diff mbox series

malloc: Move mmap code out of __libc_free hotpath

Message ID PAWPR08MB8982CFB797B86F9DBDA20E6F83A42@PAWPR08MB8982.eurprd08.prod.outlook.com
State New
Headers show
Series malloc: Move mmap code out of __libc_free hotpath | expand

Commit Message

Wilco Dijkstra March 24, 2025, 7:29 p.m. UTC
Currently __libc_free checks for a freed mmap chunk in the fast path.
Also errno is always saved and restored to preserve it.  Since mmap chunks
are larger than the largest tcache chunk, it is safe to delay this and
handle tcache, smallbin and medium bin blocks first.  Move saving of errno
to cases that actually need it.

Performance of bench-malloc-thread improves by 6.4% for 1 thread and
4.9% for 32 threads on Neoverse V2.

Passes regress, OK for commit?

---

Comments

Florian Weimer March 24, 2025, 8:41 p.m. UTC | #1
* Wilco Dijkstra:

> Currently __libc_free checks for a freed mmap chunk in the fast path.
> Also errno is always saved and restored to preserve it.  Since mmap chunks
> are larger than the largest tcache chunk, it is safe to delay this and
> handle tcache, smallbin and medium bin blocks first.  Move saving of errno
> to cases that actually need it.

I think the “Since mmap chunks are larger than the largest tcache
chunk” part is not correct.  We can fall back to performing mmap for a
smallish allocation.  But I think it's still fine to put such chunks
into tcache.  Maybe add a comment to this effect?
Wilco Dijkstra March 24, 2025, 9:15 p.m. UTC | #2
Hi Florian,

>> Currently __libc_free checks for a freed mmap chunk in the fast path.
>> Also errno is always saved and restored to preserve it.  Since mmap chunks
>> are larger than the largest tcache chunk, it is safe to delay this and
>> handle tcache, smallbin and medium bin blocks first.  Move saving of errno
>> to cases that actually need it.
>
> I think the “Since mmap chunks are larger than the largest tcache
> chunk” part is not correct.  We can fall back to performing mmap for a
> smallish allocation.  But I think it's still fine to put such chunks
> into tcache.  Maybe add a comment to this effect?

Well assuming minimum page size is 4KB it should not currently be possible
for the smallest mmap chunk to get into tcache. However the available size
of an mmap chunk is different, so it would go wrong unless we make the
chunk layouts compatible (that would be a simplification that removes the
special cases in memsize and musable).

Cheers,
Wilco
Wilco Dijkstra March 25, 2025, 1:17 a.m. UTC | #3
Hi DJ,

> ... the rest of this looks ok, but it leaves _libc_free and _int_free as
> being very small functions.  Do we need them?  Or are we relying on the
> inlining to make them efficient?

No we don't need them - the next step is move _int_free-* into __libc_free,
simplify the code and use a similar tailcall approach as __libc_malloc. This
gives similar speedups as malloc!

Cheers,
Wilco
Florian Weimer March 25, 2025, 9:11 a.m. UTC | #4
* Wilco Dijkstra:

> Hi Florian,
>
>>> Currently __libc_free checks for a freed mmap chunk in the fast path.
>>> Also errno is always saved and restored to preserve it.  Since mmap chunks
>>> are larger than the largest tcache chunk, it is safe to delay this and
>>> handle tcache, smallbin and medium bin blocks first.  Move saving of errno
>>> to cases that actually need it.
>>
>> I think the “Since mmap chunks are larger than the largest tcache
>> chunk” part is not correct.  We can fall back to performing mmap for a
>> smallish allocation.  But I think it's still fine to put such chunks
>> into tcache.  Maybe add a comment to this effect?
>
> Well assuming minimum page size is 4KB it should not currently be possible
> for the smallest mmap chunk to get into tcache.

I think 2K alignment will produce a 2K chunk.  Still outside the range
of tcache, though.  I thought it reached further.

> However the available size of an mmap chunk is different, so it would
> go wrong unless we make the chunk layouts compatible (that would be a
> simplification that removes the special cases in memsize and musable).

Oh, I had forgotten about that.

In this case, we probably want some sort of assert that this doesn't go
wrong if someone changes TCACHE_MAX_BINS.  And the comment needs
updating that the number is no longer arbitrary.

Thanks,
Florian
Wilco Dijkstra March 26, 2025, 5:06 p.m. UTC | #5
Hi DJ,

> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes:
>>> ... the rest of this looks ok, but it leaves _libc_free and _int_free as
>>> being very small functions.  Do we need them?  Or are we relying on the
>>> inlining to make them efficient?
>>
>> No we don't need them - the next step is move _int_free-* into __libc_free,
>> simplify the code and use a similar tailcall approach as __libc_malloc. This
>> gives similar speedups as malloc!
>
> Does this imply that the tcache stuff will only live in __libc_free and
> nothing else will populate tcache from internally?

Yes, for free there doesn't seem to be a need for another interface that frees into
tcache besides __libc_free. On current trunk the _int_free functions are used once.
So basically there is no need to use _int_free, either you want tcache and thus use
__libc_free, or if you don't, use _int_free_chunk. Note tcache_thread_shutdown
calls __libc_free eventhough it should be calling _int_free_chunk and then has to
jump through hoops to avoid infinite recursion...

Cheers,
Wilco
diff mbox series

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index b73ddbf554461da34d99258fae87c6ece6d175ba..f362227e68b78e25583aa971cb6ee131f44a92e5 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -3313,6 +3313,9 @@  tcache_init(void)
   if (tcache_shutting_down)
     return;
 
+  /* Preserve errno when called from free() - _int_malloc may corrupt it.  */
+  int err = errno;
+
   arena_get (ar_ptr, bytes);
   victim = _int_malloc (ar_ptr, bytes);
   if (!victim && ar_ptr != NULL)
@@ -3321,10 +3324,11 @@  tcache_init(void)
       victim = _int_malloc (ar_ptr, bytes);
     }
 
-
   if (ar_ptr != NULL)
     __libc_lock_unlock (ar_ptr->mutex);
 
+  __set_errno (err);
+
   /* In a low memory situation, we may not be able to allocate memory
      - in which case, we just keep trying later.  However, we
      typically do this very early, so either there is sufficient
@@ -3446,37 +3450,15 @@  __libc_free (void *mem)
   if (__glibc_unlikely (mtag_enabled))
     *(volatile char *)mem;
 
-  int err = errno;
-
   p = mem2chunk (mem);
 
-  if (chunk_is_mmapped (p))                       /* release mmapped memory. */
-    {
-      /* See if the dynamic brk/mmap threshold needs adjusting.
-	 Dumped fake mmapped chunks do not affect the threshold.  */
-      if (!mp_.no_dyn_threshold
-          && chunksize_nomask (p) > mp_.mmap_threshold
-          && chunksize_nomask (p) <= DEFAULT_MMAP_THRESHOLD_MAX)
-        {
-          mp_.mmap_threshold = chunksize (p);
-          mp_.trim_threshold = 2 * mp_.mmap_threshold;
-          LIBC_PROBE (memory_mallopt_free_dyn_thresholds, 2,
-                      mp_.mmap_threshold, mp_.trim_threshold);
-        }
-      munmap_chunk (p);
-    }
-  else
-    {
-      MAYBE_INIT_TCACHE ();
-
-      /* Mark the chunk as belonging to the library again.  */
-      (void)tag_region (chunk2mem (p), memsize (p));
+  MAYBE_INIT_TCACHE ();
 
-      ar_ptr = arena_for_chunk (p);
-      _int_free (ar_ptr, p, 0);
-    }
+  /* Mark the chunk as belonging to the library again.  */
+  (void)tag_region (chunk2mem (p), memsize (p));
 
-  __set_errno (err);
+  ar_ptr = arena_for_chunk (p);
+  _int_free (ar_ptr, p, 0);
 }
 libc_hidden_def (__libc_free)
 
@@ -4665,6 +4647,9 @@  _int_free_chunk (mstate av, mchunkptr p, INTERNAL_SIZE_T size, int have_lock)
 
   else if (!chunk_is_mmapped(p)) {
 
+    /* Preserve errno in case block merging results in munmap.  */
+    int err = errno;
+
     /* If we're single-threaded, don't lock the arena.  */
     if (SINGLE_THREAD_P)
       have_lock = true;
@@ -4676,13 +4661,33 @@  _int_free_chunk (mstate av, mchunkptr p, INTERNAL_SIZE_T size, int have_lock)
 
     if (!have_lock)
       __libc_lock_unlock (av->mutex);
+
+    __set_errno (err);
   }
   /*
     If the chunk was allocated via mmap, release via munmap().
   */
 
   else {
+
+    /* Preserve errno in case munmap sets it.  */
+    int err = errno;
+
+    /* See if the dynamic brk/mmap threshold needs adjusting.
+       Dumped fake mmapped chunks do not affect the threshold.  */
+    if (!mp_.no_dyn_threshold
+        && chunksize_nomask (p) > mp_.mmap_threshold
+        && chunksize_nomask (p) <= DEFAULT_MMAP_THRESHOLD_MAX)
+      {
+        mp_.mmap_threshold = chunksize (p);
+        mp_.trim_threshold = 2 * mp_.mmap_threshold;
+        LIBC_PROBE (memory_mallopt_free_dyn_thresholds, 2,
+		    mp_.mmap_threshold, mp_.trim_threshold);
+      }
+
     munmap_chunk (p);
+
+    __set_errno (err);
   }
 }