diff mbox series

[RFC/PoC] malloc: use wfcqueue to speed up remote frees

Message ID 20180731084936.g4yw6wnvt677miti@dcvr
State New
Headers show
Series [RFC/PoC] malloc: use wfcqueue to speed up remote frees | expand

Commit Message

Eric Wong July 31, 2018, 8:49 a.m. UTC
The goal is to reduce contention and improve locality of cross-thread
malloc/free traffic common to IPC systems (including Userspace-RCU) and
some garbage-collected runtimes.

Very rough benchmarks using `xthr`[1], a small URCU test program
I wrote years ago shows huge improvements in time and space:

  $ /usr/bin/time ./before.sh ./xthr -a 2 -m 2500 -i $((1024 * 1024 * 5))
  2.46user 3.51system 0:05.50elapsed 108%CPU (0avgtext+0avgdata 3352592maxresident)k
  0inputs+0outputs (17major+838014minor)pagefaults 0swaps

  $ /usr/bin/time ./after.sh ./xthr -a 2 -m 2500 -i $((1024 * 1024 * 5))
  2.68user 0.48system 0:02.55elapsed 123%CPU (0avgtext+0avgdata 532904maxresident)k
  0inputs+0outputs (0major+174304minor)pagefaults 0swaps

Where before.sh and after.sh are script wrappers around ld-linux for
the appropriate glibc installation.

  #!/bin/sh
  exec /tmp/6/lib/ld-linux-x86-64.so.2 --library-path /tmp/6/lib "$@"

  [1] xthr.c: https://80x24.org/spew/20180731082205.vykyunsm5xg7ml3e@dcvr/

It avoids lock contention by only deferring `_int_free' to scenarios
where the arena lock is already acquired.  Three functions are added:

* remote_free_begin  - Producer - enqueues the allocation into an arena
                       designated to another thread.  This is wait-free,
                       branchless, and only modifies the last (cold)
                       cacheline of the arena belonging to another thread

* remote_free_step   - Consumer - grabs everything enqueued by
                       remote_free_begin and calls `_int_free' locally
                       without acquiring extra locks.  Returns `true'
                       if it did any work, as other threads may have
                       called `remote_free_begin' in the meantime.

* remote_free_finish - Consumer - calls remote_free_step in a loop until
                       there is no more work to do.  It runs before most
                       calls to malloc_consolidate.

wfcqueue is the LGPL-2.1+ Wait-Free Concurrent Queue distributed
with Userspace-RCU <http://liburcu.org/>.  wfcqueue does not
depend on RCU itself (only atomics), but forms the basis of the
workqueue and call_rcu primitive within liburcu.

The functions I'm using from wfcqueue can be statically-linked
from header files, so it involves no extra linkage at runtime.
Note: Debian users can `apt-get install liburcu-dev' to get
wfcqueue.h; I expect it's available for other distros.

If this proof-of-concept is found acceptable, I can work on
making wfcqueue use the atomics provided by gcc/glibc instead
of the `uatomic` headers of URCU so it can be bundled with
glibc.  But maybe adding liburcu as a build-time dependency
is acceptable :)

Note: I'm haven't gotten "make -j4 check" even close to passing even
without this patch on commit 98864ed0e055583707e37cdb7d41a9cdeac4473b.
It's likely a problem on my end; but I'm only a fairly
common Debian 9 x86-64 system; though I haven't built glibc in years.

On the other hand, with the exception of fiddle (dl-dependent)
and tz tests, most of the "test-all" suite passes for Ruby
when using the either the before.sh or after.sh glibc wrapper
(but I haven't done much testing otherwise):

    make test-all TESTS='-x fiddle -x time_tz' \
		RUNRUBYOPT=--precommand=/path/to/after.sh
---
 malloc/malloc.c | 108 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 103 insertions(+), 5 deletions(-)

Comments

Carlos O'Donell July 31, 2018, 12:16 p.m. UTC | #1
On 07/31/2018 04:49 AM, Eric Wong wrote:
> The goal is to reduce contention and improve locality of cross-thread
> malloc/free traffic common to IPC systems (including Userspace-RCU) and
> some garbage-collected runtimes.
Eric,

This looks like a really interesting contribution!

For anyone reviewing this patch I just want to point out that Eric *does*
have FSF copyright assignment for glibc, so review can proceed normally
for this patch. Thank you!

I would like to see urcu used within glibc to provide better data structures
for key thread, dynamic loader, and malloc algorithms. So if anything I think
this is a move in the right direction.

It would be interesting to roll your RFC into Fedora Rawhide for 6 months and
see if we hit any problems.

I have a few high-level questions:

- Can you explain the RSS reduction given this patch? You might think that just
  adding the frees to a queue wouldn't result in any RSS gains. However, you
  are calling _int_free a lot in row and that deinterleaving may help (you really
  want vector free API here so you don't walk all the lists so many times, tcache
  had the same problem but in reverse for finding chunks). 

- Adding urcu as a build-time dependency is not acceptable for bootstrap, instead
  we would bundle a copy of urcu and keep it in sync with upstream. Would that
  make your work easier?

- What problems are you having with `make -j4 check?' Try master and report back.
  We are about to release 2.28 so it should build and pass.

Thank you again for testing this out.
Eric Wong July 31, 2018, 11:18 p.m. UTC | #2
Carlos O'Donell <carlos@redhat.com> wrote:
> On 07/31/2018 04:49 AM, Eric Wong wrote:
> > The goal is to reduce contention and improve locality of cross-thread
> > malloc/free traffic common to IPC systems (including Userspace-RCU) and
> > some garbage-collected runtimes.
> Eric,
> 
> This looks like a really interesting contribution!
> 
> For anyone reviewing this patch I just want to point out that Eric *does*
> have FSF copyright assignment for glibc, so review can proceed normally
> for this patch. Thank you!

Yep, it's been a while.

> I would like to see urcu used within glibc to provide better data structures
> for key thread, dynamic loader, and malloc algorithms. So if anything I think
> this is a move in the right direction.

That's great to hear :>

> It would be interesting to roll your RFC into Fedora Rawhide for 6 months and
> see if we hit any problems.

Sure, sounds good to me.

> I have a few high-level questions:

> - Can you explain the RSS reduction given this patch? You
> might think that just adding the frees to a queue wouldn't
> result in any RSS gains.

At least two reasons I can see:

1) With lock contention, the freeing thread can lose to the
   allocating thread.  This makes the allocating thread hit
   sysmalloc since it prevented the freeing thread from doing
   its job.  sysmalloc is the slow path, so the lock gets held
   even longer and the problem compounds from there.

2) thread caching - memory ends up in the wrong thread and
   could never get used in some cases.  Fortunately this is
   bounded, but still a waste.

I'm still new to the code, but it looks like threads are pinned
to the arena and the memory used for arenas never gets released.
Is that correct?

I was wondering if there was another possibility: the allocating
thread gives up the arena and creates a new one because the
freeing thread locked it, but I don't think that's the case.

Also, if I spawn a bunch of threads and get a bunch of
arenas early in the program lifetime; and then only have few
threads later, there can be a lot of idle arenas.

> However, you are calling _int_free a lot in row and that
> deinterleaving may help (you really want vector free API here
> so you don't walk all the lists so many times, tcache had the
> same problem but in reverse for finding chunks). 

Maybe...  I think in the ideal case, the number of allocations
and frees is close 1:1, so the loop is kept short.

What may be worth trying is to bypass _int_free for cases where
a chunk can fulfill the allocation which triggers it.  Delaying
or avoiding consolidation could worsen fragmentation, though. 

> - Adding urcu as a build-time dependency is not acceptable for
> bootstrap, instead we would bundle a copy of urcu and keep it
> in sync with upstream. Would that make your work easier?

Yes, bundling that sounds great.  I assume it's something for
you or one of the regular contributors to work on (build systems
scare me :x)

> - What problems are you having with `make -j4 check?' Try
> master and report back.  We are about to release 2.28 so it
> should build and pass.

My fault.  It seems like tests aren't automatically rerun when I
change the code; so some of my broken work-in-progress changes
ended up being false positives :x.  When working on this, I made
the mistake of doing remote_free_step inside malloc_consolidate,
which could recurse into _int_free or _int_malloc

I guess I should remove the *.test-result files before rerunning
tests?

I still get:

FAIL: nptl/tst-sched1

	"Create failed"

	I guess my system was overloaded.  pthread_create
	failures seemed to happen a lot for me when working
	on Ruby, too, and POSIX forcing EAGAIN makes it
	hard to diagnose :< (ulimit -u 47999 and 12GB RAM)

	Removing the test-result and retrying seems OK.

FAIL: resolv/tst-resolv-ai_idn
FAIL: resolv/tst-resolv-ai_idn-latin1

	Not root, so no CLONE_NEWUTS

So I think that's expected...
Carlos O'Donell Aug. 1, 2018, 4:41 a.m. UTC | #3
On 07/31/2018 07:18 PM, Eric Wong wrote:
>> - Can you explain the RSS reduction given this patch? You
>> might think that just adding the frees to a queue wouldn't
>> result in any RSS gains.
> 
> At least two reasons I can see:
> 
> 1) With lock contention, the freeing thread can lose to the
>    allocating thread.  This makes the allocating thread hit
>    sysmalloc since it prevented the freeing thread from doing
>    its job.  sysmalloc is the slow path, so the lock gets held
>    even longer and the problem compounds from there.

How does this impact RSS? It would only block the remote thread
from freeing in a timely fashion, but it would eventually make
progress.

> 2) thread caching - memory ends up in the wrong thread and
>    could never get used in some cases.  Fortunately this is
>    bounded, but still a waste.

We can't have memory end up in the wrong thread. The remote thread
computes the arena from the chunk it has, and then frees back to
the appropriate arena, even if it's not the arena that the thread
is attached to.

> I'm still new to the code, but it looks like threads are pinned
> to the arena and the memory used for arenas never gets released.
> Is that correct?

Threads are pinned to their arenas, but they can move in the event
of allocation failures, particularly to the main arena to attempt
sbrk to get more memory.

> I was wondering if there was another possibility: the allocating
> thread gives up the arena and creates a new one because the
> freeing thread locked it, but I don't think that's the case.

No.

> Also, if I spawn a bunch of threads and get a bunch of
> arenas early in the program lifetime; and then only have few
> threads later, there can be a lot of idle arenas.
 
Yes. That is true. We don't coalesce arenas to match the thread
demand.

>> However, you are calling _int_free a lot in row and that
>> deinterleaving may help (you really want vector free API here
>> so you don't walk all the lists so many times, tcache had the
>> same problem but in reverse for finding chunks). 
> 
> Maybe...  I think in the ideal case, the number of allocations
> and frees is close 1:1, so the loop is kept short.
> 
> What may be worth trying is to bypass _int_free for cases where
> a chunk can fulfill the allocation which triggers it.  Delaying
> or avoiding consolidation could worsen fragmentation, though. 

Right.

>> - Adding urcu as a build-time dependency is not acceptable for
>> bootstrap, instead we would bundle a copy of urcu and keep it
>> in sync with upstream. Would that make your work easier?
> 
> Yes, bundling that sounds great.  I assume it's something for
> you or one of the regular contributors to work on (build systems
> scare me :x)

Yes, that is something we'd have to do.

>> - What problems are you having with `make -j4 check?' Try
>> master and report back.  We are about to release 2.28 so it
>> should build and pass.
> 
> My fault.  It seems like tests aren't automatically rerun when I
> change the code; so some of my broken work-in-progress changes
> ended up being false positives :x.  When working on this, I made
> the mistake of doing remote_free_step inside malloc_consolidate,
> which could recurse into _int_free or _int_malloc

This depends a bit on what you touch.

> I guess I should remove the *.test-result files before rerunning
> tests?

Yes, that will definitely force the test to be re-run.

> I still get:
> 
> FAIL: nptl/tst-sched1
> 
> 	"Create failed"
> 
> 	I guess my system was overloaded.  pthread_create
> 	failures seemed to happen a lot for me when working
> 	on Ruby, too, and POSIX forcing EAGAIN makes it
> 	hard to diagnose :< (ulimit -u 47999 and 12GB RAM)
> 
> 	Removing the test-result and retrying seems OK.

OK. This one is new. There are a few tests where pthread_create
fails with EAGAIN because the kernel can't reap the children
fast enough.

> 
> FAIL: resolv/tst-resolv-ai_idn
> FAIL: resolv/tst-resolv-ai_idn-latin1
> 
> 	Not root, so no CLONE_NEWUTS
> 
> So I think that's expected...
> 

Agreed.
Eric Wong Aug. 1, 2018, 6:23 a.m. UTC | #4
Carlos O'Donell <carlos@redhat.com> wrote:
> On 07/31/2018 07:18 PM, Eric Wong wrote:
> >> - Can you explain the RSS reduction given this patch? You
> >> might think that just adding the frees to a queue wouldn't
> >> result in any RSS gains.
> > 
> > At least two reasons I can see:
> > 
> > 1) With lock contention, the freeing thread can lose to the
> >    allocating thread.  This makes the allocating thread hit
> >    sysmalloc since it prevented the freeing thread from doing
> >    its job.  sysmalloc is the slow path, so the lock gets held
> >    even longer and the problem compounds from there.
> 
> How does this impact RSS? It would only block the remote thread
> from freeing in a timely fashion, but it would eventually make
> progress.

Blocking the freeing thread causes the allocating thread to
sysmalloc more.  If the freeing thread could always beat the
allocating thread, then the freed memory would be available in
the arena by the time the allocating thread takes the lock.

> > 2) thread caching - memory ends up in the wrong thread and
> >    could never get used in some cases.  Fortunately this is
> >    bounded, but still a waste.
> 
> We can't have memory end up in the wrong thread. The remote thread
> computes the arena from the chunk it has, and then frees back to
> the appropriate arena, even if it's not the arena that the thread
> is attached to.

Really?  I see:

   __libc_free -> MAYBE_INIT_TCACHE && _int_free -> tcache_put

I am not seeing anything in _int_free which makes the tcache_put
arena-aware.  If we drop MAYBE_INIT_TCACHE from __libc_free,
then the tcache_put could be avoided.

> > I'm still new to the code, but it looks like threads are pinned
> > to the arena and the memory used for arenas never gets released.
> > Is that correct?
> 
> Threads are pinned to their arenas, but they can move in the event
> of allocation failures, particularly to the main arena to attempt
> sbrk to get more memory.

OK.

> > I was wondering if there was another possibility: the allocating
> > thread gives up the arena and creates a new one because the
> > freeing thread locked it, but I don't think that's the case.
> 
> No.
> 
> > Also, if I spawn a bunch of threads and get a bunch of
> > arenas early in the program lifetime; and then only have few
> > threads later, there can be a lot of idle arenas.
>  
> Yes. That is true. We don't coalesce arenas to match the thread
> demand.

Eep :<    If contention can be avoided (which tcache seems to
work well for), limiting arenas to CPU count seems desirable and
worth trying.

<snip>

> >> - Adding urcu as a build-time dependency is not acceptable for
> >> bootstrap, instead we would bundle a copy of urcu and keep it
> >> in sync with upstream. Would that make your work easier?
> > 
> > Yes, bundling that sounds great.  I assume it's something for
> > you or one of the regular contributors to work on (build systems
> > scare me :x)
> 
> Yes, that is something we'd have to do.

OK, I noticed my patch fails conformance tests because
(despite my use of __cds_wfcq_splice_nonblocking) it references
poll(), despite poll() being in an impossible code path:

   __cds_wfcq_splice_nonblocking -> ___cds_wfcq_splice
	   -> ___cds_wfcq_busy_wait -> poll

The poll call is impossible because the `blocking' parameter is 0;
but I guess the linker doesn't know that?

> >> - What problems are you having with `make -j4 check?' Try
> >> master and report back.  We are about to release 2.28 so it
> >> should build and pass.
> > 
> > My fault.  It seems like tests aren't automatically rerun when I
> > change the code; so some of my broken work-in-progress changes
> > ended up being false positives :x.  When working on this, I made
> > the mistake of doing remote_free_step inside malloc_consolidate,
> > which could recurse into _int_free or _int_malloc
> 
> This depends a bit on what you touch.

Alright, I'll keep that in mind.  Thanks!
Carlos O'Donell Aug. 1, 2018, 7:01 a.m. UTC | #5
On 08/01/2018 02:23 AM, Eric Wong wrote:
> Carlos O'Donell <carlos@redhat.com> wrote:
>> On 07/31/2018 07:18 PM, Eric Wong wrote:
>>>> - Can you explain the RSS reduction given this patch? You
>>>> might think that just adding the frees to a queue wouldn't
>>>> result in any RSS gains.
>>>
>>> At least two reasons I can see:
>>>
>>> 1) With lock contention, the freeing thread can lose to the
>>>    allocating thread.  This makes the allocating thread hit
>>>    sysmalloc since it prevented the freeing thread from doing
>>>    its job.  sysmalloc is the slow path, so the lock gets held
>>>    even longer and the problem compounds from there.
>>
>> How does this impact RSS? It would only block the remote thread
>> from freeing in a timely fashion, but it would eventually make
>> progress.
> 
> Blocking the freeing thread causes the allocating thread to
> sysmalloc more.  If the freeing thread could always beat the
> allocating thread, then the freed memory would be available in
> the arena by the time the allocating thread takes the lock.

I see what you mean now. Yes, that could reduce RSS by reducing
the time between when the remote thread frees memory and when
the producer thread (let's call it that) can reuse the returned
chunks.

>>> 2) thread caching - memory ends up in the wrong thread and
>>>    could never get used in some cases.  Fortunately this is
>>>    bounded, but still a waste.
>>
>> We can't have memory end up in the wrong thread. The remote thread
>> computes the arena from the chunk it has, and then frees back to
>> the appropriate arena, even if it's not the arena that the thread
>> is attached to.
> 
> Really?  I see:
> 
>    __libc_free -> MAYBE_INIT_TCACHE && _int_free -> tcache_put
> 
> I am not seeing anything in _int_free which makes the tcache_put
> arena-aware.  If we drop MAYBE_INIT_TCACHE from __libc_free,
> then the tcache_put could be avoided.

Thank you, that clarifies it for me, I was glossing over tcache.

Yes, the tcache layer doesn't care where the block came from and
will happily cache it.

In a producer-consumer model though, as this seems to be the example
from which we are drawing parallels, the consumer rarely needs to
allocate anything, so yes, the tcache effectively slows the initial
rate of frees to the producer thread, but only to a limit (as
you note).

>>> I'm still new to the code, but it looks like threads are pinned
>>> to the arena and the memory used for arenas never gets released.
>>> Is that correct?
>>
>> Threads are pinned to their arenas, but they can move in the event
>> of allocation failures, particularly to the main arena to attempt
>> sbrk to get more memory.
> 
> OK.
> 
>>> I was wondering if there was another possibility: the allocating
>>> thread gives up the arena and creates a new one because the
>>> freeing thread locked it, but I don't think that's the case.
>>
>> No.
>>
>>> Also, if I spawn a bunch of threads and get a bunch of
>>> arenas early in the program lifetime; and then only have few
>>> threads later, there can be a lot of idle arenas.
>>  
>> Yes. That is true. We don't coalesce arenas to match the thread
>> demand.
> 
> Eep :<    If contention can be avoided (which tcache seems to
> work well for), limiting arenas to CPU count seems desirable and
> worth trying.

Agreed.

In general it is not as bad as you think.

An arena is made up of a chain of heaps, each an mmap'd block, and
if we can manage to free an entire heap then we unmap the heap,
and if we're lucky we can manage to free down the entire arena
(_int_free -> large chunk / consolidate -> heap_trim -> shrink_heap).

So we might just end up with a large number of arena's that don't
have very much allocated at all, but are all on the arena free list
waiting for a thread to attach to them to reduce overall contention.

I agree that it would be *better* if we had one arena per CPU and
each thread could easily determine the CPU it was on (via a
restartable sequence) and then allocate CPU-local memory to work
with (the best you can do; ignoring NUMA effects).

> <snip>
> 
>>>> - Adding urcu as a build-time dependency is not acceptable for
>>>> bootstrap, instead we would bundle a copy of urcu and keep it
>>>> in sync with upstream. Would that make your work easier?
>>>
>>> Yes, bundling that sounds great.  I assume it's something for
>>> you or one of the regular contributors to work on (build systems
>>> scare me :x)
>>
>> Yes, that is something we'd have to do.
> 
> OK, I noticed my patch fails conformance tests because
> (despite my use of __cds_wfcq_splice_nonblocking) it references
> poll(), despite poll() being in an impossible code path:
> 
>    __cds_wfcq_splice_nonblocking -> ___cds_wfcq_splice
> 	   -> ___cds_wfcq_busy_wait -> poll
> 
> The poll call is impossible because the `blocking' parameter is 0;
> but I guess the linker doesn't know that?

Correct. We can fix that easily at a later date. Don't worry about it.
Eric Wong Aug. 1, 2018, 9:26 a.m. UTC | #6
Carlos O'Donell <carlos@redhat.com> wrote:
> On 08/01/2018 02:23 AM, Eric Wong wrote:
> > Carlos O'Donell <carlos@redhat.com> wrote:
> >> On 07/31/2018 07:18 PM, Eric Wong wrote:
> >>> Also, if I spawn a bunch of threads and get a bunch of
> >>> arenas early in the program lifetime; and then only have few
> >>> threads later, there can be a lot of idle arenas.
> >>  
> >> Yes. That is true. We don't coalesce arenas to match the thread
> >> demand.
> > 
> > Eep :<    If contention can be avoided (which tcache seems to
> > work well for), limiting arenas to CPU count seems desirable and
> > worth trying.
> 
> Agreed.
> 
> In general it is not as bad as you think.
> 
> An arena is made up of a chain of heaps, each an mmap'd block, and
> if we can manage to free an entire heap then we unmap the heap,
> and if we're lucky we can manage to free down the entire arena
> (_int_free -> large chunk / consolidate -> heap_trim -> shrink_heap).
> 
> So we might just end up with a large number of arena's that don't
> have very much allocated at all, but are all on the arena free list
> waiting for a thread to attach to them to reduce overall contention.
> 
> I agree that it would be *better* if we had one arena per CPU and
> each thread could easily determine the CPU it was on (via a
> restartable sequence) and then allocate CPU-local memory to work
> with (the best you can do; ignoring NUMA effects).

Thanks for the info on arenas.  One problem for Ruby is we get
many threads[1], and they create allocations of varying
lifetimes.  All this while malloc contention is rarely a
problem in Ruby because of the global VM lock (GVL).

Even without restartable sequences, I was wondering if lfstack
(also in urcu) could even be used for sharing/distributing
arenas between threads.  This would require tcache to avoid
retries on lfstack pop/push.

Much less straighforward than using wfcqueue for frees with
this patch, though :)


[1] we only had green-threads back in Ruby 1.8, and I guess many
    Rubyists got used to the idea that they could have many
    threads cheaply.  Ruby 1.9+ moved to 100% native threads,
    so I'm also trying to reintroduce green threads as an option
    back into Ruby (but still keeping native threads)

> > OK, I noticed my patch fails conformance tests because
> > (despite my use of __cds_wfcq_splice_nonblocking) it references
> > poll(), despite poll() being in an impossible code path:
> > 
> >    __cds_wfcq_splice_nonblocking -> ___cds_wfcq_splice
> > 	   -> ___cds_wfcq_busy_wait -> poll
> > 
> > The poll call is impossible because the `blocking' parameter is 0;
> > but I guess the linker doesn't know that?
> 
> Correct. We can fix that easily at a later date. Don't worry about it.

Heh, a bit dirty, but #define-ing poll away seems to work :)

diff --git a/malloc/malloc.c b/malloc/malloc.c
index 40d61e45db..89e675c7a0 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -247,6 +247,11 @@
 /* For SINGLE_THREAD_P.  */
 #include <sysdep-cancel.h>
 
+/* prevent wfcqueue.h from including poll.h and linking to it */
+#include <poll.h>
+#undef poll
+#define poll(a,b,c) assert(0 && "should not be called")
+
 #define _LGPL_SOURCE /* allows inlines */
 #include <urcu/wfcqueue.h>
Carlos O'Donell Aug. 2, 2018, 9:38 p.m. UTC | #7
On 08/01/2018 05:26 AM, Eric Wong wrote:
> Carlos O'Donell <carlos@redhat.com> wrote:
>> On 08/01/2018 02:23 AM, Eric Wong wrote:
>>> Carlos O'Donell <carlos@redhat.com> wrote:
>>>> On 07/31/2018 07:18 PM, Eric Wong wrote:
>>>>> Also, if I spawn a bunch of threads and get a bunch of
>>>>> arenas early in the program lifetime; and then only have few
>>>>> threads later, there can be a lot of idle arenas.
>>>>  
>>>> Yes. That is true. We don't coalesce arenas to match the thread
>>>> demand.
>>>
>>> Eep :<    If contention can be avoided (which tcache seems to
>>> work well for), limiting arenas to CPU count seems desirable and
>>> worth trying.
>>
>> Agreed.
>>
>> In general it is not as bad as you think.
>>
>> An arena is made up of a chain of heaps, each an mmap'd block, and
>> if we can manage to free an entire heap then we unmap the heap,
>> and if we're lucky we can manage to free down the entire arena
>> (_int_free -> large chunk / consolidate -> heap_trim -> shrink_heap).
>>
>> So we might just end up with a large number of arena's that don't
>> have very much allocated at all, but are all on the arena free list
>> waiting for a thread to attach to them to reduce overall contention.
>>
>> I agree that it would be *better* if we had one arena per CPU and
>> each thread could easily determine the CPU it was on (via a
>> restartable sequence) and then allocate CPU-local memory to work
>> with (the best you can do; ignoring NUMA effects).
> 
> Thanks for the info on arenas.  One problem for Ruby is we get
> many threads[1], and they create allocations of varying
> lifetimes.  All this while malloc contention is rarely a
> problem in Ruby because of the global VM lock (GVL).

The allocations of varying lifetimes will make it impossible to free
down a heap from a heap-based allocator. This is a serious issue with
heap-based allocators, and it will impact the max RSS that you'll
need to reach steady state. It's not really a tractable problem I think,
I don't know how to deinterlace the the chunks which have differing
lifetimes. Your only chance is to take the existing large/small/fast
bin machinery and instead of mixing them in one heap, split them into
one smaller heap each, and see how that goes i.e. adopt size classes,
but keep it heap-based.

> Even without restartable sequences, I was wondering if lfstack
> (also in urcu) could even be used for sharing/distributing
> arenas between threads.  This would require tcache to avoid
> retries on lfstack pop/push.

We absolutely need a better balancing across arenas, even now we don't
do any rebalancing based on load or attach count. We should. That
problem would go away if you just used restartable sequences to find
your cpu, map that to the local arena for the cpu and allocate there.

> Much less straighforward than using wfcqueue for frees with
> this patch, though :) 
Correct.
 
> Heh, a bit dirty, but #define-ing poll away seems to work :)
> 
> diff --git a/malloc/malloc.c b/malloc/malloc.c
> index 40d61e45db..89e675c7a0 100644
> --- a/malloc/malloc.c
> +++ b/malloc/malloc.c
> @@ -247,6 +247,11 @@
>  /* For SINGLE_THREAD_P.  */
>  #include <sysdep-cancel.h>
>  
> +/* prevent wfcqueue.h from including poll.h and linking to it */
> +#include <poll.h>
> +#undef poll
> +#define poll(a,b,c) assert(0 && "should not be called")
> +

Call __poll instead. That should fix the issue.

>  #define _LGPL_SOURCE /* allows inlines */
>  #include <urcu/wfcqueue.h>
>  
>
Eric Wong Aug. 8, 2018, 10:40 a.m. UTC | #8
Carlos O'Donell <carlos@redhat.com> wrote:
> - Adding urcu as a build-time dependency is not acceptable for bootstrap, instead
>   we would bundle a copy of urcu and keep it in sync with upstream. Would that
>   make your work easier?

Any ETA or update on urcu bundling?  It's not urgent and I've
been busy with other projects, too; but it could be helpful for
planning around other work.

Anyways, I'm pretty satisfied at my original patch and will use
the __poll call in my next iteration.  My main concern right now
is not hurting existing fast cases.  So I will need to look into
more benchmarks on list archives and wiki...

Thanks.
Eric Wong Jan. 17, 2023, 6:42 a.m. UTC | #9
Carlos O'Donell <carlos@redhat.com> wrote:
> >> - Adding urcu as a build-time dependency is not acceptable for
> >> bootstrap, instead we would bundle a copy of urcu and keep it
> >> in sync with upstream. Would that make your work easier?
>
> Eric Wong wrote:
> > Yes, bundling that sounds great.  I assume it's something for
> > you or one of the regular contributors to work on (build systems
> > scare me :x)
> 
> Yes, that is something we'd have to do.

Hi, bringing this topic from 2018 up again (+Cc Mathieu):
https://inbox.sourceware.org/libc-alpha/c061de55-cc2a-88fe-564b-2ea9c4a7e632@redhat.com/T/

I'm wondering if URCU-in-glibc is still on the table.  I'm also
considering an learning C11 atomics and deriving a standalone
wfcqueue w/o URCU atomics.

Thanks.
Mathieu Desnoyers Jan. 17, 2023, 7:05 p.m. UTC | #10
On 2023-01-17 01:42, Eric Wong wrote:
> Carlos O'Donell <carlos@redhat.com> wrote:
>>>> - Adding urcu as a build-time dependency is not acceptable for
>>>> bootstrap, instead we would bundle a copy of urcu and keep it
>>>> in sync with upstream. Would that make your work easier?
>>
>> Eric Wong wrote:
>>> Yes, bundling that sounds great.  I assume it's something for
>>> you or one of the regular contributors to work on (build systems
>>> scare me :x)
>>
>> Yes, that is something we'd have to do.
> 
> Hi, bringing this topic from 2018 up again (+Cc Mathieu):
> https://inbox.sourceware.org/libc-alpha/c061de55-cc2a-88fe-564b-2ea9c4a7e632@redhat.com/T/
> 
> I'm wondering if URCU-in-glibc is still on the table.  I'm also
> considering an learning C11 atomics and deriving a standalone
> wfcqueue w/o URCU atomics.

Hi Eric,

I'm very much interested to contribute portions of liburcu to glibc. I 
think what we would need at this point is to document where we see that 
liburcu infrastructure can improve glibc, and create a gradual 
integration roadmap based on priorities. We should identify the 
stake-holders interested in seeing this done, and then we can discuss 
the time-frame and resources available for this project.

That being said, I suspect we'd want to cover a few pieces of technology 
in this list, namely:

- Restartable Sequences (rseq system call),
- membarrier system call,
- liburcu and libside [1,2] RCU implementations,
- liburcu data structures (e.g. wfcqueue).

I would be tempted to go for an approach closer to the RCU 
implementation I have done in the libside project for glibc, because it 
supports having multiple RCU grace period domains per process, which 
allows to nicely split the locking dependencies into sub-domains, and 
therefore lessen the chances of deadlocks due to nesting of a 
global-domain RCU read-side/synchronize_rcu, and locking (e.g. mutexes).

Thanks,

Mathieu

[1] https://github.com/efficios/libside/blob/master/src/rcu.c
[2] https://github.com/efficios/libside/blob/master/src/rcu.h
Mathieu Desnoyers Jan. 18, 2023, 2:53 p.m. UTC | #11
On 2023-01-17 01:42, Eric Wong wrote:
> Carlos O'Donell <carlos@redhat.com> wrote:
>>>> - Adding urcu as a build-time dependency is not acceptable for
>>>> bootstrap, instead we would bundle a copy of urcu and keep it
>>>> in sync with upstream. Would that make your work easier?
>>
>> Eric Wong wrote:
>>> Yes, bundling that sounds great.  I assume it's something for
>>> you or one of the regular contributors to work on (build systems
>>> scare me :x)
>>
>> Yes, that is something we'd have to do.
> 
> Hi, bringing this topic from 2018 up again (+Cc Mathieu):
> https://inbox.sourceware.org/libc-alpha/c061de55-cc2a-88fe-564b-2ea9c4a7e632@redhat.com/T/
> 
> I'm wondering if URCU-in-glibc is still on the table.  I'm also
> considering an learning C11 atomics and deriving a standalone
> wfcqueue w/o URCU atomics.

I've done a quick review of your proposed patch, and there is one thing 
that I'm concerned about: forward progress of remote_free_finish(). 
AFAIU, if we have a steady flow of remote_free_begin() calls, it can 
prevent forward progress of the anena owner.

When remote_free_step() captures the queue (splice) and processes it, it 
returns whether it has processed any elements, and the caller attempts 
to splice again if there was anything present. What I wonder is why is 
it required that the caller splices queue elements that were queued 
_after_ the begin of remote_free_finish() ? Can't we simply leave those 
to the next (eventual) remote_free_finish calls ?

If we do that change, this means remote_free_finish would only be needed 
when tearing down an arena, and upon allocation only a single call to 
remote_free_step() would be needed.

Thoughts ?

Thanks,

Mathieu
Mathieu Desnoyers Jan. 18, 2023, 2:58 p.m. UTC | #12
On 2023-01-18 09:53, Mathieu Desnoyers wrote:
> On 2023-01-17 01:42, Eric Wong wrote:
>> Carlos O'Donell <carlos@redhat.com> wrote:
>>>>> - Adding urcu as a build-time dependency is not acceptable for
>>>>> bootstrap, instead we would bundle a copy of urcu and keep it
>>>>> in sync with upstream. Would that make your work easier?
>>>
>>> Eric Wong wrote:
>>>> Yes, bundling that sounds great.  I assume it's something for
>>>> you or one of the regular contributors to work on (build systems
>>>> scare me :x)
>>>
>>> Yes, that is something we'd have to do.
>>
>> Hi, bringing this topic from 2018 up again (+Cc Mathieu):
>> https://inbox.sourceware.org/libc-alpha/c061de55-cc2a-88fe-564b-2ea9c4a7e632@redhat.com/T/
>>
>> I'm wondering if URCU-in-glibc is still on the table.  I'm also
>> considering an learning C11 atomics and deriving a standalone
>> wfcqueue w/o URCU atomics.
> 
> I've done a quick review of your proposed patch, and there is one thing 
> that I'm concerned about: forward progress of remote_free_finish(). 
> AFAIU, if we have a steady flow of remote_free_begin() calls, it can 
> prevent forward progress of the anena owner.
> 
> When remote_free_step() captures the queue (splice) and processes it, it 
> returns whether it has processed any elements, and the caller attempts 
> to splice again if there was anything present. What I wonder is why is 
> it required that the caller splices queue elements that were queued 
> _after_ the begin of remote_free_finish() ? Can't we simply leave those 
> to the next (eventual) remote_free_finish calls ?
> 
> If we do that change, this means remote_free_finish would only be needed 
> when tearing down an arena, and upon allocation only a single call to 
> remote_free_step() would be needed.
> 
> Thoughts ?

Well, nevermind, it appears that what I explain here (step in malloc, 
finish only when consolidation is needed) is exactly what your 
implementation does. It took me a more careful reading and a few more 
sips of morning coffee to get it. :)

Thanks,

Mathieu


> 
> Thanks,
> 
> Mathieu
>
Mathieu Desnoyers Jan. 18, 2023, 3:48 p.m. UTC | #13
On 2023-01-17 14:05, Mathieu Desnoyers wrote:
> On 2023-01-17 01:42, Eric Wong wrote:
>> Carlos O'Donell <carlos@redhat.com> wrote:
>>>>> - Adding urcu as a build-time dependency is not acceptable for
>>>>> bootstrap, instead we would bundle a copy of urcu and keep it
>>>>> in sync with upstream. Would that make your work easier?
>>>
>>> Eric Wong wrote:
>>>> Yes, bundling that sounds great.  I assume it's something for
>>>> you or one of the regular contributors to work on (build systems
>>>> scare me :x)
>>>
>>> Yes, that is something we'd have to do.
>>
>> Hi, bringing this topic from 2018 up again (+Cc Mathieu):
>> https://inbox.sourceware.org/libc-alpha/c061de55-cc2a-88fe-564b-2ea9c4a7e632@redhat.com/T/
>>
>> I'm wondering if URCU-in-glibc is still on the table.  I'm also
>> considering an learning C11 atomics and deriving a standalone
>> wfcqueue w/o URCU atomics.
> 
> Hi Eric,
> 
> I'm very much interested to contribute portions of liburcu to glibc. I 
> think what we would need at this point is to document where we see that 
> liburcu infrastructure can improve glibc, and create a gradual 
> integration roadmap based on priorities. We should identify the 
> stake-holders interested in seeing this done, and then we can discuss 
> the time-frame and resources available for this project.
> 
> That being said, I suspect we'd want to cover a few pieces of technology 
> in this list, namely:
> 
> - Restartable Sequences (rseq system call),
> - membarrier system call,
> - liburcu and libside [1,2] RCU implementations,
> - liburcu data structures (e.g. wfcqueue).

Here is a PoC implementing liburcu wfcqueue with C11 atomics:

https://review.lttng.org/c/userspace-rcu/+/9271 PoC: wfcqueue: remove dependencies on liburcu headers

Feedback is welcome!

Thanks,

Mathieu

> 
> I would be tempted to go for an approach closer to the RCU 
> implementation I have done in the libside project for glibc, because it 
> supports having multiple RCU grace period domains per process, which 
> allows to nicely split the locking dependencies into sub-domains, and 
> therefore lessen the chances of deadlocks due to nesting of a 
> global-domain RCU read-side/synchronize_rcu, and locking (e.g. mutexes).
> 
> Thanks,
> 
> Mathieu
> 
> [1] https://github.com/efficios/libside/blob/master/src/rcu.c
> [2] https://github.com/efficios/libside/blob/master/src/rcu.h
>
Eric Wong Jan. 18, 2023, 7:12 p.m. UTC | #14
Mathieu Desnoyers via Libc-alpha <libc-alpha@sourceware.org> wrote:
> Here is a PoC implementing liburcu wfcqueue with C11 atomics:
> 
> https://review.lttng.org/c/userspace-rcu/+/9271 PoC: wfcqueue: remove dependencies on liburcu headers
> 
> Feedback is welcome!

Is there a non-JavaScript version or an address to git clone?  I can't view it
(I use w3m and do all my work from an ancient machine or via ssh||mosh)

Thanks.
Mathieu Desnoyers Jan. 18, 2023, 7:17 p.m. UTC | #15
On 2023-01-18 14:12, Eric Wong wrote:
> Mathieu Desnoyers via Libc-alpha <libc-alpha@sourceware.org> wrote:
>> Here is a PoC implementing liburcu wfcqueue with C11 atomics:
>>
>> https://review.lttng.org/c/userspace-rcu/+/9271 PoC: wfcqueue: remove dependencies on liburcu headers
>>
>> Feedback is welcome!
> 
> Is there a non-JavaScript version or an address to git clone?  I can't view it
> (I use w3m and do all my work from an ancient machine or via ssh||mosh)

git clone https://review.lttng.org/userspace-rcu
cd userspace-rcu
git fetch https://review.lttng.org/userspace-rcu refs/changes/71/9271/3 && git checkout FETCH_HEAD

should do the trick.

Thanks!

Mathieu
Eric Wong Jan. 18, 2023, 8:05 p.m. UTC | #16
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
> On 2023-01-18 14:12, Eric Wong wrote:
> > Mathieu Desnoyers via Libc-alpha <libc-alpha@sourceware.org> wrote:
> > > Here is a PoC implementing liburcu wfcqueue with C11 atomics:
> > > 
> > > https://review.lttng.org/c/userspace-rcu/+/9271 PoC: wfcqueue: remove dependencies on liburcu headers
> > > 
> > > Feedback is welcome!
> > 
> > Is there a non-JavaScript version or an address to git clone?  I can't view it
> > (I use w3m and do all my work from an ancient machine or via ssh||mosh)
> 
> git clone https://review.lttng.org/userspace-rcu
> cd userspace-rcu
> git fetch https://review.lttng.org/userspace-rcu refs/changes/71/9271/3 && git checkout FETCH_HEAD

Thanks. (Fwiw, I prefer `git show --color-words -W refs/changes/71/9271/3')
It looks fine to me, but I'm no expert on this stuff and just
getting my feet wet with C11...

The busy wait / cpu_relax path isn't needed for malloc, at least.
diff mbox series

Patch

diff --git a/malloc/malloc.c b/malloc/malloc.c
index e247c77b7d..40d61e45db 100644
--- a/malloc/malloc.c
+++ b/malloc/malloc.c
@@ -247,6 +247,9 @@ 
 /* For SINGLE_THREAD_P.  */
 #include <sysdep-cancel.h>
 
+#define _LGPL_SOURCE /* allows inlines */
+#include <urcu/wfcqueue.h>
+
 /*
   Debugging:
 
@@ -1660,6 +1663,9 @@  struct malloc_state
   /* Serialize access.  */
   __libc_lock_define (, mutex);
 
+  /* Only the owner of this arena writes to the head */
+  struct __cds_wfcq_head remote_free_head;
+
   /* Flags (formerly in max_fast).  */
   int flags;
 
@@ -1697,6 +1703,11 @@  struct malloc_state
   /* Memory allocated from the system in this arena.  */
   INTERNAL_SIZE_T system_mem;
   INTERNAL_SIZE_T max_system_mem;
+
+  /* remote_free_tail is written to by a thread other than the owner of
+     this arena, so we want this on a different cacheline than
+     remote_free_head */
+  struct cds_wfcq_tail remote_free_tail;
 };
 
 struct malloc_par
@@ -1794,6 +1805,7 @@  malloc_init_state (mstate av)
   int i;
   mbinptr bin;
 
+  __cds_wfcq_init (&av->remote_free_head, &av->remote_free_tail);
   /* Establish circular links for normal bins */
   for (i = 1; i < NBINS; ++i)
     {
@@ -3007,6 +3019,67 @@  tcache_thread_shutdown (void)
 
 #endif /* !USE_TCACHE  */
 
+_Static_assert (MINSIZE >= sizeof(struct cds_wfcq_node),
+                "struct cds_wfcq_node too big");
+/* wait-free enqueue to the remote arena */
+static void
+remote_free_begin (mstate av, void *mem)
+{
+  struct cds_wfcq_node *node = mem;
+
+  cds_wfcq_node_init (node);
+  cds_wfcq_enqueue (&av->remote_free_head, &av->remote_free_tail, node);
+  /* other thread calls remote_free_step */
+}
+
+/*
+ * process remote free queue, must have locked av
+ * returns true if it did anything
+ */
+static bool
+remote_free_step (mstate av)
+{
+  struct cds_wfcq_node *node, *n;
+  struct __cds_wfcq_head tmp_head;
+  struct cds_wfcq_tail tmp_tail;
+  enum cds_wfcq_ret ret;
+
+  if (__glibc_unlikely (av == NULL))
+    {
+      return false;
+    }
+
+  __cds_wfcq_init (&tmp_head, &tmp_tail);
+  ret = __cds_wfcq_splice_nonblocking (&tmp_head, &tmp_tail,
+				       &av->remote_free_head,
+				       &av->remote_free_tail);
+
+  if (__glibc_unlikely (ret == CDS_WFCQ_RET_DEST_EMPTY))
+    {
+      MAYBE_INIT_TCACHE ();
+      __cds_wfcq_for_each_blocking_safe (&tmp_head, &tmp_tail, node, n)
+        {
+          _int_free (av, mem2chunk(node), 1);
+        }
+
+      /*
+       * tell caller we did some work, and it's possible other threads
+       * to enqueued more work for us while we were busy
+       */
+      return true;
+    }
+
+  assert (ret != CDS_WFCQ_RET_DEST_NON_EMPTY);
+
+  return false; /* did nothing */
+}
+
+static void
+remote_free_finish (mstate av)
+{
+  while (remote_free_step (av)) ;
+}
+
 void *
 __libc_malloc (size_t bytes)
 {
@@ -3045,6 +3118,7 @@  __libc_malloc (size_t bytes)
     }
 
   arena_get (ar_ptr, bytes);
+  remote_free_step (ar_ptr);
 
   victim = _int_malloc (ar_ptr, bytes);
   /* Retry with another arena only if we were able to find a usable arena
@@ -3053,6 +3127,7 @@  __libc_malloc (size_t bytes)
     {
       LIBC_PROBE (memory_malloc_retry, 1, bytes);
       ar_ptr = arena_get_retry (ar_ptr, bytes);
+      remote_free_step (ar_ptr);
       victim = _int_malloc (ar_ptr, bytes);
     }
 
@@ -3102,10 +3177,16 @@  __libc_free (void *mem)
       return;
     }
 
-  MAYBE_INIT_TCACHE ();
-
   ar_ptr = arena_for_chunk (p);
-  _int_free (ar_ptr, p, 0);
+  if (thread_arena == ar_ptr)	/* thread_arena may be NULL */
+    {
+      MAYBE_INIT_TCACHE (); /* XXX is this needed if thread_arena == ar_ptr? */
+      _int_free (ar_ptr, p, 0);
+    }
+  else
+    {
+      remote_free_begin (ar_ptr, mem);
+    }
 }
 libc_hidden_def (__libc_free)
 
@@ -3211,6 +3292,8 @@  __libc_realloc (void *oldmem, size_t bytes)
 
   __libc_lock_lock (ar_ptr->mutex);
 
+  remote_free_step (ar_ptr);
+
   newp = _int_realloc (ar_ptr, oldp, oldsize, nb);
 
   __libc_lock_unlock (ar_ptr->mutex);
@@ -3225,7 +3308,14 @@  __libc_realloc (void *oldmem, size_t bytes)
       if (newp != NULL)
         {
           memcpy (newp, oldmem, oldsize - SIZE_SZ);
-          _int_free (ar_ptr, oldp, 0);
+          if (thread_arena == ar_ptr)
+            {
+              _int_free (ar_ptr, oldp, 0);
+            }
+          else  /* don't lock again */
+            {
+              remote_free_begin (ar_ptr, oldmem);
+            }
         }
     }
 
@@ -3294,12 +3384,14 @@  _mid_memalign (size_t alignment, size_t bytes, void *address)
     }
 
   arena_get (ar_ptr, bytes + alignment + MINSIZE);
+  remote_free_step(ar_ptr);
 
   p = _int_memalign (ar_ptr, alignment, bytes);
   if (!p && ar_ptr != NULL)
     {
       LIBC_PROBE (memory_memalign_retry, 2, bytes, alignment);
       ar_ptr = arena_get_retry (ar_ptr, bytes);
+      remote_free_step(ar_ptr);
       p = _int_memalign (ar_ptr, alignment, bytes);
     }
 
@@ -3388,7 +3480,10 @@  __libc_calloc (size_t n, size_t elem_size)
   if (SINGLE_THREAD_P)
     av = &main_arena;
   else
-    arena_get (av, sz);
+    {
+      arena_get (av, sz);
+      remote_free_step (av);
+    }
 
   if (av)
     {
@@ -3428,6 +3523,7 @@  __libc_calloc (size_t n, size_t elem_size)
 	{
 	  LIBC_PROBE (memory_calloc_retry, 1, sz);
 	  av = arena_get_retry (av, sz);
+	  remote_free_step(av);
 	  mem = _int_malloc (av, sz);
 	}
 
@@ -4750,6 +4846,7 @@  static int
 mtrim (mstate av, size_t pad)
 {
   /* Ensure all blocks are consolidated.  */
+  remote_free_finish (av);
   malloc_consolidate (av);
 
   const size_t ps = GLRO (dl_pagesize);
@@ -5133,6 +5230,7 @@  __libc_mallopt (int param_number, int value)
 
   /* We must consolidate main arena before changing max_fast
      (see definition of set_max_fast).  */
+  remote_free_finish (av);
   malloc_consolidate (av);
 
   switch (param_number)