diff mbox series

[4/5] linux-user: Support CLONE_VM and extended clone options

Message ID 20200612014606.147691-5-jkz@google.com
State New
Headers show
Series linux-user: Support extended clone(CLONE_VM) | expand

Commit Message

Josh Kunz June 12, 2020, 1:46 a.m. UTC
The `clone` system call can be used to create new processes that share
attributes with their parents, such as virtual memory, file
system location, file descriptor tables, etc. These can be useful to a
variety of guest programs.

Before this patch, QEMU had support for a limited set of these attributes.
Basically the ones needed for threads, and the options used by fork.
This change adds support for all flag combinations involving CLONE_VM.
In theory, almost all clone options could be supported, but invocations
not using CLONE_VM are likely to run afoul of linux-user's inherently
multi-threaded design.

To add this support, this patch updates the `qemu_clone` helper. An
overview of the mechanism used to support general `clone` options with
CLONE_VM is described below.

This patch also enables by-default the `clone` unit-tests in
tests/tcg/multiarch/linux-test.c, and adds an additional test for duplicate
exit signals, based on a bug found during development.

!! Overview

Adding support for CLONE_VM is tricky. The parent and guest process will
share an address space (similar to threads), so the emulator must
coordinate between the parent and the child. Currently, QEMU relies
heavily on Thread Local Storage (TLS) as part of this coordination
strategy. For threads, this works fine, because libc manages the
thread-local data region used for TLS, when we create new threads using
`pthread_create`. Ideally we could use the same mechanism for
"process-local storage" needed to allow the parent/child processes to
emulate in tandem. Unfortunately TLS is tightly integrated into libc.
The only way to create TLS data regions is via the `pthread_create` API
which also spawns a new thread (rather than a new processes, which is
what we want). Worse still, TLS itself is a complicated arch-specific
feature that is tightly integrated into the rest of libc and the dynamic
linker. Re-implementing TLS support for QEMU would likely require a
special dynamic linker / libc. Alternatively, the popular libcs could be
extended, to allow for users to create TLS regions without creating
threads. Even if major libcs decide to add this support, QEMU will still
need a temporary work around until those libcs are widely deployed. It's
also unclear if libcs will be interested in supporting this case, since
TLS image creation is generally deeply integrated with thread setup.

In this patch, I've employed an alternative approach: spawning a thread
an "stealing" its TLS image for use in the child process. This approach
leaves a dangling thread while the TLS image is in use, but by design
that thread will not become schedulable until after the TLS data is no
longer in-use by the child (as described in a moment). Therefore, it
should cause relatively minimal overhead. When considered in the larger
context, this seems like a reasonable tradeoff.

A major complication of this approach knowing when it is safe to clean up
the stack, and TLS image, used by a child process. When a child is
created with `CLONE_VM` its stack, and TLS data, need to remain valid
until that child has either exited, or successfully called `execve` (on
`execve` the child is given a new VMM by the kernel). One approach would
be to use `waitid(WNOWAIT)` (the `WNOWAIT` allows the guest to reap the
child). The problem is that the `wait` family of calls only waits for
termination. The pattern of `clone() ... execve()` for long running
child processes is pretty common. If we waited for child processes to
exit, it's likely we would end up using substantially more memory, and
keep the suspended TLS thread around much longer than necessary.
Instead, in this patch, I've used an "trampoline" process. The real
parent first clones a trampoline, the trampoline then clones the
ultimate child using the `CLONE_VFORK` option. `CLONE_VFORK` suspends
the trampoline process until the child has exited, or called `execve`.
Once the trampoline is re-scheduled, we know it is safe to clean up
after the child. This creates one more suspended process, but typically,
the trampoline only exists for a short period of time.

!! CLONE_VM setup, step by step

1. First, the suspended thread whose TLS we will use is created using
   `pthread_create`. The thread fetches and returns it's "TLS pointer"
   (an arch-specific value given to the kernel) to the parent. It then
   blocks on a lock to prevent its TLS data from being cleaned up.
   Ultimately the lock will be unlocked by the trampoline once the child
   exits.
2. Once the TLS thread has fetched the TLS pointer, it notifies the real
   parent thread, which calls `clone()` to create the trampoline
   process. For ease of implementation, the TLS image is set for the
   trampoline process during this step. This allows the trampoline to
   use functions that require TLS if needed (e.g., printf). TLS location
   is inherited when a new child is spawned, so this TLS data will
   automatically be inherited by the child.
3. Once the trampoline has been spawned, it registers itself as a
   "hidden" process with the signal subsystem. This prevents the exit
   signal from the trampoline from ever being forwarded to the guest.
   This is needed due to the way that Linux sets the exit signal for the
   ultimate child when `CLONE_PARENT` is set. See the source for
   details.
4. Once setup is complete, the trampoline spawns the final child with
   the original clone flags, plus `CLONE_PARENT`, so the child is
   correctly parented to the kernel task on which the guest invoked
   `clone`. Without this, kernel features like PDEATHSIG, and
   subreapers, would not work properly. As previously discussed, the
   trampoline also supplies `CLONE_VFORK` so that it is suspended until
   the child can be cleaned up.
5. Once the child is spawned, it signals the original parent thread that
   it is running. At this point, the trampoline process is suspended
   (due to CLONE_VFORK).
6. Finally, the call to `qemu_clone` in the parent is finished, the
   child begins executing the given callback function in the new child
   process.

!! Cleaning up

Clean up itself is a multi-step process. Once the child exits, or is
killed by a signal (cleanup is the same in both cases), the trampoline
process becomes schedulable. When the trampoline is scheduled, it frees
the child stack, and unblocks the suspended TLS thread. This cleans up
the child resources, but not the stack used by the trampoline itself. It
is possible for a process to clean up its own stack, but it is tricky,
and architecture-specific. Instead we leverage the TLS manager thread to
clean up the trampoline stack. When the trampoline is cloned (in step 2
above), we additionally set the `CHILD_SETTID` and `CHILD_CLEARTID`
flags. The target location for the SET/CLEAR TID is set to a special field
known by the TLS manager. Then, when the TLS manager thread is unsuspended,
it performs an additional `FUTEX_WAIT` on this location. That blocks the
TLS manager thread until the trampoline has fully exited, then the TLS
manager thread frees the trampoline process's stack, before exiting
itself.

!! Shortcomings of this patch

* It's complicated.
* It doesn't support any clone options when CLONE_VM is omitted.
* It doesn't properly clean up the CPU queue when the child process
  terminates, or calls execve().
* RCU unregistration is done in the trampoline process (in clone.c), but
  registration happens in syscall.c This should be made more explicit.
* The TLS image, and trampoline stack are not cleaned up if the parent
  calls `execve` or `exit_group` before the child does. This is because
  those cleanup tasks are handled by the TLS manager thread. The TLS
  manager thread is in the same thread group as the parent, so it will
  be terminated if the parent exits or calls `execve`.

!! Alternatives considered

* Non-standard libc extension to allow creating TLS images independent
  of threads. This would allow us to just `clone` the child directly
  instead of this complicated maneuver. Though we probably would still
  need the cleanup logic. For libcs, TLS image allocation is tightly
  connected to thread stack allocation, which is also arch-specific. I
  do not have enough experience with libc development to know if
  maintainers of any popular libcs would be open to supporting such an
  API. Additionally, since it will probably take years before a libc
  fix would be widely deployed, we need an interim solution anyways.
* Non-standard, Linux-only, libc extension to allow us to specify the
  CLONE_* flags used by `pthread_create`. The processes we are creating
  are basically threads in a different thread group. If we could alter
  the flags used, this whole processes could become a `pthread_create.`
  The problem with this approach is that I don't know what requirements
  pthreads has on threads to ensure they function properly. I suspect
  that pthreads relies on CHILD_CLEARTID+FUTEX_WAKE to cleanup detached
  thread state. Since we don't control the child exit reason (Linux only
  handles CHILD_CLEARTID on normal, non-signal process termination), we
  probably can't use this same tracking mechanism.
* Other mechanisms for detecting child exit so cleanup can happen
  besides CLONE_VFORK:
  * waitid(WNOWAIT): This can only detect exit, not execve.
  * file descriptors with close on exec set: This cannot detect children
    cloned with CLONE_FILES.
  * System V semaphore adjustments: Cannot detect children cloned with
    CLONE_SYSVSEM.
  * CLONE_CHILD_CLEARTID + FUTEX_WAIT: Cannot detect abnormally
    terminated children.
* Doing the child clone directly in the TLS manager thread: This saves the
  need for the trampoline process, but it causes the child process to be
  parented to the wrong kernel task (the TLS thread instead of the Main
  thread) breaking things like PDEATHSIG.

Signed-off-by: Josh Kunz <jkz@google.com>
---
 linux-user/clone.c               | 415 ++++++++++++++++++++++++++++++-
 linux-user/qemu.h                |  17 ++
 linux-user/signal.c              |  49 ++++
 linux-user/syscall.c             |  69 +++--
 tests/tcg/multiarch/linux-test.c |  67 ++++-
 5 files changed, 592 insertions(+), 25 deletions(-)

Comments

Josh Kunz June 13, 2020, 12:10 a.m. UTC | #1
> +    child_tid = atomic_fetch_or(&mgr->managed_tid, 0);
> +    /*
> +     * Check if the child has already terminated by this point. If not, wait
> +     * for the child to exit. As long as the trampoline is not killed by
> +     * a signal, the kernel guarantees that the memory at &mgr->managed_tid
> +     * will be cleared, and a FUTEX_WAKE at that address will triggered.
> +     */
> +    if (child_tid != 0) {
> +        ret = syscall(SYS_futex, &mgr->managed_tid, FUTEX_WAIT,
> +                      child_tid, NULL, NULL, 0);
> +        assert(ret == 0 && "clone manager futex should always succeed");
> +    }

A note for any reviewers/maintainers: While doing some additional
testing today, I discovered there is a bug in this section of the
patch. The child process can exit between the `atomic_fetch` and start
of the `futex(FUTEX_WAIT)` call, causing the kernel to respond with an
`EAGAIN` error, which will be caught by the assert and crash the
program. I have a patch for this. I suspect there will be comments on
this change, so I'm holding off on re-sending the series until initial
reviews have been done. I just wanted to make maintainers aware to
avoid the possibility of this bug being merged in the (very) unlikely
case there are no comments.
Alex Bennée June 16, 2020, 4:08 p.m. UTC | #2
Josh Kunz <jkz@google.com> writes:

> The `clone` system call can be used to create new processes that share
> attributes with their parents, such as virtual memory, file
> system location, file descriptor tables, etc. These can be useful to a
> variety of guest programs.
>
> Before this patch, QEMU had support for a limited set of these attributes.
> Basically the ones needed for threads, and the options used by fork.
> This change adds support for all flag combinations involving CLONE_VM.
> In theory, almost all clone options could be supported, but invocations
> not using CLONE_VM are likely to run afoul of linux-user's inherently
> multi-threaded design.
>
> To add this support, this patch updates the `qemu_clone` helper. An
> overview of the mechanism used to support general `clone` options with
> CLONE_VM is described below.
>
> This patch also enables by-default the `clone` unit-tests in
> tests/tcg/multiarch/linux-test.c, and adds an additional test for duplicate
> exit signals, based on a bug found during development.

Which by the way fail on some targets:

    TEST    linux-test on alpha
  /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not receive PDEATHSIG on parent death
  make[2]: *** [../Makefile.target:153: run-linux-test] Error 1
  make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: run-guest-tests] Error 2
  make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: run-tcg-tests-alpha-linux-user] Error 2

Have you managed a clean check-tcg with docker enabled so all the guest
architectures get tested?

>
> !! Overview
>
> Adding support for CLONE_VM is tricky. The parent and guest process will
> share an address space (similar to threads), so the emulator must
> coordinate between the parent and the child. Currently, QEMU relies
> heavily on Thread Local Storage (TLS) as part of this coordination
> strategy. For threads, this works fine, because libc manages the
> thread-local data region used for TLS, when we create new threads using
> `pthread_create`. Ideally we could use the same mechanism for
> "process-local storage" needed to allow the parent/child processes to
> emulate in tandem. Unfortunately TLS is tightly integrated into libc.
> The only way to create TLS data regions is via the `pthread_create` API
> which also spawns a new thread (rather than a new processes, which is
> what we want). Worse still, TLS itself is a complicated arch-specific
> feature that is tightly integrated into the rest of libc and the dynamic
> linker. Re-implementing TLS support for QEMU would likely require a
> special dynamic linker / libc. Alternatively, the popular libcs could be
> extended, to allow for users to create TLS regions without creating
> threads. Even if major libcs decide to add this support, QEMU will still
> need a temporary work around until those libcs are widely deployed. It's
> also unclear if libcs will be interested in supporting this case, since
> TLS image creation is generally deeply integrated with thread setup.
>
> In this patch, I've employed an alternative approach: spawning a thread
> an "stealing" its TLS image for use in the child process. This approach
> leaves a dangling thread while the TLS image is in use, but by design
> that thread will not become schedulable until after the TLS data is no
> longer in-use by the child (as described in a moment). Therefore, it
> should cause relatively minimal overhead. When considered in the larger
> context, this seems like a reasonable tradeoff.

*sharp intake of breath*

OK so the solution to the complexity of handling threads is to add more
threads? cool cool cool....

>
> A major complication of this approach knowing when it is safe to clean up
> the stack, and TLS image, used by a child process. When a child is
> created with `CLONE_VM` its stack, and TLS data, need to remain valid
> until that child has either exited, or successfully called `execve` (on
> `execve` the child is given a new VMM by the kernel). One approach would
> be to use `waitid(WNOWAIT)` (the `WNOWAIT` allows the guest to reap the
> child). The problem is that the `wait` family of calls only waits for
> termination. The pattern of `clone() ... execve()` for long running
> child processes is pretty common. If we waited for child processes to
> exit, it's likely we would end up using substantially more memory, and
> keep the suspended TLS thread around much longer than necessary.
> Instead, in this patch, I've used an "trampoline" process. The real
> parent first clones a trampoline, the trampoline then clones the
> ultimate child using the `CLONE_VFORK` option. `CLONE_VFORK` suspends
> the trampoline process until the child has exited, or called `execve`.
> Once the trampoline is re-scheduled, we know it is safe to clean up
> after the child. This creates one more suspended process, but typically,
> the trampoline only exists for a short period of time.
>
> !! CLONE_VM setup, step by step
>
> 1. First, the suspended thread whose TLS we will use is created using
>    `pthread_create`. The thread fetches and returns it's "TLS pointer"
>    (an arch-specific value given to the kernel) to the parent. It then
>    blocks on a lock to prevent its TLS data from being cleaned up.
>    Ultimately the lock will be unlocked by the trampoline once the child
>    exits.
> 2. Once the TLS thread has fetched the TLS pointer, it notifies the real
>    parent thread, which calls `clone()` to create the trampoline
>    process. For ease of implementation, the TLS image is set for the
>    trampoline process during this step. This allows the trampoline to
>    use functions that require TLS if needed (e.g., printf). TLS location
>    is inherited when a new child is spawned, so this TLS data will
>    automatically be inherited by the child.
> 3. Once the trampoline has been spawned, it registers itself as a
>    "hidden" process with the signal subsystem. This prevents the exit
>    signal from the trampoline from ever being forwarded to the guest.
>    This is needed due to the way that Linux sets the exit signal for the
>    ultimate child when `CLONE_PARENT` is set. See the source for
>    details.
> 4. Once setup is complete, the trampoline spawns the final child with
>    the original clone flags, plus `CLONE_PARENT`, so the child is
>    correctly parented to the kernel task on which the guest invoked
>    `clone`. Without this, kernel features like PDEATHSIG, and
>    subreapers, would not work properly. As previously discussed, the
>    trampoline also supplies `CLONE_VFORK` so that it is suspended until
>    the child can be cleaned up.
> 5. Once the child is spawned, it signals the original parent thread that
>    it is running. At this point, the trampoline process is suspended
>    (due to CLONE_VFORK).
> 6. Finally, the call to `qemu_clone` in the parent is finished, the
>    child begins executing the given callback function in the new child
>    process.
>
> !! Cleaning up
>
> Clean up itself is a multi-step process. Once the child exits, or is
> killed by a signal (cleanup is the same in both cases), the trampoline
> process becomes schedulable. When the trampoline is scheduled, it frees
> the child stack, and unblocks the suspended TLS thread. This cleans up
> the child resources, but not the stack used by the trampoline itself. It
> is possible for a process to clean up its own stack, but it is tricky,
> and architecture-specific. Instead we leverage the TLS manager thread to
> clean up the trampoline stack. When the trampoline is cloned (in step 2
> above), we additionally set the `CHILD_SETTID` and `CHILD_CLEARTID`
> flags. The target location for the SET/CLEAR TID is set to a special field
> known by the TLS manager. Then, when the TLS manager thread is unsuspended,
> it performs an additional `FUTEX_WAIT` on this location. That blocks the
> TLS manager thread until the trampoline has fully exited, then the TLS
> manager thread frees the trampoline process's stack, before exiting
> itself.
>
> !! Shortcomings of this patch
>
> * It's complicated.
> * It doesn't support any clone options when CLONE_VM is omitted.
> * It doesn't properly clean up the CPU queue when the child process
>   terminates, or calls execve().
> * RCU unregistration is done in the trampoline process (in clone.c), but
>   registration happens in syscall.c This should be made more explicit.
> * The TLS image, and trampoline stack are not cleaned up if the parent
>   calls `execve` or `exit_group` before the child does. This is because
>   those cleanup tasks are handled by the TLS manager thread. The TLS
>   manager thread is in the same thread group as the parent, so it will
>   be terminated if the parent exits or calls `execve`.
>
> !! Alternatives considered
>
> * Non-standard libc extension to allow creating TLS images independent
>   of threads. This would allow us to just `clone` the child directly
>   instead of this complicated maneuver. Though we probably would still
>   need the cleanup logic. For libcs, TLS image allocation is tightly
>   connected to thread stack allocation, which is also arch-specific. I
>   do not have enough experience with libc development to know if
>   maintainers of any popular libcs would be open to supporting such an
>   API. Additionally, since it will probably take years before a libc
>   fix would be widely deployed, we need an interim solution anyways.

We could consider a custom lib stub that intercepts calls to the guests
original libc and replaces it with a QEMU aware one?

> * Non-standard, Linux-only, libc extension to allow us to specify the
>   CLONE_* flags used by `pthread_create`. The processes we are creating
>   are basically threads in a different thread group. If we could alter
>   the flags used, this whole processes could become a `pthread_create.`
>   The problem with this approach is that I don't know what requirements
>   pthreads has on threads to ensure they function properly. I suspect
>   that pthreads relies on CHILD_CLEARTID+FUTEX_WAKE to cleanup detached
>   thread state. Since we don't control the child exit reason (Linux only
>   handles CHILD_CLEARTID on normal, non-signal process termination), we
>   probably can't use this same tracking mechanism.
> * Other mechanisms for detecting child exit so cleanup can happen
>   besides CLONE_VFORK:
>   * waitid(WNOWAIT): This can only detect exit, not execve.
>   * file descriptors with close on exec set: This cannot detect children
>     cloned with CLONE_FILES.
>   * System V semaphore adjustments: Cannot detect children cloned with
>     CLONE_SYSVSEM.
>   * CLONE_CHILD_CLEARTID + FUTEX_WAIT: Cannot detect abnormally
>     terminated children.
> * Doing the child clone directly in the TLS manager thread: This saves the
>   need for the trampoline process, but it causes the child process to be
>   parented to the wrong kernel task (the TLS thread instead of the Main
>   thread) breaking things like PDEATHSIG.

Have you considered a daemon which could co-ordinate between the
multiple processes that are sharing some state?


> Signed-off-by: Josh Kunz <jkz@google.com>
> ---
>  linux-user/clone.c               | 415 ++++++++++++++++++++++++++++++-
>  linux-user/qemu.h                |  17 ++
>  linux-user/signal.c              |  49 ++++
>  linux-user/syscall.c             |  69 +++--
>  tests/tcg/multiarch/linux-test.c |  67 ++++-
>  5 files changed, 592 insertions(+), 25 deletions(-)
>
> diff --git a/linux-user/clone.c b/linux-user/clone.c
> index f02ae8c464..3f7344cf9e 100644
> --- a/linux-user/clone.c
> +++ b/linux-user/clone.c
> @@ -12,6 +12,12 @@
>  #include <stdbool.h>
>  #include <assert.h>
>  
> +/* arch-specifc includes needed to fetch the TLS base offset. */
> +#if defined(__x86_64__)
> +#include <asm/prctl.h>
> +#include <sys/prctl.h>
> +#endif
> +
>  static const unsigned long NEW_STACK_SIZE = 0x40000UL;
>  
>  /*
> @@ -62,6 +68,397 @@ static void completion_finish(struct completion *c)
>      pthread_mutex_unlock(&c->mu);
>  }
>  
> +struct tls_manager {
> +    void *tls_ptr;
> +    /* fetched is completed once tls_ptr has been set by the thread. */
> +    struct completion fetched;
> +    /*
> +     * spawned is completed by the user once the managed_tid
> +     * has been spawned.
> +     */
> +    struct completion spawned;
> +    /*
> +     * TID of the child whose memory is cleaned up upon death. This memory
> +     * location is used as part of a futex op, and is cleared by the kernel
> +     * since we specify CHILD_CLEARTID.
> +     */
> +    int managed_tid;
> +    /*
> +     * The value to be `free`'d up once the janitor is ready to clean up the
> +     * TLS section, and the managed tid has exited.
> +     */
> +    void *cleanup;
> +};
> +
> +/*
> + * tls_ptr fetches the TLS "pointer" for the current thread. This pointer
> + * should be whatever platform-specific address is used to represent the TLS
> + * base address.
> + */
> +static void *tls_ptr()

This and a number of other prototypes need void args to stop the
compiler complaining about missing prototypes.

> +{
> +    void *ptr;
> +#if defined(__x86_64__)
> +    /*
> +     * On x86_64, the TLS base is stored in the `fs` segment register, we can
> +     * fetch it with `ARCH_GET_FS`:
> +     */
> +    (void)syscall(SYS_arch_prctl, ARCH_GET_FS, (unsigned long) &ptr);
> +#else
> +    ptr = NULL;
> +#endif
> +    return ptr;
> +}
> +
> +/*
> + * clone_vm_supported returns true if clone_vm() is supported on this
> + * platform.
> + */
> +static bool clone_vm_supported()
> +{
> +#if defined(__x86_64__)
> +    return true;
> +#else
> +    return false;
> +#endif
> +}
<snip>
Josh Kunz June 23, 2020, 3:43 a.m. UTC | #3
Thanks for the responses Alex. I'm working on your comments, but
wanted to clarify some of the points you brought up before mailing a
second version. Responses inline.

On Tue, Jun 16, 2020 at 9:08 AM Alex Bennée <alex.bennee@linaro.org> wrote:
> Which by the way fail on some targets:
>
>     TEST    linux-test on alpha
>   /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not receive PDEATHSIG on parent death
>   make[2]: *** [../Makefile.target:153: run-linux-test] Error 1
>   make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: run-guest-tests] Error 2
>   make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: run-tcg-tests-alpha-linux-user] Error 2
>
> Have you managed a clean check-tcg with docker enabled so all the guest
> architectures get tested?

I've gotten this Alpha failure to reproduce on my local build and I'm
working on a fix. Thanks for pointing this out. I'll make sure I get a
clean `make check-tcg` for `linux-test` on all guest architectures.

> > In this patch, I've employed an alternative approach: spawning a thread
> > an "stealing" its TLS image for use in the child process. This approach
> > leaves a dangling thread while the TLS image is in use, but by design
> > that thread will not become schedulable until after the TLS data is no
> > longer in-use by the child (as described in a moment). Therefore, it
> > should cause relatively minimal overhead. When considered in the larger
> > context, this seems like a reasonable tradeoff.
>
> *sharp intake of breath*
>
> OK so the solution to the complexity of handling threads is to add more
> threads? cool cool cool....

The solution to the complexity of shared memory, but yeah, not my
favorite either. I was kinda hoping that someone on the list would
explain why this approach is clearly wrong.

> > * Non-standard libc extension to allow creating TLS images independent
> >   of threads. This would allow us to just `clone` the child directly
> >   instead of this complicated maneuver. Though we probably would still
> >   need the cleanup logic. For libcs, TLS image allocation is tightly
> >   connected to thread stack allocation, which is also arch-specific. I
> >   do not have enough experience with libc development to know if
> >   maintainers of any popular libcs would be open to supporting such an
> >   API. Additionally, since it will probably take years before a libc
> >   fix would be widely deployed, we need an interim solution anyways.
>
> We could consider a custom lib stub that intercepts calls to the guests
> original libc and replaces it with a QEMU aware one?

Unfortunately the problem here is host libc, rather than guest libc.
We need to make TLS variables in QEMU itself work, so intercepting
guest libc calls won't help much. Or am I misunderstanding the point?

> Have you considered a daemon which could co-ordinate between the
> multiple processes that are sharing some state?

Not really for the `CLONE_VM` support added in this patch series. I
have considered trying to pull tcg out of the guest process, but not
very seriously, since it seems like a pretty heavyweight approach.
Especially compared to the solution included in this series. Do you
think there's a simpler approach that involves using a daemon to do
coordination?

Thanks again for your reviews.

--
Josh Kunz
Alex Bennée June 23, 2020, 8:21 a.m. UTC | #4
Josh Kunz <jkz@google.com> writes:

> Thanks for the responses Alex. I'm working on your comments, but
> wanted to clarify some of the points you brought up before mailing a
> second version. Responses inline.
>
> On Tue, Jun 16, 2020 at 9:08 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>> Which by the way fail on some targets:
>>
>>     TEST    linux-test on alpha
>>   /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not receive PDEATHSIG on parent death
>>   make[2]: *** [../Makefile.target:153: run-linux-test] Error 1
>>   make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: run-guest-tests] Error 2
>>   make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: run-tcg-tests-alpha-linux-user] Error 2
>>
>> Have you managed a clean check-tcg with docker enabled so all the guest
>> architectures get tested?
>
> I've gotten this Alpha failure to reproduce on my local build and I'm
> working on a fix. Thanks for pointing this out. I'll make sure I get a
> clean `make check-tcg` for `linux-test` on all guest architectures.
>
>> > In this patch, I've employed an alternative approach: spawning a thread
>> > an "stealing" its TLS image for use in the child process. This approach
>> > leaves a dangling thread while the TLS image is in use, but by design
>> > that thread will not become schedulable until after the TLS data is no
>> > longer in-use by the child (as described in a moment). Therefore, it
>> > should cause relatively minimal overhead. When considered in the larger
>> > context, this seems like a reasonable tradeoff.
>>
>> *sharp intake of breath*
>>
>> OK so the solution to the complexity of handling threads is to add more
>> threads? cool cool cool....
>
> The solution to the complexity of shared memory, but yeah, not my
> favorite either. I was kinda hoping that someone on the list would
> explain why this approach is clearly wrong.
>
>> > * Non-standard libc extension to allow creating TLS images independent
>> >   of threads. This would allow us to just `clone` the child directly
>> >   instead of this complicated maneuver. Though we probably would still
>> >   need the cleanup logic. For libcs, TLS image allocation is tightly
>> >   connected to thread stack allocation, which is also arch-specific. I
>> >   do not have enough experience with libc development to know if
>> >   maintainers of any popular libcs would be open to supporting such an
>> >   API. Additionally, since it will probably take years before a libc
>> >   fix would be widely deployed, we need an interim solution anyways.
>>
>> We could consider a custom lib stub that intercepts calls to the guests
>> original libc and replaces it with a QEMU aware one?
>
> Unfortunately the problem here is host libc, rather than guest libc.
> We need to make TLS variables in QEMU itself work, so intercepting
> guest libc calls won't help much. Or am I misunderstanding the point?

Hold up - I'm a little confused now. Why does the host TLS affect the
guest TLS? We have complete control over the guests view of the world so
we should be able to control it's TLS storage.

>> Have you considered a daemon which could co-ordinate between the
>> multiple processes that are sharing some state?
>
> Not really for the `CLONE_VM` support added in this patch series. I
> have considered trying to pull tcg out of the guest process, but not
> very seriously, since it seems like a pretty heavyweight approach.
> Especially compared to the solution included in this series. Do you
> think there's a simpler approach that involves using a daemon to do
> coordination?

I'm getting a little lost now. Exactly what state are we trying to share
between two QEMU guests which are now in separate execution contexts?
Josh Kunz July 9, 2020, 12:16 a.m. UTC | #5
Sorry for the late reply, response inline. Also I noticed a couple
mails ago I seemed to have removed the devel list and maintainers.
I've re-added them to the CC line.

On Wed, Jun 24, 2020 at 3:17 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Josh Kunz <jkz@google.com> writes:
>
> > On Tue, Jun 23, 2020, 1:21 AM Alex Bennée <alex.bennee@linaro.org> wrote:
> >
> > (snip)
> >
> >> >> > * Non-standard libc extension to allow creating TLS images independent
> >> >> >   of threads. This would allow us to just `clone` the child directly
> >> >> >   instead of this complicated maneuver. Though we probably would still
> >> >> >   need the cleanup logic. For libcs, TLS image allocation is tightly
> >> >> >   connected to thread stack allocation, which is also arch-specific. I
> >> >> >   do not have enough experience with libc development to know if
> >> >> >   maintainers of any popular libcs would be open to supporting such an
> >> >> >   API. Additionally, since it will probably take years before a libc
> >> >> >   fix would be widely deployed, we need an interim solution anyways.
> >> >>
> >> >> We could consider a custom lib stub that intercepts calls to the guests
> >> >> original libc and replaces it with a QEMU aware one?
> >> >
> >> > Unfortunately the problem here is host libc, rather than guest libc.
> >> > We need to make TLS variables in QEMU itself work, so intercepting
> >> > guest libc calls won't help much. Or am I misunderstanding the point?
> >>
> >> Hold up - I'm a little confused now. Why does the host TLS affect the
> >> guest TLS? We have complete control over the guests view of the world so
> >> we should be able to control it's TLS storage.
> >
> > Guest TLS is unaffected, just like in the existing case for guest
> > threads. Guest TLS is handled by the guest libc and the CPU emulation.
> > Just to be clear: This series changes nothing about guest TLS.
> >
> > The complexity of this series is to deal with *host* usage of TLS.
> > That is to say: use of thread local variables in QEMU itself. Host TLS
> > is needed to allow the subprocess created with `clone(CLONE_VM, ...)`
> > to run at all. TLS variables are used in QEMU for the RCU
> > implementation, parts of the TCG, and all over the place to access the
> > CPU/TaskState for the running thread. Host TLS is managed by the host
> > libc, and TLS is only set up for host threads created via
> > `pthread_create`. Subprocesses created with `clone(CLONE_VM)` share a
> > virtual memory map *and* TLS data with their parent[1], since libcs
> > provide no special handling of TLS when `clone(CLONE_VM)` is used.
> > Without the workaround used in this patch, both the parent and child
> > process's thread local variables reference the same memory locations.
> > This just doesn't work, since thread local data is assumed to actually
> > be thread local.
> >
> > The "alternative" proposed was to make the host libc support TLS for
> > processes created using clone (there are several ways to go about
> > this, each with different tradeoffs). You mentioned that "We could
> > consider a custom lib stub that intercepts calls to the guests
> > original libc..." in your comment. Since *guest* libc is not involved
> > here I was a bit confused about how this could help, and wanted to
> > clarify.
> >
> >> >> Have you considered a daemon which could co-ordinate between the
> >> >> multiple processes that are sharing some state?
> >> >
> >> > Not really for the `CLONE_VM` support added in this patch series. I
> >> > have considered trying to pull tcg out of the guest process, but not
> >> > very seriously, since it seems like a pretty heavyweight approach.
> >> > Especially compared to the solution included in this series. Do you
> >> > think there's a simpler approach that involves using a daemon to do
> >> > coordination?
> >>
> >> I'm getting a little lost now. Exactly what state are we trying to share
> >> between two QEMU guests which are now in separate execution contexts?
> >
> > Since this series only deals with `clone(CLONE_VM)` we always want to
> > share guest virtual memory between the execution contexts. There is
> > also some extra state that needs to be shared depending on which flags
> > are provided to `clone()`. E.g., signal handler tables for
> > CLONE_SIGHAND, file descriptor tables for CLONE_FILES, etc.
> >
> > The problem is that since QEMU and the guest live in the same virtual
> > memory map, keeping the mappings the same between the guest parent and
> > guest child means that the mappings also stay the same between the
> > host (QEMU) parent and host child. Two hosts can live in the same
> > virtual memory map, like we do right now with threads, but *only* with
> > valid TLS for each thread/process. That's why we bend-over backwards
> > to get set-up TLS for emulation in the child process.
>
> OK thanks for that. I'd obviously misunderstood from my first read
> through. So while hiding the underlying bits of QEMU from the guest is
> relatively easy it's quite hard to hide QEMU from itself in this
> CLONE_VM case.

Yes exactly.

> The other approach would be to suppress CLONE_VM for the actual process
> (thereby allowing QEMU to safely have a new instance and no clashing
> shared data) but emulate CLONE_VM for the guest itself (making the guest
> portions of memory shared and visible to each other). The trouble then
> would be co-ordination of mapping operations and other things that
> should be visible in a real CLONE_VM setup. This is the sort of
> situation I envisioned a co-ordination daemon might be useful.

Ah. This is interesting. Effectively the inverse of this patch. I had
not considered this approach. Thinking more about it, a "no shared
memory" approach does seem more straightforward implementation wise.
Unfortunately I think there would be a few substantial drawbacks:

1. Memory overhead. Every guest thread would need a full copy of QEMU
memory, including the translated guest binary.
2. Performance overhead. To keep virtual memory maps consistent across
tasks, a heavyweight 2 phase commit scheme, or similar, would be
needed for every `mmap`. That could have substantial performance
overhead for the guest. This could be a huge problem for processes
that use a large number of threads *and* do a lot of memory mapping or
frequently change page permissions.
3. There would be lots of similarly-fiddly bits that need to be shared
and coordinated in addition to guest memory. At least the signal
handler tables and fd_trans tables, but there are likely others I'm
missing.

The performance drawbacks could be largely mitigated by using the
current thread-only `CLONE_VM` support, but having *any* threads in
the process at all would lead to deadlocks after fork() or similar
non-CLONE_VM clone() calls. This could be worked around with a "stop
the world" button somewhat like `start_exclusive`, but expanded to
include all emulator threads. That will substantially slow down
fork().

Given all this I think the approach used in this series is probably at
least as "good" as a "no shared memory" approach. It has its own
complexities and drawbacks, but doesn't have obvious performance
issues. If you or other maintainers disagree, I'd be happy to write up
an RFC comparing the approaches in more detail (or we can just use
this thread), just let me know. Until then I'll keep pursuing this
patch.

> > [1] At least on x86_64, because TLS references are defined in terms of
> > the %fs segment, which is inherited on linux. Theoretically it's up to
> > the architecture to specify how TLS is inherited across execution
> > contexts. t's possible that the child actually ends up with no valid
> > TLS rather than using the parent TLS data. But that's not really
> > relevant here. The important thing is that the child ends up with
> > *valid* TLS, not invalid or inherited TLS.
>
>
> --
> Alex Bennée

--
Josh Kunz
Alex Bennée July 16, 2020, 10:41 a.m. UTC | #6
Josh Kunz <jkz@google.com> writes:

> Sorry for the late reply, response inline. Also I noticed a couple
> mails ago I seemed to have removed the devel list and maintainers.
> I've re-added them to the CC line.
>
> On Wed, Jun 24, 2020 at 3:17 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>
>> Josh Kunz <jkz@google.com> writes:
>>
>> > On Tue, Jun 23, 2020, 1:21 AM Alex Bennée <alex.bennee@linaro.org> wrote:
>> >
>> > (snip)
>> >
>> >> >> > * Non-standard libc extension to allow creating TLS images independent
>> >> >> >   of threads. This would allow us to just `clone` the child directly
>> >> >> >   instead of this complicated maneuver. Though we probably would still
>> >> >> >   need the cleanup logic. For libcs, TLS image allocation is tightly
>> >> >> >   connected to thread stack allocation, which is also arch-specific. I
>> >> >> >   do not have enough experience with libc development to know if
>> >> >> >   maintainers of any popular libcs would be open to supporting such an
>> >> >> >   API. Additionally, since it will probably take years before a libc
>> >> >> >   fix would be widely deployed, we need an interim solution anyways.
>> >> >>
>> >> >> We could consider a custom lib stub that intercepts calls to the guests
>> >> >> original libc and replaces it with a QEMU aware one?
>> >> >
>> >> > Unfortunately the problem here is host libc, rather than guest libc.
>> >> > We need to make TLS variables in QEMU itself work, so intercepting
>> >> > guest libc calls won't help much. Or am I misunderstanding the point?
>> >>
>> >> Hold up - I'm a little confused now. Why does the host TLS affect the
>> >> guest TLS? We have complete control over the guests view of the world so
>> >> we should be able to control it's TLS storage.
>> >
>> > Guest TLS is unaffected, just like in the existing case for guest
>> > threads. Guest TLS is handled by the guest libc and the CPU emulation.
>> > Just to be clear: This series changes nothing about guest TLS.
>> >
>> > The complexity of this series is to deal with *host* usage of TLS.
>> > That is to say: use of thread local variables in QEMU itself. Host TLS
>> > is needed to allow the subprocess created with `clone(CLONE_VM, ...)`
>> > to run at all. TLS variables are used in QEMU for the RCU
>> > implementation, parts of the TCG, and all over the place to access the
>> > CPU/TaskState for the running thread. Host TLS is managed by the host
>> > libc, and TLS is only set up for host threads created via
>> > `pthread_create`. Subprocesses created with `clone(CLONE_VM)` share a
>> > virtual memory map *and* TLS data with their parent[1], since libcs
>> > provide no special handling of TLS when `clone(CLONE_VM)` is used.
>> > Without the workaround used in this patch, both the parent and child
>> > process's thread local variables reference the same memory locations.
>> > This just doesn't work, since thread local data is assumed to actually
>> > be thread local.
>> >
>> > The "alternative" proposed was to make the host libc support TLS for
>> > processes created using clone (there are several ways to go about
>> > this, each with different tradeoffs). You mentioned that "We could
>> > consider a custom lib stub that intercepts calls to the guests
>> > original libc..." in your comment. Since *guest* libc is not involved
>> > here I was a bit confused about how this could help, and wanted to
>> > clarify.
>> >
>> >> >> Have you considered a daemon which could co-ordinate between the
>> >> >> multiple processes that are sharing some state?
>> >> >
>> >> > Not really for the `CLONE_VM` support added in this patch series. I
>> >> > have considered trying to pull tcg out of the guest process, but not
>> >> > very seriously, since it seems like a pretty heavyweight approach.
>> >> > Especially compared to the solution included in this series. Do you
>> >> > think there's a simpler approach that involves using a daemon to do
>> >> > coordination?
>> >>
>> >> I'm getting a little lost now. Exactly what state are we trying to share
>> >> between two QEMU guests which are now in separate execution contexts?
>> >
>> > Since this series only deals with `clone(CLONE_VM)` we always want to
>> > share guest virtual memory between the execution contexts. There is
>> > also some extra state that needs to be shared depending on which flags
>> > are provided to `clone()`. E.g., signal handler tables for
>> > CLONE_SIGHAND, file descriptor tables for CLONE_FILES, etc.
>> >
>> > The problem is that since QEMU and the guest live in the same virtual
>> > memory map, keeping the mappings the same between the guest parent and
>> > guest child means that the mappings also stay the same between the
>> > host (QEMU) parent and host child. Two hosts can live in the same
>> > virtual memory map, like we do right now with threads, but *only* with
>> > valid TLS for each thread/process. That's why we bend-over backwards
>> > to get set-up TLS for emulation in the child process.
>>
>> OK thanks for that. I'd obviously misunderstood from my first read
>> through. So while hiding the underlying bits of QEMU from the guest is
>> relatively easy it's quite hard to hide QEMU from itself in this
>> CLONE_VM case.
>
> Yes exactly.
>
>> The other approach would be to suppress CLONE_VM for the actual process
>> (thereby allowing QEMU to safely have a new instance and no clashing
>> shared data) but emulate CLONE_VM for the guest itself (making the guest
>> portions of memory shared and visible to each other). The trouble then
>> would be co-ordination of mapping operations and other things that
>> should be visible in a real CLONE_VM setup. This is the sort of
>> situation I envisioned a co-ordination daemon might be useful.
>
> Ah. This is interesting. Effectively the inverse of this patch. I had
> not considered this approach. Thinking more about it, a "no shared
> memory" approach does seem more straightforward implementation wise.
> Unfortunately I think there would be a few substantial drawbacks:
>
> 1. Memory overhead. Every guest thread would need a full copy of QEMU
> memory, including the translated guest binary.

Sure although I suspect the overhead is not that great. For linux-user
on 64 bit systems we only allocate 128Mb of translation buffer per
process. What sort of size systems are you expecting to run on and how
big are the binaries?

> 2. Performance overhead. To keep virtual memory maps consistent across
> tasks, a heavyweight 2 phase commit scheme, or similar, would be
> needed for every `mmap`. That could have substantial performance
> overhead for the guest. This could be a huge problem for processes
> that use a large number of threads *and* do a lot of memory mapping or
> frequently change page permissions.

I suspect that cross-arch highly threaded apps are still in the realm of
"wow, that actually works, neat :-)" for linux-user. We don't have the
luxury of falling back to a single thread like we do for system
emulation so things like strong-on-weak memory order bugs can still trip
us up.

> 3. There would be lots of similarly-fiddly bits that need to be shared
> and coordinated in addition to guest memory. At least the signal
> handler tables and fd_trans tables, but there are likely others I'm
> missing.
>
> The performance drawbacks could be largely mitigated by using the
> current thread-only `CLONE_VM` support, but having *any* threads in
> the process at all would lead to deadlocks after fork() or similar
> non-CLONE_VM clone() calls. This could be worked around with a "stop
> the world" button somewhat like `start_exclusive`, but expanded to
> include all emulator threads. That will substantially slow down
> fork().
>
> Given all this I think the approach used in this series is probably at
> least as "good" as a "no shared memory" approach. It has its own
> complexities and drawbacks, but doesn't have obvious performance
> issues. If you or other maintainers disagree, I'd be happy to write up
> an RFC comparing the approaches in more detail (or we can just use
> this thread), just let me know. Until then I'll keep pursuing this
> patch.

I think that's fair. I'll leave it to the maintainers to chime in if
they have something to add. I'd already given some comments on patch 1 and
given it needs a re-spin I'll have another look on the next iteration.

I will say expect the system to get some testing on multiple backends so
if you can expand your testing beyond an x86_64 host please do.

>
>> > [1] At least on x86_64, because TLS references are defined in terms of
>> > the %fs segment, which is inherited on linux. Theoretically it's up to
>> > the architecture to specify how TLS is inherited across execution
>> > contexts. t's possible that the child actually ends up with no valid
>> > TLS rather than using the parent TLS data. But that's not really
>> > relevant here. The important thing is that the child ends up with
>> > *valid* TLS, not invalid or inherited TLS.
>>
>>
>> --
>> Alex Bennée
diff mbox series

Patch

diff --git a/linux-user/clone.c b/linux-user/clone.c
index f02ae8c464..3f7344cf9e 100644
--- a/linux-user/clone.c
+++ b/linux-user/clone.c
@@ -12,6 +12,12 @@ 
 #include <stdbool.h>
 #include <assert.h>
 
+/* arch-specifc includes needed to fetch the TLS base offset. */
+#if defined(__x86_64__)
+#include <asm/prctl.h>
+#include <sys/prctl.h>
+#endif
+
 static const unsigned long NEW_STACK_SIZE = 0x40000UL;
 
 /*
@@ -62,6 +68,397 @@  static void completion_finish(struct completion *c)
     pthread_mutex_unlock(&c->mu);
 }
 
+struct tls_manager {
+    void *tls_ptr;
+    /* fetched is completed once tls_ptr has been set by the thread. */
+    struct completion fetched;
+    /*
+     * spawned is completed by the user once the managed_tid
+     * has been spawned.
+     */
+    struct completion spawned;
+    /*
+     * TID of the child whose memory is cleaned up upon death. This memory
+     * location is used as part of a futex op, and is cleared by the kernel
+     * since we specify CHILD_CLEARTID.
+     */
+    int managed_tid;
+    /*
+     * The value to be `free`'d up once the janitor is ready to clean up the
+     * TLS section, and the managed tid has exited.
+     */
+    void *cleanup;
+};
+
+/*
+ * tls_ptr fetches the TLS "pointer" for the current thread. This pointer
+ * should be whatever platform-specific address is used to represent the TLS
+ * base address.
+ */
+static void *tls_ptr()
+{
+    void *ptr;
+#if defined(__x86_64__)
+    /*
+     * On x86_64, the TLS base is stored in the `fs` segment register, we can
+     * fetch it with `ARCH_GET_FS`:
+     */
+    (void)syscall(SYS_arch_prctl, ARCH_GET_FS, (unsigned long) &ptr);
+#else
+    ptr = NULL;
+#endif
+    return ptr;
+}
+
+/*
+ * clone_vm_supported returns true if clone_vm() is supported on this
+ * platform.
+ */
+static bool clone_vm_supported()
+{
+#if defined(__x86_64__)
+    return true;
+#else
+    return false;
+#endif
+}
+
+static void *tls_manager_thread(void *arg)
+{
+    struct tls_manager *mgr = (struct tls_manager *) arg;
+    int child_tid, ret;
+
+    /*
+     * NOTE: Do not use an TLS in this thread until after the `spawned`
+     * completion is finished. We need to preserve the pristine state of
+     * the TLS image for this thread, so it can be re-used in a separate
+     * process.
+     */
+    mgr->tls_ptr = tls_ptr();
+
+    /* Notify tls_new that we finished fetching the TLS ptr. */
+    completion_finish(&mgr->fetched);
+
+    /*
+     * Wait for the user of our TLS to tell us the child using our TLS has
+     * been spawned.
+     */
+    completion_await(&mgr->spawned);
+
+    child_tid = atomic_fetch_or(&mgr->managed_tid, 0);
+    /*
+     * Check if the child has already terminated by this point. If not, wait
+     * for the child to exit. As long as the trampoline is not killed by
+     * a signal, the kernel guarantees that the memory at &mgr->managed_tid
+     * will be cleared, and a FUTEX_WAKE at that address will triggered.
+     */
+    if (child_tid != 0) {
+        ret = syscall(SYS_futex, &mgr->managed_tid, FUTEX_WAIT,
+                      child_tid, NULL, NULL, 0);
+        assert(ret == 0 && "clone manager futex should always succeed");
+    }
+
+    free(mgr->cleanup);
+    g_free(mgr);
+
+    return NULL;
+}
+
+static struct tls_manager *tls_manager_new()
+{
+    struct tls_manager *mgr = g_new0(struct tls_manager, 1);
+    sigset_t block, oldmask;
+
+    sigfillset(&block);
+    if (sigprocmask(SIG_BLOCK, &block, &oldmask) != 0) {
+        return NULL;
+    }
+
+    completion_init(&mgr->fetched);
+    completion_init(&mgr->spawned);
+
+    pthread_attr_t attr;
+    pthread_attr_init(&attr);
+    pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
+
+    pthread_t unused;
+    if (pthread_create(&unused, &attr, tls_manager_thread, (void *) mgr)) {
+        pthread_attr_destroy(&attr);
+        g_free(mgr);
+        return NULL;
+    }
+    pthread_attr_destroy(&attr);
+    completion_await(&mgr->fetched);
+
+    if (sigprocmask(SIG_SETMASK, &oldmask, NULL) != 0) {
+        /* Let the thread exit, and cleanup itself. */
+        completion_finish(&mgr->spawned);
+        return NULL;
+    }
+
+    /* Once we finish awaiting, the tls_ptr will be usable. */
+    return mgr;
+}
+
+struct stack {
+    /* Buffer is the "base" of the stack buffer. */
+    void *buffer;
+    /* Top is the "start" of the stack (since stack addresses "grow down"). */
+    void *top;
+};
+
+struct info {
+    /* Stacks used for the trampoline and child process. */
+    struct {
+        struct stack trampoline;
+        struct stack process;
+    } stack;
+    struct completion child_ready;
+    /* `clone` flags for the process the user asked us to make. */
+    int flags;
+    sigset_t orig_mask;
+    /*
+     * Function to run in the ultimate child process, and payload to pass as
+     * the argument.
+     */
+    int (*clone_f)(void *);
+    void *payload;
+    /*
+     * Result of calling `clone` for the child clone. Will be set to
+     * `-errno` if an error occurs.
+     */
+    int result;
+};
+
+static bool stack_new(struct stack *stack)
+{
+    /*
+     * TODO: put a guard page at the bottom of the stack, so we don't
+     * accidentally roll off the end.
+     */
+    if (posix_memalign(&stack->buffer, 16, NEW_STACK_SIZE)) {
+        return false;
+    }
+    memset(stack->buffer, 0, NEW_STACK_SIZE);
+    stack->top = stack->buffer + NEW_STACK_SIZE;
+    return true;
+}
+
+static int clone_child(void *raw_info)
+{
+    struct info *info = (struct info *) raw_info;
+    int (*clone_f)(void *) = info->clone_f;
+    void *payload = info->payload;
+    if (!(info->flags & CLONE_VFORK)) {
+        /*
+         * If CLONE_VFORK is NOT set, then the trampoline has stalled (it
+         * forces VFORK), but the actual clone should return immediately. In
+         * this case, this thread needs to notify the parent that the new
+         * process is running. If CLONE_VFORK IS set, the trampoline will
+         * notify the parent once the normal kernel vfork completes.
+         */
+        completion_finish(&info->child_ready);
+    }
+    if (sigprocmask(SIG_SETMASK, &info->orig_mask, NULL) != 0) {
+        perror("failed to restore signal mask in cloned child");
+        _exit(1);
+    }
+    return clone_f(payload);
+}
+
+static int clone_trampoline(void *raw_info)
+{
+    struct info *info = (struct info *) raw_info;
+    int flags;
+
+    struct stack process_stack = info->stack.process;
+    int orig_flags = info->flags;
+
+    if (orig_flags & CSIGNAL) {
+        /*
+         * It should be safe to call here, since we know signals are blocked
+         * for this process.
+         */
+        hide_current_process_exit_signal();
+    }
+
+    /*
+     * Force CLONE_PARENT, so that we don't accidentally become a child of the
+     * trampoline thread. This kernel task should either be a child of the
+     * trampoline's parent (if CLONE_PARENT is not in info->flags), or a child
+     * of the calling process's parent (if CLONE_PARENT IS in info->flags).
+     * That is to say, our parent should always be the correct parent for the
+     * child task.
+     *
+     * Force CLONE_VFORK so that we know when the child is no longer holding
+     * a reference to this process's virtual memory. CLONE_VFORK just suspends
+     * this task until the child execs or exits, it should not affect how the
+     * child process is created in any way. This is the only generic way I'm
+     * aware of to observe *any* exit or exec. Including "abnormal" exits like
+     * exits via signals.
+     *
+     * Force CLONE_CHILD_SETTID, since we want to track the CHILD TID in the
+     * `info` structure. Capturing the child via `clone` call directly is
+     * slightly nicer than making a syscall in the child. Since we know we're
+     * doing a CLONE_VM here, we can use CLONE_CHILD_SETTID, to guarantee that
+     * the kernel must set the child TID before the child is run. The child
+     * TID should be visibile to the parent, since both parent and child share
+     * and address space. If the clone fails, we overwrite `info->result`
+     * anyways with the error code.
+     */
+    flags = orig_flags | CLONE_PARENT | CLONE_VFORK | CLONE_CHILD_SETTID;
+    if (clone(clone_child, info->stack.process.top, flags,
+              (void *) info, NULL, NULL, &info->result) < 0) {
+        info->result = -errno;
+        completion_finish(&info->child_ready);
+        return 0;
+    }
+
+    /*
+     * Clean up the child process stack, since we know the child can no longer
+     * reference it.
+     */
+    free(process_stack.buffer);
+
+    /*
+     * We know the process we created was CLONE_VFORK, so it registered with
+     * the RCU. We share a TLS image with the process, so we can unregister
+     * it from the RCU. Since the TLS image will be valid for at least our
+     * lifetime, it should be OK to leave the child processes RCU entry in
+     * the queue between when the child execve or exits, and the OS returns
+     * here from our vfork.
+     */
+    rcu_unregister_thread();
+
+    /*
+     * If we're doing a real vfork here, we need to notify the parent that the
+     * vfork has happened.
+     */
+    if (orig_flags & CLONE_VFORK) {
+        completion_finish(&info->child_ready);
+    }
+
+    return 0;
+}
+
+static int clone_vm(int flags, int (*callback)(void *), void *payload)
+{
+    struct info info;
+    sigset_t sigmask;
+    int ret;
+
+    assert(flags & CLONE_VM && "CLONE_VM flag must be set");
+
+    memset(&info, 0, sizeof(info));
+    info.clone_f = callback;
+    info.payload = payload;
+    info.flags = flags;
+
+    /*
+     * Set up the stacks for the child processes needed to execute the clone.
+     */
+    if (!stack_new(&info.stack.trampoline)) {
+        return -1;
+    }
+    if (!stack_new(&info.stack.process)) {
+        free(info.stack.trampoline.buffer);
+        return -1;
+    }
+
+    /*
+     * tls_manager_new grants us it's ownership of the reference to the
+     * TLS manager, so we "leak" the data pointer, instead of using _get()
+     */
+    struct tls_manager *mgr = tls_manager_new();
+    if (mgr == NULL) {
+        free(info.stack.trampoline.buffer);
+        free(info.stack.process.buffer);
+        return -1;
+    }
+
+    /* Manager cleans up the trampoline stack once the trampoline exits. */
+    mgr->cleanup = info.stack.trampoline.buffer;
+
+    /*
+     * Flags used by the trampoline in the 2-phase clone setup for children
+     * cloned with CLONE_VM. We want the trampoline to be essentially identical
+     * to its parent. This improves the performance of cloning the trampoline,
+     * and guarantees that the real flags are implemented correctly.
+     *
+     * CLONE_CHILD_SETTID: Make the kernel set the managed_tid for the TLS
+     * manager.
+     *
+     * CLONE_CHILD_CLEARTID: Make the kernel clear the managed_tid, and
+     * trigger a FUTEX_WAKE (received by the TLS manager), so the TLS manager
+     * knows when to cleanup the trampoline stack.
+     *
+     * CLONE_SETTLS: To set the trampoline TLS based on the tls manager.
+     */
+    static const int base_trampoline_flags = (
+        CLONE_FILES | CLONE_FS | CLONE_IO | CLONE_PTRACE |
+        CLONE_SIGHAND | CLONE_SYSVSEM | CLONE_VM
+    ) | CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | CLONE_SETTLS;
+
+    int trampoline_flags = base_trampoline_flags;
+
+    /*
+     * To get the process hierarchy right, we set the trampoline
+     * CLONE_PARENT/CLONE_THREAD flag to match the child
+     * CLONE_PARENT/CLONE_THREAD. So add those flags if specified by the child.
+     */
+    trampoline_flags |= (flags & CLONE_PARENT) ? CLONE_PARENT : 0;
+    trampoline_flags |= (flags & CLONE_THREAD) ? CLONE_THREAD : 0;
+
+    /*
+     * When using CLONE_PARENT, linux always sets the exit_signal for the task
+     * to the exit_signal of the parent process. For our purposes, the
+     * trampoline process. exit_signal has special significance for calls like
+     * `wait`, so it needs to be set correctly. We add the signal part of the
+     * user flags here so the ultimate child gets the right signal.
+     *
+     * This has the unfortunate side-effect of sending the parent two exit
+     * signals. One when the true child exits, and one when the trampoline
+     * exits. To work-around this we have to capture the exit signal from the
+     * trampoline and supress it.
+     */
+    trampoline_flags |= (flags & CSIGNAL);
+
+    sigfillset(&sigmask);
+    if (sigprocmask(SIG_BLOCK, &sigmask, &info.orig_mask) != 0) {
+        free(info.stack.trampoline.buffer);
+        free(info.stack.process.buffer);
+        completion_finish(&mgr->spawned);
+        return -1;
+    }
+
+    if (clone(clone_trampoline,
+              info.stack.trampoline.top, trampoline_flags, &info,
+              NULL, mgr->tls_ptr, &mgr->managed_tid) < 0) {
+        free(info.stack.trampoline.buffer);
+        free(info.stack.process.buffer);
+        completion_finish(&mgr->spawned);
+        return -1;
+    }
+
+    completion_await(&info.child_ready);
+    completion_finish(&mgr->spawned);
+
+    ret = sigprocmask(SIG_SETMASK, &info.orig_mask, NULL);
+    /*
+     * If our final sigproc mask doesn't work, we're pretty screwed. We may
+     * have started the final child now, and there's no going back. If this
+     * ever happens, just crash.
+     */
+    assert(!ret && "sigprocmask after clone needs to succeed");
+
+    /* If we have an error result, then set errno as needed. */
+    if (info.result < 0) {
+        errno = -info.result;
+        return -1;
+    }
+    return info.result;
+}
+
 struct clone_thread_info {
     struct completion running;
     int tid;
@@ -120,6 +517,17 @@  int qemu_clone(int flags, int (*callback)(void *), void *payload)
 {
     int ret;
 
+    /*
+     * Backwards Compatibility: Remove once all target platforms support
+     * clone_vm. Previously, we implemented vfork() via a fork() call,
+     * preserve that behavior instead of failing.
+     */
+    if (!clone_vm_supported()) {
+        if (flags & CLONE_VFORK) {
+            flags &= ~(CLONE_VFORK | CLONE_VM);
+        }
+    }
+
     if (clone_flags_are_thread(flags)) {
         /*
          * The new process uses the same flags as pthread_create, so we can
@@ -146,7 +554,12 @@  int qemu_clone(int flags, int (*callback)(void *), void *payload)
         return ret;
     }
 
-    /* !fork && !thread */
+    if (clone_vm_supported() && (flags & CLONE_VM)) {
+        return clone_vm(flags, callback, payload);
+    }
+
+    /* !fork && !thread && !CLONE_VM. This form is unsupported. */
+
     errno = EINVAL;
     return -1;
 }
diff --git a/linux-user/qemu.h b/linux-user/qemu.h
index 54bf4f47be..e29912466c 100644
--- a/linux-user/qemu.h
+++ b/linux-user/qemu.h
@@ -94,6 +94,7 @@  struct vm86_saved_state {
 
 struct emulated_sigtable {
     int pending; /* true if signal is pending */
+    pid_t exit_pid; /* non-zero host pid, if a process is exiting. */
     target_siginfo_t info;
 };
 
@@ -183,6 +184,15 @@  typedef struct TaskState {
      * least TARGET_NSIG entries
      */
     struct target_sigaction *sigact_tbl;
+
+    /*
+     * Set to true if the process asssociated with this task state was cloned.
+     * This is needed to disambiguate cloned processes from threads. If
+     * CLONE_VM is used, a pthread_exit(..) will free the stack/TLS of the
+     * trampoline thread, and the trampoline will be unable to conduct its
+     * cleanup.
+     */
+    bool is_cloned;
 } __attribute__((aligned(16))) TaskState;
 
 extern char *exec_path;
@@ -442,6 +452,13 @@  abi_long do_sigaltstack(abi_ulong uss_addr, abi_ulong uoss_addr, abi_ulong sp);
 int do_sigprocmask(int how, const sigset_t *set, sigset_t *oldset);
 abi_long do_swapcontext(CPUArchState *env, abi_ulong uold_ctx,
                         abi_ulong unew_ctx, abi_long ctx_size);
+
+/*
+ * Register the current process as a "hidden" process. Exit signals generated
+ * by this process should not be delivered to the guest.
+ */
+void hide_current_process_exit_signal(void);
+
 /**
  * block_signals: block all signals while handling this guest syscall
  *
diff --git a/linux-user/signal.c b/linux-user/signal.c
index dc98def6d1..a7f0612b64 100644
--- a/linux-user/signal.c
+++ b/linux-user/signal.c
@@ -36,6 +36,21 @@  typedef struct target_sigaction sigact_table[TARGET_NSIG];
 static void host_signal_handler(int host_signum, siginfo_t *info,
                                 void *puc);
 
+/*
+ * This table, initilized in signal_init, is used to track "hidden" processes
+ * for which exit signals should not be delivered. The PIDs of the processes
+ * hidden processes are stored as keys. Values are always set to NULL.
+ *
+ * Note: Process IDs stored in this table may "leak" (i.e., never be removed
+ * from the table) if the guest blocks (SIG_IGN) the exit signal for the child
+ * it spawned. There is a small risk, that this PID could later be reused
+ * by an alternate child process, and the child exit would be hidden. This is
+ * an unusual case that is unlikely to happen, but it is possible.
+ */
+static GHashTable *hidden_processes;
+
+/* this lock guards access to the `hidden_processes` table. */
+static pthread_mutex_t hidden_processes_lock = PTHREAD_MUTEX_INITIALIZER;
 
 /*
  * System includes define _NSIG as SIGRTMAX + 1,
@@ -564,6 +579,9 @@  void signal_init(void)
     /* initialize signal conversion tables */
     signal_table_init();
 
+    /* initialize the hidden process table. */
+    hidden_processes = g_hash_table_new(g_direct_hash, g_direct_equal);
+
     /* Set the signal mask from the host mask. */
     sigprocmask(0, 0, &ts->signal_mask);
 
@@ -749,6 +767,10 @@  static void host_signal_handler(int host_signum, siginfo_t *info,
     k = &ts->sigtab[sig - 1];
     k->info = tinfo;
     k->pending = sig;
+    k->exit_pid = 0;
+    if (info->si_code & (CLD_DUMPED | CLD_KILLED | CLD_EXITED)) {
+        k->exit_pid = info->si_pid;
+    }
     ts->signal_pending = 1;
 
     /* Block host signals until target signal handler entered. We
@@ -930,6 +952,17 @@  int do_sigaction(int sig, const struct target_sigaction *act,
     return ret;
 }
 
+void hide_current_process_exit_signal(void)
+{
+    pid_t pid = getpid();
+
+    pthread_mutex_lock(&hidden_processes_lock);
+
+    (void)g_hash_table_insert(hidden_processes, GINT_TO_POINTER(pid), NULL);
+
+    pthread_mutex_unlock(&hidden_processes_lock);
+}
+
 static void handle_pending_signal(CPUArchState *cpu_env, int sig,
                                   struct emulated_sigtable *k)
 {
@@ -944,6 +977,22 @@  static void handle_pending_signal(CPUArchState *cpu_env, int sig,
     /* dequeue signal */
     k->pending = 0;
 
+    if (k->exit_pid) {
+        pthread_mutex_lock(&hidden_processes_lock);
+        /*
+         * If the exit signal is for a hidden PID, then just drop it, and
+         * remove the hidden process from the list, since we know it has
+         * exited.
+         */
+        if (g_hash_table_contains(hidden_processes,
+                                  GINT_TO_POINTER(k->exit_pid))) {
+            g_hash_table_remove(hidden_processes, GINT_TO_POINTER(k->exit_pid));
+            pthread_mutex_unlock(&hidden_processes_lock);
+            return;
+        }
+        pthread_mutex_unlock(&hidden_processes_lock);
+    }
+
     sig = gdb_handlesig(cpu, sig);
     if (!sig) {
         sa = NULL;
diff --git a/linux-user/syscall.c b/linux-user/syscall.c
index 838caf9c98..20cf5d5464 100644
--- a/linux-user/syscall.c
+++ b/linux-user/syscall.c
@@ -139,10 +139,9 @@ 
 
 /* These flags are ignored:
  * CLONE_DETACHED is now ignored by the kernel;
- * CLONE_IO is just an optimisation hint to the I/O scheduler
  */
 #define CLONE_IGNORED_FLAGS                     \
-    (CLONE_DETACHED | CLONE_IO)
+    (CLONE_DETACHED)
 
 /* Flags for fork which we can implement within QEMU itself */
 #define CLONE_EMULATED_FLAGS               \
@@ -5978,14 +5977,31 @@  static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
     }
     proc_flags = (proc_flags & ~CSIGNAL) | host_sig;
 
-    /* Emulate vfork() with fork() */
-    if (proc_flags & CLONE_VFORK) {
-        proc_flags &= ~(CLONE_VFORK | CLONE_VM);
+
+    if (!clone_flags_are_fork(proc_flags) && !(flags & CLONE_VM)) {
+        /*
+         * If the user is doing a non-CLONE_VM clone, which cannot be emulated
+         * with fork, we can't guarantee that we can emulate this correctly.
+         * It should work OK as long as there are no threads in parent process,
+         * so we hide it behind a flag if the user knows what they're doing.
+         */
+        qemu_log_mask(LOG_UNIMP,
+                      "Refusing non-fork/thread clone without CLONE_VM.");
+        return -TARGET_EINVAL;
     }
 
-    if (!clone_flags_are_fork(proc_flags) &&
-        !clone_flags_are_thread(proc_flags)) {
-        qemu_log_mask(LOG_UNIMP, "unsupported clone flags");
+    if ((flags & CLONE_FILES) && !(flags & CLONE_VM)) {
+        /*
+         * This flag combination is currently unsupported. QEMU needs to update
+         * the fd_trans_table as new file descriptors are opened. This is easy
+         * when CLONE_VM is set, because the fd_trans_table is shared between
+         * the parent and child. Without CLONE_VM the fd_trans_table will need
+         * to be share specially using shared memory mappings, or a
+         * consistentcy protocol between the child and the parent.
+         *
+         * For now, just return EINVAL in this case.
+         */
+        qemu_log_mask(LOG_UNIMP, "CLONE_FILES only supported with CLONE_VM");
         return -TARGET_EINVAL;
     }
 
@@ -6042,6 +6058,10 @@  static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
         ts->sigact_tbl = sigact_table_clone(parent_ts->sigact_tbl);
     }
 
+    if (!clone_flags_are_thread(proc_flags)) {
+        ts->is_cloned = true;
+    }
+
     if (flags & CLONE_CHILD_CLEARTID) {
         ts->child_tidptr = child_tidptr;
     }
@@ -6063,10 +6083,8 @@  static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
         tb_flush(cpu);
     }
 
-    if (proc_flags & CLONE_VM) {
-        info.child.register_thread = true;
-        info.child.signal_setup = true;
-    }
+    info.child.signal_setup = (flags & CLONE_VM) && !(flags & CLONE_VFORK);
+    info.child.register_thread = !!(flags & CLONE_VM);
 
     /*
      * It is not safe to deliver signals until the child has finished
@@ -6078,7 +6096,7 @@  static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
 
     ret = get_errno(qemu_clone(proc_flags, clone_run, (void *) &info));
 
-    if (ret >= 0 && (proc_flags & CLONE_VM)) {
+    if (ret >= 0 && (flags & CLONE_VM) && !(flags & CLONE_VFORK)) {
         /*
          * Wait for the child to finish setup if the child is running in the
          * same VM.
@@ -6092,7 +6110,7 @@  static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp,
     pthread_cond_destroy(&info.cond);
     pthread_mutex_destroy(&info.mutex);
 
-    if (ret >= 0 && !(proc_flags & CLONE_VM)) {
+    if (ret >= 0 && !(flags & CLONE_VM)) {
         /*
          * If !CLONE_VM, then we need to set parent_tidptr, since the child
          * won't set it for us. Should always be safe to set it here anyways.
@@ -7662,6 +7680,7 @@  static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
     switch(num) {
     case TARGET_NR_exit:
     {
+        bool do_pthread_exit = false;
         /* In old applications this may be used to implement _exit(2).
            However in threaded applictions it is used for thread termination,
            and _exit_group is used for application termination.
@@ -7692,10 +7711,20 @@  static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
                           NULL, NULL, 0);
             }
 
+            /*
+             * Need this multi-step process so we can free ts before calling
+             * pthread_exit.
+             */
+            if (!ts->is_cloned) {
+                do_pthread_exit = true;
+            }
+
             thread_cpu = NULL;
             g_free(ts);
-            rcu_unregister_thread();
-            pthread_exit(NULL);
+            if (do_pthread_exit) {
+                rcu_unregister_thread();
+                pthread_exit(NULL);
+            }
         }
 
         pthread_mutex_unlock(&clone_lock);
@@ -9700,6 +9729,14 @@  static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1,
 #ifdef __NR_exit_group
         /* new thread calls */
     case TARGET_NR_exit_group: {
+        /*
+         * TODO: We need to clean up CPUs (like is done for exit(2))
+         * for all threads in this process when exit_group is called, at least
+         * for tasks that have been cloned. Could also be done in
+         * clone_trampoline/tls_mgr. Since this cleanup is non-trival (need to
+         * coordinate it across threads. Right now it seems to be fine without
+         * the cleanup, so just leaving a note.
+         */
         preexit_cleanup(cpu_env, arg1);
         return get_errno(exit_group(arg1));
     }
diff --git a/tests/tcg/multiarch/linux-test.c b/tests/tcg/multiarch/linux-test.c
index 8a7c15cd31..a7723556c2 100644
--- a/tests/tcg/multiarch/linux-test.c
+++ b/tests/tcg/multiarch/linux-test.c
@@ -407,14 +407,13 @@  static void test_clone(void)
 
     stack1 = malloc(STACK_SIZE);
     pid1 = chk_error(clone(thread1_func, stack1 + STACK_SIZE,
-                           CLONE_VM | CLONE_FS | CLONE_FILES |
-                           CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM,
+                           CLONE_VM | SIGCHLD,
                             "hello1"));
 
     stack2 = malloc(STACK_SIZE);
     pid2 = chk_error(clone(thread2_func, stack2 + STACK_SIZE,
                            CLONE_VM | CLONE_FS | CLONE_FILES |
-                           CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM,
+                           CLONE_SIGHAND | CLONE_SYSVSEM | SIGCHLD,
                            "hello2"));
 
     wait_for_child(pid1);
@@ -517,6 +516,61 @@  static void test_shm(void)
     chk_error(shmdt(ptr));
 }
 
+static volatile sig_atomic_t test_clone_signal_count_handler_calls;
+
+static void test_clone_signal_count_handler(int sig)
+{
+    test_clone_signal_count_handler_calls++;
+}
+
+/* A clone function that does nothing and exits successfully. */
+static int successful_func(void *arg __attribute__((unused)))
+{
+    return 0;
+}
+
+/*
+ * With our clone implementation it's possible that we could generate too many
+ * child exit signals. Make sure only the single expected child-exit signal is
+ * generated.
+ */
+static void test_clone_signal_count(void)
+{
+    uint8_t *child_stack;
+    struct sigaction prev, test;
+    int status;
+    pid_t pid;
+
+    memset(&test, 0, sizeof(test));
+    test.sa_handler = test_clone_signal_count_handler;
+    test.sa_flags = SA_RESTART;
+
+    /* Use real-time signals, so every signal event gets delivered. */
+    chk_error(sigaction(SIGRTMIN, &test, &prev));
+
+    child_stack = malloc(STACK_SIZE);
+    pid = chk_error(clone(
+        successful_func,
+        child_stack + STACK_SIZE,
+        CLONE_VM | SIGRTMIN,
+        NULL
+    ));
+
+    /*
+     * Need to use __WCLONE here because we are not using SIGCHLD as the
+     * exit_signal. By default linux only waits for children spawned with
+     * SIGCHLD.
+     */
+    chk_error(waitpid(pid, &status, __WCLONE));
+
+    chk_error(sigaction(SIGRTMIN, &prev, NULL));
+
+    if (test_clone_signal_count_handler_calls != 1) {
+        error("expected to receive exactly 1 signal, received %d signals",
+              test_clone_signal_count_handler_calls);
+    }
+}
+
 int main(int argc, char **argv)
 {
     test_file();
@@ -524,11 +578,8 @@  int main(int argc, char **argv)
     test_fork();
     test_time();
     test_socket();
-
-    if (argc > 1) {
-        printf("test_clone still considered buggy\n");
-        test_clone();
-    }
+    test_clone();
+    test_clone_signal_count();
 
     test_signal();
     test_shm();