Message ID | 20200612014606.147691-5-jkz@google.com |
---|---|
State | New |
Headers | show |
Series | linux-user: Support extended clone(CLONE_VM) | expand |
> + child_tid = atomic_fetch_or(&mgr->managed_tid, 0); > + /* > + * Check if the child has already terminated by this point. If not, wait > + * for the child to exit. As long as the trampoline is not killed by > + * a signal, the kernel guarantees that the memory at &mgr->managed_tid > + * will be cleared, and a FUTEX_WAKE at that address will triggered. > + */ > + if (child_tid != 0) { > + ret = syscall(SYS_futex, &mgr->managed_tid, FUTEX_WAIT, > + child_tid, NULL, NULL, 0); > + assert(ret == 0 && "clone manager futex should always succeed"); > + } A note for any reviewers/maintainers: While doing some additional testing today, I discovered there is a bug in this section of the patch. The child process can exit between the `atomic_fetch` and start of the `futex(FUTEX_WAIT)` call, causing the kernel to respond with an `EAGAIN` error, which will be caught by the assert and crash the program. I have a patch for this. I suspect there will be comments on this change, so I'm holding off on re-sending the series until initial reviews have been done. I just wanted to make maintainers aware to avoid the possibility of this bug being merged in the (very) unlikely case there are no comments.
Josh Kunz <jkz@google.com> writes: > The `clone` system call can be used to create new processes that share > attributes with their parents, such as virtual memory, file > system location, file descriptor tables, etc. These can be useful to a > variety of guest programs. > > Before this patch, QEMU had support for a limited set of these attributes. > Basically the ones needed for threads, and the options used by fork. > This change adds support for all flag combinations involving CLONE_VM. > In theory, almost all clone options could be supported, but invocations > not using CLONE_VM are likely to run afoul of linux-user's inherently > multi-threaded design. > > To add this support, this patch updates the `qemu_clone` helper. An > overview of the mechanism used to support general `clone` options with > CLONE_VM is described below. > > This patch also enables by-default the `clone` unit-tests in > tests/tcg/multiarch/linux-test.c, and adds an additional test for duplicate > exit signals, based on a bug found during development. Which by the way fail on some targets: TEST linux-test on alpha /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not receive PDEATHSIG on parent death make[2]: *** [../Makefile.target:153: run-linux-test] Error 1 make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: run-guest-tests] Error 2 make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: run-tcg-tests-alpha-linux-user] Error 2 Have you managed a clean check-tcg with docker enabled so all the guest architectures get tested? > > !! Overview > > Adding support for CLONE_VM is tricky. The parent and guest process will > share an address space (similar to threads), so the emulator must > coordinate between the parent and the child. Currently, QEMU relies > heavily on Thread Local Storage (TLS) as part of this coordination > strategy. For threads, this works fine, because libc manages the > thread-local data region used for TLS, when we create new threads using > `pthread_create`. Ideally we could use the same mechanism for > "process-local storage" needed to allow the parent/child processes to > emulate in tandem. Unfortunately TLS is tightly integrated into libc. > The only way to create TLS data regions is via the `pthread_create` API > which also spawns a new thread (rather than a new processes, which is > what we want). Worse still, TLS itself is a complicated arch-specific > feature that is tightly integrated into the rest of libc and the dynamic > linker. Re-implementing TLS support for QEMU would likely require a > special dynamic linker / libc. Alternatively, the popular libcs could be > extended, to allow for users to create TLS regions without creating > threads. Even if major libcs decide to add this support, QEMU will still > need a temporary work around until those libcs are widely deployed. It's > also unclear if libcs will be interested in supporting this case, since > TLS image creation is generally deeply integrated with thread setup. > > In this patch, I've employed an alternative approach: spawning a thread > an "stealing" its TLS image for use in the child process. This approach > leaves a dangling thread while the TLS image is in use, but by design > that thread will not become schedulable until after the TLS data is no > longer in-use by the child (as described in a moment). Therefore, it > should cause relatively minimal overhead. When considered in the larger > context, this seems like a reasonable tradeoff. *sharp intake of breath* OK so the solution to the complexity of handling threads is to add more threads? cool cool cool.... > > A major complication of this approach knowing when it is safe to clean up > the stack, and TLS image, used by a child process. When a child is > created with `CLONE_VM` its stack, and TLS data, need to remain valid > until that child has either exited, or successfully called `execve` (on > `execve` the child is given a new VMM by the kernel). One approach would > be to use `waitid(WNOWAIT)` (the `WNOWAIT` allows the guest to reap the > child). The problem is that the `wait` family of calls only waits for > termination. The pattern of `clone() ... execve()` for long running > child processes is pretty common. If we waited for child processes to > exit, it's likely we would end up using substantially more memory, and > keep the suspended TLS thread around much longer than necessary. > Instead, in this patch, I've used an "trampoline" process. The real > parent first clones a trampoline, the trampoline then clones the > ultimate child using the `CLONE_VFORK` option. `CLONE_VFORK` suspends > the trampoline process until the child has exited, or called `execve`. > Once the trampoline is re-scheduled, we know it is safe to clean up > after the child. This creates one more suspended process, but typically, > the trampoline only exists for a short period of time. > > !! CLONE_VM setup, step by step > > 1. First, the suspended thread whose TLS we will use is created using > `pthread_create`. The thread fetches and returns it's "TLS pointer" > (an arch-specific value given to the kernel) to the parent. It then > blocks on a lock to prevent its TLS data from being cleaned up. > Ultimately the lock will be unlocked by the trampoline once the child > exits. > 2. Once the TLS thread has fetched the TLS pointer, it notifies the real > parent thread, which calls `clone()` to create the trampoline > process. For ease of implementation, the TLS image is set for the > trampoline process during this step. This allows the trampoline to > use functions that require TLS if needed (e.g., printf). TLS location > is inherited when a new child is spawned, so this TLS data will > automatically be inherited by the child. > 3. Once the trampoline has been spawned, it registers itself as a > "hidden" process with the signal subsystem. This prevents the exit > signal from the trampoline from ever being forwarded to the guest. > This is needed due to the way that Linux sets the exit signal for the > ultimate child when `CLONE_PARENT` is set. See the source for > details. > 4. Once setup is complete, the trampoline spawns the final child with > the original clone flags, plus `CLONE_PARENT`, so the child is > correctly parented to the kernel task on which the guest invoked > `clone`. Without this, kernel features like PDEATHSIG, and > subreapers, would not work properly. As previously discussed, the > trampoline also supplies `CLONE_VFORK` so that it is suspended until > the child can be cleaned up. > 5. Once the child is spawned, it signals the original parent thread that > it is running. At this point, the trampoline process is suspended > (due to CLONE_VFORK). > 6. Finally, the call to `qemu_clone` in the parent is finished, the > child begins executing the given callback function in the new child > process. > > !! Cleaning up > > Clean up itself is a multi-step process. Once the child exits, or is > killed by a signal (cleanup is the same in both cases), the trampoline > process becomes schedulable. When the trampoline is scheduled, it frees > the child stack, and unblocks the suspended TLS thread. This cleans up > the child resources, but not the stack used by the trampoline itself. It > is possible for a process to clean up its own stack, but it is tricky, > and architecture-specific. Instead we leverage the TLS manager thread to > clean up the trampoline stack. When the trampoline is cloned (in step 2 > above), we additionally set the `CHILD_SETTID` and `CHILD_CLEARTID` > flags. The target location for the SET/CLEAR TID is set to a special field > known by the TLS manager. Then, when the TLS manager thread is unsuspended, > it performs an additional `FUTEX_WAIT` on this location. That blocks the > TLS manager thread until the trampoline has fully exited, then the TLS > manager thread frees the trampoline process's stack, before exiting > itself. > > !! Shortcomings of this patch > > * It's complicated. > * It doesn't support any clone options when CLONE_VM is omitted. > * It doesn't properly clean up the CPU queue when the child process > terminates, or calls execve(). > * RCU unregistration is done in the trampoline process (in clone.c), but > registration happens in syscall.c This should be made more explicit. > * The TLS image, and trampoline stack are not cleaned up if the parent > calls `execve` or `exit_group` before the child does. This is because > those cleanup tasks are handled by the TLS manager thread. The TLS > manager thread is in the same thread group as the parent, so it will > be terminated if the parent exits or calls `execve`. > > !! Alternatives considered > > * Non-standard libc extension to allow creating TLS images independent > of threads. This would allow us to just `clone` the child directly > instead of this complicated maneuver. Though we probably would still > need the cleanup logic. For libcs, TLS image allocation is tightly > connected to thread stack allocation, which is also arch-specific. I > do not have enough experience with libc development to know if > maintainers of any popular libcs would be open to supporting such an > API. Additionally, since it will probably take years before a libc > fix would be widely deployed, we need an interim solution anyways. We could consider a custom lib stub that intercepts calls to the guests original libc and replaces it with a QEMU aware one? > * Non-standard, Linux-only, libc extension to allow us to specify the > CLONE_* flags used by `pthread_create`. The processes we are creating > are basically threads in a different thread group. If we could alter > the flags used, this whole processes could become a `pthread_create.` > The problem with this approach is that I don't know what requirements > pthreads has on threads to ensure they function properly. I suspect > that pthreads relies on CHILD_CLEARTID+FUTEX_WAKE to cleanup detached > thread state. Since we don't control the child exit reason (Linux only > handles CHILD_CLEARTID on normal, non-signal process termination), we > probably can't use this same tracking mechanism. > * Other mechanisms for detecting child exit so cleanup can happen > besides CLONE_VFORK: > * waitid(WNOWAIT): This can only detect exit, not execve. > * file descriptors with close on exec set: This cannot detect children > cloned with CLONE_FILES. > * System V semaphore adjustments: Cannot detect children cloned with > CLONE_SYSVSEM. > * CLONE_CHILD_CLEARTID + FUTEX_WAIT: Cannot detect abnormally > terminated children. > * Doing the child clone directly in the TLS manager thread: This saves the > need for the trampoline process, but it causes the child process to be > parented to the wrong kernel task (the TLS thread instead of the Main > thread) breaking things like PDEATHSIG. Have you considered a daemon which could co-ordinate between the multiple processes that are sharing some state? > Signed-off-by: Josh Kunz <jkz@google.com> > --- > linux-user/clone.c | 415 ++++++++++++++++++++++++++++++- > linux-user/qemu.h | 17 ++ > linux-user/signal.c | 49 ++++ > linux-user/syscall.c | 69 +++-- > tests/tcg/multiarch/linux-test.c | 67 ++++- > 5 files changed, 592 insertions(+), 25 deletions(-) > > diff --git a/linux-user/clone.c b/linux-user/clone.c > index f02ae8c464..3f7344cf9e 100644 > --- a/linux-user/clone.c > +++ b/linux-user/clone.c > @@ -12,6 +12,12 @@ > #include <stdbool.h> > #include <assert.h> > > +/* arch-specifc includes needed to fetch the TLS base offset. */ > +#if defined(__x86_64__) > +#include <asm/prctl.h> > +#include <sys/prctl.h> > +#endif > + > static const unsigned long NEW_STACK_SIZE = 0x40000UL; > > /* > @@ -62,6 +68,397 @@ static void completion_finish(struct completion *c) > pthread_mutex_unlock(&c->mu); > } > > +struct tls_manager { > + void *tls_ptr; > + /* fetched is completed once tls_ptr has been set by the thread. */ > + struct completion fetched; > + /* > + * spawned is completed by the user once the managed_tid > + * has been spawned. > + */ > + struct completion spawned; > + /* > + * TID of the child whose memory is cleaned up upon death. This memory > + * location is used as part of a futex op, and is cleared by the kernel > + * since we specify CHILD_CLEARTID. > + */ > + int managed_tid; > + /* > + * The value to be `free`'d up once the janitor is ready to clean up the > + * TLS section, and the managed tid has exited. > + */ > + void *cleanup; > +}; > + > +/* > + * tls_ptr fetches the TLS "pointer" for the current thread. This pointer > + * should be whatever platform-specific address is used to represent the TLS > + * base address. > + */ > +static void *tls_ptr() This and a number of other prototypes need void args to stop the compiler complaining about missing prototypes. > +{ > + void *ptr; > +#if defined(__x86_64__) > + /* > + * On x86_64, the TLS base is stored in the `fs` segment register, we can > + * fetch it with `ARCH_GET_FS`: > + */ > + (void)syscall(SYS_arch_prctl, ARCH_GET_FS, (unsigned long) &ptr); > +#else > + ptr = NULL; > +#endif > + return ptr; > +} > + > +/* > + * clone_vm_supported returns true if clone_vm() is supported on this > + * platform. > + */ > +static bool clone_vm_supported() > +{ > +#if defined(__x86_64__) > + return true; > +#else > + return false; > +#endif > +} <snip>
Thanks for the responses Alex. I'm working on your comments, but wanted to clarify some of the points you brought up before mailing a second version. Responses inline. On Tue, Jun 16, 2020 at 9:08 AM Alex Bennée <alex.bennee@linaro.org> wrote: > Which by the way fail on some targets: > > TEST linux-test on alpha > /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not receive PDEATHSIG on parent death > make[2]: *** [../Makefile.target:153: run-linux-test] Error 1 > make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: run-guest-tests] Error 2 > make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: run-tcg-tests-alpha-linux-user] Error 2 > > Have you managed a clean check-tcg with docker enabled so all the guest > architectures get tested? I've gotten this Alpha failure to reproduce on my local build and I'm working on a fix. Thanks for pointing this out. I'll make sure I get a clean `make check-tcg` for `linux-test` on all guest architectures. > > In this patch, I've employed an alternative approach: spawning a thread > > an "stealing" its TLS image for use in the child process. This approach > > leaves a dangling thread while the TLS image is in use, but by design > > that thread will not become schedulable until after the TLS data is no > > longer in-use by the child (as described in a moment). Therefore, it > > should cause relatively minimal overhead. When considered in the larger > > context, this seems like a reasonable tradeoff. > > *sharp intake of breath* > > OK so the solution to the complexity of handling threads is to add more > threads? cool cool cool.... The solution to the complexity of shared memory, but yeah, not my favorite either. I was kinda hoping that someone on the list would explain why this approach is clearly wrong. > > * Non-standard libc extension to allow creating TLS images independent > > of threads. This would allow us to just `clone` the child directly > > instead of this complicated maneuver. Though we probably would still > > need the cleanup logic. For libcs, TLS image allocation is tightly > > connected to thread stack allocation, which is also arch-specific. I > > do not have enough experience with libc development to know if > > maintainers of any popular libcs would be open to supporting such an > > API. Additionally, since it will probably take years before a libc > > fix would be widely deployed, we need an interim solution anyways. > > We could consider a custom lib stub that intercepts calls to the guests > original libc and replaces it with a QEMU aware one? Unfortunately the problem here is host libc, rather than guest libc. We need to make TLS variables in QEMU itself work, so intercepting guest libc calls won't help much. Or am I misunderstanding the point? > Have you considered a daemon which could co-ordinate between the > multiple processes that are sharing some state? Not really for the `CLONE_VM` support added in this patch series. I have considered trying to pull tcg out of the guest process, but not very seriously, since it seems like a pretty heavyweight approach. Especially compared to the solution included in this series. Do you think there's a simpler approach that involves using a daemon to do coordination? Thanks again for your reviews. -- Josh Kunz
Josh Kunz <jkz@google.com> writes: > Thanks for the responses Alex. I'm working on your comments, but > wanted to clarify some of the points you brought up before mailing a > second version. Responses inline. > > On Tue, Jun 16, 2020 at 9:08 AM Alex Bennée <alex.bennee@linaro.org> wrote: >> Which by the way fail on some targets: >> >> TEST linux-test on alpha >> /home/alex/lsrc/qemu.git/tests/tcg/multiarch/linux-test.c:709: child did not receive PDEATHSIG on parent death >> make[2]: *** [../Makefile.target:153: run-linux-test] Error 1 >> make[1]: *** [/home/alex/lsrc/qemu.git/tests/tcg/Makefile.qemu:76: run-guest-tests] Error 2 >> make: *** [/home/alex/lsrc/qemu.git/tests/Makefile.include:851: run-tcg-tests-alpha-linux-user] Error 2 >> >> Have you managed a clean check-tcg with docker enabled so all the guest >> architectures get tested? > > I've gotten this Alpha failure to reproduce on my local build and I'm > working on a fix. Thanks for pointing this out. I'll make sure I get a > clean `make check-tcg` for `linux-test` on all guest architectures. > >> > In this patch, I've employed an alternative approach: spawning a thread >> > an "stealing" its TLS image for use in the child process. This approach >> > leaves a dangling thread while the TLS image is in use, but by design >> > that thread will not become schedulable until after the TLS data is no >> > longer in-use by the child (as described in a moment). Therefore, it >> > should cause relatively minimal overhead. When considered in the larger >> > context, this seems like a reasonable tradeoff. >> >> *sharp intake of breath* >> >> OK so the solution to the complexity of handling threads is to add more >> threads? cool cool cool.... > > The solution to the complexity of shared memory, but yeah, not my > favorite either. I was kinda hoping that someone on the list would > explain why this approach is clearly wrong. > >> > * Non-standard libc extension to allow creating TLS images independent >> > of threads. This would allow us to just `clone` the child directly >> > instead of this complicated maneuver. Though we probably would still >> > need the cleanup logic. For libcs, TLS image allocation is tightly >> > connected to thread stack allocation, which is also arch-specific. I >> > do not have enough experience with libc development to know if >> > maintainers of any popular libcs would be open to supporting such an >> > API. Additionally, since it will probably take years before a libc >> > fix would be widely deployed, we need an interim solution anyways. >> >> We could consider a custom lib stub that intercepts calls to the guests >> original libc and replaces it with a QEMU aware one? > > Unfortunately the problem here is host libc, rather than guest libc. > We need to make TLS variables in QEMU itself work, so intercepting > guest libc calls won't help much. Or am I misunderstanding the point? Hold up - I'm a little confused now. Why does the host TLS affect the guest TLS? We have complete control over the guests view of the world so we should be able to control it's TLS storage. >> Have you considered a daemon which could co-ordinate between the >> multiple processes that are sharing some state? > > Not really for the `CLONE_VM` support added in this patch series. I > have considered trying to pull tcg out of the guest process, but not > very seriously, since it seems like a pretty heavyweight approach. > Especially compared to the solution included in this series. Do you > think there's a simpler approach that involves using a daemon to do > coordination? I'm getting a little lost now. Exactly what state are we trying to share between two QEMU guests which are now in separate execution contexts?
Sorry for the late reply, response inline. Also I noticed a couple mails ago I seemed to have removed the devel list and maintainers. I've re-added them to the CC line. On Wed, Jun 24, 2020 at 3:17 AM Alex Bennée <alex.bennee@linaro.org> wrote: > > > Josh Kunz <jkz@google.com> writes: > > > On Tue, Jun 23, 2020, 1:21 AM Alex Bennée <alex.bennee@linaro.org> wrote: > > > > (snip) > > > >> >> > * Non-standard libc extension to allow creating TLS images independent > >> >> > of threads. This would allow us to just `clone` the child directly > >> >> > instead of this complicated maneuver. Though we probably would still > >> >> > need the cleanup logic. For libcs, TLS image allocation is tightly > >> >> > connected to thread stack allocation, which is also arch-specific. I > >> >> > do not have enough experience with libc development to know if > >> >> > maintainers of any popular libcs would be open to supporting such an > >> >> > API. Additionally, since it will probably take years before a libc > >> >> > fix would be widely deployed, we need an interim solution anyways. > >> >> > >> >> We could consider a custom lib stub that intercepts calls to the guests > >> >> original libc and replaces it with a QEMU aware one? > >> > > >> > Unfortunately the problem here is host libc, rather than guest libc. > >> > We need to make TLS variables in QEMU itself work, so intercepting > >> > guest libc calls won't help much. Or am I misunderstanding the point? > >> > >> Hold up - I'm a little confused now. Why does the host TLS affect the > >> guest TLS? We have complete control over the guests view of the world so > >> we should be able to control it's TLS storage. > > > > Guest TLS is unaffected, just like in the existing case for guest > > threads. Guest TLS is handled by the guest libc and the CPU emulation. > > Just to be clear: This series changes nothing about guest TLS. > > > > The complexity of this series is to deal with *host* usage of TLS. > > That is to say: use of thread local variables in QEMU itself. Host TLS > > is needed to allow the subprocess created with `clone(CLONE_VM, ...)` > > to run at all. TLS variables are used in QEMU for the RCU > > implementation, parts of the TCG, and all over the place to access the > > CPU/TaskState for the running thread. Host TLS is managed by the host > > libc, and TLS is only set up for host threads created via > > `pthread_create`. Subprocesses created with `clone(CLONE_VM)` share a > > virtual memory map *and* TLS data with their parent[1], since libcs > > provide no special handling of TLS when `clone(CLONE_VM)` is used. > > Without the workaround used in this patch, both the parent and child > > process's thread local variables reference the same memory locations. > > This just doesn't work, since thread local data is assumed to actually > > be thread local. > > > > The "alternative" proposed was to make the host libc support TLS for > > processes created using clone (there are several ways to go about > > this, each with different tradeoffs). You mentioned that "We could > > consider a custom lib stub that intercepts calls to the guests > > original libc..." in your comment. Since *guest* libc is not involved > > here I was a bit confused about how this could help, and wanted to > > clarify. > > > >> >> Have you considered a daemon which could co-ordinate between the > >> >> multiple processes that are sharing some state? > >> > > >> > Not really for the `CLONE_VM` support added in this patch series. I > >> > have considered trying to pull tcg out of the guest process, but not > >> > very seriously, since it seems like a pretty heavyweight approach. > >> > Especially compared to the solution included in this series. Do you > >> > think there's a simpler approach that involves using a daemon to do > >> > coordination? > >> > >> I'm getting a little lost now. Exactly what state are we trying to share > >> between two QEMU guests which are now in separate execution contexts? > > > > Since this series only deals with `clone(CLONE_VM)` we always want to > > share guest virtual memory between the execution contexts. There is > > also some extra state that needs to be shared depending on which flags > > are provided to `clone()`. E.g., signal handler tables for > > CLONE_SIGHAND, file descriptor tables for CLONE_FILES, etc. > > > > The problem is that since QEMU and the guest live in the same virtual > > memory map, keeping the mappings the same between the guest parent and > > guest child means that the mappings also stay the same between the > > host (QEMU) parent and host child. Two hosts can live in the same > > virtual memory map, like we do right now with threads, but *only* with > > valid TLS for each thread/process. That's why we bend-over backwards > > to get set-up TLS for emulation in the child process. > > OK thanks for that. I'd obviously misunderstood from my first read > through. So while hiding the underlying bits of QEMU from the guest is > relatively easy it's quite hard to hide QEMU from itself in this > CLONE_VM case. Yes exactly. > The other approach would be to suppress CLONE_VM for the actual process > (thereby allowing QEMU to safely have a new instance and no clashing > shared data) but emulate CLONE_VM for the guest itself (making the guest > portions of memory shared and visible to each other). The trouble then > would be co-ordination of mapping operations and other things that > should be visible in a real CLONE_VM setup. This is the sort of > situation I envisioned a co-ordination daemon might be useful. Ah. This is interesting. Effectively the inverse of this patch. I had not considered this approach. Thinking more about it, a "no shared memory" approach does seem more straightforward implementation wise. Unfortunately I think there would be a few substantial drawbacks: 1. Memory overhead. Every guest thread would need a full copy of QEMU memory, including the translated guest binary. 2. Performance overhead. To keep virtual memory maps consistent across tasks, a heavyweight 2 phase commit scheme, or similar, would be needed for every `mmap`. That could have substantial performance overhead for the guest. This could be a huge problem for processes that use a large number of threads *and* do a lot of memory mapping or frequently change page permissions. 3. There would be lots of similarly-fiddly bits that need to be shared and coordinated in addition to guest memory. At least the signal handler tables and fd_trans tables, but there are likely others I'm missing. The performance drawbacks could be largely mitigated by using the current thread-only `CLONE_VM` support, but having *any* threads in the process at all would lead to deadlocks after fork() or similar non-CLONE_VM clone() calls. This could be worked around with a "stop the world" button somewhat like `start_exclusive`, but expanded to include all emulator threads. That will substantially slow down fork(). Given all this I think the approach used in this series is probably at least as "good" as a "no shared memory" approach. It has its own complexities and drawbacks, but doesn't have obvious performance issues. If you or other maintainers disagree, I'd be happy to write up an RFC comparing the approaches in more detail (or we can just use this thread), just let me know. Until then I'll keep pursuing this patch. > > [1] At least on x86_64, because TLS references are defined in terms of > > the %fs segment, which is inherited on linux. Theoretically it's up to > > the architecture to specify how TLS is inherited across execution > > contexts. t's possible that the child actually ends up with no valid > > TLS rather than using the parent TLS data. But that's not really > > relevant here. The important thing is that the child ends up with > > *valid* TLS, not invalid or inherited TLS. > > > -- > Alex Bennée -- Josh Kunz
Josh Kunz <jkz@google.com> writes: > Sorry for the late reply, response inline. Also I noticed a couple > mails ago I seemed to have removed the devel list and maintainers. > I've re-added them to the CC line. > > On Wed, Jun 24, 2020 at 3:17 AM Alex Bennée <alex.bennee@linaro.org> wrote: >> >> >> Josh Kunz <jkz@google.com> writes: >> >> > On Tue, Jun 23, 2020, 1:21 AM Alex Bennée <alex.bennee@linaro.org> wrote: >> > >> > (snip) >> > >> >> >> > * Non-standard libc extension to allow creating TLS images independent >> >> >> > of threads. This would allow us to just `clone` the child directly >> >> >> > instead of this complicated maneuver. Though we probably would still >> >> >> > need the cleanup logic. For libcs, TLS image allocation is tightly >> >> >> > connected to thread stack allocation, which is also arch-specific. I >> >> >> > do not have enough experience with libc development to know if >> >> >> > maintainers of any popular libcs would be open to supporting such an >> >> >> > API. Additionally, since it will probably take years before a libc >> >> >> > fix would be widely deployed, we need an interim solution anyways. >> >> >> >> >> >> We could consider a custom lib stub that intercepts calls to the guests >> >> >> original libc and replaces it with a QEMU aware one? >> >> > >> >> > Unfortunately the problem here is host libc, rather than guest libc. >> >> > We need to make TLS variables in QEMU itself work, so intercepting >> >> > guest libc calls won't help much. Or am I misunderstanding the point? >> >> >> >> Hold up - I'm a little confused now. Why does the host TLS affect the >> >> guest TLS? We have complete control over the guests view of the world so >> >> we should be able to control it's TLS storage. >> > >> > Guest TLS is unaffected, just like in the existing case for guest >> > threads. Guest TLS is handled by the guest libc and the CPU emulation. >> > Just to be clear: This series changes nothing about guest TLS. >> > >> > The complexity of this series is to deal with *host* usage of TLS. >> > That is to say: use of thread local variables in QEMU itself. Host TLS >> > is needed to allow the subprocess created with `clone(CLONE_VM, ...)` >> > to run at all. TLS variables are used in QEMU for the RCU >> > implementation, parts of the TCG, and all over the place to access the >> > CPU/TaskState for the running thread. Host TLS is managed by the host >> > libc, and TLS is only set up for host threads created via >> > `pthread_create`. Subprocesses created with `clone(CLONE_VM)` share a >> > virtual memory map *and* TLS data with their parent[1], since libcs >> > provide no special handling of TLS when `clone(CLONE_VM)` is used. >> > Without the workaround used in this patch, both the parent and child >> > process's thread local variables reference the same memory locations. >> > This just doesn't work, since thread local data is assumed to actually >> > be thread local. >> > >> > The "alternative" proposed was to make the host libc support TLS for >> > processes created using clone (there are several ways to go about >> > this, each with different tradeoffs). You mentioned that "We could >> > consider a custom lib stub that intercepts calls to the guests >> > original libc..." in your comment. Since *guest* libc is not involved >> > here I was a bit confused about how this could help, and wanted to >> > clarify. >> > >> >> >> Have you considered a daemon which could co-ordinate between the >> >> >> multiple processes that are sharing some state? >> >> > >> >> > Not really for the `CLONE_VM` support added in this patch series. I >> >> > have considered trying to pull tcg out of the guest process, but not >> >> > very seriously, since it seems like a pretty heavyweight approach. >> >> > Especially compared to the solution included in this series. Do you >> >> > think there's a simpler approach that involves using a daemon to do >> >> > coordination? >> >> >> >> I'm getting a little lost now. Exactly what state are we trying to share >> >> between two QEMU guests which are now in separate execution contexts? >> > >> > Since this series only deals with `clone(CLONE_VM)` we always want to >> > share guest virtual memory between the execution contexts. There is >> > also some extra state that needs to be shared depending on which flags >> > are provided to `clone()`. E.g., signal handler tables for >> > CLONE_SIGHAND, file descriptor tables for CLONE_FILES, etc. >> > >> > The problem is that since QEMU and the guest live in the same virtual >> > memory map, keeping the mappings the same between the guest parent and >> > guest child means that the mappings also stay the same between the >> > host (QEMU) parent and host child. Two hosts can live in the same >> > virtual memory map, like we do right now with threads, but *only* with >> > valid TLS for each thread/process. That's why we bend-over backwards >> > to get set-up TLS for emulation in the child process. >> >> OK thanks for that. I'd obviously misunderstood from my first read >> through. So while hiding the underlying bits of QEMU from the guest is >> relatively easy it's quite hard to hide QEMU from itself in this >> CLONE_VM case. > > Yes exactly. > >> The other approach would be to suppress CLONE_VM for the actual process >> (thereby allowing QEMU to safely have a new instance and no clashing >> shared data) but emulate CLONE_VM for the guest itself (making the guest >> portions of memory shared and visible to each other). The trouble then >> would be co-ordination of mapping operations and other things that >> should be visible in a real CLONE_VM setup. This is the sort of >> situation I envisioned a co-ordination daemon might be useful. > > Ah. This is interesting. Effectively the inverse of this patch. I had > not considered this approach. Thinking more about it, a "no shared > memory" approach does seem more straightforward implementation wise. > Unfortunately I think there would be a few substantial drawbacks: > > 1. Memory overhead. Every guest thread would need a full copy of QEMU > memory, including the translated guest binary. Sure although I suspect the overhead is not that great. For linux-user on 64 bit systems we only allocate 128Mb of translation buffer per process. What sort of size systems are you expecting to run on and how big are the binaries? > 2. Performance overhead. To keep virtual memory maps consistent across > tasks, a heavyweight 2 phase commit scheme, or similar, would be > needed for every `mmap`. That could have substantial performance > overhead for the guest. This could be a huge problem for processes > that use a large number of threads *and* do a lot of memory mapping or > frequently change page permissions. I suspect that cross-arch highly threaded apps are still in the realm of "wow, that actually works, neat :-)" for linux-user. We don't have the luxury of falling back to a single thread like we do for system emulation so things like strong-on-weak memory order bugs can still trip us up. > 3. There would be lots of similarly-fiddly bits that need to be shared > and coordinated in addition to guest memory. At least the signal > handler tables and fd_trans tables, but there are likely others I'm > missing. > > The performance drawbacks could be largely mitigated by using the > current thread-only `CLONE_VM` support, but having *any* threads in > the process at all would lead to deadlocks after fork() or similar > non-CLONE_VM clone() calls. This could be worked around with a "stop > the world" button somewhat like `start_exclusive`, but expanded to > include all emulator threads. That will substantially slow down > fork(). > > Given all this I think the approach used in this series is probably at > least as "good" as a "no shared memory" approach. It has its own > complexities and drawbacks, but doesn't have obvious performance > issues. If you or other maintainers disagree, I'd be happy to write up > an RFC comparing the approaches in more detail (or we can just use > this thread), just let me know. Until then I'll keep pursuing this > patch. I think that's fair. I'll leave it to the maintainers to chime in if they have something to add. I'd already given some comments on patch 1 and given it needs a re-spin I'll have another look on the next iteration. I will say expect the system to get some testing on multiple backends so if you can expand your testing beyond an x86_64 host please do. > >> > [1] At least on x86_64, because TLS references are defined in terms of >> > the %fs segment, which is inherited on linux. Theoretically it's up to >> > the architecture to specify how TLS is inherited across execution >> > contexts. t's possible that the child actually ends up with no valid >> > TLS rather than using the parent TLS data. But that's not really >> > relevant here. The important thing is that the child ends up with >> > *valid* TLS, not invalid or inherited TLS. >> >> >> -- >> Alex Bennée
diff --git a/linux-user/clone.c b/linux-user/clone.c index f02ae8c464..3f7344cf9e 100644 --- a/linux-user/clone.c +++ b/linux-user/clone.c @@ -12,6 +12,12 @@ #include <stdbool.h> #include <assert.h> +/* arch-specifc includes needed to fetch the TLS base offset. */ +#if defined(__x86_64__) +#include <asm/prctl.h> +#include <sys/prctl.h> +#endif + static const unsigned long NEW_STACK_SIZE = 0x40000UL; /* @@ -62,6 +68,397 @@ static void completion_finish(struct completion *c) pthread_mutex_unlock(&c->mu); } +struct tls_manager { + void *tls_ptr; + /* fetched is completed once tls_ptr has been set by the thread. */ + struct completion fetched; + /* + * spawned is completed by the user once the managed_tid + * has been spawned. + */ + struct completion spawned; + /* + * TID of the child whose memory is cleaned up upon death. This memory + * location is used as part of a futex op, and is cleared by the kernel + * since we specify CHILD_CLEARTID. + */ + int managed_tid; + /* + * The value to be `free`'d up once the janitor is ready to clean up the + * TLS section, and the managed tid has exited. + */ + void *cleanup; +}; + +/* + * tls_ptr fetches the TLS "pointer" for the current thread. This pointer + * should be whatever platform-specific address is used to represent the TLS + * base address. + */ +static void *tls_ptr() +{ + void *ptr; +#if defined(__x86_64__) + /* + * On x86_64, the TLS base is stored in the `fs` segment register, we can + * fetch it with `ARCH_GET_FS`: + */ + (void)syscall(SYS_arch_prctl, ARCH_GET_FS, (unsigned long) &ptr); +#else + ptr = NULL; +#endif + return ptr; +} + +/* + * clone_vm_supported returns true if clone_vm() is supported on this + * platform. + */ +static bool clone_vm_supported() +{ +#if defined(__x86_64__) + return true; +#else + return false; +#endif +} + +static void *tls_manager_thread(void *arg) +{ + struct tls_manager *mgr = (struct tls_manager *) arg; + int child_tid, ret; + + /* + * NOTE: Do not use an TLS in this thread until after the `spawned` + * completion is finished. We need to preserve the pristine state of + * the TLS image for this thread, so it can be re-used in a separate + * process. + */ + mgr->tls_ptr = tls_ptr(); + + /* Notify tls_new that we finished fetching the TLS ptr. */ + completion_finish(&mgr->fetched); + + /* + * Wait for the user of our TLS to tell us the child using our TLS has + * been spawned. + */ + completion_await(&mgr->spawned); + + child_tid = atomic_fetch_or(&mgr->managed_tid, 0); + /* + * Check if the child has already terminated by this point. If not, wait + * for the child to exit. As long as the trampoline is not killed by + * a signal, the kernel guarantees that the memory at &mgr->managed_tid + * will be cleared, and a FUTEX_WAKE at that address will triggered. + */ + if (child_tid != 0) { + ret = syscall(SYS_futex, &mgr->managed_tid, FUTEX_WAIT, + child_tid, NULL, NULL, 0); + assert(ret == 0 && "clone manager futex should always succeed"); + } + + free(mgr->cleanup); + g_free(mgr); + + return NULL; +} + +static struct tls_manager *tls_manager_new() +{ + struct tls_manager *mgr = g_new0(struct tls_manager, 1); + sigset_t block, oldmask; + + sigfillset(&block); + if (sigprocmask(SIG_BLOCK, &block, &oldmask) != 0) { + return NULL; + } + + completion_init(&mgr->fetched); + completion_init(&mgr->spawned); + + pthread_attr_t attr; + pthread_attr_init(&attr); + pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED); + + pthread_t unused; + if (pthread_create(&unused, &attr, tls_manager_thread, (void *) mgr)) { + pthread_attr_destroy(&attr); + g_free(mgr); + return NULL; + } + pthread_attr_destroy(&attr); + completion_await(&mgr->fetched); + + if (sigprocmask(SIG_SETMASK, &oldmask, NULL) != 0) { + /* Let the thread exit, and cleanup itself. */ + completion_finish(&mgr->spawned); + return NULL; + } + + /* Once we finish awaiting, the tls_ptr will be usable. */ + return mgr; +} + +struct stack { + /* Buffer is the "base" of the stack buffer. */ + void *buffer; + /* Top is the "start" of the stack (since stack addresses "grow down"). */ + void *top; +}; + +struct info { + /* Stacks used for the trampoline and child process. */ + struct { + struct stack trampoline; + struct stack process; + } stack; + struct completion child_ready; + /* `clone` flags for the process the user asked us to make. */ + int flags; + sigset_t orig_mask; + /* + * Function to run in the ultimate child process, and payload to pass as + * the argument. + */ + int (*clone_f)(void *); + void *payload; + /* + * Result of calling `clone` for the child clone. Will be set to + * `-errno` if an error occurs. + */ + int result; +}; + +static bool stack_new(struct stack *stack) +{ + /* + * TODO: put a guard page at the bottom of the stack, so we don't + * accidentally roll off the end. + */ + if (posix_memalign(&stack->buffer, 16, NEW_STACK_SIZE)) { + return false; + } + memset(stack->buffer, 0, NEW_STACK_SIZE); + stack->top = stack->buffer + NEW_STACK_SIZE; + return true; +} + +static int clone_child(void *raw_info) +{ + struct info *info = (struct info *) raw_info; + int (*clone_f)(void *) = info->clone_f; + void *payload = info->payload; + if (!(info->flags & CLONE_VFORK)) { + /* + * If CLONE_VFORK is NOT set, then the trampoline has stalled (it + * forces VFORK), but the actual clone should return immediately. In + * this case, this thread needs to notify the parent that the new + * process is running. If CLONE_VFORK IS set, the trampoline will + * notify the parent once the normal kernel vfork completes. + */ + completion_finish(&info->child_ready); + } + if (sigprocmask(SIG_SETMASK, &info->orig_mask, NULL) != 0) { + perror("failed to restore signal mask in cloned child"); + _exit(1); + } + return clone_f(payload); +} + +static int clone_trampoline(void *raw_info) +{ + struct info *info = (struct info *) raw_info; + int flags; + + struct stack process_stack = info->stack.process; + int orig_flags = info->flags; + + if (orig_flags & CSIGNAL) { + /* + * It should be safe to call here, since we know signals are blocked + * for this process. + */ + hide_current_process_exit_signal(); + } + + /* + * Force CLONE_PARENT, so that we don't accidentally become a child of the + * trampoline thread. This kernel task should either be a child of the + * trampoline's parent (if CLONE_PARENT is not in info->flags), or a child + * of the calling process's parent (if CLONE_PARENT IS in info->flags). + * That is to say, our parent should always be the correct parent for the + * child task. + * + * Force CLONE_VFORK so that we know when the child is no longer holding + * a reference to this process's virtual memory. CLONE_VFORK just suspends + * this task until the child execs or exits, it should not affect how the + * child process is created in any way. This is the only generic way I'm + * aware of to observe *any* exit or exec. Including "abnormal" exits like + * exits via signals. + * + * Force CLONE_CHILD_SETTID, since we want to track the CHILD TID in the + * `info` structure. Capturing the child via `clone` call directly is + * slightly nicer than making a syscall in the child. Since we know we're + * doing a CLONE_VM here, we can use CLONE_CHILD_SETTID, to guarantee that + * the kernel must set the child TID before the child is run. The child + * TID should be visibile to the parent, since both parent and child share + * and address space. If the clone fails, we overwrite `info->result` + * anyways with the error code. + */ + flags = orig_flags | CLONE_PARENT | CLONE_VFORK | CLONE_CHILD_SETTID; + if (clone(clone_child, info->stack.process.top, flags, + (void *) info, NULL, NULL, &info->result) < 0) { + info->result = -errno; + completion_finish(&info->child_ready); + return 0; + } + + /* + * Clean up the child process stack, since we know the child can no longer + * reference it. + */ + free(process_stack.buffer); + + /* + * We know the process we created was CLONE_VFORK, so it registered with + * the RCU. We share a TLS image with the process, so we can unregister + * it from the RCU. Since the TLS image will be valid for at least our + * lifetime, it should be OK to leave the child processes RCU entry in + * the queue between when the child execve or exits, and the OS returns + * here from our vfork. + */ + rcu_unregister_thread(); + + /* + * If we're doing a real vfork here, we need to notify the parent that the + * vfork has happened. + */ + if (orig_flags & CLONE_VFORK) { + completion_finish(&info->child_ready); + } + + return 0; +} + +static int clone_vm(int flags, int (*callback)(void *), void *payload) +{ + struct info info; + sigset_t sigmask; + int ret; + + assert(flags & CLONE_VM && "CLONE_VM flag must be set"); + + memset(&info, 0, sizeof(info)); + info.clone_f = callback; + info.payload = payload; + info.flags = flags; + + /* + * Set up the stacks for the child processes needed to execute the clone. + */ + if (!stack_new(&info.stack.trampoline)) { + return -1; + } + if (!stack_new(&info.stack.process)) { + free(info.stack.trampoline.buffer); + return -1; + } + + /* + * tls_manager_new grants us it's ownership of the reference to the + * TLS manager, so we "leak" the data pointer, instead of using _get() + */ + struct tls_manager *mgr = tls_manager_new(); + if (mgr == NULL) { + free(info.stack.trampoline.buffer); + free(info.stack.process.buffer); + return -1; + } + + /* Manager cleans up the trampoline stack once the trampoline exits. */ + mgr->cleanup = info.stack.trampoline.buffer; + + /* + * Flags used by the trampoline in the 2-phase clone setup for children + * cloned with CLONE_VM. We want the trampoline to be essentially identical + * to its parent. This improves the performance of cloning the trampoline, + * and guarantees that the real flags are implemented correctly. + * + * CLONE_CHILD_SETTID: Make the kernel set the managed_tid for the TLS + * manager. + * + * CLONE_CHILD_CLEARTID: Make the kernel clear the managed_tid, and + * trigger a FUTEX_WAKE (received by the TLS manager), so the TLS manager + * knows when to cleanup the trampoline stack. + * + * CLONE_SETTLS: To set the trampoline TLS based on the tls manager. + */ + static const int base_trampoline_flags = ( + CLONE_FILES | CLONE_FS | CLONE_IO | CLONE_PTRACE | + CLONE_SIGHAND | CLONE_SYSVSEM | CLONE_VM + ) | CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | CLONE_SETTLS; + + int trampoline_flags = base_trampoline_flags; + + /* + * To get the process hierarchy right, we set the trampoline + * CLONE_PARENT/CLONE_THREAD flag to match the child + * CLONE_PARENT/CLONE_THREAD. So add those flags if specified by the child. + */ + trampoline_flags |= (flags & CLONE_PARENT) ? CLONE_PARENT : 0; + trampoline_flags |= (flags & CLONE_THREAD) ? CLONE_THREAD : 0; + + /* + * When using CLONE_PARENT, linux always sets the exit_signal for the task + * to the exit_signal of the parent process. For our purposes, the + * trampoline process. exit_signal has special significance for calls like + * `wait`, so it needs to be set correctly. We add the signal part of the + * user flags here so the ultimate child gets the right signal. + * + * This has the unfortunate side-effect of sending the parent two exit + * signals. One when the true child exits, and one when the trampoline + * exits. To work-around this we have to capture the exit signal from the + * trampoline and supress it. + */ + trampoline_flags |= (flags & CSIGNAL); + + sigfillset(&sigmask); + if (sigprocmask(SIG_BLOCK, &sigmask, &info.orig_mask) != 0) { + free(info.stack.trampoline.buffer); + free(info.stack.process.buffer); + completion_finish(&mgr->spawned); + return -1; + } + + if (clone(clone_trampoline, + info.stack.trampoline.top, trampoline_flags, &info, + NULL, mgr->tls_ptr, &mgr->managed_tid) < 0) { + free(info.stack.trampoline.buffer); + free(info.stack.process.buffer); + completion_finish(&mgr->spawned); + return -1; + } + + completion_await(&info.child_ready); + completion_finish(&mgr->spawned); + + ret = sigprocmask(SIG_SETMASK, &info.orig_mask, NULL); + /* + * If our final sigproc mask doesn't work, we're pretty screwed. We may + * have started the final child now, and there's no going back. If this + * ever happens, just crash. + */ + assert(!ret && "sigprocmask after clone needs to succeed"); + + /* If we have an error result, then set errno as needed. */ + if (info.result < 0) { + errno = -info.result; + return -1; + } + return info.result; +} + struct clone_thread_info { struct completion running; int tid; @@ -120,6 +517,17 @@ int qemu_clone(int flags, int (*callback)(void *), void *payload) { int ret; + /* + * Backwards Compatibility: Remove once all target platforms support + * clone_vm. Previously, we implemented vfork() via a fork() call, + * preserve that behavior instead of failing. + */ + if (!clone_vm_supported()) { + if (flags & CLONE_VFORK) { + flags &= ~(CLONE_VFORK | CLONE_VM); + } + } + if (clone_flags_are_thread(flags)) { /* * The new process uses the same flags as pthread_create, so we can @@ -146,7 +554,12 @@ int qemu_clone(int flags, int (*callback)(void *), void *payload) return ret; } - /* !fork && !thread */ + if (clone_vm_supported() && (flags & CLONE_VM)) { + return clone_vm(flags, callback, payload); + } + + /* !fork && !thread && !CLONE_VM. This form is unsupported. */ + errno = EINVAL; return -1; } diff --git a/linux-user/qemu.h b/linux-user/qemu.h index 54bf4f47be..e29912466c 100644 --- a/linux-user/qemu.h +++ b/linux-user/qemu.h @@ -94,6 +94,7 @@ struct vm86_saved_state { struct emulated_sigtable { int pending; /* true if signal is pending */ + pid_t exit_pid; /* non-zero host pid, if a process is exiting. */ target_siginfo_t info; }; @@ -183,6 +184,15 @@ typedef struct TaskState { * least TARGET_NSIG entries */ struct target_sigaction *sigact_tbl; + + /* + * Set to true if the process asssociated with this task state was cloned. + * This is needed to disambiguate cloned processes from threads. If + * CLONE_VM is used, a pthread_exit(..) will free the stack/TLS of the + * trampoline thread, and the trampoline will be unable to conduct its + * cleanup. + */ + bool is_cloned; } __attribute__((aligned(16))) TaskState; extern char *exec_path; @@ -442,6 +452,13 @@ abi_long do_sigaltstack(abi_ulong uss_addr, abi_ulong uoss_addr, abi_ulong sp); int do_sigprocmask(int how, const sigset_t *set, sigset_t *oldset); abi_long do_swapcontext(CPUArchState *env, abi_ulong uold_ctx, abi_ulong unew_ctx, abi_long ctx_size); + +/* + * Register the current process as a "hidden" process. Exit signals generated + * by this process should not be delivered to the guest. + */ +void hide_current_process_exit_signal(void); + /** * block_signals: block all signals while handling this guest syscall * diff --git a/linux-user/signal.c b/linux-user/signal.c index dc98def6d1..a7f0612b64 100644 --- a/linux-user/signal.c +++ b/linux-user/signal.c @@ -36,6 +36,21 @@ typedef struct target_sigaction sigact_table[TARGET_NSIG]; static void host_signal_handler(int host_signum, siginfo_t *info, void *puc); +/* + * This table, initilized in signal_init, is used to track "hidden" processes + * for which exit signals should not be delivered. The PIDs of the processes + * hidden processes are stored as keys. Values are always set to NULL. + * + * Note: Process IDs stored in this table may "leak" (i.e., never be removed + * from the table) if the guest blocks (SIG_IGN) the exit signal for the child + * it spawned. There is a small risk, that this PID could later be reused + * by an alternate child process, and the child exit would be hidden. This is + * an unusual case that is unlikely to happen, but it is possible. + */ +static GHashTable *hidden_processes; + +/* this lock guards access to the `hidden_processes` table. */ +static pthread_mutex_t hidden_processes_lock = PTHREAD_MUTEX_INITIALIZER; /* * System includes define _NSIG as SIGRTMAX + 1, @@ -564,6 +579,9 @@ void signal_init(void) /* initialize signal conversion tables */ signal_table_init(); + /* initialize the hidden process table. */ + hidden_processes = g_hash_table_new(g_direct_hash, g_direct_equal); + /* Set the signal mask from the host mask. */ sigprocmask(0, 0, &ts->signal_mask); @@ -749,6 +767,10 @@ static void host_signal_handler(int host_signum, siginfo_t *info, k = &ts->sigtab[sig - 1]; k->info = tinfo; k->pending = sig; + k->exit_pid = 0; + if (info->si_code & (CLD_DUMPED | CLD_KILLED | CLD_EXITED)) { + k->exit_pid = info->si_pid; + } ts->signal_pending = 1; /* Block host signals until target signal handler entered. We @@ -930,6 +952,17 @@ int do_sigaction(int sig, const struct target_sigaction *act, return ret; } +void hide_current_process_exit_signal(void) +{ + pid_t pid = getpid(); + + pthread_mutex_lock(&hidden_processes_lock); + + (void)g_hash_table_insert(hidden_processes, GINT_TO_POINTER(pid), NULL); + + pthread_mutex_unlock(&hidden_processes_lock); +} + static void handle_pending_signal(CPUArchState *cpu_env, int sig, struct emulated_sigtable *k) { @@ -944,6 +977,22 @@ static void handle_pending_signal(CPUArchState *cpu_env, int sig, /* dequeue signal */ k->pending = 0; + if (k->exit_pid) { + pthread_mutex_lock(&hidden_processes_lock); + /* + * If the exit signal is for a hidden PID, then just drop it, and + * remove the hidden process from the list, since we know it has + * exited. + */ + if (g_hash_table_contains(hidden_processes, + GINT_TO_POINTER(k->exit_pid))) { + g_hash_table_remove(hidden_processes, GINT_TO_POINTER(k->exit_pid)); + pthread_mutex_unlock(&hidden_processes_lock); + return; + } + pthread_mutex_unlock(&hidden_processes_lock); + } + sig = gdb_handlesig(cpu, sig); if (!sig) { sa = NULL; diff --git a/linux-user/syscall.c b/linux-user/syscall.c index 838caf9c98..20cf5d5464 100644 --- a/linux-user/syscall.c +++ b/linux-user/syscall.c @@ -139,10 +139,9 @@ /* These flags are ignored: * CLONE_DETACHED is now ignored by the kernel; - * CLONE_IO is just an optimisation hint to the I/O scheduler */ #define CLONE_IGNORED_FLAGS \ - (CLONE_DETACHED | CLONE_IO) + (CLONE_DETACHED) /* Flags for fork which we can implement within QEMU itself */ #define CLONE_EMULATED_FLAGS \ @@ -5978,14 +5977,31 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp, } proc_flags = (proc_flags & ~CSIGNAL) | host_sig; - /* Emulate vfork() with fork() */ - if (proc_flags & CLONE_VFORK) { - proc_flags &= ~(CLONE_VFORK | CLONE_VM); + + if (!clone_flags_are_fork(proc_flags) && !(flags & CLONE_VM)) { + /* + * If the user is doing a non-CLONE_VM clone, which cannot be emulated + * with fork, we can't guarantee that we can emulate this correctly. + * It should work OK as long as there are no threads in parent process, + * so we hide it behind a flag if the user knows what they're doing. + */ + qemu_log_mask(LOG_UNIMP, + "Refusing non-fork/thread clone without CLONE_VM."); + return -TARGET_EINVAL; } - if (!clone_flags_are_fork(proc_flags) && - !clone_flags_are_thread(proc_flags)) { - qemu_log_mask(LOG_UNIMP, "unsupported clone flags"); + if ((flags & CLONE_FILES) && !(flags & CLONE_VM)) { + /* + * This flag combination is currently unsupported. QEMU needs to update + * the fd_trans_table as new file descriptors are opened. This is easy + * when CLONE_VM is set, because the fd_trans_table is shared between + * the parent and child. Without CLONE_VM the fd_trans_table will need + * to be share specially using shared memory mappings, or a + * consistentcy protocol between the child and the parent. + * + * For now, just return EINVAL in this case. + */ + qemu_log_mask(LOG_UNIMP, "CLONE_FILES only supported with CLONE_VM"); return -TARGET_EINVAL; } @@ -6042,6 +6058,10 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp, ts->sigact_tbl = sigact_table_clone(parent_ts->sigact_tbl); } + if (!clone_flags_are_thread(proc_flags)) { + ts->is_cloned = true; + } + if (flags & CLONE_CHILD_CLEARTID) { ts->child_tidptr = child_tidptr; } @@ -6063,10 +6083,8 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp, tb_flush(cpu); } - if (proc_flags & CLONE_VM) { - info.child.register_thread = true; - info.child.signal_setup = true; - } + info.child.signal_setup = (flags & CLONE_VM) && !(flags & CLONE_VFORK); + info.child.register_thread = !!(flags & CLONE_VM); /* * It is not safe to deliver signals until the child has finished @@ -6078,7 +6096,7 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp, ret = get_errno(qemu_clone(proc_flags, clone_run, (void *) &info)); - if (ret >= 0 && (proc_flags & CLONE_VM)) { + if (ret >= 0 && (flags & CLONE_VM) && !(flags & CLONE_VFORK)) { /* * Wait for the child to finish setup if the child is running in the * same VM. @@ -6092,7 +6110,7 @@ static int do_fork(CPUArchState *env, unsigned int flags, abi_ulong newsp, pthread_cond_destroy(&info.cond); pthread_mutex_destroy(&info.mutex); - if (ret >= 0 && !(proc_flags & CLONE_VM)) { + if (ret >= 0 && !(flags & CLONE_VM)) { /* * If !CLONE_VM, then we need to set parent_tidptr, since the child * won't set it for us. Should always be safe to set it here anyways. @@ -7662,6 +7680,7 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1, switch(num) { case TARGET_NR_exit: { + bool do_pthread_exit = false; /* In old applications this may be used to implement _exit(2). However in threaded applictions it is used for thread termination, and _exit_group is used for application termination. @@ -7692,10 +7711,20 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1, NULL, NULL, 0); } + /* + * Need this multi-step process so we can free ts before calling + * pthread_exit. + */ + if (!ts->is_cloned) { + do_pthread_exit = true; + } + thread_cpu = NULL; g_free(ts); - rcu_unregister_thread(); - pthread_exit(NULL); + if (do_pthread_exit) { + rcu_unregister_thread(); + pthread_exit(NULL); + } } pthread_mutex_unlock(&clone_lock); @@ -9700,6 +9729,14 @@ static abi_long do_syscall1(void *cpu_env, int num, abi_long arg1, #ifdef __NR_exit_group /* new thread calls */ case TARGET_NR_exit_group: { + /* + * TODO: We need to clean up CPUs (like is done for exit(2)) + * for all threads in this process when exit_group is called, at least + * for tasks that have been cloned. Could also be done in + * clone_trampoline/tls_mgr. Since this cleanup is non-trival (need to + * coordinate it across threads. Right now it seems to be fine without + * the cleanup, so just leaving a note. + */ preexit_cleanup(cpu_env, arg1); return get_errno(exit_group(arg1)); } diff --git a/tests/tcg/multiarch/linux-test.c b/tests/tcg/multiarch/linux-test.c index 8a7c15cd31..a7723556c2 100644 --- a/tests/tcg/multiarch/linux-test.c +++ b/tests/tcg/multiarch/linux-test.c @@ -407,14 +407,13 @@ static void test_clone(void) stack1 = malloc(STACK_SIZE); pid1 = chk_error(clone(thread1_func, stack1 + STACK_SIZE, - CLONE_VM | CLONE_FS | CLONE_FILES | - CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM, + CLONE_VM | SIGCHLD, "hello1")); stack2 = malloc(STACK_SIZE); pid2 = chk_error(clone(thread2_func, stack2 + STACK_SIZE, CLONE_VM | CLONE_FS | CLONE_FILES | - CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM, + CLONE_SIGHAND | CLONE_SYSVSEM | SIGCHLD, "hello2")); wait_for_child(pid1); @@ -517,6 +516,61 @@ static void test_shm(void) chk_error(shmdt(ptr)); } +static volatile sig_atomic_t test_clone_signal_count_handler_calls; + +static void test_clone_signal_count_handler(int sig) +{ + test_clone_signal_count_handler_calls++; +} + +/* A clone function that does nothing and exits successfully. */ +static int successful_func(void *arg __attribute__((unused))) +{ + return 0; +} + +/* + * With our clone implementation it's possible that we could generate too many + * child exit signals. Make sure only the single expected child-exit signal is + * generated. + */ +static void test_clone_signal_count(void) +{ + uint8_t *child_stack; + struct sigaction prev, test; + int status; + pid_t pid; + + memset(&test, 0, sizeof(test)); + test.sa_handler = test_clone_signal_count_handler; + test.sa_flags = SA_RESTART; + + /* Use real-time signals, so every signal event gets delivered. */ + chk_error(sigaction(SIGRTMIN, &test, &prev)); + + child_stack = malloc(STACK_SIZE); + pid = chk_error(clone( + successful_func, + child_stack + STACK_SIZE, + CLONE_VM | SIGRTMIN, + NULL + )); + + /* + * Need to use __WCLONE here because we are not using SIGCHLD as the + * exit_signal. By default linux only waits for children spawned with + * SIGCHLD. + */ + chk_error(waitpid(pid, &status, __WCLONE)); + + chk_error(sigaction(SIGRTMIN, &prev, NULL)); + + if (test_clone_signal_count_handler_calls != 1) { + error("expected to receive exactly 1 signal, received %d signals", + test_clone_signal_count_handler_calls); + } +} + int main(int argc, char **argv) { test_file(); @@ -524,11 +578,8 @@ int main(int argc, char **argv) test_fork(); test_time(); test_socket(); - - if (argc > 1) { - printf("test_clone still considered buggy\n"); - test_clone(); - } + test_clone(); + test_clone_signal_count(); test_signal(); test_shm();
The `clone` system call can be used to create new processes that share attributes with their parents, such as virtual memory, file system location, file descriptor tables, etc. These can be useful to a variety of guest programs. Before this patch, QEMU had support for a limited set of these attributes. Basically the ones needed for threads, and the options used by fork. This change adds support for all flag combinations involving CLONE_VM. In theory, almost all clone options could be supported, but invocations not using CLONE_VM are likely to run afoul of linux-user's inherently multi-threaded design. To add this support, this patch updates the `qemu_clone` helper. An overview of the mechanism used to support general `clone` options with CLONE_VM is described below. This patch also enables by-default the `clone` unit-tests in tests/tcg/multiarch/linux-test.c, and adds an additional test for duplicate exit signals, based on a bug found during development. !! Overview Adding support for CLONE_VM is tricky. The parent and guest process will share an address space (similar to threads), so the emulator must coordinate between the parent and the child. Currently, QEMU relies heavily on Thread Local Storage (TLS) as part of this coordination strategy. For threads, this works fine, because libc manages the thread-local data region used for TLS, when we create new threads using `pthread_create`. Ideally we could use the same mechanism for "process-local storage" needed to allow the parent/child processes to emulate in tandem. Unfortunately TLS is tightly integrated into libc. The only way to create TLS data regions is via the `pthread_create` API which also spawns a new thread (rather than a new processes, which is what we want). Worse still, TLS itself is a complicated arch-specific feature that is tightly integrated into the rest of libc and the dynamic linker. Re-implementing TLS support for QEMU would likely require a special dynamic linker / libc. Alternatively, the popular libcs could be extended, to allow for users to create TLS regions without creating threads. Even if major libcs decide to add this support, QEMU will still need a temporary work around until those libcs are widely deployed. It's also unclear if libcs will be interested in supporting this case, since TLS image creation is generally deeply integrated with thread setup. In this patch, I've employed an alternative approach: spawning a thread an "stealing" its TLS image for use in the child process. This approach leaves a dangling thread while the TLS image is in use, but by design that thread will not become schedulable until after the TLS data is no longer in-use by the child (as described in a moment). Therefore, it should cause relatively minimal overhead. When considered in the larger context, this seems like a reasonable tradeoff. A major complication of this approach knowing when it is safe to clean up the stack, and TLS image, used by a child process. When a child is created with `CLONE_VM` its stack, and TLS data, need to remain valid until that child has either exited, or successfully called `execve` (on `execve` the child is given a new VMM by the kernel). One approach would be to use `waitid(WNOWAIT)` (the `WNOWAIT` allows the guest to reap the child). The problem is that the `wait` family of calls only waits for termination. The pattern of `clone() ... execve()` for long running child processes is pretty common. If we waited for child processes to exit, it's likely we would end up using substantially more memory, and keep the suspended TLS thread around much longer than necessary. Instead, in this patch, I've used an "trampoline" process. The real parent first clones a trampoline, the trampoline then clones the ultimate child using the `CLONE_VFORK` option. `CLONE_VFORK` suspends the trampoline process until the child has exited, or called `execve`. Once the trampoline is re-scheduled, we know it is safe to clean up after the child. This creates one more suspended process, but typically, the trampoline only exists for a short period of time. !! CLONE_VM setup, step by step 1. First, the suspended thread whose TLS we will use is created using `pthread_create`. The thread fetches and returns it's "TLS pointer" (an arch-specific value given to the kernel) to the parent. It then blocks on a lock to prevent its TLS data from being cleaned up. Ultimately the lock will be unlocked by the trampoline once the child exits. 2. Once the TLS thread has fetched the TLS pointer, it notifies the real parent thread, which calls `clone()` to create the trampoline process. For ease of implementation, the TLS image is set for the trampoline process during this step. This allows the trampoline to use functions that require TLS if needed (e.g., printf). TLS location is inherited when a new child is spawned, so this TLS data will automatically be inherited by the child. 3. Once the trampoline has been spawned, it registers itself as a "hidden" process with the signal subsystem. This prevents the exit signal from the trampoline from ever being forwarded to the guest. This is needed due to the way that Linux sets the exit signal for the ultimate child when `CLONE_PARENT` is set. See the source for details. 4. Once setup is complete, the trampoline spawns the final child with the original clone flags, plus `CLONE_PARENT`, so the child is correctly parented to the kernel task on which the guest invoked `clone`. Without this, kernel features like PDEATHSIG, and subreapers, would not work properly. As previously discussed, the trampoline also supplies `CLONE_VFORK` so that it is suspended until the child can be cleaned up. 5. Once the child is spawned, it signals the original parent thread that it is running. At this point, the trampoline process is suspended (due to CLONE_VFORK). 6. Finally, the call to `qemu_clone` in the parent is finished, the child begins executing the given callback function in the new child process. !! Cleaning up Clean up itself is a multi-step process. Once the child exits, or is killed by a signal (cleanup is the same in both cases), the trampoline process becomes schedulable. When the trampoline is scheduled, it frees the child stack, and unblocks the suspended TLS thread. This cleans up the child resources, but not the stack used by the trampoline itself. It is possible for a process to clean up its own stack, but it is tricky, and architecture-specific. Instead we leverage the TLS manager thread to clean up the trampoline stack. When the trampoline is cloned (in step 2 above), we additionally set the `CHILD_SETTID` and `CHILD_CLEARTID` flags. The target location for the SET/CLEAR TID is set to a special field known by the TLS manager. Then, when the TLS manager thread is unsuspended, it performs an additional `FUTEX_WAIT` on this location. That blocks the TLS manager thread until the trampoline has fully exited, then the TLS manager thread frees the trampoline process's stack, before exiting itself. !! Shortcomings of this patch * It's complicated. * It doesn't support any clone options when CLONE_VM is omitted. * It doesn't properly clean up the CPU queue when the child process terminates, or calls execve(). * RCU unregistration is done in the trampoline process (in clone.c), but registration happens in syscall.c This should be made more explicit. * The TLS image, and trampoline stack are not cleaned up if the parent calls `execve` or `exit_group` before the child does. This is because those cleanup tasks are handled by the TLS manager thread. The TLS manager thread is in the same thread group as the parent, so it will be terminated if the parent exits or calls `execve`. !! Alternatives considered * Non-standard libc extension to allow creating TLS images independent of threads. This would allow us to just `clone` the child directly instead of this complicated maneuver. Though we probably would still need the cleanup logic. For libcs, TLS image allocation is tightly connected to thread stack allocation, which is also arch-specific. I do not have enough experience with libc development to know if maintainers of any popular libcs would be open to supporting such an API. Additionally, since it will probably take years before a libc fix would be widely deployed, we need an interim solution anyways. * Non-standard, Linux-only, libc extension to allow us to specify the CLONE_* flags used by `pthread_create`. The processes we are creating are basically threads in a different thread group. If we could alter the flags used, this whole processes could become a `pthread_create.` The problem with this approach is that I don't know what requirements pthreads has on threads to ensure they function properly. I suspect that pthreads relies on CHILD_CLEARTID+FUTEX_WAKE to cleanup detached thread state. Since we don't control the child exit reason (Linux only handles CHILD_CLEARTID on normal, non-signal process termination), we probably can't use this same tracking mechanism. * Other mechanisms for detecting child exit so cleanup can happen besides CLONE_VFORK: * waitid(WNOWAIT): This can only detect exit, not execve. * file descriptors with close on exec set: This cannot detect children cloned with CLONE_FILES. * System V semaphore adjustments: Cannot detect children cloned with CLONE_SYSVSEM. * CLONE_CHILD_CLEARTID + FUTEX_WAIT: Cannot detect abnormally terminated children. * Doing the child clone directly in the TLS manager thread: This saves the need for the trampoline process, but it causes the child process to be parented to the wrong kernel task (the TLS thread instead of the Main thread) breaking things like PDEATHSIG. Signed-off-by: Josh Kunz <jkz@google.com> --- linux-user/clone.c | 415 ++++++++++++++++++++++++++++++- linux-user/qemu.h | 17 ++ linux-user/signal.c | 49 ++++ linux-user/syscall.c | 69 +++-- tests/tcg/multiarch/linux-test.c | 67 ++++- 5 files changed, 592 insertions(+), 25 deletions(-)