[4/5] linux-user: Support CLONE_VM and extended clone options

The `clone` system call can be used to create new processes that share
attributes with their parents, such as virtual memory, file
system location, file descriptor tables, etc. These can be useful to a
variety of guest programs.

Before this patch, QEMU had support for a limited set of these attributes.
Basically the ones needed for threads, and the options used by fork.
This change adds support for all flag combinations involving CLONE_VM.
In theory, almost all clone options could be supported, but invocations
not using CLONE_VM are likely to run afoul of linux-user's inherently
multi-threaded design.

To add this support, this patch updates the `qemu_clone` helper. An
overview of the mechanism used to support general `clone` options with
CLONE_VM is described below.

This patch also enables by-default the `clone` unit-tests in
tests/tcg/multiarch/linux-test.c, and adds an additional test for duplicate
exit signals, based on a bug found during development.

!! Overview

Adding support for CLONE_VM is tricky. The parent and guest process will
share an address space (similar to threads), so the emulator must
coordinate between the parent and the child. Currently, QEMU relies
heavily on Thread Local Storage (TLS) as part of this coordination
strategy. For threads, this works fine, because libc manages the
thread-local data region used for TLS, when we create new threads using
`pthread_create`. Ideally we could use the same mechanism for
"process-local storage" needed to allow the parent/child processes to
emulate in tandem. Unfortunately TLS is tightly integrated into libc.
The only way to create TLS data regions is via the `pthread_create` API
which also spawns a new thread (rather than a new processes, which is
what we want). Worse still, TLS itself is a complicated arch-specific
feature that is tightly integrated into the rest of libc and the dynamic
linker. Re-implementing TLS support for QEMU would likely require a
special dynamic linker / libc. Alternatively, the popular libcs could be
extended, to allow for users to create TLS regions without creating
threads. Even if major libcs decide to add this support, QEMU will still
need a temporary work around until those libcs are widely deployed. It's
also unclear if libcs will be interested in supporting this case, since
TLS image creation is generally deeply integrated with thread setup.

In this patch, I've employed an alternative approach: spawning a thread
an "stealing" its TLS image for use in the child process. This approach
leaves a dangling thread while the TLS image is in use, but by design
that thread will not become schedulable until after the TLS data is no
longer in-use by the child (as described in a moment). Therefore, it
should cause relatively minimal overhead. When considered in the larger
context, this seems like a reasonable tradeoff.

A major complication of this approach knowing when it is safe to clean up
the stack, and TLS image, used by a child process. When a child is
created with `CLONE_VM` its stack, and TLS data, need to remain valid
until that child has either exited, or successfully called `execve` (on
`execve` the child is given a new VMM by the kernel). One approach would
be to use `waitid(WNOWAIT)` (the `WNOWAIT` allows the guest to reap the
child). The problem is that the `wait` family of calls only waits for
termination. The pattern of `clone() ... execve()` for long running
child processes is pretty common. If we waited for child processes to
exit, it's likely we would end up using substantially more memory, and
keep the suspended TLS thread around much longer than necessary.
Instead, in this patch, I've used an "trampoline" process. The real
parent first clones a trampoline, the trampoline then clones the
ultimate child using the `CLONE_VFORK` option. `CLONE_VFORK` suspends
the trampoline process until the child has exited, or called `execve`.
Once the trampoline is re-scheduled, we know it is safe to clean up
after the child. This creates one more suspended process, but typically,
the trampoline only exists for a short period of time.

!! CLONE_VM setup, step by step

1. First, the suspended thread whose TLS we will use is created using
   `pthread_create`. The thread fetches and returns it's "TLS pointer"
   (an arch-specific value given to the kernel) to the parent. It then
   blocks on a lock to prevent its TLS data from being cleaned up.
   Ultimately the lock will be unlocked by the trampoline once the child
   exits.
2. Once the TLS thread has fetched the TLS pointer, it notifies the real
   parent thread, which calls `clone()` to create the trampoline
   process. For ease of implementation, the TLS image is set for the
   trampoline process during this step. This allows the trampoline to
   use functions that require TLS if needed (e.g., printf). TLS location
   is inherited when a new child is spawned, so this TLS data will
   automatically be inherited by the child.
3. Once the trampoline has been spawned, it registers itself as a
   "hidden" process with the signal subsystem. This prevents the exit
   signal from the trampoline from ever being forwarded to the guest.
   This is needed due to the way that Linux sets the exit signal for the
   ultimate child when `CLONE_PARENT` is set. See the source for
   details.
4. Once setup is complete, the trampoline spawns the final child with
   the original clone flags, plus `CLONE_PARENT`, so the child is
   correctly parented to the kernel task on which the guest invoked
   `clone`. Without this, kernel features like PDEATHSIG, and
   subreapers, would not work properly. As previously discussed, the
   trampoline also supplies `CLONE_VFORK` so that it is suspended until
   the child can be cleaned up.
5. Once the child is spawned, it signals the original parent thread that
   it is running. At this point, the trampoline process is suspended
   (due to CLONE_VFORK).
6. Finally, the call to `qemu_clone` in the parent is finished, the
   child begins executing the given callback function in the new child
   process.

!! Cleaning up

Clean up itself is a multi-step process. Once the child exits, or is
killed by a signal (cleanup is the same in both cases), the trampoline
process becomes schedulable. When the trampoline is scheduled, it frees
the child stack, and unblocks the suspended TLS thread. This cleans up
the child resources, but not the stack used by the trampoline itself. It
is possible for a process to clean up its own stack, but it is tricky,
and architecture-specific. Instead we leverage the TLS manager thread to
clean up the trampoline stack. When the trampoline is cloned (in step 2
above), we additionally set the `CHILD_SETTID` and `CHILD_CLEARTID`
flags. The target location for the SET/CLEAR TID is set to a special field
known by the TLS manager. Then, when the TLS manager thread is unsuspended,
it performs an additional `FUTEX_WAIT` on this location. That blocks the
TLS manager thread until the trampoline has fully exited, then the TLS
manager thread frees the trampoline process's stack, before exiting
itself.

!! Shortcomings of this patch

* It's complicated.
* It doesn't support any clone options when CLONE_VM is omitted.
* It doesn't properly clean up the CPU queue when the child process
  terminates, or calls execve().
* RCU unregistration is done in the trampoline process (in clone.c), but
  registration happens in syscall.c This should be made more explicit.
* The TLS image, and trampoline stack are not cleaned up if the parent
  calls `execve` or `exit_group` before the child does. This is because
  those cleanup tasks are handled by the TLS manager thread. The TLS
  manager thread is in the same thread group as the parent, so it will
  be terminated if the parent exits or calls `execve`.

!! Alternatives considered

* Non-standard libc extension to allow creating TLS images independent
  of threads. This would allow us to just `clone` the child directly
  instead of this complicated maneuver. Though we probably would still
  need the cleanup logic. For libcs, TLS image allocation is tightly
  connected to thread stack allocation, which is also arch-specific. I
  do not have enough experience with libc development to know if
  maintainers of any popular libcs would be open to supporting such an
  API. Additionally, since it will probably take years before a libc
  fix would be widely deployed, we need an interim solution anyways.
* Non-standard, Linux-only, libc extension to allow us to specify the
  CLONE_* flags used by `pthread_create`. The processes we are creating
  are basically threads in a different thread group. If we could alter
  the flags used, this whole processes could become a `pthread_create.`
  The problem with this approach is that I don't know what requirements
  pthreads has on threads to ensure they function properly. I suspect
  that pthreads relies on CHILD_CLEARTID+FUTEX_WAKE to cleanup detached
  thread state. Since we don't control the child exit reason (Linux only
  handles CHILD_CLEARTID on normal, non-signal process termination), we
  probably can't use this same tracking mechanism.
* Other mechanisms for detecting child exit so cleanup can happen
  besides CLONE_VFORK:
  * waitid(WNOWAIT): This can only detect exit, not execve.
  * file descriptors with close on exec set: This cannot detect children
    cloned with CLONE_FILES.
  * System V semaphore adjustments: Cannot detect children cloned with
    CLONE_SYSVSEM.
  * CLONE_CHILD_CLEARTID + FUTEX_WAIT: Cannot detect abnormally
    terminated children.
* Doing the child clone directly in the TLS manager thread: This saves the
  need for the trampoline process, but it causes the child process to be
  parented to the wrong kernel task (the TLS thread instead of the Main
  thread) breaking things like PDEATHSIG.

Signed-off-by: Josh Kunz <jkz@google.com>
---
 linux-user/clone.c               | 415 ++++++++++++++++++++++++++++++-
 linux-user/qemu.h                |  17 ++
 linux-user/signal.c              |  49 ++++
 linux-user/syscall.c             |  69 +++--
 tests/tcg/multiarch/linux-test.c |  67 ++++-
 5 files changed, 592 insertions(+), 25 deletions(-)

Message ID	20200612014606.147691-5-jkz@google.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=nongnu.org (client-ip=209.51.188.17; helo=lists.gnu.org; envelope-from=qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org; receiver=<UNKNOWN>) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.a=rsa-sha256 header.s=20161025 header.b=Adf0hl3V; dkim-atps=neutral Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 49jk9k1S1Nz9sRK for <incoming@patchwork.ozlabs.org>; Fri, 12 Jun 2020 11:47:06 +1000 (AEST) Received: from localhost ([::1]:50004 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>) id 1jjYmd-00019A-U1 for incoming@patchwork.ozlabs.org; Thu, 11 Jun 2020 21:47:03 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:51100) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from <3ad7iXgMKCq4XYnUccUZS.QcaeSai-RSjSZbcbUbi.cfU@flex--jkz.bounces.google.com>) id 1jjYm3-00015i-Vt for qemu-devel@nongnu.org; Thu, 11 Jun 2020 21:46:28 -0400 Received: from mail-yb1-xb49.google.com ([2607:f8b0:4864:20::b49]:54976) by eggs.gnu.org with esmtps (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from <3ad7iXgMKCq4XYnUccUZS.QcaeSai-RSjSZbcbUbi.cfU@flex--jkz.bounces.google.com>) id 1jjYm0-0001VA-Dd for qemu-devel@nongnu.org; Thu, 11 Jun 2020 21:46:27 -0400 Received: by mail-yb1-xb49.google.com with SMTP id p22so8734277ybg.21 for <qemu-devel@nongnu.org>; Thu, 11 Jun 2020 18:46:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=Vp4OVl46XRZBSe+oi9pU4mOxTCIGyGyo/JRsd2TSg+Q=; b=Adf0hl3VTIFRgGdDeZA2K0o6Fm9yW4o0neP4mUgr1nzYS4LDfOi1wtxiizgedyKp21 3E9NfBAAiyRKLKn6ypv2b6y8v1PvxSkDpPp8EPa9bTx9lmma9dFcfnlFt0MuH58VbX77 SPLNiVUM3aGuBTP6Oerant2kzNUJ+MeTn938wfObJuHDSa849lTjFxPe8IOWvszueqM6 m4jkoYg39wuermUDhKdW6O923i2iahVldV5gmXcrj/4F0lcPSH7HVpgypPLBHpfP/TbH fnACzyoKJfcF/Mt2AlT5rfzJKOJCCWY/cG+IBvhK8+isBsheFF78H7jTOnuamWeI12IK qPDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=Vp4OVl46XRZBSe+oi9pU4mOxTCIGyGyo/JRsd2TSg+Q=; b=uQDKO4cBn2emmCXfw11yCj5cTlwasqvimufFPOUJwZqJoxpYXHvJNypwTn2pdz7MnB u+jJF0JWxOlELRuJ7Vj33VLeJpfBew+dVD9NE0plBDq995Ze6EVt0XkxUDZYiSd6urEO v9gGs8SwEN0/CFhQO+YExQAT50TL0MQzDaPHsEGiDokAkhuKUXgMhi8MvsvFsJY4CCg/ YvIVYjBidnZFrWSksfcck9LVNS4q3SMljA/XfmHYtx1+SDnnNcT8P/Ie65DLEM9lkM4g L/a8Bcv0Ga6dcnJQGU5/RnwclcHa2FPpywlNAfHziiyI7CUtpzlhT5Vr5GRH/Ooq87Cg /LKQ== X-Gm-Message-State: AOAM532WJ4s0lv4tfxLwKGro/6mfr2hhECAHj6CNsY0ZCVzU6YpmXpKd 87xqCgUHkbr+UWNS+PcIpl2ngPd1t0oyxzNWd75xsX3g9Xo+RjUBxMHAsq1xysn1b8yAQ5EzAs/ pxK8NAa5xOC7t7CLCwXrJgSa5mG2uV0M+HM0wX2ic5C+xhcoXJofe X-Google-Smtp-Source: ABdhPJxIjZoFRRW6HDdtHlpf7w3REoolueKDU90GZiXKTLrad/KVynm57QPQRb3Og5nyD4BbixqOp20= X-Received: by 2002:a25:cfcd:: with SMTP id f196mr17883823ybg.142.1591926377140; Thu, 11 Jun 2020 18:46:17 -0700 (PDT) Date: Thu, 11 Jun 2020 18:46:05 -0700 In-Reply-To: <20200612014606.147691-1-jkz@google.com> Message-Id: <20200612014606.147691-5-jkz@google.com> Mime-Version: 1.0 References: <20200612014606.147691-1-jkz@google.com> X-Mailer: git-send-email 2.27.0.290.gba653c62da-goog Subject: [PATCH 4/5] linux-user: Support CLONE_VM and extended clone options From: Josh Kunz <jkz@google.com> To: qemu-devel@nongnu.org Cc: riku.voipio@iki.fi, laurent@vivier.eu, alex.bennee@linaro.org, Josh Kunz <jkz@google.com> Content-Type: text/plain; charset="UTF-8" Received-SPF: pass client-ip=2607:f8b0:4864:20::b49; envelope-from=3ad7iXgMKCq4XYnUccUZS.QcaeSai-RSjSZbcbUbi.cfU@flex--jkz.bounces.google.com; helo=mail-yb1-xb49.google.com X-detected-operating-system: by eggs.gnu.org: No matching host in p0f cache. That's all we know. X-Spam_score_int: -105 X-Spam_score: -10.6 X-Spam_bar: ---------- X-Spam_report: (-10.6 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_MED=-1, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5 autolearn=_AUTOLEARN X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: <qemu-devel.nongnu.org> List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>, <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe> List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel> List-Post: <mailto:qemu-devel@nongnu.org> List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help> List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>, <mailto:qemu-devel-request@nongnu.org?subject=subscribe> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	linux-user: Support extended clone(CLONE_VM) \| expand [0/5] linux-user: Support extended clone(CLONE_VM) [1/5] linux-user: Refactor do_fork to use new `qemu_clone` [2/5] linux-user: Make fd_trans task-specific. [3/5] linux-user: Make sigact_table part of the task state. [4/5] linux-user: Support CLONE_VM and extended clone options [5/5] linux-user: Add PDEATHSIG test for clone process hierarchy.

[4/5] linux-user: Support CLONE_VM and extended clone options

Commit Message

Comments

Patch