mbox series

[v8,0/7] Add pidfd and cgroupv2 support for process creation

Message ID 20230818140642.1623571-1-adhemerval.zanella@linaro.org
Headers show
Series Add pidfd and cgroupv2 support for process creation | expand

Message

Adhemerval Zanella Netto Aug. 18, 2023, 2:06 p.m. UTC
The glibc 2.36 added wrappers for Linux syscall pidfd_open, pidfd_getfd,
and pidfd_send_signal, and exported the P_PIDFD to use along with
waitid. The pidfd is a race-free interface, however, the pidfd_open is
subject to TOCTOU if the file descriptor is not obtained directly from
the clone or clone3 syscall (there is still a small window between the
clone return and the pidfd_getfd where the process can be reaped and the
process ID reused).

A fully race-free interface with posix_spawn interface is being
discussed by GNOME [1] [2], and Qt already uses it in its QtProcess
implementation [3].  The Qt implementation has some pitfalls:

  - It calls clone through the syscall symbol, which does not run the
    pthread_atfork handlers even though it intends to use the clone
semantic for fork (by only using CLONE_PIDFD | SIGCHLD).

  - It also does not reset any internal state, such as internal IO,
    malloc, loader, etc. locks.

  - It does not set the TCB tid field nor the robust list, used by the
    pthread code.

  - It does not optimize process creation by using CLONE_VM and
    CLONE_VFORK.

Also, the recent Linux kernel (starting with 5.7) provides a way to
create a new process in a different cgroups version 2 than the default
one (through clone3 CLONE_INTO_CGROUP flag).  Providing it through glibc
interfaces makes it usable without the risk of potential breakage by
issuing clone3 syscall directly (check BZ#26371 discussion).

This patch set adds new interfaces that take care of these potential
issues.  The new posix_spawn / posix_spawnp extensions:

  #define POSIX_SPAWN_SETCGROUP 0x100

  int posix_spawnattr_getcgroup_np (const posix_spawnattr_t
				    restrict *attr, int *cgroup);
  int posix_spawnattr_setcgroup_np (posix_spawnattr_t *restrict attr,
                                    int cgroup);
  
Allow spawning a new process on a different cgroupv2.  

The pidfd_spawn and pidfd_spawnp is similar to posix_spawn and
posix_spawnp, but return a process file descriptor instead of a PID.

  int pidfd_spawn (int *restrict pidfd,
 		   const char *restrict file,
  		   const posix_spawn_file_actions_t *restrict facts,
  		   const posix_spawnattr_t *restrict attrp,
  		   char *const argv[restrict],
  		   char *const envp[restrict]);

  int pidfd_spawnp (int *restrict pidfd,
 		    const char *restrict path,
  		    const posix_spawn_file_actions_t *restrict facts,
  		    const posix_spawnattr_t *restrict attrp,
  		    char *const argv[restrict_arr],
  		    char *const envp[restrict_arr]);

The implementation makes sure that kernel must support the complete
pidfd interface, meaning that waitid (P_PIDFD) should be supported.  It
ensures that a non-racy workaround is required (such as reading procfs
fdinfo pid to use along with old wait interfaces).  If the kernel does
not have the required support the interface returns ENOSYS.

A new symbol is used instead of a posix_spawn extension to avoid
possible issues with language bindings that might track the argument
lifetime.

Both symbols reuse the posix_spawn posix_spawn_file_actions_t and
posix_spawnattr_t, to either avoid rehashing the posix_spawn API or add
a new one.  It also means that both interfaces support the same
attribute and file actions, and a new flag or file action on posix_spawn
is also added automatically for pidfd_spawn. It includes
POSIX_SPAWN_SETCGROUP.

Along with the spawn interface, a fork-like one is also provided:

  typedef union
  {
    struct
    {
      __uint64_t fork_np_flags;
      int fork_np_pidfd;
      int fork_np_cgroup;
      int fork_np_exit_signal;
  #define fork_np_flags       __data.fork_np_flags
  #define fork_np_pidfd       __data.fork_np_pidfd
  #define fork_np_cgroup      __data.fork_np_cgroup
  #define fork_np_exit_signal __data.fork_np_exit_signal
    } __data;
    char __size [FORK_NP_ARGS_SIZE_VER0];
  } fork_np_args_t;

  #define FORK_NP_PIDFD        (1ULL << 1)
  #define FORK_NP_CGROUP       (1ULL << 2)
  #define FORK_NP_ASYNCSAFE    (1ULL << 3)
  #define FORK_NP_EXIT_SIGNAL  (1ULL << 4)

  pid_t fork_np (fork_np_args_t *args, size_t size)

The SIZE must represent a supported pidfd_fork_args_t type, otherwise,
the function returns EINVAL.

If ARGS has all members set to 0, no file descriptor is returned and
pidfd_fork acts as fork.  If PIDFDFORK_PIDFD is set on the flags member,
a new file descriptor is returned on pidfd member and the kernel sets
O_CLOEXEC as default.  The pidfd_fork follows the fork/_Fork convention
on returning a positive or negative value to the parent (with a negative
indicating an error) and zero to the child.

If PIDFDFORK_CGROUP is set, the value on the cgroup member is used as
the cgroupv2 to be placed in the new process (by using the
CLONE_INTO_CGROUP clone flag).

If PIDFDFORK_ASYNCSAFE is set, pidfd_fork acts as _Fork, thus avoiding
running pthread_atfork handlers.

If PIDFDFORK_EXIT_SIGNAL is set, the signal on exit_signal is sent as
process termination (SIGCHLD is the default). The 0 value is also valid,
meaning no signal will be sent.

The kernel already sets O_CLOEXEC as default and it follows the
fork/_Fork convention on returning a positive or negative value to the
parent (with negative indicating an error) and zero to the child.

Similar to fork, pidfd_fork also runs the pthread_atfork handlers It can
be changed by using the PIDFDFORK_ASYNCSAFE flag, which makes pidfd_fork
act a _Fork.  It also sends SIGCHLD to the parent when the new process
terminates.

To have a way to interop between process IDs and process file
descriptors, the pidfd_getpid is also provided:

   pid_t pidfd_getpid (int fd)

It reads the procfs fdinfo entry from the file descriptor to get the
process ID.

[1] https://gitlab.gnome.org/GNOME/glib/-/issues/1866
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=30349
[3] https://codebrowser.dev/qt6/qtbase/src/3rdparty/forkfd/forkfd_linux.c.html

---

Changes from v7:
- Redefine __ASSUME_CLONE3 to 0 if the architecture does not support the
  syscall.
- Fixed some failing errors to be reported by spawned processes.
- Fixed pre-commit CI for AArch64 failures.
- Rename pidfd_fork to fork_np and make the API extensible
- Document more possible pidfd_getpid errors.

Changes from v6:
- Rebased against master, adjusted symbol version and NEWS entry.
- Added arm/mips clone3 implementation.

Changes from v5:
- Added cgroupv2 support for posix_spawn, pidfd_spawn, and pidfd_fork.

Changes from v4:
- Changed pidfd_fork signature to return a pid_t instead of the PID file
  descriptor.
- Changed pidfd_getpid to return EBADF for negative input, instead of
  EINVAL.
- Added PIDFDFORK_NOSIGCHLD option.
- Fixed nested __BEGIN_DECLS on spawn.h

Changes from v3:
- Remove strtoul usage.
- Fixed patchwork tst-pidfd_getpid.c regression.
- Fixed manual and NEWS typos.

Changes from v2:
- Added pidfd_fork and pidfd_getpid manual entries
- Change pidfd_fork to act as fork as default, instead as _Fork.
- Changed PIDFD_FORK_RUNATFORK flag to PIDFDFORK_ASYNCSAFE.
- Added pidfd_getpid test for EREMOTE.

Changes from v1:
- Extended pidfd_getpid error codes to return EBADF if fdinfo does not
  have Pid entry or if the value is invalid, EREMOTE is pid is in a 
  separate namespace, and ESRCH if is already terminated.
- Extended tst-pidfd_getpid.
- Rename PIDFD_FORK_RUNATFORK to PIDFDFORK_RUNATFORK to avoid clashes
  with possible kernel extensions.

Adhemerval Zanella (7):
  arm: Add the clone3 wrapper
  mips: Add the clone3 wrapper
  linux: Define __ASSUME_CLONE3 to 0 for alpha, ia64, nios2, sh, and
    sparc
  linux: Add posix_spawnattr_{get,set}cgroup_np (BZ 26731)
  posix: Add pidfd_spawn and pidfd_spawnp (BZ 30349)
  posix: Add fork_np (BZ 26371)
  linux: Add pidfd_getpid

 NEWS                                          |  24 ++
 bits/spawn_ext.h                              |  21 ++
 include/clone_internal.h                      |  21 ++
 manual/process.texi                           | 122 ++++++++-
 posix/Makefile                                |   5 +-
 posix/fork-internal.c                         | 127 ++++++++++
 posix/fork-internal.h                         |  36 +++
 posix/fork.c                                  | 107 +-------
 posix/spawn.h                                 |   6 +-
 posix/spawn_int.h                             |   3 +-
 posix/spawnattr_setflags.c                    |   3 +-
 posix/tst-posix_spawn-setsid.c                | 169 +++++++++----
 posix/tst-spawn-chdir.c                       |  15 +-
 posix/tst-spawn.c                             |  24 +-
 posix/tst-spawn.h                             |  36 +++
 posix/tst-spawn2.c                            |  17 +-
 posix/tst-spawn3.c                            | 100 ++++----
 posix/tst-spawn4.c                            |   7 +-
 posix/tst-spawn5.c                            |  14 +-
 posix/tst-spawn6.c                            |  13 +-
 posix/tst-spawn7.c                            |  13 +-
 sysdeps/nptl/_Fork.c                          |   2 +-
 sysdeps/unix/sysv/linux/Makefile              |  29 +++
 sysdeps/unix/sysv/linux/Versions              |   8 +
 sysdeps/unix/sysv/linux/aarch64/libc.abilist  |   6 +
 .../unix/sysv/linux/alpha/kernel-features.h   |   4 +
 sysdeps/unix/sysv/linux/alpha/libc.abilist    |   6 +
 sysdeps/unix/sysv/linux/arc/libc.abilist      |   6 +
 sysdeps/unix/sysv/linux/arch-fork.h           |  16 +-
 sysdeps/unix/sysv/linux/arm/be/libc.abilist   |   6 +
 sysdeps/unix/sysv/linux/arm/clone3.S          |  80 ++++++
 sysdeps/unix/sysv/linux/arm/le/libc.abilist   |   6 +
 sysdeps/unix/sysv/linux/arm/sysdep.h          |   1 +
 sysdeps/unix/sysv/linux/bits/spawn_ext.h      |  71 ++++++
 sysdeps/unix/sysv/linux/bits/unistd_ext.h     |  51 ++++
 sysdeps/unix/sysv/linux/clone-internal.c      |  58 ++++-
 sysdeps/unix/sysv/linux/clone-pidfd-support.c |  60 +++++
 sysdeps/unix/sysv/linux/csky/libc.abilist     |   6 +
 sysdeps/unix/sysv/linux/fork_np.c             |  97 +++++++
 sysdeps/unix/sysv/linux/hppa/libc.abilist     |   6 +
 sysdeps/unix/sysv/linux/i386/libc.abilist     |   6 +
 .../unix/sysv/linux/ia64/kernel-features.h    |   4 +
 sysdeps/unix/sysv/linux/ia64/libc.abilist     |   6 +
 .../sysv/linux/loongarch/lp64/libc.abilist    |   6 +
 .../sysv/linux/m68k/coldfire/libc.abilist     |   6 +
 .../unix/sysv/linux/m68k/m680x0/libc.abilist  |   6 +
 .../sysv/linux/microblaze/be/libc.abilist     |   6 +
 .../sysv/linux/microblaze/le/libc.abilist     |   6 +
 sysdeps/unix/sysv/linux/mips/clone3.S         | 139 +++++++++++
 .../sysv/linux/mips/mips32/fpu/libc.abilist   |   6 +
 .../sysv/linux/mips/mips32/nofpu/libc.abilist |   6 +
 .../sysv/linux/mips/mips64/n32/libc.abilist   |   6 +
 .../sysv/linux/mips/mips64/n64/libc.abilist   |   6 +
 sysdeps/unix/sysv/linux/mips/sysdep.h         |   2 +
 .../unix/sysv/linux/nios2/kernel-features.h   |  24 ++
 sysdeps/unix/sysv/linux/nios2/libc.abilist    |   6 +
 sysdeps/unix/sysv/linux/or1k/libc.abilist     |   6 +
 sysdeps/unix/sysv/linux/pidfd_getpid.c        | 126 ++++++++++
 sysdeps/unix/sysv/linux/pidfd_spawn.c         |  30 +++
 sysdeps/unix/sysv/linux/pidfd_spawnp.c        |  30 +++
 .../linux/powerpc/powerpc32/fpu/libc.abilist  |   6 +
 .../powerpc/powerpc32/nofpu/libc.abilist      |   6 +
 .../linux/powerpc/powerpc64/be/libc.abilist   |   6 +
 .../linux/powerpc/powerpc64/le/libc.abilist   |   6 +
 sysdeps/unix/sysv/linux/procutils.c           |  97 +++++++
 sysdeps/unix/sysv/linux/procutils.h           |  43 ++++
 .../unix/sysv/linux/riscv/rv32/libc.abilist   |   6 +
 .../unix/sysv/linux/riscv/rv64/libc.abilist   |   6 +
 .../unix/sysv/linux/s390/s390-32/libc.abilist |   6 +
 .../unix/sysv/linux/s390/s390-64/libc.abilist |   6 +
 sysdeps/unix/sysv/linux/sh/be/libc.abilist    |   6 +
 sysdeps/unix/sysv/linux/sh/kernel-features.h  |   4 +
 sysdeps/unix/sysv/linux/sh/le/libc.abilist    |   6 +
 .../unix/sysv/linux/sparc/kernel-features.h   |   4 +
 .../sysv/linux/sparc/sparc32/libc.abilist     |   6 +
 .../sysv/linux/sparc/sparc64/libc.abilist     |   6 +
 .../unix/sysv/linux/spawnattr_getcgroup_np.c  |  28 +++
 .../unix/sysv/linux/spawnattr_setcgroup_np.c  |  27 ++
 sysdeps/unix/sysv/linux/spawni.c              |  42 +++-
 sysdeps/unix/sysv/linux/sys/pidfd.h           |   4 +
 sysdeps/unix/sysv/linux/tst-fork_np-cgroup.c  | 170 +++++++++++++
 sysdeps/unix/sysv/linux/tst-fork_np.c         | 236 ++++++++++++++++++
 sysdeps/unix/sysv/linux/tst-pidfd.c           |  48 ++++
 sysdeps/unix/sysv/linux/tst-pidfd_getpid.c    | 126 ++++++++++
 .../sysv/linux/tst-posix_spawn-setsid-pidfd.c |  20 ++
 sysdeps/unix/sysv/linux/tst-spawn-cgroup.c    | 223 +++++++++++++++++
 .../unix/sysv/linux/tst-spawn-chdir-pidfd.c   |  20 ++
 sysdeps/unix/sysv/linux/tst-spawn-pidfd.c     |  20 ++
 sysdeps/unix/sysv/linux/tst-spawn-pidfd.h     |  63 +++++
 sysdeps/unix/sysv/linux/tst-spawn2-pidfd.c    |  20 ++
 sysdeps/unix/sysv/linux/tst-spawn3-pidfd.c    |  20 ++
 sysdeps/unix/sysv/linux/tst-spawn4-pidfd.c    |  20 ++
 sysdeps/unix/sysv/linux/tst-spawn5-pidfd.c    |  20 ++
 sysdeps/unix/sysv/linux/tst-spawn6-pidfd.c    |  20 ++
 sysdeps/unix/sysv/linux/tst-spawn7-pidfd.c    |  20 ++
 .../unix/sysv/linux/x86_64/64/libc.abilist    |   6 +
 .../unix/sysv/linux/x86_64/x32/libc.abilist   |   6 +
 97 files changed, 2947 insertions(+), 267 deletions(-)
 create mode 100644 bits/spawn_ext.h
 create mode 100644 posix/fork-internal.c
 create mode 100644 posix/fork-internal.h
 create mode 100644 posix/tst-spawn.h
 create mode 100644 sysdeps/unix/sysv/linux/arm/clone3.S
 create mode 100644 sysdeps/unix/sysv/linux/bits/spawn_ext.h
 create mode 100644 sysdeps/unix/sysv/linux/clone-pidfd-support.c
 create mode 100644 sysdeps/unix/sysv/linux/fork_np.c
 create mode 100644 sysdeps/unix/sysv/linux/mips/clone3.S
 create mode 100644 sysdeps/unix/sysv/linux/nios2/kernel-features.h
 create mode 100644 sysdeps/unix/sysv/linux/pidfd_getpid.c
 create mode 100644 sysdeps/unix/sysv/linux/pidfd_spawn.c
 create mode 100644 sysdeps/unix/sysv/linux/pidfd_spawnp.c
 create mode 100644 sysdeps/unix/sysv/linux/procutils.c
 create mode 100644 sysdeps/unix/sysv/linux/procutils.h
 create mode 100644 sysdeps/unix/sysv/linux/spawnattr_getcgroup_np.c
 create mode 100644 sysdeps/unix/sysv/linux/spawnattr_setcgroup_np.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-fork_np-cgroup.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-fork_np.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-pidfd_getpid.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-posix_spawn-setsid-pidfd.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-spawn-cgroup.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-spawn-chdir-pidfd.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-spawn-pidfd.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-spawn-pidfd.h
 create mode 100644 sysdeps/unix/sysv/linux/tst-spawn2-pidfd.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-spawn3-pidfd.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-spawn4-pidfd.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-spawn5-pidfd.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-spawn6-pidfd.c
 create mode 100644 sysdeps/unix/sysv/linux/tst-spawn7-pidfd.c

Comments

Rich Felker Aug. 18, 2023, 5:51 p.m. UTC | #1
On Fri, Aug 18, 2023 at 11:06:35AM -0300, Adhemerval Zanella via Libc-alpha wrote:
> The glibc 2.36 added wrappers for Linux syscall pidfd_open, pidfd_getfd,
> and pidfd_send_signal, and exported the P_PIDFD to use along with
> waitid. The pidfd is a race-free interface, however, the pidfd_open is
> subject to TOCTOU if the file descriptor is not obtained directly from
> the clone or clone3 syscall (there is still a small window between the
> clone return and the pidfd_getfd where the process can be reaped and the
> process ID reused).

Unless I'm missing something, that window is purely programmer error.
The pid belongs to the parent process, that called fork, posix_spawn,
clone, or whatever, and is responsible for not freeing it until it's
done using it.

Yes this can happen if you install a SIGCHLD handler that reaps
anything it sees, or if you're calling wait without a pid. This is
programming error. If you're stuck with code outside your control that
makes that mistake, you can already avoid it with clone by setting the
child exit signal to 0 rather than SIGCHLD. But it's best just not to
do that.

IMO making a new, complex, highly nonstandard interface to work around
a problem that's programmer error, and getting this nonstandard and
nonportable pattern into mainstream software, has negative value.

Rich
Adhemerval Zanella Netto Aug. 18, 2023, 6:34 p.m. UTC | #2
On 18/08/23 14:51, Rich Felker wrote:
> On Fri, Aug 18, 2023 at 11:06:35AM -0300, Adhemerval Zanella via Libc-alpha wrote:
>> The glibc 2.36 added wrappers for Linux syscall pidfd_open, pidfd_getfd,
>> and pidfd_send_signal, and exported the P_PIDFD to use along with
>> waitid. The pidfd is a race-free interface, however, the pidfd_open is
>> subject to TOCTOU if the file descriptor is not obtained directly from
>> the clone or clone3 syscall (there is still a small window between the
>> clone return and the pidfd_getfd where the process can be reaped and the
>> process ID reused).
> 
> Unless I'm missing something, that window is purely programmer error.
> The pid belongs to the parent process, that called fork, posix_spawn,
> clone, or whatever, and is responsible for not freeing it until it's
> done using it.
> 
> Yes this can happen if you install a SIGCHLD handler that reaps
> anything it sees, or if you're calling wait without a pid. This is
> programming error. If you're stuck with code outside your control that
> makes that mistake, you can already avoid it with clone by setting the
> child exit signal to 0 rather than SIGCHLD. But it's best just not to
> do that.
> 

Yes, this is the issue GNOME is having with their code base [1] and that
motivated this new interface.  Systemd also seems to be interested in 
these interface, although I am not sure if it is also subject to same
issue.

I don't have a strong opinion whether this should be considered a solid
reason to provide a new API, another option would to close BZ#30349 [2] 
as wontfix with this rationale.  However, this does not really provide 
an workaround, and worse it will pass the idea that to fully resolve it 
you will need either to allow the racy condition or issue clone directly.

> IMO making a new, complex, highly nonstandard interface to work around
> a problem that's programmer error, and getting this nonstandard and
> nonportable pattern into mainstream software, has negative value.

Although I see this makes sense for the pidfd_spawn, there is still the
BZ 26371 requirement for cgroupv2 support.  This would required at least
something analogous to posix_spawnattr_{get,set}cgroup_np.  I am not
sure about if my fork_np suggestion is really required here, although
the idea is to allow a more extensible fork interface for possible future
clone support.

Florian and Lennart seems to lean for a clone3 similar to the clone
interface, which I really would like to avoid.

[1] https://gitlab.gnome.org/GNOME/glib/-/issues/1866
[2] https://sourceware.org/bugzilla/show_bug.cgi?id=30349
Florian Weimer Aug. 21, 2023, 6:53 a.m. UTC | #3
* Rich Felker:

> On Fri, Aug 18, 2023 at 11:06:35AM -0300, Adhemerval Zanella via Libc-alpha wrote:
>> The glibc 2.36 added wrappers for Linux syscall pidfd_open, pidfd_getfd,
>> and pidfd_send_signal, and exported the P_PIDFD to use along with
>> waitid. The pidfd is a race-free interface, however, the pidfd_open is
>> subject to TOCTOU if the file descriptor is not obtained directly from
>> the clone or clone3 syscall (there is still a small window between the
>> clone return and the pidfd_getfd where the process can be reaped and the
>> process ID reused).
>
> Unless I'm missing something, that window is purely programmer error.
> The pid belongs to the parent process, that called fork, posix_spawn,
> clone, or whatever, and is responsible for not freeing it until it's
> done using it.
>
> Yes this can happen if you install a SIGCHLD handler that reaps
> anything it sees, or if you're calling wait without a pid. This is
> programming error. If you're stuck with code outside your control that
> makes that mistake, you can already avoid it with clone by setting the
> child exit signal to 0 rather than SIGCHLD. But it's best just not to
> do that.

I think clone3 with exit_signal set to 0 and CLONE_PIDFD allows the
creation of subprocesses that are difficult to observe by accident from
the rest of the process, while obtaining a stable identifier for the
process.  I do not think there is any other way to achieve that.  I
think it's desirable to expose this functionality in some way.

Thanks,
Florian
Rich Felker Aug. 21, 2023, 1:55 p.m. UTC | #4
On Mon, Aug 21, 2023 at 08:53:53AM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> > On Fri, Aug 18, 2023 at 11:06:35AM -0300, Adhemerval Zanella via Libc-alpha wrote:
> >> The glibc 2.36 added wrappers for Linux syscall pidfd_open, pidfd_getfd,
> >> and pidfd_send_signal, and exported the P_PIDFD to use along with
> >> waitid. The pidfd is a race-free interface, however, the pidfd_open is
> >> subject to TOCTOU if the file descriptor is not obtained directly from
> >> the clone or clone3 syscall (there is still a small window between the
> >> clone return and the pidfd_getfd where the process can be reaped and the
> >> process ID reused).
> >
> > Unless I'm missing something, that window is purely programmer error.
> > The pid belongs to the parent process, that called fork, posix_spawn,
> > clone, or whatever, and is responsible for not freeing it until it's
> > done using it.
> >
> > Yes this can happen if you install a SIGCHLD handler that reaps
> > anything it sees, or if you're calling wait without a pid. This is
> > programming error. If you're stuck with code outside your control that
> > makes that mistake, you can already avoid it with clone by setting the
> > child exit signal to 0 rather than SIGCHLD. But it's best just not to
> > do that.
> 
> I think clone3 with exit_signal set to 0 and CLONE_PIDFD allows the
> creation of subprocesses that are difficult to observe by accident from
> the rest of the process, while obtaining a stable identifier for the
> process.  I do not think there is any other way to achieve that.  I
> think it's desirable to expose this functionality in some way.

Indeed that seems like useful functionality to expose for cases where
you can't fix some bad code, but there are lots of issues with how
clone3 (and even clone) should behave with respect to the child
environment when you don't exec -- is it _Fork-like or fork-like? Can
you use AS-unsafe interfaces in the child of a MT parent? Etc. This
should all be discussed on libc-coord not libc-alpha, IMO.

Independent of that, though, I'd like to focus on the fact that
randomly reaping children with no regard to what part of your process
owned those pids has been wrong and broken practice for something like
4 decades. Just like looping over a range of fds and closing
everything is broken. It's a type of UAF bug and it's not a pattern we
suddenly need to support because systemd or whoever says so. It's a
bug that's easily detected with static analysis (any calls to wait or
to waitpid/waitid without the pid specified, or SIGCHLD handlers) and
usually easy to fix -- and then the fix works on every POSIX-ish
platform that's existed, pretty much ever, not just on GNU/Linux.

Rich
Florian Weimer Aug. 24, 2023, 7:25 a.m. UTC | #5
* Rich Felker:

> On Mon, Aug 21, 2023 at 08:53:53AM +0200, Florian Weimer wrote:
>> I think clone3 with exit_signal set to 0 and CLONE_PIDFD allows the
>> creation of subprocesses that are difficult to observe by accident from
>> the rest of the process, while obtaining a stable identifier for the
>> process.  I do not think there is any other way to achieve that.  I
>> think it's desirable to expose this functionality in some way.
>
> Indeed that seems like useful functionality to expose for cases where
> you can't fix some bad code, but there are lots of issues with how

It's more about providing a confirming implementation of function in the
wait family in case the implementation needs to create processes for
some reason.  It's not about “bad code”, these are standard POSIX
interfaces.

> clone3 (and even clone) should behave with respect to the child
> environment when you don't exec -- is it _Fork-like or fork-like? Can
> you use AS-unsafe interfaces in the child of a MT parent? Etc. This
> should all be discussed on libc-coord not libc-alpha, IMO.

Seems more like linux-api material, given that it's Linux-specific?

Thanks,
Florian
Rich Felker Aug. 24, 2023, 12:21 p.m. UTC | #6
On Thu, Aug 24, 2023 at 09:25:05AM +0200, Florian Weimer wrote:
> * Rich Felker:
> 
> > On Mon, Aug 21, 2023 at 08:53:53AM +0200, Florian Weimer wrote:
> >> I think clone3 with exit_signal set to 0 and CLONE_PIDFD allows the
> >> creation of subprocesses that are difficult to observe by accident from
> >> the rest of the process, while obtaining a stable identifier for the
> >> process.  I do not think there is any other way to achieve that.  I
> >> think it's desirable to expose this functionality in some way.
> >
> > Indeed that seems like useful functionality to expose for cases where
> > you can't fix some bad code, but there are lots of issues with how
> 
> It's more about providing a confirming implementation of function in the
> wait family in case the implementation needs to create processes for
> some reason.  It's not about “bad code”, these are standard POSIX
> interfaces.

You're talking about the implementation (exit_signal=0 is very useful
for this) and I thought we were talking about interfaces to expose to
the application (where yes calling wait is a standard interface you
can use, but the standard consequence of "bad code" doing that when
another part of the program might have pids it doesn't want reaped is
that things blow up and you get to keep both pieces).

> > clone3 (and even clone) should behave with respect to the child
> > environment when you don't exec -- is it _Fork-like or fork-like? Can
> > you use AS-unsafe interfaces in the child of a MT parent? Etc. This
> > should all be discussed on libc-coord not libc-alpha, IMO.
> 
> Seems more like linux-api material, given that it's Linux-specific?

Perhaps.

Rich
Luca Boccassi Aug. 28, 2023, 12:52 p.m. UTC | #7
> On 18/08/23 14:51, Rich Felker wrote:
> > On Fri, Aug 18, 2023 at 11:06:35AM -0300, Adhemerval Zanella via
> Libc-alpha wrote:
> >> The glibc 2.36 added wrappers for Linux syscall pidfd_open,
> pidfd_getfd,
> >> and pidfd_send_signal, and exported the P_PIDFD to use along with
> >> waitid. The pidfd is a race-free interface, however, the
> pidfd_open is
> >> subject to TOCTOU if the file descriptor is not obtained directly
> from
> >> the clone or clone3 syscall (there is still a small window between
> the
> >> clone return and the pidfd_getfd where the process can be reaped
> and the
> >> process ID reused).
> > 
> > Unless I'm missing something, that window is purely programmer
> error.
> > The pid belongs to the parent process, that called fork,
> posix_spawn,
> > clone, or whatever, and is responsible for not freeing it until
> it's
> > done using it.
> > 
> > Yes this can happen if you install a SIGCHLD handler that reaps
> > anything it sees, or if you're calling wait without a pid. This is
> > programming error. If you're stuck with code outside your control
> that
> > makes that mistake, you can already avoid it with clone by setting
> the
> > child exit signal to 0 rather than SIGCHLD. But it's best just not
> to
> > do that.
> > 
> 
> Yes, this is the issue GNOME is having with their code base [1] and
> that
> motivated this new interface.  Systemd also seems to be interested in
> these interface, although I am not sure if it is also subject to same
> issue.
> 
> I don't have a strong opinion whether this should be considered a
> solid
> reason to provide a new API, another option would to close BZ#30349
> [2] 
> as wontfix with this rationale.  However, this does not really
> provide 
> an workaround, and worse it will pass the idea that to fully resolve
> it 
> you will need either to allow the racy condition or issue clone
> directly.

These are real race conditions, that cannot be solved otherwise,
characterizing them as 'programming errors' is very misleading and
wrong.

We very much need both of those interfaces in systemd, and fully intend
to use them as soon as they are available. We are slowly moving towards
using pidfds everywhere to be able to do end-to-end race-free process
tracking and management, and these are fundamental pieces for this
effort. From what I can read the GNOME developers feel the same way,
and I wouldn't be surprised if QT followed suit too given what you
mentioned in the cover letter.

Surely implementing useful, core functionality for the direct and
immediate benefit of 3 major Linux projects is a reason as solid as you
could ever find to add a new interface.
Florian Weimer Aug. 28, 2023, 1:21 p.m. UTC | #8
* Luca Boccassi:

> These are real race conditions, that cannot be solved otherwise,
> characterizing them as 'programming errors' is very misleading and
> wrong.
>
> We very much need both of those interfaces in systemd, and fully intend
> to use them as soon as they are available. We are slowly moving towards
> using pidfds everywhere to be able to do end-to-end race-free process
> tracking and management, and these are fundamental pieces for this
> effort. From what I can read the GNOME developers feel the same way,
> and I wouldn't be surprised if QT followed suit too given what you
> mentioned in the cover letter.
>
> Surely implementing useful, core functionality for the direct and
> immediate benefit of 3 major Linux projects is a reason as solid as you
> could ever find to add a new interface.

I see value in adding fork support, too.

The fundamental issue with the fork part is that it's not future-proof
at all.  The programming model is completely different from the kernel
interface.

If I start a discussion about API alternatives, who should I Cc: except
you and Rich?

Thanks,
Florian
Luca Boccassi Aug. 28, 2023, 1:50 p.m. UTC | #9
On Mon, 28 Aug 2023 at 14:21, Florian Weimer <fweimer@redhat.com> wrote:
>
> * Luca Boccassi:
>
> > These are real race conditions, that cannot be solved otherwise,
> > characterizing them as 'programming errors' is very misleading and
> > wrong.
> >
> > We very much need both of those interfaces in systemd, and fully intend
> > to use them as soon as they are available. We are slowly moving towards
> > using pidfds everywhere to be able to do end-to-end race-free process
> > tracking and management, and these are fundamental pieces for this
> > effort. From what I can read the GNOME developers feel the same way,
> > and I wouldn't be surprised if QT followed suit too given what you
> > mentioned in the cover letter.
> >
> > Surely implementing useful, core functionality for the direct and
> > immediate benefit of 3 major Linux projects is a reason as solid as you
> > could ever find to add a new interface.
>
> I see value in adding fork support, too.
>
> The fundamental issue with the fork part is that it's not future-proof
> at all.  The programming model is completely different from the kernel
> interface.

Sure that would be nice too, it is less urgent for us as we are moving
towards spawn(), so it seems fine if that is handled separately from
this series for me.

> If I start a discussion about API alternatives, who should I Cc: except
> you and Rich?

Nobody else for now on my side, thanks.