[RFC,00/13] nommu UML

Message ID	cover.1729770373.git.thehajime@gmail.com
State	RFC
Headers	show Return-Path: <linux-um-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org> From: Hajime Tazaki <thehajime@gmail.com> To: linux-um@lists.infradead.org, jdike@addtoit.com, richard@nod.at, anton.ivanov@cambridgegreys.com, johannes@sipsolutions.net Cc: thehajime@gmail.com, ricarkol@google.com Subject: [RFC PATCH 00/13] nommu UML Date: Thu, 24 Oct 2024 21:09:08 +0900 Message-ID: <cover.1729770373.git.thehajime@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit preview: This is a series of patches of nommu arch addition to UML. It would be nice to ask comments/opinions on this. There are several limitations/issues which we already found; here is the list of those issues. Content analysis details: (-2.1 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at https://www.dnswl.org/, no trust [2607:f8b0:4864:20:0:0:0:529 listed in] [list.dnswl.org] 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record -0.0 SPF_PASS SPF: sender matches SPF record 0.1 DKIM_SIGNED Message has a DKIM or DK signature, not necessarily valid -0.1 DKIM_VALID Message has at least one valid DKIM or DK signature -0.1 DKIM_VALID_AU Message has a valid DKIM or DK signature from author's domain -0.1 DKIM_VALID_EF Message has a valid DKIM or DK signature from envelope-from domain -1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1% [score: 0.0000] 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider [thehajime(at)gmail.com] Precedence: list Sender: "linux-um" <linux-um-bounces@lists.infradead.org> Errors-To: linux-um-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org

Message ID

cover.1729770373.git.thehajime@gmail.com

State

RFC

Headers

From: Hajime Tazaki <thehajime@gmail.com>
To: linux-um@lists.infradead.org,
	jdike@addtoit.com,
	richard@nod.at,
	anton.ivanov@cambridgegreys.com,
	johannes@sipsolutions.net
Cc: thehajime@gmail.com,
	ricarkol@google.com
Subject: [RFC PATCH 00/13] nommu UML
Date: Thu, 24 Oct 2024 21:09:08 +0900
Message-ID: <cover.1729770373.git.thehajime@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Sender: "linux-um" <linux-um-bounces@lists.infradead.org>
Errors-To: linux-um-bounces+incoming=patchwork.ozlabs.org@lists.infradead.org

Message

Hajime Tazaki Oct. 24, 2024, 12:09 p.m. UTC

This is a series of patches of nommu arch addition to UML.  It would
be nice to ask comments/opinions on this.

There are several limitations/issues which we already found; here is
the list of those issues.

- prompt configured with /etc/profile is broken (variables are not
  expanded, ${HOSTNAME%%.*}:$PWD#)
- there are no mechanism implemented to cache for mapped memory of
  exec(2) thus, always read files from filesystem upon every exec,
  which makes slow on some benchmark (lmbench).
- a crash on userspace programs crashes a UML kernel, not signaling
  with SIGSEGV to the program.
- commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
  a vma structure for our case, which updates the internal procedure
  of maple_tree subsystem.  We're trying to fix issue but still a
  random process on exit(2) crashes.

UML has been built with CONFIG_MMU since day 0.  The feature
introduces the nommu mode in a different angle from what Linux Kernel
Library tried.


What is it for ?
================

- Alleviate syscall hook overhead implemented with ptrace(2)
- To exercises nommu code over UML (and over KUnit)
- Less dependency to host facilities


How it works ?
==============

To illustrate how this feature works, the below shows how syscalls are
called under nommu/UML environment.

- boot kernel, setup zpoline trampoline code (detailed later) at address 0x0
- (userspace starts)
- calls vfork/execve syscalls
- during execve, more specifically during load_elf_fdpic_binary()
  function, kernel translates `syscall/sysenter` instructions with `call
  *%rax`, which usually point to address 0 to NR_syscalls (around
  512), where trampoline code was installed during startup.
- when syscalls are issued by userspace, it jumps to *%rax, slides
  until `nop` instructions end, and jump to hooked function,
  `__kernel_vsyscall`, which is an entrypoint for syscall under nommu
  UML environment.
- call handler function in sys_call_table[] and follow how UML syscall
  works.
- return to userspace


What are the differences from MMU-full UML ?
============================================

The current nommu implementation adds 3 different functions which
MMU-full UML doesn't have:

- kernel address space can directly be accessible from userspace
  - so, uaccess() always returns 1
  - generic implementation of memcpy/strcpy/futex is also used
- alternate syscall entrypoint without ptrace
- translation of syscall/sysenter instructions to a trampoline code
  and syscall hooks

With those modifications, it allows us to use unmodified userspace
binaries with nommu UML.


History
=======

This feature was originally introduced by Ricardo Koller at Open
Source Summit NA 2020, then integrated with the syscall translation
functionality with the clean up to the original code.

Building and run
================

```
% make ARCH=um x86_64_nommu_defconfig
% make ARCH=um
```

will build UML with CONFIG_MMU=n applied.

Kunit tests can run with the following command:

```
% ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n
```

To run a typical Linux distribution, we need nommu-aware userspace.
We can use a stock version of Alpine Linux with nommu-built version of
busybox and musl-libc.


Preparing root filesystem
=========================

nommu UML requires to use a specific standard library which is aware
of nommu kernel.  We have tested custom-build musl-libc and busybox,
both of which have built-in support for nommu kernels.

There are no available Linux distributions for nommu under x86_64
architecture, so we need to prepare our own image for the root
filesystem.  We use Alpine Linux as a base distribution and replace
busybox and musl-libc on top of that.  The following are the step to
prepare the filesystem for the quick start.

```
     container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu)
     docker start $container_id
     docker wait $container_id
     docker export $container_id > alpine.tar
     docker rm $container_id

     mnt=$(mktemp -d)
     dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G
     sudo chmod og+wr "alpine.ext4"
     yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true
     sudo mount "alpine.ext4" $mnt
     sudo tar -xf alpine.tar -C $mnt
     sudo umount $mnt
```

This will create a file image, `alpine.ext4`, which contains busybox
and musl with nommu build on the Alpine Linux root filesystem.  The
file can be specified to the argument `ubd0=` to the UML command line.

```
  ./vmlinux eth0=tuntap,tap100,0e:fd:0:0:0:1,172.17.0.1 ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init
```

We plan to upstream apk packages for busybox and musl so that we can
follow the proper procedure to set up the root filesystem.


Quick start with docker
=======================

There is a docker image that you can quickly start with a simple step.

```
  docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu
```

This will launch a UML instance with an pre-configured root filesystem.

Benchmark
=========

The below shows an example of performance measurement conducted with
lmbench and (self-crafted) getpid benchmark (with v6.12-rc3 linus tree).

### lmbench (usec)

||native|um|um-nommu|
|--|--|--|--|
|select-10    |0.5645|28.3738|0.2647|
|select-100   |2.3872|28.8385|1.1021|
|select-1000  |20.5527|37.6364|9.4264|
|syscall      |0.1735|26.8711|0.1037|
|read         |0.3442|28.5771|0.1370|
|write        |0.2862|28.7340|0.1236|
|stat         |1.9236|38.5928|0.4640|
|open/close   |3.8308|66.8451|0.7789|
|fork+sh      |1176.4444|8221.5000|21443.0000|
|fork+execve  |533.1053|3034.5000|4894.3333|

### do_getpid bench (nsec)

||native|um|um-nommu|
|--|--|--|--|
|getpid | 180 | 31579 | 101|


Limitations
===========

generic nommu limitations
-------------------------
Since this port is a kernel of nommu architecture so, the
implementation inherits the characteristics of other nommu kernels
(riscv, arm, etc), described below.

- vfork(2) should be used instead of fork(2)
- ELF loader only loads PIE (position independent executable) binaries
- processes share the address space among others
- mmap(2) offers a subset of functionalities (e.g., unsupported
  MMAP_FIXED)

Thus, we have limited options to userspace programs.  We have tested
Alpine Linux with musl-libc, which has a support nommu kernel.

access to mmap_min_addr
----------------------
As the mechanism of syscall translations relies on an ability to
write/read memory address zero (0x0), we need to configure host kernel
with the following command:

```
% sh -c "echo 0 > /proc/sys/vm/mmap_min_addr"
```

supported architecture
----------------------
The current implementation of nommu UML only works on x86_64 SUBARCH.
We have not tested with 32-bit environment.

target of syscall translation
-----------------------------
The syscall translation only applies to the executable and interpreter
of ELF binary files which are processed by execve(2) syscall for the
moment: other libraries such as linked library and dlopen-ed one
aren't translated; we may be able to trigger the translation by
LD_PRELOAD.

Note that with musl-libc in Alpine Linux which we've been tested, most
of syscalls are implemented in the interpreter file
(ld-musl-x86_64.so) and calling syscall/sysenter instructions from the
linked/loaded libraries might be rare.  But it is definitely possible
so, a workaround with LD_PRELOAD is effective.


Further readings about NOMMU UML
================================

- NOMMU UML (original code by Ricardo Koller)
https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf

- zpoline: syscall translation mechanism
https://www.usenix.org/conference/atc23/presentation/yasukata
Please review the following changes for suitability for inclusion. If you have
any objections or suggestions for improvement, please respond to the patches. If
you agree with the changes, please provide your Acked-by.

The following changes since commit c2ee9f594da826bea183ed14f2cc029c719bf4da:

  KVM: selftests: Fix build on on non-x86 architectures (2024-10-21 15:49:33 -0700)

are available in the Git repository at:

  https://github.com/thehajime/linux 82a7ee8b31c51edb47e144922581824a3b5e371d
  https://github.com/thehajime/linux/tree/um-nommu-v6.12-rc4-rfc

Hajime Tazaki (13):
  fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec
  x86/um: nommu: elf loader for fdpic
  um: nommu: memory handling
  x86/um: nommu: syscall handling
  x86/um: nommu: syscall translation by zpoline
  x86/um: nommu: process/thread handling
  um: nommu: configure fs register on host syscall invocation
  x86/um/vdso: nommu: vdso memory update
  x86/um: nommu: signal handling
  x86/um: nommu: stack save/restore on vfork
  um: change machine name for uname output
  um: nommu: add documentation of nommu UML
  um: nommu: plug nommu code into build system

 Documentation/virt/uml/nommu-uml.rst    | 219 +++++++++++++++++++++++
 arch/um/Kconfig                         |  13 +-
 arch/um/Makefile                        |   6 +
 arch/um/configs/x86_64_nommu_defconfig  |  64 +++++++
 arch/um/include/asm/futex.h             |   4 +
 arch/um/include/asm/mmu.h               |   8 +
 arch/um/include/asm/mmu_context.h       |  14 +-
 arch/um/include/asm/ptrace-generic.h    |  17 ++
 arch/um/include/asm/tlbflush.h          |  23 ++-
 arch/um/include/asm/uaccess.h           |   7 +-
 arch/um/include/shared/common-offsets.h |   3 +
 arch/um/include/shared/os.h             |   9 +
 arch/um/kernel/Makefile                 |   3 +-
 arch/um/kernel/exec.c                   |   8 +
 arch/um/kernel/mem.c                    |  13 ++
 arch/um/kernel/physmem.c                |   6 +
 arch/um/kernel/process.c                |  34 +++-
 arch/um/kernel/skas/Makefile            |   3 +-
 arch/um/kernel/trap.c                   |   4 +
 arch/um/os-Linux/main.c                 |   5 +
 arch/um/os-Linux/process.c              |  22 +++
 arch/um/os-Linux/skas/process.c         |   4 +
 arch/um/os-Linux/start_up.c             |  47 +++++
 arch/um/os-Linux/time.c                 |   3 +-
 arch/um/os-Linux/util.c                 |   3 +-
 arch/x86/um/Makefile                    |  18 ++
 arch/x86/um/asm/elf.h                   |  12 +-
 arch/x86/um/asm/module.h                |  19 +-
 arch/x86/um/asm/processor.h             |  12 ++
 arch/x86/um/do_syscall_64.c             | 113 ++++++++++++
 arch/x86/um/entry_64.S                  | 110 ++++++++++++
 arch/x86/um/shared/sysdep/syscalls_64.h |   4 +
 arch/x86/um/signal.c                    |  26 +++
 arch/x86/um/syscalls_64.c               |  67 +++++++
 arch/x86/um/vdso/um_vdso.c              |  20 +++
 arch/x86/um/vdso/vma.c                  |  16 +-
 arch/x86/um/zpoline.c                   | 228 ++++++++++++++++++++++++
 fs/Kconfig.binfmt                       |   2 +-
 fs/binfmt_elf_fdpic.c                   |  10 ++
 39 files changed, 1164 insertions(+), 35 deletions(-)
 create mode 100644 Documentation/virt/uml/nommu-uml.rst
 create mode 100644 arch/um/configs/x86_64_nommu_defconfig
 create mode 100644 arch/x86/um/do_syscall_64.c
 create mode 100644 arch/x86/um/entry_64.S
 create mode 100644 arch/x86/um/zpoline.c

Comments

Benjamin Berg Oct. 26, 2024, 10:19 a.m. UTC | #1

Hi,

On Thu, 2024-10-24 at 21:09 +0900, Hajime Tazaki wrote:
> This is a series of patches of nommu arch addition to UML.  It would
> be nice to ask comments/opinions on this.
> 
> There are several limitations/issues which we already found; here is
> the list of those issues.
> 
> - prompt configured with /etc/profile is broken (variables are not
>   expanded, ${HOSTNAME%%.*}:$PWD#)
> - there are no mechanism implemented to cache for mapped memory of
>   exec(2) thus, always read files from filesystem upon every exec,
>   which makes slow on some benchmark (lmbench).
> - a crash on userspace programs crashes a UML kernel, not signaling
>   with SIGSEGV to the program.
> - commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
>   a vma structure for our case, which updates the internal procedure
>   of maple_tree subsystem.  We're trying to fix issue but still a
>   random process on exit(2) crashes.

Btw. are you handling FP register save/restore? If it is not there, it
probably would not be too hard to add (XSAVE, etc.), though it might
add a bit of additional overhead. Especially as UML always saves the FP
state rather than optimizing it like the x86 architectures.


I am a bit confused overall. I mean, zpoline seems kind of neat, but a
requirement on patching userspace code also seems like a lot.

To me, it seems much more natural to catch the userspace syscalls using
a SECCOMP filter[1]. While quite a lot slower, that should be much more
portable across architectures. For improved speed one could still do
architecture specific things inside the vDSO or by using zpoline. But
those would then "just" be optimizations and unpatched code would still
work correctly (e.g. JIT).

For me, a big argument in favour of such an approach is its simplicity.
I am mostly basing that on the fact that this patchset should properly
handle other signals like SIGFPE and SIGSEGV. And, once it does that,
you will already have all the infrastructure to do the correct register
save/restore using the host mcontex, which is what is needed in the
SIGSYS handler when using SECCOMP. The filter itself should be simple
as it just needs to catch all syscalls within valid userspace
executable memory[2] ranges.

Benjamin

[1] Maybe not surprising, as I have been working on a SECCOMP based UML
that does not require ptrace.
[2] I am assuming that userspace executable code is already confined to
a certain address space within the UML process. Obviously, the kernel
itself and loaded modules need to be free to do host syscalls and
should not be affected by the SECCOMP filter.



> 
> UML has been built with CONFIG_MMU since day 0.  The feature
> introduces the nommu mode in a different angle from what Linux Kernel
> Library tried.
> 
> 
> What is it for ?
> ================
> 
> - Alleviate syscall hook overhead implemented with ptrace(2)
> - To exercises nommu code over UML (and over KUnit)
> - Less dependency to host facilities
> 
> 
> How it works ?
> ==============
> 
> To illustrate how this feature works, the below shows how syscalls are
> called under nommu/UML environment.
> 
> - boot kernel, setup zpoline trampoline code (detailed later) at address 0x0
> - (userspace starts)
> - calls vfork/execve syscalls
> - during execve, more specifically during load_elf_fdpic_binary()
>   function, kernel translates `syscall/sysenter` instructions with `call
>   *%rax`, which usually point to address 0 to NR_syscalls (around
>   512), where trampoline code was installed during startup.
> - when syscalls are issued by userspace, it jumps to *%rax, slides
>   until `nop` instructions end, and jump to hooked function,
>   `__kernel_vsyscall`, which is an entrypoint for syscall under nommu
>   UML environment.
> - call handler function in sys_call_table[] and follow how UML syscall
>   works.
> - return to userspace
> 
> 
> What are the differences from MMU-full UML ?
> ============================================
> 
> The current nommu implementation adds 3 different functions which
> MMU-full UML doesn't have:
> 
> - kernel address space can directly be accessible from userspace
>   - so, uaccess() always returns 1
>   - generic implementation of memcpy/strcpy/futex is also used
> - alternate syscall entrypoint without ptrace
> - translation of syscall/sysenter instructions to a trampoline code
>   and syscall hooks
> 
> With those modifications, it allows us to use unmodified userspace
> binaries with nommu UML.
> 
> 
> History
> =======
> 
> This feature was originally introduced by Ricardo Koller at Open
> Source Summit NA 2020, then integrated with the syscall translation
> functionality with the clean up to the original code.
> 
> Building and run
> ================
> 
> ```
> % make ARCH=um x86_64_nommu_defconfig
> % make ARCH=um
> ```
> 
> will build UML with CONFIG_MMU=n applied.
> 
> Kunit tests can run with the following command:
> 
> ```
> % ./tools/testing/kunit/kunit.py run --kconfig_add CONFIG_MMU=n
> ```
> 
> To run a typical Linux distribution, we need nommu-aware userspace.
> We can use a stock version of Alpine Linux with nommu-built version of
> busybox and musl-libc.
> 
> 
> Preparing root filesystem
> =========================
> 
> nommu UML requires to use a specific standard library which is aware
> of nommu kernel.  We have tested custom-build musl-libc and busybox,
> both of which have built-in support for nommu kernels.
> 
> There are no available Linux distributions for nommu under x86_64
> architecture, so we need to prepare our own image for the root
> filesystem.  We use Alpine Linux as a base distribution and replace
> busybox and musl-libc on top of that.  The following are the step to
> prepare the filesystem for the quick start.
> 
> ```
>      container_id=$(docker create ghcr.io/thehajime/alpine:3.20.3-um-nommu)
>      docker start $container_id
>      docker wait $container_id
>      docker export $container_id > alpine.tar
>      docker rm $container_id
> 
>      mnt=$(mktemp -d)
>      dd if=/dev/zero of=alpine.ext4 bs=1 count=0 seek=1G
>      sudo chmod og+wr "alpine.ext4"
>      yes 2>/dev/null | mkfs.ext4 "alpine.ext4" || true
>      sudo mount "alpine.ext4" $mnt
>      sudo tar -xf alpine.tar -C $mnt
>      sudo umount $mnt
> ```
> 
> This will create a file image, `alpine.ext4`, which contains busybox
> and musl with nommu build on the Alpine Linux root filesystem.  The
> file can be specified to the argument `ubd0=` to the UML command line.
> 
> ```
>   ./vmlinux eth0=tuntap,tap100,0e:fd:0:0:0:1,172.17.0.1 ubd0=./alpine.ext4 rw mem=1024m loglevel=8 init=/sbin/init
> ```
> 
> We plan to upstream apk packages for busybox and musl so that we can
> follow the proper procedure to set up the root filesystem.
> 
> 
> Quick start with docker
> =======================
> 
> There is a docker image that you can quickly start with a simple step.
> 
> ```
>   docker run -it -v /dev/shm:/dev/shm --rm ghcr.io/thehajime/alpine:3.20.3-um-nommu
> ```
> 
> This will launch a UML instance with an pre-configured root filesystem.
> 
> Benchmark
> =========
> 
> The below shows an example of performance measurement conducted with
> lmbench and (self-crafted) getpid benchmark (with v6.12-rc3 linus tree).
> 
> ### lmbench (usec)
> 
> > > native|um|um-nommu|
> > --|--|--|--|
> > select-10    |0.5645|28.3738|0.2647|
> > select-100   |2.3872|28.8385|1.1021|
> > select-1000  |20.5527|37.6364|9.4264|
> > syscall      |0.1735|26.8711|0.1037|
> > read         |0.3442|28.5771|0.1370|
> > write        |0.2862|28.7340|0.1236|
> > stat         |1.9236|38.5928|0.4640|
> > open/close   |3.8308|66.8451|0.7789|
> > fork+sh      |1176.4444|8221.5000|21443.0000|
> > fork+execve  |533.1053|3034.5000|4894.3333|
> 
> ### do_getpid bench (nsec)
> 
> > > native|um|um-nommu|
> > --|--|--|--|
> > getpid | 180 | 31579 | 101|
> 
> 
> Limitations
> ===========
> 
> generic nommu limitations
> -------------------------
> Since this port is a kernel of nommu architecture so, the
> implementation inherits the characteristics of other nommu kernels
> (riscv, arm, etc), described below.
> 
> - vfork(2) should be used instead of fork(2)
> - ELF loader only loads PIE (position independent executable) binaries
> - processes share the address space among others
> - mmap(2) offers a subset of functionalities (e.g., unsupported
>   MMAP_FIXED)
> 
> Thus, we have limited options to userspace programs.  We have tested
> Alpine Linux with musl-libc, which has a support nommu kernel.
> 
> access to mmap_min_addr
> ----------------------
> As the mechanism of syscall translations relies on an ability to
> write/read memory address zero (0x0), we need to configure host kernel
> with the following command:
> 
> ```
> % sh -c "echo 0 > /proc/sys/vm/mmap_min_addr"
> ```
> 
> supported architecture
> ----------------------
> The current implementation of nommu UML only works on x86_64 SUBARCH.
> We have not tested with 32-bit environment.
> 
> target of syscall translation
> -----------------------------
> The syscall translation only applies to the executable and interpreter
> of ELF binary files which are processed by execve(2) syscall for the
> moment: other libraries such as linked library and dlopen-ed one
> aren't translated; we may be able to trigger the translation by
> LD_PRELOAD.
> 
> Note that with musl-libc in Alpine Linux which we've been tested, most
> of syscalls are implemented in the interpreter file
> (ld-musl-x86_64.so) and calling syscall/sysenter instructions from the
> linked/loaded libraries might be rare.  But it is definitely possible
> so, a workaround with LD_PRELOAD is effective.
> 
> 
> Further readings about NOMMU UML
> ================================
> 
> - NOMMU UML (original code by Ricardo Koller)
> https://static.sched.com/hosted_files/ossna2020/ec/kollerr_linux_um_nommu.pdf
> 
> - zpoline: syscall translation mechanism
> https://www.usenix.org/conference/atc23/presentation/yasukata
> Please review the following changes for suitability for inclusion. If you have
> any objections or suggestions for improvement, please respond to the patches. If
> you agree with the changes, please provide your Acked-by.
> 
> The following changes since commit c2ee9f594da826bea183ed14f2cc029c719bf4da:
> 
>   KVM: selftests: Fix build on on non-x86 architectures (2024-10-21 15:49:33 -0700)
> 
> are available in the Git repository at:
> 
>   https://github.com/thehajime/linux 82a7ee8b31c51edb47e144922581824a3b5e371d
>   https://github.com/thehajime/linux/tree/um-nommu-v6.12-rc4-rfc
> 
> Hajime Tazaki (13):
>   fs: binfmt_elf_efpic: add architecture hook elf_arch_finalize_exec
>   x86/um: nommu: elf loader for fdpic
>   um: nommu: memory handling
>   x86/um: nommu: syscall handling
>   x86/um: nommu: syscall translation by zpoline
>   x86/um: nommu: process/thread handling
>   um: nommu: configure fs register on host syscall invocation
>   x86/um/vdso: nommu: vdso memory update
>   x86/um: nommu: signal handling
>   x86/um: nommu: stack save/restore on vfork
>   um: change machine name for uname output
>   um: nommu: add documentation of nommu UML
>   um: nommu: plug nommu code into build system
> 
>  Documentation/virt/uml/nommu-uml.rst    | 219 +++++++++++++++++++++++
>  arch/um/Kconfig                         |  13 +-
>  arch/um/Makefile                        |   6 +
>  arch/um/configs/x86_64_nommu_defconfig  |  64 +++++++
>  arch/um/include/asm/futex.h             |   4 +
>  arch/um/include/asm/mmu.h               |   8 +
>  arch/um/include/asm/mmu_context.h       |  14 +-
>  arch/um/include/asm/ptrace-generic.h    |  17 ++
>  arch/um/include/asm/tlbflush.h          |  23 ++-
>  arch/um/include/asm/uaccess.h           |   7 +-
>  arch/um/include/shared/common-offsets.h |   3 +
>  arch/um/include/shared/os.h             |   9 +
>  arch/um/kernel/Makefile                 |   3 +-
>  arch/um/kernel/exec.c                   |   8 +
>  arch/um/kernel/mem.c                    |  13 ++
>  arch/um/kernel/physmem.c                |   6 +
>  arch/um/kernel/process.c                |  34 +++-
>  arch/um/kernel/skas/Makefile            |   3 +-
>  arch/um/kernel/trap.c                   |   4 +
>  arch/um/os-Linux/main.c                 |   5 +
>  arch/um/os-Linux/process.c              |  22 +++
>  arch/um/os-Linux/skas/process.c         |   4 +
>  arch/um/os-Linux/start_up.c             |  47 +++++
>  arch/um/os-Linux/time.c                 |   3 +-
>  arch/um/os-Linux/util.c                 |   3 +-
>  arch/x86/um/Makefile                    |  18 ++
>  arch/x86/um/asm/elf.h                   |  12 +-
>  arch/x86/um/asm/module.h                |  19 +-
>  arch/x86/um/asm/processor.h             |  12 ++
>  arch/x86/um/do_syscall_64.c             | 113 ++++++++++++
>  arch/x86/um/entry_64.S                  | 110 ++++++++++++
>  arch/x86/um/shared/sysdep/syscalls_64.h |   4 +
>  arch/x86/um/signal.c                    |  26 +++
>  arch/x86/um/syscalls_64.c               |  67 +++++++
>  arch/x86/um/vdso/um_vdso.c              |  20 +++
>  arch/x86/um/vdso/vma.c                  |  16 +-
>  arch/x86/um/zpoline.c                   | 228 ++++++++++++++++++++++++
>  fs/Kconfig.binfmt                       |   2 +-
>  fs/binfmt_elf_fdpic.c                   |  10 ++
>  39 files changed, 1164 insertions(+), 35 deletions(-)
>  create mode 100644 Documentation/virt/uml/nommu-uml.rst
>  create mode 100644 arch/um/configs/x86_64_nommu_defconfig
>  create mode 100644 arch/x86/um/do_syscall_64.c
>  create mode 100644 arch/x86/um/entry_64.S
>  create mode 100644 arch/x86/um/zpoline.c
>

Hajime Tazaki Oct. 27, 2024, 9:10 a.m. UTC | #2

Hello Benjamin,

thank you for your time looking at this.

On Sat, 26 Oct 2024 19:19:08 +0900,
Benjamin Berg wrote:

> > - a crash on userspace programs crashes a UML kernel, not signaling
> >   with SIGSEGV to the program.
> > - commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
> >   a vma structure for our case, which updates the internal procedure
> >   of maple_tree subsystem.  We're trying to fix issue but still a
> >   random process on exit(2) crashes.
> 
> Btw. are you handling FP register save/restore? If it is not there, it
> probably would not be too hard to add (XSAVE, etc.), though it might
> add a bit of additional overhead. Especially as UML always saves the FP
> state rather than optimizing it like the x86 architectures.

The patch handles fp register on entry/leave at syscall; [07/13] patch
contains this part.

I'm not familiar with that but what kind of optimizations does x86
architecture do for fp register handling ?

> I am a bit confused overall. I mean, zpoline seems kind of neat, but a
> requirement on patching userspace code also seems like a lot.
> 
> To me, it seems much more natural to catch the userspace syscalls using
> a SECCOMP filter[1]. While quite a lot slower, that should be much more
> portable across architectures. For improved speed one could still do
> architecture specific things inside the vDSO or by using zpoline. But
> those would then "just" be optimizations and unpatched code would still
> work correctly (e.g. JIT).

I'm not proposing this patch to replace existing UML implementations;
for instance, the patchset cannot run CONFIG_MMU code in the whole
kernel tree so, existing ptrace-based implementation still has real
usecase.  and ptrace based syscall hook is not indeed fast and the
improvements with seccomp filter instead clearly has benefits.  I
think it's independent to this patchset.

So I think while your seccomp patches are also in review, this
patchset can exist in parallel.

btw, though I mentioned that JIT generated code is not currently
handled in a different reply, it can be implemented as an extension to
this patchset; the original implementation of zpoline now is able to
patch JIT generated code as well.

https://github.com/yasukata/zpoline/pull/20/commits/c42af16757ad3fcdf7084c9f2139bb9105796873

it is not implemented for the moment.

in terms of the portability, the basic idea of syscall hook with
zpoline is applicable to other platform, like aarch64
(https://github.com/retrage/svc-hook).  so I believe it has a chance
to expand this idea to other architectures than x86_64.

> For me, a big argument in favour of such an approach is its simplicity.
> I am mostly basing that on the fact that this patchset should properly
> handle other signals like SIGFPE and SIGSEGV. And, once it does that,
> you will already have all the infrastructure to do the correct register
> save/restore using the host mcontex, which is what is needed in the
> SIGSYS handler when using SECCOMP. The filter itself should be simple
> as it just needs to catch all syscalls within valid userspace
> executable memory[2] ranges.

I agree with your observation that the approach is simple.
I don't have a good idea on how to handle SIGSEGV, but will try to see
with your inputs.

> Benjamin
> 
> [1] Maybe not surprising, as I have been working on a SECCOMP based UML
> that does not require ptrace.

yes, I'm aware of it since before.  I have also conducted a benchmark
with several hook mechanisms, including seccomp with simple getpid
measurement.

https://speakerdeck.com/thehajime/netdev0x18-zpoline?slide=16

> [2] I am assuming that userspace executable code is already confined to
> a certain address space within the UML process. Obviously, the kernel
> itself and loaded modules need to be free to do host syscalls and
> should not be affected by the SECCOMP filter.

I think our !MMU UML doesn't break this assumption.  But did you see
something to our patchset ?

Thanks again,
-- Hajime

Benjamin Berg Oct. 28, 2024, 1:32 p.m. UTC | #3

Hello Hajime,

On Sun, 2024-10-27 at 18:10 +0900, Hajime Tazaki wrote:
> thank you for your time looking at this.
> 
> On Sat, 26 Oct 2024 19:19:08 +0900,
> Benjamin Berg wrote:
> 
> > > - a crash on userspace programs crashes a UML kernel, not signaling
> > >   with SIGSEGV to the program.
> > > - commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
> > >   a vma structure for our case, which updates the internal procedure
> > >   of maple_tree subsystem.  We're trying to fix issue but still a
> > >   random process on exit(2) crashes.
> > 
> > Btw. are you handling FP register save/restore? If it is not there, it
> > probably would not be too hard to add (XSAVE, etc.), though it might
> > add a bit of additional overhead. Especially as UML always saves the FP
> > state rather than optimizing it like the x86 architectures.
> 
> The patch handles fp register on entry/leave at syscall; [07/13] patch
> contains this part.

That looks like FS/GS registers which are for thread-local storage. I
was talking about floating point registers. Maybe you meant another
patch?

> I'm not familiar with that but what kind of optimizations does x86
> architecture do for fp register handling ?

The kernel does not usually need the FP registers. So it optimizes the
pretty common case of a userspace -> kernel -> userspace switch that
happens for a syscall by simply not saving/restoring these registers at
all.

Obviously, it then still needs to do the work when the task is switched
or in the rare case that the kernel wants to use floating point itself.

> > I am a bit confused overall. I mean, zpoline seems kind of neat, but a
> > requirement on patching userspace code also seems like a lot.
> > 
> > To me, it seems much more natural to catch the userspace syscalls using
> > a SECCOMP filter[1]. While quite a lot slower, that should be much more
> > portable across architectures. For improved speed one could still do
> > architecture specific things inside the vDSO or by using zpoline. But
> > those would then "just" be optimizations and unpatched code would still
> > work correctly (e.g. JIT).
> 
> I'm not proposing this patch to replace existing UML implementations;
> for instance, the patchset cannot run CONFIG_MMU code in the whole
> kernel tree so, existing ptrace-based implementation still has real
> usecase.  and ptrace based syscall hook is not indeed fast and the
> improvements with seccomp filter instead clearly has benefits.  I
> think it's independent to this patchset.

Of course. nommu mode is a completely independent feature.

I am still wondering a bit about the users for such a mode. It is not
interesting for us as we use it for testing. Of course, speed is nice
but it is not the primary objective.

I understand that it can be an approach for a small "container", but
then you would need a very strict SECCOMP filter for the kernel itself.

> So I think while your seccomp patches are also in review, this
> patchset can exist in parallel.
> 
> btw, though I mentioned that JIT generated code is not currently
> handled in a different reply, it can be implemented as an extension to
> this patchset; the original implementation of zpoline now is able to
> patch JIT generated code as well.
> 
> https://github.com/yasukata/zpoline/pull/20/commits/c42af16757ad3fcdf7084c9f2139bb9105796873
> 
> it is not implemented for the moment.
> 
> in terms of the portability, the basic idea of syscall hook with
> zpoline is applicable to other platform, like aarch64
> (https://github.com/retrage/svc-hook).  so I believe it has a chance
> to expand this idea to other architectures than x86_64.

Right, aarch64 is probably the most interesting one in general. At
least there was some interest in a UML port.

> > For me, a big argument in favour of such an approach is its simplicity.
> > I am mostly basing that on the fact that this patchset should properly
> > handle other signals like SIGFPE and SIGSEGV. And, once it does that,
> > you will already have all the infrastructure to do the correct register
> > save/restore using the host mcontex, which is what is needed in the
> > SIGSYS handler when using SECCOMP. The filter itself should be simple
> > as it just needs to catch all syscalls within valid userspace
> > executable memory[2] ranges.
> 
> I agree with your observation that the approach is simple.
> I don't have a good idea on how to handle SIGSEGV, but will try to see
> with your inputs.

You can probably use "[RFC PATCH v2 5/9] um: Add helper functions to
get/set state for SECCOMP" for getting the registers and also writing
them back if you want to restore using rt_sigreturn.

> > [1] Maybe not surprising, as I have been working on a SECCOMP based UML
> > that does not require ptrace.
> 
> yes, I'm aware of it since before.  I have also conducted a benchmark
> with several hook mechanisms, including seccomp with simple getpid
> measurement.
> 
> https://speakerdeck.com/thehajime/netdev0x18-zpoline?slide=16

Sure! I saw that :-)

> > [2] I am assuming that userspace executable code is already confined to
> > a certain address space within the UML process. Obviously, the kernel
> > itself and loaded modules need to be free to do host syscalls and
> > should not be affected by the SECCOMP filter.
> 
> I think our !MMU UML doesn't break this assumption.  But did you see
> something to our patchset ?

I also assume that is fine. One just needs to understand this when
writing a SECCOMP filter for syscall emulation in nommu mode.

Benjamin

Hajime Tazaki Oct. 30, 2024, 9:25 a.m. UTC | #4

Hello,

On Mon, 28 Oct 2024 22:32:43 +0900,
Benjamin Berg wrote:

> > > > - a crash on userspace programs crashes a UML kernel, not signaling
> > > >   with SIGSEGV to the program.
> > > > - commit c27e618 (during v6.12-rc1 merge) introduces invalid access to
> > > >   a vma structure for our case, which updates the internal procedure
> > > >   of maple_tree subsystem.  We're trying to fix issue but still a
> > > >   random process on exit(2) crashes.
> > > 
> > > Btw. are you handling FP register save/restore? If it is not there, it
> > > probably would not be too hard to add (XSAVE, etc.), though it might
> > > add a bit of additional overhead. Especially as UML always saves the FP
> > > state rather than optimizing it like the x86 architectures.
> > 
> > The patch handles fp register on entry/leave at syscall; [07/13] patch
> > contains this part.
> 
> That looks like FS/GS registers which are for thread-local storage. I
> was talking about floating point registers. Maybe you meant another
> patch?

oh, this is my terrible mistake...
no, the patch doesn't handle fp resister at all.

> > I'm not familiar with that but what kind of optimizations does x86
> > architecture do for fp register handling ?
> 
> The kernel does not usually need the FP registers. So it optimizes the
> pretty common case of a userspace -> kernel -> userspace switch that
> happens for a syscall by simply not saving/restoring these registers at
> all.
> 
> Obviously, it then still needs to do the work when the task is switched
> or in the rare case that the kernel wants to use floating point itself.

thanks for the information.

> > > I am a bit confused overall. I mean, zpoline seems kind of neat, but a
> > > requirement on patching userspace code also seems like a lot.
> > > 
> > > To me, it seems much more natural to catch the userspace syscalls using
> > > a SECCOMP filter[1]. While quite a lot slower, that should be much more
> > > portable across architectures. For improved speed one could still do
> > > architecture specific things inside the vDSO or by using zpoline. But
> > > those would then "just" be optimizations and unpatched code would still
> > > work correctly (e.g. JIT).
> > 
> > I'm not proposing this patch to replace existing UML implementations;
> > for instance, the patchset cannot run CONFIG_MMU code in the whole
> > kernel tree so, existing ptrace-based implementation still has real
> > usecase.  and ptrace based syscall hook is not indeed fast and the
> > improvements with seccomp filter instead clearly has benefits.  I
> > think it's independent to this patchset.
> 
> Of course. nommu mode is a completely independent feature.
> 
> I am still wondering a bit about the users for such a mode. It is not
> interesting for us as we use it for testing. Of course, speed is nice
> but it is not the primary objective.
> 
> I understand that it can be an approach for a small "container", but
> then you would need a very strict SECCOMP filter for the kernel itself.

I didn't specifically describe the usecase for this at the v1 patch;
but at least here is the list in my mind.

1) container-like usecase can be one of them (the original work proposed
toward this),
2) testing nommu code in kernel might be another use,
3) faster I/O workload which involves bunch of syscalls over UML can
be also interesting.

I think this list covers pretty much to have !MMU mode in current
MMU-full UML.

speed might not be indeed the primary objective but if you'll see the
dozen of test cases which issues bunch of syscalls (which I think
possible case), this might be helpful.

(snip)

> > > For me, a big argument in favour of such an approach is its simplicity.
> > > I am mostly basing that on the fact that this patchset should properly
> > > handle other signals like SIGFPE and SIGSEGV. And, once it does that,
> > > you will already have all the infrastructure to do the correct register
> > > save/restore using the host mcontex, which is what is needed in the
> > > SIGSYS handler when using SECCOMP. The filter itself should be simple
> > > as it just needs to catch all syscalls within valid userspace
> > > executable memory[2] ranges.
> > 
> > I agree with your observation that the approach is simple.
> > I don't have a good idea on how to handle SIGSEGV, but will try to see
> > with your inputs.
> 
> You can probably use "[RFC PATCH v2 5/9] um: Add helper functions to
> get/set state for SECCOMP" for getting the registers and also writing
> them back if you want to restore using rt_sigreturn.

thanks,

I'm still testing with various attempts to deliver SEGV to userspace,
but yet no luck so far...  I will get you back once I come up with a
nice form.

(snip)
> > > [2] I am assuming that userspace executable code is already confined to
> > > a certain address space within the UML process. Obviously, the kernel
> > > itself and loaded modules need to be free to do host syscalls and
> > > should not be affected by the SECCOMP filter.
> > 
> > I think our !MMU UML doesn't break this assumption.  But did you see
> > something to our patchset ?
> 
> I also assume that is fine. One just needs to understand this when
> writing a SECCOMP filter for syscall emulation in nommu mode.

okay, thanks for the clarification.

-- Hajime

[RFC,00/13] nommu UML

Pull-request

Message

Comments