mbox series

[bpf-next,v8,00/11] Landlock LSM: Toward unprivileged sandboxing

Message ID 20180227004121.3633-1-mic@digikod.net
Headers show
Series Landlock LSM: Toward unprivileged sandboxing | expand

Message

Mickaël Salaün Feb. 27, 2018, 12:41 a.m. UTC
Hi,

This eight series is a major revamp of the Landlock design compared to
the previous series [1]. This enables more flexibility and granularity
of access control with file paths. It is now possible to enforce an
access control according to a file hierarchy. Landlock uses the concept
of inode and path to identify such hierarchy. In a way, it brings tools
to program what is a file hierarchy.

There is now three types of Landlock hooks: FS_WALK, FS_PICK and FS_GET.
Each of them accepts a dedicated eBPF program, called a Landlock
program.  They can be chained to enforce a full access control according
to a list of directories or files. The set of actions on a file is well
defined (e.g. read, write, ioctl, append, lock, mount...) taking
inspiration from the major Linux LSMs and some other access-controls
like Capsicum.  These program types are designed to be cache-friendly,
which give room for optimizations in the future.

The documentation patch contains some kernel documentation and
explanations on how to use Landlock.  The compiled documentation and
a talk I gave at FOSDEM can be found here: https://landlock.io
This patch series can be found in the branch landlock-v8 in this repo:
https://github.com/landlock-lsm/linux

There is still some minor issues with this patch series but it should
demonstrate how powerful this design may be. One of these issues is that
it is not a stackable LSM anymore, but the infrastructure management of
security blobs should allow to stack it with other LSM [4].

This is the first step of the roadmap discussed at LPC [2].  While the
intended final goal is to allow unprivileged users to use Landlock, this
series allows only a process with global CAP_SYS_ADMIN to load and
enforce a rule.  This may help to get feedback and avoid unexpected
behaviors.

This series can be applied on top of bpf-next, commit 7d72637eb39f
("Merge branch 'x86-jit'").  This can be tested with
CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK.  I would really
appreciate constructive comments on the design and the code.


# Landlock LSM

The goal of this new Linux Security Module (LSM) called Landlock is to
allow any process, including unprivileged ones, to create powerful
security sandboxes comparable to XNU Sandbox or OpenBSD Pledge. This
kind of sandbox is expected to help mitigate the security impact of bugs
or unexpected/malicious behaviors in user-space applications.

The approach taken is to add the minimum amount of code while still
allowing the user-space application to create quite complex access
rules.  A dedicated security policy language such as the one used by
SELinux, AppArmor and other major LSMs involves a lot of code and is
usually permitted to only a trusted user (i.e. root).  On the contrary,
eBPF programs already exist and are designed to be safely loaded by
unprivileged user-space.

This design does not seem too intrusive but is flexible enough to allow
a powerful sandbox mechanism accessible by any process on Linux. The use
of seccomp and Landlock is more suitable with the help of a user-space
library (e.g.  libseccomp) that could help to specify a high-level
language to express a security policy instead of raw eBPF programs.
Moreover, thanks to the LLVM front-end, it is quite easy to write an
eBPF program with a subset of the C language.


# Frequently asked questions

## Why is seccomp-bpf not enough?

A seccomp filter can access only raw syscall arguments (i.e. the
register values) which means that it is not possible to filter according
to the value pointed to by an argument, such as a file pathname. As an
embryonic Landlock version demonstrated, filtering at the syscall level
is complicated (e.g. need to take care of race conditions). This is
mainly because the access control checkpoints of the kernel are not at
this high-level but more underneath, at the LSM-hook level. The LSM
hooks are designed to handle this kind of checks.  Landlock abstracts
this approach to leverage the ability of unprivileged users to limit
themselves.

Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt


## Why use the seccomp(2) syscall?

Landlock use the same semantic as seccomp to apply access rule
restrictions. It add a new layer of security for the current process
which is inherited by its children. It makes sense to use an unique
access-restricting syscall (that should be allowed by seccomp filters)
which can only drop privileges. Moreover, a Landlock rule could come
from outside a process (e.g.  passed through a UNIX socket). It is then
useful to differentiate the creation/load of Landlock eBPF programs via
bpf(2), from rule enforcement via seccomp(2).


## Why a new LSM? Are SELinux, AppArmor, Smack and Tomoyo not good
   enough?

The current access control LSMs are fine for their purpose which is to
give the *root* the ability to enforce a security policy for the
*system*. What is missing is a way to enforce a security policy for any
application by its developer and *unprivileged user* as seccomp can do
for raw syscall filtering.

Differences from other (access control) LSMs:
* not only dedicated to administrators (i.e. no_new_priv);
* limited kernel attack surface (e.g. policy parsing);
* constrained policy rules (no DoS: deterministic execution time);
* do not leak more information than the loader process can legitimately
  have access to (minimize metadata inference).


# Changes since v7

* major revamp of the file system enforcement:
  * new eBPF map dedicated to tie an inode with an arbitrary 64-bits
    value, which can be used to tag files
  * three new Landlock hooks: FS_WALK, FS_PICK and FS_GET
  * add the ability to chain Landlock programs
  * add a new eBPF map type to compare inodes
  * don't use macros anymore
* replace subtype fields:
  * triggers: fine-grained bitfiel of actions on which a Landlock
    program may be called (if it comes from a sandbox process)
  * previous: a parent chained program
* upstreamed patches:
  * commit 369130b63178 ("selftests: Enhance kselftest_harness.h to
    print which assert failed")


# Changes since v6

* upstreamed patches:
  * commit 752ba56fb130 ("bpf: Extend check_uarg_tail_zero() checks")
  * commit 0b40808a1084 ("selftests: Make test_harness.h more generally
    available") and related ones
  * commit 3bb857e47e49 ("LSM: Enable multiple calls to
    security_add_hooks() for the same LSM")
* simplify the landlock_context (remove syscall_* fields) and add three
  FS sub-events: IOCTL, LOCK, FCNTL
* minimize the number of callable BPF functions from a Landlock rule
* do not split put_seccomp_filter() with put_seccomp()
* rename Landlock version to Landlock ABI
* miscellaneous fixes
* rebase on net-next


# Changes since v5

* eBPF program subtype:
  * use a prog_subtype pointer instead of inlining it into bpf_attr
  * enable a future-proof behavior (reject unhandled data/size)
  * add tests
* use a simple rule hierarchy (similar to seccomp-bpf)
* add a ptrace scope protection
* add more tests
* add more documentation
* rename some files
* miscellaneous fixes
* rebase on net-next


# Changes since v4

* upstreamed patches:
  * commit d498f8719a09 ("bpf: Rebuild bpf.o for any dependency update")
  * commit a734fb5d6006 ("samples/bpf: Reset global variables") and
    related ones
  * commit f4874d01beba ("bpf: Use bpf_create_map() from the library")
    and related ones
  * commit d02d8986a768 ("bpf: Always test unprivileged programs")
  * commit 640eb7e7b524 ("fs: Constify path_is_under()'s arguments")
  * commit 535e7b4b5ef2 ("bpf: Use u64_to_user_ptr()")
* revamp Landlock to not expose an LSM hook interface but wrap and
  abstract them with Landlock events (currently one for all filesystem
  related operations: LANDLOCK_SUBTYPE_EVENT_FS)
* wrap all filesystem kernel objects through the same FS handle (struct
  landlock_handle_fs): struct file, struct inode, struct path and struct
  dentry
* a rule don't return an errno code but only a boolean to allow or deny
  an access request
* handle all filesystem related LSM hooks
* add some tests and a sample:
  * BPF context tests
  * Landlock sandboxing tests and sample
  * write Landlock rules in C and compile them with LLVM
* change field names of eBPF program subtype
* remove arraymap of handles for now (will be replaced with a revamped
  map)
* remove cgroup handling for now
* add user and kernel documentation
* rebase on net-next


# Changes since v3

* upstreamed patch:
  * commit 1955351da41c ("bpf: Set register type according to
    is_valid_access()")
* use abstract LSM hook arguments with custom types (e.g.
  *_LANDLOCK_ARG_FS for struct file, struct inode and struct path)
* add more LSM hooks to support full filesystem access control
* improve the sandbox example
* fix races and RCU issues:
  * eBPF program execution and eBPF helpers
  * revamp the arraymap of handles to cleanly deal with update/delete
* eBPF program subtype for Landlock:
  * remove the "origin" field
  * add an "option" field
* rebase onto Daniel Mack's patches v7 [3]
* remove merged commit 1955351da41c ("bpf: Set register type according
  to is_valid_access()")
* fix spelling mistakes
* cleanup some type and variable names
* split patches
* for now, remove cgroup delegation handling for unprivileged user
* remove extra access check for cgroup_get_from_fd()
* remove unused example code dealing with skb
* remove seccomp-bpf link:
  * no more seccomp cookie
  * for now, it is no more possible to check the current syscall
    properties


# Changes since v2

* revamp cgroup handling:
  * use Daniel Mack's patches "Add eBPF hooks for cgroups" v5
  * remove bpf_landlock_cmp_cgroup_beneath()
  * make BPF_PROG_ATTACH usable with delegated cgroups
  * add a new CGRP_NO_NEW_PRIVS flag for safe cgroups
  * handle Landlock sandboxing for cgroups hierarchy
  * allow unprivileged processes to attach Landlock eBPF program to
    cgroups
* add subtype to eBPF programs:
  * replace Landlock hook identification by custom eBPF program types
    with a dedicated subtype field
  * manage fine-grained privileged Landlock programs
  * register Landlock programs for dedicated trigger origins (e.g.
    syscall, return from seccomp filter and/or interruption)
* performance and memory optimizations: use an array to access Landlock
  hooks directly but do not duplicated it for each thread
  (seccomp-based)
* allow running Landlock programs without seccomp filter
* fix seccomp-related issues
* remove extra errno bounding check for Landlock programs
* add some examples for optional eBPF functions or context access
  (network related) according to security checks to allow more features
  for privileged programs (e.g. Checmate)


# Changes since v1

* focus on the LSM hooks, not the syscalls:
  * much more simple implementation
  * does not need audit cache tricks to avoid race conditions
  * more simple to use and more generic because using the LSM hook
    abstraction directly
  * more efficient because only checking in LSM hooks
  * architecture agnostic
* switch from cBPF to eBPF:
  * new eBPF program types dedicated to Landlock
  * custom functions used by the eBPF program
  * gain some new features (e.g. 10 registers, can load values of
    different size, LLVM translator) but only a few functions allowed
    and a dedicated map type
  * new context: LSM hook ID, cookie and LSM hook arguments
  * need to set the sysctl kernel.unprivileged_bpf_disable to 0 (default
    value) to be able to load hook filters as unprivileged users
* smaller and simpler:
  * no more checker groups but dedicated arraymap of handles
  * simpler userland structs thanks to eBPF functions
* distinctive name: Landlock


[1] https://lkml.kernel.org/r/20170821000933.13024-1-mic@digikod.net
[2] https://lkml.kernel.org/r/5828776A.1010104@digikod.net
[3] https://lkml.kernel.org/r/1477390454-12553-1-git-send-email-daniel@zonque.org
[4] https://lwn.net/Articles/741963/

Regards,

Mickaël Salaün (11):
  fs,security: Add a security blob to nameidata
  fs,security: Add a new file access type: MAY_CHROOT
  bpf: Add eBPF program subtype and is_valid_subtype() verifier
  bpf,landlock: Define an eBPF program type for Landlock hooks
  seccomp,landlock: Enforce Landlock programs per process hierarchy
  bpf,landlock: Add a new map type: inode
  landlock: Handle filesystem access control
  landlock: Add ptrace restrictions
  bpf: Add a Landlock sandbox example
  bpf,landlock: Add tests for Landlock
  landlock: Add user and kernel documentation for Landlock

 Documentation/security/index.rst               |    1 +
 Documentation/security/landlock/index.rst      |   19 +
 Documentation/security/landlock/kernel.rst     |  100 +++
 Documentation/security/landlock/user.rst       |  206 +++++
 MAINTAINERS                                    |   13 +
 fs/namei.c                                     |   39 +
 fs/open.c                                      |    3 +-
 include/linux/bpf.h                            |   35 +-
 include/linux/bpf_types.h                      |    6 +
 include/linux/fs.h                             |    1 +
 include/linux/landlock.h                       |   61 ++
 include/linux/lsm_hooks.h                      |   12 +
 include/linux/namei.h                          |   14 +-
 include/linux/seccomp.h                        |    5 +
 include/linux/security.h                       |    7 +
 include/uapi/linux/bpf.h                       |   34 +-
 include/uapi/linux/landlock.h                  |  155 ++++
 include/uapi/linux/seccomp.h                   |    1 +
 kernel/bpf/Makefile                            |    3 +
 kernel/bpf/cgroup.c                            |    6 +-
 kernel/bpf/core.c                              |    6 +-
 kernel/bpf/helpers.c                           |   38 +
 kernel/bpf/inodemap.c                          |  318 ++++++++
 kernel/bpf/syscall.c                           |   62 +-
 kernel/bpf/verifier.c                          |   44 +-
 kernel/fork.c                                  |    8 +-
 kernel/seccomp.c                               |    4 +
 kernel/trace/bpf_trace.c                       |   15 +-
 net/core/filter.c                              |   76 +-
 samples/bpf/Makefile                           |    4 +
 samples/bpf/bpf_load.c                         |   85 +-
 samples/bpf/bpf_load.h                         |    7 +
 samples/bpf/cookie_uid_helper_example.c        |    2 +-
 samples/bpf/fds_example.c                      |    2 +-
 samples/bpf/landlock1.h                        |   14 +
 samples/bpf/landlock1_kern.c                   |  171 ++++
 samples/bpf/landlock1_user.c                   |  164 ++++
 samples/bpf/sock_example.c                     |    3 +-
 samples/bpf/test_cgrp2_attach.c                |    2 +-
 samples/bpf/test_cgrp2_attach2.c               |    4 +-
 samples/bpf/test_cgrp2_sock.c                  |    2 +-
 security/Kconfig                               |    1 +
 security/Makefile                              |    2 +
 security/landlock/Kconfig                      |   18 +
 security/landlock/Makefile                     |    6 +
 security/landlock/chain.c                      |   39 +
 security/landlock/chain.h                      |   35 +
 security/landlock/common.h                     |   94 +++
 security/landlock/enforce.c                    |  386 +++++++++
 security/landlock/enforce.h                    |   21 +
 security/landlock/enforce_seccomp.c            |  112 +++
 security/landlock/hooks.c                      |  121 +++
 security/landlock/hooks.h                      |   35 +
 security/landlock/hooks_cred.c                 |   52 ++
 security/landlock/hooks_cred.h                 |    1 +
 security/landlock/hooks_fs.c                   | 1021 ++++++++++++++++++++++++
 security/landlock/hooks_fs.h                   |   60 ++
 security/landlock/hooks_ptrace.c               |  124 +++
 security/landlock/hooks_ptrace.h               |   11 +
 security/landlock/init.c                       |  238 ++++++
 security/landlock/tag.c                        |  373 +++++++++
 security/landlock/tag.h                        |   36 +
 security/landlock/tag_fs.c                     |   59 ++
 security/landlock/tag_fs.h                     |   26 +
 security/landlock/task.c                       |   34 +
 security/landlock/task.h                       |   29 +
 security/security.c                            |   19 +-
 tools/include/uapi/linux/bpf.h                 |   34 +-
 tools/include/uapi/linux/landlock.h            |  155 ++++
 tools/lib/bpf/bpf.c                            |   15 +-
 tools/lib/bpf/bpf.h                            |    8 +-
 tools/lib/bpf/libbpf.c                         |    5 +-
 tools/perf/tests/bpf.c                         |    2 +-
 tools/testing/selftests/Makefile               |    1 +
 tools/testing/selftests/bpf/bpf_helpers.h      |    7 +
 tools/testing/selftests/bpf/test_align.c       |    2 +-
 tools/testing/selftests/bpf/test_tag.c         |    2 +-
 tools/testing/selftests/bpf/test_verifier.c    |  102 ++-
 tools/testing/selftests/landlock/.gitignore    |    5 +
 tools/testing/selftests/landlock/Makefile      |   35 +
 tools/testing/selftests/landlock/test.h        |   31 +
 tools/testing/selftests/landlock/test_base.c   |   27 +
 tools/testing/selftests/landlock/test_chain.c  |  249 ++++++
 tools/testing/selftests/landlock/test_fs.c     |  492 ++++++++++++
 tools/testing/selftests/landlock/test_ptrace.c |  158 ++++
 85 files changed, 5960 insertions(+), 75 deletions(-)
 create mode 100644 Documentation/security/landlock/index.rst
 create mode 100644 Documentation/security/landlock/kernel.rst
 create mode 100644 Documentation/security/landlock/user.rst
 create mode 100644 include/linux/landlock.h
 create mode 100644 include/uapi/linux/landlock.h
 create mode 100644 kernel/bpf/inodemap.c
 create mode 100644 samples/bpf/landlock1.h
 create mode 100644 samples/bpf/landlock1_kern.c
 create mode 100644 samples/bpf/landlock1_user.c
 create mode 100644 security/landlock/Kconfig
 create mode 100644 security/landlock/Makefile
 create mode 100644 security/landlock/chain.c
 create mode 100644 security/landlock/chain.h
 create mode 100644 security/landlock/common.h
 create mode 100644 security/landlock/enforce.c
 create mode 100644 security/landlock/enforce.h
 create mode 100644 security/landlock/enforce_seccomp.c
 create mode 100644 security/landlock/hooks.c
 create mode 100644 security/landlock/hooks.h
 create mode 100644 security/landlock/hooks_cred.c
 create mode 100644 security/landlock/hooks_cred.h
 create mode 100644 security/landlock/hooks_fs.c
 create mode 100644 security/landlock/hooks_fs.h
 create mode 100644 security/landlock/hooks_ptrace.c
 create mode 100644 security/landlock/hooks_ptrace.h
 create mode 100644 security/landlock/init.c
 create mode 100644 security/landlock/tag.c
 create mode 100644 security/landlock/tag.h
 create mode 100644 security/landlock/tag_fs.c
 create mode 100644 security/landlock/tag_fs.h
 create mode 100644 security/landlock/task.c
 create mode 100644 security/landlock/task.h
 create mode 100644 tools/include/uapi/linux/landlock.h
 create mode 100644 tools/testing/selftests/landlock/.gitignore
 create mode 100644 tools/testing/selftests/landlock/Makefile
 create mode 100644 tools/testing/selftests/landlock/test.h
 create mode 100644 tools/testing/selftests/landlock/test_base.c
 create mode 100644 tools/testing/selftests/landlock/test_chain.c
 create mode 100644 tools/testing/selftests/landlock/test_fs.c
 create mode 100644 tools/testing/selftests/landlock/test_ptrace.c

Comments

Andy Lutomirski Feb. 27, 2018, 4:36 a.m. UTC | #1
On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün <mic@digikod.net> wrote:
> Hi,
>
> This eight series is a major revamp of the Landlock design compared to
> the previous series [1]. This enables more flexibility and granularity
> of access control with file paths. It is now possible to enforce an
> access control according to a file hierarchy. Landlock uses the concept
> of inode and path to identify such hierarchy. In a way, it brings tools
> to program what is a file hierarchy.
>
> There is now three types of Landlock hooks: FS_WALK, FS_PICK and FS_GET.
> Each of them accepts a dedicated eBPF program, called a Landlock
> program.  They can be chained to enforce a full access control according
> to a list of directories or files. The set of actions on a file is well
> defined (e.g. read, write, ioctl, append, lock, mount...) taking
> inspiration from the major Linux LSMs and some other access-controls
> like Capsicum.  These program types are designed to be cache-friendly,
> which give room for optimizations in the future.
>
> The documentation patch contains some kernel documentation and
> explanations on how to use Landlock.  The compiled documentation and
> a talk I gave at FOSDEM can be found here: https://landlock.io
> This patch series can be found in the branch landlock-v8 in this repo:
> https://github.com/landlock-lsm/linux
>
> There is still some minor issues with this patch series but it should
> demonstrate how powerful this design may be. One of these issues is that
> it is not a stackable LSM anymore, but the infrastructure management of
> security blobs should allow to stack it with other LSM [4].
>
> This is the first step of the roadmap discussed at LPC [2].  While the
> intended final goal is to allow unprivileged users to use Landlock, this
> series allows only a process with global CAP_SYS_ADMIN to load and
> enforce a rule.  This may help to get feedback and avoid unexpected
> behaviors.
>
> This series can be applied on top of bpf-next, commit 7d72637eb39f
> ("Merge branch 'x86-jit'").  This can be tested with
> CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK.  I would really
> appreciate constructive comments on the design and the code.
>
>
> # Landlock LSM
>
> The goal of this new Linux Security Module (LSM) called Landlock is to
> allow any process, including unprivileged ones, to create powerful
> security sandboxes comparable to XNU Sandbox or OpenBSD Pledge. This
> kind of sandbox is expected to help mitigate the security impact of bugs
> or unexpected/malicious behaviors in user-space applications.
>
> The approach taken is to add the minimum amount of code while still
> allowing the user-space application to create quite complex access
> rules.  A dedicated security policy language such as the one used by
> SELinux, AppArmor and other major LSMs involves a lot of code and is
> usually permitted to only a trusted user (i.e. root).  On the contrary,
> eBPF programs already exist and are designed to be safely loaded by
> unprivileged user-space.
>
> This design does not seem too intrusive but is flexible enough to allow
> a powerful sandbox mechanism accessible by any process on Linux. The use
> of seccomp and Landlock is more suitable with the help of a user-space
> library (e.g.  libseccomp) that could help to specify a high-level
> language to express a security policy instead of raw eBPF programs.
> Moreover, thanks to the LLVM front-end, it is quite easy to write an
> eBPF program with a subset of the C language.
>
>
> # Frequently asked questions
>
> ## Why is seccomp-bpf not enough?
>
> A seccomp filter can access only raw syscall arguments (i.e. the
> register values) which means that it is not possible to filter according
> to the value pointed to by an argument, such as a file pathname. As an
> embryonic Landlock version demonstrated, filtering at the syscall level
> is complicated (e.g. need to take care of race conditions). This is
> mainly because the access control checkpoints of the kernel are not at
> this high-level but more underneath, at the LSM-hook level. The LSM
> hooks are designed to handle this kind of checks.  Landlock abstracts
> this approach to leverage the ability of unprivileged users to limit
> themselves.
>
> Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt
>
>
> ## Why use the seccomp(2) syscall?
>
> Landlock use the same semantic as seccomp to apply access rule
> restrictions. It add a new layer of security for the current process
> which is inherited by its children. It makes sense to use an unique
> access-restricting syscall (that should be allowed by seccomp filters)
> which can only drop privileges. Moreover, a Landlock rule could come
> from outside a process (e.g.  passed through a UNIX socket). It is then
> useful to differentiate the creation/load of Landlock eBPF programs via
> bpf(2), from rule enforcement via seccomp(2).

This seems like a weak argument to me.  Sure, this is a bit different
from seccomp(), and maybe shoving it into the seccomp() multiplexer is
awkward, but surely the bpf() multiplexer is even less applicable.

But I think that you have more in common with seccomp() than you're
giving it credit for.  With seccomp, you need to either prevent
ptrace() of any more-privileged task or you need to filter to make
sure you can't trace a more privileged program.  With landlock, you
need exactly the same thing.  You have basically the same no_new_privs
considerations, etc.

Also, looking forward, I think you're going to want a bunch of the
stuff that's under consideration as new seccomp features.  Tycho is
working on a "user notifier" feature for seccomp where, in addition to
accepting, rejecting, or kicking to ptrace, you can send a message to
the creator of the filter and wait for a reply.  I think that Landlock
will want exactly the same feature.

In other words, it really seems to be that you should extend seccomp()
with the ability to attach filters to things that aren't syscall
entry, e.g. file open.

I would also seriously consider doing a scaled-back Landlock variant
first, with the intent of getting the main mechanism into the kernel.
In particular, there are two big sources of complexity in Landlock.
You need to deal with the API for managing bpf programs that filter
various actions beyond just syscall entry, and you need to deal with
giving those filters a way to deal with inodes, paths, etc.  But you
can do the former without the latter.  For example, you could start
with some Landlock-style filters on things that have nothing to do
with files.  For example, you could allow a filter for connecting to
an abstract-namespace unix socket.  Or you could have a hook for
file_receive.  (You couldn't meaningfully filter based on the *path*
of the fd being received without adding all the path infrastructure,
but you could fitler on the *type* of the fd being received.)  Both of
these add new sandboxing abilities that don't currently exist.  In
particular, you can't write a seccomp rule that prevents receiving an
fd using recvmsg() right now unless you block cmsg entirely.  And you
can't write a filter that allows connecting to unix sockets by path
without allowing abstract namespace sockets either.

If you split up Landlock like this then, once you got all the
installation and management of filters down, you could submit patches
to add all the path stuff and deal with that review separately.

What do you all think?
Mickaël Salaün Feb. 27, 2018, 10:03 p.m. UTC | #2
On 27/02/2018 05:36, Andy Lutomirski wrote:
> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün <mic@digikod.net> wrote:
>> Hi,
>>
>> This eight series is a major revamp of the Landlock design compared to
>> the previous series [1]. This enables more flexibility and granularity
>> of access control with file paths. It is now possible to enforce an
>> access control according to a file hierarchy. Landlock uses the concept
>> of inode and path to identify such hierarchy. In a way, it brings tools
>> to program what is a file hierarchy.
>>
>> There is now three types of Landlock hooks: FS_WALK, FS_PICK and FS_GET.
>> Each of them accepts a dedicated eBPF program, called a Landlock
>> program.  They can be chained to enforce a full access control according
>> to a list of directories or files. The set of actions on a file is well
>> defined (e.g. read, write, ioctl, append, lock, mount...) taking
>> inspiration from the major Linux LSMs and some other access-controls
>> like Capsicum.  These program types are designed to be cache-friendly,
>> which give room for optimizations in the future.
>>
>> The documentation patch contains some kernel documentation and
>> explanations on how to use Landlock.  The compiled documentation and
>> a talk I gave at FOSDEM can be found here: https://landlock.io
>> This patch series can be found in the branch landlock-v8 in this repo:
>> https://github.com/landlock-lsm/linux
>>
>> There is still some minor issues with this patch series but it should
>> demonstrate how powerful this design may be. One of these issues is that
>> it is not a stackable LSM anymore, but the infrastructure management of
>> security blobs should allow to stack it with other LSM [4].
>>
>> This is the first step of the roadmap discussed at LPC [2].  While the
>> intended final goal is to allow unprivileged users to use Landlock, this
>> series allows only a process with global CAP_SYS_ADMIN to load and
>> enforce a rule.  This may help to get feedback and avoid unexpected
>> behaviors.
>>
>> This series can be applied on top of bpf-next, commit 7d72637eb39f
>> ("Merge branch 'x86-jit'").  This can be tested with
>> CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK.  I would really
>> appreciate constructive comments on the design and the code.
>>
>>
>> # Landlock LSM
>>
>> The goal of this new Linux Security Module (LSM) called Landlock is to
>> allow any process, including unprivileged ones, to create powerful
>> security sandboxes comparable to XNU Sandbox or OpenBSD Pledge. This
>> kind of sandbox is expected to help mitigate the security impact of bugs
>> or unexpected/malicious behaviors in user-space applications.
>>
>> The approach taken is to add the minimum amount of code while still
>> allowing the user-space application to create quite complex access
>> rules.  A dedicated security policy language such as the one used by
>> SELinux, AppArmor and other major LSMs involves a lot of code and is
>> usually permitted to only a trusted user (i.e. root).  On the contrary,
>> eBPF programs already exist and are designed to be safely loaded by
>> unprivileged user-space.
>>
>> This design does not seem too intrusive but is flexible enough to allow
>> a powerful sandbox mechanism accessible by any process on Linux. The use
>> of seccomp and Landlock is more suitable with the help of a user-space
>> library (e.g.  libseccomp) that could help to specify a high-level
>> language to express a security policy instead of raw eBPF programs.
>> Moreover, thanks to the LLVM front-end, it is quite easy to write an
>> eBPF program with a subset of the C language.
>>
>>
>> # Frequently asked questions
>>
>> ## Why is seccomp-bpf not enough?
>>
>> A seccomp filter can access only raw syscall arguments (i.e. the
>> register values) which means that it is not possible to filter according
>> to the value pointed to by an argument, such as a file pathname. As an
>> embryonic Landlock version demonstrated, filtering at the syscall level
>> is complicated (e.g. need to take care of race conditions). This is
>> mainly because the access control checkpoints of the kernel are not at
>> this high-level but more underneath, at the LSM-hook level. The LSM
>> hooks are designed to handle this kind of checks.  Landlock abstracts
>> this approach to leverage the ability of unprivileged users to limit
>> themselves.
>>
>> Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt
>>
>>
>> ## Why use the seccomp(2) syscall?
>>
>> Landlock use the same semantic as seccomp to apply access rule
>> restrictions. It add a new layer of security for the current process
>> which is inherited by its children. It makes sense to use an unique
>> access-restricting syscall (that should be allowed by seccomp filters)
>> which can only drop privileges. Moreover, a Landlock rule could come
>> from outside a process (e.g.  passed through a UNIX socket). It is then
>> useful to differentiate the creation/load of Landlock eBPF programs via
>> bpf(2), from rule enforcement via seccomp(2).
> 
> This seems like a weak argument to me.  Sure, this is a bit different
> from seccomp(), and maybe shoving it into the seccomp() multiplexer is
> awkward, but surely the bpf() multiplexer is even less applicable.

I think using the seccomp syscall is fine, and everyone agreed on it.

> 
> But I think that you have more in common with seccomp() than you're
> giving it credit for.  With seccomp, you need to either prevent
> ptrace() of any more-privileged task or you need to filter to make
> sure you can't trace a more privileged program.  With landlock, you
> need exactly the same thing.  You have basically the same no_new_privs
> considerations, etc.

Right, I did not mean to not give credit to seccomp at all.

> 
> Also, looking forward, I think you're going to want a bunch of the
> stuff that's under consideration as new seccomp features.  Tycho is
> working on a "user notifier" feature for seccomp where, in addition to
> accepting, rejecting, or kicking to ptrace, you can send a message to
> the creator of the filter and wait for a reply.  I think that Landlock
> will want exactly the same feature.

I don't think why this may be useful at all her. Landlock does not
filter at the syscall level but handles kernel object and actions as
does an LSM. That is the whole purpose of Landlock.

> 
> In other words, it really seems to be that you should extend seccomp()
> with the ability to attach filters to things that aren't syscall
> entry, e.g. file open.

It seems that it is what I do… Not sure to understand you here.

> 
> I would also seriously consider doing a scaled-back Landlock variant
> first, with the intent of getting the main mechanism into the kernel.
> In particular, there are two big sources of complexity in Landlock.
> You need to deal with the API for managing bpf programs that filter
> various actions beyond just syscall entry, and you need to deal with
> giving those filters a way to deal with inodes, paths, etc.  But you
> can do the former without the latter.  For example, you could start
> with some Landlock-style filters on things that have nothing to do
> with files.  For example, you could allow a filter for connecting to
> an abstract-namespace unix socket.  Or you could have a hook for
> file_receive.  (You couldn't meaningfully filter based on the *path*
> of the fd being received without adding all the path infrastructure,
> but you could fitler on the *type* of the fd being received.)  Both of
> these add new sandboxing abilities that don't currently exist.  In
> particular, you can't write a seccomp rule that prevents receiving an
> fd using recvmsg() right now unless you block cmsg entirely.  And you
> can't write a filter that allows connecting to unix sockets by path
> without allowing abstract namespace sockets either.

What you are proposing here seems like my previous patch series (v7). As
it was suggested by Kees (and requested by future users), this patch
series is able to handle file paths. This is a good thing because it
highlight that this design can scale to evaluate complex objects (i.e.
file paths), which was not the case for the previous patch series.

> 
> If you split up Landlock like this then, once you got all the
> installation and management of filters down, you could submit patches
> to add all the path stuff and deal with that review separately.
> 
> What do you all think?
> 

I may be able to strip some parts for a first inclusion, but a complex
use case like controlling access to files is an important use case.
Andy Lutomirski Feb. 27, 2018, 11:09 p.m. UTC | #3
On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün <mic@digikod.net> wrote:
>
> On 27/02/2018 05:36, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>> Hi,
>>>

>>>
>>> ## Why use the seccomp(2) syscall?
>>>
>>> Landlock use the same semantic as seccomp to apply access rule
>>> restrictions. It add a new layer of security for the current process
>>> which is inherited by its children. It makes sense to use an unique
>>> access-restricting syscall (that should be allowed by seccomp filters)
>>> which can only drop privileges. Moreover, a Landlock rule could come
>>> from outside a process (e.g.  passed through a UNIX socket). It is then
>>> useful to differentiate the creation/load of Landlock eBPF programs via
>>> bpf(2), from rule enforcement via seccomp(2).
>>
>> This seems like a weak argument to me.  Sure, this is a bit different
>> from seccomp(), and maybe shoving it into the seccomp() multiplexer is
>> awkward, but surely the bpf() multiplexer is even less applicable.
>
> I think using the seccomp syscall is fine, and everyone agreed on it.
>

Ah, sorry, I completely misread what you wrote.  My apologies.  You
can disregard most of my email.

>
>>
>> Also, looking forward, I think you're going to want a bunch of the
>> stuff that's under consideration as new seccomp features.  Tycho is
>> working on a "user notifier" feature for seccomp where, in addition to
>> accepting, rejecting, or kicking to ptrace, you can send a message to
>> the creator of the filter and wait for a reply.  I think that Landlock
>> will want exactly the same feature.
>
> I don't think why this may be useful at all her. Landlock does not
> filter at the syscall level but handles kernel object and actions as
> does an LSM. That is the whole purpose of Landlock.

Suppose I'm writing a container manager.  I want to run "mount" in the
container, but I don't want to allow moun() in general and I want to
emulate certain mount() actions.  I can write a filter that catches
mount using seccomp and calls out to the container manager for help.
This isn't theoretical -- Tycho wants *exactly* this use case to be
supported.

But using seccomp for this is indeed annoying.  It would be nice to
use Landlock's ability to filter based on the filesystem type, for
example.  So Tycho could write a Landlock rule like:

bool filter_mount(...)
{
  if (path needs emulation)
    call_user_notifier();
}

And it should work.

This means that, if both seccomp user notifiers and Landlock make it
upstream, then there should probably be a way to have a user notifier
bound to a seccomp filter and a set of landlock filters.
Mickaël Salaün March 6, 2018, 10:25 p.m. UTC | #4
On 28/02/2018 00:09, Andy Lutomirski wrote:
> On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>
>> On 27/02/2018 05:36, Andy Lutomirski wrote:
>>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>>> Hi,
>>>>
> 
>>>>
>>>> ## Why use the seccomp(2) syscall?
>>>>
>>>> Landlock use the same semantic as seccomp to apply access rule
>>>> restrictions. It add a new layer of security for the current process
>>>> which is inherited by its children. It makes sense to use an unique
>>>> access-restricting syscall (that should be allowed by seccomp filters)
>>>> which can only drop privileges. Moreover, a Landlock rule could come
>>>> from outside a process (e.g.  passed through a UNIX socket). It is then
>>>> useful to differentiate the creation/load of Landlock eBPF programs via
>>>> bpf(2), from rule enforcement via seccomp(2).
>>>
>>> This seems like a weak argument to me.  Sure, this is a bit different
>>> from seccomp(), and maybe shoving it into the seccomp() multiplexer is
>>> awkward, but surely the bpf() multiplexer is even less applicable.
>>
>> I think using the seccomp syscall is fine, and everyone agreed on it.
>>
> 
> Ah, sorry, I completely misread what you wrote.  My apologies.  You
> can disregard most of my email.
> 
>>
>>>
>>> Also, looking forward, I think you're going to want a bunch of the
>>> stuff that's under consideration as new seccomp features.  Tycho is
>>> working on a "user notifier" feature for seccomp where, in addition to
>>> accepting, rejecting, or kicking to ptrace, you can send a message to
>>> the creator of the filter and wait for a reply.  I think that Landlock
>>> will want exactly the same feature.
>>
>> I don't think why this may be useful at all her. Landlock does not
>> filter at the syscall level but handles kernel object and actions as
>> does an LSM. That is the whole purpose of Landlock.
> 
> Suppose I'm writing a container manager.  I want to run "mount" in the
> container, but I don't want to allow moun() in general and I want to
> emulate certain mount() actions.  I can write a filter that catches
> mount using seccomp and calls out to the container manager for help.
> This isn't theoretical -- Tycho wants *exactly* this use case to be
> supported.

Well, I think this use case should be handled with something like
LD_PRELOAD and a helper library. FYI, I did something like this:
https://github.com/stemjail/stemshim

Otherwise, we should think about enabling a process to (dynamically)
extend/patch the vDSO (similar to LD_PRELOAD but at the syscall level
and works with static binaries) for a subset of processes (the same way
seccomp filters are inherited). It may be more powerful and flexible
than extending the kernel/seccomp to patch (buggy?) userland.

> 
> But using seccomp for this is indeed annoying.  It would be nice to
> use Landlock's ability to filter based on the filesystem type, for
> example.  So Tycho could write a Landlock rule like:
> 
> bool filter_mount(...)
> {
>   if (path needs emulation)
>     call_user_notifier();
> }
> 
> And it should work.
> 
> This means that, if both seccomp user notifiers and Landlock make it
> upstream, then there should probably be a way to have a user notifier
> bound to a seccomp filter and a set of landlock filters.
> 

Using seccomp filters and Landlock programs may be powerful. However,
for this use case, I think a *post-syscall* vDSO-like (which could get
some data returned by a Landlock program) may be much more flexible
(with less kernel code). What is needed here is a way to know the kernel
semantic (Landlock) and a way to patch userland without patching its
code (vDSO-like).
Andy Lutomirski March 6, 2018, 10:33 p.m. UTC | #5
On Tue, Mar 6, 2018 at 10:25 PM, Mickaël Salaün <mic@digikod.net> wrote:
>
>
> On 28/02/2018 00:09, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>
>>> On 27/02/2018 05:36, Andy Lutomirski wrote:
>>>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>> Hi,
>>>>>
>>
>>>>>
>>>>> ## Why use the seccomp(2) syscall?
>>>>>
>>>>> Landlock use the same semantic as seccomp to apply access rule
>>>>> restrictions. It add a new layer of security for the current process
>>>>> which is inherited by its children. It makes sense to use an unique
>>>>> access-restricting syscall (that should be allowed by seccomp filters)
>>>>> which can only drop privileges. Moreover, a Landlock rule could come
>>>>> from outside a process (e.g.  passed through a UNIX socket). It is then
>>>>> useful to differentiate the creation/load of Landlock eBPF programs via
>>>>> bpf(2), from rule enforcement via seccomp(2).
>>>>
>>>> This seems like a weak argument to me.  Sure, this is a bit different
>>>> from seccomp(), and maybe shoving it into the seccomp() multiplexer is
>>>> awkward, but surely the bpf() multiplexer is even less applicable.
>>>
>>> I think using the seccomp syscall is fine, and everyone agreed on it.
>>>
>>
>> Ah, sorry, I completely misread what you wrote.  My apologies.  You
>> can disregard most of my email.
>>
>>>
>>>>
>>>> Also, looking forward, I think you're going to want a bunch of the
>>>> stuff that's under consideration as new seccomp features.  Tycho is
>>>> working on a "user notifier" feature for seccomp where, in addition to
>>>> accepting, rejecting, or kicking to ptrace, you can send a message to
>>>> the creator of the filter and wait for a reply.  I think that Landlock
>>>> will want exactly the same feature.
>>>
>>> I don't think why this may be useful at all her. Landlock does not
>>> filter at the syscall level but handles kernel object and actions as
>>> does an LSM. That is the whole purpose of Landlock.
>>
>> Suppose I'm writing a container manager.  I want to run "mount" in the
>> container, but I don't want to allow moun() in general and I want to
>> emulate certain mount() actions.  I can write a filter that catches
>> mount using seccomp and calls out to the container manager for help.
>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>> supported.
>
> Well, I think this use case should be handled with something like
> LD_PRELOAD and a helper library. FYI, I did something like this:
> https://github.com/stemjail/stemshim

I doubt that will work for containers.  Containers that use user
namespaces and, for example, setuid programs aren't going to honor
LD_PRELOAD.

>
> Otherwise, we should think about enabling a process to (dynamically)
> extend/patch the vDSO (similar to LD_PRELOAD but at the syscall level
> and works with static binaries) for a subset of processes (the same way
> seccomp filters are inherited). It may be more powerful and flexible
> than extending the kernel/seccomp to patch (buggy?) userland.

Egads!
Tycho Andersen March 6, 2018, 10:46 p.m. UTC | #6
On Tue, Mar 06, 2018 at 10:33:17PM +0000, Andy Lutomirski wrote:
> >> Suppose I'm writing a container manager.  I want to run "mount" in the
> >> container, but I don't want to allow moun() in general and I want to
> >> emulate certain mount() actions.  I can write a filter that catches
> >> mount using seccomp and calls out to the container manager for help.
> >> This isn't theoretical -- Tycho wants *exactly* this use case to be
> >> supported.
> >
> > Well, I think this use case should be handled with something like
> > LD_PRELOAD and a helper library. FYI, I did something like this:
> > https://github.com/stemjail/stemshim
> 
> I doubt that will work for containers.  Containers that use user
> namespaces and, for example, setuid programs aren't going to honor
> LD_PRELOAD.

Or anything that calls syscalls directly, like go programs.

Tycho
Mickaël Salaün March 6, 2018, 11:06 p.m. UTC | #7
On 06/03/2018 23:46, Tycho Andersen wrote:
> On Tue, Mar 06, 2018 at 10:33:17PM +0000, Andy Lutomirski wrote:
>>>> Suppose I'm writing a container manager.  I want to run "mount" in the
>>>> container, but I don't want to allow moun() in general and I want to
>>>> emulate certain mount() actions.  I can write a filter that catches
>>>> mount using seccomp and calls out to the container manager for help.
>>>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>>>> supported.
>>>
>>> Well, I think this use case should be handled with something like
>>> LD_PRELOAD and a helper library. FYI, I did something like this:
>>> https://github.com/stemjail/stemshim
>>
>> I doubt that will work for containers.  Containers that use user
>> namespaces and, for example, setuid programs aren't going to honor
>> LD_PRELOAD.
> 
> Or anything that calls syscalls directly, like go programs.

That's why the vDSO-like approach. Enforcing an access control is not
the issue here, patching a buggy userland (without patching its code) is
the issue isn't it?

As far as I remember, the main problem is to handle file descriptors
while "emulating" the kernel behavior. This can be done with a "shim"
code mapped in every processes. Chrome used something like this (in a
previous sandbox mechanism) as a kind of emulation (with the current
seccomp-bpf ). I think it should be doable to replace the (userland)
emulation code with an IPC wrapper receiving file descriptors through
UNIX socket.
Andy Lutomirski March 7, 2018, 1:21 a.m. UTC | #8
On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün <mic@digikod.net> wrote:
>
> On 06/03/2018 23:46, Tycho Andersen wrote:
>> On Tue, Mar 06, 2018 at 10:33:17PM +0000, Andy Lutomirski wrote:
>>>>> Suppose I'm writing a container manager.  I want to run "mount" in the
>>>>> container, but I don't want to allow moun() in general and I want to
>>>>> emulate certain mount() actions.  I can write a filter that catches
>>>>> mount using seccomp and calls out to the container manager for help.
>>>>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>>>>> supported.
>>>>
>>>> Well, I think this use case should be handled with something like
>>>> LD_PRELOAD and a helper library. FYI, I did something like this:
>>>> https://github.com/stemjail/stemshim
>>>
>>> I doubt that will work for containers.  Containers that use user
>>> namespaces and, for example, setuid programs aren't going to honor
>>> LD_PRELOAD.
>>
>> Or anything that calls syscalls directly, like go programs.
>
> That's why the vDSO-like approach. Enforcing an access control is not
> the issue here, patching a buggy userland (without patching its code) is
> the issue isn't it?
>
> As far as I remember, the main problem is to handle file descriptors
> while "emulating" the kernel behavior. This can be done with a "shim"
> code mapped in every processes. Chrome used something like this (in a
> previous sandbox mechanism) as a kind of emulation (with the current
> seccomp-bpf ). I think it should be doable to replace the (userland)
> emulation code with an IPC wrapper receiving file descriptors through
> UNIX socket.
>

Can you explain exactly what you mean by "vDSO-like"?

When a 64-bit program does a syscall, it just executes the SYSCALL
instruction.  The vDSO isn't involved at all.  32-bit programs usually
go through the vDSO, but not always.

It could be possible to force-load a DSO into an entire container and
rig up seccomp to intercept all SYSCALLs not originating from the DSO
such that they merely redirect control to the DSO, but that seems
quite messy.
Mickaël Salaün March 8, 2018, 11:51 p.m. UTC | #9
On 07/03/2018 02:21, Andy Lutomirski wrote:
> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>
>> On 06/03/2018 23:46, Tycho Andersen wrote:
>>> On Tue, Mar 06, 2018 at 10:33:17PM +0000, Andy Lutomirski wrote:
>>>>>> Suppose I'm writing a container manager.  I want to run "mount" in the
>>>>>> container, but I don't want to allow moun() in general and I want to
>>>>>> emulate certain mount() actions.  I can write a filter that catches
>>>>>> mount using seccomp and calls out to the container manager for help.
>>>>>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>>>>>> supported.
>>>>>
>>>>> Well, I think this use case should be handled with something like
>>>>> LD_PRELOAD and a helper library. FYI, I did something like this:
>>>>> https://github.com/stemjail/stemshim
>>>>
>>>> I doubt that will work for containers.  Containers that use user
>>>> namespaces and, for example, setuid programs aren't going to honor
>>>> LD_PRELOAD.
>>>
>>> Or anything that calls syscalls directly, like go programs.
>>
>> That's why the vDSO-like approach. Enforcing an access control is not
>> the issue here, patching a buggy userland (without patching its code) is
>> the issue isn't it?
>>
>> As far as I remember, the main problem is to handle file descriptors
>> while "emulating" the kernel behavior. This can be done with a "shim"
>> code mapped in every processes. Chrome used something like this (in a
>> previous sandbox mechanism) as a kind of emulation (with the current
>> seccomp-bpf ). I think it should be doable to replace the (userland)
>> emulation code with an IPC wrapper receiving file descriptors through
>> UNIX socket.
>>
> 
> Can you explain exactly what you mean by "vDSO-like"?
> 
> When a 64-bit program does a syscall, it just executes the SYSCALL
> instruction.  The vDSO isn't involved at all.  32-bit programs usually
> go through the vDSO, but not always.
> 
> It could be possible to force-load a DSO into an entire container and
> rig up seccomp to intercept all SYSCALLs not originating from the DSO
> such that they merely redirect control to the DSO, but that seems
> quite messy.

vDSO is a code mapped for all processes. As you said, these processes
may use it or not. What I was thinking about is to use the same concept,
i.e. map a "shim" code into each processes pertaining to a particular
hierarchy (the same way seccomp filters are inherited across processes).
With a seccomp filter matching some syscall (e.g. mount, open), it is
possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
shim code should then be able to emulate/patch what is needed, even
faking a file opening by receiving a file descriptor through a UNIX
socket. As did the Chrome sandbox, the seccomp filter may look at the
calling address to allow the shim code to call syscalls without being
catched, if needed. However, relying on SIGSYS may not fit with
arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
to a specific process address, to emulate the syscall in an easier way
than only relying on a {c,e}BPF program.
Andy Lutomirski March 8, 2018, 11:53 p.m. UTC | #10
On Thu, Mar 8, 2018 at 11:51 PM, Mickaël Salaün <mic@digikod.net> wrote:
>
> On 07/03/2018 02:21, Andy Lutomirski wrote:
>> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>
>>> On 06/03/2018 23:46, Tycho Andersen wrote:
>>>> On Tue, Mar 06, 2018 at 10:33:17PM +0000, Andy Lutomirski wrote:
>>>>>>> Suppose I'm writing a container manager.  I want to run "mount" in the
>>>>>>> container, but I don't want to allow moun() in general and I want to
>>>>>>> emulate certain mount() actions.  I can write a filter that catches
>>>>>>> mount using seccomp and calls out to the container manager for help.
>>>>>>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>>>>>>> supported.
>>>>>>
>>>>>> Well, I think this use case should be handled with something like
>>>>>> LD_PRELOAD and a helper library. FYI, I did something like this:
>>>>>> https://github.com/stemjail/stemshim
>>>>>
>>>>> I doubt that will work for containers.  Containers that use user
>>>>> namespaces and, for example, setuid programs aren't going to honor
>>>>> LD_PRELOAD.
>>>>
>>>> Or anything that calls syscalls directly, like go programs.
>>>
>>> That's why the vDSO-like approach. Enforcing an access control is not
>>> the issue here, patching a buggy userland (without patching its code) is
>>> the issue isn't it?
>>>
>>> As far as I remember, the main problem is to handle file descriptors
>>> while "emulating" the kernel behavior. This can be done with a "shim"
>>> code mapped in every processes. Chrome used something like this (in a
>>> previous sandbox mechanism) as a kind of emulation (with the current
>>> seccomp-bpf ). I think it should be doable to replace the (userland)
>>> emulation code with an IPC wrapper receiving file descriptors through
>>> UNIX socket.
>>>
>>
>> Can you explain exactly what you mean by "vDSO-like"?
>>
>> When a 64-bit program does a syscall, it just executes the SYSCALL
>> instruction.  The vDSO isn't involved at all.  32-bit programs usually
>> go through the vDSO, but not always.
>>
>> It could be possible to force-load a DSO into an entire container and
>> rig up seccomp to intercept all SYSCALLs not originating from the DSO
>> such that they merely redirect control to the DSO, but that seems
>> quite messy.
>
> vDSO is a code mapped for all processes. As you said, these processes
> may use it or not. What I was thinking about is to use the same concept,
> i.e. map a "shim" code into each processes pertaining to a particular
> hierarchy (the same way seccomp filters are inherited across processes).
> With a seccomp filter matching some syscall (e.g. mount, open), it is
> possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
> shim code should then be able to emulate/patch what is needed, even
> faking a file opening by receiving a file descriptor through a UNIX
> socket. As did the Chrome sandbox, the seccomp filter may look at the
> calling address to allow the shim code to call syscalls without being
> catched, if needed. However, relying on SIGSYS may not fit with
> arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
> to a specific process address, to emulate the syscall in an easier way
> than only relying on a {c,e}BPF program.
>

This could indeed be done, but I think that Tycho's approach is much
cleaner and probably faster.
Mickaël Salaün April 1, 2018, 10:04 p.m. UTC | #11
On 03/09/2018 12:53 AM, Andy Lutomirski wrote:
> On Thu, Mar 8, 2018 at 11:51 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>
>> On 07/03/2018 02:21, Andy Lutomirski wrote:
>>> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün <mic@digikod.net> wrote:
>>>>
>>>> On 06/03/2018 23:46, Tycho Andersen wrote:
>>>>> On Tue, Mar 06, 2018 at 10:33:17PM +0000, Andy Lutomirski wrote:
>>>>>>>> Suppose I'm writing a container manager.  I want to run "mount" in the
>>>>>>>> container, but I don't want to allow moun() in general and I want to
>>>>>>>> emulate certain mount() actions.  I can write a filter that catches
>>>>>>>> mount using seccomp and calls out to the container manager for help.
>>>>>>>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>>>>>>>> supported.
>>>>>>>
>>>>>>> Well, I think this use case should be handled with something like
>>>>>>> LD_PRELOAD and a helper library. FYI, I did something like this:
>>>>>>> https://github.com/stemjail/stemshim
>>>>>>
>>>>>> I doubt that will work for containers.  Containers that use user
>>>>>> namespaces and, for example, setuid programs aren't going to honor
>>>>>> LD_PRELOAD.
>>>>>
>>>>> Or anything that calls syscalls directly, like go programs.
>>>>
>>>> That's why the vDSO-like approach. Enforcing an access control is not
>>>> the issue here, patching a buggy userland (without patching its code) is
>>>> the issue isn't it?
>>>>
>>>> As far as I remember, the main problem is to handle file descriptors
>>>> while "emulating" the kernel behavior. This can be done with a "shim"
>>>> code mapped in every processes. Chrome used something like this (in a
>>>> previous sandbox mechanism) as a kind of emulation (with the current
>>>> seccomp-bpf ). I think it should be doable to replace the (userland)
>>>> emulation code with an IPC wrapper receiving file descriptors through
>>>> UNIX socket.
>>>>
>>>
>>> Can you explain exactly what you mean by "vDSO-like"?
>>>
>>> When a 64-bit program does a syscall, it just executes the SYSCALL
>>> instruction.  The vDSO isn't involved at all.  32-bit programs usually
>>> go through the vDSO, but not always.
>>>
>>> It could be possible to force-load a DSO into an entire container and
>>> rig up seccomp to intercept all SYSCALLs not originating from the DSO
>>> such that they merely redirect control to the DSO, but that seems
>>> quite messy.
>>
>> vDSO is a code mapped for all processes. As you said, these processes
>> may use it or not. What I was thinking about is to use the same concept,
>> i.e. map a "shim" code into each processes pertaining to a particular
>> hierarchy (the same way seccomp filters are inherited across processes).
>> With a seccomp filter matching some syscall (e.g. mount, open), it is
>> possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
>> shim code should then be able to emulate/patch what is needed, even
>> faking a file opening by receiving a file descriptor through a UNIX
>> socket. As did the Chrome sandbox, the seccomp filter may look at the
>> calling address to allow the shim code to call syscalls without being
>> catched, if needed. However, relying on SIGSYS may not fit with
>> arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
>> to a specific process address, to emulate the syscall in an easier way
>> than only relying on a {c,e}BPF program.
>>
> 
> This could indeed be done, but I think that Tycho's approach is much
> cleaner and probably faster.
> 

I like it too but how does this handle file descriptors?
Tycho Andersen April 2, 2018, 12:39 a.m. UTC | #12
Hi Mickaël,

On Mon, Apr 02, 2018 at 12:04:36AM +0200, Mickaël Salaün wrote:
> >> vDSO is a code mapped for all processes. As you said, these processes
> >> may use it or not. What I was thinking about is to use the same concept,
> >> i.e. map a "shim" code into each processes pertaining to a particular
> >> hierarchy (the same way seccomp filters are inherited across processes).
> >> With a seccomp filter matching some syscall (e.g. mount, open), it is
> >> possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
> >> shim code should then be able to emulate/patch what is needed, even
> >> faking a file opening by receiving a file descriptor through a UNIX
> >> socket. As did the Chrome sandbox, the seccomp filter may look at the
> >> calling address to allow the shim code to call syscalls without being
> >> catched, if needed. However, relying on SIGSYS may not fit with
> >> arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
> >> to a specific process address, to emulate the syscall in an easier way
> >> than only relying on a {c,e}BPF program.
> >>
> > 
> > This could indeed be done, but I think that Tycho's approach is much
> > cleaner and probably faster.
> > 
> 
> I like it too but how does this handle file descriptors?

I think it could be done fairly simply, the most complicated part is
probably designing an API that doesn't suck. But the basic idea would
be:

struct seccomp_notif_resp {
    __u64 id;
    __s32 error;
    __s64 val;
    __s32 fd;
};

if the handler responds with fd >= 0, we grab the tracer's fd,
duplicate it, and install it somewhere in the tracee's fd table. Since
things like socket() will want to return the fd number as its
installed and the handler doesn't know that, we'll probably want some
way to indicate that the kernel should return this value. We could
either mandate that if fd >= 0, that's the value that will be returned
from the syscall, or add another flag that says "no, install the fd,
but really return what's in val instead).

I guess we can't mandate that we return fd, because e.g. netlink
sockets can sometimes return fds as part of the netlink messages, and
not as the return value from the syscall.

Tycho