mbox series

[v3,bpf-next,00/21] bpf: Sysctl hook

Message ID cover.1554485409.git.rdna@fb.com
Headers show
Series bpf: Sysctl hook | expand

Message

Andrey Ignatov April 5, 2019, 7:35 p.m. UTC
v2->v3:
- simplify C based selftests by relying on variable offset stack access.

v1->v2:
- add fs/proc/proc_sysctl.c mainteners to Cc:.

The patch set introduces new BPF hook for sysctl.

It adds new program type BPF_PROG_TYPE_CGROUP_SYSCTL and attach type
BPF_CGROUP_SYSCTL.

BPF_CGROUP_SYSCTL hook is placed before calling to sysctl's proc_handler so
that accesses (read/write) to sysctl can be controlled for specific cgroup
and either allowed or denied, or traced.

The hook has access to sysctl name, current sysctl value and (on write
only) to new sysctl value via corresponding helpers. New sysctl value can
be overridden by program. Both name and values (current/new) are
represented as strings same way they're visible in /proc/sys/. It is up to
program to parse these strings.

To help with parsing the most common kind of sysctl value, vector of
integers, two new helpers are provided: bpf_strtol and bpf_strtoul with
semantic similar to user space strtol(3) and strtoul(3).

The hook also provides bpf_sysctl context with two fields:
* @write indicates whether sysctl is being read (= 0) or written (= 1);
* @file_pos is sysctl file position to read from or write to, can be
  overridden.

The hook allows to make better isolation for containerized applications
that are run as root so that one container can't change a sysctl and affect
all other containers on a host, make changes to allowed sysctl in a safer
way and simplify sysctl tracing for cgroups.

Patch 1 is preliminary refactoring.
Patch 2 adds new program and attach types.
Patches 3-5 implement helpers to access sysctl name and value.
Patch 6 adds file_pos field to bpf_sysctl context.
Patch 7 updates UAPI in tools.
Patches 8-9 add support for the new hook to libbpf and corresponding test.
Patches 10-14 add selftests for the new hook.
Patch 15 adds support for new arg types to verifier: pointer to integer.
Patch 16 adds bpf_strto{l,ul} helpers to parse integers from sysctl value.
Patch 17 updates UAPI in tools.
Patch 18 updates bpf_helpers.h.
Patch 19 adds selftests for pointer to integer in verifier.
Patches 20-21 add selftests for bpf_strto{l,ul}, including integration
              C based test for sysctl value parsing.


Andrey Ignatov (21):
  bpf: Add base proto function for cgroup-bpf programs
  bpf: Sysctl hook
  bpf: Introduce bpf_sysctl_get_name helper
  bpf: Introduce bpf_sysctl_get_current_value helper
  bpf: Introduce bpf_sysctl_{get,set}_new_value helpers
  bpf: Add file_pos field to bpf_sysctl ctx
  bpf: Sync bpf.h to tools/
  libbpf: Support sysctl hook
  selftests/bpf: Test sysctl section name
  selftests/bpf: Test BPF_CGROUP_SYSCTL
  selftests/bpf: Test bpf_sysctl_get_name helper
  selftests/bpf: Test sysctl_get_current_value helper
  selftests/bpf: Test bpf_sysctl_{get,set}_new_value helpers
  selftests/bpf: Test file_pos field in bpf_sysctl ctx
  bpf: Introduce ARG_PTR_TO_{INT,LONG} arg types
  bpf: Introduce bpf_strtol and bpf_strtoul helpers
  bpf: Sync bpf.h to tools/
  selftests/bpf: Add sysctl and strtoX helpers to bpf_helpers.h
  selftests/bpf: Test ARG_PTR_TO_LONG arg type
  selftests/bpf: Test bpf_strtol and bpf_strtoul helpers
  selftests/bpf: C based test for sysctl and strtoX

 fs/proc/proc_sysctl.c                         |   25 +-
 include/linux/bpf-cgroup.h                    |   21 +
 include/linux/bpf.h                           |    4 +
 include/linux/bpf_types.h                     |    1 +
 include/linux/filter.h                        |   16 +
 include/uapi/linux/bpf.h                      |  139 +-
 kernel/bpf/cgroup.c                           |  364 +++-
 kernel/bpf/helpers.c                          |  131 ++
 kernel/bpf/syscall.c                          |    7 +
 kernel/bpf/verifier.c                         |   30 +
 tools/include/uapi/linux/bpf.h                |  139 +-
 tools/lib/bpf/libbpf.c                        |    3 +
 tools/lib/bpf/libbpf_probes.c                 |    1 +
 tools/testing/selftests/bpf/Makefile          |    3 +-
 tools/testing/selftests/bpf/bpf_helpers.h     |   19 +
 .../selftests/bpf/progs/test_sysctl_prog.c    |   70 +
 .../selftests/bpf/test_section_names.c        |    5 +
 tools/testing/selftests/bpf/test_sysctl.c     | 1567 +++++++++++++++++
 .../testing/selftests/bpf/verifier/int_ptr.c  |  160 ++
 19 files changed, 2697 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/test_sysctl_prog.c
 create mode 100644 tools/testing/selftests/bpf/test_sysctl.c
 create mode 100644 tools/testing/selftests/bpf/verifier/int_ptr.c

Comments

Kees Cook April 6, 2019, 4:43 p.m. UTC | #1
On Fri, Apr 5, 2019 at 12:36 PM Andrey Ignatov <rdna@fb.com> wrote:
>
> v2->v3:
> - simplify C based selftests by relying on variable offset stack access.
>
> v1->v2:
> - add fs/proc/proc_sysctl.c mainteners to Cc:.
>
> The patch set introduces new BPF hook for sysctl.
>
> It adds new program type BPF_PROG_TYPE_CGROUP_SYSCTL and attach type
> BPF_CGROUP_SYSCTL.
>
> BPF_CGROUP_SYSCTL hook is placed before calling to sysctl's proc_handler so
> that accesses (read/write) to sysctl can be controlled for specific cgroup
> and either allowed or denied, or traced.
>
> The hook has access to sysctl name, current sysctl value and (on write
> only) to new sysctl value via corresponding helpers. New sysctl value can
> be overridden by program. Both name and values (current/new) are
> represented as strings same way they're visible in /proc/sys/. It is up to
> program to parse these strings.
>
> To help with parsing the most common kind of sysctl value, vector of
> integers, two new helpers are provided: bpf_strtol and bpf_strtoul with
> semantic similar to user space strtol(3) and strtoul(3).
>
> The hook also provides bpf_sysctl context with two fields:
> * @write indicates whether sysctl is being read (= 0) or written (= 1);
> * @file_pos is sysctl file position to read from or write to, can be
>   overridden.
>
> The hook allows to make better isolation for containerized applications
> that are run as root so that one container can't change a sysctl and affect
> all other containers on a host, make changes to allowed sysctl in a safer
> way and simplify sysctl tracing for cgroups.

This sounds more like an LSM than BPF. So sysctls can get blocked when
new BPF is added to a cgroup? Can the BPF be removed (or rather,
what's the lifetime of such BPF?)
Alexei Starovoitov April 6, 2019, 5:02 p.m. UTC | #2
On Sat, Apr 06, 2019 at 09:43:50AM -0700, Kees Cook wrote:
> On Fri, Apr 5, 2019 at 12:36 PM Andrey Ignatov <rdna@fb.com> wrote:
> >
> > v2->v3:
> > - simplify C based selftests by relying on variable offset stack access.
> >
> > v1->v2:
> > - add fs/proc/proc_sysctl.c mainteners to Cc:.
> >
> > The patch set introduces new BPF hook for sysctl.
> >
> > It adds new program type BPF_PROG_TYPE_CGROUP_SYSCTL and attach type
> > BPF_CGROUP_SYSCTL.
> >
> > BPF_CGROUP_SYSCTL hook is placed before calling to sysctl's proc_handler so
> > that accesses (read/write) to sysctl can be controlled for specific cgroup
> > and either allowed or denied, or traced.
> >
> > The hook has access to sysctl name, current sysctl value and (on write
> > only) to new sysctl value via corresponding helpers. New sysctl value can
> > be overridden by program. Both name and values (current/new) are
> > represented as strings same way they're visible in /proc/sys/. It is up to
> > program to parse these strings.
> >
> > To help with parsing the most common kind of sysctl value, vector of
> > integers, two new helpers are provided: bpf_strtol and bpf_strtoul with
> > semantic similar to user space strtol(3) and strtoul(3).
> >
> > The hook also provides bpf_sysctl context with two fields:
> > * @write indicates whether sysctl is being read (= 0) or written (= 1);
> > * @file_pos is sysctl file position to read from or write to, can be
> >   overridden.
> >
> > The hook allows to make better isolation for containerized applications
> > that are run as root so that one container can't change a sysctl and affect
> > all other containers on a host, make changes to allowed sysctl in a safer
> > way and simplify sysctl tracing for cgroups.
> 
> This sounds more like an LSM than BPF. 

not at all. the key difference is being cgroup scoped.
essentially for different containers.

> So sysctls can get blocked when
> new BPF is added to a cgroup? 

bpf prog is attached to this hook in a particular cgroup
and executed for sysctls for tasks that belong to that cgroup.

> Can the BPF be removed (or rather,
> what's the lifetime of such BPF?)

same as all other cgroup-bpf hooks.
Do you have a specific concern or just asking how life time of programs
is managed?
High level description of lifetime is here:
https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html
Kees Cook April 9, 2019, 4:50 p.m. UTC | #3
On Sat, Apr 6, 2019 at 10:03 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sat, Apr 06, 2019 at 09:43:50AM -0700, Kees Cook wrote:
> > On Fri, Apr 5, 2019 at 12:36 PM Andrey Ignatov <rdna@fb.com> wrote:
> > > BPF_CGROUP_SYSCTL hook is placed before calling to sysctl's proc_handler so
> > > that accesses (read/write) to sysctl can be controlled for specific cgroup
> > > and either allowed or denied, or traced.
> >
> > This sounds more like an LSM than BPF.
>
> not at all. the key difference is being cgroup scoped.
> essentially for different containers.

Okay, works for me. I was looking at it from the perspective of
something providing resource access control policy, which usually
falls into the LSM world.

> bpf prog is attached to this hook in a particular cgroup
> and executed for sysctls for tasks that belong to that cgroup.

So it's root limiting root-in-a-container? Nice to have some
boundaries there, for sure.

> > Can the BPF be removed (or rather,
> > what's the lifetime of such BPF?)
>
> same as all other cgroup-bpf hooks.
> Do you have a specific concern or just asking how life time of programs
> is managed?
> High level description of lifetime is here:
> https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html

I'm mostly curious about the access control stacking. i.e. can
in-container root add new eBPF to its own cgroup, and if so, can it
undo the restrictions already present? (I assume it can't, but figured
I'd ask...)
Jann Horn April 9, 2019, 8:41 p.m. UTC | #4
On Tue, Apr 9, 2019 at 10:26 PM Andrey Ignatov <rdna@fb.com> wrote:
> The patch set introduces new BPF hook for sysctl.
>
> It adds new program type BPF_PROG_TYPE_CGROUP_SYSCTL and attach type
> BPF_CGROUP_SYSCTL.
>
> BPF_CGROUP_SYSCTL hook is placed before calling to sysctl's proc_handler so
> that accesses (read/write) to sysctl can be controlled for specific cgroup
> and either allowed or denied, or traced.

Don't look at the credentials of "current" in a read or write handler.
Consider what happens if, for example, someone inside a cgroup opens a
sysctl file and passes the file descriptor to another process outside
the cgroup over a unix domain socket, and that other process then
writes to it. Either do your access check on open, or use the
credentials that were saved during open() in the read/write handler.

> The hook has access to sysctl name, current sysctl value and (on write
> only) to new sysctl value via corresponding helpers. New sysctl value can
> be overridden by program. Both name and values (current/new) are
> represented as strings same way they're visible in /proc/sys/. It is up to
> program to parse these strings.

But even if a filter is installed that prevents all access to a
sysctl, you can still read it by installing your own filter that, when
a read is attempted the next time, dumps the value into a map or
something like that, right?

> To help with parsing the most common kind of sysctl value, vector of
> integers, two new helpers are provided: bpf_strtol and bpf_strtoul with
> semantic similar to user space strtol(3) and strtoul(3).
>
> The hook also provides bpf_sysctl context with two fields:
> * @write indicates whether sysctl is being read (= 0) or written (= 1);
> * @file_pos is sysctl file position to read from or write to, can be
>   overridden.
>
> The hook allows to make better isolation for containerized applications
> that are run as root so that one container can't change a sysctl and affect
> all other containers on a host, make changes to allowed sysctl in a safer
> way and simplify sysctl tracing for cgroups.

Why can't you use a user namespace and isolate things properly that
way? That would be much cleaner, wouldn't it?
Andrey Ignatov April 9, 2019, 11:04 p.m. UTC | #5
Jann Horn <jannh@google.com> [Tue, 2019-04-09 13:42 -0700]:
> On Tue, Apr 9, 2019 at 10:26 PM Andrey Ignatov <rdna@fb.com> wrote:
> > The patch set introduces new BPF hook for sysctl.
> >
> > It adds new program type BPF_PROG_TYPE_CGROUP_SYSCTL and attach type
> > BPF_CGROUP_SYSCTL.
> >
> > BPF_CGROUP_SYSCTL hook is placed before calling to sysctl's proc_handler so
> > that accesses (read/write) to sysctl can be controlled for specific cgroup
> > and either allowed or denied, or traced.
> 
> Don't look at the credentials of "current" in a read or write handler.
> Consider what happens if, for example, someone inside a cgroup opens a
> sysctl file and passes the file descriptor to another process outside
> the cgroup over a unix domain socket, and that other process then
> writes to it. Either do your access check on open, or use the
> credentials that were saved during open() in the read/write handler.

This way this someone inside cgroup should already have control over
something running as root [1] outside of this cgroup, i.e. the game is
already lost, even without this hook.

[1] Since proc_sys_read() / proc_sys_write() check sysctl_perm() before
    execution reaches the hook.

This patch set doesn't look at credentials at all and relies on what
checks were already done at sys_open time or in proc_sys_call_handler()
before execution reaches the hook.

> > The hook has access to sysctl name, current sysctl value and (on write
> > only) to new sysctl value via corresponding helpers. New sysctl value can
> > be overridden by program. Both name and values (current/new) are
> > represented as strings same way they're visible in /proc/sys/. It is up to
> > program to parse these strings.
> 
> But even if a filter is installed that prevents all access to a
> sysctl, you can still read it by installing your own filter that, when
> a read is attempted the next time, dumps the value into a map or
> something like that, right?

No. This can be controlled by cgroup hierarchy and appropriate attach
flags, same way as with any other cgroup-bpf hook.

E.g. imagine there is a cgroup hierarchy:
  root/slice/container/

and container application runs in root/slice/container/ in a cgroup
namespace (CLONE_NEWCGROUP) that makes visible only "container/" part of
the hierarchy, i.e. from inside container application can't even see
"root/slice/".

Administrator can then attach sysctl hook to "root/slice/" with attach
flag NONE (bpf_attr.attach_flags = 0) what means nobody down the
hierarchy can override the program attached by administrator.

> > To help with parsing the most common kind of sysctl value, vector of
> > integers, two new helpers are provided: bpf_strtol and bpf_strtoul with
> > semantic similar to user space strtol(3) and strtoul(3).
> >
> > The hook also provides bpf_sysctl context with two fields:
> > * @write indicates whether sysctl is being read (= 0) or written (= 1);
> > * @file_pos is sysctl file position to read from or write to, can be
> >   overridden.
> >
> > The hook allows to make better isolation for containerized applications
> > that are run as root so that one container can't change a sysctl and affect
> > all other containers on a host, make changes to allowed sysctl in a safer
> > way and simplify sysctl tracing for cgroups.
> 
> Why can't you use a user namespace and isolate things properly that
> way? That would be much cleaner, wouldn't it?

I'm not sure I understand how user namespace helps here. From my
understanding it can only completely deny access to sysctl and can't do
fine-grained control for specific sysctl knobs. It also can't make
allow/deny decision based on sysctl value being written.

Basically user namespace is all or nothing. This sysctl hook provides a
way to implement fine-grained access control for sysctl knobs based on
sysctl name or value being written or whatever else policy administrator
can come up with.
Andrey Ignatov April 9, 2019, 11:17 p.m. UTC | #6
Kees Cook <keescook@chromium.org> [Tue, 2019-04-09 09:51 -0700]:
> On Sat, Apr 6, 2019 at 10:03 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sat, Apr 06, 2019 at 09:43:50AM -0700, Kees Cook wrote:
> > > On Fri, Apr 5, 2019 at 12:36 PM Andrey Ignatov <rdna@fb.com> wrote:
> > > Can the BPF be removed (or rather,
> > > what's the lifetime of such BPF?)
> >
> > same as all other cgroup-bpf hooks.
> > Do you have a specific concern or just asking how life time of programs
> > is managed?
> > High level description of lifetime is here:
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__facebookmicrosites.github.io_bpf_blog_2018_08_31_object-2Dlifetime.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=3jAokpHyGuCuJ834j-tttQ&m=ZJJ4QMXnksL1b4VPoBM0NJ0i6OWysGc2Om26pcoJpxA&s=6dIZ788hOzoDWVif5XQ-9Mqf9ijko9O7TOWArLzblxU&e=
> 
> I'm mostly curious about the access control stacking. i.e. can
> in-container root add new eBPF to its own cgroup, and if so, can it
> undo the restrictions already present? (I assume it can't, but figured
> I'd ask...)

Since I answered similar question from Jann below, I'll answer it here
as well (even though it was addressed to Alexei).

Stacking can be controlled by attach flags (NONE, BPF_F_ALLOW_OVERRIDE,
BPF_F_ALLOW_MULTI) described in include/uapi/linux/bpf.h.

Basically if one attaches a program to a cgroup with
`bpf_attr.attach_flags = 0` (0 is "NONE"), then nobody can override it
by their own programs of same type in any sub-cgroup. It can be hardened
further by cgroup namespace so that in-container root doesn't even see
part of cgroup hierarchy where cgroup-bpf program is attached to with
attach flags NONE.
Jann Horn April 9, 2019, 11:22 p.m. UTC | #7
On Wed, Apr 10, 2019 at 1:04 AM Andrey Ignatov <rdna@fb.com> wrote:
> Jann Horn <jannh@google.com> [Tue, 2019-04-09 13:42 -0700]:
> > On Tue, Apr 9, 2019 at 10:26 PM Andrey Ignatov <rdna@fb.com> wrote:
> > > The patch set introduces new BPF hook for sysctl.
> > >
> > > It adds new program type BPF_PROG_TYPE_CGROUP_SYSCTL and attach type
> > > BPF_CGROUP_SYSCTL.
> > >
> > > BPF_CGROUP_SYSCTL hook is placed before calling to sysctl's proc_handler so
> > > that accesses (read/write) to sysctl can be controlled for specific cgroup
> > > and either allowed or denied, or traced.
> >
> > Don't look at the credentials of "current" in a read or write handler.
> > Consider what happens if, for example, someone inside a cgroup opens a
> > sysctl file and passes the file descriptor to another process outside
> > the cgroup over a unix domain socket, and that other process then
> > writes to it. Either do your access check on open, or use the
> > credentials that were saved during open() in the read/write handler.
>
> This way this someone inside cgroup should already have control over
> something running as root [1] outside of this cgroup, i.e. the game is
> already lost, even without this hook.
>
> [1] Since proc_sys_read() / proc_sys_write() check sysctl_perm() before
>     execution reaches the hook.

You don't need to have _control_ over something running as root. You
only need to be able to communicate with something that expects to be
passed in file descriptors for some purpose.

> This patch set doesn't look at credentials at all and relies on what
> checks were already done at sys_open time or in proc_sys_call_handler()
> before execution reaches the hook.

You're looking at the cgroup though.

> > > The hook has access to sysctl name, current sysctl value and (on write
> > > only) to new sysctl value via corresponding helpers. New sysctl value can
> > > be overridden by program. Both name and values (current/new) are
> > > represented as strings same way they're visible in /proc/sys/. It is up to
> > > program to parse these strings.
> >
> > But even if a filter is installed that prevents all access to a
> > sysctl, you can still read it by installing your own filter that, when
> > a read is attempted the next time, dumps the value into a map or
> > something like that, right?
>
> No. This can be controlled by cgroup hierarchy and appropriate attach
> flags, same way as with any other cgroup-bpf hook.
>
> E.g. imagine there is a cgroup hierarchy:
>   root/slice/container/
>
> and container application runs in root/slice/container/ in a cgroup
> namespace (CLONE_NEWCGROUP) that makes visible only "container/" part of
> the hierarchy, i.e. from inside container application can't even see
> "root/slice/".
>
> Administrator can then attach sysctl hook to "root/slice/" with attach
> flag NONE (bpf_attr.attach_flags = 0) what means nobody down the
> hierarchy can override the program attached by administrator.

Ah, okay.

> > > To help with parsing the most common kind of sysctl value, vector of
> > > integers, two new helpers are provided: bpf_strtol and bpf_strtoul with
> > > semantic similar to user space strtol(3) and strtoul(3).
> > >
> > > The hook also provides bpf_sysctl context with two fields:
> > > * @write indicates whether sysctl is being read (= 0) or written (= 1);
> > > * @file_pos is sysctl file position to read from or write to, can be
> > >   overridden.
> > >
> > > The hook allows to make better isolation for containerized applications
> > > that are run as root so that one container can't change a sysctl and affect
> > > all other containers on a host, make changes to allowed sysctl in a safer
> > > way and simplify sysctl tracing for cgroups.
> >
> > Why can't you use a user namespace and isolate things properly that
> > way? That would be much cleaner, wouldn't it?
>
> I'm not sure I understand how user namespace helps here. From my
> understanding it can only completely deny access to sysctl and can't do
> fine-grained control for specific sysctl knobs. It also can't make
> allow/deny decision based on sysctl value being written.
>
> Basically user namespace is all or nothing. This sysctl hook provides a
> way to implement fine-grained access control for sysctl knobs based on
> sysctl name or value being written or whatever else policy administrator
> can come up with.

But there's a reason why user namespaces are all-or-nothing on these
things. If the kernel does not explicitly make a sysctl available to a
container, the sysctl has global effects, and therefore probably
shouldn't be exposed to anything other than someone with
administrative privileges across the whole system. If the kernel does
make it available to a container, the sysctl's effects are limited to
the container (or otherwise it's a kernel bug).

Can you give examples of sysctls that you want to permit using from
containers, that wouldn't be accessible in a user namespace?
Alexei Starovoitov April 9, 2019, 11:34 p.m. UTC | #8
On Tue, Apr 9, 2019 at 4:22 PM Jann Horn <jannh@google.com> wrote:
>
> But there's a reason why user namespaces are all-or-nothing on these
> things. If the kernel does not explicitly make a sysctl available to a
> container, the sysctl has global effects, and therefore probably
> shouldn't be exposed to anything other than someone with
> administrative privileges across the whole system. If the kernel does
> make it available to a container, the sysctl's effects are limited to
> the container (or otherwise it's a kernel bug).
>
> Can you give examples of sysctls that you want to permit using from
> containers, that wouldn't be accessible in a user namespace?

I think this discussion has started with incorrect
assumptions about the goal of the patch set.
There is no _security_ part here.
The sysctl hook is to prevent silly things to be done by chef
and apps. Most interesting sysctls need root anyway.
The root can detach all progs and do its thing.
Consider tcp_mem sysctl. We've seen it's been misconfigured
and caused performance issues. bpf prog can track what is
being written, alarm, etc.
User namespaces are not applicable here.
Alexei Starovoitov April 12, 2019, 9:27 p.m. UTC | #9
On Fri, Apr 05, 2019 at 12:35:22PM -0700, Andrey Ignatov wrote:
> v2->v3:
> - simplify C based selftests by relying on variable offset stack access.
> 
> v1->v2:
> - add fs/proc/proc_sysctl.c mainteners to Cc:.
> 
> The patch set introduces new BPF hook for sysctl.
> 
> It adds new program type BPF_PROG_TYPE_CGROUP_SYSCTL and attach type
> BPF_CGROUP_SYSCTL.
> 
> BPF_CGROUP_SYSCTL hook is placed before calling to sysctl's proc_handler so
> that accesses (read/write) to sysctl can be controlled for specific cgroup
> and either allowed or denied, or traced.
> 
> The hook has access to sysctl name, current sysctl value and (on write
> only) to new sysctl value via corresponding helpers. New sysctl value can
> be overridden by program. Both name and values (current/new) are
> represented as strings same way they're visible in /proc/sys/. It is up to
> program to parse these strings.
> 
> To help with parsing the most common kind of sysctl value, vector of
> integers, two new helpers are provided: bpf_strtol and bpf_strtoul with
> semantic similar to user space strtol(3) and strtoul(3).
> 
> The hook also provides bpf_sysctl context with two fields:
> * @write indicates whether sysctl is being read (= 0) or written (= 1);
> * @file_pos is sysctl file position to read from or write to, can be
>   overridden.
> 
> The hook allows to make better isolation for containerized applications
> that are run as root so that one container can't change a sysctl and affect
> all other containers on a host, make changes to allowed sysctl in a safer
> way and simplify sysctl tracing for cgroups.

Applied to bpf-next. Thanks!

Andrey,
as a follow up please add a doc describing that this bpf hook cannot be used
as a security mechanism to limit sysctl usage.
Like: explaining that task_dfl_cgroup(current) is checked at the time of read/write,
it's not a replacement for sysctl_perm, root can detach bpf progs, etc.
I think the commit 7568f4cbbeae ("selftests/bpf: C based test for sysctl and strtoX")
gives an idea of what is possible with this hook and intended usage,
but it needs to be clearly documented that it's for 'trusted root' environment.