diff mbox

[01/15] add Documentation/namespaces/user_namespace.txt (v3)

Message ID 1314993400-6910-4-git-send-email-serge@hallyn.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Serge E. Hallyn Sept. 2, 2011, 7:56 p.m. UTC
From: "Serge E. Hallyn" <serge@hallyn.com>

Quoting David Howells (dhowells@redhat.com):
> Randy Dunlap <rdunlap@xenotime.net> wrote:
>
> > > +Any task in or resource belonging to the initial user namespace will, to this
> > > +new task, appear to belong to UID and GID -1 - which is usually known as
> >
> > that extra hyphen is confusing.  how about:
> >
> >                               to UID and GID -1, which is
>
> 'which are'.
>
> David

This will hold some info about the design.  Currently it contains
future todos, issues and questions.

Changelog:
   jul 26: incorporate feedback from David Howells.
   jul 29: incorporate feedback from Randy Dunlap.

Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Randy Dunlap <rdunlap@xenotime.net>
---
 Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
 1 files changed, 107 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/namespaces/user_namespace.txt

Comments

Andrew Morton Sept. 7, 2011, 10:50 p.m. UTC | #1
On Fri,  2 Sep 2011 19:56:26 +0000
Serge Hallyn <serge@hallyn.com> wrote:

> +Note that this userid mapping for the VFS is not yet implemented, though the
> +lkml and containers mailing list archives will show several previous
> +prototypes.  In the end, those got hung up waiting on the concept of targeted
> +capabilities to be developed, which, thanks to the insight of Eric Biederman,
> +they finally did.

not-yet-implemented things worry me.  When can we expect this to
happen, and how big and ugly will it be?

I'm not seeing many (any) reviewed-by's on these patches.  I could get
down and stare at them myself, but that wouldn't be very useful.  This
work goes pretty deep and is quite security-affecting.  And network-afecting.
Can you round up some suitable people and get the reviewing and testing happening
please?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Serge Hallyn Sept. 9, 2011, 1:10 p.m. UTC | #2
Quoting Andrew Morton (akpm@linux-foundation.org):
> On Fri,  2 Sep 2011 19:56:26 +0000
> Serge Hallyn <serge@hallyn.com> wrote:
> 
> > +Note that this userid mapping for the VFS is not yet implemented, though the
> > +lkml and containers mailing list archives will show several previous
> > +prototypes.  In the end, those got hung up waiting on the concept of targeted
> > +capabilities to be developed, which, thanks to the insight of Eric Biederman,
> > +they finally did.
> 
> not-yet-implemented things worry me.  When can we expect this to
> happen, and how big and ugly will it be?

Hi Andrew,

We did a proof of concept of the simplest version of this in early August
(see git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-userns-devel.git)
which actually was very un-scary.  So technically we could push it at the
same time as this set, but I thought that might just be too much for
review in one cycle.  That set (Eric's) is the very simplest approach
which tags an entire filesystem with a user namespace.

We would also want to pursue the more baroque approach, where filesystems
themselves are user-namespace aware.  I did an approach like that in
2008, see
https://lists.linux-foundation.org/pipermail/containers/2008-August/012679.html
It again is very do-able without being ugly, but, importantly, user
namespaces are usable for containers without that.  For starters, we only
need /proc and /sys to be user namespace aware (since they must allow
access from multiple namespaces), and that is simple as they are not
persistent.

So I believe that this is the last scary patchset, and that user
namespaces could actually be usable by the end of the year!

> I'm not seeing many (any) reviewed-by's on these patches.  I could get
> down and stare at them myself, but that wouldn't be very useful.  This
> work goes pretty deep and is quite security-affecting.  And network-afecting.
> Can you round up some suitable people and get the reviewing and testing happening
> please?

Will try.  Unfortunately I missed my chance to beg and bribe people in
person at plumbers :(

thanks,
-serge
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vasiliy Kulikov Sept. 26, 2011, 7:17 p.m. UTC | #3
(cc'ed kernel-hardening)


Hi Serge,

I didn't deeply studied the patches yet (sorry!), but I have some
long-term question about the technique in general.  I couldn't find
answers to the questions in the documentation.

First, the patches by design expose much kernel code to unprivileged
userspace processes.  This code doesn't expect malformed data (e.g. VFS,
specific filesystems, block layer, char drivers, sysadmin part of LSMs,
etc. etc.).  By relaxing permission rules you greatly increase attack
surface of the kernel from unprivileged users.  Are you (or somebody
else) planning to audit this code?

Also, will it be possible to somehow restrict what specific kernel
facilities are accessible from users (IOW, what root emulation
limitations are in action)?  It is userful from both points of sysadmin,
who might not want to allow users to do such things, and from the
security POV in sense of attack surface reduction.

The patches explicitly enable some features for users on white list
basis.  It's possible to do it for simple cases, but what are you going
to do with multiplexing functions where there is a permission check
before the actual multiplexing?  FS, networking drivers, etc.  Are you
going to do the same thing as net_namespace does? - For each multiplexed
entity create bool ->ns_aware which is false by default for all
"untrusted"/not prepared protocols and is true for audited/prepared
protocols.  Or probably you have something else in mind?

Thanks,

On Fri, Sep 02, 2011 at 19:56 +0000, Serge Hallyn wrote:
> From: "Serge E. Hallyn" <serge@hallyn.com>
> 
> Quoting David Howells (dhowells@redhat.com):
> > Randy Dunlap <rdunlap@xenotime.net> wrote:
> >
> > > > +Any task in or resource belonging to the initial user namespace will, to this
> > > > +new task, appear to belong to UID and GID -1 - which is usually known as
> > >
> > > that extra hyphen is confusing.  how about:
> > >
> > >                               to UID and GID -1, which is
> >
> > 'which are'.
> >
> > David
> 
> This will hold some info about the design.  Currently it contains
> future todos, issues and questions.
> 
> Changelog:
>    jul 26: incorporate feedback from David Howells.
>    jul 29: incorporate feedback from Randy Dunlap.
> 
> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
> Cc: Eric W. Biederman <ebiederm@xmission.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Randy Dunlap <rdunlap@xenotime.net>
> ---
>  Documentation/namespaces/user_namespace.txt |  107 +++++++++++++++++++++++++++
>  1 files changed, 107 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/namespaces/user_namespace.txt
> 
> diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
> new file mode 100644
> index 0000000..b0bc480
> --- /dev/null
> +++ b/Documentation/namespaces/user_namespace.txt
> @@ -0,0 +1,107 @@
> +Description
> +===========
> +
> +Traditionally, each task is owned by a user ID (UID) and belongs to one or more
> +groups (GID).  Both are simple numeric IDs, though userspace usually translates
> +them to names.  The user namespace allows tasks to have different views of the
> +UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
> +below for more.)
> +
> +The user namespace is a simple hierarchical one.  The system starts with all
> +tasks belonging to the initial user namespace.  A task creates a new user
> +namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
> +creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
> +but it does not need to be running as root.  The clone(2) call will result in a
> +new task which to itself appears to be running as UID and GID 0, but to its
> +creator seems to have the creator's credentials.
> +
> +To this new task, any resource belonging to the initial user namespace will
> +appear to belong to user and group 'nobody', which are UID and GID -1.
> +Permission to open such files will be granted according to world access
> +permissions.  UID comparisons and group membership checks will return false,
> +and privilege will be denied.
> +
> +When a task belonging to (for example) userid 500 in the initial user namespace
> +creates a new user namespace, even though the new task will see itself as
> +belonging to UID 0, any task in the initial user namespace will see it as
> +belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
> +able to kill the new task.  Files created by the new user will (eventually) be
> +seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
> +the initial user namespace as belonging to UID 500.
> +
> +Note that this userid mapping for the VFS is not yet implemented, though the
> +lkml and containers mailing list archives will show several previous
> +prototypes.  In the end, those got hung up waiting on the concept of targeted
> +capabilities to be developed, which, thanks to the insight of Eric Biederman,
> +they finally did.
> +
> +Relationship between the User namespace and other namespaces
> +============================================================
> +
> +Other namespaces, such as UTS and network, are owned by a user namespace.  When
> +such a namespace is created, it is assigned to the user namespace of the task
> +by which it was created.  Therefore, attempts to exercise privilege to
> +resources in, for instance, a particular network namespace, can be properly
> +validated by checking whether the caller has the needed privilege (i.e.
> +CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
> +This is done using the ns_capable() function.
> +
> +As an example, if a new task is cloned with a private user namespace but
> +no private network namespace, then the task's network namespace is owned
> +by the parent user namespace.  The new task has no privilege to the
> +parent user namespace, so it will not be able to create or configure
> +network devices.  If, instead, the task were cloned with both private
> +user and network namespaces, then the private network namespace is owned
> +by the private user namespace, and so root in the new user namespace
> +will have privilege targeted to the network namespace.  It will be able
> +to create and configure network devices.
> +
> +UID Mapping
> +===========
> +The current plan (see 'flexible UID mapping' at
> +https://wiki.ubuntu.com/UserNamespace) is:
> +
> +The UID/GID stored on disk will be that in the init_user_ns.  Most likely
> +UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
> +(a few years ago) leaving the details up to filesystems while providing a lib/
> +stock implementation.  See the thread around here:
> +http://www.mail-archive.com/devel@openvz.org/msg09331.html
> +
> +
> +Working notes
> +=============
> +Capability checks for actions related to syslog must be against the
> +init_user_ns until syslog is containerized.
> +
> +Same is true for reboot and power, control groups, devices, and time.
> +
> +Perf actions (kernel/event/core.c for instance) will always be constrained to
> +init_user_ns.
> +
> +Q:
> +Is accounting considered properly containerized with respect to pidns?  (it
> +appears to be).  If so, then we can change the capable() check in
> +kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
> +
> +Q:
> +For things like nice and schedaffinity, we could allow root in a container to
> +control those, and leave only cgroups to constrain the container.  I'm not sure
> +whether that is right, or whether it violates admin expectations.
> +
> +I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
> +dentries, not inodes.
> +
> +For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
> +them) target the capability checks at the user_ns owning the tty.  That will
> +have to wait until we get userns owning files straightened out.
> +
> +We need to figure out how to label devices.  Should we just toss a user_ns
> +right into struct device?
> +
> +capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
> +some day LSMs were to be containerized, near zero chance.
> +
> +inode_owner_or_capable() should probably take an optional ns and cap parameter.
> +If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
> +inode.  But if ns is provided, then callers who need to derive
> +inode_userns(inode) anyway can save a few cycles.
> -- 
> 1.7.5.4
Serge Hallyn Sept. 27, 2011, 1:21 p.m. UTC | #4
Quoting Vasiliy Kulikov (segoon@openwall.com):
> (cc'ed kernel-hardening)
> 
> 
> Hi Serge,
> 
> I didn't deeply studied the patches yet (sorry!), but I have some
> long-term question about the technique in general.  I couldn't find
> answers to the questions in the documentation.

Great - thanks for your time, Vasiliy.

There is documentation at https://wiki.ubuntu.com/UserNamespace,
and I was adding a Documentation/namespaces/user_namespace.txt file
(which hasn't gone in yet) which you can see here:
https://lkml.org/lkml/2011/7/26/351

But those don't answer your questions sufficiently.

> First, the patches by design expose much kernel code to unprivileged
> userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> etc. etc.).  By relaxing permission rules you greatly increase attack
> surface of the kernel from unprivileged users.  Are you (or somebody
> else) planning to audit this code?

I had wanted to (but didn't) propose a discussion at ksummit about how
best to approach the filesystem code.  That's not even just for user
namespaces - patches have been floated in the past to make mount an
unprivileged operation depending on the FS and the user's permission
over the device and target.  So I don't know if a combination of auditing
and fuzzing is the way to go, or what, and wanted to get input from
some people who are more knowledgeable on that topic than me.

You're right about other kernel code as well.

I'll certainly join in this effort, but don't want to go blindly
charging in without some advice/guidance about the best way to do
this and, if others are interested, coordinate it.

We can start by looking through all code which is currently under
ns_capable(), and analyzing that.  But what tools do we have
available to perform the analysis?

Do you think a kernel summit discussion (i suppose given the late
timing, a beer bof) would be beneficial?  (I wouldn't be there)

> Also, will it be possible to somehow restrict what specific kernel
> facilities are accessible from users (IOW, what root emulation
> limitations are in action)?  It is userful from both points of sysadmin,
> who might not want to allow users to do such things, and from the
> security POV in sense of attack surface reduction.

You're probably thinking along different lines, but this is why I've
been wanting seccomp2 to get pushed through.  So that we can deny a
container the syscalls we know it won't need, especially newer ones,
to reduce the attack surface available to it.

> The patches explicitly enable some features for users on white list
> basis.  It's possible to do it for simple cases, but what are you going
> to do with multiplexing functions where there is a permission check
> before the actual multiplexing?  FS, networking drivers, etc.  Are you
> going to do the same thing as net_namespace does? - For each multiplexed
> entity create bool ->ns_aware which is false by default for all
> "untrusted"/not prepared protocols and is true for audited/prepared
> protocols.  Or probably you have something else in mind?

Ah, I typed the bottom paragraph before realizing what you were actually
asking.  The filesystems are a good example.  In the unprivileged mounts
patchsets, for instance, a flag was added to each filesystem indicating
if it was safe for unprivileged mounting (turned off for all real block
filesystems :).  For targeted capabilities, my goal would be simply to
make sure that each non-netns-aware entity do a (untargeted) capable()
check.  Without pointing to a specific example it's hard to say what I
will do.  It depends on how the code was previously laid out, and what
the maintainer of that subsystem prefers.

The way we're approaching it right now is that by default everything
stays 'capable(X)', so that a non-init user namespace doesn't get the
privileges.  While some of my patchsets this summer didn't follow this,
Eric reminded me that we should first clamp down on the user namespaces
as much as possible, and relax permissions in child namespaces later.
So the small (1-2 patch sized) sets I've been sending the last few
weeks are just trying to fix existing inadequate userid or capability
checks.

-serge
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Vasiliy Kulikov Sept. 27, 2011, 3:56 p.m. UTC | #5
On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
> > First, the patches by design expose much kernel code to unprivileged
> > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> > etc. etc.).  By relaxing permission rules you greatly increase attack
> > surface of the kernel from unprivileged users.  Are you (or somebody
> > else) planning to audit this code?
> 
> I had wanted to (but didn't) propose a discussion at ksummit about how
> best to approach the filesystem code.  That's not even just for user
> namespaces - patches have been floated in the past to make mount an
> unprivileged operation depending on the FS and the user's permission
> over the device and target.

This is a dangerous operation by itself.  AFAICS, this is the reason why
e.g. FUSE doesn't pass user mount points to other users and even root.
Beginning from violating some rules like existance of single "." and
".." in each directory and ending with filename charsets with /, \000
and things like `, ", ', \ inside.


>  So I don't know if a combination of auditing
> and fuzzing is the way to go,

Maybe the combination of both.  There are no generic recommendations,
it's always limited to the subsystem, checked property, and the
auditor.


> > Also, will it be possible to somehow restrict what specific kernel
> > facilities are accessible from users (IOW, what root emulation
> > limitations are in action)?  It is userful from both points of sysadmin,
> > who might not want to allow users to do such things, and from the
> > security POV in sense of attack surface reduction.
> 
> You're probably thinking along different lines, but this is why I've
> been wanting seccomp2 to get pushed through.  So that we can deny a
> container the syscalls we know it won't need, especially newer ones,
> to reduce the attack surface available to it.

This dependency greatly complicates the things.

First, there is a big misunderstanding between Will and Ingo in what
needs seccompv2 should serve.  Will wants to reduce kernel attack
surface by limiting syscalls and syscall arguments available to a user
(a single task, btw).  Ingo wants to see a full featured filtering
engine, which needs code changes all over the kernel.  Given the needed
changes amounts, it will unlikely reduce attack surface.

You probably don't want Will's version as syscalls filtering is a very
bad abstraction in your case.  user_namespaces likely need Ingo's
version of seccomp as it will be possible to filter e.g. fs-specific
events, but even if it is implemented, it will take a looong time for
your needs IMHO.


Also, I'm afraid for _good_ user_namespace filtering the policy
definition will be too complicated (like SELinux policy definition for
non-trivial applications) if it is implemented in events filtering
terms.


> The way we're approaching it right now is that by default everything
> stays 'capable(X)', so that a non-init user namespace doesn't get the
> privileges.

Great.  I was not sure about it.


>  While some of my patchsets this summer didn't follow this,
> Eric reminded me that we should first clamp down on the user namespaces
> as much as possible, and relax permissions in child namespaces later.

I think it is the only sane way.


> So the small (1-2 patch sized) sets I've been sending the last few
> weeks are just trying to fix existing inadequate userid or capability
> checks.
> 
> -serge

Thanks,
Serge Hallyn Oct. 1, 2011, 5 p.m. UTC | #6
Quoting Vasiliy Kulikov (segoon@openwall.com):
> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
> > > First, the patches by design expose much kernel code to unprivileged
> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> > > etc. etc.).  By relaxing permission rules you greatly increase attack
> > > surface of the kernel from unprivileged users.  Are you (or somebody
> > > else) planning to audit this code?
> > 
> > I had wanted to (but didn't) propose a discussion at ksummit about how
> > best to approach the filesystem code.  That's not even just for user
> > namespaces - patches have been floated in the past to make mount an
> > unprivileged operation depending on the FS and the user's permission
> > over the device and target.
> 
> This is a dangerous operation by itself.

Of course it is :)  And it's been a while since it has been brought up,
but it *was* quite well thought through and throrougly discussed - see
i.e. https://lkml.org/lkml/2008/1/8/131

Oh, that's right.  In the end the reason it didn't go in had to do with
the ability for an unprivileged user to prevent a privileged user from
unmounting trees by leaving a busy mount in a hidden namespace.

Eric, in the past we didn't know what to do about that, but I wonder
if setns could be used in some clever way to solve it from userspace.

> AFAICS, this is the reason why
> e.g. FUSE doesn't pass user mount points to other users and even root.
> Beginning from violating some rules like existance of single "." and
> ".." in each directory and ending with filename charsets with /, \000
> and things like `, ", ', \ inside.
> 
> 
> >  So I don't know if a combination of auditing
> > and fuzzing is the way to go,
> 
> Maybe the combination of both.  There are no generic recommendations,
> it's always limited to the subsystem, checked property, and the
> auditor.

Ok, let me keep focusing on the tightening down right now, and then
before proceeding with relaxing, I'll start some analysis and discussion
of the code which is already under targeted (ns_capable) capability checks.

> > > Also, will it be possible to somehow restrict what specific kernel
> > > facilities are accessible from users (IOW, what root emulation
> > > limitations are in action)?  It is userful from both points of sysadmin,
> > > who might not want to allow users to do such things, and from the
> > > security POV in sense of attack surface reduction.
> > 
> > You're probably thinking along different lines, but this is why I've
> > been wanting seccomp2 to get pushed through.  So that we can deny a
> > container the syscalls we know it won't need, especially newer ones,
> > to reduce the attack surface available to it.
> 
> This dependency greatly complicates the things.

IMO this is not a dependency for user namespaces though - it's only a
dependency for unprivileged user namespaces.  And we haven't seriously
discussed doing that yet precisely because we're nowhere near ready
(and frankly I don't know that it'll ever be sane).

> First, there is a big misunderstanding between Will and Ingo in what
> needs seccompv2 should serve.  Will wants to reduce kernel attack

I know I know :)

> surface by limiting syscalls and syscall arguments available to a user
> (a single task, btw).  Ingo wants to see a full featured filtering
> engine, which needs code changes all over the kernel.  Given the needed
> changes amounts, it will unlikely reduce attack surface.
> 
> You probably don't want Will's version as syscalls filtering is a very

It seems to me per-syscall filtering is a great start.  I'm not looking
to seccomp2 as an assurance against formerly privileged (and now only
privileged per-namespace) code which may have had previously overlooked
bugs.  I'm looking to seccomp2 as an assurance against bugs in newly
written syscalls or the compatibility layer.

> bad abstraction in your case.  user_namespaces likely need Ingo's
> version of seccomp as it will be possible to filter e.g. fs-specific
> events, but even if it is implemented, it will take a looong time for
> your needs IMHO.

Yes, I think that would just lead to exploits through bad policy.

> Also, I'm afraid for _good_ user_namespace filtering the policy
> definition will be too complicated (like SELinux policy definition for
> non-trivial applications) if it is implemented in events filtering
> terms.
> 
> 
> > The way we're approaching it right now is that by default everything
> > stays 'capable(X)', so that a non-init user namespace doesn't get the
> > privileges.
> 
> Great.  I was not sure about it.
> 
> 
> >  While some of my patchsets this summer didn't follow this,
> > Eric reminded me that we should first clamp down on the user namespaces
> > as much as possible, and relax permissions in child namespaces later.
> 
> I think it is the only sane way.

Yup.  I trust you and Eric will keep me in check if I get over-zealous :)

-serge
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Oct. 3, 2011, 1:46 a.m. UTC | #7
"Serge E. Hallyn" <serge.hallyn@canonical.com> writes:

> Quoting Vasiliy Kulikov (segoon@openwall.com):
>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
>> > > First, the patches by design expose much kernel code to unprivileged
>> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
>> > > etc. etc.).  By relaxing permission rules you greatly increase attack
>> > > surface of the kernel from unprivileged users.  Are you (or somebody
>> > > else) planning to audit this code?

Well in theory this codes does expose this code to unprivileged user
space in a way that increases the attack surface.    However right now
there are a lot of cases where because the kernel lacks a sufficient
mechanism people are just given root provileges so that can get things
done.  Network manager controlling the network stack as an unprivileged
user.  Random filesystems on usb sticks being mounted and unmounted
automatically when the usb sticks are inserted and removed.

I completely agree that auditing and looking at the code is necessary I
think most of what will happen is that we will start directly supporting
how the kernel is actually used in the real world.  Which should
actually reduce our level of vulnerability, because we give up the
delusion that large classes of operations don't need careful
attention because only root can perform them.   Operations which the
user space authors turn around and write a suid binary for and
unprivileged user space performs those operations all day long.

>> > I had wanted to (but didn't) propose a discussion at ksummit about how
>> > best to approach the filesystem code.  That's not even just for user
>> > namespaces - patches have been floated in the past to make mount an
>> > unprivileged operation depending on the FS and the user's permission
>> > over the device and target.
>> 
>> This is a dangerous operation by itself.
>
> Of course it is :)  And it's been a while since it has been brought up,
> but it *was* quite well thought through and throrougly discussed - see
> i.e. https://lkml.org/lkml/2008/1/8/131
>
> Oh, that's right.  In the end the reason it didn't go in had to do with
> the ability for an unprivileged user to prevent a privileged user from
> unmounting trees by leaving a busy mount in a hidden namespace.
>
> Eric, in the past we didn't know what to do about that, but I wonder
> if setns could be used in some clever way to solve it from userspace.

Oh.  That is a good objection.  I had not realized that unprivileged
mounts had that problem.

Still the solution is straight forward.  If the concern is that an
unprivileged user can prevent a privileged user from unmounting trees,
we need to require that a forced unmount of the filesystem triggers a
revoke on all open files.  sysfs and proc already support revoke at the
per file level so we can safely remove modules, we just need to extend
that support to the forced unmount case.

This is problem that actually needs to be solved for ordinary file
systems as well because of hot pluggable usb drives.  For filesystems
like ext4 it is more difficult because we need a solution that does
not sacrafice performance in the common case.  I was talking to 
Ted Tso a bit about this at plumbers conf.  It happens that hot unplug
of usb devices with mount filesystems are currently a non-ending source
of subtle bugs in the extN code.

The one implementation detail that sounds a bit trick is what to do
about mount structures in mount namespaces when we forcibly unmount
a filesystem.  That could get a bit complicated but if that is the only
hang up I'm certain we can figure something out.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Oct. 3, 2011, 7:53 p.m. UTC | #8
ebiederm@xmission.com (Eric W. Biederman) writes:

> "Serge E. Hallyn" <serge.hallyn@canonical.com> writes:
>
>> Quoting Vasiliy Kulikov (segoon@openwall.com):
>>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
>>> > > First, the patches by design expose much kernel code to unprivileged
>>> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
>>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
>>> > > etc. etc.).  By relaxing permission rules you greatly increase attack
>>> > > surface of the kernel from unprivileged users.  Are you (or somebody
>>> > > else) planning to audit this code?
>
> Well in theory this codes does expose this code to unprivileged user
> space in a way that increases the attack surface.    However right now
> there are a lot of cases where because the kernel lacks a sufficient
> mechanism people are just given root provileges so that can get things
> done.  Network manager controlling the network stack as an unprivileged
> user.  Random filesystems on usb sticks being mounted and unmounted
> automatically when the usb sticks are inserted and removed.
>
> I completely agree that auditing and looking at the code is necessary I
> think most of what will happen is that we will start directly supporting
> how the kernel is actually used in the real world.  Which should
> actually reduce our level of vulnerability, because we give up the
> delusion that large classes of operations don't need careful
> attention because only root can perform them.   Operations which the
> user space authors turn around and write a suid binary for and
> unprivileged user space performs those operations all day long.
>
>>> > I had wanted to (but didn't) propose a discussion at ksummit about how
>>> > best to approach the filesystem code.  That's not even just for user
>>> > namespaces - patches have been floated in the past to make mount an
>>> > unprivileged operation depending on the FS and the user's permission
>>> > over the device and target.
>>> 
>>> This is a dangerous operation by itself.
>>
>> Of course it is :)  And it's been a while since it has been brought up,
>> but it *was* quite well thought through and throrougly discussed - see
>> i.e. https://lkml.org/lkml/2008/1/8/131
>>
>> Oh, that's right.  In the end the reason it didn't go in had to do with
>> the ability for an unprivileged user to prevent a privileged user from
>> unmounting trees by leaving a busy mount in a hidden namespace.
>>
>> Eric, in the past we didn't know what to do about that, but I wonder
>> if setns could be used in some clever way to solve it from userspace.
>
> Oh.  That is a good objection.  I had not realized that unprivileged
> mounts had that problem.

I just re-read the discussion you are referring to and that wasn't
it.  Fuse already has something like a revoke in it's umount -f
implementation.

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Serge Hallyn Oct. 3, 2011, 8:04 p.m. UTC | #9
Quoting Eric W. Biederman (ebiederm@xmission.com):
> ebiederm@xmission.com (Eric W. Biederman) writes:
> 
> > "Serge E. Hallyn" <serge.hallyn@canonical.com> writes:
> >
> >> Quoting Vasiliy Kulikov (segoon@openwall.com):
> >>> On Tue, Sep 27, 2011 at 08:21 -0500, Serge E. Hallyn wrote:
> >>> > > First, the patches by design expose much kernel code to unprivileged
> >>> > > userspace processes.  This code doesn't expect malformed data (e.g. VFS,
> >>> > > specific filesystems, block layer, char drivers, sysadmin part of LSMs,
> >>> > > etc. etc.).  By relaxing permission rules you greatly increase attack
> >>> > > surface of the kernel from unprivileged users.  Are you (or somebody
> >>> > > else) planning to audit this code?
> >
> > Well in theory this codes does expose this code to unprivileged user
> > space in a way that increases the attack surface.    However right now
> > there are a lot of cases where because the kernel lacks a sufficient
> > mechanism people are just given root provileges so that can get things
> > done.  Network manager controlling the network stack as an unprivileged
> > user.  Random filesystems on usb sticks being mounted and unmounted
> > automatically when the usb sticks are inserted and removed.
> >
> > I completely agree that auditing and looking at the code is necessary I
> > think most of what will happen is that we will start directly supporting
> > how the kernel is actually used in the real world.  Which should
> > actually reduce our level of vulnerability, because we give up the
> > delusion that large classes of operations don't need careful
> > attention because only root can perform them.   Operations which the
> > user space authors turn around and write a suid binary for and
> > unprivileged user space performs those operations all day long.
> >
> >>> > I had wanted to (but didn't) propose a discussion at ksummit about how
> >>> > best to approach the filesystem code.  That's not even just for user
> >>> > namespaces - patches have been floated in the past to make mount an
> >>> > unprivileged operation depending on the FS and the user's permission
> >>> > over the device and target.
> >>> 
> >>> This is a dangerous operation by itself.
> >>
> >> Of course it is :)  And it's been a while since it has been brought up,
> >> but it *was* quite well thought through and throrougly discussed - see
> >> i.e. https://lkml.org/lkml/2008/1/8/131
> >>
> >> Oh, that's right.  In the end the reason it didn't go in had to do with
> >> the ability for an unprivileged user to prevent a privileged user from
> >> unmounting trees by leaving a busy mount in a hidden namespace.
> >>
> >> Eric, in the past we didn't know what to do about that, but I wonder
> >> if setns could be used in some clever way to solve it from userspace.
> >
> > Oh.  That is a good objection.  I had not realized that unprivileged
> > mounts had that problem.
> 
> I just re-read the discussion you are referring to and that wasn't

The one I linked was one discussion, but not the final one.

https://lkml.org/lkml/2008/10/6/72 is the one where the need for
revoke is brought up.

> it.  Fuse already has something like a revoke in it's umount -f
> implementation.

I'll have to (haven't yet) take a look at it.

-serge
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/namespaces/user_namespace.txt b/Documentation/namespaces/user_namespace.txt
new file mode 100644
index 0000000..b0bc480
--- /dev/null
+++ b/Documentation/namespaces/user_namespace.txt
@@ -0,0 +1,107 @@ 
+Description
+===========
+
+Traditionally, each task is owned by a user ID (UID) and belongs to one or more
+groups (GID).  Both are simple numeric IDs, though userspace usually translates
+them to names.  The user namespace allows tasks to have different views of the
+UIDs and GIDs associated with tasks and other resources.  (See 'UID mapping'
+below for more.)
+
+The user namespace is a simple hierarchical one.  The system starts with all
+tasks belonging to the initial user namespace.  A task creates a new user
+namespace by passing the CLONE_NEWUSER flag to clone(2).  This requires the
+creating task to have the CAP_SETUID, CAP_SETGID, and CAP_CHOWN capabilities,
+but it does not need to be running as root.  The clone(2) call will result in a
+new task which to itself appears to be running as UID and GID 0, but to its
+creator seems to have the creator's credentials.
+
+To this new task, any resource belonging to the initial user namespace will
+appear to belong to user and group 'nobody', which are UID and GID -1.
+Permission to open such files will be granted according to world access
+permissions.  UID comparisons and group membership checks will return false,
+and privilege will be denied.
+
+When a task belonging to (for example) userid 500 in the initial user namespace
+creates a new user namespace, even though the new task will see itself as
+belonging to UID 0, any task in the initial user namespace will see it as
+belonging to UID 500.  Therefore, UID 500 in the initial user namespace will be
+able to kill the new task.  Files created by the new user will (eventually) be
+seen by tasks in its own user namespace as belonging to UID 0, but to tasks in
+the initial user namespace as belonging to UID 500.
+
+Note that this userid mapping for the VFS is not yet implemented, though the
+lkml and containers mailing list archives will show several previous
+prototypes.  In the end, those got hung up waiting on the concept of targeted
+capabilities to be developed, which, thanks to the insight of Eric Biederman,
+they finally did.
+
+Relationship between the User namespace and other namespaces
+============================================================
+
+Other namespaces, such as UTS and network, are owned by a user namespace.  When
+such a namespace is created, it is assigned to the user namespace of the task
+by which it was created.  Therefore, attempts to exercise privilege to
+resources in, for instance, a particular network namespace, can be properly
+validated by checking whether the caller has the needed privilege (i.e.
+CAP_NET_ADMIN) targeted to the user namespace which owns the network namespace.
+This is done using the ns_capable() function.
+
+As an example, if a new task is cloned with a private user namespace but
+no private network namespace, then the task's network namespace is owned
+by the parent user namespace.  The new task has no privilege to the
+parent user namespace, so it will not be able to create or configure
+network devices.  If, instead, the task were cloned with both private
+user and network namespaces, then the private network namespace is owned
+by the private user namespace, and so root in the new user namespace
+will have privilege targeted to the network namespace.  It will be able
+to create and configure network devices.
+
+UID Mapping
+===========
+The current plan (see 'flexible UID mapping' at
+https://wiki.ubuntu.com/UserNamespace) is:
+
+The UID/GID stored on disk will be that in the init_user_ns.  Most likely
+UID/GID in other namespaces will be stored in xattrs.  But Eric was advocating
+(a few years ago) leaving the details up to filesystems while providing a lib/
+stock implementation.  See the thread around here:
+http://www.mail-archive.com/devel@openvz.org/msg09331.html
+
+
+Working notes
+=============
+Capability checks for actions related to syslog must be against the
+init_user_ns until syslog is containerized.
+
+Same is true for reboot and power, control groups, devices, and time.
+
+Perf actions (kernel/event/core.c for instance) will always be constrained to
+init_user_ns.
+
+Q:
+Is accounting considered properly containerized with respect to pidns?  (it
+appears to be).  If so, then we can change the capable() check in
+kernel/acct.c to 'ns_capable(current_pid_ns()->user_ns, CAP_PACCT)'
+
+Q:
+For things like nice and schedaffinity, we could allow root in a container to
+control those, and leave only cgroups to constrain the container.  I'm not sure
+whether that is right, or whether it violates admin expectations.
+
+I deferred some of commoncap.c.  I'm punting on xattr stuff as they take
+dentries, not inodes.
+
+For drivers/tty/tty_io.c and drivers/tty/vt/vt.c, we'll want to (for some of
+them) target the capability checks at the user_ns owning the tty.  That will
+have to wait until we get userns owning files straightened out.
+
+We need to figure out how to label devices.  Should we just toss a user_ns
+right into struct device?
+
+capable(CAP_MAC_ADMIN) checks are always to be against init_user_ns, unless
+some day LSMs were to be containerized, near zero chance.
+
+inode_owner_or_capable() should probably take an optional ns and cap parameter.
+If cap is 0, then CAP_FOWNER is checked.  If ns is NULL, we derive the ns from
+inode.  But if ns is provided, then callers who need to derive
+inode_userns(inode) anyway can save a few cycles.