Message ID | 20190906150952.23066-3-cneirabustos@gmail.com |
---|---|
State | Changes Requested |
Delegated to: | BPF Maintainers |
Headers | show |
Series | [bpf-next,v10,1/4] fs/namei.c: make available filename_lookup() for bpf helpers. | expand |
On Fri, Sep 06, 2019 at 11:09:50AM -0400, Carlos Neira wrote: > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32, > + size) > +{ > + const char *pidns_path = "/proc/self/ns/pid"; > + fname = kmem_cache_alloc(names_cachep, GFP_ATOMIC); > + if (unlikely(!fname)) { > + ret = -ENOMEM; > + goto clear; > + } > + const size_t fnamesize = offsetof(struct filename, iname[1]); > + struct filename *tmp; > + > + tmp = kmalloc(fnamesize, GFP_ATOMIC); > + if (unlikely(!tmp)) { > + __putname(fname); > + ret = -ENOMEM; > + goto clear; > + } > + > + tmp->name = (char *)fname; > + fname = tmp; > + len = strlen(pidns_path) + 1; > + memcpy((char *)fname->name, pidns_path, len); > + fname->uptr = NULL; > + fname->aname = NULL; > + fname->refcnt = 1; > + > + ret = filename_lookup(AT_FDCWD, fname, 0, &kp, NULL); > + if (ret) > + goto clear; Where do I begin? * getname_kernel() is there for purpose * so's kern_path(), damnit > + > + inode = d_backing_inode(kp.dentry); > + pidns_info->dev = (u32)inode->i_rdev; * ... and this is utter bollocks - userland doesn't have to have procfs mounted anywhere; it doesn't have to have it mounted on /proc; it can bloody well bind a symlink to anywhere and anythin on top of /proc/self even if its has procfs mounted there. This is fundamentally wrong; nothing in the kernel (bpf very much included) has any business assuming anything about what's mounted where. And while we are at it, how deep on kernel stack can that thing be called? Because pathname resolution can bring all kinds of interesting crap into the game - consider e.g. NFS4 referral traversal. And it can occur - see above about the lack of warranties that your pathwalk will go to procfs and will remain there. NAKed-by: Al Viro <viro@zeniv.linux.org.uk>
On Fri, Sep 06, 2019 at 04:24:35PM +0100, Al Viro wrote: > > + tmp = kmalloc(fnamesize, GFP_ATOMIC); > > + if (unlikely(!tmp)) { > > + __putname(fname); > > + ret = -ENOMEM; > > + goto clear; > > + } > > + > > + tmp->name = (char *)fname; > > + fname = tmp; > > + len = strlen(pidns_path) + 1; > > + memcpy((char *)fname->name, pidns_path, len); > > + fname->uptr = NULL; > > + fname->aname = NULL; > > + fname->refcnt = 1; > > + > > + ret = filename_lookup(AT_FDCWD, fname, 0, &kp, NULL); > > + if (ret) > > + goto clear; > > Where do I begin? > * getname_kernel() is there for purpose > * so's kern_path(), damnit Oh, and filename_lookup() *CAN* sleep, obviously. So that GFP_ATOMIC above is completely pointless. > > + > > + inode = d_backing_inode(kp.dentry); > > + pidns_info->dev = (u32)inode->i_rdev; Why are plaing with device number, anyway? And why would it be anything other than 0?
On Fri, Sep 06, 2019 at 04:46:47PM +0100, Al Viro wrote: > > Where do I begin? > > * getname_kernel() is there for purpose > > * so's kern_path(), damnit > > Oh, and filename_lookup() *CAN* sleep, obviously. So that > GFP_ATOMIC above is completely pointless. > > > > + > > > + inode = d_backing_inode(kp.dentry); > > > + pidns_info->dev = (u32)inode->i_rdev; In the original variant of patchset it used to be ->i_sb->s_dev, which is also bloody strange - you are not asking filename_lookup() to follow symlinks, so you'd get that of whatever filesystem /proc/self/ns resides on. ->i_rdev use makes no sense whatsoever - it's a symlink and neither it nor its target are device nodes; ->i_rdev will be left zero for both. What data are you really trying to get there?
On 9/6/19 9:00 AM, Al Viro wrote: > On Fri, Sep 06, 2019 at 04:46:47PM +0100, Al Viro wrote: > >>> Where do I begin? >>> * getname_kernel() is there for purpose >>> * so's kern_path(), damnit >> >> Oh, and filename_lookup() *CAN* sleep, obviously. So that >> GFP_ATOMIC above is completely pointless. >> >>>> + >>>> + inode = d_backing_inode(kp.dentry); >>>> + pidns_info->dev = (u32)inode->i_rdev; > > In the original variant of patchset it used to be ->i_sb->s_dev, > which is also bloody strange - you are not asking filename_lookup() > to follow symlinks, so you'd get that of whatever filesystem > /proc/self/ns resides on. > > ->i_rdev use makes no sense whatsoever - it's a symlink and > neither it nor its target are device nodes; ->i_rdev will be > left zero for both. > > What data are you really trying to get there? Let me explain a little bit background here. The ultimate goal is for bpf program to filter over (pid_namespace, tgid/pid inside pid_namespace) so bpf based tools can run inside the container. Typically, pid namespace is achieved by looking at /proc/self/ns/pid: -bash-4.4$ lsns NS TYPE NPROCS PID USER COMMAND 4026531835 cgroup 44 8261 yhs /usr/lib/systemd/systemd --user 4026531836 pid 44 8261 yhs /usr/lib/systemd/systemd --user 4026531837 user 44 8261 yhs /usr/lib/systemd/systemd --user 4026531838 uts 44 8261 yhs /usr/lib/systemd/systemd --user 4026531839 ipc 44 8261 yhs /usr/lib/systemd/systemd --user 4026531840 mnt 44 8261 yhs /usr/lib/systemd/systemd --user 4026532008 net 44 8261 yhs /usr/lib/systemd/systemd --user -bash-4.4$ readlink /proc/self/ns/pid pid:[4026531836] -bash-4.4$ stat /proc/self/ns/pid File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’ Size: 0 Blocks: 0 IO Block: 1024 symbolic link Device: 4h/4d Inode: 344795989 Links: 1 Access: (0777/lrwxrwxrwx) Uid: (128203/ yhs) Gid: ( 100/ users) Context: user_u:base_r:base_t Access: 2019-09-06 16:06:09.431616380 -0700 Modify: 2019-09-06 16:06:09.431616380 -0700 Change: 2019-09-06 16:06:09.431616380 -0700 Birth: - -bash-4.4$ Based on a discussion with Eric Biederman back in 2019 Linux Plumbers, Eric suggested that to uniquely identify a namespace, device id (major/minor) number should also be included. Although today's kernel implementation has the same device for all namespace pseudo files, but from uapi perspective, device id should be included. That is the reason why we try to get device id which holds pid namespace pseudo file. Do you have a better suggestion on how to get the device id for 'current' pid namespace? Or from design, we really should not care about device id at all?
On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote: > -bash-4.4$ readlink /proc/self/ns/pid > pid:[4026531836] > -bash-4.4$ stat /proc/self/ns/pid > File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’ > Size: 0 Blocks: 0 IO Block: 1024 symbolic link > Device: 4h/4d Inode: 344795989 Links: 1 > Access: (0777/lrwxrwxrwx) Uid: (128203/ yhs) Gid: ( 100/ users) > Context: user_u:base_r:base_t > Access: 2019-09-06 16:06:09.431616380 -0700 > Modify: 2019-09-06 16:06:09.431616380 -0700 > Change: 2019-09-06 16:06:09.431616380 -0700 > Birth: - > -bash-4.4$ > > Based on a discussion with Eric Biederman back in 2019 Linux > Plumbers, Eric suggested that to uniquely identify a > namespace, device id (major/minor) number should also > be included. Although today's kernel implementation > has the same device for all namespace pseudo files, > but from uapi perspective, device id should be included. > > That is the reason why we try to get device id which holds > pid namespace pseudo file. > > Do you have a better suggestion on how to get > the device id for 'current' pid namespace? Or from design, we > really should not care about device id at all? What the hell is "device id for pid namespace"? This is the first time I've heard about that mystery object, so it's hard to tell where it could be found. I can tell you what device numbers are involved in the areas you seem to be looking in. 1) there's whatever device number that gets assigned to (this) procfs instance. That, ironically, _is_ per-pidns, but that of the procfs instance, not that of your process (and those can be different). That's what you get in ->st_dev when doing lstat() of anything in /proc (assuming that procfs is mounted there, in the first place). NOTE: that's lstat(2), not stat(2). stat(1) uses lstat(2), unless given -L (in which case it's stat(2) time). The difference: root@kvm1:~# stat /proc/self/ns/pid File: /proc/self/ns/pid -> pid:[4026531836] Size: 0 Blocks: 0 IO Block: 1024 symbolic link Device: 4h/4d Inode: 17396 Links: 1 Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2019-09-06 19:43:11.871312319 -0400 Modify: 2019-09-06 19:43:11.871312319 -0400 Change: 2019-09-06 19:43:11.871312319 -0400 Birth: - root@kvm1:~# stat -L /proc/self/ns/pid File: /proc/self/ns/pid Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: 3h/3d Inode: 4026531836 Links: 1 Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2019-09-06 19:43:15.955313293 -0400 Modify: 2019-09-06 19:43:15.955313293 -0400 Change: 2019-09-06 19:43:15.955313293 -0400 Birth: - The former is lstat, the latter - stat. 2) device number of the filesystem where the symlink target lives. In this case, it's nsfs and there's only one instance on the entire system. _That_ would be obtained by looking at st_dev in stat(2) on /proc/self/ns/pid (0:3 above). 3) device number *OF* the symlink. That would be st_rdev in lstat(2). There's none - it's a symlink, not a character or block device. It's always zero and always will be zero. 4) the same for the target; st_rdev in stat(2) results and again, there's no such beast - it's neither character nor block device. Your code is looking at (3). Please, reread any textbook on Unix in the section that would cover stat(2) and discussion of the difference between st_dev and st_rdev. I have no idea what Eric had been talking about - it's hard to reconstruct by what you said so far. Making nsfs per-userns, perhaps? But that makes no sense whatsoever, not that userns ever had... Cheap shots aside, I really can't guess what that's about. Sorry. In any case, pathname resolution is *NOT* for the situations where you can't block. Even if it's procfs (and from the same pidns as the process) mounted there, there is no promise that the target of /proc/self has already been looked up and not evicted from memory since then. And in case of cache miss pathwalk will have to call ->lookup(), which requires locking the directory (rw_sem, shared). You can't do that in such context. And that doesn't even go into the possibility that process has something very different mounted on /proc. Again, I don't know what it is that you want to get to, but I would strongly recommend finding a way to get to that data that would not involve going anywhere near pathname resolution. How would you expect the userland to work with that value, whatever it might be? If it's just a 32bit field that will never be read, you might as well store there the same value you store now (0, that is) in much cheaper and safer way ;-)
On 9/6/19 5:10 PM, Al Viro wrote: > On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote: > >> -bash-4.4$ readlink /proc/self/ns/pid >> pid:[4026531836] >> -bash-4.4$ stat /proc/self/ns/pid >> File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’ >> Size: 0 Blocks: 0 IO Block: 1024 symbolic link >> Device: 4h/4d Inode: 344795989 Links: 1 >> Access: (0777/lrwxrwxrwx) Uid: (128203/ yhs) Gid: ( 100/ users) >> Context: user_u:base_r:base_t >> Access: 2019-09-06 16:06:09.431616380 -0700 >> Modify: 2019-09-06 16:06:09.431616380 -0700 >> Change: 2019-09-06 16:06:09.431616380 -0700 >> Birth: - >> -bash-4.4$ >> >> Based on a discussion with Eric Biederman back in 2019 Linux >> Plumbers, Eric suggested that to uniquely identify a >> namespace, device id (major/minor) number should also >> be included. Although today's kernel implementation >> has the same device for all namespace pseudo files, >> but from uapi perspective, device id should be included. >> >> That is the reason why we try to get device id which holds >> pid namespace pseudo file. >> >> Do you have a better suggestion on how to get >> the device id for 'current' pid namespace? Or from design, we >> really should not care about device id at all? > > What the hell is "device id for pid namespace"? This is the > first time I've heard about that mystery object, so it's > hard to tell where it could be found. > > I can tell you what device numbers are involved in the areas > you seem to be looking in. > > 1) there's whatever device number that gets assigned to > (this) procfs instance. That, ironically, _is_ per-pidns, but > that of the procfs instance, not that of your process (and > those can be different). That's what you get in ->st_dev > when doing lstat() of anything in /proc (assuming that > procfs is mounted there, in the first place). NOTE: > that's lstat(2), not stat(2). stat(1) uses lstat(2), > unless given -L (in which case it's stat(2) time). The > difference: > > root@kvm1:~# stat /proc/self/ns/pid > File: /proc/self/ns/pid -> pid:[4026531836] > Size: 0 Blocks: 0 IO Block: 1024 symbolic link > Device: 4h/4d Inode: 17396 Links: 1 > Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2019-09-06 19:43:11.871312319 -0400 > Modify: 2019-09-06 19:43:11.871312319 -0400 > Change: 2019-09-06 19:43:11.871312319 -0400 > Birth: - > root@kvm1:~# stat -L /proc/self/ns/pid > File: /proc/self/ns/pid > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: 3h/3d Inode: 4026531836 Links: 1 > Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2019-09-06 19:43:15.955313293 -0400 > Modify: 2019-09-06 19:43:15.955313293 -0400 > Change: 2019-09-06 19:43:15.955313293 -0400 > Birth: - > > The former is lstat, the latter - stat. > > 2) device number of the filesystem where the symlink target lives. > In this case, it's nsfs and there's only one instance on the entire > system. _That_ would be obtained by looking at st_dev in stat(2) on > /proc/self/ns/pid (0:3 above). > > 3) device number *OF* the symlink. That would be st_rdev in lstat(2). > There's none - it's a symlink, not a character or block device. It's > always zero and always will be zero. > > 4) the same for the target; st_rdev in stat(2) results and again, > there's no such beast - it's neither character nor block device. > > Your code is looking at (3). Please, reread any textbook on Unix > in the section that would cover stat(2) and discussion of the > difference between st_dev and st_rdev. > > I have no idea what Eric had been talking about - it's hard to > reconstruct by what you said so far. Making nsfs per-userns, > perhaps? But that makes no sense whatsoever, not that userns > ever had... Cheap shots aside, I really can't guess what that's > about. Sorry. Thanks for the detailed information. The device number we want is nsfs. Indeed, currently, there is only one instance on the entire system. But not exactly sure what is the possibility to have more than one nsfs device in the future. Maybe per-userns or any other criteria? > > In any case, pathname resolution is *NOT* for the situations where > you can't block. Even if it's procfs (and from the same pidns as > the process) mounted there, there is no promise that the target > of /proc/self has already been looked up and not evicted from > memory since then. And in case of cache miss pathwalk will > have to call ->lookup(), which requires locking the directory > (rw_sem, shared). You can't do that in such context. > > And that doesn't even go into the possibility that process has > something very different mounted on /proc. > > Again, I don't know what it is that you want to get to, but > I would strongly recommend finding a way to get to that data > that would not involve going anywhere near pathname resolution. > > How would you expect the userland to work with that value, > whatever it might be? If it's just a 32bit field that will > never be read, you might as well store there the same value > you store now (0, that is) in much cheaper and safer way ;-) Suppose inside pid namespace, user can pass the device number, say n1, (`stat -L /proc/self/ns/pid`) to bpf program (through map or JIT). At runtime, bpf program will try to get device number, say n2, for the 'current' process. If n1 is not the same as n2, that means they are not in the same namespace. 'current' is in the same pid namespace as the user iff n1 == n2 and also pidns id is the same for 'current' and the one with `lsns -t pid`. Are you aware of any way to get the pidns device number for 'current' without going through the pathname lookup?
Thanks a lot, Al Viro and Yonghong for taking the time to review this patch and provide technical insights needed on this one. But how do we move this forward? Al Viro's review is clear that this will not work and we should strip the name resolution code (thanks for your detailed analysis). As there is currently only one instance of the nsfs device on the system, I think we could leave out the retrieval of the pidns device number and address it when the situation changes. What do you think? On Sat, Sep 07, 2019 at 06:34:39AM +0000, Yonghong Song wrote: > > > On 9/6/19 5:10 PM, Al Viro wrote: > > On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote: > > > >> -bash-4.4$ readlink /proc/self/ns/pid > >> pid:[4026531836] > >> -bash-4.4$ stat /proc/self/ns/pid > >> File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’ > >> Size: 0 Blocks: 0 IO Block: 1024 symbolic link > >> Device: 4h/4d Inode: 344795989 Links: 1 > >> Access: (0777/lrwxrwxrwx) Uid: (128203/ yhs) Gid: ( 100/ users) > >> Context: user_u:base_r:base_t > >> Access: 2019-09-06 16:06:09.431616380 -0700 > >> Modify: 2019-09-06 16:06:09.431616380 -0700 > >> Change: 2019-09-06 16:06:09.431616380 -0700 > >> Birth: - > >> -bash-4.4$ > >> > >> Based on a discussion with Eric Biederman back in 2019 Linux > >> Plumbers, Eric suggested that to uniquely identify a > >> namespace, device id (major/minor) number should also > >> be included. Although today's kernel implementation > >> has the same device for all namespace pseudo files, > >> but from uapi perspective, device id should be included. > >> > >> That is the reason why we try to get device id which holds > >> pid namespace pseudo file. > >> > >> Do you have a better suggestion on how to get > >> the device id for 'current' pid namespace? Or from design, we > >> really should not care about device id at all? > > > > What the hell is "device id for pid namespace"? This is the > > first time I've heard about that mystery object, so it's > > hard to tell where it could be found. > > > > I can tell you what device numbers are involved in the areas > > you seem to be looking in. > > > > 1) there's whatever device number that gets assigned to > > (this) procfs instance. That, ironically, _is_ per-pidns, but > > that of the procfs instance, not that of your process (and > > those can be different). That's what you get in ->st_dev > > when doing lstat() of anything in /proc (assuming that > > procfs is mounted there, in the first place). NOTE: > > that's lstat(2), not stat(2). stat(1) uses lstat(2), > > unless given -L (in which case it's stat(2) time). The > > difference: > > > > root@kvm1:~# stat /proc/self/ns/pid > > File: /proc/self/ns/pid -> pid:[4026531836] > > Size: 0 Blocks: 0 IO Block: 1024 symbolic link > > Device: 4h/4d Inode: 17396 Links: 1 > > Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root) > > Access: 2019-09-06 19:43:11.871312319 -0400 > > Modify: 2019-09-06 19:43:11.871312319 -0400 > > Change: 2019-09-06 19:43:11.871312319 -0400 > > Birth: - > > root@kvm1:~# stat -L /proc/self/ns/pid > > File: /proc/self/ns/pid > > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > > Device: 3h/3d Inode: 4026531836 Links: 1 > > Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > > Access: 2019-09-06 19:43:15.955313293 -0400 > > Modify: 2019-09-06 19:43:15.955313293 -0400 > > Change: 2019-09-06 19:43:15.955313293 -0400 > > Birth: - > > > > The former is lstat, the latter - stat. > > > > 2) device number of the filesystem where the symlink target lives. > > In this case, it's nsfs and there's only one instance on the entire > > system. _That_ would be obtained by looking at st_dev in stat(2) on > > /proc/self/ns/pid (0:3 above). > > > > 3) device number *OF* the symlink. That would be st_rdev in lstat(2). > > There's none - it's a symlink, not a character or block device. It's > > always zero and always will be zero. > > > > 4) the same for the target; st_rdev in stat(2) results and again, > > there's no such beast - it's neither character nor block device. > > > > Your code is looking at (3). Please, reread any textbook on Unix > > in the section that would cover stat(2) and discussion of the > > difference between st_dev and st_rdev. > > > > I have no idea what Eric had been talking about - it's hard to > > reconstruct by what you said so far. Making nsfs per-userns, > > perhaps? But that makes no sense whatsoever, not that userns > > ever had... Cheap shots aside, I really can't guess what that's > > about. Sorry. > > Thanks for the detailed information. The device number we want > is nsfs. Indeed, currently, there is only one instance > on the entire system. But not exactly sure what is the possibility > to have more than one nsfs device in the future. Maybe per-userns > or any other criteria? > > > > > In any case, pathname resolution is *NOT* for the situations where > > you can't block. Even if it's procfs (and from the same pidns as > > the process) mounted there, there is no promise that the target > > of /proc/self has already been looked up and not evicted from > > memory since then. And in case of cache miss pathwalk will > > have to call ->lookup(), which requires locking the directory > > (rw_sem, shared). You can't do that in such context. > > > > And that doesn't even go into the possibility that process has > > something very different mounted on /proc. > > > > Again, I don't know what it is that you want to get to, but > > I would strongly recommend finding a way to get to that data > > that would not involve going anywhere near pathname resolution. > > > > How would you expect the userland to work with that value, > > whatever it might be? If it's just a 32bit field that will > > never be read, you might as well store there the same value > > you store now (0, that is) in much cheaper and safer way ;-) > > Suppose inside pid namespace, user can pass the device number, > say n1, (`stat -L /proc/self/ns/pid`) to bpf program (through map > or JIT). At runtime, bpf program will try to get device number, > say n2, for the 'current' process. If n1 is not the same as > n2, that means they are not in the same namespace. 'current' > is in the same pid namespace as the user iff > n1 == n2 and also pidns id is the same for 'current' and > the one with `lsns -t pid`. > > Are you aware of any way to get the pidns device number > for 'current' without going through the pathname > lookup? >
Carlos, Discussed with Eric today for what is the best way to get the device number for a namespace. The following patch seems a reasonable start although Eric would like to see how the helper is used in order to decide whether the interface looks right. commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2) Author: Yonghong Song <yhs@fb.com> Date: Mon Sep 9 21:50:51 2019 -0700 nsfs: add an interface function ns_get_inum_dev() This patch added an interface function ns_get_inum_dev(). Given a ns_common structure, the function returns the inode and device numbers. The function will be used later by a newly added bpf helper. Signed-off-by: Yonghong Song <yhs@fb.com> diff --git a/fs/nsfs.c b/fs/nsfs.c index a0431642c6b5..a603c6fc3f54 100644 --- a/fs/nsfs.c +++ b/fs/nsfs.c @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd) return ERR_PTR(-EINVAL); } +/* Get the device number for the current task pidns. + */ +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev) +{ + *inum = ns->inum; + *dev = nsfs_mnt->mnt_sb->s_dev; +} + static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry) { struct inode *inode = d_inode(dentry); diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h index d31cb6215905..b8fc680cdf1a 100644 --- a/include/linux/proc_ns.h +++ b/include/linux/proc_ns.h @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct task_struct *task, typedef struct ns_common *ns_get_path_helper_t(void *); extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t ns_get_cb, void *private_data); +extern void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev); extern int ns_get_name(char *buf, size_t size, struct task_struct *task, const struct proc_ns_operations *ns_ops); Could you put the above change and patch #1 and then have all your other patches? In your kernel change, please use interface function ns_get_inum_dev() to get pidns inode number and dev number. On 9/9/19 6:45 PM, Carlos Antonio Neira Bustos wrote: > Thanks a lot, Al Viro and Yonghong for taking the time to review this patch and > provide technical insights needed on this one. > But how do we move this forward? > Al Viro's review is clear that this will not work and we should strip the name > resolution code (thanks for your detailed analysis). > As there is currently only one instance of the nsfs device on the system, > I think we could leave out the retrieval of the pidns device number and address it > when the situation changes. > What do you think? > > > On Sat, Sep 07, 2019 at 06:34:39AM +0000, Yonghong Song wrote: >> >> >> On 9/6/19 5:10 PM, Al Viro wrote: >>> On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote: >>> >>>> -bash-4.4$ readlink /proc/self/ns/pid >>>> pid:[4026531836] >>>> -bash-4.4$ stat /proc/self/ns/pid >>>> File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’ >>>> Size: 0 Blocks: 0 IO Block: 1024 symbolic link >>>> Device: 4h/4d Inode: 344795989 Links: 1 >>>> Access: (0777/lrwxrwxrwx) Uid: (128203/ yhs) Gid: ( 100/ users) >>>> Context: user_u:base_r:base_t >>>> Access: 2019-09-06 16:06:09.431616380 -0700 >>>> Modify: 2019-09-06 16:06:09.431616380 -0700 >>>> Change: 2019-09-06 16:06:09.431616380 -0700 >>>> Birth: - >>>> -bash-4.4$ >>>> >>>> Based on a discussion with Eric Biederman back in 2019 Linux >>>> Plumbers, Eric suggested that to uniquely identify a >>>> namespace, device id (major/minor) number should also >>>> be included. Although today's kernel implementation >>>> has the same device for all namespace pseudo files, >>>> but from uapi perspective, device id should be included. >>>> >>>> That is the reason why we try to get device id which holds >>>> pid namespace pseudo file. >>>> >>>> Do you have a better suggestion on how to get >>>> the device id for 'current' pid namespace? Or from design, we >>>> really should not care about device id at all? >>> >>> What the hell is "device id for pid namespace"? This is the >>> first time I've heard about that mystery object, so it's >>> hard to tell where it could be found. >>> >>> I can tell you what device numbers are involved in the areas >>> you seem to be looking in. >>> >>> 1) there's whatever device number that gets assigned to >>> (this) procfs instance. That, ironically, _is_ per-pidns, but >>> that of the procfs instance, not that of your process (and >>> those can be different). That's what you get in ->st_dev >>> when doing lstat() of anything in /proc (assuming that >>> procfs is mounted there, in the first place). NOTE: >>> that's lstat(2), not stat(2). stat(1) uses lstat(2), >>> unless given -L (in which case it's stat(2) time). The >>> difference: >>> >>> root@kvm1:~# stat /proc/self/ns/pid >>> File: /proc/self/ns/pid -> pid:[4026531836] >>> Size: 0 Blocks: 0 IO Block: 1024 symbolic link >>> Device: 4h/4d Inode: 17396 Links: 1 >>> Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root) >>> Access: 2019-09-06 19:43:11.871312319 -0400 >>> Modify: 2019-09-06 19:43:11.871312319 -0400 >>> Change: 2019-09-06 19:43:11.871312319 -0400 >>> Birth: - >>> root@kvm1:~# stat -L /proc/self/ns/pid >>> File: /proc/self/ns/pid >>> Size: 0 Blocks: 0 IO Block: 4096 regular empty file >>> Device: 3h/3d Inode: 4026531836 Links: 1 >>> Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) >>> Access: 2019-09-06 19:43:15.955313293 -0400 >>> Modify: 2019-09-06 19:43:15.955313293 -0400 >>> Change: 2019-09-06 19:43:15.955313293 -0400 >>> Birth: - >>> >>> The former is lstat, the latter - stat. >>> >>> 2) device number of the filesystem where the symlink target lives. >>> In this case, it's nsfs and there's only one instance on the entire >>> system. _That_ would be obtained by looking at st_dev in stat(2) on >>> /proc/self/ns/pid (0:3 above). >>> >>> 3) device number *OF* the symlink. That would be st_rdev in lstat(2). >>> There's none - it's a symlink, not a character or block device. It's >>> always zero and always will be zero. >>> >>> 4) the same for the target; st_rdev in stat(2) results and again, >>> there's no such beast - it's neither character nor block device. >>> >>> Your code is looking at (3). Please, reread any textbook on Unix >>> in the section that would cover stat(2) and discussion of the >>> difference between st_dev and st_rdev. >>> >>> I have no idea what Eric had been talking about - it's hard to >>> reconstruct by what you said so far. Making nsfs per-userns, >>> perhaps? But that makes no sense whatsoever, not that userns >>> ever had... Cheap shots aside, I really can't guess what that's >>> about. Sorry. >> >> Thanks for the detailed information. The device number we want >> is nsfs. Indeed, currently, there is only one instance >> on the entire system. But not exactly sure what is the possibility >> to have more than one nsfs device in the future. Maybe per-userns >> or any other criteria? >> >>> >>> In any case, pathname resolution is *NOT* for the situations where >>> you can't block. Even if it's procfs (and from the same pidns as >>> the process) mounted there, there is no promise that the target >>> of /proc/self has already been looked up and not evicted from >>> memory since then. And in case of cache miss pathwalk will >>> have to call ->lookup(), which requires locking the directory >>> (rw_sem, shared). You can't do that in such context. >>> >>> And that doesn't even go into the possibility that process has >>> something very different mounted on /proc. >>> >>> Again, I don't know what it is that you want to get to, but >>> I would strongly recommend finding a way to get to that data >>> that would not involve going anywhere near pathname resolution. >>> >>> How would you expect the userland to work with that value, >>> whatever it might be? If it's just a 32bit field that will >>> never be read, you might as well store there the same value >>> you store now (0, that is) in much cheaper and safer way ;-) >> >> Suppose inside pid namespace, user can pass the device number, >> say n1, (`stat -L /proc/self/ns/pid`) to bpf program (through map >> or JIT). At runtime, bpf program will try to get device number, >> say n2, for the 'current' process. If n1 is not the same as >> n2, that means they are not in the same namespace. 'current' >> is in the same pid namespace as the user iff >> n1 == n2 and also pidns id is the same for 'current' and >> the one with `lsns -t pid`. >> >> Are you aware of any way to get the pidns device number >> for 'current' without going through the pathname >> lookup? >>
On 9/6/19 4:09 PM, Carlos Neira wrote: > This helper(bpf_get_current_pidns_info) obtains the active namespace from > current and returns pid, tgid, device and namespace id as seen from that > namespace, allowing to instrument a process inside a container. > > Signed-off-by: Carlos Neira <cneirabustos@gmail.com> > --- > include/linux/bpf.h | 1 + > include/uapi/linux/bpf.h | 35 +++++++++++++++++++- > kernel/bpf/core.c | 1 + > kernel/bpf/helpers.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++ > kernel/trace/bpf_trace.c | 2 ++ > 5 files changed, 124 insertions(+), 1 deletion(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 5b9d22338606..819cb1c84be0 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -1055,6 +1055,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto; > extern const struct bpf_func_proto bpf_strtol_proto; > extern const struct bpf_func_proto bpf_strtoul_proto; > extern const struct bpf_func_proto bpf_tcp_sock_proto; > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto; > > /* Shared helpers among cBPF and eBPF. */ > void bpf_user_rnd_init_once(void); > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index b5889257cc33..3ec9aa1438b7 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -2747,6 +2747,32 @@ union bpf_attr { > * **-EOPNOTSUPP** kernel configuration does not enable SYN cookies > * > * **-EPROTONOSUPPORT** IP packet version is not 4 or 6 > + * > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns) > + * Description > + * Get tgid, pid and namespace id as seen by the current namespace, > + * and device major/minor numbers from /proc/self/ns/pid. Such > + * information is stored in *pidns* of size *size*. > + * > + * This helper is used when pid filtering is needed inside a > + * container as bpf_get_current_tgid() helper always returns the > + * pid id as seen by the root namespace. > + * Return > + * 0 on success > + * > + * On failure, the returned value is one of the following: > + * > + * **-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid > + * or tgid of the current task. > + * > + * **-ENOENT** if /proc/self/ns/pid does not exists. > + * > + * **-ENOENT** if /proc/self/ns does not exists. > + * > + * **-ENOMEM** if helper internal allocation fails. -ENOMEM can be removed. > + * > + * **-EPERM** if not able to call helper. > + * > */ > #define __BPF_FUNC_MAPPER(FN) \ > FN(unspec), \ > @@ -2859,7 +2885,8 @@ union bpf_attr { > FN(sk_storage_get), \ > FN(sk_storage_delete), \ > FN(send_signal), \ > - FN(tcp_gen_syncookie), > + FN(tcp_gen_syncookie), \ > + FN(get_current_pidns_info), > > /* integer value in 'imm' field of BPF_CALL instruction selects which helper > * function eBPF program intends to call > @@ -3610,4 +3637,10 @@ struct bpf_sockopt { > __s32 retval; > }; > > +struct bpf_pidns_info { > + __u32 dev; /* dev_t from /proc/self/ns/pid inode */ /* dev_t of pid namespace pseudo file (typically /proc/seelf/ns/pid) after following symbolic link */ > + __u32 nsid; > + __u32 tgid; > + __u32 pid; > +}; > #endif /* _UAPI__LINUX_BPF_H__ */ > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > index 8191a7db2777..3159f2a0188c 100644 > --- a/kernel/bpf/core.c > +++ b/kernel/bpf/core.c > @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak; > const struct bpf_func_proto bpf_get_current_comm_proto __weak; > const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak; > const struct bpf_func_proto bpf_get_local_storage_proto __weak; > +const struct bpf_func_proto bpf_get_current_pidns_info __weak; > > const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) > { > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 5e28718928ca..8dbe6347893c 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -11,6 +11,11 @@ > #include <linux/uidgid.h> > #include <linux/filter.h> > #include <linux/ctype.h> > +#include <linux/pid_namespace.h> > +#include <linux/kdev_t.h> > +#include <linux/stat.h> > +#include <linux/namei.h> > +#include <linux/version.h> > > #include "../../lib/kstrtox.h" > > @@ -312,6 +317,87 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, > preempt_enable(); > } > > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32, > + size) > +{ > + const char *pidns_path = "/proc/self/ns/pid"; > + struct pid_namespace *pidns = NULL; > + struct filename *fname = NULL; > + struct inode *inode; > + struct path kp; > + pid_t tgid = 0; > + pid_t pid = 0; > + int ret = -EINVAL; > + int len; > + > + if (unlikely(in_interrupt() || > + current->flags & (PF_KTHREAD | PF_EXITING))) > + return -EPERM; > + > + if (unlikely(size != sizeof(struct bpf_pidns_info))) > + return -EINVAL; > + > + pidns = task_active_pid_ns(current); > + if (unlikely(!pidns)) > + return -ENOENT; > + > + pidns_info->nsid = pidns->ns.inum; > + pid = task_pid_nr_ns(current, pidns); > + if (unlikely(!pid)) > + goto clear; > + > + tgid = task_tgid_nr_ns(current, pidns); > + if (unlikely(!tgid)) > + goto clear; > + > + pidns_info->tgid = (u32) tgid; > + pidns_info->pid = (u32) pid; > + [...] > + fname = kmem_cache_alloc(names_cachep, GFP_ATOMIC); > + if (unlikely(!fname)) { > + ret = -ENOMEM; > + goto clear; > + } > + const size_t fnamesize = offsetof(struct filename, iname[1]); > + struct filename *tmp; > + > + tmp = kmalloc(fnamesize, GFP_ATOMIC); > + if (unlikely(!tmp)) { > + __putname(fname); > + ret = -ENOMEM; > + goto clear; > + } > + > + tmp->name = (char *)fname; > + fname = tmp; > + len = strlen(pidns_path) + 1; > + memcpy((char *)fname->name, pidns_path, len); > + fname->uptr = NULL; > + fname->aname = NULL; > + fname->refcnt = 1; > + > + ret = filename_lookup(AT_FDCWD, fname, 0, &kp, NULL); > + if (ret) > + goto clear; > + > + inode = d_backing_inode(kp.dentry); > + pidns_info->dev = (u32)inode->i_rdev; The above can bee replaced with new nsfs interface function ns_get_inum_dev(). > + > + return 0; > + > +clear: > + memset((void *)pidns_info, 0, (size_t) size); > + return ret; > +} > + > +const struct bpf_func_proto bpf_get_current_pidns_info_proto = { > + .func = bpf_get_current_pidns_info, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_PTR_TO_UNINIT_MEM, > + .arg2_type = ARG_CONST_SIZE, > +}; > + > #ifdef CONFIG_CGROUPS > BPF_CALL_0(bpf_get_current_cgroup_id) > { > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index ca1255d14576..5e1dc22765a5 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > #endif > case BPF_FUNC_send_signal: > return &bpf_send_signal_proto; > + case BPF_FUNC_get_current_pidns_info: > + return &bpf_get_current_pidns_info_proto; > default: > return NULL; > } >
On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote: > > Carlos, > > Discussed with Eric today for what is the best way to get > the device number for a namespace. The following patch seems > a reasonable start although Eric would like to see > how the helper is used in order to decide whether the > interface looks right. > > commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2) > Author: Yonghong Song <yhs@fb.com> > Date: Mon Sep 9 21:50:51 2019 -0700 > > nsfs: add an interface function ns_get_inum_dev() > > This patch added an interface function > ns_get_inum_dev(). Given a ns_common structure, > the function returns the inode and device > numbers. The function will be used later > by a newly added bpf helper. > > Signed-off-by: Yonghong Song <yhs@fb.com> > > diff --git a/fs/nsfs.c b/fs/nsfs.c > index a0431642c6b5..a603c6fc3f54 100644 > --- a/fs/nsfs.c > +++ b/fs/nsfs.c > @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd) > return ERR_PTR(-EINVAL); > } > > +/* Get the device number for the current task pidns. > + */ > +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev) > +{ > + *inum = ns->inum; > + *dev = nsfs_mnt->mnt_sb->s_dev; > +} Umm... Where would it get the device number once we get (hell knows what for) multiple nsfs instances? I still don't understand what would that be about, TBH... Is it really per-userns? Or something else entirely? Eric, could you give some context?
On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote: Thanks a lot Yonghong. I'll include this patch when submitting changes for version 11 of this patch. > > Carlos, > > Discussed with Eric today for what is the best way to get > the device number for a namespace. The following patch seems > a reasonable start although Eric would like to see > how the helper is used in order to decide whether the > interface looks right. > > commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2) > Author: Yonghong Song <yhs@fb.com> > Date: Mon Sep 9 21:50:51 2019 -0700 > > nsfs: add an interface function ns_get_inum_dev() > > This patch added an interface function > ns_get_inum_dev(). Given a ns_common structure, > the function returns the inode and device > numbers. The function will be used later > by a newly added bpf helper. > > Signed-off-by: Yonghong Song <yhs@fb.com> > > diff --git a/fs/nsfs.c b/fs/nsfs.c > index a0431642c6b5..a603c6fc3f54 100644 > --- a/fs/nsfs.c > +++ b/fs/nsfs.c > @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd) > return ERR_PTR(-EINVAL); > } > > +/* Get the device number for the current task pidns. > + */ > +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev) > +{ > + *inum = ns->inum; > + *dev = nsfs_mnt->mnt_sb->s_dev; > +} > + > static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry) > { > struct inode *inode = d_inode(dentry); > diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h > index d31cb6215905..b8fc680cdf1a 100644 > --- a/include/linux/proc_ns.h > +++ b/include/linux/proc_ns.h > @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct > task_struct *task, > typedef struct ns_common *ns_get_path_helper_t(void *); > extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t > ns_get_cb, > void *private_data); > +extern void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev); > > extern int ns_get_name(char *buf, size_t size, struct task_struct *task, > const struct proc_ns_operations *ns_ops); > > Could you put the above change and patch #1 and then have > all your other patches? In your kernel change, please use > interface function ns_get_inum_dev() to get pidns inode number > and dev number. > > On 9/9/19 6:45 PM, Carlos Antonio Neira Bustos wrote: > > Thanks a lot, Al Viro and Yonghong for taking the time to review this patch and > > provide technical insights needed on this one. > > But how do we move this forward? > > Al Viro's review is clear that this will not work and we should strip the name > > resolution code (thanks for your detailed analysis). > > As there is currently only one instance of the nsfs device on the system, > > I think we could leave out the retrieval of the pidns device number and address it > > when the situation changes. > > What do you think? > > > > > > On Sat, Sep 07, 2019 at 06:34:39AM +0000, Yonghong Song wrote: > >> > >> > >> On 9/6/19 5:10 PM, Al Viro wrote: > >>> On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote: > >>> > >>>> -bash-4.4$ readlink /proc/self/ns/pid > >>>> pid:[4026531836] > >>>> -bash-4.4$ stat /proc/self/ns/pid > >>>> File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’ > >>>> Size: 0 Blocks: 0 IO Block: 1024 symbolic link > >>>> Device: 4h/4d Inode: 344795989 Links: 1 > >>>> Access: (0777/lrwxrwxrwx) Uid: (128203/ yhs) Gid: ( 100/ users) > >>>> Context: user_u:base_r:base_t > >>>> Access: 2019-09-06 16:06:09.431616380 -0700 > >>>> Modify: 2019-09-06 16:06:09.431616380 -0700 > >>>> Change: 2019-09-06 16:06:09.431616380 -0700 > >>>> Birth: - > >>>> -bash-4.4$ > >>>> > >>>> Based on a discussion with Eric Biederman back in 2019 Linux > >>>> Plumbers, Eric suggested that to uniquely identify a > >>>> namespace, device id (major/minor) number should also > >>>> be included. Although today's kernel implementation > >>>> has the same device for all namespace pseudo files, > >>>> but from uapi perspective, device id should be included. > >>>> > >>>> That is the reason why we try to get device id which holds > >>>> pid namespace pseudo file. > >>>> > >>>> Do you have a better suggestion on how to get > >>>> the device id for 'current' pid namespace? Or from design, we > >>>> really should not care about device id at all? > >>> > >>> What the hell is "device id for pid namespace"? This is the > >>> first time I've heard about that mystery object, so it's > >>> hard to tell where it could be found. > >>> > >>> I can tell you what device numbers are involved in the areas > >>> you seem to be looking in. > >>> > >>> 1) there's whatever device number that gets assigned to > >>> (this) procfs instance. That, ironically, _is_ per-pidns, but > >>> that of the procfs instance, not that of your process (and > >>> those can be different). That's what you get in ->st_dev > >>> when doing lstat() of anything in /proc (assuming that > >>> procfs is mounted there, in the first place). NOTE: > >>> that's lstat(2), not stat(2). stat(1) uses lstat(2), > >>> unless given -L (in which case it's stat(2) time). The > >>> difference: > >>> > >>> root@kvm1:~# stat /proc/self/ns/pid > >>> File: /proc/self/ns/pid -> pid:[4026531836] > >>> Size: 0 Blocks: 0 IO Block: 1024 symbolic link > >>> Device: 4h/4d Inode: 17396 Links: 1 > >>> Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root) > >>> Access: 2019-09-06 19:43:11.871312319 -0400 > >>> Modify: 2019-09-06 19:43:11.871312319 -0400 > >>> Change: 2019-09-06 19:43:11.871312319 -0400 > >>> Birth: - > >>> root@kvm1:~# stat -L /proc/self/ns/pid > >>> File: /proc/self/ns/pid > >>> Size: 0 Blocks: 0 IO Block: 4096 regular empty file > >>> Device: 3h/3d Inode: 4026531836 Links: 1 > >>> Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > >>> Access: 2019-09-06 19:43:15.955313293 -0400 > >>> Modify: 2019-09-06 19:43:15.955313293 -0400 > >>> Change: 2019-09-06 19:43:15.955313293 -0400 > >>> Birth: - > >>> > >>> The former is lstat, the latter - stat. > >>> > >>> 2) device number of the filesystem where the symlink target lives. > >>> In this case, it's nsfs and there's only one instance on the entire > >>> system. _That_ would be obtained by looking at st_dev in stat(2) on > >>> /proc/self/ns/pid (0:3 above). > >>> > >>> 3) device number *OF* the symlink. That would be st_rdev in lstat(2). > >>> There's none - it's a symlink, not a character or block device. It's > >>> always zero and always will be zero. > >>> > >>> 4) the same for the target; st_rdev in stat(2) results and again, > >>> there's no such beast - it's neither character nor block device. > >>> > >>> Your code is looking at (3). Please, reread any textbook on Unix > >>> in the section that would cover stat(2) and discussion of the > >>> difference between st_dev and st_rdev. > >>> > >>> I have no idea what Eric had been talking about - it's hard to > >>> reconstruct by what you said so far. Making nsfs per-userns, > >>> perhaps? But that makes no sense whatsoever, not that userns > >>> ever had... Cheap shots aside, I really can't guess what that's > >>> about. Sorry. > >> > >> Thanks for the detailed information. The device number we want > >> is nsfs. Indeed, currently, there is only one instance > >> on the entire system. But not exactly sure what is the possibility > >> to have more than one nsfs device in the future. Maybe per-userns > >> or any other criteria? > >> > >>> > >>> In any case, pathname resolution is *NOT* for the situations where > >>> you can't block. Even if it's procfs (and from the same pidns as > >>> the process) mounted there, there is no promise that the target > >>> of /proc/self has already been looked up and not evicted from > >>> memory since then. And in case of cache miss pathwalk will > >>> have to call ->lookup(), which requires locking the directory > >>> (rw_sem, shared). You can't do that in such context. > >>> > >>> And that doesn't even go into the possibility that process has > >>> something very different mounted on /proc. > >>> > >>> Again, I don't know what it is that you want to get to, but > >>> I would strongly recommend finding a way to get to that data > >>> that would not involve going anywhere near pathname resolution. > >>> > >>> How would you expect the userland to work with that value, > >>> whatever it might be? If it's just a 32bit field that will > >>> never be read, you might as well store there the same value > >>> you store now (0, that is) in much cheaper and safer way ;-) > >> > >> Suppose inside pid namespace, user can pass the device number, > >> say n1, (`stat -L /proc/self/ns/pid`) to bpf program (through map > >> or JIT). At runtime, bpf program will try to get device number, > >> say n2, for the 'current' process. If n1 is not the same as > >> n2, that means they are not in the same namespace. 'current' > >> is in the same pid namespace as the user iff > >> n1 == n2 and also pidns id is the same for 'current' and > >> the one with `lsns -t pid`. > >> > >> Are you aware of any way to get the pidns device number > >> for 'current' without going through the pathname > >> lookup? > >>
Thanks for reviewing the rest of the code, I'll include these changes in ver 11. Bests On Tue, Sep 10, 2019 at 10:46:45PM +0000, Yonghong Song wrote: > > > On 9/6/19 4:09 PM, Carlos Neira wrote: > > This helper(bpf_get_current_pidns_info) obtains the active namespace from > > current and returns pid, tgid, device and namespace id as seen from that > > namespace, allowing to instrument a process inside a container. > > > > Signed-off-by: Carlos Neira <cneirabustos@gmail.com> > > --- > > include/linux/bpf.h | 1 + > > include/uapi/linux/bpf.h | 35 +++++++++++++++++++- > > kernel/bpf/core.c | 1 + > > kernel/bpf/helpers.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++ > > kernel/trace/bpf_trace.c | 2 ++ > > 5 files changed, 124 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index 5b9d22338606..819cb1c84be0 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -1055,6 +1055,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto; > > extern const struct bpf_func_proto bpf_strtol_proto; > > extern const struct bpf_func_proto bpf_strtoul_proto; > > extern const struct bpf_func_proto bpf_tcp_sock_proto; > > +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto; > > > > /* Shared helpers among cBPF and eBPF. */ > > void bpf_user_rnd_init_once(void); > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index b5889257cc33..3ec9aa1438b7 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -2747,6 +2747,32 @@ union bpf_attr { > > * **-EOPNOTSUPP** kernel configuration does not enable SYN cookies > > * > > * **-EPROTONOSUPPORT** IP packet version is not 4 or 6 > > + * > > + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns) > > + * Description > > + * Get tgid, pid and namespace id as seen by the current namespace, > > + * and device major/minor numbers from /proc/self/ns/pid. Such > > + * information is stored in *pidns* of size *size*. > > + * > > + * This helper is used when pid filtering is needed inside a > > + * container as bpf_get_current_tgid() helper always returns the > > + * pid id as seen by the root namespace. > > + * Return > > + * 0 on success > > + * > > + * On failure, the returned value is one of the following: > > + * > > + * **-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid > > + * or tgid of the current task. > > + * > > + * **-ENOENT** if /proc/self/ns/pid does not exists. > > + * > > + * **-ENOENT** if /proc/self/ns does not exists. > > + * > > + * **-ENOMEM** if helper internal allocation fails. > > -ENOMEM can be removed. > > > + * > > + * **-EPERM** if not able to call helper. > > + * > > */ > > #define __BPF_FUNC_MAPPER(FN) \ > > FN(unspec), \ > > @@ -2859,7 +2885,8 @@ union bpf_attr { > > FN(sk_storage_get), \ > > FN(sk_storage_delete), \ > > FN(send_signal), \ > > - FN(tcp_gen_syncookie), > > + FN(tcp_gen_syncookie), \ > > + FN(get_current_pidns_info), > > > > /* integer value in 'imm' field of BPF_CALL instruction selects which helper > > * function eBPF program intends to call > > @@ -3610,4 +3637,10 @@ struct bpf_sockopt { > > __s32 retval; > > }; > > > > +struct bpf_pidns_info { > > + __u32 dev; /* dev_t from /proc/self/ns/pid inode */ > > /* dev_t of pid namespace pseudo file (typically /proc/seelf/ns/pid) > after following symbolic link */ > > > + __u32 nsid; > > + __u32 tgid; > > + __u32 pid; > > +}; > > #endif /* _UAPI__LINUX_BPF_H__ */ > > diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c > > index 8191a7db2777..3159f2a0188c 100644 > > --- a/kernel/bpf/core.c > > +++ b/kernel/bpf/core.c > > @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak; > > const struct bpf_func_proto bpf_get_current_comm_proto __weak; > > const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak; > > const struct bpf_func_proto bpf_get_local_storage_proto __weak; > > +const struct bpf_func_proto bpf_get_current_pidns_info __weak; > > > > const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) > > { > > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > > index 5e28718928ca..8dbe6347893c 100644 > > --- a/kernel/bpf/helpers.c > > +++ b/kernel/bpf/helpers.c > > @@ -11,6 +11,11 @@ > > #include <linux/uidgid.h> > > #include <linux/filter.h> > > #include <linux/ctype.h> > > +#include <linux/pid_namespace.h> > > +#include <linux/kdev_t.h> > > +#include <linux/stat.h> > > +#include <linux/namei.h> > > +#include <linux/version.h> > > > > #include "../../lib/kstrtox.h" > > > > @@ -312,6 +317,87 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, > > preempt_enable(); > > } > > > > +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32, > > + size) > > +{ > > + const char *pidns_path = "/proc/self/ns/pid"; > > + struct pid_namespace *pidns = NULL; > > + struct filename *fname = NULL; > > + struct inode *inode; > > + struct path kp; > > + pid_t tgid = 0; > > + pid_t pid = 0; > > + int ret = -EINVAL; > > + int len; > > + > > + if (unlikely(in_interrupt() || > > + current->flags & (PF_KTHREAD | PF_EXITING))) > > + return -EPERM; > > + > > + if (unlikely(size != sizeof(struct bpf_pidns_info))) > > + return -EINVAL; > > + > > + pidns = task_active_pid_ns(current); > > + if (unlikely(!pidns)) > > + return -ENOENT; > > + > > + pidns_info->nsid = pidns->ns.inum; > > + pid = task_pid_nr_ns(current, pidns); > > + if (unlikely(!pid)) > > + goto clear; > > + > > + tgid = task_tgid_nr_ns(current, pidns); > > + if (unlikely(!tgid)) > > + goto clear; > > + > > + pidns_info->tgid = (u32) tgid; > > + pidns_info->pid = (u32) pid; > > + > [...] > > + fname = kmem_cache_alloc(names_cachep, GFP_ATOMIC); > > + if (unlikely(!fname)) { > > + ret = -ENOMEM; > > + goto clear; > > + } > > + const size_t fnamesize = offsetof(struct filename, iname[1]); > > + struct filename *tmp; > > + > > + tmp = kmalloc(fnamesize, GFP_ATOMIC); > > + if (unlikely(!tmp)) { > > + __putname(fname); > > + ret = -ENOMEM; > > + goto clear; > > + } > > + > > + tmp->name = (char *)fname; > > + fname = tmp; > > + len = strlen(pidns_path) + 1; > > + memcpy((char *)fname->name, pidns_path, len); > > + fname->uptr = NULL; > > + fname->aname = NULL; > > + fname->refcnt = 1; > > + > > + ret = filename_lookup(AT_FDCWD, fname, 0, &kp, NULL); > > + if (ret) > > + goto clear; > > + > > + inode = d_backing_inode(kp.dentry); > > + pidns_info->dev = (u32)inode->i_rdev; > The above can bee replaced with new nsfs interface function > ns_get_inum_dev(). > > + > > + return 0; > > + > > +clear: > > + memset((void *)pidns_info, 0, (size_t) size); > > + return ret; > > +} > > + > > +const struct bpf_func_proto bpf_get_current_pidns_info_proto = { > > + .func = bpf_get_current_pidns_info, > > + .gpl_only = false, > > + .ret_type = RET_INTEGER, > > + .arg1_type = ARG_PTR_TO_UNINIT_MEM, > > + .arg2_type = ARG_CONST_SIZE, > > +}; > > + > > #ifdef CONFIG_CGROUPS > > BPF_CALL_0(bpf_get_current_cgroup_id) > > { > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > index ca1255d14576..5e1dc22765a5 100644 > > --- a/kernel/trace/bpf_trace.c > > +++ b/kernel/trace/bpf_trace.c > > @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) > > #endif > > case BPF_FUNC_send_signal: > > return &bpf_send_signal_proto; > > + case BPF_FUNC_get_current_pidns_info: > > + return &bpf_get_current_pidns_info_proto; > > default: > > return NULL; > > } > >
Al Viro <viro@zeniv.linux.org.uk> writes: > On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote: >> >> Carlos, >> >> Discussed with Eric today for what is the best way to get >> the device number for a namespace. The following patch seems >> a reasonable start although Eric would like to see >> how the helper is used in order to decide whether the >> interface looks right. >> >> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2) >> Author: Yonghong Song <yhs@fb.com> >> Date: Mon Sep 9 21:50:51 2019 -0700 >> >> nsfs: add an interface function ns_get_inum_dev() >> >> This patch added an interface function >> ns_get_inum_dev(). Given a ns_common structure, >> the function returns the inode and device >> numbers. The function will be used later >> by a newly added bpf helper. >> >> Signed-off-by: Yonghong Song <yhs@fb.com> >> >> diff --git a/fs/nsfs.c b/fs/nsfs.c >> index a0431642c6b5..a603c6fc3f54 100644 >> --- a/fs/nsfs.c >> +++ b/fs/nsfs.c >> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd) >> return ERR_PTR(-EINVAL); >> } >> >> +/* Get the device number for the current task pidns. >> + */ >> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev) >> +{ >> + *inum = ns->inum; >> + *dev = nsfs_mnt->mnt_sb->s_dev; >> +} > > Umm... Where would it get the device number once we get (hell knows > what for) multiple nsfs instances? I still don't understand what > would that be about, TBH... Is it really per-userns? Or something > else entirely? Eric, could you give some context? My goal is not to paint things into a corner, with future changes. Right now it is possible to stat a namespace file descriptor and get a device and inode number. Then compare that. I don't want people using the inode number in nsfd as some magic namespace id. We have had times in the past where there was more than one superblock and thus more than one device number. Further if userspace ever uses this heavily there may be times in the future where for checkpoint/restart purposes we will want multiple nsfd's so we can preserve the inode number accross a migration. Realistically there will probably just some kind of hotplug notification to userspace to say we have hotplugged your operatining system as a migration notification. Now the halway discussion did not quite capture everything I was trying to say but it at least got to the right ballpark. The helper in fs/nsfs.c should be: bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) { return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev)); } That way if/when there are multiple inodes identifying the same namespace the bpf programs don't need to change. Up farther in the stack it should be something like: > BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino) > { > return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino); > } > > const struct bpf_func_proto bpf_current_pidns_match_proto = { > .func = bpf_current_pins_match, > .gpl_only = true, > .ret_type = RET_INTEGER > .arg1_type = ARG_PTR_TO_DEVICE_NUMBER, > .arg2_type = ARG_PTR_TO_INODE_NUMBER, > }; That allows comparing what the bpf came up with with whatever value userspace generated by stating the file descriptor. That is the least bad suggestion I currently have for that functionality. It really would be better to not have that filter in the bpf program itself but in the infrastructure that binds a program to a set of tasks. The problem with this approach is whatever device/inode you have when the namespace they refer to exits there is the possibility that the inode will be reused. So your filter will eventually start matching on the wrong thing. Eric
Carlos Antonio Neira Bustos <cneirabustos@gmail.com> writes: > On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote: > Thanks a lot Yonghong. > I'll include this patch when submitting changes for version 11 of > this patch. Please see my reply to Al. Eric
On 9/11/19 9:16 AM, Eric W. Biederman wrote: > Al Viro <viro@zeniv.linux.org.uk> writes: > >> On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote: >>> >>> Carlos, >>> >>> Discussed with Eric today for what is the best way to get >>> the device number for a namespace. The following patch seems >>> a reasonable start although Eric would like to see >>> how the helper is used in order to decide whether the >>> interface looks right. >>> >>> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2) >>> Author: Yonghong Song <yhs@fb.com> >>> Date: Mon Sep 9 21:50:51 2019 -0700 >>> >>> nsfs: add an interface function ns_get_inum_dev() >>> >>> This patch added an interface function >>> ns_get_inum_dev(). Given a ns_common structure, >>> the function returns the inode and device >>> numbers. The function will be used later >>> by a newly added bpf helper. >>> >>> Signed-off-by: Yonghong Song <yhs@fb.com> >>> >>> diff --git a/fs/nsfs.c b/fs/nsfs.c >>> index a0431642c6b5..a603c6fc3f54 100644 >>> --- a/fs/nsfs.c >>> +++ b/fs/nsfs.c >>> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd) >>> return ERR_PTR(-EINVAL); >>> } >>> >>> +/* Get the device number for the current task pidns. >>> + */ >>> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev) >>> +{ >>> + *inum = ns->inum; >>> + *dev = nsfs_mnt->mnt_sb->s_dev; >>> +} >> >> Umm... Where would it get the device number once we get (hell knows >> what for) multiple nsfs instances? I still don't understand what >> would that be about, TBH... Is it really per-userns? Or something >> else entirely? Eric, could you give some context? > > My goal is not to paint things into a corner, with future changes. > Right now it is possible to stat a namespace file descriptor and > get a device and inode number. Then compare that. > > I don't want people using the inode number in nsfd as some magic > namespace id. > > We have had times in the past where there was more than one superblock > and thus more than one device number. Further if userspace ever uses > this heavily there may be times in the future where for > checkpoint/restart purposes we will want multiple nsfd's so we can > preserve the inode number accross a migration. > > Realistically there will probably just some kind of hotplug notification > to userspace to say we have hotplugged your operatining system as > a migration notification. > > Now the halway discussion did not quite capture everything I was trying > to say but it at least got to the right ballpark. > > The helper in fs/nsfs.c should be: > > bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) > { > return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev)); > } > > That way if/when there are multiple inodes identifying the same > namespace the bpf programs don't need to change. Thanks, Eric. This is indeed better. The bpf helper should focus on comparing dev/ino, instead of return the dev/ino to bpf program. So overall, nsfs related change will look like: diff --git a/fs/nsfs.c b/fs/nsfs.c index a0431642c6b5..7e78d89c2172 100644 --- a/fs/nsfs.c +++ b/fs/nsfs.c @@ -245,6 +245,11 @@ struct file *proc_ns_fget(int fd) return ERR_PTR(-EINVAL); } +bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) +{ + return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev)); +} + static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry) { struct inode *inode = d_inode(dentry); diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h index d31cb6215905..79639807e960 100644 --- a/include/linux/proc_ns.h +++ b/include/linux/proc_ns.h @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct task_struct *task, typedef struct ns_common *ns_get_path_helper_t(void *); extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t ns_get_cb, void *private_data); +extern bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino); extern int ns_get_name(char *buf, size_t size, struct task_struct *task, const struct proc_ns_operations *ns_ops); > > Up farther in the stack it should be something like: > >> BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino) >> { >> return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino); >> } >> >> const struct bpf_func_proto bpf_current_pidns_match_proto = { >> .func = bpf_current_pins_match, >> .gpl_only = true, >> .ret_type = RET_INTEGER >> .arg1_type = ARG_PTR_TO_DEVICE_NUMBER, >> .arg2_type = ARG_PTR_TO_INODE_NUMBER, >> }; > > That allows comparing what the bpf came up with with whatever value > userspace generated by stating the file descriptor. > > > That is the least bad suggestion I currently have for that > functionality. It really would be better to not have that filter in the > bpf program itself but in the infrastructure that binds a program to a > set of tasks. > > The problem with this approach is whatever device/inode you have when > the namespace they refer to exits there is the possibility that the > inode will be reused. So your filter will eventually start matching on > the wrong thing. I come up with a differeent helper definition, which is much more similar to existing bpf_get_current_pid_tgid() and helper definition much more conforms to bpf convention. diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 5e28718928ca..bc26903c80c7 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -11,6 +11,8 @@ #include <linux/uidgid.h> #include <linux/filter.h> #include <linux/ctype.h> +#include <linux/pid_namespace.h> +#include <linux/proc_ns.h> #include "../../lib/kstrtox.h" @@ -487,3 +489,33 @@ const struct bpf_func_proto bpf_strtoul_proto = { .arg4_type = ARG_PTR_TO_LONG, }; #endif + +BPF_CALL_2(bpf_get_ns_current_pid_tgid, u32, dev, u32, inum) +{ + struct task_struct *task = current; + struct pid_namespace *pidns; + pid_t pid, tgid; + + if (unlikely(!task)) + return -EINVAL; + + + pidns = task_active_pid_ns(task); + if (unlikely(!pidns)) + return -ENOENT; + + if (!ns_match(&pidns->ns, (dev_t)dev, inum)) + return -EINVAL; + + pid = task_pid_nr_ns(task, pidns); + tgid = task_tgid_nr_ns(task, pidns); + + return (u64) tgid << 32 | pid; +} + +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = { + .func = bpf_get_ns_current_pid_tgid, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, + .arg2_type = ARG_ANYTHING, +}; Existing usage of bpf_get_current_pid_tgid() can be converted to bpf_get_ns_current_pid_tgid() if ns dev/inode number is supplied. For bpf_get_ns_current_pid_tgid(), checking return value ( < 0 or not) is needed. > > Eric >
On 9/12/19 3:03 PM, carlos antonio neira bustos wrote: > Yonghong, > > I think bpf_get_ns_current_pid_tgid interface is a lot better than the > one proposed in my patch, how are we going to move forward? Should I > take these changes and refactor the selftests to use this new interface > and submit version 12 or as the interface changed completely is a new > set of patches?. This is to solve the same problem. You can just submit as version 12. This way, we preserves discussion history clearly. > > Eric, > Thank you very much for explaining the problem and your help to move > forward with this new helper. > > > Bests
On Fri, Sep 13, 2019 at 02:56:43AM +0000, Yonghong Song wrote: Yonghong, Great, I'll submit this new interface along self tests as version 12. Thanks for your help. Bests > > > On 9/12/19 3:03 PM, carlos antonio neira bustos wrote: > > Yonghong, > > > > I think bpf_get_ns_current_pid_tgid interface is a lot better than the > > one proposed in my patch, how are we going to move forward? Should I > > take these changes and refactor the selftests to use this new interface > > and submit version 12 or as the interface changed completely is a new > > set of patches?. > > This is to solve the same problem. You can just submit as version 12. > This way, we preserves discussion history clearly. > > > > > Eric, > > Thank you very much for explaining the problem and your help to move > > forward with this new helper. > > > > > > Bests
Yonghong Song <yhs@fb.com> writes: > On 9/11/19 9:16 AM, Eric W. Biederman wrote: >> Al Viro <viro@zeniv.linux.org.uk> writes: >> >>> On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote: >>>> >>>> Carlos, >>>> >>>> Discussed with Eric today for what is the best way to get >>>> the device number for a namespace. The following patch seems >>>> a reasonable start although Eric would like to see >>>> how the helper is used in order to decide whether the >>>> interface looks right. >>>> >>>> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2) >>>> Author: Yonghong Song <yhs@fb.com> >>>> Date: Mon Sep 9 21:50:51 2019 -0700 >>>> >>>> nsfs: add an interface function ns_get_inum_dev() >>>> >>>> This patch added an interface function >>>> ns_get_inum_dev(). Given a ns_common structure, >>>> the function returns the inode and device >>>> numbers. The function will be used later >>>> by a newly added bpf helper. >>>> >>>> Signed-off-by: Yonghong Song <yhs@fb.com> >>>> >>>> diff --git a/fs/nsfs.c b/fs/nsfs.c >>>> index a0431642c6b5..a603c6fc3f54 100644 >>>> --- a/fs/nsfs.c >>>> +++ b/fs/nsfs.c >>>> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd) >>>> return ERR_PTR(-EINVAL); >>>> } >>>> >>>> +/* Get the device number for the current task pidns. >>>> + */ >>>> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev) >>>> +{ >>>> + *inum = ns->inum; >>>> + *dev = nsfs_mnt->mnt_sb->s_dev; >>>> +} >>> >>> Umm... Where would it get the device number once we get (hell knows >>> what for) multiple nsfs instances? I still don't understand what >>> would that be about, TBH... Is it really per-userns? Or something >>> else entirely? Eric, could you give some context? >> >> My goal is not to paint things into a corner, with future changes. >> Right now it is possible to stat a namespace file descriptor and >> get a device and inode number. Then compare that. >> >> I don't want people using the inode number in nsfd as some magic >> namespace id. >> >> We have had times in the past where there was more than one superblock >> and thus more than one device number. Further if userspace ever uses >> this heavily there may be times in the future where for >> checkpoint/restart purposes we will want multiple nsfd's so we can >> preserve the inode number accross a migration. >> >> Realistically there will probably just some kind of hotplug notification >> to userspace to say we have hotplugged your operatining system as >> a migration notification. >> >> Now the halway discussion did not quite capture everything I was trying >> to say but it at least got to the right ballpark. >> >> The helper in fs/nsfs.c should be: >> >> bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) >> { >> return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev)); >> } >> >> That way if/when there are multiple inodes identifying the same >> namespace the bpf programs don't need to change. > > Thanks, Eric. This is indeed better. The bpf helper should focus > on comparing dev/ino, instead of return the dev/ino to bpf program. > > So overall, nsfs related change will look like: > > diff --git a/fs/nsfs.c b/fs/nsfs.c > index a0431642c6b5..7e78d89c2172 100644 > --- a/fs/nsfs.c > +++ b/fs/nsfs.c > @@ -245,6 +245,11 @@ struct file *proc_ns_fget(int fd) > return ERR_PTR(-EINVAL); > } > > +bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) > +{ > + return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev)); > +} > + > static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry) > { > struct inode *inode = d_inode(dentry); > diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h > index d31cb6215905..79639807e960 100644 > --- a/include/linux/proc_ns.h > +++ b/include/linux/proc_ns.h > @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct > task_struct *task, > typedef struct ns_common *ns_get_path_helper_t(void *); > extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t > ns_get_cb, > void *private_data); > +extern bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino); > > extern int ns_get_name(char *buf, size_t size, struct task_struct *task, > const struct proc_ns_operations *ns_ops); > >> >> Up farther in the stack it should be something like: >> >>> BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino) >>> { >>> return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino); >>> } >>> >>> const struct bpf_func_proto bpf_current_pidns_match_proto = { >>> .func = bpf_current_pins_match, >>> .gpl_only = true, >>> .ret_type = RET_INTEGER >>> .arg1_type = ARG_PTR_TO_DEVICE_NUMBER, >>> .arg2_type = ARG_PTR_TO_INODE_NUMBER, >>> }; >> >> That allows comparing what the bpf came up with with whatever value >> userspace generated by stating the file descriptor. >> >> >> That is the least bad suggestion I currently have for that >> functionality. It really would be better to not have that filter in the >> bpf program itself but in the infrastructure that binds a program to a >> set of tasks. >> >> The problem with this approach is whatever device/inode you have when >> the namespace they refer to exits there is the possibility that the >> inode will be reused. So your filter will eventually start matching on >> the wrong thing. > > I come up with a differeent helper definition, which is much more > similar to existing bpf_get_current_pid_tgid() and helper definition > much more conforms to bpf convention. There is a problem with your bpf_get_ns_current_pid_tgid below. The inode number is a 64bit number. To be nice to old userspace we try and not use 64bit inode numbers where they are not required but in this case we should not use an interface that assumes inode numbers are 32bit. They just aren't. I didn't know how to express that in the bpf proto so I did what I could. The alternative to this would be to simply restrict this helper to bpf programs registered in the initial pid namespace. At which point you could just ensure all the numbers are in the global pid namespace. Hmm. Looing at the comment below I am confused. > diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c > index 5e28718928ca..bc26903c80c7 100644 > --- a/kernel/bpf/helpers.c > +++ b/kernel/bpf/helpers.c > @@ -11,6 +11,8 @@ > #include <linux/uidgid.h> > #include <linux/filter.h> > #include <linux/ctype.h> > +#include <linux/pid_namespace.h> > +#include <linux/proc_ns.h> > > #include "../../lib/kstrtox.h" > > @@ -487,3 +489,33 @@ const struct bpf_func_proto bpf_strtoul_proto = { > .arg4_type = ARG_PTR_TO_LONG, > }; > #endif > + > +BPF_CALL_2(bpf_get_ns_current_pid_tgid, u32, dev, u32, inum) > +{ > + struct task_struct *task = current; > + struct pid_namespace *pidns; > + pid_t pid, tgid; > + > + if (unlikely(!task)) > + return -EINVAL; > + > + > + pidns = task_active_pid_ns(task); > + if (unlikely(!pidns)) > + return -ENOENT; > + > + if (!ns_match(&pidns->ns, (dev_t)dev, inum)) > + return -EINVAL; > + > + pid = task_pid_nr_ns(task, pidns); > + tgid = task_tgid_nr_ns(task, pidns); > + > + return (u64) tgid << 32 | pid; > +} > + > +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = { > + .func = bpf_get_ns_current_pid_tgid, > + .gpl_only = false, > + .ret_type = RET_INTEGER, > + .arg1_type = ARG_ANYTHING, > + .arg2_type = ARG_ANYTHING, > +}; > > Existing usage of bpf_get_current_pid_tgid() can be converted > to bpf_get_ns_current_pid_tgid() if ns dev/inode number > is supplied. For bpf_get_ns_current_pid_tgid(), checking > return value ( < 0 or not) is needed. Ok. I missed something. What is the problem bpf_get_ns_current_pid_tgid trying to solve that bpf_get_current_pid_tgid does not solve. I would think since much of tracing ebpf is fundamentally restricted to the global root user. Limiting the ebpf programs to the initial pid namespace should not be a problem. So I don't understand why you need to specify the namespace in the ebpf call. Can someone give me a clue what problem is being sovled by this new call? Eric
On 9/13/19 5:59 PM, Eric W. Biederman wrote: > Yonghong Song <yhs@fb.com> writes: > >> On 9/11/19 9:16 AM, Eric W. Biederman wrote: >>> Al Viro <viro@zeniv.linux.org.uk> writes: >>> >>>> On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote: >>>>> >>>>> Carlos, >>>>> >>>>> Discussed with Eric today for what is the best way to get >>>>> the device number for a namespace. The following patch seems >>>>> a reasonable start although Eric would like to see >>>>> how the helper is used in order to decide whether the >>>>> interface looks right. >>>>> >>>>> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2) >>>>> Author: Yonghong Song <yhs@fb.com> >>>>> Date: Mon Sep 9 21:50:51 2019 -0700 >>>>> >>>>> nsfs: add an interface function ns_get_inum_dev() >>>>> >>>>> This patch added an interface function >>>>> ns_get_inum_dev(). Given a ns_common structure, >>>>> the function returns the inode and device >>>>> numbers. The function will be used later >>>>> by a newly added bpf helper. >>>>> >>>>> Signed-off-by: Yonghong Song <yhs@fb.com> >>>>> >>>>> diff --git a/fs/nsfs.c b/fs/nsfs.c >>>>> index a0431642c6b5..a603c6fc3f54 100644 >>>>> --- a/fs/nsfs.c >>>>> +++ b/fs/nsfs.c >>>>> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd) >>>>> return ERR_PTR(-EINVAL); >>>>> } >>>>> >>>>> +/* Get the device number for the current task pidns. >>>>> + */ >>>>> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev) >>>>> +{ >>>>> + *inum = ns->inum; >>>>> + *dev = nsfs_mnt->mnt_sb->s_dev; >>>>> +} >>>> >>>> Umm... Where would it get the device number once we get (hell knows >>>> what for) multiple nsfs instances? I still don't understand what >>>> would that be about, TBH... Is it really per-userns? Or something >>>> else entirely? Eric, could you give some context? >>> >>> My goal is not to paint things into a corner, with future changes. >>> Right now it is possible to stat a namespace file descriptor and >>> get a device and inode number. Then compare that. >>> >>> I don't want people using the inode number in nsfd as some magic >>> namespace id. >>> >>> We have had times in the past where there was more than one superblock >>> and thus more than one device number. Further if userspace ever uses >>> this heavily there may be times in the future where for >>> checkpoint/restart purposes we will want multiple nsfd's so we can >>> preserve the inode number accross a migration. >>> >>> Realistically there will probably just some kind of hotplug notification >>> to userspace to say we have hotplugged your operatining system as >>> a migration notification. >>> >>> Now the halway discussion did not quite capture everything I was trying >>> to say but it at least got to the right ballpark. >>> >>> The helper in fs/nsfs.c should be: >>> >>> bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) >>> { >>> return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev)); >>> } >>> >>> That way if/when there are multiple inodes identifying the same >>> namespace the bpf programs don't need to change. >> >> Thanks, Eric. This is indeed better. The bpf helper should focus >> on comparing dev/ino, instead of return the dev/ino to bpf program. >> >> So overall, nsfs related change will look like: >> >> diff --git a/fs/nsfs.c b/fs/nsfs.c >> index a0431642c6b5..7e78d89c2172 100644 >> --- a/fs/nsfs.c >> +++ b/fs/nsfs.c >> @@ -245,6 +245,11 @@ struct file *proc_ns_fget(int fd) >> return ERR_PTR(-EINVAL); >> } >> >> +bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) >> +{ >> + return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev)); >> +} >> + >> static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry) >> { >> struct inode *inode = d_inode(dentry); >> diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h >> index d31cb6215905..79639807e960 100644 >> --- a/include/linux/proc_ns.h >> +++ b/include/linux/proc_ns.h >> @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct >> task_struct *task, >> typedef struct ns_common *ns_get_path_helper_t(void *); >> extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t >> ns_get_cb, >> void *private_data); >> +extern bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino); >> >> extern int ns_get_name(char *buf, size_t size, struct task_struct *task, >> const struct proc_ns_operations *ns_ops); >> >>> >>> Up farther in the stack it should be something like: >>> >>>> BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino) >>>> { >>>> return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino); >>>> } >>>> >>>> const struct bpf_func_proto bpf_current_pidns_match_proto = { >>>> .func = bpf_current_pins_match, >>>> .gpl_only = true, >>>> .ret_type = RET_INTEGER >>>> .arg1_type = ARG_PTR_TO_DEVICE_NUMBER, >>>> .arg2_type = ARG_PTR_TO_INODE_NUMBER, >>>> }; >>> >>> That allows comparing what the bpf came up with with whatever value >>> userspace generated by stating the file descriptor. >>> >>> >>> That is the least bad suggestion I currently have for that >>> functionality. It really would be better to not have that filter in the >>> bpf program itself but in the infrastructure that binds a program to a >>> set of tasks. >>> >>> The problem with this approach is whatever device/inode you have when >>> the namespace they refer to exits there is the possibility that the >>> inode will be reused. So your filter will eventually start matching on >>> the wrong thing. >> >> I come up with a differeent helper definition, which is much more >> similar to existing bpf_get_current_pid_tgid() and helper definition >> much more conforms to bpf convention. > > There is a problem with your bpf_get_ns_current_pid_tgid below. > The inode number is a 64bit number. To be nice to old userspace > we try and not use 64bit inode numbers where they are not required > but in this case we should not use an interface that assumes inode > numbers are 32bit. They just aren't. > > I didn't know how to express that in the bpf proto so I did what > I could. We can change inum to u64. Just change the prototype like `u64, inum` should be good. > > The alternative to this would be to simply restrict this > helper to bpf programs registered in the initial pid namespace. > At which point you could just ensure all the numbers are in > the global pid namespace. > > Hmm. Looing at the comment below I am confused. > >> diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c >> index 5e28718928ca..bc26903c80c7 100644 >> --- a/kernel/bpf/helpers.c >> +++ b/kernel/bpf/helpers.c >> @@ -11,6 +11,8 @@ >> #include <linux/uidgid.h> >> #include <linux/filter.h> >> #include <linux/ctype.h> >> +#include <linux/pid_namespace.h> >> +#include <linux/proc_ns.h> >> >> #include "../../lib/kstrtox.h" >> >> @@ -487,3 +489,33 @@ const struct bpf_func_proto bpf_strtoul_proto = { >> .arg4_type = ARG_PTR_TO_LONG, >> }; >> #endif >> + >> +BPF_CALL_2(bpf_get_ns_current_pid_tgid, u32, dev, u32, inum) >> +{ >> + struct task_struct *task = current; >> + struct pid_namespace *pidns; >> + pid_t pid, tgid; >> + >> + if (unlikely(!task)) >> + return -EINVAL; >> + >> + >> + pidns = task_active_pid_ns(task); >> + if (unlikely(!pidns)) >> + return -ENOENT; >> + >> + if (!ns_match(&pidns->ns, (dev_t)dev, inum)) >> + return -EINVAL; >> + >> + pid = task_pid_nr_ns(task, pidns); >> + tgid = task_tgid_nr_ns(task, pidns); >> + >> + return (u64) tgid << 32 | pid; >> +} >> + >> +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = { >> + .func = bpf_get_ns_current_pid_tgid, >> + .gpl_only = false, >> + .ret_type = RET_INTEGER, >> + .arg1_type = ARG_ANYTHING, >> + .arg2_type = ARG_ANYTHING, >> +}; >> >> Existing usage of bpf_get_current_pid_tgid() can be converted >> to bpf_get_ns_current_pid_tgid() if ns dev/inode number >> is supplied. For bpf_get_ns_current_pid_tgid(), checking >> return value ( < 0 or not) is needed. > > Ok. I missed something. > > What is the problem bpf_get_ns_current_pid_tgid trying to solve > that bpf_get_current_pid_tgid does not solve. > > I would think since much of tracing ebpf is fundamentally restricted > to the global root user. Limiting the ebpf programs to the initial > pid namespace should not be a problem. > > So I don't understand why you need to specify the namespace in > the ebpf call. > > Can someone give me a clue what problem is being sovled by this > new call? We want to run the bpf program inside the namespace. There are parallel work to introduce CAP_BPF and CAP_TRACING (https://lore.kernel.org/bpf/20190906231053.1276792-1-ast@kernel.org/T/#t) to facilitate this. We have users requesting to use bcc tools inside the containers. https://github.com/iovisor/bcc/issues/1875 https://github.com/iovisor/bcc/issues/1366 https://github.com/iovisor/bcc/issues/1329 https://github.com/iovisor/bcc/issues/1532 ... Yes, this may require granting `root` privilege to containers. This can be done outside container as well. But it is just a big usability improvement if people can do inside the containers. In addition, we have requests below and internal requests as well to filter based on containers. https://github.com/iovisor/bcc/issues/1119 The new helper permits at root that you can filter based on a particular container (not container id, but dev/inode should identifying one). Hope this clarify why this helper is useful for tracing community. > > Eric >
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5b9d22338606..819cb1c84be0 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1055,6 +1055,7 @@ extern const struct bpf_func_proto bpf_get_local_storage_proto; extern const struct bpf_func_proto bpf_strtol_proto; extern const struct bpf_func_proto bpf_strtoul_proto; extern const struct bpf_func_proto bpf_tcp_sock_proto; +extern const struct bpf_func_proto bpf_get_current_pidns_info_proto; /* Shared helpers among cBPF and eBPF. */ void bpf_user_rnd_init_once(void); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index b5889257cc33..3ec9aa1438b7 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2747,6 +2747,32 @@ union bpf_attr { * **-EOPNOTSUPP** kernel configuration does not enable SYN cookies * * **-EPROTONOSUPPORT** IP packet version is not 4 or 6 + * + * int bpf_get_current_pidns_info(struct bpf_pidns_info *pidns, u32 size_of_pidns) + * Description + * Get tgid, pid and namespace id as seen by the current namespace, + * and device major/minor numbers from /proc/self/ns/pid. Such + * information is stored in *pidns* of size *size*. + * + * This helper is used when pid filtering is needed inside a + * container as bpf_get_current_tgid() helper always returns the + * pid id as seen by the root namespace. + * Return + * 0 on success + * + * On failure, the returned value is one of the following: + * + * **-EINVAL** if *size_of_pidns* is not valid or unable to get ns, pid + * or tgid of the current task. + * + * **-ENOENT** if /proc/self/ns/pid does not exists. + * + * **-ENOENT** if /proc/self/ns does not exists. + * + * **-ENOMEM** if helper internal allocation fails. + * + * **-EPERM** if not able to call helper. + * */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -2859,7 +2885,8 @@ union bpf_attr { FN(sk_storage_get), \ FN(sk_storage_delete), \ FN(send_signal), \ - FN(tcp_gen_syncookie), + FN(tcp_gen_syncookie), \ + FN(get_current_pidns_info), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call @@ -3610,4 +3637,10 @@ struct bpf_sockopt { __s32 retval; }; +struct bpf_pidns_info { + __u32 dev; /* dev_t from /proc/self/ns/pid inode */ + __u32 nsid; + __u32 tgid; + __u32 pid; +}; #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 8191a7db2777..3159f2a0188c 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -2038,6 +2038,7 @@ const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak; const struct bpf_func_proto bpf_get_current_comm_proto __weak; const struct bpf_func_proto bpf_get_current_cgroup_id_proto __weak; const struct bpf_func_proto bpf_get_local_storage_proto __weak; +const struct bpf_func_proto bpf_get_current_pidns_info __weak; const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void) { diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 5e28718928ca..8dbe6347893c 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -11,6 +11,11 @@ #include <linux/uidgid.h> #include <linux/filter.h> #include <linux/ctype.h> +#include <linux/pid_namespace.h> +#include <linux/kdev_t.h> +#include <linux/stat.h> +#include <linux/namei.h> +#include <linux/version.h> #include "../../lib/kstrtox.h" @@ -312,6 +317,87 @@ void copy_map_value_locked(struct bpf_map *map, void *dst, void *src, preempt_enable(); } +BPF_CALL_2(bpf_get_current_pidns_info, struct bpf_pidns_info *, pidns_info, u32, + size) +{ + const char *pidns_path = "/proc/self/ns/pid"; + struct pid_namespace *pidns = NULL; + struct filename *fname = NULL; + struct inode *inode; + struct path kp; + pid_t tgid = 0; + pid_t pid = 0; + int ret = -EINVAL; + int len; + + if (unlikely(in_interrupt() || + current->flags & (PF_KTHREAD | PF_EXITING))) + return -EPERM; + + if (unlikely(size != sizeof(struct bpf_pidns_info))) + return -EINVAL; + + pidns = task_active_pid_ns(current); + if (unlikely(!pidns)) + return -ENOENT; + + pidns_info->nsid = pidns->ns.inum; + pid = task_pid_nr_ns(current, pidns); + if (unlikely(!pid)) + goto clear; + + tgid = task_tgid_nr_ns(current, pidns); + if (unlikely(!tgid)) + goto clear; + + pidns_info->tgid = (u32) tgid; + pidns_info->pid = (u32) pid; + + fname = kmem_cache_alloc(names_cachep, GFP_ATOMIC); + if (unlikely(!fname)) { + ret = -ENOMEM; + goto clear; + } + const size_t fnamesize = offsetof(struct filename, iname[1]); + struct filename *tmp; + + tmp = kmalloc(fnamesize, GFP_ATOMIC); + if (unlikely(!tmp)) { + __putname(fname); + ret = -ENOMEM; + goto clear; + } + + tmp->name = (char *)fname; + fname = tmp; + len = strlen(pidns_path) + 1; + memcpy((char *)fname->name, pidns_path, len); + fname->uptr = NULL; + fname->aname = NULL; + fname->refcnt = 1; + + ret = filename_lookup(AT_FDCWD, fname, 0, &kp, NULL); + if (ret) + goto clear; + + inode = d_backing_inode(kp.dentry); + pidns_info->dev = (u32)inode->i_rdev; + + return 0; + +clear: + memset((void *)pidns_info, 0, (size_t) size); + return ret; +} + +const struct bpf_func_proto bpf_get_current_pidns_info_proto = { + .func = bpf_get_current_pidns_info, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_UNINIT_MEM, + .arg2_type = ARG_CONST_SIZE, +}; + #ifdef CONFIG_CGROUPS BPF_CALL_0(bpf_get_current_cgroup_id) { diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ca1255d14576..5e1dc22765a5 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -709,6 +709,8 @@ tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) #endif case BPF_FUNC_send_signal: return &bpf_send_signal_proto; + case BPF_FUNC_get_current_pidns_info: + return &bpf_get_current_pidns_info_proto; default: return NULL; }
This helper(bpf_get_current_pidns_info) obtains the active namespace from current and returns pid, tgid, device and namespace id as seen from that namespace, allowing to instrument a process inside a container. Signed-off-by: Carlos Neira <cneirabustos@gmail.com> --- include/linux/bpf.h | 1 + include/uapi/linux/bpf.h | 35 +++++++++++++++++++- kernel/bpf/core.c | 1 + kernel/bpf/helpers.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++ kernel/trace/bpf_trace.c | 2 ++ 5 files changed, 124 insertions(+), 1 deletion(-)