Message ID | 20190710141548.132193-1-joel@joelfernandes.org |
---|---|
Headers | show |
Series | Add support to directly attach BPF program to ftrace | expand |
On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote: > Hi, why are you cc-ing the whole world for this patch set? I'll reply to all as well, but I suspect a bunch of folks consider it spam. Please read Documentation/bpf/bpf_devel_QA.rst Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam, so I'm not sure this set reached public mailing lists. > These patches make it possible to attach BPF programs directly to tracepoints > using ftrace (/sys/kernel/debug/tracing) without needing the process doing the > attach to be alive. This has the following benefits: > > 1. Simplified Security: In Android, we have finer-grained security controls to > specific ftrace trace events using SELinux labels. We control precisely who is > allowed to enable an ftrace event already. By adding a node to ftrace for > attaching BPF programs, we can use the same mechanism to further control who is > allowed to attach to a trace event. > > 2. Process lifetime: In Android we are adding usecases where a tracing program > needs to be attached all the time to a tracepoint, for the full life time of > the system. Such as to gather statistics where there no need for a detach for > the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this > means keeping a process alive all the time. However, in Android our BPF loader > currently (for hardeneded security) involves just starting a process at boot > time, doing the BPF program loading, and then pinning them to /sys/fs/bpf. We > don't keep this process alive all the time. It is more suitable to do a > one-shot attach of the program using ftrace and not need to have a process > alive all the time anymore for this. Such process also needs elevated > privileges since tracepoint program loading currently requires CAP_SYS_ADMIN > anyway so by design Android's bpfloader runs once at init and exits. > > This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf > The following commands can be written into it: > attach:<fd> Attaches BPF prog fd to tracepoint > detach:<fd> Detaches BPF prog fd to tracepoint Looks like, to detach a program the user needs to read a text file, parse bpf prog id from text into binary. Then call fd_from_id bpf syscall, get a binary FD, convert it back to text and write as a text back into this file. I think this is just a single example why text based apis are not accepted in bpf anymore. Through the patch set you call it ftrace. As far as I can see, this set has zero overlap with ftrace. There is no ftrace-bpf connection here at all that we discussed in the past Steven. It's all quite confusing. I suggest android to solve sticky raw_tracepoint problem with user space deamon. The reasons, you point out why user daemon cannot be used, sound weak to me. Another acceptable solution would be to introduce pinning of raw_tp objects. bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would be natural extension.
On Tue, Jul 16, 2019 at 01:54:57PM -0700, Alexei Starovoitov wrote: > On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote: > > Hi, > > why are you cc-ing the whole world for this patch set? Well, the whole world happens to be interested in BPF on Android. > I'll reply to all as well, but I suspect a bunch of folks consider it spam. > Please read Documentation/bpf/bpf_devel_QA.rst Ok, I'll read it. > Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam, > so I'm not sure this set reached public mailing lists. Certainly the CC list here is not added to folks who consider it spam. All the folks added have been interested in BPF on Android at various points of time. Is this CC list really that large? It has around 24 email addresses or so. I can trim it a bit if needed. Also, you sound like as if people are screaming at me to stop emailing them, certainly that's not the case and no one has told me it is spam. And, it did reach the public archive btw: https://lore.kernel.org/netdev/20190716205455.iimn3pqpvsc3k4ry@ast-mbp.dhcp.thefacebook.com/T/#m1460ba463b78312e38b68b8c118f673d2ead9446 > > These patches make it possible to attach BPF programs directly to tracepoints > > using ftrace (/sys/kernel/debug/tracing) without needing the process doing the > > attach to be alive. This has the following benefits: > > > > 1. Simplified Security: In Android, we have finer-grained security controls to > > specific ftrace trace events using SELinux labels. We control precisely who is > > allowed to enable an ftrace event already. By adding a node to ftrace for > > attaching BPF programs, we can use the same mechanism to further control who is > > allowed to attach to a trace event. > > > > 2. Process lifetime: In Android we are adding usecases where a tracing program > > needs to be attached all the time to a tracepoint, for the full life time of > > the system. Such as to gather statistics where there no need for a detach for > > the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this > > means keeping a process alive all the time. However, in Android our BPF loader > > currently (for hardeneded security) involves just starting a process at boot > > time, doing the BPF program loading, and then pinning them to /sys/fs/bpf. We > > don't keep this process alive all the time. It is more suitable to do a > > one-shot attach of the program using ftrace and not need to have a process > > alive all the time anymore for this. Such process also needs elevated > > privileges since tracepoint program loading currently requires CAP_SYS_ADMIN > > anyway so by design Android's bpfloader runs once at init and exits. > > > > This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf > > The following commands can be written into it: > > attach:<fd> Attaches BPF prog fd to tracepoint > > detach:<fd> Detaches BPF prog fd to tracepoint > > Looks like, to detach a program the user needs to read a text file, > parse bpf prog id from text into binary. Then call fd_from_id bpf syscall, > get a binary FD, convert it back to text and write as a text back into this file. > I think this is just a single example why text based apis are not accepted > in bpf anymore. This can also be considered a tracefs API. And we can certainly change the detach to accept program ids as well if that's easier. 'detach:prog:<prog_id>' and 'detach:fd:<fd>'. By the way, I can also list the set of cumbersome steps needed to attach a BPF program using perf and I bet it will be longer ;-) > Through the patch set you call it ftrace. As far as I can see, this set > has zero overlap with ftrace. There is no ftrace-bpf connection here at all > that we discussed in the past Steven. It's all quite confusing. It depends on what you mean by ftrace, may be I can call it 'trace events' or something if it is less ambiguious. All of this has been collectively called ftrace before. I am not sure if you you are making sense actually, trace_events mechanism is a part of ftrace. See the documentation: Documentation/trace/ftrace.rst. Even the documentation file name has the word ftrace in it. I have also spoken to Steven before about this, I don't think he ever told me there is no connection so again I am a bit lost at your comments. > I suggest android to solve sticky raw_tracepoint problem with user space deamon. > The reasons, you point out why user daemon cannot be used, sound weak to me. I don't think it is weak. It seems overkill to have a daemon for a trace event that is say supposed to be attached to all the time for the lifetime of the system. Why should there be a daemon consuming resources if it is active all the time? In Android, we are very careful about spawning useless processes and leaving them alive for the lifetime of the system - for no good reason. Our security teams also don't like this, and they can comment more. > Another acceptable solution would be to introduce pinning of raw_tp objects. > bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would > be natural extension. I don't think the pinning solves the security problem, it just solves the process lifetime problem. Currently, attaching trace events through perf requires CAP_SYS_ADMIN. However, with ftrace events, we already control security of events by labeling the nodes in tracefs and granting access to the labeled context through the selinux policies. Having a 'bpf' node in tracefs for events, and granting access to the labels is a natural extension. I also thought about the pinning idea before, but we also want to add support for not just raw tracepoints, but also regular tracepoints (events if you will). I am hesitant to add a new BPF API just for creating regular tracepoints and then pinning those as well. I don't see why a new bpf node for a trace event is a bad idea, really. tracefs is how we deal with trace events on Android. We do it in production systems. This is a natural extension to that and fits with the security model well. thanks, - Joel
On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote: > > I also thought about the pinning idea before, but we also want to add support > for not just raw tracepoints, but also regular tracepoints (events if you > will). I am hesitant to add a new BPF API just for creating regular > tracepoints and then pinning those as well. and they should be done through the pinning as well. > I don't see why a new bpf node for a trace event is a bad idea, really. See the patches for kprobe/uprobe FD-based api and the reasons behind it. tldr: text is racy, doesn't scale, poor security, etc. > tracefs is how we deal with trace events on Android. android has made mistakes in the past as well. > This is a natural extension to that and fits with the security model > well. I think it's the opposite. I'm absolutely against text based apis.
On Tue, 16 Jul 2019 17:30:50 -0400 Joel Fernandes <joel@joelfernandes.org> wrote: > I don't see why a new bpf node for a trace event is a bad idea, really. > tracefs is how we deal with trace events on Android. We do it in production > systems. This is a natural extension to that and fits with the security model > well. What I would like to see is a way to have BPF inject data into the ftrace ring buffer directly. There's a bpf_trace_printk() that I find a bit of a hack (especially since it hooks into trace_printk() which is only for debugging purposes). Have a dedicated bpf ftrace ring buffer event that can be triggered is what I am looking for. Then comes the issue of what ring buffer to place it in, as ftrace can have multiple ring buffer instances. But these instances are defined by the tracefs instances directory. Having a way to associate a bpf program to a specific event in a specific tracefs directory could allow for ways to trigger writing into the correct ftrace buffer. But looking over the patches, I see what Alexei means that there's no overlap with ftrace and these patches except for the tracefs directory itself (which is part of the ftrace infrastructure). And the trace events are technically part of the ftrace infrastructure too. I see the tracefs interface being used, but I don't see how the bpf programs being added affect the ftrace ring buffer or other parts of ftrace. And I'm guessing that's what is confusing Alexei. -- Steve
On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote: > On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote: > > > > I also thought about the pinning idea before, but we also want to add support > > for not just raw tracepoints, but also regular tracepoints (events if you > > will). I am hesitant to add a new BPF API just for creating regular > > tracepoints and then pinning those as well. > > and they should be done through the pinning as well. Hmm ok, I will give it some more thought. > > I don't see why a new bpf node for a trace event is a bad idea, really. > > See the patches for kprobe/uprobe FD-based api and the reasons behind it. > tldr: text is racy, doesn't scale, poor security, etc. Is it possible to use perf without CAP_SYS_ADMIN and control security at the per-event level? We are selective about who can access which event, using selinux. That's how our ftrace-based tracers work. Its fine grained per-event control. That's where I was going with the tracefs approach since we get that granularity using the file system. Thanks.
On Tue, 16 Jul 2019 15:26:52 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> I'm absolutely against text based apis.
I guess you don't use /proc ;-)
-- Steve
On Tue, Jul 16, 2019 at 06:31:17PM -0400, Steven Rostedt wrote: > On Tue, 16 Jul 2019 17:30:50 -0400 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > I don't see why a new bpf node for a trace event is a bad idea, really. > > tracefs is how we deal with trace events on Android. We do it in production > > systems. This is a natural extension to that and fits with the security model > > well. > > What I would like to see is a way to have BPF inject data into the > ftrace ring buffer directly. There's a bpf_trace_printk() that I find a > bit of a hack (especially since it hooks into trace_printk() which is > only for debugging purposes). Have a dedicated bpf ftrace ring > buffer event that can be triggered is what I am looking for. Then comes > the issue of what ring buffer to place it in, as ftrace can have > multiple ring buffer instances. But these instances are defined by the > tracefs instances directory. Having a way to associate a bpf program to > a specific event in a specific tracefs directory could allow for ways to > trigger writing into the correct ftrace buffer. But his problem is with doing the association of a BPF program with tracefs itself. How would you attach a BPF program with tracefs without doing a text based approach? His problem is with the text based approach per his last email. > But looking over the patches, I see what Alexei means that there's no > overlap with ftrace and these patches except for the tracefs directory > itself (which is part of the ftrace infrastructure). And the trace > events are technically part of the ftrace infrastructure too. I see the > tracefs interface being used, but I don't see how the bpf programs > being added affect the ftrace ring buffer or other parts of ftrace. And > I'm guessing that's what is confusing Alexei. In a follow-up patch which I am still writing, I am using the trace ring buffer as temporary storage since I am formatting the trace event into it. This patch you are replying to is just for raw tracepoint and yes, I agree this one does not use the ring buffer, but a future addition to it does. So I don't think the association of this patch series with ftrace is going to be an issue IMO. thanks, - Joel
On Tue, Jul 16, 2019 at 06:41:50PM -0400, Joel Fernandes wrote: > On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote: > > On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote: > > > > > > I also thought about the pinning idea before, but we also want to add support > > > for not just raw tracepoints, but also regular tracepoints (events if you > > > will). I am hesitant to add a new BPF API just for creating regular > > > tracepoints and then pinning those as well. > > > > and they should be done through the pinning as well. > > Hmm ok, I will give it some more thought. I think I can make the new BPF API + pinning approach work, I will try to work on something like this and post it soon. Also, I had a question below if you don't mind taking a look: thanks Alexei! > > > I don't see why a new bpf node for a trace event is a bad idea, really. > > > > See the patches for kprobe/uprobe FD-based api and the reasons behind it. > > tldr: text is racy, doesn't scale, poor security, etc. > > Is it possible to use perf without CAP_SYS_ADMIN and control security at the > per-event level? We are selective about who can access which event, using > selinux. That's how our ftrace-based tracers work. Its fine grained per-event > control. That's where I was going with the tracefs approach since we get that > granularity using the file system. > > Thanks. >
On Tue, Jul 16, 2019 at 07:55:00PM -0400, Joel Fernandes wrote: > On Tue, Jul 16, 2019 at 06:41:50PM -0400, Joel Fernandes wrote: > > On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote: > > > On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote: > > > > > > > > I also thought about the pinning idea before, but we also want to add support > > > > for not just raw tracepoints, but also regular tracepoints (events if you > > > > will). I am hesitant to add a new BPF API just for creating regular > > > > tracepoints and then pinning those as well. > > > > > > and they should be done through the pinning as well. > > > > Hmm ok, I will give it some more thought. > > I think I can make the new BPF API + pinning approach work, I will try to > work on something like this and post it soon. > > Also, I had a question below if you don't mind taking a look: > > thanks Alexei! > > > > > I don't see why a new bpf node for a trace event is a bad idea, really. > > > > > > See the patches for kprobe/uprobe FD-based api and the reasons behind it. > > > tldr: text is racy, doesn't scale, poor security, etc. > > > > Is it possible to use perf without CAP_SYS_ADMIN and control security at the > > per-event level? We are selective about who can access which event, using > > selinux. That's how our ftrace-based tracers work. Its fine grained per-event > > control. That's where I was going with the tracefs approach since we get that > > granularity using the file system. android's choice of selinux is not a factor in deciding kernel apis. It's completely separate discusion wether disallowing particular tracepoints for given user make sense at all. Just because you can hack it in via selinux blocking particular /sys/debug/tracing/ directory and convince yourself that it's somehow makes android more secure. It doesn't mean that all new api should fit into this model. I think allowing one tracepoint and disallowing another is pointless from security point of view. Tracing bpf program can do bpf_probe_read of anything.
On Tue, Jul 16, 2019 at 06:31:17PM -0400, Steven Rostedt wrote: > On Tue, 16 Jul 2019 17:30:50 -0400 > Joel Fernandes <joel@joelfernandes.org> wrote: > > > I don't see why a new bpf node for a trace event is a bad idea, really. > > tracefs is how we deal with trace events on Android. We do it in production > > systems. This is a natural extension to that and fits with the security model > > well. > > What I would like to see is a way to have BPF inject data into the > ftrace ring buffer directly. There's a bpf_trace_printk() that I find a > bit of a hack (especially since it hooks into trace_printk() which is > only for debugging purposes). Have a dedicated bpf ftrace ring > buffer event that can be triggered is what I am looking for. Then comes > the issue of what ring buffer to place it in, as ftrace can have > multiple ring buffer instances. But these instances are defined by the > tracefs instances directory. Having a way to associate a bpf program to > a specific event in a specific tracefs directory could allow for ways to > trigger writing into the correct ftrace buffer. bpf can write anything (including full skb) into perf ring buffer. I don't think there is a use case yet to write into ftrace ring buffer. But I can be convinced otherwise :) > But looking over the patches, I see what Alexei means that there's no > overlap with ftrace and these patches except for the tracefs directory > itself (which is part of the ftrace infrastructure). And the trace > events are technically part of the ftrace infrastructure too. I see the > tracefs interface being used, but I don't see how the bpf programs > being added affect the ftrace ring buffer or other parts of ftrace. And > I'm guessing that's what is confusing Alexei. yep. What I really like to see some day is proper integration of real ftrace and bpf. So that bpf progs can be invoked from some of the ftrace machinery. And the other way around. Like I'd love to be able to start ftracing all functions from bpf and stop it from bpf. The use case is kernel debugging. I'd like to examine a bunch of condition and start ftracing the execution. Then later decide wether this collection of ip addresses is interesting to analyze and post process it quickly while still inside bpf program.
On Tue, Jul 16, 2019 at 06:24:07PM -0700, Alexei Starovoitov wrote: [snip] > > > > > I don't see why a new bpf node for a trace event is a bad idea, really. > > > > > > > > See the patches for kprobe/uprobe FD-based api and the reasons behind it. > > > > tldr: text is racy, doesn't scale, poor security, etc. > > > > > > Is it possible to use perf without CAP_SYS_ADMIN and control security at the > > > per-event level? We are selective about who can access which event, using > > > selinux. That's how our ftrace-based tracers work. Its fine grained per-event > > > control. That's where I was going with the tracefs approach since we get that > > > granularity using the file system. > > android's choice of selinux is not a factor in deciding kernel apis. > It's completely separate discusion wether disallowing particular tracepoints > for given user make sense at all. > Just because you can hack it in via selinux blocking particular > /sys/debug/tracing/ directory and convince yourself that it's somehow > makes android more secure. It doesn't mean that all new api should fit > into this model. Its not like a hack, it is just control of which tracefs node can be accessed and which cannot be since the tracing can run on production systems out in the field and there are several concerns to address like security, privacy etc. It is not just for debugging usecases. We do collect traces out in the field where these issues are real and cannot be ignored. SELinux model is deny everything, and then selectively grant access to what is needed. The VFS and security LSM hooks provide this control quite well. I am not sure if such control is possible through perf hence I asked the question. > I think allowing one tracepoint and disallowing another is pointless > from security point of view. Tracing bpf program can do bpf_probe_read > of anything. I think the assumption here is the user controls the program instructions at runtime, but that's not the case. The BPF program we are loading is not dynamically generated, it is built at build time and it is loaded from a secure verified partition, so even though it can do bpf_probe_read, it is still not something that the user can change. And, we are planning to make it even more secure by making it kernel verify the program at load time as well (you were on some discussions about that a few months ago). thanks, - Joel
On Wed, Jul 17, 2019 at 6:01 AM Joel Fernandes <joel@joelfernandes.org> wrote: I trimmed cc. some emails were bouncing. > > I think allowing one tracepoint and disallowing another is pointless > > from security point of view. Tracing bpf program can do bpf_probe_read > > of anything. > > I think the assumption here is the user controls the program instructions at > runtime, but that's not the case. The BPF program we are loading is not > dynamically generated, it is built at build time and it is loaded from a > secure verified partition, so even though it can do bpf_probe_read, it is > still not something that the user can change. so you're saying that by having a set of signed bpf programs which instructions are known to be non-malicious and allowed set of tracepoints to attach via selinux whitelist, such setup will be safe? Have you considered how mix and match will behave? > And, we are planning to make it > even more secure by making it kernel verify the program at load time as well > (you were on some discussions about that a few months ago). It sounds like api decisions for this sticky raw_tp feature are driven by security choices which are not actually secure. I'm suggesting to avoid bringing up point of security as a reason for this api design, since it's making the opposite effect.
Hi Alexei, On Wed, Jul 17, 2019 at 02:40:42PM -0700, Alexei Starovoitov wrote: > On Wed, Jul 17, 2019 at 6:01 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > I trimmed cc. some emails were bouncing. Ok, thanks. > > > I think allowing one tracepoint and disallowing another is pointless > > > from security point of view. Tracing bpf program can do bpf_probe_read > > > of anything. > > > > I think the assumption here is the user controls the program instructions at > > runtime, but that's not the case. The BPF program we are loading is not > > dynamically generated, it is built at build time and it is loaded from a > > secure verified partition, so even though it can do bpf_probe_read, it is > > still not something that the user can change. > > so you're saying that by having a set of signed bpf programs which > instructions are known to be non-malicious and allowed set of tracepoints > to attach via selinux whitelist, such setup will be safe? > Have you considered how mix and match will behave? Do you mean the effect of mixing tracepoints and programs? I have not considered this. I am Ok with further enforcing of this (only certain tracepoints can be attached to certain programs) if needed. What do you think? We could have a new bpf(2) syscall attribute specify which tracepoint is expected, or similar. I wanted to walk you through our 2 usecases we are working on: 1. timeinstate: By hooking 2 programs onto sched_switch and cpu_frequency tracepoints, we are able to collect CPU power per-UID (specific app). Connor O'Brien is working on that. 2. inode to file path mapping: By hooking onto VFS tracepoints we are adding to the android kernels, we can collect data when the kernel resolves a file path to a inode/device number. A BPF map stores the inode/dev number (key) and the path (value). We have usecases where we need a high speed lookup of this without having to scan all the files in the filesystem. For the first usecase, the BPF program will be loaded and attached to the scheduler and cpufreq tracepoints at boot time and will stay attached forever. This is why I was saying having a daemon to stay alive all the time is pointless. However, if since you are completely against using tracefs which it sounds like, then we can do a daemon that is always alive. For the second usecase, the program attach is needed on-demand unlike the first usecase, and then after the usecase completes, it is detached to avoid overhead. For the second usecase, privacy is important and we want the data to not be available to any process. So we want to make sure only selected processes can attach to that tracepoint. This is the reason why I was doing working on these patches which use the tracefs as well, since we get that level of control. As you can see, I was trying to solve the sticky tracepoint problem in usecase 1 and the privacy problem in usecase 2. I had some discussions today at office and we think we can use the daemon approach to solve both these problems as well which I think would make you happy as well. What do you think about all of this? Any other feedback? > > And, we are planning to make it > > even more secure by making it kernel verify the program at load time as well > > (you were on some discussions about that a few months ago). > > It sounds like api decisions for this sticky raw_tp feature are > driven by security choices which are not actually secure. > I'm suggesting to avoid bringing up point of security as a reason for > this api design, since it's making the opposite effect. Ok, that's a fair point. thanks, - Joel
On Wed, Jul 17, 2019 at 10:51:43PM -0400, Joel Fernandes wrote: > Hi Alexei, > > On Wed, Jul 17, 2019 at 02:40:42PM -0700, Alexei Starovoitov wrote: > > On Wed, Jul 17, 2019 at 6:01 AM Joel Fernandes <joel@joelfernandes.org> wrote: > > > > I trimmed cc. some emails were bouncing. > > Ok, thanks. > > > > > I think allowing one tracepoint and disallowing another is pointless > > > > from security point of view. Tracing bpf program can do bpf_probe_read > > > > of anything. > > > > > > I think the assumption here is the user controls the program instructions at > > > runtime, but that's not the case. The BPF program we are loading is not > > > dynamically generated, it is built at build time and it is loaded from a > > > secure verified partition, so even though it can do bpf_probe_read, it is > > > still not something that the user can change. > > > > so you're saying that by having a set of signed bpf programs which > > instructions are known to be non-malicious and allowed set of tracepoints > > to attach via selinux whitelist, such setup will be safe? > > Have you considered how mix and match will behave? > > Do you mean the effect of mixing tracepoints and programs? I have not > considered this. I am Ok with further enforcing of this (only certain > tracepoints can be attached to certain programs) if needed. What do > you think? We could have a new bpf(2) syscall attribute specify which > tracepoint is expected, or similar. > > I wanted to walk you through our 2 usecases we are working on: thanks for sharing the use case details. Appreciate it. > 1. timeinstate: By hooking 2 programs onto sched_switch and cpu_frequency > tracepoints, we are able to collect CPU power per-UID (specific app). Connor > O'Brien is working on that. > > 2. inode to file path mapping: By hooking onto VFS tracepoints we are adding to > the android kernels, we can collect data when the kernel resolves a file path > to a inode/device number. A BPF map stores the inode/dev number (key) and the > path (value). We have usecases where we need a high speed lookup of this > without having to scan all the files in the filesystem. Can you share the link to vfs tracepoints you're adding? Sounds like you're not going to attempt to upstream them knowing Al's stance towards them? May be there is a way we can do the feature you need, but w/o tracepoints? > For the first usecase, the BPF program will be loaded and attached to the > scheduler and cpufreq tracepoints at boot time and will stay attached > forever. This is why I was saying having a daemon to stay alive all the time > is pointless. However, if since you are completely against using tracefs > which it sounds like, then we can do a daemon that is always alive. As I said earlier this use case can be solved by pinning raw_tp object into bpffs. Such patches are welcomed. > For the second usecase, the program attach is needed on-demand unlike the > first usecase, and then after the usecase completes, it is detached to avoid > overhead. > > For the second usecase, privacy is important and we want the data to not be > available to any process. So we want to make sure only selected processes can > attach to that tracepoint. This is the reason why I was doing working on > these patches which use the tracefs as well, since we get that level of > control. It's hard to recommend anything w/o seeing the actual tracepoints you're adding to vfs and type of data bpf program extracts from there. Sounds like it's some sort of cache of inode->file name ? If so, why is it privacy related?
On Tue, Jul 23, 2019 at 03:11:10PM -0700, Alexei Starovoitov wrote: > > > > > I think allowing one tracepoint and disallowing another is pointless > > > > > from security point of view. Tracing bpf program can do bpf_probe_read > > > > > of anything. > > > > > > > > I think the assumption here is the user controls the program instructions at > > > > runtime, but that's not the case. The BPF program we are loading is not > > > > dynamically generated, it is built at build time and it is loaded from a > > > > secure verified partition, so even though it can do bpf_probe_read, it is > > > > still not something that the user can change. > > > > > > so you're saying that by having a set of signed bpf programs which > > > instructions are known to be non-malicious and allowed set of tracepoints > > > to attach via selinux whitelist, such setup will be safe? > > > Have you considered how mix and match will behave? > > > > Do you mean the effect of mixing tracepoints and programs? I have not > > considered this. I am Ok with further enforcing of this (only certain > > tracepoints can be attached to certain programs) if needed. What do > > you think? We could have a new bpf(2) syscall attribute specify which > > tracepoint is expected, or similar. > > > > I wanted to walk you through our 2 usecases we are working on: > > thanks for sharing the use case details. Appreciate it. No problem and thanks for your thoughts. > > 1. timeinstate: By hooking 2 programs onto sched_switch and cpu_frequency > > tracepoints, we are able to collect CPU power per-UID (specific app). Connor > > O'Brien is working on that. > > > > 2. inode to file path mapping: By hooking onto VFS tracepoints we are adding to > > the android kernels, we can collect data when the kernel resolves a file path > > to a inode/device number. A BPF map stores the inode/dev number (key) and the > > path (value). We have usecases where we need a high speed lookup of this > > without having to scan all the files in the filesystem. > > Can you share the link to vfs tracepoints you're adding? > Sounds like you're not going to attempt to upstream them knowing > Al's stance towards them? > May be there is a way we can do the feature you need, but w/o tracepoints? Yes, given Al's stance I understand the patch is not upstreamable. The patch is here: For tracepoint: https://android.googlesource.com/kernel/common/+/27d3bfe20558d279041af403a887e7bdbdcc6f24%5E%21/ For bpf program: https://android.googlesource.com/platform/system/bpfprogs/+/908f6cd718fab0de7a944f84628c56f292efeb17%5E%21/ I intended to submit the tracepoint only for the Android kernels, however if there is an upstream solution to this then that's even better since upstream can benefit. Were you thinking of a BPF helper function to get this data? > > > For the first usecase, the BPF program will be loaded and attached to the > > scheduler and cpufreq tracepoints at boot time and will stay attached > > forever. This is why I was saying having a daemon to stay alive all the time > > is pointless. However, if since you are completely against using tracefs > > which it sounds like, then we can do a daemon that is always alive. > > As I said earlier this use case can be solved by pinning raw_tp object > into bpffs. Such patches are welcomed. Ok will think more about it. > > For the second usecase, the program attach is needed on-demand unlike the > > first usecase, and then after the usecase completes, it is detached to avoid > > overhead. > > > > For the second usecase, privacy is important and we want the data to not be > > available to any process. So we want to make sure only selected processes can > > attach to that tracepoint. This is the reason why I was doing working on > > these patches which use the tracefs as well, since we get that level of > > control. > > It's hard to recommend anything w/o seeing the actual tracepoints you're adding > to vfs and type of data bpf program extracts from there. > Sounds like it's some sort of cache of inode->file name ? Yes, that's what it is. > If so, why is it privacy related? The reasoning is the file paths could reveal user activity (such as an app that opens a document) and Android has requirements to control/restrict that. thanks, - Joel
On Wed, Jul 24, 2019 at 09:57:14AM -0400, Joel Fernandes wrote: > On Tue, Jul 23, 2019 at 03:11:10PM -0700, Alexei Starovoitov wrote: > > > > > > I think allowing one tracepoint and disallowing another is pointless > > > > > > from security point of view. Tracing bpf program can do bpf_probe_read > > > > > > of anything. > > > > > > > > > > I think the assumption here is the user controls the program instructions at > > > > > runtime, but that's not the case. The BPF program we are loading is not > > > > > dynamically generated, it is built at build time and it is loaded from a > > > > > secure verified partition, so even though it can do bpf_probe_read, it is > > > > > still not something that the user can change. > > > > > > > > so you're saying that by having a set of signed bpf programs which > > > > instructions are known to be non-malicious and allowed set of tracepoints > > > > to attach via selinux whitelist, such setup will be safe? > > > > Have you considered how mix and match will behave? > > > > > > Do you mean the effect of mixing tracepoints and programs? I have not > > > considered this. I am Ok with further enforcing of this (only certain > > > tracepoints can be attached to certain programs) if needed. What do > > > you think? We could have a new bpf(2) syscall attribute specify which > > > tracepoint is expected, or similar. > > > > > > I wanted to walk you through our 2 usecases we are working on: > > > > thanks for sharing the use case details. Appreciate it. > > No problem and thanks for your thoughts. > > > > 1. timeinstate: By hooking 2 programs onto sched_switch and cpu_frequency > > > tracepoints, we are able to collect CPU power per-UID (specific app). Connor > > > O'Brien is working on that. > > > > > > 2. inode to file path mapping: By hooking onto VFS tracepoints we are adding to > > > the android kernels, we can collect data when the kernel resolves a file path > > > to a inode/device number. A BPF map stores the inode/dev number (key) and the > > > path (value). We have usecases where we need a high speed lookup of this > > > without having to scan all the files in the filesystem. > > > > Can you share the link to vfs tracepoints you're adding? > > Sounds like you're not going to attempt to upstream them knowing > > Al's stance towards them? > > May be there is a way we can do the feature you need, but w/o tracepoints? > > Yes, given Al's stance I understand the patch is not upstreamable. The patch > is here: > For tracepoint: > https://android.googlesource.com/kernel/common/+/27d3bfe20558d279041af403a887e7bdbdcc6f24%5E%21/ this is way more than tracepoint. > For bpf program: > https://android.googlesource.com/platform/system/bpfprogs/+/908f6cd718fab0de7a944f84628c56f292efeb17%5E%21/ what is unsafe_bpf_map_update_elem() in there? The verifier comment sounds odd. Could you describe the issue you see with the verifier? > I intended to submit the tracepoint only for the Android kernels, however if > there is an upstream solution to this then that's even better since upstream can > benefit. Were you thinking of a BPF helper function to get this data? I think the best way to evaluate the patches is whether they are upstreamable or not. If they're not (like this case), it means that there is something wrong with their design and if android decides to go with such approach it will only create serious issues long term. Starting with the whole idea of dev+inode -> filepath cache. dev+inode is not a unique identifier of the file. In some filesystems two different files may have the same ino integer value. Have you looked at 'struct file_handle' ? and name_to_handle_at ? I think fhandle is the only way to get unique identifier of the file. Could you please share more details why android needs this cache of dev+ino->path? I guess something uses ino to find paths? Sort of faster version of 'find -inum' ?
On Fri, Jul 26, 2019 at 11:39:56AM -0700, Alexei Starovoitov wrote: [snip] > > > > 1. timeinstate: By hooking 2 programs onto sched_switch and cpu_frequency > > > > tracepoints, we are able to collect CPU power per-UID (specific app). Connor > > > > O'Brien is working on that. > > > > > > > > 2. inode to file path mapping: By hooking onto VFS tracepoints we are adding to > > > > the android kernels, we can collect data when the kernel resolves a file path > > > > to a inode/device number. A BPF map stores the inode/dev number (key) and the > > > > path (value). We have usecases where we need a high speed lookup of this > > > > without having to scan all the files in the filesystem. > > > > > > Can you share the link to vfs tracepoints you're adding? > > > Sounds like you're not going to attempt to upstream them knowing > > > Al's stance towards them? > > > May be there is a way we can do the feature you need, but w/o tracepoints? > > > > Yes, given Al's stance I understand the patch is not upstreamable. The patch > > is here: > > For tracepoint: > > https://android.googlesource.com/kernel/common/+/27d3bfe20558d279041af403a887e7bdbdcc6f24%5E%21/ > > this is way more than tracepoint. True there is some code that calls the tracepoint. I want to optimize it more but lets see I am ready to think more about it before doing it this way, based on your suggestions. > > For bpf program: > > https://android.googlesource.com/platform/system/bpfprogs/+/908f6cd718fab0de7a944f84628c56f292efeb17%5E%21/ > > what is unsafe_bpf_map_update_elem() in there? > The verifier comment sounds odd. > Could you describe the issue you see with the verifier? Will dig out the verifier issue I was seeing. I was just trying to get a prototype working so I did not go into verifier details much. > > I intended to submit the tracepoint only for the Android kernels, however if > > there is an upstream solution to this then that's even better since upstream can > > benefit. Were you thinking of a BPF helper function to get this data? > > I think the best way to evaluate the patches is whether they are upstreamable or not. > If they're not (like this case), it means that there is something wrong with their design > and if android decides to go with such approach it will only create serious issues long term. > Starting with the whole idea of dev+inode -> filepath cache. > dev+inode is not a unique identifier of the file. > In some filesystems two different files may have the same ino integer value. > Have you looked at 'struct file_handle' ? and name_to_handle_at ? > I think fhandle is the only way to get unique identifier of the file. > Could you please share more details why android needs this cache of dev+ino->path? I will follow-up with you on this by email off the list, thanks. thanks, - Joel
On Fri, Jul 26, 2019 at 3:18 PM Joel Fernandes <joel@joelfernandes.org> wrote: > > On Fri, Jul 26, 2019 at 11:39:56AM -0700, Alexei Starovoitov wrote: [snip] > > > For bpf program: > > > https://android.googlesource.com/platform/system/bpfprogs/+/908f6cd718fab0de7a944f84628c56f292efeb17%5E%21/ > > > > what is unsafe_bpf_map_update_elem() in there? > > The verifier comment sounds odd. > > Could you describe the issue you see with the verifier? > > Will dig out the verifier issue I was seeing. I was just trying to get a > prototype working so I did not go into verifier details much. This is actually slightly old code, the actual function name is bpf_map_update_elem_unsafe() . https://android.googlesource.com/platform/system/bpf/+/refs/heads/master/progs/include/bpf_helpers.h#39 This function came about because someone added a DEFINE_BPF_MAP macro which defines BPF map accessors based on the type of the key and value. So that's the "safe" variant: https://android.googlesource.com/platform/system/bpf/+/refs/heads/master/progs/include/bpf_helpers.h#54 (added in commit https://android.googlesource.com/platform/system/bpf/+/6564b8eac46fc27dde807a39856386d98d2471c3) So the "safe" variant of the bpf_map_update_elem for us became a map specific version with a prototype: static inline __always_inline __unused int bpf_##the_map##_update_elem(TypeOfKey* k, TypeOfValue* v, unsigned long long flags) Since I had not upgraded my BPF program to the "safe" variant, I had to use the internal "unsafe" variant of the API (if that makes sense..). thanks Alexei! - Joel