[RFC,0/4] Add support to directly attach BPF program to ftrace

Message ID	20190710141548.132193-1-joel@joelfernandes.org
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: "Joel Fernandes (Google)" <joel@joelfernandes.org> To: linux-kernel@vger.kernel.org Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>, Adrian Ratiu <adrian.ratiu@collabora.com>, Alexei Starovoitov <ast@kernel.org>, bpf@vger.kernel.org, Brendan Gregg <brendan.d.gregg@gmail.com>, connoro@google.com, Daniel Borkmann <daniel@iogearbox.net>, duyuchao <yuchao.du@unisoc.com>, Ingo Molnar <mingo@redhat.com>, jeffv@google.com, Karim Yaghmour <karim.yaghmour@opersys.com>, kernel-team@android.com, linux-kselftest@vger.kernel.org, Manali Shukla <manalishukla14@gmail.com>, Manjo Raja Rao <linux@manojrajarao.com>, Martin KaFai Lau <kafai@fb.com>, Masami Hiramatsu <mhiramat@kernel.org>, Matt Mullins <mmullins@fb.com>, Michal Gregorczyk <michalgr@fb.com>, Michal Gregorczyk <michalgr@live.com>, Mohammad Husain <russoue@gmail.com>, namhyung@google.com, namhyung@kernel.org, netdev@vger.kernel.org, paul.chaignon@gmail.com, primiano@google.com, Qais Yousef <qais.yousef@arm.com>, Shuah Khan <shuah@kernel.org>, Song Liu <songliubraving@fb.com>, Srinivas Ramana <sramana@codeaurora.org>, Steven Rostedt <rostedt@goodmis.org>, Tamir Carmeli <carmeli.tamir@gmail.com>, Yonghong Song <yhs@fb.com> Subject: [PATCH RFC 0/4] Add support to directly attach BPF program to ftrace Date: Wed, 10 Jul 2019 10:15:44 -0400 Message-Id: <20190710141548.132193-1-joel@joelfernandes.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	Add support to directly attach BPF program to ftrace \| expand [RFC,0/4] Add support to directly attach BPF program to ftrace [RFC,1/4] Move bpf_raw_tracepoint functionality into bpf_trace.c [RFC,2/4] trace/bpf: Add support for attach/detach of ftrace events to BPF [RFC,3/4] lib/bpf: Add support for ftrace event attach and detach [RFC,4/4] selftests/bpf: Add test for ftrace-based BPF attach/detach

Joel Fernandes July 10, 2019, 2:15 p.m. UTC

Hi,
These patches make it possible to attach BPF programs directly to tracepoints
using ftrace (/sys/kernel/debug/tracing) without needing the process doing the
attach to be alive. This has the following benefits:

1. Simplified Security: In Android, we have finer-grained security controls to
specific ftrace trace events using SELinux labels. We control precisely who is
allowed to enable an ftrace event already. By adding a node to ftrace for
attaching BPF programs, we can use the same mechanism to further control who is
allowed to attach to a trace event.

2. Process lifetime: In Android we are adding usecases where a tracing program
needs to be attached all the time to a tracepoint, for the full life time of
the system. Such as to gather statistics where there no need for a detach for
the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this
means keeping a process alive all the time. However, in Android our BPF loader
currently (for hardeneded security) involves just starting a process at boot
time, doing the BPF program loading, and then pinning them to /sys/fs/bpf. We
don't keep this process alive all the time. It is more suitable to do a
one-shot attach of the program using ftrace and not need to have a process
alive all the time anymore for this. Such process also needs elevated
privileges since tracepoint program loading currently requires CAP_SYS_ADMIN
anyway so by design Android's bpfloader runs once at init and exits.

This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf
The following commands can be written into it:
attach:<fd> Attaches BPF prog fd to tracepoint
detach:<fd> Detaches BPF prog fd to tracepoint

Reading the bpf file will show all the attached programs to the tracepoint.

Joel Fernandes (Google) (4):
Move bpf_raw_tracepoint functionality into bpf_trace.c
trace/bpf: Add support for attach/detach of ftrace events to BPF
lib/bpf: Add support for ftrace event attach and detach
selftests/bpf: Add test for ftrace-based BPF attach/detach

include/linux/bpf_trace.h | 16 ++
include/linux/trace_events.h | 1 +
kernel/bpf/syscall.c | 69 +-----
kernel/trace/bpf_trace.c | 225 ++++++++++++++++++
kernel/trace/trace.h | 1 +
kernel/trace/trace_events.c | 8 +
tools/lib/bpf/bpf.c | 53 +++++
tools/lib/bpf/bpf.h | 4 +
tools/lib/bpf/libbpf.map | 2 +
.../raw_tp_writable_test_ftrace_run.c | 89 +++++++
10 files changed, 410 insertions(+), 58 deletions(-)
create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_ftrace_run.c

--
2.22.0.410.gd8fdbe21b5-goog

Alexei Starovoitov July 16, 2019, 8:54 p.m. UTC | #1

On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote:
> Hi,

why are you cc-ing the whole world for this patch set?
I'll reply to all as well, but I suspect a bunch of folks consider it spam.
Please read Documentation/bpf/bpf_devel_QA.rst

Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam,
so I'm not sure this set reached public mailing lists.

> These patches make it possible to attach BPF programs directly to tracepoints
> using ftrace (/sys/kernel/debug/tracing) without needing the process doing the
> attach to be alive. This has the following benefits:
> 
> 1. Simplified Security: In Android, we have finer-grained security controls to
> specific ftrace trace events using SELinux labels. We control precisely who is
> allowed to enable an ftrace event already. By adding a node to ftrace for
> attaching BPF programs, we can use the same mechanism to further control who is
> allowed to attach to a trace event.
> 
> 2. Process lifetime: In Android we are adding usecases where a tracing program
> needs to be attached all the time to a tracepoint, for the full life time of
> the system. Such as to gather statistics where there no need for a detach for
> the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this
> means keeping a process alive all the time.  However, in Android our BPF loader
> currently (for hardeneded security) involves just starting a process at boot
> time, doing the BPF program loading, and then pinning them to /sys/fs/bpf.  We
> don't keep this process alive all the time. It is more suitable to do a
> one-shot attach of the program using ftrace and not need to have a process
> alive all the time anymore for this. Such process also needs elevated
> privileges since tracepoint program loading currently requires CAP_SYS_ADMIN
> anyway so by design Android's bpfloader runs once at init and exits.
> 
> This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf
> The following commands can be written into it:
> attach:<fd>     Attaches BPF prog fd to tracepoint
> detach:<fd>     Detaches BPF prog fd to tracepoint

Looks like, to detach a program the user needs to read a text file,
parse bpf prog id from text into binary. Then call fd_from_id bpf syscall,
get a binary FD, convert it back to text and write as a text back into this file.
I think this is just a single example why text based apis are not accepted
in bpf anymore.

Through the patch set you call it ftrace. As far as I can see, this set
has zero overlap with ftrace. There is no ftrace-bpf connection here at all
that we discussed in the past Steven. It's all quite confusing.

I suggest android to solve sticky raw_tracepoint problem with user space deamon.
The reasons, you point out why user daemon cannot be used, sound weak to me.
Another acceptable solution would be to introduce pinning of raw_tp objects.
bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would
be natural extension.

Joel Fernandes July 16, 2019, 9:30 p.m. UTC | #2

On Tue, Jul 16, 2019 at 01:54:57PM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 10, 2019 at 10:15:44AM -0400, Joel Fernandes (Google) wrote:
> > Hi,
> 
> why are you cc-ing the whole world for this patch set?

Well, the whole world happens to be interested in BPF on Android.

> I'll reply to all as well, but I suspect a bunch of folks consider it spam.
> Please read Documentation/bpf/bpf_devel_QA.rst

Ok, I'll read it.

> Also, I think, netdev@vger rejects emails with 80+ characters in cc as spam,
> so I'm not sure this set reached public mailing lists.

Certainly the CC list here is not added to folks who consider it spam. All
the folks added have been interested in BPF on Android at various points of
time. Is this CC list really that large? It has around 24 email addresses or
so. I can trim it a bit if needed. Also, you sound like as if people are
screaming at me to stop emailing them, certainly that's not the case and no
one has told me it is spam.

And, it did reach the public archive btw:
https://lore.kernel.org/netdev/20190716205455.iimn3pqpvsc3k4ry@ast-mbp.dhcp.thefacebook.com/T/#m1460ba463b78312e38b68b8c118f673d2ead9446

> > These patches make it possible to attach BPF programs directly to tracepoints
> > using ftrace (/sys/kernel/debug/tracing) without needing the process doing the
> > attach to be alive. This has the following benefits:
> > 
> > 1. Simplified Security: In Android, we have finer-grained security controls to
> > specific ftrace trace events using SELinux labels. We control precisely who is
> > allowed to enable an ftrace event already. By adding a node to ftrace for
> > attaching BPF programs, we can use the same mechanism to further control who is
> > allowed to attach to a trace event.
> > 
> > 2. Process lifetime: In Android we are adding usecases where a tracing program
> > needs to be attached all the time to a tracepoint, for the full life time of
> > the system. Such as to gather statistics where there no need for a detach for
> > the full system lifetime. With perf or bpf(2)'s BPF_RAW_TRACEPOINT_OPEN, this
> > means keeping a process alive all the time.  However, in Android our BPF loader
> > currently (for hardeneded security) involves just starting a process at boot
> > time, doing the BPF program loading, and then pinning them to /sys/fs/bpf.  We
> > don't keep this process alive all the time. It is more suitable to do a
> > one-shot attach of the program using ftrace and not need to have a process
> > alive all the time anymore for this. Such process also needs elevated
> > privileges since tracepoint program loading currently requires CAP_SYS_ADMIN
> > anyway so by design Android's bpfloader runs once at init and exits.
> > 
> > This series add a new bpf file to /sys/kernel/debug/tracing/events/X/Y/bpf
> > The following commands can be written into it:
> > attach:<fd>     Attaches BPF prog fd to tracepoint
> > detach:<fd>     Detaches BPF prog fd to tracepoint
> 
> Looks like, to detach a program the user needs to read a text file,
> parse bpf prog id from text into binary. Then call fd_from_id bpf syscall,
> get a binary FD, convert it back to text and write as a text back into this file.
> I think this is just a single example why text based apis are not accepted
> in bpf anymore.

This can also be considered a tracefs API.

And we can certainly change the detach to accept program ids as well if
that's easier. 'detach:prog:<prog_id>' and 'detach:fd:<fd>'.

By the way, I can also list the set of cumbersome steps needed to attach a
BPF program using perf and I bet it will be longer ;-)

> Through the patch set you call it ftrace. As far as I can see, this set
> has zero overlap with ftrace. There is no ftrace-bpf connection here at all
> that we discussed in the past Steven. It's all quite confusing.

It depends on what you mean by ftrace, may be I can call it 'trace events' or
something if it is less ambiguious. All of this has been collectively called
ftrace before.

I am not sure if you you are making sense actually, trace_events mechanism is
a part of ftrace. See the documentation: Documentation/trace/ftrace.rst. Even
the documentation file name has the word ftrace in it.

I have also spoken to Steven before about this, I don't think he ever told me
there is no connection so again I am a bit lost at your comments.

> I suggest android to solve sticky raw_tracepoint problem with user space deamon.
> The reasons, you point out why user daemon cannot be used, sound weak to me.

I don't think it is weak. It seems overkill to have a daemon for a trace
event that is say supposed to be attached to all the time for the lifetime of
the system. Why should there be a daemon consuming resources if it is active
all the time?

In Android, we are very careful about spawning useless processes and leaving
them alive for the lifetime of the system - for no good reason. Our security
teams also don't like this, and they can comment more.

> Another acceptable solution would be to introduce pinning of raw_tp objects.
> bpf progs and maps can be pinned in bpffs already. Pinning raw_tp would
> be natural extension.

I don't think the pinning solves the security problem, it just solves the
process lifetime problem. Currently, attaching trace events through perf
requires CAP_SYS_ADMIN. However, with ftrace events, we already control
security of events by labeling the nodes in tracefs and granting access to
the labeled context through the selinux policies. Having a 'bpf' node in
tracefs for events, and granting access to the labels is a natural extension.

I also thought about the pinning idea before, but we also want to add support
for not just raw tracepoints, but also regular tracepoints (events if you
will). I am hesitant to add a new BPF API just for creating regular
tracepoints and then pinning those as well.

I don't see why a new bpf node for a trace event is a bad idea, really.
tracefs is how we deal with trace events on Android. We do it in production
systems. This is a natural extension to that and fits with the security model
well.

thanks,

 - Joel

Alexei Starovoitov July 16, 2019, 10:26 p.m. UTC | #3

On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
> 
> I also thought about the pinning idea before, but we also want to add support
> for not just raw tracepoints, but also regular tracepoints (events if you
> will). I am hesitant to add a new BPF API just for creating regular
> tracepoints and then pinning those as well.

and they should be done through the pinning as well.

> I don't see why a new bpf node for a trace event is a bad idea, really.

See the patches for kprobe/uprobe FD-based api and the reasons behind it.
tldr: text is racy, doesn't scale, poor security, etc.

> tracefs is how we deal with trace events on Android. 

android has made mistakes in the past as well.

> This is a natural extension to that and fits with the security model
> well.

I think it's the opposite.
I'm absolutely against text based apis.

Steven Rostedt July 16, 2019, 10:31 p.m. UTC | #4

On Tue, 16 Jul 2019 17:30:50 -0400
Joel Fernandes <joel@joelfernandes.org> wrote:

> I don't see why a new bpf node for a trace event is a bad idea, really.
> tracefs is how we deal with trace events on Android. We do it in production
> systems. This is a natural extension to that and fits with the security model
> well.

What I would like to see is a way to have BPF inject data into the
ftrace ring buffer directly. There's a bpf_trace_printk() that I find a
bit of a hack (especially since it hooks into trace_printk() which is
only for debugging purposes). Have a dedicated bpf ftrace ring
buffer event that can be triggered is what I am looking for. Then comes
the issue of what ring buffer to place it in, as ftrace can have
multiple ring buffer instances. But these instances are defined by the
tracefs instances directory. Having a way to associate a bpf program to
a specific event in a specific tracefs directory could allow for ways to
trigger writing into the correct ftrace buffer.

But looking over the patches, I see what Alexei means that there's no
overlap with ftrace and these patches except for the tracefs directory
itself (which is part of the ftrace infrastructure). And the trace
events are technically part of the ftrace infrastructure too. I see the
tracefs interface being used, but I don't see how the bpf programs
being added affect the ftrace ring buffer or other parts of ftrace. And
I'm guessing that's what is confusing Alexei.

-- Steve

Joel Fernandes July 16, 2019, 10:41 p.m. UTC | #5

On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote:
> On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
> > 
> > I also thought about the pinning idea before, but we also want to add support
> > for not just raw tracepoints, but also regular tracepoints (events if you
> > will). I am hesitant to add a new BPF API just for creating regular
> > tracepoints and then pinning those as well.
> 
> and they should be done through the pinning as well.

Hmm ok, I will give it some more thought.

> > I don't see why a new bpf node for a trace event is a bad idea, really.
> 
> See the patches for kprobe/uprobe FD-based api and the reasons behind it.
> tldr: text is racy, doesn't scale, poor security, etc.

Is it possible to use perf without CAP_SYS_ADMIN and control security at the
per-event level? We are selective about who can access which event, using
selinux. That's how our ftrace-based tracers work. Its fine grained per-event
control. That's where I was going with the tracefs approach since we get that
granularity using the file system.

Thanks.

Steven Rostedt July 16, 2019, 10:43 p.m. UTC | #6

On Tue, 16 Jul 2019 15:26:52 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> I'm absolutely against text based apis.

I guess you don't use /proc ;-)

-- Steve

Joel Fernandes July 16, 2019, 10:46 p.m. UTC | #7

On Tue, Jul 16, 2019 at 06:31:17PM -0400, Steven Rostedt wrote:
> On Tue, 16 Jul 2019 17:30:50 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > I don't see why a new bpf node for a trace event is a bad idea, really.
> > tracefs is how we deal with trace events on Android. We do it in production
> > systems. This is a natural extension to that and fits with the security model
> > well.
> 
> What I would like to see is a way to have BPF inject data into the
> ftrace ring buffer directly. There's a bpf_trace_printk() that I find a
> bit of a hack (especially since it hooks into trace_printk() which is
> only for debugging purposes). Have a dedicated bpf ftrace ring
> buffer event that can be triggered is what I am looking for. Then comes
> the issue of what ring buffer to place it in, as ftrace can have
> multiple ring buffer instances. But these instances are defined by the
> tracefs instances directory. Having a way to associate a bpf program to
> a specific event in a specific tracefs directory could allow for ways to
> trigger writing into the correct ftrace buffer.

But his problem is with doing the association of a BPF program with tracefs
itself. How would you attach a BPF program with tracefs without doing a text
based approach? His problem is with the text based approach per his last
email.

> But looking over the patches, I see what Alexei means that there's no
> overlap with ftrace and these patches except for the tracefs directory
> itself (which is part of the ftrace infrastructure). And the trace
> events are technically part of the ftrace infrastructure too. I see the
> tracefs interface being used, but I don't see how the bpf programs
> being added affect the ftrace ring buffer or other parts of ftrace. And
> I'm guessing that's what is confusing Alexei.

In a follow-up patch which I am still writing, I am using the trace ring
buffer as temporary storage since I am formatting the trace event into it.
This patch you are replying to is just for raw tracepoint and yes, I agree
this one does not use the ring buffer, but a future addition to it does. So
I don't think the association of this patch series with ftrace is going to be
an issue IMO.

thanks,

 - Joel

Joel Fernandes July 16, 2019, 11:55 p.m. UTC | #8

On Tue, Jul 16, 2019 at 06:41:50PM -0400, Joel Fernandes wrote:
> On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote:
> > On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
> > > 
> > > I also thought about the pinning idea before, but we also want to add support
> > > for not just raw tracepoints, but also regular tracepoints (events if you
> > > will). I am hesitant to add a new BPF API just for creating regular
> > > tracepoints and then pinning those as well.
> > 
> > and they should be done through the pinning as well.
> 
> Hmm ok, I will give it some more thought.

I think I can make the new BPF API + pinning approach work, I will try to
work on something like this and post it soon.

Also, I had a question below if you don't mind taking a look:

thanks Alexei!

> > > I don't see why a new bpf node for a trace event is a bad idea, really.
> > 
> > See the patches for kprobe/uprobe FD-based api and the reasons behind it.
> > tldr: text is racy, doesn't scale, poor security, etc.
> 
> Is it possible to use perf without CAP_SYS_ADMIN and control security at the
> per-event level? We are selective about who can access which event, using
> selinux. That's how our ftrace-based tracers work. Its fine grained per-event
> control. That's where I was going with the tracefs approach since we get that
> granularity using the file system.
> 
> Thanks.
>

Alexei Starovoitov July 17, 2019, 1:24 a.m. UTC | #9

On Tue, Jul 16, 2019 at 07:55:00PM -0400, Joel Fernandes wrote:
> On Tue, Jul 16, 2019 at 06:41:50PM -0400, Joel Fernandes wrote:
> > On Tue, Jul 16, 2019 at 03:26:52PM -0700, Alexei Starovoitov wrote:
> > > On Tue, Jul 16, 2019 at 05:30:50PM -0400, Joel Fernandes wrote:
> > > > 
> > > > I also thought about the pinning idea before, but we also want to add support
> > > > for not just raw tracepoints, but also regular tracepoints (events if you
> > > > will). I am hesitant to add a new BPF API just for creating regular
> > > > tracepoints and then pinning those as well.
> > > 
> > > and they should be done through the pinning as well.
> > 
> > Hmm ok, I will give it some more thought.
> 
> I think I can make the new BPF API + pinning approach work, I will try to
> work on something like this and post it soon.
> 
> Also, I had a question below if you don't mind taking a look:
> 
> thanks Alexei!
> 
> > > > I don't see why a new bpf node for a trace event is a bad idea, really.
> > > 
> > > See the patches for kprobe/uprobe FD-based api and the reasons behind it.
> > > tldr: text is racy, doesn't scale, poor security, etc.
> > 
> > Is it possible to use perf without CAP_SYS_ADMIN and control security at the
> > per-event level? We are selective about who can access which event, using
> > selinux. That's how our ftrace-based tracers work. Its fine grained per-event
> > control. That's where I was going with the tracefs approach since we get that
> > granularity using the file system.

android's choice of selinux is not a factor in deciding kernel apis.
It's completely separate discusion wether disallowing particular tracepoints
for given user make sense at all.
Just because you can hack it in via selinux blocking particular
/sys/debug/tracing/ directory and convince yourself that it's somehow
makes android more secure. It doesn't mean that all new api should fit
into this model.
I think allowing one tracepoint and disallowing another is pointless
from security point of view. Tracing bpf program can do bpf_probe_read
of anything.

Alexei Starovoitov July 17, 2019, 1:30 a.m. UTC | #10

On Tue, Jul 16, 2019 at 06:31:17PM -0400, Steven Rostedt wrote:
> On Tue, 16 Jul 2019 17:30:50 -0400
> Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> > I don't see why a new bpf node for a trace event is a bad idea, really.
> > tracefs is how we deal with trace events on Android. We do it in production
> > systems. This is a natural extension to that and fits with the security model
> > well.
> 
> What I would like to see is a way to have BPF inject data into the
> ftrace ring buffer directly. There's a bpf_trace_printk() that I find a
> bit of a hack (especially since it hooks into trace_printk() which is
> only for debugging purposes). Have a dedicated bpf ftrace ring
> buffer event that can be triggered is what I am looking for. Then comes
> the issue of what ring buffer to place it in, as ftrace can have
> multiple ring buffer instances. But these instances are defined by the
> tracefs instances directory. Having a way to associate a bpf program to
> a specific event in a specific tracefs directory could allow for ways to
> trigger writing into the correct ftrace buffer.

bpf can write anything (including full skb) into perf ring buffer.
I don't think there is a use case yet to write into ftrace ring buffer.
But I can be convinced otherwise :)

> But looking over the patches, I see what Alexei means that there's no
> overlap with ftrace and these patches except for the tracefs directory
> itself (which is part of the ftrace infrastructure). And the trace
> events are technically part of the ftrace infrastructure too. I see the
> tracefs interface being used, but I don't see how the bpf programs
> being added affect the ftrace ring buffer or other parts of ftrace. And
> I'm guessing that's what is confusing Alexei.

yep.
What I really like to see some day is proper integration of real ftrace
and bpf. So that bpf progs can be invoked from some of the ftrace machinery.
And the other way around.
Like I'd love to be able to start ftracing all functions from bpf
and stop it from bpf.
The use case is kernel debugging. I'd like to examine a bunch of condition
and start ftracing the execution. Then later decide wether this collection
of ip addresses is interesting to analyze and post process it quickly
while still inside bpf program.

Joel Fernandes July 17, 2019, 1:01 p.m. UTC | #11

On Tue, Jul 16, 2019 at 06:24:07PM -0700, Alexei Starovoitov wrote:
[snip]
> > > > > I don't see why a new bpf node for a trace event is a bad idea, really.
> > > > 
> > > > See the patches for kprobe/uprobe FD-based api and the reasons behind it.
> > > > tldr: text is racy, doesn't scale, poor security, etc.
> > > 
> > > Is it possible to use perf without CAP_SYS_ADMIN and control security at the
> > > per-event level? We are selective about who can access which event, using
> > > selinux. That's how our ftrace-based tracers work. Its fine grained per-event
> > > control. That's where I was going with the tracefs approach since we get that
> > > granularity using the file system.
> 
> android's choice of selinux is not a factor in deciding kernel apis.
> It's completely separate discusion wether disallowing particular tracepoints
> for given user make sense at all.
> Just because you can hack it in via selinux blocking particular
> /sys/debug/tracing/ directory and convince yourself that it's somehow
> makes android more secure. It doesn't mean that all new api should fit
> into this model.

Its not like a hack, it is just control of which tracefs node can be
accessed and which cannot be since the tracing can run on production systems
out in the field and there are several concerns to address like security,
privacy etc. It is not just for debugging usecases. We do collect traces out
in the field where these issues are real and cannot be ignored.

SELinux model is deny everything, and then selectively grant access to what
is needed. The VFS and security LSM hooks provide this control quite well. I am
not sure if such control is possible through perf hence I asked the question.

> I think allowing one tracepoint and disallowing another is pointless
> from security point of view. Tracing bpf program can do bpf_probe_read
> of anything.

I think the assumption here is the user controls the program instructions at
runtime, but that's not the case. The BPF program we are loading is not
dynamically generated, it is built at build time and it is loaded from a
secure verified partition, so even though it can do bpf_probe_read, it is
still not something that the user can change. And, we are planning to make it
even more secure by making it kernel verify the program at load time as well
(you were on some discussions about that a few months ago).

thanks,

 - Joel

Alexei Starovoitov July 17, 2019, 9:40 p.m. UTC | #12

On Wed, Jul 17, 2019 at 6:01 AM Joel Fernandes <joel@joelfernandes.org> wrote:

I trimmed cc. some emails were bouncing.

> > I think allowing one tracepoint and disallowing another is pointless
> > from security point of view. Tracing bpf program can do bpf_probe_read
> > of anything.
>
> I think the assumption here is the user controls the program instructions at
> runtime, but that's not the case. The BPF program we are loading is not
> dynamically generated, it is built at build time and it is loaded from a
> secure verified partition, so even though it can do bpf_probe_read, it is
> still not something that the user can change.

so you're saying that by having a set of signed bpf programs which
instructions are known to be non-malicious and allowed set of tracepoints
to attach via selinux whitelist, such setup will be safe?
Have you considered how mix and match will behave?

> And, we are planning to make it
> even more secure by making it kernel verify the program at load time as well
> (you were on some discussions about that a few months ago).

It sounds like api decisions for this sticky raw_tp feature are
driven by security choices which are not actually secure.
I'm suggesting to avoid bringing up point of security as a reason for
this api design, since it's making the opposite effect.

Joel Fernandes July 18, 2019, 2:51 a.m. UTC | #13

Hi Alexei,

On Wed, Jul 17, 2019 at 02:40:42PM -0700, Alexei Starovoitov wrote:
> On Wed, Jul 17, 2019 at 6:01 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> I trimmed cc. some emails were bouncing.

Ok, thanks.

> > > I think allowing one tracepoint and disallowing another is pointless
> > > from security point of view. Tracing bpf program can do bpf_probe_read
> > > of anything.
> >
> > I think the assumption here is the user controls the program instructions at
> > runtime, but that's not the case. The BPF program we are loading is not
> > dynamically generated, it is built at build time and it is loaded from a
> > secure verified partition, so even though it can do bpf_probe_read, it is
> > still not something that the user can change.
> 
> so you're saying that by having a set of signed bpf programs which
> instructions are known to be non-malicious and allowed set of tracepoints
> to attach via selinux whitelist, such setup will be safe?
> Have you considered how mix and match will behave?

Do you mean the effect of mixing tracepoints and programs? I have not
considered this. I am Ok with further enforcing of this (only certain
tracepoints can be attached to certain programs) if needed. What do
you think? We could have a new bpf(2) syscall attribute specify which
tracepoint is expected, or similar.

I wanted to walk you through our 2 usecases we are working on:

1. timeinstate: By hooking 2 programs onto sched_switch and cpu_frequency
tracepoints, we are able to collect CPU power per-UID (specific app). Connor
O'Brien is working on that.

2. inode to file path mapping: By hooking onto VFS tracepoints we are adding to
the android kernels, we can collect data when the kernel resolves a file path
to a inode/device number. A BPF map stores the inode/dev number (key) and the
path (value). We have usecases where we need a high speed lookup of this
without having to scan all the files in the filesystem.

For the first usecase, the BPF program will be loaded and attached to the
scheduler and cpufreq tracepoints at boot time and will stay attached
forever.  This is why I was saying having a daemon to stay alive all the time
is pointless. However, if since you are completely against using tracefs
which it sounds like, then we can do a daemon that is always alive.

For the second usecase, the program attach is needed on-demand unlike the
first usecase, and then after the usecase completes, it is detached to avoid
overhead.

For the second usecase, privacy is important and we want the data to not be
available to any process. So we want to make sure only selected processes can
attach to that tracepoint. This is the reason why I was doing working on
these patches which use the tracefs as well, since we get that level of
control.

As you can see, I was trying to solve the sticky tracepoint problem in
usecase 1 and the privacy problem in usecase 2.

I had some discussions today at office and we think we can use the daemon
approach to solve both these problems as well which I think would make you
happy as well.

What do you think about all of this? Any other feedback?

> > And, we are planning to make it
> > even more secure by making it kernel verify the program at load time as well
> > (you were on some discussions about that a few months ago).
> 
> It sounds like api decisions for this sticky raw_tp feature are
> driven by security choices which are not actually secure.
> I'm suggesting to avoid bringing up point of security as a reason for
> this api design, since it's making the opposite effect.

Ok, that's a fair point.

thanks,

 - Joel

Alexei Starovoitov July 23, 2019, 10:11 p.m. UTC | #14

On Wed, Jul 17, 2019 at 10:51:43PM -0400, Joel Fernandes wrote:
> Hi Alexei,
> 
> On Wed, Jul 17, 2019 at 02:40:42PM -0700, Alexei Starovoitov wrote:
> > On Wed, Jul 17, 2019 at 6:01 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > 
> > I trimmed cc. some emails were bouncing.
> 
> Ok, thanks.
> 
> > > > I think allowing one tracepoint and disallowing another is pointless
> > > > from security point of view. Tracing bpf program can do bpf_probe_read
> > > > of anything.
> > >
> > > I think the assumption here is the user controls the program instructions at
> > > runtime, but that's not the case. The BPF program we are loading is not
> > > dynamically generated, it is built at build time and it is loaded from a
> > > secure verified partition, so even though it can do bpf_probe_read, it is
> > > still not something that the user can change.
> > 
> > so you're saying that by having a set of signed bpf programs which
> > instructions are known to be non-malicious and allowed set of tracepoints
> > to attach via selinux whitelist, such setup will be safe?
> > Have you considered how mix and match will behave?
> 
> Do you mean the effect of mixing tracepoints and programs? I have not
> considered this. I am Ok with further enforcing of this (only certain
> tracepoints can be attached to certain programs) if needed. What do
> you think? We could have a new bpf(2) syscall attribute specify which
> tracepoint is expected, or similar.
> 
> I wanted to walk you through our 2 usecases we are working on:

thanks for sharing the use case details. Appreciate it.

> 1. timeinstate: By hooking 2 programs onto sched_switch and cpu_frequency
> tracepoints, we are able to collect CPU power per-UID (specific app). Connor
> O'Brien is working on that.
> 
> 2. inode to file path mapping: By hooking onto VFS tracepoints we are adding to
> the android kernels, we can collect data when the kernel resolves a file path
> to a inode/device number. A BPF map stores the inode/dev number (key) and the
> path (value). We have usecases where we need a high speed lookup of this
> without having to scan all the files in the filesystem.

Can you share the link to vfs tracepoints you're adding?
Sounds like you're not going to attempt to upstream them knowing
Al's stance towards them?
May be there is a way we can do the feature you need, but w/o tracepoints?

> For the first usecase, the BPF program will be loaded and attached to the
> scheduler and cpufreq tracepoints at boot time and will stay attached
> forever.  This is why I was saying having a daemon to stay alive all the time
> is pointless. However, if since you are completely against using tracefs
> which it sounds like, then we can do a daemon that is always alive.

As I said earlier this use case can be solved by pinning raw_tp object
into bpffs. Such patches are welcomed.

> For the second usecase, the program attach is needed on-demand unlike the
> first usecase, and then after the usecase completes, it is detached to avoid
> overhead.
> 
> For the second usecase, privacy is important and we want the data to not be
> available to any process. So we want to make sure only selected processes can
> attach to that tracepoint. This is the reason why I was doing working on
> these patches which use the tracefs as well, since we get that level of
> control.

It's hard to recommend anything w/o seeing the actual tracepoints you're adding
to vfs and type of data bpf program extracts from there.
Sounds like it's some sort of cache of inode->file name ?
If so, why is it privacy related?

Joel Fernandes July 24, 2019, 1:57 p.m. UTC | #15

On Tue, Jul 23, 2019 at 03:11:10PM -0700, Alexei Starovoitov wrote:
> > > > > I think allowing one tracepoint and disallowing another is pointless
> > > > > from security point of view. Tracing bpf program can do bpf_probe_read
> > > > > of anything.
> > > >
> > > > I think the assumption here is the user controls the program instructions at
> > > > runtime, but that's not the case. The BPF program we are loading is not
> > > > dynamically generated, it is built at build time and it is loaded from a
> > > > secure verified partition, so even though it can do bpf_probe_read, it is
> > > > still not something that the user can change.
> > > 
> > > so you're saying that by having a set of signed bpf programs which
> > > instructions are known to be non-malicious and allowed set of tracepoints
> > > to attach via selinux whitelist, such setup will be safe?
> > > Have you considered how mix and match will behave?
> > 
> > Do you mean the effect of mixing tracepoints and programs? I have not
> > considered this. I am Ok with further enforcing of this (only certain
> > tracepoints can be attached to certain programs) if needed. What do
> > you think? We could have a new bpf(2) syscall attribute specify which
> > tracepoint is expected, or similar.
> > 
> > I wanted to walk you through our 2 usecases we are working on:
> 
> thanks for sharing the use case details. Appreciate it.

No problem and thanks for your thoughts.

> > 1. timeinstate: By hooking 2 programs onto sched_switch and cpu_frequency
> > tracepoints, we are able to collect CPU power per-UID (specific app). Connor
> > O'Brien is working on that.
> > 
> > 2. inode to file path mapping: By hooking onto VFS tracepoints we are adding to
> > the android kernels, we can collect data when the kernel resolves a file path
> > to a inode/device number. A BPF map stores the inode/dev number (key) and the
> > path (value). We have usecases where we need a high speed lookup of this
> > without having to scan all the files in the filesystem.
> 
> Can you share the link to vfs tracepoints you're adding?
> Sounds like you're not going to attempt to upstream them knowing
> Al's stance towards them?
> May be there is a way we can do the feature you need, but w/o tracepoints?

Yes, given Al's stance I understand the patch is not upstreamable. The patch
is here:
For tracepoint:
https://android.googlesource.com/kernel/common/+/27d3bfe20558d279041af403a887e7bdbdcc6f24%5E%21/
For bpf program:
https://android.googlesource.com/platform/system/bpfprogs/+/908f6cd718fab0de7a944f84628c56f292efeb17%5E%21/

I intended to submit the tracepoint only for the Android kernels, however if
there is an upstream solution to this then that's even better since upstream can
benefit. Were you thinking of a BPF helper function to get this data?

> 
> > For the first usecase, the BPF program will be loaded and attached to the
> > scheduler and cpufreq tracepoints at boot time and will stay attached
> > forever.  This is why I was saying having a daemon to stay alive all the time
> > is pointless. However, if since you are completely against using tracefs
> > which it sounds like, then we can do a daemon that is always alive.
> 
> As I said earlier this use case can be solved by pinning raw_tp object
> into bpffs. Such patches are welcomed.

Ok will think more about it.

> > For the second usecase, the program attach is needed on-demand unlike the
> > first usecase, and then after the usecase completes, it is detached to avoid
> > overhead.
> > 
> > For the second usecase, privacy is important and we want the data to not be
> > available to any process. So we want to make sure only selected processes can
> > attach to that tracepoint. This is the reason why I was doing working on
> > these patches which use the tracefs as well, since we get that level of
> > control.
> 
> It's hard to recommend anything w/o seeing the actual tracepoints you're adding
> to vfs and type of data bpf program extracts from there.
> Sounds like it's some sort of cache of inode->file name ?

Yes, that's what it is.

> If so, why is it privacy related?

The reasoning is the file paths could reveal user activity (such as an app
that opens a document) and Android has requirements to control/restrict that.

thanks,

 - Joel

Alexei Starovoitov July 26, 2019, 6:39 p.m. UTC | #16

On Wed, Jul 24, 2019 at 09:57:14AM -0400, Joel Fernandes wrote:
> On Tue, Jul 23, 2019 at 03:11:10PM -0700, Alexei Starovoitov wrote:
> > > > > > I think allowing one tracepoint and disallowing another is pointless
> > > > > > from security point of view. Tracing bpf program can do bpf_probe_read
> > > > > > of anything.
> > > > >
> > > > > I think the assumption here is the user controls the program instructions at
> > > > > runtime, but that's not the case. The BPF program we are loading is not
> > > > > dynamically generated, it is built at build time and it is loaded from a
> > > > > secure verified partition, so even though it can do bpf_probe_read, it is
> > > > > still not something that the user can change.
> > > > 
> > > > so you're saying that by having a set of signed bpf programs which
> > > > instructions are known to be non-malicious and allowed set of tracepoints
> > > > to attach via selinux whitelist, such setup will be safe?
> > > > Have you considered how mix and match will behave?
> > > 
> > > Do you mean the effect of mixing tracepoints and programs? I have not
> > > considered this. I am Ok with further enforcing of this (only certain
> > > tracepoints can be attached to certain programs) if needed. What do
> > > you think? We could have a new bpf(2) syscall attribute specify which
> > > tracepoint is expected, or similar.
> > > 
> > > I wanted to walk you through our 2 usecases we are working on:
> > 
> > thanks for sharing the use case details. Appreciate it.
> 
> No problem and thanks for your thoughts.
> 
> > > 1. timeinstate: By hooking 2 programs onto sched_switch and cpu_frequency
> > > tracepoints, we are able to collect CPU power per-UID (specific app). Connor
> > > O'Brien is working on that.
> > > 
> > > 2. inode to file path mapping: By hooking onto VFS tracepoints we are adding to
> > > the android kernels, we can collect data when the kernel resolves a file path
> > > to a inode/device number. A BPF map stores the inode/dev number (key) and the
> > > path (value). We have usecases where we need a high speed lookup of this
> > > without having to scan all the files in the filesystem.
> > 
> > Can you share the link to vfs tracepoints you're adding?
> > Sounds like you're not going to attempt to upstream them knowing
> > Al's stance towards them?
> > May be there is a way we can do the feature you need, but w/o tracepoints?
> 
> Yes, given Al's stance I understand the patch is not upstreamable. The patch
> is here:
> For tracepoint:
> https://android.googlesource.com/kernel/common/+/27d3bfe20558d279041af403a887e7bdbdcc6f24%5E%21/

this is way more than tracepoint.

> For bpf program:
> https://android.googlesource.com/platform/system/bpfprogs/+/908f6cd718fab0de7a944f84628c56f292efeb17%5E%21/

what is unsafe_bpf_map_update_elem() in there?
The verifier comment sounds odd.
Could you describe the issue you see with the verifier?

> I intended to submit the tracepoint only for the Android kernels, however if
> there is an upstream solution to this then that's even better since upstream can
> benefit. Were you thinking of a BPF helper function to get this data?

I think the best way to evaluate the patches is whether they are upstreamable or not.
If they're not (like this case), it means that there is something wrong with their design
and if android decides to go with such approach it will only create serious issues long term.
Starting with the whole idea of dev+inode -> filepath cache.
dev+inode is not a unique identifier of the file.
In some filesystems two different files may have the same ino integer value.
Have you looked at 'struct file_handle' ? and name_to_handle_at ?
I think fhandle is the only way to get unique identifier of the file.
Could you please share more details why android needs this cache of dev+ino->path?
I guess something uses ino to find paths?
Sort of faster version of 'find -inum' ?

Joel Fernandes July 26, 2019, 7:18 p.m. UTC | #17

On Fri, Jul 26, 2019 at 11:39:56AM -0700, Alexei Starovoitov wrote:
[snip]
> > > > 1. timeinstate: By hooking 2 programs onto sched_switch and cpu_frequency
> > > > tracepoints, we are able to collect CPU power per-UID (specific app). Connor
> > > > O'Brien is working on that.
> > > > 
> > > > 2. inode to file path mapping: By hooking onto VFS tracepoints we are adding to
> > > > the android kernels, we can collect data when the kernel resolves a file path
> > > > to a inode/device number. A BPF map stores the inode/dev number (key) and the
> > > > path (value). We have usecases where we need a high speed lookup of this
> > > > without having to scan all the files in the filesystem.
> > > 
> > > Can you share the link to vfs tracepoints you're adding?
> > > Sounds like you're not going to attempt to upstream them knowing
> > > Al's stance towards them?
> > > May be there is a way we can do the feature you need, but w/o tracepoints?
> > 
> > Yes, given Al's stance I understand the patch is not upstreamable. The patch
> > is here:
> > For tracepoint:
> > https://android.googlesource.com/kernel/common/+/27d3bfe20558d279041af403a887e7bdbdcc6f24%5E%21/
> 
> this is way more than tracepoint.

True there is some code that calls the tracepoint. I want to optimize it more
but lets see I am ready to think more about it before doing it this way,
based on your suggestions.

> > For bpf program:
> > https://android.googlesource.com/platform/system/bpfprogs/+/908f6cd718fab0de7a944f84628c56f292efeb17%5E%21/
> 
> what is unsafe_bpf_map_update_elem() in there?
> The verifier comment sounds odd.
> Could you describe the issue you see with the verifier?

Will dig out the verifier issue I was seeing. I was just trying to get a
prototype working so I did not go into verifier details much.

> > I intended to submit the tracepoint only for the Android kernels, however if
> > there is an upstream solution to this then that's even better since upstream can
> > benefit. Were you thinking of a BPF helper function to get this data?
> 
> I think the best way to evaluate the patches is whether they are upstreamable or not.
> If they're not (like this case), it means that there is something wrong with their design
> and if android decides to go with such approach it will only create serious issues long term.
> Starting with the whole idea of dev+inode -> filepath cache.
> dev+inode is not a unique identifier of the file.
> In some filesystems two different files may have the same ino integer value.
> Have you looked at 'struct file_handle' ? and name_to_handle_at ?
> I think fhandle is the only way to get unique identifier of the file.
> Could you please share more details why android needs this cache of dev+ino->path?

I will follow-up with you on this by email off the list, thanks.

thanks,

 - Joel

Joel Fernandes July 26, 2019, 7:49 p.m. UTC | #18

On Fri, Jul 26, 2019 at 3:18 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Fri, Jul 26, 2019 at 11:39:56AM -0700, Alexei Starovoitov wrote:
[snip]
> > > For bpf program:
> > > https://android.googlesource.com/platform/system/bpfprogs/+/908f6cd718fab0de7a944f84628c56f292efeb17%5E%21/
> >
> > what is unsafe_bpf_map_update_elem() in there?
> > The verifier comment sounds odd.
> > Could you describe the issue you see with the verifier?
>
> Will dig out the verifier issue I was seeing. I was just trying to get a
> prototype working so I did not go into verifier details much.

This is actually slightly old code, the actual function name is
bpf_map_update_elem_unsafe() .
 https://android.googlesource.com/platform/system/bpf/+/refs/heads/master/progs/include/bpf_helpers.h#39

This function came about because someone added a DEFINE_BPF_MAP macro
which defines BPF map accessors based on the type of the key and
value. So that's the "safe" variant:
https://android.googlesource.com/platform/system/bpf/+/refs/heads/master/progs/include/bpf_helpers.h#54
(added in commit
https://android.googlesource.com/platform/system/bpf/+/6564b8eac46fc27dde807a39856386d98d2471c3)

So the "safe" variant of the bpf_map_update_elem for us became a map
specific version with a prototype:
static inline __always_inline __unused int
bpf_##the_map##_update_elem(TypeOfKey* k, TypeOfValue* v, unsigned
long long flags)

Since I had not upgraded my BPF program to the "safe" variant, I had
to use the internal "unsafe" variant of the API (if that makes
sense..).

thanks Alexei!

- Joel

[RFC,0/4] Add support to directly attach BPF program to ftrace

Message

Comments