diff mbox series

[bpf-next,4/6] tools: add new members to bpf_attr.raw_tracepoint in bpf.h

Message ID 159467114405.370286.1690821122507970067.stgit@toke.dk
State Changes Requested
Delegated to: BPF Maintainers
Headers show
Series bpf: Support multi-attach for freplace programs | expand

Commit Message

Toke Høiland-Jørgensen July 13, 2020, 8:12 p.m. UTC
From: Toke Høiland-Jørgensen <toke@redhat.com>

Sync addition of new members from main kernel tree.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 tools/include/uapi/linux/bpf.h |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Andrii Nakryiko July 14, 2020, 5:21 a.m. UTC | #1
On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> From: Toke Høiland-Jørgensen <toke@redhat.com>
>
> Sync addition of new members from main kernel tree.
>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
>  tools/include/uapi/linux/bpf.h |    9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index da9bf35a26f8..662a15e4a1a1 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -573,8 +573,13 @@ union bpf_attr {
>         } query;
>
>         struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
> -               __u64 name;
> -               __u32 prog_fd;
> +               __u64           name;
> +               __u32           prog_fd;
> +               __u32           log_level;      /* verbosity level of log */
> +               __u32           log_size;       /* size of user buffer */
> +               __aligned_u64   log_buf;        /* user supplied buffer */
> +               __u32           tgt_prog_fd;
> +               __u32           tgt_btf_id;
>         } raw_tracepoint;
>
>         struct { /* anonymous struct for BPF_BTF_LOAD */
>

I think BPF syscall would benefit from common/generalized log support
across all commands, given how powerful/complex it already is.
Sometimes it's literally impossible to understand why one gets -EINVAL
without adding printk()s in the kernel.

But it feels wrong to add log_level/log_size/log_buf fields to every
section of bpf_attr and require the user to specify it differently for
each command. So before we go and start adding per-command fields,
let's discuss how we can do this more generically. I wonder if we can
come up with a good way to do it in one common way and then gradually
support that way throughout all BPF commands.

Unfortunately it's too late to just add a few common fields to
bpf_attr in front of all other per-command fields, but here's two more
ideas just to start discussion. I hope someone can come up with
something nicer.

1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size).
We can extend it with two more optional arguments: one for pointer to
log-defining attr (for log_buf pointer, log_size, log_level, maybe
something more later) and another for the size of that log attr.
Beyond that we'd need some way to specify that the user actually meant
to provide those 2 optional args. The most straightforward way would
be to use the highest bit of cmd argument. So instead of calling bpf()
with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE |
BPF_LOGGED, where BPF_LOGGED = 1<<31.

2. A more "stateful" approach, would be to have an extra BPF command
to set log buffer (and level) per thread. And if such a log is set, it
would be overwritten with content on each bpf() syscall invocation
(i.e., log position would be reset to 0 on each BPF syscall).

Of course, the existing BPF_LOAD_PROG command would keep working with
existing log, but could fall back to the "common one", if
BPF_LOAD_PROG-specific one is not set.

It also doesn't seem to be all that critical to signal when the log
buffer is overflown. It's pretty easy to detect from user-space:
- either last byte would be non-zero, if we don't care about
guaranteeing zero-termination for truncated log;
- of second-to-last byte would be non-zero, if BPF syscall will always
zero-terminate the log.

Of course, if user code cares about such detection of log truncation,
it will need to set last/second-to-last bytes to zero before each
syscall.

Both approaches have their pros/cons, we can dig into those later, but
I'd like to start this discussion and see what other people think. I
also wonder if there are other syscalls with similarly advanced input
arguments (like bpf_attr) and common logging, we could learn from
those.

Thoughts?
Toke Høiland-Jørgensen July 14, 2020, 12:12 p.m. UTC | #2
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>
>> Sync addition of new members from main kernel tree.
>>
>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> ---
>>  tools/include/uapi/linux/bpf.h |    9 +++++++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> index da9bf35a26f8..662a15e4a1a1 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -573,8 +573,13 @@ union bpf_attr {
>>         } query;
>>
>>         struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
>> -               __u64 name;
>> -               __u32 prog_fd;
>> +               __u64           name;
>> +               __u32           prog_fd;
>> +               __u32           log_level;      /* verbosity level of log */
>> +               __u32           log_size;       /* size of user buffer */
>> +               __aligned_u64   log_buf;        /* user supplied buffer */
>> +               __u32           tgt_prog_fd;
>> +               __u32           tgt_btf_id;
>>         } raw_tracepoint;
>>
>>         struct { /* anonymous struct for BPF_BTF_LOAD */
>>
>
> I think BPF syscall would benefit from common/generalized log support
> across all commands, given how powerful/complex it already is.
> Sometimes it's literally impossible to understand why one gets -EINVAL
> without adding printk()s in the kernel.

Yes, I agree! This is horrible UI!

> But it feels wrong to add log_level/log_size/log_buf fields to every
> section of bpf_attr and require the user to specify it differently for
> each command. So before we go and start adding per-command fields,
> let's discuss how we can do this more generically. I wonder if we can
> come up with a good way to do it in one common way and then gradually
> support that way throughout all BPF commands.
>
> Unfortunately it's too late to just add a few common fields to
> bpf_attr in front of all other per-command fields, but here's two more
> ideas just to start discussion. I hope someone can come up with
> something nicer.
>
> 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size).
> We can extend it with two more optional arguments: one for pointer to
> log-defining attr (for log_buf pointer, log_size, log_level, maybe
> something more later) and another for the size of that log attr.
> Beyond that we'd need some way to specify that the user actually meant
> to provide those 2 optional args. The most straightforward way would
> be to use the highest bit of cmd argument. So instead of calling bpf()
> with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE |
> BPF_LOGGED, where BPF_LOGGED = 1<<31.

Well, if only we'd had a 'flags' argument to the syscall... I don't
suppose we want a bpf2()? :)

I like your idea of just using the top bits of the 'cmd' field as flag
bits, but in that case we should just define this explicitly, say
'#define BPF_CMD_FLAGS_MASK 0xFFFF0000'?

And instead of adding new arguments, why not just change the meaning of
the 'attr' argument? Say we define:

struct bpf_extended_attr {
       __u32 log_level;
       __u32 log_size;
       __aligned_u64 log_buf;
       __u8 reserved[48];
       union bpf_attr attr;
};

And then define a new flag BPF_USES_EXTENDED_ATTR which will cause the
kernel to interpret the second argument of the syscall as a pointer to
that struct instead of to the bpf_attr union?

> 2. A more "stateful" approach, would be to have an extra BPF command
> to set log buffer (and level) per thread. And if such a log is set, it
> would be overwritten with content on each bpf() syscall invocation
> (i.e., log position would be reset to 0 on each BPF syscall).

I don't think adding something stateful is a good idea; that's bound to
lead to weird issues when someone messes up the state management in
userspace.

> Of course, the existing BPF_LOAD_PROG command would keep working with
> existing log, but could fall back to the "common one", if
> BPF_LOAD_PROG-specific one is not set.
>
> It also doesn't seem to be all that critical to signal when the log
> buffer is overflown. It's pretty easy to detect from user-space:
> - either last byte would be non-zero, if we don't care about
> guaranteeing zero-termination for truncated log;
> - of second-to-last byte would be non-zero, if BPF syscall will always
> zero-terminate the log.

I think if we're doing this we should think about making the contents of
the log machine-readable, so applications can figure out what's going on
without having to parse the text strings. The simplest would be to make
this new log buffer use TLV-style messaging, say we define the log
buffer output to be a series of messages like these:

struct bpf_log_msg {
       __u16 type;
       __u32 len;
       __u8 contents[]; /* of size 'len' */
} __attribute__((packed));

To begin with we could define two types:

struct bpf_log_msg_string {
       __u16 type; /* BPF_LOG_MSG_TYPE_STRING */
       __u32 len;
       char message[];
}  __attribute__((packed));

struct bpf_log_msg_end {
       __u16 type; /* BPF_LOG_MSG_TYPE_END */
       __u32 len;
       __u16 num_truncations;
}  __attribute__((packed));

The TYPE_STRING would just be a wrapper for the existing text-based
messages, but delimited so userspace can pick them apart. And the second
one would solve your 'has the log been truncated' issue above; the
kernel simply always reserves eight bytes at the end of the buffer and
ends with writing a TYPE_END message with the number of messages that
were dropped due to lack of space. That would make it trivial for
userspace to detect truncation.

We could then add new message types later for machine-consumption. Say,
and extended error code, or offsets into the BTF information, or
whatever we end up needing. But just wrapping the existing log messages
in TLVs like the ones above could be fairly straight-forwardly
implemented with the existing bpf_log() infrastructure in the kernel,
without having to settle on which machine-readable information is useful
ahead of time... And the TLV format makes it easy for userspace to just
skip message types it doesn't understand.

WDYT?

> Of course, if user code cares about such detection of log truncation,
> it will need to set last/second-to-last bytes to zero before each
> syscall.
>
> Both approaches have their pros/cons, we can dig into those later, but
> I'd like to start this discussion and see what other people think. I
> also wonder if there are other syscalls with similarly advanced input
> arguments (like bpf_attr) and common logging, we could learn from
> those.

The first one that comes to mind is netlink extacks. Of course it's not
quite comparable since netlink already has message-based semantics, but
it does do sorta-kinda the same thing from a user PoV. The TLV format is
obviously inspired by netlink (or, well, binary networking protocols in
general).

-Toke
Andrii Nakryiko July 14, 2020, 7:14 p.m. UTC | #3
On Tue, Jul 14, 2020 at 5:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> From: Toke Høiland-Jørgensen <toke@redhat.com>
> >>
> >> Sync addition of new members from main kernel tree.
> >>
> >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> >> ---
> >>  tools/include/uapi/linux/bpf.h |    9 +++++++--
> >>  1 file changed, 7 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> >> index da9bf35a26f8..662a15e4a1a1 100644
> >> --- a/tools/include/uapi/linux/bpf.h
> >> +++ b/tools/include/uapi/linux/bpf.h
> >> @@ -573,8 +573,13 @@ union bpf_attr {
> >>         } query;
> >>
> >>         struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
> >> -               __u64 name;
> >> -               __u32 prog_fd;
> >> +               __u64           name;
> >> +               __u32           prog_fd;
> >> +               __u32           log_level;      /* verbosity level of log */
> >> +               __u32           log_size;       /* size of user buffer */
> >> +               __aligned_u64   log_buf;        /* user supplied buffer */
> >> +               __u32           tgt_prog_fd;
> >> +               __u32           tgt_btf_id;
> >>         } raw_tracepoint;
> >>
> >>         struct { /* anonymous struct for BPF_BTF_LOAD */
> >>
> >
> > I think BPF syscall would benefit from common/generalized log support
> > across all commands, given how powerful/complex it already is.
> > Sometimes it's literally impossible to understand why one gets -EINVAL
> > without adding printk()s in the kernel.
>
> Yes, I agree! This is horrible UI!

UI?.. It's a perfectly fine and extensible API for all functionality
it provides, it just needs a better human-readable feedback mechanism,
which is what I'm proposing. Error codes are not working when you have
so many different situations that can result in error.

>
> > But it feels wrong to add log_level/log_size/log_buf fields to every
> > section of bpf_attr and require the user to specify it differently for
> > each command. So before we go and start adding per-command fields,
> > let's discuss how we can do this more generically. I wonder if we can
> > come up with a good way to do it in one common way and then gradually
> > support that way throughout all BPF commands.
> >
> > Unfortunately it's too late to just add a few common fields to
> > bpf_attr in front of all other per-command fields, but here's two more
> > ideas just to start discussion. I hope someone can come up with
> > something nicer.
> >
> > 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size).
> > We can extend it with two more optional arguments: one for pointer to
> > log-defining attr (for log_buf pointer, log_size, log_level, maybe
> > something more later) and another for the size of that log attr.
> > Beyond that we'd need some way to specify that the user actually meant
> > to provide those 2 optional args. The most straightforward way would
> > be to use the highest bit of cmd argument. So instead of calling bpf()
> > with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE |
> > BPF_LOGGED, where BPF_LOGGED = 1<<31.
>
> Well, if only we'd had a 'flags' argument to the syscall... I don't
> suppose we want a bpf2()? :)
>
> I like your idea of just using the top bits of the 'cmd' field as flag
> bits, but in that case we should just define this explicitly, say
> '#define BPF_CMD_FLAGS_MASK 0xFFFF0000'?

sure

>
> And instead of adding new arguments, why not just change the meaning of
> the 'attr' argument? Say we define:
>
> struct bpf_extended_attr {
>        __u32 log_level;
>        __u32 log_size;
>        __aligned_u64 log_buf;
>        __u8 reserved[48];
>        union bpf_attr attr;
> };

because this is a major PITA for libraries, plus it's not very
extensible, once you run out of 48 bytes? And when you don't need
those 48 bytes, you still need to zero them out, the kernel still
needs to copy them, etc. It just feels unclean to me.

But before we argue that, is there a problem adding 2 more arguments
which are never used/read unless we have an extra bit set in cmd?
Honest question, are there any implications to user-space with such an
approach? Backwards-compatibility issues or anything?


>
> And then define a new flag BPF_USES_EXTENDED_ATTR which will cause the
> kernel to interpret the second argument of the syscall as a pointer to
> that struct instead of to the bpf_attr union?
>
> > 2. A more "stateful" approach, would be to have an extra BPF command
> > to set log buffer (and level) per thread. And if such a log is set, it
> > would be overwritten with content on each bpf() syscall invocation
> > (i.e., log position would be reset to 0 on each BPF syscall).
>
> I don't think adding something stateful is a good idea; that's bound to
> lead to weird issues when someone messes up the state management in
> userspace.

I agree, I'd prefer a stateless approach, but wanted to lay out a
stateful one for completeness as well.

>
> > Of course, the existing BPF_LOAD_PROG command would keep working with
> > existing log, but could fall back to the "common one", if
> > BPF_LOAD_PROG-specific one is not set.
> >
> > It also doesn't seem to be all that critical to signal when the log
> > buffer is overflown. It's pretty easy to detect from user-space:
> > - either last byte would be non-zero, if we don't care about
> > guaranteeing zero-termination for truncated log;
> > - of second-to-last byte would be non-zero, if BPF syscall will always
> > zero-terminate the log.
>
> I think if we're doing this we should think about making the contents of
> the log machine-readable, so applications can figure out what's going on
> without having to parse the text strings. The simplest would be to make
> this new log buffer use TLV-style messaging, say we define the log
> buffer output to be a series of messages like these:
>
> struct bpf_log_msg {
>        __u16 type;
>        __u32 len;
>        __u8 contents[]; /* of size 'len' */
> } __attribute__((packed));
>
> To begin with we could define two types:
>
> struct bpf_log_msg_string {
>        __u16 type; /* BPF_LOG_MSG_TYPE_STRING */
>        __u32 len;
>        char message[];
> }  __attribute__((packed));
>
> struct bpf_log_msg_end {
>        __u16 type; /* BPF_LOG_MSG_TYPE_END */
>        __u32 len;
>        __u16 num_truncations;
> }  __attribute__((packed));
>
> The TYPE_STRING would just be a wrapper for the existing text-based
> messages, but delimited so userspace can pick them apart. And the second
> one would solve your 'has the log been truncated' issue above; the
> kernel simply always reserves eight bytes at the end of the buffer and
> ends with writing a TYPE_END message with the number of messages that
> were dropped due to lack of space. That would make it trivial for
> userspace to detect truncation.
>

Log truncation is not an issue, we can make bpf syscall to write back
the actual size of emitted log (and optionally extra bool to mean
"truncated") into the original log_size field.

> We could then add new message types later for machine-consumption. Say,
> and extended error code, or offsets into the BTF information, or
> whatever we end up needing. But just wrapping the existing log messages
> in TLVs like the ones above could be fairly straight-forwardly
> implemented with the existing bpf_log() infrastructure in the kernel,
> without having to settle on which machine-readable information is useful
> ahead of time... And the TLV format makes it easy for userspace to just
> skip message types it doesn't understand.
>
> WDYT?

I think it's taking it a bit too far and adds more API on top of
existing API. Now all those types become a fixed API, messages become
an API, etc. Just more backwards compatibility stuff we need to
support forever, for, what I believe, very little gain.

In practice, using human-readable strings works just fine. If there is
any kind of real issue, usually it involves humans reading debug logs
and understanding what's going on.

Also adopting these packet-like messages is not as straightforward
through BPF code, as now you can't just construct a single log line
with few calls to bpf_log().

>
> > Of course, if user code cares about such detection of log truncation,
> > it will need to set last/second-to-last bytes to zero before each
> > syscall.
> >
> > Both approaches have their pros/cons, we can dig into those later, but
> > I'd like to start this discussion and see what other people think. I
> > also wonder if there are other syscalls with similarly advanced input
> > arguments (like bpf_attr) and common logging, we could learn from
> > those.
>
> The first one that comes to mind is netlink extacks. Of course it's not
> quite comparable since netlink already has message-based semantics, but
> it does do sorta-kinda the same thing from a user PoV. The TLV format is
> obviously inspired by netlink (or, well, binary networking protocols in
> general).
>

Yeah, I'm aware of extack, but as you said, TLV is more of a netlink
format, extack messages themselves are just strings. But my question
was more of how this log could be done for complicated API calls using
similar extendable attrs, like perf_event_open, clone3, openat2, etc.

> -Toke
>
Toke Høiland-Jørgensen July 14, 2020, 8:47 p.m. UTC | #4
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Tue, Jul 14, 2020 at 5:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> From: Toke Høiland-Jørgensen <toke@redhat.com>
>> >>
>> >> Sync addition of new members from main kernel tree.
>> >>
>> >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> >> ---
>> >>  tools/include/uapi/linux/bpf.h |    9 +++++++--
>> >>  1 file changed, 7 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
>> >> index da9bf35a26f8..662a15e4a1a1 100644
>> >> --- a/tools/include/uapi/linux/bpf.h
>> >> +++ b/tools/include/uapi/linux/bpf.h
>> >> @@ -573,8 +573,13 @@ union bpf_attr {
>> >>         } query;
>> >>
>> >>         struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
>> >> -               __u64 name;
>> >> -               __u32 prog_fd;
>> >> +               __u64           name;
>> >> +               __u32           prog_fd;
>> >> +               __u32           log_level;      /* verbosity level of log */
>> >> +               __u32           log_size;       /* size of user buffer */
>> >> +               __aligned_u64   log_buf;        /* user supplied buffer */
>> >> +               __u32           tgt_prog_fd;
>> >> +               __u32           tgt_btf_id;
>> >>         } raw_tracepoint;
>> >>
>> >>         struct { /* anonymous struct for BPF_BTF_LOAD */
>> >>
>> >
>> > I think BPF syscall would benefit from common/generalized log support
>> > across all commands, given how powerful/complex it already is.
>> > Sometimes it's literally impossible to understand why one gets -EINVAL
>> > without adding printk()s in the kernel.
>>
>> Yes, I agree! This is horrible UI!
>
> UI?.. It's a perfectly fine and extensible API for all functionality
> it provides, it just needs a better human-readable feedback mechanism,
> which is what I'm proposing. Error codes are not working when you have
> so many different situations that can result in error.

Yes. I was agreeing with you: the lack of understandable error messages
means you have to play "guess where this EINVAL came from", which is a
terrible user experience (should have been UX instead of UI, I guess,
sorry about that). 

>> > But it feels wrong to add log_level/log_size/log_buf fields to every
>> > section of bpf_attr and require the user to specify it differently for
>> > each command. So before we go and start adding per-command fields,
>> > let's discuss how we can do this more generically. I wonder if we can
>> > come up with a good way to do it in one common way and then gradually
>> > support that way throughout all BPF commands.
>> >
>> > Unfortunately it's too late to just add a few common fields to
>> > bpf_attr in front of all other per-command fields, but here's two more
>> > ideas just to start discussion. I hope someone can come up with
>> > something nicer.
>> >
>> > 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size).
>> > We can extend it with two more optional arguments: one for pointer to
>> > log-defining attr (for log_buf pointer, log_size, log_level, maybe
>> > something more later) and another for the size of that log attr.
>> > Beyond that we'd need some way to specify that the user actually meant
>> > to provide those 2 optional args. The most straightforward way would
>> > be to use the highest bit of cmd argument. So instead of calling bpf()
>> > with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE |
>> > BPF_LOGGED, where BPF_LOGGED = 1<<31.
>>
>> Well, if only we'd had a 'flags' argument to the syscall... I don't
>> suppose we want a bpf2()? :)
>>
>> I like your idea of just using the top bits of the 'cmd' field as flag
>> bits, but in that case we should just define this explicitly, say
>> '#define BPF_CMD_FLAGS_MASK 0xFFFF0000'?
>
> sure
>
>>
>> And instead of adding new arguments, why not just change the meaning of
>> the 'attr' argument? Say we define:
>>
>> struct bpf_extended_attr {
>>        __u32 log_level;
>>        __u32 log_size;
>>        __aligned_u64 log_buf;
>>        __u8 reserved[48];
>>        union bpf_attr attr;
>> };
>
> because this is a major PITA for libraries, plus it's not very
> extensible, once you run out of 48 bytes? And when you don't need
> those 48 bytes, you still need to zero them out, the kernel still
> needs to copy them, etc. It just feels unclean to me.
>
> But before we argue that, is there a problem adding 2 more arguments
> which are never used/read unless we have an extra bit set in cmd?
> Honest question, are there any implications to user-space with such an
> approach? Backwards-compatibility issues or anything?

No idea; I don't know enough about how the lower-level details of the
syscall interface works to tell either way. Is it even *possible* to add
arguments like that in a backwards-compatible way?

However, assuming it *is* possible, my larger point was that we
shouldn't add just a 'logging struct', but rather a 'common options
struct' which can be extended further as needed. And if it is *not*
possible to add new arguments to a syscall like you're proposing, my
suggestion above would be a different way to achieve basically the same
(at the cost of having to specify the maximum reserved space in advance).

>> And then define a new flag BPF_USES_EXTENDED_ATTR which will cause the
>> kernel to interpret the second argument of the syscall as a pointer to
>> that struct instead of to the bpf_attr union?
>>
>> > 2. A more "stateful" approach, would be to have an extra BPF command
>> > to set log buffer (and level) per thread. And if such a log is set, it
>> > would be overwritten with content on each bpf() syscall invocation
>> > (i.e., log position would be reset to 0 on each BPF syscall).
>>
>> I don't think adding something stateful is a good idea; that's bound to
>> lead to weird issues when someone messes up the state management in
>> userspace.
>
> I agree, I'd prefer a stateless approach, but wanted to lay out a
> stateful one for completeness as well.

OK, cool.

>>
>> > Of course, the existing BPF_LOAD_PROG command would keep working with
>> > existing log, but could fall back to the "common one", if
>> > BPF_LOAD_PROG-specific one is not set.
>> >
>> > It also doesn't seem to be all that critical to signal when the log
>> > buffer is overflown. It's pretty easy to detect from user-space:
>> > - either last byte would be non-zero, if we don't care about
>> > guaranteeing zero-termination for truncated log;
>> > - of second-to-last byte would be non-zero, if BPF syscall will always
>> > zero-terminate the log.
>>
>> I think if we're doing this we should think about making the contents of
>> the log machine-readable, so applications can figure out what's going on
>> without having to parse the text strings. The simplest would be to make
>> this new log buffer use TLV-style messaging, say we define the log
>> buffer output to be a series of messages like these:
>>
>> struct bpf_log_msg {
>>        __u16 type;
>>        __u32 len;
>>        __u8 contents[]; /* of size 'len' */
>> } __attribute__((packed));
>>
>> To begin with we could define two types:
>>
>> struct bpf_log_msg_string {
>>        __u16 type; /* BPF_LOG_MSG_TYPE_STRING */
>>        __u32 len;
>>        char message[];
>> }  __attribute__((packed));
>>
>> struct bpf_log_msg_end {
>>        __u16 type; /* BPF_LOG_MSG_TYPE_END */
>>        __u32 len;
>>        __u16 num_truncations;
>> }  __attribute__((packed));
>>
>> The TYPE_STRING would just be a wrapper for the existing text-based
>> messages, but delimited so userspace can pick them apart. And the second
>> one would solve your 'has the log been truncated' issue above; the
>> kernel simply always reserves eight bytes at the end of the buffer and
>> ends with writing a TYPE_END message with the number of messages that
>> were dropped due to lack of space. That would make it trivial for
>> userspace to detect truncation.
>>
>
> Log truncation is not an issue, we can make bpf syscall to write back
> the actual size of emitted log (and optionally extra bool to mean
> "truncated") into the original log_size field.

So what was all that you were talking about with "check the
second-to-last byte" in your previous email? I understood that to be
about detecting truncation?

>> We could then add new message types later for machine-consumption. Say,
>> and extended error code, or offsets into the BTF information, or
>> whatever we end up needing. But just wrapping the existing log messages
>> in TLVs like the ones above could be fairly straight-forwardly
>> implemented with the existing bpf_log() infrastructure in the kernel,
>> without having to settle on which machine-readable information is useful
>> ahead of time... And the TLV format makes it easy for userspace to just
>> skip message types it doesn't understand.
>>
>> WDYT?
>
> I think it's taking it a bit too far and adds more API on top of
> existing API. Now all those types become a fixed API, messages become
> an API, etc. Just more backwards compatibility stuff we need to
> support forever, for, what I believe, very little gain.
>
> In practice, using human-readable strings works just fine. If there is
> any kind of real issue, usually it involves humans reading debug logs
> and understanding what's going on.

Let me give an example of what I want to be able to do. Just today I
helped someone debug getting xdp-filter to run, and they were getting
output like this:

      libbpf: -- BEGIN DUMP LOG ---
      libbpf:
      xdpfilt_alw_all() is not a global function
      processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0
      peak_states 0 mark_read 0

      libbpf: -- END LOG --

I would like to have xdp-filter catch that, and turn it into a
friendlier "your compiler is too old" message. Having to effectively
grep through the free-form log output to pick out that message feels
brittle and error prone, as opposed to just having the kernel add a
machine-readable id ("err_func_linkage_not_global" or somesuch) and
stick it in a machine-parsable TLV.

> Also adopting these packet-like messages is not as straightforward
> through BPF code, as now you can't just construct a single log line
> with few calls to bpf_log().

Why not? bpf_log() could just transparently write the four bytes of
header (TYPE_STRING, followed by strlen(msg)) into the buffer before the
string? And in the future, an enhanced version could take (say) an error
ID as another parameter and transparently add that as a separate message.

>> > Of course, if user code cares about such detection of log truncation,
>> > it will need to set last/second-to-last bytes to zero before each
>> > syscall.
>> >
>> > Both approaches have their pros/cons, we can dig into those later, but
>> > I'd like to start this discussion and see what other people think. I
>> > also wonder if there are other syscalls with similarly advanced input
>> > arguments (like bpf_attr) and common logging, we could learn from
>> > those.
>>
>> The first one that comes to mind is netlink extacks. Of course it's not
>> quite comparable since netlink already has message-based semantics, but
>> it does do sorta-kinda the same thing from a user PoV. The TLV format is
>> obviously inspired by netlink (or, well, binary networking protocols in
>> general).
>>
>
> Yeah, I'm aware of extack, but as you said, TLV is more of a netlink
> format, extack messages themselves are just strings. But my question
> was more of how this log could be done for complicated API calls using
> similar extendable attrs, like perf_event_open, clone3, openat2, etc.

Ah, right, no idea :)

-Toke
Andrii Nakryiko July 14, 2020, 9:58 p.m. UTC | #5
On Tue, Jul 14, 2020 at 1:47 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Tue, Jul 14, 2020 at 5:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >>
> >> > On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> From: Toke Høiland-Jørgensen <toke@redhat.com>
> >> >>
> >> >> Sync addition of new members from main kernel tree.
> >> >>
> >> >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> >> >> ---
> >> >>  tools/include/uapi/linux/bpf.h |    9 +++++++--
> >> >>  1 file changed, 7 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> >> >> index da9bf35a26f8..662a15e4a1a1 100644
> >> >> --- a/tools/include/uapi/linux/bpf.h
> >> >> +++ b/tools/include/uapi/linux/bpf.h
> >> >> @@ -573,8 +573,13 @@ union bpf_attr {
> >> >>         } query;
> >> >>
> >> >>         struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
> >> >> -               __u64 name;
> >> >> -               __u32 prog_fd;
> >> >> +               __u64           name;
> >> >> +               __u32           prog_fd;
> >> >> +               __u32           log_level;      /* verbosity level of log */
> >> >> +               __u32           log_size;       /* size of user buffer */
> >> >> +               __aligned_u64   log_buf;        /* user supplied buffer */
> >> >> +               __u32           tgt_prog_fd;
> >> >> +               __u32           tgt_btf_id;
> >> >>         } raw_tracepoint;
> >> >>
> >> >>         struct { /* anonymous struct for BPF_BTF_LOAD */
> >> >>
> >> >
> >> > I think BPF syscall would benefit from common/generalized log support
> >> > across all commands, given how powerful/complex it already is.
> >> > Sometimes it's literally impossible to understand why one gets -EINVAL
> >> > without adding printk()s in the kernel.
> >>
> >> Yes, I agree! This is horrible UI!
> >
> > UI?.. It's a perfectly fine and extensible API for all functionality
> > it provides, it just needs a better human-readable feedback mechanism,
> > which is what I'm proposing. Error codes are not working when you have
> > so many different situations that can result in error.
>
> Yes. I was agreeing with you: the lack of understandable error messages
> means you have to play "guess where this EINVAL came from", which is a
> terrible user experience (should have been UX instead of UI, I guess,
> sorry about that).

Ah, ok, the term confused me :)

>
> >> > But it feels wrong to add log_level/log_size/log_buf fields to every
> >> > section of bpf_attr and require the user to specify it differently for
> >> > each command. So before we go and start adding per-command fields,
> >> > let's discuss how we can do this more generically. I wonder if we can
> >> > come up with a good way to do it in one common way and then gradually
> >> > support that way throughout all BPF commands.
> >> >
> >> > Unfortunately it's too late to just add a few common fields to
> >> > bpf_attr in front of all other per-command fields, but here's two more
> >> > ideas just to start discussion. I hope someone can come up with
> >> > something nicer.
> >> >
> >> > 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size).
> >> > We can extend it with two more optional arguments: one for pointer to
> >> > log-defining attr (for log_buf pointer, log_size, log_level, maybe
> >> > something more later) and another for the size of that log attr.
> >> > Beyond that we'd need some way to specify that the user actually meant
> >> > to provide those 2 optional args. The most straightforward way would
> >> > be to use the highest bit of cmd argument. So instead of calling bpf()
> >> > with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE |
> >> > BPF_LOGGED, where BPF_LOGGED = 1<<31.
> >>
> >> Well, if only we'd had a 'flags' argument to the syscall... I don't
> >> suppose we want a bpf2()? :)
> >>
> >> I like your idea of just using the top bits of the 'cmd' field as flag
> >> bits, but in that case we should just define this explicitly, say
> >> '#define BPF_CMD_FLAGS_MASK 0xFFFF0000'?
> >
> > sure
> >
> >>
> >> And instead of adding new arguments, why not just change the meaning of
> >> the 'attr' argument? Say we define:
> >>
> >> struct bpf_extended_attr {
> >>        __u32 log_level;
> >>        __u32 log_size;
> >>        __aligned_u64 log_buf;
> >>        __u8 reserved[48];
> >>        union bpf_attr attr;
> >> };
> >
> > because this is a major PITA for libraries, plus it's not very
> > extensible, once you run out of 48 bytes? And when you don't need
> > those 48 bytes, you still need to zero them out, the kernel still
> > needs to copy them, etc. It just feels unclean to me.
> >
> > But before we argue that, is there a problem adding 2 more arguments
> > which are never used/read unless we have an extra bit set in cmd?
> > Honest question, are there any implications to user-space with such an
> > approach? Backwards-compatibility issues or anything?
>
> No idea; I don't know enough about how the lower-level details of the
> syscall interface works to tell either way. Is it even *possible* to add
> arguments like that in a backwards-compatible way?
>
> However, assuming it *is* possible, my larger point was that we
> shouldn't add just a 'logging struct', but rather a 'common options
> struct' which can be extended further as needed. And if it is *not*
> possible to add new arguments to a syscall like you're proposing, my
> suggestion above would be a different way to achieve basically the same
> (at the cost of having to specify the maximum reserved space in advance).
>

yeah-yeah, I agree, it's less a "logging attr", more of "common attr
across all commands".

> >> And then define a new flag BPF_USES_EXTENDED_ATTR which will cause the
> >> kernel to interpret the second argument of the syscall as a pointer to
> >> that struct instead of to the bpf_attr union?
> >>
> >> > 2. A more "stateful" approach, would be to have an extra BPF command
> >> > to set log buffer (and level) per thread. And if such a log is set, it
> >> > would be overwritten with content on each bpf() syscall invocation
> >> > (i.e., log position would be reset to 0 on each BPF syscall).
> >>
> >> I don't think adding something stateful is a good idea; that's bound to
> >> lead to weird issues when someone messes up the state management in
> >> userspace.
> >
> > I agree, I'd prefer a stateless approach, but wanted to lay out a
> > stateful one for completeness as well.
>
> OK, cool.
>
> >>
> >> > Of course, the existing BPF_LOAD_PROG command would keep working with
> >> > existing log, but could fall back to the "common one", if
> >> > BPF_LOAD_PROG-specific one is not set.
> >> >
> >> > It also doesn't seem to be all that critical to signal when the log
> >> > buffer is overflown. It's pretty easy to detect from user-space:
> >> > - either last byte would be non-zero, if we don't care about
> >> > guaranteeing zero-termination for truncated log;
> >> > - of second-to-last byte would be non-zero, if BPF syscall will always
> >> > zero-terminate the log.
> >>
> >> I think if we're doing this we should think about making the contents of
> >> the log machine-readable, so applications can figure out what's going on
> >> without having to parse the text strings. The simplest would be to make
> >> this new log buffer use TLV-style messaging, say we define the log
> >> buffer output to be a series of messages like these:
> >>
> >> struct bpf_log_msg {
> >>        __u16 type;
> >>        __u32 len;
> >>        __u8 contents[]; /* of size 'len' */
> >> } __attribute__((packed));
> >>
> >> To begin with we could define two types:
> >>
> >> struct bpf_log_msg_string {
> >>        __u16 type; /* BPF_LOG_MSG_TYPE_STRING */
> >>        __u32 len;
> >>        char message[];
> >> }  __attribute__((packed));
> >>
> >> struct bpf_log_msg_end {
> >>        __u16 type; /* BPF_LOG_MSG_TYPE_END */
> >>        __u32 len;
> >>        __u16 num_truncations;
> >> }  __attribute__((packed));
> >>
> >> The TYPE_STRING would just be a wrapper for the existing text-based
> >> messages, but delimited so userspace can pick them apart. And the second
> >> one would solve your 'has the log been truncated' issue above; the
> >> kernel simply always reserves eight bytes at the end of the buffer and
> >> ends with writing a TYPE_END message with the number of messages that
> >> were dropped due to lack of space. That would make it trivial for
> >> userspace to detect truncation.
> >>
> >
> > Log truncation is not an issue, we can make bpf syscall to write back
> > the actual size of emitted log (and optionally extra bool to mean
> > "truncated") into the original log_size field.
>
> So what was all that you were talking about with "check the
> second-to-last byte" in your previous email? I understood that to be
> about detecting truncation?

Yeah, ignore my initial rambling. One can do that (detecting
truncationg) without any extra "feedback" from bpf syscall, but I
think returning filled length is probably a better approach and
doesn't hamper any other aspects.

>
> >> We could then add new message types later for machine-consumption. Say,
> >> and extended error code, or offsets into the BTF information, or
> >> whatever we end up needing. But just wrapping the existing log messages
> >> in TLVs like the ones above could be fairly straight-forwardly
> >> implemented with the existing bpf_log() infrastructure in the kernel,
> >> without having to settle on which machine-readable information is useful
> >> ahead of time... And the TLV format makes it easy for userspace to just
> >> skip message types it doesn't understand.
> >>
> >> WDYT?
> >
> > I think it's taking it a bit too far and adds more API on top of
> > existing API. Now all those types become a fixed API, messages become
> > an API, etc. Just more backwards compatibility stuff we need to
> > support forever, for, what I believe, very little gain.
> >
> > In practice, using human-readable strings works just fine. If there is
> > any kind of real issue, usually it involves humans reading debug logs
> > and understanding what's going on.
>
> Let me give an example of what I want to be able to do. Just today I
> helped someone debug getting xdp-filter to run, and they were getting
> output like this:
>
>       libbpf: -- BEGIN DUMP LOG ---
>       libbpf:
>       xdpfilt_alw_all() is not a global function
>       processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0
>       peak_states 0 mark_read 0
>
>       libbpf: -- END LOG --
>
> I would like to have xdp-filter catch that, and turn it into a
> friendlier "your compiler is too old" message. Having to effectively
> grep through the free-form log output to pick out that message feels
> brittle and error prone, as opposed to just having the kernel add a
> machine-readable id ("err_func_linkage_not_global" or somesuch) and
> stick it in a machine-parsable TLV.

Yeah, I guessed as much, but I still think it's unnecessarily
restrictive (for BPF kernel developers) to adhere to such a rigorous
error format. We'll have a few hundred different failure reasons, and
with the verifier constantly evolving some of them might get
deprecated, others will be constantly added, some other might change
the meaning, etc. I don't see this realistically working long term. So
at that point, free-form text-delimited text is as good as we should
(and, realistically, would) probably go. Library, if it wants to be
extra-helpful, can still reasonably easily parse line-delimited text,
even if that's not an ideal setup (especially if error messages will
get periodically changed).

But I'd be happy to hear what others think on this matter as well, of course.

>
> > Also adopting these packet-like messages is not as straightforward
> > through BPF code, as now you can't just construct a single log line
> > with few calls to bpf_log().
>
> Why not? bpf_log() could just transparently write the four bytes of
> header (TYPE_STRING, followed by strlen(msg)) into the buffer before the
> string? And in the future, an enhanced version could take (say) an error
> ID as another parameter and transparently add that as a separate message.

I mean when you construct one error message with few printf-like
functions. We do have that in libbpf, but I haven't checked the
verifier code. Basically, assuming one bpf_log() call is a complete
"message" might not be true.

>
> >> > Of course, if user code cares about such detection of log truncation,
> >> > it will need to set last/second-to-last bytes to zero before each
> >> > syscall.
> >> >
> >> > Both approaches have their pros/cons, we can dig into those later, but
> >> > I'd like to start this discussion and see what other people think. I
> >> > also wonder if there are other syscalls with similarly advanced input
> >> > arguments (like bpf_attr) and common logging, we could learn from
> >> > those.
> >>
> >> The first one that comes to mind is netlink extacks. Of course it's not
> >> quite comparable since netlink already has message-based semantics, but
> >> it does do sorta-kinda the same thing from a user PoV. The TLV format is
> >> obviously inspired by netlink (or, well, binary networking protocols in
> >> general).
> >>
> >
> > Yeah, I'm aware of extack, but as you said, TLV is more of a netlink
> > format, extack messages themselves are just strings. But my question
> > was more of how this log could be done for complicated API calls using
> > similar extendable attrs, like perf_event_open, clone3, openat2, etc.
>
> Ah, right, no idea :)
>
> -Toke
>
Toke Høiland-Jørgensen July 14, 2020, 10:19 p.m. UTC | #6
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

>> However, assuming it *is* possible, my larger point was that we
>> shouldn't add just a 'logging struct', but rather a 'common options
>> struct' which can be extended further as needed. And if it is *not*
>> possible to add new arguments to a syscall like you're proposing, my
>> suggestion above would be a different way to achieve basically the same
>> (at the cost of having to specify the maximum reserved space in advance).
>>
>
> yeah-yeah, I agree, it's less a "logging attr", more of "common attr
> across all commands".

Right, great. I think we are broadly in agreement with where we want to
go with this, actually :)

Let's see if anyone else chimes in; otherwise I guess I can incorporate
something along these lines in the next version of this series. I'm
going on vacation at the end of this week, though, so I will most likely
not be able to carry it to completion before then; but at least I can
post something for someone else to pick up (or if no one does it can
wait until I get back).

[...]

> Yeah, ignore my initial rambling. One can do that (detecting
> truncationg) without any extra "feedback" from bpf syscall, but I
> think returning filled length is probably a better approach and
> doesn't hamper any other aspects.

OK, sure, makes sense.

[...]

>> > Also adopting these packet-like messages is not as straightforward
>> > through BPF code, as now you can't just construct a single log line
>> > with few calls to bpf_log().
>>
>> Why not? bpf_log() could just transparently write the four bytes of
>> header (TYPE_STRING, followed by strlen(msg)) into the buffer before the
>> string? And in the future, an enhanced version could take (say) an error
>> ID as another parameter and transparently add that as a separate message.
>
> I mean when you construct one error message with few printf-like
> functions. We do have that in libbpf, but I haven't checked the
> verifier code. Basically, assuming one bpf_log() call is a complete
> "message" might not be true.

Ah, I see what you mean. I guess that could be worked around with a flag
or something, but I'll concede that in that case it's less of an obvious
drop-in replacement :)

-Toke
Alexei Starovoitov July 14, 2020, 11:11 p.m. UTC | #7
On Wed, Jul 15, 2020 at 12:19:03AM +0200, Toke Høiland-Jørgensen wrote:
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> 
> >> However, assuming it *is* possible, my larger point was that we
> >> shouldn't add just a 'logging struct', but rather a 'common options
> >> struct' which can be extended further as needed. And if it is *not*
> >> possible to add new arguments to a syscall like you're proposing, my
> >> suggestion above would be a different way to achieve basically the same
> >> (at the cost of having to specify the maximum reserved space in advance).
> >>
> >
> > yeah-yeah, I agree, it's less a "logging attr", more of "common attr
> > across all commands".
> 
> Right, great. I think we are broadly in agreement with where we want to
> go with this, actually :)

I really don't like 'common attr across all commands'.
Both of you are talking as libbpf developers who occasionally need to
add printk-s to the kernel. That is not an excuse to bloat api that will be
useful to two people.

The only reason log_buf sort-of make sense in raw_tp_open is because
btf comparison is moved from prog_load into raw_tp_open.
Miscompare of (prog_fd1, btf_id1) vs (prog_fd2, btf_id2) can be easily solved
by libbpf with as nice and as human friendly message libbpf can do.

I'm not convinced yet that it's a kernel job to print it nicely. It certainly can,
but it's quite a bit different from two existing bpf commands where log_buf is used:
PROG_LOAD and BTF_LOAD. In these two cases the kernel verifies the program
and the BTF. raw_tp_open is different, since the kernel needs to compare
that function signatures (prog_fd1, btf_id1) and (prog_fd2, btf_id2) are
exactly the same. The kernel can indicate that with single specific errno and
libbpf can print human friendly function signatures via btf_dump infra for
humans to see.
So I really don't see why log_buf is such a necessity for raw_tp_open.
Toke Høiland-Jørgensen July 15, 2020, 12:56 p.m. UTC | #8
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Wed, Jul 15, 2020 at 12:19:03AM +0200, Toke Høiland-Jørgensen wrote:
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>> 
>> >> However, assuming it *is* possible, my larger point was that we
>> >> shouldn't add just a 'logging struct', but rather a 'common options
>> >> struct' which can be extended further as needed. And if it is *not*
>> >> possible to add new arguments to a syscall like you're proposing, my
>> >> suggestion above would be a different way to achieve basically the same
>> >> (at the cost of having to specify the maximum reserved space in advance).
>> >>
>> >
>> > yeah-yeah, I agree, it's less a "logging attr", more of "common attr
>> > across all commands".
>> 
>> Right, great. I think we are broadly in agreement with where we want to
>> go with this, actually :)
>
> I really don't like 'common attr across all commands'.
> Both of you are talking as libbpf developers who occasionally need to
> add printk-s to the kernel. That is not an excuse to bloat api that will be
> useful to two people.

What? No, this is about making error messages comprehensible to people
who *can't* just go around adding printks. "Guess the source of the
EINVAL" is a really bad user experience!

> The only reason log_buf sort-of make sense in raw_tp_open is because
> btf comparison is moved from prog_load into raw_tp_open.
> Miscompare of (prog_fd1, btf_id1) vs (prog_fd2, btf_id2) can be easily solved
> by libbpf with as nice and as human friendly message libbpf can do.

So userspace is supposed to replicate all the checks done by the kernel
because we can't be bothered to add proper error messages? Really?

> I'm not convinced yet that it's a kernel job to print it nicely. It certainly can,
> but it's quite a bit different from two existing bpf commands where log_buf is used:
> PROG_LOAD and BTF_LOAD. In these two cases the kernel verifies the program
> and the BTF. raw_tp_open is different, since the kernel needs to compare
> that function signatures (prog_fd1, btf_id1) and (prog_fd2, btf_id2) are
> exactly the same. The kernel can indicate that with single specific errno and
> libbpf can print human friendly function signatures via btf_dump infra for
> humans to see.
> So I really don't see why log_buf is such a necessity for raw_tp_open.

I'll drop them from raw_tp_open for this series, but I still think they
should be added globally (or something like it). Returning a
user-friendly error message should be the absolute minimum we do. Just
like extack does for netlink.

-Toke
Andrii Nakryiko July 15, 2020, 7:02 p.m. UTC | #9
On Tue, Jul 14, 2020 at 4:11 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 12:19:03AM +0200, Toke Høiland-Jørgensen wrote:
> > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >
> > >> However, assuming it *is* possible, my larger point was that we
> > >> shouldn't add just a 'logging struct', but rather a 'common options
> > >> struct' which can be extended further as needed. And if it is *not*
> > >> possible to add new arguments to a syscall like you're proposing, my
> > >> suggestion above would be a different way to achieve basically the same
> > >> (at the cost of having to specify the maximum reserved space in advance).
> > >>
> > >
> > > yeah-yeah, I agree, it's less a "logging attr", more of "common attr
> > > across all commands".
> >
> > Right, great. I think we are broadly in agreement with where we want to
> > go with this, actually :)
>
> I really don't like 'common attr across all commands'.
> Both of you are talking as libbpf developers who occasionally need to
> add printk-s to the kernel.

How did you come to this conclusion?

Inability to figure out what's wrong when using BPF is at the top of
complaints from many users, together with hard to understand logs from
verifier.

> That is not an excuse to bloat api that will be
> useful to two people.

What do you mean specifically by bloat in API? I could understand how
it would bloat the API if we were to add log-related fields into every
part of bpf_attr, one for each type of commands. But that's exactly
what I advocate to not do.

But having even a slight improvement of error reporting, beyond
current -EINVAL, -E2BIG, etc, would improve experience immensely for
*all* BPF users.

>
> The only reason log_buf sort-of make sense in raw_tp_open is because
> btf comparison is moved from prog_load into raw_tp_open.
> Miscompare of (prog_fd1, btf_id1) vs (prog_fd2, btf_id2) can be easily solved
> by libbpf with as nice and as human friendly message libbpf can do.
>
> I'm not convinced yet that it's a kernel job to print it nicely. It certainly can,
> but it's quite a bit different from two existing bpf commands where log_buf is used:
> PROG_LOAD and BTF_LOAD. In these two cases the kernel verifies the program
> and the BTF. raw_tp_open is different, since the kernel needs to compare
> that function signatures (prog_fd1, btf_id1) and (prog_fd2, btf_id2) are
> exactly the same. The kernel can indicate that with single specific errno and
> libbpf can print human friendly function signatures via btf_dump infra for
> humans to see.
> So I really don't see why log_buf is such a necessity for raw_tp_open.
Alexei Starovoitov July 15, 2020, 11:41 p.m. UTC | #10
On Wed, Jul 15, 2020 at 02:56:36PM +0200, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> 
> > On Wed, Jul 15, 2020 at 12:19:03AM +0200, Toke Høiland-Jørgensen wrote:
> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> >> 
> >> >> However, assuming it *is* possible, my larger point was that we
> >> >> shouldn't add just a 'logging struct', but rather a 'common options
> >> >> struct' which can be extended further as needed. And if it is *not*
> >> >> possible to add new arguments to a syscall like you're proposing, my
> >> >> suggestion above would be a different way to achieve basically the same
> >> >> (at the cost of having to specify the maximum reserved space in advance).
> >> >>
> >> >
> >> > yeah-yeah, I agree, it's less a "logging attr", more of "common attr
> >> > across all commands".
> >> 
> >> Right, great. I think we are broadly in agreement with where we want to
> >> go with this, actually :)
> >
> > I really don't like 'common attr across all commands'.
> > Both of you are talking as libbpf developers who occasionally need to
> > add printk-s to the kernel. That is not an excuse to bloat api that will be
> > useful to two people.
> 
> What? No, this is about making error messages comprehensible to people
> who *can't* just go around adding printks. "Guess the source of the
> EINVAL" is a really bad user experience!
> 
> > The only reason log_buf sort-of make sense in raw_tp_open is because
> > btf comparison is moved from prog_load into raw_tp_open.
> > Miscompare of (prog_fd1, btf_id1) vs (prog_fd2, btf_id2) can be easily solved
> > by libbpf with as nice and as human friendly message libbpf can do.
> 
> So userspace is supposed to replicate all the checks done by the kernel
> because we can't be bothered to add proper error messages? Really?

That's not what I said. The kernel can report unique errno for miscompare
and all nice messages can and _should be_ be printed by libbpf.


On Wed, Jul 15, 2020, Andrii Nakryiko wrote:
>
> Inability to figure out what's wrong when using BPF is at the top of
> complaints from many users, together with hard to understand logs from
> verifier.

Only the second part is true. All users are complaining about the verifier.
No one is complaing that failed prog attach is somehow lacking string message.
The users are also complaing about libbpf being too verbose.
Yet you've refused to address the verbosity where it should be reduced and
now refusing to add it where it's needed.
It's libbpf job to explain users kernel errors.

The same thing is happening with perf_event_open syscall.
Every one who's trying to code it directly complaining about the kernel. But
not a single user is complaing about perf syscall when they use libraries and
tools. Same thing with bpf syscall. libbpf is the interface. It needs to clear
and to the point. Right now it's not doing it well. elf dump is too verbose and
unnecessary whereas in other places it says nothing informative where it
could have printed human hint.

libbpf's pr_perm_msg() hint is the only one where libbpf cares about its users.
All other messages are useful to libbpf developers and not its users.

The kernel verifier messages suck as well. They need to be improved.
But this thread 'lets add strings everywhere and users will be happy' is
completely missing the mark.
Andrii Nakryiko July 16, 2020, 1:11 a.m. UTC | #11
On Wed, Jul 15, 2020 at 4:41 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 02:56:36PM +0200, Toke Høiland-Jørgensen wrote:
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >
> > > On Wed, Jul 15, 2020 at 12:19:03AM +0200, Toke Høiland-Jørgensen wrote:
> > >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> > >>
> > >> >> However, assuming it *is* possible, my larger point was that we
> > >> >> shouldn't add just a 'logging struct', but rather a 'common options
> > >> >> struct' which can be extended further as needed. And if it is *not*
> > >> >> possible to add new arguments to a syscall like you're proposing, my
> > >> >> suggestion above would be a different way to achieve basically the same
> > >> >> (at the cost of having to specify the maximum reserved space in advance).
> > >> >>
> > >> >
> > >> > yeah-yeah, I agree, it's less a "logging attr", more of "common attr
> > >> > across all commands".
> > >>
> > >> Right, great. I think we are broadly in agreement with where we want to
> > >> go with this, actually :)
> > >
> > > I really don't like 'common attr across all commands'.
> > > Both of you are talking as libbpf developers who occasionally need to
> > > add printk-s to the kernel. That is not an excuse to bloat api that will be
> > > useful to two people.
> >
> > What? No, this is about making error messages comprehensible to people
> > who *can't* just go around adding printks. "Guess the source of the
> > EINVAL" is a really bad user experience!
> >
> > > The only reason log_buf sort-of make sense in raw_tp_open is because
> > > btf comparison is moved from prog_load into raw_tp_open.
> > > Miscompare of (prog_fd1, btf_id1) vs (prog_fd2, btf_id2) can be easily solved
> > > by libbpf with as nice and as human friendly message libbpf can do.
> >
> > So userspace is supposed to replicate all the checks done by the kernel
> > because we can't be bothered to add proper error messages? Really?
>
> That's not what I said. The kernel can report unique errno for miscompare
> and all nice messages can and _should be_ be printed by libbpf.
>
>
> On Wed, Jul 15, 2020, Andrii Nakryiko wrote:
> >
> > Inability to figure out what's wrong when using BPF is at the top of
> > complaints from many users, together with hard to understand logs from
> > verifier.
>
> Only the second part is true. All users are complaining about the verifier.
> No one is complaing that failed prog attach is somehow lacking string message.

Ok, next time I'll be helping someone to figure out another -EINVAL,
I'll remember to reassure them that it's not really frustrating, not a
guess game, and not a time sink at all.

> The users are also complaing about libbpf being too verbose.

Very well might be, but apart from your complaints on that patch
adding program loading debug message, I can't remember a single case
when someone complained about that. Do you have a link for me to get
some context?

> Yet you've refused to address the verbosity where it should be reduced and

It's open-source, everyone is welcome to submit their patches. Just
because I don't think we need to remove some log messages and thus not
am creating such patches, doesn't mean it can't be done by someone
else.

> now refusing to add it where it's needed.

Can you point to or quote where I refused to add a helpful message to libbpf?

> It's libbpf job to explain users kernel errors.

To the best of its ability, yes. Unfortunately there were many times
where I, as a human, couldn't figure it out without printk'ing my way
around the kernel. If I can't do that, I can't teach libbpf to do it.
Error codes are just not granular enough to allow distinguishing a lot
of error conditions, either by humans or automatically by libbpf.

>
> The same thing is happening with perf_event_open syscall.
> Every one who's trying to code it directly complaining about the kernel. But
> not a single user is complaing about perf syscall when they use libraries and
> tools. Same thing with bpf syscall. libbpf is the interface. It needs to clear
> and to the point. Right now it's not doing it well. elf dump is too verbose and
> unnecessary whereas in other places it says nothing informative where it
> could have printed human hint.
>
> libbpf's pr_perm_msg() hint is the only one where libbpf cares about its users.
> All other messages are useful to libbpf developers and not its users.

"Couldn't load trivial BPF program. Make sure your kernel supports BPF
(CONFIG_BPF_SYSCALL=y) and/or that RLIMIT_MEMLOCK is set to big enough
value."
"kernel doesn't support global data"
"can't attach BPF program w/o FD (did you load it?)"
"specified path %s is not on BPF FS"
"vmlinux BTF is not found" -- we should clearly add "you need a kernel
built with CONFIG_DEBUG_INFO_BTF=y"
"invalid relo for \'%s\' in special section 0x%x; forgot to initialize
global var?.."

And so on. Sure, we can have more and better error messages, no one is
claiming libbpf is perfect or done.

That doesn't mean, though, that the kernel itself can't do better in
terms of error reporting. But you clearly don't think it's a real
problem, so I'll let it rest, thank you.

>
> The kernel verifier messages suck as well. They need to be improved.
> But this thread 'lets add strings everywhere and users will be happy' is
> completely missing the mark.
Alexei Starovoitov July 16, 2020, 5:44 a.m. UTC | #12
On Wed, Jul 15, 2020 at 06:11:39PM -0700, Andrii Nakryiko wrote:
> >
> > On Wed, Jul 15, 2020, Andrii Nakryiko wrote:
> > >
> > > Inability to figure out what's wrong when using BPF is at the top of
> > > complaints from many users, together with hard to understand logs from
> > > verifier.
> >
> > Only the second part is true. All users are complaining about the verifier.
> > No one is complaing that failed prog attach is somehow lacking string message.
> 
> Ok, next time I'll be helping someone to figure out another -EINVAL,
> I'll remember to reassure them that it's not really frustrating, not a
> guess game, and not a time sink at all.

When the next time the users hit EINVAL due to _their_ usage and not
due to kernel or libbpf bug and libbpf couldn't do anything to make
the error user friendly then yes please raise it up.

> > The users are also complaing about libbpf being too verbose.
> 
> Very well might be, but apart from your complaints on that patch
> adding program loading debug message, I can't remember a single case
> when someone complained about that. Do you have a link for me to get
> some context?
> 
> > Yet you've refused to address the verbosity where it should be reduced and
> 
> It's open-source, everyone is welcome to submit their patches. Just
> because I don't think we need to remove some log messages and thus not
> am creating such patches, doesn't mean it can't be done by someone
> else.

So myself and Toke are wearing 'bpf user' hat in that context.
Both of us indicated that libbpf output is too verbose.
Your response "just send a patch" is a sure way to turn away more users.

> > now refusing to add it where it's needed.
> 
> Can you point to or quote where I refused to add a helpful message to libbpf?

see below for detailed example.

> 
> > It's libbpf job to explain users kernel errors.
> 
> To the best of its ability, yes. Unfortunately there were many times
> where I, as a human, couldn't figure it out without printk'ing my way
> around the kernel. If I can't do that, I can't teach libbpf to do it.
> Error codes are just not granular enough to allow distinguishing a lot
> of error conditions, either by humans or automatically by libbpf.

If there are such cases please bring it up. I'm sure kernel errnos
can become more unique.

> >
> > The same thing is happening with perf_event_open syscall.
> > Every one who's trying to code it directly complaining about the kernel. But
> > not a single user is complaing about perf syscall when they use libraries and
> > tools. Same thing with bpf syscall. libbpf is the interface. It needs to clear
> > and to the point. Right now it's not doing it well. elf dump is too verbose and
> > unnecessary whereas in other places it says nothing informative where it
> > could have printed human hint.
> >
> > libbpf's pr_perm_msg() hint is the only one where libbpf cares about its users.
> > All other messages are useful to libbpf developers and not its users.
> 
> "Couldn't load trivial BPF program. Make sure your kernel supports BPF
> (CONFIG_BPF_SYSCALL=y) and/or that RLIMIT_MEMLOCK is set to big enough
> value."
> "kernel doesn't support global data"
> "can't attach BPF program w/o FD (did you load it?)"
> "specified path %s is not on BPF FS"
> "vmlinux BTF is not found" -- we should clearly add "you need a kernel
> built with CONFIG_DEBUG_INFO_BTF=y"
> "invalid relo for \'%s\' in special section 0x%x; forgot to initialize
> global var?.."

sure. those count too.

> That doesn't mean, though, that the kernel itself can't do better in
> terms of error reporting. But you clearly don't think it's a real
> problem, so I'll let it rest, thank you.

It is a real problem and libbpf has to step up to address it.
The kernel does everything it can already.

Let's take this raw_tp_open patches as an example.
Currently raw_tp_open will EINVAL if prog_fd is incorrect, tp name
is not specified or expected_attach_type doesn't match.
If that happens it's a _libbpf_ bug. It's not a user mistake.
With Toke's patches tgt prog_fd and btf_id are added.
Both can be incorrect. If that happens it's most likely libbpf bug.
The user is writing their bpf_prog.c file with SEC("freplace/name")
libbpf could have messed up prog_fd and btf_id resolution.
The user didn't specify btf_id as a raw integer.
It was a libbpf job to convert user's string name into prog_fd
and btf_id in the first place. If libbpf messes it it shouldn't rely
on the kernel to catch such bugs. It's not a job of the kernel to
point out bugs in the libraries that suppose to be tightly
coupled with the kernel.
When libbpf is doing its job correctly tgt_prog_fd and tgt_btf_id are
valid and BTF of the extension prog can miscompare with BTF of
target prog it's trying to attach to via raw_tp_open with these two
arguments. Take a look at Toke's patches. That comparison is done via
btf_check_type_match(). The most helpful kernel message in such case
will be 'arg2 in foo() has size 4 while bar() has 2'.
That's the best the kernel can do. Yet it's very user unfriendly.
The kernel has no infra today to print BTF in a human friendly way.
But libbpf has BTF dumper. btf_dump__emit_type_decl() alone is
magnitude friendlier to users than kernel messages.
When the kernel's raw_tp_open syscall realizes that BTFs don't match
it can return single unique errno and libbpf can tell the user:
"
Function prototype of BPF prog:
   int ext_bpf_prog(struct __sk_buff *);
doesn't match function prototype of target prog:
   int tgt_prog_name(struct xdp_md *);
"
Everything is available in libbpf to print such error in human friendly way.
Whereas the kernel has no ability to do so.
If that message was coming from the kernel it would have been from
btf_check_type_match() and would be:
"arg1 in ext_bpf_prog() is not a pointer to context".
Compare that to above libbpf error message.
I think the difference is pretty drastic.
libbpf by far is a better place to print human friendly errors.

I'm arguing that raw_tp_open with additional tgt_prog_fd and tgt_btf_id
parameters is not much different from sys_kcmp syscall and from memcmp() libc
function. They compare two objects and return equal or not.
It's not a job of the kernel to say
'objects are not equal because byte 2 is different'.
Such message is not helpeful to users.
It's libbpf job to print two BTFs in human friendly way when the kernel
found them different. It's easier for human to glance over two function
prototypes and spot the difference instead of 'arg1 is not a pointer to context'.

So please make libbpf user friendly. Enough of pointing fingers at the kernel.
Andrii Nakryiko July 16, 2020, 7:59 p.m. UTC | #13
On Wed, Jul 15, 2020 at 10:44 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jul 15, 2020 at 06:11:39PM -0700, Andrii Nakryiko wrote:
> > >
> > > On Wed, Jul 15, 2020, Andrii Nakryiko wrote:
> > > >
> > > > Inability to figure out what's wrong when using BPF is at the top of
> > > > complaints from many users, together with hard to understand logs from
> > > > verifier.
> > >
> > > Only the second part is true. All users are complaining about the verifier.
> > > No one is complaing that failed prog attach is somehow lacking string message.
> >
> > Ok, next time I'll be helping someone to figure out another -EINVAL,
> > I'll remember to reassure them that it's not really frustrating, not a
> > guess game, and not a time sink at all.
>
> When the next time the users hit EINVAL due to _their_ usage and not
> due to kernel or libbpf bug and libbpf couldn't do anything to make
> the error user friendly then yes please raise it up.

I know this is futile to convince you anyways, so I won't go dig all
the details, but here are few general examples. With the benefit of
hindsight in a lot of those cases libbpf can do extra checks and
guessing (though bad guess is worse than no guess), but that doesn't
scale because of the sheer amount of possible situations.

What I personally went through when I was building runqslower:

1. You write a simple BPF program, open + load. You run it, you get EPERM.
2. You go check if you ran the program with sudo, retry with sudo.
3. You still get EPERM. You bang your head against the wall for 30
minutes, you recall about RLIMIT_MEMLOCK, you bump it up.
4. If you are happy, now everything works. I don't remember now if I
had to bang my head some more until I got the initial minimal version
of runqslower successfully load BPF program.


Some popular other issues I can recall.

1. People get excited and try to use fentry/fexit, get -EINVAL. Most
probably it's because they don't have a recent enough kernel.
Otherwise (if they know libbpf behavior), they'd probably have gotten
something about not being able to find the desired BTF func in the
kernel. Even in that case, is that that they don't have BTF at all
(CONFIG_DEBUG_BTF_INFO=y)? Or it's just an old one with no FUNCs
because pahole isn't recent enough? Kernel build won't mention that
you need 1.16 for fentry/fexit, of course. And so on. This is actually
not the worst case, because I can walk them through this process.

2. When dealing with cgroups. You get EINVAL for every tiny misstep.
You want to replace the BPF program, you get EINVAL. Maybe you forgot
one of the necessary flags? Or maybe you specified incompatible flags
(BPF_F_REPLACE | BPF_F_ALLOW_MULTI)? Or the parent cgroup has a
non-overridable BPF program attached already? Or maybe prog FD is
wrong, or cgroup FD is wrong, or?...

3. Someone gets excited about libbpf-tools in BCC, decides to convert
a tool. Needs to dump the hashmap fast. There is BATCH_LOOKUP, nice!
They try, they get EINVAL. Most probably outdated kernel, but I'll
need to dig into kernel sources to see what else could go wrong.

4. Try using some of the low-level APIs from libbpf/bpf.h, like the
same cgroup program replacement. You'll get E2BIG and will scratch
your head for a while, checking how the list of attached BPF programs
can be too big, if it's the only one. Just to realize that again, your
kernel is just a touch too old and just complains about non-zero
unknown field in bpf_attr.

5. Just few days ago, one user was doing bpf_program__attach_tracepoint():

libbpf: program 'tp/sched/sched_process_exit': failed to attach to pfd
10: File exists

After a bit of back and forth turns out he had a second instance of
that program running in parallel. Good, I quickly realize that it's an
old kernel and it doesn't allow me to attach more than 1 BPF program
to the tracepoint. Case solved. But what about many other BPF users
that do not have access to someone developing BPF in the kernel?

And the list goes on. Even if it was my full-time job just to
anticipate all the misuses and try to check/guess them in libbpf, that
wouldn't work and won't scale.

And all I was asking (and not finger pointing or blaming anything or
anyone) to have a mechanism in the kernel to get a single-line
human-readable hint as to which one out of many
EINVAL/E2BIG/ENOENT/EPERM conditions was hit. No need to immediately
convert all of them, we could have gradually added that, prioritizing
common and most probable ones to hit. But alas, it's easier to blame
me for not perfecting libbpf and anticipating all that beforehand.


>
> > > The users are also complaing about libbpf being too verbose.
> >
> > Very well might be, but apart from your complaints on that patch
> > adding program loading debug message, I can't remember a single case
> > when someone complained about that. Do you have a link for me to get
> > some context?
> >
> > > Yet you've refused to address the verbosity where it should be reduced and
> >
> > It's open-source, everyone is welcome to submit their patches. Just
> > because I don't think we need to remove some log messages and thus not
> > am creating such patches, doesn't mean it can't be done by someone
> > else.
>
> So myself and Toke are wearing 'bpf user' hat in that context.
> Both of us indicated that libbpf output is too verbose.
> Your response "just send a patch" is a sure way to turn away more users.
>

I can't find any such complaint from Toke in this thread, and can't
really recall something like that from recent discussions. I'd rather
have him speak for himself.

But neither you nor Toke are "just bpf users", you are totally capable
of creating a patch addressing something that bugs you.

Upstream BPF users have never complained to me about the verbosity of
logs. Furthermore, they rarely (if ever) see those **debug** logs,
until I actually ask them to turn it on to help me help them. Yes,
debug-level libbpf logs are driven more towards me and other people
(like Toke) who know internals and are helping users with issues. I
actually try to help BPF users, not ask them for patches to libbpf,
and you can find many patches from me where I was
improving/fixing/adding to kernel and libbpf as a response to both
upstream and internal BPF users. I am far from asking for recognition
of that, but please don't make false accusations either.

> > > now refusing to add it where it's needed.
> >
> > Can you point to or quote where I refused to add a helpful message to libbpf?
>
> see below for detailed example.
>
> >
> > > It's libbpf job to explain users kernel errors.
> >
> > To the best of its ability, yes. Unfortunately there were many times
> > where I, as a human, couldn't figure it out without printk'ing my way
> > around the kernel. If I can't do that, I can't teach libbpf to do it.
> > Error codes are just not granular enough to allow distinguishing a lot
> > of error conditions, either by humans or automatically by libbpf.
>
> If there are such cases please bring it up. I'm sure kernel errnos
> can become more unique.

See above for a few typical examples. I don't keep track of all the
issues I had to debug myself or help others with.

>
> > >
> > > The same thing is happening with perf_event_open syscall.
> > > Every one who's trying to code it directly complaining about the kernel. But
> > > not a single user is complaing about perf syscall when they use libraries and
> > > tools. Same thing with bpf syscall. libbpf is the interface. It needs to clear
> > > and to the point. Right now it's not doing it well. elf dump is too verbose and
> > > unnecessary whereas in other places it says nothing informative where it
> > > could have printed human hint.
> > >
> > > libbpf's pr_perm_msg() hint is the only one where libbpf cares about its users.
> > > All other messages are useful to libbpf developers and not its users.
> >
> > "Couldn't load trivial BPF program. Make sure your kernel supports BPF
> > (CONFIG_BPF_SYSCALL=y) and/or that RLIMIT_MEMLOCK is set to big enough
> > value."
> > "kernel doesn't support global data"
> > "can't attach BPF program w/o FD (did you load it?)"
> > "specified path %s is not on BPF FS"
> > "vmlinux BTF is not found" -- we should clearly add "you need a kernel
> > built with CONFIG_DEBUG_INFO_BTF=y"
> > "invalid relo for \'%s\' in special section 0x%x; forgot to initialize
> > global var?.."
>
> sure. those count too.

great, thanks for acknowledging

>
> > That doesn't mean, though, that the kernel itself can't do better in
> > terms of error reporting. But you clearly don't think it's a real
> > problem, so I'll let it rest, thank you.
>
> It is a real problem and libbpf has to step up to address it.
> The kernel does everything it can already.
>
> Let's take this raw_tp_open patches as an example.
> Currently raw_tp_open will EINVAL if prog_fd is incorrect, tp name
> is not specified or expected_attach_type doesn't match.
> If that happens it's a _libbpf_ bug. It's not a user mistake.
> With Toke's patches tgt prog_fd and btf_id are added.
> Both can be incorrect. If that happens it's most likely libbpf bug.
> The user is writing their bpf_prog.c file with SEC("freplace/name")
> libbpf could have messed up prog_fd and btf_id resolution.
> The user didn't specify btf_id as a raw integer.
> It was a libbpf job to convert user's string name into prog_fd
> and btf_id in the first place. If libbpf messes it it shouldn't rely
> on the kernel to catch such bugs. It's not a job of the kernel to
> point out bugs in the libraries that suppose to be tightly
> coupled with the kernel.
> When libbpf is doing its job correctly tgt_prog_fd and tgt_btf_id are
> valid and BTF of the extension prog can miscompare with BTF of
> target prog it's trying to attach to via raw_tp_open with these two
> arguments. Take a look at Toke's patches. That comparison is done via
> btf_check_type_match(). The most helpful kernel message in such case
> will be 'arg2 in foo() has size 4 while bar() has 2'.
> That's the best the kernel can do. Yet it's very user unfriendly.
> The kernel has no infra today to print BTF in a human friendly way.
> But libbpf has BTF dumper. btf_dump__emit_type_decl() alone is
> magnitude friendlier to users than kernel messages.
> When the kernel's raw_tp_open syscall realizes that BTFs don't match
> it can return single unique errno and libbpf can tell the user:
> "
> Function prototype of BPF prog:
>    int ext_bpf_prog(struct __sk_buff *);
> doesn't match function prototype of target prog:
>    int tgt_prog_name(struct xdp_md *);
> "
> Everything is available in libbpf to print such error in human friendly way.
> Whereas the kernel has no ability to do so.
> If that message was coming from the kernel it would have been from
> btf_check_type_match() and would be:
> "arg1 in ext_bpf_prog() is not a pointer to context".
> Compare that to above libbpf error message.
> I think the difference is pretty drastic.
> libbpf by far is a better place to print human friendly errors.
>
> I'm arguing that raw_tp_open with additional tgt_prog_fd and tgt_btf_id
> parameters is not much different from sys_kcmp syscall and from memcmp() libc
> function. They compare two objects and return equal or not.
> It's not a job of the kernel to say
> 'objects are not equal because byte 2 is different'.
> Such message is not helpeful to users.
> It's libbpf job to print two BTFs in human friendly way when the kernel
> found them different. It's easier for human to glance over two function
> prototypes and spot the difference instead of 'arg1 is not a pointer to context'.

Ok, so I assume that is an example where I refused to add helpful
error reporting to libbpf?

First, thanks for taking the time to elaborate on this. I'm sure Toke
will appreciate it, though I have nothing to do with arguing for or
against this. It was Toke's scepticism, not mine, that you are
addressing. I never said we shouldn't do this.

I do agree that in some cases libbpf has the same or more information
and it's easier to pretty-print extra context, like in this case,
which we should take advantage of.

I started this topic to discuss extending bpf() syscall in general,
regardless of this particular change and feature.

>
> So please make libbpf user friendly. Enough of pointing fingers at the kernel.

It's an honor you are associating me personally and libbpf, but I
still hope we can move this forward as a community and collective
project. Of course let's make libbpf more user-friendly where possible
and feasible, but I don't think arguing about debug-level logging in
libbpf is the right place to start.

I don't know where "finger pointing" comes from, I had no intent nor I
said anything to offend either the kernel or you personally. Proposing
changes up for discussion shouldn't be considered a form of attacking
or finger pointing, I hope.
Toke Høiland-Jørgensen July 16, 2020, 8:19 p.m. UTC | #14
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

>> So myself and Toke are wearing 'bpf user' hat in that context.
>> Both of us indicated that libbpf output is too verbose.
>> Your response "just send a patch" is a sure way to turn away more users.
>>
>
> I can't find any such complaint from Toke in this thread, and can't
> really recall something like that from recent discussions. I'd rather
> have him speak for himself.

I think what I said (not in this thread, way back during some other
discussion) was that I agreed that libbpf was being too verbose by
dumping all the sections and relocations it finds when reading an ELF
file, which causes the useful error messages to get lost. I would like
to see those messages demoted to another log level, or removed
altogether.

I won't have time to look more at this right now, but I do plan to
circle back to it: I agree with you that we need to make this more
friendly. And yes, I also think this should include finding a way to
disambiguate between different conditions leading to the same error from
the kernel. I've run into a lot of the same issues as you when
supporting people who are new to BPF - thank you for the extensive list!

-Toke
Alexei Starovoitov July 17, 2020, 3:09 a.m. UTC | #15
On Thu, Jul 16, 2020 at 12:59:30PM -0700, Andrii Nakryiko wrote:
> On Wed, Jul 15, 2020 at 10:44 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Jul 15, 2020 at 06:11:39PM -0700, Andrii Nakryiko wrote:
> > > >
> > > > On Wed, Jul 15, 2020, Andrii Nakryiko wrote:
> > > > >
> > > > > Inability to figure out what's wrong when using BPF is at the top of
> > > > > complaints from many users, together with hard to understand logs from
> > > > > verifier.
> > > >
> > > > Only the second part is true. All users are complaining about the verifier.
> > > > No one is complaing that failed prog attach is somehow lacking string message.
> > >
> > > Ok, next time I'll be helping someone to figure out another -EINVAL,
> > > I'll remember to reassure them that it's not really frustrating, not a
> > > guess game, and not a time sink at all.
> >
> > When the next time the users hit EINVAL due to _their_ usage and not
> > due to kernel or libbpf bug and libbpf couldn't do anything to make
> > the error user friendly then yes please raise it up.
> 
> I know this is futile to convince you anyways, so I won't go dig all
> the details, but here are few general examples. With the benefit of
> hindsight in a lot of those cases libbpf can do extra checks and
> guessing (though bad guess is worse than no guess), but that doesn't
> scale because of the sheer amount of possible situations.
> 
> What I personally went through when I was building runqslower:
> 
> 1. You write a simple BPF program, open + load. You run it, you get EPERM.
> 2. You go check if you ran the program with sudo, retry with sudo.
> 3. You still get EPERM. You bang your head against the wall for 30
> minutes, you recall about RLIMIT_MEMLOCK, you bump it up.
> 4. If you are happy, now everything works. I don't remember now if I
> had to bang my head some more until I got the initial minimal version
> of runqslower successfully load BPF program.

But we have libbpf hinting on that already?
Are you saying that hint wasn't working somehow?

> 
> Some popular other issues I can recall.
> 
> 1. People get excited and try to use fentry/fexit, get -EINVAL. Most
> probably it's because they don't have a recent enough kernel.
> Otherwise (if they know libbpf behavior), they'd probably have gotten
> something about not being able to find the desired BTF func in the
> kernel. Even in that case, is that that they don't have BTF at all
> (CONFIG_DEBUG_BTF_INFO=y)? Or it's just an old one with no FUNCs
> because pahole isn't recent enough? Kernel build won't mention that
> you need 1.16 for fentry/fexit, of course. And so on. This is actually
> not the worst case, because I can walk them through this process.

Exactly. Either old kernel or missing config in the kernel.
libbpf is above that. It could have provided a hint.
When libbpf is processing SEC("fentry/") and it fails to load with
empty verifier log it could start probing.
Only libbpf can do it. Kernel is helpless here.
Say we change the kernel errno for all unsuported prog types and maps
it would return ENOTSUPP or something.
Would it really help the situation?
Probably not. There will be old kernels and the same usability
issue as you described.

> 2. When dealing with cgroups. You get EINVAL for every tiny misstep.
> You want to replace the BPF program, you get EINVAL. Maybe you forgot
> one of the necessary flags? Or maybe you specified incompatible flags
> (BPF_F_REPLACE | BPF_F_ALLOW_MULTI)? Or the parent cgroup has a
> non-overridable BPF program attached already? Or maybe prog FD is
> wrong, or cgroup FD is wrong, or?...

A wrong FD for prog or cgroup would be a libbpf bug.
imo that's the case where kernel must not return a string even
if there is a log_buf facility.

non-overridable already attached is EPERM. It's not EINVAL.
multi-prog vs overridable is also EPERM.
Say, the kernel has log_buf for hierarchy_allows_attach().
what kind of message do you think it can print that will be user friendly?
"attach is not allowed because there is a parent cgroup
that has no override flag set"
How is it much better than EPERM?
Two differentiate between these two EPERM?
Both to me look like the same category if permission checks.
I don't know how to the kernel can print full cgroup path here.
It has 'struct cgroup *'. Then go to kernfs? then what?
cgroupfs is mounted somwhere.
To have meaningul message the user would want to see something like:
"
  /sys/fs/cgroup/my_container <- has no-override bpf prog, hence
  the kernel doesn't allow attach at:
  /sys/fs/cgroup/my_container/here
"
Only libbpf can print such message.
The user gave libbpf a string path. Attaching is relative to that.
Whereas the kernel has hard time with mounts/paths/namespaces.

May be to solve this cgroup attaching ambiguity we can add a libbpf
helper that can walk hierarchy and print state?
Either libbpf calls it automatically or user can trigger it?

> 3. Someone gets excited about libbpf-tools in BCC, decides to convert
> a tool. Needs to dump the hashmap fast. There is BATCH_LOOKUP, nice!
> They try, they get EINVAL. Most probably outdated kernel, but I'll
> need to dig into kernel sources to see what else could go wrong.

right. that is similar to 1. I don't see how kernel could have helped.
Say sys_bpf got top level log_buf for _all_ commands.
The user is passing a command that is unsupported.
The kernel has no clue whether log_buf is even there in bpf_attr.
It cannot return a string. Only errno.
If we change that errno from current EINVAL to ENOTSUPP...
it would go back to the point I made above.
Not really helping much.
imo that's another case where libbpf probing can go long way to
improve user experience.

> 
> 4. Try using some of the low-level APIs from libbpf/bpf.h, like the
> same cgroup program replacement. You'll get E2BIG and will scratch
> your head for a while, checking how the list of attached BPF programs
> can be too big, if it's the only one. Just to realize that again, your
> kernel is just a touch too old and just complains about non-zero
> unknown field in bpf_attr.

The low-level APIs of libbpf is indeed a pain.
There are also old and partially broken libbpf APIs.
I think we need to start aggressively deprecating some of them.
Especially those that cause debugging issues to users.

> 5. Just few days ago, one user was doing bpf_program__attach_tracepoint():
> 
> libbpf: program 'tp/sched/sched_process_exit': failed to attach to pfd
> 10: File exists
> 
> After a bit of back and forth turns out he had a second instance of
> that program running in parallel. Good, I quickly realize that it's an
> old kernel and it doesn't allow me to attach more than 1 BPF program
> to the tracepoint. Case solved. But what about many other BPF users
> that do not have access to someone developing BPF in the kernel?

I similarly don't see how kernel string would have helped.
It would have said the same thing: "cannot attach".

> And the list goes on. Even if it was my full-time job just to
> anticipate all the misuses and try to check/guess them in libbpf, that
> wouldn't work and won't scale.
> 
> And all I was asking (and not finger pointing or blaming anything or
> anyone) to have a mechanism in the kernel to get a single-line
> human-readable hint as to which one out of many
> EINVAL/E2BIG/ENOENT/EPERM conditions was hit. No need to immediately
> convert all of them, we could have gradually added that, prioritizing
> common and most probable ones to hit.

I'm not against adding log_buf to more bpf commands.
In many cases they are needed.
I'm against the blank statement that _all_ bpf commands need log_buf
and that is somehow will solve user debug nightmares.

> project. Of course let's make libbpf more user-friendly where possible
> and feasible,

Let's do so.
This thread jumped to early conclusion that log_buf for all bpf commands
will magically improve user experience. Both you and Toke were happy to
conclude that "horrible kernel UI/UX" is responsible for everything and
it has to be the one to fix. I don't think that's the case and I hope
working through the examples of bad user experience above made it clear.
Andrii Nakryiko July 18, 2020, 3:54 a.m. UTC | #16
On Thu, Jul 16, 2020 at 8:09 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 12:59:30PM -0700, Andrii Nakryiko wrote:
> > On Wed, Jul 15, 2020 at 10:44 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Wed, Jul 15, 2020 at 06:11:39PM -0700, Andrii Nakryiko wrote:
> > > > >
> > > > > On Wed, Jul 15, 2020, Andrii Nakryiko wrote:
> > > > > >
> > > > > > Inability to figure out what's wrong when using BPF is at the top of
> > > > > > complaints from many users, together with hard to understand logs from
> > > > > > verifier.
> > > > >
> > > > > Only the second part is true. All users are complaining about the verifier.
> > > > > No one is complaing that failed prog attach is somehow lacking string message.
> > > >
> > > > Ok, next time I'll be helping someone to figure out another -EINVAL,
> > > > I'll remember to reassure them that it's not really frustrating, not a
> > > > guess game, and not a time sink at all.
> > >
> > > When the next time the users hit EINVAL due to _their_ usage and not
> > > due to kernel or libbpf bug and libbpf couldn't do anything to make
> > > the error user friendly then yes please raise it up.
> >
> > I know this is futile to convince you anyways, so I won't go dig all
> > the details, but here are few general examples. With the benefit of
> > hindsight in a lot of those cases libbpf can do extra checks and
> > guessing (though bad guess is worse than no guess), but that doesn't
> > scale because of the sheer amount of possible situations.
> >
> > What I personally went through when I was building runqslower:
> >
> > 1. You write a simple BPF program, open + load. You run it, you get EPERM.
> > 2. You go check if you ran the program with sudo, retry with sudo.
> > 3. You still get EPERM. You bang your head against the wall for 30
> > minutes, you recall about RLIMIT_MEMLOCK, you bump it up.
> > 4. If you are happy, now everything works. I don't remember now if I
> > had to bang my head some more until I got the initial minimal version
> > of runqslower successfully load BPF program.
>
> But we have libbpf hinting on that already?
> Are you saying that hint wasn't working somehow?

At that time the message was just "failed to load primitive BPF
program", we added mentions of CONFIG_BPF_SYSCALL and RLIMIT_MEMLOCK
much later. And even current message might be misleading, because it
might be that you are not a root (or, in more modern kernels, don't
have CAP_BPF). Maybe not, if we are relying on unprivileged BPF for
primitive prog, I don't know.

>
> >
> > Some popular other issues I can recall.
> >
> > 1. People get excited and try to use fentry/fexit, get -EINVAL. Most
> > probably it's because they don't have a recent enough kernel.
> > Otherwise (if they know libbpf behavior), they'd probably have gotten
> > something about not being able to find the desired BTF func in the
> > kernel. Even in that case, is that that they don't have BTF at all
> > (CONFIG_DEBUG_BTF_INFO=y)? Or it's just an old one with no FUNCs
> > because pahole isn't recent enough? Kernel build won't mention that
> > you need 1.16 for fentry/fexit, of course. And so on. This is actually
> > not the worst case, because I can walk them through this process.
>
> Exactly. Either old kernel or missing config in the kernel.
> libbpf is above that. It could have provided a hint.
> When libbpf is processing SEC("fentry/") and it fails to load with
> empty verifier log it could start probing.

As I said, listing these examples I didn't expect to convince you,
because I knew you'd propose one or another way how libbpf can provide
more "diagnostics". In a lot of cases it can. But usually after the
fact (debugging it in a painful way). I don't believe we can
anticipate all cases, though, and even those that we can, I'm not sure
how scalable it is to do all that. Requiring anyone adding any new API
to provide a comprehensive set of probing/checks/diagnostics to
prevent possible error conditions seems to be way too harsh on
contributors.

Unfortunately, details escape me now for the case I had in mind, where
I had absolutely no clue which of many checks are failing, even after
reading kernel source code very carefully. Too bad, but I don't keep
such detailed notes.

> Only libbpf can do it. Kernel is helpless here.
> Say we change the kernel errno for all unsuported prog types and maps
> it would return ENOTSUPP or something.
> Would it really help the situation?

IMO, if the kernel just prints out "Unknown BPF command 123" or
"Unknown map type 345" that would be already a nice improvement.


> Probably not. There will be old kernels and the same usability
> issue as you described.

Of course old kernels will still have those problems, but it never
stopped us from fixing and improving the kernel moving forward so that
eventually a better version is available (almost) everywhere.

>
> > 2. When dealing with cgroups. You get EINVAL for every tiny misstep.
> > You want to replace the BPF program, you get EINVAL. Maybe you forgot
> > one of the necessary flags? Or maybe you specified incompatible flags
> > (BPF_F_REPLACE | BPF_F_ALLOW_MULTI)? Or the parent cgroup has a
> > non-overridable BPF program attached already? Or maybe prog FD is
> > wrong, or cgroup FD is wrong, or?...
>
> A wrong FD for prog or cgroup would be a libbpf bug.

cgroup fd is the direct input parameter for
bpf_program__attach_cgroup() and bpf_prog_attach() APIs. But, of
course, we can also add a variant that takes a string path to cgroup
and fetches it. Or we can double-check that provided cgroup FD is
actually cgroup FD.

> imo that's the case where kernel must not return a string even
> if there is a log_buf facility.
>
> non-overridable already attached is EPERM. It's not EINVAL.
> multi-prog vs overridable is also EPERM.

Attaching the same BPF prog is EINVAL, though. But it doesn't matter
which error code specifically, either EPERM, or EINVAL, or even E2BIG
have many possible meanings in bpf() syscall.

> Say, the kernel has log_buf for hierarchy_allows_attach().
> what kind of message do you think it can print that will be user friendly?
> "attach is not allowed because there is a parent cgroup
> that has no override flag set"
> How is it much better than EPERM?

"parent cgroup prevents attaching BPF program" would already lead into
the right direction. It's not perfect and full, but better than more
generic EPERM. But sure, EPERM has less variants than EINVAL, so I'm
personally always happier with EPERM, than with EINVAL.

> Two differentiate between these two EPERM?
> Both to me look like the same category if permission checks.
> I don't know how to the kernel can print full cgroup path here.
> It has 'struct cgroup *'. Then go to kernfs? then what?
> cgroupfs is mounted somwhere.
> To have meaningul message the user would want to see something like:
> "
>   /sys/fs/cgroup/my_container <- has no-override bpf prog, hence
>   the kernel doesn't allow attach at:
>   /sys/fs/cgroup/my_container/here
> "
> Only libbpf can print such message.
> The user gave libbpf a string path. Attaching is relative to that.
> Whereas the kernel has hard time with mounts/paths/namespaces.

Sure, I didn't even hope for cgroup path or anything like that. Maybe
cgroup ID would be nice, don't know, but even without that just saying
that it's parent cgroup that's the culprit is a big step forward, IMO.

>
> May be to solve this cgroup attaching ambiguity we can add a libbpf
> helper that can walk hierarchy and print state?
> Either libbpf calls it automatically or user can trigger it?
>
> > 3. Someone gets excited about libbpf-tools in BCC, decides to convert
> > a tool. Needs to dump the hashmap fast. There is BATCH_LOOKUP, nice!
> > They try, they get EINVAL. Most probably outdated kernel, but I'll
> > need to dig into kernel sources to see what else could go wrong.
>
> right. that is similar to 1. I don't see how kernel could have helped.
> Say sys_bpf got top level log_buf for _all_ commands.
> The user is passing a command that is unsupported.
> The kernel has no clue whether log_buf is even there in bpf_attr.
> It cannot return a string. Only errno.
> If we change that errno from current EINVAL to ENOTSUPP...
> it would go back to the point I made above.
> Not really helping much.
> imo that's another case where libbpf probing can go long way to
> improve user experience.

log_buf can't help existing kernels. Period. No one is arguing or
expecting that. But moving forward, just having that "unknown command
123" would be great.

But yeah, of course libbpf can create a probing map and try to do
BATCH_LOOKUP, to detect BATCH_LOOKUP support.

>
> >
> > 4. Try using some of the low-level APIs from libbpf/bpf.h, like the
> > same cgroup program replacement. You'll get E2BIG and will scratch
> > your head for a while, checking how the list of attached BPF programs
> > can be too big, if it's the only one. Just to realize that again, your
> > kernel is just a touch too old and just complains about non-zero
> > unknown field in bpf_attr.
>
> The low-level APIs of libbpf is indeed a pain.
> There are also old and partially broken libbpf APIs.
> I think we need to start aggressively deprecating some of them.
> Especially those that cause debugging issues to users.
>
> > 5. Just few days ago, one user was doing bpf_program__attach_tracepoint():
> >
> > libbpf: program 'tp/sched/sched_process_exit': failed to attach to pfd
> > 10: File exists
> >
> > After a bit of back and forth turns out he had a second instance of
> > that program running in parallel. Good, I quickly realize that it's an
> > old kernel and it doesn't allow me to attach more than 1 BPF program
> > to the tracepoint. Case solved. But what about many other BPF users
> > that do not have access to someone developing BPF in the kernel?
>
> I similarly don't see how kernel string would have helped.
> It would have said the same thing: "cannot attach".

This one is for perf subsystem, actually, it's its
PERF_EVENT_IOC_SET_BPF ioctl (until we add bpf_link for perf_event
attachment).

"limit of allowed BPF programs reached" would be good enough. "cannot
attach" is too generic.

>
> > And the list goes on. Even if it was my full-time job just to
> > anticipate all the misuses and try to check/guess them in libbpf, that
> > wouldn't work and won't scale.
> >
> > And all I was asking (and not finger pointing or blaming anything or
> > anyone) to have a mechanism in the kernel to get a single-line
> > human-readable hint as to which one out of many
> > EINVAL/E2BIG/ENOENT/EPERM conditions was hit. No need to immediately
> > convert all of them, we could have gradually added that, prioritizing
> > common and most probable ones to hit.
>
> I'm not against adding log_buf to more bpf commands.
> In many cases they are needed.
> I'm against the blank statement that _all_ bpf commands need log_buf
> and that is somehow will solve user debug nightmares.

There was no such statement. There was no statement that it will
"solve" debug nightmares. But it will certainly ease the pain.

My proposal was about adding the ability to emit something to log_buf
from any of the BPF commands, if that BPF command chooses to provide
extra error information. The whole point of this was to avoid adding
log_buf in command-specific ways (as Toke was doing in the patch that
I used to initiate the discussion) and do it once for entire syscall,
so that we can gradually utilize it where it makes most sense.

>
> > project. Of course let's make libbpf more user-friendly where possible
> > and feasible,
>
> Let's do so.
> This thread jumped to early conclusion that log_buf for all bpf commands
> will magically improve user experience. Both you and Toke were happy to
> conclude that "horrible kernel UI/UX" is responsible for everything and
> it has to be the one to fix.

You somehow concluded that it has to be either kernel or libbpf that
has to be improved, not both. I don't know how you came to this
conclusion. I didn't say or mean that, neither I read that from Toke's
replies. Let me walk through relevant parts verbatim.

Me:
> > > > I think BPF syscall would benefit from common/generalized log support
> > > > across all commands, given how powerful/complex it already is.
> > > > Sometimes it's literally impossible to understand why one gets -EINVAL
> > > > without adding printk()s in the kernel.

Toke:
> > > Yes, I agree! This is horrible UI!

Me:
> > UI?.. It's a perfectly fine and extensible API for all functionality
> > it provides, it just needs a better human-readable feedback mechanism,
> > which is what I'm proposing. Error codes are not working when you have
>> so many different situations that can result in error.

... Some technical details back and forth follows ...

You:
> I really don't like 'common attr across all commands'.
> Both of you are talking as libbpf developers who occasionally need to
> add printk-s to the kernel. That is not an excuse to bloat api that will be
> useful to two people.

You not liking this I get, this is fine, we all have our preferences,
we don't have to agree on everything. But then you just jumped into
conclusion about our motivation and claimed that it will be useful to
only two people (presumably me and Toke). I'm also not sure about
bloating the API, given that such API is already part of BPF_PROG_LOAD
command, but there is no point in discussing such technical details
anymore.

Toke said that it's a horrible story to debug such generic errors (and
I concur, it is), not that the kernel is horrible itself or was
written by horrible people. I don't feel offended if libbpf provides
horrible user experience in parts I've implemented or changed. I might
disagree about the qualification in some cases, but I won't get
offended about someone not liking my code, API or design.


> I don't think that's the case and I hope
> working through the examples of bad user experience above made it clear.

Not at all. All you showed is that once someone runs into some
specific API misuse and debugs it to success, then it's usually pretty
obvious how libbpf could have helped doing additional diagnostics.
With the benefit of hindsight.

I agree that if such diagnostics are reliable and the situation itself
is common and experienced by multiple users, then it might make sense
to add such checks to libbpf. But I also don't think it's always
possible to diagnose something automatically with 100% confidence. We
can give hints, but misdiagnosing the problem can just further confuse
things instead. Also quite often such problems are one-offs, which
doesn't make them any less confusing and frustrating, but custom
diagnostics for every such case has a potential of bloating libbpf
beyond recognition and I'm not convinced that we should diagnose all
of them. That's what I meant that this is not a scalable approach to
just say "fix libbpf to be more user friendly, kernel does its best
already".


This thread was about generalizing log_buf to entire bpf(), even
though it got diverted into something entirely different. I would
still be interested to know the technical implications of syscall
using extra args (like in my original proposal #1, about adding 2
extra args if some flag in the existing argument is set). Is that even
safe to do with syscalls? Even if we don't do it, it's still useful to
know for the future. I think the syscall() wrapper is inherently
vararg and accepts an arbitrary number of args, so I assumed it's
actually ok, but don't know how to prove it.
Alexei Starovoitov July 20, 2020, 10:30 p.m. UTC | #17
On Fri, Jul 17, 2020 at 08:54:45PM -0700, Andrii Nakryiko wrote:
> 
> > Only libbpf can do it. Kernel is helpless here.
> > Say we change the kernel errno for all unsuported prog types and maps
> > it would return ENOTSUPP or something.
> > Would it really help the situation?
> 
> IMO, if the kernel just prints out "Unknown BPF command 123" or
> "Unknown map type 345" that would be already a nice improvement.
...
> log_buf can't help existing kernels. Period. No one is arguing or
> expecting that. But moving forward, just having that "unknown command
> 123" would be great.
> 
> But yeah, of course libbpf can create a probing map and try to do
> BATCH_LOOKUP, to detect BATCH_LOOKUP support.

Also with BTF the kernel is self documented.
The following will print all commands that kernel supports:
bpftool btf dump file ./bld_x64/vmlinux |grep -A40 bpf_cmd
and with 'grep BPF_MAP_TYPE_' all supported maps.
For older kernels there is 'bpftool feature probe'.
Since libbpf reads vmlinux BTF anyway it could have got a knowledge
of all the features it supports based on BTF.

But let's continue this thought experiment for augmenting error
reporting with a string.
For 'Unknown BPF command 123' to work log_buf needs to passed
outside of 'union bpf_attr'. Probably as 4th argument to sys_bpf ?
and then a bit in 'int cmd' would need to be burned to indicate
that 4th arg is there. Probably size of the arg needs to be passed
either as 5th arg or as part of the 'struct bpf_log_buf' so it can
be extensible.
Using that log_buf directly in
SYSCALL_DEFINE[45](bpf, int cmd, ... struct bpf_log_buf *log_buf)
will be trivial.
But to pass it into any of first level functions (bpf_iter_create,
bpf_prog_attach, etc) they would need to gain an extra argument.
To pass it all the way into hierarchy_allows_attach() it needs to be added to:
__cgroup_bpf_attach 
cgroup_bpf_attach
cgroup_bpf_link_attach
link_create

Another case of ambiguous 'return -EINVAL' would cause another change
to a bunch of function prototypes.
So it's better to integrate it into current task_struct to avoid huge code churn.

But if we do so what stops us from generalizing log_buf reporting to
other syscalls? perf_event_open is in the same category.

> This one is for perf subsystem, actually, it's its
> PERF_EVENT_IOC_SET_BPF ioctl (until we add bpf_link for perf_event
> attachment).

clearly not only sys_bpf and sys_perf_event_open, but sys_ioctl would need
string support to be a full answer to ambiguous einval-s.

> My proposal was about adding the ability to emit something to log_buf
> from any of the BPF commands, if that BPF command chooses to provide
> extra error information. The whole point of this was to avoid adding
> log_buf in command-specific ways (as Toke was doing in the patch that
> I used to initiate the discussion) and do it once for entire syscall,
> so that we can gradually utilize it where it makes most sense.

I don't think that works due to code churn. Whether we pay that price
once or 'gradually' it doesn't make it any better.
When log_buf is added to an existing command the 'union bpf_attr' is there
in the function proto and nothing new needs to passed to a lot of functions.
So I certainly prefer Toke's approach of adding log_buf to one specific
command if it's really needed there.

The alternative is to solve it for all syscalls.

> I agree that if such diagnostics are reliable and the situation itself
> is common and experienced by multiple users, then it might make sense
> to add such checks to libbpf. 

'experienced by multiple users' is going to be a judgement call
either for libbpf or for kernel.
I'm saying let's improve libbpf user-friendlyness everywhere we can.
We can always drop these hints later. Unlike kernel messages that
might become stable api.
One thing is log_buf that the verifier is dumping. It's huge and not parsable.
Whereas a string next to return EINVAL may become an uapi.
I wouldn't be surprised if some of netlink extack strings got to this
level of stability and changing the string may cause somebody to notice
and the commit would get reverted.

> But I also don't think it's always
> possible to diagnose something automatically with 100% confidence. We
> can give hints, but misdiagnosing the problem can just further confuse
> things instead.

I think the possible confusion is fine.
The libbpf has an opportunity to try them and remove if things don't work out.
The kernel strings would be much more scrutinized and harder to change.

> Also quite often such problems are one-offs, which
> doesn't make them any less confusing and frustrating, but custom
> diagnostics for every such case has a potential of bloating libbpf
> beyond recognition

bloating libbpf with strings is imo better way to figure out how to
improve user experience than bloating kernel.

> of them. That's what I meant that this is not a scalable approach to
> just say "fix libbpf to be more user friendly, kernel does its best
> already".

Ok. I will take it back. The kernel can be improved with extra strings
here and there, but only if it's done without huge code churn.
If current task_struct approach can work and the changes would be
limited to something like:
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index ac53102e244a..244df18728c2 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -196,8 +196,12 @@ static bool hierarchy_allows_attach(struct cgroup *cgrp,
                        return true;
                cnt = prog_list_length(&p->bpf.progs[type]);
                WARN_ON_ONCE(cnt > 1);
-               if (cnt == 1)
-                       return !!(flags & BPF_F_ALLOW_OVERRIDE);
+               if (cnt == 1) {
+                       ret = !!(flags & BPF_F_ALLOW_OVERRIDE);
+                       if (!ret)
+                               syscall_string("cgroup has non-overridable program in the parent");
+                       return ret;
+               }

Then such syscall_string reporting mechanism would be solid addition to the
kernel. Otherwise the cost of passing explicit log_buf everywhere is not worth
it.

I think such syscall_string can probably piggy back on "socket local storage
into generic local storage" work. Sooner or later we will have
per task_struct storage. If that's a single pointer in task_struct that will
be used by both task local storage from inside tracing bpf program and
by this syscall_string() reporting mechanism then we can land it.
May be all syscalls will become stateful. sys_bpf, sys_perf_event_open, sys_ioctl
followed by new syscall "give me back the error string".
Or may be we can do some asm magic and pass both errno and a string from
a syscall at the same time.
Andrii Nakryiko July 21, 2020, 3 a.m. UTC | #18
On Mon, Jul 20, 2020 at 3:31 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jul 17, 2020 at 08:54:45PM -0700, Andrii Nakryiko wrote:
> >
> > > Only libbpf can do it. Kernel is helpless here.
> > > Say we change the kernel errno for all unsuported prog types and maps
> > > it would return ENOTSUPP or something.
> > > Would it really help the situation?
> >
> > IMO, if the kernel just prints out "Unknown BPF command 123" or
> > "Unknown map type 345" that would be already a nice improvement.
> ...
> > log_buf can't help existing kernels. Period. No one is arguing or
> > expecting that. But moving forward, just having that "unknown command
> > 123" would be great.
> >
> > But yeah, of course libbpf can create a probing map and try to do
> > BATCH_LOOKUP, to detect BATCH_LOOKUP support.
>
> Also with BTF the kernel is self documented.
> The following will print all commands that kernel supports:
> bpftool btf dump file ./bld_x64/vmlinux |grep -A40 bpf_cmd
> and with 'grep BPF_MAP_TYPE_' all supported maps.
> For older kernels there is 'bpftool feature probe'.
> Since libbpf reads vmlinux BTF anyway it could have got a knowledge
> of all the features it supports based on BTF.
>

Yes.

> But let's continue this thought experiment for augmenting error
> reporting with a string.
> For 'Unknown BPF command 123' to work log_buf needs to passed
> outside of 'union bpf_attr'. Probably as 4th argument to sys_bpf ?
> and then a bit in 'int cmd' would need to be burned to indicate
> that 4th arg is there. Probably size of the arg needs to be passed
> either as 5th arg or as part of the 'struct bpf_log_buf' so it can
> be extensible.
> Using that log_buf directly in
> SYSCALL_DEFINE[45](bpf, int cmd, ... struct bpf_log_buf *log_buf)
> will be trivial.

Right, so far exactly what I proposed as #1 proposal.

> But to pass it into any of first level functions (bpf_iter_create,
> bpf_prog_attach, etc) they would need to gain an extra argument.
> To pass it all the way into hierarchy_allows_attach() it needs to be added to:
> __cgroup_bpf_attach
> cgroup_bpf_attach
> cgroup_bpf_link_attach
> link_create

I was hoping we can do it in such a way that we won't have to thread
through extra input arguments. Current task_struct looked like a good
candidate, given syscalls are always happening in a process context. I
was considering a per-CPU variable, but CPU migration precludes that.
I certainly was hoping that it could be implemented without huge code
churn.

>
> Another case of ambiguous 'return -EINVAL' would cause another change
> to a bunch of function prototypes.
> So it's better to integrate it into current task_struct to avoid huge code churn.

Right, exactly.

>
> But if we do so what stops us from generalizing log_buf reporting to
> other syscalls? perf_event_open is in the same category.

Nothing, of course, except for the need to convince other folks. I
thought starting with bpf() syscall and proving usefulness in practice
was a good first step. Then we could expand that to other syscalls.

>
> > This one is for perf subsystem, actually, it's its
> > PERF_EVENT_IOC_SET_BPF ioctl (until we add bpf_link for perf_event
> > attachment).
>
> clearly not only sys_bpf and sys_perf_event_open, but sys_ioctl would need
> string support to be a full answer to ambiguous einval-s.
>
> > My proposal was about adding the ability to emit something to log_buf
> > from any of the BPF commands, if that BPF command chooses to provide
> > extra error information. The whole point of this was to avoid adding
> > log_buf in command-specific ways (as Toke was doing in the patch that
> > I used to initiate the discussion) and do it once for entire syscall,
> > so that we can gradually utilize it where it makes most sense.
>
> I don't think that works due to code churn. Whether we pay that price
> once or 'gradually' it doesn't make it any better.

See above, here you are assuming explicit passing of log_buf through
all functions. If done through current there would be no code churn.

> When log_buf is added to an existing command the 'union bpf_attr' is there
> in the function proto and nothing new needs to passed to a lot of functions.

Not all the functions that do error checking have a bpf_attr pointer
anyway, so even in that case we could have a lot of code churn, if we
don't do it in a smarter way (i.e., current).

> So I certainly prefer Toke's approach of adding log_buf to one specific
> command if it's really needed there.
>
> The alternative is to solve it for all syscalls.

I do like the alternative, of course, but I'd start with just bpf()
syscall initially. But that's less of a technical choice, of course.

>
> > I agree that if such diagnostics are reliable and the situation itself
> > is common and experienced by multiple users, then it might make sense
> > to add such checks to libbpf.
>
> 'experienced by multiple users' is going to be a judgement call
> either for libbpf or for kernel.
> I'm saying let's improve libbpf user-friendlyness everywhere we can.
> We can always drop these hints later. Unlike kernel messages that
> might become stable api.
> One thing is log_buf that the verifier is dumping. It's huge and not parsable.
> Whereas a string next to return EINVAL may become an uapi.
> I wouldn't be surprised if some of netlink extack strings got to this
> level of stability and changing the string may cause somebody to notice
> and the commit would get reverted.
>
> > But I also don't think it's always
> > possible to diagnose something automatically with 100% confidence. We
> > can give hints, but misdiagnosing the problem can just further confuse
> > things instead.
>
> I think the possible confusion is fine.
> The libbpf has an opportunity to try them and remove if things don't work out.
> The kernel strings would be much more scrutinized and harder to change.
>

I see where you are coming from w.r.t. stability of those strings, but
I hope that's not going to be the case. Otherwise one might argue that
kernel WARN() messages could become a stable API.

But regardless, "try them and remove if things don't work out" isn't
exactly a user-friendly strategy, if that causes pain and confusion.
So I think we should still be very careful with what we add to libbpf.


> > Also quite often such problems are one-offs, which
> > doesn't make them any less confusing and frustrating, but custom
> > diagnostics for every such case has a potential of bloating libbpf
> > beyond recognition
>
> bloating libbpf with strings is imo better way to figure out how to
> improve user experience than bloating kernel.

By bloat I meant not strings, but the logic of determining the exact
cause of failure. Which from what we discussed can be quite elaborate
and non-trivial. Strings bloat are an easy thing to fix in the kernel.
We can conditionally compile them out if some Kconfig value is set.
E.g., for cases of embedded kernels where every kilobyte counts. It's
a matter of having the proper macros/funcs to do error reporting.

>
> > of them. That's what I meant that this is not a scalable approach to
> > just say "fix libbpf to be more user friendly, kernel does its best
> > already".
>
> Ok. I will take it back. The kernel can be improved with extra strings
> here and there, but only if it's done without huge code churn.
> If current task_struct approach can work and the changes would be
> limited to something like:
> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
> index ac53102e244a..244df18728c2 100644
> --- a/kernel/bpf/cgroup.c
> +++ b/kernel/bpf/cgroup.c
> @@ -196,8 +196,12 @@ static bool hierarchy_allows_attach(struct cgroup *cgrp,
>                         return true;
>                 cnt = prog_list_length(&p->bpf.progs[type]);
>                 WARN_ON_ONCE(cnt > 1);
> -               if (cnt == 1)
> -                       return !!(flags & BPF_F_ALLOW_OVERRIDE);
> +               if (cnt == 1) {
> +                       ret = !!(flags & BPF_F_ALLOW_OVERRIDE);
> +                       if (!ret)
> +                               syscall_string("cgroup has non-overridable program in the parent");
> +                       return ret;
> +               }
>
> Then such syscall_string reporting mechanism would be solid addition to the
> kernel. Otherwise the cost of passing explicit log_buf everywhere is not worth
> it.
>

Right, that's something like what I had in mind in terms of how
non-intrusive it has to be to get adopted.

> I think such syscall_string can probably piggy back on "socket local storage
> into generic local storage" work. Sooner or later we will have
> per task_struct storage. If that's a single pointer in task_struct that will
> be used by both task local storage from inside tracing bpf program and
> by this syscall_string() reporting mechanism then we can land it.

Not sure why the need to unify it with task-local storage? What's the
use case and what are the benefits of conflating syscall logging and
some BPF program's logging? It just adds problems, e.g., some BPF
programs might interleave its log output with syscall's output. I'm
not saying we shouldn't do it, but I don't see why we have to do it.

> May be all syscalls will become stateful. sys_bpf, sys_perf_event_open, sys_ioctl
> followed by new syscall "give me back the error string".
> Or may be we can do some asm magic and pass both errno and a string from
> a syscall at the same time.


So stateful logging (e.g., through extra syscall to set user log
buffer and unset it), which was proposal #2, definitely makes it easy
to do this across all syscalls uniformly without touching that syscall
API at all. The biggest downside, though, is runtimes like Golang,
where application doesn't control which OS thread it runs on. So
getting reliable and consistent log between syscalls becomes a
problem. Maybe there is some thread pinning API, though, and this
might not be a problem. Don't know.
diff mbox series

Patch

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index da9bf35a26f8..662a15e4a1a1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -573,8 +573,13 @@  union bpf_attr {
 	} query;
 
 	struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */
-		__u64 name;
-		__u32 prog_fd;
+		__u64		name;
+		__u32		prog_fd;
+		__u32		log_level;	/* verbosity level of log */
+		__u32		log_size;	/* size of user buffer */
+		__aligned_u64	log_buf;	/* user supplied buffer */
+		__u32		tgt_prog_fd;
+		__u32		tgt_btf_id;
 	} raw_tracepoint;
 
 	struct { /* anonymous struct for BPF_BTF_LOAD */