Message ID | 159467114405.370286.1690821122507970067.stgit@toke.dk |
---|---|
State | Changes Requested |
Delegated to: | BPF Maintainers |
Headers | show |
Series | bpf: Support multi-attach for freplace programs | expand |
On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > From: Toke Høiland-Jørgensen <toke@redhat.com> > > Sync addition of new members from main kernel tree. > > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> > --- > tools/include/uapi/linux/bpf.h | 9 +++++++-- > 1 file changed, 7 insertions(+), 2 deletions(-) > > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > index da9bf35a26f8..662a15e4a1a1 100644 > --- a/tools/include/uapi/linux/bpf.h > +++ b/tools/include/uapi/linux/bpf.h > @@ -573,8 +573,13 @@ union bpf_attr { > } query; > > struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ > - __u64 name; > - __u32 prog_fd; > + __u64 name; > + __u32 prog_fd; > + __u32 log_level; /* verbosity level of log */ > + __u32 log_size; /* size of user buffer */ > + __aligned_u64 log_buf; /* user supplied buffer */ > + __u32 tgt_prog_fd; > + __u32 tgt_btf_id; > } raw_tracepoint; > > struct { /* anonymous struct for BPF_BTF_LOAD */ > I think BPF syscall would benefit from common/generalized log support across all commands, given how powerful/complex it already is. Sometimes it's literally impossible to understand why one gets -EINVAL without adding printk()s in the kernel. But it feels wrong to add log_level/log_size/log_buf fields to every section of bpf_attr and require the user to specify it differently for each command. So before we go and start adding per-command fields, let's discuss how we can do this more generically. I wonder if we can come up with a good way to do it in one common way and then gradually support that way throughout all BPF commands. Unfortunately it's too late to just add a few common fields to bpf_attr in front of all other per-command fields, but here's two more ideas just to start discussion. I hope someone can come up with something nicer. 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size). We can extend it with two more optional arguments: one for pointer to log-defining attr (for log_buf pointer, log_size, log_level, maybe something more later) and another for the size of that log attr. Beyond that we'd need some way to specify that the user actually meant to provide those 2 optional args. The most straightforward way would be to use the highest bit of cmd argument. So instead of calling bpf() with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE | BPF_LOGGED, where BPF_LOGGED = 1<<31. 2. A more "stateful" approach, would be to have an extra BPF command to set log buffer (and level) per thread. And if such a log is set, it would be overwritten with content on each bpf() syscall invocation (i.e., log position would be reset to 0 on each BPF syscall). Of course, the existing BPF_LOAD_PROG command would keep working with existing log, but could fall back to the "common one", if BPF_LOAD_PROG-specific one is not set. It also doesn't seem to be all that critical to signal when the log buffer is overflown. It's pretty easy to detect from user-space: - either last byte would be non-zero, if we don't care about guaranteeing zero-termination for truncated log; - of second-to-last byte would be non-zero, if BPF syscall will always zero-terminate the log. Of course, if user code cares about such detection of log truncation, it will need to set last/second-to-last bytes to zero before each syscall. Both approaches have their pros/cons, we can dig into those later, but I'd like to start this discussion and see what other people think. I also wonder if there are other syscalls with similarly advanced input arguments (like bpf_attr) and common logging, we could learn from those. Thoughts?
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> From: Toke Høiland-Jørgensen <toke@redhat.com> >> >> Sync addition of new members from main kernel tree. >> >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> >> --- >> tools/include/uapi/linux/bpf.h | 9 +++++++-- >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h >> index da9bf35a26f8..662a15e4a1a1 100644 >> --- a/tools/include/uapi/linux/bpf.h >> +++ b/tools/include/uapi/linux/bpf.h >> @@ -573,8 +573,13 @@ union bpf_attr { >> } query; >> >> struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ >> - __u64 name; >> - __u32 prog_fd; >> + __u64 name; >> + __u32 prog_fd; >> + __u32 log_level; /* verbosity level of log */ >> + __u32 log_size; /* size of user buffer */ >> + __aligned_u64 log_buf; /* user supplied buffer */ >> + __u32 tgt_prog_fd; >> + __u32 tgt_btf_id; >> } raw_tracepoint; >> >> struct { /* anonymous struct for BPF_BTF_LOAD */ >> > > I think BPF syscall would benefit from common/generalized log support > across all commands, given how powerful/complex it already is. > Sometimes it's literally impossible to understand why one gets -EINVAL > without adding printk()s in the kernel. Yes, I agree! This is horrible UI! > But it feels wrong to add log_level/log_size/log_buf fields to every > section of bpf_attr and require the user to specify it differently for > each command. So before we go and start adding per-command fields, > let's discuss how we can do this more generically. I wonder if we can > come up with a good way to do it in one common way and then gradually > support that way throughout all BPF commands. > > Unfortunately it's too late to just add a few common fields to > bpf_attr in front of all other per-command fields, but here's two more > ideas just to start discussion. I hope someone can come up with > something nicer. > > 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size). > We can extend it with two more optional arguments: one for pointer to > log-defining attr (for log_buf pointer, log_size, log_level, maybe > something more later) and another for the size of that log attr. > Beyond that we'd need some way to specify that the user actually meant > to provide those 2 optional args. The most straightforward way would > be to use the highest bit of cmd argument. So instead of calling bpf() > with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE | > BPF_LOGGED, where BPF_LOGGED = 1<<31. Well, if only we'd had a 'flags' argument to the syscall... I don't suppose we want a bpf2()? :) I like your idea of just using the top bits of the 'cmd' field as flag bits, but in that case we should just define this explicitly, say '#define BPF_CMD_FLAGS_MASK 0xFFFF0000'? And instead of adding new arguments, why not just change the meaning of the 'attr' argument? Say we define: struct bpf_extended_attr { __u32 log_level; __u32 log_size; __aligned_u64 log_buf; __u8 reserved[48]; union bpf_attr attr; }; And then define a new flag BPF_USES_EXTENDED_ATTR which will cause the kernel to interpret the second argument of the syscall as a pointer to that struct instead of to the bpf_attr union? > 2. A more "stateful" approach, would be to have an extra BPF command > to set log buffer (and level) per thread. And if such a log is set, it > would be overwritten with content on each bpf() syscall invocation > (i.e., log position would be reset to 0 on each BPF syscall). I don't think adding something stateful is a good idea; that's bound to lead to weird issues when someone messes up the state management in userspace. > Of course, the existing BPF_LOAD_PROG command would keep working with > existing log, but could fall back to the "common one", if > BPF_LOAD_PROG-specific one is not set. > > It also doesn't seem to be all that critical to signal when the log > buffer is overflown. It's pretty easy to detect from user-space: > - either last byte would be non-zero, if we don't care about > guaranteeing zero-termination for truncated log; > - of second-to-last byte would be non-zero, if BPF syscall will always > zero-terminate the log. I think if we're doing this we should think about making the contents of the log machine-readable, so applications can figure out what's going on without having to parse the text strings. The simplest would be to make this new log buffer use TLV-style messaging, say we define the log buffer output to be a series of messages like these: struct bpf_log_msg { __u16 type; __u32 len; __u8 contents[]; /* of size 'len' */ } __attribute__((packed)); To begin with we could define two types: struct bpf_log_msg_string { __u16 type; /* BPF_LOG_MSG_TYPE_STRING */ __u32 len; char message[]; } __attribute__((packed)); struct bpf_log_msg_end { __u16 type; /* BPF_LOG_MSG_TYPE_END */ __u32 len; __u16 num_truncations; } __attribute__((packed)); The TYPE_STRING would just be a wrapper for the existing text-based messages, but delimited so userspace can pick them apart. And the second one would solve your 'has the log been truncated' issue above; the kernel simply always reserves eight bytes at the end of the buffer and ends with writing a TYPE_END message with the number of messages that were dropped due to lack of space. That would make it trivial for userspace to detect truncation. We could then add new message types later for machine-consumption. Say, and extended error code, or offsets into the BTF information, or whatever we end up needing. But just wrapping the existing log messages in TLVs like the ones above could be fairly straight-forwardly implemented with the existing bpf_log() infrastructure in the kernel, without having to settle on which machine-readable information is useful ahead of time... And the TLV format makes it easy for userspace to just skip message types it doesn't understand. WDYT? > Of course, if user code cares about such detection of log truncation, > it will need to set last/second-to-last bytes to zero before each > syscall. > > Both approaches have their pros/cons, we can dig into those later, but > I'd like to start this discussion and see what other people think. I > also wonder if there are other syscalls with similarly advanced input > arguments (like bpf_attr) and common logging, we could learn from > those. The first one that comes to mind is netlink extacks. Of course it's not quite comparable since netlink already has message-based semantics, but it does do sorta-kinda the same thing from a user PoV. The TLV format is obviously inspired by netlink (or, well, binary networking protocols in general). -Toke
On Tue, Jul 14, 2020 at 5:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > > On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> > >> From: Toke Høiland-Jørgensen <toke@redhat.com> > >> > >> Sync addition of new members from main kernel tree. > >> > >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> > >> --- > >> tools/include/uapi/linux/bpf.h | 9 +++++++-- > >> 1 file changed, 7 insertions(+), 2 deletions(-) > >> > >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > >> index da9bf35a26f8..662a15e4a1a1 100644 > >> --- a/tools/include/uapi/linux/bpf.h > >> +++ b/tools/include/uapi/linux/bpf.h > >> @@ -573,8 +573,13 @@ union bpf_attr { > >> } query; > >> > >> struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ > >> - __u64 name; > >> - __u32 prog_fd; > >> + __u64 name; > >> + __u32 prog_fd; > >> + __u32 log_level; /* verbosity level of log */ > >> + __u32 log_size; /* size of user buffer */ > >> + __aligned_u64 log_buf; /* user supplied buffer */ > >> + __u32 tgt_prog_fd; > >> + __u32 tgt_btf_id; > >> } raw_tracepoint; > >> > >> struct { /* anonymous struct for BPF_BTF_LOAD */ > >> > > > > I think BPF syscall would benefit from common/generalized log support > > across all commands, given how powerful/complex it already is. > > Sometimes it's literally impossible to understand why one gets -EINVAL > > without adding printk()s in the kernel. > > Yes, I agree! This is horrible UI! UI?.. It's a perfectly fine and extensible API for all functionality it provides, it just needs a better human-readable feedback mechanism, which is what I'm proposing. Error codes are not working when you have so many different situations that can result in error. > > > But it feels wrong to add log_level/log_size/log_buf fields to every > > section of bpf_attr and require the user to specify it differently for > > each command. So before we go and start adding per-command fields, > > let's discuss how we can do this more generically. I wonder if we can > > come up with a good way to do it in one common way and then gradually > > support that way throughout all BPF commands. > > > > Unfortunately it's too late to just add a few common fields to > > bpf_attr in front of all other per-command fields, but here's two more > > ideas just to start discussion. I hope someone can come up with > > something nicer. > > > > 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size). > > We can extend it with two more optional arguments: one for pointer to > > log-defining attr (for log_buf pointer, log_size, log_level, maybe > > something more later) and another for the size of that log attr. > > Beyond that we'd need some way to specify that the user actually meant > > to provide those 2 optional args. The most straightforward way would > > be to use the highest bit of cmd argument. So instead of calling bpf() > > with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE | > > BPF_LOGGED, where BPF_LOGGED = 1<<31. > > Well, if only we'd had a 'flags' argument to the syscall... I don't > suppose we want a bpf2()? :) > > I like your idea of just using the top bits of the 'cmd' field as flag > bits, but in that case we should just define this explicitly, say > '#define BPF_CMD_FLAGS_MASK 0xFFFF0000'? sure > > And instead of adding new arguments, why not just change the meaning of > the 'attr' argument? Say we define: > > struct bpf_extended_attr { > __u32 log_level; > __u32 log_size; > __aligned_u64 log_buf; > __u8 reserved[48]; > union bpf_attr attr; > }; because this is a major PITA for libraries, plus it's not very extensible, once you run out of 48 bytes? And when you don't need those 48 bytes, you still need to zero them out, the kernel still needs to copy them, etc. It just feels unclean to me. But before we argue that, is there a problem adding 2 more arguments which are never used/read unless we have an extra bit set in cmd? Honest question, are there any implications to user-space with such an approach? Backwards-compatibility issues or anything? > > And then define a new flag BPF_USES_EXTENDED_ATTR which will cause the > kernel to interpret the second argument of the syscall as a pointer to > that struct instead of to the bpf_attr union? > > > 2. A more "stateful" approach, would be to have an extra BPF command > > to set log buffer (and level) per thread. And if such a log is set, it > > would be overwritten with content on each bpf() syscall invocation > > (i.e., log position would be reset to 0 on each BPF syscall). > > I don't think adding something stateful is a good idea; that's bound to > lead to weird issues when someone messes up the state management in > userspace. I agree, I'd prefer a stateless approach, but wanted to lay out a stateful one for completeness as well. > > > Of course, the existing BPF_LOAD_PROG command would keep working with > > existing log, but could fall back to the "common one", if > > BPF_LOAD_PROG-specific one is not set. > > > > It also doesn't seem to be all that critical to signal when the log > > buffer is overflown. It's pretty easy to detect from user-space: > > - either last byte would be non-zero, if we don't care about > > guaranteeing zero-termination for truncated log; > > - of second-to-last byte would be non-zero, if BPF syscall will always > > zero-terminate the log. > > I think if we're doing this we should think about making the contents of > the log machine-readable, so applications can figure out what's going on > without having to parse the text strings. The simplest would be to make > this new log buffer use TLV-style messaging, say we define the log > buffer output to be a series of messages like these: > > struct bpf_log_msg { > __u16 type; > __u32 len; > __u8 contents[]; /* of size 'len' */ > } __attribute__((packed)); > > To begin with we could define two types: > > struct bpf_log_msg_string { > __u16 type; /* BPF_LOG_MSG_TYPE_STRING */ > __u32 len; > char message[]; > } __attribute__((packed)); > > struct bpf_log_msg_end { > __u16 type; /* BPF_LOG_MSG_TYPE_END */ > __u32 len; > __u16 num_truncations; > } __attribute__((packed)); > > The TYPE_STRING would just be a wrapper for the existing text-based > messages, but delimited so userspace can pick them apart. And the second > one would solve your 'has the log been truncated' issue above; the > kernel simply always reserves eight bytes at the end of the buffer and > ends with writing a TYPE_END message with the number of messages that > were dropped due to lack of space. That would make it trivial for > userspace to detect truncation. > Log truncation is not an issue, we can make bpf syscall to write back the actual size of emitted log (and optionally extra bool to mean "truncated") into the original log_size field. > We could then add new message types later for machine-consumption. Say, > and extended error code, or offsets into the BTF information, or > whatever we end up needing. But just wrapping the existing log messages > in TLVs like the ones above could be fairly straight-forwardly > implemented with the existing bpf_log() infrastructure in the kernel, > without having to settle on which machine-readable information is useful > ahead of time... And the TLV format makes it easy for userspace to just > skip message types it doesn't understand. > > WDYT? I think it's taking it a bit too far and adds more API on top of existing API. Now all those types become a fixed API, messages become an API, etc. Just more backwards compatibility stuff we need to support forever, for, what I believe, very little gain. In practice, using human-readable strings works just fine. If there is any kind of real issue, usually it involves humans reading debug logs and understanding what's going on. Also adopting these packet-like messages is not as straightforward through BPF code, as now you can't just construct a single log line with few calls to bpf_log(). > > > Of course, if user code cares about such detection of log truncation, > > it will need to set last/second-to-last bytes to zero before each > > syscall. > > > > Both approaches have their pros/cons, we can dig into those later, but > > I'd like to start this discussion and see what other people think. I > > also wonder if there are other syscalls with similarly advanced input > > arguments (like bpf_attr) and common logging, we could learn from > > those. > > The first one that comes to mind is netlink extacks. Of course it's not > quite comparable since netlink already has message-based semantics, but > it does do sorta-kinda the same thing from a user PoV. The TLV format is > obviously inspired by netlink (or, well, binary networking protocols in > general). > Yeah, I'm aware of extack, but as you said, TLV is more of a netlink format, extack messages themselves are just strings. But my question was more of how this log could be done for complicated API calls using similar extendable attrs, like perf_event_open, clone3, openat2, etc. > -Toke >
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > On Tue, Jul 14, 2020 at 5:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> >> > On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: >> >> >> >> From: Toke Høiland-Jørgensen <toke@redhat.com> >> >> >> >> Sync addition of new members from main kernel tree. >> >> >> >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> >> >> --- >> >> tools/include/uapi/linux/bpf.h | 9 +++++++-- >> >> 1 file changed, 7 insertions(+), 2 deletions(-) >> >> >> >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h >> >> index da9bf35a26f8..662a15e4a1a1 100644 >> >> --- a/tools/include/uapi/linux/bpf.h >> >> +++ b/tools/include/uapi/linux/bpf.h >> >> @@ -573,8 +573,13 @@ union bpf_attr { >> >> } query; >> >> >> >> struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ >> >> - __u64 name; >> >> - __u32 prog_fd; >> >> + __u64 name; >> >> + __u32 prog_fd; >> >> + __u32 log_level; /* verbosity level of log */ >> >> + __u32 log_size; /* size of user buffer */ >> >> + __aligned_u64 log_buf; /* user supplied buffer */ >> >> + __u32 tgt_prog_fd; >> >> + __u32 tgt_btf_id; >> >> } raw_tracepoint; >> >> >> >> struct { /* anonymous struct for BPF_BTF_LOAD */ >> >> >> > >> > I think BPF syscall would benefit from common/generalized log support >> > across all commands, given how powerful/complex it already is. >> > Sometimes it's literally impossible to understand why one gets -EINVAL >> > without adding printk()s in the kernel. >> >> Yes, I agree! This is horrible UI! > > UI?.. It's a perfectly fine and extensible API for all functionality > it provides, it just needs a better human-readable feedback mechanism, > which is what I'm proposing. Error codes are not working when you have > so many different situations that can result in error. Yes. I was agreeing with you: the lack of understandable error messages means you have to play "guess where this EINVAL came from", which is a terrible user experience (should have been UX instead of UI, I guess, sorry about that). >> > But it feels wrong to add log_level/log_size/log_buf fields to every >> > section of bpf_attr and require the user to specify it differently for >> > each command. So before we go and start adding per-command fields, >> > let's discuss how we can do this more generically. I wonder if we can >> > come up with a good way to do it in one common way and then gradually >> > support that way throughout all BPF commands. >> > >> > Unfortunately it's too late to just add a few common fields to >> > bpf_attr in front of all other per-command fields, but here's two more >> > ideas just to start discussion. I hope someone can come up with >> > something nicer. >> > >> > 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size). >> > We can extend it with two more optional arguments: one for pointer to >> > log-defining attr (for log_buf pointer, log_size, log_level, maybe >> > something more later) and another for the size of that log attr. >> > Beyond that we'd need some way to specify that the user actually meant >> > to provide those 2 optional args. The most straightforward way would >> > be to use the highest bit of cmd argument. So instead of calling bpf() >> > with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE | >> > BPF_LOGGED, where BPF_LOGGED = 1<<31. >> >> Well, if only we'd had a 'flags' argument to the syscall... I don't >> suppose we want a bpf2()? :) >> >> I like your idea of just using the top bits of the 'cmd' field as flag >> bits, but in that case we should just define this explicitly, say >> '#define BPF_CMD_FLAGS_MASK 0xFFFF0000'? > > sure > >> >> And instead of adding new arguments, why not just change the meaning of >> the 'attr' argument? Say we define: >> >> struct bpf_extended_attr { >> __u32 log_level; >> __u32 log_size; >> __aligned_u64 log_buf; >> __u8 reserved[48]; >> union bpf_attr attr; >> }; > > because this is a major PITA for libraries, plus it's not very > extensible, once you run out of 48 bytes? And when you don't need > those 48 bytes, you still need to zero them out, the kernel still > needs to copy them, etc. It just feels unclean to me. > > But before we argue that, is there a problem adding 2 more arguments > which are never used/read unless we have an extra bit set in cmd? > Honest question, are there any implications to user-space with such an > approach? Backwards-compatibility issues or anything? No idea; I don't know enough about how the lower-level details of the syscall interface works to tell either way. Is it even *possible* to add arguments like that in a backwards-compatible way? However, assuming it *is* possible, my larger point was that we shouldn't add just a 'logging struct', but rather a 'common options struct' which can be extended further as needed. And if it is *not* possible to add new arguments to a syscall like you're proposing, my suggestion above would be a different way to achieve basically the same (at the cost of having to specify the maximum reserved space in advance). >> And then define a new flag BPF_USES_EXTENDED_ATTR which will cause the >> kernel to interpret the second argument of the syscall as a pointer to >> that struct instead of to the bpf_attr union? >> >> > 2. A more "stateful" approach, would be to have an extra BPF command >> > to set log buffer (and level) per thread. And if such a log is set, it >> > would be overwritten with content on each bpf() syscall invocation >> > (i.e., log position would be reset to 0 on each BPF syscall). >> >> I don't think adding something stateful is a good idea; that's bound to >> lead to weird issues when someone messes up the state management in >> userspace. > > I agree, I'd prefer a stateless approach, but wanted to lay out a > stateful one for completeness as well. OK, cool. >> >> > Of course, the existing BPF_LOAD_PROG command would keep working with >> > existing log, but could fall back to the "common one", if >> > BPF_LOAD_PROG-specific one is not set. >> > >> > It also doesn't seem to be all that critical to signal when the log >> > buffer is overflown. It's pretty easy to detect from user-space: >> > - either last byte would be non-zero, if we don't care about >> > guaranteeing zero-termination for truncated log; >> > - of second-to-last byte would be non-zero, if BPF syscall will always >> > zero-terminate the log. >> >> I think if we're doing this we should think about making the contents of >> the log machine-readable, so applications can figure out what's going on >> without having to parse the text strings. The simplest would be to make >> this new log buffer use TLV-style messaging, say we define the log >> buffer output to be a series of messages like these: >> >> struct bpf_log_msg { >> __u16 type; >> __u32 len; >> __u8 contents[]; /* of size 'len' */ >> } __attribute__((packed)); >> >> To begin with we could define two types: >> >> struct bpf_log_msg_string { >> __u16 type; /* BPF_LOG_MSG_TYPE_STRING */ >> __u32 len; >> char message[]; >> } __attribute__((packed)); >> >> struct bpf_log_msg_end { >> __u16 type; /* BPF_LOG_MSG_TYPE_END */ >> __u32 len; >> __u16 num_truncations; >> } __attribute__((packed)); >> >> The TYPE_STRING would just be a wrapper for the existing text-based >> messages, but delimited so userspace can pick them apart. And the second >> one would solve your 'has the log been truncated' issue above; the >> kernel simply always reserves eight bytes at the end of the buffer and >> ends with writing a TYPE_END message with the number of messages that >> were dropped due to lack of space. That would make it trivial for >> userspace to detect truncation. >> > > Log truncation is not an issue, we can make bpf syscall to write back > the actual size of emitted log (and optionally extra bool to mean > "truncated") into the original log_size field. So what was all that you were talking about with "check the second-to-last byte" in your previous email? I understood that to be about detecting truncation? >> We could then add new message types later for machine-consumption. Say, >> and extended error code, or offsets into the BTF information, or >> whatever we end up needing. But just wrapping the existing log messages >> in TLVs like the ones above could be fairly straight-forwardly >> implemented with the existing bpf_log() infrastructure in the kernel, >> without having to settle on which machine-readable information is useful >> ahead of time... And the TLV format makes it easy for userspace to just >> skip message types it doesn't understand. >> >> WDYT? > > I think it's taking it a bit too far and adds more API on top of > existing API. Now all those types become a fixed API, messages become > an API, etc. Just more backwards compatibility stuff we need to > support forever, for, what I believe, very little gain. > > In practice, using human-readable strings works just fine. If there is > any kind of real issue, usually it involves humans reading debug logs > and understanding what's going on. Let me give an example of what I want to be able to do. Just today I helped someone debug getting xdp-filter to run, and they were getting output like this: libbpf: -- BEGIN DUMP LOG --- libbpf: xdpfilt_alw_all() is not a global function processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 libbpf: -- END LOG -- I would like to have xdp-filter catch that, and turn it into a friendlier "your compiler is too old" message. Having to effectively grep through the free-form log output to pick out that message feels brittle and error prone, as opposed to just having the kernel add a machine-readable id ("err_func_linkage_not_global" or somesuch) and stick it in a machine-parsable TLV. > Also adopting these packet-like messages is not as straightforward > through BPF code, as now you can't just construct a single log line > with few calls to bpf_log(). Why not? bpf_log() could just transparently write the four bytes of header (TYPE_STRING, followed by strlen(msg)) into the buffer before the string? And in the future, an enhanced version could take (say) an error ID as another parameter and transparently add that as a separate message. >> > Of course, if user code cares about such detection of log truncation, >> > it will need to set last/second-to-last bytes to zero before each >> > syscall. >> > >> > Both approaches have their pros/cons, we can dig into those later, but >> > I'd like to start this discussion and see what other people think. I >> > also wonder if there are other syscalls with similarly advanced input >> > arguments (like bpf_attr) and common logging, we could learn from >> > those. >> >> The first one that comes to mind is netlink extacks. Of course it's not >> quite comparable since netlink already has message-based semantics, but >> it does do sorta-kinda the same thing from a user PoV. The TLV format is >> obviously inspired by netlink (or, well, binary networking protocols in >> general). >> > > Yeah, I'm aware of extack, but as you said, TLV is more of a netlink > format, extack messages themselves are just strings. But my question > was more of how this log could be done for complicated API calls using > similar extendable attrs, like perf_event_open, clone3, openat2, etc. Ah, right, no idea :) -Toke
On Tue, Jul 14, 2020 at 1:47 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > > On Tue, Jul 14, 2020 at 5:12 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> > >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > >> > >> > On Mon, Jul 13, 2020 at 1:13 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > >> >> > >> >> From: Toke Høiland-Jørgensen <toke@redhat.com> > >> >> > >> >> Sync addition of new members from main kernel tree. > >> >> > >> >> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> > >> >> --- > >> >> tools/include/uapi/linux/bpf.h | 9 +++++++-- > >> >> 1 file changed, 7 insertions(+), 2 deletions(-) > >> >> > >> >> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h > >> >> index da9bf35a26f8..662a15e4a1a1 100644 > >> >> --- a/tools/include/uapi/linux/bpf.h > >> >> +++ b/tools/include/uapi/linux/bpf.h > >> >> @@ -573,8 +573,13 @@ union bpf_attr { > >> >> } query; > >> >> > >> >> struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ > >> >> - __u64 name; > >> >> - __u32 prog_fd; > >> >> + __u64 name; > >> >> + __u32 prog_fd; > >> >> + __u32 log_level; /* verbosity level of log */ > >> >> + __u32 log_size; /* size of user buffer */ > >> >> + __aligned_u64 log_buf; /* user supplied buffer */ > >> >> + __u32 tgt_prog_fd; > >> >> + __u32 tgt_btf_id; > >> >> } raw_tracepoint; > >> >> > >> >> struct { /* anonymous struct for BPF_BTF_LOAD */ > >> >> > >> > > >> > I think BPF syscall would benefit from common/generalized log support > >> > across all commands, given how powerful/complex it already is. > >> > Sometimes it's literally impossible to understand why one gets -EINVAL > >> > without adding printk()s in the kernel. > >> > >> Yes, I agree! This is horrible UI! > > > > UI?.. It's a perfectly fine and extensible API for all functionality > > it provides, it just needs a better human-readable feedback mechanism, > > which is what I'm proposing. Error codes are not working when you have > > so many different situations that can result in error. > > Yes. I was agreeing with you: the lack of understandable error messages > means you have to play "guess where this EINVAL came from", which is a > terrible user experience (should have been UX instead of UI, I guess, > sorry about that). Ah, ok, the term confused me :) > > >> > But it feels wrong to add log_level/log_size/log_buf fields to every > >> > section of bpf_attr and require the user to specify it differently for > >> > each command. So before we go and start adding per-command fields, > >> > let's discuss how we can do this more generically. I wonder if we can > >> > come up with a good way to do it in one common way and then gradually > >> > support that way throughout all BPF commands. > >> > > >> > Unfortunately it's too late to just add a few common fields to > >> > bpf_attr in front of all other per-command fields, but here's two more > >> > ideas just to start discussion. I hope someone can come up with > >> > something nicer. > >> > > >> > 1. Currently bpf syscall accepts 3 input arguments (cmd, uattr, size). > >> > We can extend it with two more optional arguments: one for pointer to > >> > log-defining attr (for log_buf pointer, log_size, log_level, maybe > >> > something more later) and another for the size of that log attr. > >> > Beyond that we'd need some way to specify that the user actually meant > >> > to provide those 2 optional args. The most straightforward way would > >> > be to use the highest bit of cmd argument. So instead of calling bpf() > >> > with BPF_MAP_CREATE command, you'd call it with BPF_MAP_CREATE | > >> > BPF_LOGGED, where BPF_LOGGED = 1<<31. > >> > >> Well, if only we'd had a 'flags' argument to the syscall... I don't > >> suppose we want a bpf2()? :) > >> > >> I like your idea of just using the top bits of the 'cmd' field as flag > >> bits, but in that case we should just define this explicitly, say > >> '#define BPF_CMD_FLAGS_MASK 0xFFFF0000'? > > > > sure > > > >> > >> And instead of adding new arguments, why not just change the meaning of > >> the 'attr' argument? Say we define: > >> > >> struct bpf_extended_attr { > >> __u32 log_level; > >> __u32 log_size; > >> __aligned_u64 log_buf; > >> __u8 reserved[48]; > >> union bpf_attr attr; > >> }; > > > > because this is a major PITA for libraries, plus it's not very > > extensible, once you run out of 48 bytes? And when you don't need > > those 48 bytes, you still need to zero them out, the kernel still > > needs to copy them, etc. It just feels unclean to me. > > > > But before we argue that, is there a problem adding 2 more arguments > > which are never used/read unless we have an extra bit set in cmd? > > Honest question, are there any implications to user-space with such an > > approach? Backwards-compatibility issues or anything? > > No idea; I don't know enough about how the lower-level details of the > syscall interface works to tell either way. Is it even *possible* to add > arguments like that in a backwards-compatible way? > > However, assuming it *is* possible, my larger point was that we > shouldn't add just a 'logging struct', but rather a 'common options > struct' which can be extended further as needed. And if it is *not* > possible to add new arguments to a syscall like you're proposing, my > suggestion above would be a different way to achieve basically the same > (at the cost of having to specify the maximum reserved space in advance). > yeah-yeah, I agree, it's less a "logging attr", more of "common attr across all commands". > >> And then define a new flag BPF_USES_EXTENDED_ATTR which will cause the > >> kernel to interpret the second argument of the syscall as a pointer to > >> that struct instead of to the bpf_attr union? > >> > >> > 2. A more "stateful" approach, would be to have an extra BPF command > >> > to set log buffer (and level) per thread. And if such a log is set, it > >> > would be overwritten with content on each bpf() syscall invocation > >> > (i.e., log position would be reset to 0 on each BPF syscall). > >> > >> I don't think adding something stateful is a good idea; that's bound to > >> lead to weird issues when someone messes up the state management in > >> userspace. > > > > I agree, I'd prefer a stateless approach, but wanted to lay out a > > stateful one for completeness as well. > > OK, cool. > > >> > >> > Of course, the existing BPF_LOAD_PROG command would keep working with > >> > existing log, but could fall back to the "common one", if > >> > BPF_LOAD_PROG-specific one is not set. > >> > > >> > It also doesn't seem to be all that critical to signal when the log > >> > buffer is overflown. It's pretty easy to detect from user-space: > >> > - either last byte would be non-zero, if we don't care about > >> > guaranteeing zero-termination for truncated log; > >> > - of second-to-last byte would be non-zero, if BPF syscall will always > >> > zero-terminate the log. > >> > >> I think if we're doing this we should think about making the contents of > >> the log machine-readable, so applications can figure out what's going on > >> without having to parse the text strings. The simplest would be to make > >> this new log buffer use TLV-style messaging, say we define the log > >> buffer output to be a series of messages like these: > >> > >> struct bpf_log_msg { > >> __u16 type; > >> __u32 len; > >> __u8 contents[]; /* of size 'len' */ > >> } __attribute__((packed)); > >> > >> To begin with we could define two types: > >> > >> struct bpf_log_msg_string { > >> __u16 type; /* BPF_LOG_MSG_TYPE_STRING */ > >> __u32 len; > >> char message[]; > >> } __attribute__((packed)); > >> > >> struct bpf_log_msg_end { > >> __u16 type; /* BPF_LOG_MSG_TYPE_END */ > >> __u32 len; > >> __u16 num_truncations; > >> } __attribute__((packed)); > >> > >> The TYPE_STRING would just be a wrapper for the existing text-based > >> messages, but delimited so userspace can pick them apart. And the second > >> one would solve your 'has the log been truncated' issue above; the > >> kernel simply always reserves eight bytes at the end of the buffer and > >> ends with writing a TYPE_END message with the number of messages that > >> were dropped due to lack of space. That would make it trivial for > >> userspace to detect truncation. > >> > > > > Log truncation is not an issue, we can make bpf syscall to write back > > the actual size of emitted log (and optionally extra bool to mean > > "truncated") into the original log_size field. > > So what was all that you were talking about with "check the > second-to-last byte" in your previous email? I understood that to be > about detecting truncation? Yeah, ignore my initial rambling. One can do that (detecting truncationg) without any extra "feedback" from bpf syscall, but I think returning filled length is probably a better approach and doesn't hamper any other aspects. > > >> We could then add new message types later for machine-consumption. Say, > >> and extended error code, or offsets into the BTF information, or > >> whatever we end up needing. But just wrapping the existing log messages > >> in TLVs like the ones above could be fairly straight-forwardly > >> implemented with the existing bpf_log() infrastructure in the kernel, > >> without having to settle on which machine-readable information is useful > >> ahead of time... And the TLV format makes it easy for userspace to just > >> skip message types it doesn't understand. > >> > >> WDYT? > > > > I think it's taking it a bit too far and adds more API on top of > > existing API. Now all those types become a fixed API, messages become > > an API, etc. Just more backwards compatibility stuff we need to > > support forever, for, what I believe, very little gain. > > > > In practice, using human-readable strings works just fine. If there is > > any kind of real issue, usually it involves humans reading debug logs > > and understanding what's going on. > > Let me give an example of what I want to be able to do. Just today I > helped someone debug getting xdp-filter to run, and they were getting > output like this: > > libbpf: -- BEGIN DUMP LOG --- > libbpf: > xdpfilt_alw_all() is not a global function > processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 > peak_states 0 mark_read 0 > > libbpf: -- END LOG -- > > I would like to have xdp-filter catch that, and turn it into a > friendlier "your compiler is too old" message. Having to effectively > grep through the free-form log output to pick out that message feels > brittle and error prone, as opposed to just having the kernel add a > machine-readable id ("err_func_linkage_not_global" or somesuch) and > stick it in a machine-parsable TLV. Yeah, I guessed as much, but I still think it's unnecessarily restrictive (for BPF kernel developers) to adhere to such a rigorous error format. We'll have a few hundred different failure reasons, and with the verifier constantly evolving some of them might get deprecated, others will be constantly added, some other might change the meaning, etc. I don't see this realistically working long term. So at that point, free-form text-delimited text is as good as we should (and, realistically, would) probably go. Library, if it wants to be extra-helpful, can still reasonably easily parse line-delimited text, even if that's not an ideal setup (especially if error messages will get periodically changed). But I'd be happy to hear what others think on this matter as well, of course. > > > Also adopting these packet-like messages is not as straightforward > > through BPF code, as now you can't just construct a single log line > > with few calls to bpf_log(). > > Why not? bpf_log() could just transparently write the four bytes of > header (TYPE_STRING, followed by strlen(msg)) into the buffer before the > string? And in the future, an enhanced version could take (say) an error > ID as another parameter and transparently add that as a separate message. I mean when you construct one error message with few printf-like functions. We do have that in libbpf, but I haven't checked the verifier code. Basically, assuming one bpf_log() call is a complete "message" might not be true. > > >> > Of course, if user code cares about such detection of log truncation, > >> > it will need to set last/second-to-last bytes to zero before each > >> > syscall. > >> > > >> > Both approaches have their pros/cons, we can dig into those later, but > >> > I'd like to start this discussion and see what other people think. I > >> > also wonder if there are other syscalls with similarly advanced input > >> > arguments (like bpf_attr) and common logging, we could learn from > >> > those. > >> > >> The first one that comes to mind is netlink extacks. Of course it's not > >> quite comparable since netlink already has message-based semantics, but > >> it does do sorta-kinda the same thing from a user PoV. The TLV format is > >> obviously inspired by netlink (or, well, binary networking protocols in > >> general). > >> > > > > Yeah, I'm aware of extack, but as you said, TLV is more of a netlink > > format, extack messages themselves are just strings. But my question > > was more of how this log could be done for complicated API calls using > > similar extendable attrs, like perf_event_open, clone3, openat2, etc. > > Ah, right, no idea :) > > -Toke >
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> However, assuming it *is* possible, my larger point was that we >> shouldn't add just a 'logging struct', but rather a 'common options >> struct' which can be extended further as needed. And if it is *not* >> possible to add new arguments to a syscall like you're proposing, my >> suggestion above would be a different way to achieve basically the same >> (at the cost of having to specify the maximum reserved space in advance). >> > > yeah-yeah, I agree, it's less a "logging attr", more of "common attr > across all commands". Right, great. I think we are broadly in agreement with where we want to go with this, actually :) Let's see if anyone else chimes in; otherwise I guess I can incorporate something along these lines in the next version of this series. I'm going on vacation at the end of this week, though, so I will most likely not be able to carry it to completion before then; but at least I can post something for someone else to pick up (or if no one does it can wait until I get back). [...] > Yeah, ignore my initial rambling. One can do that (detecting > truncationg) without any extra "feedback" from bpf syscall, but I > think returning filled length is probably a better approach and > doesn't hamper any other aspects. OK, sure, makes sense. [...] >> > Also adopting these packet-like messages is not as straightforward >> > through BPF code, as now you can't just construct a single log line >> > with few calls to bpf_log(). >> >> Why not? bpf_log() could just transparently write the four bytes of >> header (TYPE_STRING, followed by strlen(msg)) into the buffer before the >> string? And in the future, an enhanced version could take (say) an error >> ID as another parameter and transparently add that as a separate message. > > I mean when you construct one error message with few printf-like > functions. We do have that in libbpf, but I haven't checked the > verifier code. Basically, assuming one bpf_log() call is a complete > "message" might not be true. Ah, I see what you mean. I guess that could be worked around with a flag or something, but I'll concede that in that case it's less of an obvious drop-in replacement :) -Toke
On Wed, Jul 15, 2020 at 12:19:03AM +0200, Toke Høiland-Jørgensen wrote: > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > >> However, assuming it *is* possible, my larger point was that we > >> shouldn't add just a 'logging struct', but rather a 'common options > >> struct' which can be extended further as needed. And if it is *not* > >> possible to add new arguments to a syscall like you're proposing, my > >> suggestion above would be a different way to achieve basically the same > >> (at the cost of having to specify the maximum reserved space in advance). > >> > > > > yeah-yeah, I agree, it's less a "logging attr", more of "common attr > > across all commands". > > Right, great. I think we are broadly in agreement with where we want to > go with this, actually :) I really don't like 'common attr across all commands'. Both of you are talking as libbpf developers who occasionally need to add printk-s to the kernel. That is not an excuse to bloat api that will be useful to two people. The only reason log_buf sort-of make sense in raw_tp_open is because btf comparison is moved from prog_load into raw_tp_open. Miscompare of (prog_fd1, btf_id1) vs (prog_fd2, btf_id2) can be easily solved by libbpf with as nice and as human friendly message libbpf can do. I'm not convinced yet that it's a kernel job to print it nicely. It certainly can, but it's quite a bit different from two existing bpf commands where log_buf is used: PROG_LOAD and BTF_LOAD. In these two cases the kernel verifies the program and the BTF. raw_tp_open is different, since the kernel needs to compare that function signatures (prog_fd1, btf_id1) and (prog_fd2, btf_id2) are exactly the same. The kernel can indicate that with single specific errno and libbpf can print human friendly function signatures via btf_dump infra for humans to see. So I really don't see why log_buf is such a necessity for raw_tp_open.
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Wed, Jul 15, 2020 at 12:19:03AM +0200, Toke Høiland-Jørgensen wrote: >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> >> >> However, assuming it *is* possible, my larger point was that we >> >> shouldn't add just a 'logging struct', but rather a 'common options >> >> struct' which can be extended further as needed. And if it is *not* >> >> possible to add new arguments to a syscall like you're proposing, my >> >> suggestion above would be a different way to achieve basically the same >> >> (at the cost of having to specify the maximum reserved space in advance). >> >> >> > >> > yeah-yeah, I agree, it's less a "logging attr", more of "common attr >> > across all commands". >> >> Right, great. I think we are broadly in agreement with where we want to >> go with this, actually :) > > I really don't like 'common attr across all commands'. > Both of you are talking as libbpf developers who occasionally need to > add printk-s to the kernel. That is not an excuse to bloat api that will be > useful to two people. What? No, this is about making error messages comprehensible to people who *can't* just go around adding printks. "Guess the source of the EINVAL" is a really bad user experience! > The only reason log_buf sort-of make sense in raw_tp_open is because > btf comparison is moved from prog_load into raw_tp_open. > Miscompare of (prog_fd1, btf_id1) vs (prog_fd2, btf_id2) can be easily solved > by libbpf with as nice and as human friendly message libbpf can do. So userspace is supposed to replicate all the checks done by the kernel because we can't be bothered to add proper error messages? Really? > I'm not convinced yet that it's a kernel job to print it nicely. It certainly can, > but it's quite a bit different from two existing bpf commands where log_buf is used: > PROG_LOAD and BTF_LOAD. In these two cases the kernel verifies the program > and the BTF. raw_tp_open is different, since the kernel needs to compare > that function signatures (prog_fd1, btf_id1) and (prog_fd2, btf_id2) are > exactly the same. The kernel can indicate that with single specific errno and > libbpf can print human friendly function signatures via btf_dump infra for > humans to see. > So I really don't see why log_buf is such a necessity for raw_tp_open. I'll drop them from raw_tp_open for this series, but I still think they should be added globally (or something like it). Returning a user-friendly error message should be the absolute minimum we do. Just like extack does for netlink. -Toke
On Tue, Jul 14, 2020 at 4:11 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Jul 15, 2020 at 12:19:03AM +0200, Toke Høiland-Jørgensen wrote: > > Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > > > >> However, assuming it *is* possible, my larger point was that we > > >> shouldn't add just a 'logging struct', but rather a 'common options > > >> struct' which can be extended further as needed. And if it is *not* > > >> possible to add new arguments to a syscall like you're proposing, my > > >> suggestion above would be a different way to achieve basically the same > > >> (at the cost of having to specify the maximum reserved space in advance). > > >> > > > > > > yeah-yeah, I agree, it's less a "logging attr", more of "common attr > > > across all commands". > > > > Right, great. I think we are broadly in agreement with where we want to > > go with this, actually :) > > I really don't like 'common attr across all commands'. > Both of you are talking as libbpf developers who occasionally need to > add printk-s to the kernel. How did you come to this conclusion? Inability to figure out what's wrong when using BPF is at the top of complaints from many users, together with hard to understand logs from verifier. > That is not an excuse to bloat api that will be > useful to two people. What do you mean specifically by bloat in API? I could understand how it would bloat the API if we were to add log-related fields into every part of bpf_attr, one for each type of commands. But that's exactly what I advocate to not do. But having even a slight improvement of error reporting, beyond current -EINVAL, -E2BIG, etc, would improve experience immensely for *all* BPF users. > > The only reason log_buf sort-of make sense in raw_tp_open is because > btf comparison is moved from prog_load into raw_tp_open. > Miscompare of (prog_fd1, btf_id1) vs (prog_fd2, btf_id2) can be easily solved > by libbpf with as nice and as human friendly message libbpf can do. > > I'm not convinced yet that it's a kernel job to print it nicely. It certainly can, > but it's quite a bit different from two existing bpf commands where log_buf is used: > PROG_LOAD and BTF_LOAD. In these two cases the kernel verifies the program > and the BTF. raw_tp_open is different, since the kernel needs to compare > that function signatures (prog_fd1, btf_id1) and (prog_fd2, btf_id2) are > exactly the same. The kernel can indicate that with single specific errno and > libbpf can print human friendly function signatures via btf_dump infra for > humans to see. > So I really don't see why log_buf is such a necessity for raw_tp_open.
On Wed, Jul 15, 2020 at 02:56:36PM +0200, Toke Høiland-Jørgensen wrote: > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > > > On Wed, Jul 15, 2020 at 12:19:03AM +0200, Toke Høiland-Jørgensen wrote: > >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > >> > >> >> However, assuming it *is* possible, my larger point was that we > >> >> shouldn't add just a 'logging struct', but rather a 'common options > >> >> struct' which can be extended further as needed. And if it is *not* > >> >> possible to add new arguments to a syscall like you're proposing, my > >> >> suggestion above would be a different way to achieve basically the same > >> >> (at the cost of having to specify the maximum reserved space in advance). > >> >> > >> > > >> > yeah-yeah, I agree, it's less a "logging attr", more of "common attr > >> > across all commands". > >> > >> Right, great. I think we are broadly in agreement with where we want to > >> go with this, actually :) > > > > I really don't like 'common attr across all commands'. > > Both of you are talking as libbpf developers who occasionally need to > > add printk-s to the kernel. That is not an excuse to bloat api that will be > > useful to two people. > > What? No, this is about making error messages comprehensible to people > who *can't* just go around adding printks. "Guess the source of the > EINVAL" is a really bad user experience! > > > The only reason log_buf sort-of make sense in raw_tp_open is because > > btf comparison is moved from prog_load into raw_tp_open. > > Miscompare of (prog_fd1, btf_id1) vs (prog_fd2, btf_id2) can be easily solved > > by libbpf with as nice and as human friendly message libbpf can do. > > So userspace is supposed to replicate all the checks done by the kernel > because we can't be bothered to add proper error messages? Really? That's not what I said. The kernel can report unique errno for miscompare and all nice messages can and _should be_ be printed by libbpf. On Wed, Jul 15, 2020, Andrii Nakryiko wrote: > > Inability to figure out what's wrong when using BPF is at the top of > complaints from many users, together with hard to understand logs from > verifier. Only the second part is true. All users are complaining about the verifier. No one is complaing that failed prog attach is somehow lacking string message. The users are also complaing about libbpf being too verbose. Yet you've refused to address the verbosity where it should be reduced and now refusing to add it where it's needed. It's libbpf job to explain users kernel errors. The same thing is happening with perf_event_open syscall. Every one who's trying to code it directly complaining about the kernel. But not a single user is complaing about perf syscall when they use libraries and tools. Same thing with bpf syscall. libbpf is the interface. It needs to clear and to the point. Right now it's not doing it well. elf dump is too verbose and unnecessary whereas in other places it says nothing informative where it could have printed human hint. libbpf's pr_perm_msg() hint is the only one where libbpf cares about its users. All other messages are useful to libbpf developers and not its users. The kernel verifier messages suck as well. They need to be improved. But this thread 'lets add strings everywhere and users will be happy' is completely missing the mark.
On Wed, Jul 15, 2020 at 4:41 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Jul 15, 2020 at 02:56:36PM +0200, Toke Høiland-Jørgensen wrote: > > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > > > > > On Wed, Jul 15, 2020 at 12:19:03AM +0200, Toke Høiland-Jørgensen wrote: > > >> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: > > >> > > >> >> However, assuming it *is* possible, my larger point was that we > > >> >> shouldn't add just a 'logging struct', but rather a 'common options > > >> >> struct' which can be extended further as needed. And if it is *not* > > >> >> possible to add new arguments to a syscall like you're proposing, my > > >> >> suggestion above would be a different way to achieve basically the same > > >> >> (at the cost of having to specify the maximum reserved space in advance). > > >> >> > > >> > > > >> > yeah-yeah, I agree, it's less a "logging attr", more of "common attr > > >> > across all commands". > > >> > > >> Right, great. I think we are broadly in agreement with where we want to > > >> go with this, actually :) > > > > > > I really don't like 'common attr across all commands'. > > > Both of you are talking as libbpf developers who occasionally need to > > > add printk-s to the kernel. That is not an excuse to bloat api that will be > > > useful to two people. > > > > What? No, this is about making error messages comprehensible to people > > who *can't* just go around adding printks. "Guess the source of the > > EINVAL" is a really bad user experience! > > > > > The only reason log_buf sort-of make sense in raw_tp_open is because > > > btf comparison is moved from prog_load into raw_tp_open. > > > Miscompare of (prog_fd1, btf_id1) vs (prog_fd2, btf_id2) can be easily solved > > > by libbpf with as nice and as human friendly message libbpf can do. > > > > So userspace is supposed to replicate all the checks done by the kernel > > because we can't be bothered to add proper error messages? Really? > > That's not what I said. The kernel can report unique errno for miscompare > and all nice messages can and _should be_ be printed by libbpf. > > > On Wed, Jul 15, 2020, Andrii Nakryiko wrote: > > > > Inability to figure out what's wrong when using BPF is at the top of > > complaints from many users, together with hard to understand logs from > > verifier. > > Only the second part is true. All users are complaining about the verifier. > No one is complaing that failed prog attach is somehow lacking string message. Ok, next time I'll be helping someone to figure out another -EINVAL, I'll remember to reassure them that it's not really frustrating, not a guess game, and not a time sink at all. > The users are also complaing about libbpf being too verbose. Very well might be, but apart from your complaints on that patch adding program loading debug message, I can't remember a single case when someone complained about that. Do you have a link for me to get some context? > Yet you've refused to address the verbosity where it should be reduced and It's open-source, everyone is welcome to submit their patches. Just because I don't think we need to remove some log messages and thus not am creating such patches, doesn't mean it can't be done by someone else. > now refusing to add it where it's needed. Can you point to or quote where I refused to add a helpful message to libbpf? > It's libbpf job to explain users kernel errors. To the best of its ability, yes. Unfortunately there were many times where I, as a human, couldn't figure it out without printk'ing my way around the kernel. If I can't do that, I can't teach libbpf to do it. Error codes are just not granular enough to allow distinguishing a lot of error conditions, either by humans or automatically by libbpf. > > The same thing is happening with perf_event_open syscall. > Every one who's trying to code it directly complaining about the kernel. But > not a single user is complaing about perf syscall when they use libraries and > tools. Same thing with bpf syscall. libbpf is the interface. It needs to clear > and to the point. Right now it's not doing it well. elf dump is too verbose and > unnecessary whereas in other places it says nothing informative where it > could have printed human hint. > > libbpf's pr_perm_msg() hint is the only one where libbpf cares about its users. > All other messages are useful to libbpf developers and not its users. "Couldn't load trivial BPF program. Make sure your kernel supports BPF (CONFIG_BPF_SYSCALL=y) and/or that RLIMIT_MEMLOCK is set to big enough value." "kernel doesn't support global data" "can't attach BPF program w/o FD (did you load it?)" "specified path %s is not on BPF FS" "vmlinux BTF is not found" -- we should clearly add "you need a kernel built with CONFIG_DEBUG_INFO_BTF=y" "invalid relo for \'%s\' in special section 0x%x; forgot to initialize global var?.." And so on. Sure, we can have more and better error messages, no one is claiming libbpf is perfect or done. That doesn't mean, though, that the kernel itself can't do better in terms of error reporting. But you clearly don't think it's a real problem, so I'll let it rest, thank you. > > The kernel verifier messages suck as well. They need to be improved. > But this thread 'lets add strings everywhere and users will be happy' is > completely missing the mark.
On Wed, Jul 15, 2020 at 06:11:39PM -0700, Andrii Nakryiko wrote: > > > > On Wed, Jul 15, 2020, Andrii Nakryiko wrote: > > > > > > Inability to figure out what's wrong when using BPF is at the top of > > > complaints from many users, together with hard to understand logs from > > > verifier. > > > > Only the second part is true. All users are complaining about the verifier. > > No one is complaing that failed prog attach is somehow lacking string message. > > Ok, next time I'll be helping someone to figure out another -EINVAL, > I'll remember to reassure them that it's not really frustrating, not a > guess game, and not a time sink at all. When the next time the users hit EINVAL due to _their_ usage and not due to kernel or libbpf bug and libbpf couldn't do anything to make the error user friendly then yes please raise it up. > > The users are also complaing about libbpf being too verbose. > > Very well might be, but apart from your complaints on that patch > adding program loading debug message, I can't remember a single case > when someone complained about that. Do you have a link for me to get > some context? > > > Yet you've refused to address the verbosity where it should be reduced and > > It's open-source, everyone is welcome to submit their patches. Just > because I don't think we need to remove some log messages and thus not > am creating such patches, doesn't mean it can't be done by someone > else. So myself and Toke are wearing 'bpf user' hat in that context. Both of us indicated that libbpf output is too verbose. Your response "just send a patch" is a sure way to turn away more users. > > now refusing to add it where it's needed. > > Can you point to or quote where I refused to add a helpful message to libbpf? see below for detailed example. > > > It's libbpf job to explain users kernel errors. > > To the best of its ability, yes. Unfortunately there were many times > where I, as a human, couldn't figure it out without printk'ing my way > around the kernel. If I can't do that, I can't teach libbpf to do it. > Error codes are just not granular enough to allow distinguishing a lot > of error conditions, either by humans or automatically by libbpf. If there are such cases please bring it up. I'm sure kernel errnos can become more unique. > > > > The same thing is happening with perf_event_open syscall. > > Every one who's trying to code it directly complaining about the kernel. But > > not a single user is complaing about perf syscall when they use libraries and > > tools. Same thing with bpf syscall. libbpf is the interface. It needs to clear > > and to the point. Right now it's not doing it well. elf dump is too verbose and > > unnecessary whereas in other places it says nothing informative where it > > could have printed human hint. > > > > libbpf's pr_perm_msg() hint is the only one where libbpf cares about its users. > > All other messages are useful to libbpf developers and not its users. > > "Couldn't load trivial BPF program. Make sure your kernel supports BPF > (CONFIG_BPF_SYSCALL=y) and/or that RLIMIT_MEMLOCK is set to big enough > value." > "kernel doesn't support global data" > "can't attach BPF program w/o FD (did you load it?)" > "specified path %s is not on BPF FS" > "vmlinux BTF is not found" -- we should clearly add "you need a kernel > built with CONFIG_DEBUG_INFO_BTF=y" > "invalid relo for \'%s\' in special section 0x%x; forgot to initialize > global var?.." sure. those count too. > That doesn't mean, though, that the kernel itself can't do better in > terms of error reporting. But you clearly don't think it's a real > problem, so I'll let it rest, thank you. It is a real problem and libbpf has to step up to address it. The kernel does everything it can already. Let's take this raw_tp_open patches as an example. Currently raw_tp_open will EINVAL if prog_fd is incorrect, tp name is not specified or expected_attach_type doesn't match. If that happens it's a _libbpf_ bug. It's not a user mistake. With Toke's patches tgt prog_fd and btf_id are added. Both can be incorrect. If that happens it's most likely libbpf bug. The user is writing their bpf_prog.c file with SEC("freplace/name") libbpf could have messed up prog_fd and btf_id resolution. The user didn't specify btf_id as a raw integer. It was a libbpf job to convert user's string name into prog_fd and btf_id in the first place. If libbpf messes it it shouldn't rely on the kernel to catch such bugs. It's not a job of the kernel to point out bugs in the libraries that suppose to be tightly coupled with the kernel. When libbpf is doing its job correctly tgt_prog_fd and tgt_btf_id are valid and BTF of the extension prog can miscompare with BTF of target prog it's trying to attach to via raw_tp_open with these two arguments. Take a look at Toke's patches. That comparison is done via btf_check_type_match(). The most helpful kernel message in such case will be 'arg2 in foo() has size 4 while bar() has 2'. That's the best the kernel can do. Yet it's very user unfriendly. The kernel has no infra today to print BTF in a human friendly way. But libbpf has BTF dumper. btf_dump__emit_type_decl() alone is magnitude friendlier to users than kernel messages. When the kernel's raw_tp_open syscall realizes that BTFs don't match it can return single unique errno and libbpf can tell the user: " Function prototype of BPF prog: int ext_bpf_prog(struct __sk_buff *); doesn't match function prototype of target prog: int tgt_prog_name(struct xdp_md *); " Everything is available in libbpf to print such error in human friendly way. Whereas the kernel has no ability to do so. If that message was coming from the kernel it would have been from btf_check_type_match() and would be: "arg1 in ext_bpf_prog() is not a pointer to context". Compare that to above libbpf error message. I think the difference is pretty drastic. libbpf by far is a better place to print human friendly errors. I'm arguing that raw_tp_open with additional tgt_prog_fd and tgt_btf_id parameters is not much different from sys_kcmp syscall and from memcmp() libc function. They compare two objects and return equal or not. It's not a job of the kernel to say 'objects are not equal because byte 2 is different'. Such message is not helpeful to users. It's libbpf job to print two BTFs in human friendly way when the kernel found them different. It's easier for human to glance over two function prototypes and spot the difference instead of 'arg1 is not a pointer to context'. So please make libbpf user friendly. Enough of pointing fingers at the kernel.
On Wed, Jul 15, 2020 at 10:44 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Wed, Jul 15, 2020 at 06:11:39PM -0700, Andrii Nakryiko wrote: > > > > > > On Wed, Jul 15, 2020, Andrii Nakryiko wrote: > > > > > > > > Inability to figure out what's wrong when using BPF is at the top of > > > > complaints from many users, together with hard to understand logs from > > > > verifier. > > > > > > Only the second part is true. All users are complaining about the verifier. > > > No one is complaing that failed prog attach is somehow lacking string message. > > > > Ok, next time I'll be helping someone to figure out another -EINVAL, > > I'll remember to reassure them that it's not really frustrating, not a > > guess game, and not a time sink at all. > > When the next time the users hit EINVAL due to _their_ usage and not > due to kernel or libbpf bug and libbpf couldn't do anything to make > the error user friendly then yes please raise it up. I know this is futile to convince you anyways, so I won't go dig all the details, but here are few general examples. With the benefit of hindsight in a lot of those cases libbpf can do extra checks and guessing (though bad guess is worse than no guess), but that doesn't scale because of the sheer amount of possible situations. What I personally went through when I was building runqslower: 1. You write a simple BPF program, open + load. You run it, you get EPERM. 2. You go check if you ran the program with sudo, retry with sudo. 3. You still get EPERM. You bang your head against the wall for 30 minutes, you recall about RLIMIT_MEMLOCK, you bump it up. 4. If you are happy, now everything works. I don't remember now if I had to bang my head some more until I got the initial minimal version of runqslower successfully load BPF program. Some popular other issues I can recall. 1. People get excited and try to use fentry/fexit, get -EINVAL. Most probably it's because they don't have a recent enough kernel. Otherwise (if they know libbpf behavior), they'd probably have gotten something about not being able to find the desired BTF func in the kernel. Even in that case, is that that they don't have BTF at all (CONFIG_DEBUG_BTF_INFO=y)? Or it's just an old one with no FUNCs because pahole isn't recent enough? Kernel build won't mention that you need 1.16 for fentry/fexit, of course. And so on. This is actually not the worst case, because I can walk them through this process. 2. When dealing with cgroups. You get EINVAL for every tiny misstep. You want to replace the BPF program, you get EINVAL. Maybe you forgot one of the necessary flags? Or maybe you specified incompatible flags (BPF_F_REPLACE | BPF_F_ALLOW_MULTI)? Or the parent cgroup has a non-overridable BPF program attached already? Or maybe prog FD is wrong, or cgroup FD is wrong, or?... 3. Someone gets excited about libbpf-tools in BCC, decides to convert a tool. Needs to dump the hashmap fast. There is BATCH_LOOKUP, nice! They try, they get EINVAL. Most probably outdated kernel, but I'll need to dig into kernel sources to see what else could go wrong. 4. Try using some of the low-level APIs from libbpf/bpf.h, like the same cgroup program replacement. You'll get E2BIG and will scratch your head for a while, checking how the list of attached BPF programs can be too big, if it's the only one. Just to realize that again, your kernel is just a touch too old and just complains about non-zero unknown field in bpf_attr. 5. Just few days ago, one user was doing bpf_program__attach_tracepoint(): libbpf: program 'tp/sched/sched_process_exit': failed to attach to pfd 10: File exists After a bit of back and forth turns out he had a second instance of that program running in parallel. Good, I quickly realize that it's an old kernel and it doesn't allow me to attach more than 1 BPF program to the tracepoint. Case solved. But what about many other BPF users that do not have access to someone developing BPF in the kernel? And the list goes on. Even if it was my full-time job just to anticipate all the misuses and try to check/guess them in libbpf, that wouldn't work and won't scale. And all I was asking (and not finger pointing or blaming anything or anyone) to have a mechanism in the kernel to get a single-line human-readable hint as to which one out of many EINVAL/E2BIG/ENOENT/EPERM conditions was hit. No need to immediately convert all of them, we could have gradually added that, prioritizing common and most probable ones to hit. But alas, it's easier to blame me for not perfecting libbpf and anticipating all that beforehand. > > > > The users are also complaing about libbpf being too verbose. > > > > Very well might be, but apart from your complaints on that patch > > adding program loading debug message, I can't remember a single case > > when someone complained about that. Do you have a link for me to get > > some context? > > > > > Yet you've refused to address the verbosity where it should be reduced and > > > > It's open-source, everyone is welcome to submit their patches. Just > > because I don't think we need to remove some log messages and thus not > > am creating such patches, doesn't mean it can't be done by someone > > else. > > So myself and Toke are wearing 'bpf user' hat in that context. > Both of us indicated that libbpf output is too verbose. > Your response "just send a patch" is a sure way to turn away more users. > I can't find any such complaint from Toke in this thread, and can't really recall something like that from recent discussions. I'd rather have him speak for himself. But neither you nor Toke are "just bpf users", you are totally capable of creating a patch addressing something that bugs you. Upstream BPF users have never complained to me about the verbosity of logs. Furthermore, they rarely (if ever) see those **debug** logs, until I actually ask them to turn it on to help me help them. Yes, debug-level libbpf logs are driven more towards me and other people (like Toke) who know internals and are helping users with issues. I actually try to help BPF users, not ask them for patches to libbpf, and you can find many patches from me where I was improving/fixing/adding to kernel and libbpf as a response to both upstream and internal BPF users. I am far from asking for recognition of that, but please don't make false accusations either. > > > now refusing to add it where it's needed. > > > > Can you point to or quote where I refused to add a helpful message to libbpf? > > see below for detailed example. > > > > > > It's libbpf job to explain users kernel errors. > > > > To the best of its ability, yes. Unfortunately there were many times > > where I, as a human, couldn't figure it out without printk'ing my way > > around the kernel. If I can't do that, I can't teach libbpf to do it. > > Error codes are just not granular enough to allow distinguishing a lot > > of error conditions, either by humans or automatically by libbpf. > > If there are such cases please bring it up. I'm sure kernel errnos > can become more unique. See above for a few typical examples. I don't keep track of all the issues I had to debug myself or help others with. > > > > > > > The same thing is happening with perf_event_open syscall. > > > Every one who's trying to code it directly complaining about the kernel. But > > > not a single user is complaing about perf syscall when they use libraries and > > > tools. Same thing with bpf syscall. libbpf is the interface. It needs to clear > > > and to the point. Right now it's not doing it well. elf dump is too verbose and > > > unnecessary whereas in other places it says nothing informative where it > > > could have printed human hint. > > > > > > libbpf's pr_perm_msg() hint is the only one where libbpf cares about its users. > > > All other messages are useful to libbpf developers and not its users. > > > > "Couldn't load trivial BPF program. Make sure your kernel supports BPF > > (CONFIG_BPF_SYSCALL=y) and/or that RLIMIT_MEMLOCK is set to big enough > > value." > > "kernel doesn't support global data" > > "can't attach BPF program w/o FD (did you load it?)" > > "specified path %s is not on BPF FS" > > "vmlinux BTF is not found" -- we should clearly add "you need a kernel > > built with CONFIG_DEBUG_INFO_BTF=y" > > "invalid relo for \'%s\' in special section 0x%x; forgot to initialize > > global var?.." > > sure. those count too. great, thanks for acknowledging > > > That doesn't mean, though, that the kernel itself can't do better in > > terms of error reporting. But you clearly don't think it's a real > > problem, so I'll let it rest, thank you. > > It is a real problem and libbpf has to step up to address it. > The kernel does everything it can already. > > Let's take this raw_tp_open patches as an example. > Currently raw_tp_open will EINVAL if prog_fd is incorrect, tp name > is not specified or expected_attach_type doesn't match. > If that happens it's a _libbpf_ bug. It's not a user mistake. > With Toke's patches tgt prog_fd and btf_id are added. > Both can be incorrect. If that happens it's most likely libbpf bug. > The user is writing their bpf_prog.c file with SEC("freplace/name") > libbpf could have messed up prog_fd and btf_id resolution. > The user didn't specify btf_id as a raw integer. > It was a libbpf job to convert user's string name into prog_fd > and btf_id in the first place. If libbpf messes it it shouldn't rely > on the kernel to catch such bugs. It's not a job of the kernel to > point out bugs in the libraries that suppose to be tightly > coupled with the kernel. > When libbpf is doing its job correctly tgt_prog_fd and tgt_btf_id are > valid and BTF of the extension prog can miscompare with BTF of > target prog it's trying to attach to via raw_tp_open with these two > arguments. Take a look at Toke's patches. That comparison is done via > btf_check_type_match(). The most helpful kernel message in such case > will be 'arg2 in foo() has size 4 while bar() has 2'. > That's the best the kernel can do. Yet it's very user unfriendly. > The kernel has no infra today to print BTF in a human friendly way. > But libbpf has BTF dumper. btf_dump__emit_type_decl() alone is > magnitude friendlier to users than kernel messages. > When the kernel's raw_tp_open syscall realizes that BTFs don't match > it can return single unique errno and libbpf can tell the user: > " > Function prototype of BPF prog: > int ext_bpf_prog(struct __sk_buff *); > doesn't match function prototype of target prog: > int tgt_prog_name(struct xdp_md *); > " > Everything is available in libbpf to print such error in human friendly way. > Whereas the kernel has no ability to do so. > If that message was coming from the kernel it would have been from > btf_check_type_match() and would be: > "arg1 in ext_bpf_prog() is not a pointer to context". > Compare that to above libbpf error message. > I think the difference is pretty drastic. > libbpf by far is a better place to print human friendly errors. > > I'm arguing that raw_tp_open with additional tgt_prog_fd and tgt_btf_id > parameters is not much different from sys_kcmp syscall and from memcmp() libc > function. They compare two objects and return equal or not. > It's not a job of the kernel to say > 'objects are not equal because byte 2 is different'. > Such message is not helpeful to users. > It's libbpf job to print two BTFs in human friendly way when the kernel > found them different. It's easier for human to glance over two function > prototypes and spot the difference instead of 'arg1 is not a pointer to context'. Ok, so I assume that is an example where I refused to add helpful error reporting to libbpf? First, thanks for taking the time to elaborate on this. I'm sure Toke will appreciate it, though I have nothing to do with arguing for or against this. It was Toke's scepticism, not mine, that you are addressing. I never said we shouldn't do this. I do agree that in some cases libbpf has the same or more information and it's easier to pretty-print extra context, like in this case, which we should take advantage of. I started this topic to discuss extending bpf() syscall in general, regardless of this particular change and feature. > > So please make libbpf user friendly. Enough of pointing fingers at the kernel. It's an honor you are associating me personally and libbpf, but I still hope we can move this forward as a community and collective project. Of course let's make libbpf more user-friendly where possible and feasible, but I don't think arguing about debug-level logging in libbpf is the right place to start. I don't know where "finger pointing" comes from, I had no intent nor I said anything to offend either the kernel or you personally. Proposing changes up for discussion shouldn't be considered a form of attacking or finger pointing, I hope.
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes: >> So myself and Toke are wearing 'bpf user' hat in that context. >> Both of us indicated that libbpf output is too verbose. >> Your response "just send a patch" is a sure way to turn away more users. >> > > I can't find any such complaint from Toke in this thread, and can't > really recall something like that from recent discussions. I'd rather > have him speak for himself. I think what I said (not in this thread, way back during some other discussion) was that I agreed that libbpf was being too verbose by dumping all the sections and relocations it finds when reading an ELF file, which causes the useful error messages to get lost. I would like to see those messages demoted to another log level, or removed altogether. I won't have time to look more at this right now, but I do plan to circle back to it: I agree with you that we need to make this more friendly. And yes, I also think this should include finding a way to disambiguate between different conditions leading to the same error from the kernel. I've run into a lot of the same issues as you when supporting people who are new to BPF - thank you for the extensive list! -Toke
On Thu, Jul 16, 2020 at 12:59:30PM -0700, Andrii Nakryiko wrote: > On Wed, Jul 15, 2020 at 10:44 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Wed, Jul 15, 2020 at 06:11:39PM -0700, Andrii Nakryiko wrote: > > > > > > > > On Wed, Jul 15, 2020, Andrii Nakryiko wrote: > > > > > > > > > > Inability to figure out what's wrong when using BPF is at the top of > > > > > complaints from many users, together with hard to understand logs from > > > > > verifier. > > > > > > > > Only the second part is true. All users are complaining about the verifier. > > > > No one is complaing that failed prog attach is somehow lacking string message. > > > > > > Ok, next time I'll be helping someone to figure out another -EINVAL, > > > I'll remember to reassure them that it's not really frustrating, not a > > > guess game, and not a time sink at all. > > > > When the next time the users hit EINVAL due to _their_ usage and not > > due to kernel or libbpf bug and libbpf couldn't do anything to make > > the error user friendly then yes please raise it up. > > I know this is futile to convince you anyways, so I won't go dig all > the details, but here are few general examples. With the benefit of > hindsight in a lot of those cases libbpf can do extra checks and > guessing (though bad guess is worse than no guess), but that doesn't > scale because of the sheer amount of possible situations. > > What I personally went through when I was building runqslower: > > 1. You write a simple BPF program, open + load. You run it, you get EPERM. > 2. You go check if you ran the program with sudo, retry with sudo. > 3. You still get EPERM. You bang your head against the wall for 30 > minutes, you recall about RLIMIT_MEMLOCK, you bump it up. > 4. If you are happy, now everything works. I don't remember now if I > had to bang my head some more until I got the initial minimal version > of runqslower successfully load BPF program. But we have libbpf hinting on that already? Are you saying that hint wasn't working somehow? > > Some popular other issues I can recall. > > 1. People get excited and try to use fentry/fexit, get -EINVAL. Most > probably it's because they don't have a recent enough kernel. > Otherwise (if they know libbpf behavior), they'd probably have gotten > something about not being able to find the desired BTF func in the > kernel. Even in that case, is that that they don't have BTF at all > (CONFIG_DEBUG_BTF_INFO=y)? Or it's just an old one with no FUNCs > because pahole isn't recent enough? Kernel build won't mention that > you need 1.16 for fentry/fexit, of course. And so on. This is actually > not the worst case, because I can walk them through this process. Exactly. Either old kernel or missing config in the kernel. libbpf is above that. It could have provided a hint. When libbpf is processing SEC("fentry/") and it fails to load with empty verifier log it could start probing. Only libbpf can do it. Kernel is helpless here. Say we change the kernel errno for all unsuported prog types and maps it would return ENOTSUPP or something. Would it really help the situation? Probably not. There will be old kernels and the same usability issue as you described. > 2. When dealing with cgroups. You get EINVAL for every tiny misstep. > You want to replace the BPF program, you get EINVAL. Maybe you forgot > one of the necessary flags? Or maybe you specified incompatible flags > (BPF_F_REPLACE | BPF_F_ALLOW_MULTI)? Or the parent cgroup has a > non-overridable BPF program attached already? Or maybe prog FD is > wrong, or cgroup FD is wrong, or?... A wrong FD for prog or cgroup would be a libbpf bug. imo that's the case where kernel must not return a string even if there is a log_buf facility. non-overridable already attached is EPERM. It's not EINVAL. multi-prog vs overridable is also EPERM. Say, the kernel has log_buf for hierarchy_allows_attach(). what kind of message do you think it can print that will be user friendly? "attach is not allowed because there is a parent cgroup that has no override flag set" How is it much better than EPERM? Two differentiate between these two EPERM? Both to me look like the same category if permission checks. I don't know how to the kernel can print full cgroup path here. It has 'struct cgroup *'. Then go to kernfs? then what? cgroupfs is mounted somwhere. To have meaningul message the user would want to see something like: " /sys/fs/cgroup/my_container <- has no-override bpf prog, hence the kernel doesn't allow attach at: /sys/fs/cgroup/my_container/here " Only libbpf can print such message. The user gave libbpf a string path. Attaching is relative to that. Whereas the kernel has hard time with mounts/paths/namespaces. May be to solve this cgroup attaching ambiguity we can add a libbpf helper that can walk hierarchy and print state? Either libbpf calls it automatically or user can trigger it? > 3. Someone gets excited about libbpf-tools in BCC, decides to convert > a tool. Needs to dump the hashmap fast. There is BATCH_LOOKUP, nice! > They try, they get EINVAL. Most probably outdated kernel, but I'll > need to dig into kernel sources to see what else could go wrong. right. that is similar to 1. I don't see how kernel could have helped. Say sys_bpf got top level log_buf for _all_ commands. The user is passing a command that is unsupported. The kernel has no clue whether log_buf is even there in bpf_attr. It cannot return a string. Only errno. If we change that errno from current EINVAL to ENOTSUPP... it would go back to the point I made above. Not really helping much. imo that's another case where libbpf probing can go long way to improve user experience. > > 4. Try using some of the low-level APIs from libbpf/bpf.h, like the > same cgroup program replacement. You'll get E2BIG and will scratch > your head for a while, checking how the list of attached BPF programs > can be too big, if it's the only one. Just to realize that again, your > kernel is just a touch too old and just complains about non-zero > unknown field in bpf_attr. The low-level APIs of libbpf is indeed a pain. There are also old and partially broken libbpf APIs. I think we need to start aggressively deprecating some of them. Especially those that cause debugging issues to users. > 5. Just few days ago, one user was doing bpf_program__attach_tracepoint(): > > libbpf: program 'tp/sched/sched_process_exit': failed to attach to pfd > 10: File exists > > After a bit of back and forth turns out he had a second instance of > that program running in parallel. Good, I quickly realize that it's an > old kernel and it doesn't allow me to attach more than 1 BPF program > to the tracepoint. Case solved. But what about many other BPF users > that do not have access to someone developing BPF in the kernel? I similarly don't see how kernel string would have helped. It would have said the same thing: "cannot attach". > And the list goes on. Even if it was my full-time job just to > anticipate all the misuses and try to check/guess them in libbpf, that > wouldn't work and won't scale. > > And all I was asking (and not finger pointing or blaming anything or > anyone) to have a mechanism in the kernel to get a single-line > human-readable hint as to which one out of many > EINVAL/E2BIG/ENOENT/EPERM conditions was hit. No need to immediately > convert all of them, we could have gradually added that, prioritizing > common and most probable ones to hit. I'm not against adding log_buf to more bpf commands. In many cases they are needed. I'm against the blank statement that _all_ bpf commands need log_buf and that is somehow will solve user debug nightmares. > project. Of course let's make libbpf more user-friendly where possible > and feasible, Let's do so. This thread jumped to early conclusion that log_buf for all bpf commands will magically improve user experience. Both you and Toke were happy to conclude that "horrible kernel UI/UX" is responsible for everything and it has to be the one to fix. I don't think that's the case and I hope working through the examples of bad user experience above made it clear.
On Thu, Jul 16, 2020 at 8:09 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Jul 16, 2020 at 12:59:30PM -0700, Andrii Nakryiko wrote: > > On Wed, Jul 15, 2020 at 10:44 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Wed, Jul 15, 2020 at 06:11:39PM -0700, Andrii Nakryiko wrote: > > > > > > > > > > On Wed, Jul 15, 2020, Andrii Nakryiko wrote: > > > > > > > > > > > > Inability to figure out what's wrong when using BPF is at the top of > > > > > > complaints from many users, together with hard to understand logs from > > > > > > verifier. > > > > > > > > > > Only the second part is true. All users are complaining about the verifier. > > > > > No one is complaing that failed prog attach is somehow lacking string message. > > > > > > > > Ok, next time I'll be helping someone to figure out another -EINVAL, > > > > I'll remember to reassure them that it's not really frustrating, not a > > > > guess game, and not a time sink at all. > > > > > > When the next time the users hit EINVAL due to _their_ usage and not > > > due to kernel or libbpf bug and libbpf couldn't do anything to make > > > the error user friendly then yes please raise it up. > > > > I know this is futile to convince you anyways, so I won't go dig all > > the details, but here are few general examples. With the benefit of > > hindsight in a lot of those cases libbpf can do extra checks and > > guessing (though bad guess is worse than no guess), but that doesn't > > scale because of the sheer amount of possible situations. > > > > What I personally went through when I was building runqslower: > > > > 1. You write a simple BPF program, open + load. You run it, you get EPERM. > > 2. You go check if you ran the program with sudo, retry with sudo. > > 3. You still get EPERM. You bang your head against the wall for 30 > > minutes, you recall about RLIMIT_MEMLOCK, you bump it up. > > 4. If you are happy, now everything works. I don't remember now if I > > had to bang my head some more until I got the initial minimal version > > of runqslower successfully load BPF program. > > But we have libbpf hinting on that already? > Are you saying that hint wasn't working somehow? At that time the message was just "failed to load primitive BPF program", we added mentions of CONFIG_BPF_SYSCALL and RLIMIT_MEMLOCK much later. And even current message might be misleading, because it might be that you are not a root (or, in more modern kernels, don't have CAP_BPF). Maybe not, if we are relying on unprivileged BPF for primitive prog, I don't know. > > > > > Some popular other issues I can recall. > > > > 1. People get excited and try to use fentry/fexit, get -EINVAL. Most > > probably it's because they don't have a recent enough kernel. > > Otherwise (if they know libbpf behavior), they'd probably have gotten > > something about not being able to find the desired BTF func in the > > kernel. Even in that case, is that that they don't have BTF at all > > (CONFIG_DEBUG_BTF_INFO=y)? Or it's just an old one with no FUNCs > > because pahole isn't recent enough? Kernel build won't mention that > > you need 1.16 for fentry/fexit, of course. And so on. This is actually > > not the worst case, because I can walk them through this process. > > Exactly. Either old kernel or missing config in the kernel. > libbpf is above that. It could have provided a hint. > When libbpf is processing SEC("fentry/") and it fails to load with > empty verifier log it could start probing. As I said, listing these examples I didn't expect to convince you, because I knew you'd propose one or another way how libbpf can provide more "diagnostics". In a lot of cases it can. But usually after the fact (debugging it in a painful way). I don't believe we can anticipate all cases, though, and even those that we can, I'm not sure how scalable it is to do all that. Requiring anyone adding any new API to provide a comprehensive set of probing/checks/diagnostics to prevent possible error conditions seems to be way too harsh on contributors. Unfortunately, details escape me now for the case I had in mind, where I had absolutely no clue which of many checks are failing, even after reading kernel source code very carefully. Too bad, but I don't keep such detailed notes. > Only libbpf can do it. Kernel is helpless here. > Say we change the kernel errno for all unsuported prog types and maps > it would return ENOTSUPP or something. > Would it really help the situation? IMO, if the kernel just prints out "Unknown BPF command 123" or "Unknown map type 345" that would be already a nice improvement. > Probably not. There will be old kernels and the same usability > issue as you described. Of course old kernels will still have those problems, but it never stopped us from fixing and improving the kernel moving forward so that eventually a better version is available (almost) everywhere. > > > 2. When dealing with cgroups. You get EINVAL for every tiny misstep. > > You want to replace the BPF program, you get EINVAL. Maybe you forgot > > one of the necessary flags? Or maybe you specified incompatible flags > > (BPF_F_REPLACE | BPF_F_ALLOW_MULTI)? Or the parent cgroup has a > > non-overridable BPF program attached already? Or maybe prog FD is > > wrong, or cgroup FD is wrong, or?... > > A wrong FD for prog or cgroup would be a libbpf bug. cgroup fd is the direct input parameter for bpf_program__attach_cgroup() and bpf_prog_attach() APIs. But, of course, we can also add a variant that takes a string path to cgroup and fetches it. Or we can double-check that provided cgroup FD is actually cgroup FD. > imo that's the case where kernel must not return a string even > if there is a log_buf facility. > > non-overridable already attached is EPERM. It's not EINVAL. > multi-prog vs overridable is also EPERM. Attaching the same BPF prog is EINVAL, though. But it doesn't matter which error code specifically, either EPERM, or EINVAL, or even E2BIG have many possible meanings in bpf() syscall. > Say, the kernel has log_buf for hierarchy_allows_attach(). > what kind of message do you think it can print that will be user friendly? > "attach is not allowed because there is a parent cgroup > that has no override flag set" > How is it much better than EPERM? "parent cgroup prevents attaching BPF program" would already lead into the right direction. It's not perfect and full, but better than more generic EPERM. But sure, EPERM has less variants than EINVAL, so I'm personally always happier with EPERM, than with EINVAL. > Two differentiate between these two EPERM? > Both to me look like the same category if permission checks. > I don't know how to the kernel can print full cgroup path here. > It has 'struct cgroup *'. Then go to kernfs? then what? > cgroupfs is mounted somwhere. > To have meaningul message the user would want to see something like: > " > /sys/fs/cgroup/my_container <- has no-override bpf prog, hence > the kernel doesn't allow attach at: > /sys/fs/cgroup/my_container/here > " > Only libbpf can print such message. > The user gave libbpf a string path. Attaching is relative to that. > Whereas the kernel has hard time with mounts/paths/namespaces. Sure, I didn't even hope for cgroup path or anything like that. Maybe cgroup ID would be nice, don't know, but even without that just saying that it's parent cgroup that's the culprit is a big step forward, IMO. > > May be to solve this cgroup attaching ambiguity we can add a libbpf > helper that can walk hierarchy and print state? > Either libbpf calls it automatically or user can trigger it? > > > 3. Someone gets excited about libbpf-tools in BCC, decides to convert > > a tool. Needs to dump the hashmap fast. There is BATCH_LOOKUP, nice! > > They try, they get EINVAL. Most probably outdated kernel, but I'll > > need to dig into kernel sources to see what else could go wrong. > > right. that is similar to 1. I don't see how kernel could have helped. > Say sys_bpf got top level log_buf for _all_ commands. > The user is passing a command that is unsupported. > The kernel has no clue whether log_buf is even there in bpf_attr. > It cannot return a string. Only errno. > If we change that errno from current EINVAL to ENOTSUPP... > it would go back to the point I made above. > Not really helping much. > imo that's another case where libbpf probing can go long way to > improve user experience. log_buf can't help existing kernels. Period. No one is arguing or expecting that. But moving forward, just having that "unknown command 123" would be great. But yeah, of course libbpf can create a probing map and try to do BATCH_LOOKUP, to detect BATCH_LOOKUP support. > > > > > 4. Try using some of the low-level APIs from libbpf/bpf.h, like the > > same cgroup program replacement. You'll get E2BIG and will scratch > > your head for a while, checking how the list of attached BPF programs > > can be too big, if it's the only one. Just to realize that again, your > > kernel is just a touch too old and just complains about non-zero > > unknown field in bpf_attr. > > The low-level APIs of libbpf is indeed a pain. > There are also old and partially broken libbpf APIs. > I think we need to start aggressively deprecating some of them. > Especially those that cause debugging issues to users. > > > 5. Just few days ago, one user was doing bpf_program__attach_tracepoint(): > > > > libbpf: program 'tp/sched/sched_process_exit': failed to attach to pfd > > 10: File exists > > > > After a bit of back and forth turns out he had a second instance of > > that program running in parallel. Good, I quickly realize that it's an > > old kernel and it doesn't allow me to attach more than 1 BPF program > > to the tracepoint. Case solved. But what about many other BPF users > > that do not have access to someone developing BPF in the kernel? > > I similarly don't see how kernel string would have helped. > It would have said the same thing: "cannot attach". This one is for perf subsystem, actually, it's its PERF_EVENT_IOC_SET_BPF ioctl (until we add bpf_link for perf_event attachment). "limit of allowed BPF programs reached" would be good enough. "cannot attach" is too generic. > > > And the list goes on. Even if it was my full-time job just to > > anticipate all the misuses and try to check/guess them in libbpf, that > > wouldn't work and won't scale. > > > > And all I was asking (and not finger pointing or blaming anything or > > anyone) to have a mechanism in the kernel to get a single-line > > human-readable hint as to which one out of many > > EINVAL/E2BIG/ENOENT/EPERM conditions was hit. No need to immediately > > convert all of them, we could have gradually added that, prioritizing > > common and most probable ones to hit. > > I'm not against adding log_buf to more bpf commands. > In many cases they are needed. > I'm against the blank statement that _all_ bpf commands need log_buf > and that is somehow will solve user debug nightmares. There was no such statement. There was no statement that it will "solve" debug nightmares. But it will certainly ease the pain. My proposal was about adding the ability to emit something to log_buf from any of the BPF commands, if that BPF command chooses to provide extra error information. The whole point of this was to avoid adding log_buf in command-specific ways (as Toke was doing in the patch that I used to initiate the discussion) and do it once for entire syscall, so that we can gradually utilize it where it makes most sense. > > > project. Of course let's make libbpf more user-friendly where possible > > and feasible, > > Let's do so. > This thread jumped to early conclusion that log_buf for all bpf commands > will magically improve user experience. Both you and Toke were happy to > conclude that "horrible kernel UI/UX" is responsible for everything and > it has to be the one to fix. You somehow concluded that it has to be either kernel or libbpf that has to be improved, not both. I don't know how you came to this conclusion. I didn't say or mean that, neither I read that from Toke's replies. Let me walk through relevant parts verbatim. Me: > > > > I think BPF syscall would benefit from common/generalized log support > > > > across all commands, given how powerful/complex it already is. > > > > Sometimes it's literally impossible to understand why one gets -EINVAL > > > > without adding printk()s in the kernel. Toke: > > > Yes, I agree! This is horrible UI! Me: > > UI?.. It's a perfectly fine and extensible API for all functionality > > it provides, it just needs a better human-readable feedback mechanism, > > which is what I'm proposing. Error codes are not working when you have >> so many different situations that can result in error. ... Some technical details back and forth follows ... You: > I really don't like 'common attr across all commands'. > Both of you are talking as libbpf developers who occasionally need to > add printk-s to the kernel. That is not an excuse to bloat api that will be > useful to two people. You not liking this I get, this is fine, we all have our preferences, we don't have to agree on everything. But then you just jumped into conclusion about our motivation and claimed that it will be useful to only two people (presumably me and Toke). I'm also not sure about bloating the API, given that such API is already part of BPF_PROG_LOAD command, but there is no point in discussing such technical details anymore. Toke said that it's a horrible story to debug such generic errors (and I concur, it is), not that the kernel is horrible itself or was written by horrible people. I don't feel offended if libbpf provides horrible user experience in parts I've implemented or changed. I might disagree about the qualification in some cases, but I won't get offended about someone not liking my code, API or design. > I don't think that's the case and I hope > working through the examples of bad user experience above made it clear. Not at all. All you showed is that once someone runs into some specific API misuse and debugs it to success, then it's usually pretty obvious how libbpf could have helped doing additional diagnostics. With the benefit of hindsight. I agree that if such diagnostics are reliable and the situation itself is common and experienced by multiple users, then it might make sense to add such checks to libbpf. But I also don't think it's always possible to diagnose something automatically with 100% confidence. We can give hints, but misdiagnosing the problem can just further confuse things instead. Also quite often such problems are one-offs, which doesn't make them any less confusing and frustrating, but custom diagnostics for every such case has a potential of bloating libbpf beyond recognition and I'm not convinced that we should diagnose all of them. That's what I meant that this is not a scalable approach to just say "fix libbpf to be more user friendly, kernel does its best already". This thread was about generalizing log_buf to entire bpf(), even though it got diverted into something entirely different. I would still be interested to know the technical implications of syscall using extra args (like in my original proposal #1, about adding 2 extra args if some flag in the existing argument is set). Is that even safe to do with syscalls? Even if we don't do it, it's still useful to know for the future. I think the syscall() wrapper is inherently vararg and accepts an arbitrary number of args, so I assumed it's actually ok, but don't know how to prove it.
On Fri, Jul 17, 2020 at 08:54:45PM -0700, Andrii Nakryiko wrote: > > > Only libbpf can do it. Kernel is helpless here. > > Say we change the kernel errno for all unsuported prog types and maps > > it would return ENOTSUPP or something. > > Would it really help the situation? > > IMO, if the kernel just prints out "Unknown BPF command 123" or > "Unknown map type 345" that would be already a nice improvement. ... > log_buf can't help existing kernels. Period. No one is arguing or > expecting that. But moving forward, just having that "unknown command > 123" would be great. > > But yeah, of course libbpf can create a probing map and try to do > BATCH_LOOKUP, to detect BATCH_LOOKUP support. Also with BTF the kernel is self documented. The following will print all commands that kernel supports: bpftool btf dump file ./bld_x64/vmlinux |grep -A40 bpf_cmd and with 'grep BPF_MAP_TYPE_' all supported maps. For older kernels there is 'bpftool feature probe'. Since libbpf reads vmlinux BTF anyway it could have got a knowledge of all the features it supports based on BTF. But let's continue this thought experiment for augmenting error reporting with a string. For 'Unknown BPF command 123' to work log_buf needs to passed outside of 'union bpf_attr'. Probably as 4th argument to sys_bpf ? and then a bit in 'int cmd' would need to be burned to indicate that 4th arg is there. Probably size of the arg needs to be passed either as 5th arg or as part of the 'struct bpf_log_buf' so it can be extensible. Using that log_buf directly in SYSCALL_DEFINE[45](bpf, int cmd, ... struct bpf_log_buf *log_buf) will be trivial. But to pass it into any of first level functions (bpf_iter_create, bpf_prog_attach, etc) they would need to gain an extra argument. To pass it all the way into hierarchy_allows_attach() it needs to be added to: __cgroup_bpf_attach cgroup_bpf_attach cgroup_bpf_link_attach link_create Another case of ambiguous 'return -EINVAL' would cause another change to a bunch of function prototypes. So it's better to integrate it into current task_struct to avoid huge code churn. But if we do so what stops us from generalizing log_buf reporting to other syscalls? perf_event_open is in the same category. > This one is for perf subsystem, actually, it's its > PERF_EVENT_IOC_SET_BPF ioctl (until we add bpf_link for perf_event > attachment). clearly not only sys_bpf and sys_perf_event_open, but sys_ioctl would need string support to be a full answer to ambiguous einval-s. > My proposal was about adding the ability to emit something to log_buf > from any of the BPF commands, if that BPF command chooses to provide > extra error information. The whole point of this was to avoid adding > log_buf in command-specific ways (as Toke was doing in the patch that > I used to initiate the discussion) and do it once for entire syscall, > so that we can gradually utilize it where it makes most sense. I don't think that works due to code churn. Whether we pay that price once or 'gradually' it doesn't make it any better. When log_buf is added to an existing command the 'union bpf_attr' is there in the function proto and nothing new needs to passed to a lot of functions. So I certainly prefer Toke's approach of adding log_buf to one specific command if it's really needed there. The alternative is to solve it for all syscalls. > I agree that if such diagnostics are reliable and the situation itself > is common and experienced by multiple users, then it might make sense > to add such checks to libbpf. 'experienced by multiple users' is going to be a judgement call either for libbpf or for kernel. I'm saying let's improve libbpf user-friendlyness everywhere we can. We can always drop these hints later. Unlike kernel messages that might become stable api. One thing is log_buf that the verifier is dumping. It's huge and not parsable. Whereas a string next to return EINVAL may become an uapi. I wouldn't be surprised if some of netlink extack strings got to this level of stability and changing the string may cause somebody to notice and the commit would get reverted. > But I also don't think it's always > possible to diagnose something automatically with 100% confidence. We > can give hints, but misdiagnosing the problem can just further confuse > things instead. I think the possible confusion is fine. The libbpf has an opportunity to try them and remove if things don't work out. The kernel strings would be much more scrutinized and harder to change. > Also quite often such problems are one-offs, which > doesn't make them any less confusing and frustrating, but custom > diagnostics for every such case has a potential of bloating libbpf > beyond recognition bloating libbpf with strings is imo better way to figure out how to improve user experience than bloating kernel. > of them. That's what I meant that this is not a scalable approach to > just say "fix libbpf to be more user friendly, kernel does its best > already". Ok. I will take it back. The kernel can be improved with extra strings here and there, but only if it's done without huge code churn. If current task_struct approach can work and the changes would be limited to something like: diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c index ac53102e244a..244df18728c2 100644 --- a/kernel/bpf/cgroup.c +++ b/kernel/bpf/cgroup.c @@ -196,8 +196,12 @@ static bool hierarchy_allows_attach(struct cgroup *cgrp, return true; cnt = prog_list_length(&p->bpf.progs[type]); WARN_ON_ONCE(cnt > 1); - if (cnt == 1) - return !!(flags & BPF_F_ALLOW_OVERRIDE); + if (cnt == 1) { + ret = !!(flags & BPF_F_ALLOW_OVERRIDE); + if (!ret) + syscall_string("cgroup has non-overridable program in the parent"); + return ret; + } Then such syscall_string reporting mechanism would be solid addition to the kernel. Otherwise the cost of passing explicit log_buf everywhere is not worth it. I think such syscall_string can probably piggy back on "socket local storage into generic local storage" work. Sooner or later we will have per task_struct storage. If that's a single pointer in task_struct that will be used by both task local storage from inside tracing bpf program and by this syscall_string() reporting mechanism then we can land it. May be all syscalls will become stateful. sys_bpf, sys_perf_event_open, sys_ioctl followed by new syscall "give me back the error string". Or may be we can do some asm magic and pass both errno and a string from a syscall at the same time.
On Mon, Jul 20, 2020 at 3:31 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Jul 17, 2020 at 08:54:45PM -0700, Andrii Nakryiko wrote: > > > > > Only libbpf can do it. Kernel is helpless here. > > > Say we change the kernel errno for all unsuported prog types and maps > > > it would return ENOTSUPP or something. > > > Would it really help the situation? > > > > IMO, if the kernel just prints out "Unknown BPF command 123" or > > "Unknown map type 345" that would be already a nice improvement. > ... > > log_buf can't help existing kernels. Period. No one is arguing or > > expecting that. But moving forward, just having that "unknown command > > 123" would be great. > > > > But yeah, of course libbpf can create a probing map and try to do > > BATCH_LOOKUP, to detect BATCH_LOOKUP support. > > Also with BTF the kernel is self documented. > The following will print all commands that kernel supports: > bpftool btf dump file ./bld_x64/vmlinux |grep -A40 bpf_cmd > and with 'grep BPF_MAP_TYPE_' all supported maps. > For older kernels there is 'bpftool feature probe'. > Since libbpf reads vmlinux BTF anyway it could have got a knowledge > of all the features it supports based on BTF. > Yes. > But let's continue this thought experiment for augmenting error > reporting with a string. > For 'Unknown BPF command 123' to work log_buf needs to passed > outside of 'union bpf_attr'. Probably as 4th argument to sys_bpf ? > and then a bit in 'int cmd' would need to be burned to indicate > that 4th arg is there. Probably size of the arg needs to be passed > either as 5th arg or as part of the 'struct bpf_log_buf' so it can > be extensible. > Using that log_buf directly in > SYSCALL_DEFINE[45](bpf, int cmd, ... struct bpf_log_buf *log_buf) > will be trivial. Right, so far exactly what I proposed as #1 proposal. > But to pass it into any of first level functions (bpf_iter_create, > bpf_prog_attach, etc) they would need to gain an extra argument. > To pass it all the way into hierarchy_allows_attach() it needs to be added to: > __cgroup_bpf_attach > cgroup_bpf_attach > cgroup_bpf_link_attach > link_create I was hoping we can do it in such a way that we won't have to thread through extra input arguments. Current task_struct looked like a good candidate, given syscalls are always happening in a process context. I was considering a per-CPU variable, but CPU migration precludes that. I certainly was hoping that it could be implemented without huge code churn. > > Another case of ambiguous 'return -EINVAL' would cause another change > to a bunch of function prototypes. > So it's better to integrate it into current task_struct to avoid huge code churn. Right, exactly. > > But if we do so what stops us from generalizing log_buf reporting to > other syscalls? perf_event_open is in the same category. Nothing, of course, except for the need to convince other folks. I thought starting with bpf() syscall and proving usefulness in practice was a good first step. Then we could expand that to other syscalls. > > > This one is for perf subsystem, actually, it's its > > PERF_EVENT_IOC_SET_BPF ioctl (until we add bpf_link for perf_event > > attachment). > > clearly not only sys_bpf and sys_perf_event_open, but sys_ioctl would need > string support to be a full answer to ambiguous einval-s. > > > My proposal was about adding the ability to emit something to log_buf > > from any of the BPF commands, if that BPF command chooses to provide > > extra error information. The whole point of this was to avoid adding > > log_buf in command-specific ways (as Toke was doing in the patch that > > I used to initiate the discussion) and do it once for entire syscall, > > so that we can gradually utilize it where it makes most sense. > > I don't think that works due to code churn. Whether we pay that price > once or 'gradually' it doesn't make it any better. See above, here you are assuming explicit passing of log_buf through all functions. If done through current there would be no code churn. > When log_buf is added to an existing command the 'union bpf_attr' is there > in the function proto and nothing new needs to passed to a lot of functions. Not all the functions that do error checking have a bpf_attr pointer anyway, so even in that case we could have a lot of code churn, if we don't do it in a smarter way (i.e., current). > So I certainly prefer Toke's approach of adding log_buf to one specific > command if it's really needed there. > > The alternative is to solve it for all syscalls. I do like the alternative, of course, but I'd start with just bpf() syscall initially. But that's less of a technical choice, of course. > > > I agree that if such diagnostics are reliable and the situation itself > > is common and experienced by multiple users, then it might make sense > > to add such checks to libbpf. > > 'experienced by multiple users' is going to be a judgement call > either for libbpf or for kernel. > I'm saying let's improve libbpf user-friendlyness everywhere we can. > We can always drop these hints later. Unlike kernel messages that > might become stable api. > One thing is log_buf that the verifier is dumping. It's huge and not parsable. > Whereas a string next to return EINVAL may become an uapi. > I wouldn't be surprised if some of netlink extack strings got to this > level of stability and changing the string may cause somebody to notice > and the commit would get reverted. > > > But I also don't think it's always > > possible to diagnose something automatically with 100% confidence. We > > can give hints, but misdiagnosing the problem can just further confuse > > things instead. > > I think the possible confusion is fine. > The libbpf has an opportunity to try them and remove if things don't work out. > The kernel strings would be much more scrutinized and harder to change. > I see where you are coming from w.r.t. stability of those strings, but I hope that's not going to be the case. Otherwise one might argue that kernel WARN() messages could become a stable API. But regardless, "try them and remove if things don't work out" isn't exactly a user-friendly strategy, if that causes pain and confusion. So I think we should still be very careful with what we add to libbpf. > > Also quite often such problems are one-offs, which > > doesn't make them any less confusing and frustrating, but custom > > diagnostics for every such case has a potential of bloating libbpf > > beyond recognition > > bloating libbpf with strings is imo better way to figure out how to > improve user experience than bloating kernel. By bloat I meant not strings, but the logic of determining the exact cause of failure. Which from what we discussed can be quite elaborate and non-trivial. Strings bloat are an easy thing to fix in the kernel. We can conditionally compile them out if some Kconfig value is set. E.g., for cases of embedded kernels where every kilobyte counts. It's a matter of having the proper macros/funcs to do error reporting. > > > of them. That's what I meant that this is not a scalable approach to > > just say "fix libbpf to be more user friendly, kernel does its best > > already". > > Ok. I will take it back. The kernel can be improved with extra strings > here and there, but only if it's done without huge code churn. > If current task_struct approach can work and the changes would be > limited to something like: > diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c > index ac53102e244a..244df18728c2 100644 > --- a/kernel/bpf/cgroup.c > +++ b/kernel/bpf/cgroup.c > @@ -196,8 +196,12 @@ static bool hierarchy_allows_attach(struct cgroup *cgrp, > return true; > cnt = prog_list_length(&p->bpf.progs[type]); > WARN_ON_ONCE(cnt > 1); > - if (cnt == 1) > - return !!(flags & BPF_F_ALLOW_OVERRIDE); > + if (cnt == 1) { > + ret = !!(flags & BPF_F_ALLOW_OVERRIDE); > + if (!ret) > + syscall_string("cgroup has non-overridable program in the parent"); > + return ret; > + } > > Then such syscall_string reporting mechanism would be solid addition to the > kernel. Otherwise the cost of passing explicit log_buf everywhere is not worth > it. > Right, that's something like what I had in mind in terms of how non-intrusive it has to be to get adopted. > I think such syscall_string can probably piggy back on "socket local storage > into generic local storage" work. Sooner or later we will have > per task_struct storage. If that's a single pointer in task_struct that will > be used by both task local storage from inside tracing bpf program and > by this syscall_string() reporting mechanism then we can land it. Not sure why the need to unify it with task-local storage? What's the use case and what are the benefits of conflating syscall logging and some BPF program's logging? It just adds problems, e.g., some BPF programs might interleave its log output with syscall's output. I'm not saying we shouldn't do it, but I don't see why we have to do it. > May be all syscalls will become stateful. sys_bpf, sys_perf_event_open, sys_ioctl > followed by new syscall "give me back the error string". > Or may be we can do some asm magic and pass both errno and a string from > a syscall at the same time. So stateful logging (e.g., through extra syscall to set user log buffer and unset it), which was proposal #2, definitely makes it easy to do this across all syscalls uniformly without touching that syscall API at all. The biggest downside, though, is runtimes like Golang, where application doesn't control which OS thread it runs on. So getting reliable and consistent log between syscalls becomes a problem. Maybe there is some thread pinning API, though, and this might not be a problem. Don't know.
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index da9bf35a26f8..662a15e4a1a1 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -573,8 +573,13 @@ union bpf_attr { } query; struct { /* anonymous struct used by BPF_RAW_TRACEPOINT_OPEN command */ - __u64 name; - __u32 prog_fd; + __u64 name; + __u32 prog_fd; + __u32 log_level; /* verbosity level of log */ + __u32 log_size; /* size of user buffer */ + __aligned_u64 log_buf; /* user supplied buffer */ + __u32 tgt_prog_fd; + __u32 tgt_btf_id; } raw_tracepoint; struct { /* anonymous struct for BPF_BTF_LOAD */