Message ID | 20200513075849.20868-1-daniel@iogearbox.net |
---|---|
State | Accepted |
Delegated to: | BPF Maintainers |
Headers | show |
Series | [bpf-next] bpf, bpftool: Allow probing for CONFIG_HZ from kernel config | expand |
2020-05-13 09:58 UTC+0200 ~ Daniel Borkmann <daniel@iogearbox.net> > In Cilium we've recently switched to make use of bpf_jiffies64() for > parts of our tc and XDP datapath since bpf_ktime_get_ns() is more > expensive and high-precision is not needed for our timeouts we have > anyway. Our agent has a probe manager which picks up the json of > bpftool's feature probe and we also use the macro output in our C > programs e.g. to have workarounds when helpers are not available on > older kernels. > > Extend the kernel config info dump to also include the kernel's > CONFIG_HZ, and rework the probe_kernel_image_config() for allowing a > macro dump such that CONFIG_HZ can be propagated to BPF C code as a > simple define if available via config. Latter allows to have _compile- > time_ resolution of jiffies <-> sec conversion in our code since all > are propagated as known constants. > > Given we cannot generally assume availability of kconfig everywhere, > we also have a kernel hz probe [0] as a fallback. Potentially, bpftool > could have an integrated probe fallback as well, although to derive it, > we might need to place it under 'bpftool feature probe full' or similar > given it would slow down the probing process overall. Yet 'full' doesn't > fit either for us since we don't want to pollute the kernel log with > warning messages from bpf_probe_write_user() and bpf_trace_printk() on > agent startup; I've left it out for the time being. > > [0] https://github.com/cilium/cilium/blob/master/bpf/cilium-probe-kernel-hz.c > > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> > Cc: Martin KaFai Lau <kafai@fb.com> Looks good to me, thanks! I think at the time the "bpftool feature probe" was added we didn't settle on a particular format for dumping the CONFIG_* as part as the C macro output, but other than that I can see no specific reason why not to have them, so we could even list them all and avoid the macro_dump bool. But I'm fine either way, other CONFIG_* can still be added to C macro output at a later time if someone needs them anyway. Regarding a fallback for the jiffies, not sure what would be best. I agree with you for the "full" keyword, so we would need another word I suppose. But adding new keyword for fallbacks for probing features not directly related to BPF might be going a bit beyond bpftool's scope? I don't know. Anyway, for the current patch: Reviewed-by: Quentin Monnet <quentin@isovalent.com>
On 5/13/20 12:42 PM, Quentin Monnet wrote: > 2020-05-13 09:58 UTC+0200 ~ Daniel Borkmann <daniel@iogearbox.net> >> In Cilium we've recently switched to make use of bpf_jiffies64() for >> parts of our tc and XDP datapath since bpf_ktime_get_ns() is more >> expensive and high-precision is not needed for our timeouts we have >> anyway. Our agent has a probe manager which picks up the json of >> bpftool's feature probe and we also use the macro output in our C >> programs e.g. to have workarounds when helpers are not available on >> older kernels. >> >> Extend the kernel config info dump to also include the kernel's >> CONFIG_HZ, and rework the probe_kernel_image_config() for allowing a >> macro dump such that CONFIG_HZ can be propagated to BPF C code as a >> simple define if available via config. Latter allows to have _compile- >> time_ resolution of jiffies <-> sec conversion in our code since all >> are propagated as known constants. >> >> Given we cannot generally assume availability of kconfig everywhere, >> we also have a kernel hz probe [0] as a fallback. Potentially, bpftool >> could have an integrated probe fallback as well, although to derive it, >> we might need to place it under 'bpftool feature probe full' or similar >> given it would slow down the probing process overall. Yet 'full' doesn't >> fit either for us since we don't want to pollute the kernel log with >> warning messages from bpf_probe_write_user() and bpf_trace_printk() on >> agent startup; I've left it out for the time being. >> >> [0] https://github.com/cilium/cilium/blob/master/bpf/cilium-probe-kernel-hz.c >> >> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> >> Cc: Martin KaFai Lau <kafai@fb.com> > > Looks good to me, thanks! > > I think at the time the "bpftool feature probe" was added we didn't > settle on a particular format for dumping the CONFIG_* as part as the C > macro output, but other than that I can see no specific reason why not > to have them, so we could even list them all and avoid the macro_dump > bool. But I'm fine either way, other CONFIG_* can still be added to C > macro output at a later time if someone needs them anyway. Right, I initially thought about listing them all, but then my thinking was that we should really only dump those CONFIG_* as defines that actually have a real-world use case in a BPF prog somewhere. This also helps to better understand what is useful and why and avoids unrelated noise e.g. in the bpf_features.h dump we have in Cilium as part of the agent bootstrap. > Regarding a fallback for the jiffies, not sure what would be best. I > agree with you for the "full" keyword, so we would need another word I > suppose. But adding new keyword for fallbacks for probing features not > directly related to BPF might be going a bit beyond bpftool's scope? I > don't know. Anyway, for the current patch: Agree, had the same thought. > Reviewed-by: Quentin Monnet <quentin@isovalent.com> Thanks! Daniel
On Wed, May 13, 2020 at 1:00 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > In Cilium we've recently switched to make use of bpf_jiffies64() for > parts of our tc and XDP datapath since bpf_ktime_get_ns() is more > expensive and high-precision is not needed for our timeouts we have > anyway. Our agent has a probe manager which picks up the json of > bpftool's feature probe and we also use the macro output in our C > programs e.g. to have workarounds when helpers are not available on > older kernels. > > Extend the kernel config info dump to also include the kernel's > CONFIG_HZ, and rework the probe_kernel_image_config() for allowing a > macro dump such that CONFIG_HZ can be propagated to BPF C code as a > simple define if available via config. Latter allows to have _compile- > time_ resolution of jiffies <-> sec conversion in our code since all > are propagated as known constants. > > Given we cannot generally assume availability of kconfig everywhere, > we also have a kernel hz probe [0] as a fallback. Potentially, bpftool > could have an integrated probe fallback as well, although to derive it, > we might need to place it under 'bpftool feature probe full' or similar > given it would slow down the probing process overall. Yet 'full' doesn't > fit either for us since we don't want to pollute the kernel log with > warning messages from bpf_probe_write_user() and bpf_trace_printk() on > agent startup; I've left it out for the time being. > > [0] https://github.com/cilium/cilium/blob/master/bpf/cilium-probe-kernel-hz.c > > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> > Cc: Martin KaFai Lau <kafai@fb.com> > --- libbpf supports kconfig values, so don't have to even probe for that, it will just appear as a constant global variable: extern unsigned long CONFIG_HZ __kconfig; But I assume you want this for iproute2 case, which doesn't support this, right? We really should try to make iproute2 just use libbpf as a loader with all the libbpf features available. I think all (at least all major ones) features needed are already there in libbpf, iproute2 would just need to implement custom support for old-style map definitions and maybe something else, not sure. I wonder if any of iproute2/BPF users willing to spend some effort on this? > tools/bpf/bpftool/feature.c | 120 ++++++++++++++++++++---------------- > 1 file changed, 67 insertions(+), 53 deletions(-) > [...]
On Thu, May 14, 2020 at 04:19:41PM -0700, Andrii Nakryiko wrote: > On Wed, May 13, 2020 at 1:00 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > In Cilium we've recently switched to make use of bpf_jiffies64() for > > parts of our tc and XDP datapath since bpf_ktime_get_ns() is more > > expensive and high-precision is not needed for our timeouts we have > > anyway. Our agent has a probe manager which picks up the json of > > bpftool's feature probe and we also use the macro output in our C > > programs e.g. to have workarounds when helpers are not available on > > older kernels. > > > > Extend the kernel config info dump to also include the kernel's > > CONFIG_HZ, and rework the probe_kernel_image_config() for allowing a > > macro dump such that CONFIG_HZ can be propagated to BPF C code as a > > simple define if available via config. Latter allows to have _compile- > > time_ resolution of jiffies <-> sec conversion in our code since all > > are propagated as known constants. > > > > Given we cannot generally assume availability of kconfig everywhere, > > we also have a kernel hz probe [0] as a fallback. Potentially, bpftool > > could have an integrated probe fallback as well, although to derive it, > > we might need to place it under 'bpftool feature probe full' or similar > > given it would slow down the probing process overall. Yet 'full' doesn't > > fit either for us since we don't want to pollute the kernel log with > > warning messages from bpf_probe_write_user() and bpf_trace_printk() on > > agent startup; I've left it out for the time being. > > > > [0] https://github.com/cilium/cilium/blob/master/bpf/cilium-probe-kernel-hz.c > > > > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> > > Cc: Martin KaFai Lau <kafai@fb.com> > > --- > > libbpf supports kconfig values, so don't have to even probe for that, > it will just appear as a constant global variable: > > extern unsigned long CONFIG_HZ __kconfig; > > But I assume you want this for iproute2 case, which doesn't support > this, right? We really should try to make iproute2 just use libbpf as It's one but not the only reason. Our golang daemon picks up the json config from `bpftool feature -j` and based on that we decide which features our BPF datapath will have where the daemons needs to also adapt accordingly. So for users running older kernels we need fallback behavior in our daemon, for example, in case of missing LPM delete or get_next_key from syscall side the LPM map management and/or program regeneration will differ in the agent. In case of jiffies, it's also not as trivial from control plane side, e.g. existing deployments cannot simply switch from ktime to jiffies during upgrade while traffic is live in the datapath given this has upgrade and downgrade implications on timeouts. However, we can switch to using it for new deployments via helm. In that case, the agent today probes for availability, so it needs to know i) whether the underlying kernel supports jiffies64 helper, ii) it needs to know the kernel hz value in order to convert timeouts for our CT's GC. This is done based on bpftool feature -j from the agent's probe manager. Next, we also cannot assume general availability of an existing .config from distro side, so in that case we run the probe to determine kernel hz and emit the CONFIG_HZ define instead, and if all breaks down we fall back to using ktime in our data path. From the macro side, the timeouts all resolve nicely during compilation time since everything is passed as a constant here. We have a small helper for bpf_jiffies_to_sec() and bpf_sec_to_jiffies() that resolves it whereas `extern unsigned long CONFIG_HZ __kconfig` hapens at load time plus relies on the fact that config must be available, although the latter could probably be fixed via user-callback. > a loader with all the libbpf features available. I think all (at least > all major ones) features needed are already there in libbpf, iproute2 > would just need to implement custom support for old-style map > definitions and maybe something else, not sure. I wonder if any of > iproute2/BPF users willing to spend some effort on this? Right, my main concern are all the behavioral subtleties where things could break on our side e.g. debugging something like [0] is not fun, but I might eventually do it, at least it's on my list. I recently converted our LB to be compileable and loadable from both tc as well as XDP side and I'm in the process of optimizing the BPF side further. So I found myself annoyed enough that `bpftool prog profile` doesn't work. ;) Lack of BTF - the iproute2 lib does load BTF [1], but it broke with newer LLVM versions (we ship clang-10 these days). So yeah, either fixing the BTF handling for getting `bpftool prog profile` working or investing the cycles to move iproute2 finally to libbpf along with our entire datapath. It's still an intermediate step as long-term we would love to handle everything native from golang via cilium/ebpf library to avoid shelling out, but it would allow for opening usage of other features in the meantime and latter might still be further out. One of the things that is still not clear yet to me is the global data handling and how to have a clean solution for both old kernels that don't have BPF global data support and new ones that have it. Our iproute2 version uses the bpf_apply_relo_glob() variant [2] which we discussed longer time ago at plumbers and while a hack, it solved the use-case of avoiding to invoke the compiler for every regeneration even on old kernels down to 4.9 [3] where BPF global data is not available of course. Tricky, but maybe there is some low-overhead solution we could add to libbpf that would resolve it under the hood, or worst case just inline asm ... [0] https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/commit/?id=e4c4685fd6e4afd3e9b6818c96fded371f350a3f [1] https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/commit/?id=f823f36012fb5ab4ddfca6ed4ff56188730f281e [2] https://github.com/cilium/iproute2/blob/static-data/lib/bpf.c#L2525 [3] https://cilium.io/blog/2019/04/24/cilium-15#BpfTemplating > > tools/bpf/bpftool/feature.c | 120 ++++++++++++++++++++---------------- > > 1 file changed, 67 insertions(+), 53 deletions(-) > > > > [...] Thanks, Daniel
On Fri, May 15, 2020 at 5:12 AM Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > libbpf supports kconfig values, so don't have to even probe for that, > > it will just appear as a constant global variable: > > > > extern unsigned long CONFIG_HZ __kconfig; > > > > But I assume you want this for iproute2 case, which doesn't support > > this, right? We really should try to make iproute2 just use libbpf as > > It's one but not the only reason. Our golang daemon picks up the json config > from `bpftool feature -j` and based on that we decide which features our BPF > datapath will have where the daemons needs to also adapt accordingly. So for > users running older kernels we need fallback behavior in our daemon, for > example, in case of missing LPM delete or get_next_key from syscall side the > LPM map management and/or program regeneration will differ in the agent. In > case of jiffies, it's also not as trivial from control plane side, e.g. > existing deployments cannot simply switch from ktime to jiffies during upgrade > while traffic is live in the datapath given this has upgrade and downgrade > implications on timeouts. However, we can switch to using it for new deployments > via helm. In that case, the agent today probes for availability, so it needs > to know i) whether the underlying kernel supports jiffies64 helper, ii) it needs > to know the kernel hz value in order to convert timeouts for our CT's GC. This > is done based on bpftool feature -j from the agent's probe manager. Next, we > also cannot assume general availability of an existing .config from distro side, > so in that case we run the probe to determine kernel hz and emit the CONFIG_HZ > define instead, and if all breaks down we fall back to using ktime in our data > path. From the macro side, the timeouts all resolve nicely during compilation > time since everything is passed as a constant here. We have a small helper for > bpf_jiffies_to_sec() and bpf_sec_to_jiffies() that resolves it whereas `extern > unsigned long CONFIG_HZ __kconfig` hapens at load time plus relies on the fact > that config must be available, although the latter could probably be fixed via > user-callback. fair enough. Applied to bpf-next. Thanks
diff --git a/tools/bpf/bpftool/feature.c b/tools/bpf/bpftool/feature.c index f54347f55ee0..1b73e63274b5 100644 --- a/tools/bpf/bpftool/feature.c +++ b/tools/bpf/bpftool/feature.c @@ -80,13 +80,12 @@ print_bool_feature(const char *feat_name, const char *plain_name, printf("%s is %savailable\n", plain_name, res ? "" : "NOT "); } -static void print_kernel_option(const char *name, const char *value) +static void print_kernel_option(const char *name, const char *value, + const char *define_prefix) { char *endptr; int res; - /* No support for C-style ouptut */ - if (json_output) { if (!value) { jsonw_null_field(json_wtr, name); @@ -98,6 +97,12 @@ static void print_kernel_option(const char *name, const char *value) jsonw_int_field(json_wtr, name, res); else jsonw_string_field(json_wtr, name, value); + } else if (define_prefix) { + if (value) + printf("#define %s%s %s\n", define_prefix, + name, value); + else + printf("/* %s%s is not set */\n", define_prefix, name); } else { if (value) printf("%s is set to %s\n", name, value); @@ -315,77 +320,84 @@ static bool read_next_kernel_config_option(gzFile file, char *buf, size_t n, return false; } -static void probe_kernel_image_config(void) +static void probe_kernel_image_config(const char *define_prefix) { - static const char * const options[] = { + static const struct { + const char * const name; + bool macro_dump; + } options[] = { /* Enable BPF */ - "CONFIG_BPF", + { "CONFIG_BPF", }, /* Enable bpf() syscall */ - "CONFIG_BPF_SYSCALL", + { "CONFIG_BPF_SYSCALL", }, /* Does selected architecture support eBPF JIT compiler */ - "CONFIG_HAVE_EBPF_JIT", + { "CONFIG_HAVE_EBPF_JIT", }, /* Compile eBPF JIT compiler */ - "CONFIG_BPF_JIT", + { "CONFIG_BPF_JIT", }, /* Avoid compiling eBPF interpreter (use JIT only) */ - "CONFIG_BPF_JIT_ALWAYS_ON", + { "CONFIG_BPF_JIT_ALWAYS_ON", }, /* cgroups */ - "CONFIG_CGROUPS", + { "CONFIG_CGROUPS", }, /* BPF programs attached to cgroups */ - "CONFIG_CGROUP_BPF", + { "CONFIG_CGROUP_BPF", }, /* bpf_get_cgroup_classid() helper */ - "CONFIG_CGROUP_NET_CLASSID", + { "CONFIG_CGROUP_NET_CLASSID", }, /* bpf_skb_{,ancestor_}cgroup_id() helpers */ - "CONFIG_SOCK_CGROUP_DATA", + { "CONFIG_SOCK_CGROUP_DATA", }, /* Tracing: attach BPF to kprobes, tracepoints, etc. */ - "CONFIG_BPF_EVENTS", + { "CONFIG_BPF_EVENTS", }, /* Kprobes */ - "CONFIG_KPROBE_EVENTS", + { "CONFIG_KPROBE_EVENTS", }, /* Uprobes */ - "CONFIG_UPROBE_EVENTS", + { "CONFIG_UPROBE_EVENTS", }, /* Tracepoints */ - "CONFIG_TRACING", + { "CONFIG_TRACING", }, /* Syscall tracepoints */ - "CONFIG_FTRACE_SYSCALLS", + { "CONFIG_FTRACE_SYSCALLS", }, /* bpf_override_return() helper support for selected arch */ - "CONFIG_FUNCTION_ERROR_INJECTION", + { "CONFIG_FUNCTION_ERROR_INJECTION", }, /* bpf_override_return() helper */ - "CONFIG_BPF_KPROBE_OVERRIDE", + { "CONFIG_BPF_KPROBE_OVERRIDE", }, /* Network */ - "CONFIG_NET", + { "CONFIG_NET", }, /* AF_XDP sockets */ - "CONFIG_XDP_SOCKETS", + { "CONFIG_XDP_SOCKETS", }, /* BPF_PROG_TYPE_LWT_* and related helpers */ - "CONFIG_LWTUNNEL_BPF", + { "CONFIG_LWTUNNEL_BPF", }, /* BPF_PROG_TYPE_SCHED_ACT, TC (traffic control) actions */ - "CONFIG_NET_ACT_BPF", + { "CONFIG_NET_ACT_BPF", }, /* BPF_PROG_TYPE_SCHED_CLS, TC filters */ - "CONFIG_NET_CLS_BPF", + { "CONFIG_NET_CLS_BPF", }, /* TC clsact qdisc */ - "CONFIG_NET_CLS_ACT", + { "CONFIG_NET_CLS_ACT", }, /* Ingress filtering with TC */ - "CONFIG_NET_SCH_INGRESS", + { "CONFIG_NET_SCH_INGRESS", }, /* bpf_skb_get_xfrm_state() helper */ - "CONFIG_XFRM", + { "CONFIG_XFRM", }, /* bpf_get_route_realm() helper */ - "CONFIG_IP_ROUTE_CLASSID", + { "CONFIG_IP_ROUTE_CLASSID", }, /* BPF_PROG_TYPE_LWT_SEG6_LOCAL and related helpers */ - "CONFIG_IPV6_SEG6_BPF", + { "CONFIG_IPV6_SEG6_BPF", }, /* BPF_PROG_TYPE_LIRC_MODE2 and related helpers */ - "CONFIG_BPF_LIRC_MODE2", + { "CONFIG_BPF_LIRC_MODE2", }, /* BPF stream parser and BPF socket maps */ - "CONFIG_BPF_STREAM_PARSER", + { "CONFIG_BPF_STREAM_PARSER", }, /* xt_bpf module for passing BPF programs to netfilter */ - "CONFIG_NETFILTER_XT_MATCH_BPF", + { "CONFIG_NETFILTER_XT_MATCH_BPF", }, /* bpfilter back-end for iptables */ - "CONFIG_BPFILTER", + { "CONFIG_BPFILTER", }, /* bpftilter module with "user mode helper" */ - "CONFIG_BPFILTER_UMH", + { "CONFIG_BPFILTER_UMH", }, /* test_bpf module for BPF tests */ - "CONFIG_TEST_BPF", + { "CONFIG_TEST_BPF", }, + + /* Misc configs useful in BPF C programs */ + /* jiffies <-> sec conversion for bpf_jiffies64() helper */ + { "CONFIG_HZ", true, } }; char *values[ARRAY_SIZE(options)] = { }; struct utsname utsn; @@ -427,7 +439,8 @@ static void probe_kernel_image_config(void) while (read_next_kernel_config_option(file, buf, sizeof(buf), &value)) { for (i = 0; i < ARRAY_SIZE(options); i++) { - if (values[i] || strcmp(buf, options[i])) + if ((define_prefix && !options[i].macro_dump) || + values[i] || strcmp(buf, options[i].name)) continue; values[i] = strdup(value); @@ -439,7 +452,9 @@ static void probe_kernel_image_config(void) gzclose(file); for (i = 0; i < ARRAY_SIZE(options); i++) { - print_kernel_option(options[i], values[i]); + if (define_prefix && !options[i].macro_dump) + continue; + print_kernel_option(options[i].name, values[i], define_prefix); free(values[i]); } } @@ -632,23 +647,22 @@ section_system_config(enum probe_component target, const char *define_prefix) switch (target) { case COMPONENT_KERNEL: case COMPONENT_UNSPEC: - if (define_prefix) - break; - print_start_section("system_config", "Scanning system configuration...", - NULL, /* define_comment never used here */ - NULL); /* define_prefix always NULL here */ - if (check_procfs()) { - probe_unprivileged_disabled(); - probe_jit_enable(); - probe_jit_harden(); - probe_jit_kallsyms(); - probe_jit_limit(); - } else { - p_info("/* procfs not mounted, skipping related probes */"); + "/*** Misc kernel config items ***/", + define_prefix); + if (!define_prefix) { + if (check_procfs()) { + probe_unprivileged_disabled(); + probe_jit_enable(); + probe_jit_harden(); + probe_jit_kallsyms(); + probe_jit_limit(); + } else { + p_info("/* procfs not mounted, skipping related probes */"); + } } - probe_kernel_image_config(); + probe_kernel_image_config(define_prefix); print_end_section(); break; default:
In Cilium we've recently switched to make use of bpf_jiffies64() for parts of our tc and XDP datapath since bpf_ktime_get_ns() is more expensive and high-precision is not needed for our timeouts we have anyway. Our agent has a probe manager which picks up the json of bpftool's feature probe and we also use the macro output in our C programs e.g. to have workarounds when helpers are not available on older kernels. Extend the kernel config info dump to also include the kernel's CONFIG_HZ, and rework the probe_kernel_image_config() for allowing a macro dump such that CONFIG_HZ can be propagated to BPF C code as a simple define if available via config. Latter allows to have _compile- time_ resolution of jiffies <-> sec conversion in our code since all are propagated as known constants. Given we cannot generally assume availability of kconfig everywhere, we also have a kernel hz probe [0] as a fallback. Potentially, bpftool could have an integrated probe fallback as well, although to derive it, we might need to place it under 'bpftool feature probe full' or similar given it would slow down the probing process overall. Yet 'full' doesn't fit either for us since we don't want to pollute the kernel log with warning messages from bpf_probe_write_user() and bpf_trace_printk() on agent startup; I've left it out for the time being. [0] https://github.com/cilium/cilium/blob/master/bpf/cilium-probe-kernel-hz.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> --- tools/bpf/bpftool/feature.c | 120 ++++++++++++++++++++---------------- 1 file changed, 67 insertions(+), 53 deletions(-)