Message ID | 20191221062608.1183091-1-kafai@fb.com |
---|---|
State | Changes Requested |
Delegated to: | BPF Maintainers |
Headers | show |
Series | Introduce BPF STRUCT_OPS | expand |
On 12/20/19 10:26 PM, Martin KaFai Lau wrote: > The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value > is a kernel struct with its func ptr implemented in bpf prog. > This new map is the interface to register/unregister/introspect > a bpf implemented kernel struct. > > The kernel struct is actually embedded inside another new struct > (or called the "value" struct in the code). For example, > "struct tcp_congestion_ops" is embbeded in: > struct bpf_struct_ops_tcp_congestion_ops { > refcount_t refcnt; > enum bpf_struct_ops_state state; > struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */ > } > The map value is "struct bpf_struct_ops_tcp_congestion_ops". > The "bpftool map dump" will then be able to show the > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g. > number of tcp_sock in the tcp_congestion_ops case). This "value" struct > is created automatically by a macro. Having a separate "value" struct > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some > initialization works before registering the struct_ops to the kernel > subsystem). The libbpf will take care of finding and populating the > "struct bpf_struct_ops_XYZ" from "struct XYZ". > > Register a struct_ops to a kernel subsystem: > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s) > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id > set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the > running kernel. > Instead of reusing the attr->btf_value_type_id, > btf_vmlinux_value_type_id s added such that attr->btf_fd can still be > used as the "user" btf which could store other useful sysadmin/debug > info that may be introduced in the furture, > e.g. creation-date/compiler-details/map-creator...etc. > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described > in the running kernel btf. Populate the value of this object. > The function ptr should be populated with the prog fds. > 4. Call BPF_MAP_UPDATE with the object created in (3) as > the map value. The key is always "0". > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's > args as an array of u64 is generated. BPF_MAP_UPDATE also allows > the specific struct_ops to do some final checks in "st_ops->init_member()" > (e.g. ensure all mandatory func ptrs are implemented). > If everything looks good, it will register this kernel struct > to the kernel subsystem. The map will not allow further update > from this point. > > Unregister a struct_ops from the kernel subsystem: > BPF_MAP_DELETE with key "0". > > Introspect a struct_ops: > BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will > have the prog _id_ populated as the func ptr. > > The map value state (enum bpf_struct_ops_state) will transit from: > INIT (map created) => > INUSE (map updated, i.e. reg) => > TOBEFREE (map value deleted, i.e. unreg) > > The kernel subsystem needs to call bpf_struct_ops_get() and > bpf_struct_ops_put() to manage the "refcnt" in the > "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt > for the purose of tracking the subsystem usage. Another approach > is to reuse the map->refcnt and then "show" (i.e. during map_lookup) > the subsystem's usage by doing map->refcnt - map->usercnt to filter out > the map-fd/pinned-map usage. However, that will also tie down the > future semantics of map->refcnt and map->usercnt. > > The very first subsystem's refcnt (during reg()) holds one > count to map->refcnt. When the very last subsystem's refcnt > is gone, it will also release the map->refcnt. All bpf_prog will be > freed when the map->refcnt reaches 0 (i.e. during map_free()). > > Here is how the bpftool map command will look like: > [root@arch-fb-vm1 bpf]# bpftool map show > 6: struct_ops name dctcp flags 0x0 > key 4B value 256B max_entries 1 memlock 4096B > btf_id 6 > [root@arch-fb-vm1 bpf]# bpftool map dump id 6 > [{ > "value": { > "refcnt": { > "refs": { > "counter": 1 > } > }, > "state": 1, The bpftool dump with "state" 1 is a little bit cryptic. Since this is common for all struct_ops maps, can we make it explicit, e.g., as enum values, like INIT/INUSE/TOBEFREE? > "data": { > "list": { > "next": 0, > "prev": 0 > }, > "key": 0, > "flags": 2, > "init": 24, > "release": 0, > "ssthresh": 25, > "cong_avoid": 30, > "set_state": 27, > "cwnd_event": 28, > "in_ack_event": 26, > "undo_cwnd": 29, > "pkts_acked": 0, > "min_tso_segs": 0, > "sndbuf_expand": 0, > "cong_control": 0, > "get_info": 0, > "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0 > ], > "owner": 0 > } > } > } > ] > > Misc Notes: > * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup. > It does an inplace update on "*value" instead returning a pointer > to syscall.c. Otherwise, it needs a separate copy of "zero" value > for the BPF_STRUCT_OPS_STATE_INIT to avoid races. > > * The bpf_struct_ops_map_delete_elem() is also called without > preempt_disable() from map_delete_elem(). It is because > the "->unreg()" may requires sleepable context, e.g. > the "tcp_unregister_congestion_control()". > > * "const" is added to some of the existing "struct btf_func_model *" > function arg to avoid a compiler warning caused by this patch. > > Signed-off-by: Martin KaFai Lau <kafai@fb.com> > --- > arch/x86/net/bpf_jit_comp.c | 11 +- > include/linux/bpf.h | 49 +++- > include/linux/bpf_types.h | 3 + > include/linux/btf.h | 13 + > include/uapi/linux/bpf.h | 7 +- > kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++- > kernel/bpf/btf.c | 20 +- > kernel/bpf/map_in_map.c | 3 +- > kernel/bpf/syscall.c | 49 ++-- > kernel/bpf/trampoline.c | 5 +- > kernel/bpf/verifier.c | 5 + > 11 files changed, 593 insertions(+), 40 deletions(-) > [...] > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index c1eeb3e0e116..38059880963e 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -136,6 +136,7 @@ enum bpf_map_type { > BPF_MAP_TYPE_STACK, > BPF_MAP_TYPE_SK_STORAGE, > BPF_MAP_TYPE_DEVMAP_HASH, > + BPF_MAP_TYPE_STRUCT_OPS, > }; > > /* Note that tracing related programs such as > @@ -398,6 +399,10 @@ union bpf_attr { > __u32 btf_fd; /* fd pointing to a BTF type data */ > __u32 btf_key_type_id; /* BTF type_id of the key */ > __u32 btf_value_type_id; /* BTF type_id of the value */ > + __u32 btf_vmlinux_value_type_id;/* BTF type_id of a kernel- > + * struct stored as the > + * map value > + */ > }; > > struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ > @@ -3350,7 +3355,7 @@ struct bpf_map_info { > __u32 map_flags; > char name[BPF_OBJ_NAME_LEN]; > __u32 ifindex; > - __u32 :32; > + __u32 btf_vmlinux_value_type_id; > __u64 netns_dev; > __u64 netns_ino; > __u32 btf_id; > diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c > index c9f81bd1df83..fb9a0b3e4580 100644 > --- a/kernel/bpf/bpf_struct_ops.c > +++ b/kernel/bpf/bpf_struct_ops.c > @@ -10,8 +10,66 @@ > #include <linux/seq_file.h> > #include <linux/refcount.h> > > +enum bpf_struct_ops_state { > + BPF_STRUCT_OPS_STATE_INIT, > + BPF_STRUCT_OPS_STATE_INUSE, > + BPF_STRUCT_OPS_STATE_TOBEFREE, > +}; > + > +#define BPF_STRUCT_OPS_COMMON_VALUE \ > + refcount_t refcnt; \ > + enum bpf_struct_ops_state state > + > +struct bpf_struct_ops_value { > + BPF_STRUCT_OPS_COMMON_VALUE; > + char data[0] ____cacheline_aligned_in_smp; > +}; > + > +struct bpf_struct_ops_map { > + struct bpf_map map; > + const struct bpf_struct_ops *st_ops; > + /* protect map_update */ > + spinlock_t lock; > + /* progs has all the bpf_prog that is populated > + * to the func ptr of the kernel's struct > + * (in kvalue.data). > + */ > + struct bpf_prog **progs; > + /* image is a page that has all the trampolines > + * that stores the func args before calling the bpf_prog. > + * A PAGE_SIZE "image" is enough to store all trampoline for > + * "progs[]". > + */ > + void *image; > + /* uvalue->data stores the kernel struct > + * (e.g. tcp_congestion_ops) that is more useful > + * to userspace than the kvalue. For example, > + * the bpf_prog's id is stored instead of the kernel > + * address of a func ptr. > + */ > + struct bpf_struct_ops_value *uvalue; > + /* kvalue.data stores the actual kernel's struct > + * (e.g. tcp_congestion_ops) that will be > + * registered to the kernel subsystem. > + */ > + struct bpf_struct_ops_value kvalue; > +}; > + > +#define VALUE_PREFIX "bpf_struct_ops_" > +#define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1) > + > +/* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is > + * the map's value exposed to the userspace and its btf-type-id is > + * stored at the map->btf_vmlinux_value_type_id. > + * > + */ > #define BPF_STRUCT_OPS_TYPE(_name) \ > -extern struct bpf_struct_ops bpf_##_name; > +extern struct bpf_struct_ops bpf_##_name; \ > + \ > +struct bpf_struct_ops_##_name { \ > + BPF_STRUCT_OPS_COMMON_VALUE; \ > + struct _name data ____cacheline_aligned_in_smp; \ > +}; > #include "bpf_struct_ops_types.h" > #undef BPF_STRUCT_OPS_TYPE > > @@ -35,19 +93,51 @@ const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = { > const struct bpf_prog_ops bpf_struct_ops_prog_ops = { > }; > > +static const struct btf_type *module_type; > + > void bpf_struct_ops_init(struct btf *_btf_vmlinux) > { > + s32 type_id, value_id, module_id; > const struct btf_member *member; > struct bpf_struct_ops *st_ops; > struct bpf_verifier_log log = {}; > const struct btf_type *t; > + char value_name[128]; > const char *mname; > - s32 type_id; > u32 i, j; > > + /* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */ > +#define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name); > +#include "bpf_struct_ops_types.h" > +#undef BPF_STRUCT_OPS_TYPE This looks great! > + > + module_id = btf_find_by_name_kind(_btf_vmlinux, "module", > + BTF_KIND_STRUCT); > + if (module_id < 0) { > + pr_warn("Cannot find struct module in btf_vmlinux\n"); > + return; > + } > + module_type = btf_type_by_id(_btf_vmlinux, module_id); > + > for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { > st_ops = bpf_struct_ops[i]; > > + if (strlen(st_ops->name) + VALUE_PREFIX_LEN >= > + sizeof(value_name)) { > + pr_warn("struct_ops name %s is too long\n", > + st_ops->name); > + continue; > + } > + sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name); > + > + value_id = btf_find_by_name_kind(_btf_vmlinux, value_name, > + BTF_KIND_STRUCT); > + if (value_id < 0) { > + pr_warn("Cannot find struct %s in btf_vmlinux\n", > + value_name); > + continue; > + } > + > type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name, > BTF_KIND_STRUCT); > if (type_id < 0) { > @@ -99,6 +189,9 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux) > } else { > st_ops->type_id = type_id; > st_ops->type = t; > + st_ops->value_id = value_id; > + st_ops->value_type = > + btf_type_by_id(_btf_vmlinux, value_id); > } > } > } > @@ -106,6 +199,22 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux) > > extern struct btf *btf_vmlinux; > [...] > + > +static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, > + void *value, u64 flags) > +{ > + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; > + const struct bpf_struct_ops *st_ops = st_map->st_ops; > + struct bpf_struct_ops_value *uvalue, *kvalue; > + const struct btf_member *member; > + const struct btf_type *t = st_ops->type; > + void *udata, *kdata; > + int prog_fd, err = 0; > + void *image; > + u32 i; > + > + if (flags) > + return -EINVAL; > + > + if (*(u32 *)key != 0) > + return -E2BIG; > + > + uvalue = (struct bpf_struct_ops_value *)value; > + if (uvalue->state || refcount_read(&uvalue->refcnt)) > + return -EINVAL; > + > + uvalue = (struct bpf_struct_ops_value *)st_map->uvalue; > + kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue; > + > + spin_lock(&st_map->lock); > + > + if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) { > + err = -EBUSY; > + goto unlock; > + } > + > + memcpy(uvalue, value, map->value_size); > + > + udata = &uvalue->data; > + kdata = &kvalue->data; > + image = st_map->image; > + > + for_each_member(i, t, member) { > + const struct btf_type *mtype, *ptype; > + struct bpf_prog *prog; > + u32 moff; > + > + moff = btf_member_bit_offset(t, member) / 8; > + mtype = btf_type_by_id(btf_vmlinux, member->type); > + ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL); > + if (ptype == module_type) { > + *(void **)(kdata + moff) = BPF_MODULE_OWNER; > + continue; > + } > + > + err = st_ops->init_member(t, member, kdata, udata); > + if (err < 0) > + goto reset_unlock; > + > + /* The ->init_member() has handled this member */ > + if (err > 0) > + continue; > + > + /* If st_ops->init_member does not handle it, > + * we will only handle func ptrs and zero-ed members > + * here. Reject everything else. > + */ > + > + /* All non func ptr member must be 0 */ > + if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, > + NULL)) { > + u32 msize; > + > + mtype = btf_resolve_size(btf_vmlinux, mtype, > + &msize, NULL, NULL); > + if (IS_ERR(mtype)) { > + err = PTR_ERR(mtype); > + goto reset_unlock; > + } > + > + if (memchr_inv(udata + moff, 0, msize)) { > + err = -EINVAL; > + goto reset_unlock; > + } > + > + continue; > + } > + > + prog_fd = (int)(*(unsigned long *)(udata + moff)); > + /* Similar check as the attr->attach_prog_fd */ > + if (!prog_fd) > + continue; > + > + prog = bpf_prog_get(prog_fd); > + if (IS_ERR(prog)) { > + err = PTR_ERR(prog); > + goto reset_unlock; > + } > + st_map->progs[i] = prog; > + > + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS || > + prog->aux->attach_btf_id != st_ops->type_id || > + prog->expected_attach_type != i) { > + err = -EINVAL; > + goto reset_unlock; > + } > + > + err = arch_prepare_bpf_trampoline(image, > + &st_ops->func_models[i], 0, > + &prog, 1, NULL, 0, NULL); > + if (err < 0) > + goto reset_unlock; > + > + *(void **)(kdata + moff) = image; > + image += err; > + > + /* put prog_id to udata */ > + *(unsigned long *)(udata + moff) = prog->aux->id; > + } Should we check whether user indeed provided `module` member or not before declaring success? > + > + refcount_set(&kvalue->refcnt, 1); > + bpf_map_inc(map); > + > + err = st_ops->reg(kdata); > + if (!err) { > + /* Pair with smp_load_acquire() during lookup */ > + smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE); > + goto unlock; > + } > + > + /* Error during st_ops->reg() */ > + bpf_map_put(map); > + > +reset_unlock: > + bpf_struct_ops_map_put_progs(st_map); > + memset(uvalue, 0, map->value_size); > + memset(kvalue, 0, map->value_size); > + > +unlock: > + spin_unlock(&st_map->lock); > + return err; > +} > + [...] > + > +static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) > +{ > + const struct bpf_struct_ops *st_ops; > + size_t map_total_size, st_map_size; > + struct bpf_struct_ops_map *st_map; > + const struct btf_type *t, *vt; > + struct bpf_map_memory mem; > + struct bpf_map *map; > + int err; > + > + if (!capable(CAP_SYS_ADMIN)) > + return ERR_PTR(-EPERM); > + > + st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id); > + if (!st_ops) > + return ERR_PTR(-ENOTSUPP); > + > + vt = st_ops->value_type; > + if (attr->value_size != vt->size) > + return ERR_PTR(-EINVAL); > + > + t = st_ops->type; > + > + st_map_size = sizeof(*st_map) + > + /* kvalue stores the > + * struct bpf_struct_ops_tcp_congestions_ops > + */ > + (vt->size - sizeof(struct bpf_struct_ops_value)); > + map_total_size = st_map_size + > + /* uvalue */ > + sizeof(vt->size) + > + /* struct bpf_progs **progs */ > + btf_type_vlen(t) * sizeof(struct bpf_prog *); > + err = bpf_map_charge_init(&mem, map_total_size); > + if (err < 0) > + return ERR_PTR(err); > + > + st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE); > + if (!st_map) { > + bpf_map_charge_finish(&mem); > + return ERR_PTR(-ENOMEM); > + } > + st_map->st_ops = st_ops; > + map = &st_map->map; > + > + st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); > + st_map->progs = > + bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *), > + NUMA_NO_NODE); > + /* Each trampoline costs < 64 bytes. Ensure one page > + * is enough for max number of func ptrs. > + */ > + BUILD_BUG_ON(PAGE_SIZE / 64 < BPF_STRUCT_OPS_MAX_NR_MEMBERS); This maybe true for x86 now, but it may not hold up for future other architectures. Not sure whether we should get the value for arch call backs, or we just bail out during map update if we ever grow exceeds one page. > + st_map->image = bpf_jit_alloc_exec(PAGE_SIZE); > + if (!st_map->uvalue || !st_map->progs || !st_map->image) { > + bpf_struct_ops_map_free(map); > + bpf_map_charge_finish(&mem); > + return ERR_PTR(-ENOMEM); > + } > + > + spin_lock_init(&st_map->lock); > + set_vm_flush_reset_perms(st_map->image); > + set_memory_x((long)st_map->image, 1); > + bpf_map_init_from_attr(map, attr); > + bpf_map_charge_move(&map->memory, &mem); > + > + return map; > +} > + > +const struct bpf_map_ops bpf_struct_ops_map_ops = { > + .map_alloc_check = bpf_struct_ops_map_alloc_check, > + .map_alloc = bpf_struct_ops_map_alloc, > + .map_free = bpf_struct_ops_map_free, > + .map_get_next_key = bpf_struct_ops_map_get_next_key, > + .map_lookup_elem = bpf_struct_ops_map_lookup_elem, > + .map_delete_elem = bpf_struct_ops_map_delete_elem, > + .map_update_elem = bpf_struct_ops_map_update_elem, > + .map_seq_show_elem = bpf_struct_ops_map_seq_show_elem, > +}; [...]
On Mon, Dec 23, 2019 at 11:58 AM Yonghong Song <yhs@fb.com> wrote: > > > > On 12/20/19 10:26 PM, Martin KaFai Lau wrote: > > The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value > > is a kernel struct with its func ptr implemented in bpf prog. > > This new map is the interface to register/unregister/introspect > > a bpf implemented kernel struct. > > > > The kernel struct is actually embedded inside another new struct > > (or called the "value" struct in the code). For example, > > "struct tcp_congestion_ops" is embbeded in: > > struct bpf_struct_ops_tcp_congestion_ops { > > refcount_t refcnt; > > enum bpf_struct_ops_state state; > > struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */ > > } > > The map value is "struct bpf_struct_ops_tcp_congestion_ops". > > The "bpftool map dump" will then be able to show the > > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g. > > number of tcp_sock in the tcp_congestion_ops case). This "value" struct > > is created automatically by a macro. Having a separate "value" struct > > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding > > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some > > initialization works before registering the struct_ops to the kernel > > subsystem). The libbpf will take care of finding and populating the > > "struct bpf_struct_ops_XYZ" from "struct XYZ". > > > > Register a struct_ops to a kernel subsystem: > > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s) > > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id > > set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the > > running kernel. > > Instead of reusing the attr->btf_value_type_id, > > btf_vmlinux_value_type_id s added such that attr->btf_fd can still be > > used as the "user" btf which could store other useful sysadmin/debug > > info that may be introduced in the furture, > > e.g. creation-date/compiler-details/map-creator...etc. > > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described > > in the running kernel btf. Populate the value of this object. > > The function ptr should be populated with the prog fds. > > 4. Call BPF_MAP_UPDATE with the object created in (3) as > > the map value. The key is always "0". > > > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's > > args as an array of u64 is generated. BPF_MAP_UPDATE also allows > > the specific struct_ops to do some final checks in "st_ops->init_member()" > > (e.g. ensure all mandatory func ptrs are implemented). > > If everything looks good, it will register this kernel struct > > to the kernel subsystem. The map will not allow further update > > from this point. > > > > Unregister a struct_ops from the kernel subsystem: > > BPF_MAP_DELETE with key "0". > > > > Introspect a struct_ops: > > BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will > > have the prog _id_ populated as the func ptr. > > > > The map value state (enum bpf_struct_ops_state) will transit from: > > INIT (map created) => > > INUSE (map updated, i.e. reg) => > > TOBEFREE (map value deleted, i.e. unreg) > > > > The kernel subsystem needs to call bpf_struct_ops_get() and > > bpf_struct_ops_put() to manage the "refcnt" in the > > "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt > > for the purose of tracking the subsystem usage. Another approach > > is to reuse the map->refcnt and then "show" (i.e. during map_lookup) > > the subsystem's usage by doing map->refcnt - map->usercnt to filter out > > the map-fd/pinned-map usage. However, that will also tie down the > > future semantics of map->refcnt and map->usercnt. > > > > The very first subsystem's refcnt (during reg()) holds one > > count to map->refcnt. When the very last subsystem's refcnt > > is gone, it will also release the map->refcnt. All bpf_prog will be > > freed when the map->refcnt reaches 0 (i.e. during map_free()). > > > > Here is how the bpftool map command will look like: > > [root@arch-fb-vm1 bpf]# bpftool map show > > 6: struct_ops name dctcp flags 0x0 > > key 4B value 256B max_entries 1 memlock 4096B > > btf_id 6 > > [root@arch-fb-vm1 bpf]# bpftool map dump id 6 > > [{ > > "value": { > > "refcnt": { > > "refs": { > > "counter": 1 > > } > > }, > > "state": 1, > > The bpftool dump with "state" 1 is a little bit cryptic. > Since this is common for all struct_ops maps, can we > make it explicit, e.g., as enum values, like INIT/INUSE/TOBEFREE? This can (and probably should) be done generically in bpftool for any field of enum type. Not blocking this patch set, though. > > > "data": { > > "list": { > > "next": 0, > > "prev": 0 > > }, > > "key": 0, > > "flags": 2, > > "init": 24, > > "release": 0, > > "ssthresh": 25, > > "cong_avoid": 30, > > "set_state": 27, > > "cwnd_event": 28, > > "in_ack_event": 26, > > "undo_cwnd": 29, > > "pkts_acked": 0, > > "min_tso_segs": 0, > > "sndbuf_expand": 0, > > "cong_control": 0, > > "get_info": 0, > > "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0 > > ], Same here, bpftool should be smart enough to figure out that this is a string, not just an array of bytes. > > "owner": 0 > > } > > } > > } > > ] > > [...]
On Mon, Dec 23, 2019 at 01:44:29PM -0800, Andrii Nakryiko wrote: > On Mon, Dec 23, 2019 at 11:58 AM Yonghong Song <yhs@fb.com> wrote: > > > > > > > > On 12/20/19 10:26 PM, Martin KaFai Lau wrote: > > > The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value > > > is a kernel struct with its func ptr implemented in bpf prog. > > > This new map is the interface to register/unregister/introspect > > > a bpf implemented kernel struct. > > > > > > The kernel struct is actually embedded inside another new struct > > > (or called the "value" struct in the code). For example, > > > "struct tcp_congestion_ops" is embbeded in: > > > struct bpf_struct_ops_tcp_congestion_ops { > > > refcount_t refcnt; > > > enum bpf_struct_ops_state state; > > > struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */ > > > } > > > The map value is "struct bpf_struct_ops_tcp_congestion_ops". > > > The "bpftool map dump" will then be able to show the > > > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g. > > > number of tcp_sock in the tcp_congestion_ops case). This "value" struct > > > is created automatically by a macro. Having a separate "value" struct > > > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding > > > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some > > > initialization works before registering the struct_ops to the kernel > > > subsystem). The libbpf will take care of finding and populating the > > > "struct bpf_struct_ops_XYZ" from "struct XYZ". > > > > > > Register a struct_ops to a kernel subsystem: > > > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s) > > > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id > > > set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the > > > running kernel. > > > Instead of reusing the attr->btf_value_type_id, > > > btf_vmlinux_value_type_id s added such that attr->btf_fd can still be > > > used as the "user" btf which could store other useful sysadmin/debug > > > info that may be introduced in the furture, > > > e.g. creation-date/compiler-details/map-creator...etc. > > > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described > > > in the running kernel btf. Populate the value of this object. > > > The function ptr should be populated with the prog fds. > > > 4. Call BPF_MAP_UPDATE with the object created in (3) as > > > the map value. The key is always "0". > > > > > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's > > > args as an array of u64 is generated. BPF_MAP_UPDATE also allows > > > the specific struct_ops to do some final checks in "st_ops->init_member()" > > > (e.g. ensure all mandatory func ptrs are implemented). > > > If everything looks good, it will register this kernel struct > > > to the kernel subsystem. The map will not allow further update > > > from this point. > > > > > > Unregister a struct_ops from the kernel subsystem: > > > BPF_MAP_DELETE with key "0". > > > > > > Introspect a struct_ops: > > > BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will > > > have the prog _id_ populated as the func ptr. > > > > > > The map value state (enum bpf_struct_ops_state) will transit from: > > > INIT (map created) => > > > INUSE (map updated, i.e. reg) => > > > TOBEFREE (map value deleted, i.e. unreg) > > > > > > The kernel subsystem needs to call bpf_struct_ops_get() and > > > bpf_struct_ops_put() to manage the "refcnt" in the > > > "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt > > > for the purose of tracking the subsystem usage. Another approach > > > is to reuse the map->refcnt and then "show" (i.e. during map_lookup) > > > the subsystem's usage by doing map->refcnt - map->usercnt to filter out > > > the map-fd/pinned-map usage. However, that will also tie down the > > > future semantics of map->refcnt and map->usercnt. > > > > > > The very first subsystem's refcnt (during reg()) holds one > > > count to map->refcnt. When the very last subsystem's refcnt > > > is gone, it will also release the map->refcnt. All bpf_prog will be > > > freed when the map->refcnt reaches 0 (i.e. during map_free()). > > > > > > Here is how the bpftool map command will look like: > > > [root@arch-fb-vm1 bpf]# bpftool map show > > > 6: struct_ops name dctcp flags 0x0 > > > key 4B value 256B max_entries 1 memlock 4096B > > > btf_id 6 > > > [root@arch-fb-vm1 bpf]# bpftool map dump id 6 > > > [{ > > > "value": { > > > "refcnt": { > > > "refs": { > > > "counter": 1 > > > } > > > }, > > > "state": 1, > > > > The bpftool dump with "state" 1 is a little bit cryptic. > > Since this is common for all struct_ops maps, can we > > make it explicit, e.g., as enum values, like INIT/INUSE/TOBEFREE? > > This can (and probably should) be done generically in bpftool for any > field of enum type. Not blocking this patch set, though. > > > > > > "data": { > > > "list": { > > > "next": 0, > > > "prev": 0 > > > }, > > > "key": 0, > > > "flags": 2, > > > "init": 24, > > > "release": 0, > > > "ssthresh": 25, > > > "cong_avoid": 30, > > > "set_state": 27, > > > "cwnd_event": 28, > > > "in_ack_event": 26, > > > "undo_cwnd": 29, > > > "pkts_acked": 0, > > > "min_tso_segs": 0, > > > "sndbuf_expand": 0, > > > "cong_control": 0, > > > "get_info": 0, > > > "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0 > > > ], > > Same here, bpftool should be smart enough to figure out that this is a > string, not just an array of bytes. Agree on both above that bpftool can print better strings. Those are generic improvements to bpftool and not specific to a particular map type. > > > > "owner": 0 > > > } > > > } > > > } > > > ] > > > > > [...]
On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote: > > The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value > is a kernel struct with its func ptr implemented in bpf prog. > This new map is the interface to register/unregister/introspect > a bpf implemented kernel struct. > > The kernel struct is actually embedded inside another new struct > (or called the "value" struct in the code). For example, > "struct tcp_congestion_ops" is embbeded in: > struct bpf_struct_ops_tcp_congestion_ops { > refcount_t refcnt; > enum bpf_struct_ops_state state; > struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */ > } > The map value is "struct bpf_struct_ops_tcp_congestion_ops". > The "bpftool map dump" will then be able to show the > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g. > number of tcp_sock in the tcp_congestion_ops case). This "value" struct > is created automatically by a macro. Having a separate "value" struct > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some > initialization works before registering the struct_ops to the kernel > subsystem). The libbpf will take care of finding and populating the > "struct bpf_struct_ops_XYZ" from "struct XYZ". > > Register a struct_ops to a kernel subsystem: > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s) > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id > set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the > running kernel. > Instead of reusing the attr->btf_value_type_id, > btf_vmlinux_value_type_id s added such that attr->btf_fd can still be > used as the "user" btf which could store other useful sysadmin/debug > info that may be introduced in the furture, > e.g. creation-date/compiler-details/map-creator...etc. > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described > in the running kernel btf. Populate the value of this object. > The function ptr should be populated with the prog fds. > 4. Call BPF_MAP_UPDATE with the object created in (3) as > the map value. The key is always "0". > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's > args as an array of u64 is generated. BPF_MAP_UPDATE also allows > the specific struct_ops to do some final checks in "st_ops->init_member()" > (e.g. ensure all mandatory func ptrs are implemented). > If everything looks good, it will register this kernel struct > to the kernel subsystem. The map will not allow further update > from this point. > > Unregister a struct_ops from the kernel subsystem: > BPF_MAP_DELETE with key "0". > > Introspect a struct_ops: > BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will > have the prog _id_ populated as the func ptr. > > The map value state (enum bpf_struct_ops_state) will transit from: > INIT (map created) => > INUSE (map updated, i.e. reg) => > TOBEFREE (map value deleted, i.e. unreg) > > The kernel subsystem needs to call bpf_struct_ops_get() and > bpf_struct_ops_put() to manage the "refcnt" in the > "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt > for the purose of tracking the subsystem usage. Another approach > is to reuse the map->refcnt and then "show" (i.e. during map_lookup) > the subsystem's usage by doing map->refcnt - map->usercnt to filter out > the map-fd/pinned-map usage. However, that will also tie down the > future semantics of map->refcnt and map->usercnt. > > The very first subsystem's refcnt (during reg()) holds one > count to map->refcnt. When the very last subsystem's refcnt > is gone, it will also release the map->refcnt. All bpf_prog will be > freed when the map->refcnt reaches 0 (i.e. during map_free()). > > Here is how the bpftool map command will look like: > [root@arch-fb-vm1 bpf]# bpftool map show > 6: struct_ops name dctcp flags 0x0 > key 4B value 256B max_entries 1 memlock 4096B > btf_id 6 > [root@arch-fb-vm1 bpf]# bpftool map dump id 6 > [{ > "value": { > "refcnt": { > "refs": { > "counter": 1 > } > }, > "state": 1, > "data": { > "list": { > "next": 0, > "prev": 0 > }, > "key": 0, > "flags": 2, > "init": 24, > "release": 0, > "ssthresh": 25, > "cong_avoid": 30, > "set_state": 27, > "cwnd_event": 28, > "in_ack_event": 26, > "undo_cwnd": 29, > "pkts_acked": 0, > "min_tso_segs": 0, > "sndbuf_expand": 0, > "cong_control": 0, > "get_info": 0, > "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0 > ], > "owner": 0 > } > } > } > ] > > Misc Notes: > * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup. > It does an inplace update on "*value" instead returning a pointer > to syscall.c. Otherwise, it needs a separate copy of "zero" value > for the BPF_STRUCT_OPS_STATE_INIT to avoid races. > > * The bpf_struct_ops_map_delete_elem() is also called without > preempt_disable() from map_delete_elem(). It is because > the "->unreg()" may requires sleepable context, e.g. > the "tcp_unregister_congestion_control()". > > * "const" is added to some of the existing "struct btf_func_model *" > function arg to avoid a compiler warning caused by this patch. > > Signed-off-by: Martin KaFai Lau <kafai@fb.com> > --- LGTM! Few questions below to improve my understanding. Acked-by: Andrii Nakryiko <andriin@fb.com> > arch/x86/net/bpf_jit_comp.c | 11 +- > include/linux/bpf.h | 49 +++- > include/linux/bpf_types.h | 3 + > include/linux/btf.h | 13 + > include/uapi/linux/bpf.h | 7 +- > kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++- > kernel/bpf/btf.c | 20 +- > kernel/bpf/map_in_map.c | 3 +- > kernel/bpf/syscall.c | 49 ++-- > kernel/bpf/trampoline.c | 5 +- > kernel/bpf/verifier.c | 5 + > 11 files changed, 593 insertions(+), 40 deletions(-) > [...] > + /* All non func ptr member must be 0 */ > + if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, > + NULL)) { > + u32 msize; > + > + mtype = btf_resolve_size(btf_vmlinux, mtype, > + &msize, NULL, NULL); > + if (IS_ERR(mtype)) { > + err = PTR_ERR(mtype); > + goto reset_unlock; > + } > + > + if (memchr_inv(udata + moff, 0, msize)) { just double-checking: we are ok with having non-zeroed padding in a struct, is that right? > + err = -EINVAL; > + goto reset_unlock; > + } > + > + continue; > + } > + [...] > + > + err = arch_prepare_bpf_trampoline(image, > + &st_ops->func_models[i], 0, > + &prog, 1, NULL, 0, NULL); > + if (err < 0) > + goto reset_unlock; > + > + *(void **)(kdata + moff) = image; > + image += err; are there any alignment requirements on image pointer for trampoline? > + > + /* put prog_id to udata */ > + *(unsigned long *)(udata + moff) = prog->aux->id; > + } > + [...]
Hi Martin, I love your patch! Yet something to improve: [auto build test ERROR on bpf-next/master] [cannot apply to bpf/master net/master v5.5-rc3 next-20191219] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system. BTW, we also suggest to use '--base' option to specify the base tree in git format-patch, please see https://stackoverflow.com/a/37406982] url: https://github.com/0day-ci/linux/commits/Martin-KaFai-Lau/Introduce-BPF-STRUCT_OPS/20191224-085617 base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master config: arm64-defconfig (attached as .config) compiler: aarch64-linux-gcc (GCC) 7.5.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree GCC_VERSION=7.5.0 make.cross ARCH=arm64 If you fix the issue, kindly add following tag Reported-by: kbuild test robot <lkp@intel.com> All error/warnings (new ones prefixed by >>): kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_init': kernel/bpf/bpf_struct_ops.c:176:8: error: implicit declaration of function 'btf_distill_func_proto'; did you mean 'btf_type_is_func_proto'? [-Werror=implicit-function-declaration] btf_distill_func_proto(&log, _btf_vmlinux, ^~~~~~~~~~~~~~~~~~~~~~ btf_type_is_func_proto kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_map_update_elem': >> kernel/bpf/bpf_struct_ops.c:408:2: error: implicit declaration of function 'bpf_map_inc'; did you mean 'bpf_map_put'? [-Werror=implicit-function-declaration] bpf_map_inc(map); ^~~~~~~~~~~ bpf_map_put kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_map_free': >> kernel/bpf/bpf_struct_ops.c:468:2: error: implicit declaration of function 'bpf_map_area_free'; did you mean 'bpf_prog_free'? [-Werror=implicit-function-declaration] bpf_map_area_free(st_map->progs); ^~~~~~~~~~~~~~~~~ bpf_prog_free kernel/bpf/bpf_struct_ops.c: In function 'bpf_struct_ops_map_alloc': kernel/bpf/bpf_struct_ops.c:515:8: error: implicit declaration of function 'bpf_map_charge_init'; did you mean 'bpf_prog_change_xdp'? [-Werror=implicit-function-declaration] err = bpf_map_charge_init(&mem, map_total_size); ^~~~~~~~~~~~~~~~~~~ bpf_prog_change_xdp >> kernel/bpf/bpf_struct_ops.c:519:11: error: implicit declaration of function 'bpf_map_area_alloc'; did you mean 'bpf_prog_alloc'? [-Werror=implicit-function-declaration] st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE); ^~~~~~~~~~~~~~~~~~ bpf_prog_alloc >> kernel/bpf/bpf_struct_ops.c:519:9: warning: assignment makes pointer from integer without a cast [-Wint-conversion] st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE); ^ >> kernel/bpf/bpf_struct_ops.c:521:3: error: implicit declaration of function 'bpf_map_charge_finish'; did you mean 'bpf_map_flags_to_cap'? [-Werror=implicit-function-declaration] bpf_map_charge_finish(&mem); ^~~~~~~~~~~~~~~~~~~~~ bpf_map_flags_to_cap kernel/bpf/bpf_struct_ops.c:527:17: warning: assignment makes pointer from integer without a cast [-Wint-conversion] st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); ^ kernel/bpf/bpf_struct_ops.c:528:16: warning: assignment makes pointer from integer without a cast [-Wint-conversion] st_map->progs = ^ kernel/bpf/bpf_struct_ops.c:545:2: error: implicit declaration of function 'bpf_map_init_from_attr'; did you mean 'bpf_jit_get_func_addr'? [-Werror=implicit-function-declaration] bpf_map_init_from_attr(map, attr); ^~~~~~~~~~~~~~~~~~~~~~ bpf_jit_get_func_addr kernel/bpf/bpf_struct_ops.c:546:2: error: implicit declaration of function 'bpf_map_charge_move'; did you mean 'bpf_prog_change_xdp'? [-Werror=implicit-function-declaration] bpf_map_charge_move(&map->memory, &mem); ^~~~~~~~~~~~~~~~~~~ bpf_prog_change_xdp cc1: some warnings being treated as errors vim +408 kernel/bpf/bpf_struct_ops.c 289 290 static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, 291 void *value, u64 flags) 292 { 293 struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; 294 const struct bpf_struct_ops *st_ops = st_map->st_ops; 295 struct bpf_struct_ops_value *uvalue, *kvalue; 296 const struct btf_member *member; 297 const struct btf_type *t = st_ops->type; 298 void *udata, *kdata; 299 int prog_fd, err = 0; 300 void *image; 301 u32 i; 302 303 if (flags) 304 return -EINVAL; 305 306 if (*(u32 *)key != 0) 307 return -E2BIG; 308 309 uvalue = (struct bpf_struct_ops_value *)value; 310 if (uvalue->state || refcount_read(&uvalue->refcnt)) 311 return -EINVAL; 312 313 uvalue = (struct bpf_struct_ops_value *)st_map->uvalue; 314 kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue; 315 316 spin_lock(&st_map->lock); 317 318 if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) { 319 err = -EBUSY; 320 goto unlock; 321 } 322 323 memcpy(uvalue, value, map->value_size); 324 325 udata = &uvalue->data; 326 kdata = &kvalue->data; 327 image = st_map->image; 328 329 for_each_member(i, t, member) { 330 const struct btf_type *mtype, *ptype; 331 struct bpf_prog *prog; 332 u32 moff; 333 334 moff = btf_member_bit_offset(t, member) / 8; 335 mtype = btf_type_by_id(btf_vmlinux, member->type); 336 ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL); 337 if (ptype == module_type) { 338 *(void **)(kdata + moff) = BPF_MODULE_OWNER; 339 continue; 340 } 341 342 err = st_ops->init_member(t, member, kdata, udata); 343 if (err < 0) 344 goto reset_unlock; 345 346 /* The ->init_member() has handled this member */ 347 if (err > 0) 348 continue; 349 350 /* If st_ops->init_member does not handle it, 351 * we will only handle func ptrs and zero-ed members 352 * here. Reject everything else. 353 */ 354 355 /* All non func ptr member must be 0 */ 356 if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, 357 NULL)) { 358 u32 msize; 359 360 mtype = btf_resolve_size(btf_vmlinux, mtype, 361 &msize, NULL, NULL); 362 if (IS_ERR(mtype)) { 363 err = PTR_ERR(mtype); 364 goto reset_unlock; 365 } 366 367 if (memchr_inv(udata + moff, 0, msize)) { 368 err = -EINVAL; 369 goto reset_unlock; 370 } 371 372 continue; 373 } 374 375 prog_fd = (int)(*(unsigned long *)(udata + moff)); 376 /* Similar check as the attr->attach_prog_fd */ 377 if (!prog_fd) 378 continue; 379 380 prog = bpf_prog_get(prog_fd); 381 if (IS_ERR(prog)) { 382 err = PTR_ERR(prog); 383 goto reset_unlock; 384 } 385 st_map->progs[i] = prog; 386 387 if (prog->type != BPF_PROG_TYPE_STRUCT_OPS || 388 prog->aux->attach_btf_id != st_ops->type_id || 389 prog->expected_attach_type != i) { 390 err = -EINVAL; 391 goto reset_unlock; 392 } 393 394 err = arch_prepare_bpf_trampoline(image, 395 &st_ops->func_models[i], 0, 396 &prog, 1, NULL, 0, NULL); 397 if (err < 0) 398 goto reset_unlock; 399 400 *(void **)(kdata + moff) = image; 401 image += err; 402 403 /* put prog_id to udata */ 404 *(unsigned long *)(udata + moff) = prog->aux->id; 405 } 406 407 refcount_set(&kvalue->refcnt, 1); > 408 bpf_map_inc(map); 409 410 err = st_ops->reg(kdata); 411 if (!err) { 412 /* Pair with smp_load_acquire() during lookup */ 413 smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE); 414 goto unlock; 415 } 416 417 /* Error during st_ops->reg() */ 418 bpf_map_put(map); 419 420 reset_unlock: 421 bpf_struct_ops_map_put_progs(st_map); 422 memset(uvalue, 0, map->value_size); 423 memset(kvalue, 0, map->value_size); 424 425 unlock: 426 spin_unlock(&st_map->lock); 427 return err; 428 } 429 430 static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key) 431 { 432 enum bpf_struct_ops_state prev_state; 433 struct bpf_struct_ops_map *st_map; 434 435 st_map = (struct bpf_struct_ops_map *)map; 436 prev_state = cmpxchg(&st_map->kvalue.state, 437 BPF_STRUCT_OPS_STATE_INUSE, 438 BPF_STRUCT_OPS_STATE_TOBEFREE); 439 if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) { 440 st_map->st_ops->unreg(&st_map->kvalue.data); 441 if (refcount_dec_and_test(&st_map->kvalue.refcnt)) 442 bpf_map_put(map); 443 } 444 445 return 0; 446 } 447 448 static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key, 449 struct seq_file *m) 450 { 451 void *value; 452 453 value = bpf_struct_ops_map_lookup_elem(map, key); 454 if (!value) 455 return; 456 457 btf_type_seq_show(btf_vmlinux, map->btf_vmlinux_value_type_id, 458 value, m); 459 seq_puts(m, "\n"); 460 } 461 462 static void bpf_struct_ops_map_free(struct bpf_map *map) 463 { 464 struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; 465 466 if (st_map->progs) 467 bpf_struct_ops_map_put_progs(st_map); > 468 bpf_map_area_free(st_map->progs); 469 bpf_jit_free_exec(st_map->image); 470 bpf_map_area_free(st_map->uvalue); 471 bpf_map_area_free(st_map); 472 } 473 474 static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr) 475 { 476 if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 || 477 attr->map_flags || !attr->btf_vmlinux_value_type_id) 478 return -EINVAL; 479 return 0; 480 } 481 482 static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) 483 { 484 const struct bpf_struct_ops *st_ops; 485 size_t map_total_size, st_map_size; 486 struct bpf_struct_ops_map *st_map; 487 const struct btf_type *t, *vt; 488 struct bpf_map_memory mem; 489 struct bpf_map *map; 490 int err; 491 492 if (!capable(CAP_SYS_ADMIN)) 493 return ERR_PTR(-EPERM); 494 495 st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id); 496 if (!st_ops) 497 return ERR_PTR(-ENOTSUPP); 498 499 vt = st_ops->value_type; 500 if (attr->value_size != vt->size) 501 return ERR_PTR(-EINVAL); 502 503 t = st_ops->type; 504 505 st_map_size = sizeof(*st_map) + 506 /* kvalue stores the 507 * struct bpf_struct_ops_tcp_congestions_ops 508 */ 509 (vt->size - sizeof(struct bpf_struct_ops_value)); 510 map_total_size = st_map_size + 511 /* uvalue */ 512 sizeof(vt->size) + 513 /* struct bpf_progs **progs */ 514 btf_type_vlen(t) * sizeof(struct bpf_prog *); > 515 err = bpf_map_charge_init(&mem, map_total_size); 516 if (err < 0) 517 return ERR_PTR(err); 518 > 519 st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE); 520 if (!st_map) { > 521 bpf_map_charge_finish(&mem); 522 return ERR_PTR(-ENOMEM); 523 } 524 st_map->st_ops = st_ops; 525 map = &st_map->map; 526 527 st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); 528 st_map->progs = 529 bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *), 530 NUMA_NO_NODE); 531 /* Each trampoline costs < 64 bytes. Ensure one page 532 * is enough for max number of func ptrs. 533 */ 534 BUILD_BUG_ON(PAGE_SIZE / 64 < BPF_STRUCT_OPS_MAX_NR_MEMBERS); 535 st_map->image = bpf_jit_alloc_exec(PAGE_SIZE); 536 if (!st_map->uvalue || !st_map->progs || !st_map->image) { 537 bpf_struct_ops_map_free(map); 538 bpf_map_charge_finish(&mem); 539 return ERR_PTR(-ENOMEM); 540 } 541 542 spin_lock_init(&st_map->lock); 543 set_vm_flush_reset_perms(st_map->image); 544 set_memory_x((long)st_map->image, 1); 545 bpf_map_init_from_attr(map, attr); 546 bpf_map_charge_move(&map->memory, &mem); 547 548 return map; 549 } 550 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation
On Mon, Dec 23, 2019 at 11:57:48AM -0800, Yonghong Song wrote: [ ... ] > > +static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, > > + void *value, u64 flags) > > +{ > > + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; > > + const struct bpf_struct_ops *st_ops = st_map->st_ops; > > + struct bpf_struct_ops_value *uvalue, *kvalue; > > + const struct btf_member *member; > > + const struct btf_type *t = st_ops->type; > > + void *udata, *kdata; > > + int prog_fd, err = 0; > > + void *image; > > + u32 i; > > + > > + if (flags) > > + return -EINVAL; > > + > > + if (*(u32 *)key != 0) > > + return -E2BIG; > > + > > + uvalue = (struct bpf_struct_ops_value *)value; > > + if (uvalue->state || refcount_read(&uvalue->refcnt)) > > + return -EINVAL; > > + > > + uvalue = (struct bpf_struct_ops_value *)st_map->uvalue; > > + kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue; > > + > > + spin_lock(&st_map->lock); > > + > > + if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) { > > + err = -EBUSY; > > + goto unlock; > > + } > > + > > + memcpy(uvalue, value, map->value_size); > > + > > + udata = &uvalue->data; > > + kdata = &kvalue->data; > > + image = st_map->image; > > + > > + for_each_member(i, t, member) { > > + const struct btf_type *mtype, *ptype; > > + struct bpf_prog *prog; > > + u32 moff; > > + > > + moff = btf_member_bit_offset(t, member) / 8; > > + mtype = btf_type_by_id(btf_vmlinux, member->type); > > + ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL); > > + if (ptype == module_type) { > > + *(void **)(kdata + moff) = BPF_MODULE_OWNER; > > + continue; > > + } > > + > > + err = st_ops->init_member(t, member, kdata, udata); > > + if (err < 0) > > + goto reset_unlock; > > + > > + /* The ->init_member() has handled this member */ > > + if (err > 0) > > + continue; > > + > > + /* If st_ops->init_member does not handle it, > > + * we will only handle func ptrs and zero-ed members > > + * here. Reject everything else. > > + */ > > + > > + /* All non func ptr member must be 0 */ > > + if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, > > + NULL)) { > > + u32 msize; > > + > > + mtype = btf_resolve_size(btf_vmlinux, mtype, > > + &msize, NULL, NULL); > > + if (IS_ERR(mtype)) { > > + err = PTR_ERR(mtype); > > + goto reset_unlock; > > + } > > + > > + if (memchr_inv(udata + moff, 0, msize)) { > > + err = -EINVAL; > > + goto reset_unlock; > > + } > > + > > + continue; > > + } > > + > > + prog_fd = (int)(*(unsigned long *)(udata + moff)); > > + /* Similar check as the attr->attach_prog_fd */ > > + if (!prog_fd) > > + continue; > > + > > + prog = bpf_prog_get(prog_fd); > > + if (IS_ERR(prog)) { > > + err = PTR_ERR(prog); > > + goto reset_unlock; > > + } > > + st_map->progs[i] = prog; > > + > > + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS || > > + prog->aux->attach_btf_id != st_ops->type_id || > > + prog->expected_attach_type != i) { > > + err = -EINVAL; > > + goto reset_unlock; > > + } > > + > > + err = arch_prepare_bpf_trampoline(image, > > + &st_ops->func_models[i], 0, > > + &prog, 1, NULL, 0, NULL); > > + if (err < 0) > > + goto reset_unlock; > > + > > + *(void **)(kdata + moff) = image; > > + image += err; > > + > > + /* put prog_id to udata */ > > + *(unsigned long *)(udata + moff) = prog->aux->id; > > + } > > Should we check whether user indeed provided `module` member or > not before declaring success? hmm.... By 'module', you meant the "if (ptype == module_type)" logic at the beginning? Yes, it should check user passed 0 also. will fix. > > +static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) > > +{ > > + const struct bpf_struct_ops *st_ops; > > + size_t map_total_size, st_map_size; > > + struct bpf_struct_ops_map *st_map; > > + const struct btf_type *t, *vt; > > + struct bpf_map_memory mem; > > + struct bpf_map *map; > > + int err; > > + > > + if (!capable(CAP_SYS_ADMIN)) > > + return ERR_PTR(-EPERM); > > + > > + st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id); > > + if (!st_ops) > > + return ERR_PTR(-ENOTSUPP); > > + > > + vt = st_ops->value_type; > > + if (attr->value_size != vt->size) > > + return ERR_PTR(-EINVAL); > > + > > + t = st_ops->type; > > + > > + st_map_size = sizeof(*st_map) + > > + /* kvalue stores the > > + * struct bpf_struct_ops_tcp_congestions_ops > > + */ > > + (vt->size - sizeof(struct bpf_struct_ops_value)); > > + map_total_size = st_map_size + > > + /* uvalue */ > > + sizeof(vt->size) + > > + /* struct bpf_progs **progs */ > > + btf_type_vlen(t) * sizeof(struct bpf_prog *); > > + err = bpf_map_charge_init(&mem, map_total_size); > > + if (err < 0) > > + return ERR_PTR(err); > > + > > + st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE); > > + if (!st_map) { > > + bpf_map_charge_finish(&mem); > > + return ERR_PTR(-ENOMEM); > > + } > > + st_map->st_ops = st_ops; > > + map = &st_map->map; > > + > > + st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); > > + st_map->progs = > > + bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *), > > + NUMA_NO_NODE); > > + /* Each trampoline costs < 64 bytes. Ensure one page > > + * is enough for max number of func ptrs. > > + */ > > + BUILD_BUG_ON(PAGE_SIZE / 64 < BPF_STRUCT_OPS_MAX_NR_MEMBERS); > > This maybe true for x86 now, but it may not hold up for future other > architectures. Not sure whether we should get the value for arch call > backs, or we just bail out during map update if we ever grow exceeds > one page. I will reuse the existing WARN_ON_ONCE() check in arch_prepare_bpf_trampoline(). Need to parameterize the end-of-image (i.e. the current PAGE_SIZE / 2 assumption).
On Mon, Dec 23, 2019 at 03:05:08PM -0800, Andrii Nakryiko wrote: > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value > > is a kernel struct with its func ptr implemented in bpf prog. > > This new map is the interface to register/unregister/introspect > > a bpf implemented kernel struct. > > > > The kernel struct is actually embedded inside another new struct > > (or called the "value" struct in the code). For example, > > "struct tcp_congestion_ops" is embbeded in: > > struct bpf_struct_ops_tcp_congestion_ops { > > refcount_t refcnt; > > enum bpf_struct_ops_state state; > > struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */ > > } > > The map value is "struct bpf_struct_ops_tcp_congestion_ops". > > The "bpftool map dump" will then be able to show the > > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g. > > number of tcp_sock in the tcp_congestion_ops case). This "value" struct > > is created automatically by a macro. Having a separate "value" struct > > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding > > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some > > initialization works before registering the struct_ops to the kernel > > subsystem). The libbpf will take care of finding and populating the > > "struct bpf_struct_ops_XYZ" from "struct XYZ". > > > > Register a struct_ops to a kernel subsystem: > > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s) > > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id > > set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the > > running kernel. > > Instead of reusing the attr->btf_value_type_id, > > btf_vmlinux_value_type_id s added such that attr->btf_fd can still be > > used as the "user" btf which could store other useful sysadmin/debug > > info that may be introduced in the furture, > > e.g. creation-date/compiler-details/map-creator...etc. > > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described > > in the running kernel btf. Populate the value of this object. > > The function ptr should be populated with the prog fds. > > 4. Call BPF_MAP_UPDATE with the object created in (3) as > > the map value. The key is always "0". > > > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's > > args as an array of u64 is generated. BPF_MAP_UPDATE also allows > > the specific struct_ops to do some final checks in "st_ops->init_member()" > > (e.g. ensure all mandatory func ptrs are implemented). > > If everything looks good, it will register this kernel struct > > to the kernel subsystem. The map will not allow further update > > from this point. > > > > Unregister a struct_ops from the kernel subsystem: > > BPF_MAP_DELETE with key "0". > > > > Introspect a struct_ops: > > BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will > > have the prog _id_ populated as the func ptr. > > > > The map value state (enum bpf_struct_ops_state) will transit from: > > INIT (map created) => > > INUSE (map updated, i.e. reg) => > > TOBEFREE (map value deleted, i.e. unreg) > > > > The kernel subsystem needs to call bpf_struct_ops_get() and > > bpf_struct_ops_put() to manage the "refcnt" in the > > "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt > > for the purose of tracking the subsystem usage. Another approach > > is to reuse the map->refcnt and then "show" (i.e. during map_lookup) > > the subsystem's usage by doing map->refcnt - map->usercnt to filter out > > the map-fd/pinned-map usage. However, that will also tie down the > > future semantics of map->refcnt and map->usercnt. > > > > The very first subsystem's refcnt (during reg()) holds one > > count to map->refcnt. When the very last subsystem's refcnt > > is gone, it will also release the map->refcnt. All bpf_prog will be > > freed when the map->refcnt reaches 0 (i.e. during map_free()). > > > > Here is how the bpftool map command will look like: > > [root@arch-fb-vm1 bpf]# bpftool map show > > 6: struct_ops name dctcp flags 0x0 > > key 4B value 256B max_entries 1 memlock 4096B > > btf_id 6 > > [root@arch-fb-vm1 bpf]# bpftool map dump id 6 > > [{ > > "value": { > > "refcnt": { > > "refs": { > > "counter": 1 > > } > > }, > > "state": 1, > > "data": { > > "list": { > > "next": 0, > > "prev": 0 > > }, > > "key": 0, > > "flags": 2, > > "init": 24, > > "release": 0, > > "ssthresh": 25, > > "cong_avoid": 30, > > "set_state": 27, > > "cwnd_event": 28, > > "in_ack_event": 26, > > "undo_cwnd": 29, > > "pkts_acked": 0, > > "min_tso_segs": 0, > > "sndbuf_expand": 0, > > "cong_control": 0, > > "get_info": 0, > > "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0 > > ], > > "owner": 0 > > } > > } > > } > > ] > > > > Misc Notes: > > * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup. > > It does an inplace update on "*value" instead returning a pointer > > to syscall.c. Otherwise, it needs a separate copy of "zero" value > > for the BPF_STRUCT_OPS_STATE_INIT to avoid races. > > > > * The bpf_struct_ops_map_delete_elem() is also called without > > preempt_disable() from map_delete_elem(). It is because > > the "->unreg()" may requires sleepable context, e.g. > > the "tcp_unregister_congestion_control()". > > > > * "const" is added to some of the existing "struct btf_func_model *" > > function arg to avoid a compiler warning caused by this patch. > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com> > > --- > > LGTM! Few questions below to improve my understanding. > > Acked-by: Andrii Nakryiko <andriin@fb.com> > > > arch/x86/net/bpf_jit_comp.c | 11 +- > > include/linux/bpf.h | 49 +++- > > include/linux/bpf_types.h | 3 + > > include/linux/btf.h | 13 + > > include/uapi/linux/bpf.h | 7 +- > > kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++- > > kernel/bpf/btf.c | 20 +- > > kernel/bpf/map_in_map.c | 3 +- > > kernel/bpf/syscall.c | 49 ++-- > > kernel/bpf/trampoline.c | 5 +- > > kernel/bpf/verifier.c | 5 + > > 11 files changed, 593 insertions(+), 40 deletions(-) > > > > [...] > > > + /* All non func ptr member must be 0 */ > > + if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, > > + NULL)) { > > + u32 msize; > > + > > + mtype = btf_resolve_size(btf_vmlinux, mtype, > > + &msize, NULL, NULL); > > + if (IS_ERR(mtype)) { > > + err = PTR_ERR(mtype); > > + goto reset_unlock; > > + } > > + > > + if (memchr_inv(udata + moff, 0, msize)) { > > > just double-checking: we are ok with having non-zeroed padding in a > struct, is that right? Sorry for the delay. You meant the end-padding of the kernel side struct (i.e. kdata (or kvalue)) could be non-zero? The btf's struct size (i.e. vt->size) should include the padding and the whole vt->size is init to 0. or you meant the user passed in udata (or uvalue)? > > > + err = -EINVAL; > > + goto reset_unlock; > > + } > > + > > + continue; > > + } > > + > > [...] > > > + > > + err = arch_prepare_bpf_trampoline(image, > > + &st_ops->func_models[i], 0, > > + &prog, 1, NULL, 0, NULL); > > + if (err < 0) > > + goto reset_unlock; > > + > > + *(void **)(kdata + moff) = image; > > + image += err; > > are there any alignment requirements on image pointer for trampoline? Not that I know of from reading arch_prepare_bpf_trampoline() which also can generate codes to call multiple bpf_prog. > > > + > > + /* put prog_id to udata */ > > + *(unsigned long *)(udata + moff) = prog->aux->id; > > + } > > + > > [...]
On Fri, Dec 27, 2019 at 5:47 PM Martin Lau <kafai@fb.com> wrote: > > On Mon, Dec 23, 2019 at 03:05:08PM -0800, Andrii Nakryiko wrote: > > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value > > > is a kernel struct with its func ptr implemented in bpf prog. > > > This new map is the interface to register/unregister/introspect > > > a bpf implemented kernel struct. > > > > > > The kernel struct is actually embedded inside another new struct > > > (or called the "value" struct in the code). For example, > > > "struct tcp_congestion_ops" is embbeded in: > > > struct bpf_struct_ops_tcp_congestion_ops { > > > refcount_t refcnt; > > > enum bpf_struct_ops_state state; > > > struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */ > > > } > > > The map value is "struct bpf_struct_ops_tcp_congestion_ops". > > > The "bpftool map dump" will then be able to show the > > > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g. > > > number of tcp_sock in the tcp_congestion_ops case). This "value" struct > > > is created automatically by a macro. Having a separate "value" struct > > > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding > > > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some > > > initialization works before registering the struct_ops to the kernel > > > subsystem). The libbpf will take care of finding and populating the > > > "struct bpf_struct_ops_XYZ" from "struct XYZ". > > > > > > Register a struct_ops to a kernel subsystem: > > > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s) > > > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id > > > set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the > > > running kernel. > > > Instead of reusing the attr->btf_value_type_id, > > > btf_vmlinux_value_type_id s added such that attr->btf_fd can still be > > > used as the "user" btf which could store other useful sysadmin/debug > > > info that may be introduced in the furture, > > > e.g. creation-date/compiler-details/map-creator...etc. > > > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described > > > in the running kernel btf. Populate the value of this object. > > > The function ptr should be populated with the prog fds. > > > 4. Call BPF_MAP_UPDATE with the object created in (3) as > > > the map value. The key is always "0". > > > > > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's > > > args as an array of u64 is generated. BPF_MAP_UPDATE also allows > > > the specific struct_ops to do some final checks in "st_ops->init_member()" > > > (e.g. ensure all mandatory func ptrs are implemented). > > > If everything looks good, it will register this kernel struct > > > to the kernel subsystem. The map will not allow further update > > > from this point. > > > > > > Unregister a struct_ops from the kernel subsystem: > > > BPF_MAP_DELETE with key "0". > > > > > > Introspect a struct_ops: > > > BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will > > > have the prog _id_ populated as the func ptr. > > > > > > The map value state (enum bpf_struct_ops_state) will transit from: > > > INIT (map created) => > > > INUSE (map updated, i.e. reg) => > > > TOBEFREE (map value deleted, i.e. unreg) > > > > > > The kernel subsystem needs to call bpf_struct_ops_get() and > > > bpf_struct_ops_put() to manage the "refcnt" in the > > > "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt > > > for the purose of tracking the subsystem usage. Another approach > > > is to reuse the map->refcnt and then "show" (i.e. during map_lookup) > > > the subsystem's usage by doing map->refcnt - map->usercnt to filter out > > > the map-fd/pinned-map usage. However, that will also tie down the > > > future semantics of map->refcnt and map->usercnt. > > > > > > The very first subsystem's refcnt (during reg()) holds one > > > count to map->refcnt. When the very last subsystem's refcnt > > > is gone, it will also release the map->refcnt. All bpf_prog will be > > > freed when the map->refcnt reaches 0 (i.e. during map_free()). > > > > > > Here is how the bpftool map command will look like: > > > [root@arch-fb-vm1 bpf]# bpftool map show > > > 6: struct_ops name dctcp flags 0x0 > > > key 4B value 256B max_entries 1 memlock 4096B > > > btf_id 6 > > > [root@arch-fb-vm1 bpf]# bpftool map dump id 6 > > > [{ > > > "value": { > > > "refcnt": { > > > "refs": { > > > "counter": 1 > > > } > > > }, > > > "state": 1, > > > "data": { > > > "list": { > > > "next": 0, > > > "prev": 0 > > > }, > > > "key": 0, > > > "flags": 2, > > > "init": 24, > > > "release": 0, > > > "ssthresh": 25, > > > "cong_avoid": 30, > > > "set_state": 27, > > > "cwnd_event": 28, > > > "in_ack_event": 26, > > > "undo_cwnd": 29, > > > "pkts_acked": 0, > > > "min_tso_segs": 0, > > > "sndbuf_expand": 0, > > > "cong_control": 0, > > > "get_info": 0, > > > "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0 > > > ], > > > "owner": 0 > > > } > > > } > > > } > > > ] > > > > > > Misc Notes: > > > * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup. > > > It does an inplace update on "*value" instead returning a pointer > > > to syscall.c. Otherwise, it needs a separate copy of "zero" value > > > for the BPF_STRUCT_OPS_STATE_INIT to avoid races. > > > > > > * The bpf_struct_ops_map_delete_elem() is also called without > > > preempt_disable() from map_delete_elem(). It is because > > > the "->unreg()" may requires sleepable context, e.g. > > > the "tcp_unregister_congestion_control()". > > > > > > * "const" is added to some of the existing "struct btf_func_model *" > > > function arg to avoid a compiler warning caused by this patch. > > > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com> > > > --- > > > > LGTM! Few questions below to improve my understanding. > > > > Acked-by: Andrii Nakryiko <andriin@fb.com> > > > > > arch/x86/net/bpf_jit_comp.c | 11 +- > > > include/linux/bpf.h | 49 +++- > > > include/linux/bpf_types.h | 3 + > > > include/linux/btf.h | 13 + > > > include/uapi/linux/bpf.h | 7 +- > > > kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++- > > > kernel/bpf/btf.c | 20 +- > > > kernel/bpf/map_in_map.c | 3 +- > > > kernel/bpf/syscall.c | 49 ++-- > > > kernel/bpf/trampoline.c | 5 +- > > > kernel/bpf/verifier.c | 5 + > > > 11 files changed, 593 insertions(+), 40 deletions(-) > > > > > > > [...] > > > > > + /* All non func ptr member must be 0 */ > > > + if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, > > > + NULL)) { > > > + u32 msize; > > > + > > > + mtype = btf_resolve_size(btf_vmlinux, mtype, > > > + &msize, NULL, NULL); > > > + if (IS_ERR(mtype)) { > > > + err = PTR_ERR(mtype); > > > + goto reset_unlock; > > > + } > > > + > > > + if (memchr_inv(udata + moff, 0, msize)) { > > > > > > just double-checking: we are ok with having non-zeroed padding in a > > struct, is that right? > Sorry for the delay. > > You meant the end-padding of the kernel side struct (i.e. kdata (or kvalue)) > could be non-zero? The btf's struct size (i.e. vt->size) should include > the padding and the whole vt->size is init to 0. > > or you meant the user passed in udata (or uvalue)? The latter, udata. You check member-by-member, but if there is padding between fields or at the end of a struct, nothing is currently checking it for zeroes. So probably safer to check those paddings inbetween as well. > > > > > > + err = -EINVAL; > > > + goto reset_unlock; > > > + } > > > + > > > + continue; > > > + } > > > + > > > > [...] > > > > > + > > > + err = arch_prepare_bpf_trampoline(image, > > > + &st_ops->func_models[i], 0, > > > + &prog, 1, NULL, 0, NULL); > > > + if (err < 0) > > > + goto reset_unlock; > > > + > > > + *(void **)(kdata + moff) = image; > > > + image += err; > > > > are there any alignment requirements on image pointer for trampoline? > Not that I know of from reading arch_prepare_bpf_trampoline() > which also can generate codes to call multiple bpf_prog. Yeah, I think you are right, at least for x86. Variable-sized x86 instructions pretty much rule out any alignment. It might not be the case for fixed-sized instructions architectures, but we should cross that bridge when we get to it. > > > > > > + > > > + /* put prog_id to udata */ > > > + *(unsigned long *)(udata + moff) = prog->aux->id; > > > + } > > > + > > > > [...]
On Fri, Dec 27, 2019 at 06:24:41PM -0800, Andrii Nakryiko wrote: > On Fri, Dec 27, 2019 at 5:47 PM Martin Lau <kafai@fb.com> wrote: > > > > On Mon, Dec 23, 2019 at 03:05:08PM -0800, Andrii Nakryiko wrote: > > > On Fri, Dec 20, 2019 at 10:26 PM Martin KaFai Lau <kafai@fb.com> wrote: > > > > > > > > The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value > > > > is a kernel struct with its func ptr implemented in bpf prog. > > > > This new map is the interface to register/unregister/introspect > > > > a bpf implemented kernel struct. > > > > > > > > The kernel struct is actually embedded inside another new struct > > > > (or called the "value" struct in the code). For example, > > > > "struct tcp_congestion_ops" is embbeded in: > > > > struct bpf_struct_ops_tcp_congestion_ops { > > > > refcount_t refcnt; > > > > enum bpf_struct_ops_state state; > > > > struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */ > > > > } > > > > The map value is "struct bpf_struct_ops_tcp_congestion_ops". > > > > The "bpftool map dump" will then be able to show the > > > > state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g. > > > > number of tcp_sock in the tcp_congestion_ops case). This "value" struct > > > > is created automatically by a macro. Having a separate "value" struct > > > > will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding > > > > "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some > > > > initialization works before registering the struct_ops to the kernel > > > > subsystem). The libbpf will take care of finding and populating the > > > > "struct bpf_struct_ops_XYZ" from "struct XYZ". > > > > > > > > Register a struct_ops to a kernel subsystem: > > > > 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s) > > > > 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id > > > > set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the > > > > running kernel. > > > > Instead of reusing the attr->btf_value_type_id, > > > > btf_vmlinux_value_type_id s added such that attr->btf_fd can still be > > > > used as the "user" btf which could store other useful sysadmin/debug > > > > info that may be introduced in the furture, > > > > e.g. creation-date/compiler-details/map-creator...etc. > > > > 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described > > > > in the running kernel btf. Populate the value of this object. > > > > The function ptr should be populated with the prog fds. > > > > 4. Call BPF_MAP_UPDATE with the object created in (3) as > > > > the map value. The key is always "0". > > > > > > > > During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's > > > > args as an array of u64 is generated. BPF_MAP_UPDATE also allows > > > > the specific struct_ops to do some final checks in "st_ops->init_member()" > > > > (e.g. ensure all mandatory func ptrs are implemented). > > > > If everything looks good, it will register this kernel struct > > > > to the kernel subsystem. The map will not allow further update > > > > from this point. > > > > > > > > Unregister a struct_ops from the kernel subsystem: > > > > BPF_MAP_DELETE with key "0". > > > > > > > > Introspect a struct_ops: > > > > BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will > > > > have the prog _id_ populated as the func ptr. > > > > > > > > The map value state (enum bpf_struct_ops_state) will transit from: > > > > INIT (map created) => > > > > INUSE (map updated, i.e. reg) => > > > > TOBEFREE (map value deleted, i.e. unreg) > > > > > > > > The kernel subsystem needs to call bpf_struct_ops_get() and > > > > bpf_struct_ops_put() to manage the "refcnt" in the > > > > "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt > > > > for the purose of tracking the subsystem usage. Another approach > > > > is to reuse the map->refcnt and then "show" (i.e. during map_lookup) > > > > the subsystem's usage by doing map->refcnt - map->usercnt to filter out > > > > the map-fd/pinned-map usage. However, that will also tie down the > > > > future semantics of map->refcnt and map->usercnt. > > > > > > > > The very first subsystem's refcnt (during reg()) holds one > > > > count to map->refcnt. When the very last subsystem's refcnt > > > > is gone, it will also release the map->refcnt. All bpf_prog will be > > > > freed when the map->refcnt reaches 0 (i.e. during map_free()). > > > > > > > > Here is how the bpftool map command will look like: > > > > [root@arch-fb-vm1 bpf]# bpftool map show > > > > 6: struct_ops name dctcp flags 0x0 > > > > key 4B value 256B max_entries 1 memlock 4096B > > > > btf_id 6 > > > > [root@arch-fb-vm1 bpf]# bpftool map dump id 6 > > > > [{ > > > > "value": { > > > > "refcnt": { > > > > "refs": { > > > > "counter": 1 > > > > } > > > > }, > > > > "state": 1, > > > > "data": { > > > > "list": { > > > > "next": 0, > > > > "prev": 0 > > > > }, > > > > "key": 0, > > > > "flags": 2, > > > > "init": 24, > > > > "release": 0, > > > > "ssthresh": 25, > > > > "cong_avoid": 30, > > > > "set_state": 27, > > > > "cwnd_event": 28, > > > > "in_ack_event": 26, > > > > "undo_cwnd": 29, > > > > "pkts_acked": 0, > > > > "min_tso_segs": 0, > > > > "sndbuf_expand": 0, > > > > "cong_control": 0, > > > > "get_info": 0, > > > > "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0 > > > > ], > > > > "owner": 0 > > > > } > > > > } > > > > } > > > > ] > > > > > > > > Misc Notes: > > > > * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup. > > > > It does an inplace update on "*value" instead returning a pointer > > > > to syscall.c. Otherwise, it needs a separate copy of "zero" value > > > > for the BPF_STRUCT_OPS_STATE_INIT to avoid races. > > > > > > > > * The bpf_struct_ops_map_delete_elem() is also called without > > > > preempt_disable() from map_delete_elem(). It is because > > > > the "->unreg()" may requires sleepable context, e.g. > > > > the "tcp_unregister_congestion_control()". > > > > > > > > * "const" is added to some of the existing "struct btf_func_model *" > > > > function arg to avoid a compiler warning caused by this patch. > > > > > > > > Signed-off-by: Martin KaFai Lau <kafai@fb.com> > > > > --- > > > > > > LGTM! Few questions below to improve my understanding. > > > > > > Acked-by: Andrii Nakryiko <andriin@fb.com> > > > > > > > arch/x86/net/bpf_jit_comp.c | 11 +- > > > > include/linux/bpf.h | 49 +++- > > > > include/linux/bpf_types.h | 3 + > > > > include/linux/btf.h | 13 + > > > > include/uapi/linux/bpf.h | 7 +- > > > > kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++- > > > > kernel/bpf/btf.c | 20 +- > > > > kernel/bpf/map_in_map.c | 3 +- > > > > kernel/bpf/syscall.c | 49 ++-- > > > > kernel/bpf/trampoline.c | 5 +- > > > > kernel/bpf/verifier.c | 5 + > > > > 11 files changed, 593 insertions(+), 40 deletions(-) > > > > > > > > > > [...] > > > > > > > + /* All non func ptr member must be 0 */ > > > > + if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, > > > > + NULL)) { > > > > + u32 msize; > > > > + > > > > + mtype = btf_resolve_size(btf_vmlinux, mtype, > > > > + &msize, NULL, NULL); > > > > + if (IS_ERR(mtype)) { > > > > + err = PTR_ERR(mtype); > > > > + goto reset_unlock; > > > > + } > > > > + > > > > + if (memchr_inv(udata + moff, 0, msize)) { > > > > > > > > > just double-checking: we are ok with having non-zeroed padding in a > > > struct, is that right? > > Sorry for the delay. > > > > You meant the end-padding of the kernel side struct (i.e. kdata (or kvalue)) > > could be non-zero? The btf's struct size (i.e. vt->size) should include > > the padding and the whole vt->size is init to 0. > > > > or you meant the user passed in udata (or uvalue)? > > The latter, udata. You check member-by-member, but if there is padding > between fields or at the end of a struct, nothing is currently > checking it for zeroes. So probably safer to check those paddings > inbetween as well. Agree. Will do.
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 4c8a2d1f8470..775347c78947 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1328,7 +1328,7 @@ xadd: if (is_imm8(insn->off)) return proglen; } -static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args, +static void save_regs(const struct btf_func_model *m, u8 **prog, int nr_args, int stack_size) { int i; @@ -1344,7 +1344,7 @@ static void save_regs(struct btf_func_model *m, u8 **prog, int nr_args, -(stack_size - i * 8)); } -static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args, +static void restore_regs(const struct btf_func_model *m, u8 **prog, int nr_args, int stack_size) { int i; @@ -1361,7 +1361,7 @@ static void restore_regs(struct btf_func_model *m, u8 **prog, int nr_args, -(stack_size - i * 8)); } -static int invoke_bpf(struct btf_func_model *m, u8 **pprog, +static int invoke_bpf(const struct btf_func_model *m, u8 **pprog, struct bpf_prog **progs, int prog_cnt, int stack_size) { u8 *prog = *pprog; @@ -1456,7 +1456,8 @@ static int invoke_bpf(struct btf_func_model *m, u8 **pprog, * add rsp, 8 // skip eth_type_trans's frame * ret // return to its caller */ -int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags, +int arch_prepare_bpf_trampoline(void *image, + const struct btf_func_model *m, u32 flags, struct bpf_prog **fentry_progs, int fentry_cnt, struct bpf_prog **fexit_progs, int fexit_cnt, void *orig_call) @@ -1529,7 +1530,7 @@ int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags */ if (WARN_ON_ONCE(prog - (u8 *)image > PAGE_SIZE / 2 - BPF_INSN_SAFETY)) return -EFAULT; - return 0; + return (void *)prog - image; } static int emit_cond_near_jump(u8 **pprog, void *func, void *ip, u8 jmp_cond) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index b8f087eb4bdf..4e6bdf15d33f 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -17,6 +17,7 @@ #include <linux/u64_stats_sync.h> #include <linux/refcount.h> #include <linux/mutex.h> +#include <linux/module.h> struct bpf_verifier_env; struct bpf_verifier_log; @@ -106,6 +107,7 @@ struct bpf_map { struct btf *btf; struct bpf_map_memory memory; char name[BPF_OBJ_NAME_LEN]; + u32 btf_vmlinux_value_type_id; bool unpriv_array; bool frozen; /* write-once; write-protected by freeze_mutex */ /* 22 bytes hole */ @@ -183,7 +185,8 @@ static inline bool bpf_map_offload_neutral(const struct bpf_map *map) static inline bool bpf_map_support_seq_show(const struct bpf_map *map) { - return map->btf && map->ops->map_seq_show_elem; + return (map->btf_value_type_id || map->btf_vmlinux_value_type_id) && + map->ops->map_seq_show_elem; } int map_check_no_btf(const struct bpf_map *map, @@ -441,7 +444,8 @@ struct btf_func_model { * fentry = a set of program to run before calling original function * fexit = a set of program to run after original function */ -int arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags, +int arch_prepare_bpf_trampoline(void *image, + const struct btf_func_model *m, u32 flags, struct bpf_prog **fentry_progs, int fentry_cnt, struct bpf_prog **fexit_progs, int fexit_cnt, void *orig_call); @@ -671,6 +675,7 @@ struct bpf_array_aux { struct work_struct work; }; +struct bpf_struct_ops_value; struct btf_type; struct btf_member; @@ -680,21 +685,61 @@ struct bpf_struct_ops { int (*init)(struct btf *_btf_vmlinux); int (*check_member)(const struct btf_type *t, const struct btf_member *member); + int (*init_member)(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata); + int (*reg)(void *kdata); + void (*unreg)(void *kdata); const struct btf_type *type; + const struct btf_type *value_type; const char *name; struct btf_func_model func_models[BPF_STRUCT_OPS_MAX_NR_MEMBERS]; u32 type_id; + u32 value_id; }; #if defined(CONFIG_BPF_JIT) +#define BPF_MODULE_OWNER ((void *)((0xeB9FUL << 2) + POISON_POINTER_DELTA)) const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id); void bpf_struct_ops_init(struct btf *_btf_vmlinux); +bool bpf_struct_ops_get(const void *kdata); +void bpf_struct_ops_put(const void *kdata); +int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, + void *value); +static inline bool bpf_try_module_get(const void *data, struct module *owner) +{ + if (owner == BPF_MODULE_OWNER) + return bpf_struct_ops_get(data); + else + return try_module_get(owner); +} +static inline void bpf_module_put(const void *data, struct module *owner) +{ + if (owner == BPF_MODULE_OWNER) + bpf_struct_ops_put(data); + else + module_put(owner); +} #else static inline const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) { return NULL; } static inline void bpf_struct_ops_init(struct btf *_btf_vmlinux) { } +static inline bool bpf_try_module_get(const void *data, struct module *owner) +{ + return try_module_get(owner); +} +static inline void bpf_module_put(const void *data, struct module *owner) +{ + module_put(owner); +} +static inline int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, + void *key, + void *value) +{ + return -EINVAL; +} #endif struct bpf_array { diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index fadd243ffa2d..9f326e6ef885 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -109,3 +109,6 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops) #endif BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops) +#if defined(CONFIG_BPF_JIT) +BPF_MAP_TYPE(BPF_MAP_TYPE_STRUCT_OPS, bpf_struct_ops_map_ops) +#endif diff --git a/include/linux/btf.h b/include/linux/btf.h index f74a09a7120b..881e9b76ef49 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -7,6 +7,8 @@ #include <linux/types.h> #include <uapi/linux/btf.h> +#define BTF_TYPE_EMIT(type) ((void)(type *)0) + struct btf; struct btf_member; struct btf_type; @@ -60,6 +62,10 @@ const struct btf_type *btf_type_resolve_ptr(const struct btf *btf, u32 id, u32 *res_id); const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf, u32 id, u32 *res_id); +const struct btf_type * +btf_resolve_size(const struct btf *btf, const struct btf_type *type, + u32 *type_size, const struct btf_type **elem_type, + u32 *total_nelems); #define for_each_member(i, struct_type, member) \ for (i = 0, member = btf_type_member(struct_type); \ @@ -106,6 +112,13 @@ static inline bool btf_type_kflag(const struct btf_type *t) return BTF_INFO_KFLAG(t->info); } +static inline u32 btf_member_bit_offset(const struct btf_type *struct_type, + const struct btf_member *member) +{ + return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset) + : member->offset; +} + static inline u32 btf_member_bitfield_size(const struct btf_type *struct_type, const struct btf_member *member) { diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index c1eeb3e0e116..38059880963e 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -136,6 +136,7 @@ enum bpf_map_type { BPF_MAP_TYPE_STACK, BPF_MAP_TYPE_SK_STORAGE, BPF_MAP_TYPE_DEVMAP_HASH, + BPF_MAP_TYPE_STRUCT_OPS, }; /* Note that tracing related programs such as @@ -398,6 +399,10 @@ union bpf_attr { __u32 btf_fd; /* fd pointing to a BTF type data */ __u32 btf_key_type_id; /* BTF type_id of the key */ __u32 btf_value_type_id; /* BTF type_id of the value */ + __u32 btf_vmlinux_value_type_id;/* BTF type_id of a kernel- + * struct stored as the + * map value + */ }; struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */ @@ -3350,7 +3355,7 @@ struct bpf_map_info { __u32 map_flags; char name[BPF_OBJ_NAME_LEN]; __u32 ifindex; - __u32 :32; + __u32 btf_vmlinux_value_type_id; __u64 netns_dev; __u64 netns_ino; __u32 btf_id; diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index c9f81bd1df83..fb9a0b3e4580 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -10,8 +10,66 @@ #include <linux/seq_file.h> #include <linux/refcount.h> +enum bpf_struct_ops_state { + BPF_STRUCT_OPS_STATE_INIT, + BPF_STRUCT_OPS_STATE_INUSE, + BPF_STRUCT_OPS_STATE_TOBEFREE, +}; + +#define BPF_STRUCT_OPS_COMMON_VALUE \ + refcount_t refcnt; \ + enum bpf_struct_ops_state state + +struct bpf_struct_ops_value { + BPF_STRUCT_OPS_COMMON_VALUE; + char data[0] ____cacheline_aligned_in_smp; +}; + +struct bpf_struct_ops_map { + struct bpf_map map; + const struct bpf_struct_ops *st_ops; + /* protect map_update */ + spinlock_t lock; + /* progs has all the bpf_prog that is populated + * to the func ptr of the kernel's struct + * (in kvalue.data). + */ + struct bpf_prog **progs; + /* image is a page that has all the trampolines + * that stores the func args before calling the bpf_prog. + * A PAGE_SIZE "image" is enough to store all trampoline for + * "progs[]". + */ + void *image; + /* uvalue->data stores the kernel struct + * (e.g. tcp_congestion_ops) that is more useful + * to userspace than the kvalue. For example, + * the bpf_prog's id is stored instead of the kernel + * address of a func ptr. + */ + struct bpf_struct_ops_value *uvalue; + /* kvalue.data stores the actual kernel's struct + * (e.g. tcp_congestion_ops) that will be + * registered to the kernel subsystem. + */ + struct bpf_struct_ops_value kvalue; +}; + +#define VALUE_PREFIX "bpf_struct_ops_" +#define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1) + +/* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is + * the map's value exposed to the userspace and its btf-type-id is + * stored at the map->btf_vmlinux_value_type_id. + * + */ #define BPF_STRUCT_OPS_TYPE(_name) \ -extern struct bpf_struct_ops bpf_##_name; +extern struct bpf_struct_ops bpf_##_name; \ + \ +struct bpf_struct_ops_##_name { \ + BPF_STRUCT_OPS_COMMON_VALUE; \ + struct _name data ____cacheline_aligned_in_smp; \ +}; #include "bpf_struct_ops_types.h" #undef BPF_STRUCT_OPS_TYPE @@ -35,19 +93,51 @@ const struct bpf_verifier_ops bpf_struct_ops_verifier_ops = { const struct bpf_prog_ops bpf_struct_ops_prog_ops = { }; +static const struct btf_type *module_type; + void bpf_struct_ops_init(struct btf *_btf_vmlinux) { + s32 type_id, value_id, module_id; const struct btf_member *member; struct bpf_struct_ops *st_ops; struct bpf_verifier_log log = {}; const struct btf_type *t; + char value_name[128]; const char *mname; - s32 type_id; u32 i, j; + /* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */ +#define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name); +#include "bpf_struct_ops_types.h" +#undef BPF_STRUCT_OPS_TYPE + + module_id = btf_find_by_name_kind(_btf_vmlinux, "module", + BTF_KIND_STRUCT); + if (module_id < 0) { + pr_warn("Cannot find struct module in btf_vmlinux\n"); + return; + } + module_type = btf_type_by_id(_btf_vmlinux, module_id); + for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { st_ops = bpf_struct_ops[i]; + if (strlen(st_ops->name) + VALUE_PREFIX_LEN >= + sizeof(value_name)) { + pr_warn("struct_ops name %s is too long\n", + st_ops->name); + continue; + } + sprintf(value_name, "%s%s", VALUE_PREFIX, st_ops->name); + + value_id = btf_find_by_name_kind(_btf_vmlinux, value_name, + BTF_KIND_STRUCT); + if (value_id < 0) { + pr_warn("Cannot find struct %s in btf_vmlinux\n", + value_name); + continue; + } + type_id = btf_find_by_name_kind(_btf_vmlinux, st_ops->name, BTF_KIND_STRUCT); if (type_id < 0) { @@ -99,6 +189,9 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux) } else { st_ops->type_id = type_id; st_ops->type = t; + st_ops->value_id = value_id; + st_ops->value_type = + btf_type_by_id(_btf_vmlinux, value_id); } } } @@ -106,6 +199,22 @@ void bpf_struct_ops_init(struct btf *_btf_vmlinux) extern struct btf *btf_vmlinux; +static const struct bpf_struct_ops * +bpf_struct_ops_find_value(u32 value_id) +{ + unsigned int i; + + if (!value_id || !btf_vmlinux) + return NULL; + + for (i = 0; i < ARRAY_SIZE(bpf_struct_ops); i++) { + if (bpf_struct_ops[i]->value_id == value_id) + return bpf_struct_ops[i]; + } + + return NULL; +} + const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) { unsigned int i; @@ -120,3 +229,358 @@ const struct bpf_struct_ops *bpf_struct_ops_find(u32 type_id) return NULL; } + +static int bpf_struct_ops_map_get_next_key(struct bpf_map *map, void *key, + void *next_key) +{ + if (key && *(u32 *)key == 0) + return -ENOENT; + + *(u32 *)next_key = 0; + return 0; +} + +int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key, + void *value) +{ + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; + struct bpf_struct_ops_value *uvalue, *kvalue; + enum bpf_struct_ops_state state; + + if (unlikely(*(u32 *)key != 0)) + return -ENOENT; + + kvalue = &st_map->kvalue; + /* Pair with smp_store_release() during map_update */ + state = smp_load_acquire(&kvalue->state); + if (state == BPF_STRUCT_OPS_STATE_INIT) { + memset(value, 0, map->value_size); + return 0; + } + + /* No lock is needed. state and refcnt do not need + * to be updated together under atomic context. + */ + uvalue = (struct bpf_struct_ops_value *)value; + memcpy(uvalue, st_map->uvalue, map->value_size); + uvalue->state = state; + refcount_set(&uvalue->refcnt, refcount_read(&kvalue->refcnt)); + + return 0; +} + +static void *bpf_struct_ops_map_lookup_elem(struct bpf_map *map, void *key) +{ + return ERR_PTR(-EINVAL); +} + +static void bpf_struct_ops_map_put_progs(struct bpf_struct_ops_map *st_map) +{ + const struct btf_type *t = st_map->st_ops->type; + u32 i; + + for (i = 0; i < btf_type_vlen(t); i++) { + if (st_map->progs[i]) { + bpf_prog_put(st_map->progs[i]); + st_map->progs[i] = NULL; + } + } +} + +static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, + void *value, u64 flags) +{ + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; + const struct bpf_struct_ops *st_ops = st_map->st_ops; + struct bpf_struct_ops_value *uvalue, *kvalue; + const struct btf_member *member; + const struct btf_type *t = st_ops->type; + void *udata, *kdata; + int prog_fd, err = 0; + void *image; + u32 i; + + if (flags) + return -EINVAL; + + if (*(u32 *)key != 0) + return -E2BIG; + + uvalue = (struct bpf_struct_ops_value *)value; + if (uvalue->state || refcount_read(&uvalue->refcnt)) + return -EINVAL; + + uvalue = (struct bpf_struct_ops_value *)st_map->uvalue; + kvalue = (struct bpf_struct_ops_value *)&st_map->kvalue; + + spin_lock(&st_map->lock); + + if (kvalue->state != BPF_STRUCT_OPS_STATE_INIT) { + err = -EBUSY; + goto unlock; + } + + memcpy(uvalue, value, map->value_size); + + udata = &uvalue->data; + kdata = &kvalue->data; + image = st_map->image; + + for_each_member(i, t, member) { + const struct btf_type *mtype, *ptype; + struct bpf_prog *prog; + u32 moff; + + moff = btf_member_bit_offset(t, member) / 8; + mtype = btf_type_by_id(btf_vmlinux, member->type); + ptype = btf_type_resolve_ptr(btf_vmlinux, member->type, NULL); + if (ptype == module_type) { + *(void **)(kdata + moff) = BPF_MODULE_OWNER; + continue; + } + + err = st_ops->init_member(t, member, kdata, udata); + if (err < 0) + goto reset_unlock; + + /* The ->init_member() has handled this member */ + if (err > 0) + continue; + + /* If st_ops->init_member does not handle it, + * we will only handle func ptrs and zero-ed members + * here. Reject everything else. + */ + + /* All non func ptr member must be 0 */ + if (!btf_type_resolve_func_ptr(btf_vmlinux, member->type, + NULL)) { + u32 msize; + + mtype = btf_resolve_size(btf_vmlinux, mtype, + &msize, NULL, NULL); + if (IS_ERR(mtype)) { + err = PTR_ERR(mtype); + goto reset_unlock; + } + + if (memchr_inv(udata + moff, 0, msize)) { + err = -EINVAL; + goto reset_unlock; + } + + continue; + } + + prog_fd = (int)(*(unsigned long *)(udata + moff)); + /* Similar check as the attr->attach_prog_fd */ + if (!prog_fd) + continue; + + prog = bpf_prog_get(prog_fd); + if (IS_ERR(prog)) { + err = PTR_ERR(prog); + goto reset_unlock; + } + st_map->progs[i] = prog; + + if (prog->type != BPF_PROG_TYPE_STRUCT_OPS || + prog->aux->attach_btf_id != st_ops->type_id || + prog->expected_attach_type != i) { + err = -EINVAL; + goto reset_unlock; + } + + err = arch_prepare_bpf_trampoline(image, + &st_ops->func_models[i], 0, + &prog, 1, NULL, 0, NULL); + if (err < 0) + goto reset_unlock; + + *(void **)(kdata + moff) = image; + image += err; + + /* put prog_id to udata */ + *(unsigned long *)(udata + moff) = prog->aux->id; + } + + refcount_set(&kvalue->refcnt, 1); + bpf_map_inc(map); + + err = st_ops->reg(kdata); + if (!err) { + /* Pair with smp_load_acquire() during lookup */ + smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_INUSE); + goto unlock; + } + + /* Error during st_ops->reg() */ + bpf_map_put(map); + +reset_unlock: + bpf_struct_ops_map_put_progs(st_map); + memset(uvalue, 0, map->value_size); + memset(kvalue, 0, map->value_size); + +unlock: + spin_unlock(&st_map->lock); + return err; +} + +static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key) +{ + enum bpf_struct_ops_state prev_state; + struct bpf_struct_ops_map *st_map; + + st_map = (struct bpf_struct_ops_map *)map; + prev_state = cmpxchg(&st_map->kvalue.state, + BPF_STRUCT_OPS_STATE_INUSE, + BPF_STRUCT_OPS_STATE_TOBEFREE); + if (prev_state == BPF_STRUCT_OPS_STATE_INUSE) { + st_map->st_ops->unreg(&st_map->kvalue.data); + if (refcount_dec_and_test(&st_map->kvalue.refcnt)) + bpf_map_put(map); + } + + return 0; +} + +static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key, + struct seq_file *m) +{ + void *value; + + value = bpf_struct_ops_map_lookup_elem(map, key); + if (!value) + return; + + btf_type_seq_show(btf_vmlinux, map->btf_vmlinux_value_type_id, + value, m); + seq_puts(m, "\n"); +} + +static void bpf_struct_ops_map_free(struct bpf_map *map) +{ + struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map; + + if (st_map->progs) + bpf_struct_ops_map_put_progs(st_map); + bpf_map_area_free(st_map->progs); + bpf_jit_free_exec(st_map->image); + bpf_map_area_free(st_map->uvalue); + bpf_map_area_free(st_map); +} + +static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr) +{ + if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 || + attr->map_flags || !attr->btf_vmlinux_value_type_id) + return -EINVAL; + return 0; +} + +static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr) +{ + const struct bpf_struct_ops *st_ops; + size_t map_total_size, st_map_size; + struct bpf_struct_ops_map *st_map; + const struct btf_type *t, *vt; + struct bpf_map_memory mem; + struct bpf_map *map; + int err; + + if (!capable(CAP_SYS_ADMIN)) + return ERR_PTR(-EPERM); + + st_ops = bpf_struct_ops_find_value(attr->btf_vmlinux_value_type_id); + if (!st_ops) + return ERR_PTR(-ENOTSUPP); + + vt = st_ops->value_type; + if (attr->value_size != vt->size) + return ERR_PTR(-EINVAL); + + t = st_ops->type; + + st_map_size = sizeof(*st_map) + + /* kvalue stores the + * struct bpf_struct_ops_tcp_congestions_ops + */ + (vt->size - sizeof(struct bpf_struct_ops_value)); + map_total_size = st_map_size + + /* uvalue */ + sizeof(vt->size) + + /* struct bpf_progs **progs */ + btf_type_vlen(t) * sizeof(struct bpf_prog *); + err = bpf_map_charge_init(&mem, map_total_size); + if (err < 0) + return ERR_PTR(err); + + st_map = bpf_map_area_alloc(st_map_size, NUMA_NO_NODE); + if (!st_map) { + bpf_map_charge_finish(&mem); + return ERR_PTR(-ENOMEM); + } + st_map->st_ops = st_ops; + map = &st_map->map; + + st_map->uvalue = bpf_map_area_alloc(vt->size, NUMA_NO_NODE); + st_map->progs = + bpf_map_area_alloc(btf_type_vlen(t) * sizeof(struct bpf_prog *), + NUMA_NO_NODE); + /* Each trampoline costs < 64 bytes. Ensure one page + * is enough for max number of func ptrs. + */ + BUILD_BUG_ON(PAGE_SIZE / 64 < BPF_STRUCT_OPS_MAX_NR_MEMBERS); + st_map->image = bpf_jit_alloc_exec(PAGE_SIZE); + if (!st_map->uvalue || !st_map->progs || !st_map->image) { + bpf_struct_ops_map_free(map); + bpf_map_charge_finish(&mem); + return ERR_PTR(-ENOMEM); + } + + spin_lock_init(&st_map->lock); + set_vm_flush_reset_perms(st_map->image); + set_memory_x((long)st_map->image, 1); + bpf_map_init_from_attr(map, attr); + bpf_map_charge_move(&map->memory, &mem); + + return map; +} + +const struct bpf_map_ops bpf_struct_ops_map_ops = { + .map_alloc_check = bpf_struct_ops_map_alloc_check, + .map_alloc = bpf_struct_ops_map_alloc, + .map_free = bpf_struct_ops_map_free, + .map_get_next_key = bpf_struct_ops_map_get_next_key, + .map_lookup_elem = bpf_struct_ops_map_lookup_elem, + .map_delete_elem = bpf_struct_ops_map_delete_elem, + .map_update_elem = bpf_struct_ops_map_update_elem, + .map_seq_show_elem = bpf_struct_ops_map_seq_show_elem, +}; + +/* "const void *" because some subsystem is + * passing a const (e.g. const struct tcp_congestion_ops *) + */ +bool bpf_struct_ops_get(const void *kdata) +{ + struct bpf_struct_ops_value *kvalue; + + kvalue = container_of(kdata, struct bpf_struct_ops_value, data); + + return refcount_inc_not_zero(&kvalue->refcnt); +} + +void bpf_struct_ops_put(const void *kdata) +{ + struct bpf_struct_ops_value *kvalue; + + kvalue = container_of(kdata, struct bpf_struct_ops_value, data); + if (refcount_dec_and_test(&kvalue->refcnt)) { + struct bpf_struct_ops_map *st_map; + + st_map = container_of(kvalue, struct bpf_struct_ops_map, + kvalue); + bpf_map_put(&st_map->map); + } +} diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 0e879a512cf4..0ef265170e1e 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -500,13 +500,6 @@ static const char *btf_int_encoding_str(u8 encoding) return "UNKN"; } -static u32 btf_member_bit_offset(const struct btf_type *struct_type, - const struct btf_member *member) -{ - return btf_type_kflag(struct_type) ? BTF_MEMBER_BIT_OFFSET(member->offset) - : member->offset; -} - static u32 btf_type_int(const struct btf_type *t) { return *(u32 *)(t + 1); @@ -1089,7 +1082,7 @@ static const struct resolve_vertex *env_stack_peak(struct btf_verifier_env *env) * *elem_type: same as return type ("struct X") * *total_nelems: 1 */ -static const struct btf_type * +const struct btf_type * btf_resolve_size(const struct btf *btf, const struct btf_type *type, u32 *type_size, const struct btf_type **elem_type, u32 *total_nelems) @@ -1143,8 +1136,10 @@ btf_resolve_size(const struct btf *btf, const struct btf_type *type, return ERR_PTR(-EINVAL); *type_size = nelems * size; - *total_nelems = nelems; - *elem_type = type; + if (total_nelems) + *total_nelems = nelems; + if (elem_type) + *elem_type = type; return array_type ? : type; } @@ -1858,7 +1853,10 @@ static void btf_modifier_seq_show(const struct btf *btf, u32 type_id, void *data, u8 bits_offset, struct seq_file *m) { - t = btf_type_id_resolve(btf, &type_id); + if (btf->resolved_ids) + t = btf_type_id_resolve(btf, &type_id); + else + t = btf_type_skip_modifiers(btf, type_id, NULL); btf_type_ops(t)->seq_show(btf, t, type_id, data, bits_offset, m); } diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index 5e9366b33f0f..b3c48d1533cb 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -22,7 +22,8 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) */ if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY || inner_map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE || - inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE) { + inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE || + inner_map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { fdput(f); return ERR_PTR(-ENOTSUPP); } diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 03a02ef4c496..a07800ec5023 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -628,7 +628,7 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, return ret; } -#define BPF_MAP_CREATE_LAST_FIELD btf_value_type_id +#define BPF_MAP_CREATE_LAST_FIELD btf_vmlinux_value_type_id /* called via syscall */ static int map_create(union bpf_attr *attr) { @@ -642,6 +642,14 @@ static int map_create(union bpf_attr *attr) if (err) return -EINVAL; + if (attr->btf_vmlinux_value_type_id) { + if (attr->map_type != BPF_MAP_TYPE_STRUCT_OPS || + attr->btf_key_type_id || attr->btf_value_type_id) + return -EINVAL; + } else if (attr->btf_key_type_id && !attr->btf_value_type_id) { + return -EINVAL; + } + f_flags = bpf_get_file_flag(attr->map_flags); if (f_flags < 0) return f_flags; @@ -664,32 +672,35 @@ static int map_create(union bpf_attr *attr) atomic64_set(&map->usercnt, 1); mutex_init(&map->freeze_mutex); - if (attr->btf_key_type_id || attr->btf_value_type_id) { + map->spin_lock_off = -EINVAL; + if (attr->btf_key_type_id || attr->btf_value_type_id || + /* Even the map's value is a kernel's struct, + * the bpf_prog.o must have BTF to begin with + * to figure out the corresponding kernel's + * counter part. Thus, attr->btf_fd has + * to be valid also. + */ + attr->btf_vmlinux_value_type_id) { struct btf *btf; - if (!attr->btf_value_type_id) { - err = -EINVAL; - goto free_map; - } - btf = btf_get_by_fd(attr->btf_fd); if (IS_ERR(btf)) { err = PTR_ERR(btf); goto free_map; } + map->btf = btf; - err = map_check_btf(map, btf, attr->btf_key_type_id, - attr->btf_value_type_id); - if (err) { - btf_put(btf); - goto free_map; + if (attr->btf_value_type_id) { + err = map_check_btf(map, btf, attr->btf_key_type_id, + attr->btf_value_type_id); + if (err) + goto free_map; } - map->btf = btf; map->btf_key_type_id = attr->btf_key_type_id; map->btf_value_type_id = attr->btf_value_type_id; - } else { - map->spin_lock_off = -EINVAL; + map->btf_vmlinux_value_type_id = + attr->btf_vmlinux_value_type_id; } err = security_bpf_map_alloc(map); @@ -888,6 +899,9 @@ static int map_lookup_elem(union bpf_attr *attr) } else if (map->map_type == BPF_MAP_TYPE_QUEUE || map->map_type == BPF_MAP_TYPE_STACK) { err = map->ops->map_peek_elem(map, value); + } else if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { + /* struct_ops map requires directly updating "value" */ + err = bpf_struct_ops_map_sys_lookup_elem(map, key, value); } else { rcu_read_lock(); if (map->ops->map_lookup_elem_sys_only) @@ -1092,7 +1106,9 @@ static int map_delete_elem(union bpf_attr *attr) if (bpf_map_is_dev_bound(map)) { err = bpf_map_offload_delete_elem(map, key); goto out; - } else if (IS_FD_PROG_ARRAY(map)) { + } else if (IS_FD_PROG_ARRAY(map) || + map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { + /* These maps require sleepable context */ err = map->ops->map_delete_elem(map, key); goto out; } @@ -2822,6 +2838,7 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map, info.btf_key_type_id = map->btf_key_type_id; info.btf_value_type_id = map->btf_value_type_id; } + info.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id; if (bpf_map_is_dev_bound(map)) { err = bpf_map_offload_info_fill(&info, map); diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 5ee301ddbd00..610109cfc7a8 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -110,7 +110,7 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) fentry, fentry_cnt, fexit, fexit_cnt, tr->func.addr); - if (err) + if (err < 0) goto out; if (tr->selector) @@ -244,7 +244,8 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start) } int __weak -arch_prepare_bpf_trampoline(void *image, struct btf_func_model *m, u32 flags, +arch_prepare_bpf_trampoline(void *image, + const struct btf_func_model *m, u32 flags, struct bpf_prog **fentry_progs, int fentry_cnt, struct bpf_prog **fexit_progs, int fexit_cnt, void *orig_call) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 4c1eaa1a2965..990f13165c52 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -8149,6 +8149,11 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, return -EINVAL; } + if (map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { + verbose(env, "bpf_struct_ops map cannot be used in prog\n"); + return -EINVAL; + } + return 0; }
The patch introduces BPF_MAP_TYPE_STRUCT_OPS. The map value is a kernel struct with its func ptr implemented in bpf prog. This new map is the interface to register/unregister/introspect a bpf implemented kernel struct. The kernel struct is actually embedded inside another new struct (or called the "value" struct in the code). For example, "struct tcp_congestion_ops" is embbeded in: struct bpf_struct_ops_tcp_congestion_ops { refcount_t refcnt; enum bpf_struct_ops_state state; struct tcp_congestion_ops data; /* <-- kernel subsystem struct here */ } The map value is "struct bpf_struct_ops_tcp_congestion_ops". The "bpftool map dump" will then be able to show the state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g. number of tcp_sock in the tcp_congestion_ops case). This "value" struct is created automatically by a macro. Having a separate "value" struct will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some initialization works before registering the struct_ops to the kernel subsystem). The libbpf will take care of finding and populating the "struct bpf_struct_ops_XYZ" from "struct XYZ". Register a struct_ops to a kernel subsystem: 1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s) 2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the running kernel. Instead of reusing the attr->btf_value_type_id, btf_vmlinux_value_type_id s added such that attr->btf_fd can still be used as the "user" btf which could store other useful sysadmin/debug info that may be introduced in the furture, e.g. creation-date/compiler-details/map-creator...etc. 3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described in the running kernel btf. Populate the value of this object. The function ptr should be populated with the prog fds. 4. Call BPF_MAP_UPDATE with the object created in (3) as the map value. The key is always "0". During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's args as an array of u64 is generated. BPF_MAP_UPDATE also allows the specific struct_ops to do some final checks in "st_ops->init_member()" (e.g. ensure all mandatory func ptrs are implemented). If everything looks good, it will register this kernel struct to the kernel subsystem. The map will not allow further update from this point. Unregister a struct_ops from the kernel subsystem: BPF_MAP_DELETE with key "0". Introspect a struct_ops: BPF_MAP_LOOKUP_ELEM with key "0". The map value returned will have the prog _id_ populated as the func ptr. The map value state (enum bpf_struct_ops_state) will transit from: INIT (map created) => INUSE (map updated, i.e. reg) => TOBEFREE (map value deleted, i.e. unreg) The kernel subsystem needs to call bpf_struct_ops_get() and bpf_struct_ops_put() to manage the "refcnt" in the "struct bpf_struct_ops_XYZ". This patch uses a separate refcnt for the purose of tracking the subsystem usage. Another approach is to reuse the map->refcnt and then "show" (i.e. during map_lookup) the subsystem's usage by doing map->refcnt - map->usercnt to filter out the map-fd/pinned-map usage. However, that will also tie down the future semantics of map->refcnt and map->usercnt. The very first subsystem's refcnt (during reg()) holds one count to map->refcnt. When the very last subsystem's refcnt is gone, it will also release the map->refcnt. All bpf_prog will be freed when the map->refcnt reaches 0 (i.e. during map_free()). Here is how the bpftool map command will look like: [root@arch-fb-vm1 bpf]# bpftool map show 6: struct_ops name dctcp flags 0x0 key 4B value 256B max_entries 1 memlock 4096B btf_id 6 [root@arch-fb-vm1 bpf]# bpftool map dump id 6 [{ "value": { "refcnt": { "refs": { "counter": 1 } }, "state": 1, "data": { "list": { "next": 0, "prev": 0 }, "key": 0, "flags": 2, "init": 24, "release": 0, "ssthresh": 25, "cong_avoid": 30, "set_state": 27, "cwnd_event": 28, "in_ack_event": 26, "undo_cwnd": 29, "pkts_acked": 0, "min_tso_segs": 0, "sndbuf_expand": 0, "cong_control": 0, "get_info": 0, "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0 ], "owner": 0 } } } ] Misc Notes: * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup. It does an inplace update on "*value" instead returning a pointer to syscall.c. Otherwise, it needs a separate copy of "zero" value for the BPF_STRUCT_OPS_STATE_INIT to avoid races. * The bpf_struct_ops_map_delete_elem() is also called without preempt_disable() from map_delete_elem(). It is because the "->unreg()" may requires sleepable context, e.g. the "tcp_unregister_congestion_control()". * "const" is added to some of the existing "struct btf_func_model *" function arg to avoid a compiler warning caused by this patch. Signed-off-by: Martin KaFai Lau <kafai@fb.com> --- arch/x86/net/bpf_jit_comp.c | 11 +- include/linux/bpf.h | 49 +++- include/linux/bpf_types.h | 3 + include/linux/btf.h | 13 + include/uapi/linux/bpf.h | 7 +- kernel/bpf/bpf_struct_ops.c | 468 +++++++++++++++++++++++++++++++++++- kernel/bpf/btf.c | 20 +- kernel/bpf/map_in_map.c | 3 +- kernel/bpf/syscall.c | 49 ++-- kernel/bpf/trampoline.c | 5 +- kernel/bpf/verifier.c | 5 + 11 files changed, 593 insertions(+), 40 deletions(-)