Message ID | 1444078101-29060-2-git-send-email-ast@plumgrid.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, Oct 5, 2015 at 1:48 PM, Alexei Starovoitov <ast@plumgrid.com> wrote: > In order to let unprivileged users load and execute eBPF programs > teach verifier to prevent pointer leaks. > Verifier will prevent > - any arithmetic on pointers > (except R10+Imm which is used to compute stack addresses) > - comparison of pointers > - passing pointers to helper functions > - indirectly passing pointers in stack to helper functions > - returning pointer from bpf program > - storing pointers into ctx or maps Does the arithmetic restriction include using a pointer as an index to a maps-based tail call? I'm still worried about pointer-based side-effects. -Kees > > Spill/fill of pointers into stack is allowed, but mangling > of pointers stored in the stack or reading them byte by byte is not. > > Within bpf programs the pointers do exist, since programs need to > be able to access maps, pass skb pointer to LD_ABS insns, etc > but programs cannot pass such pointer values to the outside > or obfuscate them. > > Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs, > so that socket filters (tcpdump), af_packet (quic acceleration) > and future kcm can use it. > tracing and tc cls/act program types still require root permissions, > since tracing actually needs to be able to see all kernel pointers > and tc is for root only. > > For example, the following unprivileged socket filter program is allowed: > int foo(struct __sk_buff *skb) > { > char fmt[] = "hello %d\n"; > bpf_trace_printk(fmt, sizeof(fmt), skb->len); > return 0; > } > > but the following program is not: > int foo(struct __sk_buff *skb) > { > char fmt[] = "hello %p\n"; > bpf_trace_printk(fmt, sizeof(fmt), fmt); > return 0; > } > since it would leak the kernel stack address via bpf_trace_printk(). > > Unprivileged socket filter bpf programs have access to the > following helper functions: > - map lookup/update/delete (but they cannot store kernel pointers into them) > - get_random (it's already exposed to unprivileged user space) > - get_smp_processor_id > - tail_call into another socket filter program > - ktime_get_ns > - bpf_trace_printk (for debugging) > > The feature is controlled by sysctl kernel.bpf_enable_unprivileged > which is off by default. > > New tests were added to test_verifier: > unpriv: return pointer OK > unpriv: add const to pointer OK > unpriv: add pointer to pointer OK > unpriv: neg pointer OK > unpriv: cmp pointer with const OK > unpriv: cmp pointer with pointer OK > unpriv: pass pointer to printk OK > unpriv: pass pointer to helper function OK > unpriv: indirectly pass pointer on stack to helper function OK > unpriv: mangle pointer on stack 1 OK > unpriv: mangle pointer on stack 2 OK > unpriv: read pointer from stack in small chunks OK > unpriv: write pointer into ctx OK > unpriv: write pointer into map elem value OK > unpriv: partial copy of pointer OK > > Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> > --- > include/linux/bpf.h | 3 + > kernel/bpf/syscall.c | 11 +- > kernel/bpf/verifier.c | 114 +++++++++++++++++++-- > kernel/sysctl.c | 10 ++ > kernel/trace/bpf_trace.c | 3 + > samples/bpf/libbpf.h | 8 ++ > samples/bpf/test_verifier.c | 239 ++++++++++++++++++++++++++++++++++++++++++- > 7 files changed, 371 insertions(+), 17 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index f57d7fed9ec3..acf97d66b681 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -66,6 +66,7 @@ enum bpf_arg_type { > > ARG_PTR_TO_CTX, /* pointer to context */ > ARG_ANYTHING, /* any (initialized) argument is ok */ > + ARG_VARARG, /* optional argument */ > }; > > /* type of values returned from helper functions */ > @@ -168,6 +169,8 @@ void bpf_prog_put_rcu(struct bpf_prog *prog); > struct bpf_map *bpf_map_get(struct fd f); > void bpf_map_put(struct bpf_map *map); > > +extern int sysctl_bpf_enable_unprivileged; > + > /* verify correctness of eBPF program */ > int bpf_check(struct bpf_prog **fp, union bpf_attr *attr); > #else > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index 5f35f420c12f..9a2098da2da9 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -18,6 +18,8 @@ > #include <linux/filter.h> > #include <linux/version.h> > > +int sysctl_bpf_enable_unprivileged __read_mostly; > + > static LIST_HEAD(bpf_map_types); > > static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) > @@ -542,6 +544,9 @@ static int bpf_prog_load(union bpf_attr *attr) > attr->kern_version != LINUX_VERSION_CODE) > return -EINVAL; > > + if (type != BPF_PROG_TYPE_SOCKET_FILTER && !capable(CAP_SYS_ADMIN)) > + return -EPERM; > + > /* plain bpf_prog allocation */ > prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER); > if (!prog) > @@ -597,11 +602,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz > union bpf_attr attr = {}; > int err; > > - /* the syscall is limited to root temporarily. This restriction will be > - * lifted when security audit is clean. Note that eBPF+tracing must have > - * this restriction, since it may pass kernel data to user space > - */ > - if (!capable(CAP_SYS_ADMIN)) > + if (!capable(CAP_SYS_ADMIN) && !sysctl_bpf_enable_unprivileged) > return -EPERM; > > if (!access_ok(VERIFY_READ, uattr, 1)) > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index b074b23000d6..dccaeeeb1128 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -199,6 +199,7 @@ struct verifier_env { > struct verifier_state_list **explored_states; /* search pruning optimization */ > struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by eBPF program */ > u32 used_map_cnt; /* number of used maps */ > + bool allow_ptr_leaks; > }; > > /* verbose verifier prints what it's seeing > @@ -538,6 +539,21 @@ static int bpf_size_to_bytes(int bpf_size) > return -EINVAL; > } > > +static bool is_spillable_regtype(enum bpf_reg_type type) > +{ > + switch (type) { > + case PTR_TO_MAP_VALUE: > + case PTR_TO_MAP_VALUE_OR_NULL: > + case PTR_TO_STACK: > + case PTR_TO_CTX: > + case FRAME_PTR: > + case CONST_PTR_TO_MAP: > + return true; > + default: > + return false; > + } > +} > + > /* check_stack_read/write functions track spill/fill of registers, > * stack boundary and alignment are checked in check_mem_access() > */ > @@ -550,9 +566,7 @@ static int check_stack_write(struct verifier_state *state, int off, int size, > */ > > if (value_regno >= 0 && > - (state->regs[value_regno].type == PTR_TO_MAP_VALUE || > - state->regs[value_regno].type == PTR_TO_STACK || > - state->regs[value_regno].type == PTR_TO_CTX)) { > + is_spillable_regtype(state->regs[value_regno].type)) { > > /* register containing pointer is being spilled into stack */ > if (size != BPF_REG_SIZE) { > @@ -643,6 +657,20 @@ static int check_ctx_access(struct verifier_env *env, int off, int size, > return -EACCES; > } > > +static bool is_pointer_value(struct verifier_env *env, int regno) > +{ > + if (env->allow_ptr_leaks) > + return false; > + > + switch (env->cur_state.regs[regno].type) { > + case UNKNOWN_VALUE: > + case CONST_IMM: > + return false; > + default: > + return true; > + } > +} > + > /* check whether memory at (regno + off) is accessible for t = (read | write) > * if t==write, value_regno is a register which value is stored into memory > * if t==read, value_regno is a register which will receive the value from memory > @@ -669,11 +697,21 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off, > } > > if (state->regs[regno].type == PTR_TO_MAP_VALUE) { > + if (t == BPF_WRITE && value_regno >= 0 && > + is_pointer_value(env, value_regno)) { > + verbose("R%d leaks addr into map\n", value_regno); > + return -EACCES; > + } > err = check_map_access(env, regno, off, size); > if (!err && t == BPF_READ && value_regno >= 0) > mark_reg_unknown_value(state->regs, value_regno); > > } else if (state->regs[regno].type == PTR_TO_CTX) { > + if (t == BPF_WRITE && value_regno >= 0 && > + is_pointer_value(env, value_regno)) { > + verbose("R%d leaks addr into ctx\n", value_regno); > + return -EACCES; > + } > err = check_ctx_access(env, off, size, t); > if (!err && t == BPF_READ && value_regno >= 0) > mark_reg_unknown_value(state->regs, value_regno); > @@ -684,10 +722,17 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off, > verbose("invalid stack off=%d size=%d\n", off, size); > return -EACCES; > } > - if (t == BPF_WRITE) > + if (t == BPF_WRITE) { > + if (!env->allow_ptr_leaks && > + state->stack_slot_type[MAX_BPF_STACK + off] == STACK_SPILL && > + size != BPF_REG_SIZE) { > + verbose("attempt to corrupt spilled pointer on stack\n"); > + return -EACCES; > + } > err = check_stack_write(state, off, size, value_regno); > - else > + } else { > err = check_stack_read(state, off, size, value_regno); > + } > } else { > verbose("R%d invalid mem access '%s'\n", > regno, reg_type_str[state->regs[regno].type]); > @@ -770,13 +815,26 @@ static int check_func_arg(struct verifier_env *env, u32 regno, > if (arg_type == ARG_DONTCARE) > return 0; > > + if (arg_type == ARG_VARARG) { > + if (is_pointer_value(env, regno)) { > + verbose("R%d leaks addr into printk\n", regno); > + return -EACCES; > + } > + return 0; > + } > + > if (reg->type == NOT_INIT) { > verbose("R%d !read_ok\n", regno); > return -EACCES; > } > > - if (arg_type == ARG_ANYTHING) > + if (arg_type == ARG_ANYTHING) { > + if (is_pointer_value(env, regno)) { > + verbose("R%d leaks addr into helper function\n", regno); > + return -EACCES; > + } > return 0; > + } > > if (arg_type == ARG_PTR_TO_STACK || arg_type == ARG_PTR_TO_MAP_KEY || > arg_type == ARG_PTR_TO_MAP_VALUE) { > @@ -950,8 +1008,9 @@ static int check_call(struct verifier_env *env, int func_id) > } > > /* check validity of 32-bit and 64-bit arithmetic operations */ > -static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn) > +static int check_alu_op(struct verifier_env *env, struct bpf_insn *insn) > { > + struct reg_state *regs = env->cur_state.regs; > u8 opcode = BPF_OP(insn->code); > int err; > > @@ -976,6 +1035,12 @@ static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn) > if (err) > return err; > > + if (is_pointer_value(env, insn->dst_reg)) { > + verbose("R%d pointer arithmetic prohibited\n", > + insn->dst_reg); > + return -EACCES; > + } > + > /* check dest operand */ > err = check_reg_arg(regs, insn->dst_reg, DST_OP); > if (err) > @@ -1012,6 +1077,11 @@ static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn) > */ > regs[insn->dst_reg] = regs[insn->src_reg]; > } else { > + if (is_pointer_value(env, insn->src_reg)) { > + verbose("R%d partial copy of pointer\n", > + insn->src_reg); > + return -EACCES; > + } > regs[insn->dst_reg].type = UNKNOWN_VALUE; > regs[insn->dst_reg].map_ptr = NULL; > } > @@ -1061,8 +1131,18 @@ static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn) > /* pattern match 'bpf_add Rx, imm' instruction */ > if (opcode == BPF_ADD && BPF_CLASS(insn->code) == BPF_ALU64 && > regs[insn->dst_reg].type == FRAME_PTR && > - BPF_SRC(insn->code) == BPF_K) > + BPF_SRC(insn->code) == BPF_K) { > stack_relative = true; > + } else if (is_pointer_value(env, insn->dst_reg)) { > + verbose("R%d pointer arithmetic prohibited\n", > + insn->dst_reg); > + return -EACCES; > + } else if (BPF_SRC(insn->code) == BPF_X && > + is_pointer_value(env, insn->src_reg)) { > + verbose("R%d pointer arithmetic prohibited\n", > + insn->src_reg); > + return -EACCES; > + } > > /* check dest operand */ > err = check_reg_arg(regs, insn->dst_reg, DST_OP); > @@ -1101,6 +1181,12 @@ static int check_cond_jmp_op(struct verifier_env *env, > err = check_reg_arg(regs, insn->src_reg, SRC_OP); > if (err) > return err; > + > + if (is_pointer_value(env, insn->src_reg)) { > + verbose("R%d pointer comparison prohibited\n", > + insn->src_reg); > + return -EACCES; > + } > } else { > if (insn->src_reg != BPF_REG_0) { > verbose("BPF_JMP uses reserved fields\n"); > @@ -1155,6 +1241,9 @@ static int check_cond_jmp_op(struct verifier_env *env, > regs[insn->dst_reg].type = CONST_IMM; > regs[insn->dst_reg].imm = 0; > } > + } else if (is_pointer_value(env, insn->dst_reg)) { > + verbose("R%d pointer comparison prohibited\n", insn->dst_reg); > + return -EACCES; > } else if (BPF_SRC(insn->code) == BPF_K && > (opcode == BPF_JEQ || opcode == BPF_JNE)) { > > @@ -1658,7 +1747,7 @@ static int do_check(struct verifier_env *env) > } > > if (class == BPF_ALU || class == BPF_ALU64) { > - err = check_alu_op(regs, insn); > + err = check_alu_op(env, insn); > if (err) > return err; > > @@ -1816,6 +1905,11 @@ static int do_check(struct verifier_env *env) > if (err) > return err; > > + if (is_pointer_value(env, BPF_REG_0)) { > + verbose("R0 leaks addr as return value\n"); > + return -EACCES; > + } > + > process_bpf_exit: > insn_idx = pop_stack(env, &prev_insn_idx); > if (insn_idx < 0) { > @@ -2144,6 +2238,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr) > if (ret < 0) > goto skip_full_check; > > + env->allow_ptr_leaks = capable(CAP_SYS_ADMIN); > + > ret = do_check(env); > > skip_full_check: > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index e69201d8094e..a281849278d9 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -64,6 +64,7 @@ > #include <linux/binfmts.h> > #include <linux/sched/sysctl.h> > #include <linux/kexec.h> > +#include <linux/bpf.h> > > #include <asm/uaccess.h> > #include <asm/processor.h> > @@ -1139,6 +1140,15 @@ static struct ctl_table kern_table[] = { > .proc_handler = timer_migration_handler, > }, > #endif > +#ifdef CONFIG_BPF_SYSCALL > + { > + .procname = "bpf_enable_unprivileged", > + .data = &sysctl_bpf_enable_unprivileged, > + .maxlen = sizeof(sysctl_bpf_enable_unprivileged), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, > +#endif > { } > }; > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 0fe96c7c8803..46bbed24d1f5 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -173,6 +173,9 @@ static const struct bpf_func_proto bpf_trace_printk_proto = { > .ret_type = RET_INTEGER, > .arg1_type = ARG_PTR_TO_STACK, > .arg2_type = ARG_CONST_STACK_SIZE, > + .arg3_type = ARG_VARARG, > + .arg4_type = ARG_VARARG, > + .arg5_type = ARG_VARARG, > }; > > const struct bpf_func_proto *bpf_get_trace_printk_proto(void) > diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h > index 7235e292a03b..b7f63c70b4a2 100644 > --- a/samples/bpf/libbpf.h > +++ b/samples/bpf/libbpf.h > @@ -64,6 +64,14 @@ extern char bpf_log_buf[LOG_BUF_SIZE]; > .off = 0, \ > .imm = 0 }) > > +#define BPF_MOV32_REG(DST, SRC) \ > + ((struct bpf_insn) { \ > + .code = BPF_ALU | BPF_MOV | BPF_X, \ > + .dst_reg = DST, \ > + .src_reg = SRC, \ > + .off = 0, \ > + .imm = 0 }) > + > /* Short form of mov, dst_reg = imm32 */ > > #define BPF_MOV64_IMM(DST, IMM) \ > diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c > index ee0f110c9c54..266d014a885e 100644 > --- a/samples/bpf/test_verifier.c > +++ b/samples/bpf/test_verifier.c > @@ -15,6 +15,7 @@ > #include <string.h> > #include <linux/filter.h> > #include <stddef.h> > +#include <stdbool.h> > #include "libbpf.h" > > #define MAX_INSNS 512 > @@ -25,10 +26,12 @@ struct bpf_test { > struct bpf_insn insns[MAX_INSNS]; > int fixup[32]; > const char *errstr; > + const char *errstr_unpriv; > enum { > + UNDEF, > ACCEPT, > REJECT > - } result; > + } result, result_unpriv; > enum bpf_prog_type prog_type; > }; > > @@ -96,6 +99,7 @@ static struct bpf_test tests[] = { > BPF_EXIT_INSN(), > }, > .errstr = "invalid BPF_LD_IMM insn", > + .errstr_unpriv = "R1 pointer comparison", > .result = REJECT, > }, > { > @@ -109,6 +113,7 @@ static struct bpf_test tests[] = { > BPF_EXIT_INSN(), > }, > .errstr = "invalid BPF_LD_IMM insn", > + .errstr_unpriv = "R1 pointer comparison", > .result = REJECT, > }, > { > @@ -219,6 +224,7 @@ static struct bpf_test tests[] = { > BPF_EXIT_INSN(), > }, > .errstr = "R0 !read_ok", > + .errstr_unpriv = "R1 pointer comparison", > .result = REJECT, > }, > { > @@ -294,7 +300,9 @@ static struct bpf_test tests[] = { > BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), > BPF_EXIT_INSN(), > }, > + .errstr_unpriv = "R0 leaks addr", > .result = ACCEPT, > + .result_unpriv = REJECT, > }, > { > "check corrupted spill/fill", > @@ -310,6 +318,7 @@ static struct bpf_test tests[] = { > > BPF_EXIT_INSN(), > }, > + .errstr_unpriv = "attempt to corrupt spilled", > .errstr = "corrupted spill", > .result = REJECT, > }, > @@ -473,6 +482,7 @@ static struct bpf_test tests[] = { > }, > .fixup = {3}, > .errstr = "R0 invalid mem access", > + .errstr_unpriv = "R0 leaks addr", > .result = REJECT, > }, > { > @@ -495,6 +505,8 @@ static struct bpf_test tests[] = { > BPF_MOV64_IMM(BPF_REG_0, 0), > BPF_EXIT_INSN(), > }, > + .errstr_unpriv = "R1 pointer comparison", > + .result_unpriv = REJECT, > .result = ACCEPT, > }, > { > @@ -521,6 +533,8 @@ static struct bpf_test tests[] = { > BPF_MOV64_IMM(BPF_REG_0, 0), > BPF_EXIT_INSN(), > }, > + .errstr_unpriv = "R1 pointer comparison", > + .result_unpriv = REJECT, > .result = ACCEPT, > }, > { > @@ -555,6 +569,8 @@ static struct bpf_test tests[] = { > BPF_EXIT_INSN(), > }, > .fixup = {24}, > + .errstr_unpriv = "R1 pointer comparison", > + .result_unpriv = REJECT, > .result = ACCEPT, > }, > { > @@ -603,6 +619,8 @@ static struct bpf_test tests[] = { > BPF_MOV64_IMM(BPF_REG_0, 0), > BPF_EXIT_INSN(), > }, > + .errstr_unpriv = "R1 pointer comparison", > + .result_unpriv = REJECT, > .result = ACCEPT, > }, > { > @@ -642,6 +660,8 @@ static struct bpf_test tests[] = { > BPF_MOV64_IMM(BPF_REG_0, 0), > BPF_EXIT_INSN(), > }, > + .errstr_unpriv = "R1 pointer comparison", > + .result_unpriv = REJECT, > .result = ACCEPT, > }, > { > @@ -699,6 +719,7 @@ static struct bpf_test tests[] = { > }, > .fixup = {4}, > .errstr = "different pointers", > + .errstr_unpriv = "R1 pointer comparison", > .result = REJECT, > }, > { > @@ -720,6 +741,7 @@ static struct bpf_test tests[] = { > }, > .fixup = {6}, > .errstr = "different pointers", > + .errstr_unpriv = "R1 pointer comparison", > .result = REJECT, > }, > { > @@ -742,6 +764,7 @@ static struct bpf_test tests[] = { > }, > .fixup = {7}, > .errstr = "different pointers", > + .errstr_unpriv = "R1 pointer comparison", > .result = REJECT, > }, > { > @@ -752,6 +775,7 @@ static struct bpf_test tests[] = { > BPF_EXIT_INSN(), > }, > .errstr = "invalid bpf_context access", > + .errstr_unpriv = "R1 leaks addr", > .result = REJECT, > }, > { > @@ -762,6 +786,7 @@ static struct bpf_test tests[] = { > BPF_EXIT_INSN(), > }, > .errstr = "invalid bpf_context access", > + .errstr_unpriv = "R1 leaks addr", > .result = REJECT, > }, > { > @@ -772,6 +797,7 @@ static struct bpf_test tests[] = { > BPF_EXIT_INSN(), > }, > .errstr = "invalid bpf_context access", > + .errstr_unpriv = "R1 leaks addr", > .result = REJECT, > }, > { > @@ -782,6 +808,7 @@ static struct bpf_test tests[] = { > BPF_EXIT_INSN(), > }, > .errstr = "invalid bpf_context access", > + .errstr_unpriv = "", > .result = REJECT, > .prog_type = BPF_PROG_TYPE_SCHED_ACT, > }, > @@ -803,6 +830,8 @@ static struct bpf_test tests[] = { > BPF_EXIT_INSN(), > }, > .result = ACCEPT, > + .errstr_unpriv = "R1 leaks addr", > + .result_unpriv = REJECT, > }, > { > "write skb fields from tc_cls_act prog", > @@ -819,6 +848,8 @@ static struct bpf_test tests[] = { > offsetof(struct __sk_buff, cb[3])), > BPF_EXIT_INSN(), > }, > + .errstr_unpriv = "", > + .result_unpriv = REJECT, > .result = ACCEPT, > .prog_type = BPF_PROG_TYPE_SCHED_CLS, > }, > @@ -881,6 +912,195 @@ static struct bpf_test tests[] = { > .result = REJECT, > .errstr = "invalid stack off=0 size=8", > }, > + { > + "unpriv: return pointer", > + .insns = { > + BPF_MOV64_REG(BPF_REG_0, BPF_REG_10), > + BPF_EXIT_INSN(), > + }, > + .result = ACCEPT, > + .result_unpriv = REJECT, > + .errstr_unpriv = "R0 leaks addr", > + }, > + { > + "unpriv: add const to pointer", > + .insns = { > + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 8), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .result = ACCEPT, > + .result_unpriv = REJECT, > + .errstr_unpriv = "R1 pointer arithmetic", > + }, > + { > + "unpriv: add pointer to pointer", > + .insns = { > + BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_10), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .result = ACCEPT, > + .result_unpriv = REJECT, > + .errstr_unpriv = "R1 pointer arithmetic", > + }, > + { > + "unpriv: neg pointer", > + .insns = { > + BPF_ALU64_IMM(BPF_NEG, BPF_REG_1, 0), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .result = ACCEPT, > + .result_unpriv = REJECT, > + .errstr_unpriv = "R1 pointer arithmetic", > + }, > + { > + "unpriv: cmp pointer with const", > + .insns = { > + BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 0), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .result = ACCEPT, > + .result_unpriv = REJECT, > + .errstr_unpriv = "R1 pointer comparison", > + }, > + { > + "unpriv: cmp pointer with pointer", > + .insns = { > + BPF_JMP_REG(BPF_JEQ, BPF_REG_1, BPF_REG_10, 0), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .result = ACCEPT, > + .result_unpriv = REJECT, > + .errstr_unpriv = "R10 pointer comparison", > + }, > + { > + "unpriv: pass pointer to printk", > + .insns = { > + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), > + BPF_MOV64_REG(BPF_REG_1, BPF_REG_10), > + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8), > + BPF_MOV64_IMM(BPF_REG_2, 8), > + BPF_MOV64_REG(BPF_REG_3, BPF_REG_1), > + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_trace_printk), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .errstr_unpriv = "R3 leaks addr", > + .result_unpriv = REJECT, > + .result = ACCEPT, > + }, > + { > + "unpriv: pass pointer to helper function", > + .insns = { > + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), > + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), > + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), > + BPF_LD_MAP_FD(BPF_REG_1, 0), > + BPF_MOV64_REG(BPF_REG_3, BPF_REG_2), > + BPF_MOV64_REG(BPF_REG_4, BPF_REG_2), > + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .fixup = {3}, > + .errstr_unpriv = "R4 leaks addr", > + .result_unpriv = REJECT, > + .result = ACCEPT, > + }, > + { > + "unpriv: indirectly pass pointer on stack to helper function", > + .insns = { > + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_10, -8), > + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), > + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), > + BPF_LD_MAP_FD(BPF_REG_1, 0), > + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .fixup = {3}, > + .errstr = "invalid indirect read from stack off -8+0 size 8", > + .result = REJECT, > + }, > + { > + "unpriv: mangle pointer on stack 1", > + .insns = { > + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_10, -8), > + BPF_ST_MEM(BPF_W, BPF_REG_10, -8, 0), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .errstr_unpriv = "attempt to corrupt spilled", > + .result_unpriv = REJECT, > + .result = ACCEPT, > + }, > + { > + "unpriv: mangle pointer on stack 2", > + .insns = { > + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_10, -8), > + BPF_ST_MEM(BPF_B, BPF_REG_10, -1, 0), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .errstr_unpriv = "attempt to corrupt spilled", > + .result_unpriv = REJECT, > + .result = ACCEPT, > + }, > + { > + "unpriv: read pointer from stack in small chunks", > + .insns = { > + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_10, -8), > + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_10, -8), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .errstr = "invalid size", > + .result = REJECT, > + }, > + { > + "unpriv: write pointer into ctx", > + .insns = { > + BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .errstr_unpriv = "R1 leaks addr", > + .result_unpriv = REJECT, > + .errstr = "invalid bpf_context access", > + .result = REJECT, > + }, > + { > + "unpriv: write pointer into map elem value", > + .insns = { > + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), > + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), > + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), > + BPF_LD_MAP_FD(BPF_REG_1, 0), > + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), > + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), > + BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .fixup = {3}, > + .errstr_unpriv = "R0 leaks addr", > + .result_unpriv = REJECT, > + .result = ACCEPT, > + }, > + { > + "unpriv: partial copy of pointer", > + .insns = { > + BPF_MOV32_REG(BPF_REG_1, BPF_REG_10), > + BPF_MOV64_IMM(BPF_REG_0, 0), > + BPF_EXIT_INSN(), > + }, > + .errstr_unpriv = "R10 partial copy", > + .result_unpriv = REJECT, > + .result = ACCEPT, > + }, > }; > > static int probe_filter_length(struct bpf_insn *fp) > @@ -910,12 +1130,15 @@ static int create_map(void) > static int test(void) > { > int prog_fd, i, pass_cnt = 0, err_cnt = 0; > + bool unpriv = geteuid() != 0; > > for (i = 0; i < ARRAY_SIZE(tests); i++) { > struct bpf_insn *prog = tests[i].insns; > int prog_type = tests[i].prog_type; > int prog_len = probe_filter_length(prog); > int *fixup = tests[i].fixup; > + int expected_result; > + const char *expected_errstr; > int map_fd = -1; > > if (*fixup) { > @@ -932,7 +1155,17 @@ static int test(void) > prog, prog_len * sizeof(struct bpf_insn), > "GPL", 0); > > - if (tests[i].result == ACCEPT) { > + if (unpriv && tests[i].result_unpriv != UNDEF) > + expected_result = tests[i].result_unpriv; > + else > + expected_result = tests[i].result; > + > + if (unpriv && tests[i].errstr_unpriv) > + expected_errstr = tests[i].errstr_unpriv; > + else > + expected_errstr = tests[i].errstr; > + > + if (expected_result == ACCEPT) { > if (prog_fd < 0) { > printf("FAIL\nfailed to load prog '%s'\n", > strerror(errno)); > @@ -947,7 +1180,7 @@ static int test(void) > err_cnt++; > goto fail; > } > - if (strstr(bpf_log_buf, tests[i].errstr) == 0) { > + if (strstr(bpf_log_buf, expected_errstr) == 0) { > printf("FAIL\nunexpected error message: %s", > bpf_log_buf); > err_cnt++; > -- > 1.7.9.5 >
On 10/5/15 2:00 PM, Kees Cook wrote: > On Mon, Oct 5, 2015 at 1:48 PM, Alexei Starovoitov<ast@plumgrid.com> wrote: >> >In order to let unprivileged users load and execute eBPF programs >> >teach verifier to prevent pointer leaks. >> >Verifier will prevent >> >- any arithmetic on pointers >> > (except R10+Imm which is used to compute stack addresses) >> >- comparison of pointers >> >- passing pointers to helper functions >> >- indirectly passing pointers in stack to helper functions >> >- returning pointer from bpf program >> >- storing pointers into ctx or maps > Does the arithmetic restriction include using a pointer as an index to > a maps-based tail call? I'm still worried about pointer-based > side-effects. the array maps that hold FDs (BPF_MAP_TYPE_PROG_ARRAY and BPF_MAP_TYPE_PERF_EVENT_ARRAY) don't have lookup/update accessors from the program side, so programs cannot see or manipulate those pointers. For the former only bpf_tail_call() is allowed that takes integer index and jumps to it. And the latter map accessed with bpf_perf_event_read() that also takes index only (this helper is not available to socket filters anyway). Also bpf_tail_call() can only jump to the program of the same type. So I'm quite certain it's safe. Yes, please ask questions and try to poke holes. Now it is time. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 5, 2015 at 2:12 PM, Alexei Starovoitov <ast@plumgrid.com> wrote: > On 10/5/15 2:00 PM, Kees Cook wrote: >> >> On Mon, Oct 5, 2015 at 1:48 PM, Alexei Starovoitov<ast@plumgrid.com> >> wrote: >>> >>> >In order to let unprivileged users load and execute eBPF programs >>> >teach verifier to prevent pointer leaks. >>> >Verifier will prevent >>> >- any arithmetic on pointers >>> > (except R10+Imm which is used to compute stack addresses) >>> >- comparison of pointers >>> >- passing pointers to helper functions >>> >- indirectly passing pointers in stack to helper functions >>> >- returning pointer from bpf program >>> >- storing pointers into ctx or maps >> >> Does the arithmetic restriction include using a pointer as an index to >> a maps-based tail call? I'm still worried about pointer-based >> side-effects. > > > the array maps that hold FDs (BPF_MAP_TYPE_PROG_ARRAY and > BPF_MAP_TYPE_PERF_EVENT_ARRAY) don't have lookup/update accessors > from the program side, so programs cannot see or manipulate > those pointers. > For the former only bpf_tail_call() is allowed that takes integer > index and jumps to it. And the latter map accessed with > bpf_perf_event_read() that also takes index only (this helper > is not available to socket filters anyway). > Also bpf_tail_call() can only jump to the program of the same type. > So I'm quite certain it's safe. At some point there will be an unprivileged way to create a map, though, and we don't want to let pointers get poked into the map. Or am I misunderstanding? --Andy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/5/15 2:16 PM, Andy Lutomirski wrote: > On Mon, Oct 5, 2015 at 2:12 PM, Alexei Starovoitov <ast@plumgrid.com> wrote: >> On 10/5/15 2:00 PM, Kees Cook wrote: >>> >>> On Mon, Oct 5, 2015 at 1:48 PM, Alexei Starovoitov<ast@plumgrid.com> >>> wrote: >>>> >>>>> In order to let unprivileged users load and execute eBPF programs >>>>> teach verifier to prevent pointer leaks. >>>>> Verifier will prevent >>>>> - any arithmetic on pointers >>>>> (except R10+Imm which is used to compute stack addresses) >>>>> - comparison of pointers >>>>> - passing pointers to helper functions >>>>> - indirectly passing pointers in stack to helper functions >>>>> - returning pointer from bpf program >>>>> - storing pointers into ctx or maps >>> >>> Does the arithmetic restriction include using a pointer as an index to >>> a maps-based tail call? I'm still worried about pointer-based >>> side-effects. >> >> >> the array maps that hold FDs (BPF_MAP_TYPE_PROG_ARRAY and >> BPF_MAP_TYPE_PERF_EVENT_ARRAY) don't have lookup/update accessors >> from the program side, so programs cannot see or manipulate >> those pointers. >> For the former only bpf_tail_call() is allowed that takes integer >> index and jumps to it. And the latter map accessed with >> bpf_perf_event_read() that also takes index only (this helper >> is not available to socket filters anyway). >> Also bpf_tail_call() can only jump to the program of the same type. >> So I'm quite certain it's safe. > > At some point there will be an unprivileged way to create a map, > though, and we don't want to let pointers get poked into the map. yes. exactly. With these two patches non-root can create a map against memlock user limit and have a program store bytes into it (like data from network packet), but it cannot store pointers into it. That's covered by test "unpriv: write pointer into map elem value" I've added new tests for all cases that can 'leak pointer': unpriv: return pointer OK unpriv: add const to pointer OK unpriv: add pointer to pointer OK unpriv: neg pointer OK unpriv: cmp pointer with const OK unpriv: cmp pointer with pointer OK unpriv: pass pointer to printk OK unpriv: pass pointer to helper function OK unpriv: indirectly pass pointer on stack to helper function OK unpriv: mangle pointer on stack 1 OK unpriv: mangle pointer on stack 2 OK unpriv: read pointer from stack in small chunks OK unpriv: write pointer into ctx OK unpriv: write pointer into map elem value OK unpriv: partial copy of pointer OK the most interesting one is 'indirectly pass pointer'. It checks the case where user stores a pointer into a stack and then uses that stack region either as a key for lookup or as part of format string for printk. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 5, 2015 at 2:12 PM, Alexei Starovoitov <ast@plumgrid.com> wrote: > On 10/5/15 2:00 PM, Kees Cook wrote: >> >> On Mon, Oct 5, 2015 at 1:48 PM, Alexei Starovoitov<ast@plumgrid.com> >> wrote: >>> >>> >In order to let unprivileged users load and execute eBPF programs >>> >teach verifier to prevent pointer leaks. >>> >Verifier will prevent >>> >- any arithmetic on pointers >>> > (except R10+Imm which is used to compute stack addresses) >>> >- comparison of pointers >>> >- passing pointers to helper functions >>> >- indirectly passing pointers in stack to helper functions >>> >- returning pointer from bpf program >>> >- storing pointers into ctx or maps >> >> Does the arithmetic restriction include using a pointer as an index to >> a maps-based tail call? I'm still worried about pointer-based >> side-effects. > > > the array maps that hold FDs (BPF_MAP_TYPE_PROG_ARRAY and > BPF_MAP_TYPE_PERF_EVENT_ARRAY) don't have lookup/update accessors > from the program side, so programs cannot see or manipulate > those pointers. > For the former only bpf_tail_call() is allowed that takes integer > index and jumps to it. And the latter map accessed with Okay, so I can't take a pointer, put it on the stack, take it back any part of it as an integer and use it for a tail call? Sounds like this is shaping up nicely! Thanks for adding all these checks. -Kees > bpf_perf_event_read() that also takes index only (this helper > is not available to socket filters anyway). > Also bpf_tail_call() can only jump to the program of the same type. > So I'm quite certain it's safe. > > Yes, please ask questions and try to poke holes. Now it is time. >
On 10/05/2015 10:48 PM, Alexei Starovoitov wrote: > In order to let unprivileged users load and execute eBPF programs > teach verifier to prevent pointer leaks. > Verifier will prevent > - any arithmetic on pointers > (except R10+Imm which is used to compute stack addresses) > - comparison of pointers > - passing pointers to helper functions > - indirectly passing pointers in stack to helper functions > - returning pointer from bpf program > - storing pointers into ctx or maps > > Spill/fill of pointers into stack is allowed, but mangling > of pointers stored in the stack or reading them byte by byte is not. > > Within bpf programs the pointers do exist, since programs need to > be able to access maps, pass skb pointer to LD_ABS insns, etc > but programs cannot pass such pointer values to the outside > or obfuscate them. > > Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs, > so that socket filters (tcpdump), af_packet (quic acceleration) > and future kcm can use it. > tracing and tc cls/act program types still require root permissions, > since tracing actually needs to be able to see all kernel pointers > and tc is for root only. > > For example, the following unprivileged socket filter program is allowed: > int foo(struct __sk_buff *skb) > { > char fmt[] = "hello %d\n"; > bpf_trace_printk(fmt, sizeof(fmt), skb->len); > return 0; > } > > but the following program is not: > int foo(struct __sk_buff *skb) > { > char fmt[] = "hello %p\n"; > bpf_trace_printk(fmt, sizeof(fmt), fmt); > return 0; > } > since it would leak the kernel stack address via bpf_trace_printk(). > > Unprivileged socket filter bpf programs have access to the > following helper functions: > - map lookup/update/delete (but they cannot store kernel pointers into them) > - get_random (it's already exposed to unprivileged user space) > - get_smp_processor_id > - tail_call into another socket filter program > - ktime_get_ns > - bpf_trace_printk (for debugging) > > The feature is controlled by sysctl kernel.bpf_enable_unprivileged > which is off by default. > > New tests were added to test_verifier: > unpriv: return pointer OK > unpriv: add const to pointer OK > unpriv: add pointer to pointer OK > unpriv: neg pointer OK > unpriv: cmp pointer with const OK > unpriv: cmp pointer with pointer OK > unpriv: pass pointer to printk OK > unpriv: pass pointer to helper function OK > unpriv: indirectly pass pointer on stack to helper function OK > unpriv: mangle pointer on stack 1 OK > unpriv: mangle pointer on stack 2 OK > unpriv: read pointer from stack in small chunks OK > unpriv: write pointer into ctx OK > unpriv: write pointer into map elem value OK > unpriv: partial copy of pointer OK > > Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> One scenario that comes to mind ... what happens when there are kernel pointers stored in skb->cb[] (either from the current layer or an old one from a different layer that the skb went through previously, but which did not get overwritten)? Socket filters could read a portion of skb->cb[] also when unprived and leak that out through maps. I think the verifier doesn't catch that, right? Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/5/15 3:02 PM, Kees Cook wrote: >> the array maps that hold FDs (BPF_MAP_TYPE_PROG_ARRAY and >> >BPF_MAP_TYPE_PERF_EVENT_ARRAY) don't have lookup/update accessors >> >from the program side, so programs cannot see or manipulate >> >those pointers. >> >For the former only bpf_tail_call() is allowed that takes integer >> >index and jumps to it. And the latter map accessed with > Okay, so I can't take a pointer, put it on the stack, take it back any > part of it as an integer and use it for a tail call? not quite. you can store a pointer to stack and read it as 8 byte load back into another register, but reading <8 byte of it will be rejected. That's the test: unpriv: read pointer from stack in small chunks we obviously want to avoid hiding pointer in integers. After reading it back from stack as a pointer you cannnot use this register to pass as index into bpf_tail_call(). That's the test: unpriv: pass pointer to helper function please keep shooting everything that comes to mind. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/5/15 3:14 PM, Daniel Borkmann wrote: > One scenario that comes to mind ... what happens when there are kernel > pointers stored in skb->cb[] (either from the current layer or an old > one from a different layer that the skb went through previously, but > which did not get overwritten)? > > Socket filters could read a portion of skb->cb[] also when unprived and > leak that out through maps. I think the verifier doesn't catch that, > right? grrr. indeed. previous layer before sk_filter() can leave junk in there. Would need to disable cb[0-5] for unpriv, but that will make tail_call much harder to use, since cb[0-5] is a way to pass arguments from one prog to another and clearing them is not an option, since it's too expensive. Like samples/bpf/sockex3_kern.c usage of cb[0] won't work anymore. I guess that's the price of unpriv. Will fix this, add few tail_call specific tests and respin. Please keep poking. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Alexei Starovoitov <ast@plumgrid.com> wrote: > On 10/5/15 3:14 PM, Daniel Borkmann wrote: > >One scenario that comes to mind ... what happens when there are kernel > >pointers stored in skb->cb[] (either from the current layer or an old > >one from a different layer that the skb went through previously, but > >which did not get overwritten)? > > > >Socket filters could read a portion of skb->cb[] also when unprived and > >leak that out through maps. I think the verifier doesn't catch that, > >right? > > grrr. indeed. previous layer before sk_filter() can leave junk in there. Could this be solved by activating zeroing/sanitizing of this data if there's an active BPF function around that can access that socket? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/06/2015 09:13 AM, Ingo Molnar wrote: > > * Alexei Starovoitov <ast@plumgrid.com> wrote: > >> On 10/5/15 3:14 PM, Daniel Borkmann wrote: >>> One scenario that comes to mind ... what happens when there are kernel >>> pointers stored in skb->cb[] (either from the current layer or an old >>> one from a different layer that the skb went through previously, but >>> which did not get overwritten)? >>> >>> Socket filters could read a portion of skb->cb[] also when unprived and >>> leak that out through maps. I think the verifier doesn't catch that, >>> right? >> >> grrr. indeed. previous layer before sk_filter() can leave junk in there. > > Could this be solved by activating zeroing/sanitizing of this data if there's an > active BPF function around that can access that socket? I think this check could only be done in sk_filter() for testing these conditions (unprivileged user + access to cb area), so it would need to happen from outside a native eBPF program. :/ Also classic BPF would then need to test for it, since a socket filter doesn't really know whether native eBPF is loaded there or a classic-to-eBPF transformed one, and classic never makes use of this. Anyway, it could be done by adding a bit flag cb_access:1 to the bpf_prog, set it during eBPF verification phase, and test it inside sk_filter() if I see it correctly. The reason is that this sanitizing must only be done in the 'top-level' program that is run from sk_filter() _directly_, because a user at any time could decide to put an already loaded eBPF fd into a tail call map. And cb[] is then used to pass args/state around between two programs, thus it cannot be unconditionally cleared from within the program. The association to a socket filter (SO_ATTACH_BPF) happens at a later time after a native eBPF program has already been loaded via bpf(2). Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Daniel Borkmann <daniel@iogearbox.net> wrote: > On 10/06/2015 09:13 AM, Ingo Molnar wrote: > > > >* Alexei Starovoitov <ast@plumgrid.com> wrote: > > > >>On 10/5/15 3:14 PM, Daniel Borkmann wrote: > >>>One scenario that comes to mind ... what happens when there are kernel > >>>pointers stored in skb->cb[] (either from the current layer or an old > >>>one from a different layer that the skb went through previously, but > >>>which did not get overwritten)? > >>> > >>>Socket filters could read a portion of skb->cb[] also when unprived and > >>>leak that out through maps. I think the verifier doesn't catch that, > >>>right? > >> > >>grrr. indeed. previous layer before sk_filter() can leave junk in there. > > > >Could this be solved by activating zeroing/sanitizing of this data if there's an > >active BPF function around that can access that socket? > > I think this check could only be done in sk_filter() for testing these > conditions (unprivileged user + access to cb area), so it would need to > happen from outside a native eBPF program. :/ Yes, the kernel (with code running outside of any eBPF program) would guarantee that those data fields are zeroed/sanitized, if there's an eBPF program that is attached to that socket. > [...] Also classic BPF would then need to test for it, since a socket filter > doesn't really know whether native eBPF is loaded there or a classic-to-eBPF > transformed one, and classic never makes use of this. Anyway, it could be done > by adding a bit flag cb_access:1 to the bpf_prog, set it during eBPF > verification phase, and test it inside sk_filter() if I see it correctly. That could also be done in an unlikely() branch, to keep the cost to the non-eBPF case near zero. > The reason is that this sanitizing must only be done in the 'top-level' program > that is run from sk_filter() _directly_, because a user at any time could decide > to put an already loaded eBPF fd into a tail call map. And cb[] is then used to > pass args/state around between two programs, thus it cannot be unconditionally > cleared from within the program. The association to a socket filter > (SO_ATTACH_BPF) happens at a later time after a native eBPF program has already > been loaded via bpf(2). So zeroing tends to be very cheap and it could also be beneficial to performance in terms of bringing the cacheline into the CPU cache. But I really don't know the filter code so I'm just handwaving. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/06/2015 10:20 AM, Ingo Molnar wrote: > > * Daniel Borkmann <daniel@iogearbox.net> wrote: > >> On 10/06/2015 09:13 AM, Ingo Molnar wrote: >>> >>> * Alexei Starovoitov <ast@plumgrid.com> wrote: >>> >>>> On 10/5/15 3:14 PM, Daniel Borkmann wrote: >>>>> One scenario that comes to mind ... what happens when there are kernel >>>>> pointers stored in skb->cb[] (either from the current layer or an old >>>>> one from a different layer that the skb went through previously, but >>>>> which did not get overwritten)? >>>>> >>>>> Socket filters could read a portion of skb->cb[] also when unprived and >>>>> leak that out through maps. I think the verifier doesn't catch that, >>>>> right? >>>> >>>> grrr. indeed. previous layer before sk_filter() can leave junk in there. >>> >>> Could this be solved by activating zeroing/sanitizing of this data if there's an >>> active BPF function around that can access that socket? >> >> I think this check could only be done in sk_filter() for testing these >> conditions (unprivileged user + access to cb area), so it would need to >> happen from outside a native eBPF program. :/ > > Yes, the kernel (with code running outside of any eBPF program) would guarantee > that those data fields are zeroed/sanitized, if there's an eBPF program that is > attached to that socket. > >> [...] Also classic BPF would then need to test for it, since a socket filter >> doesn't really know whether native eBPF is loaded there or a classic-to-eBPF >> transformed one, and classic never makes use of this. Anyway, it could be done >> by adding a bit flag cb_access:1 to the bpf_prog, set it during eBPF >> verification phase, and test it inside sk_filter() if I see it correctly. > > That could also be done in an unlikely() branch, to keep the cost to the non-eBPF > case near zero. Yes, agreed. For the time being, the majority of users are coming from the classic BPF side anyway and the unlikely() could still be changed later on if it should not be the case anymore. The flag and bpf_func would share the same cacheline as well. >> The reason is that this sanitizing must only be done in the 'top-level' program >> that is run from sk_filter() _directly_, because a user at any time could decide >> to put an already loaded eBPF fd into a tail call map. And cb[] is then used to >> pass args/state around between two programs, thus it cannot be unconditionally >> cleared from within the program. The association to a socket filter >> (SO_ATTACH_BPF) happens at a later time after a native eBPF program has already >> been loaded via bpf(2). > > So zeroing tends to be very cheap and it could also be beneficial to performance > in terms of bringing the cacheline into the CPU cache. But I really don't know the > filter code so I'm just handwaving. > > Thanks, > > Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/06/2015 02:51 AM, Alexei Starovoitov wrote: > On 10/5/15 3:14 PM, Daniel Borkmann wrote: >> One scenario that comes to mind ... what happens when there are kernel >> pointers stored in skb->cb[] (either from the current layer or an old >> one from a different layer that the skb went through previously, but >> which did not get overwritten)? >> >> Socket filters could read a portion of skb->cb[] also when unprived and >> leak that out through maps. I think the verifier doesn't catch that, >> right? ... > Please keep poking. ;) I'm still wondering whether sysctl_bpf_enable_unprivileged is a good way to go with regards to controlling capabilties of bpf(2), hmm, but don't really have a good idea at the moment. So, the rationale of this is to give it some soaking time before flipping the switch that then defaults to on, and later on to still have a possibility for an admin to turn it off (if not silently overwritten by some system application later on ;)). I think only having a Kconfig doesn't really make sense as distros will blindly turn lots of stuff on anyway. A hidden Kconfig entry that is not exposed into menuconfig might allow for sorting everything out first, but with the issue of getting only little testing exposure. If I see this correctly, perf_event_open(2) has a number of paranoia levels with some helpers wrapped around it, f.e.: /* * perf event paranoia level: * -1 - not paranoid at all * 0 - disallow raw tracepoint access for unpriv * 1 - disallow cpu events for unpriv * 2 - disallow kernel profiling for unpriv */ int sysctl_perf_event_paranoid __read_mostly = 1; Should instead something similar be adapted on bpf(2) as well? Or, would that even be more painful for application developers shipping their stuff through distros in the end (where they might then decide to just setup everything BPF-related and then drop privs)? I'm also wondering with regards to seccomp, which could adapt to eBPF at some point and be used by unprivileged programs. Perhaps then, a single paranoia alike setting might not suit to all eBPF subsystem users. Any ideas? Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/6/15 1:39 AM, Daniel Borkmann wrote: >>> [...] Also classic BPF would then need to test for it, since a socket >>> filter >>> doesn't really know whether native eBPF is loaded there or a >>> classic-to-eBPF >>> transformed one, and classic never makes use of this. Anyway, it >>> could be done >>> by adding a bit flag cb_access:1 to the bpf_prog, set it during eBPF >>> verification phase, and test it inside sk_filter() if I see it >>> correctly. >> >> That could also be done in an unlikely() branch, to keep the cost to >> the non-eBPF >> case near zero. > > Yes, agreed. For the time being, the majority of users are coming from the > classic BPF side anyway and the unlikely() could still be changed later on > if it should not be the case anymore. The flag and bpf_func would share the > same cacheline as well. was also thinking that we can do it only in paths that actually have multiple protocol layers, since today bpf is mainly used with tcpdump(raw_socket) and new af_packet fanout both have cb cleared on RX, because it just came out of alloc_skb and no layers were called, and on TX we can clear 20 bytes in dev_queue_xmit_nit(). af_unix/netlink also have clean skb. Need to analyze tun and sctp... but it feels overly fragile to save a branch in sk_filter, so planning to go with if(unlikely(prog->cb_access)) memset in sk_filter(). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2015-10-06 at 10:50 -0700, Alexei Starovoitov wrote: > was also thinking that we can do it only in paths that actually > have multiple protocol layers, since today bpf is mainly used with > tcpdump(raw_socket) and new af_packet fanout both have cb cleared > on RX, because it just came out of alloc_skb and no layers were called, > and on TX we can clear 20 bytes in dev_queue_xmit_nit(). > af_unix/netlink also have clean skb. Need to analyze tun and sctp... > but it feels overly fragile to save a branch in sk_filter, > so planning to go with > if(unlikely(prog->cb_access)) memset in sk_filter(). > This will break TCP use of sk_filter(). skb->cb[] contains useful data in TCP layer. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/06/2015 07:50 PM, Alexei Starovoitov wrote: > On 10/6/15 1:39 AM, Daniel Borkmann wrote: >>>> [...] Also classic BPF would then need to test for it, since a socket >>>> filter >>>> doesn't really know whether native eBPF is loaded there or a >>>> classic-to-eBPF >>>> transformed one, and classic never makes use of this. Anyway, it >>>> could be done >>>> by adding a bit flag cb_access:1 to the bpf_prog, set it during eBPF >>>> verification phase, and test it inside sk_filter() if I see it >>>> correctly. >>> >>> That could also be done in an unlikely() branch, to keep the cost to >>> the non-eBPF >>> case near zero. >> >> Yes, agreed. For the time being, the majority of users are coming from the >> classic BPF side anyway and the unlikely() could still be changed later on >> if it should not be the case anymore. The flag and bpf_func would share the >> same cacheline as well. > > was also thinking that we can do it only in paths that actually > have multiple protocol layers, since today bpf is mainly used with > tcpdump(raw_socket) and new af_packet fanout both have cb cleared > on RX, because it just came out of alloc_skb and no layers were called, > and on TX we can clear 20 bytes in dev_queue_xmit_nit(). > af_unix/netlink also have clean skb. Need to analyze tun and sctp... > but it feels overly fragile to save a branch in sk_filter, > so planning to go with > if(unlikely(prog->cb_access)) memset in sk_filter(). I was also thinking that for dev_queue_xmit_nit(), since we do the skb_clone() there, to have a clone version (w/o affecting performance of the current one) that instead of copying cb[] over, it would just do a memset(). But that would just be limited to AF_PACKET, and doesn't catch all sk_filter() users. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Oct 6, 2015 at 10:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2015-10-06 at 10:50 -0700, Alexei Starovoitov wrote: > >> was also thinking that we can do it only in paths that actually >> have multiple protocol layers, since today bpf is mainly used with >> tcpdump(raw_socket) and new af_packet fanout both have cb cleared >> on RX, because it just came out of alloc_skb and no layers were called, >> and on TX we can clear 20 bytes in dev_queue_xmit_nit(). >> af_unix/netlink also have clean skb. Need to analyze tun and sctp... >> but it feels overly fragile to save a branch in sk_filter, >> so planning to go with >> if(unlikely(prog->cb_access)) memset in sk_filter(). >> > > This will break TCP use of sk_filter(). > skb->cb[] contains useful data in TCP layer. > > Since I don't know too much about the networking details: 1. Does "skb->cb" *ever* contain anything useful for an unprivileged user? 2. Does sbk->cb form a stable ABI? Unless both answers are solid yesses, then maybe the right solution is to just deny access entirely to unprivileged users. --Andy
On 10/6/15 10:56 AM, Eric Dumazet wrote: > On Tue, 2015-10-06 at 10:50 -0700, Alexei Starovoitov wrote: > >> was also thinking that we can do it only in paths that actually >> have multiple protocol layers, since today bpf is mainly used with >> tcpdump(raw_socket) and new af_packet fanout both have cb cleared >> on RX, because it just came out of alloc_skb and no layers were called, >> and on TX we can clear 20 bytes in dev_queue_xmit_nit(). >> af_unix/netlink also have clean skb. Need to analyze tun and sctp... >> but it feels overly fragile to save a branch in sk_filter, >> so planning to go with >> if(unlikely(prog->cb_access)) memset in sk_filter(). >> > > This will break TCP use of sk_filter(). > skb->cb[] contains useful data in TCP layer. oops. thanks for catching. In case of sk_filter on top of tcp sock, it shouldn't be looking at cb at all. I'm thinking to send a patch to get rid of cb access for socket filters all together until better solution found. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Andy Lutomirski <luto@amacapital.net> wrote: > On Tue, Oct 6, 2015 at 10:56 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > On Tue, 2015-10-06 at 10:50 -0700, Alexei Starovoitov wrote: > > > >> was also thinking that we can do it only in paths that actually > >> have multiple protocol layers, since today bpf is mainly used with > >> tcpdump(raw_socket) and new af_packet fanout both have cb cleared > >> on RX, because it just came out of alloc_skb and no layers were called, > >> and on TX we can clear 20 bytes in dev_queue_xmit_nit(). > >> af_unix/netlink also have clean skb. Need to analyze tun and sctp... > >> but it feels overly fragile to save a branch in sk_filter, > >> so planning to go with > >> if(unlikely(prog->cb_access)) memset in sk_filter(). > >> > > > > This will break TCP use of sk_filter(). > > skb->cb[] contains useful data in TCP layer. > > > > > > Since I don't know too much about the networking details: > > 1. Does "skb->cb" *ever* contain anything useful for an unprivileged user? > > 2. Does sbk->cb form a stable ABI? > > Unless both answers are solid yesses, then maybe the right solution is > to just deny access entirely to unprivileged users. So this kind of instrumentation data is not an ABI in a similar fashion as tracing information is not an ABI either. I.e. tracepoints can (and sometimes do) change 'semantics' - in that the interpretation of the implementational details behind that data changes as the implementation changes. That's not something that can ever be an ABI, just like the contents of /proc/kcore or /proc/slabinfo can not be an ABI. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/6/15 5:45 AM, Daniel Borkmann wrote: > Should instead something similar be adapted on bpf(2) as well? Or, would > that even be more painful for application developers shipping their stuff > through distros in the end (where they might then decide to just setup > everything BPF-related and then drop privs)? I think loading as root and then dropping privs won't work in many cases, since apps still need to access maps even after dropping privs and today it's not possible, since cap_sys_admin is tested for every bpf syscall. > I'm also wondering with regards to seccomp, which could adapt to eBPF at > some point and be used by unprivileged programs. Perhaps then, a single > paranoia alike setting might not suit to all eBPF subsystem users. Any > ideas? There is no such paranoid sysctl for cBPF, so there is no reason to add one for eBPF other than fear. Adding multiple sysctl knobs for seccomp, socket, tracing is only reflection of even higher fear. What sysadmins suppose to do with such sysctl when kernel is kinda saying 'may be something unsafe here you're on your own' ? Also the presence of this sysctl_bpf_enable_unprivileged or any other one doesn't help with CVEs. Any bug with security implications will be a CVE regardless, so I think the better course of action is to avoid introducing this sysctl. We've discussed adding something like CAP_BPF to control it, but then again, do we want this because of fear of bugs or because it's actually needed. I think the design of all CAP_* is to give unprivileged users permissions to do something beyond normal that can potentially be harmful for other users or the whole system. In this case it's not the case. One user can load eBPF programs and maps up to its MEMLOCK limit and they cannot interfere with other users or affect the host, so CAP_BPF is not necessary either. Thoughts? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/07/2015 11:20 PM, Alexei Starovoitov wrote: > On 10/6/15 5:45 AM, Daniel Borkmann wrote: >> Should instead something similar be adapted on bpf(2) as well? Or, would >> that even be more painful for application developers shipping their stuff >> through distros in the end (where they might then decide to just setup >> everything BPF-related and then drop privs)? > > I think loading as root and then dropping privs won't work in many > cases, since apps still need to access maps even after dropping privs > and today it's not possible, since cap_sys_admin is tested for every > bpf syscall. Yep, maps-only would then need to be made accessible in some way. >> I'm also wondering with regards to seccomp, which could adapt to eBPF at >> some point and be used by unprivileged programs. Perhaps then, a single >> paranoia alike setting might not suit to all eBPF subsystem users. Any >> ideas? > > There is no such paranoid sysctl for cBPF, so there is no reason to > add one for eBPF other than fear. > Adding multiple sysctl knobs for seccomp, socket, tracing is only > reflection of even higher fear. > What sysadmins suppose to do with such sysctl when kernel is kinda > saying 'may be something unsafe here you're on your own' ? > Also the presence of this sysctl_bpf_enable_unprivileged or any other > one doesn't help with CVEs. Any bug with security implications will > be a CVE regardless, so I think the better course of action is to > avoid introducing this sysctl. Yes, I agree with you that there would be a CVE regardless. I still like the option of configurable access, not a big fan of the sysctl either. Thinking out loudly, what about a Kconfig option? We started out like this on bpf(2) itself (initially under expert settings, now afaik not anymore), and depending on usage scenarios, a requirement could be to have immutable cap_sys_admin-only, for other use-cases a requirement on the kernel might instead be to have unprivileged users as well. > We've discussed adding something like CAP_BPF to control it, > but then again, do we want this because of fear of bugs or because > it's actually needed. I think the design of all CAP_* is to give > unprivileged users permissions to do something beyond normal that > can potentially be harmful for other users or the whole system. > In this case it's not the case. One user can load eBPF programs > and maps up to its MEMLOCK limit and they cannot interfere with > other users or affect the host, so CAP_BPF is not necessary either. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 7, 2015 at 3:07 PM, Daniel Borkmann <daniel@iogearbox.net> wrote: > On 10/07/2015 11:20 PM, Alexei Starovoitov wrote: >> >> On 10/6/15 5:45 AM, Daniel Borkmann wrote: >>> >>> Should instead something similar be adapted on bpf(2) as well? Or, would >>> that even be more painful for application developers shipping their stuff >>> through distros in the end (where they might then decide to just setup >>> everything BPF-related and then drop privs)? >> >> >> I think loading as root and then dropping privs won't work in many >> cases, since apps still need to access maps even after dropping privs >> and today it's not possible, since cap_sys_admin is tested for every >> bpf syscall. > > > Yep, maps-only would then need to be made accessible in some way. > >>> I'm also wondering with regards to seccomp, which could adapt to eBPF at >>> some point and be used by unprivileged programs. Perhaps then, a single >>> paranoia alike setting might not suit to all eBPF subsystem users. Any >>> ideas? >> >> >> There is no such paranoid sysctl for cBPF, so there is no reason to >> add one for eBPF other than fear. >> Adding multiple sysctl knobs for seccomp, socket, tracing is only >> reflection of even higher fear. >> What sysadmins suppose to do with such sysctl when kernel is kinda >> saying 'may be something unsafe here you're on your own' ? >> Also the presence of this sysctl_bpf_enable_unprivileged or any other >> one doesn't help with CVEs. Any bug with security implications will >> be a CVE regardless, so I think the better course of action is to >> avoid introducing this sysctl. > > > Yes, I agree with you that there would be a CVE regardless. I still > like the option of configurable access, not a big fan of the sysctl > either. Thinking out loudly, what about a Kconfig option? We started > out like this on bpf(2) itself (initially under expert settings, now > afaik not anymore), and depending on usage scenarios, a requirement > could be to have immutable cap_sys_admin-only, for other use-cases a > requirement on the kernel might instead be to have unprivileged users > as well. It'd be nice to have it just be a Kconfig, but this shoots distro-users in the foot if a distro decides to include unpriv bpf and the user doesn't want it. I think it's probably a good idea to keep the sysctl. -Kees > >> We've discussed adding something like CAP_BPF to control it, >> but then again, do we want this because of fear of bugs or because >> it's actually needed. I think the design of all CAP_* is to give >> unprivileged users permissions to do something beyond normal that >> can potentially be harmful for other users or the whole system. >> In this case it's not the case. One user can load eBPF programs >> and maps up to its MEMLOCK limit and they cannot interfere with >> other users or affect the host, so CAP_BPF is not necessary either. > > > Thanks, > Daniel
On 10/7/15 3:22 PM, Kees Cook wrote: >> Yes, I agree with you that there would be a CVE regardless. I still >> >like the option of configurable access, not a big fan of the sysctl >> >either. Thinking out loudly, what about a Kconfig option? We started >> >out like this on bpf(2) itself (initially under expert settings, now >> >afaik not anymore), and depending on usage scenarios, a requirement >> >could be to have immutable cap_sys_admin-only, for other use-cases a >> >requirement on the kernel might instead be to have unprivileged users >> >as well. > It'd be nice to have it just be a Kconfig, but this shoots > distro-users in the foot if a distro decides to include unpriv bpf and > the user doesn't want it. I think it's probably a good idea to keep > the sysctl. I don't like introducing Kconfig for no clear reason. It only adds to the testing matrix and makes it harder to hack around. Paranoid distros can disable bpf via single config already, there is no reason to go more fine grained here. Unpriv checks add minimal amount of code, so even for tinification purpose there is no need to chop of few bytes. tiny kernels would disable bpf all together. As far as sysctl we can look at two with similar purpose: sysctl_perf_event_paranoid and modules_disabled. First one is indeed multi level, but not because of the fear of bugs, but because of real security implications. Like raw events on hyperthreaded cpu or uncore events can extract data from other user processes. So it controls these extra privileges. For bpf there are no hw implications to deal with. If we make seccomp+bpf in the future it shouldn't need another knob or extra bit. There are no extra privileges to grant, so not needed. modules_disabled is off by default and can be toggled on once. I think for paranoid distro users that "don't want bpf" that is the better model. So I'm thinking to do sysctl_unprivileged_bpf_disabled that will be 0=off by default (meaning that users can load unpriv socket filter programs and seccomp in the future) and that can be switched to 1=on once and stay that way until reboot. I think that's the best balance that avoids adding checks to all apps that want to use bpf and admins can still act on it. From app point of view it's no different than bpf syscall was not compiled in. So single feature test for bpf syscall will be enough. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/5/15 1:48 PM, Alexei Starovoitov wrote: > Unprivileged socket filter bpf programs have access to the > following helper functions: > - map lookup/update/delete (but they cannot store kernel pointers into them) > - get_random (it's already exposed to unprivileged user space) > - get_smp_processor_id > - tail_call into another socket filter program > - ktime_get_ns > - bpf_trace_printk (for debugging) while reviewing everything for Nth time realized that bpf_trace_printk is useless for unprivileged users, since trace_pipe is root only. So going to drop it in V2. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Alexei Starovoitov <ast@plumgrid.com> wrote: > As far as sysctl we can look at two with similar purpose: > sysctl_perf_event_paranoid and modules_disabled. > First one is indeed multi level, but not because of the fear of bugs, > but because of real security implications. It serves both purposes flexibly, and note that most people and distros will use the default value. > [...] Like raw events on hyperthreaded cpu or uncore events can extract data > from other user processes. So it controls these extra privileges. It also controls the generally increased risk caused by a larger attack surface, which some users may not want to carry and which they can thus shrink. With a static keys approach there would be no runtime overhead worth speaking of, so I see no reason why unprivileged eBPF couldn't have a sysctl too - with the default value set to permissive. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/7/15 11:21 PM, Ingo Molnar wrote: > so I see no reason why unprivileged eBPF couldn't have a sysctl too - with the > default value set to permissive. agreed. sent out v2 follows 'modules_disabled' style. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 7, 2015 at 4:49 PM, Alexei Starovoitov <ast@plumgrid.com> wrote: > On 10/7/15 3:22 PM, Kees Cook wrote: >>> >>> Yes, I agree with you that there would be a CVE regardless. I still >>> >like the option of configurable access, not a big fan of the sysctl >>> >either. Thinking out loudly, what about a Kconfig option? We started >>> >out like this on bpf(2) itself (initially under expert settings, now >>> >afaik not anymore), and depending on usage scenarios, a requirement >>> >could be to have immutable cap_sys_admin-only, for other use-cases a >>> >requirement on the kernel might instead be to have unprivileged users >>> >as well. >> >> It'd be nice to have it just be a Kconfig, but this shoots >> distro-users in the foot if a distro decides to include unpriv bpf and >> the user doesn't want it. I think it's probably a good idea to keep >> the sysctl. > > > I don't like introducing Kconfig for no clear reason. It only adds > to the testing matrix and makes it harder to hack around. > Paranoid distros can disable bpf via single config already, > there is no reason to go more fine grained here. > Unpriv checks add minimal amount of code, so even for tinification > purpose there is no need to chop of few bytes. tiny kernels would > disable bpf all together. > > As far as sysctl we can look at two with similar purpose: > sysctl_perf_event_paranoid and modules_disabled. > First one is indeed multi level, but not because of the fear of bugs, > but because of real security implications. Like raw events on > hyperthreaded cpu or uncore events can extract data from other > user processes. So it controls these extra privileges. > For bpf there are no hw implications to deal with. > If we make seccomp+bpf in the future it shouldn't need another knob > or extra bit. There are no extra privileges to grant, so not needed. > > modules_disabled is off by default and can be toggled on once. > I think for paranoid distro users that "don't want bpf" that is > the better model. > So I'm thinking to do sysctl_unprivileged_bpf_disabled that will be > 0=off by default (meaning that users can load unpriv socket filter > programs and seccomp in the future) and that can be switched > to 1=on once and stay that way until reboot. > I think that's the best balance that avoids adding checks to all > apps that want to use bpf and admins can still act on it. > From app point of view it's no different than bpf syscall > was not compiled in. So single feature test for bpf syscall will > be enough. I think this would be great. :) -Kees
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index f57d7fed9ec3..acf97d66b681 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -66,6 +66,7 @@ enum bpf_arg_type { ARG_PTR_TO_CTX, /* pointer to context */ ARG_ANYTHING, /* any (initialized) argument is ok */ + ARG_VARARG, /* optional argument */ }; /* type of values returned from helper functions */ @@ -168,6 +169,8 @@ void bpf_prog_put_rcu(struct bpf_prog *prog); struct bpf_map *bpf_map_get(struct fd f); void bpf_map_put(struct bpf_map *map); +extern int sysctl_bpf_enable_unprivileged; + /* verify correctness of eBPF program */ int bpf_check(struct bpf_prog **fp, union bpf_attr *attr); #else diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 5f35f420c12f..9a2098da2da9 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -18,6 +18,8 @@ #include <linux/filter.h> #include <linux/version.h> +int sysctl_bpf_enable_unprivileged __read_mostly; + static LIST_HEAD(bpf_map_types); static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) @@ -542,6 +544,9 @@ static int bpf_prog_load(union bpf_attr *attr) attr->kern_version != LINUX_VERSION_CODE) return -EINVAL; + if (type != BPF_PROG_TYPE_SOCKET_FILTER && !capable(CAP_SYS_ADMIN)) + return -EPERM; + /* plain bpf_prog allocation */ prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER); if (!prog) @@ -597,11 +602,7 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz union bpf_attr attr = {}; int err; - /* the syscall is limited to root temporarily. This restriction will be - * lifted when security audit is clean. Note that eBPF+tracing must have - * this restriction, since it may pass kernel data to user space - */ - if (!capable(CAP_SYS_ADMIN)) + if (!capable(CAP_SYS_ADMIN) && !sysctl_bpf_enable_unprivileged) return -EPERM; if (!access_ok(VERIFY_READ, uattr, 1)) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index b074b23000d6..dccaeeeb1128 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -199,6 +199,7 @@ struct verifier_env { struct verifier_state_list **explored_states; /* search pruning optimization */ struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by eBPF program */ u32 used_map_cnt; /* number of used maps */ + bool allow_ptr_leaks; }; /* verbose verifier prints what it's seeing @@ -538,6 +539,21 @@ static int bpf_size_to_bytes(int bpf_size) return -EINVAL; } +static bool is_spillable_regtype(enum bpf_reg_type type) +{ + switch (type) { + case PTR_TO_MAP_VALUE: + case PTR_TO_MAP_VALUE_OR_NULL: + case PTR_TO_STACK: + case PTR_TO_CTX: + case FRAME_PTR: + case CONST_PTR_TO_MAP: + return true; + default: + return false; + } +} + /* check_stack_read/write functions track spill/fill of registers, * stack boundary and alignment are checked in check_mem_access() */ @@ -550,9 +566,7 @@ static int check_stack_write(struct verifier_state *state, int off, int size, */ if (value_regno >= 0 && - (state->regs[value_regno].type == PTR_TO_MAP_VALUE || - state->regs[value_regno].type == PTR_TO_STACK || - state->regs[value_regno].type == PTR_TO_CTX)) { + is_spillable_regtype(state->regs[value_regno].type)) { /* register containing pointer is being spilled into stack */ if (size != BPF_REG_SIZE) { @@ -643,6 +657,20 @@ static int check_ctx_access(struct verifier_env *env, int off, int size, return -EACCES; } +static bool is_pointer_value(struct verifier_env *env, int regno) +{ + if (env->allow_ptr_leaks) + return false; + + switch (env->cur_state.regs[regno].type) { + case UNKNOWN_VALUE: + case CONST_IMM: + return false; + default: + return true; + } +} + /* check whether memory at (regno + off) is accessible for t = (read | write) * if t==write, value_regno is a register which value is stored into memory * if t==read, value_regno is a register which will receive the value from memory @@ -669,11 +697,21 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off, } if (state->regs[regno].type == PTR_TO_MAP_VALUE) { + if (t == BPF_WRITE && value_regno >= 0 && + is_pointer_value(env, value_regno)) { + verbose("R%d leaks addr into map\n", value_regno); + return -EACCES; + } err = check_map_access(env, regno, off, size); if (!err && t == BPF_READ && value_regno >= 0) mark_reg_unknown_value(state->regs, value_regno); } else if (state->regs[regno].type == PTR_TO_CTX) { + if (t == BPF_WRITE && value_regno >= 0 && + is_pointer_value(env, value_regno)) { + verbose("R%d leaks addr into ctx\n", value_regno); + return -EACCES; + } err = check_ctx_access(env, off, size, t); if (!err && t == BPF_READ && value_regno >= 0) mark_reg_unknown_value(state->regs, value_regno); @@ -684,10 +722,17 @@ static int check_mem_access(struct verifier_env *env, u32 regno, int off, verbose("invalid stack off=%d size=%d\n", off, size); return -EACCES; } - if (t == BPF_WRITE) + if (t == BPF_WRITE) { + if (!env->allow_ptr_leaks && + state->stack_slot_type[MAX_BPF_STACK + off] == STACK_SPILL && + size != BPF_REG_SIZE) { + verbose("attempt to corrupt spilled pointer on stack\n"); + return -EACCES; + } err = check_stack_write(state, off, size, value_regno); - else + } else { err = check_stack_read(state, off, size, value_regno); + } } else { verbose("R%d invalid mem access '%s'\n", regno, reg_type_str[state->regs[regno].type]); @@ -770,13 +815,26 @@ static int check_func_arg(struct verifier_env *env, u32 regno, if (arg_type == ARG_DONTCARE) return 0; + if (arg_type == ARG_VARARG) { + if (is_pointer_value(env, regno)) { + verbose("R%d leaks addr into printk\n", regno); + return -EACCES; + } + return 0; + } + if (reg->type == NOT_INIT) { verbose("R%d !read_ok\n", regno); return -EACCES; } - if (arg_type == ARG_ANYTHING) + if (arg_type == ARG_ANYTHING) { + if (is_pointer_value(env, regno)) { + verbose("R%d leaks addr into helper function\n", regno); + return -EACCES; + } return 0; + } if (arg_type == ARG_PTR_TO_STACK || arg_type == ARG_PTR_TO_MAP_KEY || arg_type == ARG_PTR_TO_MAP_VALUE) { @@ -950,8 +1008,9 @@ static int check_call(struct verifier_env *env, int func_id) } /* check validity of 32-bit and 64-bit arithmetic operations */ -static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn) +static int check_alu_op(struct verifier_env *env, struct bpf_insn *insn) { + struct reg_state *regs = env->cur_state.regs; u8 opcode = BPF_OP(insn->code); int err; @@ -976,6 +1035,12 @@ static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn) if (err) return err; + if (is_pointer_value(env, insn->dst_reg)) { + verbose("R%d pointer arithmetic prohibited\n", + insn->dst_reg); + return -EACCES; + } + /* check dest operand */ err = check_reg_arg(regs, insn->dst_reg, DST_OP); if (err) @@ -1012,6 +1077,11 @@ static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn) */ regs[insn->dst_reg] = regs[insn->src_reg]; } else { + if (is_pointer_value(env, insn->src_reg)) { + verbose("R%d partial copy of pointer\n", + insn->src_reg); + return -EACCES; + } regs[insn->dst_reg].type = UNKNOWN_VALUE; regs[insn->dst_reg].map_ptr = NULL; } @@ -1061,8 +1131,18 @@ static int check_alu_op(struct reg_state *regs, struct bpf_insn *insn) /* pattern match 'bpf_add Rx, imm' instruction */ if (opcode == BPF_ADD && BPF_CLASS(insn->code) == BPF_ALU64 && regs[insn->dst_reg].type == FRAME_PTR && - BPF_SRC(insn->code) == BPF_K) + BPF_SRC(insn->code) == BPF_K) { stack_relative = true; + } else if (is_pointer_value(env, insn->dst_reg)) { + verbose("R%d pointer arithmetic prohibited\n", + insn->dst_reg); + return -EACCES; + } else if (BPF_SRC(insn->code) == BPF_X && + is_pointer_value(env, insn->src_reg)) { + verbose("R%d pointer arithmetic prohibited\n", + insn->src_reg); + return -EACCES; + } /* check dest operand */ err = check_reg_arg(regs, insn->dst_reg, DST_OP); @@ -1101,6 +1181,12 @@ static int check_cond_jmp_op(struct verifier_env *env, err = check_reg_arg(regs, insn->src_reg, SRC_OP); if (err) return err; + + if (is_pointer_value(env, insn->src_reg)) { + verbose("R%d pointer comparison prohibited\n", + insn->src_reg); + return -EACCES; + } } else { if (insn->src_reg != BPF_REG_0) { verbose("BPF_JMP uses reserved fields\n"); @@ -1155,6 +1241,9 @@ static int check_cond_jmp_op(struct verifier_env *env, regs[insn->dst_reg].type = CONST_IMM; regs[insn->dst_reg].imm = 0; } + } else if (is_pointer_value(env, insn->dst_reg)) { + verbose("R%d pointer comparison prohibited\n", insn->dst_reg); + return -EACCES; } else if (BPF_SRC(insn->code) == BPF_K && (opcode == BPF_JEQ || opcode == BPF_JNE)) { @@ -1658,7 +1747,7 @@ static int do_check(struct verifier_env *env) } if (class == BPF_ALU || class == BPF_ALU64) { - err = check_alu_op(regs, insn); + err = check_alu_op(env, insn); if (err) return err; @@ -1816,6 +1905,11 @@ static int do_check(struct verifier_env *env) if (err) return err; + if (is_pointer_value(env, BPF_REG_0)) { + verbose("R0 leaks addr as return value\n"); + return -EACCES; + } + process_bpf_exit: insn_idx = pop_stack(env, &prev_insn_idx); if (insn_idx < 0) { @@ -2144,6 +2238,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr) if (ret < 0) goto skip_full_check; + env->allow_ptr_leaks = capable(CAP_SYS_ADMIN); + ret = do_check(env); skip_full_check: diff --git a/kernel/sysctl.c b/kernel/sysctl.c index e69201d8094e..a281849278d9 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -64,6 +64,7 @@ #include <linux/binfmts.h> #include <linux/sched/sysctl.h> #include <linux/kexec.h> +#include <linux/bpf.h> #include <asm/uaccess.h> #include <asm/processor.h> @@ -1139,6 +1140,15 @@ static struct ctl_table kern_table[] = { .proc_handler = timer_migration_handler, }, #endif +#ifdef CONFIG_BPF_SYSCALL + { + .procname = "bpf_enable_unprivileged", + .data = &sysctl_bpf_enable_unprivileged, + .maxlen = sizeof(sysctl_bpf_enable_unprivileged), + .mode = 0644, + .proc_handler = proc_dointvec, + }, +#endif { } }; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 0fe96c7c8803..46bbed24d1f5 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -173,6 +173,9 @@ static const struct bpf_func_proto bpf_trace_printk_proto = { .ret_type = RET_INTEGER, .arg1_type = ARG_PTR_TO_STACK, .arg2_type = ARG_CONST_STACK_SIZE, + .arg3_type = ARG_VARARG, + .arg4_type = ARG_VARARG, + .arg5_type = ARG_VARARG, }; const struct bpf_func_proto *bpf_get_trace_printk_proto(void) diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h index 7235e292a03b..b7f63c70b4a2 100644 --- a/samples/bpf/libbpf.h +++ b/samples/bpf/libbpf.h @@ -64,6 +64,14 @@ extern char bpf_log_buf[LOG_BUF_SIZE]; .off = 0, \ .imm = 0 }) +#define BPF_MOV32_REG(DST, SRC) \ + ((struct bpf_insn) { \ + .code = BPF_ALU | BPF_MOV | BPF_X, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = 0 }) + /* Short form of mov, dst_reg = imm32 */ #define BPF_MOV64_IMM(DST, IMM) \ diff --git a/samples/bpf/test_verifier.c b/samples/bpf/test_verifier.c index ee0f110c9c54..266d014a885e 100644 --- a/samples/bpf/test_verifier.c +++ b/samples/bpf/test_verifier.c @@ -15,6 +15,7 @@ #include <string.h> #include <linux/filter.h> #include <stddef.h> +#include <stdbool.h> #include "libbpf.h" #define MAX_INSNS 512 @@ -25,10 +26,12 @@ struct bpf_test { struct bpf_insn insns[MAX_INSNS]; int fixup[32]; const char *errstr; + const char *errstr_unpriv; enum { + UNDEF, ACCEPT, REJECT - } result; + } result, result_unpriv; enum bpf_prog_type prog_type; }; @@ -96,6 +99,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .errstr = "invalid BPF_LD_IMM insn", + .errstr_unpriv = "R1 pointer comparison", .result = REJECT, }, { @@ -109,6 +113,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .errstr = "invalid BPF_LD_IMM insn", + .errstr_unpriv = "R1 pointer comparison", .result = REJECT, }, { @@ -219,6 +224,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .errstr = "R0 !read_ok", + .errstr_unpriv = "R1 pointer comparison", .result = REJECT, }, { @@ -294,7 +300,9 @@ static struct bpf_test tests[] = { BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), BPF_EXIT_INSN(), }, + .errstr_unpriv = "R0 leaks addr", .result = ACCEPT, + .result_unpriv = REJECT, }, { "check corrupted spill/fill", @@ -310,6 +318,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, + .errstr_unpriv = "attempt to corrupt spilled", .errstr = "corrupted spill", .result = REJECT, }, @@ -473,6 +482,7 @@ static struct bpf_test tests[] = { }, .fixup = {3}, .errstr = "R0 invalid mem access", + .errstr_unpriv = "R0 leaks addr", .result = REJECT, }, { @@ -495,6 +505,8 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, + .errstr_unpriv = "R1 pointer comparison", + .result_unpriv = REJECT, .result = ACCEPT, }, { @@ -521,6 +533,8 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, + .errstr_unpriv = "R1 pointer comparison", + .result_unpriv = REJECT, .result = ACCEPT, }, { @@ -555,6 +569,8 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .fixup = {24}, + .errstr_unpriv = "R1 pointer comparison", + .result_unpriv = REJECT, .result = ACCEPT, }, { @@ -603,6 +619,8 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, + .errstr_unpriv = "R1 pointer comparison", + .result_unpriv = REJECT, .result = ACCEPT, }, { @@ -642,6 +660,8 @@ static struct bpf_test tests[] = { BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }, + .errstr_unpriv = "R1 pointer comparison", + .result_unpriv = REJECT, .result = ACCEPT, }, { @@ -699,6 +719,7 @@ static struct bpf_test tests[] = { }, .fixup = {4}, .errstr = "different pointers", + .errstr_unpriv = "R1 pointer comparison", .result = REJECT, }, { @@ -720,6 +741,7 @@ static struct bpf_test tests[] = { }, .fixup = {6}, .errstr = "different pointers", + .errstr_unpriv = "R1 pointer comparison", .result = REJECT, }, { @@ -742,6 +764,7 @@ static struct bpf_test tests[] = { }, .fixup = {7}, .errstr = "different pointers", + .errstr_unpriv = "R1 pointer comparison", .result = REJECT, }, { @@ -752,6 +775,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .errstr = "invalid bpf_context access", + .errstr_unpriv = "R1 leaks addr", .result = REJECT, }, { @@ -762,6 +786,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .errstr = "invalid bpf_context access", + .errstr_unpriv = "R1 leaks addr", .result = REJECT, }, { @@ -772,6 +797,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .errstr = "invalid bpf_context access", + .errstr_unpriv = "R1 leaks addr", .result = REJECT, }, { @@ -782,6 +808,7 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .errstr = "invalid bpf_context access", + .errstr_unpriv = "", .result = REJECT, .prog_type = BPF_PROG_TYPE_SCHED_ACT, }, @@ -803,6 +830,8 @@ static struct bpf_test tests[] = { BPF_EXIT_INSN(), }, .result = ACCEPT, + .errstr_unpriv = "R1 leaks addr", + .result_unpriv = REJECT, }, { "write skb fields from tc_cls_act prog", @@ -819,6 +848,8 @@ static struct bpf_test tests[] = { offsetof(struct __sk_buff, cb[3])), BPF_EXIT_INSN(), }, + .errstr_unpriv = "", + .result_unpriv = REJECT, .result = ACCEPT, .prog_type = BPF_PROG_TYPE_SCHED_CLS, }, @@ -881,6 +912,195 @@ static struct bpf_test tests[] = { .result = REJECT, .errstr = "invalid stack off=0 size=8", }, + { + "unpriv: return pointer", + .insns = { + BPF_MOV64_REG(BPF_REG_0, BPF_REG_10), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .result_unpriv = REJECT, + .errstr_unpriv = "R0 leaks addr", + }, + { + "unpriv: add const to pointer", + .insns = { + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .result_unpriv = REJECT, + .errstr_unpriv = "R1 pointer arithmetic", + }, + { + "unpriv: add pointer to pointer", + .insns = { + BPF_ALU64_REG(BPF_ADD, BPF_REG_1, BPF_REG_10), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .result_unpriv = REJECT, + .errstr_unpriv = "R1 pointer arithmetic", + }, + { + "unpriv: neg pointer", + .insns = { + BPF_ALU64_IMM(BPF_NEG, BPF_REG_1, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .result_unpriv = REJECT, + .errstr_unpriv = "R1 pointer arithmetic", + }, + { + "unpriv: cmp pointer with const", + .insns = { + BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .result_unpriv = REJECT, + .errstr_unpriv = "R1 pointer comparison", + }, + { + "unpriv: cmp pointer with pointer", + .insns = { + BPF_JMP_REG(BPF_JEQ, BPF_REG_1, BPF_REG_10, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .result_unpriv = REJECT, + .errstr_unpriv = "R10 pointer comparison", + }, + { + "unpriv: pass pointer to printk", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_1, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8), + BPF_MOV64_IMM(BPF_REG_2, 8), + BPF_MOV64_REG(BPF_REG_3, BPF_REG_1), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_trace_printk), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .errstr_unpriv = "R3 leaks addr", + .result_unpriv = REJECT, + .result = ACCEPT, + }, + { + "unpriv: pass pointer to helper function", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_MOV64_REG(BPF_REG_3, BPF_REG_2), + BPF_MOV64_REG(BPF_REG_4, BPF_REG_2), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_update_elem), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .fixup = {3}, + .errstr_unpriv = "R4 leaks addr", + .result_unpriv = REJECT, + .result = ACCEPT, + }, + { + "unpriv: indirectly pass pointer on stack to helper function", + .insns = { + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_10, -8), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .fixup = {3}, + .errstr = "invalid indirect read from stack off -8+0 size 8", + .result = REJECT, + }, + { + "unpriv: mangle pointer on stack 1", + .insns = { + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_10, -8), + BPF_ST_MEM(BPF_W, BPF_REG_10, -8, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .errstr_unpriv = "attempt to corrupt spilled", + .result_unpriv = REJECT, + .result = ACCEPT, + }, + { + "unpriv: mangle pointer on stack 2", + .insns = { + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_10, -8), + BPF_ST_MEM(BPF_B, BPF_REG_10, -1, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .errstr_unpriv = "attempt to corrupt spilled", + .result_unpriv = REJECT, + .result = ACCEPT, + }, + { + "unpriv: read pointer from stack in small chunks", + .insns = { + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_10, -8), + BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_10, -8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .errstr = "invalid size", + .result = REJECT, + }, + { + "unpriv: write pointer into ctx", + .insns = { + BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_1, 0), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .errstr_unpriv = "R1 leaks addr", + .result_unpriv = REJECT, + .errstr = "invalid bpf_context access", + .result = REJECT, + }, + { + "unpriv: write pointer into map elem value", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), + BPF_STX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .fixup = {3}, + .errstr_unpriv = "R0 leaks addr", + .result_unpriv = REJECT, + .result = ACCEPT, + }, + { + "unpriv: partial copy of pointer", + .insns = { + BPF_MOV32_REG(BPF_REG_1, BPF_REG_10), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .errstr_unpriv = "R10 partial copy", + .result_unpriv = REJECT, + .result = ACCEPT, + }, }; static int probe_filter_length(struct bpf_insn *fp) @@ -910,12 +1130,15 @@ static int create_map(void) static int test(void) { int prog_fd, i, pass_cnt = 0, err_cnt = 0; + bool unpriv = geteuid() != 0; for (i = 0; i < ARRAY_SIZE(tests); i++) { struct bpf_insn *prog = tests[i].insns; int prog_type = tests[i].prog_type; int prog_len = probe_filter_length(prog); int *fixup = tests[i].fixup; + int expected_result; + const char *expected_errstr; int map_fd = -1; if (*fixup) { @@ -932,7 +1155,17 @@ static int test(void) prog, prog_len * sizeof(struct bpf_insn), "GPL", 0); - if (tests[i].result == ACCEPT) { + if (unpriv && tests[i].result_unpriv != UNDEF) + expected_result = tests[i].result_unpriv; + else + expected_result = tests[i].result; + + if (unpriv && tests[i].errstr_unpriv) + expected_errstr = tests[i].errstr_unpriv; + else + expected_errstr = tests[i].errstr; + + if (expected_result == ACCEPT) { if (prog_fd < 0) { printf("FAIL\nfailed to load prog '%s'\n", strerror(errno)); @@ -947,7 +1180,7 @@ static int test(void) err_cnt++; goto fail; } - if (strstr(bpf_log_buf, tests[i].errstr) == 0) { + if (strstr(bpf_log_buf, expected_errstr) == 0) { printf("FAIL\nunexpected error message: %s", bpf_log_buf); err_cnt++;
In order to let unprivileged users load and execute eBPF programs teach verifier to prevent pointer leaks. Verifier will prevent - any arithmetic on pointers (except R10+Imm which is used to compute stack addresses) - comparison of pointers - passing pointers to helper functions - indirectly passing pointers in stack to helper functions - returning pointer from bpf program - storing pointers into ctx or maps Spill/fill of pointers into stack is allowed, but mangling of pointers stored in the stack or reading them byte by byte is not. Within bpf programs the pointers do exist, since programs need to be able to access maps, pass skb pointer to LD_ABS insns, etc but programs cannot pass such pointer values to the outside or obfuscate them. Only allow BPF_PROG_TYPE_SOCKET_FILTER unprivileged programs, so that socket filters (tcpdump), af_packet (quic acceleration) and future kcm can use it. tracing and tc cls/act program types still require root permissions, since tracing actually needs to be able to see all kernel pointers and tc is for root only. For example, the following unprivileged socket filter program is allowed: int foo(struct __sk_buff *skb) { char fmt[] = "hello %d\n"; bpf_trace_printk(fmt, sizeof(fmt), skb->len); return 0; } but the following program is not: int foo(struct __sk_buff *skb) { char fmt[] = "hello %p\n"; bpf_trace_printk(fmt, sizeof(fmt), fmt); return 0; } since it would leak the kernel stack address via bpf_trace_printk(). Unprivileged socket filter bpf programs have access to the following helper functions: - map lookup/update/delete (but they cannot store kernel pointers into them) - get_random (it's already exposed to unprivileged user space) - get_smp_processor_id - tail_call into another socket filter program - ktime_get_ns - bpf_trace_printk (for debugging) The feature is controlled by sysctl kernel.bpf_enable_unprivileged which is off by default. New tests were added to test_verifier: unpriv: return pointer OK unpriv: add const to pointer OK unpriv: add pointer to pointer OK unpriv: neg pointer OK unpriv: cmp pointer with const OK unpriv: cmp pointer with pointer OK unpriv: pass pointer to printk OK unpriv: pass pointer to helper function OK unpriv: indirectly pass pointer on stack to helper function OK unpriv: mangle pointer on stack 1 OK unpriv: mangle pointer on stack 2 OK unpriv: read pointer from stack in small chunks OK unpriv: write pointer into ctx OK unpriv: write pointer into map elem value OK unpriv: partial copy of pointer OK Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> --- include/linux/bpf.h | 3 + kernel/bpf/syscall.c | 11 +- kernel/bpf/verifier.c | 114 +++++++++++++++++++-- kernel/sysctl.c | 10 ++ kernel/trace/bpf_trace.c | 3 + samples/bpf/libbpf.h | 8 ++ samples/bpf/test_verifier.c | 239 ++++++++++++++++++++++++++++++++++++++++++- 7 files changed, 371 insertions(+), 17 deletions(-)