Message ID | 20230807132123.14731-2-jacob.martin@canonical.com |
---|---|
State | New |
Headers | show |
Series | CVE-2021-4001 | expand |
On 07.08.23 15:21, Jacob Martin wrote: > From: Daniel Borkmann <daniel@iogearbox.net> > > Commit a23740ec43ba ("bpf: Track contents of read-only maps as scalars") is > checking whether maps are read-only both from BPF program side and user space > side, and then, given their content is constant, reading out their data via > map->ops->map_direct_value_addr() which is then subsequently used as known > scalar value for the register, that is, it is marked as __mark_reg_known() > with the read value at verification time. Before a23740ec43ba, the register > content was marked as an unknown scalar so the verifier could not make any > assumptions about the map content. > > The current implementation however is prone to a TOCTOU race, meaning, the > value read as known scalar for the register is not guaranteed to be exactly > the same at a later point when the program is executed, and as such, the > prior made assumptions of the verifier with regards to the program will be > invalid which can cause issues such as OOB access, etc. > > While the BPF_F_RDONLY_PROG map flag is always fixed and required to be > specified at map creation time, the map->frozen property is initially set to > false for the map given the map value needs to be populated, e.g. for global > data sections. Once complete, the loader "freezes" the map from user space > such that no subsequent updates/deletes are possible anymore. For the rest > of the lifetime of the map, this freeze one-time trigger cannot be undone > anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_* > cmd calls which would update/delete map entries will be rejected with -EPERM > since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also > means that pending update/delete map entries must still complete before this > guarantee is given. This corner case is not an issue for loaders since they > create and prepare such program private map in successive steps. > > However, a malicious user is able to trigger this TOCTOU race in two different > ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is > used to expand the competition interval, so that map_update_elem() can modify > the contents of the map after map_freeze() and bpf_prog_load() were executed. > This works, because userfaultfd halts the parallel thread which triggered a > map_update_elem() at the time where we copy key/value from the user buffer and > this already passed the FMODE_CAN_WRITE capability test given at that time the > map was not "frozen". Then, the main thread performs the map_freeze() and > bpf_prog_load(), and once that had completed successfully, the other thread > is woken up to complete the pending map_update_elem() which then changes the > map content. For ii) the idea of the batched update is similar, meaning, when > there are a large number of updates to be processed, it can increase the > competition interval between the two. It is therefore possible in practice to > modify the contents of the map after executing map_freeze() and bpf_prog_load(). > > One way to fix both i) and ii) at the same time is to expand the use of the > map's map->writecnt. The latter was introduced in fc9702273e2e ("bpf: Add mmap() > support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19be2e2 ("bpf: > Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with > the rationale to make a writable mmap()'ing of a map mutually exclusive with > read-only freezing. The counter indicates writable mmap() mappings and then > prevents/fails the freeze operation. Its semantics can be expanded beyond > just mmap() by generally indicating ongoing write phases. This would essentially > span any parallel regular and batched flavor of update/delete operation and > then also have map_freeze() fail with -EBUSY. For the check_mem_access() in > the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all > last pending writes have completed via bpf_map_write_active() test. Once the > map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0 > only then we are really guaranteed to use the map's data as known constants. > For map->frozen being set and pending writes in process of still being completed > we fall back to marking that register as unknown scalar so we don't end up > making assumptions about it. With this, both TOCTOU reproducers from i) and > ii) are fixed. > > Note that the map->writecnt has been converted into a atomic64 in the fix in > order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating > map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning > the freeze_mutex over entire map update/delete operations in syscall side > would not be possible due to then causing everything to be serialized. > Similarly, something like synchronize_rcu() after setting map->frozen to wait > for update/deletes to complete is not possible either since it would also > have to span the user copy which can sleep. On the libbpf side, this won't > break d66562fba1ce ("libbpf: Add BPF object skeleton support") as the > anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed > mmap()-ed memory where for .rodata it's non-writable. > > Fixes: a23740ec43ba ("bpf: Track contents of read-only maps as scalars") > Reported-by: w1tcher.bupt@gmail.com > Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> > Acked-by: Andrii Nakryiko <andrii@kernel.org> > Signed-off-by: Alexei Starovoitov <ast@kernel.org> CVE-2021-4001 > (backported from commit 353050be4c19e102178ccc05988101887c25ae53) > [jacobmartin: bpf map mmap and batch support are not present, so omit fixes for them] > Signed-off-by: Jacob Martin <jacob.martin@canonical.com> Acked-by: Stefan Bader <stefan.bader@canonical.com> > --- Just as a personal note that I find things more readable if the new s-o-b block is detached and follows a <reason>, <origin>, <s-o-b> order. -Stefan > include/linux/bpf.h | 2 ++ > kernel/bpf/syscall.c | 25 +++++++++++++++++++++++++ > kernel/bpf/verifier.c | 18 +++++++++++++++++- > 3 files changed, 44 insertions(+), 1 deletion(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 5705cda3c4c4..e6ed8cbda837 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -103,6 +103,7 @@ struct bpf_map { > atomic_t usercnt; > struct work_struct work; > char name[BPF_OBJ_NAME_LEN]; > + atomic64_t writecnt; > }; > > static inline bool map_value_has_spin_lock(const struct bpf_map *map) > @@ -663,6 +664,7 @@ void bpf_map_charge_move(struct bpf_map_memory *dst, > struct bpf_map_memory *src); > void *bpf_map_area_alloc(u64 size, int numa_node); > void bpf_map_area_free(void *base); > +bool bpf_map_write_active(const struct bpf_map *map); > void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr); > > extern int sysctl_unprivileged_bpf_disabled; > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > index de788761b708..56eee42da0d0 100644 > --- a/kernel/bpf/syscall.c > +++ b/kernel/bpf/syscall.c > @@ -127,6 +127,21 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) > return map; > } > > +static void bpf_map_write_active_inc(struct bpf_map *map) > +{ > + atomic64_inc(&map->writecnt); > +} > + > +static void bpf_map_write_active_dec(struct bpf_map *map) > +{ > + atomic64_dec(&map->writecnt); > +} > + > +bool bpf_map_write_active(const struct bpf_map *map) > +{ > + return atomic64_read(&map->writecnt) != 0; > +} > + > void *bpf_map_area_alloc(u64 size, int numa_node) > { > /* We really just want to fail instead of triggering OOM killer > @@ -890,6 +905,7 @@ static int map_update_elem(union bpf_attr *attr) > map = __bpf_map_get(f); > if (IS_ERR(map)) > return PTR_ERR(map); > + bpf_map_write_active_inc(map); > if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { > err = -EPERM; > goto err_put; > @@ -979,6 +995,7 @@ static int map_update_elem(union bpf_attr *attr) > free_key: > kfree(key); > err_put: > + bpf_map_write_active_dec(map); > fdput(f); > return err; > } > @@ -1001,6 +1018,7 @@ static int map_delete_elem(union bpf_attr *attr) > map = __bpf_map_get(f); > if (IS_ERR(map)) > return PTR_ERR(map); > + bpf_map_write_active_inc(map); > if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { > err = -EPERM; > goto err_put; > @@ -1028,6 +1046,7 @@ static int map_delete_elem(union bpf_attr *attr) > out: > kfree(key); > err_put: > + bpf_map_write_active_dec(map); > fdput(f); > return err; > } > @@ -1119,6 +1138,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr) > map = __bpf_map_get(f); > if (IS_ERR(map)) > return PTR_ERR(map); > + bpf_map_write_active_inc(map); > if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ) || > !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { > err = -EPERM; > @@ -1160,6 +1180,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr) > free_key: > kfree(key); > err_put: > + bpf_map_write_active_dec(map); > fdput(f); > return err; > } > @@ -1179,6 +1200,10 @@ static int map_freeze(const union bpf_attr *attr) > map = __bpf_map_get(f); > if (IS_ERR(map)) > return PTR_ERR(map); > + if (bpf_map_write_active(map)) { > + err = -EBUSY; > + goto err_put; > + } > if (READ_ONCE(map->frozen)) { > err = -EBUSY; > goto err_put; > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 530664693ac4..cda7907d08d9 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -2773,7 +2773,23 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size) > > static bool bpf_map_is_rdonly(const struct bpf_map *map) > { > - return (map->map_flags & BPF_F_RDONLY_PROG) && map->frozen; > + /* A map is considered read-only if the following condition are true: > + * > + * 1) BPF program side cannot change any of the map content. The > + * BPF_F_RDONLY_PROG flag is throughout the lifetime of a map > + * and was set at map creation time. > + * 2) The map value(s) have been initialized from user space by a > + * loader and then "frozen", such that no new map update/delete > + * operations from syscall side are possible for the rest of > + * the map's lifetime from that point onwards. > + * the map's lifetime from that point onwards. > + * 3) Any parallel/pending map update/delete operations from syscall > + * side have been completed. Only after that point, it's safe to > + * assume that map value(s) are immutable. > + */ > + return (map->map_flags & BPF_F_RDONLY_PROG) && > + READ_ONCE(map->frozen) && > + !bpf_map_write_active(map); > } > > static int bpf_map_direct_read(struct bpf_map *map, int off, int size, u64 *val)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 5705cda3c4c4..e6ed8cbda837 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -103,6 +103,7 @@ struct bpf_map { atomic_t usercnt; struct work_struct work; char name[BPF_OBJ_NAME_LEN]; + atomic64_t writecnt; }; static inline bool map_value_has_spin_lock(const struct bpf_map *map) @@ -663,6 +664,7 @@ void bpf_map_charge_move(struct bpf_map_memory *dst, struct bpf_map_memory *src); void *bpf_map_area_alloc(u64 size, int numa_node); void bpf_map_area_free(void *base); +bool bpf_map_write_active(const struct bpf_map *map); void bpf_map_init_from_attr(struct bpf_map *map, union bpf_attr *attr); extern int sysctl_unprivileged_bpf_disabled; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index de788761b708..56eee42da0d0 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -127,6 +127,21 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr *attr) return map; } +static void bpf_map_write_active_inc(struct bpf_map *map) +{ + atomic64_inc(&map->writecnt); +} + +static void bpf_map_write_active_dec(struct bpf_map *map) +{ + atomic64_dec(&map->writecnt); +} + +bool bpf_map_write_active(const struct bpf_map *map) +{ + return atomic64_read(&map->writecnt) != 0; +} + void *bpf_map_area_alloc(u64 size, int numa_node) { /* We really just want to fail instead of triggering OOM killer @@ -890,6 +905,7 @@ static int map_update_elem(union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); + bpf_map_write_active_inc(map); if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { err = -EPERM; goto err_put; @@ -979,6 +995,7 @@ static int map_update_elem(union bpf_attr *attr) free_key: kfree(key); err_put: + bpf_map_write_active_dec(map); fdput(f); return err; } @@ -1001,6 +1018,7 @@ static int map_delete_elem(union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); + bpf_map_write_active_inc(map); if (!(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { err = -EPERM; goto err_put; @@ -1028,6 +1046,7 @@ static int map_delete_elem(union bpf_attr *attr) out: kfree(key); err_put: + bpf_map_write_active_dec(map); fdput(f); return err; } @@ -1119,6 +1138,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); + bpf_map_write_active_inc(map); if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ) || !(map_get_sys_perms(map, f) & FMODE_CAN_WRITE)) { err = -EPERM; @@ -1160,6 +1180,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr) free_key: kfree(key); err_put: + bpf_map_write_active_dec(map); fdput(f); return err; } @@ -1179,6 +1200,10 @@ static int map_freeze(const union bpf_attr *attr) map = __bpf_map_get(f); if (IS_ERR(map)) return PTR_ERR(map); + if (bpf_map_write_active(map)) { + err = -EBUSY; + goto err_put; + } if (READ_ONCE(map->frozen)) { err = -EBUSY; goto err_put; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 530664693ac4..cda7907d08d9 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -2773,7 +2773,23 @@ static void coerce_reg_to_size(struct bpf_reg_state *reg, int size) static bool bpf_map_is_rdonly(const struct bpf_map *map) { - return (map->map_flags & BPF_F_RDONLY_PROG) && map->frozen; + /* A map is considered read-only if the following condition are true: + * + * 1) BPF program side cannot change any of the map content. The + * BPF_F_RDONLY_PROG flag is throughout the lifetime of a map + * and was set at map creation time. + * 2) The map value(s) have been initialized from user space by a + * loader and then "frozen", such that no new map update/delete + * operations from syscall side are possible for the rest of + * the map's lifetime from that point onwards. + * the map's lifetime from that point onwards. + * 3) Any parallel/pending map update/delete operations from syscall + * side have been completed. Only after that point, it's safe to + * assume that map value(s) are immutable. + */ + return (map->map_flags & BPF_F_RDONLY_PROG) && + READ_ONCE(map->frozen) && + !bpf_map_write_active(map); } static int bpf_map_direct_read(struct bpf_map *map, int off, int size, u64 *val)