From patchwork Wed Dec 14 15:43:35 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Daniel Mack X-Patchwork-Id: 705701 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3tf18f5MRHz9svs for ; Thu, 15 Dec 2016 02:44:18 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932805AbcLNPoF (ORCPT ); Wed, 14 Dec 2016 10:44:05 -0500 Received: from svenfoo.org ([82.94.215.22]:48524 "EHLO mail.zonque.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932171AbcLNPoD (ORCPT ); Wed, 14 Dec 2016 10:44:03 -0500 Received: from localhost (localhost [127.0.0.1]) by mail.zonque.de (Postfix) with ESMTP id 79364C0346; Wed, 14 Dec 2016 16:44:01 +0100 (CET) Received: from mail.zonque.de ([127.0.0.1]) by localhost (rambrand.bugwerft.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4TfF24KC3FOm; Wed, 14 Dec 2016 16:44:01 +0100 (CET) Received: from tao.fritz.box (pd95c9a6f.dip0.t-ipconnect.de [217.92.154.111]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.zonque.de (Postfix) with ESMTPSA id 07CE0C033E; Wed, 14 Dec 2016 16:44:00 +0100 (CET) From: Daniel Mack To: ast@fb.com Cc: dh.herrmann@gmail.com, daniel@iogearbox.net, netdev@vger.kernel.org, davem@davemloft.net, Daniel Mack Subject: [PATCH RFC 1/2] bpf: add a longest prefix match trie map implementation Date: Wed, 14 Dec 2016 16:43:35 +0100 Message-Id: <20161214154336.17639-2-daniel@zonque.org> X-Mailer: git-send-email 2.9.3 In-Reply-To: <20161214154336.17639-1-daniel@zonque.org> References: <20161214154336.17639-1-daniel@zonque.org> Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This trie implements a longest prefix match algorithm that can be used to match IP addresses to a stored set of ranges. Internally, data is stored in an unbalanced trie of nodes that has a maximum height of n, where n is the prefixlen the trie was created with. Tries may be created with prefix lengths that are multiples of 8, in the range from 8 to 2048. The key used for lookup and update operations is a struct bpf_lpm_trie_key, and the value is a uint64_t. The code carries more information about the internal implementation. Signed-off-by: Daniel Mack Reviewed-by: David Herrmann --- include/uapi/linux/bpf.h | 7 + kernel/bpf/Makefile | 2 +- kernel/bpf/lpm_trie.c | 491 +++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 499 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf/lpm_trie.c diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0eb0e87..d564277 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -63,6 +63,12 @@ struct bpf_insn { __s32 imm; /* signed immediate constant */ }; +/* Key of an a BPF_MAP_TYPE_LPM_TRIE entry */ +struct bpf_lpm_trie_key { + __u32 prefixlen; /* up to 32 for AF_INET, 128 for AF_INET6 */ + __u8 data[0]; /* Arbitrary size */ +}; + /* BPF syscall commands, see bpf(2) man-page for details. */ enum bpf_cmd { BPF_MAP_CREATE, @@ -89,6 +95,7 @@ enum bpf_map_type { BPF_MAP_TYPE_CGROUP_ARRAY, BPF_MAP_TYPE_LRU_HASH, BPF_MAP_TYPE_LRU_PERCPU_HASH, + BPF_MAP_TYPE_LPM_TRIE, }; enum bpf_prog_type { diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 1276474..e1ce4f4 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -1,7 +1,7 @@ obj-y := core.o obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o -obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o +obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o ifeq ($(CONFIG_PERF_EVENTS),y) obj-$(CONFIG_BPF_SYSCALL) += stackmap.o endif diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c new file mode 100644 index 0000000..cae759d --- /dev/null +++ b/kernel/bpf/lpm_trie.c @@ -0,0 +1,491 @@ +/* + * Longest prefix match list implementation + * + * Copyright (c) 2016 Daniel Mack + * Copyright (c) 2016 David Herrmann + * + * This file is subject to the terms and conditions of version 2 of the GNU + * General Public License. See the file COPYING in the main directory of the + * Linux distribution for more details. + */ + +#include +#include +#include +#include +#include +#include + +/* Intermediate node */ +#define LPM_TREE_NODE_FLAG_IM BIT(0) + +struct lpm_trie_node; + +struct lpm_trie_node { + struct rcu_head rcu; + struct lpm_trie_node *child[2]; + u32 prefixlen; + u32 flags; + u64 value; + u8 data[0]; +}; + +struct lpm_trie { + struct bpf_map map; + struct lpm_trie_node *root; + size_t n_entries; + size_t max_prefixlen; + size_t data_size; + spinlock_t lock; +}; + +/* + * This trie implements a longest prefix match algorithm that can be used to + * match IP addresses to a stored set of ranges. + * + * Data stored in @data of struct bpf_lpm_key and struct lpm_trie_node is + * interpreted as big endian, so data[0] stores the most significant byte. + * + * Match ranges are internally stored in instances of struct lpm_trie_node + * which each contain their prefix length as well as two pointers that may + * lead to more nodes containing more specific matches. Each node also stores + * a value that is defined by and returned to userspace via the update_elem + * and lookup functions. + * + * For instance, let's start with a trie that was created with a prefix length + * of 32, so it can be used for IPv4 addresses, and one single element that + * matches 192.168.0.0/16. The data array would hence contain + * [0xc0, 0xa8, 0x00, 0x00] in big-endian notation. This documentation will + * stick to IP-address notation for readability though. + * + * As the trie is empty initially, the new node (1) will be places as root + * node, denoted as (R) in the example below. As there are no other node, both + * child pointers are %NULL. + * + * +----------------+ + * | (1) (R) | + * | 192.168.0.0/16 | + * | value: 1 | + * | [0] [1] | + * +----------------+ + * + * Next, let's add a new node (2) matching 192.168.0.0/24. As there is already + * a node with the same data and a smaller prefix (ie, a less specific one), + * node (2) will become a child of (1). In child index depends on the next bit + * that is outside of that (1) matches, and that bit is 0, so (2) will be + * child[0] of (1): + * + * +----------------+ + * | (1) (R) | + * | 192.168.0.0/16 | + * | value: 1 | + * | [0] [1] | + * +----------------+ + * | + * +----------------+ + * | (2) | + * | 192.168.0.0/24 | + * | value: 2 | + * | [0] [1] | + * +----------------+ + * + * The child[1] slot of (1) could be filled with another node which has bit #17 + * (the next bit after the ones that (1) matches on) set to 1. For instance, + * 192.168.128.0/24: + * + * +----------------+ + * | (1) (R) | + * | 192.168.0.0/16 | + * | value: 1 | + * | [0] [1] | + * +----------------+ + * | | + * +----------------+ +------------------+ + * | (2) | | (3) | + * | 192.168.0.0/24 | | 192.168.128.0/24 | + * | value: 2 | | value: 3 | + * | [0] [1] | | [0] [1] | + * +----------------+ +------------------+ + * + * Let's add another node (4) to the game for 192.168.1.0/24. In order to place + * it, node (1) is looked at first, and because (4) of the semantics laid out + * above (bit #17 is 0), it would normally be attached to (1) as child[0]. + * However, that slot is already allocated, so a new node is needed in between. + * That node is does not have a value attached to it and it will never be + * returned to users as result of a lookup. It is only there to differenciate + * the traversal further. It will get a prefix as wide as necessary to + * distinguish its two children: + * + * +----------------+ + * | (1) (R) | + * | 192.168.0.0/16 | + * | value: 1 | + * | [0] [1] | + * +----------------+ + * | | + * +----------------+ +------------------+ + * | (4) (I) | | (3) | + * | 192.168.0.0/23 | | 192.168.128.0/24 | + * | value: --- | | value: 3 | + * | [0] [1] | | [0] [1] | + * +----------------+ +------------------+ + * | | + * +----------------+ +----------------+ + * | (2) | | (5) | + * | 192.168.0.0/24 | | 192.168.1.0/24 | + * | value: 2 | | value: 5 | + * | [0] [1] | | [0] [1] | + * +----------------+ +----------------+ + * + * 192.168.1.1/32 would be a child of (5) etc. + * + * An intermediate node will be turned into a 'real' node on demand. In the + * example above, (4) would be re-used if 192.168.0.0/23 is added to the trie. + * + * A fully populated trie would have a height of 32 nodes, as the trie was + * created with a prefix length of 32. + * + * The lookup starts at the root node. If the current node matches and if there + * is a child that can be used to become more specific, the trie is traversed + * downwards. The last node in the traversal that is a non-intermediate one is + * returned. + */ + +static inline int extract_bit(const u8 *data, size_t index) +{ + return !!(data[index / 8] & (1 << (7 - (index % 8)))); +} + +/** + * longest_prefix_match() - determine the longest prefix + * @trie: The trie to get internal sizes from + * @node: The node to operate on + * @key: The key to compare to @node + * + * Determine the longest prefix of @node that matches the bits in @key. + */ +static size_t longest_prefix_match(const struct lpm_trie *trie, + const struct lpm_trie_node *node, + const struct bpf_lpm_trie_key *key) +{ + size_t prefixlen = 0; + int i; + + for (i = 0; i < trie->data_size; i++) { + size_t b; + + b = 8 - fls(node->data[i] ^ key->data[i]); + prefixlen += b; + + if (prefixlen >= node->prefixlen || prefixlen >= key->prefixlen) + return min(node->prefixlen, key->prefixlen); + + if (b < 8) + break; + } + + return prefixlen; +} + +/* Called from syscall or from eBPF program */ +static void *trie_lookup_elem(struct bpf_map *map, void *_key) +{ + struct lpm_trie_node *node, *found = NULL; + struct bpf_lpm_trie_key *key = _key; + struct lpm_trie *trie = + container_of(map, struct lpm_trie, map); + + /* Start walking the trie from the root node ... */ + + for (node = rcu_dereference(trie->root); node;) { + unsigned int next_bit; + size_t matchlen; + + /* + * Determine the longest prefix of @node that matches @key. + * If it's the maximum possible prefix for this trie, we have + * an exact match and can return it directly. + */ + matchlen = longest_prefix_match(trie, node, key); + if (matchlen == trie->max_prefixlen) + return &node->value; + + /* + * If the number of bits that match is smaller than the prefix + * length of @node, bail out and return the node we have seen + * last in the traversal (ie, the parent). + */ + if (matchlen < node->prefixlen) + break; + + /* + * Consider this node as return candidate unless it is an + * artificially added intermediate one + */ + if (!(node->flags & LPM_TREE_NODE_FLAG_IM)) + found = node; + + /* + * If the node match is fully satisfied, let's see if we can + * become more specific. Determine the next bit in the key and + * traverse down. + */ + next_bit = extract_bit(key->data, node->prefixlen); + node = rcu_dereference(node->child[next_bit]); + } + + return found ? &found->value : NULL; +} + +static struct lpm_trie_node *lpm_trie_node_alloc(size_t data_size) +{ + return kmalloc(sizeof(struct lpm_trie_node) + data_size, + GFP_ATOMIC | __GFP_NOWARN); +} + +/** + *_lpm_trie_find_target_node() - locate a spot to put a new node + * @trie: The trie to walk + * @key: The key to find a slot for + * @node_ret: Return variable for a node slot + * + * Find a slot to put a new node for @key, and return it in @node_ret. + * + * If the target location is an empty child of an existing node, or the + * root is unused, a pointer to that empty spot is returned in @node_ret + * and 0 is returned by the function. + * + * Otherwise, if a node is detected that conflicts with @key, that conflicting + * node is returned in @node_ret. The caller should then replace that node with + * an intermediate node. In this case, the longest prefix match between the + * existing node and @key is returned. + */ +static size_t find_target_node(struct lpm_trie *trie, + struct bpf_lpm_trie_key *key, + struct lpm_trie_node ***node_ret) +{ + struct lpm_trie_node **node = &trie->root; + size_t matchlen = 0; + + while (*node) { + unsigned int next_bit; + + matchlen = longest_prefix_match(trie, *node, key); + + if ((*node)->prefixlen != matchlen || + (*node)->prefixlen == key->prefixlen || + (*node)->prefixlen == trie->max_prefixlen) + break; + + next_bit = extract_bit(key->data, (*node)->prefixlen); + node = &(*node)->child[next_bit]; + } + + *node_ret = node; + + return *node ? matchlen : 0; +} + +/* Called from syscall or from eBPF program */ +static int trie_update_elem(struct bpf_map *map, + void *_key, void *value, u64 flags) +{ + struct lpm_trie *trie = container_of(map, struct lpm_trie, map); + struct lpm_trie_node **node, *im_node, *new_node = NULL; + struct bpf_lpm_trie_key *key = _key; + size_t matchlen; + int ret = 0; + + if (key->prefixlen > trie->max_prefixlen) + return -EINVAL; + + spin_lock(&trie->lock); + + /* Allocate and fill a new node */ + + if (trie->n_entries == trie->map.max_entries) { + ret = -ENOSPC; + goto out; + } + + new_node = lpm_trie_node_alloc(trie->data_size); + if (!new_node) { + ret = -ENOMEM; + goto out; + } + + trie->n_entries++; + new_node->value = *(u64 *) value; + new_node->prefixlen = key->prefixlen; + new_node->flags = 0; + new_node->child[0] = NULL; + new_node->child[1] = NULL; + memcpy(new_node->data, key->data, trie->data_size); + + /* + * Now find a place to attach the new node. find_target_node() + * either returned an empty slot (the root or an empty leaf), or the + * closest match, in which case an intermediate node has to be created + * and installed. + */ + matchlen = find_target_node(trie, key, &node); + if (!*node) { + rcu_assign_pointer(*node, new_node); + goto out; + } + + /* + * If the node we got back as target already exists, replace it + * new_node, which already has the correct data array and value set. + * If the node that is replaced is an intermediate one, turn it into a + * 'real' node. + */ + if ((*node)->prefixlen == matchlen) { + struct lpm_trie_node *tmp; + + new_node->child[0] = (*node)->child[0]; + new_node->child[1] = (*node)->child[1]; + + tmp = rcu_dereference(*node); + if (!(tmp->flags & LPM_TREE_NODE_FLAG_IM)) + trie->n_entries--; + + rcu_assign_pointer(*node, new_node); + kfree_rcu(tmp, rcu); + + goto out; + } + + /* + * If the new node matches the prefix completely, it must be an + * inserted as an ancestor. Simply insert it between @node and @*node. + */ + if (matchlen == key->prefixlen) { + new_node->child[extract_bit((*node)->data, matchlen)] = *node; + rcu_assign_pointer(*node, new_node); + goto out; + } + + /* Create an intermediate node and place it inbetween */ + im_node = lpm_trie_node_alloc(trie->data_size); + if (!im_node) { + ret = -ENOMEM; + goto out; + } + + im_node->prefixlen = matchlen; + im_node->flags |= LPM_TREE_NODE_FLAG_IM; + memcpy(im_node->data, (*node)->data, trie->data_size); + + /* Now determine which child to install in which slot */ + if (extract_bit(key->data, matchlen)) { + im_node->child[0] = *node; + im_node->child[1] = new_node; + } else { + im_node->child[0] = new_node; + im_node->child[1] = *node; + } + + /* Finally, assign the intermediate node to the determined spot */ + rcu_assign_pointer(*node, im_node); + +out: + if (ret) { + if (new_node) + trie->n_entries--; + + kfree(new_node); + kfree(im_node); + } + + spin_unlock(&trie->lock); + + return ret; +} + +static struct bpf_map *trie_alloc(union bpf_attr *attr) +{ + struct lpm_trie *trie; + + /* check sanity of attributes */ + if (attr->max_entries == 0 || attr->map_flags || + attr->key_size < sizeof(struct bpf_lpm_trie_key) + 1 || + attr->key_size > sizeof(struct bpf_lpm_trie_key) + 256 || + attr->value_size != sizeof(u64)) + return ERR_PTR(-EINVAL); + + trie = kzalloc(sizeof(*trie), GFP_USER | __GFP_NOWARN); + if (!trie) + return NULL; + + /* copy mandatory map attributes */ + trie->map.map_type = attr->map_type; + trie->map.key_size = attr->key_size; + trie->map.value_size = attr->value_size; + trie->map.max_entries = attr->max_entries; + trie->data_size = attr->key_size - + offsetof(struct bpf_lpm_trie_key, data); + trie->max_prefixlen = trie->data_size * 8; + + spin_lock_init(&trie->lock); + + return &trie->map; +} + +static void trie_free(struct bpf_map *map) +{ + struct lpm_trie_node **node; + struct lpm_trie *trie = + container_of(map, struct lpm_trie, map); + + spin_lock(&trie->lock); + + /* + * Always start at the root and walk down to a node that has no + * children. Then free that node, nullify its parent pointer and + * start over. + */ + + for (;;) { + node = &trie->root; + if (!*node) + break; + + for (;;) { + if ((*node)->child[0]) { + node = &(*node)->child[0]; + continue; + } + + if ((*node)->child[1]) { + node = &(*node)->child[1]; + continue; + } + + kfree(*node); + *node = NULL; + break; + } + } + + spin_unlock(&trie->lock); +} + +static const struct bpf_map_ops trie_ops = { + .map_alloc = trie_alloc, + .map_free = trie_free, + .map_lookup_elem = trie_lookup_elem, + .map_update_elem = trie_update_elem, +}; + +static struct bpf_map_type_list trie_type __read_mostly = { + .ops = &trie_ops, + .type = BPF_MAP_TYPE_LPM_TRIE, +}; + +static int __init register_trie_map(void) +{ + bpf_register_map_type(&trie_type); + return 0; +} +late_initcall(register_trie_map);