From patchwork Tue Mar 26 15:30:23 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dmitry Safonov X-Patchwork-Id: 1065725 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=quarantine dis=none) header.from=arista.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=arista.com header.i=@arista.com header.b="inAjRHxP"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44TFT83QzQz9sSV for ; Wed, 27 Mar 2019 02:30:52 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732197AbfCZPat (ORCPT ); Tue, 26 Mar 2019 11:30:49 -0400 Received: from mail-ed1-f67.google.com ([209.85.208.67]:39884 "EHLO mail-ed1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731840AbfCZPad (ORCPT ); Tue, 26 Mar 2019 11:30:33 -0400 Received: by mail-ed1-f67.google.com with SMTP id p20so10664334eds.6 for ; Tue, 26 Mar 2019 08:30:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=arista.com; s=googlenew; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=/RqmWo28JW2clwB4Gt/VepomkIt6hea2E0CtBSqiPnc=; b=inAjRHxPIQ+irCWZ7SEoSkYaS0rOJ9J/urBPRydWhX4FurEkB/I8sLRg/hBKO1z8XZ BNgf5ASqV1N+yByh9AUbM2el/zlDCYUTRXpLK9Jdi0qSmAYpBLxy4HYJk3N6JnwtWO4C WwSywzAWU/ocX1E3MA/vRi4vVH0QuTRjiYp0xoj19d72q28ERxXzKhES0xhQXmxW272z fnKXtRAAs6D9E7fjM4EV84yDRZEVXO97vXS2NlWcYBH2leNz1Gdf6lH13/nr7jeYg/3G fs2i9sWnw/ZBCNgbocrqy3DOzXy8RTv20LoG3qPk5kqdL0Riwa71d+JvH6NvDiKTqV7y qCAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/RqmWo28JW2clwB4Gt/VepomkIt6hea2E0CtBSqiPnc=; b=JLAKFoOHSsytLRptwOWwZbJ6/TOUIm77suQaNQdckH0TtNYb5uQfsjp8CyILPeiNnN urZQuDY0w+Bx0xf13yvlzSqBQDsmGZ36k8SajPCSeck+rsjUU7XVf4/1Nbk8FTfWn2qX albvSrfUfuUAkIblP3laEDSPCcUHxGbMKgKkarB/EhYsvAbQ19TK3uAXQEqV83rrQAIt OgBB7kMUUgGqmKG9NLnCTy7whUbrSPnFBq9dJzqO8UlOprlnIfgRTbcIYRFuLOwMLFrq MEnvpJ3j5ZNwFY7i80Q41na3n6RO+CNRAdR84LkKbqlo2zxYkGvnjuHL1CpDHnyPqm9L Y9DA== X-Gm-Message-State: APjAAAW563r2FpmRFS9Se48iy5kY5T2VwDnXDEgC9h1HDt/gBh86rbrN iVcweOhYW0e8jdcE+nfVhB8RWw== X-Google-Smtp-Source: APXvYqwqlWChpnwD65Z/LGYIe6DuGEUtTKTV7h6YnYZ8jdwZo/3YNzljZpWbLZRhdGR+LFD7QgWhcg== X-Received: by 2002:a50:be01:: with SMTP id a1mr17051112edi.22.1553614230686; Tue, 26 Mar 2019 08:30:30 -0700 (PDT) Received: from Mindolluin.ire.aristanetworks.com ([217.173.96.166]) by smtp.gmail.com with ESMTPSA id b2sm5310830eda.36.2019.03.26.08.30.29 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Tue, 26 Mar 2019 08:30:30 -0700 (PDT) From: Dmitry Safonov To: linux-kernel@vger.kernel.org Cc: Dmitry Safonov , Alexander Duyck , Alexey Kuznetsov , David Ahern , "David S. Miller" , Eric Dumazet , Hideaki YOSHIFUJI , Ido Schimmel , netdev@vger.kernel.org, linux-doc@vger.kernel.org, Jonathan Corbet Subject: [RFC 2/4] net/fib: Provide fib_balance_budget sysctl Date: Tue, 26 Mar 2019 15:30:23 +0000 Message-Id: <20190326153026.24493-3-dima@arista.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190326153026.24493-1-dima@arista.com> References: <20190326153026.24493-1-dima@arista.com> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Unfortunately, MAX_WORK at this moment is broken: during trie_rebalance() climbing upwards resize() is being called for each tnode. Which makes the limit useless the bigger trie gets: at each level there are 10 attempts to balance tnode and childs, resulting in O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is the level where we changed the trie originally (adding/removing alias/route). Which results in the following price of removing one route under big trie (similar for some other single routes that results in reallocating tnodes by hitting the threshold limits for halve/inflate resize): Before: Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. Main: Aver depth: 1.99 Max depth: 2 Leaves: 77825 Prefixes: 77825 Internal nodes: 77 9: 1 10: 76 Pointers: 78336 Null ptrs: 435 Total size: 7912 kB Local: [omitted] After: Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. Main: Aver depth: 3.00 Max depth: 3 Leaves: 77824 Prefixes: 77824 Internal nodes: 20491 1: 2048 2: 18432 6: 1 11: 10 Pointers: 98368 Null ptrs: 54 Total size: 8865 kB Local: [omitted] Provide a sysctl to control amount of pending balancing work. (by default unlimited as it was) Cc: linux-doc@vger.kernel.org Cc: Jonathan Corbet Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down into inflate/halve") Signed-off-by: Dmitry Safonov --- Documentation/networking/ip-sysctl.txt | 6 +++ include/net/ip.h | 1 + net/ipv4/fib_trie.c | 60 +++++++++++++++----------- net/ipv4/sysctl_net_ipv4.c | 7 +++ 4 files changed, 49 insertions(+), 25 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index acdfb5d2bcaa..fb71dacff4dd 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER 0 - Layer 3 1 - Layer 4 +fib_balance_budget - UNSIGNED INTEGER + Limits the number of resize attempts during balancing fib trie + on adding/removing new routes. + Possible values: + Default: UINT_MAX (0xFFFFFFFF) + ip_forward_update_priority - INTEGER Whether to update SKB priority from "TOS" field in IPv4 header after it is forwarded. The new SKB priority is mapped from TOS field value diff --git a/include/net/ip.h b/include/net/ip.h index be3cad9c2e4c..305d0e43088b 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -421,6 +421,7 @@ static inline unsigned int ip_skb_dst_mtu(struct sock *sk, return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU); } +extern unsigned int fib_balance_budget; struct dst_metrics *ip_fib_metrics_init(struct net *net, struct nlattr *fc_mx, int fc_mx_len, struct netlink_ext_ack *extack); diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index ad7d56c421cb..d90cf9dfd443 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -182,8 +182,10 @@ struct trie { #endif }; -static struct key_vector *resize(struct trie *t, struct key_vector *tn); +static struct key_vector *resize(struct trie *t, struct key_vector *tn, + unsigned int *budget); static size_t tnode_free_size; +unsigned int fib_balance_budget = UINT_MAX; /* * synchronize_rcu after call_rcu for that many pages; it should be especially @@ -506,7 +508,8 @@ static void tnode_free(struct key_vector *tn) static struct key_vector *replace(struct trie *t, struct key_vector *oldtnode, - struct key_vector *tn) + struct key_vector *tn, + unsigned int *budget) { struct key_vector *tp = node_parent(oldtnode); unsigned long i; @@ -522,19 +525,19 @@ static struct key_vector *replace(struct trie *t, tnode_free(oldtnode); /* resize children now that oldtnode is freed */ - for (i = child_length(tn); i;) { + for (i = child_length(tn); i && *budget;) { struct key_vector *inode = get_child(tn, --i); /* resize child node */ if (tnode_full(tn, inode)) - tn = resize(t, inode); + tn = resize(t, inode, budget); } return tp; } -static struct key_vector *inflate(struct trie *t, - struct key_vector *oldtnode) +static struct key_vector *inflate(struct trie *t, struct key_vector *oldtnode, + unsigned int *budget) { struct key_vector *tn; unsigned long i; @@ -621,7 +624,7 @@ static struct key_vector *inflate(struct trie *t, } /* setup the parent pointers into and out of this node */ - return replace(t, oldtnode, tn); + return replace(t, oldtnode, tn, budget); nomem: /* all pointers should be clean so we are done */ tnode_free(tn); @@ -629,8 +632,8 @@ static struct key_vector *inflate(struct trie *t, return NULL; } -static struct key_vector *halve(struct trie *t, - struct key_vector *oldtnode) +static struct key_vector *halve(struct trie *t, struct key_vector *oldtnode, + unsigned int *budget) { struct key_vector *tn; unsigned long i; @@ -676,7 +679,7 @@ static struct key_vector *halve(struct trie *t, } /* setup the parent pointers into and out of this node */ - return replace(t, oldtnode, tn); + return replace(t, oldtnode, tn, budget); nomem: /* all pointers should be clean so we are done */ tnode_free(tn); @@ -843,15 +846,15 @@ static inline bool should_collapse(struct key_vector *tn) return used < 2; } -#define MAX_WORK 10 -static struct key_vector *resize(struct trie *t, struct key_vector *tn) +static struct key_vector *resize(struct trie *t, struct key_vector *tn, + unsigned int *budget) { #ifdef CONFIG_IP_FIB_TRIE_STATS struct trie_use_stats __percpu *stats = t->stats; #endif struct key_vector *tp = node_parent(tn); unsigned long cindex = get_index(tn->key, tp); - int max_work = MAX_WORK; + bool inflated = false; pr_debug("In tnode_resize %p inflate_threshold=%d threshold=%d\n", tn, inflate_threshold, halve_threshold); @@ -865,8 +868,8 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) /* Double as long as the resulting node has a number of * nonempty nodes that are above the threshold. */ - while (should_inflate(tp, tn) && max_work) { - tp = inflate(t, tn); + while (should_inflate(tp, tn) && *budget) { + tp = inflate(t, tn, budget); if (!tp) { #ifdef CONFIG_IP_FIB_TRIE_STATS this_cpu_inc(stats->resize_node_skipped); @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) break; } - max_work--; + (*budget)--; + inflated = true; tn = get_child(tp, cindex); } /* update parent in case inflate failed */ tp = node_parent(tn); - /* Return if at least one inflate is run */ - if (max_work != MAX_WORK) + /* Return if at least one inflate is run: + * microoptimization to not recalculate thresholds + */ + if (inflated) return tp; /* Halve as long as the number of empty children in this * node is above threshold. */ - while (should_halve(tp, tn) && max_work) { - tp = halve(t, tn); + while (should_halve(tp, tn) && *budget) { + tp = halve(t, tn, budget); if (!tp) { #ifdef CONFIG_IP_FIB_TRIE_STATS this_cpu_inc(stats->resize_node_skipped); @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) break; } - max_work--; + (*budget)--; tn = get_child(tp, cindex); } @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen, static void trie_rebalance(struct trie *t, struct key_vector *tn) { - while (!IS_TRIE(tn)) - tn = resize(t, tn); + unsigned int budget = fib_balance_budget; + + while (budget && !IS_TRIE(tn)) + tn = resize(t, tn, &budget); } static int fib_insert_node(struct trie *t, struct key_vector *tp, @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb) void fib_table_flush_external(struct fib_table *tb) { struct trie *t = (struct trie *)tb->tb_data; + unsigned int budget = fib_balance_budget; struct key_vector *pn = t->kv; unsigned long cindex = 1; struct hlist_node *tmp; @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb) update_suffix(pn); /* resize completed node */ - pn = resize(t, pn); + pn = resize(t, pn, &budget); cindex = get_index(pkey, pn); continue; @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb) int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) { struct trie *t = (struct trie *)tb->tb_data; + unsigned int budget = fib_balance_budget; struct key_vector *pn = t->kv; unsigned long cindex = 1; struct hlist_node *tmp; @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) update_suffix(pn); /* resize completed node */ - pn = resize(t, pn); + pn = resize(t, pn, &budget); cindex = get_index(pkey, pn); continue; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index ba0fc4b18465..d7274cc442af 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -443,6 +443,13 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, + { + .procname = "fib_balance_budget", + .data = &fib_balance_budget, + .maxlen = sizeof(fib_balance_budget), + .mode = 0644, + .proc_handler = proc_douintvec, + }, { .procname = "inet_peer_threshold", .data = &inet_peer_threshold,