Message ID | 20190326153026.24493-3-dima@arista.com |
---|---|
State | RFC |
Delegated to: | David Miller |
Headers | show |
Series | net/fib: Speed up trie rebalancing for full view | expand |
On Tue, 2019-03-26 at 15:30 +0000, Dmitry Safonov wrote: > Unfortunately, MAX_WORK at this moment is broken: during > trie_rebalance() climbing upwards resize() is being called for each > tnode. Which makes the limit useless the bigger trie gets: at each level > there are 10 attempts to balance tnode and childs, resulting in > O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is > the level where we changed the trie originally (adding/removing > alias/route). > > Which results in the following price of removing one route under big > trie (similar for some other single routes that results in reallocating > tnodes by hitting the threshold limits for halve/inflate resize): The info below doesn't really tell us the price. It might be useful if you could just time the removal of that one route and provide that information. > Before: > Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. > Main: > Aver depth: 1.99 > Max depth: 2 > Leaves: 77825 > Prefixes: 77825 > Internal nodes: 77 > 9: 1 10: 76 > Pointers: 78336 > Null ptrs: 435 > Total size: 7912 kB > Local: > [omitted] > > After: > Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. > Main: > Aver depth: 3.00 > Max depth: 3 > Leaves: 77824 > Prefixes: 77824 > Internal nodes: 20491 > 1: 2048 2: 18432 6: 1 11: 10 > Pointers: 98368 > Null ptrs: 54 > Total size: 8865 kB > Local: > [omitted] > Is this before/after the patch, or just before/after the removal of the one route? I assume you are are just talking about the one route since the number of prefixes has dropped by 1. > Provide a sysctl to control amount of pending balancing work. > (by default unlimited as it was) > > Cc: linux-doc@vger.kernel.org > Cc: Jonathan Corbet <corbet@lwn.net> > Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down > into inflate/halve") > Signed-off-by: Dmitry Safonov <dima@arista.com> > --- > Documentation/networking/ip-sysctl.txt | 6 +++ > include/net/ip.h | 1 + > net/ipv4/fib_trie.c | 60 +++++++++++++++----------- > net/ipv4/sysctl_net_ipv4.c | 7 +++ > 4 files changed, 49 insertions(+), 25 deletions(-) > > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt > index acdfb5d2bcaa..fb71dacff4dd 100644 > --- a/Documentation/networking/ip-sysctl.txt > +++ b/Documentation/networking/ip-sysctl.txt > @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER > 0 - Layer 3 > 1 - Layer 4 > > +fib_balance_budget - UNSIGNED INTEGER > + Limits the number of resize attempts during balancing fib trie > + on adding/removing new routes. > + Possible values: > + Default: UINT_MAX (0xFFFFFFFF) > + So I am not really a fan of the use of UNSIGNED_INTEGER here. I would much rather see this be a signed integer and the test be for the value going negative instead of having to test in so many places if the value is 0. > ip_forward_update_priority - INTEGER > Whether to update SKB priority from "TOS" field in IPv4 header after it > is forwarded. The new SKB priority is mapped from TOS field value > diff --git a/include/net/ip.h b/include/net/ip.h > index be3cad9c2e4c..305d0e43088b 100644 > --- a/include/net/ip.h > +++ b/include/net/ip.h > @@ -421,6 +421,7 @@ static inline unsigned int ip_skb_dst_mtu(struct sock *sk, > return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU); > } > > +extern unsigned int fib_balance_budget; > struct dst_metrics *ip_fib_metrics_init(struct net *net, struct nlattr *fc_mx, > int fc_mx_len, > struct netlink_ext_ack *extack); > diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c > index ad7d56c421cb..d90cf9dfd443 100644 > --- a/net/ipv4/fib_trie.c > +++ b/net/ipv4/fib_trie.c > @@ -182,8 +182,10 @@ struct trie { > #endif > }; > > -static struct key_vector *resize(struct trie *t, struct key_vector *tn); > +static struct key_vector *resize(struct trie *t, struct key_vector *tn, > + unsigned int *budget); > static size_t tnode_free_size; > +unsigned int fib_balance_budget = UINT_MAX; > > /* > * synchronize_rcu after call_rcu for that many pages; it should be especially > @@ -506,7 +508,8 @@ static void tnode_free(struct key_vector *tn) > > static struct key_vector *replace(struct trie *t, > struct key_vector *oldtnode, > - struct key_vector *tn) > + struct key_vector *tn, > + unsigned int *budget) > { > struct key_vector *tp = node_parent(oldtnode); > unsigned long i; > @@ -522,19 +525,19 @@ static struct key_vector *replace(struct trie *t, > tnode_free(oldtnode); > > /* resize children now that oldtnode is freed */ > - for (i = child_length(tn); i;) { > + for (i = child_length(tn); i && *budget;) { So this is an example of the kind of stuff I am talking about. Having a check for budget here isn't going to add much in the way of value. I would much rather keep the logic contained within resize, and just do a check for *budget >= 0. > struct key_vector *inode = get_child(tn, --i); > > /* resize child node */ > if (tnode_full(tn, inode)) > - tn = resize(t, inode); > + tn = resize(t, inode, budget); > } > > return tp; > } > > -static struct key_vector *inflate(struct trie *t, > - struct key_vector *oldtnode) > +static struct key_vector *inflate(struct trie *t, struct key_vector *oldtnode, > + unsigned int *budget) > { > struct key_vector *tn; > unsigned long i; > @@ -621,7 +624,7 @@ static struct key_vector *inflate(struct trie *t, > } > > /* setup the parent pointers into and out of this node */ > - return replace(t, oldtnode, tn); > + return replace(t, oldtnode, tn, budget); > nomem: > /* all pointers should be clean so we are done */ > tnode_free(tn); > @@ -629,8 +632,8 @@ static struct key_vector *inflate(struct trie *t, > return NULL; > } > > -static struct key_vector *halve(struct trie *t, > - struct key_vector *oldtnode) > +static struct key_vector *halve(struct trie *t, struct key_vector *oldtnode, > + unsigned int *budget) > { > struct key_vector *tn; > unsigned long i; > @@ -676,7 +679,7 @@ static struct key_vector *halve(struct trie *t, > } > > /* setup the parent pointers into and out of this node */ > - return replace(t, oldtnode, tn); > + return replace(t, oldtnode, tn, budget); > nomem: > /* all pointers should be clean so we are done */ > tnode_free(tn); > @@ -843,15 +846,15 @@ static inline bool should_collapse(struct key_vector *tn) > return used < 2; > } > > -#define MAX_WORK 10 > -static struct key_vector *resize(struct trie *t, struct key_vector *tn) > +static struct key_vector *resize(struct trie *t, struct key_vector *tn, > + unsigned int *budget) > { > #ifdef CONFIG_IP_FIB_TRIE_STATS > struct trie_use_stats __percpu *stats = t->stats; > #endif > struct key_vector *tp = node_parent(tn); > unsigned long cindex = get_index(tn->key, tp); > - int max_work = MAX_WORK; > + bool inflated = false; > > pr_debug("In tnode_resize %p inflate_threshold=%d threshold=%d\n", > tn, inflate_threshold, halve_threshold); > @@ -865,8 +868,8 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) > /* Double as long as the resulting node has a number of > * nonempty nodes that are above the threshold. > */ > - while (should_inflate(tp, tn) && max_work) { > - tp = inflate(t, tn); > + while (should_inflate(tp, tn) && *budget) { > + tp = inflate(t, tn, budget); > if (!tp) { > #ifdef CONFIG_IP_FIB_TRIE_STATS > this_cpu_inc(stats->resize_node_skipped); > @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) > break; > } > > - max_work--; > + (*budget)--; One thing we might explore here would be to look at pulling out the check of budget at the start of the loop and instead combine the decrement with the check for < 0 here so we have something like: if (--*budget < 0) break; That way what should happen is that we will achieve maximum depth in any resize and make at least one pass optimizing from the bottom up instead of the top down. > + inflated = true; > tn = get_child(tp, cindex); > } > > /* update parent in case inflate failed */ > tp = node_parent(tn); > > - /* Return if at least one inflate is run */ > - if (max_work != MAX_WORK) > + /* Return if at least one inflate is run: > + * microoptimization to not recalculate thresholds > + */ > + if (inflated) > return tp; An alternative to having to add another variable would be to look at switching the loop to a combination of an if and a do/while like so: if (should_inflate(tp, tn)) { do { ... } while (should_inflate(tp, tn)); /* update parent in case inflate failed */ return node_parent(tn); } > /* Halve as long as the number of empty children in this > * node is above threshold. > */ > - while (should_halve(tp, tn) && max_work) { > - tp = halve(t, tn); > + while (should_halve(tp, tn) && *budget) { > + tp = halve(t, tn, budget); > if (!tp) { > #ifdef CONFIG_IP_FIB_TRIE_STATS > this_cpu_inc(stats->resize_node_skipped); > @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) > break; > } > > - max_work--; > + (*budget)--; Same here. It might make sense to pull the budget check out of the start of the loop and move it here so that it becomes a depth first optimization instead of being a shallow one. > tn = get_child(tp, cindex); > } > > @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen, > > static void trie_rebalance(struct trie *t, struct key_vector *tn) > { > - while (!IS_TRIE(tn)) > - tn = resize(t, tn); > + unsigned int budget = fib_balance_budget; > + > + while (budget && !IS_TRIE(tn)) > + tn = resize(t, tn, &budget); I would not have the budget check here. At a minimum we should be replacing trie nodes with leaves in the case that we are down to one leaf node in the trie. > } > > static int fib_insert_node(struct trie *t, struct key_vector *tp, > @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb) > void fib_table_flush_external(struct fib_table *tb) > { > struct trie *t = (struct trie *)tb->tb_data; > + unsigned int budget = fib_balance_budget; > struct key_vector *pn = t->kv; > unsigned long cindex = 1; > struct hlist_node *tmp; > @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb) > update_suffix(pn); > > /* resize completed node */ > - pn = resize(t, pn); > + pn = resize(t, pn, &budget); > cindex = get_index(pkey, pn); > > continue; So we would probably want a completely different budget here, or at least we would want the budget to be multiplied by the leaves we are removing. The problem is we aren't pulling single addresses, but are literally splitting the tree into two separate tries. The same budget isn't going to work for both cases. > @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb) > int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > { > struct trie *t = (struct trie *)tb->tb_data; > + unsigned int budget = fib_balance_budget; > struct key_vector *pn = t->kv; > unsigned long cindex = 1; > struct hlist_node *tmp; > @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) > update_suffix(pn); > > /* resize completed node */ > - pn = resize(t, pn); > + pn = resize(t, pn, &budget); > cindex = get_index(pkey, pn); > > continue; Similar issues here, though I don't know what the typical amount of change is we expect to see from a fib_table_flush call versus something like the fib_table_flush_external call. > diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c > index ba0fc4b18465..d7274cc442af 100644 > --- a/net/ipv4/sysctl_net_ipv4.c > +++ b/net/ipv4/sysctl_net_ipv4.c > @@ -443,6 +443,13 @@ static struct ctl_table ipv4_table[] = { > .mode = 0644, > .proc_handler = proc_dointvec > }, > + { > + .procname = "fib_balance_budget", > + .data = &fib_balance_budget, > + .maxlen = sizeof(fib_balance_budget), > + .mode = 0644, > + .proc_handler = proc_douintvec, > + }, > { > .procname = "inet_peer_threshold", > .data = &inet_peer_threshold,
On 4/1/19 7:09 PM, Alexander Duyck wrote: > On Tue, 2019-03-26 at 15:30 +0000, Dmitry Safonov wrote: >> Unfortunately, MAX_WORK at this moment is broken: during >> trie_rebalance() climbing upwards resize() is being called for each >> tnode. Which makes the limit useless the bigger trie gets: at each level >> there are 10 attempts to balance tnode and childs, resulting in >> O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is >> the level where we changed the trie originally (adding/removing >> alias/route). >> >> Which results in the following price of removing one route under big >> trie (similar for some other single routes that results in reallocating >> tnodes by hitting the threshold limits for halve/inflate resize): > > The info below doesn't really tell us the price. It might be useful if > you could just time the removal of that one route and provide that > information. Yes, well, that was an attempt to point that MAX_WORK in the current state doesn't limit much complexity to balance trie. So, most of the time (up to a couple of seconds) is spent on synchronise_rcu() (which was addressed by 4/4 and a patch from David Ahern in -next). But I thought fixing max_work is also worth attention, regardless the fix for synchronising rcu. And playing with the sysctl limit that is provided by the patch, I've found out that even with 4/4 applied one can tune this sysctl knob the way that results in dividing the work between consecutive route manipulations. So that on a 4-core switch the time spent to add a route in maximum was more than a second and adjusting work limit can bring it down to 300msec in max. Though, the total time spent to add the whole route table was the same. I will provide more details for v2. > >> Before: >> Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. >> Main: >> Aver depth: 1.99 >> Max depth: 2 >> Leaves: 77825 >> Prefixes: 77825 >> Internal nodes: 77 >> 9: 1 10: 76 >> Pointers: 78336 >> Null ptrs: 435 >> Total size: 7912 kB >> Local: >> [omitted] >> >> After: >> Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. >> Main: >> Aver depth: 3.00 >> Max depth: 3 >> Leaves: 77824 >> Prefixes: 77824 >> Internal nodes: 20491 >> 1: 2048 2: 18432 6: 1 11: 10 >> Pointers: 98368 >> Null ptrs: 54 >> Total size: 8865 kB >> Local: >> [omitted] >> > > Is this before/after the patch, or just before/after the removal of the > one route? I assume you are are just talking about the one route since > the number of prefixes has dropped by 1. Yes, just for removing one route. Should have mentioned that explicitly, sorry. > >> Provide a sysctl to control amount of pending balancing work. >> (by default unlimited as it was) >> >> Cc: linux-doc@vger.kernel.org >> Cc: Jonathan Corbet <corbet@lwn.net> >> Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down >> into inflate/halve") >> Signed-off-by: Dmitry Safonov <dima@arista.com> >> --- >> Documentation/networking/ip-sysctl.txt | 6 +++ >> include/net/ip.h | 1 + >> net/ipv4/fib_trie.c | 60 +++++++++++++++----------- >> net/ipv4/sysctl_net_ipv4.c | 7 +++ >> 4 files changed, 49 insertions(+), 25 deletions(-) >> >> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt >> index acdfb5d2bcaa..fb71dacff4dd 100644 >> --- a/Documentation/networking/ip-sysctl.txt >> +++ b/Documentation/networking/ip-sysctl.txt >> @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER >> 0 - Layer 3 >> 1 - Layer 4 >> >> +fib_balance_budget - UNSIGNED INTEGER >> + Limits the number of resize attempts during balancing fib trie >> + on adding/removing new routes. >> + Possible values: >> + Default: UINT_MAX (0xFFFFFFFF) >> + > > So I am not really a fan of the use of UNSIGNED_INTEGER here. I would > much rather see this be a signed integer and the test be for the value > going negative instead of having to test in so many places if the value > is 0. Ok, sounds good to me - will change the type for v2. [..] >> @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) >> break; >> } >> >> - max_work--; >> + (*budget)--; > > One thing we might explore here would be to look at pulling out the > check of budget at the start of the loop and instead combine the > decrement with the check for < 0 here so we have something like: > if (--*budget < 0) > break; > > That way what should happen is that we will achieve maximum depth in > any resize and make at least one pass optimizing from the bottom up > instead of the top down. Yeah, sounds reasonable - will do this way and retest. > >> + inflated = true; >> tn = get_child(tp, cindex); >> } >> >> /* update parent in case inflate failed */ >> tp = node_parent(tn); >> >> - /* Return if at least one inflate is run */ >> - if (max_work != MAX_WORK) >> + /* Return if at least one inflate is run: >> + * microoptimization to not recalculate thresholds >> + */ >> + if (inflated) >> return tp; > > An alternative to having to add another variable would be to look at > switching the loop to a combination of an if and a do/while like so: > if (should_inflate(tp, tn)) { > do { > ... > } while (should_inflate(tp, tn)); > > /* update parent in case inflate failed */ > return node_parent(tn); > } That's neat! Will drop the variable. > > >> /* Halve as long as the number of empty children in this >> * node is above threshold. >> */ >> - while (should_halve(tp, tn) && max_work) { >> - tp = halve(t, tn); >> + while (should_halve(tp, tn) && *budget) { >> + tp = halve(t, tn, budget); >> if (!tp) { >> #ifdef CONFIG_IP_FIB_TRIE_STATS >> this_cpu_inc(stats->resize_node_skipped); >> @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) >> break; >> } >> >> - max_work--; >> + (*budget)--; > > Same here. It might make sense to pull the budget check out of the > start of the loop and move it here so that it becomes a depth first > optimization instead of being a shallow one. Makes sense, will rework. > >> tn = get_child(tp, cindex); >> } >> >> @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen, >> >> static void trie_rebalance(struct trie *t, struct key_vector *tn) >> { >> - while (!IS_TRIE(tn)) >> - tn = resize(t, tn); >> + unsigned int budget = fib_balance_budget; >> + >> + while (budget && !IS_TRIE(tn)) >> + tn = resize(t, tn, &budget); > > I would not have the budget check here. At a minimum we should be > replacing trie nodes with leaves in the case that we are down to one > leaf node in the trie. Yeah, I see. > >> } >> >> static int fib_insert_node(struct trie *t, struct key_vector *tp, >> @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb) >> void fib_table_flush_external(struct fib_table *tb) >> { >> struct trie *t = (struct trie *)tb->tb_data; >> + unsigned int budget = fib_balance_budget; >> struct key_vector *pn = t->kv; >> unsigned long cindex = 1; >> struct hlist_node *tmp; >> @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb) >> update_suffix(pn); >> >> /* resize completed node */ >> - pn = resize(t, pn); >> + pn = resize(t, pn, &budget); >> cindex = get_index(pkey, pn); >> >> continue; > > So we would probably want a completely different budget here, or at > least we would want the budget to be multiplied by the leaves we are > removing. The problem is we aren't pulling single addresses, but are > literally splitting the tree into two separate tries. The same budget > isn't going to work for both cases. Which will multiply the work (and the time spent). Hmm, well, I guess we can multiply here and relax rtnl_mutex pressure if needed by introducing fib_lock. Does that sounds good? > >> @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb) >> int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) >> { >> struct trie *t = (struct trie *)tb->tb_data; >> + unsigned int budget = fib_balance_budget; >> struct key_vector *pn = t->kv; >> unsigned long cindex = 1; >> struct hlist_node *tmp; >> @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) >> update_suffix(pn); >> >> /* resize completed node */ >> - pn = resize(t, pn); >> + pn = resize(t, pn, &budget); >> cindex = get_index(pkey, pn); >> >> continue; > > Similar issues here, though I don't know what the typical amount of > change is we expect to see from a fib_table_flush call versus something > like the fib_table_flush_external call. Yes. Well, I was also curious if it's possible to avoid resize() call under (flush_all == true)? So, that exiting a namespace wouldn't result in rebalancing a possibly huge routing trie. Thanks for your notes and review, Dmitry
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index acdfb5d2bcaa..fb71dacff4dd 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -81,6 +81,12 @@ fib_multipath_hash_policy - INTEGER 0 - Layer 3 1 - Layer 4 +fib_balance_budget - UNSIGNED INTEGER + Limits the number of resize attempts during balancing fib trie + on adding/removing new routes. + Possible values: + Default: UINT_MAX (0xFFFFFFFF) + ip_forward_update_priority - INTEGER Whether to update SKB priority from "TOS" field in IPv4 header after it is forwarded. The new SKB priority is mapped from TOS field value diff --git a/include/net/ip.h b/include/net/ip.h index be3cad9c2e4c..305d0e43088b 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -421,6 +421,7 @@ static inline unsigned int ip_skb_dst_mtu(struct sock *sk, return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU); } +extern unsigned int fib_balance_budget; struct dst_metrics *ip_fib_metrics_init(struct net *net, struct nlattr *fc_mx, int fc_mx_len, struct netlink_ext_ack *extack); diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index ad7d56c421cb..d90cf9dfd443 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -182,8 +182,10 @@ struct trie { #endif }; -static struct key_vector *resize(struct trie *t, struct key_vector *tn); +static struct key_vector *resize(struct trie *t, struct key_vector *tn, + unsigned int *budget); static size_t tnode_free_size; +unsigned int fib_balance_budget = UINT_MAX; /* * synchronize_rcu after call_rcu for that many pages; it should be especially @@ -506,7 +508,8 @@ static void tnode_free(struct key_vector *tn) static struct key_vector *replace(struct trie *t, struct key_vector *oldtnode, - struct key_vector *tn) + struct key_vector *tn, + unsigned int *budget) { struct key_vector *tp = node_parent(oldtnode); unsigned long i; @@ -522,19 +525,19 @@ static struct key_vector *replace(struct trie *t, tnode_free(oldtnode); /* resize children now that oldtnode is freed */ - for (i = child_length(tn); i;) { + for (i = child_length(tn); i && *budget;) { struct key_vector *inode = get_child(tn, --i); /* resize child node */ if (tnode_full(tn, inode)) - tn = resize(t, inode); + tn = resize(t, inode, budget); } return tp; } -static struct key_vector *inflate(struct trie *t, - struct key_vector *oldtnode) +static struct key_vector *inflate(struct trie *t, struct key_vector *oldtnode, + unsigned int *budget) { struct key_vector *tn; unsigned long i; @@ -621,7 +624,7 @@ static struct key_vector *inflate(struct trie *t, } /* setup the parent pointers into and out of this node */ - return replace(t, oldtnode, tn); + return replace(t, oldtnode, tn, budget); nomem: /* all pointers should be clean so we are done */ tnode_free(tn); @@ -629,8 +632,8 @@ static struct key_vector *inflate(struct trie *t, return NULL; } -static struct key_vector *halve(struct trie *t, - struct key_vector *oldtnode) +static struct key_vector *halve(struct trie *t, struct key_vector *oldtnode, + unsigned int *budget) { struct key_vector *tn; unsigned long i; @@ -676,7 +679,7 @@ static struct key_vector *halve(struct trie *t, } /* setup the parent pointers into and out of this node */ - return replace(t, oldtnode, tn); + return replace(t, oldtnode, tn, budget); nomem: /* all pointers should be clean so we are done */ tnode_free(tn); @@ -843,15 +846,15 @@ static inline bool should_collapse(struct key_vector *tn) return used < 2; } -#define MAX_WORK 10 -static struct key_vector *resize(struct trie *t, struct key_vector *tn) +static struct key_vector *resize(struct trie *t, struct key_vector *tn, + unsigned int *budget) { #ifdef CONFIG_IP_FIB_TRIE_STATS struct trie_use_stats __percpu *stats = t->stats; #endif struct key_vector *tp = node_parent(tn); unsigned long cindex = get_index(tn->key, tp); - int max_work = MAX_WORK; + bool inflated = false; pr_debug("In tnode_resize %p inflate_threshold=%d threshold=%d\n", tn, inflate_threshold, halve_threshold); @@ -865,8 +868,8 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) /* Double as long as the resulting node has a number of * nonempty nodes that are above the threshold. */ - while (should_inflate(tp, tn) && max_work) { - tp = inflate(t, tn); + while (should_inflate(tp, tn) && *budget) { + tp = inflate(t, tn, budget); if (!tp) { #ifdef CONFIG_IP_FIB_TRIE_STATS this_cpu_inc(stats->resize_node_skipped); @@ -874,22 +877,25 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) break; } - max_work--; + (*budget)--; + inflated = true; tn = get_child(tp, cindex); } /* update parent in case inflate failed */ tp = node_parent(tn); - /* Return if at least one inflate is run */ - if (max_work != MAX_WORK) + /* Return if at least one inflate is run: + * microoptimization to not recalculate thresholds + */ + if (inflated) return tp; /* Halve as long as the number of empty children in this * node is above threshold. */ - while (should_halve(tp, tn) && max_work) { - tp = halve(t, tn); + while (should_halve(tp, tn) && *budget) { + tp = halve(t, tn, budget); if (!tp) { #ifdef CONFIG_IP_FIB_TRIE_STATS this_cpu_inc(stats->resize_node_skipped); @@ -897,7 +903,7 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) break; } - max_work--; + (*budget)--; tn = get_child(tp, cindex); } @@ -1005,8 +1011,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen, static void trie_rebalance(struct trie *t, struct key_vector *tn) { - while (!IS_TRIE(tn)) - tn = resize(t, tn); + unsigned int budget = fib_balance_budget; + + while (budget && !IS_TRIE(tn)) + tn = resize(t, tn, &budget); } static int fib_insert_node(struct trie *t, struct key_vector *tp, @@ -1784,6 +1792,7 @@ struct fib_table *fib_trie_unmerge(struct fib_table *oldtb) void fib_table_flush_external(struct fib_table *tb) { struct trie *t = (struct trie *)tb->tb_data; + unsigned int budget = fib_balance_budget; struct key_vector *pn = t->kv; unsigned long cindex = 1; struct hlist_node *tmp; @@ -1806,7 +1815,7 @@ void fib_table_flush_external(struct fib_table *tb) update_suffix(pn); /* resize completed node */ - pn = resize(t, pn); + pn = resize(t, pn, &budget); cindex = get_index(pkey, pn); continue; @@ -1853,6 +1862,7 @@ void fib_table_flush_external(struct fib_table *tb) int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) { struct trie *t = (struct trie *)tb->tb_data; + unsigned int budget = fib_balance_budget; struct key_vector *pn = t->kv; unsigned long cindex = 1; struct hlist_node *tmp; @@ -1876,7 +1886,7 @@ int fib_table_flush(struct net *net, struct fib_table *tb, bool flush_all) update_suffix(pn); /* resize completed node */ - pn = resize(t, pn); + pn = resize(t, pn, &budget); cindex = get_index(pkey, pn); continue; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index ba0fc4b18465..d7274cc442af 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -443,6 +443,13 @@ static struct ctl_table ipv4_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, + { + .procname = "fib_balance_budget", + .data = &fib_balance_budget, + .maxlen = sizeof(fib_balance_budget), + .mode = 0644, + .proc_handler = proc_douintvec, + }, { .procname = "inet_peer_threshold", .data = &inet_peer_threshold,
Unfortunately, MAX_WORK at this moment is broken: during trie_rebalance() climbing upwards resize() is being called for each tnode. Which makes the limit useless the bigger trie gets: at each level there are 10 attempts to balance tnode and childs, resulting in O(10^n + 10^{n-1} + ... 10^{n-k}) complexity to balance trie, where k - is the level where we changed the trie originally (adding/removing alias/route). Which results in the following price of removing one route under big trie (similar for some other single routes that results in reallocating tnodes by hitting the threshold limits for halve/inflate resize): Before: Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. Main: Aver depth: 1.99 Max depth: 2 Leaves: 77825 Prefixes: 77825 Internal nodes: 77 9: 1 10: 76 Pointers: 78336 Null ptrs: 435 Total size: 7912 kB Local: [omitted] After: Basic info: size of leaf: 40 bytes, size of tnode: 40 bytes. Main: Aver depth: 3.00 Max depth: 3 Leaves: 77824 Prefixes: 77824 Internal nodes: 20491 1: 2048 2: 18432 6: 1 11: 10 Pointers: 98368 Null ptrs: 54 Total size: 8865 kB Local: [omitted] Provide a sysctl to control amount of pending balancing work. (by default unlimited as it was) Cc: linux-doc@vger.kernel.org Cc: Jonathan Corbet <corbet@lwn.net> Fixes: ff181ed8768f ("fib_trie: Push assignment of child to parent down into inflate/halve") Signed-off-by: Dmitry Safonov <dima@arista.com> --- Documentation/networking/ip-sysctl.txt | 6 +++ include/net/ip.h | 1 + net/ipv4/fib_trie.c | 60 +++++++++++++++----------- net/ipv4/sysctl_net_ipv4.c | 7 +++ 4 files changed, 49 insertions(+), 25 deletions(-)