Message ID | 20190326153026.24493-5-dima@arista.com |
---|---|
State | RFC |
Delegated to: | David Miller |
Headers | show |
Series | net/fib: Speed up trie rebalancing for full view | expand |
On 3/26/19 9:30 AM, Dmitry Safonov wrote: > Fib trie has a hard-coded sync_pages limit to call synchronise_rcu(). > The limit is 128 pages or 512Kb (considering common case with 4Kb > pages). > > Unfortunately, at Arista we have use-scenarios with full view software > forwarding. At the scale of 100K and more routes even on 2 core boxes > the hard-coded limit starts actively shooting in the leg: lockup > detector notices that rtnl_lock is held for seconds. > First reason is previously broken MAX_WORK, that didn't limit pending > balancing work. While fixing it, I've noticed that the bottle-neck is > actually in the number of synchronise_rcu() calls. > > I've tried to fix it with a patch to decrement number of tnodes in rcu > callback, but it hasn't much affected performance. > > One possible way to "fix" it - provide another sysctl to control > sync_pages, but in my POV it's nasty - exposing another realisation > detail into user-space. well, that was accepted last week. ;-) commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f Author: David Ahern <dsahern@gmail.com> Date: Wed Mar 20 09:18:59 2019 -0700 ipv4: Allow amount of dirty memory from fib resizing to be controllable Can you see how that change (should backport easily) affects your test case? From my perspective 16MB was the sweet spot.
Hi David, On 3/26/19 3:39 PM, David Ahern wrote: > On 3/26/19 9:30 AM, Dmitry Safonov wrote: >> Fib trie has a hard-coded sync_pages limit to call synchronise_rcu(). >> The limit is 128 pages or 512Kb (considering common case with 4Kb >> pages). >> >> Unfortunately, at Arista we have use-scenarios with full view software >> forwarding. At the scale of 100K and more routes even on 2 core boxes >> the hard-coded limit starts actively shooting in the leg: lockup >> detector notices that rtnl_lock is held for seconds. >> First reason is previously broken MAX_WORK, that didn't limit pending >> balancing work. While fixing it, I've noticed that the bottle-neck is >> actually in the number of synchronise_rcu() calls. >> >> I've tried to fix it with a patch to decrement number of tnodes in rcu >> callback, but it hasn't much affected performance. >> >> One possible way to "fix" it - provide another sysctl to control >> sync_pages, but in my POV it's nasty - exposing another realisation >> detail into user-space. > > well, that was accepted last week. ;-) > > commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f > Author: David Ahern <dsahern@gmail.com> > Date: Wed Mar 20 09:18:59 2019 -0700 > > ipv4: Allow amount of dirty memory from fib resizing to be controllable > > > Can you see how that change (should backport easily) affects your test > case? From my perspective 16MB was the sweet spot. Heh, I based on master, so haven't seen it yet. I still wonder if it's good to expose it to userspace rather than shrinker, but this probably should work for me - I'll test it in near days. Thanks, Dmitry
On 3/26/19 11:15 AM, Dmitry Safonov wrote: > I still wonder if it's good to expose it to userspace rather than > shrinker, but this probably should work for me - I'll test it in near days. I did not know about the shrinker, so can not comment if it is better or not. If so, my patch can always be reverted since it was just applied.
On 3/26/19 5:57 PM, David Ahern wrote: > On 3/26/19 11:15 AM, Dmitry Safonov wrote: >> I still wonder if it's good to expose it to userspace rather than >> shrinker, but this probably should work for me - I'll test it in near days. > > I did not know about the shrinker, so can not comment if it is better or > not. If so, my patch can always be reverted since it was just applied. Oh, don't misunderstand me - I didn't mean to say your patch is wrong, but I wondered what made this preferable for you. I've sent my patches as RFC, so probably - it's a place to discuss what's better, rather than running, rushing and reverting you commit. Initially, I was also thinking about introducing some sysctl like that, but thought it's a bit "dirty" solution to expose it to userspace for tuning, rather than make it work automatically in OOM. On other side with your sysctl, there are some minor bits, those are not-yet-perfect and asking to be improved if we go this way: 1. I've noticed that the limit is only being increased or set to zero. Which in turn can result in more calls to synchronise rcu than needed. 2. The memory summed up under the limit is only tnodes in tnode_free(), while nodes are not summed there. Also some tnodes are being freed with node_free(), if I'm not mistaken. So please, don't take me wrong, your patch may also work for me, but probably worth to discuss both ways and I very much value your opinion here. Thanks, Dmitry
On 3/26/19 3:39 PM, David Ahern wrote: > On 3/26/19 9:30 AM, Dmitry Safonov wrote: >> Fib trie has a hard-coded sync_pages limit to call synchronise_rcu(). >> The limit is 128 pages or 512Kb (considering common case with 4Kb >> pages). >> >> Unfortunately, at Arista we have use-scenarios with full view software >> forwarding. At the scale of 100K and more routes even on 2 core boxes >> the hard-coded limit starts actively shooting in the leg: lockup >> detector notices that rtnl_lock is held for seconds. >> First reason is previously broken MAX_WORK, that didn't limit pending >> balancing work. While fixing it, I've noticed that the bottle-neck is >> actually in the number of synchronise_rcu() calls. >> >> I've tried to fix it with a patch to decrement number of tnodes in rcu >> callback, but it hasn't much affected performance. >> >> One possible way to "fix" it - provide another sysctl to control >> sync_pages, but in my POV it's nasty - exposing another realisation >> detail into user-space. > > well, that was accepted last week. ;-) > > commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f > Author: David Ahern <dsahern@gmail.com> > Date: Wed Mar 20 09:18:59 2019 -0700 > > ipv4: Allow amount of dirty memory from fib resizing to be controllable > > > Can you see how that change (should backport easily) affects your test > case? From my perspective 16MB was the sweet spot. FWIW, I would like to +Cc Paul here. TLDR; we're looking with David into ways to improve a hardcoded limit tnode_free_size at net/ipv4/fib_trie.c: currently it's way too low (512Kb). David created a patch to provide sysctl that controls the limit and it would solve a problem for both of us. In parallel, I thought that exposing this to userspace is not much fun and added a shrinker with synchronize_rcu(). I'm not any sure that the latter is actually a sane solution.. Is there any guarantee that memory to-be freed by call_rcu() will get freed in OOM conditions? Might there be a chance that we don't need any limit here at all? Worth to mention that I don't argue David's patch as I pointed that it would (will) solve the problem for us both, but with good intentions wondering if we can do something here rather a new sysctl knob. Thanks, Dmitry
On Tue, Mar 26, 2019 at 11:14:43PM +0000, Dmitry Safonov wrote: > On 3/26/19 3:39 PM, David Ahern wrote: > > On 3/26/19 9:30 AM, Dmitry Safonov wrote: > >> Fib trie has a hard-coded sync_pages limit to call synchronise_rcu(). > >> The limit is 128 pages or 512Kb (considering common case with 4Kb > >> pages). > >> > >> Unfortunately, at Arista we have use-scenarios with full view software > >> forwarding. At the scale of 100K and more routes even on 2 core boxes > >> the hard-coded limit starts actively shooting in the leg: lockup > >> detector notices that rtnl_lock is held for seconds. > >> First reason is previously broken MAX_WORK, that didn't limit pending > >> balancing work. While fixing it, I've noticed that the bottle-neck is > >> actually in the number of synchronise_rcu() calls. > >> > >> I've tried to fix it with a patch to decrement number of tnodes in rcu > >> callback, but it hasn't much affected performance. > >> > >> One possible way to "fix" it - provide another sysctl to control > >> sync_pages, but in my POV it's nasty - exposing another realisation > >> detail into user-space. > > > > well, that was accepted last week. ;-) > > > > commit 9ab948a91b2c2abc8e82845c0e61f4b1683e3a4f > > Author: David Ahern <dsahern@gmail.com> > > Date: Wed Mar 20 09:18:59 2019 -0700 > > > > ipv4: Allow amount of dirty memory from fib resizing to be controllable > > > > > > Can you see how that change (should backport easily) affects your test > > case? From my perspective 16MB was the sweet spot. > > FWIW, I would like to +Cc Paul here. > > TLDR; we're looking with David into ways to improve a hardcoded limit > tnode_free_size at net/ipv4/fib_trie.c: currently it's way too low > (512Kb). David created a patch to provide sysctl that controls the limit > and it would solve a problem for both of us. In parallel, I thought that > exposing this to userspace is not much fun and added a shrinker with > synchronize_rcu(). I'm not any sure that the latter is actually a sane > solution.. > Is there any guarantee that memory to-be freed by call_rcu() will get > freed in OOM conditions? Might there be a chance that we don't need any > limit here at all? Yes, unless whatever is causing the OOM is also stalling a CPU or task that RCU is waiting on. The extreme case is of course when the OOM is in fact being caused by the fact that RCU is waiting on a stalled CPU or task. Of course, the fact that the CPU or task is being stalled is a bug in its own right. So, in the absence of bugs, yes, the memory that was passed to call_rcu() should be freed within a reasonable length of time, even under OOM conditions. > Worth to mention that I don't argue David's patch as I pointed that it > would (will) solve the problem for us both, but with good intentions > wondering if we can do something here rather a new sysctl knob. An intermediate position would be to have a reasonably high setting so that the sysctl knob almost never needed to be adjusted. RCU used to detect OOM conditions and work harder to finish the grace period in those cases, but this was abandoned because it was found not to make a significant difference in production. Which might support the position of assuming that memory passed to call_rcu() gets freed reasonably quickly even under OOM conditions. Thanx, Paul
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 2ce2739e7693..5773d479e7d2 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -184,16 +184,9 @@ struct trie { static struct key_vector *resize(struct trie *t, struct key_vector *tn, unsigned int *budget); -static size_t tnode_free_size; +static atomic_long_t objects_waiting_rcu; unsigned int fib_balance_budget = UINT_MAX; -/* - * synchronize_rcu after call_rcu for that many pages; it should be especially - * useful before resizing the root node with PREEMPT_NONE configs; the value was - * obtained experimentally, aiming to avoid visible slowdown. - */ -static const int sync_pages = 128; - static struct kmem_cache *fn_alias_kmem __ro_after_init; static struct kmem_cache *trie_leaf_kmem __ro_after_init; @@ -306,11 +299,16 @@ static const int inflate_threshold_root = 30; static void __alias_free_mem(struct rcu_head *head) { struct fib_alias *fa = container_of(head, struct fib_alias, rcu); + + atomic_long_dec(&objects_waiting_rcu); kmem_cache_free(fn_alias_kmem, fa); } static inline void alias_free_mem_rcu(struct fib_alias *fa) { + lockdep_rtnl_is_held(); + + atomic_long_inc(&objects_waiting_rcu); call_rcu(&fa->rcu, __alias_free_mem); } @@ -318,13 +316,40 @@ static void __node_free_rcu(struct rcu_head *head) { struct tnode *n = container_of(head, struct tnode, rcu); + atomic_long_dec(&objects_waiting_rcu); if (!n->tn_bits) kmem_cache_free(trie_leaf_kmem, n); else kvfree(n); } -#define node_free(n) call_rcu(&tn_info(n)->rcu, __node_free_rcu) +static inline void node_free(struct key_vector *n) +{ + lockdep_rtnl_is_held(); + + atomic_long_inc(&objects_waiting_rcu); + call_rcu(&tn_info(n)->rcu, __node_free_rcu); +} + +static unsigned long fib_shrink_count(struct shrinker *s, + struct shrink_control *sc) +{ + return (unsigned long)atomic_long_read(&objects_waiting_rcu); +} + +static unsigned long fib_shrink_scan(struct shrinker *s, + struct shrink_control *sc) +{ + long ret = (unsigned long)atomic_long_read(&objects_waiting_rcu); + + synchronize_rcu(); + return (unsigned long)ret; +} + +static struct shrinker fib_shrinker = { + .count_objects = fib_shrink_count, + .scan_objects = fib_shrink_scan, +}; static struct tnode *tnode_alloc(int bits) { @@ -494,16 +519,9 @@ static void tnode_free(struct key_vector *tn) while (head) { head = head->next; - tnode_free_size += TNODE_SIZE(1ul << tn->bits); node_free(tn); - tn = container_of(head, struct tnode, rcu)->kv; } - - if (tnode_free_size >= PAGE_SIZE * sync_pages) { - tnode_free_size = 0; - synchronize_rcu(); - } } static struct key_vector *replace(struct trie *t, @@ -2118,6 +2136,9 @@ void __init fib_trie_init(void) trie_leaf_kmem = kmem_cache_create("ip_fib_trie", LEAF_SIZE, 0, SLAB_PANIC, NULL); + + if (register_shrinker(&fib_shrinker)) + panic("IP FIB: failed to register fib_shrinker\n"); } struct fib_table *fib_trie_table(u32 id, struct fib_table *alias)
Fib trie has a hard-coded sync_pages limit to call synchronise_rcu(). The limit is 128 pages or 512Kb (considering common case with 4Kb pages). Unfortunately, at Arista we have use-scenarios with full view software forwarding. At the scale of 100K and more routes even on 2 core boxes the hard-coded limit starts actively shooting in the leg: lockup detector notices that rtnl_lock is held for seconds. First reason is previously broken MAX_WORK, that didn't limit pending balancing work. While fixing it, I've noticed that the bottle-neck is actually in the number of synchronise_rcu() calls. I've tried to fix it with a patch to decrement number of tnodes in rcu callback, but it hasn't much affected performance. One possible way to "fix" it - provide another sysctl to control sync_pages, but in my POV it's nasty - exposing another realisation detail into user-space. To be complete honest, I'm not sure if calling rcu_synchronise() from shrinker is a sane idea: during OOM we're slowed down enough and adding synchronise there probably will noticeably falter a shrinker. Anyway, I've got the following results on a very stupid benchmark that adds one-by-one routes and removes them (with unlimited fib_balance_budget) and measures time spent to remove one route: *Before* on 4-cores switch (AMD GX-420CA SOC): v4 Create of 4194304 routes: 76806ms 0(2097152): 000. 32. 0. 0 3353ms 1(3145729): 000. 48. 0. 1 1311ms 2(1048577): 000. 16. 0. 1 1286ms 3(524289): 000. 8. 0. 1 865ms 4(4098744): 000. 62.138.184 858ms 5(3145728): 000. 48. 0. 0 832ms 6(1048576): 000. 16. 0. 0 794ms 7(2621441): 000. 40. 0. 1 663ms 8(2621440): 000. 40. 0. 0 525ms 9(524288): 000. 8. 0. 0 508ms v4 Delete of 4194304 routes: 111129ms 0(1589247): 000. 24. 63.255 3033ms 1(3702783): 000. 56.127.255 2833ms 2(3686399): 000. 56. 63.255 2630ms 3(1605631): 000. 24.127.255 2574ms 4(1581055): 000. 24. 31.255 2395ms 5(3671039): 000. 56. 3.255 2289ms 6(1573887): 000. 24. 3.255 2234ms 7(3678207): 000. 56. 31.255 2143ms 8(3670527): 000. 56. 1.255 2109ms 9(1573375): 000. 24. 1.255 2070ms *After* on 4-cores switch: v4 Create of 4194304 routes: 65305ms 0(2097153): 000. 32. 0. 1 1871ms 1(1048577): 000. 16. 0. 1 1064ms 2(2097152): 000. 32. 0. 0 905ms 3(524289): 000. 8. 0. 1 507ms 4(1048576): 000. 16. 0. 0 451ms 5(2097154): 000. 32. 0. 2 355ms 6(262145): 000. 4. 0. 1 240ms 7(524288): 000. 8. 0. 0 230ms 8(262144): 000. 4. 0. 0 115ms 9(131073): 000. 2. 0. 1 109ms v4 Delete of 4194304 routes: 38015ms 0(3571711): 000. 54.127.255 1616ms 1(3565567): 000. 54.103.255 1340ms 2(3670015): 000. 55.255.255 1297ms 3(3565183): 000. 54.102.127 1226ms 4(3565159): 000. 54.102.103 912ms 5(3604479): 000. 54.255.255 596ms 6(3670016): 000. 56. 0. 0 474ms 7(3565311): 000. 54.102.255 434ms 8(3567615): 000. 54.111.255 388ms 9(3565167): 000. 54.102.111 376ms After the patch there is one core, completely busy with the benchmark, while previously neither CPU was busy. Controlling balancing budget sysctl knob, one can distribute balancing work on add/remove a route between neighbour changes (with the price of possibly less balanced trie and a bit more expensive lookups). Fixes: fc86a93b46d7 ("fib_trie: Push tnode flushing down to inflate/halve") Signed-off-by: Dmitry Safonov <dima@arista.com> --- net/ipv4/fib_trie.c | 53 +++++++++++++++++++++++++++++++-------------- 1 file changed, 37 insertions(+), 16 deletions(-)