Message ID | 20140828194422.GB8867@hostway.ca |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Simon Kirby <sim@hostway.ca> writes: > On Thu, Aug 28, 2014 at 12:24:31PM -0700, Paul E. McKenney wrote: > >> On Tue, Aug 19, 2014 at 10:58:55PM -0700, Simon Kirby wrote: >> > Hello! >> > >> > In trying to figure out what happened to a box running lots of vsftpd >> > since we deployed a CONFIG_NET_NS=y kernel to it, we found that the >> > (wall) time needed for cleanup_net() to complete, even on an idle box, >> > can be quite long: >> > >> > #!/bin/bash >> > >> > ip netns delete test >&/dev/null >> > while ip netns add test; do >> > echo hi >> > ip netns delete test >> > done >> > >> > On my desktop and typical hosts, this prints at only around 4 or 6 per >> > second. While this is happening, "vmstat 1" reports 100% idle, and there >> > there are D-state processes with stacks similar to: >> > >> > 30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, kthread+0xbb, ret_from_fork+0x7c, 0xffffffffffffffff >> > >> > 32220 ip D copy_net_ns+0x68, create_new_namespaces+0xfc, unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, system_call_fastpath+0x16, 0xffffffffffffffff >> > >> > copy_net_ns() is waiting on net_mutex which is held by cleanup_net(). >> > >> > vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment >> > about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that >> > case). I didn't find anything that makes 2.6.35 any faster, but on Debian >> > 2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10, >> > 3.16, though still not anything I'd ever want to rely on per connection. >> > >> > C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c >> > >> > Kernel stack "top": http://0x.ca/sim/ref/tools/pstack >> > >> > What's going on here? >> >> That is a bit slow for many configurations, but there are some exceptions. >> >> So, what is your kernel's .config? > > I was unable to find a config (or stock kernel) that was any different, > but here's the one we're using: http://0x.ca/sim/ref/3.10/config-3.10.53 > > How fast does the above test run for you? > > We've been running with the attached, which has helped a little, but it's > still quite slow in our particular use case (vsftpd), and with the above n> test. Should I enable RCU_TRACE or STALL_INFO with a low timeout or > something? I just want to add a little bit more analysis to this. What we desire to be fast is the copy_net_ns, cleanup_net is batched and asynchronous which nothing really cares how long it takes except that cleanup_net holds the net_mutex and thus blocks copy_net_ns. The puzzle is why and which rcu delays Simon is seeing in the network namespace cleanup path, as it seems like the synchronize_rcu is not the only one, and in the case of vsftp with trivail network namespaces where nothing has been done we should not need to delay. Eric > Simon- > > -- >8 -- > Subject: [PATCH] netns: use synchronize_rcu_expedited instead of > synchronize_rcu > > Similar to ef323088, with synchronize_rcu(), we are only able to create > and destroy about 4 or 7 net namespaces per second, which really puts a > dent in the performance of programs attempting to use CLONE_NEWNET for > privilege separation (vsftpd, chromium). > --- > net/core/net_namespace.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c > index 85b6269..6dcb4b3 100644 > --- a/net/core/net_namespace.c > +++ b/net/core/net_namespace.c > @@ -296,7 +296,7 @@ static void cleanup_net(struct work_struct *work) > * This needs to be before calling the exit() notifiers, so > * the rcu_barrier() below isn't sufficient alone. > */ > - synchronize_rcu(); > + synchronize_rcu_expedited(); > > /* Run all of the network namespace exit methods */ > list_for_each_entry_reverse(ops, &pernet_list, list) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 28, 2014 at 03:33:42PM -0500, Eric W. Biederman wrote: > Simon Kirby <sim@hostway.ca> writes: > > > On Thu, Aug 28, 2014 at 12:24:31PM -0700, Paul E. McKenney wrote: > > > >> On Tue, Aug 19, 2014 at 10:58:55PM -0700, Simon Kirby wrote: > >> > Hello! > >> > > >> > In trying to figure out what happened to a box running lots of vsftpd > >> > since we deployed a CONFIG_NET_NS=y kernel to it, we found that the > >> > (wall) time needed for cleanup_net() to complete, even on an idle box, > >> > can be quite long: > >> > > >> > #!/bin/bash > >> > > >> > ip netns delete test >&/dev/null > >> > while ip netns add test; do > >> > echo hi > >> > ip netns delete test > >> > done > >> > > >> > On my desktop and typical hosts, this prints at only around 4 or 6 per > >> > second. While this is happening, "vmstat 1" reports 100% idle, and there > >> > there are D-state processes with stacks similar to: > >> > > >> > 30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, kthread+0xbb, ret_from_fork+0x7c, 0xffffffffffffffff > >> > > >> > 32220 ip D copy_net_ns+0x68, create_new_namespaces+0xfc, unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, system_call_fastpath+0x16, 0xffffffffffffffff > >> > > >> > copy_net_ns() is waiting on net_mutex which is held by cleanup_net(). > >> > > >> > vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment > >> > about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that > >> > case). I didn't find anything that makes 2.6.35 any faster, but on Debian > >> > 2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10, > >> > 3.16, though still not anything I'd ever want to rely on per connection. > >> > > >> > C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c > >> > > >> > Kernel stack "top": http://0x.ca/sim/ref/tools/pstack > >> > > >> > What's going on here? > >> > >> That is a bit slow for many configurations, but there are some exceptions. > >> > >> So, what is your kernel's .config? > > > > I was unable to find a config (or stock kernel) that was any different, > > but here's the one we're using: http://0x.ca/sim/ref/3.10/config-3.10.53 > > > > How fast does the above test run for you? > > > > We've been running with the attached, which has helped a little, but it's > > still quite slow in our particular use case (vsftpd), and with the above > n> test. Should I enable RCU_TRACE or STALL_INFO with a low timeout or > > something? > > I just want to add a little bit more analysis to this. > > What we desire to be fast is the copy_net_ns, cleanup_net is batched and > asynchronous which nothing really cares how long it takes except that > cleanup_net holds the net_mutex and thus blocks copy_net_ns. > > The puzzle is why and which rcu delays Simon is seeing in the network > namespace cleanup path, as it seems like the synchronize_rcu is not > the only one, and in the case of vsftp with trivail network namespaces > where nothing has been done we should not need to delay. Indeed, given the version and .config, I can't see why any individual RCU grace-period operation would be particularly slow. I suggest using ftrace on synchronize_rcu() and friends. Thanx, Paul > Eric > > > > Simon- > > > > -- >8 -- > > Subject: [PATCH] netns: use synchronize_rcu_expedited instead of > > synchronize_rcu > > > > Similar to ef323088, with synchronize_rcu(), we are only able to create > > and destroy about 4 or 7 net namespaces per second, which really puts a > > dent in the performance of programs attempting to use CLONE_NEWNET for > > privilege separation (vsftpd, chromium). > > --- > > net/core/net_namespace.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c > > index 85b6269..6dcb4b3 100644 > > --- a/net/core/net_namespace.c > > +++ b/net/core/net_namespace.c > > @@ -296,7 +296,7 @@ static void cleanup_net(struct work_struct *work) > > * This needs to be before calling the exit() notifiers, so > > * the rcu_barrier() below isn't sufficient alone. > > */ > > - synchronize_rcu(); > > + synchronize_rcu_expedited(); > > > > /* Run all of the network namespace exit methods */ > > list_for_each_entry_reverse(ops, &pernet_list, list) > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 28, 2014 at 01:46:58PM -0700, Paul E. McKenney wrote: > On Thu, Aug 28, 2014 at 03:33:42PM -0500, Eric W. Biederman wrote: > > > I just want to add a little bit more analysis to this. > > > > What we desire to be fast is the copy_net_ns, cleanup_net is batched and > > asynchronous which nothing really cares how long it takes except that > > cleanup_net holds the net_mutex and thus blocks copy_net_ns. > > > > The puzzle is why and which rcu delays Simon is seeing in the network > > namespace cleanup path, as it seems like the synchronize_rcu is not > > the only one, and in the case of vsftp with trivail network namespaces > > where nothing has been done we should not need to delay. > > Indeed, given the version and .config, I can't see why any individual > RCU grace-period operation would be particularly slow. > > I suggest using ftrace on synchronize_rcu() and friends. I made a parallel net namespace create/destroy benchmark that prints the progress and time to create and cleanup 32 unshare()d child processes: http://0x.ca/sim/ref/tools/netnsbench.c I noticed that if I haven't run it for a while, the first batch often is fast, followed by slowness from then on: ++++++++++++++++++++++++++++++++-------------------------------- 0.039478s ++++++++++++++++++++-----+----------------+++++++++---------++-- 4.463837s +++++++++++++++++++++++++------+--------------------++++++------ 3.011882s +++++++++++++++---+-------------++++++++++++++++---------------- 2.283993s Fiddling around on a stock kernel, "echo 1 > /sys/kernel/rcu_expedited" makes behaviour change as it did with my patch: ++-++-+++-+-----+-+-++-+-++--++-+--+-+-++--++-+-+-+-++-+--++---- 0.801406s +-+-+-++-+-+-+-+-++--+-+-++-+--++-+-+-+-+-+-+-+-+-+-+-+--++-+--- 0.872011s ++--+-++--+-++--+-++--+-+-+-+-++-+--++--+-++-+-+-+-+--++-+-+-+-- 0.946745s How would I use ftrace on synchronize_rcu() here? As Eric said, cleanup_net() is batched, but while it is cleaning up, net_mutex is held. Isn't the issue just that net_mutex is held while some other things are going on that are meant to be lazy / batched? What is net_mutex protecting in cleanup_net()? I noticed that [kworker/u16:0]'s stack is often: [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50 [<ffffffff8109607e>] synchronize_sched+0x2e/0x50 [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat] [<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60 [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 [<ffffffff81062997>] process_one_work+0x157/0x440 [<ffffffff81063303>] worker_thread+0x63/0x520 [<ffffffff81068b96>] kthread+0xd6/0xf0 [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff and [<ffffffff81095364>] _rcu_barrier+0x154/0x1f0 [<ffffffff81095450>] rcu_barrier+0x10/0x20 [<ffffffff81102c2c>] kmem_cache_destroy+0x6c/0xb0 [<ffffffffa0089e97>] nf_conntrack_cleanup_net_list+0x167/0x1c0 [nf_conntrack] [<ffffffffa008aab5>] nf_conntrack_pernet_exit+0x65/0x70 [nf_conntrack] [<ffffffff81720353>] ops_exit_list.isra.4+0x53/0x60 [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 [<ffffffff81062997>] process_one_work+0x157/0x440 [<ffffffff81063303>] worker_thread+0x63/0x520 [<ffffffff81068b96>] kthread+0xd6/0xf0 [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff So I tried flushing iptables rules and rmmoding netfilter bits: ++++++++++++++++++++-+--------------------+++++++++++----------- 0.179940s ++++++++++++++--+-------------+++++++++++++++++----------------- 0.151988s ++++++++++++++++++++++++++++---+--------------------------+++--- 0.159967s ++++++++++++++++++++++----------------------++++++++++---------- 0.175964s Expedited: ++-+--++-+-+-+-+-+-+--++-+-+-++-+-+-+--++-+-+-+-+-+-+-+-+-+-+--- 0.079988s ++-+-+-+-+-+-+-+-+-+-+-+--++-+--++-+--+-++-+-+--++-+-+-+-+-+-+-- 0.089347s ++++--+++--++--+-+++++++-+++++--------------++-+-+--++-+-+--++-- 0.081566s +++++-+++-------++-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-+-+-+-+--- 0.089026s So, much faster. It seems that just loading nf_conntrack_ipv4 (like by running iptables -t nat -nvL) is enough to slow it way down. But it is still capable of being fast, as above. Simon- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Thu, 28 Aug 2014, Simon Kirby wrote: > I noticed that [kworker/u16:0]'s stack is often: > > [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50 > [<ffffffff8109607e>] synchronize_sched+0x2e/0x50 > [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat] I guess the problem is in nf_nat_net_exit, may be other nf exit handlers too. pernet-exit handlers should avoid synchronize_rcu and rcu_barrier. A RCU callback and rcu_barrier in module-exit is the way to go. cleanup_net includes rcu_barrier, so pernet-exit does not need such calls. > [<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60 > [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 > [<ffffffff81062997>] process_one_work+0x157/0x440 > [<ffffffff81063303>] worker_thread+0x63/0x520 > [<ffffffff81068b96>] kthread+0xd6/0xf0 > [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff Regards -- Julian Anastasov <ja@ssi.bg> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Julian Anastasov <ja@ssi.bg> writes: > Hello, > > On Thu, 28 Aug 2014, Simon Kirby wrote: > >> I noticed that [kworker/u16:0]'s stack is often: >> >> [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50 >> [<ffffffff8109607e>] synchronize_sched+0x2e/0x50 >> [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat] > > I guess the problem is in nf_nat_net_exit, > may be other nf exit handlers too. pernet-exit handlers > should avoid synchronize_rcu and rcu_barrier. > A RCU callback and rcu_barrier in module-exit is the way > to go. cleanup_net includes rcu_barrier, so pernet-exit > does not need such calls. In principle I agree, however in this particular case it looks a bit tricky because a separate hash table to track nat state per network namespace. At the same time all of the packets should be drained before we get to nf_nat_net_exit so it doesn't look the synchronize_rcu in nf_nat_exit is actually protecting anything. Further calling a rcu delay function in net_exit methods largely destroys the batched cleanup of network namespaces, so it is very unpleasant. Could someone who knows nf_nat_core.c better than I do look and see if we can just remove the synchronize_rcu in nf_nat_exit? >> [<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60 >> [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 >> [<ffffffff81062997>] process_one_work+0x157/0x440 >> [<ffffffff81063303>] worker_thread+0x63/0x520 >> [<ffffffff81068b96>] kthread+0xd6/0xf0 >> [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 >> [<ffffffffffffffff>] 0xffffffffffffffff Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 28, 2014 at 05:40:29PM -0700, Simon Kirby wrote: > On Thu, Aug 28, 2014 at 01:46:58PM -0700, Paul E. McKenney wrote: > > > On Thu, Aug 28, 2014 at 03:33:42PM -0500, Eric W. Biederman wrote: > > > > > I just want to add a little bit more analysis to this. > > > > > > What we desire to be fast is the copy_net_ns, cleanup_net is batched and > > > asynchronous which nothing really cares how long it takes except that > > > cleanup_net holds the net_mutex and thus blocks copy_net_ns. > > > > > > The puzzle is why and which rcu delays Simon is seeing in the network > > > namespace cleanup path, as it seems like the synchronize_rcu is not > > > the only one, and in the case of vsftp with trivail network namespaces > > > where nothing has been done we should not need to delay. > > > > Indeed, given the version and .config, I can't see why any individual > > RCU grace-period operation would be particularly slow. > > > > I suggest using ftrace on synchronize_rcu() and friends. > > I made a parallel net namespace create/destroy benchmark that prints the > progress and time to create and cleanup 32 unshare()d child processes: > > http://0x.ca/sim/ref/tools/netnsbench.c > > I noticed that if I haven't run it for a while, the first batch often is > fast, followed by slowness from then on: > > ++++++++++++++++++++++++++++++++-------------------------------- 0.039478s > ++++++++++++++++++++-----+----------------+++++++++---------++-- 4.463837s > +++++++++++++++++++++++++------+--------------------++++++------ 3.011882s > +++++++++++++++---+-------------++++++++++++++++---------------- 2.283993s > > Fiddling around on a stock kernel, "echo 1 > /sys/kernel/rcu_expedited" > makes behaviour change as it did with my patch: > > ++-++-+++-+-----+-+-++-+-++--++-+--+-+-++--++-+-+-+-++-+--++---- 0.801406s > +-+-+-++-+-+-+-+-++--+-+-++-+--++-+-+-+-+-+-+-+-+-+-+-+--++-+--- 0.872011s > ++--+-++--+-++--+-++--+-+-+-+-++-+--++--+-++-+-+-+-+--++-+-+-+-- 0.946745s > > How would I use ftrace on synchronize_rcu() here? http://lwn.net/Articles/370423/ is your friend here. If your kernel is built with the needed configuration, you give the command "echo synchronize_rcu > set_ftrace_filter" http://lwn.net/Articles/365835/ and http://lwn.net/Articles/366796/ have background info. > As Eric said, cleanup_net() is batched, but while it is cleaning up, > net_mutex is held. Isn't the issue just that net_mutex is held while > some other things are going on that are meant to be lazy / batched? > > What is net_mutex protecting in cleanup_net()? > > I noticed that [kworker/u16:0]'s stack is often: > > [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50 > [<ffffffff8109607e>] synchronize_sched+0x2e/0x50 > [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat] > [<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60 > [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 > [<ffffffff81062997>] process_one_work+0x157/0x440 > [<ffffffff81063303>] worker_thread+0x63/0x520 > [<ffffffff81068b96>] kthread+0xd6/0xf0 > [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff > > and > > [<ffffffff81095364>] _rcu_barrier+0x154/0x1f0 > [<ffffffff81095450>] rcu_barrier+0x10/0x20 > [<ffffffff81102c2c>] kmem_cache_destroy+0x6c/0xb0 > [<ffffffffa0089e97>] nf_conntrack_cleanup_net_list+0x167/0x1c0 [nf_conntrack] > [<ffffffffa008aab5>] nf_conntrack_pernet_exit+0x65/0x70 [nf_conntrack] > [<ffffffff81720353>] ops_exit_list.isra.4+0x53/0x60 > [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0 > [<ffffffff81062997>] process_one_work+0x157/0x440 > [<ffffffff81063303>] worker_thread+0x63/0x520 > [<ffffffff81068b96>] kthread+0xd6/0xf0 > [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0 > [<ffffffffffffffff>] 0xffffffffffffffff > > So I tried flushing iptables rules and rmmoding netfilter bits: > > ++++++++++++++++++++-+--------------------+++++++++++----------- 0.179940s > ++++++++++++++--+-------------+++++++++++++++++----------------- 0.151988s > ++++++++++++++++++++++++++++---+--------------------------+++--- 0.159967s > ++++++++++++++++++++++----------------------++++++++++---------- 0.175964s > > Expedited: > > ++-+--++-+-+-+-+-+-+--++-+-+-++-+-+-+--++-+-+-+-+-+-+-+-+-+-+--- 0.079988s > ++-+-+-+-+-+-+-+-+-+-+-+--++-+--++-+--+-++-+-+--++-+-+-+-+-+-+-- 0.089347s > ++++--+++--++--+-+++++++-+++++--------------++-+-+--++-+-+--++-- 0.081566s > +++++-+++-------++-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-+-+-+-+--- 0.089026s > > So, much faster. It seems that just loading nf_conntrack_ipv4 (like by > running iptables -t nat -nvL) is enough to slow it way down. But it is > still capable of being fast, as above. My first guess is that this code sequence is calling synchronize_rcu() quite often. Would it be possible to consolidate these? Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Fri, 29 Aug 2014, Eric W. Biederman wrote: > > I guess the problem is in nf_nat_net_exit, > > may be other nf exit handlers too. pernet-exit handlers > > should avoid synchronize_rcu and rcu_barrier. > > A RCU callback and rcu_barrier in module-exit is the way > > to go. cleanup_net includes rcu_barrier, so pernet-exit > > does not need such calls. > > In principle I agree, however in this particular case it looks a bit > tricky because a separate hash table to track nat state per network > namespace. It is still possible module's pernet-init handler to attach in net->ct... special structure with all pointers that should be freed by RCU callback for the module, like the hash table. For example: struct netns_ct_nat_rcu_allocs { struct rcu_head rcu_head; struct hlist_head *nat_bysource; unsigned int nat_htable_size; }; - pernet-init: - allocate structure, attach it to net->ct.nat_rcu_allocs - the original nat_bysource place remains because we want to avoid net->ct.nat_rcu_allocs dereference. - pernet-exit: - copy nat_bysource and nat_htable_size to net->ct.nat_rcu_allocs, this can be done even in above pernet-init function - call_rcu(&net->ct.nat_rcu_allocs->rcu_head, nat_rcu_free); - cleanup_net: - rcu_barrier() - RCU callback (nat_rcu_free): - call nf_ct_free_hashtable - kfree the structure - cleanup_net: - drop netns after rcu_barrier Due to the rcu_barrier in cleanup_net it is even possible to provide per-module rcu_head instead of using allocated structure, for example: call_rcu(&net->ct.nat_rcu_head, nat_rcu_free); Then the nat_rcu_free function will just call nf_ct_free_hashtable before cleanup_net drops the netns struct. In this case the memory price is just one rcu_head for every module that uses RCU callback. > At the same time all of the packets should be drained before > we get to nf_nat_net_exit so it doesn't look the synchronize_rcu > in nf_nat_exit is actually protecting anything. It is true for cleanup_net. I don't remember, can we see packets while the particular module-exit calls unregister_pernet_subsys(), may be yes? Regards -- Julian Anastasov <ja@ssi.bg> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 85b6269..6dcb4b3 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -296,7 +296,7 @@ static void cleanup_net(struct work_struct *work) * This needs to be before calling the exit() notifiers, so * the rcu_barrier() below isn't sufficient alone. */ - synchronize_rcu(); + synchronize_rcu_expedited(); /* Run all of the network namespace exit methods */ list_for_each_entry_reverse(ops, &pernet_list, list)