diff mbox

net_ns cleanup / RCU overhead

Message ID 20140828194422.GB8867@hostway.ca
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Simon Kirby Aug. 28, 2014, 7:44 p.m. UTC
On Thu, Aug 28, 2014 at 12:24:31PM -0700, Paul E. McKenney wrote:

> On Tue, Aug 19, 2014 at 10:58:55PM -0700, Simon Kirby wrote:
> > Hello!
> > 
> > In trying to figure out what happened to a box running lots of vsftpd
> > since we deployed a CONFIG_NET_NS=y kernel to it, we found that the
> > (wall) time needed for cleanup_net() to complete, even on an idle box,
> > can be quite long:
> > 
> > #!/bin/bash
> > 
> > ip netns delete test >&/dev/null
> > while ip netns add test; do
> >         echo hi
> >         ip netns delete test
> > done
> > 
> > On my desktop and typical hosts, this prints at only around 4 or 6 per
> > second. While this is happening, "vmstat 1" reports 100% idle, and there
> > there are D-state processes with stacks similar to:
> > 
> > 30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, kthread+0xbb, ret_from_fork+0x7c, 0xffffffffffffffff
> > 
> > 32220 ip              D copy_net_ns+0x68, create_new_namespaces+0xfc, unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, system_call_fastpath+0x16, 0xffffffffffffffff
> > 
> > copy_net_ns() is waiting on net_mutex which is held by cleanup_net().
> > 
> > vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment
> > about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that
> > case). I didn't find anything that makes 2.6.35 any faster, but on Debian
> > 2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10,
> > 3.16, though still not anything I'd ever want to rely on per connection.
> > 
> > C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c
> > 
> > Kernel stack "top": http://0x.ca/sim/ref/tools/pstack
> > 
> > What's going on here?
> 
> That is a bit slow for many configurations, but there are some exceptions.
> 
> So, what is your kernel's .config?

I was unable to find a config (or stock kernel) that was any different,
but here's the one we're using: http://0x.ca/sim/ref/3.10/config-3.10.53

How fast does the above test run for you?

We've been running with the attached, which has helped a little, but it's
still quite slow in our particular use case (vsftpd), and with the above
test. Should I enable RCU_TRACE or STALL_INFO with a low timeout or
something?

Simon-

-- >8 --
Subject: [PATCH] netns: use synchronize_rcu_expedited instead of
 synchronize_rcu

Similar to ef323088, with synchronize_rcu(), we are only able to create
and destroy about 4 or 7 net namespaces per second, which really puts a
dent in the performance of programs attempting to use CLONE_NEWNET for
privilege separation (vsftpd, chromium).
---
 net/core/net_namespace.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric W. Biederman Aug. 28, 2014, 8:33 p.m. UTC | #1
Simon Kirby <sim@hostway.ca> writes:

> On Thu, Aug 28, 2014 at 12:24:31PM -0700, Paul E. McKenney wrote:
>
>> On Tue, Aug 19, 2014 at 10:58:55PM -0700, Simon Kirby wrote:
>> > Hello!
>> > 
>> > In trying to figure out what happened to a box running lots of vsftpd
>> > since we deployed a CONFIG_NET_NS=y kernel to it, we found that the
>> > (wall) time needed for cleanup_net() to complete, even on an idle box,
>> > can be quite long:
>> > 
>> > #!/bin/bash
>> > 
>> > ip netns delete test >&/dev/null
>> > while ip netns add test; do
>> >         echo hi
>> >         ip netns delete test
>> > done
>> > 
>> > On my desktop and typical hosts, this prints at only around 4 or 6 per
>> > second. While this is happening, "vmstat 1" reports 100% idle, and there
>> > there are D-state processes with stacks similar to:
>> > 
>> > 30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, kthread+0xbb, ret_from_fork+0x7c, 0xffffffffffffffff
>> > 
>> > 32220 ip              D copy_net_ns+0x68, create_new_namespaces+0xfc, unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, system_call_fastpath+0x16, 0xffffffffffffffff
>> > 
>> > copy_net_ns() is waiting on net_mutex which is held by cleanup_net().
>> > 
>> > vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment
>> > about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that
>> > case). I didn't find anything that makes 2.6.35 any faster, but on Debian
>> > 2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10,
>> > 3.16, though still not anything I'd ever want to rely on per connection.
>> > 
>> > C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c
>> > 
>> > Kernel stack "top": http://0x.ca/sim/ref/tools/pstack
>> > 
>> > What's going on here?
>> 
>> That is a bit slow for many configurations, but there are some exceptions.
>> 
>> So, what is your kernel's .config?
>
> I was unable to find a config (or stock kernel) that was any different,
> but here's the one we're using: http://0x.ca/sim/ref/3.10/config-3.10.53
>
> How fast does the above test run for you?
>
> We've been running with the attached, which has helped a little, but it's
> still quite slow in our particular use case (vsftpd), and with the above
n> test. Should I enable RCU_TRACE or STALL_INFO with a low timeout or
> something?

I just want to add a little bit more analysis to this.

What we desire to be fast is the copy_net_ns, cleanup_net is batched and
asynchronous which nothing really cares how long it takes except that
cleanup_net holds the net_mutex and thus blocks copy_net_ns.

The puzzle is why and which rcu delays Simon is seeing in the network
namespace cleanup path, as it seems like the synchronize_rcu is not
the only one, and in the case of vsftp with trivail network namespaces
where nothing has been done we should not need to delay.

Eric


> Simon-
>
> -- >8 --
> Subject: [PATCH] netns: use synchronize_rcu_expedited instead of
>  synchronize_rcu
>
> Similar to ef323088, with synchronize_rcu(), we are only able to create
> and destroy about 4 or 7 net namespaces per second, which really puts a
> dent in the performance of programs attempting to use CLONE_NEWNET for
> privilege separation (vsftpd, chromium).
> ---
>  net/core/net_namespace.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> index 85b6269..6dcb4b3 100644
> --- a/net/core/net_namespace.c
> +++ b/net/core/net_namespace.c
> @@ -296,7 +296,7 @@ static void cleanup_net(struct work_struct *work)
>  	 * This needs to be before calling the exit() notifiers, so
>  	 * the rcu_barrier() below isn't sufficient alone.
>  	 */
> -	synchronize_rcu();
> +	synchronize_rcu_expedited();
>  
>  	/* Run all of the network namespace exit methods */
>  	list_for_each_entry_reverse(ops, &pernet_list, list)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Aug. 28, 2014, 8:46 p.m. UTC | #2
On Thu, Aug 28, 2014 at 03:33:42PM -0500, Eric W. Biederman wrote:
> Simon Kirby <sim@hostway.ca> writes:
> 
> > On Thu, Aug 28, 2014 at 12:24:31PM -0700, Paul E. McKenney wrote:
> >
> >> On Tue, Aug 19, 2014 at 10:58:55PM -0700, Simon Kirby wrote:
> >> > Hello!
> >> > 
> >> > In trying to figure out what happened to a box running lots of vsftpd
> >> > since we deployed a CONFIG_NET_NS=y kernel to it, we found that the
> >> > (wall) time needed for cleanup_net() to complete, even on an idle box,
> >> > can be quite long:
> >> > 
> >> > #!/bin/bash
> >> > 
> >> > ip netns delete test >&/dev/null
> >> > while ip netns add test; do
> >> >         echo hi
> >> >         ip netns delete test
> >> > done
> >> > 
> >> > On my desktop and typical hosts, this prints at only around 4 or 6 per
> >> > second. While this is happening, "vmstat 1" reports 100% idle, and there
> >> > there are D-state processes with stacks similar to:
> >> > 
> >> > 30566 [kworker/u16:1] D wait_rcu_gp+0x48, synchronize_sched+0x2f, cleanup_net+0xdb, process_one_work+0x175, worker_thread+0x119, kthread+0xbb, ret_from_fork+0x7c, 0xffffffffffffffff
> >> > 
> >> > 32220 ip              D copy_net_ns+0x68, create_new_namespaces+0xfc, unshare_nsproxy_namespaces+0x66, SyS_unshare+0x159, system_call_fastpath+0x16, 0xffffffffffffffff
> >> > 
> >> > copy_net_ns() is waiting on net_mutex which is held by cleanup_net().
> >> > 
> >> > vsftpd uses CLONE_NEWNET to set up privsep processes. There is a comment
> >> > about it being really slow before 2.6.35 (it avoids CLONE_NEWNET in that
> >> > case). I didn't find anything that makes 2.6.35 any faster, but on Debian
> >> > 2.6.36-5-amd64, I notice it does seem to be a bit faster than 3.2, 3.10,
> >> > 3.16, though still not anything I'd ever want to rely on per connection.
> >> > 
> >> > C implementation of the above: http://0x.ca/sim/ref/tools/netnsloop.c
> >> > 
> >> > Kernel stack "top": http://0x.ca/sim/ref/tools/pstack
> >> > 
> >> > What's going on here?
> >> 
> >> That is a bit slow for many configurations, but there are some exceptions.
> >> 
> >> So, what is your kernel's .config?
> >
> > I was unable to find a config (or stock kernel) that was any different,
> > but here's the one we're using: http://0x.ca/sim/ref/3.10/config-3.10.53
> >
> > How fast does the above test run for you?
> >
> > We've been running with the attached, which has helped a little, but it's
> > still quite slow in our particular use case (vsftpd), and with the above
> n> test. Should I enable RCU_TRACE or STALL_INFO with a low timeout or
> > something?
> 
> I just want to add a little bit more analysis to this.
> 
> What we desire to be fast is the copy_net_ns, cleanup_net is batched and
> asynchronous which nothing really cares how long it takes except that
> cleanup_net holds the net_mutex and thus blocks copy_net_ns.
> 
> The puzzle is why and which rcu delays Simon is seeing in the network
> namespace cleanup path, as it seems like the synchronize_rcu is not
> the only one, and in the case of vsftp with trivail network namespaces
> where nothing has been done we should not need to delay.

Indeed, given the version and .config, I can't see why any individual
RCU grace-period operation would be particularly slow.

I suggest using ftrace on synchronize_rcu() and friends.

							Thanx, Paul

> Eric
> 
> 
> > Simon-
> >
> > -- >8 --
> > Subject: [PATCH] netns: use synchronize_rcu_expedited instead of
> >  synchronize_rcu
> >
> > Similar to ef323088, with synchronize_rcu(), we are only able to create
> > and destroy about 4 or 7 net namespaces per second, which really puts a
> > dent in the performance of programs attempting to use CLONE_NEWNET for
> > privilege separation (vsftpd, chromium).
> > ---
> >  net/core/net_namespace.c |    2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> > index 85b6269..6dcb4b3 100644
> > --- a/net/core/net_namespace.c
> > +++ b/net/core/net_namespace.c
> > @@ -296,7 +296,7 @@ static void cleanup_net(struct work_struct *work)
> >  	 * This needs to be before calling the exit() notifiers, so
> >  	 * the rcu_barrier() below isn't sufficient alone.
> >  	 */
> > -	synchronize_rcu();
> > +	synchronize_rcu_expedited();
> >  
> >  	/* Run all of the network namespace exit methods */
> >  	list_for_each_entry_reverse(ops, &pernet_list, list)
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Kirby Aug. 29, 2014, 12:40 a.m. UTC | #3
On Thu, Aug 28, 2014 at 01:46:58PM -0700, Paul E. McKenney wrote:

> On Thu, Aug 28, 2014 at 03:33:42PM -0500, Eric W. Biederman wrote:
> 
> > I just want to add a little bit more analysis to this.
> > 
> > What we desire to be fast is the copy_net_ns, cleanup_net is batched and
> > asynchronous which nothing really cares how long it takes except that
> > cleanup_net holds the net_mutex and thus blocks copy_net_ns.
> > 
> > The puzzle is why and which rcu delays Simon is seeing in the network
> > namespace cleanup path, as it seems like the synchronize_rcu is not
> > the only one, and in the case of vsftp with trivail network namespaces
> > where nothing has been done we should not need to delay.
> 
> Indeed, given the version and .config, I can't see why any individual
> RCU grace-period operation would be particularly slow.
> 
> I suggest using ftrace on synchronize_rcu() and friends.

I made a parallel net namespace create/destroy benchmark that prints the
progress and time to create and cleanup 32 unshare()d child processes:

http://0x.ca/sim/ref/tools/netnsbench.c

I noticed that if I haven't run it for a while, the first batch often is
fast, followed by slowness from then on:

++++++++++++++++++++++++++++++++-------------------------------- 0.039478s
++++++++++++++++++++-----+----------------+++++++++---------++-- 4.463837s
+++++++++++++++++++++++++------+--------------------++++++------ 3.011882s
+++++++++++++++---+-------------++++++++++++++++---------------- 2.283993s

Fiddling around on a stock kernel, "echo 1 > /sys/kernel/rcu_expedited"
makes behaviour change as it did with my patch:

++-++-+++-+-----+-+-++-+-++--++-+--+-+-++--++-+-+-+-++-+--++---- 0.801406s
+-+-+-++-+-+-+-+-++--+-+-++-+--++-+-+-+-+-+-+-+-+-+-+-+--++-+--- 0.872011s
++--+-++--+-++--+-++--+-+-+-+-++-+--++--+-++-+-+-+-+--++-+-+-+-- 0.946745s

How would I use ftrace on synchronize_rcu() here?

As Eric said, cleanup_net() is batched, but while it is cleaning up,
net_mutex is held. Isn't the issue just that net_mutex is held while
some other things are going on that are meant to be lazy / batched?

What is net_mutex protecting in cleanup_net()?

I noticed that [kworker/u16:0]'s stack is often:

[<ffffffff810942a6>] wait_rcu_gp+0x46/0x50
[<ffffffff8109607e>] synchronize_sched+0x2e/0x50
[<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat]
[<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60
[<ffffffff817209e0>] cleanup_net+0xf0/0x1a0
[<ffffffff81062997>] process_one_work+0x157/0x440
[<ffffffff81063303>] worker_thread+0x63/0x520
[<ffffffff81068b96>] kthread+0xd6/0xf0
[<ffffffff818d412c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

and

[<ffffffff81095364>] _rcu_barrier+0x154/0x1f0
[<ffffffff81095450>] rcu_barrier+0x10/0x20
[<ffffffff81102c2c>] kmem_cache_destroy+0x6c/0xb0
[<ffffffffa0089e97>] nf_conntrack_cleanup_net_list+0x167/0x1c0 [nf_conntrack]
[<ffffffffa008aab5>] nf_conntrack_pernet_exit+0x65/0x70 [nf_conntrack]
[<ffffffff81720353>] ops_exit_list.isra.4+0x53/0x60
[<ffffffff817209e0>] cleanup_net+0xf0/0x1a0
[<ffffffff81062997>] process_one_work+0x157/0x440
[<ffffffff81063303>] worker_thread+0x63/0x520
[<ffffffff81068b96>] kthread+0xd6/0xf0
[<ffffffff818d412c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

So I tried flushing iptables rules and rmmoding netfilter bits:

++++++++++++++++++++-+--------------------+++++++++++----------- 0.179940s
++++++++++++++--+-------------+++++++++++++++++----------------- 0.151988s
++++++++++++++++++++++++++++---+--------------------------+++--- 0.159967s
++++++++++++++++++++++----------------------++++++++++---------- 0.175964s

Expedited:

++-+--++-+-+-+-+-+-+--++-+-+-++-+-+-+--++-+-+-+-+-+-+-+-+-+-+--- 0.079988s
++-+-+-+-+-+-+-+-+-+-+-+--++-+--++-+--+-++-+-+--++-+-+-+-+-+-+-- 0.089347s
++++--+++--++--+-+++++++-+++++--------------++-+-+--++-+-+--++-- 0.081566s
+++++-+++-------++-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-+-+-+-+--- 0.089026s

So, much faster. It seems that just loading nf_conntrack_ipv4 (like by
running iptables -t nat -nvL) is enough to slow it way down. But it is
still capable of being fast, as above.

Simon-
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov Aug. 29, 2014, 3:57 a.m. UTC | #4
Hello,

On Thu, 28 Aug 2014, Simon Kirby wrote:

> I noticed that [kworker/u16:0]'s stack is often:
> 
> [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50
> [<ffffffff8109607e>] synchronize_sched+0x2e/0x50
> [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat]

	I guess the problem is in nf_nat_net_exit,
may be other nf exit handlers too. pernet-exit handlers
should avoid synchronize_rcu and rcu_barrier.
A RCU callback and rcu_barrier in module-exit is the way
to go. cleanup_net includes rcu_barrier, so pernet-exit
does not need such calls.

> [<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60
> [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0
> [<ffffffff81062997>] process_one_work+0x157/0x440
> [<ffffffff81063303>] worker_thread+0x63/0x520
> [<ffffffff81068b96>] kthread+0xd6/0xf0
> [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Aug. 29, 2014, 9:57 p.m. UTC | #5
Julian Anastasov <ja@ssi.bg> writes:

> 	Hello,
>
> On Thu, 28 Aug 2014, Simon Kirby wrote:
>
>> I noticed that [kworker/u16:0]'s stack is often:
>> 
>> [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50
>> [<ffffffff8109607e>] synchronize_sched+0x2e/0x50
>> [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat]
>
> 	I guess the problem is in nf_nat_net_exit,
> may be other nf exit handlers too. pernet-exit handlers
> should avoid synchronize_rcu and rcu_barrier.
> A RCU callback and rcu_barrier in module-exit is the way
> to go. cleanup_net includes rcu_barrier, so pernet-exit
> does not need such calls.

In principle I agree, however in this particular case it looks a bit
tricky because a separate hash table to track nat state per network
namespace.

At the same time all of the packets should be drained before
we get to nf_nat_net_exit so it doesn't look the synchronize_rcu
in nf_nat_exit is actually protecting anything.

Further calling a rcu delay function in net_exit methods largely
destroys the batched cleanup of network namespaces, so it is very
unpleasant.

Could someone who knows nf_nat_core.c better than I do look and
see if we can just remove the synchronize_rcu in nf_nat_exit?

>> [<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60
>> [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0
>> [<ffffffff81062997>] process_one_work+0x157/0x440
>> [<ffffffff81063303>] worker_thread+0x63/0x520
>> [<ffffffff81068b96>] kthread+0xd6/0xf0
>> [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0
>> [<ffffffffffffffff>] 0xffffffffffffffff

Eric
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Aug. 30, 2014, 2:52 a.m. UTC | #6
On Thu, Aug 28, 2014 at 05:40:29PM -0700, Simon Kirby wrote:
> On Thu, Aug 28, 2014 at 01:46:58PM -0700, Paul E. McKenney wrote:
> 
> > On Thu, Aug 28, 2014 at 03:33:42PM -0500, Eric W. Biederman wrote:
> > 
> > > I just want to add a little bit more analysis to this.
> > > 
> > > What we desire to be fast is the copy_net_ns, cleanup_net is batched and
> > > asynchronous which nothing really cares how long it takes except that
> > > cleanup_net holds the net_mutex and thus blocks copy_net_ns.
> > > 
> > > The puzzle is why and which rcu delays Simon is seeing in the network
> > > namespace cleanup path, as it seems like the synchronize_rcu is not
> > > the only one, and in the case of vsftp with trivail network namespaces
> > > where nothing has been done we should not need to delay.
> > 
> > Indeed, given the version and .config, I can't see why any individual
> > RCU grace-period operation would be particularly slow.
> > 
> > I suggest using ftrace on synchronize_rcu() and friends.
> 
> I made a parallel net namespace create/destroy benchmark that prints the
> progress and time to create and cleanup 32 unshare()d child processes:
> 
> http://0x.ca/sim/ref/tools/netnsbench.c
> 
> I noticed that if I haven't run it for a while, the first batch often is
> fast, followed by slowness from then on:
> 
> ++++++++++++++++++++++++++++++++-------------------------------- 0.039478s
> ++++++++++++++++++++-----+----------------+++++++++---------++-- 4.463837s
> +++++++++++++++++++++++++------+--------------------++++++------ 3.011882s
> +++++++++++++++---+-------------++++++++++++++++---------------- 2.283993s
> 
> Fiddling around on a stock kernel, "echo 1 > /sys/kernel/rcu_expedited"
> makes behaviour change as it did with my patch:
> 
> ++-++-+++-+-----+-+-++-+-++--++-+--+-+-++--++-+-+-+-++-+--++---- 0.801406s
> +-+-+-++-+-+-+-+-++--+-+-++-+--++-+-+-+-+-+-+-+-+-+-+-+--++-+--- 0.872011s
> ++--+-++--+-++--+-++--+-+-+-+-++-+--++--+-++-+-+-+-+--++-+-+-+-- 0.946745s
> 
> How would I use ftrace on synchronize_rcu() here?

http://lwn.net/Articles/370423/ is your friend here.  If your kernel
is built with the needed configuration, you give the command
"echo synchronize_rcu > set_ftrace_filter"

http://lwn.net/Articles/365835/ and http://lwn.net/Articles/366796/
have background info.

> As Eric said, cleanup_net() is batched, but while it is cleaning up,
> net_mutex is held. Isn't the issue just that net_mutex is held while
> some other things are going on that are meant to be lazy / batched?
> 
> What is net_mutex protecting in cleanup_net()?
> 
> I noticed that [kworker/u16:0]'s stack is often:
> 
> [<ffffffff810942a6>] wait_rcu_gp+0x46/0x50
> [<ffffffff8109607e>] synchronize_sched+0x2e/0x50
> [<ffffffffa00385ac>] nf_nat_net_exit+0x2c/0x50 [nf_nat]
> [<ffffffff81720339>] ops_exit_list.isra.4+0x39/0x60
> [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0
> [<ffffffff81062997>] process_one_work+0x157/0x440
> [<ffffffff81063303>] worker_thread+0x63/0x520
> [<ffffffff81068b96>] kthread+0xd6/0xf0
> [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> and
> 
> [<ffffffff81095364>] _rcu_barrier+0x154/0x1f0
> [<ffffffff81095450>] rcu_barrier+0x10/0x20
> [<ffffffff81102c2c>] kmem_cache_destroy+0x6c/0xb0
> [<ffffffffa0089e97>] nf_conntrack_cleanup_net_list+0x167/0x1c0 [nf_conntrack]
> [<ffffffffa008aab5>] nf_conntrack_pernet_exit+0x65/0x70 [nf_conntrack]
> [<ffffffff81720353>] ops_exit_list.isra.4+0x53/0x60
> [<ffffffff817209e0>] cleanup_net+0xf0/0x1a0
> [<ffffffff81062997>] process_one_work+0x157/0x440
> [<ffffffff81063303>] worker_thread+0x63/0x520
> [<ffffffff81068b96>] kthread+0xd6/0xf0
> [<ffffffff818d412c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> So I tried flushing iptables rules and rmmoding netfilter bits:
> 
> ++++++++++++++++++++-+--------------------+++++++++++----------- 0.179940s
> ++++++++++++++--+-------------+++++++++++++++++----------------- 0.151988s
> ++++++++++++++++++++++++++++---+--------------------------+++--- 0.159967s
> ++++++++++++++++++++++----------------------++++++++++---------- 0.175964s
> 
> Expedited:
> 
> ++-+--++-+-+-+-+-+-+--++-+-+-++-+-+-+--++-+-+-+-+-+-+-+-+-+-+--- 0.079988s
> ++-+-+-+-+-+-+-+-+-+-+-+--++-+--++-+--+-++-+-+--++-+-+-+-+-+-+-- 0.089347s
> ++++--+++--++--+-+++++++-+++++--------------++-+-+--++-+-+--++-- 0.081566s
> +++++-+++-------++-+-+-+-+-+-+-+-+-+-+-++-+-+-+-+-+-+-+-+-+-+--- 0.089026s
> 
> So, much faster. It seems that just loading nf_conntrack_ipv4 (like by
> running iptables -t nat -nvL) is enough to slow it way down. But it is
> still capable of being fast, as above.

My first guess is that this code sequence is calling synchronize_rcu()
quite often.  Would it be possible to consolidate these?

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Julian Anastasov Aug. 30, 2014, 8:20 a.m. UTC | #7
Hello,

On Fri, 29 Aug 2014, Eric W. Biederman wrote:

> > 	I guess the problem is in nf_nat_net_exit,
> > may be other nf exit handlers too. pernet-exit handlers
> > should avoid synchronize_rcu and rcu_barrier.
> > A RCU callback and rcu_barrier in module-exit is the way
> > to go. cleanup_net includes rcu_barrier, so pernet-exit
> > does not need such calls.
> 
> In principle I agree, however in this particular case it looks a bit
> tricky because a separate hash table to track nat state per network
> namespace.

	It is still possible module's pernet-init handler to
attach in net->ct... special structure with all pointers that
should be freed by RCU callback for the module, like the hash table.

For example:

struct netns_ct_nat_rcu_allocs {
	struct rcu_head		rcu_head;
	struct hlist_head	*nat_bysource;
	unsigned int		nat_htable_size;
};

- pernet-init:
	- allocate structure, attach it to net->ct.nat_rcu_allocs
	- the original nat_bysource place remains because
	we want to avoid net->ct.nat_rcu_allocs dereference.

- pernet-exit:
	- copy nat_bysource and nat_htable_size to net->ct.nat_rcu_allocs,
	this can be done even in above pernet-init function
	- call_rcu(&net->ct.nat_rcu_allocs->rcu_head, nat_rcu_free);

- cleanup_net:
	- rcu_barrier()

- RCU callback (nat_rcu_free):
	- call nf_ct_free_hashtable
	- kfree the structure

- cleanup_net:
	- drop netns after rcu_barrier

	Due to the rcu_barrier in cleanup_net it is even
possible to provide per-module rcu_head instead of using
allocated structure, for example:

	call_rcu(&net->ct.nat_rcu_head, nat_rcu_free);

	Then the nat_rcu_free function will just call
nf_ct_free_hashtable before cleanup_net drops the netns struct.
In this case the memory price is just one rcu_head for every module
that uses RCU callback.

> At the same time all of the packets should be drained before
> we get to nf_nat_net_exit so it doesn't look the synchronize_rcu
> in nf_nat_exit is actually protecting anything.

	It is true for cleanup_net. I don't remember, can we
see packets while the particular module-exit calls
unregister_pernet_subsys(), may be yes?

Regards

--
Julian Anastasov <ja@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 85b6269..6dcb4b3 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -296,7 +296,7 @@  static void cleanup_net(struct work_struct *work)
 	 * This needs to be before calling the exit() notifiers, so
 	 * the rcu_barrier() below isn't sufficient alone.
 	 */
-	synchronize_rcu();
+	synchronize_rcu_expedited();
 
 	/* Run all of the network namespace exit methods */
 	list_for_each_entry_reverse(ops, &pernet_list, list)