diff mbox

[RFC,net-next,01/11] netns: Fix race between put_net() and netlink_kernel_create()

Message ID 1430988770-28907-2-git-send-email-ying.xue@windriver.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Ying Xue May 7, 2015, 8:52 a.m. UTC
Commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and
netlink_kernel_create().") attempts to fix the following race
scenario:

put_net()
  if (atomic_dec_and_test(&net->refcnt))
    /* true */
      __put_net(net);
        queue_work(...);

/*
 * note: the net now has refcnt 0, but still in
 * the global list of net namespaces
 */

== re-schedule ==

register_pernet_subsys(&some_ops);
  register_pernet_operations(&some_ops);
    (*some_ops)->init(net);
      /*
       * we call netlink_kernel_create() here
       * in some places
       */
      netlink_kernel_create();
         sk_alloc();
            get_net(net); /* refcnt = 1 */
         /*
          * now we drop the net refcount not to
          * block the net namespace exit in the
          * future (or this can be done on the
          * error path)
          */
         put_net(sk->sk_net);
             if (atomic_dec_and_test(&...))
                   /*
                    * true. BOOOM! The net is
                    * scheduled for release twice
                    */

In order to prevent the race from happening, the commit adopted the
following solution: create netlink socket inside init_net namespace
and then re-attach it to the desired one right the socket is created;
similarly, when closing the socket, first move its namespace to
init_net so that the socket can be destroyed in the same context of
the socket creation.

Actually the proposal artificially makes the whole thing complex.
Instead there exists a simpler solution to avoid the risk of net
double release: if we find there is still pending a cleanup work
in __put_net(), we don't queue a new cleanup work again. The solution
is not only simple and easily understandable, but also it can help us
to avoid unnecessary namespace change for kernel sockets which will
be made in the future commits.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
---
 net/core/net_namespace.c |   10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Herbert Xu May 7, 2015, 9:04 a.m. UTC | #1
On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote:
>
> @@ -409,12 +410,17 @@ void __put_net(struct net *net)
>  {
>  	/* Cleanup the network namespace in process context */
>  	unsigned long flags;
> +	bool added = false;
>  
>  	spin_lock_irqsave(&cleanup_list_lock, flags);
> -	list_add(&net->cleanup_list, &cleanup_list);
> +	if (list_empty(&net->cleanup_list)) {
> +		list_add(&net->cleanup_list, &cleanup_list);
> +		added = true;
> +	}

Stop right there.  If a ref count is hitting zero twice you've
got big problems.  Because it means that after hitting zero the
first time it'll become positive again (if only briefly) which
opens you up to a race against the cleanup.

So this idea is a non-starter.

The rules for ref counts are really simple, if you hit zero then
you're dead.  Can someone explain to me in simple terms why this
ref count is special and needs to hit zero twice?

Cheers,
Cong Wang May 7, 2015, 5:19 p.m. UTC | #2
On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote:
>>
>> @@ -409,12 +410,17 @@ void __put_net(struct net *net)
>>  {
>>       /* Cleanup the network namespace in process context */
>>       unsigned long flags;
>> +     bool added = false;
>>
>>       spin_lock_irqsave(&cleanup_list_lock, flags);
>> -     list_add(&net->cleanup_list, &cleanup_list);
>> +     if (list_empty(&net->cleanup_list)) {
>> +             list_add(&net->cleanup_list, &cleanup_list);
>> +             added = true;
>> +     }
>
> Stop right there.  If a ref count is hitting zero twice you've
> got big problems.  Because it means that after hitting zero the
> first time it'll become positive again (if only briefly) which
> opens you up to a race against the cleanup.

That is true.

>
> So this idea is a non-starter.
>
> The rules for ref counts are really simple, if you hit zero then
> you're dead.  Can someone explain to me in simple terms why this
> ref count is special and needs to hit zero twice?

Because after it hits zero, it queues the final cleanup to a workqueue,
which could be delayed (depends on when the worker will be scheduled),
therefore there is a small window that new ->init() could come in.

You can argue that netns->init() should not see a dead netns, but
unfortunately that seems even harder since removing netns from global
netns list requires rtnl lock which is not doable in __put_net().
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman May 7, 2015, 5:28 p.m. UTC | #3
Cong Wang <cwang@twopensource.com> writes:

> On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote:
>>>
>>> @@ -409,12 +410,17 @@ void __put_net(struct net *net)
>>>  {
>>>       /* Cleanup the network namespace in process context */
>>>       unsigned long flags;
>>> +     bool added = false;
>>>
>>>       spin_lock_irqsave(&cleanup_list_lock, flags);
>>> -     list_add(&net->cleanup_list, &cleanup_list);
>>> +     if (list_empty(&net->cleanup_list)) {
>>> +             list_add(&net->cleanup_list, &cleanup_list);
>>> +             added = true;
>>> +     }
>>
>> Stop right there.  If a ref count is hitting zero twice you've
>> got big problems.  Because it means that after hitting zero the
>> first time it'll become positive again (if only briefly) which
>> opens you up to a race against the cleanup.
>
> That is true.
>
>>
>> So this idea is a non-starter.
>>
>> The rules for ref counts are really simple, if you hit zero then
>> you're dead.  Can someone explain to me in simple terms why this
>> ref count is special and needs to hit zero twice?
>
> Because after it hits zero, it queues the final cleanup to a workqueue,
> which could be delayed (depends on when the worker will be scheduled),
> therefore there is a small window that new ->init() could come in.
>
> You can argue that netns->init() should not see a dead netns, but
> unfortunately that seems even harder since removing netns from global
> netns list requires rtnl lock which is not doable in __put_net().

See my reply upthread where I have already written the code to ensure
netns->init() does not see a dead netns.  It is a very reasonable
expectation and not all that hard to ensure.

Eric



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman May 8, 2015, 11:20 a.m. UTC | #4
Cong Wang <cwang@twopensource.com> writes:

> On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote:
>>>
>>> @@ -409,12 +410,17 @@ void __put_net(struct net *net)
>>>  {
>>>       /* Cleanup the network namespace in process context */
>>>       unsigned long flags;
>>> +     bool added = false;
>>>
>>>       spin_lock_irqsave(&cleanup_list_lock, flags);
>>> -     list_add(&net->cleanup_list, &cleanup_list);
>>> +     if (list_empty(&net->cleanup_list)) {
>>> +             list_add(&net->cleanup_list, &cleanup_list);
>>> +             added = true;
>>> +     }
>>
>> Stop right there.  If a ref count is hitting zero twice you've
>> got big problems.  Because it means that after hitting zero the
>> first time it'll become positive again (if only briefly) which
>> opens you up to a race against the cleanup.
>
> That is true.
>
>>
>> So this idea is a non-starter.
>>
>> The rules for ref counts are really simple, if you hit zero then
>> you're dead.  Can someone explain to me in simple terms why this
>> ref count is special and needs to hit zero twice?
>
> Because after it hits zero, it queues the final cleanup to a workqueue,
> which could be delayed (depends on when the worker will be scheduled),
> therefore there is a small window that new ->init() could come in.

The bad case in all of this is if we hit that window and increment the
count.  The namespace could be removed from the lists ->exit() methods
could run.  Then the reference count could drop to zero again.

Having a network namespace alive with cleanup methods having already run
is too nasty to think about.  We really don't want to do that.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ying Xue May 8, 2015, 11:20 a.m. UTC | #5
Thank everyone for the review, comments, and suggestions!
Please see my inline responses to your inputs.

On 05/08/2015 01:19 AM, Cong Wang wrote:
> On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>> On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote:
>>>
>>> @@ -409,12 +410,17 @@ void __put_net(struct net *net)
>>>  {
>>>       /* Cleanup the network namespace in process context */
>>>       unsigned long flags;
>>> +     bool added = false;
>>>
>>>       spin_lock_irqsave(&cleanup_list_lock, flags);
>>> -     list_add(&net->cleanup_list, &cleanup_list);
>>> +     if (list_empty(&net->cleanup_list)) {
>>> +             list_add(&net->cleanup_list, &cleanup_list);
>>> +             added = true;
>>> +     }
>>
>> Stop right there.  If a ref count is hitting zero twice you've
>> got big problems.  Because it means that after hitting zero the
>> first time it'll become positive again (if only briefly) which
>> opens you up to a race against the cleanup.
> 
> That is true.
> 
>>
>> So this idea is a non-starter.
>>
>> The rules for ref counts are really simple, if you hit zero then
>> you're dead.  Can someone explain to me in simple terms why this
>> ref count is special and needs to hit zero twice?
> 
> Because after it hits zero, it queues the final cleanup to a workqueue,
> which could be delayed (depends on when the worker will be scheduled),
> therefore there is a small window that new ->init() could come in.
> 
> You can argue that netns->init() should not see a dead netns, but
> unfortunately that seems even harder since removing netns from global
> netns list requires rtnl lock which is not doable in __put_net().
> 
> 

Cong's understanding about the race between put_net and kernel creation is right.

But in my opinion the patch is still able to cover all kinds of race conditions.
You may ask why we can get reference count of a dead net again. As the race
scenario is a bit complex, please let me spend a little more time explaining why
it's safe for us. To make the case clear, please let us consider tipc module as
an example:

//Shutdown a namesapce thread:
put_net()
  if (atomic_dec_and_test(&net->refcnt))
    /* true */ -----> the net's refcount is zero now
      __put_net(net);
        queue_work(netns_wq, &net_cleanup_work);

//Worker thread:
cleanup_net()
  mutex_lock(&net_mutex);
  rtnl_lock();
  list_del_rcu(&net->list); //delete a net from net_namespace_list
  rtnl_unlock();
  list_for_each_entry_reverse(ops, &pernet_list, list)
    ops_exit_list()
       ops->exit(net)
  mutex_unlock(&net_mutex);
  net_free(ns);

== re-schedule ==
//the thread of inserting tipc module:
tipc_init()
 register_pernet_subsys(&tipc_net_ops);
 mutex_lock(&net_mutex);
  __register_pernet_operations(&tipc_net_ops)
   for_each_net(net) //iterate the net_namespace_list
    ops_init(ops, net_dead)
      tipc_init_net(net_dead)
        tipc_topsrv_start(net_dead)
          tipc_create_listen_sock(net_dead)
            sock_create_kern(net_dead)
              sk_alloc();
                get_net(net_dead); /* refcnt = 1 */
              put_net(net_dead)
               __put_net(net_dead); /* refcnt = 0 */
                if (!list_empty(&net->cleanup_list))
                   queue_work(netns_wq, &net_cleanup_work);
           /*
            * As the net_dead is linked in net_namespace_list, we don't queue
            * the cleanup work again. cleanup_net() is not called, which means
            * the risk of double free of net is avoided.
            */

The reason why the original risk occurs is that destroying a net has to be done
in a worker thread. After a net refcount is decreased to 0, and before the
process of destroying net is finished in cleanup_net(), there remains a small
race window for us. In above case, during the tiny time, tipc module may be
inserted in another thread in which a TIPC kernel socket is created. During the
socket creation, a dead net refcout is increased from 0 to 1 in sk_allc(), and
then decreased to 0 in sk_change_net(). At the moment, the net's refcout
secondly hits zero in put_net(), subsequently __put_net() is called again, which
means cleanup_net() will be secondly called. So releasing a net twice happens!

However, as __register_pernet_operations() is protected by net_mutex and
cleanup_net() also needs to grab the net_mutex before it destroys a net, the
dead net is temporarily valid during the registration. Therefore, during this
time, we can safely operate a net to be dead like what we are doing in
tipc_init_net(), such as, get its refcout in sk_alloc() and subsequently
decrease it after socket creation. But after the registration is over, the dead
net will be freed when cleanup net thread worker is scheduled.

Actually there is another possible method to avoid namespace change. For
example, when we create a kernel socket, we don't crease its net's refcount at
all in sk_alloc(), correspondingly we don't decrease the net refcout for a
kernel socket in __sk_free(). However, it needs to change many code, and the
solution is also ugly, so it's not feasible.

Of course, if you find other race with the solution, please let me know.

Thanks,
Ying

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 78fc04a..058508f 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -242,6 +242,7 @@  static __net_init int setup_net(struct net *net, struct user_namespace *user_ns)
 	net->dev_base_seq = 1;
 	net->user_ns = user_ns;
 	idr_init(&net->netns_ids);
+	INIT_LIST_HEAD(&net->cleanup_list);
 
 	list_for_each_entry(ops, &pernet_list, list) {
 		error = ops_init(ops, net);
@@ -409,12 +410,17 @@  void __put_net(struct net *net)
 {
 	/* Cleanup the network namespace in process context */
 	unsigned long flags;
+	bool added = false;
 
 	spin_lock_irqsave(&cleanup_list_lock, flags);
-	list_add(&net->cleanup_list, &cleanup_list);
+	if (list_empty(&net->cleanup_list)) {
+		list_add(&net->cleanup_list, &cleanup_list);
+		added = true;
+	}
 	spin_unlock_irqrestore(&cleanup_list_lock, flags);
 
-	queue_work(netns_wq, &net_cleanup_work);
+	if (added)
+		queue_work(netns_wq, &net_cleanup_work);
 }
 EXPORT_SYMBOL_GPL(__put_net);