Message ID | 1430988770-28907-2-git-send-email-ying.xue@windriver.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote: > > @@ -409,12 +410,17 @@ void __put_net(struct net *net) > { > /* Cleanup the network namespace in process context */ > unsigned long flags; > + bool added = false; > > spin_lock_irqsave(&cleanup_list_lock, flags); > - list_add(&net->cleanup_list, &cleanup_list); > + if (list_empty(&net->cleanup_list)) { > + list_add(&net->cleanup_list, &cleanup_list); > + added = true; > + } Stop right there. If a ref count is hitting zero twice you've got big problems. Because it means that after hitting zero the first time it'll become positive again (if only briefly) which opens you up to a race against the cleanup. So this idea is a non-starter. The rules for ref counts are really simple, if you hit zero then you're dead. Can someone explain to me in simple terms why this ref count is special and needs to hit zero twice? Cheers,
On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote: >> >> @@ -409,12 +410,17 @@ void __put_net(struct net *net) >> { >> /* Cleanup the network namespace in process context */ >> unsigned long flags; >> + bool added = false; >> >> spin_lock_irqsave(&cleanup_list_lock, flags); >> - list_add(&net->cleanup_list, &cleanup_list); >> + if (list_empty(&net->cleanup_list)) { >> + list_add(&net->cleanup_list, &cleanup_list); >> + added = true; >> + } > > Stop right there. If a ref count is hitting zero twice you've > got big problems. Because it means that after hitting zero the > first time it'll become positive again (if only briefly) which > opens you up to a race against the cleanup. That is true. > > So this idea is a non-starter. > > The rules for ref counts are really simple, if you hit zero then > you're dead. Can someone explain to me in simple terms why this > ref count is special and needs to hit zero twice? Because after it hits zero, it queues the final cleanup to a workqueue, which could be delayed (depends on when the worker will be scheduled), therefore there is a small window that new ->init() could come in. You can argue that netns->init() should not see a dead netns, but unfortunately that seems even harder since removing netns from global netns list requires rtnl lock which is not doable in __put_net(). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Cong Wang <cwang@twopensource.com> writes: > On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote: >> On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote: >>> >>> @@ -409,12 +410,17 @@ void __put_net(struct net *net) >>> { >>> /* Cleanup the network namespace in process context */ >>> unsigned long flags; >>> + bool added = false; >>> >>> spin_lock_irqsave(&cleanup_list_lock, flags); >>> - list_add(&net->cleanup_list, &cleanup_list); >>> + if (list_empty(&net->cleanup_list)) { >>> + list_add(&net->cleanup_list, &cleanup_list); >>> + added = true; >>> + } >> >> Stop right there. If a ref count is hitting zero twice you've >> got big problems. Because it means that after hitting zero the >> first time it'll become positive again (if only briefly) which >> opens you up to a race against the cleanup. > > That is true. > >> >> So this idea is a non-starter. >> >> The rules for ref counts are really simple, if you hit zero then >> you're dead. Can someone explain to me in simple terms why this >> ref count is special and needs to hit zero twice? > > Because after it hits zero, it queues the final cleanup to a workqueue, > which could be delayed (depends on when the worker will be scheduled), > therefore there is a small window that new ->init() could come in. > > You can argue that netns->init() should not see a dead netns, but > unfortunately that seems even harder since removing netns from global > netns list requires rtnl lock which is not doable in __put_net(). See my reply upthread where I have already written the code to ensure netns->init() does not see a dead netns. It is a very reasonable expectation and not all that hard to ensure. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Cong Wang <cwang@twopensource.com> writes: > On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote: >> On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote: >>> >>> @@ -409,12 +410,17 @@ void __put_net(struct net *net) >>> { >>> /* Cleanup the network namespace in process context */ >>> unsigned long flags; >>> + bool added = false; >>> >>> spin_lock_irqsave(&cleanup_list_lock, flags); >>> - list_add(&net->cleanup_list, &cleanup_list); >>> + if (list_empty(&net->cleanup_list)) { >>> + list_add(&net->cleanup_list, &cleanup_list); >>> + added = true; >>> + } >> >> Stop right there. If a ref count is hitting zero twice you've >> got big problems. Because it means that after hitting zero the >> first time it'll become positive again (if only briefly) which >> opens you up to a race against the cleanup. > > That is true. > >> >> So this idea is a non-starter. >> >> The rules for ref counts are really simple, if you hit zero then >> you're dead. Can someone explain to me in simple terms why this >> ref count is special and needs to hit zero twice? > > Because after it hits zero, it queues the final cleanup to a workqueue, > which could be delayed (depends on when the worker will be scheduled), > therefore there is a small window that new ->init() could come in. The bad case in all of this is if we hit that window and increment the count. The namespace could be removed from the lists ->exit() methods could run. Then the reference count could drop to zero again. Having a network namespace alive with cleanup methods having already run is too nasty to think about. We really don't want to do that. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thank everyone for the review, comments, and suggestions! Please see my inline responses to your inputs. On 05/08/2015 01:19 AM, Cong Wang wrote: > On Thu, May 7, 2015 at 2:04 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote: >> On Thu, May 07, 2015 at 04:52:40PM +0800, Ying Xue wrote: >>> >>> @@ -409,12 +410,17 @@ void __put_net(struct net *net) >>> { >>> /* Cleanup the network namespace in process context */ >>> unsigned long flags; >>> + bool added = false; >>> >>> spin_lock_irqsave(&cleanup_list_lock, flags); >>> - list_add(&net->cleanup_list, &cleanup_list); >>> + if (list_empty(&net->cleanup_list)) { >>> + list_add(&net->cleanup_list, &cleanup_list); >>> + added = true; >>> + } >> >> Stop right there. If a ref count is hitting zero twice you've >> got big problems. Because it means that after hitting zero the >> first time it'll become positive again (if only briefly) which >> opens you up to a race against the cleanup. > > That is true. > >> >> So this idea is a non-starter. >> >> The rules for ref counts are really simple, if you hit zero then >> you're dead. Can someone explain to me in simple terms why this >> ref count is special and needs to hit zero twice? > > Because after it hits zero, it queues the final cleanup to a workqueue, > which could be delayed (depends on when the worker will be scheduled), > therefore there is a small window that new ->init() could come in. > > You can argue that netns->init() should not see a dead netns, but > unfortunately that seems even harder since removing netns from global > netns list requires rtnl lock which is not doable in __put_net(). > > Cong's understanding about the race between put_net and kernel creation is right. But in my opinion the patch is still able to cover all kinds of race conditions. You may ask why we can get reference count of a dead net again. As the race scenario is a bit complex, please let me spend a little more time explaining why it's safe for us. To make the case clear, please let us consider tipc module as an example: //Shutdown a namesapce thread: put_net() if (atomic_dec_and_test(&net->refcnt)) /* true */ -----> the net's refcount is zero now __put_net(net); queue_work(netns_wq, &net_cleanup_work); //Worker thread: cleanup_net() mutex_lock(&net_mutex); rtnl_lock(); list_del_rcu(&net->list); //delete a net from net_namespace_list rtnl_unlock(); list_for_each_entry_reverse(ops, &pernet_list, list) ops_exit_list() ops->exit(net) mutex_unlock(&net_mutex); net_free(ns); == re-schedule == //the thread of inserting tipc module: tipc_init() register_pernet_subsys(&tipc_net_ops); mutex_lock(&net_mutex); __register_pernet_operations(&tipc_net_ops) for_each_net(net) //iterate the net_namespace_list ops_init(ops, net_dead) tipc_init_net(net_dead) tipc_topsrv_start(net_dead) tipc_create_listen_sock(net_dead) sock_create_kern(net_dead) sk_alloc(); get_net(net_dead); /* refcnt = 1 */ put_net(net_dead) __put_net(net_dead); /* refcnt = 0 */ if (!list_empty(&net->cleanup_list)) queue_work(netns_wq, &net_cleanup_work); /* * As the net_dead is linked in net_namespace_list, we don't queue * the cleanup work again. cleanup_net() is not called, which means * the risk of double free of net is avoided. */ The reason why the original risk occurs is that destroying a net has to be done in a worker thread. After a net refcount is decreased to 0, and before the process of destroying net is finished in cleanup_net(), there remains a small race window for us. In above case, during the tiny time, tipc module may be inserted in another thread in which a TIPC kernel socket is created. During the socket creation, a dead net refcout is increased from 0 to 1 in sk_allc(), and then decreased to 0 in sk_change_net(). At the moment, the net's refcout secondly hits zero in put_net(), subsequently __put_net() is called again, which means cleanup_net() will be secondly called. So releasing a net twice happens! However, as __register_pernet_operations() is protected by net_mutex and cleanup_net() also needs to grab the net_mutex before it destroys a net, the dead net is temporarily valid during the registration. Therefore, during this time, we can safely operate a net to be dead like what we are doing in tipc_init_net(), such as, get its refcout in sk_alloc() and subsequently decrease it after socket creation. But after the registration is over, the dead net will be freed when cleanup net thread worker is scheduled. Actually there is another possible method to avoid namespace change. For example, when we create a kernel socket, we don't crease its net's refcount at all in sk_alloc(), correspondingly we don't decrease the net refcout for a kernel socket in __sk_free(). However, it needs to change many code, and the solution is also ugly, so it's not feasible. Of course, if you find other race with the solution, please let me know. Thanks, Ying -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 78fc04a..058508f 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -242,6 +242,7 @@ static __net_init int setup_net(struct net *net, struct user_namespace *user_ns) net->dev_base_seq = 1; net->user_ns = user_ns; idr_init(&net->netns_ids); + INIT_LIST_HEAD(&net->cleanup_list); list_for_each_entry(ops, &pernet_list, list) { error = ops_init(ops, net); @@ -409,12 +410,17 @@ void __put_net(struct net *net) { /* Cleanup the network namespace in process context */ unsigned long flags; + bool added = false; spin_lock_irqsave(&cleanup_list_lock, flags); - list_add(&net->cleanup_list, &cleanup_list); + if (list_empty(&net->cleanup_list)) { + list_add(&net->cleanup_list, &cleanup_list); + added = true; + } spin_unlock_irqrestore(&cleanup_list_lock, flags); - queue_work(netns_wq, &net_cleanup_work); + if (added) + queue_work(netns_wq, &net_cleanup_work); } EXPORT_SYMBOL_GPL(__put_net);
Commit 23fe18669e7f ("[NETNS]: Fix race between put_net() and netlink_kernel_create().") attempts to fix the following race scenario: put_net() if (atomic_dec_and_test(&net->refcnt)) /* true */ __put_net(net); queue_work(...); /* * note: the net now has refcnt 0, but still in * the global list of net namespaces */ == re-schedule == register_pernet_subsys(&some_ops); register_pernet_operations(&some_ops); (*some_ops)->init(net); /* * we call netlink_kernel_create() here * in some places */ netlink_kernel_create(); sk_alloc(); get_net(net); /* refcnt = 1 */ /* * now we drop the net refcount not to * block the net namespace exit in the * future (or this can be done on the * error path) */ put_net(sk->sk_net); if (atomic_dec_and_test(&...)) /* * true. BOOOM! The net is * scheduled for release twice */ In order to prevent the race from happening, the commit adopted the following solution: create netlink socket inside init_net namespace and then re-attach it to the desired one right the socket is created; similarly, when closing the socket, first move its namespace to init_net so that the socket can be destroyed in the same context of the socket creation. Actually the proposal artificially makes the whole thing complex. Instead there exists a simpler solution to avoid the risk of net double release: if we find there is still pending a cleanup work in __put_net(), we don't queue a new cleanup work again. The solution is not only simple and easily understandable, but also it can help us to avoid unnecessary namespace change for kernel sockets which will be made in the future commits. Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> --- net/core/net_namespace.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)