diff mbox

[PATCH/RFC] make unregister_netdev() delete more than 4 interfaces per second

Message ID 20091017221857.GG1925@kvack.org
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Benjamin LaHaise Oct. 17, 2009, 10:18 p.m. UTC
Hi folks,

Below is a patch that changes the interaction between netdev_wait_allrefs() 
and dev_put() to replace an msleep(250) with a wait_event() on the final 
dev_put().  This change reduces the time spent sleeping during an 
unregister_netdev(), causing the system to go from <1% CPU time to something 
more CPU bound (50+% in a test vm).  This increases the speed of a bulk 
unregister_netdev() from 4 interfaces per second to over 500 per second on 
my test system.  The requirement comes from handling thousands of L2TP 
sessions where a tunnel flap results in all interfaces being torn down at 
one time.

Note that there is still more work to be done in this area.

		-ben

Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Oct. 18, 2009, 4:26 a.m. UTC | #1
Benjamin LaHaise a écrit :
> Hi folks,
> 
> Below is a patch that changes the interaction between netdev_wait_allrefs() 
> and dev_put() to replace an msleep(250) with a wait_event() on the final 
> dev_put().  This change reduces the time spent sleeping during an 
> unregister_netdev(), causing the system to go from <1% CPU time to something 
> more CPU bound (50+% in a test vm).  This increases the speed of a bulk 
> unregister_netdev() from 4 interfaces per second to over 500 per second on 
> my test system.  The requirement comes from handling thousands of L2TP 
> sessions where a tunnel flap results in all interfaces being torn down at 
> one time.
> 
> Note that there is still more work to be done in this area.
> 
> 		-ben
>  

> +DECLARE_WAIT_QUEUE_HEAD(netdev_refcnt_wait);
> +
> +void dev_put(struct net_device *dev)
> +{
> +        if (atomic_dec_and_test(&dev->refcnt))
> +		wake_up(&netdev_refcnt_wait);
> +}
> +EXPORT_SYMBOL(dev_put);
> +


Unfortunatly this slow down fast path by an order of magnitude.

atomic_dec() is pretty cheap (and eventually could use a per_cpu thing,
now we have a new and sexy per_cpu allocator), but atomic_dec_and_test()
is not that cheap and more important forbids a per_cpu conversion.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin LaHaise Oct. 18, 2009, 4:13 p.m. UTC | #2
On Sun, Oct 18, 2009 at 06:26:22AM +0200, Eric Dumazet wrote:
> Unfortunatly this slow down fast path by an order of magnitude.
> 
> atomic_dec() is pretty cheap (and eventually could use a per_cpu thing,
> now we have a new and sexy per_cpu allocator), but atomic_dec_and_test()
> is not that cheap and more important forbids a per_cpu conversion.

dev_put() is not a fast path by any means.  atomic_dec_and_test() costs 
the same as atomic_dec() on any modern CPU -- the cost is in the cacheline 
bouncing and serialisation both require.  The case of the device count 
becoming 0 is quite rare -- any device with a route on it will never hit 
a reference count of 0.

		-ben
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 18, 2009, 5:51 p.m. UTC | #3
Benjamin LaHaise a écrit :
> On Sun, Oct 18, 2009 at 06:26:22AM +0200, Eric Dumazet wrote:
>> Unfortunatly this slow down fast path by an order of magnitude.
>>
>> atomic_dec() is pretty cheap (and eventually could use a per_cpu thing,
>> now we have a new and sexy per_cpu allocator), but atomic_dec_and_test()
>> is not that cheap and more important forbids a per_cpu conversion.
> 
> dev_put() is not a fast path by any means.  atomic_dec_and_test() costs 
> the same as atomic_dec() on any modern CPU -- the cost is in the cacheline 
> bouncing and serialisation both require.  The case of the device count 
> becoming 0 is quite rare -- any device with a route on it will never hit 
> a reference count of 0.

You forgot af_packet sendmsg() users, and heavy routers where route cache
is stressed or disabled. I know several of them, they even added mmap TX 
support to get better performance. They will be disapointed by your patch.

atomic_dec_and_test() is definitly more expensive, because of strong barrier
semantics and added test after the decrement.
refcnt being close to zero or not has not impact, even on 2 years old cpus.

Machines hardly had to dismantle a netdevice in a normal lifetime, so maybe
we were lazy with this insane msleep(250). This came from old linux times,
when cpus were soooo slow and programers soooo lazy :)

The msleep(250) should be tuned first. Then if this is really necessary
to dismantle 100.000 netdevices per second, we might have to think a bit more.

Just try msleep(1 or 2), it should work quite well.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin LaHaise Oct. 18, 2009, 6:21 p.m. UTC | #4
On Sun, Oct 18, 2009 at 07:51:56PM +0200, Eric Dumazet wrote:
> You forgot af_packet sendmsg() users, and heavy routers where route cache
> is stressed or disabled. I know several of them, they even added mmap TX 
> support to get better performance. They will be disapointed by your patch.

If that's a problem, the cacheline overhead is a more serious issue.  
AF_PACKET should really keep the reference on the device between syscalls.  
Do you have a benchmark in mind that would show the overhead?

> atomic_dec_and_test() is definitly more expensive, because of strong barrier
> semantics and added test after the decrement.
> refcnt being close to zero or not has not impact, even on 2 years old cpus.

At least on x86, the atomic_dec_and_test() cost is pretty much identical to 
atomic_dec().  If this really is a performance bottleneck, people should be 
complaining about the cache miss overhead and lock overhead which will dwarf 
the atomic_dec_and_test() cost vs atomic_dec().  Granted, I'm not saying 
that it isn't an issue on other architectures, but for x86 the lock prefix 
is what's expensive, not checking the flags or not after doing the operation.

If your complaint is about uninlining dev_put(), I'm indifferent to keeping 
it inline or out of line and can change the patch to suit.

> Machines hardly had to dismantle a netdevice in a normal lifetime, so maybe
> we were lazy with this insane msleep(250). This came from old linux times,
> when cpus were soooo slow and programers soooo lazy :)

It's only now that machines can actually route one or more 10Gbps links 
that it really becomes an issue.  I've been hacking around it for some 
time, but fixing it properly is starting to be a real requirement.

> The msleep(250) should be tuned first. Then if this is really necessary
> to dismantle 100.000 netdevices per second, we might have to think a bit more.
> 
> Just try msleep(1 or 2), it should work quite well.

My goal is tearing down 100,000 interfaces in a few seconds, which really is 
necessary.  Right now we're running about 40,000 interfaces on a not yet 
saturated 10Gbps link.  Going to dual 10Gbps links means pushing more than 
100,000 subscriber interfaces, and it looks like a modern dual socket system 
can handle that.

A bigger concern is rtnl_lock().  It is a huge impediment to scaling up 
interface creation/deletion on multicore systems.  That's going to be a 
lot more invasive to fix, though.

		-ben
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 18, 2009, 7:36 p.m. UTC | #5
Benjamin LaHaise a écrit :
> 
> My goal is tearing down 100,000 interfaces in a few seconds, which really is 
> necessary.  Right now we're running about 40,000 interfaces on a not yet 
> saturated 10Gbps link.  Going to dual 10Gbps links means pushing more than 
> 100,000 subscriber interfaces, and it looks like a modern dual socket system 
> can handle that.
> 
> A bigger concern is rtnl_lock().  It is a huge impediment to scaling up 
> interface creation/deletion on multicore systems.  That's going to be a 
> lot more invasive to fix, though.

Dont forget synchronize_net() too (two calls per rollback_registered())

You need something to dismantle XXXXX interfaces at once, instead
of serializing one by one. Because in three years you'll want to dismantle
1.000.000 interfaces in one second.

Maybe defining an asynchronous unregister_netdev() function...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Octavian Purdila Oct. 21, 2009, 12:39 p.m. UTC | #6
On Sunday 18 October 2009 21:21:44 you wrote:
> > The msleep(250) should be tuned first. Then if this is really necessary
> > to dismantle 100.000 netdevices per second, we might have to think a bit
> > more. 
> > Just try msleep(1 or 2), it should work quite well.
> 
> My goal is tearing down 100,000 interfaces in a few seconds, which really
>  is  necessary.  Right now we're running about 40,000 interfaces on a not
>  yet saturated 10Gbps link.  Going to dual 10Gbps links means pushing more
>  than 100,000 subscriber interfaces, and it looks like a modern dual socket
>  system can handle that.
> 

I would also like to see this patch in, we are running into scalability issues 
with creating/deleting lots of interfaces as well.

Thanks,
tavi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 812a5f3..e20d4a4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1514,10 +1514,7 @@  extern void netdev_run_todo(void);
  *
  * Release reference to device to allow it to be freed.
  */
-static inline void dev_put(struct net_device *dev)
-{
-	atomic_dec(&dev->refcnt);
-}
+void dev_put(struct net_device *dev);
 
 /**
  *	dev_hold - get reference to device
diff --git a/net/core/dev.c b/net/core/dev.c
index b8f74cf..155217f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4945,6 +4945,16 @@  out:
 }
 EXPORT_SYMBOL(register_netdev);
 
+DECLARE_WAIT_QUEUE_HEAD(netdev_refcnt_wait);
+
+void dev_put(struct net_device *dev)
+{
+        if (atomic_dec_and_test(&dev->refcnt))
+		wake_up(&netdev_refcnt_wait);
+}
+EXPORT_SYMBOL(dev_put);
+
+
 /*
  * netdev_wait_allrefs - wait until all references are gone.
  *
@@ -4984,7 +4994,8 @@  static void netdev_wait_allrefs(struct net_device *dev)
 			rebroadcast_time = jiffies;
 		}
 
-		msleep(250);
+		wait_event_timeout(netdev_refcnt_wait,
+				   !atomic_read(&dev->refcnt), HZ/4);
 
 		if (time_after(jiffies, warning_time + 10 * HZ)) {
 			printk(KERN_EMERG "unregister_netdevice: "