Message ID | 49C380A6.4000904@cosmosbay.com |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, Mar 20, 2009 at 12:40:22PM +0100, Eric Dumazet wrote: > One point of contention in high network loads is the dst_release() performed > when a transmited skb is freed. This is because NIC tx completion calls skb free > long after original call to dev_queue_xmit(skb). > > CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is > quite visible if one CPU is 100% handling softirqs for a network device, > since dst_clone() is done by other cpus, involving cache line ping pongs. > > I believe we can release dst in dev_queue_xmit(), while cpu cache is hot, since > caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce > work to be done by softirq handler, and decrease cache misses. > > I also believe only pktgen can call dev_queue_xmit() with skb which have > a skb->users != 1. But pkthen skbs have a NULL dst entry. > > I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst > if skb->users != 1 > > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > > diff --git a/net/core/dev.c b/net/core/dev.c > index f112970..9e0fd01 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1852,6 +1852,20 @@ gso: > if (q->enqueue) { > spinlock_t *root_lock = qdisc_lock(q); > > + /* > + * Release dst while its refcnt is hot in CPU cache, instead > + * of waiting NIC tx completion > + */ > + if (likely(skb->dst)) { > + if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) { > + int newrefcnt; > + > + smp_mb__before_atomic_dec(); > + newrefcnt = atomic_dec_return(&skb->dst->__refcnt); > + WARN_ON(newrefcnt < 0); > + skb->dst = NULL; > + } > + } > spin_lock(root_lock); > > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > I think this seems like a pretty good idea. I thought for a moment that some stacked interfaces (bonds, vlan devices), might have a problem with this, since they tend to pass through dev_queue_xmit twice, but I can't see a problem with either one of those cases either, since neither of thier xmit routines makes any use of the dst pointer. I'd say include it Acked-by: Neil Horman <nhorman@tuxdriver.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <dada1@cosmosbay.com> Date: Fri, 20 Mar 2009 12:40:22 +0100 > I believe we can release dst in dev_queue_xmit(), while cpu cache is > hot, since caller of dev_queue_xmit() had to hold a reference on dst > right before. This reduce work to be done by softirq handler, and > decrease cache misses. > This will break various packet schedulers and classifiers. Heck sch_sfq.c even uses skb->dst as part of it's flow hash function :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Fri, 20 Mar 2009 12:40:22 +0100 > >> I believe we can release dst in dev_queue_xmit(), while cpu cache is >> hot, since caller of dev_queue_xmit() had to hold a reference on dst >> right before. This reduce work to be done by softirq handler, and >> decrease cache misses. >> > > This will break various packet schedulers and classifiers. > > Heck sch_sfq.c even uses skb->dst as part of it's flow hash > function :-) Well, as one of the hash perturbator, for other protocols than IPV4 & IPV6... default: h = (unsigned long)skb->dst ^ skb->protocol; h2 = (unsigned long)skb->sk; } return sfq_fold_hash(q, h, h2); But teql indeed mandates dst in __teql_resolve() Darn... This dst freeing should be done very late then, in the NIC driver itself, just before giving skb to hardware, or right before in dev_hard_start_xmit() If done in dev_hard_start_xmit(), skb could be requeued (because of NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being NULL is not a problem ? Thanks a lot David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <dada1@cosmosbay.com> Date: Wed, 25 Mar 2009 08:13:30 +0100 > If done in dev_hard_start_xmit(), skb could be requeued (because of > NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being > NULL is not a problem ? Usually it should be OK because the packet schedulers have a sort-of one-behind state that allows them to reinsert the SKB into their queue datastructures without reclassifying. But I'm not %100 sure there isn't some case that might still need skb->dst there. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller wrote, On 03/25/2009 08:17 AM: > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Wed, 25 Mar 2009 08:13:30 +0100 > >> If done in dev_hard_start_xmit(), skb could be requeued (because of >> NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being >> NULL is not a problem ? > > Usually it should be OK because the packet schedulers have > a sort-of one-behind state that allows them to reinsert > the SKB into their queue datastructures without reclassifying. Actually, since David has dumped requeuing there is no reinserting; this last one "requeued" skb is buffered at the top in q->gso_skb and waiting for better times. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/dev.c b/net/core/dev.c index f112970..9e0fd01 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1852,6 +1852,20 @@ gso: if (q->enqueue) { spinlock_t *root_lock = qdisc_lock(q); + /* + * Release dst while its refcnt is hot in CPU cache, instead + * of waiting NIC tx completion + */ + if (likely(skb->dst)) { + if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) { + int newrefcnt; + + smp_mb__before_atomic_dec(); + newrefcnt = atomic_dec_return(&skb->dst->__refcnt); + WARN_ON(newrefcnt < 0); + skb->dst = NULL; + } + } spin_lock(root_lock); if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
One point of contention in high network loads is the dst_release() performed when a transmited skb is freed. This is because NIC tx completion calls skb free long after original call to dev_queue_xmit(skb). CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is quite visible if one CPU is 100% handling softirqs for a network device, since dst_clone() is done by other cpus, involving cache line ping pongs. I believe we can release dst in dev_queue_xmit(), while cpu cache is hot, since caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce work to be done by softirq handler, and decrease cache misses. I also believe only pktgen can call dev_queue_xmit() with skb which have a skb->users != 1. But pkthen skbs have a NULL dst entry. I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst if skb->users != 1 Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html