Message ID | 49CA7AD7.8040401@cosmosbay.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, Mar 25, 2009 at 07:41:27PM +0100, Eric Dumazet wrote: > Jarek Poplawski a écrit : > > David Miller wrote, On 03/25/2009 08:17 AM: > > > >> From: Eric Dumazet <dada1@cosmosbay.com> > >> Date: Wed, 25 Mar 2009 08:13:30 +0100 > >> > >>> If done in dev_hard_start_xmit(), skb could be requeued (because of > >>> NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being > >>> NULL is not a problem ? > >> Usually it should be OK because the packet schedulers have > >> a sort-of one-behind state that allows them to reinsert > >> the SKB into their queue datastructures without reclassifying. > > > > > > Actually, since David has dumped requeuing there is no reinserting; > > this last one "requeued" skb is buffered at the top in q->gso_skb > > and waiting for better times. > > Yes indeed, this is what I thought too, thanks Jarek. Alas I'm a bit concerned with virtual devs, e.g. now I'm looking at xmits in macvlan and pppoe. Maybe this patch should exclude them? Jarek P. > > I tested following patch today on my machine, but obviously could not try > all possible quirks :) > > [PATCH] net: release dst entry in dev_hard_start_xmit() > > One point of contention in high network loads is the dst_release() performed > when a transmited skb is freed. This is because NIC tx completion calls skb free > long after original call to dev_queue_xmit(skb). > > CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is > quite visible if one CPU is 100% handling softirqs for a network device, > since dst_clone() is done by other cpus, involving cache line ping pongs. > > I believe we can release dst in dev_hard_start_xmit(), while cpu cache is hot, since > caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce > work to be done by softirq handler, and decrease cache misses. > > I also believe only pktgen can call dev_queue_xmit() with skb which have > a skb->users != 1. But pkthen skbs have a NULL dst entry. > > I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst > if skb->users != 1 > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > diff --git a/net/core/dev.c b/net/core/dev.c > index e3fe5c7..a622db6 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1664,6 +1664,26 @@ static int dev_gso_segment(struct sk_buff *skb) > return 0; > } > > + > +/* > + * Release dst while its refcnt is likely hot in CPU cache, instead > + * of waiting NIC tx completion. > + * We inline dst_release() code for performance reason > + */ > +static void release_skb_dst(struct sk_buff *skb) > +{ > + if (likely(skb->dst)) { > + if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) { > + int newrefcnt; > + > + smp_mb__before_atomic_dec(); > + newrefcnt = atomic_dec_return(&skb->dst->__refcnt); > + WARN_ON(newrefcnt < 0); > + skb->dst = NULL; > + } > + } > +} > + > int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, > struct netdev_queue *txq) > { > @@ -1681,6 +1701,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, > goto gso; > } > > + release_skb_dst(skb); > return ops->ndo_start_xmit(skb, dev); > } > > @@ -1691,6 +1712,7 @@ gso: > > skb->next = nskb->next; > nskb->next = NULL; > + release_skb_dst(nskb); > rc = ops->ndo_start_xmit(nskb, dev); > if (unlikely(rc)) { > nskb->next = skb->next; > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski a écrit : > On Wed, Mar 25, 2009 at 07:41:27PM +0100, Eric Dumazet wrote: >> Jarek Poplawski a écrit : >>> David Miller wrote, On 03/25/2009 08:17 AM: >>> >>>> From: Eric Dumazet <dada1@cosmosbay.com> >>>> Date: Wed, 25 Mar 2009 08:13:30 +0100 >>>> >>>>> If done in dev_hard_start_xmit(), skb could be requeued (because of >>>>> NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being >>>>> NULL is not a problem ? >>>> Usually it should be OK because the packet schedulers have >>>> a sort-of one-behind state that allows them to reinsert >>>> the SKB into their queue datastructures without reclassifying. >>> >>> Actually, since David has dumped requeuing there is no reinserting; >>> this last one "requeued" skb is buffered at the top in q->gso_skb >>> and waiting for better times. >> Yes indeed, this is what I thought too, thanks Jarek. > > Alas I'm a bit concerned with virtual devs, e.g. now I'm looking at > xmits in macvlan and pppoe. Maybe this patch should exclude them? > Yes, MACVLAN :) its macvlan_start_xmit() function calls dev_queue_xmit(skb) again, so we go back to packet schedulers and classifiers, they might need dst again :( Only other potential problem I found was in drivers/net/appletalk/ipddp.c static int ipddp_xmit(struct sk_buff *skb, struct net_device *dev) { __be32 paddr = ((struct rtable*)skb->dst)->rt_gateway; Not sure this driver is still supported, or if this paddr can be found elsewhere... __sk_dst_get(skb->sk) ??? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 25, 2009 at 08:40:05PM +0100, Eric Dumazet wrote: > Jarek Poplawski a écrit : ... > > Alas I'm a bit concerned with virtual devs, e.g. now I'm looking at > > xmits in macvlan and pppoe. Maybe this patch should exclude them? > > > > Yes, MACVLAN :) its macvlan_start_xmit() function calls > dev_queue_xmit(skb) again, so we go back to packet schedulers and > classifiers, they might need dst again :( > > Only other potential problem I found was in > drivers/net/appletalk/ipddp.c > > static int ipddp_xmit(struct sk_buff *skb, struct net_device *dev) > { > __be32 paddr = ((struct rtable*)skb->dst)->rt_gateway; > > Not sure this driver is still supported, or if this paddr can be found elsewhere... > __sk_dst_get(skb->sk) ??? > I guess you've considered the loopback too... Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski a écrit : > I guess you've considered the loopback too... > I had for the first patch (carefuly avoiding loopback being not responsive) : I was releasing dst only if enqueue() was called. You are right, loopback doesnt work anymore with last patch, and I have no idea why... Do you know why ? Thank you -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 25, 2009 at 09:28:37PM +0100, Eric Dumazet wrote: > Jarek Poplawski a écrit : > > > I guess you've considered the loopback too... > > > > I had for the first patch (carefuly avoiding loopback being not responsive) : > I was releasing dst only if enqueue() was called. > > You are right, loopback doesnt work anymore with last patch, > and I have no idea why... > > Do you know why ? Hmm.. isn't this test for dst == NULL in ip_rcv_finish expected to be negative for loopback source? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jarek Poplawski wrote: > On Wed, Mar 25, 2009 at 09:28:37PM +0100, Eric Dumazet wrote: > >> Jarek Poplawski a écrit : >> >> >>> I guess you've considered the loopback too... >>> >>> >> I had for the first patch (carefuly avoiding loopback being not responsive) : >> I was releasing dst only if enqueue() was called. >> >> You are right, loopback doesnt work anymore with last patch, >> and I have no idea why... >> >> Do you know why ? >> > > Hmm.. isn't this test for dst == NULL in ip_rcv_finish expected to be > negative for loopback source? ip_route_input() doesn't accept loopback addresses, so loopback packets already need to have a dst_entry attached. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/dev.c b/net/core/dev.c index e3fe5c7..a622db6 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1664,6 +1664,26 @@ static int dev_gso_segment(struct sk_buff *skb) return 0; } + +/* + * Release dst while its refcnt is likely hot in CPU cache, instead + * of waiting NIC tx completion. + * We inline dst_release() code for performance reason + */ +static void release_skb_dst(struct sk_buff *skb) +{ + if (likely(skb->dst)) { + if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) { + int newrefcnt; + + smp_mb__before_atomic_dec(); + newrefcnt = atomic_dec_return(&skb->dst->__refcnt); + WARN_ON(newrefcnt < 0); + skb->dst = NULL; + } + } +} + int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq) { @@ -1681,6 +1701,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, goto gso; } + release_skb_dst(skb); return ops->ndo_start_xmit(skb, dev); } @@ -1691,6 +1712,7 @@ gso: skb->next = nskb->next; nskb->next = NULL; + release_skb_dst(nskb); rc = ops->ndo_start_xmit(nskb, dev); if (unlikely(rc)) { nskb->next = skb->next;