Message ID | 200905292344.56814.rusty@rustcorp.com.au |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Rusty Russell a écrit : > Various drivers do skb_orphan() because they do not free transmitted > skbs in a timely manner (causing apps which hit their socket limits to > stall, socket close to hang, etc.). > > DaveM points out that there are advantages to doing it generally (it's > more likely to be on same CPU than after xmit), and I couldn't find > any new starvation issues in simple benchmarking here. > > If really no starvations are possible at all, I really wonder why some guys added memory accounting to UDP flows. Maybe they dont run "simple benchmarks" but real apps ? :) For TCP, I agree your patch is a huge benefit, since its paced by remote ACKS and window control, but an UDP sender will likely be able to saturate a link. Maybe we can try to selectively call skb_orphan() only for tcp packets ? I understand your motivations are the driver simplifications, so you want all packets being orphaned... hmm... This patch adds skb_orphan to the start of dev_hard_start_xmit(): it > can be premature in the NETDEV_TX_BUSY case, but that's uncommon. > > I removed the drivers' skb_orphan(), though it's harmless. > > Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> > Cc: Divy Le Ray <divy@chelsio.com> > Cc: Roland Dreier <rolandd@cisco.com> > Cc: Pavel Emelianov <xemul@openvz.org> > Cc: Dan Williams <dcbw@redhat.com> > Cc: libertas-dev@lists.infradead.org > --- > drivers/net/cxgb3/sge.c | 27 --------------------------- > drivers/net/loopback.c | 2 -- > drivers/net/mlx4/en_tx.c | 4 ---- > drivers/net/niu.c | 3 +-- > drivers/net/veth.c | 2 -- > drivers/net/wireless/libertas/tx.c | 3 --- > net/core/dev.c | 21 +++++---------------- > 7 files changed, 6 insertions(+), 56 deletions(-) > > diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c > --- a/drivers/net/cxgb3/sge.c > +++ b/drivers/net/cxgb3/sge.c > @@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str > dev->trans_start = jiffies; > spin_unlock(&q->lock); > > - /* > - * We do not use Tx completion interrupts to free DMAd Tx packets. > - * This is good for performamce but means that we rely on new Tx > - * packets arriving to run the destructors of completed packets, > - * which open up space in their sockets' send queues. Sometimes > - * we do not get such new packets causing Tx to stall. A single > - * UDP transmitter is a good example of this situation. We have > - * a clean up timer that periodically reclaims completed packets > - * but it doesn't run often enough (nor do we want it to) to prevent > - * lengthy stalls. A solution to this problem is to run the > - * destructor early, after the packet is queued but before it's DMAd. > - * A cons is that we lie to socket memory accounting, but the amount > - * of extra memory is reasonable (limited by the number of Tx > - * descriptors), the packets do actually get freed quickly by new > - * packets almost always, and for protocols like TCP that wait for > - * acks to really free up the data the extra memory is even less. > - * On the positive side we run the destructors on the sending CPU > - * rather than on a potentially different completing CPU, usually a > - * good thing. We also run them without holding our Tx queue lock, > - * unlike what reclaim_completed_tx() would otherwise do. > - * > - * Run the destructor before telling the DMA engine about the packet > - * to make sure it doesn't complete and get freed prematurely. > - */ > - if (likely(!skb_shared(skb))) > - skb_orphan(skb); > - > write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl); > check_ring_tx_db(adap, q); > return NETDEV_TX_OK; > diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c > --- a/drivers/net/loopback.c > +++ b/drivers/net/loopback.c > @@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff > { > struct pcpu_lstats *pcpu_lstats, *lb_stats; > > - skb_orphan(skb); > - > skb->protocol = eth_type_trans(skb,dev); > > /* it's OK to use per_cpu_ptr() because BHs are off */ > diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c > --- a/drivers/net/mlx4/en_tx.c > +++ b/drivers/net/mlx4/en_tx.c > @@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st > if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf) > tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size); > > - /* Run destructor before passing skb to HW */ > - if (likely(!skb_shared(skb))) > - skb_orphan(skb); > - > /* Ensure new descirptor hits memory > * before setting ownership of this descriptor to HW */ > wmb(); > diff --git a/drivers/net/niu.c b/drivers/net/niu.c > --- a/drivers/net/niu.c > +++ b/drivers/net/niu.c > @@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff > } > kfree_skb(skb); > skb = skb_new; > - } else > - skb_orphan(skb); > + } > > align = ((unsigned long) skb->data & (16 - 1)); > headroom = align + sizeof(struct tx_pkt_hdr); > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > --- a/drivers/net/veth.c > +++ b/drivers/net/veth.c > @@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb > struct veth_net_stats *stats, *rcv_stats; > int length, cpu; > > - skb_orphan(skb); > - > priv = netdev_priv(dev); > rcv = priv->peer; > rcv_priv = netdev_priv(rcv); > diff --git a/drivers/net/wireless/libertas/tx.c > b/drivers/net/wireless/libertas/tx.c > --- a/drivers/net/wireless/libertas/tx.c > +++ b/drivers/net/wireless/libertas/tx.c > @@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff * > if (priv->monitormode) { > /* Keep the skb to echo it back once Tx feedback is > received from FW */ > - skb_orphan(skb); > - > - /* Keep the skb around for when we get feedback */ > priv->currenttxskb = skb; > } else { > free: > diff --git a/net/core/dev.c b/net/core/dev.c > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff * > const struct net_device_ops *ops = dev->netdev_ops; > int rc; > > + /* Call destructor here. It's SMP-cache-friendly and avoids issues > + * with drivers which can hold transmitted skbs for long times */ > + skb_orphan(skb); > + > if (likely(!skb->next)) { > if (!list_empty(&ptype_all)) > dev_queue_xmit_nit(skb, dev); > @@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff * > goto gso; > } > > - rc = ops->ndo_start_xmit(skb, dev); > - /* > - * TODO: if skb_orphan() was called by > - * dev->hard_start_xmit() (for example, the unmodified > - * igb driver does that; bnx2 doesn't), then > - * skb_tx_software_timestamp() will be unable to send > - * back the time stamp. > - * > - * How can this be prevented? Always create another > - * reference to the socket before calling > - * dev->hard_start_xmit()? Prevent that skb_orphan() > - * does anything in dev->hard_start_xmit() by clearing > - * the skb destructor before the call and restoring it > - * afterwards, then doing the skb_orphan() ourselves? > - */ > - return rc; > + return ops->ndo_start_xmit(skb, dev); > } > > gso: > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote: > Rusty Russell a écrit : > > DaveM points out that there are advantages to doing it generally (it's > > more likely to be on same CPU than after xmit), and I couldn't find > > any new starvation issues in simple benchmarking here. > > If really no starvations are possible at all, I really wonder why some > guys added memory accounting to UDP flows. Maybe they dont run "simple > benchmarks" but real apps ? :) Well, without any accounting at all you could use quite a lot of memory as there are many places packets can be queued. > For TCP, I agree your patch is a huge benefit, since its paced by remote > ACKS and window control I doubt that. There'll be some cache friendliness, but I'm not sure it'll be measurable, let alone "huge". It's the win to drivers which don't have a timely and batching tx free mechanism which I aim for. > , but an UDP sender will likely be able to saturate > a link. I couldn't see any difference in saturation here (with default scheduler and an 100MBit e1000e). Two reasons come to mind: firstly, only the hardware queue is unregulated: the tx queue is still accounted. And when you add scheduling to the mix, I can't in practice cause starvation of other senders. Hope that clarifies, Rusty. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote: > This patch adds skb_orphan to the start of dev_hard_start_xmit(): it > can be premature in the NETDEV_TX_BUSY case, but that's uncommon. Would it be possible to make the new skb_orphan() at the start of dev_hard_start_xmit() conditionally so that it is not executed for packets that are to be time stamped? As discussed before (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk socket pointer is required for sending back the send time stamp from inside the device driver. Calling skb_orphan() unconditionally as in this patch would break the hardware time stamping of outgoing packets. This should work without much overhead (skb_tx() expands to a lookup of the skb_shared_info): if (!skb_tx(skb)->flags) skb_orphan(skb); Instead of looking at the individual skb_shared_tx "hardware" and "software" bits I'm just checking the whole byte here because it is shorter and perhaps faster. Semantically the result is the same.
From: Patrick Ohly <patrick.ohly@intel.com> Date: Mon, 01 Jun 2009 21:47:22 +0200 > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote: >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon. > > Would it be possible to make the new skb_orphan() at the start of > dev_hard_start_xmit() conditionally so that it is not executed for > packets that are to be time stamped? > > As discussed before > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk > socket pointer is required for sending back the send time stamp from > inside the device driver. Calling skb_orphan() unconditionally as in > this patch would break the hardware time stamping of outgoing packets. Indeed, we need to check that case, at a minimum. And there are other potentially other problems. For example, I wonder how this interacts with the new TX MMAP af_packet support in net-next-2.6 :-/ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote: > From: Patrick Ohly <patrick.ohly@intel.com> > Date: Mon, 01 Jun 2009 21:47:22 +0200 > > > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote: > >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it > >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon. > > > > Would it be possible to make the new skb_orphan() at the start of > > dev_hard_start_xmit() conditionally so that it is not executed for > > packets that are to be time stamped? > > > > As discussed before > > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk > > socket pointer is required for sending back the send time stamp from > > inside the device driver. Calling skb_orphan() unconditionally as in > > this patch would break the hardware time stamping of outgoing packets. > > Indeed, we need to check that case, at a minimum. > > And there are other potentially other problems. For example, I > wonder how this interacts with the new TX MMAP af_packet support > in net-next-2.6 :-/ I think I'll do this in the driver for now, and let's revisit doing it generically later? Thanks, Rusty. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Rusty Russell <rusty@rustcorp.com.au> Date: Tue, 2 Jun 2009 23:38:29 +0930 > On Tue, 2 Jun 2009 04:55:53 pm David Miller wrote: >> From: Patrick Ohly <patrick.ohly@intel.com> >> Date: Mon, 01 Jun 2009 21:47:22 +0200 >> >> > On Fri, 2009-05-29 at 23:44 +0930, Rusty Russell wrote: >> >> This patch adds skb_orphan to the start of dev_hard_start_xmit(): it >> >> can be premature in the NETDEV_TX_BUSY case, but that's uncommon. >> > >> > Would it be possible to make the new skb_orphan() at the start of >> > dev_hard_start_xmit() conditionally so that it is not executed for >> > packets that are to be time stamped? >> > >> > As discussed before >> > (http://article.gmane.org/gmane.linux.network/121378/), the skb->sk >> > socket pointer is required for sending back the send time stamp from >> > inside the device driver. Calling skb_orphan() unconditionally as in >> > this patch would break the hardware time stamping of outgoing packets. >> >> Indeed, we need to check that case, at a minimum. >> >> And there are other potentially other problems. For example, I >> wonder how this interacts with the new TX MMAP af_packet support >> in net-next-2.6 :-/ > > I think I'll do this in the driver for now, and let's revisit doing it > generically later? That might be the best course of action for the time being. This whole area is a rat's nest. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Rusty Russell a écrit : > On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote: >> Rusty Russell a écrit : >>> DaveM points out that there are advantages to doing it generally (it's >>> more likely to be on same CPU than after xmit), and I couldn't find >>> any new starvation issues in simple benchmarking here. >> If really no starvations are possible at all, I really wonder why some >> guys added memory accounting to UDP flows. Maybe they dont run "simple >> benchmarks" but real apps ? :) > > Well, without any accounting at all you could use quite a lot of memory as > there are many places packets can be queued. > >> For TCP, I agree your patch is a huge benefit, since its paced by remote >> ACKS and window control > > I doubt that. There'll be some cache friendliness, but I'm not sure it'll be > measurable, let alone "huge". It's the win to drivers which don't have a > timely and batching tx free mechanism which I aim for. At 250.000 packets/second on a Gigabit link, this is huge, I can tell you. (250.000 incoming packets and 250.000 outgoing packets per second, 700 Mbit/s) According to this oprofile on CPU0 (dedicated to softirqs on one bnx2 eth adapter) We can see sock_wfree() being number 2 on the profile, because it touches three cache lines per socket and transmited packet in TX completion handler. Also, taking a reference on socket for each xmit packet in flight is very expensive, since it slows down receiver in __udp4_lib_lookup(). Several cpus are fighting for sk->refcnt cache line. CPU: Core 2, speed 3000.24 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 samples cum. samples % cum. % symbol name 21215 21215 11.8847 11.8847 bnx2_poll_work 17239 38454 9.6573 21.5420 sock_wfree << effect of udp memory accounting >> 14817 53271 8.3005 29.8425 __slab_free 14635 67906 8.1986 38.0411 __udp4_lib_lookup 11425 79331 6.4003 44.4414 __alloc_skb 9710 89041 5.4396 49.8810 __slab_alloc 8095 97136 4.5348 54.4158 __udp4_lib_rcv 7831 104967 4.3869 58.8027 sock_def_write_space 7586 112553 4.2497 63.0524 ip_rcv 7518 120071 4.2116 67.2640 skb_dma_unmap 6711 126782 3.7595 71.0235 netif_receive_skb 6272 133054 3.5136 74.5371 udp_queue_rcv_skb 5262 138316 2.9478 77.4849 skb_release_data 5023 143339 2.8139 80.2988 __kmalloc_track_caller 4070 147409 2.2800 82.5788 kmem_cache_alloc 3216 150625 1.8016 84.3804 ipt_do_table 2576 153201 1.4431 85.8235 skb_queue_tail -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote: > Rusty Russell a écrit : > > On Sat, 30 May 2009 12:41:00 am Eric Dumazet wrote: > >> For TCP, I agree your patch is a huge benefit, since its paced by remote > >> ACKS and window control > > > > I doubt that. There'll be some cache friendliness, but I'm not sure > > it'll be measurable, let alone "huge". ... > We can see sock_wfree() being number 2 on the profile, because it touches > three cache lines per socket and transmited packet in TX completion > handler. Interesting, I take it back: got some "after" stats as well? > Also, taking a reference on socket for each xmit packet in flight is very > expensive, since it slows down receiver in __udp4_lib_lookup(). Several > cpus are fighting for sk->refcnt cache line. Now we have decent dynamic per-cpu, we can finally implement bigrefs. More obvious for device counts than sockets, but perhaps applicable here as well? Rusty. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Rusty Russell <rusty@rustcorp.com.au> Date: Thu, 4 Jun 2009 13:24:57 +0930 > On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote: >> Also, taking a reference on socket for each xmit packet in flight is very >> expensive, since it slows down receiver in __udp4_lib_lookup(). Several >> cpus are fighting for sk->refcnt cache line. > > Now we have decent dynamic per-cpu, we can finally implement bigrefs. More > obvious for device counts than sockets, but perhaps applicable here as well? It might be very beneficial for longer lasting, active, connections, but for high connection rates it's going to be a lose in my estimation. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller a écrit : > From: Rusty Russell <rusty@rustcorp.com.au> > Date: Thu, 4 Jun 2009 13:24:57 +0930 > >> On Thu, 4 Jun 2009 06:32:53 am Eric Dumazet wrote: >>> Also, taking a reference on socket for each xmit packet in flight is very >>> expensive, since it slows down receiver in __udp4_lib_lookup(). Several >>> cpus are fighting for sk->refcnt cache line. >> Now we have decent dynamic per-cpu, we can finally implement bigrefs. More >> obvious for device counts than sockets, but perhaps applicable here as well? > > It might be very beneficial for longer lasting, active, connections, but > for high connection rates it's going to be a lose in my estimation. Agreed. We also can avoid the sock_put()/sock_hold() pair for each tx packet, to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree() and atomic_dec_test in sk_free We could initialize sk->sk_wmem_alloc to one instead of 0, so that sock_wfree() could just synchronize itself with sk_free() void sk_free(struct sock *sk) { if (atomic_dec_test(&sk->sk_wmem_alloc)) __sk_free(sk) } static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk) { - sock_hold(sk); skb->sk = sk; skb->destructor = sock_wfree; atomic_add(skb->truesize, &sk->sk_wmem_alloc); } void sock_wfree(struct sk_buff *skb) { struct sock *sk = skb->sk; + int res; /* In case it might be waiting for more memory. */ - atomic_sub(skb->truesize, &sk->sk_wmem_alloc); + res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc); if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) sk->sk_write_space(sk); - sock_put(sk); + if (res == 0) + __sk_free(sk); } Patch will follow after some testing -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 04 Jun 2009 06:54:24 +0200 > We also can avoid the sock_put()/sock_hold() pair for each tx packet, > to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree() > and atomic_dec_test in sk_free > > We could initialize sk->sk_wmem_alloc to one instead of 0, so that > sock_wfree() could just synchronize itself with sk_free() Excellent idea Eric. > Patch will follow after some testing I look forward to it :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller <davem@davemloft.net> wrote: > >> I think I'll do this in the driver for now, and let's revisit doing it >> generically later? > > That might be the best course of action for the time being. > This whole area is a rat's nest. Calling skb_orphan like this should be forbidden. Apart from the problems already raised, it is a sign that the driver is trying to paper over a more serious issue of not cleaning up skb's timely. Yes skb_orphan will work for the cases where calling the skb destructor allows forward progress, but for the cases where you really need to the skb to be freed (e.g., iSCSI or Xen), this simply doesn't work. So anytime someone tries to propose such a solution it is a sign that they have bigger problems. Cheers,
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Fri, 3 Jul 2009 15:55:30 +0800 > Calling skb_orphan like this should be forbidden. Apart from the > problems already raised, it is a sign that the driver is trying to > paper over a more serious issue of not cleaning up skb's timely. > > Yes skb_orphan will work for the cases where calling the skb > destructor allows forward progress, but for the cases where you > really need to the skb to be freed (e.g., iSCSI or Xen), this > simply doesn't work. > > So anytime someone tries to propose such a solution it is a sign > that they have bigger problems. Agreed, but alas we are foaming at the mouth until we have a truly usable alternative. In particular the case of handling a device without usable TX completion event indications is still quite troublesome. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote: > > In particular the case of handling a device without usable TX > completion event indications is still quite troublesome. Which particular devices do you have in mind? Cheers,
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Sat, 4 Jul 2009 11:08:30 +0800 > On Fri, Jul 03, 2009 at 08:02:54PM -0700, David Miller wrote: >> >> In particular the case of handling a device without usable TX >> completion event indications is still quite troublesome. > e > Which particular devices do you have in mind? NIU I basically can't defer interrupts because the chip supports per-TX-desc interrupt indications but it lacks an "all TX queue sent" event. So if, say, tell it to interrupt every 1/4 of the TX queue then up to 1/4 of the queue can have packets "stuck" in there if TX activity all of a sudden ceases. The only thing I've come up with to be able to mitigate interrupts is to use an hrtimer of some sort. But that's going to be hard to get right, and who knows what kind of latencies will be introduced for TX completion packet freeing unless I am very carefull. And finally this belongs in generic code, not in the NIU driver, whatever we come up with. Especially since my understanding is that this is similar to what Rusty needs. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jul 03, 2009 at 08:13:47PM -0700, David Miller wrote: > > NIU > > I basically can't defer interrupts because the chip supports > per-TX-desc interrupt indications but it lacks an "all TX queue sent" > event. So if, say, tell it to interrupt every 1/4 of the TX queue > then up to 1/4 of the queue can have packets "stuck" in there > if TX activity all of a sudden ceases. Here's an idea: We let the sender decide whether we need to enable notification. This decision would be carried as a flag in the skb. For example, UDP would set this flag when its socket buffer is close to capacity. Routing would set this flag per NAPI run, etc. Of course you'd ignore this flag completely if the qdisc queue is non-empty. Cheers,
On Sat, Jul 04, 2009 at 03:42:45PM +0800, Herbert Xu wrote: > > Here's an idea: We let the sender decide whether we need to enable > notification. This decision would be carried as a flag in the skb. > For example, UDP would set this flag when its socket buffer is close > to capacity. Routing would set this flag per NAPI run, etc. Actually it doesn't even matter for routing because only those that are charged by the skb's or the pages care and they're the only ones that would need to set this. One potential problem is if the socket is constantly running close to capacity, but that should only happen if the device TX queue is also close to capacity which means that the qdisc queue should be non-empty. Cheers,
On Sat, Jul 04, 2009 at 05:09:10PM +0800, Herbert Xu wrote: > > One potential problem is if the socket is constantly running > close to capacity, but that should only happen if the device > TX queue is also close to capacity which means that the qdisc > queue should be non-empty. Here's a another crazy idea: Let's use dummy TX descriptors to generate an interrupt, either with or without transmitting an actual packet on the wire depending on the NIC. xmit(skb) if (TX queue contains no interrupting descriptor && qdisc is empty) mark TX descriptor as interrupting clean() do work if (TX queue contains no interrupting descriptor && TX queue non-empty && qdisc is empty) send dummy TX descriptor This is based on the assumption that in the time it takes for the NIC to process one packet and interrupt us, we can generate more packets. Obviously if we can't then even if the NIC had a queue-empty interrupt capability it wouldn't help. Cheers,
On Sun, Jul 05, 2009 at 11:26:58AM +0800, Herbert Xu wrote: > > Here's a another crazy idea: > > Let's use dummy TX descriptors to generate an interrupt, either > with or without transmitting an actual packet on the wire depending > on the NIC. Here's an even crazier idea that doesn't use dummy descriptors. xmit(skb) if (TX queue contains no interrupting descriptor && qdisc is empty) mark TX descriptor as interrupting if (TX queue now contains an interrupting descriptor && qdisc len < 2) stop queue if (TX ring full) stop queue clean() do work wake queue as per usual Cheers,
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Sun, 5 Jul 2009 11:34:08 +0800 > Here's an even crazier idea that doesn't use dummy descriptors. > > xmit(skb) > > if (TX queue contains no interrupting descriptor && > qdisc is empty) > mark TX descriptor as interrupting > > if (TX queue now contains an interrupting descriptor && > qdisc len < 2) > stop queue > > if (TX ring full) > stop queue > > clean() > > do work > wake queue as per usual I'm pretty sure that for normal TCP and UDP workloads, this is just going to set the interrupt bit on the first packet that gets into the queue, and then not in the rest. TCP just loops over packets in the send queue, and at initial state the qdisc will be empty. It's very hard to get this to work as well as if we had a real queue empty interrupt status event. Even if you get upstream status from the protocols saying "there's more packets coming" via some flag in the SKB, that only says what one client feeding the TX ring is about to do. It says nothing about other threads of control which are about to start feeding packets to the same device. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 17, 2009 at 06:47:12PM -0700, David Miller wrote: > > I'm pretty sure that for normal TCP and UDP workloads, this is just > going to set the interrupt bit on the first packet that gets into the > queue, and then not in the rest. > > TCP just loops over packets in the send queue, and at initial state > the qdisc will be empty. The scheme I actually tried on the e1000e is not quite the same as what I had here. I'm basically just adding descriptors per usual. However, I don't give them to the hardware immediately. Instead they're held back so that on average we have about three descriptors in the queue spaced equally that will cause interrupts. It seems to work fairly well for netperf-generated TCP/UDP traffic in terms of generating the right amount of interrupts (once every qlen/3 packets). I haven't posted anything yet because I ran into a weird problem with the e1000e. It was generating a lot more interrupts than what I'm telling it to do. Even if I hard-code the interrupts at 0, qlen/3, 2qlen/2 I still get way more than qlen/3 interrupts for qlen packets. This may be related to the fact that e1000e doesn't really have a way of turning the interrupts off for a given descriptor. Well actually it does but it also turns off status reporting which means that we will never know whether that entry has finished processing. So I've had to rely on using its TX interrupt delay mechanism as an approximation of turning interrupt notification off. Evidently that is not working in the way I intended it to. I'm in the process of repeating the same experiment with cxgb3 which hopefully should let me turn interrupts off on descriptors while still reporting completion status. Cheers,
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 19 Aug 2009 13:19:26 +1000 > I'm in the process of repeating the same experiment with cxgb3 > which hopefully should let me turn interrupts off on descriptors > while still reporting completion status. Ok, I look forward to seeing your work however it turns out. Once I see what you've done, I'll give it a spin on niu. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c --- a/drivers/net/cxgb3/sge.c +++ b/drivers/net/cxgb3/sge.c @@ -1288,33 +1288,6 @@ int t3_eth_xmit(struct sk_buff *skb, str dev->trans_start = jiffies; spin_unlock(&q->lock); - /* - * We do not use Tx completion interrupts to free DMAd Tx packets. - * This is good for performamce but means that we rely on new Tx - * packets arriving to run the destructors of completed packets, - * which open up space in their sockets' send queues. Sometimes - * we do not get such new packets causing Tx to stall. A single - * UDP transmitter is a good example of this situation. We have - * a clean up timer that periodically reclaims completed packets - * but it doesn't run often enough (nor do we want it to) to prevent - * lengthy stalls. A solution to this problem is to run the - * destructor early, after the packet is queued but before it's DMAd. - * A cons is that we lie to socket memory accounting, but the amount - * of extra memory is reasonable (limited by the number of Tx - * descriptors), the packets do actually get freed quickly by new - * packets almost always, and for protocols like TCP that wait for - * acks to really free up the data the extra memory is even less. - * On the positive side we run the destructors on the sending CPU - * rather than on a potentially different completing CPU, usually a - * good thing. We also run them without holding our Tx queue lock, - * unlike what reclaim_completed_tx() would otherwise do. - * - * Run the destructor before telling the DMA engine about the packet - * to make sure it doesn't complete and get freed prematurely. - */ - if (likely(!skb_shared(skb))) - skb_orphan(skb); - write_tx_pkt_wr(adap, skb, pi, pidx, gen, q, ndesc, compl); check_ring_tx_db(adap, q); return NETDEV_TX_OK; diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -72,8 +72,6 @@ static int loopback_xmit(struct sk_buff { struct pcpu_lstats *pcpu_lstats, *lb_stats; - skb_orphan(skb); - skb->protocol = eth_type_trans(skb,dev); /* it's OK to use per_cpu_ptr() because BHs are off */ diff --git a/drivers/net/mlx4/en_tx.c b/drivers/net/mlx4/en_tx.c --- a/drivers/net/mlx4/en_tx.c +++ b/drivers/net/mlx4/en_tx.c @@ -807,10 +807,6 @@ int mlx4_en_xmit(struct sk_buff *skb, st if (tx_desc == (struct mlx4_en_tx_desc *) ring->bounce_buf) tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size); - /* Run destructor before passing skb to HW */ - if (likely(!skb_shared(skb))) - skb_orphan(skb); - /* Ensure new descirptor hits memory * before setting ownership of this descriptor to HW */ wmb(); diff --git a/drivers/net/niu.c b/drivers/net/niu.c --- a/drivers/net/niu.c +++ b/drivers/net/niu.c @@ -6667,8 +6667,7 @@ static int niu_start_xmit(struct sk_buff } kfree_skb(skb); skb = skb_new; - } else - skb_orphan(skb); + } align = ((unsigned long) skb->data & (16 - 1)); headroom = align + sizeof(struct tx_pkt_hdr); diff --git a/drivers/net/veth.c b/drivers/net/veth.c --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -155,8 +155,6 @@ static int veth_xmit(struct sk_buff *skb struct veth_net_stats *stats, *rcv_stats; int length, cpu; - skb_orphan(skb); - priv = netdev_priv(dev); rcv = priv->peer; rcv_priv = netdev_priv(rcv); diff --git a/drivers/net/wireless/libertas/tx.c b/drivers/net/wireless/libertas/tx.c --- a/drivers/net/wireless/libertas/tx.c +++ b/drivers/net/wireless/libertas/tx.c @@ -154,9 +154,6 @@ int lbs_hard_start_xmit(struct sk_buff * if (priv->monitormode) { /* Keep the skb to echo it back once Tx feedback is received from FW */ - skb_orphan(skb); - - /* Keep the skb around for when we get feedback */ priv->currenttxskb = skb; } else { free: diff --git a/net/core/dev.c b/net/core/dev.c --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1677,6 +1677,10 @@ int dev_hard_start_xmit(struct sk_buff * const struct net_device_ops *ops = dev->netdev_ops; int rc; + /* Call destructor here. It's SMP-cache-friendly and avoids issues + * with drivers which can hold transmitted skbs for long times */ + skb_orphan(skb); + if (likely(!skb->next)) { if (!list_empty(&ptype_all)) dev_queue_xmit_nit(skb, dev); @@ -1688,22 +1692,7 @@ int dev_hard_start_xmit(struct sk_buff * goto gso; } - rc = ops->ndo_start_xmit(skb, dev); - /* - * TODO: if skb_orphan() was called by - * dev->hard_start_xmit() (for example, the unmodified - * igb driver does that; bnx2 doesn't), then - * skb_tx_software_timestamp() will be unable to send - * back the time stamp. - * - * How can this be prevented? Always create another - * reference to the socket before calling - * dev->hard_start_xmit()? Prevent that skb_orphan() - * does anything in dev->hard_start_xmit() by clearing - * the skb destructor before the call and restoring it - * afterwards, then doing the skb_orphan() ourselves? - */ - return rc; + return ops->ndo_start_xmit(skb, dev); } gso:
Various drivers do skb_orphan() because they do not free transmitted skbs in a timely manner (causing apps which hit their socket limits to stall, socket close to hang, etc.). DaveM points out that there are advantages to doing it generally (it's more likely to be on same CPU than after xmit), and I couldn't find any new starvation issues in simple benchmarking here. This patch adds skb_orphan to the start of dev_hard_start_xmit(): it can be premature in the NETDEV_TX_BUSY case, but that's uncommon. I removed the drivers' skb_orphan(), though it's harmless. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Divy Le Ray <divy@chelsio.com> Cc: Roland Dreier <rolandd@cisco.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Dan Williams <dcbw@redhat.com> Cc: libertas-dev@lists.infradead.org --- drivers/net/cxgb3/sge.c | 27 --------------------------- drivers/net/loopback.c | 2 -- drivers/net/mlx4/en_tx.c | 4 ---- drivers/net/niu.c | 3 +-- drivers/net/veth.c | 2 -- drivers/net/wireless/libertas/tx.c | 3 --- net/core/dev.c | 21 +++++---------------- 7 files changed, 6 insertions(+), 56 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html