Message ID | 20110323124939.GA7834@redhat.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On 03/23/11 08:52, Stanislaw Gruszka wrote: > on my i686 system, what should also make myri10ge_clean_rx_done() > being faster. I tested this on my very old, very weak dual-core athlon64 systems. These machines can barely achieve 10Gb/s using a 1500b MTU with LRO. Running 35 60 second netperf tests into the machines with the stock driver, and again with this patch applied, I see a tiny bandwidth increase (1.4Mb/s on average) which is statistically significant ( p < 0.001). There is no statistically significant CPU load reduction. > + if (len<= mgp->small_bytes) { > + rx =&ss->rx_small; > + bytes = mgp->small_bytes; > + } else { > + rx =&ss->rx_big, Small nit: the "," above should be a ";" Between the small bandwidth increase, and the code size reduction, I'm very appreciative of this patch. Thank you, Drew -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 23 Mar 2011 13:52:04 +0100 Stanislaw Gruszka <sgruszka@redhat.com> wrote: > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll > call. This should fix theoretical race condition with > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO > can be changed. You may need a barrier or the race may still be there. The driver seems to use mb() where wmb() is intended, and never use rmb()?
On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote: > On Wed, 23 Mar 2011 13:52:04 +0100 > Stanislaw Gruszka <sgruszka@redhat.com> wrote: > > > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll > > call. This should fix theoretical race condition with > > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO > > can be changed. > > You may need a barrier or the race may still be there. I don't understand why barrier in that case is need. What I tried to avoid is. myri10ge_clean_rx_done(): if (dev->features & NETIF_F_LRO) setup lro myri10ge_set_flags() if (dev->features & NETIF_F_LRO) flush lro Now we read dev->features & NETIF_F_LRO only once to local lro_enabled variable. So we can not flush without setup or setup without flush. No idea why memory barries is still needed. > The driver seems to use mb() where wmb() is intended, and never use rmb()? Yes, I think we can have some optimalization here. Stanislaw -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 24 Mar 2011 09:16:21 +0100 Stanislaw Gruszka <sgruszka@redhat.com> wrote: > On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote: > > On Wed, 23 Mar 2011 13:52:04 +0100 > > Stanislaw Gruszka <sgruszka@redhat.com> wrote: > > > > > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll > > > call. This should fix theoretical race condition with > > > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO > > > can be changed. > > > > You may need a barrier or the race may still be there. > > I don't understand why barrier in that case is need. > > What I tried to avoid is. > > myri10ge_clean_rx_done(): > > if (dev->features & NETIF_F_LRO) > setup lro > myri10ge_set_flags() > > if (dev->features & NETIF_F_LRO) > flush lro > > Now we read dev->features & NETIF_F_LRO only once to local > lro_enabled variable. So we can not flush without setup > or setup without flush. No idea why memory barries is still > needed. > > > The driver seems to use mb() where wmb() is intended, and never use rmb()? > > Yes, I think we can have some optimalization here. > Without barrier there is no guarantee that compiler read the flags into a local variable. It is free to do the same thing as the original code.
On Thu, Mar 24, 2011 at 08:15:39AM -0700, Stephen Hemminger wrote: > > On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote: > > > On Wed, 23 Mar 2011 13:52:04 +0100 > > > Stanislaw Gruszka <sgruszka@redhat.com> wrote: > > > > > > > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll > > > > call. This should fix theoretical race condition with > > > > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO > > > > can be changed. > > > > > > You may need a barrier or the race may still be there. > > > > I don't understand why barrier in that case is need. > > > > What I tried to avoid is. > > > > myri10ge_clean_rx_done(): > > > > if (dev->features & NETIF_F_LRO) > > setup lro > > myri10ge_set_flags() > > > > if (dev->features & NETIF_F_LRO) > > flush lro > > > > Now we read dev->features & NETIF_F_LRO only once to local > > lro_enabled variable. So we can not flush without setup > > or setup without flush. No idea why memory barries is still > > needed. > > > > > The driver seems to use mb() where wmb() is intended, and never use rmb()? > > > > Yes, I think we can have some optimalization here. > > > > Without barrier there is no guarantee that compiler read the flags > into a local variable. It is free to do the same thing as the original > code. Ok, so C code like: code1 if (dev->features & NETIF_F_LRO) branch1 code2; if (dev->features & NETIF_F_LRO) branch2 and bool lro_enabled = dev->features & NETIF_F_LRO; code1 if (lro_enabled) branch1 code2 if (lro_enabled) branch2 can give the same assembly output. It's really hard for me to understand that. I could understand, if we would get global variable directly like: bool lro_enabled = dev->lro_enabled; instead of: bool lro_enabled = dev->features & NETIF_F_LRO; David, can you confirm that Staphen is correct? Also where this barrier() should go. Before "bool lro_enabled = dev->features & NETIF_F_LRO;" or after? Stanislaw -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2011-03-24 at 16:59 +0100, Stanislaw Gruszka wrote: > On Thu, Mar 24, 2011 at 08:15:39AM -0700, Stephen Hemminger wrote: > > > On Wed, Mar 23, 2011 at 08:33:57AM -0700, Stephen Hemminger wrote: > > > > On Wed, 23 Mar 2011 13:52:04 +0100 > > > > Stanislaw Gruszka <sgruszka@redhat.com> wrote: > > > > > > > > > Add lro_enable variable to read NETIF_F_LRO flag only once per napi poll > > > > > call. This should fix theoretical race condition with > > > > > myri10ge_set_rx_csum() and myri10ge_set_flags() where flag NETIF_F_LRO > > > > > can be changed. > > > > > > > > You may need a barrier or the race may still be there. > > > > > > I don't understand why barrier in that case is need. > > > > > > What I tried to avoid is. > > > > > > myri10ge_clean_rx_done(): > > > > > > if (dev->features & NETIF_F_LRO) > > > setup lro > > > myri10ge_set_flags() > > > > > > if (dev->features & NETIF_F_LRO) > > > flush lro > > > > > > Now we read dev->features & NETIF_F_LRO only once to local > > > lro_enabled variable. So we can not flush without setup > > > or setup without flush. No idea why memory barries is still > > > needed. > > > > > > > The driver seems to use mb() where wmb() is intended, and never use rmb()? > > > > > > Yes, I think we can have some optimalization here. > > > > > > > Without barrier there is no guarantee that compiler read the flags > > into a local variable. It is free to do the same thing as the original > > code. > > Ok, so C code like: > > code1 > if (dev->features & NETIF_F_LRO) > branch1 > code2; > if (dev->features & NETIF_F_LRO) > branch2 > > and > > bool lro_enabled = dev->features & NETIF_F_LRO; > code1 > if (lro_enabled) > branch1 > code2 > if (lro_enabled) > branch2 > > can give the same assembly output. [...] Yes. A C compiler is allowed to assume that data are not shared between multiple threads, and apply any transformations that would not affect the behaviour of a single-threaded program. We can make use of some gcc extensions (wrapped up in macros like barrier()) to inhibit some such transformations. We also assume that access to an int, long or pointer variable can be atomic. The ACCESS_ONCE() macro adds volatile-qualification to such memory access, which inhibits duplication of the access. So, if you only care that this function has a consistent value for lro_enabled, you can read dev->features with ACCESS_ONCE(): bool lro_enabled = ACCESS_ONCE(dev->features) & NETIF_F_LRO; Ben.
Stanislaw Gruszka <sgruszka@redhat.com> wrote:
> David, can you confirm that Staphen is correct?
Stephen is correct. The compiler is perfectly at liberty to merge the two
loads if the value being read is not marked volatile.
If you stick a barrier() in there between the reads, then I think the compiler
will be required to emit two load instructions. However, the _CPU_ is then
entitled to merge them.
If you don't want the CPU to merge them, you have to use smp_rmb() or smp_mb()
between.
However, the compiler is also allowed to re-read the variable (ie. emit two
load instructions) if it would otherwise have to save the value on the stack
to free up a register, unless the pointed to value is marked volatile.
If you want to ensure that the value is read once only, then you need to use
ACCESS_ONCE() or stick a read/full barrier of some degree after the read.
Note the use of a barrier implies a partial ordering between two memory
accesses in the same instruction stream. If you can't say which two memory
memory accesses you want to order, you probably shouldn't be using a barrier.
David
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/myri10ge/myri10ge.c b/drivers/net/myri10ge/myri10ge.c index 24386a8..2e71240 100644 --- a/drivers/net/myri10ge/myri10ge.c +++ b/drivers/net/myri10ge/myri10ge.c @@ -1312,17 +1312,26 @@ myri10ge_unmap_rx_page(struct pci_dev *pdev, * page into an skb */ static inline int -myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx, - int bytes, int len, __wsum csum) +myri10ge_rx_done(struct myri10ge_slice_state *ss, int len, __wsum csum, + bool lro_enabled) { struct myri10ge_priv *mgp = ss->mgp; struct sk_buff *skb; struct skb_frag_struct rx_frags[MYRI10GE_MAX_FRAGS_PER_FRAME]; - int i, idx, hlen, remainder; + struct myri10ge_rx_buf *rx; + int i, idx, hlen, remainder, bytes; struct pci_dev *pdev = mgp->pdev; struct net_device *dev = mgp->dev; u8 *va; + if (len <= mgp->small_bytes) { + rx = &ss->rx_small; + bytes = mgp->small_bytes; + } else { + rx = &ss->rx_big, + bytes = mgp->big_bytes; + } + len += MXGEFW_PAD; idx = rx->cnt & rx->mask; va = page_address(rx->info[idx].page) + rx->info[idx].page_offset; @@ -1341,7 +1350,7 @@ myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx, remainder -= MYRI10GE_ALLOC_SIZE; } - if (dev->features & NETIF_F_LRO) { + if (lro_enabled) { rx_frags[0].page_offset += MXGEFW_PAD; rx_frags[0].size -= MXGEFW_PAD; len -= MXGEFW_PAD; @@ -1464,6 +1473,7 @@ myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget) struct myri10ge_rx_done *rx_done = &ss->rx_done; struct myri10ge_priv *mgp = ss->mgp; struct net_device *netdev = mgp->dev; + bool lro_enabled = (netdev->features & NETIF_F_LRO) ? true : false; unsigned long rx_bytes = 0; unsigned long rx_packets = 0; unsigned long rx_ok; @@ -1478,14 +1488,7 @@ myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget) length = ntohs(rx_done->entry[idx].length); rx_done->entry[idx].length = 0; checksum = csum_unfold(rx_done->entry[idx].checksum); - if (length <= mgp->small_bytes) - rx_ok = myri10ge_rx_done(ss, &ss->rx_small, - mgp->small_bytes, - length, checksum); - else - rx_ok = myri10ge_rx_done(ss, &ss->rx_big, - mgp->big_bytes, - length, checksum); + rx_ok = myri10ge_rx_done(ss, length, checksum, lro_enabled); rx_packets += rx_ok; rx_bytes += rx_ok * (unsigned long)length; cnt++; @@ -1497,7 +1500,7 @@ myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget) ss->stats.rx_packets += rx_packets; ss->stats.rx_bytes += rx_bytes; - if (netdev->features & NETIF_F_LRO) + if (lro_enabled) lro_flush_all(&rx_done->lro_mgr); /* restock receive rings if needed */