diff mbox

myr10ge: again fix lro_gen_skb() alignment

Message ID 49EE1C32.1060202@myri.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Andrew Gallatin April 21, 2009, 7:19 p.m. UTC
Herbert Xu wrote:
 > On Wed, Apr 15, 2009 at 04:42:48PM -0700, David Miller wrote:
 >> Herbert has been working on various optimizations to get
 >> cxgb3 GRO performance on par with LRO.  Perhaps he has
 >> some things for you to try :-)
 >
 > Yes, this patch should improve performace.  In fact, when you
 > reopen the net-next tree feel free to put this patch in :)
 >
 > gro: New frags interface to avoid copying shinfo
<...>

Hi Herbert,

With a net-next tree pulled 2 hours ago, I can now see line rate when
using frags with myri10ge on my weakest machines when receiving an
1500b TCP stream.  To achieve line rate on these machines with both
inet_lro and GRO, I must bind the netserver and device IRQ to
different CPUs.  Unfortunately, CPU accounting seems to currently be
broken in the Linux kernel, so I cannot provide an accurate comparison
at line rate.

So to compare inet_lro and GRO, I'm binding the netserver and device IRQ
to the same CPU.  When I do this, that CPU is saturated and GRO is
roughly 17% slower than inet_lro.  For comparison, here are netperf
results from a fast peer sending to my weak machine (AMD Athlon(tm) 64
X2 Dual Core Processor 3800+, 2GHz).  First inet_lro:

Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

  87380  65536  65536    60.02      6631.52   12.45    50.10    0.308 
1.238

And now GRO:
  87380  65536  65536    60.01      5488.99   9.79     50.00    0.292 
1.492

Also, can you tell me how to handle my device, which passes a simple
16-bit checksum across the entire frame (excluding first 14 bytes),
via GRO?  Simply setting skb->ip_summed = CHECKSUM_COMPLETE leads
to  "hw csum failure".

I've attached my work-in-progress patch so you can see what I'm doing.
I do not want this applied due to performance and correctness issues.

Thanks for your help,

Drew

Comments

Herbert Xu April 22, 2009, 10:48 a.m. UTC | #1
On Tue, Apr 21, 2009 at 03:19:14PM -0400, Andrew Gallatin wrote:
>
> So to compare inet_lro and GRO, I'm binding the netserver and device IRQ
> to the same CPU.  When I do this, that CPU is saturated and GRO is
> roughly 17% slower than inet_lro.  For comparison, here are netperf
> results from a fast peer sending to my weak machine (AMD Athlon(tm) 64
> X2 Dual Core Processor 3800+, 2GHz).  First inet_lro:
>
> Recv   Send    Send                          Utilization       Service  
> Demand
> Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
> Size   Size    Size     Time     Throughput  local    remote   local  
> remote
> bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB
>
>  87380  65536  65536    60.02      6631.52   12.45    50.10    0.308  
> 1.238
>
> And now GRO:
>  87380  65536  65536    60.01      5488.99   9.79     50.00    0.292  
> 1.492

I was sure I had tested my case with the IRQs bound (using a cxgb3),
but when I tried it again today GRO was indeed slower (8G vs. 9.4G)!
I fiddled with it all day and couldn't figure out why this was
so.  We weren't spending any more time in the GRO code than LRO,
and in fact we were aggregating more packets with GRO (700k segments
instead of 900k segments).  GRO was also sending a lot less ACKs
than LRO.

It finally dawned on me that my sender had been upgraded from 2.6.18
to 2.6.30-rc1.  Indeed, rebooting into 2.6.18 seems to restore
the balance between GRO and LRO.  I wonder if the ACK reduction
has anything to do with this.

Hopefully tomorrow I'll get my hands onto a myricom and try to
replicate your problem.

In the mean time, can you see if there is any disparity in the
number of aggregated segments and ACKs between GRO and LRO?
netstat -s should be sufficient to measure this (TCP segments
received and sent).

Thanks,
Andrew Gallatin April 22, 2009, 3:37 p.m. UTC | #2
Herbert Xu wrote:

 >
 > In the mean time, can you see if there is any disparity in the
 > number of aggregated segments and ACKs between GRO and LRO?
 > netstat -s should be sufficient to measure this (TCP segments
 > received and sent).

I booted the sender into a kernel.org 2.6.18.2 so as to try to have 
results as close to yours as possible (I was running 2.6.22 on the
sender before).

I ran 2 sets of experiments, with different CPU bindings.  First
I bound the netserver and IRQ to the same CPU:

LRO:
2301987 segments received
570331  segments send out

Recv   Send    Send                          Utilization       Service 
Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local 
remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % S      us/KB   us/KB

  87380  65536  65536    60.01      6637.79   10.07    49.99    0.249 
1.234

GRO:
2035181 segments received
493042  segments send out

  87380  65536  65536    60.01      5768.21   8.60     49.98    0.244 
1.420


Then I bound them to different CPUs, so as to get close to line rate:


LRO:
3165013 segments received
1763169 segments send out
  87380  65536  65536    60.01      9473.27   15.75    49.58    0.272 
0.858


GRO:
3032484 segments received
2265453 segments send out
  87380  65536  65536    60.01      9472.69   15.64    48.73    0.270 
0.843


Do you know what is broken with respect the CPU utilization in recent
kernels?  If I bind the IRQ to CPU0, then watch mpstat I see
zero load on that CPU:

% mpstat -P 0 1
Linux 2.6.30-rc1 (venice)       04/22/09

11:25:25     CPU   %user   %nice %system %iowait    %irq   %soft   %idle 
    intr/s
11:25:26       0    0.00    0.00    0.00    0.00    0.00    0.00  100.00 
  13248.00
11:25:27       0    0.00    0.00    0.00    0.00    0.00    0.00  100.00 
  13280.00

Common sense tells me that is wrong, and oprofile verifies there is
a lot happening on CPU0.  This makes it hard to use netperf's
service demand to compare LRO and GRO.

When I run a cpu-soaker in usermode bound to CPU0, I start to see
irq, softirq, etc:

11:28:02     CPU   %user   %nice %system %iowait    %irq   %soft   %idle 
    intr/s
11:28:03       0   45.10    0.00    0.00    0.00    1.96   52.94    0.00 
  13019.61
11:28:04       0   46.46    0.00    0.00    0.00    2.02   51.52    0.00 
  13414.14


If I use this as poor-man's way to measure CPU load on the CPU running
the softirq, then its clear that GRO is using a bit more CPU than LRO.
The above mpstat output is from LRO, and this is from GRO:

11:29:16       0   39.60    0.00    0.00    0.00    2.97   57.43    0.00 
  13146.53
11:29:17       0   38.00    0.00    0.00    0.00    2.00   60.00    0.00 
  13278.00
11:29:18       0   39.00    0.00    0.00    0.00    4.00   57.00    0.00 
  13273.00

Once we have the checksum issue worked out, either GRO or my driver
will be using even more CPU as it will need to verify the partial
checksums.  Remember that my current patch is just setting
CHECKSUM_UNNECESSARY to get around the checksum problem I was seeing.

Drew


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu April 23, 2009, 8 a.m. UTC | #3
On Tue, Apr 21, 2009 at 03:19:14PM -0400, Andrew Gallatin wrote:
>
> -	if (mgp->csum_flag) {
> -		if ((skb->protocol == htons(ETH_P_IP)) ||
> -		    (skb->protocol == htons(ETH_P_IPV6))) {
> -			skb->csum = csum;
> -			skb->ip_summed = CHECKSUM_COMPLETE;
> -		} else
> -			myri10ge_vlan_ip_csum(skb, csum);
> -	}
> -	netif_receive_skb(skb);
> +	rx_frags[0].page_offset += MXGEFW_PAD;
> +	rx_frags[0].size -= MXGEFW_PAD;
> +	len -= MXGEFW_PAD;
> +	skb_shinfo(skb)->nr_frags = i;
> +	skb->len = len;
> +	skb->data_len = len;
> +	skb->truesize += len;
> +	/*      skb->ip_summed = CHECKSUM_COMPLETE; */

You need to set skb->csum, just as you did for the non-LRO case.

Cheers,
diff mbox

Patch

--- /home/gallatin/linux/git/net-next-2.6/drivers/net/myri10ge/myri10ge.c	2009-04-21 14:00:22.166783937 -0400
+++ linux-tmp/drivers/net/myri10ge/myri10ge.c	2009-04-21 15:08:06.241539153 -0400
@@ -75,7 +75,7 @@ 
 #include "myri10ge_mcp.h"
 #include "myri10ge_mcp_gen_header.h"
 
-#define MYRI10GE_VERSION_STR "1.4.4-1.412"
+#define MYRI10GE_VERSION_STR "1.4.4-1.413"
 
 MODULE_DESCRIPTION("Myricom 10G driver (10GbE)");
 MODULE_AUTHOR("Maintainer: help@myri.com");
@@ -161,8 +161,6 @@  struct myri10ge_rx_done {
 	dma_addr_t bus;
 	int cnt;
 	int idx;
-	struct net_lro_mgr lro_mgr;
-	struct net_lro_desc lro_desc[MYRI10GE_MAX_LRO_DESCRIPTORS];
 };
 
 struct myri10ge_slice_netstats {
@@ -1272,11 +1270,11 @@  myri10ge_unmap_rx_page(struct pci_dev *p
 
 static inline int
 myri10ge_rx_done(struct myri10ge_slice_state *ss, struct myri10ge_rx_buf *rx,
-		 int bytes, int len, __wsum csum)
+		 struct napi_struct *napi, int bytes, int len, __wsum csum)
 {
 	struct myri10ge_priv *mgp = ss->mgp;
 	struct sk_buff *skb;
-	struct skb_frag_struct rx_frags[MYRI10GE_MAX_FRAGS_PER_FRAME];
+	struct skb_frag_struct *rx_frags;
 	int i, idx, hlen, remainder;
 	struct pci_dev *pdev = mgp->pdev;
 	struct net_device *dev = mgp->dev;
@@ -1286,6 +1284,19 @@  myri10ge_rx_done(struct myri10ge_slice_s
 	idx = rx->cnt & rx->mask;
 	va = page_address(rx->info[idx].page) + rx->info[idx].page_offset;
 	prefetch(va);
+
+	skb = napi_get_frags(napi);
+	if ((unlikely(!skb))) {
+		for (i = 0, remainder = len; remainder > 0; i++) {
+			myri10ge_unmap_rx_page(pdev, &rx->info[idx], bytes);
+			put_page(rx->info[idx].page);
+			rx->cnt++;
+			idx = rx->cnt & rx->mask;
+			remainder -= MYRI10GE_ALLOC_SIZE;
+		}
+	}
+	rx_frags = skb_shinfo(skb)->frags;
+
 	/* Fill skb_frag_struct(s) with data from our receive */
 	for (i = 0, remainder = len; remainder > 0; i++) {
 		myri10ge_unmap_rx_page(pdev, &rx->info[idx], bytes);
@@ -1300,52 +1311,18 @@  myri10ge_rx_done(struct myri10ge_slice_s
 		remainder -= MYRI10GE_ALLOC_SIZE;
 	}
 
-	if (mgp->csum_flag && myri10ge_lro) {
-		rx_frags[0].page_offset += MXGEFW_PAD;
-		rx_frags[0].size -= MXGEFW_PAD;
-		len -= MXGEFW_PAD;
-		lro_receive_frags(&ss->rx_done.lro_mgr, rx_frags,
-				  /* opaque, will come back in get_frag_header */
-				  len, len,
-				  (void *)(__force unsigned long)csum, csum);
-
-		return 1;
-	}
-
-	hlen = MYRI10GE_HLEN > len ? len : MYRI10GE_HLEN;
-
-	/* allocate an skb to attach the page(s) to. This is done
-	 * after trying LRO, so as to avoid skb allocation overheads */
-
-	skb = netdev_alloc_skb(dev, MYRI10GE_HLEN + 16);
-	if (unlikely(skb == NULL)) {
-		ss->stats.rx_dropped++;
-		do {
-			i--;
-			put_page(rx_frags[i].page);
-		} while (i != 0);
-		return 0;
-	}
-
-	/* Attach the pages to the skb, and trim off any padding */
-	myri10ge_rx_skb_build(skb, va, rx_frags, len, hlen);
-	if (skb_shinfo(skb)->frags[0].size <= 0) {
-		put_page(skb_shinfo(skb)->frags[0].page);
-		skb_shinfo(skb)->nr_frags = 0;
-	}
-	skb->protocol = eth_type_trans(skb, dev);
-	skb_record_rx_queue(skb, ss - &mgp->ss[0]);
-
-	if (mgp->csum_flag) {
-		if ((skb->protocol == htons(ETH_P_IP)) ||
-		    (skb->protocol == htons(ETH_P_IPV6))) {
-			skb->csum = csum;
-			skb->ip_summed = CHECKSUM_COMPLETE;
-		} else
-			myri10ge_vlan_ip_csum(skb, csum);
-	}
-	netif_receive_skb(skb);
+	rx_frags[0].page_offset += MXGEFW_PAD;
+	rx_frags[0].size -= MXGEFW_PAD;
+	len -= MXGEFW_PAD;
+	skb_shinfo(skb)->nr_frags = i;
+	skb->len = len;
+	skb->data_len = len;
+	skb->truesize += len;
+	/*      skb->ip_summed = CHECKSUM_COMPLETE; */
+	skb->ip_summed = CHECKSUM_UNNECESSARY;	/* XXXXXX */
+	napi_gro_frags(napi);
 	return 1;
+
 }
 
 static inline void
@@ -1418,7 +1395,8 @@  myri10ge_tx_done(struct myri10ge_slice_s
 }
 
 static inline int
-myri10ge_clean_rx_done(struct myri10ge_slice_state *ss, int budget)
+myri10ge_clean_rx_done(struct myri10ge_slice_state *ss,
+		       struct napi_struct *napi, int budget)
 {
 	struct myri10ge_rx_done *rx_done = &ss->rx_done;
 	struct myri10ge_priv *mgp = ss->mgp;
@@ -1438,10 +1416,12 @@  myri10ge_clean_rx_done(struct myri10ge_s
 		checksum = csum_unfold(rx_done->entry[idx].checksum);
 		if (length <= mgp->small_bytes)
 			rx_ok = myri10ge_rx_done(ss, &ss->rx_small,
+						 napi,
 						 mgp->small_bytes,
 						 length, checksum);
 		else
 			rx_ok = myri10ge_rx_done(ss, &ss->rx_big,
+						 napi,
 						 mgp->big_bytes,
 						 length, checksum);
 		rx_packets += rx_ok;
@@ -1455,9 +1435,6 @@  myri10ge_clean_rx_done(struct myri10ge_s
 	ss->stats.rx_packets += rx_packets;
 	ss->stats.rx_bytes += rx_bytes;
 
-	if (myri10ge_lro)
-		lro_flush_all(&rx_done->lro_mgr);
-
 	/* restock receive rings if needed */
 	if (ss->rx_small.fill_cnt - ss->rx_small.cnt < myri10ge_fill_thresh)
 		myri10ge_alloc_rx_pages(mgp, &ss->rx_small,
@@ -1522,7 +1499,7 @@  static int myri10ge_poll(struct napi_str
 #endif
 
 	/* process as many rx events as NAPI will allow */
-	work_done = myri10ge_clean_rx_done(ss, budget);
+	work_done = myri10ge_clean_rx_done(ss, napi, budget);
 
 	if (work_done < budget) {
 		napi_complete(napi);
@@ -1762,9 +1739,7 @@  static const char myri10ge_gstrings_slic
 	"----------- slice ---------",
 	"tx_pkt_start", "tx_pkt_done", "tx_req", "tx_done",
 	"rx_small_cnt", "rx_big_cnt",
-	"wake_queue", "stop_queue", "tx_linearized", "LRO aggregated",
-	    "LRO flushed",
-	"LRO avg aggr", "LRO no_desc"
+	"wake_queue", "stop_queue", "tx_linearized"
 };
 
 #define MYRI10GE_NET_STATS_LEN      21
@@ -1863,14 +1838,6 @@  myri10ge_get_ethtool_stats(struct net_de
 		data[i++] = (unsigned int)ss->tx.wake_queue;
 		data[i++] = (unsigned int)ss->tx.stop_queue;
 		data[i++] = (unsigned int)ss->tx.linearized;
-		data[i++] = ss->rx_done.lro_mgr.stats.aggregated;
-		data[i++] = ss->rx_done.lro_mgr.stats.flushed;
-		if (ss->rx_done.lro_mgr.stats.flushed)
-			data[i++] = ss->rx_done.lro_mgr.stats.aggregated /
-			    ss->rx_done.lro_mgr.stats.flushed;
-		else
-			data[i++] = 0;
-		data[i++] = ss->rx_done.lro_mgr.stats.no_desc;
 	}
 }
 
@@ -2198,67 +2165,6 @@  static void myri10ge_free_irq(struct myr
 		pci_disable_msix(pdev);
 }
 
-static int
-myri10ge_get_frag_header(struct skb_frag_struct *frag, void **mac_hdr,
-			 void **ip_hdr, void **tcpudp_hdr,
-			 u64 * hdr_flags, void *priv)
-{
-	struct ethhdr *eh;
-	struct vlan_ethhdr *veh;
-	struct iphdr *iph;
-	u8 *va = page_address(frag->page) + frag->page_offset;
-	unsigned long ll_hlen;
-	/* passed opaque through lro_receive_frags() */
-	__wsum csum = (__force __wsum) (unsigned long)priv;
-
-	/* find the mac header, aborting if not IPv4 */
-
-	eh = (struct ethhdr *)va;
-	*mac_hdr = eh;
-	ll_hlen = ETH_HLEN;
-	if (eh->h_proto != htons(ETH_P_IP)) {
-		if (eh->h_proto == htons(ETH_P_8021Q)) {
-			veh = (struct vlan_ethhdr *)va;
-			if (veh->h_vlan_encapsulated_proto != htons(ETH_P_IP))
-				return -1;
-
-			ll_hlen += VLAN_HLEN;
-
-			/*
-			 *  HW checksum starts ETH_HLEN bytes into
-			 *  frame, so we must subtract off the VLAN
-			 *  header's checksum before csum can be used
-			 */
-			csum = csum_sub(csum, csum_partial(va + ETH_HLEN,
-							   VLAN_HLEN, 0));
-		} else {
-			return -1;
-		}
-	}
-	*hdr_flags = LRO_IPV4;
-
-	iph = (struct iphdr *)(va + ll_hlen);
-	*ip_hdr = iph;
-	if (iph->protocol != IPPROTO_TCP)
-		return -1;
-	if (iph->frag_off & htons(IP_MF | IP_OFFSET))
-		return -1;
-	*hdr_flags |= LRO_TCP;
-	*tcpudp_hdr = (u8 *) (*ip_hdr) + (iph->ihl << 2);
-
-	/* verify the IP checksum */
-	if (unlikely(ip_fast_csum((u8 *) iph, iph->ihl)))
-		return -1;
-
-	/* verify the  checksum */
-	if (unlikely(csum_tcpudp_magic(iph->saddr, iph->daddr,
-				       ntohs(iph->tot_len) - (iph->ihl << 2),
-				       IPPROTO_TCP, csum)))
-		return -1;
-
-	return 0;
-}
-
 static int myri10ge_get_txrx(struct myri10ge_priv *mgp, int slice)
 {
 	struct myri10ge_cmd cmd;
@@ -2329,7 +2235,6 @@  static int myri10ge_open(struct net_devi
 	struct myri10ge_cmd cmd;
 	int i, status, big_pow2, slice;
 	u8 *itable;
-	struct net_lro_mgr *lro_mgr;
 
 	if (mgp->running != MYRI10GE_ETH_STOPPED)
 		return -EBUSY;
@@ -2450,19 +2355,6 @@  static int myri10ge_open(struct net_devi
 			goto abort_with_rings;
 		}
 
-		lro_mgr = &ss->rx_done.lro_mgr;
-		lro_mgr->dev = dev;
-		lro_mgr->features = LRO_F_NAPI;
-		lro_mgr->ip_summed = CHECKSUM_COMPLETE;
-		lro_mgr->ip_summed_aggr = CHECKSUM_UNNECESSARY;
-		lro_mgr->max_desc = MYRI10GE_MAX_LRO_DESCRIPTORS;
-		lro_mgr->lro_arr = ss->rx_done.lro_desc;
-		lro_mgr->get_frag_header = myri10ge_get_frag_header;
-		lro_mgr->max_aggr = myri10ge_lro_max_pkts;
-		lro_mgr->frag_align_pad = 2;
-		if (lro_mgr->max_aggr > MAX_SKB_FRAGS)
-			lro_mgr->max_aggr = MAX_SKB_FRAGS;
-
 		/* must happen prior to any irq */
 		napi_enable(&(ss)->napi);
 	}
@@ -3910,7 +3802,7 @@  static int myri10ge_probe(struct pci_dev
 
 	if (dac_enabled)
 		netdev->features |= NETIF_F_HIGHDMA;
-
+	netdev->features |= NETIF_F_GRO;
 	/* make sure we can get an irq, and that MSI can be
 	 * setup (if available).  Also ensure netdev->irq
 	 * is set to correct value if MSI is enabled */