Message ID | 20100115053352.31564.765.sendpatchset@krkumar2.in.ibm.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
From: Krishna Kumar <krkumar2@in.ibm.com> Date: Fri, 15 Jan 2010 11:03:52 +0530 > From: Krishna Kumar <krkumar2@in.ibm.com> > > Remove inline skb data in tcp_sendmsg(). For the few devices that > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize, > and pass the penalty to those slow devices (the following drivers > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c, > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c, > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c, > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c, > cris/eth_v10.c). I was really surprised to see r8169.c in that list. It even has all the code in it's ->ndo_start_xmit() method to build fragments properly and handle segmented SKBs, it simply doesn't set NETIF_F_SG in dev->features for whatever reason. Bonding it on your list, but it does indeed support NETIF_F_SG as long as all of it's slaves do. See bond_compute_features() and how it uses netdev_increment_features() over the slaves. Anyways... > This patch does not affect devices that support SG but turn off > via ethtool after register_netdev. > > I ran the following test cases with iperf - #threads: 1 4 8 16 32 > 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for > 1 minute, repeat 5 iterations. Total test run time is 6 hours. > System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW > figures are the aggregate across 5 iterations in mbps): ... > Please review if the idea is acceptable. > > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> So how bad does it kill performance for a chip that doesn't support NETIF_F_SG? That's what people will complain about if this goes in. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 15 Jan 2010 00:36:36 -0800 (PST) David Miller <davem@davemloft.net> wrote: > > Remove inline skb data in tcp_sendmsg(). For the few devices that > > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize, > > and pass the penalty to those slow devices (the following drivers > > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c, > > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c, > > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c, > > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c, > > cris/eth_v10.c). > > I was really surprised to see r8169.c in that list. > > It even has all the code in it's ->ndo_start_xmit() method > to build fragments properly and handle segmented SKBs, it > simply doesn't set NETIF_F_SG in dev->features for whatever > reason. The same thing goes for via-velocity.c, it's turned on via ethtool though (ethtool_op_set_sg). // Simon -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Simon Kagstrom <simon.kagstrom@netinsight.net> Date: Fri, 15 Jan 2010 09:50:49 +0100 > On Fri, 15 Jan 2010 00:36:36 -0800 (PST) > David Miller <davem@davemloft.net> wrote: > >> > Remove inline skb data in tcp_sendmsg(). For the few devices that >> > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize, >> > and pass the penalty to those slow devices (the following drivers >> > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c, >> > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c, >> > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c, >> > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c, >> > cris/eth_v10.c). >> >> I was really surprised to see r8169.c in that list. >> >> It even has all the code in it's ->ndo_start_xmit() method >> to build fragments properly and handle segmented SKBs, it >> simply doesn't set NETIF_F_SG in dev->features for whatever >> reason. > > The same thing goes for via-velocity.c, it's turned on via ethtool > though (ethtool_op_set_sg). Indeed, see my reply to Krishna's ethtool_op_set_sg() patch. I think it's a cruddy way to do things, SG ought to be on by default always unless it is defective. And if it's defective support should be removed entirely. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 15 Jan 2010 00:52:24 -0800 (PST) David Miller <davem@davemloft.net> wrote: > > The same thing goes for via-velocity.c, it's turned on via ethtool > > though (ethtool_op_set_sg). > > Indeed, see my reply to Krishna's ethtool_op_set_sg() patch. > > I think it's a cruddy way to do things, SG ought to be on by > default always unless it is defective. And if it's defective > support should be removed entirely. I kept it off by default since I didn't see any big improvement in my tests (negative in some, positive in some). But I suppose you're right though. // Simon -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> David Miller <davem@davemloft.net> > > Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices? > > From: Krishna Kumar <krkumar2@in.ibm.com> > Date: Fri, 15 Jan 2010 11:03:52 +0530 > > > From: Krishna Kumar <krkumar2@in.ibm.com> > > > > Remove inline skb data in tcp_sendmsg(). For the few devices that > > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize, > > and pass the penalty to those slow devices (the following drivers > > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c, > > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c, > > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c, > > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c, > > cris/eth_v10.c). > > I was really surprised to see r8169.c in that list. > > It even has all the code in it's ->ndo_start_xmit() method > to build fragments properly and handle segmented SKBs, it > simply doesn't set NETIF_F_SG in dev->features for whatever > reason. I didn't notice this driver but had checked via-velocity and found that it too had support but was not setting this bit. > Bonding it on your list, but it does indeed support NETIF_F_SG > as long as all of it's slaves do. See bond_compute_features() > and how it uses netdev_increment_features() over the slaves. > > Anyways... > > > This patch does not affect devices that support SG but turn off > > via ethtool after register_netdev. > > > > I ran the following test cases with iperf - #threads: 1 4 8 16 32 > > 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for > > 1 minute, repeat 5 iterations. Total test run time is 6 hours. > > System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW > > figures are the aggregate across 5 iterations in mbps): > ... > > Please review if the idea is acceptable. > > > > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> > > So how bad does it kill performance for a chip that doesn't > support NETIF_F_SG? > > That's what people will complain about if this goes in. I don't know how much performance will drop for those 15-18 drivers listed above, since I don't have access to those devices. Thanks, - KK -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Simon Kagstrom <simon.kagstrom@netinsight.net> Date: Fri, 15 Jan 2010 10:00:11 +0100 > On Fri, 15 Jan 2010 00:52:24 -0800 (PST) > David Miller <davem@davemloft.net> wrote: > >> > The same thing goes for via-velocity.c, it's turned on via ethtool >> > though (ethtool_op_set_sg). >> >> Indeed, see my reply to Krishna's ethtool_op_set_sg() patch. >> >> I think it's a cruddy way to do things, SG ought to be on by >> default always unless it is defective. And if it's defective >> support should be removed entirely. > > I kept it off by default since I didn't see any big improvement in my > tests (negative in some, positive in some). But I suppose you're > right though. Well, it has to provide significantly better performance for sendfile() (especially wrt. cpu utilization) because we avoid the copy out of the page cache pages entirely. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Krishna Kumar2 <krkumar2@in.ibm.com> Date: Fri, 15 Jan 2010 14:50:04 +0530 > I wonder if there is some other way to test it. I could test it on > the card I have, cxgbe, by ethtool F_SG off, and then testing > this patch with existing code (both with ethtool F_SG off)? Will > that be enough to get an idea, or I cannot assume this is > reasonable for real non-sg drivers? I am sure there is a > degradation, and mentioned that part as a "penalty" for those > drivers in my patch. I think such a test would provide useful data by which to judge this change. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sorry, I sent off the reply sooner than I wanted to. continued at the end... Krishna Kumar2/India/IBM wrote on 01/15/2010 02:33:25 PM: > Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices? > > > David Miller <davem@davemloft.net> > > > > Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices? > > > > From: Krishna Kumar <krkumar2@in.ibm.com> > > Date: Fri, 15 Jan 2010 11:03:52 +0530 > > > > > From: Krishna Kumar <krkumar2@in.ibm.com> > > > > > > Remove inline skb data in tcp_sendmsg(). For the few devices that > > > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize, > > > and pass the penalty to those slow devices (the following drivers > > > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c, > > > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c, > > > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c, > > > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c, > > > cris/eth_v10.c). > > > > I was really surprised to see r8169.c in that list. > > > > It even has all the code in it's ->ndo_start_xmit() method > > to build fragments properly and handle segmented SKBs, it > > simply doesn't set NETIF_F_SG in dev->features for whatever > > reason. > I didn't notice this driver but had checked via-velocity and > found that it too had support but was not setting this bit. > > > Bonding it on your list, but it does indeed support NETIF_F_SG > > as long as all of it's slaves do. See bond_compute_features() > > and how it uses netdev_increment_features() over the slaves. > > > > Anyways... > > > > > This patch does not affect devices that support SG but turn off > > > via ethtool after register_netdev. > > > > > > I ran the following test cases with iperf - #threads: 1 4 8 16 32 > > > 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for > > > 1 minute, repeat 5 iterations. Total test run time is 6 hours. > > > System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW > > > figures are the aggregate across 5 iterations in mbps): > > ... > > > Please review if the idea is acceptable. > > > > > > Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> > > > > So how bad does it kill performance for a chip that doesn't > > support NETIF_F_SG? > > > > That's what people will complain about if this goes in. > > I don't know how much performance will drop for those 15-18 drivers > listed above, since I don't have access to those devices. I wonder if there is some other way to test it. I could test it on the card I have, cxgbe, by ethtool F_SG off, and then testing this patch with existing code (both with ethtool F_SG off)? Will that be enough to get an idea, or I cannot assume this is reasonable for real non-sg drivers? I am sure there is a degradation, and mentioned that part as a "penalty" for those drivers in my patch. Thanks, - KK -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Krishna Kumar wrote: > From: Krishna Kumar <krkumar2@in.ibm.com> > > Remove inline skb data in tcp_sendmsg(). For the few devices that > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize, > and pass the penalty to those slow devices (the following drivers > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c, > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c, > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c, > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c, > cris/eth_v10.c). > > This patch does not affect devices that support SG but turn off > via ethtool after register_netdev. > > I ran the following test cases with iperf - #threads: 1 4 8 16 32 > 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for > 1 minute, repeat 5 iterations. Total test run time is 6 hours. > System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW > figures are the aggregate across 5 iterations in mbps): > > ------------------------------------------------------- > #Process I/O Size Org-BW New-BW %-change > ------------------------------------------------------- > 1 256 2098 2147 2.33 > 1 4K 14057 14269 1.50 > 1 16K 25984 27317 5.13 > 1 64K 25920 27539 6.24 > ... > 256 256 1947 1955 0.41 > 256 4K 9828 12265 24.79 > 256 16K 25087 24977 -0.43 > 256 64K 26715 27997 4.79 > ------------------------------------------------------- > Total: - 600071 634906 5.80 > ------------------------------------------------------- Does bandwidth alone convey the magnitude of the change? I would think that would only be the case if the CPU(s) were 100% utilized, and perhaps not even completely then. At the risk of a shameless plug, it's not for nothing that netperf reports service demand :) I would think that change in service demand (CPU per unit of work) would be something one wants to see. Also, the world does not run on bandwidth alone, so small packet performance and any delta there would be good to have. Multiple process tests may not be as easy in netperf as it is in iperf, but under: ftp://ftp.netperf.org/netperf/misc I have a single-stream test script I use called runemomni.sh and an example of its output, as well as an aggregate script I use called runemomniagg2.sh - I'll post an example of its output there as soon as I finish some runs. The script presumes one has ./configure'd netperf: ./configure --enable-burst --enable-omni ... The netperf omni tests still ass-u-me that the CPU util each measures is all his own, which means the service demands from aggrgate tests require some post-processing fixup. http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance happy benchmarking, rick jones FWIW, service demand and pps performance may be even more important for non-SG devices because they may be slow 1 Gig devices and still hit link-rate on a bulk throughput test even with a non-trivial increase in CPU util. However, a non-trivial hit in CPU util may rather change the pps performance. PPS - there is a *lot* of output in those omni test results - best viewed with a spreadsheet program. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Rick Jones <rick.jones2@hp.com> wrote on 01/15/2010 11:43:06 PM: > > > Remove inline skb data in tcp_sendmsg(). For the few devices that > > don't support NETIF_F_SG, dev_queue_xmit will call skb_linearize, > > and pass the penalty to those slow devices (the following drivers > > do not support NETIF_F_SG: 8139cp.c, amd8111e.c, dl2k.c, dm9000.c, > > dnet.c, ethoc.c, ibmveth.c, ioc3-eth.c, macb.c, ps3_gelic_net.c, > > r8169.c, rionet.c, spider_net.c, tsi108_eth.c, veth.c, > > via-velocity.c, atlx/atl2.c, bonding/bond_main.c, can/dev.c, > > cris/eth_v10.c). > > > > This patch does not affect devices that support SG but turn off > > via ethtool after register_netdev. > > > > I ran the following test cases with iperf - #threads: 1 4 8 16 32 > > 64 128 192 256, I/O sizes: 256 4K 16K 64K, each test case runs for > > 1 minute, repeat 5 iterations. Total test run time is 6 hours. > > System is 4-proc Opteron, with a Chelsio 10gbps NIC. Results (BW > > figures are the aggregate across 5 iterations in mbps): > > > > ------------------------------------------------------- > > #Process I/O Size Org-BW New-BW %-change > > ------------------------------------------------------- > > 1 256 2098 2147 2.33 > > 1 4K 14057 14269 1.50 > > 1 16K 25984 27317 5.13 > > 1 64K 25920 27539 6.24 > > ... > > 256 256 1947 1955 0.41 > > 256 4K 9828 12265 24.79 > > 256 16K 25087 24977 -0.43 > > 256 64K 26715 27997 4.79 > > ------------------------------------------------------- > > Total: - 600071 634906 5.80 > > ------------------------------------------------------- > > Does bandwidth alone convey the magnitude of the change? I would think that > would only be the case if the CPU(s) were 100% utilized, and perhapsnot even > completely then. At the risk of a shameless plug, it's not for nothing that > netperf reports service demand :) > > I would think that change in service demand (CPU per unit of work) would be > something one wants to see. > > Also, the world does not run on bandwidth alone, so small packet > performance and > any delta there would be good to have. > > Multiple process tests may not be as easy in netperf as it is in > iperf, but under: > > ftp://ftp.netperf.org/netperf/misc > > I have a single-stream test script I use called runemomni.sh and an > example of > its output, as well as an aggregate script I use called > runemomniagg2.sh - I'll > post an example of its output there as soon as I finish some runs. I usually run netperf for smaller number of threads and aggregate the output using some scripts. I will try what you suggested above, and see if I can get consistent results for higher number of processes. Thanks for the links. - KK > The script > presumes one has ./configure'd netperf: > > ./configure --enable-burst --enable-omni ... > > The netperf omni tests still ass-u-me that the CPU util each > measures is all his > own, which means the service demands from aggrgate tests require some > post-processing fixup. > > http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using- > Netperf-to-Measure-Aggregate-Performance > > happy benchmarking, > > rick jones > > FWIW, service demand and pps performance may be even more important > for non-SG > devices because they may be slow 1 Gig devices and still hit link- > rate on a bulk > throughput test even with a non-trivial increase in CPU util. However, a > non-trivial hit in CPU util may rather change the pps performance. > > PPS - there is a *lot* of output in those omni test results - best > viewed with a > spreadsheet program. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Krishna Kumar2 wrote: > Rick Jones <rick.jones2@hp.com> wrote on 01/15/2010 11:43:06 PM: >>Multiple process tests may not be as easy in netperf as it is in >>iperf, but under: >> >>ftp://ftp.netperf.org/netperf/misc >> >> I have a single-stream test script I use called runemomni.sh and an >> example of its output, as well as an aggregate script I use called >> runemomniagg2.sh - I'll post an example of its output there as >> soon as I finish some runs. > > > I usually run netperf for smaller number of threads and aggregate > the output using some scripts. I will try what you suggested above, > and see if I can get consistent results for higher number of > processes. Thanks for the links. You're welcome - the output of the aggregate script is up there too now - generally I run three systems for the aggregate tests and then include the TCP_MAERTs stuff, but in this case since I had only two otherwise identical systems, TCP_MAERTs would have been redundant. rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Dave, > David Miller <davem@davemloft.net> wrote on 01/15/2010 02:48:29 PM: > > Re: [RFC] [PATCH] Optimize TCP sendmsg in favour of fast devices? > > From: Krishna Kumar2 <krkumar2@in.ibm.com> > Date: Fri, 15 Jan 2010 14:50:04 +0530 > > > I wonder if there is some other way to test it. I could test it on > > the card I have, cxgbe, by ethtool F_SG off, and then testing > > this patch with existing code (both with ethtool F_SG off)? Will > > that be enough to get an idea, or I cannot assume this is > > reasonable for real non-sg drivers? I am sure there is a > > degradation, and mentioned that part as a "penalty" for those > > drivers in my patch. > > I think such a test would provide useful data by which to judge this > change. I had to remove the F_SG flag from cxgb3 driver (using ethtool didn't show any difference in performance since GSO was enabled on the device due to register_netdev setting it). Testing show a drop of 25% in performance with this patch for non-SG device, the extra alloc/memcpy is showing up. For the SG driver, I get a good performace gain (not anywhere close to 25% though). What do you suggest? Thanks, - KK -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Krishna Kumar2 <krkumar2@in.ibm.com> Date: Wed, 20 Jan 2010 17:49:18 +0530 > I had to remove the F_SG flag from cxgb3 driver (using ethtool > didn't show any difference in performance since GSO was enabled > on the device due to register_netdev setting it). Testing show a > drop of 25% in performance with this patch for non-SG device, > the extra alloc/memcpy is showing up. > > For the SG driver, I get a good performace gain (not anywhere > close to 25% though). What do you suggest? I don't think we can add your change if it hurts non-SG devices that much. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller <davem@davemloft.net> wrote: > From: Krishna Kumar2 <krkumar2@in.ibm.com> > Date: Wed, 20 Jan 2010 17:49:18 +0530 > >> I had to remove the F_SG flag from cxgb3 driver (using ethtool >> didn't show any difference in performance since GSO was enabled >> on the device due to register_netdev setting it). Testing show a >> drop of 25% in performance with this patch for non-SG device, >> the extra alloc/memcpy is showing up. >> >> For the SG driver, I get a good performace gain (not anywhere >> close to 25% though). What do you suggest? > > I don't think we can add your change if it hurts non-SG > devices that much. Wait, we need to be careful when testing this. Non-SG devices do actually benefit from TSO which they otherwise cannot access. If you unset the F_SG bit, then that would disable TSO too. So you need to enable GSO to compensate. So Krishna, did you check with tcpdump to see if GSO was really enabled with SG off? IIRC when I did a similar test with e1000 back when I wrote this the performance of GSO with SG off was pretty much the same as no GSO with SG off. Cheers,
Hi Herbert, > Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/21/2010 03:11 PM Sorry for the late response. > >> I had to remove the F_SG flag from cxgb3 driver (using ethtool > >> didn't show any difference in performance since GSO was enabled > >> on the device due to register_netdev setting it). Testing show a > >> drop of 25% in performance with this patch for non-SG device, > >> the extra alloc/memcpy is showing up. > >> > >> For the SG driver, I get a good performace gain (not anywhere > >> close to 25% though). What do you suggest? > > > > I don't think we can add your change if it hurts non-SG > > devices that much. > > Wait, we need to be careful when testing this. Non-SG devices > do actually benefit from TSO which they otherwise cannot access. > > If you unset the F_SG bit, then that would disable TSO too. So > you need to enable GSO to compensate. So Krishna, did you check > with tcpdump to see if GSO was really enabled with SG off? OK, I unset F_SG and set F_GSO (in driver). With this, tcpdump shows GSO is enabled - the tcp packet sizes builds up to 65160 bytes. I ran 5 serial netperf's with 16K and another 5 serial netperfs with 64K I/O sizes, and the aggregate result is: 0. Driver unsets F_SG but sets F_GSO: Original code with 16K: 19471.65 New code with 16K: 19409.70 Original code with 64K: 21357.23 New code with 64K: 22050.42 To recap the other tests I did today: 1. Driver unsets F_SG, and with GSO off Original code with 16K: 10123.56 New code with 16K: 7111.12 Original code with 64K: 11568.99 New code with 64K: 7611.37 2. Driver unsets F_SG and uses ethtool to set GSO: Original code with 16K: 18864.38 New code with 16K: 18465.54 Original code with 64K: 21005.43 New code with 64K: 22529.24 Thanks, - KK > IIRC when I did a similar test with e1000 back when I wrote this > the performance of GSO with SG off was pretty much the same as > no GSO with SG off. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Krishna Kumar2/India/IBM wrote on 01/27/2010 12:42 PM > > > Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/21/2010 03:11 PM > > Sorry for the late response. > > > >> I had to remove the F_SG flag from cxgb3 driver (using ethtool > > >> didn't show any difference in performance since GSO was enabled > > >> on the device due to register_netdev setting it). Testing show a > > >> drop of 25% in performance with this patch for non-SG device, > > >> the extra alloc/memcpy is showing up. > > >> > > >> For the SG driver, I get a good performace gain (not anywhere > > >> close to 25% though). What do you suggest? > > > > > > I don't think we can add your change if it hurts non-SG > > > devices that much. > > > > Wait, we need to be careful when testing this. Non-SG devices > > do actually benefit from TSO which they otherwise cannot access. > > > > If you unset the F_SG bit, then that would disable TSO too. So > > you need to enable GSO to compensate. So Krishna, did you check > > with tcpdump to see if GSO was really enabled with SG off? > > OK, I unset F_SG and set F_GSO (in driver). With this, tcpdump shows > GSO is enabled - the tcp packet sizes builds up to 65160 bytes. I should have mentioned this too - if I unset F_SG in the cxgb3 driver and nothing else, ethtool -k still shows GSO is set, and tcpdump shows max packet size is 1448. If I additionally set GSO in driver, then ethtool still has the same output, but tcpdump shows max packet size of 65160. Thanks, - KK > I ran 5 serial netperf's with 16K and another 5 serial netperfs > with 64K I/O sizes, and the aggregate result is: > > 0. Driver unsets F_SG but sets F_GSO: > Original code with 16K: 19471.65 > New code with 16K: 19409.70 > Original code with 64K: 21357.23 > New code with 64K: 22050.42 > > To recap the other tests I did today: > > 1. Driver unsets F_SG, and with GSO off > Original code with 16K: 10123.56 > New code with 16K: 7111.12 > Original code with 64K: 11568.99 > New code with 64K: 7611.37 > > 2. Driver unsets F_SG and uses ethtool to set GSO: > Original code with 16K: 18864.38 > New code with 16K: 18465.54 > Original code with 64K: 21005.43 > New code with 64K: 22529.24 > > Thanks, > > - KK > > > IIRC when I did a similar test with e1000 back when I wrote this > > the performance of GSO with SG off was pretty much the same as > > no GSO with SG off. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 27, 2010 at 12:42:12PM +0530, Krishna Kumar2 wrote: > > OK, I unset F_SG and set F_GSO (in driver). With this, tcpdump shows > GSO is enabled - the tcp packet sizes builds up to 65160 bytes. > > I ran 5 serial netperf's with 16K and another 5 serial netperfs > with 64K I/O sizes, and the aggregate result is: > > 0. Driver unsets F_SG but sets F_GSO: > Original code with 16K: 19471.65 > New code with 16K: 19409.70 > Original code with 64K: 21357.23 > New code with 64K: 22050.42 OK this is more in line with what I was expecting, namely that enabling GSO is actually beneficial even without SG. It would be good to get the CPU utilisation figures so we can see the complete picture. Cheers,
On Wed, Jan 27, 2010 at 03:12:48PM +0530, Krishna Kumar2 wrote: > > I should have mentioned this too - if I unset F_SG in the > cxgb3 driver and nothing else, ethtool -k still shows GSO > is set, and tcpdump shows max packet size is 1448. If I > additionally set GSO in driver, then ethtool still has the > same output, but tcpdump shows max packet size of 65160. This sounds like a bug. Cheers,
> Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/29/2010 02:36:25 PM: > > > I ran 5 serial netperf's with 16K and another 5 serial netperfs > > with 64K I/O sizes, and the aggregate result is: > > > > 0. Driver unsets F_SG but sets F_GSO: > > Original code with 16K: 19471.65 > > New code with 16K: 19409.70 > > Original code with 64K: 21357.23 > > New code with 64K: 22050.42 > > OK this is more in line with what I was expecting, namely that > enabling GSO is actually beneficial even without SG. > > It would be good to get the CPU utilisation figures so we can > see the complete picture. Same 5 runs of single netperf's: 0. Driver unsets F_SG but sets F_GSO: Org (16K): BW: 18180.71 SD: 13.485 New (16K): BW: 18113.15 SD: 13.551 Org (64K): BW: 21980.28 SD: 10.306 New (64K): BW: 21386.59 SD: 10.447 1. Driver unsets F_SG, and with GSO off Org (16K): BW: 10894.62 SD: 26.591 New (16K): BW: 7262.10 SD: 35.340 Org (64K): BW: 12396.41 SD: 23.357 New (64K): BW: 7853.02 SD: 32.405 2. Driver unsets F_SG and uses ethtool to set GSO: Org (16K): BW: 18094.11 SD: 13.603 New (16K): BW: 17952.38 SD: 13.743 Org (64K): BW: 21540.78 SD: 10.771 New (64K): BW: 21818.35 SD: 10.598 > > I should have mentioned this too - if I unset F_SG in the > > cxgb3 driver and nothing else, ethtool -k still shows GSO > > is set, and tcpdump shows max packet size is 1448. If I > > additionally set GSO in driver, then ethtool still has the > > same output, but tcpdump shows max packet size of 65160. > > This sounds like a bug. Yes, an ethtool bug (version 6). The test case #1 above, I have written that GSO is off but ethtool "thinks" it is on (after a modprobe -r cxgb3; modprobe cxgb3). So for test #2, I simply run "ethtool ... gso on", and GSO is now really on in the kernel, explaining the better results. thanks, - KK -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 29, 2010 at 04:45:01PM +0530, Krishna Kumar2 wrote: > > Same 5 runs of single netperf's: > > 0. Driver unsets F_SG but sets F_GSO: > Org (16K): BW: 18180.71 SD: 13.485 > New (16K): BW: 18113.15 SD: 13.551 > Org (64K): BW: 21980.28 SD: 10.306 > New (64K): BW: 21386.59 SD: 10.447 > > 1. Driver unsets F_SG, and with GSO off > Org (16K): BW: 10894.62 SD: 26.591 > New (16K): BW: 7262.10 SD: 35.340 > Org (64K): BW: 12396.41 SD: 23.357 > New (64K): BW: 7853.02 SD: 32.405 > > > 2. Driver unsets F_SG and uses ethtool to set GSO: > Org (16K): BW: 18094.11 SD: 13.603 > New (16K): BW: 17952.38 SD: 13.743 > Org (64K): BW: 21540.78 SD: 10.771 > New (64K): BW: 21818.35 SD: 10.598 Hmm, any idea what is causing case 0 to be different from case 2? In particular, the 64K performance in case 0 appears to be a regression but in case 2 it's showing up as an improvement. AFAICS these two cases should produce identical results, or is this just jitter across tests? Thanks,
Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/29/2010 05:03:46 PM: > > > Same 5 runs of single netperf's: > > > > 0. Driver unsets F_SG but sets F_GSO: > > Org (16K): BW: 18180.71 SD: 13.485 > > New (16K): BW: 18113.15 SD: 13.551 > > Org (64K): BW: 21980.28 SD: 10.306 > > New (64K): BW: 21386.59 SD: 10.447 > > > > 1. Driver unsets F_SG, and with GSO off > > Org (16K): BW: 10894.62 SD: 26.591 > > New (16K): BW: 7262.10 SD: 35.340 > > Org (64K): BW: 12396.41 SD: 23.357 > > New (64K): BW: 7853.02 SD: 32.405 > > > > > > 2. Driver unsets F_SG and uses ethtool to set GSO: > > Org (16K): BW: 18094.11 SD: 13.603 > > New (16K): BW: 17952.38 SD: 13.743 > > Org (64K): BW: 21540.78 SD: 10.771 > > New (64K): BW: 21818.35 SD: 10.598 > > Hmm, any idea what is causing case 0 to be different from case 2? > In particular, the 64K performance in case 0 appears to be a > regression but in case 2 it's showing up as an improvement. > > AFAICS these two cases should produce identical results, or is > this just jitter across tests? You are right about the jitter. I have run this many times, most of the times #0 and #2 are almost identical, but sometimes varies a bit. Also about my earlier ethtool comment: > Yes, an ethtool bug (version 6). The test case #1 above, I > have written that GSO is off but ethtool "thinks" it is on > (after a modprobe -r cxgb3; modprobe cxgb3). So for test #2, > I simply run "ethtool ... gso on", and GSO is now really on > in the kernel, explaining the better results. Hmmm, I had a bad ethtool it seems. I built the latest one to debug this problem but this shows settings correctly. thanks, - KK -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Krishna Kumar2 wrote: >>Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/29/2010 02:36:25 PM: >> >> >>>I ran 5 serial netperf's with 16K and another 5 serial netperfs >>>with 64K I/O sizes, and the aggregate result is: >>> >>>0. Driver unsets F_SG but sets F_GSO: >>> Original code with 16K: 19471.65 >>> New code with 16K: 19409.70 >>> Original code with 64K: 21357.23 >>> New code with 64K: 22050.42 >> >>OK this is more in line with what I was expecting, namely that >>enabling GSO is actually beneficial even without SG. >> >>It would be good to get the CPU utilisation figures so we can >>see the complete picture. > > > Same 5 runs of single netperf's: > > 0. Driver unsets F_SG but sets F_GSO: > Org (16K): BW: 18180.71 SD: 13.485 > New (16K): BW: 18113.15 SD: 13.551 > Org (64K): BW: 21980.28 SD: 10.306 > New (64K): BW: 21386.59 SD: 10.447 > > 1. Driver unsets F_SG, and with GSO off > Org (16K): BW: 10894.62 SD: 26.591 > New (16K): BW: 7262.10 SD: 35.340 > Org (64K): BW: 12396.41 SD: 23.357 > New (64K): BW: 7853.02 SD: 32.405 > > > 2. Driver unsets F_SG and uses ethtool to set GSO: > Org (16K): BW: 18094.11 SD: 13.603 > New (16K): BW: 17952.38 SD: 13.743 > Org (64K): BW: 21540.78 SD: 10.771 > New (64K): BW: 21818.35 SD: 10.598 Just a slight change in service demand there... For those unfamiliar, service demand in netperf is the microseconds of non-idle CPU time per KB of data transferred. Smaller is better. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu wrote: > On Fri, Jan 29, 2010 at 04:45:01PM +0530, Krishna Kumar2 wrote: > >>Same 5 runs of single netperf's: >> >>0. Driver unsets F_SG but sets F_GSO: >> Org (16K): BW: 18180.71 SD: 13.485 >> New (16K): BW: 18113.15 SD: 13.551 >> Org (64K): BW: 21980.28 SD: 10.306 >> New (64K): BW: 21386.59 SD: 10.447 >> >>1. Driver unsets F_SG, and with GSO off >> Org (16K): BW: 10894.62 SD: 26.591 >> New (16K): BW: 7262.10 SD: 35.340 >> Org (64K): BW: 12396.41 SD: 23.357 >> New (64K): BW: 7853.02 SD: 32.405 >> >> >>2. Driver unsets F_SG and uses ethtool to set GSO: >> Org (16K): BW: 18094.11 SD: 13.603 >> New (16K): BW: 17952.38 SD: 13.743 >> Org (64K): BW: 21540.78 SD: 10.771 >> New (64K): BW: 21818.35 SD: 10.598 > > > Hmm, any idea what is causing case 0 to be different from case 2? > In particular, the 64K performance in case 0 appears to be a > regression but in case 2 it's showing up as an improvement. > > AFAICS these two cases should produce identical results, or is > this just jitter across tests? To get some idea of run to run variation, and one does not want to run multiple explicit netperf commands and do later statistical work, one can add global command line arguments to netperf: netperf ... -i 30,3 -I 99,<width> ... which will tell netperf to run at least 3 iterations (that is the minimum minimum netperf will do) and no more than 30 iterations (that is the maximum maximum netperf will do) attempting to be 99% confident that the mean for throughput (and the CPU utilization if -c and/or -C are present and a global -r is not) is within +/- width/2% For example: netperf -H remote -i 30,3 -I 99,0.5 -c -C will attempt to be 99% certain that the means it reports for throughput, local and remote CPU utilization is within +/- 0.25% of the actual mean. If, after 30 iterations it has not achieved that confidence, it will emit warnings giving the width of the confidence intervals it has achieved. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff -ruNp org/net/ipv4/tcp.c new/net/ipv4/tcp.c --- org/net/ipv4/tcp.c 2010-01-13 10:43:27.000000000 +0530 +++ new/net/ipv4/tcp.c 2010-01-13 10:43:37.000000000 +0530 @@ -876,26 +876,6 @@ ssize_t tcp_sendpage(struct socket *sock #define TCP_PAGE(sk) (sk->sk_sndmsg_page) #define TCP_OFF(sk) (sk->sk_sndmsg_off) -static inline int select_size(struct sock *sk, int sg) -{ - struct tcp_sock *tp = tcp_sk(sk); - int tmp = tp->mss_cache; - - if (sg) { - if (sk_can_gso(sk)) - tmp = 0; - else { - int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER); - - if (tmp >= pgbreak && - tmp <= pgbreak + (MAX_SKB_FRAGS - 1) * PAGE_SIZE) - tmp = pgbreak; - } - } - - return tmp; -} - int tcp_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg, size_t size) { @@ -905,7 +885,7 @@ int tcp_sendmsg(struct kiocb *iocb, stru struct sk_buff *skb; int iovlen, flags; int mss_now, size_goal; - int sg, err, copied; + int err, copied; long timeo; lock_sock(sk); @@ -933,8 +913,6 @@ int tcp_sendmsg(struct kiocb *iocb, stru if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)) goto out_err; - sg = sk->sk_route_caps & NETIF_F_SG; - while (--iovlen >= 0) { int seglen = iov->iov_len; unsigned char __user *from = iov->iov_base; @@ -944,6 +922,8 @@ int tcp_sendmsg(struct kiocb *iocb, stru while (seglen > 0) { int copy = 0; int max = size_goal; + int merge, i, off; + struct page *page; skb = tcp_write_queue_tail(sk); if (tcp_send_head(sk)) { @@ -954,14 +934,11 @@ int tcp_sendmsg(struct kiocb *iocb, stru if (copy <= 0) { new_segment: - /* Allocate new segment. If the interface is SG, - * allocate skb fitting to single page. - */ + /* Allocate new segment with a single page */ if (!sk_stream_memory_free(sk)) goto wait_for_sndbuf; - skb = sk_stream_alloc_skb(sk, - select_size(sk, sg), + skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation); if (!skb) goto wait_for_memory; @@ -981,84 +958,77 @@ new_segment: if (copy > seglen) copy = seglen; - /* Where to copy to? */ - if (skb_tailroom(skb) > 0) { - /* We have some space in skb head. Superb! */ - if (copy > skb_tailroom(skb)) - copy = skb_tailroom(skb); - if ((err = skb_add_data(skb, from, copy)) != 0) - goto do_fault; - } else { - int merge = 0; - int i = skb_shinfo(skb)->nr_frags; - struct page *page = TCP_PAGE(sk); - int off = TCP_OFF(sk); - - if (skb_can_coalesce(skb, i, page, off) && - off != PAGE_SIZE) { - /* We can extend the last page - * fragment. */ - merge = 1; - } else if (i == MAX_SKB_FRAGS || !sg) { - /* Need to add new fragment and cannot - * do this because interface is non-SG, - * or because all the page slots are - * busy. */ - tcp_mark_push(tp, skb); - goto new_segment; - } else if (page) { - if (off == PAGE_SIZE) { - put_page(page); - TCP_PAGE(sk) = page = NULL; - off = 0; - } - } else + merge = 0; + i = skb_shinfo(skb)->nr_frags; + page = TCP_PAGE(sk); + off = TCP_OFF(sk); + + if (skb_can_coalesce(skb, i, page, off) && + off != PAGE_SIZE) { + /* We can extend the last page + * fragment. */ + merge = 1; + } else if (i == MAX_SKB_FRAGS) { + /* + * Need to add new fragment and cannot + * do this because all the page slots are + * busy. For the (rare) non-SG devices, + * dev_queue_xmit handles this skb. + */ + tcp_mark_push(tp, skb); + goto new_segment; + } else if (page) { + if (off == PAGE_SIZE) { + put_page(page); + TCP_PAGE(sk) = page = NULL; off = 0; + } + } else + off = 0; - if (copy > PAGE_SIZE - off) - copy = PAGE_SIZE - off; + if (copy > PAGE_SIZE - off) + copy = PAGE_SIZE - off; - if (!sk_wmem_schedule(sk, copy)) - goto wait_for_memory; + if (!sk_wmem_schedule(sk, copy)) + goto wait_for_memory; - if (!page) { - /* Allocate new cache page. */ - if (!(page = sk_stream_alloc_page(sk))) - goto wait_for_memory; - } + if (!page) { + /* Allocate new cache page. */ + if (!(page = sk_stream_alloc_page(sk))) + goto wait_for_memory; + } - /* Time to copy data. We are close to - * the end! */ - err = skb_copy_to_page(sk, from, skb, page, - off, copy); - if (err) { - /* If this page was new, give it to the - * socket so it does not get leaked. - */ - if (!TCP_PAGE(sk)) { - TCP_PAGE(sk) = page; - TCP_OFF(sk) = 0; - } - goto do_error; + /* Time to copy data. We are close to + * the end! */ + err = skb_copy_to_page(sk, from, skb, page, + off, copy); + if (err) { + /* If this page was new, give it to the + * socket so it does not get leaked. + */ + if (!TCP_PAGE(sk)) { + TCP_PAGE(sk) = page; + TCP_OFF(sk) = 0; } + goto do_error; + } - /* Update the skb. */ - if (merge) { - skb_shinfo(skb)->frags[i - 1].size += - copy; - } else { - skb_fill_page_desc(skb, i, page, off, copy); - if (TCP_PAGE(sk)) { - get_page(page); - } else if (off + copy < PAGE_SIZE) { - get_page(page); - TCP_PAGE(sk) = page; - } + /* Update the skb. */ + if (merge) { + skb_shinfo(skb)->frags[i - 1].size += + copy; + } else { + skb_fill_page_desc(skb, i, page, off, copy); + if (TCP_PAGE(sk)) { + get_page(page); + } else if (off + copy < PAGE_SIZE) { + get_page(page); + TCP_PAGE(sk) = page; } - - TCP_OFF(sk) = off + copy; } + TCP_OFF(sk) = off + copy; + if (!copied) TCP_SKB_CB(skb)->flags &= ~TCPCB_FLAG_PSH; @@ -1101,16 +1071,6 @@ out: release_sock(sk); return copied; -do_fault: - if (!skb->len) { - tcp_unlink_write_queue(skb, sk); - /* It is the one place in all of TCP, except connection - * reset, where we can be unlinking the send_head. - */ - tcp_check_send_head(sk, skb); - sk_wmem_free_skb(sk, skb); - } - do_error: if (copied) goto out;