[RFC,net-next,0/3] net: batched receive in GRO path

Message ID	7920e85c-439e-0622-46f8-0602cf37e306@solarflare.com
Headers	show Return-Path: <netdev-owner@vger.kernel.org> From: Edward Cree <ecree@solarflare.com> Subject: [RFC PATCH net-next 0/3] net: batched receive in GRO path To: David Miller <davem@davemloft.net> CC: netdev <netdev@vger.kernel.org>, Eric Dumazet <eric.dumazet@gmail.com> Message-ID: <7920e85c-439e-0622-46f8-0602cf37e306@solarflare.com> Date: Tue, 9 Jul 2019 20:27:13 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.0 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Language: en-GB Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk
Series	net: batched receive in GRO path \| expand [RFC,net-next,0/3] net: batched receive in GRO path [RFC,net-next,1/3] sfc: don't score irq moderation points for GRO [RFC,net-next,2/3] sfc: falcon: don't score irq moderation points for GRO [RFC,net-next,3/3] net: use listified RX for handling GRO_NORMAL skbs

Edward Cree July 9, 2019, 7:27 p.m. UTC

This series listifies part of GRO processing, in a manner which allows those
 packets which are not GROed (i.e. for which dev_gro_receive returns
 GRO_NORMAL) to be passed on to the listified regular receive path.
dev_gro_receive() itself is not listified, nor the per-protocol GRO
 callback, since GRO's need to hold packets on lists under napi->gro_hash
 makes keeping the packets on other lists awkward, and since the GRO control
 block state of held skbs can refer only to one 'new' skb at a time.
Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb
 onto a list in the napi struct, which is received at the end of the napi
 poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch.
Unlike my previous design ([1]), this does not require changes in drivers,
 and also does not prevent the re-use of napi->skb after GRO_MERGED_FREE or
 GRO_DROP.

Performance figures with this series, collected on a back-to-back pair of
 Solarflare sfn8522-r2 NICs with 120-second NetPerf tests.  In the stats,
 sample size n for old and new code is 6 runs each; p is from a Welch t-test.
Tests were run both with GRO enabled and disabled, the latter simulating
 uncoalesceable packets (e.g. due to IP or TCP options).  The receive side
 (which was the device under test) had the NetPerf process pinned to one CPU,
 and the device interrupts pinned to a second CPU.  CPU utilisation figures
 (used in cases of line-rate performance) are summed across all CPUs.
Where not specified (as batch=), net.core.gro_normal_batch was set to 8.
The net-next baseline used for these tests was commit 7d30a7f6424e.
TCP 4 streams, GRO on: all results line rate (9.415Gbps)
net-next: 210.3% cpu
after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next)
after #3: 191.7% cpu (- 8.9%, p=0.102 vs net-next)
TCP 4 streams, GRO off:
after #1: 7.785 Gbps
after #3: 8.387 Gbps (+ 7.7%, p=0.215 vs #1, but note *)
TCP 1 stream, GRO on: all results line rate & ~200% cpu.
TCP 1 stream, GRO off:
after #1: 6.444 Gbps
after #3: 7.363 Gbps (+14.3%, p=0.003 vs #1)
batch=16: 7.199 Gbps
batch= 4: 7.354 Gbps
batch= 0: 5.899 Gbps
TCP 100 RR, GRO off:
net-next: 995.083 us
after #1: 969.167 us (- 2.6%, p=0.204 vs net-next)
after #3: 976.433 us (- 1.9%, p=0.254 vs net-next)

(*) These tests produced a mixture of line-rate and below-line-rate results,
 meaning that statistically speaking the results were 'censored' by the
 upper bound, and were thus not normally distributed, making a Welch t-test
 mathematically invalid.  I therefore also calculated estimators according
 to [2], which gave the following:
after #1: 8.155 Gbps
after #3: 8.716 Gbps (+ 6.9%, p=0.291 vs #1)
(though my procedure for determining ν wasn't mathematically well-founded
 either, so take that p-value with a grain of salt).

Conclusion:
* Patch #1 is a fairly unambiguous improvement.
* Patch #3 has no statistically significant effect when GRO is active.
* Any effect of patch #3 on latency is within statistical noise.
* When GRO is inactive, patch #3 improves bandwidth, though for multiple
  streams the effect is smaller (possibly owing to the line-rate limit).
* The optimal batch size for this setup appears to be around 8.
* Setting the batch size to zero gives worse performance than before the
  patch; perhaps a static key is needed?
* Drivers which, unlike sfc, pass UDP traffic to GRO would expect to see a
  benefit from gaining access to batching.

Notes for future thought: in principle if we passed the napi pointer to
 napi_gro_complete(), it could add its superframe skb to napi->rx_list,
 rather than immediately netif_receive_skb_internal()ing it.  Without that
 I'm not sure if there's a possibility of OoO between normal and GROed SKBs
 on the same flow.

[1]: http://patchwork.ozlabs.org/cover/997844/
[2]: Cohen 1959, doi: 10.1080/00401706.1959.10489859

Edward Cree (3):
  sfc: don't score irq moderation points for GRO
  sfc: falcon: don't score irq moderation points for GRO
  net: use listified RX for handling GRO_NORMAL skbs

 drivers/net/ethernet/sfc/falcon/rx.c |  5 +----
 drivers/net/ethernet/sfc/rx.c        |  5 +----
 include/linux/netdevice.h            |  3 +++
 net/core/dev.c                       | 32 ++++++++++++++++++++++++++--
 net/core/sysctl_net_core.c           |  8 +++++++
 5 files changed, 43 insertions(+), 10 deletions(-)

Paolo Abeni July 10, 2019, 7:27 a.m. UTC | #1

Hi,

On Tue, 2019-07-09 at 20:27 +0100, Edward Cree wrote:
> Where not specified (as batch=), net.core.gro_normal_batch was set to 8.
> The net-next baseline used for these tests was commit 7d30a7f6424e.
> TCP 4 streams, GRO on: all results line rate (9.415Gbps)
> net-next: 210.3% cpu
> after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next)
> after #3: 191.7% cpu (- 8.9%, p=0.102 vs net-next)
> TCP 4 streams, GRO off:
> after #1: 7.785 Gbps
> after #3: 8.387 Gbps (+ 7.7%, p=0.215 vs #1, but note *)
> TCP 1 stream, GRO on: all results line rate & ~200% cpu.
> TCP 1 stream, GRO off:
> after #1: 6.444 Gbps
> after #3: 7.363 Gbps (+14.3%, p=0.003 vs #1)
> batch=16: 7.199 Gbps
> batch= 4: 7.354 Gbps
> batch= 0: 5.899 Gbps
> TCP 100 RR, GRO off:
> net-next: 995.083 us
> after #1: 969.167 us (- 2.6%, p=0.204 vs net-next)
> after #3: 976.433 us (- 1.9%, p=0.254 vs net-next)
> 
> (*) These tests produced a mixture of line-rate and below-line-rate results,
>  meaning that statistically speaking the results were 'censored' by the
>  upper bound, and were thus not normally distributed, making a Welch t-test
>  mathematically invalid.  I therefore also calculated estimators according
>  to [2], which gave the following:
> after #1: 8.155 Gbps
> after #3: 8.716 Gbps (+ 6.9%, p=0.291 vs #1)
> (though my procedure for determining ν wasn't mathematically well-founded
>  either, so take that p-value with a grain of salt).

I'm toying with a patch similar to your 3/3 (most relevant difference
being the lack of a limit to the batch size), on top of ixgbe (which
sends all the pkts to the GRO engine), and I'm observing more
controversial results (UDP only):

* when a single rx queue is running, I see a just-above-noise
peformance delta
* when multiple rx queues are running, I observe measurable regressions
(note: I use small pkts, still well under line rate even with multiple
rx queues)

I'll try to test your patch in the following days.

Side note: I think that in patch 3/3, it's necessary to add a call to
gro_normal_list() also inside napi_busy_loop().

Cheers,

Paolo

Edward Cree July 10, 2019, 2:52 p.m. UTC | #2

On 10/07/2019 08:27, Paolo Abeni wrote:
> I'm toying with a patch similar to your 3/3 (most relevant difference
> being the lack of a limit to the batch size), on top of ixgbe (which
> sends all the pkts to the GRO engine), and I'm observing more
> controversial results (UDP only):
>
> * when a single rx queue is running, I see a just-above-noise
> peformance delta
> * when multiple rx queues are running, I observe measurable regressions
> (note: I use small pkts, still well under line rate even with multiple
> rx queues)
>
> I'll try to test your patch in the following days.
I look forward to it.

> Side note: I think that in patch 3/3, it's necessary to add a call to
> gro_normal_list() also inside napi_busy_loop().
Hmm, I was caught out by the call to napi_poll() actually being a local
 function pointer, not the static function of the same name.  How did a
 shadow like that ever get allowed?
But in that case I _really_ don't understand napi_busy_loop(); nothing
 in it seems to ever flush GRO, so it's relying on either
 (1) stuff getting flushed because the bucket runs out of space, or
 (2) the next napi poll after busy_poll_stop() doing the flush.
What am I missing, and where exactly in napi_busy_loop() should the
 gro_normal_list() call go?

-Ed

Eric Dumazet July 10, 2019, 3:41 p.m. UTC | #3

On 7/10/19 4:52 PM, Edward Cree wrote:

> Hmm, I was caught out by the call to napi_poll() actually being a local
>  function pointer, not the static function of the same name.  How did a
>  shadow like that ever get allowed?
> But in that case I _really_ don't understand napi_busy_loop(); nothing
>  in it seems to ever flush GRO, so it's relying on either
>  (1) stuff getting flushed because the bucket runs out of space, or
>  (2) the next napi poll after busy_poll_stop() doing the flush.
> What am I missing, and where exactly in napi_busy_loop() should the
>  gro_normal_list() call go?

Please look at busy_poll_stop()

Edward Cree July 10, 2019, 4:47 p.m. UTC | #4

On 10/07/2019 16:41, Eric Dumazet wrote:
> On 7/10/19 4:52 PM, Edward Cree wrote:
>> Hmm, I was caught out by the call to napi_poll() actually being a local
>>  function pointer, not the static function of the same name.  How did a
>>  shadow like that ever get allowed?
>> But in that case I _really_ don't understand napi_busy_loop(); nothing
>>  in it seems to ever flush GRO, so it's relying on either
>>  (1) stuff getting flushed because the bucket runs out of space, or
>>  (2) the next napi poll after busy_poll_stop() doing the flush.
>> What am I missing, and where exactly in napi_busy_loop() should the
>>  gro_normal_list() call go?
> Please look at busy_poll_stop()
I did look there, but now I've looked again and harder, and I think I get it:
busy_poll_stop() calls napi->poll(), which (eventually, possibly in the
 subsequent poll that we schedule if rc == budget) calls napi_complete_done()
 which does the flush.
In which case, the same applies to the napi->rx_list, which similarly gets
 handled in napi_complete_done().  So I don't think napi_busy_loop() needs a
 gro_normal_list() adding to it(?)

As a general rule, I think we need to gro_normal_list() in those places, and
 only those places, that call napi_gro_flush().  But as I mentioned in the
 patch 3/3 description, I'm still confused by the (few) drivers that call
 napi_gro_flush() themselves.

-Ed

Eric Dumazet July 10, 2019, 5:39 p.m. UTC | #5

On 7/10/19 6:47 PM, Edward Cree wrote:
> On 10/07/2019 16:41, Eric Dumazet wrote:
>> On 7/10/19 4:52 PM, Edward Cree wrote:
>>> Hmm, I was caught out by the call to napi_poll() actually being a local
>>>  function pointer, not the static function of the same name.  How did a
>>>  shadow like that ever get allowed?
>>> But in that case I _really_ don't understand napi_busy_loop(); nothing
>>>  in it seems to ever flush GRO, so it's relying on either
>>>  (1) stuff getting flushed because the bucket runs out of space, or
>>>  (2) the next napi poll after busy_poll_stop() doing the flush.
>>> What am I missing, and where exactly in napi_busy_loop() should the
>>>  gro_normal_list() call go?
>> Please look at busy_poll_stop()
> I did look there, but now I've looked again and harder, and I think I get it:
> busy_poll_stop() calls napi->poll(), which (eventually, possibly in the
>  subsequent poll that we schedule if rc == budget) calls napi_complete_done()
>  which does the flush.
> In which case, the same applies to the napi->rx_list, which similarly gets
>  handled in napi_complete_done().  So I don't think napi_busy_loop() needs a
>  gro_normal_list() adding to it(?)

I advise you to try busypoll then, with TCP_RR, with say 50 usec delay in /proc/sys/net/core/busy_read

Holding a small packet in the list up to the point we call busy_poll_stop()
will basically make busypoll non working anymore.

napi_complete_done() has special behavior when busy polling is active.


> 
> As a general rule, I think we need to gro_normal_list() in those places, and
>  only those places, that call napi_gro_flush().  But as I mentioned in the
>  patch 3/3 description, I'm still confused by the (few) drivers that call
>  napi_gro_flush() themselves.
> 
> -Ed
>

Edward Cree July 12, 2019, 3:59 p.m. UTC | #6

On 10/07/2019 18:39, Eric Dumazet wrote:
> Holding a small packet in the list up to the point we call busy_poll_stop()
> will basically make busypoll non working anymore.
>
> napi_complete_done() has special behavior when busy polling is active.
Yep, I get it now, sorry for being dumb :)
Essentially we're saying that things coalesced by GRO are 'bulk' traffic and
 can wait around, but the rest is the stuff we're polling for for low latency.
I'm putting a gro_normal_list() call after the trace_napi_poll() in
 napi_busy_loop() and testing that, let's see how it goes...

Eric Dumazet July 12, 2019, 4:48 p.m. UTC | #7

On 7/12/19 5:59 PM, Edward Cree wrote:
> On 10/07/2019 18:39, Eric Dumazet wrote:
>> Holding a small packet in the list up to the point we call busy_poll_stop()
>> will basically make busypoll non working anymore.
>>
>> napi_complete_done() has special behavior when busy polling is active.
> Yep, I get it now, sorry for being dumb :)
> Essentially we're saying that things coalesced by GRO are 'bulk' traffic and
>  can wait around, 


GRO can still be beneficial even when busypolling, since TCP stack
will send a single ACK back, and a read()/recv() will copy the whole train
instead of a single MSS.

I should have mentioned that we have a patch that I forgot to upstream adding
the PSH flag to all TSO packets, meaning the receiver can automatically learn
the boundary of a GRO packet and not have to wait for the napi->poll() end
(busypolling or not)

> but the rest is the stuff we're polling for for low latency.
> I'm putting a gro_normal_list() call after the trace_napi_poll() in
>  napi_busy_loop() and testing that, let's see how it goes...
>

David Miller July 12, 2019, 6:30 p.m. UTC | #8

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 12 Jul 2019 18:48:29 +0200

> I should have mentioned that we have a patch that I forgot to
> upstream adding the PSH flag to all TSO packets, meaning the
> receiver can automatically learn the boundary of a GRO packet and
> not have to wait for the napi->poll() end (busypolling or not)

Wow, I thought we were already doing this...

Edward Cree July 24, 2019, 9:49 p.m. UTC | #9

On 12/07/2019 17:48, Eric Dumazet wrote:
>> but the rest is the stuff we're polling for for low latency.
>> I'm putting a gro_normal_list() call after the trace_napi_poll() in
>>  napi_busy_loop() and testing that, let's see how it goes...
One thing that's causing me some uncertainty: busy_poll_stop() does a
 napi->poll(), which can potentially gro_normal_one() something.  But
 when I tried to put a gro_normal_list() just after that, I ran into
 list corruption because it could race against the one in
 napi_complete_done().  I'm not entirely sure how, my current theory
 goes something like:
- clear_bit(IN_BUSY_POLL)
- task switch, start napi poll
- get as far as starting gro_normal_list()
- task switch back to busy_poll_stop()
- local_bh_disable()
- do a napi poll
- start gro_normal_list()
- list corruption ensues as we have two instances of
  netif_receive_skb_list_internal() trying to consume the same list
But I may be wildly mistaken.
Questions that arise from that:
1) Is it safe to potentially be adding to the rx_list (gro_normal_one(),
   which in theory can end up calling gro_normal_list() as well) within
   busy_poll_stop()?  I haven't ever seen a splat from that, but it seems
   every bit as possible as what I have been seeing.
2) Why does busy_poll_stop() not do its local_bh_disable() *before*
   clearing napi state bits, which (if I'm understanding correctly) would
   ensure an ordinary napi poll can't race with the one in busy_poll_stop()?

Apart from that I have indeed established that with the patches as posted
 busy-polling latency is awful, but adding a gro_normal_list() into
 napi_busy_loop() fixes that, as expected.

-Ed

Edward Cree July 30, 2019, 8:02 p.m. UTC | #10

On 24/07/2019 22:49, Edward Cree wrote:
> One thing that's causing me some uncertainty: busy_poll_stop() does a
>  napi->poll(), which can potentially gro_normal_one() something.  But
>  when I tried to put a gro_normal_list() just after that, I ran into
>  list corruption because it could race against the one in
>  napi_complete_done().
Turns out that the bh_disables are a red herring, we're racing against a
 napi poll on a different CPU.
I *think* that the sequence of events is
- enter busy_poll_stop
- enter napi->poll
- napi->poll calls napi_complete_done(), which schedules a new napi
- that new napi runs, on another CPU
- meanwhile, busy_poll_stop returns from napi->poll; if it then does a
  gro_normal_list it can race with the one on the other CPU from the new
  napi-poll.
Which means that...
> Questions that arise from that:
> 1) Is it safe to potentially be adding to the rx_list (gro_normal_one(),
>    which in theory can end up calling gro_normal_list() as well) within
>    busy_poll_stop()?  I haven't ever seen a splat from that, but it seems
>    every bit as possible as what I have been seeing.
... this isn't a problem, because napi->poll() will do all of its
 gro_normal_one()s before it calls napi_complete_done().

Just gathering some final performance numbers, then I'll post the updated
 patches (hopefully) later tonight.

-Ed

[RFC,net-next,0/3] net: batched receive in GRO path

Message

Comments