Message ID | 7920e85c-439e-0622-46f8-0602cf37e306@solarflare.com |
---|---|
Headers | show |
Series | net: batched receive in GRO path | expand |
Hi, On Tue, 2019-07-09 at 20:27 +0100, Edward Cree wrote: > Where not specified (as batch=), net.core.gro_normal_batch was set to 8. > The net-next baseline used for these tests was commit 7d30a7f6424e. > TCP 4 streams, GRO on: all results line rate (9.415Gbps) > net-next: 210.3% cpu > after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next) > after #3: 191.7% cpu (- 8.9%, p=0.102 vs net-next) > TCP 4 streams, GRO off: > after #1: 7.785 Gbps > after #3: 8.387 Gbps (+ 7.7%, p=0.215 vs #1, but note *) > TCP 1 stream, GRO on: all results line rate & ~200% cpu. > TCP 1 stream, GRO off: > after #1: 6.444 Gbps > after #3: 7.363 Gbps (+14.3%, p=0.003 vs #1) > batch=16: 7.199 Gbps > batch= 4: 7.354 Gbps > batch= 0: 5.899 Gbps > TCP 100 RR, GRO off: > net-next: 995.083 us > after #1: 969.167 us (- 2.6%, p=0.204 vs net-next) > after #3: 976.433 us (- 1.9%, p=0.254 vs net-next) > > (*) These tests produced a mixture of line-rate and below-line-rate results, > meaning that statistically speaking the results were 'censored' by the > upper bound, and were thus not normally distributed, making a Welch t-test > mathematically invalid. I therefore also calculated estimators according > to [2], which gave the following: > after #1: 8.155 Gbps > after #3: 8.716 Gbps (+ 6.9%, p=0.291 vs #1) > (though my procedure for determining ν wasn't mathematically well-founded > either, so take that p-value with a grain of salt). I'm toying with a patch similar to your 3/3 (most relevant difference being the lack of a limit to the batch size), on top of ixgbe (which sends all the pkts to the GRO engine), and I'm observing more controversial results (UDP only): * when a single rx queue is running, I see a just-above-noise peformance delta * when multiple rx queues are running, I observe measurable regressions (note: I use small pkts, still well under line rate even with multiple rx queues) I'll try to test your patch in the following days. Side note: I think that in patch 3/3, it's necessary to add a call to gro_normal_list() also inside napi_busy_loop(). Cheers, Paolo
On 10/07/2019 08:27, Paolo Abeni wrote: > I'm toying with a patch similar to your 3/3 (most relevant difference > being the lack of a limit to the batch size), on top of ixgbe (which > sends all the pkts to the GRO engine), and I'm observing more > controversial results (UDP only): > > * when a single rx queue is running, I see a just-above-noise > peformance delta > * when multiple rx queues are running, I observe measurable regressions > (note: I use small pkts, still well under line rate even with multiple > rx queues) > > I'll try to test your patch in the following days. I look forward to it. > Side note: I think that in patch 3/3, it's necessary to add a call to > gro_normal_list() also inside napi_busy_loop(). Hmm, I was caught out by the call to napi_poll() actually being a local function pointer, not the static function of the same name. How did a shadow like that ever get allowed? But in that case I _really_ don't understand napi_busy_loop(); nothing in it seems to ever flush GRO, so it's relying on either (1) stuff getting flushed because the bucket runs out of space, or (2) the next napi poll after busy_poll_stop() doing the flush. What am I missing, and where exactly in napi_busy_loop() should the gro_normal_list() call go? -Ed
On 7/10/19 4:52 PM, Edward Cree wrote: > Hmm, I was caught out by the call to napi_poll() actually being a local > function pointer, not the static function of the same name. How did a > shadow like that ever get allowed? > But in that case I _really_ don't understand napi_busy_loop(); nothing > in it seems to ever flush GRO, so it's relying on either > (1) stuff getting flushed because the bucket runs out of space, or > (2) the next napi poll after busy_poll_stop() doing the flush. > What am I missing, and where exactly in napi_busy_loop() should the > gro_normal_list() call go? Please look at busy_poll_stop()
On 10/07/2019 16:41, Eric Dumazet wrote: > On 7/10/19 4:52 PM, Edward Cree wrote: >> Hmm, I was caught out by the call to napi_poll() actually being a local >> function pointer, not the static function of the same name. How did a >> shadow like that ever get allowed? >> But in that case I _really_ don't understand napi_busy_loop(); nothing >> in it seems to ever flush GRO, so it's relying on either >> (1) stuff getting flushed because the bucket runs out of space, or >> (2) the next napi poll after busy_poll_stop() doing the flush. >> What am I missing, and where exactly in napi_busy_loop() should the >> gro_normal_list() call go? > Please look at busy_poll_stop() I did look there, but now I've looked again and harder, and I think I get it: busy_poll_stop() calls napi->poll(), which (eventually, possibly in the subsequent poll that we schedule if rc == budget) calls napi_complete_done() which does the flush. In which case, the same applies to the napi->rx_list, which similarly gets handled in napi_complete_done(). So I don't think napi_busy_loop() needs a gro_normal_list() adding to it(?) As a general rule, I think we need to gro_normal_list() in those places, and only those places, that call napi_gro_flush(). But as I mentioned in the patch 3/3 description, I'm still confused by the (few) drivers that call napi_gro_flush() themselves. -Ed
On 7/10/19 6:47 PM, Edward Cree wrote: > On 10/07/2019 16:41, Eric Dumazet wrote: >> On 7/10/19 4:52 PM, Edward Cree wrote: >>> Hmm, I was caught out by the call to napi_poll() actually being a local >>> function pointer, not the static function of the same name. How did a >>> shadow like that ever get allowed? >>> But in that case I _really_ don't understand napi_busy_loop(); nothing >>> in it seems to ever flush GRO, so it's relying on either >>> (1) stuff getting flushed because the bucket runs out of space, or >>> (2) the next napi poll after busy_poll_stop() doing the flush. >>> What am I missing, and where exactly in napi_busy_loop() should the >>> gro_normal_list() call go? >> Please look at busy_poll_stop() > I did look there, but now I've looked again and harder, and I think I get it: > busy_poll_stop() calls napi->poll(), which (eventually, possibly in the > subsequent poll that we schedule if rc == budget) calls napi_complete_done() > which does the flush. > In which case, the same applies to the napi->rx_list, which similarly gets > handled in napi_complete_done(). So I don't think napi_busy_loop() needs a > gro_normal_list() adding to it(?) I advise you to try busypoll then, with TCP_RR, with say 50 usec delay in /proc/sys/net/core/busy_read Holding a small packet in the list up to the point we call busy_poll_stop() will basically make busypoll non working anymore. napi_complete_done() has special behavior when busy polling is active. > > As a general rule, I think we need to gro_normal_list() in those places, and > only those places, that call napi_gro_flush(). But as I mentioned in the > patch 3/3 description, I'm still confused by the (few) drivers that call > napi_gro_flush() themselves. > > -Ed >
On 10/07/2019 18:39, Eric Dumazet wrote: > Holding a small packet in the list up to the point we call busy_poll_stop() > will basically make busypoll non working anymore. > > napi_complete_done() has special behavior when busy polling is active. Yep, I get it now, sorry for being dumb :) Essentially we're saying that things coalesced by GRO are 'bulk' traffic and can wait around, but the rest is the stuff we're polling for for low latency. I'm putting a gro_normal_list() call after the trace_napi_poll() in napi_busy_loop() and testing that, let's see how it goes...
On 7/12/19 5:59 PM, Edward Cree wrote: > On 10/07/2019 18:39, Eric Dumazet wrote: >> Holding a small packet in the list up to the point we call busy_poll_stop() >> will basically make busypoll non working anymore. >> >> napi_complete_done() has special behavior when busy polling is active. > Yep, I get it now, sorry for being dumb :) > Essentially we're saying that things coalesced by GRO are 'bulk' traffic and > can wait around, GRO can still be beneficial even when busypolling, since TCP stack will send a single ACK back, and a read()/recv() will copy the whole train instead of a single MSS. I should have mentioned that we have a patch that I forgot to upstream adding the PSH flag to all TSO packets, meaning the receiver can automatically learn the boundary of a GRO packet and not have to wait for the napi->poll() end (busypolling or not) > but the rest is the stuff we're polling for for low latency. > I'm putting a gro_normal_list() call after the trace_napi_poll() in > napi_busy_loop() and testing that, let's see how it goes... >
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Fri, 12 Jul 2019 18:48:29 +0200 > I should have mentioned that we have a patch that I forgot to > upstream adding the PSH flag to all TSO packets, meaning the > receiver can automatically learn the boundary of a GRO packet and > not have to wait for the napi->poll() end (busypolling or not) Wow, I thought we were already doing this...
On 12/07/2019 17:48, Eric Dumazet wrote: >> but the rest is the stuff we're polling for for low latency. >> I'm putting a gro_normal_list() call after the trace_napi_poll() in >> napi_busy_loop() and testing that, let's see how it goes... One thing that's causing me some uncertainty: busy_poll_stop() does a napi->poll(), which can potentially gro_normal_one() something. But when I tried to put a gro_normal_list() just after that, I ran into list corruption because it could race against the one in napi_complete_done(). I'm not entirely sure how, my current theory goes something like: - clear_bit(IN_BUSY_POLL) - task switch, start napi poll - get as far as starting gro_normal_list() - task switch back to busy_poll_stop() - local_bh_disable() - do a napi poll - start gro_normal_list() - list corruption ensues as we have two instances of netif_receive_skb_list_internal() trying to consume the same list But I may be wildly mistaken. Questions that arise from that: 1) Is it safe to potentially be adding to the rx_list (gro_normal_one(), which in theory can end up calling gro_normal_list() as well) within busy_poll_stop()? I haven't ever seen a splat from that, but it seems every bit as possible as what I have been seeing. 2) Why does busy_poll_stop() not do its local_bh_disable() *before* clearing napi state bits, which (if I'm understanding correctly) would ensure an ordinary napi poll can't race with the one in busy_poll_stop()? Apart from that I have indeed established that with the patches as posted busy-polling latency is awful, but adding a gro_normal_list() into napi_busy_loop() fixes that, as expected. -Ed
On 24/07/2019 22:49, Edward Cree wrote: > One thing that's causing me some uncertainty: busy_poll_stop() does a > napi->poll(), which can potentially gro_normal_one() something. But > when I tried to put a gro_normal_list() just after that, I ran into > list corruption because it could race against the one in > napi_complete_done(). Turns out that the bh_disables are a red herring, we're racing against a napi poll on a different CPU. I *think* that the sequence of events is - enter busy_poll_stop - enter napi->poll - napi->poll calls napi_complete_done(), which schedules a new napi - that new napi runs, on another CPU - meanwhile, busy_poll_stop returns from napi->poll; if it then does a gro_normal_list it can race with the one on the other CPU from the new napi-poll. Which means that... > Questions that arise from that: > 1) Is it safe to potentially be adding to the rx_list (gro_normal_one(), > which in theory can end up calling gro_normal_list() as well) within > busy_poll_stop()? I haven't ever seen a splat from that, but it seems > every bit as possible as what I have been seeing. ... this isn't a problem, because napi->poll() will do all of its gro_normal_one()s before it calls napi_complete_done(). Just gathering some final performance numbers, then I'll post the updated patches (hopefully) later tonight. -Ed