mbox series

[v2,0/17] staging: qlge: Fix rx stall in case of allocation failures

Message ID 20190927101210.23856-1-bpoirier@suse.com
Headers show
Series staging: qlge: Fix rx stall in case of allocation failures | expand

Message

Benjamin Poirier Sept. 27, 2019, 10:11 a.m. UTC
qlge refills rx buffers from napi context. In case of allocation failure,
allocation will be retried the next time napi runs. If a receive queue runs
out of free buffers (possibly after subsequent allocation failures), it
drops all traffic, no longer raises interrupts and napi is no longer
scheduled; reception is stalled until manual admin intervention.

This patch series adds a fallback mechanism for rx buffer allocation. If an
rx buffer queue becomes empty, a workqueue is scheduled to refill it from
process context where allocation can block until mm has freed some pages
(hopefully). This approach was inspired by the virtio_net driver (commit
3161e453e496 "virtio: net refill on out-of-memory").

I've compared this with how some other devices with a similar allocation
scheme handle this situation:
mlx4 relies on a periodic watchdog, sfc uses a timer, e1000e and fm10k rely
on periodic hardware interrupts (IIUC). In all cases, they use this to
schedule napi periodically at a fixed interval (10-250ms) until allocations
succeed. This kind of approach simplifies allocations because only one
context may refill buffers, however it is inefficient because of the fixed
interval: either the interval was too short, the allocation fails again and
work was done without forward progress; or the interval was too long,
buffers could've been allocated earlier and rx restarted earlier, instead
traffic was dropped while the system was idle.

Note that the qlge driver (and device) uses two kinds of buffers for
received data, so-called "small buffers" and "large buffers". The two are
arranged in ring pairs, the sbq and lbq. Depending on frame size, protocol
content and header splitting, data can go in either type of buffers.
Because of buffer size, lbq allocations are more likely to fail and lead to
stall, however I've reproduced the problem with sbq as well. The problem
was originally found when running jumbo frames. In that case, qlge uses
order-1 allocations for the large buffers. Although the two kinds of
buffers are managed similarly, the qlge driver duplicates most data
structures and code for their handling. In fact, even a casual look at the
qlge driver shows it to be in a state of disrepair, to put it kindly...

Patches 1-14 are cleanups that remove, fix and deduplicate code related to
sbq and lbq handling. Regarding those cleanups, patches 2 ("Remove
irq_cnt") and 8 ("Deduplicate rx buffer queue management") are the most
important. Finally, patches 15-17 fix the actual problem of rx stalls in
case of allocation failures by implementing the fallback of allocations to
a workqueue.

I've tested these patches using two different approaches:
1) A sender uses pktgen to send udp traffic. The receiver has a large swap,
a large net.core.rmem_max, runs a program that dirties all free memory in a
loop and runs a program that opens as many udp sockets as possible but
doesn't read from them. Since received data is all queued in the sockets
rather than freed, qlge is allocating receive buffers as quickly as
possible and faces allocation failures if the swap is slower than the
network.
2) A sender uses super_netperf. Likewise, the receiver has a large swap, a
large net.core.rmem_max and runs a program that dirties all free memory in
a loop. After the netperf send test is started, `killall -s SIGSTOP
netserver` on the receiver leads to the same situation as above.

---
Changes
v1->v2
https://lore.kernel.org/netdev/20190617074858.32467-1-bpoirier@suse.com/
* simplified QLGE_FIT16 macro down to a simple cast
* added "qlge: Fix irq masking in INTx mode"
* fixed address in pci_unmap_page() calls in "qlge: Deduplicate rx buffer
  queue management", no effect on end result of series
* adjusted series following move of driver to staging

Comments

Greg Kroah-Hartman Oct. 4, 2019, 8:19 a.m. UTC | #1
On Fri, Sep 27, 2019 at 07:11:54PM +0900, Benjamin Poirier wrote:
> qlge refills rx buffers from napi context. In case of allocation failure,
> allocation will be retried the next time napi runs. If a receive queue runs
> out of free buffers (possibly after subsequent allocation failures), it
> drops all traffic, no longer raises interrupts and napi is no longer
> scheduled; reception is stalled until manual admin intervention.
> 
> This patch series adds a fallback mechanism for rx buffer allocation. If an
> rx buffer queue becomes empty, a workqueue is scheduled to refill it from
> process context where allocation can block until mm has freed some pages
> (hopefully). This approach was inspired by the virtio_net driver (commit
> 3161e453e496 "virtio: net refill on out-of-memory").
> 
> I've compared this with how some other devices with a similar allocation
> scheme handle this situation:
> mlx4 relies on a periodic watchdog, sfc uses a timer, e1000e and fm10k rely
> on periodic hardware interrupts (IIUC). In all cases, they use this to
> schedule napi periodically at a fixed interval (10-250ms) until allocations
> succeed. This kind of approach simplifies allocations because only one
> context may refill buffers, however it is inefficient because of the fixed
> interval: either the interval was too short, the allocation fails again and
> work was done without forward progress; or the interval was too long,
> buffers could've been allocated earlier and rx restarted earlier, instead
> traffic was dropped while the system was idle.
> 
> Note that the qlge driver (and device) uses two kinds of buffers for
> received data, so-called "small buffers" and "large buffers". The two are
> arranged in ring pairs, the sbq and lbq. Depending on frame size, protocol
> content and header splitting, data can go in either type of buffers.
> Because of buffer size, lbq allocations are more likely to fail and lead to
> stall, however I've reproduced the problem with sbq as well. The problem
> was originally found when running jumbo frames. In that case, qlge uses
> order-1 allocations for the large buffers. Although the two kinds of
> buffers are managed similarly, the qlge driver duplicates most data
> structures and code for their handling. In fact, even a casual look at the
> qlge driver shows it to be in a state of disrepair, to put it kindly...
> 
> Patches 1-14 are cleanups that remove, fix and deduplicate code related to
> sbq and lbq handling. Regarding those cleanups, patches 2 ("Remove
> irq_cnt") and 8 ("Deduplicate rx buffer queue management") are the most
> important. Finally, patches 15-17 fix the actual problem of rx stalls in
> case of allocation failures by implementing the fallback of allocations to
> a workqueue.
> 
> I've tested these patches using two different approaches:
> 1) A sender uses pktgen to send udp traffic. The receiver has a large swap,
> a large net.core.rmem_max, runs a program that dirties all free memory in a
> loop and runs a program that opens as many udp sockets as possible but
> doesn't read from them. Since received data is all queued in the sockets
> rather than freed, qlge is allocating receive buffers as quickly as
> possible and faces allocation failures if the swap is slower than the
> network.
> 2) A sender uses super_netperf. Likewise, the receiver has a large swap, a
> large net.core.rmem_max and runs a program that dirties all free memory in
> a loop. After the netperf send test is started, `killall -s SIGSTOP
> netserver` on the receiver leads to the same situation as above.

As this code got moved to staging with the goal to drop it from the
tree, why are you working on fixing it up?  Do you want it moved back
out of staging into the "real" part of the tree, or are you just fixing
things that you find in order to make it cleaner before we delete it?

confused,

greg k-h
Benjamin Poirier Oct. 4, 2019, 9:15 a.m. UTC | #2
On 2019/10/04 10:19, Greg Kroah-Hartman wrote:
> On Fri, Sep 27, 2019 at 07:11:54PM +0900, Benjamin Poirier wrote:
[...]
> 
> As this code got moved to staging with the goal to drop it from the
> tree, why are you working on fixing it up?  Do you want it moved back
> out of staging into the "real" part of the tree, or are you just fixing
> things that you find in order to make it cleaner before we delete it?
> 
> confused,
> 

I expected one of two possible outcomes after moving the qlge driver to
staging:
1) it gets the attention of people looking for something to work on and
the driver is improved and submitted for normal inclusion in the future
2) it doesn't get enough attention and the driver is removed

I don't plan to do further work on it and I'm admittedly not holding my
breath for others to rush in but I already had those patches; it wasn't
a big effort to submit them as a first step towards outcome #1.

If #2 is a foregone conclusion, then there's little point in applying
the patches. The only benefit I can think of that if the complete
removal is reverted in the future, this specific problem will at least
be fixed.
Greg Kroah-Hartman Oct. 4, 2019, 3:19 p.m. UTC | #3
On Fri, Oct 04, 2019 at 06:15:45PM +0900, Benjamin Poirier wrote:
> On 2019/10/04 10:19, Greg Kroah-Hartman wrote:
> > On Fri, Sep 27, 2019 at 07:11:54PM +0900, Benjamin Poirier wrote:
> [...]
> > 
> > As this code got moved to staging with the goal to drop it from the
> > tree, why are you working on fixing it up?  Do you want it moved back
> > out of staging into the "real" part of the tree, or are you just fixing
> > things that you find in order to make it cleaner before we delete it?
> > 
> > confused,
> > 
> 
> I expected one of two possible outcomes after moving the qlge driver to
> staging:
> 1) it gets the attention of people looking for something to work on and
> the driver is improved and submitted for normal inclusion in the future
> 2) it doesn't get enough attention and the driver is removed
> 
> I don't plan to do further work on it and I'm admittedly not holding my
> breath for others to rush in but I already had those patches; it wasn't
> a big effort to submit them as a first step towards outcome #1.
> 
> If #2 is a foregone conclusion, then there's little point in applying
> the patches. The only benefit I can think of that if the complete
> removal is reverted in the future, this specific problem will at least
> be fixed.

That makes more sense, I'll go queue these up now, as I don't want to
waste the work you did on this.

thanks,

greg k-h