Message ID | 20231029-usb-fixes-1-v2-0-623533f6316e@marcan.st |
---|---|
Headers | show |
Series | USB fixes: xHCI error handling | expand |
On Sun, Oct 29, 2023 at 2:38 AM Hector Martin <marcan@marcan.st> wrote: > > This series is the first of a few bundles of USB fixes we have been > carrying downstream on the Asahi U-Boot branch for a few months. > > Patches #1-#6 fix a series of related robustness issues. Certain > conditions related to endpoint stalls revealed a chain of bugs > throughout the stack that caused U-Boot to completely fail when > encountering some errors (and errors are a fact of life with USB). > > Example scenario: > - The device stalls a bulk endpoint due to an error > - The upper layer driver tries to use the endpoint again > - There is no endpoint stall clear wired up in the API, so for starters > this is doomed to fail (fix: #4) > - xHCI knows the endpoint is halted, but tries to queue the TRB anyway, > which can't work (fix: #5) > - Since the endpoint is halted nothing happens, so the transfer times > out. > - The timeout handling tries to abort the transfer > - abort_td() tries to stop the endpoint, but since it is already halted, > this results in a context state error. As the transfer never started, > there is no completion event, so xhci_wait_for_event() is waiting for > the wrong event type, and logs an error and returns NULL. (fix: #2) > - abort_td() crashes due to failing to guard against the NULL event > (fix: #1) > - Even after fixing all that, abort_td() would not handle the context > state error properly and BUG() (fix: #3). This also fixes a race > condition documented in the xHCI spec that could occur even in the > absence of all the other bugs. > > Patch #6 addresses a related robustness issue where > xhci_wait_for_event() panics on event timeouts other than for transfers. > While this is clearly an unexpected condition and indicates a bug > elsewhere, it's no reason to outright crash. USB is notoriously > impossible to get 100% right, and we can't afford to be breaking users' > systems at any sign of trouble. Error conditions should always be dealt > with as gracefully as possible (even if that results in a completely > broken USB controller, that is much preferable to aborting the boot > process entirely, especially on devices with non-USB storage and > keyboards where USB support is effectively optional for most users). > Since after patch #1 we now guard all callers to xhci_wait_for_event() > with at least trivial NULL checks, it's okay to return and continue in > case of timeouts. > > Patch #7 addresses an unrelated bug I ran into while debugging all this, > and patch #8 adds extra debug logs to make finding future issues less > painful. > > I believe this should fix this Fedora bug too, which is either an > instance of the above sequence of events, or (more likely, given the > difficulty reproducing) the race condition documented in xHCI 4.6.9: > > https://bugzilla.redhat.com/show_bug.cgi?id=2244305 > > Signed-off-by: Hector Martin <marcan@marcan.st> > --- > Changes in v2: > - Squashed in a trivial fix for patch #1 > - Removed spurious blank line > - Added a longer description to the cover letter > - Link to v1: https://lore.kernel.org/r/20231027-usb-fixes-1-v1-0-1c879bbcd928@marcan.st > > --- > Hector Martin (8): > usb: xhci: Guard all calls to xhci_wait_for_event > usb: xhci: Better error handling in abort_td() > usb: xhci: Allow context state errors when halting an endpoint > usb: xhci: Recover from halted bulk endpoints > usb: xhci: Fail on attempt to queue TRBs to a halted endpoint > usb: xhci: Do not panic on event timeouts > usb: xhci: Fix DMA address calculation in queue_trb > usb: xhci: Add more debugging > > drivers/usb/host/xhci-ring.c | 99 ++++++++++++++++++++++++++++++++++++-------- > drivers/usb/host/xhci.c | 9 ++++ > include/usb/xhci.h | 2 + > 3 files changed, 92 insertions(+), 18 deletions(-) > --- > base-commit: 7c0d668103abae3ec14cd96d882ba20134bb4529 > change-id: 20231027-usb-fixes-1-83bfc7013012 > Series LGTM. Reviewed-by: Neal Gompa <neal@gompa.dev>
This series is the first of a few bundles of USB fixes we have been carrying downstream on the Asahi U-Boot branch for a few months. Patches #1-#6 fix a series of related robustness issues. Certain conditions related to endpoint stalls revealed a chain of bugs throughout the stack that caused U-Boot to completely fail when encountering some errors (and errors are a fact of life with USB). Example scenario: - The device stalls a bulk endpoint due to an error - The upper layer driver tries to use the endpoint again - There is no endpoint stall clear wired up in the API, so for starters this is doomed to fail (fix: #4) - xHCI knows the endpoint is halted, but tries to queue the TRB anyway, which can't work (fix: #5) - Since the endpoint is halted nothing happens, so the transfer times out. - The timeout handling tries to abort the transfer - abort_td() tries to stop the endpoint, but since it is already halted, this results in a context state error. As the transfer never started, there is no completion event, so xhci_wait_for_event() is waiting for the wrong event type, and logs an error and returns NULL. (fix: #2) - abort_td() crashes due to failing to guard against the NULL event (fix: #1) - Even after fixing all that, abort_td() would not handle the context state error properly and BUG() (fix: #3). This also fixes a race condition documented in the xHCI spec that could occur even in the absence of all the other bugs. Patch #6 addresses a related robustness issue where xhci_wait_for_event() panics on event timeouts other than for transfers. While this is clearly an unexpected condition and indicates a bug elsewhere, it's no reason to outright crash. USB is notoriously impossible to get 100% right, and we can't afford to be breaking users' systems at any sign of trouble. Error conditions should always be dealt with as gracefully as possible (even if that results in a completely broken USB controller, that is much preferable to aborting the boot process entirely, especially on devices with non-USB storage and keyboards where USB support is effectively optional for most users). Since after patch #1 we now guard all callers to xhci_wait_for_event() with at least trivial NULL checks, it's okay to return and continue in case of timeouts. Patch #7 addresses an unrelated bug I ran into while debugging all this, and patch #8 adds extra debug logs to make finding future issues less painful. I believe this should fix this Fedora bug too, which is either an instance of the above sequence of events, or (more likely, given the difficulty reproducing) the race condition documented in xHCI 4.6.9: https://bugzilla.redhat.com/show_bug.cgi?id=2244305 Signed-off-by: Hector Martin <marcan@marcan.st> --- Changes in v2: - Squashed in a trivial fix for patch #1 - Removed spurious blank line - Added a longer description to the cover letter - Link to v1: https://lore.kernel.org/r/20231027-usb-fixes-1-v1-0-1c879bbcd928@marcan.st --- Hector Martin (8): usb: xhci: Guard all calls to xhci_wait_for_event usb: xhci: Better error handling in abort_td() usb: xhci: Allow context state errors when halting an endpoint usb: xhci: Recover from halted bulk endpoints usb: xhci: Fail on attempt to queue TRBs to a halted endpoint usb: xhci: Do not panic on event timeouts usb: xhci: Fix DMA address calculation in queue_trb usb: xhci: Add more debugging drivers/usb/host/xhci-ring.c | 99 ++++++++++++++++++++++++++++++++++++-------- drivers/usb/host/xhci.c | 9 ++++ include/usb/xhci.h | 2 + 3 files changed, 92 insertions(+), 18 deletions(-) --- base-commit: 7c0d668103abae3ec14cd96d882ba20134bb4529 change-id: 20231027-usb-fixes-1-83bfc7013012 Best regards,