mbox series

[v2,0/8] USB fixes: xHCI error handling

Message ID 20231029-usb-fixes-1-v2-0-623533f6316e@marcan.st
Headers show
Series USB fixes: xHCI error handling | expand

Message

Hector Martin Oct. 29, 2023, 6:37 a.m. UTC
This series is the first of a few bundles of USB fixes we have been
carrying downstream on the Asahi U-Boot branch for a few months.

Patches #1-#6 fix a series of related robustness issues. Certain
conditions related to endpoint stalls revealed a chain of bugs
throughout the stack that caused U-Boot to completely fail when
encountering some errors (and errors are a fact of life with USB).

Example scenario:
- The device stalls a bulk endpoint due to an error
- The upper layer driver tries to use the endpoint again
- There is no endpoint stall clear wired up in the API, so for starters
  this is doomed to fail (fix: #4)
- xHCI knows the endpoint is halted, but tries to queue the TRB anyway,
  which can't work (fix: #5)
- Since the endpoint is halted nothing happens, so the transfer times
  out.
- The timeout handling tries to abort the transfer
- abort_td() tries to stop the endpoint, but since it is already halted,
  this results in a context state error. As the transfer never started,
  there is no completion event, so xhci_wait_for_event() is waiting for
  the wrong event type, and logs an error and returns NULL. (fix: #2)
- abort_td() crashes due to failing to guard against the NULL event
  (fix: #1)
- Even after fixing all that, abort_td() would not handle the context
  state error properly and BUG() (fix: #3). This also fixes a race
  condition documented in the xHCI spec that could occur even in the
  absence of all the other bugs.

Patch #6 addresses a related robustness issue where
xhci_wait_for_event() panics on event timeouts other than for transfers.
While this is clearly an unexpected condition and indicates a bug
elsewhere, it's no reason to outright crash. USB is notoriously
impossible to get 100% right, and we can't afford to be breaking users'
systems at any sign of trouble. Error conditions should always be dealt
with as gracefully as possible (even if that results in a completely
broken USB controller, that is much preferable to aborting the boot
process entirely, especially on devices with non-USB storage and
keyboards where USB support is effectively optional for most users).
Since after patch #1 we now guard all callers to xhci_wait_for_event()
with at least trivial NULL checks, it's okay to return and continue in
case of timeouts.

Patch #7 addresses an unrelated bug I ran into while debugging all this,
and patch #8 adds extra debug logs to make finding future issues less
painful.

I believe this should fix this Fedora bug too, which is either an
instance of the above sequence of events, or (more likely, given the
difficulty reproducing) the race condition documented in xHCI 4.6.9:

https://bugzilla.redhat.com/show_bug.cgi?id=2244305

Signed-off-by: Hector Martin <marcan@marcan.st>
---
Changes in v2:
- Squashed in a trivial fix for patch #1
- Removed spurious blank line
- Added a longer description to the cover letter
- Link to v1: https://lore.kernel.org/r/20231027-usb-fixes-1-v1-0-1c879bbcd928@marcan.st

---
Hector Martin (8):
      usb: xhci: Guard all calls to xhci_wait_for_event
      usb: xhci: Better error handling in abort_td()
      usb: xhci: Allow context state errors when halting an endpoint
      usb: xhci: Recover from halted bulk endpoints
      usb: xhci: Fail on attempt to queue TRBs to a halted endpoint
      usb: xhci: Do not panic on event timeouts
      usb: xhci: Fix DMA address calculation in queue_trb
      usb: xhci: Add more debugging

 drivers/usb/host/xhci-ring.c | 99 ++++++++++++++++++++++++++++++++++++--------
 drivers/usb/host/xhci.c      |  9 ++++
 include/usb/xhci.h           |  2 +
 3 files changed, 92 insertions(+), 18 deletions(-)
---
base-commit: 7c0d668103abae3ec14cd96d882ba20134bb4529
change-id: 20231027-usb-fixes-1-83bfc7013012

Best regards,

Comments

Neal Gompa Oct. 29, 2023, 4:21 p.m. UTC | #1
On Sun, Oct 29, 2023 at 2:38 AM Hector Martin <marcan@marcan.st> wrote:
>
> This series is the first of a few bundles of USB fixes we have been
> carrying downstream on the Asahi U-Boot branch for a few months.
>
> Patches #1-#6 fix a series of related robustness issues. Certain
> conditions related to endpoint stalls revealed a chain of bugs
> throughout the stack that caused U-Boot to completely fail when
> encountering some errors (and errors are a fact of life with USB).
>
> Example scenario:
> - The device stalls a bulk endpoint due to an error
> - The upper layer driver tries to use the endpoint again
> - There is no endpoint stall clear wired up in the API, so for starters
>   this is doomed to fail (fix: #4)
> - xHCI knows the endpoint is halted, but tries to queue the TRB anyway,
>   which can't work (fix: #5)
> - Since the endpoint is halted nothing happens, so the transfer times
>   out.
> - The timeout handling tries to abort the transfer
> - abort_td() tries to stop the endpoint, but since it is already halted,
>   this results in a context state error. As the transfer never started,
>   there is no completion event, so xhci_wait_for_event() is waiting for
>   the wrong event type, and logs an error and returns NULL. (fix: #2)
> - abort_td() crashes due to failing to guard against the NULL event
>   (fix: #1)
> - Even after fixing all that, abort_td() would not handle the context
>   state error properly and BUG() (fix: #3). This also fixes a race
>   condition documented in the xHCI spec that could occur even in the
>   absence of all the other bugs.
>
> Patch #6 addresses a related robustness issue where
> xhci_wait_for_event() panics on event timeouts other than for transfers.
> While this is clearly an unexpected condition and indicates a bug
> elsewhere, it's no reason to outright crash. USB is notoriously
> impossible to get 100% right, and we can't afford to be breaking users'
> systems at any sign of trouble. Error conditions should always be dealt
> with as gracefully as possible (even if that results in a completely
> broken USB controller, that is much preferable to aborting the boot
> process entirely, especially on devices with non-USB storage and
> keyboards where USB support is effectively optional for most users).
> Since after patch #1 we now guard all callers to xhci_wait_for_event()
> with at least trivial NULL checks, it's okay to return and continue in
> case of timeouts.
>
> Patch #7 addresses an unrelated bug I ran into while debugging all this,
> and patch #8 adds extra debug logs to make finding future issues less
> painful.
>
> I believe this should fix this Fedora bug too, which is either an
> instance of the above sequence of events, or (more likely, given the
> difficulty reproducing) the race condition documented in xHCI 4.6.9:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=2244305
>
> Signed-off-by: Hector Martin <marcan@marcan.st>
> ---
> Changes in v2:
> - Squashed in a trivial fix for patch #1
> - Removed spurious blank line
> - Added a longer description to the cover letter
> - Link to v1: https://lore.kernel.org/r/20231027-usb-fixes-1-v1-0-1c879bbcd928@marcan.st
>
> ---
> Hector Martin (8):
>       usb: xhci: Guard all calls to xhci_wait_for_event
>       usb: xhci: Better error handling in abort_td()
>       usb: xhci: Allow context state errors when halting an endpoint
>       usb: xhci: Recover from halted bulk endpoints
>       usb: xhci: Fail on attempt to queue TRBs to a halted endpoint
>       usb: xhci: Do not panic on event timeouts
>       usb: xhci: Fix DMA address calculation in queue_trb
>       usb: xhci: Add more debugging
>
>  drivers/usb/host/xhci-ring.c | 99 ++++++++++++++++++++++++++++++++++++--------
>  drivers/usb/host/xhci.c      |  9 ++++
>  include/usb/xhci.h           |  2 +
>  3 files changed, 92 insertions(+), 18 deletions(-)
> ---
> base-commit: 7c0d668103abae3ec14cd96d882ba20134bb4529
> change-id: 20231027-usb-fixes-1-83bfc7013012
>

Series LGTM.

Reviewed-by: Neal Gompa <neal@gompa.dev>