mbox series

[0/1,SRU,J,L,M] A deadlock issue in scsi rescan task while resuming from S3

Message ID 20230620033659.136024-1-acelan.kao@canonical.com
Headers show
Series A deadlock issue in scsi rescan task while resuming from S3 | expand

Message

AceLan Kao June 20, 2023, 3:36 a.m. UTC
From: "Chia-Lin Kao (AceLan)" <acelan.kao@canonical.com>

BugLink: https://launchpad.net/bugs/2018566

[Impact]
During the S3 stress test, the system sometimes hangs when resuming. This
is due to the SCSI rescan task being unable to acquire the mutex lock
during the resumption from S3. The mutex lock has already been acquired by
EH and is waiting for the device to be ready for a rescan. Unfortunately,
the mutex lock is never released by either party, leading to a deadlock.

[Fix]
Kaiheng submitted a patch to fix this issue which defers the rescan if the
disk is still suspended so the resume process of the disk device can proceed.
https://patchwork.ozlabs.org/project/linux-ide/patch/20230502150435.423770-2-kai.heng.feng@canonical.com/

Since the patch has not been accepted by the upstream yet, so submit it to the OEM kernel for now.

The similiar patch has been included in v6.4-rc7, backport this to
generic ubuntu kernels.
6aa0365a3c85 ata: libata-scsi: Avoid deadlock on rescan after device resume

[Test]
Verified on the machines by me and ODM.

[Where problems could occur]
It only defers the rescan task, and should not have any impact to current systems.

Damien Le Moal (1):
  ata: libata-scsi: Avoid deadlock on rescan after device resume

 drivers/ata/libata-core.c |  3 ++-
 drivers/ata/libata-eh.c   |  2 +-
 drivers/ata/libata-scsi.c | 22 +++++++++++++++++++++-
 include/linux/libata.h    |  2 +-
 4 files changed, 25 insertions(+), 4 deletions(-)

Comments

Stefan Bader June 20, 2023, 7:53 a.m. UTC | #1
On 20.06.23 05:36, AceLan Kao wrote:
> From: "Chia-Lin Kao (AceLan)" <acelan.kao@canonical.com>
> 
> BugLink: https://launchpad.net/bugs/2018566
                                    ^ bugs.launchpad.net
> 
> [Impact]
> During the S3 stress test, the system sometimes hangs when resuming. This
> is due to the SCSI rescan task being unable to acquire the mutex lock
> during the resumption from S3. The mutex lock has already been acquired by
> EH and is waiting for the device to be ready for a rescan. Unfortunately,
> the mutex lock is never released by either party, leading to a deadlock.
> 
> [Fix]
> Kaiheng submitted a patch to fix this issue which defers the rescan if the
> disk is still suspended so the resume process of the disk device can proceed.
> https://patchwork.ozlabs.org/project/linux-ide/patch/20230502150435.423770-2-kai.heng.feng@canonical.com/
> 
> Since the patch has not been accepted by the upstream yet, so submit it to the OEM kernel for now.

This is no longer true. The submitted patch is upstream as of v6.4-rc7. 
Updating old justifications might help to convince others to look at 
this more favorably.

> 
> The similiar patch has been included in v6.4-rc7, backport this to
> generic ubuntu kernels.
> 6aa0365a3c85 ata: libata-scsi: Avoid deadlock on rescan after device resume
> 
> [Test]
> Verified on the machines by me and ODM.
> 
> [Where problems could occur]
> It only defers the rescan task, and should not have any impact to current systems.
> 
> Damien Le Moal (1):
>    ata: libata-scsi: Avoid deadlock on rescan after device resume
> 
>   drivers/ata/libata-core.c |  3 ++-
>   drivers/ata/libata-eh.c   |  2 +-
>   drivers/ata/libata-scsi.c | 22 +++++++++++++++++++++-
>   include/linux/libata.h    |  2 +-
>   4 files changed, 25 insertions(+), 4 deletions(-)
> 

Anyhow, the submitted patch appears to be identical to upstream.

Acked-by: Stefan Bader <stefan.bader@canonical.com>
Tim Gardner June 20, 2023, 12:41 p.m. UTC | #2
On 6/19/23 9:36 PM, AceLan Kao wrote:
> From: "Chia-Lin Kao (AceLan)" <acelan.kao@canonical.com>
> 
> BugLink: https://launchpad.net/bugs/2018566
> 
> [Impact]
> During the S3 stress test, the system sometimes hangs when resuming. This
> is due to the SCSI rescan task being unable to acquire the mutex lock
> during the resumption from S3. The mutex lock has already been acquired by
> EH and is waiting for the device to be ready for a rescan. Unfortunately,
> the mutex lock is never released by either party, leading to a deadlock.
> 
> [Fix]
> Kaiheng submitted a patch to fix this issue which defers the rescan if the
> disk is still suspended so the resume process of the disk device can proceed.
> https://patchwork.ozlabs.org/project/linux-ide/patch/20230502150435.423770-2-kai.heng.feng@canonical.com/
> 
> Since the patch has not been accepted by the upstream yet, so submit it to the OEM kernel for now.
> 
> The similiar patch has been included in v6.4-rc7, backport this to
> generic ubuntu kernels.
> 6aa0365a3c85 ata: libata-scsi: Avoid deadlock on rescan after device resume
> 
> [Test]
> Verified on the machines by me and ODM.
> 
> [Where problems could occur]
> It only defers the rescan task, and should not have any impact to current systems.
> 
> Damien Le Moal (1):
>    ata: libata-scsi: Avoid deadlock on rescan after device resume
> 
>   drivers/ata/libata-core.c |  3 ++-
>   drivers/ata/libata-eh.c   |  2 +-
>   drivers/ata/libata-scsi.c | 22 +++++++++++++++++++++-
>   include/linux/libata.h    |  2 +-
>   4 files changed, 25 insertions(+), 4 deletions(-)
> 
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Andrei Gherzan June 26, 2023, 11:19 a.m. UTC | #3
On 23/06/20 09:53AM, Stefan Bader wrote:
> On 20.06.23 05:36, AceLan Kao wrote:
> > From: "Chia-Lin Kao (AceLan)" <acelan.kao@canonical.com>
> > 
> > BugLink: https://launchpad.net/bugs/2018566
>                                    ^ bugs.launchpad.net
> > 
> > [Impact]
> > During the S3 stress test, the system sometimes hangs when resuming. This
> > is due to the SCSI rescan task being unable to acquire the mutex lock
> > during the resumption from S3. The mutex lock has already been acquired by
> > EH and is waiting for the device to be ready for a rescan. Unfortunately,
> > the mutex lock is never released by either party, leading to a deadlock.
> > 
> > [Fix]
> > Kaiheng submitted a patch to fix this issue which defers the rescan if the
> > disk is still suspended so the resume process of the disk device can proceed.
> > https://patchwork.ozlabs.org/project/linux-ide/patch/20230502150435.423770-2-kai.heng.feng@canonical.com/
> > 
> > Since the patch has not been accepted by the upstream yet, so submit it to the OEM kernel for now.
> 
> This is no longer true. The submitted patch is upstream as of v6.4-rc7.
> Updating old justifications might help to convince others to look at this
> more favorably.

As Stefan mentioned, this is already in :

v6.4
v6.4-rc7

> > 
> > The similiar patch has been included in v6.4-rc7, backport this to
> > generic ubuntu kernels.
> > 6aa0365a3c85 ata: libata-scsi: Avoid deadlock on rescan after device resume
> > 
> > [Test]
> > Verified on the machines by me and ODM.
> > 
> > [Where problems could occur]
> > It only defers the rescan task, and should not have any impact to current systems.
> > 
> > Damien Le Moal (1):
> >    ata: libata-scsi: Avoid deadlock on rescan after device resume
> > 
> >   drivers/ata/libata-core.c |  3 ++-
> >   drivers/ata/libata-eh.c   |  2 +-
> >   drivers/ata/libata-scsi.c | 22 +++++++++++++++++++++-
> >   include/linux/libata.h    |  2 +-
> >   4 files changed, 25 insertions(+), 4 deletions(-)
> > 
> 
> Anyhow, the submitted patch appears to be identical to upstream.
> 
> Acked-by: Stefan Bader <stefan.bader@canonical.com>
> 

Acked-by: Andrei Gherzan <andrei.gherzan@canonical.com>
Roxana Nicolescu July 7, 2023, 1:34 p.m. UTC | #4
On 20/06/2023 05:36, AceLan Kao wrote:
> From: "Chia-Lin Kao (AceLan)" <acelan.kao@canonical.com>
>
> BugLink: https://launchpad.net/bugs/2018566
>
> [Impact]
> During the S3 stress test, the system sometimes hangs when resuming. This
> is due to the SCSI rescan task being unable to acquire the mutex lock
> during the resumption from S3. The mutex lock has already been acquired by
> EH and is waiting for the device to be ready for a rescan. Unfortunately,
> the mutex lock is never released by either party, leading to a deadlock.
>
> [Fix]
> Kaiheng submitted a patch to fix this issue which defers the rescan if the
> disk is still suspended so the resume process of the disk device can proceed.
> https://patchwork.ozlabs.org/project/linux-ide/patch/20230502150435.423770-2-kai.heng.feng@canonical.com/
>
> Since the patch has not been accepted by the upstream yet, so submit it to the OEM kernel for now.
>
> The similiar patch has been included in v6.4-rc7, backport this to
> generic ubuntu kernels.
> 6aa0365a3c85 ata: libata-scsi: Avoid deadlock on rescan after device resume
>
> [Test]
> Verified on the machines by me and ODM.
>
> [Where problems could occur]
> It only defers the rescan task, and should not have any impact to current systems.
>
> Damien Le Moal (1):
>    ata: libata-scsi: Avoid deadlock on rescan after device resume
>
>   drivers/ata/libata-core.c |  3 ++-
>   drivers/ata/libata-eh.c   |  2 +-
>   drivers/ata/libata-scsi.c | 22 +++++++++++++++++++++-
>   include/linux/libata.h    |  2 +-
>   4 files changed, 25 insertions(+), 4 deletions(-)
>
Applied to lunar/jammy:master-next. I adjusted the buglink accordingly. 
Thanks.

Roxana