diff mbox series

[v5,09/23] scsi: sd: Do not issue commands to suspended disks on shutdown

Message ID 20230921180758.955317-10-dlemoal@kernel.org
State New
Headers show
Series Fix libata suspend/resume handling and code cleanup | expand

Commit Message

Damien Le Moal Sept. 21, 2023, 6:07 p.m. UTC
If an error occurs when resuming a host adapter before the devices
attached to the adapter are resumed, the adapter low level driver may
remove the scsi host, resulting in a call to sd_remove() for the
disks of the host. This in turn results in a call to sd_shutdown() which
will issue a synchronize cache command and a start stop unit command to
spindown the disk. sd_shutdown() issues the commands only if the device
is not already suspended but does not check the power state for
system-wide suspend/resume. That is, the commands may be issued with the
device in a suspended state, which causes PM resume to hang, forcing a
reset of the machine to recover.

Fix this by not calling sd_shutdown() in sd_remove() if the device
is not running.

Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
---
 drivers/scsi/sd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Bart Van Assche Sept. 22, 2023, 6:14 p.m. UTC | #1
On 9/21/23 11:07, Damien Le Moal wrote:
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 1d106c8ad5af..d86306d42445 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3727,7 +3727,8 @@ static int sd_remove(struct device *dev)
>   
>   	device_del(&sdkp->disk_dev);
>   	del_gendisk(sdkp->disk);
> -	sd_shutdown(dev);
> +	if (sdkp->device->sdev_state == SDEV_RUNNING)
> +		sd_shutdown(dev);
>   
>   	put_disk(sdkp->disk);
>   	return 0;

Doesn't this patch involve a behavior change? I'm concerned the above patch
will always cause the sd_shutdown() call to be skipped if scsi_remove_host()
is called, whether or not the scsi_remove_host() call is triggered by a
resume failure.

Thanks,

Bart.
Damien Le Moal Sept. 22, 2023, 7:10 p.m. UTC | #2
On 2023/09/22 11:14, Bart Van Assche wrote:
> On 9/21/23 11:07, Damien Le Moal wrote:
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index 1d106c8ad5af..d86306d42445 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -3727,7 +3727,8 @@ static int sd_remove(struct device *dev)
>>   
>>   	device_del(&sdkp->disk_dev);
>>   	del_gendisk(sdkp->disk);
>> -	sd_shutdown(dev);
>> +	if (sdkp->device->sdev_state == SDEV_RUNNING)
>> +		sd_shutdown(dev);
>>   
>>   	put_disk(sdkp->disk);
>>   	return 0;
> 
> Doesn't this patch involve a behavior change? I'm concerned the above patch
> will always cause the sd_shutdown() call to be skipped if scsi_remove_host()
> is called, whether or not the scsi_remove_host() call is triggered by a
> resume failure.

I guess the point is: what are the possible states of the scsi device when
sd_remove() is called ? From what I have seen so far, it is RUNNING. But I am
not 100% sure this true all the time. But given that sd_shutdown() issues
commands, I would guess that it makes sense that it is always running.

Looking at the code, scsi_remove_host() calls scsi_forget_host() which calls
__scsi_remove_device() for any device that is not in the SDEV_DEL state.
__scsi_remove_device() then sets the state to SDEV_CANCEL. So it appears that
the state should always be CANCEL and not running. However, my tests showed it
to be running. I am not fully understanding how sd_remove() end up being called...

I will run tests again without this patch for libata suspend/resume. This patch
was added because of the hang I saw with the pm80xx driver (using libsas/libata)
failing to resume. That resume failure is fixed and libata does not need this
patch I think. But I will double check and drop it for now.

I think we should investigate this further though, to make sure that we can
always safely call sd_shutdown(). __scsi_remove_device() has this comment:

/*
 * If blocked, we go straight to DEL and restart the queue so
 * any commands issued during driver shutdown (like sync
 * cache) are errored immediately.
 */

Which kind of give a hint that we should probably not blindy always try to call
sd_shutdown().
Bart Van Assche Sept. 22, 2023, 8:08 p.m. UTC | #3
On 9/22/23 12:10, Damien Le Moal wrote:
> Looking at the code, scsi_remove_host() calls scsi_forget_host() which calls
> __scsi_remove_device() for any device that is not in the SDEV_DEL state.
> __scsi_remove_device() then sets the state to SDEV_CANCEL. So it appears that
> the state should always be CANCEL and not running. However, my tests showed it
> to be running. I am not fully understanding how sd_remove() end up being called...

I think this is how sd_sync_cache() gets called from inside
scsi_remove_host():

scsi_remove_host()
   -> scsi_forget_host()
     -> __scsi_remove_device()
       -> device_del(&sdev->sdev_gendev)
         -> bus_remove_device()
           -> device_release_driver()
             -> __device_release_driver()
               -> sd_remove()
                 -> sd_shutdown()
                   -> sd_sync_cache()

In other words, it is guaranteed that scsi_device_set_state(sdev, 
SDEV_CANCEL) has been called before sd_remove() if it is called by 
scsi_remove_host().

> I think we should investigate this further though, to make sure that we can
> always safely call sd_shutdown(). __scsi_remove_device() has this comment:
> 
> /*
>   * If blocked, we go straight to DEL and restart the queue so
>   * any commands issued during driver shutdown (like sync
>   * cache) are errored immediately.
>   */
> 
> Which kind of give a hint that we should probably not blindy always try to call
> sd_shutdown().

Does that comment perhaps refer to the SDEV_BLOCK / SDEV_CREATED_BLOCK
states? Anyway, I'm wondering whether there are better ways to prevent
that it is attempted to queue SCSI commands if a SCSI device is
suspended, e.g. by checking the suspended state from inside
scsi_device_state_check() or scsi_dispatch_cmd().

Thanks,

Bart.
Damien Le Moal Sept. 22, 2023, 8:36 p.m. UTC | #4
On 2023/09/22 13:08, Bart Van Assche wrote:
> On 9/22/23 12:10, Damien Le Moal wrote:
>> Looking at the code, scsi_remove_host() calls scsi_forget_host() which calls
>> __scsi_remove_device() for any device that is not in the SDEV_DEL state.
>> __scsi_remove_device() then sets the state to SDEV_CANCEL. So it appears that
>> the state should always be CANCEL and not running. However, my tests showed it
>> to be running. I am not fully understanding how sd_remove() end up being called...
> 
> I think this is how sd_sync_cache() gets called from inside
> scsi_remove_host():
> 
> scsi_remove_host()
>    -> scsi_forget_host()
>      -> __scsi_remove_device()
>        -> device_del(&sdev->sdev_gendev)
>          -> bus_remove_device()
>            -> device_release_driver()
>              -> __device_release_driver()
>                -> sd_remove()
>                  -> sd_shutdown()
>                    -> sd_sync_cache()
> 
> In other words, it is guaranteed that scsi_device_set_state(sdev, 
> SDEV_CANCEL) has been called before sd_remove() if it is called by 
> scsi_remove_host().
> 
>> I think we should investigate this further though, to make sure that we can
>> always safely call sd_shutdown(). __scsi_remove_device() has this comment:
>>
>> /*
>>   * If blocked, we go straight to DEL and restart the queue so
>>   * any commands issued during driver shutdown (like sync
>>   * cache) are errored immediately.
>>   */
>>
>> Which kind of give a hint that we should probably not blindy always try to call
>> sd_shutdown().
> 
> Does that comment perhaps refer to the SDEV_BLOCK / SDEV_CREATED_BLOCK
> states? Anyway, I'm wondering whether there are better ways to prevent
> that it is attempted to queue SCSI commands if a SCSI device is
> suspended, e.g. by checking the suspended state from inside
> scsi_device_state_check() or scsi_dispatch_cmd().

Using information in the device ->power structure is not reliable without
holding the device lock(), so we should not do that. But we can add a
"suspended" scsi_device flag that we maintain on execution of
sd_suspend_system() and sd_resume_system(). Many drivers do that...
Thoughts ?
Bart Van Assche Sept. 22, 2023, 9:32 p.m. UTC | #5
On 9/22/23 13:36, Damien Le Moal wrote:
> On 2023/09/22 13:08, Bart Van Assche wrote:
>> Does that comment perhaps refer to the SDEV_BLOCK / SDEV_CREATED_BLOCK
>> states? Anyway, I'm wondering whether there are better ways to prevent
>> that it is attempted to queue SCSI commands if a SCSI device is
>> suspended, e.g. by checking the suspended state from inside
>> scsi_device_state_check() or scsi_dispatch_cmd().
> 
> Using information in the device ->power structure is not reliable without
> holding the device lock(), so we should not do that. But we can add a
> "suspended" scsi_device flag that we maintain on execution of
> sd_suspend_system() and sd_resume_system(). Many drivers do that...
> Thoughts ?

That sounds good to me.

Thanks,

Bart.
diff mbox series

Patch

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 1d106c8ad5af..d86306d42445 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3727,7 +3727,8 @@  static int sd_remove(struct device *dev)
 
 	device_del(&sdkp->disk_dev);
 	del_gendisk(sdkp->disk);
-	sd_shutdown(dev);
+	if (sdkp->device->sdev_state == SDEV_RUNNING)
+		sd_shutdown(dev);
 
 	put_disk(sdkp->disk);
 	return 0;