[1/3] libata: avoid waking disk for several commands

Message ID	20240107180258.360886-2-phill@thesusis.net
State	New
Headers	show Return-Path: <linux-ide+bounces-166-incoming=patchwork.ozlabs.org@vger.kernel.org> From: Phillip Susi <phill@thesusis.net> To: linux-ide@vger.kernel.org Cc: Damien Le Moal <dlemoal@kernel.org>, Sergey Shtylyov <s.shtylyov@omp.ru>, Phillip Susi <phill@thesusis.net> Subject: [PATCH 1/3] libata: avoid waking disk for several commands Date: Sun, 7 Jan 2024 13:02:56 -0500 Message-Id: <20240107180258.360886-2-phill@thesusis.net> In-Reply-To: <20240107180258.360886-1-phill@thesusis.net> References: <87msthdo11.fsf@vps.thesusis.net> <20240107180258.360886-1-phill@thesusis.net> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Let sleeping disks lie \| expand [0/3] Let sleeping disks lie [1/3] libata: avoid waking disk for several commands [2/3] libata: only wake a drive once on system resume [3/3] libata: don't start PuiS disks on resume

Phillip Susi Jan. 7, 2024, 6:02 p.m. UTC

When a disk is in SLEEP mode it can not respond to any
any commands.  Several commands are simply a NOOP to a disk
that is in standby mode, but when a disk is in SLEEP mode,
they frequencly cause the disk to spin up for no reason.
To avoid this, complete these commands in libata without
waking the disk.  These commands are:

CHECK POWER MODE
FLUSH CACHE
SLEEP
STANDBY IMMEDIATE
IDENTIFY

If we know the disk is sleeping, we don't need to wake it up
to find out if it is in standby, so just pretend it is in
standby.  While asleep, there's no dirty pages in the cache,
so there's no need to flush it.  There's no point in waking
a disk from sleep just to put it back to sleep.  We also have
a cache of the IDENTIFY information so just return that
instead of waking the disk.
---
 drivers/ata/libata-core.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

Damien Le Moal Jan. 8, 2024, 6:25 a.m. UTC | #1

On 1/8/24 03:02, Phillip Susi wrote:
> When a disk is in SLEEP mode it can not respond to any
> any commands.  Several commands are simply a NOOP to a disk
> that is in standby mode, but when a disk is in SLEEP mode,
> they frequencly cause the disk to spin up for no reason.
> To avoid this, complete these commands in libata without
> waking the disk.  These commands are:

As commented in patch 3/3, please use full 72-char lines for commit messages.

> 
> CHECK POWER MODE
> FLUSH CACHE
> SLEEP
> STANDBY IMMEDIATE
> IDENTIFY
> 
> If we know the disk is sleeping, we don't need to wake it up
> to find out if it is in standby, so just pretend it is in

sleep and standby are different power states. So saying that a disk that is
sleeping is in standby does not make sense. And if you wake up a drive from
sleep mode, it will *not* be in standby (need to re-check, but I think that
holds true even with PUIS enabled).

> standby.  While asleep, there's no dirty pages in the cache,
> so there's no need to flush it.  There's no point in waking
> a disk from sleep just to put it back to sleep.  We also have
> a cache of the IDENTIFY information so just return that
> instead of waking the disk.

The problem here is that ATA_DFLAG_SLEEPING is a horrible hack to not endup with
lots of timeout failures if the user execute "hdparm -Y". Executing such
passthrough command with a disk being used by an FS (for instance) is complete
nonsense and should not be done.

So I would rather see this handled correctly, through the kernel pm runtime
suspend/resume:
1) Define a libata device sysfs attribute that allows going to sleep instead of
the default standby when the disk is runtime suspended. If sleep is used, set
ATA_DFLAG_SLEEPING.
2) With that, any command issued to the disk will trigger runtime resume. If
ATA_DFLAG_SLEEPING is set, then the drive can be woken up with a link reset from
EH, going through ata_port_runtime_resume(), exactly like with the default
standby state for suspend. ATA_DFLAG_SLEEPING being set or not will indicate if
a simple verify command can spinup the disk or if a link abort is needed (like
done now in ata_qc_issue() which is really a nasty place to do that).

Now, the annoying thing is the drive being randomly woken-up due to commands
being issued, like the ones you mention. This is indeed bad, and seeing news
like this:

https://www.phoronix.com/news/Linux-PM-Regulatory-Bugs

I think we really should do better...

But I do not think the kernel is necessarilly the right place to fix this, at
least in the case of commands issued from userspace by things like smartd or
udevd. Patching there is needed to avoid uselessly waking up disks in runtime
suspend. systemd already has power policies etc, so there is integration with
the kernel side power management. Your issues come from using a tool (hdparm)
that has no integration at all with the OS daemons.

For FSes issued commands like flush, these are generally not random at all. If
you see them appearing randomly, then there is a problem with the FS and
patching the FS may be needed. Beside flush, there are other things to consider
here. Ex: FSes using zoned block devices (SMR disks) doing garbage collection
while idle. We cannot prevent this from happening, which is why I seriously
dislike the idea of faking any command for a sleeping disk.

> ---
>  drivers/ata/libata-core.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 09ed67772fae..6c5269de4bf2 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -5040,6 +5040,26 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
>  
>  	/* if device is sleeping, schedule reset and abort the link */
>  	if (unlikely(qc->dev->flags & ATA_DFLAG_SLEEPING)) {
> +		switch (qc->tf.command)
> +		{
> +		case ATA_CMD_CHK_POWER:
> +		case ATA_CMD_SLEEP:
> +		case ATA_CMD_FLUSH:
> +		case ATA_CMD_FLUSH_EXT:
> +		case ATA_CMD_STANDBYNOW1:
> +			if (qc->tf.command == ATA_CMD_ID_ATA)
> +			{
> +				/* only fake the reply for IDENTIFY if it is from userspace */
> +				if (ata_tag_internal(qc->tag))
> +					break;
> +				sg_copy_from_buffer(qc->sg, 1, qc->dev->id, 2 * ATA_ID_WORDS);
> +			}
> +			/* fake reply to avoid waking drive */
> +			qc->flags |= ATA_QCFLAG_RTF_FILLED;
> +			qc->result_tf.nsect = 0;
> +			ata_qc_complete(qc);
> +			return;
> +		}
>  		link->eh_info.action |= ATA_EH_RESET;
>  		ata_ehi_push_desc(&link->eh_info, "waking up from sleep");
>  		ata_link_abort(link);

Sergey Shtylyov Jan. 8, 2024, 8:48 a.m. UTC | #2

On 1/7/24 9:02 PM, Phillip Susi wrote:

> When a disk is in SLEEP mode it can not respond to any
> any commands.  Several commands are simply a NOOP to a disk
> that is in standby mode, but when a disk is in SLEEP mode,
> they frequencly cause the disk to spin up for no reason.

   Frequently.

> To avoid this, complete these commands in libata without
> waking the disk.  These commands are:
> 
> CHECK POWER MODE
> FLUSH CACHE
> SLEEP
> STANDBY IMMEDIATE
> IDENTIFY
> 
> If we know the disk is sleeping, we don't need to wake it up
> to find out if it is in standby, so just pretend it is in
> standby.  While asleep, there's no dirty pages in the cache,
> so there's no need to flush it.  There's no point in waking
> a disk from sleep just to put it back to sleep.  We also have
> a cache of the IDENTIFY information so just return that
> instead of waking the disk.
[...]

   Did you run your patches thru scripts/checkpatch.pl? Looking
at this patch, I think you didn't... :-)

> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index 09ed67772fae..6c5269de4bf2 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -5040,6 +5040,26 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
>  
>  	/* if device is sleeping, schedule reset and abort the link */
>  	if (unlikely(qc->dev->flags & ATA_DFLAG_SLEEPING)) {
> +		switch (qc->tf.command)
> +		{

		switch (qc->tf.command) {

> +		case ATA_CMD_CHK_POWER:
> +		case ATA_CMD_SLEEP:
> +		case ATA_CMD_FLUSH:
> +		case ATA_CMD_FLUSH_EXT:
> +		case ATA_CMD_STANDBYNOW1:
> +			if (qc->tf.command == ATA_CMD_ID_ATA)

   This can't happen, you forgot:

		case ATA_CMD_ID_ATA:

> +			{

			if (qc->tf.command == ATA_CMD_ID_ATA) {

> +				/* only fake the reply for IDENTIFY if it is from userspace */
> +				if (ata_tag_internal(qc->tag))
> +					break;
> +				sg_copy_from_buffer(qc->sg, 1, qc->dev->id, 2 * ATA_ID_WORDS);
> +			}
> +			/* fake reply to avoid waking drive */
> +			qc->flags |= ATA_QCFLAG_RTF_FILLED;
> +			qc->result_tf.nsect = 0;
> +			ata_qc_complete(qc);
> +			return;
> +		}
>  		link->eh_info.action |= ATA_EH_RESET;
>  		ata_ehi_push_desc(&link->eh_info, "waking up from sleep");
>  		ata_link_abort(link);

MBR, Sergey

Phillip Susi Jan. 8, 2024, 1:27 p.m. UTC | #3

Damien Le Moal <dlemoal@kernel.org> writes:

> sleep and standby are different power states. So saying that a disk that is
> sleeping is in standby does not make sense. And if you wake up a drive from

There is no way to answer CHECK POWER MODE and say the drive is in
sleep.  It can only be either active, or standby, so standby is the
closest fit.  At least it gets smartd and udisks2 to leave the drive
alone.

> The problem here is that ATA_DFLAG_SLEEPING is a horrible hack to not endup with
> lots of timeout failures if the user execute "hdparm -Y". Executing such
> passthrough command with a disk being used by an FS (for instance) is complete
> nonsense and should not be done.

I'm not sure what you propose be done instead.  Regardless, this is how
it has always been done, so I don't think there is any changing it now.
You also have the legacy standby timer that is exposed to users through
udisks2/gnome-disk-utility that still has to be supported.

> So I would rather see this handled correctly, through the kernel pm runtime
> suspend/resume:

I'd eventually like to as well, but it should also work in kernels that
aren't built with runtime pm enabled.

> Now, the annoying thing is the drive being randomly woken-up due to commands
> being issued, like the ones you mention. This is indeed bad, and seeing news
> like this:
>
> https://www.phoronix.com/news/Linux-PM-Regulatory-Bugs
>
> I think we really should do better...

That sounds like broken firmware to me.  When you ask the firmware to
power off the system, it should really be powered off.

> For FSes issued commands like flush, these are generally not random at all. If
> you see them appearing randomly, then there is a problem with the FS and
> patching the FS may be needed. Beside flush, there are other things to
> consider

I'm not sure the filesystem maintainers will see it that way.  They
generally issue barriers as part of a commit at regular intervals, and
that gets turned into FLUSH CACHE.  Also the kernel issues one during
system suspend, and I think that happens even if no filesystem is
mounted.  I think systemd also issues a sync() during shutdown, which
would wake up a sleeping disk only to shut down.

I don't think it is up to all of these other sources to be patched to
avoid this.  libata knows the disk is in sleep mode, so that is the
place to handle it.

> here. Ex: FSes using zoned block devices (SMR disks) doing garbage collection
> while idle. We cannot prevent this from happening, which is why I seriously
> dislike the idea of faking any command for a sleeping disk.

I don't see the connection.  If the FS issues IO in the background, the
disk will wake up, just like it does when in standby.  The faking of the
commands simply replicates the behavior you get from standby in sleep
mode.  With this patch, as far as anyone can tell, a sleeping disk is
exactly the same as one in standby.

Phillip Susi Jan. 8, 2024, 1:30 p.m. UTC | #4

Sergey Shtylyov <s.shtylyov@omp.ru> writes:

>    Did you run your patches thru scripts/checkpatch.pl? Looking
> at this patch, I think you didn't... :-)

Will do.

>    This can't happen, you forgot:
>
> 		case ATA_CMD_ID_ATA:

Woops.. that must have gotten lost somehow when I was cutting and
pasting to avoid the fallthrough.

Damien Le Moal Jan. 10, 2024, 2:39 a.m. UTC | #5

On 1/8/24 22:27, Phillip Susi wrote:
> Damien Le Moal <dlemoal@kernel.org> writes:
> 
>> sleep and standby are different power states. So saying that a disk that is
>> sleeping is in standby does not make sense. And if you wake up a drive from
> 
> There is no way to answer CHECK POWER MODE and say the drive is in
> sleep.  It can only be either active, or standby, so standby is the
> closest fit.  At least it gets smartd and udisks2 to leave the drive
> alone.

I was not commenting about what to do about your problem, but about the fact
that your sentence was very hard to understand because it was not technically
accurate.

>> The problem here is that ATA_DFLAG_SLEEPING is a horrible hack to not endup with
>> lots of timeout failures if the user execute "hdparm -Y". Executing such
>> passthrough command with a disk being used by an FS (for instance) is complete
>> nonsense and should not be done.
> 
> I'm not sure what you propose be done instead.  Regardless, this is how
> it has always been done, so I don't think there is any changing it now.

I never proposed to change that in any way. That is fine and can stay as it is.
What I do NOT want to do is expand upon this to try to solve issues. The reason
for that, which I already stated, is that hdparm issue passthrough commands. And
if the user wants to use passthrough commands, then most of the time, he/she
will have to deal with the consequences of not using kernel-provided management
methods.

I did propose to allow for runtime suspend to to use sleep state instead of
standby. That would be fairly easy to do and replace manual "hdparm -Y" with a
well integrated control of the disk power state up to the block layer.
You never commented back about this.

> You also have the legacy standby timer that is exposed to users through
> udisks2/gnome-disk-utility that still has to be supported.

What is this legacy standby timer ? What control path does it trigger ? Do
udisks2/gnome-disk-utility use that timer to issue commands like "hdparm -Y"  ?
Or does that timer tigh into the regular runtime suspend ?

>> So I would rather see this handled correctly, through the kernel pm runtime
>> suspend/resume:
> 
> I'd eventually like to as well, but it should also work in kernels that
> aren't built with runtime pm enabled.

No. As said many times now, I am not going to do anything about the hdparm -Y
hacking. If a user want better power management features, he/she should enable
power management in their kernels.

>> For FSes issued commands like flush, these are generally not random at all. If
>> you see them appearing randomly, then there is a problem with the FS and
>> patching the FS may be needed. Beside flush, there are other things to
>> consider
> 
> I'm not sure the filesystem maintainers will see it that way.  They
> generally issue barriers as part of a commit at regular intervals, and
> that gets turned into FLUSH CACHE.  Also the kernel issues one during
> system suspend, and I think that happens even if no filesystem is
> mounted.  I think systemd also issues a sync() during shutdown, which
> would wake up a sleeping disk only to shut down.

No. The scsi layer issues a FLUSH CACHE whenever a disk is removed, goes to
sleep or the system shutdown. And there is no need to do that if the disk is
already in standby. If you see that happening, then we need to fix that.

If the device is in sleep state from "hdparm -Y", then only libata knows that
the device is sleeping with the ATA_DFLAG_SLEEPING flag. That is the fundamental
problem here: pm-core, scsi and the block layer do not know that the block
device is sleeping (and so already had its write cache flushed). Your patches
are not solving this root cause issue. They are only hidding it by faking the
commands. This is a hack, wich likely will need more hacks in the future for
different cases. See my point ? I do not want to go down such route. Let's fix
things properly.

> I don't think it is up to all of these other sources to be patched to
> avoid this.  libata knows the disk is in sleep mode, so that is the
> place to handle it.

Not that simple. See above.

Phillip Susi Jan. 16, 2024, 5:06 p.m. UTC | #6

Damien Le Moal <dlemoal@kernel.org> writes:

> I did propose to allow for runtime suspend to to use sleep state instead of
> standby. That would be fairly easy to do and replace manual "hdparm -Y" with a
> well integrated control of the disk power state up to the block layer.
> You never commented back about this.

That would be nice.  I assume that would involve changing how
libata-scsi.c translates SYNCHRONIZE CACHE from the scsi layer?

> What is this legacy standby timer ? What control path does it trigger ? Do
> udisks2/gnome-disk-utility use that timer to issue commands like "hdparm -Y"  ?
> Or does that timer tigh into the regular runtime suspend ?

The ATA disk internal auto standby timer, i.e. hdparm -S.

> No. As said many times now, I am not going to do anything about the hdparm -Y
> hacking. If a user want better power management features, he/she should enable
> power management in their kernels.

So you are saying that we need to patch the kernel to make runtime pm
work better, then patch smartd and udisks2 to check for runtime pm
before issuing their SMART commands, and patch udsisks2 to enable
runtime pm rather than using the legacy ATA standby timer?

> No. The scsi layer issues a FLUSH CACHE whenever a disk is removed, goes to
> sleep or the system shutdown. And there is no need to do that if the disk is
> already in standby. If you see that happening, then we need to fix that.

I'm almost certain that I have seen this happen, and I don't currently
see any code in sd.c that would would prevent it from issuing a FLUSH
CACHE to a disk that is runtime suspended when the system suspends or
shuts down.

The block layer also would need patched to avoid turning a barrier into
a FLUSH CACHE if the disk is runtime suspended, and also the sync()
path.  Is that even sensible to do?  It is true that for all block
devices, their caches do not need flushed while they are runtime
suspended?  It seems like it may be, but I'm not certain.

Phillip Susi Jan. 19, 2024, 8:43 p.m. UTC | #7

Phillip Susi <phill@thesusis.net> writes:

> The block layer also would need patched to avoid turning a barrier into
> a FLUSH CACHE if the disk is runtime suspended, and also the sync()
> path.  Is that even sensible to do?  It is true that for all block
> devices, their caches do not need flushed while they are runtime
> suspended?  It seems like it may be, but I'm not certain.

I was trying to do this.  I think the right place is in
blkdev_issue_flush(), but apparently bdev->bd_device is not the same
struct device that gets suspended.  I can't seem to work out where the
right struct device is to pass to pm_runtime_suspended() and skip the
flush operation.

Phillip Susi Jan. 20, 2024, 6:08 p.m. UTC | #8

Phillip Susi <phill@thesusis.net> writes:

> I was trying to do this.  I think the right place is in
> blkdev_issue_flush(), but apparently bdev->bd_device is not the same
> struct device that gets suspended.  I can't seem to work out where the
> right struct device is to pass to pm_runtime_suspended() and skip the
> flush operation.

I don't know what I was thinking yesterday.  It can't rely on
pm_runtime_suspended() because it would continue to flush and reset the
suspend timer before it ever gets suspended.  I wonder if it could use
the performance counters?  Whenever a flush is done, and also when
suspending, store the value of the write counter, and only if it has
changed, issue the flush, otherwise skip it?

Damien Le Moal Jan. 21, 2024, 12:37 a.m. UTC | #9

On 1/20/24 05:43, Phillip Susi wrote:
> Phillip Susi <phill@thesusis.net> writes:
> 
>> The block layer also would need patched to avoid turning a barrier into
>> a FLUSH CACHE if the disk is runtime suspended, and also the sync()
>> path.  Is that even sensible to do?  It is true that for all block
>> devices, their caches do not need flushed while they are runtime
>> suspended?  It seems like it may be, but I'm not certain.
> 
> I was trying to do this.  I think the right place is in
> blkdev_issue_flush(), but apparently bdev->bd_device is not the same
> struct device that gets suspended.  I can't seem to work out where the
> right struct device is to pass to pm_runtime_suspended() and skip the
> flush operation.

Flush issuing is a lot more complicated than just blkdev_issue_flush(). There is
a whole file dedicated to handling flushes (block/blk-flush.c).

But that is beside the point, which is that trying to not execute flush is
simply completely wrong. Please stop trying.

For your case, which is a drive put to sleep with hdparm -Y, only libata is
aware that the drive is sleeping. That first needs to change if you want the
kernel overall to be able to do anything. As I proposed already, using runtime
PM with sleep mode instead of standby would be a good start.

Regarding the flushes and other commands you see that are waking up the drive
for no *apparent* good reasons, please identify what application/component is
issuing these commands and look there to see why the commands are being issued
and if that is done with awareness for the device power state.

Then we can patch properly instead of trying to "hack".

Damien Le Moal Jan. 21, 2024, 12:37 a.m. UTC | #10

On 1/21/24 03:08, Phillip Susi wrote:
> Phillip Susi <phill@thesusis.net> writes:
> 
>> I was trying to do this.  I think the right place is in
>> blkdev_issue_flush(), but apparently bdev->bd_device is not the same
>> struct device that gets suspended.  I can't seem to work out where the
>> right struct device is to pass to pm_runtime_suspended() and skip the
>> flush operation.
> 
> I don't know what I was thinking yesterday.  It can't rely on
> pm_runtime_suspended() because it would continue to flush and reset the
> suspend timer before it ever gets suspended.  I wonder if it could use
> the performance counters?  Whenever a flush is done, and also when
> suspending, store the value of the write counter, and only if it has
> changed, issue the flush, otherwise skip it?

See my reply to your previous comment.
What you are trying to do is not the correct approach.

Phillip Susi Jan. 24, 2024, 4:04 p.m. UTC | #11

Damien Le Moal <dlemoal@kernel.org> writes:

> Flush issuing is a lot more complicated than just blkdev_issue_flush(). There is
> a whole file dedicated to handling flushes (block/blk-flush.c).
>
> But that is beside the point, which is that trying to not execute flush is
> simply completely wrong. Please stop trying.

I tried before to have libata ignore the useless flush and you said to
stop the flush from happening in the first place.  Now you say that's wrong?

> For your case, which is a drive put to sleep with hdparm -Y, only libata is
> aware that the drive is sleeping. That first needs to change if you want the
> kernel overall to be able to do anything. As I proposed already, using runtime
> PM with sleep mode instead of standby would be a good start.

No, I'm working on runtime pm now, as you suggested.  If you try using
runtime pm with disks, you quickly see that it does not work.

> Regarding the flushes and other commands you see that are waking up the drive
> for no *apparent* good reasons, please identify what application/component is
> issuing these commands and look there to see why the commands are being issued
> and if that is done with awareness for the device power state.

Filesystems flush every few seconds.  So does anyone calling sync(),
which the kernel does when you suspend to ram.

Either the filesystems need to keep track of whether a flush is needed
and skip it, or if they all call the same place ( blkdev_issue_flush ),
then it only needs done once in that place.

The core logic needs to be "if nothing has been written, then nothing
needs to be flushed".  Right now filesystems just flush periodically or
when their sync method is called to make sure that anything that has
been written is stable.

In userspace you have smartd and udisks2 to worry about.  The
information they need to know is whether the disk has otherwise been in
recent use and so they should not poll the SMART data.  I don't see
anything exported in the power/ directory that would give an indication
of the current *remaining* time until autosuspend, or how long it has
been since the last access.  Either one would be useful for userspace to
decide that nobody else seems to be using the disk, so I'll leave it
alone for now so it can go to sleep.

Damien Le Moal Jan. 24, 2024, 9:51 p.m. UTC | #12

On 1/25/24 01:04, Phillip Susi wrote:
> Damien Le Moal <dlemoal@kernel.org> writes:
> 
>> Flush issuing is a lot more complicated than just blkdev_issue_flush(). There is
>> a whole file dedicated to handling flushes (block/blk-flush.c).
>>
>> But that is beside the point, which is that trying to not execute flush is
>> simply completely wrong. Please stop trying.
> 
> I tried before to have libata ignore the useless flush and you said to
> stop the flush from happening in the first place.  Now you say that's wrong?

No, not wrong. But you are looking at the wrong layer. Do not try to prevent
flushes from being issued at the block layer level but *above* it. That is,
FSes, applications etc.

>> For your case, which is a drive put to sleep with hdparm -Y, only libata is
>> aware that the drive is sleeping. That first needs to change if you want the
>> kernel overall to be able to do anything. As I proposed already, using runtime
>> PM with sleep mode instead of standby would be a good start.
> 
> No, I'm working on runtime pm now, as you suggested.  If you try using
> runtime pm with disks, you quickly see that it does not work.

What does not work ? Everything is fine with my testing: the drive is always
woken up whenever someone (FS, applications etc) issue a block IO (including
flush) to the block layer. That is the expected behavior. If you want to have
the disk keep sleeping, the device users must stop poking at the drive.

> 
>> Regarding the flushes and other commands you see that are waking up the drive
>> for no *apparent* good reasons, please identify what application/component is
>> issuing these commands and look there to see why the commands are being issued
>> and if that is done with awareness for the device power state.
> 
> Filesystems flush every few seconds.  So does anyone calling sync(),
> which the kernel does when you suspend to ram.

Filesystems issue flush only if they have dirty data to flush, normally. I have
not looked at ext4/xfs code in a while, but if the FS has not committed any
transaction in the last cycle, there should be no flush issued. If there is,
then someone is reading or writing files (for reading, if you have atime
enabled, that will cause a metadata change and so a transaction commit).

> Either the filesystems need to keep track of whether a flush is needed
> and skip it, or if they all call the same place ( blkdev_issue_flush ),
> then it only needs done once in that place.

See above. That is why I am telling you to run blktrace or any tracer to figure
out the command sequence and who is doing what. There is absolutely no clarity
to what you are describing and so we cannot plan for a solution.

> The core logic needs to be "if nothing has been written, then nothing
> needs to be flushed".  Right now filesystems just flush periodically or
> when their sync method is called to make sure that anything that has
> been written is stable.

I do not think that the periodic flush happens if nothing has been written. For
"sync", someone issues that, not the FS on its own. So trace it to find out who
is doing that.

> In userspace you have smartd and udisks2 to worry about.  The
> information they need to know is whether the disk has otherwise been in
> recent use and so they should not poll the SMART data.  I don't see
> anything exported in the power/ directory that would give an indication
> of the current *remaining* time until autosuspend, or how long it has
> been since the last access.  Either one would be useful for userspace to
> decide that nobody else seems to be using the disk, so I'll leave it
> alone for now so it can go to sleep.

Phillip Susi Feb. 1, 2024, 8:01 p.m. UTC | #13

Damien Le Moal <dlemoal@kernel.org> writes:

> What does not work ? Everything is fine with my testing: the drive is always
> woken up whenever someone (FS, applications etc) issue a block IO (including
> flush) to the block layer. That is the expected behavior. If you want to have
> the disk keep sleeping, the device users must stop poking at the drive.

It seems that I have put my foot in my mouth.  When I first started
working on these patches way back when, I did see flushes without any
writes in the blktrace keeping the drive awake.  I assumed that was
still a problem that I needed to tackle.  I should have tested it
first.  It seems this problem has been fixed already.

ext4 does still seem to issue a flush with no writes in the sync path
though, causing a drive to spin up for no reason, then right back down
when you suspend-to-ram.  I guess I'll ask about this on the ext4 list.

Circling back to the PuiS patch, did I understand correctly that you are
fine with that as long as it integrates with runtime pm?

I had tried at one point to add support for REQUEST SENSE to libata so
that sd can issue that to find out if the disk is powered up or not, and
set the runtime_pm status of the disk accordingly.  Does that make sense
to you?

Damien Le Moal Feb. 2, 2024, 1:08 a.m. UTC | #14

On 2/2/24 05:01, Phillip Susi wrote:
> Damien Le Moal <dlemoal@kernel.org> writes:
> 
>> What does not work ? Everything is fine with my testing: the drive is always
>> woken up whenever someone (FS, applications etc) issue a block IO (including
>> flush) to the block layer. That is the expected behavior. If you want to have
>> the disk keep sleeping, the device users must stop poking at the drive.
> 
> It seems that I have put my foot in my mouth.  When I first started
> working on these patches way back when, I did see flushes without any
> writes in the blktrace keeping the drive awake.  I assumed that was
> still a problem that I needed to tackle.  I should have tested it
> first.  It seems this problem has been fixed already.
> 
> ext4 does still seem to issue a flush with no writes in the sync path
> though, causing a drive to spin up for no reason, then right back down
> when you suspend-to-ram.  I guess I'll ask about this on the ext4 list.
> 
> Circling back to the PuiS patch, did I understand correctly that you are
> fine with that as long as it integrates with runtime pm?

Yes, but only for drives that report full identify data when PUIS is enabled.
For drives that report incomplete identify data, we have no choice but to wake
them up. And yes, we need integration with runtime pm to set the initial power
state of the drive to standby (instead of "on") for both the ata device and its
scsi device.

> I had tried at one point to add support for REQUEST SENSE to libata so
> that sd can issue that to find out if the disk is powered up or not, and
> set the runtime_pm status of the disk accordingly.  Does that make sense
> to you?

I need to check that. I think there may be a better/easier way to get the
current power state of a drive. Will get back to you on that.

Phillip Susi Feb. 2, 2024, 7:53 p.m. UTC | #15

Damien Le Moal <dlemoal@kernel.org> writes:

> Yes, but only for drives that report full identify data when PUIS is enabled.
> For drives that report incomplete identify data, we have no choice but to wake
> them up. And yes, we need integration with runtime pm to set the
> initial power

Why was that again?  I think you said something about needing to set the
speed correctly so you at least need to know if this drive requires a
lower speed than the other in the PATA master/slave pair?  Wouldn't that
only require the speed information, not all identify data?

> state of the drive to standby (instead of "on") for both the ata device and its
> scsi device.

You mean if the whole device hierarchy were changed so that instead of
the scsi_host being a child of the port with the links and devices
hanging off to the side, the scsi_host would be the child of the ata device?

> I need to check that. I think there may be a better/easier way to get the
> current power state of a drive. Will get back to you on that.

That would be good.  At one point that was the way I found, and also I
think it was in the SAT-3 spec that is how REQUEST SENSE should be
implemented for ATA disks.

Damien Le Moal Feb. 2, 2024, 11:17 p.m. UTC | #16

On 2/3/24 04:53, Phillip Susi wrote:
> Damien Le Moal <dlemoal@kernel.org> writes:
> 
>> Yes, but only for drives that report full identify data when PUIS is enabled.
>> For drives that report incomplete identify data, we have no choice but to wake
>> them up. And yes, we need integration with runtime pm to set the
>> initial power
> 
> Why was that again?  I think you said something about needing to set the
> speed correctly so you at least need to know if this drive requires a
> lower speed than the other in the PATA master/slave pair?  Wouldn't that
> only require the speed information, not all identify data?

See ata_dev_revalidate() and ata_dev_configure() and all the drive features that
are checked using the identify data. We need to preserve that to ensure that
nothing changed on the drive so that its representation in libata is kept in
sync with the drive config. That is why drive starting with PUIS and not giving
full identify data *must* be woken up, which is the current libata behavior.

>> state of the drive to standby (instead of "on") for both the ata device and its
>> scsi device.
> 
> You mean if the whole device hierarchy were changed so that instead of
> the scsi_host being a child of the port with the links and devices
> hanging off to the side, the scsi_host would be the child of the ata device?

You do not need to change the hierarchy of devices. An ata_dev is already a
child of its scsi_dev. So if you want to set the ata device to runtime suspend
state, you have to have the parent in the same state too. runtime suspend work
top-to-bottom in the device chain. You cannot have random device in the middle
of the chain going to suspend without the devices above it also being suspended.

Also, the user does not use ata devices directly. They use the scsi device
representing the ata device. You must thus have that in sync with the ata device
state.

>> I need to check that. I think there may be a better/easier way to get the
>> current power state of a drive. Will get back to you on that.
> 
> That would be good.  At one point that was the way I found, and also I
> think it was in the SAT-3 spec that is how REQUEST SENSE should be
> implemented for ATA disks.

TEST UNIT READY is the command to use. I need to check the specs again about how
it reports the device state though. I think it is through sense data. The scsi
disk driver issues that command already to check that the drive is spun-up. See
sd_revalidate_disk() calling sd_spinup_disk(). So that will also need to change
to be aware of the initial power state to not spinup the drive if not wanted.
That will need to be done extremely carefully though to only affect ata devices
with PUIS. Likely some additional scsi device flag will be needed for this. I
have not looked into that yet.

Phillip Susi Feb. 5, 2024, 7:52 p.m. UTC | #17

Damien Le Moal <dlemoal@kernel.org> writes:

> See ata_dev_revalidate() and ata_dev_configure() and all the drive features that
> are checked using the identify data. We need to preserve that to ensure that
> nothing changed on the drive so that its representation in libata is kept in
> sync with the drive config. That is why drive starting with PUIS and not giving
> full identify data *must* be woken up, which is the current libata behavior.

Yes, the information is needed to revalidate, but why must this
revalidation be done during system resume, rather than postponed until later?

> You do not need to change the hierarchy of devices. An ata_dev is already a
> child of its scsi_dev. So if you want to set the ata device to runtime
> suspend

How is an ata_dev a child of its own scsi_dev?  Or did you mean to say
the reverse: the scsi_dev is a child of the ata_dev?  But that isn't the
case either.  In sysfs, you have:

ata_port
 - scsi_host
   - scsi_target
     - scsi_lun
 - ata_link
   - ata_dev

The link and dev hang off to the side, not in the ancestry of the scsi
disk.

> state, you have to have the parent in the same state too. runtime suspend work
> top-to-bottom in the device chain. You cannot have random device in the middle
> of the chain going to suspend without the devices above it also being suspended.
>
> Also, the user does not use ata devices directly. They use the scsi device
> representing the ata device. You must thus have that in sync with the ata device
> state.

Right... so when the system is resumed, first the ata_port is resumed,
then it has to be on for the scsi host, target, and lun to resume.  At
that point sd could check if the disk is actually spinning, and if not,
force it to suspend, which then allows the target to suspend, wich then
allows the host to suspend, and finally the ata_port.

[1/3] libata: avoid waking disk for several commands

Commit Message

Comments

Patch