diff mbox

libata: Whitelist SSDs that are known to properly return zeroes after TRIM

Message ID yq18uiojfq9.fsf@sermon.lab.mkp.net
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Martin K. Petersen Dec. 4, 2014, 2:44 a.m. UTC
As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
Zero After Trim) flags in the ATA Command Set are unreliable in the
sense that they only define what happens if the device successfully
executed the DSM TRIM command. TRIM is only advisory, however, and the
device is free to silently ignore all or parts of the request.

In practice this renders the DRAT and RZAT flags completely useless and
because the results are unpredictable we decided to disable discard in
MD for 3.18 to avoid the risk of data corruption.

Hardware vendors in the real world obviously need better guarantees than
what the standards bodies provide. Unfortuntely those guarantees are
encoded in product requirements documents rather than somewhere we can
key off of them programatically. So we are compelled to disabling
discard_zeroes_data for all devices unless we explicitly have data to
support whitelisting them.

This patch whitelists SSDs from a few of the main vendors. None of the
whitelists are based on written guarantees. They are purely based on
empirical evidence collected from internal and external users that have
tested or qualified these drives in RAID deployments.

The whitelist is only meant as a starting point and is by no means
comprehensive:

   - All intel SSD models except for 510
   - Micron M5*
   - Samsung SSDs
   - Seagate SSDs

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/ata/libata-core.c |   18 ++++++++++++++----
 drivers/ata/libata-scsi.c |   10 ++++++----
 include/linux/libata.h    |    1 +
 3 files changed, 21 insertions(+), 8 deletions(-)

Comments

Phillip Susi Dec. 4, 2014, 3:02 a.m. UTC | #1
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 12/03/2014 09:44 PM, Martin K. Petersen wrote:
> This patch whitelists SSDs from a few of the main vendors. None of
> the whitelists are based on written guarantees. They are purely
> based on empirical evidence collected from internal and external
> users that have tested or qualified these drives in RAID
> deployments.

Without anything in writing from the vendor, such a white list amounts
to nothing more than wishful thinking.  You can't really test for it
and even if it *appears* to be so, the drive is free to change its
behavior any time.



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUf87GAAoJENRVrw2cjl5RPMwIAI3RHHZBhv0xDi5uMCzHdaZM
pSp+OAYtuXuYKNhxEnH/v3RqqeL2joARR7ELa3wZ57DilpANF4SPKeV4TyJqdjxG
BLYeW1jySSNn7LjBDRxSgg61FRka4Bz3m4MCA6/6pfZawufE4bkTshRhpMzIWMsb
AAt1d1Kyw2a2wat2+h75LqEVuPdTcENBV7B1Ow//zObkljGf3p3Jkjiz3ZKE3bIT
kXIu62kypdUbvpYCtSE7Xa51Io74jPtZPyjc6vrsFaB8ulhFzDg5WK4o8Ndswz3D
Rjq4WAl7dOJ0KUKef3n57YZFjgtfPo3VUIXzrDOo0hCXYUO77Kxei9ZwOtJUkrc=
=5Wuq
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Dec. 4, 2014, 3:24 a.m. UTC | #2
>>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes:

Phillip> On 12/03/2014 09:44 PM, Martin K. Petersen wrote:
>> This patch whitelists SSDs from a few of the main vendors. None of
>> the whitelists are based on written guarantees. They are purely based
>> on empirical evidence collected from internal and external users that
>> have tested or qualified these drives in RAID deployments.

Phillip> Without anything in writing from the vendor, such a white list
Phillip> amounts to nothing more than wishful thinking.  You can't
Phillip> really test for it and even if it *appears* to be so, the drive
Phillip> is free to change its behavior any time.

Well. It's a bit stronger than wishful thinking given that OEM product
requirements documents and supply contracts make it so. But it is not
something we can get in writing.

It is, however, a step up from the blanket whitelist we've had so far
where we assumed that it worked for every drive that had DRAT/RZAT set.
Phillip Susi Dec. 4, 2014, 3:28 a.m. UTC | #3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 12/03/2014 10:24 PM, Martin K. Petersen wrote:
> Well. It's a bit stronger than wishful thinking given that OEM
> product requirements documents and supply contracts make it so. But
> it is not something we can get in writing.

OEM product requirements documents and supply contracts sound like
forms of writing to me.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUf9T2AAoJENRVrw2cjl5Rc2IH/0f1txjITXL7Os4W2KqTUbFj
RryTH8pxexysy1flXcUT4nIyoulBnRSDsy6o9DTgPI5lZDgsIVzFaNuYJHBv/Z/n
mvnjPXS4gp1cFCogHZ6K1IbMQC2WBs7Hz+Qq9pxvLKBVuifcsF+4VOiCQtz7A6QR
nfkPRh9JP7cN0kjNLYVvoNfFDqSVJ8Ac2IbKkB3OeqmeuLoxpL1qG0fSE+A90/nB
B9/wVeFiw5nH6Fdyy6kodbt0tF+gwOGBFIynVKO+4E3qeZi28YikLN/m+L09mW/m
kW4UXXWDCLnqmrfxy/VTHCKUZruj9uplZ/3UpWl8vC4jJmZM5HYS7GNE+AJurjA=
=kslG
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Dec. 4, 2014, 3:35 a.m. UTC | #4
>>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes:

Phillip> On 12/03/2014 10:24 PM, Martin K. Petersen wrote:
>> Well. It's a bit stronger than wishful thinking given that OEM
>> product requirements documents and supply contracts make it so. But
>> it is not something we can get in writing.

Phillip> OEM product requirements documents and supply contracts sound
Phillip> like forms of writing to me.

Except they are not forms of writing that we as a community have access
to.
Phillip Susi Dec. 4, 2014, 4:40 a.m. UTC | #5
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 12/03/2014 10:35 PM, Martin K. Petersen wrote:
> Phillip> OEM product requirements documents and supply contracts
> sound Phillip> like forms of writing to me.
> 
> Except they are not forms of writing that we as a community have
> access to.

So your assertion is that you have seen it in writing, so we should
all assume the drives will adhere to that, but the writing is private
and can not be verified and the manufacturers can not be held to it if
they choose not to abide by it in the general case, but we should
assume that if they are bound by private contracts, that they would
perform the same way with publically sold models that claim to have
the same model and revision?

I'm not saying a hard hell no, but this certainly makes me
uncomfortable.  I'd much rather see the manufacturers put it in
writing that yes, this make and model will perform this way even
though it is not strictly required by ATA8.  What would be better
still is if the bloody ATA standard got a clue and said that if the
drive claims that it does in fact zero after TRIM, that the TRIM
command becomes mandatory instead of advisory.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUf+W5AAoJENRVrw2cjl5RObkIAIRy29u8cx1Ejp2gOr01oWn/
tV+Qj0gpgGzcazKDJWpDK5sxboDteoFl+UiI1/1yEPE+dfvwT26ryyqWKsNjTUDb
YcwkT3zn7wgUbloL3yx4WqNZnM9/vDDv1mh94bjdIjZM2iUOpoZj81iGVaKWHIFC
I/qXf5eeHGrPtHBUzdEyAgVtd4pc1dN2zo4KZmwA3Xfj6zxq3knsASE4fgiFuegv
iyC5PvqXN1z14P2f+6/EhT2Ls1Vzo0Y/pnxZstEexftjWG6a4rbaEZVFT/fxawgn
nAqaB0GxvLpD5tSmUKUWtAYFDSWOUP+MIFyPvm0A8V1pLrLQV81PrKZ2qHMKvMU=
=dFFk
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Dec. 4, 2014, 5:06 p.m. UTC | #6
Hello, Martin.

On Wed, Dec 03, 2014 at 09:44:46PM -0500, Martin K. Petersen wrote:
> 
> As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return
> Zero After Trim) flags in the ATA Command Set are unreliable in the
> sense that they only define what happens if the device successfully
> executed the DSM TRIM command. TRIM is only advisory, however, and the
> device is free to silently ignore all or parts of the request.
> 
> In practice this renders the DRAT and RZAT flags completely useless and
> because the results are unpredictable we decided to disable discard in
> MD for 3.18 to avoid the risk of data corruption.
> 
> Hardware vendors in the real world obviously need better guarantees than
> what the standards bodies provide. Unfortuntely those guarantees are
> encoded in product requirements documents rather than somewhere we can
> key off of them programatically. So we are compelled to disabling
> discard_zeroes_data for all devices unless we explicitly have data to
> support whitelisting them.
> 
> This patch whitelists SSDs from a few of the main vendors. None of the
> whitelists are based on written guarantees. They are purely based on
> empirical evidence collected from internal and external users that have
> tested or qualified these drives in RAID deployments.
> 
> The whitelist is only meant as a starting point and is by no means
> comprehensive:
> 
>    - All intel SSD models except for 510
>    - Micron M5*
>    - Samsung SSDs
>    - Seagate SSDs

Generally, I'm extremely skeptical about whitelists.  It's very
difficult to keep them meaningfully up-to-date and often just ends up
bit rotting after the initial flurry of updates, especially given how
little this particular feature would be visible to most users.

If there's something on the horizon which would solve the
identification problem and we only have to worry about the current
batch of devices, whitelisting can be useful but otherwise I'm not
sure this is a good idea.

If there currently is no way of properly indicating this feature,
let's please disable it unconditionally.  If this is absolutely
necesary in certain cases (is it?), we can add libata.force flags or
sysfs knobs to force enable it for users who care enough about it and
knows what they're doing.

Thanks.
Alan Cox Dec. 4, 2014, 9:49 p.m. UTC | #7
On Wed, 03 Dec 2014 22:35:59 -0500
"Martin K. Petersen" <martin.petersen@oracle.com> wrote:

> >>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes:
> 
> Phillip> On 12/03/2014 10:24 PM, Martin K. Petersen wrote:
> >> Well. It's a bit stronger than wishful thinking given that OEM
> >> product requirements documents and supply contracts make it so. But
> >> it is not something we can get in writing.
> 
> Phillip> OEM product requirements documents and supply contracts sound
> Phillip> like forms of writing to me.
> 
> Except they are not forms of writing that we as a community have access
> to.

Thats true of many things we reply upon.

I assume it's something vaguely of the form

Vendor A has agreed with Oracle that drive identity X will do this and
meet some certification so Oracle can approve it for databases

Oracle has agreed to break vendor A's legs if it doesn't

Oracle would rather like that the kernel just knows about this drive as
being good, based on the above.

In which case IMHO it's good enough. "Do not p*ss off a major customer"
is probably more watertight than a lot of signed paper things 8)

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Dec. 5, 2014, 1:53 a.m. UTC | #8
>>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes:

Phillip> So your assertion is that you have seen it in writing, 

Actually I haven't. I have talked to a bunch of people that have done
RAID qualification on SSDs. I also gleaned on a few vendor lists of
supported drive configurations.

The good news is that practice of each OEM having their own custom drive
firmware build is not very common for SSDs. And most of the results I
looked at was collected using off-the-shelf drives.

Phillip> I'm not saying a hard hell no, but this certainly makes me
Phillip> uncomfortable.  

You do realize that my patch *restricts* the drives we enable
discard_zeroes_data on, right? Instead of relying solely on the drive's
own reporting we now also require empirical evidence that they do the
right thing.

Phillip> I'd much rather see the manufacturers put it in writing that
Phillip> yes, this make and model will perform this way even though it
Phillip> is not strictly required by ATA8.

Wishful thinking.

The sad reality is that standards are deliberately written to be
vague. And that any hard guarantees are part of product requirements
documents. That's not specific to SSDs in any way. That's true for
pretty much any piece of hardware.

Phillip> What would be better still is if the bloody ATA standard got a
Phillip> clue and said that if the drive claims that it does in fact
Phillip> zero after TRIM, that the TRIM command becomes mandatory
Phillip> instead of advisory.

I agree that the standards efforts in this department are a total train
wreck.
Martin K. Petersen Dec. 5, 2014, 2:13 a.m. UTC | #9
>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> Generally, I'm extremely skeptical about whitelists.

Me too. Unfortunately, in this case the blacklist is essentially
unbounded whereas the whitelist is small.

Tejun> If there's something on the horizon which would solve the
Tejun> identification problem and we only have to worry about the
Tejun> current batch of devices, whitelisting can be useful but
Tejun> otherwise I'm not sure this is a good idea.

There isn't :( The only saving grace is that SSDs are gravitating
towards NVMe and other non-ATA interfaces.

Tejun> It's very difficult to keep them meaningfully up-to-date and
Tejun> often just ends up bit rotting after the initial flurry of
Tejun> updates,

I'm well aware that this adds another truckload of ongoing pain to my
plate. Several of us discussed this issue at conferences this fall and
the consensus was that whitelisting is the only way to go about it.

I've already tightened up things in SCSI so we now prefer WRITE SAME
which does give hard guarantees unlike UNMAP. But because we use WRITE
SAME in the libata SATL it is imperative that we change our internal
flagging to be equally restrictive.

Tejun> If there currently is no way of properly indicating this feature,
Tejun> let's please disable it unconditionally.

Well, we recently disabled discard support in MD RAID5/6 to avoid
corruption. There currently is a manual override and that may be good
enough.

I just feel bad about disabling the feature for the many existing users
(and there are quite a few) that are using well-behaved drives in their
RAID deployments. And the filesystem folks have been begging for the
zeroout discard variant that I posted a few weeks ago. So the users are
there. I'm just trying to accommodate them the best I can given the lame
spec.
Martin K. Petersen Dec. 5, 2014, 2:46 a.m. UTC | #10
>>>>> "Alan" == One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> writes:

Alan> I assume it's something vaguely of the form

Alan> Vendor A has agreed with Oracle that drive identity X will do this
Alan> and meet some certification so Oracle can approve it for databases

Actually, we don't trust devices to zero anything for us :)

Alan> Oracle has agreed to break vendor A's legs if it doesn't

Alan> Oracle would rather like that the kernel just knows about this
Alan> drive as being good, based on the above.

We're generally not dealing with this kind of consumer/SMB hardware at
Oracle. So s/Oracle/$OTHER_OEM/, but yeah.

My interest in this is purely from the upstream discard lackey
perspective...
Tejun Heo Dec. 5, 2014, 2:51 p.m. UTC | #11
Hello,

On Thu, Dec 04, 2014 at 09:13:22PM -0500, Martin K. Petersen wrote:
> Tejun> If there's something on the horizon which would solve the
> Tejun> identification problem and we only have to worry about the
> Tejun> current batch of devices, whitelisting can be useful but
> Tejun> otherwise I'm not sure this is a good idea.
> 
> There isn't :( The only saving grace is that SSDs are gravitating
> towards NVMe and other non-ATA interfaces.

Yeap, that's true.  This whole thing is on the way out.

> I've already tightened up things in SCSI so we now prefer WRITE SAME
> which does give hard guarantees unlike UNMAP. But because we use WRITE
> SAME in the libata SATL it is imperative that we change our internal
> flagging to be equally restrictive.

Can you please elaborate this in the changelog and add comments
explaning why ATA_HORKAGE_ZERO_AFTER_TRIM is necessary and how it's
used?

> I just feel bad about disabling the feature for the many existing users
> (and there are quite a few) that are using well-behaved drives in their
> RAID deployments. And the filesystem folks have been begging for the
> zeroout discard variant that I posted a few weeks ago. So the users are
> there. I'm just trying to accommodate them the best I can given the lame
> spec.

Can you please explain further the practical gains of using trims
which guarantee zeroing?

Thanks.
James Bottomley Dec. 10, 2014, 4:09 a.m. UTC | #12
On Fri, 2014-12-05 at 09:51 -0500, Tejun Heo wrote:
> > I just feel bad about disabling the feature for the many existing users
> > (and there are quite a few) that are using well-behaved drives in their
> > RAID deployments. And the filesystem folks have been begging for the
> > zeroout discard variant that I posted a few weeks ago. So the users are
> > there. I'm just trying to accommodate them the best I can given the lame
> > spec.
> 
> Can you please explain further the practical gains of using trims

It's for RAID devices.  We'd like to trim RAID devices simply by sending
the trim down to every component.  If we actually have to translate trim
to a write it would damage performance (and defeat the purpose of
helping the SSD FTL reuse blocks).  However, RAID requires that the
redundancy verification of the array components matches otherwise a
verify of the RAID fails.  This means we have to have a guarantee what
the verify read of a trimmed block of a RAID component will return.  So
for RAID-1, we just need both trimmed components to return the same data
(we don't actually care what it is, just that it be mirrored); for
RAID-5 we need zeros on every trimmed component, because zero^zero =
zero.

Conversely, drives that return random junk after a trim cause
verification failures, so we just elect not to transmit trim down to
them from the RAID layer.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Dec. 10, 2014, 2:29 p.m. UTC | #13
Hello,

On Wed, Dec 10, 2014 at 07:09:38AM +0300, James Bottomley wrote:
> Conversely, drives that return random junk after a trim cause
> verification failures, so we just elect not to transmit trim down to
> them from the RAID layer.

I see.  Thanks for the explanation.  I suppose the current raid
implementations' metadata handling on partially built devices isn't
developed enough to simply mark those trimmed blocks as unsynced?  If
raid consistency truly is the only reason for this, that approach
seems way more fruitful to me than playing this optional feature game
with hardware vendors which so often leads to eventual abysmal
outcomes.

Thanks.
Tim Small Dec. 10, 2014, 3:43 p.m. UTC | #14
On 10/12/14 04:09, James Bottomley wrote:

> RAID requires that the redundancy verification of the array 
> components matches otherwise a verify of the RAID fails.  This means
> we have to have a guarantee what the verify read of a trimmed block
> of a RAID component will return. So for RAID-1, we just need both
> trimmed components to return the same data (we don't actually care
> what it is, just that it be mirrored);


FWIW, Unless I'm out-of-date, md RAID-1 and RAID-10 currently DON'T
guarantee that the data is always mirrored.  The md(4) man page from
http://git.neil.brown.name/?p=mdadm.git;a=blob;f=md.4 says:

> However on RAID1 and RAID10 it is possible for software issues to 
> cause a mismatch to be reported.  This does not necessarily mean
> that the data on the array is corrupted.  It could simply be that
> the system does not care what is stored on that part of the array -
> it is unused space.
> 
> The most likely cause for an unexpected mismatch on RAID1 or RAID10 
> occurs if a swap partition or swap file is stored on the array.
> 
> When the swap subsystem wants to write a page of memory out, it
> flags the page as 'clean' in the memory manager and requests the
> swap device to write it out.  It is quite possible that the memory
> will be changed while the write-out is happening.  In that case the
> 'clean' flag will be found to be clear when the write completes and
> so the swap subsystem will simply forget that the swapout had been 
> attempted, and will possibly choose a different page to write out.
> 
> If the swap device was on RAID1 (or RAID10), then the data is sent 
> from memory to a device twice (or more depending on the number of 
> devices in the array).  Thus it is possible that the memory gets 
> changed  between the times it is sent, so different data can be 
> written to the different devices in the array.  This will be
> detected by check as a mismatch.  However it does not reflect any
> corruption as the block where this mismatch occurs is being treated
> by the swap system as being empty, and the data will never be read
> from that block.
> 
> It is conceivable for a similar situation to occur on non-swap
> files, though it is less likely.



In my experience these inconsistencies are very common in data written
by certain applications (e.g. DBMS) on md RAID1 and RAID10 devices, and
effectively makes md verify useless for them.  This is a shame -
particularly with things like FTLs where (at least anecdotally) data
scrambling seems to be a reasonably frequent occurrence.

So unreliable read-zero-on-TRIM isn't going to break the md RAID-1 /
RAID-10 use-case significantly, because verify hasn't worked on those
forever anyway.

Of course, a non-deterministic return-zero-on-trim will still cause
problems with md RAID5/6, and with other users such as dm RAID too I'd
guess.

Tim.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley Dec. 10, 2014, 8:34 p.m. UTC | #15
On Wed, 2014-12-10 at 09:29 -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Dec 10, 2014 at 07:09:38AM +0300, James Bottomley wrote:
> > Conversely, drives that return random junk after a trim cause
> > verification failures, so we just elect not to transmit trim down to
> > them from the RAID layer.
> 
> I see.  Thanks for the explanation.  I suppose the current raid
> implementations' metadata handling on partially built devices isn't
> developed enough to simply mark those trimmed blocks as unsynced?

We do have a log, which could be used for RAID-1 for this purpose, but
it doesn't seem to be much used in practise.  It takes extra space which
most admins don't account for.

For RAID-2+ or erasure codes this won't work because a bad block read
corrupts the stripe: the really subtle failure here is that you trim a
stripe and then partially write it, the RMW you do for parity will be an
incorrect partial syndrome because it's based on junk rather than the
syndrome of the other blocks and the corruption wouldn't be detected
until you get an actual disk failure (meaning everything will look fine
until that crucial day you need your data protection mechanism to work).
We could cope with this with an even more subtle logging mechanism,
where we only trim stripes and then log trimmed stripes and insist on
full instead of partial writes so we get back to a known syndrome, but
that's introducing a lot of subtlety into the logging code ...

James


--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Dec. 10, 2014, 9:02 p.m. UTC | #16
>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> If raid consistency truly is the only reason for this, that
Tejun> approach seems way more fruitful to me than playing this optional
Tejun> feature game with hardware vendors which so often leads to
Tejun> eventual abysmal outcomes.

The other use case is the filesystem one where it is common to zero
block ranges for bitmaps, etc. In many workloads there's is a
significant win to trimming over writing out many blocks of zeroes.
Ming Lei Dec. 12, 2014, 8:35 a.m. UTC | #17
On Thu, Dec 11, 2014 at 5:02 AM, Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:
>
> Tejun> If raid consistency truly is the only reason for this, that
> Tejun> approach seems way more fruitful to me than playing this optional
> Tejun> feature game with hardware vendors which so often leads to
> Tejun> eventual abysmal outcomes.
>
> The other use case is the filesystem one where it is common to zero
> block ranges for bitmaps, etc. In many workloads there's is a
> significant win to trimming over writing out many blocks of zeroes.

Yes, looks mkfs is using the feature.

Thanks,
Ming Lei
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tejun Heo Jan. 5, 2015, 4:28 p.m. UTC | #18
Hello, Martin.

On Wed, Dec 10, 2014 at 04:02:32PM -0500, Martin K. Petersen wrote:
> >>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:
> 
> Tejun> If raid consistency truly is the only reason for this, that
> Tejun> approach seems way more fruitful to me than playing this optional
> Tejun> feature game with hardware vendors which so often leads to
> Tejun> eventual abysmal outcomes.
> 
> The other use case is the filesystem one where it is common to zero
> block ranges for bitmaps, etc. In many workloads there's is a
> significant win to trimming over writing out many blocks of zeroes.

Isn't that kinda niche and specialized tho?  Besides, in this use
case, the kernel is essentially just serving as the whitelist and
nothing else.  Userland would have to issue TRIMs directly.  I'm not
against this patch being merged but this really looks like an
ultimately futile effort to me.

Thanks.
Martin K. Petersen Jan. 7, 2015, 12:05 a.m. UTC | #19
>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

>> The other use case is the filesystem one where it is common to zero
>> block ranges for bitmaps, etc. In many workloads there's is a
>> significant win to trimming over writing out many blocks of zeroes.

Tejun> Isn't that kinda niche and specialized tho?  

I don't think so. There are two reasons for zeroing block ranges:

 1) To ensure they contain zeroes on subsequent reads

 2) To preallocate them or anchor them down on thin provisioned devices

The filesystem folks have specifically asked to be able to make that
distinction. Hence the patch that changes blkdev_issue_zeroout().

You really don't want to write out gobs and gobs of zeroes and cause
unnecessary flash wear if all you care about is the blocks being in a
deterministic state.
Tejun Heo Jan. 7, 2015, 2:54 a.m. UTC | #20
Hello, Martin.

On Tue, Jan 06, 2015 at 07:05:40PM -0500, Martin K. Petersen wrote:
> Tejun> Isn't that kinda niche and specialized tho?  
> 
> I don't think so. There are two reasons for zeroing block ranges:
> 
>  1) To ensure they contain zeroes on subsequent reads
> 
>  2) To preallocate them or anchor them down on thin provisioned devices
> 
> The filesystem folks have specifically asked to be able to make that
> distinction. Hence the patch that changes blkdev_issue_zeroout().
> 
> You really don't want to write out gobs and gobs of zeroes and cause
> unnecessary flash wear if all you care about is the blocks being in a
> deterministic state.

I think I'm still missing something.  Are there enough cases where
filesystems want to write out zeroes during operation?  Earlier in the
thread, it was mentioned that this is currently mostly useful for
raids which need the blocks actually cleared for checksum consistency,
which basically means that raid metadata handling isn't (yet) capable
of just marking those (parts of) stripes as unused.  If a filesystem
wants to read back zeros from data blocks, wouldn't it be just marking
the matching index as such?  And if you take out the zeroing case,
trims are just trims and whether they return 0 afterwards or not is
irrelevant.

There sure can be use cases where zeroing fast and securely make
noticeable difference but the cases put forth till this point seem
relatively weak.  I mean, after all, requiring trim to zero the blocks
is essentially pushing down that amount of metadata management to the
device - the device would do the exactly same thing.  Pushing it down
the layers can definitely be beneficial especially when there's no
agreed-upon metadata on the medium (so, mkfs time), but it seems kinda
superflous during normal operation.  What am I missing?

Thanks.
Dave Chinner Jan. 7, 2015, 4:15 a.m. UTC | #21
On Tue, Jan 06, 2015 at 09:54:55PM -0500, Tejun Heo wrote:
> Hello, Martin.
> 
> On Tue, Jan 06, 2015 at 07:05:40PM -0500, Martin K. Petersen wrote:
> > Tejun> Isn't that kinda niche and specialized tho?  
> > 
> > I don't think so. There are two reasons for zeroing block ranges:
> > 
> >  1) To ensure they contain zeroes on subsequent reads
> > 
> >  2) To preallocate them or anchor them down on thin provisioned devices
> > 
> > The filesystem folks have specifically asked to be able to make that
> > distinction. Hence the patch that changes blkdev_issue_zeroout().
> > 
> > You really don't want to write out gobs and gobs of zeroes and cause
> > unnecessary flash wear if all you care about is the blocks being in a
> > deterministic state.
> 
> I think I'm still missing something.  Are there enough cases where
> filesystems want to write out zeroes during operation?

IMO, yes.

w.r.t. thinp devices, we need to be able to guarantee that
prellocated regions in the filesystem are actually backed by real
blocks in the thinp device so we don't get ENOSPC from the thinp
device. No filesystems do this yet because we don't have a mechanism
for telling the lower layers "preallocate these blocks to zero".

The biggest issue is that we currently have no easy way to say
"these blocks need to contain zeros, but we aren't actually using
them yet". i.e. the filesystem code assumes that they contain zeros
(e.g. in ext4 inode tables because mkfs used to zero them) if they
haven't been used, so when it reads them it detects that
initialisation is needed because the blocks are empty....

FWIW, some filesystems need these regions to actually contain
zeros because they can't track unwritten extents (e.g.
gfs2). having sb_issue_zeroout() just do the right thing enables us
to efficiently zero the regions they are preallocating...

> Earlier in the
> thread, it was mentioned that this is currently mostly useful for
> raids which need the blocks actually cleared for checksum consistency,
> which basically means that raid metadata handling isn't (yet) capable
> of just marking those (parts of) stripes as unused.  If a filesystem
> wants to read back zeros from data blocks, wouldn't it be just marking
> the matching index as such?

Not all filesystems can do this for user data (see gfs2 case above)
and no linux filesystem tracks whether free space contains zeros or
stale data. Hence if we want blocks to be zeroed on disk, we
currently have to write zeros to them and hence they get pinned in
devices as "used space" even though they may never get used again.

Cheers,

Dave.
Tejun Heo Jan. 7, 2015, 3:26 p.m. UTC | #22
Hello, Dave.

On Wed, Jan 07, 2015 at 03:15:37PM +1100, Dave Chinner wrote:
> w.r.t. thinp devices, we need to be able to guarantee that
> prellocated regions in the filesystem are actually backed by real
> blocks in the thinp device so we don't get ENOSPC from the thinp
> device. No filesystems do this yet because we don't have a mechanism
> for telling the lower layers "preallocate these blocks to zero".
> 
> The biggest issue is that we currently have no easy way to say
> "these blocks need to contain zeros, but we aren't actually using
> them yet". i.e. the filesystem code assumes that they contain zeros
> (e.g. in ext4 inode tables because mkfs used to zero them) if they
> haven't been used, so when it reads them it detects that
> initialisation is needed because the blocks are empty....
> 
> FWIW, some filesystems need these regions to actually contain
> zeros because they can't track unwritten extents (e.g.
> gfs2). having sb_issue_zeroout() just do the right thing enables us
> to efficiently zero the regions they are preallocating...
> 
> > Earlier in the
> > thread, it was mentioned that this is currently mostly useful for
> > raids which need the blocks actually cleared for checksum consistency,
> > which basically means that raid metadata handling isn't (yet) capable
> > of just marking those (parts of) stripes as unused.  If a filesystem
> > wants to read back zeros from data blocks, wouldn't it be just marking
> > the matching index as such?
> 
> Not all filesystems can do this for user data (see gfs2 case above)
> and no linux filesystem tracks whether free space contains zeros or
> stale data. Hence if we want blocks to be zeroed on disk, we
> currently have to write zeros to them and hence they get pinned in
> devices as "used space" even though they may never get used again.

Okay, I'll take it as that this benefits generic enough use cases from
filesystem POV.  As long as that's the case, it prolly has a
reasonable chance of actually being widely and properly implemented hw
vendors.

It's easy to complain about mainstream hardware but IMHO the market is
actually pretty efficient at shooting down extra cruft which doesn't
really matter and only exists to increase the number of feature
checkboxes.  Hopefully, this has actual benefits and won't end up that
way.

Martin, do you have a newer version or shall I apply the original one?

Thanks.
Phillip Susi Jan. 8, 2015, 4:05 a.m. UTC | #23
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

On 01/06/2015 11:15 PM, Dave Chinner wrote:
> (e.g. in ext4 inode tables because mkfs used to zero them) if they 
> haven't been used, so when it reads them it detects that 
> initialisation is needed because the blocks are empty....

No, it knows that the inode table needs initialized because there is a
flag in the group descriptor that says this inode table is still
uninitalized.  It never reads the blocks to see if they are full of
zeros.  mke2fs sets the flag when it does not initialize the table
with zeros, either by direct writes ( which it doesn't do if
lazy_itable_init is true, which it defaults to these days ), or by
discarding the blocks when the device claims to support deterministic
discard that zeros.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCgAGBQJUrgHvAAoJENRVrw2cjl5RWUUH/076fhOAYW8vRB285cz+UwrE
L7o3uteYa9OctoWZTctWAVueuISJaNsxzmbu04btznETkR/YbkkexpmXRKvYt6RZ
rlCzRLWLTOC4h57jZteczn9pBf0RBeeizU0pEdTdZP/IVxOqA6NDG1qywDCE9A+I
3frFg54alaZ4MCGH0jmPVbVxM6jY3/UhA8DnOoMSlIOXatV1X7UMPK3lMJ4Ih/Yx
e3yDWJjNHjAn88TRAGD6WJ2066t7DBGnLOOF9PscaZNW6f4hbuvEG9teAPk51OfS
cwypcLsk6Mj7WU8cSQ0T/6a7Qx84g9JC6wLR0QZCUCSau6ZfWf3ZuSI26i/Xlfc=
=BRqd
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Jan. 8, 2015, 4:58 a.m. UTC | #24
On Jan 7, 2015, at 9:05 PM, Phillip Susi <psusi@ubuntu.com> wrote:
> On 01/06/2015 11:15 PM, Dave Chinner wrote:
>> (e.g. in ext4 inode tables because mkfs used to zero them) if they 
>> haven't been used, so when it reads them it detects that 
>> initialisation is needed because the blocks are empty....
> 
> No, it knows that the inode table needs initialized because there is a
> flag in the group descriptor that says this inode table is still
> uninitalized.  It never reads the blocks to see if they are full of
> zeros.  mke2fs sets the flag when it does not initialize the table
> with zeros, either by direct writes ( which it doesn't do if
> lazy_itable_init is true, which it defaults to these days ), or by
> discarding the blocks when the device claims to support deterministic
> discard that zeros.

That is only partially correct.  While it is true that mke2fs sets the
UNINIT flag at format time, the "lazy" part of that means there is a
kernel thread still does the zeroing of the inode table blocks, but after
the filesystem is mounted, for any group that does not have the ZEROED
flag set.  After that point, the "UNINIT" flag is an optimization to
avoid reading the bitmap and unused blocks from disk during allocation.

This is needed in case the group descriptor or inode bitmap is corrupted,
and e2fsck needs to scan the inode table for in-use inodes.  We don't
want it to find old inodes from before the filesystem was formatted.

The ext4_init_inode_table() calls sb_issue_zeroout->blkdev_issue_zeroout(),
so if the underlying storage supported deterministic zeroing of the
underlying storage, this could be handled very efficiently.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Phillip Susi Jan. 8, 2015, 2:09 p.m. UTC | #25
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/7/2015 11:58 PM, Andreas Dilger wrote:
>> No, it knows that the inode table needs initialized because there
>> is a flag in the group descriptor that says this inode table is
>> still uninitalized.  It never reads the blocks to see if they are
>> full of zeros.  mke2fs sets the flag when it does not initialize
>> the table with zeros, either by direct writes ( which it doesn't
>> do if lazy_itable_init is true, which it defaults to these days
>> ), or by discarding the blocks when the device claims to support
>> deterministic discard that zeros.
> 
> That is only partially correct.  While it is true that mke2fs sets
> the UNINIT flag at format time, the "lazy" part of that means there
> is a kernel thread still does the zeroing of the inode table
> blocks, but after the filesystem is mounted, for any group that
> does not have the ZEROED flag set.  After that point, the "UNINIT"
> flag is an optimization to avoid reading the bitmap and unused
> blocks from disk during allocation.

That is pretty much what I said, except that I was pointing out that
it does not *read* first to see if the disk is already zeroed, as that
would be a waste of time.  It just writes out the zeros for block
groups that still have the uninit flag set.

> This is needed in case the group descriptor or inode bitmap is
> corrupted, and e2fsck needs to scan the inode table for in-use
> inodes.  We don't want it to find old inodes from before the
> filesystem was formatted.
> 
> The ext4_init_inode_table() calls
> sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying
> storage supported deterministic zeroing of the underlying storage,
> this could be handled very efficiently.

Again, that's pretty much what I said.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUro+rAAoJENRVrw2cjl5RDcMIAIF/qOjB4FfQ1v1NPLnbYrwa
MeGxmAGIq8ftOlfOpSo8W/Adal4eegPwiBzE/8JciTHli2hZFkboG96lCGLL2GHd
51gqADIcSUmPHleLP3L/MSleFLvOAC4xYWYi00idZF5UVK3tt6m07+T/DCkoku1L
8Yg0BH5rxxRyPMRe3a/Y3WoI3PFFQDiDhz/7PSFH1W6JrEmiZutIDfm/U4h6zFWU
tlNUMyiya8DrJt4katjBTa74EPWdZQjT/dOuUGGslAgOro69ZZQ4jmEXeJGN96Ly
mMhbzkw9mjvSHYWuH2Hf1QlxoMKgq1hxaNbwEtImBzYeQTb9VSqM6BSO1ru0m9M=
=X07B
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Martin K. Petersen Jan. 8, 2015, 2:29 p.m. UTC | #26
>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:

Tejun> do you have a newer version or shall I apply the original one?

Just sent an updated patch that catches the latest Micron/Crucial drives
as well.
Andreas Dilger Jan. 8, 2015, 10:31 p.m. UTC | #27
On Jan 8, 2015, at 7:09 AM, Phillip Susi <psusi@ubuntu.com> wrote:
> On 1/7/2015 11:58 PM, Andreas Dilger wrote:
>>> No, it knows that the inode table needs initialized because there
>>> is a flag in the group descriptor that says this inode table is
>>> still uninitalized.  It never reads the blocks to see if they are
>>> full of zeros.  mke2fs sets the flag when it does not initialize
>>> the table with zeros, either by direct writes ( which it doesn't
>>> do if lazy_itable_init is true, which it defaults to these days
>>> ), or by discarding the blocks when the device claims to support
>>> deterministic discard that zeros.
>> 
>> That is only partially correct.  While it is true that mke2fs sets
>> the UNINIT flag at format time, the "lazy" part of that means there
>> is a kernel thread still does the zeroing of the inode table
>> blocks, but after the filesystem is mounted, for any group that
>> does not have the ZEROED flag set.  After that point, the "UNINIT"
>> flag is an optimization to avoid reading the bitmap and unused
>> blocks from disk during allocation.
> 
> That is pretty much what I said, except that I was pointing out that
> it does not *read* first to see if the disk is already zeroed, as that
> would be a waste of time.  It just writes out the zeros for block
> groups that still have the uninit flag set.

Sorry, I didn't get that from my reading, so I thought I'd clarify.
I'd actually proposed that the ext4_init_inode_table() thread start
by reading the itable blocks first, check them for zeroes, and only
switch over to writing if it finds any non-zero data in the blocks.

I think that would be a net win in some cases, and only a tiny bit
of overhead (a single read) if it turns out to be wrong.

>> This is needed in case the group descriptor or inode bitmap is
>> corrupted, and e2fsck needs to scan the inode table for in-use
>> inodes.  We don't want it to find old inodes from before the
>> filesystem was formatted.
>> 
>> The ext4_init_inode_table() calls
>> sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying
>> storage supported deterministic zeroing of the underlying storage,
>> this could be handled very efficiently.
> 
> Again, that's pretty much what I said.


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index c5ba15af87d3..f41f24a8bc21 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4225,10 +4225,20 @@  static const struct ata_blacklist_entry ata_device_blacklist [] = {
 	{ "PIONEER DVD-RW  DVR-216D",	NULL,	ATA_HORKAGE_NOSETXFER },
 
 	/* devices that don't properly handle queued TRIM commands */
-	{ "Micron_M500*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT???M500SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Micron_M550*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
-	{ "Crucial_CT*M550SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+	{ "Micron_M5?0*",		NULL,	ATA_HORKAGE_NO_NCQ_TRIM |
+						ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Crucial_CT???M5?0SSD*",	NULL,	ATA_HORKAGE_NO_NCQ_TRIM, },
+
+	/*
+	 * DRAT/RZAT are weak guarantees. Explicitly black/whitelist
+	 * SSDs that provide reliable zero after TRIM.
+	 */
+	{ "INTEL*SSDSC2MH*",		NULL,	0, }, /* Blacklist intel 510 */
+	{ "INTEL*SSD*", 		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SSD*INTEL*",			NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "Samsung*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "SAMSUNG*SSD*",		NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
+	{ "ST[1248][0248]0[FH]*",	NULL,	ATA_HORKAGE_ZERO_AFTER_TRIM, },
 
 	/*
 	 * Some WD SATA-I drives spin up and down erratically when the link
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 0586f66d70fa..a19479282b65 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -2515,13 +2515,15 @@  static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf)
 		rbuf[15] = lowest_aligned;
 
 		if (ata_id_has_trim(args->id)) {
-			rbuf[14] |= 0x80; /* TPE */
+			rbuf[14] |= 0x80; /* LBPME */
 
-			if (ata_id_has_zero_after_trim(args->id))
-				rbuf[14] |= 0x40; /* TPRZ */
+			if (ata_id_has_zero_after_trim(args->id) &&
+			    dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) {
+				ata_dev_info(dev, "Enabling discard_zeroes_data\n");
+				rbuf[14] |= 0x40; /* LBPRZ */
+			}
 		}
 	}
-
 	return 0;
 }
 
diff --git a/include/linux/libata.h b/include/linux/libata.h
index bd5fefeaf548..45ac825b8366 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -421,6 +421,7 @@  enum {
 	ATA_HORKAGE_NO_NCQ_TRIM	= (1 << 19),	/* don't use queued TRIM */
 	ATA_HORKAGE_NOLPM	= (1 << 20),	/* don't use LPM */
 	ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21),	/* some WDs have broken LPM */
+	ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */
 
 	 /* DMA mask for user DMA control: User visible values; DO NOT
 	    renumber */