Message ID | yq18uiojfq9.fsf@sermon.lab.mkp.net |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 12/03/2014 09:44 PM, Martin K. Petersen wrote: > This patch whitelists SSDs from a few of the main vendors. None of > the whitelists are based on written guarantees. They are purely > based on empirical evidence collected from internal and external > users that have tested or qualified these drives in RAID > deployments. Without anything in writing from the vendor, such a white list amounts to nothing more than wishful thinking. You can't really test for it and even if it *appears* to be so, the drive is free to change its behavior any time. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJUf87GAAoJENRVrw2cjl5RPMwIAI3RHHZBhv0xDi5uMCzHdaZM pSp+OAYtuXuYKNhxEnH/v3RqqeL2joARR7ELa3wZ57DilpANF4SPKeV4TyJqdjxG BLYeW1jySSNn7LjBDRxSgg61FRka4Bz3m4MCA6/6pfZawufE4bkTshRhpMzIWMsb AAt1d1Kyw2a2wat2+h75LqEVuPdTcENBV7B1Ow//zObkljGf3p3Jkjiz3ZKE3bIT kXIu62kypdUbvpYCtSE7Xa51Io74jPtZPyjc6vrsFaB8ulhFzDg5WK4o8Ndswz3D Rjq4WAl7dOJ0KUKef3n57YZFjgtfPo3VUIXzrDOo0hCXYUO77Kxei9ZwOtJUkrc= =5Wuq -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes: Phillip> On 12/03/2014 09:44 PM, Martin K. Petersen wrote: >> This patch whitelists SSDs from a few of the main vendors. None of >> the whitelists are based on written guarantees. They are purely based >> on empirical evidence collected from internal and external users that >> have tested or qualified these drives in RAID deployments. Phillip> Without anything in writing from the vendor, such a white list Phillip> amounts to nothing more than wishful thinking. You can't Phillip> really test for it and even if it *appears* to be so, the drive Phillip> is free to change its behavior any time. Well. It's a bit stronger than wishful thinking given that OEM product requirements documents and supply contracts make it so. But it is not something we can get in writing. It is, however, a step up from the blanket whitelist we've had so far where we assumed that it worked for every drive that had DRAT/RZAT set.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 12/03/2014 10:24 PM, Martin K. Petersen wrote: > Well. It's a bit stronger than wishful thinking given that OEM > product requirements documents and supply contracts make it so. But > it is not something we can get in writing. OEM product requirements documents and supply contracts sound like forms of writing to me. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJUf9T2AAoJENRVrw2cjl5Rc2IH/0f1txjITXL7Os4W2KqTUbFj RryTH8pxexysy1flXcUT4nIyoulBnRSDsy6o9DTgPI5lZDgsIVzFaNuYJHBv/Z/n mvnjPXS4gp1cFCogHZ6K1IbMQC2WBs7Hz+Qq9pxvLKBVuifcsF+4VOiCQtz7A6QR nfkPRh9JP7cN0kjNLYVvoNfFDqSVJ8Ac2IbKkB3OeqmeuLoxpL1qG0fSE+A90/nB B9/wVeFiw5nH6Fdyy6kodbt0tF+gwOGBFIynVKO+4E3qeZi28YikLN/m+L09mW/m kW4UXXWDCLnqmrfxy/VTHCKUZruj9uplZ/3UpWl8vC4jJmZM5HYS7GNE+AJurjA= =kslG -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes: Phillip> On 12/03/2014 10:24 PM, Martin K. Petersen wrote: >> Well. It's a bit stronger than wishful thinking given that OEM >> product requirements documents and supply contracts make it so. But >> it is not something we can get in writing. Phillip> OEM product requirements documents and supply contracts sound Phillip> like forms of writing to me. Except they are not forms of writing that we as a community have access to.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 12/03/2014 10:35 PM, Martin K. Petersen wrote: > Phillip> OEM product requirements documents and supply contracts > sound Phillip> like forms of writing to me. > > Except they are not forms of writing that we as a community have > access to. So your assertion is that you have seen it in writing, so we should all assume the drives will adhere to that, but the writing is private and can not be verified and the manufacturers can not be held to it if they choose not to abide by it in the general case, but we should assume that if they are bound by private contracts, that they would perform the same way with publically sold models that claim to have the same model and revision? I'm not saying a hard hell no, but this certainly makes me uncomfortable. I'd much rather see the manufacturers put it in writing that yes, this make and model will perform this way even though it is not strictly required by ATA8. What would be better still is if the bloody ATA standard got a clue and said that if the drive claims that it does in fact zero after TRIM, that the TRIM command becomes mandatory instead of advisory. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJUf+W5AAoJENRVrw2cjl5RObkIAIRy29u8cx1Ejp2gOr01oWn/ tV+Qj0gpgGzcazKDJWpDK5sxboDteoFl+UiI1/1yEPE+dfvwT26ryyqWKsNjTUDb YcwkT3zn7wgUbloL3yx4WqNZnM9/vDDv1mh94bjdIjZM2iUOpoZj81iGVaKWHIFC I/qXf5eeHGrPtHBUzdEyAgVtd4pc1dN2zo4KZmwA3Xfj6zxq3knsASE4fgiFuegv iyC5PvqXN1z14P2f+6/EhT2Ls1Vzo0Y/pnxZstEexftjWG6a4rbaEZVFT/fxawgn nAqaB0GxvLpD5tSmUKUWtAYFDSWOUP+MIFyPvm0A8V1pLrLQV81PrKZ2qHMKvMU= =dFFk -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Martin. On Wed, Dec 03, 2014 at 09:44:46PM -0500, Martin K. Petersen wrote: > > As defined, the DRAT (Deterministic Read After Trim) and RZAT (Return > Zero After Trim) flags in the ATA Command Set are unreliable in the > sense that they only define what happens if the device successfully > executed the DSM TRIM command. TRIM is only advisory, however, and the > device is free to silently ignore all or parts of the request. > > In practice this renders the DRAT and RZAT flags completely useless and > because the results are unpredictable we decided to disable discard in > MD for 3.18 to avoid the risk of data corruption. > > Hardware vendors in the real world obviously need better guarantees than > what the standards bodies provide. Unfortuntely those guarantees are > encoded in product requirements documents rather than somewhere we can > key off of them programatically. So we are compelled to disabling > discard_zeroes_data for all devices unless we explicitly have data to > support whitelisting them. > > This patch whitelists SSDs from a few of the main vendors. None of the > whitelists are based on written guarantees. They are purely based on > empirical evidence collected from internal and external users that have > tested or qualified these drives in RAID deployments. > > The whitelist is only meant as a starting point and is by no means > comprehensive: > > - All intel SSD models except for 510 > - Micron M5* > - Samsung SSDs > - Seagate SSDs Generally, I'm extremely skeptical about whitelists. It's very difficult to keep them meaningfully up-to-date and often just ends up bit rotting after the initial flurry of updates, especially given how little this particular feature would be visible to most users. If there's something on the horizon which would solve the identification problem and we only have to worry about the current batch of devices, whitelisting can be useful but otherwise I'm not sure this is a good idea. If there currently is no way of properly indicating this feature, let's please disable it unconditionally. If this is absolutely necesary in certain cases (is it?), we can add libata.force flags or sysfs knobs to force enable it for users who care enough about it and knows what they're doing. Thanks.
On Wed, 03 Dec 2014 22:35:59 -0500 "Martin K. Petersen" <martin.petersen@oracle.com> wrote: > >>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes: > > Phillip> On 12/03/2014 10:24 PM, Martin K. Petersen wrote: > >> Well. It's a bit stronger than wishful thinking given that OEM > >> product requirements documents and supply contracts make it so. But > >> it is not something we can get in writing. > > Phillip> OEM product requirements documents and supply contracts sound > Phillip> like forms of writing to me. > > Except they are not forms of writing that we as a community have access > to. Thats true of many things we reply upon. I assume it's something vaguely of the form Vendor A has agreed with Oracle that drive identity X will do this and meet some certification so Oracle can approve it for databases Oracle has agreed to break vendor A's legs if it doesn't Oracle would rather like that the kernel just knows about this drive as being good, based on the above. In which case IMHO it's good enough. "Do not p*ss off a major customer" is probably more watertight than a lot of signed paper things 8) Alan -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Phillip" == Phillip Susi <psusi@ubuntu.com> writes:
Phillip> So your assertion is that you have seen it in writing,
Actually I haven't. I have talked to a bunch of people that have done
RAID qualification on SSDs. I also gleaned on a few vendor lists of
supported drive configurations.
The good news is that practice of each OEM having their own custom drive
firmware build is not very common for SSDs. And most of the results I
looked at was collected using off-the-shelf drives.
Phillip> I'm not saying a hard hell no, but this certainly makes me
Phillip> uncomfortable.
You do realize that my patch *restricts* the drives we enable
discard_zeroes_data on, right? Instead of relying solely on the drive's
own reporting we now also require empirical evidence that they do the
right thing.
Phillip> I'd much rather see the manufacturers put it in writing that
Phillip> yes, this make and model will perform this way even though it
Phillip> is not strictly required by ATA8.
Wishful thinking.
The sad reality is that standards are deliberately written to be
vague. And that any hard guarantees are part of product requirements
documents. That's not specific to SSDs in any way. That's true for
pretty much any piece of hardware.
Phillip> What would be better still is if the bloody ATA standard got a
Phillip> clue and said that if the drive claims that it does in fact
Phillip> zero after TRIM, that the TRIM command becomes mandatory
Phillip> instead of advisory.
I agree that the standards efforts in this department are a total train
wreck.
>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:
Tejun> Generally, I'm extremely skeptical about whitelists.
Me too. Unfortunately, in this case the blacklist is essentially
unbounded whereas the whitelist is small.
Tejun> If there's something on the horizon which would solve the
Tejun> identification problem and we only have to worry about the
Tejun> current batch of devices, whitelisting can be useful but
Tejun> otherwise I'm not sure this is a good idea.
There isn't :( The only saving grace is that SSDs are gravitating
towards NVMe and other non-ATA interfaces.
Tejun> It's very difficult to keep them meaningfully up-to-date and
Tejun> often just ends up bit rotting after the initial flurry of
Tejun> updates,
I'm well aware that this adds another truckload of ongoing pain to my
plate. Several of us discussed this issue at conferences this fall and
the consensus was that whitelisting is the only way to go about it.
I've already tightened up things in SCSI so we now prefer WRITE SAME
which does give hard guarantees unlike UNMAP. But because we use WRITE
SAME in the libata SATL it is imperative that we change our internal
flagging to be equally restrictive.
Tejun> If there currently is no way of properly indicating this feature,
Tejun> let's please disable it unconditionally.
Well, we recently disabled discard support in MD RAID5/6 to avoid
corruption. There currently is a manual override and that may be good
enough.
I just feel bad about disabling the feature for the many existing users
(and there are quite a few) that are using well-behaved drives in their
RAID deployments. And the filesystem folks have been begging for the
zeroout discard variant that I posted a few weeks ago. So the users are
there. I'm just trying to accommodate them the best I can given the lame
spec.
>>>>> "Alan" == One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> writes:
Alan> I assume it's something vaguely of the form
Alan> Vendor A has agreed with Oracle that drive identity X will do this
Alan> and meet some certification so Oracle can approve it for databases
Actually, we don't trust devices to zero anything for us :)
Alan> Oracle has agreed to break vendor A's legs if it doesn't
Alan> Oracle would rather like that the kernel just knows about this
Alan> drive as being good, based on the above.
We're generally not dealing with this kind of consumer/SMB hardware at
Oracle. So s/Oracle/$OTHER_OEM/, but yeah.
My interest in this is purely from the upstream discard lackey
perspective...
Hello, On Thu, Dec 04, 2014 at 09:13:22PM -0500, Martin K. Petersen wrote: > Tejun> If there's something on the horizon which would solve the > Tejun> identification problem and we only have to worry about the > Tejun> current batch of devices, whitelisting can be useful but > Tejun> otherwise I'm not sure this is a good idea. > > There isn't :( The only saving grace is that SSDs are gravitating > towards NVMe and other non-ATA interfaces. Yeap, that's true. This whole thing is on the way out. > I've already tightened up things in SCSI so we now prefer WRITE SAME > which does give hard guarantees unlike UNMAP. But because we use WRITE > SAME in the libata SATL it is imperative that we change our internal > flagging to be equally restrictive. Can you please elaborate this in the changelog and add comments explaning why ATA_HORKAGE_ZERO_AFTER_TRIM is necessary and how it's used? > I just feel bad about disabling the feature for the many existing users > (and there are quite a few) that are using well-behaved drives in their > RAID deployments. And the filesystem folks have been begging for the > zeroout discard variant that I posted a few weeks ago. So the users are > there. I'm just trying to accommodate them the best I can given the lame > spec. Can you please explain further the practical gains of using trims which guarantee zeroing? Thanks.
On Fri, 2014-12-05 at 09:51 -0500, Tejun Heo wrote: > > I just feel bad about disabling the feature for the many existing users > > (and there are quite a few) that are using well-behaved drives in their > > RAID deployments. And the filesystem folks have been begging for the > > zeroout discard variant that I posted a few weeks ago. So the users are > > there. I'm just trying to accommodate them the best I can given the lame > > spec. > > Can you please explain further the practical gains of using trims It's for RAID devices. We'd like to trim RAID devices simply by sending the trim down to every component. If we actually have to translate trim to a write it would damage performance (and defeat the purpose of helping the SSD FTL reuse blocks). However, RAID requires that the redundancy verification of the array components matches otherwise a verify of the RAID fails. This means we have to have a guarantee what the verify read of a trimmed block of a RAID component will return. So for RAID-1, we just need both trimmed components to return the same data (we don't actually care what it is, just that it be mirrored); for RAID-5 we need zeros on every trimmed component, because zero^zero = zero. Conversely, drives that return random junk after a trim cause verification failures, so we just elect not to transmit trim down to them from the RAID layer. James -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, On Wed, Dec 10, 2014 at 07:09:38AM +0300, James Bottomley wrote: > Conversely, drives that return random junk after a trim cause > verification failures, so we just elect not to transmit trim down to > them from the RAID layer. I see. Thanks for the explanation. I suppose the current raid implementations' metadata handling on partially built devices isn't developed enough to simply mark those trimmed blocks as unsynced? If raid consistency truly is the only reason for this, that approach seems way more fruitful to me than playing this optional feature game with hardware vendors which so often leads to eventual abysmal outcomes. Thanks.
On 10/12/14 04:09, James Bottomley wrote: > RAID requires that the redundancy verification of the array > components matches otherwise a verify of the RAID fails. This means > we have to have a guarantee what the verify read of a trimmed block > of a RAID component will return. So for RAID-1, we just need both > trimmed components to return the same data (we don't actually care > what it is, just that it be mirrored); FWIW, Unless I'm out-of-date, md RAID-1 and RAID-10 currently DON'T guarantee that the data is always mirrored. The md(4) man page from http://git.neil.brown.name/?p=mdadm.git;a=blob;f=md.4 says: > However on RAID1 and RAID10 it is possible for software issues to > cause a mismatch to be reported. This does not necessarily mean > that the data on the array is corrupted. It could simply be that > the system does not care what is stored on that part of the array - > it is unused space. > > The most likely cause for an unexpected mismatch on RAID1 or RAID10 > occurs if a swap partition or swap file is stored on the array. > > When the swap subsystem wants to write a page of memory out, it > flags the page as 'clean' in the memory manager and requests the > swap device to write it out. It is quite possible that the memory > will be changed while the write-out is happening. In that case the > 'clean' flag will be found to be clear when the write completes and > so the swap subsystem will simply forget that the swapout had been > attempted, and will possibly choose a different page to write out. > > If the swap device was on RAID1 (or RAID10), then the data is sent > from memory to a device twice (or more depending on the number of > devices in the array). Thus it is possible that the memory gets > changed between the times it is sent, so different data can be > written to the different devices in the array. This will be > detected by check as a mismatch. However it does not reflect any > corruption as the block where this mismatch occurs is being treated > by the swap system as being empty, and the data will never be read > from that block. > > It is conceivable for a similar situation to occur on non-swap > files, though it is less likely. In my experience these inconsistencies are very common in data written by certain applications (e.g. DBMS) on md RAID1 and RAID10 devices, and effectively makes md verify useless for them. This is a shame - particularly with things like FTLs where (at least anecdotally) data scrambling seems to be a reasonably frequent occurrence. So unreliable read-zero-on-TRIM isn't going to break the md RAID-1 / RAID-10 use-case significantly, because verify hasn't worked on those forever anyway. Of course, a non-deterministic return-zero-on-trim will still cause problems with md RAID5/6, and with other users such as dm RAID too I'd guess. Tim. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2014-12-10 at 09:29 -0500, Tejun Heo wrote: > Hello, > > On Wed, Dec 10, 2014 at 07:09:38AM +0300, James Bottomley wrote: > > Conversely, drives that return random junk after a trim cause > > verification failures, so we just elect not to transmit trim down to > > them from the RAID layer. > > I see. Thanks for the explanation. I suppose the current raid > implementations' metadata handling on partially built devices isn't > developed enough to simply mark those trimmed blocks as unsynced? We do have a log, which could be used for RAID-1 for this purpose, but it doesn't seem to be much used in practise. It takes extra space which most admins don't account for. For RAID-2+ or erasure codes this won't work because a bad block read corrupts the stripe: the really subtle failure here is that you trim a stripe and then partially write it, the RMW you do for parity will be an incorrect partial syndrome because it's based on junk rather than the syndrome of the other blocks and the corruption wouldn't be detected until you get an actual disk failure (meaning everything will look fine until that crucial day you need your data protection mechanism to work). We could cope with this with an even more subtle logging mechanism, where we only trim stripes and then log trimmed stripes and insist on full instead of partial writes so we get back to a known syndrome, but that's introducing a lot of subtlety into the logging code ... James -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:
Tejun> If raid consistency truly is the only reason for this, that
Tejun> approach seems way more fruitful to me than playing this optional
Tejun> feature game with hardware vendors which so often leads to
Tejun> eventual abysmal outcomes.
The other use case is the filesystem one where it is common to zero
block ranges for bitmaps, etc. In many workloads there's is a
significant win to trimming over writing out many blocks of zeroes.
On Thu, Dec 11, 2014 at 5:02 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes: > > Tejun> If raid consistency truly is the only reason for this, that > Tejun> approach seems way more fruitful to me than playing this optional > Tejun> feature game with hardware vendors which so often leads to > Tejun> eventual abysmal outcomes. > > The other use case is the filesystem one where it is common to zero > block ranges for bitmaps, etc. In many workloads there's is a > significant win to trimming over writing out many blocks of zeroes. Yes, looks mkfs is using the feature. Thanks, Ming Lei -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello, Martin. On Wed, Dec 10, 2014 at 04:02:32PM -0500, Martin K. Petersen wrote: > >>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes: > > Tejun> If raid consistency truly is the only reason for this, that > Tejun> approach seems way more fruitful to me than playing this optional > Tejun> feature game with hardware vendors which so often leads to > Tejun> eventual abysmal outcomes. > > The other use case is the filesystem one where it is common to zero > block ranges for bitmaps, etc. In many workloads there's is a > significant win to trimming over writing out many blocks of zeroes. Isn't that kinda niche and specialized tho? Besides, in this use case, the kernel is essentially just serving as the whitelist and nothing else. Userland would have to issue TRIMs directly. I'm not against this patch being merged but this really looks like an ultimately futile effort to me. Thanks.
>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes: >> The other use case is the filesystem one where it is common to zero >> block ranges for bitmaps, etc. In many workloads there's is a >> significant win to trimming over writing out many blocks of zeroes. Tejun> Isn't that kinda niche and specialized tho? I don't think so. There are two reasons for zeroing block ranges: 1) To ensure they contain zeroes on subsequent reads 2) To preallocate them or anchor them down on thin provisioned devices The filesystem folks have specifically asked to be able to make that distinction. Hence the patch that changes blkdev_issue_zeroout(). You really don't want to write out gobs and gobs of zeroes and cause unnecessary flash wear if all you care about is the blocks being in a deterministic state.
Hello, Martin. On Tue, Jan 06, 2015 at 07:05:40PM -0500, Martin K. Petersen wrote: > Tejun> Isn't that kinda niche and specialized tho? > > I don't think so. There are two reasons for zeroing block ranges: > > 1) To ensure they contain zeroes on subsequent reads > > 2) To preallocate them or anchor them down on thin provisioned devices > > The filesystem folks have specifically asked to be able to make that > distinction. Hence the patch that changes blkdev_issue_zeroout(). > > You really don't want to write out gobs and gobs of zeroes and cause > unnecessary flash wear if all you care about is the blocks being in a > deterministic state. I think I'm still missing something. Are there enough cases where filesystems want to write out zeroes during operation? Earlier in the thread, it was mentioned that this is currently mostly useful for raids which need the blocks actually cleared for checksum consistency, which basically means that raid metadata handling isn't (yet) capable of just marking those (parts of) stripes as unused. If a filesystem wants to read back zeros from data blocks, wouldn't it be just marking the matching index as such? And if you take out the zeroing case, trims are just trims and whether they return 0 afterwards or not is irrelevant. There sure can be use cases where zeroing fast and securely make noticeable difference but the cases put forth till this point seem relatively weak. I mean, after all, requiring trim to zero the blocks is essentially pushing down that amount of metadata management to the device - the device would do the exactly same thing. Pushing it down the layers can definitely be beneficial especially when there's no agreed-upon metadata on the medium (so, mkfs time), but it seems kinda superflous during normal operation. What am I missing? Thanks.
On Tue, Jan 06, 2015 at 09:54:55PM -0500, Tejun Heo wrote: > Hello, Martin. > > On Tue, Jan 06, 2015 at 07:05:40PM -0500, Martin K. Petersen wrote: > > Tejun> Isn't that kinda niche and specialized tho? > > > > I don't think so. There are two reasons for zeroing block ranges: > > > > 1) To ensure they contain zeroes on subsequent reads > > > > 2) To preallocate them or anchor them down on thin provisioned devices > > > > The filesystem folks have specifically asked to be able to make that > > distinction. Hence the patch that changes blkdev_issue_zeroout(). > > > > You really don't want to write out gobs and gobs of zeroes and cause > > unnecessary flash wear if all you care about is the blocks being in a > > deterministic state. > > I think I'm still missing something. Are there enough cases where > filesystems want to write out zeroes during operation? IMO, yes. w.r.t. thinp devices, we need to be able to guarantee that prellocated regions in the filesystem are actually backed by real blocks in the thinp device so we don't get ENOSPC from the thinp device. No filesystems do this yet because we don't have a mechanism for telling the lower layers "preallocate these blocks to zero". The biggest issue is that we currently have no easy way to say "these blocks need to contain zeros, but we aren't actually using them yet". i.e. the filesystem code assumes that they contain zeros (e.g. in ext4 inode tables because mkfs used to zero them) if they haven't been used, so when it reads them it detects that initialisation is needed because the blocks are empty.... FWIW, some filesystems need these regions to actually contain zeros because they can't track unwritten extents (e.g. gfs2). having sb_issue_zeroout() just do the right thing enables us to efficiently zero the regions they are preallocating... > Earlier in the > thread, it was mentioned that this is currently mostly useful for > raids which need the blocks actually cleared for checksum consistency, > which basically means that raid metadata handling isn't (yet) capable > of just marking those (parts of) stripes as unused. If a filesystem > wants to read back zeros from data blocks, wouldn't it be just marking > the matching index as such? Not all filesystems can do this for user data (see gfs2 case above) and no linux filesystem tracks whether free space contains zeros or stale data. Hence if we want blocks to be zeroed on disk, we currently have to write zeros to them and hence they get pinned in devices as "used space" even though they may never get used again. Cheers, Dave.
Hello, Dave. On Wed, Jan 07, 2015 at 03:15:37PM +1100, Dave Chinner wrote: > w.r.t. thinp devices, we need to be able to guarantee that > prellocated regions in the filesystem are actually backed by real > blocks in the thinp device so we don't get ENOSPC from the thinp > device. No filesystems do this yet because we don't have a mechanism > for telling the lower layers "preallocate these blocks to zero". > > The biggest issue is that we currently have no easy way to say > "these blocks need to contain zeros, but we aren't actually using > them yet". i.e. the filesystem code assumes that they contain zeros > (e.g. in ext4 inode tables because mkfs used to zero them) if they > haven't been used, so when it reads them it detects that > initialisation is needed because the blocks are empty.... > > FWIW, some filesystems need these regions to actually contain > zeros because they can't track unwritten extents (e.g. > gfs2). having sb_issue_zeroout() just do the right thing enables us > to efficiently zero the regions they are preallocating... > > > Earlier in the > > thread, it was mentioned that this is currently mostly useful for > > raids which need the blocks actually cleared for checksum consistency, > > which basically means that raid metadata handling isn't (yet) capable > > of just marking those (parts of) stripes as unused. If a filesystem > > wants to read back zeros from data blocks, wouldn't it be just marking > > the matching index as such? > > Not all filesystems can do this for user data (see gfs2 case above) > and no linux filesystem tracks whether free space contains zeros or > stale data. Hence if we want blocks to be zeroed on disk, we > currently have to write zeros to them and hence they get pinned in > devices as "used space" even though they may never get used again. Okay, I'll take it as that this benefits generic enough use cases from filesystem POV. As long as that's the case, it prolly has a reasonable chance of actually being widely and properly implemented hw vendors. It's easy to complain about mainstream hardware but IMHO the market is actually pretty efficient at shooting down extra cruft which doesn't really matter and only exists to increase the number of feature checkboxes. Hopefully, this has actual benefits and won't end up that way. Martin, do you have a newer version or shall I apply the original one? Thanks.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512 On 01/06/2015 11:15 PM, Dave Chinner wrote: > (e.g. in ext4 inode tables because mkfs used to zero them) if they > haven't been used, so when it reads them it detects that > initialisation is needed because the blocks are empty.... No, it knows that the inode table needs initialized because there is a flag in the group descriptor that says this inode table is still uninitalized. It never reads the blocks to see if they are full of zeros. mke2fs sets the flag when it does not initialize the table with zeros, either by direct writes ( which it doesn't do if lazy_itable_init is true, which it defaults to these days ), or by discarding the blocks when the device claims to support deterministic discard that zeros. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAEBCgAGBQJUrgHvAAoJENRVrw2cjl5RWUUH/076fhOAYW8vRB285cz+UwrE L7o3uteYa9OctoWZTctWAVueuISJaNsxzmbu04btznETkR/YbkkexpmXRKvYt6RZ rlCzRLWLTOC4h57jZteczn9pBf0RBeeizU0pEdTdZP/IVxOqA6NDG1qywDCE9A+I 3frFg54alaZ4MCGH0jmPVbVxM6jY3/UhA8DnOoMSlIOXatV1X7UMPK3lMJ4Ih/Yx e3yDWJjNHjAn88TRAGD6WJ2066t7DBGnLOOF9PscaZNW6f4hbuvEG9teAPk51OfS cwypcLsk6Mj7WU8cSQ0T/6a7Qx84g9JC6wLR0QZCUCSau6ZfWf3ZuSI26i/Xlfc= =BRqd -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jan 7, 2015, at 9:05 PM, Phillip Susi <psusi@ubuntu.com> wrote: > On 01/06/2015 11:15 PM, Dave Chinner wrote: >> (e.g. in ext4 inode tables because mkfs used to zero them) if they >> haven't been used, so when it reads them it detects that >> initialisation is needed because the blocks are empty.... > > No, it knows that the inode table needs initialized because there is a > flag in the group descriptor that says this inode table is still > uninitalized. It never reads the blocks to see if they are full of > zeros. mke2fs sets the flag when it does not initialize the table > with zeros, either by direct writes ( which it doesn't do if > lazy_itable_init is true, which it defaults to these days ), or by > discarding the blocks when the device claims to support deterministic > discard that zeros. That is only partially correct. While it is true that mke2fs sets the UNINIT flag at format time, the "lazy" part of that means there is a kernel thread still does the zeroing of the inode table blocks, but after the filesystem is mounted, for any group that does not have the ZEROED flag set. After that point, the "UNINIT" flag is an optimization to avoid reading the bitmap and unused blocks from disk during allocation. This is needed in case the group descriptor or inode bitmap is corrupted, and e2fsck needs to scan the inode table for in-use inodes. We don't want it to find old inodes from before the filesystem was formatted. The ext4_init_inode_table() calls sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying storage supported deterministic zeroing of the underlying storage, this could be handled very efficiently. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 1/7/2015 11:58 PM, Andreas Dilger wrote: >> No, it knows that the inode table needs initialized because there >> is a flag in the group descriptor that says this inode table is >> still uninitalized. It never reads the blocks to see if they are >> full of zeros. mke2fs sets the flag when it does not initialize >> the table with zeros, either by direct writes ( which it doesn't >> do if lazy_itable_init is true, which it defaults to these days >> ), or by discarding the blocks when the device claims to support >> deterministic discard that zeros. > > That is only partially correct. While it is true that mke2fs sets > the UNINIT flag at format time, the "lazy" part of that means there > is a kernel thread still does the zeroing of the inode table > blocks, but after the filesystem is mounted, for any group that > does not have the ZEROED flag set. After that point, the "UNINIT" > flag is an optimization to avoid reading the bitmap and unused > blocks from disk during allocation. That is pretty much what I said, except that I was pointing out that it does not *read* first to see if the disk is already zeroed, as that would be a waste of time. It just writes out the zeros for block groups that still have the uninit flag set. > This is needed in case the group descriptor or inode bitmap is > corrupted, and e2fsck needs to scan the inode table for in-use > inodes. We don't want it to find old inodes from before the > filesystem was formatted. > > The ext4_init_inode_table() calls > sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying > storage supported deterministic zeroing of the underlying storage, > this could be handled very efficiently. Again, that's pretty much what I said. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUro+rAAoJENRVrw2cjl5RDcMIAIF/qOjB4FfQ1v1NPLnbYrwa MeGxmAGIq8ftOlfOpSo8W/Adal4eegPwiBzE/8JciTHli2hZFkboG96lCGLL2GHd 51gqADIcSUmPHleLP3L/MSleFLvOAC4xYWYi00idZF5UVK3tt6m07+T/DCkoku1L 8Yg0BH5rxxRyPMRe3a/Y3WoI3PFFQDiDhz/7PSFH1W6JrEmiZutIDfm/U4h6zFWU tlNUMyiya8DrJt4katjBTa74EPWdZQjT/dOuUGGslAgOro69ZZQ4jmEXeJGN96Ly mMhbzkw9mjvSHYWuH2Hf1QlxoMKgq1hxaNbwEtImBzYeQTb9VSqM6BSO1ru0m9M= =X07B -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Tejun" == Tejun Heo <tj@kernel.org> writes:
Tejun> do you have a newer version or shall I apply the original one?
Just sent an updated patch that catches the latest Micron/Crucial drives
as well.
On Jan 8, 2015, at 7:09 AM, Phillip Susi <psusi@ubuntu.com> wrote: > On 1/7/2015 11:58 PM, Andreas Dilger wrote: >>> No, it knows that the inode table needs initialized because there >>> is a flag in the group descriptor that says this inode table is >>> still uninitalized. It never reads the blocks to see if they are >>> full of zeros. mke2fs sets the flag when it does not initialize >>> the table with zeros, either by direct writes ( which it doesn't >>> do if lazy_itable_init is true, which it defaults to these days >>> ), or by discarding the blocks when the device claims to support >>> deterministic discard that zeros. >> >> That is only partially correct. While it is true that mke2fs sets >> the UNINIT flag at format time, the "lazy" part of that means there >> is a kernel thread still does the zeroing of the inode table >> blocks, but after the filesystem is mounted, for any group that >> does not have the ZEROED flag set. After that point, the "UNINIT" >> flag is an optimization to avoid reading the bitmap and unused >> blocks from disk during allocation. > > That is pretty much what I said, except that I was pointing out that > it does not *read* first to see if the disk is already zeroed, as that > would be a waste of time. It just writes out the zeros for block > groups that still have the uninit flag set. Sorry, I didn't get that from my reading, so I thought I'd clarify. I'd actually proposed that the ext4_init_inode_table() thread start by reading the itable blocks first, check them for zeroes, and only switch over to writing if it finds any non-zero data in the blocks. I think that would be a net win in some cases, and only a tiny bit of overhead (a single read) if it turns out to be wrong. >> This is needed in case the group descriptor or inode bitmap is >> corrupted, and e2fsck needs to scan the inode table for in-use >> inodes. We don't want it to find old inodes from before the >> filesystem was formatted. >> >> The ext4_init_inode_table() calls >> sb_issue_zeroout->blkdev_issue_zeroout(), so if the underlying >> storage supported deterministic zeroing of the underlying storage, >> this could be handled very efficiently. > > Again, that's pretty much what I said. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index c5ba15af87d3..f41f24a8bc21 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4225,10 +4225,20 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { { "PIONEER DVD-RW DVR-216D", NULL, ATA_HORKAGE_NOSETXFER }, /* devices that don't properly handle queued TRIM commands */ - { "Micron_M500*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, - { "Crucial_CT???M500SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, - { "Micron_M550*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, - { "Crucial_CT*M550SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, + { "Micron_M5?0*", NULL, ATA_HORKAGE_NO_NCQ_TRIM | + ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "Crucial_CT???M5?0SSD*", NULL, ATA_HORKAGE_NO_NCQ_TRIM, }, + + /* + * DRAT/RZAT are weak guarantees. Explicitly black/whitelist + * SSDs that provide reliable zero after TRIM. + */ + { "INTEL*SSDSC2MH*", NULL, 0, }, /* Blacklist intel 510 */ + { "INTEL*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "SSD*INTEL*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "Samsung*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "SAMSUNG*SSD*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, + { "ST[1248][0248]0[FH]*", NULL, ATA_HORKAGE_ZERO_AFTER_TRIM, }, /* * Some WD SATA-I drives spin up and down erratically when the link diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index 0586f66d70fa..a19479282b65 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -2515,13 +2515,15 @@ static unsigned int ata_scsiop_read_cap(struct ata_scsi_args *args, u8 *rbuf) rbuf[15] = lowest_aligned; if (ata_id_has_trim(args->id)) { - rbuf[14] |= 0x80; /* TPE */ + rbuf[14] |= 0x80; /* LBPME */ - if (ata_id_has_zero_after_trim(args->id)) - rbuf[14] |= 0x40; /* TPRZ */ + if (ata_id_has_zero_after_trim(args->id) && + dev->horkage & ATA_HORKAGE_ZERO_AFTER_TRIM) { + ata_dev_info(dev, "Enabling discard_zeroes_data\n"); + rbuf[14] |= 0x40; /* LBPRZ */ + } } } - return 0; } diff --git a/include/linux/libata.h b/include/linux/libata.h index bd5fefeaf548..45ac825b8366 100644 --- a/include/linux/libata.h +++ b/include/linux/libata.h @@ -421,6 +421,7 @@ enum { ATA_HORKAGE_NO_NCQ_TRIM = (1 << 19), /* don't use queued TRIM */ ATA_HORKAGE_NOLPM = (1 << 20), /* don't use LPM */ ATA_HORKAGE_WD_BROKEN_LPM = (1 << 21), /* some WDs have broken LPM */ + ATA_HORKAGE_ZERO_AFTER_TRIM = (1 << 22),/* guarantees zero after trim */ /* DMA mask for user DMA control: User visible values; DO NOT renumber */