Message ID | 1282232941-9910-3-git-send-email-martin.petersen@oracle.com |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
On Thu, Aug 19, 2010 at 11:48 AM, Martin K. Petersen <martin.petersen@oracle.com> wrote: > ATA IDENTIFY DEVICE word 105 contains the number of 512-byte blocks of > TRIM payload information the device can accept in one command. Use this > value to enable payloads > 512 bytes. > > Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> > --- > drivers/ata/libata-scsi.c | 7 ++++++- > include/linux/ata.h | 9 +++++++++ > 2 files changed, 15 insertions(+), 1 deletions(-) > > diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c > index e280ae6..145f099 100644 > --- a/drivers/ata/libata-scsi.c > +++ b/drivers/ata/libata-scsi.c > @@ -2152,7 +2152,12 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf) > * with the unmap bit set. > */ > if (ata_id_has_trim(args->id)) { > - put_unaligned_be32(65535 * 512 / 8, &rbuf[20]); > + unsigned int blocks; since blocks is nebulous, a comment would be nice, maybe: unsigned int blocks; /* the ATA spec specifies 512-byte blocks for trim payload */ > + > + /* Default to 1 if unspecified in word 105. Cap at 1 page. */ > + blocks = clamp(ata_id_trim_range_blocks(args->id), 1U, 8U); Should there at least be a todo comment about raising that cap? Or is there a fundamental reason for it. ie. I don't think the ATA spec calls for it, so this is a kernel restriction I assume. > + > + put_unaligned_be32(65535 * 512 / 8 * blocks, &rbuf[20]); Is this patch actually enabling the block layer to initiate ATA multi-sector trim payloads, or is this only allowing the max payloads sectors to be queried? Are there plans (or existing code) to accumulate trim ranges in the block layer and create trim commands with multiple sectors? AFAIK, the only block layer feature exported to the file system layer only accepts one range. I'd like to see multiple trim ranges accumulated either by the filesystem prior to calling into the block layer, of have the block layer accumulate them. (I suspect its easier for the filesystem to do it via the proposed fitrim() ioctl that is on the ext4 list recently. ie. the current proposed fitrim() ioctl calls into the block layer for each trim range, but it could easily have the ability to accumulate a vector of trim ranges and call the block layer only once per trim vector if the block layer had a interface for that. Greg -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes: >> + /* Default to 1 if unspecified in word 105. Cap at 1 page. */ Greg> Should there at least be a todo comment about raising that cap? We don't currently plan to. The payload is a single page which we can allocate fairly easily. A bigger payload would be more problematic. With the 4KB payload you can adress 16GB with one command. Greg> Or is there a fundamental reason for it. ie. I don't think the Greg> ATA spec calls for it, so this is a kernel restriction I assume. Yes, this is part of the internal SCSI-ATA translation mechanism in Linux. But fwiw, there I'm only aware of one drive that currently supports more than 512 bytes of payload and it also caps at 4KB. Greg> Is this patch actually enabling the block layer to initiate ATA Greg> multi-sector trim payloads, or is this only allowing the max Greg> payloads sectors to be queried? TRIM only takes 65535 blocks per entry. So we need many entries to decribe a single, contiguous space being freed. That's what's being bumped here. Greg> Are there plans (or existing code) to accumulate trim ranges in Greg> the block layer and create trim commands with multiple sectors? I have worked on this on and off. It's not trivial for several reasons. We discussed the issues at the storage workshop last week and there was concensus that the changes were too intrusive. So we're holding off until we see what's happening with: a) the SSD market. Win 7 also issues lots of small, individual trims like we do b) the TRIM TNG command that's being worked on in T13 This doesn't mean that TRIM coalescing is impossible and that it will never happen. But right now it appears to be more hassle than it's worth...
On Thu, Aug 19, 2010 at 2:05 PM, Martin K. Petersen <martin.petersen@oracle.com> wrote: >>>>>> "Greg" == Greg Freemyer <greg.freemyer@gmail.com> writes: > >>> + /* Default to 1 if unspecified in word 105. Cap at 1 page. */ > > Greg> Should there at least be a todo comment about raising that cap? > > We don't currently plan to. The payload is a single page which we can > allocate fairly easily. A bigger payload would be more problematic. > > With the 4KB payload you can adress 16GB with one command. > > > Greg> Or is there a fundamental reason for it. ie. I don't think the > Greg> ATA spec calls for it, so this is a kernel restriction I assume. > > Yes, this is part of the internal SCSI-ATA translation mechanism in > Linux. But fwiw, there I'm only aware of one drive that currently > supports more than 512 bytes of payload and it also caps at 4KB. > If you're curious about harware support for larger payloads, you might ask Mark Lord. I understand the way he implemented it in hdparm accepts many more 512 byte payload blocks than one page worth. And the only SSD hardware he's reported problems with are Intel which apparently only supports one 512 byte payload. ie. I believe hdparm will typically use up to 1MB of payload. And the ranges he collects via his script and dispatches via hdparm are typically not contiguous. But I believe he is bypassing much of the kernel stack. To be honest, I have not traced a hdparm created trim command through the kernel, so I don't know the details, and he may be breaking that 1 MB payload that is coming from userspace into smaller payloads. The fundamental problem with hdparm is that because it bypasses most of the kernel i/o stack, it is not compatible with DM/LVM, but I think its a great prototype of what the kernel could do eventually. ==> from man hdparm --trim-sector-ranges For Solid State Drives (SSDs). EXCEPTIONALLY DANGEROUS. DO NOT USE THIS FLAG!! Tells the drive firmware to discard unneeded data sectors, destroying any data that may have been present within them. This makes those sectors available for immediate use by the firmware's garbage collection mechanism, to improve scheduling for wear-leveling of the flash media. This option expects one or more sector range pairs immediately after the flag: an LBA starting address, a colon, and a sector count, with no intervening spaces. EXCEP- TIONALLY DANGEROUS. DO NOT USE THIS FLAG!! Eg. hdparm --trim-sector-ranges 1000:4 7894:16 /dev/sdz --trim-sector-ranges-stdin Identical to --trim-sector-ranges above, except the list of lba:count pairs is read from stdin rather than being specified on the command line. This can be used to avoid prob- lems with excessively long command lines. It also permits batching of many more sector ranges into single commands to the drive, up to the currently configured transfer limit (max_sectors_kb). ==> > > Greg> Is this patch actually enabling the block layer to initiate ATA > Greg> multi-sector trim payloads, or is this only allowing the max > Greg> payloads sectors to be queried? > > TRIM only takes 65535 blocks per entry. So we need many entries to > decribe a single, contiguous space being freed. That's what's being > bumped here. > Thanks, I did not appreciate that restriction and thus my confusion about the patch > > Greg> Are there plans (or existing code) to accumulate trim ranges in > Greg> the block layer and create trim commands with multiple sectors? > > I have worked on this on and off. It's not trivial for several reasons. > We discussed the issues at the storage workshop last week and there was > concensus that the changes were too intrusive. So we're holding off > until we see what's happening with: > > a) the SSD market. Win 7 also issues lots of small, individual > trims like we do > > b) the TRIM TNG command that's being worked on in T13 > > This doesn't mean that TRIM coalescing is impossible and that it will > never happen. But right now it appears to be more hassle than it's > worth... I believe it's almost trivial to coalesce non-contiguous ranges into a single vector of ranges in the proposed fitrim() ioctl because it is walking the filesystem structures looking for free ranges. I assume its just a matter of maintaining the locks long enough to ensure the blocks being trimmed are not used by other filesystem code while the ranges are being collected. The bigger problem for now is that the block layer does not export a way to pass in a vectorized list of ranges to discard. > -- > Martin K. Petersen Oracle Linux Engineering > Greg -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10-08-19 02:05 PM, Martin K. Petersen wrote: >I'm only aware of one drive that currently > supports more than 512 bytes of payload and it also caps at 4KB. SSDs based upon the Indilinx Barefoot controller support more or less "infinite" payload for TRIM, with no cap. But it predates idword[105], so just has a zero there. OCZ Vertex-LE specifies a single sector of payload. Those are the ones I know about and have on hand here for TRIM. Cheers -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 19, 2010 at 05:50:11PM -0400, Mark Lord wrote: > On 10-08-19 02:05 PM, Martin K. Petersen wrote: > >I'm only aware of one drive that currently > >supports more than 512 bytes of payload and it also caps at 4KB. > > SSDs based upon the Indilinx Barefoot controller support > more or less "infinite" payload for TRIM, with no cap. > But it predates idword[105], so just has a zero there. Is there an easy way to identify them? If so we could add a quirk for them if it provides enough benefit. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10-08-20 04:58 AM, Christoph Hellwig wrote: > On Thu, Aug 19, 2010 at 05:50:11PM -0400, Mark Lord wrote: >> On 10-08-19 02:05 PM, Martin K. Petersen wrote: >>> I'm only aware of one drive that currently >>> supports more than 512 bytes of payload and it also caps at 4KB. >> >> SSDs based upon the Indilinx Barefoot controller support >> more or less "infinite" payload for TRIM, with no cap. >> But it predates idword[105], so just has a zero there. > > Is there an easy way to identify them? If so we could add a quirk > for them if it provides enough benefit. Each brand/model identifies itself differently. But we could start a whitelist on based on the model name field from the ATA identify data. The OCZ VERTEX drives I have here, identify themselves as "OCZ-VERTEX", and accept very very long payloads (no limit?). The OCZ VERTEX-LE drives here do have a limit, of 8 sectors, and identify themselves as "OCZ VERTEX-LE" in that field. That's what hdparm-9.30 uses to figure out the max TRIM payloads, in the absence of a value in word 105. Cheers -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 20, 2010 at 9:53 AM, Mark Lord <kernel@teksavvy.com> wrote: > On 10-08-20 04:58 AM, Christoph Hellwig wrote: >> >> On Thu, Aug 19, 2010 at 05:50:11PM -0400, Mark Lord wrote: >>> >>> On 10-08-19 02:05 PM, Martin K. Petersen wrote: >>>> >>>> I'm only aware of one drive that currently >>>> supports more than 512 bytes of payload and it also caps at 4KB. >>> >>> SSDs based upon the Indilinx Barefoot controller support >>> more or less "infinite" payload for TRIM, with no cap. >>> But it predates idword[105], so just has a zero there. >> >> Is there an easy way to identify them? If so we could add a quirk >> for them if it provides enough benefit. > > > Each brand/model identifies itself differently. > But we could start a whitelist on based on the model name field > from the ATA identify data. > > The OCZ VERTEX drives I have here, identify themselves as "OCZ-VERTEX", > and accept very very long payloads (no limit?). > > The OCZ VERTEX-LE drives here do have a limit, of 8 sectors, > and identify themselves as "OCZ VERTEX-LE" in that field. > > That's what hdparm-9.30 uses to figure out the max TRIM payloads, > in the absence of a value in word 105. > > Cheers > A whitelist to enable large contiguous range trims via 8 512B-block payloads seems useful for those devices that don't support word 105. (ie. 4KB payload) But, I doubt there is enough observable performance advantage to justify reworking the internal SCSI-ATA translation mechanism in the kernel to handle larger payloads. Especially if only one or two SSDs will accept greater than 4KB of payload to the trim command. Greg -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10-08-20 01:13 PM, Greg Freemyer wrote: > On Fri, Aug 20, 2010 at 9:53 AM, Mark Lord<kernel@teksavvy.com> wrote: >> On 10-08-20 04:58 AM, Christoph Hellwig wrote: >>> >>> On Thu, Aug 19, 2010 at 05:50:11PM -0400, Mark Lord wrote: >>>> >>>> On 10-08-19 02:05 PM, Martin K. Petersen wrote: >>>>> >>>>> I'm only aware of one drive that currently >>>>> supports more than 512 bytes of payload and it also caps at 4KB. >>>> >>>> SSDs based upon the Indilinx Barefoot controller support >>>> more or less "infinite" payload for TRIM, with no cap. >>>> But it predates idword[105], so just has a zero there. >>> >>> Is there an easy way to identify them? If so we could add a quirk >>> for them if it provides enough benefit. .. > A whitelist to enable large contiguous range trims via 8 512B-block > payloads seems useful for those devices that don't support word 105. > (ie. 4KB payload) > > But, I doubt there is enough observable performance advantage to > justify reworking the internal SCSI-ATA translation mechanism in the > kernel to handle larger payloads. Especially if only one or two SSDs > will accept greater than 4KB of payload to the trim command. .. Actually, for in-kernel uses, even just a single 512-byte long list would likely be adequate. The biggest gains will come from simply having more than one range per TRIM command issued. Once we get that, it's diminishing returns for larger and larger lists, and in-kernel code is unlikely to ever be able to generate lists that long. But getting started should be easier than folks are making it out to be. The first step is to have "file deletion" issue multi-range trims for all extents of the file at once, rather than one range at a time as at present. That'll be the biggest gain, and shouldn't be too complex to implement. Especially not after Christoph/Tejun's current barrier rework stuff is in place. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 19, 2010 at 11:48:57AM -0400, Martin K. Petersen wrote: > ATA IDENTIFY DEVICE word 105 contains the number of 512-byte blocks of > TRIM payload information the device can accept in one command. Use this > value to enable payloads > 512 bytes. Currently ata_scsi_write_same_xlat passes uses a constant 512 for the payload all over the place, so the larger payloads won't actually work yet. I don't think advertising a larger size than we actually support is a good idea. -- To unsubscribe from this list: send the line "unsubscribe linux-ide" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>> "Christoph" == Christoph Hellwig <hch@lst.de> writes: Christoph> On Thu, Aug 19, 2010 at 11:48:57AM -0400, Martin K. Petersen wrote: >> ATA IDENTIFY DEVICE word 105 contains the number of 512-byte blocks >> of TRIM payload information the device can accept in one command. >> Use this value to enable payloads > 512 bytes. Christoph> Currently ata_scsi_write_same_xlat passes uses a constant 512 Christoph> for the payload all over the place, so the larger payloads Christoph> won't actually work yet. Christoph> I don't think advertising a larger size than we actually Christoph> support is a good idea. I was on crack when I included this in the series last week. It came from my discard coalescing tree where things are quite different. In any case we're also limited by the range of WRITE SAME as long as we're relying on SAT. So under all circumstances we're dead in the water with a payload > 1KB. Let's just drop this one for now.
diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c index e280ae6..145f099 100644 --- a/drivers/ata/libata-scsi.c +++ b/drivers/ata/libata-scsi.c @@ -2152,7 +2152,12 @@ static unsigned int ata_scsiop_inq_b0(struct ata_scsi_args *args, u8 *rbuf) * with the unmap bit set. */ if (ata_id_has_trim(args->id)) { - put_unaligned_be32(65535 * 512 / 8, &rbuf[20]); + unsigned int blocks; + + /* Default to 1 if unspecified in word 105. Cap at 1 page. */ + blocks = clamp(ata_id_trim_range_blocks(args->id), 1U, 8U); + + put_unaligned_be32(65535 * 512 / 8 * blocks, &rbuf[20]); put_unaligned_be32(1, &rbuf[28]); } diff --git a/include/linux/ata.h b/include/linux/ata.h index fe6e681..5584356 100644 --- a/include/linux/ata.h +++ b/include/linux/ata.h @@ -88,6 +88,7 @@ enum { ATA_ID_HW_CONFIG = 93, ATA_ID_SPG = 98, ATA_ID_LBA_CAPACITY_2 = 100, + ATA_ID_TRIM_RANGE_BLKS = 105, ATA_ID_SECTOR_SIZE = 106, ATA_ID_LAST_LUN = 126, ATA_ID_DLF = 128, @@ -827,6 +828,14 @@ static inline int ata_id_has_zero_after_trim(const u16 *id) return 0; } +static inline unsigned int ata_id_trim_range_blocks(const u16 *id) +{ + if (ata_id_has_trim(id)) + return id[ATA_ID_TRIM_RANGE_BLKS]; + + return 0; +} + static inline int ata_id_current_chs_valid(const u16 *id) { /* For ATA-1 devices, if the INITIALIZE DEVICE PARAMETERS command
ATA IDENTIFY DEVICE word 105 contains the number of 512-byte blocks of TRIM payload information the device can accept in one command. Use this value to enable payloads > 512 bytes. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> --- drivers/ata/libata-scsi.c | 7 ++++++- include/linux/ata.h | 9 +++++++++ 2 files changed, 15 insertions(+), 1 deletions(-)