diff mbox

[RFC] xfs: batched discard support

Message ID 4A8810C4.3050800@rtr.ca
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Mark Lord Aug. 16, 2009, 1:59 p.m. UTC
Mark Lord wrote:
> Christoph Hellwig wrote:
>> On Sat, Aug 15, 2009 at 10:19:21PM -0400, Mark Lord wrote:
>>> Mark Lord wrote:
>>>> Christoph Hellwig wrote:
>>> ..
>>>>> Mark, any chance to try it?  Just create an XFS filesystem, age it a
>>>>> bit and then call the attached little trim.c program on the mountmoint
>>>>> (or any file inside the filesystem for that matter)
>>>> ..
>>>>
>>>> Looking at it now.  Thanks, Christoph!
>>> ..
>>>
>>> Fails to work on 64-bit kernel w/ 32-bit userspace (no compat ioctl).
>>> Rebuilding with 32-bit kernel now..
>>
>> The actual ioctl is compatible, just add the
>>
>>     case XFS_IOC_TRIM:
>>         return xfs_ioc_trim(mp, arg);
>>
>> to xfs_file_compat_ioctl().  I'll add this to the next spin of the patch.
> ..
> 
> Okay, this gives me ENOSYS now --> discard/trim support is missing from
> the lower layers.
> 
> What other patches do I need to make this work?
> 
> The latest from Matthew's discard tree (May 2009) don't appear to be sufficient,
> even after updating them for 2.6.31-rc6.
..

Okay, I got Matthews patches updated onto 2.6.31, and fixed the incompatibilities
between those and the XFS TRIM patch (from Christoph), plus a sector_t printk issue.

My apologies for attachments, but I am attaching the updated Christoph patch,
as well as my hacked-up forward-port of Matthew's patches.

Not pretty, but they work.  :)

Now.. running Christoph's "xfs trim" on a 4.6GB mostly already-trimmed
XFS partition gave this for the first time around:

[   25.961891] Filesystem "sdb3": discarding sectors [0xc558-0x102328]
[   27.814553] Filesystem "sdb3": discarding sectors [0x10ea78-0x10e688]
[   29.771218] Filesystem "sdb3": discarding sectors [0x21d120-0x10e860]
[   31.726444] Filesystem "sdb3": discarding sectors [0x32b9a0-0x10e860]
[   33.679023] Filesystem "sdb3": discarding sectors [0x43f220-0x109860]
[   35.629948] Filesystem "sdb3": discarding sectors [0x548aa0-0x10e860]
[   37.583142] Filesystem "sdb3": discarding sectors [0x657320-0x10e860]
[   39.531822] Filesystem "sdb3": discarding sectors [0x765ba0-0x10e860]

Slow, but presumably thorough.
Subsequent runs were equally slow.

The problem is, it still issues TRIMs to the LLD one extent at a time.
Compare this with doing it all in a single TRIM command
with the wiper.sh script (filesystem unmounted):

	[~] time wiper.sh /dev/sdb3 --commit
	
	wiper.sh: Linux SATA SSD TRIM utility, version 1.9b, by Mark Lord.
	Preparing for offline TRIM of free space on /dev/sdb3 (xfs non-mounted).
	This operation could destroy your data.  Are you sure (y/N)? y
	Syncing disks..
	Beginning TRIM operations..
	Trimming 168 free extents encompassing 8793136 sectors (4294 MB)
	Done.
	
	real    0m1.249s
	user    0m0.110s
	sys     0m0.063s

That includes the time for me to type 'y' and hit enter.  :)

Cheers

Comments

Mark Lord Aug. 16, 2009, 2:06 p.m. UTC | #1
Mark Lord wrote:
..
> Slow, but presumably thorough.
> Subsequent runs were equally slow.
> 
> The problem is, it still issues TRIMs to the LLD one extent at a time.
> Compare this with doing it all in a single TRIM command
> with the wiper.sh script (filesystem unmounted):
> 
>     [~] time wiper.sh /dev/sdb3 --commit
>     
>     wiper.sh: Linux SATA SSD TRIM utility, version 1.9b, by Mark Lord.
>     Preparing for offline TRIM of free space on /dev/sdb3 (xfs 
> non-mounted).
>     This operation could destroy your data.  Are you sure (y/N)? y
>     Syncing disks..
>     Beginning TRIM operations..
>     Trimming 168 free extents encompassing 8793136 sectors (4294 MB)
>     Done.
>     
>     real    0m1.249s
>     user    0m0.110s
>     sys     0m0.063s
> 
> That includes the time for me to type 'y' and hit enter.  :)
..

For completeness, here's the same operation again,
except this time on the *mounted* xfs filesystem.
It won't be trimming quite as many blocks
(leaves 1% free space in reserve),
but otherwise is similar:

	[~] time wiper.sh /dev/sdb3 --commit
	
	wiper.sh: Linux SATA SSD TRIM utility, version 1.9b, by Mark Lord.
	Preparing for online TRIM of free space on /dev/sdb3 (xfs mounted read-write at /x).
	This operation could destroy your data.  Are you sure (y/N)? y
	Creating temporary file (4348405 KB)..
	Syncing disks..
	Beginning TRIM operations..
	Trimming 134 free extents encompassing 8696816 sectors (4246 MB)
	Removing temporary file..
	Syncing disks..
	Done.
	
	real    0m1.212s
	user    0m0.043s
	sys     0m0.053s

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Aug. 16, 2009, 2:23 p.m. UTC | #2
On Sun, Aug 16, 2009 at 09:59:32AM -0400, Mark Lord wrote:
> Okay, I got Matthews patches updated onto 2.6.31, and fixed the incompatibilities
> between those and the XFS TRIM patch (from Christoph), plus a sector_t printk issue.
>
> My apologies for attachments, but I am attaching the updated Christoph patch,
> as well as my hacked-up forward-port of Matthew's patches.
>
> Not pretty, but they work.  :)
>
> Now.. running Christoph's "xfs trim" on a 4.6GB mostly already-trimmed
> XFS partition gave this for the first time around:

> The problem is, it still issues TRIMs to the LLD one extent at a time.
> Compare this with doing it all in a single TRIM command
> with the wiper.sh script (filesystem unmounted):

I could do a variant which issues a single TRIM, but that would require
us to lock out all other allocations for the time the trim takes.  I'll
hack that up once I get some time.

--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mark Lord Aug. 16, 2009, 2:26 p.m. UTC | #3
Christoph Hellwig wrote:
> On Sun, Aug 16, 2009 at 09:59:32AM -0400, Mark Lord wrote:
>> Okay, I got Matthews patches updated onto 2.6.31, and fixed the incompatibilities
>> between those and the XFS TRIM patch (from Christoph), plus a sector_t printk issue.
>>
>> My apologies for attachments, but I am attaching the updated Christoph patch,
>> as well as my hacked-up forward-port of Matthew's patches.
>>
>> Not pretty, but they work.  :)
>>
>> Now.. running Christoph's "xfs trim" on a 4.6GB mostly already-trimmed
>> XFS partition gave this for the first time around:
> 
>> The problem is, it still issues TRIMs to the LLD one extent at a time.
>> Compare this with doing it all in a single TRIM command
>> with the wiper.sh script (filesystem unmounted):
> 
> I could do a variant which issues a single TRIM, but that would require
> us to lock out all other allocations for the time the trim takes.  I'll
> hack that up once I get some time.
..

Matthew's stuff will have to change to support that, too.
--
To unsubscribe from this list: send the line "unsubscribe linux-ide" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/block/blk-barrier.c linux/block/blk-barrier.c
--- linux-2.6.31-rc6/block/blk-barrier.c	2009-08-16 09:36:36.431146680 -0400
+++ linux/block/blk-barrier.c	2009-08-16 09:20:15.164578531 -0400
@@ -425,3 +425,4 @@ 
 					DISCARD_BARRIER, NULL);
 }
 EXPORT_SYMBOL(blkdev_issue_discard);
+EXPORT_SYMBOL(__blkdev_issue_discard);
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl.c linux/fs/xfs/linux-2.6/xfs_ioctl.c
--- linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl.c	2009-08-16 09:16:39.000433070 -0400
+++ linux/fs/xfs/linux-2.6/xfs_ioctl.c	2009-08-16 09:30:38.973683042 -0400
@@ -1274,6 +1274,31 @@ 
 	return 0;
 }
 
+int
+xfs_ioc_trim(
+	struct xfs_mount	*mp,
+	__uint32_t		*argp)
+{
+	xfs_agnumber_t		agno;
+	int			error = 0;
+	__uint32_t		minlen;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	if (get_user(minlen, argp))
+		return -EFAULT;
+
+	down_read(&mp->m_peraglock);
+	for (agno = 0; agno < mp->m_sb.sb_agcount; agno++) {
+		error = -xfs_trim_extents(mp, agno, minlen);
+		if (error)
+			break;
+	}
+	up_read(&mp->m_peraglock);
+
+	return error;
+}
+
 /*
  * Note: some of the ioctl's return positive numbers as a
  * byte count indicating success, such as readlink_by_handle.
@@ -1523,6 +1548,9 @@ 
 		error = xfs_errortag_clearall(mp, 1);
 		return -error;
 
+	case XFS_IOC_TRIM:
+		return xfs_ioc_trim(mp, arg);
+
 	default:
 		return -ENOTTY;
 	}
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl32.c linux/fs/xfs/linux-2.6/xfs_ioctl32.c
--- linux-2.6.31-rc6/fs/xfs/linux-2.6/xfs_ioctl32.c	2009-06-09 23:05:27.000000000 -0400
+++ linux/fs/xfs/linux-2.6/xfs_ioctl32.c	2009-08-16 09:31:21.005588977 -0400
@@ -539,6 +539,7 @@ 
 	void			__user *arg = (void __user *)p;
 	int			ioflags = 0;
 	int			error;
+	extern int xfs_ioc_trim(struct xfs_mount *mp, __uint32_t *argp);
 
 	if (filp->f_mode & FMODE_NOCMTIME)
 		ioflags |= IO_INVIS;
@@ -564,6 +565,8 @@ 
 	case XFS_IOC_ERROR_INJECTION:
 	case XFS_IOC_ERROR_CLEARALL:
 		return xfs_file_ioctl(filp, cmd, p);
+	case XFS_IOC_TRIM:
+		return xfs_ioc_trim(mp, arg);
 #ifndef BROKEN_X86_ALIGNMENT
 	/* These are handled fine if no alignment issues */
 	case XFS_IOC_ALLOCSP:
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/xfs_alloc.h linux/fs/xfs/xfs_alloc.h
--- linux-2.6.31-rc6/fs/xfs/xfs_alloc.h	2009-06-09 23:05:27.000000000 -0400
+++ linux/fs/xfs/xfs_alloc.h	2009-08-16 09:20:15.167913313 -0400
@@ -215,4 +215,7 @@ 
 	xfs_fsblock_t	bno,	/* starting block number of extent */
 	xfs_extlen_t	len);	/* length of extent */
 
+int xfs_trim_extents(struct xfs_mount *mp, xfs_agnumber_t agno,
+	xfs_extlen_t minlen);
+
 #endif	/* __XFS_ALLOC_H__ */
diff -u --recursive --new-file --exclude-from=linux-2.6.31-rc6//Documentation/dontdiff --exclude='*.lds' --exclude-from=linux-2.6.31-rc6//.gitignore linux-2.6.31-rc6/fs/xfs/xfs_fs.h linux/fs/xfs/xfs_fs.h
--- linux-2.6.31-rc6/fs/xfs/xfs_fs.h	2009-08-16 09:16:39.017099926 -0400
+++ linux/fs/xfs/xfs_fs.h	2009-08-16 09:20:15.171246419 -0400
@@ -475,6 +475,7 @@ 
 #define XFS_IOC_ATTRMULTI_BY_HANDLE  _IOW ('X', 123, struct xfs_fsop_attrmulti_handlereq)
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 124, struct xfs_fsop_geom)
 #define XFS_IOC_GOINGDOWN	     _IOR ('X', 125, __uint32_t)
+#define XFS_IOC_TRIM		     _IOR ('X', 126, __uint32_t)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
--- linux-2.6.31-rc6/fs/xfs/xfs_alloc.c	2009-06-09 23:05:27.000000000 -0400
+++ linux/fs/xfs/xfs_alloc.c	2009-08-16 09:44:51.073580438 -0400
@@ -39,6 +39,9 @@ 
 #include "xfs_alloc.h"
 #include "xfs_error.h"
 
+int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
+				sector_t nr_sects, gfp_t gfp_mask,
+				unsigned type, struct completion *completion);
 
 #define XFS_ABSDIFF(a,b)	(((a) <= (b)) ? ((b) - (a)) : ((a) - (b)))
 
@@ -2609,6 +2612,97 @@ 
 	return error;
 }
 
+STATIC int
+xfs_trim_extent(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_agblock_t		fbno,
+	xfs_extlen_t		flen)
+{
+	xfs_daddr_t		blkno = XFS_AGB_TO_DADDR(mp, agno, fbno);
+	sector_t		nblks = XFS_FSB_TO_BB(mp, flen);
+	int			error;
+	DECLARE_COMPLETION_ONSTACK(done);
+
+	xfs_fs_cmn_err(CE_NOTE, mp, "discarding sectors [0x%llx-0x%llx]",
+			blkno, (u64)nblks);
+
+	error = -__blkdev_issue_discard(mp->m_ddev_targp->bt_bdev,
+			blkno, nblks, GFP_NOFS, DISCARD_BARRIER, &done);
+	if (error && error != EOPNOTSUPP)
+		xfs_fs_cmn_err(CE_NOTE, mp, "discard failed, error %d", error);
+	return error;
+}
+
+/*
+ * Notify the underlying block device about our free extent map.
+ *
+ * This walks all free extents above a minimum threshold and notifies the
+ * underlying device that these blocks are unused.  That information is
+ * useful for SSDs or thinly provisioned storage in high end arrays or
+ * virtualization scenarios.
+ */
+int
+xfs_trim_extents(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		minlen)	/* minimum extent size to bother */
+{
+	struct xfs_btree_cur	*cur;	/* cursor for the by-block btree */
+	struct xfs_buf		*agbp;	/* AGF buffer pointer */
+	xfs_agblock_t		bno;	/* block the for next search */
+	xfs_agblock_t		fbno;	/* start block of found extent */
+	xfs_extlen_t		flen;	/* length of found extent */
+	int			error;
+	int			i;
+
+	error = xfs_alloc_read_agf(mp, NULL, agno, 0, &agbp);
+	if (error)
+		return error;
+
+	bno = 0;
+	for (;;) {
+		cur = xfs_allocbt_init_cursor(mp, NULL, agbp, agno,
+					      XFS_BTNUM_BNO);
+
+		error = xfs_alloc_lookup_ge(cur, bno, minlen, &i);
+		if (error)
+			goto error0;
+		if (!i) {
+			/*
+			 * No more free extents found: done.
+			 */
+			xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+			break;
+		}
+
+		error = xfs_alloc_get_rec(cur, &fbno, &flen, &i);
+		if (error)
+			goto error0;
+		XFS_WANT_CORRUPTED_GOTO(i == 1, error0);
+
+		/*
+		 * Pass if the freespace extent isn't long enough to bother.
+		 */
+		if (flen >= minlen) {
+			error = xfs_trim_extent(mp, agno, fbno, flen);
+			if (error) {
+				xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+				break;
+			}
+		}
+
+		xfs_btree_del_cursor(cur, XFS_BTREE_NOERROR);
+		bno = fbno + flen;
+	}
+
+out:
+	xfs_buf_relse(agbp);
+	return error;
+error0:
+	xfs_btree_del_cursor(cur, XFS_BTREE_ERROR);
+	goto out;
+}
 
 /*
  * AG Busy list management