diff mbox

[1/3] ext4: Fix insertion point of extent in mext_insert_across_blocks()

Message ID 4B8E0679.8060706@rs.jp.nec.com
State Accepted, archived
Headers show

Commit Message

Akira Fujita March 3, 2010, 6:49 a.m. UTC
ext4: Fix insertion point of extent in mext_insert_across_blocks()

From: Akira Fujita <a-fujita@rs.jp.nec.com>

If the leaf node has 2 extent space or fewer and
EXT4_IOC_MOVE_EXT ioctl is called
with the file offset where after the 2nd extent covers,
mext_insert_across_blocks() always tries to insert extent into the first extent.
As a result, the file gets corrupted because of
wrong extent order.  The patch fixes this problem.

Signed-off-by: Akira Fujita <a-fujita@rs.jp.nec.com>
---
  fs/ext4/move_extent.c |    4 ++++
  1 files changed, 4 insertions(+), 0 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o March 4, 2010, 1:25 a.m. UTC | #1
On Wed, Mar 03, 2010 at 03:49:29PM +0900, Akira Fujita wrote:
> ext4: Fix insertion point of extent in mext_insert_across_blocks()
> 
> From: Akira Fujita <a-fujita@rs.jp.nec.com>
> 
> If the leaf node has 2 extent space or fewer and
> EXT4_IOC_MOVE_EXT ioctl is called
> with the file offset where after the 2nd extent covers,
> mext_insert_across_blocks() always tries to insert extent into the first extent.
> As a result, the file gets corrupted because of
> wrong extent order.  The patch fixes this problem.

Do you have test cases that we can use as part of a regression test
suite to test the EXT4_IOC_MOVE_EXT ioctl?  I'm very glad you found
these problems (although timing --- right before the merge window is
about to close --- wasn't exactly ideal), but what's more important to
me is how we get better regression testing.

The other two two patches are obviously correct, but this one is going
to require me to spend a long time staring at the verious corner cases
in order for me to convince myself that it is totally safe.  If we had
a set of test cases where we could easily verify the "before" and
"after" file system images as being correct, and then combined it with
a code coverage tool, it would make it a lot easier to validate future
patches in fs/ext4/move_extent.c.

It would be useful for other parts of the kernel as well, but at least
for the standard extents function we have some fairly aggressive
generic file system tests, combined with the fact that
fs/ext4/extents.c gets exercised much more frequently than
fs/ext4/move_extents.c.

So the question is how can get we get to the point where we can
comfortable tell people that e2defrag is totally safe, and has no
chance of corrupting their data?

						- Ted

P.S.  Here's another random idea for how we might aggressively test
the EXT4_IOC_MOVE_EXT ioctl: (1) create an empty filesystem, (2)
create a tool which randomly sets 50% of the bits in the block
allocation bitmap, marking them as in use, and making the free space
look very badly fragmented.  (3) write a large number of files into
the filesystem.  (4) calculate the checksums for all of the files.
(5) run e2fsck on the filesystem to fix up the block allocation
bitmap.  (6) defrag all of the files on the filesystem.  (7) use
e2fsck to make sure the filesystem is still consistent.  (8) calculate
the checksums for all of the files to make sure they still contain
their original data.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Freemyer March 4, 2010, 5:50 a.m. UTC | #2
>
> P.S.  Here's another random idea for how we might aggressively test
> the EXT4_IOC_MOVE_EXT ioctl: (1) create an empty filesystem, (2)
> create a tool which randomly sets 50% of the bits in the block
> allocation bitmap, marking them as in use, and making the free space
> look very badly fragmented.  (3) write a large number of files into
> the filesystem.  (4) calculate the checksums for all of the files.
> (5) run e2fsck on the filesystem to fix up the block allocation
> bitmap.  (6) defrag all of the files on the filesystem.  (7) use
> e2fsck to make sure the filesystem is still consistent.  (8) calculate
> the checksums for all of the files to make sure they still contain
> their original data.

Even that does not address issues with files in use during defrag.

ie. Defrag'ing a mysql database file while in use seems like an
important test case that is missing above.

Also, one issue with repetitive testing via the e4defrag tool, is you
only end up moving everything once and then in theory extra passes
have little to do.

The ohsm project has written a userspace "relocate" tool that calls
ext4_ioc_move_ext() to move files around on the filesystem.

In the absense of any ext4 ohsm kernel patches the blocks allocated to
the donor file would just use the normal block allocators.  Therefore
it should be relatively easy to introduce an effect of just randomly
using ext4_ioc_move_ext() to change out the underlying blocks.  It
maybe a useful in building up a test suite for ext4_ioc_move_ext().

In addition, for static files such as you describe above, we plan to
use the 60 GB or so of real world public domain data at
http://edrm.net/activities/projects/dataset as potential well known /
well defined real world data.  That data already has published MD5
values available, so data corruption at any point in the process
should be readily identifiable.

Greg
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Akira Fujita March 5, 2010, 8:19 a.m. UTC | #3
(2010/03/04 10:25), tytso@mit.edu wrote:
> On Wed, Mar 03, 2010 at 03:49:29PM +0900, Akira Fujita wrote:
>> ext4: Fix insertion point of extent in mext_insert_across_blocks()
>>
>> From: Akira Fujita<a-fujita@rs.jp.nec.com>
>>
>> If the leaf node has 2 extent space or fewer and
>> EXT4_IOC_MOVE_EXT ioctl is called
>> with the file offset where after the 2nd extent covers,
>> mext_insert_across_blocks() always tries to insert extent into the first extent.
>> As a result, the file gets corrupted because of
>> wrong extent order.  The patch fixes this problem.
>
> Do you have test cases that we can use as part of a regression test
> suite to test the EXT4_IOC_MOVE_EXT ioctl?  I'm very glad you found
> these problems (although timing --- right before the merge window is
> about to close --- wasn't exactly ideal), but what's more important to
> me is how we get better regression testing.
>
> The other two two patches are obviously correct, but this one is going
> to require me to spend a long time staring at the verious corner cases
> in order for me to convince myself that it is totally safe.  If we had
> a set of test cases where we could easily verify the "before" and
> "after" file system images as being correct, and then combined it with
> a code coverage tool, it would make it a lot easier to validate future
> patches in fs/ext4/move_extent.c.

Yes, I have small regression test cases,
but they need to be arranged to release.
I'll send them to you later, please wait for a few days.

> It would be useful for other parts of the kernel as well, but at least
> for the standard extents function we have some fairly aggressive
> generic file system tests, combined with the fact that
> fs/ext4/extents.c gets exercised much more frequently than
> fs/ext4/move_extents.c.
>
> So the question is how can get we get to the point where we can
> comfortable tell people that e2defrag is totally safe, and has no
> chance of corrupting their data?

e4defrag just do the following 3 actions.
1. Create donor file
2. Allocate blocks to donor with fallocate
3. Exchange blocks between orig and donor with EXT4_IOC_MOVE_EXT

So if we can say EXT4_IOC_MOVE_EXT is safe, e4defrag is safe as well
(This presumes that fallocate is already secure quality, though).
Slightly anxious is if the crash occurs during e4defrag,
surely we have to remove donor file by hand.  This is unmanageable, I think.

To improve e4defrag quality, it is necessary to have more
people (courageous users) use it.
For that purpose, at least, the open mode fix patch I have released
(http://marc.info/?l=linux-ext4&m=126387585515465&w=2) needs to be merged into e2fsprogs.
Currently user can not do e4defrag because
there is a file open mode mismatch between user-space and kernel-space.

>
> P.S.  Here's another random idea for how we might aggressively test
> the EXT4_IOC_MOVE_EXT ioctl: (1) create an empty filesystem, (2)
> create a tool which randomly sets 50% of the bits in the block
> allocation bitmap, marking them as in use, and making the free space
> look very badly fragmented.  (3) write a large number of files into
> the filesystem.  (4) calculate the checksums for all of the files.
> (5) run e2fsck on the filesystem to fix up the block allocation
> bitmap.  (6) defrag all of the files on the filesystem.  (7) use
> e2fsck to make sure the filesystem is still consistent.  (8) calculate
> the checksums for all of the files to make sure they still contain
> their original data.

Sounds interesting.
It seems to be able to try easily except (2).
I think that we can mark block bitmap as in use with debugfs (do_setb).
Do you have another better idea for the tool you mentioned at (2)?

Regards,
Akira Fujita
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o March 5, 2010, 4:10 p.m. UTC | #4
On Fri, Mar 05, 2010 at 05:19:53PM +0900, Akira Fujita wrote:
> 
> Sounds interesting.
> It seems to be able to try easily except (2).
> I think that we can mark block bitmap as in use with debugfs (do_setb).
> Do you have another better idea for the tool you mentioned at (2)?

The libext2fs library is very powerful.  :-)

Here's a quicky program I whipped up fairly quickly.  The code that
actually messes with the block bitmap is in the for loop; the rest is
just generic setup.  Feel free to reuse this program as a framework
for other times when you want to quickly create a program to do
something programmatic where debugfs isn't quite powerful enough for
your needs.

I have considered trying to integrate tcl into debugfs, so that you
could do this in thing more easily in debugfs directly, but it's so
easy to write throwaway C programs that I've never bothered.

Regards,

       	 		  	  - Ted

/*
 * fill-bitmap.c --- Program which writes marks roughly half of the 
 *     blocks in the filesystem as being in use.
 *
 * Compile via: cc -o fill-bitmap fill-bitmap.c -lext2fs -lcom_err
 *
 * Copyright 2010 by Theodore Ts'o.
 *
 * %Begin-Header%
 * This file may be redistributed under the terms of the GNU Public
 * License.
 * %End-Header%
 */

#include <ext2fs/ext2_fs.h>
#include <ext2fs/ext2fs.h>
#include <et/com_err.h>
#include <stdlib.h>

char *program_name;

static void usage(void)
{
	fprintf(stderr, "Usage: %s device\n", program_name);
	exit (1);
}

int main (int argc, char ** argv)
{
	errcode_t retval;
	ext2_filsys fs;
	char *device_name;
	blk_t blk;

	add_error_table(&et_ext2_error_table);
	if (argc != 2)
		usage();
	program_name = argv[0];
	device_name = argv[1];

	retval = ext2fs_open (device_name, EXT2_FLAG_RW, 0, 0,
			      unix_io_manager, &fs);
        if (retval) {
		com_err(program_name, retval, "while trying to open %s",
			device_name);
		exit(1);
	}

	retval = ext2fs_read_bitmaps(fs);
	if (retval) {
		com_err(program_name, retval, "while reading bitmaps");
		exit(1);
	}

	for (blk = fs->super->s_first_data_block;
	     blk < fs->super->s_blocks_count; blk++) {
		if (ext2fs_test_block_bitmap(fs->block_map, blk))
			continue;

		if (random() & 1)
			continue;

		ext2fs_mark_block_bitmap(fs->block_map, blk);
		ext2fs_block_alloc_stats(fs, blk, 1);
	}

	ext2fs_close(fs);
	remove_error_table(&et_ext2_error_table);
	exit (0);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Akira Fujita March 8, 2010, 8:40 a.m. UTC | #5
Hi Ted,

(2010/03/06 1:10), tytso@mit.edu wrote:
> On Fri, Mar 05, 2010 at 05:19:53PM +0900, Akira Fujita wrote:
>>
>> Sounds interesting.
>> It seems to be able to try easily except (2).
>> I think that we can mark block bitmap as in use with debugfs (do_setb).
>> Do you have another better idea for the tool you mentioned at (2)?
> 
> The libext2fs library is very powerful.  :-)
> 
> Here's a quicky program I whipped up fairly quickly.  The code that
> actually messes with the block bitmap is in the for loop; the rest is
> just generic setup.  Feel free to reuse this program as a framework
> for other times when you want to quickly create a program to do
> something programmatic where debugfs isn't quite powerful enough for
> your needs.
> 
> I have considered trying to integrate tcl into debugfs, so that you
> could do this in thing more easily in debugfs directly, but it's so
> easy to write throwaway C programs that I've never bothered.
> 

Thank you.
This program will be great help for my work.
I will give it a shot.

Regards,
Akria Fujita.


> /*
>   * fill-bitmap.c --- Program which writes marks roughly half of the
>   *     blocks in the filesystem as being in use.
>   *
>   * Compile via: cc -o fill-bitmap fill-bitmap.c -lext2fs -lcom_err
>   *
>   * Copyright 2010 by Theodore Ts'o.
>   *
>   * %Begin-Header%
>   * This file may be redistributed under the terms of the GNU Public
>   * License.
>   * %End-Header%
>   */
> 
> #include<ext2fs/ext2_fs.h>
> #include<ext2fs/ext2fs.h>
> #include<et/com_err.h>
> #include<stdlib.h>
> 
> char *program_name;
> 
> static void usage(void)
> {
> 	fprintf(stderr, "Usage: %s device\n", program_name);
> 	exit (1);
> }
> 
> int main (int argc, char ** argv)
> {
> 	errcode_t retval;
> 	ext2_filsys fs;
> 	char *device_name;
> 	blk_t blk;
> 
> 	add_error_table(&et_ext2_error_table);
> 	if (argc != 2)
> 		usage();
> 	program_name = argv[0];
> 	device_name = argv[1];
> 
> 	retval = ext2fs_open (device_name, EXT2_FLAG_RW, 0, 0,
> 			      unix_io_manager,&fs);
>          if (retval) {
> 		com_err(program_name, retval, "while trying to open %s",
> 			device_name);
> 		exit(1);
> 	}
> 
> 	retval = ext2fs_read_bitmaps(fs);
> 	if (retval) {
> 		com_err(program_name, retval, "while reading bitmaps");
> 		exit(1);
> 	}
> 
> 	for (blk = fs->super->s_first_data_block;
> 	     blk<  fs->super->s_blocks_count; blk++) {
> 		if (ext2fs_test_block_bitmap(fs->block_map, blk))
> 			continue;
> 
> 		if (random()&  1)
> 			continue;
> 
> 		ext2fs_mark_block_bitmap(fs->block_map, blk);
> 		ext2fs_block_alloc_stats(fs, blk, 1);
> 	}
> 
> 	ext2fs_close(fs);
> 	remove_error_table(&et_ext2_error_table);
> 	exit (0);
> }
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 1654eb8..9eca1c0 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -252,6 +252,7 @@  mext_insert_across_blocks(handle_t *handle, struct inode *orig_inode,
  		}

  		o_start->ee_len = start_ext->ee_len;
+		eblock = le32_to_cpu(start_ext->ee_block);
  		new_flag = 1;

  	} else if (start_ext->ee_len && new_ext->ee_len &&
@@ -262,6 +263,7 @@  mext_insert_across_blocks(handle_t *handle, struct inode *orig_inode,
  		 * orig  |------------------------------|
  		 */
  		o_start->ee_len = start_ext->ee_len;
+		eblock = le32_to_cpu(start_ext->ee_block);
  		new_flag = 1;

  	} else if (!start_ext->ee_len && new_ext->ee_len &&
@@ -502,6 +504,7 @@  mext_leaf_block(handle_t *handle, struct inode *orig_inode,
  		le32_to_cpu(oext->ee_block) + oext_alen) {
  		start_ext.ee_len = cpu_to_le16(le32_to_cpu(new_ext.ee_block) -
  					       le32_to_cpu(oext->ee_block));
+		start_ext.ee_block = oext->ee_block;
  		copy_extent_status(oext, &start_ext);
  	} else if (oext > EXT_FIRST_EXTENT(orig_path[depth].p_hdr)) {
  		prev_ext = oext - 1;
@@ -515,6 +518,7 @@  mext_leaf_block(handle_t *handle, struct inode *orig_inode,
  			start_ext.ee_len = cpu_to_le16(
  				ext4_ext_get_actual_len(prev_ext) +
  				new_ext_alen);
+			start_ext.ee_block = oext->ee_block;
  			copy_extent_status(prev_ext, &start_ext);
  			new_ext.ee_len = 0;
  		}