diff mbox

Improve buffered streaming write ordering

Message ID 20081006101605.GA15881@skywalker
State Superseded, archived
Headers show

Commit Message

Aneesh Kumar K.V Oct. 6, 2008, 10:16 a.m. UTC
On Fri, Oct 03, 2008 at 03:45:55PM -0400, Chris Mason wrote:
> On Fri, 2008-10-03 at 09:43 +1000, Dave Chinner wrote:
> > On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> > > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > > For a 4.5GB streaming buffered write, this printk inside
> > > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > > > 
> > > 
> > > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > > call back with lots of unallocated buffer_heads(blocks).
> > 
> > Quite frankly, a simple streaming buffered write should *never*
> > trigger writeback from the LRU in memory reclaim.
> 
> The blktrace runs on ext4 didn't show kswapd doing any IO.  It isn't
> clear if this is because ext4 did the redirty trick or if kswapd didn't
> call writepage.
> 
> -chris

This patch actually reduced the number of extents for the below test
from 564 to 171.

$dd if=/dev/zero of=test bs=1M count=1024

ext4 mballoc block allocator still can be improved to make sure we get
less extents. I am looking into this.

What the below change basically does is to make sure we advance
writeback_index after looking at the pages skipped. With delayed
allocation when we request for 100 blocks we just add each of these
blocks to the in memory extent. Once we get the contiguous chunk
of 100 block request we have the index updated to 100. Now we request
block allocator for 100 blocks. But allocator gives us back 50 blocks
So with the current code we have writeback_index pointing to 100
With the changes below it is pointing to 50.


We also force loop inside the ext4_da_writepags with the below condition
a) if nr_to_write is zero break the loop
b) if we are able to allocate blocks but we have pages_skipped move
the writeback_index/range_start to initial value and try with a new
nr_to_write value. This make sure we will be allocating blocks for
all requested dirty pages. The reason being __fsync_super. It does

sync_inodes_sb(sb, 0);
sync_blockdev(sb->s_bdev);
sync_inodes_sb(sb, 1);

I guess the first call is supposed to have done all the meta data
allocation ? So i was forcing the block allocation without even looking
at WB_SYNC_ALL

c) If we don't have anything to write break the loop

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Chris Mason Oct. 6, 2008, 2:21 p.m. UTC | #1
On Mon, 2008-10-06 at 15:46 +0530, Aneesh Kumar K.V wrote:
> On Fri, Oct 03, 2008 at 03:45:55PM -0400, Chris Mason wrote:
> > On Fri, 2008-10-03 at 09:43 +1000, Dave Chinner wrote:
> > > On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> > > > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > > > For a 4.5GB streaming buffered write, this printk inside
> > > > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > > > > 
> > > > 
> > > > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > > > call back with lots of unallocated buffer_heads(blocks).
> > > 
> > > Quite frankly, a simple streaming buffered write should *never*
> > > trigger writeback from the LRU in memory reclaim.
> > 
> > The blktrace runs on ext4 didn't show kswapd doing any IO.  It isn't
> > clear if this is because ext4 did the redirty trick or if kswapd didn't
> > call writepage.
> > 
> > -chris
> 
> This patch actually reduced the number of extents for the below test
> from 564 to 171.
> 

For my array, this patch brings the number of ext4 extents down from
over 4000 to 27.  The throughput reported by dd goes up from ~80MB/s to
330MB/s, which means buffered IO is going as fast as O_DIRECT.

Here's the graph:

http://oss.oracle.com/~mason/bugs/writeback_ordering/ext4-aneesh.png

The strange metadata writeback for the uninit block groups is gone.

Looking at the patch, I think the ext4_writepages code should just make
its own write_cache_pages.  It's pretty hard to follow the code that is
there for ext4 vs the code that is there to make write_cache_pages do
what ext4 expects it to.

-chris


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 21f1d3a..58d010d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2386,14 +2386,16 @@  static int ext4_da_writepages(struct address_space *mapping,
 		wbc->nr_to_write = sbi->s_mb_stream_request;
 	}
 
-	if (!wbc->range_cyclic)
+	if (wbc->range_cyclic) {
+		range_start =  mapping->writeback_index;
+	} else {
 		/*
 		 * If range_cyclic is not set force range_cont
-		 * and save the old writeback_index
+		 * and save the old range_start;
 		 */
 		wbc->range_cont = 1;
-
-	range_start =  wbc->range_start;
+		range_start =  wbc->range_start;
+	}
 	pages_skipped = wbc->pages_skipped;
 
 	mpd.wbc = wbc;
@@ -2440,6 +2442,19 @@  static int ext4_da_writepages(struct address_space *mapping,
 			 */
 			to_write += wbc->nr_to_write;
 			ret = 0;
+			if (pages_skipped != wbc->pages_skipped) {
+				/* writepages skipped some pages */
+				if (wbc->range_cont) {
+					wbc->range_start = range_start;
+				} else {
+					/* range_cyclic */
+					mapping->writeback_index = range_start;
+				}
+				wbc->nr_to_write = to_write +
+					(wbc->pages_skipped - pages_skipped);
+				wbc->pages_skipped = pages_skipped;
+			} else
+				wbc->nr_to_write = to_write;
 		} else if (wbc->nr_to_write) {
 			/*
 			 * There is no more writeout needed
@@ -2449,18 +2464,27 @@  static int ext4_da_writepages(struct address_space *mapping,
 			to_write += wbc->nr_to_write;
 			break;
 		}
-		wbc->nr_to_write = to_write;
 	}
 
 	if (wbc->range_cont && (pages_skipped != wbc->pages_skipped)) {
 		/* We skipped pages in this loop */
 		wbc->range_start = range_start;
 		wbc->nr_to_write = to_write +
-				wbc->pages_skipped - pages_skipped;
+				(wbc->pages_skipped - pages_skipped);
 		wbc->pages_skipped = pages_skipped;
 		goto restart_loop;
 	}
 
+	if (wbc->range_cyclic && (pages_skipped != wbc->pages_skipped)) {
+		/*
+		 * we need to make sure we don't move the
+		 * writeback_index further without looking
+		 * at the pages skipped.
+		 */
+		mapping->writeback_index = mapping->writeback_index -
+					(wbc->pages_skipped - pages_skipped);
+	}
+
 out_writepages:
 	wbc->nr_to_write = to_write - nr_to_writebump;
 	wbc->range_start = range_start;