Message ID | E1KhvsB-0002Ik-Ex@closure.thunk.org |
---|---|
State | Superseded, archived |
Delegated to: | Theodore Ts'o |
Headers | show |
On Mon, 22 Sep 2008 20:35:23 -0400 "Theodore Ts'o" <tytso@mit.edu> wrote: > > With modern hard drives, reading 64k takes roughly the same time as > reading a 4k block. So request adjacent inode table blocks to reduce > the time it takes when iterating over directories (especially when doing > this in htree sort order) in a cold cache case. With this patch, the > time it takes to run "git status" on a kernel tree after flushing the > caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%. > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Acked-by: Alan Cox <alan@redhat.com> I'm actually suprised that 16 is the magic tuning number you've used and a bigger one isn't even more of a win Alan -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sep 23, 2008 10:16 +0100, Alan Cox wrote: > On Mon, 22 Sep 2008 20:35:23 -0400 > "Theodore Ts'o" <tytso@mit.edu> wrote: > > With modern hard drives, reading 64k takes roughly the same time as > > reading a 4k block. So request adjacent inode table blocks to reduce > > the time it takes when iterating over directories (especially when doing > > this in htree sort order) in a cold cache case. With this patch, the > > time it takes to run "git status" on a kernel tree after flushing the > > caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%. > > > > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > > Acked-by: Alan Cox <alan@redhat.com> > > I'm actually suprised that 16 is the magic tuning number you've used and > a bigger one isn't even more of a win I was going to suggest making this at least a #defined constant instead of hard coding the values there. Making it a mount option and/or /proc value would allow further testing. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger wrote: > On Sep 23, 2008 10:16 +0100, Alan Cox wrote: > >> On Mon, 22 Sep 2008 20:35:23 -0400 >> "Theodore Ts'o" <tytso@mit.edu> wrote: >> >>> With modern hard drives, reading 64k takes roughly the same time as >>> reading a 4k block. So request adjacent inode table blocks to reduce >>> the time it takes when iterating over directories (especially when doing >>> this in htree sort order) in a cold cache case. With this patch, the >>> time it takes to run "git status" on a kernel tree after flushing the >>> caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%. >>> >>> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> >>> >> Acked-by: Alan Cox <alan@redhat.com> >> >> I'm actually suprised that 16 is the magic tuning number you've used and >> a bigger one isn't even more of a win >> > > I was going to suggest making this at least a #defined constant instead > of hard coding the values there. Making it a mount option and/or /proc > value would allow further testing. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > I think that Alan is probably right - the magic number for modern drives is probably closer to 256K. Having it be a /sys tunable (with a larger default) would be a nice way to verify this. Ric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 23, 2008 at 08:18:54AM -0400, Ric Wheeler wrote: > I think that Alan is probably right - the magic number for modern drives > is probably closer to 256K. Having it be a /sys tunable (with a larger > default) would be a nice way to verify this. I've played with this a bit, and with the "git status" workload, increasing the magic number beyond 16 (64k) doesn't actually help, because the number of inodes we need to touch wasn't big enough. So I switched to a different workload, which ran "find /path -size 0 -print" with a much larger directory hierarchy. With that workload I got the following results: ra_bits ra_blocks ra_kb seconds % improvement 0 1 4 53.3 - 1 2 8 47.3 11.3% 2 4 16 41.7 21.8% 3 8 32 37.5 29.6% 4 16 64 34.4 35.5% 5 32 128 32 40.0% 6 64 256 30.7 42.4% 7 128 512 28.8 46.0% 8 256 1024 28.3 46.9% 9 512 2048 27.5 48.4% Given these numbers, I'm using a default of inode_readahead_bits of 5 (i.3., 32 blocks, or 128k for 4k blocksize filesystems). For a workload that is 100% stat-based, without any I/O, it is possible to get better results by using a higher number, yes, but I'm concerned that a larger readahead may end up interfering with other reads. We need to run some other workloads to be sure a larger number won't cause problems before we go more aggressive on this parameter. I'll send the revised patch in another message. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Tue, Sep 23, 2008 at 08:18:54AM -0400, Ric Wheeler wrote: > >> I think that Alan is probably right - the magic number for modern drives >> is probably closer to 256K. Having it be a /sys tunable (with a larger >> default) would be a nice way to verify this. >> > > I've played with this a bit, and with the "git status" workload, > increasing the magic number beyond 16 (64k) doesn't actually help, > because the number of inodes we need to touch wasn't big enough. > > So I switched to a different workload, which ran "find /path -size 0 > -print" with a much larger directory hierarchy. With that workload I > got the following results: > > ra_bits ra_blocks ra_kb seconds % improvement > 0 1 4 53.3 - > 1 2 8 47.3 11.3% > 2 4 16 41.7 21.8% > 3 8 32 37.5 29.6% > 4 16 64 34.4 35.5% > 5 32 128 32 40.0% > 6 64 256 30.7 42.4% > 7 128 512 28.8 46.0% > 8 256 1024 28.3 46.9% > 9 512 2048 27.5 48.4% > > Given these numbers, I'm using a default of inode_readahead_bits of 5 > (i.3., 32 blocks, or 128k for 4k blocksize filesystems). For a > workload that is 100% stat-based, without any I/O, it is possible to > get better results by using a higher number, yes, but I'm concerned > that a larger readahead may end up interfering with other reads. We > need to run some other workloads to be sure a larger number won't > cause problems before we go more aggressive on this parameter. > > I'll send the revised patch in another message. > > - Ted > That sounds about right for modern S-ATA/SAS drives. I would expect that having this be a tunable knob might help for some types of storage (SSD might not care, but should be faster in any case?). ric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2008-09-24 at 09:23 -0400, Ric Wheeler wrote: > Theodore Tso wrote: > > On Tue, Sep 23, 2008 at 08:18:54AM -0400, Ric Wheeler wrote: [ numbers ] > > Given these numbers, I'm using a default of inode_readahead_bits of 5 > > (i.3., 32 blocks, or 128k for 4k blocksize filesystems). For a > > workload that is 100% stat-based, without any I/O, it is possible to > > get better results by using a higher number, yes, but I'm concerned > > that a larger readahead may end up interfering with other reads. We > > need to run some other workloads to be sure a larger number won't > > cause problems before we go more aggressive on this parameter. > > That sounds about right for modern S-ATA/SAS drives. I would expect that > having this be a tunable knob might help for some types of storage (SSD > might not care, but should be faster in any case?). For the test runs being done here, there's a pretty high chance that all of the inodes you read ahead will get used before the pages are dropped, so we want to find a balance between those and the worst case workloads where inode reads are basically random. One good data point is the completion time for IOs of different sizes. I used fio to measure the latencies on O_DIRECT randomreads of given sizes on a fast 500GB sata drive. Here is the output for a 4k run (I used elevator=noop, but cfq was about the same): f4k: (groupid=6, jobs=1): err= 0: pid=22877 read : io=15816KiB, bw=539KiB/s, iops=131, runt= 30004msec clat (usec): min=555, max=20909, avg=7581.38, stdev=2475.88 issued r/w: total=3954/0, short=0/0 lat (usec): 750=0.03% lat (msec): 2=0.03%, 4=7.08%, 10=71.60%, 20=21.24%, 50=0.03% clat is completion latency, but note fio switches between usec and msec just to keep us on our toes. Other important numbers are iop/s and total issued ios. The test limits the run on each IO size to 30 seconds. The 4k run gets 131 iop/s, so my sata drive can read 131 inodes/second in a worst case random workload. iop rates for the others: 4k 131 8k 130 16k 128 32k 126 64k 121 128k 113 256k 100 A slightly trimmed job output is below, and the fio job file I used is attached if anyone wants to try this on their own machines. I'd stick with either 32k or 64k as the sweet spots, but a tunable is definitely a good idea. -chris f256k: (groupid=0, jobs=1): err= 0: pid=22871 read : io=770816KiB, bw=26309KiB/s, iops=100, runt= 30001msec clat (msec): min=1, max=45, avg= 9.96, stdev= 2.63 issued r/w: total=3011/0, short=0/0 lat (msec): 2=0.03%, 10=50.35%, 20=49.58%, 50=0.03% f128k: (groupid=1, jobs=1): err= 0: pid=22872 read : io=434560KiB, bw=14830KiB/s, iops=113, runt= 30005msec clat (msec): min=1, max=72, avg= 8.83, stdev= 2.82 issued r/w: total=3395/0, short=0/0 lat (msec): 2=0.06%, 4=0.62%, 10=63.62%, 20=35.64%, 100=0.06% f64k: (groupid=2, jobs=1): err= 0: pid=22873 read : io=233280KiB, bw=7961KiB/s, iops=121, runt= 30006msec clat (usec): min=815, max=14931, avg=8225.21, stdev=2471.22 issued r/w: total=3645/0, short=0/0 lat (usec): 1000=0.05% lat (msec): 4=2.50%, 10=69.11%, 20=28.34% f32k: (groupid=3, jobs=1): err= 0: pid=22874 read : io=121472KiB, bw=4144KiB/s, iops=126, runt= 30010msec clat (usec): min=715, max=53124, avg=7898.75, stdev=2613.35 issued r/w: total=3796/0, short=0/0 lat (usec): 750=0.03% lat (msec): 4=4.77%, 10=70.10%, 20=25.08%, 100=0.03% f16k: (groupid=4, jobs=1): err= 0: pid=22875 read : io=61584KiB, bw=2101KiB/s, iops=128, runt= 30001msec clat (msec): min=1, max=16, avg= 7.79, stdev= 2.46 issued r/w: total=3849/0, short=0/0 [global] filename=/dev/sdb numjobs=1 size=16g rw=randread direct=1 [f256k] bs=256k runtime=30 stonewall [f128k] bs=128k runtime=30 stonewall [f64k] bs=64k runtime=30 stonewall [f32k] bs=32k runtime=30 stonewall [f16k] bs=16k runtime=30 stonewall [f8k] bs=8k runtime=30 stonewall [f4k] bs=4k runtime=30 stonewall
On Wed, Sep 24, 2008 at 09:23:34AM -0400, Ric Wheeler wrote: > > That sounds about right for modern S-ATA/SAS drives. I would expect that > having this be a tunable knob might help for some types of storage (SSD > might not care, but should be faster in any case?). > Well, for SSD's, wouldn't seek time not matter, but then the limiting factor should be the overhead of the read transaction, and the I/O bandwidth of the SSD. So prefetching too much might hurt even more for SSD's, although in comparison with spinning rust platters, it would probably still be faster. :-) So I'm pretty sure that with an SSD we'll want to turn the tunable down, not up. On Wed, Sep 24, 2008 at 10:20:34AM -0400, Chris Mason wrote: >For the test runs being done here, there's a pretty high chance that all >of the inodes you read ahead will get used before the pages are dropped, >so we want to find a balance between those and the worst case workloads >where inode reads are basically random. Yep, agreed. On the other hand, if we take your iop/s and translate them to milliseconds so we can measure the latency in the case where the workload is essentialy doing random reads, and then cross correlated it with my measurements, we get this table: i/o size iops/s ms latency % degredation % improvement of random inodes of related inodes I/O 4k 131 7.634 8k 130 7.692 0.77% 11.3% 16k 128 7.813 2.34% 21.8% 32k 126 7.937 3.97% 29.6% 64k 121 8.264 8.26% 35.5% 128k 113 8.850 15.93% 40.0% 256k 100 10.000 31.00% 42.4% Depending on whether you believe that workloads involving random inode reads are more common compared to related inodes I/O, the sweet spot is probably somewhere between 32k and 128k. I'm open to opinions (preferably backed up with more benchmarks of likely workloads) of whether we should use a default value of inode_readahead_bits of 4 or 5 (i.e., 64k, my original guess, or 128k, in v2 of the patch). But yes, making it tunable is definitely going to be necessary, since for different workloads (i.e squid vs. git repositories) will have very different requirements. The other thought are the one which comes to mind is whether we should use a similar large readahead if all we are doing writes vs. reads. For example, if we are just updating a single inode, and we are reading a block to do a read/modify write cycle, maybe we shouldn't be doing as much readahead. - Ted P.S. One caveat is that my numbers were taken from a Laptop SATA, and if Chris's were taken from a Desktop/Sever SATA drive the numbers might not be properly comparable. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sep 24, 2008 16:35 -0400, Theodore Ts'o wrote: > On the other hand, if we take your iop/s and translate them to > milliseconds so we can measure the latency in the case where the > workload is essentialy doing random reads, and then cross correlated > it with my measurements, we get this table: Comparing the incremental benefit of each step: > i/o size iops/s ms latency % degredation % improvement > of random inodes of related inodes I/O > 4k 131 7.634 > 8k 130 7.692 0.77% 11.3% 1.57% 10.5% > 16k 128 7.813 2.34% 21.8% 1.63% 7.8% > 32k 126 7.937 3.97% 29.6% 4.29% 5.9% > 64k 121 8.264 8.26% 35.5% 7.67% 4.5% > 128k 113 8.850 15.93% 40.0% 16.07% 2.4% > 256k 100 10.000 31.00% 42.4% > > Depending on whether you believe that workloads involving random inode > reads are more common compared to related inodes I/O, the sweet spot > is probably somewhere between 32k and 128k. I'm open to opinions > (preferably backed up with more benchmarks of likely workloads) of > whether we should use a default value of inode_readahead_bits of 4 or > 5 (i.e., 64k, my original guess, or 128k, in v2 of the patch). But > yes, making it tunable is definitely going to be necessary, since for > different workloads (i.e squid vs. git repositories) will have very > different requirements. It looks like moving from 64kB to 128kB readahead might be a loss for "unknown" workloads, since that increases latency by 7.67% for the random inode case, but we only get 4.5% improvement in the sequential inode case. Also recall that at large scale "htree" breaks down to random inode lookup so that isn't exactly a fringe case (though readahead may still help if the cache is large enough). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index eed1265..9f6c10e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3833,41 +3833,6 @@ out_stop: ext4_journal_stop(handle); } -static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb, - unsigned long ino, struct ext4_iloc *iloc) -{ - ext4_group_t block_group; - unsigned long offset; - ext4_fsblk_t block; - struct ext4_group_desc *gdp; - - if (!ext4_valid_inum(sb, ino)) { - /* - * This error is already checked for in namei.c unless we are - * looking at an NFS filehandle, in which case no error - * report is needed - */ - return 0; - } - - block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb); - gdp = ext4_get_group_desc(sb, block_group, NULL); - if (!gdp) - return 0; - - /* - * Figure out the offset within the block group inode table - */ - offset = ((ino - 1) % EXT4_INODES_PER_GROUP(sb)) * - EXT4_INODE_SIZE(sb); - block = ext4_inode_table(sb, gdp) + - (offset >> EXT4_BLOCK_SIZE_BITS(sb)); - - iloc->block_group = block_group; - iloc->offset = offset & (EXT4_BLOCK_SIZE(sb) - 1); - return block; -} - /* * ext4_get_inode_loc returns with an extra refcount against the inode's * underlying buffer_head on success. If 'in_mem' is true, we have all @@ -3877,13 +3842,30 @@ static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb, static int __ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc, int in_mem) { + struct ext4_group_desc *gdp; ext4_fsblk_t block; struct buffer_head *bh; - block = ext4_get_inode_block(inode->i_sb, inode->i_ino, iloc); - if (!block) + iloc->bh = 0; + if (!ext4_valid_inum(inode->i_sb, inode->i_ino)) return -EIO; + iloc->block_group = (inode->i_ino - 1) / + EXT4_INODES_PER_GROUP(inode->i_sb); + gdp = ext4_get_group_desc(inode->i_sb, iloc->block_group, NULL); + if (!gdp) + return -EIO; + + /* + * Figure out the offset within the block group inode table + */ + iloc->offset = ((inode->i_ino - 1) % + EXT4_INODES_PER_GROUP(inode->i_sb)) * + EXT4_INODE_SIZE(inode->i_sb); + block = ext4_inode_table(inode->i_sb, gdp) + + (iloc->offset >> EXT4_BLOCK_SIZE_BITS(inode->i_sb)); + iloc->offset = iloc->offset & (EXT4_BLOCK_SIZE(inode->i_sb) - 1); + bh = sb_getblk(inode->i_sb, block); if (!bh) { ext4_error (inode->i_sb, "ext4_get_inode_loc", @@ -3969,6 +3951,31 @@ static int __ext4_get_inode_loc(struct inode *inode, make_io: /* + * If we need to do any I/O, try to readahead up to 16 + * blocks from the inode table. + */ + { + ext4_fsblk_t b, end, table; + unsigned num; + + table = ext4_inode_table(inode->i_sb, gdp); + b = block & ~15; + if (table > b) + b = table; + end = b+16; + num = EXT4_INODES_PER_GROUP(inode->i_sb); + if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb, + EXT4_FEATURE_RO_COMPAT_GDT_CSUM)) + num -= le16_to_cpu(gdp->bg_itable_unused); + table += num / (EXT4_BLOCK_SIZE(inode->i_sb) / + EXT4_INODE_SIZE(inode->i_sb)); + if (end > table) + end = table; + while (b <= end) + sb_breadahead(inode->i_sb, b++); + } + + /* * There are other valid inodes in the buffer, this inode * has in-inode xattrs, or we don't have this inode in memory. * Read the block from disk.
With modern hard drives, reading 64k takes roughly the same time as reading a 4k block. So request adjacent inode table blocks to reduce the time it takes when iterating over directories (especially when doing this in htree sort order) in a cold cache case. With this patch, the time it takes to run "git status" on a kernel tree after flushing the caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> ---- Note: this patch could work for ext3 as well. Anyone see any downsides? A 20% improvement seems too easy... I guess it does increase what we hold in the buffer cache by a small amount, but once the inodes are cached, we'll stop needing to do any I/O, and we only try to do the readahead when we know that we need to do some I/O anyway. This patch also eliminates ext4_get_inode_block(), since it's a static function which is only called once, and we needed access to the block group descriptor, so it simplified the code to move the code into __ext4_get_inode_loc(). The interesting bits are in the very last hunk of the patch. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html