Message ID | 20110620202854.2473133.32514.stgit@localhost.localdomain |
---|---|
State | New, archived |
Headers | show |
On Mon, Jun 20, 2011 at 10:28:54PM +0200, Bernd Schubert wrote: > From: Bernd Schubert <bernd.schubert@fastmail.fm> > > changes from v1 -> v2: > Limit the number of read-ahead blocks as suggested by Andreas. > > While creating files in large directories we noticed an endless number > of 4K reads. And those reads very much reduced file creation numbers > as shown by bonnie. While we would expect about 2000 creates/s, we > only got about 25 creates/s. Running the benchmarks for a long time > improved the numbers, but not above 200 creates/s. > It turned out those reads came from directory index block reads > and probably the bh cache never cached all dx blocks. Given by > the high number of directories we have (8192) and number of files required > to trigger the issue (16 million), rather probably bh cached dx blocks > got lost in favour of other less important blocks. > The patch below implements a read-ahead for *all* dx blocks of a directory > if a single dx block is missing in the cache. That also helps the LRU > to cache important dx blocks. If you have 8192 directories, and about 16 million files, that means you have about 2,000 files per directory. I'll assume that each file averages 8-12 characters per file, so you need 24 bytes per directory entry. If we assume that each leaf block is about 2/3rds full, you have about 17 leaf blocks, which means we're only talking about one extent index block per directory. Does that sound about right? Even if I'm underestimating the number size of your index blocks, the real problem you have a very inefficient system; probably something like 80% or more of the space in your 8192 index blocks (one per directory) are are empty. Given that, it's no wonder the index blocks are getting pushed out of memory. If you reduce the number of directories that you have, say by a factor of 4 so that you only have 2048 directories, you will still only have about one index block per directory, but it will be much fuller, and those index blocks will be hit 4 times more often, which probably makes them more likely that they stay in memory. It also means that instead of pinning about 32 megabytes of memory for all of your index blocks, you'll only pin about 8 megabytes of memory. It also makes me wonder why your patch is helping you. If there's only one index block per directory, then there's no readahead to accomplish. So maybe I'm underestimating how many leaf blocks you have in an average directory. But the file names would have to be very, very, VERY large in order to cause us to have more than a single index block. OK, so what am I missing? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ted, thanks for looking into it! On 07/17/2011 01:59 AM, Ted Ts'o wrote: > On Mon, Jun 20, 2011 at 10:28:54PM +0200, Bernd Schubert wrote: >> From: Bernd Schubert <bernd.schubert@fastmail.fm> >> >> changes from v1 -> v2: >> Limit the number of read-ahead blocks as suggested by Andreas. >> >> While creating files in large directories we noticed an endless number >> of 4K reads. And those reads very much reduced file creation numbers >> as shown by bonnie. While we would expect about 2000 creates/s, we >> only got about 25 creates/s. Running the benchmarks for a long time >> improved the numbers, but not above 200 creates/s. >> It turned out those reads came from directory index block reads >> and probably the bh cache never cached all dx blocks. Given by >> the high number of directories we have (8192) and number of files required >> to trigger the issue (16 million), rather probably bh cached dx blocks >> got lost in favour of other less important blocks. >> The patch below implements a read-ahead for *all* dx blocks of a directory >> if a single dx block is missing in the cache. That also helps the LRU >> to cache important dx blocks. > > If you have 8192 directories, and about 16 million files, that means > you have about 2,000 files per directory. I'll assume that each file > averages 8-12 characters per file, so you need 24 bytes per directory > entry. If we assume that each leaf block is about 2/3rds full, you > have about 17 leaf blocks, which means we're only talking about one > extent index block per directory. Does that sound about right? I don't understand it either yet why we have so many, but each directory has about 20 to 30 index blocks. > > Even if I'm underestimating the number size of your index blocks, the > real problem you have a very inefficient system; probably something > like 80% or more of the space in your 8192 index blocks (one per > directory) are are empty. Given that, it's no wonder the index blocks > are getting pushed out of memory. If you reduce the number of > directories that you have, say by a factor of 4 so that you only have > 2048 directories, you will still only have about one index block per > directory, but it will be much fuller, and those index blocks will be > hit 4 times more often, which probably makes them more likely that > they stay in memory. It also means that instead of pinning about 32 > megabytes of memory for all of your index blocks, you'll only pin > about 8 megabytes of memory. For a file system with 16 million files, sure, 8192 hash directories are far too much. However, it also easily might go up to 600 million or more files. Lets assume 70000 to 100000 files per directory as worst case. With that many files a hash table size of 8192 is more sane, I think. Of course, that's a clear disadvantage of hash tables - either the size is wrong, or slow re-hashsing is required. > > It also makes me wonder why your patch is helping you. If there's > only one index block per directory, then there's no readahead to > accomplish. So maybe I'm underestimating how many leaf blocks you > have in an average directory. But the file names would have to be > very, very, VERY large in order to cause us to have more than a single > index block. > > OK, so what am I missing? All files I tested with have a fixed length of 24 characters (16 byte UUID + hostname). Thanks, Bernd PS: Btw, I have a v3 patch series, that mostly fixed the 3rd patch ("Map blocks with a single semaphore lock") to be checkpatch.pl clean. I just didn't want to send another revision until we have agreed on the overall concept (didn't want to send patch-spam...). -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 16, 2011, at 9:02 PM, Bernd Schubert wrote: > I don't understand it either yet why we have so many, but each directory > has about 20 to 30 index blocks Hmm..... can you send me the output of ls -ld on one or two of the directories? I want to see how big they are. Also, just as a sanity check can you send me the following: debugfs -R "htree /path/to/a/typical/directory" /dev/sdXX | gzip -9 > /tmp/htree_dump.gz2 I just want to take a look at the structure of one of these htrees and see if there's anything surprising going on.... -- Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> On Jul 16, 2011, at 9:02 PM, Bernd Schubert wrote: > > > I don't understand it either yet why we have so many, but each directory > > has about 20 to 30 index blocks OK, I think I know what's goign on. Those are 20-30 index blocks; those are 20-30 leaf blocks. Your directories are approximately 80-120k, each, right? So what your patch is doing is constantly doing readahead to bring the *entire* directory into the buffer cache any time you do a dx_probe. That's definitely not what we would want to enable by default, but I really don't like the idea of adding Yet Another Mount option. It expands our testing effort, and the reality is very few people will take advantage of the mount option. How about this? What if we don't actually perform readahead, but instead try to look up all of the blocks to see if they are in the buffer cache using sb_find_get_block(). If it is in the the buffer cache, it will get touched, so it will be less likely to be evicted from the page cache. So for a workload like yours, it should do what you want. But if won't cause all of the pages to get pulled in after the first reference of the directory in question. I'm still worried about the case of a very large directory (say an unreaped tmp directory that has grown to be tens of megabytes). If a program does a sequential scan through the directory doing a "readdir+stat" (i.e., for example a tmp cleaner or someone running the command ls -sF"), we probably shouldn't be trying to keep all of those directory blocks in memory. So if a sequential scan is detected, that should probably suppress the calls to sb_find_get_block(0. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ted, sorry for my late reply and thanks a lot for your help! On 07/18/2011 02:23 AM, Ted Ts'o wrote: >> On Jul 16, 2011, at 9:02 PM, Bernd Schubert wrote: >> >>> I don't understand it either yet why we have so many, but each directory >>> has about 20 to 30 index blocks > > OK, I think I know what's goign on. Those are 20-30 index blocks; > those are 20-30 leaf blocks. Your directories are approximately > 80-120k, each, right? Yes, you are right. For example: drwxr-xr-x 2 root root 102400 Jul 18 13:39 FFB I also uploaded the debugfs htree output to > http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/ext4/htree_dump.bz2 > > So what your patch is doing is constantly doing readahead to bring the > *entire* directory into the buffer cache any time you do a dx_probe. > That's definitely not what we would want to enable by default, but I > really don't like the idea of adding Yet Another Mount option. It > expands our testing effort, and the reality is very few people will > take advantage of the mount option. > > How about this? What if we don't actually perform readahead, but > instead try to look up all of the blocks to see if they are in the > buffer cache using sb_find_get_block(). If it is in the the buffer In principle that should be mostly fine. We could read all directories on starting up our application and those pages would be kept in cache then. While our main concern right now is the meta data server, where that patch would help and where we will also change the on-disk-layout to entirely workaround that issue, the issue also effects storage servers. On those we are not sure, if the patch would help, as real data pages might drop those directory pages out of the cache. Also interesting is that the hole issue might easily explain meta data issues I experienced in the past with Lustre and systems with lots of files per OST - Lustre and FhGFS have a rather similar on-disk layout for data files and so should suffer from similar underlying storage issues. We are already discussing here for some time if we could change that layout for FhGFS, but that would bring up other even more critical problems... > cache, it will get touched, so it will be less likely to be evicted > from the page cache. So for a workload like yours, it should do what > you want. But if won't cause all of the pages to get pulled in after > the first reference of the directory in question. I think we would still need to map the ext4 block to the real block for sb_find_get_block(). So what about to keep most of the existing patch, but update ext4_bread_ra() to: +/* + * Read-ahead blocks + */ +int ext4_bread_ra(struct inode *inode, ext4_lblk_t block) +{ + struct buffer_head *bh; + int err; + + bh = ext4_getblk(NULL, inode, block, 0, &err); + if (!bh) + return -1; + + if (buffer_uptodate(bh)) + touch_buffer(bh); /* patch update here */ + + brelse(bh); + return 0; +} + > > I'm still worried about the case of a very large directory (say an > unreaped tmp directory that has grown to be tens of megabytes). If a > program does a sequential scan through the directory doing a > "readdir+stat" (i.e., for example a tmp cleaner or someone running the > command ls -sF"), we probably shouldn't be trying to keep all of those > directory blocks in memory. So if a sequential scan is detected, that > should probably suppress the calls to sb_find_get_block(0. Do you have a suggestion how to detect that? Set a flag in the dir inode on during a readdir and if that flag is set we won't do the touch_buffer(bh)? So add flags in struct file i_private and struct inode private about the readdir and remove those on close of the file? Better would be if there would exist a readdirplus syscall, which ls would use and which would set its intent itself... But even if we would add that, it would take ages until all user space programs would use it. Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 3ae9bc9..fad70ea 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -404,6 +404,12 @@ dioread_nolock locking. If the dioread_nolock option is specified i_version Enable 64-bit inode version support. This option is off by default. +dx_read_ahead Enables read-ahead of directory index blocks. + This option should be enabled if the filesystem several + directories with a high number of files. Disadvantage + is that on first access to a directory additional reads + come up, which might slow down other operations. + Data Mode ========= There are 3 different data modes: diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 1921392..997323a 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -916,6 +916,8 @@ struct ext4_inode_info { #define EXT4_MOUNT_DISCARD 0x40000000 /* Issue DISCARD requests */ #define EXT4_MOUNT_INIT_INODE_TABLE 0x80000000 /* Initialize uninitialized itables */ +#define EXT4_MOUNT2_DX_READ_AHEAD 0x00002 /* Read ahead directory index blocks */ + #define clear_opt(sb, opt) EXT4_SB(sb)->s_mount_opt &= \ ~EXT4_MOUNT_##opt #define set_opt(sb, opt) EXT4_SB(sb)->s_mount_opt |= \ @@ -1802,6 +1804,7 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int, int *); struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int, int *); +int ext4_bread_ra(struct inode *inode, ext4_lblk_t block); int ext4_get_block(struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index a5763e3..938fb6c 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -1490,6 +1490,9 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode, return bh; } +/* + * Synchronous read of blocks + */ struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode, ext4_lblk_t block, int create, int *err) { @@ -1500,6 +1503,7 @@ struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode, return bh; if (buffer_uptodate(bh)) return bh; + ll_rw_block(READ_META, 1, &bh); wait_on_buffer(bh); if (buffer_uptodate(bh)) @@ -1509,6 +1513,30 @@ struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode, return NULL; } +/* + * Read-ahead blocks + */ +int ext4_bread_ra(struct inode *inode, ext4_lblk_t block) +{ + struct buffer_head *bh; + int err; + + bh = ext4_getblk(NULL, inode, block, 0, &err); + if (!bh) + return -1; + + if (buffer_uptodate(bh)) { + brelse(bh); + return 0; + } + + ll_rw_block(READA, 1, &bh); + + brelse(bh); + return 0; +} + + static int walk_page_buffers(handle_t *handle, struct buffer_head *head, unsigned from, diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index bfb749f..9643722 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -49,6 +49,8 @@ #define NAMEI_RA_SIZE (NAMEI_RA_CHUNKS * NAMEI_RA_BLOCKS) #define NAMEI_RA_INDEX(c,b) (((c) * NAMEI_RA_BLOCKS) + (b)) +#define NAMEI_RA_DX_BLOCKS 32 /* Better use BH_LRU_SIZE? */ + static struct buffer_head *ext4_append(handle_t *handle, struct inode *inode, ext4_lblk_t *block, int *err) @@ -334,6 +336,50 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir, #endif /* DX_DEBUG */ /* + * Read ahead directory index blocks + */ +static void dx_ra_blocks(struct inode *dir, struct dx_entry *entries, + struct dx_entry *at) +{ + int i, err = 0; + struct dx_entry *first_ra_entry = entries + 1; + unsigned num_entries = dx_get_count(entries) - 1; + + if (num_entries < 2 || num_entries > dx_get_limit(entries)) { + dxtrace(printk("dx read-ahead: invalid number of entries:%d\n", + num_entries)); + return; + } + + /* limit read ahead blocks */ + if (num_entries > NAMEI_RA_DX_BLOCKS) { + int min = at - first_ra_entry; /* first_ra_entry + min = at */ + int max = num_entries - min - 1; /* at + max = last_ra_entry */ + int half_limit = NAMEI_RA_DX_BLOCKS >> 1; + + min = min(min, half_limit); + max = min(max, half_limit); + + first_ra_entry = at - min; + + /* We do not use exactly NAMEI_RA_DX_BLOCKS here, as the logic + * for min and max would be unnecessarily complex */ + num_entries = min + max; + } + + dxtrace(printk("dx read-ahead: %d entries in dir-ino %lu \n", + num_entries, dir->i_ino)); + + i = 0; + do { + struct dx_entry *entry = first_ra_entry + i; + + err = ext4_bread_ra(dir, dx_get_block(entry)); + i++; + } while (i < num_entries && !err); +} + +/* * Probe for a directory leaf block to search. * * dx_probe can return ERR_BAD_DX_DIR, which means there was a format @@ -347,11 +393,12 @@ dx_probe(const struct qstr *d_name, struct inode *dir, struct dx_hash_info *hinfo, struct dx_frame *frame_in, int *err) { unsigned count, indirect; - struct dx_entry *at, *entries, *p, *q, *m; + struct dx_entry *at, *entries, *ra_entries, *p, *q, *m; struct dx_root *root; struct buffer_head *bh; struct dx_frame *frame = frame_in; u32 hash; + bool did_ra = false; frame->bh = NULL; if (!(bh = ext4_bread (NULL,dir, 0, 0, err))) @@ -390,7 +437,7 @@ dx_probe(const struct qstr *d_name, struct inode *dir, goto fail; } - entries = (struct dx_entry *) (((char *)&root->info) + + ra_entries = entries = (struct dx_entry *) (((char *)&root->info) + root->info.info_length); if (dx_get_limit(entries) != dx_root_limit(dir, @@ -446,9 +493,27 @@ dx_probe(const struct qstr *d_name, struct inode *dir, frame->bh = bh; frame->entries = entries; frame->at = at; - if (!indirect--) return frame; - if (!(bh = ext4_bread (NULL,dir, dx_get_block(at), 0, err))) + + if (!did_ra && test_opt2(dir->i_sb, DX_READ_AHEAD)) { + /* read-ahead of dx blocks */ + struct buffer_head *test_bh; + ext4_lblk_t block = dx_get_block(at); + + test_bh = ext4_getblk(NULL, dir, block, 0, err); + if (test_bh && !buffer_uptodate(test_bh)) { + dx_ra_blocks(dir, ra_entries, at); + did_ra = true; + } + brelse(test_bh); + } + + if (!indirect--) + return frame; + + bh = ext4_bread(NULL, dir, dx_get_block(at), 0, err); + if (!bh) goto fail2; + at = entries = ((struct dx_node *) bh->b_data)->entries; if (dx_get_limit(entries) != dx_node_limit (dir)) { ext4_warning(dir->i_sb, diff --git a/fs/ext4/super.c b/fs/ext4/super.c index cc5c157..9dd7c05 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -1119,6 +1119,9 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs) seq_printf(seq, ",init_inode_table=%u", (unsigned) sbi->s_li_wait_mult); + if (test_opt2(sb, DX_READ_AHEAD)) + seq_puts(seq, ",dx_read_ahead"); + ext4_show_quota_options(seq, sb); return 0; @@ -1294,6 +1297,7 @@ enum { Opt_dioread_nolock, Opt_dioread_lock, Opt_discard, Opt_nodiscard, Opt_init_inode_table, Opt_noinit_inode_table, + Opt_dx_read_ahead, }; static const match_table_t tokens = { @@ -1369,6 +1373,8 @@ static const match_table_t tokens = { {Opt_init_inode_table, "init_itable=%u"}, {Opt_init_inode_table, "init_itable"}, {Opt_noinit_inode_table, "noinit_itable"}, + {Opt_dx_read_ahead, "dx_read_ahead=%u"}, + {Opt_dx_read_ahead, "dx_read_ahead"}, {Opt_err, NULL}, }; @@ -1859,6 +1865,17 @@ set_qf_format: case Opt_noinit_inode_table: clear_opt(sb, INIT_INODE_TABLE); break; + case Opt_dx_read_ahead: + if (args[0].from) { + if (match_int(&args[0], &option)) + return 0; + } else + option = 1; /* No argument, default to 1 */ + if (option) + set_opt2(sb, DX_READ_AHEAD); + else + clear_opt2(sb, DX_READ_AHEAD); + break; default: ext4_msg(sb, KERN_ERR, "Unrecognized mount option \"%s\" "