Message ID | 20190311090851.29189-1-c17828@cray.com |
---|---|
State | New |
Headers | show |
Series | [RFC] don't search large block range if disk is full | expand |
Artem, we discussed this patch on the Ext4 concall today. A couple of items came up during discussion: - the patch submission should include performance results to show that the patch is providing an improvement - it would be preferable if the thresholds for the stages were found dynamically in the kernel based on how many groups have been skipped and the free chunk size in each group - there would need to be some way to dynamically reset the scanning level when lots of blocks have been freed Cheers, Andreas > On Mar 11, 2019, at 03:08, Artem Blagodarenko <artem.blagodarenko@gmail.com> wrote: > > Block allocator tries to find: > 1) group with the same range as required > 2) group with the same average range as required > 3) group with required amount of space > 4) any group > > For quite full disk step 1 is failed with higth > probability, but takes a lot of time. > > Skip 1st step if disk full > 75% > Skip 2d step if disk full > 85% > Skip 3d step if disk full > 95% > > This three tresholds can be adjusted through added interface. > > Signed-off-by: Artem Blagodarenko <c17828@cray.com> > --- > fs/ext4/ext4.h | 3 +++ > fs/ext4/mballoc.c | 32 ++++++++++++++++++++++++++++++++ > fs/ext4/mballoc.h | 3 +++ > fs/ext4/sysfs.c | 6 ++++++ > 4 files changed, 44 insertions(+) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 185a05d3257e..fbccb459a296 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -1431,6 +1431,9 @@ struct ext4_sb_info { > unsigned int s_mb_min_to_scan; > unsigned int s_mb_stats; > unsigned int s_mb_order2_reqs; > + unsigned int s_mb_c1_treshold; > + unsigned int s_mb_c2_treshold; > + unsigned int s_mb_c3_treshold; > unsigned int s_mb_group_prealloc; > unsigned int s_max_dir_size_kb; > /* where last allocation was done - for stream allocation */ > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 4e6c36ff1d55..85f364aa96c9 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -2096,6 +2096,20 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, > return 0; > } > > +static u64 available_blocks_count(struct ext4_sb_info *sbi) > +{ > + ext4_fsblk_t resv_blocks; > + u64 bfree; > + struct ext4_super_block *es = sbi->s_es; > + > + resv_blocks = EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters)); > + bfree = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) - > + percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter); > + > + bfree = EXT4_C2B(sbi, max_t(s64, bfree, 0)); > + return bfree - (ext4_r_blocks_count(es) + resv_blocks); > +} > + > static noinline_for_stack int > ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > { > @@ -2104,10 +2118,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > int err = 0, first_err = 0; > struct ext4_sb_info *sbi; > struct super_block *sb; > + struct ext4_super_block *es; > struct ext4_buddy e4b; > + unsigned int free_rate; > > sb = ac->ac_sb; > sbi = EXT4_SB(sb); > + es = sbi->s_es; > ngroups = ext4_get_groups_count(sb); > /* non-extent files are limited to low blocks/groups */ > if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))) > @@ -2157,6 +2174,18 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) > > /* Let's just scan groups to find more-less suitable blocks */ > cr = ac->ac_2order ? 0 : 1; > + > + /* Choose what loop to pass based on disk fullness */ > + free_rate = available_blocks_count(sbi) * 100 / ext4_blocks_count(es); > + > + if (free_rate < sbi->s_mb_c3_treshold) { > + cr = 3; > + } else if(free_rate < sbi->s_mb_c2_treshold) { > + cr = 2; > + } else if(free_rate < sbi->s_mb_c1_treshold) { > + cr = 1; > + } > + > /* > * cr == 0 try to get exact allocation, > * cr == 3 try to get anything > @@ -2618,6 +2647,9 @@ int ext4_mb_init(struct super_block *sb) > sbi->s_mb_stats = MB_DEFAULT_STATS; > sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD; > sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS; > + sbi->s_mb_c1_treshold = MB_DEFAULT_C1_TRESHOLD; > + sbi->s_mb_c2_treshold = MB_DEFAULT_C2_TRESHOLD; > + sbi->s_mb_c3_treshold = MB_DEFAULT_C3_TRESHOLD; > /* > * The default group preallocation is 512, which for 4k block > * sizes translates to 2 megabytes. However for bigalloc file > diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h > index 88c98f17e3d9..d880923e55a5 100644 > --- a/fs/ext4/mballoc.h > +++ b/fs/ext4/mballoc.h > @@ -71,6 +71,9 @@ do { \ > * for which requests use 2^N search using buddies > */ > #define MB_DEFAULT_ORDER2_REQS 2 > +#define MB_DEFAULT_C1_TRESHOLD 25 > +#define MB_DEFAULT_C2_TRESHOLD 15 > +#define MB_DEFAULT_C3_TRESHOLD 5 > > /* > * default group prealloc size 512 blocks > diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c > index 9212a026a1f1..e4f1d98195c2 100644 > --- a/fs/ext4/sysfs.c > +++ b/fs/ext4/sysfs.c > @@ -175,6 +175,9 @@ EXT4_RW_ATTR_SBI_UI(mb_stats, s_mb_stats); > EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan); > EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan); > EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs); > +EXT4_RW_ATTR_SBI_UI(mb_c1_treshold, s_mb_c1_treshold); > +EXT4_RW_ATTR_SBI_UI(mb_c2_treshold, s_mb_c2_treshold); > +EXT4_RW_ATTR_SBI_UI(mb_c3_treshold, s_mb_c3_treshold); > EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request); > EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc); > EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb); > @@ -203,6 +206,9 @@ static struct attribute *ext4_attrs[] = { > ATTR_LIST(mb_max_to_scan), > ATTR_LIST(mb_min_to_scan), > ATTR_LIST(mb_order2_req), > + ATTR_LIST(mb_c1_treshold), > + ATTR_LIST(mb_c2_treshold), > + ATTR_LIST(mb_c3_treshold), > ATTR_LIST(mb_stream_req), > ATTR_LIST(mb_group_prealloc), > ATTR_LIST(max_writeback_mb_bump), > -- > 2.14.3 >
Hello Andreas, Thank you for feedback! I really wanted send new version (with test results, but without kernel decision-maker) of this patch this evening, but you were faster. > On 30 May 2019, at 19:56, Andreas Dilger <adilger@dilger.ca> wrote: > > Artem, we discussed this patch on the Ext4 concall today. A couple > of items came up during discussion: > - the patch submission should include performance results to > show that the patch is providing an improvement > - it would be preferable if the thresholds for the stages were found > dynamically in the kernel based on how many groups have been skipped > and the free chunk size in each group > - there would need to be some way to dynamically reset the scanning > level when lots of blocks have been freed > > Cheers, Andreas My suggestion is split this plan to 2 phases. Phase 1 - loop skipping code and interface to user-mode that gives to administrator ability configure loop-skipping code. Phase 2 in kernel discussion-maker based on groups info (and some other information) Here are testing results I wanted to add to new patch version. Adding it here for descussion: Here are some aproach test results. During test, system was fragmented with pattern "50 free blocks - 50 occupied blocks". Performance digradated from 1.2 Gb/sed to 10 MB/sec. 68719476736 bytes (69 GB) copied, 6619.02 s, 10.4 MB/s Let's exlude c1 loops echo "60" > /sys/fs/ext4/md0/mb_c1_threshold Excluding c1 loops doesn't change performance. Same 10 MB/s Statistics shows that 981753 c1 loops were skipped, but 1664192 finished without sucess. mballoc: (7829, 1664192, 0) useless c(0,1,2) loops mballoc: (981753, 0, 0) skipped c(0,1,2) loops Then c1 and c2 loops ware disabled. echo "60" > /sys/fs/ext4/md0/mb_c1_threshold echo "60" > /sys/fs/ext4/md0/mb_c2_threshold mballoc: (0, 0, 0) useless c(0,1,2) loops mballoc: (1425456, 1393743, 0) skipped c(0,1,2) loops A lot of loops c1 and c2 skipped. For given fragmentation write performance returned to ~500 MB/s 68719476736 bytes (69 GB) copied, 133.066 s, 516 MB/s This is example how to improve performance for exact partition fragmentation. The patch adds interfaces for adjusting block allocator for any situation. Best regards, Artem Blagodarenko. >> On Mar 11, 2019, at 03:08, Artem Blagodarenko <artem.blagodarenko@gmail.com> wrote: >> >> Block allocator tries to find: >> 1) group with the same range as required >> 2) group with the same average range as required >> 3) group with required amount of space >> 4) any group >> >> For quite full disk step 1 is failed with higth >> probability, but takes a lot of time. >> >> Skip 1st step if disk full > 75% >> Skip 2d step if disk full > 85% >> Skip 3d step if disk full > 95% >> >> This three tresholds can be adjusted through added interface. >> >> Signed-off-by: Artem Blagodarenko <c17828@cray.com> >> --- >> fs/ext4/ext4.h | 3 +++ >> fs/ext4/mballoc.c | 32 ++++++++++++++++++++++++++++++++ >> fs/ext4/mballoc.h | 3 +++ >> fs/ext4/sysfs.c | 6 ++++++ >> 4 files changed, 44 insertions(+) >> >> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h >> index 185a05d3257e..fbccb459a296 100644 >> --- a/fs/ext4/ext4.h >> +++ b/fs/ext4/ext4.h >> @@ -1431,6 +1431,9 @@ struct ext4_sb_info { >> unsigned int s_mb_min_to_scan; >> unsigned int s_mb_stats; >> unsigned int s_mb_order2_reqs; >> + unsigned int s_mb_c1_treshold; >> + unsigned int s_mb_c2_treshold; >> + unsigned int s_mb_c3_treshold; >> unsigned int s_mb_group_prealloc; >> unsigned int s_max_dir_size_kb; >> /* where last allocation was done - for stream allocation */ >> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c >> index 4e6c36ff1d55..85f364aa96c9 100644 >> --- a/fs/ext4/mballoc.c >> +++ b/fs/ext4/mballoc.c >> @@ -2096,6 +2096,20 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, >> return 0; >> } >> >> +static u64 available_blocks_count(struct ext4_sb_info *sbi) >> +{ >> + ext4_fsblk_t resv_blocks; >> + u64 bfree; >> + struct ext4_super_block *es = sbi->s_es; >> + >> + resv_blocks = EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters)); >> + bfree = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) - >> + percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter); >> + >> + bfree = EXT4_C2B(sbi, max_t(s64, bfree, 0)); >> + return bfree - (ext4_r_blocks_count(es) + resv_blocks); >> +} >> + >> static noinline_for_stack int >> ext4_mb_regular_allocator(struct ext4_allocation_context *ac) >> { >> @@ -2104,10 +2118,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) >> int err = 0, first_err = 0; >> struct ext4_sb_info *sbi; >> struct super_block *sb; >> + struct ext4_super_block *es; >> struct ext4_buddy e4b; >> + unsigned int free_rate; >> >> sb = ac->ac_sb; >> sbi = EXT4_SB(sb); >> + es = sbi->s_es; >> ngroups = ext4_get_groups_count(sb); >> /* non-extent files are limited to low blocks/groups */ >> if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))) >> @@ -2157,6 +2174,18 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) >> >> /* Let's just scan groups to find more-less suitable blocks */ >> cr = ac->ac_2order ? 0 : 1; >> + >> + /* Choose what loop to pass based on disk fullness */ >> + free_rate = available_blocks_count(sbi) * 100 / ext4_blocks_count(es); >> + >> + if (free_rate < sbi->s_mb_c3_treshold) { >> + cr = 3; >> + } else if(free_rate < sbi->s_mb_c2_treshold) { >> + cr = 2; >> + } else if(free_rate < sbi->s_mb_c1_treshold) { >> + cr = 1; >> + } >> + >> /* >> * cr == 0 try to get exact allocation, >> * cr == 3 try to get anything >> @@ -2618,6 +2647,9 @@ int ext4_mb_init(struct super_block *sb) >> sbi->s_mb_stats = MB_DEFAULT_STATS; >> sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD; >> sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS; >> + sbi->s_mb_c1_treshold = MB_DEFAULT_C1_TRESHOLD; >> + sbi->s_mb_c2_treshold = MB_DEFAULT_C2_TRESHOLD; >> + sbi->s_mb_c3_treshold = MB_DEFAULT_C3_TRESHOLD; >> /* >> * The default group preallocation is 512, which for 4k block >> * sizes translates to 2 megabytes. However for bigalloc file >> diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h >> index 88c98f17e3d9..d880923e55a5 100644 >> --- a/fs/ext4/mballoc.h >> +++ b/fs/ext4/mballoc.h >> @@ -71,6 +71,9 @@ do { \ >> * for which requests use 2^N search using buddies >> */ >> #define MB_DEFAULT_ORDER2_REQS 2 >> +#define MB_DEFAULT_C1_TRESHOLD 25 >> +#define MB_DEFAULT_C2_TRESHOLD 15 >> +#define MB_DEFAULT_C3_TRESHOLD 5 >> >> /* >> * default group prealloc size 512 blocks >> diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c >> index 9212a026a1f1..e4f1d98195c2 100644 >> --- a/fs/ext4/sysfs.c >> +++ b/fs/ext4/sysfs.c >> @@ -175,6 +175,9 @@ EXT4_RW_ATTR_SBI_UI(mb_stats, s_mb_stats); >> EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan); >> EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan); >> EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs); >> +EXT4_RW_ATTR_SBI_UI(mb_c1_treshold, s_mb_c1_treshold); >> +EXT4_RW_ATTR_SBI_UI(mb_c2_treshold, s_mb_c2_treshold); >> +EXT4_RW_ATTR_SBI_UI(mb_c3_treshold, s_mb_c3_treshold); >> EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request); >> EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc); >> EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb); >> @@ -203,6 +206,9 @@ static struct attribute *ext4_attrs[] = { >> ATTR_LIST(mb_max_to_scan), >> ATTR_LIST(mb_min_to_scan), >> ATTR_LIST(mb_order2_req), >> + ATTR_LIST(mb_c1_treshold), >> + ATTR_LIST(mb_c2_treshold), >> + ATTR_LIST(mb_c3_treshold), >> ATTR_LIST(mb_stream_req), >> ATTR_LIST(mb_group_prealloc), >> ATTR_LIST(max_writeback_mb_bump), >> -- >> 2.14.3 >>
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 185a05d3257e..fbccb459a296 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1431,6 +1431,9 @@ struct ext4_sb_info { unsigned int s_mb_min_to_scan; unsigned int s_mb_stats; unsigned int s_mb_order2_reqs; + unsigned int s_mb_c1_treshold; + unsigned int s_mb_c2_treshold; + unsigned int s_mb_c3_treshold; unsigned int s_mb_group_prealloc; unsigned int s_max_dir_size_kb; /* where last allocation was done - for stream allocation */ diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 4e6c36ff1d55..85f364aa96c9 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -2096,6 +2096,20 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac, return 0; } +static u64 available_blocks_count(struct ext4_sb_info *sbi) +{ + ext4_fsblk_t resv_blocks; + u64 bfree; + struct ext4_super_block *es = sbi->s_es; + + resv_blocks = EXT4_C2B(sbi, atomic64_read(&sbi->s_resv_clusters)); + bfree = percpu_counter_sum_positive(&sbi->s_freeclusters_counter) - + percpu_counter_sum_positive(&sbi->s_dirtyclusters_counter); + + bfree = EXT4_C2B(sbi, max_t(s64, bfree, 0)); + return bfree - (ext4_r_blocks_count(es) + resv_blocks); +} + static noinline_for_stack int ext4_mb_regular_allocator(struct ext4_allocation_context *ac) { @@ -2104,10 +2118,13 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) int err = 0, first_err = 0; struct ext4_sb_info *sbi; struct super_block *sb; + struct ext4_super_block *es; struct ext4_buddy e4b; + unsigned int free_rate; sb = ac->ac_sb; sbi = EXT4_SB(sb); + es = sbi->s_es; ngroups = ext4_get_groups_count(sb); /* non-extent files are limited to low blocks/groups */ if (!(ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))) @@ -2157,6 +2174,18 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) /* Let's just scan groups to find more-less suitable blocks */ cr = ac->ac_2order ? 0 : 1; + + /* Choose what loop to pass based on disk fullness */ + free_rate = available_blocks_count(sbi) * 100 / ext4_blocks_count(es); + + if (free_rate < sbi->s_mb_c3_treshold) { + cr = 3; + } else if(free_rate < sbi->s_mb_c2_treshold) { + cr = 2; + } else if(free_rate < sbi->s_mb_c1_treshold) { + cr = 1; + } + /* * cr == 0 try to get exact allocation, * cr == 3 try to get anything @@ -2618,6 +2647,9 @@ int ext4_mb_init(struct super_block *sb) sbi->s_mb_stats = MB_DEFAULT_STATS; sbi->s_mb_stream_request = MB_DEFAULT_STREAM_THRESHOLD; sbi->s_mb_order2_reqs = MB_DEFAULT_ORDER2_REQS; + sbi->s_mb_c1_treshold = MB_DEFAULT_C1_TRESHOLD; + sbi->s_mb_c2_treshold = MB_DEFAULT_C2_TRESHOLD; + sbi->s_mb_c3_treshold = MB_DEFAULT_C3_TRESHOLD; /* * The default group preallocation is 512, which for 4k block * sizes translates to 2 megabytes. However for bigalloc file diff --git a/fs/ext4/mballoc.h b/fs/ext4/mballoc.h index 88c98f17e3d9..d880923e55a5 100644 --- a/fs/ext4/mballoc.h +++ b/fs/ext4/mballoc.h @@ -71,6 +71,9 @@ do { \ * for which requests use 2^N search using buddies */ #define MB_DEFAULT_ORDER2_REQS 2 +#define MB_DEFAULT_C1_TRESHOLD 25 +#define MB_DEFAULT_C2_TRESHOLD 15 +#define MB_DEFAULT_C3_TRESHOLD 5 /* * default group prealloc size 512 blocks diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c index 9212a026a1f1..e4f1d98195c2 100644 --- a/fs/ext4/sysfs.c +++ b/fs/ext4/sysfs.c @@ -175,6 +175,9 @@ EXT4_RW_ATTR_SBI_UI(mb_stats, s_mb_stats); EXT4_RW_ATTR_SBI_UI(mb_max_to_scan, s_mb_max_to_scan); EXT4_RW_ATTR_SBI_UI(mb_min_to_scan, s_mb_min_to_scan); EXT4_RW_ATTR_SBI_UI(mb_order2_req, s_mb_order2_reqs); +EXT4_RW_ATTR_SBI_UI(mb_c1_treshold, s_mb_c1_treshold); +EXT4_RW_ATTR_SBI_UI(mb_c2_treshold, s_mb_c2_treshold); +EXT4_RW_ATTR_SBI_UI(mb_c3_treshold, s_mb_c3_treshold); EXT4_RW_ATTR_SBI_UI(mb_stream_req, s_mb_stream_request); EXT4_RW_ATTR_SBI_UI(mb_group_prealloc, s_mb_group_prealloc); EXT4_RW_ATTR_SBI_UI(extent_max_zeroout_kb, s_extent_max_zeroout_kb); @@ -203,6 +206,9 @@ static struct attribute *ext4_attrs[] = { ATTR_LIST(mb_max_to_scan), ATTR_LIST(mb_min_to_scan), ATTR_LIST(mb_order2_req), + ATTR_LIST(mb_c1_treshold), + ATTR_LIST(mb_c2_treshold), + ATTR_LIST(mb_c3_treshold), ATTR_LIST(mb_stream_req), ATTR_LIST(mb_group_prealloc), ATTR_LIST(max_writeback_mb_bump),
Block allocator tries to find: 1) group with the same range as required 2) group with the same average range as required 3) group with required amount of space 4) any group For quite full disk step 1 is failed with higth probability, but takes a lot of time. Skip 1st step if disk full > 75% Skip 2d step if disk full > 85% Skip 3d step if disk full > 95% This three tresholds can be adjusted through added interface. Signed-off-by: Artem Blagodarenko <c17828@cray.com> --- fs/ext4/ext4.h | 3 +++ fs/ext4/mballoc.c | 32 ++++++++++++++++++++++++++++++++ fs/ext4/mballoc.h | 3 +++ fs/ext4/sysfs.c | 6 ++++++ 4 files changed, 44 insertions(+)