[RFC,00/16] ext4: more accurate metadata reservaion for delalloc mount option

Message ID	20230824092619.1327976-1-yi.zhang@huaweicloud.com
Headers	show Return-Path: <SRS0=4HO7=EJ=vger.kernel.org=linux-ext4-owner@ozlabs.org> From: Zhang Yi <yi.zhang@huaweicloud.com> To: linux-ext4@vger.kernel.org Cc: tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, yi.zhang@huawei.com, yi.zhang@huaweicloud.com, chengzhihao1@huawei.com, yukuai3@huawei.com Subject: [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option Date: Thu, 24 Aug 2023 17:26:03 +0800 Message-Id: <20230824092619.1327976-1-yi.zhang@huaweicloud.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	ext4: more accurate metadata reservaion for delalloc mount option \| expand [RFC,00/16] ext4: more accurate metadata reservaion for delalloc mount option [RFC,01/16] ext4: correct the start block of counting reserved clusters [RFC,02/16] ext4: make sure allocate pending entry not fail [RFC,03/16] ext4: let __revise_pending() return the number of new inserts pendings [RFC,04/16] ext4: count removed reserved blocks for delalloc only es entry [RFC,05/16] ext4: pass real delayed status into ext4_es_insert_extent() [RFC,06/16] ext4: move delalloc data reserve spcae updating into ext4_es_insert_extent() [RFC,07/16] ext4: count inode's total delalloc data blocks into ext4_es_tree [RFC,08/16] ext4: refactor delalloc space reservation [RFC,09/16] ext4: count reserved metadata blocks for delalloc per inode [RFC,10/16] ext4: reserve meta blocks in ext4_da_reserve_space() [RFC,11/16] ext4: factor out common part of ext4_da_{release\|update_reserve}_space() [RFC,12/16] ext4: update reserved meta blocks in ext4_da_{release\|update_reserve}_space() [RFC,13/16] ext4: calculate the worst extent blocks needed of a delalloc es entry [RFC,14/16] ext4: reserve extent blocks for delalloc [RFC,15/16] ext4: flush delalloc blocks if no free space [RFC,16/16] ext4: drop ext4_nonda_switch()

Message ID

20230824092619.1327976-1-yi.zhang@huaweicloud.com

Headers

From: Zhang Yi <yi.zhang@huaweicloud.com>
To: linux-ext4@vger.kernel.org
Cc: tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz,
        yi.zhang@huawei.com, yi.zhang@huaweicloud.com,
        chengzhihao1@huawei.com, yukuai3@huawei.com
Subject: [RFC PATCH 00/16] ext4: more accurate metadata reservaion for
 delalloc mount option
Date: Thu, 24 Aug 2023 17:26:03 +0800
Message-Id: <20230824092619.1327976-1-yi.zhang@huaweicloud.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

ext4: more accurate metadata reservaion for delalloc mount option | expand

Message

Zhang Yi Aug. 24, 2023, 9:26 a.m. UTC

From: Zhang Yi <yi.zhang@huawei.com>

Hello,

The delayed allocation method allocates blocks during page writeback in
ext4_writepages(), which cannot handle block allocation failures due to
e.g. ENOSPC if acquires more extent blocks. In order to deal with this,
commit '79f0be8d2e6e ("ext4: Switch to non delalloc mode when we are low
on free blocks count.")' introduce ext4_nonda_switch() to convert to no
delalloc mode if the space if the free blocks is less than 150% of dirty
blocks or the watermark. In the meantime, '27dd43854227 ("ext4:
introduce reserved space")' reserve some of the file system space (2% or
4096 clusters, whichever is smaller). Both of those to solutions can
make sure that space is not exhausted when mapping delalloc blocks in
most cases, but cannot guarantee work in all cases, which could lead to
infinite loop or data loss (please see patch 14 for details).

This patch set wants to reserve metadata space more accurate for
delalloc mount option. The metadata blocks reservation is very tricky
and is also related to the continuity of physical blocks, an effective
way is to reserve as the worst case, which means that every data block
is discontinuous and one data block costs an extent entry. Reserve
metadata space as the worst case can make sure enough blocks reserved
during data writeback, the unused reservaion space can be released after
mapping data blocks. After doing this, add a worker to submit delayed
allocations to prevent excessive reservations. Finally, we could
completely drop the policy of switching back to non-delayed allocation.

The patch set is based on the latest ext4 dev branch.
Patch 1-2:   Fix two reserved data blocks problems triggered when
             bigalloc feature is enabled.
Patch 3-6:   Move reserved data blocks updating from
             ext4_{ext|ind}_map_blocks() to ext4_es_insert_extent(),
             preparing for reserving metadata.
Patch 7-14:  Reserve metadata blocks for delayed allocation as the worst
             case, and update count after allocating or releasing.
Patch 15-16: In order to prevent too many reserved metadata blocks that
             could running false positive out of space (doesn't take
             that much after allocating data blocks), add a worker to
             submit IO if the reservation is too big.

About tests:
1. This patch set has passed 'kvm-xfstests -g auto' many times.
2. The performance looks not significantly affected after doing the
   following tests on my virtual machine with 4 CPU core and 32GB
   memory, which based on Kunpeng-920 arm64 CPU and 1.5TB nvme ssd.

 fio -directory=/test -direct=0 -iodepth=10 -fsync=$sync -rw=$rw \
     -numjobs=${numjobs} -bs=${bs}k -ioengine=libaio -size=10G \
     -ramp_time=10 -runtime=60 -norandommap=0 -group_reportin \
     -name=tests

 Disable bigalloc:
                               | Before           | After
 rw         fsync jobs  bs(kB) | iops   bw(MiB/s) | iops   bw(MiB/s)
 ------------------------------|------------------|-----------------
 write      0     1     4      | 27500  107       | 27100  106
 write      0     4     4      | 33900  132       | 35300  138
 write      0     1     1024   | 134    135       | 149    150
 write      0     4     1024   | 172    174       | 199    200
 write      1     1     4      | 1530   6.1       | 1651   6.6
 write      1     4     4      | 3139   12.3      | 3131   12.2
 write      1     1     1024   | 184    185       | 195    196
 write      1     4     1024   | 117    119       | 114    115
 randwrite  0     1     4      | 17900  69.7      | 17600  68.9
 randwrite  0     4     4      | 32700  128       | 34600  135
 randwrite  0     1     1024   | 145    146       | 155    155
 randwrite  0     4     1024   | 193    194       | 207    209
 randwrite  1     1     4      | 1335   5.3       | 1444   5.7
 randwrite  1     4     4      | 3364   13.1      | 3428   13.4
 randwrite  1     1     1024   | 180    180       | 171    172
 randwrite  1     4     1024   | 132    134       | 141    142

 Enable bigalloc:
                               | Before           | After
 rw         fsync jobs  bs(kB) | iops   bw(MiB/s) | iops   bw(MiB/s)
 ------------------------------|------------------|-----------------
 write      0     1     4      | 27500  107       | 30300  118
 write      0     4     4      | 28800  112       | 34000  137
 write      0     1     1024   | 141    142       | 162    162
 write      0     4     1024   | 172    173       | 195    196
 write      1     1     4      | 1410   5.6       | 1302   5.2
 write      1     4     4      | 3052   11.9      | 3002   11.7
 write      1     1     1024   | 153    153       | 163    164
 write      1     4     1024   | 113    114       | 110    111
 randwrite  0     1     4      | 17500  68.5      | 18400  72
 randwrite  0     4     4      | 26600  104       | 24800  96
 randwrite  0     1     1024   | 170    171       | 165    165
 randwrite  0     4     1024   | 168    169       | 152    153
 randwrite  1     1     4      | 1281   5.1       | 1335   5.3
 randwrite  1     4     4      | 3115   12.2      | 3315   12
 randwrite  1     1     1024   | 150    150       | 151    152
 randwrite  1     4     1024   | 134    135       | 132    133

 Tests on ramdisk:

 Disable bigalloc
                               | Before           | After
 rw         fsync jobs  bs(kB) | iops   bw(MiB/s) | iops   bw(MiB/s)
 ------------------------------|------------------|-----------------
 write      1     1     4      | 4699   18.4      | 4858   18
 write      1     1     1024   | 245    246       | 247    248

 Enable bigalloc 
                               | Before           | After
 rw         fsync jobs  bs(kB) | iops   bw(MiB/s) | iops   bw(MiB/s)
 ------------------------------|------------------|-----------------
 write      1     1     4      | 4634   18.1      | 5073   19.8
 write      1     1     1024   | 246    247       | 268    269

Thanks,
Yi.

Zhang Yi (16):
  ext4: correct the start block of counting reserved clusters
  ext4: make sure allocate pending entry not fail
  ext4: let __revise_pending() return the number of new inserts pendings
  ext4: count removed reserved blocks for delalloc only es entry
  ext4: pass real delayed status into ext4_es_insert_extent()
  ext4: move delalloc data reserve spcae updating into
    ext4_es_insert_extent()
  ext4: count inode's total delalloc data blocks into ext4_es_tree
  ext4: refactor delalloc space reservation
  ext4: count reserved metadata blocks for delalloc per inode
  ext4: reserve meta blocks in ext4_da_reserve_space()
  ext4: factor out common part of
    ext4_da_{release|update_reserve}_space()
  ext4: update reserved meta blocks in
    ext4_da_{release|update_reserve}_space()
  ext4: calculate the worst extent blocks needed of a delalloc es entry
  ext4: reserve extent blocks for delalloc
  ext4: flush delalloc blocks if no free space
  ext4: drop ext4_nonda_switch()

 fs/ext4/balloc.c            |  47 ++++-
 fs/ext4/ext4.h              |  14 +-
 fs/ext4/extents.c           |  65 +++----
 fs/ext4/extents_status.c    | 340 +++++++++++++++++++++---------------
 fs/ext4/extents_status.h    |   3 +-
 fs/ext4/indirect.c          |   7 -
 fs/ext4/inode.c             | 191 ++++++++++----------
 fs/ext4/super.c             |  22 ++-
 include/trace/events/ext4.h |  70 ++++++--
 9 files changed, 439 insertions(+), 320 deletions(-)

Comments

Jan Kara Aug. 30, 2023, 3:30 p.m. UTC | #1

Hello!

On Thu 24-08-23 17:26:03, Zhang Yi wrote:
> The delayed allocation method allocates blocks during page writeback in
> ext4_writepages(), which cannot handle block allocation failures due to
> e.g. ENOSPC if acquires more extent blocks. In order to deal with this,
> commit '79f0be8d2e6e ("ext4: Switch to non delalloc mode when we are low
> on free blocks count.")' introduce ext4_nonda_switch() to convert to no
> delalloc mode if the space if the free blocks is less than 150% of dirty
> blocks or the watermark.

Well, that functionality is there mainly so that we can really allocate all
blocks available in the filesystem. But you are right that since we are not
reserving metadata blocks explicitely anymore, it is questionable whether
we really still need this.

> In the meantime, '27dd43854227 ("ext4:
> introduce reserved space")' reserve some of the file system space (2% or
> 4096 clusters, whichever is smaller). Both of those to solutions can
> make sure that space is not exhausted when mapping delalloc blocks in
> most cases, but cannot guarantee work in all cases, which could lead to
> infinite loop or data loss (please see patch 14 for details).

OK, I agree that in principle there could be problems due to percpu
counters inaccuracy etc. but were you able to reproduce the problem under
some at least somewhat realistic conditions? We were discussing making
free space percpu counters switch to exact counting in case we are running
tight on space to avoid these problems but it never proved to be a problem
in practice so we never bothered to actually implement it.

> This patch set wants to reserve metadata space more accurate for
> delalloc mount option. The metadata blocks reservation is very tricky
> and is also related to the continuity of physical blocks, an effective
> way is to reserve as the worst case, which means that every data block
> is discontinuous and one data block costs an extent entry. Reserve
> metadata space as the worst case can make sure enough blocks reserved
> during data writeback, the unused reservaion space can be released after
> mapping data blocks.

Well, as you say, there is a problem with the worst case estimates - either
you *heavily* overestimate the number of needed metadata blocks or the code
to estimate the number of needed metadata blocks is really complex. We used
to have estimates of needed metadata and we ditched that code (in favor of
reserved clusters) exactly because it was complex and suffered from
cornercases that were hard to fix. I haven't quite digested the other
patches in this series to judge which case is it but it seems to lean on
the "complex" side :).

So I'm somewhat skeptical this complexity is really needed but I can be
convinced :).

> After doing this, add a worker to submit delayed
> allocations to prevent excessive reservations. Finally, we could
> completely drop the policy of switching back to non-delayed allocation.

BTW the worker there in patch 15 seems really pointless. If you do:
queue_work(), flush_work() then you could just directly do the work inline
and get as a bonus more efficiency and proper lockdep tracking of
dependencies. But that's just a minor issue I have noticed.

								Honza

Zhang Yi Sept. 1, 2023, 2:33 a.m. UTC | #2

Hello! Thanks for your reply.

On 2023/8/30 23:30, Jan Kara wrote:
> Hello!
> 
> On Thu 24-08-23 17:26:03, Zhang Yi wrote:
>> The delayed allocation method allocates blocks during page writeback in
>> ext4_writepages(), which cannot handle block allocation failures due to
>> e.g. ENOSPC if acquires more extent blocks. In order to deal with this,
>> commit '79f0be8d2e6e ("ext4: Switch to non delalloc mode when we are low
>> on free blocks count.")' introduce ext4_nonda_switch() to convert to no
>> delalloc mode if the space if the free blocks is less than 150% of dirty
>> blocks or the watermark.
> 
> Well, that functionality is there mainly so that we can really allocate all
> blocks available in the filesystem. But you are right that since we are not
> reserving metadata blocks explicitely anymore, it is questionable whether
> we really still need this.
> 
>> In the meantime, '27dd43854227 ("ext4:
>> introduce reserved space")' reserve some of the file system space (2% or
>> 4096 clusters, whichever is smaller). Both of those to solutions can
>> make sure that space is not exhausted when mapping delalloc blocks in
>> most cases, but cannot guarantee work in all cases, which could lead to
>> infinite loop or data loss (please see patch 14 for details).
> 
> OK, I agree that in principle there could be problems due to percpu
> counters inaccuracy etc. but were you able to reproduce the problem under
> some at least somewhat realistic conditions? We were discussing making
> free space percpu counters switch to exact counting in case we are running
> tight on space to avoid these problems but it never proved to be a problem
> in practice so we never bothered to actually implement it.
> 

Yes, we catch this problem in our products firstly and we reproduced it
when doing stress test on low free space disk, but the frequency is
very low. After analysis we found the root cause, and Zhihao helped to
write a 100% reproducer below.

1. Apply the 'infinite_loop.diff' in attachment, add info and delay
   into ext4 code.
2. Run 'enospc.sh' on a virtual machine with 4 CPU (important, because
   the cpu number will affect EXT4_FREECLUSTERS_WATERMARK and also
   affect the reproduce).

After several minutes, the writeback process will loop infinitely, and
other processes which rely on it will hung.

[  304.815575] INFO: task sync:7292 blocked for more than 153 seconds.
[  304.818130]       Not tainted 6.5.0-dirty #578
[  304.819926] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  304.822747] task:sync            state:D stack:0     pid:7292  ppid:1      flags:0x00004000
[  304.825677] Call Trace:
[  304.826538]  <TASK>
[  304.827307]  __schedule+0x577/0x12b0
[  304.828300]  ? sync_fs_one_sb+0x50/0x50
[  304.829433]  schedule+0x9d/0x1e0
[  304.830451]  wb_wait_for_completion+0x82/0xd0
[  304.831811]  ? cpuacct_css_alloc+0x100/0x100
[  304.833090]  sync_inodes_sb+0xf1/0x440
[  304.834207]  ? sync_fs_one_sb+0x50/0x50
[  304.835304]  sync_inodes_one_sb+0x21/0x30
[  304.836528]  iterate_supers+0xd2/0x180
[  304.837614]  ksys_sync+0x50/0xf0
[  304.838356]  __do_sys_sync+0x12/0x20
[  304.839207]  do_syscall_64+0x68/0xf0
[  304.839964]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

On the contrary, after doing a little tweaking the delay injection
procedure, we could reproduce the data loss problem easily.

1. Apply the 'data_loss.diff' in the attachment.
2. Run 'enospc.sh' like the previous one, then we got below error message.

[   52.226320] EXT4-fs (sda): Delayed block allocation failed for inode 571 at logical offset 8 with max blocks 1 with error 28
[   52.229126] EXT4-fs (sda): This should not happen!! Data will be lost

>> This patch set wants to reserve metadata space more accurate for
>> delalloc mount option. The metadata blocks reservation is very tricky
>> and is also related to the continuity of physical blocks, an effective
>> way is to reserve as the worst case, which means that every data block
>> is discontinuous and one data block costs an extent entry. Reserve
>> metadata space as the worst case can make sure enough blocks reserved
>> during data writeback, the unused reservaion space can be released after
>> mapping data blocks.
> 
> Well, as you say, there is a problem with the worst case estimates - either
> you *heavily* overestimate the number of needed metadata blocks or the code
> to estimate the number of needed metadata blocks is really complex. We used
> to have estimates of needed metadata and we ditched that code (in favor of
> reserved clusters) exactly because it was complex and suffered from
> cornercases that were hard to fix. I haven't quite digested the other
> patches in this series to judge which case is it but it seems to lean on
> the "complex" side :).
> 
> So I'm somewhat skeptical this complexity is really needed but I can be
> convinced :).

I understand your concern. At first I tried to solve this problem with
other simple solutions, but failed. I suppose reserve blocks for the
worst case is the only way to cover all cases, and I noticed that xfs
also uses this reservation method, so I learned the implementation from
it, but it's not exactly the same.

Although it's a worst-case reservation, it is not that complicated.
Firstly, the estimate formula is simple, just add the 'extent & node'
blocks calculated by the **total** delay allocated data blocks and the
remaining btree heights (no need to concern whether the logical
positions of each extents are continuous or not, the btree heights can
be merged between each discontinuous extent entries of one inode, this
can reduce overestimate to some extent). Secondary, the method of
reserving metadata blocks is similar to that of reserving data blocks,
just the estimate formula is different. Fortunately, there already have
data reservation helpers like ext4_da_update_reserve_space() and
ext4_da_release_reserve_space(), it works only takes some minor
changes. BTW, I don't really know the ditched estimation and
cornercases you mentioned, how it's like?

Finally, maybe this reservation could bring other benefits in the long
run. For example, after we've done this, maybe we could also reserve
metadata for DIO and buffer IO with dioread_nolock in the future, then
we could drop EXT4_GET_BLOCKS_PRE_IO safely, which looks like a
compromise, maybe we could get some improve performance if do this (I
haven't thought deeply, just a whim :) ). But it's a different thing.

> 
>> After doing this, add a worker to submit delayed
>> allocations to prevent excessive reservations. Finally, we could
>> completely drop the policy of switching back to non-delayed allocation.
> 
> BTW the worker there in patch 15 seems really pointless. If you do:
> queue_work(), flush_work() then you could just directly do the work inline
> and get as a bonus more efficiency and proper lockdep tracking of
> dependencies. But that's just a minor issue I have noticed.
> 

Yes, I added this worker because I want to run the work asynchronously
if the s_dirtyclusters_counter is running beyond watermark. In this way,
the I/O flow could be smoother. But I didn't implement it because I
don't know if you like this estimate solution, I can do it if so.

Thanks,
Yi.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 43775a6ca505..6772cbc74224 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2192,6 +2192,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
  * mapped so that it can be written out (and thus forward progress is
  * guaranteed). After mapping we submit all mapped pages for IO.
  */
+#include <linux/delay.h>
 static int mpage_map_and_submit_extent(handle_t *handle,
 				       struct mpage_da_data *mpd,
 				       bool *give_up_on_write)
@@ -2203,6 +2204,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 	int progress = 0;
 	ext4_io_end_t *io_end = mpd->io_submit.io_end;
 	struct ext4_io_end_vec *io_end_vec;
+	static int wait = 0;
 
 	io_end_vec = ext4_alloc_io_end_vec(io_end);
 	if (IS_ERR(io_end_vec))
@@ -2213,6 +2215,26 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 		if (err < 0) {
 			struct super_block *sb = inode->i_sb;
 
+			if (!wait && err == -ENOSPC) {
+				wait = 1;
+				if (!ext4_count_free_clusters(sb)) {
+					/*
+					 * Failed to allocate metadata block,
+					 * will trigger infinite loop and hung.
+					 */
+					pr_err("will hung\n");
+				} else {
+					/*
+					 * Failed to allocate data block, wait
+					 * test.sh to free a block.
+					 */
+					pr_err("wait free\n");
+					msleep(3000);
+					pr_err("after free, now %llu\n",
+						ext4_count_free_clusters(sb));
+				}
+			}
+
 			if (ext4_forced_shutdown(EXT4_SB(sb)) ||
 			    ext4_test_mount_flag(sb, EXT4_MF_FS_ABORTED))
 				goto invalidate_dirty_pages;
@@ -2888,6 +2910,10 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 	/* In case writeback began while the folio was unlocked */
 	folio_wait_stable(folio);
 
+	/* Use task name and DISCARD mount option as delay inject filter. */
+	if (!strcmp(current->comm, "dd") && test_opt(inode->i_sb, DISCARD))
+		msleep(3000);
+
 #ifdef CONFIG_FS_ENCRYPTION
 	ret = ext4_block_write_begin(folio, pos, len, ext4_da_get_block_prep);
 #else
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c94ebf704616..79f4e96b8691 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5715,6 +5715,15 @@ static int ext4_fill_super(struct super_block *sb, struct fs_context *fc)
 
 	/* Update the s_overhead_clusters if necessary */
 	ext4_update_overhead(sb, false);
+
+	if (!strcmp(sb->s_bdev->bd_disk->disk_name, "sda")) {
+		pr_err("r_blocks %lld s_resv_clusters %llu free %lld dirty %lld EXT4_FREECLUSTERS_WATERMARK %u\n",
+			(ext4_r_blocks_count(sbi->s_es) >> sbi->s_cluster_bits),
+			atomic64_read(&sbi->s_resv_clusters),
+			percpu_counter_read_positive(&sbi->s_freeclusters_counter),
+			percpu_counter_read_positive(&sbi->s_dirtyclusters_counter),
+			EXT4_FREECLUSTERS_WATERMARK);
+	}
 	return 0;
 
 free_sbi:
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 43775a6ca505..11e47a530435 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2192,6 +2192,7 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
  * mapped so that it can be written out (and thus forward progress is
  * guaranteed). After mapping we submit all mapped pages for IO.
  */
+#include <linux/delay.h>
 static int mpage_map_and_submit_extent(handle_t *handle,
 				       struct mpage_da_data *mpd,
 				       bool *give_up_on_write)
@@ -2203,6 +2204,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 	int progress = 0;
 	ext4_io_end_t *io_end = mpd->io_submit.io_end;
 	struct ext4_io_end_vec *io_end_vec;
+	static int wait = 0;
 
 	io_end_vec = ext4_alloc_io_end_vec(io_end);
 	if (IS_ERR(io_end_vec))
@@ -2213,6 +2215,26 @@ static int mpage_map_and_submit_extent(handle_t *handle,
 		if (err < 0) {
 			struct super_block *sb = inode->i_sb;
 
+			if (!wait && err == -ENOSPC) {
+				wait = 1;
+				if (!ext4_count_free_clusters(sb)) {
+					/*
+					 * Failed to allocate data block, wait
+					 * test.sh to free a block.
+					 */
+					pr_err("wait free\n");
+					msleep(3000);
+					pr_err("after free, now %llu\n",
+						ext4_count_free_clusters(sb));
+				} else {
+					/*
+					 * Failed to allocate metadata block,
+					 * will trigger infinite loop and hung.
+					 */
+					pr_err("will hung\n");
+				}
+			}
+
 			if (ext4_forced_shutdown(EXT4_SB(sb)) ||
 			    ext4_test_mount_flag(sb, EXT4_MF_FS_ABORTED))
 				goto invalidate_dirty_pages;
@@ -2888,6 +2910,10 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
 	/* In case writeback began while the folio was unlocked */
 	folio_wait_stable(folio);
 
+	/* Use task name and DISCARD mount option as delay inject filter. */
+	if (!strcmp(current->comm, "dd") && test_opt(inode->i_sb, DISCARD))
+		msleep(3000);
+
 #ifdef CONFIG_FS_ENCRYPTION
 	ret = ext4_block_write_begin(folio, pos, len, ext4_da_get_block_prep);
 #else
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c94ebf704616..79f4e96b8691 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5715,6 +5715,15 @@ static int ext4_fill_super(struct super_block *sb, struct fs_context *fc)
 
 	/* Update the s_overhead_clusters if necessary */
 	ext4_update_overhead(sb, false);
+
+	if (!strcmp(sb->s_bdev->bd_disk->disk_name, "sda")) {
+		pr_err("r_blocks %lld s_resv_clusters %llu free %lld dirty %lld EXT4_FREECLUSTERS_WATERMARK %u\n",
+			(ext4_r_blocks_count(sbi->s_es) >> sbi->s_cluster_bits),
+			atomic64_read(&sbi->s_resv_clusters),
+			percpu_counter_read_positive(&sbi->s_freeclusters_counter),
+			percpu_counter_read_positive(&sbi->s_dirtyclusters_counter),
+			EXT4_FREECLUSTERS_WATERMARK);
+	}
 	return 0;
 
 free_sbi:
#!/bin/bash

sysctl -w kernel.hung_task_timeout_secs=15
umount /root/temp
mkfs.ext4 -F -b 4096 /dev/sda 100M
mount /dev/sda /root/temp
dd if=/dev/zero of=/root/temp/file bs=4K count=1
for i in {0..1100}
do
	touch /root/temp/f_$i
	dd if=/dev/zero of=/root/temp/f_$i bs=4K count=1
	dd if=/dev/zero of=/root/temp/f_$i bs=4K count=1 seek=2
	dd if=/dev/zero of=/root/temp/f_$i bs=4K count=1 seek=4
	dd if=/dev/zero of=/root/temp/f_$i bs=4K count=1 seek=6
done
dd if=/dev/zero of=/root/temp/consumer bs=1M count=68
umount /root/temp

mount -odiscard /dev/sda /root/temp
for i in {0..1100}
do
	dd if=/dev/zero of=/root/temp/f_$i bs=4K count=1 seek=8 &
done
sleep 1
dmesg -c > /dev/null
wait
sync &
sleep 1
while true
do
	res=`dmesg -c`
	if [[ "$res" =~ "wait free" ]]
	then
		echo "delete file"
		rm -f /root/temp/file
		break;
	elif [[ "$res" =~ "will hung" ]]
	then
		echo "will hung"
		break;
	fi
	sleep 1
done