diff mbox series

[v1] ext4: fix a assertion failure due to ungranted bh dirting

Message ID 20241010025855.2632516-1-liubaolin12138@163.com
State New
Headers show
Series [v1] ext4: fix a assertion failure due to ungranted bh dirting | expand

Commit Message

liubaolin Oct. 10, 2024, 2:58 a.m. UTC
From: Baolin Liu <liubaolin@kylinos.cn>

Since the merge of commit 3910b513fcdf ("ext4: persist the new uptodate
buffers in ext4_journalled_zero_new_buffers"), a new assertion failure
occurred under a old kernel(ext3, data=journal, pagesize=64k) with
corresponding ported patches:
================================================================
Call trace:
  __ext4_handle_dirty_metadata+0x320/0x7e8
  write_end_fn+0x78/0x178
  ext4_journalled_zero_new_buffers+0xd0/0x2c8
  ext4_block_write_begin+0x850/0xc00
  ext4_write_begin+0x334/0xc68
  generic_perform_write+0x1a4/0x380
  ext4_buffered_write_iter+0x180/0x370
  ext4_file_write_iter+0x194/0xfc0
  new_sync_write+0x338/0x4b8
  __vfs_write+0xc4/0xe8
  vfs_write+0x12c/0x3d0
  ksys_write+0xf4/0x230
  sys_write+0x34/0x48
  el0_svc_naked+0x44/0x48
================================================================

which was caused by bh dirting without calling
do_journal_get_write_access().

In the loop for all bhs of a page in ext4_block_write_begin(),
when a err occurred, it will jump out of loop.
But that will leaves some bhs being processed and some not,
which will lead to the asserion failure in calling write_end_fn().

To fixed that, get write access for the rest unprocessed bhs, just
as what write_end_fn do.

Fixes: 3910b513fcdf ("ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers")
Reported-and-tested-by: Zhi Long <longzhi@sangfor.com.cn>
Suggested-by: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
---
 fs/ext4/inode.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

Comments

Jan Kara Oct. 10, 2024, 9:29 a.m. UTC | #1
On Thu 10-10-24 10:58:55, Baolin Liu wrote:
> From: Baolin Liu <liubaolin@kylinos.cn>
> 
> Since the merge of commit 3910b513fcdf ("ext4: persist the new uptodate
> buffers in ext4_journalled_zero_new_buffers"), a new assertion failure
> occurred under a old kernel(ext3, data=journal, pagesize=64k) with
> corresponding ported patches:
...
> which was caused by bh dirting without calling
> do_journal_get_write_access().
> 
> In the loop for all bhs of a page in ext4_block_write_begin(),
> when a err occurred, it will jump out of loop.
> But that will leaves some bhs being processed and some not,
> which will lead to the asserion failure in calling write_end_fn().

Thanks for the patch but I don't understand one thing here: For
ext4_journalled_zero_new_buffers() to call write_end_fn() the buffer must
have buffer_new flag set. That flag can get set only by ext4_get_block()
function when it succeeds in which case we also call
do_journal_get_write_access(). So how is it possible that buffer_new was
set on a buffer on which we didn't call do_journal_get_write_access()? This
indicates there may be some deeper problem hidden. How exactly did you
trigger this problem?

								Honza

> 
> To fixed that, get write access for the rest unprocessed bhs, just
> as what write_end_fn do.
> 
> Fixes: 3910b513fcdf ("ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers")
> Reported-and-tested-by: Zhi Long <longzhi@sangfor.com.cn>
> Suggested-by: Shida Zhang <zhangshida@kylinos.cn>
> Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
> ---
>  fs/ext4/inode.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 54bdd4884fe6..a72f951288e4 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1102,9 +1102,24 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>  			err = -EIO;
>  	}
>  	if (unlikely(err)) {
> -		if (should_journal_data)
> +		if (should_journal_data) {
> +			if (bh != head || !block_start) {
> +				do {
> +					block_end = block_start + bh->b_size;
> +
> +					if (buffer_new(bh))
> +						if (block_end > from && block_start < to)
> +							do_journal_get_write_access(handle,
> +										    inode, bh);
> +
> +					block_start = block_end;
> +					bh = bh->b_this_page;
> +				} while (bh != head);
> +			}
> +
>  			ext4_journalled_zero_new_buffers(handle, inode, folio,
>  							 from, to);
> +		}
>  		else
>  			folio_zero_new_buffers(folio, from, to);
>  	} else if (fscrypt_inode_uses_fs_layer_crypto(inode)) {
> -- 
> 2.39.2
>
liubaolin Oct. 11, 2024, 6:18 a.m. UTC | #2
> Greetings,
> This problem is reproduced by our customer using their own testing tool “run_bug”.
> When I consulted with a client, the testing tool “run_bug” used a variety of background programs to benchmark 
> (including memory pressure, cpu pressure, file cycle manipulation, fsstress Stress testing tool, postmark program,and so on).
> The recurrence probability is relatively low.
> 
> In response to your query, in ext4_block_write_begin, the new state will be clear before get block, 
> and the bh that failed get_block will not be set to new.
> However, when the page size is greater than the block size, a page will contain multiple bh. 
> bh->b_this_page is a circular list for managing all bh on the same page.
> After get_block jumps out of the for loop, then bh->b_this_page is not processed by clear new in the for loop.
> So after calling ext4_journalled_zero_new_buffers,
> The ext4_journalled_zero_new_buffers function will determine all bh of the same page and call write_end_fn if they are in new state,
> get_block returns err's bh->b_this_page and circular list other unhandled bh if the state was previously set to new.
> Because bh not get access, the corresponding transaction is not placed in jh->b_transaction, resulting in a crash.
> 
> Therefore, the patch processing method I submit is to make unprocessed bh determines if it is in new state and get access.
> There is another way to handle the remaining bh clear_buffer_new that is not processed.
> I personally recommend get access this way, the impact is small. 
> Please guide the two processing methods, which one do you recommend?



在 2024/10/10 17:29, Jan Kara 写道:
> On Thu 10-10-24 10:58:55, Baolin Liu wrote:
>> From: Baolin Liu <liubaolin@kylinos.cn>
>>
>> Since the merge of commit 3910b513fcdf ("ext4: persist the new uptodate
>> buffers in ext4_journalled_zero_new_buffers"), a new assertion failure
>> occurred under a old kernel(ext3, data=journal, pagesize=64k) with
>> corresponding ported patches:
> ...
>> which was caused by bh dirting without calling
>> do_journal_get_write_access().
>>
>> In the loop for all bhs of a page in ext4_block_write_begin(),
>> when a err occurred, it will jump out of loop.
>> But that will leaves some bhs being processed and some not,
>> which will lead to the asserion failure in calling write_end_fn().
> 
> Thanks for the patch but I don't understand one thing here: For
> ext4_journalled_zero_new_buffers() to call write_end_fn() the buffer must
> have buffer_new flag set. That flag can get set only by ext4_get_block()
> function when it succeeds in which case we also call
> do_journal_get_write_access(). So how is it possible that buffer_new was
> set on a buffer on which we didn't call do_journal_get_write_access()? This
> indicates there may be some deeper problem hidden. How exactly did you
> trigger this problem?
> 
> 								Honza
> 
>>
>> To fixed that, get write access for the rest unprocessed bhs, just
>> as what write_end_fn do.
>>
>> Fixes: 3910b513fcdf ("ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers")
>> Reported-and-tested-by: Zhi Long <longzhi@sangfor.com.cn>
>> Suggested-by: Shida Zhang <zhangshida@kylinos.cn>
>> Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
>> ---
>>   fs/ext4/inode.c | 17 ++++++++++++++++-
>>   1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 54bdd4884fe6..a72f951288e4 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1102,9 +1102,24 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
>>   			err = -EIO;
>>   	}
>>   	if (unlikely(err)) {
>> -		if (should_journal_data)
>> +		if (should_journal_data) {
>> +			if (bh != head || !block_start) {
>> +				do {
>> +					block_end = block_start + bh->b_size;
>> +
>> +					if (buffer_new(bh))
>> +						if (block_end > from && block_start < to)
>> +							do_journal_get_write_access(handle,
>> +										    inode, bh);
>> +
>> +					block_start = block_end;
>> +					bh = bh->b_this_page;
>> +				} while (bh != head);
>> +			}
>> +
>>   			ext4_journalled_zero_new_buffers(handle, inode, folio,
>>   							 from, to);
>> +		}
>>   		else
>>   			folio_zero_new_buffers(folio, from, to);
>>   	} else if (fscrypt_inode_uses_fs_layer_crypto(inode)) {
>> -- 
>> 2.39.2
>>
liubaolin Oct. 16, 2024, 2:42 a.m. UTC | #3
> Greetings,
> Regarding this issue, 
> I was able to reproduce it quickly by injecting faults via module parameter passing in fsstest while testing simultaneously. 
> And we tested that neither get access nor clear new would reproduce the issue after injecting faults. 
> Could you please take a look at which approach, get access or clear new, is better? 
> 
> The fsstress testing and injection fault command are as follows:
> fsstress_arm -d "/fsstress_dir2/" -l 102400 -n 100 -p 128 -r -S -s 10 -c 
> watch -n 1 "echo  1  > /sys/module/ext4/parameters/inject_fault"
> 
> The injected fault test is modified as follows, 
> where the module parameter inject_fault injects the fault, 
> and the module parameter try_fix selects whether to handle the fault by getting access or clearing the new:
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b231cd437..590f84391 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -50,6 +50,12 @@
>  
>  #define MPAGE_DA_EXTENT_TAIL 0x01
>  
> +static int ext4_inject_fault __read_mostly;
> +module_param_named(inject_fault, ext4_inject_fault, int, 0644);
> +static int ext4_try_fix __read_mostly;
> +module_param_named(try_fix, ext4_try_fix, int, 0644);
> +
> +
>  static void ext4_journalled_zero_new_buffers(handle_t *handle,
>  					     struct page *page,
>  					     unsigned int from, unsigned int to);
> @@ -1065,6 +1071,12 @@ int ext4_block_write_begin(handle_t *handle, struct page *page, loff_t pos, unsi
>  			clear_buffer_new(bh);
>  		if (!buffer_mapped(bh)) {
>  			WARN_ON(bh->b_size != blocksize);
> +			if (unlikely(ext4_inject_fault)) {
> +				ext4_inject_fault = 0;
> +				ext4_warning(inode->i_sb, "XXX inject fault get_block return -ENOSPC\n");
> +				err = -ENOSPC;
> +				break;
> +			}
>  			err = get_block(inode, block, bh, 1);
>  			if (err)
>  				break;
> @@ -1116,10 +1128,31 @@ int ext4_block_write_begin(handle_t *handle, struct page *page, loff_t pos, unsi
>  			err = -EIO;
>  	}
>  	if (unlikely(err))
> -		if (should_journal_data)
> +		if (should_journal_data) {
> +			if(bh != head || !block_start) {
> +				do {
> +					block_end = block_start + bh->b_size;
> +
> +					if (buffer_new(bh))
> +						if (block_end > from && block_start < to) {
> +							if (ext4_try_fix == 1) {
> +								ext4_warning(inode->i_sb, "XXX try fix 1\n");
> +								do_journal_get_write_access(handle,
> +											    bh);
> +							} else if (ext4_try_fix == 2) {
> +								ext4_warning(inode->i_sb, "XXX try fix 2\n");
> +								clear_buffer_new(bh);
> +							}
> +						}
> +
> +					block_start = block_end;
> +					bh = bh->b_this_page;
> +				} while (bh != head);
> +			}
> +
>  			ext4_journalled_zero_new_buffers(handle, page, from,
>  							 to);
> -		else
> +		} else
>  			page_zero_new_buffers(page, from, to);
>  	else if (decrypt)
>  		err = fscrypt_decrypt_page(page->mapping->host, page,


在 2024/10/11 14:18, liubaolin 写道:
>> Greetings,
>> This problem is reproduced by our customer using their own testing 
>> tool “run_bug”.
>> When I consulted with a client, the testing tool “run_bug” used a 
>> variety of background programs to benchmark (including memory 
>> pressure, cpu pressure, file cycle manipulation, fsstress Stress 
>> testing tool, postmark program,and so on).
>> The recurrence probability is relatively low.
>>
>> In response to your query, in ext4_block_write_begin, the new state 
>> will be clear before get block, and the bh that failed get_block will 
>> not be set to new.
>> However, when the page size is greater than the block size, a page 
>> will contain multiple bh. bh->b_this_page is a circular list for 
>> managing all bh on the same page.
>> After get_block jumps out of the for loop, then bh->b_this_page is not 
>> processed by clear new in the for loop.
>> So after calling ext4_journalled_zero_new_buffers,
>> The ext4_journalled_zero_new_buffers function will determine all bh of 
>> the same page and call write_end_fn if they are in new state,
>> get_block returns err's bh->b_this_page and circular list other 
>> unhandled bh if the state was previously set to new.
>> Because bh not get access, the corresponding transaction is not placed 
>> in jh->b_transaction, resulting in a crash.
>>
>> Therefore, the patch processing method I submit is to make unprocessed 
>> bh determines if it is in new state and get access.
>> There is another way to handle the remaining bh clear_buffer_new that 
>> is not processed.
>> I personally recommend get access this way, the impact is small. 
>> Please guide the two processing methods, which one do you recommend?
> 
> 
> 
> 在 2024/10/10 17:29, Jan Kara 写道:
>> On Thu 10-10-24 10:58:55, Baolin Liu wrote:
>>> From: Baolin Liu <liubaolin@kylinos.cn>
>>>
>>> Since the merge of commit 3910b513fcdf ("ext4: persist the new uptodate
>>> buffers in ext4_journalled_zero_new_buffers"), a new assertion failure
>>> occurred under a old kernel(ext3, data=journal, pagesize=64k) with
>>> corresponding ported patches:
>> ...
>>> which was caused by bh dirting without calling
>>> do_journal_get_write_access().
>>>
>>> In the loop for all bhs of a page in ext4_block_write_begin(),
>>> when a err occurred, it will jump out of loop.
>>> But that will leaves some bhs being processed and some not,
>>> which will lead to the asserion failure in calling write_end_fn().
>>
>> Thanks for the patch but I don't understand one thing here: For
>> ext4_journalled_zero_new_buffers() to call write_end_fn() the buffer must
>> have buffer_new flag set. That flag can get set only by ext4_get_block()
>> function when it succeeds in which case we also call
>> do_journal_get_write_access(). So how is it possible that buffer_new was
>> set on a buffer on which we didn't call do_journal_get_write_access()? 
>> This
>> indicates there may be some deeper problem hidden. How exactly did you
>> trigger this problem?
>>
>>                                 Honza
>>
>>>
>>> To fixed that, get write access for the rest unprocessed bhs, just
>>> as what write_end_fn do.
>>>
>>> Fixes: 3910b513fcdf ("ext4: persist the new uptodate buffers in 
>>> ext4_journalled_zero_new_buffers")
>>> Reported-and-tested-by: Zhi Long <longzhi@sangfor.com.cn>
>>> Suggested-by: Shida Zhang <zhangshida@kylinos.cn>
>>> Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
>>> ---
>>>   fs/ext4/inode.c | 17 ++++++++++++++++-
>>>   1 file changed, 16 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>> index 54bdd4884fe6..a72f951288e4 100644
>>> --- a/fs/ext4/inode.c
>>> +++ b/fs/ext4/inode.c
>>> @@ -1102,9 +1102,24 @@ int ext4_block_write_begin(handle_t *handle, 
>>> struct folio *folio,
>>>               err = -EIO;
>>>       }
>>>       if (unlikely(err)) {
>>> -        if (should_journal_data)
>>> +        if (should_journal_data) {
>>> +            if (bh != head || !block_start) {
>>> +                do {
>>> +                    block_end = block_start + bh->b_size;
>>> +
>>> +                    if (buffer_new(bh))
>>> +                        if (block_end > from && block_start < to)
>>> +                            do_journal_get_write_access(handle,
>>> +                                            inode, bh);
>>> +
>>> +                    block_start = block_end;
>>> +                    bh = bh->b_this_page;
>>> +                } while (bh != head);
>>> +            }
>>> +
>>>               ext4_journalled_zero_new_buffers(handle, inode, folio,
>>>                                from, to);
>>> +        }
>>>           else
>>>               folio_zero_new_buffers(folio, from, to);
>>>       } else if (fscrypt_inode_uses_fs_layer_crypto(inode)) {
>>> -- 
>>> 2.39.2
>>>
>
Jan Kara Oct. 16, 2024, 10:33 a.m. UTC | #4
Hello,

On Fri 11-10-24 12:08:58, Baolin Liu wrote:
> Greetings,
> 
> This problem is reproduced by our customer using their own testing tool
> “run_bug”. When I consulted with a client, the testing tool “run_bug”
> used a variety of background programs to benchmark (including memory
> pressure, cpu pressure, file cycle manipulation, fsstress Stress testing
> tool, postmark program,and so on).
> 
> The recurrence probability is relatively low.

OK, thanks for asking!

> In response to your query, in ext4_block_write_begin, the new state will
> be clear before get block, and the bh that failed get_block will not be
> set to new. However, when the page size is greater than the block size, a
> page will contain multiple bh. 

True. I wanted to argue that the buffer_new bit should be either cleared in
ext4_block_write_begin() (in case of error) or in
ext4_journalled_write_end() (in case of success) but actually
ext4_journalled_write_end() misses the clearing. So I think the better
solution is like the attached patch. I'll submit it once testing finishes
but it would be great if you could test that it fixes your problems as
well. Thanks!

								Honza
liubaolin Oct. 16, 2024, 1:38 p.m. UTC | #5
> Hello,
> I reviewed the patch attached in your email. The issue you mentioned about clearing buffer_new(bh) in write_end_fn() is indeed a bug.
> However, this patch does not resolve the crash issue we encountered.
> 
> Let me explain my analysis in detail below.
> The crash occurs in the function jbd2_journal_dirty_metadata().
> 
> ext4_block_write_begin() -> ext4_journalled_zero_new_buffers() -> write_end_fn()
>  -> ext4_dirty_journalled_data() -> ext4_handle_dirty_metadata() -> __ext4_handle_dirty_metadata()
>  -> jbd2_journal_dirty_metadata()
> 
> In the function jbd2_journal_dirty_metadata(), there is the following condition:
> —---------------------------------------------------------------------------------------------------
>         if (data_race(jh->b_transaction != transaction &&
>             jh->b_next_transaction != transaction)) {
>                 spin_lock(&jh->b_state_lock);
>                 J_ASSERT_JH(jh, jh->b_transaction == transaction ||
>                                 jh->b_next_transaction == transaction);
>                 spin_unlock(&jh->b_state_lock);
>         }
> ----------------------------------------------------------------------------------------------------
> By analyzing the vmcore, I found that both jh->b_transaction and jh->b_next_transaction are NULL.
> Through code analysis, I discovered that the __jbd2_journal_file_buffer() function adds the corresponding transaction of bh to jh->b_transaction.
> Normally, this is accessed through do_journal_get_write_access(), which can call __jbd2_journal_file_buffer().
> The detailed function call process is as follows:
> do_journal_get_write_access() -> ext4_journal_get_write_access() -> __ext4_journal_get_write_access()
>  -> jbd2_journal_get_write_access() -> do_get_write_access() -> __jbd2_journal_file_buffer()
>  
> 
> Therefore, resolving the crash issue requires obtaining write access before calling the jbd2_journal_dirty_metadata() function.
> The comment at the definition of the jbd2_journal_dirty_metadata() function also states: 
> 	'The buffer must have previously had jbd2_journal_get_write_access().'
> 
> In the ext4_block_write_begin() function, if get_block() encounters an error, then neither bh->b_this_page nor the subsequent bh calls do_journal_get_write_access().
> If bh->b_this_page and the subsequent bh are in the new state, it will lead to a crash when reaching the jbd2_journal_dirty_metadata() function.
> 	
> So, there are two ways to resolve this crash issue:
> 1、Call do_journal_get_write_access() on bh that is not handled due to get_block() error.
> 	The patch modification is in the attachment 0001-ext4-fix-a-assertion-failure-due-to-ungranted-bh-dir.patch.
> 
> 2、Call clear_buffer_new() on bh that is not handled due to get_block() error.
> 	The patch modification is in the attachment 0001-ext4-fix-a-assertion-failure-due-to-bh-not-clear-new.patch.
> 
> Additionally, I have found a method to quickly reproduce this crash issue.
> For details, please refer to the email I previously sent you: “https://lore.kernel.org/all/bd41c24b-7325-4584-a965-392a32e32c74@163.com/”.
> I have verified that this quick reproduction method works for both solutions to resolve the issue.
> 
> Please continue to consider which method is better to resolve this issue. 
> If you think that using clear_buffer_new() is a better solution, I can resend the patch via git send-mail.



在 2024/10/16 18:33, Jan Kara 写道:
> Hello,
> 
> On Fri 11-10-24 12:08:58, Baolin Liu wrote:
>> Greetings,
>>
>> This problem is reproduced by our customer using their own testing tool
>> “run_bug”. When I consulted with a client, the testing tool “run_bug”
>> used a variety of background programs to benchmark (including memory
>> pressure, cpu pressure, file cycle manipulation, fsstress Stress testing
>> tool, postmark program,and so on).
>>
>> The recurrence probability is relatively low.
> 
> OK, thanks for asking!
> 
>> In response to your query, in ext4_block_write_begin, the new state will
>> be clear before get block, and the bh that failed get_block will not be
>> set to new. However, when the page size is greater than the block size, a
>> page will contain multiple bh.
> 
> True. I wanted to argue that the buffer_new bit should be either cleared in
> ext4_block_write_begin() (in case of error) or in
> ext4_journalled_write_end() (in case of success) but actually
> ext4_journalled_write_end() misses the clearing. So I think the better
> solution is like the attached patch. I'll submit it once testing finishes
> but it would be great if you could test that it fixes your problems as
> well. Thanks!
> 
> 								Honza
From cbed522ccb695681d94ec02940e958fcf77e58cd Mon Sep 17 00:00:00 2001
From: Baolin Liu <liubaolin@kylinos.cn>
Date: Wed, 9 Oct 2024 09:30:34 +0800
Subject: [PATCH] ext4: fix a assertion failure due to bh not clear new

Since the merge of commit 3910b513fcdf ("ext4: persist the new uptodate
buffers in ext4_journalled_zero_new_buffers"), a new assertion failure
occurred under a old kernel(ext3, data=journal, pagesize=64k) with
corresponding ported patches:
================================================================
Call trace:
  __ext4_handle_dirty_metadata+0x320/0x7e8
  write_end_fn+0x78/0x178
  ext4_journalled_zero_new_buffers+0xd0/0x2c8
  ext4_block_write_begin+0x850/0xc00
  ext4_write_begin+0x334/0xc68
  generic_perform_write+0x1a4/0x380
  ext4_buffered_write_iter+0x180/0x370
  ext4_file_write_iter+0x194/0xfc0
  new_sync_write+0x338/0x4b8
  __vfs_write+0xc4/0xe8
  vfs_write+0x12c/0x3d0
  ksys_write+0xf4/0x230
  sys_write+0x34/0x48
  el0_svc_naked+0x44/0x48
================================================================

which was caused by bh dirting without calling clear_buffer_new().

In the loop for all bhs of a page in ext4_block_write_begin(),
when a err occurred, it will jump out of loop.
But that will leaves some bhs being processed and some not,
which will lead to the asserion failure in calling write_end_fn().

To fixed that, clear_buffer_new for the rest unprocessed bhs, just
as what write_end_fn do.

Fixes: 3910b513fcdf ("ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers")
Reported-and-tested-by: Zhi Long <longzhi@sangfor.com.cn>
Suggested-by: Shida Zhang <zhangshida@kylinos.cn>
Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
---
 fs/ext4/inode.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 54bdd4884fe6..26c107e083c6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1102,9 +1102,23 @@ int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 			err = -EIO;
 	}
 	if (unlikely(err)) {
-		if (should_journal_data)
+		if (should_journal_data) {
+			if (bh != head || !block_start) {
+				do {
+					block_end = block_start + bh->b_size;
+
+					if (buffer_new(bh))
+						if (block_end > from && block_start < to)
+							clear_buffer_new(bh);
+
+					block_start = block_end;
+					bh = bh->b_this_page;
+				} while (bh != head);
+			}
+
 			ext4_journalled_zero_new_buffers(handle, inode, folio,
 							 from, to);
+		}
 		else
 			folio_zero_new_buffers(folio, from, to);
 	} else if (fscrypt_inode_uses_fs_layer_crypto(inode)) {
liubaolin Oct. 18, 2024, 1:48 a.m. UTC | #6
> Hello, I am very sorry.
> I did not previously understand the approach of your patch to solve the issue.
> Yesterday, I intentionally injected faults during the quick reproduction test, 
> and indeed, after applying your patch, the crash issue was resolved and did not occur again.
> I finally understood your approach to solving the problem. Please disregard my previous email.
> Thank you for helping me solve this crash issue in a better way.
> I still need to improve my skills in file systems, and I truly appreciate your guidance.



在 2024/10/16 21:38, liubaolin 写道:
>> Hello,
>> I reviewed the patch attached in your email. The issue you mentioned 
>> about clearing buffer_new(bh) in write_end_fn() is indeed a bug.
>> However, this patch does not resolve the crash issue we encountered.
>>
>> Let me explain my analysis in detail below.
>> The crash occurs in the function jbd2_journal_dirty_metadata().
>>
>> ext4_block_write_begin() -> ext4_journalled_zero_new_buffers() -> 
>> write_end_fn()
>>  -> ext4_dirty_journalled_data() -> ext4_handle_dirty_metadata() -> 
>> __ext4_handle_dirty_metadata()
>>  -> jbd2_journal_dirty_metadata()
>>
>> In the function jbd2_journal_dirty_metadata(), there is the following 
>> condition:
>> —---------------------------------------------------------------------------------------------------
>>         if (data_race(jh->b_transaction != transaction &&
>>             jh->b_next_transaction != transaction)) {
>>                 spin_lock(&jh->b_state_lock);
>>                 J_ASSERT_JH(jh, jh->b_transaction == transaction ||
>>                                 jh->b_next_transaction == transaction);
>>                 spin_unlock(&jh->b_state_lock);
>>         }
>> ----------------------------------------------------------------------------------------------------
>> By analyzing the vmcore, I found that both jh->b_transaction and jh- 
>> >b_next_transaction are NULL.
>> Through code analysis, I discovered that the 
>> __jbd2_journal_file_buffer() function adds the corresponding 
>> transaction of bh to jh->b_transaction.
>> Normally, this is accessed through do_journal_get_write_access(), 
>> which can call __jbd2_journal_file_buffer().
>> The detailed function call process is as follows:
>> do_journal_get_write_access() -> ext4_journal_get_write_access() -> 
>> __ext4_journal_get_write_access()
>>  -> jbd2_journal_get_write_access() -> do_get_write_access() -> 
>> __jbd2_journal_file_buffer()
>>
>>
>> Therefore, resolving the crash issue requires obtaining write access 
>> before calling the jbd2_journal_dirty_metadata() function.
>> The comment at the definition of the jbd2_journal_dirty_metadata() 
>> function also states:     'The buffer must have previously had 
>> jbd2_journal_get_write_access().'
>>
>> In the ext4_block_write_begin() function, if get_block() encounters an 
>> error, then neither bh->b_this_page nor the subsequent bh calls 
>> do_journal_get_write_access().
>> If bh->b_this_page and the subsequent bh are in the new state, it will 
>> lead to a crash when reaching the jbd2_journal_dirty_metadata() function.
>>
>> So, there are two ways to resolve this crash issue:
>> 1、Call do_journal_get_write_access() on bh that is not handled due to 
>> get_block() error.
>>     The patch modification is in the attachment 0001-ext4-fix-a- 
>> assertion-failure-due-to-ungranted-bh-dir.patch.
>>
>> 2、Call clear_buffer_new() on bh that is not handled due to 
>> get_block() error.
>>     The patch modification is in the attachment 0001-ext4-fix-a- 
>> assertion-failure-due-to-bh-not-clear-new.patch.
>>
>> Additionally, I have found a method to quickly reproduce this crash 
>> issue.
>> For details, please refer to the email I previously sent you: 
>> “https://lore.kernel.org/all/bd41c24b-7325-4584- 
>> a965-392a32e32c74@163.com/”.
>> I have verified that this quick reproduction method works for both 
>> solutions to resolve the issue.
>>
>> Please continue to consider which method is better to resolve this 
>> issue. If you think that using clear_buffer_new() is a better 
>> solution, I can resend the patch via git send-mail.
> 
> 
> 
> 在 2024/10/16 18:33, Jan Kara 写道:
>> Hello,
>>
>> On Fri 11-10-24 12:08:58, Baolin Liu wrote:
>>> Greetings,
>>>
>>> This problem is reproduced by our customer using their own testing tool
>>> “run_bug”. When I consulted with a client, the testing tool “run_bug”
>>> used a variety of background programs to benchmark (including memory
>>> pressure, cpu pressure, file cycle manipulation, fsstress Stress testing
>>> tool, postmark program,and so on).
>>>
>>> The recurrence probability is relatively low.
>>
>> OK, thanks for asking!
>>
>>> In response to your query, in ext4_block_write_begin, the new state will
>>> be clear before get block, and the bh that failed get_block will not be
>>> set to new. However, when the page size is greater than the block 
>>> size, a
>>> page will contain multiple bh.
>>
>> True. I wanted to argue that the buffer_new bit should be either 
>> cleared in
>> ext4_block_write_begin() (in case of error) or in
>> ext4_journalled_write_end() (in case of success) but actually
>> ext4_journalled_write_end() misses the clearing. So I think the better
>> solution is like the attached patch. I'll submit it once testing finishes
>> but it would be great if you could test that it fixes your problems as
>> well. Thanks!
>>
>>                                 Honza
Jan Kara Oct. 18, 2024, 9:14 a.m. UTC | #7
On Fri 18-10-24 09:48:17, liubaolin wrote:
> > Hello, I am very sorry.
> > I did not previously understand the approach of your patch to solve the issue.
> > Yesterday, I intentionally injected faults during the quick reproduction
> > test, and indeed, after applying your patch, the crash issue was
> > resolved and did not occur again.
> > I finally understood your approach to solving the problem. Please disregard my previous email.
> > Thank you for helping me solve this crash issue in a better way.
> > I still need to improve my skills in file systems, and I truly appreciate your guidance.

Great! Thanks for testing. I'll send the patch for inclusion then.

								Honza

> 在 2024/10/16 21:38, liubaolin 写道:
> > > Hello,
> > > I reviewed the patch attached in your email. The issue you mentioned
> > > about clearing buffer_new(bh) in write_end_fn() is indeed a bug.
> > > However, this patch does not resolve the crash issue we encountered.
> > > 
> > > Let me explain my analysis in detail below.
> > > The crash occurs in the function jbd2_journal_dirty_metadata().
> > > 
> > > ext4_block_write_begin() -> ext4_journalled_zero_new_buffers() ->
> > > write_end_fn()
> > >  -> ext4_dirty_journalled_data() -> ext4_handle_dirty_metadata() ->
> > > __ext4_handle_dirty_metadata()
> > >  -> jbd2_journal_dirty_metadata()
> > > 
> > > In the function jbd2_journal_dirty_metadata(), there is the
> > > following condition:
> > > —---------------------------------------------------------------------------------------------------
> > >         if (data_race(jh->b_transaction != transaction &&
> > >             jh->b_next_transaction != transaction)) {
> > >                 spin_lock(&jh->b_state_lock);
> > >                 J_ASSERT_JH(jh, jh->b_transaction == transaction ||
> > >                                 jh->b_next_transaction == transaction);
> > >                 spin_unlock(&jh->b_state_lock);
> > >         }
> > > ----------------------------------------------------------------------------------------------------
> > > By analyzing the vmcore, I found that both jh->b_transaction and jh-
> > > >b_next_transaction are NULL.
> > > Through code analysis, I discovered that the
> > > __jbd2_journal_file_buffer() function adds the corresponding
> > > transaction of bh to jh->b_transaction.
> > > Normally, this is accessed through do_journal_get_write_access(),
> > > which can call __jbd2_journal_file_buffer().
> > > The detailed function call process is as follows:
> > > do_journal_get_write_access() -> ext4_journal_get_write_access() ->
> > > __ext4_journal_get_write_access()
> > >  -> jbd2_journal_get_write_access() -> do_get_write_access() ->
> > > __jbd2_journal_file_buffer()
> > > 
> > > 
> > > Therefore, resolving the crash issue requires obtaining write access
> > > before calling the jbd2_journal_dirty_metadata() function.
> > > The comment at the definition of the jbd2_journal_dirty_metadata()
> > > function also states:     'The buffer must have previously had
> > > jbd2_journal_get_write_access().'
> > > 
> > > In the ext4_block_write_begin() function, if get_block() encounters
> > > an error, then neither bh->b_this_page nor the subsequent bh calls
> > > do_journal_get_write_access().
> > > If bh->b_this_page and the subsequent bh are in the new state, it
> > > will lead to a crash when reaching the jbd2_journal_dirty_metadata()
> > > function.
> > > 
> > > So, there are two ways to resolve this crash issue:
> > > 1、Call do_journal_get_write_access() on bh that is not handled due
> > > to get_block() error.
> > >     The patch modification is in the attachment 0001-ext4-fix-a-
> > > assertion-failure-due-to-ungranted-bh-dir.patch.
> > > 
> > > 2、Call clear_buffer_new() on bh that is not handled due to
> > > get_block() error.
> > >     The patch modification is in the attachment 0001-ext4-fix-a-
> > > assertion-failure-due-to-bh-not-clear-new.patch.
> > > 
> > > Additionally, I have found a method to quickly reproduce this crash
> > > issue.
> > > For details, please refer to the email I previously sent you:
> > > “https://lore.kernel.org/all/bd41c24b-7325-4584-
> > > a965-392a32e32c74@163.com/”.
> > > I have verified that this quick reproduction method works for both
> > > solutions to resolve the issue.
> > > 
> > > Please continue to consider which method is better to resolve this
> > > issue. If you think that using clear_buffer_new() is a better
> > > solution, I can resend the patch via git send-mail.
> > 
> > 
> > 
> > 在 2024/10/16 18:33, Jan Kara 写道:
> > > Hello,
> > > 
> > > On Fri 11-10-24 12:08:58, Baolin Liu wrote:
> > > > Greetings,
> > > > 
> > > > This problem is reproduced by our customer using their own testing tool
> > > > “run_bug”. When I consulted with a client, the testing tool “run_bug”
> > > > used a variety of background programs to benchmark (including memory
> > > > pressure, cpu pressure, file cycle manipulation, fsstress Stress testing
> > > > tool, postmark program,and so on).
> > > > 
> > > > The recurrence probability is relatively low.
> > > 
> > > OK, thanks for asking!
> > > 
> > > > In response to your query, in ext4_block_write_begin, the new state will
> > > > be clear before get block, and the bh that failed get_block will not be
> > > > set to new. However, when the page size is greater than the
> > > > block size, a
> > > > page will contain multiple bh.
> > > 
> > > True. I wanted to argue that the buffer_new bit should be either
> > > cleared in
> > > ext4_block_write_begin() (in case of error) or in
> > > ext4_journalled_write_end() (in case of success) but actually
> > > ext4_journalled_write_end() misses the clearing. So I think the better
> > > solution is like the attached patch. I'll submit it once testing finishes
> > > but it would be great if you could test that it fixes your problems as
> > > well. Thanks!
> > > 
> > >                                 Honza
>
diff mbox series

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 54bdd4884fe6..a72f951288e4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1102,9 +1102,24 @@  int ext4_block_write_begin(handle_t *handle, struct folio *folio,
 			err = -EIO;
 	}
 	if (unlikely(err)) {
-		if (should_journal_data)
+		if (should_journal_data) {
+			if (bh != head || !block_start) {
+				do {
+					block_end = block_start + bh->b_size;
+
+					if (buffer_new(bh))
+						if (block_end > from && block_start < to)
+							do_journal_get_write_access(handle,
+										    inode, bh);
+
+					block_start = block_end;
+					bh = bh->b_this_page;
+				} while (bh != head);
+			}
+
 			ext4_journalled_zero_new_buffers(handle, inode, folio,
 							 from, to);
+		}
 		else
 			folio_zero_new_buffers(folio, from, to);
 	} else if (fscrypt_inode_uses_fs_layer_crypto(inode)) {