diff mbox series

[-next] ext4: ensure data forced to disk when rename

Message ID 20220901062657.1192732-1-yebin10@huawei.com
State Rejected
Headers show
Series [-next] ext4: ensure data forced to disk when rename | expand

Commit Message

yebin (H) Sept. 1, 2022, 6:26 a.m. UTC
There is issue that data lost when rename new file. Reason is dioread_nolock is
enabled default now, which map blocks will mark extent unwritten. But inode
size is changed after submit io. Then read file will return zero if extent is
unwritten.
To solve above isuue, wait all data force to disk before rename when enable
dioread_nolock and enable delay allocate.

Signed-off-by: Ye Bin <yebin10@huawei.com>
---
 fs/ext4/ext4.h  |  1 +
 fs/ext4/inode.c | 16 ++++++++++++++++
 fs/ext4/namei.c |  2 +-
 3 files changed, 18 insertions(+), 1 deletion(-)

Comments

Jan Kara Sept. 1, 2022, 8:37 a.m. UTC | #1
On Thu 01-09-22 14:26:57, Ye Bin wrote:
> There is issue that data lost when rename new file. Reason is dioread_nolock is
> enabled default now, which map blocks will mark extent unwritten. But inode
> size is changed after submit io. Then read file will return zero if extent is
> unwritten.
> To solve above isuue, wait all data force to disk before rename when enable
> dioread_nolock and enable delay allocate.
> 
> Signed-off-by: Ye Bin <yebin10@huawei.com>

Well, but it was always like that. We do not guarantee that the data is
securely on disk before rename - userspace is responsible for that if it
needs this guarantee. We just want to make the time window shorter and so
start data writeback because lot of userspace is careless. But waiting for
data writeback before rename will significantly slow down some workloads
and IMHO without a good reason.

								Honza

> ---
>  fs/ext4/ext4.h  |  1 +
>  fs/ext4/inode.c | 16 ++++++++++++++++
>  fs/ext4/namei.c |  2 +-
>  3 files changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 9bca5565547b..111296259b89 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2998,6 +2998,7 @@ extern int ext4_break_layouts(struct inode *);
>  extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length);
>  extern void ext4_set_inode_flags(struct inode *, bool init);
>  extern int ext4_alloc_da_blocks(struct inode *inode);
> +extern void ext4_flush_data(struct inode *inode);
>  extern void ext4_set_aops(struct inode *inode);
>  extern int ext4_writepage_trans_blocks(struct inode *);
>  extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 601214453c3a..4df7ffd3f1b0 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3133,6 +3133,22 @@ int ext4_alloc_da_blocks(struct inode *inode)
>  	return filemap_flush(inode->i_mapping);
>  }
>  
> +void ext4_flush_data(struct inode *inode)
> +{
> +	if (!EXT4_I(inode)->i_reserved_data_blocks)
> +		return;
> +
> +	if (filemap_flush(inode->i_mapping))
> +		return;
> +
> +	if (test_opt(inode->i_sb, DELALLOC) &&
> +	    ext4_should_dioread_nolock(inode) &&
> +	    atomic_read(&EXT4_I(inode)->i_unwritten))
> +		filemap_fdatawait(inode->i_mapping);
> +
> +	return;
> +}
> +
>  /*
>   * bmap() is special.  It gets used by applications such as lilo and by
>   * the swapper to find the on-disk block of a specific piece of data.
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 3a31b662f661..030402fe3b9f 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -3820,7 +3820,7 @@ static int ext4_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
>  		}
>  	}
>  	if (new.inode && !test_opt(new.dir->i_sb, NO_AUTO_DA_ALLOC))
> -		ext4_alloc_da_blocks(old.inode);
> +		ext4_flush_data(old.inode);
>  
>  	credits = (2 * EXT4_DATA_TRANS_BLOCKS(old.dir->i_sb) +
>  		   EXT4_INDEX_EXTRA_TRANS_BLOCKS + 2);
> -- 
> 2.31.1
>
Zhang Yi Sept. 1, 2022, 1:19 p.m. UTC | #2
On 2022/9/1 16:37, Jan Kara wrote:
> On Thu 01-09-22 14:26:57, Ye Bin wrote:
>> There is issue that data lost when rename new file. Reason is dioread_nolock is
>> enabled default now, which map blocks will mark extent unwritten. But inode
>> size is changed after submit io. Then read file will return zero if extent is
>> unwritten.
>> To solve above isuue, wait all data force to disk before rename when enable
>> dioread_nolock and enable delay allocate.
>>
>> Signed-off-by: Ye Bin <yebin10@huawei.com>
> 
> Well, but it was always like that. We do not guarantee that the data is
> securely on disk before rename - userspace is responsible for that if it
> needs this guarantee. We just want to make the time window shorter and so
> start data writeback because lot of userspace is careless. But waiting for
> data writeback before rename will significantly slow down some workloads
> and IMHO without a good reason.
> 

Hi, Jan

Our case is modifing a no-empty file names "data.conf" through vim, it will
cp "data.conf" to ".data.conf.swp" firstly, overwrite rename ".data.conf.swp"
back to "data.conf" and fsync after modifying. If we power down between rename
and fsync, we may lose all data in the original "data.conf" and get a whole
zero file. Before we enable dioread_nolock by defalut, the newly appending data
may lost, but at least we will not lose the original data in the default
data=ordered and auto_da_alloc mode. It seems that dioread_nolock breaks the
guarantee provide by auto_da_alloc. Maybe we should add a fsync before rename
in vim ?

Thanks,
Yi.
Theodore Ts'o Sept. 1, 2022, 2:45 p.m. UTC | #3
On Thu, Sep 01, 2022 at 09:19:04PM +0800, Zhang Yi wrote:
> 
> Our case is modifing a no-empty file names "data.conf" through vim, it will
> cp "data.conf" to ".data.conf.swp" firstly, overwrite rename ".data.conf.swp"
> back to "data.conf" and fsync after modifying. If we power down between rename
> and fsync, we may lose all data in the original "data.conf" and get a whole
> zero file. Before we enable dioread_nolock by defalut, the newly appending data
> may lost, but at least we will not lose the original data in the default
> data=ordered and auto_da_alloc mode. It seems that dioread_nolock breaks the
> guarantee provide by auto_da_alloc. Maybe we should add a fsync before rename
> in vim ?

What I normally recommend is to write the new contents of "data.conf"
to "data.conf.new".  Then fsync the file descriptor corresponding to
data.conf.new.  If you want to backup data.conf, you can create a hard
link of data.conf to data.conf.bak, then rename data.conf.new on top
of data.conf.

The auto_da_alloc mode is a best effort attempt to protect against bad
application programs, and it never was a guarantee that it would
*always* protect against data loss if you had a overwriting rename
racing against a power failure.  Without auto_da_alloc mode, the
window where the user might lose data if they application failed to do
the fsync() was 30 seconds.  With auto_da_alloc (and if dioread_nolock
was disabled), that window was reduced to say, a few hundred
milliseconds.  With dioread_nolock, that window was increased to
somehwere between 0 and 5 seconds (on average, 2.5 seconds) plus the
time for the write to complete (a few hundred milliseconds).

But note that your proposed patch doesn't actually really improve
things all that much.  That's because calling filemap_datawrite() only
waits for the write request to complete.  But you still need to update
the metadata before the blocks become visible after a crash.  That
requires forcing a journal commit, and waiting for that commit to
complete, which is what ext4_sync_file() does, and which is of course,
quite expensive.

We could add something to the end of ext4_convert_unwritten_io_end_vec()
where if the inode is da_alloc eligible, we trigger an asynchronous
journal commit; that is, once we converted all of the unwritten extents,
if the file has been closed after being opened with O_TRUNC, or the file has
been renamed on top of a pre-existing file and there were dirty blocks
that were flushed out when we called ext4_alloc_da_blocks(), that
ext4_convert_unwritten_io_end_vec() would then request that a journal
commit be started.   But we'd want to see what sort of performance regression
this adds, since nothing is for free, and the goal here is to paper over
buggy applications without imposing too much of a performance cost.

But it should be clear that this is *buggy* applications, and
applications SHOULD be calling fsync() before an overwriting rename()
if they care about data not being lost if there is a power failure at
an inconvenient point.   All we are trying to do here is to reduce the
probability of data loss caused by buggy applications.  There was never
any guarantees.

Cheers,

						- Ted
Zhang Yi Sept. 2, 2022, 3:31 a.m. UTC | #4
On 2022/9/1 22:45, Theodore Ts'o wrote:
> On Thu, Sep 01, 2022 at 09:19:04PM +0800, Zhang Yi wrote:
>>
>> Our case is modifing a no-empty file names "data.conf" through vim, it will
>> cp "data.conf" to ".data.conf.swp" firstly, overwrite rename ".data.conf.swp"
>> back to "data.conf" and fsync after modifying. If we power down between rename
>> and fsync, we may lose all data in the original "data.conf" and get a whole
>> zero file. Before we enable dioread_nolock by defalut, the newly appending data
>> may lost, but at least we will not lose the original data in the default
>> data=ordered and auto_da_alloc mode. It seems that dioread_nolock breaks the
>> guarantee provide by auto_da_alloc. Maybe we should add a fsync before rename
>> in vim ?
> > What I normally recommend is to write the new contents of "data.conf"
> to "data.conf.new".  Then fsync the file descriptor corresponding to
> data.conf.new.  If you want to backup data.conf, you can create a hard
> link of data.conf to data.conf.bak, then rename data.conf.new on top
> of data.conf.
> 
> The auto_da_alloc mode is a best effort attempt to protect against bad
> application programs, and it never was a guarantee that it would
> *always* protect against data loss if you had a overwriting rename
> racing against a power failure.  Without auto_da_alloc mode, the
> window where the user might lose data if they application failed to do
> the fsync() was 30 seconds.  With auto_da_alloc (and if dioread_nolock
> was disabled), that window was reduced to say, a few hundred

Thanks for the explanation and suggestions, I totally agree with you that
applications SHOULD be calling fsync() before an overwriting rename() and
the filesystem just do best efforts.

For the above "with auto_da_alloc and no dioread_nolock" case, I'm not sure
I understand right, please correct me if I say wrong. With auto_da_alloc,
data=ordered and no dioread_nolock, it separates into two cases: 1) the new
written contents was appended to an allocated block; 2) the new contents
was written to a new block.

For case 1, we have a few hundred milliseconds window as you said,
because the inode will not be added to transaction->t_inode_list, so
the transaction committing progress does not wait data blocks to
write to complete. For case 2, the inode will be added to
transaction->t_inode_list, we can guarantee that data write to disk
before the rename() completes, so there is no window for this case.

Thanks,
Yi.

> milliseconds.  With dioread_nolock, that window was increased to
> somehwere between 0 and 5 seconds (on average, 2.5 seconds) plus the
> time for the write to complete (a few hundred milliseconds).
> 
> But note that your proposed patch doesn't actually really improve
> things all that much.  That's because calling filemap_datawrite() only
> waits for the write request to complete.  But you still need to update
> the metadata before the blocks become visible after a crash.  That
> requires forcing a journal commit, and waiting for that commit to
> complete, which is what ext4_sync_file() does, and which is of course,
> quite expensive.
> 
> We could add something to the end of ext4_convert_unwritten_io_end_vec()
> where if the inode is da_alloc eligible, we trigger an asynchronous
> journal commit; that is, once we converted all of the unwritten extents,
> if the file has been closed after being opened with O_TRUNC, or the file has
> been renamed on top of a pre-existing file and there were dirty blocks
> that were flushed out when we called ext4_alloc_da_blocks(), that
> ext4_convert_unwritten_io_end_vec() would then request that a journal
> commit be started.   But we'd want to see what sort of performance regression
> this adds, since nothing is for free, and the goal here is to paper over
> buggy applications without imposing too much of a performance cost.
> 
> But it should be clear that this is *buggy* applications, and
> applications SHOULD be calling fsync() before an overwriting rename()
> if they care about data not being lost if there is a power failure at
> an inconvenient point.   All we are trying to do here is to reduce the
> probability of data loss caused by buggy applications.  There was never
> any guarantees.
> 
> Cheers,
> 
> 						- Ted
> 
> .
>
diff mbox series

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9bca5565547b..111296259b89 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2998,6 +2998,7 @@  extern int ext4_break_layouts(struct inode *);
 extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length);
 extern void ext4_set_inode_flags(struct inode *, bool init);
 extern int ext4_alloc_da_blocks(struct inode *inode);
+extern void ext4_flush_data(struct inode *inode);
 extern void ext4_set_aops(struct inode *inode);
 extern int ext4_writepage_trans_blocks(struct inode *);
 extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 601214453c3a..4df7ffd3f1b0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3133,6 +3133,22 @@  int ext4_alloc_da_blocks(struct inode *inode)
 	return filemap_flush(inode->i_mapping);
 }
 
+void ext4_flush_data(struct inode *inode)
+{
+	if (!EXT4_I(inode)->i_reserved_data_blocks)
+		return;
+
+	if (filemap_flush(inode->i_mapping))
+		return;
+
+	if (test_opt(inode->i_sb, DELALLOC) &&
+	    ext4_should_dioread_nolock(inode) &&
+	    atomic_read(&EXT4_I(inode)->i_unwritten))
+		filemap_fdatawait(inode->i_mapping);
+
+	return;
+}
+
 /*
  * bmap() is special.  It gets used by applications such as lilo and by
  * the swapper to find the on-disk block of a specific piece of data.
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 3a31b662f661..030402fe3b9f 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -3820,7 +3820,7 @@  static int ext4_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 		}
 	}
 	if (new.inode && !test_opt(new.dir->i_sb, NO_AUTO_DA_ALLOC))
-		ext4_alloc_da_blocks(old.inode);
+		ext4_flush_data(old.inode);
 
 	credits = (2 * EXT4_DATA_TRANS_BLOCKS(old.dir->i_sb) +
 		   EXT4_INDEX_EXTRA_TRANS_BLOCKS + 2);