Message ID | 20220901062657.1192732-1-yebin10@huawei.com |
---|---|
State | Rejected |
Headers | show |
Series | [-next] ext4: ensure data forced to disk when rename | expand |
On Thu 01-09-22 14:26:57, Ye Bin wrote: > There is issue that data lost when rename new file. Reason is dioread_nolock is > enabled default now, which map blocks will mark extent unwritten. But inode > size is changed after submit io. Then read file will return zero if extent is > unwritten. > To solve above isuue, wait all data force to disk before rename when enable > dioread_nolock and enable delay allocate. > > Signed-off-by: Ye Bin <yebin10@huawei.com> Well, but it was always like that. We do not guarantee that the data is securely on disk before rename - userspace is responsible for that if it needs this guarantee. We just want to make the time window shorter and so start data writeback because lot of userspace is careless. But waiting for data writeback before rename will significantly slow down some workloads and IMHO without a good reason. Honza > --- > fs/ext4/ext4.h | 1 + > fs/ext4/inode.c | 16 ++++++++++++++++ > fs/ext4/namei.c | 2 +- > 3 files changed, 18 insertions(+), 1 deletion(-) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 9bca5565547b..111296259b89 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -2998,6 +2998,7 @@ extern int ext4_break_layouts(struct inode *); > extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length); > extern void ext4_set_inode_flags(struct inode *, bool init); > extern int ext4_alloc_da_blocks(struct inode *inode); > +extern void ext4_flush_data(struct inode *inode); > extern void ext4_set_aops(struct inode *inode); > extern int ext4_writepage_trans_blocks(struct inode *); > extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks); > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 601214453c3a..4df7ffd3f1b0 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3133,6 +3133,22 @@ int ext4_alloc_da_blocks(struct inode *inode) > return filemap_flush(inode->i_mapping); > } > > +void ext4_flush_data(struct inode *inode) > +{ > + if (!EXT4_I(inode)->i_reserved_data_blocks) > + return; > + > + if (filemap_flush(inode->i_mapping)) > + return; > + > + if (test_opt(inode->i_sb, DELALLOC) && > + ext4_should_dioread_nolock(inode) && > + atomic_read(&EXT4_I(inode)->i_unwritten)) > + filemap_fdatawait(inode->i_mapping); > + > + return; > +} > + > /* > * bmap() is special. It gets used by applications such as lilo and by > * the swapper to find the on-disk block of a specific piece of data. > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c > index 3a31b662f661..030402fe3b9f 100644 > --- a/fs/ext4/namei.c > +++ b/fs/ext4/namei.c > @@ -3820,7 +3820,7 @@ static int ext4_rename(struct user_namespace *mnt_userns, struct inode *old_dir, > } > } > if (new.inode && !test_opt(new.dir->i_sb, NO_AUTO_DA_ALLOC)) > - ext4_alloc_da_blocks(old.inode); > + ext4_flush_data(old.inode); > > credits = (2 * EXT4_DATA_TRANS_BLOCKS(old.dir->i_sb) + > EXT4_INDEX_EXTRA_TRANS_BLOCKS + 2); > -- > 2.31.1 >
On 2022/9/1 16:37, Jan Kara wrote: > On Thu 01-09-22 14:26:57, Ye Bin wrote: >> There is issue that data lost when rename new file. Reason is dioread_nolock is >> enabled default now, which map blocks will mark extent unwritten. But inode >> size is changed after submit io. Then read file will return zero if extent is >> unwritten. >> To solve above isuue, wait all data force to disk before rename when enable >> dioread_nolock and enable delay allocate. >> >> Signed-off-by: Ye Bin <yebin10@huawei.com> > > Well, but it was always like that. We do not guarantee that the data is > securely on disk before rename - userspace is responsible for that if it > needs this guarantee. We just want to make the time window shorter and so > start data writeback because lot of userspace is careless. But waiting for > data writeback before rename will significantly slow down some workloads > and IMHO without a good reason. > Hi, Jan Our case is modifing a no-empty file names "data.conf" through vim, it will cp "data.conf" to ".data.conf.swp" firstly, overwrite rename ".data.conf.swp" back to "data.conf" and fsync after modifying. If we power down between rename and fsync, we may lose all data in the original "data.conf" and get a whole zero file. Before we enable dioread_nolock by defalut, the newly appending data may lost, but at least we will not lose the original data in the default data=ordered and auto_da_alloc mode. It seems that dioread_nolock breaks the guarantee provide by auto_da_alloc. Maybe we should add a fsync before rename in vim ? Thanks, Yi.
On Thu, Sep 01, 2022 at 09:19:04PM +0800, Zhang Yi wrote: > > Our case is modifing a no-empty file names "data.conf" through vim, it will > cp "data.conf" to ".data.conf.swp" firstly, overwrite rename ".data.conf.swp" > back to "data.conf" and fsync after modifying. If we power down between rename > and fsync, we may lose all data in the original "data.conf" and get a whole > zero file. Before we enable dioread_nolock by defalut, the newly appending data > may lost, but at least we will not lose the original data in the default > data=ordered and auto_da_alloc mode. It seems that dioread_nolock breaks the > guarantee provide by auto_da_alloc. Maybe we should add a fsync before rename > in vim ? What I normally recommend is to write the new contents of "data.conf" to "data.conf.new". Then fsync the file descriptor corresponding to data.conf.new. If you want to backup data.conf, you can create a hard link of data.conf to data.conf.bak, then rename data.conf.new on top of data.conf. The auto_da_alloc mode is a best effort attempt to protect against bad application programs, and it never was a guarantee that it would *always* protect against data loss if you had a overwriting rename racing against a power failure. Without auto_da_alloc mode, the window where the user might lose data if they application failed to do the fsync() was 30 seconds. With auto_da_alloc (and if dioread_nolock was disabled), that window was reduced to say, a few hundred milliseconds. With dioread_nolock, that window was increased to somehwere between 0 and 5 seconds (on average, 2.5 seconds) plus the time for the write to complete (a few hundred milliseconds). But note that your proposed patch doesn't actually really improve things all that much. That's because calling filemap_datawrite() only waits for the write request to complete. But you still need to update the metadata before the blocks become visible after a crash. That requires forcing a journal commit, and waiting for that commit to complete, which is what ext4_sync_file() does, and which is of course, quite expensive. We could add something to the end of ext4_convert_unwritten_io_end_vec() where if the inode is da_alloc eligible, we trigger an asynchronous journal commit; that is, once we converted all of the unwritten extents, if the file has been closed after being opened with O_TRUNC, or the file has been renamed on top of a pre-existing file and there were dirty blocks that were flushed out when we called ext4_alloc_da_blocks(), that ext4_convert_unwritten_io_end_vec() would then request that a journal commit be started. But we'd want to see what sort of performance regression this adds, since nothing is for free, and the goal here is to paper over buggy applications without imposing too much of a performance cost. But it should be clear that this is *buggy* applications, and applications SHOULD be calling fsync() before an overwriting rename() if they care about data not being lost if there is a power failure at an inconvenient point. All we are trying to do here is to reduce the probability of data loss caused by buggy applications. There was never any guarantees. Cheers, - Ted
On 2022/9/1 22:45, Theodore Ts'o wrote: > On Thu, Sep 01, 2022 at 09:19:04PM +0800, Zhang Yi wrote: >> >> Our case is modifing a no-empty file names "data.conf" through vim, it will >> cp "data.conf" to ".data.conf.swp" firstly, overwrite rename ".data.conf.swp" >> back to "data.conf" and fsync after modifying. If we power down between rename >> and fsync, we may lose all data in the original "data.conf" and get a whole >> zero file. Before we enable dioread_nolock by defalut, the newly appending data >> may lost, but at least we will not lose the original data in the default >> data=ordered and auto_da_alloc mode. It seems that dioread_nolock breaks the >> guarantee provide by auto_da_alloc. Maybe we should add a fsync before rename >> in vim ? > > What I normally recommend is to write the new contents of "data.conf" > to "data.conf.new". Then fsync the file descriptor corresponding to > data.conf.new. If you want to backup data.conf, you can create a hard > link of data.conf to data.conf.bak, then rename data.conf.new on top > of data.conf. > > The auto_da_alloc mode is a best effort attempt to protect against bad > application programs, and it never was a guarantee that it would > *always* protect against data loss if you had a overwriting rename > racing against a power failure. Without auto_da_alloc mode, the > window where the user might lose data if they application failed to do > the fsync() was 30 seconds. With auto_da_alloc (and if dioread_nolock > was disabled), that window was reduced to say, a few hundred Thanks for the explanation and suggestions, I totally agree with you that applications SHOULD be calling fsync() before an overwriting rename() and the filesystem just do best efforts. For the above "with auto_da_alloc and no dioread_nolock" case, I'm not sure I understand right, please correct me if I say wrong. With auto_da_alloc, data=ordered and no dioread_nolock, it separates into two cases: 1) the new written contents was appended to an allocated block; 2) the new contents was written to a new block. For case 1, we have a few hundred milliseconds window as you said, because the inode will not be added to transaction->t_inode_list, so the transaction committing progress does not wait data blocks to write to complete. For case 2, the inode will be added to transaction->t_inode_list, we can guarantee that data write to disk before the rename() completes, so there is no window for this case. Thanks, Yi. > milliseconds. With dioread_nolock, that window was increased to > somehwere between 0 and 5 seconds (on average, 2.5 seconds) plus the > time for the write to complete (a few hundred milliseconds). > > But note that your proposed patch doesn't actually really improve > things all that much. That's because calling filemap_datawrite() only > waits for the write request to complete. But you still need to update > the metadata before the blocks become visible after a crash. That > requires forcing a journal commit, and waiting for that commit to > complete, which is what ext4_sync_file() does, and which is of course, > quite expensive. > > We could add something to the end of ext4_convert_unwritten_io_end_vec() > where if the inode is da_alloc eligible, we trigger an asynchronous > journal commit; that is, once we converted all of the unwritten extents, > if the file has been closed after being opened with O_TRUNC, or the file has > been renamed on top of a pre-existing file and there were dirty blocks > that were flushed out when we called ext4_alloc_da_blocks(), that > ext4_convert_unwritten_io_end_vec() would then request that a journal > commit be started. But we'd want to see what sort of performance regression > this adds, since nothing is for free, and the goal here is to paper over > buggy applications without imposing too much of a performance cost. > > But it should be clear that this is *buggy* applications, and > applications SHOULD be calling fsync() before an overwriting rename() > if they care about data not being lost if there is a power failure at > an inconvenient point. All we are trying to do here is to reduce the > probability of data loss caused by buggy applications. There was never > any guarantees. > > Cheers, > > - Ted > > . >
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 9bca5565547b..111296259b89 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2998,6 +2998,7 @@ extern int ext4_break_layouts(struct inode *); extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length); extern void ext4_set_inode_flags(struct inode *, bool init); extern int ext4_alloc_da_blocks(struct inode *inode); +extern void ext4_flush_data(struct inode *inode); extern void ext4_set_aops(struct inode *inode); extern int ext4_writepage_trans_blocks(struct inode *); extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 601214453c3a..4df7ffd3f1b0 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3133,6 +3133,22 @@ int ext4_alloc_da_blocks(struct inode *inode) return filemap_flush(inode->i_mapping); } +void ext4_flush_data(struct inode *inode) +{ + if (!EXT4_I(inode)->i_reserved_data_blocks) + return; + + if (filemap_flush(inode->i_mapping)) + return; + + if (test_opt(inode->i_sb, DELALLOC) && + ext4_should_dioread_nolock(inode) && + atomic_read(&EXT4_I(inode)->i_unwritten)) + filemap_fdatawait(inode->i_mapping); + + return; +} + /* * bmap() is special. It gets used by applications such as lilo and by * the swapper to find the on-disk block of a specific piece of data. diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c index 3a31b662f661..030402fe3b9f 100644 --- a/fs/ext4/namei.c +++ b/fs/ext4/namei.c @@ -3820,7 +3820,7 @@ static int ext4_rename(struct user_namespace *mnt_userns, struct inode *old_dir, } } if (new.inode && !test_opt(new.dir->i_sb, NO_AUTO_DA_ALLOC)) - ext4_alloc_da_blocks(old.inode); + ext4_flush_data(old.inode); credits = (2 * EXT4_DATA_TRANS_BLOCKS(old.dir->i_sb) + EXT4_INDEX_EXTRA_TRANS_BLOCKS + 2);
There is issue that data lost when rename new file. Reason is dioread_nolock is enabled default now, which map blocks will mark extent unwritten. But inode size is changed after submit io. Then read file will return zero if extent is unwritten. To solve above isuue, wait all data force to disk before rename when enable dioread_nolock and enable delay allocate. Signed-off-by: Ye Bin <yebin10@huawei.com> --- fs/ext4/ext4.h | 1 + fs/ext4/inode.c | 16 ++++++++++++++++ fs/ext4/namei.c | 2 +- 3 files changed, 18 insertions(+), 1 deletion(-)