Message ID | 20241022111059.2566137-4-yi.zhang@huaweicloud.com |
---|---|
State | New |
Headers | show |
Series | ext4: use iomap for regular file's buffered I/O path and enable large folio | expand |
On Tue, Oct 22, 2024 at 07:10:34PM +0800, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@huawei.com> > > There is no need to write back all data before punching a hole in > data=ordered|writeback mode since it will be dropped soon after removing > space, so just remove the filemap_write_and_wait_range() in these modes. > However, in data=journal mode, we need to write dirty pages out before > discarding page cache in case of crash before committing the freeing > data transaction, which could expose old, stale data. Can't the same thing happen with non-journaled data writes? Say you write 1GB of "A"s to a file and fsync. Then you write "B"s to the same 1GB of file and immediately start punching it. If the system reboots before the mapping updates all get written to disk, won't you risk seeing some of those "A" because we no longer flush the "B"s? Also, since the program didn't explicitly fsync the Bs, why bother flushing the dirty data at all? Are data=journal writes supposed to be synchronous flushing writes nowadays? --D > Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > --- > fs/ext4/inode.c | 26 +++++++++++++++----------- > 1 file changed, 15 insertions(+), 11 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index f8796f7b0f94..94b923afcd9c 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > > trace_ext4_punch_hole(inode, offset, length, 0); > > - /* > - * Write out all dirty pages to avoid race conditions > - * Then release them. > - */ > - if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { > - ret = filemap_write_and_wait_range(mapping, offset, > - offset + length - 1); > - if (ret) > - return ret; > - } > - > inode_lock(inode); > > /* No need to punch hole beyond i_size */ > @@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) > ret = ext4_update_disksize_before_punch(inode, offset, length); > if (ret) > goto out_dio; > + > + /* > + * For journalled data we need to write (and checkpoint) pages > + * before discarding page cache to avoid inconsitent data on > + * disk in case of crash before punching trans is committed. > + */ > + if (ext4_should_journal_data(inode)) { > + ret = filemap_write_and_wait_range(mapping, > + first_block_offset, last_block_offset); > + if (ret) > + goto out_dio; > + } > + > + ext4_truncate_folios_range(inode, first_block_offset, > + last_block_offset + 1); > truncate_pagecache_range(inode, first_block_offset, > last_block_offset); > } > -- > 2.46.1 > >
On 2024/11/19 7:15, Darrick J. Wong wrote: > On Tue, Oct 22, 2024 at 07:10:34PM +0800, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@huawei.com> >> >> There is no need to write back all data before punching a hole in >> data=ordered|writeback mode since it will be dropped soon after removing >> space, so just remove the filemap_write_and_wait_range() in these modes. >> However, in data=journal mode, we need to write dirty pages out before >> discarding page cache in case of crash before committing the freeing >> data transaction, which could expose old, stale data. > > Can't the same thing happen with non-journaled data writes? > > Say you write 1GB of "A"s to a file and fsync. Then you write "B"s to > the same 1GB of file and immediately start punching it. If the system > reboots before the mapping updates all get written to disk, won't you > risk seeing some of those "A" because we no longer flush the "B"s? > > Also, since the program didn't explicitly fsync the Bs, why bother > flushing the dirty data at all? Are data=journal writes supposed to be > synchronous flushing writes nowadays? Thanks you for your replay. This case is not exactly the problem that can occur in data=journal mode, the problem is even if we fsync "B"s before punching the hole, we may still encounter old data ("A"s or even order) if the system reboots before the hole-punching process is completed. The details of this problem is the ext4_punch_hole()-> truncate_pagecache_range()-> ..->journal_unmap_buffer() will drop the checkpoint transaction, which may contain B's journaled data. Consequently, the journal tail could move advance beyond this point. If we do not flush the data before dropping the cache and a crash occurs before the punching transaction is committed, B's transaction will never recover, resulting in the loss of B's data. Therefore, this cannot happen in non-journaled data writes. This flush logic is copied from ext4_zero_range() since it has the same problem, Jan added it in commit 783ae448b7a2 ("ext4: Fix special handling of journalled data from extent zeroing"), please see it for more details. Jan, please correct me if my understanding is incorrect. Thanks, Yi. > > --D > >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> >> --- >> fs/ext4/inode.c | 26 +++++++++++++++----------- >> 1 file changed, 15 insertions(+), 11 deletions(-) >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index f8796f7b0f94..94b923afcd9c 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) >> >> trace_ext4_punch_hole(inode, offset, length, 0); >> >> - /* >> - * Write out all dirty pages to avoid race conditions >> - * Then release them. >> - */ >> - if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { >> - ret = filemap_write_and_wait_range(mapping, offset, >> - offset + length - 1); >> - if (ret) >> - return ret; >> - } >> - >> inode_lock(inode); >> >> /* No need to punch hole beyond i_size */ >> @@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) >> ret = ext4_update_disksize_before_punch(inode, offset, length); >> if (ret) >> goto out_dio; >> + >> + /* >> + * For journalled data we need to write (and checkpoint) pages >> + * before discarding page cache to avoid inconsitent data on >> + * disk in case of crash before punching trans is committed. >> + */ >> + if (ext4_should_journal_data(inode)) { >> + ret = filemap_write_and_wait_range(mapping, >> + first_block_offset, last_block_offset); >> + if (ret) >> + goto out_dio; >> + } >> + >> + ext4_truncate_folios_range(inode, first_block_offset, >> + last_block_offset + 1); >> truncate_pagecache_range(inode, first_block_offset, >> last_block_offset); >> } >> -- >> 2.46.1 >> >>
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index f8796f7b0f94..94b923afcd9c 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3965,17 +3965,6 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) trace_ext4_punch_hole(inode, offset, length, 0); - /* - * Write out all dirty pages to avoid race conditions - * Then release them. - */ - if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) { - ret = filemap_write_and_wait_range(mapping, offset, - offset + length - 1); - if (ret) - return ret; - } - inode_lock(inode); /* No need to punch hole beyond i_size */ @@ -4037,6 +4026,21 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length) ret = ext4_update_disksize_before_punch(inode, offset, length); if (ret) goto out_dio; + + /* + * For journalled data we need to write (and checkpoint) pages + * before discarding page cache to avoid inconsitent data on + * disk in case of crash before punching trans is committed. + */ + if (ext4_should_journal_data(inode)) { + ret = filemap_write_and_wait_range(mapping, + first_block_offset, last_block_offset); + if (ret) + goto out_dio; + } + + ext4_truncate_folios_range(inode, first_block_offset, + last_block_offset + 1); truncate_pagecache_range(inode, first_block_offset, last_block_offset); }