diff mbox series

[01/27] ext4: remove writable userspace mappings before truncating page cache

Message ID 20241022111059.2566137-2-yi.zhang@huaweicloud.com
State New
Headers show
Series ext4: use iomap for regular file's buffered I/O path and enable large folio | expand

Commit Message

Zhang Yi Oct. 22, 2024, 11:10 a.m. UTC
From: Zhang Yi <yi.zhang@huawei.com>

When zeroing a range of folios on the filesystem which block size is
less than the page size, the file's mapped partial blocks within one
page will be marked as unwritten, we should remove writable userspace
mappings to ensure that ext4_page_mkwrite() can be called during
subsequent write access to these folios. Otherwise, data written by
subsequent mmap writes may not be saved to disk.

 $mkfs.ext4 -b 1024 /dev/vdb
 $mount /dev/vdb /mnt
 $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
               -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \
               -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo

 $od -Ax -t x1z /mnt/foo
 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 *
 000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
 *
 001000

 $umount /mnt && mount /dev/vdb /mnt
 $od -Ax -t x1z /mnt/foo
 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
 *
 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 *
 001000

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/ext4.h    |  2 ++
 fs/ext4/extents.c |  1 +
 fs/ext4/inode.c   | 41 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 44 insertions(+)

Comments

Jan Kara Dec. 4, 2024, 11:13 a.m. UTC | #1
I'm sorry for the huge delay here...

On Tue 22-10-24 19:10:32, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When zeroing a range of folios on the filesystem which block size is
> less than the page size, the file's mapped partial blocks within one
> page will be marked as unwritten, we should remove writable userspace
> mappings to ensure that ext4_page_mkwrite() can be called during
> subsequent write access to these folios. Otherwise, data written by
> subsequent mmap writes may not be saved to disk.
> 
>  $mkfs.ext4 -b 1024 /dev/vdb
>  $mount /dev/vdb /mnt
>  $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
>                -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \
>                -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo
> 
>  $od -Ax -t x1z /mnt/foo
>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>  *
>  000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
>  *
>  001000
> 
>  $umount /mnt && mount /dev/vdb /mnt
>  $od -Ax -t x1z /mnt/foo
>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>  *
>  000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  *
>  001000
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

This is a great catch! I think this may be source of the sporadic data
corruption issues we observe with blocksize < pagesize.

> +static inline void ext4_truncate_folio(struct inode *inode,
> +				       loff_t start, loff_t end)
> +{
> +	unsigned long blocksize = i_blocksize(inode);
> +	struct folio *folio;
> +
> +	if (round_up(start, blocksize) >= round_down(end, blocksize))
> +		return;
> +
> +	folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
> +	if (IS_ERR(folio))
> +		return;
> +
> +	if (folio_mkclean(folio))
> +		folio_mark_dirty(folio);
> +	folio_unlock(folio);
> +	folio_put(folio);

I don't think this is enough. In your example from the changelog, this would
leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range.
Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks()
converts blocks under 2048-4096 to unwritten state. But what handles
zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes
only partial blocks, not full blocks. Am I missing something?

If I'm right, I'd keep it simple and just writeout these partial folios with
filemap_write_and_wait_range() and expand the range
truncate_pagecache_range() removes to include these partial folios. The
overhead won't be big and it isn't like this is some very performance
sensitive path.

> +}
> +
> +/*
> + * When truncating a range of folios, if the block size is less than the
> + * page size, the file's mapped partial blocks within one page could be
> + * freed or converted to unwritten. We should call this function to remove
> + * writable userspace mappings so that ext4_page_mkwrite() can be called
> + * during subsequent write access to these folios.
> + */
> +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end)

Maybe call this ext4_truncate_page_cache_block_range()? And assert that
start & end are block aligned. Because this essentially prepares page cache
for manipulation with a block range.

> +{
> +	unsigned long blocksize = i_blocksize(inode);
> +
> +	if (end > inode->i_size)
> +		end = inode->i_size;
> +	if (start >= end || blocksize >= PAGE_SIZE)
> +		return;
> +
> +	ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end));
> +	if (end > round_up(start, PAGE_SIZE))
> +		ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end);
> +}

So I'd move the following truncate_pagecache_range() into
ext4_truncate_folios_range(). And also the preceding:

                /*
                 * For journalled data we need to write (and checkpoint) pages
                 * before discarding page cache to avoid inconsitent data on
                 * disk in case of crash before zeroing trans is committed.
                 */
                if (ext4_should_journal_data(inode)) {
                        ret = filemap_write_and_wait_range(mapping, start,
                                                           end - 1);
		...

into this function. So that it can be self-contained "do the right thing
with page cache to prepare for block range manipulations".

								Honza
Zhang Yi Dec. 6, 2024, 7:59 a.m. UTC | #2
On 2024/12/4 19:13, Jan Kara wrote:
> I'm sorry for the huge delay here...
> 
It's fine, I know you're probably been busy lately, and this series has
undergone significant modifications, which should require considerable
time for review. Thanks a lot for taking time to review this series!

> On Tue 22-10-24 19:10:32, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When zeroing a range of folios on the filesystem which block size is
>> less than the page size, the file's mapped partial blocks within one
>> page will be marked as unwritten, we should remove writable userspace
>> mappings to ensure that ext4_page_mkwrite() can be called during
>> subsequent write access to these folios. Otherwise, data written by
>> subsequent mmap writes may not be saved to disk.
>>
>>  $mkfs.ext4 -b 1024 /dev/vdb
>>  $mount /dev/vdb /mnt
>>  $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
>>                -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \
>>                -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo
>>
>>  $od -Ax -t x1z /mnt/foo
>>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>>  *
>>  000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
>>  *
>>  001000
>>
>>  $umount /mnt && mount /dev/vdb /mnt
>>  $od -Ax -t x1z /mnt/foo
>>  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>>  *
>>  000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>>  *
>>  001000
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> This is a great catch! I think this may be source of the sporadic data
> corruption issues we observe with blocksize < pagesize.
> 
>> +static inline void ext4_truncate_folio(struct inode *inode,
>> +				       loff_t start, loff_t end)
>> +{
>> +	unsigned long blocksize = i_blocksize(inode);
>> +	struct folio *folio;
>> +
>> +	if (round_up(start, blocksize) >= round_down(end, blocksize))
>> +		return;
>> +
>> +	folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
>> +	if (IS_ERR(folio))
>> +		return;
>> +
>> +	if (folio_mkclean(folio))
>> +		folio_mark_dirty(folio);
>> +	folio_unlock(folio);
>> +	folio_put(folio);
> 
> I don't think this is enough. In your example from the changelog, this would
> leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range.
> Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks()
> converts blocks under 2048-4096 to unwritten state. But what handles
> zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes
> only partial blocks, not full blocks. Am I missing something?
> 

Sorry, I don't understand why truncate_pagecache_range() does nothing? In my
example, the variable 'start' is 2048, the variable 'end' is 4096, and the
call process truncate_pagecache_range(inode, 2048, 4096-1)->..->
truncate_inode_partial_folio()->folio_zero_range() does zeroing the 2048-4096
range. I also tested it below, it was zeroed.

  xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
               -c "mwrite -S 0x5a 2048 2048" \
               -c "fzero 2048 2048" -c "close" /mnt/foo

  od -Ax -t x1z /mnt/foo
  000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  >XXXXXXXXXXXXXXXX<
  *
  000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
  *
  001000

> If I'm right, I'd keep it simple and just writeout these partial folios with
> filemap_write_and_wait_range() and expand the range
> truncate_pagecache_range() removes to include these partial folios. The

What I mean is the truncate_pagecache_range() has already covered the partial
folios. right?

> overhead won't be big and it isn't like this is some very performance
> sensitive path.
> 
>> +}
>> +
>> +/*
>> + * When truncating a range of folios, if the block size is less than the
>> + * page size, the file's mapped partial blocks within one page could be
>> + * freed or converted to unwritten. We should call this function to remove
>> + * writable userspace mappings so that ext4_page_mkwrite() can be called
>> + * during subsequent write access to these folios.
>> + */
>> +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end)
> 
> Maybe call this ext4_truncate_page_cache_block_range()? And assert that
> start & end are block aligned. Because this essentially prepares page cache
> for manipulation with a block range.

Ha, it's a good idea, I agree with you that move truncate_pagecache_range()
and the hunk of flushing in journal data mode into this function. But I don't
understand why assert that 'start & end' are block aligned? I think
ext4_truncate_page_cache_block_range() should allow passing unaligned input
parameters and aligned them itself, especially, after patch 04 and 05,
ext4_zero_range() and ext4_punch_hole() will passing offset and offset+len
directly, which may block unaligned.

Thanks,
Yi.

> 
>> +{
>> +	unsigned long blocksize = i_blocksize(inode);
>> +
>> +	if (end > inode->i_size)
>> +		end = inode->i_size;
>> +	if (start >= end || blocksize >= PAGE_SIZE)
>> +		return;
>> +
>> +	ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end));
>> +	if (end > round_up(start, PAGE_SIZE))
>> +		ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end);
>> +}
> 
> So I'd move the following truncate_pagecache_range() into
> ext4_truncate_folios_range(). And also the preceding:
> 
>                 /*
>                  * For journalled data we need to write (and checkpoint) pages
>                  * before discarding page cache to avoid inconsitent data on
>                  * disk in case of crash before zeroing trans is committed.
>                  */
>                 if (ext4_should_journal_data(inode)) {
>                         ret = filemap_write_and_wait_range(mapping, start,
>                                                            end - 1);
> 		...
> 
> into this function. So that it can be self-contained "do the right thing
> with page cache to prepare for block range manipulations".
> 
> 								Honza
Jan Kara Dec. 6, 2024, 3:49 p.m. UTC | #3
On Fri 06-12-24 15:59:44, Zhang Yi wrote:
> On 2024/12/4 19:13, Jan Kara wrote:
> > On Tue 22-10-24 19:10:32, Zhang Yi wrote:
> >> +static inline void ext4_truncate_folio(struct inode *inode,
> >> +				       loff_t start, loff_t end)
> >> +{
> >> +	unsigned long blocksize = i_blocksize(inode);
> >> +	struct folio *folio;
> >> +
> >> +	if (round_up(start, blocksize) >= round_down(end, blocksize))
> >> +		return;
> >> +
> >> +	folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
> >> +	if (IS_ERR(folio))
> >> +		return;
> >> +
> >> +	if (folio_mkclean(folio))
> >> +		folio_mark_dirty(folio);
> >> +	folio_unlock(folio);
> >> +	folio_put(folio);
> > 
> > I don't think this is enough. In your example from the changelog, this would
> > leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range.
> > Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks()
> > converts blocks under 2048-4096 to unwritten state. But what handles
> > zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes
> > only partial blocks, not full blocks. Am I missing something?
> > 
> 
> Sorry, I don't understand why truncate_pagecache_range() does nothing? In my
> example, the variable 'start' is 2048, the variable 'end' is 4096, and the
> call process truncate_pagecache_range(inode, 2048, 4096-1)->..->
> truncate_inode_partial_folio()->folio_zero_range() does zeroing the 2048-4096
> range. I also tested it below, it was zeroed.
> 
>   xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
>                -c "mwrite -S 0x5a 2048 2048" \
>                -c "fzero 2048 2048" -c "close" /mnt/foo
> 
>   od -Ax -t x1z /mnt/foo
>   000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  >XXXXXXXXXXXXXXXX<
>   *
>   000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  >................<
>   *
>   001000

Yeah, sorry. I've got totally confused here. truncate_pagecache_range()
indeed does all the zeroing we need. Your version of ext4_truncate_folio()
should do the right thing.

> > If I'm right, I'd keep it simple and just writeout these partial folios with
> > filemap_write_and_wait_range() and expand the range
> > truncate_pagecache_range() removes to include these partial folios. The
> 
> What I mean is the truncate_pagecache_range() has already covered the partial
> folios. right?

Right, it should cover the partial folios.

> > overhead won't be big and it isn't like this is some very performance
> > sensitive path.
> > 
> >> +}
> >> +
> >> +/*
> >> + * When truncating a range of folios, if the block size is less than the
> >> + * page size, the file's mapped partial blocks within one page could be
> >> + * freed or converted to unwritten. We should call this function to remove
> >> + * writable userspace mappings so that ext4_page_mkwrite() can be called
> >> + * during subsequent write access to these folios.
> >> + */
> >> +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end)
> > 
> > Maybe call this ext4_truncate_page_cache_block_range()? And assert that
> > start & end are block aligned. Because this essentially prepares page cache
> > for manipulation with a block range.
> 
> Ha, it's a good idea, I agree with you that move truncate_pagecache_range()
> and the hunk of flushing in journal data mode into this function. But I don't
> understand why assert that 'start & end' are block aligned?

Yes, that shouldn't be needed since truncate_pagecache_range() will do the
right thing.

								Honza
diff mbox series

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 44b0d418143c..6d0267afd4c1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3020,6 +3020,8 @@  extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
 extern int ext4_break_layouts(struct inode *);
+extern void ext4_truncate_folios_range(struct inode *inode, loff_t start,
+				       loff_t end);
 extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length);
 extern void ext4_set_inode_flags(struct inode *, bool init);
 extern int ext4_alloc_da_blocks(struct inode *inode);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 34e25eee6521..2a054c3689f0 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4677,6 +4677,7 @@  static long ext4_zero_range(struct file *file, loff_t offset,
 		}
 
 		/* Now release the pages and zero block aligned part of pages */
+		ext4_truncate_folios_range(inode, start, end);
 		truncate_pagecache_range(inode, start, end - 1);
 		inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 54bdd4884fe6..8b34e79112d5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -31,6 +31,7 @@ 
 #include <linux/writeback.h>
 #include <linux/pagevec.h>
 #include <linux/mpage.h>
+#include <linux/rmap.h>
 #include <linux/namei.h>
 #include <linux/uio.h>
 #include <linux/bio.h>
@@ -3870,6 +3871,46 @@  int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
 	return ret;
 }
 
+static inline void ext4_truncate_folio(struct inode *inode,
+				       loff_t start, loff_t end)
+{
+	unsigned long blocksize = i_blocksize(inode);
+	struct folio *folio;
+
+	if (round_up(start, blocksize) >= round_down(end, blocksize))
+		return;
+
+	folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
+	if (IS_ERR(folio))
+		return;
+
+	if (folio_mkclean(folio))
+		folio_mark_dirty(folio);
+	folio_unlock(folio);
+	folio_put(folio);
+}
+
+/*
+ * When truncating a range of folios, if the block size is less than the
+ * page size, the file's mapped partial blocks within one page could be
+ * freed or converted to unwritten. We should call this function to remove
+ * writable userspace mappings so that ext4_page_mkwrite() can be called
+ * during subsequent write access to these folios.
+ */
+void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end)
+{
+	unsigned long blocksize = i_blocksize(inode);
+
+	if (end > inode->i_size)
+		end = inode->i_size;
+	if (start >= end || blocksize >= PAGE_SIZE)
+		return;
+
+	ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end));
+	if (end > round_up(start, PAGE_SIZE))
+		ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end);
+}
+
 static void ext4_wait_dax_page(struct inode *inode)
 {
 	filemap_invalidate_unlock(inode->i_mapping);