Message ID | 20241022111059.2566137-2-yi.zhang@huaweicloud.com |
---|---|
State | New |
Headers | show |
Series | ext4: use iomap for regular file's buffered I/O path and enable large folio | expand |
I'm sorry for the huge delay here... On Tue 22-10-24 19:10:32, Zhang Yi wrote: > From: Zhang Yi <yi.zhang@huawei.com> > > When zeroing a range of folios on the filesystem which block size is > less than the page size, the file's mapped partial blocks within one > page will be marked as unwritten, we should remove writable userspace > mappings to ensure that ext4_page_mkwrite() can be called during > subsequent write access to these folios. Otherwise, data written by > subsequent mmap writes may not be saved to disk. > > $mkfs.ext4 -b 1024 /dev/vdb > $mount /dev/vdb /mnt > $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \ > -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \ > -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo > > $od -Ax -t x1z /mnt/foo > 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 > * > 000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 > * > 001000 > > $umount /mnt && mount /dev/vdb /mnt > $od -Ax -t x1z /mnt/foo > 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 > * > 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > * > 001000 > > Signed-off-by: Zhang Yi <yi.zhang@huawei.com> This is a great catch! I think this may be source of the sporadic data corruption issues we observe with blocksize < pagesize. > +static inline void ext4_truncate_folio(struct inode *inode, > + loff_t start, loff_t end) > +{ > + unsigned long blocksize = i_blocksize(inode); > + struct folio *folio; > + > + if (round_up(start, blocksize) >= round_down(end, blocksize)) > + return; > + > + folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT); > + if (IS_ERR(folio)) > + return; > + > + if (folio_mkclean(folio)) > + folio_mark_dirty(folio); > + folio_unlock(folio); > + folio_put(folio); I don't think this is enough. In your example from the changelog, this would leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range. Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks() converts blocks under 2048-4096 to unwritten state. But what handles zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes only partial blocks, not full blocks. Am I missing something? If I'm right, I'd keep it simple and just writeout these partial folios with filemap_write_and_wait_range() and expand the range truncate_pagecache_range() removes to include these partial folios. The overhead won't be big and it isn't like this is some very performance sensitive path. > +} > + > +/* > + * When truncating a range of folios, if the block size is less than the > + * page size, the file's mapped partial blocks within one page could be > + * freed or converted to unwritten. We should call this function to remove > + * writable userspace mappings so that ext4_page_mkwrite() can be called > + * during subsequent write access to these folios. > + */ > +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end) Maybe call this ext4_truncate_page_cache_block_range()? And assert that start & end are block aligned. Because this essentially prepares page cache for manipulation with a block range. > +{ > + unsigned long blocksize = i_blocksize(inode); > + > + if (end > inode->i_size) > + end = inode->i_size; > + if (start >= end || blocksize >= PAGE_SIZE) > + return; > + > + ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end)); > + if (end > round_up(start, PAGE_SIZE)) > + ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end); > +} So I'd move the following truncate_pagecache_range() into ext4_truncate_folios_range(). And also the preceding: /* * For journalled data we need to write (and checkpoint) pages * before discarding page cache to avoid inconsitent data on * disk in case of crash before zeroing trans is committed. */ if (ext4_should_journal_data(inode)) { ret = filemap_write_and_wait_range(mapping, start, end - 1); ... into this function. So that it can be self-contained "do the right thing with page cache to prepare for block range manipulations". Honza
On 2024/12/4 19:13, Jan Kara wrote: > I'm sorry for the huge delay here... > It's fine, I know you're probably been busy lately, and this series has undergone significant modifications, which should require considerable time for review. Thanks a lot for taking time to review this series! > On Tue 22-10-24 19:10:32, Zhang Yi wrote: >> From: Zhang Yi <yi.zhang@huawei.com> >> >> When zeroing a range of folios on the filesystem which block size is >> less than the page size, the file's mapped partial blocks within one >> page will be marked as unwritten, we should remove writable userspace >> mappings to ensure that ext4_page_mkwrite() can be called during >> subsequent write access to these folios. Otherwise, data written by >> subsequent mmap writes may not be saved to disk. >> >> $mkfs.ext4 -b 1024 /dev/vdb >> $mount /dev/vdb /mnt >> $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \ >> -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \ >> -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo >> >> $od -Ax -t x1z /mnt/foo >> 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 >> * >> 000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 >> * >> 001000 >> >> $umount /mnt && mount /dev/vdb /mnt >> $od -Ax -t x1z /mnt/foo >> 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 >> * >> 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> * >> 001000 >> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com> > > This is a great catch! I think this may be source of the sporadic data > corruption issues we observe with blocksize < pagesize. > >> +static inline void ext4_truncate_folio(struct inode *inode, >> + loff_t start, loff_t end) >> +{ >> + unsigned long blocksize = i_blocksize(inode); >> + struct folio *folio; >> + >> + if (round_up(start, blocksize) >= round_down(end, blocksize)) >> + return; >> + >> + folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT); >> + if (IS_ERR(folio)) >> + return; >> + >> + if (folio_mkclean(folio)) >> + folio_mark_dirty(folio); >> + folio_unlock(folio); >> + folio_put(folio); > > I don't think this is enough. In your example from the changelog, this would > leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range. > Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks() > converts blocks under 2048-4096 to unwritten state. But what handles > zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes > only partial blocks, not full blocks. Am I missing something? > Sorry, I don't understand why truncate_pagecache_range() does nothing? In my example, the variable 'start' is 2048, the variable 'end' is 4096, and the call process truncate_pagecache_range(inode, 2048, 4096-1)->..-> truncate_inode_partial_folio()->folio_zero_range() does zeroing the 2048-4096 range. I also tested it below, it was zeroed. xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \ -c "mwrite -S 0x5a 2048 2048" \ -c "fzero 2048 2048" -c "close" /mnt/foo od -Ax -t x1z /mnt/foo 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 >XXXXXXXXXXXXXXXX< * 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................< * 001000 > If I'm right, I'd keep it simple and just writeout these partial folios with > filemap_write_and_wait_range() and expand the range > truncate_pagecache_range() removes to include these partial folios. The What I mean is the truncate_pagecache_range() has already covered the partial folios. right? > overhead won't be big and it isn't like this is some very performance > sensitive path. > >> +} >> + >> +/* >> + * When truncating a range of folios, if the block size is less than the >> + * page size, the file's mapped partial blocks within one page could be >> + * freed or converted to unwritten. We should call this function to remove >> + * writable userspace mappings so that ext4_page_mkwrite() can be called >> + * during subsequent write access to these folios. >> + */ >> +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end) > > Maybe call this ext4_truncate_page_cache_block_range()? And assert that > start & end are block aligned. Because this essentially prepares page cache > for manipulation with a block range. Ha, it's a good idea, I agree with you that move truncate_pagecache_range() and the hunk of flushing in journal data mode into this function. But I don't understand why assert that 'start & end' are block aligned? I think ext4_truncate_page_cache_block_range() should allow passing unaligned input parameters and aligned them itself, especially, after patch 04 and 05, ext4_zero_range() and ext4_punch_hole() will passing offset and offset+len directly, which may block unaligned. Thanks, Yi. > >> +{ >> + unsigned long blocksize = i_blocksize(inode); >> + >> + if (end > inode->i_size) >> + end = inode->i_size; >> + if (start >= end || blocksize >= PAGE_SIZE) >> + return; >> + >> + ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end)); >> + if (end > round_up(start, PAGE_SIZE)) >> + ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end); >> +} > > So I'd move the following truncate_pagecache_range() into > ext4_truncate_folios_range(). And also the preceding: > > /* > * For journalled data we need to write (and checkpoint) pages > * before discarding page cache to avoid inconsitent data on > * disk in case of crash before zeroing trans is committed. > */ > if (ext4_should_journal_data(inode)) { > ret = filemap_write_and_wait_range(mapping, start, > end - 1); > ... > > into this function. So that it can be self-contained "do the right thing > with page cache to prepare for block range manipulations". > > Honza
On Fri 06-12-24 15:59:44, Zhang Yi wrote: > On 2024/12/4 19:13, Jan Kara wrote: > > On Tue 22-10-24 19:10:32, Zhang Yi wrote: > >> +static inline void ext4_truncate_folio(struct inode *inode, > >> + loff_t start, loff_t end) > >> +{ > >> + unsigned long blocksize = i_blocksize(inode); > >> + struct folio *folio; > >> + > >> + if (round_up(start, blocksize) >= round_down(end, blocksize)) > >> + return; > >> + > >> + folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT); > >> + if (IS_ERR(folio)) > >> + return; > >> + > >> + if (folio_mkclean(folio)) > >> + folio_mark_dirty(folio); > >> + folio_unlock(folio); > >> + folio_put(folio); > > > > I don't think this is enough. In your example from the changelog, this would > > leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range. > > Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks() > > converts blocks under 2048-4096 to unwritten state. But what handles > > zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes > > only partial blocks, not full blocks. Am I missing something? > > > > Sorry, I don't understand why truncate_pagecache_range() does nothing? In my > example, the variable 'start' is 2048, the variable 'end' is 4096, and the > call process truncate_pagecache_range(inode, 2048, 4096-1)->..-> > truncate_inode_partial_folio()->folio_zero_range() does zeroing the 2048-4096 > range. I also tested it below, it was zeroed. > > xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \ > -c "mwrite -S 0x5a 2048 2048" \ > -c "fzero 2048 2048" -c "close" /mnt/foo > > od -Ax -t x1z /mnt/foo > 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 >XXXXXXXXXXXXXXXX< > * > 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................< > * > 001000 Yeah, sorry. I've got totally confused here. truncate_pagecache_range() indeed does all the zeroing we need. Your version of ext4_truncate_folio() should do the right thing. > > If I'm right, I'd keep it simple and just writeout these partial folios with > > filemap_write_and_wait_range() and expand the range > > truncate_pagecache_range() removes to include these partial folios. The > > What I mean is the truncate_pagecache_range() has already covered the partial > folios. right? Right, it should cover the partial folios. > > overhead won't be big and it isn't like this is some very performance > > sensitive path. > > > >> +} > >> + > >> +/* > >> + * When truncating a range of folios, if the block size is less than the > >> + * page size, the file's mapped partial blocks within one page could be > >> + * freed or converted to unwritten. We should call this function to remove > >> + * writable userspace mappings so that ext4_page_mkwrite() can be called > >> + * during subsequent write access to these folios. > >> + */ > >> +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end) > > > > Maybe call this ext4_truncate_page_cache_block_range()? And assert that > > start & end are block aligned. Because this essentially prepares page cache > > for manipulation with a block range. > > Ha, it's a good idea, I agree with you that move truncate_pagecache_range() > and the hunk of flushing in journal data mode into this function. But I don't > understand why assert that 'start & end' are block aligned? Yes, that shouldn't be needed since truncate_pagecache_range() will do the right thing. Honza
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 44b0d418143c..6d0267afd4c1 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -3020,6 +3020,8 @@ extern int ext4_inode_attach_jinode(struct inode *inode); extern int ext4_can_truncate(struct inode *inode); extern int ext4_truncate(struct inode *); extern int ext4_break_layouts(struct inode *); +extern void ext4_truncate_folios_range(struct inode *inode, loff_t start, + loff_t end); extern int ext4_punch_hole(struct file *file, loff_t offset, loff_t length); extern void ext4_set_inode_flags(struct inode *, bool init); extern int ext4_alloc_da_blocks(struct inode *inode); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 34e25eee6521..2a054c3689f0 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4677,6 +4677,7 @@ static long ext4_zero_range(struct file *file, loff_t offset, } /* Now release the pages and zero block aligned part of pages */ + ext4_truncate_folios_range(inode, start, end); truncate_pagecache_range(inode, start, end - 1); inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode)); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 54bdd4884fe6..8b34e79112d5 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -31,6 +31,7 @@ #include <linux/writeback.h> #include <linux/pagevec.h> #include <linux/mpage.h> +#include <linux/rmap.h> #include <linux/namei.h> #include <linux/uio.h> #include <linux/bio.h> @@ -3870,6 +3871,46 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset, return ret; } +static inline void ext4_truncate_folio(struct inode *inode, + loff_t start, loff_t end) +{ + unsigned long blocksize = i_blocksize(inode); + struct folio *folio; + + if (round_up(start, blocksize) >= round_down(end, blocksize)) + return; + + folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT); + if (IS_ERR(folio)) + return; + + if (folio_mkclean(folio)) + folio_mark_dirty(folio); + folio_unlock(folio); + folio_put(folio); +} + +/* + * When truncating a range of folios, if the block size is less than the + * page size, the file's mapped partial blocks within one page could be + * freed or converted to unwritten. We should call this function to remove + * writable userspace mappings so that ext4_page_mkwrite() can be called + * during subsequent write access to these folios. + */ +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end) +{ + unsigned long blocksize = i_blocksize(inode); + + if (end > inode->i_size) + end = inode->i_size; + if (start >= end || blocksize >= PAGE_SIZE) + return; + + ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end)); + if (end > round_up(start, PAGE_SIZE)) + ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end); +} + static void ext4_wait_dax_page(struct inode *inode) { filemap_invalidate_unlock(inode->i_mapping);