ext4: defer updating i_disksize until endio

Message ID	20230324092907.1341457-1-cccheng@synology.com
State	New
Headers	show Return-Path: <SRS0=QB4p=7Q=vger.kernel.org=linux-ext4-owner@ozlabs.org> From: Chung-Chiang Cheng <cccheng@synology.com> To: linux-ext4@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, yi.zhang@huawei.com Cc: shepjeng@gmail.com, kernel@cccheng.net, Chung-Chiang Cheng <cccheng@synology.com>, Robbie Ko <robbieko@synology.com> Subject: [PATCH] ext4: defer updating i_disksize until endio Date: Fri, 24 Mar 2023 17:29:07 +0800 Message-Id: <20230324092907.1341457-1-cccheng@synology.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	ext4: defer updating i_disksize until endio \| expand ext4: defer updating i_disksize until endio

Chung-Chiang Cheng March 24, 2023, 9:29 a.m. UTC

During a buffer write, the on-disk inode size (i_disksize) gets updated
at ext4_da_write_end() or mpage_map_and_submit_extent(). This update is
then committed into the journal. However, at that point, the writeback
process is not yet completed. If an interruption occurs, the content of
the file may not be in a consistent state because the i_disksize is
replayed by the journal, but the corresponding content may not have been
actually submitted to the disk.

    $ mount -o commit=1 /dev/vdc /mnt
    $ echo -n foo >> /mnt/test; sync
    $ echo -n bar >> /mnt/test; sleep 3; echo b > /proc/sysrq-trigger

    $ xxd /mnt/test
    00000000: 666f 6f00 0000                           foo...

After the above steps have been executed, there are padding zeros at the
end of the file, which are obviously not part of the actual content.
To fix this issue, we can defer updating i_disksize until the endio. The
original ext4_end_io_rsv_work() was to convert unwritten extents to
extents, but it now also updates the disk size.

Reviewed-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
---
 fs/ext4/inode.c   | 41 ++++--------------------------
 fs/ext4/page-io.c | 64 +++++++++++++++++++++++++++++++++++++++--------
 2 files changed, 58 insertions(+), 47 deletions(-)

kernel test robot March 24, 2023, 11:57 a.m. UTC | #1

Hi Chung-Chiang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on v6.3-rc3]
[also build test WARNING on linus/master]
[cannot apply to tytso-ext4/dev next-20230324]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Chung-Chiang-Cheng/ext4-defer-updating-i_disksize-until-endio/20230324-173716
patch link:    https://lore.kernel.org/r/20230324092907.1341457-1-cccheng%40synology.com
patch subject: [PATCH] ext4: defer updating i_disksize until endio
config: x86_64-randconfig-a014 (https://download.01.org/0day-ci/archive/20230324/202303241950.C1ApLyal-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/9a02b364d2cffe71a45866edf750b0280c8cb990
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Chung-Chiang-Cheng/ext4-defer-updating-i_disksize-until-endio/20230324-173716
        git checkout 9a02b364d2cffe71a45866edf750b0280c8cb990
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash fs/ext4/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303241950.C1ApLyal-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> fs/ext4/inode.c:3115:23: warning: variable 'end' set but not used [-Wunused-but-set-variable]
           unsigned long start, end;
                                ^
   1 warning generated.


vim +/end +3115 fs/ext4/inode.c

64769240bd07f4 Alex Tomas         2008-07-11  3108  
64769240bd07f4 Alex Tomas         2008-07-11  3109  static int ext4_da_write_end(struct file *file,
64769240bd07f4 Alex Tomas         2008-07-11  3110  			     struct address_space *mapping,
64769240bd07f4 Alex Tomas         2008-07-11  3111  			     loff_t pos, unsigned len, unsigned copied,
64769240bd07f4 Alex Tomas         2008-07-11  3112  			     struct page *page, void *fsdata)
64769240bd07f4 Alex Tomas         2008-07-11  3113  {
64769240bd07f4 Alex Tomas         2008-07-11  3114  	struct inode *inode = mapping->host;
632eaeab1feb5d Mingming Cao       2008-07-11 @3115  	unsigned long start, end;
79f0be8d2e6ebd Aneesh Kumar K.V   2008-10-08  3116  	int write_mode = (int)(unsigned long)fsdata;
79f0be8d2e6ebd Aneesh Kumar K.V   2008-10-08  3117  
74d553aad7926e Theodore Ts'o      2013-04-03  3118  	if (write_mode == FALL_BACK_TO_NONDELALLOC)
74d553aad7926e Theodore Ts'o      2013-04-03  3119  		return ext4_write_end(file, mapping, pos,
79f0be8d2e6ebd Aneesh Kumar K.V   2008-10-08  3120  				      len, copied, page, fsdata);
632eaeab1feb5d Mingming Cao       2008-07-11  3121  
9bffad1ed2a003 Theodore Ts'o      2009-06-17  3122  	trace_ext4_da_write_end(inode, pos, len, copied);
6984aef59814fb Zhang Yi           2021-07-16  3123  
6984aef59814fb Zhang Yi           2021-07-16  3124  	if (write_mode != CONVERT_INLINE_DATA &&
6984aef59814fb Zhang Yi           2021-07-16  3125  	    ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) &&
6984aef59814fb Zhang Yi           2021-07-16  3126  	    ext4_has_inline_data(inode))
6984aef59814fb Zhang Yi           2021-07-16  3127  		return ext4_write_inline_data_end(inode, pos, len, copied, page);
6984aef59814fb Zhang Yi           2021-07-16  3128  
09cbfeaf1a5a67 Kirill A. Shutemov 2016-04-01  3129  	start = pos & (PAGE_SIZE - 1);
632eaeab1feb5d Mingming Cao       2008-07-11  3130  	end = start + copied - 1;
64769240bd07f4 Alex Tomas         2008-07-11  3131  
64769240bd07f4 Alex Tomas         2008-07-11  3132  	/*
4df031ff5876d9 Zhang Yi           2021-07-16  3133  	 * Since we are holding inode lock, we are sure i_disksize <=
4df031ff5876d9 Zhang Yi           2021-07-16  3134  	 * i_size. We also know that if i_disksize < i_size, there are
9a02b364d2cffe Chung-Chiang Cheng 2023-03-24  3135  	 * delalloc writes pending in the range upto i_size. There's no
9a02b364d2cffe Chung-Chiang Cheng 2023-03-24  3136  	 * need to touch i_disksize since the endio of writeback will
9a02b364d2cffe Chung-Chiang Cheng 2023-03-24  3137  	 * push i_disksize upto i_size eventually.
4df031ff5876d9 Zhang Yi           2021-07-16  3138  	 *
4df031ff5876d9 Zhang Yi           2021-07-16  3139  	 * Note that we defer inode dirtying to generic_write_end() /
4df031ff5876d9 Zhang Yi           2021-07-16  3140  	 * ext4_da_write_inline_data_end().
64769240bd07f4 Alex Tomas         2008-07-11  3141  	 */
9c3569b50f12e4 Tao Ma             2012-12-10  3142  
cc883236b79297 Zhang Yi           2021-07-16  3143  	return generic_write_end(file, mapping, pos, len, copied, page, fsdata);
64769240bd07f4 Alex Tomas         2008-07-11  3144  }
64769240bd07f4 Alex Tomas         2008-07-11  3145

Zhang Yi March 24, 2023, 12:58 p.m. UTC | #2

Hello!

On 2023/3/24 17:29, Chung-Chiang Cheng wrote:
> During a buffer write, the on-disk inode size (i_disksize) gets updated
> at ext4_da_write_end() or mpage_map_and_submit_extent(). This update is
> then committed into the journal. However, at that point, the writeback
> process is not yet completed. If an interruption occurs, the content of
> the file may not be in a consistent state because the i_disksize is
> replayed by the journal, but the corresponding content may not have been
> actually submitted to the disk.
> 
>     $ mount -o commit=1 /dev/vdc /mnt
>     $ echo -n foo >> /mnt/test; sync
>     $ echo -n bar >> /mnt/test; sleep 3; echo b > /proc/sysrq-trigger
> 
>     $ xxd /mnt/test
>     00000000: 666f 6f00 0000                           foo...
> 
> After the above steps have been executed, there are padding zeros at the
> end of the file, which are obviously not part of the actual content.
> To fix this issue, we can defer updating i_disksize until the endio. The
> original ext4_end_io_rsv_work() was to convert unwritten extents to
> extents, but it now also updates the disk size.
> 

Thanks for the patch, but I'm sorry this patch doesn't looks right. Firstly,
the ext4 filesystem just do best efforts and don't guarantee data integrity
after system crash or power-off, so the best way is doing fsync() to reduce
the window of lose data. And then, we cannot simply postpone updating
i_disksize until IO end procedure because the related block allocating and
inode's extents changes are not atomic(they will not belong to one journal
transaction), so it may probability lead to file system inconsistency (please
run xfstests with the patch). Finally, we also cannot start a new handle in
the IO end procedure because it may lead to potential deadlock on page
writeback (please looks reserve handle).

Thanks,
Yi.

> Reviewed-by: Robbie Ko <robbieko@synology.com>
> Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
> ---
>  fs/ext4/inode.c   | 41 ++++--------------------------
>  fs/ext4/page-io.c | 64 +++++++++++++++++++++++++++++++++++++++--------
>  2 files changed, 58 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 96517785a9f8..c3537cd603dc 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2515,6 +2515,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			goto update_disksize;
>  	} while (map->m_len);
>  
> +	return err;
> +
>  update_disksize:
>  	/*
>  	 * Update on-disk size after IO is submitted.  Races with
> @@ -3105,36 +3107,12 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
>  	return ret;
>  }
>  
> -/*
> - * Check if we should update i_disksize
> - * when write to the end of file but not require block allocation
> - */
> -static int ext4_da_should_update_i_disksize(struct page *page,
> -					    unsigned long offset)
> -{
> -	struct buffer_head *bh;
> -	struct inode *inode = page->mapping->host;
> -	unsigned int idx;
> -	int i;
> -
> -	bh = page_buffers(page);
> -	idx = offset >> inode->i_blkbits;
> -
> -	for (i = 0; i < idx; i++)
> -		bh = bh->b_this_page;
> -
> -	if (!buffer_mapped(bh) || (buffer_delay(bh)) || buffer_unwritten(bh))
> -		return 0;
> -	return 1;
> -}
> -
>  static int ext4_da_write_end(struct file *file,
>  			     struct address_space *mapping,
>  			     loff_t pos, unsigned len, unsigned copied,
>  			     struct page *page, void *fsdata)
>  {
>  	struct inode *inode = mapping->host;
> -	loff_t new_i_size;
>  	unsigned long start, end;
>  	int write_mode = (int)(unsigned long)fsdata;
>  
> @@ -3155,22 +3133,13 @@ static int ext4_da_write_end(struct file *file,
>  	/*
>  	 * Since we are holding inode lock, we are sure i_disksize <=
>  	 * i_size. We also know that if i_disksize < i_size, there are
> -	 * delalloc writes pending in the range upto i_size. If the end of
> -	 * the current write is <= i_size, there's no need to touch
> -	 * i_disksize since writeback will push i_disksize upto i_size
> -	 * eventually. If the end of the current write is > i_size and
> -	 * inside an allocated block (ext4_da_should_update_i_disksize()
> -	 * check), we need to update i_disksize here as neither
> -	 * ext4_writepage() nor certain ext4_writepages() paths not
> -	 * allocating blocks update i_disksize.
> +	 * delalloc writes pending in the range upto i_size. There's no
> +	 * need to touch i_disksize since the endio of writeback will
> +	 * push i_disksize upto i_size eventually.
>  	 *
>  	 * Note that we defer inode dirtying to generic_write_end() /
>  	 * ext4_da_write_inline_data_end().
>  	 */
> -	new_i_size = pos + copied;
> -	if (copied && new_i_size > inode->i_size &&
> -	    ext4_da_should_update_i_disksize(page, end))
> -		ext4_update_i_disksize(inode, new_i_size);
>  
>  	return generic_write_end(file, mapping, pos, len, copied, page, fsdata);
>  }
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 1e4db96a04e6..f893d26c4b88 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -182,6 +182,10 @@ static int ext4_end_io_end(ext4_io_end_t *io_end)
>  		   "list->prev 0x%p\n",
>  		   io_end, inode->i_ino, io_end->list.next, io_end->list.prev);
>  
> +	/* Only reserved conversions from writeback should enter here */
> +	WARN_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> +	WARN_ON(!io_end->handle && EXT4_SB(inode->i_sb)->s_journal);
> +
>  	io_end->handle = NULL;	/* Following call will use up the handle */
>  	ret = ext4_convert_unwritten_io_end_vec(handle, io_end);
>  	if (ret < 0 && !ext4_forced_shutdown(EXT4_SB(inode->i_sb))) {
> @@ -226,9 +230,6 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
>  	struct workqueue_struct *wq;
>  	unsigned long flags;
>  
> -	/* Only reserved conversions from writeback should enter here */
> -	WARN_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> -	WARN_ON(!io_end->handle && sbi->s_journal);
>  	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
>  	wq = sbi->rsv_conversion_wq;
>  	if (list_empty(&ei->i_rsv_conversion_list))
> @@ -237,6 +238,43 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
>  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
>  }
>  
> +static int ext4_endio_update_disksize(ext4_io_end_t *io_end)
> +{
> +	int ret = 0;
> +	loff_t i_size, disksize = 0;
> +	handle_t *handle;
> +	struct ext4_io_end_vec *io_end_vec;
> +	struct inode *inode = io_end->inode;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	list_for_each_entry(io_end_vec, &io_end->list_vec, list) {
> +		if (disksize < io_end_vec->offset + io_end_vec->size)
> +			disksize = io_end_vec->offset + io_end_vec->size;
> +	}
> +
> +	if (disksize > ei->i_disksize) {
> +		down_write(&ei->i_data_sem);
> +		i_size = inode->i_size;
> +		if (disksize > i_size)
> +			disksize = i_size;
> +		if (disksize > ei->i_disksize)
> +			WRITE_ONCE(ei->i_disksize, i_size);
> +		up_write(&ei->i_data_sem);
> +
> +		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
> +		if (IS_ERR(handle))
> +			return PTR_ERR(handle);
> +
> +		ret = ext4_mark_inode_dirty(handle, inode);
> +		if (ret)
> +			ext4_error_err(inode->i_sb, ret, "Failed to mark inode %lu dirty",
> +				       inode->i_ino);
> +		ext4_journal_stop(handle);
> +	}
> +
> +	return ret;
> +}
> +
>  static int ext4_do_flush_completed_IO(struct inode *inode,
>  				      struct list_head *head)
>  {
> @@ -253,10 +291,16 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
>  
>  	while (!list_empty(&unwritten)) {
>  		io_end = list_entry(unwritten.next, ext4_io_end_t, list);
> -		BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
>  		list_del_init(&io_end->list);
>  
> -		err = ext4_end_io_end(io_end);
> +		err = ext4_endio_update_disksize(io_end);
> +		if (unlikely(!ret && err))
> +			ret = err;
> +
> +		if (io_end->flag & EXT4_IO_END_UNWRITTEN)
> +			err = ext4_end_io_end(io_end);
> +		else
> +			ext4_release_io_end(io_end);
>  		if (unlikely(!ret && err))
>  			ret = err;
>  	}
> @@ -264,7 +308,7 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
>  }
>  
>  /*
> - * work on completed IO, to convert unwritten extents to extents
> + * work on completed IO, to convert unwritten extents to extents and update disksize
>   */
>  void ext4_end_io_rsv_work(struct work_struct *work)
>  {
> @@ -289,12 +333,10 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
>  void ext4_put_io_end_defer(ext4_io_end_t *io_end)
>  {
>  	if (refcount_dec_and_test(&io_end->count)) {
> -		if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) ||
> -				list_empty(&io_end->list_vec)) {
> +		if (list_empty(&io_end->list_vec))
>  			ext4_release_io_end(io_end);
> -			return;
> -		}
> -		ext4_add_complete_io(io_end);
> +		else
> +			ext4_add_complete_io(io_end);
>  	}
>  }
>  
>

kernel test robot March 24, 2023, 12:59 p.m. UTC | #3

Hi Chung-Chiang,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on v6.3-rc3]
[also build test WARNING on linus/master]
[cannot apply to tytso-ext4/dev next-20230324]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Chung-Chiang-Cheng/ext4-defer-updating-i_disksize-until-endio/20230324-173716
patch link:    https://lore.kernel.org/r/20230324092907.1341457-1-cccheng%40synology.com
patch subject: [PATCH] ext4: defer updating i_disksize until endio
config: ia64-allyesconfig (https://download.01.org/0day-ci/archive/20230324/202303242033.wiXS9jKn-lkp@intel.com/config)
compiler: ia64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/9a02b364d2cffe71a45866edf750b0280c8cb990
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Chung-Chiang-Cheng/ext4-defer-updating-i_disksize-until-endio/20230324-173716
        git checkout 9a02b364d2cffe71a45866edf750b0280c8cb990
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=ia64 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=ia64 SHELL=/bin/bash fs/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>
| Link: https://lore.kernel.org/oe-kbuild-all/202303242033.wiXS9jKn-lkp@intel.com/

All warnings (new ones prefixed by >>):

   fs/ext4/inode.c: In function 'ext4_da_write_end':
>> fs/ext4/inode.c:3115:30: warning: variable 'end' set but not used [-Wunused-but-set-variable]
    3115 |         unsigned long start, end;
         |                              ^~~


vim +/end +3115 fs/ext4/inode.c

64769240bd07f4 Alex Tomas         2008-07-11  3108  
64769240bd07f4 Alex Tomas         2008-07-11  3109  static int ext4_da_write_end(struct file *file,
64769240bd07f4 Alex Tomas         2008-07-11  3110  			     struct address_space *mapping,
64769240bd07f4 Alex Tomas         2008-07-11  3111  			     loff_t pos, unsigned len, unsigned copied,
64769240bd07f4 Alex Tomas         2008-07-11  3112  			     struct page *page, void *fsdata)
64769240bd07f4 Alex Tomas         2008-07-11  3113  {
64769240bd07f4 Alex Tomas         2008-07-11  3114  	struct inode *inode = mapping->host;
632eaeab1feb5d Mingming Cao       2008-07-11 @3115  	unsigned long start, end;
79f0be8d2e6ebd Aneesh Kumar K.V   2008-10-08  3116  	int write_mode = (int)(unsigned long)fsdata;
79f0be8d2e6ebd Aneesh Kumar K.V   2008-10-08  3117  
74d553aad7926e Theodore Ts'o      2013-04-03  3118  	if (write_mode == FALL_BACK_TO_NONDELALLOC)
74d553aad7926e Theodore Ts'o      2013-04-03  3119  		return ext4_write_end(file, mapping, pos,
79f0be8d2e6ebd Aneesh Kumar K.V   2008-10-08  3120  				      len, copied, page, fsdata);
632eaeab1feb5d Mingming Cao       2008-07-11  3121  
9bffad1ed2a003 Theodore Ts'o      2009-06-17  3122  	trace_ext4_da_write_end(inode, pos, len, copied);
6984aef59814fb Zhang Yi           2021-07-16  3123  
6984aef59814fb Zhang Yi           2021-07-16  3124  	if (write_mode != CONVERT_INLINE_DATA &&
6984aef59814fb Zhang Yi           2021-07-16  3125  	    ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA) &&
6984aef59814fb Zhang Yi           2021-07-16  3126  	    ext4_has_inline_data(inode))
6984aef59814fb Zhang Yi           2021-07-16  3127  		return ext4_write_inline_data_end(inode, pos, len, copied, page);
6984aef59814fb Zhang Yi           2021-07-16  3128  
09cbfeaf1a5a67 Kirill A. Shutemov 2016-04-01  3129  	start = pos & (PAGE_SIZE - 1);
632eaeab1feb5d Mingming Cao       2008-07-11  3130  	end = start + copied - 1;
64769240bd07f4 Alex Tomas         2008-07-11  3131  
64769240bd07f4 Alex Tomas         2008-07-11  3132  	/*
4df031ff5876d9 Zhang Yi           2021-07-16  3133  	 * Since we are holding inode lock, we are sure i_disksize <=
4df031ff5876d9 Zhang Yi           2021-07-16  3134  	 * i_size. We also know that if i_disksize < i_size, there are
9a02b364d2cffe Chung-Chiang Cheng 2023-03-24  3135  	 * delalloc writes pending in the range upto i_size. There's no
9a02b364d2cffe Chung-Chiang Cheng 2023-03-24  3136  	 * need to touch i_disksize since the endio of writeback will
9a02b364d2cffe Chung-Chiang Cheng 2023-03-24  3137  	 * push i_disksize upto i_size eventually.
4df031ff5876d9 Zhang Yi           2021-07-16  3138  	 *
4df031ff5876d9 Zhang Yi           2021-07-16  3139  	 * Note that we defer inode dirtying to generic_write_end() /
4df031ff5876d9 Zhang Yi           2021-07-16  3140  	 * ext4_da_write_inline_data_end().
64769240bd07f4 Alex Tomas         2008-07-11  3141  	 */
9c3569b50f12e4 Tao Ma             2012-12-10  3142  
cc883236b79297 Zhang Yi           2021-07-16  3143  	return generic_write_end(file, mapping, pos, len, copied, page, fsdata);
64769240bd07f4 Alex Tomas         2008-07-11  3144  }
64769240bd07f4 Alex Tomas         2008-07-11  3145

Jan Kara March 27, 2023, 9:29 a.m. UTC | #4

On Fri 24-03-23 17:29:07, Chung-Chiang Cheng wrote:
> During a buffer write, the on-disk inode size (i_disksize) gets updated
> at ext4_da_write_end() or mpage_map_and_submit_extent(). This update is
> then committed into the journal. However, at that point, the writeback
> process is not yet completed. If an interruption occurs, the content of
> the file may not be in a consistent state because the i_disksize is
> replayed by the journal, but the corresponding content may not have been
> actually submitted to the disk.
> 
>     $ mount -o commit=1 /dev/vdc /mnt
>     $ echo -n foo >> /mnt/test; sync
>     $ echo -n bar >> /mnt/test; sleep 3; echo b > /proc/sysrq-trigger
> 
>     $ xxd /mnt/test
>     00000000: 666f 6f00 0000                           foo...
> 
> After the above steps have been executed, there are padding zeros at the
> end of the file, which are obviously not part of the actual content.
> To fix this issue, we can defer updating i_disksize until the endio. The
> original ext4_end_io_rsv_work() was to convert unwritten extents to
> extents, but it now also updates the disk size.
> 
> Reviewed-by: Robbie Ko <robbieko@synology.com>
> Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>

As Zhang Yi already noted in his review, this is expected at least with
data=writeback mount option. With data=ordered this should not happen
though as the commit of the transaction with i_disksize update will wait
for page writeback to complete (this is exactly the reason why data=ordered
exists after all). Are you able to observe this problem with data=ordered
mount option?

								Honza

> ---
>  fs/ext4/inode.c   | 41 ++++--------------------------
>  fs/ext4/page-io.c | 64 +++++++++++++++++++++++++++++++++++++++--------
>  2 files changed, 58 insertions(+), 47 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 96517785a9f8..c3537cd603dc 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2515,6 +2515,8 @@ static int mpage_map_and_submit_extent(handle_t *handle,
>  			goto update_disksize;
>  	} while (map->m_len);
>  
> +	return err;
> +
>  update_disksize:
>  	/*
>  	 * Update on-disk size after IO is submitted.  Races with
> @@ -3105,36 +3107,12 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
>  	return ret;
>  }
>  
> -/*
> - * Check if we should update i_disksize
> - * when write to the end of file but not require block allocation
> - */
> -static int ext4_da_should_update_i_disksize(struct page *page,
> -					    unsigned long offset)
> -{
> -	struct buffer_head *bh;
> -	struct inode *inode = page->mapping->host;
> -	unsigned int idx;
> -	int i;
> -
> -	bh = page_buffers(page);
> -	idx = offset >> inode->i_blkbits;
> -
> -	for (i = 0; i < idx; i++)
> -		bh = bh->b_this_page;
> -
> -	if (!buffer_mapped(bh) || (buffer_delay(bh)) || buffer_unwritten(bh))
> -		return 0;
> -	return 1;
> -}
> -
>  static int ext4_da_write_end(struct file *file,
>  			     struct address_space *mapping,
>  			     loff_t pos, unsigned len, unsigned copied,
>  			     struct page *page, void *fsdata)
>  {
>  	struct inode *inode = mapping->host;
> -	loff_t new_i_size;
>  	unsigned long start, end;
>  	int write_mode = (int)(unsigned long)fsdata;
>  
> @@ -3155,22 +3133,13 @@ static int ext4_da_write_end(struct file *file,
>  	/*
>  	 * Since we are holding inode lock, we are sure i_disksize <=
>  	 * i_size. We also know that if i_disksize < i_size, there are
> -	 * delalloc writes pending in the range upto i_size. If the end of
> -	 * the current write is <= i_size, there's no need to touch
> -	 * i_disksize since writeback will push i_disksize upto i_size
> -	 * eventually. If the end of the current write is > i_size and
> -	 * inside an allocated block (ext4_da_should_update_i_disksize()
> -	 * check), we need to update i_disksize here as neither
> -	 * ext4_writepage() nor certain ext4_writepages() paths not
> -	 * allocating blocks update i_disksize.
> +	 * delalloc writes pending in the range upto i_size. There's no
> +	 * need to touch i_disksize since the endio of writeback will
> +	 * push i_disksize upto i_size eventually.
>  	 *
>  	 * Note that we defer inode dirtying to generic_write_end() /
>  	 * ext4_da_write_inline_data_end().
>  	 */
> -	new_i_size = pos + copied;
> -	if (copied && new_i_size > inode->i_size &&
> -	    ext4_da_should_update_i_disksize(page, end))
> -		ext4_update_i_disksize(inode, new_i_size);
>  
>  	return generic_write_end(file, mapping, pos, len, copied, page, fsdata);
>  }
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 1e4db96a04e6..f893d26c4b88 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -182,6 +182,10 @@ static int ext4_end_io_end(ext4_io_end_t *io_end)
>  		   "list->prev 0x%p\n",
>  		   io_end, inode->i_ino, io_end->list.next, io_end->list.prev);
>  
> +	/* Only reserved conversions from writeback should enter here */
> +	WARN_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> +	WARN_ON(!io_end->handle && EXT4_SB(inode->i_sb)->s_journal);
> +
>  	io_end->handle = NULL;	/* Following call will use up the handle */
>  	ret = ext4_convert_unwritten_io_end_vec(handle, io_end);
>  	if (ret < 0 && !ext4_forced_shutdown(EXT4_SB(inode->i_sb))) {
> @@ -226,9 +230,6 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
>  	struct workqueue_struct *wq;
>  	unsigned long flags;
>  
> -	/* Only reserved conversions from writeback should enter here */
> -	WARN_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
> -	WARN_ON(!io_end->handle && sbi->s_journal);
>  	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
>  	wq = sbi->rsv_conversion_wq;
>  	if (list_empty(&ei->i_rsv_conversion_list))
> @@ -237,6 +238,43 @@ static void ext4_add_complete_io(ext4_io_end_t *io_end)
>  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
>  }
>  
> +static int ext4_endio_update_disksize(ext4_io_end_t *io_end)
> +{
> +	int ret = 0;
> +	loff_t i_size, disksize = 0;
> +	handle_t *handle;
> +	struct ext4_io_end_vec *io_end_vec;
> +	struct inode *inode = io_end->inode;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	list_for_each_entry(io_end_vec, &io_end->list_vec, list) {
> +		if (disksize < io_end_vec->offset + io_end_vec->size)
> +			disksize = io_end_vec->offset + io_end_vec->size;
> +	}
> +
> +	if (disksize > ei->i_disksize) {
> +		down_write(&ei->i_data_sem);
> +		i_size = inode->i_size;
> +		if (disksize > i_size)
> +			disksize = i_size;
> +		if (disksize > ei->i_disksize)
> +			WRITE_ONCE(ei->i_disksize, i_size);
> +		up_write(&ei->i_data_sem);
> +
> +		handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
> +		if (IS_ERR(handle))
> +			return PTR_ERR(handle);
> +
> +		ret = ext4_mark_inode_dirty(handle, inode);
> +		if (ret)
> +			ext4_error_err(inode->i_sb, ret, "Failed to mark inode %lu dirty",
> +				       inode->i_ino);
> +		ext4_journal_stop(handle);
> +	}
> +
> +	return ret;
> +}
> +
>  static int ext4_do_flush_completed_IO(struct inode *inode,
>  				      struct list_head *head)
>  {
> @@ -253,10 +291,16 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
>  
>  	while (!list_empty(&unwritten)) {
>  		io_end = list_entry(unwritten.next, ext4_io_end_t, list);
> -		BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
>  		list_del_init(&io_end->list);
>  
> -		err = ext4_end_io_end(io_end);
> +		err = ext4_endio_update_disksize(io_end);
> +		if (unlikely(!ret && err))
> +			ret = err;
> +
> +		if (io_end->flag & EXT4_IO_END_UNWRITTEN)
> +			err = ext4_end_io_end(io_end);
> +		else
> +			ext4_release_io_end(io_end);
>  		if (unlikely(!ret && err))
>  			ret = err;
>  	}
> @@ -264,7 +308,7 @@ static int ext4_do_flush_completed_IO(struct inode *inode,
>  }
>  
>  /*
> - * work on completed IO, to convert unwritten extents to extents
> + * work on completed IO, to convert unwritten extents to extents and update disksize
>   */
>  void ext4_end_io_rsv_work(struct work_struct *work)
>  {
> @@ -289,12 +333,10 @@ ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
>  void ext4_put_io_end_defer(ext4_io_end_t *io_end)
>  {
>  	if (refcount_dec_and_test(&io_end->count)) {
> -		if (!(io_end->flag & EXT4_IO_END_UNWRITTEN) ||
> -				list_empty(&io_end->list_vec)) {
> +		if (list_empty(&io_end->list_vec))
>  			ext4_release_io_end(io_end);
> -			return;
> -		}
> -		ext4_add_complete_io(io_end);
> +		else
> +			ext4_add_complete_io(io_end);
>  	}
>  }
>  
> -- 
> 2.34.1
>

Chung-Chiang Cheng March 27, 2023, 10:28 a.m. UTC | #5

On Mon, Mar 27, 2023 at 5:29 PM Jan Kara <jack@suse.cz> wrote:
>
> As Zhang Yi already noted in his review, this is expected at least with
> data=writeback mount option. With data=ordered this should not happen
> though as the commit of the transaction with i_disksize update will wait
> for page writeback to complete (this is exactly the reason why data=ordered
> exists after all). Are you able to observe this problem with data=ordered
> mount option?
>
>                                                                 Honza

It's a pity that this issue also occurs with data=ordered due to delayed
allocation being enabled by default. If delayed allocation were disabled,
it would not be as easy to reproduce.

This is because if data is written to the end of a file and the block is
allocated, the new i_disksize will be immediately committed to the journal
at ext4_da_write_end(), but the writeback procedure is not yet triggered.
By default, ext4 commits the journal every 5 seconds, but a dirty page may
not be written back until 30 seconds later. This is not a short time window,
and any improper shutdown during this time may lead to the issue :(

Thanks.
C.C.Cheng

Zhang Yi March 27, 2023, 11:17 a.m. UTC | #6

On 2023/3/27 18:28, Chung-Chiang Cheng wrote:
> On Mon, Mar 27, 2023 at 5:29 PM Jan Kara <jack@suse.cz> wrote:
>>
>> As Zhang Yi already noted in his review, this is expected at least with
>> data=writeback mount option. With data=ordered this should not happen
>> though as the commit of the transaction with i_disksize update will wait
>> for page writeback to complete (this is exactly the reason why data=ordered
>> exists after all). Are you able to observe this problem with data=ordered
>> mount option?
>>
>>                                                                 Honza
> 
> It's a pity that this issue also occurs with data=ordered due to delayed
> allocation being enabled by default. If delayed allocation were disabled,
> it would not be as easy to reproduce.
> 
> This is because if data is written to the end of a file and the block is
> allocated, the new i_disksize will be immediately committed to the journal
> at ext4_da_write_end(), but the writeback procedure is not yet triggered.
> By default, ext4 commits the journal every 5 seconds, but a dirty page may
> not be written back until 30 seconds later. This is not a short time window,
> and any improper shutdown during this time may lead to the issue :(
> 

It seems that the case you've mentioned is intra-block append write (no?),
current data=ordered mount option doesn't work in this case because
ext4_map_blocks() doesn't attach inode to the t_inode_list of the running
transaction. If delayed allocation were disabled, the lose data window is still
there, because ext4_write_end()->ext4_update_inode_size() is also updating
i_disksize before writing data back. This is at least guarantee no store data.
We had discussed this in [1].

[1]. https://lore.kernel.org/linux-ext4/1554370192-113254-1-git-send-email-yi.zhang@huawei.com/

Thanks,
Yi.

Jan Kara March 27, 2023, 11:34 a.m. UTC | #7

On Mon 27-03-23 18:28:55, Chung-Chiang Cheng wrote:
> On Mon, Mar 27, 2023 at 5:29 PM Jan Kara <jack@suse.cz> wrote:
> >
> > As Zhang Yi already noted in his review, this is expected at least with
> > data=writeback mount option. With data=ordered this should not happen
> > though as the commit of the transaction with i_disksize update will wait
> > for page writeback to complete (this is exactly the reason why data=ordered
> > exists after all). Are you able to observe this problem with data=ordered
> > mount option?
> >
> >                                                                 Honza
> 
> It's a pity that this issue also occurs with data=ordered due to delayed
> allocation being enabled by default. If delayed allocation were disabled,
> it would not be as easy to reproduce.

Ah, ok. With data=ordered and expanding within the last block, you are
right you can see zeros at the end of the file after a crash. We were
discussing this in the past already but decided not to improve this because
the fix would have performance cost we didn't want to impose on users.

> This is because if data is written to the end of a file and the block is
> allocated, the new i_disksize will be immediately committed to the journal
> at ext4_da_write_end(), but the writeback procedure is not yet triggered.
> By default, ext4 commits the journal every 5 seconds, but a dirty page may
> not be written back until 30 seconds later. This is not a short time window,
> and any improper shutdown during this time may lead to the issue :(

Yeah, I agree. The time window is not small. What we could do and what
could even bring some performance benefit is if we moved the i_disksize
update from ext4_da_write_end() to ext4_do_writepages(). Currently we do
the i_disksize update only in mpage_map_and_submit_extent() but we could
add a similar logic when exiting from ext4_do_writepages() to update
i_disksize for written back pages beyond i_disksize which didn't need block
allocation. *Except* there is a problem that we couldn't do this i_disksize
update when the pages are written from jbd2 during ordered data writeback
(we cannot start transaction in that context). And this is nasty because
we will completely loose the i_disksize update. We could handle it by
redirtying the tail page in this case but it gets a bit ugly...

								Honza

Chung-Chiang Cheng March 29, 2023, 3:36 a.m. UTC | #8

On Mon, Mar 27, 2023 at 7:17 PM Zhang Yi <yi.zhang@huawei.com> wrote:
>
> On 2023/3/27 18:28, Chung-Chiang Cheng wrote:
> > It's a pity that this issue also occurs with data=ordered due to delayed
> > allocation being enabled by default. If delayed allocation were disabled,
> > it would not be as easy to reproduce.
> >
> > This is because if data is written to the end of a file and the block is
> > allocated, the new i_disksize will be immediately committed to the journal
> > at ext4_da_write_end(), but the writeback procedure is not yet triggered.
> > By default, ext4 commits the journal every 5 seconds, but a dirty page may
> > not be written back until 30 seconds later. This is not a short time window,
> > and any improper shutdown during this time may lead to the issue :(
> >

Thank you for the explanation from you and Jan. I agree that it is not the
responsibility of ext4 to provide application consistency, but this case is
not even crash consistent, although no sensitive data were revealed after
crash.

> It seems that the case you've mentioned is intra-block append write (no?),
> current data=ordered mount option doesn't work in this case because
> ext4_map_blocks() doesn't attach inode to the t_inode_list of the running
> transaction. If delayed allocation were disabled, the lose data window is still
> there, because ext4_write_end()->ext4_update_inode_size() is also updating
> i_disksize before writing data back. This is at least guarantee no store data.
> We had discussed this in [1].

Yes, you're right. I've reconfirmed my experiment and determined that this
issue can be reproduced with or without delayed allocation.

I've tried your previous solution of adding the required inode to the current
transaction's ordered data list. It seems to work perfectly for me and simply
solves the issue, but the journal handling needs to be added back to the
delayed allocation write. Does it really have an obvious performance impact?

>
> [1]. https://lore.kernel.org/linux-ext4/1554370192-113254-1-git-send-email-yi.zhang@huawei.com/
>
> Thanks,
> Yi.

Zhang Yi March 29, 2023, 11:37 a.m. UTC | #9

On 2023/3/29 11:36, Chung-Chiang Cheng wrote:
> On Mon, Mar 27, 2023 at 7:17 PM Zhang Yi <yi.zhang@huawei.com> wrote:
>>
>> On 2023/3/27 18:28, Chung-Chiang Cheng wrote:
>>> It's a pity that this issue also occurs with data=ordered due to delayed
>>> allocation being enabled by default. If delayed allocation were disabled,
>>> it would not be as easy to reproduce.
>>>
>>> This is because if data is written to the end of a file and the block is
>>> allocated, the new i_disksize will be immediately committed to the journal
>>> at ext4_da_write_end(), but the writeback procedure is not yet triggered.
>>> By default, ext4 commits the journal every 5 seconds, but a dirty page may
>>> not be written back until 30 seconds later. This is not a short time window,
>>> and any improper shutdown during this time may lead to the issue :(
>>>
> 
> Thank you for the explanation from you and Jan. I agree that it is not the
> responsibility of ext4 to provide application consistency, but this case is
> not even crash consistent, although no sensitive data were revealed after
> crash.
> 
>> It seems that the case you've mentioned is intra-block append write (no?),
>> current data=ordered mount option doesn't work in this case because
>> ext4_map_blocks() doesn't attach inode to the t_inode_list of the running
>> transaction. If delayed allocation were disabled, the lose data window is still
>> there, because ext4_write_end()->ext4_update_inode_size() is also updating
>> i_disksize before writing data back. This is at least guarantee no store data.
>> We had discussed this in [1].
> 
> Yes, you're right. I've reconfirmed my experiment and determined that this
> issue can be reproduced with or without delayed allocation.
> 
> I've tried your previous solution of adding the required inode to the current
> transaction's ordered data list. It seems to work perfectly for me and simply
> solves the issue, but the journal handling needs to be added back to the
> delayed allocation write. Does it really have an obvious performance impact?
> 

It depends on the writing behavior (proportion of partial block write), I had
test fio sequence write with bs=1K[1] on the default 4K block size file system
(the cases were not enough, I hope that will be helpful to you), it's have
about 30% degradation at that time. I haven't test it recently, maybe it could
have more degradation than before at the delayed allocation write since we
removed the journal handling. Or you can test it on your products.

Thanks,
Yi.

[1] fio --name=foo --size=5G --bs=1k --numjobs=24 --iodepth=1 --rw=write \
        --norandommap --group_reporting --runtime=100 --time_based \
        --nrfiles=3 --directory=/mnt/ --fallocate=none --fsync=1

ext4: defer updating i_disksize until endio

Commit Message

Comments

Patch