[00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio

Message ID	20241022111059.2566137-1-yi.zhang@huaweicloud.com
Headers	show Return-Path: <SRS0=0rc7=RS=vger.kernel.org=linux-ext4+bounces-4686-patchwork-incoming=ozlabs.org@ozlabs.org> From: Zhang Yi <yi.zhang@huaweicloud.com> To: linux-ext4@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, tytso@mit.edu, adilger.kernel@dilger.ca, jack@suse.cz, ritesh.list@gmail.com, hch@infradead.org, djwong@kernel.org, david@fromorbit.com, zokeefe@google.com, yi.zhang@huawei.com, yi.zhang@huaweicloud.com, chengzhihao1@huawei.com, yukuai3@huawei.com, yangerkun@huawei.com Subject: [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Date: Tue, 22 Oct 2024 19:10:31 +0800 Message-ID: <20241022111059.2566137-1-yi.zhang@huaweicloud.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit
Series	ext4: use iomap for regular file's buffered I/O path and enable large folio \| expand [00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio [01/27] ext4: remove writable userspace mappings before truncating page cache [02/27] ext4: don't explicit update times in ext4_fallocate() [03/27] ext4: don't write back data before punch hole in nojournal mode [04/27] ext4: refactor ext4_punch_hole() [05/27] ext4: refactor ext4_zero_range() [06/27] ext4: refactor ext4_collapse_range() [07/27] ext4: refactor ext4_insert_range() [08/27] ext4: factor out ext4_do_fallocate() [09/27] ext4: move out inode_lock into ext4_fallocate() [10/27] ext4: move out common parts into ext4_fallocate() [11/27] ext4: use reserved metadata blocks when splitting extent on endio [12/27] ext4: introduce seq counter for the extent status entry [13/27] ext4: add a new iomap aops for regular file's buffered IO path [14/27] ext4: implement buffered read iomap path [15/27] ext4: implement buffered write iomap path [16/27] ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP [17/27] ext4: implement writeback iomap path [18/27] ext4: implement mmap iomap path [19/27] ext4: do not always order data when partial zeroing out a block [20/27] ext4: do not start handle if unnecessary while partial zeroing out a block [21/27] ext4: implement zero_range iomap path [22/27] ext4: disable online defrag when inode using iomap buffered I/O path [23/27] ext4: disable inode journal mode when using iomap buffered I/O path [24/27] ext4: partially enable iomap for the buffered I/O path of regular files [25/27] ext4: enable large folio for regular file with iomap buffered I/O path [26/27] ext4: change mount options code style [27/27] ext4: introduce a mount option for iomap buffered I/O path

Message ID

20241022111059.2566137-1-yi.zhang@huaweicloud.com

Headers

From: Zhang Yi <yi.zhang@huaweicloud.com>
To: linux-ext4@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	tytso@mit.edu,
	adilger.kernel@dilger.ca,
	jack@suse.cz,
	ritesh.list@gmail.com,
	hch@infradead.org,
	djwong@kernel.org,
	david@fromorbit.com,
	zokeefe@google.com,
	yi.zhang@huawei.com,
	yi.zhang@huaweicloud.com,
	chengzhihao1@huawei.com,
	yukuai3@huawei.com,
	yangerkun@huawei.com
Subject: [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path
 and enable large folio
Date: Tue, 22 Oct 2024 19:10:31 +0800
Message-ID: <20241022111059.2566137-1-yi.zhang@huaweicloud.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Series

ext4: use iomap for regular file's buffered I/O path and enable large folio | expand

Message

Zhang Yi Oct. 22, 2024, 11:10 a.m. UTC

From: Zhang Yi <yi.zhang@huawei.com>

Hello！

This patch series is the latest version based on my previous RFC
series[1], which converts the buffered I/O path of ext4 regular files to
iomap and enables large folios. After several months of work, almost all
preparatory changes have been upstreamed, thanks a lot for the review
and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
time for the main implementation of this conversion. 

This series is the main part of iomap buffered iomap conversion, it's
based on 6.12-rc4, and the code context is also depend on my anohter
cleanup series[1] (I've put that in this seris so we can merge it
directly), fixed all minor bugs found in my previous RFC v4 series.
Additionally, I've update change logs in each patch and also includes
some code modifications as Dave's suggestions. This series implements
the core iomap APIs on ext4 and introduces a mount option called
"buffered_iomap" to enable the iomap buffered I/O path. We have already
supported the default features, default mount options and bigalloc
feature. However, we do not yet support online defragmentation, inline
data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
to buffered_head I/O path automatically if you use those features and
options. Some of these features should be supported gradually in the
near future.

Most of the implementations resemble the original buffered_head path;
however, there are four key differences.

1. The first aspect is the block allocation in the writeback path. The
   iomap frame will invoke ->map_blocks() at least once for each dirty
   folio. To ensure optimal writeback performance, we aim to allocate a
   range of delalloc blocks that is as long as possible within the
   writeback length for each invocation. In certain situations, we may
   allocate a range of blocks that exceeds the amount we will actually
   write back. Therefore,
1) we cannot allocate a written extent for those blocks because it may
   expose stale data in such short write cases. Instead, we should
   allocate an unwritten extent, which means we must always enable the
   dioread_nolock option. This change could also bring many other
   benefits.
2) We should postpone updating the 'i_disksize' until the end of the I/O
   process, based on the actual written length. This approach can also
   prevent the exposure of zero data, which may occur if there is a
   power failure during an append write.
3) We do not need to pre-split extents during write-back, we can
   postpone this task until the end I/O process while converting
   unwritten extents.

2. The second reason is that since we always allocate unwritten space
   for new blocks, there is no risk of exposing stale data. As a result,
   we do not need to order the data, which allows us to disable the
   data=ordered mode. Consequently, we also do not require the reserved
   handle when converting the unwritten extent in the final I/O worker,
   we can directly start with the normal handle.

Series details:

Patch 1-10 is just another series of mine that refactors the fallocate
functions[1]. This series relies on the code context of that but has no
logical dependencies. I put this here just for easy access and merge.

Patch 11-21 implement the iomap buffered read/write path, dirty folio
write back path and mmap path for ext4 regular file. 

Patch 22-23 disable the unsupported online-defragmentation function and
disable the changing of the inode journal flag to data=journal mode.
Please look at the following patch for details.

Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
default now) to partially enable the iomap buffered I/O path and also
enable large folio.


About performance:

Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.

 fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
     -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
     -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
     -group_reportin -name=$name --output=/tmp/test_log

 == buffer read ==

                buffer_head        iomap + large folio
 type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
 -------------------------------------------------------
 hole     4K    576k    2253       762k    2975     +32%
 hole     64K   48.7k   3043       77.8k   4860     +60%
 hole     1M    2960    2960       4942    4942     +67%
 ramdisk  4K    443k    1732       530k    2069     +19%
 ramdisk  64K   34.5k   2156       45.6k   2850     +32%
 ramdisk  1M    2093    2093       2841    2841     +36%
 nvme     4K    339k    1323       364k    1425     +8%
 nvme     64K   23.6k   1471       25.2k   1574     +7%
 nvme     1M    2012    2012       2153    2153     +7%


 == buffer write ==

                                       buffer_head  iomap + large folio
 type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
 ----------------------------------------------------------------------
 cache      N    N    N    4K     417k    1631    440k    1719   +5%
 cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
 cache      N    N    N    1M     2143    2143    5716    5716   +167%
 cache      Y    N    N    4K     449k    1755    469k    1834   +5%
 cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
 cache      Y    N    N    1M     2352    2352    5577    5577   +137%
 ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
 ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
 ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
 ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
 ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
 ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
 ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
 ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
 ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
 ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
 ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
 ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
 nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
 nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
 nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
 nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
 nvme       N    Y    N    64K    6453    403     8661    541    +34%
 nvme       N    Y    N    1M     649     649     1351    1351   +108%
 nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
 nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
 nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
 nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
 nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
 nvme       Y    Y    N    1M     714     714     1504    1504   +111%

Thanks,
Yi.

Major changes since RFC v4:
 - Disable unsupported online defragmentation, do not fall back to
   buffer_head path.
 - Wite and wait data back while doing partial block truncate down to
   fix a stale data problem.
 - Disable the online changing of the inode journal flag to data=journal
   mode.
 - Since iomap can zero out dirty pages with unwritten extent, do not
   write data before zeroing out in ext4_zero_range(), and also do not
   zero partial blocks under a started journal handle.

[1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/

---
RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/


Zhang Yi (27):
  ext4: remove writable userspace mappings before truncating page cache
  ext4: don't explicit update times in ext4_fallocate()
  ext4: don't write back data before punch hole in nojournal mode
  ext4: refactor ext4_punch_hole()
  ext4: refactor ext4_zero_range()
  ext4: refactor ext4_collapse_range()
  ext4: refactor ext4_insert_range()
  ext4: factor out ext4_do_fallocate()
  ext4: move out inode_lock into ext4_fallocate()
  ext4: move out common parts into ext4_fallocate()
  ext4: use reserved metadata blocks when splitting extent on endio
  ext4: introduce seq counter for the extent status entry
  ext4: add a new iomap aops for regular file's buffered IO path
  ext4: implement buffered read iomap path
  ext4: implement buffered write iomap path
  ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
  ext4: implement writeback iomap path
  ext4: implement mmap iomap path
  ext4: do not always order data when partial zeroing out a block
  ext4: do not start handle if unnecessary while partial zeroing out a
    block
  ext4: implement zero_range iomap path
  ext4: disable online defrag when inode using iomap buffered I/O path
  ext4: disable inode journal mode when using iomap buffered I/O path
  ext4: partially enable iomap for the buffered I/O path of regular
    files
  ext4: enable large folio for regular file with iomap buffered I/O path
  ext4: change mount options code style
  ext4: introduce a mount option for iomap buffered I/O path

 fs/ext4/ext4.h              |  17 +-
 fs/ext4/ext4_jbd2.c         |   3 +-
 fs/ext4/ext4_jbd2.h         |   8 +
 fs/ext4/extents.c           | 568 +++++++++++----------------
 fs/ext4/extents_status.c    |  13 +-
 fs/ext4/file.c              |  19 +-
 fs/ext4/ialloc.c            |   5 +
 fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
 fs/ext4/move_extent.c       |   7 +
 fs/ext4/page-io.c           | 105 +++++
 fs/ext4/super.c             | 185 ++++-----
 include/trace/events/ext4.h |  57 +--
 12 files changed, 1153 insertions(+), 589 deletions(-)

Comments

Sedat Dilek Oct. 22, 2024, 6:59 a.m. UTC | #1

On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>
> From: Zhang Yi <yi.zhang@huawei.com>
>
> Hello！
>
> This patch series is the latest version based on my previous RFC
> series[1], which converts the buffered I/O path of ext4 regular files to
> iomap and enables large folios. After several months of work, almost all
> preparatory changes have been upstreamed, thanks a lot for the review
> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
> time for the main implementation of this conversion.
>
> This series is the main part of iomap buffered iomap conversion, it's
> based on 6.12-rc4, and the code context is also depend on my anohter
> cleanup series[1] (I've put that in this seris so we can merge it
> directly), fixed all minor bugs found in my previous RFC v4 series.
> Additionally, I've update change logs in each patch and also includes
> some code modifications as Dave's suggestions. This series implements
> the core iomap APIs on ext4 and introduces a mount option called
> "buffered_iomap" to enable the iomap buffered I/O path. We have already
> supported the default features, default mount options and bigalloc
> feature. However, we do not yet support online defragmentation, inline
> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
> to buffered_head I/O path automatically if you use those features and
> options. Some of these features should be supported gradually in the
> near future.
>
> Most of the implementations resemble the original buffered_head path;
> however, there are four key differences.
>
> 1. The first aspect is the block allocation in the writeback path. The
>    iomap frame will invoke ->map_blocks() at least once for each dirty
>    folio. To ensure optimal writeback performance, we aim to allocate a
>    range of delalloc blocks that is as long as possible within the
>    writeback length for each invocation. In certain situations, we may
>    allocate a range of blocks that exceeds the amount we will actually
>    write back. Therefore,
> 1) we cannot allocate a written extent for those blocks because it may
>    expose stale data in such short write cases. Instead, we should
>    allocate an unwritten extent, which means we must always enable the
>    dioread_nolock option. This change could also bring many other
>    benefits.
> 2) We should postpone updating the 'i_disksize' until the end of the I/O
>    process, based on the actual written length. This approach can also
>    prevent the exposure of zero data, which may occur if there is a
>    power failure during an append write.
> 3) We do not need to pre-split extents during write-back, we can
>    postpone this task until the end I/O process while converting
>    unwritten extents.
>
> 2. The second reason is that since we always allocate unwritten space
>    for new blocks, there is no risk of exposing stale data. As a result,
>    we do not need to order the data, which allows us to disable the
>    data=ordered mode. Consequently, we also do not require the reserved
>    handle when converting the unwritten extent in the final I/O worker,
>    we can directly start with the normal handle.
>
> Series details:
>
> Patch 1-10 is just another series of mine that refactors the fallocate
> functions[1]. This series relies on the code context of that but has no
> logical dependencies. I put this here just for easy access and merge.
>
> Patch 11-21 implement the iomap buffered read/write path, dirty folio
> write back path and mmap path for ext4 regular file.
>
> Patch 22-23 disable the unsupported online-defragmentation function and
> disable the changing of the inode journal flag to data=journal mode.
> Please look at the following patch for details.
>
> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
> default now) to partially enable the iomap buffered I/O path and also
> enable large folio.
>
>
> About performance:
>
> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.
>
>  fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
>      -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
>      -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
>      -group_reportin -name=$name --output=/tmp/test_log
>

Hi Zhang Yi,

can you clarify about the FIO values for the diverse parameters?

Thanks.

BR,
-Sedat-

>  == buffer read ==
>
>                 buffer_head        iomap + large folio
>  type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>  -------------------------------------------------------
>  hole     4K    576k    2253       762k    2975     +32%
>  hole     64K   48.7k   3043       77.8k   4860     +60%
>  hole     1M    2960    2960       4942    4942     +67%
>  ramdisk  4K    443k    1732       530k    2069     +19%
>  ramdisk  64K   34.5k   2156       45.6k   2850     +32%
>  ramdisk  1M    2093    2093       2841    2841     +36%
>  nvme     4K    339k    1323       364k    1425     +8%
>  nvme     64K   23.6k   1471       25.2k   1574     +7%
>  nvme     1M    2012    2012       2153    2153     +7%
>
>
>  == buffer write ==
>
>                                        buffer_head  iomap + large folio
>  type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
>  ----------------------------------------------------------------------
>  cache      N    N    N    4K     417k    1631    440k    1719   +5%
>  cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
>  cache      N    N    N    1M     2143    2143    5716    5716   +167%
>  cache      Y    N    N    4K     449k    1755    469k    1834   +5%
>  cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
>  cache      Y    N    N    1M     2352    2352    5577    5577   +137%
>  ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
>  ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
>  ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
>  ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
>  ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
>  ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
>  ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
>  ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
>  ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
>  ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
>  ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
>  ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
>  nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
>  nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
>  nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
>  nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
>  nvme       N    Y    N    64K    6453    403     8661    541    +34%
>  nvme       N    Y    N    1M     649     649     1351    1351   +108%
>  nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
>  nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
>  nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
>  nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
>  nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
>  nvme       Y    Y    N    1M     714     714     1504    1504   +111%
>
> Thanks,
> Yi.
>
> Major changes since RFC v4:
>  - Disable unsupported online defragmentation, do not fall back to
>    buffer_head path.
>  - Wite and wait data back while doing partial block truncate down to
>    fix a stale data problem.
>  - Disable the online changing of the inode journal flag to data=journal
>    mode.
>  - Since iomap can zero out dirty pages with unwritten extent, do not
>    write data before zeroing out in ext4_zero_range(), and also do not
>    zero partial blocks under a started journal handle.
>
> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/
>
> ---
> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
>
>
> Zhang Yi (27):
>   ext4: remove writable userspace mappings before truncating page cache
>   ext4: don't explicit update times in ext4_fallocate()
>   ext4: don't write back data before punch hole in nojournal mode
>   ext4: refactor ext4_punch_hole()
>   ext4: refactor ext4_zero_range()
>   ext4: refactor ext4_collapse_range()
>   ext4: refactor ext4_insert_range()
>   ext4: factor out ext4_do_fallocate()
>   ext4: move out inode_lock into ext4_fallocate()
>   ext4: move out common parts into ext4_fallocate()
>   ext4: use reserved metadata blocks when splitting extent on endio
>   ext4: introduce seq counter for the extent status entry
>   ext4: add a new iomap aops for regular file's buffered IO path
>   ext4: implement buffered read iomap path
>   ext4: implement buffered write iomap path
>   ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
>   ext4: implement writeback iomap path
>   ext4: implement mmap iomap path
>   ext4: do not always order data when partial zeroing out a block
>   ext4: do not start handle if unnecessary while partial zeroing out a
>     block
>   ext4: implement zero_range iomap path
>   ext4: disable online defrag when inode using iomap buffered I/O path
>   ext4: disable inode journal mode when using iomap buffered I/O path
>   ext4: partially enable iomap for the buffered I/O path of regular
>     files
>   ext4: enable large folio for regular file with iomap buffered I/O path
>   ext4: change mount options code style
>   ext4: introduce a mount option for iomap buffered I/O path
>
>  fs/ext4/ext4.h              |  17 +-
>  fs/ext4/ext4_jbd2.c         |   3 +-
>  fs/ext4/ext4_jbd2.h         |   8 +
>  fs/ext4/extents.c           | 568 +++++++++++----------------
>  fs/ext4/extents_status.c    |  13 +-
>  fs/ext4/file.c              |  19 +-
>  fs/ext4/ialloc.c            |   5 +
>  fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
>  fs/ext4/move_extent.c       |   7 +
>  fs/ext4/page-io.c           | 105 +++++
>  fs/ext4/super.c             | 185 ++++-----
>  include/trace/events/ext4.h |  57 +--
>  12 files changed, 1153 insertions(+), 589 deletions(-)
>
> --
> 2.46.1
>
>

Zhang Yi Oct. 22, 2024, 9:22 a.m. UTC | #2

On 2024/10/22 14:59, Sedat Dilek wrote:
> On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>>
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> Hello！
>>
>> This patch series is the latest version based on my previous RFC
>> series[1], which converts the buffered I/O path of ext4 regular files to
>> iomap and enables large folios. After several months of work, almost all
>> preparatory changes have been upstreamed, thanks a lot for the review
>> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
>> time for the main implementation of this conversion.
>>
>> This series is the main part of iomap buffered iomap conversion, it's
>> based on 6.12-rc4, and the code context is also depend on my anohter
>> cleanup series[1] (I've put that in this seris so we can merge it
>> directly), fixed all minor bugs found in my previous RFC v4 series.
>> Additionally, I've update change logs in each patch and also includes
>> some code modifications as Dave's suggestions. This series implements
>> the core iomap APIs on ext4 and introduces a mount option called
>> "buffered_iomap" to enable the iomap buffered I/O path. We have already
>> supported the default features, default mount options and bigalloc
>> feature. However, we do not yet support online defragmentation, inline
>> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
>> to buffered_head I/O path automatically if you use those features and
>> options. Some of these features should be supported gradually in the
>> near future.
>>
>> Most of the implementations resemble the original buffered_head path;
>> however, there are four key differences.
>>
>> 1. The first aspect is the block allocation in the writeback path. The
>>    iomap frame will invoke ->map_blocks() at least once for each dirty
>>    folio. To ensure optimal writeback performance, we aim to allocate a
>>    range of delalloc blocks that is as long as possible within the
>>    writeback length for each invocation. In certain situations, we may
>>    allocate a range of blocks that exceeds the amount we will actually
>>    write back. Therefore,
>> 1) we cannot allocate a written extent for those blocks because it may
>>    expose stale data in such short write cases. Instead, we should
>>    allocate an unwritten extent, which means we must always enable the
>>    dioread_nolock option. This change could also bring many other
>>    benefits.
>> 2) We should postpone updating the 'i_disksize' until the end of the I/O
>>    process, based on the actual written length. This approach can also
>>    prevent the exposure of zero data, which may occur if there is a
>>    power failure during an append write.
>> 3) We do not need to pre-split extents during write-back, we can
>>    postpone this task until the end I/O process while converting
>>    unwritten extents.
>>
>> 2. The second reason is that since we always allocate unwritten space
>>    for new blocks, there is no risk of exposing stale data. As a result,
>>    we do not need to order the data, which allows us to disable the
>>    data=ordered mode. Consequently, we also do not require the reserved
>>    handle when converting the unwritten extent in the final I/O worker,
>>    we can directly start with the normal handle.
>>
>> Series details:
>>
>> Patch 1-10 is just another series of mine that refactors the fallocate
>> functions[1]. This series relies on the code context of that but has no
>> logical dependencies. I put this here just for easy access and merge.
>>
>> Patch 11-21 implement the iomap buffered read/write path, dirty folio
>> write back path and mmap path for ext4 regular file.
>>
>> Patch 22-23 disable the unsupported online-defragmentation function and
>> disable the changing of the inode journal flag to data=journal mode.
>> Please look at the following patch for details.
>>
>> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
>> default now) to partially enable the iomap buffered I/O path and also
>> enable large folio.
>>
>>
>> About performance:
>>
>> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
>> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.
>>
>>  fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
>>      -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
>>      -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
>>      -group_reportin -name=$name --output=/tmp/test_log
>>
> 
> Hi Zhang Yi,
> 
> can you clarify about the FIO values for the diverse parameters?
> 

Hi Sedat,

Sure, the test I present here is a simple single-thread and single-I/O
depth case with psync ioengine. Most of the FIO parameters are shown
in the tables below.

For the rest, the 'iodepth' and 'numjobs' are always set to 1 and the
'size' is 40GB. During the write cache test, I also disable the write
back process through:

 echo 0 > /proc/sys/vm/dirty_writeback_centisecs
 echo 100 > /proc/sys/vm/dirty_background_ratio
 echo 100 > /proc/sys/vm/dirty_ratio

Thanks,
Yi.

> 
>>  == buffer read ==
>>
>>                 buffer_head        iomap + large folio
>>  type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>>  -------------------------------------------------------
>>  hole     4K    576k    2253       762k    2975     +32%
>>  hole     64K   48.7k   3043       77.8k   4860     +60%
>>  hole     1M    2960    2960       4942    4942     +67%
>>  ramdisk  4K    443k    1732       530k    2069     +19%
>>  ramdisk  64K   34.5k   2156       45.6k   2850     +32%
>>  ramdisk  1M    2093    2093       2841    2841     +36%
>>  nvme     4K    339k    1323       364k    1425     +8%
>>  nvme     64K   23.6k   1471       25.2k   1574     +7%
>>  nvme     1M    2012    2012       2153    2153     +7%
>>
>>
>>  == buffer write ==
>>
>>                                        buffer_head  iomap + large folio
>>  type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
>>  ----------------------------------------------------------------------
>>  cache      N    N    N    4K     417k    1631    440k    1719   +5%
>>  cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
>>  cache      N    N    N    1M     2143    2143    5716    5716   +167%
>>  cache      Y    N    N    4K     449k    1755    469k    1834   +5%
>>  cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
>>  cache      Y    N    N    1M     2352    2352    5577    5577   +137%
>>  ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
>>  ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
>>  ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
>>  ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
>>  ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
>>  ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
>>  ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
>>  ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
>>  ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
>>  ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
>>  ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
>>  ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
>>  nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
>>  nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
>>  nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
>>  nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
>>  nvme       N    Y    N    64K    6453    403     8661    541    +34%
>>  nvme       N    Y    N    1M     649     649     1351    1351   +108%
>>  nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
>>  nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
>>  nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
>>  nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
>>  nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
>>  nvme       Y    Y    N    1M     714     714     1504    1504   +111%
>>
>> Thanks,
>> Yi.
>>
>> Major changes since RFC v4:
>>  - Disable unsupported online defragmentation, do not fall back to
>>    buffer_head path.
>>  - Wite and wait data back while doing partial block truncate down to
>>    fix a stale data problem.
>>  - Disable the online changing of the inode journal flag to data=journal
>>    mode.
>>  - Since iomap can zero out dirty pages with unwritten extent, do not
>>    write data before zeroing out in ext4_zero_range(), and also do not
>>    zero partial blocks under a started journal handle.
>>
>> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/
>>
>> ---
>> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
>> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
>> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
>> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
>>
>>
>> Zhang Yi (27):
>>   ext4: remove writable userspace mappings before truncating page cache
>>   ext4: don't explicit update times in ext4_fallocate()
>>   ext4: don't write back data before punch hole in nojournal mode
>>   ext4: refactor ext4_punch_hole()
>>   ext4: refactor ext4_zero_range()
>>   ext4: refactor ext4_collapse_range()
>>   ext4: refactor ext4_insert_range()
>>   ext4: factor out ext4_do_fallocate()
>>   ext4: move out inode_lock into ext4_fallocate()
>>   ext4: move out common parts into ext4_fallocate()
>>   ext4: use reserved metadata blocks when splitting extent on endio
>>   ext4: introduce seq counter for the extent status entry
>>   ext4: add a new iomap aops for regular file's buffered IO path
>>   ext4: implement buffered read iomap path
>>   ext4: implement buffered write iomap path
>>   ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
>>   ext4: implement writeback iomap path
>>   ext4: implement mmap iomap path
>>   ext4: do not always order data when partial zeroing out a block
>>   ext4: do not start handle if unnecessary while partial zeroing out a
>>     block
>>   ext4: implement zero_range iomap path
>>   ext4: disable online defrag when inode using iomap buffered I/O path
>>   ext4: disable inode journal mode when using iomap buffered I/O path
>>   ext4: partially enable iomap for the buffered I/O path of regular
>>     files
>>   ext4: enable large folio for regular file with iomap buffered I/O path
>>   ext4: change mount options code style
>>   ext4: introduce a mount option for iomap buffered I/O path
>>
>>  fs/ext4/ext4.h              |  17 +-
>>  fs/ext4/ext4_jbd2.c         |   3 +-
>>  fs/ext4/ext4_jbd2.h         |   8 +
>>  fs/ext4/extents.c           | 568 +++++++++++----------------
>>  fs/ext4/extents_status.c    |  13 +-
>>  fs/ext4/file.c              |  19 +-
>>  fs/ext4/ialloc.c            |   5 +
>>  fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
>>  fs/ext4/move_extent.c       |   7 +
>>  fs/ext4/page-io.c           | 105 +++++
>>  fs/ext4/super.c             | 185 ++++-----
>>  include/trace/events/ext4.h |  57 +--
>>  12 files changed, 1153 insertions(+), 589 deletions(-)
>>
>> --
>> 2.46.1
>>
>>

Sedat Dilek Oct. 23, 2024, 12:13 p.m. UTC | #3

On Tue, Oct 22, 2024 at 11:22 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>
> On 2024/10/22 14:59, Sedat Dilek wrote:
> > On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
> >>
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> Hello！
> >>
> >> This patch series is the latest version based on my previous RFC
> >> series[1], which converts the buffered I/O path of ext4 regular files to
> >> iomap and enables large folios. After several months of work, almost all
> >> preparatory changes have been upstreamed, thanks a lot for the review
> >> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
> >> time for the main implementation of this conversion.
> >>
> >> This series is the main part of iomap buffered iomap conversion, it's
> >> based on 6.12-rc4, and the code context is also depend on my anohter
> >> cleanup series[1] (I've put that in this seris so we can merge it
> >> directly), fixed all minor bugs found in my previous RFC v4 series.
> >> Additionally, I've update change logs in each patch and also includes
> >> some code modifications as Dave's suggestions. This series implements
> >> the core iomap APIs on ext4 and introduces a mount option called
> >> "buffered_iomap" to enable the iomap buffered I/O path. We have already
> >> supported the default features, default mount options and bigalloc
> >> feature. However, we do not yet support online defragmentation, inline
> >> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
> >> to buffered_head I/O path automatically if you use those features and
> >> options. Some of these features should be supported gradually in the
> >> near future.
> >>
> >> Most of the implementations resemble the original buffered_head path;
> >> however, there are four key differences.
> >>
> >> 1. The first aspect is the block allocation in the writeback path. The
> >>    iomap frame will invoke ->map_blocks() at least once for each dirty
> >>    folio. To ensure optimal writeback performance, we aim to allocate a
> >>    range of delalloc blocks that is as long as possible within the
> >>    writeback length for each invocation. In certain situations, we may
> >>    allocate a range of blocks that exceeds the amount we will actually
> >>    write back. Therefore,
> >> 1) we cannot allocate a written extent for those blocks because it may
> >>    expose stale data in such short write cases. Instead, we should
> >>    allocate an unwritten extent, which means we must always enable the
> >>    dioread_nolock option. This change could also bring many other
> >>    benefits.
> >> 2) We should postpone updating the 'i_disksize' until the end of the I/O
> >>    process, based on the actual written length. This approach can also
> >>    prevent the exposure of zero data, which may occur if there is a
> >>    power failure during an append write.
> >> 3) We do not need to pre-split extents during write-back, we can
> >>    postpone this task until the end I/O process while converting
> >>    unwritten extents.
> >>
> >> 2. The second reason is that since we always allocate unwritten space
> >>    for new blocks, there is no risk of exposing stale data. As a result,
> >>    we do not need to order the data, which allows us to disable the
> >>    data=ordered mode. Consequently, we also do not require the reserved
> >>    handle when converting the unwritten extent in the final I/O worker,
> >>    we can directly start with the normal handle.
> >>
> >> Series details:
> >>
> >> Patch 1-10 is just another series of mine that refactors the fallocate
> >> functions[1]. This series relies on the code context of that but has no
> >> logical dependencies. I put this here just for easy access and merge.
> >>
> >> Patch 11-21 implement the iomap buffered read/write path, dirty folio
> >> write back path and mmap path for ext4 regular file.
> >>
> >> Patch 22-23 disable the unsupported online-defragmentation function and
> >> disable the changing of the inode journal flag to data=journal mode.
> >> Please look at the following patch for details.
> >>
> >> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
> >> default now) to partially enable the iomap buffered I/O path and also
> >> enable large folio.
> >>
> >>
> >> About performance:
> >>
> >> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
> >> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.
> >>
> >>  fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
> >>      -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
> >>      -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
> >>      -group_reportin -name=$name --output=/tmp/test_log
> >>
> >
> > Hi Zhang Yi,
> >
> > can you clarify about the FIO values for the diverse parameters?
> >
>
> Hi Sedat,
>
> Sure, the test I present here is a simple single-thread and single-I/O
> depth case with psync ioengine. Most of the FIO parameters are shown
> in the tables below.
>

Hi Zhang Yi,

Thanks for your reply.

Can you share a FIO config file with all (relevant) settings?
Maybe it is in the below link?

Link: https://packages.debian.org/sid/all/fio-examples/filelist

> For the rest, the 'iodepth' and 'numjobs' are always set to 1 and the
> 'size' is 40GB. During the write cache test, I also disable the write
> back process through:
>
>  echo 0 > /proc/sys/vm/dirty_writeback_centisecs
>  echo 100 > /proc/sys/vm/dirty_background_ratio
>  echo 100 > /proc/sys/vm/dirty_ratio
>

^^ Ist this info in one of the patches? If not - can you add this info
to the next version's cover-letter?

The patchset and improvements are valid only for powerful servers or
has a notebook user any benefits of this?
If you have benchmark data, please share this.

I can NOT promise if I will give that patchset a try.

Best thanks.

Best regards,
-Sedat-

> Thanks,
> Yi.
>
> >
> >>  == buffer read ==
> >>
> >>                 buffer_head        iomap + large folio
> >>  type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
> >>  -------------------------------------------------------
> >>  hole     4K    576k    2253       762k    2975     +32%
> >>  hole     64K   48.7k   3043       77.8k   4860     +60%
> >>  hole     1M    2960    2960       4942    4942     +67%
> >>  ramdisk  4K    443k    1732       530k    2069     +19%
> >>  ramdisk  64K   34.5k   2156       45.6k   2850     +32%
> >>  ramdisk  1M    2093    2093       2841    2841     +36%
> >>  nvme     4K    339k    1323       364k    1425     +8%
> >>  nvme     64K   23.6k   1471       25.2k   1574     +7%
> >>  nvme     1M    2012    2012       2153    2153     +7%
> >>
> >>
> >>  == buffer write ==
> >>
> >>                                        buffer_head  iomap + large folio
> >>  type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
> >>  ----------------------------------------------------------------------
> >>  cache      N    N    N    4K     417k    1631    440k    1719   +5%
> >>  cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
> >>  cache      N    N    N    1M     2143    2143    5716    5716   +167%
> >>  cache      Y    N    N    4K     449k    1755    469k    1834   +5%
> >>  cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
> >>  cache      Y    N    N    1M     2352    2352    5577    5577   +137%
> >>  ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
> >>  ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
> >>  ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
> >>  ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
> >>  ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
> >>  ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
> >>  ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
> >>  ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
> >>  ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
> >>  ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
> >>  ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
> >>  ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
> >>  nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
> >>  nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
> >>  nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
> >>  nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
> >>  nvme       N    Y    N    64K    6453    403     8661    541    +34%
> >>  nvme       N    Y    N    1M     649     649     1351    1351   +108%
> >>  nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
> >>  nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
> >>  nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
> >>  nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
> >>  nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
> >>  nvme       Y    Y    N    1M     714     714     1504    1504   +111%
> >>
> >> Thanks,
> >> Yi.
> >>
> >> Major changes since RFC v4:
> >>  - Disable unsupported online defragmentation, do not fall back to
> >>    buffer_head path.
> >>  - Wite and wait data back while doing partial block truncate down to
> >>    fix a stale data problem.
> >>  - Disable the online changing of the inode journal flag to data=journal
> >>    mode.
> >>  - Since iomap can zero out dirty pages with unwritten extent, do not
> >>    write data before zeroing out in ext4_zero_range(), and also do not
> >>    zero partial blocks under a started journal handle.
> >>
> >> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/
> >>
> >> ---
> >> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
> >> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
> >> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
> >> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
> >>
> >>
> >> Zhang Yi (27):
> >>   ext4: remove writable userspace mappings before truncating page cache
> >>   ext4: don't explicit update times in ext4_fallocate()
> >>   ext4: don't write back data before punch hole in nojournal mode
> >>   ext4: refactor ext4_punch_hole()
> >>   ext4: refactor ext4_zero_range()
> >>   ext4: refactor ext4_collapse_range()
> >>   ext4: refactor ext4_insert_range()
> >>   ext4: factor out ext4_do_fallocate()
> >>   ext4: move out inode_lock into ext4_fallocate()
> >>   ext4: move out common parts into ext4_fallocate()
> >>   ext4: use reserved metadata blocks when splitting extent on endio
> >>   ext4: introduce seq counter for the extent status entry
> >>   ext4: add a new iomap aops for regular file's buffered IO path
> >>   ext4: implement buffered read iomap path
> >>   ext4: implement buffered write iomap path
> >>   ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
> >>   ext4: implement writeback iomap path
> >>   ext4: implement mmap iomap path
> >>   ext4: do not always order data when partial zeroing out a block
> >>   ext4: do not start handle if unnecessary while partial zeroing out a
> >>     block
> >>   ext4: implement zero_range iomap path
> >>   ext4: disable online defrag when inode using iomap buffered I/O path
> >>   ext4: disable inode journal mode when using iomap buffered I/O path
> >>   ext4: partially enable iomap for the buffered I/O path of regular
> >>     files
> >>   ext4: enable large folio for regular file with iomap buffered I/O path
> >>   ext4: change mount options code style
> >>   ext4: introduce a mount option for iomap buffered I/O path
> >>
> >>  fs/ext4/ext4.h              |  17 +-
> >>  fs/ext4/ext4_jbd2.c         |   3 +-
> >>  fs/ext4/ext4_jbd2.h         |   8 +
> >>  fs/ext4/extents.c           | 568 +++++++++++----------------
> >>  fs/ext4/extents_status.c    |  13 +-
> >>  fs/ext4/file.c              |  19 +-
> >>  fs/ext4/ialloc.c            |   5 +
> >>  fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
> >>  fs/ext4/move_extent.c       |   7 +
> >>  fs/ext4/page-io.c           | 105 +++++
> >>  fs/ext4/super.c             | 185 ++++-----
> >>  include/trace/events/ext4.h |  57 +--
> >>  12 files changed, 1153 insertions(+), 589 deletions(-)
> >>
> >> --
> >> 2.46.1
> >>
> >>
>

Zhang Yi Oct. 24, 2024, 7:44 a.m. UTC | #4

On 2024/10/23 20:13, Sedat Dilek wrote:
> On Tue, Oct 22, 2024 at 11:22 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>>
>> On 2024/10/22 14:59, Sedat Dilek wrote:
>>> On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@huaweicloud.com> wrote:
>>>>
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> Hello！
>>>>
>>>> This patch series is the latest version based on my previous RFC
>>>> series[1], which converts the buffered I/O path of ext4 regular files to
>>>> iomap and enables large folios. After several months of work, almost all
>>>> preparatory changes have been upstreamed, thanks a lot for the review
>>>> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
>>>> time for the main implementation of this conversion.
>>>>
>>>> This series is the main part of iomap buffered iomap conversion, it's
>>>> based on 6.12-rc4, and the code context is also depend on my anohter
>>>> cleanup series[1] (I've put that in this seris so we can merge it
>>>> directly), fixed all minor bugs found in my previous RFC v4 series.
>>>> Additionally, I've update change logs in each patch and also includes
>>>> some code modifications as Dave's suggestions. This series implements
>>>> the core iomap APIs on ext4 and introduces a mount option called
>>>> "buffered_iomap" to enable the iomap buffered I/O path. We have already
>>>> supported the default features, default mount options and bigalloc
>>>> feature. However, we do not yet support online defragmentation, inline
>>>> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
>>>> to buffered_head I/O path automatically if you use those features and
>>>> options. Some of these features should be supported gradually in the
>>>> near future.
>>>>
>>>> Most of the implementations resemble the original buffered_head path;
>>>> however, there are four key differences.
>>>>
>>>> 1. The first aspect is the block allocation in the writeback path. The
>>>>    iomap frame will invoke ->map_blocks() at least once for each dirty
>>>>    folio. To ensure optimal writeback performance, we aim to allocate a
>>>>    range of delalloc blocks that is as long as possible within the
>>>>    writeback length for each invocation. In certain situations, we may
>>>>    allocate a range of blocks that exceeds the amount we will actually
>>>>    write back. Therefore,
>>>> 1) we cannot allocate a written extent for those blocks because it may
>>>>    expose stale data in such short write cases. Instead, we should
>>>>    allocate an unwritten extent, which means we must always enable the
>>>>    dioread_nolock option. This change could also bring many other
>>>>    benefits.
>>>> 2) We should postpone updating the 'i_disksize' until the end of the I/O
>>>>    process, based on the actual written length. This approach can also
>>>>    prevent the exposure of zero data, which may occur if there is a
>>>>    power failure during an append write.
>>>> 3) We do not need to pre-split extents during write-back, we can
>>>>    postpone this task until the end I/O process while converting
>>>>    unwritten extents.
>>>>
>>>> 2. The second reason is that since we always allocate unwritten space
>>>>    for new blocks, there is no risk of exposing stale data. As a result,
>>>>    we do not need to order the data, which allows us to disable the
>>>>    data=ordered mode. Consequently, we also do not require the reserved
>>>>    handle when converting the unwritten extent in the final I/O worker,
>>>>    we can directly start with the normal handle.
>>>>
>>>> Series details:
>>>>
>>>> Patch 1-10 is just another series of mine that refactors the fallocate
>>>> functions[1]. This series relies on the code context of that but has no
>>>> logical dependencies. I put this here just for easy access and merge.
>>>>
>>>> Patch 11-21 implement the iomap buffered read/write path, dirty folio
>>>> write back path and mmap path for ext4 regular file.
>>>>
>>>> Patch 22-23 disable the unsupported online-defragmentation function and
>>>> disable the changing of the inode journal flag to data=journal mode.
>>>> Please look at the following patch for details.
>>>>
>>>> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
>>>> default now) to partially enable the iomap buffered I/O path and also
>>>> enable large folio.
>>>>
>>>>
>>>> About performance:
>>>>
>>>> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
>>>> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.
>>>>
>>>>  fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
>>>>      -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
>>>>      -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
>>>>      -group_reportin -name=$name --output=/tmp/test_log
>>>>
>>>
>>> Hi Zhang Yi,
>>>
>>> can you clarify about the FIO values for the diverse parameters?
>>>
>>
>> Hi Sedat,
>>
>> Sure, the test I present here is a simple single-thread and single-I/O
>> depth case with psync ioengine. Most of the FIO parameters are shown
>> in the tables below.
>>
> 
> Hi Zhang Yi,
> 
> Thanks for your reply.
> 
> Can you share a FIO config file with all (relevant) settings?
> Maybe it is in the below link?
> 
> Link: https://packages.debian.org/sid/all/fio-examples/filelist

No, I didn't have this configuration file. I simply wrote two straightforward
scripts to do this test. This serves as a reference, primarily used for
performance analysis in basic read/write operations with different backends.
More complex cases should be adjusted based on the actual circumstances.

I have attached the scripts, feel free to use them. I suggest adjusting the
parameters according to your machine configuration and service I/O model.

> 
>> For the rest, the 'iodepth' and 'numjobs' are always set to 1 and the
>> 'size' is 40GB. During the write cache test, I also disable the write
>> back process through:
>>
>>  echo 0 > /proc/sys/vm/dirty_writeback_centisecs
>>  echo 100 > /proc/sys/vm/dirty_background_ratio
>>  echo 100 > /proc/sys/vm/dirty_ratio
>>
> 
> ^^ Ist this info in one of the patches? If not - can you add this info
> to the next version's cover-letter?
> 
> The patchset and improvements are valid only for powerful servers or
> has a notebook user any benefits of this?

The performance improvement is primarily attributed to the cost savings of
the kernel software stack with large I/O. Therefore, when the CPU becomes a
bottleneck, performance should improves, i.e. the faster the disk, the more
pronounced the benefits, regardless of whether the system is a server or a
notebook.

Thanks,
Yi.

> If you have benchmark data, please share this.
> 
> I can NOT promise if I will give that patchset a try.
> 
> Best thanks.
> 
> Best regards,
> -Sedat-
> 
>> Thanks,
>> Yi.
>>
>>>
>>>>  == buffer read ==
>>>>
>>>>                 buffer_head        iomap + large folio
>>>>  type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>>>>  -------------------------------------------------------
>>>>  hole     4K    576k    2253       762k    2975     +32%
>>>>  hole     64K   48.7k   3043       77.8k   4860     +60%
>>>>  hole     1M    2960    2960       4942    4942     +67%
>>>>  ramdisk  4K    443k    1732       530k    2069     +19%
>>>>  ramdisk  64K   34.5k   2156       45.6k   2850     +32%
>>>>  ramdisk  1M    2093    2093       2841    2841     +36%
>>>>  nvme     4K    339k    1323       364k    1425     +8%
>>>>  nvme     64K   23.6k   1471       25.2k   1574     +7%
>>>>  nvme     1M    2012    2012       2153    2153     +7%
>>>>
>>>>
>>>>  == buffer write ==
>>>>
>>>>                                        buffer_head  iomap + large folio
>>>>  type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
>>>>  ----------------------------------------------------------------------
>>>>  cache      N    N    N    4K     417k    1631    440k    1719   +5%
>>>>  cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
>>>>  cache      N    N    N    1M     2143    2143    5716    5716   +167%
>>>>  cache      Y    N    N    4K     449k    1755    469k    1834   +5%
>>>>  cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
>>>>  cache      Y    N    N    1M     2352    2352    5577    5577   +137%
>>>>  ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
>>>>  ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
>>>>  ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
>>>>  ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
>>>>  ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
>>>>  ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
>>>>  ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
>>>>  ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
>>>>  ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
>>>>  ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
>>>>  ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
>>>>  ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
>>>>  nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
>>>>  nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
>>>>  nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
>>>>  nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
>>>>  nvme       N    Y    N    64K    6453    403     8661    541    +34%
>>>>  nvme       N    Y    N    1M     649     649     1351    1351   +108%
>>>>  nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
>>>>  nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
>>>>  nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
>>>>  nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
>>>>  nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
>>>>  nvme       Y    Y    N    1M     714     714     1504    1504   +111%
>>>>
>>>> Thanks,
>>>> Yi.
>>>>
>>>> Major changes since RFC v4:
>>>>  - Disable unsupported online defragmentation, do not fall back to
>>>>    buffer_head path.
>>>>  - Wite and wait data back while doing partial block truncate down to
>>>>    fix a stale data problem.
>>>>  - Disable the online changing of the inode journal flag to data=journal
>>>>    mode.
>>>>  - Since iomap can zero out dirty pages with unwritten extent, do not
>>>>    write data before zeroing out in ext4_zero_range(), and also do not
>>>>    zero partial blocks under a started journal handle.
>>>>
>>>> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/
>>>>
>>>> ---
>>>> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
>>>> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
>>>> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
>>>> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
>>>>
>>>>
>>>> Zhang Yi (27):
>>>>   ext4: remove writable userspace mappings before truncating page cache
>>>>   ext4: don't explicit update times in ext4_fallocate()
>>>>   ext4: don't write back data before punch hole in nojournal mode
>>>>   ext4: refactor ext4_punch_hole()
>>>>   ext4: refactor ext4_zero_range()
>>>>   ext4: refactor ext4_collapse_range()
>>>>   ext4: refactor ext4_insert_range()
>>>>   ext4: factor out ext4_do_fallocate()
>>>>   ext4: move out inode_lock into ext4_fallocate()
>>>>   ext4: move out common parts into ext4_fallocate()
>>>>   ext4: use reserved metadata blocks when splitting extent on endio
>>>>   ext4: introduce seq counter for the extent status entry
>>>>   ext4: add a new iomap aops for regular file's buffered IO path
>>>>   ext4: implement buffered read iomap path
>>>>   ext4: implement buffered write iomap path
>>>>   ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
>>>>   ext4: implement writeback iomap path
>>>>   ext4: implement mmap iomap path
>>>>   ext4: do not always order data when partial zeroing out a block
>>>>   ext4: do not start handle if unnecessary while partial zeroing out a
>>>>     block
>>>>   ext4: implement zero_range iomap path
>>>>   ext4: disable online defrag when inode using iomap buffered I/O path
>>>>   ext4: disable inode journal mode when using iomap buffered I/O path
>>>>   ext4: partially enable iomap for the buffered I/O path of regular
>>>>     files
>>>>   ext4: enable large folio for regular file with iomap buffered I/O path
>>>>   ext4: change mount options code style
>>>>   ext4: introduce a mount option for iomap buffered I/O path
>>>>
>>>>  fs/ext4/ext4.h              |  17 +-
>>>>  fs/ext4/ext4_jbd2.c         |   3 +-
>>>>  fs/ext4/ext4_jbd2.h         |   8 +
>>>>  fs/ext4/extents.c           | 568 +++++++++++----------------
>>>>  fs/ext4/extents_status.c    |  13 +-
>>>>  fs/ext4/file.c              |  19 +-
>>>>  fs/ext4/ialloc.c            |   5 +
>>>>  fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
>>>>  fs/ext4/move_extent.c       |   7 +
>>>>  fs/ext4/page-io.c           | 105 +++++
>>>>  fs/ext4/super.c             | 185 ++++-----
>>>>  include/trace/events/ext4.h |  57 +--
>>>>  12 files changed, 1153 insertions(+), 589 deletions(-)
>>>>
>>>> --
>>>> 2.46.1
>>>>
>>>>
>>
#!/bin/bash

ramdev=$1
nvmedev=$2

MOUNT_OPT=""
test_size=40G

function run_fio()
{
	local rw=read
	local sync=$1
	local bs=$2
	local iodepth=$3
	local numjobs=$4
	local overwrite=$5
	local name=1
	local size=$6

	fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
	    -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
	    -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
	    -group_reportin -name=$name --output=/tmp/log

	cat /tmp/log >> /tmp/fio_result
}

function init_env()
{
	local hole=$1
	local size=$2
	local dev=$3

	rm -rf /mnt/*

	if [[ "$hole" == "1" ]]; then
		truncate -s $size /mnt/1.0.0
	else
		xfs_io -f -c "pwrite 0 $size" /mnt/1.0.0
	fi

	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function reset_env()
{
	local dev=$1

	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function do_one_test()
{
	local sync=0
	local hole=$1
	local size=$2
	local dev=$3

	echo "-------------------" | tee -a /tmp/fio_result

	echo "=== 4K:" | tee -a /tmp/fio_result
	reset_env $dev
	run_fio $sync 4k 1 1 0 $size

	echo "=== 64K:" | tee -a /tmp/fio_result
	reset_env $dev
	run_fio $sync 64k 1 1 0 $size

	echo "=== 1M:" | tee -a /tmp/fio_result
	reset_env $dev
	run_fio $sync 1M 1 1 0 $size

	echo "-------------------" | tee -a /tmp/fio_result
}

function run_one_round()
{
	local hole=$1
	local size=$2
	local dev=$3

	init_env $hole $size $dev
	do_one_test $hole $size $dev
}

function run_test()
{
	echo "---- TEST RAMDEV ----" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $ramdev /mnt

	echo "----- 1. READ HOLE" | tee -a /tmp/fio_result
	run_one_round 1 $test_size $ramdev

	echo "----- 2. READ RAM DATA" | tee -a /tmp/fio_result
	run_one_round 0 $test_size $ramdev
	umount /mnt

	echo "---- TEST NVMEDEV ----" | tee -a /tmp/fio_result
	echo "----- 3. READ NVME DATA" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $nvmedev /mnt
	run_one_round 0 $test_size $nvmedev
	umount /mnt
}

if [ -z "$ramdev" ] || [ -z "$nvmedev" ]; then
	echo "$0 <ramdev> <nvmedev>"
	exit
fi

umount /mnt
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $ramdev
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $nvmedev

cp /tmp/fio_result /tmp/fio_result.old
rm -f /tmp/fio_result

## TEST base ramdev
echo "==== TEST BASE ====" | tee -a /tmp/fio_result
MOUNT_OPT="nobuffered_iomap"
run_test

## TEST iomap ramdev
echo "==== TEST IOMAP ====" | tee -a /tmp/fio_result
MOUNT_OPT="buffered_iomap"
run_test
#!/bin/bash

ramdev=$1
nvmedev=$2

MOUNT_OPT=""
test_size=40G

function run_fio()
{
	local rw=write
	local sync=$1
	local bs=$2
	local iodepth=$3
	local numjobs=$4
	local overwrite=$5
	local name=1
	local size=$6

	fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
	    -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
	    -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
	    -group_reportin -name=$name --output=/tmp/log

	cat /tmp/log >> /tmp/fio_result
}

function init_env()
{
	local dev=$1

	rm -rf /mnt/*
	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function reset_env()
{
	local overwrite=$1
	local dev=$2

	if [[ "$overwrite" == "0" ]]; then
		rm -rf /mnt/*
	fi
	umount /mnt
	mount -o $MOUNT_OPT $dev /mnt
}

function do_one_test()
{
	local sync=$1
	local overwrite=$2
	local size=$3
	local dev=$4

	echo "-------------------" | tee -a /tmp/fio_result

	echo "=== 4K:" | tee -a /tmp/fio_result
	reset_env $overwrite $dev
	run_fio $sync 4k 1 1 $overwrite $size

	echo "=== 64K:" | tee -a /tmp/fio_result
	reset_env $overwrite $dev
	run_fio $sync 64k 1 1 $overwrite $size

	echo "=== 1M:" | tee -a /tmp/fio_result
	reset_env $overwrite $dev
	run_fio $sync 1M 1 1 $overwrite $size

	echo "-------------------" | tee -a /tmp/fio_result
}

function run_one_round()
{
	local sync=$1
	local overwrite=$2
	local size=$3
	local dev=$4

	echo "Sync:$sync, Overwrite:$overwrite" | tee -a /tmp/fio_result
	init_env $dev
	do_one_test $sync $overwrite $size $dev
}

function run_test()
{
	echo "---- TEST RAMDEV ----" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $ramdev /mnt

	echo "----- 1. WRITE CACHE" | tee -a /tmp/fio_result
	# Stop writeback
	echo 0 > /proc/sys/vm/dirty_writeback_centisecs
	echo 30000 > /proc/sys/vm/dirty_expire_centisecs
	echo 100 > /proc/sys/vm/dirty_background_ratio
	echo 100 > /proc/sys/vm/dirty_ratio
	run_one_round 0 0 $test_size $ramdev
	run_one_round 0 1 $test_size $ramdev

	echo "----- 2. WRITE RAM DISK" | tee -a /tmp/fio_result
	# Restore writeback
	echo 500 > /proc/sys/vm/dirty_writeback_centisecs
	echo 3000 > /proc/sys/vm/dirty_expire_centisecs
	echo 10 > /proc/sys/vm/dirty_background_ratio
	echo 20 > /proc/sys/vm/dirty_ratio
	run_one_round 0 0 $test_size $ramdev
	run_one_round 0 1 $test_size $ramdev
	run_one_round 1 0 $test_size $ramdev
	run_one_round 1 1 $test_size $ramdev
	umount /mnt

	echo "---- TEST NVMEDEV ----" | tee -a /tmp/fio_result
	echo "----- 3. WRITE NVME DISK" | tee -a /tmp/fio_result
	mount -o $MOUNT_OPT $nvmedev /mnt
	run_one_round 0 0 $test_size $nvmedev
	run_one_round 0 1 $test_size $nvmedev
	run_one_round 1 0 $test_size $nvmedev
	run_one_round 1 1 $test_size $nvmedev
	umount /mnt
}

if [ -z "$ramdev" ] || [ -z "$nvmedev" ]; then
	echo "$0 <ramdev> <nvmedev>"
	exit
fi

umount /mnt
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $ramdev
mkfs.ext4 -E lazy_itable_init=0,lazy_journal_init=0 -F $nvmedev

cp /tmp/fio_result /tmp/fio_result.old
rm -f /tmp/fio_result

## TEST base
echo "==== TEST BASE ====" | tee -a /tmp/fio_result
MOUNT_OPT="nobuffered_iomap"
run_test

## TEST iomap
echo "==== TEST IOMAP ====" | tee -a /tmp/fio_result
MOUNT_OPT="buffered_iomap"
run_test