diff mbox series

[RFC,v4,33/34] ext4: don't mark IOMAP_F_DIRTY for buffer write

Message ID 20240410150313.2820364-5-yi.zhang@huaweicloud.com
State Superseded
Headers show
Series ext4: use iomap for regular file's buffered IO path and enable large folio | expand

Commit Message

Zhang Yi April 10, 2024, 3:03 p.m. UTC
From: Zhang Yi <yi.zhang@huawei.com>

The data sync dirty check in ext4_inode_datasync_dirty() is expansive
since jbd2_transaction_committed() holds journal->j_state lock when
journal is enabled, it costs a lot in high-concurrency iomap buffered
read/write paths, but we never check IOMAP_F_DIRTY in these cases, so
let's check it only in swap file, dax and direct IO cases. Tested by
Unixbench on 100GB ramdisk:

./Run -c 128 -i 10 fstime fsbuffer fsdisk

  == without this patch ==
  128 CPUs in system; running 128 parallel copies of tests

  File Copy 1024 bufsize 2000 maxblocks       6332521.0 KBps
  File Copy 256 bufsize 500 maxblocks         1639726.0 KBps
  File Copy 4096 bufsize 8000 maxblocks      24018572.0 KBps

  == with this patch ==
  128 CPUs in system; running 128 parallel copies of tests

  File Copy 1024 bufsize 2000 maxblocks      49229257.0 KBps
  File Copy 256 bufsize 500 maxblocks        24057510.0 KBps
  File Copy 4096 bufsize 8000 maxblocks      75704437.0 KBps

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/ext4/inode.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Comments

Dave Chinner May 1, 2024, 9:27 a.m. UTC | #1
On Wed, Apr 10, 2024 at 11:03:12PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> The data sync dirty check in ext4_inode_datasync_dirty() is expansive
> since jbd2_transaction_committed() holds journal->j_state lock when
> journal is enabled, it costs a lot in high-concurrency iomap buffered
> read/write paths, but we never check IOMAP_F_DIRTY in these cases, so
> let's check it only in swap file, dax and direct IO cases. Tested by
> Unixbench on 100GB ramdisk:
> 
> ./Run -c 128 -i 10 fstime fsbuffer fsdisk
> 
>   == without this patch ==
>   128 CPUs in system; running 128 parallel copies of tests
> 
>   File Copy 1024 bufsize 2000 maxblocks       6332521.0 KBps
>   File Copy 256 bufsize 500 maxblocks         1639726.0 KBps
>   File Copy 4096 bufsize 8000 maxblocks      24018572.0 KBps
> 
>   == with this patch ==
>   128 CPUs in system; running 128 parallel copies of tests
> 
>   File Copy 1024 bufsize 2000 maxblocks      49229257.0 KBps
>   File Copy 256 bufsize 500 maxblocks        24057510.0 KBps
>   File Copy 4096 bufsize 8000 maxblocks      75704437.0 KBps
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/ext4/inode.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1cb219d347af..269503749ef5 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3281,9 +3281,13 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>  	 * there is no other metadata changes being made or are pending.
>  	 */
>  	iomap->flags = 0;
> -	if (ext4_inode_datasync_dirty(inode) ||
> -	    offset + length > i_size_read(inode))
> -		iomap->flags |= IOMAP_F_DIRTY;
> +	if ((flags & (IOMAP_DAX | IOMAP_REPORT)) ||
> +	    ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) ==
> +	     (IOMAP_WRITE | IOMAP_DIRECT))) {
> +		if (offset + length > i_size_read(inode) ||
> +		    ext4_inode_datasync_dirty(inode))
> +			iomap->flags |= IOMAP_F_DIRTY;
> +	}

NACK. This just adds a nasty landmine that anyone working on the
iomap infrastructure can step on. i.e. any time we add a new check
for IOMAP_F_DIRTY in the generic infrastructure, ext4 is going to
break because it won't set the IOMAP_F_DIRTY flag correctly.

If checking an inode is dirty is expensive on ext4, then make it
less expensive and everyone will benefit.

/me goes and looks at jbd2_transaction_committed()

Oh, it it's just a sequence number comparison, and it needs a lock
because it has to dereference the running/committed transactions
structures to get the current sequence numbers. Why not just store
the commiting/running transaction tids in the journal_t, and then
you can sample them without needing any locking and the whole
ext4_inode_datasync_dirty() scalability problem goes away...

-Dave.
Zhang Yi May 6, 2024, 2:02 p.m. UTC | #2
On 2024/5/1 17:27, Dave Chinner wrote:
> On Wed, Apr 10, 2024 at 11:03:12PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> The data sync dirty check in ext4_inode_datasync_dirty() is expansive
>> since jbd2_transaction_committed() holds journal->j_state lock when
>> journal is enabled, it costs a lot in high-concurrency iomap buffered
>> read/write paths, but we never check IOMAP_F_DIRTY in these cases, so
>> let's check it only in swap file, dax and direct IO cases. Tested by
>> Unixbench on 100GB ramdisk:
>>
>> ./Run -c 128 -i 10 fstime fsbuffer fsdisk
>>
>>   == without this patch ==
>>   128 CPUs in system; running 128 parallel copies of tests
>>
>>   File Copy 1024 bufsize 2000 maxblocks       6332521.0 KBps
>>   File Copy 256 bufsize 500 maxblocks         1639726.0 KBps
>>   File Copy 4096 bufsize 8000 maxblocks      24018572.0 KBps
>>
>>   == with this patch ==
>>   128 CPUs in system; running 128 parallel copies of tests
>>
>>   File Copy 1024 bufsize 2000 maxblocks      49229257.0 KBps
>>   File Copy 256 bufsize 500 maxblocks        24057510.0 KBps
>>   File Copy 4096 bufsize 8000 maxblocks      75704437.0 KBps
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/ext4/inode.c | 10 +++++++---
>>  1 file changed, 7 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 1cb219d347af..269503749ef5 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3281,9 +3281,13 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
>>  	 * there is no other metadata changes being made or are pending.
>>  	 */
>>  	iomap->flags = 0;
>> -	if (ext4_inode_datasync_dirty(inode) ||
>> -	    offset + length > i_size_read(inode))
>> -		iomap->flags |= IOMAP_F_DIRTY;
>> +	if ((flags & (IOMAP_DAX | IOMAP_REPORT)) ||
>> +	    ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) ==
>> +	     (IOMAP_WRITE | IOMAP_DIRECT))) {
>> +		if (offset + length > i_size_read(inode) ||
>> +		    ext4_inode_datasync_dirty(inode))
>> +			iomap->flags |= IOMAP_F_DIRTY;
>> +	}
> 
> NACK. This just adds a nasty landmine that anyone working on the
> iomap infrastructure can step on. i.e. any time we add a new check
> for IOMAP_F_DIRTY in the generic infrastructure, ext4 is going to
> break because it won't set the IOMAP_F_DIRTY flag correctly.
> 
> If checking an inode is dirty is expensive on ext4, then make it
> less expensive and everyone will benefit.
> 
> /me goes and looks at jbd2_transaction_committed()
> 
> Oh, it it's just a sequence number comparison, and it needs a lock
> because it has to dereference the running/committed transactions
> structures to get the current sequence numbers. Why not just store
> the commiting/running transaction tids in the journal_t, and then
> you can sample them without needing any locking and the whole
> ext4_inode_datasync_dirty() scalability problem goes away...
> 

Indeed, it could be useful, and it seems could also simplify many
other jbd2 processes.

Thanks,
Yi.
diff mbox series

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1cb219d347af..269503749ef5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3281,9 +3281,13 @@  static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
 	 * there is no other metadata changes being made or are pending.
 	 */
 	iomap->flags = 0;
-	if (ext4_inode_datasync_dirty(inode) ||
-	    offset + length > i_size_read(inode))
-		iomap->flags |= IOMAP_F_DIRTY;
+	if ((flags & (IOMAP_DAX | IOMAP_REPORT)) ||
+	    ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) ==
+	     (IOMAP_WRITE | IOMAP_DIRECT))) {
+		if (offset + length > i_size_read(inode) ||
+		    ext4_inode_datasync_dirty(inode))
+			iomap->flags |= IOMAP_F_DIRTY;
+	}
 
 	if (map->m_flags & EXT4_MAP_NEW)
 		iomap->flags |= IOMAP_F_NEW;