significant drop fio IOPS performance on v5.10

Message ID	5c819c9d6190452f9b10bb78a72cb47f@amazon.com
State	New
Headers	show Return-Path: <SRS0=Yr0J=Z7=vger.kernel.org=linux-ext4-owner@ozlabs.org> From: "Lu, Davina" <davinalu@amazon.com> To: "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>, "tytso@mit.edu" <tytso@mit.edu>, "adilger.kernel@dilger.ca" <adilger.kernel@dilger.ca>, "regressions@lists.linux.dev" <regressions@lists.linux.dev>, "stable@vger.kernel.org" <stable@vger.kernel.org>, "Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com>, hazem ahmed mohamed <hazem.ahmed.abuelfotoh@gmail.com>, "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> CC: "Kiselev, Oleg" <okiselev@amazon.com>, "Liu, Frank" <franklmz@amazon.com> Subject: significant drop fio IOPS performance on v5.10 Thread-Topic: significant drop fio IOPS performance on v5.10 Thread-Index: AdjO5vTzPItE0he4TsCHXHwBRGK6iAB1NzvQAJElJTA= Date: Wed, 28 Sep 2022 06:07:34 +0000 Message-ID: <5c819c9d6190452f9b10bb78a72cb47f@amazon.com> References: <357ace228adf4e859df5e9f3f4f18b49@amazon.com> <1cdc68e6a98d4e93a95be5d887bcc75d@amazon.com> In-Reply-To: <1cdc68e6a98d4e93a95be5d887bcc75d@amazon.com> Accept-Language: en-US Content-Language: en-US Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 Precedence: bulk
Series	significant drop fio IOPS performance on v5.10 \| expand significant drop fio IOPS performance on v5.10

Lu, Davina Sept. 28, 2022, 6:07 a.m. UTC

Hello,

I was profiling the 5.10 kernel and comparing it to 4.14.  On a system with 64 virtual CPUs and 256 GiB of RAM, I am observing a significant drop in IO performance. Using the following FIO with the script "sudo ftest_write.sh <dev_name>" in attachment, I saw FIO iops result drop from 22K to less than 1K. 
The script simply does: mount a the EXT4 16GiB volume with max IOPS 64000K, mounting option is " -o noatime,nodiratime,data=ordered", then run fio with 2048 fio wring thread with 28800000 file size with { --name=16kb_rand_write_only_2048_jobs --directory=/rdsdbdata1 --rw=randwrite --ioengine=sync --buffered=1 --bs=16k --max-jobs=2048 --numjobs=2048 --runtime=60 --time_based --thread --filesize=28800000 --fsync=1 --group_reporting }.

My analyzing is that the degradation is introduce by commit {244adf6426ee31a83f397b700d964cff12a247d3} and the issue is the contention on rsv_conversion_wq.  The simplest option is to increase the journal size, but that introduces more operational complexity.  Another option is to add the following change in attachment "allow more ext4-rsv-conversion workqueue.patch"

From 27e1b0e14275a281b3529f6a60c7b23a81356751 Mon Sep 17 00:00:00 2001
From: davinalu <davinalu@amazon.com>
Date: Fri, 23 Sep 2022 00:43:53 +0000
Subject: [PATCH] allow more ext4-rsv-conversion workqueue to speedup fio  writing
---
 fs/ext4/super.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

My thought is: If the max_active is 1, it means the "__WQ_ORDERED" combined with WQ_UNBOUND setting, based on alloc_workqueue(). So I added it .
I am not sure should we need "__WQ_ORDERED" or not? without "__WQ_ORDERED" it looks also work at my testbed, but I added since not much fio TP difference on my testbed result with/out "__WQ_ORDERED".

From My understanding and observation: with dioread_unlock and delay_alloc both enabled,  the  bio_endio() and ext4_writepages() will trigger this work queue to ext4_do_flush_completed_IO(). Looks like the work queue is an one-by-one updating: at EXT4 extend.c io_end->list_vec  list only have one io_end_vec each time. So if the BIO has high performance, and we have only one thread to do EXT4 flush will be an bottleneck here. The "ext4-rsv-conversion" this workqueue is mainly for update the EXT4_IO_END_UNWRITTEN extend block(only exist on dioread_unlock and delay_alloc options are set) and extend status  if I understand correctly here. Am  I correct?

This works on my test system and passes xfstests, but  will this cause any corruption on ext4 extends blocks updates, not even sure about the journal transaction updates either?
Can you tell me what I will break if this change is made?

Thanks
Davina

Theodore Ts'o Sept. 28, 2022, 10:36 p.m. UTC | #1

On Wed, Sep 28, 2022 at 06:07:34AM +0000, Lu, Davina wrote:
> 
> Hello,
> 
> My analyzing is that the degradation is introduce by commit
> {244adf6426ee31a83f397b700d964cff12a247d3} and the issue is the
> contention on rsv_conversion_wq.  The simplest option is to increase
> the journal size, but that introduces more operational complexity.
> Another option is to add the following change in attachment "allow
> more ext4-rsv-conversion workqueue.patch"

Hi, thanks for the patch.  However, I don't believe as written it is
safe.  That's because your patch will allow multiple CPU's to run
ext4_flush_completed_IO(), and this function is not set up to be safe
to be run concurrently.  That is, I don't see the necessary locking if
we have two CPU's trying to convert unwritten extents on the same
inode.

It could be made safe, and certainly if the problem is that you are
worried about contention across multiple inodes being written to by
different FIO jobs, then certainly this could be a potential
contention issue.

However, what doesn't make sense is that increasing the journal size
also seems to fix the issue for you.  That implies the problem is one
of the journal being to small, and so you are running into an issue of
stalls caused by the need to do a synchronous checkpoint to clear
space in the journal.  That is a different problem than one of there
being a contention problem with rsv_conversion_wq.

So I want to make sure we understand what you are seeing before we
make such a change.  One potential concern is that will cause a large
number of additional kernel threads.  Now, if these extra kernel
threads are doing useful work, then that's not necessarily an issue.
But if not, then it's going to be burning a large amount of kernel
memory (especially for a system with a large number of CPU cores).

Regards,

					- Ted

Linux regression tracking (Thorsten Leemhuis) Sept. 29, 2022, 7:03 a.m. UTC | #2

TWIMC: this mail is primarily send for documentation purposes and for
regzbot, my Linux kernel regression tracking bot. These mails usually
contain '#forregzbot' in the subject, to make them easy to spot and filter.

[TLDR: I'm adding this regression report to the list of tracked
regressions; all text from me you find below is based on a few templates
paragraphs you might have encountered already already in similar form.]

Hi, this is your Linux kernel regression tracker.

On 28.09.22 08:07, Lu, Davina wrote:
> 
> Hello,
> 
> I was profiling the 5.10 kernel and comparing it to 4.14.  On a system with 64 virtual CPUs and 256 GiB of RAM, I am observing a significant drop in IO performance. Using the following FIO with the script "sudo ftest_write.sh <dev_name>" in attachment, I saw FIO iops result drop from 22K to less than 1K. 
> The script simply does: mount a the EXT4 16GiB volume with max IOPS 64000K, mounting option is " -o noatime,nodiratime,data=ordered", then run fio with 2048 fio wring thread with 28800000 file size with { --name=16kb_rand_write_only_2048_jobs --directory=/rdsdbdata1 --rw=randwrite --ioengine=sync --buffered=1 --bs=16k --max-jobs=2048 --numjobs=2048 --runtime=60 --time_based --thread --filesize=28800000 --fsync=1 --group_reporting }.
> 
> My analyzing is that the degradation is introduce by commit {244adf6426ee31a83f397b700d964cff12a247d3} and the issue is the contention on rsv_conversion_wq.  The simplest option is to increase the journal size, but that introduces more operational complexity.  Another option is to add the following change in attachment "allow more ext4-rsv-conversion workqueue.patch"
> 
> From 27e1b0e14275a281b3529f6a60c7b23a81356751 Mon Sep 17 00:00:00 2001
> From: davinalu <davinalu@amazon.com>
> Date: Fri, 23 Sep 2022 00:43:53 +0000
> Subject: [PATCH] allow more ext4-rsv-conversion workqueue to speedup fio  writing
> ---
>  fs/ext4/super.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c index a0af833f7da7..6b34298cdc3b 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4963,7 +4963,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>          * concurrency isn't really necessary.  Limit it to 1.
>          */
>         EXT4_SB(sb)->rsv_conversion_wq =
> -               alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
> +               alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | 
> + WQ_UNBOUND | __WQ_ORDERED, 0);
>         if (!EXT4_SB(sb)->rsv_conversion_wq) {
>                 printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
>                 ret = -ENOMEM;
> 
> My thought is: If the max_active is 1, it means the "__WQ_ORDERED" combined with WQ_UNBOUND setting, based on alloc_workqueue(). So I added it .
> I am not sure should we need "__WQ_ORDERED" or not? without "__WQ_ORDERED" it looks also work at my testbed, but I added since not much fio TP difference on my testbed result with/out "__WQ_ORDERED".
> 
> From My understanding and observation: with dioread_unlock and delay_alloc both enabled,  the  bio_endio() and ext4_writepages() will trigger this work queue to ext4_do_flush_completed_IO(). Looks like the work queue is an one-by-one updating: at EXT4 extend.c io_end->list_vec  list only have one io_end_vec each time. So if the BIO has high performance, and we have only one thread to do EXT4 flush will be an bottleneck here. The "ext4-rsv-conversion" this workqueue is mainly for update the EXT4_IO_END_UNWRITTEN extend block(only exist on dioread_unlock and delay_alloc options are set) and extend status  if I understand correctly here. Am  I correct?
> 
> This works on my test system and passes xfstests, but  will this cause any corruption on ext4 extends blocks updates, not even sure about the journal transaction updates either?
> Can you tell me what I will break if this change is made?

Thanks for the report. To be sure below issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, my Linux kernel regression
tracking bot:

#regzbot ^introduced 244adf6426ee31a8
#regzbot title ext4: significant drop  fio IOPS performance on v5.10
#regzbot backburner: performance regression introduced a while ago that
will take some time to get analyzed and fixed
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply -- ideally with also
telling regzbot about it, as explained here:
https://linux-regtracking.leemhuis.info/tracked-regression/

Reminder for developers: When fixing the issue, add 'Link:' tags
pointing to the report (the mail this one replies to), as explained for
in the Linux kernel's documentation; above webpage explains why this is
important for tracked regressions.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)

P.S.: As the Linux kernel's regression tracker I deal with a lot of
reports and sometimes miss something important when writing mails like
this. If that's the case here, don't hesitate to tell me in a public
reply, it's in everyone's interest to set the public record straight.

Theodore Ts'o Sept. 29, 2022, 11:36 a.m. UTC | #3

This is a repeat of a discussion regarding changing the default of
dioread_nolock.  Note that you can "revert" this change in defaults by
using the mount options "nodioread_nolock" (or "dioread_lock").  There
were some good reasons why the change in the default was made, though.

#regzbot invalid: caused by a change of defaults that (among others) was
done for security reasons, see Ted's answer in
https://lore.kernel.org/all/Yypx6VQRbl3bFP2v@mit.edu/ for details

					- Ted

Lu, Davina Oct. 5, 2022, 7:24 a.m. UTC | #4

Hi Ted,

Thanks for your reply! Sorry for the late reply since I was busy on another urgent EXT4 issue recently.

Understood the concern about the patch here-- the patch is just a simple start and good suggestions on to make it safe update in multiple CPUs.
1.  Max thread number could not really need 256. From my side, ~30 could be good enough and this only apply with cpu_num >= some_threads (E.g. 16), we can do this in a sysctl/mount option etc...
2. Definitely agree that we need to make safe way and thanks for clear explanation.  A quick question, so if we make sure converting unwritten extents on same inode  is in "atomic" way, then it should be a good patch to start with?

About the larger the journal size -- which can help this IOPS issue, I found is, with small journal size with bad IOPS, journal transaction are more frequently (e.g. from 450 to 700 in a period of time) and each transaction has very less handlers (e.g from 1000 to 70).
Tracing the code, I found: the reserved block exists in each journal update in EXT4 which requires more blocks, then in 
add_transaction_credits()
if (jbd2_log_space_left(journal) < journal->j_max_transaction_buffers) { //since journal size is too small, it has less change to goes into this if to start an journal update.
            if (jbd2_log_space_left(journal) <
                                                journal->j_max_transaction_buffers) 
                   __jbd2_log_wait_for_space(journal); // good to trigger commit-> 
                   jbd2_log_do_checkpoint() ->  jbd2_log_start_commit() ->
                    __jbd2_log_start_commit  -> journal->j_commit_request = target;

So most of time, in this func, it has to wait for journal commit... The slower journal update could affect the IO performance from my understanding (fio test is like: one pwrite64 and then followed one fsync). Therefore if we larger the journal size from 128M to 2GB, mostly the "if" condition can go into to ease this problem.

Thanks
Davina

-----Original Message-----
From: Theodore Ts'o <tytso@mit.edu> 
Sent: Thursday, September 29, 2022 8:36 AM
To: Lu, Davina <davinalu@amazon.com>
Cc: linux-ext4@vger.kernel.org; adilger.kernel@dilger.ca; regressions@lists.linux.dev; stable@vger.kernel.org; Mohamed Abuelfotoh, Hazem <abuehaze@amazon.com>; hazem ahmed mohamed <hazem.ahmed.abuelfotoh@gmail.com>; Ritesh Harjani (IBM) <ritesh.list@gmail.com>; Kiselev, Oleg <okiselev@amazon.com>; Liu, Frank <franklmz@amazon.com>
Subject: RE: [EXTERNAL]significant drop fio IOPS performance on v5.10

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Wed, Sep 28, 2022 at 06:07:34AM +0000, Lu, Davina wrote:
>
> Hello,
>
> My analyzing is that the degradation is introduce by commit 
> {244adf6426ee31a83f397b700d964cff12a247d3} and the issue is the 
> contention on rsv_conversion_wq.  The simplest option is to increase 
> the journal size, but that introduces more operational complexity.
> Another option is to add the following change in attachment "allow 
> more ext4-rsv-conversion workqueue.patch"

Hi, thanks for the patch.  However, I don't believe as written it is safe.  That's because your patch will allow multiple CPU's to run ext4_flush_completed_IO(), and this function is not set up to be safe to be run concurrently.  That is, I don't see the necessary locking if we have two CPU's trying to convert unwritten extents on the same inode.

It could be made safe, and certainly if the problem is that you are worried about contention across multiple inodes being written to by different FIO jobs, then certainly this could be a potential contention issue.

However, what doesn't make sense is that increasing the journal size also seems to fix the issue for you.  That implies the problem is one of the journal being to small, and so you are running into an issue of stalls caused by the need to do a synchronous checkpoint to clear space in the journal.  That is a different problem than one of there being a contention problem with rsv_conversion_wq.

So I want to make sure we understand what you are seeing before we make such a change.  One potential concern is that will cause a large number of additional kernel threads.  Now, if these extra kernel threads are doing useful work, then that's not necessarily an issue.
But if not, then it's going to be burning a large amount of kernel memory (especially for a system with a large number of CPU cores).

Regards,

                                        - Ted

Lu, Davina Nov. 9, 2022, 1:02 a.m. UTC | #5

Hi Ted,

For adding the lock per inode per CPU, I didn't use the "inode->i_data_sem" since this already been added to protect at ext4_map_blocks(). This will been called eventually by ext4-rsv-conversion workqueue. So not sure if we still need to another semaphore for per inode protection?

So I just simply add a new rw_semaphore under ext4_inode_info, also change the maximum queue size won't be more than supported CPU number. I tested fio and xfstest-dev ext4 cases and all good here.

Not sure your opinions about the protection, the patch is below and I also put into attachment.

Thanks
Davina

From 4906bba1e9ce127fb8a92cf709c3948ac92c617c Mon Sep 17 00:00:00 2001
From: davinalu <davinalu@amazon.com>
Date: Wed, 9 Nov 2022 00:44:25 +0000
Subject: [PATCH] ext4:allow_multi_rsv_conversion_queue

---
 fs/ext4/ext4.h    | 1 +
 fs/ext4/extents.c | 4 +++-
 fs/ext4/super.c   | 5 ++++-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3bf9a6926798..4414750b0134 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1166,6 +1166,7 @@ struct ext4_inode_info {
        atomic_t i_unwritten; /* Nr. of inflight conversions pending */

        spinlock_t i_block_reservation_lock;
+       struct rw_semaphore i_rsv_unwritten_sem;

        /*
         * Transactions that contain inode's metadata needed to complete
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5235974126bd..806393da92cf 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4808,8 +4808,9 @@ int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
                                break;
                        }
                }
+               down_write(&EXT4_I(inode)->i_rsv_unwritten_sem);
                ret = ext4_map_blocks(handle, inode, &map,
-                                     EXT4_GET_BLOCKS_IO_CONVERT_EXT);
+                               EXT4_GET_BLOCKS_IO_CONVERT_EXT);
                if (ret <= 0)
                        ext4_warning(inode->i_sb,
                                     "inode #%lu: block %u: len %u: "
@@ -4817,6 +4818,7 @@ int ext4_convert_unwritten_extents(handle_t *handle, struct inode *inode,
                                     inode->i_ino, map.m_lblk,
                                     map.m_len, ret);
                ret2 = ext4_mark_inode_dirty(handle, inode);
+               up_write(&EXT4_I(inode)->i_rsv_unwritten_sem);
                if (credits) {
                        ret3 = ext4_journal_stop(handle);
                        if (unlikely(ret3))
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 091db733834e..b3c7544798b8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1399,6 +1399,7 @@ static void init_once(void *foo)
        INIT_LIST_HEAD(&ei->i_orphan);
        init_rwsem(&ei->xattr_sem);
        init_rwsem(&ei->i_data_sem);
+       init_rwsem(&ei->i_rsv_unwritten_sem);
        inode_init_once(&ei->vfs_inode);
        ext4_fc_init_inode(&ei->vfs_inode);
 }
@@ -5212,7 +5213,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
         * concurrency isn't really necessary.  Limit it to 1.
         */
        EXT4_SB(sb)->rsv_conversion_wq =
-               alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+               alloc_workqueue("ext4-rsv-conversion",
+                               WQ_MEM_RECLAIM | WQ_UNBOUND | __WQ_ORDERED,
+                               num_active_cpus() > 1 ? num_active_cpus() : 1);
        if (!EXT4_SB(sb)->rsv_conversion_wq) {
                printk(KERN_ERR "EXT4-fs: failed to create workqueue\n");
                ret = -ENOMEM;
--
2.37.1

-----Original Message-----
From: Theodore Ts'o <tytso@mit.edu> 
Sent: Thursday, September 29, 2022 8:36 AM
To: Lu, Davina <davinalu@amazon.com>
Cc: linux-ext4@vger.kernel.org; adilger.kernel@dilger.ca; regressions@lists.linux.dev; stable@vger.kernel.org; Mohamed Abuelfotoh, Hazem <abuehaze@amazon.com>; hazem ahmed mohamed <hazem.ahmed.abuelfotoh@gmail.com>; Ritesh Harjani (IBM) <ritesh.list@gmail.com>; Kiselev, Oleg <okiselev@amazon.com>; Liu, Frank <franklmz@amazon.com>
Subject: RE: [EXTERNAL]significant drop fio IOPS performance on v5.10

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

On Wed, Sep 28, 2022 at 06:07:34AM +0000, Lu, Davina wrote:
>
> Hello,
>
> My analyzing is that the degradation is introduce by commit 
> {244adf6426ee31a83f397b700d964cff12a247d3} and the issue is the 
> contention on rsv_conversion_wq.  The simplest option is to increase 
> the journal size, but that introduces more operational complexity.
> Another option is to add the following change in attachment "allow 
> more ext4-rsv-conversion workqueue.patch"

Hi, thanks for the patch.  However, I don't believe as written it is safe.  That's because your patch will allow multiple CPU's to run ext4_flush_completed_IO(), and this function is not set up to be safe to be run concurrently.  That is, I don't see the necessary locking if we have two CPU's trying to convert unwritten extents on the same inode.

It could be made safe, and certainly if the problem is that you are worried about contention across multiple inodes being written to by different FIO jobs, then certainly this could be a potential contention issue.

However, what doesn't make sense is that increasing the journal size also seems to fix the issue for you.  That implies the problem is one of the journal being to small, and so you are running into an issue of stalls caused by the need to do a synchronous checkpoint to clear space in the journal.  That is a different problem than one of there being a contention problem with rsv_conversion_wq.

So I want to make sure we understand what you are seeing before we make such a change.  One potential concern is that will cause a large number of additional kernel threads.  Now, if these extra kernel threads are doing useful work, then that's not necessarily an issue.
But if not, then it's going to be burning a large amount of kernel memory (especially for a system with a large number of CPU cores).

Regards,

                                        - Ted

significant drop fio IOPS performance on v5.10

Commit Message

Comments

Patch