mbox series

[SRU,K,0/8] sbitmap: fix IO hung due to waiters not

Message ID 20230605054601.1410517-1-gerald.yang@canonical.com
Headers show
Series sbitmap: fix IO hung due to waiters not | expand

Message

Gerald Yang June 5, 2023, 5:45 a.m. UTC
BugLink: https://bugs.launchpad.net/bugs/2022318

SRU Justification:

[ Impact ]

When running fio on a NVME on an AWS test instance with 5.19 kernel, IOs get stuck and fio never ends

fio command:
sudo fio --name=read_iops_test --filename=/dev/nvme1n1 --filesize=50G --time_based --ramp_time=2s --runtime=1m --ioengine=libaio --direct=1 --verify=0 --randrepeat=0 --bs=16K --iodepth=256 --rw=randread
read_iops_test: (g=0): rw=randread, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=256
fio-3.28
Starting 1 process
Jobs: 1 (f=0): [/(1)][-.-%][eta 01m:02s]

IOs completely get stuck, after a while kernel log shows:
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.230970] INFO: task fio:2545 blocked for more than 120 seconds.
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.232878] Not tainted 5.19.0-43-generic #44~22.04.1-Ubuntu
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.234738] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237053] task:fio state:D stack: 0 pid: 2545 ppid: 2495 flags:0x00000002
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237057] Call Trace:
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237058] <TASK>
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237061] __schedule+0x257/0x5d0
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237066] schedule+0x68/0x110
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237068] io_schedule+0x46/0x80
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237069] blk_mq_get_tag+0x117/0x300
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237072] ? destroy_sched_domains_rcu+0x40/0x40
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237076] __blk_mq_alloc_requests+0xc4/0x1e0
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237079] blk_mq_get_new_requests+0xf6/0x1a0
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237080] blk_mq_submit_bio+0x1eb/0x440
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237082] __submit_bio+0x109/0x1a0
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237085] submit_bio_noacct_nocheck+0xc2/0x120
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237087] submit_bio_noacct+0x209/0x590
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237088] submit_bio+0x40/0xf0
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237090] __blkdev_direct_IO_async+0x146/0x1f0
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237092] blkdev_direct_IO.part.0+0x40/0xa0
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237093] blkdev_read_iter+0x9f/0x1a0
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237094] aio_read+0xec/0x1d0
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237097] ? __io_submit_one.constprop.0+0x113/0x200
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237099] __io_submit_one.constprop.0+0x113/0x200
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237100] ? __io_submit_one.constprop.0+0x113/0x200
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237101] io_submit_one+0xe8/0x3d0
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237103] __x64_sys_io_submit+0x84/0x190
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237104] ? do_syscall_64+0x69/0x90
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237106] ? do_syscall_64+0x69/0x90
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237107] do_syscall_64+0x59/0x90
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237108] ? syscall_exit_to_user_mode+0x2a/0x50
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237110] ? do_syscall_64+0x69/0x90
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237111] entry_SYSCALL_64_after_hwframe+0x63/0xcd
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237113] RIP: 0033:0x7f44f351ea3d
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237116] RSP: 002b:00007fff1dcfe558 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237117] RAX: ffffffffffffffda RBX: 00007f44f2272b68 RCX: 00007f44f351ea3d
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237118] RDX: 000056315d9ad828 RSI: 0000000000000001 RDI: 00007f44f224f000
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237119] RBP: 00007f44f224f000 R08: 00007f44e9430000 R09: 00000000000002d8
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237120] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237120] R13: 0000000000000000 R14: 000056315d9ad828 R15: 000056315d9e1830
Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237122] </TASK>

This issue can not be reproduced on 5.15 and 6.2 kernels

From call trace, it got stuck more than 120 seconds waiting on previous IOs to complete for freeing some tags, so new IO requests are able to obtain tags
But in fact, not all previous IOs got stuck, at least some of previous IOs should have completed, but the waiters were not waken up

This issue is fixed by the upstream commit below which has been merged in kernel 6.1
commit 4acb83417cadfdcbe64215f9d0ddcf3132af808e
Author: Keith Busch <kbusch@kernel.org>
Date: Fri Sep 9 11:40:22 2022 -0700

    sbitmap: fix batched wait_cnt accounting

    Batched completions can clear multiple bits, but we're only decrementing
    the wait_cnt by one each time. This can cause waiters to never be woken,
    stalling IO. Use the batched count instead.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
    Signed-off-by: Keith Busch <kbusch@kernel.org>
    Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.com
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

To SRU the commit to 5.19 kernel, we also need to SRU all its dependencies and one further fix:

30514bd2dd4e sbitmap: fix lockup while swapping
4acb83417cad sbitmap: fix batched wait_cnt accounting
c35227d4e8cb sbitmap: Use atomic_long_try_cmpxchg in __sbitmap_queue_get_batch
48c033314f37 sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()
bce1b56c7382 Revert "sbitmap: fix batched wait_cnt accounting"
16ede66973c8 sbitmap: fix batched wait_cnt accounting
ddbfc34fcf5d sbitmap: remove unnecessary code in __sbitmap_queue_get_batch
040b83fcecfb sbitmap: fix possible io hung due to lost wakeup

[ Test Plan ]

This can be simply reproduced by launching an instance on AWS EC2
And run the fio command on a nvme device for few hours to make sure IOs don’t get stuck

I’ve built a test kernel with above commits on 5.19.0-43 generic kernel here:
https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/361041-generic

With this test kernel, fio has been running for few hours without any issue

[ Where problems could occur ]

The sbitmap is mainly used in blk-mq in block layer, scsi drivers and fungible ethernet driver
If there is any issue happens in sbitmap, the symptom should be IO hung, or packets get stuck in fungible driver

Hugh Dickins (1):
  sbitmap: fix lockup while swapping

Jan Kara (1):
  sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()

Jens Axboe (1):
  Revert "sbitmap: fix batched wait_cnt accounting"

Keith Busch (2):
  sbitmap: fix batched wait_cnt accounting
  sbitmap: fix batched wait_cnt accounting

Liu Song (1):
  sbitmap: remove unnecessary code in __sbitmap_queue_get_batch

Uros Bizjak (1):
  sbitmap: Use atomic_long_try_cmpxchg in __sbitmap_queue_get_batch

Yu Kuai (1):
  sbitmap: fix possible io hung due to lost wakeup

 block/blk-mq-tag.c      |   2 +-
 include/linux/sbitmap.h |   3 +-
 lib/sbitmap.c           | 109 ++++++++++++++++++++++++++--------------
 3 files changed, 73 insertions(+), 41 deletions(-)

Comments

Andrei Gherzan June 5, 2023, 9:41 a.m. UTC | #1
On 23/06/05 01:45PM, Gerald Yang wrote:
> BugLink: https://bugs.launchpad.net/bugs/2022318
> 
> SRU Justification:
> 
> [ Impact ]
> 
> When running fio on a NVME on an AWS test instance with 5.19 kernel, IOs get stuck and fio never ends
> 
> fio command:
> sudo fio --name=read_iops_test --filename=/dev/nvme1n1 --filesize=50G --time_based --ramp_time=2s --runtime=1m --ioengine=libaio --direct=1 --verify=0 --randrepeat=0 --bs=16K --iodepth=256 --rw=randread
> read_iops_test: (g=0): rw=randread, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=256
> fio-3.28
> Starting 1 process
> Jobs: 1 (f=0): [/(1)][-.-%][eta 01m:02s]
> 
> IOs completely get stuck, after a while kernel log shows:
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.230970] INFO: task fio:2545 blocked for more than 120 seconds.
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.232878] Not tainted 5.19.0-43-generic #44~22.04.1-Ubuntu
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.234738] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237053] task:fio state:D stack: 0 pid: 2545 ppid: 2495 flags:0x00000002
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237057] Call Trace:
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237058] <TASK>
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237061] __schedule+0x257/0x5d0
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237066] schedule+0x68/0x110
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237068] io_schedule+0x46/0x80
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237069] blk_mq_get_tag+0x117/0x300
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237072] ? destroy_sched_domains_rcu+0x40/0x40
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237076] __blk_mq_alloc_requests+0xc4/0x1e0
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237079] blk_mq_get_new_requests+0xf6/0x1a0
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237080] blk_mq_submit_bio+0x1eb/0x440
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237082] __submit_bio+0x109/0x1a0
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237085] submit_bio_noacct_nocheck+0xc2/0x120
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237087] submit_bio_noacct+0x209/0x590
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237088] submit_bio+0x40/0xf0
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237090] __blkdev_direct_IO_async+0x146/0x1f0
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237092] blkdev_direct_IO.part.0+0x40/0xa0
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237093] blkdev_read_iter+0x9f/0x1a0
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237094] aio_read+0xec/0x1d0
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237097] ? __io_submit_one.constprop.0+0x113/0x200
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237099] __io_submit_one.constprop.0+0x113/0x200
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237100] ? __io_submit_one.constprop.0+0x113/0x200
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237101] io_submit_one+0xe8/0x3d0
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237103] __x64_sys_io_submit+0x84/0x190
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237104] ? do_syscall_64+0x69/0x90
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237106] ? do_syscall_64+0x69/0x90
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237107] do_syscall_64+0x59/0x90
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237108] ? syscall_exit_to_user_mode+0x2a/0x50
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237110] ? do_syscall_64+0x69/0x90
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237111] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237113] RIP: 0033:0x7f44f351ea3d
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237116] RSP: 002b:00007fff1dcfe558 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237117] RAX: ffffffffffffffda RBX: 00007f44f2272b68 RCX: 00007f44f351ea3d
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237118] RDX: 000056315d9ad828 RSI: 0000000000000001 RDI: 00007f44f224f000
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237119] RBP: 00007f44f224f000 R08: 00007f44e9430000 R09: 00000000000002d8
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237120] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237120] R13: 0000000000000000 R14: 000056315d9ad828 R15: 000056315d9e1830
> Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237122] </TASK>
> 
> This issue can not be reproduced on 5.15 and 6.2 kernels
> 
> From call trace, it got stuck more than 120 seconds waiting on previous IOs to complete for freeing some tags, so new IO requests are able to obtain tags
> But in fact, not all previous IOs got stuck, at least some of previous IOs should have completed, but the waiters were not waken up
> 
> This issue is fixed by the upstream commit below which has been merged in kernel 6.1
> commit 4acb83417cadfdcbe64215f9d0ddcf3132af808e
> Author: Keith Busch <kbusch@kernel.org>
> Date: Fri Sep 9 11:40:22 2022 -0700
> 
>     sbitmap: fix batched wait_cnt accounting
> 
>     Batched completions can clear multiple bits, but we're only decrementing
>     the wait_cnt by one each time. This can cause waiters to never be woken,
>     stalling IO. Use the batched count instead.
> 
>     Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
>     Signed-off-by: Keith Busch <kbusch@kernel.org>
>     Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.com
>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> To SRU the commit to 5.19 kernel, we also need to SRU all its dependencies and one further fix:
> 

All of the commits, as you mention above, have landed upstream (6.1). As
such, I would have expected the relevant footers, including your
Signed-off-by:

"(cherry picked from" ... or "(backported from ..."
Signed-off-by: ...

> 30514bd2dd4e sbitmap: fix lockup while swapping

This was a clean cherry-pick from
30514bd2dd4e86a3ecfd6a93a3eadf7b9ea164a0.

> 4acb83417cad sbitmap: fix batched wait_cnt accounting

Clean cherry-pick:
4acb83417cadfdcbe64215f9d0ddcf3132af808e.

> c35227d4e8cb sbitmap: Use atomic_long_try_cmpxchg in __sbitmap_queue_get_batch

Clean cherry-pick:
c35227d4e8cbc70a6622cc7cc5f8c3bff513f1fa.

> 48c033314f37 sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()

Clean cherry-pick:
48c033314f372478548203c583529f53080fd078.

> bce1b56c7382 Revert "sbitmap: fix batched wait_cnt accounting"

I don't understand the meaning of this revert. You are adding the
commit and reverting it later. That revert is also upstream
(bce1b56c73826fec8caf6187f0c922ede397a5a8) but is
this "dance" needed for this submission?

> 16ede66973c8 sbitmap: fix batched wait_cnt accounting

4acb83417cadfdcbe64215f9d0ddcf3132af808e doesn't apply cleanly. Your
submission does so this should have been mentioned as a backport with
some details on its implications.

> ddbfc34fcf5d sbitmap: remove unnecessary code in __sbitmap_queue_get_batch

Clean cherry-pick:
ddbfc34fcf5d0bc33b006b90c580c56edeb31068.

> 040b83fcecfb sbitmap: fix possible io hung due to lost wakeup

Clean cherry-pick:
040b83fcecfb86f3225d3a5de7fd9b3fbccf83b4.

> 
> [ Test Plan ]
> 
> This can be simply reproduced by launching an instance on AWS EC2
> And run the fio command on a nvme device for few hours to make sure IOs don’t get stuck
> 
> I’ve built a test kernel with above commits on 5.19.0-43 generic kernel here:
> https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/361041-generic
> 
> With this test kernel, fio has been running for few hours without any issue
> 
> [ Where problems could occur ]
> 
> The sbitmap is mainly used in blk-mq in block layer, scsi drivers and fungible ethernet driver
> If there is any issue happens in sbitmap, the symptom should be IO hung, or packets get stuck in fungible driver
> 
> Hugh Dickins (1):
>   sbitmap: fix lockup while swapping
> 
> Jan Kara (1):
>   sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()
> 
> Jens Axboe (1):
>   Revert "sbitmap: fix batched wait_cnt accounting"
> 
> Keith Busch (2):
>   sbitmap: fix batched wait_cnt accounting
>   sbitmap: fix batched wait_cnt accounting
> 
> Liu Song (1):
>   sbitmap: remove unnecessary code in __sbitmap_queue_get_batch
> 
> Uros Bizjak (1):
>   sbitmap: Use atomic_long_try_cmpxchg in __sbitmap_queue_get_batch
> 
> Yu Kuai (1):
>   sbitmap: fix possible io hung due to lost wakeup
> 
>  block/blk-mq-tag.c      |   2 +-
>  include/linux/sbitmap.h |   3 +-
>  lib/sbitmap.c           | 109 ++++++++++++++++++++++++++--------------
>  3 files changed, 73 insertions(+), 41 deletions(-)

I'll avoid NACK-ing for now as we could fix all these when we apply but
I would resubmit.
Andrei Gherzan June 5, 2023, 9:46 a.m. UTC | #2
On 23/06/05 10:41AM, Andrei Gherzan wrote:
> On 23/06/05 01:45PM, Gerald Yang wrote:
> > BugLink: https://bugs.launchpad.net/bugs/2022318

You also miss the BugLink references in the patches. I'll NACK as I
think a refresh would be way easier to work with for the maintainers.

> > 
> > SRU Justification:
> > 
> > [ Impact ]
> > 
> > When running fio on a NVME on an AWS test instance with 5.19 kernel, IOs get stuck and fio never ends
> > 
> > fio command:
> > sudo fio --name=read_iops_test --filename=/dev/nvme1n1 --filesize=50G --time_based --ramp_time=2s --runtime=1m --ioengine=libaio --direct=1 --verify=0 --randrepeat=0 --bs=16K --iodepth=256 --rw=randread
> > read_iops_test: (g=0): rw=randread, bs=(R) 16.0KiB-16.0KiB, (W) 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=256
> > fio-3.28
> > Starting 1 process
> > Jobs: 1 (f=0): [/(1)][-.-%][eta 01m:02s]
> > 
> > IOs completely get stuck, after a while kernel log shows:
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.230970] INFO: task fio:2545 blocked for more than 120 seconds.
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.232878] Not tainted 5.19.0-43-generic #44~22.04.1-Ubuntu
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.234738] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237053] task:fio state:D stack: 0 pid: 2545 ppid: 2495 flags:0x00000002
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237057] Call Trace:
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237058] <TASK>
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237061] __schedule+0x257/0x5d0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237066] schedule+0x68/0x110
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237068] io_schedule+0x46/0x80
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237069] blk_mq_get_tag+0x117/0x300
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237072] ? destroy_sched_domains_rcu+0x40/0x40
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237076] __blk_mq_alloc_requests+0xc4/0x1e0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237079] blk_mq_get_new_requests+0xf6/0x1a0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237080] blk_mq_submit_bio+0x1eb/0x440
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237082] __submit_bio+0x109/0x1a0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237085] submit_bio_noacct_nocheck+0xc2/0x120
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237087] submit_bio_noacct+0x209/0x590
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237088] submit_bio+0x40/0xf0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237090] __blkdev_direct_IO_async+0x146/0x1f0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237092] blkdev_direct_IO.part.0+0x40/0xa0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237093] blkdev_read_iter+0x9f/0x1a0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237094] aio_read+0xec/0x1d0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237097] ? __io_submit_one.constprop.0+0x113/0x200
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237099] __io_submit_one.constprop.0+0x113/0x200
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237100] ? __io_submit_one.constprop.0+0x113/0x200
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237101] io_submit_one+0xe8/0x3d0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237103] __x64_sys_io_submit+0x84/0x190
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237104] ? do_syscall_64+0x69/0x90
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237106] ? do_syscall_64+0x69/0x90
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237107] do_syscall_64+0x59/0x90
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237108] ? syscall_exit_to_user_mode+0x2a/0x50
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237110] ? do_syscall_64+0x69/0x90
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237111] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237113] RIP: 0033:0x7f44f351ea3d
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237116] RSP: 002b:00007fff1dcfe558 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237117] RAX: ffffffffffffffda RBX: 00007f44f2272b68 RCX: 00007f44f351ea3d
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237118] RDX: 000056315d9ad828 RSI: 0000000000000001 RDI: 00007f44f224f000
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237119] RBP: 00007f44f224f000 R08: 00007f44e9430000 R09: 00000000000002d8
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237120] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237120] R13: 0000000000000000 R14: 000056315d9ad828 R15: 000056315d9e1830
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237122] </TASK>
> > 
> > This issue can not be reproduced on 5.15 and 6.2 kernels
> > 
> > From call trace, it got stuck more than 120 seconds waiting on previous IOs to complete for freeing some tags, so new IO requests are able to obtain tags
> > But in fact, not all previous IOs got stuck, at least some of previous IOs should have completed, but the waiters were not waken up
> > 
> > This issue is fixed by the upstream commit below which has been merged in kernel 6.1
> > commit 4acb83417cadfdcbe64215f9d0ddcf3132af808e
> > Author: Keith Busch <kbusch@kernel.org>
> > Date: Fri Sep 9 11:40:22 2022 -0700
> > 
> >     sbitmap: fix batched wait_cnt accounting
> > 
> >     Batched completions can clear multiple bits, but we're only decrementing
> >     the wait_cnt by one each time. This can cause waiters to never be woken,
> >     stalling IO. Use the batched count instead.
> > 
> >     Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
> >     Signed-off-by: Keith Busch <kbusch@kernel.org>
> >     Link: https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.com
> >     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> > 
> > To SRU the commit to 5.19 kernel, we also need to SRU all its dependencies and one further fix:
> > 
> 
> All of the commits, as you mention above, have landed upstream (6.1). As
> such, I would have expected the relevant footers, including your
> Signed-off-by:
> 
> "(cherry picked from" ... or "(backported from ..."
> Signed-off-by: ...
> 
> > 30514bd2dd4e sbitmap: fix lockup while swapping
> 
> This was a clean cherry-pick from
> 30514bd2dd4e86a3ecfd6a93a3eadf7b9ea164a0.
> 
> > 4acb83417cad sbitmap: fix batched wait_cnt accounting
> 
> Clean cherry-pick:
> 4acb83417cadfdcbe64215f9d0ddcf3132af808e.
> 
> > c35227d4e8cb sbitmap: Use atomic_long_try_cmpxchg in __sbitmap_queue_get_batch
> 
> Clean cherry-pick:
> c35227d4e8cbc70a6622cc7cc5f8c3bff513f1fa.
> 
> > 48c033314f37 sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()
> 
> Clean cherry-pick:
> 48c033314f372478548203c583529f53080fd078.
> 
> > bce1b56c7382 Revert "sbitmap: fix batched wait_cnt accounting"
> 
> I don't understand the meaning of this revert. You are adding the
> commit and reverting it later. That revert is also upstream
> (bce1b56c73826fec8caf6187f0c922ede397a5a8) but is
> this "dance" needed for this submission?
> 
> > 16ede66973c8 sbitmap: fix batched wait_cnt accounting
> 
> 4acb83417cadfdcbe64215f9d0ddcf3132af808e doesn't apply cleanly. Your
> submission does so this should have been mentioned as a backport with
> some details on its implications.
> 
> > ddbfc34fcf5d sbitmap: remove unnecessary code in __sbitmap_queue_get_batch
> 
> Clean cherry-pick:
> ddbfc34fcf5d0bc33b006b90c580c56edeb31068.
> 
> > 040b83fcecfb sbitmap: fix possible io hung due to lost wakeup
> 
> Clean cherry-pick:
> 040b83fcecfb86f3225d3a5de7fd9b3fbccf83b4.
> 
> > 
> > [ Test Plan ]
> > 
> > This can be simply reproduced by launching an instance on AWS EC2
> > And run the fio command on a nvme device for few hours to make sure IOs don’t get stuck
> > 
> > I’ve built a test kernel with above commits on 5.19.0-43 generic kernel here:
> > https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/361041-generic
> > 
> > With this test kernel, fio has been running for few hours without any issue
> > 
> > [ Where problems could occur ]
> > 
> > The sbitmap is mainly used in blk-mq in block layer, scsi drivers and fungible ethernet driver
> > If there is any issue happens in sbitmap, the symptom should be IO hung, or packets get stuck in fungible driver
> > 
> > Hugh Dickins (1):
> >   sbitmap: fix lockup while swapping
> > 
> > Jan Kara (1):
> >   sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()
> > 
> > Jens Axboe (1):
> >   Revert "sbitmap: fix batched wait_cnt accounting"
> > 
> > Keith Busch (2):
> >   sbitmap: fix batched wait_cnt accounting
> >   sbitmap: fix batched wait_cnt accounting
> > 
> > Liu Song (1):
> >   sbitmap: remove unnecessary code in __sbitmap_queue_get_batch
> > 
> > Uros Bizjak (1):
> >   sbitmap: Use atomic_long_try_cmpxchg in __sbitmap_queue_get_batch
> > 
> > Yu Kuai (1):
> >   sbitmap: fix possible io hung due to lost wakeup
> > 
> >  block/blk-mq-tag.c      |   2 +-
> >  include/linux/sbitmap.h |   3 +-
> >  lib/sbitmap.c           | 109 ++++++++++++++++++++++++++--------------
> >  3 files changed, 73 insertions(+), 41 deletions(-)
> 
> I'll avoid NACK-ing for now as we could fix all these when we apply but
> I would resubmit.
> 
> -- 
> Andrei Gherzan
Gerald Yang June 6, 2023, 1:25 a.m. UTC | #3
Thanks Andrei for the review and the details, I will adjust the patches and
resubmit the SRU

Regards,
Gerald

On Mon, Jun 5, 2023 at 5:41 PM Andrei Gherzan <andrei.gherzan@canonical.com>
wrote:

> On 23/06/05 01:45PM, Gerald Yang wrote:
> > BugLink: https://bugs.launchpad.net/bugs/2022318
> >
> > SRU Justification:
> >
> > [ Impact ]
> >
> > When running fio on a NVME on an AWS test instance with 5.19 kernel, IOs
> get stuck and fio never ends
> >
> > fio command:
> > sudo fio --name=read_iops_test --filename=/dev/nvme1n1 --filesize=50G
> --time_based --ramp_time=2s --runtime=1m --ioengine=libaio --direct=1
> --verify=0 --randrepeat=0 --bs=16K --iodepth=256 --rw=randread
> > read_iops_test: (g=0): rw=randread, bs=(R) 16.0KiB-16.0KiB, (W)
> 16.0KiB-16.0KiB, (T) 16.0KiB-16.0KiB, ioengine=libaio, iodepth=256
> > fio-3.28
> > Starting 1 process
> > Jobs: 1 (f=0): [/(1)][-.-%][eta 01m:02s]
> >
> > IOs completely get stuck, after a while kernel log shows:
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.230970] INFO: task
> fio:2545 blocked for more than 120 seconds.
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.232878] Not tainted
> 5.19.0-43-generic #44~22.04.1-Ubuntu
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.234738] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237053] task:fio state:D
> stack: 0 pid: 2545 ppid: 2495 flags:0x00000002
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237057] Call Trace:
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237058] <TASK>
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237061]
> __schedule+0x257/0x5d0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237066] schedule+0x68/0x110
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237068]
> io_schedule+0x46/0x80
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237069]
> blk_mq_get_tag+0x117/0x300
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237072] ?
> destroy_sched_domains_rcu+0x40/0x40
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237076]
> __blk_mq_alloc_requests+0xc4/0x1e0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237079]
> blk_mq_get_new_requests+0xf6/0x1a0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237080]
> blk_mq_submit_bio+0x1eb/0x440
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237082]
> __submit_bio+0x109/0x1a0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237085]
> submit_bio_noacct_nocheck+0xc2/0x120
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237087]
> submit_bio_noacct+0x209/0x590
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237088]
> submit_bio+0x40/0xf0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237090]
> __blkdev_direct_IO_async+0x146/0x1f0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237092]
> blkdev_direct_IO.part.0+0x40/0xa0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237093]
> blkdev_read_iter+0x9f/0x1a0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237094] aio_read+0xec/0x1d0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237097] ?
> __io_submit_one.constprop.0+0x113/0x200
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237099]
> __io_submit_one.constprop.0+0x113/0x200
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237100] ?
> __io_submit_one.constprop.0+0x113/0x200
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237101]
> io_submit_one+0xe8/0x3d0
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237103]
> __x64_sys_io_submit+0x84/0x190
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237104] ?
> do_syscall_64+0x69/0x90
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237106] ?
> do_syscall_64+0x69/0x90
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237107]
> do_syscall_64+0x59/0x90
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237108] ?
> syscall_exit_to_user_mode+0x2a/0x50
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237110] ?
> do_syscall_64+0x69/0x90
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237111]
> entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237113] RIP:
> 0033:0x7f44f351ea3d
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237116] RSP:
> 002b:00007fff1dcfe558 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237117] RAX:
> ffffffffffffffda RBX: 00007f44f2272b68 RCX: 00007f44f351ea3d
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237118] RDX:
> 000056315d9ad828 RSI: 0000000000000001 RDI: 00007f44f224f000
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237119] RBP:
> 00007f44f224f000 R08: 00007f44e9430000 R09: 00000000000002d8
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237120] R10:
> 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237120] R13:
> 0000000000000000 R14: 000056315d9ad828 R15: 000056315d9e1830
> > Jun 1 03:57:52 ip-172-31-39-141 kernel: [ 370.237122] </TASK>
> >
> > This issue can not be reproduced on 5.15 and 6.2 kernels
> >
> > From call trace, it got stuck more than 120 seconds waiting on previous
> IOs to complete for freeing some tags, so new IO requests are able to
> obtain tags
> > But in fact, not all previous IOs got stuck, at least some of previous
> IOs should have completed, but the waiters were not waken up
> >
> > This issue is fixed by the upstream commit below which has been merged
> in kernel 6.1
> > commit 4acb83417cadfdcbe64215f9d0ddcf3132af808e
> > Author: Keith Busch <kbusch@kernel.org>
> > Date: Fri Sep 9 11:40:22 2022 -0700
> >
> >     sbitmap: fix batched wait_cnt accounting
> >
> >     Batched completions can clear multiple bits, but we're only
> decrementing
> >     the wait_cnt by one each time. This can cause waiters to never be
> woken,
> >     stalling IO. Use the batched count instead.
> >
> >     Link: https://bugzilla.kernel.org/show_bug.cgi?id=215679
> >     Signed-off-by: Keith Busch <kbusch@kernel.org>
> >     Link:
> https://lore.kernel.org/r/20220909184022.1709476-1-kbusch@fb.com
> >     Signed-off-by: Jens Axboe <axboe@kernel.dk>
> >
> > To SRU the commit to 5.19 kernel, we also need to SRU all its
> dependencies and one further fix:
> >
>
> All of the commits, as you mention above, have landed upstream (6.1). As
> such, I would have expected the relevant footers, including your
> Signed-off-by:
>
> "(cherry picked from" ... or "(backported from ..."
> Signed-off-by: ...
>
> > 30514bd2dd4e sbitmap: fix lockup while swapping
>
> This was a clean cherry-pick from
> 30514bd2dd4e86a3ecfd6a93a3eadf7b9ea164a0.
>
> > 4acb83417cad sbitmap: fix batched wait_cnt accounting
>
> Clean cherry-pick:
> 4acb83417cadfdcbe64215f9d0ddcf3132af808e.
>
> > c35227d4e8cb sbitmap: Use atomic_long_try_cmpxchg in
> __sbitmap_queue_get_batch
>
> Clean cherry-pick:
> c35227d4e8cbc70a6622cc7cc5f8c3bff513f1fa.
>
> > 48c033314f37 sbitmap: Avoid leaving waitqueue in invalid state in
> __sbq_wake_up()
>
> Clean cherry-pick:
> 48c033314f372478548203c583529f53080fd078.
>
> > bce1b56c7382 Revert "sbitmap: fix batched wait_cnt accounting"
>
> I don't understand the meaning of this revert. You are adding the
> commit and reverting it later. That revert is also upstream
> (bce1b56c73826fec8caf6187f0c922ede397a5a8) but is
> this "dance" needed for this submission?
>
> > 16ede66973c8 sbitmap: fix batched wait_cnt accounting
>
> 4acb83417cadfdcbe64215f9d0ddcf3132af808e doesn't apply cleanly. Your
> submission does so this should have been mentioned as a backport with
> some details on its implications.
>
> > ddbfc34fcf5d sbitmap: remove unnecessary code in
> __sbitmap_queue_get_batch
>
> Clean cherry-pick:
> ddbfc34fcf5d0bc33b006b90c580c56edeb31068.
>
> > 040b83fcecfb sbitmap: fix possible io hung due to lost wakeup
>
> Clean cherry-pick:
> 040b83fcecfb86f3225d3a5de7fd9b3fbccf83b4.
>
> >
> > [ Test Plan ]
> >
> > This can be simply reproduced by launching an instance on AWS EC2
> > And run the fio command on a nvme device for few hours to make sure IOs
> don’t get stuck
> >
> > I’ve built a test kernel with above commits on 5.19.0-43 generic kernel
> here:
> > https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/361041-generic
> >
> > With this test kernel, fio has been running for few hours without any
> issue
> >
> > [ Where problems could occur ]
> >
> > The sbitmap is mainly used in blk-mq in block layer, scsi drivers and
> fungible ethernet driver
> > If there is any issue happens in sbitmap, the symptom should be IO hung,
> or packets get stuck in fungible driver
> >
> > Hugh Dickins (1):
> >   sbitmap: fix lockup while swapping
> >
> > Jan Kara (1):
> >   sbitmap: Avoid leaving waitqueue in invalid state in __sbq_wake_up()
> >
> > Jens Axboe (1):
> >   Revert "sbitmap: fix batched wait_cnt accounting"
> >
> > Keith Busch (2):
> >   sbitmap: fix batched wait_cnt accounting
> >   sbitmap: fix batched wait_cnt accounting
> >
> > Liu Song (1):
> >   sbitmap: remove unnecessary code in __sbitmap_queue_get_batch
> >
> > Uros Bizjak (1):
> >   sbitmap: Use atomic_long_try_cmpxchg in __sbitmap_queue_get_batch
> >
> > Yu Kuai (1):
> >   sbitmap: fix possible io hung due to lost wakeup
> >
> >  block/blk-mq-tag.c      |   2 +-
> >  include/linux/sbitmap.h |   3 +-
> >  lib/sbitmap.c           | 109 ++++++++++++++++++++++++++--------------
> >  3 files changed, 73 insertions(+), 41 deletions(-)
>
> I'll avoid NACK-ing for now as we could fix all these when we apply but
> I would resubmit.
>
> --
> Andrei Gherzan
>