mbox series

[SRU,F,0/1] xfs: Preallocated ioend transactions cause deadlock due to log buffer exhaustion

Message ID 20230214055022.7876-1-matthew.ruffell@canonical.com
Headers show
Series xfs: Preallocated ioend transactions cause deadlock due to log buffer exhaustion | expand

Message

Matthew Ruffell Feb. 14, 2023, 5:50 a.m. UTC
BugLink: https://bugs.launchpad.net/bugs/2007219

[Impact]

A deadlock exists in the XFS filesystem that occurs when the XFS log buffer
becomes completely exhausted.

XFS maintains a circular ring buffer for the log, and it is a fixed size. To be
able to create a transaction, you need to be able to reserve space on the log
buffer.

Certain ioend transactions, such as file append, can be preallocated for a
negligible performance gain. This takes up space in the log buffer, and these
preallocated ioends are placed on a workqueue, behind other ioends that are not
preallocated.

The deadlock occurs when the XFS log buffer is running low on space, and an
ioend append transaction comes in. It is preallocated, consuming nearly all of
the remaining XFS log buffer space, and is placed at the very end of the ioend
workqueue. The kernel then takes a ioend from the top of the ioend
workqeueue, creates a transaction, xfs_trans_alloc(), attempts to allocate space 
for it, xfs_trans_reserve(), xfs_log_reserve(), and then waits in a while loop
checking free space in the log, xlog_grant_head_check(), xlog_grant_head_wait().

Since there is no space, the thread sleeps with schedule(). This happens with
all ioend transactions, since the log is exhausted and I/O is not moving.

Since I/O never moves, the thread is never woken up, and we get hung task
timeouts, with system failure shortly afterward.

An example hung task timeout is:

INFO: task kworker/60:0:4002982 blocked for more than 360 seconds.
      Tainted: G           OE     5.4.0-137-generic #154-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/60:0    D    0 4002982      2 0x90004080
Workqueue: xfs-conv/dm-3 xfs_end_io [xfs]
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 xlog_grant_head_wait+0xb9/0x1e0 [xfs]
 xlog_grant_head_check+0xde/0x100 [xfs]
 xfs_log_reserve+0xc9/0x1e0 [xfs]
 xfs_trans_reserve+0x17a/0x1e0 [xfs]
 xfs_trans_alloc+0xda/0x170 [xfs]
 xfs_iomap_write_unwritten+0x128/0x2f0 [xfs]
 xfs_end_ioend+0x15b/0x1b0 [xfs]
 xfs_end_io+0xb1/0xe0 [xfs]
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x1f/0x40

There is no known workaround, other than to have a larger log buffer at
filesystem creation, but even then, it only buys you time until you get high
enough load to exhaust the log buffer.

[Fix]

This was fixed in 5.13-rc1 by the following patch:

commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
Author: Brian Foster <bfoster@redhat.com>
Date:   Fri Apr 9 10:27:43 2021 -0700
Subject: xfs: drop submit side trans alloc for append ioends
Link: https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5

The patch more or less removes all preallocated ioend transactions, and instead,
when ioend appends are needed, they go to the standard workqueue like any other
ioend, where transactions are allocated when they reach the top of the 
workqueue. 

The patch required some backporting for Focal. The changes to the patch itself
is minimal and should be straightforward to read, however, the changes to the
XFS ioend subsystem between 5.4 and 5.13-rc1 were quite extreme, with a lot of
refactoring taking place over very many commits.

Additionally, the patch was part of a five part series, the first, fixes the
deadlock, and the rest remove all the code to do with transaction preallocation.

It is safe to leave the rest of the code in place. It will become dead code, but
it will not be reachable, and not cause any risk of regression, due to 
ioend->io_append_trans always being NULL, and not entering into any of the if
statements.

[Testcase]

There is currently no known testcase for this issue. The issue has only been
seen in a customer production environment, running under heavy load. The issue
has not been seen in a customer test environment, only production.

The production workload is a busy Kubernetes cluster running containers and VMs,
with the hosts's filesystem being broken into separate mountpoints over several
partitions, all XFS.

The kubernetes containers are all backed from a large 4TB disk which is XFS.

A test kernel is available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf353709-test

This test kernel has been deployed to several production hosts, and the deadlock
no longer occurs.

[Where problems can occur]

We are changing how certain ioend transactions take place in the XFS filesystem.
If a regression were to occur, it would impact all XFS users. Users would have
to downgrade their kernel, as there are no workarounds for enabling or disabling
the behaviour change.

The overall risk of the change should be low. ioend append transactions would
still be processed in nearly the same way, still being placed at the end of the
ioend workqueue like any other transaction, with the only change being it is
allocated at the time the transaction is created, at the top of the workqueue
when it is processed, and not preallocated when the ioend is first submitted.

There will be a very minor performance penalty, but it wouldn't be measurable
in any tangible workload.

I have run xfstests for the xfs/* subset against the released 5.4.0-137-generic
and the test 5.4.0-137-generic test kernel with the patch included. The test
kernel had identical results, it will likely not cause any regressions.

There is some additional risk leaving the dead code in place, but I have read
the code and the commits to remove the dead code, and I came to the conclusion
it is less risky to leave it in place, than to backport the refactor commits.

[Other Info]

The XFS ioend subsystem refactor took place in the following commits:

commit 598ecfbaa742aca0dcdbbea25681406f95cc0b63
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 17 Oct 2019 13:12:15 -0700
Subject: iomap: lift the xfs writeback code to iomap
Link: https://github.com/torvalds/linux/commit/598ecfbaa742aca0dcdbbea25681406f95cc0b63

commit 9e91c5728cab3d0aa3197d009c3d63e147914e77
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 17 Oct 2019 13:12:13 -0700
Subject: iomap: lift common tracing code from xfs to iomap
Link: https://github.com/torvalds/linux/commit/9e91c5728cab3d0aa3197d009c3d63e147914e77

commit 433dad94ec5d6b90385b56a8bc8718dd9542b289
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 17 Oct 2019 13:12:07 -0700
Subject: xfs: refactor the ioend merging code
Link: https://github.com/torvalds/linux/commit/433dad94ec5d6b90385b56a8bc8718dd9542b289

ioend->io_append_trans was renamed to ioend->io_private in the following commit:

commit 5653017bc44e54baa299f3523f160c23ac0628fd
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 17 Oct 2019 13:12:09 -0700
Subject: xfs: turn io_append_trans into an io_private void pointer
Link: https://github.com/torvalds/linux/commit/5653017bc44e54baa299f3523f160c23ac0628fd

The full five part preallocated ioend deadlock patch series is:

commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
Author: Brian Foster <bfoster@redhat.com>
Date:   Fri Apr 9 10:27:43 2021 -0700
Subject: xfs: drop submit side trans alloc for append ioends
Link: https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5

commit 7adb8f14e134d5f885d47c4ccd620836235f0b7f
Author: Brian Foster <bfoster@redhat.com>
Date:   Fri Apr 9 10:27:55 2021 -0700
Subject: xfs: open code ioend needs workqueue helper
Link: https://github.com/torvalds/linux/commit/7adb8f14e134d5f885d47c4ccd620836235f0b7f

commit 044c6449f18f174ba8d86640936add3fc7582e49
Author: Brian Foster <bfoster@redhat.com>
Date:   Fri Apr 9 10:27:55 2021 -0700
Subject: xfs: drop unused ioend private merge and setfilesize code
Link: https://github.com/torvalds/linux/commit/044c6449f18f174ba8d86640936add3fc7582e49

commit e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
Author: Brian Foster <bfoster@redhat.com>
Date:   Fri Apr 9 10:27:56 2021 -0700
Subject: xfs: drop unnecessary setfilesize helper
Link: https://github.com/torvalds/linux/commit/e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4

commit 6e552494fb90acae005d74ce6a2ee102d965184b
Author: Brian Foster <bfoster@redhat.com>
Date:   Tue May 4 08:54:29 2021 -0700
Subject: iomap: remove unused private field from ioend
Link: https://github.com/torvalds/linux/commit/6e552494fb90acae005d74ce6a2ee102d965184b

As you can see, the four latter commits are not necessary. They simply remove
dead code, which has no harm being left in place. They also do not backport
at all, not without the ALL of the significant refactor commits.

Hence, we only take "xfs: drop submit side trans alloc for append ioends" only,
as it is the only needed commit to solve the problem.

Brian Foster (1):
  xfs: drop submit side trans alloc for append ioends

 fs/xfs/xfs_aops.c | 43 +++----------------------------------------
 1 file changed, 3 insertions(+), 40 deletions(-)

Comments

Andrea Righi Feb. 14, 2023, 6:32 a.m. UTC | #1
On Tue, Feb 14, 2023 at 06:50:21PM +1300, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2007219
> 
> [Impact]
> 
> A deadlock exists in the XFS filesystem that occurs when the XFS log buffer
> becomes completely exhausted.
> 
> XFS maintains a circular ring buffer for the log, and it is a fixed size. To be
> able to create a transaction, you need to be able to reserve space on the log
> buffer.
> 
> Certain ioend transactions, such as file append, can be preallocated for a
> negligible performance gain. This takes up space in the log buffer, and these
> preallocated ioends are placed on a workqueue, behind other ioends that are not
> preallocated.
> 
> The deadlock occurs when the XFS log buffer is running low on space, and an
> ioend append transaction comes in. It is preallocated, consuming nearly all of
> the remaining XFS log buffer space, and is placed at the very end of the ioend
> workqueue. The kernel then takes a ioend from the top of the ioend
> workqeueue, creates a transaction, xfs_trans_alloc(), attempts to allocate space 
> for it, xfs_trans_reserve(), xfs_log_reserve(), and then waits in a while loop
> checking free space in the log, xlog_grant_head_check(), xlog_grant_head_wait().
> 
> Since there is no space, the thread sleeps with schedule(). This happens with
> all ioend transactions, since the log is exhausted and I/O is not moving.
> 
> Since I/O never moves, the thread is never woken up, and we get hung task
> timeouts, with system failure shortly afterward.
> 
> An example hung task timeout is:
> 
> INFO: task kworker/60:0:4002982 blocked for more than 360 seconds.
>       Tainted: G           OE     5.4.0-137-generic #154-Ubuntu
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/60:0    D    0 4002982      2 0x90004080
> Workqueue: xfs-conv/dm-3 xfs_end_io [xfs]
> Call Trace:
>  __schedule+0x2e3/0x740
>  schedule+0x42/0xb0
>  xlog_grant_head_wait+0xb9/0x1e0 [xfs]
>  xlog_grant_head_check+0xde/0x100 [xfs]
>  xfs_log_reserve+0xc9/0x1e0 [xfs]
>  xfs_trans_reserve+0x17a/0x1e0 [xfs]
>  xfs_trans_alloc+0xda/0x170 [xfs]
>  xfs_iomap_write_unwritten+0x128/0x2f0 [xfs]
>  xfs_end_ioend+0x15b/0x1b0 [xfs]
>  xfs_end_io+0xb1/0xe0 [xfs]
>  process_one_work+0x1eb/0x3b0
>  worker_thread+0x4d/0x400
>  kthread+0x104/0x140
>  ? process_one_work+0x3b0/0x3b0
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x1f/0x40
> 
> There is no known workaround, other than to have a larger log buffer at
> filesystem creation, but even then, it only buys you time until you get high
> enough load to exhaust the log buffer.
> 
> [Fix]
> 
> This was fixed in 5.13-rc1 by the following patch:
> 
> commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:43 2021 -0700
> Subject: xfs: drop submit side trans alloc for append ioends
> Link: https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
> 
> The patch more or less removes all preallocated ioend transactions, and instead,
> when ioend appends are needed, they go to the standard workqueue like any other
> ioend, where transactions are allocated when they reach the top of the 
> workqueue. 
> 
> The patch required some backporting for Focal. The changes to the patch itself
> is minimal and should be straightforward to read, however, the changes to the
> XFS ioend subsystem between 5.4 and 5.13-rc1 were quite extreme, with a lot of
> refactoring taking place over very many commits.
> 
> Additionally, the patch was part of a five part series, the first, fixes the
> deadlock, and the rest remove all the code to do with transaction preallocation.
> 
> It is safe to leave the rest of the code in place. It will become dead code, but
> it will not be reachable, and not cause any risk of regression, due to 
> ioend->io_append_trans always being NULL, and not entering into any of the if
> statements.
> 
> [Testcase]
> 
> There is currently no known testcase for this issue. The issue has only been
> seen in a customer production environment, running under heavy load. The issue
> has not been seen in a customer test environment, only production.
> 
> The production workload is a busy Kubernetes cluster running containers and VMs,
> with the hosts's filesystem being broken into separate mountpoints over several
> partitions, all XFS.
> 
> The kubernetes containers are all backed from a large 4TB disk which is XFS.
> 
> A test kernel is available in the following ppa:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf353709-test
> 
> This test kernel has been deployed to several production hosts, and the deadlock
> no longer occurs.
> 
> [Where problems can occur]
> 
> We are changing how certain ioend transactions take place in the XFS filesystem.
> If a regression were to occur, it would impact all XFS users. Users would have
> to downgrade their kernel, as there are no workarounds for enabling or disabling
> the behaviour change.
> 
> The overall risk of the change should be low. ioend append transactions would
> still be processed in nearly the same way, still being placed at the end of the
> ioend workqueue like any other transaction, with the only change being it is
> allocated at the time the transaction is created, at the top of the workqueue
> when it is processed, and not preallocated when the ioend is first submitted.
> 
> There will be a very minor performance penalty, but it wouldn't be measurable
> in any tangible workload.
> 
> I have run xfstests for the xfs/* subset against the released 5.4.0-137-generic
> and the test 5.4.0-137-generic test kernel with the patch included. The test
> kernel had identical results, it will likely not cause any regressions.
> 
> There is some additional risk leaving the dead code in place, but I have read
> the code and the commits to remove the dead code, and I came to the conclusion
> it is less risky to leave it in place, than to backport the refactor commits.
> 
> [Other Info]
> 
> The XFS ioend subsystem refactor took place in the following commits:
> 
> commit 598ecfbaa742aca0dcdbbea25681406f95cc0b63
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:15 -0700
> Subject: iomap: lift the xfs writeback code to iomap
> Link: https://github.com/torvalds/linux/commit/598ecfbaa742aca0dcdbbea25681406f95cc0b63
> 
> commit 9e91c5728cab3d0aa3197d009c3d63e147914e77
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:13 -0700
> Subject: iomap: lift common tracing code from xfs to iomap
> Link: https://github.com/torvalds/linux/commit/9e91c5728cab3d0aa3197d009c3d63e147914e77
> 
> commit 433dad94ec5d6b90385b56a8bc8718dd9542b289
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:07 -0700
> Subject: xfs: refactor the ioend merging code
> Link: https://github.com/torvalds/linux/commit/433dad94ec5d6b90385b56a8bc8718dd9542b289
> 
> ioend->io_append_trans was renamed to ioend->io_private in the following commit:
> 
> commit 5653017bc44e54baa299f3523f160c23ac0628fd
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:09 -0700
> Subject: xfs: turn io_append_trans into an io_private void pointer
> Link: https://github.com/torvalds/linux/commit/5653017bc44e54baa299f3523f160c23ac0628fd
> 
> The full five part preallocated ioend deadlock patch series is:
> 
> commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:43 2021 -0700
> Subject: xfs: drop submit side trans alloc for append ioends
> Link: https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
> 
> commit 7adb8f14e134d5f885d47c4ccd620836235f0b7f
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:55 2021 -0700
> Subject: xfs: open code ioend needs workqueue helper
> Link: https://github.com/torvalds/linux/commit/7adb8f14e134d5f885d47c4ccd620836235f0b7f
> 
> commit 044c6449f18f174ba8d86640936add3fc7582e49
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:55 2021 -0700
> Subject: xfs: drop unused ioend private merge and setfilesize code
> Link: https://github.com/torvalds/linux/commit/044c6449f18f174ba8d86640936add3fc7582e49
> 
> commit e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:56 2021 -0700
> Subject: xfs: drop unnecessary setfilesize helper
> Link: https://github.com/torvalds/linux/commit/e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
> 
> commit 6e552494fb90acae005d74ce6a2ee102d965184b
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Tue May 4 08:54:29 2021 -0700
> Subject: iomap: remove unused private field from ioend
> Link: https://github.com/torvalds/linux/commit/6e552494fb90acae005d74ce6a2ee102d965184b
> 
> As you can see, the four latter commits are not necessary. They simply remove
> dead code, which has no harm being left in place. They also do not backport
> at all, not without the ALL of the significant refactor commits.
> 
> Hence, we only take "xfs: drop submit side trans alloc for append ioends" only,
> as it is the only needed commit to solve the problem.

Makes sense and the backport part looks correct to me, therefore:

Acked-by: Andrea Righi <andrea.righi@canonical.com>
Khalid Elmously Feb. 14, 2023, 7:14 a.m. UTC | #2
On 2023-02-14 18:50:21 , Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2007219
> 
> [Impact]
> 
> A deadlock exists in the XFS filesystem that occurs when the XFS log buffer
> becomes completely exhausted.
> 
> XFS maintains a circular ring buffer for the log, and it is a fixed size. To be
> able to create a transaction, you need to be able to reserve space on the log
> buffer.
> 
> Certain ioend transactions, such as file append, can be preallocated for a
> negligible performance gain. This takes up space in the log buffer, and these
> preallocated ioends are placed on a workqueue, behind other ioends that are not
> preallocated.
> 
> The deadlock occurs when the XFS log buffer is running low on space, and an
> ioend append transaction comes in. It is preallocated, consuming nearly all of
> the remaining XFS log buffer space, and is placed at the very end of the ioend
> workqueue. The kernel then takes a ioend from the top of the ioend
> workqeueue, creates a transaction, xfs_trans_alloc(), attempts to allocate space 
> for it, xfs_trans_reserve(), xfs_log_reserve(), and then waits in a while loop
> checking free space in the log, xlog_grant_head_check(), xlog_grant_head_wait().
> 
> Since there is no space, the thread sleeps with schedule(). This happens with
> all ioend transactions, since the log is exhausted and I/O is not moving.
> 
> Since I/O never moves, the thread is never woken up, and we get hung task
> timeouts, with system failure shortly afterward.
> 
> An example hung task timeout is:
> 
> INFO: task kworker/60:0:4002982 blocked for more than 360 seconds.
>       Tainted: G           OE     5.4.0-137-generic #154-Ubuntu
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/60:0    D    0 4002982      2 0x90004080
> Workqueue: xfs-conv/dm-3 xfs_end_io [xfs]
> Call Trace:
>  __schedule+0x2e3/0x740
>  schedule+0x42/0xb0
>  xlog_grant_head_wait+0xb9/0x1e0 [xfs]
>  xlog_grant_head_check+0xde/0x100 [xfs]
>  xfs_log_reserve+0xc9/0x1e0 [xfs]
>  xfs_trans_reserve+0x17a/0x1e0 [xfs]
>  xfs_trans_alloc+0xda/0x170 [xfs]
>  xfs_iomap_write_unwritten+0x128/0x2f0 [xfs]
>  xfs_end_ioend+0x15b/0x1b0 [xfs]
>  xfs_end_io+0xb1/0xe0 [xfs]
>  process_one_work+0x1eb/0x3b0
>  worker_thread+0x4d/0x400
>  kthread+0x104/0x140
>  ? process_one_work+0x3b0/0x3b0
>  ? kthread_park+0x90/0x90
>  ret_from_fork+0x1f/0x40
> 
> There is no known workaround, other than to have a larger log buffer at
> filesystem creation, but even then, it only buys you time until you get high
> enough load to exhaust the log buffer.
> 
> [Fix]
> 
> This was fixed in 5.13-rc1 by the following patch:
> 
> commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:43 2021 -0700
> Subject: xfs: drop submit side trans alloc for append ioends
> Link: https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
> 
> The patch more or less removes all preallocated ioend transactions, and instead,
> when ioend appends are needed, they go to the standard workqueue like any other
> ioend, where transactions are allocated when they reach the top of the 
> workqueue. 
> 
> The patch required some backporting for Focal. The changes to the patch itself
> is minimal and should be straightforward to read, however, the changes to the
> XFS ioend subsystem between 5.4 and 5.13-rc1 were quite extreme, with a lot of
> refactoring taking place over very many commits.
> 
> Additionally, the patch was part of a five part series, the first, fixes the
> deadlock, and the rest remove all the code to do with transaction preallocation.
> 
> It is safe to leave the rest of the code in place. It will become dead code, but
> it will not be reachable, and not cause any risk of regression, due to 
> ioend->io_append_trans always being NULL, and not entering into any of the if
> statements.
> 
> [Testcase]
> 
> There is currently no known testcase for this issue. The issue has only been
> seen in a customer production environment, running under heavy load. The issue
> has not been seen in a customer test environment, only production.
> 
> The production workload is a busy Kubernetes cluster running containers and VMs,
> with the hosts's filesystem being broken into separate mountpoints over several
> partitions, all XFS.
> 
> The kubernetes containers are all backed from a large 4TB disk which is XFS.
> 
> A test kernel is available in the following ppa:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf353709-test
> 
> This test kernel has been deployed to several production hosts, and the deadlock
> no longer occurs.
> 
> [Where problems can occur]
> 
> We are changing how certain ioend transactions take place in the XFS filesystem.
> If a regression were to occur, it would impact all XFS users. Users would have
> to downgrade their kernel, as there are no workarounds for enabling or disabling
> the behaviour change.
> 
> The overall risk of the change should be low. ioend append transactions would
> still be processed in nearly the same way, still being placed at the end of the
> ioend workqueue like any other transaction, with the only change being it is
> allocated at the time the transaction is created, at the top of the workqueue
> when it is processed, and not preallocated when the ioend is first submitted.
> 
> There will be a very minor performance penalty, but it wouldn't be measurable
> in any tangible workload.
> 
> I have run xfstests for the xfs/* subset against the released 5.4.0-137-generic
> and the test 5.4.0-137-generic test kernel with the patch included. The test
> kernel had identical results, it will likely not cause any regressions.
> 
> There is some additional risk leaving the dead code in place, but I have read
> the code and the commits to remove the dead code, and I came to the conclusion
> it is less risky to leave it in place, than to backport the refactor commits.
> 
> [Other Info]
> 
> The XFS ioend subsystem refactor took place in the following commits:
> 
> commit 598ecfbaa742aca0dcdbbea25681406f95cc0b63
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:15 -0700
> Subject: iomap: lift the xfs writeback code to iomap
> Link: https://github.com/torvalds/linux/commit/598ecfbaa742aca0dcdbbea25681406f95cc0b63
> 
> commit 9e91c5728cab3d0aa3197d009c3d63e147914e77
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:13 -0700
> Subject: iomap: lift common tracing code from xfs to iomap
> Link: https://github.com/torvalds/linux/commit/9e91c5728cab3d0aa3197d009c3d63e147914e77
> 
> commit 433dad94ec5d6b90385b56a8bc8718dd9542b289
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:07 -0700
> Subject: xfs: refactor the ioend merging code
> Link: https://github.com/torvalds/linux/commit/433dad94ec5d6b90385b56a8bc8718dd9542b289
> 
> ioend->io_append_trans was renamed to ioend->io_private in the following commit:
> 
> commit 5653017bc44e54baa299f3523f160c23ac0628fd
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:09 -0700
> Subject: xfs: turn io_append_trans into an io_private void pointer
> Link: https://github.com/torvalds/linux/commit/5653017bc44e54baa299f3523f160c23ac0628fd
> 
> The full five part preallocated ioend deadlock patch series is:
> 
> commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:43 2021 -0700
> Subject: xfs: drop submit side trans alloc for append ioends
> Link: https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
> 
> commit 7adb8f14e134d5f885d47c4ccd620836235f0b7f
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:55 2021 -0700
> Subject: xfs: open code ioend needs workqueue helper
> Link: https://github.com/torvalds/linux/commit/7adb8f14e134d5f885d47c4ccd620836235f0b7f
> 
> commit 044c6449f18f174ba8d86640936add3fc7582e49
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:55 2021 -0700
> Subject: xfs: drop unused ioend private merge and setfilesize code
> Link: https://github.com/torvalds/linux/commit/044c6449f18f174ba8d86640936add3fc7582e49
> 
> commit e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:56 2021 -0700
> Subject: xfs: drop unnecessary setfilesize helper
> Link: https://github.com/torvalds/linux/commit/e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
> 
> commit 6e552494fb90acae005d74ce6a2ee102d965184b
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Tue May 4 08:54:29 2021 -0700
> Subject: iomap: remove unused private field from ioend
> Link: https://github.com/torvalds/linux/commit/6e552494fb90acae005d74ce6a2ee102d965184b
> 
> As you can see, the four latter commits are not necessary. They simply remove
> dead code, which has no harm being left in place. They also do not backport
> at all, not without the ALL of the significant refactor commits.
> 
> Hence, we only take "xfs: drop submit side trans alloc for append ioends" only,
> as it is the only needed commit to solve the problem.
> 
> Brian Foster (1):
>   xfs: drop submit side trans alloc for append ioends
> 
>  fs/xfs/xfs_aops.c | 43 +++----------------------------------------
>  1 file changed, 3 insertions(+), 40 deletions(-)
> 


Acked-by: Khalid Elmously <khalid.elmously@canonical.com>

Also, I believe the convention we follow is "backported from commit xxxxx" not "backported from xxxxx".



> -- 
> 2.37.2
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team
Stefan Bader Feb. 17, 2023, 10:10 a.m. UTC | #3
On 14.02.23 06:50, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/2007219
> 
> [Impact]
> 
> A deadlock exists in the XFS filesystem that occurs when the XFS log buffer
> becomes completely exhausted.
> 
> XFS maintains a circular ring buffer for the log, and it is a fixed size. To be
> able to create a transaction, you need to be able to reserve space on the log
> buffer.
> 
> Certain ioend transactions, such as file append, can be preallocated for a
> negligible performance gain. This takes up space in the log buffer, and these
> preallocated ioends are placed on a workqueue, behind other ioends that are not
> preallocated.
> 
> The deadlock occurs when the XFS log buffer is running low on space, and an
> ioend append transaction comes in. It is preallocated, consuming nearly all of
> the remaining XFS log buffer space, and is placed at the very end of the ioend
> workqueue. The kernel then takes a ioend from the top of the ioend
> workqeueue, creates a transaction, xfs_trans_alloc(), attempts to allocate space
> for it, xfs_trans_reserve(), xfs_log_reserve(), and then waits in a while loop
> checking free space in the log, xlog_grant_head_check(), xlog_grant_head_wait().
> 
> Since there is no space, the thread sleeps with schedule(). This happens with
> all ioend transactions, since the log is exhausted and I/O is not moving.
> 
> Since I/O never moves, the thread is never woken up, and we get hung task
> timeouts, with system failure shortly afterward.
> 
> An example hung task timeout is:
> 
> INFO: task kworker/60:0:4002982 blocked for more than 360 seconds.
>        Tainted: G           OE     5.4.0-137-generic #154-Ubuntu
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/60:0    D    0 4002982      2 0x90004080
> Workqueue: xfs-conv/dm-3 xfs_end_io [xfs]
> Call Trace:
>   __schedule+0x2e3/0x740
>   schedule+0x42/0xb0
>   xlog_grant_head_wait+0xb9/0x1e0 [xfs]
>   xlog_grant_head_check+0xde/0x100 [xfs]
>   xfs_log_reserve+0xc9/0x1e0 [xfs]
>   xfs_trans_reserve+0x17a/0x1e0 [xfs]
>   xfs_trans_alloc+0xda/0x170 [xfs]
>   xfs_iomap_write_unwritten+0x128/0x2f0 [xfs]
>   xfs_end_ioend+0x15b/0x1b0 [xfs]
>   xfs_end_io+0xb1/0xe0 [xfs]
>   process_one_work+0x1eb/0x3b0
>   worker_thread+0x4d/0x400
>   kthread+0x104/0x140
>   ? process_one_work+0x3b0/0x3b0
>   ? kthread_park+0x90/0x90
>   ret_from_fork+0x1f/0x40
> 
> There is no known workaround, other than to have a larger log buffer at
> filesystem creation, but even then, it only buys you time until you get high
> enough load to exhaust the log buffer.
> 
> [Fix]
> 
> This was fixed in 5.13-rc1 by the following patch:
> 
> commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:43 2021 -0700
> Subject: xfs: drop submit side trans alloc for append ioends
> Link: https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
> 
> The patch more or less removes all preallocated ioend transactions, and instead,
> when ioend appends are needed, they go to the standard workqueue like any other
> ioend, where transactions are allocated when they reach the top of the
> workqueue.
> 
> The patch required some backporting for Focal. The changes to the patch itself
> is minimal and should be straightforward to read, however, the changes to the
> XFS ioend subsystem between 5.4 and 5.13-rc1 were quite extreme, with a lot of
> refactoring taking place over very many commits.
> 
> Additionally, the patch was part of a five part series, the first, fixes the
> deadlock, and the rest remove all the code to do with transaction preallocation.
> 
> It is safe to leave the rest of the code in place. It will become dead code, but
> it will not be reachable, and not cause any risk of regression, due to
> ioend->io_append_trans always being NULL, and not entering into any of the if
> statements.
> 
> [Testcase]
> 
> There is currently no known testcase for this issue. The issue has only been
> seen in a customer production environment, running under heavy load. The issue
> has not been seen in a customer test environment, only production.
> 
> The production workload is a busy Kubernetes cluster running containers and VMs,
> with the hosts's filesystem being broken into separate mountpoints over several
> partitions, all XFS.
> 
> The kubernetes containers are all backed from a large 4TB disk which is XFS.
> 
> A test kernel is available in the following ppa:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf353709-test
> 
> This test kernel has been deployed to several production hosts, and the deadlock
> no longer occurs.
> 
> [Where problems can occur]
> 
> We are changing how certain ioend transactions take place in the XFS filesystem.
> If a regression were to occur, it would impact all XFS users. Users would have
> to downgrade their kernel, as there are no workarounds for enabling or disabling
> the behaviour change.
> 
> The overall risk of the change should be low. ioend append transactions would
> still be processed in nearly the same way, still being placed at the end of the
> ioend workqueue like any other transaction, with the only change being it is
> allocated at the time the transaction is created, at the top of the workqueue
> when it is processed, and not preallocated when the ioend is first submitted.
> 
> There will be a very minor performance penalty, but it wouldn't be measurable
> in any tangible workload.
> 
> I have run xfstests for the xfs/* subset against the released 5.4.0-137-generic
> and the test 5.4.0-137-generic test kernel with the patch included. The test
> kernel had identical results, it will likely not cause any regressions.
> 
> There is some additional risk leaving the dead code in place, but I have read
> the code and the commits to remove the dead code, and I came to the conclusion
> it is less risky to leave it in place, than to backport the refactor commits.
> 
> [Other Info]
> 
> The XFS ioend subsystem refactor took place in the following commits:
> 
> commit 598ecfbaa742aca0dcdbbea25681406f95cc0b63
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:15 -0700
> Subject: iomap: lift the xfs writeback code to iomap
> Link: https://github.com/torvalds/linux/commit/598ecfbaa742aca0dcdbbea25681406f95cc0b63
> 
> commit 9e91c5728cab3d0aa3197d009c3d63e147914e77
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:13 -0700
> Subject: iomap: lift common tracing code from xfs to iomap
> Link: https://github.com/torvalds/linux/commit/9e91c5728cab3d0aa3197d009c3d63e147914e77
> 
> commit 433dad94ec5d6b90385b56a8bc8718dd9542b289
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:07 -0700
> Subject: xfs: refactor the ioend merging code
> Link: https://github.com/torvalds/linux/commit/433dad94ec5d6b90385b56a8bc8718dd9542b289
> 
> ioend->io_append_trans was renamed to ioend->io_private in the following commit:
> 
> commit 5653017bc44e54baa299f3523f160c23ac0628fd
> From: Christoph Hellwig <hch@lst.de>
> Date: Thu, 17 Oct 2019 13:12:09 -0700
> Subject: xfs: turn io_append_trans into an io_private void pointer
> Link: https://github.com/torvalds/linux/commit/5653017bc44e54baa299f3523f160c23ac0628fd
> 
> The full five part preallocated ioend deadlock patch series is:
> 
> commit 7cd3099f4925d7c15887d1940ebd65acd66100f5
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:43 2021 -0700
> Subject: xfs: drop submit side trans alloc for append ioends
> Link: https://github.com/torvalds/linux/commit/7cd3099f4925d7c15887d1940ebd65acd66100f5
> 
> commit 7adb8f14e134d5f885d47c4ccd620836235f0b7f
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:55 2021 -0700
> Subject: xfs: open code ioend needs workqueue helper
> Link: https://github.com/torvalds/linux/commit/7adb8f14e134d5f885d47c4ccd620836235f0b7f
> 
> commit 044c6449f18f174ba8d86640936add3fc7582e49
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:55 2021 -0700
> Subject: xfs: drop unused ioend private merge and setfilesize code
> Link: https://github.com/torvalds/linux/commit/044c6449f18f174ba8d86640936add3fc7582e49
> 
> commit e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Apr 9 10:27:56 2021 -0700
> Subject: xfs: drop unnecessary setfilesize helper
> Link: https://github.com/torvalds/linux/commit/e7a3d7e792a5ad50583a2e6c35e72bd2ca6096f4
> 
> commit 6e552494fb90acae005d74ce6a2ee102d965184b
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Tue May 4 08:54:29 2021 -0700
> Subject: iomap: remove unused private field from ioend
> Link: https://github.com/torvalds/linux/commit/6e552494fb90acae005d74ce6a2ee102d965184b
> 
> As you can see, the four latter commits are not necessary. They simply remove
> dead code, which has no harm being left in place. They also do not backport
> at all, not without the ALL of the significant refactor commits.
> 
> Hence, we only take "xfs: drop submit side trans alloc for append ioends" only,
> as it is the only needed commit to solve the problem.
> 
> Brian Foster (1):
>    xfs: drop submit side trans alloc for append ioends
> 
>   fs/xfs/xfs_aops.c | 43 +++----------------------------------------
>   1 file changed, 3 insertions(+), 40 deletions(-)
> 

Applied to focal:linux/master-next. Thanks.

-Stefan