Message ID | 20230621174114.1320834-1-bongiojp@gmail.com |
---|---|
Headers | show |
Series | iomap regression for aio dio 4k writes | expand |
On Wed, Jun 21, 2023 at 10:29:19AM -0700, Jeremy Bongio wrote: > Hi Darrick and Allison, > > There has been a standing performance regression involving AIO DIO > 4k-aligned writes on ext4 backed by a fast local SSD since the switch > to iomap. I think it was originally reported and investigated in this > thread: https://lore.kernel.org/all/87lf7rkffv.fsf@collabora.com/ > > Short version: > Pre-iomap, for ext4 async direct writes, after the bio is written to disk > the completion function is called directly during the endio stage. > > Post-iomap, for direct writes, after the bio is written to disk, the completion > function is deferred to a work queue. This adds latency that impacts > performance most noticeably in very fast SSDs. > > Detailed version: > A possible explanation is below, followed by a few questions to figure > out the right way to fix it. > > In 4.15, ext4 uses fs/direct-io.c. When an AIO DIO write has completed > in the nvme driver, the interrupt handler for the write request ends > in calling bio_endio() which ends up calling dio_bio_end_aio(). A > different end_io function is used for async and sync io. If there are > no pages mapped in memory for the write operation's inode, then the > completion function for ext4 is called directly. If there are pages > mapped, then they might be dirty and need to be updated and work > is deferred to a work queue. > > Here is the relevant 4.15 code: > > fs/direct-io.c: dio_bio_end_aio() > if (dio->result) > defer_completion = dio->defer_completion || > (dio_op == REQ_OP_WRITE && > dio->inode->i_mapping->nrpages); > if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > } else { > dio_complete(dio, 0, DIO_COMPLETE_ASYNC); > } > > After ext4 switched to using iomap, the endio function became > iomap_dio_bio_end_io() in fs/iomap/direct-io.c. In iomap the same end io > function is used for both async and sync io. Yes, because the IO completion processing is the same regardless of whether the IO is submitted for sync or async completion. > All write requests will > defer io completion to a work queue even if there are no mapped pages > for the inode. Yup. Consider O_DSYNC DIO writes: Where are the post-IO completion integrity operations done? Consider DIO write IO completion for different filesystems: how does iomap_dio_complete() know whether dio->dops->end_io() needs to run in task context or not. e.g. DIO writes into unwritten extents: where are the written conversion transactions run? > With the attached patch, I see significantly better performance in 5.10 than 4.15. 5.10 is the latest kernel where I have driver support for an SSD that is fast enough to reproduce the regression. I verified that upstream iomap works the same. > > Test results using the reproduction script from the original report > and testing with 4k/8k/12k/16k blocksizes and write-only: > https://people.collabora.com/~krisman/dio/week21/bench.sh > > fio benchmark command: > fio --ioengine libaio --size=2G --direct=1 --filename=${MNT}/file --iodepth=64 \ > --time_based=1 --thread=1 --overwrite=1 --bs=${BS} --rw=$RW \ Ah, you are testing pure overwrites, which means for ext4 the only thing it needs to care about is cached mappings. What happens when you add O_DSYNC here? > --name "`uname -r`-${TYPE}-${RW}-${BS}-${FS}" \ > --runtime=100 --output-format=terse >> ${LOG} > > For 4.15, with all write completions called in io handler: > 4k: bw=1056MiB/s > 8k: bw=2082MiB/s > 12k: bw=2332MiB/s > 16k: bw=2453MiB/s > > For unmodified 5.10, with all write completions deferred: > 4k: bw=1004MiB/s > 8k: bw=2074MiB/s > 12k: bw=2309MiB/s > 16k: bw=2465MiB/s I don't see a regression here - the differences are in the noise of a typical fio overwrite test. > For modified 5.10, with all write completions called in io handler: > 4k: bw=1193MiB/s > 8k: bw=2258MiB/s > 12k: bw=2346MiB/s > 16k: bw=2446MiB/s > > Questions: > > Why did iomap from the beginning not make the async/sync io and > mapped/unmapped distinction that fs/direct-io.c did? Because the iomap code was designed from the ground up as an extent-based concurrent async IO engine that supported concurrent reads and writes to the same sparse file with full data integrity handling guarantees. The old dio code started off as "sync" DIO only, and it was for a long time completely broken for async DIO. Correct support for that was eventually tacked on via DIO_COMPLETE_ASYNC, but it was still basically an awful, nasty, complex, "block at a time" synchronous IO engine. > Since no issues have been found for ext4 calling completion work > directly in the io handler pre-iomap, it is unlikely that this is > unsafe (sleeping within an io handler callback). However, this may not > be true for all filesystems. Does XFS potentially sleep in its > completion code? Yes, and ext4 does too. e.g. O_DSYNC overwrites always need to be deferred to task context to be able to take sleeping locks and potentially block on journal and or device cache flushes. i.e. Have you considered what context all of XFS, f2fs, btrfs, zonefs and gfs2 need for pure DIO overwrite completion in all it's different variants? AFAIC, it's far simpler conceptually to defer all writes to completion context than it is to try to work out what writes need to be deferred and what doesn't, especially as the filesystem ->end_io completion might need to sleep and the iomap code has no idea whether that is possible. Cheers, Dave.
On Thu, Jun 22, 2023 at 09:59:30AM +1000, Dave Chinner wrote: > On Wed, Jun 21, 2023 at 10:29:19AM -0700, Jeremy Bongio wrote: > > Since no issues have been found for ext4 calling completion work > > directly in the io handler pre-iomap, it is unlikely that this is > > unsafe (sleeping within an io handler callback). However, this may not > > be true for all filesystems. Does XFS potentially sleep in its > > completion code? > > Yes, and ext4 does too. e.g. O_DSYNC overwrites always need to be > deferred to task context to be able to take sleeping locks and > potentially block on journal and or device cache flushes. > > i.e. Have you considered what context all of XFS, f2fs, btrfs, > zonefs and gfs2 need for pure DIO overwrite completion in all it's > different variants? > > AFAIC, it's far simpler conceptually to defer all writes to > completion context than it is to try to work out what writes need to > be deferred and what doesn't, especially as the filesystem ->end_io > completion might need to sleep and the iomap code has no idea > whether that is possible. Ok, so having spent a bit more thought on this away from the office this morning, I think there is a generic way we can avoid deferring completions for pure overwrites. We already have a mechanism in iomap that tells us if the write is a pure overwrite and we use it to change how we issue O_DSYNC DIO writes. i.e. we use it to determine if we can use FUA writes rather than a post-IO journal/device cache flush to guarantee data integrity. See IOMAP_DIO_WRITE_FUA for how we determine whether we need issue a generic_write_sync() call or not in the post IO completion processing. The iomap flags that determines if we can make this optimisation are IOMAP_F_SHARED and IOMAP_F_DIRTY. IOMAP_F_SHARED indicates a COW is required to break sharing for the write IO to proceed, whilst IOMAP_F_DIRTY indicates that the inode is either dirty or that the write IO requires metadata to be dirtied at completion time (e.g. unwritten extent conversion) before the sync operation that provides data integrity guarantees can be run. If neither of these flags are set in the iomap, it effectively means that the IO is a pure overwrite. i.e. the filesytsem has explicitly said that this write IO does not need any post-IO completion filesystem work to be done. At this point, the iomap code can optimise for a pure overwrite into an IOMAP_MAPPED extent, knowing that the only thing it needs to care about on completion is internal data integrity requirements (i.e. O_DSYNC/O_SYNC) of the IO. Hence if the filesystem has told iomap that it has no IO completion requirements, and iomap doesn't need generic_write_sync() for data integrity (i.e. no data integrity required or FUA was used for the entire IO), then we could complete the DIO write directly from the bio completion callback context... IOWs, what you want can be done, it's just a whole lot more complex than just avoiding a queue_work() call... -Dave.
On Thu, Jun 22, 2023 at 11:55:23AM +1000, Dave Chinner wrote: > Ok, so having spent a bit more thought on this away from the office > this morning, I think there is a generic way we can avoid deferring > completions for pure overwrites. OK, this is how we can, but should we? The same amount of work needs to be done, no matter whether we do it in interrupt context or workqueue context. Doing it in interrupt context has lower latency, but maybe allows us to batch up the work and so get better bandwidth. And we can't handle other interrupts while we're handling this one, so from a whole-system perspective, I think we'd rather do the work in the workqueue. Latency is important for reads, but why is it important for writes? There's such a thing as a dependent read, but writes are usually buffered and we can wait as long as we like for a write to complete.
On Thu, Jun 22, 2023 at 03:55:52AM +0100, Matthew Wilcox wrote: > Latency is important for reads, but why is it important for writes? > There's such a thing as a dependent read, but writes are usually buffered > and we can wait as long as we like for a write to complete. That was exactly my reasoning on why I did always defer the write completions in the initial iomap direct I/O code, and until now no one has complained. I could see why people care either about synchronous writes, or polled io_uring writes, for which we might be able to do the work in the completion thread context, but for aio writes this feels a bit odd.
On Thu, Jun 22, 2023 at 03:55:52AM +0100, Matthew Wilcox wrote: > On Thu, Jun 22, 2023 at 11:55:23AM +1000, Dave Chinner wrote: > > Ok, so having spent a bit more thought on this away from the office > > this morning, I think there is a generic way we can avoid deferring > > completions for pure overwrites. > > OK, this is how we can, but should we? The same amount of work > needs to be done, no matter whether we do it in interrupt context or > workqueue context. Doing it in interrupt context has lower latency, > but maybe allows us to batch up the work and so get better bandwidth. > And we can't handle other interrupts while we're handling this one, > so from a whole-system perspective, I think we'd rather do the work in > the workqueue. Yup, I agree with you there, but I can also be easily convinced that optimising the pure in-place DIO overwrite path is worth the effort. > Latency is important for reads, but why is it important for writes? > There's such a thing as a dependent read, but writes are usually buffered > and we can wait as long as we like for a write to complete. The OP cares about async direct IO performance, not buffered writes. And for DIO writes, there is most definitely such a thing as "dependent writes". Think about journalled data - you can't overwrite data in place until the data write to the journal has first completed all the way down to stable storage. i.e. there's an inherent IO completion-to-submission write ordering constraint in the algorithm, and so we have dependent writes. And that's the whole point of the DIO write FUA optimisations in iomap; they avoid the dependent "write" that provides data integrity i.e. the journal flush and/or device cache flush that generic_write_sync() issues in IO completion is a dependent write because it cannot start until all the data being written has reached the device entirely. Using completion-to-submission ordering of the integrity operations means we don't need to block other IOs to the same file, other journal operations in the filesystem or other data IO to provide that data integrity requirement for the specific O_DSYNC DIO write IO. If we can use an FUA write for this instead of a separate cache flush, then we end up providing O_DSYNC writes with about 40% lower completion latency than a "write + cache flush" sequential IO pair. This means that things like high performance databases improve throughput by 25-50% and operational latency goes down by ~30-40% if we can make extensive use of FUA writes to provide the desired data integrity guarantees. From that perspective, an application doing pure overwrites with ordering depedencies might actually be very dependent on minimising individual DIO write latency for overall performance... Cheers, Dave.
On Wed, 2023-06-21 at 10:29 -0700, Jeremy Bongio wrote: > Hi Darrick and Allison, > > There has been a standing performance regression involving AIO DIO > 4k-aligned writes on ext4 backed by a fast local SSD since the switch > to iomap. I think it was originally reported and investigated in this > thread: > https://urldefense.com/v3/__https://lore.kernel.org/all/87lf7rkffv.fsf@collabora.com/__;!!ACWV5N9M2RV99hQ!L6pU0-g5XWj5298hj3etLj9LW11t5Ga7cvTM1iDf158n1gTLot0r0WELozslls0LvJ8X9vil81pOn_CQMgHZyQ$ > > > Short version: > Pre-iomap, for ext4 async direct writes, after the bio is written to > disk > the completion function is called directly during the endio stage. > > Post-iomap, for direct writes, after the bio is written to disk, the > completion > function is deferred to a work queue. This adds latency that impacts > performance most noticeably in very fast SSDs. > > Detailed version: > A possible explanation is below, followed by a few questions to > figure > out the right way to fix it. > > In 4.15, ext4 uses fs/direct-io.c. When an AIO DIO write has > completed > in the nvme driver, the interrupt handler for the write request ends > in calling bio_endio() which ends up calling dio_bio_end_aio(). A > different end_io function is used for async and sync io. If there are > no pages mapped in memory for the write operation's inode, then the > completion function for ext4 is called directly. If there are pages > mapped, then they might be dirty and need to be updated and work > is deferred to a work queue. > > Here is the relevant 4.15 code: > > fs/direct-io.c: dio_bio_end_aio() > if (dio->result) > defer_completion = dio->defer_completion || > (dio_op == REQ_OP_WRITE && > dio->inode->i_mapping->nrpages); > if (defer_completion) { > INIT_WORK(&dio->complete_work, dio_aio_complete_work); > queue_work(dio->inode->i_sb->s_dio_done_wq, > &dio->complete_work); > } else { > dio_complete(dio, 0, DIO_COMPLETE_ASYNC); > } > > After ext4 switched to using iomap, the endio function became > iomap_dio_bio_end_io() in fs/iomap/direct-io.c. In iomap the same end > io > function is used for both async and sync io. All write requests will > defer io completion to a work queue even if there are no mapped pages > for the inode. > > Here is the relevant code in latest kernel (post-iomap) ... > > fs/iomap/direct-io.c: iomap_dio_bio_end_io() > if (dio->wait_for_completion) { > struct task_struct *waiter = dio->submit.waiter; > WRITE_ONCE(dio->submit.waiter, NULL); > blk_wake_io_task(waiter); > } else if (dio->flags & IOMAP_DIO_WRITE) { > struct inode *inode = file_inode(dio->iocb->ki_filp); > > WRITE_ONCE(dio->iocb->private, NULL); > INIT_WORK(&dio->aio.work, iomap_dio_complete_work); > queue_work(inode->i_sb->s_dio_done_wq, &dio->aio.work); > } else { > WRITE_ONCE(dio->iocb->private, NULL); > iomap_dio_complete_work(&dio->aio.work); > } > > With the attached patch, I see significantly better performance in > 5.10 than 4.15. 5.10 is the latest kernel where I have driver support > for an SSD that is fast enough to reproduce the regression. I > verified that upstream iomap works the same. > > Test results using the reproduction script from the original report > and testing with 4k/8k/12k/16k blocksizes and write-only: > https://urldefense.com/v3/__https://people.collabora.com/*krisman/dio/week21/bench.sh__;fg!!ACWV5N9M2RV99hQ!L6pU0-g5XWj5298hj3etLj9LW11t5Ga7cvTM1iDf158n1gTLot0r0WELozslls0LvJ8X9vil81pOn_CpnhVXfg$ > > > fio benchmark command: > fio --ioengine libaio --size=2G --direct=1 --filename=${MNT}/file -- > iodepth=64 \ > --time_based=1 --thread=1 --overwrite=1 --bs=${BS} --rw=$RW \ > --name "`uname -r`-${TYPE}-${RW}-${BS}-${FS}" \ > --runtime=100 --output-format=terse >> ${LOG} > > For 4.15, with all write completions called in io handler: > 4k: bw=1056MiB/s > 8k: bw=2082MiB/s > 12k: bw=2332MiB/s > 16k: bw=2453MiB/s > > For unmodified 5.10, with all write completions deferred: > 4k: bw=1004MiB/s > 8k: bw=2074MiB/s > 12k: bw=2309MiB/s > 16k: bw=2465MiB/s > > For modified 5.10, with all write completions called in io handler: > 4k: bw=1193MiB/s > 8k: bw=2258MiB/s > 12k: bw=2346MiB/s > 16k: bw=2446MiB/s > > Questions: > > Why did iomap from the beginning not make the async/sync io and > mapped/unmapped distinction that fs/direct-io.c did? > > Since no issues have been found for ext4 calling completion work > directly in the io handler pre-iomap, it is unlikely that this is > unsafe (sleeping within an io handler callback). However, this may > not > be true for all filesystems. Does XFS potentially sleep in its > completion code? > > Allison, Ted mentioned you saw a similar problem when doing > performance testing for the latest version of Unbreakable Linux. Can > you test/verify this patch addresses your performance regression as > well? Hi Jeremy, Sure I can link this patch to the bug report to see if the tester sees any improvements. If anything, I think it's a good data point to have, even if we are still considering other solutions. Thanks! Allison > > Jeremy Bongio (1): > For DIO writes with no mapped pages for inode, skip deferring > completion. > > fs/iomap/direct-io.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) >
On Thu, Jun 22, 2023 at 09:59:29AM +1000, Dave Chinner wrote: > Ah, you are testing pure overwrites, which means for ext4 the only > thing it needs to care about is cached mappings. What happens when > you add O_DSYNC here? I think you mean O_SYNC, right? In a pure overwrite case, where all of the extents are initialized and where the Oracle or DB2 server is doing writes to preallocated, pre-initialized space in the tablespace file followed by fdatasync(), there *are* no post-I/O data integrity operations which are required. If the file is opened O_SYNC or if the blocks were not preallocated using fallocate(2) and not initialized ahead of time, then sure, we can't use this optimization. However, the cases where databases workloads *are* doing overwrites and using fdatasync(2) most certainly do exist, and the benefit of this optimization can be a 20% throughput. Which is nothing to sneeze at. What we might to do is to let the file system tell the iomap layer via a flag whether or not there are no post-I/O metadata operations required, and then *if* that flag is set, and *if* the inode has no pages in the page cache (so there are no invalidate operations necessary), it should be safe to skip using queue_work(). That way, the file system has to affirmatively state that it is safe to skip the workqueue, so it shouldn't do any harm to other file systems using the iomap DIO layer. What am I missing? Cheers, - Ted
On Thu, Jun 22, 2023 at 10:32:33PM -0400, Theodore Ts'o wrote: > On Thu, Jun 22, 2023 at 09:59:29AM +1000, Dave Chinner wrote: > > Ah, you are testing pure overwrites, which means for ext4 the only > > thing it needs to care about is cached mappings. What happens when > > you add O_DSYNC here? > > I think you mean O_SYNC, right? No, I *explicitly* meant O_DSYNC. > In a pure overwrite case, where all > of the extents are initialized and where the Oracle or DB2 server is > doing writes to preallocated, pre-initialized space in the tablespace > file followed by fdatasync(), there *are* no post-I/O data integrity > operations which are required. Wrong: O_DSYNC DIO write IO requires the data to be on stable storage at IO completion. This means the pure overwrite IO must be either issued as a REQ_FUA write or as a normal write followed by a device cache flush. That device cache flush is a post-I/O data integrity operation and that is handled by iomap_dio_complete() -> generic_write_sync() -> vfs_fsync_range().... > If the file is opened O_SYNC or if the blocks were not > preallocated using fallocate(2) and not initialized ahead of time, > then sure, we can't use this optimization. Well, yes. That's the whole point of the IOMAP_F_DIRTY flag - if that is set, we don't attempt any pure overwrite optimisations because it's not a pure overwrite and metadata needs flushing to the journal. Hence we need to call generic_write_sync(). > What we might to do is to let the file system tell the iomap layer > via a flag whether or not there are no post-I/O metadata > operations required, and then *if* that flag is set, and *if* the > inode has no pages in the page cache (so there are no invalidate > operations necessary), it should be safe to skip using > queue_work(). That way, the file system has to affirmatively > state that it is safe to skip the workqueue, so it shouldn't do > any harm to other file systems using the iomap DIO layer. > > What am I missing? You didn't read my followup email. IOMAP_F_DIRTY is the flag you describe, and it already exists. -Dave.