Message ID | 1311791126-11383-2-git-send-email-freddy77@gmail.com |
---|---|
State | New |
Headers | show |
Did you test this at all? On Wed, Jul 27, 2011 at 08:25:25PM +0200, Frediano Ziglio wrote: > + case QEMU_AIO_FLUSH: > + io_prep_fdsync(iocbs, fd); > + break; Looks great, but doesn't work as expected. Hint: grep for aio_fsync in the linux kernel.
Il giorno 27/lug/2011, alle ore 20:31, Christoph Hellwig <hch@lst.de> ha scritto: > Did you test this at all? > Yes! Not at kernel level :-) Usually I trust documentation and man pages. > On Wed, Jul 27, 2011 at 08:25:25PM +0200, Frediano Ziglio wrote: >> + case QEMU_AIO_FLUSH: >> + io_prep_fdsync(iocbs, fd); >> + break; > > Looks great, but doesn't work as expected. > > Hint: grep for aio_fsync in the linux kernel. > Thanks. I'll try to port misaligned access to Linux AIO. Also I'll add some comments on code to avoid somebody do the same mistache I did. Mainly however -k qemu-img and aio=native in blockdev options are silently ignored if nocache is not enabled. Also I notice that combining XFS, Linux AIO, O_DIRECT and O_DSYNC give impressive performance but currently there is no way to specify all that flags together cause nocache enable O_DIRECT while O_DSYNC is enabled with writethrough. Frediano
On Wed, Jul 27, 2011 at 09:52:51PM +0200, Frediano Ziglio wrote: > > > > Yes! Not at kernel level :-) In that case we have a bad error handling problem somewhere in qemu. the IOCB_CMD_FDSYNC aio opcode will always return EINVAL currently, and we really should have cought that somewhere in qemu. > Thanks. I'll try to port misaligned access to Linux AIO. Also I'll add some comments on code to avoid somebody do the same mistache I did. It's direct I/O code in general that doesn't handle misaligned access. Given that we should never get misaligned I/O from guests I just didn't bother duplicating the read-modify-write code for that code path as well. > Mainly however -k qemu-img and aio=native in blockdev options are silently ignored if nocache is not enabled. Maybe we should indeed error out instead. Care to prepare a patch for that? > Also I notice that combining XFS, Linux AIO, O_DIRECT and O_DSYNC give impressive performance but currently there is no way to specify all that flags together cause nocache enable O_DIRECT while O_DSYNC is enabled with writethrough. Indeed. This has come up a few times, and actually is a mostly trivial task. Maybe we should give up waiting for -blockdev and separate cache mode settings and allow a nocache-writethrough or similar mode now? It's going to be around 10 lines of code + documentation.
Am 27.07.2011 21:57, schrieb Christoph Hellwig: > On Wed, Jul 27, 2011 at 09:52:51PM +0200, Frediano Ziglio wrote: >> Also I notice that combining XFS, Linux AIO, O_DIRECT and O_DSYNC give impressive performance but currently there is no way to specify all that flags together cause nocache enable O_DIRECT while O_DSYNC is enabled with writethrough. > > Indeed. This has come up a few times, and actually is a mostly trivial > task. Maybe we should give up waiting for -blockdev and separate cache > mode settings and allow a nocache-writethrough or similar mode now? It's > going to be around 10 lines of code + documentation. I understand that there may be reasons for using O_DIRECT | O_DSYNC, but what is the explanation for O_DSYNC improving performance? Christoph, on another note: Can we rely on Linux AIO never returning short writes except on EOF? Currently we return -EINVAL in this case, so I hope it's true or we wouldn't return the correct error code. The reason why I'm asking is because I want to allow reads across EOF for growable images and pad with zeros (the synchronous code does this today in order to allow bdrv_pread/pwrite to work, and when we start using coroutines in the block layer, these cases will hit the AIO paths). Kevin
On Thu, Jul 28, 2011 at 09:47:05AM +0200, Kevin Wolf wrote: > > Indeed. This has come up a few times, and actually is a mostly trivial > > task. Maybe we should give up waiting for -blockdev and separate cache > > mode settings and allow a nocache-writethrough or similar mode now? It's > > going to be around 10 lines of code + documentation. > > I understand that there may be reasons for using O_DIRECT | O_DSYNC, but > what is the explanation for O_DSYNC improving performance? There isn't any, at least for modern Linux. O_DSYNC at this point is equivalent to a range fdatasync for each write call, and given that we're doing O_DIRECT the ranges flush doesn't matter. If you do have a modern host and an old guest it might end up beeing faster because the barrier implementation in Linux used to suck so badly, but that's not inhrent to the I/O model. If you guest however doesn't support cache flushes at all O_DIRECT | O_DSYNC is the only sane model to use for local filesystems and block devices. > Christoph, on another note: Can we rely on Linux AIO never returning > short writes except on EOF? Currently we return -EINVAL in this case, so > I hope it's true or we wouldn't return the correct error code. More or less. There's one corner case for all Linux I/O, and that is only writes up to INT_MAX are supported, and larger writes (and reads) get truncated to it. It's pretty nasty, but Linux has been vocally opposed to fixing this issue.
Am 28.07.2011 14:15, schrieb Christoph Hellwig: >> Christoph, on another note: Can we rely on Linux AIO never returning >> short writes except on EOF? Currently we return -EINVAL in this case, so "short reads" I meant, of course. >> I hope it's true or we wouldn't return the correct error code. > > More or less. There's one corner case for all Linux I/O, and that is > only writes up to INT_MAX are supported, and larger writes (and reads) > get truncated to it. It's pretty nasty, but Linux has been vocally > opposed to fixing this issue. I think we can safely ignore this. So just replacing the current ret = -EINVAL; by a memset(buf + ret, 0, len - ret); ret = 0; should be okay, right? (Of course using the qiov versions, but you get the idea) Kevin
On Thu, Jul 28, 2011 at 02:41:02PM +0200, Kevin Wolf wrote: > > More or less. There's one corner case for all Linux I/O, and that is > > only writes up to INT_MAX are supported, and larger writes (and reads) > > get truncated to it. It's pretty nasty, but Linux has been vocally > > opposed to fixing this issue. > > I think we can safely ignore this. So just replacing the current > ret = -EINVAL; by a memset(buf + ret, 0, len - ret); ret = 0; should be > okay, right? (Of course using the qiov versions, but you get the idea) This should be safe.
On Thu, Jul 28, 2011 at 1:15 PM, Christoph Hellwig <hch@lst.de> wrote: > On Thu, Jul 28, 2011 at 09:47:05AM +0200, Kevin Wolf wrote: >> > Indeed. This has come up a few times, and actually is a mostly trivial >> > task. Maybe we should give up waiting for -blockdev and separate cache >> > mode settings and allow a nocache-writethrough or similar mode now? It's >> > going to be around 10 lines of code + documentation. >> >> I understand that there may be reasons for using O_DIRECT | O_DSYNC, but >> what is the explanation for O_DSYNC improving performance? > > There isn't any, at least for modern Linux. O_DSYNC at this point is > equivalent to a range fdatasync for each write call, and given that we're > doing O_DIRECT the ranges flush doesn't matter. If you do have a modern > host and an old guest it might end up beeing faster because the barrier > implementation in Linux used to suck so badly, but that's not inhrent > to the I/O model. If you guest however doesn't support cache flushes > at all O_DIRECT | O_DSYNC is the only sane model to use for local filesystems > and block devices. I can rebase this cache=directsync patch and send it: http://repo.or.cz/w/qemu/stefanha.git/commitdiff/6756719a46ac9876ac6d0460a33ad98e96a3a923 The other weird caching-related option I was playing with is -drive ...,readahead=on|off. It lets you disable the host kernel readahead on buffered modes (cache=writeback|writethrough): http://repo.or.cz/w/qemu/stefanha.git/commitdiff/f2fc2b297a2b2dd0cccd1dc2f7c519f3b0374e0d Stefan
diff --git a/block/raw-posix.c b/block/raw-posix.c index 3c6bd4b..27ae81e 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -628,6 +628,13 @@ static BlockDriverAIOCB *raw_aio_flush(BlockDriverState *bs, if (fd_open(bs) < 0) return NULL; +#ifdef CONFIG_LINUX_AIO + if (s->use_aio) { + return laio_submit(bs, s->aio_ctx, s->fd, 0, NULL, + 0, cb, opaque, QEMU_AIO_FLUSH); + } +#endif + return paio_submit(bs, s->fd, 0, NULL, 0, cb, opaque, QEMU_AIO_FLUSH); } diff --git a/linux-aio.c b/linux-aio.c index 68f4b3d..d07c435 100644 --- a/linux-aio.c +++ b/linux-aio.c @@ -215,6 +215,9 @@ BlockDriverAIOCB *laio_submit(BlockDriverState *bs, void *aio_ctx, int fd, case QEMU_AIO_READ: io_prep_preadv(iocbs, fd, qiov->iov, qiov->niov, offset); break; + case QEMU_AIO_FLUSH: + io_prep_fdsync(iocbs, fd); + break; default: fprintf(stderr, "%s: invalid AIO request type 0x%x.\n", __func__, type);
Signed-off-by: Frediano Ziglio <freddy77@gmail.com> --- block/raw-posix.c | 7 +++++++ linux-aio.c | 3 +++ 2 files changed, 10 insertions(+), 0 deletions(-)