Message ID | 1431409679-16077-3-git-send-email-den@openvz.org |
---|---|
State | New |
Headers | show |
Am 12.05.2015 um 07:47 hat Denis V. Lunev geschrieben: > The following sequence > int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); > for (i = 0; i < 100000; i++) > write(fd, buf, 4096); > performs 5% better if buf is aligned to 4096 bytes. > > The difference is quite reliable. > > On the other hand we do not want at the moment to enforce bounce > buffering if guest request is aligned to 512 bytes. > > The patch changes default bounce buffer optimal alignment to > MAX(page size, 4k). 4k is chosen as maximal known sector size on real > HDD. > > The justification of the performance improve is quite interesting. > From the kernel point of view each request to the disk was split > by two. This could be seen by blktrace like this: > 9,0 11 1 0.000000000 11151 Q WS 312737792 + 1023 [qemu-img] > 9,0 11 2 0.000007938 11151 Q WS 312738815 + 8 [qemu-img] > 9,0 11 3 0.000030735 11151 Q WS 312738823 + 1016 [qemu-img] > 9,0 11 4 0.000032482 11151 Q WS 312739839 + 8 [qemu-img] > 9,0 11 5 0.000041379 11151 Q WS 312739847 + 1016 [qemu-img] > 9,0 11 6 0.000042818 11151 Q WS 312740863 + 8 [qemu-img] > 9,0 11 7 0.000051236 11151 Q WS 312740871 + 1017 [qemu-img] > 9,0 5 1 0.169071519 11151 Q WS 312741888 + 1023 [qemu-img] > After the patch the pattern becomes normal: > 9,0 6 1 0.000000000 12422 Q WS 314834944 + 1024 [qemu-img] > 9,0 6 2 0.000038527 12422 Q WS 314835968 + 1024 [qemu-img] > 9,0 6 3 0.000072849 12422 Q WS 314836992 + 1024 [qemu-img] > 9,0 6 4 0.000106276 12422 Q WS 314838016 + 1024 [qemu-img] > and the amount of requests sent to disk (could be calculated counting > number of lines in the output of blktrace) is reduced about 2 times. > > Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest > does his job well and real requests comes properly aligned (to page). > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > --- > block.c | 8 ++++---- > block/io.c | 2 +- > block/raw-posix.c | 14 ++++++++------ > 3 files changed, 13 insertions(+), 11 deletions(-) > > diff --git a/block.c b/block.c > index e293907..325f727 100644 > --- a/block.c > +++ b/block.c > @@ -106,8 +106,8 @@ int is_windows_drive(const char *filename) > size_t bdrv_opt_mem_align(BlockDriverState *bs) > { > if (!bs || !bs->drv) { > - /* 4k should be on the safe side */ > - return 4096; > + /* page size or 4k (hdd sector size) should be on the safe side */ > + return MAX(4096, getpagesize()); > } > > return bs->bl.opt_mem_alignment; > @@ -116,8 +116,8 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs) > size_t bdrv_min_mem_align(BlockDriverState *bs) > { > if (!bs || !bs->drv) { > - /* 4k should be on the safe side */ > - return 4096; > + /* page size or 4k (hdd sector size) should be on the safe side */ > + return MAX(4096, getpagesize()); > } > > return bs->bl.min_mem_alignment; > diff --git a/block/io.c b/block/io.c > index 908a3d1..071652c 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -205,7 +205,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) > bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment; > } else { > bs->bl.min_mem_alignment = 512; > - bs->bl.opt_mem_alignment = 512; > + bs->bl.opt_mem_alignment = getpagesize(); > } > > if (bs->backing_hd) { I think it would make more sense to keep this specific to the raw-posix driver. After all, it's only the kernel page cache that we optimise here. Other backends probably don't take advantage of page alignment. > diff --git a/block/raw-posix.c b/block/raw-posix.c > index 7083924..04f3d4e 100644 > --- a/block/raw-posix.c > +++ b/block/raw-posix.c > @@ -301,6 +301,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > { > BDRVRawState *s = bs->opaque; > char *buf; > + size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize()); > > /* For /dev/sg devices the alignment is not really used. > With buffered I/O, we don't have any restrictions. */ > @@ -330,9 +331,9 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > /* If we could not get the sizes so far, we can only guess them */ > if (!s->buf_align) { > size_t align; > - buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE); > - for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { > - if (raw_is_io_aligned(fd, buf + align, MAX_BLOCKSIZE)) { > + buf = qemu_memalign(max_align, 2 * max_align); > + for (align = 512; align <= max_align; align <<= 1) { > + if (raw_is_io_aligned(fd, buf + align, max_align)) { > s->buf_align = align; > break; > } > @@ -342,8 +343,8 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > > if (!bs->request_alignment) { > size_t align; > - buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE); > - for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { > + buf = qemu_memalign(s->buf_align, max_align); > + for (align = 512; align <= max_align; align <<= 1) { > if (raw_is_io_aligned(fd, buf, align)) { > bs->request_alignment = align; > break; > @@ -726,7 +727,9 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) > > raw_probe_alignment(bs, s->fd, errp); > bs->bl.min_mem_alignment = s->buf_align; > - bs->bl.opt_mem_alignment = s->buf_align; > + if (bs->bl.min_mem_alignment > bs->bl.opt_mem_alignment) { > + bs->bl.opt_mem_alignment = bs->bl.min_mem_alignment; > + } Or, if you want to keep the getpagesize() initialisation as a generic fallback just in case, I would still suggest to be explicit here instead of relying on the default, like this: bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize()). Kevin
On 12/05/15 13:27, Kevin Wolf wrote: > Am 12.05.2015 um 07:47 hat Denis V. Lunev geschrieben: >> The following sequence >> int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); >> for (i = 0; i < 100000; i++) >> write(fd, buf, 4096); >> performs 5% better if buf is aligned to 4096 bytes. >> >> The difference is quite reliable. >> >> On the other hand we do not want at the moment to enforce bounce >> buffering if guest request is aligned to 512 bytes. >> >> The patch changes default bounce buffer optimal alignment to >> MAX(page size, 4k). 4k is chosen as maximal known sector size on real >> HDD. >> >> The justification of the performance improve is quite interesting. >> From the kernel point of view each request to the disk was split >> by two. This could be seen by blktrace like this: >> 9,0 11 1 0.000000000 11151 Q WS 312737792 + 1023 [qemu-img] >> 9,0 11 2 0.000007938 11151 Q WS 312738815 + 8 [qemu-img] >> 9,0 11 3 0.000030735 11151 Q WS 312738823 + 1016 [qemu-img] >> 9,0 11 4 0.000032482 11151 Q WS 312739839 + 8 [qemu-img] >> 9,0 11 5 0.000041379 11151 Q WS 312739847 + 1016 [qemu-img] >> 9,0 11 6 0.000042818 11151 Q WS 312740863 + 8 [qemu-img] >> 9,0 11 7 0.000051236 11151 Q WS 312740871 + 1017 [qemu-img] >> 9,0 5 1 0.169071519 11151 Q WS 312741888 + 1023 [qemu-img] >> After the patch the pattern becomes normal: >> 9,0 6 1 0.000000000 12422 Q WS 314834944 + 1024 [qemu-img] >> 9,0 6 2 0.000038527 12422 Q WS 314835968 + 1024 [qemu-img] >> 9,0 6 3 0.000072849 12422 Q WS 314836992 + 1024 [qemu-img] >> 9,0 6 4 0.000106276 12422 Q WS 314838016 + 1024 [qemu-img] >> and the amount of requests sent to disk (could be calculated counting >> number of lines in the output of blktrace) is reduced about 2 times. >> >> Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest >> does his job well and real requests comes properly aligned (to page). >> >> Signed-off-by: Denis V. Lunev <den@openvz.org> >> CC: Paolo Bonzini <pbonzini@redhat.com> >> CC: Kevin Wolf <kwolf@redhat.com> >> CC: Stefan Hajnoczi <stefanha@redhat.com> >> --- >> block.c | 8 ++++---- >> block/io.c | 2 +- >> block/raw-posix.c | 14 ++++++++------ >> 3 files changed, 13 insertions(+), 11 deletions(-) >> >> diff --git a/block.c b/block.c >> index e293907..325f727 100644 >> --- a/block.c >> +++ b/block.c >> @@ -106,8 +106,8 @@ int is_windows_drive(const char *filename) >> size_t bdrv_opt_mem_align(BlockDriverState *bs) >> { >> if (!bs || !bs->drv) { >> - /* 4k should be on the safe side */ >> - return 4096; >> + /* page size or 4k (hdd sector size) should be on the safe side */ >> + return MAX(4096, getpagesize()); >> } >> >> return bs->bl.opt_mem_alignment; >> @@ -116,8 +116,8 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs) >> size_t bdrv_min_mem_align(BlockDriverState *bs) >> { >> if (!bs || !bs->drv) { >> - /* 4k should be on the safe side */ >> - return 4096; >> + /* page size or 4k (hdd sector size) should be on the safe side */ >> + return MAX(4096, getpagesize()); >> } >> >> return bs->bl.min_mem_alignment; >> diff --git a/block/io.c b/block/io.c >> index 908a3d1..071652c 100644 >> --- a/block/io.c >> +++ b/block/io.c >> @@ -205,7 +205,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) >> bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment; >> } else { >> bs->bl.min_mem_alignment = 512; >> - bs->bl.opt_mem_alignment = 512; >> + bs->bl.opt_mem_alignment = getpagesize(); >> } >> >> if (bs->backing_hd) { > > I think it would make more sense to keep this specific to the raw-posix > driver. After all, it's only the kernel page cache that we optimise > here. Other backends probably don't take advantage of page alignment. > >> diff --git a/block/raw-posix.c b/block/raw-posix.c >> index 7083924..04f3d4e 100644 >> --- a/block/raw-posix.c >> +++ b/block/raw-posix.c >> @@ -301,6 +301,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) >> { >> BDRVRawState *s = bs->opaque; >> char *buf; >> + size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize()); >> >> /* For /dev/sg devices the alignment is not really used. >> With buffered I/O, we don't have any restrictions. */ >> @@ -330,9 +331,9 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) >> /* If we could not get the sizes so far, we can only guess them */ >> if (!s->buf_align) { >> size_t align; >> - buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE); >> - for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { >> - if (raw_is_io_aligned(fd, buf + align, MAX_BLOCKSIZE)) { >> + buf = qemu_memalign(max_align, 2 * max_align); >> + for (align = 512; align <= max_align; align <<= 1) { >> + if (raw_is_io_aligned(fd, buf + align, max_align)) { >> s->buf_align = align; >> break; >> } >> @@ -342,8 +343,8 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) >> >> if (!bs->request_alignment) { >> size_t align; >> - buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE); >> - for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { >> + buf = qemu_memalign(s->buf_align, max_align); >> + for (align = 512; align <= max_align; align <<= 1) { >> if (raw_is_io_aligned(fd, buf, align)) { >> bs->request_alignment = align; >> break; >> @@ -726,7 +727,9 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) >> >> raw_probe_alignment(bs, s->fd, errp); >> bs->bl.min_mem_alignment = s->buf_align; >> - bs->bl.opt_mem_alignment = s->buf_align; >> + if (bs->bl.min_mem_alignment > bs->bl.opt_mem_alignment) { >> + bs->bl.opt_mem_alignment = bs->bl.min_mem_alignment; >> + } > > Or, if you want to keep the getpagesize() initialisation as a generic > fallback just in case, I would still suggest to be explicit here instead > of relying on the default, like this: > > bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize()). > > Kevin > definitely I can do this if this is a strict requirement and I have not performed any real testing on Windows and other platforms but from my point of view we will be on a safe side with this alignment. Pls note, that I do not make any new allocation and any new alignment check. The patch just forces alignment of the allocation which will be performed in any case. And this approach just matches IO coming from guest with IO initiated by the qemu-img/io. All guest operations (both Windows and Linux) are really page aligned by address and offset nowadays. This approach is safe. It does not bring any additional (significant) overhead. Den
On 12/05/2015 12:27, Kevin Wolf wrote: > I think it would make more sense to keep this specific to the raw-posix > driver. After all, it's only the kernel page cache that we optimise > here. Other backends probably don't take advantage of page alignment. I don't think it makes sense to keep it raw-posix-specific, though. It's not the page cache that we optimize for, because this is with O_DIRECT. If anything, making it page aligned means that the buffer spans one fewer physical page and thus it may economize a bit on TLB misses. Paolo
Am 12.05.2015 um 12:36 hat Denis V. Lunev geschrieben: > On 12/05/15 13:27, Kevin Wolf wrote: > >Am 12.05.2015 um 07:47 hat Denis V. Lunev geschrieben: > >>The following sequence > >> int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); > >> for (i = 0; i < 100000; i++) > >> write(fd, buf, 4096); > >>performs 5% better if buf is aligned to 4096 bytes. > >> > >>The difference is quite reliable. > >> > >>On the other hand we do not want at the moment to enforce bounce > >>buffering if guest request is aligned to 512 bytes. > >> > >>The patch changes default bounce buffer optimal alignment to > >>MAX(page size, 4k). 4k is chosen as maximal known sector size on real > >>HDD. > >> > >>The justification of the performance improve is quite interesting. > >> From the kernel point of view each request to the disk was split > >>by two. This could be seen by blktrace like this: > >> 9,0 11 1 0.000000000 11151 Q WS 312737792 + 1023 [qemu-img] > >> 9,0 11 2 0.000007938 11151 Q WS 312738815 + 8 [qemu-img] > >> 9,0 11 3 0.000030735 11151 Q WS 312738823 + 1016 [qemu-img] > >> 9,0 11 4 0.000032482 11151 Q WS 312739839 + 8 [qemu-img] > >> 9,0 11 5 0.000041379 11151 Q WS 312739847 + 1016 [qemu-img] > >> 9,0 11 6 0.000042818 11151 Q WS 312740863 + 8 [qemu-img] > >> 9,0 11 7 0.000051236 11151 Q WS 312740871 + 1017 [qemu-img] > >> 9,0 5 1 0.169071519 11151 Q WS 312741888 + 1023 [qemu-img] > >>After the patch the pattern becomes normal: > >> 9,0 6 1 0.000000000 12422 Q WS 314834944 + 1024 [qemu-img] > >> 9,0 6 2 0.000038527 12422 Q WS 314835968 + 1024 [qemu-img] > >> 9,0 6 3 0.000072849 12422 Q WS 314836992 + 1024 [qemu-img] > >> 9,0 6 4 0.000106276 12422 Q WS 314838016 + 1024 [qemu-img] > >>and the amount of requests sent to disk (could be calculated counting > >>number of lines in the output of blktrace) is reduced about 2 times. > >> > >>Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest > >>does his job well and real requests comes properly aligned (to page). > >> > >>Signed-off-by: Denis V. Lunev <den@openvz.org> > >>CC: Paolo Bonzini <pbonzini@redhat.com> > >>CC: Kevin Wolf <kwolf@redhat.com> > >>CC: Stefan Hajnoczi <stefanha@redhat.com> > >>--- > >> block.c | 8 ++++---- > >> block/io.c | 2 +- > >> block/raw-posix.c | 14 ++++++++------ > >> 3 files changed, 13 insertions(+), 11 deletions(-) > >> > >>diff --git a/block.c b/block.c > >>index e293907..325f727 100644 > >>--- a/block.c > >>+++ b/block.c > >>@@ -106,8 +106,8 @@ int is_windows_drive(const char *filename) > >> size_t bdrv_opt_mem_align(BlockDriverState *bs) > >> { > >> if (!bs || !bs->drv) { > >>- /* 4k should be on the safe side */ > >>- return 4096; > >>+ /* page size or 4k (hdd sector size) should be on the safe side */ > >>+ return MAX(4096, getpagesize()); > >> } > >> > >> return bs->bl.opt_mem_alignment; > >>@@ -116,8 +116,8 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs) > >> size_t bdrv_min_mem_align(BlockDriverState *bs) > >> { > >> if (!bs || !bs->drv) { > >>- /* 4k should be on the safe side */ > >>- return 4096; > >>+ /* page size or 4k (hdd sector size) should be on the safe side */ > >>+ return MAX(4096, getpagesize()); > >> } > >> > >> return bs->bl.min_mem_alignment; > >>diff --git a/block/io.c b/block/io.c > >>index 908a3d1..071652c 100644 > >>--- a/block/io.c > >>+++ b/block/io.c > >>@@ -205,7 +205,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) > >> bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment; > >> } else { > >> bs->bl.min_mem_alignment = 512; > >>- bs->bl.opt_mem_alignment = 512; > >>+ bs->bl.opt_mem_alignment = getpagesize(); > >> } > >> > >> if (bs->backing_hd) { > > > >I think it would make more sense to keep this specific to the raw-posix > >driver. After all, it's only the kernel page cache that we optimise > >here. Other backends probably don't take advantage of page alignment. > > > >>diff --git a/block/raw-posix.c b/block/raw-posix.c > >>index 7083924..04f3d4e 100644 > >>--- a/block/raw-posix.c > >>+++ b/block/raw-posix.c > >>@@ -301,6 +301,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > >> { > >> BDRVRawState *s = bs->opaque; > >> char *buf; > >>+ size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize()); > >> > >> /* For /dev/sg devices the alignment is not really used. > >> With buffered I/O, we don't have any restrictions. */ > >>@@ -330,9 +331,9 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > >> /* If we could not get the sizes so far, we can only guess them */ > >> if (!s->buf_align) { > >> size_t align; > >>- buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE); > >>- for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { > >>- if (raw_is_io_aligned(fd, buf + align, MAX_BLOCKSIZE)) { > >>+ buf = qemu_memalign(max_align, 2 * max_align); > >>+ for (align = 512; align <= max_align; align <<= 1) { > >>+ if (raw_is_io_aligned(fd, buf + align, max_align)) { > >> s->buf_align = align; > >> break; > >> } > >>@@ -342,8 +343,8 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) > >> > >> if (!bs->request_alignment) { > >> size_t align; > >>- buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE); > >>- for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { > >>+ buf = qemu_memalign(s->buf_align, max_align); > >>+ for (align = 512; align <= max_align; align <<= 1) { > >> if (raw_is_io_aligned(fd, buf, align)) { > >> bs->request_alignment = align; > >> break; > >>@@ -726,7 +727,9 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) > >> > >> raw_probe_alignment(bs, s->fd, errp); > >> bs->bl.min_mem_alignment = s->buf_align; > >>- bs->bl.opt_mem_alignment = s->buf_align; > >>+ if (bs->bl.min_mem_alignment > bs->bl.opt_mem_alignment) { > >>+ bs->bl.opt_mem_alignment = bs->bl.min_mem_alignment; > >>+ } > > > >Or, if you want to keep the getpagesize() initialisation as a generic > >fallback just in case, I would still suggest to be explicit here instead > >of relying on the default, like this: > > > > bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize()). > > > >Kevin > > > definitely I can do this if this is a strict requirement and I have > not performed any real testing on Windows and other platforms > but from my point of view we will be on a safe side with this > alignment. Yes, it certainly won't hurt as a default, so I'm okay with keeping it in block.c. I would only like to have it explicit in raw-posix, too, because the justification you use in the commit message is specific to raw-posix (or, to be more precise, specific to raw-posix on Linux). Paolo is right that I missed that the page cache isn't involved, but then it must be the Linux block layer that splits the requests as you reported. That's still raw-posix only. For other backends (like network protocols), defaulting to pagesize shouldn't hurt and possibly there are some effects that make it an improvement there as well, but for raw-posix we actually have a good reason to do so and to be explicit about it in the driver. > Pls note, that I do not make any new allocation and any new > alignment check. The patch just forces alignment of the > allocation which will be performed in any case. And this > approach just matches IO coming from guest with IO initiated > by the qemu-img/io. All guest operations (both Windows and > Linux) are really page aligned by address and offset > nowadays. > > This approach is safe. It does not bring any additional > (significant) overhead. Yes, I understand that. :-) Kevin
On 12/05/15 16:08, Kevin Wolf wrote: > Am 12.05.2015 um 12:36 hat Denis V. Lunev geschrieben: >> On 12/05/15 13:27, Kevin Wolf wrote: >>> Am 12.05.2015 um 07:47 hat Denis V. Lunev geschrieben: >>>> The following sequence >>>> int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); >>>> for (i = 0; i < 100000; i++) >>>> write(fd, buf, 4096); >>>> performs 5% better if buf is aligned to 4096 bytes. >>>> >>>> The difference is quite reliable. >>>> >>>> On the other hand we do not want at the moment to enforce bounce >>>> buffering if guest request is aligned to 512 bytes. >>>> >>>> The patch changes default bounce buffer optimal alignment to >>>> MAX(page size, 4k). 4k is chosen as maximal known sector size on real >>>> HDD. >>>> >>>> The justification of the performance improve is quite interesting. >>>> From the kernel point of view each request to the disk was split >>>> by two. This could be seen by blktrace like this: >>>> 9,0 11 1 0.000000000 11151 Q WS 312737792 + 1023 [qemu-img] >>>> 9,0 11 2 0.000007938 11151 Q WS 312738815 + 8 [qemu-img] >>>> 9,0 11 3 0.000030735 11151 Q WS 312738823 + 1016 [qemu-img] >>>> 9,0 11 4 0.000032482 11151 Q WS 312739839 + 8 [qemu-img] >>>> 9,0 11 5 0.000041379 11151 Q WS 312739847 + 1016 [qemu-img] >>>> 9,0 11 6 0.000042818 11151 Q WS 312740863 + 8 [qemu-img] >>>> 9,0 11 7 0.000051236 11151 Q WS 312740871 + 1017 [qemu-img] >>>> 9,0 5 1 0.169071519 11151 Q WS 312741888 + 1023 [qemu-img] >>>> After the patch the pattern becomes normal: >>>> 9,0 6 1 0.000000000 12422 Q WS 314834944 + 1024 [qemu-img] >>>> 9,0 6 2 0.000038527 12422 Q WS 314835968 + 1024 [qemu-img] >>>> 9,0 6 3 0.000072849 12422 Q WS 314836992 + 1024 [qemu-img] >>>> 9,0 6 4 0.000106276 12422 Q WS 314838016 + 1024 [qemu-img] >>>> and the amount of requests sent to disk (could be calculated counting >>>> number of lines in the output of blktrace) is reduced about 2 times. >>>> >>>> Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest >>>> does his job well and real requests comes properly aligned (to page). >>>> >>>> Signed-off-by: Denis V. Lunev <den@openvz.org> >>>> CC: Paolo Bonzini <pbonzini@redhat.com> >>>> CC: Kevin Wolf <kwolf@redhat.com> >>>> CC: Stefan Hajnoczi <stefanha@redhat.com> >>>> --- >>>> block.c | 8 ++++---- >>>> block/io.c | 2 +- >>>> block/raw-posix.c | 14 ++++++++------ >>>> 3 files changed, 13 insertions(+), 11 deletions(-) >>>> >>>> diff --git a/block.c b/block.c >>>> index e293907..325f727 100644 >>>> --- a/block.c >>>> +++ b/block.c >>>> @@ -106,8 +106,8 @@ int is_windows_drive(const char *filename) >>>> size_t bdrv_opt_mem_align(BlockDriverState *bs) >>>> { >>>> if (!bs || !bs->drv) { >>>> - /* 4k should be on the safe side */ >>>> - return 4096; >>>> + /* page size or 4k (hdd sector size) should be on the safe side */ >>>> + return MAX(4096, getpagesize()); >>>> } >>>> >>>> return bs->bl.opt_mem_alignment; >>>> @@ -116,8 +116,8 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs) >>>> size_t bdrv_min_mem_align(BlockDriverState *bs) >>>> { >>>> if (!bs || !bs->drv) { >>>> - /* 4k should be on the safe side */ >>>> - return 4096; >>>> + /* page size or 4k (hdd sector size) should be on the safe side */ >>>> + return MAX(4096, getpagesize()); >>>> } >>>> >>>> return bs->bl.min_mem_alignment; >>>> diff --git a/block/io.c b/block/io.c >>>> index 908a3d1..071652c 100644 >>>> --- a/block/io.c >>>> +++ b/block/io.c >>>> @@ -205,7 +205,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) >>>> bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment; >>>> } else { >>>> bs->bl.min_mem_alignment = 512; >>>> - bs->bl.opt_mem_alignment = 512; >>>> + bs->bl.opt_mem_alignment = getpagesize(); >>>> } >>>> >>>> if (bs->backing_hd) { >>> I think it would make more sense to keep this specific to the raw-posix >>> driver. After all, it's only the kernel page cache that we optimise >>> here. Other backends probably don't take advantage of page alignment. >>> >>>> diff --git a/block/raw-posix.c b/block/raw-posix.c >>>> index 7083924..04f3d4e 100644 >>>> --- a/block/raw-posix.c >>>> +++ b/block/raw-posix.c >>>> @@ -301,6 +301,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) >>>> { >>>> BDRVRawState *s = bs->opaque; >>>> char *buf; >>>> + size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize()); >>>> >>>> /* For /dev/sg devices the alignment is not really used. >>>> With buffered I/O, we don't have any restrictions. */ >>>> @@ -330,9 +331,9 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) >>>> /* If we could not get the sizes so far, we can only guess them */ >>>> if (!s->buf_align) { >>>> size_t align; >>>> - buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE); >>>> - for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { >>>> - if (raw_is_io_aligned(fd, buf + align, MAX_BLOCKSIZE)) { >>>> + buf = qemu_memalign(max_align, 2 * max_align); >>>> + for (align = 512; align <= max_align; align <<= 1) { >>>> + if (raw_is_io_aligned(fd, buf + align, max_align)) { >>>> s->buf_align = align; >>>> break; >>>> } >>>> @@ -342,8 +343,8 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) >>>> >>>> if (!bs->request_alignment) { >>>> size_t align; >>>> - buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE); >>>> - for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { >>>> + buf = qemu_memalign(s->buf_align, max_align); >>>> + for (align = 512; align <= max_align; align <<= 1) { >>>> if (raw_is_io_aligned(fd, buf, align)) { >>>> bs->request_alignment = align; >>>> break; >>>> @@ -726,7 +727,9 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) >>>> >>>> raw_probe_alignment(bs, s->fd, errp); >>>> bs->bl.min_mem_alignment = s->buf_align; >>>> - bs->bl.opt_mem_alignment = s->buf_align; >>>> + if (bs->bl.min_mem_alignment > bs->bl.opt_mem_alignment) { >>>> + bs->bl.opt_mem_alignment = bs->bl.min_mem_alignment; >>>> + } >>> Or, if you want to keep the getpagesize() initialisation as a generic >>> fallback just in case, I would still suggest to be explicit here instead >>> of relying on the default, like this: >>> >>> bs->bl.opt_mem_alignment = MAX(s->buf_align, getpagesize()). >>> >>> Kevin >>> >> definitely I can do this if this is a strict requirement and I have >> not performed any real testing on Windows and other platforms >> but from my point of view we will be on a safe side with this >> alignment. > Yes, it certainly won't hurt as a default, so I'm okay with keeping it > in block.c. I would only like to have it explicit in raw-posix, too, > because the justification you use in the commit message is specific to > raw-posix (or, to be more precise, specific to raw-posix on Linux). > > Paolo is right that I missed that the page cache isn't involved, but > then it must be the Linux block layer that splits the requests as you > reported. That's still raw-posix only. > > For other backends (like network protocols), defaulting to pagesize > shouldn't hurt and possibly there are some effects that make it an > improvement there as well, but for raw-posix we actually have a good > reason to do so and to be explicit about it in the driver. ok, makes sense. >> Pls note, that I do not make any new allocation and any new >> alignment check. The patch just forces alignment of the >> allocation which will be performed in any case. And this >> approach just matches IO coming from guest with IO initiated >> by the qemu-img/io. All guest operations (both Windows and >> Linux) are really page aligned by address and offset >> nowadays. >> >> This approach is safe. It does not bring any additional >> (significant) overhead. > Yes, I understand that. :-) > > Kevin
diff --git a/block.c b/block.c index e293907..325f727 100644 --- a/block.c +++ b/block.c @@ -106,8 +106,8 @@ int is_windows_drive(const char *filename) size_t bdrv_opt_mem_align(BlockDriverState *bs) { if (!bs || !bs->drv) { - /* 4k should be on the safe side */ - return 4096; + /* page size or 4k (hdd sector size) should be on the safe side */ + return MAX(4096, getpagesize()); } return bs->bl.opt_mem_alignment; @@ -116,8 +116,8 @@ size_t bdrv_opt_mem_align(BlockDriverState *bs) size_t bdrv_min_mem_align(BlockDriverState *bs) { if (!bs || !bs->drv) { - /* 4k should be on the safe side */ - return 4096; + /* page size or 4k (hdd sector size) should be on the safe side */ + return MAX(4096, getpagesize()); } return bs->bl.min_mem_alignment; diff --git a/block/io.c b/block/io.c index 908a3d1..071652c 100644 --- a/block/io.c +++ b/block/io.c @@ -205,7 +205,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment; } else { bs->bl.min_mem_alignment = 512; - bs->bl.opt_mem_alignment = 512; + bs->bl.opt_mem_alignment = getpagesize(); } if (bs->backing_hd) { diff --git a/block/raw-posix.c b/block/raw-posix.c index 7083924..04f3d4e 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -301,6 +301,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) { BDRVRawState *s = bs->opaque; char *buf; + size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize()); /* For /dev/sg devices the alignment is not really used. With buffered I/O, we don't have any restrictions. */ @@ -330,9 +331,9 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) /* If we could not get the sizes so far, we can only guess them */ if (!s->buf_align) { size_t align; - buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE); - for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { - if (raw_is_io_aligned(fd, buf + align, MAX_BLOCKSIZE)) { + buf = qemu_memalign(max_align, 2 * max_align); + for (align = 512; align <= max_align; align <<= 1) { + if (raw_is_io_aligned(fd, buf + align, max_align)) { s->buf_align = align; break; } @@ -342,8 +343,8 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) if (!bs->request_alignment) { size_t align; - buf = qemu_memalign(s->buf_align, MAX_BLOCKSIZE); - for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { + buf = qemu_memalign(s->buf_align, max_align); + for (align = 512; align <= max_align; align <<= 1) { if (raw_is_io_aligned(fd, buf, align)) { bs->request_alignment = align; break; @@ -726,7 +727,9 @@ static void raw_refresh_limits(BlockDriverState *bs, Error **errp) raw_probe_alignment(bs, s->fd, errp); bs->bl.min_mem_alignment = s->buf_align; - bs->bl.opt_mem_alignment = s->buf_align; + if (bs->bl.min_mem_alignment > bs->bl.opt_mem_alignment) { + bs->bl.opt_mem_alignment = bs->bl.min_mem_alignment; + } } static int check_for_dasd(int fd)
The following sequence int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); for (i = 0; i < 100000; i++) write(fd, buf, 4096); performs 5% better if buf is aligned to 4096 bytes. The difference is quite reliable. On the other hand we do not want at the moment to enforce bounce buffering if guest request is aligned to 512 bytes. The patch changes default bounce buffer optimal alignment to MAX(page size, 4k). 4k is chosen as maximal known sector size on real HDD. The justification of the performance improve is quite interesting. From the kernel point of view each request to the disk was split by two. This could be seen by blktrace like this: 9,0 11 1 0.000000000 11151 Q WS 312737792 + 1023 [qemu-img] 9,0 11 2 0.000007938 11151 Q WS 312738815 + 8 [qemu-img] 9,0 11 3 0.000030735 11151 Q WS 312738823 + 1016 [qemu-img] 9,0 11 4 0.000032482 11151 Q WS 312739839 + 8 [qemu-img] 9,0 11 5 0.000041379 11151 Q WS 312739847 + 1016 [qemu-img] 9,0 11 6 0.000042818 11151 Q WS 312740863 + 8 [qemu-img] 9,0 11 7 0.000051236 11151 Q WS 312740871 + 1017 [qemu-img] 9,0 5 1 0.169071519 11151 Q WS 312741888 + 1023 [qemu-img] After the patch the pattern becomes normal: 9,0 6 1 0.000000000 12422 Q WS 314834944 + 1024 [qemu-img] 9,0 6 2 0.000038527 12422 Q WS 314835968 + 1024 [qemu-img] 9,0 6 3 0.000072849 12422 Q WS 314836992 + 1024 [qemu-img] 9,0 6 4 0.000106276 12422 Q WS 314838016 + 1024 [qemu-img] and the amount of requests sent to disk (could be calculated counting number of lines in the output of blktrace) is reduced about 2 times. Both qemu-img and qemu-io are affected while qemu-kvm is not. The guest does his job well and real requests comes properly aligned (to page). Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Kevin Wolf <kwolf@redhat.com> CC: Stefan Hajnoczi <stefanha@redhat.com> --- block.c | 8 ++++---- block/io.c | 2 +- block/raw-posix.c | 14 ++++++++------ 3 files changed, 13 insertions(+), 11 deletions(-)