From patchwork Fri Jan 30 18:39:11 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Denis V. Lunev" X-Patchwork-Id: 434972 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 2893B140142 for ; Sat, 31 Jan 2015 05:40:51 +1100 (AEDT) Received: from localhost ([::1]:38191 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YHGUq-0006aH-RH for incoming@patchwork.ozlabs.org; Fri, 30 Jan 2015 13:40:48 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49658) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YHGUH-0006GG-OX for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:15 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YHGUD-0007oe-J5 for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:13 -0500 Received: from mx2.parallels.com ([199.115.105.18]:35599) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YHGUD-0007Eu-BH for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:09 -0500 Received: from [199.115.105.250] (helo=mail.parallels.com) by mx2.parallels.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.85) (envelope-from ) id 1YHGTP-0003NR-5t; Fri, 30 Jan 2015 10:39:19 -0800 Received: from irbis.local (10.30.2.139) by US-EXCH.sw.swsoft.com (10.255.249.47) with Microsoft SMTP Server (TLS) id 15.0.1044.25; Fri, 30 Jan 2015 10:39:14 -0800 Message-ID: <54CBCFCF.7090006@parallels.com> Date: Fri, 30 Jan 2015 21:39:11 +0300 From: "Denis V. Lunev" User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: Kevin Wolf , Paolo Bonzini References: <1422528659-3121-1-git-send-email-den@openvz.org> <1422528659-3121-2-git-send-email-den@openvz.org> <20150129131848.GA3950@noname.redhat.com> <54CA3A7A.8090208@openvz.org> In-Reply-To: <54CA3A7A.8090208@openvz.org> X-Originating-IP: [10.30.2.139] X-ClientProxiedBy: US-EXCH2.sw.swsoft.com (10.255.249.46) To US-EXCH.sw.swsoft.com (10.255.249.47) Received-SPF: pass (mx2.parallels.com: domain of parallels.com designates 199.115.105.250 as permitted sender) client-ip=199.115.105.250; envelope-from=den@parallels.com; helo=mail.parallels.com; X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 199.115.105.18 Cc: qemu-devel@nongnu.org, Stefan Hajnoczi Subject: Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment in qemu_blockalign X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org On 29/01/15 16:49, Denis V. Lunev wrote: > On 29/01/15 16:18, Kevin Wolf wrote: >> Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben: >>> The following sequence >>> int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644); >>> for (i = 0; i < 100000; i++) >>> write(fd, buf, 4096); >>> performs 5% better if buf is aligned to 4096 bytes rather then to >>> 512 bytes on HDD with 512/4096 logical/physical sector size. >>> >>> The difference is quite reliable. >>> >>> On the other hand we do not want at the moment to enforce bounce >>> buffering if guest request is aligned to 512 bytes. This patch >>> forces page alignment when we really forced to perform memory >>> allocation. >>> >>> Signed-off-by: Denis V. Lunev >>> CC: Paolo Bonzini >>> CC: Kevin Wolf >>> CC: Stefan Hajnoczi >>> --- >>> block.c | 9 ++++++++- >>> 1 file changed, 8 insertions(+), 1 deletion(-) >>> >>> diff --git a/block.c b/block.c >>> index d45e4dd..38cf73f 100644 >>> --- a/block.c >>> +++ b/block.c >>> @@ -5293,7 +5293,11 @@ void >>> bdrv_set_guest_block_size(BlockDriverState *bs, int align) >>> void *qemu_blockalign(BlockDriverState *bs, size_t size) >>> { >>> - return qemu_memalign(bdrv_opt_mem_align(bs), size); >>> + size_t align = bdrv_opt_mem_align(bs); >>> + if (align < 4096) { >>> + align = 4096; >>> + } >>> + return qemu_memalign(align, size); >>> } >>> void *qemu_blockalign0(BlockDriverState *bs, size_t size) >>> @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState >>> *bs, size_t size) >>> /* Ensure that NULL is never returned on success */ >>> assert(align > 0); >>> + if (align < 4096) { >>> + align = 4096; >>> + } >>> if (size == 0) { >>> size = align; >>> } >> This is the wrong place to make this change. First you're duplicating >> logic in the callers of bdrv_opt_mem_align() instead of making it return >> the right thing in the first place. > This has been actually done in the first iteration. bdrv_opt_mem_align > is called actually three times in: > qemu_blockalign > qemu_try_blockalign > bdrv_qiov_is_aligned > Paolo says that he does not want to have bdrv_qiov_is_aligned affected > to avoid extra bounce buffering. > > From my point of view this extra bounce buffering is better than > unaligned > pointer during write to the disk as 512/4096 logical/physical sectors > size > disks are mainstream now. Though I don't want to specially argue here. > Normal guest operations results in page aligned requests and this is not > a problem at all. The amount of 512 aligned requests from guest side is > quite negligible. >> Second, you're arguing with numbers >> from a simple test case for O_DIRECT on Linux, but you're changing the >> alignment for everyone instead of just the raw-posix driver which is >> responsible for accessing Linux files. > This should not be a real problem. We are allocation memory for the > buffer. A little bit stricter alignment is not a big overhead for any > libc > implementation thus this kludge will not produce any significant > overhead. >> Also, what's the real reason for the performance improvement? Having >> page alignment? If so, actually querying the page size instead of >> assuming 4k might be worth a thought. >> >> Kevin > Most likely the problem comes from the read-modify-write pattern > either in kernel or in disk. Actually my experience says that it is a > bad idea to supply 512 byte aligned buffer for O_DIRECT IO. > ABI technically allows this but in general it is much less tested. > > Yes, this synthetic test shows some difference here. In terms of > qemu-io the result is also visible, but less > qemu-img create -f qcow2 ./1.img 64G > qemu-io -n -c 'write -P 0xaa 0 1G' 1.img > performs 1% better. > > There is also similar kludge here > size_t bdrv_opt_mem_align(BlockDriverState *bs) > { > if (!bs || !bs->drv) { > /* 4k should be on the safe side */ > return 4096; > } > > return bs->bl.opt_mem_alignment; > } > which just uses 4096 constant. > > Yes, I could agree that queering page size could be a good idea, but > I do not know at the moment how to do that. Can you pls share your > opinion if you have any. > > Regards, > Den Paolo, Kevin, I have spent a bit more time digging the issue and found some additional information. The same 5% difference if the buffer is aligned to 512/4096 is observed for the following devices/filesystems 1) ext4 with block size equals to 1024 over 512/512 physical/logical sector size SSD disk 2) ext4 with block size equals to 4096 over 512/512 physical/logical sector size SSD disk 3) ext4 with block size equals to 4096 over 512/4096 physical/logical sector size rotational disk (WDC WD20EZRX) 4) with block size equals to 4096 over 512/512 physical/logical sector size SSD disk This means that only page size (4k) matters. Guys, you propose quite different approaches. I can extend this patch to use sysconf(_SC_PAGESIZE) to detect page size and drop hardcoded 4096. This is not a problem. But you have different opinion about the place to insert the check. Could you please come into agreement? Proper defines/configuration work to be done, I am trying to negotiate principal approach. Version 1) @@ -5304,9 +5309,13 @@ void *qemu_blockalign0(BlockDriverState *bs, size_t size) void *qemu_try_blockalign(BlockDriverState *bs, size_t size) { size_t align = bdrv_opt_mem_align(bs); + int page_size = sysconf(_SC_PAGESIZE); /* Ensure that NULL is never returned on success */ assert(align > 0); + if (align < page_size) { + align = page_size; + } if (size == 0) { size = align; } I am totally fine with both versions. Regards, Den P.S. A bit improved version of test is attached. diff --git a/block.c b/block.c index d45e4dd..bc5d1e7 100644 --- a/block.c +++ b/block.c @@ -543,7 +543,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp) bs->bl.max_transfer_length = bs->file->bl.max_transfer_length; bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment; } else { - bs->bl.opt_mem_alignment = 512; + bs->bl.opt_mem_alignment = sysconf(_SC_PAGESIZE); } if (bs->backing_hd) { diff --git a/block/raw-posix.c b/block/raw-posix.c index ec38fee..d1b3388 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -266,7 +266,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp) if (!s->buf_align) { size_t align; buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE); - for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) { + for (align = sysconf(_SC_PAGESIZE); align <= MAX_BLOCKSIZE; align <<= 1) { if (pread(fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) { s->buf_align = align; break; Version 2) diff --git a/block.c b/block.c index d45e4dd..e2bb3fd 100644 --- a/block.c +++ b/block.c @@ -5293,6 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState *bs, int align) void *qemu_blockalign(BlockDriverState *bs, size_t size) { + int align = bdrv_opt_mem_align(bs); + int page_size = sysconf(_SC_PAGESIZE); + if (align < page_size) { + align = page_size; + } return qemu_memalign(bdrv_opt_mem_align(bs), size); }