From patchwork Fri Jan 30 18:39:11 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Denis V. Lunev" <den@parallels.com>
X-Patchwork-Id: 434972
Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 2893B140142
	for <incoming@patchwork.ozlabs.org>;
	Sat, 31 Jan 2015 05:40:51 +1100 (AEDT)
Received: from localhost ([::1]:38191 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>)
	id 1YHGUq-0006aH-RH
	for incoming@patchwork.ozlabs.org; Fri, 30 Jan 2015 13:40:48 -0500
Received: from eggs.gnu.org ([2001:4830:134:3::10]:49658)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@parallels.com>) id 1YHGUH-0006GG-OX
	for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:15 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <den@parallels.com>) id 1YHGUD-0007oe-J5
	for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:13 -0500
Received: from mx2.parallels.com ([199.115.105.18]:35599)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <den@parallels.com>) id 1YHGUD-0007Eu-BH
	for qemu-devel@nongnu.org; Fri, 30 Jan 2015 13:40:09 -0500
Received: from [199.115.105.250] (helo=mail.parallels.com)
	by mx2.parallels.com with esmtps (TLSv1:AES256-SHA:256) (Exim 4.85)
	(envelope-from <den@parallels.com>)
	id 1YHGTP-0003NR-5t; Fri, 30 Jan 2015 10:39:19 -0800
Received: from irbis.local (10.30.2.139) by US-EXCH.sw.swsoft.com
	(10.255.249.47) with Microsoft SMTP Server (TLS) id 15.0.1044.25;
	Fri, 30 Jan 2015 10:39:14 -0800
Message-ID: <54CBCFCF.7090006@parallels.com>
Date: Fri, 30 Jan 2015 21:39:11 +0300
From: "Denis V. Lunev" <den@parallels.com>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10;
	rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: Kevin Wolf <kwolf@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>
References: <1422528659-3121-1-git-send-email-den@openvz.org>
	<1422528659-3121-2-git-send-email-den@openvz.org>
	<20150129131848.GA3950@noname.redhat.com>
	<54CA3A7A.8090208@openvz.org>
In-Reply-To: <54CA3A7A.8090208@openvz.org>
X-Originating-IP: [10.30.2.139]
X-ClientProxiedBy: US-EXCH2.sw.swsoft.com (10.255.249.46) To
	US-EXCH.sw.swsoft.com (10.255.249.47)
Received-SPF: pass (mx2.parallels.com: domain of parallels.com designates
	199.115.105.250 as permitted sender) client-ip=199.115.105.250;
	envelope-from=den@parallels.com; helo=mail.parallels.com;
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-Received-From: 199.115.105.18
Cc: qemu-devel@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 1/1] block: enforce minimal 4096 alignment
	in qemu_blockalign
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

On 29/01/15 16:49, Denis V. Lunev wrote:
> On 29/01/15 16:18, Kevin Wolf wrote:
>> Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben:
>>> The following sequence
>>>      int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>>>      for (i = 0; i < 100000; i++)
>>>              write(fd, buf, 4096);
>>> performs 5% better if buf is aligned to 4096 bytes rather then to
>>> 512 bytes on HDD with 512/4096 logical/physical sector size.
>>>
>>> The difference is quite reliable.
>>>
>>> On the other hand we do not want at the moment to enforce bounce
>>> buffering if guest request is aligned to 512 bytes. This patch
>>> forces page alignment when we really forced to perform memory
>>> allocation.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Paolo Bonzini <pbonzini@redhat.com>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block.c | 9 ++++++++-
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index d45e4dd..38cf73f 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -5293,7 +5293,11 @@ void 
>>> bdrv_set_guest_block_size(BlockDriverState *bs, int align)
>>>     void *qemu_blockalign(BlockDriverState *bs, size_t size)
>>>   {
>>> -    return qemu_memalign(bdrv_opt_mem_align(bs), size);
>>> +    size_t align = bdrv_opt_mem_align(bs);
>>> +    if (align < 4096) {
>>> +        align = 4096;
>>> +    }
>>> +    return qemu_memalign(align, size);
>>>   }
>>>     void *qemu_blockalign0(BlockDriverState *bs, size_t size)
>>> @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState 
>>> *bs, size_t size)
>>>         /* Ensure that NULL is never returned on success */
>>>       assert(align > 0);
>>> +    if (align < 4096) {
>>> +        align = 4096;
>>> +    }
>>>       if (size == 0) {
>>>           size = align;
>>>       }
>> This is the wrong place to make this change. First you're duplicating
>> logic in the callers of bdrv_opt_mem_align() instead of making it return
>> the right thing in the first place.
> This has been actually done in the first iteration. bdrv_opt_mem_align
> is called actually three times in:
>   qemu_blockalign
>   qemu_try_blockalign
>   bdrv_qiov_is_aligned
> Paolo says that he does not want to have bdrv_qiov_is_aligned affected
> to avoid extra bounce buffering.
>
> From my point of view this extra bounce buffering is better than 
> unaligned
> pointer during write to the disk as 512/4096 logical/physical sectors 
> size
> disks are mainstream now. Though I don't want to specially argue here.
> Normal guest operations results in page aligned requests and this is not
> a problem at all. The amount of 512 aligned requests from guest side is
> quite negligible.
>>   Second, you're arguing with numbers
>> from a simple test case for O_DIRECT on Linux, but you're changing the
>> alignment for everyone instead of just the raw-posix driver which is
>> responsible for accessing Linux files.
> This should not be a real problem. We are allocation memory for the
> buffer. A little bit stricter alignment is not a big overhead for any 
> libc
> implementation thus this kludge will not produce any significant 
> overhead.
>> Also, what's the real reason for the performance improvement? Having
>> page alignment? If so, actually querying the page size instead of
>> assuming 4k might be worth a thought.
>>
>> Kevin
> Most likely the problem comes from the read-modify-write pattern
> either in kernel or in disk. Actually my experience says that it is a
> bad idea to supply 512 byte aligned buffer for O_DIRECT IO.
> ABI technically allows this but in general it is much less tested.
>
> Yes, this synthetic test shows some difference here. In terms of
> qemu-io the result is also visible, but less
>   qemu-img create -f qcow2 ./1.img 64G
>   qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
> performs 1% better.
>
> There is also similar kludge here
> size_t bdrv_opt_mem_align(BlockDriverState *bs)
> {
>     if (!bs || !bs->drv) {
>         /* 4k should be on the safe side */
>         return 4096;
>     }
>
>     return bs->bl.opt_mem_alignment;
> }
> which just uses 4096 constant.
>
> Yes, I could agree that queering page size could be a good idea, but
> I do not know at the moment how to do that. Can you pls share your
> opinion if you have any.
>
> Regards,
>     Den
Paolo, Kevin,

I have spent a bit more time digging the issue and found some
additional information. The same 5% difference if the buffer is
aligned to 512/4096 is observed for the following devices/filesystems

1) ext4 with block size equals to 1024 over 512/512 physical/logical
    sector size SSD disk
2) ext4 with block size equals to 4096 over 512/512 physical/logical
    sector size SSD disk
3) ext4 with block size equals to 4096 over 512/4096 physical/logical
    sector size rotational disk (WDC WD20EZRX)
4) with block size equals to 4096 over 512/512 physical/logical
    sector size SSD disk

This means that only page size (4k) matters.

Guys, you propose quite different approaches. I can extend this patch
to use sysconf(_SC_PAGESIZE) to detect page size and drop hardcoded
4096. This is not a problem. But you have different opinion about
the place to insert the check.

Could you please come into agreement?

Proper defines/configuration work to be done, I am trying to negotiate
principal approach.

Version 1)


@@ -5304,9 +5309,13 @@ void *qemu_blockalign0(BlockDriverState *bs, 
size_t size)
  void *qemu_try_blockalign(BlockDriverState *bs, size_t size)
  {
      size_t align = bdrv_opt_mem_align(bs);
+    int page_size = sysconf(_SC_PAGESIZE);

      /* Ensure that NULL is never returned on success */
      assert(align > 0);
+    if (align < page_size) {
+        align = page_size;
+    }
      if (size == 0) {
          size = align;
      }

I am totally fine with both versions.

Regards,
     Den

P.S. A bit improved version of test is attached.

diff --git a/block.c b/block.c
index d45e4dd..bc5d1e7 100644
--- a/block.c
+++ b/block.c
@@ -543,7 +543,7 @@ void bdrv_refresh_limits(BlockDriverState *bs, Error **errp)
          bs->bl.max_transfer_length = bs->file->bl.max_transfer_length;
          bs->bl.opt_mem_alignment = bs->file->bl.opt_mem_alignment;
      } else {
-        bs->bl.opt_mem_alignment = 512;
+        bs->bl.opt_mem_alignment = sysconf(_SC_PAGESIZE);
      }
  
      if (bs->backing_hd) {
diff --git a/block/raw-posix.c b/block/raw-posix.c
index ec38fee..d1b3388 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -266,7 +266,7 @@ static void raw_probe_alignment(BlockDriverState *bs, int fd, Error **errp)
      if (!s->buf_align) {
          size_t align;
          buf = qemu_memalign(MAX_BLOCKSIZE, 2 * MAX_BLOCKSIZE);
-        for (align = 512; align <= MAX_BLOCKSIZE; align <<= 1) {
+        for (align = sysconf(_SC_PAGESIZE); align <= MAX_BLOCKSIZE; align <<= 1) {
              if (pread(fd, buf + align, MAX_BLOCKSIZE, 0) >= 0) {
                  s->buf_align = align;
                  break;


Version 2)
diff --git a/block.c b/block.c
index d45e4dd..e2bb3fd 100644
--- a/block.c
+++ b/block.c
@@ -5293,6 +5293,11 @@ void bdrv_set_guest_block_size(BlockDriverState 
*bs, int align)

  void *qemu_blockalign(BlockDriverState *bs, size_t size)
  {
+    int align = bdrv_opt_mem_align(bs);
+    int page_size = sysconf(_SC_PAGESIZE);
+    if (align < page_size) {
+        align = page_size;
+    }
      return qemu_memalign(bdrv_opt_mem_align(bs), size);
  }