[1/1] block: enforce minimal 4096 alignment in qemu_blockalign

On 29/01/15 16:49, Denis V. Lunev wrote:
> On 29/01/15 16:18, Kevin Wolf wrote:
>> Am 29.01.2015 um 11:50 hat Denis V. Lunev geschrieben:
>>> The following sequence
>>>      int fd = open(argv[1], O_RDWR | O_CREAT | O_DIRECT, 0644);
>>>      for (i = 0; i < 100000; i++)
>>>              write(fd, buf, 4096);
>>> performs 5% better if buf is aligned to 4096 bytes rather then to
>>> 512 bytes on HDD with 512/4096 logical/physical sector size.
>>>
>>> The difference is quite reliable.
>>>
>>> On the other hand we do not want at the moment to enforce bounce
>>> buffering if guest request is aligned to 512 bytes. This patch
>>> forces page alignment when we really forced to perform memory
>>> allocation.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: Paolo Bonzini <pbonzini@redhat.com>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>   block.c | 9 ++++++++-
>>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block.c b/block.c
>>> index d45e4dd..38cf73f 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -5293,7 +5293,11 @@ void 
>>> bdrv_set_guest_block_size(BlockDriverState *bs, int align)
>>>     void *qemu_blockalign(BlockDriverState *bs, size_t size)
>>>   {
>>> -    return qemu_memalign(bdrv_opt_mem_align(bs), size);
>>> +    size_t align = bdrv_opt_mem_align(bs);
>>> +    if (align < 4096) {
>>> +        align = 4096;
>>> +    }
>>> +    return qemu_memalign(align, size);
>>>   }
>>>     void *qemu_blockalign0(BlockDriverState *bs, size_t size)
>>> @@ -5307,6 +5311,9 @@ void *qemu_try_blockalign(BlockDriverState 
>>> *bs, size_t size)
>>>         /* Ensure that NULL is never returned on success */
>>>       assert(align > 0);
>>> +    if (align < 4096) {
>>> +        align = 4096;
>>> +    }
>>>       if (size == 0) {
>>>           size = align;
>>>       }
>> This is the wrong place to make this change. First you're duplicating
>> logic in the callers of bdrv_opt_mem_align() instead of making it return
>> the right thing in the first place.
> This has been actually done in the first iteration. bdrv_opt_mem_align
> is called actually three times in:
>   qemu_blockalign
>   qemu_try_blockalign
>   bdrv_qiov_is_aligned
> Paolo says that he does not want to have bdrv_qiov_is_aligned affected
> to avoid extra bounce buffering.
>
> From my point of view this extra bounce buffering is better than 
> unaligned
> pointer during write to the disk as 512/4096 logical/physical sectors 
> size
> disks are mainstream now. Though I don't want to specially argue here.
> Normal guest operations results in page aligned requests and this is not
> a problem at all. The amount of 512 aligned requests from guest side is
> quite negligible.
>>   Second, you're arguing with numbers
>> from a simple test case for O_DIRECT on Linux, but you're changing the
>> alignment for everyone instead of just the raw-posix driver which is
>> responsible for accessing Linux files.
> This should not be a real problem. We are allocation memory for the
> buffer. A little bit stricter alignment is not a big overhead for any 
> libc
> implementation thus this kludge will not produce any significant 
> overhead.
>> Also, what's the real reason for the performance improvement? Having
>> page alignment? If so, actually querying the page size instead of
>> assuming 4k might be worth a thought.
>>
>> Kevin
> Most likely the problem comes from the read-modify-write pattern
> either in kernel or in disk. Actually my experience says that it is a
> bad idea to supply 512 byte aligned buffer for O_DIRECT IO.
> ABI technically allows this but in general it is much less tested.
>
> Yes, this synthetic test shows some difference here. In terms of
> qemu-io the result is also visible, but less
>   qemu-img create -f qcow2 ./1.img 64G
>   qemu-io -n -c 'write -P 0xaa 0 1G' 1.img
> performs 1% better.
>
> There is also similar kludge here
> size_t bdrv_opt_mem_align(BlockDriverState *bs)
> {
>     if (!bs || !bs->drv) {
>         /* 4k should be on the safe side */
>         return 4096;
>     }
>
>     return bs->bl.opt_mem_alignment;
> }
> which just uses 4096 constant.
>
> Yes, I could agree that queering page size could be a good idea, but
> I do not know at the moment how to do that. Can you pls share your
> opinion if you have any.
>
> Regards,
>     Den
Paolo, Kevin,

I have spent a bit more time digging the issue and found some
additional information. The same 5% difference if the buffer is
aligned to 512/4096 is observed for the following devices/filesystems

1) ext4 with block size equals to 1024 over 512/512 physical/logical
    sector size SSD disk
2) ext4 with block size equals to 4096 over 512/512 physical/logical
    sector size SSD disk
3) ext4 with block size equals to 4096 over 512/4096 physical/logical
    sector size rotational disk (WDC WD20EZRX)
4) with block size equals to 4096 over 512/512 physical/logical
    sector size SSD disk

This means that only page size (4k) matters.

Guys, you propose quite different approaches. I can extend this patch
to use sysconf(_SC_PAGESIZE) to detect page size and drop hardcoded
4096. This is not a problem. But you have different opinion about
the place to insert the check.

Could you please come into agreement?

Proper defines/configuration work to be done, I am trying to negotiate
principal approach.

Version 1)


@@ -5304,9 +5309,13 @@ void *qemu_blockalign0(BlockDriverState *bs, 
size_t size)
  void *qemu_try_blockalign(BlockDriverState *bs, size_t size)
  {
      size_t align = bdrv_opt_mem_align(bs);
+    int page_size = sysconf(_SC_PAGESIZE);

      /* Ensure that NULL is never returned on success */
      assert(align > 0);
+    if (align < page_size) {
+        align = page_size;
+    }
      if (size == 0) {
          size = align;
      }

I am totally fine with both versions.

Regards,
     Den

P.S. A bit improved version of test is attached.


[1/1] block: enforce minimal 4096 alignment in qemu_blockalign

Commit Message

Comments

Patch