diff mbox

qemu-img convert cache mode for source

Message ID 53109E99.3020102@kamp.de
State New
Headers show

Commit Message

Peter Lieven Feb. 28, 2014, 2:35 p.m. UTC
On 27.02.2014 09:57, Stefan Hajnoczi wrote:
> On Wed, Feb 26, 2014 at 05:01:52PM +0100, Peter Lieven wrote:
>> On 26.02.2014 16:41, Stefan Hajnoczi wrote:
>>> On Wed, Feb 26, 2014 at 11:14:04AM +0100, Peter Lieven wrote:
>>>> I was wondering if it would be a good idea to set the O_DIRECT mode for the source
>>>> files of a qemu-img convert process if the source is a host_device?
>>>>
>>>> Currently the backup of a host device is polluting the page cache.
>>> Points to consider:
>>>
>>> 1. O_DIRECT does not work on Linux tmpfs, you get EINVAL when opening
>>>     the file.  A fallback is necessary.
>>>
>>> 2. O_DIRECT has no readahead so performance could actually decrease.
>>>     The question is, how important is reahead versus polluting page
>>>     cache?
>>>
>>> 3. For raw files it would make sense to tell the kernel that access is
>>>     sequential and data will be used only once.  Then we can get the best
>>>     of both worlds (avoid polluting page cache but still get readahead).
>>>     This is done using posix_fadvise(2).
>>>
>>>     The problem is what to do for image formats.  An image file can be
>>>     very fragmented so the readahead might not be a win.  Does this mean
>>>     that for image formats we should tell the kernel access will be
>>>     random?
>>>
>>>     Furthermore, maybe it's best to do readahead inside QEMU so that even
>>>     network protocols (nbd, iscsi, etc) can get good performance.  They
>>>     act like O_DIRECT is always on.
>> your comments are regarding qemu-img convert, right?
>> How would you implement this? A new open flag because
>> the fadvise had to goto inside the protocol driver.
>>
>> I would start with host_devices first and see how it performs there.
>>
>> For qemu-img convert I would issue a FADV_DONTNEED after
>> a write for the bytes that have been written
>> (i have tested this with Linux and it seems to work quite well).
>>
>> Question is, what is the right paramter for reads? Also FADV_DONTNEED?
> I think so but this should be justified with benchmark results.

I ran some benchmarks at found that a FADV_DONTNEED issues after
a read does not hurt regarding to performance. But it avoids buffers
increasing while I read from a host_device of raw file.
As for writing it does only work if I issue a fdatasync after each write, but
this should be equivalent to O_DIRECT. So I would keep the patch
to support qemu-img convert sources if they are host_device or file.

Here is a proposal for a patch:

........................................................... KAMP Netzwerkdienste GmbH Vestische Str. 89-91 | 46117 Oberhausen Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40 pl@kamp.de | http://www.kamp.de Geschäftsführer: Heiner Lante | Michael 
Lante Amtsgericht Duisburg | HRB Nr. 12154 USt-Id-Nr.: DE 120607556 ...........................................................

Comments

Kevin Wolf March 3, 2014, 10:38 a.m. UTC | #1
Am 28.02.2014 um 15:35 hat Peter Lieven geschrieben:
> On 27.02.2014 09:57, Stefan Hajnoczi wrote:
> >On Wed, Feb 26, 2014 at 05:01:52PM +0100, Peter Lieven wrote:
> >>On 26.02.2014 16:41, Stefan Hajnoczi wrote:
> >>>On Wed, Feb 26, 2014 at 11:14:04AM +0100, Peter Lieven wrote:
> >>>>I was wondering if it would be a good idea to set the O_DIRECT mode for the source
> >>>>files of a qemu-img convert process if the source is a host_device?
> >>>>
> >>>>Currently the backup of a host device is polluting the page cache.
> >>>Points to consider:
> >>>
> >>>1. O_DIRECT does not work on Linux tmpfs, you get EINVAL when opening
> >>>    the file.  A fallback is necessary.
> >>>
> >>>2. O_DIRECT has no readahead so performance could actually decrease.
> >>>    The question is, how important is reahead versus polluting page
> >>>    cache?
> >>>
> >>>3. For raw files it would make sense to tell the kernel that access is
> >>>    sequential and data will be used only once.  Then we can get the best
> >>>    of both worlds (avoid polluting page cache but still get readahead).
> >>>    This is done using posix_fadvise(2).
> >>>
> >>>    The problem is what to do for image formats.  An image file can be
> >>>    very fragmented so the readahead might not be a win.  Does this mean
> >>>    that for image formats we should tell the kernel access will be
> >>>    random?
> >>>
> >>>    Furthermore, maybe it's best to do readahead inside QEMU so that even
> >>>    network protocols (nbd, iscsi, etc) can get good performance.  They
> >>>    act like O_DIRECT is always on.
> >>your comments are regarding qemu-img convert, right?
> >>How would you implement this? A new open flag because
> >>the fadvise had to goto inside the protocol driver.
> >>
> >>I would start with host_devices first and see how it performs there.
> >>
> >>For qemu-img convert I would issue a FADV_DONTNEED after
> >>a write for the bytes that have been written
> >>(i have tested this with Linux and it seems to work quite well).
> >>
> >>Question is, what is the right paramter for reads? Also FADV_DONTNEED?
> >I think so but this should be justified with benchmark results.
> 
> I ran some benchmarks at found that a FADV_DONTNEED issues after
> a read does not hurt regarding to performance. But it avoids buffers
> increasing while I read from a host_device of raw file.

Okay, sounds reasonable.

> As for writing it does only work if I issue a fdatasync after each write, but
> this should be equivalent to O_DIRECT. So I would keep the patch
> to support qemu-img convert sources if they are host_device or file.

Doing an fdatasync() is not an option (and not equivalent to O_DIRECT
at all).

> Here is a proposal for a patch:
> 
> diff --git a/block.c b/block.c
> index 2fd5482..2445433 100644
> --- a/block.c
> +++ b/block.c
> @@ -2626,6 +2626,14 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
>              qemu_aio_wait();
>          }
>      }
> +
> +#ifdef POSIX_FADV_DONTNEED
> +    if (!rwco.ret && bs->open_flags & BDRV_O_SEQUENTIAL &&
> +        bs->drv->bdrv_fadvise && !is_write) {
> +        bs->drv->bdrv_fadvise(bs, offset, qiov->size, POSIX_FADV_DONTNEED);
> +    }
> +#endif
> +

This #ifdef should be in the raw-posix driver. Please try to keep the
qemu interface backend agnostic and leave POSIX_FADV_DONTNEED and
friends as an implementation detail of block drivers.

> diff --git a/include/block/block.h b/include/block/block.h
> index 780f48b..a4dcc3c 100644
> --- a/include/block/block.h
> +++ b/include/block/block.h
> @@ -105,6 +105,9 @@ typedef enum {
>  #define BDRV_O_PROTOCOL    0x8000  /* if no block driver is explicitly given:
>                                        select an appropriate protocol driver,
>                                        ignoring the format layer */
> +#define BDRV_O_SEQUENTIAL 0x10000  /* open device for sequential read/write */
> +
> +
> 
>  #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH)

Why two additional newlines?

BDRV_O_SEQUENTIAL works for me as the external interface.

Kevin
Peter Lieven March 3, 2014, 11:20 a.m. UTC | #2
On 03.03.2014 11:38, Kevin Wolf wrote:
> Am 28.02.2014 um 15:35 hat Peter Lieven geschrieben:
>> On 27.02.2014 09:57, Stefan Hajnoczi wrote:
>>> On Wed, Feb 26, 2014 at 05:01:52PM +0100, Peter Lieven wrote:
>>>> On 26.02.2014 16:41, Stefan Hajnoczi wrote:
>>>>> On Wed, Feb 26, 2014 at 11:14:04AM +0100, Peter Lieven wrote:
>>>>>> I was wondering if it would be a good idea to set the O_DIRECT mode for the source
>>>>>> files of a qemu-img convert process if the source is a host_device?
>>>>>>
>>>>>> Currently the backup of a host device is polluting the page cache.
>>>>> Points to consider:
>>>>>
>>>>> 1. O_DIRECT does not work on Linux tmpfs, you get EINVAL when opening
>>>>>     the file.  A fallback is necessary.
>>>>>
>>>>> 2. O_DIRECT has no readahead so performance could actually decrease.
>>>>>     The question is, how important is reahead versus polluting page
>>>>>     cache?
>>>>>
>>>>> 3. For raw files it would make sense to tell the kernel that access is
>>>>>     sequential and data will be used only once.  Then we can get the best
>>>>>     of both worlds (avoid polluting page cache but still get readahead).
>>>>>     This is done using posix_fadvise(2).
>>>>>
>>>>>     The problem is what to do for image formats.  An image file can be
>>>>>     very fragmented so the readahead might not be a win.  Does this mean
>>>>>     that for image formats we should tell the kernel access will be
>>>>>     random?
>>>>>
>>>>>     Furthermore, maybe it's best to do readahead inside QEMU so that even
>>>>>     network protocols (nbd, iscsi, etc) can get good performance.  They
>>>>>     act like O_DIRECT is always on.
>>>> your comments are regarding qemu-img convert, right?
>>>> How would you implement this? A new open flag because
>>>> the fadvise had to goto inside the protocol driver.
>>>>
>>>> I would start with host_devices first and see how it performs there.
>>>>
>>>> For qemu-img convert I would issue a FADV_DONTNEED after
>>>> a write for the bytes that have been written
>>>> (i have tested this with Linux and it seems to work quite well).
>>>>
>>>> Question is, what is the right paramter for reads? Also FADV_DONTNEED?
>>> I think so but this should be justified with benchmark results.
>> I ran some benchmarks at found that a FADV_DONTNEED issues after
>> a read does not hurt regarding to performance. But it avoids buffers
>> increasing while I read from a host_device of raw file.
> Okay, sounds reasonable.
>
>> As for writing it does only work if I issue a fdatasync after each write, but
>> this should be equivalent to O_DIRECT. So I would keep the patch
>> to support qemu-img convert sources if they are host_device or file.
> Doing an fdatasync() is not an option (and not equivalent to O_DIRECT
> at all).
of course, fdatasync is no option. I just wanted to point out that one
should use O_DIRECT if the pagecache should be disable for the output
file.
>
>> Here is a proposal for a patch:
>>
>> diff --git a/block.c b/block.c
>> index 2fd5482..2445433 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -2626,6 +2626,14 @@ static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
>>               qemu_aio_wait();
>>           }
>>       }
>> +
>> +#ifdef POSIX_FADV_DONTNEED
>> +    if (!rwco.ret && bs->open_flags & BDRV_O_SEQUENTIAL &&
>> +        bs->drv->bdrv_fadvise && !is_write) {
>> +        bs->drv->bdrv_fadvise(bs, offset, qiov->size, POSIX_FADV_DONTNEED);
>> +    }
>> +#endif
>> +
> This #ifdef should be in the raw-posix driver. Please try to keep the
> qemu interface backend agnostic and leave POSIX_FADV_DONTNEED and
> friends as an implementation detail of block drivers.
I had the same idee, but as far as I see the callback to the completes
request is handled in block.c. raw-posix uses the aio interface and
not coroutines.
>
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 780f48b..a4dcc3c 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
>> @@ -105,6 +105,9 @@ typedef enum {
>>   #define BDRV_O_PROTOCOL    0x8000  /* if no block driver is explicitly given:
>>                                         select an appropriate protocol driver,
>>                                         ignoring the format layer */
>> +#define BDRV_O_SEQUENTIAL 0x10000  /* open device for sequential read/write */
>> +
>> +
>>
>>   #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH)
> Why two additional newlines?
typo. this patch war basically an RFC :-)
>
> BDRV_O_SEQUENTIAL works for me as the external interface.
>
> Kevin

Peter
Stefan Hajnoczi March 3, 2014, 12:03 p.m. UTC | #3
On Fri, Feb 28, 2014 at 03:35:05PM +0100, Peter Lieven wrote:
> On 27.02.2014 09:57, Stefan Hajnoczi wrote:
> >On Wed, Feb 26, 2014 at 05:01:52PM +0100, Peter Lieven wrote:
> >>On 26.02.2014 16:41, Stefan Hajnoczi wrote:
> >>>On Wed, Feb 26, 2014 at 11:14:04AM +0100, Peter Lieven wrote:
> >>>>I was wondering if it would be a good idea to set the O_DIRECT mode for the source
> >>>>files of a qemu-img convert process if the source is a host_device?
> >>>>
> >>>>Currently the backup of a host device is polluting the page cache.
> >>>Points to consider:
> >>>
> >>>1. O_DIRECT does not work on Linux tmpfs, you get EINVAL when opening
> >>>    the file.  A fallback is necessary.
> >>>
> >>>2. O_DIRECT has no readahead so performance could actually decrease.
> >>>    The question is, how important is reahead versus polluting page
> >>>    cache?
> >>>
> >>>3. For raw files it would make sense to tell the kernel that access is
> >>>    sequential and data will be used only once.  Then we can get the best
> >>>    of both worlds (avoid polluting page cache but still get readahead).
> >>>    This is done using posix_fadvise(2).
> >>>
> >>>    The problem is what to do for image formats.  An image file can be
> >>>    very fragmented so the readahead might not be a win.  Does this mean
> >>>    that for image formats we should tell the kernel access will be
> >>>    random?
> >>>
> >>>    Furthermore, maybe it's best to do readahead inside QEMU so that even
> >>>    network protocols (nbd, iscsi, etc) can get good performance.  They
> >>>    act like O_DIRECT is always on.
> >>your comments are regarding qemu-img convert, right?
> >>How would you implement this? A new open flag because
> >>the fadvise had to goto inside the protocol driver.
> >>
> >>I would start with host_devices first and see how it performs there.
> >>
> >>For qemu-img convert I would issue a FADV_DONTNEED after
> >>a write for the bytes that have been written
> >>(i have tested this with Linux and it seems to work quite well).
> >>
> >>Question is, what is the right paramter for reads? Also FADV_DONTNEED?
> >I think so but this should be justified with benchmark results.
> 
> I ran some benchmarks at found that a FADV_DONTNEED issues after
> a read does not hurt regarding to performance. But it avoids buffers
> increasing while I read from a host_device of raw file.

It was mentioned in this thread that a sequential shouldn't promote the
pages anyway - they should be dropped by the kernel if there is memory
pressure.

So what is the actual performance problem you are trying to solve and
what benchmark output are you getting when you compare with
FADV_DONTNEED against without FADV_DONTNEED?

I think there's a danger that the discussion will go around in circles.
Please post the performance results that kicked off this whole effort
and let's focus on the data.  That way it's much easier to evaluate what
changes to QEMU are a win and which are not necessary.

> As for writing it does only work if I issue a fdatasync after each write, but
> this should be equivalent to O_DIRECT. So I would keep the patch
> to support qemu-img convert sources if they are host_device or file.

fdatasync(2) is much more heavy-weight than writing out a pages because
it sends a disk write cache flush command and waits for it to complete.

Stefan
Peter Lieven March 3, 2014, 12:20 p.m. UTC | #4
On 03.03.2014 13:03, Stefan Hajnoczi wrote:
> On Fri, Feb 28, 2014 at 03:35:05PM +0100, Peter Lieven wrote:
>> On 27.02.2014 09:57, Stefan Hajnoczi wrote:
>>> On Wed, Feb 26, 2014 at 05:01:52PM +0100, Peter Lieven wrote:
>>>> On 26.02.2014 16:41, Stefan Hajnoczi wrote:
>>>>> On Wed, Feb 26, 2014 at 11:14:04AM +0100, Peter Lieven wrote:
>>>>>> I was wondering if it would be a good idea to set the O_DIRECT mode for the source
>>>>>> files of a qemu-img convert process if the source is a host_device?
>>>>>>
>>>>>> Currently the backup of a host device is polluting the page cache.
>>>>> Points to consider:
>>>>>
>>>>> 1. O_DIRECT does not work on Linux tmpfs, you get EINVAL when opening
>>>>>     the file.  A fallback is necessary.
>>>>>
>>>>> 2. O_DIRECT has no readahead so performance could actually decrease.
>>>>>     The question is, how important is reahead versus polluting page
>>>>>     cache?
>>>>>
>>>>> 3. For raw files it would make sense to tell the kernel that access is
>>>>>     sequential and data will be used only once.  Then we can get the best
>>>>>     of both worlds (avoid polluting page cache but still get readahead).
>>>>>     This is done using posix_fadvise(2).
>>>>>
>>>>>     The problem is what to do for image formats.  An image file can be
>>>>>     very fragmented so the readahead might not be a win.  Does this mean
>>>>>     that for image formats we should tell the kernel access will be
>>>>>     random?
>>>>>
>>>>>     Furthermore, maybe it's best to do readahead inside QEMU so that even
>>>>>     network protocols (nbd, iscsi, etc) can get good performance.  They
>>>>>     act like O_DIRECT is always on.
>>>> your comments are regarding qemu-img convert, right?
>>>> How would you implement this? A new open flag because
>>>> the fadvise had to goto inside the protocol driver.
>>>>
>>>> I would start with host_devices first and see how it performs there.
>>>>
>>>> For qemu-img convert I would issue a FADV_DONTNEED after
>>>> a write for the bytes that have been written
>>>> (i have tested this with Linux and it seems to work quite well).
>>>>
>>>> Question is, what is the right paramter for reads? Also FADV_DONTNEED?
>>> I think so but this should be justified with benchmark results.
>> I ran some benchmarks at found that a FADV_DONTNEED issues after
>> a read does not hurt regarding to performance. But it avoids buffers
>> increasing while I read from a host_device of raw file.
> It was mentioned in this thread that a sequential shouldn't promote the
> pages anyway - they should be dropped by the kernel if there is memory
> pressure.
Yes, but this costs cpu time in spikes and the page cache is polluted
with data that is definetely not needed.
>
> So what is the actual performance problem you are trying to solve and
> what benchmark output are you getting when you compare with
> FADV_DONTNEED against without FADV_DONTNEED?
I found the performance to be identical. For the problem see below please.
>
> I think there's a danger that the discussion will go around in circles.
> Please post the performance results that kicked off this whole effort
> and let's focus on the data.  That way it's much easier to evaluate what
> changes to QEMU are a win and which are not necessary.
I found that under memory pressure situations the increasing buffers
leads to vserver memory being swapped out. This caused trouble
especially in overcommit scenarios (where all memory is backed by
swap).
>
>> As for writing it does only work if I issue a fdatasync after each write, but
>> this should be equivalent to O_DIRECT. So I would keep the patch
>> to support qemu-img convert sources if they are host_device or file.
> fdatasync(2) is much more heavy-weight than writing out a pages because
> it sends a disk write cache flush command and waits for it to complete.
as mentioned before for the write path the

FADV_DONTNEED stuff doesn't work.

Peter
Paolo Bonzini March 3, 2014, 12:59 p.m. UTC | #5
Il 03/03/2014 12:20, Peter Lieven ha scritto:
>>>
>> This #ifdef should be in the raw-posix driver. Please try to keep the
>> qemu interface backend agnostic and leave POSIX_FADV_DONTNEED and
>> friends as an implementation detail of block drivers.
> I had the same idee, but as far as I see the callback to the completes
> request is handled in block.c. raw-posix uses the aio interface and
> not coroutines.

You can put it in the QEMU_AIO_READ case of aio_worker.

linux-aio.c is only for O_DIRECT, so you do not need it there.

BTW, perhaps you need to exclude the fadvise if O_DIRECT is in use?

Paolo
Peter Lieven March 3, 2014, 1:07 p.m. UTC | #6
On 03.03.2014 13:59, Paolo Bonzini wrote:
> Il 03/03/2014 12:20, Peter Lieven ha scritto:
>>>>
>>> This #ifdef should be in the raw-posix driver. Please try to keep the
>>> qemu interface backend agnostic and leave POSIX_FADV_DONTNEED and
>>> friends as an implementation detail of block drivers.
>> I had the same idee, but as far as I see the callback to the completes
>> request is handled in block.c. raw-posix uses the aio interface and
>> not coroutines.
>
> You can put it in the QEMU_AIO_READ case of aio_worker.
Thanks for the pointer.
>
> linux-aio.c is only for O_DIRECT, so you do not need it there.
>
> BTW, perhaps you need to exclude the fadvise if O_DIRECT is in use?
Good point!
>
> Paolo

Peter
Stefan Hajnoczi March 4, 2014, 9:24 a.m. UTC | #7
On Mon, Mar 03, 2014 at 01:20:21PM +0100, Peter Lieven wrote:
> On 03.03.2014 13:03, Stefan Hajnoczi wrote:
> >So what is the actual performance problem you are trying to solve and
> >what benchmark output are you getting when you compare with
> >FADV_DONTNEED against without FADV_DONTNEED?
> I found the performance to be identical. For the problem see below please.
> >
> >I think there's a danger that the discussion will go around in circles.
> >Please post the performance results that kicked off this whole effort
> >and let's focus on the data.  That way it's much easier to evaluate what
> >changes to QEMU are a win and which are not necessary.
> I found that under memory pressure situations the increasing buffers
> leads to vserver memory being swapped out. This caused trouble
> especially in overcommit scenarios (where all memory is backed by
> swap).

I think the general idea is qemu-img should not impact running guests,
even on a heavily loaded machine.  But again, this needs to be discussed
using concrete benchmarks with configurations and results posted to the
list.

Stefan
Peter Lieven March 5, 2014, 2:44 p.m. UTC | #8
Am 04.03.2014 10:24, schrieb Stefan Hajnoczi:
> On Mon, Mar 03, 2014 at 01:20:21PM +0100, Peter Lieven wrote:
>> On 03.03.2014 13:03, Stefan Hajnoczi wrote:
>>> So what is the actual performance problem you are trying to solve and
>>> what benchmark output are you getting when you compare with
>>> FADV_DONTNEED against without FADV_DONTNEED?
>> I found the performance to be identical. For the problem see below please.
>>> I think there's a danger that the discussion will go around in circles.
>>> Please post the performance results that kicked off this whole effort
>>> and let's focus on the data.  That way it's much easier to evaluate what
>>> changes to QEMU are a win and which are not necessary.
>> I found that under memory pressure situations the increasing buffers
>> leads to vserver memory being swapped out. This caused trouble
>> especially in overcommit scenarios (where all memory is backed by
>> swap).
> I think the general idea is qemu-img should not impact running guests,
> even on a heavily loaded machine.  But again, this needs to be discussed
> using concrete benchmarks with configurations and results posted to the
> list.
Sure, this is why I started to look at this. I found that under high memory
pressure a backup (local storage -> NFS) causes swapping. I started to
use libnfs as destination to avoid influence of the kernel NFS client. But
I saw that the buffers still increase while a backup is running. With the
proposed patch I sent recently

[PATCH] block: introduce BDRV_O_SEQUENTIAL

I don't see this behaviour while I have not yet observed a performance penalty.

Peter
>
> Stefan
Marcus March 5, 2014, 3:20 p.m. UTC | #9
I think this is a more generic sysadmin problem.  I've seen the same
thing in the past with simply snapshotting a logical volume or zfs
zvol and copying it off somewhere. Page cache bloats, the system
starts swapping. To avoid it, we wrote a small C program that calls
FADV_DONTNEED on a file, and fork off a process to call it on the
source file every X seconds in our backup scripts.

It's a little strange to me to have qemu-img do this, just like it
would be strange if 'cp' did it, but I can see it as a very useful
shortcut if it's an optional flag.  qemu-img to me is just an admin
tool, and the admin should decide if they want their tool's reads
cached.  Some additional things that come to mind:

* If you are running qemu-img on a running VM's source file,
FADV_DONTNEED may ruin the cache you wanted if the VM is not running
cache=none.

* O_DIRECT I think will cause unexpected problems, for example the
zfsonlinux guys (and tmpfs as mentioned) don't yet support it. If it
is used, there has to be a fallback or a way to turn it off.

On Wed, Mar 5, 2014 at 7:44 AM, Peter Lieven <pl@kamp.de> wrote:
> Am 04.03.2014 10:24, schrieb Stefan Hajnoczi:
>> On Mon, Mar 03, 2014 at 01:20:21PM +0100, Peter Lieven wrote:
>>> On 03.03.2014 13:03, Stefan Hajnoczi wrote:
>>>> So what is the actual performance problem you are trying to solve and
>>>> what benchmark output are you getting when you compare with
>>>> FADV_DONTNEED against without FADV_DONTNEED?
>>> I found the performance to be identical. For the problem see below please.
>>>> I think there's a danger that the discussion will go around in circles.
>>>> Please post the performance results that kicked off this whole effort
>>>> and let's focus on the data.  That way it's much easier to evaluate what
>>>> changes to QEMU are a win and which are not necessary.
>>> I found that under memory pressure situations the increasing buffers
>>> leads to vserver memory being swapped out. This caused trouble
>>> especially in overcommit scenarios (where all memory is backed by
>>> swap).
>> I think the general idea is qemu-img should not impact running guests,
>> even on a heavily loaded machine.  But again, this needs to be discussed
>> using concrete benchmarks with configurations and results posted to the
>> list.
> Sure, this is why I started to look at this. I found that under high memory
> pressure a backup (local storage -> NFS) causes swapping. I started to
> use libnfs as destination to avoid influence of the kernel NFS client. But
> I saw that the buffers still increase while a backup is running. With the
> proposed patch I sent recently
>
> [PATCH] block: introduce BDRV_O_SEQUENTIAL
>
> I don't see this behaviour while I have not yet observed a performance penalty.
>
> Peter
>>
>> Stefan
>
>
Peter Lieven March 5, 2014, 3:53 p.m. UTC | #10
Am 05.03.2014 16:20, schrieb Marcus:
> I think this is a more generic sysadmin problem.  I've seen the same
> thing in the past with simply snapshotting a logical volume or zfs
> zvol and copying it off somewhere. Page cache bloats, the system
> starts swapping. To avoid it, we wrote a small C program that calls
> FADV_DONTNEED on a file, and fork off a process to call it on the
> source file every X seconds in our backup scripts.
I do not call FADV_DONTNEED on the whole file, but only
on the block that has just been read.
>
> It's a little strange to me to have qemu-img do this, just like it
> would be strange if 'cp' did it, but I can see it as a very useful
> shortcut if it's an optional flag.  qemu-img to me is just an admin
> tool, and the admin should decide if they want their tool's reads
> cached.  Some additional things that come to mind:
>
> * If you are running qemu-img on a running VM's source file,
> FADV_DONTNEED may ruin the cache you wanted if the VM is not running
> cache=none.
You would normally not run it on the source directly. In my case
I run it on a snapshot of an logical volume, but I see your point.

So you can confirm my oberservations and would be happy if
this behaviour could be toggled with a cmdline switch?
>
> * O_DIRECT I think will cause unexpected problems, for example the
> zfsonlinux guys (and tmpfs as mentioned) don't yet support it. If it
> is used, there has to be a fallback or a way to turn it off.
I don't use O_DIRECT. Its an option for the destination file only at the
moment. You can set it with -t none as qemu-img argument.

Peter
>
> On Wed, Mar 5, 2014 at 7:44 AM, Peter Lieven <pl@kamp.de> wrote:
>> Am 04.03.2014 10:24, schrieb Stefan Hajnoczi:
>>> On Mon, Mar 03, 2014 at 01:20:21PM +0100, Peter Lieven wrote:
>>>> On 03.03.2014 13:03, Stefan Hajnoczi wrote:
>>>>> So what is the actual performance problem you are trying to solve and
>>>>> what benchmark output are you getting when you compare with
>>>>> FADV_DONTNEED against without FADV_DONTNEED?
>>>> I found the performance to be identical. For the problem see below please.
>>>>> I think there's a danger that the discussion will go around in circles.
>>>>> Please post the performance results that kicked off this whole effort
>>>>> and let's focus on the data.  That way it's much easier to evaluate what
>>>>> changes to QEMU are a win and which are not necessary.
>>>> I found that under memory pressure situations the increasing buffers
>>>> leads to vserver memory being swapped out. This caused trouble
>>>> especially in overcommit scenarios (where all memory is backed by
>>>> swap).
>>> I think the general idea is qemu-img should not impact running guests,
>>> even on a heavily loaded machine.  But again, this needs to be discussed
>>> using concrete benchmarks with configurations and results posted to the
>>> list.
>> Sure, this is why I started to look at this. I found that under high memory
>> pressure a backup (local storage -> NFS) causes swapping. I started to
>> use libnfs as destination to avoid influence of the kernel NFS client. But
>> I saw that the buffers still increase while a backup is running. With the
>> proposed patch I sent recently
>>
>> [PATCH] block: introduce BDRV_O_SEQUENTIAL
>>
>> I don't see this behaviour while I have not yet observed a performance penalty.
>>
>> Peter
>>> Stefan
>>
Marcus March 5, 2014, 5:38 p.m. UTC | #11
On Wed, Mar 5, 2014 at 8:53 AM, Peter Lieven <pl@kamp.de> wrote:
> Am 05.03.2014 16:20, schrieb Marcus:
>> I think this is a more generic sysadmin problem.  I've seen the same
>> thing in the past with simply snapshotting a logical volume or zfs
>> zvol and copying it off somewhere. Page cache bloats, the system
>> starts swapping. To avoid it, we wrote a small C program that calls
>> FADV_DONTNEED on a file, and fork off a process to call it on the
>> source file every X seconds in our backup scripts.
> I do not call FADV_DONTNEED on the whole file, but only
> on the block that has just been read.

Yes, I suppose that's one of the advantages of having it integrated
into the reader.

>>
>> It's a little strange to me to have qemu-img do this, just like it
>> would be strange if 'cp' did it, but I can see it as a very useful
>> shortcut if it's an optional flag.  qemu-img to me is just an admin
>> tool, and the admin should decide if they want their tool's reads
>> cached.  Some additional things that come to mind:
>>
>> * If you are running qemu-img on a running VM's source file,
>> FADV_DONTNEED may ruin the cache you wanted if the VM is not running
>> cache=none.
> You would normally not run it on the source directly. In my case
> I run it on a snapshot of an logical volume, but I see your point.

Totally depends on the situation, just thought it was worth consideration.

>
> So you can confirm my oberservations and would be happy if
> this behaviour could be toggled with a cmdline switch?

Yes, I've seen the same behavior you mention just with 'cp'. It was
with a version of the CentOS 6.2 kernel, at least, before we added
FADV_DONTNEED into the backup scripts.

>>
>> * O_DIRECT I think will cause unexpected problems, for example the
>> zfsonlinux guys (and tmpfs as mentioned) don't yet support it. If it
>> is used, there has to be a fallback or a way to turn it off.
> I don't use O_DIRECT. Its an option for the destination file only at the
> moment. You can set it with -t none as qemu-img argument.

I just mentioned it because setting it on the source was suggested
originally and subsequently discussed.

>
> Peter
>>
>> On Wed, Mar 5, 2014 at 7:44 AM, Peter Lieven <pl@kamp.de> wrote:
>>> Am 04.03.2014 10:24, schrieb Stefan Hajnoczi:
>>>> On Mon, Mar 03, 2014 at 01:20:21PM +0100, Peter Lieven wrote:
>>>>> On 03.03.2014 13:03, Stefan Hajnoczi wrote:
>>>>>> So what is the actual performance problem you are trying to solve and
>>>>>> what benchmark output are you getting when you compare with
>>>>>> FADV_DONTNEED against without FADV_DONTNEED?
>>>>> I found the performance to be identical. For the problem see below please.
>>>>>> I think there's a danger that the discussion will go around in circles.
>>>>>> Please post the performance results that kicked off this whole effort
>>>>>> and let's focus on the data.  That way it's much easier to evaluate what
>>>>>> changes to QEMU are a win and which are not necessary.
>>>>> I found that under memory pressure situations the increasing buffers
>>>>> leads to vserver memory being swapped out. This caused trouble
>>>>> especially in overcommit scenarios (where all memory is backed by
>>>>> swap).
>>>> I think the general idea is qemu-img should not impact running guests,
>>>> even on a heavily loaded machine.  But again, this needs to be discussed
>>>> using concrete benchmarks with configurations and results posted to the
>>>> list.
>>> Sure, this is why I started to look at this. I found that under high memory
>>> pressure a backup (local storage -> NFS) causes swapping. I started to
>>> use libnfs as destination to avoid influence of the kernel NFS client. But
>>> I saw that the buffers still increase while a backup is running. With the
>>> proposed patch I sent recently
>>>
>>> [PATCH] block: introduce BDRV_O_SEQUENTIAL
>>>
>>> I don't see this behaviour while I have not yet observed a performance penalty.
>>>
>>> Peter
>>>> Stefan
>>>
>
Peter Lieven March 5, 2014, 6:09 p.m. UTC | #12
Am 05.03.2014 18:38, schrieb Marcus:
> On Wed, Mar 5, 2014 at 8:53 AM, Peter Lieven <pl@kamp.de> wrote:
>> Am 05.03.2014 16:20, schrieb Marcus:
>>> I think this is a more generic sysadmin problem.  I've seen the same
>>> thing in the past with simply snapshotting a logical volume or zfs
>>> zvol and copying it off somewhere. Page cache bloats, the system
>>> starts swapping. To avoid it, we wrote a small C program that calls
>>> FADV_DONTNEED on a file, and fork off a process to call it on the
>>> source file every X seconds in our backup scripts.
>> I do not call FADV_DONTNEED on the whole file, but only
>> on the block that has just been read.
> Yes, I suppose that's one of the advantages of having it integrated
> into the reader.
>
>>> It's a little strange to me to have qemu-img do this, just like it
>>> would be strange if 'cp' did it, but I can see it as a very useful
>>> shortcut if it's an optional flag.  qemu-img to me is just an admin
>>> tool, and the admin should decide if they want their tool's reads
>>> cached.  Some additional things that come to mind:
>>>
>>> * If you are running qemu-img on a running VM's source file,
>>> FADV_DONTNEED may ruin the cache you wanted if the VM is not running
>>> cache=none.
>> You would normally not run it on the source directly. In my case
>> I run it on a snapshot of an logical volume, but I see your point.
> Totally depends on the situation, just thought it was worth consideration.
Yes, and it was a good remark.
>
>> So you can confirm my oberservations and would be happy if
>> this behaviour could be toggled with a cmdline switch?
> Yes, I've seen the same behavior you mention just with 'cp'. It was
> with a version of the CentOS 6.2 kernel, at least, before we added
> FADV_DONTNEED into the backup scripts.
Ok, Stefan would you be happy with it?
>
>>> * O_DIRECT I think will cause unexpected problems, for example the
>>> zfsonlinux guys (and tmpfs as mentioned) don't yet support it. If it
>>> is used, there has to be a fallback or a way to turn it off.
>> I don't use O_DIRECT. Its an option for the destination file only at the
>> moment. You can set it with -t none as qemu-img argument.
> I just mentioned it because setting it on the source was suggested
> originally and subsequently discussed.
Yes, but it would break readahead and would not work and tmpfs
and many other things we don't see know.

Peter
Stefan Hajnoczi March 6, 2014, 10:29 a.m. UTC | #13
On Wed, Mar 05, 2014 at 03:44:17PM +0100, Peter Lieven wrote:
> [PATCH] block: introduce BDRV_O_SEQUENTIAL

It hasn't shown up on the mailing list yet.

I received it privately because I am CCed but it needs to be on the list
in order to get review and be merged.

Anthony: how can Peter troubleshoot a patch email that isn't appearing
on the mailing list?

Stefan
Stefan Hajnoczi March 6, 2014, 10:41 a.m. UTC | #14
On Wed, Mar 05, 2014 at 07:09:13PM +0100, Peter Lieven wrote:
> Am 05.03.2014 18:38, schrieb Marcus:
> > On Wed, Mar 5, 2014 at 8:53 AM, Peter Lieven <pl@kamp.de> wrote:
> >> So you can confirm my oberservations and would be happy if
> >> this behaviour could be toggled with a cmdline switch?
> > Yes, I've seen the same behavior you mention just with 'cp'. It was
> > with a version of the CentOS 6.2 kernel, at least, before we added
> > FADV_DONTNEED into the backup scripts.
> Ok, Stefan would you be happy with it?

I'm happy with it.

But I'm a little frustrated that I've asked multiple times for a
concrete benchmark and it hasn't been provided.

In order to review patches that improve performance, we need a benchmark
that can be reproduced.  And the commit description must include results
that quantify the improvement.

Otherwise it's too easy to merge patches that sound reasonable but don't
have the effect that everyone thought they had.

I see two possible benchmarks:

1. Create memory pressure so that a benchmark performs worse unless we
   slim down our page cache usage.

2. Compare mm statistics before and after qemu-img to prove no longer
   bloats the page cache.

#2 doesn't prove that there is a practical performance issue, it just
optimizes the mm statistics.  But at least it quantifies that and serves
as a test case.

Either would be okay.

Stefan
Paolo Bonzini March 6, 2014, 11:29 a.m. UTC | #15
Il 06/03/2014 11:29, Stefan Hajnoczi ha scritto:
> On Wed, Mar 05, 2014 at 03:44:17PM +0100, Peter Lieven wrote:
>> [PATCH] block: introduce BDRV_O_SEQUENTIAL
>
> It hasn't shown up on the mailing list yet.
>
> I received it privately because I am CCed but it needs to be on the list
> in order to get review and be merged.
>
> Anthony: how can Peter troubleshoot a patch email that isn't appearing
> on the mailing list?

Maybe it's been caught by the savannah spam filter.

Paolo
Liguori, Anthony March 6, 2014, 2:19 p.m. UTC | #16
We can check the moderation queue although it's usually empty.  Savannah has a pretty aggressive spam filter and mail delivery isn't always reliable.

Regards,

Anthony Liguori
Peter Lieven March 6, 2014, 6:07 p.m. UTC | #17
Hi Anthony,

Am 06.03.2014 15:19, schrieb Liguori, Anthony:
> We can check the moderation queue although it's usually empty.  Savannah has a pretty aggressive spam filter and mail delivery isn't always reliable.
It would be good to know why these mails got lost. It seems to have happened sometimes in the past as well.

Thanks,
Peter
>
> Regards,
>
> Anthony Liguori
>
> ________________________________________
> From: Paolo Bonzini [paolo.bonzini@gmail.com] on behalf of Paolo Bonzini [pbonzini@redhat.com]
> Sent: Thursday, March 06, 2014 3:29 AM
> To: Stefan Hajnoczi; Peter Lieven
> Cc: Kevin Wolf; qemu-devel@nongnu.org; Stefan Hajnoczi; Liguori, Anthony
> Subject: Re: qemu-img convert cache mode for source
>
> Il 06/03/2014 11:29, Stefan Hajnoczi ha scritto:
>> On Wed, Mar 05, 2014 at 03:44:17PM +0100, Peter Lieven wrote:
>>> [PATCH] block: introduce BDRV_O_SEQUENTIAL
>> It hasn't shown up on the mailing list yet.
>>
>> I received it privately because I am CCed but it needs to be on the list
>> in order to get review and be merged.
>>
>> Anthony: how can Peter troubleshoot a patch email that isn't appearing
>> on the mailing list?
> Maybe it's been caught by the savannah spam filter.
>
> Paolo
>
Peter Lieven March 6, 2014, 6:58 p.m. UTC | #18
Am 06.03.2014 11:41, schrieb Stefan Hajnoczi:
> On Wed, Mar 05, 2014 at 07:09:13PM +0100, Peter Lieven wrote:
>> Am 05.03.2014 18:38, schrieb Marcus:
>>> On Wed, Mar 5, 2014 at 8:53 AM, Peter Lieven <pl@kamp.de> wrote:
>>>> So you can confirm my oberservations and would be happy if
>>>> this behaviour could be toggled with a cmdline switch?
>>> Yes, I've seen the same behavior you mention just with 'cp'. It was
>>> with a version of the CentOS 6.2 kernel, at least, before we added
>>> FADV_DONTNEED into the backup scripts.
>> Ok, Stefan would you be happy with it?
> I'm happy with it.
>
> But I'm a little frustrated that I've asked multiple times for a
> concrete benchmark and it hasn't been provided.
I did not want to frustrate you with that. I thought my abstract
description of my observations in the commit message would be sufficient.
Especially when Marcus provided his observations.

I will try to setup a test case.

Peter
>
> In order to review patches that improve performance, we need a benchmark
> that can be reproduced.  And the commit description must include results
> that quantify the improvement.
>
> Otherwise it's too easy to merge patches that sound reasonable but don't
> have the effect that everyone thought they had.
>
> I see two possible benchmarks:
>
> 1. Create memory pressure so that a benchmark performs worse unless we
>    slim down our page cache usage.
>
> 2. Compare mm statistics before and after qemu-img to prove no longer
>    bloats the page cache.
>
> #2 doesn't prove that there is a practical performance issue, it just
> optimizes the mm statistics.  But at least it quantifies that and serves
> as a test case.
>
> Either would be okay.
>
> Stefan
Peter Lieven March 7, 2014, 8:03 a.m. UTC | #19
On 06.03.2014 15:19, Liguori, Anthony wrote:
> We can check the moderation queue although it's usually empty.  Savannah has a pretty aggressive spam filter and mail delivery isn't always reliable.
I think I got it. I accidently had a typo (missing space) in the git send-email command which left the envelope-sender empty.

git send-email --envelope-sender=pl@kamp.de--cc=pbonzini@redhat.com --cc=kwolf@redhat.com --cc=stefanha@redhat.com 0001-block-introduce-BDRV_O_SEQUENTIAL.patch

Peter

>
> Regards,
>
> Anthony Liguori
>
> ________________________________________
> From: Paolo Bonzini [paolo.bonzini@gmail.com] on behalf of Paolo Bonzini [pbonzini@redhat.com]
> Sent: Thursday, March 06, 2014 3:29 AM
> To: Stefan Hajnoczi; Peter Lieven
> Cc: Kevin Wolf; qemu-devel@nongnu.org; Stefan Hajnoczi; Liguori, Anthony
> Subject: Re: qemu-img convert cache mode for source
>
> Il 06/03/2014 11:29, Stefan Hajnoczi ha scritto:
>> On Wed, Mar 05, 2014 at 03:44:17PM +0100, Peter Lieven wrote:
>>> [PATCH] block: introduce BDRV_O_SEQUENTIAL
>> It hasn't shown up on the mailing list yet.
>>
>> I received it privately because I am CCed but it needs to be on the list
>> in order to get review and be merged.
>>
>> Anthony: how can Peter troubleshoot a patch email that isn't appearing
>> on the mailing list?
> Maybe it's been caught by the savannah spam filter.
>
> Paolo
>
diff mbox

Patch

diff --git a/block.c b/block.c
index 2fd5482..2445433 100644
--- a/block.c
+++ b/block.c
@@ -2626,6 +2626,14 @@  static int bdrv_prwv_co(BlockDriverState *bs, int64_t offset,
              qemu_aio_wait();
          }
      }
+
+#ifdef POSIX_FADV_DONTNEED
+    if (!rwco.ret && bs->open_flags & BDRV_O_SEQUENTIAL &&
+        bs->drv->bdrv_fadvise && !is_write) {
+        bs->drv->bdrv_fadvise(bs, offset, qiov->size, POSIX_FADV_DONTNEED);
+    }
+#endif
+
      return rwco.ret;
  }

diff --git a/block/raw-posix.c b/block/raw-posix.c
index 161ea14..d8d78d8 100644
--- a/block/raw-posix.c
+++ b/block/raw-posix.c
@@ -1397,6 +1397,12 @@  static int raw_get_info(BlockDriverState *bs, BlockDriverInfo *bdi)
      return 0;
  }

+static int raw_fadvise(BlockDriverState *bs, off_t offset, off_t len, int advise)
+{
+    BDRVRawState *s = bs->opaque;
+    return posix_fadvise(s->fd, offset, len, advise);
+}
+
  static QEMUOptionParameter raw_create_options[] = {
      {
          .name = BLOCK_OPT_SIZE,
@@ -1433,6 +1439,7 @@  static BlockDriver bdrv_file = {
      .bdrv_get_info = raw_get_info,
      .bdrv_get_allocated_file_size
                          = raw_get_allocated_file_size,
+    .bdrv_fadvise = raw_fadvise,

      .create_options = raw_create_options,
  };
@@ -1811,6 +1818,7 @@  static BlockDriver bdrv_host_device = {
      .bdrv_get_info = raw_get_info,
      .bdrv_get_allocated_file_size
                          = raw_get_allocated_file_size,
+    .bdrv_fadvise = raw_fadvise,

      /* generic scsi device */
  #ifdef __linux__
diff --git a/block/raw_bsd.c b/block/raw_bsd.c
index 01ea692..f09bc70 100644
--- a/block/raw_bsd.c
+++ b/block/raw_bsd.c
@@ -171,6 +171,15 @@  static int raw_probe(const uint8_t *buf, int buf_size, const char *filename)
      return 1;
  }

+static int raw_fadvise(BlockDriverState *bs, off_t offset, off_t len, int advise)
+{
+    if (bs->file->drv->bdrv_fadvise) {
+        return bs->file->drv->bdrv_fadvise(bs->file, offset, len, advise);
+    }
+    return 0;
+}
+
+
  static BlockDriver bdrv_raw = {
      .format_name          = "raw",
      .bdrv_probe           = &raw_probe,
@@ -195,7 +204,8 @@  static BlockDriver bdrv_raw = {
      .bdrv_ioctl           = &raw_ioctl,
      .bdrv_aio_ioctl       = &raw_aio_ioctl,
      .create_options       = &raw_create_options[0],
-    .bdrv_has_zero_init   = &raw_has_zero_init
+    .bdrv_has_zero_init   = &raw_has_zero_init,
+    .bdrv_fadvise         = &raw_fadvise,
  };

  static void bdrv_raw_init(void)
diff --git a/include/block/block.h b/include/block/block.h
index 780f48b..a4dcc3c 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -105,6 +105,9 @@  typedef enum {
  #define BDRV_O_PROTOCOL    0x8000  /* if no block driver is explicitly given:
                                        select an appropriate protocol driver,
                                        ignoring the format layer */
+#define BDRV_O_SEQUENTIAL 0x10000  /* open device for sequential read/write */
+
+

  #define BDRV_O_CACHE_MASK  (BDRV_O_NOCACHE | BDRV_O_CACHE_WB | BDRV_O_NO_FLUSH)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 0bcf1c9..7efad55 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -246,6 +246,8 @@  struct BlockDriver {
       * zeros, 0 otherwise.
       */
      int (*bdrv_has_zero_init)(BlockDriverState *bs);
+
+    int (*bdrv_fadvise)(BlockDriverState *bs, off_t offset, off_t len, int advise);

      QLIST_ENTRY(BlockDriver) list;
  };
diff --git a/qemu-img.c b/qemu-img.c
index 78fc868..2b900d0 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -1298,7 +1298,8 @@  static int img_convert(int argc, char **argv)

      total_sectors = 0;
      for (bs_i = 0; bs_i < bs_n; bs_i++) {
-        bs[bs_i] = bdrv_new_open(argv[optind + bs_i], fmt, BDRV_O_FLAGS, true,
+        bs[bs_i] = bdrv_new_open(argv[optind + bs_i], fmt,
+                                 BDRV_O_FLAGS | BDRV_O_SEQUENTIAL, true,
                                   quiet);
          if (!bs[bs_i]) {
              error_report("Could not open '%s'", argv[optind + bs_i]);