diff mbox

[RFC] qcow2: add a readahead cache for qcow2_decompress_cluster

Message ID 1388074792-29946-1-git-send-email-pl@kamp.de
State New
Headers show

Commit Message

Peter Lieven Dec. 26, 2013, 4:19 p.m. UTC
while evaluatiing compressed qcow2 images as a good basis for
virtual machine templates I found out that there are a lot
of partly redundant (compressed clusters have common physical
sectors) and relatively short reads.

This doesn't hurt if the image resides on a local
filesystem where we can benefit from the local page cache,
but it adds a lot of penalty when accessing remote images
on NFS or similar exports.

This patch effectevily implements a readahead of 2 * cluster_size
which is 2 * 64kB per default resulting in 128kB readahead. This
is the common setting for Linux for instance.

For example this leads to the following times when converting
a compressed qcow2 image to a local tmpfs partition.

Old:
time ./qemu-img convert nfs://10.0.0.1/export/VC-Ubuntu-LTS-12.04.2-64bit.qcow2 /tmp/test.raw
real	0m24.681s
user	0m8.597s
sys	0m4.084s

New:
time ./qemu-img convert nfs://10.0.0.1/export/VC-Ubuntu-LTS-12.04.2-64bit.qcow2 /tmp/test.raw
real	0m16.121s
user	0m7.932s
sys	0m2.244s

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 block/qcow2-cluster.c |   27 +++++++++++++++++++++++++--
 block/qcow2.h         |    1 +
 2 files changed, 26 insertions(+), 2 deletions(-)

Comments

Fam Zheng Dec. 27, 2013, 3:23 a.m. UTC | #1
On 2013年12月27日 00:19, Peter Lieven wrote:
> while evaluatiing compressed qcow2 images as a good basis for
> virtual machine templates I found out that there are a lot
> of partly redundant (compressed clusters have common physical
> sectors) and relatively short reads.
>
> This doesn't hurt if the image resides on a local
> filesystem where we can benefit from the local page cache,
> but it adds a lot of penalty when accessing remote images
> on NFS or similar exports.
>
> This patch effectevily implements a readahead of 2 * cluster_size
> which is 2 * 64kB per default resulting in 128kB readahead. This
> is the common setting for Linux for instance.
>
> For example this leads to the following times when converting
> a compressed qcow2 image to a local tmpfs partition.
>
> Old:
> time ./qemu-img convert nfs://10.0.0.1/export/VC-Ubuntu-LTS-12.04.2-64bit.qcow2 /tmp/test.raw
> real	0m24.681s
> user	0m8.597s
> sys	0m4.084s
>
> New:
> time ./qemu-img convert nfs://10.0.0.1/export/VC-Ubuntu-LTS-12.04.2-64bit.qcow2 /tmp/test.raw
> real	0m16.121s
> user	0m7.932s
> sys	0m2.244s
>
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>   block/qcow2-cluster.c |   27 +++++++++++++++++++++++++--
>   block/qcow2.h         |    1 +
>   2 files changed, 26 insertions(+), 2 deletions(-)

I like this idea, but here's a question. Actually, this penalty is 
common to all protocol drivers: curl, gluster, whatever. Readahead is 
not only good for compression processing, but also quite helpful for 
boot: BIOS and GRUB may send sequential 1 sector IO, synchronously, thus 
suffer from high latency of network communication. So I think if we want 
to do this, we will want to share it with other format and protocol 
combinations.

Fam
Peter Lieven Dec. 28, 2013, 3:35 p.m. UTC | #2
Am 27.12.2013 04:23, schrieb Fam Zheng:
> On 2013年12月27日 00:19, Peter Lieven wrote:
>> while evaluatiing compressed qcow2 images as a good basis for
>> virtual machine templates I found out that there are a lot
>> of partly redundant (compressed clusters have common physical
>> sectors) and relatively short reads.
>>
>> This doesn't hurt if the image resides on a local
>> filesystem where we can benefit from the local page cache,
>> but it adds a lot of penalty when accessing remote images
>> on NFS or similar exports.
>>
>> This patch effectevily implements a readahead of 2 * cluster_size
>> which is 2 * 64kB per default resulting in 128kB readahead. This
>> is the common setting for Linux for instance.
>>
>> For example this leads to the following times when converting
>> a compressed qcow2 image to a local tmpfs partition.
>>
>> Old:
>> time ./qemu-img convert nfs://10.0.0.1/export/VC-Ubuntu-LTS-12.04.2-64bit.qcow2 /tmp/test.raw
>> real    0m24.681s
>> user    0m8.597s
>> sys    0m4.084s
>>
>> New:
>> time ./qemu-img convert nfs://10.0.0.1/export/VC-Ubuntu-LTS-12.04.2-64bit.qcow2 /tmp/test.raw
>> real    0m16.121s
>> user    0m7.932s
>> sys    0m2.244s
>>
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>>   block/qcow2-cluster.c |   27 +++++++++++++++++++++++++--
>>   block/qcow2.h         |    1 +
>>   2 files changed, 26 insertions(+), 2 deletions(-)
>
> I like this idea, but here's a question. Actually, this penalty is common to all protocol drivers: curl, gluster, whatever. Readahead is not only good for compression processing, but also quite helpful for boot: BIOS and GRUB may send sequential 1 sector IO, synchronously, thus suffer from high latency of network communication. So I think if we want to do this, we will want to share it with other format and protocol combinations.
I had the same idea in mind. Not only high latency, but also high I/O load on the storage as reading sectors one by one produces high IOPS.
But we have to be very careful:
- Its likely that the OS already does a readahead so we should not put the complexity in qemu in this case.
- We definetely destroy zero copy functionality.

My idea would be that we only do a readahead if we observe a read smaller than n bytes and then maybe round up to this size. Maybe
we should only place this logic only in place if there is a 1 sector read and then read e.g. 4K. In any case this has to be an opt-in feature.

If I have some time I will collect some historgram of transfer size versus timing booting popular OSs.

Peter
Peter Lieven Dec. 28, 2013, 3:38 p.m. UTC | #3
Am 28.12.2013 16:35, schrieb Peter Lieven:
> Am 27.12.2013 04:23, schrieb Fam Zheng:
>> On 2013年12月27日 00:19, Peter Lieven wrote:
>>> while evaluatiing compressed qcow2 images as a good basis for
>>> virtual machine templates I found out that there are a lot
>>> of partly redundant (compressed clusters have common physical
>>> sectors) and relatively short reads.
>>>
>>> This doesn't hurt if the image resides on a local
>>> filesystem where we can benefit from the local page cache,
>>> but it adds a lot of penalty when accessing remote images
>>> on NFS or similar exports.
>>>
>>> This patch effectevily implements a readahead of 2 * cluster_size
>>> which is 2 * 64kB per default resulting in 128kB readahead. This
>>> is the common setting for Linux for instance.
>>>
>>> For example this leads to the following times when converting
>>> a compressed qcow2 image to a local tmpfs partition.
>>>
>>> Old:
>>> time ./qemu-img convert nfs://10.0.0.1/export/VC-Ubuntu-LTS-12.04.2-64bit.qcow2 /tmp/test.raw
>>> real    0m24.681s
>>> user    0m8.597s
>>> sys    0m4.084s
>>>
>>> New:
>>> time ./qemu-img convert nfs://10.0.0.1/export/VC-Ubuntu-LTS-12.04.2-64bit.qcow2 /tmp/test.raw
>>> real    0m16.121s
>>> user    0m7.932s
>>> sys    0m2.244s
>>>
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>> ---
>>>   block/qcow2-cluster.c |   27 +++++++++++++++++++++++++--
>>>   block/qcow2.h         |    1 +
>>>   2 files changed, 26 insertions(+), 2 deletions(-)
>> I like this idea, but here's a question. Actually, this penalty is common to all protocol drivers: curl, gluster, whatever. Readahead is not only good for compression processing, but also quite helpful for boot: BIOS and GRUB may send sequential 1 sector IO, synchronously, thus suffer from high latency of network communication. So I think if we want to do this, we will want to share it with other format and protocol combinations.
> I had the same idea in mind. Not only high latency, but also high I/O load on the storage as reading sectors one by one produces high IOPS.
> But we have to be very careful:
> - Its likely that the OS already does a readahead so we should not put the complexity in qemu in this case.
> - We definetely destroy zero copy functionality.
>
> My idea would be that we only do a readahead if we observe a read smaller than n bytes and then maybe round up to this size. Maybe
> we should only place this logic only in place if there is a 1 sector read and then read e.g. 4K. In any case this has to be an opt-in feature.
>
> If I have some time I will collect some historgram of transfer size versus timing booting popular OSs.
What I forgot here. In the QCOW2 compressed cluster case this was very low haning fruit as buffers etc. are already there.
In this case its obvious that we benefit from readahead, but maybe we could qemu-img enable it itself in this special case
if we really build it into the BlockDriver.

Peter
diff mbox

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 11f9c50..367f089 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1321,7 +1321,7 @@  static int decompress_buffer(uint8_t *out_buf, int out_buf_size,
 int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
 {
     BDRVQcowState *s = bs->opaque;
-    int ret, csize, nb_csectors, sector_offset;
+    int ret, csize, nb_csectors, sector_offset, max_read;
     uint64_t coffset;
 
     coffset = cluster_offset & s->cluster_offset_mask;
@@ -1329,9 +1329,32 @@  int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
         nb_csectors = ((cluster_offset >> s->csize_shift) & s->csize_mask) + 1;
         sector_offset = coffset & 511;
         csize = nb_csectors * 512 - sector_offset;
+        max_read = MIN((bs->file->total_sectors - (coffset >> 9)), 2 * s->cluster_sectors);
         BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
-        ret = bdrv_read(bs->file, coffset >> 9, s->cluster_data, nb_csectors);
+        if (s->cluster_cache_offset != -1 && coffset > s->cluster_cache_offset &&
+           (coffset >> 9) < (s->cluster_cache_offset >> 9) + s->cluster_data_sectors) {
+            int cached_sectors = s->cluster_data_sectors - ((coffset >> 9) -
+                                 (s->cluster_cache_offset >> 9));
+            memmove(s->cluster_data,
+                    s->cluster_data + (s->cluster_data_sectors - cached_sectors) * 512,
+                    cached_sectors * 512);
+            s->cluster_data_sectors = cached_sectors;
+            if (nb_csectors > cached_sectors) {
+                /* some sectors are missing read them and fill up to max_read sectors */
+                ret = bdrv_read(bs->file, (coffset >> 9) + cached_sectors,
+                                s->cluster_data + cached_sectors * 512,
+                                max_read);
+                s->cluster_data_sectors = cached_sectors + max_read;
+            } else {
+                /* all relevant sectors are in the cache */
+                ret = 0;
+            }
+        } else {
+            ret = bdrv_read(bs->file, coffset >> 9, s->cluster_data, max_read);
+            s->cluster_data_sectors = max_read;
+        }
         if (ret < 0) {
+            s->cluster_data_sectors = 0;
             return ret;
         }
         if (decompress_buffer(s->cluster_cache, s->cluster_size,
diff --git a/block/qcow2.h b/block/qcow2.h
index 922e190..5edad26 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -185,6 +185,7 @@  typedef struct BDRVQcowState {
 
     uint8_t *cluster_cache;
     uint8_t *cluster_data;
+    int cluster_data_sectors;
     uint64_t cluster_cache_offset;
     QLIST_HEAD(QCowClusterAlloc, QCowL2Meta) cluster_allocs;