diff mbox

[1.8,5/6] qemu-img: add option to align writes to cluster_sectors during convert

Message ID 1385387840-17307-6-git-send-email-pl@kamp.de
State New
Headers show

Commit Message

Peter Lieven Nov. 25, 2013, 1:57 p.m. UTC
Signed-off-by: Peter Lieven <pl@kamp.de>
---
 qemu-img-cmds.hx |    4 ++--
 qemu-img.c       |   37 +++++++++++++++++++++++++------------
 qemu-img.texi    |    5 ++++-
 3 files changed, 31 insertions(+), 15 deletions(-)

Comments

Paolo Bonzini Nov. 25, 2013, 3:11 p.m. UTC | #1
Il 25/11/2013 14:57, Peter Lieven ha scritto:
> Signed-off-by: Peter Lieven <pl@kamp.de>

Ok, given this patch I think the cluster_size is the right one to use
here---and also the way you used the optimal unmap granularity makes
sense; you could also use MAX(optimal unmap granularity, optimal
transfer length granularity).

However, there is no need to write one cluster at a time.  What matters,
I think, is to align the *end* of the transfer, so that the next
transfer can start aligned.

> +            if (align && cluster_sectors > 0) {
> +                int64_t next_aligned_sector = (sector_num + cluster_sectors);

So this should be "+ n", not "+ cluster_sectors".

Perhaps it could be conditional on "n > cluster_sectors" (small requests
happen when you have sparse region, and breaking them doesn't help).

Finally, I believe there is no need for a separate "-a" knob.

The patch looks fine to me with these small changes, though.

Also, a couple of ideas for separate patches.  Perhaps the default value
of "-S" could be cluster_size if specified?  This would avoid making raw
images too fragmented, and compounding filesystem-level fragmentation
with qcow2-level fragmentation.  And 4K is too small a default in my
opinion; it could be easily changed to 64K, though 4K was of course an
improvement compared to 512 before commit a22f123 (qemu-img: Require
larger zero areas for sparse handling, 2011-08-26).

Paolo

> +                next_aligned_sector -= next_aligned_sector % cluster_sectors;
> +                if (sector_num + n > next_aligned_sector) {
> +                    n = next_aligned_sector - sector_num;
> +                }
> +            }
> +
>              if (n > bs_offset + bs_sectors - sector_num) {
>                  n = bs_offset + bs_sectors - sector_num;
>              }
> diff --git a/qemu-img.texi b/qemu-img.texi
> index 87f9d0f..9b1720f 100644
> --- a/qemu-img.texi
> +++ b/qemu-img.texi
> @@ -179,11 +179,14 @@ Error on reading data
>  
>  @end table
>  
> -@item convert [-c] [-p] [-n] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_name}] [-S @var{sparse_size}] [-m @var{iobuf_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
> +@item convert [-c] [-p] [-n] [-a] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_name}] [-S @var{sparse_size}] [-m @var{iobuf_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
>  
>  Convert the disk image @var{filename} or a snapshot @var{snapshot_name} to disk image @var{output_filename}
>  using format @var{output_fmt}. It can be optionally compressed (@code{-c}
>  option) or use any format specific options like encryption (@code{-o} option).
> +If the @code{-a} option is specified write requests will be aligned
> +to the cluster size of the output image if possible. This is the default
> +for compressed images.
>  
>  Only the formats @code{qcow} and @code{qcow2} support compression. The
>  compression is read-only. It means that if a compressed sector is
>
Peter Lieven Nov. 25, 2013, 3:32 p.m. UTC | #2
On 25.11.2013 16:11, Paolo Bonzini wrote:
> Il 25/11/2013 14:57, Peter Lieven ha scritto:
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> Ok, given this patch I think the cluster_size is the right one to use
> here---and also the way you used the optimal unmap granularity makes
> sense; you could also use MAX(optimal unmap granularity, optimal
> transfer length granularity).
>
> However, there is no need to write one cluster at a time.  What matters,
> I think, is to align the *end* of the transfer, so that the next
> transfer can start aligned.
>
>> +            if (align && cluster_sectors > 0) {
>> +                int64_t next_aligned_sector = (sector_num + cluster_sectors);
> So this should be "+ n", not "+ cluster_sectors".
>
> Perhaps it could be conditional on "n > cluster_sectors" (small requests
> happen when you have sparse region, and breaking them doesn't help).
>
> Finally, I believe there is no need for a separate "-a" knob.
>
> The patch looks fine to me with these small changes, though.
>
> Also, a couple of ideas for separate patches.  Perhaps the default value
> of "-S" could be cluster_size if specified?  This would avoid making raw
> images too fragmented, and compounding filesystem-level fragmentation
> with qcow2-level fragmentation.  And 4K is too small a default in my
> opinion; it could be easily changed to 64K, though 4K was of course an
> improvement compared to 512 before commit a22f123 (qemu-img: Require
> larger zero areas for sparse handling, 2011-08-26).
I would vote for 64K or 256K, we already use the first for some time. However, it turned out
that (much) bigger values decrease performance. Setting it
to cluster_size can be dangerous. As described in my case its 15MB and
I think for vhd its 1MB. This can be a lot of zeros that have to be written.

Peter
>
> Paolo
>
>> +                next_aligned_sector -= next_aligned_sector % cluster_sectors;
>> +                if (sector_num + n > next_aligned_sector) {
>> +                    n = next_aligned_sector - sector_num;
>> +                }
>> +            }
>> +
>>               if (n > bs_offset + bs_sectors - sector_num) {
>>                   n = bs_offset + bs_sectors - sector_num;
>>               }
>> diff --git a/qemu-img.texi b/qemu-img.texi
>> index 87f9d0f..9b1720f 100644
>> --- a/qemu-img.texi
>> +++ b/qemu-img.texi
>> @@ -179,11 +179,14 @@ Error on reading data
>>   
>>   @end table
>>   
>> -@item convert [-c] [-p] [-n] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_name}] [-S @var{sparse_size}] [-m @var{iobuf_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
>> +@item convert [-c] [-p] [-n] [-a] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_name}] [-S @var{sparse_size}] [-m @var{iobuf_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
>>   
>>   Convert the disk image @var{filename} or a snapshot @var{snapshot_name} to disk image @var{output_filename}
>>   using format @var{output_fmt}. It can be optionally compressed (@code{-c}
>>   option) or use any format specific options like encryption (@code{-o} option).
>> +If the @code{-a} option is specified write requests will be aligned
>> +to the cluster size of the output image if possible. This is the default
>> +for compressed images.
>>   
>>   Only the formats @code{qcow} and @code{qcow2} support compression. The
>>   compression is read-only. It means that if a compressed sector is
>>
Paolo Bonzini Nov. 25, 2013, 3:50 p.m. UTC | #3
Il 25/11/2013 16:32, Peter Lieven ha scritto:
>>
>> Also, a couple of ideas for separate patches.  Perhaps the default value
>> of "-S" could be cluster_size if specified?  This would avoid making raw
>> images too fragmented, and compounding filesystem-level fragmentation
>> with qcow2-level fragmentation.  And 4K is too small a default in my
>> opinion; it could be easily changed to 64K, though 4K was of course an
>> improvement compared to 512 before commit a22f123 (qemu-img: Require
>> larger zero areas for sparse handling, 2011-08-26).
> I would vote for 64K or 256K, we already use the first for some time.
> However, it turned out
> that (much) bigger values decrease performance. Setting it
> to cluster_size can be dangerous. As described in my case its 15MB and
> I think for vhd its 1MB. This can be a lot of zeros that have to be
> written.

What about max(4096, min(bdi->cluster_size, 1048576))?

Paolo
Peter Lieven Nov. 25, 2013, 3:55 p.m. UTC | #4
On 25.11.2013 16:50, Paolo Bonzini wrote:
> Il 25/11/2013 16:32, Peter Lieven ha scritto:
>>> Also, a couple of ideas for separate patches.  Perhaps the default value
>>> of "-S" could be cluster_size if specified?  This would avoid making raw
>>> images too fragmented, and compounding filesystem-level fragmentation
>>> with qcow2-level fragmentation.  And 4K is too small a default in my
>>> opinion; it could be easily changed to 64K, though 4K was of course an
>>> improvement compared to 512 before commit a22f123 (qemu-img: Require
>>> larger zero areas for sparse handling, 2011-08-26).
>> I would vote for 64K or 256K, we already use the first for some time.
>> However, it turned out
>> that (much) bigger values decrease performance. Setting it
>> to cluster_size can be dangerous. As described in my case its 15MB and
>> I think for vhd its 1MB. This can be a lot of zeros that have to be
>> written.
> What about max(4096, min(bdi->cluster_size, 1048576))?
chaning sparse_size from 65536 to 1048576 about 5% performance decrease...

lieven@lieven-pc:~/git/qemu$ time ./qemu-img convert -pp -m 15728640 -S 1048576 /tmp/VC-Ubuntu-LTS-12.04.2-64bit.qcow2 iscsi://172.21.200.45/iqn.2001-05.com.equallogic:0-8a0906-9d95c510a-344001d54795289f-2012-r2-1-7-0/0
40980480 of 40980480 sectors converted.

real    0m29.263s
user    0m7.544s
sys    0m1.636s
lieven@lieven-pc:~/git/qemu$ time ./qemu-img convert -pp -m 15728640 -S 4096 /tmp/VC-Ubuntu-LTS-12.04.2-64bit.qcow2 iscsi://172.21.200.45/iqn.2001-05.com.equallogic:0-8a0906-9d95c510a-344001d54795289f-2012-r2-1-7-0/0
40980480 of 40980480 sectors converted.

real    0m28.169s
user    0m7.792s
sys    0m1.516s
lieven@lieven-pc:~/git/qemu$ time ./qemu-img convert -pp -m 15728640 -S 65536 /tmp/VC-Ubuntu-LTS-12.04.2-64bit.qcow2 iscsi://172.21.200.45/iqn.2001-05.com.equallogic:0-8a0906-9d95c510a-344001d54795289f-2012-r2-1-7-0/0
40980480 of 40980480 sectors converted.

real    0m27.643s
user    0m7.644s
sys    0m1.520s

i wouldn't go over 64k until we fully understand which impact it has.

Peter
Paolo Bonzini Nov. 25, 2013, 4:02 p.m. UTC | #5
Il 25/11/2013 16:55, Peter Lieven ha scritto:
>>
> chaning sparse_size from 65536 to 1048576 about 5% performance decrease...
> 
> lieven@lieven-pc:~/git/qemu$ time ./qemu-img convert -pp -m 15728640 -S
> 1048576 /tmp/VC-Ubuntu-LTS-12.04.2-64bit.qcow2
> iscsi://172.21.200.45/iqn.2001-05.com.equallogic:0-8a0906-9d95c510a-344001d54795289f-2012-r2-1-7-0/0
> 
> 40980480 of 40980480 sectors converted.
> 
> real    0m29.263s
> user    0m7.544s
> sys    0m1.636s
> lieven@lieven-pc:~/git/qemu$ time ./qemu-img convert -pp -m 15728640 -S
> 4096 /tmp/VC-Ubuntu-LTS-12.04.2-64bit.qcow2
> iscsi://172.21.200.45/iqn.2001-05.com.equallogic:0-8a0906-9d95c510a-344001d54795289f-2012-r2-1-7-0/0
> 
> 40980480 of 40980480 sectors converted.
> 
> real    0m28.169s
> user    0m7.792s
> sys    0m1.516s
> lieven@lieven-pc:~/git/qemu$ time ./qemu-img convert -pp -m 15728640 -S
> 65536 /tmp/VC-Ubuntu-LTS-12.04.2-64bit.qcow2
> iscsi://172.21.200.45/iqn.2001-05.com.equallogic:0-8a0906-9d95c510a-344001d54795289f-2012-r2-1-7-0/0
> 
> 40980480 of 40980480 sectors converted.
> 
> real    0m27.643s
> user    0m7.644s
> sys    0m1.520s
> 
> i wouldn't go over 64k until we fully understand which impact it has.

I agree.

Paolo
Peter Lieven Nov. 25, 2013, 4:11 p.m. UTC | #6
On 25.11.2013 16:11, Paolo Bonzini wrote:
> Il 25/11/2013 14:57, Peter Lieven ha scritto:
>> Signed-off-by: Peter Lieven <pl@kamp.de>
> Ok, given this patch I think the cluster_size is the right one to use
> here---and also the way you used the optimal unmap granularity makes
> sense; you could also use MAX(optimal unmap granularity, optimal
> transfer length granularity).
>
> However, there is no need to write one cluster at a time.  What matters,
> I think, is to align the *end* of the transfer, so that the next
> transfer can start aligned.
>
>> +            if (align && cluster_sectors > 0) {
>> +                int64_t next_aligned_sector = (sector_num + cluster_sectors);
> So this should be "+ n", not "+ cluster_sectors".
>
> Perhaps it could be conditional on "n > cluster_sectors" (small requests
> happen when you have sparse region, and breaking them doesn't help).
would you also agree to n >= cluster_sectors. In my case
and if especially if n is bound by iobuf_size the case n > cluster_sectors
will be hard to meet.

Peter

>
> Finally, I believe there is no need for a separate "-a" knob.
>
> The patch looks fine to me with these small changes, though.
>
> Also, a couple of ideas for separate patches.  Perhaps the default value
> of "-S" could be cluster_size if specified?  This would avoid making raw
> images too fragmented, and compounding filesystem-level fragmentation
> with qcow2-level fragmentation.  And 4K is too small a default in my
> opinion; it could be easily changed to 64K, though 4K was of course an
> improvement compared to 512 before commit a22f123 (qemu-img: Require
> larger zero areas for sparse handling, 2011-08-26).
>
> Paolo
>
>> +                next_aligned_sector -= next_aligned_sector % cluster_sectors;
>> +                if (sector_num + n > next_aligned_sector) {
>> +                    n = next_aligned_sector - sector_num;
>> +                }
>> +            }
>> +
>>               if (n > bs_offset + bs_sectors - sector_num) {
>>                   n = bs_offset + bs_sectors - sector_num;
>>               }
>> diff --git a/qemu-img.texi b/qemu-img.texi
>> index 87f9d0f..9b1720f 100644
>> --- a/qemu-img.texi
>> +++ b/qemu-img.texi
>> @@ -179,11 +179,14 @@ Error on reading data
>>   
>>   @end table
>>   
>> -@item convert [-c] [-p] [-n] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_name}] [-S @var{sparse_size}] [-m @var{iobuf_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
>> +@item convert [-c] [-p] [-n] [-a] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_name}] [-S @var{sparse_size}] [-m @var{iobuf_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
>>   
>>   Convert the disk image @var{filename} or a snapshot @var{snapshot_name} to disk image @var{output_filename}
>>   using format @var{output_fmt}. It can be optionally compressed (@code{-c}
>>   option) or use any format specific options like encryption (@code{-o} option).
>> +If the @code{-a} option is specified write requests will be aligned
>> +to the cluster size of the output image if possible. This is the default
>> +for compressed images.
>>   
>>   Only the formats @code{qcow} and @code{qcow2} support compression. The
>>   compression is read-only. It means that if a compressed sector is
>>
Paolo Bonzini Nov. 25, 2013, 4:34 p.m. UTC | #7
Il 25/11/2013 17:11, Peter Lieven ha scritto:
> On 25.11.2013 16:11, Paolo Bonzini wrote:
>> Il 25/11/2013 14:57, Peter Lieven ha scritto:
>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> Ok, given this patch I think the cluster_size is the right one to use
>> here---and also the way you used the optimal unmap granularity makes
>> sense; you could also use MAX(optimal unmap granularity, optimal
>> transfer length granularity).
>>
>> However, there is no need to write one cluster at a time.  What matters,
>> I think, is to align the *end* of the transfer, so that the next
>> transfer can start aligned.
>>
>>> +            if (align && cluster_sectors > 0) {
>>> +                int64_t next_aligned_sector = (sector_num +
>>> cluster_sectors);
>> So this should be "+ n", not "+ cluster_sectors".
>>
>> Perhaps it could be conditional on "n > cluster_sectors" (small requests
>> happen when you have sparse region, and breaking them doesn't help).
> 
> would you also agree to n >= cluster_sectors. In my case
> and if especially if n is bound by iobuf_size the case n > cluster_sectors
> will be hard to meet.

Of course.  In fact > alone is wrong ("n > cluster_sectors || n ==
iobuf_size" could be right, but perhaps it's a useless complication).

Paolo
diff mbox

Patch

diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index e0b8ab4..266cdf3 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -34,9 +34,9 @@  STEXI
 ETEXI
 
 DEF("convert", img_convert,
-    "convert [-c] [-p] [-q] [-n] [-f fmt] [-t cache] [-O output_fmt] [-o options] [-s snapshot_name] [-S sparse_size] [-m iobuf_size] filename [filename2 [...]] output_filename")
+    "convert [-c] [-p] [-q] [-n] [-a] [-f fmt] [-t cache] [-O output_fmt] [-o options] [-s snapshot_name] [-S sparse_size] [-m iobuf_size] filename [filename2 [...]] output_filename")
 STEXI
-@item convert [-c] [-p] [-q] [-n] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_name}] [-S @var{sparse_size}] [-m @var{iobuf_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
+@item convert [-c] [-p] [-q] [-n] [-a] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_name}] [-S @var{sparse_size}] [-m @var{iobuf_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
 ETEXI
 
 DEF("info", img_info,
diff --git a/qemu-img.c b/qemu-img.c
index 0ce5d14..9fa8fd4 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -109,6 +109,7 @@  static void help(void)
            "  '--output' takes the format in which the output must be done (human or json)\n"
            "  '-n' skips the target volume creation (useful if the volume is created\n"
            "       prior to running qemu-img)\n"
+           "  '-a' align write requests to cluster size if possible\n"
            "\n"
            "Parameters to check subcommand:\n"
            "  '-r' tries to repair any inconsistencies that are found during the check.\n"
@@ -1125,8 +1126,7 @@  out3:
 
 static int img_convert(int argc, char **argv)
 {
-    int c, n, n1, bs_n, bs_i, compress, cluster_size,
-        cluster_sectors, skip_create;
+    int c, n, n1, bs_n, bs_i, compress, cluster_sectors, skip_create;
     int64_t ret = 0;
     int progress = 0, flags;
     const char *fmt, *out_fmt, *cache, *out_baseimg, *out_filename;
@@ -1144,7 +1144,7 @@  static int img_convert(int argc, char **argv)
     char *options = NULL;
     const char *snapshot_name = NULL;
     int min_sparse = 8; /* Need at least 4k of zeros for sparse detection */
-    bool quiet = false;
+    bool quiet = false, align = false;
     Error *local_err = NULL;
 
     fmt = NULL;
@@ -1154,7 +1154,7 @@  static int img_convert(int argc, char **argv)
     compress = 0;
     skip_create = 0;
     for(;;) {
-        c = getopt(argc, argv, "f:O:B:s:hce6o:pS:m:t:qn");
+        c = getopt(argc, argv, "f:O:B:s:hcae6o:pS:m:t:qn");
         if (c == -1) {
             break;
         }
@@ -1175,6 +1175,9 @@  static int img_convert(int argc, char **argv)
         case 'c':
             compress = 1;
             break;
+        case 'a':
+            align = true;
+            break;
         case 'e':
             error_report("option -e is deprecated, please use \'-o "
                   "encryption\' instead!");
@@ -1402,19 +1405,21 @@  static int img_convert(int argc, char **argv)
         }
     }
 
+    cluster_sectors = 0;
+    ret = bdrv_get_info(out_bs, &bdi);
+    if (ret < 0 && compress) {
+        error_report("could not get block driver info");
+        goto out;
+    } else {
+        cluster_sectors = bdi.cluster_size / BDRV_SECTOR_SIZE;
+    }
+
     if (compress) {
-        ret = bdrv_get_info(out_bs, &bdi);
-        if (ret < 0) {
-            error_report("could not get block driver info");
-            goto out;
-        }
-        cluster_size = bdi.cluster_size;
-        if (cluster_size <= 0 || cluster_size > bufsectors * BDRV_SECTOR_SIZE) {
+        if (cluster_sectors <= 0 || cluster_sectors > bufsectors) {
             error_report("invalid cluster size");
             ret = -1;
             goto out;
         }
-        cluster_sectors = cluster_size >> 9;
         sector_num = 0;
 
         nb_sectors = total_sectors;
@@ -1552,6 +1557,14 @@  static int img_convert(int argc, char **argv)
                 n = nb_sectors;
             }
 
+            if (align && cluster_sectors > 0) {
+                int64_t next_aligned_sector = (sector_num + cluster_sectors);
+                next_aligned_sector -= next_aligned_sector % cluster_sectors;
+                if (sector_num + n > next_aligned_sector) {
+                    n = next_aligned_sector - sector_num;
+                }
+            }
+
             if (n > bs_offset + bs_sectors - sector_num) {
                 n = bs_offset + bs_sectors - sector_num;
             }
diff --git a/qemu-img.texi b/qemu-img.texi
index 87f9d0f..9b1720f 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -179,11 +179,14 @@  Error on reading data
 
 @end table
 
-@item convert [-c] [-p] [-n] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_name}] [-S @var{sparse_size}] [-m @var{iobuf_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
+@item convert [-c] [-p] [-n] [-a] [-f @var{fmt}] [-t @var{cache}] [-O @var{output_fmt}] [-o @var{options}] [-s @var{snapshot_name}] [-S @var{sparse_size}] [-m @var{iobuf_size}] @var{filename} [@var{filename2} [...]] @var{output_filename}
 
 Convert the disk image @var{filename} or a snapshot @var{snapshot_name} to disk image @var{output_filename}
 using format @var{output_fmt}. It can be optionally compressed (@code{-c}
 option) or use any format specific options like encryption (@code{-o} option).
+If the @code{-a} option is specified write requests will be aligned
+to the cluster size of the output image if possible. This is the default
+for compressed images.
 
 Only the formats @code{qcow} and @code{qcow2} support compression. The
 compression is read-only. It means that if a compressed sector is