[1/5] RFC: Efficient VM backup for qemu (v1)

Message ID	1353488464-82756-1-git-send-email-dietmar@proxmox.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Dietmar Maurer <dietmar@proxmox.com> To: qemu-devel@nongnu.org Date: Wed, 21 Nov 2012 10:01:00 +0100 Message-Id: <1353488464-82756-1-git-send-email-dietmar@proxmox.com> Cc: kwolf@redhat.com, Dietmar Maurer <dietmar@proxmox.com> Subject: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu (v1) Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Dietmar Maurer Nov. 21, 2012, 9:01 a.m. UTC

This series provides a way to efficiently backup VMs.

* Backup to a single archive file
* Backup contain all data to restore VM (full backup)
* Do not depend on storage type or image format
* Avoid use of temporary storage
* store sparse images efficiently

The file docs/backup-rfc.txt contains more details.

Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
---
 docs/backup-rfc.txt |  119 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 119 insertions(+), 0 deletions(-)
 create mode 100644 docs/backup-rfc.txt

Kevin Wolf Nov. 21, 2012, 10:48 a.m. UTC | #1

Am 21.11.2012 10:01, schrieb Dietmar Maurer:
> +Some storage types/formats supports internal snapshots using some kind
> +of reference counting (rados, sheepdog, dm-thin, qcow2). It would be possible
> +to use that for backups, but for now we want to be storage-independent.
> +
> +Note: It turned out that taking a qcow2 snapshot can take a very long
> +time on larger files.

Hm, really? What are "larger files"? It has always been relatively quick
when I tested it, though internal snapshots are not my focus, so that
need not mean much.

If this is really an important use case for someone, I think qcow2
internal snapshots still have some potential for relatively easy
performance optimisations.

But that just as an aside...

> +
> +=Make it more efficient=
> +
> +The be more efficient, we simply need to avoid unnecessary steps. The
> +following steps are always required:
> +
> +1.) read old data before it gets overwritten
> +2.) write that data into the backup archive
> +3.) write new data (VM write)
> +
> +As you can see, this involves only one read, an two writes.

Looks like a nice approach to backup indeed.

The question is how to fit this into the big picture of qemu's live
block operations. Much of it looks like an active mirror (which is still
to be implemented), with the difference that it doesn't write the new,
but the old data, and that it keeps a bitmap of clusters that should not
be mirrored.

I'm not sure if this means that code should be shared between these two
or if the differences are too big. However, both of them have things in
common regarding the design. For example, both have a background part
(copying the existing data) and an active part (mirroring/backing up
data on writes). Block jobs are the right tool for the background part.

The active part is a bit more tricky. You're putting some code into
block.c to achieve it, which is kind of ugly. We have been talking about
"block filters" previously that would provide a generic infrastructure,
and at least in the mid term the additions to block.c must disappear.
(Same for block.h and block_int.h - keep things as separated from the
core as possible) Maybe we should introduce this infrastructure now.

Another interesting point is how (or whether) to link block jobs with
block filters. I think when the job is started, the filter should be
inserted automatically, and when you cancel it, it should be stopped.
When you pause the job... no idea. :-)

> +
> +To make that work, our backup archive need to be able to store image
> +data 'out of order'. It is important to notice that this will not work
> +with traditional archive formats like tar.

> +* works on any storage type and image format.
> +* we can define a new and simple archive format, which is able to
> +  store sparse files efficiently.

> +
> +Note: Storing sparse files is a mess with existing archive
> +formats. For example, tar requires information about holes at the
> +beginning of the archive.

> +* we need to define a new archive format
> +
> +Note: Most existing archive formats are optimized to store small files
> +including file attributes. We simply do not need that for VM archives.
> +
> +* archive contains data 'out of order'
> +
> +If you want to access image data in sequential order, you need to
> +re-order archive data. It would be possible to to that on the fly,
> +using temporary files.
> +
> +Fortunately, a normal restore/extract works perfectly with 'out of
> +order' data, because the target files are seekable.

> +=Archive format requirements=
> +
> +The basic requirement for such new format is that we can store image
> +date 'out of order'. It is also very likely that we have less than 256
> +drives/images per VM, and we want to be able to store VM configuration
> +files.
> +
> +We have defined a very simply format with those properties, see:
> +
> +docs/specs/vma_spec.txt
> +
> +Please let us know if you know an existing format which provides the
> +same functionality.

Essentially, what you need is an image format. You want to be
independent from the source image formats, but you're okay with using a
specific format for the backup (or you wouldn't have defined a new
format for it).

The one special thing that you need is storing multiple images in one
file. There's something like this already in qemu: qcow2 with its
internal snapshots is basically a flat file system.

Not saying that this is necessarily the best option, but I think reusing
existing formats and implementation is always a good thing, so it's an
idea to consider.

Kevin

Dietmar Maurer Nov. 21, 2012, 11:10 a.m. UTC | #2

> > +Note: It turned out that taking a qcow2 snapshot can take a very long
> > +time on larger files.
> 
> Hm, really? What are "larger files"? It has always been relatively quick when I
> tested it, though internal snapshots are not my focus, so that need not mean
> much.

300GB or larger
 
> If this is really an important use case for someone, I think qcow2 internal
> snapshots still have some potential for relatively easy performance
> optimisations.

I guess the problem is the small cluster size, so the reference table gets quite large
(for example fvd uses 2GB to minimize table size).
 
> But that just as an aside...
> 
> > +
> > +=Make it more efficient=
> > +
> > +The be more efficient, we simply need to avoid unnecessary steps. The
> > +following steps are always required:
> > +
> > +1.) read old data before it gets overwritten
> > +2.) write that data into the backup archive
> > +3.) write new data (VM write)
> > +
> > +As you can see, this involves only one read, an two writes.
> 
> Looks like a nice approach to backup indeed.
> 
> The question is how to fit this into the big picture of qemu's live block
> operations. Much of it looks like an active mirror (which is still to be
> implemented), with the difference that it doesn't write the new, but the old
> data, and that it keeps a bitmap of clusters that should not be mirrored.
> 
> I'm not sure if this means that code should be shared between these two or
> if the differences are too big. However, both of them have things in common
> regarding the design. For example, both have a background part (copying the
> existing data) and an active part (mirroring/backing up data on writes). Block
> jobs are the right tool for the background part.

I already use block jobs. Or do you want to share more?
 
> The active part is a bit more tricky. You're putting some code into block.c to
> achieve it, which is kind of ugly. 

yes. but I tried to keep that small ;-)

>We have been talking about "block filters"
> previously that would provide a generic infrastructure, and at least in the mid
> term the additions to block.c must disappear.
> (Same for block.h and block_int.h - keep things as separated from the core as
> possible) Maybe we should introduce this infrastructure now.

I have no idea what you talk about? Can you point me to the relevant discussion?
 
> Another interesting point is how (or whether) to link block jobs with block
> filters. I think when the job is started, the filter should be inserted
> automatically, and when you cancel it, it should be stopped.
> When you pause the job... no idea. :-)
> 
> > +
> > +To make that work, our backup archive need to be able to store image
> > +data 'out of order'. It is important to notice that this will not
> > +work with traditional archive formats like tar.
> 
> > +* works on any storage type and image format.
> > +* we can define a new and simple archive format, which is able to
> > +  store sparse files efficiently.
> 
> > +
> > +Note: Storing sparse files is a mess with existing archive formats.
> > +For example, tar requires information about holes at the beginning of
> > +the archive.
> 
> > +* we need to define a new archive format
> > +
> > +Note: Most existing archive formats are optimized to store small
> > +files including file attributes. We simply do not need that for VM archives.
> > +
> > +* archive contains data 'out of order'
> > +
> > +If you want to access image data in sequential order, you need to
> > +re-order archive data. It would be possible to to that on the fly,
> > +using temporary files.
> > +
> > +Fortunately, a normal restore/extract works perfectly with 'out of
> > +order' data, because the target files are seekable.
> 
> > +=Archive format requirements=
> > +
> > +The basic requirement for such new format is that we can store image
> > +date 'out of order'. It is also very likely that we have less than
> > +256 drives/images per VM, and we want to be able to store VM
> > +configuration files.
> > +
> > +We have defined a very simply format with those properties, see:
> > +
> > +docs/specs/vma_spec.txt
> > +
> > +Please let us know if you know an existing format which provides the
> > +same functionality.
> 
> Essentially, what you need is an image format. You want to be independent
> from the source image formats, but you're okay with using a specific format
> for the backup (or you wouldn't have defined a new format for it).
> 
> The one special thing that you need is storing multiple images in one file.
> There's something like this already in qemu: qcow2 with its internal
> snapshots is basically a flat file system.
> 
> Not saying that this is necessarily the best option, but I think reusing existing
> formats and implementation is always a good thing, so it's an idea to
> consider.

AFAIK qcow2 file cannot store data out of order. In general, an backup fd is not seekable, 
and we only want to do sequential writes. Image format always requires seekable fds?

Anyways, a qcow2 file is really complex beast - I am quite unsure if I would use 
that for backup if it is possible. 

That would require any external tool to include >=50000 LOC

The vma reader code is about 700 LOC (quite easy).

Dietmar Maurer Nov. 21, 2012, 11:23 a.m. UTC | #3

> Not saying that this is necessarily the best option, but I think reusing existing
> formats and implementation is always a good thing, so it's an idea to
> consider.

Yes, I would really like to reuse something. Our current backup software uses 'tar' files,
but that is really inefficient. We also analyzed all other available
archive formats, but none of them is capable to store sparse files efficiently. 

And storing data out of order is beyond the scope of existing format.

Kevin Wolf Nov. 21, 2012, 12:37 p.m. UTC | #4

Am 21.11.2012 12:10, schrieb Dietmar Maurer:
>>> +Note: It turned out that taking a qcow2 snapshot can take a very long
>>> +time on larger files.
>>
>> Hm, really? What are "larger files"? It has always been relatively quick when I
>> tested it, though internal snapshots are not my focus, so that need not mean
>> much.
> 
> 300GB or larger
>  
>> If this is really an important use case for someone, I think qcow2 internal
>> snapshots still have some potential for relatively easy performance
>> optimisations.
> 
> I guess the problem is the small cluster size, so the reference table gets quite large
> (for example fvd uses 2GB to minimize table size).

qemu-img check gives an idea of what it costs to read in the whole
metadata of an image. Updating some of it should mean not more than a
factor of two. I'm seeing much bigger differences, so I suspect there's
something wrong.

Somebody should probably try tracing where the performance is lost.

>> But that just as an aside...
>>
>>> +
>>> +=Make it more efficient=
>>> +
>>> +The be more efficient, we simply need to avoid unnecessary steps. The
>>> +following steps are always required:
>>> +
>>> +1.) read old data before it gets overwritten
>>> +2.) write that data into the backup archive
>>> +3.) write new data (VM write)
>>> +
>>> +As you can see, this involves only one read, an two writes.
>>
>> Looks like a nice approach to backup indeed.
>>
>> The question is how to fit this into the big picture of qemu's live block
>> operations. Much of it looks like an active mirror (which is still to be
>> implemented), with the difference that it doesn't write the new, but the old
>> data, and that it keeps a bitmap of clusters that should not be mirrored.
>>
>> I'm not sure if this means that code should be shared between these two or
>> if the differences are too big. However, both of them have things in common
>> regarding the design. For example, both have a background part (copying the
>> existing data) and an active part (mirroring/backing up data on writes). Block
>> jobs are the right tool for the background part.
> 
> I already use block jobs. Or do you want to share more?

I was thinking about sharing code between a future active mirror and the
backup job. Which may or may not make sense. I'm mostly hoping for input
from Paolo here.

>> The active part is a bit more tricky. You're putting some code into block.c to
>> achieve it, which is kind of ugly. 
> 
> yes. but I tried to keep that small ;-)

Yup, it's already not too bad. I haven't looked into it in much detail,
but I'd like to reduce it even a bit more. In particular, the
backup_info field in the BlockDriverState feels wrong to me. In the long
term the generic block layer shouldn't know at all what a backup is, and
baking it into BDS couples it very tightly.

>> We have been talking about "block filters"
>> previously that would provide a generic infrastructure, and at least in the mid
>> term the additions to block.c must disappear.
>> (Same for block.h and block_int.h - keep things as separated from the core as
>> possible) Maybe we should introduce this infrastructure now.
> 
> I have no idea what you talk about? Can you point me to the relevant discussion?

Not sure if a single discussion explains it, and I can't even find one
at the moment.

In short, the idea is that you can stick filters on top of a
BlockDriverState, so that any read/writes (and possibly more requests,
if necessary) are routed through the filter before they are passed to
the block driver of this BDS. Filters would be implemented as
BlockDrivers, i.e. you could implement .bdrv_co_write() in a filter to
intercept all writes to an image.

>> Another interesting point is how (or whether) to link block jobs with block
>> filters. I think when the job is started, the filter should be inserted
>> automatically, and when you cancel it, it should be stopped.
>> When you pause the job... no idea. :-)

>> Essentially, what you need is an image format. You want to be independent
>> from the source image formats, but you're okay with using a specific format
>> for the backup (or you wouldn't have defined a new format for it).
>>
>> The one special thing that you need is storing multiple images in one file.
>> There's something like this already in qemu: qcow2 with its internal
>> snapshots is basically a flat file system.
>>
>> Not saying that this is necessarily the best option, but I think reusing existing
>> formats and implementation is always a good thing, so it's an idea to
>> consider.
> 
> AFAIK qcow2 file cannot store data out of order. In general, an backup fd is not seekable, 
> and we only want to do sequential writes. Image format always requires seekable fds?

Ah, this is what you mean by "out of order". Just out of curiosity, what
are these non-seekable backup fds usually?

In principle even for this qcow2 could be used as an image format,
however the existing implementation wouldn't be of much use for you, so
it loses quite a bit of its attractiveness.

> Anyways, a qcow2 file is really complex beast - I am quite unsure if I would use 
> that for backup if it is possible. 
> 
> That would require any external tool to include >=50000 LOC
> 
> The vma reader code is about 700 LOC (quite easy).

So what? qemu-img is already there.

Kevin

Paolo Bonzini Nov. 21, 2012, 1:23 p.m. UTC | #5

Il 21/11/2012 13:37, Kevin Wolf ha scritto:
>>> >> The active part is a bit more tricky. You're putting some code into block.c to
>>> >> achieve it, which is kind of ugly. 
>> > 
>> > yes. but I tried to keep that small ;-)
> Yup, it's already not too bad. I haven't looked into it in much detail,
> but I'd like to reduce it even a bit more. In particular, the
> backup_info field in the BlockDriverState feels wrong to me. In the long
> term the generic block layer shouldn't know at all what a backup is, and
> baking it into BDS couples it very tightly.

My plan was to have something like bs->job->job_type->{before,after}_write.

   int coroutine_fn (*before_write)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        void **cookie);
   int coroutine_fn (*after_write)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        void *cookie);


The before_write could optionally return a "cookie" that is passed back
to the after_write callback.

Actually this was plan B, as a poor-man implementation of the filter
infrastructure.  Plan A was that the block filters would materialize
suddenly in someone's git tree.

Anyway, it should be very easy to convert Dietmar's code to something
like that, and the active mirror could use it as well.

>> > AFAIK qcow2 file cannot store data out of order. In general, an backup fd is not seekable, 
>> > and we only want to do sequential writes. Image format always requires seekable fds?
> Ah, this is what you mean by "out of order". Just out of curiosity, what
> are these non-seekable backup fds usually?

Perhaps I've been reading the SCSI standards too much lately, but tapes
come to mind. :)

Paolo

Dietmar Maurer Nov. 21, 2012, 1:25 p.m. UTC | #6

> > AFAIK qcow2 file cannot store data out of order. In general, an backup
> > fd is not seekable, and we only want to do sequential writes. Image format
> always requires seekable fds?
> 
> Ah, this is what you mean by "out of order". Just out of curiosity, what are
> these non-seekable backup fds usually?

/dev/nst0 ;-)

But there are better examples. Usually you want to use some kind of
compression, and you do that with existing tools:

# backup to stdout|gzip|...

A common usage scenario is to pipe a backup into a restore (copy)

# backup to stdout|ssh to remote host -c 'restore from stdin'

It is also a performance question. Seeks are terrible slow.

> In principle even for this qcow2 could be used as an image format, however
> the existing implementation wouldn't be of much use for you, so it loses
> quite a bit of its attractiveness.
> 
> > Anyways, a qcow2 file is really complex beast - I am quite unsure if I
> > would use that for backup if it is possible.
> >
> > That would require any external tool to include >=50000 LOC
> >
> > The vma reader code is about 700 LOC (quite easy).
> 
> So what? qemu-img is already there.

Anyways, you already pointed out that the existing implementation does not work.

But I already expected such discussion. So maybe it is better we simply pipe all data to an external binary?
We just need to define a minimal protocol. 

In future we can produce different archivers as independent/external binaries?

Kevin Wolf Nov. 21, 2012, 1:58 p.m. UTC | #7

Am 21.11.2012 14:25, schrieb Dietmar Maurer:
>>> AFAIK qcow2 file cannot store data out of order. In general, an backup
>>> fd is not seekable, and we only want to do sequential writes. Image format
>> always requires seekable fds?
>>
>> Ah, this is what you mean by "out of order". Just out of curiosity, what are
>> these non-seekable backup fds usually?
> 
> /dev/nst0 ;-)

Sure. :-)

> But there are better examples. Usually you want to use some kind of
> compression, and you do that with existing tools:
> 
> # backup to stdout|gzip|...

When you use an image/archive format anyway, you could use a compression
mechanism that it already supports.

> A common usage scenario is to pipe a backup into a restore (copy)
> 
> # backup to stdout|ssh to remote host -c 'restore from stdin'

This is a good one. I believe our usual solution would have been to
backup to a NBD server on the remote host instead.

In general I can see that being able to pipe it to other programs could
be nice. I'm not sure if it's an absolute requirement. Would your tools
for taking the backup employ any specific use of pipes?

> It is also a performance question. Seeks are terrible slow.

You wouldn't do it a lot. Only for metadata, and you would only write
out the metadata once the in-memory cache is full.

>> In principle even for this qcow2 could be used as an image format, however
>> the existing implementation wouldn't be of much use for you, so it loses
>> quite a bit of its attractiveness.
>>
>>> Anyways, a qcow2 file is really complex beast - I am quite unsure if I
>>> would use that for backup if it is possible.
>>>
>>> That would require any external tool to include >=50000 LOC
>>>
>>> The vma reader code is about 700 LOC (quite easy).
>>
>> So what? qemu-img is already there.
> 
> Anyways, you already pointed out that the existing implementation does not work.

I'm still trying to figure out the real requirements to think some more
about it. :-)

> But I already expected such discussion. So maybe it is better we simply pipe all data to an external binary?
> We just need to define a minimal protocol. 
> 
> In future we can produce different archivers as independent/external binaries?

You shouldn't look at discussions as a bad thing. We're not trying to
block your changes, but to understand and possibly improve them.

Yes, discussions mean that it takes a bit longer to get things merged,
but they also mean that usually something better is merged in the end
that actually fits well in qemu's design, is maintainable, generic and
so on. Evading the discussions by keeping code externally wouldn't
improve things.

Which doesn't mean that external archivers are completely out of the
question, but I would only consider them if there's a good technical
reason to do so.

So if eventually we come to the conclusion that vma (or for that matter,
anything else in your patches) is the right solution, let's take it. But
first please give us the chance to understand the reasons of why you did
things the way you did them, and to discuss the pros and cons of
alternative solutions.

Kevin

Dietmar Maurer Nov. 21, 2012, 3:47 p.m. UTC | #8

> >> Ah, this is what you mean by "out of order". Just out of curiosity,
> >> what are these non-seekable backup fds usually?
> >
> > /dev/nst0 ;-)
> 
> Sure. :-)
> 
> > But there are better examples. Usually you want to use some kind of
> > compression, and you do that with existing tools:
> >
> > # backup to stdout|gzip|...
> 
> When you use an image/archive format anyway, you could use a
> compression mechanism that it already supports.

Many archive formats does not support compressions internally (tar, cpio, ..).
I also avoided to include that in the 'vma' format. So you can use any
external tool. 

Some user wants to compress, other wants bzip2, or gzip -1, xz, pgzip, ...
Or maybe pipe into some kind of encryption tool ...

> > A common usage scenario is to pipe a backup into a restore (copy)
> >
> > # backup to stdout|ssh to remote host -c 'restore from stdin'
> 
> This is a good one. I believe our usual solution would have been to backup to
> a NBD server on the remote host instead.
> 
> In general I can see that being able to pipe it to other programs could be nice.
> I'm not sure if it's an absolute requirement. Would your tools for taking the
> backup employ any specific use of pipes?

Yes, we currently have that functionality, and I do not want to remove features.

> > It is also a performance question. Seeks are terrible slow.
> 
> You wouldn't do it a lot. Only for metadata, and you would only write out the
> metadata once the in-memory cache is full.

IMHO it is still much better to write sequentially, because that has 'zero' overhead.

Besides, writing data sequentially is so much easier (on the implementation side)

The current VMA code also use checksums and special 'uuid' markers, which
makes it possible to find and recover damaged archives. I guess such things
are quite impossible with qcow2, or very hard to do?

> >> In principle even for this qcow2 could be used as an image format,
> >> however the existing implementation wouldn't be of much use for you,
> >> so it loses quite a bit of its attractiveness.
> >>
> >>> Anyways, a qcow2 file is really complex beast - I am quite unsure if
> >>> I would use that for backup if it is possible.
> >>>
> >>> That would require any external tool to include >=50000 LOC
> >>>
> >>> The vma reader code is about 700 LOC (quite easy).
> >>
> >> So what? qemu-img is already there.
> >
> > Anyways, you already pointed out that the existing implementation does
> not work.
> 
> I'm still trying to figure out the real requirements to think some more about
> it. :-)

Any existing archive format I know works on pipes (without seeks). 
Well, that does not really mean anything.

> > But I already expected such discussion. So maybe it is better we simply pipe
> all data to an external binary?
> > We just need to define a minimal protocol.
> >
> > In future we can produce different archivers as independent/external
> binaries?
> 
> You shouldn't look at discussions as a bad thing. We're not trying to block
> your changes, but to understand and possibly improve them.

I do not consider your comments as 'bad thing' - above idea was a real suggestion ;-)

I already have plans to use a Content Addressable Storage (instead of 'vma'), so
such plugin architecture makes it easier to play around with different formats.
 
> Yes, discussions mean that it takes a bit longer to get things merged, but they
> also mean that usually something better is merged in the end that actually
> fits well in qemu's design, is maintainable, generic and so on. Evading the
> discussions by keeping code externally wouldn't improve things.

sure.
 
> Which doesn't mean that external archivers are completely out of the
> question, but I would only consider them if there's a good technical reason to
> do so.

As noted above, I can see rooms for different format. 

1.) 'vma' is my proof of concept, easy to implement and use.
2.) CAS - very useful to sync backup data across datacenters (this
gives us deduplication and kind of 'incremental backups')
3.) support existing archive format like 'tar' (this is possible if we
use temporary files to store out-of-order data)
4.) backup to some kind of external server
5.) plugins for existing backup tools (bacula, ...)?

> So if eventually we come to the conclusion that vma (or for that matter,
> anything else in your patches) is the right solution, let's take it. But first
> please give us the chance to understand the reasons of why you did things
> the way you did them, and to discuss the pros and cons of alternative
> solutions.

Sure. I was not aware that I wrote something negative in the previous reply - sorry for that.

Stefan Hajnoczi Nov. 22, 2012, 11:12 a.m. UTC | #9

On Wed, Nov 21, 2012 at 10:01:00AM +0100, Dietmar Maurer wrote:
> +==Disadvantages==
> +
> +* we need to define a new archive format
> +
> +Note: Most existing archive formats are optimized to store small files
> +including file attributes. We simply do not need that for VM archives.

Did you look at the VMDK "Stream-Optimized Compressed" subformat?

http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?src=vmdk

It is a stream of compressed "grains" (data).  They are out-of-order and
each grain comes with the virtual disk lba where the data should be
visible to the guest.

The stream also contains "grain tables" and "grain directories".  This
metadata makes random read access to the file possible once you have
downloaded the entire file (i.e. it is seekable).  Although tools can
choose to consume the stream in sequential order too and ignore the
metadata.

In other words, the format is an out-of-order stream of data chunks plus
random access lookup tables at the end.

QEMU's block/vmdk.c already has some support for this format although I
don't think we generate out-of-order yet.

The benefit of reusing this code is that existing tools can consume
these files.

Stefan

Dietmar Maurer Nov. 22, 2012, 11:26 a.m. UTC | #10

> Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> 
> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
> src=vmdk
> 
> It is a stream of compressed "grains" (data).  They are out-of-order and each
> grain comes with the virtual disk lba where the data should be visible to the
> guest.
> 

What kind of license is applied to that specification?

Dietmar Maurer Nov. 22, 2012, 11:40 a.m. UTC | #11

> Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> 
> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
> src=vmdk

Max file size 2TB?

Dietmar Maurer Nov. 22, 2012, noon UTC | #12

> It is a stream of compressed "grains" (data).  They are out-of-order and each
> grain comes with the virtual disk lba where the data should be visible to the
> guest.
> 
> The stream also contains "grain tables" and "grain directories".  This
> metadata makes random read access to the file possible once you have
> downloaded the entire file (i.e. it is seekable).  Although tools can choose to
> consume the stream in sequential order too and ignore the metadata.
> 
> In other words, the format is an out-of-order stream of data chunks plus
> random access lookup tables at the end.
> 
> QEMU's block/vmdk.c already has some support for this format although I
> don't think we generate out-of-order yet.
> 
> The benefit of reusing this code is that existing tools can consume these files.

Compression format is hardcoded to RFC 1951 (defalte). I think this is a major disadvantage, 
because it is really slow (compared to lzop).

Dietmar Maurer Nov. 22, 2012, 12:03 p.m. UTC | #13

> Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> 
> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
> src=vmdk

And is that covered by any patents?

Stefan Hajnoczi Nov. 22, 2012, 12:44 p.m. UTC | #14

On Thu, Nov 22, 2012 at 11:26:21AM +0000, Dietmar Maurer wrote:
> > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> > 
> > http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
> > src=vmdk
> > 
> > It is a stream of compressed "grains" (data).  They are out-of-order and each
> > grain comes with the virtual disk lba where the data should be visible to the
> > guest.
> > 
> 
> What kind of license is applied to that specification?

The document I linked came straight from Google Search and you don't
need to agree to anything to view it.  The document doesn't seem to
impose restrictions.  QEMU has supported the VMDK format and so have
other open source tools for a number of years.

For anything more specific you could search VMware's website and/or
check with a lawyer.

Stefan

Dietmar Maurer Nov. 22, 2012, 12:55 p.m. UTC | #15

> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@gmail.com]
> Sent: Donnerstag, 22. November 2012 13:45
> To: Dietmar Maurer
> Cc: qemu-devel@nongnu.org; kwolf@redhat.com
> Subject: Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu
> (v1)
> 
> On Thu, Nov 22, 2012 at 11:26:21AM +0000, Dietmar Maurer wrote:
> > > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> > >
> > >
> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
> > > src=vmdk
> > >
> > > It is a stream of compressed "grains" (data).  They are out-of-order
> > > and each grain comes with the virtual disk lba where the data should
> > > be visible to the guest.
> > >
> >
> > What kind of license is applied to that specification?
> 
> The document I linked came straight from Google Search and you don't need
> to agree to anything to view it.  The document doesn't seem to impose
> restrictions.  QEMU has supported the VMDK format and so have other open
> source tools for a number of years.
> 
> For anything more specific you could search VMware's website and/or check
> with a lawyer.

The documents says: VMware products are covered by one or more patents listed at http://www.vmware.com/go/patents

I simply do not have the time to check all those things, which make that format unusable for me.

Anyways, thanks for the link.

Stefan Hajnoczi Nov. 22, 2012, 3:30 p.m. UTC | #16

On Thu, Nov 22, 2012 at 1:55 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> -----Original Message-----
>> From: Stefan Hajnoczi [mailto:stefanha@gmail.com]
>> Sent: Donnerstag, 22. November 2012 13:45
>> To: Dietmar Maurer
>> Cc: qemu-devel@nongnu.org; kwolf@redhat.com
>> Subject: Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu
>> (v1)
>>
>> On Thu, Nov 22, 2012 at 11:26:21AM +0000, Dietmar Maurer wrote:
>> > > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
>> > >
>> > >
>> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
>> > > src=vmdk
>> > >
>> > > It is a stream of compressed "grains" (data).  They are out-of-order
>> > > and each grain comes with the virtual disk lba where the data should
>> > > be visible to the guest.
>> > >
>> >
>> > What kind of license is applied to that specification?
>>
>> The document I linked came straight from Google Search and you don't need
>> to agree to anything to view it.  The document doesn't seem to impose
>> restrictions.  QEMU has supported the VMDK format and so have other open
>> source tools for a number of years.
>>
>> For anything more specific you could search VMware's website and/or check
>> with a lawyer.
>
> The documents says: VMware products are covered by one or more patents listed at http://www.vmware.com/go/patents
>
> I simply do not have the time to check all those things, which make that format unusable for me.

In think proxmox ships the QEMU vmdk functionality today?  In that
case you should check this :).

Stefan

Stefan Hajnoczi Nov. 22, 2012, 3:42 p.m. UTC | #17

On Thu, Nov 22, 2012 at 12:40 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> Did you look at the VMDK "Stream-Optimized Compressed" subformat?
>>
>> http://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf?
>> src=vmdk
>
> Max file size 2TB?

That is for a single .vmdk file.  But vmdks can also be split across
multiple files ("extents") so you can get more than 2TB.

QEMU has some support for reading vmdks with extents, but I think we
never create files like this today.

Stefan

Stefan Hajnoczi Nov. 22, 2012, 3:45 p.m. UTC | #18

On Thu, Nov 22, 2012 at 1:00 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> It is a stream of compressed "grains" (data).  They are out-of-order and each
>> grain comes with the virtual disk lba where the data should be visible to the
>> guest.
>>
>> The stream also contains "grain tables" and "grain directories".  This
>> metadata makes random read access to the file possible once you have
>> downloaded the entire file (i.e. it is seekable).  Although tools can choose to
>> consume the stream in sequential order too and ignore the metadata.
>>
>> In other words, the format is an out-of-order stream of data chunks plus
>> random access lookup tables at the end.
>>
>> QEMU's block/vmdk.c already has some support for this format although I
>> don't think we generate out-of-order yet.
>>
>> The benefit of reusing this code is that existing tools can consume these files.
>
> Compression format is hardcoded to RFC 1951 (defalte). I think this is a major disadvantage,
> because it is really slow (compared to lzop).

It's a naughty thing to do but we could simply pick a new constant and
support LZO as an incompatible option.  The file is then no longer
compatible with existing vmdk tools but at least we then have a choice
of using compatible deflate or the LZO extension.

VMDK already has 99% of what you need and we already have a bunch of
code to handle this format.  This seems like a good opportunity to
flesh out VMDK support and avoid reinventing the wheel.

Stefan

Dietmar Maurer Nov. 22, 2012, 3:56 p.m. UTC | #19

> It's a naughty thing to do but we could simply pick a new constant and
> support LZO as an incompatible option.  The file is then no longer compatible
> with existing vmdk tools but at least we then have a choice of using
> compatible deflate or the LZO extension.

To be 100% incompatible to existing tools? That would remove any advantage.
 
> VMDK already has 99% of what you need and we already have a bunch of
> code to handle this format.  This seems like a good opportunity to flesh out
> VMDK support and avoid reinventing the wheel.

Using 'probably' patented software is a bad idea - I will not go that way.

Dietmar Maurer Nov. 22, 2012, 3:58 p.m. UTC | #20

> > The documents says: VMware products are covered by one or more
> patents
> > listed at http://www.vmware.com/go/patents
> >
> > I simply do not have the time to check all those things, which make that
> format unusable for me.
> 
> In think proxmox ships the QEMU vmdk functionality today?  In that case you
> should check this :).

Well, and also remove it from the qemu repository? Such things are not compatible with GPL?

Stefan Hajnoczi Nov. 22, 2012, 4:37 p.m. UTC | #21

On Thu, Nov 22, 2012 at 4:56 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> It's a naughty thing to do but we could simply pick a new constant and
>> support LZO as an incompatible option.  The file is then no longer compatible
>> with existing vmdk tools but at least we then have a choice of using
>> compatible deflate or the LZO extension.
>
> To be 100% incompatible to existing tools? That would remove any advantage.

No, it should be an option.  Users who care about compatibility can use deflate.

Stefan

Stefan Hajnoczi Nov. 22, 2012, 5:02 p.m. UTC | #22

On Thu, Nov 22, 2012 at 4:58 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> > The documents says: VMware products are covered by one or more
>> patents
>> > listed at http://www.vmware.com/go/patents
>> >
>> > I simply do not have the time to check all those things, which make that
>> format unusable for me.
>>
>> In think proxmox ships the QEMU vmdk functionality today?  In that case you
>> should check this :).
>
> Well, and also remove it from the qemu repository? Such things are not compatible with GPL?

If you are really concerned about this then submit a patch to add
./configure --disable-vmdk, ship QEMU without VMDK, and drop it from
your documentation/wiki.

If you're not really concerned, then let's accept that VMDK support is okay.

Stefan

Stefan Hajnoczi Nov. 22, 2012, 5:16 p.m. UTC | #23

On Thu, Nov 22, 2012 at 12:12 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Wed, Nov 21, 2012 at 10:01:00AM +0100, Dietmar Maurer wrote:
>> +==Disadvantages==
>> +
>> +* we need to define a new archive format
>> +
>> +Note: Most existing archive formats are optimized to store small files
>> +including file attributes. We simply do not need that for VM archives.
>
> Did you look at the VMDK "Stream-Optimized Compressed" subformat?

We've gone down several sub-threads discussing whether VMDK is
suitable.  I want to summarize why this is a good approach:

The VMDK format already allows for out-of-order data and is supported
by existing tools - this is very important for backups where people
are (rightfully) paranoid about putting their backups in an obscure
format.  They want to be able to access their data years later,
whether your tool is still around or not.

QEMU's implementation has partial support for Stream-Optimized
Compressed images.  If you complete the code for this subformat, not
only does this benefit the VM Backup feature, but it also makes
qemu-img convert more powerful for everyone.  I hope we can kill two
birds with one stone here.

Stefan

Dietmar Maurer Nov. 22, 2012, 5:34 p.m. UTC | #24

> -----Original Message-----
> From: Stefan Hajnoczi [mailto:stefanha@gmail.com]
> Sent: Donnerstag, 22. November 2012 18:02
> To: Dietmar Maurer
> Cc: kwolf@redhat.com; qemu-devel@nongnu.org
> Subject: Re: [Qemu-devel] [PATCH 1/5] RFC: Efficient VM backup for qemu
> (v1)
> 
> On Thu, Nov 22, 2012 at 4:58 PM, Dietmar Maurer <dietmar@proxmox.com>
> wrote:
> >> > The documents says: VMware products are covered by one or more
> >> patents
> >> > listed at http://www.vmware.com/go/patents
> >> >
> >> > I simply do not have the time to check all those things, which make
> >> > that
> >> format unusable for me.
> >>
> >> In think proxmox ships the QEMU vmdk functionality today?  In that
> >> case you should check this :).
> >
> > Well, and also remove it from the qemu repository? Such things are not
> compatible with GPL?
> 
> If you are really concerned about this then submit a patch to add ./configure
> --disable-vmdk, ship QEMU without VMDK, and drop it from your
> documentation/wiki.
> 
> If you're not really concerned, then let's accept that VMDK support is okay.

I simply don't want to waste time one something with unclear License issues.
So I will not work on such format.

Don't get me wrong - that is just my personal opinion.

Dietmar Maurer Nov. 22, 2012, 5:46 p.m. UTC | #25

> > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
> 
> We've gone down several sub-threads discussing whether VMDK is suitable.
> I want to summarize why this is a good approach:
> 
> The VMDK format already allows for out-of-order data and is supported by
> existing tools - this is very important for backups where people are
> (rightfully) paranoid about putting their backups in an obscure format.  They
> want to be able to access their data years later, whether your tool is still
> around or not.

The VMDK format has strong disadvantages:

- unclear License (the spec links to patents)
- they use a very slow compression algorithm (deflate), which makes it unusable for backup

Dietmar Maurer Nov. 22, 2012, 5:50 p.m. UTC | #26

> The VMDK format has strong disadvantages:
> 
> - unclear License (the spec links to patents)
> - they use a very slow compression algorithm (deflate), which makes it
> unusable for backup

Seems they do not support multiple configuration files. You can only
a single text block, and that needs to contain vmware specific info.
So where do I add my qemu related config?

Dietmar Maurer Nov. 22, 2012, 6:05 p.m. UTC | #27

> QEMU's implementation has partial support for Stream-Optimized
> Compressed images.  If you complete the code for this subformat, not only
> does this benefit the VM Backup feature, but it also makes qemu-img convert
> more powerful for everyone.  I hope we can kill two birds with one stone

The doc contain the following link:

http://www.vmware.com/download/patents.html

I simply have no idea how to check all those patents. How can someone tell
that they do not cover things in the specs? I am really curios?

Dietmar Maurer Nov. 22, 2012, 6:15 p.m. UTC | #28

> The VMDK format already allows for out-of-order data and is supported by
> existing tools - this is very important for backups where people are
> (rightfully) paranoid about putting their backups in an obscure format.  They
> want to be able to access their data years later, whether your tool is still
> around or not.

Anything we will add to the qemu source fulfills those properties. Or do you
really think qemu will disappear soon?

Besides, the VMA format is much simpler than the vmdk format. Thus I consider
it safer (and not 'obscure').

Stefan Hajnoczi Nov. 23, 2012, 5:19 a.m. UTC | #29

On Thu, Nov 22, 2012 at 7:05 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> QEMU's implementation has partial support for Stream-Optimized
>> Compressed images.  If you complete the code for this subformat, not only
>> does this benefit the VM Backup feature, but it also makes qemu-img convert
>> more powerful for everyone.  I hope we can kill two birds with one stone
>
> The doc contain the following link:
>
> http://www.vmware.com/download/patents.html
>
> I simply have no idea how to check all those patents. How can someone tell
> that they do not cover things in the specs? I am really curios?

If you want to investigate it then you would look at each one or use
search engines to make it easier (skip all the non-disk image related
patents).

But keep in mind that any other company out there could have a patent
on out-of-order data in an image file or other aspects of what you're
proposing.  Reinventing the wheel may not stop you from infringing on
their patents so the fact that VMware may or may not have patents
doesn't change things if you're really trying to find possible issues.

This is why SQLite has a policy of only using algorithms that are
older than 17 years, see comment by drh:
http://www.sqlite.org/cvstrac/wiki?p=BlueSky

Stefan

Stefan Hajnoczi Nov. 23, 2012, 5:21 a.m. UTC | #30

On Thu, Nov 22, 2012 at 6:50 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> The VMDK format has strong disadvantages:
>>
>> - unclear License (the spec links to patents)
>> - they use a very slow compression algorithm (deflate), which makes it
>> unusable for backup
>
> Seems they do not support multiple configuration files. You can only
> a single text block, and that needs to contain vmware specific info.
> So where do I add my qemu related config?

This is true.  QEMU uses a VMDK as a single disk image.  To handle
multiple disks there would need to be multiple images plus a vmstate
or config file.

Stefan

Stefan Hajnoczi Nov. 23, 2012, 5:23 a.m. UTC | #31

On Thu, Nov 22, 2012 at 6:46 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>> > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
>>
>> We've gone down several sub-threads discussing whether VMDK is suitable.
>> I want to summarize why this is a good approach:
>>
>> The VMDK format already allows for out-of-order data and is supported by
>> existing tools - this is very important for backups where people are
>> (rightfully) paranoid about putting their backups in an obscure format.  They
>> want to be able to access their data years later, whether your tool is still
>> around or not.
>
> The VMDK format has strong disadvantages:
>
> - unclear License (the spec links to patents)

I've already pointed out that you're taking an inconsistent position
on this point.  It's FUD.

> - they use a very slow compression algorithm (deflate), which makes it unusable for backup

I've already pointed out that we can optionally support other algorithms.

Stefan

Stefan Hajnoczi Nov. 23, 2012, 5:25 a.m. UTC | #32

On Fri, Nov 23, 2012 at 6:23 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Thu, Nov 22, 2012 at 6:46 PM, Dietmar Maurer <dietmar@proxmox.com> wrote:
>>> > Did you look at the VMDK "Stream-Optimized Compressed" subformat?
>>>
>>> We've gone down several sub-threads discussing whether VMDK is suitable.
>>> I want to summarize why this is a good approach:
>>>
>>> The VMDK format already allows for out-of-order data and is supported by
>>> existing tools - this is very important for backups where people are
>>> (rightfully) paranoid about putting their backups in an obscure format.  They
>>> want to be able to access their data years later, whether your tool is still
>>> around or not.
>>
>> The VMDK format has strong disadvantages:
>>
>> - unclear License (the spec links to patents)
>
> I've already pointed out that you're taking an inconsistent position
> on this point.  It's FUD.
>
>> - they use a very slow compression algorithm (deflate), which makes it unusable for backup
>
> I've already pointed out that we can optionally support other algorithms.

To make progress here I'll review the RFC patches.  VMDK or not isn't
the main thing, a backup feature like this looks interesting.

Stefan

Dietmar Maurer Nov. 23, 2012, 6:05 a.m. UTC | #33

> But keep in mind that any other company out there could have a patent on
> out-of-order data in an image file or other aspects of what you're proposing.

Sorry, but the vmware docs explicitly include a pointer to those patents. So this
is something completely different to me.

Dietmar Maurer Nov. 23, 2012, 6:13 a.m. UTC | #34

> > The VMDK format has strong disadvantages:
> >
> > - unclear License (the spec links to patents)
> 
> I've already pointed out that you're taking an inconsistent position on this
> point.  It's FUD.
> 
> > - they use a very slow compression algorithm (deflate), which makes it
> > unusable for backup
> 
> I've already pointed out that we can optionally support other algorithms.

Well, I guess we both pointed out our opinions. 

I will try to implement some kind of plugging architecture for backup formats.
That way we can implement/support more than one format.

Dietmar Maurer Nov. 23, 2012, 6:18 a.m. UTC | #35

> To make progress here I'll review the RFC patches.  VMDK or not isn't the
> main thing, a backup feature like this looks interesting.

Yes, a 'review' would be great - thanks.

- Dietmar

Dietmar Maurer Nov. 23, 2012, 7:38 a.m. UTC | #36

> In short, the idea is that you can stick filters on top of a BlockDriverState, so
> that any read/writes (and possibly more requests, if necessary) are routed
> through the filter before they are passed to the block driver of this BDS.
> Filters would be implemented as BlockDrivers, i.e. you could implement
> .bdrv_co_write() in a filter to intercept all writes to an image.

I am quite unsure if that make things easier.

Dietmar Maurer Nov. 23, 2012, 7:42 a.m. UTC | #37

> > Yup, it's already not too bad. I haven't looked into it in much
> > detail, but I'd like to reduce it even a bit more. In particular, the
> > backup_info field in the BlockDriverState feels wrong to me. In the
> > long term the generic block layer shouldn't know at all what a backup
> > is, and baking it into BDS couples it very tightly.
> 
> My plan was to have something like bs->job->job_type->{before,after}_write.
> 
>    int coroutine_fn (*before_write)(BlockDriverState *bs,
>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>         void **cookie);
>    int coroutine_fn (*after_write)(BlockDriverState *bs,
>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>         void *cookie);
> 
> 
> The before_write could optionally return a "cookie" that is passed back to
> the after_write callback.

I don't really understand why a filter is related to the job? This is sometimes useful,
but not a generic filter infrastructure (maybe someone want to use filters without a job).

Dietmar Maurer Nov. 23, 2012, 8:12 a.m. UTC | #38

> > Yup, it's already not too bad. I haven't looked into it in much
> > detail, but I'd like to reduce it even a bit more. In particular, the
> > backup_info field in the BlockDriverState feels wrong to me. In the
> > long term the generic block layer shouldn't know at all what a backup
> > is, and baking it into BDS couples it very tightly.
> 
> My plan was to have something like bs->job->job_type->{before,after}_write.
> 
>    int coroutine_fn (*before_write)(BlockDriverState *bs,
>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>         void **cookie);
>    int coroutine_fn (*after_write)(BlockDriverState *bs,
>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>         void *cookie);

I don't think that job is the right place. Instead I would put a list 
of filters into BDS:

typedef struct BlockFilter {
    void *opaque;
    int cluster_size;
    int coroutine_fn (before_read)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        BdrvRequestFlags flags, void **cookie);
    int coroutine_fn (after_read)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        BdrvRequestFlags flags, void *cookie);
    int coroutine_fn (*before_write)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        void **cookie);
    int coroutine_fn (*after_write)(BlockDriverState *bs,
        int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
        void *cookie);
} BlockFilter;

struct BlockDriverState {
   ...
    QLIST_HEAD(, BlockFilters) filters;
};

Would that work for you?

Dietmar Maurer Nov. 23, 2012, 9:01 a.m. UTC | #39

> > My plan was to have something like bs->job->job_type-
> >{before,after}_write.
> >
> >    int coroutine_fn (*before_write)(BlockDriverState *bs,
> >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> >         void **cookie);
> >    int coroutine_fn (*after_write)(BlockDriverState *bs,
> >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> >         void *cookie);
> 
> I don't think that job is the right place. Instead I would put a list of filters into
> BDS:

Well, I can also add it to job_type. Just tell me what you prefer, and I will write the patch.

Dietmar Maurer Nov. 23, 2012, 9:05 a.m. UTC | #40

> > > My plan was to have something like bs->job->job_type-
> > >{before,after}_write.
> > >
> > >    int coroutine_fn (*before_write)(BlockDriverState *bs,
> > >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> > >         void **cookie);
> > >    int coroutine_fn (*after_write)(BlockDriverState *bs,
> > >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
> > >         void *cookie);
> >
> > I don't think that job is the right place. Instead I would put a list
> > of filters into
> > BDS:
> 
> Well, I can also add it to job_type. Just tell me what you prefer, and I will
> write the patch.

BTW, will such filters work with the new virtio-blk-data-plane?

Kevin Wolf Nov. 23, 2012, 9:08 a.m. UTC | #41

Am 23.11.2012 08:38, schrieb Dietmar Maurer:
>> In short, the idea is that you can stick filters on top of a BlockDriverState, so
>> that any read/writes (and possibly more requests, if necessary) are routed
>> through the filter before they are passed to the block driver of this BDS.
>> Filters would be implemented as BlockDrivers, i.e. you could implement
>> .bdrv_co_write() in a filter to intercept all writes to an image.
> 
> I am quite unsure if that make things easier.

At least it would make for a much cleaner design compared to putting
code for every feature you can think of into bdrv_co_do_readv/writev().

Kevin

Paolo Bonzini Nov. 23, 2012, 9:15 a.m. UTC | #42

Il 23/11/2012 10:05, Dietmar Maurer ha scritto:
>>>> My plan was to have something like bs->job->job_type-
>>>> {before,after}_write.
>>>>
>>>>    int coroutine_fn (*before_write)(BlockDriverState *bs,
>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>         void **cookie);
>>>>    int coroutine_fn (*after_write)(BlockDriverState *bs,
>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>         void *cookie);
>>>
>>> I don't think that job is the right place. Instead I would put a list
>>> of filters into
>>> BDS:
>>
>> Well, I can also add it to job_type. Just tell me what you prefer, and I will
>> write the patch.
> 
> BTW, will such filters work with the new virtio-blk-data-plane?

No, virtio-blk-data-plane is a hack and will be slowly rewritten to
support all fancy features.

Paolo

Dietmar Maurer Nov. 23, 2012, 9:17 a.m. UTC | #43

> > BTW, will such filters work with the new virtio-blk-data-plane?
> 
> No, virtio-blk-data-plane is a hack and will be slowly rewritten to support all
> fancy features.

Ah, good to know ;-) thanks.

Paolo Bonzini Nov. 23, 2012, 9:18 a.m. UTC | #44

Il 23/11/2012 08:42, Dietmar Maurer ha scritto:
>> > My plan was to have something like bs->job->job_type->{before,after}_write.
>> > 
>> >    int coroutine_fn (*before_write)(BlockDriverState *bs,
>> >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>> >         void **cookie);
>> >    int coroutine_fn (*after_write)(BlockDriverState *bs,
>> >         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>> >         void *cookie);
>> > 
>> > 
>> > The before_write could optionally return a "cookie" that is passed back to
>> > the after_write callback.
> I don't really understand why a filter is related to the job? This is sometimes useful,
> but not a generic filter infrastructure (maybe someone want to use filters without a job).

See the part you snipped:

Actually this was plan B, as a poor-man implementation of the filter
infrastructure.  Plan A was that the block filters would materialize
suddenly in someone's git tree.

Paolo

Dietmar Maurer Nov. 23, 2012, 9:21 a.m. UTC | #45

> >> Filters would be implemented as BlockDrivers, i.e. you could
> >> implement
> >> .bdrv_co_write() in a filter to intercept all writes to an image.
> >
> > I am quite unsure if that make things easier.
> 
> At least it would make for a much cleaner design compared to putting code
> for every feature you can think of into bdrv_co_do_readv/writev().

So if you want to add a filter, you simply modify bs->drv to point to the filter?

Dietmar Maurer Nov. 23, 2012, 9:28 a.m. UTC | #46

> Actually this was plan B, as a poor-man implementation of the filter
> infrastructure.  Plan A was that the block filters would materialize suddenly in
> someone's git tree.

OK, so let us summarize the options:

a.) wait untit it materialize suddenly in someone's git tree.
b.) add BlockFilter inside BDS
c.) add filter callbacks to block bojs (job_type)
d.) use BlockDriver as filter
e.) use the current BackupInfo unless filters materialize suddenly in someone's git tree.

more ideas?

Dietmar Maurer Nov. 23, 2012, 9:31 a.m. UTC | #47

> > >> Filters would be implemented as BlockDrivers, i.e. you could
> > >> implement
> > >> .bdrv_co_write() in a filter to intercept all writes to an image.
> > >
> > > I am quite unsure if that make things easier.
> >
> > At least it would make for a much cleaner design compared to putting
> > code for every feature you can think of into bdrv_co_do_readv/writev().
> 
> So if you want to add a filter, you simply modify bs->drv to point to the filter?

Seems the BlockDriver struct does not contain any 'state' (I guess that is by design),
so where do you store filter related dynamic data?

Kevin Wolf Nov. 23, 2012, 9:55 a.m. UTC | #48

Am 23.11.2012 10:05, schrieb Dietmar Maurer:
>>>> My plan was to have something like bs->job->job_type-
>>>> {before,after}_write.
>>>>
>>>>    int coroutine_fn (*before_write)(BlockDriverState *bs,
>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>         void **cookie);
>>>>    int coroutine_fn (*after_write)(BlockDriverState *bs,
>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>         void *cookie);
>>>
>>> I don't think that job is the right place. Instead I would put a list
>>> of filters into
>>> BDS:
>>
>> Well, I can also add it to job_type. Just tell me what you prefer, and I will
>> write the patch.

A block filter shouldn't be tied to a job, I think. We have things like
blkdebug that are really filters and aren't coupled with a job, and on
the other hand we want to generalise "block jobs" into just "jobs", so
adding block specific things to job_type would be a step in the wrong
direction.

I also think that before_write/after_write isn't a convenient interface,
it brings back much of the callback-based AIO cruft and passing void*
isn't nice anyway. It's much nice to have a single .bdrv_co_write
callback that somewhere in the middle calls the layer below with a
simple function call.

Also read/write aren't enough, for a full filter interface you
potentially also need flush, discard and probably most other operations.

This is why I suggested using a regular BlockDriver struct for filters,
it already has all functions that are needed.

> BTW, will such filters work with the new virtio-blk-data-plane?

Not initially, but I think as soon as data plane gets support for image
formats, filters would work as well.

Kevin

Kevin Wolf Nov. 23, 2012, 10:29 a.m. UTC | #49

Am 23.11.2012 10:31, schrieb Dietmar Maurer:
>>>>> Filters would be implemented as BlockDrivers, i.e. you could
>>>>> implement
>>>>> .bdrv_co_write() in a filter to intercept all writes to an image.
>>>>
>>>> I am quite unsure if that make things easier.
>>>
>>> At least it would make for a much cleaner design compared to putting
>>> code for every feature you can think of into bdrv_co_do_readv/writev().
>>
>> So if you want to add a filter, you simply modify bs->drv to point to the filter?
> 
> Seems the BlockDriver struct does not contain any 'state' (I guess that is by design),
> so where do you store filter related dynamic data?

You wouldn't change bs->drv of the block device, you still need that one
after having processed the data in the filter.

Instead, you'd have some BlockDriverState *first_filter in bs to which
requests are forwarded. first_filter->file would point to either the
next filter or if there are no more filters to the real BlockDriverState.

Which raises the question of how to distinguish whether it's a new
request to bs that must go through the filters or whether it actually
comes from the last filter in the chain. As you can see, we don't have a
well thought out plan yet, just rough ideas (otherwise it would probably
be implemented already).

Kevin

Markus Armbruster Nov. 23, 2012, 10:55 a.m. UTC | #50

Kevin Wolf <kwolf@redhat.com> writes:

> Am 23.11.2012 10:05, schrieb Dietmar Maurer:
>>>>> My plan was to have something like bs->job->job_type-
>>>>> {before,after}_write.
>>>>>
>>>>>    int coroutine_fn (*before_write)(BlockDriverState *bs,
>>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>>         void **cookie);
>>>>>    int coroutine_fn (*after_write)(BlockDriverState *bs,
>>>>>         int64_t sector_num, int nb_sectors, QEMUIOVector *qiov,
>>>>>         void *cookie);
>>>>
>>>> I don't think that job is the right place. Instead I would put a list
>>>> of filters into
>>>> BDS:
>>>
>>> Well, I can also add it to job_type. Just tell me what you prefer, and I will
>>> write the patch.
>
> A block filter shouldn't be tied to a job, I think. We have things like
> blkdebug that are really filters and aren't coupled with a job, and on
> the other hand we want to generalise "block jobs" into just "jobs", so
> adding block specific things to job_type would be a step in the wrong
> direction.
>
> I also think that before_write/after_write isn't a convenient interface,
> it brings back much of the callback-based AIO cruft and passing void*
> isn't nice anyway. It's much nice to have a single .bdrv_co_write
> callback that somewhere in the middle calls the layer below with a
> simple function call.
>
> Also read/write aren't enough, for a full filter interface you
> potentially also need flush, discard and probably most other operations.
>
> This is why I suggested using a regular BlockDriver struct for filters,
> it already has all functions that are needed.

Let me elaborate a bit.

A block backend is a tree of block driver instances (BlockDriverState).
Common examples:

        raw             qcow2           qcow2
         |              /   \           /   \
        file         file   raw      file  qcow2
                             |             /   \
                            file         file  raw
                                                |
                                               file

A less common example:

        raw
         |
      blkdebug
         |
        file

Here, "blkdebug" acts as a filter, i.e. a block driver that can be put
between two adjacent tree nodes.  It injects errors by selectively
failing some bdrv_aio_readv() and bdrv_aio_writev() operations.

Actually, "raw" could also be viewed as a degenerate filter that does
nothing[*], but such a filter isn't particularly useful.

Except perhaps to serve as base for real filters, that do stuff.  To do
stuff in your filter, you'd replace raw's operations with your own.

Hmm, blkdebug implements much fewer operations than raw.  Makes me
wonder whether it works only in special places in the tree now.

[...]

[*] Except occasionally inject bugs when somebody adds new BlockDriver
operations without updating "raw" to forward them.

Dietmar Maurer Nov. 26, 2012, 5:51 a.m. UTC | #51

> Which raises the question of how to distinguish whether it's a new request to
> bs that must go through the filters or whether it actually comes from the last
> filter in the chain. As you can see, we don't have a well thought out plan yet,
> just rough ideas (otherwise it would probably be implemented already).

The question is if I should modify my backup patch (regarding block filters)?

IMHO, the current implementation is quite simple and easy to maintain. We can easily
convert it if someone comes up with a full featured 'block filter' solution.

Paolo Bonzini Nov. 26, 2012, 12:07 p.m. UTC | #52

Il 26/11/2012 06:51, Dietmar Maurer ha scritto:
>> Which raises the question of how to distinguish whether it's a new request to
>> bs that must go through the filters or whether it actually comes from the last
>> filter in the chain. As you can see, we don't have a well thought out plan yet,
>> just rough ideas (otherwise it would probably be implemented already).
> 
> The question is if I should modify my backup patch (regarding block filters)?

The only solution I came up with is to add before/after hooks in the
block job.  I agree with the criticism, but I think it's general enough
and at the same time easy enough to implement.

> IMHO, the current implementation is quite simple and easy to maintain.

No, "if (bs->backup_info)" simply doesn't belong in bdrv_co_writev.

Paolo

Dietmar Maurer Nov. 27, 2012, 6:20 a.m. UTC | #53

> The only solution I came up with is to add before/after hooks in the block
> job.  I agree with the criticism, but I think it's general enough and at the same
> time easy enough to implement.
> 
> > IMHO, the current implementation is quite simple and easy to maintain.
> 
> No, "if (bs->backup_info)" simply doesn't belong in bdrv_co_writev.

I do not really understand that argument, because the current COPY_ON_READ
implementation also works that way:

    if (bs->copy_on_read) {
        flags |= BDRV_REQ_COPY_ON_READ;
    }
    if (flags & BDRV_REQ_COPY_ON_READ) {
        bs->copy_on_read_in_flight++;
    }

    if (bs->copy_on_read_in_flight) {
        wait_for_overlapping_requests(bs, sector_num, nb_sectors);
    }

    tracked_request_begin(&req, bs, sector_num, nb_sectors, false);

    if (flags & BDRV_REQ_COPY_ON_READ) {
...

Or do you also want to move that to block job hooks?

Dietmar Maurer Nov. 27, 2012, 7:15 a.m. UTC | #54

> > The only solution I came up with is to add before/after hooks in the
> > block job.  I agree with the criticism, but I think it's general
> > enough and at the same time easy enough to implement.
> >
> > > IMHO, the current implementation is quite simple and easy to maintain.
> >
> > No, "if (bs->backup_info)" simply doesn't belong in bdrv_co_writev.
> 
> I do not really understand that argument, because the current
> COPY_ON_READ implementation also works that way:
> 
>     if (bs->copy_on_read) {
>         flags |= BDRV_REQ_COPY_ON_READ;
>     }
>     if (flags & BDRV_REQ_COPY_ON_READ) {
>         bs->copy_on_read_in_flight++;
>     }
> 
>     if (bs->copy_on_read_in_flight) {
>         wait_for_overlapping_requests(bs, sector_num, nb_sectors);
>     }
> 
>     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
> 
>     if (flags & BDRV_REQ_COPY_ON_READ) { ...
> 
> Or do you also want to move that to block job hooks?

Just tried to move that code, but copy on read feature is unrelated to block jobs,
i.e. one can open a bdrv with BDRV_O_COPY_ON_READ, and that does not create
a job.

I already suggested to add those hooks to BDS instead - don't you think that would work?

Kevin Wolf Nov. 27, 2012, 8:48 a.m. UTC | #55

Am 27.11.2012 08:15, schrieb Dietmar Maurer:
>>> The only solution I came up with is to add before/after hooks in the
>>> block job.  I agree with the criticism, but I think it's general
>>> enough and at the same time easy enough to implement.
>>>
>>>> IMHO, the current implementation is quite simple and easy to maintain.
>>>
>>> No, "if (bs->backup_info)" simply doesn't belong in bdrv_co_writev.
>>
>> I do not really understand that argument, because the current
>> COPY_ON_READ implementation also works that way:
>>
>>     if (bs->copy_on_read) {
>>         flags |= BDRV_REQ_COPY_ON_READ;
>>     }
>>     if (flags & BDRV_REQ_COPY_ON_READ) {
>>         bs->copy_on_read_in_flight++;
>>     }
>>
>>     if (bs->copy_on_read_in_flight) {
>>         wait_for_overlapping_requests(bs, sector_num, nb_sectors);
>>     }
>>
>>     tracked_request_begin(&req, bs, sector_num, nb_sectors, false);
>>
>>     if (flags & BDRV_REQ_COPY_ON_READ) { ...
>>
>> Or do you also want to move that to block job hooks?
> 
> Just tried to move that code, but copy on read feature is unrelated to block jobs,
> i.e. one can open a bdrv with BDRV_O_COPY_ON_READ, and that does not create
> a job.
> 
> I already suggested to add those hooks to BDS instead - don't you think that would work?

To which BDS? If it is the BDS that is being backed up, the problem is
that you could only have one implementation per BDS, i.e. you couldn't
use backup and copy on read or I/O throttling or whatever at the same time.

Kevin

Wayne Xia Nov. 27, 2012, 10:09 a.m. UTC | #56

于 2012-11-21 17:01, Dietmar Maurer 写道:
> This series provides a way to efficiently backup VMs.
> 
> * Backup to a single archive file
> * Backup contain all data to restore VM (full backup)
> * Do not depend on storage type or image format
> * Avoid use of temporary storage
> * store sparse images efficiently
> 
> The file docs/backup-rfc.txt contains more details.
> 
> Signed-off-by: Dietmar Maurer <dietmar@proxmox.com>
> ---
>   docs/backup-rfc.txt |  119 +++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 119 insertions(+), 0 deletions(-)
>   create mode 100644 docs/backup-rfc.txt
> 
> diff --git a/docs/backup-rfc.txt b/docs/backup-rfc.txt
> new file mode 100644
> index 0000000..5b4b3df
> --- /dev/null
> +++ b/docs/backup-rfc.txt
> @@ -0,0 +1,119 @@
> +RFC: Efficient VM backup for qemu
> +
> +=Requirements=
> +
> +* Backup to a single archive file
> +* Backup needs to contain all data to restore VM (full backup)
> +* Do not depend on storage type or image format
> +* Avoid use of temporary storage
> +* store sparse images efficiently
> +
> +=Introduction=
> +
> +Most VM backup solutions use some kind of snapshot to get a consistent
> +VM view at a specific point in time. For example, we previously used
> +LVM to create a snapshot of all used VM images, which are then copied
> +into a tar file.
> +
> +That basically means that any data written during backup involve
> +considerable overhead. For LVM we get the following steps:
> +
> +1.) read original data (VM write)
> +2.) write original data into snapshot (VM write)
> +3.) write new data (VM write)
> +4.) read data from snapshot (backup)
> +5.) write data from snapshot into tar file (backup)
> +
> +Another approach to backup VM images is to create a new qcow2 image
> +which use the old image as base. During backup, writes are redirected
> +to the new image, so the old image represents a 'snapshot'. After
> +backup, data need to be copied back from new image into the old
> +one (commit). So a simple write during backup triggers the following
> +steps:
> +
> +1.) write new data to new image (VM write)
> +2.) read data from old image (backup)
> +3.) write data from old image into tar file (backup)
> +
> +4.) read data from new image (commit)
> +5.) write data to old image (commit)
> +
> +This is in fact the same overhead as before. Other tools like qemu
> +livebackup produces similar overhead (2 reads, 3 writes).
> +
> +Some storage types/formats supports internal snapshots using some kind
> +of reference counting (rados, sheepdog, dm-thin, qcow2). It would be possible
> +to use that for backups, but for now we want to be storage-independent.
> +
> +Note: It turned out that taking a qcow2 snapshot can take a very long
> +time on larger files.
> +
> +=Make it more efficient=
> +
> +The be more efficient, we simply need to avoid unnecessary steps. The
> +following steps are always required:
> +
> +1.) read old data before it gets overwritten
> +2.) write that data into the backup archive
> +3.) write new data (VM write)
> +
> +As you can see, this involves only one read, an two writes.
> +
> +To make that work, our backup archive need to be able to store image
> +data 'out of order'. It is important to notice that this will not work
> +with traditional archive formats like tar.
> +
> +During backup we simply intercept writes, then read existing data and
> +store that directly into the archive. After that we can continue the
> +write.
> +
> +==Advantages==
> +
> +* very good performance (1 read, 2 writes)
> +* works on any storage type and image format.
> +* avoid usage of temporary storage
> +* we can define a new and simple archive format, which is able to
> +  store sparse files efficiently.
> +
> +Note: Storing sparse files is a mess with existing archive
> +formats. For example, tar requires information about holes at the
> +beginning of the archive.
> +
> +==Disadvantages==
> +
> +* we need to define a new archive format
> +
> +Note: Most existing archive formats are optimized to store small files
> +including file attributes. We simply do not need that for VM archives.
> +
> +* archive contains data 'out of order'
> +
> +If you want to access image data in sequential order, you need to
> +re-order archive data. It would be possible to to that on the fly,
> +using temporary files.
> +
> +Fortunately, a normal restore/extract works perfectly with 'out of
> +order' data, because the target files are seekable.
> +
> +* slow backup storage can slow down VM during backup
> +
> +It is important to note that we only do sequential writes to the
> +backup storage. Furthermore one can compress the backup stream. IMHO,
> +it is better to slow down the VM a bit. All other solutions creates
> +large amounts of temporary data during backup.
> +
> +=Archive format requirements=
> +
> +The basic requirement for such new format is that we can store image
> +date 'out of order'. It is also very likely that we have less than 256
> +drives/images per VM, and we want to be able to store VM configuration
> +files.
> +
> +We have defined a very simply format with those properties, see:
> +
> +docs/specs/vma_spec.txt
> +
> +Please let us know if you know an existing format which provides the
> +same functionality.
> +
> +
> 
 Just want to confirm something to understand it better:
you are backing up the block image not including VM memory
state right?  I am considering a way to do live Savevm including
memory and device state, so wonder if you already had a solution
for it.

Dietmar Maurer Nov. 27, 2012, 10:37 a.m. UTC | #57

>  Just want to confirm something to understand it better:
> you are backing up the block image not including VM memory state right?  I
> am considering a way to do live Savevm including memory and device state,
> so wonder if you already had a solution for it.

Yes, I have already code for that.

Wayne Xia Nov. 28, 2012, 9:39 a.m. UTC | #58

于 2012-11-27 18:37, Dietmar Maurer 写道:
>>   Just want to confirm something to understand it better:
>> you are backing up the block image not including VM memory state right?  I
>> am considering a way to do live Savevm including memory and device state,
>> so wonder if you already had a solution for it.
>
> Yes, I have already code for that.
>
>

   Does those code for VM memory and device state lively save/restore
included in this patch serials? I quickly reviewed the patches but
did not found a hook to save VM memory state? Hope you can enlight
me your way, my thoughts is do live migration into qcow2 file,
but your code seems not touched qcow2 images.

Dietmar Maurer Nov. 28, 2012, 11:08 a.m. UTC | #59

>    Does those code for VM memory and device state lively save/restore

> included in this patch serials? I quickly reviewed the patches but did not

> found a hook to save VM memory state? Hope you can enlight me your way,

> my thoughts is do live migration into qcow2 file, but your code seems not

> touched qcow2 images.


I will post that code in a few days.

[1/5] RFC: Efficient VM backup for qemu (v1)

Commit Message

Comments

Patch