diff mbox

block: allow write-threshold on device name

Message ID 1433641131-24123-1-git-send-email-eblake@redhat.com
State New
Headers show

Commit Message

Eric Blake June 7, 2015, 1:38 a.m. UTC
Commit e2462113 allowed the ability to fire an event if a BDS
node exceeds a threshold during a write, but limited the option
to only work on node names.  For convenience, expand this to
allow a device name as a way to set the threshold on the BDS
at the active layer of the device.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/write-threshold.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Jianjun Kong June 7, 2015, 8:53 a.m. UTC | #1
On Sun, Jun 7, 2015 at 9:38 AM, Eric Blake <eblake@redhat.com> wrote:
>
> Commit e2462113 allowed the ability to fire an event if a BDS
> node exceeds a threshold during a write, but limited the option
> to only work on node names.  For convenience, expand this to
> allow a device name as a way to set the threshold on the BDS
> at the active layer of the device.
>
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/write-threshold.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/block/write-threshold.c b/block/write-threshold.c
> index a53c1f5..e3df419 100644
> --- a/block/write-threshold.c
> +++ b/block/write-threshold.c
> @@ -1,7 +1,7 @@
>  /*
>   * QEMU System Emulator block write threshold notification
>   *
> - * Copyright Red Hat, Inc. 2014
> + * Copyright Red Hat, Inc. 2014, 2015
>   *
>   * Authors:
>   *  Francesco Romani <fromani@redhat.com>
> @@ -110,9 +110,8 @@ void qmp_block_set_write_threshold(const char *node_name,
>      BlockDriverState *bs;
>      AioContext *aio_context;
>
> -    bs = bdrv_find_node(node_name);
> +    bs = bdrv_lookup_bs(node_name, node_name, errp);

It means we can pass device name by 'node_name' parameter?
do we need to update command doc in qapi/block-core.json?


>      if (!bs) {
> -        error_setg(errp, "Device '%s' not found", node_name);
>          return;
>      }
>
> --
> 2.4.2
>
>
Eric Blake June 8, 2015, 2:35 p.m. UTC | #2
On 06/07/2015 02:53 AM, Amos Kong wrote:
> On Sun, Jun 7, 2015 at 9:38 AM, Eric Blake <eblake@redhat.com> wrote:
>>
>> Commit e2462113 allowed the ability to fire an event if a BDS
>> node exceeds a threshold during a write, but limited the option
>> to only work on node names.  For convenience, expand this to
>> allow a device name as a way to set the threshold on the BDS
>> at the active layer of the device.
>>
>> Signed-off-by: Eric Blake <eblake@redhat.com>
>> ---
>>  block/write-threshold.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>

>> -    bs = bdrv_find_node(node_name);
>> +    bs = bdrv_lookup_bs(node_name, node_name, errp);
> 
> It means we can pass device name by 'node_name' parameter?

Yes. The two namespaces cannot overlap, so it is unambiguous that a
device name means the active node attached to the device (we use it in a
number of other commands).

> do we need to update command doc in qapi/block-core.json?

Good call; existing docs state:

# @block-set-write-threshold
...
# @node-name: graph node name on which the threshold must be set.
...
{ 'command': 'block-set-write-threshold',
  'data': { 'node-name': 'str', 'write-threshold': 'uint64' } }

so I'll prepare a v2 that tweaks it.
Eric Blake June 9, 2015, 10:35 p.m. UTC | #3
On 06/06/2015 07:38 PM, Eric Blake wrote:
> Commit e2462113 allowed the ability to fire an event if a BDS
> node exceeds a threshold during a write, but limited the option
> to only work on node names.  For convenience, expand this to
> allow a device name as a way to set the threshold on the BDS
> at the active layer of the device.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> ---
>  block/write-threshold.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 

> @@ -110,9 +110,8 @@ void qmp_block_set_write_threshold(const char *node_name,
>      BlockDriverState *bs;
>      AioContext *aio_context;
> 
> -    bs = bdrv_find_node(node_name);
> +    bs = bdrv_lookup_bs(node_name, node_name, errp);

Hmm, I'm not quite sure this does what I want.  If I understand
correctly, when you open a qcow2 image, you get the following
query-block layout (abbreviated):

            "inserted": {
...
                "image": {
...
                    "backing-filename-format": "qcow2",
                    "virtual-size": 12884901888,
                    "filename": "/dev/sda6",
                    "cluster-size": 65536,
                    "format": "qcow2",
                    "actual-size": 0,
...
                "drv": "qcow2",
                "iops": 0,
                "bps_wr": 0,
                "write_threshold": 0,
...
                "file": "/dev/sda6",


That is, the only write_threshold reported is that of the active layer
BDS (write_threshold of other BDS is reported through
query-named-block-nodes, but only if those BDS have a name), but
query-block is not listing any secondary BDS details.

Meanwhile, here is the corresponding query-blockstats layout:

            "device": "drive-virtio-disk0",
            "parent": {
                "stats": {
                    "flush_total_time_ns": 0,
                    "wr_highest_offset": 72482304,
...
            "stats": {
                "flush_total_time_ns": 728455560,
                "wr_highest_offset": 9129332224,

which DOES show the BDS chain; in particular, each qcow2 file has two
BDS (one for the protocol, and the other ('parent') for the actual file).

The statistic I'm interested in is the allocation of the block device
(the host offset, aka wr_highest_offset 72482304 above), and NOT the
usage pattern of the guest (the qcow2 protocol, wr_highest_offset
9129332224).  But bdrv_lookup_bs() finds the qcow2 protocol layer,
rather than the intended backing file layer; likewise, query-block is
only reporting write_threshold for the protocol layer.

I'm wondering if, when a device name is given rather than a node name,
it is safe to blindly follow the active layer down to its lowest member
(or error out if there are more than one lower members, as in quorum),
as that is the statistic that libvirt and upper layers really want ("am
I about to exceed the allocation of my underlying storage?").  Likewise,
on reporting, it is more useful to know the threshold of the backing
layer if the qcow2 protocol layer does not have a threshold.  I'm
playing with that idea before submitting a v2.
Kevin Wolf June 10, 2015, 7:57 a.m. UTC | #4
Am 10.06.2015 um 00:35 hat Eric Blake geschrieben:
> On 06/06/2015 07:38 PM, Eric Blake wrote:
> > Commit e2462113 allowed the ability to fire an event if a BDS
> > node exceeds a threshold during a write, but limited the option
> > to only work on node names.  For convenience, expand this to
> > allow a device name as a way to set the threshold on the BDS
> > at the active layer of the device.
> > 
> > Signed-off-by: Eric Blake <eblake@redhat.com>
> > ---
> >  block/write-threshold.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> 
> > @@ -110,9 +110,8 @@ void qmp_block_set_write_threshold(const char *node_name,
> >      BlockDriverState *bs;
> >      AioContext *aio_context;
> > 
> > -    bs = bdrv_find_node(node_name);
> > +    bs = bdrv_lookup_bs(node_name, node_name, errp);
> 
> Hmm, I'm not quite sure this does what I want.  If I understand
> correctly, when you open a qcow2 image, you get the following
> query-block layout (abbreviated):
> 
>             "inserted": {
> ...
>                 "image": {
> ...
>                     "backing-filename-format": "qcow2",
>                     "virtual-size": 12884901888,
>                     "filename": "/dev/sda6",
>                     "cluster-size": 65536,
>                     "format": "qcow2",
>                     "actual-size": 0,
> ...
>                 "drv": "qcow2",
>                 "iops": 0,
>                 "bps_wr": 0,
>                 "write_threshold": 0,
> ...
>                 "file": "/dev/sda6",
> 
> 
> That is, the only write_threshold reported is that of the active layer
> BDS (write_threshold of other BDS is reported through
> query-named-block-nodes, but only if those BDS have a name), but
> query-block is not listing any secondary BDS details.
> 
> Meanwhile, here is the corresponding query-blockstats layout:
> 
>             "device": "drive-virtio-disk0",
>             "parent": {
>                 "stats": {
>                     "flush_total_time_ns": 0,
>                     "wr_highest_offset": 72482304,
> ...
>             "stats": {
>                 "flush_total_time_ns": 728455560,
>                 "wr_highest_offset": 9129332224,
> 
> which DOES show the BDS chain; in particular, each qcow2 file has two
> BDS (one for the protocol, and the other ('parent') for the actual file).
> 
> The statistic I'm interested in is the allocation of the block device
> (the host offset, aka wr_highest_offset 72482304 above), and NOT the
> usage pattern of the guest (the qcow2 protocol, wr_highest_offset
> 9129332224).  But bdrv_lookup_bs() finds the qcow2 protocol layer,
> rather than the intended backing file layer; likewise, query-block is
> only reporting write_threshold for the protocol layer.
> 
> I'm wondering if, when a device name is given rather than a node name,
> it is safe to blindly follow the active layer down to its lowest member
> (or error out if there are more than one lower members, as in quorum),
> as that is the statistic that libvirt and upper layers really want ("am
> I about to exceed the allocation of my underlying storage?").  Likewise,
> on reporting, it is more useful to know the threshold of the backing
> layer if the qcow2 protocol layer does not have a threshold.  I'm
> playing with that idea before submitting a v2.

That is indeed what you need in your specific use case. However, qemu
shouldn't try to guess what management tools really want. It should
provide a clean interface that allows management tools to express
themselves what they want.

The cleanest interface that I can think of is that you access exactly
the node whose name you specified. If we do any magic like going down
the chain (which chain? What do you do with things like quorum in the
path?), we make the interface inconsistent and if anyone really wants to
know the highest offset that the guest accessed on its virtual disk, it
wouldn't even be possible any more because we said that that's not what
a management tool is interested in.

Let's stay away from such magic, as much as we can. libvirt can just
specify a node-name for the protocol layer and use that.

Kevin
Eric Blake June 10, 2015, 1:07 p.m. UTC | #5
On 06/10/2015 01:57 AM, Kevin Wolf wrote:

>>
>> The statistic I'm interested in is the allocation of the block device
>> (the host offset, aka wr_highest_offset 72482304 above), and NOT the
>> usage pattern of the guest (the qcow2 protocol, wr_highest_offset
>> 9129332224).  But bdrv_lookup_bs() finds the qcow2 protocol layer,
>> rather than the intended backing file layer; likewise, query-block is
>> only reporting write_threshold for the protocol layer.
>>
>> I'm wondering if, when a device name is given rather than a node name,
>> it is safe to blindly follow the active layer down to its lowest member
>> (or error out if there are more than one lower members, as in quorum),
>> as that is the statistic that libvirt and upper layers really want ("am
>> I about to exceed the allocation of my underlying storage?").  Likewise,
>> on reporting, it is more useful to know the threshold of the backing
>> layer if the qcow2 protocol layer does not have a threshold.  I'm
>> playing with that idea before submitting a v2.
> 
> That is indeed what you need in your specific use case. However, qemu
> shouldn't try to guess what management tools really want. It should
> provide a clean interface that allows management tools to express
> themselves what they want.

Well, I think that means I need to bite the bullet and teach libvirt to
use node names before it can take advantage of this feature; at which
point this idea of allowing a threshold on device name is no longer
important.

> 
> The cleanest interface that I can think of is that you access exactly
> the node whose name you specified. If we do any magic like going down
> the chain (which chain? What do you do with things like quorum in the
> path?), we make the interface inconsistent and if anyone really wants to
> know the highest offset that the guest accessed on its virtual disk, it
> wouldn't even be possible any more because we said that that's not what
> a management tool is interested in.

My problem here is that libvirt tracks only a single <disk>, but that
disk has two potential node names that need tracking (both the qcow2
protocol, and the underlying file).  Furthermore, operations like
snapshot creation, drive-mirror, and active block commit can change what
the active layer is, and thus need another node name.

It would really make life easier if qemu could auto-assign node names
(so that EVERY node has a name without libvirt having to invent two
names per qcow2 file), and then give libvirt an easy way to query the
node names in use (query-block should make it obvious what the full
node-name tree is, so that libvirt can then pick out the node name it is
interested in).

> 
> Let's stay away from such magic, as much as we can. libvirt can just
> specify a node-name for the protocol layer and use that.

Okay, I'll probably abandon this patch, then, but still work on
something to make node names easier for libvirt to integrate with.
Kevin Wolf June 10, 2015, 1:43 p.m. UTC | #6
Am 10.06.2015 um 15:07 hat Eric Blake geschrieben:
> On 06/10/2015 01:57 AM, Kevin Wolf wrote:
> 
> >>
> >> The statistic I'm interested in is the allocation of the block device
> >> (the host offset, aka wr_highest_offset 72482304 above), and NOT the
> >> usage pattern of the guest (the qcow2 protocol, wr_highest_offset
> >> 9129332224).  But bdrv_lookup_bs() finds the qcow2 protocol layer,
> >> rather than the intended backing file layer; likewise, query-block is
> >> only reporting write_threshold for the protocol layer.
> >>
> >> I'm wondering if, when a device name is given rather than a node name,
> >> it is safe to blindly follow the active layer down to its lowest member
> >> (or error out if there are more than one lower members, as in quorum),
> >> as that is the statistic that libvirt and upper layers really want ("am
> >> I about to exceed the allocation of my underlying storage?").  Likewise,
> >> on reporting, it is more useful to know the threshold of the backing
> >> layer if the qcow2 protocol layer does not have a threshold.  I'm
> >> playing with that idea before submitting a v2.
> > 
> > That is indeed what you need in your specific use case. However, qemu
> > shouldn't try to guess what management tools really want. It should
> > provide a clean interface that allows management tools to express
> > themselves what they want.
> 
> Well, I think that means I need to bite the bullet and teach libvirt to
> use node names before it can take advantage of this feature; at which
> point this idea of allowing a threshold on device name is no longer
> important.

Yes, I think libvirt needs to learn about node-names for this. And I'm
sure that very soon you'll find more uses for them anyway, so I prefer
doing the right thing now instead of adding a short-term hack for each
individual case in order to avoid the need for node-names.

> > The cleanest interface that I can think of is that you access exactly
> > the node whose name you specified. If we do any magic like going down
> > the chain (which chain? What do you do with things like quorum in the
> > path?), we make the interface inconsistent and if anyone really wants to
> > know the highest offset that the guest accessed on its virtual disk, it
> > wouldn't even be possible any more because we said that that's not what
> > a management tool is interested in.
> 
> My problem here is that libvirt tracks only a single <disk>, but that
> disk has two potential node names that need tracking (both the qcow2
> protocol, and the underlying file).

I'm afraid this sound as if you hadn't fully understood the consequences
of the blockdev work yet (even though I'm sure you do know them in
theory). Make it s/two/n/ and we'll talk about it.

Even today you can put filters like blkdebug/blkverify between the
format and the protocol to have more than two layers in the bs->file
chain. And you can use VMDK or Quorum to have many children instead of
just bs->file and bs->backing_hd.

These are setups that libvirt will increasingly need to understand.

> Furthermore, operations like
> snapshot creation, drive-mirror, and active block commit can change what
> the active layer is, and thus need another node name.
> 
> It would really make life easier if qemu could auto-assign node names
> (so that EVERY node has a name without libvirt having to invent two
> names per qcow2 file), and then give libvirt an easy way to query the
> node names in use (query-block should make it obvious what the full
> node-name tree is, so that libvirt can then pick out the node name it is
> interested in).

If you think we should pick up Jeff's patches for autogeneration of node
names again, I'm in favour of that. I think a few more reasons came up
recently why this would be useful.

> > Let's stay away from such magic, as much as we can. libvirt can just
> > specify a node-name for the protocol layer and use that.
> 
> Okay, I'll probably abandon this patch, then, but still work on
> something to make node names easier for libvirt to integrate with.

Hm, okay. I would find it nice to accept device names everywhere where a
node name is expected (just for aesthetic reasons), but I see that it
would probably remain unused, so abandoning the patch is okay with me.

Kevin
diff mbox

Patch

diff --git a/block/write-threshold.c b/block/write-threshold.c
index a53c1f5..e3df419 100644
--- a/block/write-threshold.c
+++ b/block/write-threshold.c
@@ -1,7 +1,7 @@ 
 /*
  * QEMU System Emulator block write threshold notification
  *
- * Copyright Red Hat, Inc. 2014
+ * Copyright Red Hat, Inc. 2014, 2015
  *
  * Authors:
  *  Francesco Romani <fromani@redhat.com>
@@ -110,9 +110,8 @@  void qmp_block_set_write_threshold(const char *node_name,
     BlockDriverState *bs;
     AioContext *aio_context;

-    bs = bdrv_find_node(node_name);
+    bs = bdrv_lookup_bs(node_name, node_name, errp);
     if (!bs) {
-        error_setg(errp, "Device '%s' not found", node_name);
         return;
     }