[2/2] NBD proto: add GET_LBA_STATUS extension

Message ID	1458742562-30624-3-git-send-email-den@openvz.org
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: "Denis V. Lunev" <den@openvz.org> To: nbd-general@lists.sourceforge.net, qemu-devel@nongnu.org Date: Wed, 23 Mar 2016 17:16:02 +0300 Message-Id: <1458742562-30624-3-git-send-email-den@openvz.org> In-Reply-To: <1458742562-30624-1-git-send-email-den@openvz.org> References: <1458742562-30624-1-git-send-email-den@openvz.org> Cc: Kevin Wolf <kwolf@redhat.com>, Pavel Borzenkov <pborzenkov@virtuozzo.com>, Stefan Hajnoczi <stefanha@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Wouter Verhelst <w@uter.be>, den@openvz.org Subject: [Qemu-devel] [PATCH 2/2] NBD proto: add GET_LBA_STATUS extension Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Denis V. Lunev March 23, 2016, 2:16 p.m. UTC

From: Pavel Borzenkov <pborzenkov@virtuozzo.com>

With the availability of sparse storage formats, it is often needed to
query status of a particular LBA range and read only those blocks of
data that are actually present on the block device.

To provide such information, the patch adds GET_LBA_STATUS extension
with one new NBD_CMD_GET_LBA_STATUS command.

There exists a concept of data dirtiness, which is required during, for
example, incremental block device backup. To express this concept via
NBD protocol, this patch also adds additional mode of operation to
NBD_CMD_GET_LBA_STATUS command.

Since NBD protocol has no notion of block size, and to mimic SCSI "GET
LBA STATUS" command more closely, it has been chosen to return a list of
extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
bitmap.

Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Wouter Verhelst <w@uter.be>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)

Eric Blake March 23, 2016, 4:27 p.m. UTC | #1

On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.

The acronym LBA is not used elsewhere in the NBD spec; should we spell
it out at least once?

> 
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
> 
> There exists a concept of data dirtiness, which is required during, for
> example, incremental block device backup. To express this concept via
> NBD protocol, this patch also adds additional mode of operation to
> NBD_CMD_GET_LBA_STATUS command.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> LBA STATUS" command more closely, it has been chosen to return a list of
> extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> bitmap.
> 
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 82 insertions(+)
> 

>  
> +### `GET_LBA_STATUS` extension
> +
> +With the availability of sparse storage formats, it is often needed to query
> +status of a particular LBA range and read only those blocks of data that are
> +actually present on the block device.
> +
> +Some storage formats and operations over such formats express a concept of
> +data dirtiness. Whether the operation is block device mirroring,
> +incremental block device backup or any other operation with a concept of
> +data dirtiness, they all share a need to provide a list of LBA ranges
> +that this particular operation treats as dirty.
> +
> +To provide such class of information, `GET_LBA_STATUS` extension adds new
> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> +their respective states.
> +
> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    An LBA range status query request. Length and offset define the range
> +    of interest. The server MUST reply with a reply header, followed
> +    immediately by the following data:
> +
> +      - 32 bits, length of parameter data that follow (unsigned)

Is this length the number of descriptors, or the number of bytes
occupied by those descriptors?  It looks like bytes (that is, with the
current layout, this field should be a multiple of 14 unless an error is
returned and the data is bogus).

I guess 32 bits is sufficient: transmission commands are limited to
32-bit length, and we are unlikely to have more than one extent per 512
bytes of length, so even if we have a header for every single sector
(worst-case for alternating clean/dirty sectors), as long as the
smallest granularity of an extent is larger than the extent field, the
'length of parameter data' in bytes is still sufficient.

> +      - zero or more LBA status descriptors, each having the following

zero or more? [1]

> +        structure:
> +
> +        * 64 bits, offset (unsigned)
> +        * 32 bits, length (unsigned)
> +        * 16 bits, status (unsigned)

An array of these status descriptors is packed on the wire, while the
typical C layout of an array of these structures will have padding to
reach a nice 8-byte alignment.  Should 'status' be a 32-bit field, so
that clients and servers do not have to pack/unpack between 14 bytes on
the wire and 16 bytes in efficient array handling, at the expense of
more network traffic?

Conversely, it would be possible to send less data over the wire, as
long as we require that all LBA status descriptors cover consecutive
offsets.  That is, having the server reply with offsets is pointless,
since they can be reconstructed on the client by starting with the
offset in the client's request, then adding length from each status
field.  Is less network traffic desirable?

> +
> +    unless an error condition has occurred.
> +
> +    If an error occurs, the server SHOULD set the appropriate error code
> +    in the error field. The server MUST then either close the
> +    connection, or send *length of parameter data* bytes of data
> +    (which MAY be invalid).
> +
> +    The type of information required by the client is passed to server in the
> +    command flags field. If the server does not implement requested type or
> +    have no means to express it, it MUST NOT return an error, but instead MUST
> +    return a single LBA status descriptor with *offset* and *length* equal to
> +    the *offset* and *length* from request, and *status* set to `0`.

[1] So in what situations will we ever return an array of zero status
fields? On an error?  Should we make it clear that the server MUST NOT
return 0 status fields except on an error?

Do we want to require that the server MUST reply with enough extents to
sum up to the length of the client's request, or are we permitting a
short reply?

> +
> +    The following request types are currently defined for the command:
> +
> +    1. Block provisioning state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return

Here, you spell it '0x0'; in the previous patch, you spelled the command
flag as 'bit 1' - does that mean that Block provisioning state is the
default when no command flags are sent?  What if we later add other
flags but still want block provisioning mode?  Wouldn't it be better to
state that this mode is entered when the NBD_FLAG_GET_DIRTY flag is
clear, without regards to the state of the other flags, than allowing
this mode only when all 16 flags are zero?

For example, should we allow a flag that states that the client is
interested only in allocated/unallocated, and that the server may
coalesce NBD_STATE_ZEROED extents as if they were NBD_STATE_ALLOCATED
for fewer extents reported and thus potentially less network traffic?

> +    the provisioning state of the device. The following provisionnig states

s/provisionnig/provisioning/

> +    are defined for the command:
> +
> +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> +        and contains zeroes;
> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> +        block device. A client MUST NOT make any assumptions about the
> +        contents of the extent.

Can NBD_STATE_ALLOCATED and NBD_STATE_DEALLOCATED both be set at the
same time, or is that an error on the server?  What do we know about an
extent that has neither NBD_STATE_ALLOCATED nor NBD_STATE_DEALLOCATED set?

/me re-reads

Oh, you are using this as the _entire_ 16-bit status value, rather than
as bits 0, 1, and 2 as flags.

But I think you have two binary flags (four possible states) here: it is
quite conceivable to have a server on top of a SCSI device, where we
know that the extent is unallocated in SCSI, but where the server will
guarantee that it reads as all zeroes (possibly because the server
bypasses SCSI on the NBD read commands when it knows SCSI is
unallocated).  That is, if we set this up as two flags:

0x1 - allocated
0x2 - reads as 0

then we can express four states:

0x0 - LBA extent not present, client MUST NOT make assumptions about
contents, and reads should not be attempted
0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
0x2 - LBA extent not present, but client can treat the extent as zeroes
and reads will succeed
0x3 - LBA extent present, client can treat the extent as zeroes and
reads will succeed

Actually, we should probably pick the bit values such that 0x0 means
allocated and readable, as the most common state, since we also require
that the command returns a single extent over the entire length with
status 0 if the server can't provide any further details.

I'm not familiar enough with the SCSI "GET LBA STATUS" command to know
if your command sanely matches to that one.

> +
> +    2. Block dirtiness state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return

This overlaps with the bit value for NBD_FLAG_SEND_FUA in the previous
patch.  Is that okay?  Or should we use a different bit value, on the
grounds that some future extension may want to use both flags
orthogonally within the same (possibly new) command?  Again, consistency
in the spelling ('bit 1' in the previous patch, '0x1' here), would be nice.

> +    the dirtiness status of the device. The following dirtiness states
> +    are defined for the command:
> +
> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.

Again, it looks like you are using these as two entire 16-bit status
values, rather than as two separate bits (1<<0 and 1<<1).  Another way
of expressing it is that a single boolean flag is defined, if clear, the
extent is dirty, if set, the extent is clean.

> +
> +    Generic NBD client implementation without knowledge of a particular NBD
> +    server operation MUST NOT make any assumption on the meaning of the
> +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> +
> +The server SHOULD return `EINVAL` if it receives a `GET_LBA_STATUS` request
> +including one or more sectors beyond the size of the device.

As mentioned in the previous mail, should we also recommend an EINVAL if
NBD_CMD_GET_LBA_STATUS was not negotiated in the export options but the
client sends the command anyways; and/or a requirement that the client
must not issue the command in that case?

> +
>  ## About this file
>  
>  This file tries to document the NBD protocol as it is currently
>

Wouter Verhelst March 23, 2016, 5:58 p.m. UTC | #2

On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
> 
> There exists a concept of data dirtiness, which is required during, for
> example, incremental block device backup. To express this concept via
> NBD protocol, this patch also adds additional mode of operation to
> NBD_CMD_GET_LBA_STATUS command.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> LBA STATUS" command more closely, it has been chosen to return a list of
> extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> bitmap.
> 
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 82 insertions(+)
> 
> diff --git a/doc/proto.md b/doc/proto.md
> index cda213c..fff515d 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle negotiation:
>    `NBD_CMD_TRIM` commands
>  - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
>    supports `NBD_CMD_WRITE_ZEROES` commands
> +- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server
> +  supports `NBD_CMD_GET_LBA_STATUS` commands
>  
>  ##### Client flags
>  
> @@ -477,6 +479,10 @@ The following request types exist:
>  
>      Defined by the experimental `WRITE_ZEROES` extension; see below.
>  
> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    Defined by the experimental `GET_LBA_STATUS` extension; see below.
> +
>  * Other requests
>  
>      Some third-party implementations may require additional protocol
> @@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a write zeroes request
>  including one or more sectors beyond the size of the device. It SHOULD
>  return `EPERM` if it receives a write zeroes request on a read-only export.
>  
> +### `GET_LBA_STATUS` extension
> +
> +With the availability of sparse storage formats, it is often needed to query
> +status of a particular LBA range and read only those blocks of data that are
> +actually present on the block device.
> +
> +Some storage formats and operations over such formats express a concept of
> +data dirtiness. Whether the operation is block device mirroring,
> +incremental block device backup or any other operation with a concept of
> +data dirtiness, they all share a need to provide a list of LBA ranges
> +that this particular operation treats as dirty.
> +
> +To provide such class of information, `GET_LBA_STATUS` extension adds new
> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> +their respective states.
> +
> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    An LBA range status query request. Length and offset define the range
> +    of interest. The server MUST reply with a reply header, followed
> +    immediately by the following data:

As Eric noted, please expand LBA at least once.

> +      - 32 bits, length of parameter data that follow (unsigned)
> +      - zero or more LBA status descriptors, each having the following
> +        structure:
> +
> +        * 64 bits, offset (unsigned)
> +        * 32 bits, length (unsigned)
> +        * 16 bits, status (unsigned)
> +
> +    unless an error condition has occurred.
> +
> +    If an error occurs, the server SHOULD set the appropriate error code
> +    in the error field. The server MUST then either close the
> +    connection, or send *length of parameter data* bytes of data
> +    (which MAY be invalid).
> +
> +    The type of information required by the client is passed to server in the
> +    command flags field. If the server does not implement requested type or
> +    have no means to express it, it MUST NOT return an error, but instead MUST
> +    return a single LBA status descriptor with *offset* and *length* equal to
> +    the *offset* and *length* from request, and *status* set to `0`.
> +
> +    The following request types are currently defined for the command:
> +
> +    1. Block provisioning state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return

I prefer to have a non-zero flag value.

> +    the provisioning state of the device. The following provisionnig states
> +    are defined for the command:
> +
> +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> +        and contains zeroes;

Presumably this should be "contains only zeroes"?

Also, this may end up being a fairly expensive call for the server to
process. Is it really useful?

> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> +        block device. A client MUST NOT make any assumptions about the
> +        contents of the extent.
> +
> +    2. Block dirtiness state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> +    the dirtiness status of the device. The following dirtiness states
> +    are defined for the command:
> +
> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> +
> +    Generic NBD client implementation without knowledge of a particular NBD
> +    server operation MUST NOT make any assumption on the meaning of the
> +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.

That makes it a useless call. A server can read /dev/random to decide
whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with
this spec.

Either the spec should define what it means for a block to be in a dirty
state, or it should not talk about it.

Kevin Wolf March 23, 2016, 6:14 p.m. UTC | #3

Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben:
> On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > +    The type of information required by the client is passed to server in the
> > +    command flags field. If the server does not implement requested type or
> > +    have no means to express it, it MUST NOT return an error, but instead MUST
> > +    return a single LBA status descriptor with *offset* and *length* equal to
> > +    the *offset* and *length* from request, and *status* set to `0`.
> > +
> > +    The following request types are currently defined for the command:
> > +
> > +    1. Block provisioning state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> 
> I prefer to have a non-zero flag value.
> 
> > +    the provisioning state of the device. The following provisionnig states
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > +        and contains zeroes;
> 
> Presumably this should be "contains only zeroes"?
> 
> Also, this may end up being a fairly expensive call for the server to
> process. Is it really useful?

I think we need to make clear that this is meant as an optimisation and
it's always a valid option for a server to return NBD_STATE_ALLOCATED
even if the contents is zeroed.

It is definitely useful if the server has a means to efficiently find
out the allocation status (e.g. SEEK_HOLE). In that case the client may
be able to avoid reading the block and sending it over the network, or
when making a copy, it could use it to keep the target file sparse. If
the client can't take advantage, we didn't have much overhead, so it's
fine.

It's less clear in a case where the server needs to read in the block
and scan its contents. It could still be a net win if the next thing the
client does is retrieving the block: The server would still have the
cost of reading the block, but it wouldn't be transferred. But when the
client doesn't follow up with a read, it's quite a bit of overhead that
we had for no benefit. Returning NBD_STATE_ALLOCATED might be more
appropriate in this case than scanning for zeros.

Kevin

Pavel Borzenkov March 24, 2016, 8:25 a.m. UTC | #4

On Wed, Mar 23, 2016 at 07:14:54PM +0100, Kevin Wolf wrote:
> Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben:
> > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > > +    The type of information required by the client is passed to server in the
> > > +    command flags field. If the server does not implement requested type or
> > > +    have no means to express it, it MUST NOT return an error, but instead MUST
> > > +    return a single LBA status descriptor with *offset* and *length* equal to
> > > +    the *offset* and *length* from request, and *status* set to `0`.
> > > +
> > > +    The following request types are currently defined for the command:
> > > +
> > > +    1. Block provisioning state
> > > +
> > > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > > +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> > 
> > I prefer to have a non-zero flag value.
> > 
> > > +    the provisioning state of the device. The following provisionnig states
> > > +    are defined for the command:
> > > +
> > > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > > +        and contains zeroes;
> > 
> > Presumably this should be "contains only zeroes"?
> > 
> > Also, this may end up being a fairly expensive call for the server to
> > process. Is it really useful?
> 
> I think we need to make clear that this is meant as an optimisation and
> it's always a valid option for a server to return NBD_STATE_ALLOCATED
> even if the contents is zeroed.
> 
> It is definitely useful if the server has a means to efficiently find
> out the allocation status (e.g. SEEK_HOLE). In that case the client may
> be able to avoid reading the block and sending it over the network, or
> when making a copy, it could use it to keep the target file sparse. If
> the client can't take advantage, we didn't have much overhead, so it's
> fine.

Yes, that was the idea. I'll add a note that the server may return
NBD_STATE_ALLOCATED instead of NBD_STATE_ZEROED if it has not means to
efficiently differentiate allocated blocks with zeroes from allocated
blocks with non-zeroed content.

> 
> It's less clear in a case where the server needs to read in the block
> and scan its contents. It could still be a net win if the next thing the
> client does is retrieving the block: The server would still have the
> cost of reading the block, but it wouldn't be transferred. But when the
> client doesn't follow up with a read, it's quite a bit of overhead that
> we had for no benefit. Returning NBD_STATE_ALLOCATED might be more
> appropriate in this case than scanning for zeros.
> 
> Kevin



> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140

> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general

Wouter Verhelst March 24, 2016, 8:41 a.m. UTC | #5

On Thu, Mar 24, 2016 at 11:25:52AM +0300, Pavel Borzenkov wrote:
> On Wed, Mar 23, 2016 at 07:14:54PM +0100, Kevin Wolf wrote:
> > Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben:
> > > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > > > +    the provisioning state of the device. The following provisionnig states
> > > > +    are defined for the command:
> > > > +
> > > > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > > > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > > > +        and contains zeroes;
> > > 
> > > Presumably this should be "contains only zeroes"?
> > > 
> > > Also, this may end up being a fairly expensive call for the server to
> > > process. Is it really useful?
> > 
> > I think we need to make clear that this is meant as an optimisation and
> > it's always a valid option for a server to return NBD_STATE_ALLOCATED
> > even if the contents is zeroed.
> > 
> > It is definitely useful if the server has a means to efficiently find
> > out the allocation status (e.g. SEEK_HOLE). In that case the client may
> > be able to avoid reading the block and sending it over the network, or
> > when making a copy, it could use it to keep the target file sparse. If
> > the client can't take advantage, we didn't have much overhead, so it's
> > fine.
> 
> Yes, that was the idea. I'll add a note that the server may return
> NBD_STATE_ALLOCATED instead of NBD_STATE_ZEROED if it has not means to
> efficiently differentiate allocated blocks with zeroes from allocated
> blocks with non-zeroed content.

Okay, that alleviates my concerns.

In that case it might be useful if the server could say something along
the lines of "I know it's allocated, but I didn't check whether there's
anything non-zero in there"? The client can then decide to do nothing
with that information; but the more useful information is sent along,
the better...

Pavel Borzenkov March 24, 2016, 8:43 a.m. UTC | #6

On Wed, Mar 23, 2016 at 06:58:34PM +0100, Wouter Verhelst wrote:
> On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > 
> > With the availability of sparse storage formats, it is often needed to
> > query status of a particular LBA range and read only those blocks of
> > data that are actually present on the block device.
> > 
> > To provide such information, the patch adds GET_LBA_STATUS extension
> > with one new NBD_CMD_GET_LBA_STATUS command.
> > 
> > There exists a concept of data dirtiness, which is required during, for
> > example, incremental block device backup. To express this concept via
> > NBD protocol, this patch also adds additional mode of operation to
> > NBD_CMD_GET_LBA_STATUS command.
> > 
> > Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> > LBA STATUS" command more closely, it has been chosen to return a list of
> > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> > bitmap.
> > 
> > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > CC: Wouter Verhelst <w@uter.be>
> > CC: Paolo Bonzini <pbonzini@redhat.com>
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 82 insertions(+)
> > 
> > diff --git a/doc/proto.md b/doc/proto.md
> > index cda213c..fff515d 100644
> > --- a/doc/proto.md
> > +++ b/doc/proto.md
> > @@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle negotiation:
> >    `NBD_CMD_TRIM` commands
> >  - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
> >    supports `NBD_CMD_WRITE_ZEROES` commands
> > +- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server
> > +  supports `NBD_CMD_GET_LBA_STATUS` commands
> >  
> >  ##### Client flags
> >  
> > @@ -477,6 +479,10 @@ The following request types exist:
> >  
> >      Defined by the experimental `WRITE_ZEROES` extension; see below.
> >  
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +    Defined by the experimental `GET_LBA_STATUS` extension; see below.
> > +
> >  * Other requests
> >  
> >      Some third-party implementations may require additional protocol
> > @@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a write zeroes request
> >  including one or more sectors beyond the size of the device. It SHOULD
> >  return `EPERM` if it receives a write zeroes request on a read-only export.
> >  
> > +### `GET_LBA_STATUS` extension
> > +
> > +With the availability of sparse storage formats, it is often needed to query
> > +status of a particular LBA range and read only those blocks of data that are
> > +actually present on the block device.
> > +
> > +Some storage formats and operations over such formats express a concept of
> > +data dirtiness. Whether the operation is block device mirroring,
> > +incremental block device backup or any other operation with a concept of
> > +data dirtiness, they all share a need to provide a list of LBA ranges
> > +that this particular operation treats as dirty.
> > +
> > +To provide such class of information, `GET_LBA_STATUS` extension adds new
> > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> > +their respective states.
> > +
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +    An LBA range status query request. Length and offset define the range
> > +    of interest. The server MUST reply with a reply header, followed
> > +    immediately by the following data:
> 
> As Eric noted, please expand LBA at least once.
> 
> > +      - 32 bits, length of parameter data that follow (unsigned)
> > +      - zero or more LBA status descriptors, each having the following
> > +        structure:
> > +
> > +        * 64 bits, offset (unsigned)
> > +        * 32 bits, length (unsigned)
> > +        * 16 bits, status (unsigned)
> > +
> > +    unless an error condition has occurred.
> > +
> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MUST then either close the
> > +    connection, or send *length of parameter data* bytes of data
> > +    (which MAY be invalid).
> > +
> > +    The type of information required by the client is passed to server in the
> > +    command flags field. If the server does not implement requested type or
> > +    have no means to express it, it MUST NOT return an error, but instead MUST
> > +    return a single LBA status descriptor with *offset* and *length* equal to
> > +    the *offset* and *length* from request, and *status* set to `0`.
> > +
> > +    The following request types are currently defined for the command:
> > +
> > +    1. Block provisioning state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> 
> I prefer to have a non-zero flag value.
> 
> > +    the provisioning state of the device. The following provisionnig states
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > +        and contains zeroes;
> 
> Presumably this should be "contains only zeroes"?
> 
> Also, this may end up being a fairly expensive call for the server to
> process. Is it really useful?
> 
> > +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> > +        block device. A client MUST NOT make any assumptions about the
> > +        contents of the extent.
> > +
> > +    2. Block dirtiness state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> > +    the dirtiness status of the device. The following dirtiness states
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> > +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> > +
> > +    Generic NBD client implementation without knowledge of a particular NBD
> > +    server operation MUST NOT make any assumption on the meaning of the
> > +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> 
> That makes it a useless call. A server can read /dev/random to decide
> whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with
> this spec.
> 
> Either the spec should define what it means for a block to be in a dirty
> state, or it should not talk about it.

What I was trying to say with this sentence is that without knowing what
is currently going on on the server side, the DIRTY state has no
meaning. If we are doing incremental backup DIRTY state might represent blocks
that have changed since last backup. For mirroring - blocks that are yet
to be migrated. And for the same block device different sets of DIRTY
ranges might be returned in response to this command. Basically, the
meaning of DIRTY depends on the context. 

Should I state in the spec, that the meaning of DIRTY state is
implementation-specific? I see that NBD_REP_SERVER already uses this
approach.

> 
> -- 
> < ron> I mean, the main *practical* problem with C++, is there's like a dozen
>        people in the world who think they really understand all of its rules,
>        and pretty much all of them are just lying to themselves too.
>  -- #debian-devel, OFTC, 2016-02-12



> ------------------------------------------------------------------------------
> Transform Data into Opportunity.
> Accelerate data analysis in your applications with
> Intel Data Analytics Acceleration Library.
> Click to learn more.
> http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140

> _______________________________________________
> Nbd-general mailing list
> Nbd-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nbd-general

Wouter Verhelst March 24, 2016, 9:33 a.m. UTC | #7

On Thu, Mar 24, 2016 at 11:43:18AM +0300, Pavel Borzenkov wrote:
> On Wed, Mar 23, 2016 at 06:58:34PM +0100, Wouter Verhelst wrote:
> > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > > +    2. Block dirtiness state
> > > +
> > > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > > +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> > > +    the dirtiness status of the device. The following dirtiness states
> > > +    are defined for the command:
> > > +
> > > +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> > > +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> > > +
> > > +    Generic NBD client implementation without knowledge of a particular NBD
> > > +    server operation MUST NOT make any assumption on the meaning of the
> > > +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> > 
> > That makes it a useless call. A server can read /dev/random to decide
> > whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with
> > this spec.
> > 
> > Either the spec should define what it means for a block to be in a dirty
> > state, or it should not talk about it.
> 
> What I was trying to say with this sentence is that without knowing what
> is currently going on on the server side, the DIRTY state has no
> meaning. If we are doing incremental backup DIRTY state might represent blocks
> that have changed since last backup. For mirroring - blocks that are yet
> to be migrated. And for the same block device different sets of DIRTY
> ranges might be returned in response to this command. Basically, the
> meaning of DIRTY depends on the context. 
> 
> Should I state in the spec, that the meaning of DIRTY state is
> implementation-specific? I see that NBD_REP_SERVER already uses this
> approach.

That's still meaningless without a way for the client to detect what the
server-side implementation actually is. The protocol doesn't have that.

This is okay for NBD_REP_SERVER, because the extra information there is
*also* defined to be "UTF-8 encoded data suitable for direct display to
a human being", which means it just allows the server to pass on some
extra info (e.g., qemu could use it to state whether it's in use by a
live VM, or what the backend storage pool is, etc), but the data there
is not going to be critical.

For this proposed message, however, that is not the case; the whole
point of this part of your proposal seems to be to tell the client
whether some part of the export is "dirty" or not. Now I'm not saying we
need to fully define what it means for a part of the backend to be
"dirty" or not.  It's okay to leave part of the meaning in the dark,
leaving that implementation-defined. But saying "must not make any
assumption" basically means "give me some random metadata, and I'll
throw it away then because there's nothing I can do with it".

Alternatively, we could define more states than just "dirty" and
"clean", and have those be more strictly defined than what your current
proposal says.

Alex Bligh March 24, 2016, 10:32 a.m. UTC | #8

On 24 Mar 2016, at 09:33, Wouter Verhelst <w@uter.be> wrote:

> Now I'm not saying we
> need to fully define what it means for a part of the backend to be
> "dirty" or not.  It's okay to leave part of the meaning in the dark,
> leaving that implementation-defined.

Well, the 3 possible states are:

* unallocated
* zero
* non-zero

So the possible replies are a bitfield of those, with a '1' if it 'might'
be in that state, i.e.

111 = no idea
110 = might be zero or unallocated, but isn't zero
011 = I know it's allocated, but I don't know whether it is zero or not

And '000' is not permitted!

etc.

--
Alex Bligh

Pavel Borzenkov March 24, 2016, 11:36 a.m. UTC | #9

On Thu, Mar 24, 2016 at 09:41:29AM +0100, Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 11:25:52AM +0300, Pavel Borzenkov wrote:
> > On Wed, Mar 23, 2016 at 07:14:54PM +0100, Kevin Wolf wrote:
> > > Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben:
> > > > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote:
> > > > > +    the provisioning state of the device. The following provisionnig states
> > > > > +    are defined for the command:
> > > > > +
> > > > > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > > > > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > > > > +        and contains zeroes;
> > > > 
> > > > Presumably this should be "contains only zeroes"?
> > > > 
> > > > Also, this may end up being a fairly expensive call for the server to
> > > > process. Is it really useful?
> > > 
> > > I think we need to make clear that this is meant as an optimisation and
> > > it's always a valid option for a server to return NBD_STATE_ALLOCATED
> > > even if the contents is zeroed.
> > > 
> > > It is definitely useful if the server has a means to efficiently find
> > > out the allocation status (e.g. SEEK_HOLE). In that case the client may
> > > be able to avoid reading the block and sending it over the network, or
> > > when making a copy, it could use it to keep the target file sparse. If
> > > the client can't take advantage, we didn't have much overhead, so it's
> > > fine.
> > 
> > Yes, that was the idea. I'll add a note that the server may return
> > NBD_STATE_ALLOCATED instead of NBD_STATE_ZEROED if it has not means to
> > efficiently differentiate allocated blocks with zeroes from allocated
> > blocks with non-zeroed content.
> 
> Okay, that alleviates my concerns.
> 
> In that case it might be useful if the server could say something along
> the lines of "I know it's allocated, but I didn't check whether there's
> anything non-zero in there"? The client can then decide to do nothing
> with that information; but the more useful information is sent along,
> the better...

Doesn't allocated state mean exactly this? E.g. it is allocated and I
have no idea what the content is.

> 
> -- 
> < ron> I mean, the main *practical* problem with C++, is there's like a dozen
>        people in the world who think they really understand all of its rules,
>        and pretty much all of them are just lying to themselves too.
>  -- #debian-devel, OFTC, 2016-02-12

Paolo Bonzini March 24, 2016, 11:55 a.m. UTC | #10

On 23/03/2016 18:58, Wouter Verhelst wrote:
>> +To provide such class of information, `GET_LBA_STATUS` extension adds new
>> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
>> +their respective states.
>> +
>> +* `NBD_CMD_GET_LBA_STATUS` (7)
>> +
>> +    An LBA range status query request. Length and offset define the range
>> +    of interest. The server MUST reply with a reply header, followed
>> +    immediately by the following data:
> 
> As Eric noted, please expand LBA at least once.

Let's just use "block" (e.g. NBD_CMD_GET_BLOCK_STATUS).

>> +      - 32 bits, length of parameter data that follow (unsigned)
>> +      - zero or more LBA status descriptors, each having the following
>> +        structure:
>> +
>> +        * 64 bits, offset (unsigned)
>> +        * 32 bits, length (unsigned)
>> +        * 16 bits, status (unsigned)
>> +
>> +    unless an error condition has occurred.
>> +

Can we just return one descriptor?  That would simplify the protocol a bit.

>> +    the provisioning state of the device. The following provisionnig states
>> +    are defined for the command:
>> +
>> +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
>> +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
>> +        and contains zeroes;
> 
> Presumably this should be "contains only zeroes"?

Yes, or "reads as zeroes".

> Also, this may end up being a fairly expensive call for the server to
> process. Is it really useful?

It's always okay for the server to omit NBD_STATE_ZERO, but it can be
useful if the state is known for some reason.  For example, file system
holes are always zero, but unallocated blocks on a block device are not
always zero.

However, let's make these bits, so that

NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
NBD_STATE_ZERO (0x2), LBA extent will read as zeroes

and you can have NBD_STATE_ALLOCATED|NBD_STATE_ZERO.  File systems do
have the concept of unwritten extents which would be represented like
that.  The API to access the information (the FIEMAP ioctl) is broken,
but perhaps in the future a non-broken API could be added---for example
SEEK_ZERO and SEEK_NONZERO values for lseek's "whence" argument.

>> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
>> +        block device. A client MUST NOT make any assumptions about the
>> +        contents of the extent.
>> +
>> +    2. Block dirtiness state
>> +
>> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
>> +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
>> +    the dirtiness status of the device. The following dirtiness states
>> +    are defined for the command:
>> +
>> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
>> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
>> +
>> +    Generic NBD client implementation without knowledge of a particular NBD
>> +    server operation MUST NOT make any assumption on the meaning of the
>> +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> 
> That makes it a useless call. A server can read /dev/random to decide
> whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with
> this spec.
> 
> Either the spec should define what it means for a block to be in a dirty
> state, or it should not talk about it.

Here is my attempt:

    This command is meant to operate in tandem with other (non-NBD)
    channels to the server.  Generally, a "dirty" block is a block that
    has been written to by someone, but the exact meaning of "has been
    written" is left to the implementation.  For example, a virtual
    machine monitor could provide a (non-NBD) command to start tracking
    blocks written by the virtual machine.  A backup client then can
    connect to an NBD server provided by the virtual machine monitor
    and use NBD_CMD_GET_BLOCK_STATUS only read blocks that the virtual
    machine has changed.

    An implementation that doesn't track the "dirtiness" state of blocks
    MUST either fail this command with EINVAL, or mark all blocks as
    dirty in the descriptor that it returns.

Paolo

Paolo Bonzini March 24, 2016, 11:58 a.m. UTC | #11

On 24/03/2016 11:32, Alex Bligh wrote:
>> > Now I'm not saying we
>> > need to fully define what it means for a part of the backend to be
>> > "dirty" or not.  It's okay to leave part of the meaning in the dark,
>> > leaving that implementation-defined.
> Well, the 3 possible states are:
> 
> * unallocated
> * zero
> * non-zero
> 
> So the possible replies are a bitfield of those, with a '1' if it 'might'
> be in that state, i.e.
> 
> 111 = no idea
> 110 = might be zero or unallocated, but isn't zero
> 011 = I know it's allocated, but I don't know whether it is zero or not

How do you represent "definitely unallocated?"

Paolo

> And '000' is not permitted!

Alex Bligh March 24, 2016, 12:17 p.m. UTC | #12

On 24 Mar 2016, at 11:58, Paolo Bonzini <pbonzini@redhat.com> wrote:

> 
> On 24/03/2016 11:32, Alex Bligh wrote:
>>>> Now I'm not saying we
>>>> need to fully define what it means for a part of the backend to be
>>>> "dirty" or not.  It's okay to leave part of the meaning in the dark,
>>>> leaving that implementation-defined.
>> Well, the 3 possible states are:
>> 
>> * unallocated
>> * zero
>> * non-zero
>> 
>> So the possible replies are a bitfield of those, with a '1' if it 'might'
>> be in that state, i.e.
>> 
>> 111 = no idea
>> 110 = might be zero or unallocated, but isn't zero
>> 011 = I know it's allocated, but I don't know whether it is zero or not
> 
> How do you represent "definitely unallocated?"

100 is definitely allocated. The first '1' says it 'might' be in allocated state,
but as we know it's NOT in any of the other states (next two zeroes), by a
process of elimination, it's definitely unallocated. Similarly 010 and 001
are the two other 'definite' states.

Pavel Borzenkov March 24, 2016, 12:30 p.m. UTC | #13

On Wed, Mar 23, 2016 at 10:27:00AM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > 
> > With the availability of sparse storage formats, it is often needed to
> > query status of a particular LBA range and read only those blocks of
> > data that are actually present on the block device.
> 
> The acronym LBA is not used elsewhere in the NBD spec; should we spell
> it out at least once?
> 
> > 
> > To provide such information, the patch adds GET_LBA_STATUS extension
> > with one new NBD_CMD_GET_LBA_STATUS command.
> > 
> > There exists a concept of data dirtiness, which is required during, for
> > example, incremental block device backup. To express this concept via
> > NBD protocol, this patch also adds additional mode of operation to
> > NBD_CMD_GET_LBA_STATUS command.
> > 
> > Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> > LBA STATUS" command more closely, it has been chosen to return a list of
> > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> > bitmap.
> > 
> > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> > Signed-off-by: Denis V. Lunev <den@openvz.org>
> > CC: Wouter Verhelst <w@uter.be>
> > CC: Paolo Bonzini <pbonzini@redhat.com>
> > CC: Kevin Wolf <kwolf@redhat.com>
> > CC: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 82 insertions(+)
> > 
> 
> >  
> > +### `GET_LBA_STATUS` extension
> > +
> > +With the availability of sparse storage formats, it is often needed to query
> > +status of a particular LBA range and read only those blocks of data that are
> > +actually present on the block device.
> > +
> > +Some storage formats and operations over such formats express a concept of
> > +data dirtiness. Whether the operation is block device mirroring,
> > +incremental block device backup or any other operation with a concept of
> > +data dirtiness, they all share a need to provide a list of LBA ranges
> > +that this particular operation treats as dirty.
> > +
> > +To provide such class of information, `GET_LBA_STATUS` extension adds new
> > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> > +their respective states.
> > +
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +    An LBA range status query request. Length and offset define the range
> > +    of interest. The server MUST reply with a reply header, followed
> > +    immediately by the following data:
> > +
> > +      - 32 bits, length of parameter data that follow (unsigned)
> 
> Is this length the number of descriptors, or the number of bytes
> occupied by those descriptors?  It looks like bytes (that is, with the
> current layout, this field should be a multiple of 14 unless an error is
> returned and the data is bogus).
> 
> I guess 32 bits is sufficient: transmission commands are limited to
> 32-bit length, and we are unlikely to have more than one extent per 512
> bytes of length, so even if we have a header for every single sector
> (worst-case for alternating clean/dirty sectors), as long as the
> smallest granularity of an extent is larger than the extent field, the
> 'length of parameter data' in bytes is still sufficient.
> 
> > +      - zero or more LBA status descriptors, each having the following
> 
> zero or more? [1]
> 
> > +        structure:
> > +
> > +        * 64 bits, offset (unsigned)
> > +        * 32 bits, length (unsigned)
> > +        * 16 bits, status (unsigned)
> 
> An array of these status descriptors is packed on the wire, while the
> typical C layout of an array of these structures will have padding to
> reach a nice 8-byte alignment.  Should 'status' be a 32-bit field, so
> that clients and servers do not have to pack/unpack between 14 bytes on
> the wire and 16 bytes in efficient array handling, at the expense of
> more network traffic?
> 
> Conversely, it would be possible to send less data over the wire, as
> long as we require that all LBA status descriptors cover consecutive
> offsets.  That is, having the server reply with offsets is pointless,
> since they can be reconstructed on the client by starting with the
> offset in the client's request, then adding length from each status
> field.  Is less network traffic desirable?

I think it's better to explicitly send the start of each LBA extent.
So I'll go with changing 'status' field to 32 bits to avoid
packing/unpacking issues.

> 
> > +
> > +    unless an error condition has occurred.
> > +
> > +    If an error occurs, the server SHOULD set the appropriate error code
> > +    in the error field. The server MUST then either close the
> > +    connection, or send *length of parameter data* bytes of data
> > +    (which MAY be invalid).
> > +
> > +    The type of information required by the client is passed to server in the
> > +    command flags field. If the server does not implement requested type or
> > +    have no means to express it, it MUST NOT return an error, but instead MUST
> > +    return a single LBA status descriptor with *offset* and *length* equal to
> > +    the *offset* and *length* from request, and *status* set to `0`.
> 
> [1] So in what situations will we ever return an array of zero status
> fields? On an error?  Should we make it clear that the server MUST NOT
> return 0 status fields except on an error?
> 
> Do we want to require that the server MUST reply with enough extents to
> sum up to the length of the client's request, or are we permitting a
> short reply?

While the "GET LBA STATUS" command in SCSI allows partial reply, I
believe it'd better to keep things simple and require that the server
must either return a list of extents that covers the whole requested
range, or an error.

> 
> > +
> > +    The following request types are currently defined for the command:
> > +
> > +    1. Block provisioning state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> 
> Here, you spell it '0x0'; in the previous patch, you spelled the command
> flag as 'bit 1' - does that mean that Block provisioning state is the
> default when no command flags are sent?  What if we later add other
> flags but still want block provisioning mode?  Wouldn't it be better to
> state that this mode is entered when the NBD_FLAG_GET_DIRTY flag is
> clear, without regards to the state of the other flags, than allowing
> this mode only when all 16 flags are zero?
> 
> For example, should we allow a flag that states that the client is
> interested only in allocated/unallocated, and that the server may
> coalesce NBD_STATE_ZEROED extents as if they were NBD_STATE_ALLOCATED
> for fewer extents reported and thus potentially less network traffic?

Actually, for this command I treat 'command flags' field not as a set of
flags, but rather as a plain number representing required mode of
operation. Probably, not a good idea as it doesn't match the rest of the
commands.

I went this way because I didn't want to allow clients to request
several modes simultaneously (e.g. provisioning + dirtiness) in the same
request. This makes server side implementation harder.

I think I'll just switch to bits to match the rest of the commands and
will add a note, that server should return EINVAL in case several modes
are requested simultaneously.

> 
> > +    the provisioning state of the device. The following provisionnig states
> 
> s/provisionnig/provisioning/
> 
> > +    are defined for the command:
> > +
> > +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> > +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> > +        and contains zeroes;
> > +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> > +        block device. A client MUST NOT make any assumptions about the
> > +        contents of the extent.
> 
> Can NBD_STATE_ALLOCATED and NBD_STATE_DEALLOCATED both be set at the
> same time, or is that an error on the server?  What do we know about an
> extent that has neither NBD_STATE_ALLOCATED nor NBD_STATE_DEALLOCATED set?
> 
> /me re-reads
> 
> Oh, you are using this as the _entire_ 16-bit status value, rather than
> as bits 0, 1, and 2 as flags.

Yes

> 
> But I think you have two binary flags (four possible states) here: it is
> quite conceivable to have a server on top of a SCSI device, where we
> know that the extent is unallocated in SCSI, but where the server will
> guarantee that it reads as all zeroes (possibly because the server
> bypasses SCSI on the NBD read commands when it knows SCSI is
> unallocated).  That is, if we set this up as two flags:
> 
> 0x1 - allocated
> 0x2 - reads as 0
> 
> then we can express four states:
> 
> 0x0 - LBA extent not present, client MUST NOT make assumptions about
> contents, and reads should not be attempted
> 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
> 0x2 - LBA extent not present, but client can treat the extent as zeroes
> and reads will succeed
> 0x3 - LBA extent present, client can treat the extent as zeroes and
> reads will succeed

I'm not sure that clients need this level of details. From client's POV
0x2 and 0x3 are the same.

> 
> Actually, we should probably pick the bit values such that 0x0 means
> allocated and readable, as the most common state, since we also require
> that the command returns a single extent over the entire length with
> status 0 if the server can't provide any further details.
> 
> I'm not familiar enough with the SCSI "GET LBA STATUS" command to know
> if your command sanely matches to that one.

Extents returned by "GET LBA STATUS" in SCSI can have three possible
state: MAPPED/ANCHORED/DEALLOCATED. These are not bits and cannot be
combined together.

> 
> > +
> > +    2. Block dirtiness state
> > +
> > +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> > +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> 
> This overlaps with the bit value for NBD_FLAG_SEND_FUA in the previous
> patch.  Is that okay?  Or should we use a different bit value, on the
> grounds that some future extension may want to use both flags
> orthogonally within the same (possibly new) command?  Again, consistency
> in the spelling ('bit 1' in the previous patch, '0x1' here), would be nice.

I couldn't think of any use case. But, just in case, I'll change the
value of the flags so they don't overlap.

Paolo Bonzini March 24, 2016, 12:32 p.m. UTC | #14

On 24/03/2016 13:17, Alex Bligh wrote:
>>> >> * unallocated
>>> >> * zero
>>> >> * non-zero
>>> >> 
>>> >> So the possible replies are a bitfield of those, with a '1' if it 'might'
>>> >> be in that state, i.e.
>>> >> 
>>> >> 111 = no idea
>>> >> 110 = might be zero or unallocated, but isn't zero
>>> >> 011 = I know it's allocated, but I don't know whether it is zero or not
>> > 
>> > How do you represent "definitely unallocated?"
> 100 is definitely allocated. The first '1' says it 'might' be in allocated state,
> but as we know it's NOT in any of the other states (next two zeroes), by a
> process of elimination, it's definitely unallocated. Similarly 010 and 001
> are the two other 'definite' states.

An unallocated block can still be definitely zero.

Paolo

Wouter Verhelst March 24, 2016, 12:32 p.m. UTC | #15

On Thu, Mar 24, 2016 at 02:36:52PM +0300, Pavel Borzenkov wrote:
> On Thu, Mar 24, 2016 at 09:41:29AM +0100, Wouter Verhelst wrote:
> > Okay, that alleviates my concerns.
> > 
> > In that case it might be useful if the server could say something along
> > the lines of "I know it's allocated, but I didn't check whether there's
> > anything non-zero in there"? The client can then decide to do nothing
> > with that information; but the more useful information is sent along,
> > the better...
> 
> Doesn't allocated state mean exactly this? E.g. it is allocated and I
> have no idea what the content is.

I suppose you're right.

Wouter Verhelst March 24, 2016, 12:43 p.m. UTC | #16

Hi Paolo,

On Thu, Mar 24, 2016 at 12:55:51PM +0100, Paolo Bonzini wrote:
> 
> 
> On 23/03/2016 18:58, Wouter Verhelst wrote:
> >> +To provide such class of information, `GET_LBA_STATUS` extension adds new
> >> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> >> +their respective states.
> >> +
> >> +* `NBD_CMD_GET_LBA_STATUS` (7)
> >> +
> >> +    An LBA range status query request. Length and offset define the range
> >> +    of interest. The server MUST reply with a reply header, followed
> >> +    immediately by the following data:
> > 
> > As Eric noted, please expand LBA at least once.
> 
> Let's just use "block" (e.g. NBD_CMD_GET_BLOCK_STATUS).

That works too :-)

[...]
> > Also, this may end up being a fairly expensive call for the server to
> > process. Is it really useful?
> 
> It's always okay for the server to omit NBD_STATE_ZERO, but it can be
> useful if the state is known for some reason.  For example, file system
> holes are always zero, but unallocated blocks on a block device are not
> always zero.
> 
> However, let's make these bits, so that
> 
> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> 
> and you can have NBD_STATE_ALLOCATED|NBD_STATE_ZERO.  File systems do
> have the concept of unwritten extents which would be represented like
> that.  The API to access the information (the FIEMAP ioctl) is broken,
> but perhaps in the future a non-broken API could be added---for example
> SEEK_ZERO and SEEK_NONZERO values for lseek's "whence" argument.

Okay, that works for me.

> >> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> >> +        block device. A client MUST NOT make any assumptions about the
> >> +        contents of the extent.
> >> +
> >> +    2. Block dirtiness state
> >> +
> >> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> >> +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> >> +    the dirtiness status of the device. The following dirtiness states
> >> +    are defined for the command:
> >> +
> >> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> >> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
> >> +
> >> +    Generic NBD client implementation without knowledge of a particular NBD
> >> +    server operation MUST NOT make any assumption on the meaning of the
> >> +    NBD_STATE_DIRTY or NBD_STATE_CLEAN states.
> > 
> > That makes it a useless call. A server can read /dev/random to decide
> > whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with
> > this spec.
> > 
> > Either the spec should define what it means for a block to be in a dirty
> > state, or it should not talk about it.
> 
> Here is my attempt:
> 
>     This command is meant to operate in tandem with other (non-NBD)
>     channels to the server.  Generally, a "dirty" block is a block that
>     has been written to by someone, but the exact meaning of "has been
>     written" is left to the implementation.  For example, a virtual
>     machine monitor could provide a (non-NBD) command to start tracking
>     blocks written by the virtual machine.  A backup client then can
>     connect to an NBD server provided by the virtual machine monitor
>     and use NBD_CMD_GET_BLOCK_STATUS only read blocks that the virtual
                                      ^ to
>     machine has changed.
> 
>     An implementation that doesn't track the "dirtiness" state of blocks
>     MUST either fail this command with EINVAL, or mark all blocks as
>     dirty in the descriptor that it returns.

That seems saner, yes -- and I'm starting to understand what the
rationale is for this protocol message :-)

I suppose I could also implement that in nbd-server to send out
information about changed blocks if the copy-on-write option has been
switched on.

It might also be possible to add an in-protocol message to start
tracking changes (e.g., a "create checkpoint" message), but I'm not sure
if that's worth it (and it could massively complicate the NBD state
machine and protocol; not sure whether that's worth it)

At any rate, your suggestion does alleviate my concerns.

Alex Bligh March 24, 2016, 1:31 p.m. UTC | #17

On 24 Mar 2016, at 12:32, Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 24/03/2016 13:17, Alex Bligh wrote:
>>>>>> * unallocated
>>>>>> * zero
>>>>>> * non-zero
>>>>>> 
>>>>>> So the possible replies are a bitfield of those, with a '1' if it 'might'
>>>>>> be in that state, i.e.
>>>>>> 
>>>>>> 111 = no idea
>>>>>> 110 = might be zero or unallocated, but isn't zero
>>>>>> 011 = I know it's allocated, but I don't know whether it is zero or not
>>>> 
>>>> How do you represent "definitely unallocated?"
>> 100 is definitely allocated. The first '1' says it 'might' be in allocated state,
>> but as we know it's NOT in any of the other states (next two zeroes), by a
>> process of elimination, it's definitely unallocated. Similarly 010 and 001
>> are the two other 'definite' states.
> 
> An unallocated block can still be definitely zero.

Sorry, I should have been clearer on the states:

bits
210

1-- Unallocated, and hence reads as zero
-1- Allocated, and reads as zero
--1 Allocated, and reads as non-zero

So 100 means 'definitely unallocated, will read as zero'.

If you are saying that there is also a state called 'Unallocated, but reads
as non-zero', that could be handled by adding a fourth bit. Same idea. One
would presume this would only ever be set in conjunction with bit 2.

My point in general was to represent all the possible states and let the client
determine what it wants to do with the information.

Paolo Bonzini March 24, 2016, 1:32 p.m. UTC | #18

On 24/03/2016 14:31, Alex Bligh wrote:
> Sorry, I should have been clearer on the states:
> 
> bits
> 210
> 
> 1-- Unallocated, and hence reads as zero
> -1- Allocated, and reads as zero
> --1 Allocated, and reads as non-zero
> 
> So 100 means 'definitely unallocated, will read as zero'.
> 
> If you are saying that there is also a state called 'Unallocated, but reads
> as non-zero',

Yes.

> that could be handled by adding a fourth bit. Same idea. One
> would presume this would only ever be set in conjunction with bit 2.
> My point in general was to represent all the possible states and let the client
> determine what it wants to do with the information.

This seems overengineered...

Paolo

Eric Blake March 24, 2016, 3:04 p.m. UTC | #19

On 03/24/2016 06:30 AM, Pavel Borzenkov wrote:
>> Conversely, it would be possible to send less data over the wire, as
>> long as we require that all LBA status descriptors cover consecutive
>> offsets.  That is, having the server reply with offsets is pointless,
>> since they can be reconstructed on the client by starting with the
>> offset in the client's request, then adding length from each status
>> field.  Is less network traffic desirable?
> 
> I think it's better to explicitly send the start of each LBA extent.

Why? Is the redundancy for something that the client can reconstruct
worth the extra safety at the cost of more traffic?

> So I'll go with changing 'status' field to 32 bits to avoid
> packing/unpacking issues.

You may want to do that even if you eliminate the offset field, so that
you have 8 bytes per struct (instead of 16).

>>
>> Do we want to require that the server MUST reply with enough extents to
>> sum up to the length of the client's request, or are we permitting a
>> short reply?
> 
> While the "GET LBA STATUS" command in SCSI allows partial reply, I
> believe it'd better to keep things simple and require that the server
> must either return a list of extents that covers the whole requested
> range, or an error.

Make sure you specify whether ranges are allowed to overlap, or must be
distinct (I prefer the latter), and whether they must appear in sorted
order (which I also prefer) - once you mandate ordered and
non-overlapping coverage, client-side validation that the server's
answer makes sense is easier (remember, we want the client to detect
man-in-the-middle corruption of the server's reply).

> Actually, for this command I treat 'command flags' field not as a set of
> flags, but rather as a plain number representing required mode of
> operation. Probably, not a good idea as it doesn't match the rest of the
> commands.
> 
> I went this way because I didn't want to allow clients to request
> several modes simultaneously (e.g. provisioning + dirtiness) in the same
> request. This makes server side implementation harder.
> 
> I think I'll just switch to bits to match the rest of the commands and
> will add a note, that server should return EINVAL in case several modes
> are requested simultaneously.

But you don't need two bits.  Just a single bit will do (off for
provisioning mode, on for dirty mode).  So there are no conflicting
modes that can be simultaneously requested, at least in the current
definition of a single valid flag bit.  (Then again, I did make a
suggestion about additional bits, useful only during provisioning mode,
that might be used to state that the client is okay if the server
coalesces extents that differ only in allocation or only in zeroed
content - if we add that bit or two bits, it would be an error to use it
while requesting dirty mode).

>> then we can express four states:
>>
>> 0x0 - LBA extent not present, client MUST NOT make assumptions about
>> contents, and reads should not be attempted
>> 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents
>> 0x2 - LBA extent not present, but client can treat the extent as zeroes
>> and reads will succeed
>> 0x3 - LBA extent present, client can treat the extent as zeroes and
>> reads will succeed
> 
> I'm not sure that clients need this level of details. From client's POV
> 0x2 and 0x3 are the same.

No, if the client is trying to EXACTLY copy sparseness, then 0x2 and 0x3
differ on whether the client will punch a hole vs. explicitly allocate
zeroes.

Eric Blake March 24, 2016, 3:25 p.m. UTC | #20

On 03/24/2016 05:55 AM, Paolo Bonzini wrote:
>> As Eric noted, please expand LBA at least once.
> 
> Let's just use "block" (e.g. NBD_CMD_GET_BLOCK_STATUS).

Yes, avoiding the term LBA and using BLOCK everywhere also nicely solves
the problem of introducing yet more terminology.

> 
>>> +      - 32 bits, length of parameter data that follow (unsigned)
>>> +      - zero or more LBA status descriptors, each having the following
>>> +        structure:
>>> +
>>> +        * 64 bits, offset (unsigned)
>>> +        * 32 bits, length (unsigned)
>>> +        * 16 bits, status (unsigned)
>>> +
>>> +    unless an error condition has occurred.
>>> +
> 
> Can we just return one descriptor?  That would simplify the protocol a bit.

As in, the return is exactly one descriptor, consisting of:

* 32 bits, length (unsigned): must be > 0, <= the client's length
* 16 bits, status (unsigned): status of that block

Of course, it means more traffic. The nice part about returning an array
of descriptors is that I can learn the status of 1G of the file, even if
the file alternates every 512 bytes between extent status, in just one
client call. But returning only a single descriptor at a time means I'd
have to make 2M client calls to learn the same pattern of allocation.
Fortunately, in the common case, allocation patterns tend to not be that
disjoint.

On the other hand, returning only one descriptor at a time (for possibly
less length than the client requested) may be easier when using
lseek(SEEK_DATA/HOLE) as the mechanism for determining the bounds of
each extent, since the server only has to search once per command,
instead of dynamically construct the entire reply.

I don't have any strong opinions on which would be better, but it is
definitely food for thought.

> 
> However, let's make these bits, so that
> 
> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes

Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
allocated, 1 means not present), so that an overall status of 0 is a
safe default?  (That is, it should always be safe to state a sector is
allocated when it is not, and always safe to state a sector is not known
to read as zeroes even if that happens to be its contents - all that we
lose by reporting this safe default state is that the client will be
unable to optimize for skipping holes).

>> Either the spec should define what it means for a block to be in a dirty
>> state, or it should not talk about it.
> 
> Here is my attempt:
> 
>     This command is meant to operate in tandem with other (non-NBD)
>     channels to the server.  Generally, a "dirty" block is a block that
>     has been written to by someone, but the exact meaning of "has been
>     written" is left to the implementation.  For example, a virtual
>     machine monitor could provide a (non-NBD) command to start tracking
>     blocks written by the virtual machine.  A backup client then can
>     connect to an NBD server provided by the virtual machine monitor
>     and use NBD_CMD_GET_BLOCK_STATUS only read blocks that the virtual

s/only/to only/

>     machine has changed.

s/changed/changed since it started tracking/

> 
>     An implementation that doesn't track the "dirtiness" state of blocks
>     MUST either fail this command with EINVAL, or mark all blocks as
>     dirty in the descriptor that it returns.

Is it feasible to return zero/allocated/dirty status all at the same
time, or do we want to strictly require two different modes of
operation?  That is, if we are returning zero and allocated as two bits,
can we also return a third bit for dirty/clean?  Should we flip the
sense of the bit, where 0 means dirty and 1 means clean, again so that a
server can always return a status of 0 as the safe default?

Paolo Bonzini March 24, 2016, 3:33 p.m. UTC | #21

On 24/03/2016 16:25, Eric Blake wrote:
>> However, let's make these bits, so that
>>
>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> 
> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> allocated, 1 means not present), so that an overall status of 0 is a
> safe default?

Double negations are evil (and don't work the same in all languages), so
I think it's a worse option.

>>     An implementation that doesn't track the "dirtiness" state of blocks
>>     MUST either fail this command with EINVAL, or mark all blocks as
>>     dirty in the descriptor that it returns.
> 
> Is it feasible to return zero/allocated/dirty status all at the same
> time, or do we want to strictly require two different modes of
> operation?

I think we should differentiate them, because it makes sense to support
only one.

In particular, while it is more or less obvious that (in my proposal
above) a trivial implementation must return NBD_STATE_ALLOCATED, it is
quite weird to require a trivial implementation to return
NBD_STATE_ALLOCATED|NBD_STATE_DIRTY.

Paolo

> That is, if we are returning zero and allocated as two bits,
> can we also return a third bit for dirty/clean?  Should we flip the
> sense of the bit, where 0 means dirty and 1 means clean, again so that a
> server can always return a status of 0 as the safe default?
>

Wouter Verhelst March 24, 2016, 3:53 p.m. UTC | #22

On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> On 24/03/2016 16:25, Eric Blake wrote:
> >> However, let's make these bits, so that
> >>
> >> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> >> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > 
> > Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> > allocated, 1 means not present), so that an overall status of 0 is a
> > safe default?
> 
> Double negations are evil (and don't work the same in all languages), so
> I think it's a worse option.

I agree that a bit which says "unallocated" is confusing in that manner,
but that just means we need a better name (one that doesn't contain
"un-" or "not")

I like the idea of having zero be the "sensible" default, although I'm
quite unable to come up with a better name myself.

Eric Blake March 24, 2016, 4:04 p.m. UTC | #23

On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
>> On 24/03/2016 16:25, Eric Blake wrote:
>>>> However, let's make these bits, so that
>>>>
>>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
>>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
>>>
>>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
>>> allocated, 1 means not present), so that an overall status of 0 is a
>>> safe default?
>>
>> Double negations are evil (and don't work the same in all languages), so
>> I think it's a worse option.
> 
> I agree that a bit which says "unallocated" is confusing in that manner,
> but that just means we need a better name (one that doesn't contain
> "un-" or "not")
> 
> I like the idea of having zero be the "sensible" default, although I'm
> quite unable to come up with a better name myself.

NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
matches well that we have NBD_CMD_TRIM for requesting the creation of
such a state.

Kevin Wolf March 24, 2016, 4:07 p.m. UTC | #24

Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> >> On 24/03/2016 16:25, Eric Blake wrote:
> >>>> However, let's make these bits, so that
> >>>>
> >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> >>>
> >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> >>> allocated, 1 means not present), so that an overall status of 0 is a
> >>> safe default?
> >>
> >> Double negations are evil (and don't work the same in all languages), so
> >> I think it's a worse option.
> > 
> > I agree that a bit which says "unallocated" is confusing in that manner,
> > but that just means we need a better name (one that doesn't contain
> > "un-" or "not")
> > 
> > I like the idea of having zero be the "sensible" default, although I'm
> > quite unable to come up with a better name myself.
> 
> NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> matches well that we have NBD_CMD_TRIM for requesting the creation of
> such a state.

How about NBD_STATE_HOLE?

Kevin

Pavel Borzenkov March 24, 2016, 4:36 p.m. UTC | #25

On Thu, Mar 24, 2016 at 09:04:07AM -0600, Eric Blake wrote:
> On 03/24/2016 06:30 AM, Pavel Borzenkov wrote:
> >> Conversely, it would be possible to send less data over the wire, as
> >> long as we require that all LBA status descriptors cover consecutive
> >> offsets.  That is, having the server reply with offsets is pointless,
> >> since they can be reconstructed on the client by starting with the
> >> offset in the client's request, then adding length from each status
> >> field.  Is less network traffic desirable?
> > 
> > I think it's better to explicitly send the start of each LBA extent.
> 
> Why? Is the redundancy for something that the client can reconstruct
> worth the extra safety at the cost of more traffic?

On second thought, not sending offsets look better. Firstly, they are
really not required for client to process the reply. Secondly, it also
implies that the regions are distinct and in sorted order and makes it
impossible to do otherwise.

Wouter Verhelst March 24, 2016, 4:47 p.m. UTC | #26

On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote:
> Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> > On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> > >> On 24/03/2016 16:25, Eric Blake wrote:
> > >>>> However, let's make these bits, so that
> > >>>>
> > >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> > >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > >>>
> > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> > >>> allocated, 1 means not present), so that an overall status of 0 is a
> > >>> safe default?
> > >>
> > >> Double negations are evil (and don't work the same in all languages), so
> > >> I think it's a worse option.
> > > 
> > > I agree that a bit which says "unallocated" is confusing in that manner,
> > > but that just means we need a better name (one that doesn't contain
> > > "un-" or "not")
> > > 
> > > I like the idea of having zero be the "sensible" default, although I'm
> > > quite unable to come up with a better name myself.
> > 
> > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> > matches well that we have NBD_CMD_TRIM for requesting the creation of
> > such a state.
> 
> How about NBD_STATE_HOLE?

Both will work, although I like NBD_STATE_TRIM slightly better because
it indeed nicely references NBD_CMD_TRIM.

However, I also think it should then be made clear that issuing
NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for
that region if the backend storage format dosn't support that, to avoid
confusion later on.

Eric Blake March 24, 2016, 10:08 p.m. UTC | #27

On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
> 

> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    An LBA range status query request. Length and offset define the range
> +    of interest. The server MUST reply with a reply header, followed
> +    immediately by the following data:
> +
> +      - 32 bits, length of parameter data that follow (unsigned)
> +      - zero or more LBA status descriptors, each having the following
> +        structure:
> +
> +        * 64 bits, offset (unsigned)
> +        * 32 bits, length (unsigned)
> +        * 16 bits, status (unsigned)
> +
> +    unless an error condition has occurred.

To date, only the NBD_CMD_READ command caused the server to send data
after the handle in its reply.  This would be the second command with a
data field in the response, but it is returning a variable amount of
data, not directly tied to the client's length - but at least you did
make it structured so that the client knows how much to read.  However,
your patch is incomplete; you'll need to edit the "Transmission" section
of the document to mention the rules on the server sending data, as the
existing text would now be wrong:

> The server replies with:
> 
> S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC)
> S: 32 bits, error
> S: 64 bits, handle
> S: (length bytes of data if the request is of type NBD_CMD_READ)

You may also want to add a rule that for all future extensions, any
command that requires data in the server response (other than the server
response to NBD_CMD_READ) must include a 32-bit length as the first
field of its data payload, so that the server reply is always structured.

Hmm, I still think it would be hard to write a wireshark dissector for
server replies, since there is no indication whether a given reply will
include data or not.  The _client_ knows (well, any well-written client
that uses a different value for 'handle' for every command sent to the
server), because it can read the returned 'handle' field to know what
command it matches to and therefore what data to expect; but a
third-party observer seeing _just_ the server messages has no idea which
server responses should have payload.  Scanning the stream and assuming
that a new reply starts each time NBD_REPLY_MAGIC is encountered is a
close approximation, but hits false positives if the data being returned
for NBD_CMD_READ contains the same magic number as part of its contents.
 Obviously, back-compat says we can't change the response to
NBD_CMD_READ, but that means that a wireshark dissector has to either
maintain context, or hunt through returned data looking for potential
magic numbers and possibly hitting false positives.

That said, maybe we want to allow global flag negotiation prior to
transmission to add a new flag on both server and client side - the
server would advertise that it is capable of a new reply mode, and the
client then has to reply that it wants to use the new reply mode; in
that mode, all server replies starting with magic 0x67446698
(NBD_REPLY_MAGIC) will never have a data payload, and all commands that
cause a reply with payload (both NBD_CMD_READ and the new
NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it)
will reply with a NEW magic number:

S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2)
S: 32 bits, error
S: 64 bits, handle
S: 32 bits, length
S: length bytes of data

so that the server's data stream is fully structured without having to
maintain context of the client's requests.  That is, a dissector can now
merely scan for both magic numbers; and on a stream between a client and
server that have negotiated the new mode, the old magic number will
never have a payload, and the new magic number will always be
accompanied with a payload that describes how much data to read to the
boundary of the next reply.

For that matter, right now, NBD_CMD_READ is required to either end the
session or return length bytes of data even when error is non-zero (but
where those bytes MAY be invalid); but by adding a negotiated flag for
structured length replies, it would be possible to allow for short reads
and/or returning an error with 0 bytes of payload but still keeping the
connection to the client open, without having to send wasted bytes over
the wire.

I could write up a negotiation of global flags for structured reply
lengths as an extension proposal, if you think it is worth it.

Wouter Verhelst March 25, 2016, 8:49 a.m. UTC | #28

On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote:
> On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > 
> > With the availability of sparse storage formats, it is often needed to
> > query status of a particular LBA range and read only those blocks of
> > data that are actually present on the block device.
> > 
> > To provide such information, the patch adds GET_LBA_STATUS extension
> > with one new NBD_CMD_GET_LBA_STATUS command.
> > 
> 
> > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > +
> > +    An LBA range status query request. Length and offset define the range
> > +    of interest. The server MUST reply with a reply header, followed
> > +    immediately by the following data:
> > +
> > +      - 32 bits, length of parameter data that follow (unsigned)
> > +      - zero or more LBA status descriptors, each having the following
> > +        structure:
> > +
> > +        * 64 bits, offset (unsigned)
> > +        * 32 bits, length (unsigned)
> > +        * 16 bits, status (unsigned)
> > +
> > +    unless an error condition has occurred.
> 
> To date, only the NBD_CMD_READ command caused the server to send data
> after the handle in its reply.  This would be the second command with a
> data field in the response, but it is returning a variable amount of
> data, not directly tied to the client's length - but at least you did
> make it structured so that the client knows how much to read.  However,
> your patch is incomplete; you'll need to edit the "Transmission" section
> of the document to mention the rules on the server sending data, as the
> existing text would now be wrong:
> 
> > The server replies with:
> > 
> > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC)
> > S: 32 bits, error
> > S: 64 bits, handle
> > S: (length bytes of data if the request is of type NBD_CMD_READ)
> 
> You may also want to add a rule that for all future extensions, any
> command that requires data in the server response (other than the server
> response to NBD_CMD_READ) must include a 32-bit length as the first
> field of its data payload, so that the server reply is always structured.

Right.

> Hmm, I still think it would be hard to write a wireshark dissector for
> server replies, since there is no indication whether a given reply will
> include data or not.

Well, a wireshark dissector already exists. However, it is very old; it
doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
deal with negotiation at all. It was written when newstyle negotiation
didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
according to the guy who wrote it (don't remember who that was).

> The _client_ knows (well, any well-written client
> that uses a different value for 'handle' for every command sent to the
> server), because it can read the returned 'handle' field to know what
> command it matches to and therefore what data to expect; but a
> third-party observer seeing _just_ the server messages has no idea which
> server responses should have payload.

It can if it knows enough about the protocol, but granted, that makes it
harder for us to extend the protocol without breaking the dissector.

> Scanning the stream and assuming that a new reply starts each time
> NBD_REPLY_MAGIC is encountered is a close approximation, but hits
> false positives if the data being returned for NBD_CMD_READ contains
> the same magic number as part of its contents.

Indeed.

> Obviously, back-compat says we can't change the response to
> NBD_CMD_READ, but that means that a wireshark dissector has to either
> maintain context, or hunt through returned data looking for potential
> magic numbers and possibly hitting false positives.
>
> That said, maybe we want to allow global flag negotiation prior to
> transmission to add a new flag on both server and client side - the
> server would advertise that it is capable of a new reply mode, and the
> client then has to reply that it wants to use the new reply mode; in

Global flag negotiation is already possible. There is a client flags
field, which is sent before option haggling, that could be used for
negotiation of such backwards-incompatible features.

> that mode, all server replies starting with magic 0x67446698
> (NBD_REPLY_MAGIC) will never have a data payload, and all commands that
> cause a reply with payload (both NBD_CMD_READ and the new
> NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it)
> will reply with a NEW magic number:
> 
> S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2)
> S: 32 bits, error
> S: 64 bits, handle
> S: 32 bits, length
> S: length bytes of data

I like this. However, before committing to it, I would like to see
Markus' feedback on that (explicit Cc added, even though he's on the
list, too).

We'd also need a way to communicate the ability to speak this protocol
from the kernel to userspace before telling the server that the client
understands something which maybe its kernel doesn't.

Markus?

> so that the server's data stream is fully structured without having to
> maintain context of the client's requests.  That is, a dissector can now
> merely scan for both magic numbers; and on a stream between a client and
> server that have negotiated the new mode, the old magic number will
> never have a payload, and the new magic number will always be
> accompanied with a payload that describes how much data to read to the
> boundary of the next reply.
> 
> For that matter, right now, NBD_CMD_READ is required to either end the
> session or return length bytes of data even when error is non-zero (but
> where those bytes MAY be invalid); but by adding a negotiated flag for
> structured length replies, it would be possible to allow for short reads
> and/or returning an error with 0 bytes of payload but still keeping the
> connection to the client open, without having to send wasted bytes over
> the wire.

Yes. This has been discussed on the nbd-general list in the past. There
is also the (significant) problem of the server having maybe already
sent out the header before discovering there is an error, at which point
it can *only* drop the connection.

> I could write up a negotiation of global flags for structured reply
> lengths as an extension proposal, if you think it is worth it.

I think it is worth it...

Alex Bligh March 25, 2016, 9:01 a.m. UTC | #29

On 25 Mar 2016, at 08:49, Wouter Verhelst <w@uter.be> wrote:

> Yes. This has been discussed on the nbd-general list in the past. There
> is also the (significant) problem of the server having maybe already
> sent out the header before discovering there is an error, at which point
> it can *only* drop the connection.

Indeed.

I think where we got to last time this was discussed was that the
server could have the option of returning 'chunks' of reply, each
chunk either being a {length, data} tuple, or an error. The total
data transmitted would add up to the full length if there is no
error.

>> I could write up a negotiation of global flags for structured reply
>> lengths as an extension proposal, if you think it is worth it.
> 
> I think it is worth it...

+1 - for NBD_CMD_READ too

Eric Blake March 28, 2016, 3:58 p.m. UTC | #30

On 03/25/2016 02:49 AM, Wouter Verhelst wrote:

>> You may also want to add a rule that for all future extensions, any
>> command that requires data in the server response (other than the server
>> response to NBD_CMD_READ) must include a 32-bit length as the first
>> field of its data payload, so that the server reply is always structured.
> 
> Right.
> 
>> Hmm, I still think it would be hard to write a wireshark dissector for
>> server replies, since there is no indication whether a given reply will
>> include data or not.
> 
> Well, a wireshark dissector already exists. However, it is very old; it
> doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
> support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
> deal with negotiation at all. It was written when newstyle negotiation
> didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
> according to the guy who wrote it (don't remember who that was).

And I've never written a wireshark dissector myself, to know how easy or
hard it would be to extend that work.

> 
>> The _client_ knows (well, any well-written client
>> that uses a different value for 'handle' for every command sent to the
>> server), because it can read the returned 'handle' field to know what
>> command it matches to and therefore what data to expect; but a
>> third-party observer seeing _just_ the server messages has no idea which
>> server responses should have payload.
> 
> It can if it knows enough about the protocol, but granted, that makes it
> harder for us to extend the protocol without breaking the dissector.
> 
>> Scanning the stream and assuming that a new reply starts each time
>> NBD_REPLY_MAGIC is encountered is a close approximation, but hits
>> false positives if the data being returned for NBD_CMD_READ contains
>> the same magic number as part of its contents.
> 
> Indeed.

One benefit of TCP: we can rely on packet boundaries (whether or not
fragmentation is also happening); I'm not sure if UDP shares the same
benefits (then again, I'm not even sure if UDP is usable for the NBD
protocol, since we definitely rely on in-order delivery of packets: a
read command can definitely return more bytes than even a jumbo frame
can contain, even though we do allow out-of-order delivery of replies).
 So if the server always sends each reply as the start of its own packet
(rather than coalescing multiple replies into a single network packet),
and the dissector only looks for magic numbers at the start of a packet,
then any server packet not starting with the magic number must be data
payload, and you could potentially even avoid the false positives by
choosing to break data replies into packets by adjusting packet lengths
by one byte any time the next data chunk would otherwise happen to start
with the same two bytes as the magic number.  But I haven't actually
tested any of this, to know if packet coalescing goes on, and I
certainly don't think it is worth complicating the server to avoid false
positive magic number detection by splitting read payloads across packet
boundaries, just for the benefit of wire-sniffing.

> I like this. However, before committing to it, I would like to see
> Markus' feedback on that (explicit Cc added, even though he's on the
> list, too).
> 
> We'd also need a way to communicate the ability to speak this protocol
> from the kernel to userspace before telling the server that the client
> understands something which maybe its kernel doesn't.

proto.md already documents that ioctl()s exist for the user space to
inform the kernel about options sent by the server prior to kicking off
transmission phase, and that the NBD protocol itself does not go into
full detail about available ioctl()s (the kernel/userspace coordination
does not affect the protocol).  It seems fairly straightforward for the
kernel to supply another ioctl() that userspace can use to learn whether
it is allowed or forbidden from advertising structured reply status
during the handshake phase (including the case where the ioctl() is not
present being treated as the client must not enable structured replies).

> 
> Markus?

Markus is on vacation for a bit, so we'll just have to wait for a reply.

> 
>> I could write up a negotiation of global flags for structured reply
>> lengths as an extension proposal, if you think it is worth it.
> 
> I think it is worth it...

Okay, I'll give it a shot, in a separate thread.

Kevin Wolf March 29, 2016, 9:38 a.m. UTC | #31

Am 24.03.2016 um 17:47 hat Wouter Verhelst geschrieben:
> On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote:
> > Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> > > On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> > > >> On 24/03/2016 16:25, Eric Blake wrote:
> > > >>>> However, let's make these bits, so that
> > > >>>>
> > > >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> > > >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > > >>>
> > > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> > > >>> allocated, 1 means not present), so that an overall status of 0 is a
> > > >>> safe default?
> > > >>
> > > >> Double negations are evil (and don't work the same in all languages), so
> > > >> I think it's a worse option.
> > > > 
> > > > I agree that a bit which says "unallocated" is confusing in that manner,
> > > > but that just means we need a better name (one that doesn't contain
> > > > "un-" or "not")
> > > > 
> > > > I like the idea of having zero be the "sensible" default, although I'm
> > > > quite unable to come up with a better name myself.
> > > 
> > > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> > > matches well that we have NBD_CMD_TRIM for requesting the creation of
> > > such a state.
> > 
> > How about NBD_STATE_HOLE?
> 
> Both will work, although I like NBD_STATE_TRIM slightly better because
> it indeed nicely references NBD_CMD_TRIM.

I just thought that "trim" sounds more like an action than a status, and
while the reason for a hole to exist can be a previous TRIM command,
another option is that it's an area in an image that just has never been
written to. In that case "trim" would be a misnomer.

> However, I also think it should then be made clear that issuing
> NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for
> that region if the backend storage format dosn't support that, to avoid
> confusion later on.

Good point. That might be another reason for not calling the status
"trim".

Kevin

Wouter Verhelst March 29, 2016, 9:53 a.m. UTC | #32

On Tue, Mar 29, 2016 at 11:38:35AM +0200, Kevin Wolf wrote:
> Am 24.03.2016 um 17:47 hat Wouter Verhelst geschrieben:
> > On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote:
> > > Am 24.03.2016 um 17:04 hat Eric Blake geschrieben:
> > > > On 03/24/2016 09:53 AM, Wouter Verhelst wrote:
> > > > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote:
> > > > >> On 24/03/2016 16:25, Eric Blake wrote:
> > > > >>>> However, let's make these bits, so that
> > > > >>>>
> > > > >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device
> > > > >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes
> > > > >>>
> > > > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means
> > > > >>> allocated, 1 means not present), so that an overall status of 0 is a
> > > > >>> safe default?
> > > > >>
> > > > >> Double negations are evil (and don't work the same in all languages), so
> > > > >> I think it's a worse option.
> > > > > 
> > > > > I agree that a bit which says "unallocated" is confusing in that manner,
> > > > > but that just means we need a better name (one that doesn't contain
> > > > > "un-" or "not")
> > > > > 
> > > > > I like the idea of having zero be the "sensible" default, although I'm
> > > > > quite unable to come up with a better name myself.
> > > > 
> > > > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated);
> > > > matches well that we have NBD_CMD_TRIM for requesting the creation of
> > > > such a state.
> > > 
> > > How about NBD_STATE_HOLE?
> > 
> > Both will work, although I like NBD_STATE_TRIM slightly better because
> > it indeed nicely references NBD_CMD_TRIM.
> 
> I just thought that "trim" sounds more like an action than a status, and
> while the reason for a hole to exist can be a previous TRIM command,
> another option is that it's an area in an image that just has never been
> written to. In that case "trim" would be a misnomer.

Point. It could be "TRIMMED" instead, I suppose.

> > However, I also think it should then be made clear that issuing
> > NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for
> > that region if the backend storage format dosn't support that, to avoid
> > confusion later on.
> 
> Good point. That might be another reason for not calling the status
> "trim".

Also a good point...

Paolo Bonzini March 29, 2016, 10:25 a.m. UTC | #33

On 29/03/2016 11:38, Kevin Wolf wrote:
> > > How about NBD_STATE_HOLE?
> > 
> > Both will work, although I like NBD_STATE_TRIM slightly better because
> > it indeed nicely references NBD_CMD_TRIM.
> 
> I just thought that "trim" sounds more like an action than a status, and
> while the reason for a hole to exist can be a previous TRIM command,
> another option is that it's an area in an image that just has never been
> written to. In that case "trim" would be a misnomer.

I agree with Kevin.  My preference is still on NBD_STATE_ALLOCATED, but
NBD_STATE_HOLE is a fine name as well.

Paolo

Markus Pargmann April 4, 2016, 10:18 a.m. UTC | #34

Hi,

back from my easter vacation. A bit surprised to find 200 mails in the
NBD mailing list ;).

On Friday 25 March 2016 09:49:29 Wouter Verhelst wrote:
> On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote:
> > On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> > > From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> > > 
> > > With the availability of sparse storage formats, it is often needed to
> > > query status of a particular LBA range and read only those blocks of
> > > data that are actually present on the block device.
> > > 
> > > To provide such information, the patch adds GET_LBA_STATUS extension
> > > with one new NBD_CMD_GET_LBA_STATUS command.
> > > 
> > 
> > > +* `NBD_CMD_GET_LBA_STATUS` (7)
> > > +
> > > +    An LBA range status query request. Length and offset define the range
> > > +    of interest. The server MUST reply with a reply header, followed
> > > +    immediately by the following data:
> > > +
> > > +      - 32 bits, length of parameter data that follow (unsigned)
> > > +      - zero or more LBA status descriptors, each having the following
> > > +        structure:
> > > +
> > > +        * 64 bits, offset (unsigned)
> > > +        * 32 bits, length (unsigned)
> > > +        * 16 bits, status (unsigned)
> > > +
> > > +    unless an error condition has occurred.
> > 
> > To date, only the NBD_CMD_READ command caused the server to send data
> > after the handle in its reply.  This would be the second command with a
> > data field in the response, but it is returning a variable amount of
> > data, not directly tied to the client's length - but at least you did
> > make it structured so that the client knows how much to read.  However,
> > your patch is incomplete; you'll need to edit the "Transmission" section
> > of the document to mention the rules on the server sending data, as the
> > existing text would now be wrong:
> > 
> > > The server replies with:
> > > 
> > > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC)
> > > S: 32 bits, error
> > > S: 64 bits, handle
> > > S: (length bytes of data if the request is of type NBD_CMD_READ)
> > 
> > You may also want to add a rule that for all future extensions, any
> > command that requires data in the server response (other than the server
> > response to NBD_CMD_READ) must include a 32-bit length as the first
> > field of its data payload, so that the server reply is always structured.
> 
> Right.
> 
> > Hmm, I still think it would be hard to write a wireshark dissector for
> > server replies, since there is no indication whether a given reply will
> > include data or not.
> 
> Well, a wireshark dissector already exists. However, it is very old; it
> doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
> support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
> deal with negotiation at all. It was written when newstyle negotiation
> didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
> according to the guy who wrote it (don't remember who that was).
> 
> > The _client_ knows (well, any well-written client
> > that uses a different value for 'handle' for every command sent to the
> > server), because it can read the returned 'handle' field to know what
> > command it matches to and therefore what data to expect; but a
> > third-party observer seeing _just_ the server messages has no idea which
> > server responses should have payload.
> 
> It can if it knows enough about the protocol, but granted, that makes it
> harder for us to extend the protocol without breaking the dissector.
> 
> > Scanning the stream and assuming that a new reply starts each time
> > NBD_REPLY_MAGIC is encountered is a close approximation, but hits
> > false positives if the data being returned for NBD_CMD_READ contains
> > the same magic number as part of its contents.
> 
> Indeed.
> 
> > Obviously, back-compat says we can't change the response to
> > NBD_CMD_READ, but that means that a wireshark dissector has to either
> > maintain context, or hunt through returned data looking for potential
> > magic numbers and possibly hitting false positives.
> >
> > That said, maybe we want to allow global flag negotiation prior to
> > transmission to add a new flag on both server and client side - the
> > server would advertise that it is capable of a new reply mode, and the
> > client then has to reply that it wants to use the new reply mode; in
> 
> Global flag negotiation is already possible. There is a client flags
> field, which is sent before option haggling, that could be used for
> negotiation of such backwards-incompatible features.
> 
> > that mode, all server replies starting with magic 0x67446698
> > (NBD_REPLY_MAGIC) will never have a data payload, and all commands that
> > cause a reply with payload (both NBD_CMD_READ and the new
> > NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it)
> > will reply with a NEW magic number:
> > 
> > S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2)
> > S: 32 bits, error
> > S: 64 bits, handle
> > S: 32 bits, length
> > S: length bytes of data
> 
> I like this. However, before committing to it, I would like to see
> Markus' feedback on that (explicit Cc added, even though he's on the
> list, too).
> 
> We'd also need a way to communicate the ability to speak this protocol
> from the kernel to userspace before telling the server that the client
> understands something which maybe its kernel doesn't.
> 
> Markus?
> 
> > so that the server's data stream is fully structured without having to
> > maintain context of the client's requests.  That is, a dissector can now
> > merely scan for both magic numbers; and on a stream between a client and
> > server that have negotiated the new mode, the old magic number will
> > never have a payload, and the new magic number will always be
> > accompanied with a payload that describes how much data to read to the
> > boundary of the next reply.
> > 
> > For that matter, right now, NBD_CMD_READ is required to either end the
> > session or return length bytes of data even when error is non-zero (but
> > where those bytes MAY be invalid); but by adding a negotiated flag for
> > structured length replies, it would be possible to allow for short reads
> > and/or returning an error with 0 bytes of payload but still keeping the
> > connection to the client open, without having to send wasted bytes over
> > the wire.
> 
> Yes. This has been discussed on the nbd-general list in the past. There
> is also the (significant) problem of the server having maybe already
> sent out the header before discovering there is an error, at which point
> it can *only* drop the connection.

I am still not through all the new mails on the list, so there may be
some more discussions about this. But I will answer here simply.

I generally like the idea of having this new kind of reply. Is this
solving our issues where we want to "stream" data directly from a
filedescriptor into a tcp-stream?

Does it make sense to extend this for reads with an offset? This way we
could not only send in chunks but also order them randomly. Is there any
use-case where it does make sense to read data not sequentially?

Best Regards,

Markus

Markus Pargmann April 4, 2016, 10:32 a.m. UTC | #35

Hi,

On Monday 28 March 2016 09:58:52 Eric Blake wrote:
> On 03/25/2016 02:49 AM, Wouter Verhelst wrote:
> 
> >> You may also want to add a rule that for all future extensions, any
> >> command that requires data in the server response (other than the server
> >> response to NBD_CMD_READ) must include a 32-bit length as the first
> >> field of its data payload, so that the server reply is always structured.
> > 
> > Right.
> > 
> >> Hmm, I still think it would be hard to write a wireshark dissector for
> >> server replies, since there is no indication whether a given reply will
> >> include data or not.
> > 
> > Well, a wireshark dissector already exists. However, it is very old; it
> > doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't
> > support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't
> > deal with negotiation at all. It was written when newstyle negotiation
> > didn't exist yet, and scanning for INIT_PASSWD at the time was too hard,
> > according to the guy who wrote it (don't remember who that was).
> 
> And I've never written a wireshark dissector myself, to know how easy or
> hard it would be to extend that work.
> 
> > 
> >> The _client_ knows (well, any well-written client
> >> that uses a different value for 'handle' for every command sent to the
> >> server), because it can read the returned 'handle' field to know what
> >> command it matches to and therefore what data to expect; but a
> >> third-party observer seeing _just_ the server messages has no idea which
> >> server responses should have payload.
> > 
> > It can if it knows enough about the protocol, but granted, that makes it
> > harder for us to extend the protocol without breaking the dissector.
> > 
> >> Scanning the stream and assuming that a new reply starts each time
> >> NBD_REPLY_MAGIC is encountered is a close approximation, but hits
> >> false positives if the data being returned for NBD_CMD_READ contains
> >> the same magic number as part of its contents.
> > 
> > Indeed.
> 
> One benefit of TCP: we can rely on packet boundaries (whether or not
> fragmentation is also happening); I'm not sure if UDP shares the same
> benefits (then again, I'm not even sure if UDP is usable for the NBD
> protocol, since we definitely rely on in-order delivery of packets: a
> read command can definitely return more bytes than even a jumbo frame
> can contain, even though we do allow out-of-order delivery of replies).
>  So if the server always sends each reply as the start of its own packet
> (rather than coalescing multiple replies into a single network packet),
> and the dissector only looks for magic numbers at the start of a packet,
> then any server packet not starting with the magic number must be data
> payload, and you could potentially even avoid the false positives by
> choosing to break data replies into packets by adjusting packet lengths
> by one byte any time the next data chunk would otherwise happen to start
> with the same two bytes as the magic number.  But I haven't actually
> tested any of this, to know if packet coalescing goes on, and I
> certainly don't think it is worth complicating the server to avoid false
> positive magic number detection by splitting read payloads across packet
> boundaries, just for the benefit of wire-sniffing.
> 
> > I like this. However, before committing to it, I would like to see
> > Markus' feedback on that (explicit Cc added, even though he's on the
> > list, too).
> > 
> > We'd also need a way to communicate the ability to speak this protocol
> > from the kernel to userspace before telling the server that the client
> > understands something which maybe its kernel doesn't.
> 
> proto.md already documents that ioctl()s exist for the user space to
> inform the kernel about options sent by the server prior to kicking off
> transmission phase, and that the NBD protocol itself does not go into
> full detail about available ioctl()s (the kernel/userspace coordination
> does not affect the protocol).  It seems fairly straightforward for the
> kernel to supply another ioctl() that userspace can use to learn whether
> it is allowed or forbidden from advertising structured reply status
> during the handshake phase (including the case where the ioctl() is not
> present being treated as the client must not enable structured replies).

Depending on the implementation we may not even need to communicate the
used NBD protocol from userspace to the kernel. We have a different
magic and the magic stays at the beginning of the message so we can
simply use the magic to decide what message we have. Of course that
would need another receive which may be too slow.

For the other direction, giving the userspace information about the
capabilities of the kernel implementation, it may be better to use a
sysfs entry? All current ioctls are used for the exact nbd device it is
used on. But implementation capabilities are a common property over all
nbd devices.

Best Regards,

Markus

Eric Blake April 4, 2016, 4:40 p.m. UTC | #36

On 03/23/2016 08:16 AM, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> 
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.
> 
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
> 
> There exists a concept of data dirtiness, which is required during, for
> example, incremental block device backup. To express this concept via
> NBD protocol, this patch also adds additional mode of operation to
> NBD_CMD_GET_LBA_STATUS command.
> 
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> LBA STATUS" command more closely, it has been chosen to return a list of
> extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> bitmap.
> 
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 82 insertions(+)

I've posted a v2 of this proposal under a new title, rebased on top of
the recent work to add structured replies, and trying to take into
account a number of the suggestions that occurred in this thread.

Eric Blake April 4, 2016, 4:54 p.m. UTC | #37

On 04/04/2016 04:18 AM, Markus Pargmann wrote:
> Hi,
> 
> back from my easter vacation. A bit surprised to find 200 mails in the
> NBD mailing list ;).
> 

>> Yes. This has been discussed on the nbd-general list in the past. There
>> is also the (significant) problem of the server having maybe already
>> sent out the header before discovering there is an error, at which point
>> it can *only* drop the connection.
> 
> I am still not through all the new mails on the list, so there may be
> some more discussions about this. But I will answer here simply.
> 
> I generally like the idea of having this new kind of reply. Is this
> solving our issues where we want to "stream" data directly from a
> filedescriptor into a tcp-stream?

I'm a relative newcomer on the NBD list, so I'm not sure what the issue
was there, but yes, structured replies DOES help with read semantics
that encounter a partial error that can be detected only after you have
started the reply stream.

> 
> Does it make sense to extend this for reads with an offset? This way we
> could not only send in chunks but also order them randomly. Is there any
> use-case where it does make sense to read data not sequentially?

In fact, that's what already got committed while you were gone.  It may
help if you jump in and read the current state of proto.md, if you don't
want to plow through all the mail churn in the meantime for how we got
to the state committed.  And of course, if you have suggestions on how
to further improve it, the extension is still documented as
experimental, so we can further tweak it before releasing upstream NBD
or downstream QEMU implementations of the extension (I'm currently
coding up the work on a qemu implementation, and will report on anything
that I run into during that process).

Denis V. Lunev April 4, 2016, 8:16 p.m. UTC | #38

On 03/23/2016 05:16 PM, Denis V. Lunev wrote:
> From: Pavel Borzenkov <pborzenkov@virtuozzo.com>
>
> With the availability of sparse storage formats, it is often needed to
> query status of a particular LBA range and read only those blocks of
> data that are actually present on the block device.
>
> To provide such information, the patch adds GET_LBA_STATUS extension
> with one new NBD_CMD_GET_LBA_STATUS command.
>
> There exists a concept of data dirtiness, which is required during, for
> example, incremental block device backup. To express this concept via
> NBD protocol, this patch also adds additional mode of operation to
> NBD_CMD_GET_LBA_STATUS command.
>
> Since NBD protocol has no notion of block size, and to mimic SCSI "GET
> LBA STATUS" command more closely, it has been chosen to return a list of
> extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a
> bitmap.
>
> Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com>
> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Wouter Verhelst <w@uter.be>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>   doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 82 insertions(+)
>
> diff --git a/doc/proto.md b/doc/proto.md
> index cda213c..fff515d 100644
> --- a/doc/proto.md
> +++ b/doc/proto.md
> @@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle negotiation:
>     `NBD_CMD_TRIM` commands
>   - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server
>     supports `NBD_CMD_WRITE_ZEROES` commands
> +- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server
> +  supports `NBD_CMD_GET_LBA_STATUS` commands
>   
>   ##### Client flags
>   
> @@ -477,6 +479,10 @@ The following request types exist:
>   
>       Defined by the experimental `WRITE_ZEROES` extension; see below.
>   
> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    Defined by the experimental `GET_LBA_STATUS` extension; see below.
> +
>   * Other requests
>   
>       Some third-party implementations may require additional protocol
> @@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a write zeroes request
>   including one or more sectors beyond the size of the device. It SHOULD
>   return `EPERM` if it receives a write zeroes request on a read-only export.
>   
> +### `GET_LBA_STATUS` extension
> +
> +With the availability of sparse storage formats, it is often needed to query
> +status of a particular LBA range and read only those blocks of data that are
> +actually present on the block device.
> +
> +Some storage formats and operations over such formats express a concept of
> +data dirtiness. Whether the operation is block device mirroring,
> +incremental block device backup or any other operation with a concept of
> +data dirtiness, they all share a need to provide a list of LBA ranges
> +that this particular operation treats as dirty.
> +
> +To provide such class of information, `GET_LBA_STATUS` extension adds new
> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with
> +their respective states.
> +
> +* `NBD_CMD_GET_LBA_STATUS` (7)
> +
> +    An LBA range status query request. Length and offset define the range
> +    of interest. The server MUST reply with a reply header, followed
> +    immediately by the following data:
> +
> +      - 32 bits, length of parameter data that follow (unsigned)
> +      - zero or more LBA status descriptors, each having the following
> +        structure:
> +
> +        * 64 bits, offset (unsigned)
> +        * 32 bits, length (unsigned)
> +        * 16 bits, status (unsigned)
> +
> +    unless an error condition has occurred.
> +
> +    If an error occurs, the server SHOULD set the appropriate error code
> +    in the error field. The server MUST then either close the
> +    connection, or send *length of parameter data* bytes of data
> +    (which MAY be invalid).
> +
> +    The type of information required by the client is passed to server in the
> +    command flags field. If the server does not implement requested type or
> +    have no means to express it, it MUST NOT return an error, but instead MUST
> +    return a single LBA status descriptor with *offset* and *length* equal to
> +    the *offset* and *length* from request, and *status* set to `0`.
> +
> +    The following request types are currently defined for the command:
> +
> +    1. Block provisioning state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
> +    the provisioning state of the device. The following provisionnig states
> +    are defined for the command:
> +
> +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
> +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
> +        and contains zeroes;
> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
> +        block device. A client MUST NOT make any assumptions about the
> +        contents of the extent.
> +
> +    2. Block dirtiness state
> +
> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
> +    field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return
> +    the dirtiness status of the device. The following dirtiness states
> +    are defined for the command:
> +
> +      - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty;
> +      - `NBD_STATE_CLEAN` (0x1), LBA extent is clean.
we can add 'NBD_STATE_DIRTY_DEALLOCATED' (0x2) here as additional hint

Eric Blake April 4, 2016, 8:36 p.m. UTC | #39

On 04/04/2016 02:16 PM, Denis V. Lunev wrote:

>> +    The following request types are currently defined for the command:
>> +
>> +    1. Block provisioning state
>> +
>> +    Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags
>> +    field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return
>> +    the provisioning state of the device. The following provisionnig states
>> +    are defined for the command:
>> +
>> +      - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device;
>> +      - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device
>> +        and contains zeroes;
>> +      - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the
>> +        block device. A client MUST NOT make any assumptions about the
>> +        contents of the extent.

> we can add 'NBD_STATE_DIRTY_DEALLOCATED' (0x2) here as additional hint

No, DEALLOCATED and HOLE are the same thing, and we really want the
status to be a bitwise-OR of flags (bit 0: is it allocated or
deallocated. bit 1: is it unknown content or all 0), rather than a set
of 3 states, since it really is possible to have all four combinations
of those two orthognal status information. That was one of the topics
already hashed out in the v1 conversation, and fixed in v2.

Wouter Verhelst April 4, 2016, 10:17 p.m. UTC | #40

On Mon, Apr 04, 2016 at 12:18:37PM +0200, Markus Pargmann wrote:
> Hi,
> 
> back from my easter vacation. A bit surprised to find 200 mails in the
> NBD mailing list ;).

I'm sure you were :-)

[...]
> > Yes. This has been discussed on the nbd-general list in the past. There
> > is also the (significant) problem of the server having maybe already
> > sent out the header before discovering there is an error, at which point
> > it can *only* drop the connection.
> 
> I am still not through all the new mails on the list, so there may be
> some more discussions about this. But I will answer here simply.
> 
> I generally like the idea of having this new kind of reply. Is this
> solving our issues where we want to "stream" data directly from a
> filedescriptor into a tcp-stream?

The eventual suggestion does, yes (as I'm sure you've found out by now).

> Does it make sense to extend this for reads with an offset?

I'm not sure what you mean by that. "reads with an offset"? Don't all our reads
have an offset?

> This way we could not only send in chunks but also order them randomly. Is
> there any use-case where it does make sense to read data not sequentially?

If you just mean "read replies with offset", then yes :-)

[2/2] NBD proto: add GET_LBA_STATUS extension

Commit Message

Comments

Patch