Message ID | 1458742562-30624-3-git-send-email-den@openvz.org |
---|---|
State | New |
Headers | show |
On 03/23/2016 08:16 AM, Denis V. Lunev wrote: > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > With the availability of sparse storage formats, it is often needed to > query status of a particular LBA range and read only those blocks of > data that are actually present on the block device. The acronym LBA is not used elsewhere in the NBD spec; should we spell it out at least once? > > To provide such information, the patch adds GET_LBA_STATUS extension > with one new NBD_CMD_GET_LBA_STATUS command. > > There exists a concept of data dirtiness, which is required during, for > example, incremental block device backup. To express this concept via > NBD protocol, this patch also adds additional mode of operation to > NBD_CMD_GET_LBA_STATUS command. > > Since NBD protocol has no notion of block size, and to mimic SCSI "GET > LBA STATUS" command more closely, it has been chosen to return a list of > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a > bitmap. > > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Wouter Verhelst <w@uter.be> > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > --- > doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 82 insertions(+) > > > +### `GET_LBA_STATUS` extension > + > +With the availability of sparse storage formats, it is often needed to query > +status of a particular LBA range and read only those blocks of data that are > +actually present on the block device. > + > +Some storage formats and operations over such formats express a concept of > +data dirtiness. Whether the operation is block device mirroring, > +incremental block device backup or any other operation with a concept of > +data dirtiness, they all share a need to provide a list of LBA ranges > +that this particular operation treats as dirty. > + > +To provide such class of information, `GET_LBA_STATUS` extension adds new > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with > +their respective states. > + > +* `NBD_CMD_GET_LBA_STATUS` (7) > + > + An LBA range status query request. Length and offset define the range > + of interest. The server MUST reply with a reply header, followed > + immediately by the following data: > + > + - 32 bits, length of parameter data that follow (unsigned) Is this length the number of descriptors, or the number of bytes occupied by those descriptors? It looks like bytes (that is, with the current layout, this field should be a multiple of 14 unless an error is returned and the data is bogus). I guess 32 bits is sufficient: transmission commands are limited to 32-bit length, and we are unlikely to have more than one extent per 512 bytes of length, so even if we have a header for every single sector (worst-case for alternating clean/dirty sectors), as long as the smallest granularity of an extent is larger than the extent field, the 'length of parameter data' in bytes is still sufficient. > + - zero or more LBA status descriptors, each having the following zero or more? [1] > + structure: > + > + * 64 bits, offset (unsigned) > + * 32 bits, length (unsigned) > + * 16 bits, status (unsigned) An array of these status descriptors is packed on the wire, while the typical C layout of an array of these structures will have padding to reach a nice 8-byte alignment. Should 'status' be a 32-bit field, so that clients and servers do not have to pack/unpack between 14 bytes on the wire and 16 bytes in efficient array handling, at the expense of more network traffic? Conversely, it would be possible to send less data over the wire, as long as we require that all LBA status descriptors cover consecutive offsets. That is, having the server reply with offsets is pointless, since they can be reconstructed on the client by starting with the offset in the client's request, then adding length from each status field. Is less network traffic desirable? > + > + unless an error condition has occurred. > + > + If an error occurs, the server SHOULD set the appropriate error code > + in the error field. The server MUST then either close the > + connection, or send *length of parameter data* bytes of data > + (which MAY be invalid). > + > + The type of information required by the client is passed to server in the > + command flags field. If the server does not implement requested type or > + have no means to express it, it MUST NOT return an error, but instead MUST > + return a single LBA status descriptor with *offset* and *length* equal to > + the *offset* and *length* from request, and *status* set to `0`. [1] So in what situations will we ever return an array of zero status fields? On an error? Should we make it clear that the server MUST NOT return 0 status fields except on an error? Do we want to require that the server MUST reply with enough extents to sum up to the length of the client's request, or are we permitting a short reply? > + > + The following request types are currently defined for the command: > + > + 1. Block provisioning state > + > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > + field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return Here, you spell it '0x0'; in the previous patch, you spelled the command flag as 'bit 1' - does that mean that Block provisioning state is the default when no command flags are sent? What if we later add other flags but still want block provisioning mode? Wouldn't it be better to state that this mode is entered when the NBD_FLAG_GET_DIRTY flag is clear, without regards to the state of the other flags, than allowing this mode only when all 16 flags are zero? For example, should we allow a flag that states that the client is interested only in allocated/unallocated, and that the server may coalesce NBD_STATE_ZEROED extents as if they were NBD_STATE_ALLOCATED for fewer extents reported and thus potentially less network traffic? > + the provisioning state of the device. The following provisionnig states s/provisionnig/provisioning/ > + are defined for the command: > + > + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; > + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device > + and contains zeroes; > + - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the > + block device. A client MUST NOT make any assumptions about the > + contents of the extent. Can NBD_STATE_ALLOCATED and NBD_STATE_DEALLOCATED both be set at the same time, or is that an error on the server? What do we know about an extent that has neither NBD_STATE_ALLOCATED nor NBD_STATE_DEALLOCATED set? /me re-reads Oh, you are using this as the _entire_ 16-bit status value, rather than as bits 0, 1, and 2 as flags. But I think you have two binary flags (four possible states) here: it is quite conceivable to have a server on top of a SCSI device, where we know that the extent is unallocated in SCSI, but where the server will guarantee that it reads as all zeroes (possibly because the server bypasses SCSI on the NBD read commands when it knows SCSI is unallocated). That is, if we set this up as two flags: 0x1 - allocated 0x2 - reads as 0 then we can express four states: 0x0 - LBA extent not present, client MUST NOT make assumptions about contents, and reads should not be attempted 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents 0x2 - LBA extent not present, but client can treat the extent as zeroes and reads will succeed 0x3 - LBA extent present, client can treat the extent as zeroes and reads will succeed Actually, we should probably pick the bit values such that 0x0 means allocated and readable, as the most common state, since we also require that the command returns a single extent over the entire length with status 0 if the server can't provide any further details. I'm not familiar enough with the SCSI "GET LBA STATUS" command to know if your command sanely matches to that one. > + > + 2. Block dirtiness state > + > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > + field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return This overlaps with the bit value for NBD_FLAG_SEND_FUA in the previous patch. Is that okay? Or should we use a different bit value, on the grounds that some future extension may want to use both flags orthogonally within the same (possibly new) command? Again, consistency in the spelling ('bit 1' in the previous patch, '0x1' here), would be nice. > + the dirtiness status of the device. The following dirtiness states > + are defined for the command: > + > + - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty; > + - `NBD_STATE_CLEAN` (0x1), LBA extent is clean. Again, it looks like you are using these as two entire 16-bit status values, rather than as two separate bits (1<<0 and 1<<1). Another way of expressing it is that a single boolean flag is defined, if clear, the extent is dirty, if set, the extent is clean. > + > + Generic NBD client implementation without knowledge of a particular NBD > + server operation MUST NOT make any assumption on the meaning of the > + NBD_STATE_DIRTY or NBD_STATE_CLEAN states. > + > +The server SHOULD return `EINVAL` if it receives a `GET_LBA_STATUS` request > +including one or more sectors beyond the size of the device. As mentioned in the previous mail, should we also recommend an EINVAL if NBD_CMD_GET_LBA_STATUS was not negotiated in the export options but the client sends the command anyways; and/or a requirement that the client must not issue the command in that case? > + > ## About this file > > This file tries to document the NBD protocol as it is currently >
On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote: > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > With the availability of sparse storage formats, it is often needed to > query status of a particular LBA range and read only those blocks of > data that are actually present on the block device. > > To provide such information, the patch adds GET_LBA_STATUS extension > with one new NBD_CMD_GET_LBA_STATUS command. > > There exists a concept of data dirtiness, which is required during, for > example, incremental block device backup. To express this concept via > NBD protocol, this patch also adds additional mode of operation to > NBD_CMD_GET_LBA_STATUS command. > > Since NBD protocol has no notion of block size, and to mimic SCSI "GET > LBA STATUS" command more closely, it has been chosen to return a list of > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a > bitmap. > > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Wouter Verhelst <w@uter.be> > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > --- > doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 82 insertions(+) > > diff --git a/doc/proto.md b/doc/proto.md > index cda213c..fff515d 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle negotiation: > `NBD_CMD_TRIM` commands > - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server > supports `NBD_CMD_WRITE_ZEROES` commands > +- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server > + supports `NBD_CMD_GET_LBA_STATUS` commands > > ##### Client flags > > @@ -477,6 +479,10 @@ The following request types exist: > > Defined by the experimental `WRITE_ZEROES` extension; see below. > > +* `NBD_CMD_GET_LBA_STATUS` (7) > + > + Defined by the experimental `GET_LBA_STATUS` extension; see below. > + > * Other requests > > Some third-party implementations may require additional protocol > @@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a write zeroes request > including one or more sectors beyond the size of the device. It SHOULD > return `EPERM` if it receives a write zeroes request on a read-only export. > > +### `GET_LBA_STATUS` extension > + > +With the availability of sparse storage formats, it is often needed to query > +status of a particular LBA range and read only those blocks of data that are > +actually present on the block device. > + > +Some storage formats and operations over such formats express a concept of > +data dirtiness. Whether the operation is block device mirroring, > +incremental block device backup or any other operation with a concept of > +data dirtiness, they all share a need to provide a list of LBA ranges > +that this particular operation treats as dirty. > + > +To provide such class of information, `GET_LBA_STATUS` extension adds new > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with > +their respective states. > + > +* `NBD_CMD_GET_LBA_STATUS` (7) > + > + An LBA range status query request. Length and offset define the range > + of interest. The server MUST reply with a reply header, followed > + immediately by the following data: As Eric noted, please expand LBA at least once. > + - 32 bits, length of parameter data that follow (unsigned) > + - zero or more LBA status descriptors, each having the following > + structure: > + > + * 64 bits, offset (unsigned) > + * 32 bits, length (unsigned) > + * 16 bits, status (unsigned) > + > + unless an error condition has occurred. > + > + If an error occurs, the server SHOULD set the appropriate error code > + in the error field. The server MUST then either close the > + connection, or send *length of parameter data* bytes of data > + (which MAY be invalid). > + > + The type of information required by the client is passed to server in the > + command flags field. If the server does not implement requested type or > + have no means to express it, it MUST NOT return an error, but instead MUST > + return a single LBA status descriptor with *offset* and *length* equal to > + the *offset* and *length* from request, and *status* set to `0`. > + > + The following request types are currently defined for the command: > + > + 1. Block provisioning state > + > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > + field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return I prefer to have a non-zero flag value. > + the provisioning state of the device. The following provisionnig states > + are defined for the command: > + > + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; > + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device > + and contains zeroes; Presumably this should be "contains only zeroes"? Also, this may end up being a fairly expensive call for the server to process. Is it really useful? > + - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the > + block device. A client MUST NOT make any assumptions about the > + contents of the extent. > + > + 2. Block dirtiness state > + > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > + field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return > + the dirtiness status of the device. The following dirtiness states > + are defined for the command: > + > + - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty; > + - `NBD_STATE_CLEAN` (0x1), LBA extent is clean. > + > + Generic NBD client implementation without knowledge of a particular NBD > + server operation MUST NOT make any assumption on the meaning of the > + NBD_STATE_DIRTY or NBD_STATE_CLEAN states. That makes it a useless call. A server can read /dev/random to decide whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with this spec. Either the spec should define what it means for a block to be in a dirty state, or it should not talk about it.
Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben: > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote: > > + The type of information required by the client is passed to server in the > > + command flags field. If the server does not implement requested type or > > + have no means to express it, it MUST NOT return an error, but instead MUST > > + return a single LBA status descriptor with *offset* and *length* equal to > > + the *offset* and *length* from request, and *status* set to `0`. > > + > > + The following request types are currently defined for the command: > > + > > + 1. Block provisioning state > > + > > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > > + field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return > > I prefer to have a non-zero flag value. > > > + the provisioning state of the device. The following provisionnig states > > + are defined for the command: > > + > > + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; > > + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device > > + and contains zeroes; > > Presumably this should be "contains only zeroes"? > > Also, this may end up being a fairly expensive call for the server to > process. Is it really useful? I think we need to make clear that this is meant as an optimisation and it's always a valid option for a server to return NBD_STATE_ALLOCATED even if the contents is zeroed. It is definitely useful if the server has a means to efficiently find out the allocation status (e.g. SEEK_HOLE). In that case the client may be able to avoid reading the block and sending it over the network, or when making a copy, it could use it to keep the target file sparse. If the client can't take advantage, we didn't have much overhead, so it's fine. It's less clear in a case where the server needs to read in the block and scan its contents. It could still be a net win if the next thing the client does is retrieving the block: The server would still have the cost of reading the block, but it wouldn't be transferred. But when the client doesn't follow up with a read, it's quite a bit of overhead that we had for no benefit. Returning NBD_STATE_ALLOCATED might be more appropriate in this case than scanning for zeros. Kevin
On Wed, Mar 23, 2016 at 07:14:54PM +0100, Kevin Wolf wrote: > Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben: > > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote: > > > + The type of information required by the client is passed to server in the > > > + command flags field. If the server does not implement requested type or > > > + have no means to express it, it MUST NOT return an error, but instead MUST > > > + return a single LBA status descriptor with *offset* and *length* equal to > > > + the *offset* and *length* from request, and *status* set to `0`. > > > + > > > + The following request types are currently defined for the command: > > > + > > > + 1. Block provisioning state > > > + > > > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > > > + field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return > > > > I prefer to have a non-zero flag value. > > > > > + the provisioning state of the device. The following provisionnig states > > > + are defined for the command: > > > + > > > + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; > > > + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device > > > + and contains zeroes; > > > > Presumably this should be "contains only zeroes"? > > > > Also, this may end up being a fairly expensive call for the server to > > process. Is it really useful? > > I think we need to make clear that this is meant as an optimisation and > it's always a valid option for a server to return NBD_STATE_ALLOCATED > even if the contents is zeroed. > > It is definitely useful if the server has a means to efficiently find > out the allocation status (e.g. SEEK_HOLE). In that case the client may > be able to avoid reading the block and sending it over the network, or > when making a copy, it could use it to keep the target file sparse. If > the client can't take advantage, we didn't have much overhead, so it's > fine. Yes, that was the idea. I'll add a note that the server may return NBD_STATE_ALLOCATED instead of NBD_STATE_ZEROED if it has not means to efficiently differentiate allocated blocks with zeroes from allocated blocks with non-zeroed content. > > It's less clear in a case where the server needs to read in the block > and scan its contents. It could still be a net win if the next thing the > client does is retrieving the block: The server would still have the > cost of reading the block, but it wouldn't be transferred. But when the > client doesn't follow up with a read, it's quite a bit of overhead that > we had for no benefit. Returning NBD_STATE_ALLOCATED might be more > appropriate in this case than scanning for zeros. > > Kevin > ------------------------------------------------------------------------------ > Transform Data into Opportunity. > Accelerate data analysis in your applications with > Intel Data Analytics Acceleration Library. > Click to learn more. > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140 > _______________________________________________ > Nbd-general mailing list > Nbd-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nbd-general
On Thu, Mar 24, 2016 at 11:25:52AM +0300, Pavel Borzenkov wrote: > On Wed, Mar 23, 2016 at 07:14:54PM +0100, Kevin Wolf wrote: > > Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben: > > > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote: > > > > + the provisioning state of the device. The following provisionnig states > > > > + are defined for the command: > > > > + > > > > + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; > > > > + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device > > > > + and contains zeroes; > > > > > > Presumably this should be "contains only zeroes"? > > > > > > Also, this may end up being a fairly expensive call for the server to > > > process. Is it really useful? > > > > I think we need to make clear that this is meant as an optimisation and > > it's always a valid option for a server to return NBD_STATE_ALLOCATED > > even if the contents is zeroed. > > > > It is definitely useful if the server has a means to efficiently find > > out the allocation status (e.g. SEEK_HOLE). In that case the client may > > be able to avoid reading the block and sending it over the network, or > > when making a copy, it could use it to keep the target file sparse. If > > the client can't take advantage, we didn't have much overhead, so it's > > fine. > > Yes, that was the idea. I'll add a note that the server may return > NBD_STATE_ALLOCATED instead of NBD_STATE_ZEROED if it has not means to > efficiently differentiate allocated blocks with zeroes from allocated > blocks with non-zeroed content. Okay, that alleviates my concerns. In that case it might be useful if the server could say something along the lines of "I know it's allocated, but I didn't check whether there's anything non-zero in there"? The client can then decide to do nothing with that information; but the more useful information is sent along, the better...
On Wed, Mar 23, 2016 at 06:58:34PM +0100, Wouter Verhelst wrote: > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote: > > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > > > With the availability of sparse storage formats, it is often needed to > > query status of a particular LBA range and read only those blocks of > > data that are actually present on the block device. > > > > To provide such information, the patch adds GET_LBA_STATUS extension > > with one new NBD_CMD_GET_LBA_STATUS command. > > > > There exists a concept of data dirtiness, which is required during, for > > example, incremental block device backup. To express this concept via > > NBD protocol, this patch also adds additional mode of operation to > > NBD_CMD_GET_LBA_STATUS command. > > > > Since NBD protocol has no notion of block size, and to mimic SCSI "GET > > LBA STATUS" command more closely, it has been chosen to return a list of > > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a > > bitmap. > > > > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> > > Signed-off-by: Denis V. Lunev <den@openvz.org> > > CC: Wouter Verhelst <w@uter.be> > > CC: Paolo Bonzini <pbonzini@redhat.com> > > CC: Kevin Wolf <kwolf@redhat.com> > > CC: Stefan Hajnoczi <stefanha@redhat.com> > > --- > > doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 82 insertions(+) > > > > diff --git a/doc/proto.md b/doc/proto.md > > index cda213c..fff515d 100644 > > --- a/doc/proto.md > > +++ b/doc/proto.md > > @@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle negotiation: > > `NBD_CMD_TRIM` commands > > - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server > > supports `NBD_CMD_WRITE_ZEROES` commands > > +- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server > > + supports `NBD_CMD_GET_LBA_STATUS` commands > > > > ##### Client flags > > > > @@ -477,6 +479,10 @@ The following request types exist: > > > > Defined by the experimental `WRITE_ZEROES` extension; see below. > > > > +* `NBD_CMD_GET_LBA_STATUS` (7) > > + > > + Defined by the experimental `GET_LBA_STATUS` extension; see below. > > + > > * Other requests > > > > Some third-party implementations may require additional protocol > > @@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a write zeroes request > > including one or more sectors beyond the size of the device. It SHOULD > > return `EPERM` if it receives a write zeroes request on a read-only export. > > > > +### `GET_LBA_STATUS` extension > > + > > +With the availability of sparse storage formats, it is often needed to query > > +status of a particular LBA range and read only those blocks of data that are > > +actually present on the block device. > > + > > +Some storage formats and operations over such formats express a concept of > > +data dirtiness. Whether the operation is block device mirroring, > > +incremental block device backup or any other operation with a concept of > > +data dirtiness, they all share a need to provide a list of LBA ranges > > +that this particular operation treats as dirty. > > + > > +To provide such class of information, `GET_LBA_STATUS` extension adds new > > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with > > +their respective states. > > + > > +* `NBD_CMD_GET_LBA_STATUS` (7) > > + > > + An LBA range status query request. Length and offset define the range > > + of interest. The server MUST reply with a reply header, followed > > + immediately by the following data: > > As Eric noted, please expand LBA at least once. > > > + - 32 bits, length of parameter data that follow (unsigned) > > + - zero or more LBA status descriptors, each having the following > > + structure: > > + > > + * 64 bits, offset (unsigned) > > + * 32 bits, length (unsigned) > > + * 16 bits, status (unsigned) > > + > > + unless an error condition has occurred. > > + > > + If an error occurs, the server SHOULD set the appropriate error code > > + in the error field. The server MUST then either close the > > + connection, or send *length of parameter data* bytes of data > > + (which MAY be invalid). > > + > > + The type of information required by the client is passed to server in the > > + command flags field. If the server does not implement requested type or > > + have no means to express it, it MUST NOT return an error, but instead MUST > > + return a single LBA status descriptor with *offset* and *length* equal to > > + the *offset* and *length* from request, and *status* set to `0`. > > + > > + The following request types are currently defined for the command: > > + > > + 1. Block provisioning state > > + > > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > > + field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return > > I prefer to have a non-zero flag value. > > > + the provisioning state of the device. The following provisionnig states > > + are defined for the command: > > + > > + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; > > + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device > > + and contains zeroes; > > Presumably this should be "contains only zeroes"? > > Also, this may end up being a fairly expensive call for the server to > process. Is it really useful? > > > + - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the > > + block device. A client MUST NOT make any assumptions about the > > + contents of the extent. > > + > > + 2. Block dirtiness state > > + > > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > > + field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return > > + the dirtiness status of the device. The following dirtiness states > > + are defined for the command: > > + > > + - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty; > > + - `NBD_STATE_CLEAN` (0x1), LBA extent is clean. > > + > > + Generic NBD client implementation without knowledge of a particular NBD > > + server operation MUST NOT make any assumption on the meaning of the > > + NBD_STATE_DIRTY or NBD_STATE_CLEAN states. > > That makes it a useless call. A server can read /dev/random to decide > whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with > this spec. > > Either the spec should define what it means for a block to be in a dirty > state, or it should not talk about it. What I was trying to say with this sentence is that without knowing what is currently going on on the server side, the DIRTY state has no meaning. If we are doing incremental backup DIRTY state might represent blocks that have changed since last backup. For mirroring - blocks that are yet to be migrated. And for the same block device different sets of DIRTY ranges might be returned in response to this command. Basically, the meaning of DIRTY depends on the context. Should I state in the spec, that the meaning of DIRTY state is implementation-specific? I see that NBD_REP_SERVER already uses this approach. > > -- > < ron> I mean, the main *practical* problem with C++, is there's like a dozen > people in the world who think they really understand all of its rules, > and pretty much all of them are just lying to themselves too. > -- #debian-devel, OFTC, 2016-02-12 > ------------------------------------------------------------------------------ > Transform Data into Opportunity. > Accelerate data analysis in your applications with > Intel Data Analytics Acceleration Library. > Click to learn more. > http://pubads.g.doubleclick.net/gampad/clk?id=278785351&iu=/4140 > _______________________________________________ > Nbd-general mailing list > Nbd-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/nbd-general
On Thu, Mar 24, 2016 at 11:43:18AM +0300, Pavel Borzenkov wrote: > On Wed, Mar 23, 2016 at 06:58:34PM +0100, Wouter Verhelst wrote: > > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote: > > > + 2. Block dirtiness state > > > + > > > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > > > + field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return > > > + the dirtiness status of the device. The following dirtiness states > > > + are defined for the command: > > > + > > > + - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty; > > > + - `NBD_STATE_CLEAN` (0x1), LBA extent is clean. > > > + > > > + Generic NBD client implementation without knowledge of a particular NBD > > > + server operation MUST NOT make any assumption on the meaning of the > > > + NBD_STATE_DIRTY or NBD_STATE_CLEAN states. > > > > That makes it a useless call. A server can read /dev/random to decide > > whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with > > this spec. > > > > Either the spec should define what it means for a block to be in a dirty > > state, or it should not talk about it. > > What I was trying to say with this sentence is that without knowing what > is currently going on on the server side, the DIRTY state has no > meaning. If we are doing incremental backup DIRTY state might represent blocks > that have changed since last backup. For mirroring - blocks that are yet > to be migrated. And for the same block device different sets of DIRTY > ranges might be returned in response to this command. Basically, the > meaning of DIRTY depends on the context. > > Should I state in the spec, that the meaning of DIRTY state is > implementation-specific? I see that NBD_REP_SERVER already uses this > approach. That's still meaningless without a way for the client to detect what the server-side implementation actually is. The protocol doesn't have that. This is okay for NBD_REP_SERVER, because the extra information there is *also* defined to be "UTF-8 encoded data suitable for direct display to a human being", which means it just allows the server to pass on some extra info (e.g., qemu could use it to state whether it's in use by a live VM, or what the backend storage pool is, etc), but the data there is not going to be critical. For this proposed message, however, that is not the case; the whole point of this part of your proposal seems to be to tell the client whether some part of the export is "dirty" or not. Now I'm not saying we need to fully define what it means for a part of the backend to be "dirty" or not. It's okay to leave part of the meaning in the dark, leaving that implementation-defined. But saying "must not make any assumption" basically means "give me some random metadata, and I'll throw it away then because there's nothing I can do with it". Alternatively, we could define more states than just "dirty" and "clean", and have those be more strictly defined than what your current proposal says.
On 24 Mar 2016, at 09:33, Wouter Verhelst <w@uter.be> wrote: > Now I'm not saying we > need to fully define what it means for a part of the backend to be > "dirty" or not. It's okay to leave part of the meaning in the dark, > leaving that implementation-defined. Well, the 3 possible states are: * unallocated * zero * non-zero So the possible replies are a bitfield of those, with a '1' if it 'might' be in that state, i.e. 111 = no idea 110 = might be zero or unallocated, but isn't zero 011 = I know it's allocated, but I don't know whether it is zero or not And '000' is not permitted! etc. -- Alex Bligh
On Thu, Mar 24, 2016 at 09:41:29AM +0100, Wouter Verhelst wrote: > On Thu, Mar 24, 2016 at 11:25:52AM +0300, Pavel Borzenkov wrote: > > On Wed, Mar 23, 2016 at 07:14:54PM +0100, Kevin Wolf wrote: > > > Am 23.03.2016 um 18:58 hat Wouter Verhelst geschrieben: > > > > On Wed, Mar 23, 2016 at 05:16:02PM +0300, Denis V. Lunev wrote: > > > > > + the provisioning state of the device. The following provisionnig states > > > > > + are defined for the command: > > > > > + > > > > > + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; > > > > > + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device > > > > > + and contains zeroes; > > > > > > > > Presumably this should be "contains only zeroes"? > > > > > > > > Also, this may end up being a fairly expensive call for the server to > > > > process. Is it really useful? > > > > > > I think we need to make clear that this is meant as an optimisation and > > > it's always a valid option for a server to return NBD_STATE_ALLOCATED > > > even if the contents is zeroed. > > > > > > It is definitely useful if the server has a means to efficiently find > > > out the allocation status (e.g. SEEK_HOLE). In that case the client may > > > be able to avoid reading the block and sending it over the network, or > > > when making a copy, it could use it to keep the target file sparse. If > > > the client can't take advantage, we didn't have much overhead, so it's > > > fine. > > > > Yes, that was the idea. I'll add a note that the server may return > > NBD_STATE_ALLOCATED instead of NBD_STATE_ZEROED if it has not means to > > efficiently differentiate allocated blocks with zeroes from allocated > > blocks with non-zeroed content. > > Okay, that alleviates my concerns. > > In that case it might be useful if the server could say something along > the lines of "I know it's allocated, but I didn't check whether there's > anything non-zero in there"? The client can then decide to do nothing > with that information; but the more useful information is sent along, > the better... Doesn't allocated state mean exactly this? E.g. it is allocated and I have no idea what the content is. > > -- > < ron> I mean, the main *practical* problem with C++, is there's like a dozen > people in the world who think they really understand all of its rules, > and pretty much all of them are just lying to themselves too. > -- #debian-devel, OFTC, 2016-02-12
On 23/03/2016 18:58, Wouter Verhelst wrote: >> +To provide such class of information, `GET_LBA_STATUS` extension adds new >> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with >> +their respective states. >> + >> +* `NBD_CMD_GET_LBA_STATUS` (7) >> + >> + An LBA range status query request. Length and offset define the range >> + of interest. The server MUST reply with a reply header, followed >> + immediately by the following data: > > As Eric noted, please expand LBA at least once. Let's just use "block" (e.g. NBD_CMD_GET_BLOCK_STATUS). >> + - 32 bits, length of parameter data that follow (unsigned) >> + - zero or more LBA status descriptors, each having the following >> + structure: >> + >> + * 64 bits, offset (unsigned) >> + * 32 bits, length (unsigned) >> + * 16 bits, status (unsigned) >> + >> + unless an error condition has occurred. >> + Can we just return one descriptor? That would simplify the protocol a bit. >> + the provisioning state of the device. The following provisionnig states >> + are defined for the command: >> + >> + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; >> + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device >> + and contains zeroes; > > Presumably this should be "contains only zeroes"? Yes, or "reads as zeroes". > Also, this may end up being a fairly expensive call for the server to > process. Is it really useful? It's always okay for the server to omit NBD_STATE_ZERO, but it can be useful if the state is known for some reason. For example, file system holes are always zero, but unallocated blocks on a block device are not always zero. However, let's make these bits, so that NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device NBD_STATE_ZERO (0x2), LBA extent will read as zeroes and you can have NBD_STATE_ALLOCATED|NBD_STATE_ZERO. File systems do have the concept of unwritten extents which would be represented like that. The API to access the information (the FIEMAP ioctl) is broken, but perhaps in the future a non-broken API could be added---for example SEEK_ZERO and SEEK_NONZERO values for lseek's "whence" argument. >> + - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the >> + block device. A client MUST NOT make any assumptions about the >> + contents of the extent. >> + >> + 2. Block dirtiness state >> + >> + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags >> + field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return >> + the dirtiness status of the device. The following dirtiness states >> + are defined for the command: >> + >> + - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty; >> + - `NBD_STATE_CLEAN` (0x1), LBA extent is clean. >> + >> + Generic NBD client implementation without knowledge of a particular NBD >> + server operation MUST NOT make any assumption on the meaning of the >> + NBD_STATE_DIRTY or NBD_STATE_CLEAN states. > > That makes it a useless call. A server can read /dev/random to decide > whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with > this spec. > > Either the spec should define what it means for a block to be in a dirty > state, or it should not talk about it. Here is my attempt: This command is meant to operate in tandem with other (non-NBD) channels to the server. Generally, a "dirty" block is a block that has been written to by someone, but the exact meaning of "has been written" is left to the implementation. For example, a virtual machine monitor could provide a (non-NBD) command to start tracking blocks written by the virtual machine. A backup client then can connect to an NBD server provided by the virtual machine monitor and use NBD_CMD_GET_BLOCK_STATUS only read blocks that the virtual machine has changed. An implementation that doesn't track the "dirtiness" state of blocks MUST either fail this command with EINVAL, or mark all blocks as dirty in the descriptor that it returns. Paolo
On 24/03/2016 11:32, Alex Bligh wrote: >> > Now I'm not saying we >> > need to fully define what it means for a part of the backend to be >> > "dirty" or not. It's okay to leave part of the meaning in the dark, >> > leaving that implementation-defined. > Well, the 3 possible states are: > > * unallocated > * zero > * non-zero > > So the possible replies are a bitfield of those, with a '1' if it 'might' > be in that state, i.e. > > 111 = no idea > 110 = might be zero or unallocated, but isn't zero > 011 = I know it's allocated, but I don't know whether it is zero or not How do you represent "definitely unallocated?" Paolo > And '000' is not permitted!
On 24 Mar 2016, at 11:58, Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 24/03/2016 11:32, Alex Bligh wrote: >>>> Now I'm not saying we >>>> need to fully define what it means for a part of the backend to be >>>> "dirty" or not. It's okay to leave part of the meaning in the dark, >>>> leaving that implementation-defined. >> Well, the 3 possible states are: >> >> * unallocated >> * zero >> * non-zero >> >> So the possible replies are a bitfield of those, with a '1' if it 'might' >> be in that state, i.e. >> >> 111 = no idea >> 110 = might be zero or unallocated, but isn't zero >> 011 = I know it's allocated, but I don't know whether it is zero or not > > How do you represent "definitely unallocated?" 100 is definitely allocated. The first '1' says it 'might' be in allocated state, but as we know it's NOT in any of the other states (next two zeroes), by a process of elimination, it's definitely unallocated. Similarly 010 and 001 are the two other 'definite' states.
On Wed, Mar 23, 2016 at 10:27:00AM -0600, Eric Blake wrote: > On 03/23/2016 08:16 AM, Denis V. Lunev wrote: > > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > > > With the availability of sparse storage formats, it is often needed to > > query status of a particular LBA range and read only those blocks of > > data that are actually present on the block device. > > The acronym LBA is not used elsewhere in the NBD spec; should we spell > it out at least once? > > > > > To provide such information, the patch adds GET_LBA_STATUS extension > > with one new NBD_CMD_GET_LBA_STATUS command. > > > > There exists a concept of data dirtiness, which is required during, for > > example, incremental block device backup. To express this concept via > > NBD protocol, this patch also adds additional mode of operation to > > NBD_CMD_GET_LBA_STATUS command. > > > > Since NBD protocol has no notion of block size, and to mimic SCSI "GET > > LBA STATUS" command more closely, it has been chosen to return a list of > > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a > > bitmap. > > > > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> > > Signed-off-by: Denis V. Lunev <den@openvz.org> > > CC: Wouter Verhelst <w@uter.be> > > CC: Paolo Bonzini <pbonzini@redhat.com> > > CC: Kevin Wolf <kwolf@redhat.com> > > CC: Stefan Hajnoczi <stefanha@redhat.com> > > --- > > doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 82 insertions(+) > > > > > > > +### `GET_LBA_STATUS` extension > > + > > +With the availability of sparse storage formats, it is often needed to query > > +status of a particular LBA range and read only those blocks of data that are > > +actually present on the block device. > > + > > +Some storage formats and operations over such formats express a concept of > > +data dirtiness. Whether the operation is block device mirroring, > > +incremental block device backup or any other operation with a concept of > > +data dirtiness, they all share a need to provide a list of LBA ranges > > +that this particular operation treats as dirty. > > + > > +To provide such class of information, `GET_LBA_STATUS` extension adds new > > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with > > +their respective states. > > + > > +* `NBD_CMD_GET_LBA_STATUS` (7) > > + > > + An LBA range status query request. Length and offset define the range > > + of interest. The server MUST reply with a reply header, followed > > + immediately by the following data: > > + > > + - 32 bits, length of parameter data that follow (unsigned) > > Is this length the number of descriptors, or the number of bytes > occupied by those descriptors? It looks like bytes (that is, with the > current layout, this field should be a multiple of 14 unless an error is > returned and the data is bogus). > > I guess 32 bits is sufficient: transmission commands are limited to > 32-bit length, and we are unlikely to have more than one extent per 512 > bytes of length, so even if we have a header for every single sector > (worst-case for alternating clean/dirty sectors), as long as the > smallest granularity of an extent is larger than the extent field, the > 'length of parameter data' in bytes is still sufficient. > > > + - zero or more LBA status descriptors, each having the following > > zero or more? [1] > > > + structure: > > + > > + * 64 bits, offset (unsigned) > > + * 32 bits, length (unsigned) > > + * 16 bits, status (unsigned) > > An array of these status descriptors is packed on the wire, while the > typical C layout of an array of these structures will have padding to > reach a nice 8-byte alignment. Should 'status' be a 32-bit field, so > that clients and servers do not have to pack/unpack between 14 bytes on > the wire and 16 bytes in efficient array handling, at the expense of > more network traffic? > > Conversely, it would be possible to send less data over the wire, as > long as we require that all LBA status descriptors cover consecutive > offsets. That is, having the server reply with offsets is pointless, > since they can be reconstructed on the client by starting with the > offset in the client's request, then adding length from each status > field. Is less network traffic desirable? I think it's better to explicitly send the start of each LBA extent. So I'll go with changing 'status' field to 32 bits to avoid packing/unpacking issues. > > > + > > + unless an error condition has occurred. > > + > > + If an error occurs, the server SHOULD set the appropriate error code > > + in the error field. The server MUST then either close the > > + connection, or send *length of parameter data* bytes of data > > + (which MAY be invalid). > > + > > + The type of information required by the client is passed to server in the > > + command flags field. If the server does not implement requested type or > > + have no means to express it, it MUST NOT return an error, but instead MUST > > + return a single LBA status descriptor with *offset* and *length* equal to > > + the *offset* and *length* from request, and *status* set to `0`. > > [1] So in what situations will we ever return an array of zero status > fields? On an error? Should we make it clear that the server MUST NOT > return 0 status fields except on an error? > > Do we want to require that the server MUST reply with enough extents to > sum up to the length of the client's request, or are we permitting a > short reply? While the "GET LBA STATUS" command in SCSI allows partial reply, I believe it'd better to keep things simple and require that the server must either return a list of extents that covers the whole requested range, or an error. > > > + > > + The following request types are currently defined for the command: > > + > > + 1. Block provisioning state > > + > > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > > + field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return > > Here, you spell it '0x0'; in the previous patch, you spelled the command > flag as 'bit 1' - does that mean that Block provisioning state is the > default when no command flags are sent? What if we later add other > flags but still want block provisioning mode? Wouldn't it be better to > state that this mode is entered when the NBD_FLAG_GET_DIRTY flag is > clear, without regards to the state of the other flags, than allowing > this mode only when all 16 flags are zero? > > For example, should we allow a flag that states that the client is > interested only in allocated/unallocated, and that the server may > coalesce NBD_STATE_ZEROED extents as if they were NBD_STATE_ALLOCATED > for fewer extents reported and thus potentially less network traffic? Actually, for this command I treat 'command flags' field not as a set of flags, but rather as a plain number representing required mode of operation. Probably, not a good idea as it doesn't match the rest of the commands. I went this way because I didn't want to allow clients to request several modes simultaneously (e.g. provisioning + dirtiness) in the same request. This makes server side implementation harder. I think I'll just switch to bits to match the rest of the commands and will add a note, that server should return EINVAL in case several modes are requested simultaneously. > > > + the provisioning state of the device. The following provisionnig states > > s/provisionnig/provisioning/ > > > + are defined for the command: > > + > > + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; > > + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device > > + and contains zeroes; > > + - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the > > + block device. A client MUST NOT make any assumptions about the > > + contents of the extent. > > Can NBD_STATE_ALLOCATED and NBD_STATE_DEALLOCATED both be set at the > same time, or is that an error on the server? What do we know about an > extent that has neither NBD_STATE_ALLOCATED nor NBD_STATE_DEALLOCATED set? > > /me re-reads > > Oh, you are using this as the _entire_ 16-bit status value, rather than > as bits 0, 1, and 2 as flags. Yes > > But I think you have two binary flags (four possible states) here: it is > quite conceivable to have a server on top of a SCSI device, where we > know that the extent is unallocated in SCSI, but where the server will > guarantee that it reads as all zeroes (possibly because the server > bypasses SCSI on the NBD read commands when it knows SCSI is > unallocated). That is, if we set this up as two flags: > > 0x1 - allocated > 0x2 - reads as 0 > > then we can express four states: > > 0x0 - LBA extent not present, client MUST NOT make assumptions about > contents, and reads should not be attempted > 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents > 0x2 - LBA extent not present, but client can treat the extent as zeroes > and reads will succeed > 0x3 - LBA extent present, client can treat the extent as zeroes and > reads will succeed I'm not sure that clients need this level of details. From client's POV 0x2 and 0x3 are the same. > > Actually, we should probably pick the bit values such that 0x0 means > allocated and readable, as the most common state, since we also require > that the command returns a single extent over the entire length with > status 0 if the server can't provide any further details. > > I'm not familiar enough with the SCSI "GET LBA STATUS" command to know > if your command sanely matches to that one. Extents returned by "GET LBA STATUS" in SCSI can have three possible state: MAPPED/ANCHORED/DEALLOCATED. These are not bits and cannot be combined together. > > > + > > + 2. Block dirtiness state > > + > > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > > + field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return > > This overlaps with the bit value for NBD_FLAG_SEND_FUA in the previous > patch. Is that okay? Or should we use a different bit value, on the > grounds that some future extension may want to use both flags > orthogonally within the same (possibly new) command? Again, consistency > in the spelling ('bit 1' in the previous patch, '0x1' here), would be nice. I couldn't think of any use case. But, just in case, I'll change the value of the flags so they don't overlap.
On 24/03/2016 13:17, Alex Bligh wrote: >>> >> * unallocated >>> >> * zero >>> >> * non-zero >>> >> >>> >> So the possible replies are a bitfield of those, with a '1' if it 'might' >>> >> be in that state, i.e. >>> >> >>> >> 111 = no idea >>> >> 110 = might be zero or unallocated, but isn't zero >>> >> 011 = I know it's allocated, but I don't know whether it is zero or not >> > >> > How do you represent "definitely unallocated?" > 100 is definitely allocated. The first '1' says it 'might' be in allocated state, > but as we know it's NOT in any of the other states (next two zeroes), by a > process of elimination, it's definitely unallocated. Similarly 010 and 001 > are the two other 'definite' states. An unallocated block can still be definitely zero. Paolo
On Thu, Mar 24, 2016 at 02:36:52PM +0300, Pavel Borzenkov wrote: > On Thu, Mar 24, 2016 at 09:41:29AM +0100, Wouter Verhelst wrote: > > Okay, that alleviates my concerns. > > > > In that case it might be useful if the server could say something along > > the lines of "I know it's allocated, but I didn't check whether there's > > anything non-zero in there"? The client can then decide to do nothing > > with that information; but the more useful information is sent along, > > the better... > > Doesn't allocated state mean exactly this? E.g. it is allocated and I > have no idea what the content is. I suppose you're right.
Hi Paolo, On Thu, Mar 24, 2016 at 12:55:51PM +0100, Paolo Bonzini wrote: > > > On 23/03/2016 18:58, Wouter Verhelst wrote: > >> +To provide such class of information, `GET_LBA_STATUS` extension adds new > >> +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with > >> +their respective states. > >> + > >> +* `NBD_CMD_GET_LBA_STATUS` (7) > >> + > >> + An LBA range status query request. Length and offset define the range > >> + of interest. The server MUST reply with a reply header, followed > >> + immediately by the following data: > > > > As Eric noted, please expand LBA at least once. > > Let's just use "block" (e.g. NBD_CMD_GET_BLOCK_STATUS). That works too :-) [...] > > Also, this may end up being a fairly expensive call for the server to > > process. Is it really useful? > > It's always okay for the server to omit NBD_STATE_ZERO, but it can be > useful if the state is known for some reason. For example, file system > holes are always zero, but unallocated blocks on a block device are not > always zero. > > However, let's make these bits, so that > > NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device > NBD_STATE_ZERO (0x2), LBA extent will read as zeroes > > and you can have NBD_STATE_ALLOCATED|NBD_STATE_ZERO. File systems do > have the concept of unwritten extents which would be represented like > that. The API to access the information (the FIEMAP ioctl) is broken, > but perhaps in the future a non-broken API could be added---for example > SEEK_ZERO and SEEK_NONZERO values for lseek's "whence" argument. Okay, that works for me. > >> + - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the > >> + block device. A client MUST NOT make any assumptions about the > >> + contents of the extent. > >> + > >> + 2. Block dirtiness state > >> + > >> + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > >> + field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return > >> + the dirtiness status of the device. The following dirtiness states > >> + are defined for the command: > >> + > >> + - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty; > >> + - `NBD_STATE_CLEAN` (0x1), LBA extent is clean. > >> + > >> + Generic NBD client implementation without knowledge of a particular NBD > >> + server operation MUST NOT make any assumption on the meaning of the > >> + NBD_STATE_DIRTY or NBD_STATE_CLEAN states. > > > > That makes it a useless call. A server can read /dev/random to decide > > whether to send STATE_DIRTY or STATE_CLEAN, and still be compliant with > > this spec. > > > > Either the spec should define what it means for a block to be in a dirty > > state, or it should not talk about it. > > Here is my attempt: > > This command is meant to operate in tandem with other (non-NBD) > channels to the server. Generally, a "dirty" block is a block that > has been written to by someone, but the exact meaning of "has been > written" is left to the implementation. For example, a virtual > machine monitor could provide a (non-NBD) command to start tracking > blocks written by the virtual machine. A backup client then can > connect to an NBD server provided by the virtual machine monitor > and use NBD_CMD_GET_BLOCK_STATUS only read blocks that the virtual ^ to > machine has changed. > > An implementation that doesn't track the "dirtiness" state of blocks > MUST either fail this command with EINVAL, or mark all blocks as > dirty in the descriptor that it returns. That seems saner, yes -- and I'm starting to understand what the rationale is for this protocol message :-) I suppose I could also implement that in nbd-server to send out information about changed blocks if the copy-on-write option has been switched on. It might also be possible to add an in-protocol message to start tracking changes (e.g., a "create checkpoint" message), but I'm not sure if that's worth it (and it could massively complicate the NBD state machine and protocol; not sure whether that's worth it) At any rate, your suggestion does alleviate my concerns.
On 24 Mar 2016, at 12:32, Paolo Bonzini <pbonzini@redhat.com> wrote: > On 24/03/2016 13:17, Alex Bligh wrote: >>>>>> * unallocated >>>>>> * zero >>>>>> * non-zero >>>>>> >>>>>> So the possible replies are a bitfield of those, with a '1' if it 'might' >>>>>> be in that state, i.e. >>>>>> >>>>>> 111 = no idea >>>>>> 110 = might be zero or unallocated, but isn't zero >>>>>> 011 = I know it's allocated, but I don't know whether it is zero or not >>>> >>>> How do you represent "definitely unallocated?" >> 100 is definitely allocated. The first '1' says it 'might' be in allocated state, >> but as we know it's NOT in any of the other states (next two zeroes), by a >> process of elimination, it's definitely unallocated. Similarly 010 and 001 >> are the two other 'definite' states. > > An unallocated block can still be definitely zero. Sorry, I should have been clearer on the states: bits 210 1-- Unallocated, and hence reads as zero -1- Allocated, and reads as zero --1 Allocated, and reads as non-zero So 100 means 'definitely unallocated, will read as zero'. If you are saying that there is also a state called 'Unallocated, but reads as non-zero', that could be handled by adding a fourth bit. Same idea. One would presume this would only ever be set in conjunction with bit 2. My point in general was to represent all the possible states and let the client determine what it wants to do with the information.
On 24/03/2016 14:31, Alex Bligh wrote: > Sorry, I should have been clearer on the states: > > bits > 210 > > 1-- Unallocated, and hence reads as zero > -1- Allocated, and reads as zero > --1 Allocated, and reads as non-zero > > So 100 means 'definitely unallocated, will read as zero'. > > If you are saying that there is also a state called 'Unallocated, but reads > as non-zero', Yes. > that could be handled by adding a fourth bit. Same idea. One > would presume this would only ever be set in conjunction with bit 2. > My point in general was to represent all the possible states and let the client > determine what it wants to do with the information. This seems overengineered... Paolo
On 03/24/2016 06:30 AM, Pavel Borzenkov wrote: >> Conversely, it would be possible to send less data over the wire, as >> long as we require that all LBA status descriptors cover consecutive >> offsets. That is, having the server reply with offsets is pointless, >> since they can be reconstructed on the client by starting with the >> offset in the client's request, then adding length from each status >> field. Is less network traffic desirable? > > I think it's better to explicitly send the start of each LBA extent. Why? Is the redundancy for something that the client can reconstruct worth the extra safety at the cost of more traffic? > So I'll go with changing 'status' field to 32 bits to avoid > packing/unpacking issues. You may want to do that even if you eliminate the offset field, so that you have 8 bytes per struct (instead of 16). >> >> Do we want to require that the server MUST reply with enough extents to >> sum up to the length of the client's request, or are we permitting a >> short reply? > > While the "GET LBA STATUS" command in SCSI allows partial reply, I > believe it'd better to keep things simple and require that the server > must either return a list of extents that covers the whole requested > range, or an error. Make sure you specify whether ranges are allowed to overlap, or must be distinct (I prefer the latter), and whether they must appear in sorted order (which I also prefer) - once you mandate ordered and non-overlapping coverage, client-side validation that the server's answer makes sense is easier (remember, we want the client to detect man-in-the-middle corruption of the server's reply). > Actually, for this command I treat 'command flags' field not as a set of > flags, but rather as a plain number representing required mode of > operation. Probably, not a good idea as it doesn't match the rest of the > commands. > > I went this way because I didn't want to allow clients to request > several modes simultaneously (e.g. provisioning + dirtiness) in the same > request. This makes server side implementation harder. > > I think I'll just switch to bits to match the rest of the commands and > will add a note, that server should return EINVAL in case several modes > are requested simultaneously. But you don't need two bits. Just a single bit will do (off for provisioning mode, on for dirty mode). So there are no conflicting modes that can be simultaneously requested, at least in the current definition of a single valid flag bit. (Then again, I did make a suggestion about additional bits, useful only during provisioning mode, that might be used to state that the client is okay if the server coalesces extents that differ only in allocation or only in zeroed content - if we add that bit or two bits, it would be an error to use it while requesting dirty mode). >> then we can express four states: >> >> 0x0 - LBA extent not present, client MUST NOT make assumptions about >> contents, and reads should not be attempted >> 0x1 - LBA extent allocated, reads will succeed but no guarantee on contents >> 0x2 - LBA extent not present, but client can treat the extent as zeroes >> and reads will succeed >> 0x3 - LBA extent present, client can treat the extent as zeroes and >> reads will succeed > > I'm not sure that clients need this level of details. From client's POV > 0x2 and 0x3 are the same. No, if the client is trying to EXACTLY copy sparseness, then 0x2 and 0x3 differ on whether the client will punch a hole vs. explicitly allocate zeroes.
On 03/24/2016 05:55 AM, Paolo Bonzini wrote: >> As Eric noted, please expand LBA at least once. > > Let's just use "block" (e.g. NBD_CMD_GET_BLOCK_STATUS). Yes, avoiding the term LBA and using BLOCK everywhere also nicely solves the problem of introducing yet more terminology. > >>> + - 32 bits, length of parameter data that follow (unsigned) >>> + - zero or more LBA status descriptors, each having the following >>> + structure: >>> + >>> + * 64 bits, offset (unsigned) >>> + * 32 bits, length (unsigned) >>> + * 16 bits, status (unsigned) >>> + >>> + unless an error condition has occurred. >>> + > > Can we just return one descriptor? That would simplify the protocol a bit. As in, the return is exactly one descriptor, consisting of: * 32 bits, length (unsigned): must be > 0, <= the client's length * 16 bits, status (unsigned): status of that block Of course, it means more traffic. The nice part about returning an array of descriptors is that I can learn the status of 1G of the file, even if the file alternates every 512 bytes between extent status, in just one client call. But returning only a single descriptor at a time means I'd have to make 2M client calls to learn the same pattern of allocation. Fortunately, in the common case, allocation patterns tend to not be that disjoint. On the other hand, returning only one descriptor at a time (for possibly less length than the client requested) may be easier when using lseek(SEEK_DATA/HOLE) as the mechanism for determining the bounds of each extent, since the server only has to search once per command, instead of dynamically construct the entire reply. I don't have any strong opinions on which would be better, but it is definitely food for thought. > > However, let's make these bits, so that > > NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device > NBD_STATE_ZERO (0x2), LBA extent will read as zeroes Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means allocated, 1 means not present), so that an overall status of 0 is a safe default? (That is, it should always be safe to state a sector is allocated when it is not, and always safe to state a sector is not known to read as zeroes even if that happens to be its contents - all that we lose by reporting this safe default state is that the client will be unable to optimize for skipping holes). >> Either the spec should define what it means for a block to be in a dirty >> state, or it should not talk about it. > > Here is my attempt: > > This command is meant to operate in tandem with other (non-NBD) > channels to the server. Generally, a "dirty" block is a block that > has been written to by someone, but the exact meaning of "has been > written" is left to the implementation. For example, a virtual > machine monitor could provide a (non-NBD) command to start tracking > blocks written by the virtual machine. A backup client then can > connect to an NBD server provided by the virtual machine monitor > and use NBD_CMD_GET_BLOCK_STATUS only read blocks that the virtual s/only/to only/ > machine has changed. s/changed/changed since it started tracking/ > > An implementation that doesn't track the "dirtiness" state of blocks > MUST either fail this command with EINVAL, or mark all blocks as > dirty in the descriptor that it returns. Is it feasible to return zero/allocated/dirty status all at the same time, or do we want to strictly require two different modes of operation? That is, if we are returning zero and allocated as two bits, can we also return a third bit for dirty/clean? Should we flip the sense of the bit, where 0 means dirty and 1 means clean, again so that a server can always return a status of 0 as the safe default?
On 24/03/2016 16:25, Eric Blake wrote: >> However, let's make these bits, so that >> >> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device >> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes > > Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means > allocated, 1 means not present), so that an overall status of 0 is a > safe default? Double negations are evil (and don't work the same in all languages), so I think it's a worse option. >> An implementation that doesn't track the "dirtiness" state of blocks >> MUST either fail this command with EINVAL, or mark all blocks as >> dirty in the descriptor that it returns. > > Is it feasible to return zero/allocated/dirty status all at the same > time, or do we want to strictly require two different modes of > operation? I think we should differentiate them, because it makes sense to support only one. In particular, while it is more or less obvious that (in my proposal above) a trivial implementation must return NBD_STATE_ALLOCATED, it is quite weird to require a trivial implementation to return NBD_STATE_ALLOCATED|NBD_STATE_DIRTY. Paolo > That is, if we are returning zero and allocated as two bits, > can we also return a third bit for dirty/clean? Should we flip the > sense of the bit, where 0 means dirty and 1 means clean, again so that a > server can always return a status of 0 as the safe default? >
On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote: > On 24/03/2016 16:25, Eric Blake wrote: > >> However, let's make these bits, so that > >> > >> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device > >> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes > > > > Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means > > allocated, 1 means not present), so that an overall status of 0 is a > > safe default? > > Double negations are evil (and don't work the same in all languages), so > I think it's a worse option. I agree that a bit which says "unallocated" is confusing in that manner, but that just means we need a better name (one that doesn't contain "un-" or "not") I like the idea of having zero be the "sensible" default, although I'm quite unable to come up with a better name myself.
On 03/24/2016 09:53 AM, Wouter Verhelst wrote: > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote: >> On 24/03/2016 16:25, Eric Blake wrote: >>>> However, let's make these bits, so that >>>> >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes >>> >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means >>> allocated, 1 means not present), so that an overall status of 0 is a >>> safe default? >> >> Double negations are evil (and don't work the same in all languages), so >> I think it's a worse option. > > I agree that a bit which says "unallocated" is confusing in that manner, > but that just means we need a better name (one that doesn't contain > "un-" or "not") > > I like the idea of having zero be the "sensible" default, although I'm > quite unable to come up with a better name myself. NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated); matches well that we have NBD_CMD_TRIM for requesting the creation of such a state.
Am 24.03.2016 um 17:04 hat Eric Blake geschrieben: > On 03/24/2016 09:53 AM, Wouter Verhelst wrote: > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote: > >> On 24/03/2016 16:25, Eric Blake wrote: > >>>> However, let's make these bits, so that > >>>> > >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device > >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes > >>> > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means > >>> allocated, 1 means not present), so that an overall status of 0 is a > >>> safe default? > >> > >> Double negations are evil (and don't work the same in all languages), so > >> I think it's a worse option. > > > > I agree that a bit which says "unallocated" is confusing in that manner, > > but that just means we need a better name (one that doesn't contain > > "un-" or "not") > > > > I like the idea of having zero be the "sensible" default, although I'm > > quite unable to come up with a better name myself. > > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated); > matches well that we have NBD_CMD_TRIM for requesting the creation of > such a state. How about NBD_STATE_HOLE? Kevin
On Thu, Mar 24, 2016 at 09:04:07AM -0600, Eric Blake wrote: > On 03/24/2016 06:30 AM, Pavel Borzenkov wrote: > >> Conversely, it would be possible to send less data over the wire, as > >> long as we require that all LBA status descriptors cover consecutive > >> offsets. That is, having the server reply with offsets is pointless, > >> since they can be reconstructed on the client by starting with the > >> offset in the client's request, then adding length from each status > >> field. Is less network traffic desirable? > > > > I think it's better to explicitly send the start of each LBA extent. > > Why? Is the redundancy for something that the client can reconstruct > worth the extra safety at the cost of more traffic? On second thought, not sending offsets look better. Firstly, they are really not required for client to process the reply. Secondly, it also implies that the regions are distinct and in sorted order and makes it impossible to do otherwise.
On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote: > Am 24.03.2016 um 17:04 hat Eric Blake geschrieben: > > On 03/24/2016 09:53 AM, Wouter Verhelst wrote: > > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote: > > >> On 24/03/2016 16:25, Eric Blake wrote: > > >>>> However, let's make these bits, so that > > >>>> > > >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device > > >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes > > >>> > > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means > > >>> allocated, 1 means not present), so that an overall status of 0 is a > > >>> safe default? > > >> > > >> Double negations are evil (and don't work the same in all languages), so > > >> I think it's a worse option. > > > > > > I agree that a bit which says "unallocated" is confusing in that manner, > > > but that just means we need a better name (one that doesn't contain > > > "un-" or "not") > > > > > > I like the idea of having zero be the "sensible" default, although I'm > > > quite unable to come up with a better name myself. > > > > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated); > > matches well that we have NBD_CMD_TRIM for requesting the creation of > > such a state. > > How about NBD_STATE_HOLE? Both will work, although I like NBD_STATE_TRIM slightly better because it indeed nicely references NBD_CMD_TRIM. However, I also think it should then be made clear that issuing NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for that region if the backend storage format dosn't support that, to avoid confusion later on.
On 03/23/2016 08:16 AM, Denis V. Lunev wrote: > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > With the availability of sparse storage formats, it is often needed to > query status of a particular LBA range and read only those blocks of > data that are actually present on the block device. > > To provide such information, the patch adds GET_LBA_STATUS extension > with one new NBD_CMD_GET_LBA_STATUS command. > > +* `NBD_CMD_GET_LBA_STATUS` (7) > + > + An LBA range status query request. Length and offset define the range > + of interest. The server MUST reply with a reply header, followed > + immediately by the following data: > + > + - 32 bits, length of parameter data that follow (unsigned) > + - zero or more LBA status descriptors, each having the following > + structure: > + > + * 64 bits, offset (unsigned) > + * 32 bits, length (unsigned) > + * 16 bits, status (unsigned) > + > + unless an error condition has occurred. To date, only the NBD_CMD_READ command caused the server to send data after the handle in its reply. This would be the second command with a data field in the response, but it is returning a variable amount of data, not directly tied to the client's length - but at least you did make it structured so that the client knows how much to read. However, your patch is incomplete; you'll need to edit the "Transmission" section of the document to mention the rules on the server sending data, as the existing text would now be wrong: > The server replies with: > > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC) > S: 32 bits, error > S: 64 bits, handle > S: (length bytes of data if the request is of type NBD_CMD_READ) You may also want to add a rule that for all future extensions, any command that requires data in the server response (other than the server response to NBD_CMD_READ) must include a 32-bit length as the first field of its data payload, so that the server reply is always structured. Hmm, I still think it would be hard to write a wireshark dissector for server replies, since there is no indication whether a given reply will include data or not. The _client_ knows (well, any well-written client that uses a different value for 'handle' for every command sent to the server), because it can read the returned 'handle' field to know what command it matches to and therefore what data to expect; but a third-party observer seeing _just_ the server messages has no idea which server responses should have payload. Scanning the stream and assuming that a new reply starts each time NBD_REPLY_MAGIC is encountered is a close approximation, but hits false positives if the data being returned for NBD_CMD_READ contains the same magic number as part of its contents. Obviously, back-compat says we can't change the response to NBD_CMD_READ, but that means that a wireshark dissector has to either maintain context, or hunt through returned data looking for potential magic numbers and possibly hitting false positives. That said, maybe we want to allow global flag negotiation prior to transmission to add a new flag on both server and client side - the server would advertise that it is capable of a new reply mode, and the client then has to reply that it wants to use the new reply mode; in that mode, all server replies starting with magic 0x67446698 (NBD_REPLY_MAGIC) will never have a data payload, and all commands that cause a reply with payload (both NBD_CMD_READ and the new NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it) will reply with a NEW magic number: S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2) S: 32 bits, error S: 64 bits, handle S: 32 bits, length S: length bytes of data so that the server's data stream is fully structured without having to maintain context of the client's requests. That is, a dissector can now merely scan for both magic numbers; and on a stream between a client and server that have negotiated the new mode, the old magic number will never have a payload, and the new magic number will always be accompanied with a payload that describes how much data to read to the boundary of the next reply. For that matter, right now, NBD_CMD_READ is required to either end the session or return length bytes of data even when error is non-zero (but where those bytes MAY be invalid); but by adding a negotiated flag for structured length replies, it would be possible to allow for short reads and/or returning an error with 0 bytes of payload but still keeping the connection to the client open, without having to send wasted bytes over the wire. I could write up a negotiation of global flags for structured reply lengths as an extension proposal, if you think it is worth it.
On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote: > On 03/23/2016 08:16 AM, Denis V. Lunev wrote: > > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > > > With the availability of sparse storage formats, it is often needed to > > query status of a particular LBA range and read only those blocks of > > data that are actually present on the block device. > > > > To provide such information, the patch adds GET_LBA_STATUS extension > > with one new NBD_CMD_GET_LBA_STATUS command. > > > > > +* `NBD_CMD_GET_LBA_STATUS` (7) > > + > > + An LBA range status query request. Length and offset define the range > > + of interest. The server MUST reply with a reply header, followed > > + immediately by the following data: > > + > > + - 32 bits, length of parameter data that follow (unsigned) > > + - zero or more LBA status descriptors, each having the following > > + structure: > > + > > + * 64 bits, offset (unsigned) > > + * 32 bits, length (unsigned) > > + * 16 bits, status (unsigned) > > + > > + unless an error condition has occurred. > > To date, only the NBD_CMD_READ command caused the server to send data > after the handle in its reply. This would be the second command with a > data field in the response, but it is returning a variable amount of > data, not directly tied to the client's length - but at least you did > make it structured so that the client knows how much to read. However, > your patch is incomplete; you'll need to edit the "Transmission" section > of the document to mention the rules on the server sending data, as the > existing text would now be wrong: > > > The server replies with: > > > > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC) > > S: 32 bits, error > > S: 64 bits, handle > > S: (length bytes of data if the request is of type NBD_CMD_READ) > > You may also want to add a rule that for all future extensions, any > command that requires data in the server response (other than the server > response to NBD_CMD_READ) must include a 32-bit length as the first > field of its data payload, so that the server reply is always structured. Right. > Hmm, I still think it would be hard to write a wireshark dissector for > server replies, since there is no indication whether a given reply will > include data or not. Well, a wireshark dissector already exists. However, it is very old; it doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't deal with negotiation at all. It was written when newstyle negotiation didn't exist yet, and scanning for INIT_PASSWD at the time was too hard, according to the guy who wrote it (don't remember who that was). > The _client_ knows (well, any well-written client > that uses a different value for 'handle' for every command sent to the > server), because it can read the returned 'handle' field to know what > command it matches to and therefore what data to expect; but a > third-party observer seeing _just_ the server messages has no idea which > server responses should have payload. It can if it knows enough about the protocol, but granted, that makes it harder for us to extend the protocol without breaking the dissector. > Scanning the stream and assuming that a new reply starts each time > NBD_REPLY_MAGIC is encountered is a close approximation, but hits > false positives if the data being returned for NBD_CMD_READ contains > the same magic number as part of its contents. Indeed. > Obviously, back-compat says we can't change the response to > NBD_CMD_READ, but that means that a wireshark dissector has to either > maintain context, or hunt through returned data looking for potential > magic numbers and possibly hitting false positives. > > That said, maybe we want to allow global flag negotiation prior to > transmission to add a new flag on both server and client side - the > server would advertise that it is capable of a new reply mode, and the > client then has to reply that it wants to use the new reply mode; in Global flag negotiation is already possible. There is a client flags field, which is sent before option haggling, that could be used for negotiation of such backwards-incompatible features. > that mode, all server replies starting with magic 0x67446698 > (NBD_REPLY_MAGIC) will never have a data payload, and all commands that > cause a reply with payload (both NBD_CMD_READ and the new > NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it) > will reply with a NEW magic number: > > S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2) > S: 32 bits, error > S: 64 bits, handle > S: 32 bits, length > S: length bytes of data I like this. However, before committing to it, I would like to see Markus' feedback on that (explicit Cc added, even though he's on the list, too). We'd also need a way to communicate the ability to speak this protocol from the kernel to userspace before telling the server that the client understands something which maybe its kernel doesn't. Markus? > so that the server's data stream is fully structured without having to > maintain context of the client's requests. That is, a dissector can now > merely scan for both magic numbers; and on a stream between a client and > server that have negotiated the new mode, the old magic number will > never have a payload, and the new magic number will always be > accompanied with a payload that describes how much data to read to the > boundary of the next reply. > > For that matter, right now, NBD_CMD_READ is required to either end the > session or return length bytes of data even when error is non-zero (but > where those bytes MAY be invalid); but by adding a negotiated flag for > structured length replies, it would be possible to allow for short reads > and/or returning an error with 0 bytes of payload but still keeping the > connection to the client open, without having to send wasted bytes over > the wire. Yes. This has been discussed on the nbd-general list in the past. There is also the (significant) problem of the server having maybe already sent out the header before discovering there is an error, at which point it can *only* drop the connection. > I could write up a negotiation of global flags for structured reply > lengths as an extension proposal, if you think it is worth it. I think it is worth it...
On 25 Mar 2016, at 08:49, Wouter Verhelst <w@uter.be> wrote: > Yes. This has been discussed on the nbd-general list in the past. There > is also the (significant) problem of the server having maybe already > sent out the header before discovering there is an error, at which point > it can *only* drop the connection. Indeed. I think where we got to last time this was discussed was that the server could have the option of returning 'chunks' of reply, each chunk either being a {length, data} tuple, or an error. The total data transmitted would add up to the full length if there is no error. >> I could write up a negotiation of global flags for structured reply >> lengths as an extension proposal, if you think it is worth it. > > I think it is worth it... +1 - for NBD_CMD_READ too
On 03/25/2016 02:49 AM, Wouter Verhelst wrote: >> You may also want to add a rule that for all future extensions, any >> command that requires data in the server response (other than the server >> response to NBD_CMD_READ) must include a 32-bit length as the first >> field of its data payload, so that the server reply is always structured. > > Right. > >> Hmm, I still think it would be hard to write a wireshark dissector for >> server replies, since there is no indication whether a given reply will >> include data or not. > > Well, a wireshark dissector already exists. However, it is very old; it > doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't > support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't > deal with negotiation at all. It was written when newstyle negotiation > didn't exist yet, and scanning for INIT_PASSWD at the time was too hard, > according to the guy who wrote it (don't remember who that was). And I've never written a wireshark dissector myself, to know how easy or hard it would be to extend that work. > >> The _client_ knows (well, any well-written client >> that uses a different value for 'handle' for every command sent to the >> server), because it can read the returned 'handle' field to know what >> command it matches to and therefore what data to expect; but a >> third-party observer seeing _just_ the server messages has no idea which >> server responses should have payload. > > It can if it knows enough about the protocol, but granted, that makes it > harder for us to extend the protocol without breaking the dissector. > >> Scanning the stream and assuming that a new reply starts each time >> NBD_REPLY_MAGIC is encountered is a close approximation, but hits >> false positives if the data being returned for NBD_CMD_READ contains >> the same magic number as part of its contents. > > Indeed. One benefit of TCP: we can rely on packet boundaries (whether or not fragmentation is also happening); I'm not sure if UDP shares the same benefits (then again, I'm not even sure if UDP is usable for the NBD protocol, since we definitely rely on in-order delivery of packets: a read command can definitely return more bytes than even a jumbo frame can contain, even though we do allow out-of-order delivery of replies). So if the server always sends each reply as the start of its own packet (rather than coalescing multiple replies into a single network packet), and the dissector only looks for magic numbers at the start of a packet, then any server packet not starting with the magic number must be data payload, and you could potentially even avoid the false positives by choosing to break data replies into packets by adjusting packet lengths by one byte any time the next data chunk would otherwise happen to start with the same two bytes as the magic number. But I haven't actually tested any of this, to know if packet coalescing goes on, and I certainly don't think it is worth complicating the server to avoid false positive magic number detection by splitting read payloads across packet boundaries, just for the benefit of wire-sniffing. > I like this. However, before committing to it, I would like to see > Markus' feedback on that (explicit Cc added, even though he's on the > list, too). > > We'd also need a way to communicate the ability to speak this protocol > from the kernel to userspace before telling the server that the client > understands something which maybe its kernel doesn't. proto.md already documents that ioctl()s exist for the user space to inform the kernel about options sent by the server prior to kicking off transmission phase, and that the NBD protocol itself does not go into full detail about available ioctl()s (the kernel/userspace coordination does not affect the protocol). It seems fairly straightforward for the kernel to supply another ioctl() that userspace can use to learn whether it is allowed or forbidden from advertising structured reply status during the handshake phase (including the case where the ioctl() is not present being treated as the client must not enable structured replies). > > Markus? Markus is on vacation for a bit, so we'll just have to wait for a reply. > >> I could write up a negotiation of global flags for structured reply >> lengths as an extension proposal, if you think it is worth it. > > I think it is worth it... Okay, I'll give it a shot, in a separate thread.
Am 24.03.2016 um 17:47 hat Wouter Verhelst geschrieben: > On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote: > > Am 24.03.2016 um 17:04 hat Eric Blake geschrieben: > > > On 03/24/2016 09:53 AM, Wouter Verhelst wrote: > > > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote: > > > >> On 24/03/2016 16:25, Eric Blake wrote: > > > >>>> However, let's make these bits, so that > > > >>>> > > > >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device > > > >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes > > > >>> > > > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means > > > >>> allocated, 1 means not present), so that an overall status of 0 is a > > > >>> safe default? > > > >> > > > >> Double negations are evil (and don't work the same in all languages), so > > > >> I think it's a worse option. > > > > > > > > I agree that a bit which says "unallocated" is confusing in that manner, > > > > but that just means we need a better name (one that doesn't contain > > > > "un-" or "not") > > > > > > > > I like the idea of having zero be the "sensible" default, although I'm > > > > quite unable to come up with a better name myself. > > > > > > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated); > > > matches well that we have NBD_CMD_TRIM for requesting the creation of > > > such a state. > > > > How about NBD_STATE_HOLE? > > Both will work, although I like NBD_STATE_TRIM slightly better because > it indeed nicely references NBD_CMD_TRIM. I just thought that "trim" sounds more like an action than a status, and while the reason for a hole to exist can be a previous TRIM command, another option is that it's an area in an image that just has never been written to. In that case "trim" would be a misnomer. > However, I also think it should then be made clear that issuing > NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for > that region if the backend storage format dosn't support that, to avoid > confusion later on. Good point. That might be another reason for not calling the status "trim". Kevin
On Tue, Mar 29, 2016 at 11:38:35AM +0200, Kevin Wolf wrote: > Am 24.03.2016 um 17:47 hat Wouter Verhelst geschrieben: > > On Thu, Mar 24, 2016 at 05:07:47PM +0100, Kevin Wolf wrote: > > > Am 24.03.2016 um 17:04 hat Eric Blake geschrieben: > > > > On 03/24/2016 09:53 AM, Wouter Verhelst wrote: > > > > > On Thu, Mar 24, 2016 at 04:33:42PM +0100, Paolo Bonzini wrote: > > > > >> On 24/03/2016 16:25, Eric Blake wrote: > > > > >>>> However, let's make these bits, so that > > > > >>>> > > > > >>>> NBD_STATE_ALLOCATED (0x1), LBA extent is present on the block device > > > > >>>> NBD_STATE_ZERO (0x2), LBA extent will read as zeroes > > > > >>> > > > > >>> Should we flip the sense and call this NBD_STATE_UNALLOCATED (0 means > > > > >>> allocated, 1 means not present), so that an overall status of 0 is a > > > > >>> safe default? > > > > >> > > > > >> Double negations are evil (and don't work the same in all languages), so > > > > >> I think it's a worse option. > > > > > > > > > > I agree that a bit which says "unallocated" is confusing in that manner, > > > > > but that just means we need a better name (one that doesn't contain > > > > > "un-" or "not") > > > > > > > > > > I like the idea of having zero be the "sensible" default, although I'm > > > > > quite unable to come up with a better name myself. > > > > > > > > NBD_STATE_TRIM, perhaps? (0 for present, 1 for trimmed or unallocated); > > > > matches well that we have NBD_CMD_TRIM for requesting the creation of > > > > such a state. > > > > > > How about NBD_STATE_HOLE? > > > > Both will work, although I like NBD_STATE_TRIM slightly better because > > it indeed nicely references NBD_CMD_TRIM. > > I just thought that "trim" sounds more like an action than a status, and > while the reason for a hole to exist can be a previous TRIM command, > another option is that it's an area in an image that just has never been > written to. In that case "trim" would be a misnomer. Point. It could be "TRIMMED" instead, I suppose. > > However, I also think it should then be made clear that issuing > > NBD_CMD_TRIM doesn't *require* that GET_BLOCK returns NBD_STATE_TRIM for > > that region if the backend storage format dosn't support that, to avoid > > confusion later on. > > Good point. That might be another reason for not calling the status > "trim". Also a good point...
On 29/03/2016 11:38, Kevin Wolf wrote: > > > How about NBD_STATE_HOLE? > > > > Both will work, although I like NBD_STATE_TRIM slightly better because > > it indeed nicely references NBD_CMD_TRIM. > > I just thought that "trim" sounds more like an action than a status, and > while the reason for a hole to exist can be a previous TRIM command, > another option is that it's an area in an image that just has never been > written to. In that case "trim" would be a misnomer. I agree with Kevin. My preference is still on NBD_STATE_ALLOCATED, but NBD_STATE_HOLE is a fine name as well. Paolo
Hi, back from my easter vacation. A bit surprised to find 200 mails in the NBD mailing list ;). On Friday 25 March 2016 09:49:29 Wouter Verhelst wrote: > On Thu, Mar 24, 2016 at 04:08:13PM -0600, Eric Blake wrote: > > On 03/23/2016 08:16 AM, Denis V. Lunev wrote: > > > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > > > > > With the availability of sparse storage formats, it is often needed to > > > query status of a particular LBA range and read only those blocks of > > > data that are actually present on the block device. > > > > > > To provide such information, the patch adds GET_LBA_STATUS extension > > > with one new NBD_CMD_GET_LBA_STATUS command. > > > > > > > > +* `NBD_CMD_GET_LBA_STATUS` (7) > > > + > > > + An LBA range status query request. Length and offset define the range > > > + of interest. The server MUST reply with a reply header, followed > > > + immediately by the following data: > > > + > > > + - 32 bits, length of parameter data that follow (unsigned) > > > + - zero or more LBA status descriptors, each having the following > > > + structure: > > > + > > > + * 64 bits, offset (unsigned) > > > + * 32 bits, length (unsigned) > > > + * 16 bits, status (unsigned) > > > + > > > + unless an error condition has occurred. > > > > To date, only the NBD_CMD_READ command caused the server to send data > > after the handle in its reply. This would be the second command with a > > data field in the response, but it is returning a variable amount of > > data, not directly tied to the client's length - but at least you did > > make it structured so that the client knows how much to read. However, > > your patch is incomplete; you'll need to edit the "Transmission" section > > of the document to mention the rules on the server sending data, as the > > existing text would now be wrong: > > > > > The server replies with: > > > > > > S: 32 bits, 0x67446698, magic (NBD_REPLY_MAGIC) > > > S: 32 bits, error > > > S: 64 bits, handle > > > S: (length bytes of data if the request is of type NBD_CMD_READ) > > > > You may also want to add a rule that for all future extensions, any > > command that requires data in the server response (other than the server > > response to NBD_CMD_READ) must include a 32-bit length as the first > > field of its data payload, so that the server reply is always structured. > > Right. > > > Hmm, I still think it would be hard to write a wireshark dissector for > > server replies, since there is no indication whether a given reply will > > include data or not. > > Well, a wireshark dissector already exists. However, it is very old; it > doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't > support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't > deal with negotiation at all. It was written when newstyle negotiation > didn't exist yet, and scanning for INIT_PASSWD at the time was too hard, > according to the guy who wrote it (don't remember who that was). > > > The _client_ knows (well, any well-written client > > that uses a different value for 'handle' for every command sent to the > > server), because it can read the returned 'handle' field to know what > > command it matches to and therefore what data to expect; but a > > third-party observer seeing _just_ the server messages has no idea which > > server responses should have payload. > > It can if it knows enough about the protocol, but granted, that makes it > harder for us to extend the protocol without breaking the dissector. > > > Scanning the stream and assuming that a new reply starts each time > > NBD_REPLY_MAGIC is encountered is a close approximation, but hits > > false positives if the data being returned for NBD_CMD_READ contains > > the same magic number as part of its contents. > > Indeed. > > > Obviously, back-compat says we can't change the response to > > NBD_CMD_READ, but that means that a wireshark dissector has to either > > maintain context, or hunt through returned data looking for potential > > magic numbers and possibly hitting false positives. > > > > That said, maybe we want to allow global flag negotiation prior to > > transmission to add a new flag on both server and client side - the > > server would advertise that it is capable of a new reply mode, and the > > client then has to reply that it wants to use the new reply mode; in > > Global flag negotiation is already possible. There is a client flags > field, which is sent before option haggling, that could be used for > negotiation of such backwards-incompatible features. > > > that mode, all server replies starting with magic 0x67446698 > > (NBD_REPLY_MAGIC) will never have a data payload, and all commands that > > cause a reply with payload (both NBD_CMD_READ and the new > > NBD_CMD_GET_LBA_STATUS of this message, or whatever name we give it) > > will reply with a NEW magic number: > > > > S: 32 bits, XXXX, magic (NBD_REPLY_MAGIC2) > > S: 32 bits, error > > S: 64 bits, handle > > S: 32 bits, length > > S: length bytes of data > > I like this. However, before committing to it, I would like to see > Markus' feedback on that (explicit Cc added, even though he's on the > list, too). > > We'd also need a way to communicate the ability to speak this protocol > from the kernel to userspace before telling the server that the client > understands something which maybe its kernel doesn't. > > Markus? > > > so that the server's data stream is fully structured without having to > > maintain context of the client's requests. That is, a dissector can now > > merely scan for both magic numbers; and on a stream between a client and > > server that have negotiated the new mode, the old magic number will > > never have a payload, and the new magic number will always be > > accompanied with a payload that describes how much data to read to the > > boundary of the next reply. > > > > For that matter, right now, NBD_CMD_READ is required to either end the > > session or return length bytes of data even when error is non-zero (but > > where those bytes MAY be invalid); but by adding a negotiated flag for > > structured length replies, it would be possible to allow for short reads > > and/or returning an error with 0 bytes of payload but still keeping the > > connection to the client open, without having to send wasted bytes over > > the wire. > > Yes. This has been discussed on the nbd-general list in the past. There > is also the (significant) problem of the server having maybe already > sent out the header before discovering there is an error, at which point > it can *only* drop the connection. I am still not through all the new mails on the list, so there may be some more discussions about this. But I will answer here simply. I generally like the idea of having this new kind of reply. Is this solving our issues where we want to "stream" data directly from a filedescriptor into a tcp-stream? Does it make sense to extend this for reads with an offset? This way we could not only send in chunks but also order them randomly. Is there any use-case where it does make sense to read data not sequentially? Best Regards, Markus
Hi, On Monday 28 March 2016 09:58:52 Eric Blake wrote: > On 03/25/2016 02:49 AM, Wouter Verhelst wrote: > > >> You may also want to add a rule that for all future extensions, any > >> command that requires data in the server response (other than the server > >> response to NBD_CMD_READ) must include a 32-bit length as the first > >> field of its data payload, so that the server reply is always structured. > > > > Right. > > > >> Hmm, I still think it would be hard to write a wireshark dissector for > >> server replies, since there is no indication whether a given reply will > >> include data or not. > > > > Well, a wireshark dissector already exists. However, it is very old; it > > doesn't support the NBD_CMD_TRIM or NBD_CMD_FLUSH commands, it doesn't > > support the NBD_CMD_FLAG_FUA option to NBD_CMD_WRITE, and it doesn't > > deal with negotiation at all. It was written when newstyle negotiation > > didn't exist yet, and scanning for INIT_PASSWD at the time was too hard, > > according to the guy who wrote it (don't remember who that was). > > And I've never written a wireshark dissector myself, to know how easy or > hard it would be to extend that work. > > > > >> The _client_ knows (well, any well-written client > >> that uses a different value for 'handle' for every command sent to the > >> server), because it can read the returned 'handle' field to know what > >> command it matches to and therefore what data to expect; but a > >> third-party observer seeing _just_ the server messages has no idea which > >> server responses should have payload. > > > > It can if it knows enough about the protocol, but granted, that makes it > > harder for us to extend the protocol without breaking the dissector. > > > >> Scanning the stream and assuming that a new reply starts each time > >> NBD_REPLY_MAGIC is encountered is a close approximation, but hits > >> false positives if the data being returned for NBD_CMD_READ contains > >> the same magic number as part of its contents. > > > > Indeed. > > One benefit of TCP: we can rely on packet boundaries (whether or not > fragmentation is also happening); I'm not sure if UDP shares the same > benefits (then again, I'm not even sure if UDP is usable for the NBD > protocol, since we definitely rely on in-order delivery of packets: a > read command can definitely return more bytes than even a jumbo frame > can contain, even though we do allow out-of-order delivery of replies). > So if the server always sends each reply as the start of its own packet > (rather than coalescing multiple replies into a single network packet), > and the dissector only looks for magic numbers at the start of a packet, > then any server packet not starting with the magic number must be data > payload, and you could potentially even avoid the false positives by > choosing to break data replies into packets by adjusting packet lengths > by one byte any time the next data chunk would otherwise happen to start > with the same two bytes as the magic number. But I haven't actually > tested any of this, to know if packet coalescing goes on, and I > certainly don't think it is worth complicating the server to avoid false > positive magic number detection by splitting read payloads across packet > boundaries, just for the benefit of wire-sniffing. > > > I like this. However, before committing to it, I would like to see > > Markus' feedback on that (explicit Cc added, even though he's on the > > list, too). > > > > We'd also need a way to communicate the ability to speak this protocol > > from the kernel to userspace before telling the server that the client > > understands something which maybe its kernel doesn't. > > proto.md already documents that ioctl()s exist for the user space to > inform the kernel about options sent by the server prior to kicking off > transmission phase, and that the NBD protocol itself does not go into > full detail about available ioctl()s (the kernel/userspace coordination > does not affect the protocol). It seems fairly straightforward for the > kernel to supply another ioctl() that userspace can use to learn whether > it is allowed or forbidden from advertising structured reply status > during the handshake phase (including the case where the ioctl() is not > present being treated as the client must not enable structured replies). Depending on the implementation we may not even need to communicate the used NBD protocol from userspace to the kernel. We have a different magic and the magic stays at the beginning of the message so we can simply use the magic to decide what message we have. Of course that would need another receive which may be too slow. For the other direction, giving the userspace information about the capabilities of the kernel implementation, it may be better to use a sysfs entry? All current ioctls are used for the exact nbd device it is used on. But implementation capabilities are a common property over all nbd devices. Best Regards, Markus
On 03/23/2016 08:16 AM, Denis V. Lunev wrote: > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > With the availability of sparse storage formats, it is often needed to > query status of a particular LBA range and read only those blocks of > data that are actually present on the block device. > > To provide such information, the patch adds GET_LBA_STATUS extension > with one new NBD_CMD_GET_LBA_STATUS command. > > There exists a concept of data dirtiness, which is required during, for > example, incremental block device backup. To express this concept via > NBD protocol, this patch also adds additional mode of operation to > NBD_CMD_GET_LBA_STATUS command. > > Since NBD protocol has no notion of block size, and to mimic SCSI "GET > LBA STATUS" command more closely, it has been chosen to return a list of > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a > bitmap. > > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Wouter Verhelst <w@uter.be> > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > --- > doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 82 insertions(+) I've posted a v2 of this proposal under a new title, rebased on top of the recent work to add structured replies, and trying to take into account a number of the suggestions that occurred in this thread.
On 04/04/2016 04:18 AM, Markus Pargmann wrote: > Hi, > > back from my easter vacation. A bit surprised to find 200 mails in the > NBD mailing list ;). > >> Yes. This has been discussed on the nbd-general list in the past. There >> is also the (significant) problem of the server having maybe already >> sent out the header before discovering there is an error, at which point >> it can *only* drop the connection. > > I am still not through all the new mails on the list, so there may be > some more discussions about this. But I will answer here simply. > > I generally like the idea of having this new kind of reply. Is this > solving our issues where we want to "stream" data directly from a > filedescriptor into a tcp-stream? I'm a relative newcomer on the NBD list, so I'm not sure what the issue was there, but yes, structured replies DOES help with read semantics that encounter a partial error that can be detected only after you have started the reply stream. > > Does it make sense to extend this for reads with an offset? This way we > could not only send in chunks but also order them randomly. Is there any > use-case where it does make sense to read data not sequentially? In fact, that's what already got committed while you were gone. It may help if you jump in and read the current state of proto.md, if you don't want to plow through all the mail churn in the meantime for how we got to the state committed. And of course, if you have suggestions on how to further improve it, the extension is still documented as experimental, so we can further tweak it before releasing upstream NBD or downstream QEMU implementations of the extension (I'm currently coding up the work on a qemu implementation, and will report on anything that I run into during that process).
On 03/23/2016 05:16 PM, Denis V. Lunev wrote: > From: Pavel Borzenkov <pborzenkov@virtuozzo.com> > > With the availability of sparse storage formats, it is often needed to > query status of a particular LBA range and read only those blocks of > data that are actually present on the block device. > > To provide such information, the patch adds GET_LBA_STATUS extension > with one new NBD_CMD_GET_LBA_STATUS command. > > There exists a concept of data dirtiness, which is required during, for > example, incremental block device backup. To express this concept via > NBD protocol, this patch also adds additional mode of operation to > NBD_CMD_GET_LBA_STATUS command. > > Since NBD protocol has no notion of block size, and to mimic SCSI "GET > LBA STATUS" command more closely, it has been chosen to return a list of > extents in the response of NBD_CMD_GET_LBA_STATUS command, instead of a > bitmap. > > Signed-off-by: Pavel Borzenkov <pborzenkov@virtuozzo.com> > Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Wouter Verhelst <w@uter.be> > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Kevin Wolf <kwolf@redhat.com> > CC: Stefan Hajnoczi <stefanha@redhat.com> > --- > doc/proto.md | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 82 insertions(+) > > diff --git a/doc/proto.md b/doc/proto.md > index cda213c..fff515d 100644 > --- a/doc/proto.md > +++ b/doc/proto.md > @@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle negotiation: > `NBD_CMD_TRIM` commands > - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server > supports `NBD_CMD_WRITE_ZEROES` commands > +- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server > + supports `NBD_CMD_GET_LBA_STATUS` commands > > ##### Client flags > > @@ -477,6 +479,10 @@ The following request types exist: > > Defined by the experimental `WRITE_ZEROES` extension; see below. > > +* `NBD_CMD_GET_LBA_STATUS` (7) > + > + Defined by the experimental `GET_LBA_STATUS` extension; see below. > + > * Other requests > > Some third-party implementations may require additional protocol > @@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a write zeroes request > including one or more sectors beyond the size of the device. It SHOULD > return `EPERM` if it receives a write zeroes request on a read-only export. > > +### `GET_LBA_STATUS` extension > + > +With the availability of sparse storage formats, it is often needed to query > +status of a particular LBA range and read only those blocks of data that are > +actually present on the block device. > + > +Some storage formats and operations over such formats express a concept of > +data dirtiness. Whether the operation is block device mirroring, > +incremental block device backup or any other operation with a concept of > +data dirtiness, they all share a need to provide a list of LBA ranges > +that this particular operation treats as dirty. > + > +To provide such class of information, `GET_LBA_STATUS` extension adds new > +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with > +their respective states. > + > +* `NBD_CMD_GET_LBA_STATUS` (7) > + > + An LBA range status query request. Length and offset define the range > + of interest. The server MUST reply with a reply header, followed > + immediately by the following data: > + > + - 32 bits, length of parameter data that follow (unsigned) > + - zero or more LBA status descriptors, each having the following > + structure: > + > + * 64 bits, offset (unsigned) > + * 32 bits, length (unsigned) > + * 16 bits, status (unsigned) > + > + unless an error condition has occurred. > + > + If an error occurs, the server SHOULD set the appropriate error code > + in the error field. The server MUST then either close the > + connection, or send *length of parameter data* bytes of data > + (which MAY be invalid). > + > + The type of information required by the client is passed to server in the > + command flags field. If the server does not implement requested type or > + have no means to express it, it MUST NOT return an error, but instead MUST > + return a single LBA status descriptor with *offset* and *length* equal to > + the *offset* and *length* from request, and *status* set to `0`. > + > + The following request types are currently defined for the command: > + > + 1. Block provisioning state > + > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > + field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return > + the provisioning state of the device. The following provisionnig states > + are defined for the command: > + > + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; > + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device > + and contains zeroes; > + - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the > + block device. A client MUST NOT make any assumptions about the > + contents of the extent. > + > + 2. Block dirtiness state > + > + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags > + field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return > + the dirtiness status of the device. The following dirtiness states > + are defined for the command: > + > + - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty; > + - `NBD_STATE_CLEAN` (0x1), LBA extent is clean. we can add 'NBD_STATE_DIRTY_DEALLOCATED' (0x2) here as additional hint
On 04/04/2016 02:16 PM, Denis V. Lunev wrote: >> + The following request types are currently defined for the command: >> + >> + 1. Block provisioning state >> + >> + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags >> + field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return >> + the provisioning state of the device. The following provisionnig states >> + are defined for the command: >> + >> + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; >> + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device >> + and contains zeroes; >> + - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the >> + block device. A client MUST NOT make any assumptions about the >> + contents of the extent. > we can add 'NBD_STATE_DIRTY_DEALLOCATED' (0x2) here as additional hint No, DEALLOCATED and HOLE are the same thing, and we really want the status to be a bitwise-OR of flags (bit 0: is it allocated or deallocated. bit 1: is it unknown content or all 0), rather than a set of 3 states, since it really is possible to have all four combinations of those two orthognal status information. That was one of the topics already hashed out in the v1 conversation, and fixed in v2.
On Mon, Apr 04, 2016 at 12:18:37PM +0200, Markus Pargmann wrote: > Hi, > > back from my easter vacation. A bit surprised to find 200 mails in the > NBD mailing list ;). I'm sure you were :-) [...] > > Yes. This has been discussed on the nbd-general list in the past. There > > is also the (significant) problem of the server having maybe already > > sent out the header before discovering there is an error, at which point > > it can *only* drop the connection. > > I am still not through all the new mails on the list, so there may be > some more discussions about this. But I will answer here simply. > > I generally like the idea of having this new kind of reply. Is this > solving our issues where we want to "stream" data directly from a > filedescriptor into a tcp-stream? The eventual suggestion does, yes (as I'm sure you've found out by now). > Does it make sense to extend this for reads with an offset? I'm not sure what you mean by that. "reads with an offset"? Don't all our reads have an offset? > This way we could not only send in chunks but also order them randomly. Is > there any use-case where it does make sense to read data not sequentially? If you just mean "read replies with offset", then yes :-)
diff --git a/doc/proto.md b/doc/proto.md index cda213c..fff515d 100644 --- a/doc/proto.md +++ b/doc/proto.md @@ -243,6 +243,8 @@ immediately after the global flags field in oldstyle negotiation: `NBD_CMD_TRIM` commands - bit 6, `NBD_FLAG_SEND_WRITE_ZEROES`; should be set to 1 if the server supports `NBD_CMD_WRITE_ZEROES` commands +- bit 7, `NBD_FLAG_SEND_GET_LBA_STATUS`; should be set to 1 if the server + supports `NBD_CMD_GET_LBA_STATUS` commands ##### Client flags @@ -477,6 +479,10 @@ The following request types exist: Defined by the experimental `WRITE_ZEROES` extension; see below. +* `NBD_CMD_GET_LBA_STATUS` (7) + + Defined by the experimental `GET_LBA_STATUS` extension; see below. + * Other requests Some third-party implementations may require additional protocol @@ -638,6 +644,82 @@ The server SHOULD return `ENOSPC` if it receives a write zeroes request including one or more sectors beyond the size of the device. It SHOULD return `EPERM` if it receives a write zeroes request on a read-only export. +### `GET_LBA_STATUS` extension + +With the availability of sparse storage formats, it is often needed to query +status of a particular LBA range and read only those blocks of data that are +actually present on the block device. + +Some storage formats and operations over such formats express a concept of +data dirtiness. Whether the operation is block device mirroring, +incremental block device backup or any other operation with a concept of +data dirtiness, they all share a need to provide a list of LBA ranges +that this particular operation treats as dirty. + +To provide such class of information, `GET_LBA_STATUS` extension adds new +`NBD_CMD_GET_LBA_STATUS` command which returns a list of LBA ranges with +their respective states. + +* `NBD_CMD_GET_LBA_STATUS` (7) + + An LBA range status query request. Length and offset define the range + of interest. The server MUST reply with a reply header, followed + immediately by the following data: + + - 32 bits, length of parameter data that follow (unsigned) + - zero or more LBA status descriptors, each having the following + structure: + + * 64 bits, offset (unsigned) + * 32 bits, length (unsigned) + * 16 bits, status (unsigned) + + unless an error condition has occurred. + + If an error occurs, the server SHOULD set the appropriate error code + in the error field. The server MUST then either close the + connection, or send *length of parameter data* bytes of data + (which MAY be invalid). + + The type of information required by the client is passed to server in the + command flags field. If the server does not implement requested type or + have no means to express it, it MUST NOT return an error, but instead MUST + return a single LBA status descriptor with *offset* and *length* equal to + the *offset* and *length* from request, and *status* set to `0`. + + The following request types are currently defined for the command: + + 1. Block provisioning state + + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags + field set to `NBD_FLAG_GET_ALLOCATED` (0x0), the server MUST return + the provisioning state of the device. The following provisionnig states + are defined for the command: + + - `NBD_STATE_ALLOCATED` (0x0), LBA extent is present on the block device; + - `NBD_STATE_ZEROED` (0x1), LBA extent is present on the block device + and contains zeroes; + - `NBD_STATE_DEALLOCATED` (0x2), LBA extent is not present on the + block device. A client MUST NOT make any assumptions about the + contents of the extent. + + 2. Block dirtiness state + + Upon receiving an `NBD_CMD_GET_LBA_STATUS` command with command flags + field set to `NBD_FLAG_GET_DIRTY` (0x1), the server MUST return + the dirtiness status of the device. The following dirtiness states + are defined for the command: + + - `NBD_STATE_DIRTY` (0x0), LBA extent is dirty; + - `NBD_STATE_CLEAN` (0x1), LBA extent is clean. + + Generic NBD client implementation without knowledge of a particular NBD + server operation MUST NOT make any assumption on the meaning of the + NBD_STATE_DIRTY or NBD_STATE_CLEAN states. + +The server SHOULD return `EINVAL` if it receives a `GET_LBA_STATUS` request +including one or more sectors beyond the size of the device. + ## About this file This file tries to document the NBD protocol as it is currently