Message ID | 20201217210222.779619-1-its@irrelevant.dk |
---|---|
Headers | show |
Series | hw/block/nvme: dif-based end-to-end data protection support | expand |
On Thu, Dec 17, 2020 at 10:02:19PM +0100, Klaus Jensen wrote: > From: Klaus Jensen <k.jensen@samsung.com> > > This series adds support for extended LBAs and end-to-end data > protection. Marked RFC, since there are a bunch of issues that could use > some discussion. > > Storing metadata bytes contiguously with the logical block data and > creating a physically extended logical block basically breaks the DULBE > and deallocation support I just added. Formatting a namespace with > protection information requires the app- and reftags of deallocated or > unwritten blocks to be 0xffff and 0xffffffff respectively; this could be > used to reintroduce DULBE support in that case, albeit at a somewhat > higher cost than the block status flag-based approach. > > There is basically three ways of storing metadata (and maybe a forth, > but that is probably quite the endeavour): > > 1. Storing metadata as extended blocks directly on the blockdev. That > is the approach used in this RFC. > > 2. Use a separate blockdev. Incidentially, this is also the easiest > and most straightforward solution to support MPTR-based "separate > metadata". This also allows DULBE and block deallocation to be > supported using the existing approach. > > 3. A hybrid of 1 and 2 where the metadata is stored contiguously at > the end of the nvme-ns blockdev. > > Option 1 obviously works well with DIF-based protection information and > extended LBAs since it maps one to one. Option 2 works flawlessly with > MPTR-based metadata, but extended LBAs can be "emulated" at the cost of > a bunch of scatter/gather operations. Are there any actual users of extended metadata that we care about? I'm aware of only a few niche places that can even access an extended metadata format. There's not kernel support in any major OS that I know of. Option 2 sounds fine. If option 3 means that you're still using MPTR, but just sequester space at the end of the backing block device for meta-data purposes, then that is fine too. You can even resize it dynamically if you want to support different metadata sizes. > The 4th option is extending an existing image format (QCOW2) or create > something on top of RAW to supports metadata bytes per block. But both > approaches require full API support through the block layer. And > probably a lot of other stuff that I did not think about. It definitely sounds appealing to push the feature to a lower level if you're really willing to see that through. In any case, calculating T10 CRCs is *really* slow unless you have special hardware and software support for it.
On Dec 17 13:14, Keith Busch wrote: > On Thu, Dec 17, 2020 at 10:02:19PM +0100, Klaus Jensen wrote: > > From: Klaus Jensen <k.jensen@samsung.com> > > > > This series adds support for extended LBAs and end-to-end data > > protection. Marked RFC, since there are a bunch of issues that could use > > some discussion. > > > > Storing metadata bytes contiguously with the logical block data and > > creating a physically extended logical block basically breaks the DULBE > > and deallocation support I just added. Formatting a namespace with > > protection information requires the app- and reftags of deallocated or > > unwritten blocks to be 0xffff and 0xffffffff respectively; this could be > > used to reintroduce DULBE support in that case, albeit at a somewhat > > higher cost than the block status flag-based approach. > > > > There is basically three ways of storing metadata (and maybe a forth, > > but that is probably quite the endeavour): > > > > 1. Storing metadata as extended blocks directly on the blockdev. That > > is the approach used in this RFC. > > > > 2. Use a separate blockdev. Incidentially, this is also the easiest > > and most straightforward solution to support MPTR-based "separate > > metadata". This also allows DULBE and block deallocation to be > > supported using the existing approach. > > > > 3. A hybrid of 1 and 2 where the metadata is stored contiguously at > > the end of the nvme-ns blockdev. > > > > Option 1 obviously works well with DIF-based protection information and > > extended LBAs since it maps one to one. Option 2 works flawlessly with > > MPTR-based metadata, but extended LBAs can be "emulated" at the cost of > > a bunch of scatter/gather operations. > > Are there any actual users of extended metadata that we care about? I'm > aware of only a few niche places that can even access an extended > metadata format. There's not kernel support in any major OS that I know > of. > Yes, there are definitely actual users in enterprise storage. But the main use case here is testing (using extended LBAs with SPDK for instance). > Option 2 sounds fine. > > If option 3 means that you're still using MPTR, but just sequester space > at the end of the backing block device for meta-data purposes, then that > is fine too. You can even resize it dynamically if you want to support > different metadata sizes. Heh, I tend to think that my English vocabulary is pretty decent but I had to look up 'sequester'. I just learned a new word today \o/ Yes. I actually also like option 3. Technically option 2 does not break image interoperability between devices (ide, virtio), but you would leave out a bunch of metadata that your application might depend on, so I don't see any way to not break interoperability really. And I would then be just fine with "emulating" extended LBAs at the cost of more I/O. Because I would like the device to support that mode of operation as well. We have not implemented this, but my gut feeling says that it can be done. > > > The 4th option is extending an existing image format (QCOW2) or create > > something on top of RAW to supports metadata bytes per block. But both > > approaches require full API support through the block layer. And > > probably a lot of other stuff that I did not think about. > > It definitely sounds appealing to push the feature to a lower level if > you're really willing to see that through. > Yes, its super appealing and I would like to have some input from the block layer guys on this. That is, if anyone has ever explored it? > In any case, calculating T10 CRCs is *really* slow unless you have > special hardware and software support for it. > Yeah. I know this is super slow. For for emulation and testing purposes I think it is a nice feature for the device to optionally offer.
On Fri, Dec 18, 2020 at 10:39:01AM +0100, Klaus Jensen wrote: > On Dec 17 13:14, Keith Busch wrote: > > On Thu, Dec 17, 2020 at 10:02:19PM +0100, Klaus Jensen wrote: > > > > Are there any actual users of extended metadata that we care about? I'm > > aware of only a few niche places that can even access an extended > > metadata format. There's not kernel support in any major OS that I know > > of. > > > > Yes, there are definitely actual users in enterprise storage. But the > main use case here is testing (using extended LBAs with SPDK for > instance). Fair enough. And just to make sure we're coverging on the same nomenclature, spec suggests "extended" metadata means the interleaved format that does not use the MPTR field. Metadata with the MPTR field is referred to as "separate". I'm only mentioning this because I've been in confused conversations where "extended LBA" interchangably meant either method. > Yes. I actually also like option 3. Technically option 2 does not break > image interoperability between devices (ide, virtio), but you would > leave out a bunch of metadata that your application might depend on, so > I don't see any way to not break interoperability really. Either is fine. If you're switching metadata modes through your qemu parameters, you could think of this as a firmware change, which doesn't guarantee the same LBA formats. > And I would then be just fine with "emulating" extended LBAs at the cost > of more I/O. Because I would like the device to support that mode of > operation as well. We have not implemented this, but my gut feeling says > that it can be done. It can. My qemu tree from way back did this, but infradead.org lost it all and never recovered. I didn't retain a local copy either, but starting from scratch is probably the best course anyway. > > In any case, calculating T10 CRCs is *really* slow unless you have > > special hardware and software support for it. > > > > Yeah. I know this is super slow. For for emulation and testing purposes > I think it is a nice feature for the device to optionally offer. Bonus if you want to implement this with PCLMULQDQ support in x86-64 hosts. For reference, here's the linux kernel's implementation: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/crypto/crct10dif-pcl-asm_64.S I wouldn't necessarily tie metadata support to T10DIF, though, since it has uses beyond protection info.
From: Klaus Jensen <k.jensen@samsung.com> This series adds support for extended LBAs and end-to-end data protection. Marked RFC, since there are a bunch of issues that could use some discussion. Storing metadata bytes contiguously with the logical block data and creating a physically extended logical block basically breaks the DULBE and deallocation support I just added. Formatting a namespace with protection information requires the app- and reftags of deallocated or unwritten blocks to be 0xffff and 0xffffffff respectively; this could be used to reintroduce DULBE support in that case, albeit at a somewhat higher cost than the block status flag-based approach. There is basically three ways of storing metadata (and maybe a forth, but that is probably quite the endeavour): 1. Storing metadata as extended blocks directly on the blockdev. That is the approach used in this RFC. 2. Use a separate blockdev. Incidentially, this is also the easiest and most straightforward solution to support MPTR-based "separate metadata". This also allows DULBE and block deallocation to be supported using the existing approach. 3. A hybrid of 1 and 2 where the metadata is stored contiguously at the end of the nvme-ns blockdev. Option 1 obviously works well with DIF-based protection information and extended LBAs since it maps one to one. Option 2 works flawlessly with MPTR-based metadata, but extended LBAs can be "emulated" at the cost of a bunch of scatter/gather operations. The 4th option is extending an existing image format (QCOW2) or create something on top of RAW to supports metadata bytes per block. But both approaches require full API support through the block layer. And probably a lot of other stuff that I did not think about. Anyway, we would love some comments on this. Gollu Appalanaidu (2): nvme: add support for extended LBAs hw/block/nvme: end-to-end data protection Klaus Jensen (1): hw/block/nvme: refactor nvme_dma hw/block/nvme-ns.h | 22 +- hw/block/nvme.h | 36 +++ include/block/nvme.h | 24 +- hw/block/nvme-ns.c | 66 ++++- hw/block/nvme.c | 616 ++++++++++++++++++++++++++++++++++++++---- hw/block/trace-events | 10 + 6 files changed, 704 insertions(+), 70 deletions(-)