Message ID | 20210112094235.188686-1-its@irrelevant.dk |
---|---|
State | New |
Headers | show |
Series | hw/block/nvme: fix zone write finalize | expand |
On Tue, Jan 12, 2021 at 10:42:35AM +0100, Klaus Jensen wrote: > From: Klaus Jensen <k.jensen@samsung.com> > > The zone write pointer is unconditionally advanced, even for write > faults. Make sure that the zone is always transitioned to Full if the > write pointer reaches zone capacity. Looks like some spec weirdness. It says we can transition to full: b) as a result of successful completion of a write operation that writes one or more logical blocks that causes the zone to reach its writeable zone capacity; But a failed write also advances the write pointer as you've pointed out, so they might want to strike "successful". Looks fine to me. Reviewed-by: Keith Busch <kbusch@kernel.org> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com> > Cc: Dmitry Fomichev <dmitry.fomichev@wdc.com> > --- > hw/block/nvme.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c > index 0854ee307227..280b31b4459d 100644 > --- a/hw/block/nvme.c > +++ b/hw/block/nvme.c > @@ -1268,10 +1268,13 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req, > nlb = le16_to_cpu(rw->nlb) + 1; > zone = nvme_get_zone_by_slba(ns, slba); > > + zone->d.wp += nlb; > + > if (failed) { > res->slba = 0; > - zone->d.wp += nlb; > - } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) { > + } > + > + if (zone->d.wp == nvme_zone_wr_boundary(zone)) { The previous check was using 'zone->w_ptr', but now it's 'zone->d.wp'. As far as I can tell, this difference will mean the zone won't finalize until the last write completes, where before it could finalize after the zone's last write is submitted. Either way looks okay, but I think these two values ought to always be in sync. > switch (nvme_get_zone_state(zone)) { > case NVME_ZONE_STATE_IMPLICITLY_OPEN: > case NVME_ZONE_STATE_EXPLICITLY_OPEN: > @@ -1288,9 +1291,6 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req, > default: > assert(false); > } > - zone->d.wp = zone->w_ptr; > - } else { > - zone->d.wp += nlb; > } > }
On Jan 14 15:03, Keith Busch wrote: > On Tue, Jan 12, 2021 at 10:42:35AM +0100, Klaus Jensen wrote: > > From: Klaus Jensen <k.jensen@samsung.com> > > > > The zone write pointer is unconditionally advanced, even for write > > faults. Make sure that the zone is always transitioned to Full if the > > write pointer reaches zone capacity. > > Looks like some spec weirdness. It says we can transition to full: > > b) as a result of successful completion of a write operation that > writes one or more logical blocks that causes the zone to reach its > writeable zone capacity; > > But a failed write also advances the write pointer as you've pointed > out, so they might want to strike "successful". > Yes, its slightly inconsistent. Empty/Closed to Opened is also kinda fuzzy in the spec. The wording is slightly different when transitioning from Empty and Closed to Implicitly Opened. ZSE->ZSIO just says "and a write operation writes on or more logical blocks of that zone". ZSC->ZSIO says "and a write operation that writes one or more logical blocks to the zone completes succesfully". This has given me some headaches on if the transition should occur when "preparing/submitting" the write or when the write completes. But I guess this is pretty much an implementation detail of the device. I should mention this to my representatives ;) > Looks fine to me. > > Reviewed-by: Keith Busch <kbusch@kernel.org> > > > Signed-off-by: Klaus Jensen <k.jensen@samsung.com> > > Cc: Dmitry Fomichev <dmitry.fomichev@wdc.com> > > --- > > hw/block/nvme.c | 10 +++++----- > > 1 file changed, 5 insertions(+), 5 deletions(-) > > > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c > > index 0854ee307227..280b31b4459d 100644 > > --- a/hw/block/nvme.c > > +++ b/hw/block/nvme.c > > @@ -1268,10 +1268,13 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req, > > nlb = le16_to_cpu(rw->nlb) + 1; > > zone = nvme_get_zone_by_slba(ns, slba); > > > > + zone->d.wp += nlb; > > + > > if (failed) { > > res->slba = 0; > > - zone->d.wp += nlb; > > - } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) { > > + } > > + > > + if (zone->d.wp == nvme_zone_wr_boundary(zone)) { > > The previous check was using 'zone->w_ptr', but now it's 'zone->d.wp'. > As far as I can tell, this difference will mean the zone won't finalize > until the last write completes, where before it could finalize after the > zone's last write is submitted. Either way looks okay, but I think these > two values ought to always be in sync. > Right - the zone will transition to Full on the last completed write. I really not sure if this models devices better or not. But I would argue that a real device would not relinquish resources until the last write completes. An issue is that for QD>1 (zone append), we can easily end up with appends A and B completing in reversed order such that d.wp is advanced by B_nlb, but we left a "hole" of A_nlb in the zone because that write has not completed yet. But - I think *some* inconsistency on the write pointer value is to be expected when using Zone Append. The host can only ever depend on a consistent value of the write pointer by quiescing I/O to the zone and then getting a zone report - and d.wp and d.w_ptr will always converge in that case.
On Jan 12 10:42, Klaus Jensen wrote: > From: Klaus Jensen <k.jensen@samsung.com> > > The zone write pointer is unconditionally advanced, even for write > faults. Make sure that the zone is always transitioned to Full if the > write pointer reaches zone capacity. > > Signed-off-by: Klaus Jensen <k.jensen@samsung.com> > Cc: Dmitry Fomichev <dmitry.fomichev@wdc.com> > --- > hw/block/nvme.c | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c > index 0854ee307227..280b31b4459d 100644 > --- a/hw/block/nvme.c > +++ b/hw/block/nvme.c > @@ -1268,10 +1268,13 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req, > nlb = le16_to_cpu(rw->nlb) + 1; > zone = nvme_get_zone_by_slba(ns, slba); > > + zone->d.wp += nlb; > + > if (failed) { > res->slba = 0; > - zone->d.wp += nlb; > - } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) { > + } > + > + if (zone->d.wp == nvme_zone_wr_boundary(zone)) { > switch (nvme_get_zone_state(zone)) { > case NVME_ZONE_STATE_IMPLICITLY_OPEN: > case NVME_ZONE_STATE_EXPLICITLY_OPEN: > @@ -1288,9 +1291,6 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req, > default: > assert(false); > } > - zone->d.wp = zone->w_ptr; > - } else { > - zone->d.wp += nlb; > } > } > Applied to nvme-next.
diff --git a/hw/block/nvme.c b/hw/block/nvme.c index 0854ee307227..280b31b4459d 100644 --- a/hw/block/nvme.c +++ b/hw/block/nvme.c @@ -1268,10 +1268,13 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req, nlb = le16_to_cpu(rw->nlb) + 1; zone = nvme_get_zone_by_slba(ns, slba); + zone->d.wp += nlb; + if (failed) { res->slba = 0; - zone->d.wp += nlb; - } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) { + } + + if (zone->d.wp == nvme_zone_wr_boundary(zone)) { switch (nvme_get_zone_state(zone)) { case NVME_ZONE_STATE_IMPLICITLY_OPEN: case NVME_ZONE_STATE_EXPLICITLY_OPEN: @@ -1288,9 +1291,6 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req, default: assert(false); } - zone->d.wp = zone->w_ptr; - } else { - zone->d.wp += nlb; } }