hw/block/nvme: fix zone write finalize

Message ID	20210112094235.188686-1-its@irrelevant.dk
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Klaus Jensen <its@irrelevant.dk> To: qemu-devel@nongnu.org Subject: [PATCH] hw/block/nvme: fix zone write finalize Date: Tue, 12 Jan 2021 10:42:35 +0100 Message-Id: <20210112094235.188686-1-its@irrelevant.dk> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=64.147.123.20; envelope-from=its@irrelevant.dk; helo=wout4-smtp.messagingengine.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, qemu-block@nongnu.org, Dmitry Fomichev <dmitry.fomichev@wdc.com>, Klaus Jensen <k.jensen@samsung.com>, Max Reitz <mreitz@redhat.com>, Klaus Jensen <its@irrelevant.dk>, Keith Busch <kbusch@kernel.org> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	hw/block/nvme: fix zone write finalize \| expand hw/block/nvme: fix zone write finalize

Message ID

20210112094235.188686-1-its@irrelevant.dk

State

New

Headers

From: Klaus Jensen <its@irrelevant.dk>
To: qemu-devel@nongnu.org
Subject: [PATCH] hw/block/nvme: fix zone write finalize
Date: Tue, 12 Jan 2021 10:42:35 +0100
Message-Id: <20210112094235.188686-1-its@irrelevant.dk>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=64.147.123.20; envelope-from=its@irrelevant.dk;
 helo=wout4-smtp.messagingengine.com
X-Spam_score_int: -27
X-Spam_score: -2.8
X-Spam_bar: --
X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
 RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001,
 SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
 <mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: Kevin Wolf <kwolf@redhat.com>, qemu-block@nongnu.org,
 Dmitry Fomichev <dmitry.fomichev@wdc.com>,
 Klaus Jensen <k.jensen@samsung.com>,
 Max Reitz <mreitz@redhat.com>, Klaus Jensen <its@irrelevant.dk>,
 Keith Busch <kbusch@kernel.org>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: "Qemu-devel"
 <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Series

hw/block/nvme: fix zone write finalize | expand

Commit Message

Klaus Jensen Jan. 12, 2021, 9:42 a.m. UTC

From: Klaus Jensen <k.jensen@samsung.com>

The zone write pointer is unconditionally advanced, even for write
faults. Make sure that the zone is always transitioned to Full if the
write pointer reaches zone capacity.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Cc: Dmitry Fomichev <dmitry.fomichev@wdc.com>
---
 hw/block/nvme.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Keith Busch Jan. 14, 2021, 11:03 p.m. UTC | #1

On Tue, Jan 12, 2021 at 10:42:35AM +0100, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> The zone write pointer is unconditionally advanced, even for write
> faults. Make sure that the zone is always transitioned to Full if the
> write pointer reaches zone capacity.

Looks like some spec weirdness. It says we can transition to full:

  b) as a result of successful completion of a write operation that
     writes one or more logical blocks that causes the zone to reach its
     writeable zone capacity;

But a failed write also advances the write pointer as you've pointed
out, so they might want to strike "successful".

Looks fine to me.

Reviewed-by: Keith Busch <kbusch@kernel.org>

> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> Cc: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 0854ee307227..280b31b4459d 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1268,10 +1268,13 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
>      nlb = le16_to_cpu(rw->nlb) + 1;
>      zone = nvme_get_zone_by_slba(ns, slba);
>  
> +    zone->d.wp += nlb;
> +
>      if (failed) {
>          res->slba = 0;
> -        zone->d.wp += nlb;
> -    } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) {
> +    }
> +
> +    if (zone->d.wp == nvme_zone_wr_boundary(zone)) {

The previous check was using 'zone->w_ptr', but now it's 'zone->d.wp'.
As far as I can tell, this difference will mean the zone won't finalize
until the last write completes, where before it could finalize after the
zone's last write is submitted. Either way looks okay, but I think these
two values ought to always be in sync.

>          switch (nvme_get_zone_state(zone)) {
>          case NVME_ZONE_STATE_IMPLICITLY_OPEN:
>          case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> @@ -1288,9 +1291,6 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
>          default:
>              assert(false);
>          }
> -        zone->d.wp = zone->w_ptr;
> -    } else {
> -        zone->d.wp += nlb;
>      }
>  }

Klaus Jensen Jan. 15, 2021, 6:01 a.m. UTC | #2

On Jan 14 15:03, Keith Busch wrote:
> On Tue, Jan 12, 2021 at 10:42:35AM +0100, Klaus Jensen wrote:
> > From: Klaus Jensen <k.jensen@samsung.com>
> > 
> > The zone write pointer is unconditionally advanced, even for write
> > faults. Make sure that the zone is always transitioned to Full if the
> > write pointer reaches zone capacity.
> 
> Looks like some spec weirdness. It says we can transition to full:
> 
>   b) as a result of successful completion of a write operation that
>      writes one or more logical blocks that causes the zone to reach its
>      writeable zone capacity;
> 
> But a failed write also advances the write pointer as you've pointed
> out, so they might want to strike "successful".
> 

Yes, its slightly inconsistent.

Empty/Closed to Opened is also kinda fuzzy in the spec. The wording is
slightly different when transitioning from Empty and Closed to
Implicitly Opened. ZSE->ZSIO just says "and a write operation writes on
or more logical blocks of that zone". ZSC->ZSIO says "and a write
operation that writes one or more logical blocks to the zone completes
succesfully". This has given me some headaches on if the transition
should occur when "preparing/submitting" the write or when the write
completes. But I guess this is pretty much an implementation detail of
the device.

I should mention this to my representatives ;)

> Looks fine to me.
> 
> Reviewed-by: Keith Busch <kbusch@kernel.org>
> 
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > Cc: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> >  hw/block/nvme.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 0854ee307227..280b31b4459d 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1268,10 +1268,13 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
> >      nlb = le16_to_cpu(rw->nlb) + 1;
> >      zone = nvme_get_zone_by_slba(ns, slba);
> >  
> > +    zone->d.wp += nlb;
> > +
> >      if (failed) {
> >          res->slba = 0;
> > -        zone->d.wp += nlb;
> > -    } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) {
> > +    }
> > +
> > +    if (zone->d.wp == nvme_zone_wr_boundary(zone)) {
> 
> The previous check was using 'zone->w_ptr', but now it's 'zone->d.wp'.
> As far as I can tell, this difference will mean the zone won't finalize
> until the last write completes, where before it could finalize after the
> zone's last write is submitted. Either way looks okay, but I think these
> two values ought to always be in sync.
> 

Right - the zone will transition to Full on the last completed write. I
really not sure if this models devices better or not. But I would argue
that a real device would not relinquish resources until the last write
completes.

An issue is that for QD>1 (zone append), we can easily end up with
appends A and B completing in reversed order such that d.wp is advanced
by B_nlb, but we left a "hole" of A_nlb in the zone because that write
has not completed yet.

But - I think *some* inconsistency on the write pointer value is to be
expected when using Zone Append. The host can only ever depend on a
consistent value of the write pointer by quiescing I/O to the zone and
then getting a zone report - and d.wp and d.w_ptr will always
converge in that case.

Klaus Jensen Jan. 20, 2021, 9:14 a.m. UTC | #3

On Jan 12 10:42, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> The zone write pointer is unconditionally advanced, even for write
> faults. Make sure that the zone is always transitioned to Full if the
> write pointer reaches zone capacity.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> Cc: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> ---
>  hw/block/nvme.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 0854ee307227..280b31b4459d 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1268,10 +1268,13 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
>      nlb = le16_to_cpu(rw->nlb) + 1;
>      zone = nvme_get_zone_by_slba(ns, slba);
>  
> +    zone->d.wp += nlb;
> +
>      if (failed) {
>          res->slba = 0;
> -        zone->d.wp += nlb;
> -    } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) {
> +    }
> +
> +    if (zone->d.wp == nvme_zone_wr_boundary(zone)) {
>          switch (nvme_get_zone_state(zone)) {
>          case NVME_ZONE_STATE_IMPLICITLY_OPEN:
>          case NVME_ZONE_STATE_EXPLICITLY_OPEN:
> @@ -1288,9 +1291,6 @@ static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
>          default:
>              assert(false);
>          }
> -        zone->d.wp = zone->w_ptr;
> -    } else {
> -        zone->d.wp += nlb;
>      }
>  }
>  

Applied to nvme-next.

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 0854ee307227..280b31b4459d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1268,10 +1268,13 @@  static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
     nlb = le16_to_cpu(rw->nlb) + 1;
     zone = nvme_get_zone_by_slba(ns, slba);
 
+    zone->d.wp += nlb;
+
     if (failed) {
         res->slba = 0;
-        zone->d.wp += nlb;
-    } else if (zone->w_ptr == nvme_zone_wr_boundary(zone)) {
+    }
+
+    if (zone->d.wp == nvme_zone_wr_boundary(zone)) {
         switch (nvme_get_zone_state(zone)) {
         case NVME_ZONE_STATE_IMPLICITLY_OPEN:
         case NVME_ZONE_STATE_EXPLICITLY_OPEN:
@@ -1288,9 +1291,6 @@  static void nvme_finalize_zoned_write(NvmeNamespace *ns, NvmeRequest *req,
         default:
             assert(false);
         }
-        zone->d.wp = zone->w_ptr;
-    } else {
-        zone->d.wp += nlb;
     }
 }

hw/block/nvme: fix zone write finalize

Commit Message

Comments

Patch