hw/nvme: fix handling of over-committed queues

Message ID	20241025-issue-2388-v1-1-16707e0d3342@samsung.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Feedback-ID: idc91472f:Fastmail From: Klaus Jensen <its@irrelevant.dk> Date: Fri, 25 Oct 2024 12:50:45 +0200 Subject: [PATCH] hw/nvme: fix handling of over-committed queues MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Message-Id: <20241025-issue-2388-v1-1-16707e0d3342@samsung.com> To: Keith Busch <kbusch@kernel.org>, Klaus Jensen <its@irrelevant.dk>, Jesper Devantier <foss@defmacro.it>, qemu-block@nongnu.org, qemu-devel@nongnu.org Cc: Klaus Jensen <k.jensen@samsung.com>, qemu-stable@nongnu.org, Waldemar Kozaczuk <jwkozaczuk@gmail.com> Received-SPF: pass client-ip=103.168.172.148; envelope-from=its@irrelevant.dk; helo=fout-a5-smtp.messagingengine.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Series	hw/nvme: fix handling of over-committed queues \| expand hw/nvme: fix handling of over-committed queues

Klaus Jensen Oct. 25, 2024, 10:50 a.m. UTC

From: Klaus Jensen <k.jensen@samsung.com>

If a host chooses to use the SQHD "hint" in the CQE to know if there is
room in the submission queue for additional commands, it may result in a
situation where there are not enough internal resources (struct
NvmeRequest) available to process the command. For a lack of a better
term, the host may "over-commit" the device (i.e., it may have more
inflight commands than the queue size).

For example, assume a queue with N entries. The host submits N commands
and all are picked up for processing, advancing the head and emptying
the queue. Regardless of which of these N commands complete first, the
SQHD field of that CQE will indicate to the host that the queue is
empty, which allows the host to issue N commands again. However, if the
device has not posted CQEs for all the previous commands yet, the device
will have less than N resources available to process the commands, so
queue processing is suspended.

And here lies an 11 year latent bug. In the absense of any additional
tail updates on the submission queue, we never schedule the processing
bottom-half again unless we observe a head update on an associated full
completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups
(in the absense of over-commit of course). Incidentially, that "kick all
associated SQs" mechanism can now be killed since we now just schedule
queue processing when we return a processing resource to a non-empty
submission queue, which happens to cover both edge cases.

So, apparently, no previous driver tested with hw/nvme has ever used
SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv
shows up with the driver that actually does. I salute you.

Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface")
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388
Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/nvme/ctrl.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)


---
base-commit: e67b7aef7c7f67ecd0282e903e0daff806d5d680
change-id: 20241025-issue-2388-bd047487f74c

Best regards,

Waldek Kozaczuk Oct. 25, 2024, 4:20 p.m. UTC | #1

Hi,

I have applied this patch to the same QEMU source tree where I reproduced
this issue. I changed the queue size to 3 on the OSv side to trigger this
bug, but unfortunately, I still see the same behavior of the OSv guest
hanging.

Regards,
Waldemar Kozaczuk

On Fri, Oct 25, 2024 at 6:51 AM Klaus Jensen <its@irrelevant.dk> wrote:

> From: Klaus Jensen <k.jensen@samsung.com>
>
> If a host chooses to use the SQHD "hint" in the CQE to know if there is
> room in the submission queue for additional commands, it may result in a
> situation where there are not enough internal resources (struct
> NvmeRequest) available to process the command. For a lack of a better
> term, the host may "over-commit" the device (i.e., it may have more
> inflight commands than the queue size).
>
> For example, assume a queue with N entries. The host submits N commands
> and all are picked up for processing, advancing the head and emptying
> the queue. Regardless of which of these N commands complete first, the
> SQHD field of that CQE will indicate to the host that the queue is
> empty, which allows the host to issue N commands again. However, if the
> device has not posted CQEs for all the previous commands yet, the device
> will have less than N resources available to process the commands, so
> queue processing is suspended.
>
> And here lies an 11 year latent bug. In the absense of any additional
> tail updates on the submission queue, we never schedule the processing
> bottom-half again unless we observe a head update on an associated full
> completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups
> (in the absense of over-commit of course). Incidentially, that "kick all
> associated SQs" mechanism can now be killed since we now just schedule
> queue processing when we return a processing resource to a non-empty
> submission queue, which happens to cover both edge cases.
>
> So, apparently, no previous driver tested with hw/nvme has ever used
> SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv
> shows up with the driver that actually does. I salute you.
>
> Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface")
> Cc: qemu-stable@nongnu.org
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388
> Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/nvme/ctrl.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index
> f4e89203c1a6e3b051fd7185cbf01ec9bae9684a..b13585c4da911b9e8ae4a722761fd85dfa24be4d
> 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque)
>              stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
>              break;
>          }
> +
>          QTAILQ_REMOVE(&cq->req_list, req, entry);
> +
>          nvme_inc_cq_tail(cq);
>          nvme_sg_unmap(&req->sg);
> +
> +        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
> +            qemu_bh_schedule(sq->bh);
> +        }
> +
>          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
>      }
>      if (cq->tail != cq->head) {
> @@ -7950,7 +7957,6 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr
> addr, int val)
>          /* Completion queue doorbell write */
>
>          uint16_t new_head = val & 0xffff;
> -        int start_sqs;
>          NvmeCQueue *cq;
>
>          qid = (addr - (0x1000 + (1 << 2))) >> 3;
> @@ -8001,18 +8007,10 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr
> addr, int val)
>
>          trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
>
> -        start_sqs = nvme_cq_full(cq) ? 1 : 0;
>          cq->head = new_head;
>          if (!qid && n->dbbuf_enabled) {
>              stl_le_pci_dma(pci, cq->db_addr, cq->head,
> MEMTXATTRS_UNSPECIFIED);
>          }
> -        if (start_sqs) {
> -            NvmeSQueue *sq;
> -            QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
> -                qemu_bh_schedule(sq->bh);
> -            }
> -            qemu_bh_schedule(cq->bh);
> -        }
>
>          if (cq->tail == cq->head) {
>              if (cq->irq_enabled) {
>
> ---
> base-commit: e67b7aef7c7f67ecd0282e903e0daff806d5d680
> change-id: 20241025-issue-2388-bd047487f74c
>
> Best regards,
> --
> Klaus Jensen <k.jensen@samsung.com>
>
>

Keith Busch Oct. 25, 2024, 4:45 p.m. UTC | #2

On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote:
> @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque)
>          nvme_inc_cq_tail(cq);
>          nvme_sg_unmap(&req->sg);
> +
> +        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
> +            qemu_bh_schedule(sq->bh);
> +        }
> +
>          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
>      }

Shouldn't we schedule the bottom half after the req has been added to
the list? I think everything the callback needs to be written prior to
calling qemu_bh_schedule().

Klaus Jensen Oct. 28, 2024, 9:01 a.m. UTC | #3

On Oct 25 10:45, Keith Busch wrote:
> On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote:
> > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque)
> >          nvme_inc_cq_tail(cq);
> >          nvme_sg_unmap(&req->sg);
> > +
> > +        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
> > +            qemu_bh_schedule(sq->bh);
> > +        }
> > +
> >          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
> >      }
> 
> Shouldn't we schedule the bottom half after the req has been added to
> the list? I think everything the callback needs to be written prior to
> calling qemu_bh_schedule().
> 

Not as far as I know. It is only queued up; it won't be executed
immediately. It might run next (ASAP) if we are already in a bottom
half, but not before whatever context we are in returns.

Keith Busch Oct. 28, 2024, 3:15 p.m. UTC | #4

On Mon, Oct 28, 2024 at 10:01:50AM +0100, Klaus Jensen wrote:
> On Oct 25 10:45, Keith Busch wrote:
> > On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote:
> > > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque)
> > >          nvme_inc_cq_tail(cq);
> > >          nvme_sg_unmap(&req->sg);
> > > +
> > > +        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
> > > +            qemu_bh_schedule(sq->bh);
> > > +        }
> > > +
> > >          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
> > >      }
> > 
> > Shouldn't we schedule the bottom half after the req has been added to
> > the list? I think everything the callback needs to be written prior to
> > calling qemu_bh_schedule().
> > 
> 
> Not as far as I know. It is only queued up; it won't be executed
> immediately. It might run next (ASAP) if we are already in a bottom
> half, but not before whatever context we are in returns.

Okay. I was trying to come up with an explanation for why Waldek was
still able to reproduce the problem, and that was all I have so far.

Klaus Jensen Oct. 29, 2024, 12:33 p.m. UTC | #5

On Oct 28 09:15, Keith Busch wrote:
> On Mon, Oct 28, 2024 at 10:01:50AM +0100, Klaus Jensen wrote:
> > On Oct 25 10:45, Keith Busch wrote:
> > > On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote:
> > > > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque)
> > > >          nvme_inc_cq_tail(cq);
> > > >          nvme_sg_unmap(&req->sg);
> > > > +
> > > > +        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
> > > > +            qemu_bh_schedule(sq->bh);
> > > > +        }
> > > > +
> > > >          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
> > > >      }
> > > 
> > > Shouldn't we schedule the bottom half after the req has been added to
> > > the list? I think everything the callback needs to be written prior to
> > > calling qemu_bh_schedule().
> > > 
> > 
> > Not as far as I know. It is only queued up; it won't be executed
> > immediately. It might run next (ASAP) if we are already in a bottom
> > half, but not before whatever context we are in returns.
> 
> Okay. I was trying to come up with an explanation for why Waldek was
> still able to reproduce the problem, and that was all I have so far.
> 

I was too eager in removing the start_sqs stuff. I removed kicking the
cq when transitioning from full to non-full. The core fix is the right
one, but I introduced a bug...

v2 just posted should be good. Verified it with OSv master.

hw/nvme: fix handling of over-committed queues

Commit Message

Comments

Patch