diff mbox series

hw/nvme: fix handling of over-committed queues

Message ID 20241025-issue-2388-v1-1-16707e0d3342@samsung.com
State New
Headers show
Series hw/nvme: fix handling of over-committed queues | expand

Commit Message

Klaus Jensen Oct. 25, 2024, 10:50 a.m. UTC
From: Klaus Jensen <k.jensen@samsung.com>

If a host chooses to use the SQHD "hint" in the CQE to know if there is
room in the submission queue for additional commands, it may result in a
situation where there are not enough internal resources (struct
NvmeRequest) available to process the command. For a lack of a better
term, the host may "over-commit" the device (i.e., it may have more
inflight commands than the queue size).

For example, assume a queue with N entries. The host submits N commands
and all are picked up for processing, advancing the head and emptying
the queue. Regardless of which of these N commands complete first, the
SQHD field of that CQE will indicate to the host that the queue is
empty, which allows the host to issue N commands again. However, if the
device has not posted CQEs for all the previous commands yet, the device
will have less than N resources available to process the commands, so
queue processing is suspended.

And here lies an 11 year latent bug. In the absense of any additional
tail updates on the submission queue, we never schedule the processing
bottom-half again unless we observe a head update on an associated full
completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups
(in the absense of over-commit of course). Incidentially, that "kick all
associated SQs" mechanism can now be killed since we now just schedule
queue processing when we return a processing resource to a non-empty
submission queue, which happens to cover both edge cases.

So, apparently, no previous driver tested with hw/nvme has ever used
SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv
shows up with the driver that actually does. I salute you.

Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface")
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388
Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/nvme/ctrl.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)


---
base-commit: e67b7aef7c7f67ecd0282e903e0daff806d5d680
change-id: 20241025-issue-2388-bd047487f74c

Best regards,

Comments

Waldek Kozaczuk Oct. 25, 2024, 4:20 p.m. UTC | #1
Hi,

I have applied this patch to the same QEMU source tree where I reproduced
this issue. I changed the queue size to 3 on the OSv side to trigger this
bug, but unfortunately, I still see the same behavior of the OSv guest
hanging.

Regards,
Waldemar Kozaczuk

On Fri, Oct 25, 2024 at 6:51 AM Klaus Jensen <its@irrelevant.dk> wrote:

> From: Klaus Jensen <k.jensen@samsung.com>
>
> If a host chooses to use the SQHD "hint" in the CQE to know if there is
> room in the submission queue for additional commands, it may result in a
> situation where there are not enough internal resources (struct
> NvmeRequest) available to process the command. For a lack of a better
> term, the host may "over-commit" the device (i.e., it may have more
> inflight commands than the queue size).
>
> For example, assume a queue with N entries. The host submits N commands
> and all are picked up for processing, advancing the head and emptying
> the queue. Regardless of which of these N commands complete first, the
> SQHD field of that CQE will indicate to the host that the queue is
> empty, which allows the host to issue N commands again. However, if the
> device has not posted CQEs for all the previous commands yet, the device
> will have less than N resources available to process the commands, so
> queue processing is suspended.
>
> And here lies an 11 year latent bug. In the absense of any additional
> tail updates on the submission queue, we never schedule the processing
> bottom-half again unless we observe a head update on an associated full
> completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups
> (in the absense of over-commit of course). Incidentially, that "kick all
> associated SQs" mechanism can now be killed since we now just schedule
> queue processing when we return a processing resource to a non-empty
> submission queue, which happens to cover both edge cases.
>
> So, apparently, no previous driver tested with hw/nvme has ever used
> SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv
> shows up with the driver that actually does. I salute you.
>
> Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface")
> Cc: qemu-stable@nongnu.org
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388
> Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/nvme/ctrl.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index
> f4e89203c1a6e3b051fd7185cbf01ec9bae9684a..b13585c4da911b9e8ae4a722761fd85dfa24be4d
> 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque)
>              stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
>              break;
>          }
> +
>          QTAILQ_REMOVE(&cq->req_list, req, entry);
> +
>          nvme_inc_cq_tail(cq);
>          nvme_sg_unmap(&req->sg);
> +
> +        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
> +            qemu_bh_schedule(sq->bh);
> +        }
> +
>          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
>      }
>      if (cq->tail != cq->head) {
> @@ -7950,7 +7957,6 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr
> addr, int val)
>          /* Completion queue doorbell write */
>
>          uint16_t new_head = val & 0xffff;
> -        int start_sqs;
>          NvmeCQueue *cq;
>
>          qid = (addr - (0x1000 + (1 << 2))) >> 3;
> @@ -8001,18 +8007,10 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr
> addr, int val)
>
>          trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
>
> -        start_sqs = nvme_cq_full(cq) ? 1 : 0;
>          cq->head = new_head;
>          if (!qid && n->dbbuf_enabled) {
>              stl_le_pci_dma(pci, cq->db_addr, cq->head,
> MEMTXATTRS_UNSPECIFIED);
>          }
> -        if (start_sqs) {
> -            NvmeSQueue *sq;
> -            QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
> -                qemu_bh_schedule(sq->bh);
> -            }
> -            qemu_bh_schedule(cq->bh);
> -        }
>
>          if (cq->tail == cq->head) {
>              if (cq->irq_enabled) {
>
> ---
> base-commit: e67b7aef7c7f67ecd0282e903e0daff806d5d680
> change-id: 20241025-issue-2388-bd047487f74c
>
> Best regards,
> --
> Klaus Jensen <k.jensen@samsung.com>
>
>
Keith Busch Oct. 25, 2024, 4:45 p.m. UTC | #2
On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote:
> @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque)
>          nvme_inc_cq_tail(cq);
>          nvme_sg_unmap(&req->sg);
> +
> +        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
> +            qemu_bh_schedule(sq->bh);
> +        }
> +
>          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
>      }

Shouldn't we schedule the bottom half after the req has been added to
the list? I think everything the callback needs to be written prior to
calling qemu_bh_schedule().
Klaus Jensen Oct. 28, 2024, 9:01 a.m. UTC | #3
On Oct 25 10:45, Keith Busch wrote:
> On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote:
> > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque)
> >          nvme_inc_cq_tail(cq);
> >          nvme_sg_unmap(&req->sg);
> > +
> > +        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
> > +            qemu_bh_schedule(sq->bh);
> > +        }
> > +
> >          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
> >      }
> 
> Shouldn't we schedule the bottom half after the req has been added to
> the list? I think everything the callback needs to be written prior to
> calling qemu_bh_schedule().
> 

Not as far as I know. It is only queued up; it won't be executed
immediately. It might run next (ASAP) if we are already in a bottom
half, but not before whatever context we are in returns.
Keith Busch Oct. 28, 2024, 3:15 p.m. UTC | #4
On Mon, Oct 28, 2024 at 10:01:50AM +0100, Klaus Jensen wrote:
> On Oct 25 10:45, Keith Busch wrote:
> > On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote:
> > > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque)
> > >          nvme_inc_cq_tail(cq);
> > >          nvme_sg_unmap(&req->sg);
> > > +
> > > +        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
> > > +            qemu_bh_schedule(sq->bh);
> > > +        }
> > > +
> > >          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
> > >      }
> > 
> > Shouldn't we schedule the bottom half after the req has been added to
> > the list? I think everything the callback needs to be written prior to
> > calling qemu_bh_schedule().
> > 
> 
> Not as far as I know. It is only queued up; it won't be executed
> immediately. It might run next (ASAP) if we are already in a bottom
> half, but not before whatever context we are in returns.

Okay. I was trying to come up with an explanation for why Waldek was
still able to reproduce the problem, and that was all I have so far.
Klaus Jensen Oct. 29, 2024, 12:33 p.m. UTC | #5
On Oct 28 09:15, Keith Busch wrote:
> On Mon, Oct 28, 2024 at 10:01:50AM +0100, Klaus Jensen wrote:
> > On Oct 25 10:45, Keith Busch wrote:
> > > On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote:
> > > > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque)
> > > >          nvme_inc_cq_tail(cq);
> > > >          nvme_sg_unmap(&req->sg);
> > > > +
> > > > +        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
> > > > +            qemu_bh_schedule(sq->bh);
> > > > +        }
> > > > +
> > > >          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
> > > >      }
> > > 
> > > Shouldn't we schedule the bottom half after the req has been added to
> > > the list? I think everything the callback needs to be written prior to
> > > calling qemu_bh_schedule().
> > > 
> > 
> > Not as far as I know. It is only queued up; it won't be executed
> > immediately. It might run next (ASAP) if we are already in a bottom
> > half, but not before whatever context we are in returns.
> 
> Okay. I was trying to come up with an explanation for why Waldek was
> still able to reproduce the problem, and that was all I have so far.
> 

I was too eager in removing the start_sqs stuff. I removed kicking the
cq when transitioning from full to non-full. The core fix is the right
one, but I introduced a bug...

v2 just posted should be good. Verified it with OSv master.
diff mbox series

Patch

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f4e89203c1a6e3b051fd7185cbf01ec9bae9684a..b13585c4da911b9e8ae4a722761fd85dfa24be4d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -1520,9 +1520,16 @@  static void nvme_post_cqes(void *opaque)
             stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
             break;
         }
+
         QTAILQ_REMOVE(&cq->req_list, req, entry);
+
         nvme_inc_cq_tail(cq);
         nvme_sg_unmap(&req->sg);
+
+        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
+            qemu_bh_schedule(sq->bh);
+        }
+
         QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
     }
     if (cq->tail != cq->head) {
@@ -7950,7 +7957,6 @@  static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
         /* Completion queue doorbell write */
 
         uint16_t new_head = val & 0xffff;
-        int start_sqs;
         NvmeCQueue *cq;
 
         qid = (addr - (0x1000 + (1 << 2))) >> 3;
@@ -8001,18 +8007,10 @@  static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
 
         trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
 
-        start_sqs = nvme_cq_full(cq) ? 1 : 0;
         cq->head = new_head;
         if (!qid && n->dbbuf_enabled) {
             stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED);
         }
-        if (start_sqs) {
-            NvmeSQueue *sq;
-            QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
-                qemu_bh_schedule(sq->bh);
-            }
-            qemu_bh_schedule(cq->bh);
-        }
 
         if (cq->tail == cq->head) {
             if (cq->irq_enabled) {