diff mbox series

[v2] hw/nvme: fix handling of over-committed queues

Message ID 20241029-issue-2388-v2-1-e335658104cf@samsung.com
State New
Headers show
Series [v2] hw/nvme: fix handling of over-committed queues | expand

Commit Message

Klaus Jensen Oct. 29, 2024, 12:15 p.m. UTC
From: Klaus Jensen <k.jensen@samsung.com>

If a host chooses to use the SQHD "hint" in the CQE to know if there is
room in the submission queue for additional commands, it may result in a
situation where there are not enough internal resources (struct
NvmeRequest) available to process the command. For a lack of a better
term, the host may "over-commit" the device (i.e., it may have more
inflight commands than the queue size).

For example, assume a queue with N entries. The host submits N commands
and all are picked up for processing, advancing the head and emptying
the queue. Regardless of which of these N commands complete first, the
SQHD field of that CQE will indicate to the host that the queue is
empty, which allows the host to issue N commands again. However, if the
device has not posted CQEs for all the previous commands yet, the device
will have less than N resources available to process the commands, so
queue processing is suspended.

And here lies an 11 year latent bug. In the absense of any additional
tail updates on the submission queue, we never schedule the processing
bottom-half again unless we observe a head update on an associated full
completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups
(in the absense of over-commit of course). Incidentially, that "kick all
associated SQs" mechanism can now be killed since we now just schedule
queue processing when we return a processing resource to a non-empty
submission queue, which happens to cover both edge cases. However, we
must retain kicking the CQ if it was previously full.

So, apparently, no previous driver tested with hw/nvme has ever used
SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv
shows up with the driver that actually does. I salute you.

Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface")
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388
Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
Changes in v2:
- Retain cq kick on previously full queue
- Link to v1: https://lore.kernel.org/r/20241025-issue-2388-v1-1-16707e0d3342@samsung.com
---
 hw/nvme/ctrl.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)


---
base-commit: fdf250e5a37830615e324017cb3a503e84b3712c
change-id: 20241025-issue-2388-bd047487f74c

Best regards,

Comments

Klaus Jensen Nov. 4, 2024, 6:15 p.m. UTC | #1
On Oct 29 13:15, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> If a host chooses to use the SQHD "hint" in the CQE to know if there is
> room in the submission queue for additional commands, it may result in a
> situation where there are not enough internal resources (struct
> NvmeRequest) available to process the command. For a lack of a better
> term, the host may "over-commit" the device (i.e., it may have more
> inflight commands than the queue size).
> 
> For example, assume a queue with N entries. The host submits N commands
> and all are picked up for processing, advancing the head and emptying
> the queue. Regardless of which of these N commands complete first, the
> SQHD field of that CQE will indicate to the host that the queue is
> empty, which allows the host to issue N commands again. However, if the
> device has not posted CQEs for all the previous commands yet, the device
> will have less than N resources available to process the commands, so
> queue processing is suspended.
> 
> And here lies an 11 year latent bug. In the absense of any additional
> tail updates on the submission queue, we never schedule the processing
> bottom-half again unless we observe a head update on an associated full
> completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups
> (in the absense of over-commit of course). Incidentially, that "kick all
> associated SQs" mechanism can now be killed since we now just schedule
> queue processing when we return a processing resource to a non-empty
> submission queue, which happens to cover both edge cases. However, we
> must retain kicking the CQ if it was previously full.
> 
> So, apparently, no previous driver tested with hw/nvme has ever used
> SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv
> shows up with the driver that actually does. I salute you.
> 
> Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface")
> Cc: qemu-stable@nongnu.org
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388
> Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
> Changes in v2:
> - Retain cq kick on previously full queue
> - Link to v1: https://lore.kernel.org/r/20241025-issue-2388-v1-1-16707e0d3342@samsung.com
> ---
>  hw/nvme/ctrl.c | 25 ++++++++++++++-----------
>  1 file changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index f4e89203c1a6e3b051fd7185cbf01ec9bae9684a..1185455a94c4af43a39708b1b97dba9624fc7ad3 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque)
>              stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
>              break;
>          }
> +
>          QTAILQ_REMOVE(&cq->req_list, req, entry);
> +
>          nvme_inc_cq_tail(cq);
>          nvme_sg_unmap(&req->sg);
> +
> +        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
> +            qemu_bh_schedule(sq->bh);
> +        }
> +
>          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
>      }
>      if (cq->tail != cq->head) {
> @@ -7950,7 +7957,6 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>          /* Completion queue doorbell write */
>  
>          uint16_t new_head = val & 0xffff;
> -        int start_sqs;
>          NvmeCQueue *cq;
>  
>          qid = (addr - (0x1000 + (1 << 2))) >> 3;
> @@ -8001,19 +8007,16 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>  
>          trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
>  
> -        start_sqs = nvme_cq_full(cq) ? 1 : 0;
> -        cq->head = new_head;
> -        if (!qid && n->dbbuf_enabled) {
> -            stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED);
> -        }
> -        if (start_sqs) {
> -            NvmeSQueue *sq;
> -            QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
> -                qemu_bh_schedule(sq->bh);
> -            }
> +        /* scheduled deferred cqe posting if queue was previously full */
> +        if (nvme_cq_full(cq)) {
>              qemu_bh_schedule(cq->bh);
>          }
>  
> +        cq->head = new_head;
> +        if (!qid && n->dbbuf_enabled) {
> +            stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED);
> +        }
> +
>          if (cq->tail == cq->head) {
>              if (cq->irq_enabled) {
>                  n->cq_pending--;
> 
> ---
> base-commit: fdf250e5a37830615e324017cb3a503e84b3712c
> change-id: 20241025-issue-2388-bd047487f74c
> 
> Best regards,
> -- 
> Klaus Jensen <k.jensen@samsung.com>
> 

Ping. Tested, but would appreciate a review ;)
Keith Busch Nov. 4, 2024, 6:57 p.m. UTC | #2
On Tue, Oct 29, 2024 at 01:15:19PM +0100, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> If a host chooses to use the SQHD "hint" in the CQE to know if there is
> room in the submission queue for additional commands, it may result in a
> situation where there are not enough internal resources (struct
> NvmeRequest) available to process the command. For a lack of a better
> term, the host may "over-commit" the device (i.e., it may have more
> inflight commands than the queue size).
>
> ...

LGTM

Reviewed-by: Keith Busch <kbusch@kernel.org>
Waldek Kozaczuk Nov. 4, 2024, 10:32 p.m. UTC | #3
I have run my tests on the OSv side with small queue sizes like 3,4,5 and I
could NOT replicate the issue. So it looks like the V2 version of this
patch fixes the problem.

Thanks a lot,
Waldemar Kozaczuk

On Mon, Nov 4, 2024 at 1:57 PM Keith Busch <kbusch@kernel.org> wrote:

> On Tue, Oct 29, 2024 at 01:15:19PM +0100, Klaus Jensen wrote:
> > From: Klaus Jensen <k.jensen@samsung.com>
> >
> > If a host chooses to use the SQHD "hint" in the CQE to know if there is
> > room in the submission queue for additional commands, it may result in a
> > situation where there are not enough internal resources (struct
> > NvmeRequest) available to process the command. For a lack of a better
> > term, the host may "over-commit" the device (i.e., it may have more
> > inflight commands than the queue size).
> >
> > ...
>
> LGTM
>
> Reviewed-by: Keith Busch <kbusch@kernel.org>
>
diff mbox series

Patch

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f4e89203c1a6e3b051fd7185cbf01ec9bae9684a..1185455a94c4af43a39708b1b97dba9624fc7ad3 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -1520,9 +1520,16 @@  static void nvme_post_cqes(void *opaque)
             stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
             break;
         }
+
         QTAILQ_REMOVE(&cq->req_list, req, entry);
+
         nvme_inc_cq_tail(cq);
         nvme_sg_unmap(&req->sg);
+
+        if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) {
+            qemu_bh_schedule(sq->bh);
+        }
+
         QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
     }
     if (cq->tail != cq->head) {
@@ -7950,7 +7957,6 @@  static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
         /* Completion queue doorbell write */
 
         uint16_t new_head = val & 0xffff;
-        int start_sqs;
         NvmeCQueue *cq;
 
         qid = (addr - (0x1000 + (1 << 2))) >> 3;
@@ -8001,19 +8007,16 @@  static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
 
         trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
 
-        start_sqs = nvme_cq_full(cq) ? 1 : 0;
-        cq->head = new_head;
-        if (!qid && n->dbbuf_enabled) {
-            stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED);
-        }
-        if (start_sqs) {
-            NvmeSQueue *sq;
-            QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
-                qemu_bh_schedule(sq->bh);
-            }
+        /* scheduled deferred cqe posting if queue was previously full */
+        if (nvme_cq_full(cq)) {
             qemu_bh_schedule(cq->bh);
         }
 
+        cq->head = new_head;
+        if (!qid && n->dbbuf_enabled) {
+            stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED);
+        }
+
         if (cq->tail == cq->head) {
             if (cq->irq_enabled) {
                 n->cq_pending--;