Message ID | 20241029-issue-2388-v2-1-e335658104cf@samsung.com |
---|---|
State | New |
Headers | show |
Series | [v2] hw/nvme: fix handling of over-committed queues | expand |
On Oct 29 13:15, Klaus Jensen wrote: > From: Klaus Jensen <k.jensen@samsung.com> > > If a host chooses to use the SQHD "hint" in the CQE to know if there is > room in the submission queue for additional commands, it may result in a > situation where there are not enough internal resources (struct > NvmeRequest) available to process the command. For a lack of a better > term, the host may "over-commit" the device (i.e., it may have more > inflight commands than the queue size). > > For example, assume a queue with N entries. The host submits N commands > and all are picked up for processing, advancing the head and emptying > the queue. Regardless of which of these N commands complete first, the > SQHD field of that CQE will indicate to the host that the queue is > empty, which allows the host to issue N commands again. However, if the > device has not posted CQEs for all the previous commands yet, the device > will have less than N resources available to process the commands, so > queue processing is suspended. > > And here lies an 11 year latent bug. In the absense of any additional > tail updates on the submission queue, we never schedule the processing > bottom-half again unless we observe a head update on an associated full > completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups > (in the absense of over-commit of course). Incidentially, that "kick all > associated SQs" mechanism can now be killed since we now just schedule > queue processing when we return a processing resource to a non-empty > submission queue, which happens to cover both edge cases. However, we > must retain kicking the CQ if it was previously full. > > So, apparently, no previous driver tested with hw/nvme has ever used > SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv > shows up with the driver that actually does. I salute you. > > Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface") > Cc: qemu-stable@nongnu.org > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388 > Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com> > --- > Changes in v2: > - Retain cq kick on previously full queue > - Link to v1: https://lore.kernel.org/r/20241025-issue-2388-v1-1-16707e0d3342@samsung.com > --- > hw/nvme/ctrl.c | 25 ++++++++++++++----------- > 1 file changed, 14 insertions(+), 11 deletions(-) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index f4e89203c1a6e3b051fd7185cbf01ec9bae9684a..1185455a94c4af43a39708b1b97dba9624fc7ad3 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque) > stl_le_p(&n->bar.csts, NVME_CSTS_FAILED); > break; > } > + > QTAILQ_REMOVE(&cq->req_list, req, entry); > + > nvme_inc_cq_tail(cq); > nvme_sg_unmap(&req->sg); > + > + if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { > + qemu_bh_schedule(sq->bh); > + } > + > QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); > } > if (cq->tail != cq->head) { > @@ -7950,7 +7957,6 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) > /* Completion queue doorbell write */ > > uint16_t new_head = val & 0xffff; > - int start_sqs; > NvmeCQueue *cq; > > qid = (addr - (0x1000 + (1 << 2))) >> 3; > @@ -8001,19 +8007,16 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) > > trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head); > > - start_sqs = nvme_cq_full(cq) ? 1 : 0; > - cq->head = new_head; > - if (!qid && n->dbbuf_enabled) { > - stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED); > - } > - if (start_sqs) { > - NvmeSQueue *sq; > - QTAILQ_FOREACH(sq, &cq->sq_list, entry) { > - qemu_bh_schedule(sq->bh); > - } > + /* scheduled deferred cqe posting if queue was previously full */ > + if (nvme_cq_full(cq)) { > qemu_bh_schedule(cq->bh); > } > > + cq->head = new_head; > + if (!qid && n->dbbuf_enabled) { > + stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED); > + } > + > if (cq->tail == cq->head) { > if (cq->irq_enabled) { > n->cq_pending--; > > --- > base-commit: fdf250e5a37830615e324017cb3a503e84b3712c > change-id: 20241025-issue-2388-bd047487f74c > > Best regards, > -- > Klaus Jensen <k.jensen@samsung.com> > Ping. Tested, but would appreciate a review ;)
On Tue, Oct 29, 2024 at 01:15:19PM +0100, Klaus Jensen wrote: > From: Klaus Jensen <k.jensen@samsung.com> > > If a host chooses to use the SQHD "hint" in the CQE to know if there is > room in the submission queue for additional commands, it may result in a > situation where there are not enough internal resources (struct > NvmeRequest) available to process the command. For a lack of a better > term, the host may "over-commit" the device (i.e., it may have more > inflight commands than the queue size). > > ... LGTM Reviewed-by: Keith Busch <kbusch@kernel.org>
I have run my tests on the OSv side with small queue sizes like 3,4,5 and I could NOT replicate the issue. So it looks like the V2 version of this patch fixes the problem. Thanks a lot, Waldemar Kozaczuk On Mon, Nov 4, 2024 at 1:57 PM Keith Busch <kbusch@kernel.org> wrote: > On Tue, Oct 29, 2024 at 01:15:19PM +0100, Klaus Jensen wrote: > > From: Klaus Jensen <k.jensen@samsung.com> > > > > If a host chooses to use the SQHD "hint" in the CQE to know if there is > > room in the submission queue for additional commands, it may result in a > > situation where there are not enough internal resources (struct > > NvmeRequest) available to process the command. For a lack of a better > > term, the host may "over-commit" the device (i.e., it may have more > > inflight commands than the queue size). > > > > ... > > LGTM > > Reviewed-by: Keith Busch <kbusch@kernel.org> >
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index f4e89203c1a6e3b051fd7185cbf01ec9bae9684a..1185455a94c4af43a39708b1b97dba9624fc7ad3 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque) stl_le_p(&n->bar.csts, NVME_CSTS_FAILED); break; } + QTAILQ_REMOVE(&cq->req_list, req, entry); + nvme_inc_cq_tail(cq); nvme_sg_unmap(&req->sg); + + if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { + qemu_bh_schedule(sq->bh); + } + QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); } if (cq->tail != cq->head) { @@ -7950,7 +7957,6 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) /* Completion queue doorbell write */ uint16_t new_head = val & 0xffff; - int start_sqs; NvmeCQueue *cq; qid = (addr - (0x1000 + (1 << 2))) >> 3; @@ -8001,19 +8007,16 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head); - start_sqs = nvme_cq_full(cq) ? 1 : 0; - cq->head = new_head; - if (!qid && n->dbbuf_enabled) { - stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED); - } - if (start_sqs) { - NvmeSQueue *sq; - QTAILQ_FOREACH(sq, &cq->sq_list, entry) { - qemu_bh_schedule(sq->bh); - } + /* scheduled deferred cqe posting if queue was previously full */ + if (nvme_cq_full(cq)) { qemu_bh_schedule(cq->bh); } + cq->head = new_head; + if (!qid && n->dbbuf_enabled) { + stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED); + } + if (cq->tail == cq->head) { if (cq->irq_enabled) { n->cq_pending--;