Message ID | 20241025-issue-2388-v1-1-16707e0d3342@samsung.com |
---|---|
State | New |
Headers | show |
Series | hw/nvme: fix handling of over-committed queues | expand |
Hi, I have applied this patch to the same QEMU source tree where I reproduced this issue. I changed the queue size to 3 on the OSv side to trigger this bug, but unfortunately, I still see the same behavior of the OSv guest hanging. Regards, Waldemar Kozaczuk On Fri, Oct 25, 2024 at 6:51 AM Klaus Jensen <its@irrelevant.dk> wrote: > From: Klaus Jensen <k.jensen@samsung.com> > > If a host chooses to use the SQHD "hint" in the CQE to know if there is > room in the submission queue for additional commands, it may result in a > situation where there are not enough internal resources (struct > NvmeRequest) available to process the command. For a lack of a better > term, the host may "over-commit" the device (i.e., it may have more > inflight commands than the queue size). > > For example, assume a queue with N entries. The host submits N commands > and all are picked up for processing, advancing the head and emptying > the queue. Regardless of which of these N commands complete first, the > SQHD field of that CQE will indicate to the host that the queue is > empty, which allows the host to issue N commands again. However, if the > device has not posted CQEs for all the previous commands yet, the device > will have less than N resources available to process the commands, so > queue processing is suspended. > > And here lies an 11 year latent bug. In the absense of any additional > tail updates on the submission queue, we never schedule the processing > bottom-half again unless we observe a head update on an associated full > completion queue. This has been sufficient to handle N-to-1 SQ/CQ setups > (in the absense of over-commit of course). Incidentially, that "kick all > associated SQs" mechanism can now be killed since we now just schedule > queue processing when we return a processing resource to a non-empty > submission queue, which happens to cover both edge cases. > > So, apparently, no previous driver tested with hw/nvme has ever used > SQHD (e.g., neither the Linux NVMe driver or SPDK uses it). But then OSv > shows up with the driver that actually does. I salute you. > > Fixes: f3c507adcd7b ("NVMe: Initial commit for new storage interface") > Cc: qemu-stable@nongnu.org > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2388 > Reported-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com> > --- > hw/nvme/ctrl.c | 16 +++++++--------- > 1 file changed, 7 insertions(+), 9 deletions(-) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index > f4e89203c1a6e3b051fd7185cbf01ec9bae9684a..b13585c4da911b9e8ae4a722761fd85dfa24be4d > 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque) > stl_le_p(&n->bar.csts, NVME_CSTS_FAILED); > break; > } > + > QTAILQ_REMOVE(&cq->req_list, req, entry); > + > nvme_inc_cq_tail(cq); > nvme_sg_unmap(&req->sg); > + > + if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { > + qemu_bh_schedule(sq->bh); > + } > + > QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); > } > if (cq->tail != cq->head) { > @@ -7950,7 +7957,6 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr > addr, int val) > /* Completion queue doorbell write */ > > uint16_t new_head = val & 0xffff; > - int start_sqs; > NvmeCQueue *cq; > > qid = (addr - (0x1000 + (1 << 2))) >> 3; > @@ -8001,18 +8007,10 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr > addr, int val) > > trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head); > > - start_sqs = nvme_cq_full(cq) ? 1 : 0; > cq->head = new_head; > if (!qid && n->dbbuf_enabled) { > stl_le_pci_dma(pci, cq->db_addr, cq->head, > MEMTXATTRS_UNSPECIFIED); > } > - if (start_sqs) { > - NvmeSQueue *sq; > - QTAILQ_FOREACH(sq, &cq->sq_list, entry) { > - qemu_bh_schedule(sq->bh); > - } > - qemu_bh_schedule(cq->bh); > - } > > if (cq->tail == cq->head) { > if (cq->irq_enabled) { > > --- > base-commit: e67b7aef7c7f67ecd0282e903e0daff806d5d680 > change-id: 20241025-issue-2388-bd047487f74c > > Best regards, > -- > Klaus Jensen <k.jensen@samsung.com> > >
On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote: > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque) > nvme_inc_cq_tail(cq); > nvme_sg_unmap(&req->sg); > + > + if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { > + qemu_bh_schedule(sq->bh); > + } > + > QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); > } Shouldn't we schedule the bottom half after the req has been added to the list? I think everything the callback needs to be written prior to calling qemu_bh_schedule().
On Oct 25 10:45, Keith Busch wrote: > On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote: > > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque) > > nvme_inc_cq_tail(cq); > > nvme_sg_unmap(&req->sg); > > + > > + if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { > > + qemu_bh_schedule(sq->bh); > > + } > > + > > QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); > > } > > Shouldn't we schedule the bottom half after the req has been added to > the list? I think everything the callback needs to be written prior to > calling qemu_bh_schedule(). > Not as far as I know. It is only queued up; it won't be executed immediately. It might run next (ASAP) if we are already in a bottom half, but not before whatever context we are in returns.
On Mon, Oct 28, 2024 at 10:01:50AM +0100, Klaus Jensen wrote: > On Oct 25 10:45, Keith Busch wrote: > > On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote: > > > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque) > > > nvme_inc_cq_tail(cq); > > > nvme_sg_unmap(&req->sg); > > > + > > > + if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { > > > + qemu_bh_schedule(sq->bh); > > > + } > > > + > > > QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); > > > } > > > > Shouldn't we schedule the bottom half after the req has been added to > > the list? I think everything the callback needs to be written prior to > > calling qemu_bh_schedule(). > > > > Not as far as I know. It is only queued up; it won't be executed > immediately. It might run next (ASAP) if we are already in a bottom > half, but not before whatever context we are in returns. Okay. I was trying to come up with an explanation for why Waldek was still able to reproduce the problem, and that was all I have so far.
On Oct 28 09:15, Keith Busch wrote: > On Mon, Oct 28, 2024 at 10:01:50AM +0100, Klaus Jensen wrote: > > On Oct 25 10:45, Keith Busch wrote: > > > On Fri, Oct 25, 2024 at 12:50:45PM +0200, Klaus Jensen wrote: > > > > @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque) > > > > nvme_inc_cq_tail(cq); > > > > nvme_sg_unmap(&req->sg); > > > > + > > > > + if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { > > > > + qemu_bh_schedule(sq->bh); > > > > + } > > > > + > > > > QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); > > > > } > > > > > > Shouldn't we schedule the bottom half after the req has been added to > > > the list? I think everything the callback needs to be written prior to > > > calling qemu_bh_schedule(). > > > > > > > Not as far as I know. It is only queued up; it won't be executed > > immediately. It might run next (ASAP) if we are already in a bottom > > half, but not before whatever context we are in returns. > > Okay. I was trying to come up with an explanation for why Waldek was > still able to reproduce the problem, and that was all I have so far. > I was too eager in removing the start_sqs stuff. I removed kicking the cq when transitioning from full to non-full. The core fix is the right one, but I introduced a bug... v2 just posted should be good. Verified it with OSv master.
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index f4e89203c1a6e3b051fd7185cbf01ec9bae9684a..b13585c4da911b9e8ae4a722761fd85dfa24be4d 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -1520,9 +1520,16 @@ static void nvme_post_cqes(void *opaque) stl_le_p(&n->bar.csts, NVME_CSTS_FAILED); break; } + QTAILQ_REMOVE(&cq->req_list, req, entry); + nvme_inc_cq_tail(cq); nvme_sg_unmap(&req->sg); + + if (QTAILQ_EMPTY(&sq->req_list) && !nvme_sq_empty(sq)) { + qemu_bh_schedule(sq->bh); + } + QTAILQ_INSERT_TAIL(&sq->req_list, req, entry); } if (cq->tail != cq->head) { @@ -7950,7 +7957,6 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) /* Completion queue doorbell write */ uint16_t new_head = val & 0xffff; - int start_sqs; NvmeCQueue *cq; qid = (addr - (0x1000 + (1 << 2))) >> 3; @@ -8001,18 +8007,10 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head); - start_sqs = nvme_cq_full(cq) ? 1 : 0; cq->head = new_head; if (!qid && n->dbbuf_enabled) { stl_le_pci_dma(pci, cq->db_addr, cq->head, MEMTXATTRS_UNSPECIFIED); } - if (start_sqs) { - NvmeSQueue *sq; - QTAILQ_FOREACH(sq, &cq->sq_list, entry) { - qemu_bh_schedule(sq->bh); - } - qemu_bh_schedule(cq->bh); - } if (cq->tail == cq->head) { if (cq->irq_enabled) {