mbox series

[RFC,0/3] qmp: make qmp_device_add() a coroutine

Message ID 20230906190141.1286893-1-stefanha@redhat.com
Headers show
Series qmp: make qmp_device_add() a coroutine | expand

Message

Stefan Hajnoczi Sept. 6, 2023, 7:01 p.m. UTC
It is not safe to call drain_call_rcu() from qmp_device_add() because
some call stacks are not prepared for drain_call_rcu() to drop the Big
QEMU Lock (BQL).

For example, device emulation code is protected by the BQL but when it
calls aio_poll() -> ... -> qmp_device_add() -> drain_call_rcu() then the
BQL is dropped. See https://bugzilla.redhat.com/show_bug.cgi?id=2215192 for a
concrete bug of this type.

Another limitation of drain_call_rcu() is that it cannot be invoked within an
RCU read-side critical section since the reclamation phase cannot complete
until the end of the critical section. Unfortunately, call stacks have been
seen where this happens (see
https://bugzilla.redhat.com/show_bug.cgi?id=2214985).

This patch series introduces drain_call_rcu_co(), which does the same thing as
drain_call_rcu() but asynchronously. By yielding back to the event loop we can
wait until the caller drops the BQL and leaves its RCU read-side critical
section.

Patch 1 changes HMP so that coroutine monitor commands yield back to the event
loop instead of running inside a nested event loop.

Patch 2 introduces the new drain_call_rcu_co() API.

Patch 3 converts qmp_device_add() into a coroutine monitor command and uses
drain_call_rcu_co().

I'm sending this as an RFC because I don't have confirmation yet that the bugs
mentioned above are fixed by this patch series.

Stefan Hajnoczi (3):
  hmp: avoid the nested event loop in handle_hmp_command()
  rcu: add drain_call_rcu_co() API
  qmp: make qmp_device_add() a coroutine

 MAINTAINERS            |  2 ++
 docs/devel/rcu.txt     | 21 ++++++++++++++++
 qapi/qdev.json         |  1 +
 include/monitor/qdev.h |  3 ++-
 include/qemu/rcu.h     |  1 +
 util/rcu-internal.h    |  8 ++++++
 monitor/hmp.c          | 28 +++++++++++----------
 monitor/qmp-cmds.c     |  2 +-
 softmmu/qdev-monitor.c | 34 +++++++++++++++++++++++---
 util/rcu-co.c          | 55 ++++++++++++++++++++++++++++++++++++++++++
 util/rcu.c             |  3 ++-
 hmp-commands.hx        |  1 +
 util/meson.build       |  2 +-
 13 files changed, 140 insertions(+), 21 deletions(-)
 create mode 100644 util/rcu-internal.h
 create mode 100644 util/rcu-co.c

Comments

Paolo Bonzini Sept. 7, 2023, 11:28 a.m. UTC | #1
On 9/6/23 21:01, Stefan Hajnoczi wrote:
> It is not safe to call drain_call_rcu() from qmp_device_add() because
> some call stacks are not prepared for drain_call_rcu() to drop the Big
> QEMU Lock (BQL).
> 
> For example, device emulation code is protected by the BQL but when it
> calls aio_poll() -> ... -> qmp_device_add() -> drain_call_rcu() then the
> BQL is dropped. See https://bugzilla.redhat.com/show_bug.cgi?id=2215192 for a
> concrete bug of this type.
> 
> Another limitation of drain_call_rcu() is that it cannot be invoked within an
> RCU read-side critical section since the reclamation phase cannot complete
> until the end of the critical section. Unfortunately, call stacks have been
> seen where this happens (see
> https://bugzilla.redhat.com/show_bug.cgi?id=2214985).

I think the root cause here is that do_qmp_dispatch_bh is called on the 
wrong context, namely qemu_get_aio_context() instead of 
iohandler_get_aio_context().  This is what causes it to move to the vCPU 
thread.

Auditing all subsystems that use iohandler_get_aio_context(), for 
example via qemu_set_fd_handler(), together with bottom halves, would be 
a bit daunting.

I don't have any objection to this patch series actually, but I would 
like to see if using the right AioContext also fixes the bug---and then 
treat these changes as more of a cleanup.  Coroutines are pretty 
pervasive in QEMU and are not going away which, as you say in the 
updated docs, makes drain_call_rcu_co() preferrable to drain_call_rcu().

Paolo


> This patch series introduces drain_call_rcu_co(), which does the same thing as
> drain_call_rcu() but asynchronously. By yielding back to the event loop we can
> wait until the caller drops the BQL and leaves its RCU read-side critical
> section.
> 
> Patch 1 changes HMP so that coroutine monitor commands yield back to the event
> loop instead of running inside a nested event loop.
> 
> Patch 2 introduces the new drain_call_rcu_co() API.
> 
> Patch 3 converts qmp_device_add() into a coroutine monitor command and uses
> drain_call_rcu_co().
> 
> I'm sending this as an RFC because I don't have confirmation yet that the bugs
> mentioned above are fixed by this patch series.
> 
> Stefan Hajnoczi (3):
>    hmp: avoid the nested event loop in handle_hmp_command()
>    rcu: add drain_call_rcu_co() API
>    qmp: make qmp_device_add() a coroutine
> 
>   MAINTAINERS            |  2 ++
>   docs/devel/rcu.txt     | 21 ++++++++++++++++
>   qapi/qdev.json         |  1 +
>   include/monitor/qdev.h |  3 ++-
>   include/qemu/rcu.h     |  1 +
>   util/rcu-internal.h    |  8 ++++++
>   monitor/hmp.c          | 28 +++++++++++----------
>   monitor/qmp-cmds.c     |  2 +-
>   softmmu/qdev-monitor.c | 34 +++++++++++++++++++++++---
>   util/rcu-co.c          | 55 ++++++++++++++++++++++++++++++++++++++++++
>   util/rcu.c             |  3 ++-
>   hmp-commands.hx        |  1 +
>   util/meson.build       |  2 +-
>   13 files changed, 140 insertions(+), 21 deletions(-)
>   create mode 100644 util/rcu-internal.h
>   create mode 100644 util/rcu-co.c
>
Stefan Hajnoczi Sept. 7, 2023, 2 p.m. UTC | #2
On Thu, Sep 07, 2023 at 01:28:55PM +0200, Paolo Bonzini wrote:
> On 9/6/23 21:01, Stefan Hajnoczi wrote:
> > It is not safe to call drain_call_rcu() from qmp_device_add() because
> > some call stacks are not prepared for drain_call_rcu() to drop the Big
> > QEMU Lock (BQL).
> > 
> > For example, device emulation code is protected by the BQL but when it
> > calls aio_poll() -> ... -> qmp_device_add() -> drain_call_rcu() then the
> > BQL is dropped. See https://bugzilla.redhat.com/show_bug.cgi?id=2215192 for a
> > concrete bug of this type.
> > 
> > Another limitation of drain_call_rcu() is that it cannot be invoked within an
> > RCU read-side critical section since the reclamation phase cannot complete
> > until the end of the critical section. Unfortunately, call stacks have been
> > seen where this happens (see
> > https://bugzilla.redhat.com/show_bug.cgi?id=2214985).
> 
> I think the root cause here is that do_qmp_dispatch_bh is called on the
> wrong context, namely qemu_get_aio_context() instead of
> iohandler_get_aio_context().  This is what causes it to move to the vCPU
> thread.
> 
> Auditing all subsystems that use iohandler_get_aio_context(), for example
> via qemu_set_fd_handler(), together with bottom halves, would be a bit
> daunting.
> 
> I don't have any objection to this patch series actually, but I would like
> to see if using the right AioContext also fixes the bug---and then treat
> these changes as more of a cleanup.  Coroutines are pretty pervasive in QEMU
> and are not going away which, as you say in the updated docs, makes
> drain_call_rcu_co() preferrable to drain_call_rcu().

While I agree that the issue would not happen if monitor commands only
ran in the iohandler AioContext, I don't think we can change that.

When Kevin implemented coroutine commands in commit 9ce44e2ce267 ("qmp:
Move dispatcher to a coroutine"), he used qemu_get_aio_context()
deliberately so that AIO_WAIT_WHILE() can make progress.

I'm not clear on the exact scenario though, because coroutines shouldn't
call AIO_WAIT_WHILE().

Kevin?

There is only one coroutine monitor command that calls the QEMU block
layer: qmp_block_resize(). If we're going to change how the AioContext
works then now is the time to do it before there are more commands that
need to be audited/refactored.

Stefan

> 
> Paolo
> 
> 
> > This patch series introduces drain_call_rcu_co(), which does the same thing as
> > drain_call_rcu() but asynchronously. By yielding back to the event loop we can
> > wait until the caller drops the BQL and leaves its RCU read-side critical
> > section.
> > 
> > Patch 1 changes HMP so that coroutine monitor commands yield back to the event
> > loop instead of running inside a nested event loop.
> > 
> > Patch 2 introduces the new drain_call_rcu_co() API.
> > 
> > Patch 3 converts qmp_device_add() into a coroutine monitor command and uses
> > drain_call_rcu_co().
> > 
> > I'm sending this as an RFC because I don't have confirmation yet that the bugs
> > mentioned above are fixed by this patch series.
> > 
> > Stefan Hajnoczi (3):
> >    hmp: avoid the nested event loop in handle_hmp_command()
> >    rcu: add drain_call_rcu_co() API
> >    qmp: make qmp_device_add() a coroutine
> > 
> >   MAINTAINERS            |  2 ++
> >   docs/devel/rcu.txt     | 21 ++++++++++++++++
> >   qapi/qdev.json         |  1 +
> >   include/monitor/qdev.h |  3 ++-
> >   include/qemu/rcu.h     |  1 +
> >   util/rcu-internal.h    |  8 ++++++
> >   monitor/hmp.c          | 28 +++++++++++----------
> >   monitor/qmp-cmds.c     |  2 +-
> >   softmmu/qdev-monitor.c | 34 +++++++++++++++++++++++---
> >   util/rcu-co.c          | 55 ++++++++++++++++++++++++++++++++++++++++++
> >   util/rcu.c             |  3 ++-
> >   hmp-commands.hx        |  1 +
> >   util/meson.build       |  2 +-
> >   13 files changed, 140 insertions(+), 21 deletions(-)
> >   create mode 100644 util/rcu-internal.h
> >   create mode 100644 util/rcu-co.c
> > 
>
Paolo Bonzini Sept. 7, 2023, 2:25 p.m. UTC | #3
On Thu, Sep 7, 2023 at 4:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> While I agree that the issue would not happen if monitor commands only
> ran in the iohandler AioContext, I don't think we can change that.
> When Kevin implemented coroutine commands in commit 9ce44e2ce267 ("qmp:
> Move dispatcher to a coroutine"), he used qemu_get_aio_context()
> deliberately so that AIO_WAIT_WHILE() can make progress.

Ah, you are referring to

+        /*
+         * Move the coroutine from iohandler_ctx to qemu_aio_context for
+         * executing the command handler so that it can make progress if it
+         * involves an AIO_WAIT_WHILE().
+         */
+        aio_co_schedule(qemu_get_aio_context(), qmp_dispatcher_co);
+        qemu_coroutine_yield();

> I'm not clear on the exact scenario though, because coroutines shouldn't
> call AIO_WAIT_WHILE().

I think he meant "so that an AIO_WAIT_WHILE() invoked through a bottom
half will make progress on the coroutine as well".

However I am not sure the comment applies here, because
do_qmp_dispatch_bh() only applies to non-coroutine commands; that
commit allowed monitor commands to run in vCPU threads when they
previously weren't.

Thinking more about it, I don't like that the

    if (!!(cmd->options & QCO_COROUTINE) == qemu_in_coroutine()) {
    }

check is in qmp_dispatch() rather than monitor_qmp_dispatch().

Any caller of qmp_dispatch() knows if it is in a coroutine or not.
qemu-ga uses neither a coroutine dispatcher nor coroutine commands.
QEMU uses non-coroutine dispatch for out-of-band commands (and we can
forbid coroutine + allow-oob at the same time), and coroutine dispatch
for the others.

So, moving out of coroutine context (through a bottom half) should be
done by monitor_qmp_dispatch(), and likewise moving temporarily out of
the iohandler context in the case of coroutine commands. In the case
of !req_obj->req you don't need to do either of those. qmp_dispatch()
can still assert that the coroutine-ness of the command matches the
context in which qmp_dispatch() is called.

Once this is done, I think moving out of coroutine context can use a
BH that runs in the iohandler context.


Paolo
Stefan Hajnoczi Sept. 7, 2023, 3:29 p.m. UTC | #4
On Thu, 7 Sept 2023 at 10:26, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On Thu, Sep 7, 2023 at 4:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > While I agree that the issue would not happen if monitor commands only
> > ran in the iohandler AioContext, I don't think we can change that.
> > When Kevin implemented coroutine commands in commit 9ce44e2ce267 ("qmp:
> > Move dispatcher to a coroutine"), he used qemu_get_aio_context()
> > deliberately so that AIO_WAIT_WHILE() can make progress.
>
> Ah, you are referring to
>
> +        /*
> +         * Move the coroutine from iohandler_ctx to qemu_aio_context for
> +         * executing the command handler so that it can make progress if it
> +         * involves an AIO_WAIT_WHILE().
> +         */
> +        aio_co_schedule(qemu_get_aio_context(), qmp_dispatcher_co);
> +        qemu_coroutine_yield();
>
> > I'm not clear on the exact scenario though, because coroutines shouldn't
> > call AIO_WAIT_WHILE().
>
> I think he meant "so that an AIO_WAIT_WHILE() invoked through a bottom
> half will make progress on the coroutine as well".
>
> However I am not sure the comment applies here, because
> do_qmp_dispatch_bh() only applies to non-coroutine commands; that
> commit allowed monitor commands to run in vCPU threads when they
> previously weren't.
>
> Thinking more about it, I don't like that the
>
>     if (!!(cmd->options & QCO_COROUTINE) == qemu_in_coroutine()) {
>     }
>
> check is in qmp_dispatch() rather than monitor_qmp_dispatch().
>
> Any caller of qmp_dispatch() knows if it is in a coroutine or not.
> qemu-ga uses neither a coroutine dispatcher nor coroutine commands.
> QEMU uses non-coroutine dispatch for out-of-band commands (and we can
> forbid coroutine + allow-oob at the same time), and coroutine dispatch
> for the others.
>
> So, moving out of coroutine context (through a bottom half) should be
> done by monitor_qmp_dispatch(), and likewise moving temporarily out of
> the iohandler context in the case of coroutine commands. In the case
> of !req_obj->req you don't need to do either of those. qmp_dispatch()
> can still assert that the coroutine-ness of the command matches the
> context in which qmp_dispatch() is called.
>
> Once this is done, I think moving out of coroutine context can use a
> BH that runs in the iohandler context.

I'll wait for Kevin's input and will then revisit the patches based on
the conclusion we come to.

Stefan
Kevin Wolf Sept. 12, 2023, 5:08 p.m. UTC | #5
Am 07.09.2023 um 16:25 hat Paolo Bonzini geschrieben:
> On Thu, Sep 7, 2023 at 4:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > While I agree that the issue would not happen if monitor commands only
> > ran in the iohandler AioContext, I don't think we can change that.
> > When Kevin implemented coroutine commands in commit 9ce44e2ce267 ("qmp:
> > Move dispatcher to a coroutine"), he used qemu_get_aio_context()
> > deliberately so that AIO_WAIT_WHILE() can make progress.
> 
> Ah, you are referring to
> 
> +        /*
> +         * Move the coroutine from iohandler_ctx to qemu_aio_context for
> +         * executing the command handler so that it can make progress if it
> +         * involves an AIO_WAIT_WHILE().
> +         */
> +        aio_co_schedule(qemu_get_aio_context(), qmp_dispatcher_co);
> +        qemu_coroutine_yield();
> 
> > I'm not clear on the exact scenario though, because coroutines shouldn't
> > call AIO_WAIT_WHILE().
> 
> I think he meant "so that an AIO_WAIT_WHILE() invoked through a bottom
> half will make progress on the coroutine as well".

It's been a while, but I think I may have meant an AIO_WAIT_WHILE() that
is executed by someone else and that depends on the coroutine. For
example, I imagine this is what I could have seen:

1. The QMP command handler does some I/O and yields for it (like
   updating the qcow2 header for block_resize) with increased
   bs->in_flight

2. Something else calls drain, which polls qemu_aio_context, but not
   iohandler_ctx, until the request completes.

3. Nothing will ever resume the coroutine -> deadlock

> However I am not sure the comment applies here, because
> do_qmp_dispatch_bh() only applies to non-coroutine commands; that
> commit allowed monitor commands to run in vCPU threads when they
> previously weren't.
> 
> Thinking more about it, I don't like that the
> 
>     if (!!(cmd->options & QCO_COROUTINE) == qemu_in_coroutine()) {
>     }
> 
> check is in qmp_dispatch() rather than monitor_qmp_dispatch().
> 
> Any caller of qmp_dispatch() knows if it is in a coroutine or not.
> qemu-ga uses neither a coroutine dispatcher nor coroutine commands.
> QEMU uses non-coroutine dispatch for out-of-band commands (and we can
> forbid coroutine + allow-oob at the same time), and coroutine dispatch
> for the others.
> 
> So, moving out of coroutine context (through a bottom half) should be
> done by monitor_qmp_dispatch(), and likewise moving temporarily out of
> the iohandler context in the case of coroutine commands. In the case
> of !req_obj->req you don't need to do either of those. qmp_dispatch()
> can still assert that the coroutine-ness of the command matches the
> context in which qmp_dispatch() is called.
> 
> Once this is done, I think moving out of coroutine context can use a
> BH that runs in the iohandler context.

Non-coroutine handlers could probably stay in iothread_ctx, but I don't
think we can avoid switching to a different for coroutine handlers.

So maybe we can just move the rescheduling down to the coroutine case in
qmp_dispatch().

Kevin
Paolo Bonzini Sept. 13, 2023, 11:38 a.m. UTC | #6
On Tue, Sep 12, 2023 at 7:08 PM Kevin Wolf <kwolf@redhat.com> wrote:
> > Any caller of qmp_dispatch() knows if it is in a coroutine or not.
> > qemu-ga uses neither a coroutine dispatcher nor coroutine commands.
> > QEMU uses non-coroutine dispatch for out-of-band commands (and we can
> > forbid coroutine + allow-oob at the same time), and coroutine dispatch
> > for the others.
> >
> > So, moving out of coroutine context (through a bottom half) should be
> > done by monitor_qmp_dispatch(), and likewise moving temporarily out of
> > the iohandler context in the case of coroutine commands. In the case
> > of !req_obj->req you don't need to do either of those. qmp_dispatch()
> > can still assert that the coroutine-ness of the command matches the
> > context in which qmp_dispatch() is called.
> >
> > Once this is done, I think moving out of coroutine context can use a
> > BH that runs in the iohandler context.
>
> Non-coroutine handlers could probably stay in iothread_ctx, but I don't
> think we can avoid switching to a different for coroutine handlers.

Agreed.

> So maybe we can just move the rescheduling down to the coroutine case in
> qmp_dispatch().

Not sure about qmp_dispatch (see above: any caller of the function
knows if it is in a coroutine or not, and qemu-ga need not know about
coroutines at all). But what you said also applies if the rescheduling
is only pushed to monitor_qmp_dispatch(), which would be my first
option.

Thanks!

Paolo