diff mbox series

[1/2] virtio: use blk_io_plug_call() in virtio_irqfd_notify()

Message ID 20230815120805.3235166-2-stefanha@redhat.com
State New
Headers show
Series virtio-blk: use blk_io_plug_call() instead of notification BH | expand

Commit Message

Stefan Hajnoczi Aug. 15, 2023, 12:08 p.m. UTC
virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used
Buffer Notifications from an IOThread. This involves an eventfd
write(2) syscall. Calling this repeatedly when completing multiple I/O
requests in a row is wasteful.

Use the blk_io_plug_call() API to batch together virtio_irqfd_notify()
calls made during Linux AIO (aio=native) or io_uring (aio=io_uring)
completion processing. Do not modify the thread pool (aio=threads) to
avoid introducing a dependency from util/ onto the block layer.

Behavior is unchanged for emulated devices that do not use blk_io_plug()
since blk_io_plug_call() immediately invokes the callback when called
outside a blk_io_plug()/blk_io_unplug() region.

fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a
single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could
be noise. Detailed performance data and configuration specifics are
available here:
https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd

This duplicates the BH that virtio-blk uses for batching. The next
commit will remove it.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 block/io_uring.c   |  6 ++++++
 block/linux-aio.c  |  4 ++++
 hw/virtio/virtio.c | 10 +++++++++-
 3 files changed, 19 insertions(+), 1 deletion(-)

Comments

Ilya Maximets Aug. 16, 2023, 1:36 p.m. UTC | #1
On 8/15/23 14:08, Stefan Hajnoczi wrote:
> virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used
> Buffer Notifications from an IOThread. This involves an eventfd
> write(2) syscall. Calling this repeatedly when completing multiple I/O
> requests in a row is wasteful.

Hi, Stefan.  This is an interesting change!

There is more or less exactly the same problem with fast network backends
and I was playing around with similar ideas in this area while working on
af-xdp network backend recently.  Primarily, implementation of the Rx BH
for virtio-net device and locking the RCU before passing packets from the
backend to the device one by one.

> 
> Use the blk_io_plug_call() API to batch together virtio_irqfd_notify()
> calls made during Linux AIO (aio=native) or io_uring (aio=io_uring)
> completion processing. Do not modify the thread pool (aio=threads) to
> avoid introducing a dependency from util/ onto the block layer.

But you're introducing a dependency from generic virtio code onto the
block layer in this patch.  This seem to break the module abstraction.

It looks like there are 2 options to resolve the semantics issue here:

1. Move virtio_notify_irqfd() from virtio.c down to the block layer.
   Block layer is the only user, so that may be justified, but it
   doesn't seem like a particularly good solution.  (I'm actually not
   sure why block devices are the only ones using this function...)

2. Move and rename the block/plug library somewhere generic.  The plug
   library seems to not have any dependencies on a block layer, other
   than a name, so it should not be hard to generalize it (well, the
   naming might be hard).

In general, while looking at the plug library, it seems to do what is
typically implemented via RCU frameworks - the delayed function call.
The only difference is that RCU doesn't check for duplicates and the
callbacks are global.  Should not be hard to add some new functionality
to RCU framework in order to address these, e.g. rcu_call_local() for
calls that should be executed once the current thread exits its own
critical section.

Using RCU for non-RCU-protected things might be considered as an abuse.
However, we might solve two issues in one shot if instead of entering
blk_io_plug/unplug section we will enter an RCU critical section and
call callbacks at the exit.  The first issue is the notification batching
that this patch is trying to fix, the second is an excessive number of
thread fences on RCU exits every time virtio_notify_irqfd() and other
virtio functions are invoked.  The second issue can be avoided by using
RCU_READ_LOCK_GUARD() in completion functions.  Not sure if that will
improve performance, but it definitely removes a lot of noise from the
perf top for network backends.  This makes the code a bit less explicit
though, the lock guard will definitely need a comment.  Though, the reason
for blk_io_plug() calls is not fully clear for a module code alone either.

I'm not sure what is the best way forward.  I'm trying to figure out
that myself, since a similar solution will in the end be needed for
network backends and so it might be better to have something more generic.
What do you think?

> 
> Behavior is unchanged for emulated devices that do not use blk_io_plug()
> since blk_io_plug_call() immediately invokes the callback when called
> outside a blk_io_plug()/blk_io_unplug() region.
> 
> fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a
> single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could
> be noise. Detailed performance data and configuration specifics are
> available here:
> https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd
> 
> This duplicates the BH that virtio-blk uses for batching. The next
> commit will remove it.

I'm likely missing something, but could you explain why it is safe to
batch unconditionally here?  The current BH code, as you mentioned in
the second patch, is only batching if EVENT_IDX is not set.
Maybe worth adding a few words in the commit message for people like
me, who are a bit rusty on QEMU/virtio internals. :)

Best regards, Ilya Maximets.

> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  block/io_uring.c   |  6 ++++++
>  block/linux-aio.c  |  4 ++++
>  hw/virtio/virtio.c | 10 +++++++++-
>  3 files changed, 19 insertions(+), 1 deletion(-)
> 
> diff --git a/block/io_uring.c b/block/io_uring.c
> index 69d9820928..749cf83934 100644
> --- a/block/io_uring.c
> +++ b/block/io_uring.c
> @@ -124,6 +124,9 @@ static void luring_process_completions(LuringState *s)
>  {
>      struct io_uring_cqe *cqes;
>      int total_bytes;
> +
> +    blk_io_plug();
> +
>      /*
>       * Request completion callbacks can run the nested event loop.
>       * Schedule ourselves so the nested event loop will "see" remaining
> @@ -216,7 +219,10 @@ end:
>              aio_co_wake(luringcb->co);
>          }
>      }
> +
>      qemu_bh_cancel(s->completion_bh);
> +
> +    blk_io_unplug();
>  }
>  
>  static int ioq_submit(LuringState *s)
> diff --git a/block/linux-aio.c b/block/linux-aio.c
> index 561c71a9ae..cef3d6b1c7 100644
> --- a/block/linux-aio.c
> +++ b/block/linux-aio.c
> @@ -204,6 +204,8 @@ static void qemu_laio_process_completions(LinuxAioState *s)
>  {
>      struct io_event *events;
>  
> +    blk_io_plug();
> +
>      /* Reschedule so nested event loops see currently pending completions */
>      qemu_bh_schedule(s->completion_bh);
>  
> @@ -230,6 +232,8 @@ static void qemu_laio_process_completions(LinuxAioState *s)
>       * own `for` loop.  If we are the last all counters droped to zero. */
>      s->event_max = 0;
>      s->event_idx = 0;
> +
> +    blk_io_unplug();
>  }
>  
>  static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 309038fd46..a691e8526b 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -28,6 +28,7 @@
>  #include "hw/virtio/virtio-bus.h"
>  #include "hw/qdev-properties.h"
>  #include "hw/virtio/virtio-access.h"
> +#include "sysemu/block-backend.h"
>  #include "sysemu/dma.h"
>  #include "sysemu/runstate.h"
>  #include "virtio-qmp.h"
> @@ -2426,6 +2427,13 @@ static bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
>      }
>  }
>  
> +/* Batch irqs while inside a blk_io_plug()/blk_io_unplug() section */
> +static void virtio_notify_irqfd_unplug_fn(void *opaque)
> +{
> +    EventNotifier *notifier = opaque;
> +    event_notifier_set(notifier);
> +}
> +
>  void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
>  {
>      WITH_RCU_READ_LOCK_GUARD() {
> @@ -2452,7 +2460,7 @@ void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
>       * to an atomic operation.
>       */
>      virtio_set_isr(vq->vdev, 0x1);
> -    event_notifier_set(&vq->guest_notifier);
> +    blk_io_plug_call(virtio_notify_irqfd_unplug_fn, &vq->guest_notifier);
>  }
>  
>  static void virtio_irq(VirtQueue *vq)
Stefan Hajnoczi Aug. 16, 2023, 3:30 p.m. UTC | #2
On Wed, Aug 16, 2023 at 03:36:32PM +0200, Ilya Maximets wrote:
> On 8/15/23 14:08, Stefan Hajnoczi wrote:
> > virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used
> > Buffer Notifications from an IOThread. This involves an eventfd
> > write(2) syscall. Calling this repeatedly when completing multiple I/O
> > requests in a row is wasteful.
> 
> Hi, Stefan.  This is an interesting change!
> 
> There is more or less exactly the same problem with fast network backends
> and I was playing around with similar ideas in this area while working on
> af-xdp network backend recently.  Primarily, implementation of the Rx BH
> for virtio-net device and locking the RCU before passing packets from the
> backend to the device one by one.
> 
> > 
> > Use the blk_io_plug_call() API to batch together virtio_irqfd_notify()
> > calls made during Linux AIO (aio=native) or io_uring (aio=io_uring)
> > completion processing. Do not modify the thread pool (aio=threads) to
> > avoid introducing a dependency from util/ onto the block layer.
> 
> But you're introducing a dependency from generic virtio code onto the
> block layer in this patch.  This seem to break the module abstraction.
> 
> It looks like there are 2 options to resolve the semantics issue here:

Yes, it's a layering violation.

> 
> 1. Move virtio_notify_irqfd() from virtio.c down to the block layer.
>    Block layer is the only user, so that may be justified, but it
>    doesn't seem like a particularly good solution.  (I'm actually not
>    sure why block devices are the only ones using this function...)

Yes, this is the easiest way to avoid the layering violation for now.

The virtio_notify_irqfd() API is necessary when running in an IOThread
because the normal QEMU irq API must run under the Big QEMU Lock. Block
devices are the only ones that raise interrupts from IOThreads at the
moment.

> 
> 2. Move and rename the block/plug library somewhere generic.  The plug
>    library seems to not have any dependencies on a block layer, other
>    than a name, so it should not be hard to generalize it (well, the
>    naming might be hard).

Yes, it should be possible to make it generic quite easily. I will give
this a try in the next version of the patch.

> In general, while looking at the plug library, it seems to do what is
> typically implemented via RCU frameworks - the delayed function call.
> The only difference is that RCU doesn't check for duplicates and the
> callbacks are global.  Should not be hard to add some new functionality
> to RCU framework in order to address these, e.g. rcu_call_local() for
> calls that should be executed once the current thread exits its own
> critical section.

This rcu_call_local() idea is unrelated to Read Copy Update, so I don't
think it should be part of the RCU API.

Another deferred function call mechanism is QEMUBH. It already supports
coalescing. However, BHs are invoked once per AioContext event loop
iteration and there is no way invoke the BH earlier. Also the BH pointer
needs to be passed to every function that wishes to schedule a deferred
call, which can be tedious (e.g. block/linux-aio.c should defer the
io_submit(2) syscall until the end of virtio-blk request processing -
there are a lot of layers of code between those two components).

> 
> Using RCU for non-RCU-protected things might be considered as an abuse.
> However, we might solve two issues in one shot if instead of entering
> blk_io_plug/unplug section we will enter an RCU critical section and
> call callbacks at the exit.  The first issue is the notification batching
> that this patch is trying to fix, the second is an excessive number of
> thread fences on RCU exits every time virtio_notify_irqfd() and other
> virtio functions are invoked.  The second issue can be avoided by using
> RCU_READ_LOCK_GUARD() in completion functions.  Not sure if that will
> improve performance, but it definitely removes a lot of noise from the
> perf top for network backends.  This makes the code a bit less explicit
> though, the lock guard will definitely need a comment.  Though, the reason
> for blk_io_plug() calls is not fully clear for a module code alone either.

util/aio-posix.c:run_poll_handlers() has a top-level
RCU_READ_LOCK_GUARD() for this reason. Maybe we should do the same
around aio_bh_poll() + aio_dispatch_ready_handlers() in
util/aio-posix.c:aio_poll()? The downside is that latency-sensitive
call_rcu() callbacks perform worse.

> 
> I'm not sure what is the best way forward.  I'm trying to figure out
> that myself, since a similar solution will in the end be needed for
> network backends and so it might be better to have something more generic.
> What do you think?

I suggest moving blk_io_plug() to util/ and renaming it to
defer_begin/defer_end()/defer_call(). The interface and implementation
can also be improved as needed for other users.

> 
> > 
> > Behavior is unchanged for emulated devices that do not use blk_io_plug()
> > since blk_io_plug_call() immediately invokes the callback when called
> > outside a blk_io_plug()/blk_io_unplug() region.
> > 
> > fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a
> > single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could
> > be noise. Detailed performance data and configuration specifics are
> > available here:
> > https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd
> > 
> > This duplicates the BH that virtio-blk uses for batching. The next
> > commit will remove it.
> 
> I'm likely missing something, but could you explain why it is safe to
> batch unconditionally here?  The current BH code, as you mentioned in
> the second patch, is only batching if EVENT_IDX is not set.
> Maybe worth adding a few words in the commit message for people like
> me, who are a bit rusty on QEMU/virtio internals. :)

The BH is condition on EVENT_IDX not for correctness/safety, but for
performance. Commit 12c1c7d7cefb4dbffeb5712e75a33e4692f0a76b
("virtio-blk: dataplane: Don't batch notifications if EVENT_IDX is
present") argued that VIRTIO's EVENT_IDX feature already provides
batching so the BH is not needed in that case.

Actually I'm not sure exactly why deferring virtio_notify_irqfd() helps,
but IOPS improves on my machine. One theory:

Linux drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_split() (or the
end of the irq handler function) updates EVENT_IDX so that QEMU will
inject an interrupt again when the next request is completed. If the
vCPU and QEMU are running in parallel, then N requests could result in N
interrupts when the vCPU updates EVENT_IDX before QEMU completes the
next request. Maybe that's why I see better performance when deferring
virtio_notify_irqfd().

(I'm thinking of experimenting with rate limiting irq injection so that
it's possible to batch even more efficienctly but I worry that it will
be hard to avoid latency vs throughput regressions.)

Stefan

> 
> Best regards, Ilya Maximets.
> 
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  block/io_uring.c   |  6 ++++++
> >  block/linux-aio.c  |  4 ++++
> >  hw/virtio/virtio.c | 10 +++++++++-
> >  3 files changed, 19 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/io_uring.c b/block/io_uring.c
> > index 69d9820928..749cf83934 100644
> > --- a/block/io_uring.c
> > +++ b/block/io_uring.c
> > @@ -124,6 +124,9 @@ static void luring_process_completions(LuringState *s)
> >  {
> >      struct io_uring_cqe *cqes;
> >      int total_bytes;
> > +
> > +    blk_io_plug();
> > +
> >      /*
> >       * Request completion callbacks can run the nested event loop.
> >       * Schedule ourselves so the nested event loop will "see" remaining
> > @@ -216,7 +219,10 @@ end:
> >              aio_co_wake(luringcb->co);
> >          }
> >      }
> > +
> >      qemu_bh_cancel(s->completion_bh);
> > +
> > +    blk_io_unplug();
> >  }
> >  
> >  static int ioq_submit(LuringState *s)
> > diff --git a/block/linux-aio.c b/block/linux-aio.c
> > index 561c71a9ae..cef3d6b1c7 100644
> > --- a/block/linux-aio.c
> > +++ b/block/linux-aio.c
> > @@ -204,6 +204,8 @@ static void qemu_laio_process_completions(LinuxAioState *s)
> >  {
> >      struct io_event *events;
> >  
> > +    blk_io_plug();
> > +
> >      /* Reschedule so nested event loops see currently pending completions */
> >      qemu_bh_schedule(s->completion_bh);
> >  
> > @@ -230,6 +232,8 @@ static void qemu_laio_process_completions(LinuxAioState *s)
> >       * own `for` loop.  If we are the last all counters droped to zero. */
> >      s->event_max = 0;
> >      s->event_idx = 0;
> > +
> > +    blk_io_unplug();
> >  }
> >  
> >  static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
> > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > index 309038fd46..a691e8526b 100644
> > --- a/hw/virtio/virtio.c
> > +++ b/hw/virtio/virtio.c
> > @@ -28,6 +28,7 @@
> >  #include "hw/virtio/virtio-bus.h"
> >  #include "hw/qdev-properties.h"
> >  #include "hw/virtio/virtio-access.h"
> > +#include "sysemu/block-backend.h"
> >  #include "sysemu/dma.h"
> >  #include "sysemu/runstate.h"
> >  #include "virtio-qmp.h"
> > @@ -2426,6 +2427,13 @@ static bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
> >      }
> >  }
> >  
> > +/* Batch irqs while inside a blk_io_plug()/blk_io_unplug() section */
> > +static void virtio_notify_irqfd_unplug_fn(void *opaque)
> > +{
> > +    EventNotifier *notifier = opaque;
> > +    event_notifier_set(notifier);
> > +}
> > +
> >  void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
> >  {
> >      WITH_RCU_READ_LOCK_GUARD() {
> > @@ -2452,7 +2460,7 @@ void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
> >       * to an atomic operation.
> >       */
> >      virtio_set_isr(vq->vdev, 0x1);
> > -    event_notifier_set(&vq->guest_notifier);
> > +    blk_io_plug_call(virtio_notify_irqfd_unplug_fn, &vq->guest_notifier);
> >  }
> >  
> >  static void virtio_irq(VirtQueue *vq)
>
Ilya Maximets Aug. 16, 2023, 6:30 p.m. UTC | #3
On 8/16/23 17:30, Stefan Hajnoczi wrote:
> On Wed, Aug 16, 2023 at 03:36:32PM +0200, Ilya Maximets wrote:
>> On 8/15/23 14:08, Stefan Hajnoczi wrote:
>>> virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used
>>> Buffer Notifications from an IOThread. This involves an eventfd
>>> write(2) syscall. Calling this repeatedly when completing multiple I/O
>>> requests in a row is wasteful.
>>
>> Hi, Stefan.  This is an interesting change!
>>
>> There is more or less exactly the same problem with fast network backends
>> and I was playing around with similar ideas in this area while working on
>> af-xdp network backend recently.  Primarily, implementation of the Rx BH
>> for virtio-net device and locking the RCU before passing packets from the
>> backend to the device one by one.
>>
>>>
>>> Use the blk_io_plug_call() API to batch together virtio_irqfd_notify()
>>> calls made during Linux AIO (aio=native) or io_uring (aio=io_uring)
>>> completion processing. Do not modify the thread pool (aio=threads) to
>>> avoid introducing a dependency from util/ onto the block layer.
>>
>> But you're introducing a dependency from generic virtio code onto the
>> block layer in this patch.  This seem to break the module abstraction.
>>
>> It looks like there are 2 options to resolve the semantics issue here:
> 
> Yes, it's a layering violation.
> 
>>
>> 1. Move virtio_notify_irqfd() from virtio.c down to the block layer.
>>    Block layer is the only user, so that may be justified, but it
>>    doesn't seem like a particularly good solution.  (I'm actually not
>>    sure why block devices are the only ones using this function...)
> 
> Yes, this is the easiest way to avoid the layering violation for now.
> 
> The virtio_notify_irqfd() API is necessary when running in an IOThread
> because the normal QEMU irq API must run under the Big QEMU Lock. Block
> devices are the only ones that raise interrupts from IOThreads at the
> moment.

Ack.  Thanks for explanation!

> 
>>
>> 2. Move and rename the block/plug library somewhere generic.  The plug
>>    library seems to not have any dependencies on a block layer, other
>>    than a name, so it should not be hard to generalize it (well, the
>>    naming might be hard).
> 
> Yes, it should be possible to make it generic quite easily. I will give
> this a try in the next version of the patch.

OK.  Sounds good to me.

> 
>> In general, while looking at the plug library, it seems to do what is
>> typically implemented via RCU frameworks - the delayed function call.
>> The only difference is that RCU doesn't check for duplicates and the
>> callbacks are global.  Should not be hard to add some new functionality
>> to RCU framework in order to address these, e.g. rcu_call_local() for
>> calls that should be executed once the current thread exits its own
>> critical section.
> 
> This rcu_call_local() idea is unrelated to Read Copy Update, so I don't
> think it should be part of the RCU API.

Agreed.

> Another deferred function call mechanism is QEMUBH. It already supports
> coalescing. However, BHs are invoked once per AioContext event loop
> iteration and there is no way invoke the BH earlier. Also the BH pointer
> needs to be passed to every function that wishes to schedule a deferred
> call, which can be tedious (e.g. block/linux-aio.c should defer the
> io_submit(2) syscall until the end of virtio-blk request processing -
> there are a lot of layers of code between those two components).
> 
>>
>> Using RCU for non-RCU-protected things might be considered as an abuse.
>> However, we might solve two issues in one shot if instead of entering
>> blk_io_plug/unplug section we will enter an RCU critical section and
>> call callbacks at the exit.  The first issue is the notification batching
>> that this patch is trying to fix, the second is an excessive number of
>> thread fences on RCU exits every time virtio_notify_irqfd() and other
>> virtio functions are invoked.  The second issue can be avoided by using
>> RCU_READ_LOCK_GUARD() in completion functions.  Not sure if that will
>> improve performance, but it definitely removes a lot of noise from the
>> perf top for network backends.  This makes the code a bit less explicit
>> though, the lock guard will definitely need a comment.  Though, the reason
>> for blk_io_plug() calls is not fully clear for a module code alone either.
> 
> util/aio-posix.c:run_poll_handlers() has a top-level
> RCU_READ_LOCK_GUARD() for this reason.

Nice, didn't know that.

> Maybe we should do the same
> around aio_bh_poll() + aio_dispatch_ready_handlers() in
> util/aio-posix.c:aio_poll()? The downside is that latency-sensitive
> call_rcu() callbacks perform worse.

"latency-sensitive call_rcu() callback" is a bit of an oxymoron.
There is no real way to tell when the other thread will exit the
critical section.  But I'm not familiar with the code enough to
make a decision here.

>>
>> I'm not sure what is the best way forward.  I'm trying to figure out
>> that myself, since a similar solution will in the end be needed for
>> network backends and so it might be better to have something more generic.
>> What do you think?
> 
> I suggest moving blk_io_plug() to util/ and renaming it to
> defer_begin/defer_end()/defer_call(). The interface and implementation
> can also be improved as needed for other users.

Sounds good to me.

> 
>>
>>>
>>> Behavior is unchanged for emulated devices that do not use blk_io_plug()
>>> since blk_io_plug_call() immediately invokes the callback when called
>>> outside a blk_io_plug()/blk_io_unplug() region.
>>>
>>> fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a
>>> single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could
>>> be noise. Detailed performance data and configuration specifics are
>>> available here:
>>> https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd
>>>
>>> This duplicates the BH that virtio-blk uses for batching. The next
>>> commit will remove it.
>>
>> I'm likely missing something, but could you explain why it is safe to
>> batch unconditionally here?  The current BH code, as you mentioned in
>> the second patch, is only batching if EVENT_IDX is not set.
>> Maybe worth adding a few words in the commit message for people like
>> me, who are a bit rusty on QEMU/virtio internals. :)
> 
> The BH is condition on EVENT_IDX not for correctness/safety, but for
> performance.

I see.  Thanks!
Do you think using deferred call in the non-irqfd virtio_notify()
will make sense?  Not in this patch set, but in general, e.g. for
a future use from a network backend.

> Commit 12c1c7d7cefb4dbffeb5712e75a33e4692f0a76b
> ("virtio-blk: dataplane: Don't batch notifications if EVENT_IDX is
> present") argued that VIRTIO's EVENT_IDX feature already provides
> batching so the BH is not needed in that case.
> 
> Actually I'm not sure exactly why deferring virtio_notify_irqfd() helps,
> but IOPS improves on my machine. One theory:
> 
> Linux drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_split() (or the
> end of the irq handler function) updates EVENT_IDX so that QEMU will
> inject an interrupt again when the next request is completed. If the
> vCPU and QEMU are running in parallel, then N requests could result in N
> interrupts when the vCPU updates EVENT_IDX before QEMU completes the
> next request. Maybe that's why I see better performance when deferring
> virtio_notify_irqfd().

Sounds very plausible.  I actually see only latency and interrupt
count measurements in either 12c1c7d7cefb ("virtio-blk: dataplane:
Don't batch notifications if EVENT_IDX is present") or 5b2ffbe4d998
("virtio-blk: dataplane: notify guest as a batch") as well as in
mailing list discussions about these changes.  So, it's possible that
impact of batching on throughput in terms of IOPS was not actually
measured before (it porobably was, but not published).

> 
> (I'm thinking of experimenting with rate limiting irq injection so that
> it's possible to batch even more efficienctly but I worry that it will
> be hard to avoid latency vs throughput regressions.)
> 
> Stefan
> 
>>
>> Best regards, Ilya Maximets.
>>
>>>
>>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>  block/io_uring.c   |  6 ++++++
>>>  block/linux-aio.c  |  4 ++++
>>>  hw/virtio/virtio.c | 10 +++++++++-
>>>  3 files changed, 19 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/block/io_uring.c b/block/io_uring.c
>>> index 69d9820928..749cf83934 100644
>>> --- a/block/io_uring.c
>>> +++ b/block/io_uring.c
>>> @@ -124,6 +124,9 @@ static void luring_process_completions(LuringState *s)
>>>  {
>>>      struct io_uring_cqe *cqes;
>>>      int total_bytes;
>>> +
>>> +    blk_io_plug();
>>> +
>>>      /*
>>>       * Request completion callbacks can run the nested event loop.
>>>       * Schedule ourselves so the nested event loop will "see" remaining
>>> @@ -216,7 +219,10 @@ end:
>>>              aio_co_wake(luringcb->co);
>>>          }
>>>      }
>>> +
>>>      qemu_bh_cancel(s->completion_bh);
>>> +
>>> +    blk_io_unplug();
>>>  }
>>>  
>>>  static int ioq_submit(LuringState *s)
>>> diff --git a/block/linux-aio.c b/block/linux-aio.c
>>> index 561c71a9ae..cef3d6b1c7 100644
>>> --- a/block/linux-aio.c
>>> +++ b/block/linux-aio.c
>>> @@ -204,6 +204,8 @@ static void qemu_laio_process_completions(LinuxAioState *s)
>>>  {
>>>      struct io_event *events;
>>>  
>>> +    blk_io_plug();
>>> +
>>>      /* Reschedule so nested event loops see currently pending completions */
>>>      qemu_bh_schedule(s->completion_bh);
>>>  
>>> @@ -230,6 +232,8 @@ static void qemu_laio_process_completions(LinuxAioState *s)
>>>       * own `for` loop.  If we are the last all counters droped to zero. */
>>>      s->event_max = 0;
>>>      s->event_idx = 0;
>>> +
>>> +    blk_io_unplug();
>>>  }
>>>  
>>>  static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
>>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
>>> index 309038fd46..a691e8526b 100644
>>> --- a/hw/virtio/virtio.c
>>> +++ b/hw/virtio/virtio.c
>>> @@ -28,6 +28,7 @@
>>>  #include "hw/virtio/virtio-bus.h"
>>>  #include "hw/qdev-properties.h"
>>>  #include "hw/virtio/virtio-access.h"
>>> +#include "sysemu/block-backend.h"
>>>  #include "sysemu/dma.h"
>>>  #include "sysemu/runstate.h"
>>>  #include "virtio-qmp.h"
>>> @@ -2426,6 +2427,13 @@ static bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
>>>      }
>>>  }
>>>  
>>> +/* Batch irqs while inside a blk_io_plug()/blk_io_unplug() section */
>>> +static void virtio_notify_irqfd_unplug_fn(void *opaque)
>>> +{
>>> +    EventNotifier *notifier = opaque;
>>> +    event_notifier_set(notifier);
>>> +}
>>> +
>>>  void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
>>>  {
>>>      WITH_RCU_READ_LOCK_GUARD() {
>>> @@ -2452,7 +2460,7 @@ void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
>>>       * to an atomic operation.
>>>       */
>>>      virtio_set_isr(vq->vdev, 0x1);
>>> -    event_notifier_set(&vq->guest_notifier);
>>> +    blk_io_plug_call(virtio_notify_irqfd_unplug_fn, &vq->guest_notifier);

There is a trace call in this function.  Should we move it or add a new
trace call to the callback?  The trace was printed right before the
actual notification before, and now the actual notification is delayed
potentially reducing the usefulness of the trace.

>>>  }
>>>  
>>>  static void virtio_irq(VirtQueue *vq)
>>
Stefan Hajnoczi Aug. 16, 2023, 9:16 p.m. UTC | #4
On Wed, Aug 16, 2023 at 08:30:58PM +0200, Ilya Maximets wrote:
> On 8/16/23 17:30, Stefan Hajnoczi wrote:
> > On Wed, Aug 16, 2023 at 03:36:32PM +0200, Ilya Maximets wrote:
> >> On 8/15/23 14:08, Stefan Hajnoczi wrote:
> >>> virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used
> >>> Buffer Notifications from an IOThread. This involves an eventfd
> >>> write(2) syscall. Calling this repeatedly when completing multiple I/O
> >>> requests in a row is wasteful.
> >>
> >> Hi, Stefan.  This is an interesting change!
> >>
> >> There is more or less exactly the same problem with fast network backends
> >> and I was playing around with similar ideas in this area while working on
> >> af-xdp network backend recently.  Primarily, implementation of the Rx BH
> >> for virtio-net device and locking the RCU before passing packets from the
> >> backend to the device one by one.
> >>
> >>>
> >>> Use the blk_io_plug_call() API to batch together virtio_irqfd_notify()
> >>> calls made during Linux AIO (aio=native) or io_uring (aio=io_uring)
> >>> completion processing. Do not modify the thread pool (aio=threads) to
> >>> avoid introducing a dependency from util/ onto the block layer.
> >>
> >> But you're introducing a dependency from generic virtio code onto the
> >> block layer in this patch.  This seem to break the module abstraction.
> >>
> >> It looks like there are 2 options to resolve the semantics issue here:
> > 
> > Yes, it's a layering violation.
> > 
> >>
> >> 1. Move virtio_notify_irqfd() from virtio.c down to the block layer.
> >>    Block layer is the only user, so that may be justified, but it
> >>    doesn't seem like a particularly good solution.  (I'm actually not
> >>    sure why block devices are the only ones using this function...)
> > 
> > Yes, this is the easiest way to avoid the layering violation for now.
> > 
> > The virtio_notify_irqfd() API is necessary when running in an IOThread
> > because the normal QEMU irq API must run under the Big QEMU Lock. Block
> > devices are the only ones that raise interrupts from IOThreads at the
> > moment.
> 
> Ack.  Thanks for explanation!
> 
> > 
> >>
> >> 2. Move and rename the block/plug library somewhere generic.  The plug
> >>    library seems to not have any dependencies on a block layer, other
> >>    than a name, so it should not be hard to generalize it (well, the
> >>    naming might be hard).
> > 
> > Yes, it should be possible to make it generic quite easily. I will give
> > this a try in the next version of the patch.
> 
> OK.  Sounds good to me.
> 
> > 
> >> In general, while looking at the plug library, it seems to do what is
> >> typically implemented via RCU frameworks - the delayed function call.
> >> The only difference is that RCU doesn't check for duplicates and the
> >> callbacks are global.  Should not be hard to add some new functionality
> >> to RCU framework in order to address these, e.g. rcu_call_local() for
> >> calls that should be executed once the current thread exits its own
> >> critical section.
> > 
> > This rcu_call_local() idea is unrelated to Read Copy Update, so I don't
> > think it should be part of the RCU API.
> 
> Agreed.
> 
> > Another deferred function call mechanism is QEMUBH. It already supports
> > coalescing. However, BHs are invoked once per AioContext event loop
> > iteration and there is no way invoke the BH earlier. Also the BH pointer
> > needs to be passed to every function that wishes to schedule a deferred
> > call, which can be tedious (e.g. block/linux-aio.c should defer the
> > io_submit(2) syscall until the end of virtio-blk request processing -
> > there are a lot of layers of code between those two components).
> > 
> >>
> >> Using RCU for non-RCU-protected things might be considered as an abuse.
> >> However, we might solve two issues in one shot if instead of entering
> >> blk_io_plug/unplug section we will enter an RCU critical section and
> >> call callbacks at the exit.  The first issue is the notification batching
> >> that this patch is trying to fix, the second is an excessive number of
> >> thread fences on RCU exits every time virtio_notify_irqfd() and other
> >> virtio functions are invoked.  The second issue can be avoided by using
> >> RCU_READ_LOCK_GUARD() in completion functions.  Not sure if that will
> >> improve performance, but it definitely removes a lot of noise from the
> >> perf top for network backends.  This makes the code a bit less explicit
> >> though, the lock guard will definitely need a comment.  Though, the reason
> >> for blk_io_plug() calls is not fully clear for a module code alone either.
> > 
> > util/aio-posix.c:run_poll_handlers() has a top-level
> > RCU_READ_LOCK_GUARD() for this reason.
> 
> Nice, didn't know that.
> 
> > Maybe we should do the same
> > around aio_bh_poll() + aio_dispatch_ready_handlers() in
> > util/aio-posix.c:aio_poll()? The downside is that latency-sensitive
> > call_rcu() callbacks perform worse.
> 
> "latency-sensitive call_rcu() callback" is a bit of an oxymoron.
> There is no real way to tell when the other thread will exit the
> critical section.  But I'm not familiar with the code enough to
> make a decision here.

I see your point :).

Another issue occurred to me: drain_call_rcu() is invoked by some QEMU
monitor commands and the monitor runs in QEMU's event loop.
drain_call_rcu() waits for the call_rcu thread, which needs
synchronize_rcu() to finish, so I guess it will hang if called with the
RCU read lock held. There is the potential for a deadlock here. It might
be solvable.

> 
> >>
> >> I'm not sure what is the best way forward.  I'm trying to figure out
> >> that myself, since a similar solution will in the end be needed for
> >> network backends and so it might be better to have something more generic.
> >> What do you think?
> > 
> > I suggest moving blk_io_plug() to util/ and renaming it to
> > defer_begin/defer_end()/defer_call(). The interface and implementation
> > can also be improved as needed for other users.
> 
> Sounds good to me.
> 
> > 
> >>
> >>>
> >>> Behavior is unchanged for emulated devices that do not use blk_io_plug()
> >>> since blk_io_plug_call() immediately invokes the callback when called
> >>> outside a blk_io_plug()/blk_io_unplug() region.
> >>>
> >>> fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a
> >>> single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could
> >>> be noise. Detailed performance data and configuration specifics are
> >>> available here:
> >>> https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd
> >>>
> >>> This duplicates the BH that virtio-blk uses for batching. The next
> >>> commit will remove it.
> >>
> >> I'm likely missing something, but could you explain why it is safe to
> >> batch unconditionally here?  The current BH code, as you mentioned in
> >> the second patch, is only batching if EVENT_IDX is not set.
> >> Maybe worth adding a few words in the commit message for people like
> >> me, who are a bit rusty on QEMU/virtio internals. :)
> > 
> > The BH is condition on EVENT_IDX not for correctness/safety, but for
> > performance.
> 
> I see.  Thanks!
> Do you think using deferred call in the non-irqfd virtio_notify()
> will make sense?  Not in this patch set, but in general, e.g. for
> a future use from a network backend.

Yes, a deferred call may improve performance in the non-irqfd case too.

> > Commit 12c1c7d7cefb4dbffeb5712e75a33e4692f0a76b
> > ("virtio-blk: dataplane: Don't batch notifications if EVENT_IDX is
> > present") argued that VIRTIO's EVENT_IDX feature already provides
> > batching so the BH is not needed in that case.
> > 
> > Actually I'm not sure exactly why deferring virtio_notify_irqfd() helps,
> > but IOPS improves on my machine. One theory:
> > 
> > Linux drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_split() (or the
> > end of the irq handler function) updates EVENT_IDX so that QEMU will
> > inject an interrupt again when the next request is completed. If the
> > vCPU and QEMU are running in parallel, then N requests could result in N
> > interrupts when the vCPU updates EVENT_IDX before QEMU completes the
> > next request. Maybe that's why I see better performance when deferring
> > virtio_notify_irqfd().
> 
> Sounds very plausible.  I actually see only latency and interrupt
> count measurements in either 12c1c7d7cefb ("virtio-blk: dataplane:
> Don't batch notifications if EVENT_IDX is present") or 5b2ffbe4d998
> ("virtio-blk: dataplane: notify guest as a batch") as well as in
> mailing list discussions about these changes.  So, it's possible that
> impact of batching on throughput in terms of IOPS was not actually
> measured before (it porobably was, but not published).
> 
> > 
> > (I'm thinking of experimenting with rate limiting irq injection so that
> > it's possible to batch even more efficienctly but I worry that it will
> > be hard to avoid latency vs throughput regressions.)
> > 
> > Stefan
> > 
> >>
> >> Best regards, Ilya Maximets.
> >>
> >>>
> >>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>> ---
> >>>  block/io_uring.c   |  6 ++++++
> >>>  block/linux-aio.c  |  4 ++++
> >>>  hw/virtio/virtio.c | 10 +++++++++-
> >>>  3 files changed, 19 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/block/io_uring.c b/block/io_uring.c
> >>> index 69d9820928..749cf83934 100644
> >>> --- a/block/io_uring.c
> >>> +++ b/block/io_uring.c
> >>> @@ -124,6 +124,9 @@ static void luring_process_completions(LuringState *s)
> >>>  {
> >>>      struct io_uring_cqe *cqes;
> >>>      int total_bytes;
> >>> +
> >>> +    blk_io_plug();
> >>> +
> >>>      /*
> >>>       * Request completion callbacks can run the nested event loop.
> >>>       * Schedule ourselves so the nested event loop will "see" remaining
> >>> @@ -216,7 +219,10 @@ end:
> >>>              aio_co_wake(luringcb->co);
> >>>          }
> >>>      }
> >>> +
> >>>      qemu_bh_cancel(s->completion_bh);
> >>> +
> >>> +    blk_io_unplug();
> >>>  }
> >>>  
> >>>  static int ioq_submit(LuringState *s)
> >>> diff --git a/block/linux-aio.c b/block/linux-aio.c
> >>> index 561c71a9ae..cef3d6b1c7 100644
> >>> --- a/block/linux-aio.c
> >>> +++ b/block/linux-aio.c
> >>> @@ -204,6 +204,8 @@ static void qemu_laio_process_completions(LinuxAioState *s)
> >>>  {
> >>>      struct io_event *events;
> >>>  
> >>> +    blk_io_plug();
> >>> +
> >>>      /* Reschedule so nested event loops see currently pending completions */
> >>>      qemu_bh_schedule(s->completion_bh);
> >>>  
> >>> @@ -230,6 +232,8 @@ static void qemu_laio_process_completions(LinuxAioState *s)
> >>>       * own `for` loop.  If we are the last all counters droped to zero. */
> >>>      s->event_max = 0;
> >>>      s->event_idx = 0;
> >>> +
> >>> +    blk_io_unplug();
> >>>  }
> >>>  
> >>>  static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
> >>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >>> index 309038fd46..a691e8526b 100644
> >>> --- a/hw/virtio/virtio.c
> >>> +++ b/hw/virtio/virtio.c
> >>> @@ -28,6 +28,7 @@
> >>>  #include "hw/virtio/virtio-bus.h"
> >>>  #include "hw/qdev-properties.h"
> >>>  #include "hw/virtio/virtio-access.h"
> >>> +#include "sysemu/block-backend.h"
> >>>  #include "sysemu/dma.h"
> >>>  #include "sysemu/runstate.h"
> >>>  #include "virtio-qmp.h"
> >>> @@ -2426,6 +2427,13 @@ static bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
> >>>      }
> >>>  }
> >>>  
> >>> +/* Batch irqs while inside a blk_io_plug()/blk_io_unplug() section */
> >>> +static void virtio_notify_irqfd_unplug_fn(void *opaque)
> >>> +{
> >>> +    EventNotifier *notifier = opaque;
> >>> +    event_notifier_set(notifier);
> >>> +}
> >>> +
> >>>  void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
> >>>  {
> >>>      WITH_RCU_READ_LOCK_GUARD() {
> >>> @@ -2452,7 +2460,7 @@ void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
> >>>       * to an atomic operation.
> >>>       */
> >>>      virtio_set_isr(vq->vdev, 0x1);
> >>> -    event_notifier_set(&vq->guest_notifier);
> >>> +    blk_io_plug_call(virtio_notify_irqfd_unplug_fn, &vq->guest_notifier);
> 
> There is a trace call in this function.  Should we move it or add a new
> trace call to the callback?  The trace was printed right before the
> actual notification before, and now the actual notification is delayed
> potentially reducing the usefulness of the trace.
> 
> >>>  }
> >>>  
> >>>  static void virtio_irq(VirtQueue *vq)
> >>
>
diff mbox series

Patch

diff --git a/block/io_uring.c b/block/io_uring.c
index 69d9820928..749cf83934 100644
--- a/block/io_uring.c
+++ b/block/io_uring.c
@@ -124,6 +124,9 @@  static void luring_process_completions(LuringState *s)
 {
     struct io_uring_cqe *cqes;
     int total_bytes;
+
+    blk_io_plug();
+
     /*
      * Request completion callbacks can run the nested event loop.
      * Schedule ourselves so the nested event loop will "see" remaining
@@ -216,7 +219,10 @@  end:
             aio_co_wake(luringcb->co);
         }
     }
+
     qemu_bh_cancel(s->completion_bh);
+
+    blk_io_unplug();
 }
 
 static int ioq_submit(LuringState *s)
diff --git a/block/linux-aio.c b/block/linux-aio.c
index 561c71a9ae..cef3d6b1c7 100644
--- a/block/linux-aio.c
+++ b/block/linux-aio.c
@@ -204,6 +204,8 @@  static void qemu_laio_process_completions(LinuxAioState *s)
 {
     struct io_event *events;
 
+    blk_io_plug();
+
     /* Reschedule so nested event loops see currently pending completions */
     qemu_bh_schedule(s->completion_bh);
 
@@ -230,6 +232,8 @@  static void qemu_laio_process_completions(LinuxAioState *s)
      * own `for` loop.  If we are the last all counters droped to zero. */
     s->event_max = 0;
     s->event_idx = 0;
+
+    blk_io_unplug();
 }
 
 static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 309038fd46..a691e8526b 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -28,6 +28,7 @@ 
 #include "hw/virtio/virtio-bus.h"
 #include "hw/qdev-properties.h"
 #include "hw/virtio/virtio-access.h"
+#include "sysemu/block-backend.h"
 #include "sysemu/dma.h"
 #include "sysemu/runstate.h"
 #include "virtio-qmp.h"
@@ -2426,6 +2427,13 @@  static bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
     }
 }
 
+/* Batch irqs while inside a blk_io_plug()/blk_io_unplug() section */
+static void virtio_notify_irqfd_unplug_fn(void *opaque)
+{
+    EventNotifier *notifier = opaque;
+    event_notifier_set(notifier);
+}
+
 void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
 {
     WITH_RCU_READ_LOCK_GUARD() {
@@ -2452,7 +2460,7 @@  void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
      * to an atomic operation.
      */
     virtio_set_isr(vq->vdev, 0x1);
-    event_notifier_set(&vq->guest_notifier);
+    blk_io_plug_call(virtio_notify_irqfd_unplug_fn, &vq->guest_notifier);
 }
 
 static void virtio_irq(VirtQueue *vq)