diff mbox series

[v3,2/5] test-bdrv-drain: avoid race with BH in IOThread drain test

Message ID 20230912231037.826804-3-stefanha@redhat.com
State New
Headers show
Series block-backend: process I/O in the current AioContext | expand

Commit Message

Stefan Hajnoczi Sept. 12, 2023, 11:10 p.m. UTC
This patch fixes a race condition in test-bdrv-drain that is difficult
to reproduce. test-bdrv-drain sometimes fails without an error message
on the block pull request sent by Kevin Wolf on Sep 4, 2023. I was able
to reproduce it locally and found that "block-backend: process I/O in
the current AioContext" (in this patch series) is the first commit where
it reproduces.

I do not know why "block-backend: process I/O in the current AioContext"
exposes this bug. It might be related to the fact that the test's preadv
request runs in the main thread instead of IOThread a after my commit.
That might simply change the timing of the test.

Now on to the race condition in test-bdrv-drain. The main thread
schedules a BH in IOThread a and then drains the BDS:

  aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, &data);

  /* The request is running on the IOThread a. Draining its block device
   * will make sure that it has completed as far as the BDS is concerned,
   * but the drain in this thread can continue immediately after
   * bdrv_dec_in_flight() and aio_ret might be assigned only slightly
   * later. */
  do_drain_begin(drain_type, bs);

If the BH completes before do_drain_begin() then there is nothing to
worry about.

If the BH invokes bdrv_flush() before do_drain_begin(), then
do_drain_begin() waits for it to complete.

The problematic case is when do_drain_begin() runs before the BH enters
bdrv_flush(). Then do_drain_begin() misses the BH and the drain
mechanism has failed in quiescing I/O.

Fix this by incrementing the in_flight counter so that do_drain_begin()
waits for test_iothread_main_thread_bh().

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 tests/unit/test-bdrv-drain.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Eric Blake Sept. 13, 2023, 4:08 p.m. UTC | #1
On Tue, Sep 12, 2023 at 07:10:34PM -0400, Stefan Hajnoczi wrote:
> This patch fixes a race condition in test-bdrv-drain that is difficult
> to reproduce. test-bdrv-drain sometimes fails without an error message
> on the block pull request sent by Kevin Wolf on Sep 4, 2023. I was able
> to reproduce it locally and found that "block-backend: process I/O in
> the current AioContext" (in this patch series) is the first commit where
> it reproduces.
> 
> I do not know why "block-backend: process I/O in the current AioContext"
> exposes this bug. It might be related to the fact that the test's preadv
> request runs in the main thread instead of IOThread a after my commit.

In reading the commit message before the impacted code, my first
thought was that you had a typo of an extra word (that is, something
to fix by s/a //), but reading further, a better fix would be calling
attention to the fact that you are referencing a specific named
thread, as in s/IOThread a/IOThread A/...

> That might simply change the timing of the test.
> 
> Now on to the race condition in test-bdrv-drain. The main thread
> schedules a BH in IOThread a and then drains the BDS:

...and another spot with the same parse issue...

> 
>   aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, &data);
> 
>   /* The request is running on the IOThread a. Draining its block device

...but here you were quoting from the existing code base, which is
where I finally realized it was more than just your commit message.

>    * will make sure that it has completed as far as the BDS is concerned,
>    * but the drain in this thread can continue immediately after
>    * bdrv_dec_in_flight() and aio_ret might be assigned only slightly
>    * later. */
>   do_drain_begin(drain_type, bs);
> 
> If the BH completes before do_drain_begin() then there is nothing to
> worry about.
> 
> If the BH invokes bdrv_flush() before do_drain_begin(), then
> do_drain_begin() waits for it to complete.
> 
> The problematic case is when do_drain_begin() runs before the BH enters
> bdrv_flush(). Then do_drain_begin() misses the BH and the drain
> mechanism has failed in quiescing I/O.
> 
> Fix this by incrementing the in_flight counter so that do_drain_begin()
> waits for test_iothread_main_thread_bh().
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  tests/unit/test-bdrv-drain.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
> index ccc453c29e..67a79aa3f0 100644
> --- a/tests/unit/test-bdrv-drain.c
> +++ b/tests/unit/test-bdrv-drain.c
> @@ -512,6 +512,7 @@ static void test_iothread_main_thread_bh(void *opaque)
>       * executed during drain, otherwise this would deadlock. */
>      aio_context_acquire(bdrv_get_aio_context(data->bs));
>      bdrv_flush(data->bs);
> +    bdrv_dec_in_flight(data->bs); /* incremented by test_iothread_common() */
>      aio_context_release(bdrv_get_aio_context(data->bs));
>  }
>  
> @@ -583,6 +584,13 @@ static void test_iothread_common(enum drain_type drain_type, int drain_thread)
>              aio_context_acquire(ctx_a);
>          }
>  
> +        /*
> +         * Increment in_flight so that do_drain_begin() waits for
> +         * test_iothread_main_thread_bh(). This prevents the race between
> +         * test_iothread_main_thread_bh() in IOThread a and do_drain_begin() in
> +         * this thread. test_iothread_main_thread_bh() decrements in_flight.
> +         */
> +        bdrv_inc_in_flight(bs);
>          aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, &data);
>  
>          /* The request is running on the IOThread a. Draining its block device

and indeed, your commit message is consistent with the current code's
naming convention.  If you have reason to respin, a pre-req patch to
change the case before adding more references might be nice, but I
won't insist.

Reviewed-by: Eric Blake <eblake@redhat.com>
Stefan Hajnoczi Sept. 14, 2023, 12:01 p.m. UTC | #2
On Wed, Sep 13, 2023 at 11:08:54AM -0500, Eric Blake wrote:
> On Tue, Sep 12, 2023 at 07:10:34PM -0400, Stefan Hajnoczi wrote:
> > This patch fixes a race condition in test-bdrv-drain that is difficult
> > to reproduce. test-bdrv-drain sometimes fails without an error message
> > on the block pull request sent by Kevin Wolf on Sep 4, 2023. I was able
> > to reproduce it locally and found that "block-backend: process I/O in
> > the current AioContext" (in this patch series) is the first commit where
> > it reproduces.
> > 
> > I do not know why "block-backend: process I/O in the current AioContext"
> > exposes this bug. It might be related to the fact that the test's preadv
> > request runs in the main thread instead of IOThread a after my commit.
> 
> In reading the commit message before the impacted code, my first
> thought was that you had a typo of an extra word (that is, something
> to fix by s/a //), but reading further, a better fix would be calling
> attention to the fact that you are referencing a specific named
> thread, as in s/IOThread a/IOThread A/...
> 
> > That might simply change the timing of the test.
> > 
> > Now on to the race condition in test-bdrv-drain. The main thread
> > schedules a BH in IOThread a and then drains the BDS:
> 
> ...and another spot with the same parse issue...
> 
> > 
> >   aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, &data);
> > 
> >   /* The request is running on the IOThread a. Draining its block device
> 
> ...but here you were quoting from the existing code base, which is
> where I finally realized it was more than just your commit message.
> 
> >    * will make sure that it has completed as far as the BDS is concerned,
> >    * but the drain in this thread can continue immediately after
> >    * bdrv_dec_in_flight() and aio_ret might be assigned only slightly
> >    * later. */
> >   do_drain_begin(drain_type, bs);
> > 
> > If the BH completes before do_drain_begin() then there is nothing to
> > worry about.
> > 
> > If the BH invokes bdrv_flush() before do_drain_begin(), then
> > do_drain_begin() waits for it to complete.
> > 
> > The problematic case is when do_drain_begin() runs before the BH enters
> > bdrv_flush(). Then do_drain_begin() misses the BH and the drain
> > mechanism has failed in quiescing I/O.
> > 
> > Fix this by incrementing the in_flight counter so that do_drain_begin()
> > waits for test_iothread_main_thread_bh().
> > 
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  tests/unit/test-bdrv-drain.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
> > index ccc453c29e..67a79aa3f0 100644
> > --- a/tests/unit/test-bdrv-drain.c
> > +++ b/tests/unit/test-bdrv-drain.c
> > @@ -512,6 +512,7 @@ static void test_iothread_main_thread_bh(void *opaque)
> >       * executed during drain, otherwise this would deadlock. */
> >      aio_context_acquire(bdrv_get_aio_context(data->bs));
> >      bdrv_flush(data->bs);
> > +    bdrv_dec_in_flight(data->bs); /* incremented by test_iothread_common() */
> >      aio_context_release(bdrv_get_aio_context(data->bs));
> >  }
> >  
> > @@ -583,6 +584,13 @@ static void test_iothread_common(enum drain_type drain_type, int drain_thread)
> >              aio_context_acquire(ctx_a);
> >          }
> >  
> > +        /*
> > +         * Increment in_flight so that do_drain_begin() waits for
> > +         * test_iothread_main_thread_bh(). This prevents the race between
> > +         * test_iothread_main_thread_bh() in IOThread a and do_drain_begin() in
> > +         * this thread. test_iothread_main_thread_bh() decrements in_flight.
> > +         */
> > +        bdrv_inc_in_flight(bs);
> >          aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, &data);
> >  
> >          /* The request is running on the IOThread a. Draining its block device
> 
> and indeed, your commit message is consistent with the current code's
> naming convention.  If you have reason to respin, a pre-req patch to
> change the case before adding more references might be nice, but I
> won't insist.
> 
> Reviewed-by: Eric Blake <eblake@redhat.com>

Sorry about that. It is confusing.

Stefan
diff mbox series

Patch

diff --git a/tests/unit/test-bdrv-drain.c b/tests/unit/test-bdrv-drain.c
index ccc453c29e..67a79aa3f0 100644
--- a/tests/unit/test-bdrv-drain.c
+++ b/tests/unit/test-bdrv-drain.c
@@ -512,6 +512,7 @@  static void test_iothread_main_thread_bh(void *opaque)
      * executed during drain, otherwise this would deadlock. */
     aio_context_acquire(bdrv_get_aio_context(data->bs));
     bdrv_flush(data->bs);
+    bdrv_dec_in_flight(data->bs); /* incremented by test_iothread_common() */
     aio_context_release(bdrv_get_aio_context(data->bs));
 }
 
@@ -583,6 +584,13 @@  static void test_iothread_common(enum drain_type drain_type, int drain_thread)
             aio_context_acquire(ctx_a);
         }
 
+        /*
+         * Increment in_flight so that do_drain_begin() waits for
+         * test_iothread_main_thread_bh(). This prevents the race between
+         * test_iothread_main_thread_bh() in IOThread a and do_drain_begin() in
+         * this thread. test_iothread_main_thread_bh() decrements in_flight.
+         */
+        bdrv_inc_in_flight(bs);
         aio_bh_schedule_oneshot(ctx_a, test_iothread_main_thread_bh, &data);
 
         /* The request is running on the IOThread a. Draining its block device