diff mbox series

[v2] migration: hold the BQL during setup

Message ID 20230525164726.45176-1-f.ebner@proxmox.com
State New
Headers show
Series [v2] migration: hold the BQL during setup | expand

Commit Message

Fiona Ebner May 25, 2023, 4:47 p.m. UTC
This is intended to be a semantic revert of commit 9b09503752
("migration: run setup callbacks out of big lock"). There have been so
many changes since that commit (e.g. a new setup callback
dirty_bitmap_save_setup() that also needs to be adapted now), it's
easier to do the revert manually.

For snapshots, the bdrv_writev_vmstate() function is used during setup
(in QIOChannelBlock backing the QEMUFile), but not holding the BQL
while calling it could lead to an assertion failure. To understand
how, first note the following:

1. Generated coroutine wrappers for block layer functions spawn the
coroutine and use AIO_WAIT_WHILE()/aio_poll() to wait for it.
2. If the host OS switches threads at an inconvenient time, it can
happen that a bottom half scheduled for the main thread's AioContext
is executed as part of a vCPU thread's aio_poll().

An example leading to the assertion failure is as follows:

main thread:
1. A snapshot-save QMP command gets issued.
2. snapshot_save_job_bh() is scheduled.

vCPU thread:
3. aio_poll() for the main thread's AioContext is called (e.g. when
the guest writes to a pflash device, as part of blk_pwrite which is a
generated coroutine wrapper).
4. snapshot_save_job_bh() is executed as part of aio_poll().
3. qemu_savevm_state() is called.
4. qemu_mutex_unlock_iothread() is called. Now
qemu_get_current_aio_context() returns 0x0.
5. bdrv_writev_vmstate() is executed during the usual savevm setup
via qemu_fflush(). But this function is a generated coroutine wrapper,
so it uses AIO_WAIT_WHILE. There, the assertion
assert(qemu_get_current_aio_context() == qemu_get_aio_context());
will fail.

To fix it, ensure that the BQL is held during setup. While it would
only be needed for snapshots, adapting migration too avoids additional
logic for conditional locking/unlocking in the setup callbacks.
Writing the header could (in theory) also trigger qemu_fflush() and
thus bdrv_writev_vmstate(), so the locked section also covers the
qemu_savevm_state_header() call, even for migration for consistentcy.

The comment in ram_init_bitmaps() was introduced by 4987783400
("migration: fix incorrect memory_global_dirty_log_start outside BQL")
and is removed, because it referred to the qemu_mutex_lock_iothread()
call.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 include/migration/register.h   | 2 +-
 migration/block-dirty-bitmap.c | 3 ---
 migration/block.c              | 5 -----
 migration/migration.c          | 6 ++++++
 migration/ram.c                | 3 ---
 migration/savevm.c             | 2 --
 6 files changed, 7 insertions(+), 14 deletions(-)

Comments

Juan Quintela May 26, 2023, 8:49 a.m. UTC | #1
Fiona Ebner <f.ebner@proxmox.com> wrote:
> This is intended to be a semantic revert of commit 9b09503752
> ("migration: run setup callbacks out of big lock"). There have been so
> many changes since that commit (e.g. a new setup callback
> dirty_bitmap_save_setup() that also needs to be adapted now), it's
> easier to do the revert manually.
>
> For snapshots, the bdrv_writev_vmstate() function is used during setup
> (in QIOChannelBlock backing the QEMUFile), but not holding the BQL
> while calling it could lead to an assertion failure. To understand
> how, first note the following:
>
> 1. Generated coroutine wrappers for block layer functions spawn the
> coroutine and use AIO_WAIT_WHILE()/aio_poll() to wait for it.
> 2. If the host OS switches threads at an inconvenient time, it can
> happen that a bottom half scheduled for the main thread's AioContext
> is executed as part of a vCPU thread's aio_poll().
>
> An example leading to the assertion failure is as follows:
>
> main thread:
> 1. A snapshot-save QMP command gets issued.
> 2. snapshot_save_job_bh() is scheduled.
>
> vCPU thread:
> 3. aio_poll() for the main thread's AioContext is called (e.g. when
> the guest writes to a pflash device, as part of blk_pwrite which is a
> generated coroutine wrapper).
> 4. snapshot_save_job_bh() is executed as part of aio_poll().
> 3. qemu_savevm_state() is called.
> 4. qemu_mutex_unlock_iothread() is called. Now
> qemu_get_current_aio_context() returns 0x0.
> 5. bdrv_writev_vmstate() is executed during the usual savevm setup
> via qemu_fflush(). But this function is a generated coroutine wrapper,
> so it uses AIO_WAIT_WHILE. There, the assertion
> assert(qemu_get_current_aio_context() == qemu_get_aio_context());
> will fail.
>
> To fix it, ensure that the BQL is held during setup. While it would
> only be needed for snapshots, adapting migration too avoids additional
> logic for conditional locking/unlocking in the setup callbacks.
> Writing the header could (in theory) also trigger qemu_fflush() and
> thus bdrv_writev_vmstate(), so the locked section also covers the
> qemu_savevm_state_header() call, even for migration for consistentcy.
>
> The comment in ram_init_bitmaps() was introduced by 4987783400
> ("migration: fix incorrect memory_global_dirty_log_start outside BQL")
> and is removed, because it referred to the qemu_mutex_lock_iothread()
> call.
>
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

I think we will take this bit, because the problem exist and this is
best solution that I can came with.

Later, Juan.
Juan Quintela May 26, 2023, 10:16 a.m. UTC | #2
Fiona Ebner <f.ebner@proxmox.com> wrote:

Nak

Sometimes it works, and sometimes it hangs.

When I was preparing with make check to send the pull request, I got
this:


Thread 5 (Thread 0x7f01d6ffd6c0 (LWP 921583) "live_migration"):
#0  0x00007f0217ff81d9 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x55d20315e860) at futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x55d20315e860, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2  0x00007f0217ff825f in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x55d20315e860, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3  0x00007f0217ffab79 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f01d6ffc380, cond=0x55d20315e838) at pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0x55d20315e838, mutex=mutex@entry=0x55d20315e808) at pthread_cond_wait.c:618
#5  0x000055d2014fa10b in qemu_cond_wait_impl (cond=0x55d20315e838, mutex=0x55d20315e808, file=0x55d201710eb0 "../../../../mnt/code/qemu/full/util/qemu-thread-posix.c", line=314) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:225
#6  0x000055d2014fa617 in qemu_sem_wait (sem=0x55d20315e808) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:314
#7  0x000055d2011780cf in multifd_send_sync_main (f=f@entry=0x55d202830d10) at ../../../../mnt/code/qemu/full/migration/multifd.c:633
#8  0x000055d2012fdc1b in ram_save_setup (f=0x55d202830d10, opaque=<optimized out>) at ../../../../mnt/code/qemu/full/migration/ram.c:3029
#9  0x000055d201183995 in qemu_savevm_state_setup (f=0x55d202830d10) at ../../../../mnt/code/qemu/full/migration/savevm.c:1269
#10 0x000055d20117482f in migration_thread (opaque=opaque@entry=0x55d202852070) at ../../../../mnt/code/qemu/full/migration/migration.c:2963
#11 0x000055d2014f95f8 in qemu_thread_start (args=0x55d20340a1a0) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:541
#12 0x00007f0217ffb907 in start_thread (arg=<optimized out>) at pthread_create.c:444
#13 0x00007f0218081870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 4 (Thread 0x7f02158876c0 (LWP 921554) "CPU 0/KVM"):
#0  futex_wait (private=0, expected=2, futex_word=0x55d201dd3000 <qemu_global_mutex>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x55d201dd3000 <qemu_global_mutex>, private=0) at lowlevellock.c:49
#2  0x00007f0217ffee21 in lll_mutex_lock_optimized (mutex=0x55d201dd3000 <qemu_global_mutex>) at pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x55d201dd3000 <qemu_global_mutex>) at pthread_mutex_lock.c:93
#4  0x000055d2014f99d3 in qemu_mutex_lock_impl (mutex=0x55d201dd3000 <qemu_global_mutex>, file=0x55d2016942d0 "../../../../mnt/code/qemu/full/softmmu/physmem.c", line=2588) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:94
#5  0x000055d201148142 in qemu_mutex_lock_iothread_impl (file=file@entry=0x55d2016942d0 "../../../../mnt/code/qemu/full/softmmu/physmem.c", line=line@entry=2588) at ../../../../mnt/code/qemu/full/softmmu/cpus.c:504
#6  0x000055d2012eeaf1 in prepare_mmio_access (mr=0x55d20315db10) at ../../../../mnt/code/qemu/full/softmmu/physmem.c:2588
#7  flatview_write_continue (fv=fv@entry=0x7f01cc016d50, addr=addr@entry=1016, attrs=..., attrs@entry=..., ptr=ptr@entry=0x7f0219524000, len=len@entry=1, addr1=<optimized out>, l=<optimized out>, mr=0x55d20315db10) at ../../../../mnt/code/qemu/full/softmmu/physmem.c:2643
#8  0x000055d2012eebe9 in flatview_write (fv=0x7f01cc016d50, addr=addr@entry=1016, attrs=attrs@entry=..., buf=buf@entry=0x7f0219524000, len=len@entry=1) at ../../../../mnt/code/qemu/full/softmmu/physmem.c:2690
#9  0x000055d2012f1feb in address_space_write (len=1, buf=0x7f0219524000, attrs=..., addr=1016, as=0x55d201de6e60 <address_space_io>) at ../../../../mnt/code/qemu/full/softmmu/physmem.c:2786
#10 address_space_rw (as=0x55d201de6e60 <address_space_io>, addr=addr@entry=1016, attrs=attrs@entry=..., buf=0x7f0219524000, len=len@entry=1, is_write=is_write@entry=true) at ../../../../mnt/code/qemu/full/softmmu/physmem.c:2796
#11 0x000055d20137071b in kvm_handle_io (count=1, size=1, direction=<optimized out>, data=<optimized out>, attrs=..., port=1016) at ../../../../mnt/code/qemu/full/accel/kvm/kvm-all.c:2775
#12 kvm_cpu_exec (cpu=cpu@entry=0x55d202853d60) at ../../../../mnt/code/qemu/full/accel/kvm/kvm-all.c:3026
#13 0x000055d201371bdd in kvm_vcpu_thread_fn (arg=arg@entry=0x55d202853d60) at ../../../../mnt/code/qemu/full/accel/kvm/kvm-accel-ops.c:51
#14 0x000055d2014f95f8 in qemu_thread_start (args=0x55d20285de20) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:541
#15 0x00007f0217ffb907 in start_thread (arg=<optimized out>) at pthread_create.c:444
#16 0x00007f0218081870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 3 (Thread 0x7f02161ba6c0 (LWP 921553) "IO mon_iothread"):
#0  0x00007f021807435d in __GI___poll (fds=0x7f0208000fe0, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f021859f3a9 in g_main_context_poll (priority=<optimized out>, n_fds=3, fds=0x7f0208000fe0, timeout=<optimized out>, context=0x55d202850d50) at ../glib/gmain.c:4584
#2  g_main_context_iterate.isra.0 (context=0x55d202850d50, block=1, dispatch=1, self=<optimized out>) at ../glib/gmain.c:4271
#3  0x00007f021854099f in g_main_loop_run (loop=0x55d202850f40) at ../glib/gmain.c:4479
#4  0x000055d2013b6819 in iothread_run (opaque=opaque@entry=0x55d20282d870) at ../../../../mnt/code/qemu/full/iothread.c:70
#5  0x000055d2014f95f8 in qemu_thread_start (args=0x55d202850f80) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:541
#6  0x00007f0217ffb907 in start_thread (arg=<optimized out>) at pthread_create.c:444
#7  0x00007f0218081870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7f0216fff6c0 (LWP 921550) "qemu-system-x86"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x000055d2014fa77a in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /mnt/code/qemu/full/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x55d201e04b00 <rcu_gp_event>) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:464
#3  0x000055d20150391f in wait_for_readers () at ../../../../mnt/code/qemu/full/util/rcu.c:136
#4  synchronize_rcu () at ../../../../mnt/code/qemu/full/util/rcu.c:172
#5  0x000055d201503d8f in call_rcu_thread (opaque=opaque@entry=0x0) at ../../../../mnt/code/qemu/full/util/rcu.c:285
#6  0x000055d2014f95f8 in qemu_thread_start (args=0x55d202586220) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:541
#7  0x00007f0217ffb907 in start_thread (arg=<optimized out>) at pthread_create.c:444
#8  0x00007f0218081870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f0217a5ff80 (LWP 921548) "qemu-system-x86"):
#0  futex_wait (private=0, expected=2, futex_word=0x55d201dd3000 <qemu_global_mutex>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x55d201dd3000 <qemu_global_mutex>, private=0) at lowlevellock.c:49
#2  0x00007f0217ffee21 in lll_mutex_lock_optimized (mutex=0x55d201dd3000 <qemu_global_mutex>) at pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x55d201dd3000 <qemu_global_mutex>) at pthread_mutex_lock.c:93
#4  0x000055d2014f99d3 in qemu_mutex_lock_impl (mutex=0x55d201dd3000 <qemu_global_mutex>, file=0x55d201714e68 "../../../../mnt/code/qemu/full/util/main-loop.c", line=311) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:94
#5  0x000055d201148142 in qemu_mutex_lock_iothread_impl (file=file@entry=0x55d201714e68 "../../../../mnt/code/qemu/full/util/main-loop.c", line=line@entry=311) at ../../../../mnt/code/qemu/full/softmmu/cpus.c:504
#6  0x000055d20150cf56 in os_host_main_loop_wait (timeout=0) at ../../../../mnt/code/qemu/full/util/main-loop.c:311
#7  main_loop_wait (nonblocking=nonblocking@entry=0) at ../../../../mnt/code/qemu/full/util/main-loop.c:592
#8  0x000055d201152187 in qemu_main_loop () at ../../../../mnt/code/qemu/full/softmmu/runstate.c:738
#9  0x000055d201379ab6 in qemu_default_main () at ../../../../mnt/code/qemu/full/softmmu/main.c:37
#10 0x00007f0217f96b4a in __libc_start_call_main (main=main@entry=0x55d200f4b070 <main>, argc=argc@entry=25, argv=argv@entry=0x7fff8cdb7bc8) at ../sysdeps/nptl/libc_start_call_main.h:58
#11 0x00007f0217f96c0b in __libc_start_main_impl (main=0x55d200f4b070 <main>, argc=25, argv=0x7fff8cdb7bc8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff8cdb7bb8) at ../csu/libc-start.c:360
#12 0x000055d200f4cbd5 in _start ()

And this on the other side:

Thread 4 (Thread 0x7f31277fe6c0 (LWP 921563) "CPU 0/KVM"):
#0  0x00007f312e3941d9 in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x55eb7ed62e0c) at futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x55eb7ed62e0c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2  0x00007f312e39425f in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x55eb7ed62e0c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3  0x00007f312e396b79 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7f31277fd550, cond=0x55eb7ed62de0) at pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=cond@entry=0x55eb7ed62de0, mutex=mutex@entry=0x55eb7d5f9000 <qemu_global_mutex>) at pthread_cond_wait.c:618
#5  0x000055eb7cd2010b in qemu_cond_wait_impl (cond=0x55eb7ed62de0, mutex=0x55eb7d5f9000 <qemu_global_mutex>, file=0x55eb7ce533c0 "../../../../mnt/code/qemu/full/softmmu/cpus.c", line=424) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:225
#6  0x000055eb7c96df71 in qemu_wait_io_event (cpu=cpu@entry=0x55eb7ed58d60) at ../../../../mnt/code/qemu/full/softmmu/cpus.c:424
#7  0x000055eb7cb97bc0 in kvm_vcpu_thread_fn (arg=arg@entry=0x55eb7ed58d60) at ../../../../mnt/code/qemu/full/accel/kvm/kvm-accel-ops.c:56
#8  0x000055eb7cd1f5f8 in qemu_thread_start (args=0x55eb7ed62e20) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:541
#9  0x00007f312e397907 in start_thread (arg=<optimized out>) at pthread_create.c:444
#10 0x00007f312e41d870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 3 (Thread 0x7f3127fff6c0 (LWP 921562) "IO mon_iothread"):
#0  0x00007f312e41035d in __GI___poll (fds=0x7f3120000fe0, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f312e99f3a9 in g_main_context_poll (priority=<optimized out>, n_fds=3, fds=0x7f3120000fe0, timeout=<optimized out>, context=0x55eb7ed55d50) at ../glib/gmain.c:4584
#2  g_main_context_iterate.isra.0 (context=0x55eb7ed55d50, block=1, dispatch=1, self=<optimized out>) at ../glib/gmain.c:4271
#3  0x00007f312e94099f in g_main_loop_run (loop=0x55eb7ed55f40) at ../glib/gmain.c:4479
#4  0x000055eb7cbdc819 in iothread_run (opaque=opaque@entry=0x55eb7ed32870) at ../../../../mnt/code/qemu/full/iothread.c:70
#5  0x000055eb7cd1f5f8 in qemu_thread_start (args=0x55eb7ed55f80) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:541
#6  0x00007f312e397907 in start_thread (arg=<optimized out>) at pthread_create.c:444
#7  0x00007f312e41d870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7f312d3ff6c0 (LWP 921559) "qemu-system-x86"):
#0  futex_wait (private=0, expected=2, futex_word=0x55eb7d5f9000 <qemu_global_mutex>) at ../sysdeps/nptl/futex-internal.h:146
#1  __GI___lll_lock_wait (futex=futex@entry=0x55eb7d5f9000 <qemu_global_mutex>, private=0) at lowlevellock.c:49
#2  0x00007f312e39ae21 in lll_mutex_lock_optimized (mutex=0x55eb7d5f9000 <qemu_global_mutex>) at pthread_mutex_lock.c:48
#3  ___pthread_mutex_lock (mutex=mutex@entry=0x55eb7d5f9000 <qemu_global_mutex>) at pthread_mutex_lock.c:93
#4  0x000055eb7cd1f9d3 in qemu_mutex_lock_impl (mutex=0x55eb7d5f9000 <qemu_global_mutex>, file=0x55eb7cf38ce8 "../../../../mnt/code/qemu/full/util/rcu.c", line=286) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:94
#5  0x000055eb7c96e142 in qemu_mutex_lock_iothread_impl (file=file@entry=0x55eb7cf38ce8 "../../../../mnt/code/qemu/full/util/rcu.c", line=line@entry=286) at ../../../../mnt/code/qemu/full/softmmu/cpus.c:504
#6  0x000055eb7cd29cf8 in call_rcu_thread (opaque=opaque@entry=0x0) at ../../../../mnt/code/qemu/full/util/rcu.c:286
#7  0x000055eb7cd1f5f8 in qemu_thread_start (args=0x55eb7ea8b220) at ../../../../mnt/code/qemu/full/util/qemu-thread-posix.c:541
#8  0x00007f312e397907 in start_thread (arg=<optimized out>) at pthread_create.c:444
#9  0x00007f312e41d870 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f312d853f80 (LWP 921557) "qemu-system-x86"):
#0  0x00007f312e41f78b in __recvmsg_syscall (flags=2, msg=0x7ffe27448a50, fd=20) at ../sysdeps/unix/sysv/linux/recvmsg.c:27
#1  __libc_recvmsg (fd=20, msg=msg@entry=0x7ffe27448a50, flags=flags@entry=2) at ../sysdeps/unix/sysv/linux/recvmsg.c:41
#2  0x000055eb7cbb3531 in qio_channel_socket_readv (ioc=<optimized out>, iov=<optimized out>, niov=<optimized out>, fds=0x0, nfds=0x0, flags=<optimized out>, errp=0x7ffe27448bb0) at ../../../../mnt/code/qemu/full/io/channel-socket.c:532
#3  0x000055eb7c9909b0 in migration_channel_read_peek (ioc=ioc@entry=0x55eb7fb859f0, buf=buf@entry=0x7ffe27448bac "", buflen=buflen@entry=4, errp=errp@entry=0x7ffe27448bb0) at ../../../../mnt/code/qemu/full/migration/channel.c:117
#4  0x000055eb7c997de1 in migration_ioc_process_incoming (ioc=ioc@entry=0x55eb7fb859f0, errp=errp@entry=0x7ffe27448bf8) at ../../../../mnt/code/qemu/full/migration/migration.c:689
#5  0x000055eb7c990715 in migration_channel_process_incoming (ioc=0x55eb7fb859f0) at ../../../../mnt/code/qemu/full/migration/channel.c:45
#6  0x000055eb7cbbaab7 in qio_net_listener_channel_func (ioc=<optimized out>, condition=<optimized out>, opaque=<optimized out>) at ../../../../mnt/code/qemu/full/io/net-listener.c:54
#7  0x00007f312e94139c in g_main_dispatch (context=0x55eb7eafdb20) at ../glib/gmain.c:3460
#8  g_main_context_dispatch (context=0x55eb7eafdb20) at ../glib/gmain.c:4200
#9  0x000055eb7cd32fe8 in glib_pollfds_poll () at ../../../../mnt/code/qemu/full/util/main-loop.c:290
#10 os_host_main_loop_wait (timeout=977785000) at ../../../../mnt/code/qemu/full/util/main-loop.c:313
#11 main_loop_wait (nonblocking=nonblocking@entry=0) at ../../../../mnt/code/qemu/full/util/main-loop.c:592
#12 0x000055eb7c978187 in qemu_main_loop () at ../../../../mnt/code/qemu/full/softmmu/runstate.c:738
#13 0x000055eb7cb9fab6 in qemu_default_main () at ../../../../mnt/code/qemu/full/softmmu/main.c:37
#14 0x00007f312e332b4a in __libc_start_call_main (main=main@entry=0x55eb7c771070 <main>, argc=argc@entry=27, argv=argv@entry=0x7ffe27448f18) at ../sysdeps/nptl/libc_start_call_main.h:58
#15 0x00007f312e332c0b in __libc_start_main_impl (main=0x55eb7c771070 <main>, argc=27, argv=0x7ffe27448f18, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe27448f08) at ../csu/libc-start.c:360
#16 0x000055eb7c772bd5 in _start ()

Can you take a look?

Later, Juan.


> This is intended to be a semantic revert of commit 9b09503752
> ("migration: run setup callbacks out of big lock"). There have been so
> many changes since that commit (e.g. a new setup callback
> dirty_bitmap_save_setup() that also needs to be adapted now), it's
> easier to do the revert manually.
>
> For snapshots, the bdrv_writev_vmstate() function is used during setup
> (in QIOChannelBlock backing the QEMUFile), but not holding the BQL
> while calling it could lead to an assertion failure. To understand
> how, first note the following:
>
> 1. Generated coroutine wrappers for block layer functions spawn the
> coroutine and use AIO_WAIT_WHILE()/aio_poll() to wait for it.
> 2. If the host OS switches threads at an inconvenient time, it can
> happen that a bottom half scheduled for the main thread's AioContext
> is executed as part of a vCPU thread's aio_poll().
>
> An example leading to the assertion failure is as follows:
>
> main thread:
> 1. A snapshot-save QMP command gets issued.
> 2. snapshot_save_job_bh() is scheduled.
>
> vCPU thread:
> 3. aio_poll() for the main thread's AioContext is called (e.g. when
> the guest writes to a pflash device, as part of blk_pwrite which is a
> generated coroutine wrapper).
> 4. snapshot_save_job_bh() is executed as part of aio_poll().
> 3. qemu_savevm_state() is called.
> 4. qemu_mutex_unlock_iothread() is called. Now
> qemu_get_current_aio_context() returns 0x0.
> 5. bdrv_writev_vmstate() is executed during the usual savevm setup
> via qemu_fflush(). But this function is a generated coroutine wrapper,
> so it uses AIO_WAIT_WHILE. There, the assertion
> assert(qemu_get_current_aio_context() == qemu_get_aio_context());
> will fail.
>
> To fix it, ensure that the BQL is held during setup. While it would
> only be needed for snapshots, adapting migration too avoids additional
> logic for conditional locking/unlocking in the setup callbacks.
> Writing the header could (in theory) also trigger qemu_fflush() and
> thus bdrv_writev_vmstate(), so the locked section also covers the
> qemu_savevm_state_header() call, even for migration for consistentcy.
>
> The comment in ram_init_bitmaps() was introduced by 4987783400
> ("migration: fix incorrect memory_global_dirty_log_start outside BQL")
> and is removed, because it referred to the qemu_mutex_lock_iothread()
> call.
>
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>  include/migration/register.h   | 2 +-
>  migration/block-dirty-bitmap.c | 3 ---
>  migration/block.c              | 5 -----
>  migration/migration.c          | 6 ++++++
>  migration/ram.c                | 3 ---
>  migration/savevm.c             | 2 --
>  6 files changed, 7 insertions(+), 14 deletions(-)
>
> diff --git a/include/migration/register.h b/include/migration/register.h
> index a8dfd8fefd..ed8c42063b 100644
> --- a/include/migration/register.h
> +++ b/include/migration/register.h
> @@ -20,6 +20,7 @@ typedef struct SaveVMHandlers {
>      /* This runs inside the iothread lock.  */
>      SaveStateHandler *save_state;
>  
> +    int (*save_setup)(QEMUFile *f, void *opaque);
>      void (*save_cleanup)(void *opaque);
>      int (*save_live_complete_postcopy)(QEMUFile *f, void *opaque);
>      int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
> @@ -45,7 +46,6 @@ typedef struct SaveVMHandlers {
>      int (*save_live_iterate)(QEMUFile *f, void *opaque);
>  
>      /* This runs outside the iothread lock!  */
> -    int (*save_setup)(QEMUFile *f, void *opaque);
>      /* Note for save_live_pending:
>       * must_precopy:
>       * - must be migrated in precopy or in stopped state
> diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
> index 032fc5f405..03cb2e72ee 100644
> --- a/migration/block-dirty-bitmap.c
> +++ b/migration/block-dirty-bitmap.c
> @@ -1214,9 +1214,7 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void *opaque)
>      DBMSaveState *s = &((DBMState *)opaque)->save;
>      SaveBitmapState *dbms = NULL;
>  
> -    qemu_mutex_lock_iothread();
>      if (init_dirty_bitmap_migration(s) < 0) {
> -        qemu_mutex_unlock_iothread();
>          return -1;
>      }
>  
> @@ -1224,7 +1222,6 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void *opaque)
>          send_bitmap_start(f, s, dbms);
>      }
>      qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS);
> -    qemu_mutex_unlock_iothread();
>      return 0;
>  }
>  
> diff --git a/migration/block.c b/migration/block.c
> index b9580a6c7e..4df7862f07 100644
> --- a/migration/block.c
> +++ b/migration/block.c
> @@ -722,18 +722,13 @@ static int block_save_setup(QEMUFile *f, void *opaque)
>      trace_migration_block_save("setup", block_mig_state.submitted,
>                                 block_mig_state.transferred);
>  
> -    qemu_mutex_lock_iothread();
>      ret = init_blk_migration(f);
>      if (ret < 0) {
> -        qemu_mutex_unlock_iothread();
>          return ret;
>      }
>  
>      /* start track dirty blocks */
>      ret = set_dirty_tracking();
> -
> -    qemu_mutex_unlock_iothread();
> -
>      if (ret) {
>          return ret;
>      }
> diff --git a/migration/migration.c b/migration/migration.c
> index 5de7f734b9..c1b3c515e9 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2929,7 +2929,9 @@ static void *migration_thread(void *opaque)
>      object_ref(OBJECT(s));
>      update_iteration_initial_status(s);
>  
> +    qemu_mutex_lock_iothread();
>      qemu_savevm_state_header(s->to_dst_file);
> +    qemu_mutex_unlock_iothread();
>  
>      /*
>       * If we opened the return path, we need to make sure dst has it
> @@ -2957,7 +2959,9 @@ static void *migration_thread(void *opaque)
>          qemu_savevm_send_colo_enable(s->to_dst_file);
>      }
>  
> +    qemu_mutex_lock_iothread();
>      qemu_savevm_state_setup(s->to_dst_file);
> +    qemu_mutex_unlock_iothread();
>  
>      qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
>                                 MIGRATION_STATUS_ACTIVE);
> @@ -3068,8 +3072,10 @@ static void *bg_migration_thread(void *opaque)
>      ram_write_tracking_prepare();
>  #endif
>  
> +    qemu_mutex_lock_iothread();
>      qemu_savevm_state_header(s->to_dst_file);
>      qemu_savevm_state_setup(s->to_dst_file);
> +    qemu_mutex_unlock_iothread();
>  
>      qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
>                                 MIGRATION_STATUS_ACTIVE);
> diff --git a/migration/ram.c b/migration/ram.c
> index 9fb076fa58..1a2edef527 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -2881,8 +2881,6 @@ static void migration_bitmap_clear_discarded_pages(RAMState *rs)
>  
>  static void ram_init_bitmaps(RAMState *rs)
>  {
> -    /* For memory_global_dirty_log_start below.  */
> -    qemu_mutex_lock_iothread();
>      qemu_mutex_lock_ramlist();
>  
>      WITH_RCU_READ_LOCK_GUARD() {
> @@ -2894,7 +2892,6 @@ static void ram_init_bitmaps(RAMState *rs)
>          }
>      }
>      qemu_mutex_unlock_ramlist();
> -    qemu_mutex_unlock_iothread();
>  
>      /*
>       * After an eventual first bitmap sync, fixup the initial bitmap
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 03795ce8dc..be97770620 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -1624,10 +1624,8 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
>      memset(&compression_counters, 0, sizeof(compression_counters));
>      ms->to_dst_file = f;
>  
> -    qemu_mutex_unlock_iothread();
>      qemu_savevm_state_header(f);
>      qemu_savevm_state_setup(f);
> -    qemu_mutex_lock_iothread();
>  
>      while (qemu_file_get_error(f) == 0) {
>          if (qemu_savevm_state_iterate(f, false) > 0) {
Fiona Ebner May 26, 2023, 1:47 p.m. UTC | #3
Am 26.05.23 um 12:16 schrieb Juan Quintela:
> Nak
> 
> Sometimes it works, and sometimes it hangs.

Sorry, I originally only ran the tests for x86_64 (native for me). I now
ran into the hang too, with qtest-aarch64/migration-test and
qtest-i386/migration-test.

> Can you take a look?

Will do!

Best Regards,
Fiona
Fiona Ebner May 26, 2023, 3:54 p.m. UTC | #4
Am 26.05.23 um 15:47 schrieb Fiona Ebner:
> Am 26.05.23 um 12:16 schrieb Juan Quintela:
>> Nak
>>
>> Sometimes it works, and sometimes it hangs.
> 
> Sorry, I originally only ran the tests for x86_64 (native for me). I now
> ran into the hang too, with qtest-aarch64/migration-test and
> qtest-i386/migration-test.
> 
>> Can you take a look?
> 
> Will do!
> 

So I took a look at the multifd_send_state->params[$i] and noticed that
the IOChannel c is still NULL and running is still false, while the
name, page_size, etc. have already been initialized.

So it seems like that socket_send_channel_create() did not manage to
execute multifd_new_send_channel_async() yet, telling from the following
in multifd_save_setup():

>         p->page_size = qemu_target_page_size();
>         p->page_count = page_count;
> 
>         if (migrate_zero_copy_send()) {
>             p->write_flags = QIO_CHANNEL_WRITE_FLAG_ZERO_COPY;
>         } else {
>             p->write_flags = 0;
>         }
> 
>         socket_send_channel_create(multifd_new_send_channel_async, p);
>     }

I guess the execution of multifd_new_send_channel_async() after
socket_send_channel_create() somehow depends on the main thread doing
something? But the main thread is waiting on the BQL.

Should I introduce an unlocked section around multifd_send_sync_main()
in ram_save_setup() and hope for the best? The tests ran two times
without errors for me afterwards. Is there an easy way to only run the
single problematic test cases again?

There's still the risk there's something else that needs an unlocked
section. In that regard, v1 of the patch is safer, because it doesn't
change which sections are inside or outside the BQL for migration, just
for snapshot.

Best Regards,
Fiona
diff mbox series

Patch

diff --git a/include/migration/register.h b/include/migration/register.h
index a8dfd8fefd..ed8c42063b 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -20,6 +20,7 @@  typedef struct SaveVMHandlers {
     /* This runs inside the iothread lock.  */
     SaveStateHandler *save_state;
 
+    int (*save_setup)(QEMUFile *f, void *opaque);
     void (*save_cleanup)(void *opaque);
     int (*save_live_complete_postcopy)(QEMUFile *f, void *opaque);
     int (*save_live_complete_precopy)(QEMUFile *f, void *opaque);
@@ -45,7 +46,6 @@  typedef struct SaveVMHandlers {
     int (*save_live_iterate)(QEMUFile *f, void *opaque);
 
     /* This runs outside the iothread lock!  */
-    int (*save_setup)(QEMUFile *f, void *opaque);
     /* Note for save_live_pending:
      * must_precopy:
      * - must be migrated in precopy or in stopped state
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 032fc5f405..03cb2e72ee 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -1214,9 +1214,7 @@  static int dirty_bitmap_save_setup(QEMUFile *f, void *opaque)
     DBMSaveState *s = &((DBMState *)opaque)->save;
     SaveBitmapState *dbms = NULL;
 
-    qemu_mutex_lock_iothread();
     if (init_dirty_bitmap_migration(s) < 0) {
-        qemu_mutex_unlock_iothread();
         return -1;
     }
 
@@ -1224,7 +1222,6 @@  static int dirty_bitmap_save_setup(QEMUFile *f, void *opaque)
         send_bitmap_start(f, s, dbms);
     }
     qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS);
-    qemu_mutex_unlock_iothread();
     return 0;
 }
 
diff --git a/migration/block.c b/migration/block.c
index b9580a6c7e..4df7862f07 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -722,18 +722,13 @@  static int block_save_setup(QEMUFile *f, void *opaque)
     trace_migration_block_save("setup", block_mig_state.submitted,
                                block_mig_state.transferred);
 
-    qemu_mutex_lock_iothread();
     ret = init_blk_migration(f);
     if (ret < 0) {
-        qemu_mutex_unlock_iothread();
         return ret;
     }
 
     /* start track dirty blocks */
     ret = set_dirty_tracking();
-
-    qemu_mutex_unlock_iothread();
-
     if (ret) {
         return ret;
     }
diff --git a/migration/migration.c b/migration/migration.c
index 5de7f734b9..c1b3c515e9 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2929,7 +2929,9 @@  static void *migration_thread(void *opaque)
     object_ref(OBJECT(s));
     update_iteration_initial_status(s);
 
+    qemu_mutex_lock_iothread();
     qemu_savevm_state_header(s->to_dst_file);
+    qemu_mutex_unlock_iothread();
 
     /*
      * If we opened the return path, we need to make sure dst has it
@@ -2957,7 +2959,9 @@  static void *migration_thread(void *opaque)
         qemu_savevm_send_colo_enable(s->to_dst_file);
     }
 
+    qemu_mutex_lock_iothread();
     qemu_savevm_state_setup(s->to_dst_file);
+    qemu_mutex_unlock_iothread();
 
     qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
                                MIGRATION_STATUS_ACTIVE);
@@ -3068,8 +3072,10 @@  static void *bg_migration_thread(void *opaque)
     ram_write_tracking_prepare();
 #endif
 
+    qemu_mutex_lock_iothread();
     qemu_savevm_state_header(s->to_dst_file);
     qemu_savevm_state_setup(s->to_dst_file);
+    qemu_mutex_unlock_iothread();
 
     qemu_savevm_wait_unplug(s, MIGRATION_STATUS_SETUP,
                                MIGRATION_STATUS_ACTIVE);
diff --git a/migration/ram.c b/migration/ram.c
index 9fb076fa58..1a2edef527 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -2881,8 +2881,6 @@  static void migration_bitmap_clear_discarded_pages(RAMState *rs)
 
 static void ram_init_bitmaps(RAMState *rs)
 {
-    /* For memory_global_dirty_log_start below.  */
-    qemu_mutex_lock_iothread();
     qemu_mutex_lock_ramlist();
 
     WITH_RCU_READ_LOCK_GUARD() {
@@ -2894,7 +2892,6 @@  static void ram_init_bitmaps(RAMState *rs)
         }
     }
     qemu_mutex_unlock_ramlist();
-    qemu_mutex_unlock_iothread();
 
     /*
      * After an eventual first bitmap sync, fixup the initial bitmap
diff --git a/migration/savevm.c b/migration/savevm.c
index 03795ce8dc..be97770620 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1624,10 +1624,8 @@  static int qemu_savevm_state(QEMUFile *f, Error **errp)
     memset(&compression_counters, 0, sizeof(compression_counters));
     ms->to_dst_file = f;
 
-    qemu_mutex_unlock_iothread();
     qemu_savevm_state_header(f);
     qemu_savevm_state_setup(f);
-    qemu_mutex_lock_iothread();
 
     while (qemu_file_get_error(f) == 0) {
         if (qemu_savevm_state_iterate(f, false) > 0) {