Message ID | 1455259174-3384-3-git-send-email-den@openvz.org |
---|---|
State | New |
Headers | show |
* Denis V. Lunev (den@openvz.org) wrote: > There is a possibility to hit an assert in qcow2_get_specific_info that > s->qcow_version is undefined. This happens when VM in starting from > suspended state, i.e. it processes incoming migration, and in the same > time 'info block' is called. > > The problem is that qcow2_invalidate_cache() closes the image and > memset()s BDRVQcowState in the middle. > > The patch moves processing of bdrv_invalidate_cache_all out of > coroutine context for postcopy migration to avoid that. This function > is called with the following stack: > process_incoming_migration_co > qemu_loadvm_state > qemu_loadvm_state_main > loadvm_process_command > loadvm_postcopy_handle_run > > Signed-off-by: Denis V. Lunev <den@openvz.org> > CC: Paolo Bonzini <pbonzini@redhat.com> > CC: Juan Quintela <quintela@redhat.com> > CC: Amit Shah <amit.shah@redhat.com> I've tested my normal postcopy smoke test and it seems to work, so Tested-by: Dr. David Alan Gilbert <dgilbert@redhat.com> however, can you add a comment before loadvm_postcopy_handle_run_bh to explain why it's a bh, otherwise at some point someone might put it back. I'll admit to not really understanding what the difference is between bh and coroutine context; I'd thought if it was all in the main thread stuff was safe. Dave > --- > migration/savevm.c | 27 +++++++++++++++++---------- > 1 file changed, 17 insertions(+), 10 deletions(-) > > diff --git a/migration/savevm.c b/migration/savevm.c > index 94f2894..8415fd9 100644 > --- a/migration/savevm.c > +++ b/migration/savevm.c > @@ -1496,18 +1496,10 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis) > return 0; > } > > -/* After all discards we can start running and asking for pages */ > -static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) > +static void loadvm_postcopy_handle_run_bh(void *opaque) > { > - PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_RUNNING); > Error *local_err = NULL; > > - trace_loadvm_postcopy_handle_run(); > - if (ps != POSTCOPY_INCOMING_LISTENING) { > - error_report("CMD_POSTCOPY_RUN in wrong postcopy state (%d)", ps); > - return -1; > - } > - > /* TODO we should move all of this lot into postcopy_ram.c or a shared code > * in migration.c > */ > @@ -1519,7 +1511,6 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) > bdrv_invalidate_cache_all(&local_err); > if (local_err) { > error_report_err(local_err); > - return -1; > } > > trace_loadvm_postcopy_handle_run_cpu_sync(); > @@ -1534,6 +1525,22 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) > /* leave it paused and let management decide when to start the CPU */ > runstate_set(RUN_STATE_PAUSED); > } > +} > + > +/* After all discards we can start running and asking for pages */ > +static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) > +{ > + PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_RUNNING); > + QEMUBH *bh; > + > + trace_loadvm_postcopy_handle_run(); > + if (ps != POSTCOPY_INCOMING_LISTENING) { > + error_report("CMD_POSTCOPY_RUN in wrong postcopy state (%d)", ps); > + return -1; > + } > + > + bh = qemu_bh_new(loadvm_postcopy_handle_run_bh, NULL); > + qemu_bh_schedule(bh); > > /* We need to finish reading the stream from the package > * and also stop reading anything more from the stream that loaded the > -- > 2.1.4 > > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 12/02/2016 13:50, Dr. David Alan Gilbert wrote: > I'll admit to not really understanding what the difference is > between bh and coroutine context; I'd thought if it was all > in the main thread stuff was safe. It's arguably a bug in the block layer code. It assumes that all code called from a coroutine wants not to block. Moving stuff to a bottom half tells the block layer that you want bdrv_invalidate_cache_all to block. Paolo
On 02/12/2016 03:55 PM, Paolo Bonzini wrote: > > On 12/02/2016 13:50, Dr. David Alan Gilbert wrote: >> I'll admit to not really understanding what the difference is >> between bh and coroutine context; I'd thought if it was all >> in the main thread stuff was safe. > It's arguably a bug in the block layer code. It assumes that all code > called from a coroutine wants not to block. Moving stuff to a bottom > half tells the block layer that you want bdrv_invalidate_cache_all to block. > > Paolo cool description! Really short and exhaustive!
diff --git a/migration/savevm.c b/migration/savevm.c index 94f2894..8415fd9 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -1496,18 +1496,10 @@ static int loadvm_postcopy_handle_listen(MigrationIncomingState *mis) return 0; } -/* After all discards we can start running and asking for pages */ -static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) +static void loadvm_postcopy_handle_run_bh(void *opaque) { - PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_RUNNING); Error *local_err = NULL; - trace_loadvm_postcopy_handle_run(); - if (ps != POSTCOPY_INCOMING_LISTENING) { - error_report("CMD_POSTCOPY_RUN in wrong postcopy state (%d)", ps); - return -1; - } - /* TODO we should move all of this lot into postcopy_ram.c or a shared code * in migration.c */ @@ -1519,7 +1511,6 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) bdrv_invalidate_cache_all(&local_err); if (local_err) { error_report_err(local_err); - return -1; } trace_loadvm_postcopy_handle_run_cpu_sync(); @@ -1534,6 +1525,22 @@ static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) /* leave it paused and let management decide when to start the CPU */ runstate_set(RUN_STATE_PAUSED); } +} + +/* After all discards we can start running and asking for pages */ +static int loadvm_postcopy_handle_run(MigrationIncomingState *mis) +{ + PostcopyState ps = postcopy_state_set(POSTCOPY_INCOMING_RUNNING); + QEMUBH *bh; + + trace_loadvm_postcopy_handle_run(); + if (ps != POSTCOPY_INCOMING_LISTENING) { + error_report("CMD_POSTCOPY_RUN in wrong postcopy state (%d)", ps); + return -1; + } + + bh = qemu_bh_new(loadvm_postcopy_handle_run_bh, NULL); + qemu_bh_schedule(bh); /* We need to finish reading the stream from the package * and also stop reading anything more from the stream that loaded the
There is a possibility to hit an assert in qcow2_get_specific_info that s->qcow_version is undefined. This happens when VM in starting from suspended state, i.e. it processes incoming migration, and in the same time 'info block' is called. The problem is that qcow2_invalidate_cache() closes the image and memset()s BDRVQcowState in the middle. The patch moves processing of bdrv_invalidate_cache_all out of coroutine context for postcopy migration to avoid that. This function is called with the following stack: process_incoming_migration_co qemu_loadvm_state qemu_loadvm_state_main loadvm_process_command loadvm_postcopy_handle_run Signed-off-by: Denis V. Lunev <den@openvz.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Juan Quintela <quintela@redhat.com> CC: Amit Shah <amit.shah@redhat.com> --- migration/savevm.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-)