Message ID | 20231017202633.296756-3-peterx@redhat.com |
---|---|
State | New |
Headers | show |
Series | migration: Better error handling in rp thread, allow failures in recover | expand |
Peter Xu <peterx@redhat.com> wrote: > Normally the postcopy recover phase should only exist for a super short > period, that's the duration when QEMU is trying to recover from an > interrupted postcopy migration, during which handshake will be carried out > for continuing the procedure with state changes from PAUSED -> RECOVER -> > POSTCOPY_ACTIVE again. > > Here RECOVER phase should be super small, that happens right after the > admin specified a new but working network link for QEMU to reconnect to > dest QEMU. > > However there can still be case where the channel is broken in this small > RECOVER window. > > If it happens, with current code there's no way the src QEMU can got kicked > out of RECOVER stage. No way either to retry the recover in another channel > when established. > > This patch allows the RECOVER phase to fail itself too - we're mostly > ready, just some small things missing, e.g. properly kick the main > migration thread out when sleeping on rp_sem when we found that we're at > RECOVER stage. When this happens, it fails the RECOVER itself, and > rollback to PAUSED stage. Then the user can retry another round of > recovery. > > To make it even stronger, teach QMP command migrate-pause to explicitly > kick src/dst QEMU out when needed, so even if for some reason the migration > thread didn't got kicked out already by a failing rethrn-path thread, the > admin can also kick it out. > > This will be an super, super corner case, but still try to cover that. > > One can try to test this with two proxy channels for migration: > > (a) socat unix-listen:/tmp/src.sock,reuseaddr,fork tcp:localhost:10000 > (b) socat tcp-listen:10000,reuseaddr,fork unix:/tmp/dst.sock > > So the migration channel will be: > > (a) (b) > src -> /tmp/src.sock -> tcp:10000 -> /tmp/dst.sock -> dst > > Then to make QEMU hang at RECOVER stage, one can do below: > > (1) stop the postcopy using QMP command postcopy-pause > (2) kill the 2nd proxy (b) > (3) try to recover the postcopy using /tmp/src.sock on src > (4) src QEMU will go into RECOVER stage but won't be able to continue > from there, because the channel is actually broken at (b) > > Before this patch, step (4) will make src QEMU stuck in RECOVER stage, > without a way to kick the QEMU out or continue the postcopy again. After > this patch, (4) will quickly fail qemu and bounce back to PAUSED stage. > > Admin can also kick QEMU from (4) into PAUSED when needed using > migrate-pause when needed. > > After bouncing back to PAUSED stage, one can recover again. > > Reported-by: Xiaohui Li <xiaohli@redhat.com> > Reviewed-by: Fabiano Rosas <farosas@suse.de> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2111332 > Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Juan Quintela <quintela@redhat.com> > -void migration_rp_wait(MigrationState *s) > +int migration_rp_wait(MigrationState *s) > { > + /* If migration has failure already, ignore the wait */ > + if (migrate_has_error(s)) { > + return -1; > + } > + > qemu_sem_wait(&s->rp_state.rp_sem); > + > + /* After wait, double check that there's no failure */ > + if (migrate_has_error(s)) { > + return -1; > + } > + > + return 0; > } Shouldn't this be bool? We have (too many) functions in migration that returns 0/-1 and set an error, I think we should change them to return bool. Or even just test if err is set. Later, Juan.
On Tue, Oct 31, 2023 at 03:26:42PM +0100, Juan Quintela wrote: > > -void migration_rp_wait(MigrationState *s) > > +int migration_rp_wait(MigrationState *s) > > { > > + /* If migration has failure already, ignore the wait */ > > + if (migrate_has_error(s)) { > > + return -1; > > + } > > + > > qemu_sem_wait(&s->rp_state.rp_sem); > > + > > + /* After wait, double check that there's no failure */ > > + if (migrate_has_error(s)) { > > + return -1; > > + } > > + > > + return 0; > > } > > Shouldn't this be bool? > > We have (too many) functions in migration that returns 0/-1 and set an > error, I think we should change them to return bool. Or even just test > if err is set. Yeah this patch comes earlier than "switching to bools". I can make them bool after I rebase to the new pulls and see what's leftover. Thanks.
diff --git a/migration/migration.h b/migration/migration.h index 1d1ac13adb..0b695bbfb1 100644 --- a/migration/migration.h +++ b/migration/migration.h @@ -494,6 +494,7 @@ int migrate_init(MigrationState *s, Error **errp); bool migration_is_blocked(Error **errp); /* True if outgoing migration has entered postcopy phase */ bool migration_in_postcopy(void); +bool migration_postcopy_is_alive(int state); MigrationState *migrate_get_current(void); uint64_t ram_get_total_transferred_pages(void); @@ -534,8 +535,11 @@ void migration_populate_vfio_info(MigrationInfo *info); void migration_reset_vfio_bytes_transferred(void); void postcopy_temp_page_reset(PostcopyTmpPage *tmp_page); -/* Migration thread waiting for return path thread. */ -void migration_rp_wait(MigrationState *s); +/* + * Migration thread waiting for return path thread. Return non-zero if an + * error is detected. + */ +int migration_rp_wait(MigrationState *s); /* * Kick the migration thread waiting for return path messages. NOTE: the * name can be slightly confusing (when read as "kick the rp thread"), just diff --git a/migration/migration.c b/migration/migration.c index 3123c4a873..0661dad953 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -1348,6 +1348,17 @@ bool migration_in_postcopy(void) } } +bool migration_postcopy_is_alive(int state) +{ + switch (state) { + case MIGRATION_STATUS_POSTCOPY_ACTIVE: + case MIGRATION_STATUS_POSTCOPY_RECOVER: + return true; + default: + return false; + } +} + bool migration_in_postcopy_after_devices(MigrationState *s) { return migration_in_postcopy() && s->postcopy_after_devices; @@ -1557,8 +1568,15 @@ void qmp_migrate_pause(Error **errp) MigrationIncomingState *mis = migration_incoming_get_current(); int ret = 0; - if (ms->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) { + if (migration_postcopy_is_alive(ms->state)) { /* Source side, during postcopy */ + Error *error = NULL; + + /* Tell the core migration that we're pausing */ + error_setg(&error, "Postcopy migration is paused by the user"); + migrate_set_error(ms, error); + error_free(error); + qemu_mutex_lock(&ms->qemu_file_lock); if (ms->to_dst_file) { ret = qemu_file_shutdown(ms->to_dst_file); @@ -1567,10 +1585,17 @@ void qmp_migrate_pause(Error **errp) if (ret) { error_setg(errp, "Failed to pause source migration"); } + + /* + * Kick the migration thread out of any waiting windows (on behalf + * of the rp thread). + */ + migration_rp_kick(ms); + return; } - if (mis->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) { + if (migration_postcopy_is_alive(mis->state)) { ret = qemu_file_shutdown(mis->from_src_file); if (ret) { error_setg(errp, "Failed to pause destination migration"); @@ -1579,7 +1604,7 @@ void qmp_migrate_pause(Error **errp) } error_setg(errp, "migrate-pause is currently only supported " - "during postcopy-active state"); + "during postcopy-active or postcopy-recover state"); } bool migration_is_blocked(Error **errp) @@ -1754,9 +1779,21 @@ void qmp_migrate_continue(MigrationStatus state, Error **errp) qemu_sem_post(&s->pause_sem); } -void migration_rp_wait(MigrationState *s) +int migration_rp_wait(MigrationState *s) { + /* If migration has failure already, ignore the wait */ + if (migrate_has_error(s)) { + return -1; + } + qemu_sem_wait(&s->rp_state.rp_sem); + + /* After wait, double check that there's no failure */ + if (migrate_has_error(s)) { + return -1; + } + + return 0; } void migration_rp_kick(MigrationState *s) @@ -2018,6 +2055,20 @@ out: trace_source_return_path_thread_bad_end(); } + if (ms->state == MIGRATION_STATUS_POSTCOPY_RECOVER) { + /* + * this will be extremely unlikely: that we got yet another network + * issue during recovering of the 1st network failure.. during this + * period the main migration thread can be waiting on rp_sem for + * this thread to sync with the other side. + * + * When this happens, explicitly kick the migration thread out of + * RECOVER stage and back to PAUSED, so the admin can try + * everything again. + */ + migration_rp_kick(ms); + } + trace_source_return_path_thread_end(); rcu_unregister_thread(); @@ -2482,7 +2533,9 @@ static int postcopy_resume_handshake(MigrationState *s) qemu_savevm_send_postcopy_resume(s->to_dst_file); while (s->state == MIGRATION_STATUS_POSTCOPY_RECOVER) { - migration_rp_wait(s); + if (migration_rp_wait(s)) { + return -1; + } } if (s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE) { diff --git a/migration/ram.c b/migration/ram.c index 544c5b63cf..973e872284 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -4175,7 +4175,9 @@ static int ram_dirty_bitmap_sync_all(MigrationState *s, RAMState *rs) /* Wait until all the ramblocks' dirty bitmap synced */ while (qatomic_read(&rs->postcopy_bmap_sync_requested)) { - migration_rp_wait(s); + if (migration_rp_wait(s)) { + return -1; + } } trace_ram_dirty_bitmap_sync_complete();