diff mbox series

[v5,8/8] migration: Add a wrapper to cleanup migration files

Message ID 20230831183916.13203-9-farosas@suse.de
State New
Headers show
Series Fix segfault on migration return path | expand

Commit Message

Fabiano Rosas Aug. 31, 2023, 6:39 p.m. UTC
We currently have a pattern for cleaning up a migration QEMUFile:

  qemu_mutex_lock(&s->qemu_file_lock);
  file = s->file_name;
  s->file_name = NULL;
  qemu_mutex_unlock(&s->qemu_file_lock);

  migration_ioc_unregister_yank_from_file(file);
  qemu_file_shutdown(file);
  qemu_fclose(file);

This sequence requires some consideration about locking to avoid
TOC/TOU bugs and avoid passing NULL into the functions that don't
expect it.

There's not need to call a shutdown() right before a close() and a
shutdown() in another thread being issued as a means to unblock a file
should not collide with this close().

Create a wrapper function to make sure the locking is being done
properly. Remove the extra shutdown().

The yank is linked to the QIOChannel, so if more than one QEMUFile
share the same channel, care must be taken to (un)register only one
yank function.

Move the yank unregister before clearing the pointer, so we can avoid
locking and add a comment explaining we're only using the QEMUFile as
a way to access the channel.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c      | 93 ++++++++++++--------------------------
 migration/yank_functions.c |  5 ++
 2 files changed, 35 insertions(+), 63 deletions(-)

Comments

Peter Xu Sept. 1, 2023, 4:05 p.m. UTC | #1
On Thu, Aug 31, 2023 at 03:39:16PM -0300, Fabiano Rosas wrote:
> @@ -1166,16 +1183,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>          qemu_mutex_lock_iothread();
>  
>          multifd_save_cleanup();
> -        qemu_mutex_lock(&s->qemu_file_lock);
> -        tmp = s->to_dst_file;
> -        s->to_dst_file = NULL;
> -        qemu_mutex_unlock(&s->qemu_file_lock);
> -        /*
> -         * Close the file handle without the lock to make sure the
> -         * critical section won't block for long.
> -         */
> -        migration_ioc_unregister_yank_from_file(tmp);
> -        qemu_fclose(tmp);
> +
> +        migration_ioc_unregister_yank_from_file(s->to_dst_file);

I think you suggested that we should always take the file lock when
operating on them, so this is slightly going backwards to not hold any lock
when doing it. But doing so in migrate_fd_cleanup() is probably fine (as it
serializes with bql on all the rest qmp commands, neither should migration
thread exist at this point).  Your call; it's still much cleaner.

Reviewed-by: Peter Xu <peterx@redhat.com>
Fabiano Rosas Sept. 1, 2023, 6:29 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Thu, Aug 31, 2023 at 03:39:16PM -0300, Fabiano Rosas wrote:
>> @@ -1166,16 +1183,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>>          qemu_mutex_lock_iothread();
>>  
>>          multifd_save_cleanup();
>> -        qemu_mutex_lock(&s->qemu_file_lock);
>> -        tmp = s->to_dst_file;
>> -        s->to_dst_file = NULL;
>> -        qemu_mutex_unlock(&s->qemu_file_lock);
>> -        /*
>> -         * Close the file handle without the lock to make sure the
>> -         * critical section won't block for long.
>> -         */
>> -        migration_ioc_unregister_yank_from_file(tmp);
>> -        qemu_fclose(tmp);
>> +
>> +        migration_ioc_unregister_yank_from_file(s->to_dst_file);
>
> I think you suggested that we should always take the file lock when
> operating on them, so this is slightly going backwards to not hold any lock
> when doing it. But doing so in migrate_fd_cleanup() is probably fine (as it
> serializes with bql on all the rest qmp commands, neither should migration
> thread exist at this point).  Your call; it's still much cleaner.

I think I was mistaken. We need the lock on the thread that clears the
pointer so that we can safely dereference it on another thread under the
lock.

Here we're accessing it from the same thread that later does the
clearing. So that's a slightly different problem.
Peter Xu Sept. 5, 2023, 3:34 p.m. UTC | #3
On Fri, Sep 01, 2023 at 03:29:51PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, Aug 31, 2023 at 03:39:16PM -0300, Fabiano Rosas wrote:
> >> @@ -1166,16 +1183,9 @@ static void migrate_fd_cleanup(MigrationState *s)
> >>          qemu_mutex_lock_iothread();
> >>  
> >>          multifd_save_cleanup();
> >> -        qemu_mutex_lock(&s->qemu_file_lock);
> >> -        tmp = s->to_dst_file;
> >> -        s->to_dst_file = NULL;
> >> -        qemu_mutex_unlock(&s->qemu_file_lock);
> >> -        /*
> >> -         * Close the file handle without the lock to make sure the
> >> -         * critical section won't block for long.
> >> -         */
> >> -        migration_ioc_unregister_yank_from_file(tmp);
> >> -        qemu_fclose(tmp);
> >> +
> >> +        migration_ioc_unregister_yank_from_file(s->to_dst_file);
> >
> > I think you suggested that we should always take the file lock when
> > operating on them, so this is slightly going backwards to not hold any lock
> > when doing it. But doing so in migrate_fd_cleanup() is probably fine (as it
> > serializes with bql on all the rest qmp commands, neither should migration
> > thread exist at this point).  Your call; it's still much cleaner.
> 
> I think I was mistaken. We need the lock on the thread that clears the
> pointer so that we can safely dereference it on another thread under the
> lock.
> 
> Here we're accessing it from the same thread that later does the
> clearing. So that's a slightly different problem.

But this is not the only place to clear it, so you still need to justify
why the other call sites (e.g., postcopy_pause() won't happen in parallel
with this call site.

The good thing about your proposal (of always taking that lock) is we avoid
those justifications, as you said before. :)

Thanks,
Fabiano Rosas Sept. 5, 2023, 5:25 p.m. UTC | #4
Peter Xu <peterx@redhat.com> writes:

> On Fri, Sep 01, 2023 at 03:29:51PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Thu, Aug 31, 2023 at 03:39:16PM -0300, Fabiano Rosas wrote:
>> >> @@ -1166,16 +1183,9 @@ static void migrate_fd_cleanup(MigrationState *s)
>> >>          qemu_mutex_lock_iothread();
>> >>  
>> >>          multifd_save_cleanup();
>> >> -        qemu_mutex_lock(&s->qemu_file_lock);
>> >> -        tmp = s->to_dst_file;
>> >> -        s->to_dst_file = NULL;
>> >> -        qemu_mutex_unlock(&s->qemu_file_lock);
>> >> -        /*
>> >> -         * Close the file handle without the lock to make sure the
>> >> -         * critical section won't block for long.
>> >> -         */
>> >> -        migration_ioc_unregister_yank_from_file(tmp);
>> >> -        qemu_fclose(tmp);
>> >> +
>> >> +        migration_ioc_unregister_yank_from_file(s->to_dst_file);
>> >
>> > I think you suggested that we should always take the file lock when
>> > operating on them, so this is slightly going backwards to not hold any lock
>> > when doing it. But doing so in migrate_fd_cleanup() is probably fine (as it
>> > serializes with bql on all the rest qmp commands, neither should migration
>> > thread exist at this point).  Your call; it's still much cleaner.
>> 
>> I think I was mistaken. We need the lock on the thread that clears the
>> pointer so that we can safely dereference it on another thread under the
>> lock.
>> 
>> Here we're accessing it from the same thread that later does the
>> clearing. So that's a slightly different problem.
>
> But this is not the only place to clear it, so you still need to justify
> why the other call sites (e.g., postcopy_pause() won't happen in parallel
> with this call site.
>
> The good thing about your proposal (of always taking that lock) is we avoid
> those justifications, as you said before. :)
>

Yes, I should probably try harder to keep it under the lock.

The issue is that without using the QIOChannel reference count or
keeping a flag there's no way to pair the register/unregister of the
yank. Because 1) we'll never be sure whether the yank was previously
registered when calling the unregister and 2) we don't store the ioc, so
we need to access it from the QEMUFile, but then several QEMUFiles can
have the same ioc.

The easiest way to keep it under the lock would be to add a flag:

migration_file_release(QEMUFile **file, bool unregister_yank);

... and only set it when we're sure the yank has been registered. It is
still a bit hand-wavy though.
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index 7fec57ad7f..99d21c3442 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -217,6 +217,25 @@  MigrationIncomingState *migration_incoming_get_current(void)
     return current_incoming;
 }
 
+static void migration_file_release(QEMUFile **file)
+{
+    MigrationState *ms = migrate_get_current();
+    QEMUFile *tmp;
+
+    /*
+     * Reset the pointer before releasing it to avoid holding the lock
+     * for too long.
+     */
+    WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
+        tmp = *file;
+        *file = NULL;
+    }
+
+    if (tmp) {
+        qemu_fclose(tmp);
+    }
+}
+
 void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
 {
     if (mis->socket_address_list) {
@@ -1155,8 +1174,6 @@  static void migrate_fd_cleanup(MigrationState *s)
     qemu_savevm_state_cleanup();
 
     if (s->to_dst_file) {
-        QEMUFile *tmp;
-
         trace_migrate_fd_cleanup();
         qemu_mutex_unlock_iothread();
         if (s->migration_thread_running) {
@@ -1166,16 +1183,9 @@  static void migrate_fd_cleanup(MigrationState *s)
         qemu_mutex_lock_iothread();
 
         multifd_save_cleanup();
-        qemu_mutex_lock(&s->qemu_file_lock);
-        tmp = s->to_dst_file;
-        s->to_dst_file = NULL;
-        qemu_mutex_unlock(&s->qemu_file_lock);
-        /*
-         * Close the file handle without the lock to make sure the
-         * critical section won't block for long.
-         */
-        migration_ioc_unregister_yank_from_file(tmp);
-        qemu_fclose(tmp);
+
+        migration_ioc_unregister_yank_from_file(s->to_dst_file);
+        migration_file_release(&s->to_dst_file);
     }
 
     /*
@@ -1815,38 +1825,6 @@  static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
     return 0;
 }
 
-/*
- * Release ms->rp_state.from_dst_file (and postcopy_qemufile_src if
- * existed) in a safe way.
- */
-static void migration_release_dst_files(MigrationState *ms)
-{
-    QEMUFile *file;
-
-    WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
-        /*
-         * Reset the from_dst_file pointer first before releasing it, as we
-         * can't block within lock section
-         */
-        file = ms->rp_state.from_dst_file;
-        ms->rp_state.from_dst_file = NULL;
-    }
-
-    /*
-     * Do the same to postcopy fast path socket too if there is.  No
-     * locking needed because this qemufile should only be managed by
-     * return path thread.
-     */
-    if (ms->postcopy_qemufile_src) {
-        migration_ioc_unregister_yank_from_file(ms->postcopy_qemufile_src);
-        qemu_file_shutdown(ms->postcopy_qemufile_src);
-        qemu_fclose(ms->postcopy_qemufile_src);
-        ms->postcopy_qemufile_src = NULL;
-    }
-
-    qemu_fclose(file);
-}
-
 /*
  * Handles messages sent on the return path towards the source VM
  *
@@ -2046,7 +2024,12 @@  static int await_return_path_close_on_source(MigrationState *ms)
     ret = ms->rp_state.error;
     ms->rp_state.error = false;
 
-    migration_release_dst_files(ms);
+    migration_file_release(&ms->rp_state.from_dst_file);
+
+    if (ms->postcopy_qemufile_src) {
+        migration_ioc_unregister_yank_from_file(ms->postcopy_qemufile_src);
+    }
+    migration_file_release(&ms->postcopy_qemufile_src);
 
     trace_migration_return_path_end_after(ret);
     return ret;
@@ -2502,26 +2485,10 @@  static MigThrError postcopy_pause(MigrationState *s)
     assert(s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
 
     while (true) {
-        QEMUFile *file;
-
-        /*
-         * Current channel is possibly broken. Release it.  Note that this is
-         * guaranteed even without lock because to_dst_file should only be
-         * modified by the migration thread.  That also guarantees that the
-         * unregister of yank is safe too without the lock.  It should be safe
-         * even to be within the qemu_file_lock, but we didn't do that to avoid
-         * taking more mutex (yank_lock) within qemu_file_lock.  TL;DR: we make
-         * the qemu_file_lock critical section as small as possible.
-         */
+        /* Current channel is possibly broken. Release it. */
         assert(s->to_dst_file);
         migration_ioc_unregister_yank_from_file(s->to_dst_file);
-        qemu_mutex_lock(&s->qemu_file_lock);
-        file = s->to_dst_file;
-        s->to_dst_file = NULL;
-        qemu_mutex_unlock(&s->qemu_file_lock);
-
-        qemu_file_shutdown(file);
-        qemu_fclose(file);
+        migration_file_release(&s->to_dst_file);
 
         /*
          * We're already pausing, so ignore any errors on the return
diff --git a/migration/yank_functions.c b/migration/yank_functions.c
index d5a710a3f2..31b0d790e2 100644
--- a/migration/yank_functions.c
+++ b/migration/yank_functions.c
@@ -48,6 +48,11 @@  void migration_ioc_unregister_yank(QIOChannel *ioc)
     }
 }
 
+/*
+ * There's no direct relationship between the QEMUFile and the
+ * yank. This is just a convenience helper because the QIOChannel and
+ * the QEMUFile lifecycles happen to match.
+ */
 void migration_ioc_unregister_yank_from_file(QEMUFile *file)
 {
     QIOChannel *ioc = qemu_file_get_ioc(file);