diff mbox series

[v3,10/10] migration: Add a wrapper to cleanup migration files

Message ID 20230811150836.2895-11-farosas@suse.de
State New
Headers show
Series Fix segfault on migration return path | expand

Commit Message

Fabiano Rosas Aug. 11, 2023, 3:08 p.m. UTC
We currently have a pattern for cleaning up a migration QEMUFile:

  qemu_mutex_lock(&s->qemu_file_lock);
  file = s->file_name;
  s->file_name = NULL;
  qemu_mutex_unlock(&s->qemu_file_lock);

  migration_ioc_unregister_yank_from_file(file);
  qemu_file_shutdown(file);
  qemu_fclose(file);

There are some considerations for this sequence:

- we must clear the pointer under the lock, to avoid TOC/TOU bugs;
- the shutdown() and close() expect be given a non-null parameter;
- a close() in one thread should not race with a shutdown() in another;

Create a wrapper function to make sure everything works correctly.

Note: the return path did not used to call
      migration_ioc_unregister_yank_from_file(), but I added it
      nonetheless for uniformity.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 92 ++++++++++++-------------------------------
 util/yank.c           |  2 -
 2 files changed, 26 insertions(+), 68 deletions(-)

Comments

Peter Xu Aug. 15, 2023, 10:15 p.m. UTC | #1
On Fri, Aug 11, 2023 at 12:08:36PM -0300, Fabiano Rosas wrote:
> We currently have a pattern for cleaning up a migration QEMUFile:
> 
>   qemu_mutex_lock(&s->qemu_file_lock);
>   file = s->file_name;
>   s->file_name = NULL;
>   qemu_mutex_unlock(&s->qemu_file_lock);
> 
>   migration_ioc_unregister_yank_from_file(file);
>   qemu_file_shutdown(file);
>   qemu_fclose(file);
> 
> There are some considerations for this sequence:
> 
> - we must clear the pointer under the lock, to avoid TOC/TOU bugs;
> - the shutdown() and close() expect be given a non-null parameter;
> - a close() in one thread should not race with a shutdown() in another;
> 
> Create a wrapper function to make sure everything works correctly.
> 
> Note: the return path did not used to call
>       migration_ioc_unregister_yank_from_file(), but I added it
>       nonetheless for uniformity.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>

This definitely looks cleaner.  Probably can be squashed together with
previous patch?  If you could double check whether we can just drop the
shutdown() all over the places when close() altogether, it'll be even
nicer (I hope I didn't miss any real reasons to explicitly do that).

> diff --git a/util/yank.c b/util/yank.c
> index abf47c346d..4b6afbf589 100644
> --- a/util/yank.c
> +++ b/util/yank.c
> @@ -146,8 +146,6 @@ void yank_unregister_function(const YankInstance *instance,
>              return;
>          }
>      }
> -
> -    abort();

I think we can't silently do this.  This check is very strict and I guess
you removed it because you hit a crash.  What's the crash?  Can we just
pair the yank reg/unreg?

>  }
>  
>  void qmp_yank(YankInstanceList *instances,
> -- 
> 2.35.3
>
Fabiano Rosas Aug. 15, 2023, 10:31 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Fri, Aug 11, 2023 at 12:08:36PM -0300, Fabiano Rosas wrote:
>> We currently have a pattern for cleaning up a migration QEMUFile:
>> 
>>   qemu_mutex_lock(&s->qemu_file_lock);
>>   file = s->file_name;
>>   s->file_name = NULL;
>>   qemu_mutex_unlock(&s->qemu_file_lock);
>> 
>>   migration_ioc_unregister_yank_from_file(file);
>>   qemu_file_shutdown(file);
>>   qemu_fclose(file);
>> 
>> There are some considerations for this sequence:
>> 
>> - we must clear the pointer under the lock, to avoid TOC/TOU bugs;
>> - the shutdown() and close() expect be given a non-null parameter;
>> - a close() in one thread should not race with a shutdown() in another;
>> 
>> Create a wrapper function to make sure everything works correctly.
>> 
>> Note: the return path did not used to call
>>       migration_ioc_unregister_yank_from_file(), but I added it
>>       nonetheless for uniformity.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>
> This definitely looks cleaner.  Probably can be squashed together with
> previous patch?  If you could double check whether we can just drop the
> shutdown() all over the places when close() altogether, it'll be even
> nicer (I hope I didn't miss any real reasons to explicitly do that).
>
>> diff --git a/util/yank.c b/util/yank.c
>> index abf47c346d..4b6afbf589 100644
>> --- a/util/yank.c
>> +++ b/util/yank.c
>> @@ -146,8 +146,6 @@ void yank_unregister_function(const YankInstance *instance,
>>              return;
>>          }
>>      }
>> -
>> -    abort();
>
> I think we can't silently do this.  This check is very strict and I guess
> you removed it because you hit a crash.  What's the crash?  Can we just
> pair the yank reg/unreg?
>

Well, the abort() is the crash. It just means that we looped and didn't
find the handler to unregister. It looks harmless to me. I should have
mentioned this in the commit message.

I could certainly add a yank handler to the rp_state.from_dst_file. But
then I have no idea what will happen if we try to yank the return path
at a random moment.

Side note: I see that yank does a qio_channel_shutdown() without the
controversial setting of -EIO. Which means it is probably succeptible to
the same race described in the qemu_file_shutdown() code.
Peter Xu Aug. 16, 2023, 2:26 p.m. UTC | #3
On Tue, Aug 15, 2023 at 07:31:28PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Aug 11, 2023 at 12:08:36PM -0300, Fabiano Rosas wrote:
> >> We currently have a pattern for cleaning up a migration QEMUFile:
> >> 
> >>   qemu_mutex_lock(&s->qemu_file_lock);
> >>   file = s->file_name;
> >>   s->file_name = NULL;
> >>   qemu_mutex_unlock(&s->qemu_file_lock);
> >> 
> >>   migration_ioc_unregister_yank_from_file(file);
> >>   qemu_file_shutdown(file);
> >>   qemu_fclose(file);
> >> 
> >> There are some considerations for this sequence:
> >> 
> >> - we must clear the pointer under the lock, to avoid TOC/TOU bugs;
> >> - the shutdown() and close() expect be given a non-null parameter;
> >> - a close() in one thread should not race with a shutdown() in another;
> >> 
> >> Create a wrapper function to make sure everything works correctly.
> >> 
> >> Note: the return path did not used to call
> >>       migration_ioc_unregister_yank_from_file(), but I added it
> >>       nonetheless for uniformity.
> >> 
> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> >
> > This definitely looks cleaner.  Probably can be squashed together with
> > previous patch?  If you could double check whether we can just drop the
> > shutdown() all over the places when close() altogether, it'll be even
> > nicer (I hope I didn't miss any real reasons to explicitly do that).
> >
> >> diff --git a/util/yank.c b/util/yank.c
> >> index abf47c346d..4b6afbf589 100644
> >> --- a/util/yank.c
> >> +++ b/util/yank.c
> >> @@ -146,8 +146,6 @@ void yank_unregister_function(const YankInstance *instance,
> >>              return;
> >>          }
> >>      }
> >> -
> >> -    abort();
> >
> > I think we can't silently do this.  This check is very strict and I guess
> > you removed it because you hit a crash.  What's the crash?  Can we just
> > pair the yank reg/unreg?
> >
> 
> Well, the abort() is the crash. It just means that we looped and didn't
> find the handler to unregister. It looks harmless to me. I should have
> mentioned this in the commit message.

Yeah, trust me I wanted to remove that for quite a few times. :) But then I
normally decided to try harder to find what's missing; and so far indeed I
found that the cleanest way is always pair the reg/unreg.

> 
> I could certainly add a yank handler to the rp_state.from_dst_file. But
> then I have no idea what will happen if we try to yank the return path
> at a random moment.

I think the idea was it should be registered always when the channel is
created, and then unregistered when the channel is destroyed.  They should
just pair, alongside with the channel's lifecycle?

> 
> Side note: I see that yank does a qio_channel_shutdown() without the
> controversial setting of -EIO. Which means it is probably succeptible to
> the same race described in the qemu_file_shutdown() code.

Are you looking outside migration code (I saw nbd_teardown_connection()
does have one)?

For migration IIUC it's always via migration_ioc_unregister_yank().
Fabiano Rosas Aug. 16, 2023, 2:57 p.m. UTC | #4
Peter Xu <peterx@redhat.com> writes:

> On Tue, Aug 15, 2023 at 07:31:28PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Fri, Aug 11, 2023 at 12:08:36PM -0300, Fabiano Rosas wrote:
>> >> We currently have a pattern for cleaning up a migration QEMUFile:
>> >> 
>> >>   qemu_mutex_lock(&s->qemu_file_lock);
>> >>   file = s->file_name;
>> >>   s->file_name = NULL;
>> >>   qemu_mutex_unlock(&s->qemu_file_lock);
>> >> 
>> >>   migration_ioc_unregister_yank_from_file(file);
>> >>   qemu_file_shutdown(file);
>> >>   qemu_fclose(file);
>> >> 
>> >> There are some considerations for this sequence:
>> >> 
>> >> - we must clear the pointer under the lock, to avoid TOC/TOU bugs;
>> >> - the shutdown() and close() expect be given a non-null parameter;
>> >> - a close() in one thread should not race with a shutdown() in another;
>> >> 
>> >> Create a wrapper function to make sure everything works correctly.
>> >> 
>> >> Note: the return path did not used to call
>> >>       migration_ioc_unregister_yank_from_file(), but I added it
>> >>       nonetheless for uniformity.
>> >> 
>> >> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> >
>> > This definitely looks cleaner.  Probably can be squashed together with
>> > previous patch?  If you could double check whether we can just drop the
>> > shutdown() all over the places when close() altogether, it'll be even
>> > nicer (I hope I didn't miss any real reasons to explicitly do that).
>> >
>> >> diff --git a/util/yank.c b/util/yank.c
>> >> index abf47c346d..4b6afbf589 100644
>> >> --- a/util/yank.c
>> >> +++ b/util/yank.c
>> >> @@ -146,8 +146,6 @@ void yank_unregister_function(const YankInstance *instance,
>> >>              return;
>> >>          }
>> >>      }
>> >> -
>> >> -    abort();
>> >
>> > I think we can't silently do this.  This check is very strict and I guess
>> > you removed it because you hit a crash.  What's the crash?  Can we just
>> > pair the yank reg/unreg?
>> >
>> 
>> Well, the abort() is the crash. It just means that we looped and didn't
>> find the handler to unregister. It looks harmless to me. I should have
>> mentioned this in the commit message.
>
> Yeah, trust me I wanted to remove that for quite a few times. :) But then I
> normally decided to try harder to find what's missing; and so far indeed I
> found that the cleanest way is always pair the reg/unreg.
>
>> 
>> I could certainly add a yank handler to the rp_state.from_dst_file. But
>> then I have no idea what will happen if we try to yank the return path
>> at a random moment.
>
> I think the idea was it should be registered always when the channel is
> created, and then unregistered when the channel is destroyed.  They should
> just pair, alongside with the channel's lifecycle?
>
>> 
>> Side note: I see that yank does a qio_channel_shutdown() without the
>> controversial setting of -EIO. Which means it is probably succeptible to
>> the same race described in the qemu_file_shutdown() code.
>
> Are you looking outside migration code (I saw nbd_teardown_connection()
> does have one)?
>
> For migration IIUC it's always via migration_ioc_unregister_yank().

I'm talking about the actual yank action, not the unregister.

migration_yank_iochannel() calls qio_channel_shutdown() in the same way
as qemu_file_shutdown(), but unlike the latter, it doesn't set
f->last_error = -EIO. Which means that in theory, we could yank and
still try to use the QEMUFile.

In other words, what commit a555b8092a ("qemu-file: Don't do IO after
shutdown") did does not apply to yank because yank didn't exit at the
time.
Peter Xu Aug. 16, 2023, 3:26 p.m. UTC | #5
On Wed, Aug 16, 2023 at 11:57:17AM -0300, Fabiano Rosas wrote:
> I'm talking about the actual yank action, not the unregister.
> 
> migration_yank_iochannel() calls qio_channel_shutdown() in the same way
> as qemu_file_shutdown(), but unlike the latter, it doesn't set
> f->last_error = -EIO. Which means that in theory, we could yank and
> still try to use the QEMUFile.
> 
> In other words, what commit a555b8092a ("qemu-file: Don't do IO after
> shutdown") did does not apply to yank because yank didn't exit at the
> time.

Ah ok..

Perhaps we should register yank over the qemufiles not ioc for migrations?
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index 4df5ca25c1..3c33e4fae4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -217,6 +217,27 @@  MigrationIncomingState *migration_incoming_get_current(void)
     return current_incoming;
 }
 
+static void migration_file_release(QEMUFile **file)
+{
+    MigrationState *ms = migrate_get_current();
+    QEMUFile *tmp;
+
+    /*
+     * Reset the pointer before releasing it to avoid holding the lock
+     * for too long.
+     */
+    WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
+        tmp = *file;
+        *file = NULL;
+    }
+
+    if (tmp) {
+        migration_ioc_unregister_yank_from_file(tmp);
+        qemu_file_shutdown(tmp);
+        qemu_fclose(tmp);
+    }
+}
+
 void migration_incoming_transport_cleanup(MigrationIncomingState *mis)
 {
     if (mis->socket_address_list) {
@@ -1155,8 +1176,6 @@  static void migrate_fd_cleanup(MigrationState *s)
     qemu_savevm_state_cleanup();
 
     if (s->to_dst_file) {
-        QEMUFile *tmp;
-
         trace_migrate_fd_cleanup();
         qemu_mutex_unlock_iothread();
         if (s->migration_thread_running) {
@@ -1166,17 +1185,7 @@  static void migrate_fd_cleanup(MigrationState *s)
         qemu_mutex_lock_iothread();
 
         multifd_save_cleanup();
-        qemu_mutex_lock(&s->qemu_file_lock);
-        tmp = s->to_dst_file;
-        s->to_dst_file = NULL;
-        qemu_mutex_unlock(&s->qemu_file_lock);
-        /*
-         * Close the file handle without the lock to make sure the
-         * critical section won't block for long.
-         */
-        migration_ioc_unregister_yank_from_file(tmp);
-        qemu_file_shutdown(tmp);
-        qemu_fclose(tmp);
+        migration_file_release(&s->to_dst_file);
     }
 
     /*
@@ -1816,39 +1825,6 @@  static int migrate_handle_rp_resume_ack(MigrationState *s, uint32_t value)
     return 0;
 }
 
-/*
- * Release ms->rp_state.from_dst_file (and postcopy_qemufile_src if
- * existed) in a safe way.
- */
-static void migration_release_dst_files(MigrationState *ms)
-{
-    QEMUFile *file;
-
-    WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
-        /*
-         * Reset the from_dst_file pointer first before releasing it, as we
-         * can't block within lock section
-         */
-        file = ms->rp_state.from_dst_file;
-        ms->rp_state.from_dst_file = NULL;
-    }
-
-    /*
-     * Do the same to postcopy fast path socket too if there is.  No
-     * locking needed because this qemufile should only be managed by
-     * return path thread.
-     */
-    if (ms->postcopy_qemufile_src) {
-        migration_ioc_unregister_yank_from_file(ms->postcopy_qemufile_src);
-        qemu_file_shutdown(ms->postcopy_qemufile_src);
-        qemu_fclose(ms->postcopy_qemufile_src);
-        ms->postcopy_qemufile_src = NULL;
-    }
-
-    qemu_file_shutdown(file);
-    qemu_fclose(file);
-}
-
 /*
  * Handles messages sent on the return path towards the source VM
  *
@@ -2048,7 +2024,8 @@  static int await_return_path_close_on_source(MigrationState *ms)
     ret = ms->rp_state.error;
     ms->rp_state.error = false;
 
-    migration_release_dst_files(ms);
+    migration_file_release(&ms->rp_state.from_dst_file);
+    migration_file_release(&ms->postcopy_qemufile_src);
 
     trace_migration_return_path_end_after(ret);
     return ret;
@@ -2504,26 +2481,9 @@  static MigThrError postcopy_pause(MigrationState *s)
     assert(s->state == MIGRATION_STATUS_POSTCOPY_ACTIVE);
 
     while (true) {
-        QEMUFile *file;
-
-        /*
-         * Current channel is possibly broken. Release it.  Note that this is
-         * guaranteed even without lock because to_dst_file should only be
-         * modified by the migration thread.  That also guarantees that the
-         * unregister of yank is safe too without the lock.  It should be safe
-         * even to be within the qemu_file_lock, but we didn't do that to avoid
-         * taking more mutex (yank_lock) within qemu_file_lock.  TL;DR: we make
-         * the qemu_file_lock critical section as small as possible.
-         */
+        /* Current channel is possibly broken. Release it. */
         assert(s->to_dst_file);
-        migration_ioc_unregister_yank_from_file(s->to_dst_file);
-        qemu_mutex_lock(&s->qemu_file_lock);
-        file = s->to_dst_file;
-        s->to_dst_file = NULL;
-        qemu_mutex_unlock(&s->qemu_file_lock);
-
-        qemu_file_shutdown(file);
-        qemu_fclose(file);
+        migration_file_release(&s->to_dst_file);
 
         /*
          * We're already pausing, so ignore any errors on the return
diff --git a/util/yank.c b/util/yank.c
index abf47c346d..4b6afbf589 100644
--- a/util/yank.c
+++ b/util/yank.c
@@ -146,8 +146,6 @@  void yank_unregister_function(const YankInstance *instance,
             return;
         }
     }
-
-    abort();
 }
 
 void qmp_yank(YankInstanceList *instances,