diff mbox series

[1/3] migration: Stop marking RP bad after shutdown

Message ID 20230728121516.16258-2-farosas@suse.de
State New
Headers show
Series Fix segfault on migration return path | expand

Commit Message

Fabiano Rosas July 28, 2023, 12:15 p.m. UTC
When waiting for the return path (RP) thread to finish, there is
really nothing wrong in the RP if the destination end of the migration
stops responding, leaving it stuck.

Stop returning an error at that point and leave it to other parts of
the code to catch. One such part is the very next routine run by
migration_completion() which checks 'to_dst_file' for an error and fails
the migration. Another is the RP thread itself when the recvmsg()
returns an error.

With this we stop marking RP bad from outside of the thread and can
reuse await_return_path_close_on_source() in the next patches to wait
on the thread during a paused migration.

Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
 migration/migration.c | 1 -
 1 file changed, 1 deletion(-)

Comments

Peter Xu July 28, 2023, 9:37 p.m. UTC | #1
On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote:
> When waiting for the return path (RP) thread to finish, there is
> really nothing wrong in the RP if the destination end of the migration
> stops responding, leaving it stuck.
> 
> Stop returning an error at that point and leave it to other parts of
> the code to catch. One such part is the very next routine run by
> migration_completion() which checks 'to_dst_file' for an error and fails
> the migration. Another is the RP thread itself when the recvmsg()
> returns an error.
> 
> With this we stop marking RP bad from outside of the thread and can
> reuse await_return_path_close_on_source() in the next patches to wait
> on the thread during a paused migration.
> 
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
>  migration/migration.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 91bba630a8..051067f8c5 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2049,7 +2049,6 @@ static int await_return_path_close_on_source(MigrationState *ms)
>           * waiting for the destination.
>           */
>          qemu_file_shutdown(ms->rp_state.from_dst_file);
> -        mark_source_rp_bad(ms);
>      }
>      trace_await_return_path_close_on_source_joining();
>      qemu_thread_join(&ms->rp_state.rp_thread);

The retval of await_return_path_close_on_source() relies on
ms->rp_state.error.  If mark_source_rp_bad() is dropped, is it possible
that it'll start to return succeed where it used to return failure?

Maybe not a big deal: I see migration_completion() also has another
qemu_file_get_error() later to catch errors, but I don't know how solid
that is.

I think as long as after this patch we can fail properly on e.g. network
failures for precopy when cap return-path=on, then we should be good.
Fabiano Rosas July 31, 2023, 9:02 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote:
>> When waiting for the return path (RP) thread to finish, there is
>> really nothing wrong in the RP if the destination end of the migration
>> stops responding, leaving it stuck.
>> 
>> Stop returning an error at that point and leave it to other parts of
>> the code to catch. One such part is the very next routine run by
>> migration_completion() which checks 'to_dst_file' for an error and fails
>> the migration. Another is the RP thread itself when the recvmsg()
>> returns an error.
>> 
>> With this we stop marking RP bad from outside of the thread and can
>> reuse await_return_path_close_on_source() in the next patches to wait
>> on the thread during a paused migration.
>> 
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>>  migration/migration.c | 1 -
>>  1 file changed, 1 deletion(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 91bba630a8..051067f8c5 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2049,7 +2049,6 @@ static int await_return_path_close_on_source(MigrationState *ms)
>>           * waiting for the destination.
>>           */
>>          qemu_file_shutdown(ms->rp_state.from_dst_file);
>> -        mark_source_rp_bad(ms);
>>      }
>>      trace_await_return_path_close_on_source_joining();
>>      qemu_thread_join(&ms->rp_state.rp_thread);
>
> The retval of await_return_path_close_on_source() relies on
> ms->rp_state.error.  If mark_source_rp_bad() is dropped, is it possible
> that it'll start to return succeed where it used to return failure?

Yep, as described in the commit message, I think it's ok to do that. The
critical part is doing the shutdown. Other instances of
mark_source_rp_bad() continue existing and we continue returning
rp_state.error.

>
> Maybe not a big deal: I see migration_completion() also has another
> qemu_file_get_error() later to catch errors, but I don't know how solid
> that is.

That is the instance I refer to in the commit message. At
await_return_path_close_on_source() we only call mark_source_rp_bad() if
to_dst_file has an error. That will be caught by this
qemu_file_get_error() anyway.
Fabiano Rosas July 31, 2023, 9:13 p.m. UTC | #3
Fabiano Rosas <farosas@suse.de> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Fri, Jul 28, 2023 at 09:15:14AM -0300, Fabiano Rosas wrote:
>>> When waiting for the return path (RP) thread to finish, there is
>>> really nothing wrong in the RP if the destination end of the migration
>>> stops responding, leaving it stuck.
>>> 
>>> Stop returning an error at that point and leave it to other parts of
>>> the code to catch. One such part is the very next routine run by
>>> migration_completion() which checks 'to_dst_file' for an error and fails
>>> the migration. Another is the RP thread itself when the recvmsg()
>>> returns an error.
>>> 
>>> With this we stop marking RP bad from outside of the thread and can
>>> reuse await_return_path_close_on_source() in the next patches to wait
>>> on the thread during a paused migration.
>>> 
>>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>>> ---
>>>  migration/migration.c | 1 -
>>>  1 file changed, 1 deletion(-)
>>> 
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index 91bba630a8..051067f8c5 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -2049,7 +2049,6 @@ static int await_return_path_close_on_source(MigrationState *ms)
>>>           * waiting for the destination.
>>>           */
>>>          qemu_file_shutdown(ms->rp_state.from_dst_file);
>>> -        mark_source_rp_bad(ms);
>>>      }
>>>      trace_await_return_path_close_on_source_joining();
>>>      qemu_thread_join(&ms->rp_state.rp_thread);
>>
>> The retval of await_return_path_close_on_source() relies on
>> ms->rp_state.error.  If mark_source_rp_bad() is dropped, is it possible
>> that it'll start to return succeed where it used to return failure?
>
> Yep, as described in the commit message, I think it's ok to do that. The
> critical part is doing the shutdown. Other instances of
> mark_source_rp_bad() continue existing and we continue returning
> rp_state.error.
>
>>
>> Maybe not a big deal: I see migration_completion() also has another
>> qemu_file_get_error() later to catch errors, but I don't know how solid
>> that is.
>
> That is the instance I refer to in the commit message. At
> await_return_path_close_on_source() we only call mark_source_rp_bad() if
> to_dst_file has an error. That will be caught by this
> qemu_file_get_error() anyway.

Actually, I can do better, I can merge this shutdown() into
migration_completion(). Then this dependency becomes explicit. Since you
suggested moving await_return_path_close_on_source() into
postcopy_pause(), it doesn't make sense to check to_dst_file anymore,
because when pausing we clear that file.
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index 91bba630a8..051067f8c5 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2049,7 +2049,6 @@  static int await_return_path_close_on_source(MigrationState *ms)
          * waiting for the destination.
          */
         qemu_file_shutdown(ms->rp_state.from_dst_file);
-        mark_source_rp_bad(ms);
     }
     trace_await_return_path_close_on_source_joining();
     qemu_thread_join(&ms->rp_state.rp_thread);