diff mbox series

[13/14] migration: Use migrate_has_error() in close_return_path_on_source()

Message ID 20240207133347.1115903-14-clg@redhat.com
State New
Headers show
Series migration: Improve error reporting | expand

Commit Message

Cédric Le Goater Feb. 7, 2024, 1:33 p.m. UTC
close_return_path_on_source() retrieves the migration error from the
the QEMUFile '->to_dst_file' to know if a shutdown is required. This
shutdown is required to exit the return-path thread. However, in
migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling
close_return_path_on_source() and the shutdown is never performed,
leaving the source and destination waiting for an event to occur.

Avoid relying on '->to_dst_file' and use migrate_has_error() instead.

Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>
---
 migration/migration.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Peter Xu Feb. 8, 2024, 5:52 a.m. UTC | #1
On Wed, Feb 07, 2024 at 02:33:46PM +0100, Cédric Le Goater wrote:
> close_return_path_on_source() retrieves the migration error from the
> the QEMUFile '->to_dst_file' to know if a shutdown is required. This
> shutdown is required to exit the return-path thread. However, in
> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling
> close_return_path_on_source() and the shutdown is never performed,
> leaving the source and destination waiting for an event to occur.
> 
> Avoid relying on '->to_dst_file' and use migrate_has_error() instead.
> 
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Cédric Le Goater <clg@redhat.com>

Reviewed-by: Peter Xu <peterx@redhat.com>
Fabiano Rosas Feb. 8, 2024, 1:07 p.m. UTC | #2
Cédric Le Goater <clg@redhat.com> writes:

> close_return_path_on_source() retrieves the migration error from the
> the QEMUFile '->to_dst_file' to know if a shutdown is required. This
> shutdown is required to exit the return-path thread. However, in
> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling
> close_return_path_on_source() and the shutdown is never performed,
> leaving the source and destination waiting for an event to occur.
>
> Avoid relying on '->to_dst_file' and use migrate_has_error() instead.
>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Cédric Le Goater <clg@redhat.com>
> ---
>  migration/migration.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms)
>       * cause it to unblock if it's stuck waiting for the destination.
>       */
>      WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
> -        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
> -            qemu_file_get_error(ms->to_dst_file)) {
> +        if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
>              qemu_file_shutdown(ms->rp_state.from_dst_file);
>          }
>      }

Hm, maybe Peter can help defend this, but this assumes that every
function that takes an 'f' and sets the file error also sets
migrate_set_error(). I'm not sure we have determined that, have we?
Cédric Le Goater Feb. 8, 2024, 1:45 p.m. UTC | #3
On 2/8/24 14:07, Fabiano Rosas wrote:
> Cédric Le Goater <clg@redhat.com> writes:
> 
>> close_return_path_on_source() retrieves the migration error from the
>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This
>> shutdown is required to exit the return-path thread. However, in
>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling
>> close_return_path_on_source() and the shutdown is never performed,
>> leaving the source and destination waiting for an event to occur.
>>
>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead.
>>
>> Suggested-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>> ---
>>   migration/migration.c | 3 +--
>>   1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms)
>>        * cause it to unblock if it's stuck waiting for the destination.
>>        */
>>       WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
>> -        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
>> -            qemu_file_get_error(ms->to_dst_file)) {
>> +        if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
>>               qemu_file_shutdown(ms->rp_state.from_dst_file);
>>           }
>>       }
> 
> Hm, maybe Peter can help defend this, but this assumes that every
> function that takes an 'f' and sets the file error also sets
> migrate_set_error(). I'm not sure we have determined that, have we?

How could we check all the code path ? I agree it is difficult when
looking at the code :/

Thanks,

C.
Fabiano Rosas Feb. 8, 2024, 1:57 p.m. UTC | #4
Cédric Le Goater <clg@redhat.com> writes:

> On 2/8/24 14:07, Fabiano Rosas wrote:
>> Cédric Le Goater <clg@redhat.com> writes:
>> 
>>> close_return_path_on_source() retrieves the migration error from the
>>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This
>>> shutdown is required to exit the return-path thread. However, in
>>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling
>>> close_return_path_on_source() and the shutdown is never performed,
>>> leaving the source and destination waiting for an event to occur.
>>>
>>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead.
>>>
>>> Suggested-by: Peter Xu <peterx@redhat.com>
>>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>>> ---
>>>   migration/migration.c | 3 +--
>>>   1 file changed, 1 insertion(+), 2 deletions(-)
>>>
>>> diff --git a/migration/migration.c b/migration/migration.c
>>> index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms)
>>>        * cause it to unblock if it's stuck waiting for the destination.
>>>        */
>>>       WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
>>> -        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
>>> -            qemu_file_get_error(ms->to_dst_file)) {
>>> +        if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
>>>               qemu_file_shutdown(ms->rp_state.from_dst_file);
>>>           }
>>>       }
>> 
>> Hm, maybe Peter can help defend this, but this assumes that every
>> function that takes an 'f' and sets the file error also sets
>> migrate_set_error(). I'm not sure we have determined that, have we?
>
> How could we check all the code path ? I agree it is difficult when
> looking at the code :/

It would help if the thing wasn't called 'f' for the most part of the
code to begin with.

Whenever there's a file error at to_dst_file there's the chance that the
rp_state.from_dst_file got stuck. So we cannot ignore the file error.

Would it work if we checked it earlier during cleanup as you did
previously and then set the migration error?
Cédric Le Goater Feb. 12, 2024, 1:03 p.m. UTC | #5
On 2/8/24 14:57, Fabiano Rosas wrote:
> Cédric Le Goater <clg@redhat.com> writes:
> 
>> On 2/8/24 14:07, Fabiano Rosas wrote:
>>> Cédric Le Goater <clg@redhat.com> writes:
>>>
>>>> close_return_path_on_source() retrieves the migration error from the
>>>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This
>>>> shutdown is required to exit the return-path thread. However, in
>>>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling
>>>> close_return_path_on_source() and the shutdown is never performed,
>>>> leaving the source and destination waiting for an event to occur.
>>>>
>>>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead.
>>>>
>>>> Suggested-by: Peter Xu <peterx@redhat.com>
>>>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>>>> ---
>>>>    migration/migration.c | 3 +--
>>>>    1 file changed, 1 insertion(+), 2 deletions(-)
>>>>
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms)
>>>>         * cause it to unblock if it's stuck waiting for the destination.
>>>>         */
>>>>        WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
>>>> -        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
>>>> -            qemu_file_get_error(ms->to_dst_file)) {
>>>> +        if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
>>>>                qemu_file_shutdown(ms->rp_state.from_dst_file);
>>>>            }
>>>>        }
>>>
>>> Hm, maybe Peter can help defend this, but this assumes that every
>>> function that takes an 'f' and sets the file error also sets
>>> migrate_set_error(). I'm not sure we have determined that, have we?
>>
>> How could we check all the code path ? I agree it is difficult when
>> looking at the code :/
> 
> It would help if the thing wasn't called 'f' for the most part of the
> code to begin with.
> 
> Whenever there's a file error at to_dst_file there's the chance that the
> rp_state.from_dst_file got stuck. So we cannot ignore the file error.
> 
> Would it work if we checked it earlier during cleanup as you did
> previously and then set the migration error?

Do you mean doing something similar to what is done in
source_return_path_thread() ?

         if (qemu_file_get_error(s->to_dst_file)) {
             qemu_file_get_error_obj(s->to_dst_file, &err);
     	    if (err) {
         	migrate_set_error(ms, err);
         	error_free(err);
	...

Yes. That would be safer I think.


Nevertheless, I am struggling to understand how qemu_file_set_error()
and migrate_set_error() fit together. I was expecting some kind of
synchronization  routine but there isn't it seems. Are they completely
orthogonal ? when should we use these routines and when not ?

My initial goal was to modify some of the memory handlers (log_global*)
and migration handlers to propagate errors at the QMP level and them
report to the management layer. This is growing in something bigger
and currently, I don't find a good approach to the problem.

The last two patches of this series try to fix the return-path thread
termination. Let's keep that for after.

Thanks,

C.
Fabiano Rosas Feb. 14, 2024, 4 p.m. UTC | #6
Cédric Le Goater <clg@redhat.com> writes:

> On 2/8/24 14:57, Fabiano Rosas wrote:
>> Cédric Le Goater <clg@redhat.com> writes:
>> 
>>> On 2/8/24 14:07, Fabiano Rosas wrote:
>>>> Cédric Le Goater <clg@redhat.com> writes:
>>>>
>>>>> close_return_path_on_source() retrieves the migration error from the
>>>>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This
>>>>> shutdown is required to exit the return-path thread. However, in
>>>>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling
>>>>> close_return_path_on_source() and the shutdown is never performed,
>>>>> leaving the source and destination waiting for an event to occur.
>>>>>
>>>>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead.
>>>>>
>>>>> Suggested-by: Peter Xu <peterx@redhat.com>
>>>>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>>>>> ---
>>>>>    migration/migration.c | 3 +--
>>>>>    1 file changed, 1 insertion(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>> index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644
>>>>> --- a/migration/migration.c
>>>>> +++ b/migration/migration.c
>>>>> @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms)
>>>>>         * cause it to unblock if it's stuck waiting for the destination.
>>>>>         */
>>>>>        WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
>>>>> -        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
>>>>> -            qemu_file_get_error(ms->to_dst_file)) {
>>>>> +        if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
>>>>>                qemu_file_shutdown(ms->rp_state.from_dst_file);
>>>>>            }
>>>>>        }
>>>>
>>>> Hm, maybe Peter can help defend this, but this assumes that every
>>>> function that takes an 'f' and sets the file error also sets
>>>> migrate_set_error(). I'm not sure we have determined that, have we?
>>>
>>> How could we check all the code path ? I agree it is difficult when
>>> looking at the code :/
>> 
>> It would help if the thing wasn't called 'f' for the most part of the
>> code to begin with.
>> 
>> Whenever there's a file error at to_dst_file there's the chance that the
>> rp_state.from_dst_file got stuck. So we cannot ignore the file error.
>> 
>> Would it work if we checked it earlier during cleanup as you did
>> previously and then set the migration error?
>
> Do you mean doing something similar to what is done in
> source_return_path_thread() ?
>
>          if (qemu_file_get_error(s->to_dst_file)) {
>              qemu_file_get_error_obj(s->to_dst_file, &err);
>      	    if (err) {
>          	migrate_set_error(ms, err);
>          	error_free(err);
> 	...
>
> Yes. That would be safer I think.

Yes, something like that.

I wish we could make that return path cleanup more deterministic, but
currently it's just: "if something hangs, call shutdown()". We don't
have a way to detect a hang, we just look at the file error and hope it
works.

A crucial aspect here is that calling qemu_file_shutdown() itself sets
the file error. So there's not even a guarantee that an error is
actually an error.

>
>
> Nevertheless, I am struggling to understand how qemu_file_set_error()
> and migrate_set_error() fit together. I was expecting some kind of
> synchronization  routine but there isn't it seems. Are they completely
> orthogonal ? when should we use these routines and when not ?

We're trying to phase out the QEMUFile usage altogether. One thing that
is getting in the way is this dependency on the qemu_file_*_error
functions.

While we're not there yet, a good pattern is to find a
qemu_file_set|get_error() pair and replace it with
migrate_set|has_error(). Unfortunately the return path does not fit in
this, because we don't have a matching qemu_file_set_error, it could be
anywhere. As I said above, we're using that error as a heuristic for: "a
recvmsg() might be hanging".

>
> My initial goal was to modify some of the memory handlers (log_global*)
> and migration handlers to propagate errors at the QMP level and them
> report to the management layer. This is growing in something bigger
> and currently, I don't find a good approach to the problem.
>
> The last two patches of this series try to fix the return-path thread
> termination. Let's keep that for after.

I'll try to figure that out. I see you provided a reproducer.

>
> Thanks,
>
> C.
Cédric Le Goater Feb. 16, 2024, 3:17 p.m. UTC | #7
On 2/14/24 17:00, Fabiano Rosas wrote:
> Cédric Le Goater <clg@redhat.com> writes:
> 
>> On 2/8/24 14:57, Fabiano Rosas wrote:
>>> Cédric Le Goater <clg@redhat.com> writes:
>>>
>>>> On 2/8/24 14:07, Fabiano Rosas wrote:
>>>>> Cédric Le Goater <clg@redhat.com> writes:
>>>>>
>>>>>> close_return_path_on_source() retrieves the migration error from the
>>>>>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This
>>>>>> shutdown is required to exit the return-path thread. However, in
>>>>>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling
>>>>>> close_return_path_on_source() and the shutdown is never performed,
>>>>>> leaving the source and destination waiting for an event to occur.
>>>>>>
>>>>>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead.
>>>>>>
>>>>>> Suggested-by: Peter Xu <peterx@redhat.com>
>>>>>> Signed-off-by: Cédric Le Goater <clg@redhat.com>
>>>>>> ---
>>>>>>     migration/migration.c | 3 +--
>>>>>>     1 file changed, 1 insertion(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>>> index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644
>>>>>> --- a/migration/migration.c
>>>>>> +++ b/migration/migration.c
>>>>>> @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms)
>>>>>>          * cause it to unblock if it's stuck waiting for the destination.
>>>>>>          */
>>>>>>         WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
>>>>>> -        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
>>>>>> -            qemu_file_get_error(ms->to_dst_file)) {
>>>>>> +        if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
>>>>>>                 qemu_file_shutdown(ms->rp_state.from_dst_file);
>>>>>>             }
>>>>>>         }
>>>>>
>>>>> Hm, maybe Peter can help defend this, but this assumes that every
>>>>> function that takes an 'f' and sets the file error also sets
>>>>> migrate_set_error(). I'm not sure we have determined that, have we?
>>>>
>>>> How could we check all the code path ? I agree it is difficult when
>>>> looking at the code :/
>>>
>>> It would help if the thing wasn't called 'f' for the most part of the
>>> code to begin with.
>>>
>>> Whenever there's a file error at to_dst_file there's the chance that the
>>> rp_state.from_dst_file got stuck. So we cannot ignore the file error.
>>>
>>> Would it work if we checked it earlier during cleanup as you did
>>> previously and then set the migration error?
>>
>> Do you mean doing something similar to what is done in
>> source_return_path_thread() ?
>>
>>           if (qemu_file_get_error(s->to_dst_file)) {
>>               qemu_file_get_error_obj(s->to_dst_file, &err);
>>       	    if (err) {
>>           	migrate_set_error(ms, err);
>>           	error_free(err);
>> 	...
>>
>> Yes. That would be safer I think.
> 
> Yes, something like that.
> 
> I wish we could make that return path cleanup more deterministic, but
> currently it's just: "if something hangs, call shutdown()". We don't
> have a way to detect a hang, we just look at the file error and hope it
> works.
> 
> A crucial aspect here is that calling qemu_file_shutdown() itself sets
> the file error. So there's not even a guarantee that an error is
> actually an error.
> 
>>
>>
>> Nevertheless, I am struggling to understand how qemu_file_set_error()
>> and migrate_set_error() fit together. I was expecting some kind of
>> synchronization  routine but there isn't it seems. Are they completely
>> orthogonal ? when should we use these routines and when not ?
> 
> We're trying to phase out the QEMUFile usage altogether. One thing that
> is getting in the way is this dependency on the qemu_file_*_error
> functions.

OK. the other changes, which add an Error** argument to various handlers,
reduce the use of qemu_file_*_error routines in VFIO.

> While we're not there yet, a good pattern is to find a
> qemu_file_set|get_error() pair and replace it with
> migrate_set|has_error(). 

OK. I will keep that in mind for the other changes.

Thanks,

C.



> Unfortunately the return path does not fit in
> this, because we don't have a matching qemu_file_set_error, it could be
> anywhere. As I said above, we're using that error as a heuristic for: "a
> recvmsg() might be hanging".
> 
>>
>> My initial goal was to modify some of the memory handlers (log_global*)
>> and migration handlers to propagate errors at the QMP level and them
>> report to the management layer. This is growing in something bigger
>> and currently, I don't find a good approach to the problem.
>>
>> The last two patches of this series try to fix the return-path thread
>> termination. Let's keep that for after.
> 
> I'll try to figure that out. I see you provided a reproducer.
> 
>>
>> Thanks,
>>
>> C.
>
Peter Xu Feb. 23, 2024, 4:14 a.m. UTC | #8
On Thu, Feb 08, 2024 at 10:07:44AM -0300, Fabiano Rosas wrote:
> > diff --git a/migration/migration.c b/migration/migration.c
> > index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms)
> >       * cause it to unblock if it's stuck waiting for the destination.
> >       */
> >      WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
> > -        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
> > -            qemu_file_get_error(ms->to_dst_file)) {
> > +        if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
> >              qemu_file_shutdown(ms->rp_state.from_dst_file);
> >          }
> >      }
> 
> Hm, maybe Peter can help defend this, but this assumes that every
> function that takes an 'f' and sets the file error also sets
> migrate_set_error(). I'm not sure we have determined that, have we?

[apologies on getting back to this thread late.. I saw there's yet another
 proposal in the other email, will look at that soon]

I think that should be set, or otherwise we lose an error?  After all
s->error is the only thing we report, if there is a qemufile error that is
not reported into s->error it can be lost then.

On src QEMU we have both migration thread and return path thread.  For
migration thread the file error should always be collected by
migration_detect_error() by the qemu_file_get_error_obj_any() (it also
looks after postcopy_qemufile_src).  For return path thread it's always
collected when the loop quits.

Would migrate_has_error() be safer than qemu_file_get_error() in some
cases?  I'm considering when there is an error outside of qemufile itself,
that's the case where qemu_file_get_error(ms->to_dst_file) can return
false, however we may still need a kick to the from_dst_file?
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2372,8 +2372,7 @@  static bool close_return_path_on_source(MigrationState *ms)
      * cause it to unblock if it's stuck waiting for the destination.
      */
     WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) {
-        if (ms->to_dst_file && ms->rp_state.from_dst_file &&
-            qemu_file_get_error(ms->to_dst_file)) {
+        if (migrate_has_error(ms) && ms->rp_state.from_dst_file) {
             qemu_file_shutdown(ms->rp_state.from_dst_file);
         }
     }