Message ID | 20240207133347.1115903-14-clg@redhat.com |
---|---|
State | New |
Headers | show |
Series | migration: Improve error reporting | expand |
On Wed, Feb 07, 2024 at 02:33:46PM +0100, Cédric Le Goater wrote: > close_return_path_on_source() retrieves the migration error from the > the QEMUFile '->to_dst_file' to know if a shutdown is required. This > shutdown is required to exit the return-path thread. However, in > migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling > close_return_path_on_source() and the shutdown is never performed, > leaving the source and destination waiting for an event to occur. > > Avoid relying on '->to_dst_file' and use migrate_has_error() instead. > > Suggested-by: Peter Xu <peterx@redhat.com> > Signed-off-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Peter Xu <peterx@redhat.com>
Cédric Le Goater <clg@redhat.com> writes: > close_return_path_on_source() retrieves the migration error from the > the QEMUFile '->to_dst_file' to know if a shutdown is required. This > shutdown is required to exit the return-path thread. However, in > migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling > close_return_path_on_source() and the shutdown is never performed, > leaving the source and destination waiting for an event to occur. > > Avoid relying on '->to_dst_file' and use migrate_has_error() instead. > > Suggested-by: Peter Xu <peterx@redhat.com> > Signed-off-by: Cédric Le Goater <clg@redhat.com> > --- > migration/migration.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms) > * cause it to unblock if it's stuck waiting for the destination. > */ > WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { > - if (ms->to_dst_file && ms->rp_state.from_dst_file && > - qemu_file_get_error(ms->to_dst_file)) { > + if (migrate_has_error(ms) && ms->rp_state.from_dst_file) { > qemu_file_shutdown(ms->rp_state.from_dst_file); > } > } Hm, maybe Peter can help defend this, but this assumes that every function that takes an 'f' and sets the file error also sets migrate_set_error(). I'm not sure we have determined that, have we?
On 2/8/24 14:07, Fabiano Rosas wrote: > Cédric Le Goater <clg@redhat.com> writes: > >> close_return_path_on_source() retrieves the migration error from the >> the QEMUFile '->to_dst_file' to know if a shutdown is required. This >> shutdown is required to exit the return-path thread. However, in >> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling >> close_return_path_on_source() and the shutdown is never performed, >> leaving the source and destination waiting for an event to occur. >> >> Avoid relying on '->to_dst_file' and use migrate_has_error() instead. >> >> Suggested-by: Peter Xu <peterx@redhat.com> >> Signed-off-by: Cédric Le Goater <clg@redhat.com> >> --- >> migration/migration.c | 3 +-- >> 1 file changed, 1 insertion(+), 2 deletions(-) >> >> diff --git a/migration/migration.c b/migration/migration.c >> index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644 >> --- a/migration/migration.c >> +++ b/migration/migration.c >> @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms) >> * cause it to unblock if it's stuck waiting for the destination. >> */ >> WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { >> - if (ms->to_dst_file && ms->rp_state.from_dst_file && >> - qemu_file_get_error(ms->to_dst_file)) { >> + if (migrate_has_error(ms) && ms->rp_state.from_dst_file) { >> qemu_file_shutdown(ms->rp_state.from_dst_file); >> } >> } > > Hm, maybe Peter can help defend this, but this assumes that every > function that takes an 'f' and sets the file error also sets > migrate_set_error(). I'm not sure we have determined that, have we? How could we check all the code path ? I agree it is difficult when looking at the code :/ Thanks, C.
Cédric Le Goater <clg@redhat.com> writes: > On 2/8/24 14:07, Fabiano Rosas wrote: >> Cédric Le Goater <clg@redhat.com> writes: >> >>> close_return_path_on_source() retrieves the migration error from the >>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This >>> shutdown is required to exit the return-path thread. However, in >>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling >>> close_return_path_on_source() and the shutdown is never performed, >>> leaving the source and destination waiting for an event to occur. >>> >>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead. >>> >>> Suggested-by: Peter Xu <peterx@redhat.com> >>> Signed-off-by: Cédric Le Goater <clg@redhat.com> >>> --- >>> migration/migration.c | 3 +-- >>> 1 file changed, 1 insertion(+), 2 deletions(-) >>> >>> diff --git a/migration/migration.c b/migration/migration.c >>> index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644 >>> --- a/migration/migration.c >>> +++ b/migration/migration.c >>> @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms) >>> * cause it to unblock if it's stuck waiting for the destination. >>> */ >>> WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { >>> - if (ms->to_dst_file && ms->rp_state.from_dst_file && >>> - qemu_file_get_error(ms->to_dst_file)) { >>> + if (migrate_has_error(ms) && ms->rp_state.from_dst_file) { >>> qemu_file_shutdown(ms->rp_state.from_dst_file); >>> } >>> } >> >> Hm, maybe Peter can help defend this, but this assumes that every >> function that takes an 'f' and sets the file error also sets >> migrate_set_error(). I'm not sure we have determined that, have we? > > How could we check all the code path ? I agree it is difficult when > looking at the code :/ It would help if the thing wasn't called 'f' for the most part of the code to begin with. Whenever there's a file error at to_dst_file there's the chance that the rp_state.from_dst_file got stuck. So we cannot ignore the file error. Would it work if we checked it earlier during cleanup as you did previously and then set the migration error?
On 2/8/24 14:57, Fabiano Rosas wrote: > Cédric Le Goater <clg@redhat.com> writes: > >> On 2/8/24 14:07, Fabiano Rosas wrote: >>> Cédric Le Goater <clg@redhat.com> writes: >>> >>>> close_return_path_on_source() retrieves the migration error from the >>>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This >>>> shutdown is required to exit the return-path thread. However, in >>>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling >>>> close_return_path_on_source() and the shutdown is never performed, >>>> leaving the source and destination waiting for an event to occur. >>>> >>>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead. >>>> >>>> Suggested-by: Peter Xu <peterx@redhat.com> >>>> Signed-off-by: Cédric Le Goater <clg@redhat.com> >>>> --- >>>> migration/migration.c | 3 +-- >>>> 1 file changed, 1 insertion(+), 2 deletions(-) >>>> >>>> diff --git a/migration/migration.c b/migration/migration.c >>>> index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644 >>>> --- a/migration/migration.c >>>> +++ b/migration/migration.c >>>> @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms) >>>> * cause it to unblock if it's stuck waiting for the destination. >>>> */ >>>> WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { >>>> - if (ms->to_dst_file && ms->rp_state.from_dst_file && >>>> - qemu_file_get_error(ms->to_dst_file)) { >>>> + if (migrate_has_error(ms) && ms->rp_state.from_dst_file) { >>>> qemu_file_shutdown(ms->rp_state.from_dst_file); >>>> } >>>> } >>> >>> Hm, maybe Peter can help defend this, but this assumes that every >>> function that takes an 'f' and sets the file error also sets >>> migrate_set_error(). I'm not sure we have determined that, have we? >> >> How could we check all the code path ? I agree it is difficult when >> looking at the code :/ > > It would help if the thing wasn't called 'f' for the most part of the > code to begin with. > > Whenever there's a file error at to_dst_file there's the chance that the > rp_state.from_dst_file got stuck. So we cannot ignore the file error. > > Would it work if we checked it earlier during cleanup as you did > previously and then set the migration error? Do you mean doing something similar to what is done in source_return_path_thread() ? if (qemu_file_get_error(s->to_dst_file)) { qemu_file_get_error_obj(s->to_dst_file, &err); if (err) { migrate_set_error(ms, err); error_free(err); ... Yes. That would be safer I think. Nevertheless, I am struggling to understand how qemu_file_set_error() and migrate_set_error() fit together. I was expecting some kind of synchronization routine but there isn't it seems. Are they completely orthogonal ? when should we use these routines and when not ? My initial goal was to modify some of the memory handlers (log_global*) and migration handlers to propagate errors at the QMP level and them report to the management layer. This is growing in something bigger and currently, I don't find a good approach to the problem. The last two patches of this series try to fix the return-path thread termination. Let's keep that for after. Thanks, C.
Cédric Le Goater <clg@redhat.com> writes: > On 2/8/24 14:57, Fabiano Rosas wrote: >> Cédric Le Goater <clg@redhat.com> writes: >> >>> On 2/8/24 14:07, Fabiano Rosas wrote: >>>> Cédric Le Goater <clg@redhat.com> writes: >>>> >>>>> close_return_path_on_source() retrieves the migration error from the >>>>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This >>>>> shutdown is required to exit the return-path thread. However, in >>>>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling >>>>> close_return_path_on_source() and the shutdown is never performed, >>>>> leaving the source and destination waiting for an event to occur. >>>>> >>>>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead. >>>>> >>>>> Suggested-by: Peter Xu <peterx@redhat.com> >>>>> Signed-off-by: Cédric Le Goater <clg@redhat.com> >>>>> --- >>>>> migration/migration.c | 3 +-- >>>>> 1 file changed, 1 insertion(+), 2 deletions(-) >>>>> >>>>> diff --git a/migration/migration.c b/migration/migration.c >>>>> index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644 >>>>> --- a/migration/migration.c >>>>> +++ b/migration/migration.c >>>>> @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms) >>>>> * cause it to unblock if it's stuck waiting for the destination. >>>>> */ >>>>> WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { >>>>> - if (ms->to_dst_file && ms->rp_state.from_dst_file && >>>>> - qemu_file_get_error(ms->to_dst_file)) { >>>>> + if (migrate_has_error(ms) && ms->rp_state.from_dst_file) { >>>>> qemu_file_shutdown(ms->rp_state.from_dst_file); >>>>> } >>>>> } >>>> >>>> Hm, maybe Peter can help defend this, but this assumes that every >>>> function that takes an 'f' and sets the file error also sets >>>> migrate_set_error(). I'm not sure we have determined that, have we? >>> >>> How could we check all the code path ? I agree it is difficult when >>> looking at the code :/ >> >> It would help if the thing wasn't called 'f' for the most part of the >> code to begin with. >> >> Whenever there's a file error at to_dst_file there's the chance that the >> rp_state.from_dst_file got stuck. So we cannot ignore the file error. >> >> Would it work if we checked it earlier during cleanup as you did >> previously and then set the migration error? > > Do you mean doing something similar to what is done in > source_return_path_thread() ? > > if (qemu_file_get_error(s->to_dst_file)) { > qemu_file_get_error_obj(s->to_dst_file, &err); > if (err) { > migrate_set_error(ms, err); > error_free(err); > ... > > Yes. That would be safer I think. Yes, something like that. I wish we could make that return path cleanup more deterministic, but currently it's just: "if something hangs, call shutdown()". We don't have a way to detect a hang, we just look at the file error and hope it works. A crucial aspect here is that calling qemu_file_shutdown() itself sets the file error. So there's not even a guarantee that an error is actually an error. > > > Nevertheless, I am struggling to understand how qemu_file_set_error() > and migrate_set_error() fit together. I was expecting some kind of > synchronization routine but there isn't it seems. Are they completely > orthogonal ? when should we use these routines and when not ? We're trying to phase out the QEMUFile usage altogether. One thing that is getting in the way is this dependency on the qemu_file_*_error functions. While we're not there yet, a good pattern is to find a qemu_file_set|get_error() pair and replace it with migrate_set|has_error(). Unfortunately the return path does not fit in this, because we don't have a matching qemu_file_set_error, it could be anywhere. As I said above, we're using that error as a heuristic for: "a recvmsg() might be hanging". > > My initial goal was to modify some of the memory handlers (log_global*) > and migration handlers to propagate errors at the QMP level and them > report to the management layer. This is growing in something bigger > and currently, I don't find a good approach to the problem. > > The last two patches of this series try to fix the return-path thread > termination. Let's keep that for after. I'll try to figure that out. I see you provided a reproducer. > > Thanks, > > C.
On 2/14/24 17:00, Fabiano Rosas wrote: > Cédric Le Goater <clg@redhat.com> writes: > >> On 2/8/24 14:57, Fabiano Rosas wrote: >>> Cédric Le Goater <clg@redhat.com> writes: >>> >>>> On 2/8/24 14:07, Fabiano Rosas wrote: >>>>> Cédric Le Goater <clg@redhat.com> writes: >>>>> >>>>>> close_return_path_on_source() retrieves the migration error from the >>>>>> the QEMUFile '->to_dst_file' to know if a shutdown is required. This >>>>>> shutdown is required to exit the return-path thread. However, in >>>>>> migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling >>>>>> close_return_path_on_source() and the shutdown is never performed, >>>>>> leaving the source and destination waiting for an event to occur. >>>>>> >>>>>> Avoid relying on '->to_dst_file' and use migrate_has_error() instead. >>>>>> >>>>>> Suggested-by: Peter Xu <peterx@redhat.com> >>>>>> Signed-off-by: Cédric Le Goater <clg@redhat.com> >>>>>> --- >>>>>> migration/migration.c | 3 +-- >>>>>> 1 file changed, 1 insertion(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/migration/migration.c b/migration/migration.c >>>>>> index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644 >>>>>> --- a/migration/migration.c >>>>>> +++ b/migration/migration.c >>>>>> @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms) >>>>>> * cause it to unblock if it's stuck waiting for the destination. >>>>>> */ >>>>>> WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { >>>>>> - if (ms->to_dst_file && ms->rp_state.from_dst_file && >>>>>> - qemu_file_get_error(ms->to_dst_file)) { >>>>>> + if (migrate_has_error(ms) && ms->rp_state.from_dst_file) { >>>>>> qemu_file_shutdown(ms->rp_state.from_dst_file); >>>>>> } >>>>>> } >>>>> >>>>> Hm, maybe Peter can help defend this, but this assumes that every >>>>> function that takes an 'f' and sets the file error also sets >>>>> migrate_set_error(). I'm not sure we have determined that, have we? >>>> >>>> How could we check all the code path ? I agree it is difficult when >>>> looking at the code :/ >>> >>> It would help if the thing wasn't called 'f' for the most part of the >>> code to begin with. >>> >>> Whenever there's a file error at to_dst_file there's the chance that the >>> rp_state.from_dst_file got stuck. So we cannot ignore the file error. >>> >>> Would it work if we checked it earlier during cleanup as you did >>> previously and then set the migration error? >> >> Do you mean doing something similar to what is done in >> source_return_path_thread() ? >> >> if (qemu_file_get_error(s->to_dst_file)) { >> qemu_file_get_error_obj(s->to_dst_file, &err); >> if (err) { >> migrate_set_error(ms, err); >> error_free(err); >> ... >> >> Yes. That would be safer I think. > > Yes, something like that. > > I wish we could make that return path cleanup more deterministic, but > currently it's just: "if something hangs, call shutdown()". We don't > have a way to detect a hang, we just look at the file error and hope it > works. > > A crucial aspect here is that calling qemu_file_shutdown() itself sets > the file error. So there's not even a guarantee that an error is > actually an error. > >> >> >> Nevertheless, I am struggling to understand how qemu_file_set_error() >> and migrate_set_error() fit together. I was expecting some kind of >> synchronization routine but there isn't it seems. Are they completely >> orthogonal ? when should we use these routines and when not ? > > We're trying to phase out the QEMUFile usage altogether. One thing that > is getting in the way is this dependency on the qemu_file_*_error > functions. OK. the other changes, which add an Error** argument to various handlers, reduce the use of qemu_file_*_error routines in VFIO. > While we're not there yet, a good pattern is to find a > qemu_file_set|get_error() pair and replace it with > migrate_set|has_error(). OK. I will keep that in mind for the other changes. Thanks, C. > Unfortunately the return path does not fit in > this, because we don't have a matching qemu_file_set_error, it could be > anywhere. As I said above, we're using that error as a heuristic for: "a > recvmsg() might be hanging". > >> >> My initial goal was to modify some of the memory handlers (log_global*) >> and migration handlers to propagate errors at the QMP level and them >> report to the management layer. This is growing in something bigger >> and currently, I don't find a good approach to the problem. >> >> The last two patches of this series try to fix the return-path thread >> termination. Let's keep that for after. > > I'll try to figure that out. I see you provided a reproducer. > >> >> Thanks, >> >> C. >
On Thu, Feb 08, 2024 at 10:07:44AM -0300, Fabiano Rosas wrote: > > diff --git a/migration/migration.c b/migration/migration.c > > index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms) > > * cause it to unblock if it's stuck waiting for the destination. > > */ > > WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { > > - if (ms->to_dst_file && ms->rp_state.from_dst_file && > > - qemu_file_get_error(ms->to_dst_file)) { > > + if (migrate_has_error(ms) && ms->rp_state.from_dst_file) { > > qemu_file_shutdown(ms->rp_state.from_dst_file); > > } > > } > > Hm, maybe Peter can help defend this, but this assumes that every > function that takes an 'f' and sets the file error also sets > migrate_set_error(). I'm not sure we have determined that, have we? [apologies on getting back to this thread late.. I saw there's yet another proposal in the other email, will look at that soon] I think that should be set, or otherwise we lose an error? After all s->error is the only thing we report, if there is a qemufile error that is not reported into s->error it can be lost then. On src QEMU we have both migration thread and return path thread. For migration thread the file error should always be collected by migration_detect_error() by the qemu_file_get_error_obj_any() (it also looks after postcopy_qemufile_src). For return path thread it's always collected when the loop quits. Would migrate_has_error() be safer than qemu_file_get_error() in some cases? I'm considering when there is an error outside of qemufile itself, that's the case where qemu_file_get_error(ms->to_dst_file) can return false, however we may still need a kick to the from_dst_file?
diff --git a/migration/migration.c b/migration/migration.c index d5f705ceef4c925589aa49335969672c0d761fa2..5f55af3d7624750ca416c4177781241b3e291e5d 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -2372,8 +2372,7 @@ static bool close_return_path_on_source(MigrationState *ms) * cause it to unblock if it's stuck waiting for the destination. */ WITH_QEMU_LOCK_GUARD(&ms->qemu_file_lock) { - if (ms->to_dst_file && ms->rp_state.from_dst_file && - qemu_file_get_error(ms->to_dst_file)) { + if (migrate_has_error(ms) && ms->rp_state.from_dst_file) { qemu_file_shutdown(ms->rp_state.from_dst_file); } }
close_return_path_on_source() retrieves the migration error from the the QEMUFile '->to_dst_file' to know if a shutdown is required. This shutdown is required to exit the return-path thread. However, in migrate_fd_cleanup(), '->to_dst_file' is cleaned up before calling close_return_path_on_source() and the shutdown is never performed, leaving the source and destination waiting for an event to occur. Avoid relying on '->to_dst_file' and use migrate_has_error() instead. Suggested-by: Peter Xu <peterx@redhat.com> Signed-off-by: Cédric Le Goater <clg@redhat.com> --- migration/migration.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)