Message ID | 1312383104-9565-1-git-send-email-mjt@msgid.tls.msk.ru |
---|---|
State | New |
Headers | show |
On Wed, 3 Aug 2011 18:51:44 +0400 Michael Tokarev <mjt@tls.msk.ru> wrote: > If we do, it results in double monitor_resume() (second being called > from migrate_fd_cleanup() anyway) and monitor suspend count becoming > negative. Are you hitting an specific issue or did you find this by code inspection? IIRC, I asked Marcelo to add the monitor_resume() call in the fix for e447b1a603 because the monitor wasn't being resumed in some cases. Don't remember which though, do you Marcelo? I see two possibilities here: 1. After e447b1a603 there was some change that made the monitor_resume() call in migrate_fd_put_buffer() unnecessary 2. We're calling it in the wrong place Taking a quick look at the code I see that migrate_fd_cleanup() doesn't seem to be called when qemu_savevm_state_iterate() fails, for example. > > Signed-Off-By: Michael Tokarev <mjt@tls.msk.ru> > Reviewed-By: Jan Kiszka <jan.kiszka@siemens.com> > --- > migration.c | 3 --- > 1 files changed, 0 insertions(+), 3 deletions(-) > > diff --git a/migration.c b/migration.c > index 2a15b98..7ca883f 100644 > --- a/migration.c > +++ b/migration.c > @@ -330,9 +330,6 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) > if (ret == -EAGAIN) { > qemu_set_fd_handler2(s->fd, NULL, NULL, migrate_fd_put_notify, s); > } else if (ret < 0) { > - if (s->mon) { > - monitor_resume(s->mon); > - } > s->state = MIG_STATE_ERROR; > notifier_list_notify(&migration_state_notifiers, NULL); > }
04.08.2011 23:24, Luiz Capitulino пишет: > On Wed, 3 Aug 2011 18:51:44 +0400 > Michael Tokarev <mjt@tls.msk.ru> wrote: > >> If we do, it results in double monitor_resume() (second being called >> from migrate_fd_cleanup() anyway) and monitor suspend count becoming >> negative. > > Are you hitting an specific issue or did you find this by code inspection? The specific case I'm talking about can trivially be triggered. Run migrate "exec:nosuchfile" and your monitor will be stuck forever. > IIRC, I asked Marcelo to add the monitor_resume() call in the fix for > e447b1a603 because the monitor wasn't being resumed in some cases. Don't > remember which though, do you Marcelo? And this (e447b1a603) is actually exactly the same thing: when the child terminates before qemu completes sending stuff to it. > I see two possibilities here: > > 1. After e447b1a603 there was some change that made the monitor_resume() call > in migrate_fd_put_buffer() unnecessary > > 2. We're calling it in the wrong place > > Taking a quick look at the code I see that migrate_fd_cleanup() doesn't > seem to be called when qemu_savevm_state_iterate() fails, for example. Yes indeed, but that will lead to a havoc in other places, like leaving callback notifier registered even after migration gets cancelled, or keeping the file open. Note that qemu_savevm_state_iterate() gets called from a callback routine. What I observe here is that when the exec: process exits prematurely, qemu will continue writing stuff to pipe up till migrate_fd_put_buffer() hits error (EPIPE iirc), at which point we'll call mon->resume() twice. The same happens in case of tcp migration when the other end closes the connection. I don't know when qemu_savevm_state_iterate() may fail, or why qemu_file_has_error() in there does not return true. Maybe the problem here is much more severe actually, -- that's why I initially asked what it is all about, -- before offering the patch. Thanks, /mjt >> Signed-Off-By: Michael Tokarev <mjt@tls.msk.ru> >> Reviewed-By: Jan Kiszka <jan.kiszka@siemens.com> >> --- >> migration.c | 3 --- >> 1 files changed, 0 insertions(+), 3 deletions(-) >> >> diff --git a/migration.c b/migration.c >> index 2a15b98..7ca883f 100644 >> --- a/migration.c >> +++ b/migration.c >> @@ -330,9 +330,6 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) >> if (ret == -EAGAIN) { >> qemu_set_fd_handler2(s->fd, NULL, NULL, migrate_fd_put_notify, s); >> } else if (ret < 0) { >> - if (s->mon) { >> - monitor_resume(s->mon); >> - } >> s->state = MIG_STATE_ERROR; >> notifier_list_notify(&migration_state_notifiers, NULL); >> } >
04.08.2011 23:52, Michael Tokarev пишет: > 04.08.2011 23:24, Luiz Capitulino пишет: >> On Wed, 3 Aug 2011 18:51:44 +0400 >> Michael Tokarev <mjt@tls.msk.ru> wrote: >> >>> If we do, it results in double monitor_resume() (second being called >>> from migrate_fd_cleanup() anyway) and monitor suspend count becoming >>> negative. >> >> Are you hitting an specific issue or did you find this by code inspection? > > The specific case I'm talking about can trivially be triggered. > Run migrate "exec:nosuchfile" and your monitor will be stuck > forever. > >> IIRC, I asked Marcelo to add the monitor_resume() call in the fix for >> e447b1a603 because the monitor wasn't being resumed in some cases. Don't >> remember which though, do you Marcelo? > > And this (e447b1a603) is actually exactly the same thing: when the > child terminates before qemu completes sending stuff to it. > >> I see two possibilities here: >> >> 1. After e447b1a603 there was some change that made the monitor_resume() call >> in migrate_fd_put_buffer() unnecessary >> >> 2. We're calling it in the wrong place >> >> Taking a quick look at the code I see that migrate_fd_cleanup() doesn't >> seem to be called when qemu_savevm_state_iterate() fails, for example. > > Yes indeed, but that will lead to a havoc in other places, like > leaving callback notifier registered even after migration gets > cancelled, or keeping the file open. > > Note that qemu_savevm_state_iterate() gets called from a callback > routine. > > What I observe here is that when the exec: process exits prematurely, > qemu will continue writing stuff to pipe up till migrate_fd_put_buffer() > hits error (EPIPE iirc), at which point we'll call mon->resume() twice. > The same happens in case of tcp migration when the other end closes > the connection. > > I don't know when qemu_savevm_state_iterate() may fail, or why > qemu_file_has_error() in there does not return true. > > Maybe the problem here is much more severe actually, -- that's why > I initially asked what it is all about, -- before offering the patch. This looks like a race condition: which of the two places will detect the error first. Sometimes I see this (-monitor stdio): (qemu) migrate "exec:nothing" sh: nothing: not found (qemu) and after this, monitor is stuck as I described. But sometimes I see this: (qemu) migrate "exec:nothing" sh: nothing: not found (qemu) And it continues working. Note the extra newline. If the error gets detected sooner (like in case of exec:nothing, or even better, after removing /bin/sh so that system(3) will fail), the chance to have stuck monitor is much higher. If the error gets detected later, it will survive most likely. Or something like that anyway. /mjt
diff --git a/migration.c b/migration.c index 2a15b98..7ca883f 100644 --- a/migration.c +++ b/migration.c @@ -330,9 +330,6 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size) if (ret == -EAGAIN) { qemu_set_fd_handler2(s->fd, NULL, NULL, migrate_fd_put_notify, s); } else if (ret < 0) { - if (s->mon) { - monitor_resume(s->mon); - } s->state = MIG_STATE_ERROR; notifier_list_notify(&migration_state_notifiers, NULL); }