diff mbox series

[V1,2/3] migration: fix suspended runstate

Message ID 1686860800-34667-3-git-send-email-steven.sistare@oracle.com
State New
Headers show
Series fix migration of suspended runstate | expand

Commit Message

Steven Sistare June 15, 2023, 8:26 p.m. UTC
Migration of a guest in the suspended state is broken.  The incoming
migration code automatically tries to wake the guest, which IMO is
wrong -- the guest should end migration in the same state it started.
Further, the wakeup is done by calling qemu_system_wakeup_request(), which
bypasses vm_start().  The guest appears to be in the running state, but
it is not.

To fix, leave the guest in the suspended state, but call
qemu_system_start_on_wakeup_request() so the guest is properly resumed
later, when the client sends a system_wakeup command.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 migration/migration.c | 11 ++++-------
 softmmu/runstate.c    |  1 +
 2 files changed, 5 insertions(+), 7 deletions(-)

Comments

Peter Xu June 20, 2023, 9:46 p.m. UTC | #1
On Thu, Jun 15, 2023 at 01:26:39PM -0700, Steve Sistare wrote:
> Migration of a guest in the suspended state is broken.  The incoming
> migration code automatically tries to wake the guest, which IMO is
> wrong -- the guest should end migration in the same state it started.
> Further, the wakeup is done by calling qemu_system_wakeup_request(), which
> bypasses vm_start().  The guest appears to be in the running state, but
> it is not.
> 
> To fix, leave the guest in the suspended state, but call
> qemu_system_start_on_wakeup_request() so the guest is properly resumed
> later, when the client sends a system_wakeup command.
> 
> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> ---
>  migration/migration.c | 11 ++++-------
>  softmmu/runstate.c    |  1 +
>  2 files changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 17b4b47..851fe6d 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -496,6 +496,10 @@ static void process_incoming_migration_bh(void *opaque)
>          vm_start();
>      } else {
>          runstate_set(global_state_get_runstate());
> +        if (runstate_check(RUN_STATE_SUSPENDED)) {
> +            /* Force vm_start to be called later. */
> +            qemu_system_start_on_wakeup_request();
> +        }

Is this really needed, along with patch 1?

I have a very limited knowledge on suspension, so I'm prone to making
mistakes..

But from what I read this, qemu_system_wakeup_request() (existing one, not
after patch 1 applied) will setup wakeup_reason and kick the main thread
using qemu_notify_event().  Then IIUC the e.g. vcpu wakeups will be done in
the main thread later on after qemu_wakeup_requested() returns true.

>      }
>      /*
>       * This must happen after any state changes since as soon as an external
> @@ -2101,7 +2105,6 @@ static int postcopy_start(MigrationState *ms)
>      qemu_mutex_lock_iothread();
>      trace_postcopy_start_set_run();
>  
> -    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
>      global_state_store();
>      ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>      if (ret < 0) {
> @@ -2307,7 +2310,6 @@ static void migration_completion(MigrationState *s)
>      if (s->state == MIGRATION_STATUS_ACTIVE) {
>          qemu_mutex_lock_iothread();
>          s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
> -        qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
>  
>          s->vm_old_state = runstate_get();
>          global_state_store();
> @@ -3102,11 +3104,6 @@ static void *bg_migration_thread(void *opaque)
>  
>      qemu_mutex_lock_iothread();
>  
> -    /*
> -     * If VM is currently in suspended state, then, to make a valid runstate
> -     * transition in vm_stop_force_state() we need to wakeup it up.
> -     */
> -    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);

Removal of these three places seems reasonable to me, or we won't persist
the SUSPEND state.

Above comment was the major reason I used to have thought it was needed
(again, based on zero knowledge around this..), but perhaps it was just
wrong?  I would assume vm_stop_force_state() will still just work with
suepended, am I right?

>      s->vm_old_state = runstate_get();
>  
>      global_state_store();
> diff --git a/softmmu/runstate.c b/softmmu/runstate.c
> index e127b21..771896c 100644
> --- a/softmmu/runstate.c
> +++ b/softmmu/runstate.c
> @@ -159,6 +159,7 @@ static const RunStateTransition runstate_transitions_def[] = {
>      { RUN_STATE_RUNNING, RUN_STATE_SUSPENDED },
>      { RUN_STATE_SUSPENDED, RUN_STATE_RUNNING },
>      { RUN_STATE_SUSPENDED, RUN_STATE_FINISH_MIGRATE },
> +    { RUN_STATE_SUSPENDED, RUN_STATE_PAUSED },
>      { RUN_STATE_SUSPENDED, RUN_STATE_PRELAUNCH },
>      { RUN_STATE_SUSPENDED, RUN_STATE_COLO},
>  
> -- 
> 1.8.3.1
>
Steven Sistare June 21, 2023, 7:15 p.m. UTC | #2
On 6/20/2023 5:46 PM, Peter Xu wrote:
> On Thu, Jun 15, 2023 at 01:26:39PM -0700, Steve Sistare wrote:
>> Migration of a guest in the suspended state is broken.  The incoming
>> migration code automatically tries to wake the guest, which IMO is
>> wrong -- the guest should end migration in the same state it started.
>> Further, the wakeup is done by calling qemu_system_wakeup_request(), which
>> bypasses vm_start().  The guest appears to be in the running state, but
>> it is not.
>>
>> To fix, leave the guest in the suspended state, but call
>> qemu_system_start_on_wakeup_request() so the guest is properly resumed
>> later, when the client sends a system_wakeup command.
>>
>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>> ---
>>  migration/migration.c | 11 ++++-------
>>  softmmu/runstate.c    |  1 +
>>  2 files changed, 5 insertions(+), 7 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 17b4b47..851fe6d 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -496,6 +496,10 @@ static void process_incoming_migration_bh(void *opaque)
>>          vm_start();
>>      } else {
>>          runstate_set(global_state_get_runstate());
>> +        if (runstate_check(RUN_STATE_SUSPENDED)) {
>> +            /* Force vm_start to be called later. */
>> +            qemu_system_start_on_wakeup_request();
>> +        }
> 
> Is this really needed, along with patch 1?
> 
> I have a very limited knowledge on suspension, so I'm prone to making
> mistakes..
> 
> But from what I read this, qemu_system_wakeup_request() (existing one, not
> after patch 1 applied) will setup wakeup_reason and kick the main thread
> using qemu_notify_event().  Then IIUC the e.g. vcpu wakeups will be done in
> the main thread later on after qemu_wakeup_requested() returns true.

Correct, here:

    if (qemu_wakeup_requested()) {
        pause_all_vcpus();
        qemu_system_wakeup();
        notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
        wakeup_reason = QEMU_WAKEUP_REASON_NONE;
        resume_all_vcpus();
        qapi_event_send_wakeup();
    }

However, that is not sufficient, because vm_start() was never called on the incoming
side.  vm_start calls the vm state notifiers for RUN_STATE_RUNNING, among other things.


Without my fixes, it "works" because the outgoing migration automatically wakes a suspended
guest, which sets the state to running, which is saved in global state:

    void migration_completion(MigrationState *s)
        qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
        global_state_store()

Then the incoming migration calls vm_start here:

    migration/migration.c
        if (!global_state_received() ||
            global_state_get_runstate() == RUN_STATE_RUNNING) {
            if (autostart) {
                vm_start();

vm_start must be called for correctness.

- Steve

>>      }
>>      /*
>>       * This must happen after any state changes since as soon as an external
>> @@ -2101,7 +2105,6 @@ static int postcopy_start(MigrationState *ms)
>>      qemu_mutex_lock_iothread();
>>      trace_postcopy_start_set_run();
>>  
>> -    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
>>      global_state_store();
>>      ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
>>      if (ret < 0) {
>> @@ -2307,7 +2310,6 @@ static void migration_completion(MigrationState *s)
>>      if (s->state == MIGRATION_STATUS_ACTIVE) {
>>          qemu_mutex_lock_iothread();
>>          s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
>> -        qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
>>  
>>          s->vm_old_state = runstate_get();
>>          global_state_store();
>> @@ -3102,11 +3104,6 @@ static void *bg_migration_thread(void *opaque)
>>  
>>      qemu_mutex_lock_iothread();
>>  
>> -    /*
>> -     * If VM is currently in suspended state, then, to make a valid runstate
>> -     * transition in vm_stop_force_state() we need to wakeup it up.
>> -     */
>> -    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
> 
> Removal of these three places seems reasonable to me, or we won't persist
> the SUSPEND state.
> 
> Above comment was the major reason I used to have thought it was needed
> (again, based on zero knowledge around this..), but perhaps it was just
> wrong?  I would assume vm_stop_force_state() will still just work with
> suepended, am I right?
> 
>>      s->vm_old_state = runstate_get();
>>  
>>      global_state_store();
>> diff --git a/softmmu/runstate.c b/softmmu/runstate.c
>> index e127b21..771896c 100644
>> --- a/softmmu/runstate.c
>> +++ b/softmmu/runstate.c
>> @@ -159,6 +159,7 @@ static const RunStateTransition runstate_transitions_def[] = {
>>      { RUN_STATE_RUNNING, RUN_STATE_SUSPENDED },
>>      { RUN_STATE_SUSPENDED, RUN_STATE_RUNNING },
>>      { RUN_STATE_SUSPENDED, RUN_STATE_FINISH_MIGRATE },
>> +    { RUN_STATE_SUSPENDED, RUN_STATE_PAUSED },
>>      { RUN_STATE_SUSPENDED, RUN_STATE_PRELAUNCH },
>>      { RUN_STATE_SUSPENDED, RUN_STATE_COLO},
>>  
>> -- 
>> 1.8.3.1
>>
>
Peter Xu June 21, 2023, 8:28 p.m. UTC | #3
On Wed, Jun 21, 2023 at 03:15:42PM -0400, Steven Sistare wrote:
> On 6/20/2023 5:46 PM, Peter Xu wrote:
> > On Thu, Jun 15, 2023 at 01:26:39PM -0700, Steve Sistare wrote:
> >> Migration of a guest in the suspended state is broken.  The incoming
> >> migration code automatically tries to wake the guest, which IMO is
> >> wrong -- the guest should end migration in the same state it started.
> >> Further, the wakeup is done by calling qemu_system_wakeup_request(), which
> >> bypasses vm_start().  The guest appears to be in the running state, but
> >> it is not.
> >>
> >> To fix, leave the guest in the suspended state, but call
> >> qemu_system_start_on_wakeup_request() so the guest is properly resumed
> >> later, when the client sends a system_wakeup command.
> >>
> >> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >> ---
> >>  migration/migration.c | 11 ++++-------
> >>  softmmu/runstate.c    |  1 +
> >>  2 files changed, 5 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/migration/migration.c b/migration/migration.c
> >> index 17b4b47..851fe6d 100644
> >> --- a/migration/migration.c
> >> +++ b/migration/migration.c
> >> @@ -496,6 +496,10 @@ static void process_incoming_migration_bh(void *opaque)
> >>          vm_start();
> >>      } else {
> >>          runstate_set(global_state_get_runstate());
> >> +        if (runstate_check(RUN_STATE_SUSPENDED)) {
> >> +            /* Force vm_start to be called later. */
> >> +            qemu_system_start_on_wakeup_request();
> >> +        }
> > 
> > Is this really needed, along with patch 1?
> > 
> > I have a very limited knowledge on suspension, so I'm prone to making
> > mistakes..
> > 
> > But from what I read this, qemu_system_wakeup_request() (existing one, not
> > after patch 1 applied) will setup wakeup_reason and kick the main thread
> > using qemu_notify_event().  Then IIUC the e.g. vcpu wakeups will be done in
> > the main thread later on after qemu_wakeup_requested() returns true.
> 
> Correct, here:
> 
>     if (qemu_wakeup_requested()) {
>         pause_all_vcpus();
>         qemu_system_wakeup();
>         notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
>         wakeup_reason = QEMU_WAKEUP_REASON_NONE;
>         resume_all_vcpus();
>         qapi_event_send_wakeup();
>     }
> 
> However, that is not sufficient, because vm_start() was never called on the incoming
> side.  vm_start calls the vm state notifiers for RUN_STATE_RUNNING, among other things.
> 
> 
> Without my fixes, it "works" because the outgoing migration automatically wakes a suspended
> guest, which sets the state to running, which is saved in global state:
> 
>     void migration_completion(MigrationState *s)
>         qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
>         global_state_store()
> 
> Then the incoming migration calls vm_start here:
> 
>     migration/migration.c
>         if (!global_state_received() ||
>             global_state_get_runstate() == RUN_STATE_RUNNING) {
>             if (autostart) {
>                 vm_start();
> 
> vm_start must be called for correctness.

I see.  Though I had a feeling that this is still not the right way to do,
at least not as clean.

One question is, would above work for postcopy when VM is suspended during
the switchover?

I think I see your point that vm_start() (mostly vm_prepare_start())
contains a bunch of operations that maybe we must have before starting the
VM, but then.. should we just make that vm_start() unconditional when
loading VM completes?  I just don't see anything won't need it (besides
-S), even COLO.

So I'm wondering about something like this:

===8<===
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -481,19 +481,28 @@ static void process_incoming_migration_bh(void *opaque)
 
     dirty_bitmap_mig_before_vm_start();
 
-    if (!global_state_received() ||
-        global_state_get_runstate() == RUN_STATE_RUNNING) {
-        if (autostart) {
-            vm_start();
-        } else {
-            runstate_set(RUN_STATE_PAUSED);
-        }
-    } else if (migration_incoming_colo_enabled()) {
+    if (migration_incoming_colo_enabled()) {
         migration_incoming_disable_colo();
+        /* COLO should always have autostart=1 or we can enforce it here */
+    }
+
+    if (autostart) {
+        RunState run_state = global_state_get_runstate();
         vm_start();
+        switch (run_state) {
+        case RUN_STATE_RUNNING:
+            break;
+        case RUN_STATE_SUSPENDED:
+            qemu_system_suspend();
+            break;
+        default:
+            runstate_set(run_state);
+            break;
+        }
     } else {
-        runstate_set(global_state_get_runstate());
+        runstate_set(RUN_STATE_PAUSED);
     }
===8<===

IIUC this can drop qemu_system_start_on_wakeup_request() along with the
other global var.  Would something like it work for us?
Steven Sistare June 23, 2023, 6:25 p.m. UTC | #4
On 6/21/2023 4:28 PM, Peter Xu wrote:
> On Wed, Jun 21, 2023 at 03:15:42PM -0400, Steven Sistare wrote:
>> On 6/20/2023 5:46 PM, Peter Xu wrote:
>>> On Thu, Jun 15, 2023 at 01:26:39PM -0700, Steve Sistare wrote:
>>>> Migration of a guest in the suspended state is broken.  The incoming
>>>> migration code automatically tries to wake the guest, which IMO is
>>>> wrong -- the guest should end migration in the same state it started.
>>>> Further, the wakeup is done by calling qemu_system_wakeup_request(), which
>>>> bypasses vm_start().  The guest appears to be in the running state, but
>>>> it is not.
>>>>
>>>> To fix, leave the guest in the suspended state, but call
>>>> qemu_system_start_on_wakeup_request() so the guest is properly resumed
>>>> later, when the client sends a system_wakeup command.
>>>>
>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>> ---
>>>>  migration/migration.c | 11 ++++-------
>>>>  softmmu/runstate.c    |  1 +
>>>>  2 files changed, 5 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index 17b4b47..851fe6d 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -496,6 +496,10 @@ static void process_incoming_migration_bh(void *opaque)
>>>>          vm_start();
>>>>      } else {
>>>>          runstate_set(global_state_get_runstate());
>>>> +        if (runstate_check(RUN_STATE_SUSPENDED)) {
>>>> +            /* Force vm_start to be called later. */
>>>> +            qemu_system_start_on_wakeup_request();
>>>> +        }
>>>
>>> Is this really needed, along with patch 1?
>>>
>>> I have a very limited knowledge on suspension, so I'm prone to making
>>> mistakes..
>>>
>>> But from what I read this, qemu_system_wakeup_request() (existing one, not
>>> after patch 1 applied) will setup wakeup_reason and kick the main thread
>>> using qemu_notify_event().  Then IIUC the e.g. vcpu wakeups will be done in
>>> the main thread later on after qemu_wakeup_requested() returns true.
>>
>> Correct, here:
>>
>>     if (qemu_wakeup_requested()) {
>>         pause_all_vcpus();
>>         qemu_system_wakeup();
>>         notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
>>         wakeup_reason = QEMU_WAKEUP_REASON_NONE;
>>         resume_all_vcpus();
>>         qapi_event_send_wakeup();
>>     }
>>
>> However, that is not sufficient, because vm_start() was never called on the incoming
>> side.  vm_start calls the vm state notifiers for RUN_STATE_RUNNING, among other things.
>>
>>
>> Without my fixes, it "works" because the outgoing migration automatically wakes a suspended
>> guest, which sets the state to running, which is saved in global state:
>>
>>     void migration_completion(MigrationState *s)
>>         qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
>>         global_state_store()
>>
>> Then the incoming migration calls vm_start here:
>>
>>     migration/migration.c
>>         if (!global_state_received() ||
>>             global_state_get_runstate() == RUN_STATE_RUNNING) {
>>             if (autostart) {
>>                 vm_start();
>>
>> vm_start must be called for correctness.
> 
> I see.  Though I had a feeling that this is still not the right way to do,
> at least not as clean.
> 
> One question is, would above work for postcopy when VM is suspended during
> the switchover?

Good catch, that is broken.
I added qemu_system_start_on_wakeup_request to loadvm_postcopy_handle_run_bh
and now it works.

    if (global_state_get_runstate() == RUN_STATE_RUNNING) {
        if (autostart) {
            vm_start();
        } else {
            runstate_set(RUN_STATE_PAUSED);
        }
    } else {
        runstate_set(global_state_get_runstate());
        if (runstate_check(RUN_STATE_SUSPENDED)) {
            qemu_system_start_on_wakeup_request();
        }
    }

> I think I see your point that vm_start() (mostly vm_prepare_start())
> contains a bunch of operations that maybe we must have before starting the
> VM, but then.. should we just make that vm_start() unconditional when
> loading VM completes?  I just don't see anything won't need it (besides
> -S), even COLO.
> 
> So I'm wondering about something like this:
> 
> ===8<===
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -481,19 +481,28 @@ static void process_incoming_migration_bh(void *opaque)
>  
>      dirty_bitmap_mig_before_vm_start();
>  
> -    if (!global_state_received() ||
> -        global_state_get_runstate() == RUN_STATE_RUNNING) {
> -        if (autostart) {
> -            vm_start();
> -        } else {
> -            runstate_set(RUN_STATE_PAUSED);
> -        }
> -    } else if (migration_incoming_colo_enabled()) {
> +    if (migration_incoming_colo_enabled()) {
>          migration_incoming_disable_colo();
> +        /* COLO should always have autostart=1 or we can enforce it here */
> +    }
> +
> +    if (autostart) {
> +        RunState run_state = global_state_get_runstate();
>          vm_start();

This will resume the guest for a brief time, against the user's wishes.  Not OK IMO.

> +        switch (run_state) {
> +        case RUN_STATE_RUNNING:
> +            break;
> +        case RUN_STATE_SUSPENDED:
> +            qemu_system_suspend();

qemu_system_suspend will not cause the guest to suspend.  It is qemu's response after
the guest initiates a suspend.

> +            break;
> +        default:
> +            runstate_set(run_state);
> +            break;
> +        }
>      } else {
> -        runstate_set(global_state_get_runstate());
> +        runstate_set(RUN_STATE_PAUSED);
>      }
> ===8<===
> 
> IIUC this can drop qemu_system_start_on_wakeup_request() along with the
> other global var.  Would something like it work for us?

Afraid not.  start_on_wake is the only correct solution I can think of.

- Steve
Steven Sistare June 23, 2023, 7:56 p.m. UTC | #5
On 6/23/2023 2:25 PM, Steven Sistare wrote:
> On 6/21/2023 4:28 PM, Peter Xu wrote:
>> On Wed, Jun 21, 2023 at 03:15:42PM -0400, Steven Sistare wrote:
>>> On 6/20/2023 5:46 PM, Peter Xu wrote:
>>>> On Thu, Jun 15, 2023 at 01:26:39PM -0700, Steve Sistare wrote:
>>>>> Migration of a guest in the suspended state is broken.  The incoming
>>>>> migration code automatically tries to wake the guest, which IMO is
>>>>> wrong -- the guest should end migration in the same state it started.
>>>>> Further, the wakeup is done by calling qemu_system_wakeup_request(), which
>>>>> bypasses vm_start().  The guest appears to be in the running state, but
>>>>> it is not.
>>>>>
>>>>> To fix, leave the guest in the suspended state, but call
>>>>> qemu_system_start_on_wakeup_request() so the guest is properly resumed
>>>>> later, when the client sends a system_wakeup command.
>>>>>
>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>> ---
>>>>>  migration/migration.c | 11 ++++-------
>>>>>  softmmu/runstate.c    |  1 +
>>>>>  2 files changed, 5 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>> index 17b4b47..851fe6d 100644
>>>>> --- a/migration/migration.c
>>>>> +++ b/migration/migration.c
>>>>> @@ -496,6 +496,10 @@ static void process_incoming_migration_bh(void *opaque)
>>>>>          vm_start();
>>>>>      } else {
>>>>>          runstate_set(global_state_get_runstate());
>>>>> +        if (runstate_check(RUN_STATE_SUSPENDED)) {
>>>>> +            /* Force vm_start to be called later. */
>>>>> +            qemu_system_start_on_wakeup_request();
>>>>> +        }
>>>>
>>>> Is this really needed, along with patch 1?
>>>>
>>>> I have a very limited knowledge on suspension, so I'm prone to making
>>>> mistakes..
>>>>
>>>> But from what I read this, qemu_system_wakeup_request() (existing one, not
>>>> after patch 1 applied) will setup wakeup_reason and kick the main thread
>>>> using qemu_notify_event().  Then IIUC the e.g. vcpu wakeups will be done in
>>>> the main thread later on after qemu_wakeup_requested() returns true.
>>>
>>> Correct, here:
>>>
>>>     if (qemu_wakeup_requested()) {
>>>         pause_all_vcpus();
>>>         qemu_system_wakeup();
>>>         notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
>>>         wakeup_reason = QEMU_WAKEUP_REASON_NONE;
>>>         resume_all_vcpus();
>>>         qapi_event_send_wakeup();
>>>     }
>>>
>>> However, that is not sufficient, because vm_start() was never called on the incoming
>>> side.  vm_start calls the vm state notifiers for RUN_STATE_RUNNING, among other things.
>>>
>>>
>>> Without my fixes, it "works" because the outgoing migration automatically wakes a suspended
>>> guest, which sets the state to running, which is saved in global state:
>>>
>>>     void migration_completion(MigrationState *s)
>>>         qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
>>>         global_state_store()
>>>
>>> Then the incoming migration calls vm_start here:
>>>
>>>     migration/migration.c
>>>         if (!global_state_received() ||
>>>             global_state_get_runstate() == RUN_STATE_RUNNING) {
>>>             if (autostart) {
>>>                 vm_start();
>>>
>>> vm_start must be called for correctness.
>>
>> I see.  Though I had a feeling that this is still not the right way to do,
>> at least not as clean.
>>
>> One question is, would above work for postcopy when VM is suspended during
>> the switchover?
> 
> Good catch, that is broken.
> I added qemu_system_start_on_wakeup_request to loadvm_postcopy_handle_run_bh
> and now it works.
> 
>     if (global_state_get_runstate() == RUN_STATE_RUNNING) {
>         if (autostart) {
>             vm_start();
>         } else {
>             runstate_set(RUN_STATE_PAUSED);
>         }
>     } else {
>         runstate_set(global_state_get_runstate());
>         if (runstate_check(RUN_STATE_SUSPENDED)) {
>             qemu_system_start_on_wakeup_request();
>         }
>     }
> 
>> I think I see your point that vm_start() (mostly vm_prepare_start())
>> contains a bunch of operations that maybe we must have before starting the
>> VM, but then.. should we just make that vm_start() unconditional when
>> loading VM completes?  I just don't see anything won't need it (besides
>> -S), even COLO.
>>
>> So I'm wondering about something like this:
>>
>> ===8<===
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -481,19 +481,28 @@ static void process_incoming_migration_bh(void *opaque)
>>  
>>      dirty_bitmap_mig_before_vm_start();
>>  
>> -    if (!global_state_received() ||
>> -        global_state_get_runstate() == RUN_STATE_RUNNING) {
>> -        if (autostart) {
>> -            vm_start();
>> -        } else {
>> -            runstate_set(RUN_STATE_PAUSED);
>> -        }
>> -    } else if (migration_incoming_colo_enabled()) {
>> +    if (migration_incoming_colo_enabled()) {
>>          migration_incoming_disable_colo();
>> +        /* COLO should always have autostart=1 or we can enforce it here */
>> +    }
>> +
>> +    if (autostart) {
>> +        RunState run_state = global_state_get_runstate();
>>          vm_start();
> 
> This will resume the guest for a brief time, against the user's wishes.  Not OK IMO.
> 
>> +        switch (run_state) {
>> +        case RUN_STATE_RUNNING:
>> +            break;
>> +        case RUN_STATE_SUSPENDED:
>> +            qemu_system_suspend();
> 
> qemu_system_suspend will not cause the guest to suspend.  It is qemu's response after
> the guest initiates a suspend.
> 
>> +            break;
>> +        default:
>> +            runstate_set(run_state);
>> +            break;
>> +        }
>>      } else {
>> -        runstate_set(global_state_get_runstate());
>> +        runstate_set(RUN_STATE_PAUSED);
>>      }
>> ===8<===
>>
>> IIUC this can drop qemu_system_start_on_wakeup_request() along with the
>> other global var.  Would something like it work for us?
> 
> Afraid not.  start_on_wake is the only correct solution I can think of.
> 
> - Steve

Actually, the start_on_wake concept is indeed required, but I can hide the
details in softmmu/cpus.s:

static bool vm_never_started = true;

void vm_start(void)
{
    if (!vm_prepare_start(false)) {
        vm_never_started = false;
        resume_all_vcpus();
    }
}

void vm_wakeup(void)
{
    if (vm_never_started) {
        vm_start();
    } else {
        runstate_set(RUN_STATE_RUNNING);
    }
}

... and qemu_system_wakeup_request() calls vm_wakeup().

Now the migration code simply sets the run state (eg SUSPENDED), no more 
calling qemu_system_start_on_wakeup_request.

- Steve
Peter Xu June 26, 2023, 6:27 p.m. UTC | #6
On Fri, Jun 23, 2023 at 02:25:05PM -0400, Steven Sistare wrote:
> On 6/21/2023 4:28 PM, Peter Xu wrote:
> > On Wed, Jun 21, 2023 at 03:15:42PM -0400, Steven Sistare wrote:
> >> On 6/20/2023 5:46 PM, Peter Xu wrote:
> >>> On Thu, Jun 15, 2023 at 01:26:39PM -0700, Steve Sistare wrote:
> >>>> Migration of a guest in the suspended state is broken.  The incoming
> >>>> migration code automatically tries to wake the guest, which IMO is
> >>>> wrong -- the guest should end migration in the same state it started.
> >>>> Further, the wakeup is done by calling qemu_system_wakeup_request(), which
> >>>> bypasses vm_start().  The guest appears to be in the running state, but
> >>>> it is not.
> >>>>
> >>>> To fix, leave the guest in the suspended state, but call
> >>>> qemu_system_start_on_wakeup_request() so the guest is properly resumed
> >>>> later, when the client sends a system_wakeup command.
> >>>>
> >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>> ---
> >>>>  migration/migration.c | 11 ++++-------
> >>>>  softmmu/runstate.c    |  1 +
> >>>>  2 files changed, 5 insertions(+), 7 deletions(-)
> >>>>
> >>>> diff --git a/migration/migration.c b/migration/migration.c
> >>>> index 17b4b47..851fe6d 100644
> >>>> --- a/migration/migration.c
> >>>> +++ b/migration/migration.c
> >>>> @@ -496,6 +496,10 @@ static void process_incoming_migration_bh(void *opaque)
> >>>>          vm_start();
> >>>>      } else {
> >>>>          runstate_set(global_state_get_runstate());
> >>>> +        if (runstate_check(RUN_STATE_SUSPENDED)) {
> >>>> +            /* Force vm_start to be called later. */
> >>>> +            qemu_system_start_on_wakeup_request();
> >>>> +        }
> >>>
> >>> Is this really needed, along with patch 1?
> >>>
> >>> I have a very limited knowledge on suspension, so I'm prone to making
> >>> mistakes..
> >>>
> >>> But from what I read this, qemu_system_wakeup_request() (existing one, not
> >>> after patch 1 applied) will setup wakeup_reason and kick the main thread
> >>> using qemu_notify_event().  Then IIUC the e.g. vcpu wakeups will be done in
> >>> the main thread later on after qemu_wakeup_requested() returns true.
> >>
> >> Correct, here:
> >>
> >>     if (qemu_wakeup_requested()) {
> >>         pause_all_vcpus();
> >>         qemu_system_wakeup();
> >>         notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
> >>         wakeup_reason = QEMU_WAKEUP_REASON_NONE;
> >>         resume_all_vcpus();
> >>         qapi_event_send_wakeup();
> >>     }
> >>
> >> However, that is not sufficient, because vm_start() was never called on the incoming
> >> side.  vm_start calls the vm state notifiers for RUN_STATE_RUNNING, among other things.
> >>
> >>
> >> Without my fixes, it "works" because the outgoing migration automatically wakes a suspended
> >> guest, which sets the state to running, which is saved in global state:
> >>
> >>     void migration_completion(MigrationState *s)
> >>         qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
> >>         global_state_store()
> >>
> >> Then the incoming migration calls vm_start here:
> >>
> >>     migration/migration.c
> >>         if (!global_state_received() ||
> >>             global_state_get_runstate() == RUN_STATE_RUNNING) {
> >>             if (autostart) {
> >>                 vm_start();
> >>
> >> vm_start must be called for correctness.
> > 
> > I see.  Though I had a feeling that this is still not the right way to do,
> > at least not as clean.
> > 
> > One question is, would above work for postcopy when VM is suspended during
> > the switchover?
> 
> Good catch, that is broken.
> I added qemu_system_start_on_wakeup_request to loadvm_postcopy_handle_run_bh
> and now it works.
> 
>     if (global_state_get_runstate() == RUN_STATE_RUNNING) {
>         if (autostart) {
>             vm_start();
>         } else {
>             runstate_set(RUN_STATE_PAUSED);
>         }
>     } else {
>         runstate_set(global_state_get_runstate());
>         if (runstate_check(RUN_STATE_SUSPENDED)) {
>             qemu_system_start_on_wakeup_request();
>         }
>     }
> 
> > I think I see your point that vm_start() (mostly vm_prepare_start())
> > contains a bunch of operations that maybe we must have before starting the
> > VM, but then.. should we just make that vm_start() unconditional when
> > loading VM completes?  I just don't see anything won't need it (besides
> > -S), even COLO.
> > 
> > So I'm wondering about something like this:
> > 
> > ===8<===
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -481,19 +481,28 @@ static void process_incoming_migration_bh(void *opaque)
> >  
> >      dirty_bitmap_mig_before_vm_start();
> >  
> > -    if (!global_state_received() ||
> > -        global_state_get_runstate() == RUN_STATE_RUNNING) {
> > -        if (autostart) {
> > -            vm_start();
> > -        } else {
> > -            runstate_set(RUN_STATE_PAUSED);
> > -        }
> > -    } else if (migration_incoming_colo_enabled()) {
> > +    if (migration_incoming_colo_enabled()) {
> >          migration_incoming_disable_colo();
> > +        /* COLO should always have autostart=1 or we can enforce it here */
> > +    }
> > +
> > +    if (autostart) {
> > +        RunState run_state = global_state_get_runstate();
> >          vm_start();
> 
> This will resume the guest for a brief time, against the user's wishes.  Not OK IMO.

Ah okay..

Can we do whatever we need in vm_prepare_start(), then? I assume these
chunks are needed:

    /*
     * WHPX accelerator needs to know whether we are going to step
     * any CPUs, before starting the first one.
     */
    if (cpus_accel->synchronize_pre_resume) {
        cpus_accel->synchronize_pre_resume(step_pending);
    }

    /* We are sending this now, but the CPUs will be resumed shortly later */
    qapi_event_send_resume();

    cpu_enable_ticks();

While these may not be needed, but instead only needed if RUN_STATE_RUNNING
below (runstate_set() will be needed regardless):

    runstate_set(RUN_STATE_RUNNING);
    vm_state_notify(1, RUN_STATE_RUNNING);

So here's another of my attempt (this time also taking
!global_state_received() into account)...

    RunState run_state;

    if (migration_incoming_colo_enabled()) {
        migration_incoming_disable_colo();
    }

    if (!global_state_received()) {
        run_state = RUN_STATE_RUNNING;
    } else {
        run_state = global_state_get_runstate();
    }

    if (autostart) {
        /* Part of vm_prepare_start(), may isolate into a helper? */
        if (cpus_accel->synchronize_pre_resume) {
            cpus_accel->synchronize_pre_resume(step_pending);
        }
        qapi_event_send_resume();
        cpu_enable_ticks();
        /* Setup the runstate on src */
        runstate_set(run_state);
        if (run_state == RUN_STATE_RUNNING) {
            vm_state_notify(1, RUN_STATE_RUNNING);
        }
    } else {
        runstate_set(RUN_STATE_PAUSED);
    }

The whole idea is still do whatever needed here besides resuming the vcpus,
rather than leaving the whole vm_start() into system wakeup.  It then can
avoid touching the system wakeup code which seems cleaner.

> > IIUC this can drop qemu_system_start_on_wakeup_request() along with the
> > other global var.  Would something like it work for us?
> 
> Afraid not.  start_on_wake is the only correct solution I can think of.

Please check again above, I just hope we can avoid yet another global to
QEMU if possible.

Thanks,
Peter Xu June 26, 2023, 7:11 p.m. UTC | #7
On Mon, Jun 26, 2023 at 02:27:32PM -0400, Peter Xu wrote:
> On Fri, Jun 23, 2023 at 02:25:05PM -0400, Steven Sistare wrote:
> > On 6/21/2023 4:28 PM, Peter Xu wrote:
> > > On Wed, Jun 21, 2023 at 03:15:42PM -0400, Steven Sistare wrote:
> > >> On 6/20/2023 5:46 PM, Peter Xu wrote:
> > >>> On Thu, Jun 15, 2023 at 01:26:39PM -0700, Steve Sistare wrote:
> > >>>> Migration of a guest in the suspended state is broken.  The incoming
> > >>>> migration code automatically tries to wake the guest, which IMO is
> > >>>> wrong -- the guest should end migration in the same state it started.
> > >>>> Further, the wakeup is done by calling qemu_system_wakeup_request(), which
> > >>>> bypasses vm_start().  The guest appears to be in the running state, but
> > >>>> it is not.
> > >>>>
> > >>>> To fix, leave the guest in the suspended state, but call
> > >>>> qemu_system_start_on_wakeup_request() so the guest is properly resumed
> > >>>> later, when the client sends a system_wakeup command.
> > >>>>
> > >>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> > >>>> ---
> > >>>>  migration/migration.c | 11 ++++-------
> > >>>>  softmmu/runstate.c    |  1 +
> > >>>>  2 files changed, 5 insertions(+), 7 deletions(-)
> > >>>>
> > >>>> diff --git a/migration/migration.c b/migration/migration.c
> > >>>> index 17b4b47..851fe6d 100644
> > >>>> --- a/migration/migration.c
> > >>>> +++ b/migration/migration.c
> > >>>> @@ -496,6 +496,10 @@ static void process_incoming_migration_bh(void *opaque)
> > >>>>          vm_start();
> > >>>>      } else {
> > >>>>          runstate_set(global_state_get_runstate());
> > >>>> +        if (runstate_check(RUN_STATE_SUSPENDED)) {
> > >>>> +            /* Force vm_start to be called later. */
> > >>>> +            qemu_system_start_on_wakeup_request();
> > >>>> +        }
> > >>>
> > >>> Is this really needed, along with patch 1?
> > >>>
> > >>> I have a very limited knowledge on suspension, so I'm prone to making
> > >>> mistakes..
> > >>>
> > >>> But from what I read this, qemu_system_wakeup_request() (existing one, not
> > >>> after patch 1 applied) will setup wakeup_reason and kick the main thread
> > >>> using qemu_notify_event().  Then IIUC the e.g. vcpu wakeups will be done in
> > >>> the main thread later on after qemu_wakeup_requested() returns true.
> > >>
> > >> Correct, here:
> > >>
> > >>     if (qemu_wakeup_requested()) {
> > >>         pause_all_vcpus();
> > >>         qemu_system_wakeup();
> > >>         notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
> > >>         wakeup_reason = QEMU_WAKEUP_REASON_NONE;
> > >>         resume_all_vcpus();
> > >>         qapi_event_send_wakeup();
> > >>     }
> > >>
> > >> However, that is not sufficient, because vm_start() was never called on the incoming
> > >> side.  vm_start calls the vm state notifiers for RUN_STATE_RUNNING, among other things.
> > >>
> > >>
> > >> Without my fixes, it "works" because the outgoing migration automatically wakes a suspended
> > >> guest, which sets the state to running, which is saved in global state:
> > >>
> > >>     void migration_completion(MigrationState *s)
> > >>         qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
> > >>         global_state_store()
> > >>
> > >> Then the incoming migration calls vm_start here:
> > >>
> > >>     migration/migration.c
> > >>         if (!global_state_received() ||
> > >>             global_state_get_runstate() == RUN_STATE_RUNNING) {
> > >>             if (autostart) {
> > >>                 vm_start();
> > >>
> > >> vm_start must be called for correctness.
> > > 
> > > I see.  Though I had a feeling that this is still not the right way to do,
> > > at least not as clean.
> > > 
> > > One question is, would above work for postcopy when VM is suspended during
> > > the switchover?
> > 
> > Good catch, that is broken.
> > I added qemu_system_start_on_wakeup_request to loadvm_postcopy_handle_run_bh
> > and now it works.
> > 
> >     if (global_state_get_runstate() == RUN_STATE_RUNNING) {
> >         if (autostart) {
> >             vm_start();
> >         } else {
> >             runstate_set(RUN_STATE_PAUSED);
> >         }
> >     } else {
> >         runstate_set(global_state_get_runstate());
> >         if (runstate_check(RUN_STATE_SUSPENDED)) {
> >             qemu_system_start_on_wakeup_request();
> >         }
> >     }
> > 
> > > I think I see your point that vm_start() (mostly vm_prepare_start())
> > > contains a bunch of operations that maybe we must have before starting the
> > > VM, but then.. should we just make that vm_start() unconditional when
> > > loading VM completes?  I just don't see anything won't need it (besides
> > > -S), even COLO.
> > > 
> > > So I'm wondering about something like this:
> > > 
> > > ===8<===
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -481,19 +481,28 @@ static void process_incoming_migration_bh(void *opaque)
> > >  
> > >      dirty_bitmap_mig_before_vm_start();
> > >  
> > > -    if (!global_state_received() ||
> > > -        global_state_get_runstate() == RUN_STATE_RUNNING) {
> > > -        if (autostart) {
> > > -            vm_start();
> > > -        } else {
> > > -            runstate_set(RUN_STATE_PAUSED);
> > > -        }
> > > -    } else if (migration_incoming_colo_enabled()) {
> > > +    if (migration_incoming_colo_enabled()) {
> > >          migration_incoming_disable_colo();
> > > +        /* COLO should always have autostart=1 or we can enforce it here */
> > > +    }
> > > +
> > > +    if (autostart) {
> > > +        RunState run_state = global_state_get_runstate();
> > >          vm_start();
> > 
> > This will resume the guest for a brief time, against the user's wishes.  Not OK IMO.
> 
> Ah okay..
> 
> Can we do whatever we need in vm_prepare_start(), then? I assume these
> chunks are needed:
> 
>     /*
>      * WHPX accelerator needs to know whether we are going to step
>      * any CPUs, before starting the first one.
>      */
>     if (cpus_accel->synchronize_pre_resume) {
>         cpus_accel->synchronize_pre_resume(step_pending);
>     }
> 
>     /* We are sending this now, but the CPUs will be resumed shortly later */
>     qapi_event_send_resume();
> 
>     cpu_enable_ticks();
> 
> While these may not be needed, but instead only needed if RUN_STATE_RUNNING
> below (runstate_set() will be needed regardless):
> 
>     runstate_set(RUN_STATE_RUNNING);
>     vm_state_notify(1, RUN_STATE_RUNNING);
> 
> So here's another of my attempt (this time also taking
> !global_state_received() into account)...
> 
>     RunState run_state;
> 
>     if (migration_incoming_colo_enabled()) {
>         migration_incoming_disable_colo();
>     }
> 
>     if (!global_state_received()) {
>         run_state = RUN_STATE_RUNNING;
>     } else {
>         run_state = global_state_get_runstate();
>     }
> 
>     if (autostart) {
>         /* Part of vm_prepare_start(), may isolate into a helper? */
>         if (cpus_accel->synchronize_pre_resume) {
>             cpus_accel->synchronize_pre_resume(step_pending);
>         }
>         qapi_event_send_resume();
>         cpu_enable_ticks();
>         /* Setup the runstate on src */
>         runstate_set(run_state);
>         if (run_state == RUN_STATE_RUNNING) {
>             vm_state_notify(1, RUN_STATE_RUNNING);

And here I at least forgot:

              resume_all_vcpus();

The pesudo code just to illustrate the whole idea.  Definitely not tested
in any form, so sorry if it won't even compile..

>         }
>     } else {
>         runstate_set(RUN_STATE_PAUSED);
>     }
> 
> The whole idea is still do whatever needed here besides resuming the vcpus,
> rather than leaving the whole vm_start() into system wakeup.  It then can
> avoid touching the system wakeup code which seems cleaner.
> 
> > > IIUC this can drop qemu_system_start_on_wakeup_request() along with the
> > > other global var.  Would something like it work for us?
> > 
> > Afraid not.  start_on_wake is the only correct solution I can think of.
> 
> Please check again above, I just hope we can avoid yet another global to
> QEMU if possible.
> 
> Thanks,
> 
> -- 
> Peter Xu
Steven Sistare June 30, 2023, 1:50 p.m. UTC | #8
On 6/26/2023 2:27 PM, Peter Xu wrote:
> On Fri, Jun 23, 2023 at 02:25:05PM -0400, Steven Sistare wrote:
>> On 6/21/2023 4:28 PM, Peter Xu wrote:
>>> On Wed, Jun 21, 2023 at 03:15:42PM -0400, Steven Sistare wrote:
>>>> On 6/20/2023 5:46 PM, Peter Xu wrote:
>>>>> On Thu, Jun 15, 2023 at 01:26:39PM -0700, Steve Sistare wrote:
>>>>>> Migration of a guest in the suspended state is broken.  The incoming
>>>>>> migration code automatically tries to wake the guest, which IMO is
>>>>>> wrong -- the guest should end migration in the same state it started.
>>>>>> Further, the wakeup is done by calling qemu_system_wakeup_request(), which
>>>>>> bypasses vm_start().  The guest appears to be in the running state, but
>>>>>> it is not.
>>>>>>
>>>>>> To fix, leave the guest in the suspended state, but call
>>>>>> qemu_system_start_on_wakeup_request() so the guest is properly resumed
>>>>>> later, when the client sends a system_wakeup command.
>>>>>>
>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>> ---
>>>>>>  migration/migration.c | 11 ++++-------
>>>>>>  softmmu/runstate.c    |  1 +
>>>>>>  2 files changed, 5 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>>> index 17b4b47..851fe6d 100644
>>>>>> --- a/migration/migration.c
>>>>>> +++ b/migration/migration.c
>>>>>> @@ -496,6 +496,10 @@ static void process_incoming_migration_bh(void *opaque)
>>>>>>          vm_start();
>>>>>>      } else {
>>>>>>          runstate_set(global_state_get_runstate());
>>>>>> +        if (runstate_check(RUN_STATE_SUSPENDED)) {
>>>>>> +            /* Force vm_start to be called later. */
>>>>>> +            qemu_system_start_on_wakeup_request();
>>>>>> +        }
>>>>>
>>>>> Is this really needed, along with patch 1?
>>>>>
>>>>> I have a very limited knowledge on suspension, so I'm prone to making
>>>>> mistakes..
>>>>>
>>>>> But from what I read this, qemu_system_wakeup_request() (existing one, not
>>>>> after patch 1 applied) will setup wakeup_reason and kick the main thread
>>>>> using qemu_notify_event().  Then IIUC the e.g. vcpu wakeups will be done in
>>>>> the main thread later on after qemu_wakeup_requested() returns true.
>>>>
>>>> Correct, here:
>>>>
>>>>     if (qemu_wakeup_requested()) {
>>>>         pause_all_vcpus();
>>>>         qemu_system_wakeup();
>>>>         notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
>>>>         wakeup_reason = QEMU_WAKEUP_REASON_NONE;
>>>>         resume_all_vcpus();
>>>>         qapi_event_send_wakeup();
>>>>     }
>>>>
>>>> However, that is not sufficient, because vm_start() was never called on the incoming
>>>> side.  vm_start calls the vm state notifiers for RUN_STATE_RUNNING, among other things.
>>>>
>>>>
>>>> Without my fixes, it "works" because the outgoing migration automatically wakes a suspended
>>>> guest, which sets the state to running, which is saved in global state:
>>>>
>>>>     void migration_completion(MigrationState *s)
>>>>         qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
>>>>         global_state_store()
>>>>
>>>> Then the incoming migration calls vm_start here:
>>>>
>>>>     migration/migration.c
>>>>         if (!global_state_received() ||
>>>>             global_state_get_runstate() == RUN_STATE_RUNNING) {
>>>>             if (autostart) {
>>>>                 vm_start();
>>>>
>>>> vm_start must be called for correctness.
>>>
>>> I see.  Though I had a feeling that this is still not the right way to do,
>>> at least not as clean.
>>>
>>> One question is, would above work for postcopy when VM is suspended during
>>> the switchover?
>>
>> Good catch, that is broken.
>> I added qemu_system_start_on_wakeup_request to loadvm_postcopy_handle_run_bh
>> and now it works.
>>
>>     if (global_state_get_runstate() == RUN_STATE_RUNNING) {
>>         if (autostart) {
>>             vm_start();
>>         } else {
>>             runstate_set(RUN_STATE_PAUSED);
>>         }
>>     } else {
>>         runstate_set(global_state_get_runstate());
>>         if (runstate_check(RUN_STATE_SUSPENDED)) {
>>             qemu_system_start_on_wakeup_request();
>>         }
>>     }
>>
>>> I think I see your point that vm_start() (mostly vm_prepare_start())
>>> contains a bunch of operations that maybe we must have before starting the
>>> VM, but then.. should we just make that vm_start() unconditional when
>>> loading VM completes?  I just don't see anything won't need it (besides
>>> -S), even COLO.
>>>
>>> So I'm wondering about something like this:
>>>
>>> ===8<===
>>> --- a/migration/migration.c
>>> +++ b/migration/migration.c
>>> @@ -481,19 +481,28 @@ static void process_incoming_migration_bh(void *opaque)
>>>  
>>>      dirty_bitmap_mig_before_vm_start();
>>>  
>>> -    if (!global_state_received() ||
>>> -        global_state_get_runstate() == RUN_STATE_RUNNING) {
>>> -        if (autostart) {
>>> -            vm_start();
>>> -        } else {
>>> -            runstate_set(RUN_STATE_PAUSED);
>>> -        }
>>> -    } else if (migration_incoming_colo_enabled()) {
>>> +    if (migration_incoming_colo_enabled()) {
>>>          migration_incoming_disable_colo();
>>> +        /* COLO should always have autostart=1 or we can enforce it here */
>>> +    }
>>> +
>>> +    if (autostart) {
>>> +        RunState run_state = global_state_get_runstate();
>>>          vm_start();
>>
>> This will resume the guest for a brief time, against the user's wishes.  Not OK IMO.
> 
> Ah okay..
> 
> Can we do whatever we need in vm_prepare_start(), then? I assume these
> chunks are needed:
> 
>     /*
>      * WHPX accelerator needs to know whether we are going to step
>      * any CPUs, before starting the first one.
>      */
>     if (cpus_accel->synchronize_pre_resume) {
>         cpus_accel->synchronize_pre_resume(step_pending);
>     }
> 
>     /* We are sending this now, but the CPUs will be resumed shortly later */
>     qapi_event_send_resume();
> 
>     cpu_enable_ticks();
> 
> While these may not be needed, but instead only needed if RUN_STATE_RUNNING
> below (runstate_set() will be needed regardless):
> 
>     runstate_set(RUN_STATE_RUNNING);
>     vm_state_notify(1, RUN_STATE_RUNNING);
> 
> So here's another of my attempt (this time also taking
> !global_state_received() into account)...
> 
>     RunState run_state;
> 
>     if (migration_incoming_colo_enabled()) {
>         migration_incoming_disable_colo();
>     }
> 
>     if (!global_state_received()) {
>         run_state = RUN_STATE_RUNNING;
>     } else {
>         run_state = global_state_get_runstate();
>     }
> 
>     if (autostart) {
>         /* Part of vm_prepare_start(), may isolate into a helper? */
>         if (cpus_accel->synchronize_pre_resume) {
>             cpus_accel->synchronize_pre_resume(step_pending);
>         }
>         qapi_event_send_resume();
>         cpu_enable_ticks();
>         /* Setup the runstate on src */
>         runstate_set(run_state);
>         if (run_state == RUN_STATE_RUNNING) {
>             vm_state_notify(1, RUN_STATE_RUNNING);
>         }
>     } else {
>         runstate_set(RUN_STATE_PAUSED);
>     }
> 
> The whole idea is still do whatever needed here besides resuming the vcpus,
> rather than leaving the whole vm_start() into system wakeup.  It then can
> avoid touching the system wakeup code which seems cleaner.

The problem is that some actions cannot be performed at migration finish time,
such as vm_state_notify RUN_STATE_RUNNING.  The wakeup code called later still 
needs to know that vm_state_notify has not been called, and call it.

I just posted a new series with a cleaner wakeup, but it still uses a global.

- Steve

>>> IIUC this can drop qemu_system_start_on_wakeup_request() along with the
>>> other global var.  Would something like it work for us?
>>
>> Afraid not.  start_on_wake is the only correct solution I can think of.
> 
> Please check again above, I just hope we can avoid yet another global to
> QEMU if possible.
> 
> Thanks,
>
Peter Xu July 26, 2023, 8:18 p.m. UTC | #9
On Fri, Jun 30, 2023 at 09:50:41AM -0400, Steven Sistare wrote:
> On 6/26/2023 2:27 PM, Peter Xu wrote:
> > On Fri, Jun 23, 2023 at 02:25:05PM -0400, Steven Sistare wrote:
> >> On 6/21/2023 4:28 PM, Peter Xu wrote:
> >>> On Wed, Jun 21, 2023 at 03:15:42PM -0400, Steven Sistare wrote:
> >>>> On 6/20/2023 5:46 PM, Peter Xu wrote:
> >>>>> On Thu, Jun 15, 2023 at 01:26:39PM -0700, Steve Sistare wrote:
> >>>>>> Migration of a guest in the suspended state is broken.  The incoming
> >>>>>> migration code automatically tries to wake the guest, which IMO is
> >>>>>> wrong -- the guest should end migration in the same state it started.
> >>>>>> Further, the wakeup is done by calling qemu_system_wakeup_request(), which
> >>>>>> bypasses vm_start().  The guest appears to be in the running state, but
> >>>>>> it is not.
> >>>>>>
> >>>>>> To fix, leave the guest in the suspended state, but call
> >>>>>> qemu_system_start_on_wakeup_request() so the guest is properly resumed
> >>>>>> later, when the client sends a system_wakeup command.
> >>>>>>
> >>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
> >>>>>> ---
> >>>>>>  migration/migration.c | 11 ++++-------
> >>>>>>  softmmu/runstate.c    |  1 +
> >>>>>>  2 files changed, 5 insertions(+), 7 deletions(-)
> >>>>>>
> >>>>>> diff --git a/migration/migration.c b/migration/migration.c
> >>>>>> index 17b4b47..851fe6d 100644
> >>>>>> --- a/migration/migration.c
> >>>>>> +++ b/migration/migration.c
> >>>>>> @@ -496,6 +496,10 @@ static void process_incoming_migration_bh(void *opaque)
> >>>>>>          vm_start();
> >>>>>>      } else {
> >>>>>>          runstate_set(global_state_get_runstate());
> >>>>>> +        if (runstate_check(RUN_STATE_SUSPENDED)) {
> >>>>>> +            /* Force vm_start to be called later. */
> >>>>>> +            qemu_system_start_on_wakeup_request();
> >>>>>> +        }
> >>>>>
> >>>>> Is this really needed, along with patch 1?
> >>>>>
> >>>>> I have a very limited knowledge on suspension, so I'm prone to making
> >>>>> mistakes..
> >>>>>
> >>>>> But from what I read this, qemu_system_wakeup_request() (existing one, not
> >>>>> after patch 1 applied) will setup wakeup_reason and kick the main thread
> >>>>> using qemu_notify_event().  Then IIUC the e.g. vcpu wakeups will be done in
> >>>>> the main thread later on after qemu_wakeup_requested() returns true.
> >>>>
> >>>> Correct, here:
> >>>>
> >>>>     if (qemu_wakeup_requested()) {
> >>>>         pause_all_vcpus();
> >>>>         qemu_system_wakeup();
> >>>>         notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
> >>>>         wakeup_reason = QEMU_WAKEUP_REASON_NONE;
> >>>>         resume_all_vcpus();
> >>>>         qapi_event_send_wakeup();
> >>>>     }
> >>>>
> >>>> However, that is not sufficient, because vm_start() was never called on the incoming
> >>>> side.  vm_start calls the vm state notifiers for RUN_STATE_RUNNING, among other things.
> >>>>
> >>>>
> >>>> Without my fixes, it "works" because the outgoing migration automatically wakes a suspended
> >>>> guest, which sets the state to running, which is saved in global state:
> >>>>
> >>>>     void migration_completion(MigrationState *s)
> >>>>         qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
> >>>>         global_state_store()
> >>>>
> >>>> Then the incoming migration calls vm_start here:
> >>>>
> >>>>     migration/migration.c
> >>>>         if (!global_state_received() ||
> >>>>             global_state_get_runstate() == RUN_STATE_RUNNING) {
> >>>>             if (autostart) {
> >>>>                 vm_start();
> >>>>
> >>>> vm_start must be called for correctness.
> >>>
> >>> I see.  Though I had a feeling that this is still not the right way to do,
> >>> at least not as clean.
> >>>
> >>> One question is, would above work for postcopy when VM is suspended during
> >>> the switchover?
> >>
> >> Good catch, that is broken.
> >> I added qemu_system_start_on_wakeup_request to loadvm_postcopy_handle_run_bh
> >> and now it works.
> >>
> >>     if (global_state_get_runstate() == RUN_STATE_RUNNING) {
> >>         if (autostart) {
> >>             vm_start();
> >>         } else {
> >>             runstate_set(RUN_STATE_PAUSED);
> >>         }
> >>     } else {
> >>         runstate_set(global_state_get_runstate());
> >>         if (runstate_check(RUN_STATE_SUSPENDED)) {
> >>             qemu_system_start_on_wakeup_request();
> >>         }
> >>     }
> >>
> >>> I think I see your point that vm_start() (mostly vm_prepare_start())
> >>> contains a bunch of operations that maybe we must have before starting the
> >>> VM, but then.. should we just make that vm_start() unconditional when
> >>> loading VM completes?  I just don't see anything won't need it (besides
> >>> -S), even COLO.
> >>>
> >>> So I'm wondering about something like this:
> >>>
> >>> ===8<===
> >>> --- a/migration/migration.c
> >>> +++ b/migration/migration.c
> >>> @@ -481,19 +481,28 @@ static void process_incoming_migration_bh(void *opaque)
> >>>  
> >>>      dirty_bitmap_mig_before_vm_start();
> >>>  
> >>> -    if (!global_state_received() ||
> >>> -        global_state_get_runstate() == RUN_STATE_RUNNING) {
> >>> -        if (autostart) {
> >>> -            vm_start();
> >>> -        } else {
> >>> -            runstate_set(RUN_STATE_PAUSED);
> >>> -        }
> >>> -    } else if (migration_incoming_colo_enabled()) {
> >>> +    if (migration_incoming_colo_enabled()) {
> >>>          migration_incoming_disable_colo();
> >>> +        /* COLO should always have autostart=1 or we can enforce it here */
> >>> +    }
> >>> +
> >>> +    if (autostart) {
> >>> +        RunState run_state = global_state_get_runstate();
> >>>          vm_start();
> >>
> >> This will resume the guest for a brief time, against the user's wishes.  Not OK IMO.
> > 
> > Ah okay..
> > 
> > Can we do whatever we need in vm_prepare_start(), then? I assume these
> > chunks are needed:
> > 
> >     /*
> >      * WHPX accelerator needs to know whether we are going to step
> >      * any CPUs, before starting the first one.
> >      */
> >     if (cpus_accel->synchronize_pre_resume) {
> >         cpus_accel->synchronize_pre_resume(step_pending);
> >     }
> > 
> >     /* We are sending this now, but the CPUs will be resumed shortly later */
> >     qapi_event_send_resume();
> > 
> >     cpu_enable_ticks();
> > 
> > While these may not be needed, but instead only needed if RUN_STATE_RUNNING
> > below (runstate_set() will be needed regardless):
> > 
> >     runstate_set(RUN_STATE_RUNNING);
> >     vm_state_notify(1, RUN_STATE_RUNNING);
> > 
> > So here's another of my attempt (this time also taking
> > !global_state_received() into account)...
> > 
> >     RunState run_state;
> > 
> >     if (migration_incoming_colo_enabled()) {
> >         migration_incoming_disable_colo();
> >     }
> > 
> >     if (!global_state_received()) {
> >         run_state = RUN_STATE_RUNNING;
> >     } else {
> >         run_state = global_state_get_runstate();
> >     }
> > 
> >     if (autostart) {
> >         /* Part of vm_prepare_start(), may isolate into a helper? */
> >         if (cpus_accel->synchronize_pre_resume) {
> >             cpus_accel->synchronize_pre_resume(step_pending);
> >         }
> >         qapi_event_send_resume();
> >         cpu_enable_ticks();
> >         /* Setup the runstate on src */
> >         runstate_set(run_state);
> >         if (run_state == RUN_STATE_RUNNING) {
> >             vm_state_notify(1, RUN_STATE_RUNNING);
> >         }
> >     } else {
> >         runstate_set(RUN_STATE_PAUSED);
> >     }
> > 
> > The whole idea is still do whatever needed here besides resuming the vcpus,
> > rather than leaving the whole vm_start() into system wakeup.  It then can
> > avoid touching the system wakeup code which seems cleaner.
> 
> The problem is that some actions cannot be performed at migration finish time,
> such as vm_state_notify RUN_STATE_RUNNING.  The wakeup code called later still 
> needs to know that vm_state_notify has not been called, and call it.

Sorry, very late reply..

Can we just call vm_state_notify() earlier?

I know I'm not familiar with this code at all.. but I'm still not fully
convinced a new global is needed for this.  IMHO QEMU becomes harder to
maintain with such globals, while already tons exist.

> 
> I just posted a new series with a cleaner wakeup, but it still uses a global.

I think the new version does not apply anymore to master.  Do you have a
tree somewhere?

Thanks,
Steven Sistare Aug. 14, 2023, 6:53 p.m. UTC | #10
On 7/26/2023 4:18 PM, Peter Xu wrote:
> On Fri, Jun 30, 2023 at 09:50:41AM -0400, Steven Sistare wrote:
>> On 6/26/2023 2:27 PM, Peter Xu wrote:
>>> On Fri, Jun 23, 2023 at 02:25:05PM -0400, Steven Sistare wrote:
>>>> On 6/21/2023 4:28 PM, Peter Xu wrote:
>>>>> On Wed, Jun 21, 2023 at 03:15:42PM -0400, Steven Sistare wrote:
>>>>>> On 6/20/2023 5:46 PM, Peter Xu wrote:
>>>>>>> On Thu, Jun 15, 2023 at 01:26:39PM -0700, Steve Sistare wrote:
>>>>>>>> Migration of a guest in the suspended state is broken.  The incoming
>>>>>>>> migration code automatically tries to wake the guest, which IMO is
>>>>>>>> wrong -- the guest should end migration in the same state it started.
>>>>>>>> Further, the wakeup is done by calling qemu_system_wakeup_request(), which
>>>>>>>> bypasses vm_start().  The guest appears to be in the running state, but
>>>>>>>> it is not.
>>>>>>>>
>>>>>>>> To fix, leave the guest in the suspended state, but call
>>>>>>>> qemu_system_start_on_wakeup_request() so the guest is properly resumed
>>>>>>>> later, when the client sends a system_wakeup command.
>>>>>>>>
>>>>>>>> Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
>>>>>>>> ---
>>>>>>>>  migration/migration.c | 11 ++++-------
>>>>>>>>  softmmu/runstate.c    |  1 +
>>>>>>>>  2 files changed, 5 insertions(+), 7 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>>>>>> index 17b4b47..851fe6d 100644
>>>>>>>> --- a/migration/migration.c
>>>>>>>> +++ b/migration/migration.c
>>>>>>>> @@ -496,6 +496,10 @@ static void process_incoming_migration_bh(void *opaque)
>>>>>>>>          vm_start();
>>>>>>>>      } else {
>>>>>>>>          runstate_set(global_state_get_runstate());
>>>>>>>> +        if (runstate_check(RUN_STATE_SUSPENDED)) {
>>>>>>>> +            /* Force vm_start to be called later. */
>>>>>>>> +            qemu_system_start_on_wakeup_request();
>>>>>>>> +        }
>>>>>>>
>>>>>>> Is this really needed, along with patch 1?
>>>>>>>
>>>>>>> I have a very limited knowledge on suspension, so I'm prone to making
>>>>>>> mistakes..
>>>>>>>
>>>>>>> But from what I read this, qemu_system_wakeup_request() (existing one, not
>>>>>>> after patch 1 applied) will setup wakeup_reason and kick the main thread
>>>>>>> using qemu_notify_event().  Then IIUC the e.g. vcpu wakeups will be done in
>>>>>>> the main thread later on after qemu_wakeup_requested() returns true.
>>>>>>
>>>>>> Correct, here:
>>>>>>
>>>>>>     if (qemu_wakeup_requested()) {
>>>>>>         pause_all_vcpus();
>>>>>>         qemu_system_wakeup();
>>>>>>         notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
>>>>>>         wakeup_reason = QEMU_WAKEUP_REASON_NONE;
>>>>>>         resume_all_vcpus();
>>>>>>         qapi_event_send_wakeup();
>>>>>>     }
>>>>>>
>>>>>> However, that is not sufficient, because vm_start() was never called on the incoming
>>>>>> side.  vm_start calls the vm state notifiers for RUN_STATE_RUNNING, among other things.
>>>>>>
>>>>>>
>>>>>> Without my fixes, it "works" because the outgoing migration automatically wakes a suspended
>>>>>> guest, which sets the state to running, which is saved in global state:
>>>>>>
>>>>>>     void migration_completion(MigrationState *s)
>>>>>>         qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
>>>>>>         global_state_store()
>>>>>>
>>>>>> Then the incoming migration calls vm_start here:
>>>>>>
>>>>>>     migration/migration.c
>>>>>>         if (!global_state_received() ||
>>>>>>             global_state_get_runstate() == RUN_STATE_RUNNING) {
>>>>>>             if (autostart) {
>>>>>>                 vm_start();
>>>>>>
>>>>>> vm_start must be called for correctness.
>>>>>
>>>>> I see.  Though I had a feeling that this is still not the right way to do,
>>>>> at least not as clean.
>>>>>
>>>>> One question is, would above work for postcopy when VM is suspended during
>>>>> the switchover?
>>>>
>>>> Good catch, that is broken.
>>>> I added qemu_system_start_on_wakeup_request to loadvm_postcopy_handle_run_bh
>>>> and now it works.
>>>>
>>>>     if (global_state_get_runstate() == RUN_STATE_RUNNING) {
>>>>         if (autostart) {
>>>>             vm_start();
>>>>         } else {
>>>>             runstate_set(RUN_STATE_PAUSED);
>>>>         }
>>>>     } else {
>>>>         runstate_set(global_state_get_runstate());
>>>>         if (runstate_check(RUN_STATE_SUSPENDED)) {
>>>>             qemu_system_start_on_wakeup_request();
>>>>         }
>>>>     }
>>>>
>>>>> I think I see your point that vm_start() (mostly vm_prepare_start())
>>>>> contains a bunch of operations that maybe we must have before starting the
>>>>> VM, but then.. should we just make that vm_start() unconditional when
>>>>> loading VM completes?  I just don't see anything won't need it (besides
>>>>> -S), even COLO.
>>>>>
>>>>> So I'm wondering about something like this:
>>>>>
>>>>> ===8<===
>>>>> --- a/migration/migration.c
>>>>> +++ b/migration/migration.c
>>>>> @@ -481,19 +481,28 @@ static void process_incoming_migration_bh(void *opaque)
>>>>>  
>>>>>      dirty_bitmap_mig_before_vm_start();
>>>>>  
>>>>> -    if (!global_state_received() ||
>>>>> -        global_state_get_runstate() == RUN_STATE_RUNNING) {
>>>>> -        if (autostart) {
>>>>> -            vm_start();
>>>>> -        } else {
>>>>> -            runstate_set(RUN_STATE_PAUSED);
>>>>> -        }
>>>>> -    } else if (migration_incoming_colo_enabled()) {
>>>>> +    if (migration_incoming_colo_enabled()) {
>>>>>          migration_incoming_disable_colo();
>>>>> +        /* COLO should always have autostart=1 or we can enforce it here */
>>>>> +    }
>>>>> +
>>>>> +    if (autostart) {
>>>>> +        RunState run_state = global_state_get_runstate();
>>>>>          vm_start();
>>>>
>>>> This will resume the guest for a brief time, against the user's wishes.  Not OK IMO.
>>>
>>> Ah okay..
>>>
>>> Can we do whatever we need in vm_prepare_start(), then? I assume these
>>> chunks are needed:
>>>
>>>     /*
>>>      * WHPX accelerator needs to know whether we are going to step
>>>      * any CPUs, before starting the first one.
>>>      */
>>>     if (cpus_accel->synchronize_pre_resume) {
>>>         cpus_accel->synchronize_pre_resume(step_pending);
>>>     }
>>>
>>>     /* We are sending this now, but the CPUs will be resumed shortly later */
>>>     qapi_event_send_resume();
>>>
>>>     cpu_enable_ticks();
>>>
>>> While these may not be needed, but instead only needed if RUN_STATE_RUNNING
>>> below (runstate_set() will be needed regardless):
>>>
>>>     runstate_set(RUN_STATE_RUNNING);
>>>     vm_state_notify(1, RUN_STATE_RUNNING);
>>>
>>> So here's another of my attempt (this time also taking
>>> !global_state_received() into account)...
>>>
>>>     RunState run_state;
>>>
>>>     if (migration_incoming_colo_enabled()) {
>>>         migration_incoming_disable_colo();
>>>     }
>>>
>>>     if (!global_state_received()) {
>>>         run_state = RUN_STATE_RUNNING;
>>>     } else {
>>>         run_state = global_state_get_runstate();
>>>     }
>>>
>>>     if (autostart) {
>>>         /* Part of vm_prepare_start(), may isolate into a helper? */
>>>         if (cpus_accel->synchronize_pre_resume) {
>>>             cpus_accel->synchronize_pre_resume(step_pending);
>>>         }
>>>         qapi_event_send_resume();
>>>         cpu_enable_ticks();
>>>         /* Setup the runstate on src */
>>>         runstate_set(run_state);
>>>         if (run_state == RUN_STATE_RUNNING) {
>>>             vm_state_notify(1, RUN_STATE_RUNNING);
>>>         }
>>>     } else {
>>>         runstate_set(RUN_STATE_PAUSED);
>>>     }
>>>
>>> The whole idea is still do whatever needed here besides resuming the vcpus,
>>> rather than leaving the whole vm_start() into system wakeup.  It then can
>>> avoid touching the system wakeup code which seems cleaner.
>>
>> The problem is that some actions cannot be performed at migration finish time,
>> such as vm_state_notify RUN_STATE_RUNNING.  The wakeup code called later still 
>> needs to know that vm_state_notify has not been called, and call it.
> 
> Sorry, very late reply..

And I just returned from vaction :)

> Can we just call vm_state_notify() earlier?

We cannot.  The guest is not running yet, and will not be until later.
We cannot call notifiers that perform actions that complete, or react to, 
the guest entering a running state.

> I know I'm not familiar with this code at all.. but I'm still not fully
> convinced a new global is needed for this.  IMHO QEMU becomes harder to
> maintain with such globals, while already tons exist.
> 
>> I just posted a new series with a cleaner wakeup, but it still uses a global.

See "[PATCH V2 01/10] vl: start on wakeup request" in the new series. 
The transitions of the global are trivial and sensible:

    vm_start()
        vm_started = true;

    do_vm_stop()
        vm_started = false;

current_run_state is already a global, confined to runstate.c.  
vm_started is a new qualifier on the state, and hence is also global.

If runstate were a field in MachineState, then I would add vm_started to
MachineState, but MachineState is not used in runstate.c.

> I think the new version does not apply anymore to master.  Do you have a
> tree somewhere?

I do not, but I will rebase and post V3 in a moment.

- Steve
Peter Xu Aug. 14, 2023, 7:37 p.m. UTC | #11
On Mon, Aug 14, 2023 at 02:53:56PM -0400, Steven Sistare wrote:
> > Can we just call vm_state_notify() earlier?
> 
> We cannot.  The guest is not running yet, and will not be until later.
> We cannot call notifiers that perform actions that complete, or react to, 
> the guest entering a running state.

I tried to look at a few examples of the notifees and most of them I read
do not react to "vcpu running" but "vm running" (in which case I think
"suspended" mode falls into "vm running" case); most of them won't care on
the RunState parameter passed in, but only the bool "running".

In reality, when running=true, it must be RUNNING so far.

In that case does it mean we should notify right after the switchover,
since after migration the vm is indeed running only if the vcpus are not
during suspend?

One example (of possible issue) is vfio_vmstate_change(), where iiuc if we
try to suspend a VM it should keep to be VFIO_DEVICE_STATE_RUNNING for that
device; this kind of prove to me that SUSPEND is actually one of
running=true states.

If we postpone all notifiers here even after we switched over to dest qemu
to the next upcoming suspend wakeup, I think it means these devices will
not be in VFIO_DEVICE_STATE_RUNNING after switchover but perhaps
VFIO_DEVICE_STATE_STOP.

Ideally I think we should here call vm_state_notify() with running=true and
state=SUSPEND, but since I do see some hooks are not well prepared for
SUSPEND over running=true, I'd think we should on the safe side call
vm_state_notify(running=true, state=RUNNING) even for SUSPEND at switch
over phase.  With that IIUC it'll naturally work (e.g. when wakeup again
later we just need to call no notifiers).
Steven Sistare Aug. 16, 2023, 5:48 p.m. UTC | #12
On 8/14/2023 3:37 PM, Peter Xu wrote:
> On Mon, Aug 14, 2023 at 02:53:56PM -0400, Steven Sistare wrote:
>>> Can we just call vm_state_notify() earlier?
>>
>> We cannot.  The guest is not running yet, and will not be until later.
>> We cannot call notifiers that perform actions that complete, or react to, 
>> the guest entering a running state.
> 
> I tried to look at a few examples of the notifees and most of them I read
> do not react to "vcpu running" but "vm running" (in which case I think
> "suspended" mode falls into "vm running" case); most of them won't care on
> the RunState parameter passed in, but only the bool "running".
> 
> In reality, when running=true, it must be RUNNING so far.
> 
> In that case does it mean we should notify right after the switchover,
> since after migration the vm is indeed running only if the vcpus are not
> during suspend?

I cannot parse your question, but maybe this answers it.
If the outgoing VM is running and not suspended, then the incoming side
tests for autostart==true and calls vm_start, which calls the notifiers,
right after the switchover.

> One example (of possible issue) is vfio_vmstate_change(), where iiuc if we
> try to suspend a VM it should keep to be VFIO_DEVICE_STATE_RUNNING for that
> device; this kind of prove to me that SUSPEND is actually one of
> running=true states.
> 
> If we postpone all notifiers here even after we switched over to dest qemu
> to the next upcoming suspend wakeup, I think it means these devices will
> not be in VFIO_DEVICE_STATE_RUNNING after switchover but perhaps
> VFIO_DEVICE_STATE_STOP.

or VFIO_DEVICE_STATE_RESUMING, which is set in vfio_load_setup.
AFAIK it is OK to remain in that state until wakeup is called later.

> Ideally I think we should here call vm_state_notify() with running=true and
> state=SUSPEND, but since I do see some hooks are not well prepared for
> SUSPEND over running=true, I'd think we should on the safe side call
> vm_state_notify(running=true, state=RUNNING) even for SUSPEND at switch
> over phase.  With that IIUC it'll naturally work (e.g. when wakeup again
> later we just need to call no notifiers).

Notifiers are just one piece, all the code in vm_prepare_start must be called.
Is it correct to call all of that long before we actually resume the CPUs in
wakeup?  I don't know, but what is the point?  The wakeup code still needs
modification to conditionally resume the vcpus.  The scheme would be roughly:

    loadvm_postcopy_handle_run_bh()
        runstat = global_state_get_runstate();
        if (runstate == RUN_STATE_RUNNING) {
            vm_start()
        } else if (runstate == RUN_STATE_SUSPENDED)
            vm_prepare_start();   // the start of vm_start()
        }

    qemu_system_wakeup_request()
        if (some condition)
            resume_all_vcpus();   // the remainder of vm_start()
        else
            runstate_set(RUN_STATE_RUNNING)

How is that better than my patches
    [PATCH V3 01/10] vl: start on wakeup request
    [PATCH V3 02/10] migration: preserve suspended runstate

    loadvm_postcopy_handle_run_bh()
        runstate = global_state_get_runstate();
        if (runstate == RUN_STATE_RUNNING)
            vm_start()
        else
            runstate_set(runstate);    // eg RUN_STATE_SUSPENDED

    qemu_system_wakeup_request()
        if (!vm_started)
            vm_start();
        else
            runstate_set(RUN_STATE_RUNNING);

Recall this thread started with your comment "It then can avoid touching the 
system wakeup code which seems cleaner".  We still need to touch the wakeup
code.

- Steve
Peter Xu Aug. 17, 2023, 6:19 p.m. UTC | #13
On Wed, Aug 16, 2023 at 01:48:13PM -0400, Steven Sistare wrote:
> On 8/14/2023 3:37 PM, Peter Xu wrote:
> > On Mon, Aug 14, 2023 at 02:53:56PM -0400, Steven Sistare wrote:
> >>> Can we just call vm_state_notify() earlier?
> >>
> >> We cannot.  The guest is not running yet, and will not be until later.
> >> We cannot call notifiers that perform actions that complete, or react to, 
> >> the guest entering a running state.
> > 
> > I tried to look at a few examples of the notifees and most of them I read
> > do not react to "vcpu running" but "vm running" (in which case I think
> > "suspended" mode falls into "vm running" case); most of them won't care on
> > the RunState parameter passed in, but only the bool "running".
> > 
> > In reality, when running=true, it must be RUNNING so far.
> > 
> > In that case does it mean we should notify right after the switchover,
> > since after migration the vm is indeed running only if the vcpus are not
> > during suspend?
> 
> I cannot parse your question, but maybe this answers it.
> If the outgoing VM is running and not suspended, then the incoming side
> tests for autostart==true and calls vm_start, which calls the notifiers,
> right after the switchover.

I meant IMHO SUSPENDED should be seen as "vm running" case to me, just like
RUNNING.  Then, we should invoke vm_prepare_start(), just need some touch
ups.

> 
> > One example (of possible issue) is vfio_vmstate_change(), where iiuc if we
> > try to suspend a VM it should keep to be VFIO_DEVICE_STATE_RUNNING for that
> > device; this kind of prove to me that SUSPEND is actually one of
> > running=true states.
> > 
> > If we postpone all notifiers here even after we switched over to dest qemu
> > to the next upcoming suspend wakeup, I think it means these devices will
> > not be in VFIO_DEVICE_STATE_RUNNING after switchover but perhaps
> > VFIO_DEVICE_STATE_STOP.
> 
> or VFIO_DEVICE_STATE_RESUMING, which is set in vfio_load_setup.
> AFAIK it is OK to remain in that state until wakeup is called later.

So let me provide another clue of why I think we should call
vm_prepare_start()..

Firstly, I think RESUME event should always be there right after we
switched over, no matter suspeneded or not.  I just noticed that your test
case would work because you put "wakeup" to be before RESUME.  I'm not sure
whether that's correct.  I'd bet people could rely on that RESUME to
identify the switchover.

More importantly, I'm wondering whether RTC should still be running during
the suspended mode?  Sorry again if my knowledge over there is just
limited, so correct me otherwise - but my understanding is during suspend
mode (s1 or s3, frankly I can't tell which one this belongs..), rtc should
still be running along with the system clock.  It means we _should_ at
least call cpu_enable_ticks() to enable rtc:

/*
 * enable cpu_get_ticks()
 * Caller must hold BQL which serves as mutex for vm_clock_seqlock.
 */
void cpu_enable_ticks(void)

I think that'll enable cpu_get_tsc() and make it start to work right.

> 
> > Ideally I think we should here call vm_state_notify() with running=true and
> > state=SUSPEND, but since I do see some hooks are not well prepared for
> > SUSPEND over running=true, I'd think we should on the safe side call
> > vm_state_notify(running=true, state=RUNNING) even for SUSPEND at switch
> > over phase.  With that IIUC it'll naturally work (e.g. when wakeup again
> > later we just need to call no notifiers).
> 
> Notifiers are just one piece, all the code in vm_prepare_start must be called.
> Is it correct to call all of that long before we actually resume the CPUs in
> wakeup?  I don't know, but what is the point?

The point is not only for cleaness (again, I really, really don't like that
new global.. sorry), but also now I think we should make the vm running.

> The wakeup code still needs
> modification to conditionally resume the vcpus.  The scheme would be roughly:
> 
>     loadvm_postcopy_handle_run_bh()
>         runstat = global_state_get_runstate();
>         if (runstate == RUN_STATE_RUNNING) {
>             vm_start()
>         } else if (runstate == RUN_STATE_SUSPENDED)
>             vm_prepare_start();   // the start of vm_start()
>         }
> 
>     qemu_system_wakeup_request()
>         if (some condition)
>             resume_all_vcpus();   // the remainder of vm_start()
>         else
>             runstate_set(RUN_STATE_RUNNING)

No it doesn't.  wakeup_reason is set there, main loop does the resuming.
See:

    if (qemu_wakeup_requested()) {
        pause_all_vcpus();
        qemu_system_wakeup();
        notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
        wakeup_reason = QEMU_WAKEUP_REASON_NONE;
        resume_all_vcpus();
        qapi_event_send_wakeup();
    }

> 
> How is that better than my patches
>     [PATCH V3 01/10] vl: start on wakeup request
>     [PATCH V3 02/10] migration: preserve suspended runstate
> 
>     loadvm_postcopy_handle_run_bh()
>         runstate = global_state_get_runstate();
>         if (runstate == RUN_STATE_RUNNING)
>             vm_start()
>         else
>             runstate_set(runstate);    // eg RUN_STATE_SUSPENDED
> 
>     qemu_system_wakeup_request()
>         if (!vm_started)
>             vm_start();
>         else
>             runstate_set(RUN_STATE_RUNNING);
> 
> Recall this thread started with your comment "It then can avoid touching the 
> system wakeup code which seems cleaner".  We still need to touch the wakeup
> code.

Let me provide some code and reply to your new patchset inlines.

Thanks.
Steven Sistare Aug. 24, 2023, 8:53 p.m. UTC | #14
On 8/17/2023 2:19 PM, Peter Xu wrote:
> On Wed, Aug 16, 2023 at 01:48:13PM -0400, Steven Sistare wrote:
>> On 8/14/2023 3:37 PM, Peter Xu wrote:
>>> On Mon, Aug 14, 2023 at 02:53:56PM -0400, Steven Sistare wrote:
>>>>> Can we just call vm_state_notify() earlier?
>>>>
>>>> We cannot.  The guest is not running yet, and will not be until later.
>>>> We cannot call notifiers that perform actions that complete, or react to, 
>>>> the guest entering a running state.
>>>
>>> I tried to look at a few examples of the notifees and most of them I read
>>> do not react to "vcpu running" but "vm running" (in which case I think
>>> "suspended" mode falls into "vm running" case); most of them won't care on
>>> the RunState parameter passed in, but only the bool "running".
>>>
>>> In reality, when running=true, it must be RUNNING so far.
>>>
>>> In that case does it mean we should notify right after the switchover,
>>> since after migration the vm is indeed running only if the vcpus are not
>>> during suspend?
>>
>> I cannot parse your question, but maybe this answers it.
>> If the outgoing VM is running and not suspended, then the incoming side
>> tests for autostart==true and calls vm_start, which calls the notifiers,
>> right after the switchover.
> 
> I meant IMHO SUSPENDED should be seen as "vm running" case to me, just like
> RUNNING.  Then, we should invoke vm_prepare_start(), just need some touch
> ups.
> 
>>
>>> One example (of possible issue) is vfio_vmstate_change(), where iiuc if we
>>> try to suspend a VM it should keep to be VFIO_DEVICE_STATE_RUNNING for that
>>> device; this kind of prove to me that SUSPEND is actually one of
>>> running=true states.
>>>
>>> If we postpone all notifiers here even after we switched over to dest qemu
>>> to the next upcoming suspend wakeup, I think it means these devices will
>>> not be in VFIO_DEVICE_STATE_RUNNING after switchover but perhaps
>>> VFIO_DEVICE_STATE_STOP.
>>
>> or VFIO_DEVICE_STATE_RESUMING, which is set in vfio_load_setup.
>> AFAIK it is OK to remain in that state until wakeup is called later.
> 
> So let me provide another clue of why I think we should call
> vm_prepare_start()..
> 
> Firstly, I think RESUME event should always be there right after we
> switched over, no matter suspeneded or not.  I just noticed that your test
> case would work because you put "wakeup" to be before RESUME.  I'm not sure
> whether that's correct.  I'd bet people could rely on that RESUME to
> identify the switchover.

Yes, possibly.

> More importantly, I'm wondering whether RTC should still be running during
> the suspended mode?  Sorry again if my knowledge over there is just
> limited, so correct me otherwise - but my understanding is during suspend
> mode (s1 or s3, frankly I can't tell which one this belongs..), rtc should
> still be running along with the system clock.  It means we _should_ at
> least call cpu_enable_ticks() to enable rtc:

Agreed.  The comment above cpu_get_ticks says:
  * return the time elapsed in VM between vm_start and vm_stop
Suspending a guest does not call vm_stop, so ticks keeps running.
I also verified that experimentally.

> /*
>  * enable cpu_get_ticks()
>  * Caller must hold BQL which serves as mutex for vm_clock_seqlock.
>  */
> void cpu_enable_ticks(void)
> 
> I think that'll enable cpu_get_tsc() and make it start to work right.
> 
>>
>>> Ideally I think we should here call vm_state_notify() with running=true and
>>> state=SUSPEND, but since I do see some hooks are not well prepared for
>>> SUSPEND over running=true, I'd think we should on the safe side call
>>> vm_state_notify(running=true, state=RUNNING) even for SUSPEND at switch
>>> over phase.  With that IIUC it'll naturally work (e.g. when wakeup again
>>> later we just need to call no notifiers).
>>
>> Notifiers are just one piece, all the code in vm_prepare_start must be called.
>> Is it correct to call all of that long before we actually resume the CPUs in
>> wakeup?  I don't know, but what is the point?
> 
> The point is not only for cleaness (again, I really, really don't like that
> new global.. sorry), but also now I think we should make the vm running.

I do believe it is safe to call vm_prepare_start immediately, since the vcpus
are never running when it is called.

>> The wakeup code still needs
>> modification to conditionally resume the vcpus.  The scheme would be roughly:
>>
>>     loadvm_postcopy_handle_run_bh()
>>         runstat = global_state_get_runstate();
>>         if (runstate == RUN_STATE_RUNNING) {
>>             vm_start()
>>         } else if (runstate == RUN_STATE_SUSPENDED)
>>             vm_prepare_start();   // the start of vm_start()
>>         }
>>
>>     qemu_system_wakeup_request()
>>         if (some condition)
>>             resume_all_vcpus();   // the remainder of vm_start()
>>         else
>>             runstate_set(RUN_STATE_RUNNING)
> 
> No it doesn't.  wakeup_reason is set there, main loop does the resuming.
> See:
> 
>     if (qemu_wakeup_requested()) {
>         pause_all_vcpus();
>         qemu_system_wakeup();
>         notifier_list_notify(&wakeup_notifiers, &wakeup_reason);
>         wakeup_reason = QEMU_WAKEUP_REASON_NONE;
>         resume_all_vcpus();
>         qapi_event_send_wakeup();
>     }

Agreed, we can rely on that main loop code to call resume_all_vcpus, and not
modify qemu_system_wakeup_request.  That is cleaner. 

>> How is that better than my patches
>>     [PATCH V3 01/10] vl: start on wakeup request
>>     [PATCH V3 02/10] migration: preserve suspended runstate
>>
>>     loadvm_postcopy_handle_run_bh()
>>         runstate = global_state_get_runstate();
>>         if (runstate == RUN_STATE_RUNNING)
>>             vm_start()
>>         else
>>             runstate_set(runstate);    // eg RUN_STATE_SUSPENDED
>>
>>     qemu_system_wakeup_request()
>>         if (!vm_started)
>>             vm_start();
>>         else
>>             runstate_set(RUN_STATE_RUNNING);
>>
>> Recall this thread started with your comment "It then can avoid touching the 
>> system wakeup code which seems cleaner".  We still need to touch the wakeup
>> code.
> 
> Let me provide some code and reply to your new patchset inlines.

Thank you, I have more comments there.

- Steve
diff mbox series

Patch

diff --git a/migration/migration.c b/migration/migration.c
index 17b4b47..851fe6d 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -496,6 +496,10 @@  static void process_incoming_migration_bh(void *opaque)
         vm_start();
     } else {
         runstate_set(global_state_get_runstate());
+        if (runstate_check(RUN_STATE_SUSPENDED)) {
+            /* Force vm_start to be called later. */
+            qemu_system_start_on_wakeup_request();
+        }
     }
     /*
      * This must happen after any state changes since as soon as an external
@@ -2101,7 +2105,6 @@  static int postcopy_start(MigrationState *ms)
     qemu_mutex_lock_iothread();
     trace_postcopy_start_set_run();
 
-    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
     global_state_store();
     ret = vm_stop_force_state(RUN_STATE_FINISH_MIGRATE);
     if (ret < 0) {
@@ -2307,7 +2310,6 @@  static void migration_completion(MigrationState *s)
     if (s->state == MIGRATION_STATUS_ACTIVE) {
         qemu_mutex_lock_iothread();
         s->downtime_start = qemu_clock_get_ms(QEMU_CLOCK_REALTIME);
-        qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
 
         s->vm_old_state = runstate_get();
         global_state_store();
@@ -3102,11 +3104,6 @@  static void *bg_migration_thread(void *opaque)
 
     qemu_mutex_lock_iothread();
 
-    /*
-     * If VM is currently in suspended state, then, to make a valid runstate
-     * transition in vm_stop_force_state() we need to wakeup it up.
-     */
-    qemu_system_wakeup_request(QEMU_WAKEUP_REASON_OTHER, NULL);
     s->vm_old_state = runstate_get();
 
     global_state_store();
diff --git a/softmmu/runstate.c b/softmmu/runstate.c
index e127b21..771896c 100644
--- a/softmmu/runstate.c
+++ b/softmmu/runstate.c
@@ -159,6 +159,7 @@  static const RunStateTransition runstate_transitions_def[] = {
     { RUN_STATE_RUNNING, RUN_STATE_SUSPENDED },
     { RUN_STATE_SUSPENDED, RUN_STATE_RUNNING },
     { RUN_STATE_SUSPENDED, RUN_STATE_FINISH_MIGRATE },
+    { RUN_STATE_SUSPENDED, RUN_STATE_PAUSED },
     { RUN_STATE_SUSPENDED, RUN_STATE_PRELAUNCH },
     { RUN_STATE_SUSPENDED, RUN_STATE_COLO},