diff mbox series

[6/6] migration/colo.c: Move colo_notify_compares_event to the right place

Message ID d4555dd5146a54518c4d9d4efd996b7c745c6687.1589193382.git.lukasstraub2@web.de
State New
Headers show
Series colo: migration related bugfixes | expand

Commit Message

Lukas Straub May 11, 2020, 11:11 a.m. UTC
If the secondary has to failover during checkpointing, it still is
in the old state (i.e. different state than primary). Thus we can't
expose the primary state until after the checkpoint is sent.

This fixes sporadic connection reset of client connections during
failover.

Signed-off-by: Lukas Straub <lukasstraub2@web.de>
---
 migration/colo.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Zhanghailiang May 14, 2020, 1:27 p.m. UTC | #1
Cc: Zhang Chen <chen.zhang@intel.com>

> 
> If the secondary has to failover during checkpointing, it still is in the old state
> (i.e. different state than primary). Thus we can't expose the primary state
> until after the checkpoint is sent.
> 

Hmm, do you mean we should not flush the net packages to client connection until checkpointing
Process almost success because it may fail during checkpointing ?

> This fixes sporadic connection reset of client connections during failover.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> ---
>  migration/colo.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c index
> a69782efc5..a3fc21e86e 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -430,12 +430,6 @@ static int
> colo_do_checkpoint_transaction(MigrationState *s,
>          goto out;
>      }
> 
> -    qemu_event_reset(&s->colo_checkpoint_event);
> -    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> &local_err);
> -    if (local_err) {
> -        goto out;
> -    }
> -
>      /* Disable block migration */
>      migrate_set_block_enabled(false, &local_err);
>      qemu_mutex_lock_iothread();
> @@ -494,6 +488,12 @@ static int
> colo_do_checkpoint_transaction(MigrationState *s,
>          goto out;
>      }
> 
> +    qemu_event_reset(&s->colo_checkpoint_event);
> +    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> &local_err);
> +    if (local_err) {
> +        goto out;
> +    }
> +
>      colo_receive_check_message(s->rp_state.from_dst_file,
>                         COLO_MESSAGE_VMSTATE_LOADED, &local_err);
>      if (local_err) {
> --
> 2.20.1
Lukas Straub May 14, 2020, 2:31 p.m. UTC | #2
On Thu, 14 May 2020 13:27:30 +0000
Zhanghailiang <zhang.zhanghailiang@huawei.com> wrote:

> Cc: Zhang Chen <chen.zhang@intel.com>
> 
> > 
> > If the secondary has to failover during checkpointing, it still is in the old state
> > (i.e. different state than primary). Thus we can't expose the primary state
> > until after the checkpoint is sent.
> >   
> 
> Hmm, do you mean we should not flush the net packages to client connection until checkpointing
> Process almost success because it may fail during checkpointing ?

No.
If the primary fails/crashes during checkpointing, the secondary is still in different state than the primary because it didn't receive the full checkpoint. We can release the miscompared packets only after both primary and secondary are in the same state.

Example:
1. Client opens a TCP connection, sends SYN.
2. Primary accepts the connection with SYN-ACK, but due to nondeterministic execution the secondary is delayed.
3. Checkpoint happens, primary releases the SYN-ACK packet but then crashes while sending the checkpoint.
4. The Secondary fails over. At this point it is still in the old state where it hasn't sent the SYN-ACK packet.
5. The client responds with ACK to the SYN-ACK packet.
6. Because it doesn't know the connection, the secondary responds with RST, connection reset.

Regards,
Lukas Straub

> > This fixes sporadic connection reset of client connections during failover.
> > 
> > Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> > ---
> >  migration/colo.c | 12 ++++++------
> >  1 file changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/migration/colo.c b/migration/colo.c index
> > a69782efc5..a3fc21e86e 100644
> > --- a/migration/colo.c
> > +++ b/migration/colo.c
> > @@ -430,12 +430,6 @@ static int
> > colo_do_checkpoint_transaction(MigrationState *s,
> >          goto out;
> >      }
> > 
> > -    qemu_event_reset(&s->colo_checkpoint_event);
> > -    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> > &local_err);
> > -    if (local_err) {
> > -        goto out;
> > -    }
> > -
> >      /* Disable block migration */
> >      migrate_set_block_enabled(false, &local_err);
> >      qemu_mutex_lock_iothread();
> > @@ -494,6 +488,12 @@ static int
> > colo_do_checkpoint_transaction(MigrationState *s,
> >          goto out;
> >      }
> > 
> > +    qemu_event_reset(&s->colo_checkpoint_event);
> > +    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> > &local_err);
> > +    if (local_err) {
> > +        goto out;
> > +    }
> > +
> >      colo_receive_check_message(s->rp_state.from_dst_file,
> >                         COLO_MESSAGE_VMSTATE_LOADED, &local_err);
> >      if (local_err) {
> > --
> > 2.20.1
Zhanghailiang May 15, 2020, 1:45 a.m. UTC | #3
> -----Original Message-----
> From: Lukas Straub [mailto:lukasstraub2@web.de]
> Sent: Thursday, May 14, 2020 10:31 PM
> To: Zhanghailiang <zhang.zhanghailiang@huawei.com>
> Cc: qemu-devel <qemu-devel@nongnu.org>; Zhang Chen
> <chen.zhang@intel.com>; Juan Quintela <quintela@redhat.com>; Dr. David
> Alan Gilbert <dgilbert@redhat.com>
> Subject: Re: [PATCH 6/6] migration/colo.c: Move
> colo_notify_compares_event to the right place
> 
> On Thu, 14 May 2020 13:27:30 +0000
> Zhanghailiang <zhang.zhanghailiang@huawei.com> wrote:
> 
> > Cc: Zhang Chen <chen.zhang@intel.com>
> >
> > >
> > > If the secondary has to failover during checkpointing, it still is
> > > in the old state (i.e. different state than primary). Thus we can't
> > > expose the primary state until after the checkpoint is sent.
> > >
> >
> > Hmm, do you mean we should not flush the net packages to client
> > connection until checkpointing Process almost success because it may fail
> during checkpointing ?
> 
> No.
> If the primary fails/crashes during checkpointing, the secondary is still in
> different state than the primary because it didn't receive the full checkpoint.
> We can release the miscompared packets only after both primary and
> secondary are in the same state.
> 
> Example:
> 1. Client opens a TCP connection, sends SYN.
> 2. Primary accepts the connection with SYN-ACK, but due to
> nondeterministic execution the secondary is delayed.
> 3. Checkpoint happens, primary releases the SYN-ACK packet but then
> crashes while sending the checkpoint.
> 4. The Secondary fails over. At this point it is still in the old state where it
> hasn't sent the SYN-ACK packet.
> 5. The client responds with ACK to the SYN-ACK packet.
> 6. Because it doesn't know the connection, the secondary responds with RST,
> connection reset.
> 

Good example. For this patch, it is OK, I will add reviewed-by in your origin patch.


> Regards,
> Lukas Straub
> 
> > > This fixes sporadic connection reset of client connections during failover.
> > >
> > > Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> > > ---
> > >  migration/colo.c | 12 ++++++------
> > >  1 file changed, 6 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/migration/colo.c b/migration/colo.c index
> > > a69782efc5..a3fc21e86e 100644
> > > --- a/migration/colo.c
> > > +++ b/migration/colo.c
> > > @@ -430,12 +430,6 @@ static int
> > > colo_do_checkpoint_transaction(MigrationState *s,
> > >          goto out;
> > >      }
> > >
> > > -    qemu_event_reset(&s->colo_checkpoint_event);
> > > -    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> > > &local_err);
> > > -    if (local_err) {
> > > -        goto out;
> > > -    }
> > > -
> > >      /* Disable block migration */
> > >      migrate_set_block_enabled(false, &local_err);
> > >      qemu_mutex_lock_iothread();
> > > @@ -494,6 +488,12 @@ static int
> > > colo_do_checkpoint_transaction(MigrationState *s,
> > >          goto out;
> > >      }
> > >
> > > +    qemu_event_reset(&s->colo_checkpoint_event);
> > > +    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> > > &local_err);
> > > +    if (local_err) {
> > > +        goto out;
> > > +    }
> > > +
> > >      colo_receive_check_message(s->rp_state.from_dst_file,
> > >                         COLO_MESSAGE_VMSTATE_LOADED,
> &local_err);
> > >      if (local_err) {
> > > --
> > > 2.20.1
Zhanghailiang May 15, 2020, 1:53 a.m. UTC | #4
Reviewed-by: zhanghailiang <zhang.zhanghailiang@huawei.com>

> -----Original Message-----
> From: Lukas Straub [mailto:lukasstraub2@web.de]
> Sent: Monday, May 11, 2020 7:11 PM
> To: qemu-devel <qemu-devel@nongnu.org>
> Cc: Zhanghailiang <zhang.zhanghailiang@huawei.com>; Juan Quintela
> <quintela@redhat.com>; Dr. David Alan Gilbert <dgilbert@redhat.com>
> Subject: [PATCH 6/6] migration/colo.c: Move colo_notify_compares_event
> to the right place
> 
> If the secondary has to failover during checkpointing, it still is in the old state
> (i.e. different state than primary). Thus we can't expose the primary state
> until after the checkpoint is sent.
> 
> This fixes sporadic connection reset of client connections during failover.
> 
> Signed-off-by: Lukas Straub <lukasstraub2@web.de>
> ---
>  migration/colo.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/migration/colo.c b/migration/colo.c index
> a69782efc5..a3fc21e86e 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -430,12 +430,6 @@ static int
> colo_do_checkpoint_transaction(MigrationState *s,
>          goto out;
>      }
> 
> -    qemu_event_reset(&s->colo_checkpoint_event);
> -    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> &local_err);
> -    if (local_err) {
> -        goto out;
> -    }
> -
>      /* Disable block migration */
>      migrate_set_block_enabled(false, &local_err);
>      qemu_mutex_lock_iothread();
> @@ -494,6 +488,12 @@ static int
> colo_do_checkpoint_transaction(MigrationState *s,
>          goto out;
>      }
> 
> +    qemu_event_reset(&s->colo_checkpoint_event);
> +    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT,
> &local_err);
> +    if (local_err) {
> +        goto out;
> +    }
> +
>      colo_receive_check_message(s->rp_state.from_dst_file,
>                         COLO_MESSAGE_VMSTATE_LOADED, &local_err);
>      if (local_err) {
> --
> 2.20.1
diff mbox series

Patch

diff --git a/migration/colo.c b/migration/colo.c
index a69782efc5..a3fc21e86e 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -430,12 +430,6 @@  static int colo_do_checkpoint_transaction(MigrationState *s,
         goto out;
     }
 
-    qemu_event_reset(&s->colo_checkpoint_event);
-    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT, &local_err);
-    if (local_err) {
-        goto out;
-    }
-
     /* Disable block migration */
     migrate_set_block_enabled(false, &local_err);
     qemu_mutex_lock_iothread();
@@ -494,6 +488,12 @@  static int colo_do_checkpoint_transaction(MigrationState *s,
         goto out;
     }
 
+    qemu_event_reset(&s->colo_checkpoint_event);
+    colo_notify_compares_event(NULL, COLO_EVENT_CHECKPOINT, &local_err);
+    if (local_err) {
+        goto out;
+    }
+
     colo_receive_check_message(s->rp_state.from_dst_file,
                        COLO_MESSAGE_VMSTATE_LOADED, &local_err);
     if (local_err) {