mbox series

[0/3] migration: Downtime tracepoints

Message ID 20231026155337.596281-1-peterx@redhat.com
Headers show
Series migration: Downtime tracepoints | expand

Message

Peter Xu Oct. 26, 2023, 3:53 p.m. UTC
This small series (actually only the last patch; first two are cleanups)
wants to improve ability of QEMU downtime analysis similarly to what Joao
used to propose here:

  https://lore.kernel.org/r/20230926161841.98464-1-joao.m.martins@oracle.com

But with a few differences:

  - Nothing exported yet to qapi, all tracepoints so far

  - Instead of major checkpoints (stop, iterable, non-iterable, resume-rp),
    finer granule by providing downtime measurements for each vmstate (I
    made microsecond to be the unit to be accurate).  So far it seems
    iterable / non-iterable is the core of the problem, and I want to nail
    it to per-device.

  - Trace dest QEMU too

For the last bullet: consider the case where a device save() can be super
fast, while load() can actually be super slow.  Both of them will
contribute to the ultimate downtime, but not a simple summary: when src
QEMU is save()ing on device1, dst QEMU can be load()ing on device2.  So
they can run in parallel.  However the only way to figure all components of
the downtime is to record both.

Please have a look, thanks.

Peter Xu (3):
  migration: Set downtime_start even for postcopy
  migration: Add migration_downtime_start|end() helpers
  migration: Add per vmstate downtime tracepoints

 migration/migration.c  | 38 +++++++++++++++++++++-----------
 migration/savevm.c     | 49 ++++++++++++++++++++++++++++++++++++++----
 migration/trace-events |  2 ++
 3 files changed, 72 insertions(+), 17 deletions(-)

Comments

Joao Martins Oct. 26, 2023, 4:06 p.m. UTC | #1
On 26/10/2023 16:53, Peter Xu wrote:
> This small series (actually only the last patch; first two are cleanups)
> wants to improve ability of QEMU downtime analysis similarly to what Joao
> used to propose here:
> 
>   https://lore.kernel.org/r/20230926161841.98464-1-joao.m.martins@oracle.com
> 
Thanks for following up on the idea; It's been hard to have enough bandwidth for
everything on the past set of weeks :(

> But with a few differences:
> 
>   - Nothing exported yet to qapi, all tracepoints so far
> 
>   - Instead of major checkpoints (stop, iterable, non-iterable, resume-rp),
>     finer granule by providing downtime measurements for each vmstate (I
>     made microsecond to be the unit to be accurate).  So far it seems
>     iterable / non-iterable is the core of the problem, and I want to nail
>     it to per-device.
> 
>   - Trace dest QEMU too
> 
> For the last bullet: consider the case where a device save() can be super
> fast, while load() can actually be super slow.  Both of them will
> contribute to the ultimate downtime, but not a simple summary: when src
> QEMU is save()ing on device1, dst QEMU can be load()ing on device2.  So
> they can run in parallel.  However the only way to figure all components of
> the downtime is to record both.
> 
> Please have a look, thanks.
>

I like your series, as it allows a user to pinpoint one particular bad device,
while covering the load side too. The checkpoints of migration on the other hand
were useful -- while also a bit ugly -- for the sort of big picture of how
downtime breaks down. Perhaps we could add that /also/ as tracepoitns without
specifically commiting to be exposed in QAPI.

More fundamentally, how can one capture the 'stop' part? There's also time spent
there like e.g. quiescing/stopping vhost-net workers, or suspending the VF
device. All likely as bad to those tracepoints pertaining device-state/ram
related stuff (iterable and non-iterable portions).


> Peter Xu (3):
>   migration: Set downtime_start even for postcopy
>   migration: Add migration_downtime_start|end() helpers
>   migration: Add per vmstate downtime tracepoints
> 
>  migration/migration.c  | 38 +++++++++++++++++++++-----------
>  migration/savevm.c     | 49 ++++++++++++++++++++++++++++++++++++++----
>  migration/trace-events |  2 ++
>  3 files changed, 72 insertions(+), 17 deletions(-)
>
Peter Xu Oct. 26, 2023, 5:03 p.m. UTC | #2
On Thu, Oct 26, 2023 at 05:06:37PM +0100, Joao Martins wrote:
> On 26/10/2023 16:53, Peter Xu wrote:
> > This small series (actually only the last patch; first two are cleanups)
> > wants to improve ability of QEMU downtime analysis similarly to what Joao
> > used to propose here:
> > 
> >   https://lore.kernel.org/r/20230926161841.98464-1-joao.m.martins@oracle.com
> > 
> Thanks for following up on the idea; It's been hard to have enough bandwidth for
> everything on the past set of weeks :(

Yeah, totally understdood.  I think our QE team pushed me towards some
series like this, while my plan was waiting for your new version. :)

Then when I started I decided to go into per-device.  I was thinking of
also persist that information, but then I remembered some ppc guest can
have ~40,000 vmstates..  and memory to maintain that may or may not regress
a ppc user.  So I figured I should first keep it simple with tracepoints.

> 
> > But with a few differences:
> > 
> >   - Nothing exported yet to qapi, all tracepoints so far
> > 
> >   - Instead of major checkpoints (stop, iterable, non-iterable, resume-rp),
> >     finer granule by providing downtime measurements for each vmstate (I
> >     made microsecond to be the unit to be accurate).  So far it seems
> >     iterable / non-iterable is the core of the problem, and I want to nail
> >     it to per-device.
> > 
> >   - Trace dest QEMU too
> > 
> > For the last bullet: consider the case where a device save() can be super
> > fast, while load() can actually be super slow.  Both of them will
> > contribute to the ultimate downtime, but not a simple summary: when src
> > QEMU is save()ing on device1, dst QEMU can be load()ing on device2.  So
> > they can run in parallel.  However the only way to figure all components of
> > the downtime is to record both.
> > 
> > Please have a look, thanks.
> >
> 
> I like your series, as it allows a user to pinpoint one particular bad device,
> while covering the load side too. The checkpoints of migration on the other hand
> were useful -- while also a bit ugly -- for the sort of big picture of how
> downtime breaks down. Perhaps we could add that /also/ as tracepoitns without
> specifically commiting to be exposed in QAPI.
> 
> More fundamentally, how can one capture the 'stop' part? There's also time spent
> there like e.g. quiescing/stopping vhost-net workers, or suspending the VF
> device. All likely as bad to those tracepoints pertaining device-state/ram
> related stuff (iterable and non-iterable portions).

Yeah that's a good point.  I didn't cover "stop" yet because I think it's
just more tricky and I didn't think it all through, yet.

The first question is, when stopping some backends, the vCPUs are still
running, so it's not 100% clear to me on which should be contributed as
part of real downtime.

Meanwhile that'll be another angle besides vmstates: need to keep some eye
on the state change handlers, and that can be a device, or something else.

Did you measure the stop process in some way before?  Do you have some
rough number or anything surprising you already observed?

Thanks,
Peter Xu Oct. 26, 2023, 6:18 p.m. UTC | #3
On Thu, Oct 26, 2023 at 01:03:57PM -0400, Peter Xu wrote:
> On Thu, Oct 26, 2023 at 05:06:37PM +0100, Joao Martins wrote:
> > On 26/10/2023 16:53, Peter Xu wrote:
> > > This small series (actually only the last patch; first two are cleanups)
> > > wants to improve ability of QEMU downtime analysis similarly to what Joao
> > > used to propose here:
> > > 
> > >   https://lore.kernel.org/r/20230926161841.98464-1-joao.m.martins@oracle.com
> > > 
> > Thanks for following up on the idea; It's been hard to have enough bandwidth for
> > everything on the past set of weeks :(
> 
> Yeah, totally understdood.  I think our QE team pushed me towards some
> series like this, while my plan was waiting for your new version. :)
> 
> Then when I started I decided to go into per-device.  I was thinking of
> also persist that information, but then I remembered some ppc guest can
> have ~40,000 vmstates..  and memory to maintain that may or may not regress
> a ppc user.  So I figured I should first keep it simple with tracepoints.
> 
> > 
> > > But with a few differences:
> > > 
> > >   - Nothing exported yet to qapi, all tracepoints so far
> > > 
> > >   - Instead of major checkpoints (stop, iterable, non-iterable, resume-rp),
> > >     finer granule by providing downtime measurements for each vmstate (I
> > >     made microsecond to be the unit to be accurate).  So far it seems
> > >     iterable / non-iterable is the core of the problem, and I want to nail
> > >     it to per-device.
> > > 
> > >   - Trace dest QEMU too
> > > 
> > > For the last bullet: consider the case where a device save() can be super
> > > fast, while load() can actually be super slow.  Both of them will
> > > contribute to the ultimate downtime, but not a simple summary: when src
> > > QEMU is save()ing on device1, dst QEMU can be load()ing on device2.  So
> > > they can run in parallel.  However the only way to figure all components of
> > > the downtime is to record both.
> > > 
> > > Please have a look, thanks.
> > >
> > 
> > I like your series, as it allows a user to pinpoint one particular bad device,
> > while covering the load side too. The checkpoints of migration on the other hand
> > were useful -- while also a bit ugly -- for the sort of big picture of how
> > downtime breaks down. Perhaps we could add that /also/ as tracepoitns without
> > specifically commiting to be exposed in QAPI.
> > 
> > More fundamentally, how can one capture the 'stop' part? There's also time spent
> > there like e.g. quiescing/stopping vhost-net workers, or suspending the VF
> > device. All likely as bad to those tracepoints pertaining device-state/ram
> > related stuff (iterable and non-iterable portions).
> 
> Yeah that's a good point.  I didn't cover "stop" yet because I think it's
> just more tricky and I didn't think it all through, yet.
> 
> The first question is, when stopping some backends, the vCPUs are still
> running, so it's not 100% clear to me on which should be contributed as
> part of real downtime.

I was wrong.. we always stop vcpus first.

If you won't mind, I can add some traceopints for all those spots in this
series to cover your other series.  I'll also make sure I do that for both
sides.

Thanks,

> 
> Meanwhile that'll be another angle besides vmstates: need to keep some eye
> on the state change handlers, and that can be a device, or something else.
> 
> Did you measure the stop process in some way before?  Do you have some
> rough number or anything surprising you already observed?
> 
> Thanks,
> 
> -- 
> Peter Xu
Joao Martins Oct. 26, 2023, 7:33 p.m. UTC | #4
On 26/10/2023 19:18, Peter Xu wrote:
> On Thu, Oct 26, 2023 at 01:03:57PM -0400, Peter Xu wrote:
>> On Thu, Oct 26, 2023 at 05:06:37PM +0100, Joao Martins wrote:
>>> On 26/10/2023 16:53, Peter Xu wrote:
>>>> This small series (actually only the last patch; first two are cleanups)
>>>> wants to improve ability of QEMU downtime analysis similarly to what Joao
>>>> used to propose here:
>>>>
>>>>   https://lore.kernel.org/r/20230926161841.98464-1-joao.m.martins@oracle.com
>>>>
>>> Thanks for following up on the idea; It's been hard to have enough bandwidth for
>>> everything on the past set of weeks :(
>>
>> Yeah, totally understdood.  I think our QE team pushed me towards some
>> series like this, while my plan was waiting for your new version. :)
>>

Oh my end, it was similar (though not by QE/QA) with folks feeling at a blank
when they see a bigger downtime.

Having an explainer/breakdown totally makes this easier to poke holes on where
problems are.

>> Then when I started I decided to go into per-device.  I was thinking of
>> also persist that information, but then I remembered some ppc guest can
>> have ~40,000 vmstates..  and memory to maintain that may or may not regress
>> a ppc user.  So I figured I should first keep it simple with tracepoints.
>>

Yeah, I should have removed that last patch for QAPI.

vmstates is something that I wasn't quite liking how it looked, but I think you
managed to square a relatively clean way on that last patch.

>>>
>>>> But with a few differences:
>>>>
>>>>   - Nothing exported yet to qapi, all tracepoints so far
>>>>
>>>>   - Instead of major checkpoints (stop, iterable, non-iterable, resume-rp),
>>>>     finer granule by providing downtime measurements for each vmstate (I
>>>>     made microsecond to be the unit to be accurate).  So far it seems
>>>>     iterable / non-iterable is the core of the problem, and I want to nail
>>>>     it to per-device.
>>>>
>>>>   - Trace dest QEMU too
>>>>
>>>> For the last bullet: consider the case where a device save() can be super
>>>> fast, while load() can actually be super slow.  Both of them will
>>>> contribute to the ultimate downtime, but not a simple summary: when src
>>>> QEMU is save()ing on device1, dst QEMU can be load()ing on device2.  So
>>>> they can run in parallel.  However the only way to figure all components of
>>>> the downtime is to record both.
>>>>
>>>> Please have a look, thanks.
>>>>
>>>
>>> I like your series, as it allows a user to pinpoint one particular bad device,
>>> while covering the load side too. The checkpoints of migration on the other hand
>>> were useful -- while also a bit ugly -- for the sort of big picture of how
>>> downtime breaks down. Perhaps we could add that /also/ as tracepoitns without
>>> specifically commiting to be exposed in QAPI.
>>>
>>> More fundamentally, how can one capture the 'stop' part? There's also time spent
>>> there like e.g. quiescing/stopping vhost-net workers, or suspending the VF
>>> device. All likely as bad to those tracepoints pertaining device-state/ram
>>> related stuff (iterable and non-iterable portions).
>>
>> Yeah that's a good point.  I didn't cover "stop" yet because I think it's
>> just more tricky and I didn't think it all through, yet.
>>

It could follow your previous line of thought where you do it per vmstate.

But the catch is that vm state change handlers are nameless so tracepoints
wouldn't be tell which vm-state is spending time on each

>> The first question is, when stopping some backends, the vCPUs are still
>> running, so it's not 100% clear to me on which should be contributed as
>> part of real downtime.
> 
> I was wrong.. we always stop vcpus first.
> 

I was about to say this, but I guess you figured out. Even if your vCPUs weren't
stopped first, the external I/O threads (qemu or kernel) wouldn't service
guest own I/O which is a portion of outage.

> If you won't mind, I can add some traceopints for all those spots in this
> series to cover your other series.  I'll also make sure I do that for both
> sides.
> 
Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
considering it started on the other patches (if you also agree it is right). The
patches ofc are enterily different, but at least I like to believe the ideas
initially presented and then subsequently improved are what lead to the downtime
observability improvements in this series.

	Joao
Peter Xu Oct. 26, 2023, 8:07 p.m. UTC | #5
On Thu, Oct 26, 2023 at 08:33:13PM +0100, Joao Martins wrote:
> Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
> considering it started on the other patches (if you also agree it is right). The
> patches ofc are enterily different, but at least I like to believe the ideas
> initially presented and then subsequently improved are what lead to the downtime
> observability improvements in this series.

Sure, I'll add that.

If you like, I would be definitely happy to have Co-developed-by: with you,
if you agree.  I just don't know whether that addressed all your need, and
I need some patch like that for our builds.

Thanks,
Joao Martins Oct. 27, 2023, 8:58 a.m. UTC | #6
On 26/10/2023 21:07, Peter Xu wrote:
> On Thu, Oct 26, 2023 at 08:33:13PM +0100, Joao Martins wrote:
>> Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
>> considering it started on the other patches (if you also agree it is right). The
>> patches ofc are enterily different, but at least I like to believe the ideas
>> initially presented and then subsequently improved are what lead to the downtime
>> observability improvements in this series.
> 
> Sure, I'll add that.
> 
> If you like, I would be definitely happy to have Co-developed-by: with you,
> if you agree. 

Oh, that's great, thanks!

> I just don't know whether that addressed all your need, and
> I need some patch like that for our builds.

I think it achieves the same as the other series. Or rather it re-implements it
but with less compromise on QAPI and made the tracepoints more 'generic' to even
other usecases and less specific to the 'checkpoint breakdown'. Which makes the
implementation simpler (like we don't need that array storing the checkpoint
timestamps) given that it's just tracing and not for QAPI.

Though while it puts more work over developing new tracing tooling for users, I
think it's a good start towards downtime breakdown "clearity" without trading
off maintenance.

	Joao
Peter Xu Oct. 27, 2023, 2:41 p.m. UTC | #7
On Fri, Oct 27, 2023 at 09:58:03AM +0100, Joao Martins wrote:
> On 26/10/2023 21:07, Peter Xu wrote:
> > On Thu, Oct 26, 2023 at 08:33:13PM +0100, Joao Martins wrote:
> >> Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
> >> considering it started on the other patches (if you also agree it is right). The
> >> patches ofc are enterily different, but at least I like to believe the ideas
> >> initially presented and then subsequently improved are what lead to the downtime
> >> observability improvements in this series.
> > 
> > Sure, I'll add that.
> > 
> > If you like, I would be definitely happy to have Co-developed-by: with you,
> > if you agree. 
> 
> Oh, that's great, thanks!

Great!  I apologize on not asking already before a formal patch is post.

> 
> > I just don't know whether that addressed all your need, and
> > I need some patch like that for our builds.
> 
> I think it achieves the same as the other series. Or rather it re-implements it
> but with less compromise on QAPI and made the tracepoints more 'generic' to even
> other usecases and less specific to the 'checkpoint breakdown'. Which makes the
> implementation simpler (like we don't need that array storing the checkpoint
> timestamps) given that it's just tracing and not for QAPI.

Yes.  Please also feel free to have a closer look on the exact checkpoints
in that patch.  I just want to make sure that'll be able to service the
same as the patch you proposed, but with tracepoints, and I don't miss
anything important.

The dest checkpoints are all new, I hope I nailed them all right as we
would expect.

For src checkpoints, IIRC your patch explicitly tracked return path closing
while patch 4 only made it just part of final enclosure; the 4th checkpoint
is after non-iterable sent, until 5th to be the last "downtime-end". It can
cover more than "return path close":

    qemu_savevm_state_complete_precopy_non_iterable <--- 4th checkpoint
    qemu_fflush (after non-iterable sent)
    close_return_path_on_source
    cpu_throttle_stop
    qemu_mutex_lock_iothread
    migration_downtime_end                          <---- 5th checkpoint

If you see fit or necessary, I can, for example, also add one more
checkpoint right after close return path.  I just didn't know whether it's
your intention to explicitly check that point.  Just let me know if so.

Also on whether you prefer to keep a timestamp in the tracepoint itself;
I only use either "log" or "dtrace"+qemu-trace-stap for tracing: both of
them contain timestamps already.  But I can also add the timestamps
(offseted by downtime_start) if you prefer.

I plan to repost before early next week (want to catch the train for 8.2,
if it works out), so it'll be great to collect all your feedback and amend
that before the repost.

> 
> Though while it puts more work over developing new tracing tooling for users, I
> think it's a good start towards downtime breakdown "clearity" without trading
> off maintenance.

Thanks,
Joao Martins Oct. 27, 2023, 10:17 p.m. UTC | #8
On 27/10/2023 15:41, Peter Xu wrote:
> On Fri, Oct 27, 2023 at 09:58:03AM +0100, Joao Martins wrote:
>> On 26/10/2023 21:07, Peter Xu wrote:
>>> On Thu, Oct 26, 2023 at 08:33:13PM +0100, Joao Martins wrote:
>>>> Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
>>>> considering it started on the other patches (if you also agree it is right). The
>>>> patches ofc are enterily different, but at least I like to believe the ideas
>>>> initially presented and then subsequently improved are what lead to the downtime
>>>> observability improvements in this series.
>>>
>>> Sure, I'll add that.
>>>
>>> If you like, I would be definitely happy to have Co-developed-by: with you,
>>> if you agree. 
>>
>> Oh, that's great, thanks!
> 
> Great!  I apologize on not asking already before a formal patch is post.
> 
>>
>>> I just don't know whether that addressed all your need, and
>>> I need some patch like that for our builds.
>>
>> I think it achieves the same as the other series. Or rather it re-implements it
>> but with less compromise on QAPI and made the tracepoints more 'generic' to even
>> other usecases and less specific to the 'checkpoint breakdown'. Which makes the
>> implementation simpler (like we don't need that array storing the checkpoint
>> timestamps) given that it's just tracing and not for QAPI.
> 
> Yes.  Please also feel free to have a closer look on the exact checkpoints
> in that patch.  I just want to make sure that'll be able to service the
> same as the patch you proposed, but with tracepoints, and I don't miss
> anything important.
> 
> The dest checkpoints are all new, I hope I nailed them all right as we
> would expect.
> 
Yeah, it looks like so; but I am no master of postcopy so I don't have quite the
cirurgical eye there.

Perhaps for the loadvm side, 'precopy-bh-enter' suggests some ambiguity (given
that precopy happens on both source and destination?). Perhaps being explicit in
either incoming-bh-enter? or even prefix checkpoint names by 'source' on source
side for example to distinguish?

> For src checkpoints, IIRC your patch explicitly tracked return path closing
> while patch 4 only made it just part of final enclosure; the 4th checkpoint
> is after non-iterable sent, until 5th to be the last "downtime-end". It can
> cover more than "return path close":
> 
>     qemu_savevm_state_complete_precopy_non_iterable <--- 4th checkpoint
>     qemu_fflush (after non-iterable sent)
>     close_return_path_on_source
>     cpu_throttle_stop
>     qemu_mutex_lock_iothread
>     migration_downtime_end                          <---- 5th checkpoint
> 
> If you see fit or necessary, I can, for example, also add one more
> checkpoint right after close return path.  I just didn't know whether it's
> your intention to explicitly check that point.  Just let me know if so.
> 
It was put there because it was a simple catch-all from the source PoV on how
much the destination is taking. Of course bearing in mind that without
return-path then such metric/tracepoint is not available. On the other hand, with
migration_downtime_end I think we have the equivalent in that resume checkpoint.
So we just need to make the diff between that and the non-iterable and I think
we have the same things as I was looking for (even more I would say).

> Also on whether you prefer to keep a timestamp in the tracepoint itself;
> I only use either "log" or "dtrace"+qemu-trace-stap for tracing: both of
> them contain timestamps already.  But I can also add the timestamps
> (offseted by downtime_start) if you prefer.
> 
Perhaps it is easy to wrap the checkpoint tracepoint in its own function to
allow extension of something else e.g. add the timestamp (or any other data into
the checkpoints) or do something in a particular checkpoint of the migration.
The timing function from qemu would give qemu own idea of time (directly
correlable with the downtime metric that applications consume) which would be
more consistent? though I am at too minds on this whether to rely on tracing
stamps or align with what's provided to applications.

> I plan to repost before early next week (want to catch the train for 8.2,
> if it works out), so it'll be great to collect all your feedback and amend
> that before the repost.
> 
>>
>> Though while it puts more work over developing new tracing tooling for users, I
>> think it's a good start towards downtime breakdown "clearity" without trading
>> off maintenance.
> 
> Thanks,
>
Peter Xu Oct. 30, 2023, 3:13 p.m. UTC | #9
On Fri, Oct 27, 2023 at 11:17:41PM +0100, Joao Martins wrote:
> On 27/10/2023 15:41, Peter Xu wrote:
> > On Fri, Oct 27, 2023 at 09:58:03AM +0100, Joao Martins wrote:
> >> On 26/10/2023 21:07, Peter Xu wrote:
> >>> On Thu, Oct 26, 2023 at 08:33:13PM +0100, Joao Martins wrote:
> >>>> Sure. For the fourth patch, feel free to add Suggested-by and/or a Link,
> >>>> considering it started on the other patches (if you also agree it is right). The
> >>>> patches ofc are enterily different, but at least I like to believe the ideas
> >>>> initially presented and then subsequently improved are what lead to the downtime
> >>>> observability improvements in this series.
> >>>
> >>> Sure, I'll add that.
> >>>
> >>> If you like, I would be definitely happy to have Co-developed-by: with you,
> >>> if you agree. 
> >>
> >> Oh, that's great, thanks!
> > 
> > Great!  I apologize on not asking already before a formal patch is post.
> > 
> >>
> >>> I just don't know whether that addressed all your need, and
> >>> I need some patch like that for our builds.
> >>
> >> I think it achieves the same as the other series. Or rather it re-implements it
> >> but with less compromise on QAPI and made the tracepoints more 'generic' to even
> >> other usecases and less specific to the 'checkpoint breakdown'. Which makes the
> >> implementation simpler (like we don't need that array storing the checkpoint
> >> timestamps) given that it's just tracing and not for QAPI.
> > 
> > Yes.  Please also feel free to have a closer look on the exact checkpoints
> > in that patch.  I just want to make sure that'll be able to service the
> > same as the patch you proposed, but with tracepoints, and I don't miss
> > anything important.
> > 
> > The dest checkpoints are all new, I hope I nailed them all right as we
> > would expect.
> > 
> Yeah, it looks like so; but I am no master of postcopy so I don't have quite the
> cirurgical eye there.
> 
> Perhaps for the loadvm side, 'precopy-bh-enter' suggests some ambiguity (given
> that precopy happens on both source and destination?). Perhaps being explicit in
> either incoming-bh-enter? or even prefix checkpoint names by 'source' on source
> side for example to distinguish?

I don't think src has a bottom half; if not considering cleanup_bh as part
of migration at all.  Destination's BH is a major part of migration.

In all cases, let me prefix them with "src"/"dst" then, hopefully even clearer.

> 
> > For src checkpoints, IIRC your patch explicitly tracked return path closing
> > while patch 4 only made it just part of final enclosure; the 4th checkpoint
> > is after non-iterable sent, until 5th to be the last "downtime-end". It can
> > cover more than "return path close":
> > 
> >     qemu_savevm_state_complete_precopy_non_iterable <--- 4th checkpoint
> >     qemu_fflush (after non-iterable sent)
> >     close_return_path_on_source
> >     cpu_throttle_stop
> >     qemu_mutex_lock_iothread
> >     migration_downtime_end                          <---- 5th checkpoint
> > 
> > If you see fit or necessary, I can, for example, also add one more
> > checkpoint right after close return path.  I just didn't know whether it's
> > your intention to explicitly check that point.  Just let me know if so.
> > 
> It was put there because it was a simple catch-all from the source PoV on how
> much the destination is taking. Of course bearing in mind that without
> return-path then such metric/tracepoint is not available. On the other hand, with
> migration_downtime_end I think we have the equivalent in that resume checkpoint.
> So we just need to make the diff between that and the non-iterable and I think
> we have the same things as I was looking for (even more I would say).
> 
> > Also on whether you prefer to keep a timestamp in the tracepoint itself;
> > I only use either "log" or "dtrace"+qemu-trace-stap for tracing: both of
> > them contain timestamps already.  But I can also add the timestamps
> > (offseted by downtime_start) if you prefer.
> > 
> Perhaps it is easy to wrap the checkpoint tracepoint in its own function to
> allow extension of something else e.g. add the timestamp (or any other data into
> the checkpoints) or do something in a particular checkpoint of the migration.
> The timing function from qemu would give qemu own idea of time (directly
> correlable with the downtime metric that applications consume) which would be
> more consistent? though I am at too minds on this whether to rely on tracing
> stamps or align with what's provided to applications.

Yes it should be more consistent.  Let me add them into the checkpoints.

Thanks,
Peter Xu Oct. 30, 2023, 4:09 p.m. UTC | #10
On Mon, Oct 30, 2023 at 11:13:55AM -0400, Peter Xu wrote:
> > Perhaps it is easy to wrap the checkpoint tracepoint in its own function to
> > allow extension of something else e.g. add the timestamp (or any other data into
> > the checkpoints) or do something in a particular checkpoint of the migration.
> > The timing function from qemu would give qemu own idea of time (directly
> > correlable with the downtime metric that applications consume) which would be
> > more consistent? though I am at too minds on this whether to rely on tracing
> > stamps or align with what's provided to applications.
> 
> Yes it should be more consistent.  Let me add them into the checkpoints.

I just noticed dst qemu doesn't have downtime_start to offset..  Only
offset the src downtime checkpoints will make it less consistent.  I'll
leave this one for later.  Posting a new version soon.
Joao Martins Oct. 30, 2023, 4:11 p.m. UTC | #11
On 30/10/2023 16:09, Peter Xu wrote:
> On Mon, Oct 30, 2023 at 11:13:55AM -0400, Peter Xu wrote:
>>> Perhaps it is easy to wrap the checkpoint tracepoint in its own function to
>>> allow extension of something else e.g. add the timestamp (or any other data into
>>> the checkpoints) or do something in a particular checkpoint of the migration.
>>> The timing function from qemu would give qemu own idea of time (directly
>>> correlable with the downtime metric that applications consume) which would be
>>> more consistent? though I am at too minds on this whether to rely on tracing
>>> stamps or align with what's provided to applications.
>>
>> Yes it should be more consistent.  Let me add them into the checkpoints.
> 
> I just noticed dst qemu doesn't have downtime_start to offset..  Only
> offset the src downtime checkpoints will make it less consistent.  I'll
> leave this one for later.  Posting a new version soon.
> 
Argh, yes.

Somehow was thinking in source only -- let's leave for later then.