diff mbox series

[4/4] ppc: ensure we update the decrementer value during migration

Message ID 1505054255-2990-5-git-send-email-mark.cave-ayland@ilande.co.uk
State New
Headers show
Series ppc: migration fixes for TCG | expand

Commit Message

Mark Cave-Ayland Sept. 10, 2017, 2:37 p.m. UTC
During local testing with TCG, intermittent errors were found when trying to
migrate Darwin OS images.

The underlying cause was that Darwin resets the decrementer value to fairly
small values on each interrupt. cpu_ppc_set_tb_clk() sets the default value
of the decrementer to 0xffffffff during initialisation which typically
corresponds to several seconds. Hence when restoring the image, the guest
would effectively "lose" decrementer interrupts during this time causing
confusion in the guest.

Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
---
 target/ppc/machine.c |    2 ++
 1 file changed, 2 insertions(+)

Comments

David Gibson Sept. 13, 2017, 7:12 a.m. UTC | #1
On Sun, Sep 10, 2017 at 03:37:35PM +0100, Mark Cave-Ayland wrote:
> During local testing with TCG, intermittent errors were found when trying to
> migrate Darwin OS images.
> 
> The underlying cause was that Darwin resets the decrementer value to fairly
> small values on each interrupt. cpu_ppc_set_tb_clk() sets the default value
> of the decrementer to 0xffffffff during initialisation which typically
> corresponds to several seconds. Hence when restoring the image, the guest
> would effectively "lose" decrementer interrupts during this time causing
> confusion in the guest.
> 
> Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>

This is subtly incorrect.  It sets the DECR on load to exactly the
value that was saved.  That effectively means that the DECR is frozen
for the migration downtime, which probably isn't what we want.

Instead we need to save the DECR as an offset from the timebase, and
restore it relative to the (downtime adjusted) timebase on the
destination.

The complication with that is that the timebase is generally migrated
at the machine level, not the cpu level: the timebase is generally
synchronized between cpus in the machine, and migrating it at the
individual cpu could break that.  Which means we probably need a
machine level hook to handle the decrementer too, even though it
logically *is* per-cpu, because otherwise we'll be trying to restore
it before the timebase is restored.

> ---
>  target/ppc/machine.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/target/ppc/machine.c b/target/ppc/machine.c
> index 10b3c41..a16a856 100644
> --- a/target/ppc/machine.c
> +++ b/target/ppc/machine.c
> @@ -176,6 +176,7 @@ static void cpu_pre_save(void *opaque)
>      env->spr[SPR_CFAR] = env->cfar;
>  #endif
>      env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr;
> +    env->spr[SPR_DECR] = cpu_ppc_load_decr(env);
>  
>      for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
>          env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i];
> @@ -280,6 +281,7 @@ static int cpu_post_load(void *opaque, int version_id)
>      env->cfar = env->spr[SPR_CFAR];
>  #endif
>      env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR];
> +    cpu_ppc_store_decr(env, env->spr[SPR_DECR]);
>  
>      for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
>          env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i];
Mark Cave-Ayland Sept. 13, 2017, 5:11 p.m. UTC | #2
On 13/09/17 08:12, David Gibson wrote:

> This is subtly incorrect.  It sets the DECR on load to exactly the
> value that was saved.  That effectively means that the DECR is frozen
> for the migration downtime, which probably isn't what we want.
> 
> Instead we need to save the DECR as an offset from the timebase, and
> restore it relative to the (downtime adjusted) timebase on the
> destination.
> 
> The complication with that is that the timebase is generally migrated
> at the machine level, not the cpu level: the timebase is generally
> synchronized between cpus in the machine, and migrating it at the
> individual cpu could break that.  Which means we probably need a
> machine level hook to handle the decrementer too, even though it
> logically *is* per-cpu, because otherwise we'll be trying to restore
> it before the timebase is restored.

I know that we discussed this in-depth last year, however I was working
along the lines that Laurent's patch fixed this along the lines of our
previous discussion:
https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg00338.html (and
indeed Laurent's analysis at
https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg01489.html).

However looking again at the this patch in the context you mentioned
above, I'm starting to wonder if the right solution now is for the
machine init function for g3beige/mac99 to do the same as spapr, e.g.
add cpu_ppc_clock_vm_state_change() as a vm_change_state_handler and
then add VMSTATE_PPC_TIMEBASE_V from the machine PPCTimeBase into my new
subsection?

Laurent, do you think that your state change handler would work
correctly in this way?


ATB,

Mark.
Laurent Vivier Sept. 13, 2017, 5:58 p.m. UTC | #3
On 13/09/2017 19:11, Mark Cave-Ayland wrote:
> On 13/09/17 08:12, David Gibson wrote:
> 
>> This is subtly incorrect.  It sets the DECR on load to exactly the
>> value that was saved.  That effectively means that the DECR is frozen
>> for the migration downtime, which probably isn't what we want.
>>
>> Instead we need to save the DECR as an offset from the timebase, and
>> restore it relative to the (downtime adjusted) timebase on the
>> destination.
>>
>> The complication with that is that the timebase is generally migrated
>> at the machine level, not the cpu level: the timebase is generally
>> synchronized between cpus in the machine, and migrating it at the
>> individual cpu could break that.  Which means we probably need a
>> machine level hook to handle the decrementer too, even though it
>> logically *is* per-cpu, because otherwise we'll be trying to restore
>> it before the timebase is restored.
> 
> I know that we discussed this in-depth last year, however I was working
> along the lines that Laurent's patch fixed this along the lines of our
> previous discussion:
> https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg00338.html (and
> indeed Laurent's analysis at
> https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg01489.html).
> 
> However looking again at the this patch in the context you mentioned
> above, I'm starting to wonder if the right solution now is for the
> machine init function for g3beige/mac99 to do the same as spapr, e.g.
> add cpu_ppc_clock_vm_state_change() as a vm_change_state_handler and
> then add VMSTATE_PPC_TIMEBASE_V from the machine PPCTimeBase into my new
> subsection?
> 
> Laurent, do you think that your state change handler would work
> correctly in this way?

I think all is explained in the second link you have mentioned, it seems 
we don't need a state handler as KVM DECR will no be updated by the migrated value:

hw/ppc/ppc.c

    736 static void __cpu_ppc_store_decr(PowerPCCPU *cpu, uint64_t *nextp,
    737                                  QEMUTimer *timer,
    738                                  void (*raise_excp)(void *),
    739                                  void (*lower_excp)(PowerPCCPU *),
    740                                  uint32_t decr, uint32_t value)
    741 {
...
    749     if (kvm_enabled()) {
    750         /* KVM handles decrementer exceptions, we don't need our own timer */
    751         return;
    752     }
...

But this allows to migrate it for TCG. And it should be correct because in case of TCG I think [need to check] timebase is stopped too (so offset is 0)

David, do you agree with that?

Thanks,
Laurent
David Gibson Sept. 14, 2017, 3:52 a.m. UTC | #4
On Wed, Sep 13, 2017 at 07:58:23PM +0200, Laurent Vivier wrote:
> On 13/09/2017 19:11, Mark Cave-Ayland wrote:
> > On 13/09/17 08:12, David Gibson wrote:
> > 
> >> This is subtly incorrect.  It sets the DECR on load to exactly the
> >> value that was saved.  That effectively means that the DECR is frozen
> >> for the migration downtime, which probably isn't what we want.
> >>
> >> Instead we need to save the DECR as an offset from the timebase, and
> >> restore it relative to the (downtime adjusted) timebase on the
> >> destination.
> >>
> >> The complication with that is that the timebase is generally migrated
> >> at the machine level, not the cpu level: the timebase is generally
> >> synchronized between cpus in the machine, and migrating it at the
> >> individual cpu could break that.  Which means we probably need a
> >> machine level hook to handle the decrementer too, even though it
> >> logically *is* per-cpu, because otherwise we'll be trying to restore
> >> it before the timebase is restored.
> > 
> > I know that we discussed this in-depth last year, however I was working
> > along the lines that Laurent's patch fixed this along the lines of our
> > previous discussion:
> > https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg00338.html (and
> > indeed Laurent's analysis at
> > https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg01489.html).
> > 
> > However looking again at the this patch in the context you mentioned
> > above, I'm starting to wonder if the right solution now is for the
> > machine init function for g3beige/mac99 to do the same as spapr, e.g.
> > add cpu_ppc_clock_vm_state_change() as a vm_change_state_handler and
> > then add VMSTATE_PPC_TIMEBASE_V from the machine PPCTimeBase into my new
> > subsection?
> > 
> > Laurent, do you think that your state change handler would work
> > correctly in this way?
> 
> I think all is explained in the second link you have mentioned, it seems 
> we don't need a state handler as KVM DECR will no be updated by the migrated value:
> 
> hw/ppc/ppc.c
> 
>     736 static void __cpu_ppc_store_decr(PowerPCCPU *cpu, uint64_t *nextp,
>     737                                  QEMUTimer *timer,
>     738                                  void (*raise_excp)(void *),
>     739                                  void (*lower_excp)(PowerPCCPU *),
>     740                                  uint32_t decr, uint32_t value)
>     741 {
> ...
>     749     if (kvm_enabled()) {
>     750         /* KVM handles decrementer exceptions, we don't need our own timer */
>     751         return;
>     752     }
> ...
> 
> But this allows to migrate it for TCG. And it should be correct because in case of TCG I think [need to check] timebase is stopped too (so offset is 0)
> 
> David, do you agree with that?

Yes, I think so.  Some details might be different, but the basic idea
of migrating the timebase and decrementers at machine level should be
the same for pseries and g3beige/whatever.
Mark Cave-Ayland Sept. 15, 2017, 12:45 p.m. UTC | #5
On 14/09/17 04:52, David Gibson wrote:

> On Wed, Sep 13, 2017 at 07:58:23PM +0200, Laurent Vivier wrote:
>> On 13/09/2017 19:11, Mark Cave-Ayland wrote:
>>> On 13/09/17 08:12, David Gibson wrote:
>>>
>>>> This is subtly incorrect.  It sets the DECR on load to exactly the
>>>> value that was saved.  That effectively means that the DECR is frozen
>>>> for the migration downtime, which probably isn't what we want.
>>>>
>>>> Instead we need to save the DECR as an offset from the timebase, and
>>>> restore it relative to the (downtime adjusted) timebase on the
>>>> destination.
>>>>
>>>> The complication with that is that the timebase is generally migrated
>>>> at the machine level, not the cpu level: the timebase is generally
>>>> synchronized between cpus in the machine, and migrating it at the
>>>> individual cpu could break that.  Which means we probably need a
>>>> machine level hook to handle the decrementer too, even though it
>>>> logically *is* per-cpu, because otherwise we'll be trying to restore
>>>> it before the timebase is restored.
>>>
>>> I know that we discussed this in-depth last year, however I was working
>>> along the lines that Laurent's patch fixed this along the lines of our
>>> previous discussion:
>>> https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg00338.html (and
>>> indeed Laurent's analysis at
>>> https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg01489.html).
>>>
>>> However looking again at the this patch in the context you mentioned
>>> above, I'm starting to wonder if the right solution now is for the
>>> machine init function for g3beige/mac99 to do the same as spapr, e.g.
>>> add cpu_ppc_clock_vm_state_change() as a vm_change_state_handler and
>>> then add VMSTATE_PPC_TIMEBASE_V from the machine PPCTimeBase into my new
>>> subsection?
>>>
>>> Laurent, do you think that your state change handler would work
>>> correctly in this way?
>>
>> I think all is explained in the second link you have mentioned, it seems 
>> we don't need a state handler as KVM DECR will no be updated by the migrated value:
>>
>> hw/ppc/ppc.c
>>
>>     736 static void __cpu_ppc_store_decr(PowerPCCPU *cpu, uint64_t *nextp,
>>     737                                  QEMUTimer *timer,
>>     738                                  void (*raise_excp)(void *),
>>     739                                  void (*lower_excp)(PowerPCCPU *),
>>     740                                  uint32_t decr, uint32_t value)
>>     741 {
>> ...
>>     749     if (kvm_enabled()) {
>>     750         /* KVM handles decrementer exceptions, we don't need our own timer */
>>     751         return;
>>     752     }
>> ...
>>
>> But this allows to migrate it for TCG. And it should be correct because in case of TCG I think [need to check] timebase is stopped too (so offset is 0)
>>
>> David, do you agree with that?
> 
> Yes, I think so.  Some details might be different, but the basic idea
> of migrating the timebase and decrementers at machine level should be
> the same for pseries and g3beige/whatever.

From my perspective, I'd really like to bury all of the timebase
migration logic into timebase_load()/timebase_save() so then as you
suggest above, everything will work regardless of the machine type and host.

Playing around with these functions I can see that they are very
PPC-specific - for example cpu_get_host_ticks() can only ever work
correctly migrating between 2 PPC hosts because here on Intel the value
returned is the TSC register which is completely unrelated to the
timebase frequency. Hence the calculated values are nonsense and the
guest inevitably ends up freezing.

I've had a go at converting this back to using the host clock to
calculate the required adjustment to tb_offset (see
https://github.com/mcayland/qemu/commit/68e31f0455a79d41b2ab703364c74ceb4d513154)
but then even that struck me as incorrect since unless there are other
host CPUs with an equivalent of tb_offset, this whole section of code
should only ever run if the host CPU is PPC, and maybe even only under KVM.

My current thoughts are now as follows:

- The existing timebase_load() should never adjust tb_offset unless the
host CPU is also PPC (how do we actually determine this?). Otherwise it
should always be 0.

- The per-CPU decrementer values should be *stored* in the normal
vmstate SPR array as per my latest patch at
https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg02346.html.

BUT:

- The per-CPU decrementer values should be *restored* in timebase_load()
but again should not be adjusted by tb_offset unless the host CPU is
also PPC.

The real question is whether tb_offset is used for a TCG PPC guest on a
PPC host or whether it also remains at 0 as it does here on an Intel
host? If it does remain at 0 for a TCG PPC guest on a PPC host then we
don't need to work out whether we are running on a PPC host at all: we
can just skip the timebase adjustments made by timebase_load() if
!kvm_enabled() and everything should just work.

If this all seems reasonable then what I need to do for the PPC
g3beige/mac99 machines is add the relevant machine state, include the
existing vmstate_change_handler and then add the code to timebase_load()
to set the decrementer. I'm happy to do this as long as everyone agrees
this is a sensible approach?


ATB,

Mark.
David Gibson Sept. 19, 2017, 8:36 a.m. UTC | #6
On Fri, Sep 15, 2017 at 01:45:11PM +0100, Mark Cave-Ayland wrote:
> On 14/09/17 04:52, David Gibson wrote:
> 
> > On Wed, Sep 13, 2017 at 07:58:23PM +0200, Laurent Vivier wrote:
> >> On 13/09/2017 19:11, Mark Cave-Ayland wrote:
> >>> On 13/09/17 08:12, David Gibson wrote:
> >>>
> >>>> This is subtly incorrect.  It sets the DECR on load to exactly the
> >>>> value that was saved.  That effectively means that the DECR is frozen
> >>>> for the migration downtime, which probably isn't what we want.
> >>>>
> >>>> Instead we need to save the DECR as an offset from the timebase, and
> >>>> restore it relative to the (downtime adjusted) timebase on the
> >>>> destination.
> >>>>
> >>>> The complication with that is that the timebase is generally migrated
> >>>> at the machine level, not the cpu level: the timebase is generally
> >>>> synchronized between cpus in the machine, and migrating it at the
> >>>> individual cpu could break that.  Which means we probably need a
> >>>> machine level hook to handle the decrementer too, even though it
> >>>> logically *is* per-cpu, because otherwise we'll be trying to restore
> >>>> it before the timebase is restored.
> >>>
> >>> I know that we discussed this in-depth last year, however I was working
> >>> along the lines that Laurent's patch fixed this along the lines of our
> >>> previous discussion:
> >>> https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg00338.html (and
> >>> indeed Laurent's analysis at
> >>> https://lists.gnu.org/archive/html/qemu-devel/2017-02/msg01489.html).
> >>>
> >>> However looking again at the this patch in the context you mentioned
> >>> above, I'm starting to wonder if the right solution now is for the
> >>> machine init function for g3beige/mac99 to do the same as spapr, e.g.
> >>> add cpu_ppc_clock_vm_state_change() as a vm_change_state_handler and
> >>> then add VMSTATE_PPC_TIMEBASE_V from the machine PPCTimeBase into my new
> >>> subsection?
> >>>
> >>> Laurent, do you think that your state change handler would work
> >>> correctly in this way?
> >>
> >> I think all is explained in the second link you have mentioned, it seems 
> >> we don't need a state handler as KVM DECR will no be updated by the migrated value:
> >>
> >> hw/ppc/ppc.c
> >>
> >>     736 static void __cpu_ppc_store_decr(PowerPCCPU *cpu, uint64_t *nextp,
> >>     737                                  QEMUTimer *timer,
> >>     738                                  void (*raise_excp)(void *),
> >>     739                                  void (*lower_excp)(PowerPCCPU *),
> >>     740                                  uint32_t decr, uint32_t value)
> >>     741 {
> >> ...
> >>     749     if (kvm_enabled()) {
> >>     750         /* KVM handles decrementer exceptions, we don't need our own timer */
> >>     751         return;
> >>     752     }
> >> ...
> >>
> >> But this allows to migrate it for TCG. And it should be correct because in case of TCG I think [need to check] timebase is stopped too (so offset is 0)
> >>
> >> David, do you agree with that?
> > 
> > Yes, I think so.  Some details might be different, but the basic idea
> > of migrating the timebase and decrementers at machine level should be
> > the same for pseries and g3beige/whatever.
> 
> >From my perspective, I'd really like to bury all of the timebase
> migration logic into timebase_load()/timebase_save() so then as you
> suggest above, everything will work regardless of the machine type and host.

That would certainly be ideal if we can manage it.

> Playing around with these functions I can see that they are very
> PPC-specific - for example cpu_get_host_ticks() can only ever work
> correctly migrating between 2 PPC hosts because here on Intel the value
> returned is the TSC register which is completely unrelated to the
> timebase frequency. Hence the calculated values are nonsense and the
> guest inevitably ends up freezing.

Ugh.  That sounds like it needs fixing.

> 
> I've had a go at converting this back to using the host clock to
> calculate the required adjustment to tb_offset (see
> https://github.com/mcayland/qemu/commit/68e31f0455a79d41b2ab703364c74ceb4d513154)
> but then even that struck me as incorrect since unless there are other
> host CPUs with an equivalent of tb_offset, this whole section of code
> should only ever run if the host CPU is PPC, and maybe even only under KVM.

Hm, maybe.  It certainly should be possible to do something
equivalent, recalculating guest TB as necessary from guest generic
clock, frequency and offsets in real time units.  It might end up
being desirable to have a special KVM path which works with raw TB
offsets for simplicity, though.

> My current thoughts are now as follows:
> 
> - The existing timebase_load() should never adjust tb_offset unless the
> host CPU is also PPC (how do we actually determine this?). Otherwise it
> should always be 0.

That doesn't sound right.  But to work out what does makes sense, we
need to figure out what "tb_offset" means in the non-KVM case, which
isn't immediately obvious to me.

> - The per-CPU decrementer values should be *stored* in the normal
> vmstate SPR array as per my latest patch at
> https://lists.gnu.org/archive/html/qemu-devel/2017-09/msg02346.html.

Uh.. maybe.  It kind of makes sense, but the lack of symmetry with the
load path is a bit messy, and may not play well with the way vmstate
descriptions work.

> BUT:
> 
> - The per-CPU decrementer values should be *restored* in timebase_load()
> but again should not be adjusted by tb_offset unless the host CPU is
> also PPC.

That doesn't seem right.  Regardless of whether we're KVM or TCG, we
should recalculate the DECR values based on the stored value adjusted
by how many guest timebase ticks have passed since then (which we'll
need to calculate based on qemu_clock_ns() combined with various other
bits of info (basically the same stuff we already need to compute the
correct timebase on load).

Or to put it another way timebase_load() should calculate (and/or
load) both the guest TB at save time and the guest TB at load time,
then adjust all the DECRs based on the delta between them.

For extra fun, to cover some of the embedded cpus, we'll need to
consider how to handle the PIT, which is sort-of-but-not-quite like
the DECR.

> The real question is whether tb_offset is used for a TCG PPC guest on a
> PPC host or whether it also remains at 0 as it does here on an Intel
> host? If it does remain at 0 for a TCG PPC guest on a PPC host then we
> don't need to work out whether we are running on a PPC host at all: we
> can just skip the timebase adjustments made by timebase_load() if
> !kvm_enabled() and everything should just work.

It kind of sounds like we need to rework that value entirely - or at
least rename it to something less misleading.  Basically to represent
the guest TB we want a tuple of a snapshot TB value with the qemu
clock time at which that snapshot was taken.  Based on that, plus the
tb frequency we can calculate a future TB value - and it's a
migratable quantity, which raw tb_offset is not (even between ppc
hosts).

Obviously there are several equivalent ways of representing that.

> If this all seems reasonable then what I need to do for the PPC
> g3beige/mac99 machines is add the relevant machine state, include the
> existing vmstate_change_handler and then add the code to timebase_load()
> to set the decrementer. I'm happy to do this as long as everyone agrees
> this is a sensible approach?
> 
> 
> ATB,
> 
> Mark.
>
diff mbox series

Patch

diff --git a/target/ppc/machine.c b/target/ppc/machine.c
index 10b3c41..a16a856 100644
--- a/target/ppc/machine.c
+++ b/target/ppc/machine.c
@@ -176,6 +176,7 @@  static void cpu_pre_save(void *opaque)
     env->spr[SPR_CFAR] = env->cfar;
 #endif
     env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr;
+    env->spr[SPR_DECR] = cpu_ppc_load_decr(env);
 
     for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
         env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i];
@@ -280,6 +281,7 @@  static int cpu_post_load(void *opaque, int version_id)
     env->cfar = env->spr[SPR_CFAR];
 #endif
     env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR];
+    cpu_ppc_store_decr(env, env->spr[SPR_DECR]);
 
     for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
         env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i];