diff mbox

[4/4] target-ppc: ensure we include the decrementer value during migration

Message ID 568FC5FF.9070606@ilande.co.uk
State New
Headers show

Commit Message

Mark Cave-Ayland Jan. 8, 2016, 2:21 p.m. UTC
On 08/01/16 02:47, Alexey Kardashevskiy wrote:

> On 01/07/2016 05:22 AM, Mark Cave-Ayland wrote:
>> During local testing with TCG, intermittent errors were found when
>> trying to
>> migrate Darwin OS images.
>>
>> The underlying cause was that Darwin resets the decrementer value to
>> fairly
>> small values on each interrupt. cpu_ppc_set_tb_clk() sets the default
>> value
>> of the decrementer to 0xffffffff during initialisation which typically
>> corresponds to several seconds. Hence when restoring the image, the guest
>> would effectively "lose" decrementer interrupts during this time causing
>> confusion in the guest.
>>
>> NOTE: there does seem to be some overlap here with the
>> vmstate_ppc_timebase
>> code, however it doesn't seem to handle multiple CPUs which is why
>> I've gone
>> for an independent implementation.
> 
> It does handle multiple CPUs:
> 
> static int timebase_post_load(void *opaque, int version_id)
> {
> ...
>      /* Set new offset to all CPUs */
>      CPU_FOREACH(cpu) {
>          PowerPCCPU *pcpu = POWERPC_CPU(cpu);
>          pcpu->env.tb_env->tb_offset = tb_off_adj;
>      }
> 
> 
> It does not transfer DECR though, it transfers the offset instead, one
> for all CPUs.
> 
> 
> The easier solution would be just adding this instead of the whole patch:
> 
> spr_register(env, SPR_DECR, "DECR",
>              SPR_NOACCESS, SPR_NOACCESS,
>              &spr_read_decr, &spr_write_decr,
>              0x00000000);
> 
> somewhere in target-ppc/translate_init.c for CPUs you want to support,
> gen_tbl() seems to be the right place for this. As long as it is just
> spr_register() and not spr_register_kvm(), I suppose it should not break
> KVM and pseries.

I've just tried adding that but it then gives the following error on
startup:

Error: Trying to register SPR 22 (016) twice !

Based upon this, the existing registration seems fine. I think the
problem is that I can't see anything in __cpu_ppc_store_decr() that
updates the spr[SPR_DECR] value when the decrementer register is
changed, so it needs to be explicitly added to
cpu_pre_save()/cpu_post_load():




I've just tried the diff above instead of my original version and
repeated my savevm/loadvm pair test with a Darwin installation and it
also fixes the random hang I was seeing on loadvm.

Seemingly this should work on KVM in that cpu_ppc_load_decr() and
cpu_ppc_store_decr() become no-ops as long as KVM maintains
env->spr[SPR_DECR], but a second set of eyeballs would be useful here.

If you can let me know if this is suitable then I'll update the patchset
based upon your feedback and send out a v2.


ATB,

Mark.

Comments

Alexey Kardashevskiy Jan. 11, 2016, 1:18 a.m. UTC | #1
On 01/09/2016 01:21 AM, Mark Cave-Ayland wrote:
> On 08/01/16 02:47, Alexey Kardashevskiy wrote:
>
>> On 01/07/2016 05:22 AM, Mark Cave-Ayland wrote:
>>> During local testing with TCG, intermittent errors were found when
>>> trying to
>>> migrate Darwin OS images.
>>>
>>> The underlying cause was that Darwin resets the decrementer value to
>>> fairly
>>> small values on each interrupt. cpu_ppc_set_tb_clk() sets the default
>>> value
>>> of the decrementer to 0xffffffff during initialisation which typically
>>> corresponds to several seconds. Hence when restoring the image, the guest
>>> would effectively "lose" decrementer interrupts during this time causing
>>> confusion in the guest.
>>>
>>> NOTE: there does seem to be some overlap here with the
>>> vmstate_ppc_timebase
>>> code, however it doesn't seem to handle multiple CPUs which is why
>>> I've gone
>>> for an independent implementation.
>>
>> It does handle multiple CPUs:
>>
>> static int timebase_post_load(void *opaque, int version_id)
>> {
>> ...
>>       /* Set new offset to all CPUs */
>>       CPU_FOREACH(cpu) {
>>           PowerPCCPU *pcpu = POWERPC_CPU(cpu);
>>           pcpu->env.tb_env->tb_offset = tb_off_adj;
>>       }
>>
>>
>> It does not transfer DECR though, it transfers the offset instead, one
>> for all CPUs.
>>
>>
>> The easier solution would be just adding this instead of the whole patch:
>>
>> spr_register(env, SPR_DECR, "DECR",
>>               SPR_NOACCESS, SPR_NOACCESS,
>>               &spr_read_decr, &spr_write_decr,
>>               0x00000000);
>>
>> somewhere in target-ppc/translate_init.c for CPUs you want to support,
>> gen_tbl() seems to be the right place for this. As long as it is just
>> spr_register() and not spr_register_kvm(), I suppose it should not break
>> KVM and pseries.
>
> I've just tried adding that but it then gives the following error on
> startup:
>
> Error: Trying to register SPR 22 (016) twice !
>
> Based upon this, the existing registration seems fine. I think the
> problem is that I can't see anything in __cpu_ppc_store_decr() that
> updates the spr[SPR_DECR] value when the decrementer register is
> changed, so it needs to be explicitly added to
> cpu_pre_save()/cpu_post_load():
>
>
> diff --git a/target-ppc/machine.c b/target-ppc/machine.c
> index 251a84b..495e58d 100644
> --- a/target-ppc/machine.c
> +++ b/target-ppc/machine.c
> @@ -141,6 +141,7 @@ static void cpu_pre_save(void *opaque)
>       env->spr[SPR_CFAR] = env->cfar;
>   #endif
>       env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr;
> +    env->spr[SPR_DECR] = cpu_ppc_load_decr(env);
>
>       for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
>           env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i];
> @@ -175,6 +176,7 @@ static int cpu_post_load(void *opaque, int version_id)
>       env->cfar = env->spr[SPR_CFAR];
>   #endif
>       env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR];
> +    cpu_ppc_store_decr(env, env->spr[SPR_DECR]);
>
>       for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
>           env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i];
>
>
> I've just tried the diff above instead of my original version and
> repeated my savevm/loadvm pair test with a Darwin installation and it
> also fixes the random hang I was seeing on loadvm.
>
> Seemingly this should work on KVM in that cpu_ppc_load_decr() and
> cpu_ppc_store_decr() become no-ops as long as KVM maintains
> env->spr[SPR_DECR], but a second set of eyeballs would be useful here.
>
> If you can let me know if this is suitable then I'll update the patchset
> based upon your feedback and send out a v2.


Looks good to me, I'd just wait a day or two in the case if David wants to 
comment.
David Gibson Jan. 11, 2016, 4:55 a.m. UTC | #2
On Mon, Jan 11, 2016 at 12:18:31PM +1100, Alexey Kardashevskiy wrote:
> On 01/09/2016 01:21 AM, Mark Cave-Ayland wrote:
> >On 08/01/16 02:47, Alexey Kardashevskiy wrote:
> >
> >>On 01/07/2016 05:22 AM, Mark Cave-Ayland wrote:
> >>>During local testing with TCG, intermittent errors were found when
> >>>trying to
> >>>migrate Darwin OS images.
> >>>
> >>>The underlying cause was that Darwin resets the decrementer value to
> >>>fairly
> >>>small values on each interrupt. cpu_ppc_set_tb_clk() sets the default
> >>>value
> >>>of the decrementer to 0xffffffff during initialisation which typically
> >>>corresponds to several seconds. Hence when restoring the image, the guest
> >>>would effectively "lose" decrementer interrupts during this time causing
> >>>confusion in the guest.
> >>>
> >>>NOTE: there does seem to be some overlap here with the
> >>>vmstate_ppc_timebase
> >>>code, however it doesn't seem to handle multiple CPUs which is why
> >>>I've gone
> >>>for an independent implementation.
> >>
> >>It does handle multiple CPUs:
> >>
> >>static int timebase_post_load(void *opaque, int version_id)
> >>{
> >>...
> >>      /* Set new offset to all CPUs */
> >>      CPU_FOREACH(cpu) {
> >>          PowerPCCPU *pcpu = POWERPC_CPU(cpu);
> >>          pcpu->env.tb_env->tb_offset = tb_off_adj;
> >>      }
> >>
> >>
> >>It does not transfer DECR though, it transfers the offset instead, one
> >>for all CPUs.
> >>
> >>
> >>The easier solution would be just adding this instead of the whole patch:
> >>
> >>spr_register(env, SPR_DECR, "DECR",
> >>              SPR_NOACCESS, SPR_NOACCESS,
> >>              &spr_read_decr, &spr_write_decr,
> >>              0x00000000);
> >>
> >>somewhere in target-ppc/translate_init.c for CPUs you want to support,
> >>gen_tbl() seems to be the right place for this. As long as it is just
> >>spr_register() and not spr_register_kvm(), I suppose it should not break
> >>KVM and pseries.
> >
> >I've just tried adding that but it then gives the following error on
> >startup:
> >
> >Error: Trying to register SPR 22 (016) twice !
> >
> >Based upon this, the existing registration seems fine. I think the
> >problem is that I can't see anything in __cpu_ppc_store_decr() that
> >updates the spr[SPR_DECR] value when the decrementer register is
> >changed, so it needs to be explicitly added to
> >cpu_pre_save()/cpu_post_load():
> >
> >
> >diff --git a/target-ppc/machine.c b/target-ppc/machine.c
> >index 251a84b..495e58d 100644
> >--- a/target-ppc/machine.c
> >+++ b/target-ppc/machine.c
> >@@ -141,6 +141,7 @@ static void cpu_pre_save(void *opaque)
> >      env->spr[SPR_CFAR] = env->cfar;
> >  #endif
> >      env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr;
> >+    env->spr[SPR_DECR] = cpu_ppc_load_decr(env);
> >
> >      for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
> >          env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i];
> >@@ -175,6 +176,7 @@ static int cpu_post_load(void *opaque, int version_id)
> >      env->cfar = env->spr[SPR_CFAR];
> >  #endif
> >      env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR];
> >+    cpu_ppc_store_decr(env, env->spr[SPR_DECR]);
> >
> >      for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
> >          env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i];
> >
> >
> >I've just tried the diff above instead of my original version and
> >repeated my savevm/loadvm pair test with a Darwin installation and it
> >also fixes the random hang I was seeing on loadvm.
> >
> >Seemingly this should work on KVM in that cpu_ppc_load_decr() and
> >cpu_ppc_store_decr() become no-ops as long as KVM maintains
> >env->spr[SPR_DECR], but a second set of eyeballs would be useful here.
> >
> >If you can let me know if this is suitable then I'll update the patchset
> >based upon your feedback and send out a v2.
> 
> 
> Looks good to me, I'd just wait a day or two in the case if David wants to
> comment.

I was on holiday and missed the start of this thread, I'm afraid, so I
don't fully understand the context here.

Am I right in thinking that this change will essentially freeze the
decrementer across the migration downtime?  That doesn't seem right,
since the decrementer is supposed to be linked to the timebase and
represent real time passing.

In other words, isn't this just skipping the decrementer interrupts at
the qemu level rather than the guest level?

It seems that instead we should be reconstructing the decrementer on
the destination based on an offset from the timebase.
Mark Cave-Ayland Jan. 11, 2016, 7:43 a.m. UTC | #3
On 11/01/16 04:55, David Gibson wrote:

> On Mon, Jan 11, 2016 at 12:18:31PM +1100, Alexey Kardashevskiy wrote:
>> On 01/09/2016 01:21 AM, Mark Cave-Ayland wrote:
>>> On 08/01/16 02:47, Alexey Kardashevskiy wrote:
>>>
>>>> On 01/07/2016 05:22 AM, Mark Cave-Ayland wrote:
>>>>> During local testing with TCG, intermittent errors were found when
>>>>> trying to
>>>>> migrate Darwin OS images.
>>>>>
>>>>> The underlying cause was that Darwin resets the decrementer value to
>>>>> fairly
>>>>> small values on each interrupt. cpu_ppc_set_tb_clk() sets the default
>>>>> value
>>>>> of the decrementer to 0xffffffff during initialisation which typically
>>>>> corresponds to several seconds. Hence when restoring the image, the guest
>>>>> would effectively "lose" decrementer interrupts during this time causing
>>>>> confusion in the guest.
>>>>>
>>>>> NOTE: there does seem to be some overlap here with the
>>>>> vmstate_ppc_timebase
>>>>> code, however it doesn't seem to handle multiple CPUs which is why
>>>>> I've gone
>>>>> for an independent implementation.
>>>>
>>>> It does handle multiple CPUs:
>>>>
>>>> static int timebase_post_load(void *opaque, int version_id)
>>>> {
>>>> ...
>>>>      /* Set new offset to all CPUs */
>>>>      CPU_FOREACH(cpu) {
>>>>          PowerPCCPU *pcpu = POWERPC_CPU(cpu);
>>>>          pcpu->env.tb_env->tb_offset = tb_off_adj;
>>>>      }
>>>>
>>>>
>>>> It does not transfer DECR though, it transfers the offset instead, one
>>>> for all CPUs.
>>>>
>>>>
>>>> The easier solution would be just adding this instead of the whole patch:
>>>>
>>>> spr_register(env, SPR_DECR, "DECR",
>>>>              SPR_NOACCESS, SPR_NOACCESS,
>>>>              &spr_read_decr, &spr_write_decr,
>>>>              0x00000000);
>>>>
>>>> somewhere in target-ppc/translate_init.c for CPUs you want to support,
>>>> gen_tbl() seems to be the right place for this. As long as it is just
>>>> spr_register() and not spr_register_kvm(), I suppose it should not break
>>>> KVM and pseries.
>>>
>>> I've just tried adding that but it then gives the following error on
>>> startup:
>>>
>>> Error: Trying to register SPR 22 (016) twice !
>>>
>>> Based upon this, the existing registration seems fine. I think the
>>> problem is that I can't see anything in __cpu_ppc_store_decr() that
>>> updates the spr[SPR_DECR] value when the decrementer register is
>>> changed, so it needs to be explicitly added to
>>> cpu_pre_save()/cpu_post_load():
>>>
>>>
>>> diff --git a/target-ppc/machine.c b/target-ppc/machine.c
>>> index 251a84b..495e58d 100644
>>> --- a/target-ppc/machine.c
>>> +++ b/target-ppc/machine.c
>>> @@ -141,6 +141,7 @@ static void cpu_pre_save(void *opaque)
>>>      env->spr[SPR_CFAR] = env->cfar;
>>>  #endif
>>>      env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr;
>>> +    env->spr[SPR_DECR] = cpu_ppc_load_decr(env);
>>>
>>>      for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
>>>          env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i];
>>> @@ -175,6 +176,7 @@ static int cpu_post_load(void *opaque, int version_id)
>>>      env->cfar = env->spr[SPR_CFAR];
>>>  #endif
>>>      env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR];
>>> +    cpu_ppc_store_decr(env, env->spr[SPR_DECR]);
>>>
>>>      for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
>>>          env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i];
>>>
>>>
>>> I've just tried the diff above instead of my original version and
>>> repeated my savevm/loadvm pair test with a Darwin installation and it
>>> also fixes the random hang I was seeing on loadvm.
>>>
>>> Seemingly this should work on KVM in that cpu_ppc_load_decr() and
>>> cpu_ppc_store_decr() become no-ops as long as KVM maintains
>>> env->spr[SPR_DECR], but a second set of eyeballs would be useful here.
>>>
>>> If you can let me know if this is suitable then I'll update the patchset
>>> based upon your feedback and send out a v2.
>>
>>
>> Looks good to me, I'd just wait a day or two in the case if David wants to
>> comment.
> 
> I was on holiday and missed the start of this thread, I'm afraid, so I
> don't fully understand the context here.

It's part of a patchset I posted which fixes up problems I had with
migrating g3beige/mac99 machines under TCG:
https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00544.html.

Apologies for not adding you as CC directly - are you still helping to
cover ppc-next for Alex?

> Am I right in thinking that this change will essentially freeze the
> decrementer across the migration downtime?  That doesn't seem right,
> since the decrementer is supposed to be linked to the timebase and
> represent real time passing.

Yes, that's correct.

> In other words, isn't this just skipping the decrementer interrupts at
> the qemu level rather than the guest level?
> 
> It seems that instead we should be reconstructing the decrementer on
> the destination based on an offset from the timebase.

Well I haven't really looked at how time warping works during in
migration for QEMU, however this seems to be the method used by
hw/ppc/ppc.c's timebase_post_load() function but my understanding is
that this isn't currently available for the g3beige/mac99 machines?
Should the patch in fact do this but also add decrementer support? And
if it did, would this have a negative effect on pseries?

But yes, assuming that the guest time warp is handled correctly (which I
assume is handled correctly elsewhere since this would also be required
for KVM) then I think that this should work.


ATB,

Mark.
David Gibson Jan. 12, 2016, 2:44 a.m. UTC | #4
On Mon, Jan 11, 2016 at 07:43:54AM +0000, Mark Cave-Ayland wrote:
> On 11/01/16 04:55, David Gibson wrote:
> 
> > On Mon, Jan 11, 2016 at 12:18:31PM +1100, Alexey Kardashevskiy wrote:
> >> On 01/09/2016 01:21 AM, Mark Cave-Ayland wrote:
> >>> On 08/01/16 02:47, Alexey Kardashevskiy wrote:
> >>>
> >>>> On 01/07/2016 05:22 AM, Mark Cave-Ayland wrote:
> >>>>> During local testing with TCG, intermittent errors were found when
> >>>>> trying to
> >>>>> migrate Darwin OS images.
> >>>>>
> >>>>> The underlying cause was that Darwin resets the decrementer value to
> >>>>> fairly
> >>>>> small values on each interrupt. cpu_ppc_set_tb_clk() sets the default
> >>>>> value
> >>>>> of the decrementer to 0xffffffff during initialisation which typically
> >>>>> corresponds to several seconds. Hence when restoring the image, the guest
> >>>>> would effectively "lose" decrementer interrupts during this time causing
> >>>>> confusion in the guest.
> >>>>>
> >>>>> NOTE: there does seem to be some overlap here with the
> >>>>> vmstate_ppc_timebase
> >>>>> code, however it doesn't seem to handle multiple CPUs which is why
> >>>>> I've gone
> >>>>> for an independent implementation.
> >>>>
> >>>> It does handle multiple CPUs:
> >>>>
> >>>> static int timebase_post_load(void *opaque, int version_id)
> >>>> {
> >>>> ...
> >>>>      /* Set new offset to all CPUs */
> >>>>      CPU_FOREACH(cpu) {
> >>>>          PowerPCCPU *pcpu = POWERPC_CPU(cpu);
> >>>>          pcpu->env.tb_env->tb_offset = tb_off_adj;
> >>>>      }
> >>>>
> >>>>
> >>>> It does not transfer DECR though, it transfers the offset instead, one
> >>>> for all CPUs.
> >>>>
> >>>>
> >>>> The easier solution would be just adding this instead of the whole patch:
> >>>>
> >>>> spr_register(env, SPR_DECR, "DECR",
> >>>>              SPR_NOACCESS, SPR_NOACCESS,
> >>>>              &spr_read_decr, &spr_write_decr,
> >>>>              0x00000000);
> >>>>
> >>>> somewhere in target-ppc/translate_init.c for CPUs you want to support,
> >>>> gen_tbl() seems to be the right place for this. As long as it is just
> >>>> spr_register() and not spr_register_kvm(), I suppose it should not break
> >>>> KVM and pseries.
> >>>
> >>> I've just tried adding that but it then gives the following error on
> >>> startup:
> >>>
> >>> Error: Trying to register SPR 22 (016) twice !
> >>>
> >>> Based upon this, the existing registration seems fine. I think the
> >>> problem is that I can't see anything in __cpu_ppc_store_decr() that
> >>> updates the spr[SPR_DECR] value when the decrementer register is
> >>> changed, so it needs to be explicitly added to
> >>> cpu_pre_save()/cpu_post_load():
> >>>
> >>>
> >>> diff --git a/target-ppc/machine.c b/target-ppc/machine.c
> >>> index 251a84b..495e58d 100644
> >>> --- a/target-ppc/machine.c
> >>> +++ b/target-ppc/machine.c
> >>> @@ -141,6 +141,7 @@ static void cpu_pre_save(void *opaque)
> >>>      env->spr[SPR_CFAR] = env->cfar;
> >>>  #endif
> >>>      env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr;
> >>> +    env->spr[SPR_DECR] = cpu_ppc_load_decr(env);
> >>>
> >>>      for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
> >>>          env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i];
> >>> @@ -175,6 +176,7 @@ static int cpu_post_load(void *opaque, int version_id)
> >>>      env->cfar = env->spr[SPR_CFAR];
> >>>  #endif
> >>>      env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR];
> >>> +    cpu_ppc_store_decr(env, env->spr[SPR_DECR]);
> >>>
> >>>      for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
> >>>          env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i];
> >>>
> >>>
> >>> I've just tried the diff above instead of my original version and
> >>> repeated my savevm/loadvm pair test with a Darwin installation and it
> >>> also fixes the random hang I was seeing on loadvm.
> >>>
> >>> Seemingly this should work on KVM in that cpu_ppc_load_decr() and
> >>> cpu_ppc_store_decr() become no-ops as long as KVM maintains
> >>> env->spr[SPR_DECR], but a second set of eyeballs would be useful here.
> >>>
> >>> If you can let me know if this is suitable then I'll update the patchset
> >>> based upon your feedback and send out a v2.
> >>
> >>
> >> Looks good to me, I'd just wait a day or two in the case if David wants to
> >> comment.
> > 
> > I was on holiday and missed the start of this thread, I'm afraid, so I
> > don't fully understand the context here.
> 
> It's part of a patchset I posted which fixes up problems I had with
> migrating g3beige/mac99 machines under TCG:
> https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00544.html.
> 
> Apologies for not adding you as CC directly - are you still helping to
> cover ppc-next for Alex?

Yes, I am.

> > Am I right in thinking that this change will essentially freeze the
> > decrementer across the migration downtime?  That doesn't seem right,
> > since the decrementer is supposed to be linked to the timebase and
> > represent real time passing.
> 
> Yes, that's correct.
> 
> > In other words, isn't this just skipping the decrementer interrupts at
> > the qemu level rather than the guest level?
> > 
> > It seems that instead we should be reconstructing the decrementer on
> > the destination based on an offset from the timebase.
> 
> Well I haven't really looked at how time warping works during in
> migration for QEMU, however this seems to be the method used by
> hw/ppc/ppc.c's timebase_post_load() function but my understanding is
> that this isn't currently available for the g3beige/mac99 machines?

Ah.. yes, it looks like the timebase migration stuff is only hooked in
on the pseries machine type.  As far as I can tell it should be
trivial to add it to other machines though - it doesn't appear to rely
on anything outside the common ppc timebase stuff.

> Should the patch in fact do this but also add decrementer support? And
> if it did, would this have a negative effect on pseries?

Yes, I think that's the right approach.  Note that rather than
duplicating the logic to adjust the decrementer over migration, it
should be possible to encode the decrementer as a diff from the
timebase across the migration.

In fact.. I'm not sure it ever makes sense to store the decrementer
value as a direct value, since it's constantly changing - probably
makes more sense to derive it from the timebase whenever it is needed.

As far as I know that should be fine for pseries.  I think the current
behaviour is probably technically wrong for pseries as well, but the
timing code of our Linux guests is robust enough to handle a small
displacement to the time of the next decrementer interrupt.

> But yes, assuming that the guest time warp is handled correctly (which I
> assume is handled correctly elsewhere since this would also be required
> for KVM) then I think that this should work.
> 
> 
> ATB,
> 
> Mark.
>
Mark Cave-Ayland Jan. 15, 2016, 5:46 p.m. UTC | #5
On 12/01/16 02:44, David Gibson wrote:

>>> In other words, isn't this just skipping the decrementer interrupts at
>>> the qemu level rather than the guest level?
>>>
>>> It seems that instead we should be reconstructing the decrementer on
>>> the destination based on an offset from the timebase.
>>
>> Well I haven't really looked at how time warping works during in
>> migration for QEMU, however this seems to be the method used by
>> hw/ppc/ppc.c's timebase_post_load() function but my understanding is
>> that this isn't currently available for the g3beige/mac99 machines?
> 
> Ah.. yes, it looks like the timebase migration stuff is only hooked in
> on the pseries machine type.  As far as I can tell it should be
> trivial to add it to other machines though - it doesn't appear to rely
> on anything outside the common ppc timebase stuff.
> 
>> Should the patch in fact do this but also add decrementer support? And
>> if it did, would this have a negative effect on pseries?
> 
> Yes, I think that's the right approach.  Note that rather than
> duplicating the logic to adjust the decrementer over migration, it
> should be possible to encode the decrementer as a diff from the
> timebase across the migration.
> 
> In fact.. I'm not sure it ever makes sense to store the decrementer
> value as a direct value, since it's constantly changing - probably
> makes more sense to derive it from the timebase whenever it is needed.
> 
> As far as I know that should be fine for pseries.  I think the current
> behaviour is probably technically wrong for pseries as well, but the
> timing code of our Linux guests is robust enough to handle a small
> displacement to the time of the next decrementer interrupt.

I've had a bit of an experiment trying to implement something suitable,
but I'm not 100% certain I've got this right.

From the code my understanding is that the timebase is effectively
free-running and so if a migration takes 5s then you use tb_offset to
calculate the difference between the timebase before migration, and
subsequently apply the offset for all future reads of the timebase for
the lifetime of the CPU (i.e. the migrated guest is effectively living
at a point in the past where the timebase is consistent).

So I do see that it would make more sense to switch the decrementer over
to an offset from timebase to avoid duplicating the time-shift migration
logic, but my PPC experience is fairly limited. Any particular pointers
as to what is the best way to approach this? It seems getting this done
separately first will make the changes in the last patch trivial.


Many thanks,

Mark.
David Gibson Jan. 18, 2016, 4:51 a.m. UTC | #6
On Fri, Jan 15, 2016 at 05:46:10PM +0000, Mark Cave-Ayland wrote:
> On 12/01/16 02:44, David Gibson wrote:
> 
> >>> In other words, isn't this just skipping the decrementer interrupts at
> >>> the qemu level rather than the guest level?
> >>>
> >>> It seems that instead we should be reconstructing the decrementer on
> >>> the destination based on an offset from the timebase.
> >>
> >> Well I haven't really looked at how time warping works during in
> >> migration for QEMU, however this seems to be the method used by
> >> hw/ppc/ppc.c's timebase_post_load() function but my understanding is
> >> that this isn't currently available for the g3beige/mac99 machines?
> > 
> > Ah.. yes, it looks like the timebase migration stuff is only hooked in
> > on the pseries machine type.  As far as I can tell it should be
> > trivial to add it to other machines though - it doesn't appear to rely
> > on anything outside the common ppc timebase stuff.
> > 
> >> Should the patch in fact do this but also add decrementer support? And
> >> if it did, would this have a negative effect on pseries?
> > 
> > Yes, I think that's the right approach.  Note that rather than
> > duplicating the logic to adjust the decrementer over migration, it
> > should be possible to encode the decrementer as a diff from the
> > timebase across the migration.
> > 
> > In fact.. I'm not sure it ever makes sense to store the decrementer
> > value as a direct value, since it's constantly changing - probably
> > makes more sense to derive it from the timebase whenever it is needed.
> > 
> > As far as I know that should be fine for pseries.  I think the current
> > behaviour is probably technically wrong for pseries as well, but the
> > timing code of our Linux guests is robust enough to handle a small
> > displacement to the time of the next decrementer interrupt.
> 
> I've had a bit of an experiment trying to implement something suitable,
> but I'm not 100% certain I've got this right.
> 
> >From the code my understanding is that the timebase is effectively
> free-running and so if a migration takes 5s then you use tb_offset to
> calculate the difference between the timebase before migration, and
> subsequently apply the offset for all future reads of the timebase for
> the lifetime of the CPU (i.e. the migrated guest is effectively living
> at a point in the past where the timebase is consistent).

Um.. no.  At least in the usual configuration, the timebase represents
real, wall-clock time, so we expect it to jump forward across the
migration downtime.  This is important because the guest will use the
timebase to calculate real time differences.

However, the absolute value of the timebase may be different on the
*host* between source and destination for migration.  So what we need
to do is before migration we work out the delta between host and guest
notions of wall clock time (as defined by the guest timebase), and
transfer that in the migration stream.

On the destination we initialize the guest timebase so that the guest
maintains the same realtime offset from the host.  This means that as
long as source and destination system time is synchronized, guest
real-time tracking will continue correctly across the migration.

We do also make sure that the guest timebase never goes backwards, but
that would only happen if the source and destination host times were
badly out of sync.
Mark Cave-Ayland Jan. 25, 2016, 5:48 a.m. UTC | #7
On 18/01/16 04:51, David Gibson wrote:

> On Fri, Jan 15, 2016 at 05:46:10PM +0000, Mark Cave-Ayland wrote:
>> On 12/01/16 02:44, David Gibson wrote:
>>
>>>>> In other words, isn't this just skipping the decrementer interrupts at
>>>>> the qemu level rather than the guest level?
>>>>>
>>>>> It seems that instead we should be reconstructing the decrementer on
>>>>> the destination based on an offset from the timebase.
>>>>
>>>> Well I haven't really looked at how time warping works during in
>>>> migration for QEMU, however this seems to be the method used by
>>>> hw/ppc/ppc.c's timebase_post_load() function but my understanding is
>>>> that this isn't currently available for the g3beige/mac99 machines?
>>>
>>> Ah.. yes, it looks like the timebase migration stuff is only hooked in
>>> on the pseries machine type.  As far as I can tell it should be
>>> trivial to add it to other machines though - it doesn't appear to rely
>>> on anything outside the common ppc timebase stuff.
>>>
>>>> Should the patch in fact do this but also add decrementer support? And
>>>> if it did, would this have a negative effect on pseries?
>>>
>>> Yes, I think that's the right approach.  Note that rather than
>>> duplicating the logic to adjust the decrementer over migration, it
>>> should be possible to encode the decrementer as a diff from the
>>> timebase across the migration.
>>>
>>> In fact.. I'm not sure it ever makes sense to store the decrementer
>>> value as a direct value, since it's constantly changing - probably
>>> makes more sense to derive it from the timebase whenever it is needed.
>>>
>>> As far as I know that should be fine for pseries.  I think the current
>>> behaviour is probably technically wrong for pseries as well, but the
>>> timing code of our Linux guests is robust enough to handle a small
>>> displacement to the time of the next decrementer interrupt.
>>
>> I've had a bit of an experiment trying to implement something suitable,
>> but I'm not 100% certain I've got this right.
>>
>> >From the code my understanding is that the timebase is effectively
>> free-running and so if a migration takes 5s then you use tb_offset to
>> calculate the difference between the timebase before migration, and
>> subsequently apply the offset for all future reads of the timebase for
>> the lifetime of the CPU (i.e. the migrated guest is effectively living
>> at a point in the past where the timebase is consistent).
> 
> Um.. no.  At least in the usual configuration, the timebase represents
> real, wall-clock time, so we expect it to jump forward across the
> migration downtime.  This is important because the guest will use the
> timebase to calculate real time differences.
> 
> However, the absolute value of the timebase may be different on the
> *host* between source and destination for migration.  So what we need
> to do is before migration we work out the delta between host and guest
> notions of wall clock time (as defined by the guest timebase), and
> transfer that in the migration stream.
> 
> On the destination we initialize the guest timebase so that the guest
> maintains the same realtime offset from the host.  This means that as
> long as source and destination system time is synchronized, guest
> real-time tracking will continue correctly across the migration.
> 
> We do also make sure that the guest timebase never goes backwards, but
> that would only happen if the source and destination host times were
> badly out of sync.

I had a poke at trying to include the timebase state in the Mac machines
but found that enabling this caused migration to freeze, even on top of
my existing (known working) patchset.

Looking closer at the code, I don't think it can work on an x86 TCG host
in its current form since the host/guest clocks are used interchangeably
in places, e.g. in timebase_pre_save() we have this:

uint64_t ticks = cpu_get_host_ticks();
...
tb->guest_timebase = ticks + first_ppc_cpu->env.tb_env->tb_offset;

So this implies that guest_timebase is set in higher resolution host
ticks. But then in timebase_post_load() we have this:

migration_duration_tb = muldiv64(migration_duration_ns, freq,
    NANOSECONDS_PER_SECOND);

which calculates the migration time in timebase ticks but then:

guest_tb = tb_remote->guest_timebase + MIN(0, migration_duration_tb);
tb_off_adj = guest_tb - cpu_get_host_ticks();

which mixes tb_remote->guest_timebase in host ticks with
migration_duration_tb in timebase ticks.

I think that tb_off_adj should be in host ticks so should
migration_duration_tb be dropped and guest_tb be calculated from
migration_duration_ns instead? At least given the current mixing of host
ticks and timebase ticks, I think this can only work correctly on PPC
where the two are the same.

One other thing I noticed is that this code "broke" my normal testing
pattern when I tend to launch qemu-system-ppc with -loadvm foo -S with
the machine stopped. In this case the migration duration calculation is
performed at the moment state is restored, and not the point where the
CPU is started.

Should the adjustment calculation not take place in a cpu_start()
function instead? Otherwise at the point when I resume the paused
machine the tb_off_adj calculation will be based in the past and
therefore wrong.


ATB,

Mark.
David Gibson Jan. 25, 2016, 11:10 a.m. UTC | #8
On Mon, Jan 25, 2016 at 05:48:02AM +0000, Mark Cave-Ayland wrote:
> On 18/01/16 04:51, David Gibson wrote:
> 
> > On Fri, Jan 15, 2016 at 05:46:10PM +0000, Mark Cave-Ayland wrote:
> >> On 12/01/16 02:44, David Gibson wrote:
> >>
> >>>>> In other words, isn't this just skipping the decrementer interrupts at
> >>>>> the qemu level rather than the guest level?
> >>>>>
> >>>>> It seems that instead we should be reconstructing the decrementer on
> >>>>> the destination based on an offset from the timebase.
> >>>>
> >>>> Well I haven't really looked at how time warping works during in
> >>>> migration for QEMU, however this seems to be the method used by
> >>>> hw/ppc/ppc.c's timebase_post_load() function but my understanding is
> >>>> that this isn't currently available for the g3beige/mac99 machines?
> >>>
> >>> Ah.. yes, it looks like the timebase migration stuff is only hooked in
> >>> on the pseries machine type.  As far as I can tell it should be
> >>> trivial to add it to other machines though - it doesn't appear to rely
> >>> on anything outside the common ppc timebase stuff.
> >>>
> >>>> Should the patch in fact do this but also add decrementer support? And
> >>>> if it did, would this have a negative effect on pseries?
> >>>
> >>> Yes, I think that's the right approach.  Note that rather than
> >>> duplicating the logic to adjust the decrementer over migration, it
> >>> should be possible to encode the decrementer as a diff from the
> >>> timebase across the migration.
> >>>
> >>> In fact.. I'm not sure it ever makes sense to store the decrementer
> >>> value as a direct value, since it's constantly changing - probably
> >>> makes more sense to derive it from the timebase whenever it is needed.
> >>>
> >>> As far as I know that should be fine for pseries.  I think the current
> >>> behaviour is probably technically wrong for pseries as well, but the
> >>> timing code of our Linux guests is robust enough to handle a small
> >>> displacement to the time of the next decrementer interrupt.
> >>
> >> I've had a bit of an experiment trying to implement something suitable,
> >> but I'm not 100% certain I've got this right.
> >>
> >> >From the code my understanding is that the timebase is effectively
> >> free-running and so if a migration takes 5s then you use tb_offset to
> >> calculate the difference between the timebase before migration, and
> >> subsequently apply the offset for all future reads of the timebase for
> >> the lifetime of the CPU (i.e. the migrated guest is effectively living
> >> at a point in the past where the timebase is consistent).
> > 
> > Um.. no.  At least in the usual configuration, the timebase represents
> > real, wall-clock time, so we expect it to jump forward across the
> > migration downtime.  This is important because the guest will use the
> > timebase to calculate real time differences.
> > 
> > However, the absolute value of the timebase may be different on the
> > *host* between source and destination for migration.  So what we need
> > to do is before migration we work out the delta between host and guest
> > notions of wall clock time (as defined by the guest timebase), and
> > transfer that in the migration stream.
> > 
> > On the destination we initialize the guest timebase so that the guest
> > maintains the same realtime offset from the host.  This means that as
> > long as source and destination system time is synchronized, guest
> > real-time tracking will continue correctly across the migration.
> > 
> > We do also make sure that the guest timebase never goes backwards, but
> > that would only happen if the source and destination host times were
> > badly out of sync.
> 
> I had a poke at trying to include the timebase state in the Mac machines
> but found that enabling this caused migration to freeze, even on top of
> my existing (known working) patchset.
> 
> Looking closer at the code, I don't think it can work on an x86 TCG host
> in its current form since the host/guest clocks are used interchangeably
> in places, e.g. in timebase_pre_save() we have this:
> 
> uint64_t ticks = cpu_get_host_ticks();
> ...
> tb->guest_timebase = ticks + first_ppc_cpu->env.tb_env->tb_offset;
> 
> So this implies that guest_timebase is set in higher resolution host
> ticks. But then in timebase_post_load() we have this:
> 
> migration_duration_tb = muldiv64(migration_duration_ns, freq,
>     NANOSECONDS_PER_SECOND);
> 
> which calculates the migration time in timebase ticks but then:
> 
> guest_tb = tb_remote->guest_timebase + MIN(0, migration_duration_tb);
> tb_off_adj = guest_tb - cpu_get_host_ticks();
> 
> which mixes tb_remote->guest_timebase in host ticks with
> migration_duration_tb in timebase ticks.
> 
> I think that tb_off_adj should be in host ticks so should
> migration_duration_tb be dropped and guest_tb be calculated from
> migration_duration_ns instead? At least given the current mixing of host
> ticks and timebase ticks, I think this can only work correctly on PPC
> where the two are the same.
> 
> One other thing I noticed is that this code "broke" my normal testing
> pattern when I tend to launch qemu-system-ppc with -loadvm foo -S with
> the machine stopped. In this case the migration duration calculation is
> performed at the moment state is restored, and not the point where the
> CPU is started.
> 
> Should the adjustment calculation not take place in a cpu_start()
> function instead? Otherwise at the point when I resume the paused
> machine the tb_off_adj calculation will be based in the past and
> therefore wrong.

Um.. so the migration duration is a complete red herring, regardless
of the units.

Remember, we only ever compute the guest timebase value at the moment
the guest requests it - actually maintaining a current timebase value
makes sense in hardware, but would be nuts in software.

The timebase is a function of real, wall-clock time, and the migration
destination has a notion of wall-clock time without reference to the
source.

So what you need to transmit for migration is enough information to
compute the guest timebase from real-time - essentially just an offset
between real-time and the timebase.

The guest can potentially observe the migration duration as a jump in
timebase values, but qemu doesn't need to do any calculations with it.
BALATON Zoltan Jan. 25, 2016, 5:20 p.m. UTC | #9
On Mon, 25 Jan 2016, David Gibson wrote:
> Remember, we only ever compute the guest timebase value at the moment
> the guest requests it - actually maintaining a current timebase value
> makes sense in hardware, but would be nuts in software.
>
> The timebase is a function of real, wall-clock time, and the migration
> destination has a notion of wall-clock time without reference to the
> source.
>
> So what you need to transmit for migration is enough information to
> compute the guest timebase from real-time - essentially just an offset
> between real-time and the timebase.

I don't know anything about all this but if I understand your conversation 
correctly then you don't really need an offset between real-time and 
timebase. That would be true assuming the source and destination has the 
same real-time but that's not necessarily true. What would be needed is an 
offset between source timebase and destination's real time. That is, 
besides the offset on the source you would also need to work out the 
difference in the "real-time" on the source and destination and correct 
the transferred offset with this after migration.

Is this handled somehow in QEMU (for example by doing some message 
exchange to establish the clock difference between the source and 
destination during migration) or is it assumed that migration always needs 
synchronised clocks done by some external means?

Regards,
BALATON Zoltan
David Gibson Jan. 26, 2016, 5:51 a.m. UTC | #10
On Mon, Jan 25, 2016 at 06:20:21PM +0100, BALATON Zoltan wrote:
> On Mon, 25 Jan 2016, David Gibson wrote:
> >Remember, we only ever compute the guest timebase value at the moment
> >the guest requests it - actually maintaining a current timebase value
> >makes sense in hardware, but would be nuts in software.
> >
> >The timebase is a function of real, wall-clock time, and the migration
> >destination has a notion of wall-clock time without reference to the
> >source.
> >
> >So what you need to transmit for migration is enough information to
> >compute the guest timebase from real-time - essentially just an offset
> >between real-time and the timebase.
> 
> I don't know anything about all this but if I understand your conversation
> correctly then you don't really need an offset between real-time and
> timebase. That would be true assuming the source and destination has the
> same real-time but that's not necessarily true.

No, it's not necessarily true, but if it's not that's not really our
problem to deal with.

We do ensure that the guest timebase doesn't go backwards in this
situation, but if the source and destination hosts don't have
synchronized time, the guest real time may jump.  But since we have no
idea whether the source or destination had a better notion of
real-time, we can't really do any better.

> What would be needed is an
> offset between source timebase and destination's real time. That is, besides
> the offset on the source you would also need to work out the difference in
> the "real-time" on the source and destination and correct the transferred
> offset with this after migration.

Yeah, that's not really possible in the current migration model.  Nor
is it something that's really qemu's job.

> Is this handled somehow in QEMU (for example by doing some message exchange
> to establish the clock difference between the source and destination during
> migration) or is it assumed that migration always needs synchronised clocks
> done by some external means?

Migration itself doesn't need synced clocks, but if the hosts'
real-time isn't well behaved, you can't really expect the guest's real
time to be well behaved.
Mark Cave-Ayland Jan. 26, 2016, 10:31 p.m. UTC | #11
On 25/01/16 11:10, David Gibson wrote:

> Um.. so the migration duration is a complete red herring, regardless
> of the units.
> 
> Remember, we only ever compute the guest timebase value at the moment
> the guest requests it - actually maintaining a current timebase value
> makes sense in hardware, but would be nuts in software.
> 
> The timebase is a function of real, wall-clock time, and the migration
> destination has a notion of wall-clock time without reference to the
> source.
> 
> So what you need to transmit for migration is enough information to
> compute the guest timebase from real-time - essentially just an offset
> between real-time and the timebase.
> 
> The guest can potentially observe the migration duration as a jump in
> timebase values, but qemu doesn't need to do any calculations with it.

Thanks for more pointers - I think I'm slowly getting there. My current
thoughts are that the basic migration algorithm is doing the right thing
in that it works out the number of host ticks different between source
and destination.

I have a slight query with this section of code though:

    migration_duration_tb = muldiv64(migration_duration_ns, freq,
                                     NANOSECONDS_PER_SECOND);

This is not technically correct on TCG x86 since the timebase is the x86
TSC which is running somewhere in the GHz range, compared to freq which
is hard-coded to 16MHz. However this doesn't seem to matter because the
timebase adjustment is limited to a maximum of 1s. Why should this be if
the timebase is supposed to be free running as you mentioned in a
previous email?

AFAICT the main problem on TCG x86 is that post-migration the timebase
calculated by cpu_ppc_get_tb() is incorrect:

uint64_t cpu_ppc_get_tb(ppc_tb_t *tb_env, uint64_t vmclk, int64_t tb_offset)
{
    /* TB time in tb periods */
    return muldiv64(vmclk, tb_env->tb_freq, get_ticks_per_sec()) +
                    tb_offset;
}

For a typical savevm/loadvm pair I see something like this:

savevm:

tb->guest_timebase = 26281306490558
qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511

loadvm:

cpu_get_host_ticks() = 26289847005259
tb_off_adj = -8540514701
qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511
cpu_ppc_get_tb() = -15785159386

But as cpu_ppc_get_tb() uses QEMU_CLOCK_VIRTUAL for vmclk we end up with
a negative number for the timebase since the virtual clock is dwarfed by
the number of TSC ticks calculated for tb_off_adj. This will work on a
PPC host though since cpu_host_get_ticks() is also derived from the
timebase.

Another question I have is cpu_ppc_load_tbl():

uint64_t cpu_ppc_load_tbl (CPUPPCState *env)
{
    ppc_tb_t *tb_env = env->tb_env;
    uint64_t tb;

    if (kvm_enabled()) {
        return env->spr[SPR_TBL];
    }

    tb = cpu_ppc_get_tb(tb_env, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
                        tb_env->tb_offset);
    LOG_TB("%s: tb %016" PRIx64 "\n", __func__, tb);

    return tb;
}

Compared with cpu_ppc_load_tbu(), it is returning uint64_t rather than
uint32_t and doesn't appear to mask the bottom 32-bits of the timebase
value?


ATB,

Mark.
Mark Cave-Ayland Jan. 26, 2016, 11:08 p.m. UTC | #12
On 26/01/16 22:31, Mark Cave-Ayland wrote:

> For a typical savevm/loadvm pair I see something like this:
> 
> savevm:
> 
> tb->guest_timebase = 26281306490558
> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511
> 
> loadvm:
> 
> cpu_get_host_ticks() = 26289847005259
> tb_off_adj = -8540514701
> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511
> cpu_ppc_get_tb() = -15785159386
> 
> But as cpu_ppc_get_tb() uses QEMU_CLOCK_VIRTUAL for vmclk we end up with
> a negative number for the timebase since the virtual clock is dwarfed by
> the number of TSC ticks calculated for tb_off_adj. This will work on a
> PPC host though since cpu_host_get_ticks() is also derived from the
> timebase.

Ah... the magic of signed numbers. Here's another attempt but this time
with cpu_ppc_get_tb() printed unsigned:

savevm:

tb->guest_timebase = 2597350923332
qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 4199081744
cpu_ppc_get_tb() = 69704756

loadvm:

cpu_get_host_ticks() = 2606380782824
tb_off_adj = -9029859492
qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 4199081744
cpu_ppc_get_tb() = 18446744064749396880


This implies that the cpu_ppc_get_tb() value post-migration is far too
low - presumably because tb_env->tb_freq is only 16MHz compared to the
actual TSC rate?


ATB,

Mark.
David Gibson Feb. 1, 2016, 12:52 a.m. UTC | #13
On Tue, Jan 26, 2016 at 10:31:19PM +0000, Mark Cave-Ayland wrote:
> On 25/01/16 11:10, David Gibson wrote:
> 
> > Um.. so the migration duration is a complete red herring, regardless
> > of the units.
> > 
> > Remember, we only ever compute the guest timebase value at the moment
> > the guest requests it - actually maintaining a current timebase value
> > makes sense in hardware, but would be nuts in software.
> > 
> > The timebase is a function of real, wall-clock time, and the migration
> > destination has a notion of wall-clock time without reference to the
> > source.
> > 
> > So what you need to transmit for migration is enough information to
> > compute the guest timebase from real-time - essentially just an offset
> > between real-time and the timebase.
> > 
> > The guest can potentially observe the migration duration as a jump in
> > timebase values, but qemu doesn't need to do any calculations with it.
> 
> Thanks for more pointers - I think I'm slowly getting there. My current
> thoughts are that the basic migration algorithm is doing the right thing
> in that it works out the number of host ticks different between source
> and destination.

Sorry, I've take a while to reply to this.  I realised the tb
migration didn't work the way I thought it did, so I've had to get my
head around what's actually going on.

I had thought that it transferred only meta-information telling the
destination how to calculate the timebase, without actually working
out the timebase value at any particular moment.

In fact, what it sends is basically the tuple of (timebase, realtime)
at the point of sending the migration stream.  The destination then
uses that to work out how to compute the timebase from realtime there.

I'm not convinced this is a great approach, but it should basically
work.  However, as you've seen there are also some Just Plain Bugs in
the logic for this.

> I have a slight query with this section of code though:
> 
>     migration_duration_tb = muldiv64(migration_duration_ns, freq,
>                                      NANOSECONDS_PER_SECOND);
> 
> This is not technically correct on TCG x86 since the timebase is the x86
> TSC which is running somewhere in the GHz range, compared to freq which
> is hard-coded to 16MHz.

Um.. what?  AFAICT that line doesn't have any reference to the TSC
speed.  Just ns and the (guest) tb).  Also 16MHz is only for the
oldworld Macs - modern ppc cpus have the TB frequency architected as
512MHz.

> However this doesn't seem to matter because the
> timebase adjustment is limited to a maximum of 1s. Why should this be if
> the timebase is supposed to be free running as you mentioned in a
> previous email?

AFAICT, what it's doing here is assuming that if the migration
duration is >1s (or appears to be >1s) then it's because the host
clocks are out of sync and so just capping the elapsed tb time at 1s.

That's just wrong, IMO.  1s is a long downtime for a live migration,
but it's not impossible, and it will happen nearly always in the
scenariou you've discussed of manually loading the migration stream
from a file.

But more to the point, trying to maintain correctness of the timebase
when the hosts are out of sync is basically futile.  There's no other
reference we can use, so all we can achieve is getting a different
wrong value from what we'd get by blindly trusting the host clock.

We do need to constrain the tb from going backwards, because that will
cause chaos on the guest, but otherwise we should just trust the host
clock and ditch that 1s clamp.  If the hosts are out of sync, then
guest time will jump, but that was always going to happen.

> AFAICT the main problem on TCG x86 is that post-migration the timebase
> calculated by cpu_ppc_get_tb() is incorrect:
> 
> uint64_t cpu_ppc_get_tb(ppc_tb_t *tb_env, uint64_t vmclk, int64_t tb_offset)
> {
>     /* TB time in tb periods */
>     return muldiv64(vmclk, tb_env->tb_freq, get_ticks_per_sec()) +
>                     tb_offset;
> }


So the problem here is that get_ticks_per_sec() (which always returns
1,000,000,000) is not talking about the same ticks as
cpu_get_host_ticks().  That may not have been true when this code was
written.

> For a typical savevm/loadvm pair I see something like this:
> 
> savevm:
> 
> tb->guest_timebase = 26281306490558
> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511
> 
> loadvm:
> 
> cpu_get_host_ticks() = 26289847005259
> tb_off_adj = -8540514701
> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511
> cpu_ppc_get_tb() = -15785159386
> 
> But as cpu_ppc_get_tb() uses QEMU_CLOCK_VIRTUAL for vmclk we end up with
> a negative number for the timebase since the virtual clock is dwarfed by
> the number of TSC ticks calculated for tb_off_adj. This will work on a
> PPC host though since cpu_host_get_ticks() is also derived from the
> timebase.

Yeah, we shouldn't be using cpu_host_get_ticks() at all - or anything
else which depends on a host frequency.  We should only be using qemu
interfaces which work in real time units (nanoseconds, usually).

> Another question I have is cpu_ppc_load_tbl():
> 
> uint64_t cpu_ppc_load_tbl (CPUPPCState *env)
> {
>     ppc_tb_t *tb_env = env->tb_env;
>     uint64_t tb;
> 
>     if (kvm_enabled()) {
>         return env->spr[SPR_TBL];
>     }
> 
>     tb = cpu_ppc_get_tb(tb_env, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
>                         tb_env->tb_offset);
>     LOG_TB("%s: tb %016" PRIx64 "\n", __func__, tb);
> 
>     return tb;
> }
> 
> Compared with cpu_ppc_load_tbu(), it is returning uint64_t rather than
> uint32_t and doesn't appear to mask the bottom 32-bits of the timebase
> value?

That's because in 64-bit mode loading the "TBL" is actually a load of
the whole 64-bit timebase.
Mark Cave-Ayland Feb. 2, 2016, 11:41 p.m. UTC | #14
On 01/02/16 00:52, David Gibson wrote:

>> Thanks for more pointers - I think I'm slowly getting there. My current
>> thoughts are that the basic migration algorithm is doing the right thing
>> in that it works out the number of host ticks different between source
>> and destination.
> 
> Sorry, I've take a while to reply to this.  I realised the tb
> migration didn't work the way I thought it did, so I've had to get my
> head around what's actually going on.

No problem - it's turning out to be a lot more complicated than I
initially expected.

> I had thought that it transferred only meta-information telling the
> destination how to calculate the timebase, without actually working
> out the timebase value at any particular moment.
> 
> In fact, what it sends is basically the tuple of (timebase, realtime)
> at the point of sending the migration stream.  The destination then
> uses that to work out how to compute the timebase from realtime there.
> 
> I'm not convinced this is a great approach, but it should basically
> work.  However, as you've seen there are also some Just Plain Bugs in
> the logic for this.
> 
>> I have a slight query with this section of code though:
>>
>>     migration_duration_tb = muldiv64(migration_duration_ns, freq,
>>                                      NANOSECONDS_PER_SECOND);
>>
>> This is not technically correct on TCG x86 since the timebase is the x86
>> TSC which is running somewhere in the GHz range, compared to freq which
>> is hard-coded to 16MHz.
> 
> Um.. what?  AFAICT that line doesn't have any reference to the TSC
> speed.  Just ns and the (guest) tb).  Also 16MHz is only for the
> oldworld Macs - modern ppc cpus have the TB frequency architected as
> 512MHz.

On TCG the software timebase for the Mac guests is fixed at 16MHz so how
does KVM handle this? Does it compensate by emulating the 16MHz timebase
for the guest even though the host has a 512HMz timebase?

>> However this doesn't seem to matter because the
>> timebase adjustment is limited to a maximum of 1s. Why should this be if
>> the timebase is supposed to be free running as you mentioned in a
>> previous email?
> 
> AFAICT, what it's doing here is assuming that if the migration
> duration is >1s (or appears to be >1s) then it's because the host
> clocks are out of sync and so just capping the elapsed tb time at 1s.
> 
> That's just wrong, IMO.  1s is a long downtime for a live migration,
> but it's not impossible, and it will happen nearly always in the
> scenariou you've discussed of manually loading the migration stream
> from a file.
> 
> But more to the point, trying to maintain correctness of the timebase
> when the hosts are out of sync is basically futile.  There's no other
> reference we can use, so all we can achieve is getting a different
> wrong value from what we'd get by blindly trusting the host clock.
> 
> We do need to constrain the tb from going backwards, because that will
> cause chaos on the guest, but otherwise we should just trust the host
> clock and ditch that 1s clamp.  If the hosts are out of sync, then
> guest time will jump, but that was always going to happen.

Going back to your earlier email you suggested that the host timebase is
always continuously running, even when the guest is paused. But then
resuming the guest then the timebase must jump in the guest regardless?

If this is the case then this is the big difference between TCG and KVM
guests: TCG timebase is derived from the virtual clock which solves the
problem of paused guests during migration. For example with the existing
migration code, what would happen if you did a migration with the guest
paused on KVM? The offset would surely be wrong as it was calculated at
the end of migration.

And another thought: should it be possible to migrate guests between TCG
and KVM hosts at will?

>> AFAICT the main problem on TCG x86 is that post-migration the timebase
>> calculated by cpu_ppc_get_tb() is incorrect:
>>
>> uint64_t cpu_ppc_get_tb(ppc_tb_t *tb_env, uint64_t vmclk, int64_t tb_offset)
>> {
>>     /* TB time in tb periods */
>>     return muldiv64(vmclk, tb_env->tb_freq, get_ticks_per_sec()) +
>>                     tb_offset;
>> }
> 
> 
> So the problem here is that get_ticks_per_sec() (which always returns
> 1,000,000,000) is not talking about the same ticks as
> cpu_get_host_ticks().  That may not have been true when this code was
> written.

Yes. That's basically what I was trying to say but I think you've
expressed it far more eloquently than I did.

>> For a typical savevm/loadvm pair I see something like this:
>>
>> savevm:
>>
>> tb->guest_timebase = 26281306490558
>> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511
>>
>> loadvm:
>>
>> cpu_get_host_ticks() = 26289847005259
>> tb_off_adj = -8540514701
>> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511
>> cpu_ppc_get_tb() = -15785159386
>>
>> But as cpu_ppc_get_tb() uses QEMU_CLOCK_VIRTUAL for vmclk we end up with
>> a negative number for the timebase since the virtual clock is dwarfed by
>> the number of TSC ticks calculated for tb_off_adj. This will work on a
>> PPC host though since cpu_host_get_ticks() is also derived from the
>> timebase.
> 
> Yeah, we shouldn't be using cpu_host_get_ticks() at all - or anything
> else which depends on a host frequency.  We should only be using qemu
> interfaces which work in real time units (nanoseconds, usually).

I agree that this is the right way forward. Unfortunately the timebase
behaviour under KVM PPC is quite new to me, so please do bear with me
for asking all these questions.


ATB,

Mark.
David Gibson Feb. 3, 2016, 4:59 a.m. UTC | #15
On Tue, Feb 02, 2016 at 11:41:40PM +0000, Mark Cave-Ayland wrote:
> On 01/02/16 00:52, David Gibson wrote:
> 
> >> Thanks for more pointers - I think I'm slowly getting there. My current
> >> thoughts are that the basic migration algorithm is doing the right thing
> >> in that it works out the number of host ticks different between source
> >> and destination.
> > 
> > Sorry, I've take a while to reply to this.  I realised the tb
> > migration didn't work the way I thought it did, so I've had to get my
> > head around what's actually going on.
> 
> No problem - it's turning out to be a lot more complicated than I
> initially expected.
> 
> > I had thought that it transferred only meta-information telling the
> > destination how to calculate the timebase, without actually working
> > out the timebase value at any particular moment.
> > 
> > In fact, what it sends is basically the tuple of (timebase, realtime)
> > at the point of sending the migration stream.  The destination then
> > uses that to work out how to compute the timebase from realtime there.
> > 
> > I'm not convinced this is a great approach, but it should basically
> > work.  However, as you've seen there are also some Just Plain Bugs in
> > the logic for this.
> > 
> >> I have a slight query with this section of code though:
> >>
> >>     migration_duration_tb = muldiv64(migration_duration_ns, freq,
> >>                                      NANOSECONDS_PER_SECOND);
> >>
> >> This is not technically correct on TCG x86 since the timebase is the x86
> >> TSC which is running somewhere in the GHz range, compared to freq which
> >> is hard-coded to 16MHz.
> > 
> > Um.. what?  AFAICT that line doesn't have any reference to the TSC
> > speed.  Just ns and the (guest) tb).  Also 16MHz is only for the
> > oldworld Macs - modern ppc cpus have the TB frequency architected as
> > 512MHz.
> 
> On TCG the software timebase for the Mac guests is fixed at 16MHz so how
> does KVM handle this?

> Does it compensate by emulating the 16MHz timebase
> for the guest even though the host has a 512HMz timebase?

No, it can't.  The timebase is not privileged, so there's no way to
virtualize it for the guest.  So, the best we can do is to detect KVM,
override the guest device tree with the host timebase frequency and
hope the guest reads it rather than assuming a fixed value for the
platform.

> >> However this doesn't seem to matter because the
> >> timebase adjustment is limited to a maximum of 1s. Why should this be if
> >> the timebase is supposed to be free running as you mentioned in a
> >> previous email?
> > 
> > AFAICT, what it's doing here is assuming that if the migration
> > duration is >1s (or appears to be >1s) then it's because the host
> > clocks are out of sync and so just capping the elapsed tb time at 1s.
> > 
> > That's just wrong, IMO.  1s is a long downtime for a live migration,
> > but it's not impossible, and it will happen nearly always in the
> > scenariou you've discussed of manually loading the migration stream
> > from a file.
> > 
> > But more to the point, trying to maintain correctness of the timebase
> > when the hosts are out of sync is basically futile.  There's no other
> > reference we can use, so all we can achieve is getting a different
> > wrong value from what we'd get by blindly trusting the host clock.
> > 
> > We do need to constrain the tb from going backwards, because that will
> > cause chaos on the guest, but otherwise we should just trust the host
> > clock and ditch that 1s clamp.  If the hosts are out of sync, then
> > guest time will jump, but that was always going to happen.
> 
> Going back to your earlier email you suggested that the host timebase is
> always continuously running, even when the guest is paused. But then
> resuming the guest then the timebase must jump in the guest regardless?
> 
> If this is the case then this is the big difference between TCG and KVM
> guests: TCG timebase is derived from the virtual clock which solves the
> problem of paused guests during migration. For example with the existing
> migration code, what would happen if you did a migration with the guest
> paused on KVM? The offset would surely be wrong as it was calculated at
> the end of migration.

So there are two different cases to consider here.  Once is when the
guest is paused incidentally, such as during migration, the other is
when the guest is explicitly paused.

In the first case the timebase absolutely should keep running (or
appear to do so), since it's the primary source of real time for the
guest.

In the second case, it's a bit unclear what the right thing to do is.
Keeping the tb running means accurate realtime, but stopping it is
often better for debugging, which is one of the main reasons to
explicitly pause.

I believe spapr on KVM HV will keep the TB going, but the TSC on x86
will be stopped.

> And another thought: should it be possible to migrate guests between TCG
> and KVM hosts at will?

It would be nice if that's possible, but I don't think it's a primary goal.

> >> AFAICT the main problem on TCG x86 is that post-migration the timebase
> >> calculated by cpu_ppc_get_tb() is incorrect:
> >>
> >> uint64_t cpu_ppc_get_tb(ppc_tb_t *tb_env, uint64_t vmclk, int64_t tb_offset)
> >> {
> >>     /* TB time in tb periods */
> >>     return muldiv64(vmclk, tb_env->tb_freq, get_ticks_per_sec()) +
> >>                     tb_offset;
> >> }
> > 
> > 
> > So the problem here is that get_ticks_per_sec() (which always returns
> > 1,000,000,000) is not talking about the same ticks as
> > cpu_get_host_ticks().  That may not have been true when this code was
> > written.
> 
> Yes. That's basically what I was trying to say but I think you've
> expressed it far more eloquently than I did.
> 
> >> For a typical savevm/loadvm pair I see something like this:
> >>
> >> savevm:
> >>
> >> tb->guest_timebase = 26281306490558
> >> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511
> >>
> >> loadvm:
> >>
> >> cpu_get_host_ticks() = 26289847005259
> >> tb_off_adj = -8540514701
> >> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511
> >> cpu_ppc_get_tb() = -15785159386
> >>
> >> But as cpu_ppc_get_tb() uses QEMU_CLOCK_VIRTUAL for vmclk we end up with
> >> a negative number for the timebase since the virtual clock is dwarfed by
> >> the number of TSC ticks calculated for tb_off_adj. This will work on a
> >> PPC host though since cpu_host_get_ticks() is also derived from the
> >> timebase.
> > 
> > Yeah, we shouldn't be using cpu_host_get_ticks() at all - or anything
> > else which depends on a host frequency.  We should only be using qemu
> > interfaces which work in real time units (nanoseconds, usually).
> 
> I agree that this is the right way forward. Unfortunately the timebase
> behaviour under KVM PPC is quite new to me, so please do bear with me
> for asking all these questions.
> 
> 
> ATB,
> 
> Mark.
>
Alexander Graf Feb. 3, 2016, 5:43 a.m. UTC | #16
> Am 03.02.2016 um 06:59 schrieb David Gibson <david@gibson.dropbear.id.au>:
> 
>> On Tue, Feb 02, 2016 at 11:41:40PM +0000, Mark Cave-Ayland wrote:
>> On 01/02/16 00:52, David Gibson wrote:
>> 
>>>> Thanks for more pointers - I think I'm slowly getting there. My current
>>>> thoughts are that the basic migration algorithm is doing the right thing
>>>> in that it works out the number of host ticks different between source
>>>> and destination.
>>> 
>>> Sorry, I've take a while to reply to this.  I realised the tb
>>> migration didn't work the way I thought it did, so I've had to get my
>>> head around what's actually going on.
>> 
>> No problem - it's turning out to be a lot more complicated than I
>> initially expected.
>> 
>>> I had thought that it transferred only meta-information telling the
>>> destination how to calculate the timebase, without actually working
>>> out the timebase value at any particular moment.
>>> 
>>> In fact, what it sends is basically the tuple of (timebase, realtime)
>>> at the point of sending the migration stream.  The destination then
>>> uses that to work out how to compute the timebase from realtime there.
>>> 
>>> I'm not convinced this is a great approach, but it should basically
>>> work.  However, as you've seen there are also some Just Plain Bugs in
>>> the logic for this.
>>> 
>>>> I have a slight query with this section of code though:
>>>> 
>>>>    migration_duration_tb = muldiv64(migration_duration_ns, freq,
>>>>                                     NANOSECONDS_PER_SECOND);
>>>> 
>>>> This is not technically correct on TCG x86 since the timebase is the x86
>>>> TSC which is running somewhere in the GHz range, compared to freq which
>>>> is hard-coded to 16MHz.
>>> 
>>> Um.. what?  AFAICT that line doesn't have any reference to the TSC
>>> speed.  Just ns and the (guest) tb).  Also 16MHz is only for the
>>> oldworld Macs - modern ppc cpus have the TB frequency architected as
>>> 512MHz.
>> 
>> On TCG the software timebase for the Mac guests is fixed at 16MHz so how
>> does KVM handle this?
> 
>> Does it compensate by emulating the 16MHz timebase
>> for the guest even though the host has a 512HMz timebase?
> 
> No, it can't.  The timebase is not privileged, so there's no way to
> virtualize it for the guest.  So, the best we can do is to detect KVM,
> override the guest device tree with the host timebase frequency and
> hope the guest reads it rather than assuming a fixed value for the
> platform.

IIRC Mac OS (9&X) calculates the tb frequency using the cuda clock as reference. So it doesn't honor our dt properties like Linux, but it can adapt to different frequencies.


Alex
Mark Cave-Ayland Feb. 23, 2016, 9:27 p.m. UTC | #17
On 03/02/16 04:59, David Gibson wrote:

>> Going back to your earlier email you suggested that the host timebase is
>> always continuously running, even when the guest is paused. But then
>> resuming the guest then the timebase must jump in the guest regardless?
>>
>> If this is the case then this is the big difference between TCG and KVM
>> guests: TCG timebase is derived from the virtual clock which solves the
>> problem of paused guests during migration. For example with the existing
>> migration code, what would happen if you did a migration with the guest
>> paused on KVM? The offset would surely be wrong as it was calculated at
>> the end of migration.
> 
> So there are two different cases to consider here.  Once is when the
> guest is paused incidentally, such as during migration, the other is
> when the guest is explicitly paused.
> 
> In the first case the timebase absolutely should keep running (or
> appear to do so), since it's the primary source of real time for the
> guest.

I'm not sure I understand this, since if the guest is paused either
deliberately or incidentally during migration then isn't the timebase
also frozen? Or is it external to the CPU?

> In the second case, it's a bit unclear what the right thing to do is.
> Keeping the tb running means accurate realtime, but stopping it is
> often better for debugging, which is one of the main reasons to
> explicitly pause.
> 
> I believe spapr on KVM HV will keep the TB going, but the TSC on x86
> will be stopped.

Is this from a guest-centric view, i.e. if I pause a VM and wait 20 mins
then when the guest resumes the timebase will jump forward by 20 mins
worth of ticks?


ATB,

Mark.
David Gibson Feb. 24, 2016, 12:47 a.m. UTC | #18
On Tue, Feb 23, 2016 at 09:27:09PM +0000, Mark Cave-Ayland wrote:
> On 03/02/16 04:59, David Gibson wrote:
> 
> >> Going back to your earlier email you suggested that the host timebase is
> >> always continuously running, even when the guest is paused. But then
> >> resuming the guest then the timebase must jump in the guest regardless?
> >>
> >> If this is the case then this is the big difference between TCG and KVM
> >> guests: TCG timebase is derived from the virtual clock which solves the
> >> problem of paused guests during migration. For example with the existing
> >> migration code, what would happen if you did a migration with the guest
> >> paused on KVM? The offset would surely be wrong as it was calculated at
> >> the end of migration.
> > 
> > So there are two different cases to consider here.  Once is when the
> > guest is paused incidentally, such as during migration, the other is
> > when the guest is explicitly paused.
> > 
> > In the first case the timebase absolutely should keep running (or
> > appear to do so), since it's the primary source of real time for the
> > guest.
> 
> I'm not sure I understand this, since if the guest is paused either
> deliberately or incidentally during migration then isn't the timebase
> also frozen? Or is it external to the CPU?

I don't really understand the question.  Migration has no equivalent
in real hardware, so there's no "real" behaviour to mimic.  If we
freeze the TB during migration, then the guest's clock will get out of
sync with wall clock time, and in a production environment that's
really bad.  So no, we absolutely must not freeze the TB during
migration.

When the guest has been explicitly paused, there's a case to be made
either way.

> > In the second case, it's a bit unclear what the right thing to do is.
> > Keeping the tb running means accurate realtime, but stopping it is
> > often better for debugging, which is one of the main reasons to
> > explicitly pause.
> > 
> > I believe spapr on KVM HV will keep the TB going, but the TSC on x86
> > will be stopped.
> 
> Is this from a guest-centric view, i.e. if I pause a VM and wait 20 mins
> then when the guest resumes the timebase will jump forward by 20 mins
> worth of ticks?

Yes, that's correct.
Juan Quintela Feb. 24, 2016, 12:31 p.m. UTC | #19
David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Feb 23, 2016 at 09:27:09PM +0000, Mark Cave-Ayland wrote:
>> On 03/02/16 04:59, David Gibson wrote:
>> 
>> >> Going back to your earlier email you suggested that the host timebase is
>> >> always continuously running, even when the guest is paused. But then
>> >> resuming the guest then the timebase must jump in the guest regardless?
>> >>
>> >> If this is the case then this is the big difference between TCG and KVM
>> >> guests: TCG timebase is derived from the virtual clock which solves the
>> >> problem of paused guests during migration. For example with the existing
>> >> migration code, what would happen if you did a migration with the guest
>> >> paused on KVM? The offset would surely be wrong as it was calculated at
>> >> the end of migration.
>> > 
>> > So there are two different cases to consider here.  Once is when the
>> > guest is paused incidentally, such as during migration, the other is
>> > when the guest is explicitly paused.
>> > 
>> > In the first case the timebase absolutely should keep running (or
>> > appear to do so), since it's the primary source of real time for the
>> > guest.
>> 
>> I'm not sure I understand this, since if the guest is paused either
>> deliberately or incidentally during migration then isn't the timebase
>> also frozen? Or is it external to the CPU?
>
> I don't really understand the question.  Migration has no equivalent
> in real hardware, so there's no "real" behaviour to mimic.  If we
> freeze the TB during migration, then the guest's clock will get out of
> sync with wall clock time, and in a production environment that's
> really bad.  So no, we absolutely must not freeze the TB during
> migration.
>
> When the guest has been explicitly paused, there's a case to be made
> either way.

If this is the case, can't we just change the device to just read the
clock from the host at device insntantiation and call it a day?

(* Notice that I haven't seen the previous discussion *)

On migration, having a post-load function that just loads the right
value for that device should work.  Or if we want to make it work for
pause/cont, we should have a notifier to be run each time "cont" is
issued, and put a callback there?

Or I am missing something improtant?

>
>> > In the second case, it's a bit unclear what the right thing to do is.
>> > Keeping the tb running means accurate realtime, but stopping it is
>> > often better for debugging, which is one of the main reasons to
>> > explicitly pause.
>> > 
>> > I believe spapr on KVM HV will keep the TB going, but the TSC on x86
>> > will be stopped.
>> 
>> Is this from a guest-centric view, i.e. if I pause a VM and wait 20 mins
>> then when the guest resumes the timebase will jump forward by 20 mins
>> worth of ticks?
>
> Yes, that's correct.

I.e. my proposal fixes this?
If you want ot make it really, really "classy", you can look at the mess
we have on x86 to introduce ticks "slewly" for windos 95 (and XP?)
during a while, but I don't think that solution would work for 20mins of
ticks :p

Later, Juan.
David Gibson Feb. 25, 2016, 12:19 a.m. UTC | #20
On Wed, Feb 24, 2016 at 01:31:05PM +0100, Juan Quintela wrote:
> David Gibson <david@gibson.dropbear.id.au> wrote:
> > On Tue, Feb 23, 2016 at 09:27:09PM +0000, Mark Cave-Ayland wrote:
> >> On 03/02/16 04:59, David Gibson wrote:
> >> 
> >> >> Going back to your earlier email you suggested that the host timebase is
> >> >> always continuously running, even when the guest is paused. But then
> >> >> resuming the guest then the timebase must jump in the guest regardless?
> >> >>
> >> >> If this is the case then this is the big difference between TCG and KVM
> >> >> guests: TCG timebase is derived from the virtual clock which solves the
> >> >> problem of paused guests during migration. For example with the existing
> >> >> migration code, what would happen if you did a migration with the guest
> >> >> paused on KVM? The offset would surely be wrong as it was calculated at
> >> >> the end of migration.
> >> > 
> >> > So there are two different cases to consider here.  Once is when the
> >> > guest is paused incidentally, such as during migration, the other is
> >> > when the guest is explicitly paused.
> >> > 
> >> > In the first case the timebase absolutely should keep running (or
> >> > appear to do so), since it's the primary source of real time for the
> >> > guest.
> >> 
> >> I'm not sure I understand this, since if the guest is paused either
> >> deliberately or incidentally during migration then isn't the timebase
> >> also frozen? Or is it external to the CPU?
> >
> > I don't really understand the question.  Migration has no equivalent
> > in real hardware, so there's no "real" behaviour to mimic.  If we
> > freeze the TB during migration, then the guest's clock will get out of
> > sync with wall clock time, and in a production environment that's
> > really bad.  So no, we absolutely must not freeze the TB during
> > migration.
> >
> > When the guest has been explicitly paused, there's a case to be made
> > either way.
> 
> If this is the case, can't we just change the device to just read the
> clock from the host at device insntantiation and call it a day?

That's not quite enough, because although the timebase advances in
real time, it will have an offset from realtime that varies boot to
boot.

> (* Notice that I haven't seen the previous discussion *)
> 
> On migration, having a post-load function that just loads the right
> value for that device should work.  Or if we want to make it work for
> pause/cont, we should have a notifier to be run each time "cont" is
> issued, and put a callback there?

Right.  This is basically what we already do on pseries: in pre_save
we store both the timebase value and the current real time.  In
post_load we again check the real time, look at the difference from
the value in the migration stream and advance the TB to match.

> Or I am missing something improtant?
> 
> >
> >> > In the second case, it's a bit unclear what the right thing to do is.
> >> > Keeping the tb running means accurate realtime, but stopping it is
> >> > often better for debugging, which is one of the main reasons to
> >> > explicitly pause.
> >> > 
> >> > I believe spapr on KVM HV will keep the TB going, but the TSC on x86
> >> > will be stopped.
> >> 
> >> Is this from a guest-centric view, i.e. if I pause a VM and wait 20 mins
> >> then when the guest resumes the timebase will jump forward by 20 mins
> >> worth of ticks?
> >
> > Yes, that's correct.
> 
> I.e. my proposal fixes this?
> If you want ot make it really, really "classy", you can look at the mess
> we have on x86 to introduce ticks "slewly" for windos 95 (and XP?)
> during a while, but I don't think that solution would work for 20mins of
> ticks :p
> 
> Later, Juan.
>
Mark Cave-Ayland Feb. 25, 2016, 4:33 a.m. UTC | #21
On 24/02/16 12:31, Juan Quintela wrote:
>> I don't really understand the question.  Migration has no equivalent
>> in real hardware, so there's no "real" behaviour to mimic.  If we
>> freeze the TB during migration, then the guest's clock will get out of
>> sync with wall clock time, and in a production environment that's
>> really bad.  So no, we absolutely must not freeze the TB during
>> migration.
>>
>> When the guest has been explicitly paused, there's a case to be made
>> either way.
> 
> If this is the case, can't we just change the device to just read the
> clock from the host at device insntantiation and call it a day?
> 
> (* Notice that I haven't seen the previous discussion *)
> 
> On migration, having a post-load function that just loads the right
> value for that device should work.  Or if we want to make it work for
> pause/cont, we should have a notifier to be run each time "cont" is
> issued, and put a callback there?
> 
> Or I am missing something improtant?

Right, that's roughly the approach I was thinking when I wrote my last
reply - for KVM derive the timebase from the virtual clock similar to
TCG and adjust on CPU start, e.g.

cpu_reset():
    cpu->tb_env->tb_offset = 0;

cpu_start/resume():
    cpu->tb_env->tb_offset =
        qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) * tb_env->tb_freq +
            cpu->tb_env->tb_offset -
        qemu_clock_get_ns(QEMU_CLOCK_HOST)

Is there any reason why this shouldn't work? My understanding is that
guests supporting KVM_REG_PPC_TB_OFFSET should compensate correctly for
the timebase if tb_offset is the difference from the host timebase at
guest virtual time zero.


ATB,

Mark.
Mark Cave-Ayland Feb. 25, 2016, 5 a.m. UTC | #22
On 25/02/16 04:33, Mark Cave-Ayland wrote:

> cpu_start/resume():
>     cpu->tb_env->tb_offset =
>         qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) * tb_env->tb_freq +
>             cpu->tb_env->tb_offset -
>         qemu_clock_get_ns(QEMU_CLOCK_HOST)

Actually just realised this is slightly wrong and in fact should be:

cpu_start/resume():
    cpu->tb_env->tb_offset =
        muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
                 cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) +
            cpu->tb_env->tb_offset -
        qemu_clock_get_ns(QEMU_CLOCK_HOST)


ATB,

Mark.
Mark Cave-Ayland Feb. 25, 2016, 9:50 a.m. UTC | #23
On 25/02/16 05:00, Mark Cave-Ayland wrote:

> On 25/02/16 04:33, Mark Cave-Ayland wrote:
> 
>> cpu_start/resume():
>>     cpu->tb_env->tb_offset =
>>         qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) * tb_env->tb_freq +
>>             cpu->tb_env->tb_offset -
>>         qemu_clock_get_ns(QEMU_CLOCK_HOST)
> 
> Actually just realised this is slightly wrong and in fact should be:
> 
> cpu_start/resume():
>     cpu->tb_env->tb_offset =
>         muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
>                  cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) +
>             cpu->tb_env->tb_offset -
>         qemu_clock_get_ns(QEMU_CLOCK_HOST)

Sign. And let me try that again, this time after caffeine:

cpu_start/resume():
    cpu->tb_env->tb_offset =
        muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
                 cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) +
            cpu->tb_env->tb_offset -
        cpu_get_host_ticks();

This should translate to: at CPU start, calculate the difference between
the current guest virtual timebase and the host timebase, storing the
difference in cpu->tb_env->tb_offset.


ATB,

Mark.
David Gibson Feb. 26, 2016, 4:35 a.m. UTC | #24
On Thu, Feb 25, 2016 at 09:50:20AM +0000, Mark Cave-Ayland wrote:
> On 25/02/16 05:00, Mark Cave-Ayland wrote:
> 
> > On 25/02/16 04:33, Mark Cave-Ayland wrote:
> > 
> >> cpu_start/resume():
> >>     cpu->tb_env->tb_offset =
> >>         qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) * tb_env->tb_freq +
> >>             cpu->tb_env->tb_offset -
> >>         qemu_clock_get_ns(QEMU_CLOCK_HOST)
> > 
> > Actually just realised this is slightly wrong and in fact should be:
> > 
> > cpu_start/resume():
> >     cpu->tb_env->tb_offset =
> >         muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
> >                  cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) +
> >             cpu->tb_env->tb_offset -
> >         qemu_clock_get_ns(QEMU_CLOCK_HOST)
> 
> Sign. And let me try that again, this time after caffeine:
> 
> cpu_start/resume():
>     cpu->tb_env->tb_offset =
>         muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
>                  cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) +
>             cpu->tb_env->tb_offset -
>         cpu_get_host_ticks();
> 
> This should translate to: at CPU start, calculate the difference between
> the current guest virtual timebase and the host timebase, storing the
> difference in cpu->tb_env->tb_offset.

Ummm... I think that's right.  Except that you need to make sure you
calculate the tb_offset just once, and set the same value to all guest
CPUs.  Otherwise the guest TBs may be slightly out of sync with each
other, which is bad (the host should have already ensure that all host
TBs are in sync with each other).

We really should make helper routines that each Power machine type can
use for this.  Unfortunately we can't put it directly into the common
ppc cpu migration code because of the requirement to keep the TBs
synced across the machine.
Mark Cave-Ayland Feb. 26, 2016, 12:29 p.m. UTC | #25
On 26/02/16 04:35, David Gibson wrote:

>> Sign. And let me try that again, this time after caffeine:
>>
>> cpu_start/resume():
>>     cpu->tb_env->tb_offset =
>>         muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
>>                  cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) +
>>             cpu->tb_env->tb_offset -
>>         cpu_get_host_ticks();
>>
>> This should translate to: at CPU start, calculate the difference between
>> the current guest virtual timebase and the host timebase, storing the
>> difference in cpu->tb_env->tb_offset.
> 
> Ummm... I think that's right.  Except that you need to make sure you
> calculate the tb_offset just once, and set the same value to all guest
> CPUs.  Otherwise the guest TBs may be slightly out of sync with each
> other, which is bad (the host should have already ensure that all host
> TBs are in sync with each other).

Nods. The reason I really like this solution is because it correctly
handles both paused/live machines and simplifies the migration logic
significantly. In fact, with this solution the only information you
would need in vmstate_ppc_timebase for migration would be the current
tb_offset so the receiving host can calculate the guest timebase from
the virtual clock accordingly.

> We really should make helper routines that each Power machine type can
> use for this.  Unfortunately we can't put it directly into the common
> ppc cpu migration code because of the requirement to keep the TBs
> synced across the machine.

Effectively I believe there are 2 cases here: TCG and KVM. For TCG all
of this is a no-op since tb/decr are already derived from the virtual
clock + tb_offset already so that really just leaves KVM.

From what you're saying we would need 2 hooks for KVM here: one on
machine start/resume which updates tb_offset for all vCPUs and one for
vCPU resume which updates just that one particular vCPU.

Just curious as to your comment regarding helper routines for each Power
machine type - is there any reason not to enable this globally for all
Power machine types? If tb_offset isn't supported by the guest then the
problem with not being able to handle a jump in timebase post-migration
still remains exactly the same.

The other question of course relates to the reason this thread was
started which is will this approach still support migrating the
decrementer? My feeling is that this would still work in that we would
encode the number of ticks before the decrementer reaches its interrupt
value into vmstate_ppc_timebase, whether high or low. For TCG it is easy
to ensure that the decrementer will still fire in n ticks time
post-migration (which solves my particular use case), but I don't have a
feeling as to whether this is possible, or indeed desirable for KVM.


ATB,

Mark.
David Gibson Feb. 29, 2016, 3:57 a.m. UTC | #26
On Fri, Feb 26, 2016 at 12:29:51PM +0000, Mark Cave-Ayland wrote:
> On 26/02/16 04:35, David Gibson wrote:
> 
> >> Sign. And let me try that again, this time after caffeine:
> >>
> >> cpu_start/resume():
> >>     cpu->tb_env->tb_offset =
> >>         muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
> >>                  cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) +
> >>             cpu->tb_env->tb_offset -
> >>         cpu_get_host_ticks();
> >>
> >> This should translate to: at CPU start, calculate the difference between
> >> the current guest virtual timebase and the host timebase, storing the
> >> difference in cpu->tb_env->tb_offset.
> > 
> > Ummm... I think that's right.  Except that you need to make sure you
> > calculate the tb_offset just once, and set the same value to all guest
> > CPUs.  Otherwise the guest TBs may be slightly out of sync with each
> > other, which is bad (the host should have already ensure that all host
> > TBs are in sync with each other).
> 
> Nods. The reason I really like this solution is because it correctly
> handles both paused/live machines and simplifies the migration logic
> significantly. In fact, with this solution the only information you
> would need in vmstate_ppc_timebase for migration would be the current
> tb_offset so the receiving host can calculate the guest timebase from
> the virtual clock accordingly.

> > We really should make helper routines that each Power machine type can
> > use for this.  Unfortunately we can't put it directly into the common
> > ppc cpu migration code because of the requirement to keep the TBs
> > synced across the machine.
> 
> Effectively I believe there are 2 cases here: TCG and KVM. For TCG all
> of this is a no-op since tb/decr are already derived from the virtual
> clock + tb_offset already so that really just leaves KVM.
> 
> >From what you're saying we would need 2 hooks for KVM here: one on
> machine start/resume which updates tb_offset for all vCPUs and one for
> vCPU resume which updates just that one particular vCPU.
> 
> Just curious as to your comment regarding helper routines for each Power
> machine type - is there any reason not to enable this globally for all
> Power machine types? If tb_offset isn't supported by the guest then the
> problem with not being able to handle a jump in timebase post-migration
> still remains exactly the same.

Well, I can't see a place to put it globally.  We can't put it in the
general vCPU stuff, because that would migrate each CPU's timebase
independently, but we want to migrate as a system wide operation, to
ensure the TBs are all synchronized in the destination guest.

To do the platform wide stuff, it pretty much has to be in the machine
type.

> The other question of course relates to the reason this thread was
> started which is will this approach still support migrating the
> decrementer? My feeling is that this would still work in that we would
> encode the number of ticks before the decrementer reaches its interrupt
> value into vmstate_ppc_timebase, whether high or low. For TCG it is easy
> to ensure that the decrementer will still fire in n ticks time
> post-migration (which solves my particular use case), but I don't have a
> feeling as to whether this is possible, or indeed desirable for KVM.

Yes, for TCG it should be fairly straightforward.  The DECR should be
calculated from the timebase.  We do need to check it on incoming
migration though, and check when we need to refire the decrementer
interrupt.

For KVM we'll need to load an appropriate value into the real
decrementer.  We probably want to migrate a difference between the TB
and the decrementer.  What could get hairy here is that there are a
number of different variants between ppc models on how exactly the
decrementer interrupt triggers: is it edge-triggered on 1->0
transition, edge-triggered on 0->-1 transition, or level triggered on
the DECR's sign bit.
Mark Cave-Ayland Feb. 29, 2016, 8:21 p.m. UTC | #27
On 29/02/16 03:57, David Gibson wrote:

> On Fri, Feb 26, 2016 at 12:29:51PM +0000, Mark Cave-Ayland wrote:
>> On 26/02/16 04:35, David Gibson wrote:
>>
>>>> Sign. And let me try that again, this time after caffeine:
>>>>
>>>> cpu_start/resume():
>>>>     cpu->tb_env->tb_offset =
>>>>         muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
>>>>                  cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) +
>>>>             cpu->tb_env->tb_offset -
>>>>         cpu_get_host_ticks();
>>>>
>>>> This should translate to: at CPU start, calculate the difference between
>>>> the current guest virtual timebase and the host timebase, storing the
>>>> difference in cpu->tb_env->tb_offset.
>>>
>>> Ummm... I think that's right.  Except that you need to make sure you
>>> calculate the tb_offset just once, and set the same value to all guest
>>> CPUs.  Otherwise the guest TBs may be slightly out of sync with each
>>> other, which is bad (the host should have already ensure that all host
>>> TBs are in sync with each other).
>>
>> Nods. The reason I really like this solution is because it correctly
>> handles both paused/live machines and simplifies the migration logic
>> significantly. In fact, with this solution the only information you
>> would need in vmstate_ppc_timebase for migration would be the current
>> tb_offset so the receiving host can calculate the guest timebase from
>> the virtual clock accordingly.
> 
>>> We really should make helper routines that each Power machine type can
>>> use for this.  Unfortunately we can't put it directly into the common
>>> ppc cpu migration code because of the requirement to keep the TBs
>>> synced across the machine.
>>
>> Effectively I believe there are 2 cases here: TCG and KVM. For TCG all
>> of this is a no-op since tb/decr are already derived from the virtual
>> clock + tb_offset already so that really just leaves KVM.
>>
>> >From what you're saying we would need 2 hooks for KVM here: one on
>> machine start/resume which updates tb_offset for all vCPUs and one for
>> vCPU resume which updates just that one particular vCPU.
>>
>> Just curious as to your comment regarding helper routines for each Power
>> machine type - is there any reason not to enable this globally for all
>> Power machine types? If tb_offset isn't supported by the guest then the
>> problem with not being able to handle a jump in timebase post-migration
>> still remains exactly the same.
> 
> Well, I can't see a place to put it globally.  We can't put it in the
> general vCPU stuff, because that would migrate each CPU's timebase
> independently, but we want to migrate as a system wide operation, to
> ensure the TBs are all synchronized in the destination guest.
> 
> To do the platform wide stuff, it pretty much has to be in the machine
> type.

(goes and looks)

It strikes me that a good solution here would be to introduce a new
PPCMachineClass from which all of the PPC machines could derive instead
of each different machine being a direct subclass of MachineClass (this
is not dissimilar as to the existing PCMachineClass) and move the
timebase and decrementer information into it. With this then all of the
PPC machine types can pick up the changes automatically.

>> The other question of course relates to the reason this thread was
>> started which is will this approach still support migrating the
>> decrementer? My feeling is that this would still work in that we would
>> encode the number of ticks before the decrementer reaches its interrupt
>> value into vmstate_ppc_timebase, whether high or low. For TCG it is easy
>> to ensure that the decrementer will still fire in n ticks time
>> post-migration (which solves my particular use case), but I don't have a
>> feeling as to whether this is possible, or indeed desirable for KVM.
> 
> Yes, for TCG it should be fairly straightforward.  The DECR should be
> calculated from the timebase.  We do need to check it on incoming
> migration though, and check when we need to refire the decrementer
> interrupt.

So just to confirm that while reads from the timebase are not privileged
(and so cannot be intercepted between host and guest), we still have
individual control of the per-guest decrementer interrupts?

> For KVM we'll need to load an appropriate value into the real
> decrementer.  We probably want to migrate a difference between the TB
> and the decrementer.  What could get hairy here is that there are a
> number of different variants between ppc models on how exactly the
> decrementer interrupt triggers: is it edge-triggered on 1->0
> transition, edge-triggered on 0->-1 transition, or level triggered on
> the DECR's sign bit.  

I don't think that is too much of a problem, since for TCG the logic is
already encapsulated in hw/ppc/ppc.c's __cpu_ppc_store_decr(). It should
be possible to move this logic into a shared helper function to keep
everything in one place.

Finally just to re-iterate that while I can write and compile-test a
potential patchset, I have no way to test the KVM parts. If I were to
dedicate some time to this, would yourself/Alex/Alexey be willing to
help test and debug these changes?


ATB,

Mark.
David Gibson March 10, 2016, 4:57 a.m. UTC | #28
On Mon, Feb 29, 2016 at 08:21:39PM +0000, Mark Cave-Ayland wrote:
> On 29/02/16 03:57, David Gibson wrote:
> 
> > On Fri, Feb 26, 2016 at 12:29:51PM +0000, Mark Cave-Ayland wrote:
> >> On 26/02/16 04:35, David Gibson wrote:
> >>
> >>>> Sign. And let me try that again, this time after caffeine:
> >>>>
> >>>> cpu_start/resume():
> >>>>     cpu->tb_env->tb_offset =
> >>>>         muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL),
> >>>>                  cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) +
> >>>>             cpu->tb_env->tb_offset -
> >>>>         cpu_get_host_ticks();
> >>>>
> >>>> This should translate to: at CPU start, calculate the difference between
> >>>> the current guest virtual timebase and the host timebase, storing the
> >>>> difference in cpu->tb_env->tb_offset.
> >>>
> >>> Ummm... I think that's right.  Except that you need to make sure you
> >>> calculate the tb_offset just once, and set the same value to all guest
> >>> CPUs.  Otherwise the guest TBs may be slightly out of sync with each
> >>> other, which is bad (the host should have already ensure that all host
> >>> TBs are in sync with each other).
> >>
> >> Nods. The reason I really like this solution is because it correctly
> >> handles both paused/live machines and simplifies the migration logic
> >> significantly. In fact, with this solution the only information you
> >> would need in vmstate_ppc_timebase for migration would be the current
> >> tb_offset so the receiving host can calculate the guest timebase from
> >> the virtual clock accordingly.
> > 
> >>> We really should make helper routines that each Power machine type can
> >>> use for this.  Unfortunately we can't put it directly into the common
> >>> ppc cpu migration code because of the requirement to keep the TBs
> >>> synced across the machine.
> >>
> >> Effectively I believe there are 2 cases here: TCG and KVM. For TCG all
> >> of this is a no-op since tb/decr are already derived from the virtual
> >> clock + tb_offset already so that really just leaves KVM.
> >>
> >> >From what you're saying we would need 2 hooks for KVM here: one on
> >> machine start/resume which updates tb_offset for all vCPUs and one for
> >> vCPU resume which updates just that one particular vCPU.
> >>
> >> Just curious as to your comment regarding helper routines for each Power
> >> machine type - is there any reason not to enable this globally for all
> >> Power machine types? If tb_offset isn't supported by the guest then the
> >> problem with not being able to handle a jump in timebase post-migration
> >> still remains exactly the same.
> > 
> > Well, I can't see a place to put it globally.  We can't put it in the
> > general vCPU stuff, because that would migrate each CPU's timebase
> > independently, but we want to migrate as a system wide operation, to
> > ensure the TBs are all synchronized in the destination guest.
> > 
> > To do the platform wide stuff, it pretty much has to be in the machine
> > type.
> 
> (goes and looks)
> 
> It strikes me that a good solution here would be to introduce a new
> PPCMachineClass from which all of the PPC machines could derive instead
> of each different machine being a direct subclass of MachineClass (this
> is not dissimilar as to the existing PCMachineClass) and move the
> timebase and decrementer information into it. With this then all of the
> PPC machine types can pick up the changes automatically.

Um.. maybe, yes.  There might be some gotches in attempting that
(particularly maintaining backwards compat for migration), but it
could be worth a shot.

> >> The other question of course relates to the reason this thread was
> >> started which is will this approach still support migrating the
> >> decrementer? My feeling is that this would still work in that we would
> >> encode the number of ticks before the decrementer reaches its interrupt
> >> value into vmstate_ppc_timebase, whether high or low. For TCG it is easy
> >> to ensure that the decrementer will still fire in n ticks time
> >> post-migration (which solves my particular use case), but I don't have a
> >> feeling as to whether this is possible, or indeed desirable for KVM.
> > 
> > Yes, for TCG it should be fairly straightforward.  The DECR should be
> > calculated from the timebase.  We do need to check it on incoming
> > migration though, and check when we need to refire the decrementer
> > interrupt.
> 
> So just to confirm that while reads from the timebase are not privileged
> (and so cannot be intercepted between host and guest), we still have
> individual control of the per-guest decrementer interrupts?

I'm not entirely sure I understand the question, but I think the
answer is yes.

> > For KVM we'll need to load an appropriate value into the real
> > decrementer.  We probably want to migrate a difference between the TB
> > and the decrementer.  What could get hairy here is that there are a
> > number of different variants between ppc models on how exactly the
> > decrementer interrupt triggers: is it edge-triggered on 1->0
> > transition, edge-triggered on 0->-1 transition, or level triggered on
> > the DECR's sign bit.  
> 
> I don't think that is too much of a problem, since for TCG the logic is
> already encapsulated in hw/ppc/ppc.c's __cpu_ppc_store_decr(). It should
> be possible to move this logic into a shared helper function to keep
> everything in one place.
> 
> Finally just to re-iterate that while I can write and compile-test a
> potential patchset, I have no way to test the KVM parts. If I were to
> dedicate some time to this, would yourself/Alex/Alexey be willing to
> help test and debug these changes?

Um.. to some extent.  I can test that you haven't broken spapr easily
enough.  Testing that the Mac machines are working under KVM PR is
trickier, since I'm not really set up for testing the Mac machine
types.  It might work if you can supply VM images and scripts, which I
can then execute on a ppc machine.
diff mbox

Patch

diff --git a/target-ppc/machine.c b/target-ppc/machine.c
index 251a84b..495e58d 100644
--- a/target-ppc/machine.c
+++ b/target-ppc/machine.c
@@ -141,6 +141,7 @@  static void cpu_pre_save(void *opaque)
     env->spr[SPR_CFAR] = env->cfar;
 #endif
     env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr;
+    env->spr[SPR_DECR] = cpu_ppc_load_decr(env);

     for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
         env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i];
@@ -175,6 +176,7 @@  static int cpu_post_load(void *opaque, int version_id)
     env->cfar = env->spr[SPR_CFAR];
 #endif
     env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR];
+    cpu_ppc_store_decr(env, env->spr[SPR_DECR]);

     for (i = 0; (i < 4) && (i < env->nb_BATs); i++) {
         env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i];