Message ID | 568FC5FF.9070606@ilande.co.uk |
---|---|
State | New |
Headers | show |
On 01/09/2016 01:21 AM, Mark Cave-Ayland wrote: > On 08/01/16 02:47, Alexey Kardashevskiy wrote: > >> On 01/07/2016 05:22 AM, Mark Cave-Ayland wrote: >>> During local testing with TCG, intermittent errors were found when >>> trying to >>> migrate Darwin OS images. >>> >>> The underlying cause was that Darwin resets the decrementer value to >>> fairly >>> small values on each interrupt. cpu_ppc_set_tb_clk() sets the default >>> value >>> of the decrementer to 0xffffffff during initialisation which typically >>> corresponds to several seconds. Hence when restoring the image, the guest >>> would effectively "lose" decrementer interrupts during this time causing >>> confusion in the guest. >>> >>> NOTE: there does seem to be some overlap here with the >>> vmstate_ppc_timebase >>> code, however it doesn't seem to handle multiple CPUs which is why >>> I've gone >>> for an independent implementation. >> >> It does handle multiple CPUs: >> >> static int timebase_post_load(void *opaque, int version_id) >> { >> ... >> /* Set new offset to all CPUs */ >> CPU_FOREACH(cpu) { >> PowerPCCPU *pcpu = POWERPC_CPU(cpu); >> pcpu->env.tb_env->tb_offset = tb_off_adj; >> } >> >> >> It does not transfer DECR though, it transfers the offset instead, one >> for all CPUs. >> >> >> The easier solution would be just adding this instead of the whole patch: >> >> spr_register(env, SPR_DECR, "DECR", >> SPR_NOACCESS, SPR_NOACCESS, >> &spr_read_decr, &spr_write_decr, >> 0x00000000); >> >> somewhere in target-ppc/translate_init.c for CPUs you want to support, >> gen_tbl() seems to be the right place for this. As long as it is just >> spr_register() and not spr_register_kvm(), I suppose it should not break >> KVM and pseries. > > I've just tried adding that but it then gives the following error on > startup: > > Error: Trying to register SPR 22 (016) twice ! > > Based upon this, the existing registration seems fine. I think the > problem is that I can't see anything in __cpu_ppc_store_decr() that > updates the spr[SPR_DECR] value when the decrementer register is > changed, so it needs to be explicitly added to > cpu_pre_save()/cpu_post_load(): > > > diff --git a/target-ppc/machine.c b/target-ppc/machine.c > index 251a84b..495e58d 100644 > --- a/target-ppc/machine.c > +++ b/target-ppc/machine.c > @@ -141,6 +141,7 @@ static void cpu_pre_save(void *opaque) > env->spr[SPR_CFAR] = env->cfar; > #endif > env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr; > + env->spr[SPR_DECR] = cpu_ppc_load_decr(env); > > for (i = 0; (i < 4) && (i < env->nb_BATs); i++) { > env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i]; > @@ -175,6 +176,7 @@ static int cpu_post_load(void *opaque, int version_id) > env->cfar = env->spr[SPR_CFAR]; > #endif > env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR]; > + cpu_ppc_store_decr(env, env->spr[SPR_DECR]); > > for (i = 0; (i < 4) && (i < env->nb_BATs); i++) { > env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i]; > > > I've just tried the diff above instead of my original version and > repeated my savevm/loadvm pair test with a Darwin installation and it > also fixes the random hang I was seeing on loadvm. > > Seemingly this should work on KVM in that cpu_ppc_load_decr() and > cpu_ppc_store_decr() become no-ops as long as KVM maintains > env->spr[SPR_DECR], but a second set of eyeballs would be useful here. > > If you can let me know if this is suitable then I'll update the patchset > based upon your feedback and send out a v2. Looks good to me, I'd just wait a day or two in the case if David wants to comment.
On Mon, Jan 11, 2016 at 12:18:31PM +1100, Alexey Kardashevskiy wrote: > On 01/09/2016 01:21 AM, Mark Cave-Ayland wrote: > >On 08/01/16 02:47, Alexey Kardashevskiy wrote: > > > >>On 01/07/2016 05:22 AM, Mark Cave-Ayland wrote: > >>>During local testing with TCG, intermittent errors were found when > >>>trying to > >>>migrate Darwin OS images. > >>> > >>>The underlying cause was that Darwin resets the decrementer value to > >>>fairly > >>>small values on each interrupt. cpu_ppc_set_tb_clk() sets the default > >>>value > >>>of the decrementer to 0xffffffff during initialisation which typically > >>>corresponds to several seconds. Hence when restoring the image, the guest > >>>would effectively "lose" decrementer interrupts during this time causing > >>>confusion in the guest. > >>> > >>>NOTE: there does seem to be some overlap here with the > >>>vmstate_ppc_timebase > >>>code, however it doesn't seem to handle multiple CPUs which is why > >>>I've gone > >>>for an independent implementation. > >> > >>It does handle multiple CPUs: > >> > >>static int timebase_post_load(void *opaque, int version_id) > >>{ > >>... > >> /* Set new offset to all CPUs */ > >> CPU_FOREACH(cpu) { > >> PowerPCCPU *pcpu = POWERPC_CPU(cpu); > >> pcpu->env.tb_env->tb_offset = tb_off_adj; > >> } > >> > >> > >>It does not transfer DECR though, it transfers the offset instead, one > >>for all CPUs. > >> > >> > >>The easier solution would be just adding this instead of the whole patch: > >> > >>spr_register(env, SPR_DECR, "DECR", > >> SPR_NOACCESS, SPR_NOACCESS, > >> &spr_read_decr, &spr_write_decr, > >> 0x00000000); > >> > >>somewhere in target-ppc/translate_init.c for CPUs you want to support, > >>gen_tbl() seems to be the right place for this. As long as it is just > >>spr_register() and not spr_register_kvm(), I suppose it should not break > >>KVM and pseries. > > > >I've just tried adding that but it then gives the following error on > >startup: > > > >Error: Trying to register SPR 22 (016) twice ! > > > >Based upon this, the existing registration seems fine. I think the > >problem is that I can't see anything in __cpu_ppc_store_decr() that > >updates the spr[SPR_DECR] value when the decrementer register is > >changed, so it needs to be explicitly added to > >cpu_pre_save()/cpu_post_load(): > > > > > >diff --git a/target-ppc/machine.c b/target-ppc/machine.c > >index 251a84b..495e58d 100644 > >--- a/target-ppc/machine.c > >+++ b/target-ppc/machine.c > >@@ -141,6 +141,7 @@ static void cpu_pre_save(void *opaque) > > env->spr[SPR_CFAR] = env->cfar; > > #endif > > env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr; > >+ env->spr[SPR_DECR] = cpu_ppc_load_decr(env); > > > > for (i = 0; (i < 4) && (i < env->nb_BATs); i++) { > > env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i]; > >@@ -175,6 +176,7 @@ static int cpu_post_load(void *opaque, int version_id) > > env->cfar = env->spr[SPR_CFAR]; > > #endif > > env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR]; > >+ cpu_ppc_store_decr(env, env->spr[SPR_DECR]); > > > > for (i = 0; (i < 4) && (i < env->nb_BATs); i++) { > > env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i]; > > > > > >I've just tried the diff above instead of my original version and > >repeated my savevm/loadvm pair test with a Darwin installation and it > >also fixes the random hang I was seeing on loadvm. > > > >Seemingly this should work on KVM in that cpu_ppc_load_decr() and > >cpu_ppc_store_decr() become no-ops as long as KVM maintains > >env->spr[SPR_DECR], but a second set of eyeballs would be useful here. > > > >If you can let me know if this is suitable then I'll update the patchset > >based upon your feedback and send out a v2. > > > Looks good to me, I'd just wait a day or two in the case if David wants to > comment. I was on holiday and missed the start of this thread, I'm afraid, so I don't fully understand the context here. Am I right in thinking that this change will essentially freeze the decrementer across the migration downtime? That doesn't seem right, since the decrementer is supposed to be linked to the timebase and represent real time passing. In other words, isn't this just skipping the decrementer interrupts at the qemu level rather than the guest level? It seems that instead we should be reconstructing the decrementer on the destination based on an offset from the timebase.
On 11/01/16 04:55, David Gibson wrote: > On Mon, Jan 11, 2016 at 12:18:31PM +1100, Alexey Kardashevskiy wrote: >> On 01/09/2016 01:21 AM, Mark Cave-Ayland wrote: >>> On 08/01/16 02:47, Alexey Kardashevskiy wrote: >>> >>>> On 01/07/2016 05:22 AM, Mark Cave-Ayland wrote: >>>>> During local testing with TCG, intermittent errors were found when >>>>> trying to >>>>> migrate Darwin OS images. >>>>> >>>>> The underlying cause was that Darwin resets the decrementer value to >>>>> fairly >>>>> small values on each interrupt. cpu_ppc_set_tb_clk() sets the default >>>>> value >>>>> of the decrementer to 0xffffffff during initialisation which typically >>>>> corresponds to several seconds. Hence when restoring the image, the guest >>>>> would effectively "lose" decrementer interrupts during this time causing >>>>> confusion in the guest. >>>>> >>>>> NOTE: there does seem to be some overlap here with the >>>>> vmstate_ppc_timebase >>>>> code, however it doesn't seem to handle multiple CPUs which is why >>>>> I've gone >>>>> for an independent implementation. >>>> >>>> It does handle multiple CPUs: >>>> >>>> static int timebase_post_load(void *opaque, int version_id) >>>> { >>>> ... >>>> /* Set new offset to all CPUs */ >>>> CPU_FOREACH(cpu) { >>>> PowerPCCPU *pcpu = POWERPC_CPU(cpu); >>>> pcpu->env.tb_env->tb_offset = tb_off_adj; >>>> } >>>> >>>> >>>> It does not transfer DECR though, it transfers the offset instead, one >>>> for all CPUs. >>>> >>>> >>>> The easier solution would be just adding this instead of the whole patch: >>>> >>>> spr_register(env, SPR_DECR, "DECR", >>>> SPR_NOACCESS, SPR_NOACCESS, >>>> &spr_read_decr, &spr_write_decr, >>>> 0x00000000); >>>> >>>> somewhere in target-ppc/translate_init.c for CPUs you want to support, >>>> gen_tbl() seems to be the right place for this. As long as it is just >>>> spr_register() and not spr_register_kvm(), I suppose it should not break >>>> KVM and pseries. >>> >>> I've just tried adding that but it then gives the following error on >>> startup: >>> >>> Error: Trying to register SPR 22 (016) twice ! >>> >>> Based upon this, the existing registration seems fine. I think the >>> problem is that I can't see anything in __cpu_ppc_store_decr() that >>> updates the spr[SPR_DECR] value when the decrementer register is >>> changed, so it needs to be explicitly added to >>> cpu_pre_save()/cpu_post_load(): >>> >>> >>> diff --git a/target-ppc/machine.c b/target-ppc/machine.c >>> index 251a84b..495e58d 100644 >>> --- a/target-ppc/machine.c >>> +++ b/target-ppc/machine.c >>> @@ -141,6 +141,7 @@ static void cpu_pre_save(void *opaque) >>> env->spr[SPR_CFAR] = env->cfar; >>> #endif >>> env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr; >>> + env->spr[SPR_DECR] = cpu_ppc_load_decr(env); >>> >>> for (i = 0; (i < 4) && (i < env->nb_BATs); i++) { >>> env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i]; >>> @@ -175,6 +176,7 @@ static int cpu_post_load(void *opaque, int version_id) >>> env->cfar = env->spr[SPR_CFAR]; >>> #endif >>> env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR]; >>> + cpu_ppc_store_decr(env, env->spr[SPR_DECR]); >>> >>> for (i = 0; (i < 4) && (i < env->nb_BATs); i++) { >>> env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i]; >>> >>> >>> I've just tried the diff above instead of my original version and >>> repeated my savevm/loadvm pair test with a Darwin installation and it >>> also fixes the random hang I was seeing on loadvm. >>> >>> Seemingly this should work on KVM in that cpu_ppc_load_decr() and >>> cpu_ppc_store_decr() become no-ops as long as KVM maintains >>> env->spr[SPR_DECR], but a second set of eyeballs would be useful here. >>> >>> If you can let me know if this is suitable then I'll update the patchset >>> based upon your feedback and send out a v2. >> >> >> Looks good to me, I'd just wait a day or two in the case if David wants to >> comment. > > I was on holiday and missed the start of this thread, I'm afraid, so I > don't fully understand the context here. It's part of a patchset I posted which fixes up problems I had with migrating g3beige/mac99 machines under TCG: https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00544.html. Apologies for not adding you as CC directly - are you still helping to cover ppc-next for Alex? > Am I right in thinking that this change will essentially freeze the > decrementer across the migration downtime? That doesn't seem right, > since the decrementer is supposed to be linked to the timebase and > represent real time passing. Yes, that's correct. > In other words, isn't this just skipping the decrementer interrupts at > the qemu level rather than the guest level? > > It seems that instead we should be reconstructing the decrementer on > the destination based on an offset from the timebase. Well I haven't really looked at how time warping works during in migration for QEMU, however this seems to be the method used by hw/ppc/ppc.c's timebase_post_load() function but my understanding is that this isn't currently available for the g3beige/mac99 machines? Should the patch in fact do this but also add decrementer support? And if it did, would this have a negative effect on pseries? But yes, assuming that the guest time warp is handled correctly (which I assume is handled correctly elsewhere since this would also be required for KVM) then I think that this should work. ATB, Mark.
On Mon, Jan 11, 2016 at 07:43:54AM +0000, Mark Cave-Ayland wrote: > On 11/01/16 04:55, David Gibson wrote: > > > On Mon, Jan 11, 2016 at 12:18:31PM +1100, Alexey Kardashevskiy wrote: > >> On 01/09/2016 01:21 AM, Mark Cave-Ayland wrote: > >>> On 08/01/16 02:47, Alexey Kardashevskiy wrote: > >>> > >>>> On 01/07/2016 05:22 AM, Mark Cave-Ayland wrote: > >>>>> During local testing with TCG, intermittent errors were found when > >>>>> trying to > >>>>> migrate Darwin OS images. > >>>>> > >>>>> The underlying cause was that Darwin resets the decrementer value to > >>>>> fairly > >>>>> small values on each interrupt. cpu_ppc_set_tb_clk() sets the default > >>>>> value > >>>>> of the decrementer to 0xffffffff during initialisation which typically > >>>>> corresponds to several seconds. Hence when restoring the image, the guest > >>>>> would effectively "lose" decrementer interrupts during this time causing > >>>>> confusion in the guest. > >>>>> > >>>>> NOTE: there does seem to be some overlap here with the > >>>>> vmstate_ppc_timebase > >>>>> code, however it doesn't seem to handle multiple CPUs which is why > >>>>> I've gone > >>>>> for an independent implementation. > >>>> > >>>> It does handle multiple CPUs: > >>>> > >>>> static int timebase_post_load(void *opaque, int version_id) > >>>> { > >>>> ... > >>>> /* Set new offset to all CPUs */ > >>>> CPU_FOREACH(cpu) { > >>>> PowerPCCPU *pcpu = POWERPC_CPU(cpu); > >>>> pcpu->env.tb_env->tb_offset = tb_off_adj; > >>>> } > >>>> > >>>> > >>>> It does not transfer DECR though, it transfers the offset instead, one > >>>> for all CPUs. > >>>> > >>>> > >>>> The easier solution would be just adding this instead of the whole patch: > >>>> > >>>> spr_register(env, SPR_DECR, "DECR", > >>>> SPR_NOACCESS, SPR_NOACCESS, > >>>> &spr_read_decr, &spr_write_decr, > >>>> 0x00000000); > >>>> > >>>> somewhere in target-ppc/translate_init.c for CPUs you want to support, > >>>> gen_tbl() seems to be the right place for this. As long as it is just > >>>> spr_register() and not spr_register_kvm(), I suppose it should not break > >>>> KVM and pseries. > >>> > >>> I've just tried adding that but it then gives the following error on > >>> startup: > >>> > >>> Error: Trying to register SPR 22 (016) twice ! > >>> > >>> Based upon this, the existing registration seems fine. I think the > >>> problem is that I can't see anything in __cpu_ppc_store_decr() that > >>> updates the spr[SPR_DECR] value when the decrementer register is > >>> changed, so it needs to be explicitly added to > >>> cpu_pre_save()/cpu_post_load(): > >>> > >>> > >>> diff --git a/target-ppc/machine.c b/target-ppc/machine.c > >>> index 251a84b..495e58d 100644 > >>> --- a/target-ppc/machine.c > >>> +++ b/target-ppc/machine.c > >>> @@ -141,6 +141,7 @@ static void cpu_pre_save(void *opaque) > >>> env->spr[SPR_CFAR] = env->cfar; > >>> #endif > >>> env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr; > >>> + env->spr[SPR_DECR] = cpu_ppc_load_decr(env); > >>> > >>> for (i = 0; (i < 4) && (i < env->nb_BATs); i++) { > >>> env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i]; > >>> @@ -175,6 +176,7 @@ static int cpu_post_load(void *opaque, int version_id) > >>> env->cfar = env->spr[SPR_CFAR]; > >>> #endif > >>> env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR]; > >>> + cpu_ppc_store_decr(env, env->spr[SPR_DECR]); > >>> > >>> for (i = 0; (i < 4) && (i < env->nb_BATs); i++) { > >>> env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i]; > >>> > >>> > >>> I've just tried the diff above instead of my original version and > >>> repeated my savevm/loadvm pair test with a Darwin installation and it > >>> also fixes the random hang I was seeing on loadvm. > >>> > >>> Seemingly this should work on KVM in that cpu_ppc_load_decr() and > >>> cpu_ppc_store_decr() become no-ops as long as KVM maintains > >>> env->spr[SPR_DECR], but a second set of eyeballs would be useful here. > >>> > >>> If you can let me know if this is suitable then I'll update the patchset > >>> based upon your feedback and send out a v2. > >> > >> > >> Looks good to me, I'd just wait a day or two in the case if David wants to > >> comment. > > > > I was on holiday and missed the start of this thread, I'm afraid, so I > > don't fully understand the context here. > > It's part of a patchset I posted which fixes up problems I had with > migrating g3beige/mac99 machines under TCG: > https://lists.gnu.org/archive/html/qemu-devel/2016-01/msg00544.html. > > Apologies for not adding you as CC directly - are you still helping to > cover ppc-next for Alex? Yes, I am. > > Am I right in thinking that this change will essentially freeze the > > decrementer across the migration downtime? That doesn't seem right, > > since the decrementer is supposed to be linked to the timebase and > > represent real time passing. > > Yes, that's correct. > > > In other words, isn't this just skipping the decrementer interrupts at > > the qemu level rather than the guest level? > > > > It seems that instead we should be reconstructing the decrementer on > > the destination based on an offset from the timebase. > > Well I haven't really looked at how time warping works during in > migration for QEMU, however this seems to be the method used by > hw/ppc/ppc.c's timebase_post_load() function but my understanding is > that this isn't currently available for the g3beige/mac99 machines? Ah.. yes, it looks like the timebase migration stuff is only hooked in on the pseries machine type. As far as I can tell it should be trivial to add it to other machines though - it doesn't appear to rely on anything outside the common ppc timebase stuff. > Should the patch in fact do this but also add decrementer support? And > if it did, would this have a negative effect on pseries? Yes, I think that's the right approach. Note that rather than duplicating the logic to adjust the decrementer over migration, it should be possible to encode the decrementer as a diff from the timebase across the migration. In fact.. I'm not sure it ever makes sense to store the decrementer value as a direct value, since it's constantly changing - probably makes more sense to derive it from the timebase whenever it is needed. As far as I know that should be fine for pseries. I think the current behaviour is probably technically wrong for pseries as well, but the timing code of our Linux guests is robust enough to handle a small displacement to the time of the next decrementer interrupt. > But yes, assuming that the guest time warp is handled correctly (which I > assume is handled correctly elsewhere since this would also be required > for KVM) then I think that this should work. > > > ATB, > > Mark. >
On 12/01/16 02:44, David Gibson wrote: >>> In other words, isn't this just skipping the decrementer interrupts at >>> the qemu level rather than the guest level? >>> >>> It seems that instead we should be reconstructing the decrementer on >>> the destination based on an offset from the timebase. >> >> Well I haven't really looked at how time warping works during in >> migration for QEMU, however this seems to be the method used by >> hw/ppc/ppc.c's timebase_post_load() function but my understanding is >> that this isn't currently available for the g3beige/mac99 machines? > > Ah.. yes, it looks like the timebase migration stuff is only hooked in > on the pseries machine type. As far as I can tell it should be > trivial to add it to other machines though - it doesn't appear to rely > on anything outside the common ppc timebase stuff. > >> Should the patch in fact do this but also add decrementer support? And >> if it did, would this have a negative effect on pseries? > > Yes, I think that's the right approach. Note that rather than > duplicating the logic to adjust the decrementer over migration, it > should be possible to encode the decrementer as a diff from the > timebase across the migration. > > In fact.. I'm not sure it ever makes sense to store the decrementer > value as a direct value, since it's constantly changing - probably > makes more sense to derive it from the timebase whenever it is needed. > > As far as I know that should be fine for pseries. I think the current > behaviour is probably technically wrong for pseries as well, but the > timing code of our Linux guests is robust enough to handle a small > displacement to the time of the next decrementer interrupt. I've had a bit of an experiment trying to implement something suitable, but I'm not 100% certain I've got this right. From the code my understanding is that the timebase is effectively free-running and so if a migration takes 5s then you use tb_offset to calculate the difference between the timebase before migration, and subsequently apply the offset for all future reads of the timebase for the lifetime of the CPU (i.e. the migrated guest is effectively living at a point in the past where the timebase is consistent). So I do see that it would make more sense to switch the decrementer over to an offset from timebase to avoid duplicating the time-shift migration logic, but my PPC experience is fairly limited. Any particular pointers as to what is the best way to approach this? It seems getting this done separately first will make the changes in the last patch trivial. Many thanks, Mark.
On Fri, Jan 15, 2016 at 05:46:10PM +0000, Mark Cave-Ayland wrote: > On 12/01/16 02:44, David Gibson wrote: > > >>> In other words, isn't this just skipping the decrementer interrupts at > >>> the qemu level rather than the guest level? > >>> > >>> It seems that instead we should be reconstructing the decrementer on > >>> the destination based on an offset from the timebase. > >> > >> Well I haven't really looked at how time warping works during in > >> migration for QEMU, however this seems to be the method used by > >> hw/ppc/ppc.c's timebase_post_load() function but my understanding is > >> that this isn't currently available for the g3beige/mac99 machines? > > > > Ah.. yes, it looks like the timebase migration stuff is only hooked in > > on the pseries machine type. As far as I can tell it should be > > trivial to add it to other machines though - it doesn't appear to rely > > on anything outside the common ppc timebase stuff. > > > >> Should the patch in fact do this but also add decrementer support? And > >> if it did, would this have a negative effect on pseries? > > > > Yes, I think that's the right approach. Note that rather than > > duplicating the logic to adjust the decrementer over migration, it > > should be possible to encode the decrementer as a diff from the > > timebase across the migration. > > > > In fact.. I'm not sure it ever makes sense to store the decrementer > > value as a direct value, since it's constantly changing - probably > > makes more sense to derive it from the timebase whenever it is needed. > > > > As far as I know that should be fine for pseries. I think the current > > behaviour is probably technically wrong for pseries as well, but the > > timing code of our Linux guests is robust enough to handle a small > > displacement to the time of the next decrementer interrupt. > > I've had a bit of an experiment trying to implement something suitable, > but I'm not 100% certain I've got this right. > > >From the code my understanding is that the timebase is effectively > free-running and so if a migration takes 5s then you use tb_offset to > calculate the difference between the timebase before migration, and > subsequently apply the offset for all future reads of the timebase for > the lifetime of the CPU (i.e. the migrated guest is effectively living > at a point in the past where the timebase is consistent). Um.. no. At least in the usual configuration, the timebase represents real, wall-clock time, so we expect it to jump forward across the migration downtime. This is important because the guest will use the timebase to calculate real time differences. However, the absolute value of the timebase may be different on the *host* between source and destination for migration. So what we need to do is before migration we work out the delta between host and guest notions of wall clock time (as defined by the guest timebase), and transfer that in the migration stream. On the destination we initialize the guest timebase so that the guest maintains the same realtime offset from the host. This means that as long as source and destination system time is synchronized, guest real-time tracking will continue correctly across the migration. We do also make sure that the guest timebase never goes backwards, but that would only happen if the source and destination host times were badly out of sync.
On 18/01/16 04:51, David Gibson wrote: > On Fri, Jan 15, 2016 at 05:46:10PM +0000, Mark Cave-Ayland wrote: >> On 12/01/16 02:44, David Gibson wrote: >> >>>>> In other words, isn't this just skipping the decrementer interrupts at >>>>> the qemu level rather than the guest level? >>>>> >>>>> It seems that instead we should be reconstructing the decrementer on >>>>> the destination based on an offset from the timebase. >>>> >>>> Well I haven't really looked at how time warping works during in >>>> migration for QEMU, however this seems to be the method used by >>>> hw/ppc/ppc.c's timebase_post_load() function but my understanding is >>>> that this isn't currently available for the g3beige/mac99 machines? >>> >>> Ah.. yes, it looks like the timebase migration stuff is only hooked in >>> on the pseries machine type. As far as I can tell it should be >>> trivial to add it to other machines though - it doesn't appear to rely >>> on anything outside the common ppc timebase stuff. >>> >>>> Should the patch in fact do this but also add decrementer support? And >>>> if it did, would this have a negative effect on pseries? >>> >>> Yes, I think that's the right approach. Note that rather than >>> duplicating the logic to adjust the decrementer over migration, it >>> should be possible to encode the decrementer as a diff from the >>> timebase across the migration. >>> >>> In fact.. I'm not sure it ever makes sense to store the decrementer >>> value as a direct value, since it's constantly changing - probably >>> makes more sense to derive it from the timebase whenever it is needed. >>> >>> As far as I know that should be fine for pseries. I think the current >>> behaviour is probably technically wrong for pseries as well, but the >>> timing code of our Linux guests is robust enough to handle a small >>> displacement to the time of the next decrementer interrupt. >> >> I've had a bit of an experiment trying to implement something suitable, >> but I'm not 100% certain I've got this right. >> >> >From the code my understanding is that the timebase is effectively >> free-running and so if a migration takes 5s then you use tb_offset to >> calculate the difference between the timebase before migration, and >> subsequently apply the offset for all future reads of the timebase for >> the lifetime of the CPU (i.e. the migrated guest is effectively living >> at a point in the past where the timebase is consistent). > > Um.. no. At least in the usual configuration, the timebase represents > real, wall-clock time, so we expect it to jump forward across the > migration downtime. This is important because the guest will use the > timebase to calculate real time differences. > > However, the absolute value of the timebase may be different on the > *host* between source and destination for migration. So what we need > to do is before migration we work out the delta between host and guest > notions of wall clock time (as defined by the guest timebase), and > transfer that in the migration stream. > > On the destination we initialize the guest timebase so that the guest > maintains the same realtime offset from the host. This means that as > long as source and destination system time is synchronized, guest > real-time tracking will continue correctly across the migration. > > We do also make sure that the guest timebase never goes backwards, but > that would only happen if the source and destination host times were > badly out of sync. I had a poke at trying to include the timebase state in the Mac machines but found that enabling this caused migration to freeze, even on top of my existing (known working) patchset. Looking closer at the code, I don't think it can work on an x86 TCG host in its current form since the host/guest clocks are used interchangeably in places, e.g. in timebase_pre_save() we have this: uint64_t ticks = cpu_get_host_ticks(); ... tb->guest_timebase = ticks + first_ppc_cpu->env.tb_env->tb_offset; So this implies that guest_timebase is set in higher resolution host ticks. But then in timebase_post_load() we have this: migration_duration_tb = muldiv64(migration_duration_ns, freq, NANOSECONDS_PER_SECOND); which calculates the migration time in timebase ticks but then: guest_tb = tb_remote->guest_timebase + MIN(0, migration_duration_tb); tb_off_adj = guest_tb - cpu_get_host_ticks(); which mixes tb_remote->guest_timebase in host ticks with migration_duration_tb in timebase ticks. I think that tb_off_adj should be in host ticks so should migration_duration_tb be dropped and guest_tb be calculated from migration_duration_ns instead? At least given the current mixing of host ticks and timebase ticks, I think this can only work correctly on PPC where the two are the same. One other thing I noticed is that this code "broke" my normal testing pattern when I tend to launch qemu-system-ppc with -loadvm foo -S with the machine stopped. In this case the migration duration calculation is performed at the moment state is restored, and not the point where the CPU is started. Should the adjustment calculation not take place in a cpu_start() function instead? Otherwise at the point when I resume the paused machine the tb_off_adj calculation will be based in the past and therefore wrong. ATB, Mark.
On Mon, Jan 25, 2016 at 05:48:02AM +0000, Mark Cave-Ayland wrote: > On 18/01/16 04:51, David Gibson wrote: > > > On Fri, Jan 15, 2016 at 05:46:10PM +0000, Mark Cave-Ayland wrote: > >> On 12/01/16 02:44, David Gibson wrote: > >> > >>>>> In other words, isn't this just skipping the decrementer interrupts at > >>>>> the qemu level rather than the guest level? > >>>>> > >>>>> It seems that instead we should be reconstructing the decrementer on > >>>>> the destination based on an offset from the timebase. > >>>> > >>>> Well I haven't really looked at how time warping works during in > >>>> migration for QEMU, however this seems to be the method used by > >>>> hw/ppc/ppc.c's timebase_post_load() function but my understanding is > >>>> that this isn't currently available for the g3beige/mac99 machines? > >>> > >>> Ah.. yes, it looks like the timebase migration stuff is only hooked in > >>> on the pseries machine type. As far as I can tell it should be > >>> trivial to add it to other machines though - it doesn't appear to rely > >>> on anything outside the common ppc timebase stuff. > >>> > >>>> Should the patch in fact do this but also add decrementer support? And > >>>> if it did, would this have a negative effect on pseries? > >>> > >>> Yes, I think that's the right approach. Note that rather than > >>> duplicating the logic to adjust the decrementer over migration, it > >>> should be possible to encode the decrementer as a diff from the > >>> timebase across the migration. > >>> > >>> In fact.. I'm not sure it ever makes sense to store the decrementer > >>> value as a direct value, since it's constantly changing - probably > >>> makes more sense to derive it from the timebase whenever it is needed. > >>> > >>> As far as I know that should be fine for pseries. I think the current > >>> behaviour is probably technically wrong for pseries as well, but the > >>> timing code of our Linux guests is robust enough to handle a small > >>> displacement to the time of the next decrementer interrupt. > >> > >> I've had a bit of an experiment trying to implement something suitable, > >> but I'm not 100% certain I've got this right. > >> > >> >From the code my understanding is that the timebase is effectively > >> free-running and so if a migration takes 5s then you use tb_offset to > >> calculate the difference between the timebase before migration, and > >> subsequently apply the offset for all future reads of the timebase for > >> the lifetime of the CPU (i.e. the migrated guest is effectively living > >> at a point in the past where the timebase is consistent). > > > > Um.. no. At least in the usual configuration, the timebase represents > > real, wall-clock time, so we expect it to jump forward across the > > migration downtime. This is important because the guest will use the > > timebase to calculate real time differences. > > > > However, the absolute value of the timebase may be different on the > > *host* between source and destination for migration. So what we need > > to do is before migration we work out the delta between host and guest > > notions of wall clock time (as defined by the guest timebase), and > > transfer that in the migration stream. > > > > On the destination we initialize the guest timebase so that the guest > > maintains the same realtime offset from the host. This means that as > > long as source and destination system time is synchronized, guest > > real-time tracking will continue correctly across the migration. > > > > We do also make sure that the guest timebase never goes backwards, but > > that would only happen if the source and destination host times were > > badly out of sync. > > I had a poke at trying to include the timebase state in the Mac machines > but found that enabling this caused migration to freeze, even on top of > my existing (known working) patchset. > > Looking closer at the code, I don't think it can work on an x86 TCG host > in its current form since the host/guest clocks are used interchangeably > in places, e.g. in timebase_pre_save() we have this: > > uint64_t ticks = cpu_get_host_ticks(); > ... > tb->guest_timebase = ticks + first_ppc_cpu->env.tb_env->tb_offset; > > So this implies that guest_timebase is set in higher resolution host > ticks. But then in timebase_post_load() we have this: > > migration_duration_tb = muldiv64(migration_duration_ns, freq, > NANOSECONDS_PER_SECOND); > > which calculates the migration time in timebase ticks but then: > > guest_tb = tb_remote->guest_timebase + MIN(0, migration_duration_tb); > tb_off_adj = guest_tb - cpu_get_host_ticks(); > > which mixes tb_remote->guest_timebase in host ticks with > migration_duration_tb in timebase ticks. > > I think that tb_off_adj should be in host ticks so should > migration_duration_tb be dropped and guest_tb be calculated from > migration_duration_ns instead? At least given the current mixing of host > ticks and timebase ticks, I think this can only work correctly on PPC > where the two are the same. > > One other thing I noticed is that this code "broke" my normal testing > pattern when I tend to launch qemu-system-ppc with -loadvm foo -S with > the machine stopped. In this case the migration duration calculation is > performed at the moment state is restored, and not the point where the > CPU is started. > > Should the adjustment calculation not take place in a cpu_start() > function instead? Otherwise at the point when I resume the paused > machine the tb_off_adj calculation will be based in the past and > therefore wrong. Um.. so the migration duration is a complete red herring, regardless of the units. Remember, we only ever compute the guest timebase value at the moment the guest requests it - actually maintaining a current timebase value makes sense in hardware, but would be nuts in software. The timebase is a function of real, wall-clock time, and the migration destination has a notion of wall-clock time without reference to the source. So what you need to transmit for migration is enough information to compute the guest timebase from real-time - essentially just an offset between real-time and the timebase. The guest can potentially observe the migration duration as a jump in timebase values, but qemu doesn't need to do any calculations with it.
On Mon, 25 Jan 2016, David Gibson wrote: > Remember, we only ever compute the guest timebase value at the moment > the guest requests it - actually maintaining a current timebase value > makes sense in hardware, but would be nuts in software. > > The timebase is a function of real, wall-clock time, and the migration > destination has a notion of wall-clock time without reference to the > source. > > So what you need to transmit for migration is enough information to > compute the guest timebase from real-time - essentially just an offset > between real-time and the timebase. I don't know anything about all this but if I understand your conversation correctly then you don't really need an offset between real-time and timebase. That would be true assuming the source and destination has the same real-time but that's not necessarily true. What would be needed is an offset between source timebase and destination's real time. That is, besides the offset on the source you would also need to work out the difference in the "real-time" on the source and destination and correct the transferred offset with this after migration. Is this handled somehow in QEMU (for example by doing some message exchange to establish the clock difference between the source and destination during migration) or is it assumed that migration always needs synchronised clocks done by some external means? Regards, BALATON Zoltan
On Mon, Jan 25, 2016 at 06:20:21PM +0100, BALATON Zoltan wrote: > On Mon, 25 Jan 2016, David Gibson wrote: > >Remember, we only ever compute the guest timebase value at the moment > >the guest requests it - actually maintaining a current timebase value > >makes sense in hardware, but would be nuts in software. > > > >The timebase is a function of real, wall-clock time, and the migration > >destination has a notion of wall-clock time without reference to the > >source. > > > >So what you need to transmit for migration is enough information to > >compute the guest timebase from real-time - essentially just an offset > >between real-time and the timebase. > > I don't know anything about all this but if I understand your conversation > correctly then you don't really need an offset between real-time and > timebase. That would be true assuming the source and destination has the > same real-time but that's not necessarily true. No, it's not necessarily true, but if it's not that's not really our problem to deal with. We do ensure that the guest timebase doesn't go backwards in this situation, but if the source and destination hosts don't have synchronized time, the guest real time may jump. But since we have no idea whether the source or destination had a better notion of real-time, we can't really do any better. > What would be needed is an > offset between source timebase and destination's real time. That is, besides > the offset on the source you would also need to work out the difference in > the "real-time" on the source and destination and correct the transferred > offset with this after migration. Yeah, that's not really possible in the current migration model. Nor is it something that's really qemu's job. > Is this handled somehow in QEMU (for example by doing some message exchange > to establish the clock difference between the source and destination during > migration) or is it assumed that migration always needs synchronised clocks > done by some external means? Migration itself doesn't need synced clocks, but if the hosts' real-time isn't well behaved, you can't really expect the guest's real time to be well behaved.
On 25/01/16 11:10, David Gibson wrote: > Um.. so the migration duration is a complete red herring, regardless > of the units. > > Remember, we only ever compute the guest timebase value at the moment > the guest requests it - actually maintaining a current timebase value > makes sense in hardware, but would be nuts in software. > > The timebase is a function of real, wall-clock time, and the migration > destination has a notion of wall-clock time without reference to the > source. > > So what you need to transmit for migration is enough information to > compute the guest timebase from real-time - essentially just an offset > between real-time and the timebase. > > The guest can potentially observe the migration duration as a jump in > timebase values, but qemu doesn't need to do any calculations with it. Thanks for more pointers - I think I'm slowly getting there. My current thoughts are that the basic migration algorithm is doing the right thing in that it works out the number of host ticks different between source and destination. I have a slight query with this section of code though: migration_duration_tb = muldiv64(migration_duration_ns, freq, NANOSECONDS_PER_SECOND); This is not technically correct on TCG x86 since the timebase is the x86 TSC which is running somewhere in the GHz range, compared to freq which is hard-coded to 16MHz. However this doesn't seem to matter because the timebase adjustment is limited to a maximum of 1s. Why should this be if the timebase is supposed to be free running as you mentioned in a previous email? AFAICT the main problem on TCG x86 is that post-migration the timebase calculated by cpu_ppc_get_tb() is incorrect: uint64_t cpu_ppc_get_tb(ppc_tb_t *tb_env, uint64_t vmclk, int64_t tb_offset) { /* TB time in tb periods */ return muldiv64(vmclk, tb_env->tb_freq, get_ticks_per_sec()) + tb_offset; } For a typical savevm/loadvm pair I see something like this: savevm: tb->guest_timebase = 26281306490558 qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511 loadvm: cpu_get_host_ticks() = 26289847005259 tb_off_adj = -8540514701 qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511 cpu_ppc_get_tb() = -15785159386 But as cpu_ppc_get_tb() uses QEMU_CLOCK_VIRTUAL for vmclk we end up with a negative number for the timebase since the virtual clock is dwarfed by the number of TSC ticks calculated for tb_off_adj. This will work on a PPC host though since cpu_host_get_ticks() is also derived from the timebase. Another question I have is cpu_ppc_load_tbl(): uint64_t cpu_ppc_load_tbl (CPUPPCState *env) { ppc_tb_t *tb_env = env->tb_env; uint64_t tb; if (kvm_enabled()) { return env->spr[SPR_TBL]; } tb = cpu_ppc_get_tb(tb_env, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), tb_env->tb_offset); LOG_TB("%s: tb %016" PRIx64 "\n", __func__, tb); return tb; } Compared with cpu_ppc_load_tbu(), it is returning uint64_t rather than uint32_t and doesn't appear to mask the bottom 32-bits of the timebase value? ATB, Mark.
On 26/01/16 22:31, Mark Cave-Ayland wrote: > For a typical savevm/loadvm pair I see something like this: > > savevm: > > tb->guest_timebase = 26281306490558 > qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511 > > loadvm: > > cpu_get_host_ticks() = 26289847005259 > tb_off_adj = -8540514701 > qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511 > cpu_ppc_get_tb() = -15785159386 > > But as cpu_ppc_get_tb() uses QEMU_CLOCK_VIRTUAL for vmclk we end up with > a negative number for the timebase since the virtual clock is dwarfed by > the number of TSC ticks calculated for tb_off_adj. This will work on a > PPC host though since cpu_host_get_ticks() is also derived from the > timebase. Ah... the magic of signed numbers. Here's another attempt but this time with cpu_ppc_get_tb() printed unsigned: savevm: tb->guest_timebase = 2597350923332 qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 4199081744 cpu_ppc_get_tb() = 69704756 loadvm: cpu_get_host_ticks() = 2606380782824 tb_off_adj = -9029859492 qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 4199081744 cpu_ppc_get_tb() = 18446744064749396880 This implies that the cpu_ppc_get_tb() value post-migration is far too low - presumably because tb_env->tb_freq is only 16MHz compared to the actual TSC rate? ATB, Mark.
On Tue, Jan 26, 2016 at 10:31:19PM +0000, Mark Cave-Ayland wrote: > On 25/01/16 11:10, David Gibson wrote: > > > Um.. so the migration duration is a complete red herring, regardless > > of the units. > > > > Remember, we only ever compute the guest timebase value at the moment > > the guest requests it - actually maintaining a current timebase value > > makes sense in hardware, but would be nuts in software. > > > > The timebase is a function of real, wall-clock time, and the migration > > destination has a notion of wall-clock time without reference to the > > source. > > > > So what you need to transmit for migration is enough information to > > compute the guest timebase from real-time - essentially just an offset > > between real-time and the timebase. > > > > The guest can potentially observe the migration duration as a jump in > > timebase values, but qemu doesn't need to do any calculations with it. > > Thanks for more pointers - I think I'm slowly getting there. My current > thoughts are that the basic migration algorithm is doing the right thing > in that it works out the number of host ticks different between source > and destination. Sorry, I've take a while to reply to this. I realised the tb migration didn't work the way I thought it did, so I've had to get my head around what's actually going on. I had thought that it transferred only meta-information telling the destination how to calculate the timebase, without actually working out the timebase value at any particular moment. In fact, what it sends is basically the tuple of (timebase, realtime) at the point of sending the migration stream. The destination then uses that to work out how to compute the timebase from realtime there. I'm not convinced this is a great approach, but it should basically work. However, as you've seen there are also some Just Plain Bugs in the logic for this. > I have a slight query with this section of code though: > > migration_duration_tb = muldiv64(migration_duration_ns, freq, > NANOSECONDS_PER_SECOND); > > This is not technically correct on TCG x86 since the timebase is the x86 > TSC which is running somewhere in the GHz range, compared to freq which > is hard-coded to 16MHz. Um.. what? AFAICT that line doesn't have any reference to the TSC speed. Just ns and the (guest) tb). Also 16MHz is only for the oldworld Macs - modern ppc cpus have the TB frequency architected as 512MHz. > However this doesn't seem to matter because the > timebase adjustment is limited to a maximum of 1s. Why should this be if > the timebase is supposed to be free running as you mentioned in a > previous email? AFAICT, what it's doing here is assuming that if the migration duration is >1s (or appears to be >1s) then it's because the host clocks are out of sync and so just capping the elapsed tb time at 1s. That's just wrong, IMO. 1s is a long downtime for a live migration, but it's not impossible, and it will happen nearly always in the scenariou you've discussed of manually loading the migration stream from a file. But more to the point, trying to maintain correctness of the timebase when the hosts are out of sync is basically futile. There's no other reference we can use, so all we can achieve is getting a different wrong value from what we'd get by blindly trusting the host clock. We do need to constrain the tb from going backwards, because that will cause chaos on the guest, but otherwise we should just trust the host clock and ditch that 1s clamp. If the hosts are out of sync, then guest time will jump, but that was always going to happen. > AFAICT the main problem on TCG x86 is that post-migration the timebase > calculated by cpu_ppc_get_tb() is incorrect: > > uint64_t cpu_ppc_get_tb(ppc_tb_t *tb_env, uint64_t vmclk, int64_t tb_offset) > { > /* TB time in tb periods */ > return muldiv64(vmclk, tb_env->tb_freq, get_ticks_per_sec()) + > tb_offset; > } So the problem here is that get_ticks_per_sec() (which always returns 1,000,000,000) is not talking about the same ticks as cpu_get_host_ticks(). That may not have been true when this code was written. > For a typical savevm/loadvm pair I see something like this: > > savevm: > > tb->guest_timebase = 26281306490558 > qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511 > > loadvm: > > cpu_get_host_ticks() = 26289847005259 > tb_off_adj = -8540514701 > qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511 > cpu_ppc_get_tb() = -15785159386 > > But as cpu_ppc_get_tb() uses QEMU_CLOCK_VIRTUAL for vmclk we end up with > a negative number for the timebase since the virtual clock is dwarfed by > the number of TSC ticks calculated for tb_off_adj. This will work on a > PPC host though since cpu_host_get_ticks() is also derived from the > timebase. Yeah, we shouldn't be using cpu_host_get_ticks() at all - or anything else which depends on a host frequency. We should only be using qemu interfaces which work in real time units (nanoseconds, usually). > Another question I have is cpu_ppc_load_tbl(): > > uint64_t cpu_ppc_load_tbl (CPUPPCState *env) > { > ppc_tb_t *tb_env = env->tb_env; > uint64_t tb; > > if (kvm_enabled()) { > return env->spr[SPR_TBL]; > } > > tb = cpu_ppc_get_tb(tb_env, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), > tb_env->tb_offset); > LOG_TB("%s: tb %016" PRIx64 "\n", __func__, tb); > > return tb; > } > > Compared with cpu_ppc_load_tbu(), it is returning uint64_t rather than > uint32_t and doesn't appear to mask the bottom 32-bits of the timebase > value? That's because in 64-bit mode loading the "TBL" is actually a load of the whole 64-bit timebase.
On 01/02/16 00:52, David Gibson wrote: >> Thanks for more pointers - I think I'm slowly getting there. My current >> thoughts are that the basic migration algorithm is doing the right thing >> in that it works out the number of host ticks different between source >> and destination. > > Sorry, I've take a while to reply to this. I realised the tb > migration didn't work the way I thought it did, so I've had to get my > head around what's actually going on. No problem - it's turning out to be a lot more complicated than I initially expected. > I had thought that it transferred only meta-information telling the > destination how to calculate the timebase, without actually working > out the timebase value at any particular moment. > > In fact, what it sends is basically the tuple of (timebase, realtime) > at the point of sending the migration stream. The destination then > uses that to work out how to compute the timebase from realtime there. > > I'm not convinced this is a great approach, but it should basically > work. However, as you've seen there are also some Just Plain Bugs in > the logic for this. > >> I have a slight query with this section of code though: >> >> migration_duration_tb = muldiv64(migration_duration_ns, freq, >> NANOSECONDS_PER_SECOND); >> >> This is not technically correct on TCG x86 since the timebase is the x86 >> TSC which is running somewhere in the GHz range, compared to freq which >> is hard-coded to 16MHz. > > Um.. what? AFAICT that line doesn't have any reference to the TSC > speed. Just ns and the (guest) tb). Also 16MHz is only for the > oldworld Macs - modern ppc cpus have the TB frequency architected as > 512MHz. On TCG the software timebase for the Mac guests is fixed at 16MHz so how does KVM handle this? Does it compensate by emulating the 16MHz timebase for the guest even though the host has a 512HMz timebase? >> However this doesn't seem to matter because the >> timebase adjustment is limited to a maximum of 1s. Why should this be if >> the timebase is supposed to be free running as you mentioned in a >> previous email? > > AFAICT, what it's doing here is assuming that if the migration > duration is >1s (or appears to be >1s) then it's because the host > clocks are out of sync and so just capping the elapsed tb time at 1s. > > That's just wrong, IMO. 1s is a long downtime for a live migration, > but it's not impossible, and it will happen nearly always in the > scenariou you've discussed of manually loading the migration stream > from a file. > > But more to the point, trying to maintain correctness of the timebase > when the hosts are out of sync is basically futile. There's no other > reference we can use, so all we can achieve is getting a different > wrong value from what we'd get by blindly trusting the host clock. > > We do need to constrain the tb from going backwards, because that will > cause chaos on the guest, but otherwise we should just trust the host > clock and ditch that 1s clamp. If the hosts are out of sync, then > guest time will jump, but that was always going to happen. Going back to your earlier email you suggested that the host timebase is always continuously running, even when the guest is paused. But then resuming the guest then the timebase must jump in the guest regardless? If this is the case then this is the big difference between TCG and KVM guests: TCG timebase is derived from the virtual clock which solves the problem of paused guests during migration. For example with the existing migration code, what would happen if you did a migration with the guest paused on KVM? The offset would surely be wrong as it was calculated at the end of migration. And another thought: should it be possible to migrate guests between TCG and KVM hosts at will? >> AFAICT the main problem on TCG x86 is that post-migration the timebase >> calculated by cpu_ppc_get_tb() is incorrect: >> >> uint64_t cpu_ppc_get_tb(ppc_tb_t *tb_env, uint64_t vmclk, int64_t tb_offset) >> { >> /* TB time in tb periods */ >> return muldiv64(vmclk, tb_env->tb_freq, get_ticks_per_sec()) + >> tb_offset; >> } > > > So the problem here is that get_ticks_per_sec() (which always returns > 1,000,000,000) is not talking about the same ticks as > cpu_get_host_ticks(). That may not have been true when this code was > written. Yes. That's basically what I was trying to say but I think you've expressed it far more eloquently than I did. >> For a typical savevm/loadvm pair I see something like this: >> >> savevm: >> >> tb->guest_timebase = 26281306490558 >> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511 >> >> loadvm: >> >> cpu_get_host_ticks() = 26289847005259 >> tb_off_adj = -8540514701 >> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511 >> cpu_ppc_get_tb() = -15785159386 >> >> But as cpu_ppc_get_tb() uses QEMU_CLOCK_VIRTUAL for vmclk we end up with >> a negative number for the timebase since the virtual clock is dwarfed by >> the number of TSC ticks calculated for tb_off_adj. This will work on a >> PPC host though since cpu_host_get_ticks() is also derived from the >> timebase. > > Yeah, we shouldn't be using cpu_host_get_ticks() at all - or anything > else which depends on a host frequency. We should only be using qemu > interfaces which work in real time units (nanoseconds, usually). I agree that this is the right way forward. Unfortunately the timebase behaviour under KVM PPC is quite new to me, so please do bear with me for asking all these questions. ATB, Mark.
On Tue, Feb 02, 2016 at 11:41:40PM +0000, Mark Cave-Ayland wrote: > On 01/02/16 00:52, David Gibson wrote: > > >> Thanks for more pointers - I think I'm slowly getting there. My current > >> thoughts are that the basic migration algorithm is doing the right thing > >> in that it works out the number of host ticks different between source > >> and destination. > > > > Sorry, I've take a while to reply to this. I realised the tb > > migration didn't work the way I thought it did, so I've had to get my > > head around what's actually going on. > > No problem - it's turning out to be a lot more complicated than I > initially expected. > > > I had thought that it transferred only meta-information telling the > > destination how to calculate the timebase, without actually working > > out the timebase value at any particular moment. > > > > In fact, what it sends is basically the tuple of (timebase, realtime) > > at the point of sending the migration stream. The destination then > > uses that to work out how to compute the timebase from realtime there. > > > > I'm not convinced this is a great approach, but it should basically > > work. However, as you've seen there are also some Just Plain Bugs in > > the logic for this. > > > >> I have a slight query with this section of code though: > >> > >> migration_duration_tb = muldiv64(migration_duration_ns, freq, > >> NANOSECONDS_PER_SECOND); > >> > >> This is not technically correct on TCG x86 since the timebase is the x86 > >> TSC which is running somewhere in the GHz range, compared to freq which > >> is hard-coded to 16MHz. > > > > Um.. what? AFAICT that line doesn't have any reference to the TSC > > speed. Just ns and the (guest) tb). Also 16MHz is only for the > > oldworld Macs - modern ppc cpus have the TB frequency architected as > > 512MHz. > > On TCG the software timebase for the Mac guests is fixed at 16MHz so how > does KVM handle this? > Does it compensate by emulating the 16MHz timebase > for the guest even though the host has a 512HMz timebase? No, it can't. The timebase is not privileged, so there's no way to virtualize it for the guest. So, the best we can do is to detect KVM, override the guest device tree with the host timebase frequency and hope the guest reads it rather than assuming a fixed value for the platform. > >> However this doesn't seem to matter because the > >> timebase adjustment is limited to a maximum of 1s. Why should this be if > >> the timebase is supposed to be free running as you mentioned in a > >> previous email? > > > > AFAICT, what it's doing here is assuming that if the migration > > duration is >1s (or appears to be >1s) then it's because the host > > clocks are out of sync and so just capping the elapsed tb time at 1s. > > > > That's just wrong, IMO. 1s is a long downtime for a live migration, > > but it's not impossible, and it will happen nearly always in the > > scenariou you've discussed of manually loading the migration stream > > from a file. > > > > But more to the point, trying to maintain correctness of the timebase > > when the hosts are out of sync is basically futile. There's no other > > reference we can use, so all we can achieve is getting a different > > wrong value from what we'd get by blindly trusting the host clock. > > > > We do need to constrain the tb from going backwards, because that will > > cause chaos on the guest, but otherwise we should just trust the host > > clock and ditch that 1s clamp. If the hosts are out of sync, then > > guest time will jump, but that was always going to happen. > > Going back to your earlier email you suggested that the host timebase is > always continuously running, even when the guest is paused. But then > resuming the guest then the timebase must jump in the guest regardless? > > If this is the case then this is the big difference between TCG and KVM > guests: TCG timebase is derived from the virtual clock which solves the > problem of paused guests during migration. For example with the existing > migration code, what would happen if you did a migration with the guest > paused on KVM? The offset would surely be wrong as it was calculated at > the end of migration. So there are two different cases to consider here. Once is when the guest is paused incidentally, such as during migration, the other is when the guest is explicitly paused. In the first case the timebase absolutely should keep running (or appear to do so), since it's the primary source of real time for the guest. In the second case, it's a bit unclear what the right thing to do is. Keeping the tb running means accurate realtime, but stopping it is often better for debugging, which is one of the main reasons to explicitly pause. I believe spapr on KVM HV will keep the TB going, but the TSC on x86 will be stopped. > And another thought: should it be possible to migrate guests between TCG > and KVM hosts at will? It would be nice if that's possible, but I don't think it's a primary goal. > >> AFAICT the main problem on TCG x86 is that post-migration the timebase > >> calculated by cpu_ppc_get_tb() is incorrect: > >> > >> uint64_t cpu_ppc_get_tb(ppc_tb_t *tb_env, uint64_t vmclk, int64_t tb_offset) > >> { > >> /* TB time in tb periods */ > >> return muldiv64(vmclk, tb_env->tb_freq, get_ticks_per_sec()) + > >> tb_offset; > >> } > > > > > > So the problem here is that get_ticks_per_sec() (which always returns > > 1,000,000,000) is not talking about the same ticks as > > cpu_get_host_ticks(). That may not have been true when this code was > > written. > > Yes. That's basically what I was trying to say but I think you've > expressed it far more eloquently than I did. > > >> For a typical savevm/loadvm pair I see something like this: > >> > >> savevm: > >> > >> tb->guest_timebase = 26281306490558 > >> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511 > >> > >> loadvm: > >> > >> cpu_get_host_ticks() = 26289847005259 > >> tb_off_adj = -8540514701 > >> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) = 7040725511 > >> cpu_ppc_get_tb() = -15785159386 > >> > >> But as cpu_ppc_get_tb() uses QEMU_CLOCK_VIRTUAL for vmclk we end up with > >> a negative number for the timebase since the virtual clock is dwarfed by > >> the number of TSC ticks calculated for tb_off_adj. This will work on a > >> PPC host though since cpu_host_get_ticks() is also derived from the > >> timebase. > > > > Yeah, we shouldn't be using cpu_host_get_ticks() at all - or anything > > else which depends on a host frequency. We should only be using qemu > > interfaces which work in real time units (nanoseconds, usually). > > I agree that this is the right way forward. Unfortunately the timebase > behaviour under KVM PPC is quite new to me, so please do bear with me > for asking all these questions. > > > ATB, > > Mark. >
> Am 03.02.2016 um 06:59 schrieb David Gibson <david@gibson.dropbear.id.au>: > >> On Tue, Feb 02, 2016 at 11:41:40PM +0000, Mark Cave-Ayland wrote: >> On 01/02/16 00:52, David Gibson wrote: >> >>>> Thanks for more pointers - I think I'm slowly getting there. My current >>>> thoughts are that the basic migration algorithm is doing the right thing >>>> in that it works out the number of host ticks different between source >>>> and destination. >>> >>> Sorry, I've take a while to reply to this. I realised the tb >>> migration didn't work the way I thought it did, so I've had to get my >>> head around what's actually going on. >> >> No problem - it's turning out to be a lot more complicated than I >> initially expected. >> >>> I had thought that it transferred only meta-information telling the >>> destination how to calculate the timebase, without actually working >>> out the timebase value at any particular moment. >>> >>> In fact, what it sends is basically the tuple of (timebase, realtime) >>> at the point of sending the migration stream. The destination then >>> uses that to work out how to compute the timebase from realtime there. >>> >>> I'm not convinced this is a great approach, but it should basically >>> work. However, as you've seen there are also some Just Plain Bugs in >>> the logic for this. >>> >>>> I have a slight query with this section of code though: >>>> >>>> migration_duration_tb = muldiv64(migration_duration_ns, freq, >>>> NANOSECONDS_PER_SECOND); >>>> >>>> This is not technically correct on TCG x86 since the timebase is the x86 >>>> TSC which is running somewhere in the GHz range, compared to freq which >>>> is hard-coded to 16MHz. >>> >>> Um.. what? AFAICT that line doesn't have any reference to the TSC >>> speed. Just ns and the (guest) tb). Also 16MHz is only for the >>> oldworld Macs - modern ppc cpus have the TB frequency architected as >>> 512MHz. >> >> On TCG the software timebase for the Mac guests is fixed at 16MHz so how >> does KVM handle this? > >> Does it compensate by emulating the 16MHz timebase >> for the guest even though the host has a 512HMz timebase? > > No, it can't. The timebase is not privileged, so there's no way to > virtualize it for the guest. So, the best we can do is to detect KVM, > override the guest device tree with the host timebase frequency and > hope the guest reads it rather than assuming a fixed value for the > platform. IIRC Mac OS (9&X) calculates the tb frequency using the cuda clock as reference. So it doesn't honor our dt properties like Linux, but it can adapt to different frequencies. Alex
On 03/02/16 04:59, David Gibson wrote: >> Going back to your earlier email you suggested that the host timebase is >> always continuously running, even when the guest is paused. But then >> resuming the guest then the timebase must jump in the guest regardless? >> >> If this is the case then this is the big difference between TCG and KVM >> guests: TCG timebase is derived from the virtual clock which solves the >> problem of paused guests during migration. For example with the existing >> migration code, what would happen if you did a migration with the guest >> paused on KVM? The offset would surely be wrong as it was calculated at >> the end of migration. > > So there are two different cases to consider here. Once is when the > guest is paused incidentally, such as during migration, the other is > when the guest is explicitly paused. > > In the first case the timebase absolutely should keep running (or > appear to do so), since it's the primary source of real time for the > guest. I'm not sure I understand this, since if the guest is paused either deliberately or incidentally during migration then isn't the timebase also frozen? Or is it external to the CPU? > In the second case, it's a bit unclear what the right thing to do is. > Keeping the tb running means accurate realtime, but stopping it is > often better for debugging, which is one of the main reasons to > explicitly pause. > > I believe spapr on KVM HV will keep the TB going, but the TSC on x86 > will be stopped. Is this from a guest-centric view, i.e. if I pause a VM and wait 20 mins then when the guest resumes the timebase will jump forward by 20 mins worth of ticks? ATB, Mark.
On Tue, Feb 23, 2016 at 09:27:09PM +0000, Mark Cave-Ayland wrote: > On 03/02/16 04:59, David Gibson wrote: > > >> Going back to your earlier email you suggested that the host timebase is > >> always continuously running, even when the guest is paused. But then > >> resuming the guest then the timebase must jump in the guest regardless? > >> > >> If this is the case then this is the big difference between TCG and KVM > >> guests: TCG timebase is derived from the virtual clock which solves the > >> problem of paused guests during migration. For example with the existing > >> migration code, what would happen if you did a migration with the guest > >> paused on KVM? The offset would surely be wrong as it was calculated at > >> the end of migration. > > > > So there are two different cases to consider here. Once is when the > > guest is paused incidentally, such as during migration, the other is > > when the guest is explicitly paused. > > > > In the first case the timebase absolutely should keep running (or > > appear to do so), since it's the primary source of real time for the > > guest. > > I'm not sure I understand this, since if the guest is paused either > deliberately or incidentally during migration then isn't the timebase > also frozen? Or is it external to the CPU? I don't really understand the question. Migration has no equivalent in real hardware, so there's no "real" behaviour to mimic. If we freeze the TB during migration, then the guest's clock will get out of sync with wall clock time, and in a production environment that's really bad. So no, we absolutely must not freeze the TB during migration. When the guest has been explicitly paused, there's a case to be made either way. > > In the second case, it's a bit unclear what the right thing to do is. > > Keeping the tb running means accurate realtime, but stopping it is > > often better for debugging, which is one of the main reasons to > > explicitly pause. > > > > I believe spapr on KVM HV will keep the TB going, but the TSC on x86 > > will be stopped. > > Is this from a guest-centric view, i.e. if I pause a VM and wait 20 mins > then when the guest resumes the timebase will jump forward by 20 mins > worth of ticks? Yes, that's correct.
David Gibson <david@gibson.dropbear.id.au> wrote: > On Tue, Feb 23, 2016 at 09:27:09PM +0000, Mark Cave-Ayland wrote: >> On 03/02/16 04:59, David Gibson wrote: >> >> >> Going back to your earlier email you suggested that the host timebase is >> >> always continuously running, even when the guest is paused. But then >> >> resuming the guest then the timebase must jump in the guest regardless? >> >> >> >> If this is the case then this is the big difference between TCG and KVM >> >> guests: TCG timebase is derived from the virtual clock which solves the >> >> problem of paused guests during migration. For example with the existing >> >> migration code, what would happen if you did a migration with the guest >> >> paused on KVM? The offset would surely be wrong as it was calculated at >> >> the end of migration. >> > >> > So there are two different cases to consider here. Once is when the >> > guest is paused incidentally, such as during migration, the other is >> > when the guest is explicitly paused. >> > >> > In the first case the timebase absolutely should keep running (or >> > appear to do so), since it's the primary source of real time for the >> > guest. >> >> I'm not sure I understand this, since if the guest is paused either >> deliberately or incidentally during migration then isn't the timebase >> also frozen? Or is it external to the CPU? > > I don't really understand the question. Migration has no equivalent > in real hardware, so there's no "real" behaviour to mimic. If we > freeze the TB during migration, then the guest's clock will get out of > sync with wall clock time, and in a production environment that's > really bad. So no, we absolutely must not freeze the TB during > migration. > > When the guest has been explicitly paused, there's a case to be made > either way. If this is the case, can't we just change the device to just read the clock from the host at device insntantiation and call it a day? (* Notice that I haven't seen the previous discussion *) On migration, having a post-load function that just loads the right value for that device should work. Or if we want to make it work for pause/cont, we should have a notifier to be run each time "cont" is issued, and put a callback there? Or I am missing something improtant? > >> > In the second case, it's a bit unclear what the right thing to do is. >> > Keeping the tb running means accurate realtime, but stopping it is >> > often better for debugging, which is one of the main reasons to >> > explicitly pause. >> > >> > I believe spapr on KVM HV will keep the TB going, but the TSC on x86 >> > will be stopped. >> >> Is this from a guest-centric view, i.e. if I pause a VM and wait 20 mins >> then when the guest resumes the timebase will jump forward by 20 mins >> worth of ticks? > > Yes, that's correct. I.e. my proposal fixes this? If you want ot make it really, really "classy", you can look at the mess we have on x86 to introduce ticks "slewly" for windos 95 (and XP?) during a while, but I don't think that solution would work for 20mins of ticks :p Later, Juan.
On Wed, Feb 24, 2016 at 01:31:05PM +0100, Juan Quintela wrote: > David Gibson <david@gibson.dropbear.id.au> wrote: > > On Tue, Feb 23, 2016 at 09:27:09PM +0000, Mark Cave-Ayland wrote: > >> On 03/02/16 04:59, David Gibson wrote: > >> > >> >> Going back to your earlier email you suggested that the host timebase is > >> >> always continuously running, even when the guest is paused. But then > >> >> resuming the guest then the timebase must jump in the guest regardless? > >> >> > >> >> If this is the case then this is the big difference between TCG and KVM > >> >> guests: TCG timebase is derived from the virtual clock which solves the > >> >> problem of paused guests during migration. For example with the existing > >> >> migration code, what would happen if you did a migration with the guest > >> >> paused on KVM? The offset would surely be wrong as it was calculated at > >> >> the end of migration. > >> > > >> > So there are two different cases to consider here. Once is when the > >> > guest is paused incidentally, such as during migration, the other is > >> > when the guest is explicitly paused. > >> > > >> > In the first case the timebase absolutely should keep running (or > >> > appear to do so), since it's the primary source of real time for the > >> > guest. > >> > >> I'm not sure I understand this, since if the guest is paused either > >> deliberately or incidentally during migration then isn't the timebase > >> also frozen? Or is it external to the CPU? > > > > I don't really understand the question. Migration has no equivalent > > in real hardware, so there's no "real" behaviour to mimic. If we > > freeze the TB during migration, then the guest's clock will get out of > > sync with wall clock time, and in a production environment that's > > really bad. So no, we absolutely must not freeze the TB during > > migration. > > > > When the guest has been explicitly paused, there's a case to be made > > either way. > > If this is the case, can't we just change the device to just read the > clock from the host at device insntantiation and call it a day? That's not quite enough, because although the timebase advances in real time, it will have an offset from realtime that varies boot to boot. > (* Notice that I haven't seen the previous discussion *) > > On migration, having a post-load function that just loads the right > value for that device should work. Or if we want to make it work for > pause/cont, we should have a notifier to be run each time "cont" is > issued, and put a callback there? Right. This is basically what we already do on pseries: in pre_save we store both the timebase value and the current real time. In post_load we again check the real time, look at the difference from the value in the migration stream and advance the TB to match. > Or I am missing something improtant? > > > > >> > In the second case, it's a bit unclear what the right thing to do is. > >> > Keeping the tb running means accurate realtime, but stopping it is > >> > often better for debugging, which is one of the main reasons to > >> > explicitly pause. > >> > > >> > I believe spapr on KVM HV will keep the TB going, but the TSC on x86 > >> > will be stopped. > >> > >> Is this from a guest-centric view, i.e. if I pause a VM and wait 20 mins > >> then when the guest resumes the timebase will jump forward by 20 mins > >> worth of ticks? > > > > Yes, that's correct. > > I.e. my proposal fixes this? > If you want ot make it really, really "classy", you can look at the mess > we have on x86 to introduce ticks "slewly" for windos 95 (and XP?) > during a while, but I don't think that solution would work for 20mins of > ticks :p > > Later, Juan. >
On 24/02/16 12:31, Juan Quintela wrote: >> I don't really understand the question. Migration has no equivalent >> in real hardware, so there's no "real" behaviour to mimic. If we >> freeze the TB during migration, then the guest's clock will get out of >> sync with wall clock time, and in a production environment that's >> really bad. So no, we absolutely must not freeze the TB during >> migration. >> >> When the guest has been explicitly paused, there's a case to be made >> either way. > > If this is the case, can't we just change the device to just read the > clock from the host at device insntantiation and call it a day? > > (* Notice that I haven't seen the previous discussion *) > > On migration, having a post-load function that just loads the right > value for that device should work. Or if we want to make it work for > pause/cont, we should have a notifier to be run each time "cont" is > issued, and put a callback there? > > Or I am missing something improtant? Right, that's roughly the approach I was thinking when I wrote my last reply - for KVM derive the timebase from the virtual clock similar to TCG and adjust on CPU start, e.g. cpu_reset(): cpu->tb_env->tb_offset = 0; cpu_start/resume(): cpu->tb_env->tb_offset = qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) * tb_env->tb_freq + cpu->tb_env->tb_offset - qemu_clock_get_ns(QEMU_CLOCK_HOST) Is there any reason why this shouldn't work? My understanding is that guests supporting KVM_REG_PPC_TB_OFFSET should compensate correctly for the timebase if tb_offset is the difference from the host timebase at guest virtual time zero. ATB, Mark.
On 25/02/16 04:33, Mark Cave-Ayland wrote: > cpu_start/resume(): > cpu->tb_env->tb_offset = > qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) * tb_env->tb_freq + > cpu->tb_env->tb_offset - > qemu_clock_get_ns(QEMU_CLOCK_HOST) Actually just realised this is slightly wrong and in fact should be: cpu_start/resume(): cpu->tb_env->tb_offset = muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) + cpu->tb_env->tb_offset - qemu_clock_get_ns(QEMU_CLOCK_HOST) ATB, Mark.
On 25/02/16 05:00, Mark Cave-Ayland wrote: > On 25/02/16 04:33, Mark Cave-Ayland wrote: > >> cpu_start/resume(): >> cpu->tb_env->tb_offset = >> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) * tb_env->tb_freq + >> cpu->tb_env->tb_offset - >> qemu_clock_get_ns(QEMU_CLOCK_HOST) > > Actually just realised this is slightly wrong and in fact should be: > > cpu_start/resume(): > cpu->tb_env->tb_offset = > muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), > cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) + > cpu->tb_env->tb_offset - > qemu_clock_get_ns(QEMU_CLOCK_HOST) Sign. And let me try that again, this time after caffeine: cpu_start/resume(): cpu->tb_env->tb_offset = muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) + cpu->tb_env->tb_offset - cpu_get_host_ticks(); This should translate to: at CPU start, calculate the difference between the current guest virtual timebase and the host timebase, storing the difference in cpu->tb_env->tb_offset. ATB, Mark.
On Thu, Feb 25, 2016 at 09:50:20AM +0000, Mark Cave-Ayland wrote: > On 25/02/16 05:00, Mark Cave-Ayland wrote: > > > On 25/02/16 04:33, Mark Cave-Ayland wrote: > > > >> cpu_start/resume(): > >> cpu->tb_env->tb_offset = > >> qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) * tb_env->tb_freq + > >> cpu->tb_env->tb_offset - > >> qemu_clock_get_ns(QEMU_CLOCK_HOST) > > > > Actually just realised this is slightly wrong and in fact should be: > > > > cpu_start/resume(): > > cpu->tb_env->tb_offset = > > muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), > > cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) + > > cpu->tb_env->tb_offset - > > qemu_clock_get_ns(QEMU_CLOCK_HOST) > > Sign. And let me try that again, this time after caffeine: > > cpu_start/resume(): > cpu->tb_env->tb_offset = > muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), > cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) + > cpu->tb_env->tb_offset - > cpu_get_host_ticks(); > > This should translate to: at CPU start, calculate the difference between > the current guest virtual timebase and the host timebase, storing the > difference in cpu->tb_env->tb_offset. Ummm... I think that's right. Except that you need to make sure you calculate the tb_offset just once, and set the same value to all guest CPUs. Otherwise the guest TBs may be slightly out of sync with each other, which is bad (the host should have already ensure that all host TBs are in sync with each other). We really should make helper routines that each Power machine type can use for this. Unfortunately we can't put it directly into the common ppc cpu migration code because of the requirement to keep the TBs synced across the machine.
On 26/02/16 04:35, David Gibson wrote: >> Sign. And let me try that again, this time after caffeine: >> >> cpu_start/resume(): >> cpu->tb_env->tb_offset = >> muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), >> cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) + >> cpu->tb_env->tb_offset - >> cpu_get_host_ticks(); >> >> This should translate to: at CPU start, calculate the difference between >> the current guest virtual timebase and the host timebase, storing the >> difference in cpu->tb_env->tb_offset. > > Ummm... I think that's right. Except that you need to make sure you > calculate the tb_offset just once, and set the same value to all guest > CPUs. Otherwise the guest TBs may be slightly out of sync with each > other, which is bad (the host should have already ensure that all host > TBs are in sync with each other). Nods. The reason I really like this solution is because it correctly handles both paused/live machines and simplifies the migration logic significantly. In fact, with this solution the only information you would need in vmstate_ppc_timebase for migration would be the current tb_offset so the receiving host can calculate the guest timebase from the virtual clock accordingly. > We really should make helper routines that each Power machine type can > use for this. Unfortunately we can't put it directly into the common > ppc cpu migration code because of the requirement to keep the TBs > synced across the machine. Effectively I believe there are 2 cases here: TCG and KVM. For TCG all of this is a no-op since tb/decr are already derived from the virtual clock + tb_offset already so that really just leaves KVM. From what you're saying we would need 2 hooks for KVM here: one on machine start/resume which updates tb_offset for all vCPUs and one for vCPU resume which updates just that one particular vCPU. Just curious as to your comment regarding helper routines for each Power machine type - is there any reason not to enable this globally for all Power machine types? If tb_offset isn't supported by the guest then the problem with not being able to handle a jump in timebase post-migration still remains exactly the same. The other question of course relates to the reason this thread was started which is will this approach still support migrating the decrementer? My feeling is that this would still work in that we would encode the number of ticks before the decrementer reaches its interrupt value into vmstate_ppc_timebase, whether high or low. For TCG it is easy to ensure that the decrementer will still fire in n ticks time post-migration (which solves my particular use case), but I don't have a feeling as to whether this is possible, or indeed desirable for KVM. ATB, Mark.
On Fri, Feb 26, 2016 at 12:29:51PM +0000, Mark Cave-Ayland wrote: > On 26/02/16 04:35, David Gibson wrote: > > >> Sign. And let me try that again, this time after caffeine: > >> > >> cpu_start/resume(): > >> cpu->tb_env->tb_offset = > >> muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), > >> cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) + > >> cpu->tb_env->tb_offset - > >> cpu_get_host_ticks(); > >> > >> This should translate to: at CPU start, calculate the difference between > >> the current guest virtual timebase and the host timebase, storing the > >> difference in cpu->tb_env->tb_offset. > > > > Ummm... I think that's right. Except that you need to make sure you > > calculate the tb_offset just once, and set the same value to all guest > > CPUs. Otherwise the guest TBs may be slightly out of sync with each > > other, which is bad (the host should have already ensure that all host > > TBs are in sync with each other). > > Nods. The reason I really like this solution is because it correctly > handles both paused/live machines and simplifies the migration logic > significantly. In fact, with this solution the only information you > would need in vmstate_ppc_timebase for migration would be the current > tb_offset so the receiving host can calculate the guest timebase from > the virtual clock accordingly. > > We really should make helper routines that each Power machine type can > > use for this. Unfortunately we can't put it directly into the common > > ppc cpu migration code because of the requirement to keep the TBs > > synced across the machine. > > Effectively I believe there are 2 cases here: TCG and KVM. For TCG all > of this is a no-op since tb/decr are already derived from the virtual > clock + tb_offset already so that really just leaves KVM. > > >From what you're saying we would need 2 hooks for KVM here: one on > machine start/resume which updates tb_offset for all vCPUs and one for > vCPU resume which updates just that one particular vCPU. > > Just curious as to your comment regarding helper routines for each Power > machine type - is there any reason not to enable this globally for all > Power machine types? If tb_offset isn't supported by the guest then the > problem with not being able to handle a jump in timebase post-migration > still remains exactly the same. Well, I can't see a place to put it globally. We can't put it in the general vCPU stuff, because that would migrate each CPU's timebase independently, but we want to migrate as a system wide operation, to ensure the TBs are all synchronized in the destination guest. To do the platform wide stuff, it pretty much has to be in the machine type. > The other question of course relates to the reason this thread was > started which is will this approach still support migrating the > decrementer? My feeling is that this would still work in that we would > encode the number of ticks before the decrementer reaches its interrupt > value into vmstate_ppc_timebase, whether high or low. For TCG it is easy > to ensure that the decrementer will still fire in n ticks time > post-migration (which solves my particular use case), but I don't have a > feeling as to whether this is possible, or indeed desirable for KVM. Yes, for TCG it should be fairly straightforward. The DECR should be calculated from the timebase. We do need to check it on incoming migration though, and check when we need to refire the decrementer interrupt. For KVM we'll need to load an appropriate value into the real decrementer. We probably want to migrate a difference between the TB and the decrementer. What could get hairy here is that there are a number of different variants between ppc models on how exactly the decrementer interrupt triggers: is it edge-triggered on 1->0 transition, edge-triggered on 0->-1 transition, or level triggered on the DECR's sign bit.
On 29/02/16 03:57, David Gibson wrote: > On Fri, Feb 26, 2016 at 12:29:51PM +0000, Mark Cave-Ayland wrote: >> On 26/02/16 04:35, David Gibson wrote: >> >>>> Sign. And let me try that again, this time after caffeine: >>>> >>>> cpu_start/resume(): >>>> cpu->tb_env->tb_offset = >>>> muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), >>>> cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) + >>>> cpu->tb_env->tb_offset - >>>> cpu_get_host_ticks(); >>>> >>>> This should translate to: at CPU start, calculate the difference between >>>> the current guest virtual timebase and the host timebase, storing the >>>> difference in cpu->tb_env->tb_offset. >>> >>> Ummm... I think that's right. Except that you need to make sure you >>> calculate the tb_offset just once, and set the same value to all guest >>> CPUs. Otherwise the guest TBs may be slightly out of sync with each >>> other, which is bad (the host should have already ensure that all host >>> TBs are in sync with each other). >> >> Nods. The reason I really like this solution is because it correctly >> handles both paused/live machines and simplifies the migration logic >> significantly. In fact, with this solution the only information you >> would need in vmstate_ppc_timebase for migration would be the current >> tb_offset so the receiving host can calculate the guest timebase from >> the virtual clock accordingly. > >>> We really should make helper routines that each Power machine type can >>> use for this. Unfortunately we can't put it directly into the common >>> ppc cpu migration code because of the requirement to keep the TBs >>> synced across the machine. >> >> Effectively I believe there are 2 cases here: TCG and KVM. For TCG all >> of this is a no-op since tb/decr are already derived from the virtual >> clock + tb_offset already so that really just leaves KVM. >> >> >From what you're saying we would need 2 hooks for KVM here: one on >> machine start/resume which updates tb_offset for all vCPUs and one for >> vCPU resume which updates just that one particular vCPU. >> >> Just curious as to your comment regarding helper routines for each Power >> machine type - is there any reason not to enable this globally for all >> Power machine types? If tb_offset isn't supported by the guest then the >> problem with not being able to handle a jump in timebase post-migration >> still remains exactly the same. > > Well, I can't see a place to put it globally. We can't put it in the > general vCPU stuff, because that would migrate each CPU's timebase > independently, but we want to migrate as a system wide operation, to > ensure the TBs are all synchronized in the destination guest. > > To do the platform wide stuff, it pretty much has to be in the machine > type. (goes and looks) It strikes me that a good solution here would be to introduce a new PPCMachineClass from which all of the PPC machines could derive instead of each different machine being a direct subclass of MachineClass (this is not dissimilar as to the existing PCMachineClass) and move the timebase and decrementer information into it. With this then all of the PPC machine types can pick up the changes automatically. >> The other question of course relates to the reason this thread was >> started which is will this approach still support migrating the >> decrementer? My feeling is that this would still work in that we would >> encode the number of ticks before the decrementer reaches its interrupt >> value into vmstate_ppc_timebase, whether high or low. For TCG it is easy >> to ensure that the decrementer will still fire in n ticks time >> post-migration (which solves my particular use case), but I don't have a >> feeling as to whether this is possible, or indeed desirable for KVM. > > Yes, for TCG it should be fairly straightforward. The DECR should be > calculated from the timebase. We do need to check it on incoming > migration though, and check when we need to refire the decrementer > interrupt. So just to confirm that while reads from the timebase are not privileged (and so cannot be intercepted between host and guest), we still have individual control of the per-guest decrementer interrupts? > For KVM we'll need to load an appropriate value into the real > decrementer. We probably want to migrate a difference between the TB > and the decrementer. What could get hairy here is that there are a > number of different variants between ppc models on how exactly the > decrementer interrupt triggers: is it edge-triggered on 1->0 > transition, edge-triggered on 0->-1 transition, or level triggered on > the DECR's sign bit. I don't think that is too much of a problem, since for TCG the logic is already encapsulated in hw/ppc/ppc.c's __cpu_ppc_store_decr(). It should be possible to move this logic into a shared helper function to keep everything in one place. Finally just to re-iterate that while I can write and compile-test a potential patchset, I have no way to test the KVM parts. If I were to dedicate some time to this, would yourself/Alex/Alexey be willing to help test and debug these changes? ATB, Mark.
On Mon, Feb 29, 2016 at 08:21:39PM +0000, Mark Cave-Ayland wrote: > On 29/02/16 03:57, David Gibson wrote: > > > On Fri, Feb 26, 2016 at 12:29:51PM +0000, Mark Cave-Ayland wrote: > >> On 26/02/16 04:35, David Gibson wrote: > >> > >>>> Sign. And let me try that again, this time after caffeine: > >>>> > >>>> cpu_start/resume(): > >>>> cpu->tb_env->tb_offset = > >>>> muldiv64(qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL), > >>>> cpu->tb_env->tb_freq, NANOSECONDS_PER_SECOND) + > >>>> cpu->tb_env->tb_offset - > >>>> cpu_get_host_ticks(); > >>>> > >>>> This should translate to: at CPU start, calculate the difference between > >>>> the current guest virtual timebase and the host timebase, storing the > >>>> difference in cpu->tb_env->tb_offset. > >>> > >>> Ummm... I think that's right. Except that you need to make sure you > >>> calculate the tb_offset just once, and set the same value to all guest > >>> CPUs. Otherwise the guest TBs may be slightly out of sync with each > >>> other, which is bad (the host should have already ensure that all host > >>> TBs are in sync with each other). > >> > >> Nods. The reason I really like this solution is because it correctly > >> handles both paused/live machines and simplifies the migration logic > >> significantly. In fact, with this solution the only information you > >> would need in vmstate_ppc_timebase for migration would be the current > >> tb_offset so the receiving host can calculate the guest timebase from > >> the virtual clock accordingly. > > > >>> We really should make helper routines that each Power machine type can > >>> use for this. Unfortunately we can't put it directly into the common > >>> ppc cpu migration code because of the requirement to keep the TBs > >>> synced across the machine. > >> > >> Effectively I believe there are 2 cases here: TCG and KVM. For TCG all > >> of this is a no-op since tb/decr are already derived from the virtual > >> clock + tb_offset already so that really just leaves KVM. > >> > >> >From what you're saying we would need 2 hooks for KVM here: one on > >> machine start/resume which updates tb_offset for all vCPUs and one for > >> vCPU resume which updates just that one particular vCPU. > >> > >> Just curious as to your comment regarding helper routines for each Power > >> machine type - is there any reason not to enable this globally for all > >> Power machine types? If tb_offset isn't supported by the guest then the > >> problem with not being able to handle a jump in timebase post-migration > >> still remains exactly the same. > > > > Well, I can't see a place to put it globally. We can't put it in the > > general vCPU stuff, because that would migrate each CPU's timebase > > independently, but we want to migrate as a system wide operation, to > > ensure the TBs are all synchronized in the destination guest. > > > > To do the platform wide stuff, it pretty much has to be in the machine > > type. > > (goes and looks) > > It strikes me that a good solution here would be to introduce a new > PPCMachineClass from which all of the PPC machines could derive instead > of each different machine being a direct subclass of MachineClass (this > is not dissimilar as to the existing PCMachineClass) and move the > timebase and decrementer information into it. With this then all of the > PPC machine types can pick up the changes automatically. Um.. maybe, yes. There might be some gotches in attempting that (particularly maintaining backwards compat for migration), but it could be worth a shot. > >> The other question of course relates to the reason this thread was > >> started which is will this approach still support migrating the > >> decrementer? My feeling is that this would still work in that we would > >> encode the number of ticks before the decrementer reaches its interrupt > >> value into vmstate_ppc_timebase, whether high or low. For TCG it is easy > >> to ensure that the decrementer will still fire in n ticks time > >> post-migration (which solves my particular use case), but I don't have a > >> feeling as to whether this is possible, or indeed desirable for KVM. > > > > Yes, for TCG it should be fairly straightforward. The DECR should be > > calculated from the timebase. We do need to check it on incoming > > migration though, and check when we need to refire the decrementer > > interrupt. > > So just to confirm that while reads from the timebase are not privileged > (and so cannot be intercepted between host and guest), we still have > individual control of the per-guest decrementer interrupts? I'm not entirely sure I understand the question, but I think the answer is yes. > > For KVM we'll need to load an appropriate value into the real > > decrementer. We probably want to migrate a difference between the TB > > and the decrementer. What could get hairy here is that there are a > > number of different variants between ppc models on how exactly the > > decrementer interrupt triggers: is it edge-triggered on 1->0 > > transition, edge-triggered on 0->-1 transition, or level triggered on > > the DECR's sign bit. > > I don't think that is too much of a problem, since for TCG the logic is > already encapsulated in hw/ppc/ppc.c's __cpu_ppc_store_decr(). It should > be possible to move this logic into a shared helper function to keep > everything in one place. > > Finally just to re-iterate that while I can write and compile-test a > potential patchset, I have no way to test the KVM parts. If I were to > dedicate some time to this, would yourself/Alex/Alexey be willing to > help test and debug these changes? Um.. to some extent. I can test that you haven't broken spapr easily enough. Testing that the Mac machines are working under KVM PR is trickier, since I'm not really set up for testing the Mac machine types. It might work if you can supply VM images and scripts, which I can then execute on a ppc machine.
diff --git a/target-ppc/machine.c b/target-ppc/machine.c index 251a84b..495e58d 100644 --- a/target-ppc/machine.c +++ b/target-ppc/machine.c @@ -141,6 +141,7 @@ static void cpu_pre_save(void *opaque) env->spr[SPR_CFAR] = env->cfar; #endif env->spr[SPR_BOOKE_SPEFSCR] = env->spe_fscr; + env->spr[SPR_DECR] = cpu_ppc_load_decr(env); for (i = 0; (i < 4) && (i < env->nb_BATs); i++) { env->spr[SPR_DBAT0U + 2*i] = env->DBAT[0][i]; @@ -175,6 +176,7 @@ static int cpu_post_load(void *opaque, int version_id) env->cfar = env->spr[SPR_CFAR]; #endif env->spe_fscr = env->spr[SPR_BOOKE_SPEFSCR]; + cpu_ppc_store_decr(env, env->spr[SPR_DECR]); for (i = 0; (i < 4) && (i < env->nb_BATs); i++) { env->DBAT[0][i] = env->spr[SPR_DBAT0U + 2*i];