x86: Reset MTRR on vCPU reset

Message ID	20140813190916.12400.59842.stgit@gimli.home
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Alex Williamson <alex.williamson@redhat.com> To: qemu-devel@nongnu.org, kvm@vger.kernel.org Date: Wed, 13 Aug 2014 13:09:41 -0600 Message-ID: <20140813190916.12400.59842.stgit@gimli.home> User-Agent: StGIT/0.14.3 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Cc: Alex Williamson <alex.williamson@redhat.com>, lersek@redhat.com, qemu-stable@nongnu.org Subject: [Qemu-devel] [PATCH] x86: Reset MTRR on vCPU reset Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Alex Williamson Aug. 13, 2014, 7:09 p.m. UTC

The SDM specifies (June 2014 Vol3 11.11.5):

    On a hardware reset, the P6 and more recent processors clear the
    valid flags in variable-range MTRRs and clear the E flag in the
    IA32_MTRR_DEF_TYPE MSR to disable all MTRRs. All other bits in the
    MTRRs are undefined.

We currently do none of that, so whatever MTRR settings you had prior
to reset is what you have after reset.  Usually this doesn't matter
because KVM often ignores the guest mappings and uses write-back
anyway.  However, if you have an assigned device and an IOMMU that
allows NoSnoop for that device, KVM defers to the guest memory
mappings which are now stale after reset.  The result is that OVMF
rebooting on such a configuration takes a full minute to LZMA
decompress the EFI volume, a process that is nearly instant on the
initial boot.

Add support for reseting the SDM defined bits on vCPU reset.

Also, by my count we're already in danger of overflowing the entries
array that we pass to KVM, so I've topped it up for a bit of headroom.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Cc: qemu-stable@nongnu.org
---

 target-i386/cpu.c |    6 ++++++
 target-i386/cpu.h |    4 ++++
 target-i386/kvm.c |   14 +++++++++++++-
 3 files changed, 23 insertions(+), 1 deletion(-)

Laszlo Ersek Aug. 13, 2014, 8:33 p.m. UTC | #1

a number of comments -- feel free to address or ignore each as you see fit:

On 08/13/14 21:09, Alex Williamson wrote:
> The SDM specifies (June 2014 Vol3 11.11.5):
> 
>     On a hardware reset, the P6 and more recent processors clear the
>     valid flags in variable-range MTRRs and clear the E flag in the
>     IA32_MTRR_DEF_TYPE MSR to disable all MTRRs. All other bits in the
>     MTRRs are undefined.
> 
> We currently do none of that, so whatever MTRR settings you had prior
> to reset is what you have after reset.  Usually this doesn't matter
> because KVM often ignores the guest mappings and uses write-back
> anyway.  However, if you have an assigned device and an IOMMU that
> allows NoSnoop for that device, KVM defers to the guest memory
> mappings which are now stale after reset.  The result is that OVMF
> rebooting on such a configuration takes a full minute to LZMA
> decompress the EFI volume, a process that is nearly instant on the

For pedantry, instead of "EFI volume" we could say "LZMA-compressed
Firmware File System file in the FVMAIN_COMPACT firmware volume".

> initial boot.
> 
> Add support for reseting the SDM defined bits on vCPU reset.
> 
> Also, by my count we're already in danger of overflowing the entries
> array that we pass to KVM, so I've topped it up for a bit of headroom.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> Cc: qemu-stable@nongnu.org
> ---
> 
>  target-i386/cpu.c |    6 ++++++
>  target-i386/cpu.h |    4 ++++
>  target-i386/kvm.c |   14 +++++++++++++-
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index 6d008ab..b5ae654 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -2588,6 +2588,12 @@ static void x86_cpu_reset(CPUState *s)
>  
>      env->xcr0 = 1;
>  
> +    /* MTRR init - Clear global enable bit and valid bit in each variable reg */
> +    env->mtrr_deftype &= ~MSR_MTRRdefType_Enable;
> +    for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
> +        env->mtrr_var[i].mask &= ~MSR_MTRRphysMask_Valid;
> +    }
> +

I can see that the limit, MSR_MTRRcap_VCNT, is #defined as 8. Would you
be willing to update the definition of the "CPUX86State.mtrr_var" array
too, in "target-i386/cpu.h"? Currently it says:

    MTRRVar mtrr_var[8];

>  #if !defined(CONFIG_USER_ONLY)
>      /* We hard-wire the BSP to the first CPU. */
>      if (s->cpu_index == 0) {
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index e634d83..139890f 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -337,6 +337,8 @@
>  #define MSR_MTRRphysBase(reg)           (0x200 + 2 * (reg))
>  #define MSR_MTRRphysMask(reg)           (0x200 + 2 * (reg) + 1)
>  
> +#define MSR_MTRRphysMask_Valid (1 << 11)
> +

Note: a signed integer (int32_t).

>  #define MSR_MTRRfix64K_00000            0x250
>  #define MSR_MTRRfix16K_80000            0x258
>  #define MSR_MTRRfix16K_A0000            0x259
> @@ -353,6 +355,8 @@
>  
>  #define MSR_MTRRdefType                 0x2ff
>  
> +#define MSR_MTRRdefType_Enable (1 << 11)
> +

Note: a signed integer (int32_t).

Now, if you scroll back to the bit-clearing in x86_cpu_reset(), you see

  ~MSR_MTRRdefType_Enable

and

 ~MSR_MTRRphysMask_Valid

These expressions evaluate to negative int (int32_t) values (because the
bit-neg sets their sign bits).

Due to two's complement (which we are allowed to assume in qemu, see
HACKING), the negative int32_t values will be just correct for the next
step, when they are converted to uint64_t for the bit-ands, as part of
the usual arithmetic conversions. ("env->mtrr_deftype" and
"env->mtrr_var[i].mask" are uint64_t.) Mathematically this means an
addition of UINT64_MAX+1. ("Sign extended".)

In general, even though they are correct due to two's complement, I
dislike such detours into negative-valued signed integers by way of
bit-neg, because people are mostly unaware of them and assume they "just
work". My preferred solution would be

#define MSR_MTRRphysMask_Valid (1ull << 11)
#define MSR_MTRRdefType_Enable (1ull << 11)

Feel free to ignore this of course.

>  #define MSR_CORE_PERF_FIXED_CTR0        0x309
>  #define MSR_CORE_PERF_FIXED_CTR1        0x30a
>  #define MSR_CORE_PERF_FIXED_CTR2        0x30b
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 097fe11..cb31338 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -79,6 +79,7 @@ static int lm_capable_kernel;
>  static bool has_msr_hv_hypercall;
>  static bool has_msr_hv_vapic;
>  static bool has_msr_hv_tsc;
> +static bool has_msr_mtrr;
>  
>  static bool has_msr_architectural_pmu;
>  static uint32_t num_architectural_pmu_counters;
> @@ -739,6 +740,10 @@ int kvm_arch_init_vcpu(CPUState *cs)
>          env->kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
>      }
>  
> +    if (env->features[FEAT_1_EDX] & CPUID_MTRR) {
> +        has_msr_mtrr = true;
> +    }
> +

Seems to match "MTRR Feature Identification" in my (older) copy of the SDM.

>      return 0;
>  }
>  
> @@ -1183,7 +1188,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>      CPUX86State *env = &cpu->env;
>      struct {
>          struct kvm_msrs info;
> -        struct kvm_msr_entry entries[100];
> +        struct kvm_msr_entry entries[128];
>      } msr_data;
>      struct kvm_msr_entry *msrs = msr_data.entries;
>      int n = 0, i;
> @@ -1278,6 +1283,13 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>              kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_REFERENCE_TSC,
>                                env->msr_hv_tsc);
>          }
> +        if (has_msr_mtrr) {
> +            kvm_msr_entry_set(&msrs[n++], MSR_MTRRdefType, env->mtrr_deftype);
> +            for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
> +                kvm_msr_entry_set(&msrs[n++],
> +                                  MSR_MTRRphysMask(i), env->mtrr_var[i].mask);
> +            }
> +        }
>  
>          /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see
>           *       kvm_put_msr_feature_control. */
> 

I think that this code is correct (and sufficient for the reset
problem), but I'm uncertain if it's complete:

(a) Shouldn't you put the matching PhysBase registers as well (for the
variable range ones)?

Plus, shouldn't you put mtrr_fixed[11] too (MSR_MTRRfix64K_00000, ...)?

(b) You only modify kvm_put_msrs(). What about kvm_get_msrs()? I can see
that you make the msr putting dependent on:

    /*
     * The following MSRs have side effects on the guest or are too
     * heavy for normal writeback. Limit them to reset or full state
     * updates.
     */
    if (level >= KVM_PUT_RESET_STATE) {

But that's probably not your reason for omitting matching new code from
kvm_get_msrs(): "HV_X64_MSR_REFERENCE_TSC" is also heavy-weight (visible
in your patch's context), but that one is nevertheless handled in
kvm_get_msrs().

My only reason for (b) is simply symmetry. For example, commit 48a5f3bc
added HV_X64_MSR_REFERENCE_TSC at once to both put() and get().

According to "target-i386/machine.c", mtrr_deftype and co. are even
migrated (part of vmstate), so this asymmetry could become a problem in
migration. Eg. source host doesn't fetch MTRR state from KVM, hence wire
format carries garbage, but on the target you put (part of) that garbage
(right now, just the mask) back into KVM:

do_savevm()
  qemu_savevm_state()
    qemu_savevm_state_complete()
      cpu_synchronize_all_states()
        cpu_synchronize_state()
          kvm_cpu_synchronize_state()
            do_kvm_cpu_synchronize_state()
              kvm_arch_get_registers()
                kvm_get_msrs()

do_loadvm()
  load_vmstate()
    qemu_loadvm_state()
      cpu_synchronize_all_post_init()
        cpu_synchronize_post_init()
          kvm_cpu_synchronize_post_init()
            kvm_arch_put_registers(..., KVM_PUT_FULL_STATE)
              kvm_put_msrs(..., KVM_PUT_FULL_STATE)

/* state subset modified during VCPU reset */
#define KVM_PUT_RESET_STATE     2

/* full state set, modified during initialization or on vmload */
#define KVM_PUT_FULL_STATE      3

Hence I suspect (a) and (b) should be handled.

... And then we arrive at cross-version migration, where both source and
target hosts support MTRR, but the source qemu sends unsynchronized MTRR
data (ie. garbage) in the migration stream, but the target passes it to
KVM. I don't know if this is possible, and if so, what to do about it. :(

(BTW,

        VMSTATE_MTRR_VARS(env.mtrr_var, X86CPU, 8, 8),

should be rebased to MSR_MTRRcap_VCNT too, probably.)

Apologies about the verbiage, I just wrote down whatever crossed my
mind. I don't think I said anything overly important, but I feel unsafe
about giving my R-b until someone disproves my migration worries.
(Basically, before the patch, whatever MTRR data was in the migration
stream never reached KVM. This changes now.)

... Is the following argument valid in your opinion?

  KVM cares about guest-specified MTRR values *only* when
  kvm_arch_has_noncoherent_dma() returns true to vmx_get_mt_mask().
  Since "kvm_arch_has_noncoherent_dma() returning true" (ie. device
  assignment) exludes migration anyway, we don't have to care about
  migration of MTRRs.

I'm confused, but that shouldn't block this patch!

Thanks,
Laszlo

Alex Williamson Aug. 13, 2014, 10:06 p.m. UTC | #2

On Wed, 2014-08-13 at 22:33 +0200, Laszlo Ersek wrote:
> a number of comments -- feel free to address or ignore each as you see fit:
> 
> On 08/13/14 21:09, Alex Williamson wrote:
> > The SDM specifies (June 2014 Vol3 11.11.5):
> > 
> >     On a hardware reset, the P6 and more recent processors clear the
> >     valid flags in variable-range MTRRs and clear the E flag in the
> >     IA32_MTRR_DEF_TYPE MSR to disable all MTRRs. All other bits in the
> >     MTRRs are undefined.
> > 
> > We currently do none of that, so whatever MTRR settings you had prior
> > to reset is what you have after reset.  Usually this doesn't matter
> > because KVM often ignores the guest mappings and uses write-back
> > anyway.  However, if you have an assigned device and an IOMMU that
> > allows NoSnoop for that device, KVM defers to the guest memory
> > mappings which are now stale after reset.  The result is that OVMF
> > rebooting on such a configuration takes a full minute to LZMA
> > decompress the EFI volume, a process that is nearly instant on the
> 
> For pedantry, instead of "EFI volume" we could say "LZMA-compressed
> Firmware File System file in the FVMAIN_COMPACT firmware volume".

Can you come up with something with maybe half that many words?  And
also, does it matter?  I want someone using OVMF and experiencing a long
reboot delay to know that this might fix their problem.  Noting that the
major time consuming stall is in the LZMA decompression code helps to
rationalize why the mapping change is important.  The specific blob of
data that's being decompressed seems mostly irrelevant, which is why I
only gave it 2 words.

> > initial boot.
> > 
> > Add support for reseting the SDM defined bits on vCPU reset.
> > 
> > Also, by my count we're already in danger of overflowing the entries
> > array that we pass to KVM, so I've topped it up for a bit of headroom.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > Cc: qemu-stable@nongnu.org
> > ---
> > 
> >  target-i386/cpu.c |    6 ++++++
> >  target-i386/cpu.h |    4 ++++
> >  target-i386/kvm.c |   14 +++++++++++++-
> >  3 files changed, 23 insertions(+), 1 deletion(-)
> > 
> > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > index 6d008ab..b5ae654 100644
> > --- a/target-i386/cpu.c
> > +++ b/target-i386/cpu.c
> > @@ -2588,6 +2588,12 @@ static void x86_cpu_reset(CPUState *s)
> >  
> >      env->xcr0 = 1;
> >  
> > +    /* MTRR init - Clear global enable bit and valid bit in each variable reg */
> > +    env->mtrr_deftype &= ~MSR_MTRRdefType_Enable;
> > +    for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
> > +        env->mtrr_var[i].mask &= ~MSR_MTRRphysMask_Valid;
> > +    }
> > +
> 
> I can see that the limit, MSR_MTRRcap_VCNT, is #defined as 8. Would you
> be willing to update the definition of the "CPUX86State.mtrr_var" array
> too, in "target-i386/cpu.h"? Currently it says:

I was tempted to do that, but I was hoping there was some deeper
reasoning why these were already defined separately.  For instance, what
if we wanted to keep a stable vmstate size, but expose fewer variable
MTRRs to the guest.  MSR_MTRRcap_VCNT is the number exposed to the
guest, so it makes sense that we only need to clear the valid bits on
those.  As I look through the commits that got us here, that was
probably just wishful thinking.

>     MTRRVar mtrr_var[8];
> 
> >  #if !defined(CONFIG_USER_ONLY)
> >      /* We hard-wire the BSP to the first CPU. */
> >      if (s->cpu_index == 0) {
> > diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> > index e634d83..139890f 100644
> > --- a/target-i386/cpu.h
> > +++ b/target-i386/cpu.h
> > @@ -337,6 +337,8 @@
> >  #define MSR_MTRRphysBase(reg)           (0x200 + 2 * (reg))
> >  #define MSR_MTRRphysMask(reg)           (0x200 + 2 * (reg) + 1)
> >  
> > +#define MSR_MTRRphysMask_Valid (1 << 11)
> > +
> 
> Note: a signed integer (int32_t).
> 
> >  #define MSR_MTRRfix64K_00000            0x250
> >  #define MSR_MTRRfix16K_80000            0x258
> >  #define MSR_MTRRfix16K_A0000            0x259
> > @@ -353,6 +355,8 @@
> >  
> >  #define MSR_MTRRdefType                 0x2ff
> >  
> > +#define MSR_MTRRdefType_Enable (1 << 11)
> > +
> 
> Note: a signed integer (int32_t).
> 
> Now, if you scroll back to the bit-clearing in x86_cpu_reset(), you see
> 
>   ~MSR_MTRRdefType_Enable
> 
> and
> 
>  ~MSR_MTRRphysMask_Valid
> 
> These expressions evaluate to negative int (int32_t) values (because the
> bit-neg sets their sign bits).
> 
> Due to two's complement (which we are allowed to assume in qemu, see
> HACKING), the negative int32_t values will be just correct for the next
> step, when they are converted to uint64_t for the bit-ands, as part of
> the usual arithmetic conversions. ("env->mtrr_deftype" and
> "env->mtrr_var[i].mask" are uint64_t.) Mathematically this means an
> addition of UINT64_MAX+1. ("Sign extended".)
> 
> In general, even though they are correct due to two's complement, I
> dislike such detours into negative-valued signed integers by way of
> bit-neg, because people are mostly unaware of them and assume they "just
> work". My preferred solution would be
> 
> #define MSR_MTRRphysMask_Valid (1ull << 11)
> #define MSR_MTRRdefType_Enable (1ull << 11)
> 
> Feel free to ignore this of course.

This seems like an uphill battle, but I suppose I don't have any problem
with an overly pedantic definition like this.

> >  #define MSR_CORE_PERF_FIXED_CTR0        0x309
> >  #define MSR_CORE_PERF_FIXED_CTR1        0x30a
> >  #define MSR_CORE_PERF_FIXED_CTR2        0x30b
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index 097fe11..cb31338 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -79,6 +79,7 @@ static int lm_capable_kernel;
> >  static bool has_msr_hv_hypercall;
> >  static bool has_msr_hv_vapic;
> >  static bool has_msr_hv_tsc;
> > +static bool has_msr_mtrr;
> >  
> >  static bool has_msr_architectural_pmu;
> >  static uint32_t num_architectural_pmu_counters;
> > @@ -739,6 +740,10 @@ int kvm_arch_init_vcpu(CPUState *cs)
> >          env->kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
> >      }
> >  
> > +    if (env->features[FEAT_1_EDX] & CPUID_MTRR) {
> > +        has_msr_mtrr = true;
> > +    }
> > +
> 
> Seems to match "MTRR Feature Identification" in my (older) copy of the SDM.
> 
> >      return 0;
> >  }
> >  
> > @@ -1183,7 +1188,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
> >      CPUX86State *env = &cpu->env;
> >      struct {
> >          struct kvm_msrs info;
> > -        struct kvm_msr_entry entries[100];
> > +        struct kvm_msr_entry entries[128];
> >      } msr_data;
> >      struct kvm_msr_entry *msrs = msr_data.entries;
> >      int n = 0, i;
> > @@ -1278,6 +1283,13 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
> >              kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_REFERENCE_TSC,
> >                                env->msr_hv_tsc);
> >          }
> > +        if (has_msr_mtrr) {
> > +            kvm_msr_entry_set(&msrs[n++], MSR_MTRRdefType, env->mtrr_deftype);
> > +            for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
> > +                kvm_msr_entry_set(&msrs[n++],
> > +                                  MSR_MTRRphysMask(i), env->mtrr_var[i].mask);
> > +            }
> > +        }
> >  
> >          /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see
> >           *       kvm_put_msr_feature_control. */
> > 
> 
> I think that this code is correct (and sufficient for the reset
> problem), but I'm uncertain if it's complete:
> 
> (a) Shouldn't you put the matching PhysBase registers as well (for the
> variable range ones)?
> 
> Plus, shouldn't you put mtrr_fixed[11] too (MSR_MTRRfix64K_00000, ...)?

If my change wasn't isolated to the reset portion of kvm_put_msrs() then
I would agree with you.  But since it is, all of those registers are
undefined by the SDM.

> (b) You only modify kvm_put_msrs(). What about kvm_get_msrs()? I can see
> that you make the msr putting dependent on:
> 
>     /*
>      * The following MSRs have side effects on the guest or are too
>      * heavy for normal writeback. Limit them to reset or full state
>      * updates.
>      */
>     if (level >= KVM_PUT_RESET_STATE) {
> 
> But that's probably not your reason for omitting matching new code from
> kvm_get_msrs(): "HV_X64_MSR_REFERENCE_TSC" is also heavy-weight (visible
> in your patch's context), but that one is nevertheless handled in
> kvm_get_msrs().
> 
> My only reason for (b) is simply symmetry. For example, commit 48a5f3bc
> added HV_X64_MSR_REFERENCE_TSC at once to both put() and get().
> 
> According to "target-i386/machine.c", mtrr_deftype and co. are even
> migrated (part of vmstate), so this asymmetry could become a problem in
> migration. Eg. source host doesn't fetch MTRR state from KVM, hence wire
> format carries garbage, but on the target you put (part of) that garbage
> (right now, just the mask) back into KVM:
> 
> do_savevm()
>   qemu_savevm_state()
>     qemu_savevm_state_complete()
>       cpu_synchronize_all_states()
>         cpu_synchronize_state()
>           kvm_cpu_synchronize_state()
>             do_kvm_cpu_synchronize_state()
>               kvm_arch_get_registers()
>                 kvm_get_msrs()
> 
> do_loadvm()
>   load_vmstate()
>     qemu_loadvm_state()
>       cpu_synchronize_all_post_init()
>         cpu_synchronize_post_init()
>           kvm_cpu_synchronize_post_init()
>             kvm_arch_put_registers(..., KVM_PUT_FULL_STATE)
>               kvm_put_msrs(..., KVM_PUT_FULL_STATE)
> 
> /* state subset modified during VCPU reset */
> #define KVM_PUT_RESET_STATE     2
> 
> /* full state set, modified during initialization or on vmload */
> #define KVM_PUT_FULL_STATE      3
> 
> Hence I suspect (a) and (b) should be handled.
> 
> ... And then we arrive at cross-version migration, where both source and
> target hosts support MTRR, but the source qemu sends unsynchronized MTRR
> data (ie. garbage) in the migration stream, but the target passes it to
> KVM. I don't know if this is possible, and if so, what to do about it. :(

Where does the target pass it to KVM?  I think you've identified that we
migrate unsynchronized data, but the good news is that we don't do
anything with it unless you're running under TCG (in which case it is
synchronized anyway).  We neither load nor store the MTRR state from/to
KVM, which may have implications if you were to boot a guest, migrate
it, then hot-add an assigned device where we need to start caring about
guest mappings.

> (BTW,
> 
>         VMSTATE_MTRR_VARS(env.mtrr_var, X86CPU, 8, 8),
> 
> should be rebased to MSR_MTRRcap_VCNT too, probably.)
> 
> Apologies about the verbiage, I just wrote down whatever crossed my
> mind. I don't think I said anything overly important, but I feel unsafe
> about giving my R-b until someone disproves my migration worries.
> (Basically, before the patch, whatever MTRR data was in the migration
> stream never reached KVM. This changes now.)

Not really because it only gets pushed to KVM on vCPU reset and we're
clearing the necessary enable/valid bits.  The rest is undefined anyway.

> ... Is the following argument valid in your opinion?
> 
>   KVM cares about guest-specified MTRR values *only* when
>   kvm_arch_has_noncoherent_dma() returns true to vmx_get_mt_mask().
>   Since "kvm_arch_has_noncoherent_dma() returning true" (ie. device
>   assignment) exludes migration anyway, we don't have to care about
>   migration of MTRRs.

I think we do need to care about migration of MTRRs because a device can
be hot attached on the migration target while the MTRRs could have been
programmed on the migration source.  Therefore it doesn't matter than
device assignment excludes migration.  This patch still seems correct to
me, but you have identified another issue in the same problem space.
I'll start working on it.  Thanks,

Alex

Laszlo Ersek Aug. 13, 2014, 11:17 p.m. UTC | #3

On 08/14/14 00:06, Alex Williamson wrote:
> On Wed, 2014-08-13 at 22:33 +0200, Laszlo Ersek wrote:
>> a number of comments -- feel free to address or ignore each as you see fit:
>>
>> On 08/13/14 21:09, Alex Williamson wrote:

>>> mappings which are now stale after reset.  The result is that OVMF
>>> rebooting on such a configuration takes a full minute to LZMA
>>> decompress the EFI volume, a process that is nearly instant on the
>>
>> For pedantry, instead of "EFI volume" we could say "LZMA-compressed
>> Firmware File System file in the FVMAIN_COMPACT firmware volume".
> 
> Can you come up with something with maybe half that many words?

"Firmware volume" then. "Firmware volume" is not a generic term, it's a
specific term in the Platform Initialization (PI) spec.

> And
> also, does it matter?

No. :)

> I want someone using OVMF and experiencing a long
> reboot delay to know that this might fix their problem.  Noting that the
> major time consuming stall is in the LZMA decompression code helps to
> rationalize why the mapping change is important.  The specific blob of
> data that's being decompressed seems mostly irrelevant, which is why I
> only gave it 2 words.

Fair enough, it's just that "EFI volume" doesn't mean anything specific
(to me), while "firmware volume" does.

>>> @@ -1183,7 +1188,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>>>      CPUX86State *env = &cpu->env;
>>>      struct {
>>>          struct kvm_msrs info;
>>> -        struct kvm_msr_entry entries[100];
>>> +        struct kvm_msr_entry entries[128];
>>>      } msr_data;
>>>      struct kvm_msr_entry *msrs = msr_data.entries;
>>>      int n = 0, i;
>>> @@ -1278,6 +1283,13 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>>>              kvm_msr_entry_set(&msrs[n++], HV_X64_MSR_REFERENCE_TSC,
>>>                                env->msr_hv_tsc);
>>>          }
>>> +        if (has_msr_mtrr) {
>>> +            kvm_msr_entry_set(&msrs[n++], MSR_MTRRdefType, env->mtrr_deftype);
>>> +            for (i = 0; i < MSR_MTRRcap_VCNT; i++) {
>>> +                kvm_msr_entry_set(&msrs[n++],
>>> +                                  MSR_MTRRphysMask(i), env->mtrr_var[i].mask);
>>> +            }
>>> +        }
>>>  
>>>          /* Note: MSR_IA32_FEATURE_CONTROL is written separately, see
>>>           *       kvm_put_msr_feature_control. */
>>>
>>
>> I think that this code is correct (and sufficient for the reset
>> problem), but I'm uncertain if it's complete:
>>
>> (a) Shouldn't you put the matching PhysBase registers as well (for the
>> variable range ones)?
>>
>> Plus, shouldn't you put mtrr_fixed[11] too (MSR_MTRRfix64K_00000, ...)?
> 
> If my change wasn't isolated to the reset portion of kvm_put_msrs() then
> I would agree with you.  But since it is, all of those registers are
> undefined by the SDM.

That's a good way to express your point indeed, and a good way to
formulate my concern: I'm not sure your change is isolated to the reset
portion. The check that "gates" the new hunk says

  level >= KVM_PUT_RESET_STATE

and a higher level than that does exist: KVM_PUT_FULL_STATE, which is
used in incoming migration.

>> (b) You only modify kvm_put_msrs(). What about kvm_get_msrs()? I can see
>> that you make the msr putting dependent on:
>>
>>     /*
>>      * The following MSRs have side effects on the guest or are too
>>      * heavy for normal writeback. Limit them to reset or full state
>>      * updates.
>>      */
>>     if (level >= KVM_PUT_RESET_STATE) {
>>
>> But that's probably not your reason for omitting matching new code from
>> kvm_get_msrs(): "HV_X64_MSR_REFERENCE_TSC" is also heavy-weight (visible
>> in your patch's context), but that one is nevertheless handled in
>> kvm_get_msrs().
>>
>> My only reason for (b) is simply symmetry. For example, commit 48a5f3bc
>> added HV_X64_MSR_REFERENCE_TSC at once to both put() and get().
>>
>> According to "target-i386/machine.c", mtrr_deftype and co. are even
>> migrated (part of vmstate), so this asymmetry could become a problem in
>> migration. Eg. source host doesn't fetch MTRR state from KVM, hence wire
>> format carries garbage, but on the target you put (part of) that garbage
>> (right now, just the mask) back into KVM:
>>
>> do_savevm()
>>   qemu_savevm_state()
>>     qemu_savevm_state_complete()
>>       cpu_synchronize_all_states()
>>         cpu_synchronize_state()
>>           kvm_cpu_synchronize_state()
>>             do_kvm_cpu_synchronize_state()
>>               kvm_arch_get_registers()
>>                 kvm_get_msrs()
>>
>> do_loadvm()
>>   load_vmstate()
>>     qemu_loadvm_state()
>>       cpu_synchronize_all_post_init()
>>         cpu_synchronize_post_init()
>>           kvm_cpu_synchronize_post_init()
>>             kvm_arch_put_registers(..., KVM_PUT_FULL_STATE)
>>               kvm_put_msrs(..., KVM_PUT_FULL_STATE)
>>
>> /* state subset modified during VCPU reset */
>> #define KVM_PUT_RESET_STATE     2
>>
>> /* full state set, modified during initialization or on vmload */
>> #define KVM_PUT_FULL_STATE      3
>>
>> Hence I suspect (a) and (b) should be handled.
>>
>> ... And then we arrive at cross-version migration, where both source and
>> target hosts support MTRR, but the source qemu sends unsynchronized MTRR
>> data (ie. garbage) in the migration stream, but the target passes it to
>> KVM. I don't know if this is possible, and if so, what to do about it. :(
> 
> Where does the target pass it to KVM?

That's what I tried to show with the 2nd call stack above, the one
- that is rooted in do_loadvm(),
- and ends in kvm_put_msrs(),
- with "level" equalling KVM_PUT_FULL_STATE (--> 3),
- which the gate, ie. level >= KVM_PUT_RESET_STATE (--> 2), will let
  across.

(I do see that right now the patch passes only a part of the MTRR state
to KVM, but *some* part it does pass.)

> I think you've identified that we
> migrate unsynchronized data, but the good news is that we don't do
> anything with it unless you're running under TCG (in which case it is
> synchronized anyway).

That's where I disagree (but hopefully I'm just confused). I think that
once your patch is applied, part of the MTRR state, read from the
migration stream, will be sent to KVM. Because do_loadvm() leads to
kvm_put_msrs(), with level = KVM_PUT_FULL_STATE >= KVM_PUT_RESET_STATE.

> We neither load nor store the MTRR state from/to
> KVM,

(We don't load, I agree; we don't store, I disagree (after your patch).)

> which may have implications if you were to boot a guest, migrate
> it, then hot-add an assigned device where we need to start caring about
> guest mappings.
> 
>> (BTW,
>>
>>         VMSTATE_MTRR_VARS(env.mtrr_var, X86CPU, 8, 8),
>>
>> should be rebased to MSR_MTRRcap_VCNT too, probably.)
>>
>> Apologies about the verbiage, I just wrote down whatever crossed my
>> mind. I don't think I said anything overly important, but I feel unsafe
>> about giving my R-b until someone disproves my migration worries.
>> (Basically, before the patch, whatever MTRR data was in the migration
>> stream never reached KVM. This changes now.)
> 
> Not really because it only gets pushed to KVM on vCPU reset

(and on KVM_PUT_FULL_STATE >= KVM_PUT_RESET_STATE)

> and we're
> clearing the necessary enable/valid bits.  The rest is undefined anyway.

Indeed, when the put is a consequence of the VM being reset. But I think
that the put is reachable differently (ie. on incoming migration), and
consistency would be important then.

> 
>> ... Is the following argument valid in your opinion?
>>
>>   KVM cares about guest-specified MTRR values *only* when
>>   kvm_arch_has_noncoherent_dma() returns true to vmx_get_mt_mask().
>>   Since "kvm_arch_has_noncoherent_dma() returning true" (ie. device
>>   assignment) exludes migration anyway, we don't have to care about
>>   migration of MTRRs.
> 
> I think we do need to care about migration of MTRRs because a device can
> be hot attached on the migration target while the MTRRs could have been
> programmed on the migration source.  Therefore it doesn't matter than
> device assignment excludes migration.  This patch still seems correct to
> me, but you have identified another issue in the same problem space.
> I'll start working on it.  Thanks,

I certainly invite you and others to correct me; I'm a novice in this area.

Anyway, I had a thought that puts even my migration worries to shame :)
Here goes:

In x86_cpu_reset(), could you debug-log the *prior* value of
"env->mtrr_var[i].mask", ie. before you mask out the valid bit? (Same
for "env->mtrr_deftype", before you mask out the global Enable bit?)

I believe that under KVM, those env->mtrr_XXX fields are *always* zero
(dating back to CPUX86State's z-allocation time). Because, where would
we set them to anything nonzero?

$ git grep mtrr_deftype
target-i386/cpu.h:    uint64_t mtrr_deftype;
target-i386/machine.c:        VMSTATE_UINT64_V(env.mtrr_deftype, X86CPU, 8),
target-i386/misc_helper.c:        env->mtrr_deftype = val;
target-i386/misc_helper.c:        val = env->mtrr_deftype;

"target-i386/misc_helper.c" is TCG, isn't it?

The assignment to "env->mtrr_deftype" happens in function helper_wrmsr()
[target-i386/misc_helper.c], which is *only* called in the following chain:

gen_intermediate_code_internal() [target-i386/translate.c]
  disas_insn()                   [target-i386/translate.c]
    gen_helper_wrmsr()           [target-i386/misc_helper.c]

(gen_helper_wrmsr() resolves to helper_wrmsr() through some ingenious
hacks in "include/exec/helper-gen.h"; run
'git grep -e DEF_HELPER --and -e wrmsr', and then check DEF_HELPER_1().)

So, my theory is that under KVM, the hunk for x86_cpu_reset() doesn't
actually clear any set bits (because all those bits are already zero),
and that the kvm_put_msrs() hunk simply writes all-bits-zero values
(which certainly conforms to the reset requirements!) And this happens
exactly because the patch never *loads* MTRR state from KVM.

Here's a summary of this mess of an email:

- With TCG, I think everything just works, and the hunk for
  x86_cpu_reset() improves MTRR conformance.

- With KVM, the lack of loading MTRR state from KVM, combined with the
  (partial) storing of MTRR state to KVM, has two consequences:
  - migration invalidates (loses) MTRR state,
  - without migration, the clearing actions in x86_cpu_reset() have
    no effect (0 -> 0), but this is masked by the fact that
    all-bits-zero values for the registers in question happen to be
    correct for resetting.

- Both the store (which is now partial) and the load (which is
  nonexistent) should be complete instead, because the store is
  reachable on the incoming migration path too, not just when resetting.

I'm sorry if I sound crazy, but the above should be easy to disprove (by
logging the prior values in x86_cpu_reset(), on KVM, and by setting a
breakpoint on the incoming qemu process).

Thanks,
Laszlo

Laszlo Ersek Aug. 13, 2014, 11:44 p.m. UTC | #4

On 08/14/14 01:17, Laszlo Ersek wrote:

> - With KVM, the lack of loading MTRR state from KVM, combined with the
>   (partial) storing of MTRR state to KVM, has two consequences:
>   - migration invalidates (loses) MTRR state,

I'll concede that migration *already* loses MTRR state (on KVM), even
before your patch. On the incoming host, the difference is that
pre-patch, the guest continues running (after migration) with MTRRs in
the "initial" KVM state, while post-patch, the guest continues running
after an explicit zeroing of the variable MTRR masks and the deftype.

I admit that it wouldn't be right to say that the patch "causes" MTRR
state loss.

With that, I think I've actually convinced myself that your patch is
correct:

The x86_cpu_reset() hunk is correct in any case, independently of KVM
vs. TCG. (On TCG it even improves MTRR conformance.) Splitting that hunk
into a separate patch might be worthwhile, but not overly important.

The kvm_put_msrs() hunk forces a zero write to the variable MTRR
PhysMasks and the DefType, on both reset and on incoming migration. For
reset, this is correct behavior. For incoming migration, it is not, but
it certainly shouldn't qualify as a regression, relative to the current
status (where MTRR state is simply lost and replaced with initial MTRR
state on the incoming host).

I think the above "end results" could be expressed more clearly in the
code, but I'm already wondering if you'll ever talk to me again, so I'm
willing to give my R-b if you think that's useful... :)

(Again, I might be wrong, easily.)

Thanks
Laszlo

Alex Williamson Aug. 14, 2014, 3:08 p.m. UTC | #5

On Thu, 2014-08-14 at 01:44 +0200, Laszlo Ersek wrote:
> On 08/14/14 01:17, Laszlo Ersek wrote:
> 
> > - With KVM, the lack of loading MTRR state from KVM, combined with the
> >   (partial) storing of MTRR state to KVM, has two consequences:
> >   - migration invalidates (loses) MTRR state,
> 
> I'll concede that migration *already* loses MTRR state (on KVM), even
> before your patch. On the incoming host, the difference is that
> pre-patch, the guest continues running (after migration) with MTRRs in
> the "initial" KVM state, while post-patch, the guest continues running
> after an explicit zeroing of the variable MTRR masks and the deftype.
> 
> I admit that it wouldn't be right to say that the patch "causes" MTRR
> state loss.
> 
> With that, I think I've actually convinced myself that your patch is
> correct:
> 
> The x86_cpu_reset() hunk is correct in any case, independently of KVM
> vs. TCG. (On TCG it even improves MTRR conformance.) Splitting that hunk
> into a separate patch might be worthwhile, but not overly important.
> 
> The kvm_put_msrs() hunk forces a zero write to the variable MTRR
> PhysMasks and the DefType, on both reset and on incoming migration. For
> reset, this is correct behavior. For incoming migration, it is not, but
> it certainly shouldn't qualify as a regression, relative to the current
> status (where MTRR state is simply lost and replaced with initial MTRR
> state on the incoming host).
> 
> I think the above "end results" could be expressed more clearly in the
> code, but I'm already wondering if you'll ever talk to me again, so I'm
> willing to give my R-b if you think that's useful... :)

Heh, I think you've highlighted an important point, perhaps several.  I
was assuming my kvm_put_msrs() was only for reset, but it's clearly not.
So I agree that we need both get and put support.  It probably makes
sense to create one patch cleaning up the hardcoded variable register
array vs guest advertised, another implementing the reset path, and a
final one adding KVM get/put.  I'll get started.  Thanks for the review.

Alex

x86: Reset MTRR on vCPU reset

Commit Message

Comments

Patch