Message ID | 20230621013821.6874-3-dongli.zhang@oracle.com |
---|---|
State | New |
Headers | show |
Series | target/i386/kvm: fix two svm pmu virtualization bugs | expand |
On Wed, Jun 21, 2023 at 9:39 AM Dongli Zhang <dongli.zhang@oracle.com> wrote: > > The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM > side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to > the KVM side. > > However, only the Intel gp/fixed/global pmu registers are involved. There > is not any implementation for AMD pmu registers. The > 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are > calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for > AMD. Before AMD PerfMonV2, the number of gp registers is decided based on > the CPU version. Updating the relevant documentation to clarify this part of the deficiency would be a good first step. > > This patch is to add the support for AMD version=1 pmu, to get and put AMD > pmu registers. Otherwise, there will be a bug: AMD version=1 ? AMD does not have version 1, just directly has 2, perhaps because of x86 compatibility. AMD also does not have the so-called architectural pmu. Maybe need to rename has_architectural_pmu_version for AMD. It might be more helpful to add similar support for AMD PerfMonV2. > > 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it > is running "perf top". The pmu registers are not disabled gracefully. > > 2. Although the x86_cpu_reset() resets many registers to zero, the > kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, > some pmu events are still enabled at the KVM side. I agree that we should have done that, especially if guest pmu is enabled on the AMD platforms. > > 3. The KVM pmc_speculative_in_use() always returns true so that the events > will not be reclaimed. The kvm_pmc->perf_event is still active. > > 4. After the reboot, the VM kernel reports below error: > > [ 0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor. > [ 0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076) > > 5. In a worse case, the active kvm_pmc->perf_event is still able to > inject unknown NMIs randomly to the VM kernel. > > [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. > > The patch is to fix the issue by resetting AMD pmu registers during the > reset. I'm not sure if the qemu_reset or VM kexec will necessarily trigger kvm::amd_pmu_reset(). > > Cc: Joe Jin <joe.jin@oracle.com> > Cc: Like Xu <likexu@tencent.com> > Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> > --- > target/i386/cpu.h | 5 +++ > target/i386/kvm/kvm.c | 83 +++++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 86 insertions(+), 2 deletions(-) > > diff --git a/target/i386/cpu.h b/target/i386/cpu.h > index cd047e0410..b8ba72e87a 100644 > --- a/target/i386/cpu.h > +++ b/target/i386/cpu.h > @@ -471,6 +471,11 @@ typedef enum X86Seg { > #define MSR_CORE_PERF_GLOBAL_CTRL 0x38f > #define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x390 > > +#define MSR_K7_EVNTSEL0 0xc0010000 > +#define MSR_K7_PERFCTR0 0xc0010004 > +#define MSR_F15H_PERF_CTL0 0xc0010200 > +#define MSR_F15H_PERF_CTR0 0xc0010201 > + > #define MSR_MC0_CTL 0x400 > #define MSR_MC0_STATUS 0x401 > #define MSR_MC0_ADDR 0x402 > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c > index bf4136fa1b..a0f7273dad 100644 > --- a/target/i386/kvm/kvm.c > +++ b/target/i386/kvm/kvm.c > @@ -2084,6 +2084,32 @@ int kvm_arch_init_vcpu(CPUState *cs) > } > } > > + /* > + * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to > + * disable the AMD pmu virtualization. > + * > + * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled > + * indicates the KVM side has already disabled the pmu virtualization. > + */ > + if (IS_AMD_CPU(env) && !cs->kvm_state->pmu_cap_disabled) { > + int64_t family; > + > + family = (env->cpuid_version >> 8) & 0xf; > + if (family == 0xf) { > + family += (env->cpuid_version >> 20) & 0xff; > + } > + > + if (family >= 6) { > + has_architectural_pmu_version = 1; > + > + if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) { > + num_architectural_pmu_gp_counters = 6; Please make the code a little more readable with some macro definitions. #define AMD64_NUM_COUNTERS 4 #define AMD64_NUM_COUNTERS_CORE 6 > + } else { > + num_architectural_pmu_gp_counters = 4; > + } > + } > + } > + > cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused); > > for (i = 0x80000000; i <= limit; i++) { > @@ -3438,7 +3464,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) > kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr); > } > > - if (has_architectural_pmu_version > 0) { > + if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { It seems to be saying here that it is possible to perform kvm_msr_entry_add() for Intel-related pmu on AMD platforms. So the reverse is expected to be feasible as well. Why need distinction? > if (has_architectural_pmu_version > 1) { > /* Stop the counter. */ > kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); > @@ -3469,6 +3495,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level) > env->msr_global_ctrl); > } > } > + > + if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { > + uint32_t sel_base = MSR_K7_EVNTSEL0; > + uint32_t ctr_base = MSR_K7_PERFCTR0; > + uint32_t step = 1; > + > + if (num_architectural_pmu_gp_counters == 6) { > + sel_base = MSR_F15H_PERF_CTL0; > + ctr_base = MSR_F15H_PERF_CTR0; > + step = 2; > + } > + > + for (i = 0; i < num_architectural_pmu_gp_counters; i++) { > + kvm_msr_entry_add(cpu, ctr_base + i * step, > + env->msr_gp_counters[i]); > + kvm_msr_entry_add(cpu, sel_base + i * step, > + env->msr_gp_evtsel[i]); > + } > + } > + > /* > * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add, > * only sync them to KVM on the first cpu > @@ -3929,7 +3975,7 @@ static int kvm_get_msrs(X86CPU *cpu) > if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) { > kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1); > } > - if (has_architectural_pmu_version > 0) { > + if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { > if (has_architectural_pmu_version > 1) { > kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); > kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0); > @@ -3945,6 +3991,25 @@ static int kvm_get_msrs(X86CPU *cpu) > } > } > > + if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { > + uint32_t sel_base = MSR_K7_EVNTSEL0; > + uint32_t ctr_base = MSR_K7_PERFCTR0; > + uint32_t step = 1; > + > + if (num_architectural_pmu_gp_counters == 6) { > + sel_base = MSR_F15H_PERF_CTL0; > + ctr_base = MSR_F15H_PERF_CTR0; > + step = 2; Perhaps it would be more helpful to add a little commentary on why step 2 is needed. > + } > + > + for (i = 0; i < num_architectural_pmu_gp_counters; i++) { > + kvm_msr_entry_add(cpu, ctr_base + i * step, > + env->msr_gp_counters[i]); > + kvm_msr_entry_add(cpu, sel_base + i * step, > + env->msr_gp_evtsel[i]); > + } > + } > + > if (env->mcg_cap) { > kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0); > kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0); > @@ -4230,6 +4295,20 @@ static int kvm_get_msrs(X86CPU *cpu) > case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1: > env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data; > break; > + case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3: > + env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data; > + break; > + case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3: > + env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data; > + break; > + case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb: > + index = index - MSR_F15H_PERF_CTL0; > + if (index & 0x1) { > + env->msr_gp_counters[index] = msrs[i].data; > + } else { > + env->msr_gp_evtsel[index] = msrs[i].data; > + } > + break; > case HV_X64_MSR_HYPERCALL: > env->msr_hv_hypercall = msrs[i].data; > break; > -- > 2.34.1 >
Hi Like, On 7/2/23 07:15, Like Xu wrote: > On Wed, Jun 21, 2023 at 9:39 AM Dongli Zhang <dongli.zhang@oracle.com> wrote: >> >> The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM >> side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to >> the KVM side. >> >> However, only the Intel gp/fixed/global pmu registers are involved. There >> is not any implementation for AMD pmu registers. The >> 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are >> calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for >> AMD. Before AMD PerfMonV2, the number of gp registers is decided based on >> the CPU version. > > Updating the relevant documentation to clarify this part of the deficiency > would be a good first step. Would you mind suggesting the doc to add this TODO/deficiency? The only place I find is to add a new TODO under docs/system/i386, but not sure if it is worth it. This bugfix is not complex. > >> >> This patch is to add the support for AMD version=1 pmu, to get and put AMD >> pmu registers. Otherwise, there will be a bug: > > AMD version=1 ? > > AMD does not have version 1, just directly has 2, perhaps because of x86 > compatibility. AMD also does not have the so-called architectural pmu. > Maybe need to rename has_architectural_pmu_version for AMD. > Thank you very much for the explanation. I will use version 2. > It might be more helpful to add similar support for AMD PerfMonV2. Yes. I will do that. During that time, the AMD PerfMonV2 KVM patchset (from you) was still in progress. I see the pull request from Paolo today. I will add that in v3. > >> >> 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it >> is running "perf top". The pmu registers are not disabled gracefully. >> >> 2. Although the x86_cpu_reset() resets many registers to zero, the >> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, >> some pmu events are still enabled at the KVM side. > > I agree that we should have done that, especially if guest pmu is enabled > on the AMD platforms. > >> >> 3. The KVM pmc_speculative_in_use() always returns true so that the events >> will not be reclaimed. The kvm_pmc->perf_event is still active. >> >> 4. After the reboot, the VM kernel reports below error: >> >> [ 0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor. >> [ 0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076) >> >> 5. In a worse case, the active kvm_pmc->perf_event is still able to >> inject unknown NMIs randomly to the VM kernel. >> >> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. >> >> The patch is to fix the issue by resetting AMD pmu registers during the >> reset. > > I'm not sure if the qemu_reset or VM kexec will necessarily trigger > kvm::amd_pmu_reset(). According to the mainline linux kernel: kvm_vcpu_reset() -> kvm_pmu_reset() -> amd_pmu_reset() The PMU will not reset when init_event==true, that is, when processing the INIT: line 12049. 11975 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) 11976 { ... ... 12049 if (!init_event) { 12050 kvm_pmu_reset(vcpu); 12051 vcpu->arch.smbase = 0x30000; 12052 12053 vcpu->arch.msr_misc_features_enables = 0; 12054 vcpu->arch.ia32_misc_enable_msr = MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL | 12055 MSR_IA32_MISC_ENABLE_BTS_UNAVAIL; 12056 12057 __kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP); 12058 __kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true); 12059 } According to the below ... https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d28bc9dd25ce023270d2e039e7c98d38ecbf7758 ... "INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU, MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP." That's why initially I did not send a KVM patch to remove the 'init_event'. > >> >> Cc: Joe Jin <joe.jin@oracle.com> >> Cc: Like Xu <likexu@tencent.com> >> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> >> --- >> target/i386/cpu.h | 5 +++ >> target/i386/kvm/kvm.c | 83 +++++++++++++++++++++++++++++++++++++++++-- >> 2 files changed, 86 insertions(+), 2 deletions(-) >> >> diff --git a/target/i386/cpu.h b/target/i386/cpu.h >> index cd047e0410..b8ba72e87a 100644 >> --- a/target/i386/cpu.h >> +++ b/target/i386/cpu.h >> @@ -471,6 +471,11 @@ typedef enum X86Seg { >> #define MSR_CORE_PERF_GLOBAL_CTRL 0x38f >> #define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x390 >> >> +#define MSR_K7_EVNTSEL0 0xc0010000 >> +#define MSR_K7_PERFCTR0 0xc0010004 >> +#define MSR_F15H_PERF_CTL0 0xc0010200 >> +#define MSR_F15H_PERF_CTR0 0xc0010201 >> + >> #define MSR_MC0_CTL 0x400 >> #define MSR_MC0_STATUS 0x401 >> #define MSR_MC0_ADDR 0x402 >> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c >> index bf4136fa1b..a0f7273dad 100644 >> --- a/target/i386/kvm/kvm.c >> +++ b/target/i386/kvm/kvm.c >> @@ -2084,6 +2084,32 @@ int kvm_arch_init_vcpu(CPUState *cs) >> } >> } >> >> + /* >> + * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to >> + * disable the AMD pmu virtualization. >> + * >> + * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled >> + * indicates the KVM side has already disabled the pmu virtualization. >> + */ >> + if (IS_AMD_CPU(env) && !cs->kvm_state->pmu_cap_disabled) { >> + int64_t family; >> + >> + family = (env->cpuid_version >> 8) & 0xf; >> + if (family == 0xf) { >> + family += (env->cpuid_version >> 20) & 0xff; >> + } >> + >> + if (family >= 6) { >> + has_architectural_pmu_version = 1; >> + >> + if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) { >> + num_architectural_pmu_gp_counters = 6; > > Please make the code a little more readable with some macro definitions. > > #define AMD64_NUM_COUNTERS 4 > #define AMD64_NUM_COUNTERS_CORE 6 I will do that. Thank you very much! > >> + } else { >> + num_architectural_pmu_gp_counters = 4; >> + } >> + } >> + } >> + >> cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused); >> >> for (i = 0x80000000; i <= limit; i++) { >> @@ -3438,7 +3464,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) >> kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr); >> } >> >> - if (has_architectural_pmu_version > 0) { >> + if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { > > It seems to be saying here that it is possible to perform kvm_msr_entry_add() > for Intel-related pmu on AMD platforms. So the reverse is expected to be > feasible as well. Why need distinction? My fault. I always used something like below when emulating AMD on Intel host. -cpu EPYC,vendor="AuthenticAMD" I should consider the AMD-on-Intel and Intel-on-AMD case more carefully. I will address that in v3. > >> if (has_architectural_pmu_version > 1) { >> /* Stop the counter. */ >> kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); >> @@ -3469,6 +3495,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level) >> env->msr_global_ctrl); >> } >> } >> + >> + if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { >> + uint32_t sel_base = MSR_K7_EVNTSEL0; >> + uint32_t ctr_base = MSR_K7_PERFCTR0; >> + uint32_t step = 1; >> + >> + if (num_architectural_pmu_gp_counters == 6) { >> + sel_base = MSR_F15H_PERF_CTL0; >> + ctr_base = MSR_F15H_PERF_CTR0; >> + step = 2; >> + } >> + >> + for (i = 0; i < num_architectural_pmu_gp_counters; i++) { >> + kvm_msr_entry_add(cpu, ctr_base + i * step, >> + env->msr_gp_counters[i]); >> + kvm_msr_entry_add(cpu, sel_base + i * step, >> + env->msr_gp_evtsel[i]); >> + } >> + } >> + >> /* >> * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add, >> * only sync them to KVM on the first cpu >> @@ -3929,7 +3975,7 @@ static int kvm_get_msrs(X86CPU *cpu) >> if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) { >> kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1); >> } >> - if (has_architectural_pmu_version > 0) { >> + if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { >> if (has_architectural_pmu_version > 1) { >> kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); >> kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0); >> @@ -3945,6 +3991,25 @@ static int kvm_get_msrs(X86CPU *cpu) >> } >> } >> >> + if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { >> + uint32_t sel_base = MSR_K7_EVNTSEL0; >> + uint32_t ctr_base = MSR_K7_PERFCTR0; >> + uint32_t step = 1; >> + >> + if (num_architectural_pmu_gp_counters == 6) { >> + sel_base = MSR_F15H_PERF_CTL0; >> + ctr_base = MSR_F15H_PERF_CTR0; >> + step = 2; > > Perhaps it would be more helpful to add a little commentary on why > step 2 is needed. Sure. I will do that. The step "2" is because MSR_F15H_PERF_CTL1 is MSR_F15H_PERF_CTL0 + 2. #define MSR_F15H_PERF_CTL0 0xc0010200 #define MSR_F15H_PERF_CTR0 0xc0010201 Thank you very much for your feedback. I will address those in v3. Dongli Zhang > >> + } >> + >> + for (i = 0; i < num_architectural_pmu_gp_counters; i++) { >> + kvm_msr_entry_add(cpu, ctr_base + i * step, >> + env->msr_gp_counters[i]); >> + kvm_msr_entry_add(cpu, sel_base + i * step, >> + env->msr_gp_evtsel[i]); >> + } >> + } >> + >> if (env->mcg_cap) { >> kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0); >> kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0); >> @@ -4230,6 +4295,20 @@ static int kvm_get_msrs(X86CPU *cpu) >> case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1: >> env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data; >> break; >> + case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3: >> + env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data; >> + break; >> + case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3: >> + env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data; >> + break; >> + case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb: >> + index = index - MSR_F15H_PERF_CTL0; >> + if (index & 0x1) { >> + env->msr_gp_counters[index] = msrs[i].data; >> + } else { >> + env->msr_gp_evtsel[index] = msrs[i].data; >> + } >> + break; >> case HV_X64_MSR_HYPERCALL: >> env->msr_hv_hypercall = msrs[i].data; >> break; >> -- >> 2.34.1 >>
diff --git a/target/i386/cpu.h b/target/i386/cpu.h index cd047e0410..b8ba72e87a 100644 --- a/target/i386/cpu.h +++ b/target/i386/cpu.h @@ -471,6 +471,11 @@ typedef enum X86Seg { #define MSR_CORE_PERF_GLOBAL_CTRL 0x38f #define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x390 +#define MSR_K7_EVNTSEL0 0xc0010000 +#define MSR_K7_PERFCTR0 0xc0010004 +#define MSR_F15H_PERF_CTL0 0xc0010200 +#define MSR_F15H_PERF_CTR0 0xc0010201 + #define MSR_MC0_CTL 0x400 #define MSR_MC0_STATUS 0x401 #define MSR_MC0_ADDR 0x402 diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c index bf4136fa1b..a0f7273dad 100644 --- a/target/i386/kvm/kvm.c +++ b/target/i386/kvm/kvm.c @@ -2084,6 +2084,32 @@ int kvm_arch_init_vcpu(CPUState *cs) } } + /* + * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to + * disable the AMD pmu virtualization. + * + * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled + * indicates the KVM side has already disabled the pmu virtualization. + */ + if (IS_AMD_CPU(env) && !cs->kvm_state->pmu_cap_disabled) { + int64_t family; + + family = (env->cpuid_version >> 8) & 0xf; + if (family == 0xf) { + family += (env->cpuid_version >> 20) & 0xff; + } + + if (family >= 6) { + has_architectural_pmu_version = 1; + + if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) { + num_architectural_pmu_gp_counters = 6; + } else { + num_architectural_pmu_gp_counters = 4; + } + } + } + cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused); for (i = 0x80000000; i <= limit; i++) { @@ -3438,7 +3464,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env->poll_control_msr); } - if (has_architectural_pmu_version > 0) { + if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { if (has_architectural_pmu_version > 1) { /* Stop the counter. */ kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); @@ -3469,6 +3495,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level) env->msr_global_ctrl); } } + + if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { + uint32_t sel_base = MSR_K7_EVNTSEL0; + uint32_t ctr_base = MSR_K7_PERFCTR0; + uint32_t step = 1; + + if (num_architectural_pmu_gp_counters == 6) { + sel_base = MSR_F15H_PERF_CTL0; + ctr_base = MSR_F15H_PERF_CTR0; + step = 2; + } + + for (i = 0; i < num_architectural_pmu_gp_counters; i++) { + kvm_msr_entry_add(cpu, ctr_base + i * step, + env->msr_gp_counters[i]); + kvm_msr_entry_add(cpu, sel_base + i * step, + env->msr_gp_evtsel[i]); + } + } + /* * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add, * only sync them to KVM on the first cpu @@ -3929,7 +3975,7 @@ static int kvm_get_msrs(X86CPU *cpu) if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) { kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1); } - if (has_architectural_pmu_version > 0) { + if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { if (has_architectural_pmu_version > 1) { kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0); @@ -3945,6 +3991,25 @@ static int kvm_get_msrs(X86CPU *cpu) } } + if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { + uint32_t sel_base = MSR_K7_EVNTSEL0; + uint32_t ctr_base = MSR_K7_PERFCTR0; + uint32_t step = 1; + + if (num_architectural_pmu_gp_counters == 6) { + sel_base = MSR_F15H_PERF_CTL0; + ctr_base = MSR_F15H_PERF_CTR0; + step = 2; + } + + for (i = 0; i < num_architectural_pmu_gp_counters; i++) { + kvm_msr_entry_add(cpu, ctr_base + i * step, + env->msr_gp_counters[i]); + kvm_msr_entry_add(cpu, sel_base + i * step, + env->msr_gp_evtsel[i]); + } + } + if (env->mcg_cap) { kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0); kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0); @@ -4230,6 +4295,20 @@ static int kvm_get_msrs(X86CPU *cpu) case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1: env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data; break; + case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3: + env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data; + break; + case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3: + env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data; + break; + case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb: + index = index - MSR_F15H_PERF_CTL0; + if (index & 0x1) { + env->msr_gp_counters[index] = msrs[i].data; + } else { + env->msr_gp_evtsel[index] = msrs[i].data; + } + break; case HV_X64_MSR_HYPERCALL: env->msr_hv_hypercall = msrs[i].data; break;
The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to the KVM side. However, only the Intel gp/fixed/global pmu registers are involved. There is not any implementation for AMD pmu registers. The 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for AMD. Before AMD PerfMonV2, the number of gp registers is decided based on the CPU version. This patch is to add the support for AMD version=1 pmu, to get and put AMD pmu registers. Otherwise, there will be a bug: 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it is running "perf top". The pmu registers are not disabled gracefully. 2. Although the x86_cpu_reset() resets many registers to zero, the kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, some pmu events are still enabled at the KVM side. 3. The KVM pmc_speculative_in_use() always returns true so that the events will not be reclaimed. The kvm_pmc->perf_event is still active. 4. After the reboot, the VM kernel reports below error: [ 0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, complain to your hardware vendor. [ 0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR c0010200 is 530076) 5. In a worse case, the active kvm_pmc->perf_event is still able to inject unknown NMIs randomly to the VM kernel. [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. The patch is to fix the issue by resetting AMD pmu registers during the reset. Cc: Joe Jin <joe.jin@oracle.com> Cc: Like Xu <likexu@tencent.com> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> --- target/i386/cpu.h | 5 +++ target/i386/kvm/kvm.c | 83 +++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 86 insertions(+), 2 deletions(-)