Message ID | b88fe895e6f71711387ca153f4f1b3fbb0aa2176.1724556967.git.mchehab+huawei@kernel.org |
---|---|
State | New |
Headers | show |
Series | Add ACPI CPER firmware first error injection on ARM emulation | expand |
On Sun, 25 Aug 2024 at 04:46, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > Accurately injecting an ARM Processor error ACPI/APEI GHES > error record requires the value of the ARM Multiprocessor > Affinity Register (mpidr). > > While ARM implements it, this is currently not visible. > > Add a field at CPU storing it, and place it at arm_cpu_properties > as experimental, thus allowing it to be queried via QMP using > qom-get function. > static Property arm_cpu_properties[] = { > DEFINE_PROP_UINT64("midr", ARMCPU, midr, 0), > + DEFINE_PROP_UINT64("x-mpidr", ARMCPU, mpidr, 0), > DEFINE_PROP_UINT64("mp-affinity", ARMCPU, > mp_affinity, ARM64_AFFINITY_INVALID), > DEFINE_PROP_INT32("node-id", ARMCPU, node_id, CPU_UNSET_NUMA_NODE_ID), Why do we need this? Why is it experimental? The later patch seems to use it via QMP, which I'm not super enthusiastic about -- the preexisting mpidr and mp-affinity properties are there for code that is creating CPU objects to configure the CPU object, not as a query interface for QOM. thanks -- PMM
Em Sun, 25 Aug 2024 12:34:14 +0100 Peter Maydell <peter.maydell@linaro.org> escreveu: > On Sun, 25 Aug 2024 at 04:46, Mauro Carvalho Chehab > <mchehab+huawei@kernel.org> wrote: > > > > Accurately injecting an ARM Processor error ACPI/APEI GHES > > error record requires the value of the ARM Multiprocessor > > Affinity Register (mpidr). > > > > While ARM implements it, this is currently not visible. > > > > Add a field at CPU storing it, and place it at arm_cpu_properties > > as experimental, thus allowing it to be queried via QMP using > > qom-get function. > > > static Property arm_cpu_properties[] = { > > DEFINE_PROP_UINT64("midr", ARMCPU, midr, 0), > > + DEFINE_PROP_UINT64("x-mpidr", ARMCPU, mpidr, 0), > > DEFINE_PROP_UINT64("mp-affinity", ARMCPU, > > mp_affinity, ARM64_AFFINITY_INVALID), > > DEFINE_PROP_INT32("node-id", ARMCPU, node_id, CPU_UNSET_NUMA_NODE_ID), > > Why do we need this? The ACPI HEST tables, in particular when using GHESv2 provide several kinds of errors. Among them, we have ARM Processor Error, as defined at UEFI 2.10 spec (and earlier versions), the Common Platform Error Record (CPER) is defined as: https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html?highlight=ghes#arm-processor-error-section There are two fields that are part of the CPER record. One of them is mandatory (MIDR); the other one is optional, but needed to decode another field. So, basically those errors need them. > Why is it experimental? This was a suggestion from Igor. As for now the QAPI for external error injection is experimental, It makes sense to me to keep it experimental as well. > The later patch > seems to use it via QMP, which I'm not super enthusiastic > about -- the preexisting mpidr and mp-affinity properties are > there for code that is creating CPU objects to configure > the CPU object, not as a query interface for QOM. I saw that. Basically the decoding by OS guest depends on MPIDR, as explained at the description of Error affinity level field: "For errors that can be attributed to a specific affinity level, this field defines the affinity level at which the error was produced, detected, and/or consumed. This is a value between 0 and 3. All other values (4-255) are reserved For example, a vendor may choose to define affinity levels as follows: Level 0: errors that can be precisely attributed to a specific CPU (e.g. due to a synchronous external abort) Level 1: Cache parity and/or ECC errors detected at cache of affinity level 1 (e.g. only attributed to higher level cache due to prefetching and/or error propagation) NOTE: Detailed meanings and groupings of affinity level are chip and/or platform specific. The affinity level described here must be consistent with the platform definitions used MPIDR. For cache/TLB errors, the cache/TLB level is provided by the cache/TLB error structure, which may differ from affinity level." Regards, Mauro
On Mon, 26 Aug 2024 at 04:12, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > Em Sun, 25 Aug 2024 12:34:14 +0100 > Peter Maydell <peter.maydell@linaro.org> escreveu: > > > On Sun, 25 Aug 2024 at 04:46, Mauro Carvalho Chehab > > <mchehab+huawei@kernel.org> wrote: > > > > > > Accurately injecting an ARM Processor error ACPI/APEI GHES > > > error record requires the value of the ARM Multiprocessor > > > Affinity Register (mpidr). > > > > > > While ARM implements it, this is currently not visible. > > > > > > Add a field at CPU storing it, and place it at arm_cpu_properties > > > as experimental, thus allowing it to be queried via QMP using > > > qom-get function. > > > > > static Property arm_cpu_properties[] = { > > > DEFINE_PROP_UINT64("midr", ARMCPU, midr, 0), > > > + DEFINE_PROP_UINT64("x-mpidr", ARMCPU, mpidr, 0), > > > DEFINE_PROP_UINT64("mp-affinity", ARMCPU, > > > mp_affinity, ARM64_AFFINITY_INVALID), > > > DEFINE_PROP_INT32("node-id", ARMCPU, node_id, CPU_UNSET_NUMA_NODE_ID), > > > > Why do we need this? > > The ACPI HEST tables, in particular when using GHESv2 provide > several kinds of errors. Among them, we have ARM Processor Error, > as defined at UEFI 2.10 spec (and earlier versions), the Common > Platform Error Record (CPER) is defined as: > > https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html?highlight=ghes#arm-processor-error-section > > There are two fields that are part of the CPER record. One of them is > mandatory (MIDR); the other one is optional, but needed to decode another > field. > > So, basically those errors need them. OK, but why do scripts outside of QEMU need the information, as opposed to telling QEMU "hey, generate an error" and QEMU knowing the format to use? Do we have any other QMP APIs where something external provides raw ACPI data like this? -- PMM
Em Fri, 30 Aug 2024 17:27:27 +0100 Peter Maydell <peter.maydell@linaro.org> escreveu: > On Mon, 26 Aug 2024 at 04:12, Mauro Carvalho Chehab > <mchehab+huawei@kernel.org> wrote: > > > > Em Sun, 25 Aug 2024 12:34:14 +0100 > > Peter Maydell <peter.maydell@linaro.org> escreveu: > > > > > On Sun, 25 Aug 2024 at 04:46, Mauro Carvalho Chehab > > > <mchehab+huawei@kernel.org> wrote: > > > > > > > > Accurately injecting an ARM Processor error ACPI/APEI GHES > > > > error record requires the value of the ARM Multiprocessor > > > > Affinity Register (mpidr). > > > > > > > > While ARM implements it, this is currently not visible. > > > > > > > > Add a field at CPU storing it, and place it at arm_cpu_properties > > > > as experimental, thus allowing it to be queried via QMP using > > > > qom-get function. > > > > > > > static Property arm_cpu_properties[] = { > > > > DEFINE_PROP_UINT64("midr", ARMCPU, midr, 0), > > > > + DEFINE_PROP_UINT64("x-mpidr", ARMCPU, mpidr, 0), > > > > DEFINE_PROP_UINT64("mp-affinity", ARMCPU, > > > > mp_affinity, ARM64_AFFINITY_INVALID), > > > > DEFINE_PROP_INT32("node-id", ARMCPU, node_id, CPU_UNSET_NUMA_NODE_ID), > > > > > > Why do we need this? > > > > The ACPI HEST tables, in particular when using GHESv2 provide > > several kinds of errors. Among them, we have ARM Processor Error, > > as defined at UEFI 2.10 spec (and earlier versions), the Common > > Platform Error Record (CPER) is defined as: > > > > https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html?highlight=ghes#arm-processor-error-section > > > > There are two fields that are part of the CPER record. One of them is > > mandatory (MIDR); the other one is optional, but needed to decode another > > field. > > > > So, basically those errors need them. > > OK, but why do scripts outside of QEMU need the information, > as opposed to telling QEMU "hey, generate an error" and > QEMU knowing the format to use? Do we have any other > QMP APIs where something external provides raw ACPI > data like this? This was discussed during the review of this patch series. See, the ACPI Platform Error Interfaces (APEI) code currently in QEMU implements limited support for ACPI HEST - Hardware Error Source Table [1]. [1] https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#acpi-error-source HEST consists of, currently, 9 error types (plus 3 obsoleted ones). Among them, there is support for generic errors via GHES and GHESv2 types. While not officially obsoleted, GHES is superseded by GHESv2. GHESv2 (and GHES) has a section type field to identify which error type it is [2]. Currently, there are +10 defined UUIDs for the section type. [2] https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#section-descriptor The current code on ghes.c implements GHESv2 support for a single type (memory error), received from the host OS via SIGBUS. Testing such code and injecting such error is not easy, as the host OS needs to send a SIGBUS to the guest, this reflecting an error at the main OS. Such code also has several limitations. - At the first three versions of this patch set, the code was just doing like what you said: it was adding an error injection for a HEST GHESv2 ARM Processor Error. So the error record (CPER) were produced in QEMU using some optional parameters passed via QMP to change fields when needed. With such approach, QEMU could use directly the value from MIDR and MPIDR. The main disadvantage is that, to make full support of HEST, a lot of code will be needed to add support for every GHESv2 type and for every GHESv2 section type. So, the feedback we had were to re-implement it into a generic way. The generic CPER error inject approach (since v4 of this series), has soma advantages: - it is easy to do fuzz testing, as the entire CPER is built via a python script; - no need to modify QEMU to support other GHESv2 types of record and to support other types of processors; - GHESv2 fields can also be dynamically generated; - It shouldn't be hard to change the code to support other types of HEST table (currently, only GHESv2 is supported). The disadvantage is that queries are needed to pick configuration and register values from the current emulation to do error injection. For ARM Processor Error, it means that MPIDR and MIDR, are needed. Other processors and other error types will also require to query other data from QEMU, either using already-existing QMP code or by adding new ones. Yet, the amount of code for such queries seem to be smaller than the amount of code to be added for every single GHESv2/HEST type. - Worth saying that QEMU may still require internal HEST/GHES errors to be able to reflect at the guests hardware problems detected at the host OS. So, for instance, if a host OS memory is poisoned due to hardware errors, QEMU and guests need to know, in order to kill processes affected by a bad memory. Regards, Mauro
diff --git a/target/arm/cpu.c b/target/arm/cpu.c index 19191c239181..30fcf0a10f46 100644 --- a/target/arm/cpu.c +++ b/target/arm/cpu.c @@ -2619,6 +2619,7 @@ static ObjectClass *arm_cpu_class_by_name(const char *cpu_model) static Property arm_cpu_properties[] = { DEFINE_PROP_UINT64("midr", ARMCPU, midr, 0), + DEFINE_PROP_UINT64("x-mpidr", ARMCPU, mpidr, 0), DEFINE_PROP_UINT64("mp-affinity", ARMCPU, mp_affinity, ARM64_AFFINITY_INVALID), DEFINE_PROP_INT32("node-id", ARMCPU, node_id, CPU_UNSET_NUMA_NODE_ID), diff --git a/target/arm/cpu.h b/target/arm/cpu.h index 9a3fd595621f..3ad4de793409 100644 --- a/target/arm/cpu.h +++ b/target/arm/cpu.h @@ -1033,6 +1033,7 @@ struct ArchCPU { uint64_t reset_pmcr_el0; } isar; uint64_t midr; + uint64_t mpidr; uint32_t revidr; uint32_t reset_fpsid; uint64_t ctr; diff --git a/target/arm/helper.c b/target/arm/helper.c index 0a582c1cd3b3..d6e7aa069489 100644 --- a/target/arm/helper.c +++ b/target/arm/helper.c @@ -4690,7 +4690,7 @@ static uint64_t mpidr_read_val(CPUARMState *env) return mpidr; } -static uint64_t mpidr_read(CPUARMState *env, const ARMCPRegInfo *ri) +static uint64_t mpidr_read(CPUARMState *env) { unsigned int cur_el = arm_current_el(env); @@ -4700,6 +4700,11 @@ static uint64_t mpidr_read(CPUARMState *env, const ARMCPRegInfo *ri) return mpidr_read_val(env); } +static uint64_t mpidr_read_ri(CPUARMState *env, const ARMCPRegInfo *ri) +{ + return mpidr_read(env); +} + static const ARMCPRegInfo lpae_cp_reginfo[] = { /* NOP AMAIR0/1 */ { .name = "AMAIR0", .state = ARM_CP_STATE_BOTH, @@ -9721,7 +9726,7 @@ void register_cp_regs_for_features(ARMCPU *cpu) { .name = "MPIDR_EL1", .state = ARM_CP_STATE_BOTH, .opc0 = 3, .crn = 0, .crm = 0, .opc1 = 0, .opc2 = 5, .fgt = FGT_MPIDR_EL1, - .access = PL1_R, .readfn = mpidr_read, .type = ARM_CP_NO_RAW }, + .access = PL1_R, .readfn = mpidr_read_ri, .type = ARM_CP_NO_RAW }, }; #ifdef CONFIG_USER_ONLY static const ARMCPRegUserSpaceInfo mpidr_user_cp_reginfo[] = { @@ -9731,6 +9736,7 @@ void register_cp_regs_for_features(ARMCPU *cpu) modify_arm_cp_regs(mpidr_cp_reginfo, mpidr_user_cp_reginfo); #endif define_arm_cp_regs(cpu, mpidr_cp_reginfo); + cpu->mpidr = mpidr_read(env); } if (arm_feature(env, ARM_FEATURE_AUXCR)) {
Accurately injecting an ARM Processor error ACPI/APEI GHES error record requires the value of the ARM Multiprocessor Affinity Register (mpidr). While ARM implements it, this is currently not visible. Add a field at CPU storing it, and place it at arm_cpu_properties as experimental, thus allowing it to be queried via QMP using qom-get function. Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> --- target/arm/cpu.c | 1 + target/arm/cpu.h | 1 + target/arm/helper.c | 10 ++++++++-- 3 files changed, 10 insertions(+), 2 deletions(-)