Message ID | 6a3542a7d8acfbf88c906ec6f6dc5a697257b461.1721630625.git.mchehab+huawei@kernel.org |
---|---|
State | New |
Headers | show |
Series | Add ACPI CPER firmware first error injection for Arm Processor | expand |
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes: > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > 1. Some GHES functions require handling addresses. Add a helper function > to support it. > > 2. Add support for ACPI CPER (firmware-first) ARM processor error injection. > > Compliance with N.2.4.4 ARM Processor Error Section in UEFI 2.6 and > upper specs, using error type bit encoding as detailed at UEFI 2.9A > errata. > > Error injection examples: > > { "execute": "qmp_capabilities" } > > { "execute": "arm-inject-error", > "arguments": { > "errortypes": ['cache-error'] > } > } > > { "execute": "arm-inject-error", > "arguments": { > "errortypes": ['tlb-error'] > } > } > > { "execute": "arm-inject-error", > "arguments": { > "errortypes": ['bus-error'] > } > } > > { "execute": "arm-inject-error", > "arguments": { > "errortypes": ['cache-error', 'tlb-error'] > } > } > > { "execute": "arm-inject-error", > "arguments": { > "errortypes": ['cache-error', 'tlb-error', 'bus-error', 'micro-arch-error'] > } > } > ... > > Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > Co-authored-by: Shiju Jose <shiju.jose@huawei.com> > For Add a logic to handle block addresses, > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > For FW first ARM processor error injection, > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > Signed-off-by: Shiju Jose <shiju.jose@huawei.com> > --- > configs/targets/aarch64-softmmu.mak | 1 + > hw/acpi/ghes.c | 258 ++++++++++++++++++++++++++-- > hw/arm/Kconfig | 4 + > hw/arm/arm_error_inject.c | 35 ++++ > hw/arm/arm_error_inject_stubs.c | 18 ++ > hw/arm/meson.build | 3 + > include/hw/acpi/ghes.h | 2 + > qapi/arm-error-inject.json | 49 ++++++ > qapi/meson.build | 1 + > qapi/qapi-schema.json | 1 + > 10 files changed, 361 insertions(+), 11 deletions(-) > create mode 100644 hw/arm/arm_error_inject.c > create mode 100644 hw/arm/arm_error_inject_stubs.c > create mode 100644 qapi/arm-error-inject.json Since the new file not covered in MAINTAINERS, get_maintainer.pl will blame it on the QAPI maintainers alone. No good. [...] > diff --git a/qapi/arm-error-inject.json b/qapi/arm-error-inject.json > new file mode 100644 > index 000000000000..430e6cea6b60 > --- /dev/null > +++ b/qapi/arm-error-inject.json > @@ -0,0 +1,49 @@ > +# -*- Mode: Python -*- > +# vim: filetype=python > + > +## > +# = ARM Processor Errors > +## > + > +## > +# @ArmProcessorErrorType: > +# > +# Type of ARM processor error to inject > +# > +# @unknown-error: Unknown error Removed in PATCH 7, and unused until then. Why add it in the first place? > +# > +# @cache-error: Cache error > +# > +# @tlb-error: TLB error > +# > +# @bus-error: Bus error. > +# > +# @micro-arch-error: Micro architectural error. > +# > +# Since: 9.1 > +## > +{ 'enum': 'ArmProcessorErrorType', > + 'data': ['unknown-error', > + 'cache-error', Tab in this line. Please convert to spaces. > + 'tlb-error', > + 'bus-error', > + 'micro-arch-error'] > +} > + > +## > +# @arm-inject-error: > +# > +# Inject ARM Processor error. > +# > +# @errortypes: ARM processor error types to inject > +# > +# Features: > +# > +# @unstable: This command is experimental. > +# > +# Since: 9.1 > +## > +{ 'command': 'arm-inject-error', > + 'data': { 'errortypes': ['ArmProcessorErrorType'] }, Please separate words with dashes: 'error-types'. > + 'features': [ 'unstable' ] > +} Is this used only with TARGET_ARM? Why is being able to inject multiple error types at once useful? I'd expect at least some of these errors to come with additional information. For instance, I imagine a bus error is associated with some address. If we encode the the error to inject as an enum value, adding more will be hard. If we wrap the enum in a struct { 'struct': 'ArmProcessorError', 'data': { 'type': 'ArmProcessorErrorType' } } we can later extend it like { 'union': 'ArmProcessorError', 'base: { 'type': 'ArmProcessorErrorType' } 'data': { 'bus-error': 'ArmProcessorBusErrorData' } } { 'struct': 'ArmProcessorBusErrorData', 'data': ... } > diff --git a/qapi/meson.build b/qapi/meson.build > index e7bc54e5d047..5927932c4be3 100644 > --- a/qapi/meson.build > +++ b/qapi/meson.build > @@ -22,6 +22,7 @@ if have_system or have_tools or have_ga > endif > > qapi_all_modules = [ > + 'arm-error-inject', > 'authz', > 'block', > 'block-core', > diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json > index b1581988e4eb..479a22de7e43 100644 > --- a/qapi/qapi-schema.json > +++ b/qapi/qapi-schema.json > @@ -81,3 +81,4 @@ > { 'include': 'vfio.json' } > { 'include': 'cryptodev.json' } > { 'include': 'cxl.json' } > +{ 'include': 'arm-error-inject.json' }
Em Fri, 26 Jul 2024 13:44:12 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> escreveu: > On Mon, 22 Jul 2024 08:45:56 +0200 > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > > 1. Some GHES functions require handling addresses. Add a helper function > > to support it. > > > > 2. Add support for ACPI CPER (firmware-first) ARM processor error injection. > > > > Compliance with N.2.4.4 ARM Processor Error Section in UEFI 2.6 and > > upper specs, using error type bit encoding as detailed at UEFI 2.9A > > errata. > > > > Error injection examples: > > > > { "execute": "qmp_capabilities" } > > > > { "execute": "arm-inject-error", > > "arguments": { > > "errortypes": ['cache-error'] > > } > > } > > > > { "execute": "arm-inject-error", > > "arguments": { > > "errortypes": ['tlb-error'] > > } > > } > > > > { "execute": "arm-inject-error", > > "arguments": { > > "errortypes": ['bus-error'] > > } > > } > > > > { "execute": "arm-inject-error", > > "arguments": { > > "errortypes": ['cache-error', 'tlb-error'] > > } > > } > > > > { "execute": "arm-inject-error", > > "arguments": { > > "errortypes": ['cache-error', 'tlb-error', 'bus-error', 'micro-arch-error'] > > } > > } > > ... > > > > Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > > Co-authored-by: Shiju Jose <shiju.jose@huawei.com> > > For Add a logic to handle block addresses, > # before comments I think? > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > > For FW first ARM processor error injection, > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > > Signed-off-by: Shiju Jose <shiju.jose@huawei.com> > I can't remember what I wrote in here so may well be commenting on > my past self ;) > > > --- > > configs/targets/aarch64-softmmu.mak | 1 + > > hw/acpi/ghes.c | 258 ++++++++++++++++++++++++++-- > > hw/arm/Kconfig | 4 + > > hw/arm/arm_error_inject.c | 35 ++++ > > hw/arm/arm_error_inject_stubs.c | 18 ++ > > hw/arm/meson.build | 3 + > > include/hw/acpi/ghes.h | 2 + > > qapi/arm-error-inject.json | 49 ++++++ > > qapi/meson.build | 1 + > > qapi/qapi-schema.json | 1 + > > 10 files changed, 361 insertions(+), 11 deletions(-) > > create mode 100644 hw/arm/arm_error_inject.c > > create mode 100644 hw/arm/arm_error_inject_stubs.c > > create mode 100644 qapi/arm-error-inject.json > > > > diff --git a/configs/targets/aarch64-softmmu.mak b/configs/targets/aarch64-softmmu.mak > > index 84cb32dc2f4f..b4b3cd97934a 100644 > > --- a/configs/targets/aarch64-softmmu.mak > > +++ b/configs/targets/aarch64-softmmu.mak > > @@ -5,3 +5,4 @@ TARGET_KVM_HAVE_GUEST_DEBUG=y > > TARGET_XML_FILES= gdb-xml/aarch64-core.xml gdb-xml/aarch64-fpu.xml gdb-xml/arm-core.xml gdb-xml/arm-vfp.xml gdb-xml/arm-vfp3.xml gdb-xml/arm-vfp-sysregs.xml gdb-xml/arm-neon.xml gdb-xml/arm-m-profile.xml gdb-xml/arm-m-profile-mve.xml gdb-xml/aarch64-pauth.xml > > # needed by boot.c > > TARGET_NEED_FDT=y > > +CONFIG_ARM_EINJ=y > > diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c > > index 5b8bc6eeb437..6075ef5893ce 100644 > > --- a/hw/acpi/ghes.c > > +++ b/hw/acpi/ghes.c > > @@ -27,6 +27,7 @@ > > #include "hw/acpi/generic_event_device.h" > > #include "hw/nvram/fw_cfg.h" > > #include "qemu/uuid.h" > > +#include "qapi/qapi-types-arm-error-inject.h" > > > > #define ACPI_GHES_ERRORS_FW_CFG_FILE "etc/hardware_errors" > > #define ACPI_GHES_DATA_ADDR_FW_CFG_FILE "etc/hardware_errors_addr" > > @@ -53,6 +54,12 @@ > > /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */ > > #define ACPI_GHES_MEM_CPER_LENGTH 80 > > > > +/* > > + * ARM Processor section CPER size, UEFI 2.10: N.2.4.4 > > + * ARM Processor Error Section > > + */ > > +#define ACPI_GHES_ARM_CPER_LENGTH (72 + 600) > > + > > /* Masks for block_status flags */ > > #define ACPI_GEBS_UNCORRECTABLE 1 > > > > @@ -231,6 +238,142 @@ static int acpi_ghes_record_mem_error(uint64_t error_block_address, > > return 0; > > } > > > > +/* UEFI 2.9: N.2.4.4 ARM Processor Error Section */ > > +static void acpi_ghes_build_append_arm_cper(uint8_t error_types, GArray *table) > > +{ > > + /* > > + * ARM Processor Error Record > > + */ > > + > > + /* Validation Bits */ > > + build_append_int_noprefix(table, > > + (1ULL << 3) | /* Vendor specific info Valid */ > > + (1ULL << 2) | /* Running status Valid */ > > + (1ULL << 1) | /* Error affinity level Valid */ > > + (1ULL << 0), /* MPIDR Valid */ > > + 4); > > + /* Error Info Num */ > > + build_append_int_noprefix(table, 1, 2); > > + /* Context Info Num */ > > + build_append_int_noprefix(table, 1, 2); > > + /* Section length */ > > + build_append_int_noprefix(table, ACPI_GHES_ARM_CPER_LENGTH, 4); > > + /* Error affinity level */ > > + build_append_int_noprefix(table, 2, 1); > > + /* Reserved */ > > + build_append_int_noprefix(table, 0, 3); > > + /* MPIDR_EL1 */ > > + build_append_int_noprefix(table, 0xAB12, 8); > > These need to be real - I see you fix that in later > patches, but I'd be tempted to pull it back here. Or maybe just > add a comment to say you will rewrite this later. > > I know you aren't keen to smash patches with different authorship > together, but here I think you should just have this > correct from the start (so combine this and 5-7) > perhaps with some links back to the version where they are split? I folded this with patch 7. I kept patch 5 as a separate one, as it is a different logical change. After folding, this field is filled from the emulation value for it (by default, as it can be overridden via QMP). > > + /* MIDR_EL1 */ > > + build_append_int_noprefix(table, 0xCD24, 8); > > + /* Running state */ > > + build_append_int_noprefix(table, 0x1, 4); > > + /* PSCI state */ > > + build_append_int_noprefix(table, 0x1234, 4); > > + > > + /* ARM Propcessor error information */ > > + /* Version */ > > + build_append_int_noprefix(table, 0, 1); > > + /* Length */ > > + build_append_int_noprefix(table, 32, 1); > > + /* Validation Bits */ > > + build_append_int_noprefix(table, > > + (1ULL << 4) | /* Physical fault address Valid */ > > Some tabs hiding in here that need to be spaces. Solved when folding with patch 7. > > + (1ULL << 3) | /* Virtual fault address Valid */ > > + (1ULL << 2) | /* Error information Valid */ > > + (1ULL << 1) | /* Flags Valid */ > > + (1ULL << 0), /* Multiple error count Valid */ > > + 2); > > + /* Type */ > > + if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_CACHE_ERROR) || > > + error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_TLB_ERROR) || > > + error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_BUS_ERROR) || > > + error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_MICRO_ARCH_ERROR)) { > > + build_append_int_noprefix(table, error_types, 1); > > + } else { > > + return; > > + } > > + /* Multiple error count */ > > + build_append_int_noprefix(table, 2, 2); > > + /* Flags */ > > + build_append_int_noprefix(table, 0xD, 1); > > + /* Error information */ > > + if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_CACHE_ERROR)) { > > + build_append_int_noprefix(table, 0x0091000F, 8); > > + } else if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_TLB_ERROR)) { > > + build_append_int_noprefix(table, 0x0054007F, 8); > > + } else if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_BUS_ERROR)) { > > + build_append_int_noprefix(table, 0x80D6460FFF, 8); > > + } else if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_MICRO_ARCH_ERROR)) { > > + build_append_int_noprefix(table, 0x78DA03FF, 8); > > + } else { > > + return; > > + } > > + /* Virtual fault address */ > > + build_append_int_noprefix(table, 0x67320230, 8); > > + /* Physical fault address */ > > + build_append_int_noprefix(table, 0x5CDFD492, 8); > > + > > + /* ARM Propcessor error context information */ > > + /* Version */ > > + build_append_int_noprefix(table, 0, 2); > > + /* Validation Bits */ > > + /* AArch64 EL1 context registers Valid */ > > + build_append_int_noprefix(table, 5, 2); > > + /* Register array size */ > > + build_append_int_noprefix(table, 592, 4); > > + /* Register array */ > > + build_append_int_noprefix(table, 0x12ABDE67, 8); > > +} > > + > > +static int acpi_ghes_record_arm_error(uint8_t error_types, > > + uint64_t error_block_address) > > +{ > > + GArray *block; > > + > > + /* ARM processor Error Section Type */ > > + const uint8_t uefi_cper_arm_sec[] = > > + UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05, \ > > + 0x1D, 0x5D, 0x46, 0xB0); > > + > > + /* > > + * Invalid fru id: ACPI 4.0: 17.3.2.6.1 Generic Error Data, > > + * Table 17-13 Generic Error Data Entry > > + */ > > + QemuUUID fru_id = {}; > > + uint32_t data_length; > > + > > + block = g_array_new(false, true /* clear */, 1); > > + > > + /* This is the length if adding a new generic error data entry*/ > > space before * Fixed. > > > + data_length = ACPI_GHES_DATA_LENGTH + ACPI_GHES_ARM_CPER_LENGTH; > > + /* > > + * It should not run out of the preallocated memory if adding a new generic > > + * error data entry > > + */ > > + assert((data_length + ACPI_GHES_GESB_SIZE) <= > > + ACPI_GHES_MAX_RAW_DATA_LENGTH); > > + > > + /* Build the new generic error status block header */ > > + acpi_ghes_generic_error_status(block, ACPI_GEBS_UNCORRECTABLE, > > + 0, 0, data_length, ACPI_CPER_SEV_RECOVERABLE); > > + > > + /* Build this new generic error data entry header */ > > + acpi_ghes_generic_error_data(block, uefi_cper_arm_sec, > > + ACPI_CPER_SEV_RECOVERABLE, 0, 0, > > + ACPI_GHES_ARM_CPER_LENGTH, fru_id, 0); > > + > > + /* Build the ARM processor error section CPER */ > > + acpi_ghes_build_append_arm_cper(error_types, block); > > + > > + /* Write the generic error data entry into guest memory */ > > + cpu_physical_memory_write(error_block_address, block->data, block->len); > > + > > + g_array_free(block, true); > > + > > + return 0; > > +} > > > > +bool ghes_record_arm_errors(uint8_t error_types, uint32_t notify) > > +{ > > + int read_ack_register = 0; > > + uint64_t read_ack_register_addr = 0; > > + uint64_t error_block_addr = 0; > > + > > + if (!ghes_get_addr(notify, &error_block_addr, &read_ack_register_addr)) { > > + return false; > > + } > > + > > + cpu_physical_memory_read(read_ack_register_addr, > > + &read_ack_register, sizeof(uint64_t)); > > longer but I'd prefer sizeof(read_ack_register) > Maybe we can shorten to read_ack and read_ack_addr? > > > + /* zero means OSPM does not acknowledge the error */ > > + if (!read_ack_register) { > > + error_report("Last time OSPM does not acknowledge the error," > > + " record CPER failed this time, set the ack value to" > > + " avoid blocking next time CPER record! exit"); > > + read_ack_register = 1; > > + cpu_physical_memory_write(read_ack_register_addr, > > + &read_ack_register, sizeof(uint64_t)); > sizeof(read_ack_register) > > > + return false; > > + } > > + > > + read_ack_register = cpu_to_le64(0); > > + cpu_physical_memory_write(read_ack_register_addr, > > + &read_ack_register, sizeof(uint64_t)); > > sizeof(read_ack_register) > > > + return acpi_ghes_record_arm_error(error_types, error_block_addr); > > +} > > + Changed as suggested. > > > diff --git a/qapi/arm-error-inject.json b/qapi/arm-error-inject.json > > new file mode 100644 > > index 000000000000..430e6cea6b60 > > --- /dev/null > > +++ b/qapi/arm-error-inject.json > > > +## > > +# @arm-inject-error: > > +# > > +# Inject ARM Processor error. > > +# > > +# @errortypes: ARM processor error types to inject > > +# > > +# Features: > > +# > > +# @unstable: This command is experimental. > > +# > > +# Since: 9.1 > Update to 9.2 on next version. Ok. Thanks, Mauro
Em Thu, 25 Jul 2024 11:48:12 +0200 Markus Armbruster <armbru@redhat.com> escreveu: > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes: > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > > 1. Some GHES functions require handling addresses. Add a helper function > > to support it. > > > > 2. Add support for ACPI CPER (firmware-first) ARM processor error injection. > > > > Compliance with N.2.4.4 ARM Processor Error Section in UEFI 2.6 and > > upper specs, using error type bit encoding as detailed at UEFI 2.9A > > errata. > > > > Error injection examples: > > > > { "execute": "qmp_capabilities" } > > > > { "execute": "arm-inject-error", > > "arguments": { > > "errortypes": ['cache-error'] > > } > > } > > > > { "execute": "arm-inject-error", > > "arguments": { > > "errortypes": ['tlb-error'] > > } > > } > > > > { "execute": "arm-inject-error", > > "arguments": { > > "errortypes": ['bus-error'] > > } > > } > > > > { "execute": "arm-inject-error", > > "arguments": { > > "errortypes": ['cache-error', 'tlb-error'] > > } > > } > > > > { "execute": "arm-inject-error", > > "arguments": { > > "errortypes": ['cache-error', 'tlb-error', 'bus-error', 'micro-arch-error'] > > } > > } > > ... > > > > Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > > Co-authored-by: Shiju Jose <shiju.jose@huawei.com> > > For Add a logic to handle block addresses, > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > > For FW first ARM processor error injection, > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > > Signed-off-by: Shiju Jose <shiju.jose@huawei.com> > > --- > > configs/targets/aarch64-softmmu.mak | 1 + > > hw/acpi/ghes.c | 258 ++++++++++++++++++++++++++-- > > hw/arm/Kconfig | 4 + > > hw/arm/arm_error_inject.c | 35 ++++ > > hw/arm/arm_error_inject_stubs.c | 18 ++ > > hw/arm/meson.build | 3 + > > include/hw/acpi/ghes.h | 2 + > > qapi/arm-error-inject.json | 49 ++++++ > > qapi/meson.build | 1 + > > qapi/qapi-schema.json | 1 + > > 10 files changed, 361 insertions(+), 11 deletions(-) > > create mode 100644 hw/arm/arm_error_inject.c > > create mode 100644 hw/arm/arm_error_inject_stubs.c > > create mode 100644 qapi/arm-error-inject.json > > Since the new file not covered in MAINTAINERS, get_maintainer.pl will > blame it on the QAPI maintainers alone. No good. Added myself there: diff --git a/MAINTAINERS b/MAINTAINERS index 98eddf7ae155..713a104ef901 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2075,6 +2075,13 @@ F: hw/acpi/ghes.c F: include/hw/acpi/ghes.h F: docs/specs/acpi_hest_ghes.rst +ACPI/HEST/GHES/ARM processor CPER +R: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> +S: Maintained +F: hw/arm/arm_error_inject.c +F: hw/arm/arm_error_inject_stubs.c +F: qapi/arm-error-inject.json + ppc4xx L: qemu-ppc@nongnu.org S: Orphan > > [...] > > > diff --git a/qapi/arm-error-inject.json b/qapi/arm-error-inject.json > > new file mode 100644 > > index 000000000000..430e6cea6b60 > > --- /dev/null > > +++ b/qapi/arm-error-inject.json > > @@ -0,0 +1,49 @@ > > +# -*- Mode: Python -*- > > +# vim: filetype=python > > + > > +## > > +# = ARM Processor Errors > > +## > > + > > +## > > +# @ArmProcessorErrorType: > > +# > > +# Type of ARM processor error to inject > > +# > > +# @unknown-error: Unknown error > > Removed in PATCH 7, and unused until then. Why add it in the first > place? I folded this with patch 7, so this was gone now. > > > +# > > +# @cache-error: Cache error > > +# > > +# @tlb-error: TLB error > > +# > > +# @bus-error: Bus error. > > +# > > +# @micro-arch-error: Micro architectural error. > > +# > > +# Since: 9.1 > > +## > > +{ 'enum': 'ArmProcessorErrorType', > > + 'data': ['unknown-error', > > + 'cache-error', > > Tab in this line. Please convert to spaces. Ok. > > > + 'tlb-error', > > + 'bus-error', > > + 'micro-arch-error'] > > +} > > + > > +## > > +# @arm-inject-error: > > +# > > +# Inject ARM Processor error. > > +# > > +# @errortypes: ARM processor error types to inject > > +# > > +# Features: > > +# > > +# @unstable: This command is experimental. > > +# > > +# Since: 9.1 > > +## > > +{ 'command': 'arm-inject-error', > > + 'data': { 'errortypes': ['ArmProcessorErrorType'] }, > > Please separate words with dashes: 'error-types'. Done. Folding with patch 7 broke it on two separate fields: error and type. > > > + 'features': [ 'unstable' ] > > +} > > Is this used only with TARGET_ARM? Yes, as this CPER record is defined only for arm. There are three other processor error info: - for x86; - for ia32; - for "generic cpu". They have different structures, with different fields. > Why is being able to inject multiple error types at once useful? The CPER ARM Processor record is defined at UEFI spec as having from 1 to 255 errors, that can be using the same type or not. The idea behind UEFI spec is that a single root error may be reflected on multiple errors. It may also help to reduce BIOS interrupts to OS, by merging errors altogether, as memory errors usually happen in bursts. Due to that, a single Processor Error Information inside a CPER record for ARM processor can, according with UEFI spec, contain more than one of the following bits set: +-----|---------------------------+ | Bit | Meaning | +=====+===========================+ | 1 | Cache Error | | 2 | TLB Error | | 3 | Bus Error | | 4 | Micro-architectural Error | +-----|---------------------------+ So, the spec allows, for instance, to have a single Processor Error Information (PEI) with micro-arch and tlb-error flags raised at the same time. We need the capability of testing multiple error types in order to check if OS implementation is decoding it the right way. In particular, Linux was not doing it right, as the CPER ARM Processor record handler was written at the time UEFI 2.6 spec was written, while the actual encoding for the error type was only defined at UEFI 2.9A errata and newer. > I'd expect at least some of these errors to come with additional > information. For instance, I imagine a bus error is associated with > some address. It actually depends on the ARM and PEI valid fields: the address may or may not be present, depending if the phy/logical address valid field bit is set or not. > > If we encode the the error to inject as an enum value, adding more will > be hard. > > If we wrap the enum in a struct > > { 'struct': 'ArmProcessorError', > 'data': { 'type': 'ArmProcessorErrorType' } } > > we can later extend it like > > { 'union': 'ArmProcessorError', > 'base: { 'type': 'ArmProcessorErrorType' } > 'data': { > 'bus-error': 'ArmProcessorBusErrorData' } } > > { 'struct': 'ArmProcessorBusErrorData', > 'data': ... } I don't see this working as one might expect. See, the ARM error information data can be repeated from 1 to 255 times. It is given by this struct (see patch 7): { 'struct': 'ArmProcessorErrorInformation', 'data': { '*validation': ['ArmPeiValidationBits'], 'type': ['ArmProcessorErrorType'], '*multiple-error': 'uint16', '*flags': ['ArmProcessorFlags'], '*error-info': 'uint64', '*virt-addr': 'uint64', '*phy-addr': 'uint64'} } According with the UEFI spec, the type is always be present. The other fields are marked as valid or not via the field "validation". So, there's one bit indicating what is valid between the fields at the PEI structure, e. g.: - multiple-error: multiple occurrences of the error; - flags; - error-info: error information; - virt-addr: virtual address; - phy-addr: physical address. There are also other fields that are global for the entire record, also marked as valid or not via another bitmask. The contents of almost all those fields are independent of the error type. The only field which content is affected by the error type is "error-info", and the definition of such field is not fully specified. So, currently, UEFI spec only defines it when: 1. the error type has just one bit set; 2. the error type is either cache, TLB or bus error[1]. If type is micro-arch-specific error, the spec doesn't tell how this field if filled. To make the API simple (yet powerful), I opted to not enforce any encoding for error-info: let userspace fill it as required and use some default that would make sense, if this is not passed via QMP. [1] See https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#arm-processor-error-information > > diff --git a/qapi/meson.build b/qapi/meson.build > > index e7bc54e5d047..5927932c4be3 100644 > > --- a/qapi/meson.build > > +++ b/qapi/meson.build > > @@ -22,6 +22,7 @@ if have_system or have_tools or have_ga > > endif > > > > qapi_all_modules = [ > > + 'arm-error-inject', > > 'authz', > > 'block', > > 'block-core', > > diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json > > index b1581988e4eb..479a22de7e43 100644 > > --- a/qapi/qapi-schema.json > > +++ b/qapi/qapi-schema.json > > @@ -81,3 +81,4 @@ > > { 'include': 'vfio.json' } > > { 'include': 'cryptodev.json' } > > { 'include': 'cxl.json' } > > +{ 'include': 'arm-error-inject.json' } > Thanks, Mauro
Em Fri, 26 Jul 2024 13:46:46 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> escreveu: > A few quick replies from me. > I'm sure Mauro will add more info. > > > > + 'tlb-error', > > > + 'bus-error', > > > + 'micro-arch-error'] > > > +} > > > + > > > +## > > > +# @arm-inject-error: > > > +# > > > +# Inject ARM Processor error. > > > +# > > > +# @errortypes: ARM processor error types to inject > > > +# > > > +# Features: > > > +# > > > +# @unstable: This command is experimental. > > > +# > > > +# Since: 9.1 > > > +## > > > +{ 'command': 'arm-inject-error', > > > + 'data': { 'errortypes': ['ArmProcessorErrorType'] }, > > > > Please separate words with dashes: 'error-types'. > > > > > + 'features': [ 'unstable' ] > > > +} > > > > Is this used only with TARGET_ARM? > > > > Why is being able to inject multiple error types at once useful? > > It pokes a weird corner of the specification that I think previously > tripped up Linux. > > > > > I'd expect at least some of these errors to come with additional > > information. For instance, I imagine a bus error is associated with > > some address. > > Absolutely agree that in sane case you wouldn't have multiple errors > but we want to hit the insane ones :( Yes. > There is only prevision for one set of data in the record despite > it providing a bitmap for the type of error. Well, there isn't anything at the UEFI forbidding to use multiple bits. On a "normal" field with a bitmask, more than one bit set is supported. So, as spec doesn't deny it, it should be valid to have more than one bits filled. Now, when multiple errors bits from this table are set: +-----|---------------------------+ | Bit | Meaning | +=====+===========================+ | 1 | Cache Error | | 2 | TLB Error | | 3 | Bus Error | | 4 | Micro-architectural Error | +-----|---------------------------+ - if bit 4 is set, as specified at the spec, the error-info field is defined by the ARM vendor, according with: "N.2.4.4.1.1. ARM Vendor Specific Micro-Architecture ErrorStructure This is a vendor specific structure. Please refer to your hardware vendor documentation for the format of this structure." So, provided that the vendor-specific documentation explicitly allows setting bit 4 with other bits, I don't see an UEFI compliance problem. - if bit 4 is not set, but multiple bits 1 to 3 are set, the content of error-info is currently undefined, as tables N.18 to N.20 won't apply. Anyway, from spec PoV, IMO UEFI API requires an errata to clearly enforce that just one bit should be set or to define the behavior when multiple ones are set. Thanks, Mauro
Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes: > Em Thu, 25 Jul 2024 11:48:12 +0200 > Markus Armbruster <armbru@redhat.com> escreveu: > >> Mauro Carvalho Chehab <mchehab+huawei@kernel.org> writes: >> >> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> >> > >> > 1. Some GHES functions require handling addresses. Add a helper function >> > to support it. >> > >> > 2. Add support for ACPI CPER (firmware-first) ARM processor error injection. >> > >> > Compliance with N.2.4.4 ARM Processor Error Section in UEFI 2.6 and >> > upper specs, using error type bit encoding as detailed at UEFI 2.9A >> > errata. >> > >> > Error injection examples: >> > >> > { "execute": "qmp_capabilities" } >> > >> > { "execute": "arm-inject-error", >> > "arguments": { >> > "errortypes": ['cache-error'] >> > } >> > } >> > >> > { "execute": "arm-inject-error", >> > "arguments": { >> > "errortypes": ['tlb-error'] >> > } >> > } >> > >> > { "execute": "arm-inject-error", >> > "arguments": { >> > "errortypes": ['bus-error'] >> > } >> > } >> > >> > { "execute": "arm-inject-error", >> > "arguments": { >> > "errortypes": ['cache-error', 'tlb-error'] >> > } >> > } >> > >> > { "execute": "arm-inject-error", >> > "arguments": { >> > "errortypes": ['cache-error', 'tlb-error', 'bus-error', 'micro-arch-error'] >> > } >> > } >> > ... >> > >> > Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> >> > Co-authored-by: Shiju Jose <shiju.jose@huawei.com> >> > For Add a logic to handle block addresses, >> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> >> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> >> > For FW first ARM processor error injection, >> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> >> > Signed-off-by: Shiju Jose <shiju.jose@huawei.com> >> > --- >> > configs/targets/aarch64-softmmu.mak | 1 + >> > hw/acpi/ghes.c | 258 ++++++++++++++++++++++++++-- >> > hw/arm/Kconfig | 4 + >> > hw/arm/arm_error_inject.c | 35 ++++ >> > hw/arm/arm_error_inject_stubs.c | 18 ++ >> > hw/arm/meson.build | 3 + >> > include/hw/acpi/ghes.h | 2 + >> > qapi/arm-error-inject.json | 49 ++++++ >> > qapi/meson.build | 1 + >> > qapi/qapi-schema.json | 1 + >> > 10 files changed, 361 insertions(+), 11 deletions(-) >> > create mode 100644 hw/arm/arm_error_inject.c >> > create mode 100644 hw/arm/arm_error_inject_stubs.c >> > create mode 100644 qapi/arm-error-inject.json >> >> Since the new file not covered in MAINTAINERS, get_maintainer.pl will >> blame it on the QAPI maintainers alone. No good. > > Added myself there: > > diff --git a/MAINTAINERS b/MAINTAINERS > index 98eddf7ae155..713a104ef901 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -2075,6 +2075,13 @@ F: hw/acpi/ghes.c > F: include/hw/acpi/ghes.h > F: docs/specs/acpi_hest_ghes.rst > > +ACPI/HEST/GHES/ARM processor CPER > +R: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > +S: Maintained > +F: hw/arm/arm_error_inject.c > +F: hw/arm/arm_error_inject_stubs.c > +F: qapi/arm-error-inject.json > + > ppc4xx > L: qemu-ppc@nongnu.org > S: Orphan > >> >> [...] >> >> > diff --git a/qapi/arm-error-inject.json b/qapi/arm-error-inject.json >> > new file mode 100644 >> > index 000000000000..430e6cea6b60 >> > --- /dev/null >> > +++ b/qapi/arm-error-inject.json >> > @@ -0,0 +1,49 @@ >> > +# -*- Mode: Python -*- >> > +# vim: filetype=python >> > + >> > +## >> > +# = ARM Processor Errors >> > +## >> > + >> > +## >> > +# @ArmProcessorErrorType: >> > +# >> > +# Type of ARM processor error to inject >> > +# >> > +# @unknown-error: Unknown error >> >> Removed in PATCH 7, and unused until then. Why add it in the first >> place? > > I folded this with patch 7, so this was gone now. > >> >> > +# >> > +# @cache-error: Cache error >> > +# >> > +# @tlb-error: TLB error >> > +# >> > +# @bus-error: Bus error. >> > +# >> > +# @micro-arch-error: Micro architectural error. >> > +# >> > +# Since: 9.1 >> > +## >> > +{ 'enum': 'ArmProcessorErrorType', >> > + 'data': ['unknown-error', >> > + 'cache-error', >> >> Tab in this line. Please convert to spaces. > > Ok. > >> >> > + 'tlb-error', >> > + 'bus-error', >> > + 'micro-arch-error'] >> > +} >> > + >> > +## >> > +# @arm-inject-error: >> > +# >> > +# Inject ARM Processor error. >> > +# >> > +# @errortypes: ARM processor error types to inject >> > +# >> > +# Features: >> > +# >> > +# @unstable: This command is experimental. >> > +# >> > +# Since: 9.1 >> > +## >> > +{ 'command': 'arm-inject-error', >> > + 'data': { 'errortypes': ['ArmProcessorErrorType'] }, >> >> Please separate words with dashes: 'error-types'. > > Done. > > Folding with patch 7 broke it on two separate fields: error and > type. > >> >> > + 'features': [ 'unstable' ] >> > +} >> >> Is this used only with TARGET_ARM? > > Yes, as this CPER record is defined only for arm. There are three other > processor error info: > - for x86; > - for ia32; > - for "generic cpu". > > They have different structures, with different fields. A generic inject-error command feels nicer, but coding its arguments in the schema could be more trouble than it's worth. I'm not asking you to try. A target-specific command like this one should be conditional. Try this: { 'command': 'arm-inject-error', 'data': { 'errortypes': ['ArmProcessorErrorType'] }, 'features': [ 'unstable' ], 'if': 'TARGET_ARM' } No need to provide a qmp_arm_inject_error() stub then. >> Why is being able to inject multiple error types at once useful? > > The CPER ARM Processor record is defined at UEFI spec as having from 1 to > 255 errors, that can be using the same type or not. The idea behind UEFI > spec is that a single root error may be reflected on multiple errors. > > It may also help to reduce BIOS interrupts to OS, by merging errors > altogether, as memory errors usually happen in bursts. > > Due to that, a single Processor Error Information inside a CPER record > for ARM processor can, according with UEFI spec, contain more than one > of the following bits set: > > +-----|---------------------------+ > | Bit | Meaning | > +=====+===========================+ > | 1 | Cache Error | > | 2 | TLB Error | > | 3 | Bus Error | > | 4 | Micro-architectural Error | > +-----|---------------------------+ > > So, the spec allows, for instance, to have a single Processor Error > Information (PEI) with micro-arch and tlb-error flags raised at the > same time. > > We need the capability of testing multiple error types in order to check > if OS implementation is decoding it the right way. In particular, Linux > was not doing it right, as the CPER ARM Processor record handler was > written at the time UEFI 2.6 spec was written, while the actual encoding > for the error type was only defined at UEFI 2.9A errata and newer. I see. >> I'd expect at least some of these errors to come with additional >> information. For instance, I imagine a bus error is associated with >> some address. > > It actually depends on the ARM and PEI valid fields: the address may or > may not be present, depending if the phy/logical address valid field bit > is set or not. > >> >> If we encode the the error to inject as an enum value, adding more will >> be hard. >> >> If we wrap the enum in a struct >> >> { 'struct': 'ArmProcessorError', >> 'data': { 'type': 'ArmProcessorErrorType' } } >> >> we can later extend it like >> >> { 'union': 'ArmProcessorError', >> 'base: { 'type': 'ArmProcessorErrorType' } >> 'data': { >> 'bus-error': 'ArmProcessorBusErrorData' } } >> >> { 'struct': 'ArmProcessorBusErrorData', >> 'data': ... } > > I don't see this working as one might expect. See, the ARM error > information data can be repeated from 1 to 255 times. It is given > by this struct (see patch 7): > > { 'struct': 'ArmProcessorErrorInformation', > 'data': { '*validation': ['ArmPeiValidationBits'], > 'type': ['ArmProcessorErrorType'], > '*multiple-error': 'uint16', > '*flags': ['ArmProcessorFlags'], > '*error-info': 'uint64', > '*virt-addr': 'uint64', > '*phy-addr': 'uint64'} > } > > According with the UEFI spec, the type is always be present. > The other fields are marked as valid or not via the field > "validation". So, there's one bit indicating what is valid between > the fields at the PEI structure, e. g.: > > - multiple-error: multiple occurrences of the error; > - flags; > - error-info: error information; > - virt-addr: virtual address; > - phy-addr: physical address. > > There are also other fields that are global for the entire record, > also marked as valid or not via another bitmask. > > The contents of almost all those fields are independent of the error > type. The only field which content is affected by the error type is > "error-info", and the definition of such field is not fully specified. > > So, currently, UEFI spec only defines it when: > > 1. the error type has just one bit set; > 2. the error type is either cache, TLB or bus error[1]. > If type is micro-arch-specific error, the spec doesn't tell how this > field if filled. > > To make the API simple (yet powerful), I opted to not enforce any encoding > for error-info: let userspace fill it as required and use some default > that would make sense, if this is not passed via QMP. > > [1] See https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#arm-processor-error-information I asked because designing for extensibility is good practice. It's not a hard requirement here, because feature 'unstable' gives us lincense to change the interface incompatibly. [...]
On Mon, 22 Jul 2024 08:45:56 +0200 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: that's quite a bit of code that in 99% won't ever be used (assuming error injection testing scenario), not to mention it's a hw depended one and governed by different specs. Essentially we would need to create _whole_ lot of QAPI commands to cover possible errors for no benefit to QEMU. Let take for example very simple _OST status reporting, QEMU of cause can decode values and present it to users in more 'presentable' form. However instead of translating numbers (aka. spec language) into a made up QEMU language, QEMU just passes values up the stack and users can use well defined spec to interpret its meaning. benefits are: QEMU doesn't have to maintain translation code and QAPI ABI is limited to passing raw values. Can we do similar thing here as well? i.e. simplify error injection commands to a command that takes raw value and passes it to guest (QEMU here acts as proxy, if I'm not mistaken)? Preferably make it generic enough to handle not only ARM but other error formats HEST is able to handle. PS: For user convenience, QEMU can carry a script that could help generate this raw value in user friendly way but at the same time it won't put maintenance burden on QEMU itself. > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > 1. Some GHES functions require handling addresses. Add a helper function > to support it. > > 2. Add support for ACPI CPER (firmware-first) ARM processor error injection. > > Compliance with N.2.4.4 ARM Processor Error Section in UEFI 2.6 and > upper specs, using error type bit encoding as detailed at UEFI 2.9A > errata. > > Error injection examples: > > { "execute": "qmp_capabilities" } > > { "execute": "arm-inject-error", > "arguments": { > "errortypes": ['cache-error'] > } > } > > { "execute": "arm-inject-error", > "arguments": { > "errortypes": ['tlb-error'] > } > } > > { "execute": "arm-inject-error", > "arguments": { > "errortypes": ['bus-error'] > } > } > > { "execute": "arm-inject-error", > "arguments": { > "errortypes": ['cache-error', 'tlb-error'] > } > } > > { "execute": "arm-inject-error", > "arguments": { > "errortypes": ['cache-error', 'tlb-error', 'bus-error', 'micro-arch-error'] > } > } > ... > > Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > Co-authored-by: Shiju Jose <shiju.jose@huawei.com> > For Add a logic to handle block addresses, > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > For FW first ARM processor error injection, > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > Signed-off-by: Shiju Jose <shiju.jose@huawei.com> > --- > configs/targets/aarch64-softmmu.mak | 1 + > hw/acpi/ghes.c | 258 ++++++++++++++++++++++++++-- > hw/arm/Kconfig | 4 + > hw/arm/arm_error_inject.c | 35 ++++ > hw/arm/arm_error_inject_stubs.c | 18 ++ > hw/arm/meson.build | 3 + > include/hw/acpi/ghes.h | 2 + > qapi/arm-error-inject.json | 49 ++++++ > qapi/meson.build | 1 + > qapi/qapi-schema.json | 1 + > 10 files changed, 361 insertions(+), 11 deletions(-) > create mode 100644 hw/arm/arm_error_inject.c > create mode 100644 hw/arm/arm_error_inject_stubs.c > create mode 100644 qapi/arm-error-inject.json > > diff --git a/configs/targets/aarch64-softmmu.mak b/configs/targets/aarch64-softmmu.mak > index 84cb32dc2f4f..b4b3cd97934a 100644 > --- a/configs/targets/aarch64-softmmu.mak > +++ b/configs/targets/aarch64-softmmu.mak > @@ -5,3 +5,4 @@ TARGET_KVM_HAVE_GUEST_DEBUG=y > TARGET_XML_FILES= gdb-xml/aarch64-core.xml gdb-xml/aarch64-fpu.xml gdb-xml/arm-core.xml gdb-xml/arm-vfp.xml gdb-xml/arm-vfp3.xml gdb-xml/arm-vfp-sysregs.xml gdb-xml/arm-neon.xml gdb-xml/arm-m-profile.xml gdb-xml/arm-m-profile-mve.xml gdb-xml/aarch64-pauth.xml > # needed by boot.c > TARGET_NEED_FDT=y > +CONFIG_ARM_EINJ=y > diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c > index 5b8bc6eeb437..6075ef5893ce 100644 > --- a/hw/acpi/ghes.c > +++ b/hw/acpi/ghes.c > @@ -27,6 +27,7 @@ > #include "hw/acpi/generic_event_device.h" > #include "hw/nvram/fw_cfg.h" > #include "qemu/uuid.h" > +#include "qapi/qapi-types-arm-error-inject.h" > > #define ACPI_GHES_ERRORS_FW_CFG_FILE "etc/hardware_errors" > #define ACPI_GHES_DATA_ADDR_FW_CFG_FILE "etc/hardware_errors_addr" > @@ -53,6 +54,12 @@ > /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */ > #define ACPI_GHES_MEM_CPER_LENGTH 80 > > +/* > + * ARM Processor section CPER size, UEFI 2.10: N.2.4.4 > + * ARM Processor Error Section > + */ > +#define ACPI_GHES_ARM_CPER_LENGTH (72 + 600) > + > /* Masks for block_status flags */ > #define ACPI_GEBS_UNCORRECTABLE 1 > > @@ -231,6 +238,142 @@ static int acpi_ghes_record_mem_error(uint64_t error_block_address, > return 0; > } > > +/* UEFI 2.9: N.2.4.4 ARM Processor Error Section */ > +static void acpi_ghes_build_append_arm_cper(uint8_t error_types, GArray *table) > +{ > + /* > + * ARM Processor Error Record > + */ > + > + /* Validation Bits */ > + build_append_int_noprefix(table, > + (1ULL << 3) | /* Vendor specific info Valid */ > + (1ULL << 2) | /* Running status Valid */ > + (1ULL << 1) | /* Error affinity level Valid */ > + (1ULL << 0), /* MPIDR Valid */ > + 4); > + /* Error Info Num */ > + build_append_int_noprefix(table, 1, 2); > + /* Context Info Num */ > + build_append_int_noprefix(table, 1, 2); > + /* Section length */ > + build_append_int_noprefix(table, ACPI_GHES_ARM_CPER_LENGTH, 4); > + /* Error affinity level */ > + build_append_int_noprefix(table, 2, 1); > + /* Reserved */ > + build_append_int_noprefix(table, 0, 3); > + /* MPIDR_EL1 */ > + build_append_int_noprefix(table, 0xAB12, 8); > + /* MIDR_EL1 */ > + build_append_int_noprefix(table, 0xCD24, 8); > + /* Running state */ > + build_append_int_noprefix(table, 0x1, 4); > + /* PSCI state */ > + build_append_int_noprefix(table, 0x1234, 4); > + > + /* ARM Propcessor error information */ > + /* Version */ > + build_append_int_noprefix(table, 0, 1); > + /* Length */ > + build_append_int_noprefix(table, 32, 1); > + /* Validation Bits */ > + build_append_int_noprefix(table, > + (1ULL << 4) | /* Physical fault address Valid */ > + (1ULL << 3) | /* Virtual fault address Valid */ > + (1ULL << 2) | /* Error information Valid */ > + (1ULL << 1) | /* Flags Valid */ > + (1ULL << 0), /* Multiple error count Valid */ > + 2); > + /* Type */ > + if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_CACHE_ERROR) || > + error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_TLB_ERROR) || > + error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_BUS_ERROR) || > + error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_MICRO_ARCH_ERROR)) { > + build_append_int_noprefix(table, error_types, 1); > + } else { > + return; > + } > + /* Multiple error count */ > + build_append_int_noprefix(table, 2, 2); > + /* Flags */ > + build_append_int_noprefix(table, 0xD, 1); > + /* Error information */ > + if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_CACHE_ERROR)) { > + build_append_int_noprefix(table, 0x0091000F, 8); > + } else if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_TLB_ERROR)) { > + build_append_int_noprefix(table, 0x0054007F, 8); > + } else if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_BUS_ERROR)) { > + build_append_int_noprefix(table, 0x80D6460FFF, 8); > + } else if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_MICRO_ARCH_ERROR)) { > + build_append_int_noprefix(table, 0x78DA03FF, 8); > + } else { > + return; > + } > + /* Virtual fault address */ > + build_append_int_noprefix(table, 0x67320230, 8); > + /* Physical fault address */ > + build_append_int_noprefix(table, 0x5CDFD492, 8); > + > + /* ARM Propcessor error context information */ > + /* Version */ > + build_append_int_noprefix(table, 0, 2); > + /* Validation Bits */ > + /* AArch64 EL1 context registers Valid */ > + build_append_int_noprefix(table, 5, 2); > + /* Register array size */ > + build_append_int_noprefix(table, 592, 4); > + /* Register array */ > + build_append_int_noprefix(table, 0x12ABDE67, 8); > +} > + > +static int acpi_ghes_record_arm_error(uint8_t error_types, > + uint64_t error_block_address) > +{ > + GArray *block; > + > + /* ARM processor Error Section Type */ > + const uint8_t uefi_cper_arm_sec[] = > + UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05, \ > + 0x1D, 0x5D, 0x46, 0xB0); > + > + /* > + * Invalid fru id: ACPI 4.0: 17.3.2.6.1 Generic Error Data, > + * Table 17-13 Generic Error Data Entry > + */ > + QemuUUID fru_id = {}; > + uint32_t data_length; > + > + block = g_array_new(false, true /* clear */, 1); > + > + /* This is the length if adding a new generic error data entry*/ > + data_length = ACPI_GHES_DATA_LENGTH + ACPI_GHES_ARM_CPER_LENGTH; > + /* > + * It should not run out of the preallocated memory if adding a new generic > + * error data entry > + */ > + assert((data_length + ACPI_GHES_GESB_SIZE) <= > + ACPI_GHES_MAX_RAW_DATA_LENGTH); > + > + /* Build the new generic error status block header */ > + acpi_ghes_generic_error_status(block, ACPI_GEBS_UNCORRECTABLE, > + 0, 0, data_length, ACPI_CPER_SEV_RECOVERABLE); > + > + /* Build this new generic error data entry header */ > + acpi_ghes_generic_error_data(block, uefi_cper_arm_sec, > + ACPI_CPER_SEV_RECOVERABLE, 0, 0, > + ACPI_GHES_ARM_CPER_LENGTH, fru_id, 0); > + > + /* Build the ARM processor error section CPER */ > + acpi_ghes_build_append_arm_cper(error_types, block); > + > + /* Write the generic error data entry into guest memory */ > + cpu_physical_memory_write(error_block_address, block->data, block->len); > + > + g_array_free(block, true); > + > + return 0; > +} > + > /* > * Build table for the hardware error fw_cfg blob. > * Initialize "etc/hardware_errors" and "etc/hardware_errors_addr" fw_cfg blobs. > @@ -392,23 +535,22 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s, > ags->present = true; > } > > +static uint64_t ghes_get_state_start_address(void) > +{ > + AcpiGedState *acpi_ged_state = > + ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED, NULL)); > + AcpiGhesState *ags = &acpi_ged_state->ghes_state; > + > + return le64_to_cpu(ags->ghes_addr_le); > +} > + > int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address) > { > uint64_t error_block_addr, read_ack_register_addr, read_ack_register = 0; > - uint64_t start_addr; > + uint64_t start_addr = ghes_get_state_start_address(); > bool ret = -1; > - AcpiGedState *acpi_ged_state; > - AcpiGhesState *ags; > - > assert(source_id < ACPI_HEST_SRC_ID_RESERVED); > > - acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED, > - NULL)); > - g_assert(acpi_ged_state); > - ags = &acpi_ged_state->ghes_state; > - > - start_addr = le64_to_cpu(ags->ghes_addr_le); > - > if (physical_address) { > > if (source_id < ACPI_HEST_SRC_ID_RESERVED) { > @@ -448,6 +590,100 @@ int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address) > return ret; > } > > +/* > + * Error register block data layout > + * > + * | +---------------------+ ges.ghes_addr_le > + * | |error_block_address0 | > + * | +---------------------+ > + * | |error_block_address1 | > + * | +---------------------+ --+-- > + * | | ............. | GHES_ADDRESS_SIZE > + * | +---------------------+ --+-- > + * | |error_block_addressN | > + * | +---------------------+ > + * | | read_ack_register0 | > + * | +---------------------+ --+-- > + * | | read_ack_register1 | GHES_ADDRESS_SIZE > + * | +---------------------+ --+-- > + * | | ............. | > + * | +---------------------+ > + * | | read_ack_registerN | > + * | +---------------------+ --+-- > + * | | CPER | | > + * | | .... | GHES_MAX_RAW_DATA_LENGT > + * | | CPER | | > + * | +---------------------+ --+-- > + * | | .......... | > + * | +---------------------+ > + * | | CPER | > + * | | .... | > + * | | CPER | > + * | +---------------------+ > + */ > + > +/* Map from uint32_t notify to entry offset in GHES */ > +static const uint8_t error_source_to_index[] = { 0xff, 0xff, 0xff, 0xff, > + 0xff, 0xff, 0xff, 1, 0}; > + > +static bool ghes_get_addr(uint32_t notify, uint64_t *error_block_addr, > + uint64_t *read_ack_register_addr) > +{ > + uint64_t base; > + > + if (notify >= ACPI_GHES_NOTIFY_RESERVED) { > + return false; > + } > + > + /* Find and check the source id for this new CPER */ > + if (error_source_to_index[notify] == 0xff) { > + return false; > + } > + > + base = ghes_get_state_start_address(); > + > + *read_ack_register_addr = base + > + ACPI_GHES_ERROR_SOURCE_COUNT * sizeof(uint64_t) + > + error_source_to_index[notify] * sizeof(uint64_t); > + > + /* Could also be read back from the error_block_address register */ > + *error_block_addr = base + > + ACPI_GHES_ERROR_SOURCE_COUNT * sizeof(uint64_t) + > + ACPI_GHES_ERROR_SOURCE_COUNT * sizeof(uint64_t) + > + error_source_to_index[notify] * ACPI_GHES_MAX_RAW_DATA_LENGTH; > + > + return true; > +} > + > +bool ghes_record_arm_errors(uint8_t error_types, uint32_t notify) > +{ > + int read_ack_register = 0; > + uint64_t read_ack_register_addr = 0; > + uint64_t error_block_addr = 0; > + > + if (!ghes_get_addr(notify, &error_block_addr, &read_ack_register_addr)) { > + return false; > + } > + > + cpu_physical_memory_read(read_ack_register_addr, > + &read_ack_register, sizeof(uint64_t)); > + /* zero means OSPM does not acknowledge the error */ > + if (!read_ack_register) { > + error_report("Last time OSPM does not acknowledge the error," > + " record CPER failed this time, set the ack value to" > + " avoid blocking next time CPER record! exit"); > + read_ack_register = 1; > + cpu_physical_memory_write(read_ack_register_addr, > + &read_ack_register, sizeof(uint64_t)); > + return false; > + } > + > + read_ack_register = cpu_to_le64(0); > + cpu_physical_memory_write(read_ack_register_addr, > + &read_ack_register, sizeof(uint64_t)); > + return acpi_ghes_record_arm_error(error_types, error_block_addr); > +} > + > bool acpi_ghes_present(void) > { > AcpiGedState *acpi_ged_state; > diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig > index 1ad60da7aa2d..bafac82f9fd3 100644 > --- a/hw/arm/Kconfig > +++ b/hw/arm/Kconfig > @@ -712,3 +712,7 @@ config ARMSSE > select UNIMP > select SSE_COUNTER > select SSE_TIMER > + > +config ARM_EINJ > + bool > + default y if AARCH64 > diff --git a/hw/arm/arm_error_inject.c b/hw/arm/arm_error_inject.c > new file mode 100644 > index 000000000000..1da97d5d4fdc > --- /dev/null > +++ b/hw/arm/arm_error_inject.c > @@ -0,0 +1,35 @@ > +/* > + * ARM Processor error injection > + * > + * Copyright(C) 2024 Huawei LTD. > + * > + * This code is licensed under the GPL version 2 or later. See the > + * COPYING file in the top-level directory. > + * > + */ > + > +#include "qemu/osdep.h" > +#include "qapi/error.h" > +#include "qapi-commands-arm-error-inject.h" > +#include "hw/boards.h" > +#include "hw/acpi/ghes.h" > + > +/* For ARM processor errors */ > +void qmp_arm_inject_error(ArmProcessorErrorTypeList *errortypes, Error **errp) > +{ > + MachineState *machine = MACHINE(qdev_get_machine()); > + MachineClass *mc = MACHINE_GET_CLASS(machine); > + uint8_t error_types = 0; > + > + while (errortypes) { > + error_types |= BIT(errortypes->value); > + errortypes = errortypes->next; > + } > + > + ghes_record_arm_errors(error_types, ACPI_GHES_NOTIFY_GPIO); > + if (mc->set_error) { > + mc->set_error(); > + } > + > + return; > +} > diff --git a/hw/arm/arm_error_inject_stubs.c b/hw/arm/arm_error_inject_stubs.c > new file mode 100644 > index 000000000000..b51f4202fe64 > --- /dev/null > +++ b/hw/arm/arm_error_inject_stubs.c > @@ -0,0 +1,18 @@ > +/* > + * QMP stub for ARM processor error injection. > + * > + * Copyright(C) 2024 Huawei LTD. > + * > + * This code is licensed under the GPL version 2 or later. See the > + * COPYING file in the top-level directory. > + * > + */ > + > +#include "qemu/osdep.h" > +#include "qapi/error.h" > +#include "qapi-commands-arm-error-inject.h" > + > +void qmp_arm_inject_error(ArmProcessorErrorTypeList *errortypes, Error **errp) > +{ > + error_setg(errp, "ARM processor error support is not compiled in"); > +} > diff --git a/hw/arm/meson.build b/hw/arm/meson.build > index 0c07ab522f4c..cb7fe09fc87b 100644 > --- a/hw/arm/meson.build > +++ b/hw/arm/meson.build > @@ -60,6 +60,7 @@ arm_ss.add(when: 'CONFIG_ARM_SMMUV3', if_true: files('smmuv3.c')) > arm_ss.add(when: 'CONFIG_FSL_IMX6UL', if_true: files('fsl-imx6ul.c', 'mcimx6ul-evk.c')) > arm_ss.add(when: 'CONFIG_NRF51_SOC', if_true: files('nrf51_soc.c')) > arm_ss.add(when: 'CONFIG_XEN', if_true: files('xen_arm.c')) > +arm_ss.add(when: 'CONFIG_ARM_EINJ', if_true: files('arm_error_inject.c')) > > system_ss.add(when: 'CONFIG_ARM_SMMUV3', if_true: files('smmu-common.c')) > system_ss.add(when: 'CONFIG_CHEETAH', if_true: files('palm.c')) > @@ -77,5 +78,7 @@ system_ss.add(when: 'CONFIG_TOSA', if_true: files('tosa.c')) > system_ss.add(when: 'CONFIG_VERSATILE', if_true: files('versatilepb.c')) > system_ss.add(when: 'CONFIG_VEXPRESS', if_true: files('vexpress.c')) > system_ss.add(when: 'CONFIG_Z2', if_true: files('z2.c')) > +system_ss.add(when: 'CONFIG_ARM_EINJ', if_false: files('arm_error_inject_stubs.c')) > +system_ss.add(when: 'CONFIG_ALL', if_true: files('arm_error_inject_stubs.c')) > > hw_arch += {'arm': arm_ss} > diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h > index 4f1ab1a73a06..dc531ffce7ae 100644 > --- a/include/hw/acpi/ghes.h > +++ b/include/hw/acpi/ghes.h > @@ -75,6 +75,8 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s, > GArray *hardware_errors); > int acpi_ghes_record_errors(uint8_t notify, uint64_t error_physical_addr); > > +bool ghes_record_arm_errors(uint8_t error_types, uint32_t notify); > + > /** > * acpi_ghes_present: Report whether ACPI GHES table is present > * > diff --git a/qapi/arm-error-inject.json b/qapi/arm-error-inject.json > new file mode 100644 > index 000000000000..430e6cea6b60 > --- /dev/null > +++ b/qapi/arm-error-inject.json > @@ -0,0 +1,49 @@ > +# -*- Mode: Python -*- > +# vim: filetype=python > + > +## > +# = ARM Processor Errors > +## > + > +## > +# @ArmProcessorErrorType: > +# > +# Type of ARM processor error to inject > +# > +# @unknown-error: Unknown error > +# > +# @cache-error: Cache error > +# > +# @tlb-error: TLB error > +# > +# @bus-error: Bus error. > +# > +# @micro-arch-error: Micro architectural error. > +# > +# Since: 9.1 > +## > +{ 'enum': 'ArmProcessorErrorType', > + 'data': ['unknown-error', > + 'cache-error', > + 'tlb-error', > + 'bus-error', > + 'micro-arch-error'] > +} > + > +## > +# @arm-inject-error: > +# > +# Inject ARM Processor error. > +# > +# @errortypes: ARM processor error types to inject > +# > +# Features: > +# > +# @unstable: This command is experimental. > +# > +# Since: 9.1 > +## > +{ 'command': 'arm-inject-error', > + 'data': { 'errortypes': ['ArmProcessorErrorType'] }, > + 'features': [ 'unstable' ] > +} > diff --git a/qapi/meson.build b/qapi/meson.build > index e7bc54e5d047..5927932c4be3 100644 > --- a/qapi/meson.build > +++ b/qapi/meson.build > @@ -22,6 +22,7 @@ if have_system or have_tools or have_ga > endif > > qapi_all_modules = [ > + 'arm-error-inject', > 'authz', > 'block', > 'block-core', > diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json > index b1581988e4eb..479a22de7e43 100644 > --- a/qapi/qapi-schema.json > +++ b/qapi/qapi-schema.json > @@ -81,3 +81,4 @@ > { 'include': 'vfio.json' } > { 'include': 'cryptodev.json' } > { 'include': 'cxl.json' } > +{ 'include': 'arm-error-inject.json' }
Em Tue, 30 Jul 2024 13:17:09 +0200 Igor Mammedov <imammedo@redhat.com> escreveu: > On Mon, 22 Jul 2024 08:45:56 +0200 > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > that's quite a bit of code that in 99% won't ever be used > (assuming error injection testing scenario), > not to mention it's a hw depended one and governed by different specs. > > Essentially we would need to create _whole_ lot of QAPI > commands to cover possible errors for no benefit to QEMU. > > Let take for example very simple _OST status reporting, > QEMU of cause can decode values and present it to users in > more 'presentable' form. However instead of translating > numbers (aka. spec language) into a made up QEMU language, > QEMU just passes values up the stack and users can use > well defined spec to interpret its meaning. > > benefits are: QEMU doesn't have to maintain translation > code and QAPI ABI is limited to passing raw values. > > Can we do similar thing here as well? > i.e. simplify error injection commands to > a command that takes raw value and passes it > to guest (QEMU here acts as proxy, if I'm not > mistaken)? > > Preferably make it generic enough to handle > not only ARM but other error formats HEST is > able to handle. A too generic interface doesn't sound feasible to me, as the EINJ code needs to check QEMU implementation details before doing the error inject. See, processor is probably the simplest error injection source, as most of the fields there aren't related to how the hardware simulation is done. Yet, if you see patch 7 of this series, you'll notice that some fields should actually be filled based on the emulation. On ARM, we have some IDs that depend on the emulation (MIDR, MPIDR, power state). Doing that on userspace may require a QAPI to query them. The memory layout, however, is the most complex one. Even for an ARM processor CPER (which is the simplest scenario), the physical/virtual address need to be checked against the emulation environment. Other error sources (like memory errors, CXL, etc) will require a deep knowledge about how QEMU mapped such devices. So, in practice, if we move this to an EINJ script, we'll need to add a probably more complex QAPI to allow querying the memory layout and other device and CPU specific bindings. Also, we don't know what newer versions of ACPI spec will reserve us. See, even the HEST table contents is dependent of the HEST revision number, as made clear at the ACPI 6.5 notes: https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#acpi-error-source and at: https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#error-source-structure-header-type-12-onward So, if we're willing to add support for a more generic "raw data" QAPI, I would still do it per-type, and for the fields that won't require knowledge of the device-emulation details. Btw, my proposal on patch 7 of this series is to have raw data for: - the error-info field; - registers dump; - micro-architecture specific data. I don't mind trying to have more raw data there as I see (marginal) benefits of allowing to generate CPER invalid records [1], but some of those fields need to be validated and/or filled internally at QEMU - if not forced to an specific value by the caller. [1] a raw data EINJ can be useful for fuzzy logic fault detection to check if badly formed packages won't cause a Kernel panic or be an exploit. Yet, not really a concern for APEI, as if the hardware is faulty, a Kernel panic is not out of the table. Also, if the the BIOS is already compromised and has malicious code on it, the EINJ interface is not the main concern. > PS: > For user convenience, QEMU can carry a script that > could help generate this raw value in user friendly way > but at the same time it won't put maintenance > burden on QEMU itself. The script will still require reviews, and the same code will be there. So, from maintenance burden, there won't be much difference. Btw, I'm actually using myself a script to test it, currently sitting together with rasdaemon - which is the Linux tool to detect and handle hardware errors: https://github.com/mchehab/rasdaemon/blob/master/contrib/qemu_einj.py as it helps a lot when trying to simulate more complex errors. Once QEMU gains support to inject processor errors, I can prepare a separate patch to move it to QEMU. Thanks, Mauro
Em Wed, 31 Jul 2024 09:57:19 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> escreveu: > On Wed, 31 Jul 2024 09:11:33 +0200 > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > > Em Tue, 30 Jul 2024 13:17:09 +0200 > > Igor Mammedov <imammedo@redhat.com> escreveu: > > > > > On Mon, 22 Jul 2024 08:45:56 +0200 > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > > > > > that's quite a bit of code that in 99% won't ever be used > > > (assuming error injection testing scenario), > > > not to mention it's a hw depended one and governed by different specs. > > > > > > Essentially we would need to create _whole_ lot of QAPI > > > commands to cover possible errors for no benefit to QEMU. > > Fair point. A 'few' error types might be helpful to general > users like the original memory error injection this is built > on which is reduce blast radius of a real error (and used in > production VMM cases), but most are about testing the rest > of the stack, not really QEMU. My concern is that we may end needing QAPI for querying stuff that will be used on such script. On a more concrete example, let's suppose we want to produce a HEST record for a CXL source, using the layout given by a QEMU command line like this: ./qemu-system-aarch64 -M virt,nvdimm=on,gic-version=3,cxl=on -m 4g,maxmem=8G,slots=8 \ -cpu max -smp 4 -kernel Image -drive if=none,file=debian.qcow2,format=qcow2,id=hd \ -device pcie-root-port,id=root_port1 -device virtio-blk-pci,drive=hd \ -netdev type=user,id=mynet,hostfwd=tcp::5555-:22 -qmp tcp:localhost:4445,server=on,wait=off \ -device virtio-net-pci,netdev=mynet,id=bob -nographic -no-reboot \ -append 'earlycon root=/dev/vda1 fsck.mode=skip tp_printk maxcpus=4' \ -monitor telnet:127.0.0.1:1234,server,nowait -bios QEMU_EFI.fd \ -object memory-backend-ram,size=4G,id=mem0 -numa node,nodeid=0,cpus=0-3,memdev=mem0 \ -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest.raw,size=256M,align=256M \ -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M,align=256M \ -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M,align=256M \ -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/cxltest4.raw,size=256M,align=256M \ -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=1M,align=1M \ -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=1M,align=1M \ -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=1M,align=1M \ -object memory-backend-file,id=cxl-lsa4,share=on,mem-path=/tmp/lsa4.raw,size=1M,align=1M \ -object memory-backend-file,id=cxl-mem5,share=on,mem-path=/tmp/cxltest5.raw,size=256M,align=256M \ -object memory-backend-file,id=cxl-mem6,share=on,mem-path=/tmp/cxltest6.raw,size=256M,align=256M \ -object memory-backend-file,id=cxl-mem7,share=on,mem-path=/tmp/cxltest7.raw,size=256M,align=256M \ -object memory-backend-file,id=cxl-mem8,share=on,mem-path=/tmp/cxltest8.raw,size=256M,align=256M \ -object memory-backend-file,id=cxl-lsa5,share=on,mem-path=/tmp/lsa5.raw,size=1M,align=1M \ -object memory-backend-file,id=cxl-lsa6,share=on,mem-path=/tmp/lsa6.raw,size=1M,align=1M \ -object memory-backend-file,id=cxl-lsa7,share=on,mem-path=/tmp/lsa7.raw,size=1M,align=1M \ -object memory-backend-file,id=cxl-lsa8,share=on,mem-path=/tmp/lsa8.raw,size=1M,align=1M \ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=2 \ -device cxl-rp,port=1,bus=cxl.1,id=root_port2,chassis=0,slot=3 \ -device virtio-rng-pci,bus=root_port2 \ -device cxl-upstream,port=33,bus=root_port0,id=us0,multifunction=on,addr=0.0 \ -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \ -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \ -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \ -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \ -device cxl-type3,bus=swport0,memdev=cxl-mem1,id=cxl-pmem0,lsa=cxl-lsa1,sn=3 \ -device cxl-type3,bus=swport1,memdev=cxl-mem2,id=cxl-pmem1,lsa=cxl-lsa2,sn=4 \ -device cxl-type3,bus=swport2,memdev=cxl-mem3,id=cxl-pmem2,lsa=cxl-lsa3,sn=5 \ -device cxl-type3,bus=swport3,memdev=cxl-mem4,id=cxl-pmem3,lsa=cxl-lsa4,sn=6 \ -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=1k \ -machine ras=on There is a complex set of CXL devices there. Doing error injection will need to know exactly how such devices were created and the memory got allocated, as, when testing an userspace application running at the guest OS (like rasdaemon), we need an address that makes sense, otherwise the memory poison of rasdaemon won't be disabling the right address set. So, while a HEST QAPI raw interface will have a very trivial API, and the ghes code being also simpler, we'll need a QAPI interface good enough to describe all CXL specific details for the error injection script to produce a proper HEST record, if this ends being mapped using such QAPI. > > So it's very helpful for a smallish group of users. > My end goal is to be able to use QEMU to validate and identify regressions at the Linux Kernel RAS subsystem and at the userspace daemon responsible to receive, log and take actions based on the error information (on other words, rasdaemon). Not only for ARM, but also for x86 (and maybe in the future for other archs, depending on how they end implementing RAS features). At the end of the day, I'd like to have a github action running QEMU to check rasdaemon proposed patches against a docker container with the latest version of Linux and QEMU. > > > Let take for example very simple _OST status reporting, > > > QEMU of cause can decode values and present it to users in > > > more 'presentable' form. However instead of translating > > > numbers (aka. spec language) into a made up QEMU language, > > > QEMU just passes values up the stack and users can use > > > well defined spec to interpret its meaning. > > > > > > benefits are: QEMU doesn't have to maintain translation > > > code and QAPI ABI is limited to passing raw values. > > > > > > Can we do similar thing here as well? > > > i.e. simplify error injection commands to > > > a command that takes raw value and passes it > > > to guest (QEMU here acts as proxy, if I'm not > > > mistaken)? > > > > > > Preferably make it generic enough to handle > > > not only ARM but other error formats HEST is > > > able to handle. > > > > A too generic interface doesn't sound feasible to me, as the > > EINJ code needs to check QEMU implementation details before > > doing the error inject. > > To be clear we are talking here about a script that > generates 'similar' stuff to ACPI EINJ does and injects > via qapi, not guest injection (which is almost always locked > down on production machines / distros because of the footgun > aspect). + ACPI EINJ interface suffers exactly the same > problems with state discoverability we have with a raw interface here. > (I checked with Mauro offline that I'd interpreted this > comment correctly!) Yes, the end goal is to inject GHESv2 errors using a generic event device, e. g.: https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#generic-hardware-error-source-version-2-ghesv2-type-10 > > See, processor is probably the simplest error injection > > source, as most of the fields there aren't related to how > > the hardware simulation is done. > > > > Yet, if you see patch 7 of this series, you'll notice that some > > fields should actually be filled based on the emulation. > > > > On ARM, we have some IDs that depend on the emulation > > (MIDR, MPIDR, power state). Doing that on userspace may require > > a QAPI to query them. > > We could strip back the QAPI part to only the bits that are > not dependent on state. Works for me. > However, the kicker to that is we'd > need to make sure all that state is available to an external > tool (or fully controllable from initial launch command line). > I'm not sure where the gaps are but, I'm fairly sure there > will be some. Doesn't save much code other than documentation > of the QAPI. For CPU error injection, certainly the amount of code for the error injection won't change much. The code will be split into QEMU C code and Python's code (inside a qemu-einj.py script). As raw data won't be validated, some code may actually be removed. On the other hand, some new query QAPI interfaces will be needed, specially when handling memory-related HEST data, as the memory layout with enough details for the script to properly produce errors will be needed. So, we're talking on adding a couple of additional QAPIs. The advantage is that such QAPIs will be independent of the ghes driver, but some may be arch-specific, like the ones reporting fields like ARM mpidr, midr, power_state, for instance. On an initial implementation, we can live without those, but for the ARM processor error injection. I'll try to craft a proposal with a very minimal QAPI for GHESv2 injection, implementing it for ARM processor (and maybe for some other type, to check if the interface will fit). Let's see how it goes. > > The memory layout, however, is the most complex one. Even for > > an ARM processor CPER (which is the simplest scenario), the > > physical/virtual address need to be checked against the emulation > > environment. > > > > Other error sources (like memory errors, CXL, etc) will require > > a deep knowledge about how QEMU mapped such devices. > > For CXL stuff we'll piggy back on native error injection interfaces > that are already there and couldn't be avoided because they > are writing a bunch of register state (that we elide in the FW > first path). > https://lore.kernel.org/qemu-devel/20240205141940.31111-12-Jonathan.Cameron@huawei.com/ > So we won't be adding new QAPI, but the error record generation logic > will be in QEMU. For background, the CXL FW first error injection > has taken a back seat to the ARM errors because of the obvious > other factor that CXL isn't supported on ARM in upstream QEMU. Makes sense to me. > Once I escape a few near term deadlines I'll add the x86 > support for GHESv2 / SCI interrupt signaling as you'd see on a > typical x86 server. Well, if we go to a generic GHESv2 QAPI interface, the arm/virt.c will have everything in place to generate GHESv2. It could be used to simulate a x86 processor event, as the changes will happen at the script side. We'll still need to add the needed bits at x86 virt code, though, if we want the error injection to be tested against a guest doing x86 emulation. > > > > So, in practice, if we move this to an EINJ script, we'll need > > to add a probably more complex QAPI to allow querying the memory > > layout and other device and CPU specific bindings. > > > > Also, we don't know what newer versions of ACPI spec will reserve > > us. See, even the HEST table contents is dependent of the HEST > > revision number, as made clear at the ACPI 6.5 notes: > > > > https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#acpi-error-source > > > > and at: > > > > https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#error-source-structure-header-type-12-onward > > > > So, if we're willing to add support for a more generic "raw data" > > QAPI, I would still do it per-type, and for the fields that won't > > require knowledge of the device-emulation details. > > Could blend the two options and provide no qapi for the bits > that are QEMU state dependent - if fuzzing, can inject > the full record raw as doesn't have to be valid state anyway. Not sure. I guess it will depend if we'll be using a simple raw data buffer for CPER record(s) or if we'll break it per CPER type. The GHES interface for generic CPER records would be simpler, but it may require some QEMU query QAPIs for the script to be able to get what it is needed.. Let me try to code it and see how it goes. > > Btw, my proposal on patch 7 of this series is to have raw data > > for: > > - the error-info field; > > - registers dump; > > - micro-architecture specific data. > > > > I don't mind trying to have more raw data there as I see (marginal) > > benefits of allowing to generate CPER invalid records [1], but some of > > those fields need to be validated and/or filled internally at QEMU - if > > not forced to an specific value by the caller. > > > > [1] a raw data EINJ can be useful for fuzzy logic fault detection to > > check if badly formed packages won't cause a Kernel panic or be > > an exploit. Yet, not really a concern for APEI, as if the hardware > > is faulty, a Kernel panic is not out of the table. Also, if the > > the BIOS is already compromised and has malicious code on it, > > the EINJ interface is not the main concern. > > > > > PS: > > > For user convenience, QEMU can carry a script that > > > could help generate this raw value in user friendly way > > > but at the same time it won't put maintenance > > > burden on QEMU itself. > > > > The script will still require reviews, and the same code will > > be there. So, from maintenance burden, there won't be much > > difference. > > Agreed. I'd also be very keen that the script is tightly coupled to > QEMU as doesn't make sense to carry with kernel or RAS daemon and > I'd want to ultimately get this stuff into all the appropriate > CI flows. Agreed: placing it together with QEMU is indeed the best location. > > > > Btw, I'm actually using myself a script to test it, currently > > sitting together with rasdaemon - which is the Linux tool to detect > > and handle hardware errors: > > > > https://github.com/mchehab/rasdaemon/blob/master/contrib/qemu_einj.py > > > > as it helps a lot when trying to simulate more complex errors. > > > > Once QEMU gains support to inject processor errors, I can prepare a > > separate patch to move it to QEMU. > > > > Thanks, > > Mauro > > So tricky questions. I'm not sure which way is the least painful! Agreed. Thanks, Mauro
On Wed, 31 Jul 2024 09:57:19 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > On Wed, 31 Jul 2024 09:11:33 +0200 > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > > Em Tue, 30 Jul 2024 13:17:09 +0200 > > Igor Mammedov <imammedo@redhat.com> escreveu: > > > > > On Mon, 22 Jul 2024 08:45:56 +0200 > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: [...] > > > Preferably make it generic enough to handle > > > not only ARM but other error formats HEST is > > > able to handle. > > > > A too generic interface doesn't sound feasible to me, as the > > EINJ code needs to check QEMU implementation details before > > doing the error inject. > > To be clear we are talking here about a script that > generates 'similar' stuff to ACPI EINJ does and injects > via qapi, not guest injection (which is almost always locked > down on production machines / distros because of the footgun > aspect). + ACPI EINJ interface suffers exactly the same > problems with state discoverability we have with a raw interface here. > (I checked with Mauro offline that I'd interpreted this > comment correctly!) > > > > > See, processor is probably the simplest error injection > > source, as most of the fields there aren't related to how > > the hardware simulation is done. > > > > Yet, if you see patch 7 of this series, you'll notice that some > > fields should actually be filled based on the emulation. > > > > On ARM, we have some IDs that depend on the emulation > > (MIDR, MPIDR, power state). Doing that on userspace may require > > a QAPI to query them. QEMU has qmp commands to query QOM tree, device properties is likely what you'd be interested with. Adding new QAPI might be not necessary as long as needed data point are exposed via device's properties. And additional properties are relatively cheap, especially if their names prefixed with 'x-' which by convention means /internal use, not stable, not ABI/ Well whole qmp tree structure hasn't been declared as ABI (as far as I know), but it's relatively stable and we try not to mess with it much (especially for mainstream virt machines), as some external users might (ab)use it anyway (no promises on QEMU side though). On contrary QAPI is mostly considered as ABI QEMU provides to its users with burden to maintain it stability. If injection script is internal tool to QEMU, it should be fine for it to use qom introspection to get data and limit QAPI necessary minimum only. To make sure it won't be broken silently by 'innocent' QEMU contributors, have a CI job to make sure that it still works as intended. > We could strip back the QAPI part to only the bits that are > not dependent on state. However, the kicker to that is we'd > need to make sure all that state is available to an external > tool (or fully controllable from initial launch command line). > I'm not sure where the gaps are but, I'm fairly sure there > will be some. Doesn't save much code other than documentation > of the QAPI. > > > > > The memory layout, however, is the most complex one. Even for > > an ARM processor CPER (which is the simplest scenario), the > > physical/virtual address need to be checked against the emulation > > environment. > > > > Other error sources (like memory errors, CXL, etc) will require > > a deep knowledge about how QEMU mapped such devices. > > For CXL stuff we'll piggy back on native error injection interfaces > that are already there and couldn't be avoided because they > are writing a bunch of register state (that we elide in the FW > first path). > https://lore.kernel.org/qemu-devel/20240205141940.31111-12-Jonathan.Cameron@huawei.com/ > So we won't be adding new QAPI, but the error record generation logic > will be in QEMU. For background, the CXL FW first error injection > has taken a back seat to the ARM errors because of the obvious > other factor that CXL isn't supported on ARM in upstream QEMU. > Once I escape a few near term deadlines I'll add the x86 > support for GHESv2 / SCI interrupt signaling as you'd see on a > typical x86 server. > > > > > So, in practice, if we move this to an EINJ script, we'll need > > to add a probably more complex QAPI to allow querying the memory > > layout and other device and CPU specific bindings. > > > > Also, we don't know what newer versions of ACPI spec will reserve > > us. See, even the HEST table contents is dependent of the HEST > > revision number, as made clear at the ACPI 6.5 notes: > > > > https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#acpi-error-source > > > > and at: > > > > https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#error-source-structure-header-type-12-onward > > > > So, if we're willing to add support for a more generic "raw data" > > QAPI, I would still do it per-type, and for the fields that won't > > require knowledge of the device-emulation details. > > Could blend the two options and provide no qapi for the bits > that are QEMU state dependent - if fuzzing, can inject > the full record raw as doesn't have to be valid state anyway. > > > > > Btw, my proposal on patch 7 of this series is to have raw data > > for: > > - the error-info field; > > - registers dump; > > - micro-architecture specific data. > > > > I don't mind trying to have more raw data there as I see (marginal) > > benefits of allowing to generate CPER invalid records [1], but some of > > those fields need to be validated and/or filled internally at QEMU - if > > not forced to an specific value by the caller. > > > > [1] a raw data EINJ can be useful for fuzzy logic fault detection to > > check if badly formed packages won't cause a Kernel panic or be > > an exploit. Yet, not really a concern for APEI, as if the hardware > > is faulty, a Kernel panic is not out of the table. Also, if the > > the BIOS is already compromised and has malicious code on it, > > the EINJ interface is not the main concern. > > > > > PS: > > > For user convenience, QEMU can carry a script that > > > could help generate this raw value in user friendly way > > > but at the same time it won't put maintenance > > > burden on QEMU itself. > > > > The script will still require reviews, and the same code will > > be there. So, from maintenance burden, there won't be much > > difference. it makes a lot of difference if code is integral part qemu binary, (less people have to spend time on reviewing it, avoid increasing attack surface, ... (other made up reasons)). Implementing shim/proxy in QEMU and putting all error composing logic into a separate script (even if it's a part QEMU source), shifts most of the burden to whomever (I'd assume you'd volunteer yourself) would maintain the script. If script breaks, it doesn't affect QEMU itself (nor I believe it should affect release process), script's maintainer(s) can have their own schedule/process on how to deal with it. > Agreed. I'd also be very keen that the script is tightly coupled to > QEMU as doesn't make sense to carry with kernel or RAS daemon and > I'd want to ultimately get this stuff into all the appropriate > CI flows. Agreed, it makes much more sense to carry such script as a part of QEMU. > > > > Btw, I'm actually using myself a script to test it, currently > > sitting together with rasdaemon - which is the Linux tool to detect > > and handle hardware errors: > > > > https://github.com/mchehab/rasdaemon/blob/master/contrib/qemu_einj.py > > > > as it helps a lot when trying to simulate more complex errors. > > > > Once QEMU gains support to inject processor errors, I can prepare a > > separate patch to move it to QEMU. > > > > Thanks, > > Mauro > > So tricky questions. I'm not sure which way is the least painful! > > Jonathan > >
Em Thu, 1 Aug 2024 10:36:23 +0200 Igor Mammedov <imammedo@redhat.com> escreveu: > On Wed, 31 Jul 2024 09:57:19 +0100 > Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > > > On Wed, 31 Jul 2024 09:11:33 +0200 > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > > > > Em Tue, 30 Jul 2024 13:17:09 +0200 > > > Igor Mammedov <imammedo@redhat.com> escreveu: > > > > > > > On Mon, 22 Jul 2024 08:45:56 +0200 > > > > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > [...] > > > > Preferably make it generic enough to handle > > > > not only ARM but other error formats HEST is > > > > able to handle. > > > > > > A too generic interface doesn't sound feasible to me, as the > > > EINJ code needs to check QEMU implementation details before > > > doing the error inject. > > > > To be clear we are talking here about a script that > > generates 'similar' stuff to ACPI EINJ does and injects > > via qapi, not guest injection (which is almost always locked > > down on production machines / distros because of the footgun > > aspect). + ACPI EINJ interface suffers exactly the same > > problems with state discoverability we have with a raw interface here. > > (I checked with Mauro offline that I'd interpreted this > > comment correctly!) > > > > > > > > See, processor is probably the simplest error injection > > > source, as most of the fields there aren't related to how > > > the hardware simulation is done. > > > > > > Yet, if you see patch 7 of this series, you'll notice that some > > > fields should actually be filled based on the emulation. > > > > > > On ARM, we have some IDs that depend on the emulation > > > (MIDR, MPIDR, power state). Doing that on userspace may require > > > a QAPI to query them. > > QEMU has qmp commands to query QOM tree, device properties is > likely what you'd be interested with. > Adding new QAPI might be not necessary as long as needed > data point are exposed via device's properties. > > And additional properties are relatively cheap, especially if their > names prefixed with 'x-' which by convention means > /internal use, not stable, not ABI/ > > Well whole qmp tree structure hasn't been declared as ABI (as far as I know), > but it's relatively stable and we try not to mess with it much > (especially for mainstream virt machines), as some external users > might (ab)use it anyway (no promises on QEMU side though). > > On contrary QAPI is mostly considered as ABI QEMU provides > to its users with burden to maintain it stability. > > If injection script is internal tool to QEMU, it should be fine > for it to use qom introspection to get data and limit QAPI > necessary minimum only. OK, good. Anyway, after sleeping on it, I decided to not focus at the query for now, as my goal is to validate Linux Kernel and rasdaemon. So, I'll just fill them when calling the error inject script. As a reference, for ARM Processor error injection, there are just 4 fields that would benefit for a query to fill the default values: - mpidr-el1 - midr-el1 - power_state - ARM context registers and its type > To make sure it won't be broken silently by 'innocent' QEMU > contributors, have a CI job to make sure that it still works > as intended. > > > We could strip back the QAPI part to only the bits that are > > not dependent on state. However, the kicker to that is we'd > > need to make sure all that state is available to an external > > tool (or fully controllable from initial launch command line). > > I'm not sure where the gaps are but, I'm fairly sure there > > will be some. Doesn't save much code other than documentation > > of the QAPI. > > > > > > > > The memory layout, however, is the most complex one. Even for > > > an ARM processor CPER (which is the simplest scenario), the > > > physical/virtual address need to be checked against the emulation > > > environment. > > > > > > Other error sources (like memory errors, CXL, etc) will require > > > a deep knowledge about how QEMU mapped such devices. > > > > For CXL stuff we'll piggy back on native error injection interfaces > > that are already there and couldn't be avoided because they > > are writing a bunch of register state (that we elide in the FW > > first path). > > https://lore.kernel.org/qemu-devel/20240205141940.31111-12-Jonathan.Cameron@huawei.com/ > > So we won't be adding new QAPI, but the error record generation logic > > will be in QEMU. For background, the CXL FW first error injection > > has taken a back seat to the ARM errors because of the obvious > > other factor that CXL isn't supported on ARM in upstream QEMU. > > Once I escape a few near term deadlines I'll add the x86 > > support for GHESv2 / SCI interrupt signaling as you'd see on a > > typical x86 server. > > > > > > > > So, in practice, if we move this to an EINJ script, we'll need > > > to add a probably more complex QAPI to allow querying the memory > > > layout and other device and CPU specific bindings. > > > > > > Also, we don't know what newer versions of ACPI spec will reserve > > > us. See, even the HEST table contents is dependent of the HEST > > > revision number, as made clear at the ACPI 6.5 notes: > > > > > > https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#acpi-error-source > > > > > > and at: > > > > > > https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#error-source-structure-header-type-12-onward > > > > > > So, if we're willing to add support for a more generic "raw data" > > > QAPI, I would still do it per-type, and for the fields that won't > > > require knowledge of the device-emulation details. > > > > Could blend the two options and provide no qapi for the bits > > that are QEMU state dependent - if fuzzing, can inject > > the full record raw as doesn't have to be valid state anyway. > > > > > > > > Btw, my proposal on patch 7 of this series is to have raw data > > > for: > > > - the error-info field; > > > - registers dump; > > > - micro-architecture specific data. > > > > > > I don't mind trying to have more raw data there as I see (marginal) > > > benefits of allowing to generate CPER invalid records [1], but some of > > > those fields need to be validated and/or filled internally at QEMU - if > > > not forced to an specific value by the caller. > > > > > > [1] a raw data EINJ can be useful for fuzzy logic fault detection to > > > check if badly formed packages won't cause a Kernel panic or be > > > an exploit. Yet, not really a concern for APEI, as if the hardware > > > is faulty, a Kernel panic is not out of the table. Also, if the > > > the BIOS is already compromised and has malicious code on it, > > > the EINJ interface is not the main concern. > > > > > > > PS: > > > > For user convenience, QEMU can carry a script that > > > > could help generate this raw value in user friendly way > > > > but at the same time it won't put maintenance > > > > burden on QEMU itself. > > > > > > The script will still require reviews, and the same code will > > > be there. So, from maintenance burden, there won't be much > > > difference. > > it makes a lot of difference if code is integral part qemu binary, > (less people have to spend time on reviewing it, avoid increasing > attack surface, ... (other made up reasons)). I see. > Implementing shim/proxy in QEMU and putting all error composing logic > into a separate script (even if it's a part QEMU source), shifts > most of the burden to whomever (I'd assume you'd volunteer yourself) > would maintain the script. Yes, I'll maintain it. > If script breaks, it doesn't affect QEMU itself (nor I believe it > should affect release process), script's maintainer(s) can have their > own schedule/process on how to deal with it. That's good. > > Agreed. I'd also be very keen that the script is tightly coupled to > > QEMU as doesn't make sense to carry with kernel or RAS daemon and > > I'd want to ultimately get this stuff into all the appropriate > > CI flows. > > Agreed, it makes much more sense to carry such script as a part of QEMU. Ok, I'll be submitting a v4 using the CPER raw data plus script approach. I'll be placing the script at the final patch. I opted to make the simplest possible QAPI (keeping it marked as unstable), as we might need/want to improve it to support other features. Btw, as error injection is not trivial, and using the script is the best way to do it, I would prefer to keep such QAPI always marked as unstable, as it is preferred that QEMU users to use it (and submit patches improving it) instead of manually crafting CPER records with their own scripts. Thanks, Mauro
Em Mon, 29 Jul 2024 16:32:41 +0200 Markus Armbruster <armbru@redhat.com> escreveu: > Yes, as this CPER record is defined only for arm. There are three other > > processor error info: > > - for x86; > > - for ia32; > > - for "generic cpu". > > > > They have different structures, with different fields. > > A generic inject-error command feels nicer, but coding its arguments in > the schema could be more trouble than it's worth. I'm not asking you to > try. > > A target-specific command like this one should be conditional. Try > this: > > { 'command': 'arm-inject-error', > 'data': { 'errortypes': ['ArmProcessorErrorType'] }, > 'features': [ 'unstable' ], > 'if': 'TARGET_ARM' } > > No need to provide a qmp_arm_inject_error() stub then. I tried it, but it generates lots of poison errors. Basically, QAPI generation includes poison.h, making it to complain about on non-ARM builds. Anyway, the new version I'm about to submit is not dependent on ARM anymore (as it is a generic GHES error injection that can be used by any arch). Still, as I added a Kconfig symbol for it, I still needed a stub. It would be cool not needing it, but on the other hand it doesn't hurt much. > >> If we encode the the error to inject as an enum value, adding more will > >> be hard. > >> > >> If we wrap the enum in a struct > >> > >> { 'struct': 'ArmProcessorError', > >> 'data': { 'type': 'ArmProcessorErrorType' } } > >> > >> we can later extend it like > >> > >> { 'union': 'ArmProcessorError', > >> 'base: { 'type': 'ArmProcessorErrorType' } > >> 'data': { > >> 'bus-error': 'ArmProcessorBusErrorData' } } > >> > >> { 'struct': 'ArmProcessorBusErrorData', > >> 'data': ... } > > > > I don't see this working as one might expect. See, the ARM error > > information data can be repeated from 1 to 255 times. It is given > > by this struct (see patch 7): > > > > { 'struct': 'ArmProcessorErrorInformation', > > 'data': { '*validation': ['ArmPeiValidationBits'], > > 'type': ['ArmProcessorErrorType'], > > '*multiple-error': 'uint16', > > '*flags': ['ArmProcessorFlags'], > > '*error-info': 'uint64', > > '*virt-addr': 'uint64', > > '*phy-addr': 'uint64'} > > } > > > > According with the UEFI spec, the type is always be present. > > The other fields are marked as valid or not via the field > > "validation". So, there's one bit indicating what is valid between > > the fields at the PEI structure, e. g.: > > > > - multiple-error: multiple occurrences of the error; > > - flags; > > - error-info: error information; > > - virt-addr: virtual address; > > - phy-addr: physical address. > > > > There are also other fields that are global for the entire record, > > also marked as valid or not via another bitmask. > > > > The contents of almost all those fields are independent of the error > > type. The only field which content is affected by the error type is > > "error-info", and the definition of such field is not fully specified. > > > > So, currently, UEFI spec only defines it when: > > > > 1. the error type has just one bit set; > > 2. the error type is either cache, TLB or bus error[1]. > > If type is micro-arch-specific error, the spec doesn't tell how this > > field if filled. > > > > To make the API simple (yet powerful), I opted to not enforce any encoding > > for error-info: let userspace fill it as required and use some default > > that would make sense, if this is not passed via QMP. > > > > [1] See https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#arm-processor-error-information > > I asked because designing for extensibility is good practice. > > It's not a hard requirement here, because feature 'unstable' gives us > lincense to change the interface incompatibly. IMO keeping it as unstable makes sense, as this QAPI is specific for error injection, which is hardly a feature widely used. Also, with the script approach, the actual CPER record generation happens on a script. If we provide it together with QEMU, if the QAPI ever changes, the changes inside the script will happen altogether. So, IMO, no need to make it stable. Thanks, Mauro
diff --git a/configs/targets/aarch64-softmmu.mak b/configs/targets/aarch64-softmmu.mak index 84cb32dc2f4f..b4b3cd97934a 100644 --- a/configs/targets/aarch64-softmmu.mak +++ b/configs/targets/aarch64-softmmu.mak @@ -5,3 +5,4 @@ TARGET_KVM_HAVE_GUEST_DEBUG=y TARGET_XML_FILES= gdb-xml/aarch64-core.xml gdb-xml/aarch64-fpu.xml gdb-xml/arm-core.xml gdb-xml/arm-vfp.xml gdb-xml/arm-vfp3.xml gdb-xml/arm-vfp-sysregs.xml gdb-xml/arm-neon.xml gdb-xml/arm-m-profile.xml gdb-xml/arm-m-profile-mve.xml gdb-xml/aarch64-pauth.xml # needed by boot.c TARGET_NEED_FDT=y +CONFIG_ARM_EINJ=y diff --git a/hw/acpi/ghes.c b/hw/acpi/ghes.c index 5b8bc6eeb437..6075ef5893ce 100644 --- a/hw/acpi/ghes.c +++ b/hw/acpi/ghes.c @@ -27,6 +27,7 @@ #include "hw/acpi/generic_event_device.h" #include "hw/nvram/fw_cfg.h" #include "qemu/uuid.h" +#include "qapi/qapi-types-arm-error-inject.h" #define ACPI_GHES_ERRORS_FW_CFG_FILE "etc/hardware_errors" #define ACPI_GHES_DATA_ADDR_FW_CFG_FILE "etc/hardware_errors_addr" @@ -53,6 +54,12 @@ /* The memory section CPER size, UEFI 2.6: N.2.5 Memory Error Section */ #define ACPI_GHES_MEM_CPER_LENGTH 80 +/* + * ARM Processor section CPER size, UEFI 2.10: N.2.4.4 + * ARM Processor Error Section + */ +#define ACPI_GHES_ARM_CPER_LENGTH (72 + 600) + /* Masks for block_status flags */ #define ACPI_GEBS_UNCORRECTABLE 1 @@ -231,6 +238,142 @@ static int acpi_ghes_record_mem_error(uint64_t error_block_address, return 0; } +/* UEFI 2.9: N.2.4.4 ARM Processor Error Section */ +static void acpi_ghes_build_append_arm_cper(uint8_t error_types, GArray *table) +{ + /* + * ARM Processor Error Record + */ + + /* Validation Bits */ + build_append_int_noprefix(table, + (1ULL << 3) | /* Vendor specific info Valid */ + (1ULL << 2) | /* Running status Valid */ + (1ULL << 1) | /* Error affinity level Valid */ + (1ULL << 0), /* MPIDR Valid */ + 4); + /* Error Info Num */ + build_append_int_noprefix(table, 1, 2); + /* Context Info Num */ + build_append_int_noprefix(table, 1, 2); + /* Section length */ + build_append_int_noprefix(table, ACPI_GHES_ARM_CPER_LENGTH, 4); + /* Error affinity level */ + build_append_int_noprefix(table, 2, 1); + /* Reserved */ + build_append_int_noprefix(table, 0, 3); + /* MPIDR_EL1 */ + build_append_int_noprefix(table, 0xAB12, 8); + /* MIDR_EL1 */ + build_append_int_noprefix(table, 0xCD24, 8); + /* Running state */ + build_append_int_noprefix(table, 0x1, 4); + /* PSCI state */ + build_append_int_noprefix(table, 0x1234, 4); + + /* ARM Propcessor error information */ + /* Version */ + build_append_int_noprefix(table, 0, 1); + /* Length */ + build_append_int_noprefix(table, 32, 1); + /* Validation Bits */ + build_append_int_noprefix(table, + (1ULL << 4) | /* Physical fault address Valid */ + (1ULL << 3) | /* Virtual fault address Valid */ + (1ULL << 2) | /* Error information Valid */ + (1ULL << 1) | /* Flags Valid */ + (1ULL << 0), /* Multiple error count Valid */ + 2); + /* Type */ + if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_CACHE_ERROR) || + error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_TLB_ERROR) || + error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_BUS_ERROR) || + error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_MICRO_ARCH_ERROR)) { + build_append_int_noprefix(table, error_types, 1); + } else { + return; + } + /* Multiple error count */ + build_append_int_noprefix(table, 2, 2); + /* Flags */ + build_append_int_noprefix(table, 0xD, 1); + /* Error information */ + if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_CACHE_ERROR)) { + build_append_int_noprefix(table, 0x0091000F, 8); + } else if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_TLB_ERROR)) { + build_append_int_noprefix(table, 0x0054007F, 8); + } else if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_BUS_ERROR)) { + build_append_int_noprefix(table, 0x80D6460FFF, 8); + } else if (error_types & BIT(ARM_PROCESSOR_ERROR_TYPE_MICRO_ARCH_ERROR)) { + build_append_int_noprefix(table, 0x78DA03FF, 8); + } else { + return; + } + /* Virtual fault address */ + build_append_int_noprefix(table, 0x67320230, 8); + /* Physical fault address */ + build_append_int_noprefix(table, 0x5CDFD492, 8); + + /* ARM Propcessor error context information */ + /* Version */ + build_append_int_noprefix(table, 0, 2); + /* Validation Bits */ + /* AArch64 EL1 context registers Valid */ + build_append_int_noprefix(table, 5, 2); + /* Register array size */ + build_append_int_noprefix(table, 592, 4); + /* Register array */ + build_append_int_noprefix(table, 0x12ABDE67, 8); +} + +static int acpi_ghes_record_arm_error(uint8_t error_types, + uint64_t error_block_address) +{ + GArray *block; + + /* ARM processor Error Section Type */ + const uint8_t uefi_cper_arm_sec[] = + UUID_LE(0xE19E3D16, 0xBC11, 0x11E4, 0x9C, 0xAA, 0xC2, 0x05, \ + 0x1D, 0x5D, 0x46, 0xB0); + + /* + * Invalid fru id: ACPI 4.0: 17.3.2.6.1 Generic Error Data, + * Table 17-13 Generic Error Data Entry + */ + QemuUUID fru_id = {}; + uint32_t data_length; + + block = g_array_new(false, true /* clear */, 1); + + /* This is the length if adding a new generic error data entry*/ + data_length = ACPI_GHES_DATA_LENGTH + ACPI_GHES_ARM_CPER_LENGTH; + /* + * It should not run out of the preallocated memory if adding a new generic + * error data entry + */ + assert((data_length + ACPI_GHES_GESB_SIZE) <= + ACPI_GHES_MAX_RAW_DATA_LENGTH); + + /* Build the new generic error status block header */ + acpi_ghes_generic_error_status(block, ACPI_GEBS_UNCORRECTABLE, + 0, 0, data_length, ACPI_CPER_SEV_RECOVERABLE); + + /* Build this new generic error data entry header */ + acpi_ghes_generic_error_data(block, uefi_cper_arm_sec, + ACPI_CPER_SEV_RECOVERABLE, 0, 0, + ACPI_GHES_ARM_CPER_LENGTH, fru_id, 0); + + /* Build the ARM processor error section CPER */ + acpi_ghes_build_append_arm_cper(error_types, block); + + /* Write the generic error data entry into guest memory */ + cpu_physical_memory_write(error_block_address, block->data, block->len); + + g_array_free(block, true); + + return 0; +} + /* * Build table for the hardware error fw_cfg blob. * Initialize "etc/hardware_errors" and "etc/hardware_errors_addr" fw_cfg blobs. @@ -392,23 +535,22 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *ags, FWCfgState *s, ags->present = true; } +static uint64_t ghes_get_state_start_address(void) +{ + AcpiGedState *acpi_ged_state = + ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED, NULL)); + AcpiGhesState *ags = &acpi_ged_state->ghes_state; + + return le64_to_cpu(ags->ghes_addr_le); +} + int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address) { uint64_t error_block_addr, read_ack_register_addr, read_ack_register = 0; - uint64_t start_addr; + uint64_t start_addr = ghes_get_state_start_address(); bool ret = -1; - AcpiGedState *acpi_ged_state; - AcpiGhesState *ags; - assert(source_id < ACPI_HEST_SRC_ID_RESERVED); - acpi_ged_state = ACPI_GED(object_resolve_path_type("", TYPE_ACPI_GED, - NULL)); - g_assert(acpi_ged_state); - ags = &acpi_ged_state->ghes_state; - - start_addr = le64_to_cpu(ags->ghes_addr_le); - if (physical_address) { if (source_id < ACPI_HEST_SRC_ID_RESERVED) { @@ -448,6 +590,100 @@ int acpi_ghes_record_errors(uint8_t source_id, uint64_t physical_address) return ret; } +/* + * Error register block data layout + * + * | +---------------------+ ges.ghes_addr_le + * | |error_block_address0 | + * | +---------------------+ + * | |error_block_address1 | + * | +---------------------+ --+-- + * | | ............. | GHES_ADDRESS_SIZE + * | +---------------------+ --+-- + * | |error_block_addressN | + * | +---------------------+ + * | | read_ack_register0 | + * | +---------------------+ --+-- + * | | read_ack_register1 | GHES_ADDRESS_SIZE + * | +---------------------+ --+-- + * | | ............. | + * | +---------------------+ + * | | read_ack_registerN | + * | +---------------------+ --+-- + * | | CPER | | + * | | .... | GHES_MAX_RAW_DATA_LENGT + * | | CPER | | + * | +---------------------+ --+-- + * | | .......... | + * | +---------------------+ + * | | CPER | + * | | .... | + * | | CPER | + * | +---------------------+ + */ + +/* Map from uint32_t notify to entry offset in GHES */ +static const uint8_t error_source_to_index[] = { 0xff, 0xff, 0xff, 0xff, + 0xff, 0xff, 0xff, 1, 0}; + +static bool ghes_get_addr(uint32_t notify, uint64_t *error_block_addr, + uint64_t *read_ack_register_addr) +{ + uint64_t base; + + if (notify >= ACPI_GHES_NOTIFY_RESERVED) { + return false; + } + + /* Find and check the source id for this new CPER */ + if (error_source_to_index[notify] == 0xff) { + return false; + } + + base = ghes_get_state_start_address(); + + *read_ack_register_addr = base + + ACPI_GHES_ERROR_SOURCE_COUNT * sizeof(uint64_t) + + error_source_to_index[notify] * sizeof(uint64_t); + + /* Could also be read back from the error_block_address register */ + *error_block_addr = base + + ACPI_GHES_ERROR_SOURCE_COUNT * sizeof(uint64_t) + + ACPI_GHES_ERROR_SOURCE_COUNT * sizeof(uint64_t) + + error_source_to_index[notify] * ACPI_GHES_MAX_RAW_DATA_LENGTH; + + return true; +} + +bool ghes_record_arm_errors(uint8_t error_types, uint32_t notify) +{ + int read_ack_register = 0; + uint64_t read_ack_register_addr = 0; + uint64_t error_block_addr = 0; + + if (!ghes_get_addr(notify, &error_block_addr, &read_ack_register_addr)) { + return false; + } + + cpu_physical_memory_read(read_ack_register_addr, + &read_ack_register, sizeof(uint64_t)); + /* zero means OSPM does not acknowledge the error */ + if (!read_ack_register) { + error_report("Last time OSPM does not acknowledge the error," + " record CPER failed this time, set the ack value to" + " avoid blocking next time CPER record! exit"); + read_ack_register = 1; + cpu_physical_memory_write(read_ack_register_addr, + &read_ack_register, sizeof(uint64_t)); + return false; + } + + read_ack_register = cpu_to_le64(0); + cpu_physical_memory_write(read_ack_register_addr, + &read_ack_register, sizeof(uint64_t)); + return acpi_ghes_record_arm_error(error_types, error_block_addr); +} + bool acpi_ghes_present(void) { AcpiGedState *acpi_ged_state; diff --git a/hw/arm/Kconfig b/hw/arm/Kconfig index 1ad60da7aa2d..bafac82f9fd3 100644 --- a/hw/arm/Kconfig +++ b/hw/arm/Kconfig @@ -712,3 +712,7 @@ config ARMSSE select UNIMP select SSE_COUNTER select SSE_TIMER + +config ARM_EINJ + bool + default y if AARCH64 diff --git a/hw/arm/arm_error_inject.c b/hw/arm/arm_error_inject.c new file mode 100644 index 000000000000..1da97d5d4fdc --- /dev/null +++ b/hw/arm/arm_error_inject.c @@ -0,0 +1,35 @@ +/* + * ARM Processor error injection + * + * Copyright(C) 2024 Huawei LTD. + * + * This code is licensed under the GPL version 2 or later. See the + * COPYING file in the top-level directory. + * + */ + +#include "qemu/osdep.h" +#include "qapi/error.h" +#include "qapi-commands-arm-error-inject.h" +#include "hw/boards.h" +#include "hw/acpi/ghes.h" + +/* For ARM processor errors */ +void qmp_arm_inject_error(ArmProcessorErrorTypeList *errortypes, Error **errp) +{ + MachineState *machine = MACHINE(qdev_get_machine()); + MachineClass *mc = MACHINE_GET_CLASS(machine); + uint8_t error_types = 0; + + while (errortypes) { + error_types |= BIT(errortypes->value); + errortypes = errortypes->next; + } + + ghes_record_arm_errors(error_types, ACPI_GHES_NOTIFY_GPIO); + if (mc->set_error) { + mc->set_error(); + } + + return; +} diff --git a/hw/arm/arm_error_inject_stubs.c b/hw/arm/arm_error_inject_stubs.c new file mode 100644 index 000000000000..b51f4202fe64 --- /dev/null +++ b/hw/arm/arm_error_inject_stubs.c @@ -0,0 +1,18 @@ +/* + * QMP stub for ARM processor error injection. + * + * Copyright(C) 2024 Huawei LTD. + * + * This code is licensed under the GPL version 2 or later. See the + * COPYING file in the top-level directory. + * + */ + +#include "qemu/osdep.h" +#include "qapi/error.h" +#include "qapi-commands-arm-error-inject.h" + +void qmp_arm_inject_error(ArmProcessorErrorTypeList *errortypes, Error **errp) +{ + error_setg(errp, "ARM processor error support is not compiled in"); +} diff --git a/hw/arm/meson.build b/hw/arm/meson.build index 0c07ab522f4c..cb7fe09fc87b 100644 --- a/hw/arm/meson.build +++ b/hw/arm/meson.build @@ -60,6 +60,7 @@ arm_ss.add(when: 'CONFIG_ARM_SMMUV3', if_true: files('smmuv3.c')) arm_ss.add(when: 'CONFIG_FSL_IMX6UL', if_true: files('fsl-imx6ul.c', 'mcimx6ul-evk.c')) arm_ss.add(when: 'CONFIG_NRF51_SOC', if_true: files('nrf51_soc.c')) arm_ss.add(when: 'CONFIG_XEN', if_true: files('xen_arm.c')) +arm_ss.add(when: 'CONFIG_ARM_EINJ', if_true: files('arm_error_inject.c')) system_ss.add(when: 'CONFIG_ARM_SMMUV3', if_true: files('smmu-common.c')) system_ss.add(when: 'CONFIG_CHEETAH', if_true: files('palm.c')) @@ -77,5 +78,7 @@ system_ss.add(when: 'CONFIG_TOSA', if_true: files('tosa.c')) system_ss.add(when: 'CONFIG_VERSATILE', if_true: files('versatilepb.c')) system_ss.add(when: 'CONFIG_VEXPRESS', if_true: files('vexpress.c')) system_ss.add(when: 'CONFIG_Z2', if_true: files('z2.c')) +system_ss.add(when: 'CONFIG_ARM_EINJ', if_false: files('arm_error_inject_stubs.c')) +system_ss.add(when: 'CONFIG_ALL', if_true: files('arm_error_inject_stubs.c')) hw_arch += {'arm': arm_ss} diff --git a/include/hw/acpi/ghes.h b/include/hw/acpi/ghes.h index 4f1ab1a73a06..dc531ffce7ae 100644 --- a/include/hw/acpi/ghes.h +++ b/include/hw/acpi/ghes.h @@ -75,6 +75,8 @@ void acpi_ghes_add_fw_cfg(AcpiGhesState *vms, FWCfgState *s, GArray *hardware_errors); int acpi_ghes_record_errors(uint8_t notify, uint64_t error_physical_addr); +bool ghes_record_arm_errors(uint8_t error_types, uint32_t notify); + /** * acpi_ghes_present: Report whether ACPI GHES table is present * diff --git a/qapi/arm-error-inject.json b/qapi/arm-error-inject.json new file mode 100644 index 000000000000..430e6cea6b60 --- /dev/null +++ b/qapi/arm-error-inject.json @@ -0,0 +1,49 @@ +# -*- Mode: Python -*- +# vim: filetype=python + +## +# = ARM Processor Errors +## + +## +# @ArmProcessorErrorType: +# +# Type of ARM processor error to inject +# +# @unknown-error: Unknown error +# +# @cache-error: Cache error +# +# @tlb-error: TLB error +# +# @bus-error: Bus error. +# +# @micro-arch-error: Micro architectural error. +# +# Since: 9.1 +## +{ 'enum': 'ArmProcessorErrorType', + 'data': ['unknown-error', + 'cache-error', + 'tlb-error', + 'bus-error', + 'micro-arch-error'] +} + +## +# @arm-inject-error: +# +# Inject ARM Processor error. +# +# @errortypes: ARM processor error types to inject +# +# Features: +# +# @unstable: This command is experimental. +# +# Since: 9.1 +## +{ 'command': 'arm-inject-error', + 'data': { 'errortypes': ['ArmProcessorErrorType'] }, + 'features': [ 'unstable' ] +} diff --git a/qapi/meson.build b/qapi/meson.build index e7bc54e5d047..5927932c4be3 100644 --- a/qapi/meson.build +++ b/qapi/meson.build @@ -22,6 +22,7 @@ if have_system or have_tools or have_ga endif qapi_all_modules = [ + 'arm-error-inject', 'authz', 'block', 'block-core', diff --git a/qapi/qapi-schema.json b/qapi/qapi-schema.json index b1581988e4eb..479a22de7e43 100644 --- a/qapi/qapi-schema.json +++ b/qapi/qapi-schema.json @@ -81,3 +81,4 @@ { 'include': 'vfio.json' } { 'include': 'cryptodev.json' } { 'include': 'cxl.json' } +{ 'include': 'arm-error-inject.json' }