Message ID | 20220804004411.1343158-1-Jason@zx2c4.com |
---|---|
State | New |
Headers | show |
Series | [v2] hw/i386: place setup_data at fixed place in memory | expand |
On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote: > The boot parameter header refers to setup_data at an absolute address, > and each setup_data refers to the next setup_data at an absolute address > too. Currently QEMU simply puts the setup_datas right after the kernel > image, and since the kernel_image is loaded at prot_addr -- a fixed > address knowable to QEMU apriori -- the setup_data absolute address > winds up being just `prot_addr + a_fixed_offset_into_kernel_image`. > > This mostly works fine, so long as the kernel image really is loaded at > prot_addr. However, OVMF doesn't load the kernel at prot_addr, and > generally EFI doesn't give a good way of predicting where it's going to > load the kernel. So when it loads it at some address != prot_addr, the > absolute addresses in setup_data now point somewhere bogus, causing > crashes when EFI stub tries to follow the next link. > > Fix this by placing setup_data at some fixed place in memory, relative > to real_addr, not as part of the kernel image, and then pointing the > setup_data absolute address to that fixed place in memory. This way, > even if OVMF or other chains relocate the kernel image, the boot > parameter still points to the correct absolute address. > > Fixes: 3cbeb52467 ("hw/i386: add device tree support") > Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: Richard Henderson <richard.henderson@linaro.org> > Cc: Peter Maydell <peter.maydell@linaro.org> > Cc: Michael S. Tsirkin <mst@redhat.com> > Cc: Daniel P. Berrangé <berrange@redhat.com> > Cc: Gerd Hoffmann <kraxel@redhat.com> > Cc: Ard Biesheuvel <ardb@kernel.org> > Cc: linux-efi@vger.kernel.org > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Didn't read the patch yet. Adding Laszlo. > --- > hw/i386/x86.c | 38 ++++++++++++++++++++------------------ > 1 file changed, 20 insertions(+), 18 deletions(-) > > diff --git a/hw/i386/x86.c b/hw/i386/x86.c > index 050eedc0c8..8b853abf38 100644 > --- a/hw/i386/x86.c > +++ b/hw/i386/x86.c > @@ -760,36 +760,36 @@ static bool load_elfboot(const char *kernel_filename, > fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ENTRY, pvh_start_addr); > fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, mh_load_addr); > fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, elf_kernel_size); > > return true; > } > > void x86_load_linux(X86MachineState *x86ms, > FWCfgState *fw_cfg, > int acpi_data_size, > bool pvh_enabled, > bool legacy_no_rng_seed) > { > bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled; > uint16_t protocol; > int setup_size, kernel_size, cmdline_size; > - int dtb_size, setup_data_offset; > + int dtb_size, setup_data_item_len, setup_data_total_len = 0; > uint32_t initrd_max; > - uint8_t header[8192], *setup, *kernel; > - hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0; > + uint8_t header[8192], *setup, *kernel, *setup_datas = NULL; > + hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0, setup_data_base; > FILE *f; > char *vmode; > MachineState *machine = MACHINE(x86ms); > struct setup_data *setup_data; > const char *kernel_filename = machine->kernel_filename; > const char *initrd_filename = machine->initrd_filename; > const char *dtb_filename = machine->dtb; > const char *kernel_cmdline = machine->kernel_cmdline; > SevKernelLoaderContext sev_load_ctx = {}; > enum { RNG_SEED_LENGTH = 32 }; > > /* Align to 16 bytes as a paranoia measure */ > cmdline_size = (strlen(kernel_cmdline) + 16) & ~15; > > /* load the kernel header */ > f = fopen(kernel_filename, "rb"); > @@ -886,32 +886,33 @@ void x86_load_linux(X86MachineState *x86ms, > if (protocol < 0x200 || !(header[0x211] & 0x01)) { > /* Low kernel */ > real_addr = 0x90000; > cmdline_addr = 0x9a000 - cmdline_size; > prot_addr = 0x10000; > } else if (protocol < 0x202) { > /* High but ancient kernel */ > real_addr = 0x90000; > cmdline_addr = 0x9a000 - cmdline_size; > prot_addr = 0x100000; > } else { > /* High and recent kernel */ > real_addr = 0x10000; > cmdline_addr = 0x20000; > prot_addr = 0x100000; > } > + setup_data_base = real_addr + 0x8000; > > /* highest address for loading the initrd */ > if (protocol >= 0x20c && > lduw_p(header + 0x236) & XLF_CAN_BE_LOADED_ABOVE_4G) { > /* > * Linux has supported initrd up to 4 GB for a very long time (2007, > * long before XLF_CAN_BE_LOADED_ABOVE_4G which was added in 2013), > * though it only sets initrd_max to 2 GB to "work around bootloader > * bugs". Luckily, QEMU firmware(which does something like bootloader) > * has supported this. > * > * It's believed that if XLF_CAN_BE_LOADED_ABOVE_4G is set, initrd can > * be loaded into any address. > * > * In addition, initrd_max is uint32_t simply because QEMU doesn't > * support the 64-bit boot protocol (specifically the ext_ramdisk_image > @@ -1049,60 +1050,61 @@ void x86_load_linux(X86MachineState *x86ms, > fclose(f); > > /* append dtb to kernel */ > if (dtb_filename) { > if (protocol < 0x209) { > fprintf(stderr, "qemu: Linux kernel too old to load a dtb\n"); > exit(1); > } > > dtb_size = get_image_size(dtb_filename); > if (dtb_size <= 0) { > fprintf(stderr, "qemu: error reading dtb %s: %s\n", > dtb_filename, strerror(errno)); > exit(1); > } > > - setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16); > - kernel_size = setup_data_offset + sizeof(struct setup_data) + dtb_size; > - kernel = g_realloc(kernel, kernel_size); > - > - > - setup_data = (struct setup_data *)(kernel + setup_data_offset); > + setup_data_item_len = sizeof(struct setup_data) + dtb_size; > + setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len); > + setup_data = (struct setup_data *)(setup_datas + setup_data_total_len); > setup_data->next = cpu_to_le64(first_setup_data); > - first_setup_data = prot_addr + setup_data_offset; > + first_setup_data = setup_data_base + setup_data_total_len; > + setup_data_total_len += setup_data_item_len; > setup_data->type = cpu_to_le32(SETUP_DTB); > setup_data->len = cpu_to_le32(dtb_size); > - > load_image_size(dtb_filename, setup_data->data, dtb_size); > } > > if (!legacy_no_rng_seed) { > - setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16); > - kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH; > - kernel = g_realloc(kernel, kernel_size); > - setup_data = (struct setup_data *)(kernel + setup_data_offset); > + setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH; > + setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len); > + setup_data = (struct setup_data *)(setup_datas + setup_data_total_len); > setup_data->next = cpu_to_le64(first_setup_data); > - first_setup_data = prot_addr + setup_data_offset; > + first_setup_data = setup_data_base + setup_data_total_len; > + setup_data_total_len += setup_data_item_len; > setup_data->type = cpu_to_le32(SETUP_RNG_SEED); > setup_data->len = cpu_to_le32(RNG_SEED_LENGTH); > qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH); > } > > - /* Offset 0x250 is a pointer to the first setup_data link. */ > - stq_p(header + 0x250, first_setup_data); > + if (first_setup_data) { > + /* Offset 0x250 is a pointer to the first setup_data link. */ > + stq_p(header + 0x250, first_setup_data); > + rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len, > + setup_data_base, NULL, NULL, NULL, NULL, false); > + } > > /* > * If we're starting an encrypted VM, it will be OVMF based, which uses the > * efi stub for booting and doesn't require any values to be placed in the > * kernel header. We therefore don't update the header so the hash of the > * kernel on the other side of the fw_cfg interface matches the hash of the > * file the user passed in. > */ > if (!sev_enabled()) { > memcpy(setup, header, MIN(sizeof(header), setup_size)); > } > > fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr); > fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size); > fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size); > sev_load_ctx.kernel_data = (char *)kernel; > -- > 2.35.1
On 08/04/22 09:03, Michael S. Tsirkin wrote: > On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote: >> The boot parameter header refers to setup_data at an absolute address, >> and each setup_data refers to the next setup_data at an absolute address >> too. Currently QEMU simply puts the setup_datas right after the kernel >> image, and since the kernel_image is loaded at prot_addr -- a fixed >> address knowable to QEMU apriori -- the setup_data absolute address >> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`. >> >> This mostly works fine, so long as the kernel image really is loaded at >> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and >> generally EFI doesn't give a good way of predicting where it's going to >> load the kernel. So when it loads it at some address != prot_addr, the >> absolute addresses in setup_data now point somewhere bogus, causing >> crashes when EFI stub tries to follow the next link. >> >> Fix this by placing setup_data at some fixed place in memory, relative >> to real_addr, not as part of the kernel image, and then pointing the >> setup_data absolute address to that fixed place in memory. This way, >> even if OVMF or other chains relocate the kernel image, the boot >> parameter still points to the correct absolute address. >> >> Fixes: 3cbeb52467 ("hw/i386: add device tree support") >> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> >> Cc: Paolo Bonzini <pbonzini@redhat.com> >> Cc: Richard Henderson <richard.henderson@linaro.org> >> Cc: Peter Maydell <peter.maydell@linaro.org> >> Cc: Michael S. Tsirkin <mst@redhat.com> >> Cc: Daniel P. Berrangé <berrange@redhat.com> >> Cc: Gerd Hoffmann <kraxel@redhat.com> >> Cc: Ard Biesheuvel <ardb@kernel.org> >> Cc: linux-efi@vger.kernel.org >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > Didn't read the patch yet. > Adding Laszlo. As I said in <http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>, I don't believe that the setup_data chaining described in <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work for UEFI guests at all, with QEMU pre-populating the links with GPAs. However, rather than introducing a new info channel, or reusing an existent one (ACPI linker/loader, GUID-ed structure chaining in pflash), for the sake of this feature, I suggest simply disabling this feature for UEFI guests. setup_data chaining has not been necessary for UEFI guests for years (this is the first time I've heard about it in more than a decade), and the particular use case (provide guests with good random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL. ... Now, here's my problem: microvm, and Xen. As far as I can tell, QEMU can determine (it already does determine) whether the guest uses UEFI or not, for the "pc" and "q35" machine types. But not for microvm or Xen! Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can derive that pflash_cfi01_get_blk(pcms->flash[0]) returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this is only valid after the inital part of pc_system_firmware_init() has run ("Map legacy -drive if=pflash to machine properties"), but that is not a problem, given the following call tree: pc_memory_init() [hw/i386/pc.c] pc_system_firmware_init() [hw/i386/pc_sysfw.c] x86_load_linux() [hw/i386/x86.c] That is, when we call x86_load_linux() from pc_memory_init(), we can set the last argument of x86_load_linux() from *both* the compat knob *and* pflash_cfi01_get_blk(pcms->flash[0]). Unluckily however, we have two more x86_load_linux() calls where this does not work. The first is xen_load_linux() [hw/i386/pc.c], which is used for "xen HVM direct kernel boot" [hw/i386/pc_piix.c] -- there we certainly don't use flash. I don't know if Xen HVM direct kernel boot is possible with UEFI. If so, we have a problem, if UEFI is out of the question there, then enabling setup_data passing is fine. The other problematic x86_load_linux() call is in MicroVM -- microvm_memory_init() [hw/i386/microvm.c]. This is a real problem unfortunately, as MicroVM can be used with SeaBIOS, QBoot, and even OVMF, *but* even in the last case, it runs OVMF *without* flash (just from ROM). So not only is MicroVM not a descendant of the PC Machine Class (so it has no access to PC Machine State either -- such as pcms->flash), it never *uses* flash in fact. Which is a big problem for my idea, because QEMU has no way of identifying whether microvm is going to boot a custom SeaBIOS binary (where the current setup_data chaining is OK) or a custom OVMF binary (where the current setup_data chaining crashes the guest kernel). So I thought that for pc and q35, I should be able to propose a fix, based on: pflash_cfi01_get_blk(pcms->flash[0]) but it turns out I don't know what to do about Xen; and worse, for MicroVM, it's currently *impossible* for QEMU to tell apart UEFI from other guest firmwares. For now I suggest either reverting the original patch, or at least not enabling the knob by default for any machine types. In particular, when using MicroVM, the user must leave the knob disabled when direct booting a kernel on OVMF, and the user may or may not enable the knob when direct booting a kernel on SeaBIOS. Thanks Laszlo
On Thu, Aug 04, 2022 at 10:58:36AM +0200, Laszlo Ersek wrote: > On 08/04/22 09:03, Michael S. Tsirkin wrote: > > On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote: > >> The boot parameter header refers to setup_data at an absolute address, > >> and each setup_data refers to the next setup_data at an absolute address > >> too. Currently QEMU simply puts the setup_datas right after the kernel > >> image, and since the kernel_image is loaded at prot_addr -- a fixed > >> address knowable to QEMU apriori -- the setup_data absolute address > >> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`. > >> > >> This mostly works fine, so long as the kernel image really is loaded at > >> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and > >> generally EFI doesn't give a good way of predicting where it's going to > >> load the kernel. So when it loads it at some address != prot_addr, the > >> absolute addresses in setup_data now point somewhere bogus, causing > >> crashes when EFI stub tries to follow the next link. > >> > >> Fix this by placing setup_data at some fixed place in memory, relative > >> to real_addr, not as part of the kernel image, and then pointing the > >> setup_data absolute address to that fixed place in memory. This way, > >> even if OVMF or other chains relocate the kernel image, the boot > >> parameter still points to the correct absolute address. > >> > >> Fixes: 3cbeb52467 ("hw/i386: add device tree support") > >> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> > >> Cc: Paolo Bonzini <pbonzini@redhat.com> > >> Cc: Richard Henderson <richard.henderson@linaro.org> > >> Cc: Peter Maydell <peter.maydell@linaro.org> > >> Cc: Michael S. Tsirkin <mst@redhat.com> > >> Cc: Daniel P. Berrangé <berrange@redhat.com> > >> Cc: Gerd Hoffmann <kraxel@redhat.com> > >> Cc: Ard Biesheuvel <ardb@kernel.org> > >> Cc: linux-efi@vger.kernel.org > >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > > > Didn't read the patch yet. > > Adding Laszlo. > > As I said in > <http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>, > I don't believe that the setup_data chaining described in > <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work > for UEFI guests at all, with QEMU pre-populating the links with GPAs. > > However, rather than introducing a new info channel, or reusing an > existent one (ACPI linker/loader, GUID-ed structure chaining in pflash), > for the sake of this feature, I suggest simply disabling this feature > for UEFI guests. setup_data chaining has not been necessary for UEFI > guests for years (this is the first time I've heard about it in more > than a decade), and the particular use case (provide guests with good > random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL. > > ... Now, here's my problem: microvm, and Xen. > > As far as I can tell, QEMU can determine (it already does determine) > whether the guest uses UEFI or not, for the "pc" and "q35" machine > types. But not for microvm or Xen! > > Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can > derive that > > pflash_cfi01_get_blk(pcms->flash[0]) > > returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this > is only valid after the inital part of pc_system_firmware_init() has run > ("Map legacy -drive if=pflash to machine properties"), but that is not a > problem, given the following call tree: I don't beleve that's a valid check anymore since Gerd introduced the ability to load UEFI via -bios, for UEFI builds without persistent variables. ( a8152c4e4613c70c2f0573a82babbc8acc00cf90 ) > Which is a big problem for my idea, because QEMU has no way of > identifying whether microvm is going to boot a custom SeaBIOS binary > (where the current setup_data chaining is OK) or a custom OVMF binary > (where the current setup_data chaining crashes the guest kernel). > > So I thought that for pc and q35, I should be able to propose a fix, > based on: > > pflash_cfi01_get_blk(pcms->flash[0]) > > but it turns out I don't know what to do about Xen; and worse, for > MicroVM, it's currently *impossible* for QEMU to tell apart UEFI from > other guest firmwares. Yep, and ultimately the inability to distinguish UEFI vs other firmware is arguably correct by design, as the QEMU <-> firmware interface is supposed to be arbitrarily pluggable for any firmware implementation not limited to merely UEFI + seabios. > For now I suggest either reverting the original patch, or at least not > enabling the knob by default for any machine types. In particular, when > using MicroVM, the user must leave the knob disabled when direct booting > a kernel on OVMF, and the user may or may not enable the knob when > direct booting a kernel on SeaBIOS. Having it opt-in via a knob would defeat Jason's goal of having the seed available automatically. With regards, Daniel
On Thu, 4 Aug 2022 at 11:25, Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Thu, Aug 04, 2022 at 10:58:36AM +0200, Laszlo Ersek wrote: > > On 08/04/22 09:03, Michael S. Tsirkin wrote: > > > On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote: > > >> The boot parameter header refers to setup_data at an absolute address, > > >> and each setup_data refers to the next setup_data at an absolute address > > >> too. Currently QEMU simply puts the setup_datas right after the kernel > > >> image, and since the kernel_image is loaded at prot_addr -- a fixed > > >> address knowable to QEMU apriori -- the setup_data absolute address > > >> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`. > > >> > > >> This mostly works fine, so long as the kernel image really is loaded at > > >> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and > > >> generally EFI doesn't give a good way of predicting where it's going to > > >> load the kernel. So when it loads it at some address != prot_addr, the > > >> absolute addresses in setup_data now point somewhere bogus, causing > > >> crashes when EFI stub tries to follow the next link. > > >> > > >> Fix this by placing setup_data at some fixed place in memory, relative > > >> to real_addr, not as part of the kernel image, and then pointing the > > >> setup_data absolute address to that fixed place in memory. This way, > > >> even if OVMF or other chains relocate the kernel image, the boot > > >> parameter still points to the correct absolute address. > > >> > > >> Fixes: 3cbeb52467 ("hw/i386: add device tree support") > > >> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> > > >> Cc: Paolo Bonzini <pbonzini@redhat.com> > > >> Cc: Richard Henderson <richard.henderson@linaro.org> > > >> Cc: Peter Maydell <peter.maydell@linaro.org> > > >> Cc: Michael S. Tsirkin <mst@redhat.com> > > >> Cc: Daniel P. Berrangé <berrange@redhat.com> > > >> Cc: Gerd Hoffmann <kraxel@redhat.com> > > >> Cc: Ard Biesheuvel <ardb@kernel.org> > > >> Cc: linux-efi@vger.kernel.org > > >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > > > > > Didn't read the patch yet. > > > Adding Laszlo. > > > > As I said in > > <http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>, > > I don't believe that the setup_data chaining described in > > <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work > > for UEFI guests at all, with QEMU pre-populating the links with GPAs. > > > > However, rather than introducing a new info channel, or reusing an > > existent one (ACPI linker/loader, GUID-ed structure chaining in pflash), > > for the sake of this feature, I suggest simply disabling this feature > > for UEFI guests. setup_data chaining has not been necessary for UEFI > > guests for years (this is the first time I've heard about it in more > > than a decade), and the particular use case (provide guests with good > > random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL. > > > > ... Now, here's my problem: microvm, and Xen. > > > > As far as I can tell, QEMU can determine (it already does determine) > > whether the guest uses UEFI or not, for the "pc" and "q35" machine > > types. But not for microvm or Xen! > > > > Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can > > derive that > > > > pflash_cfi01_get_blk(pcms->flash[0]) > > > > returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this > > is only valid after the inital part of pc_system_firmware_init() has run > > ("Map legacy -drive if=pflash to machine properties"), but that is not a > > problem, given the following call tree: > > I don't beleve that's a valid check anymore since Gerd introduced the > ability to load UEFI via -bios, for UEFI builds without persistent > variables. ( a8152c4e4613c70c2f0573a82babbc8acc00cf90 ) > I think there is a fundamental flaw in the QEMU logic where it adds setup_data nodes to the *file* representation of the kernel image. IOW, loading the kernei image at address x, creating setup_data nodes in memory relative to x and then invoking the kernel image directly (as kexec does as well, AIUI) is perfectly fine. Managing a file system abstraction and a generic interface to load its contents, and using it to load the kernel image anywhere in memory is also fine, and OVMF -kernel relies on this. It is the combination of the two that explodes, unsurprisingly. Making inferences about which kind of firmware is invoking the file loading interface, and gating this behavior accordingly just papers over the problem. Jason and I have been discussing this over IRC, and there are essentially 3 places we could address this: 1) in the Linux/x86 EFI stub, which has a 'pure EFI' entrypoint and a 'EFI handover protocol' entrypoint, and we could simply wipe the setup_data pointer in the former; 2) in OVMF's kernel loader, where we could 'fix up' the setup_data field 3) in QEMU, which [as I laid out above] does something it shouldn't be doing. I strongly object to 2), as it would involve teaching OVMF's 'pure EFI' boot path about the intricacies of struct boot_params etc, which defeats the purpose of having a generic EFI boot path (Note that much of my work on the Linux/EFI subsystem over the past years has been focused on getting rid of all the arch-specific hacks and layering violations and making as much of the EFI code in Linux arch-agnostic and adhere to EFI principles and APIs) Given that neither SETUP_DTB nor SETUP_RNG_SEED are particularly relevant for pure EFI boot, clearing setup_data in the pure EFI entrypoint in the EFI stub is not unreasonable, but not very future proof, given that it limits how we might use setup_data for other purposes in the future (although one might argue we could just drop code again from the stub when new uses of setup_data are introduced into the kernel) However, our conclusion was that it is really QEMU that is doing something strange here, so IMHO, fixing QEMU is really the most suitable approach.
On 08/04/22 12:26, Ard Biesheuvel wrote: > On Thu, 4 Aug 2022 at 11:25, Daniel P. Berrangé <berrange@redhat.com> wrote: >> >> On Thu, Aug 04, 2022 at 10:58:36AM +0200, Laszlo Ersek wrote: >>> On 08/04/22 09:03, Michael S. Tsirkin wrote: >>>> On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote: >>>>> The boot parameter header refers to setup_data at an absolute address, >>>>> and each setup_data refers to the next setup_data at an absolute address >>>>> too. Currently QEMU simply puts the setup_datas right after the kernel >>>>> image, and since the kernel_image is loaded at prot_addr -- a fixed >>>>> address knowable to QEMU apriori -- the setup_data absolute address >>>>> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`. >>>>> >>>>> This mostly works fine, so long as the kernel image really is loaded at >>>>> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and >>>>> generally EFI doesn't give a good way of predicting where it's going to >>>>> load the kernel. So when it loads it at some address != prot_addr, the >>>>> absolute addresses in setup_data now point somewhere bogus, causing >>>>> crashes when EFI stub tries to follow the next link. >>>>> >>>>> Fix this by placing setup_data at some fixed place in memory, relative >>>>> to real_addr, not as part of the kernel image, and then pointing the >>>>> setup_data absolute address to that fixed place in memory. This way, >>>>> even if OVMF or other chains relocate the kernel image, the boot >>>>> parameter still points to the correct absolute address. >>>>> >>>>> Fixes: 3cbeb52467 ("hw/i386: add device tree support") >>>>> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> >>>>> Cc: Paolo Bonzini <pbonzini@redhat.com> >>>>> Cc: Richard Henderson <richard.henderson@linaro.org> >>>>> Cc: Peter Maydell <peter.maydell@linaro.org> >>>>> Cc: Michael S. Tsirkin <mst@redhat.com> >>>>> Cc: Daniel P. Berrangé <berrange@redhat.com> >>>>> Cc: Gerd Hoffmann <kraxel@redhat.com> >>>>> Cc: Ard Biesheuvel <ardb@kernel.org> >>>>> Cc: linux-efi@vger.kernel.org >>>>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> >>>> >>>> Didn't read the patch yet. >>>> Adding Laszlo. >>> >>> As I said in >>> <http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>, >>> I don't believe that the setup_data chaining described in >>> <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work >>> for UEFI guests at all, with QEMU pre-populating the links with GPAs. >>> >>> However, rather than introducing a new info channel, or reusing an >>> existent one (ACPI linker/loader, GUID-ed structure chaining in pflash), >>> for the sake of this feature, I suggest simply disabling this feature >>> for UEFI guests. setup_data chaining has not been necessary for UEFI >>> guests for years (this is the first time I've heard about it in more >>> than a decade), and the particular use case (provide guests with good >>> random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL. >>> >>> ... Now, here's my problem: microvm, and Xen. >>> >>> As far as I can tell, QEMU can determine (it already does determine) >>> whether the guest uses UEFI or not, for the "pc" and "q35" machine >>> types. But not for microvm or Xen! >>> >>> Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can >>> derive that >>> >>> pflash_cfi01_get_blk(pcms->flash[0]) >>> >>> returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this >>> is only valid after the inital part of pc_system_firmware_init() has run >>> ("Map legacy -drive if=pflash to machine properties"), but that is not a >>> problem, given the following call tree: >> >> I don't beleve that's a valid check anymore since Gerd introduced the >> ability to load UEFI via -bios, for UEFI builds without persistent >> variables. ( a8152c4e4613c70c2f0573a82babbc8acc00cf90 ) >> > > I think there is a fundamental flaw in the QEMU logic where it adds > setup_data nodes to the *file* representation of the kernel image. > > IOW, loading the kernei image at address x, creating setup_data nodes > in memory relative to x and then invoking the kernel image directly > (as kexec does as well, AIUI) is perfectly fine. > > Managing a file system abstraction and a generic interface to load its > contents, and using it to load the kernel image anywhere in memory is > also fine, and OVMF -kernel relies on this. > > It is the combination of the two that explodes, unsurprisingly. Making > inferences about which kind of firmware is invoking the file loading > interface, and gating this behavior accordingly just papers over the > problem. > > Jason and I have been discussing this over IRC, and there are > essentially 3 places we could address this: > 1) in the Linux/x86 EFI stub, which has a 'pure EFI' entrypoint and a > 'EFI handover protocol' entrypoint, and we could simply wipe the > setup_data pointer in the former; > 2) in OVMF's kernel loader, where we could 'fix up' the setup_data field > 3) in QEMU, which [as I laid out above] does something it shouldn't be doing. > > I strongly object to 2), as it would involve teaching OVMF's 'pure > EFI' boot path about the intricacies of struct boot_params etc, which > defeats the purpose of having a generic EFI boot path (Note that much > of my work on the Linux/EFI subsystem over the past years has been > focused on getting rid of all the arch-specific hacks and layering > violations and making as much of the EFI code in Linux arch-agnostic > and adhere to EFI principles and APIs) > > Given that neither SETUP_DTB nor SETUP_RNG_SEED are particularly > relevant for pure EFI boot, clearing setup_data in the pure EFI > entrypoint in the EFI stub is not unreasonable, but not very future > proof, given that it limits how we might use setup_data for other > purposes in the future (although one might argue we could just drop > code again from the stub when new uses of setup_data are introduced > into the kernel) > > However, our conclusion was that it is really QEMU that is doing > something strange here, so IMHO, fixing QEMU is really the most > suitable approach. > I'm not sure if this can be solved arch-independently and firmware-independently. ACPI and SMBIOS look "independent" of arch and firmware if you squint, but that's only because the specs describe the entry points for different firmwares differently, thereby hiding the problem from the rest of the world. Also because the various virtual firmwares receive the ACPI and SMBIOS contents from the VMM(s) differently. However, ACPI and SMBIOS are probably too late for the kernel to consume the random seed. A different example for information that's needed really early is the serial port's location (see "earlyprintk" e.g.). That gets passed on the command line (and the usual values strongly differ between arm64 and x864_64). But, I guess passing the random seed on the kernel cmdline is considered insecure. None of the existing info passing methods seem early enough, generic enough, and secure enough (at the same time)...
Hi Daniel, On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote: > Yep, and ultimately the inability to distinguish UEFI vs other firmware > is arguably correct by design, as the QEMU <-> firmware interface is > supposed to be arbitrarily pluggable for any firmware implementation > not limited to merely UEFI + seabios. Indeed, I agree with this. > > > For now I suggest either reverting the original patch, or at least not > > enabling the knob by default for any machine types. In particular, when > > using MicroVM, the user must leave the knob disabled when direct booting > > a kernel on OVMF, and the user may or may not enable the knob when > > direct booting a kernel on SeaBIOS. > > Having it opt-in via a knob would defeat Jason's goal of having the seed > available automatically. Yes, adding a knob is absolutely out of the question. It also doesn't actually solve the problem: this triggers when QEMU passes a DTB too. It's not just for the new RNG seed thing. This bug isn't new. Jason
Hi Laszlo, On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote: > None of the existing info passing methods seem early enough, generic > enough, and secure enough (at the same time)... Can you look at the v2 patch? It seems to work on every configuration I throw at it. Keep in mind that setup_data is only used very, very early. I can think of a few other places to put it too, looking at the x86 memory map, that will survive long enough. I think this might actually be a straightforwardly solvable problem if you think about it more basically. Jason
On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote: > Hi Daniel, > > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote: > > Yep, and ultimately the inability to distinguish UEFI vs other firmware > > is arguably correct by design, as the QEMU <-> firmware interface is > > supposed to be arbitrarily pluggable for any firmware implementation > > not limited to merely UEFI + seabios. > > Indeed, I agree with this. > > > > > > For now I suggest either reverting the original patch, or at least not > > > enabling the knob by default for any machine types. In particular, when > > > using MicroVM, the user must leave the knob disabled when direct booting > > > a kernel on OVMF, and the user may or may not enable the knob when > > > direct booting a kernel on SeaBIOS. > > > > Having it opt-in via a knob would defeat Jason's goal of having the seed > > available automatically. > > Yes, adding a knob is absolutely out of the question. > > It also doesn't actually solve the problem: this triggers when QEMU > passes a DTB too. It's not just for the new RNG seed thing. This bug > isn't new. In the other thread I also mentioned that this RNG Seed addition has caused a bug with AMD SEV too, making boot measurement attestation fail because the kernel blob passed to the firmware no longer matches what the tenant expects, due to the injected seed. With regards, Daniel
On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote: > > Hi Daniel, > > > > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote: > > > Yep, and ultimately the inability to distinguish UEFI vs other firmware > > > is arguably correct by design, as the QEMU <-> firmware interface is > > > supposed to be arbitrarily pluggable for any firmware implementation > > > not limited to merely UEFI + seabios. > > > > Indeed, I agree with this. > > > > > > > > > For now I suggest either reverting the original patch, or at least not > > > > enabling the knob by default for any machine types. In particular, when > > > > using MicroVM, the user must leave the knob disabled when direct booting > > > > a kernel on OVMF, and the user may or may not enable the knob when > > > > direct booting a kernel on SeaBIOS. > > > > > > Having it opt-in via a knob would defeat Jason's goal of having the seed > > > available automatically. > > > > Yes, adding a knob is absolutely out of the question. > > > > It also doesn't actually solve the problem: this triggers when QEMU > > passes a DTB too. It's not just for the new RNG seed thing. This bug > > isn't new. > > In the other thread I also mentioned that this RNG Seed addition has > caused a bug with AMD SEV too, making boot measurement attestation > fail because the kernel blob passed to the firmware no longer matches > what the tenant expects, due to the injected seed. > I was actually expecting this to be an issue in the signing/attestation department as well, and you just confirmed my suspicion. But does this mean that populating the setup_data pointer is out of the question altogether? Or only that putting the setup_data linked list nodes inside the image is a problem?
Hi Ard, On Thu, Aug 4, 2022 at 2:16 PM Ard Biesheuvel <ardb@kernel.org> wrote: > > On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote: > > > Hi Daniel, > > > > > > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote: > > > > Yep, and ultimately the inability to distinguish UEFI vs other firmware > > > > is arguably correct by design, as the QEMU <-> firmware interface is > > > > supposed to be arbitrarily pluggable for any firmware implementation > > > > not limited to merely UEFI + seabios. > > > > > > Indeed, I agree with this. > > > > > > > > > > > > For now I suggest either reverting the original patch, or at least not > > > > > enabling the knob by default for any machine types. In particular, when > > > > > using MicroVM, the user must leave the knob disabled when direct booting > > > > > a kernel on OVMF, and the user may or may not enable the knob when > > > > > direct booting a kernel on SeaBIOS. > > > > > > > > Having it opt-in via a knob would defeat Jason's goal of having the seed > > > > available automatically. > > > > > > Yes, adding a knob is absolutely out of the question. > > > > > > It also doesn't actually solve the problem: this triggers when QEMU > > > passes a DTB too. It's not just for the new RNG seed thing. This bug > > > isn't new. > > > > In the other thread I also mentioned that this RNG Seed addition has > > caused a bug with AMD SEV too, making boot measurement attestation > > fail because the kernel blob passed to the firmware no longer matches > > what the tenant expects, due to the injected seed. > > > > I was actually expecting this to be an issue in the > signing/attestation department as well, and you just confirmed my > suspicion. > > But does this mean that populating the setup_data pointer is out of > the question altogether? Or only that putting the setup_data linked > list nodes inside the image is a problem? If you look at the v2 patch, populating boot_param->setup_data winds up being a fixed value. So even if that part was a problem (though I don't think it is), it won't be with the v2 patch, since it's always the same. Jason
On Thu, Aug 4, 2022 at 2:17 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Hi Ard, > > On Thu, Aug 4, 2022 at 2:16 PM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote: > > > > Hi Daniel, > > > > > > > > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote: > > > > > Yep, and ultimately the inability to distinguish UEFI vs other firmware > > > > > is arguably correct by design, as the QEMU <-> firmware interface is > > > > > supposed to be arbitrarily pluggable for any firmware implementation > > > > > not limited to merely UEFI + seabios. > > > > > > > > Indeed, I agree with this. > > > > > > > > > > > > > > > For now I suggest either reverting the original patch, or at least not > > > > > > enabling the knob by default for any machine types. In particular, when > > > > > > using MicroVM, the user must leave the knob disabled when direct booting > > > > > > a kernel on OVMF, and the user may or may not enable the knob when > > > > > > direct booting a kernel on SeaBIOS. > > > > > > > > > > Having it opt-in via a knob would defeat Jason's goal of having the seed > > > > > available automatically. > > > > > > > > Yes, adding a knob is absolutely out of the question. > > > > > > > > It also doesn't actually solve the problem: this triggers when QEMU > > > > passes a DTB too. It's not just for the new RNG seed thing. This bug > > > > isn't new. > > > > > > In the other thread I also mentioned that this RNG Seed addition has > > > caused a bug with AMD SEV too, making boot measurement attestation > > > fail because the kernel blob passed to the firmware no longer matches > > > what the tenant expects, due to the injected seed. > > > > > > > I was actually expecting this to be an issue in the > > signing/attestation department as well, and you just confirmed my > > suspicion. > > > > But does this mean that populating the setup_data pointer is out of > > the question altogether? Or only that putting the setup_data linked > > list nodes inside the image is a problem? > > If you look at the v2 patch, populating boot_param->setup_data winds > up being a fixed value. So even if that part was a problem (though I > don't think it is), it won't be with the v2 patch, since it's always > the same. Actually, `setup` isn't even modified if SEV is being used anyway. So really, the approach of this v2 -- of not modifying the kernel image -- should fix that issue, no matter what. Jason
On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > Hi Laszlo, > > On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote: > > None of the existing info passing methods seem early enough, generic > > enough, and secure enough (at the same time)... > > Can you look at the v2 patch? It seems to work on every configuration I > throw at it. Keep in mind that setup_data is only used very, very early. > I can think of a few other places to put it too, looking at the x86 > memory map, that will survive long enough. > > I think this might actually be a straightforwardly solvable problem if > you think about it more basically. And just to put things in perspective here... We only need like 48 bytes or something at some easy fixed address. That's not much. That's *got* to be a fairly tractable problem. If v2 has issues, I can't see why there wouldn't be a different easy place to put a meger 48 bytes of stuff that then is allowed to be wiped out after early boot. Jason
On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote: > The boot parameter header refers to setup_data at an absolute address, > and each setup_data refers to the next setup_data at an absolute address > too. Currently QEMU simply puts the setup_datas right after the kernel > image, and since the kernel_image is loaded at prot_addr -- a fixed > address knowable to QEMU apriori -- the setup_data absolute address > winds up being just `prot_addr + a_fixed_offset_into_kernel_image`. > > This mostly works fine, so long as the kernel image really is loaded at > prot_addr. However, OVMF doesn't load the kernel at prot_addr, and > generally EFI doesn't give a good way of predicting where it's going to > load the kernel. So when it loads it at some address != prot_addr, the > absolute addresses in setup_data now point somewhere bogus, causing > crashes when EFI stub tries to follow the next link. > > Fix this by placing setup_data at some fixed place in memory, relative > to real_addr, not as part of the kernel image, and then pointing the > setup_data absolute address to that fixed place in memory. This way, > even if OVMF or other chains relocate the kernel image, the boot > parameter still points to the correct absolute address. > > Fixes: 3cbeb52467 ("hw/i386: add device tree support") > Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: Richard Henderson <richard.henderson@linaro.org> > Cc: Peter Maydell <peter.maydell@linaro.org> > Cc: Michael S. Tsirkin <mst@redhat.com> > Cc: Daniel P. Berrangé <berrange@redhat.com> > Cc: Gerd Hoffmann <kraxel@redhat.com> > Cc: Ard Biesheuvel <ardb@kernel.org> > Cc: linux-efi@vger.kernel.org > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > --- > hw/i386/x86.c | 38 ++++++++++++++++++++------------------ > 1 file changed, 20 insertions(+), 18 deletions(-) > > diff --git a/hw/i386/x86.c b/hw/i386/x86.c > index 050eedc0c8..8b853abf38 100644 > --- a/hw/i386/x86.c > +++ b/hw/i386/x86.c > if (!legacy_no_rng_seed) { > - setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16); > - kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH; > - kernel = g_realloc(kernel, kernel_size); > - setup_data = (struct setup_data *)(kernel + setup_data_offset); > + setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH; > + setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len); > + setup_data = (struct setup_data *)(setup_datas + setup_data_total_len); > setup_data->next = cpu_to_le64(first_setup_data); > - first_setup_data = prot_addr + setup_data_offset; > + first_setup_data = setup_data_base + setup_data_total_len; > + setup_data_total_len += setup_data_item_len; > setup_data->type = cpu_to_le32(SETUP_RNG_SEED); > setup_data->len = cpu_to_le32(RNG_SEED_LENGTH); > qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH); > } > > - /* Offset 0x250 is a pointer to the first setup_data link. */ > - stq_p(header + 0x250, first_setup_data); > + if (first_setup_data) { > + /* Offset 0x250 is a pointer to the first setup_data link. */ > + stq_p(header + 0x250, first_setup_data); > + rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len, > + setup_data_base, NULL, NULL, NULL, NULL, false); > + } The boot measurements with AMD SEV now succeed, but I'm a little worried about the implications of adding this ROM, when a few lines later here we're discarding the 'header' changes for AMD SEV. Is this still going to operate correctly in the guest OS if we've discarded the header changes below ? > /* > * If we're starting an encrypted VM, it will be OVMF based, which uses the > * efi stub for booting and doesn't require any values to be placed in the > * kernel header. We therefore don't update the header so the hash of the > * kernel on the other side of the fw_cfg interface matches the hash of the > * file the user passed in. > */ > if (!sev_enabled()) { > memcpy(setup, header, MIN(sizeof(header), setup_size)); > } > > fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr); > fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size); > fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size); > sev_load_ctx.kernel_data = (char *)kernel; > -- > 2.35.1 > > With regards, Daniel
Hi Daniel, On Thu, Aug 4, 2022 at 2:54 PM Daniel P. Berrangé <berrange@redhat.com> wrote: > > On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote: > > The boot parameter header refers to setup_data at an absolute address, > > and each setup_data refers to the next setup_data at an absolute address > > too. Currently QEMU simply puts the setup_datas right after the kernel > > image, and since the kernel_image is loaded at prot_addr -- a fixed > > address knowable to QEMU apriori -- the setup_data absolute address > > winds up being just `prot_addr + a_fixed_offset_into_kernel_image`. > > > > This mostly works fine, so long as the kernel image really is loaded at > > prot_addr. However, OVMF doesn't load the kernel at prot_addr, and > > generally EFI doesn't give a good way of predicting where it's going to > > load the kernel. So when it loads it at some address != prot_addr, the > > absolute addresses in setup_data now point somewhere bogus, causing > > crashes when EFI stub tries to follow the next link. > > > > Fix this by placing setup_data at some fixed place in memory, relative > > to real_addr, not as part of the kernel image, and then pointing the > > setup_data absolute address to that fixed place in memory. This way, > > even if OVMF or other chains relocate the kernel image, the boot > > parameter still points to the correct absolute address. > > > > Fixes: 3cbeb52467 ("hw/i386: add device tree support") > > Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> > > Cc: Paolo Bonzini <pbonzini@redhat.com> > > Cc: Richard Henderson <richard.henderson@linaro.org> > > Cc: Peter Maydell <peter.maydell@linaro.org> > > Cc: Michael S. Tsirkin <mst@redhat.com> > > Cc: Daniel P. Berrangé <berrange@redhat.com> > > Cc: Gerd Hoffmann <kraxel@redhat.com> > > Cc: Ard Biesheuvel <ardb@kernel.org> > > Cc: linux-efi@vger.kernel.org > > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > --- > > hw/i386/x86.c | 38 ++++++++++++++++++++------------------ > > 1 file changed, 20 insertions(+), 18 deletions(-) > > > > diff --git a/hw/i386/x86.c b/hw/i386/x86.c > > index 050eedc0c8..8b853abf38 100644 > > --- a/hw/i386/x86.c > > +++ b/hw/i386/x86.c > > > > if (!legacy_no_rng_seed) { > > - setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16); > > - kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH; > > - kernel = g_realloc(kernel, kernel_size); > > - setup_data = (struct setup_data *)(kernel + setup_data_offset); > > + setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH; > > + setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len); > > + setup_data = (struct setup_data *)(setup_datas + setup_data_total_len); > > setup_data->next = cpu_to_le64(first_setup_data); > > - first_setup_data = prot_addr + setup_data_offset; > > + first_setup_data = setup_data_base + setup_data_total_len; > > + setup_data_total_len += setup_data_item_len; > > setup_data->type = cpu_to_le32(SETUP_RNG_SEED); > > setup_data->len = cpu_to_le32(RNG_SEED_LENGTH); > > qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH); > > } > > > > - /* Offset 0x250 is a pointer to the first setup_data link. */ > > - stq_p(header + 0x250, first_setup_data); > > + if (first_setup_data) { > > + /* Offset 0x250 is a pointer to the first setup_data link. */ > > + stq_p(header + 0x250, first_setup_data); > > + rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len, > > + setup_data_base, NULL, NULL, NULL, NULL, false); > > + } > > The boot measurements with AMD SEV now succeed, but I'm a little > worried about the implications of adding this ROM, when a few lines > later here we're discarding the 'header' changes for AMD SEV. Is > this still going to operate correctly in the guest OS if we've > discarded the header changes below ? I'll add a !sev_enabled() condition to that block too, so it also skips adding the ROM, for v3. Jason
On 08/04/22 14:47, Jason A. Donenfeld wrote: > On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: >> >> Hi Laszlo, >> >> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote: >>> None of the existing info passing methods seem early enough, generic >>> enough, and secure enough (at the same time)... >> >> Can you look at the v2 patch? It seems to work on every configuration I >> throw at it. Keep in mind that setup_data is only used very, very early. >> I can think of a few other places to put it too, looking at the x86 >> memory map, that will survive long enough. >> >> I think this might actually be a straightforwardly solvable problem if >> you think about it more basically. > > And just to put things in perspective here... We only need like 48 > bytes or something at some easy fixed address. That's not much. That's > *got* to be a fairly tractable problem. If v2 has issues, I can't see > why there wouldn't be a different easy place to put a meger 48 bytes > of stuff that then is allowed to be wiped out after early boot. I've looked at v2. It still relies on passing information from QEMU to the guest kernel through guest RAM such that the whole firmware execution takes place in-between, without the firmware knowing anything about that particular area -- effectively treating it as free system RAM. Such exceptions are time bombs. We *have* used hard-coded addresses, sometimes they are unavoidable, but then they are open-coded in both QEMU and the firmware, and some early part of the firmware takes care to either move the data to a "safe" place, or to cover it in-place with a kind of reservation that prevents other parts of the firmware from trampling over it. I've debugged mistakes (memory corruption) when such reservation was forgotten; it's not fun. In short, I have nothing against the QEMU patch, but then the current OvmfPkg maintainers should accept a patch for the firmware too, for protecting the area from later firmware components, as early as possible. Laszlo
Hi Laszlo, On Thu, Aug 4, 2022 at 3:17 PM Laszlo Ersek <lersek@redhat.com> wrote: > > On 08/04/22 14:47, Jason A. Donenfeld wrote: > > On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: > >> > >> Hi Laszlo, > >> > >> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote: > >>> None of the existing info passing methods seem early enough, generic > >>> enough, and secure enough (at the same time)... > >> > >> Can you look at the v2 patch? It seems to work on every configuration I > >> throw at it. Keep in mind that setup_data is only used very, very early. > >> I can think of a few other places to put it too, looking at the x86 > >> memory map, that will survive long enough. > >> > >> I think this might actually be a straightforwardly solvable problem if > >> you think about it more basically. > > > > And just to put things in perspective here... We only need like 48 > > bytes or something at some easy fixed address. That's not much. That's > > *got* to be a fairly tractable problem. If v2 has issues, I can't see > > why there wouldn't be a different easy place to put a meger 48 bytes > > of stuff that then is allowed to be wiped out after early boot. > > I've looked at v2. It still relies on passing information from QEMU to > the guest kernel through guest RAM such that the whole firmware > execution takes place in-between, without the firmware knowing anything > about that particular area -- effectively treating it as free system > RAM. Such exceptions are time bombs. > > We *have* used hard-coded addresses, sometimes they are unavoidable, but > then they are open-coded in both QEMU and the firmware, and some early > part of the firmware takes care to either move the data to a "safe" > place, or to cover it in-place with a kind of reservation that prevents > other parts of the firmware from trampling over it. I've debugged > mistakes (memory corruption) when such reservation was forgotten; it's > not fun. > > In short, I have nothing against the QEMU patch, but then the current > OvmfPkg maintainers should accept a patch for the firmware too, for > protecting the area from later firmware components, as early as possible. What you say mostly makes sense. Though, I was wondering if there's some unused space (maybe a historical low address) that nothing touches because it's always been considered reserved. Some kind of ancient video data region, for example. We only need 48 bytes after all... My hope was that somebody with a lot of deep x86 knowledge would be able to smell their way to something that's always good like that. (Alternatively, there's my real_addr thing from v2.) Jason
On 08/04/22 14:16, Ard Biesheuvel wrote: > On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote: >> >> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote: >>> Hi Daniel, >>> >>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote: >>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware >>>> is arguably correct by design, as the QEMU <-> firmware interface is >>>> supposed to be arbitrarily pluggable for any firmware implementation >>>> not limited to merely UEFI + seabios. >>> >>> Indeed, I agree with this. >>> >>>> >>>>> For now I suggest either reverting the original patch, or at least not >>>>> enabling the knob by default for any machine types. In particular, when >>>>> using MicroVM, the user must leave the knob disabled when direct booting >>>>> a kernel on OVMF, and the user may or may not enable the knob when >>>>> direct booting a kernel on SeaBIOS. >>>> >>>> Having it opt-in via a knob would defeat Jason's goal of having the seed >>>> available automatically. >>> >>> Yes, adding a knob is absolutely out of the question. >>> >>> It also doesn't actually solve the problem: this triggers when QEMU >>> passes a DTB too. It's not just for the new RNG seed thing. This bug >>> isn't new. >> >> In the other thread I also mentioned that this RNG Seed addition has >> caused a bug with AMD SEV too, making boot measurement attestation >> fail because the kernel blob passed to the firmware no longer matches >> what the tenant expects, due to the injected seed. >> > > I was actually expecting this to be an issue in the > signing/attestation department as well, and you just confirmed my > suspicion. > > But does this mean that populating the setup_data pointer is out of > the question altogether? Or only that putting the setup_data linked > list nodes inside the image is a problem? QEMU already has to inject a whole bunch of stuff into confidential computing guests. The way it's done (IIRC) is that the non-compressed, trailing part of pflash (basically where the reset vector code lives too) is populated at OVMF build time with a chain of GUID-ed structures, and fields of those structures are filled in (at OVMF build time) from various fixed PCDs. The fixed PCDs in turn are populated from the FD files, using various MEMFD regions. When QEMU launches the guest, it can parse the GPAs from the on-disk pflash image (traversing the list of GUID-ed structs), at which addresses the guest firmware will then expect the various crypto artifacts to be injected. The point is that "who's in control" is reversed. The firmware exposes (at build time) at what GPAs it can accept data injections, and QEMU follows that. Of course the firmware ensures that nothing else in the firmware will try to own those GPAs. The only thing that needed to be hard-coded when this feature was introduced was the "entry point", that is, the flash offset at which QEMU starts the GUID-ed structure traversal. AMD and IBM developers can likely much better describe this mechanism, as I've not dealt with it in over a year. The QEMU side code is in "hw/i386/pc_sysfw_ovmf.c". We can make setup_data chaining work with OVMF, but the whole chain should be located in a GPA range that OVMF dictates.
On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote: > > On 08/04/22 14:16, Ard Biesheuvel wrote: > > On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote: > >> > >> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote: > >>> Hi Daniel, > >>> > >>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote: > >>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware > >>>> is arguably correct by design, as the QEMU <-> firmware interface is > >>>> supposed to be arbitrarily pluggable for any firmware implementation > >>>> not limited to merely UEFI + seabios. > >>> > >>> Indeed, I agree with this. > >>> > >>>> > >>>>> For now I suggest either reverting the original patch, or at least not > >>>>> enabling the knob by default for any machine types. In particular, when > >>>>> using MicroVM, the user must leave the knob disabled when direct booting > >>>>> a kernel on OVMF, and the user may or may not enable the knob when > >>>>> direct booting a kernel on SeaBIOS. > >>>> > >>>> Having it opt-in via a knob would defeat Jason's goal of having the seed > >>>> available automatically. > >>> > >>> Yes, adding a knob is absolutely out of the question. > >>> > >>> It also doesn't actually solve the problem: this triggers when QEMU > >>> passes a DTB too. It's not just for the new RNG seed thing. This bug > >>> isn't new. > >> > >> In the other thread I also mentioned that this RNG Seed addition has > >> caused a bug with AMD SEV too, making boot measurement attestation > >> fail because the kernel blob passed to the firmware no longer matches > >> what the tenant expects, due to the injected seed. > >> > > > > I was actually expecting this to be an issue in the > > signing/attestation department as well, and you just confirmed my > > suspicion. > > > > But does this mean that populating the setup_data pointer is out of > > the question altogether? Or only that putting the setup_data linked > > list nodes inside the image is a problem? > > QEMU already has to inject a whole bunch of stuff into confidential > computing guests. The way it's done (IIRC) is that the non-compressed, > trailing part of pflash (basically where the reset vector code lives > too) is populated at OVMF build time with a chain of GUID-ed structures, > and fields of those structures are filled in (at OVMF build time) from > various fixed PCDs. The fixed PCDs in turn are populated from the FD > files, using various MEMFD regions. When QEMU launches the guest, it can > parse the GPAs from the on-disk pflash image (traversing the list of > GUID-ed structs), at which addresses the guest firmware will then expect > the various crypto artifacts to be injected. > > The point is that "who's in control" is reversed. The firmware exposes > (at build time) at what GPAs it can accept data injections, and QEMU > follows that. Of course the firmware ensures that nothing else in the > firmware will try to own those GPAs. > > The only thing that needed to be hard-coded when this feature was > introduced was the "entry point", that is, the flash offset at which > QEMU starts the GUID-ed structure traversal. > > AMD and IBM developers can likely much better describe this mechanism, > as I've not dealt with it in over a year. The QEMU side code is in > "hw/i386/pc_sysfw_ovmf.c". > > We can make setup_data chaining work with OVMF, but the whole chain > should be located in a GPA range that OVMF dictates. It sounds like what you describe is pretty OVMF-specific though, right? Do we want to tie things together so tightly like that? Given we only need 48 bytes or so, isn't there a more subtle place we could just throw this in ram that doesn't need such complex coordination? Jason
On 08/04/22 15:16, Laszlo Ersek wrote: > On 08/04/22 14:47, Jason A. Donenfeld wrote: >> On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote: >>> >>> Hi Laszlo, >>> >>> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote: >>>> None of the existing info passing methods seem early enough, generic >>>> enough, and secure enough (at the same time)... >>> >>> Can you look at the v2 patch? It seems to work on every configuration I >>> throw at it. Keep in mind that setup_data is only used very, very early. >>> I can think of a few other places to put it too, looking at the x86 >>> memory map, that will survive long enough. >>> >>> I think this might actually be a straightforwardly solvable problem if >>> you think about it more basically. >> >> And just to put things in perspective here... We only need like 48 >> bytes or something at some easy fixed address. That's not much. That's >> *got* to be a fairly tractable problem. If v2 has issues, I can't see >> why there wouldn't be a different easy place to put a meger 48 bytes >> of stuff that then is allowed to be wiped out after early boot. > > I've looked at v2. It still relies on passing information from QEMU to > the guest kernel through guest RAM such that the whole firmware > execution takes place in-between, without the firmware knowing anything > about that particular area -- effectively treating it as free system > RAM. Such exceptions are time bombs. > > We *have* used hard-coded addresses, sometimes they are unavoidable, but > then they are open-coded in both QEMU and the firmware, and some early > part of the firmware takes care to either move the data to a "safe" > place, or to cover it in-place with a kind of reservation that prevents > other parts of the firmware from trampling over it. I've debugged > mistakes (memory corruption) when such reservation was forgotten; it's > not fun. Reference: https://github.com/tianocore/edk2/commit/ad43bc6b2e > > In short, I have nothing against the QEMU patch, but then the current > OvmfPkg maintainers should accept a patch for the firmware too, for > protecting the area from later firmware components, as early as possible. > > Laszlo >
On 08/04/22 15:28, Jason A. Donenfeld wrote: > On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote: >> >> On 08/04/22 14:16, Ard Biesheuvel wrote: >>> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote: >>>> >>>> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote: >>>>> Hi Daniel, >>>>> >>>>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote: >>>>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware >>>>>> is arguably correct by design, as the QEMU <-> firmware interface is >>>>>> supposed to be arbitrarily pluggable for any firmware implementation >>>>>> not limited to merely UEFI + seabios. >>>>> >>>>> Indeed, I agree with this. >>>>> >>>>>> >>>>>>> For now I suggest either reverting the original patch, or at least not >>>>>>> enabling the knob by default for any machine types. In particular, when >>>>>>> using MicroVM, the user must leave the knob disabled when direct booting >>>>>>> a kernel on OVMF, and the user may or may not enable the knob when >>>>>>> direct booting a kernel on SeaBIOS. >>>>>> >>>>>> Having it opt-in via a knob would defeat Jason's goal of having the seed >>>>>> available automatically. >>>>> >>>>> Yes, adding a knob is absolutely out of the question. >>>>> >>>>> It also doesn't actually solve the problem: this triggers when QEMU >>>>> passes a DTB too. It's not just for the new RNG seed thing. This bug >>>>> isn't new. >>>> >>>> In the other thread I also mentioned that this RNG Seed addition has >>>> caused a bug with AMD SEV too, making boot measurement attestation >>>> fail because the kernel blob passed to the firmware no longer matches >>>> what the tenant expects, due to the injected seed. >>>> >>> >>> I was actually expecting this to be an issue in the >>> signing/attestation department as well, and you just confirmed my >>> suspicion. >>> >>> But does this mean that populating the setup_data pointer is out of >>> the question altogether? Or only that putting the setup_data linked >>> list nodes inside the image is a problem? >> >> QEMU already has to inject a whole bunch of stuff into confidential >> computing guests. The way it's done (IIRC) is that the non-compressed, >> trailing part of pflash (basically where the reset vector code lives >> too) is populated at OVMF build time with a chain of GUID-ed structures, >> and fields of those structures are filled in (at OVMF build time) from >> various fixed PCDs. The fixed PCDs in turn are populated from the FD >> files, using various MEMFD regions. When QEMU launches the guest, it can >> parse the GPAs from the on-disk pflash image (traversing the list of >> GUID-ed structs), at which addresses the guest firmware will then expect >> the various crypto artifacts to be injected. >> >> The point is that "who's in control" is reversed. The firmware exposes >> (at build time) at what GPAs it can accept data injections, and QEMU >> follows that. Of course the firmware ensures that nothing else in the >> firmware will try to own those GPAs. >> >> The only thing that needed to be hard-coded when this feature was >> introduced was the "entry point", that is, the flash offset at which >> QEMU starts the GUID-ed structure traversal. >> >> AMD and IBM developers can likely much better describe this mechanism, >> as I've not dealt with it in over a year. The QEMU side code is in >> "hw/i386/pc_sysfw_ovmf.c". >> >> We can make setup_data chaining work with OVMF, but the whole chain >> should be located in a GPA range that OVMF dictates. > > It sounds like what you describe is pretty OVMF-specific though, > right? Yes, completely OVMF specific. > Do we want to tie things together so tightly like that? It depends on what the goal is: - do we want setup_data chaining work generally? - or do we want only the random seed injection to stop crashing OVMF guests? In the latter case, we only need to teach QEMU to reliably recognize OVMF from the on-disk firmware binary (regardless of pflash use), and then not expose any setup_data in guest RAM. The UEFI guest kernel will use EFI_RNG_PROTOCOL (when available, from virtio-rng-pci), and no particular seed otherwise. (This is the "papering over" case, which I don't think is necessarily wrong; only it should be robust. I thought pflash would be usable for that; turns out it isn't. Thus, we could perhaps try identifying a firmware binary as "OVMF" by contents.) In the former case though, I think the "tight coupling" is unavoidable. As long as setup_data is needed by the kernel extremely early, no "firmware hiding" or "firmware independence" is in place yet. > Given we only need 48 bytes or so, isn't there a more subtle place we > could just throw this in ram that doesn't need such complex > coordination? These tricks add up and go wrong after a while. The pedantic reservations in the firmware have proved necessary. IIUC, with v2, the setup_data_base address would (most frequently) be 96 KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does not reserve away the area, UefiCpuPkg or other drivers might allocate an overlapping chunk, even if only temporarily. That might not break the firmware, but it could overwrite the random seed. If the guest kernel somehow consumed that seed (rather than getting one from EFI_RNG_PROTOCOL -- what if no virtio-rng-pci device was configured?), that could potentially destroy the guest's security, without any obvious/immediate signs. Laszlo
On Thu, Aug 04, 2022 at 03:56:54PM +0200, Laszlo Ersek wrote: > On 08/04/22 15:28, Jason A. Donenfeld wrote: > > On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote: > >> > >> On 08/04/22 14:16, Ard Biesheuvel wrote: > >>> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote: > >>>> > >>>> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote: > >>>>> Hi Daniel, > >>>>> > >>>>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote: > >>>>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware > >>>>>> is arguably correct by design, as the QEMU <-> firmware interface is > >>>>>> supposed to be arbitrarily pluggable for any firmware implementation > >>>>>> not limited to merely UEFI + seabios. > >>>>> > >>>>> Indeed, I agree with this. > >>>>> > >>>>>> > >>>>>>> For now I suggest either reverting the original patch, or at least not > >>>>>>> enabling the knob by default for any machine types. In particular, when > >>>>>>> using MicroVM, the user must leave the knob disabled when direct booting > >>>>>>> a kernel on OVMF, and the user may or may not enable the knob when > >>>>>>> direct booting a kernel on SeaBIOS. > >>>>>> > >>>>>> Having it opt-in via a knob would defeat Jason's goal of having the seed > >>>>>> available automatically. > >>>>> > >>>>> Yes, adding a knob is absolutely out of the question. > >>>>> > >>>>> It also doesn't actually solve the problem: this triggers when QEMU > >>>>> passes a DTB too. It's not just for the new RNG seed thing. This bug > >>>>> isn't new. > >>>> > >>>> In the other thread I also mentioned that this RNG Seed addition has > >>>> caused a bug with AMD SEV too, making boot measurement attestation > >>>> fail because the kernel blob passed to the firmware no longer matches > >>>> what the tenant expects, due to the injected seed. > >>>> > >>> > >>> I was actually expecting this to be an issue in the > >>> signing/attestation department as well, and you just confirmed my > >>> suspicion. > >>> > >>> But does this mean that populating the setup_data pointer is out of > >>> the question altogether? Or only that putting the setup_data linked > >>> list nodes inside the image is a problem? > >> > >> QEMU already has to inject a whole bunch of stuff into confidential > >> computing guests. The way it's done (IIRC) is that the non-compressed, > >> trailing part of pflash (basically where the reset vector code lives > >> too) is populated at OVMF build time with a chain of GUID-ed structures, > >> and fields of those structures are filled in (at OVMF build time) from > >> various fixed PCDs. The fixed PCDs in turn are populated from the FD > >> files, using various MEMFD regions. When QEMU launches the guest, it can > >> parse the GPAs from the on-disk pflash image (traversing the list of > >> GUID-ed structs), at which addresses the guest firmware will then expect > >> the various crypto artifacts to be injected. > >> > >> The point is that "who's in control" is reversed. The firmware exposes > >> (at build time) at what GPAs it can accept data injections, and QEMU > >> follows that. Of course the firmware ensures that nothing else in the > >> firmware will try to own those GPAs. > >> > >> The only thing that needed to be hard-coded when this feature was > >> introduced was the "entry point", that is, the flash offset at which > >> QEMU starts the GUID-ed structure traversal. > >> > >> AMD and IBM developers can likely much better describe this mechanism, > >> as I've not dealt with it in over a year. The QEMU side code is in > >> "hw/i386/pc_sysfw_ovmf.c". > >> > >> We can make setup_data chaining work with OVMF, but the whole chain > >> should be located in a GPA range that OVMF dictates. > > > > It sounds like what you describe is pretty OVMF-specific though, > > right? > > Yes, completely OVMF specific. > > > Do we want to tie things together so tightly like that? > > It depends on what the goal is: > > - do we want setup_data chaining work generally? > > - or do we want only the random seed injection to stop crashing OVMF guests? > > In the latter case, we only need to teach QEMU to reliably recognize > OVMF from the on-disk firmware binary (regardless of pflash use), and > then not expose any setup_data in guest RAM. The UEFI guest kernel will > use EFI_RNG_PROTOCOL (when available, from virtio-rng-pci), and no > particular seed otherwise. > > (This is the "papering over" case, which I don't think is necessarily > wrong; only it should be robust. I thought pflash would be usable for > that; turns out it isn't. Thus, we could perhaps try identifying a > firmware binary as "OVMF" by contents.) > > In the former case though, I think the "tight coupling" is unavoidable. > As long as setup_data is needed by the kernel extremely early, no > "firmware hiding" or "firmware independence" is in place yet. > > > Given we only need 48 bytes or so, isn't there a more subtle place we > > could just throw this in ram that doesn't need such complex > > coordination? > > These tricks add up and go wrong after a while. The pedantic > reservations in the firmware have proved necessary. > > IIUC, with v2, the setup_data_base address would (most frequently) be 96 > KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does > not reserve away the area, UefiCpuPkg or other drivers might allocate an > overlapping chunk, even if only temporarily. That might not break the > firmware, but it could overwrite the random seed. If the guest kernel > somehow consumed that seed (rather than getting one from > EFI_RNG_PROTOCOL -- what if no virtio-rng-pci device was configured?), > that could potentially destroy the guest's security, without any > obvious/immediate signs. The guest kernel shouldn't load this random seed even if it is provided by the hypervisor, when running in a confidential virutalization environment like SEV/TDX or equiv for other arches. It can't trust the hypervisor to have actually used random data for populating the seed value. On x86 it would have to rely on the hardware RDRAND/RDSEED for an initial seed. With regards, Daniel
On 08/04/22 15:56, Laszlo Ersek wrote: > On 08/04/22 15:28, Jason A. Donenfeld wrote: >> On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote: >>> >>> On 08/04/22 14:16, Ard Biesheuvel wrote: >>>> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote: >>>>> >>>>> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote: >>>>>> Hi Daniel, >>>>>> >>>>>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote: >>>>>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware >>>>>>> is arguably correct by design, as the QEMU <-> firmware interface is >>>>>>> supposed to be arbitrarily pluggable for any firmware implementation >>>>>>> not limited to merely UEFI + seabios. >>>>>> >>>>>> Indeed, I agree with this. >>>>>> >>>>>>> >>>>>>>> For now I suggest either reverting the original patch, or at least not >>>>>>>> enabling the knob by default for any machine types. In particular, when >>>>>>>> using MicroVM, the user must leave the knob disabled when direct booting >>>>>>>> a kernel on OVMF, and the user may or may not enable the knob when >>>>>>>> direct booting a kernel on SeaBIOS. >>>>>>> >>>>>>> Having it opt-in via a knob would defeat Jason's goal of having the seed >>>>>>> available automatically. >>>>>> >>>>>> Yes, adding a knob is absolutely out of the question. >>>>>> >>>>>> It also doesn't actually solve the problem: this triggers when QEMU >>>>>> passes a DTB too. It's not just for the new RNG seed thing. This bug >>>>>> isn't new. >>>>> >>>>> In the other thread I also mentioned that this RNG Seed addition has >>>>> caused a bug with AMD SEV too, making boot measurement attestation >>>>> fail because the kernel blob passed to the firmware no longer matches >>>>> what the tenant expects, due to the injected seed. >>>>> >>>> >>>> I was actually expecting this to be an issue in the >>>> signing/attestation department as well, and you just confirmed my >>>> suspicion. >>>> >>>> But does this mean that populating the setup_data pointer is out of >>>> the question altogether? Or only that putting the setup_data linked >>>> list nodes inside the image is a problem? >>> >>> QEMU already has to inject a whole bunch of stuff into confidential >>> computing guests. The way it's done (IIRC) is that the non-compressed, >>> trailing part of pflash (basically where the reset vector code lives >>> too) is populated at OVMF build time with a chain of GUID-ed structures, >>> and fields of those structures are filled in (at OVMF build time) from >>> various fixed PCDs. The fixed PCDs in turn are populated from the FD >>> files, using various MEMFD regions. When QEMU launches the guest, it can >>> parse the GPAs from the on-disk pflash image (traversing the list of >>> GUID-ed structs), at which addresses the guest firmware will then expect >>> the various crypto artifacts to be injected. >>> >>> The point is that "who's in control" is reversed. The firmware exposes >>> (at build time) at what GPAs it can accept data injections, and QEMU >>> follows that. Of course the firmware ensures that nothing else in the >>> firmware will try to own those GPAs. >>> >>> The only thing that needed to be hard-coded when this feature was >>> introduced was the "entry point", that is, the flash offset at which >>> QEMU starts the GUID-ed structure traversal. >>> >>> AMD and IBM developers can likely much better describe this mechanism, >>> as I've not dealt with it in over a year. The QEMU side code is in >>> "hw/i386/pc_sysfw_ovmf.c". >>> >>> We can make setup_data chaining work with OVMF, but the whole chain >>> should be located in a GPA range that OVMF dictates. >> >> It sounds like what you describe is pretty OVMF-specific though, >> right? > > Yes, completely OVMF specific. > >> Do we want to tie things together so tightly like that? > > It depends on what the goal is: > > - do we want setup_data chaining work generally? > > - or do we want only the random seed injection to stop crashing OVMF guests? > > In the latter case, we only need to teach QEMU to reliably recognize > OVMF from the on-disk firmware binary (regardless of pflash use), and > then not expose any setup_data in guest RAM. The UEFI guest kernel will > use EFI_RNG_PROTOCOL (when available, from virtio-rng-pci), and no > particular seed otherwise. > > (This is the "papering over" case, which I don't think is necessarily > wrong; only it should be robust. I thought pflash would be usable for > that; turns out it isn't. Thus, we could perhaps try identifying a > firmware binary as "OVMF" by contents.) > > In the former case though, I think the "tight coupling" is unavoidable. > As long as setup_data is needed by the kernel extremely early, no > "firmware hiding" or "firmware independence" is in place yet. > >> Given we only need 48 bytes or so, isn't there a more subtle place we >> could just throw this in ram that doesn't need such complex >> coordination? > > These tricks add up and go wrong after a while. The pedantic > reservations in the firmware have proved necessary. > > IIUC, with v2, the setup_data_base address would (most frequently) be 96 > KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does > not reserve away the area, UefiCpuPkg or other drivers might allocate an > overlapping chunk, even if only temporarily. That might not break the > firmware, but it could overwrite the random seed. If the guest kernel > somehow consumed that seed (rather than getting one from > EFI_RNG_PROTOCOL -- what if no virtio-rng-pci device was configured?), > that could potentially destroy the guest's security, without any > obvious/immediate signs. I must add however that I feel very much out of place making authoritative-sounding statements like this. The above is my honest opinion, and my (unpleasant) writing style, but it's just that: my opinion & style. So much has happened in edk2 and QEMU since last summer (without me following them closely) that I feel uncomfortable speaking on them. Take whatever I say with a grain of salt, and definitely research all options. Laszlo
Hey Laszlo, On Thu, Aug 04, 2022 at 03:56:54PM +0200, Laszlo Ersek wrote: > - do we want setup_data chaining work generally? > > - or do we want only the random seed injection to stop crashing OVMF guests? Preferably the first - generally. Which brings us to your point: > > Given we only need 48 bytes or so, isn't there a more subtle place we > > could just throw this in ram that doesn't need such complex > > coordination? > > These tricks add up and go wrong after a while. The pedantic > reservations in the firmware have proved necessary. > > IIUC, with v2, the setup_data_base address would (most frequently) be 96 > KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does > not reserve away the area, UefiCpuPkg or other drivers might allocate an > overlapping chunk, even if only temporarily. That might not break the > firmware, but it could overwrite the random seed. Yea, so we don't want an address that something else might overwrite. So my question is: isn't there some 48 bytes or so available in some low address (or maybe a high one?) that is traditionally reserved for some hardware function, and so software doesn't use it, but it turns out QEMU doesn't use it for anything either, so we can get away placing it at that address? It seems like there *ought* to be something like that. I just don't (yet) know what it is... Jason
On 08/05/22 00:56, Jason A. Donenfeld wrote: > Hey Laszlo, > > On Thu, Aug 04, 2022 at 03:56:54PM +0200, Laszlo Ersek wrote: >> - do we want setup_data chaining work generally? >> >> - or do we want only the random seed injection to stop crashing OVMF guests? > > Preferably the first - generally. Which brings us to your point: > >>> Given we only need 48 bytes or so, isn't there a more subtle place we >>> could just throw this in ram that doesn't need such complex >>> coordination? >> >> These tricks add up and go wrong after a while. The pedantic >> reservations in the firmware have proved necessary. >> >> IIUC, with v2, the setup_data_base address would (most frequently) be 96 >> KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does >> not reserve away the area, UefiCpuPkg or other drivers might allocate an >> overlapping chunk, even if only temporarily. That might not break the >> firmware, but it could overwrite the random seed. > > Yea, so we don't want an address that something else might overwrite. So > my question is: isn't there some 48 bytes or so available in some low > address (or maybe a high one?) that is traditionally reserved for some > hardware function, and so software doesn't use it, but it turns out QEMU > doesn't use it for anything either, so we can get away placing it at > that address? It seems like there *ought* to be something like that. I > just don't (yet) know what it is... I don't know of any such "hidden pocket", unfortunately. On the other hand, low-level edk2 drivers (usually dealing with x86 intricacies, such as MTRRs, CPU bringup, ...) have repeatedly surprised me with their handling of low memory. Laszlo
Hi, > > We can make setup_data chaining work with OVMF, but the whole chain > > should be located in a GPA range that OVMF dictates. > > It sounds like what you describe is pretty OVMF-specific though, > right? Do we want to tie things together so tightly like that? > > Given we only need 48 bytes or so, isn't there a more subtle place we > could just throw this in ram that doesn't need such complex > coordination? Joining the party late (and still catching up the thread). Given we don't need that anyway with EFI, only with legacy BIOS: Can't that just be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass those 48 bytes random seed? take care, Gerd
Hey Gerd, On Tue, Aug 16, 2022 at 10:55:11AM +0200, Gerd Hoffmann wrote: > Hi, > > > > We can make setup_data chaining work with OVMF, but the whole chain > > > should be located in a GPA range that OVMF dictates. > > > > It sounds like what you describe is pretty OVMF-specific though, > > right? Do we want to tie things together so tightly like that? > > > > Given we only need 48 bytes or so, isn't there a more subtle place we > > could just throw this in ram that doesn't need such complex > > coordination? > > Joining the party late (and still catching up the thread). Given we > don't need that anyway with EFI, only with legacy BIOS: Can't that just > be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass > those 48 bytes random seed? Actually, I want this to work with EFI, very much so. If our objective was to just not break EFI, the solution would be simple: in the kernel we can have EFISTUB ignore the setup_data field from the image, and then bump the boot header protocol number. If QEMU sees the boot protocol number is below this one, then it won't set setup_data. Done, fixed. Except I think there's value in passing seeds even through with EFI. Your option ROM idea is interesting; somebody mentioned that elsewhere too I think. I'm wondering, though: do option ROMs still run when EFI/OVMF is being used? Jason
On Thu, Aug 18, 2022 at 05:38:37PM +0200, Jason A. Donenfeld wrote: > Hey Gerd, > > > Joining the party late (and still catching up the thread). Given we > > don't need that anyway with EFI, only with legacy BIOS: Can't that just > > be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass > > those 48 bytes random seed? > > Actually, I want this to work with EFI, very much so. With EFI the kernel stub gets some random seed via EFI_RNG_PROTOCOL. I can't see any good reason to derive from that. It works no matter how the kernel gets loaded. OVMF ships with a driver for the virtio-rng device. So just add that to your virtual machine and seeding works fine ... > If our objective was to just not break EFI, the solution would be > simple: in the kernel we can have EFISTUB ignore the setup_data field > from the image, and then bump the boot header protocol number. If QEMU > sees the boot protocol number is below this one, then it won't set > setup_data. Done, fixed. As mentioned elsewhere in the thread patching in physical addresses on qemu side isn't going to fly due to the different load methods we have. > Your option ROM idea is interesting; somebody mentioned that elsewhere > too I think. Doing the setup_data patching in the option rom has the advantage that it'll only happen with that specific load method being used. Also the option rom knows where it places stuff in memory so it is in a much better position to find a good & non-conflicting place for the random seed. Also reserve/allocate memory if needed etc. > I'm wondering, though: do option ROMs still run when > EFI/OVMF is being used? No, they are not used with EFI. OVMF has a completely independent implementation for direct kernel boot. The options I see for EFI are: (1) Do nothing and continue to depend on virtio-rng. (2) Implement an efi driver which gets those 48 seed bytes from qemu by whatever means we'll define and hands them out via EFI_RNG_PROTOCOL. take care, Gerd
On Fri, 19 Aug 2022 at 08:41, Gerd Hoffmann <kraxel@redhat.com> wrote: > > On Thu, Aug 18, 2022 at 05:38:37PM +0200, Jason A. Donenfeld wrote: > > Hey Gerd, > > > > > Joining the party late (and still catching up the thread). Given we > > > don't need that anyway with EFI, only with legacy BIOS: Can't that just > > > be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass > > > those 48 bytes random seed? > > > > Actually, I want this to work with EFI, very much so. Even if we wire this up for EFI in some way, it will only affect direct kernel boot using -kernel/-initrd etc, which is a niche use case at best (AIUI libvirt uses it for the initial boot only) I personally rely on it heavily for development, and I imagine others might too, but that is hardly relevant here. > With EFI the kernel stub gets some random seed via EFI_RNG_PROTOCOL. > I can't see any good reason to derive from that. It works no matter > how the kernel gets loaded. > > OVMF ships with a driver for the virtio-rng device. So just add that > to your virtual machine and seeding works fine ... > ... or we find other ways for the platform to speak to the OS, using EFI protocols or other EFI methods. Currently, the 'pure EFI' boot code is arch agnostic - it can be built and run on any architecture that supports EFI boot. Adding Linux+x86 specific hacks to it is out of the question. So that means that setup_data provided by QEMU will be consumed directly by the kernel (when doing direct kernel boot only), using an out of band channel that EFI/OVMF are completely unaware of, putting it outside the scope of secure boot, measure boot, etc. This is not acceptable to me. > > If our objective was to just not break EFI, the solution would be > > simple: in the kernel we can have EFISTUB ignore the setup_data field > > from the image, and then bump the boot header protocol number. If QEMU > > sees the boot protocol number is below this one, then it won't set > > setup_data. Done, fixed. > > As mentioned elsewhere in the thread patching in physical addresses on > qemu side isn't going to fly due to the different load methods we have. > And it conflates the file representation with the in-memory representation, which I object to fundamentally - setup_data is part of the file image, and becomes part of the in-memory representation when it gets staged in memory for booting, which only happens in the EFI stub when using pure EFI boot. Using setup_data as a hidden comms channel between QEMU and the core kernel breaks that distinction. > > Your option ROM idea is interesting; somebody mentioned that elsewhere > > too I think. > > Doing the setup_data patching in the option rom has the advantage that > it'll only happen with that specific load method being used. Also the > option rom knows where it places stuff in memory so it is in a much > better position to find a good & non-conflicting place for the random > seed. Also reserve/allocate memory if needed etc. > Exactly. This is the only sensible place to do this. > > I'm wondering, though: do option ROMs still run when > > EFI/OVMF is being used? > > No, they are not used with EFI. OVMF has a completely independent > implementation for direct kernel boot. > > The options I see for EFI are: > > (1) Do nothing and continue to depend on virtio-rng. > (2) Implement an efi driver which gets those 48 seed bytes from > qemu by whatever means we'll define and hands them out via > EFI_RNG_PROTOCOL. > We could explore other options, but SETUP_RNG_SEED is fundamentally incompatible with EFI boot (or any other boot method where the image is treated as an opaque file by the firmware/loader), so that is not an acceptable approach to me.
diff --git a/hw/i386/x86.c b/hw/i386/x86.c index 050eedc0c8..8b853abf38 100644 --- a/hw/i386/x86.c +++ b/hw/i386/x86.c @@ -760,36 +760,36 @@ static bool load_elfboot(const char *kernel_filename, fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ENTRY, pvh_start_addr); fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, mh_load_addr); fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, elf_kernel_size); return true; } void x86_load_linux(X86MachineState *x86ms, FWCfgState *fw_cfg, int acpi_data_size, bool pvh_enabled, bool legacy_no_rng_seed) { bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled; uint16_t protocol; int setup_size, kernel_size, cmdline_size; - int dtb_size, setup_data_offset; + int dtb_size, setup_data_item_len, setup_data_total_len = 0; uint32_t initrd_max; - uint8_t header[8192], *setup, *kernel; - hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0; + uint8_t header[8192], *setup, *kernel, *setup_datas = NULL; + hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0, setup_data_base; FILE *f; char *vmode; MachineState *machine = MACHINE(x86ms); struct setup_data *setup_data; const char *kernel_filename = machine->kernel_filename; const char *initrd_filename = machine->initrd_filename; const char *dtb_filename = machine->dtb; const char *kernel_cmdline = machine->kernel_cmdline; SevKernelLoaderContext sev_load_ctx = {}; enum { RNG_SEED_LENGTH = 32 }; /* Align to 16 bytes as a paranoia measure */ cmdline_size = (strlen(kernel_cmdline) + 16) & ~15; /* load the kernel header */ f = fopen(kernel_filename, "rb"); @@ -886,32 +886,33 @@ void x86_load_linux(X86MachineState *x86ms, if (protocol < 0x200 || !(header[0x211] & 0x01)) { /* Low kernel */ real_addr = 0x90000; cmdline_addr = 0x9a000 - cmdline_size; prot_addr = 0x10000; } else if (protocol < 0x202) { /* High but ancient kernel */ real_addr = 0x90000; cmdline_addr = 0x9a000 - cmdline_size; prot_addr = 0x100000; } else { /* High and recent kernel */ real_addr = 0x10000; cmdline_addr = 0x20000; prot_addr = 0x100000; } + setup_data_base = real_addr + 0x8000; /* highest address for loading the initrd */ if (protocol >= 0x20c && lduw_p(header + 0x236) & XLF_CAN_BE_LOADED_ABOVE_4G) { /* * Linux has supported initrd up to 4 GB for a very long time (2007, * long before XLF_CAN_BE_LOADED_ABOVE_4G which was added in 2013), * though it only sets initrd_max to 2 GB to "work around bootloader * bugs". Luckily, QEMU firmware(which does something like bootloader) * has supported this. * * It's believed that if XLF_CAN_BE_LOADED_ABOVE_4G is set, initrd can * be loaded into any address. * * In addition, initrd_max is uint32_t simply because QEMU doesn't * support the 64-bit boot protocol (specifically the ext_ramdisk_image @@ -1049,60 +1050,61 @@ void x86_load_linux(X86MachineState *x86ms, fclose(f); /* append dtb to kernel */ if (dtb_filename) { if (protocol < 0x209) { fprintf(stderr, "qemu: Linux kernel too old to load a dtb\n"); exit(1); } dtb_size = get_image_size(dtb_filename); if (dtb_size <= 0) { fprintf(stderr, "qemu: error reading dtb %s: %s\n", dtb_filename, strerror(errno)); exit(1); } - setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16); - kernel_size = setup_data_offset + sizeof(struct setup_data) + dtb_size; - kernel = g_realloc(kernel, kernel_size); - - - setup_data = (struct setup_data *)(kernel + setup_data_offset); + setup_data_item_len = sizeof(struct setup_data) + dtb_size; + setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len); + setup_data = (struct setup_data *)(setup_datas + setup_data_total_len); setup_data->next = cpu_to_le64(first_setup_data); - first_setup_data = prot_addr + setup_data_offset; + first_setup_data = setup_data_base + setup_data_total_len; + setup_data_total_len += setup_data_item_len; setup_data->type = cpu_to_le32(SETUP_DTB); setup_data->len = cpu_to_le32(dtb_size); - load_image_size(dtb_filename, setup_data->data, dtb_size); } if (!legacy_no_rng_seed) { - setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16); - kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH; - kernel = g_realloc(kernel, kernel_size); - setup_data = (struct setup_data *)(kernel + setup_data_offset); + setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH; + setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len); + setup_data = (struct setup_data *)(setup_datas + setup_data_total_len); setup_data->next = cpu_to_le64(first_setup_data); - first_setup_data = prot_addr + setup_data_offset; + first_setup_data = setup_data_base + setup_data_total_len; + setup_data_total_len += setup_data_item_len; setup_data->type = cpu_to_le32(SETUP_RNG_SEED); setup_data->len = cpu_to_le32(RNG_SEED_LENGTH); qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH); } - /* Offset 0x250 is a pointer to the first setup_data link. */ - stq_p(header + 0x250, first_setup_data); + if (first_setup_data) { + /* Offset 0x250 is a pointer to the first setup_data link. */ + stq_p(header + 0x250, first_setup_data); + rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len, + setup_data_base, NULL, NULL, NULL, NULL, false); + } /* * If we're starting an encrypted VM, it will be OVMF based, which uses the * efi stub for booting and doesn't require any values to be placed in the * kernel header. We therefore don't update the header so the hash of the * kernel on the other side of the fw_cfg interface matches the hash of the * file the user passed in. */ if (!sev_enabled()) { memcpy(setup, header, MIN(sizeof(header), setup_size)); } fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr); fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size); fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size); sev_load_ctx.kernel_data = (char *)kernel;
The boot parameter header refers to setup_data at an absolute address, and each setup_data refers to the next setup_data at an absolute address too. Currently QEMU simply puts the setup_datas right after the kernel image, and since the kernel_image is loaded at prot_addr -- a fixed address knowable to QEMU apriori -- the setup_data absolute address winds up being just `prot_addr + a_fixed_offset_into_kernel_image`. This mostly works fine, so long as the kernel image really is loaded at prot_addr. However, OVMF doesn't load the kernel at prot_addr, and generally EFI doesn't give a good way of predicting where it's going to load the kernel. So when it loads it at some address != prot_addr, the absolute addresses in setup_data now point somewhere bogus, causing crashes when EFI stub tries to follow the next link. Fix this by placing setup_data at some fixed place in memory, relative to real_addr, not as part of the kernel image, and then pointing the setup_data absolute address to that fixed place in memory. This way, even if OVMF or other chains relocate the kernel image, the boot parameter still points to the correct absolute address. Fixes: 3cbeb52467 ("hw/i386: add device tree support") Reported-by: Xiaoyao Li <xiaoyao.li@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Richard Henderson <richard.henderson@linaro.org> Cc: Peter Maydell <peter.maydell@linaro.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Daniel P. Berrangé <berrange@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: linux-efi@vger.kernel.org Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- hw/i386/x86.c | 38 ++++++++++++++++++++------------------ 1 file changed, 20 insertions(+), 18 deletions(-)