[v2] hw/i386: place setup_data at fixed place in memory

Message ID	20220804004411.1343158-1-Jason@zx2c4.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: "Jason A. Donenfeld" <Jason@zx2c4.com> To: qemu-devel@nongnu.org Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>, Xiaoyao Li <xiaoyao.li@intel.com>, Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, Peter Maydell <peter.maydell@linaro.org>, "Michael S . Tsirkin" <mst@redhat.com>, =?utf-8?q?Daniel_P_=2E_Berrang?= =?utf-8?q?=C3=A9?= <berrange@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>, Ard Biesheuvel <ardb@kernel.org>, linux-efi@vger.kernel.org Subject: [PATCH v2] hw/i386: place setup_data at fixed place in memory Date: Thu, 4 Aug 2022 02:44:11 +0200 Message-Id: <20220804004411.1343158-1-Jason@zx2c4.com> In-Reply-To: <YusVVLNbLgsk49PK@zx2c4.com> References: <YusVVLNbLgsk49PK@zx2c4.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=145.40.68.75; envelope-from=SRS0=nI4E=YI=zx2c4.com=Jason@kernel.org; helo=ams.source.kernel.org X-Spam_score_int: -17 X-Spam_score: -1.8 X-Spam_bar: - X-Spam_report: (-1.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.249, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=no autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	[v2] hw/i386: place setup_data at fixed place in memory \| expand [v2] hw/i386: place setup_data at fixed place in memory

Jason A. Donenfeld Aug. 4, 2022, 12:44 a.m. UTC

The boot parameter header refers to setup_data at an absolute address,
and each setup_data refers to the next setup_data at an absolute address
too. Currently QEMU simply puts the setup_datas right after the kernel
image, and since the kernel_image is loaded at prot_addr -- a fixed
address knowable to QEMU apriori -- the setup_data absolute address
winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.

This mostly works fine, so long as the kernel image really is loaded at
prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
generally EFI doesn't give a good way of predicting where it's going to
load the kernel. So when it loads it at some address != prot_addr, the
absolute addresses in setup_data now point somewhere bogus, causing
crashes when EFI stub tries to follow the next link.

Fix this by placing setup_data at some fixed place in memory, relative
to real_addr, not as part of the kernel image, and then pointing the
setup_data absolute address to that fixed place in memory. This way,
even if OVMF or other chains relocate the kernel image, the boot
parameter still points to the correct absolute address.

Fixes: 3cbeb52467 ("hw/i386: add device tree support")
Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Daniel P. Berrangé <berrange@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: linux-efi@vger.kernel.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 hw/i386/x86.c | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

Michael S. Tsirkin Aug. 4, 2022, 7:03 a.m. UTC | #1

On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> The boot parameter header refers to setup_data at an absolute address,
> and each setup_data refers to the next setup_data at an absolute address
> too. Currently QEMU simply puts the setup_datas right after the kernel
> image, and since the kernel_image is loaded at prot_addr -- a fixed
> address knowable to QEMU apriori -- the setup_data absolute address
> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> 
> This mostly works fine, so long as the kernel image really is loaded at
> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> generally EFI doesn't give a good way of predicting where it's going to
> load the kernel. So when it loads it at some address != prot_addr, the
> absolute addresses in setup_data now point somewhere bogus, causing
> crashes when EFI stub tries to follow the next link.
> 
> Fix this by placing setup_data at some fixed place in memory, relative
> to real_addr, not as part of the kernel image, and then pointing the
> setup_data absolute address to that fixed place in memory. This way,
> even if OVMF or other chains relocate the kernel image, the boot
> parameter still points to the correct absolute address.
> 
> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: linux-efi@vger.kernel.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>

Didn't read the patch yet.
Adding Laszlo.

> ---
>  hw/i386/x86.c | 38 ++++++++++++++++++++------------------
>  1 file changed, 20 insertions(+), 18 deletions(-)
> 
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index 050eedc0c8..8b853abf38 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c
> @@ -760,36 +760,36 @@ static bool load_elfboot(const char *kernel_filename,
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ENTRY, pvh_start_addr);
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, mh_load_addr);
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, elf_kernel_size);
>  
>      return true;
>  }
>  
>  void x86_load_linux(X86MachineState *x86ms,
>                      FWCfgState *fw_cfg,
>                      int acpi_data_size,
>                      bool pvh_enabled,
>                      bool legacy_no_rng_seed)
>  {
>      bool linuxboot_dma_enabled = X86_MACHINE_GET_CLASS(x86ms)->fwcfg_dma_enabled;
>      uint16_t protocol;
>      int setup_size, kernel_size, cmdline_size;
> -    int dtb_size, setup_data_offset;
> +    int dtb_size, setup_data_item_len, setup_data_total_len = 0;
>      uint32_t initrd_max;
> -    uint8_t header[8192], *setup, *kernel;
> -    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0;
> +    uint8_t header[8192], *setup, *kernel, *setup_datas = NULL;
> +    hwaddr real_addr, prot_addr, cmdline_addr, initrd_addr = 0, first_setup_data = 0, setup_data_base;
>      FILE *f;
>      char *vmode;
>      MachineState *machine = MACHINE(x86ms);
>      struct setup_data *setup_data;
>      const char *kernel_filename = machine->kernel_filename;
>      const char *initrd_filename = machine->initrd_filename;
>      const char *dtb_filename = machine->dtb;
>      const char *kernel_cmdline = machine->kernel_cmdline;
>      SevKernelLoaderContext sev_load_ctx = {};
>      enum { RNG_SEED_LENGTH = 32 };
>  
>      /* Align to 16 bytes as a paranoia measure */
>      cmdline_size = (strlen(kernel_cmdline) + 16) & ~15;
>  
>      /* load the kernel header */
>      f = fopen(kernel_filename, "rb");
> @@ -886,32 +886,33 @@ void x86_load_linux(X86MachineState *x86ms,
>      if (protocol < 0x200 || !(header[0x211] & 0x01)) {
>          /* Low kernel */
>          real_addr    = 0x90000;
>          cmdline_addr = 0x9a000 - cmdline_size;
>          prot_addr    = 0x10000;
>      } else if (protocol < 0x202) {
>          /* High but ancient kernel */
>          real_addr    = 0x90000;
>          cmdline_addr = 0x9a000 - cmdline_size;
>          prot_addr    = 0x100000;
>      } else {
>          /* High and recent kernel */
>          real_addr    = 0x10000;
>          cmdline_addr = 0x20000;
>          prot_addr    = 0x100000;
>      }
> +    setup_data_base = real_addr + 0x8000;
>  
>      /* highest address for loading the initrd */
>      if (protocol >= 0x20c &&
>          lduw_p(header + 0x236) & XLF_CAN_BE_LOADED_ABOVE_4G) {
>          /*
>           * Linux has supported initrd up to 4 GB for a very long time (2007,
>           * long before XLF_CAN_BE_LOADED_ABOVE_4G which was added in 2013),
>           * though it only sets initrd_max to 2 GB to "work around bootloader
>           * bugs". Luckily, QEMU firmware(which does something like bootloader)
>           * has supported this.
>           *
>           * It's believed that if XLF_CAN_BE_LOADED_ABOVE_4G is set, initrd can
>           * be loaded into any address.
>           *
>           * In addition, initrd_max is uint32_t simply because QEMU doesn't
>           * support the 64-bit boot protocol (specifically the ext_ramdisk_image
> @@ -1049,60 +1050,61 @@ void x86_load_linux(X86MachineState *x86ms,
>      fclose(f);
>  
>      /* append dtb to kernel */
>      if (dtb_filename) {
>          if (protocol < 0x209) {
>              fprintf(stderr, "qemu: Linux kernel too old to load a dtb\n");
>              exit(1);
>          }
>  
>          dtb_size = get_image_size(dtb_filename);
>          if (dtb_size <= 0) {
>              fprintf(stderr, "qemu: error reading dtb %s: %s\n",
>                      dtb_filename, strerror(errno));
>              exit(1);
>          }
>  
> -        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
> -        kernel_size = setup_data_offset + sizeof(struct setup_data) + dtb_size;
> -        kernel = g_realloc(kernel, kernel_size);
> -
> -
> -        setup_data = (struct setup_data *)(kernel + setup_data_offset);
> +        setup_data_item_len = sizeof(struct setup_data) + dtb_size;
> +        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
> +        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
>          setup_data->next = cpu_to_le64(first_setup_data);
> -        first_setup_data = prot_addr + setup_data_offset;
> +        first_setup_data = setup_data_base + setup_data_total_len;
> +        setup_data_total_len += setup_data_item_len;
>          setup_data->type = cpu_to_le32(SETUP_DTB);
>          setup_data->len = cpu_to_le32(dtb_size);
> -
>          load_image_size(dtb_filename, setup_data->data, dtb_size);
>      }
>  
>      if (!legacy_no_rng_seed) {
> -        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
> -        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
> -        kernel = g_realloc(kernel, kernel_size);
> -        setup_data = (struct setup_data *)(kernel + setup_data_offset);
> +        setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH;
> +        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
> +        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
>          setup_data->next = cpu_to_le64(first_setup_data);
> -        first_setup_data = prot_addr + setup_data_offset;
> +        first_setup_data = setup_data_base + setup_data_total_len;
> +        setup_data_total_len += setup_data_item_len;
>          setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
>          setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
>          qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
>      }
>  
> -    /* Offset 0x250 is a pointer to the first setup_data link. */
> -    stq_p(header + 0x250, first_setup_data);
> +    if (first_setup_data) {
> +            /* Offset 0x250 is a pointer to the first setup_data link. */
> +            stq_p(header + 0x250, first_setup_data);
> +            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
> +                         setup_data_base, NULL, NULL, NULL, NULL, false);
> +    }
>  
>      /*
>       * If we're starting an encrypted VM, it will be OVMF based, which uses the
>       * efi stub for booting and doesn't require any values to be placed in the
>       * kernel header.  We therefore don't update the header so the hash of the
>       * kernel on the other side of the fw_cfg interface matches the hash of the
>       * file the user passed in.
>       */
>      if (!sev_enabled()) {
>          memcpy(setup, header, MIN(sizeof(header), setup_size));
>      }
>  
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
>      fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
>      sev_load_ctx.kernel_data = (char *)kernel;
> -- 
> 2.35.1

Laszlo Ersek Aug. 4, 2022, 8:58 a.m. UTC | #2

On 08/04/22 09:03, Michael S. Tsirkin wrote:
> On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
>> The boot parameter header refers to setup_data at an absolute address,
>> and each setup_data refers to the next setup_data at an absolute address
>> too. Currently QEMU simply puts the setup_datas right after the kernel
>> image, and since the kernel_image is loaded at prot_addr -- a fixed
>> address knowable to QEMU apriori -- the setup_data absolute address
>> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
>>
>> This mostly works fine, so long as the kernel image really is loaded at
>> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
>> generally EFI doesn't give a good way of predicting where it's going to
>> load the kernel. So when it loads it at some address != prot_addr, the
>> absolute addresses in setup_data now point somewhere bogus, causing
>> crashes when EFI stub tries to follow the next link.
>>
>> Fix this by placing setup_data at some fixed place in memory, relative
>> to real_addr, not as part of the kernel image, and then pointing the
>> setup_data absolute address to that fixed place in memory. This way,
>> even if OVMF or other chains relocate the kernel image, the boot
>> parameter still points to the correct absolute address.
>>
>> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
>> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>> Cc: Richard Henderson <richard.henderson@linaro.org>
>> Cc: Peter Maydell <peter.maydell@linaro.org>
>> Cc: Michael S. Tsirkin <mst@redhat.com>
>> Cc: Daniel P. Berrangé <berrange@redhat.com>
>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>> Cc: Ard Biesheuvel <ardb@kernel.org>
>> Cc: linux-efi@vger.kernel.org
>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> 
> Didn't read the patch yet.
> Adding Laszlo.

As I said in
<http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>,
I don't believe that the setup_data chaining described in
<https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work
for UEFI guests at all, with QEMU pre-populating the links with GPAs.

However, rather than introducing a new info channel, or reusing an
existent one (ACPI linker/loader, GUID-ed structure chaining in pflash),
for the sake of this feature, I suggest simply disabling this feature
for UEFI guests. setup_data chaining has not been necessary for UEFI
guests for years (this is the first time I've heard about it in more
than a decade), and the particular use case (provide guests with good
random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL.

... Now, here's my problem: microvm, and Xen.

As far as I can tell, QEMU can determine (it already does determine)
whether the guest uses UEFI or not, for the "pc" and "q35" machine
types. But not for microvm or Xen!

Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can
derive that

  pflash_cfi01_get_blk(pcms->flash[0])

returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this
is only valid after the inital part of pc_system_firmware_init() has run
("Map legacy -drive if=pflash to machine properties"), but that is not a
problem, given the following call tree:

  pc_memory_init()            [hw/i386/pc.c]
    pc_system_firmware_init() [hw/i386/pc_sysfw.c]
    x86_load_linux()          [hw/i386/x86.c]

That is, when we call x86_load_linux() from pc_memory_init(), we can set
the last argument of x86_load_linux() from *both* the compat knob *and*
pflash_cfi01_get_blk(pcms->flash[0]).

Unluckily however, we have two more x86_load_linux() calls where this
does not work.

The first is xen_load_linux() [hw/i386/pc.c], which is used for "xen HVM
direct kernel boot" [hw/i386/pc_piix.c] -- there we certainly don't use
flash. I don't know if Xen HVM direct kernel boot is possible with UEFI.
If so, we have a problem, if UEFI is out of the question there, then
enabling setup_data passing is fine.

The other problematic x86_load_linux() call is in MicroVM --
microvm_memory_init() [hw/i386/microvm.c]. This is a real problem
unfortunately, as MicroVM can be used with SeaBIOS, QBoot, and even
OVMF, *but* even in the last case, it runs OVMF *without* flash (just
from ROM). So not only is MicroVM not a descendant of the PC Machine
Class (so it has no access to PC Machine State either -- such as
pcms->flash), it never *uses* flash in fact.

Which is a big problem for my idea, because QEMU has no way of
identifying whether microvm is going to boot a custom SeaBIOS binary
(where the current setup_data chaining is OK) or a custom OVMF binary
(where the current setup_data chaining crashes the guest kernel).

So I thought that for pc and q35, I should be able to propose a fix,
based on:

  pflash_cfi01_get_blk(pcms->flash[0])

but it turns out I don't know what to do about Xen; and worse, for
MicroVM, it's currently *impossible* for QEMU to tell apart UEFI from
other guest firmwares.

For now I suggest either reverting the original patch, or at least not
enabling the knob by default for any machine types. In particular, when
using MicroVM, the user must leave the knob disabled when direct booting
a kernel on OVMF, and the user may or may not enable the knob when
direct booting a kernel on SeaBIOS.

Thanks
Laszlo

Daniel P. Berrangé Aug. 4, 2022, 9:25 a.m. UTC | #3

On Thu, Aug 04, 2022 at 10:58:36AM +0200, Laszlo Ersek wrote:
> On 08/04/22 09:03, Michael S. Tsirkin wrote:
> > On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> >> The boot parameter header refers to setup_data at an absolute address,
> >> and each setup_data refers to the next setup_data at an absolute address
> >> too. Currently QEMU simply puts the setup_datas right after the kernel
> >> image, and since the kernel_image is loaded at prot_addr -- a fixed
> >> address knowable to QEMU apriori -- the setup_data absolute address
> >> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> >>
> >> This mostly works fine, so long as the kernel image really is loaded at
> >> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> >> generally EFI doesn't give a good way of predicting where it's going to
> >> load the kernel. So when it loads it at some address != prot_addr, the
> >> absolute addresses in setup_data now point somewhere bogus, causing
> >> crashes when EFI stub tries to follow the next link.
> >>
> >> Fix this by placing setup_data at some fixed place in memory, relative
> >> to real_addr, not as part of the kernel image, and then pointing the
> >> setup_data absolute address to that fixed place in memory. This way,
> >> even if OVMF or other chains relocate the kernel image, the boot
> >> parameter still points to the correct absolute address.
> >>
> >> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> >> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> >> Cc: Paolo Bonzini <pbonzini@redhat.com>
> >> Cc: Richard Henderson <richard.henderson@linaro.org>
> >> Cc: Peter Maydell <peter.maydell@linaro.org>
> >> Cc: Michael S. Tsirkin <mst@redhat.com>
> >> Cc: Daniel P. Berrangé <berrange@redhat.com>
> >> Cc: Gerd Hoffmann <kraxel@redhat.com>
> >> Cc: Ard Biesheuvel <ardb@kernel.org>
> >> Cc: linux-efi@vger.kernel.org
> >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> > 
> > Didn't read the patch yet.
> > Adding Laszlo.
> 
> As I said in
> <http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>,
> I don't believe that the setup_data chaining described in
> <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work
> for UEFI guests at all, with QEMU pre-populating the links with GPAs.
> 
> However, rather than introducing a new info channel, or reusing an
> existent one (ACPI linker/loader, GUID-ed structure chaining in pflash),
> for the sake of this feature, I suggest simply disabling this feature
> for UEFI guests. setup_data chaining has not been necessary for UEFI
> guests for years (this is the first time I've heard about it in more
> than a decade), and the particular use case (provide guests with good
> random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL.
> 
> ... Now, here's my problem: microvm, and Xen.
> 
> As far as I can tell, QEMU can determine (it already does determine)
> whether the guest uses UEFI or not, for the "pc" and "q35" machine
> types. But not for microvm or Xen!
> 
> Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can
> derive that
> 
>   pflash_cfi01_get_blk(pcms->flash[0])
> 
> returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this
> is only valid after the inital part of pc_system_firmware_init() has run
> ("Map legacy -drive if=pflash to machine properties"), but that is not a
> problem, given the following call tree:

I don't beleve that's a valid check anymore since Gerd introduced the
ability to load UEFI via -bios, for UEFI builds without persistent
variables. ( a8152c4e4613c70c2f0573a82babbc8acc00cf90 )


> Which is a big problem for my idea, because QEMU has no way of
> identifying whether microvm is going to boot a custom SeaBIOS binary
> (where the current setup_data chaining is OK) or a custom OVMF binary
> (where the current setup_data chaining crashes the guest kernel).
> 
> So I thought that for pc and q35, I should be able to propose a fix,
> based on:
> 
>   pflash_cfi01_get_blk(pcms->flash[0])
> 
> but it turns out I don't know what to do about Xen; and worse, for
> MicroVM, it's currently *impossible* for QEMU to tell apart UEFI from
> other guest firmwares.

Yep, and ultimately the inability to distinguish UEFI vs other firmware
is arguably correct by design, as the QEMU <-> firmware interface is
supposed to be arbitrarily pluggable for any firmware implementation
not  limited to merely UEFI + seabios.

> For now I suggest either reverting the original patch, or at least not
> enabling the knob by default for any machine types. In particular, when
> using MicroVM, the user must leave the knob disabled when direct booting
> a kernel on OVMF, and the user may or may not enable the knob when
> direct booting a kernel on SeaBIOS.

Having it opt-in via a knob would defeat Jason's goal of having the seed
available automatically.



With regards,
Daniel

Ard Biesheuvel Aug. 4, 2022, 10:26 a.m. UTC | #4

On Thu, 4 Aug 2022 at 11:25, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Aug 04, 2022 at 10:58:36AM +0200, Laszlo Ersek wrote:
> > On 08/04/22 09:03, Michael S. Tsirkin wrote:
> > > On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> > >> The boot parameter header refers to setup_data at an absolute address,
> > >> and each setup_data refers to the next setup_data at an absolute address
> > >> too. Currently QEMU simply puts the setup_datas right after the kernel
> > >> image, and since the kernel_image is loaded at prot_addr -- a fixed
> > >> address knowable to QEMU apriori -- the setup_data absolute address
> > >> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> > >>
> > >> This mostly works fine, so long as the kernel image really is loaded at
> > >> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> > >> generally EFI doesn't give a good way of predicting where it's going to
> > >> load the kernel. So when it loads it at some address != prot_addr, the
> > >> absolute addresses in setup_data now point somewhere bogus, causing
> > >> crashes when EFI stub tries to follow the next link.
> > >>
> > >> Fix this by placing setup_data at some fixed place in memory, relative
> > >> to real_addr, not as part of the kernel image, and then pointing the
> > >> setup_data absolute address to that fixed place in memory. This way,
> > >> even if OVMF or other chains relocate the kernel image, the boot
> > >> parameter still points to the correct absolute address.
> > >>
> > >> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> > >> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > >> Cc: Paolo Bonzini <pbonzini@redhat.com>
> > >> Cc: Richard Henderson <richard.henderson@linaro.org>
> > >> Cc: Peter Maydell <peter.maydell@linaro.org>
> > >> Cc: Michael S. Tsirkin <mst@redhat.com>
> > >> Cc: Daniel P. Berrangé <berrange@redhat.com>
> > >> Cc: Gerd Hoffmann <kraxel@redhat.com>
> > >> Cc: Ard Biesheuvel <ardb@kernel.org>
> > >> Cc: linux-efi@vger.kernel.org
> > >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> > >
> > > Didn't read the patch yet.
> > > Adding Laszlo.
> >
> > As I said in
> > <http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>,
> > I don't believe that the setup_data chaining described in
> > <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work
> > for UEFI guests at all, with QEMU pre-populating the links with GPAs.
> >
> > However, rather than introducing a new info channel, or reusing an
> > existent one (ACPI linker/loader, GUID-ed structure chaining in pflash),
> > for the sake of this feature, I suggest simply disabling this feature
> > for UEFI guests. setup_data chaining has not been necessary for UEFI
> > guests for years (this is the first time I've heard about it in more
> > than a decade), and the particular use case (provide guests with good
> > random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL.
> >
> > ... Now, here's my problem: microvm, and Xen.
> >
> > As far as I can tell, QEMU can determine (it already does determine)
> > whether the guest uses UEFI or not, for the "pc" and "q35" machine
> > types. But not for microvm or Xen!
> >
> > Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can
> > derive that
> >
> >   pflash_cfi01_get_blk(pcms->flash[0])
> >
> > returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this
> > is only valid after the inital part of pc_system_firmware_init() has run
> > ("Map legacy -drive if=pflash to machine properties"), but that is not a
> > problem, given the following call tree:
>
> I don't beleve that's a valid check anymore since Gerd introduced the
> ability to load UEFI via -bios, for UEFI builds without persistent
> variables. ( a8152c4e4613c70c2f0573a82babbc8acc00cf90 )
>

I think there is a fundamental flaw in the QEMU logic where it adds
setup_data nodes to the *file* representation of the kernel image.

IOW, loading the kernei image at address x, creating setup_data nodes
in memory relative to x and then invoking the kernel image directly
(as kexec does as well, AIUI) is perfectly fine.

Managing a file system abstraction and a generic interface to load its
contents, and using it to load the kernel image anywhere in memory is
also fine, and OVMF -kernel relies on this.

It is the combination of the two that explodes, unsurprisingly. Making
inferences about which kind of firmware is invoking the file loading
interface, and gating this behavior accordingly just papers over the
problem.

Jason and I have been discussing this over IRC, and there are
essentially 3 places we could address this:
1) in the Linux/x86 EFI stub, which has a 'pure EFI' entrypoint and a
'EFI handover protocol' entrypoint, and we could simply wipe the
setup_data pointer in the former;
2) in OVMF's kernel loader, where we could 'fix up' the setup_data field
3) in QEMU, which [as I laid out above] does something it shouldn't be doing.

I strongly object to 2), as it would involve teaching OVMF's 'pure
EFI' boot path about the intricacies of struct boot_params etc, which
defeats the purpose of having a generic EFI boot path (Note that much
of my work on the Linux/EFI subsystem over the past years has been
focused on getting rid of all the arch-specific hacks and layering
violations and making as much of the EFI code in Linux arch-agnostic
and adhere to EFI principles and APIs)

Given that neither SETUP_DTB nor SETUP_RNG_SEED are particularly
relevant for pure EFI boot, clearing setup_data in the pure EFI
entrypoint in the EFI stub is not unreasonable, but not very future
proof, given that it limits how we might use setup_data for other
purposes in the future (although one might argue we could just drop
code again from the stub when new uses of setup_data are introduced
into the kernel)

However, our conclusion was that it is really QEMU that is doing
something strange here, so IMHO, fixing QEMU is really the most
suitable approach.

Laszlo Ersek Aug. 4, 2022, 11:31 a.m. UTC | #5

On 08/04/22 12:26, Ard Biesheuvel wrote:
> On Thu, 4 Aug 2022 at 11:25, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>
>> On Thu, Aug 04, 2022 at 10:58:36AM +0200, Laszlo Ersek wrote:
>>> On 08/04/22 09:03, Michael S. Tsirkin wrote:
>>>> On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
>>>>> The boot parameter header refers to setup_data at an absolute address,
>>>>> and each setup_data refers to the next setup_data at an absolute address
>>>>> too. Currently QEMU simply puts the setup_datas right after the kernel
>>>>> image, and since the kernel_image is loaded at prot_addr -- a fixed
>>>>> address knowable to QEMU apriori -- the setup_data absolute address
>>>>> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
>>>>>
>>>>> This mostly works fine, so long as the kernel image really is loaded at
>>>>> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
>>>>> generally EFI doesn't give a good way of predicting where it's going to
>>>>> load the kernel. So when it loads it at some address != prot_addr, the
>>>>> absolute addresses in setup_data now point somewhere bogus, causing
>>>>> crashes when EFI stub tries to follow the next link.
>>>>>
>>>>> Fix this by placing setup_data at some fixed place in memory, relative
>>>>> to real_addr, not as part of the kernel image, and then pointing the
>>>>> setup_data absolute address to that fixed place in memory. This way,
>>>>> even if OVMF or other chains relocate the kernel image, the boot
>>>>> parameter still points to the correct absolute address.
>>>>>
>>>>> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
>>>>> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>>> Cc: Richard Henderson <richard.henderson@linaro.org>
>>>>> Cc: Peter Maydell <peter.maydell@linaro.org>
>>>>> Cc: Michael S. Tsirkin <mst@redhat.com>
>>>>> Cc: Daniel P. Berrangé <berrange@redhat.com>
>>>>> Cc: Gerd Hoffmann <kraxel@redhat.com>
>>>>> Cc: Ard Biesheuvel <ardb@kernel.org>
>>>>> Cc: linux-efi@vger.kernel.org
>>>>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>>>>
>>>> Didn't read the patch yet.
>>>> Adding Laszlo.
>>>
>>> As I said in
>>> <http://mid.mail-archive.com/8bcc7826-91ab-855e-7151-2e9add88025a@redhat.com>,
>>> I don't believe that the setup_data chaining described in
>>> <https://www.kernel.org/doc/Documentation/x86/boot.rst> can be made work
>>> for UEFI guests at all, with QEMU pre-populating the links with GPAs.
>>>
>>> However, rather than introducing a new info channel, or reusing an
>>> existent one (ACPI linker/loader, GUID-ed structure chaining in pflash),
>>> for the sake of this feature, I suggest simply disabling this feature
>>> for UEFI guests. setup_data chaining has not been necessary for UEFI
>>> guests for years (this is the first time I've heard about it in more
>>> than a decade), and the particular use case (provide guests with good
>>> random seed) is solved for UEFI guests via virtio-rng / EFI_RNG_PROTOCOL.
>>>
>>> ... Now, here's my problem: microvm, and Xen.
>>>
>>> As far as I can tell, QEMU can determine (it already does determine)
>>> whether the guest uses UEFI or not, for the "pc" and "q35" machine
>>> types. But not for microvm or Xen!
>>>
>>> Namely, from pc_system_firmware_init() [hw/i386/pc_sysfw.c], we can
>>> derive that
>>>
>>>   pflash_cfi01_get_blk(pcms->flash[0])
>>>
>>> returning NULL vs. non-NULL stands for "BIOS vs. UEFI". Note that this
>>> is only valid after the inital part of pc_system_firmware_init() has run
>>> ("Map legacy -drive if=pflash to machine properties"), but that is not a
>>> problem, given the following call tree:
>>
>> I don't beleve that's a valid check anymore since Gerd introduced the
>> ability to load UEFI via -bios, for UEFI builds without persistent
>> variables. ( a8152c4e4613c70c2f0573a82babbc8acc00cf90 )
>>
> 
> I think there is a fundamental flaw in the QEMU logic where it adds
> setup_data nodes to the *file* representation of the kernel image.
> 
> IOW, loading the kernei image at address x, creating setup_data nodes
> in memory relative to x and then invoking the kernel image directly
> (as kexec does as well, AIUI) is perfectly fine.
> 
> Managing a file system abstraction and a generic interface to load its
> contents, and using it to load the kernel image anywhere in memory is
> also fine, and OVMF -kernel relies on this.
> 
> It is the combination of the two that explodes, unsurprisingly. Making
> inferences about which kind of firmware is invoking the file loading
> interface, and gating this behavior accordingly just papers over the
> problem.
> 
> Jason and I have been discussing this over IRC, and there are
> essentially 3 places we could address this:
> 1) in the Linux/x86 EFI stub, which has a 'pure EFI' entrypoint and a
> 'EFI handover protocol' entrypoint, and we could simply wipe the
> setup_data pointer in the former;
> 2) in OVMF's kernel loader, where we could 'fix up' the setup_data field
> 3) in QEMU, which [as I laid out above] does something it shouldn't be doing.
> 
> I strongly object to 2), as it would involve teaching OVMF's 'pure
> EFI' boot path about the intricacies of struct boot_params etc, which
> defeats the purpose of having a generic EFI boot path (Note that much
> of my work on the Linux/EFI subsystem over the past years has been
> focused on getting rid of all the arch-specific hacks and layering
> violations and making as much of the EFI code in Linux arch-agnostic
> and adhere to EFI principles and APIs)
> 
> Given that neither SETUP_DTB nor SETUP_RNG_SEED are particularly
> relevant for pure EFI boot, clearing setup_data in the pure EFI
> entrypoint in the EFI stub is not unreasonable, but not very future
> proof, given that it limits how we might use setup_data for other
> purposes in the future (although one might argue we could just drop
> code again from the stub when new uses of setup_data are introduced
> into the kernel)
> 
> However, our conclusion was that it is really QEMU that is doing
> something strange here, so IMHO, fixing QEMU is really the most
> suitable approach.
> 

I'm not sure if this can be solved arch-independently and
firmware-independently.

ACPI and SMBIOS look "independent" of arch and firmware if you squint,
but that's only because the specs describe the entry points for
different firmwares differently, thereby hiding the problem from the
rest of the world. Also because the various virtual firmwares receive
the ACPI and SMBIOS contents from the VMM(s) differently. However, ACPI
and SMBIOS are probably too late for the kernel to consume the random seed.

A different example for information that's needed really early is the
serial port's location (see "earlyprintk" e.g.). That gets passed on the
command line (and the usual values strongly differ between arm64 and
x864_64). But, I guess passing the random seed on the kernel cmdline is
considered insecure.

None of the existing info passing methods seem early enough, generic
enough, and secure enough (at the same time)...

Jason A. Donenfeld Aug. 4, 2022, 12:03 p.m. UTC | #6

Hi Daniel,

On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> Yep, and ultimately the inability to distinguish UEFI vs other firmware
> is arguably correct by design, as the QEMU <-> firmware interface is
> supposed to be arbitrarily pluggable for any firmware implementation
> not  limited to merely UEFI + seabios.

Indeed, I agree with this.

> 
> > For now I suggest either reverting the original patch, or at least not
> > enabling the knob by default for any machine types. In particular, when
> > using MicroVM, the user must leave the knob disabled when direct booting
> > a kernel on OVMF, and the user may or may not enable the knob when
> > direct booting a kernel on SeaBIOS.
> 
> Having it opt-in via a knob would defeat Jason's goal of having the seed
> available automatically.

Yes, adding a knob is absolutely out of the question.

It also doesn't actually solve the problem: this triggers when QEMU
passes a DTB too. It's not just for the new RNG seed thing. This bug
isn't new.

Jason

Jason A. Donenfeld Aug. 4, 2022, 12:11 p.m. UTC | #7

Hi Laszlo,

On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote:
> None of the existing info passing methods seem early enough, generic
> enough, and secure enough (at the same time)...

Can you look at the v2 patch? It seems to work on every configuration I
throw at it. Keep in mind that setup_data is only used very, very early.
I can think of a few other places to put it too, looking at the x86
memory map, that will survive long enough.

I think this might actually be a straightforwardly solvable problem if
you think about it more basically.

Jason

Daniel P. Berrangé Aug. 4, 2022, 12:11 p.m. UTC | #8

On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> Hi Daniel,
> 
> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> > Yep, and ultimately the inability to distinguish UEFI vs other firmware
> > is arguably correct by design, as the QEMU <-> firmware interface is
> > supposed to be arbitrarily pluggable for any firmware implementation
> > not  limited to merely UEFI + seabios.
> 
> Indeed, I agree with this.
> 
> > 
> > > For now I suggest either reverting the original patch, or at least not
> > > enabling the knob by default for any machine types. In particular, when
> > > using MicroVM, the user must leave the knob disabled when direct booting
> > > a kernel on OVMF, and the user may or may not enable the knob when
> > > direct booting a kernel on SeaBIOS.
> > 
> > Having it opt-in via a knob would defeat Jason's goal of having the seed
> > available automatically.
> 
> Yes, adding a knob is absolutely out of the question.
> 
> It also doesn't actually solve the problem: this triggers when QEMU
> passes a DTB too. It's not just for the new RNG seed thing. This bug
> isn't new.

In the other thread I also mentioned that this RNG Seed addition has
caused a bug with AMD SEV too, making boot measurement attestation
fail because the kernel blob passed to the firmware no longer matches
what the tenant expects, due to the injected seed.

With regards,
Daniel

Ard Biesheuvel Aug. 4, 2022, 12:16 p.m. UTC | #9

On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> > Hi Daniel,
> >
> > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> > > Yep, and ultimately the inability to distinguish UEFI vs other firmware
> > > is arguably correct by design, as the QEMU <-> firmware interface is
> > > supposed to be arbitrarily pluggable for any firmware implementation
> > > not  limited to merely UEFI + seabios.
> >
> > Indeed, I agree with this.
> >
> > >
> > > > For now I suggest either reverting the original patch, or at least not
> > > > enabling the knob by default for any machine types. In particular, when
> > > > using MicroVM, the user must leave the knob disabled when direct booting
> > > > a kernel on OVMF, and the user may or may not enable the knob when
> > > > direct booting a kernel on SeaBIOS.
> > >
> > > Having it opt-in via a knob would defeat Jason's goal of having the seed
> > > available automatically.
> >
> > Yes, adding a knob is absolutely out of the question.
> >
> > It also doesn't actually solve the problem: this triggers when QEMU
> > passes a DTB too. It's not just for the new RNG seed thing. This bug
> > isn't new.
>
> In the other thread I also mentioned that this RNG Seed addition has
> caused a bug with AMD SEV too, making boot measurement attestation
> fail because the kernel blob passed to the firmware no longer matches
> what the tenant expects, due to the injected seed.
>

I was actually expecting this to be an issue in the
signing/attestation department as well, and you just confirmed my
suspicion.

But does this mean that populating the setup_data pointer is out of
the question altogether? Or only that putting the setup_data linked
list nodes inside the image is a problem?

Jason A. Donenfeld Aug. 4, 2022, 12:17 p.m. UTC | #10

Hi Ard,

On Thu, Aug 4, 2022 at 2:16 PM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >
> > On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> > > Hi Daniel,
> > >
> > > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> > > > Yep, and ultimately the inability to distinguish UEFI vs other firmware
> > > > is arguably correct by design, as the QEMU <-> firmware interface is
> > > > supposed to be arbitrarily pluggable for any firmware implementation
> > > > not  limited to merely UEFI + seabios.
> > >
> > > Indeed, I agree with this.
> > >
> > > >
> > > > > For now I suggest either reverting the original patch, or at least not
> > > > > enabling the knob by default for any machine types. In particular, when
> > > > > using MicroVM, the user must leave the knob disabled when direct booting
> > > > > a kernel on OVMF, and the user may or may not enable the knob when
> > > > > direct booting a kernel on SeaBIOS.
> > > >
> > > > Having it opt-in via a knob would defeat Jason's goal of having the seed
> > > > available automatically.
> > >
> > > Yes, adding a knob is absolutely out of the question.
> > >
> > > It also doesn't actually solve the problem: this triggers when QEMU
> > > passes a DTB too. It's not just for the new RNG seed thing. This bug
> > > isn't new.
> >
> > In the other thread I also mentioned that this RNG Seed addition has
> > caused a bug with AMD SEV too, making boot measurement attestation
> > fail because the kernel blob passed to the firmware no longer matches
> > what the tenant expects, due to the injected seed.
> >
>
> I was actually expecting this to be an issue in the
> signing/attestation department as well, and you just confirmed my
> suspicion.
>
> But does this mean that populating the setup_data pointer is out of
> the question altogether? Or only that putting the setup_data linked
> list nodes inside the image is a problem?

If you look at the v2 patch, populating boot_param->setup_data winds
up being a fixed value. So even if that part was a problem (though I
don't think it is), it won't be with the v2 patch, since it's always
the same.

Jason

Jason A. Donenfeld Aug. 4, 2022, 12:28 p.m. UTC | #11

On Thu, Aug 4, 2022 at 2:17 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Ard,
>
> On Thu, Aug 4, 2022 at 2:16 PM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > >
> > > On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> > > > Hi Daniel,
> > > >
> > > > On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> > > > > Yep, and ultimately the inability to distinguish UEFI vs other firmware
> > > > > is arguably correct by design, as the QEMU <-> firmware interface is
> > > > > supposed to be arbitrarily pluggable for any firmware implementation
> > > > > not  limited to merely UEFI + seabios.
> > > >
> > > > Indeed, I agree with this.
> > > >
> > > > >
> > > > > > For now I suggest either reverting the original patch, or at least not
> > > > > > enabling the knob by default for any machine types. In particular, when
> > > > > > using MicroVM, the user must leave the knob disabled when direct booting
> > > > > > a kernel on OVMF, and the user may or may not enable the knob when
> > > > > > direct booting a kernel on SeaBIOS.
> > > > >
> > > > > Having it opt-in via a knob would defeat Jason's goal of having the seed
> > > > > available automatically.
> > > >
> > > > Yes, adding a knob is absolutely out of the question.
> > > >
> > > > It also doesn't actually solve the problem: this triggers when QEMU
> > > > passes a DTB too. It's not just for the new RNG seed thing. This bug
> > > > isn't new.
> > >
> > > In the other thread I also mentioned that this RNG Seed addition has
> > > caused a bug with AMD SEV too, making boot measurement attestation
> > > fail because the kernel blob passed to the firmware no longer matches
> > > what the tenant expects, due to the injected seed.
> > >
> >
> > I was actually expecting this to be an issue in the
> > signing/attestation department as well, and you just confirmed my
> > suspicion.
> >
> > But does this mean that populating the setup_data pointer is out of
> > the question altogether? Or only that putting the setup_data linked
> > list nodes inside the image is a problem?
>
> If you look at the v2 patch, populating boot_param->setup_data winds
> up being a fixed value. So even if that part was a problem (though I
> don't think it is), it won't be with the v2 patch, since it's always
> the same.

Actually, `setup` isn't even modified if SEV is being used anyway. So
really, the approach of this v2 -- of not modifying the kernel image
-- should fix that issue, no matter what.

Jason

Jason A. Donenfeld Aug. 4, 2022, 12:47 p.m. UTC | #12

On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Laszlo,
>
> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote:
> > None of the existing info passing methods seem early enough, generic
> > enough, and secure enough (at the same time)...
>
> Can you look at the v2 patch? It seems to work on every configuration I
> throw at it. Keep in mind that setup_data is only used very, very early.
> I can think of a few other places to put it too, looking at the x86
> memory map, that will survive long enough.
>
> I think this might actually be a straightforwardly solvable problem if
> you think about it more basically.

And just to put things in perspective here... We only need like 48
bytes or something at some easy fixed address. That's not much. That's
*got* to be a fairly tractable problem. If v2 has issues, I can't see
why there wouldn't be a different easy place to put a meger 48 bytes
of stuff that then is allowed to be wiped out after early boot.

Jason

Daniel P. Berrangé Aug. 4, 2022, 12:54 p.m. UTC | #13

On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> The boot parameter header refers to setup_data at an absolute address,
> and each setup_data refers to the next setup_data at an absolute address
> too. Currently QEMU simply puts the setup_datas right after the kernel
> image, and since the kernel_image is loaded at prot_addr -- a fixed
> address knowable to QEMU apriori -- the setup_data absolute address
> winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> 
> This mostly works fine, so long as the kernel image really is loaded at
> prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> generally EFI doesn't give a good way of predicting where it's going to
> load the kernel. So when it loads it at some address != prot_addr, the
> absolute addresses in setup_data now point somewhere bogus, causing
> crashes when EFI stub tries to follow the next link.
> 
> Fix this by placing setup_data at some fixed place in memory, relative
> to real_addr, not as part of the kernel image, and then pointing the
> setup_data absolute address to that fixed place in memory. This way,
> even if OVMF or other chains relocate the kernel image, the boot
> parameter still points to the correct absolute address.
> 
> Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Peter Maydell <peter.maydell@linaro.org>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Ard Biesheuvel <ardb@kernel.org>
> Cc: linux-efi@vger.kernel.org
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
>  hw/i386/x86.c | 38 ++++++++++++++++++++------------------
>  1 file changed, 20 insertions(+), 18 deletions(-)
> 
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index 050eedc0c8..8b853abf38 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c


>      if (!legacy_no_rng_seed) {
> -        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
> -        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
> -        kernel = g_realloc(kernel, kernel_size);
> -        setup_data = (struct setup_data *)(kernel + setup_data_offset);
> +        setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH;
> +        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
> +        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
>          setup_data->next = cpu_to_le64(first_setup_data);
> -        first_setup_data = prot_addr + setup_data_offset;
> +        first_setup_data = setup_data_base + setup_data_total_len;
> +        setup_data_total_len += setup_data_item_len;
>          setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
>          setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
>          qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
>      }
>  
> -    /* Offset 0x250 is a pointer to the first setup_data link. */
> -    stq_p(header + 0x250, first_setup_data);
> +    if (first_setup_data) {
> +            /* Offset 0x250 is a pointer to the first setup_data link. */
> +            stq_p(header + 0x250, first_setup_data);
> +            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
> +                         setup_data_base, NULL, NULL, NULL, NULL, false);
> +    }

The boot measurements with AMD SEV now succeed, but I'm a little
worried about the implications of adding this ROM, when a few lines
later here we're discarding the 'header' changes for AMD SEV. Is
this still going to operate correctly in the guest OS if we've
discarded the header changes below ?

>      /*
>       * If we're starting an encrypted VM, it will be OVMF based, which uses the
>       * efi stub for booting and doesn't require any values to be placed in the
>       * kernel header.  We therefore don't update the header so the hash of the
>       * kernel on the other side of the fw_cfg interface matches the hash of the
>       * file the user passed in.
>       */
>      if (!sev_enabled()) {
>          memcpy(setup, header, MIN(sizeof(header), setup_size));
>      }
>  
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_ADDR, prot_addr);
>      fw_cfg_add_i32(fw_cfg, FW_CFG_KERNEL_SIZE, kernel_size);
>      fw_cfg_add_bytes(fw_cfg, FW_CFG_KERNEL_DATA, kernel, kernel_size);
>      sev_load_ctx.kernel_data = (char *)kernel;
> -- 
> 2.35.1
> 
> 

With regards,
Daniel

Jason A. Donenfeld Aug. 4, 2022, 1:07 p.m. UTC | #14

Hi Daniel,

On Thu, Aug 4, 2022 at 2:54 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
>
> On Thu, Aug 04, 2022 at 02:44:11AM +0200, Jason A. Donenfeld wrote:
> > The boot parameter header refers to setup_data at an absolute address,
> > and each setup_data refers to the next setup_data at an absolute address
> > too. Currently QEMU simply puts the setup_datas right after the kernel
> > image, and since the kernel_image is loaded at prot_addr -- a fixed
> > address knowable to QEMU apriori -- the setup_data absolute address
> > winds up being just `prot_addr + a_fixed_offset_into_kernel_image`.
> >
> > This mostly works fine, so long as the kernel image really is loaded at
> > prot_addr. However, OVMF doesn't load the kernel at prot_addr, and
> > generally EFI doesn't give a good way of predicting where it's going to
> > load the kernel. So when it loads it at some address != prot_addr, the
> > absolute addresses in setup_data now point somewhere bogus, causing
> > crashes when EFI stub tries to follow the next link.
> >
> > Fix this by placing setup_data at some fixed place in memory, relative
> > to real_addr, not as part of the kernel image, and then pointing the
> > setup_data absolute address to that fixed place in memory. This way,
> > even if OVMF or other chains relocate the kernel image, the boot
> > parameter still points to the correct absolute address.
> >
> > Fixes: 3cbeb52467 ("hw/i386: add device tree support")
> > Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Richard Henderson <richard.henderson@linaro.org>
> > Cc: Peter Maydell <peter.maydell@linaro.org>
> > Cc: Michael S. Tsirkin <mst@redhat.com>
> > Cc: Daniel P. Berrangé <berrange@redhat.com>
> > Cc: Gerd Hoffmann <kraxel@redhat.com>
> > Cc: Ard Biesheuvel <ardb@kernel.org>
> > Cc: linux-efi@vger.kernel.org
> > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> > ---
> >  hw/i386/x86.c | 38 ++++++++++++++++++++------------------
> >  1 file changed, 20 insertions(+), 18 deletions(-)
> >
> > diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> > index 050eedc0c8..8b853abf38 100644
> > --- a/hw/i386/x86.c
> > +++ b/hw/i386/x86.c
>
>
> >      if (!legacy_no_rng_seed) {
> > -        setup_data_offset = QEMU_ALIGN_UP(kernel_size, 16);
> > -        kernel_size = setup_data_offset + sizeof(struct setup_data) + RNG_SEED_LENGTH;
> > -        kernel = g_realloc(kernel, kernel_size);
> > -        setup_data = (struct setup_data *)(kernel + setup_data_offset);
> > +        setup_data_item_len = sizeof(struct setup_data) + RNG_SEED_LENGTH;
> > +        setup_datas = g_realloc(setup_datas, setup_data_total_len + setup_data_item_len);
> > +        setup_data = (struct setup_data *)(setup_datas + setup_data_total_len);
> >          setup_data->next = cpu_to_le64(first_setup_data);
> > -        first_setup_data = prot_addr + setup_data_offset;
> > +        first_setup_data = setup_data_base + setup_data_total_len;
> > +        setup_data_total_len += setup_data_item_len;
> >          setup_data->type = cpu_to_le32(SETUP_RNG_SEED);
> >          setup_data->len = cpu_to_le32(RNG_SEED_LENGTH);
> >          qemu_guest_getrandom_nofail(setup_data->data, RNG_SEED_LENGTH);
> >      }
> >
> > -    /* Offset 0x250 is a pointer to the first setup_data link. */
> > -    stq_p(header + 0x250, first_setup_data);
> > +    if (first_setup_data) {
> > +            /* Offset 0x250 is a pointer to the first setup_data link. */
> > +            stq_p(header + 0x250, first_setup_data);
> > +            rom_add_blob("setup_data", setup_datas, setup_data_total_len, setup_data_total_len,
> > +                         setup_data_base, NULL, NULL, NULL, NULL, false);
> > +    }
>
> The boot measurements with AMD SEV now succeed, but I'm a little
> worried about the implications of adding this ROM, when a few lines
> later here we're discarding the 'header' changes for AMD SEV. Is
> this still going to operate correctly in the guest OS if we've
> discarded the header changes below ?

I'll add a !sev_enabled() condition to that block too, so it also
skips adding the ROM, for v3.

Jason

Laszlo Ersek Aug. 4, 2022, 1:16 p.m. UTC | #15

On 08/04/22 14:47, Jason A. Donenfeld wrote:
> On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>>
>> Hi Laszlo,
>>
>> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote:
>>> None of the existing info passing methods seem early enough, generic
>>> enough, and secure enough (at the same time)...
>>
>> Can you look at the v2 patch? It seems to work on every configuration I
>> throw at it. Keep in mind that setup_data is only used very, very early.
>> I can think of a few other places to put it too, looking at the x86
>> memory map, that will survive long enough.
>>
>> I think this might actually be a straightforwardly solvable problem if
>> you think about it more basically.
> 
> And just to put things in perspective here... We only need like 48
> bytes or something at some easy fixed address. That's not much. That's
> *got* to be a fairly tractable problem. If v2 has issues, I can't see
> why there wouldn't be a different easy place to put a meger 48 bytes
> of stuff that then is allowed to be wiped out after early boot.

I've looked at v2. It still relies on passing information from QEMU to
the guest kernel through guest RAM such that the whole firmware
execution takes place in-between, without the firmware knowing anything
about that particular area -- effectively treating it as free system
RAM. Such exceptions are time bombs.

We *have* used hard-coded addresses, sometimes they are unavoidable, but
then they are open-coded in both QEMU and the firmware, and some early
part of the firmware takes care to either move the data to a "safe"
place, or to cover it in-place with a kind of reservation that prevents
other parts of the firmware from trampling over it. I've debugged
mistakes (memory corruption) when such reservation was forgotten; it's
not fun.

In short, I have nothing against the QEMU patch, but then the current
OvmfPkg maintainers should accept a patch for the firmware too, for
protecting the area from later firmware components, as early as possible.

Laszlo

Jason A. Donenfeld Aug. 4, 2022, 1:25 p.m. UTC | #16

Hi Laszlo,

On Thu, Aug 4, 2022 at 3:17 PM Laszlo Ersek <lersek@redhat.com> wrote:
>
> On 08/04/22 14:47, Jason A. Donenfeld wrote:
> > On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >>
> >> Hi Laszlo,
> >>
> >> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote:
> >>> None of the existing info passing methods seem early enough, generic
> >>> enough, and secure enough (at the same time)...
> >>
> >> Can you look at the v2 patch? It seems to work on every configuration I
> >> throw at it. Keep in mind that setup_data is only used very, very early.
> >> I can think of a few other places to put it too, looking at the x86
> >> memory map, that will survive long enough.
> >>
> >> I think this might actually be a straightforwardly solvable problem if
> >> you think about it more basically.
> >
> > And just to put things in perspective here... We only need like 48
> > bytes or something at some easy fixed address. That's not much. That's
> > *got* to be a fairly tractable problem. If v2 has issues, I can't see
> > why there wouldn't be a different easy place to put a meger 48 bytes
> > of stuff that then is allowed to be wiped out after early boot.
>
> I've looked at v2. It still relies on passing information from QEMU to
> the guest kernel through guest RAM such that the whole firmware
> execution takes place in-between, without the firmware knowing anything
> about that particular area -- effectively treating it as free system
> RAM. Such exceptions are time bombs.
>
> We *have* used hard-coded addresses, sometimes they are unavoidable, but
> then they are open-coded in both QEMU and the firmware, and some early
> part of the firmware takes care to either move the data to a "safe"
> place, or to cover it in-place with a kind of reservation that prevents
> other parts of the firmware from trampling over it. I've debugged
> mistakes (memory corruption) when such reservation was forgotten; it's
> not fun.
>
> In short, I have nothing against the QEMU patch, but then the current
> OvmfPkg maintainers should accept a patch for the firmware too, for
> protecting the area from later firmware components, as early as possible.

What you say mostly makes sense. Though, I was wondering if there's
some unused space (maybe a historical low address) that nothing
touches because it's always been considered reserved. Some kind of
ancient video data region, for example. We only need 48 bytes after
all... My hope was that somebody with a lot of deep x86 knowledge
would be able to smell their way to something that's always good like
that. (Alternatively, there's my real_addr thing from v2.)

Jason

Laszlo Ersek Aug. 4, 2022, 1:25 p.m. UTC | #17

On 08/04/22 14:16, Ard Biesheuvel wrote:
> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>
>> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
>>> Hi Daniel,
>>>
>>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
>>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware
>>>> is arguably correct by design, as the QEMU <-> firmware interface is
>>>> supposed to be arbitrarily pluggable for any firmware implementation
>>>> not  limited to merely UEFI + seabios.
>>>
>>> Indeed, I agree with this.
>>>
>>>>
>>>>> For now I suggest either reverting the original patch, or at least not
>>>>> enabling the knob by default for any machine types. In particular, when
>>>>> using MicroVM, the user must leave the knob disabled when direct booting
>>>>> a kernel on OVMF, and the user may or may not enable the knob when
>>>>> direct booting a kernel on SeaBIOS.
>>>>
>>>> Having it opt-in via a knob would defeat Jason's goal of having the seed
>>>> available automatically.
>>>
>>> Yes, adding a knob is absolutely out of the question.
>>>
>>> It also doesn't actually solve the problem: this triggers when QEMU
>>> passes a DTB too. It's not just for the new RNG seed thing. This bug
>>> isn't new.
>>
>> In the other thread I also mentioned that this RNG Seed addition has
>> caused a bug with AMD SEV too, making boot measurement attestation
>> fail because the kernel blob passed to the firmware no longer matches
>> what the tenant expects, due to the injected seed.
>>
> 
> I was actually expecting this to be an issue in the
> signing/attestation department as well, and you just confirmed my
> suspicion.
> 
> But does this mean that populating the setup_data pointer is out of
> the question altogether? Or only that putting the setup_data linked
> list nodes inside the image is a problem?

QEMU already has to inject a whole bunch of stuff into confidential
computing guests. The way it's done (IIRC) is that the non-compressed,
trailing part of pflash (basically where the reset vector code lives
too) is populated at OVMF build time with a chain of GUID-ed structures,
and fields of those structures are filled in (at OVMF build time) from
various fixed PCDs. The fixed PCDs in turn are populated from the FD
files, using various MEMFD regions. When QEMU launches the guest, it can
parse the GPAs from the on-disk pflash image (traversing the list of
GUID-ed structs), at which addresses the guest firmware will then expect
the various crypto artifacts to be injected.

The point is that "who's in control" is reversed. The firmware exposes
(at build time) at what GPAs it can accept data injections, and QEMU
follows that. Of course the firmware ensures that nothing else in the
firmware will try to own those GPAs.

The only thing that needed to be hard-coded when this feature was
introduced was the "entry point", that is, the flash offset at which
QEMU starts the GUID-ed structure traversal.

AMD and IBM developers can likely much better describe this mechanism,
as I've not dealt with it in over a year. The QEMU side code is in
"hw/i386/pc_sysfw_ovmf.c".

We can make setup_data chaining work with OVMF, but the whole chain
should be located in a GPA range that OVMF dictates.

Jason A. Donenfeld Aug. 4, 2022, 1:28 p.m. UTC | #18

On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote:
>
> On 08/04/22 14:16, Ard Biesheuvel wrote:
> > On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>
> >> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> >>> Hi Daniel,
> >>>
> >>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> >>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware
> >>>> is arguably correct by design, as the QEMU <-> firmware interface is
> >>>> supposed to be arbitrarily pluggable for any firmware implementation
> >>>> not  limited to merely UEFI + seabios.
> >>>
> >>> Indeed, I agree with this.
> >>>
> >>>>
> >>>>> For now I suggest either reverting the original patch, or at least not
> >>>>> enabling the knob by default for any machine types. In particular, when
> >>>>> using MicroVM, the user must leave the knob disabled when direct booting
> >>>>> a kernel on OVMF, and the user may or may not enable the knob when
> >>>>> direct booting a kernel on SeaBIOS.
> >>>>
> >>>> Having it opt-in via a knob would defeat Jason's goal of having the seed
> >>>> available automatically.
> >>>
> >>> Yes, adding a knob is absolutely out of the question.
> >>>
> >>> It also doesn't actually solve the problem: this triggers when QEMU
> >>> passes a DTB too. It's not just for the new RNG seed thing. This bug
> >>> isn't new.
> >>
> >> In the other thread I also mentioned that this RNG Seed addition has
> >> caused a bug with AMD SEV too, making boot measurement attestation
> >> fail because the kernel blob passed to the firmware no longer matches
> >> what the tenant expects, due to the injected seed.
> >>
> >
> > I was actually expecting this to be an issue in the
> > signing/attestation department as well, and you just confirmed my
> > suspicion.
> >
> > But does this mean that populating the setup_data pointer is out of
> > the question altogether? Or only that putting the setup_data linked
> > list nodes inside the image is a problem?
>
> QEMU already has to inject a whole bunch of stuff into confidential
> computing guests. The way it's done (IIRC) is that the non-compressed,
> trailing part of pflash (basically where the reset vector code lives
> too) is populated at OVMF build time with a chain of GUID-ed structures,
> and fields of those structures are filled in (at OVMF build time) from
> various fixed PCDs. The fixed PCDs in turn are populated from the FD
> files, using various MEMFD regions. When QEMU launches the guest, it can
> parse the GPAs from the on-disk pflash image (traversing the list of
> GUID-ed structs), at which addresses the guest firmware will then expect
> the various crypto artifacts to be injected.
>
> The point is that "who's in control" is reversed. The firmware exposes
> (at build time) at what GPAs it can accept data injections, and QEMU
> follows that. Of course the firmware ensures that nothing else in the
> firmware will try to own those GPAs.
>
> The only thing that needed to be hard-coded when this feature was
> introduced was the "entry point", that is, the flash offset at which
> QEMU starts the GUID-ed structure traversal.
>
> AMD and IBM developers can likely much better describe this mechanism,
> as I've not dealt with it in over a year. The QEMU side code is in
> "hw/i386/pc_sysfw_ovmf.c".
>
> We can make setup_data chaining work with OVMF, but the whole chain
> should be located in a GPA range that OVMF dictates.

It sounds like what you describe is pretty OVMF-specific though,
right? Do we want to tie things together so tightly like that?

Given we only need 48 bytes or so, isn't there a more subtle place we
could just throw this in ram that doesn't need such complex
coordination?

Jason

Laszlo Ersek Aug. 4, 2022, 1:29 p.m. UTC | #19

On 08/04/22 15:16, Laszlo Ersek wrote:
> On 08/04/22 14:47, Jason A. Donenfeld wrote:
>> On Thu, Aug 4, 2022 at 2:11 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>>>
>>> Hi Laszlo,
>>>
>>> On Thu, Aug 04, 2022 at 01:31:36PM +0200, Laszlo Ersek wrote:
>>>> None of the existing info passing methods seem early enough, generic
>>>> enough, and secure enough (at the same time)...
>>>
>>> Can you look at the v2 patch? It seems to work on every configuration I
>>> throw at it. Keep in mind that setup_data is only used very, very early.
>>> I can think of a few other places to put it too, looking at the x86
>>> memory map, that will survive long enough.
>>>
>>> I think this might actually be a straightforwardly solvable problem if
>>> you think about it more basically.
>>
>> And just to put things in perspective here... We only need like 48
>> bytes or something at some easy fixed address. That's not much. That's
>> *got* to be a fairly tractable problem. If v2 has issues, I can't see
>> why there wouldn't be a different easy place to put a meger 48 bytes
>> of stuff that then is allowed to be wiped out after early boot.
> 
> I've looked at v2. It still relies on passing information from QEMU to
> the guest kernel through guest RAM such that the whole firmware
> execution takes place in-between, without the firmware knowing anything
> about that particular area -- effectively treating it as free system
> RAM. Such exceptions are time bombs.
> 
> We *have* used hard-coded addresses, sometimes they are unavoidable, but
> then they are open-coded in both QEMU and the firmware, and some early
> part of the firmware takes care to either move the data to a "safe"
> place, or to cover it in-place with a kind of reservation that prevents
> other parts of the firmware from trampling over it. I've debugged
> mistakes (memory corruption) when such reservation was forgotten; it's
> not fun.

Reference:

https://github.com/tianocore/edk2/commit/ad43bc6b2e

> 
> In short, I have nothing against the QEMU patch, but then the current
> OvmfPkg maintainers should accept a patch for the firmware too, for
> protecting the area from later firmware components, as early as possible.
> 
> Laszlo
>

Laszlo Ersek Aug. 4, 2022, 1:56 p.m. UTC | #20

On 08/04/22 15:28, Jason A. Donenfeld wrote:
> On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote:
>>
>> On 08/04/22 14:16, Ard Biesheuvel wrote:
>>> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>>>
>>>> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
>>>>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware
>>>>>> is arguably correct by design, as the QEMU <-> firmware interface is
>>>>>> supposed to be arbitrarily pluggable for any firmware implementation
>>>>>> not  limited to merely UEFI + seabios.
>>>>>
>>>>> Indeed, I agree with this.
>>>>>
>>>>>>
>>>>>>> For now I suggest either reverting the original patch, or at least not
>>>>>>> enabling the knob by default for any machine types. In particular, when
>>>>>>> using MicroVM, the user must leave the knob disabled when direct booting
>>>>>>> a kernel on OVMF, and the user may or may not enable the knob when
>>>>>>> direct booting a kernel on SeaBIOS.
>>>>>>
>>>>>> Having it opt-in via a knob would defeat Jason's goal of having the seed
>>>>>> available automatically.
>>>>>
>>>>> Yes, adding a knob is absolutely out of the question.
>>>>>
>>>>> It also doesn't actually solve the problem: this triggers when QEMU
>>>>> passes a DTB too. It's not just for the new RNG seed thing. This bug
>>>>> isn't new.
>>>>
>>>> In the other thread I also mentioned that this RNG Seed addition has
>>>> caused a bug with AMD SEV too, making boot measurement attestation
>>>> fail because the kernel blob passed to the firmware no longer matches
>>>> what the tenant expects, due to the injected seed.
>>>>
>>>
>>> I was actually expecting this to be an issue in the
>>> signing/attestation department as well, and you just confirmed my
>>> suspicion.
>>>
>>> But does this mean that populating the setup_data pointer is out of
>>> the question altogether? Or only that putting the setup_data linked
>>> list nodes inside the image is a problem?
>>
>> QEMU already has to inject a whole bunch of stuff into confidential
>> computing guests. The way it's done (IIRC) is that the non-compressed,
>> trailing part of pflash (basically where the reset vector code lives
>> too) is populated at OVMF build time with a chain of GUID-ed structures,
>> and fields of those structures are filled in (at OVMF build time) from
>> various fixed PCDs. The fixed PCDs in turn are populated from the FD
>> files, using various MEMFD regions. When QEMU launches the guest, it can
>> parse the GPAs from the on-disk pflash image (traversing the list of
>> GUID-ed structs), at which addresses the guest firmware will then expect
>> the various crypto artifacts to be injected.
>>
>> The point is that "who's in control" is reversed. The firmware exposes
>> (at build time) at what GPAs it can accept data injections, and QEMU
>> follows that. Of course the firmware ensures that nothing else in the
>> firmware will try to own those GPAs.
>>
>> The only thing that needed to be hard-coded when this feature was
>> introduced was the "entry point", that is, the flash offset at which
>> QEMU starts the GUID-ed structure traversal.
>>
>> AMD and IBM developers can likely much better describe this mechanism,
>> as I've not dealt with it in over a year. The QEMU side code is in
>> "hw/i386/pc_sysfw_ovmf.c".
>>
>> We can make setup_data chaining work with OVMF, but the whole chain
>> should be located in a GPA range that OVMF dictates.
> 
> It sounds like what you describe is pretty OVMF-specific though,
> right?

Yes, completely OVMF specific.

> Do we want to tie things together so tightly like that?

It depends on what the goal is:

- do we want setup_data chaining work generally?

- or do we want only the random seed injection to stop crashing OVMF guests?

In the latter case, we only need to teach QEMU to reliably recognize
OVMF from the on-disk firmware binary (regardless of pflash use), and
then not expose any setup_data in guest RAM. The UEFI guest kernel will
use EFI_RNG_PROTOCOL (when available, from virtio-rng-pci), and no
particular seed otherwise.

(This is the "papering over" case, which I don't think is necessarily
wrong; only it should be robust. I thought pflash would be usable for
that; turns out it isn't. Thus, we could perhaps try identifying a
firmware binary as "OVMF" by contents.)

In the former case though, I think the "tight coupling" is unavoidable.
As long as setup_data is needed by the kernel extremely early, no
"firmware hiding" or "firmware independence" is in place yet.

> Given we only need 48 bytes or so, isn't there a more subtle place we
> could just throw this in ram that doesn't need such complex
> coordination?

These tricks add up and go wrong after a while. The pedantic
reservations in the firmware have proved necessary.

IIUC, with v2, the setup_data_base address would (most frequently) be 96
KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does
not reserve away the area, UefiCpuPkg or other drivers might allocate an
overlapping chunk, even if only temporarily. That might not break the
firmware, but it could overwrite the random seed. If the guest kernel
somehow consumed that seed (rather than getting one from
EFI_RNG_PROTOCOL -- what if no virtio-rng-pci device was configured?),
that could potentially destroy the guest's security, without any
obvious/immediate signs.

Laszlo

Daniel P. Berrangé Aug. 4, 2022, 2:03 p.m. UTC | #21

On Thu, Aug 04, 2022 at 03:56:54PM +0200, Laszlo Ersek wrote:
> On 08/04/22 15:28, Jason A. Donenfeld wrote:
> > On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote:
> >>
> >> On 08/04/22 14:16, Ard Biesheuvel wrote:
> >>> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>>>
> >>>> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
> >>>>> Hi Daniel,
> >>>>>
> >>>>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
> >>>>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware
> >>>>>> is arguably correct by design, as the QEMU <-> firmware interface is
> >>>>>> supposed to be arbitrarily pluggable for any firmware implementation
> >>>>>> not  limited to merely UEFI + seabios.
> >>>>>
> >>>>> Indeed, I agree with this.
> >>>>>
> >>>>>>
> >>>>>>> For now I suggest either reverting the original patch, or at least not
> >>>>>>> enabling the knob by default for any machine types. In particular, when
> >>>>>>> using MicroVM, the user must leave the knob disabled when direct booting
> >>>>>>> a kernel on OVMF, and the user may or may not enable the knob when
> >>>>>>> direct booting a kernel on SeaBIOS.
> >>>>>>
> >>>>>> Having it opt-in via a knob would defeat Jason's goal of having the seed
> >>>>>> available automatically.
> >>>>>
> >>>>> Yes, adding a knob is absolutely out of the question.
> >>>>>
> >>>>> It also doesn't actually solve the problem: this triggers when QEMU
> >>>>> passes a DTB too. It's not just for the new RNG seed thing. This bug
> >>>>> isn't new.
> >>>>
> >>>> In the other thread I also mentioned that this RNG Seed addition has
> >>>> caused a bug with AMD SEV too, making boot measurement attestation
> >>>> fail because the kernel blob passed to the firmware no longer matches
> >>>> what the tenant expects, due to the injected seed.
> >>>>
> >>>
> >>> I was actually expecting this to be an issue in the
> >>> signing/attestation department as well, and you just confirmed my
> >>> suspicion.
> >>>
> >>> But does this mean that populating the setup_data pointer is out of
> >>> the question altogether? Or only that putting the setup_data linked
> >>> list nodes inside the image is a problem?
> >>
> >> QEMU already has to inject a whole bunch of stuff into confidential
> >> computing guests. The way it's done (IIRC) is that the non-compressed,
> >> trailing part of pflash (basically where the reset vector code lives
> >> too) is populated at OVMF build time with a chain of GUID-ed structures,
> >> and fields of those structures are filled in (at OVMF build time) from
> >> various fixed PCDs. The fixed PCDs in turn are populated from the FD
> >> files, using various MEMFD regions. When QEMU launches the guest, it can
> >> parse the GPAs from the on-disk pflash image (traversing the list of
> >> GUID-ed structs), at which addresses the guest firmware will then expect
> >> the various crypto artifacts to be injected.
> >>
> >> The point is that "who's in control" is reversed. The firmware exposes
> >> (at build time) at what GPAs it can accept data injections, and QEMU
> >> follows that. Of course the firmware ensures that nothing else in the
> >> firmware will try to own those GPAs.
> >>
> >> The only thing that needed to be hard-coded when this feature was
> >> introduced was the "entry point", that is, the flash offset at which
> >> QEMU starts the GUID-ed structure traversal.
> >>
> >> AMD and IBM developers can likely much better describe this mechanism,
> >> as I've not dealt with it in over a year. The QEMU side code is in
> >> "hw/i386/pc_sysfw_ovmf.c".
> >>
> >> We can make setup_data chaining work with OVMF, but the whole chain
> >> should be located in a GPA range that OVMF dictates.
> > 
> > It sounds like what you describe is pretty OVMF-specific though,
> > right?
> 
> Yes, completely OVMF specific.
> 
> > Do we want to tie things together so tightly like that?
> 
> It depends on what the goal is:
> 
> - do we want setup_data chaining work generally?
> 
> - or do we want only the random seed injection to stop crashing OVMF guests?
> 
> In the latter case, we only need to teach QEMU to reliably recognize
> OVMF from the on-disk firmware binary (regardless of pflash use), and
> then not expose any setup_data in guest RAM. The UEFI guest kernel will
> use EFI_RNG_PROTOCOL (when available, from virtio-rng-pci), and no
> particular seed otherwise.
> 
> (This is the "papering over" case, which I don't think is necessarily
> wrong; only it should be robust. I thought pflash would be usable for
> that; turns out it isn't. Thus, we could perhaps try identifying a
> firmware binary as "OVMF" by contents.)
> 
> In the former case though, I think the "tight coupling" is unavoidable.
> As long as setup_data is needed by the kernel extremely early, no
> "firmware hiding" or "firmware independence" is in place yet.
> 
> > Given we only need 48 bytes or so, isn't there a more subtle place we
> > could just throw this in ram that doesn't need such complex
> > coordination?
> 
> These tricks add up and go wrong after a while. The pedantic
> reservations in the firmware have proved necessary.
> 
> IIUC, with v2, the setup_data_base address would (most frequently) be 96
> KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does
> not reserve away the area, UefiCpuPkg or other drivers might allocate an
> overlapping chunk, even if only temporarily. That might not break the
> firmware, but it could overwrite the random seed. If the guest kernel
> somehow consumed that seed (rather than getting one from
> EFI_RNG_PROTOCOL -- what if no virtio-rng-pci device was configured?),
> that could potentially destroy the guest's security, without any
> obvious/immediate signs.

The guest kernel shouldn't load this random seed even if it is
provided by the hypervisor, when running in a confidential
virutalization environment like SEV/TDX or equiv for other arches.
It can't trust the hypervisor to have actually used random data
for populating the seed value. On x86 it would have to rely on
the hardware RDRAND/RDSEED for an initial seed.

With regards,
Daniel

Laszlo Ersek Aug. 4, 2022, 2:04 p.m. UTC | #22

On 08/04/22 15:56, Laszlo Ersek wrote:
> On 08/04/22 15:28, Jason A. Donenfeld wrote:
>> On Thu, Aug 4, 2022 at 3:25 PM Laszlo Ersek <lersek@redhat.com> wrote:
>>>
>>> On 08/04/22 14:16, Ard Biesheuvel wrote:
>>>> On Thu, 4 Aug 2022 at 14:11, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>>>>
>>>>> On Thu, Aug 04, 2022 at 02:03:29PM +0200, Jason A. Donenfeld wrote:
>>>>>> Hi Daniel,
>>>>>>
>>>>>> On Thu, Aug 04, 2022 at 10:25:36AM +0100, Daniel P. Berrangé wrote:
>>>>>>> Yep, and ultimately the inability to distinguish UEFI vs other firmware
>>>>>>> is arguably correct by design, as the QEMU <-> firmware interface is
>>>>>>> supposed to be arbitrarily pluggable for any firmware implementation
>>>>>>> not  limited to merely UEFI + seabios.
>>>>>>
>>>>>> Indeed, I agree with this.
>>>>>>
>>>>>>>
>>>>>>>> For now I suggest either reverting the original patch, or at least not
>>>>>>>> enabling the knob by default for any machine types. In particular, when
>>>>>>>> using MicroVM, the user must leave the knob disabled when direct booting
>>>>>>>> a kernel on OVMF, and the user may or may not enable the knob when
>>>>>>>> direct booting a kernel on SeaBIOS.
>>>>>>>
>>>>>>> Having it opt-in via a knob would defeat Jason's goal of having the seed
>>>>>>> available automatically.
>>>>>>
>>>>>> Yes, adding a knob is absolutely out of the question.
>>>>>>
>>>>>> It also doesn't actually solve the problem: this triggers when QEMU
>>>>>> passes a DTB too. It's not just for the new RNG seed thing. This bug
>>>>>> isn't new.
>>>>>
>>>>> In the other thread I also mentioned that this RNG Seed addition has
>>>>> caused a bug with AMD SEV too, making boot measurement attestation
>>>>> fail because the kernel blob passed to the firmware no longer matches
>>>>> what the tenant expects, due to the injected seed.
>>>>>
>>>>
>>>> I was actually expecting this to be an issue in the
>>>> signing/attestation department as well, and you just confirmed my
>>>> suspicion.
>>>>
>>>> But does this mean that populating the setup_data pointer is out of
>>>> the question altogether? Or only that putting the setup_data linked
>>>> list nodes inside the image is a problem?
>>>
>>> QEMU already has to inject a whole bunch of stuff into confidential
>>> computing guests. The way it's done (IIRC) is that the non-compressed,
>>> trailing part of pflash (basically where the reset vector code lives
>>> too) is populated at OVMF build time with a chain of GUID-ed structures,
>>> and fields of those structures are filled in (at OVMF build time) from
>>> various fixed PCDs. The fixed PCDs in turn are populated from the FD
>>> files, using various MEMFD regions. When QEMU launches the guest, it can
>>> parse the GPAs from the on-disk pflash image (traversing the list of
>>> GUID-ed structs), at which addresses the guest firmware will then expect
>>> the various crypto artifacts to be injected.
>>>
>>> The point is that "who's in control" is reversed. The firmware exposes
>>> (at build time) at what GPAs it can accept data injections, and QEMU
>>> follows that. Of course the firmware ensures that nothing else in the
>>> firmware will try to own those GPAs.
>>>
>>> The only thing that needed to be hard-coded when this feature was
>>> introduced was the "entry point", that is, the flash offset at which
>>> QEMU starts the GUID-ed structure traversal.
>>>
>>> AMD and IBM developers can likely much better describe this mechanism,
>>> as I've not dealt with it in over a year. The QEMU side code is in
>>> "hw/i386/pc_sysfw_ovmf.c".
>>>
>>> We can make setup_data chaining work with OVMF, but the whole chain
>>> should be located in a GPA range that OVMF dictates.
>>
>> It sounds like what you describe is pretty OVMF-specific though,
>> right?
> 
> Yes, completely OVMF specific.
> 
>> Do we want to tie things together so tightly like that?
> 
> It depends on what the goal is:
> 
> - do we want setup_data chaining work generally?
> 
> - or do we want only the random seed injection to stop crashing OVMF guests?
> 
> In the latter case, we only need to teach QEMU to reliably recognize
> OVMF from the on-disk firmware binary (regardless of pflash use), and
> then not expose any setup_data in guest RAM. The UEFI guest kernel will
> use EFI_RNG_PROTOCOL (when available, from virtio-rng-pci), and no
> particular seed otherwise.
> 
> (This is the "papering over" case, which I don't think is necessarily
> wrong; only it should be robust. I thought pflash would be usable for
> that; turns out it isn't. Thus, we could perhaps try identifying a
> firmware binary as "OVMF" by contents.)
> 
> In the former case though, I think the "tight coupling" is unavoidable.
> As long as setup_data is needed by the kernel extremely early, no
> "firmware hiding" or "firmware independence" is in place yet.
> 
>> Given we only need 48 bytes or so, isn't there a more subtle place we
>> could just throw this in ram that doesn't need such complex
>> coordination?
> 
> These tricks add up and go wrong after a while. The pedantic
> reservations in the firmware have proved necessary.
> 
> IIUC, with v2, the setup_data_base address would (most frequently) be 96
> KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does
> not reserve away the area, UefiCpuPkg or other drivers might allocate an
> overlapping chunk, even if only temporarily. That might not break the
> firmware, but it could overwrite the random seed. If the guest kernel
> somehow consumed that seed (rather than getting one from
> EFI_RNG_PROTOCOL -- what if no virtio-rng-pci device was configured?),
> that could potentially destroy the guest's security, without any
> obvious/immediate signs.

I must add however that I feel very much out of place making
authoritative-sounding statements like this. The above is my honest
opinion, and my (unpleasant) writing style, but it's just that: my
opinion & style. So much has happened in edk2 and QEMU since last summer
(without me following them closely) that I feel uncomfortable speaking
on them. Take whatever I say with a grain of salt, and definitely
research all options.

Laszlo

Jason A. Donenfeld Aug. 4, 2022, 10:56 p.m. UTC | #23

Hey Laszlo,

On Thu, Aug 04, 2022 at 03:56:54PM +0200, Laszlo Ersek wrote:
> - do we want setup_data chaining work generally?
> 
> - or do we want only the random seed injection to stop crashing OVMF guests?

Preferably the first - generally. Which brings us to your point:

> > Given we only need 48 bytes or so, isn't there a more subtle place we
> > could just throw this in ram that doesn't need such complex
> > coordination?
> 
> These tricks add up and go wrong after a while. The pedantic
> reservations in the firmware have proved necessary.
> 
> IIUC, with v2, the setup_data_base address would (most frequently) be 96
> KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does
> not reserve away the area, UefiCpuPkg or other drivers might allocate an
> overlapping chunk, even if only temporarily. That might not break the
> firmware, but it could overwrite the random seed. 

Yea, so we don't want an address that something else might overwrite. So
my question is: isn't there some 48 bytes or so available in some low
address (or maybe a high one?) that is traditionally reserved for some
hardware function, and so software doesn't use it, but it turns out QEMU
doesn't use it for anything either, so we can get away placing it at
that address? It seems like there *ought* to be something like that. I
just don't (yet) know what it is...

Jason

Laszlo Ersek Aug. 5, 2022, 6:26 a.m. UTC | #24

On 08/05/22 00:56, Jason A. Donenfeld wrote:
> Hey Laszlo,
> 
> On Thu, Aug 04, 2022 at 03:56:54PM +0200, Laszlo Ersek wrote:
>> - do we want setup_data chaining work generally?
>>
>> - or do we want only the random seed injection to stop crashing OVMF guests?
> 
> Preferably the first - generally. Which brings us to your point:
>  
>>> Given we only need 48 bytes or so, isn't there a more subtle place we
>>> could just throw this in ram that doesn't need such complex
>>> coordination?
>>
>> These tricks add up and go wrong after a while. The pedantic
>> reservations in the firmware have proved necessary.
>>
>> IIUC, with v2, the setup_data_base address would (most frequently) be 96
>> KB. edk2 does have uses for very low memory. If OVMF's PlatformPei does
>> not reserve away the area, UefiCpuPkg or other drivers might allocate an
>> overlapping chunk, even if only temporarily. That might not break the
>> firmware, but it could overwrite the random seed. 
> 
> Yea, so we don't want an address that something else might overwrite. So
> my question is: isn't there some 48 bytes or so available in some low
> address (or maybe a high one?) that is traditionally reserved for some
> hardware function, and so software doesn't use it, but it turns out QEMU
> doesn't use it for anything either, so we can get away placing it at
> that address? It seems like there *ought* to be something like that. I
> just don't (yet) know what it is...

I don't know of any such "hidden pocket", unfortunately. On the other
hand, low-level edk2 drivers (usually dealing with x86 intricacies, such
as MTRRs, CPU bringup, ...) have repeatedly surprised me with their
handling of low memory.

Laszlo

Gerd Hoffmann Aug. 16, 2022, 8:55 a.m. UTC | #25

Hi,

> > We can make setup_data chaining work with OVMF, but the whole chain
> > should be located in a GPA range that OVMF dictates.
> 
> It sounds like what you describe is pretty OVMF-specific though,
> right? Do we want to tie things together so tightly like that?
> 
> Given we only need 48 bytes or so, isn't there a more subtle place we
> could just throw this in ram that doesn't need such complex
> coordination?

Joining the party late (and still catching up the thread).  Given we
don't need that anyway with EFI, only with legacy BIOS:  Can't that just
be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass
those 48 bytes random seed?

take care,
  Gerd

Jason A. Donenfeld Aug. 18, 2022, 3:38 p.m. UTC | #26

Hey Gerd,

On Tue, Aug 16, 2022 at 10:55:11AM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> > > We can make setup_data chaining work with OVMF, but the whole chain
> > > should be located in a GPA range that OVMF dictates.
> > 
> > It sounds like what you describe is pretty OVMF-specific though,
> > right? Do we want to tie things together so tightly like that?
> > 
> > Given we only need 48 bytes or so, isn't there a more subtle place we
> > could just throw this in ram that doesn't need such complex
> > coordination?
> 
> Joining the party late (and still catching up the thread).  Given we
> don't need that anyway with EFI, only with legacy BIOS:  Can't that just
> be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass
> those 48 bytes random seed?

Actually, I want this to work with EFI, very much so.

If our objective was to just not break EFI, the solution would be
simple: in the kernel we can have EFISTUB ignore the setup_data field
from the image, and then bump the boot header protocol number. If QEMU
sees the boot protocol number is below this one, then it won't set
setup_data. Done, fixed.

Except I think there's value in passing seeds even through with EFI.

Your option ROM idea is interesting; somebody mentioned that elsewhere
too I think. I'm wondering, though: do option ROMs still run when
EFI/OVMF is being used?

Jason

Gerd Hoffmann Aug. 19, 2022, 6:40 a.m. UTC | #27

On Thu, Aug 18, 2022 at 05:38:37PM +0200, Jason A. Donenfeld wrote:
> Hey Gerd,
> 
> > Joining the party late (and still catching up the thread).  Given we
> > don't need that anyway with EFI, only with legacy BIOS:  Can't that just
> > be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass
> > those 48 bytes random seed?
> 
> Actually, I want this to work with EFI, very much so.

With EFI the kernel stub gets some random seed via EFI_RNG_PROTOCOL.
I can't see any good reason to derive from that.  It works no matter
how the kernel gets loaded.

OVMF ships with a driver for the virtio-rng device.  So just add that
to your virtual machine and seeding works fine ...

> If our objective was to just not break EFI, the solution would be
> simple: in the kernel we can have EFISTUB ignore the setup_data field
> from the image, and then bump the boot header protocol number. If QEMU
> sees the boot protocol number is below this one, then it won't set
> setup_data. Done, fixed.

As mentioned elsewhere in the thread patching in physical addresses on
qemu side isn't going to fly due to the different load methods we have.

> Your option ROM idea is interesting; somebody mentioned that elsewhere
> too I think.

Doing the setup_data patching in the option rom has the advantage that
it'll only happen with that specific load method being used.  Also the
option rom knows where it places stuff in memory so it is in a much
better position to find a good & non-conflicting place for the random
seed.  Also reserve/allocate memory if needed etc.

> I'm wondering, though: do option ROMs still run when
> EFI/OVMF is being used?

No, they are not used with EFI.  OVMF has a completely independent
implementation for direct kernel boot.

The options I see for EFI are:

  (1) Do nothing and continue to depend on virtio-rng.
  (2) Implement an efi driver which gets those 48 seed bytes from
      qemu by whatever means we'll define and hands them out via
      EFI_RNG_PROTOCOL.

take care,
  Gerd

Ard Biesheuvel Aug. 19, 2022, 7:16 a.m. UTC | #28

On Fri, 19 Aug 2022 at 08:41, Gerd Hoffmann <kraxel@redhat.com> wrote:
>
> On Thu, Aug 18, 2022 at 05:38:37PM +0200, Jason A. Donenfeld wrote:
> > Hey Gerd,
> >
> > > Joining the party late (and still catching up the thread).  Given we
> > > don't need that anyway with EFI, only with legacy BIOS:  Can't that just
> > > be a protocol between qemu and pc-bios/optionrom/*boot*.S on how to pass
> > > those 48 bytes random seed?
> >
> > Actually, I want this to work with EFI, very much so.

Even if we wire this up for EFI in some way, it will only affect
direct kernel boot using -kernel/-initrd etc, which is a niche use
case at best (AIUI libvirt uses it for the initial boot only)

I personally rely on it heavily for development, and I imagine others
might too, but that is hardly relevant here.

> With EFI the kernel stub gets some random seed via EFI_RNG_PROTOCOL.
> I can't see any good reason to derive from that.  It works no matter
> how the kernel gets loaded.
>
> OVMF ships with a driver for the virtio-rng device.  So just add that
> to your virtual machine and seeding works fine ...
>

... or we find other ways for the platform to speak to the OS, using
EFI protocols or other EFI methods.

Currently, the 'pure EFI' boot code is arch agnostic - it can be built
and run on any architecture that supports EFI boot. Adding Linux+x86
specific hacks to it is out of the question.

So that means that setup_data provided by QEMU will be consumed
directly by the kernel (when doing direct kernel boot only), using an
out of band channel that EFI/OVMF are completely unaware of, putting
it outside the scope of secure boot, measure boot, etc.

This is not acceptable to me.

> > If our objective was to just not break EFI, the solution would be
> > simple: in the kernel we can have EFISTUB ignore the setup_data field
> > from the image, and then bump the boot header protocol number. If QEMU
> > sees the boot protocol number is below this one, then it won't set
> > setup_data. Done, fixed.
>
> As mentioned elsewhere in the thread patching in physical addresses on
> qemu side isn't going to fly due to the different load methods we have.
>

And it conflates the file representation with the in-memory
representation, which I object to fundamentally - setup_data is part
of the file image, and becomes part of the in-memory representation
when it gets staged in memory for booting, which only happens in the
EFI stub when using pure EFI boot.

Using setup_data as a hidden comms channel between QEMU and the core
kernel breaks that distinction.

> > Your option ROM idea is interesting; somebody mentioned that elsewhere
> > too I think.
>
> Doing the setup_data patching in the option rom has the advantage that
> it'll only happen with that specific load method being used.  Also the
> option rom knows where it places stuff in memory so it is in a much
> better position to find a good & non-conflicting place for the random
> seed.  Also reserve/allocate memory if needed etc.
>

Exactly. This is the only sensible place to do this.

> > I'm wondering, though: do option ROMs still run when
> > EFI/OVMF is being used?
>
> No, they are not used with EFI.  OVMF has a completely independent
> implementation for direct kernel boot.
>
> The options I see for EFI are:
>
>   (1) Do nothing and continue to depend on virtio-rng.
>   (2) Implement an efi driver which gets those 48 seed bytes from
>       qemu by whatever means we'll define and hands them out via
>       EFI_RNG_PROTOCOL.
>

We could explore other options, but SETUP_RNG_SEED is fundamentally
incompatible with EFI boot (or any other boot method where the image
is treated as an opaque file by the firmware/loader), so that is not
an acceptable approach to me.

[v2] hw/i386: place setup_data at fixed place in memory

Commit Message

Comments

Patch