diff mbox series

[qemu,v5] spapr: Kill SLOF

Message ID 20200110020925.98711-1-aik@ozlabs.ru
State New
Headers show
Series [qemu,v5] spapr: Kill SLOF | expand

Commit Message

Alexey Kardashevskiy Jan. 10, 2020, 2:09 a.m. UTC
The Petitboot bootloader is way more advanced than SLOF is ever going to
be as Petitboot comes with the full-featured Linux kernel with all
the drivers, and initramdisk with quite user friendly interface.
The problem with ditching SLOF is that an unmodified pseries kernel can
either start via:
1. kexec, this requires presence of RTAS and skips
ibm,client-architecture-support entirely;
2. normal boot, this heavily relies on the OF1275 client interface to
fetch the device tree and do early setup (claim memory).

This adds a new bios-less mode to the pseries machine: "bios=on|off".
When enabled, QEMU does not load SLOF and jumps to the kernel from
"-kernel".

The client interface is implemented exactly as RTAS - a 20 bytes blob,
right next after the RTAS blob. The entry point is passed to the kernel
via GPR5.

This implements a handful of client interface methods just to get going.
In particular, this implements the device tree fetching,
ibm,client-architecture-support and instantiate-rtas.

This implements changing FDT properties for RTAS (for vmlinux and zImage)
and initramdisk location (for zImage). To make this work, this skips
fdt_pack() when bios=off as not packing the blob leaves some room for
appending.

This assigns "phandles" to device tree nodes as there is no more SLOF
and OF nodes addresses of which served as phandle values.
This keeps predefined nodes (such as XICS/NVLINK/...) unchanged.
phandles are regenerated at every FDT rebuild.

When bios=off, this adds "/chosen" every time QEMU builds a tree.

This implements "claim" which the client (Linux) uses for memory
allocation; this is also  used by QEMU for claiming kernel/initrd images,
client interface entry point, RTAS and the initial stack.

While at this, add a "kernel-addr" machine parameter to allow moving
the kernel in memory. This is useful for debugging if the kernel is
loaded at @0, although not necessary.

This adds basic instances support which are managed by a hashmap
ihandle->[phandle, DeviceState, Chardev].

Note that a 64bit PCI fix is required for Linux:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735e

The test command line:

qemu-system-ppc64 \
-nodefaults \
-chardev stdio,id=STDIO0,signal=off,mux=on \
-device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
-mon id=MON0,chardev=STDIO0,mode=readline \
-nographic \
-vga none \
-kernel pbuild/kernel-le-guest/arch/powerpc/boot/zImage.pseries \
-machine pseries,bios=off,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken \
-m 4G \
-enable-kvm \
-initrd pb/rootfs.cpio.xz \
-device nec-usb-xhci,id=nec-usb-xhci0 \
-netdev tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0 \
-device virtio-net-pci,id=vnet0,netdev=TAP0 img/f30le.qcow2 \
-snapshot \
-smp 8,threads=8 \
-trace events=qemu_trace_events \
-d guest_errors \
-chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.ssh54088 \
-mon chardev=SOCKET0,mode=control

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
Changes:
v5:
* made instances keep device and chardev pointers
* removed VIO dependencies
* print error if RTAS memory is not claimed as it should have been
* pack FDT as "quiesce"

v4:
* fixed open
* validate ihandles in "call-method"

v3:
* fixed phandles allocation
* s/__be32/uint32_t/ as we do not normally have __be32 type in qemu
* fixed size of /chosen/stdout
* bunch of renames
* do not create rtas properties at all, let the client deal with it;
instead setprop allows changing these in the FDT
* no more packing FDT when bios=off - nobody needs it and getprop does not
work otherwise
* allow updating initramdisk device tree properties (for zImage)
* added instances
* fixed stdout on OF's "write"
* removed special handling for stdout in OF client, spapr-vty handles it
instead

v2:
* fixed claim()
* added "setprop"
* cleaner client interface and RTAS blobs management
* boots to petitboot and further to the target system
* more trace points
---
 hw/ppc/Makefile.objs     |   1 +
 include/hw/ppc/spapr.h   |  28 +-
 hw/ppc/spapr.c           | 266 ++++++++++++++--
 hw/ppc/spapr_hcall.c     |  74 +++--
 hw/ppc/spapr_of_client.c | 633 +++++++++++++++++++++++++++++++++++++++
 hw/ppc/trace-events      |  12 +
 6 files changed, 959 insertions(+), 55 deletions(-)
 create mode 100644 hw/ppc/spapr_of_client.c

Comments

David Gibson Jan. 21, 2020, 5:11 a.m. UTC | #1
On Fri, Jan 10, 2020 at 01:09:25PM +1100, Alexey Kardashevskiy wrote:
> The Petitboot bootloader is way more advanced than SLOF is ever going to
> be as Petitboot comes with the full-featured Linux kernel with all
> the drivers, and initramdisk with quite user friendly interface.
> The problem with ditching SLOF is that an unmodified pseries kernel can
> either start via:
> 1. kexec, this requires presence of RTAS and skips
> ibm,client-architecture-support entirely;
> 2. normal boot, this heavily relies on the OF1275 client interface to
> fetch the device tree and do early setup (claim memory).
> 
> This adds a new bios-less mode to the pseries machine: "bios=on|off".
> When enabled, QEMU does not load SLOF and jumps to the kernel from
> "-kernel".

I don't love the name "bios" for this flag, since BIOS tends to refer
to old-school x86 firmware.  Given the various plans we're considering
the future, I'd suggest "firmware=slof" for the current in-guest SLOF
mode, and say "firmware=vof" (Virtual Open Firmware) for the new
model.  We can consider firmware=petitboot or firmware=none (for
direct kexec-style boot into -kernel) or whatever in the future

> The client interface is implemented exactly as RTAS - a 20 bytes blob,
> right next after the RTAS blob. The entry point is passed to the kernel
> via GPR5.
> 
> This implements a handful of client interface methods just to get going.
> In particular, this implements the device tree fetching,
> ibm,client-architecture-support and instantiate-rtas.
> 
> This implements changing FDT properties for RTAS (for vmlinux and zImage)
> and initramdisk location (for zImage). To make this work, this skips
> fdt_pack() when bios=off as not packing the blob leaves some room for
> appending.
> 
> This assigns "phandles" to device tree nodes as there is no more SLOF
> and OF nodes addresses of which served as phandle values.
> This keeps predefined nodes (such as XICS/NVLINK/...) unchanged.
> phandles are regenerated at every FDT rebuild.
> 
> When bios=off, this adds "/chosen" every time QEMU builds a tree.
> 
> This implements "claim" which the client (Linux) uses for memory
> allocation; this is also  used by QEMU for claiming kernel/initrd images,
> client interface entry point, RTAS and the initial stack.
> 
> While at this, add a "kernel-addr" machine parameter to allow moving
> the kernel in memory. This is useful for debugging if the kernel is
> loaded at @0, although not necessary.
> 
> This adds basic instances support which are managed by a hashmap
> ihandle->[phandle, DeviceState, Chardev].
> 
> Note that a 64bit PCI fix is required for Linux:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735e
> 
> The test command line:
> 
> qemu-system-ppc64 \
> -nodefaults \
> -chardev stdio,id=STDIO0,signal=off,mux=on \
> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> -mon id=MON0,chardev=STDIO0,mode=readline \
> -nographic \
> -vga none \
> -kernel pbuild/kernel-le-guest/arch/powerpc/boot/zImage.pseries \
> -machine pseries,bios=off,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken \
> -m 4G \
> -enable-kvm \
> -initrd pb/rootfs.cpio.xz \
> -device nec-usb-xhci,id=nec-usb-xhci0 \
> -netdev tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0 \
> -device virtio-net-pci,id=vnet0,netdev=TAP0 img/f30le.qcow2 \
> -snapshot \
> -smp 8,threads=8 \
> -trace events=qemu_trace_events \
> -d guest_errors \
> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.ssh54088 \
> -mon chardev=SOCKET0,mode=control
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

It'd be nice to split this patch up a bit, though I'll admit it's not
very obvious where to do so.

> ---
> Changes:
> v5:
> * made instances keep device and chardev pointers
> * removed VIO dependencies
> * print error if RTAS memory is not claimed as it should have been
> * pack FDT as "quiesce"
> 
> v4:
> * fixed open
> * validate ihandles in "call-method"
> 
> v3:
> * fixed phandles allocation
> * s/__be32/uint32_t/ as we do not normally have __be32 type in qemu
> * fixed size of /chosen/stdout
> * bunch of renames
> * do not create rtas properties at all, let the client deal with it;
> instead setprop allows changing these in the FDT
> * no more packing FDT when bios=off - nobody needs it and getprop does not
> work otherwise
> * allow updating initramdisk device tree properties (for zImage)
> * added instances
> * fixed stdout on OF's "write"
> * removed special handling for stdout in OF client, spapr-vty handles it
> instead
> 
> v2:
> * fixed claim()
> * added "setprop"
> * cleaner client interface and RTAS blobs management
> * boots to petitboot and further to the target system
> * more trace points
> ---
>  hw/ppc/Makefile.objs     |   1 +
>  include/hw/ppc/spapr.h   |  28 +-
>  hw/ppc/spapr.c           | 266 ++++++++++++++--
>  hw/ppc/spapr_hcall.c     |  74 +++--
>  hw/ppc/spapr_of_client.c | 633 +++++++++++++++++++++++++++++++++++++++
>  hw/ppc/trace-events      |  12 +
>  6 files changed, 959 insertions(+), 55 deletions(-)
>  create mode 100644 hw/ppc/spapr_of_client.c
> 
> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> index 101e9fc59185..20efc0aa6f9b 100644
> --- a/hw/ppc/Makefile.objs
> +++ b/hw/ppc/Makefile.objs
> @@ -6,6 +6,7 @@ obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
>  obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
>  obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o spapr_irq.o
>  obj-$(CONFIG_PSERIES) += spapr_tpm_proxy.o
> +obj-$(CONFIG_PSERIES) += spapr_of_client.o
>  obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>  # IBM PowerNV
>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 61f005c6f686..efc2c70abf99 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -105,6 +105,11 @@ struct SpaprCapabilities {
>      uint8_t caps[SPAPR_CAP_NUM];
>  };
>  
> +typedef struct {
> +    uint64_t start;
> +    uint64_t size;
> +} SpaprOfClaimed;
> +

Can we split more of the fake-OF code into a new file?

>  /**
>   * SpaprMachineClass:
>   */
> @@ -160,6 +165,13 @@ struct SpaprMachineState {
>      void *fdt_blob;
>      long kernel_size;
>      bool kernel_le;
> +    uint64_t kernel_addr;
> +    bool bios_enabled;
> +    uint32_t rtas_base;
> +    GArray *claimed; /* array of SpaprOfClaimed */
> +    uint64_t claimed_base;
> +    GHashTable *of_instances; /* ihandle -> SpaprOfInstance */
> +    uint32_t of_instance_last;
>      uint32_t initrd_base;
>      long initrd_size;
>      uint64_t rtc_offset; /* Now used only during incoming migration */
> @@ -510,7 +522,8 @@ struct SpaprMachineState {
>  /* Client Architecture support */
>  #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
>  #define KVMPPC_H_UPDATE_DT      (KVMPPC_HCALL_BASE + 0x3)
> -#define KVMPPC_HCALL_MAX        KVMPPC_H_UPDATE_DT
> +#define KVMPPC_H_CLIENT         (KVMPPC_HCALL_BASE + 0x5)
> +#define KVMPPC_HCALL_MAX        KVMPPC_H_CLIENT
>  
>  /*
>   * The hcall range 0xEF00 to 0xEF80 is reserved for use in facilitating
> @@ -538,6 +551,11 @@ void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
>  target_ulong spapr_hypercall(PowerPCCPU *cpu, target_ulong opcode,
>                               target_ulong *args);
>  
> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
> +                                            SpaprMachineState *spapr,
> +                                            target_ulong addr,
> +                                            target_ulong fdt_bufsize);
> +
>  /* Virtual Processor Area structure constants */
>  #define VPA_MIN_SIZE           640
>  #define VPA_SIZE_OFFSET        0x4
> @@ -769,6 +787,11 @@ struct SpaprEventLogEntry {
>  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
>  void spapr_events_init(SpaprMachineState *sm);
>  void spapr_dt_events(SpaprMachineState *sm, void *fdt);
> +uint64_t spapr_do_of_client_claim(SpaprMachineState *spapr, uint64_t virt,
> +                                  uint64_t size, uint64_t align);
> +
> +uint32_t spapr_of_client_open(SpaprMachineState *spapr, const char *path);
> +int spapr_h_client(SpaprMachineState *spapr, target_ulong client_args);
>  void close_htab_fd(SpaprMachineState *spapr);
>  void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
>  void spapr_free_hpt(SpaprMachineState *spapr);
> @@ -891,4 +914,7 @@ void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
>  #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
>  
>  void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
> +
> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base);
> +
>  #endif /* HW_SPAPR_H */
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index e62c89b3dd40..76ce8b973082 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -896,6 +896,55 @@ out:
>      return ret;
>  }
>  
> +/*
> + * Below is a compiled version of RTAS blob and OF client interface entry point.
> + *
> + * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
> + * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
> + *
> + *   .globl  _start
> + *   _start:
> + *           mr      4,3
> + *           lis     3,KVMPPC_H_RTAS@h
> + *           ori     3,3,KVMPPC_H_RTAS@l
> + *           sc      1
> + *           blr
> + *           mr      4,3
> + *           lis     3,KVMPPC_H_CLIENT@h
> + *           ori     3,3,KVMPPC_H_CLIENT@l
> + *           sc      1
> + *           blr
> + */
> +static struct {

Should be able to add a 'const' here.

> +    uint8_t rtas[20], client[20];
> +} QEMU_PACKED rtas_client_blob = {
> +    .rtas = {
> +        0x7c, 0x64, 0x1b, 0x78,
> +        0x3c, 0x60, 0x00, 0x00,
> +        0x60, 0x63, 0xf0, 0x00,
> +        0x44, 0x00, 0x00, 0x22,
> +        0x4e, 0x80, 0x00, 0x20
> +    },
> +    .client = {
> +        0x7c, 0x64, 0x1b, 0x78,
> +        0x3c, 0x60, 0x00, 0x00,
> +        0x60, 0x63, 0xf0, 0x05,
> +        0x44, 0x00, 0x00, 0x22,
> +        0x4e, 0x80, 0x00, 0x20
> +    }
> +};

I'd split this into two variables - there's not really any connection
between the two, AFAICT.

Note that I'm getting closer to merging the fwnmi stuff at which point
you'll need to pad the RTAS blob with a bunch of extra space for
taking the fwnmi dumps.

> +
> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base)
> +{
> +    if (spapr_do_of_client_claim(spapr, base, sizeof(rtas_client_blob.rtas),
> +                                 0) != -1) {

Wait.. == -1 is the success case?  That's a very surprising interface.

> +        error_report("The OF client did not claim RTAS memory at 0x%x", base);

Error message is hard to follow.  Maybe "Could not claim memory for RTAS"

> +    }
> +    spapr->rtas_base = base;
> +    cpu_physical_memory_write(base, rtas_client_blob.rtas,
> +                              sizeof(rtas_client_blob.rtas));
> +}
> +
>  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>  {
>      MachineState *ms = MACHINE(spapr);
> @@ -980,6 +1029,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>      _FDT(fdt_setprop(fdt, rtas, "ibm,lrdr-capacity",
>                       lrdr_capacity, sizeof(lrdr_capacity)));
>  
> +    if (!spapr->bios_enabled) {
> +        _FDT(fdt_setprop_cell(fdt, rtas, "rtas-size",
> +                              sizeof(rtas_client_blob.rtas)));
> +    }
> +
>      spapr_dt_rtas_tokens(fdt, rtas);
>  }
>  
> @@ -1057,7 +1111,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
>      }
>  
>      if (spapr->kernel_size) {
> -        uint64_t kprop[2] = { cpu_to_be64(KERNEL_LOAD_ADDR),
> +        uint64_t kprop[2] = { cpu_to_be64(spapr->kernel_addr),

Hrm, I really think I would like to see the change to adjustable
kernel_addr split out - it puts a bunch of noise into the main kill
slof patch.

>                                cpu_to_be64(spapr->kernel_size) };
>  
>          _FDT(fdt_setprop(fdt, chosen, "qemu,boot-kernel",
> @@ -1245,7 +1299,8 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>      /* Build memory reserve map */
>      if (reset) {
>          if (spapr->kernel_size) {
> -            _FDT((fdt_add_mem_rsv(fdt, KERNEL_LOAD_ADDR, spapr->kernel_size)));
> +            _FDT((fdt_add_mem_rsv(fdt, spapr->kernel_addr,
> +                                  spapr->kernel_size)));
>          }
>          if (spapr->initrd_size) {
>              _FDT((fdt_add_mem_rsv(fdt, spapr->initrd_base,
> @@ -1268,12 +1323,56 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>          }
>      }
>  
> +    if (!spapr->bios_enabled) {
> +        uint32_t phandle;
> +        int i, offset, proplen = 0;
> +        const void *prop;
> +        bool found = false;
> +        GArray *phandles = g_array_new(false, false, sizeof(uint32_t));
> +
> +        /* Find all predefined phandles */
> +        for (offset = fdt_next_node(fdt, -1, NULL);
> +             offset >= 0;
> +             offset = fdt_next_node(fdt, offset, NULL)) {
> +            prop = fdt_getprop_namelen(fdt, offset, "phandle", 7, &proplen);

You can just use fdt_getprop() rather than the namelen variant (that's
only really useful when you don't have a \0-terminated string with the
name).

> +            if (prop && proplen == sizeof(uint32_t)) {
> +                phandle = fdt32_ld(prop);
> +                g_array_append_val(phandles, phandle);
> +            }
> +        }
> +
> +        /* Assign phandles skipping the predefined ones */
> +        for (offset = fdt_next_node(fdt, -1, NULL), phandle = 1;
> +             offset >= 0;
> +             offset = fdt_next_node(fdt, offset, NULL), ++phandle) {
> +            prop = fdt_getprop_namelen(fdt, offset, "phandle", 7, &proplen);
> +            if (prop) {
> +                continue;
> +            }
> +            /* Check if the current phandle is not allocated already */
> +            for ( ; ; ++phandle) {
> +                for (i = 0, found = false; i < phandles->len; ++i) {
> +                    if (phandle == g_array_index(phandles, uint32_t, i)) {
> +                        found = true;
> +                        break;
> +                    }
> +                }
> +                if (!found) {
> +                    break;
> +                }
> +            }
> +            _FDT(fdt_setprop_cell(fdt, offset, "phandle", phandle));
> +        }
> +        g_array_unref(phandles);
> +    }
> +
>      return fdt;
>  }
>  
>  static uint64_t translate_kernel_address(void *opaque, uint64_t addr)
>  {
> -    return (addr & 0x0fffffff) + KERNEL_LOAD_ADDR;
> +    SpaprMachineState *spapr = opaque;
> +    return (addr & 0x0fffffff) + spapr->kernel_addr;
>  }
>  
>  static void emulate_spapr_hypercall(PPCVirtualHypervisor *vhyp,
> @@ -1660,24 +1759,89 @@ static void spapr_machine_reset(MachineState *machine)
>       */
>      fdt_addr = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FDT_MAX_SIZE;
>  
> +    /* Set up the entry state */
> +    if (!spapr->bios_enabled) {
> +        if (spapr->claimed) {
> +            g_array_unref(spapr->claimed);
> +        }
> +        if (spapr->of_instances) {
> +            g_hash_table_unref(spapr->of_instances);
> +        }
> +
> +        spapr->claimed = g_array_new(false, false, sizeof(SpaprOfClaimed));
> +        spapr->of_instances = g_hash_table_new(g_direct_hash, g_direct_equal);
> +
> +        spapr->claimed_base = 0x10000; /* Avoid using the first system page */
> +
> +        spapr_cpu_set_entry_state(first_ppc_cpu, spapr->kernel_addr,
> +                                  spapr->initrd_base);
> +        first_ppc_cpu->env.gpr[4] = spapr->initrd_size;
> +
> +        if (spapr_do_of_client_claim(spapr, spapr->kernel_addr,
> +                                  spapr->kernel_size, 0) == -1) {
> +            error_report("Memory for kernel is in use");
> +            exit(1);
> +        }
> +        if (spapr_do_of_client_claim(spapr, spapr->initrd_base,
> +                                  spapr->initrd_size, 0) == -1) {
> +            error_report("Memory for initramdisk is in use");
> +            exit(1);
> +        }
> +        first_ppc_cpu->env.gpr[1] = spapr_do_of_client_claim(spapr, 0, 0x40000,
> +                                                             0x10000);
> +        if (first_ppc_cpu->env.gpr[1] == -1) {
> +            error_report("Memory allocation for stack failed");
> +            exit(1);
> +        }
> +
> +        first_ppc_cpu->env.gpr[5] =
> +            spapr_do_of_client_claim(spapr, 0, sizeof(rtas_client_blob.client),
> +                                     sizeof(rtas_client_blob.client));
> +        if (first_ppc_cpu->env.gpr[5] == -1) {
> +            error_report("Memory allocation for OF client failed");
> +            exit(1);
> +        }
> +        cpu_physical_memory_write(first_ppc_cpu->env.gpr[5],
> +                                  rtas_client_blob.client,
> +                                  sizeof(rtas_client_blob.client));
> +    } else {
> +        spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
> +        first_ppc_cpu->env.gpr[5] = 0; /* 0 = kexec !0 = prom_init */
> +    }
> +
>      fdt = spapr_build_fdt(spapr, true, FDT_MAX_SIZE);
>  
> -    rc = fdt_pack(fdt);
> -
> -    /* Should only fail if we've built a corrupted tree */
> -    assert(rc == 0);
> -
> -    /* Load the fdt */
> -    qemu_fdt_dumpdtb(fdt, fdt_totalsize(fdt));
> -    cpu_physical_memory_write(fdt_addr, fdt, fdt_totalsize(fdt));
>      g_free(spapr->fdt_blob);
>      spapr->fdt_size = fdt_totalsize(fdt);
>      spapr->fdt_initial_size = spapr->fdt_size;
>      spapr->fdt_blob = fdt;
>  
> -    /* Set up the entry state */
> -    spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
> -    first_ppc_cpu->env.gpr[5] = 0;
> +    if (spapr->bios_enabled) {
> +        /* Load the fdt */
> +        rc = fdt_pack(spapr->fdt_blob);
> +        /* Should only fail if we've built a corrupted tree */
> +        assert(rc == 0);
> +
> +        spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
> +        spapr->fdt_initial_size = spapr->fdt_size;
> +        qemu_fdt_dumpdtb(spapr->fdt_blob, spapr->fdt_size);

I think we should still have a dumpdtb call on the !bios path.

> +        cpu_physical_memory_write(fdt_addr, spapr->fdt_blob, spapr->fdt_size);
> +    } else {
> +        char *stdout_path = spapr_vio_stdout_path(spapr->vio_bus);
> +        int offset = fdt_path_offset(fdt, "/chosen");
> +
> +        /*
> +         * SLOF-less setup requires an open instance of stdout for early
> +         * kernel printk. By now all phandles are settled so we can open
> +         * the default serial console.
> +         * We skip writing FDT as nothing expects it; OF client interface is
> +         * going to be used for reading the device tree.
> +         */
> +        if (stdout_path) {
> +            _FDT(fdt_setprop_cell(fdt, offset, "stdout",
> +                                  spapr_of_client_open(spapr, stdout_path)));
> +        }
> +    }
>  
>      spapr->cas_reboot = false;
>  }
> @@ -2897,12 +3061,12 @@ static void spapr_machine_init(MachineState *machine)
>          uint64_t lowaddr = 0;
>  
>          spapr->kernel_size = load_elf(kernel_filename, NULL,
> -                                      translate_kernel_address, NULL,
> +                                      translate_kernel_address, spapr,
>                                        NULL, &lowaddr, NULL, 1,
>                                        PPC_ELF_MACHINE, 0, 0);
>          if (spapr->kernel_size == ELF_LOAD_WRONG_ENDIAN) {
>              spapr->kernel_size = load_elf(kernel_filename, NULL,
> -                                          translate_kernel_address, NULL, NULL,
> +                                          translate_kernel_address, spapr, NULL,
>                                            &lowaddr, NULL, 0, PPC_ELF_MACHINE,
>                                            0, 0);
>              spapr->kernel_le = spapr->kernel_size > 0;
> @@ -2918,7 +3082,7 @@ static void spapr_machine_init(MachineState *machine)
>              /* Try to locate the initrd in the gap between the kernel
>               * and the firmware. Add a bit of space just in case
>               */
> -            spapr->initrd_base = (KERNEL_LOAD_ADDR + spapr->kernel_size
> +            spapr->initrd_base = (spapr->kernel_addr + spapr->kernel_size
>                                    + 0x1ffff) & ~0xffff;
>              spapr->initrd_size = load_image_targphys(initrd_filename,
>                                                       spapr->initrd_base,
> @@ -2932,20 +3096,22 @@ static void spapr_machine_init(MachineState *machine)
>          }
>      }
>  
> -    if (bios_name == NULL) {
> -        bios_name = FW_FILE_NAME;
> +    if (spapr->bios_enabled) {
> +        if (bios_name == NULL) {
> +            bios_name = FW_FILE_NAME;
> +        }
> +        filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
> +        if (!filename) {
> +            error_report("Could not find LPAR firmware '%s'", bios_name);
> +            exit(1);
> +        }
> +        fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
> +        if (fw_size <= 0) {
> +            error_report("Could not load LPAR firmware '%s'", filename);
> +            exit(1);
> +        }
> +        g_free(filename);
>      }
> -    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
> -    if (!filename) {
> -        error_report("Could not find LPAR firmware '%s'", bios_name);
> -        exit(1);
> -    }
> -    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
> -    if (fw_size <= 0) {
> -        error_report("Could not load LPAR firmware '%s'", filename);
> -        exit(1);
> -    }
> -    g_free(filename);
>  
>      /* FIXME: Should register things through the MachineState's qdev
>       * interface, this is a legacy from the sPAPREnvironment structure
> @@ -3162,6 +3328,32 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
>      visit_type_uint32(v, name, (uint32_t *)opaque, errp);
>  }
>  
> +static void spapr_get_kernel_addr(Object *obj, Visitor *v, const char *name,
> +                                  void *opaque, Error **errp)
> +{
> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
> +}
> +
> +static void spapr_set_kernel_addr(Object *obj, Visitor *v, const char *name,
> +                                  void *opaque, Error **errp)
> +{
> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
> +}
> +
> +static bool spapr_get_bios_enabled(Object *obj, Error **errp)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> +
> +    return spapr->bios_enabled;
> +}
> +
> +static void spapr_set_bios_enabled(Object *obj, bool value, Error **errp)
> +{
> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> +
> +    spapr->bios_enabled = value;
> +}
> +
>  static char *spapr_get_ic_mode(Object *obj, Error **errp)
>  {
>      SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> @@ -3267,6 +3459,20 @@ static void spapr_instance_init(Object *obj)
>      object_property_add_bool(obj, "vfio-no-msix-emulation",
>                               spapr_get_msix_emulation, NULL, NULL);
>  
> +    object_property_add(obj, "kernel-addr", "uint64", spapr_get_kernel_addr,
> +                        spapr_set_kernel_addr, NULL, &spapr->kernel_addr,
> +                        &error_abort);
> +    object_property_set_description(obj, "kernel-addr",
> +                                    stringify(KERNEL_LOAD_ADDR)
> +                                    " for -kernel is the default",
> +                                    NULL);
> +    spapr->kernel_addr = KERNEL_LOAD_ADDR;
> +    object_property_add_bool(obj, "bios", spapr_get_bios_enabled,
> +                            spapr_set_bios_enabled, NULL);
> +    object_property_set_description(obj, "bios", "Conrols whether to load bios",
> +                                    NULL);
> +    spapr->bios_enabled = true;
> +
>      /* The machine class defines the default interrupt controller mode */
>      spapr->irq = smc->irq;
>      object_property_add_str(obj, "ic-mode", spapr_get_ic_mode,
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index f1799b1b707d..f2d8823d2c3a 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1660,15 +1660,11 @@ static bool spapr_hotplugged_dev_before_cas(void)
>      return false;
>  }
>  
> -static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> -                                                  SpaprMachineState *spapr,
> -                                                  target_ulong opcode,
> -                                                  target_ulong *args)
> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
> +                                            SpaprMachineState *spapr,
> +                                            target_ulong addr,
> +                                            target_ulong fdt_bufsize)
>  {
> -    /* Working address in data buffer */
> -    target_ulong addr = ppc64_phys_to_real(args[0]);
> -    target_ulong fdt_buf = args[1];
> -    target_ulong fdt_bufsize = args[2];
>      target_ulong ov_table;
>      uint32_t cas_pvr;
>      SpaprOptionVector *ov1_guest, *ov5_guest, *ov5_cas_old;
> @@ -1816,7 +1812,6 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>  
>      if (!spapr->cas_reboot) {
>          void *fdt;
> -        SpaprDeviceTreeUpdateHeader hdr = { .version_id = 1 };
>  
>          /* If spapr_machine_reset() did not set up a HPT but one is necessary
>           * (because the guest isn't going to use radix) then set it up here. */
> @@ -1825,21 +1820,7 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>              spapr_setup_hpt_and_vrma(spapr);
>          }
>  
> -        if (fdt_bufsize < sizeof(hdr)) {
> -            error_report("SLOF provided insufficient CAS buffer "
> -                         TARGET_FMT_lu " (min: %zu)", fdt_bufsize, sizeof(hdr));
> -            exit(EXIT_FAILURE);
> -        }
> -
> -        fdt_bufsize -= sizeof(hdr);
> -
> -        fdt = spapr_build_fdt(spapr, false, fdt_bufsize);
> -        _FDT((fdt_pack(fdt)));
> -
> -        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
> -        cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
> -                                  fdt_totalsize(fdt));
> -        trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
> +        fdt = spapr_build_fdt(spapr, !spapr->bios_enabled, fdt_bufsize);
>  
>          g_free(spapr->fdt_blob);
>          spapr->fdt_size = fdt_totalsize(fdt);
> @@ -1854,6 +1835,41 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>      return H_SUCCESS;
>  }
>  
> +static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> +                                                  SpaprMachineState *spapr,
> +                                                  target_ulong opcode,
> +                                                  target_ulong *args)
> +{
> +    /* Working address in data buffer */
> +    target_ulong addr = ppc64_phys_to_real(args[0]);
> +    target_ulong fdt_buf = args[1];
> +    target_ulong fdt_bufsize = args[2];
> +    target_ulong ret;
> +    SpaprDeviceTreeUpdateHeader hdr = { .version_id = 1 };
> +
> +    if (fdt_bufsize < sizeof(hdr)) {
> +        error_report("SLOF provided insufficient CAS buffer "
> +                     TARGET_FMT_lu " (min: %zu)", fdt_bufsize, sizeof(hdr));
> +        exit(EXIT_FAILURE);
> +    }
> +
> +    fdt_bufsize -= sizeof(hdr);
> +
> +    ret = do_client_architecture_support(cpu, spapr, addr, fdt_bufsize);
> +    if (ret == H_SUCCESS) {
> +        _FDT((fdt_pack(spapr->fdt_blob)));
> +        spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
> +        spapr->fdt_initial_size = spapr->fdt_size;
> +
> +        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
> +        cpu_physical_memory_write(fdt_buf + sizeof(hdr), spapr->fdt_blob,
> +                                  spapr->fdt_size);
> +        trace_spapr_cas_continue(spapr->fdt_size + sizeof(hdr));
> +    }
> +
> +    return ret;
> +}
> +
>  static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>                                                SpaprMachineState *spapr,
>                                                target_ulong opcode,
> @@ -1998,6 +2014,14 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, SpaprMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +static target_ulong h_client(PowerPCCPU *cpu, SpaprMachineState *spapr,
> +                             target_ulong opcode, target_ulong *args)

As I said in an earlier revision, please explan these names from just
"client", for readability by people who aren't already thinking about
open firmware.

> +{
> +    target_ulong client_args = ppc64_phys_to_real(args[0]);
> +
> +    return spapr_h_client(spapr, client_args);
> +}
> +
>  static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
>  static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - KVMPPC_HCALL_BASE + 1];
>  static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 4 + 1];
> @@ -2121,6 +2145,8 @@ static void hypercall_register_types(void)
>  
>      spapr_register_hypercall(KVMPPC_H_UPDATE_DT, h_update_dt);
>  
> +    spapr_register_hypercall(KVMPPC_H_CLIENT, h_client);
> +
>      /* Virtual Processor Home Node */
>      spapr_register_hypercall(H_HOME_NODE_ASSOCIATIVITY,
>                               h_home_node_associativity);
> diff --git a/hw/ppc/spapr_of_client.c b/hw/ppc/spapr_of_client.c
> new file mode 100644
> index 000000000000..24d854b76e51
> --- /dev/null
> +++ b/hw/ppc/spapr_of_client.c

I'd suggest expanding this file to cover as much as you can of the
virtual OF stuff, not just the client interface.

> @@ -0,0 +1,633 @@
> +#include "qemu/osdep.h"
> +#include "qemu-common.h"
> +#include "qapi/error.h"
> +#include "exec/memory.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/ppc/spapr_vio.h"
> +#include "chardev/char.h"
> +#include "qom/qom-qobject.h"
> +#include "trace.h"
> +
> +typedef struct {
> +    DeviceState *dev;
> +    Chardev *cdev;
> +    uint32_t phandle;
> +} SpaprOfInstance;
> +
> +/*
> + * OF 1275 "nextprop" description suggests is it 32 bytes max but
> + * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 chars long.
> + */
> +#define OF_PROPNAME_LEN_MAX 64
> +
> +/* Defined as Big Endian */
> +struct prom_args {
> +    uint32_t service;
> +    uint32_t nargs;
> +    uint32_t nret;
> +    uint32_t args[10];
> +};
> +
> +static void readstr(hwaddr pa, char *buf, int size)
> +{
> +    cpu_physical_memory_read(pa, buf, size - 1);
> +    buf[size - 1] = 0;
> +}

I'd still like to see this return some kind of error if it had to
truncate what was passed by the client.

> +
> +static bool _cmpservice(const char *s, size_t len,

Don't use leading _ please - in userland those are reserved for the
system libraries.

> +                        unsigned nargs, unsigned nret,
> +                        const char *s1, size_t len1,
> +                        unsigned nargscheck, unsigned nretcheck)
> +{
> +    if (strcmp(s, s1)) {
> +        return false;
> +    }
> +    if (nargscheck == 0 && nretcheck == 0) {
> +        return true;
> +    }
> +    if (nargs != nargscheck || nret != nretcheck) {
> +        trace_spapr_of_client_error_param(s, nargscheck, nretcheck, nargs,
> +                                          nret);
> +        return false;
> +    }
> +
> +    return true;
> +}
> +
> +static uint32_t of_client_finddevice(const void *fdt, uint32_t nodeaddr)
> +{
> +    char node[256];

Is 256 enough?  OF paths can get pretty long...

> +    int ret;
> +
> +    readstr(nodeaddr, node, sizeof(node));
> +    ret = fdt_path_offset(fdt, node);
> +    if (ret >= 0) {
> +        ret = fdt_get_phandle(fdt, ret);
> +    }
> +
> +    return (uint32_t) ret;
> +}
> +
> +static uint32_t of_client_getprop(const void *fdt, uint32_t nodeph,
> +                                  uint32_t pname, uint32_t valaddr,
> +                                  uint32_t vallen)
> +{
> +    char propname[OF_PROPNAME_LEN_MAX + 1];
> +    uint32_t ret = 0;
> +    int proplen = 0;
> +    const void *prop;
> +
> +    readstr(pname, propname, sizeof(propname));
> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
> +                               propname, strlen(propname), &proplen);

Again, you don't need _namelen.

> +    if (prop) {
> +        int cb = MIN(proplen, vallen);
> +
> +        cpu_physical_memory_write(valaddr, prop, cb);
> +        ret = cb;

If I'm reading 1275 correctly, the return value should be the
untruncated length of the property.

> +    } else {
> +        ret = -1;
> +    }
> +    trace_spapr_of_client_getprop(nodeph, propname, ret);
> +
> +    return ret;
> +}
> +
> +static uint32_t of_client_getproplen(const void *fdt, uint32_t nodeph,
> +                                     uint32_t pname)
> +{
> +    char propname[OF_PROPNAME_LEN_MAX + 1];
> +    uint32_t ret = 0;
> +    int proplen = 0;
> +    const void *prop;
> +
> +    readstr(pname, propname, sizeof(propname));
> +
> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
> +                               propname, strlen(propname), &proplen);

No _namelen.

> +    if (prop) {
> +        ret = proplen;
> +    } else {
> +        ret = -1;
> +    }
> +
> +    return ret;
> +}
> +
> +static uint32_t of_client_setprop(SpaprMachineState *spapr,
> +                                  uint32_t nodeph, uint32_t pname,
> +                                  uint32_t valaddr, uint32_t vallen)
> +{
> +    char propname[OF_PROPNAME_LEN_MAX + 1];
> +    uint32_t ret = -1;
> +    int offset;

A comment noting that you're only allowing a very restricted set of
setprops would be good.

> +    readstr(pname, propname, sizeof(propname));
> +    if (vallen == sizeof(uint32_t)) {
> +        uint32_t val32 = ldl_be_phys(first_cpu->as, valaddr);
> +
> +        if ((strcmp(propname, "linux,rtas-base") == 0) ||
> +            (strcmp(propname, "linux,rtas-entry") == 0)) {
> +            spapr->rtas_base = val32;
> +        } else if (strcmp(propname, "linux,initrd-start") == 0) {
> +            spapr->initrd_base = val32;
> +        } else if (strcmp(propname, "linux,initrd-end") == 0) {
> +            spapr->initrd_size = val32 - spapr->initrd_base;
> +        } else {
> +            goto trace_exit;
> +        }
> +    } else if (vallen == sizeof(uint64_t)) {
> +        uint64_t val64 = ldq_be_phys(first_cpu->as, valaddr);
> +
> +        if (strcmp(propname, "linux,initrd-start") == 0) {
> +            spapr->initrd_base = val64;
> +        } else if (strcmp(propname, "linux,initrd-end") == 0) {
> +            spapr->initrd_size = val64 - spapr->initrd_base;
> +        } else {
> +            goto trace_exit;
> +        }
> +    } else {
> +        goto trace_exit;
> +    }
> +
> +    offset = fdt_node_offset_by_phandle(spapr->fdt_blob, nodeph);
> +    if (offset >= 0) {
> +        uint8_t data[vallen];
> +
> +        cpu_physical_memory_read(valaddr, data, vallen);
> +        if (!fdt_setprop(spapr->fdt_blob, offset, propname, data, vallen)) {
> +            ret = vallen;
> +        }
> +    }
> +
> +trace_exit:
> +    trace_spapr_of_client_setprop(nodeph, propname, ret);
> +
> +    return ret;
> +}
> +
> +static uint32_t of_client_nextprop(const void *fdt, uint32_t phandle,
> +                                   uint32_t prevaddr, uint32_t nameaddr)
> +{
> +    int offset = fdt_node_offset_by_phandle(fdt, phandle);
> +    char prev[OF_PROPNAME_LEN_MAX + 1];
> +    const char *tmp;
> +
> +    readstr(prevaddr, prev, sizeof(prev));
> +    for (offset = fdt_first_property_offset(fdt, offset);
> +         offset >= 0;
> +         offset = fdt_next_property_offset(fdt, offset)) {
> +
> +        if (!fdt_getprop_by_offset(fdt, offset, &tmp, NULL)) {
> +            return 0;
> +        }
> +        if (prev[0] == '\0' || strcmp(prev, tmp) == 0) {
> +            if (prev[0] != '\0') {
> +                offset = fdt_next_property_offset(fdt, offset);
> +                if (offset < 0) {
> +                    return 0;
> +                }
> +            }
> +            if (!fdt_getprop_by_offset(fdt, offset, &tmp, NULL)) {
> +                return 0;
> +            }
> +            cpu_physical_memory_write(nameaddr, tmp, strlen(tmp) + 1);
> +            return 1;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static uint32_t of_client_peer(const void *fdt, uint32_t phandle)
> +{
> +    int ret;
> +
> +    if (phandle == 0) {
> +        ret = fdt_path_offset(fdt, "/");
> +    } else {
> +        ret = fdt_next_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> +    }
> +
> +    if (ret < 0) {
> +        ret = 0;
> +    } else {
> +        ret = fdt_get_phandle(fdt, ret);
> +    }
> +
> +    return ret;
> +}
> +
> +static uint32_t of_client_child(const void *fdt, uint32_t phandle)
> +{
> +    int ret = fdt_first_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> +
> +    if (ret < 0) {
> +        ret = 0;
> +    } else {
> +        ret = fdt_get_phandle(fdt, ret);
> +    }
> +
> +    return ret;
> +}
> +
> +static uint32_t of_client_parent(const void *fdt, uint32_t phandle)
> +{
> +    int ret = fdt_parent_offset(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> +
> +    if (ret < 0) {
> +        ret = 0;
> +    } else {
> +        ret = fdt_get_phandle(fdt, ret);
> +    }
> +
> +    return ret;
> +}
> +
> +static DeviceState *of_client_find_qom_dev(BusState *bus, const char *path)
> +{
> +    BusChild *kid;
> +
> +    QTAILQ_FOREACH(kid, &bus->children, sibling) {
> +        const char *p = qdev_get_fw_dev_path(kid->child);
> +        BusState *child;
> +
> +        if (p && strcmp(path, p) == 0) {
> +            return kid->child;
> +        }
> +        QLIST_FOREACH(child, &kid->child->child_bus, sibling) {
> +            DeviceState *d = of_client_find_qom_dev(child, path);
> +
> +            if (d) {
> +                return d;
> +            }
> +        }
> +    }
> +    return NULL;
> +}
> +
> +uint32_t spapr_of_client_open(SpaprMachineState *spapr, const char *path)
> +{
> +    int offset;
> +    uint32_t ret = 0;
> +    SpaprOfInstance *inst;
> +
> +    if (spapr->of_instance_last == 0xFFFFFFFF) {
> +        /* We do not recycle ihandles yet */
> +        goto trace_exit;
> +    }
> +    offset = fdt_path_offset(spapr->fdt_blob, path);
> +    if (offset < 0) {
> +        trace_spapr_of_client_error_unknown_path(path);
> +        goto trace_exit;
> +    }
> +
> +    inst = g_new(SpaprOfInstance, 1);
> +    inst->phandle = fdt_get_phandle(spapr->fdt_blob, offset);
> +    g_assert(inst->phandle);
> +    ++spapr->of_instance_last;
> +    inst->dev = of_client_find_qom_dev(sysbus_get_default(), path);
> +    g_hash_table_insert(spapr->of_instances,
> +                        GINT_TO_POINTER(spapr->of_instance_last),
> +                        inst);
> +    ret = spapr->of_instance_last;
> +
> +    if (inst->dev) {
> +        const char *cdevstr = object_property_get_str(OBJECT(inst->dev),
> +                                                      "chardev", NULL);
> +
> +        if (cdevstr) {
> +            inst->cdev = qemu_chr_find(cdevstr);
> +        }
> +    }
> +
> +trace_exit:
> +    trace_spapr_of_client_open(path, inst ? inst->phandle : 0, ret);
> +
> +    return ret;
> +}
> +
> +static uint32_t of_client_open(SpaprMachineState *spapr, uint32_t pathaddr)
> +{
> +    char path[256];
> +
> +    readstr(pathaddr, path, sizeof(path));
> +
> +    return spapr_of_client_open(spapr, path);
> +}
> +
> +static void of_client_close(SpaprMachineState *spapr, uint32_t ihandle)
> +{
> +    if (!g_hash_table_remove(spapr->of_instances, GINT_TO_POINTER(ihandle))) {
> +        trace_spapr_of_client_error_unknown_ihandle_close(ihandle);
> +    }
> +}
> +
> +static uint32_t of_client_instance_to_package(SpaprMachineState *spapr,
> +                                              uint32_t ihandle)
> +{
> +    gpointer instp = g_hash_table_lookup(spapr->of_instances,
> +                                        GINT_TO_POINTER(ihandle));
> +
> +    if (!instp) {
> +        return -1;
> +    }
> +
> +    return ((SpaprOfInstance *)instp)->phandle;
> +}
> +
> +static uint32_t of_client_package_to_path(const void *fdt, uint32_t phandle,
> +                                          uint32_t buf, uint32_t len)
> +{
> +    char tmp[256];
> +
> +    if (0 == fdt_get_path(fdt, fdt_node_offset_by_phandle(fdt, phandle), tmp,
> +                          sizeof(tmp))) {
> +        tmp[sizeof(tmp) - 1] = 0;
> +        cpu_physical_memory_write(buf, tmp, MIN(len, strlen(tmp)));
> +    }
> +    return len;
> +}
> +
> +static uint32_t of_client_instance_to_path(SpaprMachineState *spapr,
> +                                           uint32_t ihandle, uint32_t buf,
> +                                           uint32_t len)
> +{
> +    uint32_t phandle = of_client_instance_to_package(spapr, ihandle);
> +
> +    if (phandle != -1) {
> +        return of_client_package_to_path(spapr->fdt_blob, phandle, buf, len);
> +    }
> +
> +    return 0;
> +}
> +
> +static uint32_t of_client_write(SpaprMachineState *spapr, uint32_t ihandle,
> +                                uint32_t buf, uint32_t len)
> +{
> +    char tmp[256];
> +    int toread, toprint, cb = MIN(len, 1024);
> +    SpaprOfInstance *inst = (SpaprOfInstance *)
> +        g_hash_table_lookup(spapr->of_instances, GINT_TO_POINTER(ihandle));
> +
> +    while (cb > 0) {
> +        toread = MIN(cb + 1, sizeof(tmp));
> +        readstr(buf, tmp, toread);
> +        toprint = strlen(tmp);
> +        if (inst && inst->cdev) {
> +            toprint = qemu_chr_write(inst->cdev, (uint8_t *) tmp, toprint,
> +                                     true);
> +        } else {
> +            /* We normally open stdout so this is fallback */
> +            printf("DBG[%d]%s", ihandle, tmp);
> +        }
> +        buf += toprint;
> +        cb -= toprint;
> +    }
> +
> +    return len;
> +}
> +
> +static bool of_client_claim_avail(GArray *claimed, uint64_t virt, uint64_t size)
> +{
> +    int i;
> +    SpaprOfClaimed *c;
> +
> +    for (i = 0; i < claimed->len; ++i) {
> +        c = &g_array_index(claimed, SpaprOfClaimed, i);
> +        if ((c->start <= virt && virt < c->start + c->size) ||
> +            (virt <= c->start && c->start < virt + size)) {
> +            return false;
> +        }
> +    }
> +
> +    return true;
> +}
> +
> +static void of_client_claim_add(GArray *claimed, uint64_t virt, uint64_t size)
> +{
> +    SpaprOfClaimed newclaim;
> +
> +    newclaim.start = virt;
> +    newclaim.size = size;
> +    g_array_append_val(claimed, newclaim);
> +}
> +
> +/*
> + * "claim" claims memory at @virt if @align==0; otherwise it allocates
> + * memory at the requested alignment.
> + */
> +uint64_t spapr_do_of_client_claim(SpaprMachineState *spapr, uint64_t virt,
> +                                  uint64_t size, uint64_t align)
> +{
> +    uint32_t ret;
> +
> +    if (align == 0) {
> +        if (!of_client_claim_avail(spapr->claimed, virt, size)) {
> +            return -1;
> +        }
> +        ret = virt;
> +    } else {
> +        align = pow2ceil(align);

Should this be a pow2ceil, or should it just return an error if align
is not a power of 2.  Note that aligning something to 4 bytes will
(probably) make it *not* aligned to 3 bytes.

> +        spapr->claimed_base = (spapr->claimed_base + align - 1) & ~(align - 1);
> +        while (1) {
> +            if (spapr->claimed_base >= spapr->rma_size) {
> +                perror("Out of memory");

error_report() or qemu_log() or something and a message with some more
specificity, please.

> +                return -1;
> +            }
> +            if (of_client_claim_avail(spapr->claimed, spapr->claimed_base,
> +                                      size)) {
> +                break;
> +            }
> +            spapr->claimed_base += size;
> +        }
> +        ret = spapr->claimed_base;
> +    }
> +
> +    spapr->claimed_base = MAX(spapr->claimed_base, ret + size);
> +    of_client_claim_add(spapr->claimed, virt, size);
> +    trace_spapr_of_client_claim(virt, size, align, ret);
> +
> +    return ret;
> +}
> +
> +static uint32_t of_client_claim(SpaprMachineState *spapr, uint32_t virt,
> +                                uint32_t size, uint32_t align)
> +{
> +    if (align) {
> +        return -1;
> +    }
> +    if (!of_client_claim_avail(spapr->claimed, virt, size)) {
> +        return -1;
> +    }
> +
> +    spapr->claimed_base = MAX(spapr->claimed_base, virt + size);
> +    of_client_claim_add(spapr->claimed, virt, size);
> +    trace_spapr_of_client_claim(virt, size, align, virt);

Huh.  So do_of_client_claim() is never used from of_client_claim(),
only from "internal" claimers.  It definitely needs a different name.

> +    return virt;
> +}
> +
> +static uint32_t of_client_call_method(SpaprMachineState *spapr,
> +                                      uint32_t methodaddr, uint32_t ihandle,
> +                                      uint32_t param, uint32_t *ret2)
> +{
> +    uint32_t ret = -1;
> +    char path[256] = "", method[256] = "";
> +    uint32_t phandle = of_client_instance_to_package(spapr, ihandle);
> +    int offset;
> +
> +    if (!ihandle) {
> +        goto trace_exit;
> +    }
> +
> +    readstr(methodaddr, method, sizeof(method));
> +    phandle = of_client_instance_to_package(spapr, ihandle);
> +    if (!phandle) {
> +        goto trace_exit;
> +    }
> +
> +    offset = fdt_node_offset_by_phandle(spapr->fdt_blob, phandle);
> +    if (offset < 0) {
> +        goto trace_exit;
> +    }
> +
> +    if (fdt_get_path(spapr->fdt_blob, offset, path, sizeof(path))) {
> +        goto trace_exit;
> +    }
> +
> +    if (strcmp(path, "/") == 0) {
> +        if (strcmp(method, "ibm,client-architecture-support") == 0) {
> +
> +#define FDT_MAX_SIZE            0x100000
> +            ret = do_client_architecture_support(POWERPC_CPU(first_cpu), spapr,
> +                                                 param, FDT_MAX_SIZE);
> +            *ret2 = 0;
> +        }
> +    } else if (strcmp(path, "/rtas") == 0) {
> +        if (strcmp(method, "instantiate-rtas") == 0) {
> +            spapr_instantiate_rtas(spapr, param);
> +            ret = 0;
> +            *ret2 = param; /* rtasbase */
> +        }
> +    } else {
> +        trace_spapr_of_client_error_unknown_method(method);
> +    }
> +
> +trace_exit:
> +    trace_spapr_of_client_method(ihandle, method, param, phandle, path, ret);
> +
> +    return ret;
> +}
> +
> +static void of_client_quiesce(SpaprMachineState *spapr)
> +{
> +    int rc = fdt_pack(spapr->fdt_blob);
> +    /* Should only fail if we've built a corrupted tree */
> +    assert(rc == 0);
> +
> +    spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
> +    spapr->fdt_initial_size = spapr->fdt_size;
> +}
> +
> +int spapr_h_client(SpaprMachineState *spapr, target_ulong of_client_args)
> +{
> +    struct prom_args args = { 0 };
> +    char service[64];
> +    unsigned nargs, nret;
> +    int i, servicelen;
> +
> +    cpu_physical_memory_read(of_client_args, &args, sizeof(args));
> +    nargs = be32_to_cpu(args.nargs);
> +    nret = be32_to_cpu(args.nret);
> +    readstr(be32_to_cpu(args.service), service, sizeof(service));
> +    servicelen = strlen(service);
> +
> +#define cmpservice(s, a, r) \
> +    _cmpservice(service, servicelen, nargs, nret, (s), sizeof(s), (a), (r))
> +
> +    if (cmpservice("finddevice", 1, 1)) {
> +        args.args[nargs] = of_client_finddevice(spapr->fdt_blob,
> +                                                be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("getprop", 4, 1)) {
> +        args.args[nargs] = of_client_getprop(spapr->fdt_blob,
> +                                             be32_to_cpu(args.args[0]),
> +                                             be32_to_cpu(args.args[1]),
> +                                             be32_to_cpu(args.args[2]),
> +                                             be32_to_cpu(args.args[3]));
> +    } else if (cmpservice("getproplen", 2, 1)) {
> +        args.args[nargs] = of_client_getproplen(spapr->fdt_blob,
> +                                                be32_to_cpu(args.args[0]),
> +                                                be32_to_cpu(args.args[1]));
> +    } else if (cmpservice("setprop", 4, 1)) {
> +        args.args[nargs] = of_client_setprop(spapr,
> +                                             be32_to_cpu(args.args[0]),
> +                                             be32_to_cpu(args.args[1]),
> +                                             be32_to_cpu(args.args[2]),
> +                                             be32_to_cpu(args.args[3]));
> +    } else if (cmpservice("nextprop", 3, 1)) {
> +        args.args[nargs] = of_client_nextprop(spapr->fdt_blob,
> +                                              be32_to_cpu(args.args[0]),
> +                                              be32_to_cpu(args.args[1]),
> +                                              be32_to_cpu(args.args[2]));
> +    } else if (cmpservice("peer", 1, 1)) {
> +        args.args[nargs] = of_client_peer(spapr->fdt_blob,
> +                                          be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("child", 1, 1)) {
> +        args.args[nargs] = of_client_child(spapr->fdt_blob,
> +                                           be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("parent", 1, 1)) {
> +        args.args[nargs] = of_client_parent(spapr->fdt_blob,
> +                                            be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("open", 1, 1)) {
> +        args.args[nargs] = of_client_open(spapr, be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("close", 1, 0)) {
> +        of_client_close(spapr, be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("instance-to-package", 1, 1)) {
> +        args.args[nargs] =
> +            of_client_instance_to_package(spapr,
> +                                          be32_to_cpu(args.args[0]));
> +    } else if (cmpservice("package-to-path", 3, 1)) {
> +        args.args[nargs] = of_client_package_to_path(spapr->fdt_blob,
> +                                                     be32_to_cpu(args.args[0]),
> +                                                     be32_to_cpu(args.args[1]),
> +                                                     be32_to_cpu(args.args[2]));
> +    } else if (cmpservice("instance-to-path", 3, 1)) {
> +        args.args[nargs] =
> +            of_client_instance_to_path(spapr,
> +                                       be32_to_cpu(args.args[0]),
> +                                       be32_to_cpu(args.args[1]),
> +                                       be32_to_cpu(args.args[2]));
> +    } else if (cmpservice("write", 3, 1)) {
> +        args.args[nargs] = of_client_write(spapr,
> +                                           be32_to_cpu(args.args[0]),
> +                                           be32_to_cpu(args.args[1]),
> +                                           be32_to_cpu(args.args[2]));
> +    } else if (cmpservice("claim", 3, 1)) {
> +        args.args[nargs] = of_client_claim(spapr,
> +                                           be32_to_cpu(args.args[0]),
> +                                           be32_to_cpu(args.args[1]),
> +                                           be32_to_cpu(args.args[2]));
> +    } else if (cmpservice("call-method", 3, 2)) {
> +        args.args[nargs] = of_client_call_method(spapr,
> +                                                 be32_to_cpu(args.args[0]),
> +                                                 be32_to_cpu(args.args[1]),
> +                                                 be32_to_cpu(args.args[2]),
> +                                                 &args.args[nargs + 1]);
> +    } else if (cmpservice("quiesce", 0, 0)) {
> +        of_client_quiesce(spapr);
> +    } else if (cmpservice("exit", 0, 0)) {
> +        error_report("Stopped as the VM requested \"exit\"");
> +        vm_stop(RUN_STATE_PAUSED);
> +    } else {
> +        trace_spapr_of_client_error_unknown_service(service, nargs, nret);
> +        args.args[nargs] = -1;

You've never bounds checked nargs at this point.

> +    }
> +
> +    for (i = 0; i < nret; ++i) {

And likewise you might not have bounds checked nret.

> +        args.args[nargs + i] = be32_to_cpu(args.args[nargs + i]);
> +    }
> +    cpu_physical_memory_write(of_client_args, &args, sizeof(args));
> +
> +    return H_SUCCESS;
> +}
> diff --git a/hw/ppc/trace-events b/hw/ppc/trace-events
> index 9ea620f23c85..e2d1e58d07c3 100644
> --- a/hw/ppc/trace-events
> +++ b/hw/ppc/trace-events
> @@ -21,6 +21,18 @@ spapr_update_dt(unsigned cb) "New blob %u bytes"
>  spapr_update_dt_failed_size(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
>  spapr_update_dt_failed_check(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
>  
> +# spapr_client.c
> +spapr_of_client_error_param(const char *method, int nargscheck, int nretcheck, int nargs, int nret) "%s takes/returns %d/%d, not %d/%d"
> +spapr_of_client_error_unknown_service(const char *service, int nargs, int nret) "%s args=%d rets=%d"
> +spapr_of_client_error_unknown_method(const char *method) "%s"
> +spapr_of_client_error_unknown_ihandle_close(uint32_t ihandle) "0x%x"
> +spapr_of_client_error_unknown_path(const char *path) "%s"
> +spapr_of_client_claim(uint32_t virt, uint32_t size, uint32_t align, uint32_t ret) "virt=0x%x size=0x%x align=0x%x => 0x%x"
> +spapr_of_client_method(uint32_t ihandle, const char *method, uint32_t param, uint32_t phandle, const char *path, uint32_t ret) "0x%x \"%s\" param=0x%x ph=0x%x \"%s\" => 0x%x"
> +spapr_of_client_getprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
> +spapr_of_client_setprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
> +spapr_of_client_open(const char *path, uint32_t phandle, uint32_t ihandle) "%s 0x%x => 0x%x"
> +
>  # spapr_hcall_tpm.c
>  spapr_h_tpm_comm(const char *device_path, uint64_t operation) "tpm_device_path=%s operation=0x%"PRIu64
>  spapr_tpm_execute(uint64_t data_in, uint64_t data_in_sz, uint64_t data_out, uint64_t data_out_sz) "data_in=0x%"PRIx64", data_in_sz=%"PRIu64", data_out=0x%"PRIx64", data_out_sz=%"PRIu64
Alexey Kardashevskiy Jan. 21, 2020, 7:25 a.m. UTC | #2
On 21/01/2020 16:11, David Gibson wrote:
> On Fri, Jan 10, 2020 at 01:09:25PM +1100, Alexey Kardashevskiy wrote:
>> The Petitboot bootloader is way more advanced than SLOF is ever going to
>> be as Petitboot comes with the full-featured Linux kernel with all
>> the drivers, and initramdisk with quite user friendly interface.
>> The problem with ditching SLOF is that an unmodified pseries kernel can
>> either start via:
>> 1. kexec, this requires presence of RTAS and skips
>> ibm,client-architecture-support entirely;
>> 2. normal boot, this heavily relies on the OF1275 client interface to
>> fetch the device tree and do early setup (claim memory).
>>
>> This adds a new bios-less mode to the pseries machine: "bios=on|off".
>> When enabled, QEMU does not load SLOF and jumps to the kernel from
>> "-kernel".
> 
> I don't love the name "bios" for this flag, since BIOS tends to refer
> to old-school x86 firmware.  Given the various plans we're considering
> the future, I'd suggest "firmware=slof" for the current in-guest SLOF
> mode, and say "firmware=vof" (Virtual Open Firmware) for the new
> model.  We can consider firmware=petitboot or firmware=none (for
> direct kexec-style boot into -kernel) or whatever in the future

Ok. We could also enforce default loading addresses for SLOF/kernel/grub
and drop "kernel-addr", although it is going to be confusing if it
changes in not so obvious way...

In fact, I will ideally need 3 flags:
-bios: on|off to stop loading SLOF;
-kernel-addr: 0x0 for slof/kernel; 0x20000 for grub;
-kernel-translate-hack: on|off - as grub is linked to run from 0x20000
and it only works when placed there, the hack breaks it.

Or we can pass grub via -bios and not via -kernel but strictly speaking
there is still a firmware - that new 20 bytes blob so it would not be
accurate.

We can put this all into a single
-firmware slof|vof|grub|linux. Not sure.


>> The client interface is implemented exactly as RTAS - a 20 bytes blob,
>> right next after the RTAS blob. The entry point is passed to the kernel
>> via GPR5.
>>
>> This implements a handful of client interface methods just to get going.
>> In particular, this implements the device tree fetching,
>> ibm,client-architecture-support and instantiate-rtas.
>>
>> This implements changing FDT properties for RTAS (for vmlinux and zImage)
>> and initramdisk location (for zImage). To make this work, this skips
>> fdt_pack() when bios=off as not packing the blob leaves some room for
>> appending.
>>
>> This assigns "phandles" to device tree nodes as there is no more SLOF
>> and OF nodes addresses of which served as phandle values.
>> This keeps predefined nodes (such as XICS/NVLINK/...) unchanged.
>> phandles are regenerated at every FDT rebuild.
>>
>> When bios=off, this adds "/chosen" every time QEMU builds a tree.
>>
>> This implements "claim" which the client (Linux) uses for memory
>> allocation; this is also  used by QEMU for claiming kernel/initrd images,
>> client interface entry point, RTAS and the initial stack.
>>
>> While at this, add a "kernel-addr" machine parameter to allow moving
>> the kernel in memory. This is useful for debugging if the kernel is
>> loaded at @0, although not necessary.
>>
>> This adds basic instances support which are managed by a hashmap
>> ihandle->[phandle, DeviceState, Chardev].
>>
>> Note that a 64bit PCI fix is required for Linux:
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735e
>>
>> The test command line:
>>
>> qemu-system-ppc64 \
>> -nodefaults \
>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>> -mon id=MON0,chardev=STDIO0,mode=readline \
>> -nographic \
>> -vga none \
>> -kernel pbuild/kernel-le-guest/arch/powerpc/boot/zImage.pseries \
>> -machine pseries,bios=off,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken \
>> -m 4G \
>> -enable-kvm \
>> -initrd pb/rootfs.cpio.xz \
>> -device nec-usb-xhci,id=nec-usb-xhci0 \
>> -netdev tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0 \
>> -device virtio-net-pci,id=vnet0,netdev=TAP0 img/f30le.qcow2 \
>> -snapshot \
>> -smp 8,threads=8 \
>> -trace events=qemu_trace_events \
>> -d guest_errors \
>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.ssh54088 \
>> -mon chardev=SOCKET0,mode=control
>>
>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> It'd be nice to split this patch up a bit, though I'll admit it's not
> very obvious where to do so.


v6 is a patchset.

>> ---
>> Changes:
>> v5:
>> * made instances keep device and chardev pointers
>> * removed VIO dependencies
>> * print error if RTAS memory is not claimed as it should have been
>> * pack FDT as "quiesce"
>>
>> v4:
>> * fixed open
>> * validate ihandles in "call-method"
>>
>> v3:
>> * fixed phandles allocation
>> * s/__be32/uint32_t/ as we do not normally have __be32 type in qemu
>> * fixed size of /chosen/stdout
>> * bunch of renames
>> * do not create rtas properties at all, let the client deal with it;
>> instead setprop allows changing these in the FDT
>> * no more packing FDT when bios=off - nobody needs it and getprop does not
>> work otherwise
>> * allow updating initramdisk device tree properties (for zImage)
>> * added instances
>> * fixed stdout on OF's "write"
>> * removed special handling for stdout in OF client, spapr-vty handles it
>> instead
>>
>> v2:
>> * fixed claim()
>> * added "setprop"
>> * cleaner client interface and RTAS blobs management
>> * boots to petitboot and further to the target system
>> * more trace points
>> ---
>>  hw/ppc/Makefile.objs     |   1 +
>>  include/hw/ppc/spapr.h   |  28 +-
>>  hw/ppc/spapr.c           | 266 ++++++++++++++--
>>  hw/ppc/spapr_hcall.c     |  74 +++--
>>  hw/ppc/spapr_of_client.c | 633 +++++++++++++++++++++++++++++++++++++++
>>  hw/ppc/trace-events      |  12 +
>>  6 files changed, 959 insertions(+), 55 deletions(-)
>>  create mode 100644 hw/ppc/spapr_of_client.c
>>
>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>> index 101e9fc59185..20efc0aa6f9b 100644
>> --- a/hw/ppc/Makefile.objs
>> +++ b/hw/ppc/Makefile.objs
>> @@ -6,6 +6,7 @@ obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
>>  obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
>>  obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o spapr_irq.o
>>  obj-$(CONFIG_PSERIES) += spapr_tpm_proxy.o
>> +obj-$(CONFIG_PSERIES) += spapr_of_client.o
>>  obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>>  # IBM PowerNV
>>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 61f005c6f686..efc2c70abf99 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -105,6 +105,11 @@ struct SpaprCapabilities {
>>      uint8_t caps[SPAPR_CAP_NUM];
>>  };
>>  
>> +typedef struct {
>> +    uint64_t start;
>> +    uint64_t size;
>> +} SpaprOfClaimed;
>> +
> 
> Can we split more of the fake-OF code into a new file?


Done in v6, I quite reworked it, this is why I told you to ping me
before you review this one :)

> 
>>  /**
>>   * SpaprMachineClass:
>>   */
>> @@ -160,6 +165,13 @@ struct SpaprMachineState {
>>      void *fdt_blob;
>>      long kernel_size;
>>      bool kernel_le;
>> +    uint64_t kernel_addr;
>> +    bool bios_enabled;
>> +    uint32_t rtas_base;
>> +    GArray *claimed; /* array of SpaprOfClaimed */
>> +    uint64_t claimed_base;
>> +    GHashTable *of_instances; /* ihandle -> SpaprOfInstance */
>> +    uint32_t of_instance_last;
>>      uint32_t initrd_base;
>>      long initrd_size;
>>      uint64_t rtc_offset; /* Now used only during incoming migration */
>> @@ -510,7 +522,8 @@ struct SpaprMachineState {
>>  /* Client Architecture support */
>>  #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
>>  #define KVMPPC_H_UPDATE_DT      (KVMPPC_HCALL_BASE + 0x3)
>> -#define KVMPPC_HCALL_MAX        KVMPPC_H_UPDATE_DT
>> +#define KVMPPC_H_CLIENT         (KVMPPC_HCALL_BASE + 0x5)
>> +#define KVMPPC_HCALL_MAX        KVMPPC_H_CLIENT
>>  
>>  /*
>>   * The hcall range 0xEF00 to 0xEF80 is reserved for use in facilitating
>> @@ -538,6 +551,11 @@ void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
>>  target_ulong spapr_hypercall(PowerPCCPU *cpu, target_ulong opcode,
>>                               target_ulong *args);
>>  
>> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>> +                                            SpaprMachineState *spapr,
>> +                                            target_ulong addr,
>> +                                            target_ulong fdt_bufsize);
>> +
>>  /* Virtual Processor Area structure constants */
>>  #define VPA_MIN_SIZE           640
>>  #define VPA_SIZE_OFFSET        0x4
>> @@ -769,6 +787,11 @@ struct SpaprEventLogEntry {
>>  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
>>  void spapr_events_init(SpaprMachineState *sm);
>>  void spapr_dt_events(SpaprMachineState *sm, void *fdt);
>> +uint64_t spapr_do_of_client_claim(SpaprMachineState *spapr, uint64_t virt,
>> +                                  uint64_t size, uint64_t align);
>> +
>> +uint32_t spapr_of_client_open(SpaprMachineState *spapr, const char *path);
>> +int spapr_h_client(SpaprMachineState *spapr, target_ulong client_args);
>>  void close_htab_fd(SpaprMachineState *spapr);
>>  void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
>>  void spapr_free_hpt(SpaprMachineState *spapr);
>> @@ -891,4 +914,7 @@ void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
>>  #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
>>  
>>  void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
>> +
>> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base);
>> +
>>  #endif /* HW_SPAPR_H */
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index e62c89b3dd40..76ce8b973082 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -896,6 +896,55 @@ out:
>>      return ret;
>>  }
>>  
>> +/*
>> + * Below is a compiled version of RTAS blob and OF client interface entry point.
>> + *
>> + * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
>> + * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
>> + *
>> + *   .globl  _start
>> + *   _start:
>> + *           mr      4,3
>> + *           lis     3,KVMPPC_H_RTAS@h
>> + *           ori     3,3,KVMPPC_H_RTAS@l
>> + *           sc      1
>> + *           blr
>> + *           mr      4,3
>> + *           lis     3,KVMPPC_H_CLIENT@h
>> + *           ori     3,3,KVMPPC_H_CLIENT@l
>> + *           sc      1
>> + *           blr
>> + */
>> +static struct {
> 
> Should be able to add a 'const' here.
> 
>> +    uint8_t rtas[20], client[20];
>> +} QEMU_PACKED rtas_client_blob = {
>> +    .rtas = {
>> +        0x7c, 0x64, 0x1b, 0x78,
>> +        0x3c, 0x60, 0x00, 0x00,
>> +        0x60, 0x63, 0xf0, 0x00,
>> +        0x44, 0x00, 0x00, 0x22,
>> +        0x4e, 0x80, 0x00, 0x20
>> +    },
>> +    .client = {
>> +        0x7c, 0x64, 0x1b, 0x78,
>> +        0x3c, 0x60, 0x00, 0x00,
>> +        0x60, 0x63, 0xf0, 0x05,
>> +        0x44, 0x00, 0x00, 0x22,
>> +        0x4e, 0x80, 0x00, 0x20
>> +    }
>> +};
> 
> I'd split this into two variables - there's not really any connection
> between the two, AFAICT.
> 
> Note that I'm getting closer to merging the fwnmi stuff at which point
> you'll need to pad the RTAS blob with a bunch of extra space for
> taking the fwnmi dumps.
> 
>> +
>> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base)
>> +{
>> +    if (spapr_do_of_client_claim(spapr, base, sizeof(rtas_client_blob.rtas),
>> +                                 0) != -1) {
> 
> Wait.. == -1 is the success case?  That's a very surprising interface.


This is a sort of an assert. spapr_do_of_client_claim() returns an
address and the client is expected to claim the memory which it wants
RTAS to be copied to, this makes sure it either happened or we claimed
it here.


> 
>> +        error_report("The OF client did not claim RTAS memory at 0x%x", base);
> 
> Error message is hard to follow.  Maybe "Could not claim memory for RTAS"
> 
>> +    }
>> +    spapr->rtas_base = base;
>> +    cpu_physical_memory_write(base, rtas_client_blob.rtas,
>> +                              sizeof(rtas_client_blob.rtas));
>> +}
>> +
>>  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>  {
>>      MachineState *ms = MACHINE(spapr);
>> @@ -980,6 +1029,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>      _FDT(fdt_setprop(fdt, rtas, "ibm,lrdr-capacity",
>>                       lrdr_capacity, sizeof(lrdr_capacity)));
>>  
>> +    if (!spapr->bios_enabled) {
>> +        _FDT(fdt_setprop_cell(fdt, rtas, "rtas-size",
>> +                              sizeof(rtas_client_blob.rtas)));
>> +    }
>> +
>>      spapr_dt_rtas_tokens(fdt, rtas);
>>  }
>>  
>> @@ -1057,7 +1111,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
>>      }
>>  
>>      if (spapr->kernel_size) {
>> -        uint64_t kprop[2] = { cpu_to_be64(KERNEL_LOAD_ADDR),
>> +        uint64_t kprop[2] = { cpu_to_be64(spapr->kernel_addr),
> 
> Hrm, I really think I would like to see the change to adjustable
> kernel_addr split out - it puts a bunch of noise into the main kill
> slof patch.

Sure, I'll do that if we decide to proceed with this.


> 
>>                                cpu_to_be64(spapr->kernel_size) };
>>  
>>          _FDT(fdt_setprop(fdt, chosen, "qemu,boot-kernel",
>> @@ -1245,7 +1299,8 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>>      /* Build memory reserve map */
>>      if (reset) {
>>          if (spapr->kernel_size) {
>> -            _FDT((fdt_add_mem_rsv(fdt, KERNEL_LOAD_ADDR, spapr->kernel_size)));
>> +            _FDT((fdt_add_mem_rsv(fdt, spapr->kernel_addr,
>> +                                  spapr->kernel_size)));
>>          }
>>          if (spapr->initrd_size) {
>>              _FDT((fdt_add_mem_rsv(fdt, spapr->initrd_base,
>> @@ -1268,12 +1323,56 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>>          }
>>      }
>>  
>> +    if (!spapr->bios_enabled) {
>> +        uint32_t phandle;
>> +        int i, offset, proplen = 0;
>> +        const void *prop;
>> +        bool found = false;
>> +        GArray *phandles = g_array_new(false, false, sizeof(uint32_t));
>> +
>> +        /* Find all predefined phandles */
>> +        for (offset = fdt_next_node(fdt, -1, NULL);
>> +             offset >= 0;
>> +             offset = fdt_next_node(fdt, offset, NULL)) {
>> +            prop = fdt_getprop_namelen(fdt, offset, "phandle", 7, &proplen);
> 
> You can just use fdt_getprop() rather than the namelen variant (that's
> only really useful when you don't have a \0-terminated string with the
> name).

Ok, will fix. There are just too many similar functions in libfdt.h and
fdt_getprop() could be inlined, probably.


>> +            if (prop && proplen == sizeof(uint32_t)) {
>> +                phandle = fdt32_ld(prop);
>> +                g_array_append_val(phandles, phandle);
>> +            }
>> +        }
>> +
>> +        /* Assign phandles skipping the predefined ones */
>> +        for (offset = fdt_next_node(fdt, -1, NULL), phandle = 1;
>> +             offset >= 0;
>> +             offset = fdt_next_node(fdt, offset, NULL), ++phandle) {
>> +            prop = fdt_getprop_namelen(fdt, offset, "phandle", 7, &proplen);
>> +            if (prop) {
>> +                continue;
>> +            }
>> +            /* Check if the current phandle is not allocated already */
>> +            for ( ; ; ++phandle) {
>> +                for (i = 0, found = false; i < phandles->len; ++i) {
>> +                    if (phandle == g_array_index(phandles, uint32_t, i)) {
>> +                        found = true;
>> +                        break;
>> +                    }
>> +                }
>> +                if (!found) {
>> +                    break;
>> +                }
>> +            }
>> +            _FDT(fdt_setprop_cell(fdt, offset, "phandle", phandle));
>> +        }
>> +        g_array_unref(phandles);
>> +    }
>> +
>>      return fdt;
>>  }
>>  
>>  static uint64_t translate_kernel_address(void *opaque, uint64_t addr)
>>  {
>> -    return (addr & 0x0fffffff) + KERNEL_LOAD_ADDR;
>> +    SpaprMachineState *spapr = opaque;
>> +    return (addr & 0x0fffffff) + spapr->kernel_addr;
>>  }
>>  
>>  static void emulate_spapr_hypercall(PPCVirtualHypervisor *vhyp,
>> @@ -1660,24 +1759,89 @@ static void spapr_machine_reset(MachineState *machine)
>>       */
>>      fdt_addr = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FDT_MAX_SIZE;
>>  
>> +    /* Set up the entry state */
>> +    if (!spapr->bios_enabled) {
>> +        if (spapr->claimed) {
>> +            g_array_unref(spapr->claimed);
>> +        }
>> +        if (spapr->of_instances) {
>> +            g_hash_table_unref(spapr->of_instances);
>> +        }
>> +
>> +        spapr->claimed = g_array_new(false, false, sizeof(SpaprOfClaimed));
>> +        spapr->of_instances = g_hash_table_new(g_direct_hash, g_direct_equal);
>> +
>> +        spapr->claimed_base = 0x10000; /* Avoid using the first system page */
>> +
>> +        spapr_cpu_set_entry_state(first_ppc_cpu, spapr->kernel_addr,
>> +                                  spapr->initrd_base);
>> +        first_ppc_cpu->env.gpr[4] = spapr->initrd_size;
>> +
>> +        if (spapr_do_of_client_claim(spapr, spapr->kernel_addr,
>> +                                  spapr->kernel_size, 0) == -1) {
>> +            error_report("Memory for kernel is in use");
>> +            exit(1);
>> +        }
>> +        if (spapr_do_of_client_claim(spapr, spapr->initrd_base,
>> +                                  spapr->initrd_size, 0) == -1) {
>> +            error_report("Memory for initramdisk is in use");
>> +            exit(1);
>> +        }
>> +        first_ppc_cpu->env.gpr[1] = spapr_do_of_client_claim(spapr, 0, 0x40000,
>> +                                                             0x10000);
>> +        if (first_ppc_cpu->env.gpr[1] == -1) {
>> +            error_report("Memory allocation for stack failed");
>> +            exit(1);
>> +        }
>> +
>> +        first_ppc_cpu->env.gpr[5] =
>> +            spapr_do_of_client_claim(spapr, 0, sizeof(rtas_client_blob.client),
>> +                                     sizeof(rtas_client_blob.client));
>> +        if (first_ppc_cpu->env.gpr[5] == -1) {
>> +            error_report("Memory allocation for OF client failed");
>> +            exit(1);
>> +        }
>> +        cpu_physical_memory_write(first_ppc_cpu->env.gpr[5],
>> +                                  rtas_client_blob.client,
>> +                                  sizeof(rtas_client_blob.client));
>> +    } else {
>> +        spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
>> +        first_ppc_cpu->env.gpr[5] = 0; /* 0 = kexec !0 = prom_init */
>> +    }
>> +
>>      fdt = spapr_build_fdt(spapr, true, FDT_MAX_SIZE);
>>  
>> -    rc = fdt_pack(fdt);
>> -
>> -    /* Should only fail if we've built a corrupted tree */
>> -    assert(rc == 0);
>> -
>> -    /* Load the fdt */
>> -    qemu_fdt_dumpdtb(fdt, fdt_totalsize(fdt));
>> -    cpu_physical_memory_write(fdt_addr, fdt, fdt_totalsize(fdt));
>>      g_free(spapr->fdt_blob);
>>      spapr->fdt_size = fdt_totalsize(fdt);
>>      spapr->fdt_initial_size = spapr->fdt_size;
>>      spapr->fdt_blob = fdt;
>>  
>> -    /* Set up the entry state */
>> -    spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
>> -    first_ppc_cpu->env.gpr[5] = 0;
>> +    if (spapr->bios_enabled) {
>> +        /* Load the fdt */
>> +        rc = fdt_pack(spapr->fdt_blob);
>> +        /* Should only fail if we've built a corrupted tree */
>> +        assert(rc == 0);
>> +
>> +        spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
>> +        spapr->fdt_initial_size = spapr->fdt_size;
>> +        qemu_fdt_dumpdtb(spapr->fdt_blob, spapr->fdt_size);
> 
> I think we should still have a dumpdtb call on the !bios path.
> 
>> +        cpu_physical_memory_write(fdt_addr, spapr->fdt_blob, spapr->fdt_size);
>> +    } else {
>> +        char *stdout_path = spapr_vio_stdout_path(spapr->vio_bus);
>> +        int offset = fdt_path_offset(fdt, "/chosen");
>> +
>> +        /*
>> +         * SLOF-less setup requires an open instance of stdout for early
>> +         * kernel printk. By now all phandles are settled so we can open
>> +         * the default serial console.
>> +         * We skip writing FDT as nothing expects it; OF client interface is
>> +         * going to be used for reading the device tree.
>> +         */
>> +        if (stdout_path) {
>> +            _FDT(fdt_setprop_cell(fdt, offset, "stdout",
>> +                                  spapr_of_client_open(spapr, stdout_path)));
>> +        }
>> +    }
>>  
>>      spapr->cas_reboot = false;
>>  }
>> @@ -2897,12 +3061,12 @@ static void spapr_machine_init(MachineState *machine)
>>          uint64_t lowaddr = 0;
>>  
>>          spapr->kernel_size = load_elf(kernel_filename, NULL,
>> -                                      translate_kernel_address, NULL,
>> +                                      translate_kernel_address, spapr,
>>                                        NULL, &lowaddr, NULL, 1,
>>                                        PPC_ELF_MACHINE, 0, 0);
>>          if (spapr->kernel_size == ELF_LOAD_WRONG_ENDIAN) {
>>              spapr->kernel_size = load_elf(kernel_filename, NULL,
>> -                                          translate_kernel_address, NULL, NULL,
>> +                                          translate_kernel_address, spapr, NULL,
>>                                            &lowaddr, NULL, 0, PPC_ELF_MACHINE,
>>                                            0, 0);
>>              spapr->kernel_le = spapr->kernel_size > 0;
>> @@ -2918,7 +3082,7 @@ static void spapr_machine_init(MachineState *machine)
>>              /* Try to locate the initrd in the gap between the kernel
>>               * and the firmware. Add a bit of space just in case
>>               */
>> -            spapr->initrd_base = (KERNEL_LOAD_ADDR + spapr->kernel_size
>> +            spapr->initrd_base = (spapr->kernel_addr + spapr->kernel_size
>>                                    + 0x1ffff) & ~0xffff;
>>              spapr->initrd_size = load_image_targphys(initrd_filename,
>>                                                       spapr->initrd_base,
>> @@ -2932,20 +3096,22 @@ static void spapr_machine_init(MachineState *machine)
>>          }
>>      }
>>  
>> -    if (bios_name == NULL) {
>> -        bios_name = FW_FILE_NAME;
>> +    if (spapr->bios_enabled) {
>> +        if (bios_name == NULL) {
>> +            bios_name = FW_FILE_NAME;
>> +        }
>> +        filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
>> +        if (!filename) {
>> +            error_report("Could not find LPAR firmware '%s'", bios_name);
>> +            exit(1);
>> +        }
>> +        fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
>> +        if (fw_size <= 0) {
>> +            error_report("Could not load LPAR firmware '%s'", filename);
>> +            exit(1);
>> +        }
>> +        g_free(filename);
>>      }
>> -    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
>> -    if (!filename) {
>> -        error_report("Could not find LPAR firmware '%s'", bios_name);
>> -        exit(1);
>> -    }
>> -    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
>> -    if (fw_size <= 0) {
>> -        error_report("Could not load LPAR firmware '%s'", filename);
>> -        exit(1);
>> -    }
>> -    g_free(filename);
>>  
>>      /* FIXME: Should register things through the MachineState's qdev
>>       * interface, this is a legacy from the sPAPREnvironment structure
>> @@ -3162,6 +3328,32 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
>>      visit_type_uint32(v, name, (uint32_t *)opaque, errp);
>>  }
>>  
>> +static void spapr_get_kernel_addr(Object *obj, Visitor *v, const char *name,
>> +                                  void *opaque, Error **errp)
>> +{
>> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
>> +}
>> +
>> +static void spapr_set_kernel_addr(Object *obj, Visitor *v, const char *name,
>> +                                  void *opaque, Error **errp)
>> +{
>> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
>> +}
>> +
>> +static bool spapr_get_bios_enabled(Object *obj, Error **errp)
>> +{
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>> +
>> +    return spapr->bios_enabled;
>> +}
>> +
>> +static void spapr_set_bios_enabled(Object *obj, bool value, Error **errp)
>> +{
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>> +
>> +    spapr->bios_enabled = value;
>> +}
>> +
>>  static char *spapr_get_ic_mode(Object *obj, Error **errp)
>>  {
>>      SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>> @@ -3267,6 +3459,20 @@ static void spapr_instance_init(Object *obj)
>>      object_property_add_bool(obj, "vfio-no-msix-emulation",
>>                               spapr_get_msix_emulation, NULL, NULL);
>>  
>> +    object_property_add(obj, "kernel-addr", "uint64", spapr_get_kernel_addr,
>> +                        spapr_set_kernel_addr, NULL, &spapr->kernel_addr,
>> +                        &error_abort);
>> +    object_property_set_description(obj, "kernel-addr",
>> +                                    stringify(KERNEL_LOAD_ADDR)
>> +                                    " for -kernel is the default",
>> +                                    NULL);
>> +    spapr->kernel_addr = KERNEL_LOAD_ADDR;
>> +    object_property_add_bool(obj, "bios", spapr_get_bios_enabled,
>> +                            spapr_set_bios_enabled, NULL);
>> +    object_property_set_description(obj, "bios", "Conrols whether to load bios",
>> +                                    NULL);
>> +    spapr->bios_enabled = true;
>> +
>>      /* The machine class defines the default interrupt controller mode */
>>      spapr->irq = smc->irq;
>>      object_property_add_str(obj, "ic-mode", spapr_get_ic_mode,
>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>> index f1799b1b707d..f2d8823d2c3a 100644
>> --- a/hw/ppc/spapr_hcall.c
>> +++ b/hw/ppc/spapr_hcall.c
>> @@ -1660,15 +1660,11 @@ static bool spapr_hotplugged_dev_before_cas(void)
>>      return false;
>>  }
>>  
>> -static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>> -                                                  SpaprMachineState *spapr,
>> -                                                  target_ulong opcode,
>> -                                                  target_ulong *args)
>> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>> +                                            SpaprMachineState *spapr,
>> +                                            target_ulong addr,
>> +                                            target_ulong fdt_bufsize)
>>  {
>> -    /* Working address in data buffer */
>> -    target_ulong addr = ppc64_phys_to_real(args[0]);
>> -    target_ulong fdt_buf = args[1];
>> -    target_ulong fdt_bufsize = args[2];
>>      target_ulong ov_table;
>>      uint32_t cas_pvr;
>>      SpaprOptionVector *ov1_guest, *ov5_guest, *ov5_cas_old;
>> @@ -1816,7 +1812,6 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>>  
>>      if (!spapr->cas_reboot) {
>>          void *fdt;
>> -        SpaprDeviceTreeUpdateHeader hdr = { .version_id = 1 };
>>  
>>          /* If spapr_machine_reset() did not set up a HPT but one is necessary
>>           * (because the guest isn't going to use radix) then set it up here. */
>> @@ -1825,21 +1820,7 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>>              spapr_setup_hpt_and_vrma(spapr);
>>          }
>>  
>> -        if (fdt_bufsize < sizeof(hdr)) {
>> -            error_report("SLOF provided insufficient CAS buffer "
>> -                         TARGET_FMT_lu " (min: %zu)", fdt_bufsize, sizeof(hdr));
>> -            exit(EXIT_FAILURE);
>> -        }
>> -
>> -        fdt_bufsize -= sizeof(hdr);
>> -
>> -        fdt = spapr_build_fdt(spapr, false, fdt_bufsize);
>> -        _FDT((fdt_pack(fdt)));
>> -
>> -        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
>> -        cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
>> -                                  fdt_totalsize(fdt));
>> -        trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
>> +        fdt = spapr_build_fdt(spapr, !spapr->bios_enabled, fdt_bufsize);
>>  
>>          g_free(spapr->fdt_blob);
>>          spapr->fdt_size = fdt_totalsize(fdt);
>> @@ -1854,6 +1835,41 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>>      return H_SUCCESS;
>>  }
>>  
>> +static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>> +                                                  SpaprMachineState *spapr,
>> +                                                  target_ulong opcode,
>> +                                                  target_ulong *args)
>> +{
>> +    /* Working address in data buffer */
>> +    target_ulong addr = ppc64_phys_to_real(args[0]);
>> +    target_ulong fdt_buf = args[1];
>> +    target_ulong fdt_bufsize = args[2];
>> +    target_ulong ret;
>> +    SpaprDeviceTreeUpdateHeader hdr = { .version_id = 1 };
>> +
>> +    if (fdt_bufsize < sizeof(hdr)) {
>> +        error_report("SLOF provided insufficient CAS buffer "
>> +                     TARGET_FMT_lu " (min: %zu)", fdt_bufsize, sizeof(hdr));
>> +        exit(EXIT_FAILURE);
>> +    }
>> +
>> +    fdt_bufsize -= sizeof(hdr);
>> +
>> +    ret = do_client_architecture_support(cpu, spapr, addr, fdt_bufsize);
>> +    if (ret == H_SUCCESS) {
>> +        _FDT((fdt_pack(spapr->fdt_blob)));
>> +        spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
>> +        spapr->fdt_initial_size = spapr->fdt_size;
>> +
>> +        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
>> +        cpu_physical_memory_write(fdt_buf + sizeof(hdr), spapr->fdt_blob,
>> +                                  spapr->fdt_size);
>> +        trace_spapr_cas_continue(spapr->fdt_size + sizeof(hdr));
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>>  static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>>                                                SpaprMachineState *spapr,
>>                                                target_ulong opcode,
>> @@ -1998,6 +2014,14 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>      return H_SUCCESS;
>>  }
>>  
>> +static target_ulong h_client(PowerPCCPU *cpu, SpaprMachineState *spapr,
>> +                             target_ulong opcode, target_ulong *args)
> 
> As I said in an earlier revision, please explan these names from just
> "client", for readability by people who aren't already thinking about
> open firmware.

Yeah, I missed this one.


> 
>> +{
>> +    target_ulong client_args = ppc64_phys_to_real(args[0]);
>> +
>> +    return spapr_h_client(spapr, client_args);
>> +}
>> +
>>  static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
>>  static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - KVMPPC_HCALL_BASE + 1];
>>  static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 4 + 1];
>> @@ -2121,6 +2145,8 @@ static void hypercall_register_types(void)
>>  
>>      spapr_register_hypercall(KVMPPC_H_UPDATE_DT, h_update_dt);
>>  
>> +    spapr_register_hypercall(KVMPPC_H_CLIENT, h_client);
>> +
>>      /* Virtual Processor Home Node */
>>      spapr_register_hypercall(H_HOME_NODE_ASSOCIATIVITY,
>>                               h_home_node_associativity);
>> diff --git a/hw/ppc/spapr_of_client.c b/hw/ppc/spapr_of_client.c
>> new file mode 100644
>> index 000000000000..24d854b76e51
>> --- /dev/null
>> +++ b/hw/ppc/spapr_of_client.c
> 
> I'd suggest expanding this file to cover as much as you can of the
> virtual OF stuff, not just the client interface.

This is done in v6.


> 
>> @@ -0,0 +1,633 @@
>> +#include "qemu/osdep.h"
>> +#include "qemu-common.h"
>> +#include "qapi/error.h"
>> +#include "exec/memory.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/ppc/spapr_vio.h"
>> +#include "chardev/char.h"
>> +#include "qom/qom-qobject.h"
>> +#include "trace.h"
>> +
>> +typedef struct {
>> +    DeviceState *dev;
>> +    Chardev *cdev;
>> +    uint32_t phandle;
>> +} SpaprOfInstance;
>> +
>> +/*
>> + * OF 1275 "nextprop" description suggests is it 32 bytes max but
>> + * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 chars long.
>> + */
>> +#define OF_PROPNAME_LEN_MAX 64
>> +
>> +/* Defined as Big Endian */
>> +struct prom_args {
>> +    uint32_t service;
>> +    uint32_t nargs;
>> +    uint32_t nret;
>> +    uint32_t args[10];
>> +};
>> +
>> +static void readstr(hwaddr pa, char *buf, int size)
>> +{
>> +    cpu_physical_memory_read(pa, buf, size - 1);
>> +    buf[size - 1] = 0;
>> +}
> 
> I'd still like to see this return some kind of error if it had to
> truncate what was passed by the client.


But truncating will produce error anyway - libfdt won't find stuff, etc.


>> +
>> +static bool _cmpservice(const char *s, size_t len,
> 
> Don't use leading _ please - in userland those are reserved for the
> system libraries.
> 
>> +                        unsigned nargs, unsigned nret,
>> +                        const char *s1, size_t len1,
>> +                        unsigned nargscheck, unsigned nretcheck)
>> +{
>> +    if (strcmp(s, s1)) {
>> +        return false;
>> +    }
>> +    if (nargscheck == 0 && nretcheck == 0) {
>> +        return true;
>> +    }
>> +    if (nargs != nargscheck || nret != nretcheck) {
>> +        trace_spapr_of_client_error_param(s, nargscheck, nretcheck, nargs,
>> +                                          nret);
>> +        return false;
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +static uint32_t of_client_finddevice(const void *fdt, uint32_t nodeaddr)
>> +{
>> +    char node[256];
> 
> Is 256 enough?  OF paths can get pretty long...


Hard to imagine that 255 is not enough though. Long parts of the path
would be scsi drive id, PHB but in between we can only have a bunch of
PCI bridges and these are not so long.

What do you think is an appropriate limit?


> 
>> +    int ret;
>> +
>> +    readstr(nodeaddr, node, sizeof(node));
>> +    ret = fdt_path_offset(fdt, node);
>> +    if (ret >= 0) {
>> +        ret = fdt_get_phandle(fdt, ret);
>> +    }
>> +
>> +    return (uint32_t) ret;
>> +}
>> +
>> +static uint32_t of_client_getprop(const void *fdt, uint32_t nodeph,
>> +                                  uint32_t pname, uint32_t valaddr,
>> +                                  uint32_t vallen)
>> +{
>> +    char propname[OF_PROPNAME_LEN_MAX + 1];
>> +    uint32_t ret = 0;
>> +    int proplen = 0;
>> +    const void *prop;
>> +
>> +    readstr(pname, propname, sizeof(propname));
>> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
>> +                               propname, strlen(propname), &proplen);
> 
> Again, you don't need _namelen.
> 
>> +    if (prop) {
>> +        int cb = MIN(proplen, vallen);
>> +
>> +        cpu_physical_memory_write(valaddr, prop, cb);
>> +        ret = cb;
> 
> If I'm reading 1275 correctly, the return value should be the
> untruncated length of the property.


"Size is either the actual size of the property". I'd think the actual
size is what we actually copied to the buffer but @proplen is probably
what they meant, I'll change to that and see what breaks.



>> +    } else {
>> +        ret = -1;
>> +    }
>> +    trace_spapr_of_client_getprop(nodeph, propname, ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t of_client_getproplen(const void *fdt, uint32_t nodeph,
>> +                                     uint32_t pname)
>> +{
>> +    char propname[OF_PROPNAME_LEN_MAX + 1];
>> +    uint32_t ret = 0;
>> +    int proplen = 0;
>> +    const void *prop;
>> +
>> +    readstr(pname, propname, sizeof(propname));
>> +
>> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
>> +                               propname, strlen(propname), &proplen);
> 
> No _namelen.
> 
>> +    if (prop) {
>> +        ret = proplen;
>> +    } else {
>> +        ret = -1;
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t of_client_setprop(SpaprMachineState *spapr,
>> +                                  uint32_t nodeph, uint32_t pname,
>> +                                  uint32_t valaddr, uint32_t vallen)
>> +{
>> +    char propname[OF_PROPNAME_LEN_MAX + 1];
>> +    uint32_t ret = -1;
>> +    int offset;
> 
> A comment noting that you're only allowing a very restricted set of
> setprops would be good.


Is not that quite clear from the code itself? Okay...


>> +    readstr(pname, propname, sizeof(propname));
>> +    if (vallen == sizeof(uint32_t)) {
>> +        uint32_t val32 = ldl_be_phys(first_cpu->as, valaddr);
>> +
>> +        if ((strcmp(propname, "linux,rtas-base") == 0) ||
>> +            (strcmp(propname, "linux,rtas-entry") == 0)) {
>> +            spapr->rtas_base = val32;
>> +        } else if (strcmp(propname, "linux,initrd-start") == 0) {
>> +            spapr->initrd_base = val32;
>> +        } else if (strcmp(propname, "linux,initrd-end") == 0) {
>> +            spapr->initrd_size = val32 - spapr->initrd_base;
>> +        } else {
>> +            goto trace_exit;
>> +        }
>> +    } else if (vallen == sizeof(uint64_t)) {
>> +        uint64_t val64 = ldq_be_phys(first_cpu->as, valaddr);
>> +
>> +        if (strcmp(propname, "linux,initrd-start") == 0) {
>> +            spapr->initrd_base = val64;
>> +        } else if (strcmp(propname, "linux,initrd-end") == 0) {
>> +            spapr->initrd_size = val64 - spapr->initrd_base;
>> +        } else {
>> +            goto trace_exit;
>> +        }
>> +    } else {
>> +        goto trace_exit;
>> +    }
>> +
>> +    offset = fdt_node_offset_by_phandle(spapr->fdt_blob, nodeph);
>> +    if (offset >= 0) {
>> +        uint8_t data[vallen];
>> +
>> +        cpu_physical_memory_read(valaddr, data, vallen);
>> +        if (!fdt_setprop(spapr->fdt_blob, offset, propname, data, vallen)) {
>> +            ret = vallen;
>> +        }
>> +    }
>> +
>> +trace_exit:
>> +    trace_spapr_of_client_setprop(nodeph, propname, ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t of_client_nextprop(const void *fdt, uint32_t phandle,
>> +                                   uint32_t prevaddr, uint32_t nameaddr)
>> +{
>> +    int offset = fdt_node_offset_by_phandle(fdt, phandle);
>> +    char prev[OF_PROPNAME_LEN_MAX + 1];
>> +    const char *tmp;
>> +
>> +    readstr(prevaddr, prev, sizeof(prev));
>> +    for (offset = fdt_first_property_offset(fdt, offset);
>> +         offset >= 0;
>> +         offset = fdt_next_property_offset(fdt, offset)) {
>> +
>> +        if (!fdt_getprop_by_offset(fdt, offset, &tmp, NULL)) {
>> +            return 0;
>> +        }
>> +        if (prev[0] == '\0' || strcmp(prev, tmp) == 0) {
>> +            if (prev[0] != '\0') {
>> +                offset = fdt_next_property_offset(fdt, offset);
>> +                if (offset < 0) {
>> +                    return 0;
>> +                }
>> +            }
>> +            if (!fdt_getprop_by_offset(fdt, offset, &tmp, NULL)) {
>> +                return 0;
>> +            }
>> +            cpu_physical_memory_write(nameaddr, tmp, strlen(tmp) + 1);
>> +            return 1;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static uint32_t of_client_peer(const void *fdt, uint32_t phandle)
>> +{
>> +    int ret;
>> +
>> +    if (phandle == 0) {
>> +        ret = fdt_path_offset(fdt, "/");
>> +    } else {
>> +        ret = fdt_next_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
>> +    }
>> +
>> +    if (ret < 0) {
>> +        ret = 0;
>> +    } else {
>> +        ret = fdt_get_phandle(fdt, ret);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t of_client_child(const void *fdt, uint32_t phandle)
>> +{
>> +    int ret = fdt_first_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
>> +
>> +    if (ret < 0) {
>> +        ret = 0;
>> +    } else {
>> +        ret = fdt_get_phandle(fdt, ret);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t of_client_parent(const void *fdt, uint32_t phandle)
>> +{
>> +    int ret = fdt_parent_offset(fdt, fdt_node_offset_by_phandle(fdt, phandle));
>> +
>> +    if (ret < 0) {
>> +        ret = 0;
>> +    } else {
>> +        ret = fdt_get_phandle(fdt, ret);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static DeviceState *of_client_find_qom_dev(BusState *bus, const char *path)
>> +{
>> +    BusChild *kid;
>> +
>> +    QTAILQ_FOREACH(kid, &bus->children, sibling) {
>> +        const char *p = qdev_get_fw_dev_path(kid->child);
>> +        BusState *child;
>> +
>> +        if (p && strcmp(path, p) == 0) {
>> +            return kid->child;
>> +        }
>> +        QLIST_FOREACH(child, &kid->child->child_bus, sibling) {
>> +            DeviceState *d = of_client_find_qom_dev(child, path);
>> +
>> +            if (d) {
>> +                return d;
>> +            }
>> +        }
>> +    }
>> +    return NULL;
>> +}
>> +
>> +uint32_t spapr_of_client_open(SpaprMachineState *spapr, const char *path)
>> +{
>> +    int offset;
>> +    uint32_t ret = 0;
>> +    SpaprOfInstance *inst;
>> +
>> +    if (spapr->of_instance_last == 0xFFFFFFFF) {
>> +        /* We do not recycle ihandles yet */
>> +        goto trace_exit;
>> +    }
>> +    offset = fdt_path_offset(spapr->fdt_blob, path);
>> +    if (offset < 0) {
>> +        trace_spapr_of_client_error_unknown_path(path);
>> +        goto trace_exit;
>> +    }
>> +
>> +    inst = g_new(SpaprOfInstance, 1);
>> +    inst->phandle = fdt_get_phandle(spapr->fdt_blob, offset);
>> +    g_assert(inst->phandle);
>> +    ++spapr->of_instance_last;
>> +    inst->dev = of_client_find_qom_dev(sysbus_get_default(), path);
>> +    g_hash_table_insert(spapr->of_instances,
>> +                        GINT_TO_POINTER(spapr->of_instance_last),
>> +                        inst);
>> +    ret = spapr->of_instance_last;
>> +
>> +    if (inst->dev) {
>> +        const char *cdevstr = object_property_get_str(OBJECT(inst->dev),
>> +                                                      "chardev", NULL);
>> +
>> +        if (cdevstr) {
>> +            inst->cdev = qemu_chr_find(cdevstr);
>> +        }
>> +    }
>> +
>> +trace_exit:
>> +    trace_spapr_of_client_open(path, inst ? inst->phandle : 0, ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t of_client_open(SpaprMachineState *spapr, uint32_t pathaddr)
>> +{
>> +    char path[256];
>> +
>> +    readstr(pathaddr, path, sizeof(path));
>> +
>> +    return spapr_of_client_open(spapr, path);
>> +}
>> +
>> +static void of_client_close(SpaprMachineState *spapr, uint32_t ihandle)
>> +{
>> +    if (!g_hash_table_remove(spapr->of_instances, GINT_TO_POINTER(ihandle))) {
>> +        trace_spapr_of_client_error_unknown_ihandle_close(ihandle);
>> +    }
>> +}
>> +
>> +static uint32_t of_client_instance_to_package(SpaprMachineState *spapr,
>> +                                              uint32_t ihandle)
>> +{
>> +    gpointer instp = g_hash_table_lookup(spapr->of_instances,
>> +                                        GINT_TO_POINTER(ihandle));
>> +
>> +    if (!instp) {
>> +        return -1;
>> +    }
>> +
>> +    return ((SpaprOfInstance *)instp)->phandle;
>> +}
>> +
>> +static uint32_t of_client_package_to_path(const void *fdt, uint32_t phandle,
>> +                                          uint32_t buf, uint32_t len)
>> +{
>> +    char tmp[256];
>> +
>> +    if (0 == fdt_get_path(fdt, fdt_node_offset_by_phandle(fdt, phandle), tmp,
>> +                          sizeof(tmp))) {
>> +        tmp[sizeof(tmp) - 1] = 0;
>> +        cpu_physical_memory_write(buf, tmp, MIN(len, strlen(tmp)));
>> +    }
>> +    return len;
>> +}
>> +
>> +static uint32_t of_client_instance_to_path(SpaprMachineState *spapr,
>> +                                           uint32_t ihandle, uint32_t buf,
>> +                                           uint32_t len)
>> +{
>> +    uint32_t phandle = of_client_instance_to_package(spapr, ihandle);
>> +
>> +    if (phandle != -1) {
>> +        return of_client_package_to_path(spapr->fdt_blob, phandle, buf, len);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static uint32_t of_client_write(SpaprMachineState *spapr, uint32_t ihandle,
>> +                                uint32_t buf, uint32_t len)
>> +{
>> +    char tmp[256];
>> +    int toread, toprint, cb = MIN(len, 1024);
>> +    SpaprOfInstance *inst = (SpaprOfInstance *)
>> +        g_hash_table_lookup(spapr->of_instances, GINT_TO_POINTER(ihandle));
>> +
>> +    while (cb > 0) {
>> +        toread = MIN(cb + 1, sizeof(tmp));
>> +        readstr(buf, tmp, toread);
>> +        toprint = strlen(tmp);
>> +        if (inst && inst->cdev) {
>> +            toprint = qemu_chr_write(inst->cdev, (uint8_t *) tmp, toprint,
>> +                                     true);
>> +        } else {
>> +            /* We normally open stdout so this is fallback */
>> +            printf("DBG[%d]%s", ihandle, tmp);
>> +        }
>> +        buf += toprint;
>> +        cb -= toprint;
>> +    }
>> +
>> +    return len;
>> +}
>> +
>> +static bool of_client_claim_avail(GArray *claimed, uint64_t virt, uint64_t size)
>> +{
>> +    int i;
>> +    SpaprOfClaimed *c;
>> +
>> +    for (i = 0; i < claimed->len; ++i) {
>> +        c = &g_array_index(claimed, SpaprOfClaimed, i);
>> +        if ((c->start <= virt && virt < c->start + c->size) ||
>> +            (virt <= c->start && c->start < virt + size)) {
>> +            return false;
>> +        }
>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +static void of_client_claim_add(GArray *claimed, uint64_t virt, uint64_t size)
>> +{
>> +    SpaprOfClaimed newclaim;
>> +
>> +    newclaim.start = virt;
>> +    newclaim.size = size;
>> +    g_array_append_val(claimed, newclaim);
>> +}
>> +
>> +/*
>> + * "claim" claims memory at @virt if @align==0; otherwise it allocates
>> + * memory at the requested alignment.
>> + */
>> +uint64_t spapr_do_of_client_claim(SpaprMachineState *spapr, uint64_t virt,
>> +                                  uint64_t size, uint64_t align)
>> +{
>> +    uint32_t ret;
>> +
>> +    if (align == 0) {
>> +        if (!of_client_claim_avail(spapr->claimed, virt, size)) {
>> +            return -1;
>> +        }
>> +        ret = virt;
>> +    } else {
>> +        align = pow2ceil(align);
> 
> Should this be a pow2ceil, or should it just return an error if align
> is not a power of 2. > Note that aligning something to 4 bytes will
> (probably) make it *not* aligned to 3 bytes.

I did not see any notes about the specific alignment requirements here,
the idea is that clients may just not expect unaligned memory at all; I
could probably just drop it and see what happens...


> 
>> +        spapr->claimed_base = (spapr->claimed_base + align - 1) & ~(align - 1);
>> +        while (1) {
>> +            if (spapr->claimed_base >= spapr->rma_size) {
>> +                perror("Out of memory");
> 
> error_report() or qemu_log() or something and a message with some more
> specificity, please.


What kind of specificity is missing here?



> 
>> +                return -1;
>> +            }
>> +            if (of_client_claim_avail(spapr->claimed, spapr->claimed_base,
>> +                                      size)) {
>> +                break;
>> +            }
>> +            spapr->claimed_base += size;
>> +        }
>> +        ret = spapr->claimed_base;
>> +    }
>> +
>> +    spapr->claimed_base = MAX(spapr->claimed_base, ret + size);
>> +    of_client_claim_add(spapr->claimed, virt, size);
>> +    trace_spapr_of_client_claim(virt, size, align, ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static uint32_t of_client_claim(SpaprMachineState *spapr, uint32_t virt,
>> +                                uint32_t size, uint32_t align)
>> +{
>> +    if (align) {
>> +        return -1;
>> +    }
>> +    if (!of_client_claim_avail(spapr->claimed, virt, size)) {
>> +        return -1;
>> +    }
>> +
>> +    spapr->claimed_base = MAX(spapr->claimed_base, virt + size);
>> +    of_client_claim_add(spapr->claimed, virt, size);
>> +    trace_spapr_of_client_claim(virt, size, align, virt);
> 
> Huh.  So do_of_client_claim() is never used from of_client_claim(),
> only from "internal" claimers.  It definitely needs a different name.


It is used in v6. Linux always passes non-zero @virt but grub does not.


>> +    return virt;
>> +}
>> +
>> +static uint32_t of_client_call_method(SpaprMachineState *spapr,
>> +                                      uint32_t methodaddr, uint32_t ihandle,
>> +                                      uint32_t param, uint32_t *ret2)
>> +{
>> +    uint32_t ret = -1;
>> +    char path[256] = "", method[256] = "";
>> +    uint32_t phandle = of_client_instance_to_package(spapr, ihandle);
>> +    int offset;
>> +
>> +    if (!ihandle) {
>> +        goto trace_exit;
>> +    }
>> +
>> +    readstr(methodaddr, method, sizeof(method));
>> +    phandle = of_client_instance_to_package(spapr, ihandle);
>> +    if (!phandle) {
>> +        goto trace_exit;
>> +    }
>> +
>> +    offset = fdt_node_offset_by_phandle(spapr->fdt_blob, phandle);
>> +    if (offset < 0) {
>> +        goto trace_exit;
>> +    }
>> +
>> +    if (fdt_get_path(spapr->fdt_blob, offset, path, sizeof(path))) {
>> +        goto trace_exit;
>> +    }
>> +
>> +    if (strcmp(path, "/") == 0) {
>> +        if (strcmp(method, "ibm,client-architecture-support") == 0) {
>> +
>> +#define FDT_MAX_SIZE            0x100000
>> +            ret = do_client_architecture_support(POWERPC_CPU(first_cpu), spapr,
>> +                                                 param, FDT_MAX_SIZE);
>> +            *ret2 = 0;
>> +        }
>> +    } else if (strcmp(path, "/rtas") == 0) {
>> +        if (strcmp(method, "instantiate-rtas") == 0) {
>> +            spapr_instantiate_rtas(spapr, param);
>> +            ret = 0;
>> +            *ret2 = param; /* rtasbase */
>> +        }
>> +    } else {
>> +        trace_spapr_of_client_error_unknown_method(method);
>> +    }
>> +
>> +trace_exit:
>> +    trace_spapr_of_client_method(ihandle, method, param, phandle, path, ret);
>> +
>> +    return ret;
>> +}
>> +
>> +static void of_client_quiesce(SpaprMachineState *spapr)
>> +{
>> +    int rc = fdt_pack(spapr->fdt_blob);
>> +    /* Should only fail if we've built a corrupted tree */
>> +    assert(rc == 0);
>> +
>> +    spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
>> +    spapr->fdt_initial_size = spapr->fdt_size;
>> +}
>> +
>> +int spapr_h_client(SpaprMachineState *spapr, target_ulong of_client_args)
>> +{
>> +    struct prom_args args = { 0 };
>> +    char service[64];
>> +    unsigned nargs, nret;
>> +    int i, servicelen;
>> +
>> +    cpu_physical_memory_read(of_client_args, &args, sizeof(args));
>> +    nargs = be32_to_cpu(args.nargs);
>> +    nret = be32_to_cpu(args.nret);
>> +    readstr(be32_to_cpu(args.service), service, sizeof(service));
>> +    servicelen = strlen(service);
>> +
>> +#define cmpservice(s, a, r) \
>> +    _cmpservice(service, servicelen, nargs, nret, (s), sizeof(s), (a), (r))
>> +
>> +    if (cmpservice("finddevice", 1, 1)) {
>> +        args.args[nargs] = of_client_finddevice(spapr->fdt_blob,
>> +                                                be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("getprop", 4, 1)) {
>> +        args.args[nargs] = of_client_getprop(spapr->fdt_blob,
>> +                                             be32_to_cpu(args.args[0]),
>> +                                             be32_to_cpu(args.args[1]),
>> +                                             be32_to_cpu(args.args[2]),
>> +                                             be32_to_cpu(args.args[3]));
>> +    } else if (cmpservice("getproplen", 2, 1)) {
>> +        args.args[nargs] = of_client_getproplen(spapr->fdt_blob,
>> +                                                be32_to_cpu(args.args[0]),
>> +                                                be32_to_cpu(args.args[1]));
>> +    } else if (cmpservice("setprop", 4, 1)) {
>> +        args.args[nargs] = of_client_setprop(spapr,
>> +                                             be32_to_cpu(args.args[0]),
>> +                                             be32_to_cpu(args.args[1]),
>> +                                             be32_to_cpu(args.args[2]),
>> +                                             be32_to_cpu(args.args[3]));
>> +    } else if (cmpservice("nextprop", 3, 1)) {
>> +        args.args[nargs] = of_client_nextprop(spapr->fdt_blob,
>> +                                              be32_to_cpu(args.args[0]),
>> +                                              be32_to_cpu(args.args[1]),
>> +                                              be32_to_cpu(args.args[2]));
>> +    } else if (cmpservice("peer", 1, 1)) {
>> +        args.args[nargs] = of_client_peer(spapr->fdt_blob,
>> +                                          be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("child", 1, 1)) {
>> +        args.args[nargs] = of_client_child(spapr->fdt_blob,
>> +                                           be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("parent", 1, 1)) {
>> +        args.args[nargs] = of_client_parent(spapr->fdt_blob,
>> +                                            be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("open", 1, 1)) {
>> +        args.args[nargs] = of_client_open(spapr, be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("close", 1, 0)) {
>> +        of_client_close(spapr, be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("instance-to-package", 1, 1)) {
>> +        args.args[nargs] =
>> +            of_client_instance_to_package(spapr,
>> +                                          be32_to_cpu(args.args[0]));
>> +    } else if (cmpservice("package-to-path", 3, 1)) {
>> +        args.args[nargs] = of_client_package_to_path(spapr->fdt_blob,
>> +                                                     be32_to_cpu(args.args[0]),
>> +                                                     be32_to_cpu(args.args[1]),
>> +                                                     be32_to_cpu(args.args[2]));
>> +    } else if (cmpservice("instance-to-path", 3, 1)) {
>> +        args.args[nargs] =
>> +            of_client_instance_to_path(spapr,
>> +                                       be32_to_cpu(args.args[0]),
>> +                                       be32_to_cpu(args.args[1]),
>> +                                       be32_to_cpu(args.args[2]));
>> +    } else if (cmpservice("write", 3, 1)) {
>> +        args.args[nargs] = of_client_write(spapr,
>> +                                           be32_to_cpu(args.args[0]),
>> +                                           be32_to_cpu(args.args[1]),
>> +                                           be32_to_cpu(args.args[2]));
>> +    } else if (cmpservice("claim", 3, 1)) {
>> +        args.args[nargs] = of_client_claim(spapr,
>> +                                           be32_to_cpu(args.args[0]),
>> +                                           be32_to_cpu(args.args[1]),
>> +                                           be32_to_cpu(args.args[2]));
>> +    } else if (cmpservice("call-method", 3, 2)) {
>> +        args.args[nargs] = of_client_call_method(spapr,
>> +                                                 be32_to_cpu(args.args[0]),
>> +                                                 be32_to_cpu(args.args[1]),
>> +                                                 be32_to_cpu(args.args[2]),
>> +                                                 &args.args[nargs + 1]);
>> +    } else if (cmpservice("quiesce", 0, 0)) {
>> +        of_client_quiesce(spapr);
>> +    } else if (cmpservice("exit", 0, 0)) {
>> +        error_report("Stopped as the VM requested \"exit\"");
>> +        vm_stop(RUN_STATE_PAUSED);
>> +    } else {
>> +        trace_spapr_of_client_error_unknown_service(service, nargs, nret);
>> +        args.args[nargs] = -1;
> 
> You've never bounds checked nargs at this point.
> 
>> +    }
>> +
>> +    for (i = 0; i < nret; ++i) {
> 
> And likewise you might not have bounds checked nret.

Oh, true. Thanks,


> 
>> +        args.args[nargs + i] = be32_to_cpu(args.args[nargs + i]);
>> +    }
>> +    cpu_physical_memory_write(of_client_args, &args, sizeof(args));
>> +
>> +    return H_SUCCESS;
>> +}
>> diff --git a/hw/ppc/trace-events b/hw/ppc/trace-events
>> index 9ea620f23c85..e2d1e58d07c3 100644
>> --- a/hw/ppc/trace-events
>> +++ b/hw/ppc/trace-events
>> @@ -21,6 +21,18 @@ spapr_update_dt(unsigned cb) "New blob %u bytes"
>>  spapr_update_dt_failed_size(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
>>  spapr_update_dt_failed_check(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
>>  
>> +# spapr_client.c
>> +spapr_of_client_error_param(const char *method, int nargscheck, int nretcheck, int nargs, int nret) "%s takes/returns %d/%d, not %d/%d"
>> +spapr_of_client_error_unknown_service(const char *service, int nargs, int nret) "%s args=%d rets=%d"
>> +spapr_of_client_error_unknown_method(const char *method) "%s"
>> +spapr_of_client_error_unknown_ihandle_close(uint32_t ihandle) "0x%x"
>> +spapr_of_client_error_unknown_path(const char *path) "%s"
>> +spapr_of_client_claim(uint32_t virt, uint32_t size, uint32_t align, uint32_t ret) "virt=0x%x size=0x%x align=0x%x => 0x%x"
>> +spapr_of_client_method(uint32_t ihandle, const char *method, uint32_t param, uint32_t phandle, const char *path, uint32_t ret) "0x%x \"%s\" param=0x%x ph=0x%x \"%s\" => 0x%x"
>> +spapr_of_client_getprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
>> +spapr_of_client_setprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
>> +spapr_of_client_open(const char *path, uint32_t phandle, uint32_t ihandle) "%s 0x%x => 0x%x"
>> +
>>  # spapr_hcall_tpm.c
>>  spapr_h_tpm_comm(const char *device_path, uint64_t operation) "tpm_device_path=%s operation=0x%"PRIu64
>>  spapr_tpm_execute(uint64_t data_in, uint64_t data_in_sz, uint64_t data_out, uint64_t data_out_sz) "data_in=0x%"PRIx64", data_in_sz=%"PRIu64", data_out=0x%"PRIx64", data_out_sz=%"PRIu64
>
David Gibson Jan. 22, 2020, 6:32 a.m. UTC | #3
On Tue, Jan 21, 2020 at 06:25:36PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 21/01/2020 16:11, David Gibson wrote:
> > On Fri, Jan 10, 2020 at 01:09:25PM +1100, Alexey Kardashevskiy wrote:
> >> The Petitboot bootloader is way more advanced than SLOF is ever going to
> >> be as Petitboot comes with the full-featured Linux kernel with all
> >> the drivers, and initramdisk with quite user friendly interface.
> >> The problem with ditching SLOF is that an unmodified pseries kernel can
> >> either start via:
> >> 1. kexec, this requires presence of RTAS and skips
> >> ibm,client-architecture-support entirely;
> >> 2. normal boot, this heavily relies on the OF1275 client interface to
> >> fetch the device tree and do early setup (claim memory).
> >>
> >> This adds a new bios-less mode to the pseries machine: "bios=on|off".
> >> When enabled, QEMU does not load SLOF and jumps to the kernel from
> >> "-kernel".
> > 
> > I don't love the name "bios" for this flag, since BIOS tends to refer
> > to old-school x86 firmware.  Given the various plans we're considering
> > the future, I'd suggest "firmware=slof" for the current in-guest SLOF
> > mode, and say "firmware=vof" (Virtual Open Firmware) for the new
> > model.  We can consider firmware=petitboot or firmware=none (for
> > direct kexec-style boot into -kernel) or whatever in the future
> 
> Ok. We could also enforce default loading addresses for SLOF/kernel/grub
> and drop "kernel-addr", although it is going to be confusing if it
> changes in not so obvious way...

Yes, I think that would be confusing, so I think adding the
kernel-addr override is a good idea, I'd just like it split out for
clarity.

> In fact, I will ideally need 3 flags:
> -bios: on|off to stop loading SLOF;
> -kernel-addr: 0x0 for slof/kernel; 0x20000 for grub;

I'm happy for that one to be separate from the "firmware style"
option.

> -kernel-translate-hack: on|off - as grub is linked to run from 0x20000
> and it only works when placed there, the hack breaks it.

Hrm.  I don't really understand what this one is about.  That doesn't
really seem like something the user would ever want to fiddle with
directly.

> Or we can pass grub via -bios and not via -kernel but strictly speaking
> there is still a firmware - that new 20 bytes blob so it would not be
> accurate.
> 
> We can put this all into a single
> -firmware slof|vof|grub|linux. Not sure.

I'm not thinking of "grub" as a separate option - that would be the
same as "vof".  Using vof + no -kernel we'd need to scan the disks in
the same way SLOF does, and look for a boot partition, which will
probably contain a GRUB image.  Then we'd need enough faked OF client
calls to let GRUB operate.

> >> The client interface is implemented exactly as RTAS - a 20 bytes blob,
> >> right next after the RTAS blob. The entry point is passed to the kernel
> >> via GPR5.
> >>
> >> This implements a handful of client interface methods just to get going.
> >> In particular, this implements the device tree fetching,
> >> ibm,client-architecture-support and instantiate-rtas.
> >>
> >> This implements changing FDT properties for RTAS (for vmlinux and zImage)
> >> and initramdisk location (for zImage). To make this work, this skips
> >> fdt_pack() when bios=off as not packing the blob leaves some room for
> >> appending.
> >>
> >> This assigns "phandles" to device tree nodes as there is no more SLOF
> >> and OF nodes addresses of which served as phandle values.
> >> This keeps predefined nodes (such as XICS/NVLINK/...) unchanged.
> >> phandles are regenerated at every FDT rebuild.
> >>
> >> When bios=off, this adds "/chosen" every time QEMU builds a tree.
> >>
> >> This implements "claim" which the client (Linux) uses for memory
> >> allocation; this is also  used by QEMU for claiming kernel/initrd images,
> >> client interface entry point, RTAS and the initial stack.
> >>
> >> While at this, add a "kernel-addr" machine parameter to allow moving
> >> the kernel in memory. This is useful for debugging if the kernel is
> >> loaded at @0, although not necessary.
> >>
> >> This adds basic instances support which are managed by a hashmap
> >> ihandle->[phandle, DeviceState, Chardev].
> >>
> >> Note that a 64bit PCI fix is required for Linux:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735e
> >>
> >> The test command line:
> >>
> >> qemu-system-ppc64 \
> >> -nodefaults \
> >> -chardev stdio,id=STDIO0,signal=off,mux=on \
> >> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> >> -mon id=MON0,chardev=STDIO0,mode=readline \
> >> -nographic \
> >> -vga none \
> >> -kernel pbuild/kernel-le-guest/arch/powerpc/boot/zImage.pseries \
> >> -machine pseries,bios=off,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken \
> >> -m 4G \
> >> -enable-kvm \
> >> -initrd pb/rootfs.cpio.xz \
> >> -device nec-usb-xhci,id=nec-usb-xhci0 \
> >> -netdev tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0 \
> >> -device virtio-net-pci,id=vnet0,netdev=TAP0 img/f30le.qcow2 \
> >> -snapshot \
> >> -smp 8,threads=8 \
> >> -trace events=qemu_trace_events \
> >> -d guest_errors \
> >> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.ssh54088 \
> >> -mon chardev=SOCKET0,mode=control
> >>
> >> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > 
> > It'd be nice to split this patch up a bit, though I'll admit it's not
> > very obvious where to do so.
> 
> 
> v6 is a patchset.
> 
> >> ---
> >> Changes:
> >> v5:
> >> * made instances keep device and chardev pointers
> >> * removed VIO dependencies
> >> * print error if RTAS memory is not claimed as it should have been
> >> * pack FDT as "quiesce"
> >>
> >> v4:
> >> * fixed open
> >> * validate ihandles in "call-method"
> >>
> >> v3:
> >> * fixed phandles allocation
> >> * s/__be32/uint32_t/ as we do not normally have __be32 type in qemu
> >> * fixed size of /chosen/stdout
> >> * bunch of renames
> >> * do not create rtas properties at all, let the client deal with it;
> >> instead setprop allows changing these in the FDT
> >> * no more packing FDT when bios=off - nobody needs it and getprop does not
> >> work otherwise
> >> * allow updating initramdisk device tree properties (for zImage)
> >> * added instances
> >> * fixed stdout on OF's "write"
> >> * removed special handling for stdout in OF client, spapr-vty handles it
> >> instead
> >>
> >> v2:
> >> * fixed claim()
> >> * added "setprop"
> >> * cleaner client interface and RTAS blobs management
> >> * boots to petitboot and further to the target system
> >> * more trace points
> >> ---
> >>  hw/ppc/Makefile.objs     |   1 +
> >>  include/hw/ppc/spapr.h   |  28 +-
> >>  hw/ppc/spapr.c           | 266 ++++++++++++++--
> >>  hw/ppc/spapr_hcall.c     |  74 +++--
> >>  hw/ppc/spapr_of_client.c | 633 +++++++++++++++++++++++++++++++++++++++
> >>  hw/ppc/trace-events      |  12 +
> >>  6 files changed, 959 insertions(+), 55 deletions(-)
> >>  create mode 100644 hw/ppc/spapr_of_client.c
> >>
> >> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
> >> index 101e9fc59185..20efc0aa6f9b 100644
> >> --- a/hw/ppc/Makefile.objs
> >> +++ b/hw/ppc/Makefile.objs
> >> @@ -6,6 +6,7 @@ obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
> >>  obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
> >>  obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o spapr_irq.o
> >>  obj-$(CONFIG_PSERIES) += spapr_tpm_proxy.o
> >> +obj-$(CONFIG_PSERIES) += spapr_of_client.o
> >>  obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
> >>  # IBM PowerNV
> >>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index 61f005c6f686..efc2c70abf99 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -105,6 +105,11 @@ struct SpaprCapabilities {
> >>      uint8_t caps[SPAPR_CAP_NUM];
> >>  };
> >>  
> >> +typedef struct {
> >> +    uint64_t start;
> >> +    uint64_t size;
> >> +} SpaprOfClaimed;
> >> +
> > 
> > Can we split more of the fake-OF code into a new file?
> 
> 
> Done in v6, I quite reworked it, this is why I told you to ping me
> before you review this one :)

Oops, I forgot.  Sorry.

> >>  /**
> >>   * SpaprMachineClass:
> >>   */
> >> @@ -160,6 +165,13 @@ struct SpaprMachineState {
> >>      void *fdt_blob;
> >>      long kernel_size;
> >>      bool kernel_le;
> >> +    uint64_t kernel_addr;
> >> +    bool bios_enabled;
> >> +    uint32_t rtas_base;
> >> +    GArray *claimed; /* array of SpaprOfClaimed */
> >> +    uint64_t claimed_base;
> >> +    GHashTable *of_instances; /* ihandle -> SpaprOfInstance */
> >> +    uint32_t of_instance_last;
> >>      uint32_t initrd_base;
> >>      long initrd_size;
> >>      uint64_t rtc_offset; /* Now used only during incoming migration */
> >> @@ -510,7 +522,8 @@ struct SpaprMachineState {
> >>  /* Client Architecture support */
> >>  #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
> >>  #define KVMPPC_H_UPDATE_DT      (KVMPPC_HCALL_BASE + 0x3)
> >> -#define KVMPPC_HCALL_MAX        KVMPPC_H_UPDATE_DT
> >> +#define KVMPPC_H_CLIENT         (KVMPPC_HCALL_BASE + 0x5)
> >> +#define KVMPPC_HCALL_MAX        KVMPPC_H_CLIENT
> >>  
> >>  /*
> >>   * The hcall range 0xEF00 to 0xEF80 is reserved for use in facilitating
> >> @@ -538,6 +551,11 @@ void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
> >>  target_ulong spapr_hypercall(PowerPCCPU *cpu, target_ulong opcode,
> >>                               target_ulong *args);
> >>  
> >> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
> >> +                                            SpaprMachineState *spapr,
> >> +                                            target_ulong addr,
> >> +                                            target_ulong fdt_bufsize);
> >> +
> >>  /* Virtual Processor Area structure constants */
> >>  #define VPA_MIN_SIZE           640
> >>  #define VPA_SIZE_OFFSET        0x4
> >> @@ -769,6 +787,11 @@ struct SpaprEventLogEntry {
> >>  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
> >>  void spapr_events_init(SpaprMachineState *sm);
> >>  void spapr_dt_events(SpaprMachineState *sm, void *fdt);
> >> +uint64_t spapr_do_of_client_claim(SpaprMachineState *spapr, uint64_t virt,
> >> +                                  uint64_t size, uint64_t align);
> >> +
> >> +uint32_t spapr_of_client_open(SpaprMachineState *spapr, const char *path);
> >> +int spapr_h_client(SpaprMachineState *spapr, target_ulong client_args);
> >>  void close_htab_fd(SpaprMachineState *spapr);
> >>  void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
> >>  void spapr_free_hpt(SpaprMachineState *spapr);
> >> @@ -891,4 +914,7 @@ void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
> >>  #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
> >>  
> >>  void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
> >> +
> >> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base);
> >> +
> >>  #endif /* HW_SPAPR_H */
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index e62c89b3dd40..76ce8b973082 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -896,6 +896,55 @@ out:
> >>      return ret;
> >>  }
> >>  
> >> +/*
> >> + * Below is a compiled version of RTAS blob and OF client interface entry point.
> >> + *
> >> + * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
> >> + * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
> >> + *
> >> + *   .globl  _start
> >> + *   _start:
> >> + *           mr      4,3
> >> + *           lis     3,KVMPPC_H_RTAS@h
> >> + *           ori     3,3,KVMPPC_H_RTAS@l
> >> + *           sc      1
> >> + *           blr
> >> + *           mr      4,3
> >> + *           lis     3,KVMPPC_H_CLIENT@h
> >> + *           ori     3,3,KVMPPC_H_CLIENT@l
> >> + *           sc      1
> >> + *           blr
> >> + */
> >> +static struct {
> > 
> > Should be able to add a 'const' here.
> > 
> >> +    uint8_t rtas[20], client[20];
> >> +} QEMU_PACKED rtas_client_blob = {
> >> +    .rtas = {
> >> +        0x7c, 0x64, 0x1b, 0x78,
> >> +        0x3c, 0x60, 0x00, 0x00,
> >> +        0x60, 0x63, 0xf0, 0x00,
> >> +        0x44, 0x00, 0x00, 0x22,
> >> +        0x4e, 0x80, 0x00, 0x20
> >> +    },
> >> +    .client = {
> >> +        0x7c, 0x64, 0x1b, 0x78,
> >> +        0x3c, 0x60, 0x00, 0x00,
> >> +        0x60, 0x63, 0xf0, 0x05,
> >> +        0x44, 0x00, 0x00, 0x22,
> >> +        0x4e, 0x80, 0x00, 0x20
> >> +    }
> >> +};
> > 
> > I'd split this into two variables - there's not really any connection
> > between the two, AFAICT.
> > 
> > Note that I'm getting closer to merging the fwnmi stuff at which point
> > you'll need to pad the RTAS blob with a bunch of extra space for
> > taking the fwnmi dumps.
> > 
> >> +
> >> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base)
> >> +{
> >> +    if (spapr_do_of_client_claim(spapr, base, sizeof(rtas_client_blob.rtas),
> >> +                                 0) != -1) {
> > 
> > Wait.. == -1 is the success case?  That's a very surprising interface.
> 
> 
> This is a sort of an assert. spapr_do_of_client_claim() returns an
> address and the client is expected to claim the memory which it wants
> RTAS to be copied to, this makes sure it either happened or we claimed
> it here.

Ah!  Ok, I understand.

> >> +        error_report("The OF client did not claim RTAS memory at 0x%x", base);
> > 
> > Error message is hard to follow.  Maybe "Could not claim memory
> > for RTAS"

Which makes my suggestion here a bad one too.

> > 
> >> +    }
> >> +    spapr->rtas_base = base;
> >> +    cpu_physical_memory_write(base, rtas_client_blob.rtas,
> >> +                              sizeof(rtas_client_blob.rtas));
> >> +}
> >> +
> >>  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> >>  {
> >>      MachineState *ms = MACHINE(spapr);
> >> @@ -980,6 +1029,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
> >>      _FDT(fdt_setprop(fdt, rtas, "ibm,lrdr-capacity",
> >>                       lrdr_capacity, sizeof(lrdr_capacity)));
> >>  
> >> +    if (!spapr->bios_enabled) {
> >> +        _FDT(fdt_setprop_cell(fdt, rtas, "rtas-size",
> >> +                              sizeof(rtas_client_blob.rtas)));
> >> +    }
> >> +
> >>      spapr_dt_rtas_tokens(fdt, rtas);
> >>  }
> >>  
> >> @@ -1057,7 +1111,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
> >>      }
> >>  
> >>      if (spapr->kernel_size) {
> >> -        uint64_t kprop[2] = { cpu_to_be64(KERNEL_LOAD_ADDR),
> >> +        uint64_t kprop[2] = { cpu_to_be64(spapr->kernel_addr),
> > 
> > Hrm, I really think I would like to see the change to adjustable
> > kernel_addr split out - it puts a bunch of noise into the main kill
> > slof patch.
> 
> Sure, I'll do that if we decide to proceed with this.
> 
> 
> > 
> >>                                cpu_to_be64(spapr->kernel_size) };
> >>  
> >>          _FDT(fdt_setprop(fdt, chosen, "qemu,boot-kernel",
> >> @@ -1245,7 +1299,8 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
> >>      /* Build memory reserve map */
> >>      if (reset) {
> >>          if (spapr->kernel_size) {
> >> -            _FDT((fdt_add_mem_rsv(fdt, KERNEL_LOAD_ADDR, spapr->kernel_size)));
> >> +            _FDT((fdt_add_mem_rsv(fdt, spapr->kernel_addr,
> >> +                                  spapr->kernel_size)));
> >>          }
> >>          if (spapr->initrd_size) {
> >>              _FDT((fdt_add_mem_rsv(fdt, spapr->initrd_base,
> >> @@ -1268,12 +1323,56 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
> >>          }
> >>      }
> >>  
> >> +    if (!spapr->bios_enabled) {
> >> +        uint32_t phandle;
> >> +        int i, offset, proplen = 0;
> >> +        const void *prop;
> >> +        bool found = false;
> >> +        GArray *phandles = g_array_new(false, false, sizeof(uint32_t));
> >> +
> >> +        /* Find all predefined phandles */
> >> +        for (offset = fdt_next_node(fdt, -1, NULL);
> >> +             offset >= 0;
> >> +             offset = fdt_next_node(fdt, offset, NULL)) {
> >> +            prop = fdt_getprop_namelen(fdt, offset, "phandle", 7, &proplen);
> > 
> > You can just use fdt_getprop() rather than the namelen variant (that's
> > only really useful when you don't have a \0-terminated string with the
> > name).
> 
> Ok, will fix. There are just too many similar functions in libfdt.h and
> fdt_getprop() could be inlined, probably.

It won't be inlined, but I think it will be tail-call optimized so it
might as well be.  That is, I think the .o will look something like:

fdt_getprop:
	jiggle some registers
	bl	strlen
	jiggle some regs
fdt_getprop_namelen:
	...
	blr

> >> +            if (prop && proplen == sizeof(uint32_t)) {
> >> +                phandle = fdt32_ld(prop);
> >> +                g_array_append_val(phandles, phandle);
> >> +            }
> >> +        }
> >> +
> >> +        /* Assign phandles skipping the predefined ones */
> >> +        for (offset = fdt_next_node(fdt, -1, NULL), phandle = 1;
> >> +             offset >= 0;
> >> +             offset = fdt_next_node(fdt, offset, NULL), ++phandle) {
> >> +            prop = fdt_getprop_namelen(fdt, offset, "phandle", 7, &proplen);
> >> +            if (prop) {
> >> +                continue;
> >> +            }
> >> +            /* Check if the current phandle is not allocated already */
> >> +            for ( ; ; ++phandle) {
> >> +                for (i = 0, found = false; i < phandles->len; ++i) {
> >> +                    if (phandle == g_array_index(phandles, uint32_t, i)) {
> >> +                        found = true;
> >> +                        break;
> >> +                    }
> >> +                }
> >> +                if (!found) {
> >> +                    break;
> >> +                }
> >> +            }
> >> +            _FDT(fdt_setprop_cell(fdt, offset, "phandle", phandle));
> >> +        }
> >> +        g_array_unref(phandles);
> >> +    }
> >> +
> >>      return fdt;
> >>  }
> >>  
> >>  static uint64_t translate_kernel_address(void *opaque, uint64_t addr)
> >>  {
> >> -    return (addr & 0x0fffffff) + KERNEL_LOAD_ADDR;
> >> +    SpaprMachineState *spapr = opaque;
> >> +    return (addr & 0x0fffffff) + spapr->kernel_addr;
> >>  }
> >>  
> >>  static void emulate_spapr_hypercall(PPCVirtualHypervisor *vhyp,
> >> @@ -1660,24 +1759,89 @@ static void spapr_machine_reset(MachineState *machine)
> >>       */
> >>      fdt_addr = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FDT_MAX_SIZE;
> >>  
> >> +    /* Set up the entry state */
> >> +    if (!spapr->bios_enabled) {
> >> +        if (spapr->claimed) {
> >> +            g_array_unref(spapr->claimed);
> >> +        }
> >> +        if (spapr->of_instances) {
> >> +            g_hash_table_unref(spapr->of_instances);
> >> +        }
> >> +
> >> +        spapr->claimed = g_array_new(false, false, sizeof(SpaprOfClaimed));
> >> +        spapr->of_instances = g_hash_table_new(g_direct_hash, g_direct_equal);
> >> +
> >> +        spapr->claimed_base = 0x10000; /* Avoid using the first system page */
> >> +
> >> +        spapr_cpu_set_entry_state(first_ppc_cpu, spapr->kernel_addr,
> >> +                                  spapr->initrd_base);
> >> +        first_ppc_cpu->env.gpr[4] = spapr->initrd_size;
> >> +
> >> +        if (spapr_do_of_client_claim(spapr, spapr->kernel_addr,
> >> +                                  spapr->kernel_size, 0) == -1) {
> >> +            error_report("Memory for kernel is in use");
> >> +            exit(1);
> >> +        }
> >> +        if (spapr_do_of_client_claim(spapr, spapr->initrd_base,
> >> +                                  spapr->initrd_size, 0) == -1) {
> >> +            error_report("Memory for initramdisk is in use");
> >> +            exit(1);
> >> +        }
> >> +        first_ppc_cpu->env.gpr[1] = spapr_do_of_client_claim(spapr, 0, 0x40000,
> >> +                                                             0x10000);
> >> +        if (first_ppc_cpu->env.gpr[1] == -1) {
> >> +            error_report("Memory allocation for stack failed");
> >> +            exit(1);
> >> +        }
> >> +
> >> +        first_ppc_cpu->env.gpr[5] =
> >> +            spapr_do_of_client_claim(spapr, 0, sizeof(rtas_client_blob.client),
> >> +                                     sizeof(rtas_client_blob.client));
> >> +        if (first_ppc_cpu->env.gpr[5] == -1) {
> >> +            error_report("Memory allocation for OF client failed");
> >> +            exit(1);
> >> +        }
> >> +        cpu_physical_memory_write(first_ppc_cpu->env.gpr[5],
> >> +                                  rtas_client_blob.client,
> >> +                                  sizeof(rtas_client_blob.client));
> >> +    } else {
> >> +        spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
> >> +        first_ppc_cpu->env.gpr[5] = 0; /* 0 = kexec !0 = prom_init */
> >> +    }
> >> +
> >>      fdt = spapr_build_fdt(spapr, true, FDT_MAX_SIZE);
> >>  
> >> -    rc = fdt_pack(fdt);
> >> -
> >> -    /* Should only fail if we've built a corrupted tree */
> >> -    assert(rc == 0);
> >> -
> >> -    /* Load the fdt */
> >> -    qemu_fdt_dumpdtb(fdt, fdt_totalsize(fdt));
> >> -    cpu_physical_memory_write(fdt_addr, fdt, fdt_totalsize(fdt));
> >>      g_free(spapr->fdt_blob);
> >>      spapr->fdt_size = fdt_totalsize(fdt);
> >>      spapr->fdt_initial_size = spapr->fdt_size;
> >>      spapr->fdt_blob = fdt;
> >>  
> >> -    /* Set up the entry state */
> >> -    spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
> >> -    first_ppc_cpu->env.gpr[5] = 0;
> >> +    if (spapr->bios_enabled) {
> >> +        /* Load the fdt */
> >> +        rc = fdt_pack(spapr->fdt_blob);
> >> +        /* Should only fail if we've built a corrupted tree */
> >> +        assert(rc == 0);
> >> +
> >> +        spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
> >> +        spapr->fdt_initial_size = spapr->fdt_size;
> >> +        qemu_fdt_dumpdtb(spapr->fdt_blob, spapr->fdt_size);
> > 
> > I think we should still have a dumpdtb call on the !bios path.
> > 
> >> +        cpu_physical_memory_write(fdt_addr, spapr->fdt_blob, spapr->fdt_size);
> >> +    } else {
> >> +        char *stdout_path = spapr_vio_stdout_path(spapr->vio_bus);
> >> +        int offset = fdt_path_offset(fdt, "/chosen");
> >> +
> >> +        /*
> >> +         * SLOF-less setup requires an open instance of stdout for early
> >> +         * kernel printk. By now all phandles are settled so we can open
> >> +         * the default serial console.
> >> +         * We skip writing FDT as nothing expects it; OF client interface is
> >> +         * going to be used for reading the device tree.
> >> +         */
> >> +        if (stdout_path) {
> >> +            _FDT(fdt_setprop_cell(fdt, offset, "stdout",
> >> +                                  spapr_of_client_open(spapr, stdout_path)));
> >> +        }
> >> +    }
> >>  
> >>      spapr->cas_reboot = false;
> >>  }
> >> @@ -2897,12 +3061,12 @@ static void spapr_machine_init(MachineState *machine)
> >>          uint64_t lowaddr = 0;
> >>  
> >>          spapr->kernel_size = load_elf(kernel_filename, NULL,
> >> -                                      translate_kernel_address, NULL,
> >> +                                      translate_kernel_address, spapr,
> >>                                        NULL, &lowaddr, NULL, 1,
> >>                                        PPC_ELF_MACHINE, 0, 0);
> >>          if (spapr->kernel_size == ELF_LOAD_WRONG_ENDIAN) {
> >>              spapr->kernel_size = load_elf(kernel_filename, NULL,
> >> -                                          translate_kernel_address, NULL, NULL,
> >> +                                          translate_kernel_address, spapr, NULL,
> >>                                            &lowaddr, NULL, 0, PPC_ELF_MACHINE,
> >>                                            0, 0);
> >>              spapr->kernel_le = spapr->kernel_size > 0;
> >> @@ -2918,7 +3082,7 @@ static void spapr_machine_init(MachineState *machine)
> >>              /* Try to locate the initrd in the gap between the kernel
> >>               * and the firmware. Add a bit of space just in case
> >>               */
> >> -            spapr->initrd_base = (KERNEL_LOAD_ADDR + spapr->kernel_size
> >> +            spapr->initrd_base = (spapr->kernel_addr + spapr->kernel_size
> >>                                    + 0x1ffff) & ~0xffff;
> >>              spapr->initrd_size = load_image_targphys(initrd_filename,
> >>                                                       spapr->initrd_base,
> >> @@ -2932,20 +3096,22 @@ static void spapr_machine_init(MachineState *machine)
> >>          }
> >>      }
> >>  
> >> -    if (bios_name == NULL) {
> >> -        bios_name = FW_FILE_NAME;
> >> +    if (spapr->bios_enabled) {
> >> +        if (bios_name == NULL) {
> >> +            bios_name = FW_FILE_NAME;
> >> +        }
> >> +        filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
> >> +        if (!filename) {
> >> +            error_report("Could not find LPAR firmware '%s'", bios_name);
> >> +            exit(1);
> >> +        }
> >> +        fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
> >> +        if (fw_size <= 0) {
> >> +            error_report("Could not load LPAR firmware '%s'", filename);
> >> +            exit(1);
> >> +        }
> >> +        g_free(filename);
> >>      }
> >> -    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
> >> -    if (!filename) {
> >> -        error_report("Could not find LPAR firmware '%s'", bios_name);
> >> -        exit(1);
> >> -    }
> >> -    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
> >> -    if (fw_size <= 0) {
> >> -        error_report("Could not load LPAR firmware '%s'", filename);
> >> -        exit(1);
> >> -    }
> >> -    g_free(filename);
> >>  
> >>      /* FIXME: Should register things through the MachineState's qdev
> >>       * interface, this is a legacy from the sPAPREnvironment structure
> >> @@ -3162,6 +3328,32 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
> >>      visit_type_uint32(v, name, (uint32_t *)opaque, errp);
> >>  }
> >>  
> >> +static void spapr_get_kernel_addr(Object *obj, Visitor *v, const char *name,
> >> +                                  void *opaque, Error **errp)
> >> +{
> >> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
> >> +}
> >> +
> >> +static void spapr_set_kernel_addr(Object *obj, Visitor *v, const char *name,
> >> +                                  void *opaque, Error **errp)
> >> +{
> >> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
> >> +}
> >> +
> >> +static bool spapr_get_bios_enabled(Object *obj, Error **errp)
> >> +{
> >> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> >> +
> >> +    return spapr->bios_enabled;
> >> +}
> >> +
> >> +static void spapr_set_bios_enabled(Object *obj, bool value, Error **errp)
> >> +{
> >> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> >> +
> >> +    spapr->bios_enabled = value;
> >> +}
> >> +
> >>  static char *spapr_get_ic_mode(Object *obj, Error **errp)
> >>  {
> >>      SpaprMachineState *spapr = SPAPR_MACHINE(obj);
> >> @@ -3267,6 +3459,20 @@ static void spapr_instance_init(Object *obj)
> >>      object_property_add_bool(obj, "vfio-no-msix-emulation",
> >>                               spapr_get_msix_emulation, NULL, NULL);
> >>  
> >> +    object_property_add(obj, "kernel-addr", "uint64", spapr_get_kernel_addr,
> >> +                        spapr_set_kernel_addr, NULL, &spapr->kernel_addr,
> >> +                        &error_abort);
> >> +    object_property_set_description(obj, "kernel-addr",
> >> +                                    stringify(KERNEL_LOAD_ADDR)
> >> +                                    " for -kernel is the default",
> >> +                                    NULL);
> >> +    spapr->kernel_addr = KERNEL_LOAD_ADDR;
> >> +    object_property_add_bool(obj, "bios", spapr_get_bios_enabled,
> >> +                            spapr_set_bios_enabled, NULL);
> >> +    object_property_set_description(obj, "bios", "Conrols whether to load bios",
> >> +                                    NULL);
> >> +    spapr->bios_enabled = true;
> >> +
> >>      /* The machine class defines the default interrupt controller mode */
> >>      spapr->irq = smc->irq;
> >>      object_property_add_str(obj, "ic-mode", spapr_get_ic_mode,
> >> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> >> index f1799b1b707d..f2d8823d2c3a 100644
> >> --- a/hw/ppc/spapr_hcall.c
> >> +++ b/hw/ppc/spapr_hcall.c
> >> @@ -1660,15 +1660,11 @@ static bool spapr_hotplugged_dev_before_cas(void)
> >>      return false;
> >>  }
> >>  
> >> -static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> >> -                                                  SpaprMachineState *spapr,
> >> -                                                  target_ulong opcode,
> >> -                                                  target_ulong *args)
> >> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
> >> +                                            SpaprMachineState *spapr,
> >> +                                            target_ulong addr,
> >> +                                            target_ulong fdt_bufsize)
> >>  {
> >> -    /* Working address in data buffer */
> >> -    target_ulong addr = ppc64_phys_to_real(args[0]);
> >> -    target_ulong fdt_buf = args[1];
> >> -    target_ulong fdt_bufsize = args[2];
> >>      target_ulong ov_table;
> >>      uint32_t cas_pvr;
> >>      SpaprOptionVector *ov1_guest, *ov5_guest, *ov5_cas_old;
> >> @@ -1816,7 +1812,6 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> >>  
> >>      if (!spapr->cas_reboot) {
> >>          void *fdt;
> >> -        SpaprDeviceTreeUpdateHeader hdr = { .version_id = 1 };
> >>  
> >>          /* If spapr_machine_reset() did not set up a HPT but one is necessary
> >>           * (because the guest isn't going to use radix) then set it up here. */
> >> @@ -1825,21 +1820,7 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> >>              spapr_setup_hpt_and_vrma(spapr);
> >>          }
> >>  
> >> -        if (fdt_bufsize < sizeof(hdr)) {
> >> -            error_report("SLOF provided insufficient CAS buffer "
> >> -                         TARGET_FMT_lu " (min: %zu)", fdt_bufsize, sizeof(hdr));
> >> -            exit(EXIT_FAILURE);
> >> -        }
> >> -
> >> -        fdt_bufsize -= sizeof(hdr);
> >> -
> >> -        fdt = spapr_build_fdt(spapr, false, fdt_bufsize);
> >> -        _FDT((fdt_pack(fdt)));
> >> -
> >> -        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
> >> -        cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
> >> -                                  fdt_totalsize(fdt));
> >> -        trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
> >> +        fdt = spapr_build_fdt(spapr, !spapr->bios_enabled, fdt_bufsize);
> >>  
> >>          g_free(spapr->fdt_blob);
> >>          spapr->fdt_size = fdt_totalsize(fdt);
> >> @@ -1854,6 +1835,41 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> >>      return H_SUCCESS;
> >>  }
> >>  
> >> +static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> >> +                                                  SpaprMachineState *spapr,
> >> +                                                  target_ulong opcode,
> >> +                                                  target_ulong *args)
> >> +{
> >> +    /* Working address in data buffer */
> >> +    target_ulong addr = ppc64_phys_to_real(args[0]);
> >> +    target_ulong fdt_buf = args[1];
> >> +    target_ulong fdt_bufsize = args[2];
> >> +    target_ulong ret;
> >> +    SpaprDeviceTreeUpdateHeader hdr = { .version_id = 1 };
> >> +
> >> +    if (fdt_bufsize < sizeof(hdr)) {
> >> +        error_report("SLOF provided insufficient CAS buffer "
> >> +                     TARGET_FMT_lu " (min: %zu)", fdt_bufsize, sizeof(hdr));
> >> +        exit(EXIT_FAILURE);
> >> +    }
> >> +
> >> +    fdt_bufsize -= sizeof(hdr);
> >> +
> >> +    ret = do_client_architecture_support(cpu, spapr, addr, fdt_bufsize);
> >> +    if (ret == H_SUCCESS) {
> >> +        _FDT((fdt_pack(spapr->fdt_blob)));
> >> +        spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
> >> +        spapr->fdt_initial_size = spapr->fdt_size;
> >> +
> >> +        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
> >> +        cpu_physical_memory_write(fdt_buf + sizeof(hdr), spapr->fdt_blob,
> >> +                                  spapr->fdt_size);
> >> +        trace_spapr_cas_continue(spapr->fdt_size + sizeof(hdr));
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >>  static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
> >>                                                SpaprMachineState *spapr,
> >>                                                target_ulong opcode,
> >> @@ -1998,6 +2014,14 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, SpaprMachineState *spapr,
> >>      return H_SUCCESS;
> >>  }
> >>  
> >> +static target_ulong h_client(PowerPCCPU *cpu, SpaprMachineState *spapr,
> >> +                             target_ulong opcode, target_ulong *args)
> > 
> > As I said in an earlier revision, please explan these names from just
> > "client", for readability by people who aren't already thinking about
> > open firmware.
> 
> Yeah, I missed this one.
> 
> 
> > 
> >> +{
> >> +    target_ulong client_args = ppc64_phys_to_real(args[0]);
> >> +
> >> +    return spapr_h_client(spapr, client_args);
> >> +}
> >> +
> >>  static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
> >>  static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - KVMPPC_HCALL_BASE + 1];
> >>  static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 4 + 1];
> >> @@ -2121,6 +2145,8 @@ static void hypercall_register_types(void)
> >>  
> >>      spapr_register_hypercall(KVMPPC_H_UPDATE_DT, h_update_dt);
> >>  
> >> +    spapr_register_hypercall(KVMPPC_H_CLIENT, h_client);
> >> +
> >>      /* Virtual Processor Home Node */
> >>      spapr_register_hypercall(H_HOME_NODE_ASSOCIATIVITY,
> >>                               h_home_node_associativity);
> >> diff --git a/hw/ppc/spapr_of_client.c b/hw/ppc/spapr_of_client.c
> >> new file mode 100644
> >> index 000000000000..24d854b76e51
> >> --- /dev/null
> >> +++ b/hw/ppc/spapr_of_client.c
> > 
> > I'd suggest expanding this file to cover as much as you can of the
> > virtual OF stuff, not just the client interface.
> 
> This is done in v6.
> 
> 
> > 
> >> @@ -0,0 +1,633 @@
> >> +#include "qemu/osdep.h"
> >> +#include "qemu-common.h"
> >> +#include "qapi/error.h"
> >> +#include "exec/memory.h"
> >> +#include "hw/ppc/spapr.h"
> >> +#include "hw/ppc/spapr_vio.h"
> >> +#include "chardev/char.h"
> >> +#include "qom/qom-qobject.h"
> >> +#include "trace.h"
> >> +
> >> +typedef struct {
> >> +    DeviceState *dev;
> >> +    Chardev *cdev;
> >> +    uint32_t phandle;
> >> +} SpaprOfInstance;
> >> +
> >> +/*
> >> + * OF 1275 "nextprop" description suggests is it 32 bytes max but
> >> + * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 chars long.
> >> + */
> >> +#define OF_PROPNAME_LEN_MAX 64
> >> +
> >> +/* Defined as Big Endian */
> >> +struct prom_args {
> >> +    uint32_t service;
> >> +    uint32_t nargs;
> >> +    uint32_t nret;
> >> +    uint32_t args[10];
> >> +};
> >> +
> >> +static void readstr(hwaddr pa, char *buf, int size)
> >> +{
> >> +    cpu_physical_memory_read(pa, buf, size - 1);
> >> +    buf[size - 1] = 0;
> >> +}
> > 
> > I'd still like to see this return some kind of error if it had to
> > truncate what was passed by the client.
> 
> 
> But truncating will produce error anyway - libfdt won't find stuff,
> etc.

Probably, but I think the error will be much more comprehensible if we
catch it here.

> >> +
> >> +static bool _cmpservice(const char *s, size_t len,
> > 
> > Don't use leading _ please - in userland those are reserved for the
> > system libraries.
> > 
> >> +                        unsigned nargs, unsigned nret,
> >> +                        const char *s1, size_t len1,
> >> +                        unsigned nargscheck, unsigned nretcheck)
> >> +{
> >> +    if (strcmp(s, s1)) {
> >> +        return false;
> >> +    }
> >> +    if (nargscheck == 0 && nretcheck == 0) {
> >> +        return true;
> >> +    }
> >> +    if (nargs != nargscheck || nret != nretcheck) {
> >> +        trace_spapr_of_client_error_param(s, nargscheck, nretcheck, nargs,
> >> +                                          nret);
> >> +        return false;
> >> +    }
> >> +
> >> +    return true;
> >> +}
> >> +
> >> +static uint32_t of_client_finddevice(const void *fdt, uint32_t nodeaddr)
> >> +{
> >> +    char node[256];
> > 
> > Is 256 enough?  OF paths can get pretty long...
> 
> 
> Hard to imagine that 255 is not enough though. Long parts of the path
> would be scsi drive id, PHB but in between we can only have a bunch of
> PCI bridges and these are not so long.

Hm, ok.  I had a look on a Boston and the longest path I see there is
75 characters, I thought it might be a lot more.

> What do you think is an appropriate limit?
> 
> 
> > 
> >> +    int ret;
> >> +
> >> +    readstr(nodeaddr, node, sizeof(node));
> >> +    ret = fdt_path_offset(fdt, node);
> >> +    if (ret >= 0) {
> >> +        ret = fdt_get_phandle(fdt, ret);
> >> +    }
> >> +
> >> +    return (uint32_t) ret;
> >> +}
> >> +
> >> +static uint32_t of_client_getprop(const void *fdt, uint32_t nodeph,
> >> +                                  uint32_t pname, uint32_t valaddr,
> >> +                                  uint32_t vallen)
> >> +{
> >> +    char propname[OF_PROPNAME_LEN_MAX + 1];
> >> +    uint32_t ret = 0;
> >> +    int proplen = 0;
> >> +    const void *prop;
> >> +
> >> +    readstr(pname, propname, sizeof(propname));
> >> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
> >> +                               propname, strlen(propname), &proplen);
> > 
> > Again, you don't need _namelen.
> > 
> >> +    if (prop) {
> >> +        int cb = MIN(proplen, vallen);
> >> +
> >> +        cpu_physical_memory_write(valaddr, prop, cb);
> >> +        ret = cb;
> > 
> > If I'm reading 1275 correctly, the return value should be the
> > untruncated length of the property.
> 
> 
> "Size is either the actual size of the property". I'd think the actual
> size is what we actually copied to the buffer but @proplen is probably
> what they meant, I'll change to that and see what breaks.
> 
> 
> 
> >> +    } else {
> >> +        ret = -1;
> >> +    }
> >> +    trace_spapr_of_client_getprop(nodeph, propname, ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t of_client_getproplen(const void *fdt, uint32_t nodeph,
> >> +                                     uint32_t pname)
> >> +{
> >> +    char propname[OF_PROPNAME_LEN_MAX + 1];
> >> +    uint32_t ret = 0;
> >> +    int proplen = 0;
> >> +    const void *prop;
> >> +
> >> +    readstr(pname, propname, sizeof(propname));
> >> +
> >> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
> >> +                               propname, strlen(propname), &proplen);
> > 
> > No _namelen.
> > 
> >> +    if (prop) {
> >> +        ret = proplen;
> >> +    } else {
> >> +        ret = -1;
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t of_client_setprop(SpaprMachineState *spapr,
> >> +                                  uint32_t nodeph, uint32_t pname,
> >> +                                  uint32_t valaddr, uint32_t vallen)
> >> +{
> >> +    char propname[OF_PROPNAME_LEN_MAX + 1];
> >> +    uint32_t ret = -1;
> >> +    int offset;
> > 
> > A comment noting that you're only allowing a very restricted set of
> > setprops would be good.
> 
> 
> Is not that quite clear from the code itself? Okay...

Well, kinda.  The rationale for it would be useful here though.

> >> +    readstr(pname, propname, sizeof(propname));
> >> +    if (vallen == sizeof(uint32_t)) {
> >> +        uint32_t val32 = ldl_be_phys(first_cpu->as, valaddr);
> >> +
> >> +        if ((strcmp(propname, "linux,rtas-base") == 0) ||
> >> +            (strcmp(propname, "linux,rtas-entry") == 0)) {
> >> +            spapr->rtas_base = val32;
> >> +        } else if (strcmp(propname, "linux,initrd-start") == 0) {
> >> +            spapr->initrd_base = val32;
> >> +        } else if (strcmp(propname, "linux,initrd-end") == 0) {
> >> +            spapr->initrd_size = val32 - spapr->initrd_base;
> >> +        } else {
> >> +            goto trace_exit;
> >> +        }
> >> +    } else if (vallen == sizeof(uint64_t)) {
> >> +        uint64_t val64 = ldq_be_phys(first_cpu->as, valaddr);
> >> +
> >> +        if (strcmp(propname, "linux,initrd-start") == 0) {
> >> +            spapr->initrd_base = val64;
> >> +        } else if (strcmp(propname, "linux,initrd-end") == 0) {
> >> +            spapr->initrd_size = val64 - spapr->initrd_base;
> >> +        } else {
> >> +            goto trace_exit;
> >> +        }
> >> +    } else {
> >> +        goto trace_exit;
> >> +    }
> >> +
> >> +    offset = fdt_node_offset_by_phandle(spapr->fdt_blob, nodeph);
> >> +    if (offset >= 0) {
> >> +        uint8_t data[vallen];
> >> +
> >> +        cpu_physical_memory_read(valaddr, data, vallen);
> >> +        if (!fdt_setprop(spapr->fdt_blob, offset, propname, data, vallen)) {
> >> +            ret = vallen;
> >> +        }
> >> +    }
> >> +
> >> +trace_exit:
> >> +    trace_spapr_of_client_setprop(nodeph, propname, ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t of_client_nextprop(const void *fdt, uint32_t phandle,
> >> +                                   uint32_t prevaddr, uint32_t nameaddr)
> >> +{
> >> +    int offset = fdt_node_offset_by_phandle(fdt, phandle);
> >> +    char prev[OF_PROPNAME_LEN_MAX + 1];
> >> +    const char *tmp;
> >> +
> >> +    readstr(prevaddr, prev, sizeof(prev));
> >> +    for (offset = fdt_first_property_offset(fdt, offset);
> >> +         offset >= 0;
> >> +         offset = fdt_next_property_offset(fdt, offset)) {
> >> +
> >> +        if (!fdt_getprop_by_offset(fdt, offset, &tmp, NULL)) {
> >> +            return 0;
> >> +        }
> >> +        if (prev[0] == '\0' || strcmp(prev, tmp) == 0) {
> >> +            if (prev[0] != '\0') {
> >> +                offset = fdt_next_property_offset(fdt, offset);
> >> +                if (offset < 0) {
> >> +                    return 0;
> >> +                }
> >> +            }
> >> +            if (!fdt_getprop_by_offset(fdt, offset, &tmp, NULL)) {
> >> +                return 0;
> >> +            }
> >> +            cpu_physical_memory_write(nameaddr, tmp, strlen(tmp) + 1);
> >> +            return 1;
> >> +        }
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static uint32_t of_client_peer(const void *fdt, uint32_t phandle)
> >> +{
> >> +    int ret;
> >> +
> >> +    if (phandle == 0) {
> >> +        ret = fdt_path_offset(fdt, "/");
> >> +    } else {
> >> +        ret = fdt_next_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> >> +    }
> >> +
> >> +    if (ret < 0) {
> >> +        ret = 0;
> >> +    } else {
> >> +        ret = fdt_get_phandle(fdt, ret);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t of_client_child(const void *fdt, uint32_t phandle)
> >> +{
> >> +    int ret = fdt_first_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> >> +
> >> +    if (ret < 0) {
> >> +        ret = 0;
> >> +    } else {
> >> +        ret = fdt_get_phandle(fdt, ret);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t of_client_parent(const void *fdt, uint32_t phandle)
> >> +{
> >> +    int ret = fdt_parent_offset(fdt, fdt_node_offset_by_phandle(fdt, phandle));
> >> +
> >> +    if (ret < 0) {
> >> +        ret = 0;
> >> +    } else {
> >> +        ret = fdt_get_phandle(fdt, ret);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static DeviceState *of_client_find_qom_dev(BusState *bus, const char *path)
> >> +{
> >> +    BusChild *kid;
> >> +
> >> +    QTAILQ_FOREACH(kid, &bus->children, sibling) {
> >> +        const char *p = qdev_get_fw_dev_path(kid->child);
> >> +        BusState *child;
> >> +
> >> +        if (p && strcmp(path, p) == 0) {
> >> +            return kid->child;
> >> +        }
> >> +        QLIST_FOREACH(child, &kid->child->child_bus, sibling) {
> >> +            DeviceState *d = of_client_find_qom_dev(child, path);
> >> +
> >> +            if (d) {
> >> +                return d;
> >> +            }
> >> +        }
> >> +    }
> >> +    return NULL;
> >> +}
> >> +
> >> +uint32_t spapr_of_client_open(SpaprMachineState *spapr, const char *path)
> >> +{
> >> +    int offset;
> >> +    uint32_t ret = 0;
> >> +    SpaprOfInstance *inst;
> >> +
> >> +    if (spapr->of_instance_last == 0xFFFFFFFF) {
> >> +        /* We do not recycle ihandles yet */
> >> +        goto trace_exit;
> >> +    }
> >> +    offset = fdt_path_offset(spapr->fdt_blob, path);
> >> +    if (offset < 0) {
> >> +        trace_spapr_of_client_error_unknown_path(path);
> >> +        goto trace_exit;
> >> +    }
> >> +
> >> +    inst = g_new(SpaprOfInstance, 1);
> >> +    inst->phandle = fdt_get_phandle(spapr->fdt_blob, offset);
> >> +    g_assert(inst->phandle);
> >> +    ++spapr->of_instance_last;
> >> +    inst->dev = of_client_find_qom_dev(sysbus_get_default(), path);
> >> +    g_hash_table_insert(spapr->of_instances,
> >> +                        GINT_TO_POINTER(spapr->of_instance_last),
> >> +                        inst);
> >> +    ret = spapr->of_instance_last;
> >> +
> >> +    if (inst->dev) {
> >> +        const char *cdevstr = object_property_get_str(OBJECT(inst->dev),
> >> +                                                      "chardev", NULL);
> >> +
> >> +        if (cdevstr) {
> >> +            inst->cdev = qemu_chr_find(cdevstr);
> >> +        }
> >> +    }
> >> +
> >> +trace_exit:
> >> +    trace_spapr_of_client_open(path, inst ? inst->phandle : 0, ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t of_client_open(SpaprMachineState *spapr, uint32_t pathaddr)
> >> +{
> >> +    char path[256];
> >> +
> >> +    readstr(pathaddr, path, sizeof(path));
> >> +
> >> +    return spapr_of_client_open(spapr, path);
> >> +}
> >> +
> >> +static void of_client_close(SpaprMachineState *spapr, uint32_t ihandle)
> >> +{
> >> +    if (!g_hash_table_remove(spapr->of_instances, GINT_TO_POINTER(ihandle))) {
> >> +        trace_spapr_of_client_error_unknown_ihandle_close(ihandle);
> >> +    }
> >> +}
> >> +
> >> +static uint32_t of_client_instance_to_package(SpaprMachineState *spapr,
> >> +                                              uint32_t ihandle)
> >> +{
> >> +    gpointer instp = g_hash_table_lookup(spapr->of_instances,
> >> +                                        GINT_TO_POINTER(ihandle));
> >> +
> >> +    if (!instp) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    return ((SpaprOfInstance *)instp)->phandle;
> >> +}
> >> +
> >> +static uint32_t of_client_package_to_path(const void *fdt, uint32_t phandle,
> >> +                                          uint32_t buf, uint32_t len)
> >> +{
> >> +    char tmp[256];
> >> +
> >> +    if (0 == fdt_get_path(fdt, fdt_node_offset_by_phandle(fdt, phandle), tmp,
> >> +                          sizeof(tmp))) {
> >> +        tmp[sizeof(tmp) - 1] = 0;
> >> +        cpu_physical_memory_write(buf, tmp, MIN(len, strlen(tmp)));
> >> +    }
> >> +    return len;
> >> +}
> >> +
> >> +static uint32_t of_client_instance_to_path(SpaprMachineState *spapr,
> >> +                                           uint32_t ihandle, uint32_t buf,
> >> +                                           uint32_t len)
> >> +{
> >> +    uint32_t phandle = of_client_instance_to_package(spapr, ihandle);
> >> +
> >> +    if (phandle != -1) {
> >> +        return of_client_package_to_path(spapr->fdt_blob, phandle, buf, len);
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static uint32_t of_client_write(SpaprMachineState *spapr, uint32_t ihandle,
> >> +                                uint32_t buf, uint32_t len)
> >> +{
> >> +    char tmp[256];
> >> +    int toread, toprint, cb = MIN(len, 1024);
> >> +    SpaprOfInstance *inst = (SpaprOfInstance *)
> >> +        g_hash_table_lookup(spapr->of_instances, GINT_TO_POINTER(ihandle));
> >> +
> >> +    while (cb > 0) {
> >> +        toread = MIN(cb + 1, sizeof(tmp));
> >> +        readstr(buf, tmp, toread);
> >> +        toprint = strlen(tmp);
> >> +        if (inst && inst->cdev) {
> >> +            toprint = qemu_chr_write(inst->cdev, (uint8_t *) tmp, toprint,
> >> +                                     true);
> >> +        } else {
> >> +            /* We normally open stdout so this is fallback */
> >> +            printf("DBG[%d]%s", ihandle, tmp);
> >> +        }
> >> +        buf += toprint;
> >> +        cb -= toprint;
> >> +    }
> >> +
> >> +    return len;
> >> +}
> >> +
> >> +static bool of_client_claim_avail(GArray *claimed, uint64_t virt, uint64_t size)
> >> +{
> >> +    int i;
> >> +    SpaprOfClaimed *c;
> >> +
> >> +    for (i = 0; i < claimed->len; ++i) {
> >> +        c = &g_array_index(claimed, SpaprOfClaimed, i);
> >> +        if ((c->start <= virt && virt < c->start + c->size) ||
> >> +            (virt <= c->start && c->start < virt + size)) {
> >> +            return false;
> >> +        }
> >> +    }
> >> +
> >> +    return true;
> >> +}
> >> +
> >> +static void of_client_claim_add(GArray *claimed, uint64_t virt, uint64_t size)
> >> +{
> >> +    SpaprOfClaimed newclaim;
> >> +
> >> +    newclaim.start = virt;
> >> +    newclaim.size = size;
> >> +    g_array_append_val(claimed, newclaim);
> >> +}
> >> +
> >> +/*
> >> + * "claim" claims memory at @virt if @align==0; otherwise it allocates
> >> + * memory at the requested alignment.
> >> + */
> >> +uint64_t spapr_do_of_client_claim(SpaprMachineState *spapr, uint64_t virt,
> >> +                                  uint64_t size, uint64_t align)
> >> +{
> >> +    uint32_t ret;
> >> +
> >> +    if (align == 0) {
> >> +        if (!of_client_claim_avail(spapr->claimed, virt, size)) {
> >> +            return -1;
> >> +        }
> >> +        ret = virt;
> >> +    } else {
> >> +        align = pow2ceil(align);
> > 
> > Should this be a pow2ceil, or should it just return an error if align
> > is not a power of 2. > Note that aligning something to 4 bytes will
> > (probably) make it *not* aligned to 3 bytes.
> 
> I did not see any notes about the specific alignment requirements here,
> the idea is that clients may just not expect unaligned memory at all; I
> could probably just drop it and see what happens...

I don't follow you.  Isn't the align value coming from the client?

> >> +        spapr->claimed_base = (spapr->claimed_base + align - 1) & ~(align - 1);
> >> +        while (1) {
> >> +            if (spapr->claimed_base >= spapr->rma_size) {
> >> +                perror("Out of memory");
> > 
> > error_report() or qemu_log() or something and a message with some more
> > specificity, please.
> 
> 
> What kind of specificity is missing here?

That it's on the OF claim interface specifically, and how much they
were trying to claim.

> >> +                return -1;
> >> +            }
> >> +            if (of_client_claim_avail(spapr->claimed, spapr->claimed_base,
> >> +                                      size)) {
> >> +                break;
> >> +            }
> >> +            spapr->claimed_base += size;
> >> +        }
> >> +        ret = spapr->claimed_base;
> >> +    }
> >> +
> >> +    spapr->claimed_base = MAX(spapr->claimed_base, ret + size);
> >> +    of_client_claim_add(spapr->claimed, virt, size);
> >> +    trace_spapr_of_client_claim(virt, size, align, ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static uint32_t of_client_claim(SpaprMachineState *spapr, uint32_t virt,
> >> +                                uint32_t size, uint32_t align)
> >> +{
> >> +    if (align) {
> >> +        return -1;
> >> +    }
> >> +    if (!of_client_claim_avail(spapr->claimed, virt, size)) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    spapr->claimed_base = MAX(spapr->claimed_base, virt + size);
> >> +    of_client_claim_add(spapr->claimed, virt, size);
> >> +    trace_spapr_of_client_claim(virt, size, align, virt);
> > 
> > Huh.  So do_of_client_claim() is never used from of_client_claim(),
> > only from "internal" claimers.  It definitely needs a different name.
> 
> 
> It is used in v6. Linux always passes non-zero @virt but grub does not.
> 
> 
> >> +    return virt;
> >> +}
> >> +
> >> +static uint32_t of_client_call_method(SpaprMachineState *spapr,
> >> +                                      uint32_t methodaddr, uint32_t ihandle,
> >> +                                      uint32_t param, uint32_t *ret2)
> >> +{
> >> +    uint32_t ret = -1;
> >> +    char path[256] = "", method[256] = "";
> >> +    uint32_t phandle = of_client_instance_to_package(spapr, ihandle);
> >> +    int offset;
> >> +
> >> +    if (!ihandle) {
> >> +        goto trace_exit;
> >> +    }
> >> +
> >> +    readstr(methodaddr, method, sizeof(method));
> >> +    phandle = of_client_instance_to_package(spapr, ihandle);
> >> +    if (!phandle) {
> >> +        goto trace_exit;
> >> +    }
> >> +
> >> +    offset = fdt_node_offset_by_phandle(spapr->fdt_blob, phandle);
> >> +    if (offset < 0) {
> >> +        goto trace_exit;
> >> +    }
> >> +
> >> +    if (fdt_get_path(spapr->fdt_blob, offset, path, sizeof(path))) {
> >> +        goto trace_exit;
> >> +    }
> >> +
> >> +    if (strcmp(path, "/") == 0) {
> >> +        if (strcmp(method, "ibm,client-architecture-support") == 0) {
> >> +
> >> +#define FDT_MAX_SIZE            0x100000
> >> +            ret = do_client_architecture_support(POWERPC_CPU(first_cpu), spapr,
> >> +                                                 param, FDT_MAX_SIZE);
> >> +            *ret2 = 0;
> >> +        }
> >> +    } else if (strcmp(path, "/rtas") == 0) {
> >> +        if (strcmp(method, "instantiate-rtas") == 0) {
> >> +            spapr_instantiate_rtas(spapr, param);
> >> +            ret = 0;
> >> +            *ret2 = param; /* rtasbase */
> >> +        }
> >> +    } else {
> >> +        trace_spapr_of_client_error_unknown_method(method);
> >> +    }
> >> +
> >> +trace_exit:
> >> +    trace_spapr_of_client_method(ihandle, method, param, phandle, path, ret);
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static void of_client_quiesce(SpaprMachineState *spapr)
> >> +{
> >> +    int rc = fdt_pack(spapr->fdt_blob);
> >> +    /* Should only fail if we've built a corrupted tree */
> >> +    assert(rc == 0);
> >> +
> >> +    spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
> >> +    spapr->fdt_initial_size = spapr->fdt_size;
> >> +}
> >> +
> >> +int spapr_h_client(SpaprMachineState *spapr, target_ulong of_client_args)
> >> +{
> >> +    struct prom_args args = { 0 };
> >> +    char service[64];
> >> +    unsigned nargs, nret;
> >> +    int i, servicelen;
> >> +
> >> +    cpu_physical_memory_read(of_client_args, &args, sizeof(args));
> >> +    nargs = be32_to_cpu(args.nargs);
> >> +    nret = be32_to_cpu(args.nret);
> >> +    readstr(be32_to_cpu(args.service), service, sizeof(service));
> >> +    servicelen = strlen(service);
> >> +
> >> +#define cmpservice(s, a, r) \
> >> +    _cmpservice(service, servicelen, nargs, nret, (s), sizeof(s), (a), (r))
> >> +
> >> +    if (cmpservice("finddevice", 1, 1)) {
> >> +        args.args[nargs] = of_client_finddevice(spapr->fdt_blob,
> >> +                                                be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("getprop", 4, 1)) {
> >> +        args.args[nargs] = of_client_getprop(spapr->fdt_blob,
> >> +                                             be32_to_cpu(args.args[0]),
> >> +                                             be32_to_cpu(args.args[1]),
> >> +                                             be32_to_cpu(args.args[2]),
> >> +                                             be32_to_cpu(args.args[3]));
> >> +    } else if (cmpservice("getproplen", 2, 1)) {
> >> +        args.args[nargs] = of_client_getproplen(spapr->fdt_blob,
> >> +                                                be32_to_cpu(args.args[0]),
> >> +                                                be32_to_cpu(args.args[1]));
> >> +    } else if (cmpservice("setprop", 4, 1)) {
> >> +        args.args[nargs] = of_client_setprop(spapr,
> >> +                                             be32_to_cpu(args.args[0]),
> >> +                                             be32_to_cpu(args.args[1]),
> >> +                                             be32_to_cpu(args.args[2]),
> >> +                                             be32_to_cpu(args.args[3]));
> >> +    } else if (cmpservice("nextprop", 3, 1)) {
> >> +        args.args[nargs] = of_client_nextprop(spapr->fdt_blob,
> >> +                                              be32_to_cpu(args.args[0]),
> >> +                                              be32_to_cpu(args.args[1]),
> >> +                                              be32_to_cpu(args.args[2]));
> >> +    } else if (cmpservice("peer", 1, 1)) {
> >> +        args.args[nargs] = of_client_peer(spapr->fdt_blob,
> >> +                                          be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("child", 1, 1)) {
> >> +        args.args[nargs] = of_client_child(spapr->fdt_blob,
> >> +                                           be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("parent", 1, 1)) {
> >> +        args.args[nargs] = of_client_parent(spapr->fdt_blob,
> >> +                                            be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("open", 1, 1)) {
> >> +        args.args[nargs] = of_client_open(spapr, be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("close", 1, 0)) {
> >> +        of_client_close(spapr, be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("instance-to-package", 1, 1)) {
> >> +        args.args[nargs] =
> >> +            of_client_instance_to_package(spapr,
> >> +                                          be32_to_cpu(args.args[0]));
> >> +    } else if (cmpservice("package-to-path", 3, 1)) {
> >> +        args.args[nargs] = of_client_package_to_path(spapr->fdt_blob,
> >> +                                                     be32_to_cpu(args.args[0]),
> >> +                                                     be32_to_cpu(args.args[1]),
> >> +                                                     be32_to_cpu(args.args[2]));
> >> +    } else if (cmpservice("instance-to-path", 3, 1)) {
> >> +        args.args[nargs] =
> >> +            of_client_instance_to_path(spapr,
> >> +                                       be32_to_cpu(args.args[0]),
> >> +                                       be32_to_cpu(args.args[1]),
> >> +                                       be32_to_cpu(args.args[2]));
> >> +    } else if (cmpservice("write", 3, 1)) {
> >> +        args.args[nargs] = of_client_write(spapr,
> >> +                                           be32_to_cpu(args.args[0]),
> >> +                                           be32_to_cpu(args.args[1]),
> >> +                                           be32_to_cpu(args.args[2]));
> >> +    } else if (cmpservice("claim", 3, 1)) {
> >> +        args.args[nargs] = of_client_claim(spapr,
> >> +                                           be32_to_cpu(args.args[0]),
> >> +                                           be32_to_cpu(args.args[1]),
> >> +                                           be32_to_cpu(args.args[2]));
> >> +    } else if (cmpservice("call-method", 3, 2)) {
> >> +        args.args[nargs] = of_client_call_method(spapr,
> >> +                                                 be32_to_cpu(args.args[0]),
> >> +                                                 be32_to_cpu(args.args[1]),
> >> +                                                 be32_to_cpu(args.args[2]),
> >> +                                                 &args.args[nargs + 1]);
> >> +    } else if (cmpservice("quiesce", 0, 0)) {
> >> +        of_client_quiesce(spapr);
> >> +    } else if (cmpservice("exit", 0, 0)) {
> >> +        error_report("Stopped as the VM requested \"exit\"");
> >> +        vm_stop(RUN_STATE_PAUSED);
> >> +    } else {
> >> +        trace_spapr_of_client_error_unknown_service(service, nargs, nret);
> >> +        args.args[nargs] = -1;
> > 
> > You've never bounds checked nargs at this point.
> > 
> >> +    }
> >> +
> >> +    for (i = 0; i < nret; ++i) {
> > 
> > And likewise you might not have bounds checked nret.
> 
> Oh, true. Thanks,
> 
> 
> > 
> >> +        args.args[nargs + i] = be32_to_cpu(args.args[nargs + i]);
> >> +    }
> >> +    cpu_physical_memory_write(of_client_args, &args, sizeof(args));
> >> +
> >> +    return H_SUCCESS;
> >> +}
> >> diff --git a/hw/ppc/trace-events b/hw/ppc/trace-events
> >> index 9ea620f23c85..e2d1e58d07c3 100644
> >> --- a/hw/ppc/trace-events
> >> +++ b/hw/ppc/trace-events
> >> @@ -21,6 +21,18 @@ spapr_update_dt(unsigned cb) "New blob %u bytes"
> >>  spapr_update_dt_failed_size(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
> >>  spapr_update_dt_failed_check(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
> >>  
> >> +# spapr_client.c
> >> +spapr_of_client_error_param(const char *method, int nargscheck, int nretcheck, int nargs, int nret) "%s takes/returns %d/%d, not %d/%d"
> >> +spapr_of_client_error_unknown_service(const char *service, int nargs, int nret) "%s args=%d rets=%d"
> >> +spapr_of_client_error_unknown_method(const char *method) "%s"
> >> +spapr_of_client_error_unknown_ihandle_close(uint32_t ihandle) "0x%x"
> >> +spapr_of_client_error_unknown_path(const char *path) "%s"
> >> +spapr_of_client_claim(uint32_t virt, uint32_t size, uint32_t align, uint32_t ret) "virt=0x%x size=0x%x align=0x%x => 0x%x"
> >> +spapr_of_client_method(uint32_t ihandle, const char *method, uint32_t param, uint32_t phandle, const char *path, uint32_t ret) "0x%x \"%s\" param=0x%x ph=0x%x \"%s\" => 0x%x"
> >> +spapr_of_client_getprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
> >> +spapr_of_client_setprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
> >> +spapr_of_client_open(const char *path, uint32_t phandle, uint32_t ihandle) "%s 0x%x => 0x%x"
> >> +
> >>  # spapr_hcall_tpm.c
> >>  spapr_h_tpm_comm(const char *device_path, uint64_t operation) "tpm_device_path=%s operation=0x%"PRIu64
> >>  spapr_tpm_execute(uint64_t data_in, uint64_t data_in_sz, uint64_t data_out, uint64_t data_out_sz) "data_in=0x%"PRIx64", data_in_sz=%"PRIu64", data_out=0x%"PRIx64", data_out_sz=%"PRIu64
> > 
>
Alexey Kardashevskiy Jan. 22, 2020, 7:14 a.m. UTC | #4
On 22/01/2020 17:32, David Gibson wrote:
> On Tue, Jan 21, 2020 at 06:25:36PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 21/01/2020 16:11, David Gibson wrote:
>>> On Fri, Jan 10, 2020 at 01:09:25PM +1100, Alexey Kardashevskiy wrote:
>>>> The Petitboot bootloader is way more advanced than SLOF is ever going to
>>>> be as Petitboot comes with the full-featured Linux kernel with all
>>>> the drivers, and initramdisk with quite user friendly interface.
>>>> The problem with ditching SLOF is that an unmodified pseries kernel can
>>>> either start via:
>>>> 1. kexec, this requires presence of RTAS and skips
>>>> ibm,client-architecture-support entirely;
>>>> 2. normal boot, this heavily relies on the OF1275 client interface to
>>>> fetch the device tree and do early setup (claim memory).
>>>>
>>>> This adds a new bios-less mode to the pseries machine: "bios=on|off".
>>>> When enabled, QEMU does not load SLOF and jumps to the kernel from
>>>> "-kernel".
>>>
>>> I don't love the name "bios" for this flag, since BIOS tends to refer
>>> to old-school x86 firmware.  Given the various plans we're considering
>>> the future, I'd suggest "firmware=slof" for the current in-guest SLOF
>>> mode, and say "firmware=vof" (Virtual Open Firmware) for the new
>>> model.  We can consider firmware=petitboot or firmware=none (for
>>> direct kexec-style boot into -kernel) or whatever in the future
>>
>> Ok. We could also enforce default loading addresses for SLOF/kernel/grub
>> and drop "kernel-addr", although it is going to be confusing if it
>> changes in not so obvious way...
> 
> Yes, I think that would be confusing, so I think adding the
> kernel-addr override is a good idea, I'd just like it split out for
> clarity.
> 
>> In fact, I will ideally need 3 flags:
>> -bios: on|off to stop loading SLOF;
>> -kernel-addr: 0x0 for slof/kernel; 0x20000 for grub;
> 
> I'm happy for that one to be separate from the "firmware style"
> option.
> 
>> -kernel-translate-hack: on|off - as grub is linked to run from 0x20000
>> and it only works when placed there, the hack breaks it.
> 
> Hrm.  I don't really understand what this one is about.  That doesn't
> really seem like something the user would ever want to fiddle with
> directly.

This allows loading grub, or actually any elf (not that I have anything
else in mind that just grub but still) which is not capable of
relocating itself.


>> Or we can pass grub via -bios and not via -kernel but strictly speaking
>> there is still a firmware - that new 20 bytes blob so it would not be
>> accurate.
>>
>> We can put this all into a single
>> -firmware slof|vof|grub|linux. Not sure.
> 
> I'm not thinking of "grub" as a separate option - that would be the
> same as "vof".  Using vof + no -kernel we'd need to scan the disks in
> the same way SLOF does, and look for a boot partition, which will
> probably contain a GRUB image. 

I was hoping we can avoid that by allowing
"-kernel grub" and let grub do filesystems and MBR/GPT.

> Then we'd need enough faked OF client
> calls to let GRUB operate.

v6 does very basic block devices. Now I need to learn how to build grub
properly, it is 32bit and it is not straight forward how to build it
100% properly on ppc64 machine, I see occasional issues such as
uint32->uint64 extension with a garbage in the top 32bits, things like
this... But it can definitely read MBR/GPT, parse FS, etc.


> 
>>>> The client interface is implemented exactly as RTAS - a 20 bytes blob,
>>>> right next after the RTAS blob. The entry point is passed to the kernel
>>>> via GPR5.
>>>>
>>>> This implements a handful of client interface methods just to get going.
>>>> In particular, this implements the device tree fetching,
>>>> ibm,client-architecture-support and instantiate-rtas.
>>>>
>>>> This implements changing FDT properties for RTAS (for vmlinux and zImage)
>>>> and initramdisk location (for zImage). To make this work, this skips
>>>> fdt_pack() when bios=off as not packing the blob leaves some room for
>>>> appending.
>>>>
>>>> This assigns "phandles" to device tree nodes as there is no more SLOF
>>>> and OF nodes addresses of which served as phandle values.
>>>> This keeps predefined nodes (such as XICS/NVLINK/...) unchanged.
>>>> phandles are regenerated at every FDT rebuild.
>>>>
>>>> When bios=off, this adds "/chosen" every time QEMU builds a tree.
>>>>
>>>> This implements "claim" which the client (Linux) uses for memory
>>>> allocation; this is also  used by QEMU for claiming kernel/initrd images,
>>>> client interface entry point, RTAS and the initial stack.
>>>>
>>>> While at this, add a "kernel-addr" machine parameter to allow moving
>>>> the kernel in memory. This is useful for debugging if the kernel is
>>>> loaded at @0, although not necessary.
>>>>
>>>> This adds basic instances support which are managed by a hashmap
>>>> ihandle->[phandle, DeviceState, Chardev].
>>>>
>>>> Note that a 64bit PCI fix is required for Linux:
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735e
>>>>
>>>> The test command line:
>>>>
>>>> qemu-system-ppc64 \
>>>> -nodefaults \
>>>> -chardev stdio,id=STDIO0,signal=off,mux=on \
>>>> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
>>>> -mon id=MON0,chardev=STDIO0,mode=readline \
>>>> -nographic \
>>>> -vga none \
>>>> -kernel pbuild/kernel-le-guest/arch/powerpc/boot/zImage.pseries \
>>>> -machine pseries,bios=off,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken \
>>>> -m 4G \
>>>> -enable-kvm \
>>>> -initrd pb/rootfs.cpio.xz \
>>>> -device nec-usb-xhci,id=nec-usb-xhci0 \
>>>> -netdev tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0 \
>>>> -device virtio-net-pci,id=vnet0,netdev=TAP0 img/f30le.qcow2 \
>>>> -snapshot \
>>>> -smp 8,threads=8 \
>>>> -trace events=qemu_trace_events \
>>>> -d guest_errors \
>>>> -chardev socket,id=SOCKET0,server,nowait,path=qemu.mon.ssh54088 \
>>>> -mon chardev=SOCKET0,mode=control
>>>>
>>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>>
>>> It'd be nice to split this patch up a bit, though I'll admit it's not
>>> very obvious where to do so.
>>
>>
>> v6 is a patchset.
>>
>>>> ---
>>>> Changes:
>>>> v5:
>>>> * made instances keep device and chardev pointers
>>>> * removed VIO dependencies
>>>> * print error if RTAS memory is not claimed as it should have been
>>>> * pack FDT as "quiesce"
>>>>
>>>> v4:
>>>> * fixed open
>>>> * validate ihandles in "call-method"
>>>>
>>>> v3:
>>>> * fixed phandles allocation
>>>> * s/__be32/uint32_t/ as we do not normally have __be32 type in qemu
>>>> * fixed size of /chosen/stdout
>>>> * bunch of renames
>>>> * do not create rtas properties at all, let the client deal with it;
>>>> instead setprop allows changing these in the FDT
>>>> * no more packing FDT when bios=off - nobody needs it and getprop does not
>>>> work otherwise
>>>> * allow updating initramdisk device tree properties (for zImage)
>>>> * added instances
>>>> * fixed stdout on OF's "write"
>>>> * removed special handling for stdout in OF client, spapr-vty handles it
>>>> instead
>>>>
>>>> v2:
>>>> * fixed claim()
>>>> * added "setprop"
>>>> * cleaner client interface and RTAS blobs management
>>>> * boots to petitboot and further to the target system
>>>> * more trace points
>>>> ---
>>>>  hw/ppc/Makefile.objs     |   1 +
>>>>  include/hw/ppc/spapr.h   |  28 +-
>>>>  hw/ppc/spapr.c           | 266 ++++++++++++++--
>>>>  hw/ppc/spapr_hcall.c     |  74 +++--
>>>>  hw/ppc/spapr_of_client.c | 633 +++++++++++++++++++++++++++++++++++++++
>>>>  hw/ppc/trace-events      |  12 +
>>>>  6 files changed, 959 insertions(+), 55 deletions(-)
>>>>  create mode 100644 hw/ppc/spapr_of_client.c
>>>>
>>>> diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
>>>> index 101e9fc59185..20efc0aa6f9b 100644
>>>> --- a/hw/ppc/Makefile.objs
>>>> +++ b/hw/ppc/Makefile.objs
>>>> @@ -6,6 +6,7 @@ obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
>>>>  obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
>>>>  obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o spapr_irq.o
>>>>  obj-$(CONFIG_PSERIES) += spapr_tpm_proxy.o
>>>> +obj-$(CONFIG_PSERIES) += spapr_of_client.o
>>>>  obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
>>>>  # IBM PowerNV
>>>>  obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>> index 61f005c6f686..efc2c70abf99 100644
>>>> --- a/include/hw/ppc/spapr.h
>>>> +++ b/include/hw/ppc/spapr.h
>>>> @@ -105,6 +105,11 @@ struct SpaprCapabilities {
>>>>      uint8_t caps[SPAPR_CAP_NUM];
>>>>  };
>>>>  
>>>> +typedef struct {
>>>> +    uint64_t start;
>>>> +    uint64_t size;
>>>> +} SpaprOfClaimed;
>>>> +
>>>
>>> Can we split more of the fake-OF code into a new file?
>>
>>
>> Done in v6, I quite reworked it, this is why I told you to ping me
>> before you review this one :)
> 
> Oops, I forgot.  Sorry.
> 
>>>>  /**
>>>>   * SpaprMachineClass:
>>>>   */
>>>> @@ -160,6 +165,13 @@ struct SpaprMachineState {
>>>>      void *fdt_blob;
>>>>      long kernel_size;
>>>>      bool kernel_le;
>>>> +    uint64_t kernel_addr;
>>>> +    bool bios_enabled;
>>>> +    uint32_t rtas_base;
>>>> +    GArray *claimed; /* array of SpaprOfClaimed */
>>>> +    uint64_t claimed_base;
>>>> +    GHashTable *of_instances; /* ihandle -> SpaprOfInstance */
>>>> +    uint32_t of_instance_last;
>>>>      uint32_t initrd_base;
>>>>      long initrd_size;
>>>>      uint64_t rtc_offset; /* Now used only during incoming migration */
>>>> @@ -510,7 +522,8 @@ struct SpaprMachineState {
>>>>  /* Client Architecture support */
>>>>  #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
>>>>  #define KVMPPC_H_UPDATE_DT      (KVMPPC_HCALL_BASE + 0x3)
>>>> -#define KVMPPC_HCALL_MAX        KVMPPC_H_UPDATE_DT
>>>> +#define KVMPPC_H_CLIENT         (KVMPPC_HCALL_BASE + 0x5)
>>>> +#define KVMPPC_HCALL_MAX        KVMPPC_H_CLIENT
>>>>  
>>>>  /*
>>>>   * The hcall range 0xEF00 to 0xEF80 is reserved for use in facilitating
>>>> @@ -538,6 +551,11 @@ void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
>>>>  target_ulong spapr_hypercall(PowerPCCPU *cpu, target_ulong opcode,
>>>>                               target_ulong *args);
>>>>  
>>>> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>>>> +                                            SpaprMachineState *spapr,
>>>> +                                            target_ulong addr,
>>>> +                                            target_ulong fdt_bufsize);
>>>> +
>>>>  /* Virtual Processor Area structure constants */
>>>>  #define VPA_MIN_SIZE           640
>>>>  #define VPA_SIZE_OFFSET        0x4
>>>> @@ -769,6 +787,11 @@ struct SpaprEventLogEntry {
>>>>  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
>>>>  void spapr_events_init(SpaprMachineState *sm);
>>>>  void spapr_dt_events(SpaprMachineState *sm, void *fdt);
>>>> +uint64_t spapr_do_of_client_claim(SpaprMachineState *spapr, uint64_t virt,
>>>> +                                  uint64_t size, uint64_t align);
>>>> +
>>>> +uint32_t spapr_of_client_open(SpaprMachineState *spapr, const char *path);
>>>> +int spapr_h_client(SpaprMachineState *spapr, target_ulong client_args);
>>>>  void close_htab_fd(SpaprMachineState *spapr);
>>>>  void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
>>>>  void spapr_free_hpt(SpaprMachineState *spapr);
>>>> @@ -891,4 +914,7 @@ void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
>>>>  #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
>>>>  
>>>>  void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
>>>> +
>>>> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base);
>>>> +
>>>>  #endif /* HW_SPAPR_H */
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index e62c89b3dd40..76ce8b973082 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -896,6 +896,55 @@ out:
>>>>      return ret;
>>>>  }
>>>>  
>>>> +/*
>>>> + * Below is a compiled version of RTAS blob and OF client interface entry point.
>>>> + *
>>>> + * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
>>>> + * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
>>>> + *
>>>> + *   .globl  _start
>>>> + *   _start:
>>>> + *           mr      4,3
>>>> + *           lis     3,KVMPPC_H_RTAS@h
>>>> + *           ori     3,3,KVMPPC_H_RTAS@l
>>>> + *           sc      1
>>>> + *           blr
>>>> + *           mr      4,3
>>>> + *           lis     3,KVMPPC_H_CLIENT@h
>>>> + *           ori     3,3,KVMPPC_H_CLIENT@l
>>>> + *           sc      1
>>>> + *           blr
>>>> + */
>>>> +static struct {
>>>
>>> Should be able to add a 'const' here.
>>>
>>>> +    uint8_t rtas[20], client[20];
>>>> +} QEMU_PACKED rtas_client_blob = {
>>>> +    .rtas = {
>>>> +        0x7c, 0x64, 0x1b, 0x78,
>>>> +        0x3c, 0x60, 0x00, 0x00,
>>>> +        0x60, 0x63, 0xf0, 0x00,
>>>> +        0x44, 0x00, 0x00, 0x22,
>>>> +        0x4e, 0x80, 0x00, 0x20
>>>> +    },
>>>> +    .client = {
>>>> +        0x7c, 0x64, 0x1b, 0x78,
>>>> +        0x3c, 0x60, 0x00, 0x00,
>>>> +        0x60, 0x63, 0xf0, 0x05,
>>>> +        0x44, 0x00, 0x00, 0x22,
>>>> +        0x4e, 0x80, 0x00, 0x20
>>>> +    }
>>>> +};
>>>
>>> I'd split this into two variables - there's not really any connection
>>> between the two, AFAICT.
>>>
>>> Note that I'm getting closer to merging the fwnmi stuff at which point
>>> you'll need to pad the RTAS blob with a bunch of extra space for
>>> taking the fwnmi dumps.
>>>
>>>> +
>>>> +void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base)
>>>> +{
>>>> +    if (spapr_do_of_client_claim(spapr, base, sizeof(rtas_client_blob.rtas),
>>>> +                                 0) != -1) {
>>>
>>> Wait.. == -1 is the success case?  That's a very surprising interface.
>>
>>
>> This is a sort of an assert. spapr_do_of_client_claim() returns an
>> address and the client is expected to claim the memory which it wants
>> RTAS to be copied to, this makes sure it either happened or we claimed
>> it here.
> 
> Ah!  Ok, I understand.
> 
>>>> +        error_report("The OF client did not claim RTAS memory at 0x%x", base);
>>>
>>> Error message is hard to follow.  Maybe "Could not claim memory
>>> for RTAS"
> 
> Which makes my suggestion here a bad one too.
> 
>>>
>>>> +    }
>>>> +    spapr->rtas_base = base;
>>>> +    cpu_physical_memory_write(base, rtas_client_blob.rtas,
>>>> +                              sizeof(rtas_client_blob.rtas));
>>>> +}
>>>> +
>>>>  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>>>  {
>>>>      MachineState *ms = MACHINE(spapr);
>>>> @@ -980,6 +1029,11 @@ static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
>>>>      _FDT(fdt_setprop(fdt, rtas, "ibm,lrdr-capacity",
>>>>                       lrdr_capacity, sizeof(lrdr_capacity)));
>>>>  
>>>> +    if (!spapr->bios_enabled) {
>>>> +        _FDT(fdt_setprop_cell(fdt, rtas, "rtas-size",
>>>> +                              sizeof(rtas_client_blob.rtas)));
>>>> +    }
>>>> +
>>>>      spapr_dt_rtas_tokens(fdt, rtas);
>>>>  }
>>>>  
>>>> @@ -1057,7 +1111,7 @@ static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
>>>>      }
>>>>  
>>>>      if (spapr->kernel_size) {
>>>> -        uint64_t kprop[2] = { cpu_to_be64(KERNEL_LOAD_ADDR),
>>>> +        uint64_t kprop[2] = { cpu_to_be64(spapr->kernel_addr),
>>>
>>> Hrm, I really think I would like to see the change to adjustable
>>> kernel_addr split out - it puts a bunch of noise into the main kill
>>> slof patch.
>>
>> Sure, I'll do that if we decide to proceed with this.
>>
>>
>>>
>>>>                                cpu_to_be64(spapr->kernel_size) };
>>>>  
>>>>          _FDT(fdt_setprop(fdt, chosen, "qemu,boot-kernel",
>>>> @@ -1245,7 +1299,8 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>>>>      /* Build memory reserve map */
>>>>      if (reset) {
>>>>          if (spapr->kernel_size) {
>>>> -            _FDT((fdt_add_mem_rsv(fdt, KERNEL_LOAD_ADDR, spapr->kernel_size)));
>>>> +            _FDT((fdt_add_mem_rsv(fdt, spapr->kernel_addr,
>>>> +                                  spapr->kernel_size)));
>>>>          }
>>>>          if (spapr->initrd_size) {
>>>>              _FDT((fdt_add_mem_rsv(fdt, spapr->initrd_base,
>>>> @@ -1268,12 +1323,56 @@ void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
>>>>          }
>>>>      }
>>>>  
>>>> +    if (!spapr->bios_enabled) {
>>>> +        uint32_t phandle;
>>>> +        int i, offset, proplen = 0;
>>>> +        const void *prop;
>>>> +        bool found = false;
>>>> +        GArray *phandles = g_array_new(false, false, sizeof(uint32_t));
>>>> +
>>>> +        /* Find all predefined phandles */
>>>> +        for (offset = fdt_next_node(fdt, -1, NULL);
>>>> +             offset >= 0;
>>>> +             offset = fdt_next_node(fdt, offset, NULL)) {
>>>> +            prop = fdt_getprop_namelen(fdt, offset, "phandle", 7, &proplen);
>>>
>>> You can just use fdt_getprop() rather than the namelen variant (that's
>>> only really useful when you don't have a \0-terminated string with the
>>> name).
>>
>> Ok, will fix. There are just too many similar functions in libfdt.h and
>> fdt_getprop() could be inlined, probably.
> 
> It won't be inlined, but I think it will be tail-call optimized so it
> might as well be.  That is, I think the .o will look something like:
> 
> fdt_getprop:
> 	jiggle some registers
> 	bl	strlen
> 	jiggle some regs
> fdt_getprop_namelen:
> 	...
> 	blr
> 
>>>> +            if (prop && proplen == sizeof(uint32_t)) {
>>>> +                phandle = fdt32_ld(prop);
>>>> +                g_array_append_val(phandles, phandle);
>>>> +            }
>>>> +        }
>>>> +
>>>> +        /* Assign phandles skipping the predefined ones */
>>>> +        for (offset = fdt_next_node(fdt, -1, NULL), phandle = 1;
>>>> +             offset >= 0;
>>>> +             offset = fdt_next_node(fdt, offset, NULL), ++phandle) {
>>>> +            prop = fdt_getprop_namelen(fdt, offset, "phandle", 7, &proplen);
>>>> +            if (prop) {
>>>> +                continue;
>>>> +            }
>>>> +            /* Check if the current phandle is not allocated already */
>>>> +            for ( ; ; ++phandle) {
>>>> +                for (i = 0, found = false; i < phandles->len; ++i) {
>>>> +                    if (phandle == g_array_index(phandles, uint32_t, i)) {
>>>> +                        found = true;
>>>> +                        break;
>>>> +                    }
>>>> +                }
>>>> +                if (!found) {
>>>> +                    break;
>>>> +                }
>>>> +            }
>>>> +            _FDT(fdt_setprop_cell(fdt, offset, "phandle", phandle));
>>>> +        }
>>>> +        g_array_unref(phandles);
>>>> +    }
>>>> +
>>>>      return fdt;
>>>>  }
>>>>  
>>>>  static uint64_t translate_kernel_address(void *opaque, uint64_t addr)
>>>>  {
>>>> -    return (addr & 0x0fffffff) + KERNEL_LOAD_ADDR;
>>>> +    SpaprMachineState *spapr = opaque;
>>>> +    return (addr & 0x0fffffff) + spapr->kernel_addr;
>>>>  }
>>>>  
>>>>  static void emulate_spapr_hypercall(PPCVirtualHypervisor *vhyp,
>>>> @@ -1660,24 +1759,89 @@ static void spapr_machine_reset(MachineState *machine)
>>>>       */
>>>>      fdt_addr = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FDT_MAX_SIZE;
>>>>  
>>>> +    /* Set up the entry state */
>>>> +    if (!spapr->bios_enabled) {
>>>> +        if (spapr->claimed) {
>>>> +            g_array_unref(spapr->claimed);
>>>> +        }
>>>> +        if (spapr->of_instances) {
>>>> +            g_hash_table_unref(spapr->of_instances);
>>>> +        }
>>>> +
>>>> +        spapr->claimed = g_array_new(false, false, sizeof(SpaprOfClaimed));
>>>> +        spapr->of_instances = g_hash_table_new(g_direct_hash, g_direct_equal);
>>>> +
>>>> +        spapr->claimed_base = 0x10000; /* Avoid using the first system page */
>>>> +
>>>> +        spapr_cpu_set_entry_state(first_ppc_cpu, spapr->kernel_addr,
>>>> +                                  spapr->initrd_base);
>>>> +        first_ppc_cpu->env.gpr[4] = spapr->initrd_size;
>>>> +
>>>> +        if (spapr_do_of_client_claim(spapr, spapr->kernel_addr,
>>>> +                                  spapr->kernel_size, 0) == -1) {
>>>> +            error_report("Memory for kernel is in use");
>>>> +            exit(1);
>>>> +        }
>>>> +        if (spapr_do_of_client_claim(spapr, spapr->initrd_base,
>>>> +                                  spapr->initrd_size, 0) == -1) {
>>>> +            error_report("Memory for initramdisk is in use");
>>>> +            exit(1);
>>>> +        }
>>>> +        first_ppc_cpu->env.gpr[1] = spapr_do_of_client_claim(spapr, 0, 0x40000,
>>>> +                                                             0x10000);
>>>> +        if (first_ppc_cpu->env.gpr[1] == -1) {
>>>> +            error_report("Memory allocation for stack failed");
>>>> +            exit(1);
>>>> +        }
>>>> +
>>>> +        first_ppc_cpu->env.gpr[5] =
>>>> +            spapr_do_of_client_claim(spapr, 0, sizeof(rtas_client_blob.client),
>>>> +                                     sizeof(rtas_client_blob.client));
>>>> +        if (first_ppc_cpu->env.gpr[5] == -1) {
>>>> +            error_report("Memory allocation for OF client failed");
>>>> +            exit(1);
>>>> +        }
>>>> +        cpu_physical_memory_write(first_ppc_cpu->env.gpr[5],
>>>> +                                  rtas_client_blob.client,
>>>> +                                  sizeof(rtas_client_blob.client));
>>>> +    } else {
>>>> +        spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
>>>> +        first_ppc_cpu->env.gpr[5] = 0; /* 0 = kexec !0 = prom_init */
>>>> +    }
>>>> +
>>>>      fdt = spapr_build_fdt(spapr, true, FDT_MAX_SIZE);
>>>>  
>>>> -    rc = fdt_pack(fdt);
>>>> -
>>>> -    /* Should only fail if we've built a corrupted tree */
>>>> -    assert(rc == 0);
>>>> -
>>>> -    /* Load the fdt */
>>>> -    qemu_fdt_dumpdtb(fdt, fdt_totalsize(fdt));
>>>> -    cpu_physical_memory_write(fdt_addr, fdt, fdt_totalsize(fdt));
>>>>      g_free(spapr->fdt_blob);
>>>>      spapr->fdt_size = fdt_totalsize(fdt);
>>>>      spapr->fdt_initial_size = spapr->fdt_size;
>>>>      spapr->fdt_blob = fdt;
>>>>  
>>>> -    /* Set up the entry state */
>>>> -    spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
>>>> -    first_ppc_cpu->env.gpr[5] = 0;
>>>> +    if (spapr->bios_enabled) {
>>>> +        /* Load the fdt */
>>>> +        rc = fdt_pack(spapr->fdt_blob);
>>>> +        /* Should only fail if we've built a corrupted tree */
>>>> +        assert(rc == 0);
>>>> +
>>>> +        spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
>>>> +        spapr->fdt_initial_size = spapr->fdt_size;
>>>> +        qemu_fdt_dumpdtb(spapr->fdt_blob, spapr->fdt_size);
>>>
>>> I think we should still have a dumpdtb call on the !bios path.
>>>
>>>> +        cpu_physical_memory_write(fdt_addr, spapr->fdt_blob, spapr->fdt_size);
>>>> +    } else {
>>>> +        char *stdout_path = spapr_vio_stdout_path(spapr->vio_bus);
>>>> +        int offset = fdt_path_offset(fdt, "/chosen");
>>>> +
>>>> +        /*
>>>> +         * SLOF-less setup requires an open instance of stdout for early
>>>> +         * kernel printk. By now all phandles are settled so we can open
>>>> +         * the default serial console.
>>>> +         * We skip writing FDT as nothing expects it; OF client interface is
>>>> +         * going to be used for reading the device tree.
>>>> +         */
>>>> +        if (stdout_path) {
>>>> +            _FDT(fdt_setprop_cell(fdt, offset, "stdout",
>>>> +                                  spapr_of_client_open(spapr, stdout_path)));
>>>> +        }
>>>> +    }
>>>>  
>>>>      spapr->cas_reboot = false;
>>>>  }
>>>> @@ -2897,12 +3061,12 @@ static void spapr_machine_init(MachineState *machine)
>>>>          uint64_t lowaddr = 0;
>>>>  
>>>>          spapr->kernel_size = load_elf(kernel_filename, NULL,
>>>> -                                      translate_kernel_address, NULL,
>>>> +                                      translate_kernel_address, spapr,
>>>>                                        NULL, &lowaddr, NULL, 1,
>>>>                                        PPC_ELF_MACHINE, 0, 0);
>>>>          if (spapr->kernel_size == ELF_LOAD_WRONG_ENDIAN) {
>>>>              spapr->kernel_size = load_elf(kernel_filename, NULL,
>>>> -                                          translate_kernel_address, NULL, NULL,
>>>> +                                          translate_kernel_address, spapr, NULL,
>>>>                                            &lowaddr, NULL, 0, PPC_ELF_MACHINE,
>>>>                                            0, 0);
>>>>              spapr->kernel_le = spapr->kernel_size > 0;
>>>> @@ -2918,7 +3082,7 @@ static void spapr_machine_init(MachineState *machine)
>>>>              /* Try to locate the initrd in the gap between the kernel
>>>>               * and the firmware. Add a bit of space just in case
>>>>               */
>>>> -            spapr->initrd_base = (KERNEL_LOAD_ADDR + spapr->kernel_size
>>>> +            spapr->initrd_base = (spapr->kernel_addr + spapr->kernel_size
>>>>                                    + 0x1ffff) & ~0xffff;
>>>>              spapr->initrd_size = load_image_targphys(initrd_filename,
>>>>                                                       spapr->initrd_base,
>>>> @@ -2932,20 +3096,22 @@ static void spapr_machine_init(MachineState *machine)
>>>>          }
>>>>      }
>>>>  
>>>> -    if (bios_name == NULL) {
>>>> -        bios_name = FW_FILE_NAME;
>>>> +    if (spapr->bios_enabled) {
>>>> +        if (bios_name == NULL) {
>>>> +            bios_name = FW_FILE_NAME;
>>>> +        }
>>>> +        filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
>>>> +        if (!filename) {
>>>> +            error_report("Could not find LPAR firmware '%s'", bios_name);
>>>> +            exit(1);
>>>> +        }
>>>> +        fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
>>>> +        if (fw_size <= 0) {
>>>> +            error_report("Could not load LPAR firmware '%s'", filename);
>>>> +            exit(1);
>>>> +        }
>>>> +        g_free(filename);
>>>>      }
>>>> -    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
>>>> -    if (!filename) {
>>>> -        error_report("Could not find LPAR firmware '%s'", bios_name);
>>>> -        exit(1);
>>>> -    }
>>>> -    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
>>>> -    if (fw_size <= 0) {
>>>> -        error_report("Could not load LPAR firmware '%s'", filename);
>>>> -        exit(1);
>>>> -    }
>>>> -    g_free(filename);
>>>>  
>>>>      /* FIXME: Should register things through the MachineState's qdev
>>>>       * interface, this is a legacy from the sPAPREnvironment structure
>>>> @@ -3162,6 +3328,32 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
>>>>      visit_type_uint32(v, name, (uint32_t *)opaque, errp);
>>>>  }
>>>>  
>>>> +static void spapr_get_kernel_addr(Object *obj, Visitor *v, const char *name,
>>>> +                                  void *opaque, Error **errp)
>>>> +{
>>>> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
>>>> +}
>>>> +
>>>> +static void spapr_set_kernel_addr(Object *obj, Visitor *v, const char *name,
>>>> +                                  void *opaque, Error **errp)
>>>> +{
>>>> +    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
>>>> +}
>>>> +
>>>> +static bool spapr_get_bios_enabled(Object *obj, Error **errp)
>>>> +{
>>>> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>>>> +
>>>> +    return spapr->bios_enabled;
>>>> +}
>>>> +
>>>> +static void spapr_set_bios_enabled(Object *obj, bool value, Error **errp)
>>>> +{
>>>> +    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>>>> +
>>>> +    spapr->bios_enabled = value;
>>>> +}
>>>> +
>>>>  static char *spapr_get_ic_mode(Object *obj, Error **errp)
>>>>  {
>>>>      SpaprMachineState *spapr = SPAPR_MACHINE(obj);
>>>> @@ -3267,6 +3459,20 @@ static void spapr_instance_init(Object *obj)
>>>>      object_property_add_bool(obj, "vfio-no-msix-emulation",
>>>>                               spapr_get_msix_emulation, NULL, NULL);
>>>>  
>>>> +    object_property_add(obj, "kernel-addr", "uint64", spapr_get_kernel_addr,
>>>> +                        spapr_set_kernel_addr, NULL, &spapr->kernel_addr,
>>>> +                        &error_abort);
>>>> +    object_property_set_description(obj, "kernel-addr",
>>>> +                                    stringify(KERNEL_LOAD_ADDR)
>>>> +                                    " for -kernel is the default",
>>>> +                                    NULL);
>>>> +    spapr->kernel_addr = KERNEL_LOAD_ADDR;
>>>> +    object_property_add_bool(obj, "bios", spapr_get_bios_enabled,
>>>> +                            spapr_set_bios_enabled, NULL);
>>>> +    object_property_set_description(obj, "bios", "Conrols whether to load bios",
>>>> +                                    NULL);
>>>> +    spapr->bios_enabled = true;
>>>> +
>>>>      /* The machine class defines the default interrupt controller mode */
>>>>      spapr->irq = smc->irq;
>>>>      object_property_add_str(obj, "ic-mode", spapr_get_ic_mode,
>>>> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
>>>> index f1799b1b707d..f2d8823d2c3a 100644
>>>> --- a/hw/ppc/spapr_hcall.c
>>>> +++ b/hw/ppc/spapr_hcall.c
>>>> @@ -1660,15 +1660,11 @@ static bool spapr_hotplugged_dev_before_cas(void)
>>>>      return false;
>>>>  }
>>>>  
>>>> -static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>>>> -                                                  SpaprMachineState *spapr,
>>>> -                                                  target_ulong opcode,
>>>> -                                                  target_ulong *args)
>>>> +target_ulong do_client_architecture_support(PowerPCCPU *cpu,
>>>> +                                            SpaprMachineState *spapr,
>>>> +                                            target_ulong addr,
>>>> +                                            target_ulong fdt_bufsize)
>>>>  {
>>>> -    /* Working address in data buffer */
>>>> -    target_ulong addr = ppc64_phys_to_real(args[0]);
>>>> -    target_ulong fdt_buf = args[1];
>>>> -    target_ulong fdt_bufsize = args[2];
>>>>      target_ulong ov_table;
>>>>      uint32_t cas_pvr;
>>>>      SpaprOptionVector *ov1_guest, *ov5_guest, *ov5_cas_old;
>>>> @@ -1816,7 +1812,6 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>>>>  
>>>>      if (!spapr->cas_reboot) {
>>>>          void *fdt;
>>>> -        SpaprDeviceTreeUpdateHeader hdr = { .version_id = 1 };
>>>>  
>>>>          /* If spapr_machine_reset() did not set up a HPT but one is necessary
>>>>           * (because the guest isn't going to use radix) then set it up here. */
>>>> @@ -1825,21 +1820,7 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>>>>              spapr_setup_hpt_and_vrma(spapr);
>>>>          }
>>>>  
>>>> -        if (fdt_bufsize < sizeof(hdr)) {
>>>> -            error_report("SLOF provided insufficient CAS buffer "
>>>> -                         TARGET_FMT_lu " (min: %zu)", fdt_bufsize, sizeof(hdr));
>>>> -            exit(EXIT_FAILURE);
>>>> -        }
>>>> -
>>>> -        fdt_bufsize -= sizeof(hdr);
>>>> -
>>>> -        fdt = spapr_build_fdt(spapr, false, fdt_bufsize);
>>>> -        _FDT((fdt_pack(fdt)));
>>>> -
>>>> -        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
>>>> -        cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
>>>> -                                  fdt_totalsize(fdt));
>>>> -        trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
>>>> +        fdt = spapr_build_fdt(spapr, !spapr->bios_enabled, fdt_bufsize);
>>>>  
>>>>          g_free(spapr->fdt_blob);
>>>>          spapr->fdt_size = fdt_totalsize(fdt);
>>>> @@ -1854,6 +1835,41 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>>>>      return H_SUCCESS;
>>>>  }
>>>>  
>>>> +static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>>>> +                                                  SpaprMachineState *spapr,
>>>> +                                                  target_ulong opcode,
>>>> +                                                  target_ulong *args)
>>>> +{
>>>> +    /* Working address in data buffer */
>>>> +    target_ulong addr = ppc64_phys_to_real(args[0]);
>>>> +    target_ulong fdt_buf = args[1];
>>>> +    target_ulong fdt_bufsize = args[2];
>>>> +    target_ulong ret;
>>>> +    SpaprDeviceTreeUpdateHeader hdr = { .version_id = 1 };
>>>> +
>>>> +    if (fdt_bufsize < sizeof(hdr)) {
>>>> +        error_report("SLOF provided insufficient CAS buffer "
>>>> +                     TARGET_FMT_lu " (min: %zu)", fdt_bufsize, sizeof(hdr));
>>>> +        exit(EXIT_FAILURE);
>>>> +    }
>>>> +
>>>> +    fdt_bufsize -= sizeof(hdr);
>>>> +
>>>> +    ret = do_client_architecture_support(cpu, spapr, addr, fdt_bufsize);
>>>> +    if (ret == H_SUCCESS) {
>>>> +        _FDT((fdt_pack(spapr->fdt_blob)));
>>>> +        spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
>>>> +        spapr->fdt_initial_size = spapr->fdt_size;
>>>> +
>>>> +        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
>>>> +        cpu_physical_memory_write(fdt_buf + sizeof(hdr), spapr->fdt_blob,
>>>> +                                  spapr->fdt_size);
>>>> +        trace_spapr_cas_continue(spapr->fdt_size + sizeof(hdr));
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>>  static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
>>>>                                                SpaprMachineState *spapr,
>>>>                                                target_ulong opcode,
>>>> @@ -1998,6 +2014,14 @@ static target_ulong h_update_dt(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>>>      return H_SUCCESS;
>>>>  }
>>>>  
>>>> +static target_ulong h_client(PowerPCCPU *cpu, SpaprMachineState *spapr,
>>>> +                             target_ulong opcode, target_ulong *args)
>>>
>>> As I said in an earlier revision, please explan these names from just
>>> "client", for readability by people who aren't already thinking about
>>> open firmware.
>>
>> Yeah, I missed this one.
>>
>>
>>>
>>>> +{
>>>> +    target_ulong client_args = ppc64_phys_to_real(args[0]);
>>>> +
>>>> +    return spapr_h_client(spapr, client_args);
>>>> +}
>>>> +
>>>>  static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
>>>>  static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - KVMPPC_HCALL_BASE + 1];
>>>>  static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 4 + 1];
>>>> @@ -2121,6 +2145,8 @@ static void hypercall_register_types(void)
>>>>  
>>>>      spapr_register_hypercall(KVMPPC_H_UPDATE_DT, h_update_dt);
>>>>  
>>>> +    spapr_register_hypercall(KVMPPC_H_CLIENT, h_client);
>>>> +
>>>>      /* Virtual Processor Home Node */
>>>>      spapr_register_hypercall(H_HOME_NODE_ASSOCIATIVITY,
>>>>                               h_home_node_associativity);
>>>> diff --git a/hw/ppc/spapr_of_client.c b/hw/ppc/spapr_of_client.c
>>>> new file mode 100644
>>>> index 000000000000..24d854b76e51
>>>> --- /dev/null
>>>> +++ b/hw/ppc/spapr_of_client.c
>>>
>>> I'd suggest expanding this file to cover as much as you can of the
>>> virtual OF stuff, not just the client interface.
>>
>> This is done in v6.
>>
>>
>>>
>>>> @@ -0,0 +1,633 @@
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu-common.h"
>>>> +#include "qapi/error.h"
>>>> +#include "exec/memory.h"
>>>> +#include "hw/ppc/spapr.h"
>>>> +#include "hw/ppc/spapr_vio.h"
>>>> +#include "chardev/char.h"
>>>> +#include "qom/qom-qobject.h"
>>>> +#include "trace.h"
>>>> +
>>>> +typedef struct {
>>>> +    DeviceState *dev;
>>>> +    Chardev *cdev;
>>>> +    uint32_t phandle;
>>>> +} SpaprOfInstance;
>>>> +
>>>> +/*
>>>> + * OF 1275 "nextprop" description suggests is it 32 bytes max but
>>>> + * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 chars long.
>>>> + */
>>>> +#define OF_PROPNAME_LEN_MAX 64
>>>> +
>>>> +/* Defined as Big Endian */
>>>> +struct prom_args {
>>>> +    uint32_t service;
>>>> +    uint32_t nargs;
>>>> +    uint32_t nret;
>>>> +    uint32_t args[10];
>>>> +};
>>>> +
>>>> +static void readstr(hwaddr pa, char *buf, int size)
>>>> +{
>>>> +    cpu_physical_memory_read(pa, buf, size - 1);
>>>> +    buf[size - 1] = 0;
>>>> +}
>>>
>>> I'd still like to see this return some kind of error if it had to
>>> truncate what was passed by the client.
>>
>>
>> But truncating will produce error anyway - libfdt won't find stuff,
>> etc.
> 
> Probably, but I think the error will be much more comprehensible if we
> catch it here.

I cannot really print an error as the guest can flood the QEMU's log,
returning an error is done by other means. If I make this one return an
error, then every single caller will have to have error_exit label. How
about this:

static void readstr(hwaddr pa, char *buf, int size)
{
    cpu_physical_memory_read(pa, buf, size);
    if (buf[size - 1] != '\0') {
        buf[size - 1] = '\0';
        trace_spapr_of_client_error_strbuf(buf, size);
    }
}

and a tracepoint:

spapr_of_client_error_strbuf(const char *s, int len) "%s is longed than %d"


and let the callers of this fail?


> 
>>>> +
>>>> +static bool _cmpservice(const char *s, size_t len,
>>>
>>> Don't use leading _ please - in userland those are reserved for the
>>> system libraries.
>>>
>>>> +                        unsigned nargs, unsigned nret,
>>>> +                        const char *s1, size_t len1,
>>>> +                        unsigned nargscheck, unsigned nretcheck)
>>>> +{
>>>> +    if (strcmp(s, s1)) {
>>>> +        return false;
>>>> +    }
>>>> +    if (nargscheck == 0 && nretcheck == 0) {
>>>> +        return true;
>>>> +    }
>>>> +    if (nargs != nargscheck || nret != nretcheck) {
>>>> +        trace_spapr_of_client_error_param(s, nargscheck, nretcheck, nargs,
>>>> +                                          nret);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_finddevice(const void *fdt, uint32_t nodeaddr)
>>>> +{
>>>> +    char node[256];
>>>
>>> Is 256 enough?  OF paths can get pretty long...
>>
>>
>> Hard to imagine that 255 is not enough though. Long parts of the path
>> would be scsi drive id, PHB but in between we can only have a bunch of
>> PCI bridges and these are not so long.
> 
> Hm, ok.  I had a look on a Boston and the longest path I see there is
> 75 characters, I thought it might be a lot more.


May be with a bunch of scsi/pci buses, or when network is involved, with
all the IP+DNS+imagename - these can be long, probably... Although 256
is still pretty long.


> 
>> What do you think is an appropriate limit?
>>
>>
>>>
>>>> +    int ret;
>>>> +
>>>> +    readstr(nodeaddr, node, sizeof(node));
>>>> +    ret = fdt_path_offset(fdt, node);
>>>> +    if (ret >= 0) {
>>>> +        ret = fdt_get_phandle(fdt, ret);
>>>> +    }
>>>> +
>>>> +    return (uint32_t) ret;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_getprop(const void *fdt, uint32_t nodeph,
>>>> +                                  uint32_t pname, uint32_t valaddr,
>>>> +                                  uint32_t vallen)
>>>> +{
>>>> +    char propname[OF_PROPNAME_LEN_MAX + 1];
>>>> +    uint32_t ret = 0;
>>>> +    int proplen = 0;
>>>> +    const void *prop;
>>>> +
>>>> +    readstr(pname, propname, sizeof(propname));
>>>> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
>>>> +                               propname, strlen(propname), &proplen);
>>>
>>> Again, you don't need _namelen.
>>>
>>>> +    if (prop) {
>>>> +        int cb = MIN(proplen, vallen);
>>>> +
>>>> +        cpu_physical_memory_write(valaddr, prop, cb);
>>>> +        ret = cb;
>>>
>>> If I'm reading 1275 correctly, the return value should be the
>>> untruncated length of the property.
>>
>>
>> "Size is either the actual size of the property". I'd think the actual
>> size is what we actually copied to the buffer but @proplen is probably
>> what they meant, I'll change to that and see what breaks.
>>
>>
>>
>>>> +    } else {
>>>> +        ret = -1;
>>>> +    }
>>>> +    trace_spapr_of_client_getprop(nodeph, propname, ret);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_getproplen(const void *fdt, uint32_t nodeph,
>>>> +                                     uint32_t pname)
>>>> +{
>>>> +    char propname[OF_PROPNAME_LEN_MAX + 1];
>>>> +    uint32_t ret = 0;
>>>> +    int proplen = 0;
>>>> +    const void *prop;
>>>> +
>>>> +    readstr(pname, propname, sizeof(propname));
>>>> +
>>>> +    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
>>>> +                               propname, strlen(propname), &proplen);
>>>
>>> No _namelen.
>>>
>>>> +    if (prop) {
>>>> +        ret = proplen;
>>>> +    } else {
>>>> +        ret = -1;
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_setprop(SpaprMachineState *spapr,
>>>> +                                  uint32_t nodeph, uint32_t pname,
>>>> +                                  uint32_t valaddr, uint32_t vallen)
>>>> +{
>>>> +    char propname[OF_PROPNAME_LEN_MAX + 1];
>>>> +    uint32_t ret = -1;
>>>> +    int offset;
>>>
>>> A comment noting that you're only allowing a very restricted set of
>>> setprops would be good.
>>
>>
>> Is not that quite clear from the code itself? Okay...
> 
> Well, kinda.  The rationale for it would be useful here though.

I am adding:

 /*
  * We only allow changing properties which we know how to update on
  * the QEMU side.
  */



> 
>>>> +    readstr(pname, propname, sizeof(propname));
>>>> +    if (vallen == sizeof(uint32_t)) {
>>>> +        uint32_t val32 = ldl_be_phys(first_cpu->as, valaddr);
>>>> +
>>>> +        if ((strcmp(propname, "linux,rtas-base") == 0) ||
>>>> +            (strcmp(propname, "linux,rtas-entry") == 0)) {
>>>> +            spapr->rtas_base = val32;
>>>> +        } else if (strcmp(propname, "linux,initrd-start") == 0) {
>>>> +            spapr->initrd_base = val32;
>>>> +        } else if (strcmp(propname, "linux,initrd-end") == 0) {
>>>> +            spapr->initrd_size = val32 - spapr->initrd_base;
>>>> +        } else {
>>>> +            goto trace_exit;
>>>> +        }
>>>> +    } else if (vallen == sizeof(uint64_t)) {
>>>> +        uint64_t val64 = ldq_be_phys(first_cpu->as, valaddr);
>>>> +
>>>> +        if (strcmp(propname, "linux,initrd-start") == 0) {
>>>> +            spapr->initrd_base = val64;
>>>> +        } else if (strcmp(propname, "linux,initrd-end") == 0) {
>>>> +            spapr->initrd_size = val64 - spapr->initrd_base;
>>>> +        } else {
>>>> +            goto trace_exit;
>>>> +        }
>>>> +    } else {
>>>> +        goto trace_exit;
>>>> +    }
>>>> +
>>>> +    offset = fdt_node_offset_by_phandle(spapr->fdt_blob, nodeph);
>>>> +    if (offset >= 0) {
>>>> +        uint8_t data[vallen];
>>>> +
>>>> +        cpu_physical_memory_read(valaddr, data, vallen);
>>>> +        if (!fdt_setprop(spapr->fdt_blob, offset, propname, data, vallen)) {
>>>> +            ret = vallen;
>>>> +        }
>>>> +    }
>>>> +
>>>> +trace_exit:
>>>> +    trace_spapr_of_client_setprop(nodeph, propname, ret);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_nextprop(const void *fdt, uint32_t phandle,
>>>> +                                   uint32_t prevaddr, uint32_t nameaddr)
>>>> +{
>>>> +    int offset = fdt_node_offset_by_phandle(fdt, phandle);
>>>> +    char prev[OF_PROPNAME_LEN_MAX + 1];
>>>> +    const char *tmp;
>>>> +
>>>> +    readstr(prevaddr, prev, sizeof(prev));
>>>> +    for (offset = fdt_first_property_offset(fdt, offset);
>>>> +         offset >= 0;
>>>> +         offset = fdt_next_property_offset(fdt, offset)) {
>>>> +
>>>> +        if (!fdt_getprop_by_offset(fdt, offset, &tmp, NULL)) {
>>>> +            return 0;
>>>> +        }
>>>> +        if (prev[0] == '\0' || strcmp(prev, tmp) == 0) {
>>>> +            if (prev[0] != '\0') {
>>>> +                offset = fdt_next_property_offset(fdt, offset);
>>>> +                if (offset < 0) {
>>>> +                    return 0;
>>>> +                }
>>>> +            }
>>>> +            if (!fdt_getprop_by_offset(fdt, offset, &tmp, NULL)) {
>>>> +                return 0;
>>>> +            }
>>>> +            cpu_physical_memory_write(nameaddr, tmp, strlen(tmp) + 1);
>>>> +            return 1;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_peer(const void *fdt, uint32_t phandle)
>>>> +{
>>>> +    int ret;
>>>> +
>>>> +    if (phandle == 0) {
>>>> +        ret = fdt_path_offset(fdt, "/");
>>>> +    } else {
>>>> +        ret = fdt_next_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
>>>> +    }
>>>> +
>>>> +    if (ret < 0) {
>>>> +        ret = 0;
>>>> +    } else {
>>>> +        ret = fdt_get_phandle(fdt, ret);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_child(const void *fdt, uint32_t phandle)
>>>> +{
>>>> +    int ret = fdt_first_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
>>>> +
>>>> +    if (ret < 0) {
>>>> +        ret = 0;
>>>> +    } else {
>>>> +        ret = fdt_get_phandle(fdt, ret);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_parent(const void *fdt, uint32_t phandle)
>>>> +{
>>>> +    int ret = fdt_parent_offset(fdt, fdt_node_offset_by_phandle(fdt, phandle));
>>>> +
>>>> +    if (ret < 0) {
>>>> +        ret = 0;
>>>> +    } else {
>>>> +        ret = fdt_get_phandle(fdt, ret);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static DeviceState *of_client_find_qom_dev(BusState *bus, const char *path)
>>>> +{
>>>> +    BusChild *kid;
>>>> +
>>>> +    QTAILQ_FOREACH(kid, &bus->children, sibling) {
>>>> +        const char *p = qdev_get_fw_dev_path(kid->child);
>>>> +        BusState *child;
>>>> +
>>>> +        if (p && strcmp(path, p) == 0) {
>>>> +            return kid->child;
>>>> +        }
>>>> +        QLIST_FOREACH(child, &kid->child->child_bus, sibling) {
>>>> +            DeviceState *d = of_client_find_qom_dev(child, path);
>>>> +
>>>> +            if (d) {
>>>> +                return d;
>>>> +            }
>>>> +        }
>>>> +    }
>>>> +    return NULL;
>>>> +}
>>>> +
>>>> +uint32_t spapr_of_client_open(SpaprMachineState *spapr, const char *path)
>>>> +{
>>>> +    int offset;
>>>> +    uint32_t ret = 0;
>>>> +    SpaprOfInstance *inst;
>>>> +
>>>> +    if (spapr->of_instance_last == 0xFFFFFFFF) {
>>>> +        /* We do not recycle ihandles yet */
>>>> +        goto trace_exit;
>>>> +    }
>>>> +    offset = fdt_path_offset(spapr->fdt_blob, path);
>>>> +    if (offset < 0) {
>>>> +        trace_spapr_of_client_error_unknown_path(path);
>>>> +        goto trace_exit;
>>>> +    }
>>>> +
>>>> +    inst = g_new(SpaprOfInstance, 1);
>>>> +    inst->phandle = fdt_get_phandle(spapr->fdt_blob, offset);
>>>> +    g_assert(inst->phandle);
>>>> +    ++spapr->of_instance_last;
>>>> +    inst->dev = of_client_find_qom_dev(sysbus_get_default(), path);
>>>> +    g_hash_table_insert(spapr->of_instances,
>>>> +                        GINT_TO_POINTER(spapr->of_instance_last),
>>>> +                        inst);
>>>> +    ret = spapr->of_instance_last;
>>>> +
>>>> +    if (inst->dev) {
>>>> +        const char *cdevstr = object_property_get_str(OBJECT(inst->dev),
>>>> +                                                      "chardev", NULL);
>>>> +
>>>> +        if (cdevstr) {
>>>> +            inst->cdev = qemu_chr_find(cdevstr);
>>>> +        }
>>>> +    }
>>>> +
>>>> +trace_exit:
>>>> +    trace_spapr_of_client_open(path, inst ? inst->phandle : 0, ret);
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_open(SpaprMachineState *spapr, uint32_t pathaddr)
>>>> +{
>>>> +    char path[256];
>>>> +
>>>> +    readstr(pathaddr, path, sizeof(path));
>>>> +
>>>> +    return spapr_of_client_open(spapr, path);
>>>> +}
>>>> +
>>>> +static void of_client_close(SpaprMachineState *spapr, uint32_t ihandle)
>>>> +{
>>>> +    if (!g_hash_table_remove(spapr->of_instances, GINT_TO_POINTER(ihandle))) {
>>>> +        trace_spapr_of_client_error_unknown_ihandle_close(ihandle);
>>>> +    }
>>>> +}
>>>> +
>>>> +static uint32_t of_client_instance_to_package(SpaprMachineState *spapr,
>>>> +                                              uint32_t ihandle)
>>>> +{
>>>> +    gpointer instp = g_hash_table_lookup(spapr->of_instances,
>>>> +                                        GINT_TO_POINTER(ihandle));
>>>> +
>>>> +    if (!instp) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    return ((SpaprOfInstance *)instp)->phandle;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_package_to_path(const void *fdt, uint32_t phandle,
>>>> +                                          uint32_t buf, uint32_t len)
>>>> +{
>>>> +    char tmp[256];
>>>> +
>>>> +    if (0 == fdt_get_path(fdt, fdt_node_offset_by_phandle(fdt, phandle), tmp,
>>>> +                          sizeof(tmp))) {
>>>> +        tmp[sizeof(tmp) - 1] = 0;
>>>> +        cpu_physical_memory_write(buf, tmp, MIN(len, strlen(tmp)));
>>>> +    }
>>>> +    return len;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_instance_to_path(SpaprMachineState *spapr,
>>>> +                                           uint32_t ihandle, uint32_t buf,
>>>> +                                           uint32_t len)
>>>> +{
>>>> +    uint32_t phandle = of_client_instance_to_package(spapr, ihandle);
>>>> +
>>>> +    if (phandle != -1) {
>>>> +        return of_client_package_to_path(spapr->fdt_blob, phandle, buf, len);
>>>> +    }
>>>> +
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static uint32_t of_client_write(SpaprMachineState *spapr, uint32_t ihandle,
>>>> +                                uint32_t buf, uint32_t len)
>>>> +{
>>>> +    char tmp[256];
>>>> +    int toread, toprint, cb = MIN(len, 1024);
>>>> +    SpaprOfInstance *inst = (SpaprOfInstance *)
>>>> +        g_hash_table_lookup(spapr->of_instances, GINT_TO_POINTER(ihandle));
>>>> +
>>>> +    while (cb > 0) {
>>>> +        toread = MIN(cb + 1, sizeof(tmp));
>>>> +        readstr(buf, tmp, toread);
>>>> +        toprint = strlen(tmp);
>>>> +        if (inst && inst->cdev) {
>>>> +            toprint = qemu_chr_write(inst->cdev, (uint8_t *) tmp, toprint,
>>>> +                                     true);
>>>> +        } else {
>>>> +            /* We normally open stdout so this is fallback */
>>>> +            printf("DBG[%d]%s", ihandle, tmp);
>>>> +        }
>>>> +        buf += toprint;
>>>> +        cb -= toprint;
>>>> +    }
>>>> +
>>>> +    return len;
>>>> +}
>>>> +
>>>> +static bool of_client_claim_avail(GArray *claimed, uint64_t virt, uint64_t size)
>>>> +{
>>>> +    int i;
>>>> +    SpaprOfClaimed *c;
>>>> +
>>>> +    for (i = 0; i < claimed->len; ++i) {
>>>> +        c = &g_array_index(claimed, SpaprOfClaimed, i);
>>>> +        if ((c->start <= virt && virt < c->start + c->size) ||
>>>> +            (virt <= c->start && c->start < virt + size)) {
>>>> +            return false;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static void of_client_claim_add(GArray *claimed, uint64_t virt, uint64_t size)
>>>> +{
>>>> +    SpaprOfClaimed newclaim;
>>>> +
>>>> +    newclaim.start = virt;
>>>> +    newclaim.size = size;
>>>> +    g_array_append_val(claimed, newclaim);
>>>> +}
>>>> +
>>>> +/*
>>>> + * "claim" claims memory at @virt if @align==0; otherwise it allocates
>>>> + * memory at the requested alignment.
>>>> + */
>>>> +uint64_t spapr_do_of_client_claim(SpaprMachineState *spapr, uint64_t virt,
>>>> +                                  uint64_t size, uint64_t align)
>>>> +{
>>>> +    uint32_t ret;
>>>> +
>>>> +    if (align == 0) {
>>>> +        if (!of_client_claim_avail(spapr->claimed, virt, size)) {
>>>> +            return -1;
>>>> +        }
>>>> +        ret = virt;
>>>> +    } else {
>>>> +        align = pow2ceil(align);
>>>
>>> Should this be a pow2ceil, or should it just return an error if align
>>> is not a power of 2. > Note that aligning something to 4 bytes will
>>> (probably) make it *not* aligned to 3 bytes.
>>
>> I did not see any notes about the specific alignment requirements here,
>> the idea is that clients may just not expect unaligned memory at all; I
>> could probably just drop it and see what happens...
> 
> I don't follow you.  Isn't the align value coming from the client?


This code is used by the client and QEMU. So for QEMU users, I'll have
to align myself everywhere, not a huge deal. And since it is an old
interface which nobody follows 100%, I can imagine clients (grub/yaboot)
asking to claim with alignments other than power of two in expectation
that the firmware will align it, may be.


> 
>>>> +        spapr->claimed_base = (spapr->claimed_base + align - 1) & ~(align - 1);
>>>> +        while (1) {
>>>> +            if (spapr->claimed_base >= spapr->rma_size) {
>>>> +                perror("Out of memory");
>>>
>>> error_report() or qemu_log() or something and a message with some more
>>> specificity, please.
>>
>>
>> What kind of specificity is missing here?
> 
> That it's on the OF claim interface specifically, and how much they
> were trying to claim.

Changing it to

error_report("Out of RMA memory for the OF client")

Thanks for the review! I'll post it when we settle on the new bios/vof
machine parameter.
David Gibson Jan. 23, 2020, 5:11 a.m. UTC | #5
On Wed, Jan 22, 2020 at 06:14:37PM +1100, Alexey Kardashevskiy wrote:
> 
> 
> On 22/01/2020 17:32, David Gibson wrote:
> > On Tue, Jan 21, 2020 at 06:25:36PM +1100, Alexey Kardashevskiy wrote:
> >>
> >>
> >> On 21/01/2020 16:11, David Gibson wrote:
> >>> On Fri, Jan 10, 2020 at 01:09:25PM +1100, Alexey Kardashevskiy wrote:
> >>>> The Petitboot bootloader is way more advanced than SLOF is ever going to
> >>>> be as Petitboot comes with the full-featured Linux kernel with all
> >>>> the drivers, and initramdisk with quite user friendly interface.
> >>>> The problem with ditching SLOF is that an unmodified pseries kernel can
> >>>> either start via:
> >>>> 1. kexec, this requires presence of RTAS and skips
> >>>> ibm,client-architecture-support entirely;
> >>>> 2. normal boot, this heavily relies on the OF1275 client interface to
> >>>> fetch the device tree and do early setup (claim memory).
> >>>>
> >>>> This adds a new bios-less mode to the pseries machine: "bios=on|off".
> >>>> When enabled, QEMU does not load SLOF and jumps to the kernel from
> >>>> "-kernel".
> >>>
> >>> I don't love the name "bios" for this flag, since BIOS tends to refer
> >>> to old-school x86 firmware.  Given the various plans we're considering
> >>> the future, I'd suggest "firmware=slof" for the current in-guest SLOF
> >>> mode, and say "firmware=vof" (Virtual Open Firmware) for the new
> >>> model.  We can consider firmware=petitboot or firmware=none (for
> >>> direct kexec-style boot into -kernel) or whatever in the future
> >>
> >> Ok. We could also enforce default loading addresses for SLOF/kernel/grub
> >> and drop "kernel-addr", although it is going to be confusing if it
> >> changes in not so obvious way...
> > 
> > Yes, I think that would be confusing, so I think adding the
> > kernel-addr override is a good idea, I'd just like it split out for
> > clarity.
> > 
> >> In fact, I will ideally need 3 flags:
> >> -bios: on|off to stop loading SLOF;
> >> -kernel-addr: 0x0 for slof/kernel; 0x20000 for grub;
> > 
> > I'm happy for that one to be separate from the "firmware style"
> > option.
> > 
> >> -kernel-translate-hack: on|off - as grub is linked to run from 0x20000
> >> and it only works when placed there, the hack breaks it.
> > 
> > Hrm.  I don't really understand what this one is about.  That doesn't
> > really seem like something the user would ever want to fiddle with
> > directly.
> 
> This allows loading grub, or actually any elf (not that I have anything
> else in mind that just grub but still) which is not capable of
> relocating itself.

Ok, why would we ever not want that?

> >> Or we can pass grub via -bios and not via -kernel but strictly speaking
> >> there is still a firmware - that new 20 bytes blob so it would not be
> >> accurate.
> >>
> >> We can put this all into a single
> >> -firmware slof|vof|grub|linux. Not sure.
> > 
> > I'm not thinking of "grub" as a separate option - that would be the
> > same as "vof".  Using vof + no -kernel we'd need to scan the disks in
> > the same way SLOF does, and look for a boot partition, which will
> > probably contain a GRUB image. 
> 
> I was hoping we can avoid that by allowing
> "-kernel grub" and let grub do filesystems and MBR/GPT.

I don't want that to be the only way, because I want the GRUB
installed by the OS installer to be the GRUB we use.

> > Then we'd need enough faked OF client
> > calls to let GRUB operate.
> 
> v6 does very basic block devices. Now I need to learn how to build grub
> properly, it is 32bit and it is not straight forward how to build it
> 100% properly on ppc64 machine, I see occasional issues such as
> uint32->uint64 extension with a garbage in the top 32bits, things like
> this... But it can definitely read MBR/GPT, parse FS, etc.

Ok.

[snip]
> >>>> +/*
> >>>> + * "claim" claims memory at @virt if @align==0; otherwise it allocates
> >>>> + * memory at the requested alignment.
> >>>> + */
> >>>> +uint64_t spapr_do_of_client_claim(SpaprMachineState *spapr, uint64_t virt,
> >>>> +                                  uint64_t size, uint64_t align)
> >>>> +{
> >>>> +    uint32_t ret;
> >>>> +
> >>>> +    if (align == 0) {
> >>>> +        if (!of_client_claim_avail(spapr->claimed, virt, size)) {
> >>>> +            return -1;
> >>>> +        }
> >>>> +        ret = virt;
> >>>> +    } else {
> >>>> +        align = pow2ceil(align);
> >>>
> >>> Should this be a pow2ceil, or should it just return an error if align
> >>> is not a power of 2. > Note that aligning something to 4 bytes will
> >>> (probably) make it *not* aligned to 3 bytes.
> >>
> >> I did not see any notes about the specific alignment requirements here,
> >> the idea is that clients may just not expect unaligned memory at all; I
> >> could probably just drop it and see what happens...
> > 
> > I don't follow you.  Isn't the align value coming from the client?
> 
> This code is used by the client and QEMU. So for QEMU users, I'll have
> to align myself everywhere, not a huge deal. And since it is an old
> interface which nobody follows 100%, I can imagine clients (grub/yaboot)
> asking to claim with alignments other than power of two in expectation
> that the firmware will align it, may be.
> 
> 
> > 
> >>>> +        spapr->claimed_base = (spapr->claimed_base + align - 1) & ~(align - 1);
> >>>> +        while (1) {
> >>>> +            if (spapr->claimed_base >= spapr->rma_size) {
> >>>> +                perror("Out of memory");
> >>>
> >>> error_report() or qemu_log() or something and a message with some more
> >>> specificity, please.
> >>
> >>
> >> What kind of specificity is missing here?
> > 
> > That it's on the OF claim interface specifically, and how much they
> > were trying to claim.
> 
> Changing it to
> 
> error_report("Out of RMA memory for the OF client")
> 
> Thanks for the review! I'll post it when we settle on the new bios/vof
> machine parameter.
> 
> 
>
Alexey Kardashevskiy Jan. 23, 2020, 8:43 a.m. UTC | #6
On 23/01/2020 16:11, David Gibson wrote:
> On Wed, Jan 22, 2020 at 06:14:37PM +1100, Alexey Kardashevskiy wrote:
>>
>>
>> On 22/01/2020 17:32, David Gibson wrote:
>>> On Tue, Jan 21, 2020 at 06:25:36PM +1100, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 21/01/2020 16:11, David Gibson wrote:
>>>>> On Fri, Jan 10, 2020 at 01:09:25PM +1100, Alexey Kardashevskiy wrote:
>>>>>> The Petitboot bootloader is way more advanced than SLOF is ever going to
>>>>>> be as Petitboot comes with the full-featured Linux kernel with all
>>>>>> the drivers, and initramdisk with quite user friendly interface.
>>>>>> The problem with ditching SLOF is that an unmodified pseries kernel can
>>>>>> either start via:
>>>>>> 1. kexec, this requires presence of RTAS and skips
>>>>>> ibm,client-architecture-support entirely;
>>>>>> 2. normal boot, this heavily relies on the OF1275 client interface to
>>>>>> fetch the device tree and do early setup (claim memory).
>>>>>>
>>>>>> This adds a new bios-less mode to the pseries machine: "bios=on|off".
>>>>>> When enabled, QEMU does not load SLOF and jumps to the kernel from
>>>>>> "-kernel".
>>>>>
>>>>> I don't love the name "bios" for this flag, since BIOS tends to refer
>>>>> to old-school x86 firmware.  Given the various plans we're considering
>>>>> the future, I'd suggest "firmware=slof" for the current in-guest SLOF
>>>>> mode, and say "firmware=vof" (Virtual Open Firmware) for the new
>>>>> model.  We can consider firmware=petitboot or firmware=none (for
>>>>> direct kexec-style boot into -kernel) or whatever in the future
>>>>
>>>> Ok. We could also enforce default loading addresses for SLOF/kernel/grub
>>>> and drop "kernel-addr", although it is going to be confusing if it
>>>> changes in not so obvious way...
>>>
>>> Yes, I think that would be confusing, so I think adding the
>>> kernel-addr override is a good idea, I'd just like it split out for
>>> clarity.
>>>
>>>> In fact, I will ideally need 3 flags:
>>>> -bios: on|off to stop loading SLOF;
>>>> -kernel-addr: 0x0 for slof/kernel; 0x20000 for grub;
>>>
>>> I'm happy for that one to be separate from the "firmware style"
>>> option.
>>>
>>>> -kernel-translate-hack: on|off - as grub is linked to run from 0x20000
>>>> and it only works when placed there, the hack breaks it.
>>>
>>> Hrm.  I don't really understand what this one is about.  That doesn't
>>> really seem like something the user would ever want to fiddle with
>>> directly.
>>
>> This allows loading grub, or actually any elf (not that I have anything
>> else in mind that just grub but still) which is not capable of
>> relocating itself.
> 
> Ok, why would we ever not want that?


Typical vmlinux is:

[fstn1-p1 kernel]$ readelf --sections ~/pbuild/kernel-le-guest/vmlinux |
head -n 100
There are 54 section headers, starting at offset 0x1027d0b8:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .head.text        PROGBITS         c000000000000000  00010000
       0000000000008000  0000000000000000  AX       0     0     128
  [ 2] .text             PROGBITS         c000000000008000  00018000
       0000000000ea2c50  0000000000000000  AX       0     0     256
  [ 3] .rodata           PROGBITS         c000000000eb0000  00ec0000
       00000000002f4b58  0000000000000000  WA       0     0     128
  [ 4] .gnu.hash         GNU_HASH         c0000000011a4b58  011b4b58
       000000000000001c  0000000000000000   A      27     0     8

  [ 5] .pci_fixup        PROGBITS         c0000000011a4b78  011b4b78
       0000000000003438  0000000000000000   A       0     0     8
  [ 6] __param           PROGBITS         c0000000011a7fb0  011b7fb0
       0000000000003fe8  0000000000000000  WA       0     0     8

  [ 7] __modver          PROGBITS         c0000000011abf98  011bbf98
       0000000000000118  0000000000000000  WA       0     0     8



This - c000000000xxxxxx - is where QEMU will try loading the kernel if
we did not have that translate_kernel_address.


> 
>>>> Or we can pass grub via -bios and not via -kernel but strictly speaking
>>>> there is still a firmware - that new 20 bytes blob so it would not be
>>>> accurate.
>>>>
>>>> We can put this all into a single
>>>> -firmware slof|vof|grub|linux. Not sure.
>>>
>>> I'm not thinking of "grub" as a separate option - that would be the
>>> same as "vof".  Using vof + no -kernel we'd need to scan the disks in
>>> the same way SLOF does, and look for a boot partition, which will
>>> probably contain a GRUB image. 
>>
>> I was hoping we can avoid that by allowing
>> "-kernel grub" and let grub do filesystems and MBR/GPT.
> 
> I don't want that to be the only way, because I want the GRUB
> installed by the OS installer to be the GRUB we use.


Then it means implementing filesystems in the OF client in QEMU. Quite a
lot.
Andrea Bolognani Jan. 28, 2020, 11:31 a.m. UTC | #7
On Thu, 2020-01-23 at 16:11 +1100, David Gibson wrote:
> On Wed, Jan 22, 2020 at 06:14:37PM +1100, Alexey Kardashevskiy wrote:
> > On 22/01/2020 17:32, David Gibson wrote:
> > > I'm not thinking of "grub" as a separate option - that would be the
> > > same as "vof".  Using vof + no -kernel we'd need to scan the disks in
> > > the same way SLOF does, and look for a boot partition, which will
> > > probably contain a GRUB image. 
> > 
> > I was hoping we can avoid that by allowing
> > "-kernel grub" and let grub do filesystems and MBR/GPT.
> 
> I don't want that to be the only way, because I want the GRUB
> installed by the OS installer to be the GRUB we use.

Agreed, the bootloader and the kernel should live inside the guest
image and not on the host's filesystem.
Alexey Kardashevskiy Jan. 30, 2020, 5:57 a.m. UTC | #8
On 28/01/2020 22:31, Andrea Bolognani wrote:
> On Thu, 2020-01-23 at 16:11 +1100, David Gibson wrote:
>> On Wed, Jan 22, 2020 at 06:14:37PM +1100, Alexey Kardashevskiy wrote:
>>> On 22/01/2020 17:32, David Gibson wrote:
>>>> I'm not thinking of "grub" as a separate option - that would be the
>>>> same as "vof".  Using vof + no -kernel we'd need to scan the disks in
>>>> the same way SLOF does, and look for a boot partition, which will
>>>> probably contain a GRUB image. 
>>>
>>> I was hoping we can avoid that by allowing
>>> "-kernel grub" and let grub do filesystems and MBR/GPT.
>>
>> I don't want that to be the only way, because I want the GRUB
>> installed by the OS installer to be the GRUB we use.
> 
> Agreed, the bootloader and the kernel should live inside the guest
> image and not on the host's filesystem.


Well, I tried. Added simple MBR+GPT parser, loaded ELF and discovered
that load_elf32/64 does not parse ELFs from memory, only from files.
Anyone keen on fixing that? :) My current workaround is to load grub
from the disk, then store it /tmp and load this as it was passed via
-kernel which is ugly..
diff mbox series

Patch

diff --git a/hw/ppc/Makefile.objs b/hw/ppc/Makefile.objs
index 101e9fc59185..20efc0aa6f9b 100644
--- a/hw/ppc/Makefile.objs
+++ b/hw/ppc/Makefile.objs
@@ -6,6 +6,7 @@  obj-$(CONFIG_PSERIES) += spapr_hcall.o spapr_iommu.o spapr_rtas.o
 obj-$(CONFIG_PSERIES) += spapr_pci.o spapr_rtc.o spapr_drc.o
 obj-$(CONFIG_PSERIES) += spapr_cpu_core.o spapr_ovec.o spapr_irq.o
 obj-$(CONFIG_PSERIES) += spapr_tpm_proxy.o
+obj-$(CONFIG_PSERIES) += spapr_of_client.o
 obj-$(CONFIG_SPAPR_RNG) +=  spapr_rng.o
 # IBM PowerNV
 obj-$(CONFIG_POWERNV) += pnv.o pnv_xscom.o pnv_core.o pnv_lpc.o pnv_psi.o pnv_occ.o pnv_bmc.o
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 61f005c6f686..efc2c70abf99 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -105,6 +105,11 @@  struct SpaprCapabilities {
     uint8_t caps[SPAPR_CAP_NUM];
 };
 
+typedef struct {
+    uint64_t start;
+    uint64_t size;
+} SpaprOfClaimed;
+
 /**
  * SpaprMachineClass:
  */
@@ -160,6 +165,13 @@  struct SpaprMachineState {
     void *fdt_blob;
     long kernel_size;
     bool kernel_le;
+    uint64_t kernel_addr;
+    bool bios_enabled;
+    uint32_t rtas_base;
+    GArray *claimed; /* array of SpaprOfClaimed */
+    uint64_t claimed_base;
+    GHashTable *of_instances; /* ihandle -> SpaprOfInstance */
+    uint32_t of_instance_last;
     uint32_t initrd_base;
     long initrd_size;
     uint64_t rtc_offset; /* Now used only during incoming migration */
@@ -510,7 +522,8 @@  struct SpaprMachineState {
 /* Client Architecture support */
 #define KVMPPC_H_CAS            (KVMPPC_HCALL_BASE + 0x2)
 #define KVMPPC_H_UPDATE_DT      (KVMPPC_HCALL_BASE + 0x3)
-#define KVMPPC_HCALL_MAX        KVMPPC_H_UPDATE_DT
+#define KVMPPC_H_CLIENT         (KVMPPC_HCALL_BASE + 0x5)
+#define KVMPPC_HCALL_MAX        KVMPPC_H_CLIENT
 
 /*
  * The hcall range 0xEF00 to 0xEF80 is reserved for use in facilitating
@@ -538,6 +551,11 @@  void spapr_register_hypercall(target_ulong opcode, spapr_hcall_fn fn);
 target_ulong spapr_hypercall(PowerPCCPU *cpu, target_ulong opcode,
                              target_ulong *args);
 
+target_ulong do_client_architecture_support(PowerPCCPU *cpu,
+                                            SpaprMachineState *spapr,
+                                            target_ulong addr,
+                                            target_ulong fdt_bufsize);
+
 /* Virtual Processor Area structure constants */
 #define VPA_MIN_SIZE           640
 #define VPA_SIZE_OFFSET        0x4
@@ -769,6 +787,11 @@  struct SpaprEventLogEntry {
 void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space);
 void spapr_events_init(SpaprMachineState *sm);
 void spapr_dt_events(SpaprMachineState *sm, void *fdt);
+uint64_t spapr_do_of_client_claim(SpaprMachineState *spapr, uint64_t virt,
+                                  uint64_t size, uint64_t align);
+
+uint32_t spapr_of_client_open(SpaprMachineState *spapr, const char *path);
+int spapr_h_client(SpaprMachineState *spapr, target_ulong client_args);
 void close_htab_fd(SpaprMachineState *spapr);
 void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
 void spapr_free_hpt(SpaprMachineState *spapr);
@@ -891,4 +914,7 @@  void spapr_check_pagesize(SpaprMachineState *spapr, hwaddr pagesize,
 #define SPAPR_OV5_XIVE_BOTH     0x80 /* Only to advertise on the platform */
 
 void spapr_set_all_lpcrs(target_ulong value, target_ulong mask);
+
+void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base);
+
 #endif /* HW_SPAPR_H */
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index e62c89b3dd40..76ce8b973082 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -896,6 +896,55 @@  out:
     return ret;
 }
 
+/*
+ * Below is a compiled version of RTAS blob and OF client interface entry point.
+ *
+ * gcc -nostdlib  -mbig -o spapr-rtas.img spapr-rtas.S
+ * objcopy  -O binary -j .text  spapr-rtas.img spapr-rtas.bin
+ *
+ *   .globl  _start
+ *   _start:
+ *           mr      4,3
+ *           lis     3,KVMPPC_H_RTAS@h
+ *           ori     3,3,KVMPPC_H_RTAS@l
+ *           sc      1
+ *           blr
+ *           mr      4,3
+ *           lis     3,KVMPPC_H_CLIENT@h
+ *           ori     3,3,KVMPPC_H_CLIENT@l
+ *           sc      1
+ *           blr
+ */
+static struct {
+    uint8_t rtas[20], client[20];
+} QEMU_PACKED rtas_client_blob = {
+    .rtas = {
+        0x7c, 0x64, 0x1b, 0x78,
+        0x3c, 0x60, 0x00, 0x00,
+        0x60, 0x63, 0xf0, 0x00,
+        0x44, 0x00, 0x00, 0x22,
+        0x4e, 0x80, 0x00, 0x20
+    },
+    .client = {
+        0x7c, 0x64, 0x1b, 0x78,
+        0x3c, 0x60, 0x00, 0x00,
+        0x60, 0x63, 0xf0, 0x05,
+        0x44, 0x00, 0x00, 0x22,
+        0x4e, 0x80, 0x00, 0x20
+    }
+};
+
+void spapr_instantiate_rtas(SpaprMachineState *spapr, uint32_t base)
+{
+    if (spapr_do_of_client_claim(spapr, base, sizeof(rtas_client_blob.rtas),
+                                 0) != -1) {
+        error_report("The OF client did not claim RTAS memory at 0x%x", base);
+    }
+    spapr->rtas_base = base;
+    cpu_physical_memory_write(base, rtas_client_blob.rtas,
+                              sizeof(rtas_client_blob.rtas));
+}
+
 static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
 {
     MachineState *ms = MACHINE(spapr);
@@ -980,6 +1029,11 @@  static void spapr_dt_rtas(SpaprMachineState *spapr, void *fdt)
     _FDT(fdt_setprop(fdt, rtas, "ibm,lrdr-capacity",
                      lrdr_capacity, sizeof(lrdr_capacity)));
 
+    if (!spapr->bios_enabled) {
+        _FDT(fdt_setprop_cell(fdt, rtas, "rtas-size",
+                              sizeof(rtas_client_blob.rtas)));
+    }
+
     spapr_dt_rtas_tokens(fdt, rtas);
 }
 
@@ -1057,7 +1111,7 @@  static void spapr_dt_chosen(SpaprMachineState *spapr, void *fdt)
     }
 
     if (spapr->kernel_size) {
-        uint64_t kprop[2] = { cpu_to_be64(KERNEL_LOAD_ADDR),
+        uint64_t kprop[2] = { cpu_to_be64(spapr->kernel_addr),
                               cpu_to_be64(spapr->kernel_size) };
 
         _FDT(fdt_setprop(fdt, chosen, "qemu,boot-kernel",
@@ -1245,7 +1299,8 @@  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
     /* Build memory reserve map */
     if (reset) {
         if (spapr->kernel_size) {
-            _FDT((fdt_add_mem_rsv(fdt, KERNEL_LOAD_ADDR, spapr->kernel_size)));
+            _FDT((fdt_add_mem_rsv(fdt, spapr->kernel_addr,
+                                  spapr->kernel_size)));
         }
         if (spapr->initrd_size) {
             _FDT((fdt_add_mem_rsv(fdt, spapr->initrd_base,
@@ -1268,12 +1323,56 @@  void *spapr_build_fdt(SpaprMachineState *spapr, bool reset, size_t space)
         }
     }
 
+    if (!spapr->bios_enabled) {
+        uint32_t phandle;
+        int i, offset, proplen = 0;
+        const void *prop;
+        bool found = false;
+        GArray *phandles = g_array_new(false, false, sizeof(uint32_t));
+
+        /* Find all predefined phandles */
+        for (offset = fdt_next_node(fdt, -1, NULL);
+             offset >= 0;
+             offset = fdt_next_node(fdt, offset, NULL)) {
+            prop = fdt_getprop_namelen(fdt, offset, "phandle", 7, &proplen);
+            if (prop && proplen == sizeof(uint32_t)) {
+                phandle = fdt32_ld(prop);
+                g_array_append_val(phandles, phandle);
+            }
+        }
+
+        /* Assign phandles skipping the predefined ones */
+        for (offset = fdt_next_node(fdt, -1, NULL), phandle = 1;
+             offset >= 0;
+             offset = fdt_next_node(fdt, offset, NULL), ++phandle) {
+            prop = fdt_getprop_namelen(fdt, offset, "phandle", 7, &proplen);
+            if (prop) {
+                continue;
+            }
+            /* Check if the current phandle is not allocated already */
+            for ( ; ; ++phandle) {
+                for (i = 0, found = false; i < phandles->len; ++i) {
+                    if (phandle == g_array_index(phandles, uint32_t, i)) {
+                        found = true;
+                        break;
+                    }
+                }
+                if (!found) {
+                    break;
+                }
+            }
+            _FDT(fdt_setprop_cell(fdt, offset, "phandle", phandle));
+        }
+        g_array_unref(phandles);
+    }
+
     return fdt;
 }
 
 static uint64_t translate_kernel_address(void *opaque, uint64_t addr)
 {
-    return (addr & 0x0fffffff) + KERNEL_LOAD_ADDR;
+    SpaprMachineState *spapr = opaque;
+    return (addr & 0x0fffffff) + spapr->kernel_addr;
 }
 
 static void emulate_spapr_hypercall(PPCVirtualHypervisor *vhyp,
@@ -1660,24 +1759,89 @@  static void spapr_machine_reset(MachineState *machine)
      */
     fdt_addr = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FDT_MAX_SIZE;
 
+    /* Set up the entry state */
+    if (!spapr->bios_enabled) {
+        if (spapr->claimed) {
+            g_array_unref(spapr->claimed);
+        }
+        if (spapr->of_instances) {
+            g_hash_table_unref(spapr->of_instances);
+        }
+
+        spapr->claimed = g_array_new(false, false, sizeof(SpaprOfClaimed));
+        spapr->of_instances = g_hash_table_new(g_direct_hash, g_direct_equal);
+
+        spapr->claimed_base = 0x10000; /* Avoid using the first system page */
+
+        spapr_cpu_set_entry_state(first_ppc_cpu, spapr->kernel_addr,
+                                  spapr->initrd_base);
+        first_ppc_cpu->env.gpr[4] = spapr->initrd_size;
+
+        if (spapr_do_of_client_claim(spapr, spapr->kernel_addr,
+                                  spapr->kernel_size, 0) == -1) {
+            error_report("Memory for kernel is in use");
+            exit(1);
+        }
+        if (spapr_do_of_client_claim(spapr, spapr->initrd_base,
+                                  spapr->initrd_size, 0) == -1) {
+            error_report("Memory for initramdisk is in use");
+            exit(1);
+        }
+        first_ppc_cpu->env.gpr[1] = spapr_do_of_client_claim(spapr, 0, 0x40000,
+                                                             0x10000);
+        if (first_ppc_cpu->env.gpr[1] == -1) {
+            error_report("Memory allocation for stack failed");
+            exit(1);
+        }
+
+        first_ppc_cpu->env.gpr[5] =
+            spapr_do_of_client_claim(spapr, 0, sizeof(rtas_client_blob.client),
+                                     sizeof(rtas_client_blob.client));
+        if (first_ppc_cpu->env.gpr[5] == -1) {
+            error_report("Memory allocation for OF client failed");
+            exit(1);
+        }
+        cpu_physical_memory_write(first_ppc_cpu->env.gpr[5],
+                                  rtas_client_blob.client,
+                                  sizeof(rtas_client_blob.client));
+    } else {
+        spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
+        first_ppc_cpu->env.gpr[5] = 0; /* 0 = kexec !0 = prom_init */
+    }
+
     fdt = spapr_build_fdt(spapr, true, FDT_MAX_SIZE);
 
-    rc = fdt_pack(fdt);
-
-    /* Should only fail if we've built a corrupted tree */
-    assert(rc == 0);
-
-    /* Load the fdt */
-    qemu_fdt_dumpdtb(fdt, fdt_totalsize(fdt));
-    cpu_physical_memory_write(fdt_addr, fdt, fdt_totalsize(fdt));
     g_free(spapr->fdt_blob);
     spapr->fdt_size = fdt_totalsize(fdt);
     spapr->fdt_initial_size = spapr->fdt_size;
     spapr->fdt_blob = fdt;
 
-    /* Set up the entry state */
-    spapr_cpu_set_entry_state(first_ppc_cpu, SPAPR_ENTRY_POINT, fdt_addr);
-    first_ppc_cpu->env.gpr[5] = 0;
+    if (spapr->bios_enabled) {
+        /* Load the fdt */
+        rc = fdt_pack(spapr->fdt_blob);
+        /* Should only fail if we've built a corrupted tree */
+        assert(rc == 0);
+
+        spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
+        spapr->fdt_initial_size = spapr->fdt_size;
+        qemu_fdt_dumpdtb(spapr->fdt_blob, spapr->fdt_size);
+        cpu_physical_memory_write(fdt_addr, spapr->fdt_blob, spapr->fdt_size);
+    } else {
+        char *stdout_path = spapr_vio_stdout_path(spapr->vio_bus);
+        int offset = fdt_path_offset(fdt, "/chosen");
+
+        /*
+         * SLOF-less setup requires an open instance of stdout for early
+         * kernel printk. By now all phandles are settled so we can open
+         * the default serial console.
+         * We skip writing FDT as nothing expects it; OF client interface is
+         * going to be used for reading the device tree.
+         */
+        if (stdout_path) {
+            _FDT(fdt_setprop_cell(fdt, offset, "stdout",
+                                  spapr_of_client_open(spapr, stdout_path)));
+        }
+    }
 
     spapr->cas_reboot = false;
 }
@@ -2897,12 +3061,12 @@  static void spapr_machine_init(MachineState *machine)
         uint64_t lowaddr = 0;
 
         spapr->kernel_size = load_elf(kernel_filename, NULL,
-                                      translate_kernel_address, NULL,
+                                      translate_kernel_address, spapr,
                                       NULL, &lowaddr, NULL, 1,
                                       PPC_ELF_MACHINE, 0, 0);
         if (spapr->kernel_size == ELF_LOAD_WRONG_ENDIAN) {
             spapr->kernel_size = load_elf(kernel_filename, NULL,
-                                          translate_kernel_address, NULL, NULL,
+                                          translate_kernel_address, spapr, NULL,
                                           &lowaddr, NULL, 0, PPC_ELF_MACHINE,
                                           0, 0);
             spapr->kernel_le = spapr->kernel_size > 0;
@@ -2918,7 +3082,7 @@  static void spapr_machine_init(MachineState *machine)
             /* Try to locate the initrd in the gap between the kernel
              * and the firmware. Add a bit of space just in case
              */
-            spapr->initrd_base = (KERNEL_LOAD_ADDR + spapr->kernel_size
+            spapr->initrd_base = (spapr->kernel_addr + spapr->kernel_size
                                   + 0x1ffff) & ~0xffff;
             spapr->initrd_size = load_image_targphys(initrd_filename,
                                                      spapr->initrd_base,
@@ -2932,20 +3096,22 @@  static void spapr_machine_init(MachineState *machine)
         }
     }
 
-    if (bios_name == NULL) {
-        bios_name = FW_FILE_NAME;
+    if (spapr->bios_enabled) {
+        if (bios_name == NULL) {
+            bios_name = FW_FILE_NAME;
+        }
+        filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
+        if (!filename) {
+            error_report("Could not find LPAR firmware '%s'", bios_name);
+            exit(1);
+        }
+        fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
+        if (fw_size <= 0) {
+            error_report("Could not load LPAR firmware '%s'", filename);
+            exit(1);
+        }
+        g_free(filename);
     }
-    filename = qemu_find_file(QEMU_FILE_TYPE_BIOS, bios_name);
-    if (!filename) {
-        error_report("Could not find LPAR firmware '%s'", bios_name);
-        exit(1);
-    }
-    fw_size = load_image_targphys(filename, 0, FW_MAX_SIZE);
-    if (fw_size <= 0) {
-        error_report("Could not load LPAR firmware '%s'", filename);
-        exit(1);
-    }
-    g_free(filename);
 
     /* FIXME: Should register things through the MachineState's qdev
      * interface, this is a legacy from the sPAPREnvironment structure
@@ -3162,6 +3328,32 @@  static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
     visit_type_uint32(v, name, (uint32_t *)opaque, errp);
 }
 
+static void spapr_get_kernel_addr(Object *obj, Visitor *v, const char *name,
+                                  void *opaque, Error **errp)
+{
+    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
+}
+
+static void spapr_set_kernel_addr(Object *obj, Visitor *v, const char *name,
+                                  void *opaque, Error **errp)
+{
+    visit_type_uint64(v, name, (uint64_t *)opaque, errp);
+}
+
+static bool spapr_get_bios_enabled(Object *obj, Error **errp)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
+
+    return spapr->bios_enabled;
+}
+
+static void spapr_set_bios_enabled(Object *obj, bool value, Error **errp)
+{
+    SpaprMachineState *spapr = SPAPR_MACHINE(obj);
+
+    spapr->bios_enabled = value;
+}
+
 static char *spapr_get_ic_mode(Object *obj, Error **errp)
 {
     SpaprMachineState *spapr = SPAPR_MACHINE(obj);
@@ -3267,6 +3459,20 @@  static void spapr_instance_init(Object *obj)
     object_property_add_bool(obj, "vfio-no-msix-emulation",
                              spapr_get_msix_emulation, NULL, NULL);
 
+    object_property_add(obj, "kernel-addr", "uint64", spapr_get_kernel_addr,
+                        spapr_set_kernel_addr, NULL, &spapr->kernel_addr,
+                        &error_abort);
+    object_property_set_description(obj, "kernel-addr",
+                                    stringify(KERNEL_LOAD_ADDR)
+                                    " for -kernel is the default",
+                                    NULL);
+    spapr->kernel_addr = KERNEL_LOAD_ADDR;
+    object_property_add_bool(obj, "bios", spapr_get_bios_enabled,
+                            spapr_set_bios_enabled, NULL);
+    object_property_set_description(obj, "bios", "Conrols whether to load bios",
+                                    NULL);
+    spapr->bios_enabled = true;
+
     /* The machine class defines the default interrupt controller mode */
     spapr->irq = smc->irq;
     object_property_add_str(obj, "ic-mode", spapr_get_ic_mode,
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index f1799b1b707d..f2d8823d2c3a 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1660,15 +1660,11 @@  static bool spapr_hotplugged_dev_before_cas(void)
     return false;
 }
 
-static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
-                                                  SpaprMachineState *spapr,
-                                                  target_ulong opcode,
-                                                  target_ulong *args)
+target_ulong do_client_architecture_support(PowerPCCPU *cpu,
+                                            SpaprMachineState *spapr,
+                                            target_ulong addr,
+                                            target_ulong fdt_bufsize)
 {
-    /* Working address in data buffer */
-    target_ulong addr = ppc64_phys_to_real(args[0]);
-    target_ulong fdt_buf = args[1];
-    target_ulong fdt_bufsize = args[2];
     target_ulong ov_table;
     uint32_t cas_pvr;
     SpaprOptionVector *ov1_guest, *ov5_guest, *ov5_cas_old;
@@ -1816,7 +1812,6 @@  static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
 
     if (!spapr->cas_reboot) {
         void *fdt;
-        SpaprDeviceTreeUpdateHeader hdr = { .version_id = 1 };
 
         /* If spapr_machine_reset() did not set up a HPT but one is necessary
          * (because the guest isn't going to use radix) then set it up here. */
@@ -1825,21 +1820,7 @@  static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
             spapr_setup_hpt_and_vrma(spapr);
         }
 
-        if (fdt_bufsize < sizeof(hdr)) {
-            error_report("SLOF provided insufficient CAS buffer "
-                         TARGET_FMT_lu " (min: %zu)", fdt_bufsize, sizeof(hdr));
-            exit(EXIT_FAILURE);
-        }
-
-        fdt_bufsize -= sizeof(hdr);
-
-        fdt = spapr_build_fdt(spapr, false, fdt_bufsize);
-        _FDT((fdt_pack(fdt)));
-
-        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
-        cpu_physical_memory_write(fdt_buf + sizeof(hdr), fdt,
-                                  fdt_totalsize(fdt));
-        trace_spapr_cas_continue(fdt_totalsize(fdt) + sizeof(hdr));
+        fdt = spapr_build_fdt(spapr, !spapr->bios_enabled, fdt_bufsize);
 
         g_free(spapr->fdt_blob);
         spapr->fdt_size = fdt_totalsize(fdt);
@@ -1854,6 +1835,41 @@  static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
     return H_SUCCESS;
 }
 
+static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
+                                                  SpaprMachineState *spapr,
+                                                  target_ulong opcode,
+                                                  target_ulong *args)
+{
+    /* Working address in data buffer */
+    target_ulong addr = ppc64_phys_to_real(args[0]);
+    target_ulong fdt_buf = args[1];
+    target_ulong fdt_bufsize = args[2];
+    target_ulong ret;
+    SpaprDeviceTreeUpdateHeader hdr = { .version_id = 1 };
+
+    if (fdt_bufsize < sizeof(hdr)) {
+        error_report("SLOF provided insufficient CAS buffer "
+                     TARGET_FMT_lu " (min: %zu)", fdt_bufsize, sizeof(hdr));
+        exit(EXIT_FAILURE);
+    }
+
+    fdt_bufsize -= sizeof(hdr);
+
+    ret = do_client_architecture_support(cpu, spapr, addr, fdt_bufsize);
+    if (ret == H_SUCCESS) {
+        _FDT((fdt_pack(spapr->fdt_blob)));
+        spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
+        spapr->fdt_initial_size = spapr->fdt_size;
+
+        cpu_physical_memory_write(fdt_buf, &hdr, sizeof(hdr));
+        cpu_physical_memory_write(fdt_buf + sizeof(hdr), spapr->fdt_blob,
+                                  spapr->fdt_size);
+        trace_spapr_cas_continue(spapr->fdt_size + sizeof(hdr));
+    }
+
+    return ret;
+}
+
 static target_ulong h_home_node_associativity(PowerPCCPU *cpu,
                                               SpaprMachineState *spapr,
                                               target_ulong opcode,
@@ -1998,6 +2014,14 @@  static target_ulong h_update_dt(PowerPCCPU *cpu, SpaprMachineState *spapr,
     return H_SUCCESS;
 }
 
+static target_ulong h_client(PowerPCCPU *cpu, SpaprMachineState *spapr,
+                             target_ulong opcode, target_ulong *args)
+{
+    target_ulong client_args = ppc64_phys_to_real(args[0]);
+
+    return spapr_h_client(spapr, client_args);
+}
+
 static spapr_hcall_fn papr_hypercall_table[(MAX_HCALL_OPCODE / 4) + 1];
 static spapr_hcall_fn kvmppc_hypercall_table[KVMPPC_HCALL_MAX - KVMPPC_HCALL_BASE + 1];
 static spapr_hcall_fn svm_hypercall_table[(SVM_HCALL_MAX - SVM_HCALL_BASE) / 4 + 1];
@@ -2121,6 +2145,8 @@  static void hypercall_register_types(void)
 
     spapr_register_hypercall(KVMPPC_H_UPDATE_DT, h_update_dt);
 
+    spapr_register_hypercall(KVMPPC_H_CLIENT, h_client);
+
     /* Virtual Processor Home Node */
     spapr_register_hypercall(H_HOME_NODE_ASSOCIATIVITY,
                              h_home_node_associativity);
diff --git a/hw/ppc/spapr_of_client.c b/hw/ppc/spapr_of_client.c
new file mode 100644
index 000000000000..24d854b76e51
--- /dev/null
+++ b/hw/ppc/spapr_of_client.c
@@ -0,0 +1,633 @@ 
+#include "qemu/osdep.h"
+#include "qemu-common.h"
+#include "qapi/error.h"
+#include "exec/memory.h"
+#include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_vio.h"
+#include "chardev/char.h"
+#include "qom/qom-qobject.h"
+#include "trace.h"
+
+typedef struct {
+    DeviceState *dev;
+    Chardev *cdev;
+    uint32_t phandle;
+} SpaprOfInstance;
+
+/*
+ * OF 1275 "nextprop" description suggests is it 32 bytes max but
+ * LoPAPR defines "ibm,query-interrupt-source-number" which is 33 chars long.
+ */
+#define OF_PROPNAME_LEN_MAX 64
+
+/* Defined as Big Endian */
+struct prom_args {
+    uint32_t service;
+    uint32_t nargs;
+    uint32_t nret;
+    uint32_t args[10];
+};
+
+static void readstr(hwaddr pa, char *buf, int size)
+{
+    cpu_physical_memory_read(pa, buf, size - 1);
+    buf[size - 1] = 0;
+}
+
+static bool _cmpservice(const char *s, size_t len,
+                        unsigned nargs, unsigned nret,
+                        const char *s1, size_t len1,
+                        unsigned nargscheck, unsigned nretcheck)
+{
+    if (strcmp(s, s1)) {
+        return false;
+    }
+    if (nargscheck == 0 && nretcheck == 0) {
+        return true;
+    }
+    if (nargs != nargscheck || nret != nretcheck) {
+        trace_spapr_of_client_error_param(s, nargscheck, nretcheck, nargs,
+                                          nret);
+        return false;
+    }
+
+    return true;
+}
+
+static uint32_t of_client_finddevice(const void *fdt, uint32_t nodeaddr)
+{
+    char node[256];
+    int ret;
+
+    readstr(nodeaddr, node, sizeof(node));
+    ret = fdt_path_offset(fdt, node);
+    if (ret >= 0) {
+        ret = fdt_get_phandle(fdt, ret);
+    }
+
+    return (uint32_t) ret;
+}
+
+static uint32_t of_client_getprop(const void *fdt, uint32_t nodeph,
+                                  uint32_t pname, uint32_t valaddr,
+                                  uint32_t vallen)
+{
+    char propname[OF_PROPNAME_LEN_MAX + 1];
+    uint32_t ret = 0;
+    int proplen = 0;
+    const void *prop;
+
+    readstr(pname, propname, sizeof(propname));
+    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
+                               propname, strlen(propname), &proplen);
+    if (prop) {
+        int cb = MIN(proplen, vallen);
+
+        cpu_physical_memory_write(valaddr, prop, cb);
+        ret = cb;
+    } else {
+        ret = -1;
+    }
+    trace_spapr_of_client_getprop(nodeph, propname, ret);
+
+    return ret;
+}
+
+static uint32_t of_client_getproplen(const void *fdt, uint32_t nodeph,
+                                     uint32_t pname)
+{
+    char propname[OF_PROPNAME_LEN_MAX + 1];
+    uint32_t ret = 0;
+    int proplen = 0;
+    const void *prop;
+
+    readstr(pname, propname, sizeof(propname));
+
+    prop = fdt_getprop_namelen(fdt, fdt_node_offset_by_phandle(fdt, nodeph),
+                               propname, strlen(propname), &proplen);
+    if (prop) {
+        ret = proplen;
+    } else {
+        ret = -1;
+    }
+
+    return ret;
+}
+
+static uint32_t of_client_setprop(SpaprMachineState *spapr,
+                                  uint32_t nodeph, uint32_t pname,
+                                  uint32_t valaddr, uint32_t vallen)
+{
+    char propname[OF_PROPNAME_LEN_MAX + 1];
+    uint32_t ret = -1;
+    int offset;
+
+    readstr(pname, propname, sizeof(propname));
+    if (vallen == sizeof(uint32_t)) {
+        uint32_t val32 = ldl_be_phys(first_cpu->as, valaddr);
+
+        if ((strcmp(propname, "linux,rtas-base") == 0) ||
+            (strcmp(propname, "linux,rtas-entry") == 0)) {
+            spapr->rtas_base = val32;
+        } else if (strcmp(propname, "linux,initrd-start") == 0) {
+            spapr->initrd_base = val32;
+        } else if (strcmp(propname, "linux,initrd-end") == 0) {
+            spapr->initrd_size = val32 - spapr->initrd_base;
+        } else {
+            goto trace_exit;
+        }
+    } else if (vallen == sizeof(uint64_t)) {
+        uint64_t val64 = ldq_be_phys(first_cpu->as, valaddr);
+
+        if (strcmp(propname, "linux,initrd-start") == 0) {
+            spapr->initrd_base = val64;
+        } else if (strcmp(propname, "linux,initrd-end") == 0) {
+            spapr->initrd_size = val64 - spapr->initrd_base;
+        } else {
+            goto trace_exit;
+        }
+    } else {
+        goto trace_exit;
+    }
+
+    offset = fdt_node_offset_by_phandle(spapr->fdt_blob, nodeph);
+    if (offset >= 0) {
+        uint8_t data[vallen];
+
+        cpu_physical_memory_read(valaddr, data, vallen);
+        if (!fdt_setprop(spapr->fdt_blob, offset, propname, data, vallen)) {
+            ret = vallen;
+        }
+    }
+
+trace_exit:
+    trace_spapr_of_client_setprop(nodeph, propname, ret);
+
+    return ret;
+}
+
+static uint32_t of_client_nextprop(const void *fdt, uint32_t phandle,
+                                   uint32_t prevaddr, uint32_t nameaddr)
+{
+    int offset = fdt_node_offset_by_phandle(fdt, phandle);
+    char prev[OF_PROPNAME_LEN_MAX + 1];
+    const char *tmp;
+
+    readstr(prevaddr, prev, sizeof(prev));
+    for (offset = fdt_first_property_offset(fdt, offset);
+         offset >= 0;
+         offset = fdt_next_property_offset(fdt, offset)) {
+
+        if (!fdt_getprop_by_offset(fdt, offset, &tmp, NULL)) {
+            return 0;
+        }
+        if (prev[0] == '\0' || strcmp(prev, tmp) == 0) {
+            if (prev[0] != '\0') {
+                offset = fdt_next_property_offset(fdt, offset);
+                if (offset < 0) {
+                    return 0;
+                }
+            }
+            if (!fdt_getprop_by_offset(fdt, offset, &tmp, NULL)) {
+                return 0;
+            }
+            cpu_physical_memory_write(nameaddr, tmp, strlen(tmp) + 1);
+            return 1;
+        }
+    }
+
+    return 0;
+}
+
+static uint32_t of_client_peer(const void *fdt, uint32_t phandle)
+{
+    int ret;
+
+    if (phandle == 0) {
+        ret = fdt_path_offset(fdt, "/");
+    } else {
+        ret = fdt_next_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
+    }
+
+    if (ret < 0) {
+        ret = 0;
+    } else {
+        ret = fdt_get_phandle(fdt, ret);
+    }
+
+    return ret;
+}
+
+static uint32_t of_client_child(const void *fdt, uint32_t phandle)
+{
+    int ret = fdt_first_subnode(fdt, fdt_node_offset_by_phandle(fdt, phandle));
+
+    if (ret < 0) {
+        ret = 0;
+    } else {
+        ret = fdt_get_phandle(fdt, ret);
+    }
+
+    return ret;
+}
+
+static uint32_t of_client_parent(const void *fdt, uint32_t phandle)
+{
+    int ret = fdt_parent_offset(fdt, fdt_node_offset_by_phandle(fdt, phandle));
+
+    if (ret < 0) {
+        ret = 0;
+    } else {
+        ret = fdt_get_phandle(fdt, ret);
+    }
+
+    return ret;
+}
+
+static DeviceState *of_client_find_qom_dev(BusState *bus, const char *path)
+{
+    BusChild *kid;
+
+    QTAILQ_FOREACH(kid, &bus->children, sibling) {
+        const char *p = qdev_get_fw_dev_path(kid->child);
+        BusState *child;
+
+        if (p && strcmp(path, p) == 0) {
+            return kid->child;
+        }
+        QLIST_FOREACH(child, &kid->child->child_bus, sibling) {
+            DeviceState *d = of_client_find_qom_dev(child, path);
+
+            if (d) {
+                return d;
+            }
+        }
+    }
+    return NULL;
+}
+
+uint32_t spapr_of_client_open(SpaprMachineState *spapr, const char *path)
+{
+    int offset;
+    uint32_t ret = 0;
+    SpaprOfInstance *inst;
+
+    if (spapr->of_instance_last == 0xFFFFFFFF) {
+        /* We do not recycle ihandles yet */
+        goto trace_exit;
+    }
+    offset = fdt_path_offset(spapr->fdt_blob, path);
+    if (offset < 0) {
+        trace_spapr_of_client_error_unknown_path(path);
+        goto trace_exit;
+    }
+
+    inst = g_new(SpaprOfInstance, 1);
+    inst->phandle = fdt_get_phandle(spapr->fdt_blob, offset);
+    g_assert(inst->phandle);
+    ++spapr->of_instance_last;
+    inst->dev = of_client_find_qom_dev(sysbus_get_default(), path);
+    g_hash_table_insert(spapr->of_instances,
+                        GINT_TO_POINTER(spapr->of_instance_last),
+                        inst);
+    ret = spapr->of_instance_last;
+
+    if (inst->dev) {
+        const char *cdevstr = object_property_get_str(OBJECT(inst->dev),
+                                                      "chardev", NULL);
+
+        if (cdevstr) {
+            inst->cdev = qemu_chr_find(cdevstr);
+        }
+    }
+
+trace_exit:
+    trace_spapr_of_client_open(path, inst ? inst->phandle : 0, ret);
+
+    return ret;
+}
+
+static uint32_t of_client_open(SpaprMachineState *spapr, uint32_t pathaddr)
+{
+    char path[256];
+
+    readstr(pathaddr, path, sizeof(path));
+
+    return spapr_of_client_open(spapr, path);
+}
+
+static void of_client_close(SpaprMachineState *spapr, uint32_t ihandle)
+{
+    if (!g_hash_table_remove(spapr->of_instances, GINT_TO_POINTER(ihandle))) {
+        trace_spapr_of_client_error_unknown_ihandle_close(ihandle);
+    }
+}
+
+static uint32_t of_client_instance_to_package(SpaprMachineState *spapr,
+                                              uint32_t ihandle)
+{
+    gpointer instp = g_hash_table_lookup(spapr->of_instances,
+                                        GINT_TO_POINTER(ihandle));
+
+    if (!instp) {
+        return -1;
+    }
+
+    return ((SpaprOfInstance *)instp)->phandle;
+}
+
+static uint32_t of_client_package_to_path(const void *fdt, uint32_t phandle,
+                                          uint32_t buf, uint32_t len)
+{
+    char tmp[256];
+
+    if (0 == fdt_get_path(fdt, fdt_node_offset_by_phandle(fdt, phandle), tmp,
+                          sizeof(tmp))) {
+        tmp[sizeof(tmp) - 1] = 0;
+        cpu_physical_memory_write(buf, tmp, MIN(len, strlen(tmp)));
+    }
+    return len;
+}
+
+static uint32_t of_client_instance_to_path(SpaprMachineState *spapr,
+                                           uint32_t ihandle, uint32_t buf,
+                                           uint32_t len)
+{
+    uint32_t phandle = of_client_instance_to_package(spapr, ihandle);
+
+    if (phandle != -1) {
+        return of_client_package_to_path(spapr->fdt_blob, phandle, buf, len);
+    }
+
+    return 0;
+}
+
+static uint32_t of_client_write(SpaprMachineState *spapr, uint32_t ihandle,
+                                uint32_t buf, uint32_t len)
+{
+    char tmp[256];
+    int toread, toprint, cb = MIN(len, 1024);
+    SpaprOfInstance *inst = (SpaprOfInstance *)
+        g_hash_table_lookup(spapr->of_instances, GINT_TO_POINTER(ihandle));
+
+    while (cb > 0) {
+        toread = MIN(cb + 1, sizeof(tmp));
+        readstr(buf, tmp, toread);
+        toprint = strlen(tmp);
+        if (inst && inst->cdev) {
+            toprint = qemu_chr_write(inst->cdev, (uint8_t *) tmp, toprint,
+                                     true);
+        } else {
+            /* We normally open stdout so this is fallback */
+            printf("DBG[%d]%s", ihandle, tmp);
+        }
+        buf += toprint;
+        cb -= toprint;
+    }
+
+    return len;
+}
+
+static bool of_client_claim_avail(GArray *claimed, uint64_t virt, uint64_t size)
+{
+    int i;
+    SpaprOfClaimed *c;
+
+    for (i = 0; i < claimed->len; ++i) {
+        c = &g_array_index(claimed, SpaprOfClaimed, i);
+        if ((c->start <= virt && virt < c->start + c->size) ||
+            (virt <= c->start && c->start < virt + size)) {
+            return false;
+        }
+    }
+
+    return true;
+}
+
+static void of_client_claim_add(GArray *claimed, uint64_t virt, uint64_t size)
+{
+    SpaprOfClaimed newclaim;
+
+    newclaim.start = virt;
+    newclaim.size = size;
+    g_array_append_val(claimed, newclaim);
+}
+
+/*
+ * "claim" claims memory at @virt if @align==0; otherwise it allocates
+ * memory at the requested alignment.
+ */
+uint64_t spapr_do_of_client_claim(SpaprMachineState *spapr, uint64_t virt,
+                                  uint64_t size, uint64_t align)
+{
+    uint32_t ret;
+
+    if (align == 0) {
+        if (!of_client_claim_avail(spapr->claimed, virt, size)) {
+            return -1;
+        }
+        ret = virt;
+    } else {
+        align = pow2ceil(align);
+        spapr->claimed_base = (spapr->claimed_base + align - 1) & ~(align - 1);
+        while (1) {
+            if (spapr->claimed_base >= spapr->rma_size) {
+                perror("Out of memory");
+                return -1;
+            }
+            if (of_client_claim_avail(spapr->claimed, spapr->claimed_base,
+                                      size)) {
+                break;
+            }
+            spapr->claimed_base += size;
+        }
+        ret = spapr->claimed_base;
+    }
+
+    spapr->claimed_base = MAX(spapr->claimed_base, ret + size);
+    of_client_claim_add(spapr->claimed, virt, size);
+    trace_spapr_of_client_claim(virt, size, align, ret);
+
+    return ret;
+}
+
+static uint32_t of_client_claim(SpaprMachineState *spapr, uint32_t virt,
+                                uint32_t size, uint32_t align)
+{
+    if (align) {
+        return -1;
+    }
+    if (!of_client_claim_avail(spapr->claimed, virt, size)) {
+        return -1;
+    }
+
+    spapr->claimed_base = MAX(spapr->claimed_base, virt + size);
+    of_client_claim_add(spapr->claimed, virt, size);
+    trace_spapr_of_client_claim(virt, size, align, virt);
+
+    return virt;
+}
+
+static uint32_t of_client_call_method(SpaprMachineState *spapr,
+                                      uint32_t methodaddr, uint32_t ihandle,
+                                      uint32_t param, uint32_t *ret2)
+{
+    uint32_t ret = -1;
+    char path[256] = "", method[256] = "";
+    uint32_t phandle = of_client_instance_to_package(spapr, ihandle);
+    int offset;
+
+    if (!ihandle) {
+        goto trace_exit;
+    }
+
+    readstr(methodaddr, method, sizeof(method));
+    phandle = of_client_instance_to_package(spapr, ihandle);
+    if (!phandle) {
+        goto trace_exit;
+    }
+
+    offset = fdt_node_offset_by_phandle(spapr->fdt_blob, phandle);
+    if (offset < 0) {
+        goto trace_exit;
+    }
+
+    if (fdt_get_path(spapr->fdt_blob, offset, path, sizeof(path))) {
+        goto trace_exit;
+    }
+
+    if (strcmp(path, "/") == 0) {
+        if (strcmp(method, "ibm,client-architecture-support") == 0) {
+
+#define FDT_MAX_SIZE            0x100000
+            ret = do_client_architecture_support(POWERPC_CPU(first_cpu), spapr,
+                                                 param, FDT_MAX_SIZE);
+            *ret2 = 0;
+        }
+    } else if (strcmp(path, "/rtas") == 0) {
+        if (strcmp(method, "instantiate-rtas") == 0) {
+            spapr_instantiate_rtas(spapr, param);
+            ret = 0;
+            *ret2 = param; /* rtasbase */
+        }
+    } else {
+        trace_spapr_of_client_error_unknown_method(method);
+    }
+
+trace_exit:
+    trace_spapr_of_client_method(ihandle, method, param, phandle, path, ret);
+
+    return ret;
+}
+
+static void of_client_quiesce(SpaprMachineState *spapr)
+{
+    int rc = fdt_pack(spapr->fdt_blob);
+    /* Should only fail if we've built a corrupted tree */
+    assert(rc == 0);
+
+    spapr->fdt_size = fdt_totalsize(spapr->fdt_blob);
+    spapr->fdt_initial_size = spapr->fdt_size;
+}
+
+int spapr_h_client(SpaprMachineState *spapr, target_ulong of_client_args)
+{
+    struct prom_args args = { 0 };
+    char service[64];
+    unsigned nargs, nret;
+    int i, servicelen;
+
+    cpu_physical_memory_read(of_client_args, &args, sizeof(args));
+    nargs = be32_to_cpu(args.nargs);
+    nret = be32_to_cpu(args.nret);
+    readstr(be32_to_cpu(args.service), service, sizeof(service));
+    servicelen = strlen(service);
+
+#define cmpservice(s, a, r) \
+    _cmpservice(service, servicelen, nargs, nret, (s), sizeof(s), (a), (r))
+
+    if (cmpservice("finddevice", 1, 1)) {
+        args.args[nargs] = of_client_finddevice(spapr->fdt_blob,
+                                                be32_to_cpu(args.args[0]));
+    } else if (cmpservice("getprop", 4, 1)) {
+        args.args[nargs] = of_client_getprop(spapr->fdt_blob,
+                                             be32_to_cpu(args.args[0]),
+                                             be32_to_cpu(args.args[1]),
+                                             be32_to_cpu(args.args[2]),
+                                             be32_to_cpu(args.args[3]));
+    } else if (cmpservice("getproplen", 2, 1)) {
+        args.args[nargs] = of_client_getproplen(spapr->fdt_blob,
+                                                be32_to_cpu(args.args[0]),
+                                                be32_to_cpu(args.args[1]));
+    } else if (cmpservice("setprop", 4, 1)) {
+        args.args[nargs] = of_client_setprop(spapr,
+                                             be32_to_cpu(args.args[0]),
+                                             be32_to_cpu(args.args[1]),
+                                             be32_to_cpu(args.args[2]),
+                                             be32_to_cpu(args.args[3]));
+    } else if (cmpservice("nextprop", 3, 1)) {
+        args.args[nargs] = of_client_nextprop(spapr->fdt_blob,
+                                              be32_to_cpu(args.args[0]),
+                                              be32_to_cpu(args.args[1]),
+                                              be32_to_cpu(args.args[2]));
+    } else if (cmpservice("peer", 1, 1)) {
+        args.args[nargs] = of_client_peer(spapr->fdt_blob,
+                                          be32_to_cpu(args.args[0]));
+    } else if (cmpservice("child", 1, 1)) {
+        args.args[nargs] = of_client_child(spapr->fdt_blob,
+                                           be32_to_cpu(args.args[0]));
+    } else if (cmpservice("parent", 1, 1)) {
+        args.args[nargs] = of_client_parent(spapr->fdt_blob,
+                                            be32_to_cpu(args.args[0]));
+    } else if (cmpservice("open", 1, 1)) {
+        args.args[nargs] = of_client_open(spapr, be32_to_cpu(args.args[0]));
+    } else if (cmpservice("close", 1, 0)) {
+        of_client_close(spapr, be32_to_cpu(args.args[0]));
+    } else if (cmpservice("instance-to-package", 1, 1)) {
+        args.args[nargs] =
+            of_client_instance_to_package(spapr,
+                                          be32_to_cpu(args.args[0]));
+    } else if (cmpservice("package-to-path", 3, 1)) {
+        args.args[nargs] = of_client_package_to_path(spapr->fdt_blob,
+                                                     be32_to_cpu(args.args[0]),
+                                                     be32_to_cpu(args.args[1]),
+                                                     be32_to_cpu(args.args[2]));
+    } else if (cmpservice("instance-to-path", 3, 1)) {
+        args.args[nargs] =
+            of_client_instance_to_path(spapr,
+                                       be32_to_cpu(args.args[0]),
+                                       be32_to_cpu(args.args[1]),
+                                       be32_to_cpu(args.args[2]));
+    } else if (cmpservice("write", 3, 1)) {
+        args.args[nargs] = of_client_write(spapr,
+                                           be32_to_cpu(args.args[0]),
+                                           be32_to_cpu(args.args[1]),
+                                           be32_to_cpu(args.args[2]));
+    } else if (cmpservice("claim", 3, 1)) {
+        args.args[nargs] = of_client_claim(spapr,
+                                           be32_to_cpu(args.args[0]),
+                                           be32_to_cpu(args.args[1]),
+                                           be32_to_cpu(args.args[2]));
+    } else if (cmpservice("call-method", 3, 2)) {
+        args.args[nargs] = of_client_call_method(spapr,
+                                                 be32_to_cpu(args.args[0]),
+                                                 be32_to_cpu(args.args[1]),
+                                                 be32_to_cpu(args.args[2]),
+                                                 &args.args[nargs + 1]);
+    } else if (cmpservice("quiesce", 0, 0)) {
+        of_client_quiesce(spapr);
+    } else if (cmpservice("exit", 0, 0)) {
+        error_report("Stopped as the VM requested \"exit\"");
+        vm_stop(RUN_STATE_PAUSED);
+    } else {
+        trace_spapr_of_client_error_unknown_service(service, nargs, nret);
+        args.args[nargs] = -1;
+    }
+
+    for (i = 0; i < nret; ++i) {
+        args.args[nargs + i] = be32_to_cpu(args.args[nargs + i]);
+    }
+    cpu_physical_memory_write(of_client_args, &args, sizeof(args));
+
+    return H_SUCCESS;
+}
diff --git a/hw/ppc/trace-events b/hw/ppc/trace-events
index 9ea620f23c85..e2d1e58d07c3 100644
--- a/hw/ppc/trace-events
+++ b/hw/ppc/trace-events
@@ -21,6 +21,18 @@  spapr_update_dt(unsigned cb) "New blob %u bytes"
 spapr_update_dt_failed_size(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
 spapr_update_dt_failed_check(unsigned cbold, unsigned cbnew, unsigned magic) "Old blob %u bytes, new blob %u bytes, magic 0x%x"
 
+# spapr_client.c
+spapr_of_client_error_param(const char *method, int nargscheck, int nretcheck, int nargs, int nret) "%s takes/returns %d/%d, not %d/%d"
+spapr_of_client_error_unknown_service(const char *service, int nargs, int nret) "%s args=%d rets=%d"
+spapr_of_client_error_unknown_method(const char *method) "%s"
+spapr_of_client_error_unknown_ihandle_close(uint32_t ihandle) "0x%x"
+spapr_of_client_error_unknown_path(const char *path) "%s"
+spapr_of_client_claim(uint32_t virt, uint32_t size, uint32_t align, uint32_t ret) "virt=0x%x size=0x%x align=0x%x => 0x%x"
+spapr_of_client_method(uint32_t ihandle, const char *method, uint32_t param, uint32_t phandle, const char *path, uint32_t ret) "0x%x \"%s\" param=0x%x ph=0x%x \"%s\" => 0x%x"
+spapr_of_client_getprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
+spapr_of_client_setprop(uint32_t ph, const char *prop, uint32_t ret) "phandle=0x%x \"%s\" => 0x%x"
+spapr_of_client_open(const char *path, uint32_t phandle, uint32_t ihandle) "%s 0x%x => 0x%x"
+
 # spapr_hcall_tpm.c
 spapr_h_tpm_comm(const char *device_path, uint64_t operation) "tpm_device_path=%s operation=0x%"PRIu64
 spapr_tpm_execute(uint64_t data_in, uint64_t data_in_sz, uint64_t data_out, uint64_t data_out_sz) "data_in=0x%"PRIx64", data_in_sz=%"PRIu64", data_out=0x%"PRIx64", data_out_sz=%"PRIu64