[RFC,v2,00/26] KVM/arm64: A stage 2 for the host

Message ID	20210108121524.656872-1-qperret@google.com
Headers	show Return-Path: <devicetree-owner@vger.kernel.org> Sender: "qperret via sendgmr" <qperret@r2d2-qp.c.googlers.com> Date: Fri, 8 Jan 2021 12:14:58 +0000 Message-Id: <20210108121524.656872-1-qperret@google.com> Mime-Version: 1.0 Subject: [RFC PATCH v2 00/26] KVM/arm64: A stage 2 for the host From: Quentin Perret <qperret@google.com> To: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>, Julien Thierry <julien.thierry.kdev@gmail.com>, Suzuki K Poulose <suzuki.poulose@arm.com>, Rob Herring <robh+dt@kernel.org>, Frank Rowand <frowand.list@gmail.com> Cc: devicetree@vger.kernel.org, android-kvm@google.com, linux-kernel@vger.kernel.org, kernel-team@android.com, kvmarm@lists.cs.columbia.edu, linux-arm-kernel@lists.infradead.org, Fuad Tabba <tabba@google.com>, Mark Rutland <mark.rutland@arm.com>, David Brazdil <dbrazdil@google.com> Content-Type: text/plain; charset="UTF-8" Precedence: bulk
Series	KVM/arm64: A stage 2 for the host \| expand [RFC,v2,00/26] KVM/arm64: A stage 2 for the host [RFC,v2,15/26] of/fdt: Introduce early_init_dt_add_memory_hyp()

Quentin Perret Jan. 8, 2021, 12:14 p.m. UTC

Hi all,

This is the v2 of the series previously posted here:

  https://lore.kernel.org/kvmarm/20201117181607.1761516-1-qperret@google.com/

This basically allows us to wrap the host with a stage 2 when running in
nVHE, hence paving the way for protecting guest memory from the host in
the future (among other use-cases). For more details about the
motivation and the design angle taken here, I would recommend to have a
look at the cover letter of v1, and/or to watch these presentations at
LPC [1] and KVM forum 2020 [2].

In short, the changes since v1 include:

 - Renamed most pkvm-specific pgtable functions as pkvm_* to avoid
   confusion with the host's (Fuad)

 - Added an IC flush when switching pgtables (Fuad, Mark)

 - Cleaned-up the PI aliasing in image-vars.h (David)

 - Added a TLB flush when enabling the host stage 2 to avoid stale TLBs
   from bootloader

 - Fixed the early memory reservation by using NR_CPUS instead of
   num_possible_cpus() (which is always 1 that early)

 - Added missing preempt_{dis,en}able() guards in
   kvm_hyp_enable_protection()

 - Rebased on latest kvmarm/next

And if you'd like a branch that has all the goodies, there it is:

    https://android-kvm.googlesource.com/linux qperret/host-stage2-v2

Thanks!
Quentin

[1] https://youtu.be/54q6RzS9BpQ?t=10859
[2] https://kvmforum2020.sched.com/event/eE24/virtualization-for-the-masses-exposing-kvm-on-android-will-deacon-google

Quentin Perret (23):
  KVM: arm64: Initialize kvm_nvhe_init_params early
  KVM: arm64: Avoid free_page() in page-table allocator
  KVM: arm64: Factor memory allocation out of pgtable.c
  KVM: arm64: Introduce a BSS section for use at Hyp
  KVM: arm64: Make kvm_call_hyp() a function call at Hyp
  KVM: arm64: Allow using kvm_nvhe_sym() in hyp code
  KVM: arm64: Introduce an early Hyp page allocator
  KVM: arm64: Stub CONFIG_DEBUG_LIST at Hyp
  KVM: arm64: Introduce a Hyp buddy page allocator
  KVM: arm64: Enable access to sanitized CPU features at EL2
  KVM: arm64: Factor out vector address calculation
  of/fdt: Introduce early_init_dt_add_memory_hyp()
  KVM: arm64: Prepare Hyp memory protection
  KVM: arm64: Elevate Hyp mappings creation at EL2
  KVM: arm64: Use kvm_arch for stage 2 pgtable
  KVM: arm64: Use kvm_arch in kvm_s2_mmu
  KVM: arm64: Set host stage 2 using kvm_nvhe_init_params
  KVM: arm64: Refactor kvm_arm_setup_stage2()
  KVM: arm64: Refactor __load_guest_stage2()
  KVM: arm64: Refactor __populate_fault_info()
  KVM: arm64: Make memcache anonymous in pgtable allocator
  KVM: arm64: Reserve memory for host stage 2
  KVM: arm64: Wrap the host with a stage 2

Will Deacon (3):
  arm64: lib: Annotate {clear,copy}_page() as position-independent
  KVM: arm64: Link position-independent string routines into .hyp.text
  arm64: kvm: Add standalone ticket spinlock implementation for use at
    hyp

 arch/arm64/include/asm/cpufeature.h           |   1 +
 arch/arm64/include/asm/hyp_image.h            |   7 +
 arch/arm64/include/asm/kvm_asm.h              |   7 +
 arch/arm64/include/asm/kvm_cpufeature.h       |  19 ++
 arch/arm64/include/asm/kvm_host.h             |  16 +-
 arch/arm64/include/asm/kvm_hyp.h              |   8 +
 arch/arm64/include/asm/kvm_mmu.h              |  69 +++++-
 arch/arm64/include/asm/kvm_pgtable.h          |  41 +++-
 arch/arm64/include/asm/sections.h             |   1 +
 arch/arm64/kernel/asm-offsets.c               |   3 +
 arch/arm64/kernel/cpufeature.c                |  12 +
 arch/arm64/kernel/image-vars.h                |  33 +++
 arch/arm64/kernel/vmlinux.lds.S               |   7 +
 arch/arm64/kvm/arm.c                          | 144 ++++++++++--
 arch/arm64/kvm/hyp/Makefile                   |   2 +-
 arch/arm64/kvm/hyp/include/hyp/switch.h       |  36 +--
 arch/arm64/kvm/hyp/include/nvhe/early_alloc.h |  14 ++
 arch/arm64/kvm/hyp/include/nvhe/gfp.h         |  32 +++
 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  33 +++
 arch/arm64/kvm/hyp/include/nvhe/memory.h      |  55 +++++
 arch/arm64/kvm/hyp/include/nvhe/mm.h          | 107 +++++++++
 arch/arm64/kvm/hyp/include/nvhe/spinlock.h    |  92 ++++++++
 arch/arm64/kvm/hyp/nvhe/Makefile              |   9 +-
 arch/arm64/kvm/hyp/nvhe/cache.S               |  13 ++
 arch/arm64/kvm/hyp/nvhe/cpufeature.c          |   8 +
 arch/arm64/kvm/hyp/nvhe/early_alloc.c         |  60 +++++
 arch/arm64/kvm/hyp/nvhe/hyp-init.S            |  41 ++++
 arch/arm64/kvm/hyp/nvhe/hyp-main.c            |  48 ++++
 arch/arm64/kvm/hyp/nvhe/hyp.lds.S             |   1 +
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 191 ++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/mm.c                  | 174 ++++++++++++++
 arch/arm64/kvm/hyp/nvhe/page_alloc.c          | 185 +++++++++++++++
 arch/arm64/kvm/hyp/nvhe/psci-relay.c          |   4 +-
 arch/arm64/kvm/hyp/nvhe/setup.c               | 214 ++++++++++++++++++
 arch/arm64/kvm/hyp/nvhe/stub.c                |  22 ++
 arch/arm64/kvm/hyp/nvhe/switch.c              |  12 +-
 arch/arm64/kvm/hyp/nvhe/tlb.c                 |   4 +-
 arch/arm64/kvm/hyp/pgtable.c                  |  98 ++++----
 arch/arm64/kvm/hyp/reserved_mem.c             | 104 +++++++++
 arch/arm64/kvm/mmu.c                          | 114 +++++++++-
 arch/arm64/kvm/reset.c                        |  42 +---
 arch/arm64/lib/clear_page.S                   |   4 +-
 arch/arm64/lib/copy_page.S                    |   4 +-
 arch/arm64/mm/init.c                          |   3 +
 drivers/of/fdt.c                              |   5 +
 45 files changed, 1954 insertions(+), 145 deletions(-)
 create mode 100644 arch/arm64/include/asm/kvm_cpufeature.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/early_alloc.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/memory.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mm.h
 create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h
 create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
 create mode 100644 arch/arm64/kvm/hyp/nvhe/cpufeature.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/early_alloc.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/mm.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/setup.c
 create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
 create mode 100644 arch/arm64/kvm/hyp/reserved_mem.c

Marc Zyngier Jan. 13, 2021, 11:33 a.m. UTC | #1

Hi Quentin,

On 2021-01-08 12:15, Quentin Perret wrote:
> Introduce the infrastructure in KVM enabling to copy CPU feature
> registers into EL2-owned data-structures, to allow reading sanitised
> values directly at EL2 in nVHE.
> 
> Given that only a subset of these features are being read by the
> hypervisor, the ones that need to be copied are to be listed under
> <asm/kvm_cpufeature.h> together with the name of the nVHE variable that
> will hold the copy.
> 
> While at it, introduce the first user of this infrastructure by
> implementing __flush_dcache_area at EL2, which needs
> arm64_ftr_reg_ctrel0.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/cpufeature.h     |  1 +
>  arch/arm64/include/asm/kvm_cpufeature.h | 17 ++++++++++++++
>  arch/arm64/kernel/cpufeature.c          | 12 ++++++++++
>  arch/arm64/kvm/arm.c                    | 31 +++++++++++++++++++++++++
>  arch/arm64/kvm/hyp/nvhe/Makefile        |  3 ++-
>  arch/arm64/kvm/hyp/nvhe/cache.S         | 13 +++++++++++
>  arch/arm64/kvm/hyp/nvhe/cpufeature.c    |  8 +++++++
>  7 files changed, 84 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/include/asm/kvm_cpufeature.h
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/cache.S
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/cpufeature.c
> 
> diff --git a/arch/arm64/include/asm/cpufeature.h
> b/arch/arm64/include/asm/cpufeature.h
> index 16063c813dcd..742e9bcc051b 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -600,6 +600,7 @@ void __init setup_cpu_features(void);
>  void check_local_cpu_capabilities(void);
> 
>  u64 read_sanitised_ftr_reg(u32 id);
> +int copy_ftr_reg(u32 id, struct arm64_ftr_reg *dst);
> 
>  static inline bool cpu_supports_mixed_endian_el0(void)
>  {
> diff --git a/arch/arm64/include/asm/kvm_cpufeature.h
> b/arch/arm64/include/asm/kvm_cpufeature.h
> new file mode 100644
> index 000000000000..d34f85cba358
> --- /dev/null
> +++ b/arch/arm64/include/asm/kvm_cpufeature.h
> @@ -0,0 +1,17 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2020 - Google LLC
> + * Author: Quentin Perret <qperret@google.com>
> + */
> +
> +#include <asm/cpufeature.h>
> +
> +#ifndef KVM_HYP_CPU_FTR_REG
> +#if defined(__KVM_NVHE_HYPERVISOR__)
> +#define KVM_HYP_CPU_FTR_REG(id, name) extern struct arm64_ftr_reg 
> name;
> +#else
> +#define KVM_HYP_CPU_FTR_REG(id, name) DECLARE_KVM_NVHE_SYM(name);
> +#endif
> +#endif
> +
> +KVM_HYP_CPU_FTR_REG(SYS_CTR_EL0, arm64_ftr_reg_ctrel0)
> diff --git a/arch/arm64/kernel/cpufeature.c 
> b/arch/arm64/kernel/cpufeature.c
> index bc3549663957..c2019aaaadc3 100644
> --- a/arch/arm64/kernel/cpufeature.c
> +++ b/arch/arm64/kernel/cpufeature.c
> @@ -1113,6 +1113,18 @@ u64 read_sanitised_ftr_reg(u32 id)
>  }
>  EXPORT_SYMBOL_GPL(read_sanitised_ftr_reg);
> 
> +int copy_ftr_reg(u32 id, struct arm64_ftr_reg *dst)
> +{
> +	struct arm64_ftr_reg *regp = get_arm64_ftr_reg(id);
> +
> +	if (!regp)
> +		return -EINVAL;
> +
> +	memcpy(dst, regp, sizeof(*regp));
> +
> +	return 0;
> +}
> +
>  #define read_sysreg_case(r)	\
>  	case r:		return read_sysreg_s(r)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 51b53ca36dc5..9fd769349e9e 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -34,6 +34,7 @@
>  #include <asm/virt.h>
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_asm.h>
> +#include <asm/kvm_cpufeature.h>
>  #include <asm/kvm_mmu.h>
>  #include <asm/kvm_emulate.h>
>  #include <asm/sections.h>
> @@ -1697,6 +1698,29 @@ static void teardown_hyp_mode(void)
>  	}
>  }
> 
> +#undef KVM_HYP_CPU_FTR_REG
> +#define KVM_HYP_CPU_FTR_REG(id, name) \
> +	{ .sys_id = id, .dst = (struct arm64_ftr_reg *)&kvm_nvhe_sym(name) },
> +static const struct __ftr_reg_copy_entry {
> +	u32			sys_id;
> +	struct arm64_ftr_reg	*dst;

Why do we need the whole data structure? Can't we just live with 
sys_val?

> +} hyp_ftr_regs[] = {
> +	#include <asm/kvm_cpufeature.h>
> +};

Can't this be made __initdata?

> +
> +static int copy_cpu_ftr_regs(void)
> +{
> +	int i, ret;
> +
> +	for (i = 0; i < ARRAY_SIZE(hyp_ftr_regs); i++) {
> +		ret = copy_ftr_reg(hyp_ftr_regs[i].sys_id, hyp_ftr_regs[i].dst);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
>  /**
>   * Inits Hyp-mode on all online CPUs
>   */
> @@ -1705,6 +1729,13 @@ static int init_hyp_mode(void)
>  	int cpu;
>  	int err = 0;
> 
> +	/*
> +	 * Copy the required CPU feature register in their EL2 counterpart
> +	 */
> +	err = copy_cpu_ftr_regs();
> +	if (err)
> +		return err;
> +

Just to keep things together, please move any sysreg manipulation into
sys_regs.c, most probably into kvm_sys_reg_table_init().

Thanks,

         M.

Quentin Perret Jan. 13, 2021, 2:23 p.m. UTC | #2

Hey Marc,

On Wednesday 13 Jan 2021 at 11:33:13 (+0000), Marc Zyngier wrote:
> > +#undef KVM_HYP_CPU_FTR_REG
> > +#define KVM_HYP_CPU_FTR_REG(id, name) \
> > +	{ .sys_id = id, .dst = (struct arm64_ftr_reg *)&kvm_nvhe_sym(name) },
> > +static const struct __ftr_reg_copy_entry {
> > +	u32			sys_id;
> > +	struct arm64_ftr_reg	*dst;
> 
> Why do we need the whole data structure? Can't we just live with sys_val?

I don't have a use-case for anything else than sys_val, so yes I think I
should be able to simplify. I'll try that for v3.

> 
> > +} hyp_ftr_regs[] = {
> > +	#include <asm/kvm_cpufeature.h>
> > +};
> 
> Can't this be made __initdata?

Good point, that would be nice indeed. Can I use that from outside an
__init function? If not, I'll need to rework the code a bit more, but
that should be simple enough either way.

> > +
> > +static int copy_cpu_ftr_regs(void)
> > +{
> > +	int i, ret;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(hyp_ftr_regs); i++) {
> > +		ret = copy_ftr_reg(hyp_ftr_regs[i].sys_id, hyp_ftr_regs[i].dst);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> >  /**
> >   * Inits Hyp-mode on all online CPUs
> >   */
> > @@ -1705,6 +1729,13 @@ static int init_hyp_mode(void)
> >  	int cpu;
> >  	int err = 0;
> > 
> > +	/*
> > +	 * Copy the required CPU feature register in their EL2 counterpart
> > +	 */
> > +	err = copy_cpu_ftr_regs();
> > +	if (err)
> > +		return err;
> > +
> 
> Just to keep things together, please move any sysreg manipulation into
> sys_regs.c, most probably into kvm_sys_reg_table_init().

Will do.

Thanks,
Quentin

Quentin Perret Jan. 13, 2021, 2:35 p.m. UTC | #3

On Wednesday 13 Jan 2021 at 14:23:03 (+0000), Quentin Perret wrote:
> Good point, that would be nice indeed. Can I use that from outside an
> __init function?

Just gave it a go, and the answer to this appears to be yes,
surprisingly -- I was expecting a compile-time warning similar to what
we get when non-__init code calls into __init, but that doesn't seem to
trigger here. Anyways, I'll add the annotation in v3.

Thanks,
Quentin

Marc Zyngier Jan. 13, 2021, 5:27 p.m. UTC | #4

On 2021-01-13 14:35, Quentin Perret wrote:
> On Wednesday 13 Jan 2021 at 14:23:03 (+0000), Quentin Perret wrote:
>> Good point, that would be nice indeed. Can I use that from outside an
>> __init function?
> 
> Just gave it a go, and the answer to this appears to be yes,
> surprisingly -- I was expecting a compile-time warning similar to what
> we get when non-__init code calls into __init, but that doesn't seem to
> trigger here. Anyways, I'll add the annotation in v3.

That's surprising. I'd definitely expect something to explode...
Do you have CONFIG_DEBUG_SECTION_MISMATCH=y?

         M.

Quentin Perret Jan. 13, 2021, 6:28 p.m. UTC | #5

On Wednesday 13 Jan 2021 at 17:27:49 (+0000), Marc Zyngier wrote:
> On 2021-01-13 14:35, Quentin Perret wrote:
> > On Wednesday 13 Jan 2021 at 14:23:03 (+0000), Quentin Perret wrote:
> > > Good point, that would be nice indeed. Can I use that from outside an
> > > __init function?
> > 
> > Just gave it a go, and the answer to this appears to be yes,
> > surprisingly -- I was expecting a compile-time warning similar to what
> > we get when non-__init code calls into __init, but that doesn't seem to
> > trigger here. Anyways, I'll add the annotation in v3.
> 
> That's surprising. I'd definitely expect something to explode...
> Do you have CONFIG_DEBUG_SECTION_MISMATCH=y?

Yes I do, so, that doesn't seem to be it. Now, the plot thickens: I
_do_ get a warning if I remove the 'const' qualifier. But interestingly,
in both cases hyp_ftr_regs is placed in .init.data:

  $ objdump -t vmlinux | grep hyp_ftr_regs
  ffff8000116c17b0 g     O .init.data     0000000000000030 hyp_ftr_regs

The warning is silenced only if I mark hyp_ftr_regs as const. modpost
bug? I'll double check my findings and follow up in a separate series.

Thanks,
Quentin

Will Deacon Feb. 1, 2021, 5:28 p.m. UTC | #6

On Fri, Jan 08, 2021 at 12:15:01PM +0000, Quentin Perret wrote:
> From: Will Deacon <will@kernel.org>
> 
> We will soon need to synchronise multiple CPUs in the hyp text at EL2.
> The qspinlock-based locking used by the host is overkill for this purpose
> and relies on the kernel's "percpu" implementation for the MCS nodes.
> 
> Implement a simple ticket locking scheme based heavily on the code removed
> by commit c11090474d70 ("arm64: locking: Replace ticket lock implementation
> with qspinlock").
> 
> Signed-off-by: Will Deacon <will@kernel.org>
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/kvm/hyp/include/nvhe/spinlock.h | 92 ++++++++++++++++++++++
>  1 file changed, 92 insertions(+)
>  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h
> 
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/spinlock.h b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
> new file mode 100644
> index 000000000000..7584c397bbac
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
> @@ -0,0 +1,92 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * A stand-alone ticket spinlock implementation for use by the non-VHE
> + * KVM hypervisor code running at EL2.
> + *
> + * Copyright (C) 2020 Google LLC
> + * Author: Will Deacon <will@kernel.org>
> + *
> + * Heavily based on the implementation removed by c11090474d70 which was:
> + * Copyright (C) 2012 ARM Ltd.
> + */
> +
> +#ifndef __ARM64_KVM_NVHE_SPINLOCK_H__
> +#define __ARM64_KVM_NVHE_SPINLOCK_H__
> +
> +#include <asm/alternative.h>
> +#include <asm/lse.h>
> +
> +typedef union hyp_spinlock {
> +	u32	__val;
> +	struct {
> +#ifdef __AARCH64EB__
> +		u16 next, owner;
> +#else
> +		u16 owner, next;
> +	};
> +#endif

Looks like I put this #endif in the wrong place; probably needs to be a line
higher.

Will

Quentin Perret Feb. 1, 2021, 5:40 p.m. UTC | #7

On Monday 01 Feb 2021 at 17:28:34 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:01PM +0000, Quentin Perret wrote:
> > From: Will Deacon <will@kernel.org>
> > 
> > We will soon need to synchronise multiple CPUs in the hyp text at EL2.
> > The qspinlock-based locking used by the host is overkill for this purpose
> > and relies on the kernel's "percpu" implementation for the MCS nodes.
> > 
> > Implement a simple ticket locking scheme based heavily on the code removed
> > by commit c11090474d70 ("arm64: locking: Replace ticket lock implementation
> > with qspinlock").
> > 
> > Signed-off-by: Will Deacon <will@kernel.org>
> > Signed-off-by: Quentin Perret <qperret@google.com>
> > ---
> >  arch/arm64/kvm/hyp/include/nvhe/spinlock.h | 92 ++++++++++++++++++++++
> >  1 file changed, 92 insertions(+)
> >  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/spinlock.h
> > 
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/spinlock.h b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
> > new file mode 100644
> > index 000000000000..7584c397bbac
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/spinlock.h
> > @@ -0,0 +1,92 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * A stand-alone ticket spinlock implementation for use by the non-VHE
> > + * KVM hypervisor code running at EL2.
> > + *
> > + * Copyright (C) 2020 Google LLC
> > + * Author: Will Deacon <will@kernel.org>
> > + *
> > + * Heavily based on the implementation removed by c11090474d70 which was:
> > + * Copyright (C) 2012 ARM Ltd.
> > + */
> > +
> > +#ifndef __ARM64_KVM_NVHE_SPINLOCK_H__
> > +#define __ARM64_KVM_NVHE_SPINLOCK_H__
> > +
> > +#include <asm/alternative.h>
> > +#include <asm/lse.h>
> > +
> > +typedef union hyp_spinlock {
> > +	u32	__val;
> > +	struct {
> > +#ifdef __AARCH64EB__
> > +		u16 next, owner;
> > +#else
> > +		u16 owner, next;
> > +	};
> > +#endif
> 
> Looks like I put this #endif in the wrong place; probably needs to be a line
> higher.

Uh oh, missed that too. Fix now merged locally, thanks.

Quentin

Will Deacon Feb. 1, 2021, 5:41 p.m. UTC | #8

On Fri, Jan 08, 2021 at 12:15:02PM +0000, Quentin Perret wrote:
> Move the initialization of kvm_nvhe_init_params in a dedicated function
> that is run early, and only once during KVM init, rather than every time
> the KVM vectors are set and reset.
> 
> This also opens the opportunity for the hypervisor to change the init
> structs during boot, hence simplifying the replacement of host-provided
> page-tables and stacks by the ones the hypervisor will create for
> itself.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/kvm/arm.c | 28 ++++++++++++++++++++--------
>  1 file changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 04c44853b103..3ac0f3425833 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c

[...]

> @@ -1807,6 +1813,12 @@ static int init_hyp_mode(void)
>  			goto out_err;
>  	}
>  
> +	/*
> +	 * Prepare the CPU initialization parameters
> +	 */
> +	for_each_possible_cpu(cpu)
> +		cpu_prepare_hyp_mode(cpu);
> +

This is the fifth for_each_possible_cpu() loop in this function; can any of
them be merged together?

Will

Will Deacon Feb. 1, 2021, 5:46 p.m. UTC | #9

On Fri, Jan 08, 2021 at 12:15:03PM +0000, Quentin Perret wrote:
> Currently, the KVM page-table allocator uses a mix of put_page() and
> free_page() calls depending on the context even though page-allocation
> is always achieved using variants of __get_free_page().
> 
> Make the code consitent by using put_page() throughout, and reduce the

typo: consistent

> memory management API surface used by the page-table code. This will
> ease factoring out page-alloction from pgtable.c, which is a
> pre-requisite to creating page-tables at EL2.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/kvm/hyp/pgtable.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

Acked-by: Will Deacon <will@kernel.org>

Will

Will Deacon Feb. 1, 2021, 6:16 p.m. UTC | #10

On Fri, Jan 08, 2021 at 12:15:04PM +0000, Quentin Perret wrote:
> In preparation for enabling the creation of page-tables at EL2, factor
> all memory allocation out of the page-table code, hence making it
> re-usable with any compatible memory allocator.
> 
> No functional changes intended.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h | 32 +++++++++-
>  arch/arm64/kvm/hyp/pgtable.c         | 90 +++++++++++++++++-----------
>  arch/arm64/kvm/mmu.c                 | 70 +++++++++++++++++++++-
>  3 files changed, 154 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 52ab38db04c7..45acc9dc6c45 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -13,17 +13,41 @@
>  
>  typedef u64 kvm_pte_t;
>  
> +/**
> + * struct kvm_pgtable_mm_ops - Memory management callbacks.
> + * @zalloc_page:	Allocate a zeroed memory page.

Please describe the 'arg' parameter.

> + * @zalloc_pages_exact:	Allocate an exact number of zeroed memory pages.

I think this comment coulld be expanded somewhat to make it clear that (a)
the 'size' parameter is in bytes rather than pages (b) the rounding
behaviour applied if 'size' is not page-aligned and (c) that the resulting
allocation is physically contiguous.

> + * @free_pages_exact:	Free an exact number of memory pages.
> + * @get_page:		Increment the refcount on a page.
> + * @put_page:		Decrement the refcount on a page.
> + * @page_count:		Returns the refcount of a page.
> + * @phys_to_virt:	Convert a physical address into a virtual address.
> + * @virt_to_phys:	Convert a virtual address into a physical address.

I think it would be good to be explicit about the nature of the virtual
address here. We've dealing with virtual addresses that are mapped in the
current context rather than e.g. guest virtual addresses.

> + */
> +struct kvm_pgtable_mm_ops {
> +	void*		(*zalloc_page)(void *arg);
> +	void*		(*zalloc_pages_exact)(size_t size);
> +	void		(*free_pages_exact)(void *addr, size_t size);
> +	void		(*get_page)(void *addr);
> +	void		(*put_page)(void *addr);
> +	int		(*page_count)(void *addr);
> +	void*		(*phys_to_virt)(phys_addr_t phys);
> +	phys_addr_t	(*virt_to_phys)(void *addr);
> +};

[...]

> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 1f41173e6149..278e163beda4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -88,6 +88,48 @@ static bool kvm_is_device_pfn(unsigned long pfn)
>  	return !pfn_valid(pfn);
>  }
>  
> +static void *stage2_memcache_alloc_page(void *arg)
> +{
> +	struct kvm_mmu_memory_cache *mc = arg;
> +	kvm_pte_t *ptep = NULL;
> +
> +	/* Allocated with GFP_KERNEL_ACCOUNT, so no need to zero */

I couldn't spot where GFP_KERNEL_ACCOUNT implies __GFP_ZERO. Please can you
elaborate?

> +	if (mc && mc->nobjs)
> +		ptep = mc->objects[--mc->nobjs];
> +
> +	return ptep;
> +}

Why can't we use kvm_mmu_memory_cache_alloc() directly instead of opening up
the memory_cache?

> +static void *kvm_host_zalloc_pages_exact(size_t size)
> +{
> +	return alloc_pages_exact(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);

Hmm, so now we're passing __GFP_ZERO? ;)

> +static void kvm_host_get_page(void *addr)
> +{
> +	get_page(virt_to_page(addr));
> +}
> +
> +static void kvm_host_put_page(void *addr)
> +{
> +	put_page(virt_to_page(addr));
> +}
> +
> +static int kvm_host_page_count(void *addr)
> +{
> +	return page_count(virt_to_page(addr));
> +}
> +
> +static phys_addr_t kvm_host_pa(void *addr)
> +{
> +	return __pa(addr);
> +}
> +
> +static void *kvm_host_va(phys_addr_t phys)
> +{
> +	return __va(phys);
> +}
> +
>  /*
>   * Unmapping vs dcache management:
>   *
> @@ -351,6 +393,17 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
>  	return 0;
>  }
>  
> +static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> +	.zalloc_page		= stage2_memcache_alloc_page,
> +	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
> +	.free_pages_exact	= free_pages_exact,
> +	.get_page		= kvm_host_get_page,
> +	.put_page		= kvm_host_put_page,
> +	.page_count		= kvm_host_page_count,
> +	.phys_to_virt		= kvm_host_va,
> +	.virt_to_phys		= kvm_host_pa,
> +};

Idle thought, but I wonder whether it would be better to have these
implementations as the default and make the mm_ops structure parameter
to kvm_pgtable_stage2_init() optional? I guess you don't gain an awful
lot though, so feel free to ignore me.

Will

Will Deacon Feb. 1, 2021, 6:32 p.m. UTC | #11

On Fri, Jan 08, 2021 at 12:15:05PM +0000, Quentin Perret wrote:
> Currently, the hyp code cannot make full use of a bss, as the kernel
> section is mapped read-only.
> 
> While this mapping could simply be changed to read-write, it would
> intermingle even more the hyp and kernel state than they currently are.
> Instead, introduce a __hyp_bss section, that uses reserved pages, and
> create the appropriate RW hyp mappings during KVM init.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/sections.h |  1 +
>  arch/arm64/kernel/vmlinux.lds.S   |  7 +++++++
>  arch/arm64/kvm/arm.c              | 11 +++++++++++
>  arch/arm64/kvm/hyp/nvhe/hyp.lds.S |  1 +
>  4 files changed, 20 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/sections.h b/arch/arm64/include/asm/sections.h
> index 8ff579361731..f58cf493de16 100644
> --- a/arch/arm64/include/asm/sections.h
> +++ b/arch/arm64/include/asm/sections.h
> @@ -12,6 +12,7 @@ extern char __hibernate_exit_text_start[], __hibernate_exit_text_end[];
>  extern char __hyp_idmap_text_start[], __hyp_idmap_text_end[];
>  extern char __hyp_text_start[], __hyp_text_end[];
>  extern char __hyp_data_ro_after_init_start[], __hyp_data_ro_after_init_end[];
> +extern char __hyp_bss_start[], __hyp_bss_end[];
>  extern char __idmap_text_start[], __idmap_text_end[];
>  extern char __initdata_begin[], __initdata_end[];
>  extern char __inittext_begin[], __inittext_end[];
> diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
> index 43af13968dfd..3eca35d5a7cf 100644
> --- a/arch/arm64/kernel/vmlinux.lds.S
> +++ b/arch/arm64/kernel/vmlinux.lds.S
> @@ -8,6 +8,13 @@
>  #define RO_EXCEPTION_TABLE_ALIGN	8
>  #define RUNTIME_DISCARD_EXIT
>  
> +#define BSS_FIRST_SECTIONS				\
> +	. = ALIGN(PAGE_SIZE);				\
> +	__hyp_bss_start = .;				\
> +	*(.hyp.bss)					\

Use HYP_SECTION_NAME() here?

> +	. = ALIGN(PAGE_SIZE);				\
> +	__hyp_bss_end = .;

Should this be gated on CONFIG_KVM like the other hyp sections are? In fact,
it might be nice to define all of those together. Yeah, it means moving
things higher up in the file, but I think it will be easier to read.

>  #include <asm-generic/vmlinux.lds.h>
>  #include <asm/cache.h>
>  #include <asm/hyp_image.h>
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 3ac0f3425833..51b53ca36dc5 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1770,7 +1770,18 @@ static int init_hyp_mode(void)
>  		goto out_err;
>  	}
>  
> +	/*
> +	 * .hyp.bss is placed at the beginning of the .bss section, so map that
> +	 * part RW, and the rest RO as the hyp shouldn't be touching it.
> +	 */
>  	err = create_hyp_mappings(kvm_ksym_ref(__bss_start),

I think it would be clearer to refer to __hyp_bss_start here ^^.
You could always add an ASSERT in the linker script if you want to catch
anybody adding something before the hyp bss in future.

Will

Quentin Perret Feb. 1, 2021, 6:32 p.m. UTC | #12

On Monday 01 Feb 2021 at 18:16:08 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:04PM +0000, Quentin Perret wrote:
> > In preparation for enabling the creation of page-tables at EL2, factor
> > all memory allocation out of the page-table code, hence making it
> > re-usable with any compatible memory allocator.
> > 
> > No functional changes intended.
> > 
> > Signed-off-by: Quentin Perret <qperret@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h | 32 +++++++++-
> >  arch/arm64/kvm/hyp/pgtable.c         | 90 +++++++++++++++++-----------
> >  arch/arm64/kvm/mmu.c                 | 70 +++++++++++++++++++++-
> >  3 files changed, 154 insertions(+), 38 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 52ab38db04c7..45acc9dc6c45 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -13,17 +13,41 @@
> >  
> >  typedef u64 kvm_pte_t;
> >  
> > +/**
> > + * struct kvm_pgtable_mm_ops - Memory management callbacks.
> > + * @zalloc_page:	Allocate a zeroed memory page.
> 
> Please describe the 'arg' parameter.
> 
> > + * @zalloc_pages_exact:	Allocate an exact number of zeroed memory pages.
> 
> I think this comment coulld be expanded somewhat to make it clear that (a)
> the 'size' parameter is in bytes rather than pages (b) the rounding
> behaviour applied if 'size' is not page-aligned and (c) that the resulting
> allocation is physically contiguous.
> 
> > + * @free_pages_exact:	Free an exact number of memory pages.
> > + * @get_page:		Increment the refcount on a page.
> > + * @put_page:		Decrement the refcount on a page.
> > + * @page_count:		Returns the refcount of a page.
> > + * @phys_to_virt:	Convert a physical address into a virtual address.
> > + * @virt_to_phys:	Convert a virtual address into a physical address.
> 
> I think it would be good to be explicit about the nature of the virtual
> address here. We've dealing with virtual addresses that are mapped in the
> current context rather than e.g. guest virtual addresses.

Ack to all the above.

> > + */
> > +struct kvm_pgtable_mm_ops {
> > +	void*		(*zalloc_page)(void *arg);
> > +	void*		(*zalloc_pages_exact)(size_t size);
> > +	void		(*free_pages_exact)(void *addr, size_t size);
> > +	void		(*get_page)(void *addr);
> > +	void		(*put_page)(void *addr);
> > +	int		(*page_count)(void *addr);
> > +	void*		(*phys_to_virt)(phys_addr_t phys);
> > +	phys_addr_t	(*virt_to_phys)(void *addr);
> > +};
> 
> [...]
> 
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 1f41173e6149..278e163beda4 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -88,6 +88,48 @@ static bool kvm_is_device_pfn(unsigned long pfn)
> >  	return !pfn_valid(pfn);
> >  }
> >  
> > +static void *stage2_memcache_alloc_page(void *arg)
> > +{
> > +	struct kvm_mmu_memory_cache *mc = arg;
> > +	kvm_pte_t *ptep = NULL;
> > +
> > +	/* Allocated with GFP_KERNEL_ACCOUNT, so no need to zero */
> 
> I couldn't spot where GFP_KERNEL_ACCOUNT implies __GFP_ZERO.

I'm not suprised, it doesn't. Broken comment clearly, I'll fix with
s/GFP_KERNEL_ACCOUNT/__GFP_ZERO

> Please can you elaborate?
> 
> > +	if (mc && mc->nobjs)
> > +		ptep = mc->objects[--mc->nobjs];
> > +
> > +	return ptep;
> > +}
> 
> Why can't we use kvm_mmu_memory_cache_alloc() directly instead of opening up
> the memory_cache?

I think we can -- that function didn't exist when I first wrote this,
but no good reason not to use it now.

> > +static void *kvm_host_zalloc_pages_exact(size_t size)
> > +{
> > +	return alloc_pages_exact(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> 
> Hmm, so now we're passing __GFP_ZERO? ;)

:-)

> > +static void kvm_host_get_page(void *addr)
> > +{
> > +	get_page(virt_to_page(addr));
> > +}
> > +
> > +static void kvm_host_put_page(void *addr)
> > +{
> > +	put_page(virt_to_page(addr));
> > +}
> > +
> > +static int kvm_host_page_count(void *addr)
> > +{
> > +	return page_count(virt_to_page(addr));
> > +}
> > +
> > +static phys_addr_t kvm_host_pa(void *addr)
> > +{
> > +	return __pa(addr);
> > +}
> > +
> > +static void *kvm_host_va(phys_addr_t phys)
> > +{
> > +	return __va(phys);
> > +}
> > +
> >  /*
> >   * Unmapping vs dcache management:
> >   *
> > @@ -351,6 +393,17 @@ int create_hyp_exec_mappings(phys_addr_t phys_addr, size_t size,
> >  	return 0;
> >  }
> >  
> > +static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> > +	.zalloc_page		= stage2_memcache_alloc_page,
> > +	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
> > +	.free_pages_exact	= free_pages_exact,
> > +	.get_page		= kvm_host_get_page,
> > +	.put_page		= kvm_host_put_page,
> > +	.page_count		= kvm_host_page_count,
> > +	.phys_to_virt		= kvm_host_va,
> > +	.virt_to_phys		= kvm_host_pa,
> > +};
> 
> Idle thought, but I wonder whether it would be better to have these
> implementations as the default and make the mm_ops structure parameter
> to kvm_pgtable_stage2_init() optional? I guess you don't gain an awful
> lot though, so feel free to ignore me.

No strong opinion really, but I suppose I could do something as simple
as having static inline wrappers which provide kvm_s2_mm_ops to the
pgtable API for me. I'll probably want to make sure these are not
defined when compiling EL2 code, though, to avoid confusion.

Or maybe you had something else in mind?

Cheers,
Quentin

Will Deacon Feb. 1, 2021, 6:39 p.m. UTC | #13

On Mon, Feb 01, 2021 at 06:32:52PM +0000, Quentin Perret wrote:
> On Monday 01 Feb 2021 at 18:16:08 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:04PM +0000, Quentin Perret wrote:
> > > +static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
> > > +	.zalloc_page		= stage2_memcache_alloc_page,
> > > +	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
> > > +	.free_pages_exact	= free_pages_exact,
> > > +	.get_page		= kvm_host_get_page,
> > > +	.put_page		= kvm_host_put_page,
> > > +	.page_count		= kvm_host_page_count,
> > > +	.phys_to_virt		= kvm_host_va,
> > > +	.virt_to_phys		= kvm_host_pa,
> > > +};
> > 
> > Idle thought, but I wonder whether it would be better to have these
> > implementations as the default and make the mm_ops structure parameter
> > to kvm_pgtable_stage2_init() optional? I guess you don't gain an awful
> > lot though, so feel free to ignore me.
> 
> No strong opinion really, but I suppose I could do something as simple
> as having static inline wrappers which provide kvm_s2_mm_ops to the
> pgtable API for me. I'll probably want to make sure these are not
> defined when compiling EL2 code, though, to avoid confusion.
> 
> Or maybe you had something else in mind?

No, just food for thought. If we can reduce the changes for normal KVM then
it's probably worth considering if it doesn't add divergent code paths. But
I'm also fine with the proposal you have here, so if it doesn't work then
don't get hung up on it.

Will

Will Deacon Feb. 1, 2021, 6:41 p.m. UTC | #14

On Fri, Jan 08, 2021 at 12:15:06PM +0000, Quentin Perret wrote:
> kvm_call_hyp() has some logic to issue a function call or a hypercall
> depending the EL at which the kernel is running. However, all the code
> compiled under __KVM_NVHE_HYPERVISOR__ is guaranteed to run only at EL2,
> and in this case a simple function call is needed.
> 
> Add ifdefery to kvm_host.h to symplify kvm_call_hyp() in .hyp.text.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h | 6 ++++++
>  1 file changed, 6 insertions(+)

Acked-by: Will Deacon <will@kernel.org>

Will

Will Deacon Feb. 1, 2021, 6:43 p.m. UTC | #15

On Fri, Jan 08, 2021 at 12:15:07PM +0000, Quentin Perret wrote:
> In order to allow the usage of code shared by the host and the hyp in
> static inline library function, allow the usage of kvm_nvhe_sym() at el2

typo: functions

> by defaulting to the raw symbol name.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/hyp_image.h | 4 ++++
>  1 file changed, 4 insertions(+)

Acked-by: Will Deacon <will@kernel.org>

Will

Will Deacon Feb. 1, 2021, 7 p.m. UTC | #16

On Fri, Jan 08, 2021 at 12:15:08PM +0000, Quentin Perret wrote:
> diff --git a/arch/arm64/kvm/hyp/nvhe/early_alloc.c b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
> new file mode 100644
> index 000000000000..de4c45662970
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
> @@ -0,0 +1,60 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Google LLC
> + * Author: Quentin Perret <qperret@google.com>
> + */
> +
> +#include <asm/kvm_pgtable.h>
> +
> +#include <nvhe/memory.h>
> +
> +struct kvm_pgtable_mm_ops hyp_early_alloc_mm_ops;
> +s64 __ro_after_init hyp_physvirt_offset;
> +
> +static unsigned long base;
> +static unsigned long end;
> +static unsigned long cur;
> +
> +unsigned long hyp_early_alloc_nr_pages(void)
> +{
> +	return (cur - base) >> PAGE_SHIFT;
> +}

nit: but I find this function name confusing (it's returning the number of
_allocated_ pages, not the number of _free_ pages!). How about something
like hyp_early_alloc_size() to match hyp_s1_pgtable_size() which you add
later? [and move the shift out to the caller]?

> +
> +extern void clear_page(void *to);

Stick this in a header?

> +
> +void *hyp_early_alloc_contig(unsigned int nr_pages)

I think order might make more sense, or do you need to allocate
non-power-of-2 batches of pages?

> +{
> +	unsigned long ret = cur, i, p;
> +
> +	if (!nr_pages)
> +		return NULL;
> +
> +	cur += nr_pages << PAGE_SHIFT;
> +	if (cur > end) {

This would mean that concurrent hyp_early_alloc_nr_pages() would transiently
give the wrong answer. Might be worth sticking the locking expectations with
the function prototypes.

That said, maybe it would be better to write this check as:

	if (end - cur < (nr_pages << PAGE_SHIFT))

as that also removes the need to worry about overflow if nr_pages is huge
(which would be a bug in the hypervisor, which we would then catch here).

> +		cur = ret;
> +		return NULL;
> +	}
> +
> +	for (i = 0; i < nr_pages; i++) {
> +		p = ret + (i << PAGE_SHIFT);
> +		clear_page((void *)(p));
> +	}
> +
> +	return (void *)ret;
> +}
> +
> +void *hyp_early_alloc_page(void *arg)
> +{
> +	return hyp_early_alloc_contig(1);
> +}
> +
> +void hyp_early_alloc_init(unsigned long virt, unsigned long size)
> +{
> +	base = virt;
> +	end = virt + size;
> +	cur = virt;

nit: base = cur = virt;

Will

Will Deacon Feb. 1, 2021, 7:06 p.m. UTC | #17

On Fri, Jan 08, 2021 at 12:15:09PM +0000, Quentin Perret wrote:
> In order to use the kernel list library at EL2, introduce stubs for the
> CONFIG_DEBUG_LIST out-of-lines calls.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/kvm/hyp/nvhe/Makefile |  2 +-
>  arch/arm64/kvm/hyp/nvhe/stub.c   | 22 ++++++++++++++++++++++
>  2 files changed, 23 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
> 
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> index 1fc0684a7678..33bd381d8f73 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
>  lib-objs := $(addprefix ../../../lib/, $(lib-objs))
>  
>  obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> -	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
> +	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
>  obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
>  	 ../fpsimd.o ../hyp-entry.o ../exception.o
>  obj-y += $(lib-objs)
> diff --git a/arch/arm64/kvm/hyp/nvhe/stub.c b/arch/arm64/kvm/hyp/nvhe/stub.c
> new file mode 100644
> index 000000000000..c0aa6bbfd79d
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/stub.c
> @@ -0,0 +1,22 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Stubs for out-of-line function calls caused by re-using kernel
> + * infrastructure at EL2.
> + *
> + * Copyright (C) 2020 - Google LLC
> + */
> +
> +#include <linux/list.h>
> +
> +#ifdef CONFIG_DEBUG_LIST
> +bool __list_add_valid(struct list_head *new, struct list_head *prev,
> +		      struct list_head *next)
> +{
> +		return true;
> +}
> +
> +bool __list_del_entry_valid(struct list_head *entry)
> +{
> +		return true;
> +}
> +#endif

Can we get away with defining our own CHECK_DATA_CORRUPTION macro instead?

Will

Quentin Perret Feb. 2, 2021, 9:44 a.m. UTC | #18

On Monday 01 Feb 2021 at 19:00:08 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:08PM +0000, Quentin Perret wrote:
> > diff --git a/arch/arm64/kvm/hyp/nvhe/early_alloc.c b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
> > new file mode 100644
> > index 000000000000..de4c45662970
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/nvhe/early_alloc.c
> > @@ -0,0 +1,60 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2020 Google LLC
> > + * Author: Quentin Perret <qperret@google.com>
> > + */
> > +
> > +#include <asm/kvm_pgtable.h>
> > +
> > +#include <nvhe/memory.h>
> > +
> > +struct kvm_pgtable_mm_ops hyp_early_alloc_mm_ops;
> > +s64 __ro_after_init hyp_physvirt_offset;
> > +
> > +static unsigned long base;
> > +static unsigned long end;
> > +static unsigned long cur;
> > +
> > +unsigned long hyp_early_alloc_nr_pages(void)
> > +{
> > +	return (cur - base) >> PAGE_SHIFT;
> > +}
> 
> nit: but I find this function name confusing (it's returning the number of
> _allocated_ pages, not the number of _free_ pages!). How about something
> like hyp_early_alloc_size() to match hyp_s1_pgtable_size() which you add
> later? [and move the shift out to the caller]?

Works for me.

> > +extern void clear_page(void *to);
> 
> Stick this in a header?

Right, that, or perhaps just use asm/page.h directly -- I _think_ that
should work fine assuming with have the correct symbol aliasing in
place.

> > +
> > +void *hyp_early_alloc_contig(unsigned int nr_pages)
> 
> I think order might make more sense, or do you need to allocate
> non-power-of-2 batches of pages?

Indeed, I allocate page-aligned blobs of arbitrary size (e.g.
divide_memory_pool() in patch 16), so I prefer it that way.

> > +{
> > +	unsigned long ret = cur, i, p;
> > +
> > +	if (!nr_pages)
> > +		return NULL;
> > +
> > +	cur += nr_pages << PAGE_SHIFT;
> > +	if (cur > end) {
> 
> This would mean that concurrent hyp_early_alloc_nr_pages() would transiently
> give the wrong answer. Might be worth sticking the locking expectations with
> the function prototypes.

This is only called from a single CPU from a non-preemptible section, so
that is not a problem. But yes, I'll stick a comment.

> That said, maybe it would be better to write this check as:
> 
> 	if (end - cur < (nr_pages << PAGE_SHIFT))
> 
> as that also removes the need to worry about overflow if nr_pages is huge
> (which would be a bug in the hypervisor, which we would then catch here).

Sounds good.

> > +		cur = ret;
> > +		return NULL;
> > +	}
> > +
> > +	for (i = 0; i < nr_pages; i++) {
> > +		p = ret + (i << PAGE_SHIFT);
> > +		clear_page((void *)(p));
> > +	}
> > +
> > +	return (void *)ret;
> > +}
> > +
> > +void *hyp_early_alloc_page(void *arg)
> > +{
> > +	return hyp_early_alloc_contig(1);
> > +}
> > +
> > +void hyp_early_alloc_init(unsigned long virt, unsigned long size)
> > +{
> > +	base = virt;
> > +	end = virt + size;
> > +	cur = virt;
> 
> nit: base = cur = virt;

Ack.

Thanks for the review,
Quentin

Quentin Perret Feb. 2, 2021, 9:57 a.m. UTC | #19

On Monday 01 Feb 2021 at 19:06:20 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:09PM +0000, Quentin Perret wrote:
> > In order to use the kernel list library at EL2, introduce stubs for the
> > CONFIG_DEBUG_LIST out-of-lines calls.
> > 
> > Signed-off-by: Quentin Perret <qperret@google.com>
> > ---
> >  arch/arm64/kvm/hyp/nvhe/Makefile |  2 +-
> >  arch/arm64/kvm/hyp/nvhe/stub.c   | 22 ++++++++++++++++++++++
> >  2 files changed, 23 insertions(+), 1 deletion(-)
> >  create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
> > 
> > diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> > index 1fc0684a7678..33bd381d8f73 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> > +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> > @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
> >  lib-objs := $(addprefix ../../../lib/, $(lib-objs))
> >  
> >  obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> > -	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
> > +	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
> >  obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> >  	 ../fpsimd.o ../hyp-entry.o ../exception.o
> >  obj-y += $(lib-objs)
> > diff --git a/arch/arm64/kvm/hyp/nvhe/stub.c b/arch/arm64/kvm/hyp/nvhe/stub.c
> > new file mode 100644
> > index 000000000000..c0aa6bbfd79d
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/nvhe/stub.c
> > @@ -0,0 +1,22 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Stubs for out-of-line function calls caused by re-using kernel
> > + * infrastructure at EL2.
> > + *
> > + * Copyright (C) 2020 - Google LLC
> > + */
> > +
> > +#include <linux/list.h>
> > +
> > +#ifdef CONFIG_DEBUG_LIST
> > +bool __list_add_valid(struct list_head *new, struct list_head *prev,
> > +		      struct list_head *next)
> > +{
> > +		return true;
> > +}
> > +
> > +bool __list_del_entry_valid(struct list_head *entry)
> > +{
> > +		return true;
> > +}
> > +#endif
> 
> Can we get away with defining our own CHECK_DATA_CORRUPTION macro instead?

Yes I think eventually it'd be nice to get there, but that has other
implications (e.g. how do you report something in dmesg from EL2?) so
perhaps we can keep that a separate series?

Cheers,
Quentin

Will Deacon Feb. 2, 2021, 10 a.m. UTC | #20

On Tue, Feb 02, 2021 at 09:57:36AM +0000, Quentin Perret wrote:
> On Monday 01 Feb 2021 at 19:06:20 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:09PM +0000, Quentin Perret wrote:
> > > In order to use the kernel list library at EL2, introduce stubs for the
> > > CONFIG_DEBUG_LIST out-of-lines calls.
> > > 
> > > Signed-off-by: Quentin Perret <qperret@google.com>
> > > ---
> > >  arch/arm64/kvm/hyp/nvhe/Makefile |  2 +-
> > >  arch/arm64/kvm/hyp/nvhe/stub.c   | 22 ++++++++++++++++++++++
> > >  2 files changed, 23 insertions(+), 1 deletion(-)
> > >  create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
> > > 
> > > diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> > > index 1fc0684a7678..33bd381d8f73 100644
> > > --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> > > +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> > > @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
> > >  lib-objs := $(addprefix ../../../lib/, $(lib-objs))
> > >  
> > >  obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> > > -	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
> > > +	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
> > >  obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> > >  	 ../fpsimd.o ../hyp-entry.o ../exception.o
> > >  obj-y += $(lib-objs)
> > > diff --git a/arch/arm64/kvm/hyp/nvhe/stub.c b/arch/arm64/kvm/hyp/nvhe/stub.c
> > > new file mode 100644
> > > index 000000000000..c0aa6bbfd79d
> > > --- /dev/null
> > > +++ b/arch/arm64/kvm/hyp/nvhe/stub.c
> > > @@ -0,0 +1,22 @@
> > > +// SPDX-License-Identifier: GPL-2.0-only
> > > +/*
> > > + * Stubs for out-of-line function calls caused by re-using kernel
> > > + * infrastructure at EL2.
> > > + *
> > > + * Copyright (C) 2020 - Google LLC
> > > + */
> > > +
> > > +#include <linux/list.h>
> > > +
> > > +#ifdef CONFIG_DEBUG_LIST
> > > +bool __list_add_valid(struct list_head *new, struct list_head *prev,
> > > +		      struct list_head *next)
> > > +{
> > > +		return true;
> > > +}
> > > +
> > > +bool __list_del_entry_valid(struct list_head *entry)
> > > +{
> > > +		return true;
> > > +}
> > > +#endif
> > 
> > Can we get away with defining our own CHECK_DATA_CORRUPTION macro instead?
> 
> Yes I think eventually it'd be nice to get there, but that has other
> implications (e.g. how do you report something in dmesg from EL2?) so
> perhaps we can keep that a separate series?

We wouldn't necessarily have to report anything, but having the return value
of these functions be based off the generic checks would be great if we can
do it (i.e. we'd avoid corrupting the list).

Will

Quentin Perret Feb. 2, 2021, 10:14 a.m. UTC | #21

On Tuesday 02 Feb 2021 at 10:00:29 (+0000), Will Deacon wrote:
> On Tue, Feb 02, 2021 at 09:57:36AM +0000, Quentin Perret wrote:
> > On Monday 01 Feb 2021 at 19:06:20 (+0000), Will Deacon wrote:
> > > On Fri, Jan 08, 2021 at 12:15:09PM +0000, Quentin Perret wrote:
> > > > In order to use the kernel list library at EL2, introduce stubs for the
> > > > CONFIG_DEBUG_LIST out-of-lines calls.
> > > > 
> > > > Signed-off-by: Quentin Perret <qperret@google.com>
> > > > ---
> > > >  arch/arm64/kvm/hyp/nvhe/Makefile |  2 +-
> > > >  arch/arm64/kvm/hyp/nvhe/stub.c   | 22 ++++++++++++++++++++++
> > > >  2 files changed, 23 insertions(+), 1 deletion(-)
> > > >  create mode 100644 arch/arm64/kvm/hyp/nvhe/stub.c
> > > > 
> > > > diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> > > > index 1fc0684a7678..33bd381d8f73 100644
> > > > --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> > > > +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> > > > @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
> > > >  lib-objs := $(addprefix ../../../lib/, $(lib-objs))
> > > >  
> > > >  obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> > > > -	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o
> > > > +	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
> > > >  obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> > > >  	 ../fpsimd.o ../hyp-entry.o ../exception.o
> > > >  obj-y += $(lib-objs)
> > > > diff --git a/arch/arm64/kvm/hyp/nvhe/stub.c b/arch/arm64/kvm/hyp/nvhe/stub.c
> > > > new file mode 100644
> > > > index 000000000000..c0aa6bbfd79d
> > > > --- /dev/null
> > > > +++ b/arch/arm64/kvm/hyp/nvhe/stub.c
> > > > @@ -0,0 +1,22 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > +/*
> > > > + * Stubs for out-of-line function calls caused by re-using kernel
> > > > + * infrastructure at EL2.
> > > > + *
> > > > + * Copyright (C) 2020 - Google LLC
> > > > + */
> > > > +
> > > > +#include <linux/list.h>
> > > > +
> > > > +#ifdef CONFIG_DEBUG_LIST
> > > > +bool __list_add_valid(struct list_head *new, struct list_head *prev,
> > > > +		      struct list_head *next)
> > > > +{
> > > > +		return true;
> > > > +}
> > > > +
> > > > +bool __list_del_entry_valid(struct list_head *entry)
> > > > +{
> > > > +		return true;
> > > > +}
> > > > +#endif
> > > 
> > > Can we get away with defining our own CHECK_DATA_CORRUPTION macro instead?
> > 
> > Yes I think eventually it'd be nice to get there, but that has other
> > implications (e.g. how do you report something in dmesg from EL2?) so
> > perhaps we can keep that a separate series?
> 
> We wouldn't necessarily have to report anything, but having the return value
> of these functions be based off the generic checks would be great if we can
> do it (i.e. we'd avoid corrupting the list).

Ah, I see what you mean. Happy to have a go a it, there are a few other
small things that make that it a bit annoying e.g. CHECK_DATA_CORRUPTION
is unconditionally defined in bug.h, and I'll need to stub EXPORT_SYMBOL
as well, which may both require changing core files, but maybe that's
fine. And if that is too painful I think it would make sense to keep
this a separate and self-contained series which would be a nice
incremental improvement over the simple approach I have here :)

Cheers,
Quentin

Will Deacon Feb. 2, 2021, 6:13 p.m. UTC | #22

Hi Quentin,

Sorry for the delay, this one took me a while to grok.

On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> When memory protection is enabled, the hyp code will require a basic
> form of memory management in order to allocate and free memory pages at
> EL2. This is needed for various use-cases, including the creation of hyp
> mappings or the allocation of stage 2 page tables.
> 
> To address these use-case, introduce a simple memory allocator in the
> hyp code. The allocator is designed as a conventional 'buddy allocator',
> working with a page granularity. It allows to allocate and free
> physically contiguous pages from memory 'pools', with a guaranteed order
> alignment in the PA space. Each page in a memory pool is associated
> with a struct hyp_page which holds the page's metadata, including its
> refcount, as well as its current order, hence mimicking the kernel's
> buddy system in the GFP infrastructure. The hyp_page metadata are made
> accessible through a hyp_vmemmap, following the concept of
> SPARSE_VMEMMAP in the kernel.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/kvm/hyp/include/nvhe/gfp.h    |  32 ++++
>  arch/arm64/kvm/hyp/include/nvhe/memory.h |  25 +++
>  arch/arm64/kvm/hyp/nvhe/Makefile         |   2 +-
>  arch/arm64/kvm/hyp/nvhe/page_alloc.c     | 185 +++++++++++++++++++++++
>  4 files changed, 243 insertions(+), 1 deletion(-)
>  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c
> 
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> new file mode 100644
> index 000000000000..95587faee171
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __KVM_HYP_GFP_H
> +#define __KVM_HYP_GFP_H
> +
> +#include <linux/list.h>
> +
> +#include <nvhe/memory.h>
> +#include <nvhe/spinlock.h>
> +
> +#define HYP_MAX_ORDER	11U

Could we just use MAX_ORDER here?

> +#define HYP_NO_ORDER	UINT_MAX
> +
> +struct hyp_pool {
> +	hyp_spinlock_t lock;

A comment about what this lock protects would be handy, especially as the
'refcount' field of 'struct hyp_page' isn't updated atomically. I think it
also means that we don't have a safe way to move a page from one pool to
another; it's fixed forever once the page has been made available for
allocation.

> +	struct list_head free_area[HYP_MAX_ORDER + 1];
> +	phys_addr_t range_start;
> +	phys_addr_t range_end;
> +};
> +
> +/* GFP flags */
> +#define HYP_GFP_NONE	0
> +#define HYP_GFP_ZERO	1
> +
> +/* Allocation */
> +void *hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask, unsigned int order);
> +void hyp_get_page(void *addr);
> +void hyp_put_page(void *addr);
> +
> +/* Used pages cannot be freed */
> +int hyp_pool_init(struct hyp_pool *pool, phys_addr_t phys,
> +		  unsigned int nr_pages, unsigned int used_pages);

Maybe "reserved_pages" would be a better name than "used_pages"?

> +#endif /* __KVM_HYP_GFP_H */
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> index 64c44c142c95..ed47674bc988 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/memory.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> @@ -6,7 +6,17 @@
>  
>  #include <linux/types.h>
>  
> +struct hyp_pool;
> +struct hyp_page {
> +	unsigned int refcount;
> +	unsigned int order;
> +	struct hyp_pool *pool;
> +	struct list_head node;
> +};
> +
>  extern s64 hyp_physvirt_offset;
> +extern u64 __hyp_vmemmap;
> +#define hyp_vmemmap ((struct hyp_page *)__hyp_vmemmap)
>  
>  #define __hyp_pa(virt)	((phys_addr_t)(virt) + hyp_physvirt_offset)
>  #define __hyp_va(virt)	((void *)((phys_addr_t)(virt) - hyp_physvirt_offset))
> @@ -21,4 +31,19 @@ static inline phys_addr_t hyp_virt_to_phys(void *addr)
>  	return __hyp_pa(addr);
>  }
>  
> +#define hyp_phys_to_pfn(phys)	((phys) >> PAGE_SHIFT)
> +#define hyp_phys_to_page(phys)	(&hyp_vmemmap[hyp_phys_to_pfn(phys)])
> +#define hyp_virt_to_page(virt)	hyp_phys_to_page(__hyp_pa(virt))
> +
> +#define hyp_page_to_phys(page)  ((phys_addr_t)((page) - hyp_vmemmap) << PAGE_SHIFT)

Maybe implement this in terms of a new hyp_page_to_pfn() macro?

> +#define hyp_page_to_virt(page)	__hyp_va(hyp_page_to_phys(page))
> +#define hyp_page_to_pool(page)	(((struct hyp_page *)page)->pool)
> +
> +static inline int hyp_page_count(void *addr)
> +{
> +	struct hyp_page *p = hyp_virt_to_page(addr);
> +
> +	return p->refcount;
> +}
> +
>  #endif /* __KVM_HYP_MEMORY_H */
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> index 33bd381d8f73..9e5eacfec6ec 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
>  lib-objs := $(addprefix ../../../lib/, $(lib-objs))
>  
>  obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> -	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
> +	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o
>  obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
>  	 ../fpsimd.o ../hyp-entry.o ../exception.o
>  obj-y += $(lib-objs)
> diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> new file mode 100644
> index 000000000000..6de6515f0432
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> @@ -0,0 +1,185 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Google LLC
> + * Author: Quentin Perret <qperret@google.com>
> + */
> +
> +#include <asm/kvm_hyp.h>
> +#include <nvhe/gfp.h>
> +
> +u64 __hyp_vmemmap;
> +
> +/*
> + * Example buddy-tree for a 4-pages physically contiguous pool:
> + *
> + *                 o : Page 3
> + *                /
> + *               o-o : Page 2
> + *              /
> + *             /   o : Page 1
> + *            /   /
> + *           o---o-o : Page 0
> + *    Order  2   1 0
> + *
> + * Example of requests on this zon:

typo: zone

> + *   __find_buddy(pool, page 0, order 0) => page 1
> + *   __find_buddy(pool, page 0, order 1) => page 2
> + *   __find_buddy(pool, page 1, order 0) => page 0
> + *   __find_buddy(pool, page 2, order 0) => page 3
> + */
> +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> +				     unsigned int order)
> +{
> +	phys_addr_t addr = hyp_page_to_phys(p);
> +
> +	addr ^= (PAGE_SIZE << order);
> +	if (addr < pool->range_start || addr >= pool->range_end)
> +		return NULL;

Are these range checks only needed because the pool isn't required to be
an exact power-of-2 pages in size? If so, maybe it would be more
straightforward to limit the max order on a per-pool basis depending upon
its size?

> +
> +	return hyp_phys_to_page(addr);
> +}
> +
> +static void __hyp_attach_page(struct hyp_pool *pool,
> +			      struct hyp_page *p)
> +{
> +	unsigned int order = p->order;
> +	struct hyp_page *buddy;
> +
> +	p->order = HYP_NO_ORDER;

Why is this needed?

> +	for (; order < HYP_MAX_ORDER; order++) {
> +		/* Nothing to do if the buddy isn't in a free-list */
> +		buddy = __find_buddy(pool, p, order);
> +		if (!buddy || list_empty(&buddy->node) || buddy->order != order)

Could we move the "buddy->order" check into __find_buddy()?

> +			break;
> +
> +		/* Otherwise, coalesce the buddies and go one level up */
> +		list_del_init(&buddy->node);
> +		buddy->order = HYP_NO_ORDER;
> +		p = (p < buddy) ? p : buddy;
> +	}
> +
> +	p->order = order;
> +	list_add_tail(&p->node, &pool->free_area[order]);
> +}
> +
> +void hyp_put_page(void *addr)
> +{
> +	struct hyp_page *p = hyp_virt_to_page(addr);
> +	struct hyp_pool *pool = hyp_page_to_pool(p);
> +
> +	hyp_spin_lock(&pool->lock);
> +	if (!p->refcount)
> +		hyp_panic();
> +	p->refcount--;
> +	if (!p->refcount)
> +		__hyp_attach_page(pool, p);
> +	hyp_spin_unlock(&pool->lock);
> +}
> +
> +void hyp_get_page(void *addr)
> +{
> +	struct hyp_page *p = hyp_virt_to_page(addr);
> +	struct hyp_pool *pool = hyp_page_to_pool(p);
> +
> +	hyp_spin_lock(&pool->lock);
> +	p->refcount++;
> +	hyp_spin_unlock(&pool->lock);

We should probably have a proper atomic refcount type for this along the
lines of refcount_t. Even if initially that is implemented with a lock, it
would be good to hide that behind a refcount API.

> +}
> +
> +/* Extract a page from the buddy tree, at a specific order */
> +static struct hyp_page *__hyp_extract_page(struct hyp_pool *pool,
> +					   struct hyp_page *p,
> +					   unsigned int order)
> +{
> +	struct hyp_page *buddy;
> +
> +	if (p->order == HYP_NO_ORDER || p->order < order)
> +		return NULL;

Can you drop the explicit HYP_NO_ORDER check here?

> +
> +	list_del_init(&p->node);
> +
> +	/* Split the page in two until reaching the requested order */
> +	while (p->order > order) {
> +		p->order--;
> +		buddy = __find_buddy(pool, p, p->order);
> +		buddy->order = p->order;
> +		list_add_tail(&buddy->node, &pool->free_area[buddy->order]);
> +	}
> +
> +	p->refcount = 1;
> +
> +	return p;
> +}
> +
> +static void clear_hyp_page(struct hyp_page *p)
> +{
> +	unsigned long i;
> +
> +	for (i = 0; i < (1 << p->order); i++)
> +		clear_page(hyp_page_to_virt(p) + (i << PAGE_SHIFT));

I wonder if this is actually any better than a memset(0)? That should use
DC ZCA as appropriate afaict.

> +static void *__hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask,
> +			       unsigned int order)
> +{
> +	unsigned int i = order;
> +	struct hyp_page *p;
> +
> +	/* Look for a high-enough-order page */
> +	while (i <= HYP_MAX_ORDER && list_empty(&pool->free_area[i]))
> +		i++;
> +	if (i > HYP_MAX_ORDER)
> +		return NULL;
> +
> +	/* Extract it from the tree at the right order */
> +	p = list_first_entry(&pool->free_area[i], struct hyp_page, node);
> +	p = __hyp_extract_page(pool, p, order);
> +
> +	if (mask & HYP_GFP_ZERO)
> +		clear_hyp_page(p);

Do we have a use-case where skipping the zeroing is worthwhile? If not,
it might make some sense to zero on the freeing path instead.

> +
> +	return p;
> +}
> +
> +void *hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask, unsigned int order)
> +{
> +	struct hyp_page *p;
> +
> +	hyp_spin_lock(&pool->lock);
> +	p = __hyp_alloc_pages(pool, mask, order);
> +	hyp_spin_unlock(&pool->lock);
> +
> +	return p ? hyp_page_to_virt(p) : NULL;

It looks weird not having __hyp_alloc_pages return the VA, but I guess later
patches will use __hyp_alloc_pages() for something else.

> +}
> +
> +/* hyp_vmemmap must be backed beforehand */
> +int hyp_pool_init(struct hyp_pool *pool, phys_addr_t phys,
> +		  unsigned int nr_pages, unsigned int used_pages)
> +{
> +	struct hyp_page *p;
> +	int i;
> +
> +	if (phys % PAGE_SIZE)
> +		return -EINVAL;

Maybe just take a pfn instead?

> +	hyp_spin_lock_init(&pool->lock);
> +	for (i = 0; i <= HYP_MAX_ORDER; i++)
> +		INIT_LIST_HEAD(&pool->free_area[i]);
> +	pool->range_start = phys;
> +	pool->range_end = phys + (nr_pages << PAGE_SHIFT);
> +
> +	/* Init the vmemmap portion */
> +	p = hyp_phys_to_page(phys);
> +	memset(p, 0, sizeof(*p) * nr_pages);
> +	for (i = 0; i < nr_pages; i++, p++) {
> +		p->pool = pool;
> +		INIT_LIST_HEAD(&p->node);
> +	}

Maybe index p like an array (e.g. p[i]) instead of maintaining two loop
increments?

> +
> +	/* Attach the unused pages to the buddy tree */
> +	p = hyp_phys_to_page(phys + (used_pages << PAGE_SHIFT));
> +	for (i = used_pages; i < nr_pages; i++, p++)
> +		__hyp_attach_page(pool, p);

Likewise.

Will

Will Deacon Feb. 2, 2021, 6:24 p.m. UTC | #23

On Fri, Jan 08, 2021 at 12:15:12PM +0000, Quentin Perret wrote:
> In order to re-map the guest vectors at EL2 when pKVM is enabled,
> refactor __kvm_vector_slot2idx() and kvm_init_vector_slot() to move all
> the address calculation logic in a static inline function.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h | 8 ++++++++
>  arch/arm64/kvm/arm.c             | 9 +--------
>  2 files changed, 9 insertions(+), 8 deletions(-)

Acked-by: Will Deacon <will@kernel.org>

Will

Will Deacon Feb. 3, 2021, 2:37 p.m. UTC | #24

On Fri, Jan 08, 2021 at 12:15:14PM +0000, Quentin Perret wrote:
> When memory protection is enabled, the Hyp code needs the ability to
> create and manage its own page-table. To do so, introduce a new set of
> hypercalls to initialize Hyp memory protection.
> 
> During the init hcall, the hypervisor runs with the host-provided
> page-table and uses the trivial early page allocator to create its own
> set of page-tables, using a memory pool that was donated by the host.
> Specifically, the hypervisor creates its own mappings for __hyp_text,
> the Hyp memory pool, the __hyp_bss, the portion of hyp_vmemmap
> corresponding to the Hyp pool, among other things. It then jumps back in
> the idmap page, switches to use the newly-created pgd (instead of the
> temporary one provided by the host) and then installs the full-fledged
> buddy allocator which will then be the only one in used from then on.
> 
> Note that for the sake of symplifying the review, this only introduces
> the code doing this operation, without actually being called by anyhing
> yet. This will be done in a subsequent patch, which will introduce the
> necessary host kernel changes.
> 
> Credits to Will for __pkvm_init_switch_pgd.
> 
> Co-authored-by: Will Deacon <will@kernel.org>
> Signed-off-by: Will Deacon <will@kernel.org>
> Signed-off-by: Quentin Perret <qperret@google.com>

[...]

> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
> index c0450828378b..a0e113734b20 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -100,4 +100,12 @@ void __noreturn hyp_panic(void);
>  void __noreturn __hyp_do_panic(bool restore_host, u64 spsr, u64 elr, u64 par);
>  #endif
>  
> +#ifdef __KVM_NVHE_HYPERVISOR__
> +void __pkvm_init_switch_pgd(phys_addr_t phys, unsigned long size,
> +			    phys_addr_t pgd, void *sp, void *cont_fn);
> +int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus,
> +		unsigned long *per_cpu_base);
> +void __noreturn __host_enter(struct kvm_cpu_context *host_ctxt);
> +#endif
> +
>  #endif /* __ARM64_KVM_HYP_H__ */
> diff --git a/arch/arm64/kernel/image-vars.h b/arch/arm64/kernel/image-vars.h
> index 43f3a1d6e92d..366d837f0d39 100644
> --- a/arch/arm64/kernel/image-vars.h
> +++ b/arch/arm64/kernel/image-vars.h
> @@ -113,6 +113,25 @@ KVM_NVHE_ALIAS_HYP(__memcpy, __pi_memcpy);
>  KVM_NVHE_ALIAS_HYP(__memset, __pi_memset);
>  #endif
>  
> +/* Hypevisor VA size */

typo: Hypervisor

> +KVM_NVHE_ALIAS(hyp_va_bits);
> +
> +/* Kernel memory sections */
> +KVM_NVHE_ALIAS(__start_rodata);
> +KVM_NVHE_ALIAS(__end_rodata);
> +KVM_NVHE_ALIAS(__bss_start);
> +KVM_NVHE_ALIAS(__bss_stop);
> +
> +/* Hyp memory sections */
> +KVM_NVHE_ALIAS(__hyp_idmap_text_start);
> +KVM_NVHE_ALIAS(__hyp_idmap_text_end);
> +KVM_NVHE_ALIAS(__hyp_text_start);
> +KVM_NVHE_ALIAS(__hyp_text_end);
> +KVM_NVHE_ALIAS(__hyp_data_ro_after_init_start);
> +KVM_NVHE_ALIAS(__hyp_data_ro_after_init_end);
> +KVM_NVHE_ALIAS(__hyp_bss_start);
> +KVM_NVHE_ALIAS(__hyp_bss_end);
> +
>  #endif /* CONFIG_KVM */
>  
>  #endif /* __ARM64_KERNEL_IMAGE_VARS_H */
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index 687598e41b21..b726332eec49 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -10,4 +10,4 @@ subdir-ccflags-y := -I$(incdir)				\
>  		    -DDISABLE_BRANCH_PROFILING		\
>  		    $(DISABLE_STACKLEAK_PLUGIN)
>  
> -obj-$(CONFIG_KVM) += vhe/ nvhe/ pgtable.o
> +obj-$(CONFIG_KVM) += vhe/ nvhe/ pgtable.o reserved_mem.o
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> index ed47674bc988..c8af6fe87bfb 100644
> --- a/arch/arm64/kvm/hyp/include/nvhe/memory.h
> +++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> @@ -6,6 +6,12 @@
>  
>  #include <linux/types.h>
>  
> +#define HYP_MEMBLOCK_REGIONS 128
> +struct hyp_memblock_region {
> +	phys_addr_t start;
> +	phys_addr_t end;
> +};
> +
>  struct hyp_pool;
>  struct hyp_page {
>  	unsigned int refcount;
> diff --git a/arch/arm64/kvm/hyp/include/nvhe/mm.h b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> new file mode 100644
> index 000000000000..f0cc09b127a5
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/include/nvhe/mm.h
> @@ -0,0 +1,79 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __KVM_HYP_MM_H
> +#define __KVM_HYP_MM_H
> +
> +#include <asm/kvm_pgtable.h>
> +#include <asm/spectre.h>
> +#include <linux/types.h>
> +
> +#include <nvhe/memory.h>
> +#include <nvhe/spinlock.h>
> +
> +extern struct hyp_memblock_region kvm_nvhe_sym(hyp_memory)[];
> +extern int kvm_nvhe_sym(hyp_memblock_nr);
> +extern struct kvm_pgtable pkvm_pgtable;
> +extern hyp_spinlock_t pkvm_pgd_lock;
> +extern struct hyp_pool hpool;
> +extern u64 __io_map_base;
> +extern u32 hyp_va_bits;
> +
> +int hyp_create_idmap(void);
> +int hyp_map_vectors(void);
> +int hyp_back_vmemmap(phys_addr_t phys, unsigned long size, phys_addr_t back);
> +int pkvm_cpu_set_vector(enum arm64_hyp_spectre_vector slot);
> +int pkvm_create_mappings(void *from, void *to, enum kvm_pgtable_prot prot);
> +int __pkvm_create_mappings(unsigned long start, unsigned long size,
> +			   unsigned long phys, unsigned long prot);
> +unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
> +					    unsigned long prot);
> +
> +static inline void hyp_vmemmap_range(phys_addr_t phys, unsigned long size,
> +				     unsigned long *start, unsigned long *end)
> +{
> +	unsigned long nr_pages = size >> PAGE_SHIFT;
> +	struct hyp_page *p = hyp_phys_to_page(phys);
> +
> +	*start = (unsigned long)p;
> +	*end = *start + nr_pages * sizeof(struct hyp_page);
> +	*start = ALIGN_DOWN(*start, PAGE_SIZE);
> +	*end = ALIGN(*end, PAGE_SIZE);
> +}
> +
> +static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages)
> +{
> +	unsigned long total = 0, i;
> +
> +	/* Provision the worst case scenario with 4 levels of page-table */
> +	for (i = 0; i < 4; i++) {

Looks like you want KVM_PGTABLE_MAX_LEVELS, so maybe move that into a
header?

> +		nr_pages = DIV_ROUND_UP(nr_pages, PTRS_PER_PTE);
> +		total += nr_pages;
> +	}

... that said, I'm not sure this needs to iterate at all. What exactly are
you trying to compute?

> +
> +	return total;
> +}
> +
> +static inline unsigned long hyp_s1_pgtable_size(void)
> +{
> +	struct hyp_memblock_region *reg;
> +	unsigned long nr_pages, res = 0;
> +	int i;
> +
> +	if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
> +		return 0;

It's a bit grotty having this be signed. Why do we need to encode the error
case differently from the 0 case?

> +
> +	for (i = 0; i < kvm_nvhe_sym(hyp_memblock_nr); i++) {
> +		reg = &kvm_nvhe_sym(hyp_memory)[i];

You could declare reg in the loop body.

> +		nr_pages = (reg->end - reg->start) >> PAGE_SHIFT;
> +		nr_pages = __hyp_pgtable_max_pages(nr_pages);

Maybe it would make more sense for __hyp_pgtable_max_pages to take the
size in bytes rather than pages, since most callers seem to have to do the
conversion?

> +		res += nr_pages << PAGE_SHIFT;
> +	}
> +
> +	/* Allow 1 GiB for private mappings */
> +	nr_pages = (1 << 30) >> PAGE_SHIFT;

SZ_1G >> PAGE_SHIFT

> +	nr_pages = __hyp_pgtable_max_pages(nr_pages);
> +	res += nr_pages << PAGE_SHIFT;
> +
> +	return res;

Might make more sense to keep res in pages until here, then just shift when
returning.

> +}
> +
> +#endif /* __KVM_HYP_MM_H */
> diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> index 72cfe53f106f..d7381a503182 100644
> --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> @@ -11,9 +11,9 @@ lib-objs := $(addprefix ../../../lib/, $(lib-objs))
>  
>  obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
>  	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
> -	 cache.o cpufeature.o
> +	 cache.o cpufeature.o setup.o mm.o
>  obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> -	 ../fpsimd.o ../hyp-entry.o ../exception.o
> +	 ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
>  obj-y += $(lib-objs)
>  
>  ##
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> index 31b060a44045..ad943966c39f 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> @@ -251,4 +251,35 @@ alternative_else_nop_endif
>  
>  SYM_CODE_END(__kvm_handle_stub_hvc)
>  
> +SYM_FUNC_START(__pkvm_init_switch_pgd)
> +	/* Turn the MMU off */
> +	pre_disable_mmu_workaround
> +	mrs	x2, sctlr_el2
> +	bic	x3, x2, #SCTLR_ELx_M
> +	msr	sctlr_el2, x3
> +	isb
> +
> +	tlbi	alle2
> +
> +	/* Install the new pgtables */
> +	ldr	x3, [x0, #NVHE_INIT_PGD_PA]
> +	phys_to_ttbr x4, x3
> +alternative_if ARM64_HAS_CNP
> +	orr	x4, x4, #TTBR_CNP_BIT
> +alternative_else_nop_endif
> +	msr	ttbr0_el2, x4
> +
> +	/* Set the new stack pointer */
> +	ldr	x0, [x0, #NVHE_INIT_STACK_HYP_VA]
> +	mov	sp, x0
> +
> +	/* And turn the MMU back on! */
> +	dsb	nsh
> +	isb
> +	msr	sctlr_el2, x2
> +	ic	iallu
> +	isb
> +	ret	x1
> +SYM_FUNC_END(__pkvm_init_switch_pgd)
> +
>  	.popsection
> diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> index a906f9e2ff34..3075f117651c 100644
> --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> @@ -6,12 +6,14 @@
>  
>  #include <hyp/switch.h>
>  
> +#include <asm/pgtable-types.h>
>  #include <asm/kvm_asm.h>
>  #include <asm/kvm_emulate.h>
>  #include <asm/kvm_host.h>
>  #include <asm/kvm_hyp.h>
>  #include <asm/kvm_mmu.h>
>  
> +#include <nvhe/mm.h>
>  #include <nvhe/trap_handler.h>
>  
>  DEFINE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params);
> @@ -106,6 +108,42 @@ static void handle___vgic_v3_restore_aprs(struct kvm_cpu_context *host_ctxt)
>  	__vgic_v3_restore_aprs(kern_hyp_va(cpu_if));
>  }
>  
> +static void handle___pkvm_init(struct kvm_cpu_context *host_ctxt)
> +{
> +	DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> +	DECLARE_REG(unsigned long, size, host_ctxt, 2);
> +	DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3);
> +	DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4);
> +
> +	cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base);

__pkvm_init() doesn't return, so I think this assignment back into host_ctxt
is confusing.

Also, I wonder if these bare numbers would be better hidden behind, e.g.

 #define DECLARE_ARG0(...)	DECLARE_REG(__VA_ARGS__, 1)
 ...
 #define DECLARE_RET(...)	DECLARE_REG(__VA_ARGS__, 1)

but it's cosmetic, so no need to change your patch. Just worried about
off-by-1s causing interesting behaviour!

> +
> +static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt)
> +{
> +	DECLARE_REG(enum arm64_hyp_spectre_vector, slot, host_ctxt, 1);
> +
> +	cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot);
> +}
> +
> +static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt)
> +{
> +	DECLARE_REG(unsigned long, start, host_ctxt, 1);
> +	DECLARE_REG(unsigned long, size, host_ctxt, 2);
> +	DECLARE_REG(unsigned long, phys, host_ctxt, 3);
> +	DECLARE_REG(unsigned long, prot, host_ctxt, 4);
> +
> +	cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, prot);
> +}
> +
> +static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ctxt)
> +{
> +	DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> +	DECLARE_REG(size_t, size, host_ctxt, 2);

Why the size_t vs unsigned long discrepancy with pkvm_create_mappings?
Same with phys_addr_t, although that one probably doesn't matter.

Also, the pgtable API uses an enum type for the prot bits.

> +	DECLARE_REG(unsigned long, prot, host_ctxt, 3);
> +
> +	cpu_reg(host_ctxt, 1) = __pkvm_create_private_mapping(phys, size, prot);
> +}
> +
>  typedef void (*hcall_t)(struct kvm_cpu_context *);
>  
>  #define HANDLE_FUNC(x)	[__KVM_HOST_SMCCC_FUNC_##x] = kimg_fn_ptr(handle_##x)
> @@ -125,6 +163,10 @@ static const hcall_t *host_hcall[] = {
>  	HANDLE_FUNC(__kvm_get_mdcr_el2),
>  	HANDLE_FUNC(__vgic_v3_save_aprs),
>  	HANDLE_FUNC(__vgic_v3_restore_aprs),
> +	HANDLE_FUNC(__pkvm_init),
> +	HANDLE_FUNC(__pkvm_cpu_set_vector),
> +	HANDLE_FUNC(__pkvm_create_mappings),
> +	HANDLE_FUNC(__pkvm_create_private_mapping),
>  };
>  
>  static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
> diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
> new file mode 100644
> index 000000000000..f3481646a94e
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/nvhe/mm.c
> @@ -0,0 +1,174 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2020 Google LLC
> + * Author: Quentin Perret <qperret@google.com>
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <asm/kvm_hyp.h>
> +#include <asm/kvm_mmu.h>
> +#include <asm/kvm_pgtable.h>
> +#include <asm/spectre.h>
> +
> +#include <nvhe/early_alloc.h>
> +#include <nvhe/gfp.h>
> +#include <nvhe/memory.h>
> +#include <nvhe/mm.h>
> +#include <nvhe/spinlock.h>
> +
> +struct kvm_pgtable pkvm_pgtable;
> +hyp_spinlock_t pkvm_pgd_lock;
> +u64 __io_map_base;
> +
> +struct hyp_memblock_region hyp_memory[HYP_MEMBLOCK_REGIONS];
> +int hyp_memblock_nr;
> +
> +int __pkvm_create_mappings(unsigned long start, unsigned long size,
> +			  unsigned long phys, unsigned long prot)
> +{
> +	int err;
> +
> +	hyp_spin_lock(&pkvm_pgd_lock);
> +	err = kvm_pgtable_hyp_map(&pkvm_pgtable, start, size, phys, prot);
> +	hyp_spin_unlock(&pkvm_pgd_lock);
> +
> +	return err;
> +}
> +
> +unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
> +					   unsigned long prot)
> +{
> +	unsigned long addr;
> +	int ret;
> +
> +	hyp_spin_lock(&pkvm_pgd_lock);
> +
> +	size = PAGE_ALIGN(size + offset_in_page(phys));

It might just be simpler to require page-aligned size and phys in the
caller. At least, for the vectors that should be straightforward because
I think they're guaranteed not to span a page boundary.

> +	addr = __io_map_base;
> +	__io_map_base += size;
> +
> +	/* Are we overflowing on the vmemmap ? */
> +	if (__io_map_base > __hyp_vmemmap) {
> +		__io_map_base -= size;
> +		addr = 0;

Can we use ERR_PTR(), or does that fail miserably at EL2?

> +		goto out;
> +	}
> +
> +	ret = kvm_pgtable_hyp_map(&pkvm_pgtable, addr, size, phys, prot);
> +	if (ret) {
> +		addr = 0;
> +		goto out;
> +	}
> +
> +	addr = addr + offset_in_page(phys);
> +out:
> +	hyp_spin_unlock(&pkvm_pgd_lock);
> +
> +	return addr;
> +}

[...]

> +static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
> +				 unsigned long *per_cpu_base)
> +{
> +	void *start, *end, *virt = hyp_phys_to_virt(phys);
> +	int ret, i;
> +
> +	/* Recreate the hyp page-table using the early page allocator */
> +	hyp_early_alloc_init(hyp_pgt_base, hyp_s1_pgtable_size());
> +	ret = kvm_pgtable_hyp_init(&pkvm_pgtable, hyp_va_bits,
> +				   &hyp_early_alloc_mm_ops);
> +	if (ret)
> +		return ret;
> +
> +	ret = hyp_create_idmap();
> +	if (ret)
> +		return ret;
> +
> +	ret = hyp_map_vectors();
> +	if (ret)
> +		return ret;
> +
> +	ret = hyp_back_vmemmap(phys, size, hyp_virt_to_phys(vmemmap_base));
> +	if (ret)
> +		return ret;
> +
> +	ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_text_start),
> +				  hyp_symbol_addr(__hyp_text_end),
> +				  PAGE_HYP_EXEC);
> +	if (ret)
> +		return ret;
> +
> +	ret = pkvm_create_mappings(hyp_symbol_addr(__start_rodata),
> +				   hyp_symbol_addr(__end_rodata), PAGE_HYP_RO);
> +	if (ret)
> +		return ret;
> +
> +	ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_data_ro_after_init_start),
> +				   hyp_symbol_addr(__hyp_data_ro_after_init_end),
> +				   PAGE_HYP_RO);
> +	if (ret)
> +		return ret;
> +
> +	ret = pkvm_create_mappings(hyp_symbol_addr(__bss_start),

__hyp_bss_start

> +				   hyp_symbol_addr(__hyp_bss_end), PAGE_HYP);
> +	if (ret)
> +		return ret;
> +
> +	ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_bss_end),
> +				   hyp_symbol_addr(__bss_stop), PAGE_HYP_RO);
> +	if (ret)
> +		return ret;
> +
> +	ret = pkvm_create_mappings(virt, virt + size - 1, PAGE_HYP);

Why is the range inclusive here?

> +	if (ret)
> +		return ret;
> +
> +	for (i = 0; i < hyp_nr_cpus; i++) {
> +		start = (void *)kern_hyp_va(per_cpu_base[i]);
> +		end = start + PAGE_ALIGN(hyp_percpu_size);
> +		ret = pkvm_create_mappings(start, end, PAGE_HYP);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void update_nvhe_init_params(void)
> +{
> +	struct kvm_nvhe_init_params *params;
> +	unsigned long i, stack;
> +
> +	for (i = 0; i < hyp_nr_cpus; i++) {
> +		stack = (unsigned long)stacks_base + (i << PAGE_SHIFT);
> +		params = per_cpu_ptr(&kvm_init_params, i);
> +		params->stack_hyp_va = stack + PAGE_SIZE;
> +		params->pgd_pa = __hyp_pa(pkvm_pgtable.pgd);
> +		__flush_dcache_area(params, sizeof(*params));
> +	}
> +}
> +
> +static void *hyp_zalloc_hyp_page(void *arg)
> +{
> +	return hyp_alloc_pages(&hpool, HYP_GFP_ZERO, 0);
> +}
> +
> +void __noreturn __pkvm_init_finalise(void)
> +{
> +	struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
> +	struct kvm_cpu_context *host_ctxt = &host_data->host_ctxt;
> +	unsigned long nr_pages, used_pages;
> +	int ret;
> +
> +	/* Now that the vmemmap is backed, install the full-fledged allocator */
> +	nr_pages = hyp_s1_pgtable_size() >> PAGE_SHIFT;
> +	used_pages = hyp_early_alloc_nr_pages();
> +	ret = hyp_pool_init(&hpool, __hyp_pa(hyp_pgt_base), nr_pages, used_pages);
> +	if (ret)
> +		goto out;
> +
> +	pkvm_pgtable_mm_ops.zalloc_page = hyp_zalloc_hyp_page;
> +	pkvm_pgtable_mm_ops.phys_to_virt = hyp_phys_to_virt;
> +	pkvm_pgtable_mm_ops.virt_to_phys = hyp_virt_to_phys;
> +	pkvm_pgtable_mm_ops.get_page = hyp_get_page;
> +	pkvm_pgtable_mm_ops.put_page = hyp_put_page;
> +	pkvm_pgtable.mm_ops = &pkvm_pgtable_mm_ops;
> +
> +out:
> +	host_ctxt->regs.regs[0] = SMCCC_RET_SUCCESS;
> +	host_ctxt->regs.regs[1] = ret;

Use the cpu_reg() helper for these?

> +
> +	__host_enter(host_ctxt);
> +}
> +
> +int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus,
> +		unsigned long *per_cpu_base)
> +{
> +	struct kvm_nvhe_init_params *params;
> +	void *virt = hyp_phys_to_virt(phys);
> +	void (*fn)(phys_addr_t params_pa, void *finalize_fn_va);
> +	int ret;
> +
> +	if (phys % PAGE_SIZE || size % PAGE_SIZE || (u64)virt % PAGE_SIZE)
> +		return -EINVAL;
> +
> +	hyp_spin_lock_init(&pkvm_pgd_lock);
> +	hyp_nr_cpus = nr_cpus;
> +
> +	ret = divide_memory_pool(virt, size);
> +	if (ret)
> +		return ret;
> +
> +	ret = recreate_hyp_mappings(phys, size, per_cpu_base);
> +	if (ret)
> +		return ret;
> +
> +	update_nvhe_init_params();
> +
> +	/* Jump in the idmap page to switch to the new page-tables */
> +	params = this_cpu_ptr(&kvm_init_params);
> +	fn = (typeof(fn))__hyp_pa(hyp_symbol_addr(__pkvm_init_switch_pgd));
> +	fn(__hyp_pa(params), hyp_symbol_addr(__pkvm_init_finalise));
> +
> +	unreachable();
> +}
> diff --git a/arch/arm64/kvm/hyp/reserved_mem.c b/arch/arm64/kvm/hyp/reserved_mem.c
> new file mode 100644
> index 000000000000..32f648992835
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/reserved_mem.c
> @@ -0,0 +1,102 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (C) 2020 - Google LLC
> + * Author: Quentin Perret <qperret@google.com>
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <linux/memblock.h>
> +#include <linux/sort.h>
> +
> +#include <asm/kvm_host.h>
> +
> +#include <nvhe/memory.h>
> +#include <nvhe/mm.h>
> +
> +phys_addr_t hyp_mem_base;
> +phys_addr_t hyp_mem_size;
> +
> +int __init early_init_dt_add_memory_hyp(u64 base, u64 size)
> +{
> +	struct hyp_memblock_region *reg;
> +
> +	if (kvm_nvhe_sym(hyp_memblock_nr) >= HYP_MEMBLOCK_REGIONS)
> +		kvm_nvhe_sym(hyp_memblock_nr) = -1;
> +
> +	if (kvm_nvhe_sym(hyp_memblock_nr) < 0)
> +		return -ENOMEM;
> +
> +	reg = kvm_nvhe_sym(hyp_memory);
> +	reg[kvm_nvhe_sym(hyp_memblock_nr)].start = base;
> +	reg[kvm_nvhe_sym(hyp_memblock_nr)].end = base + size;
> +	kvm_nvhe_sym(hyp_memblock_nr)++;
> +
> +	return 0;
> +}

This isn't called by anything in this patch afaict, so it's a bit tricky to
review, especially as I was trying to see how it interacts with
kvm_hyp_reserve(), which reads hyp_memblock_nr.

> +
> +static int cmp_hyp_memblock(const void *p1, const void *p2)
> +{
> +	const struct hyp_memblock_region *r1 = p1;
> +	const struct hyp_memblock_region *r2 = p2;
> +
> +	return r1->start < r2->start ? -1 : (r1->start > r2->start);
> +}
> +
> +static void __init sort_memblock_regions(void)
> +{
> +	sort(kvm_nvhe_sym(hyp_memory),
> +	     kvm_nvhe_sym(hyp_memblock_nr),
> +	     sizeof(struct hyp_memblock_region),
> +	     cmp_hyp_memblock,
> +	     NULL);
> +}
> +
> +void __init kvm_hyp_reserve(void)
> +{
> +	u64 nr_pages, prev;
> +
> +	if (!is_hyp_mode_available() || is_kernel_in_hyp_mode())
> +		return;
> +
> +	if (kvm_get_mode() != KVM_MODE_PROTECTED)
> +		return;
> +
> +	if (kvm_nvhe_sym(hyp_memblock_nr) < 0) {
> +		kvm_err("Failed to register hyp memblocks\n");
> +		return;
> +	}
> +
> +	sort_memblock_regions();
> +
> +	/*
> +	 * We don't know the number of possible CPUs yet, so allocate for the
> +	 * worst case.
> +	 */
> +	hyp_mem_size += NR_CPUS << PAGE_SHIFT;

There was a recent patch bumping NR_CPUs to 512, so this would be 32MB
with 64k pages. Is it possible to return memory to the host later on once
we have a better handle on the number of CPUs in the system?

> +	hyp_mem_size += hyp_s1_pgtable_size();
> +
> +	/*
> +	 * The hyp_vmemmap needs to be backed by pages, but these pages
> +	 * themselves need to be present in the vmemmap, so compute the number
> +	 * of pages needed by looking for a fixed point.
> +	 */
> +	nr_pages = 0;
> +	do {
> +		prev = nr_pages;
> +		nr_pages = (hyp_mem_size >> PAGE_SHIFT) + prev;
> +		nr_pages = DIV_ROUND_UP(nr_pages * sizeof(struct hyp_page), PAGE_SIZE);
> +		nr_pages += __hyp_pgtable_max_pages(nr_pages);
> +	} while (nr_pages != prev);
> +	hyp_mem_size += nr_pages << PAGE_SHIFT;
> +
> +	hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
> +					      hyp_mem_size, SZ_2M);

Why SZ_2M? Guessing you might mean PMD_SIZE, although then we will probably
want to retry with smaller alignment if the allocation fails as this can
again be large with e.g. 64k pages.

> +	if (!hyp_mem_base) {
> +		kvm_err("Failed to reserve hyp memory\n");
> +		return;
> +	}
> +	memblock_reserve(hyp_mem_base, hyp_mem_size);
> +
> +	kvm_info("Reserved %lld MiB at 0x%llx\n", hyp_mem_size >> 20,
> +		 hyp_mem_base);
> +}
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 278e163beda4..3cf9397dabdb 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1264,10 +1264,10 @@ static struct kvm_pgtable_mm_ops kvm_hyp_mm_ops = {
>  	.virt_to_phys		= kvm_host_pa,
>  };
>  
> +u32 hyp_va_bits;

Perhaps it would be better to pass this to __pkvm_init() instead of making
it global?

Will

Will Deacon Feb. 3, 2021, 3:31 p.m. UTC | #25

On Fri, Jan 08, 2021 at 12:15:15PM +0000, Quentin Perret wrote:
> Previous commits have introduced infrastructure at EL2 to enable the Hyp
> code to manage its own memory, and more specifically its stage 1 page
> tables. However, this was preliminary work, and none of it is currently
> in use.
> 
> Put all of this together by elevating the hyp mappings creation at EL2
> when memory protection is enabled. In this case, the host kernel running
> at EL1 still creates _temporary_ Hyp mappings, only used while
> initializing the hypervisor, but frees them right after.
> 
> As such, all calls to create_hyp_mappings() after kvm init has finished
> turn into hypercalls, as the host now has no 'legal' way to modify the
> hypevisor page tables directly.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h |  1 -
>  arch/arm64/kvm/arm.c             | 62 +++++++++++++++++++++++++++++---
>  arch/arm64/kvm/mmu.c             | 34 ++++++++++++++++++
>  3 files changed, 92 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index d7ebd73ec86f..6c8466a042a9 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -309,6 +309,5 @@ static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
>  	 */
>  	asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));
>  }
> -
>  #endif /* __ASSEMBLY__ */
>  #endif /* __ARM64_KVM_MMU_H__ */
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 6af9204bcd5b..e524682c2ccf 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -1421,7 +1421,7 @@ static void cpu_prepare_hyp_mode(int cpu)
>  	kvm_flush_dcache_to_poc(params, sizeof(*params));
>  }
>  
> -static void cpu_init_hyp_mode(void)
> +static void kvm_set_hyp_vector(void)

Please do something about the naming: now we have both cpu_set_hyp_vector()
and kvm_set_hyp_vector()!

>  {
>  	struct kvm_nvhe_init_params *params;
>  	struct arm_smccc_res res;
> @@ -1439,6 +1439,11 @@ static void cpu_init_hyp_mode(void)
>  	params = this_cpu_ptr_nvhe_sym(kvm_init_params);
>  	arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(__kvm_hyp_init), virt_to_phys(params), &res);
>  	WARN_ON(res.a0 != SMCCC_RET_SUCCESS);
> +}
> +
> +static void cpu_init_hyp_mode(void)
> +{
> +	kvm_set_hyp_vector();
>  
>  	/*
>  	 * Disabling SSBD on a non-VHE system requires us to enable SSBS
> @@ -1481,7 +1486,10 @@ static void cpu_set_hyp_vector(void)
>  	struct bp_hardening_data *data = this_cpu_ptr(&bp_hardening_data);
>  	void *vector = hyp_spectre_vector_selector[data->slot];
>  
> -	*this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> +	if (!is_protected_kvm_enabled())
> +		*this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> +	else
> +		kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);

*Very* minor nit, but it might be cleaner to have static inline functions
with the same prototypes as the hypercalls, just to make the code even
easier to read. e.g

	if (!is_protected_kvm_enabled())
		_cpu_set_vector(data->slot);
	else
		kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);

you could then conceivably wrap that in a macro and avoid having the
"is_protected_kvm_enabled()" checks explicit every time.

>  }
>  
>  static void cpu_hyp_reinit(void)
> @@ -1489,13 +1497,14 @@ static void cpu_hyp_reinit(void)
>  	kvm_init_host_cpu_context(&this_cpu_ptr_hyp_sym(kvm_host_data)->host_ctxt);
>  
>  	cpu_hyp_reset();
> -	cpu_set_hyp_vector();
>  
>  	if (is_kernel_in_hyp_mode())
>  		kvm_timer_init_vhe();
>  	else
>  		cpu_init_hyp_mode();
>  
> +	cpu_set_hyp_vector();
> +
>  	kvm_arm_init_debug();
>  
>  	if (vgic_present)
> @@ -1714,13 +1723,52 @@ static int copy_cpu_ftr_regs(void)
>  	return 0;
>  }
>  
> +static int kvm_hyp_enable_protection(void)
> +{
> +	void *per_cpu_base = kvm_ksym_ref(kvm_arm_hyp_percpu_base);
> +	int ret, cpu;
> +	void *addr;
> +
> +	if (!is_protected_kvm_enabled())
> +		return 0;

Maybe I'm hung up on my previous suggestion, but I feel like we shouldn't
get here if protected kvm isn't enabled.

> +	if (!hyp_mem_base)
> +		return -ENOMEM;
> +
> +	addr = phys_to_virt(hyp_mem_base);
> +	ret = create_hyp_mappings(addr, addr + hyp_mem_size - 1, PAGE_HYP);
> +	if (ret)
> +		return ret;
> +
> +	preempt_disable();
> +	kvm_set_hyp_vector();
> +	ret = kvm_call_hyp_nvhe(__pkvm_init, hyp_mem_base, hyp_mem_size,
> +				num_possible_cpus(), kern_hyp_va(per_cpu_base));

Would it make sense for the __pkvm_init() hypercall to set the vector as
well, so that we wouldn't need to disable preemption over two hypercalls?

Failing that, maybe move the whole preempt_disable/enable sequence into
another function.

> +	preempt_enable();
> +	if (ret)
> +		return ret;
> +
> +	free_hyp_pgds();
> +	for_each_possible_cpu(cpu)
> +		free_page(per_cpu(kvm_arm_hyp_stack_page, cpu));
> +
> +	return 0;
> +}
> +
>  /**
>   * Inits Hyp-mode on all online CPUs
>   */
>  static int init_hyp_mode(void)
>  {
>  	int cpu;
> -	int err = 0;
> +	int err = -ENOMEM;
> +
> +	/*
> +	 * The protected Hyp-mode cannot be initialized if the memory pool
> +	 * allocation has failed.
> +	 */
> +	if (is_protected_kvm_enabled() && !hyp_mem_base)
> +		return err;
>  
>  	/*
>  	 * Copy the required CPU feature register in their EL2 counterpart
> @@ -1854,6 +1902,12 @@ static int init_hyp_mode(void)
>  	for_each_possible_cpu(cpu)
>  		cpu_prepare_hyp_mode(cpu);
>  
> +	err = kvm_hyp_enable_protection();
> +	if (err) {
> +		kvm_err("Failed to enable hyp memory protection: %d\n", err);
> +		goto out_err;
> +	}
> +
>  	return 0;
>  
>  out_err:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index 3cf9397dabdb..9d4c9251208e 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -225,15 +225,39 @@ void free_hyp_pgds(void)
>  	if (hyp_pgtable) {
>  		kvm_pgtable_hyp_destroy(hyp_pgtable);
>  		kfree(hyp_pgtable);
> +		hyp_pgtable = NULL;
>  	}
>  	mutex_unlock(&kvm_hyp_pgd_mutex);
>  }
>  
> +static bool kvm_host_owns_hyp_mappings(void)
> +{
> +	if (static_branch_likely(&kvm_protected_mode_initialized))
> +		return false;
> +
> +	/*
> +	 * This can happen at boot time when __create_hyp_mappings() is called
> +	 * after the hyp protection has been enabled, but the static key has
> +	 * not been flipped yet.
> +	 */
> +	if (!hyp_pgtable && is_protected_kvm_enabled())
> +		return false;
> +
> +	BUG_ON(!hyp_pgtable);

Can we fail more gracefully, e.g. by continuing without KVM?

Will

Will Deacon Feb. 3, 2021, 3:34 p.m. UTC | #26

On Fri, Jan 08, 2021 at 12:15:16PM +0000, Quentin Perret wrote:
> In order to make use of the stage 2 pgtable code for the host stage 2,
> use struct kvm_arch in lieu of struct kvm as the host will have the
> former but not the latter.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h | 5 +++--
>  arch/arm64/kvm/hyp/pgtable.c         | 6 +++---
>  arch/arm64/kvm/mmu.c                 | 2 +-
>  3 files changed, 7 insertions(+), 6 deletions(-)

Acked-by: Will Deacon <will@kernel.org>

Will

Will Deacon Feb. 3, 2021, 3:38 p.m. UTC | #27

On Fri, Jan 08, 2021 at 12:15:17PM +0000, Quentin Perret wrote:
> In order to make use of the stage 2 pgtable code for the host stage 2,
> change kvm_s2_mmu to use a kvm_arch pointer in lieu of the kvm pointer,
> as the host will have the former but not the latter.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_host.h | 2 +-
>  arch/arm64/include/asm/kvm_mmu.h  | 7 ++++++-
>  arch/arm64/kvm/mmu.c              | 8 ++++----
>  3 files changed, 11 insertions(+), 6 deletions(-)

Acked-by: Will Deacon <will@kernel.org>

Will

Will Deacon Feb. 3, 2021, 3:53 p.m. UTC | #28

On Fri, Jan 08, 2021 at 12:15:19PM +0000, Quentin Perret wrote:
> In order to re-use some of the stage 2 setup at EL2, factor parts of
> kvm_arm_setup_stage2() out into static inline functions.
> 
> No functional change intended.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h | 48 ++++++++++++++++++++++++++++++++
>  arch/arm64/kvm/reset.c           | 42 +++-------------------------
>  2 files changed, 52 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 662f0415344e..83b4c5cf4768 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -280,6 +280,54 @@ static inline int kvm_write_guest_lock(struct kvm *kvm, gpa_t gpa,
>  	return ret;
>  }
>  
> +static inline u64 kvm_get_parange(u64 mmfr0)
> +{
> +	u64 parange = cpuid_feature_extract_unsigned_field(mmfr0,
> +				ID_AA64MMFR0_PARANGE_SHIFT);
> +	if (parange > ID_AA64MMFR0_PARANGE_MAX)
> +		parange = ID_AA64MMFR0_PARANGE_MAX;
> +
> +	return parange;
> +}
> +
> +/*
> + * The VTCR value is common across all the physical CPUs on the system.
> + * We use system wide sanitised values to fill in different fields,
> + * except for Hardware Management of Access Flags. HA Flag is set
> + * unconditionally on all CPUs, as it is safe to run with or without
> + * the feature and the bit is RES0 on CPUs that don't support it.
> + */
> +static inline u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
> +{
> +	u64 vtcr = VTCR_EL2_FLAGS;
> +	u8 lvls;
> +
> +	vtcr |= kvm_get_parange(mmfr0) << VTCR_EL2_PS_SHIFT;
> +	vtcr |= VTCR_EL2_T0SZ(phys_shift);
> +	/*
> +	 * Use a minimum 2 level page table to prevent splitting
> +	 * host PMD huge pages at stage2.
> +	 */
> +	lvls = stage2_pgtable_levels(phys_shift);
> +	if (lvls < 2)
> +		lvls = 2;
> +	vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls);
> +
> +	/*
> +	 * Enable the Hardware Access Flag management, unconditionally
> +	 * on all CPUs. The features is RES0 on CPUs without the support
> +	 * and must be ignored by the CPUs.
> +	 */
> +	vtcr |= VTCR_EL2_HA;
> +
> +	/* Set the vmid bits */
> +	vtcr |= (get_vmid_bits(mmfr1) == 16) ?
> +		VTCR_EL2_VS_16BIT :
> +		VTCR_EL2_VS_8BIT;
> +
> +	return vtcr;
> +}

Although I think this is functionally fine, I think it's unusual to see
large "static inline" functions like this in shared header files. One
alternative approach would be to follow the example of
kernel/locking/qspinlock_paravirt.h, where the header is guarded in such a
way that is only ever included by kernel/locking/qspinlock.c and therefore
doesn't need the "inline" at all. That separation really helps, I think.

Will

Will Deacon Feb. 3, 2021, 3:54 p.m. UTC | #29

On Fri, Jan 08, 2021 at 12:15:20PM +0000, Quentin Perret wrote:
> Refactor __load_guest_stage2() to introduce __load_stage2() which will
> be re-used when loading the host stage 2.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_mmu.h | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)

Acked-by: Will Deacon <will@kernel.org>

Will

Will Deacon Feb. 3, 2021, 3:58 p.m. UTC | #30

On Fri, Jan 08, 2021 at 12:15:21PM +0000, Quentin Perret wrote:
> Refactor __populate_fault_info() to introduce __get_fault_info() which
> will be used once the host is wrapped in a stage 2.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/kvm/hyp/include/hyp/switch.h | 36 +++++++++++++++----------
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
> index 84473574c2e7..e9005255d639 100644
> --- a/arch/arm64/kvm/hyp/include/hyp/switch.h
> +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
> @@ -157,19 +157,9 @@ static inline bool __translate_far_to_hpfar(u64 far, u64 *hpfar)
>  	return true;
>  }
>  
> -static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
> +static inline bool __get_fault_info(u64 esr, u64 *far, u64 *hpfar)

Could this take a pointer to a struct kvm_vcpu_fault_info instead?

Will

Will Deacon Feb. 3, 2021, 3:59 p.m. UTC | #31

On Fri, Jan 08, 2021 at 12:15:22PM +0000, Quentin Perret wrote:
> The current stage2 page-table allocator uses a memcache to get
> pre-allocated pages when it needs any. To allow re-using this code at
> EL2 which uses a concept of memory pools, make the memcache argument to
> kvm_pgtable_stage2_map() anonymous. and let the mm_ops zalloc_page()
> callbacks use it the way they need to.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_pgtable.h | 6 +++---
>  arch/arm64/kvm/hyp/pgtable.c         | 4 ++--
>  2 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> index 8e8f1d2c5e0e..d846bc3d3b77 100644
> --- a/arch/arm64/include/asm/kvm_pgtable.h
> +++ b/arch/arm64/include/asm/kvm_pgtable.h
> @@ -176,8 +176,8 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
>   * @size:	Size of the mapping.
>   * @phys:	Physical address of the memory to map.
>   * @prot:	Permissions and attributes for the mapping.
> - * @mc:		Cache of pre-allocated GFP_PGTABLE_USER memory from which to
> - *		allocate page-table pages.
> + * @mc:		Cache of pre-allocated memory from which to allocate page-table
> + *		pages.

We should probably mention that this memory must be zeroed, since I don't
think the page-table code takes care of that.

Will

Will Deacon Feb. 3, 2021, 4:05 p.m. UTC | #32

On Fri, Jan 08, 2021 at 12:15:18PM +0000, Quentin Perret wrote:
> Move the registers relevant to host stage 2 enablement to
> kvm_nvhe_init_params to prepare the ground for enabling it in later
> patches.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_asm.h   | 3 +++
>  arch/arm64/kernel/asm-offsets.c    | 3 +++
>  arch/arm64/kvm/arm.c               | 5 +++++
>  arch/arm64/kvm/hyp/nvhe/hyp-init.S | 9 +++++++++
>  arch/arm64/kvm/hyp/nvhe/switch.c   | 5 +----
>  5 files changed, 21 insertions(+), 4 deletions(-)

Acked-by: Will Deacon <will@kernel.org>

Will

Will Deacon Feb. 3, 2021, 4:11 p.m. UTC | #33

On Fri, Jan 08, 2021 at 12:15:24PM +0000, Quentin Perret wrote:
> When KVM runs in protected nVHE mode, make use of a stage 2 page-table
> to give the hypervisor some control over the host memory accesses. At
> the moment all memory aborts from the host will be instantly idmapped
> RWX at stage 2 in a lazy fashion. Later patches will make use of that
> infrastructure to implement access control restrictions to e.g. protect
> guest memory from the host.
> 
> Signed-off-by: Quentin Perret <qperret@google.com>
> ---
>  arch/arm64/include/asm/kvm_cpufeature.h       |   2 +
>  arch/arm64/kernel/image-vars.h                |   3 +
>  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  33 +++
>  arch/arm64/kvm/hyp/nvhe/Makefile              |   2 +-
>  arch/arm64/kvm/hyp/nvhe/hyp-init.S            |   1 +
>  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |   6 +
>  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 191 ++++++++++++++++++
>  arch/arm64/kvm/hyp/nvhe/setup.c               |   6 +
>  arch/arm64/kvm/hyp/nvhe/switch.c              |   7 +-
>  arch/arm64/kvm/hyp/nvhe/tlb.c                 |   4 +-
>  10 files changed, 248 insertions(+), 7 deletions(-)
>  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
>  create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c

[...]

> +void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
> +{
> +	enum kvm_pgtable_prot prot;
> +	u64 far, hpfar, esr, ipa;
> +	int ret;
> +
> +	esr = read_sysreg_el2(SYS_ESR);
> +	if (!__get_fault_info(esr, &far, &hpfar))
> +		hyp_panic();
> +
> +	prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W | KVM_PGTABLE_PROT_X;
> +	ipa = (hpfar & HPFAR_MASK) << 8;
> +	ret = host_stage2_map(ipa, PAGE_SIZE, prot);

Can we try to put down a block mapping if the whole thing falls within
memory?

Will

Quentin Perret Feb. 3, 2021, 6:33 p.m. UTC | #34

Hey Will,

On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> Hi Quentin,
> 
> Sorry for the delay, this one took me a while to grok.

No need to be sorry, thanks for having a look!

> On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > When memory protection is enabled, the hyp code will require a basic
> > form of memory management in order to allocate and free memory pages at
> > EL2. This is needed for various use-cases, including the creation of hyp
> > mappings or the allocation of stage 2 page tables.
> > 
> > To address these use-case, introduce a simple memory allocator in the
> > hyp code. The allocator is designed as a conventional 'buddy allocator',
> > working with a page granularity. It allows to allocate and free
> > physically contiguous pages from memory 'pools', with a guaranteed order
> > alignment in the PA space. Each page in a memory pool is associated
> > with a struct hyp_page which holds the page's metadata, including its
> > refcount, as well as its current order, hence mimicking the kernel's
> > buddy system in the GFP infrastructure. The hyp_page metadata are made
> > accessible through a hyp_vmemmap, following the concept of
> > SPARSE_VMEMMAP in the kernel.
> > 
> > Signed-off-by: Quentin Perret <qperret@google.com>
> > ---
> >  arch/arm64/kvm/hyp/include/nvhe/gfp.h    |  32 ++++
> >  arch/arm64/kvm/hyp/include/nvhe/memory.h |  25 +++
> >  arch/arm64/kvm/hyp/nvhe/Makefile         |   2 +-
> >  arch/arm64/kvm/hyp/nvhe/page_alloc.c     | 185 +++++++++++++++++++++++
> >  4 files changed, 243 insertions(+), 1 deletion(-)
> >  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/gfp.h
> >  create mode 100644 arch/arm64/kvm/hyp/nvhe/page_alloc.c
> > 
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/gfp.h b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> > new file mode 100644
> > index 000000000000..95587faee171
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/gfp.h
> > @@ -0,0 +1,32 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +#ifndef __KVM_HYP_GFP_H
> > +#define __KVM_HYP_GFP_H
> > +
> > +#include <linux/list.h>
> > +
> > +#include <nvhe/memory.h>
> > +#include <nvhe/spinlock.h>
> > +
> > +#define HYP_MAX_ORDER	11U
> 
> Could we just use MAX_ORDER here?

Sure, that would work too. I just figured we might decide to set this to
a lower value in the future -- this effectively limits the size of the
largest portion of memory we can allocate, so maybe it would make sense
to set that to match the size of the largest concatenated pgd for ex,
hence minimizing the overhead of struct hyp_pool. But I suppose we
can also do that later, so...

> > +#define HYP_NO_ORDER	UINT_MAX
> > +
> > +struct hyp_pool {
> > +	hyp_spinlock_t lock;
> 
> A comment about what this lock protects would be handy, especially as the
> 'refcount' field of 'struct hyp_page' isn't updated atomically. I think it
> also means that we don't have a safe way to move a page from one pool to
> another; it's fixed forever once the page has been made available for
> allocation.

Indeed, there is currently no good way to do this. I'll stick a comment.

> > +	struct list_head free_area[HYP_MAX_ORDER + 1];
> > +	phys_addr_t range_start;
> > +	phys_addr_t range_end;
> > +};
> > +
> > +/* GFP flags */
> > +#define HYP_GFP_NONE	0
> > +#define HYP_GFP_ZERO	1
> > +
> > +/* Allocation */
> > +void *hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask, unsigned int order);
> > +void hyp_get_page(void *addr);
> > +void hyp_put_page(void *addr);
> > +
> > +/* Used pages cannot be freed */
> > +int hyp_pool_init(struct hyp_pool *pool, phys_addr_t phys,
> > +		  unsigned int nr_pages, unsigned int used_pages);
> 
> Maybe "reserved_pages" would be a better name than "used_pages"?

That works too. These pages could maybe use a bit of love as well,
they're the pages that have been allocated by the early allocator before
we hand over the memory pool to this allocator. So we might want to do
something about them (such as fixup their refcount).

> > +#endif /* __KVM_HYP_GFP_H */
> > diff --git a/arch/arm64/kvm/hyp/include/nvhe/memory.h b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> > index 64c44c142c95..ed47674bc988 100644
> > --- a/arch/arm64/kvm/hyp/include/nvhe/memory.h
> > +++ b/arch/arm64/kvm/hyp/include/nvhe/memory.h
> > @@ -6,7 +6,17 @@
> >  
> >  #include <linux/types.h>
> >  
> > +struct hyp_pool;
> > +struct hyp_page {
> > +	unsigned int refcount;
> > +	unsigned int order;
> > +	struct hyp_pool *pool;
> > +	struct list_head node;
> > +};
> > +
> >  extern s64 hyp_physvirt_offset;
> > +extern u64 __hyp_vmemmap;
> > +#define hyp_vmemmap ((struct hyp_page *)__hyp_vmemmap)
> >  
> >  #define __hyp_pa(virt)	((phys_addr_t)(virt) + hyp_physvirt_offset)
> >  #define __hyp_va(virt)	((void *)((phys_addr_t)(virt) - hyp_physvirt_offset))
> > @@ -21,4 +31,19 @@ static inline phys_addr_t hyp_virt_to_phys(void *addr)
> >  	return __hyp_pa(addr);
> >  }
> >  
> > +#define hyp_phys_to_pfn(phys)	((phys) >> PAGE_SHIFT)
> > +#define hyp_phys_to_page(phys)	(&hyp_vmemmap[hyp_phys_to_pfn(phys)])
> > +#define hyp_virt_to_page(virt)	hyp_phys_to_page(__hyp_pa(virt))
> > +
> > +#define hyp_page_to_phys(page)  ((phys_addr_t)((page) - hyp_vmemmap) << PAGE_SHIFT)
> 
> Maybe implement this in terms of a new hyp_page_to_pfn() macro?

Sure, should be easy enough.

> > +#define hyp_page_to_virt(page)	__hyp_va(hyp_page_to_phys(page))
> > +#define hyp_page_to_pool(page)	(((struct hyp_page *)page)->pool)
> > +
> > +static inline int hyp_page_count(void *addr)
> > +{
> > +	struct hyp_page *p = hyp_virt_to_page(addr);
> > +
> > +	return p->refcount;
> > +}
> > +
> >  #endif /* __KVM_HYP_MEMORY_H */
> > diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> > index 33bd381d8f73..9e5eacfec6ec 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> > +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> > @@ -10,7 +10,7 @@ lib-objs := clear_page.o copy_page.o memcpy.o memset.o
> >  lib-objs := $(addprefix ../../../lib/, $(lib-objs))
> >  
> >  obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> > -	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o
> > +	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o
> >  obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> >  	 ../fpsimd.o ../hyp-entry.o ../exception.o
> >  obj-y += $(lib-objs)
> > diff --git a/arch/arm64/kvm/hyp/nvhe/page_alloc.c b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> > new file mode 100644
> > index 000000000000..6de6515f0432
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/nvhe/page_alloc.c
> > @@ -0,0 +1,185 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2020 Google LLC
> > + * Author: Quentin Perret <qperret@google.com>
> > + */
> > +
> > +#include <asm/kvm_hyp.h>
> > +#include <nvhe/gfp.h>
> > +
> > +u64 __hyp_vmemmap;
> > +
> > +/*
> > + * Example buddy-tree for a 4-pages physically contiguous pool:
> > + *
> > + *                 o : Page 3
> > + *                /
> > + *               o-o : Page 2
> > + *              /
> > + *             /   o : Page 1
> > + *            /   /
> > + *           o---o-o : Page 0
> > + *    Order  2   1 0
> > + *
> > + * Example of requests on this zon:
> 
> typo: zone

s/zon/pool, even :)

> > + *   __find_buddy(pool, page 0, order 0) => page 1
> > + *   __find_buddy(pool, page 0, order 1) => page 2
> > + *   __find_buddy(pool, page 1, order 0) => page 0
> > + *   __find_buddy(pool, page 2, order 0) => page 3
> > + */
> > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > +				     unsigned int order)
> > +{
> > +	phys_addr_t addr = hyp_page_to_phys(p);
> > +
> > +	addr ^= (PAGE_SIZE << order);
> > +	if (addr < pool->range_start || addr >= pool->range_end)
> > +		return NULL;
> 
> Are these range checks only needed because the pool isn't required to be
> an exact power-of-2 pages in size? If so, maybe it would be more
> straightforward to limit the max order on a per-pool basis depending upon
> its size?

More importantly, it is because pages outside of the pool are not
guaranteed to be covered by the hyp_vmemmap, so I really need to make
sure I don't dereference them.

> > +
> > +	return hyp_phys_to_page(addr);
> > +}
> > +
> > +static void __hyp_attach_page(struct hyp_pool *pool,
> > +			      struct hyp_page *p)
> > +{
> > +	unsigned int order = p->order;
> > +	struct hyp_page *buddy;
> > +
> > +	p->order = HYP_NO_ORDER;
> 
> Why is this needed?

If p->order is say 3, I may be able to coalesce with the buddy of order
3 to form a higher order page of order 4. And that higher order page
will be represented by the 'first' of the two order-3 pages (let's call
it the head), and the other order 3 page (let's say the tail) will be
assigned 'HYP_NO_ORDER'.

And basically at this point I don't know if 'p' is going be the head or
the tail, so I set it to HYP_NO_ORDER a priori so I don't have to think
about this in the loop below. Is that helping?

I suppose this could use more comments as well ...

> 
> > +	for (; order < HYP_MAX_ORDER; order++) {
> > +		/* Nothing to do if the buddy isn't in a free-list */
> > +		buddy = __find_buddy(pool, p, order);
> > +		if (!buddy || list_empty(&buddy->node) || buddy->order != order)
> 
> Could we move the "buddy->order" check into __find_buddy()?

I think might break __hyp_extract_page() below. The way I think about
__find_buddy() is as a low level function which gives you the buddy page
blindly if it exists in the hyp_vmemmap, and it's up to the callers to
decide whether the buddy is in the right state for their use or not.

Again, a comment would help I guess.

> 
> > +			break;
> > +
> > +		/* Otherwise, coalesce the buddies and go one level up */
> > +		list_del_init(&buddy->node);
> > +		buddy->order = HYP_NO_ORDER;
> > +		p = (p < buddy) ? p : buddy;
> > +	}
> > +
> > +	p->order = order;
> > +	list_add_tail(&p->node, &pool->free_area[order]);
> > +}
> > +
> > +void hyp_put_page(void *addr)
> > +{
> > +	struct hyp_page *p = hyp_virt_to_page(addr);
> > +	struct hyp_pool *pool = hyp_page_to_pool(p);
> > +
> > +	hyp_spin_lock(&pool->lock);
> > +	if (!p->refcount)
> > +		hyp_panic();
> > +	p->refcount--;
> > +	if (!p->refcount)
> > +		__hyp_attach_page(pool, p);
> > +	hyp_spin_unlock(&pool->lock);
> > +}
> > +
> > +void hyp_get_page(void *addr)
> > +{
> > +	struct hyp_page *p = hyp_virt_to_page(addr);
> > +	struct hyp_pool *pool = hyp_page_to_pool(p);
> > +
> > +	hyp_spin_lock(&pool->lock);
> > +	p->refcount++;
> > +	hyp_spin_unlock(&pool->lock);
> 
> We should probably have a proper atomic refcount type for this along the
> lines of refcount_t. Even if initially that is implemented with a lock, it
> would be good to hide that behind a refcount API.

Makes sense, I'll introduce wrappers around these.
> 
> > +}
> > +
> > +/* Extract a page from the buddy tree, at a specific order */
> > +static struct hyp_page *__hyp_extract_page(struct hyp_pool *pool,
> > +					   struct hyp_page *p,
> > +					   unsigned int order)
> > +{
> > +	struct hyp_page *buddy;
> > +
> > +	if (p->order == HYP_NO_ORDER || p->order < order)
> > +		return NULL;
> 
> Can you drop the explicit HYP_NO_ORDER check here?

I think so, yes.

> > +
> > +	list_del_init(&p->node);
> > +
> > +	/* Split the page in two until reaching the requested order */
> > +	while (p->order > order) {
> > +		p->order--;
> > +		buddy = __find_buddy(pool, p, p->order);
> > +		buddy->order = p->order;
> > +		list_add_tail(&buddy->node, &pool->free_area[buddy->order]);
> > +	}
> > +
> > +	p->refcount = 1;
> > +
> > +	return p;
> > +}
> > +
> > +static void clear_hyp_page(struct hyp_page *p)
> > +{
> > +	unsigned long i;
> > +
> > +	for (i = 0; i < (1 << p->order); i++)
> > +		clear_page(hyp_page_to_virt(p) + (i << PAGE_SHIFT));
> 
> I wonder if this is actually any better than a memset(0)? That should use
> DC ZCA as appropriate afaict.

I think that makes sense, and would allow us to drop the EL2 dependency
on clear_page(), so I'll do the change for v3.

> > +static void *__hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask,
> > +			       unsigned int order)
> > +{
> > +	unsigned int i = order;
> > +	struct hyp_page *p;
> > +
> > +	/* Look for a high-enough-order page */
> > +	while (i <= HYP_MAX_ORDER && list_empty(&pool->free_area[i]))
> > +		i++;
> > +	if (i > HYP_MAX_ORDER)
> > +		return NULL;
> > +
> > +	/* Extract it from the tree at the right order */
> > +	p = list_first_entry(&pool->free_area[i], struct hyp_page, node);
> > +	p = __hyp_extract_page(pool, p, order);
> > +
> > +	if (mask & HYP_GFP_ZERO)
> > +		clear_hyp_page(p);
> 
> Do we have a use-case where skipping the zeroing is worthwhile? If not,
> it might make some sense to zero on the freeing path instead.

And during hyp_pool_init(), but this should be infrequent, so yes I
think this is preferable. I'll get rid of HYP_GFP_ZERO altogether.

> 
> > +
> > +	return p;
> > +}
> > +
> > +void *hyp_alloc_pages(struct hyp_pool *pool, gfp_t mask, unsigned int order)
> > +{
> > +	struct hyp_page *p;
> > +
> > +	hyp_spin_lock(&pool->lock);
> > +	p = __hyp_alloc_pages(pool, mask, order);
> > +	hyp_spin_unlock(&pool->lock);
> > +
> > +	return p ? hyp_page_to_virt(p) : NULL;
> 
> It looks weird not having __hyp_alloc_pages return the VA, but I guess later
> patches will use __hyp_alloc_pages() for something else.

Actually no, this can be simplified.

> > +}
> > +
> > +/* hyp_vmemmap must be backed beforehand */
> > +int hyp_pool_init(struct hyp_pool *pool, phys_addr_t phys,
> > +		  unsigned int nr_pages, unsigned int used_pages)
> > +{
> > +	struct hyp_page *p;
> > +	int i;
> > +
> > +	if (phys % PAGE_SIZE)
> > +		return -EINVAL;
> 
> Maybe just take a pfn instead?
> 
> > +	hyp_spin_lock_init(&pool->lock);
> > +	for (i = 0; i <= HYP_MAX_ORDER; i++)
> > +		INIT_LIST_HEAD(&pool->free_area[i]);
> > +	pool->range_start = phys;
> > +	pool->range_end = phys + (nr_pages << PAGE_SHIFT);
> > +
> > +	/* Init the vmemmap portion */
> > +	p = hyp_phys_to_page(phys);
> > +	memset(p, 0, sizeof(*p) * nr_pages);
> > +	for (i = 0; i < nr_pages; i++, p++) {
> > +		p->pool = pool;
> > +		INIT_LIST_HEAD(&p->node);
> > +	}
> 
> Maybe index p like an array (e.g. p[i]) instead of maintaining two loop
> increments?
> 
> > +
> > +	/* Attach the unused pages to the buddy tree */
> > +	p = hyp_phys_to_page(phys + (used_pages << PAGE_SHIFT));
> > +	for (i = used_pages; i < nr_pages; i++, p++)
> > +		__hyp_attach_page(pool, p);
> 
> Likewise.

And ack for these 3 comments.

Cheers,
Quentin

Quentin Perret Feb. 4, 2021, 10:47 a.m. UTC | #35

On Wednesday 03 Feb 2021 at 14:37:10 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:14PM +0000, Quentin Perret wrote:
> > +static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages)
> > +{
> > +	unsigned long total = 0, i;
> > +
> > +	/* Provision the worst case scenario with 4 levels of page-table */
> > +	for (i = 0; i < 4; i++) {
> 
> Looks like you want KVM_PGTABLE_MAX_LEVELS, so maybe move that into a
> header?

Will do.

> 
> > +		nr_pages = DIV_ROUND_UP(nr_pages, PTRS_PER_PTE);
> > +		total += nr_pages;
> > +	}
> 
> ... that said, I'm not sure this needs to iterate at all. What exactly are
> you trying to compute?

I'm trying to figure out how many pages I will need to construct a
page-table covering nr_pages contiguous pages. The first iteration tells
me how many level 0 pages I need to cover nr_pages, the second iteration
how many level 1 pages I need to cover the level 0 pages, and so on...

I might be doing this naively though. Got a better idea?

> > +
> > +	return total;
> > +}
> > +
> > +static inline unsigned long hyp_s1_pgtable_size(void)
> > +{
> > +	struct hyp_memblock_region *reg;
> > +	unsigned long nr_pages, res = 0;
> > +	int i;
> > +
> > +	if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
> > +		return 0;
> 
> It's a bit grotty having this be signed. Why do we need to encode the error
> case differently from the 0 case?

Here specifically we don't, but it is needed in early_init_dt_add_memory_hyp()
to distinguish the overflow case from the first memblock being added.

> 
> > +
> > +	for (i = 0; i < kvm_nvhe_sym(hyp_memblock_nr); i++) {
> > +		reg = &kvm_nvhe_sym(hyp_memory)[i];
> 
> You could declare reg in the loop body.

I found it prettier like that and assumed the compiler would optimize it
anyway, but ok.

> > +		nr_pages = (reg->end - reg->start) >> PAGE_SHIFT;
> > +		nr_pages = __hyp_pgtable_max_pages(nr_pages);
> 
> Maybe it would make more sense for __hyp_pgtable_max_pages to take the
> size in bytes rather than pages, since most callers seem to have to do the
> conversion?

Yes, and it seems I can apply small cleanups in other places, so I'll
fix this throughout the patch.

> > +		res += nr_pages << PAGE_SHIFT;
> > +	}
> > +
> > +	/* Allow 1 GiB for private mappings */
> > +	nr_pages = (1 << 30) >> PAGE_SHIFT;
> 
> SZ_1G >> PAGE_SHIFT

Much nicer, thanks. I was also considering adding a Kconfig option for
that, because 1GiB is totally arbitrary. Toughts?

> > +	nr_pages = __hyp_pgtable_max_pages(nr_pages);
> > +	res += nr_pages << PAGE_SHIFT;
> > +
> > +	return res;
> 
> Might make more sense to keep res in pages until here, then just shift when
> returning.
> 
> > +}
> > +
> > +#endif /* __KVM_HYP_MM_H */
> > diff --git a/arch/arm64/kvm/hyp/nvhe/Makefile b/arch/arm64/kvm/hyp/nvhe/Makefile
> > index 72cfe53f106f..d7381a503182 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/Makefile
> > +++ b/arch/arm64/kvm/hyp/nvhe/Makefile
> > @@ -11,9 +11,9 @@ lib-objs := $(addprefix ../../../lib/, $(lib-objs))
> >  
> >  obj-y := timer-sr.o sysreg-sr.o debug-sr.o switch.o tlb.o hyp-init.o host.o \
> >  	 hyp-main.o hyp-smp.o psci-relay.o early_alloc.o stub.o page_alloc.o \
> > -	 cache.o cpufeature.o
> > +	 cache.o cpufeature.o setup.o mm.o
> >  obj-y += ../vgic-v3-sr.o ../aarch32.o ../vgic-v2-cpuif-proxy.o ../entry.o \
> > -	 ../fpsimd.o ../hyp-entry.o ../exception.o
> > +	 ../fpsimd.o ../hyp-entry.o ../exception.o ../pgtable.o
> >  obj-y += $(lib-objs)
> >  
> >  ##
> > diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-init.S b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> > index 31b060a44045..ad943966c39f 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> > +++ b/arch/arm64/kvm/hyp/nvhe/hyp-init.S
> > @@ -251,4 +251,35 @@ alternative_else_nop_endif
> >  
> >  SYM_CODE_END(__kvm_handle_stub_hvc)
> >  
> > +SYM_FUNC_START(__pkvm_init_switch_pgd)
> > +	/* Turn the MMU off */
> > +	pre_disable_mmu_workaround
> > +	mrs	x2, sctlr_el2
> > +	bic	x3, x2, #SCTLR_ELx_M
> > +	msr	sctlr_el2, x3
> > +	isb
> > +
> > +	tlbi	alle2
> > +
> > +	/* Install the new pgtables */
> > +	ldr	x3, [x0, #NVHE_INIT_PGD_PA]
> > +	phys_to_ttbr x4, x3
> > +alternative_if ARM64_HAS_CNP
> > +	orr	x4, x4, #TTBR_CNP_BIT
> > +alternative_else_nop_endif
> > +	msr	ttbr0_el2, x4
> > +
> > +	/* Set the new stack pointer */
> > +	ldr	x0, [x0, #NVHE_INIT_STACK_HYP_VA]
> > +	mov	sp, x0
> > +
> > +	/* And turn the MMU back on! */
> > +	dsb	nsh
> > +	isb
> > +	msr	sctlr_el2, x2
> > +	ic	iallu
> > +	isb
> > +	ret	x1
> > +SYM_FUNC_END(__pkvm_init_switch_pgd)
> > +
> >  	.popsection
> > diff --git a/arch/arm64/kvm/hyp/nvhe/hyp-main.c b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > index a906f9e2ff34..3075f117651c 100644
> > --- a/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > +++ b/arch/arm64/kvm/hyp/nvhe/hyp-main.c
> > @@ -6,12 +6,14 @@
> >  
> >  #include <hyp/switch.h>
> >  
> > +#include <asm/pgtable-types.h>
> >  #include <asm/kvm_asm.h>
> >  #include <asm/kvm_emulate.h>
> >  #include <asm/kvm_host.h>
> >  #include <asm/kvm_hyp.h>
> >  #include <asm/kvm_mmu.h>
> >  
> > +#include <nvhe/mm.h>
> >  #include <nvhe/trap_handler.h>
> >  
> >  DEFINE_PER_CPU(struct kvm_nvhe_init_params, kvm_init_params);
> > @@ -106,6 +108,42 @@ static void handle___vgic_v3_restore_aprs(struct kvm_cpu_context *host_ctxt)
> >  	__vgic_v3_restore_aprs(kern_hyp_va(cpu_if));
> >  }
> >  
> > +static void handle___pkvm_init(struct kvm_cpu_context *host_ctxt)
> > +{
> > +	DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> > +	DECLARE_REG(unsigned long, size, host_ctxt, 2);
> > +	DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3);
> > +	DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4);
> > +
> > +	cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base);
> 
> __pkvm_init() doesn't return, so I think this assignment back into host_ctxt
> is confusing.

Very good point, I'll get rid of this.

> 
> Also, I wonder if these bare numbers would be better hidden behind, e.g.
> 
>  #define DECLARE_ARG0(...)	DECLARE_REG(__VA_ARGS__, 1)
>  ...
>  #define DECLARE_RET(...)	DECLARE_REG(__VA_ARGS__, 1)
> 
> but it's cosmetic, so no need to change your patch. Just worried about
> off-by-1s causing interesting behaviour!

Works for me, but I'll leave this with Marc.

> > +
> > +static void handle___pkvm_cpu_set_vector(struct kvm_cpu_context *host_ctxt)
> > +{
> > +	DECLARE_REG(enum arm64_hyp_spectre_vector, slot, host_ctxt, 1);
> > +
> > +	cpu_reg(host_ctxt, 1) = pkvm_cpu_set_vector(slot);
> > +}
> > +
> > +static void handle___pkvm_create_mappings(struct kvm_cpu_context *host_ctxt)
> > +{
> > +	DECLARE_REG(unsigned long, start, host_ctxt, 1);
> > +	DECLARE_REG(unsigned long, size, host_ctxt, 2);
> > +	DECLARE_REG(unsigned long, phys, host_ctxt, 3);
> > +	DECLARE_REG(unsigned long, prot, host_ctxt, 4);
> > +
> > +	cpu_reg(host_ctxt, 1) = __pkvm_create_mappings(start, size, phys, prot);
> > +}
> > +
> > +static void handle___pkvm_create_private_mapping(struct kvm_cpu_context *host_ctxt)
> > +{
> > +	DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> > +	DECLARE_REG(size_t, size, host_ctxt, 2);
> 
> Why the size_t vs unsigned long discrepancy with pkvm_create_mappings?
> Same with phys_addr_t, although that one probably doesn't matter.
> 
> Also, the pgtable API uses an enum type for the prot bits.

Yes this needs cleaning up.

> > +	DECLARE_REG(unsigned long, prot, host_ctxt, 3);
> > +
> > +	cpu_reg(host_ctxt, 1) = __pkvm_create_private_mapping(phys, size, prot);
> > +}
> > +
> >  typedef void (*hcall_t)(struct kvm_cpu_context *);
> >  
> >  #define HANDLE_FUNC(x)	[__KVM_HOST_SMCCC_FUNC_##x] = kimg_fn_ptr(handle_##x)
> > @@ -125,6 +163,10 @@ static const hcall_t *host_hcall[] = {
> >  	HANDLE_FUNC(__kvm_get_mdcr_el2),
> >  	HANDLE_FUNC(__vgic_v3_save_aprs),
> >  	HANDLE_FUNC(__vgic_v3_restore_aprs),
> > +	HANDLE_FUNC(__pkvm_init),
> > +	HANDLE_FUNC(__pkvm_cpu_set_vector),
> > +	HANDLE_FUNC(__pkvm_create_mappings),
> > +	HANDLE_FUNC(__pkvm_create_private_mapping),
> >  };
> >  
> >  static void handle_host_hcall(struct kvm_cpu_context *host_ctxt)
> > diff --git a/arch/arm64/kvm/hyp/nvhe/mm.c b/arch/arm64/kvm/hyp/nvhe/mm.c
> > new file mode 100644
> > index 000000000000..f3481646a94e
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/nvhe/mm.c
> > @@ -0,0 +1,174 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (C) 2020 Google LLC
> > + * Author: Quentin Perret <qperret@google.com>
> > + */
> > +
> > +#include <linux/kvm_host.h>
> > +#include <asm/kvm_hyp.h>
> > +#include <asm/kvm_mmu.h>
> > +#include <asm/kvm_pgtable.h>
> > +#include <asm/spectre.h>
> > +
> > +#include <nvhe/early_alloc.h>
> > +#include <nvhe/gfp.h>
> > +#include <nvhe/memory.h>
> > +#include <nvhe/mm.h>
> > +#include <nvhe/spinlock.h>
> > +
> > +struct kvm_pgtable pkvm_pgtable;
> > +hyp_spinlock_t pkvm_pgd_lock;
> > +u64 __io_map_base;
> > +
> > +struct hyp_memblock_region hyp_memory[HYP_MEMBLOCK_REGIONS];
> > +int hyp_memblock_nr;
> > +
> > +int __pkvm_create_mappings(unsigned long start, unsigned long size,
> > +			  unsigned long phys, unsigned long prot)
> > +{
> > +	int err;
> > +
> > +	hyp_spin_lock(&pkvm_pgd_lock);
> > +	err = kvm_pgtable_hyp_map(&pkvm_pgtable, start, size, phys, prot);
> > +	hyp_spin_unlock(&pkvm_pgd_lock);
> > +
> > +	return err;
> > +}
> > +
> > +unsigned long __pkvm_create_private_mapping(phys_addr_t phys, size_t size,
> > +					   unsigned long prot)
> > +{
> > +	unsigned long addr;
> > +	int ret;
> > +
> > +	hyp_spin_lock(&pkvm_pgd_lock);
> > +
> > +	size = PAGE_ALIGN(size + offset_in_page(phys));
> 
> It might just be simpler to require page-aligned size and phys in the
> caller. At least, for the vectors that should be straightforward because
> I think they're guaranteed not to span a page boundary.

That is done this way for consistency with the host's equivalent
(__create_hyp_private_mapping()), but I can probably factorize the
size-alignment stuff for both.

> > +	addr = __io_map_base;
> > +	__io_map_base += size;
> > +
> > +	/* Are we overflowing on the vmemmap ? */
> > +	if (__io_map_base > __hyp_vmemmap) {
> > +		__io_map_base -= size;
> > +		addr = 0;
> 
> Can we use ERR_PTR(), or does that fail miserably at EL2?

Good question, I'll give it a go.

> > +		goto out;
> > +	}
> > +
> > +	ret = kvm_pgtable_hyp_map(&pkvm_pgtable, addr, size, phys, prot);
> > +	if (ret) {
> > +		addr = 0;
> > +		goto out;
> > +	}
> > +
> > +	addr = addr + offset_in_page(phys);
> > +out:
> > +	hyp_spin_unlock(&pkvm_pgd_lock);
> > +
> > +	return addr;
> > +}
> 
> [...]
> 
> > +static int recreate_hyp_mappings(phys_addr_t phys, unsigned long size,
> > +				 unsigned long *per_cpu_base)
> > +{
> > +	void *start, *end, *virt = hyp_phys_to_virt(phys);
> > +	int ret, i;
> > +
> > +	/* Recreate the hyp page-table using the early page allocator */
> > +	hyp_early_alloc_init(hyp_pgt_base, hyp_s1_pgtable_size());
> > +	ret = kvm_pgtable_hyp_init(&pkvm_pgtable, hyp_va_bits,
> > +				   &hyp_early_alloc_mm_ops);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = hyp_create_idmap();
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = hyp_map_vectors();
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = hyp_back_vmemmap(phys, size, hyp_virt_to_phys(vmemmap_base));
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_text_start),
> > +				  hyp_symbol_addr(__hyp_text_end),
> > +				  PAGE_HYP_EXEC);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pkvm_create_mappings(hyp_symbol_addr(__start_rodata),
> > +				   hyp_symbol_addr(__end_rodata), PAGE_HYP_RO);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_data_ro_after_init_start),
> > +				   hyp_symbol_addr(__hyp_data_ro_after_init_end),
> > +				   PAGE_HYP_RO);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pkvm_create_mappings(hyp_symbol_addr(__bss_start),
> 
> __hyp_bss_start
>
> > +				   hyp_symbol_addr(__hyp_bss_end), PAGE_HYP);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pkvm_create_mappings(hyp_symbol_addr(__hyp_bss_end),
> > +				   hyp_symbol_addr(__bss_stop), PAGE_HYP_RO);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = pkvm_create_mappings(virt, virt + size - 1, PAGE_HYP);
> 
> Why is the range inclusive here?

It shouldn't be really, I'll fix it.

> > +	if (ret)
> > +		return ret;
> > +
> > +	for (i = 0; i < hyp_nr_cpus; i++) {
> > +		start = (void *)kern_hyp_va(per_cpu_base[i]);
> > +		end = start + PAGE_ALIGN(hyp_percpu_size);
> > +		ret = pkvm_create_mappings(start, end, PAGE_HYP);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static void update_nvhe_init_params(void)
> > +{
> > +	struct kvm_nvhe_init_params *params;
> > +	unsigned long i, stack;
> > +
> > +	for (i = 0; i < hyp_nr_cpus; i++) {
> > +		stack = (unsigned long)stacks_base + (i << PAGE_SHIFT);
> > +		params = per_cpu_ptr(&kvm_init_params, i);
> > +		params->stack_hyp_va = stack + PAGE_SIZE;
> > +		params->pgd_pa = __hyp_pa(pkvm_pgtable.pgd);
> > +		__flush_dcache_area(params, sizeof(*params));
> > +	}
> > +}
> > +
> > +static void *hyp_zalloc_hyp_page(void *arg)
> > +{
> > +	return hyp_alloc_pages(&hpool, HYP_GFP_ZERO, 0);
> > +}
> > +
> > +void __noreturn __pkvm_init_finalise(void)
> > +{
> > +	struct kvm_host_data *host_data = this_cpu_ptr(&kvm_host_data);
> > +	struct kvm_cpu_context *host_ctxt = &host_data->host_ctxt;
> > +	unsigned long nr_pages, used_pages;
> > +	int ret;
> > +
> > +	/* Now that the vmemmap is backed, install the full-fledged allocator */
> > +	nr_pages = hyp_s1_pgtable_size() >> PAGE_SHIFT;
> > +	used_pages = hyp_early_alloc_nr_pages();
> > +	ret = hyp_pool_init(&hpool, __hyp_pa(hyp_pgt_base), nr_pages, used_pages);
> > +	if (ret)
> > +		goto out;
> > +
> > +	pkvm_pgtable_mm_ops.zalloc_page = hyp_zalloc_hyp_page;
> > +	pkvm_pgtable_mm_ops.phys_to_virt = hyp_phys_to_virt;
> > +	pkvm_pgtable_mm_ops.virt_to_phys = hyp_virt_to_phys;
> > +	pkvm_pgtable_mm_ops.get_page = hyp_get_page;
> > +	pkvm_pgtable_mm_ops.put_page = hyp_put_page;
> > +	pkvm_pgtable.mm_ops = &pkvm_pgtable_mm_ops;
> > +
> > +out:
> > +	host_ctxt->regs.regs[0] = SMCCC_RET_SUCCESS;
> > +	host_ctxt->regs.regs[1] = ret;
> 
> Use the cpu_reg() helper for these?

+1

> > +
> > +	__host_enter(host_ctxt);
> > +}
> > +
> > +int __pkvm_init(phys_addr_t phys, unsigned long size, unsigned long nr_cpus,
> > +		unsigned long *per_cpu_base)
> > +{
> > +	struct kvm_nvhe_init_params *params;
> > +	void *virt = hyp_phys_to_virt(phys);
> > +	void (*fn)(phys_addr_t params_pa, void *finalize_fn_va);
> > +	int ret;
> > +
> > +	if (phys % PAGE_SIZE || size % PAGE_SIZE || (u64)virt % PAGE_SIZE)
> > +		return -EINVAL;
> > +
> > +	hyp_spin_lock_init(&pkvm_pgd_lock);
> > +	hyp_nr_cpus = nr_cpus;
> > +
> > +	ret = divide_memory_pool(virt, size);
> > +	if (ret)
> > +		return ret;
> > +
> > +	ret = recreate_hyp_mappings(phys, size, per_cpu_base);
> > +	if (ret)
> > +		return ret;
> > +
> > +	update_nvhe_init_params();
> > +
> > +	/* Jump in the idmap page to switch to the new page-tables */
> > +	params = this_cpu_ptr(&kvm_init_params);
> > +	fn = (typeof(fn))__hyp_pa(hyp_symbol_addr(__pkvm_init_switch_pgd));
> > +	fn(__hyp_pa(params), hyp_symbol_addr(__pkvm_init_finalise));
> > +
> > +	unreachable();
> > +}
> > diff --git a/arch/arm64/kvm/hyp/reserved_mem.c b/arch/arm64/kvm/hyp/reserved_mem.c
> > new file mode 100644
> > index 000000000000..32f648992835
> > --- /dev/null
> > +++ b/arch/arm64/kvm/hyp/reserved_mem.c
> > @@ -0,0 +1,102 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Copyright (C) 2020 - Google LLC
> > + * Author: Quentin Perret <qperret@google.com>
> > + */
> > +
> > +#include <linux/kvm_host.h>
> > +#include <linux/memblock.h>
> > +#include <linux/sort.h>
> > +
> > +#include <asm/kvm_host.h>
> > +
> > +#include <nvhe/memory.h>
> > +#include <nvhe/mm.h>
> > +
> > +phys_addr_t hyp_mem_base;
> > +phys_addr_t hyp_mem_size;
> > +
> > +int __init early_init_dt_add_memory_hyp(u64 base, u64 size)
> > +{
> > +	struct hyp_memblock_region *reg;
> > +
> > +	if (kvm_nvhe_sym(hyp_memblock_nr) >= HYP_MEMBLOCK_REGIONS)
> > +		kvm_nvhe_sym(hyp_memblock_nr) = -1;
> > +
> > +	if (kvm_nvhe_sym(hyp_memblock_nr) < 0)
> > +		return -ENOMEM;
> > +
> > +	reg = kvm_nvhe_sym(hyp_memory);
> > +	reg[kvm_nvhe_sym(hyp_memblock_nr)].start = base;
> > +	reg[kvm_nvhe_sym(hyp_memblock_nr)].end = base + size;
> > +	kvm_nvhe_sym(hyp_memblock_nr)++;
> > +
> > +	return 0;
> > +}
> 
> This isn't called by anything in this patch afaict, so it's a bit tricky to
> review, especially as I was trying to see how it interacts with
> kvm_hyp_reserve(), which reads hyp_memblock_nr.

It's not obvious by the look of it, but this _is_ called -- see the
previous patch. But note that given the outcome of the discussion
with Rob, this is changing in v3 as I'll be using the memblock API
instead.

> > +
> > +static int cmp_hyp_memblock(const void *p1, const void *p2)
> > +{
> > +	const struct hyp_memblock_region *r1 = p1;
> > +	const struct hyp_memblock_region *r2 = p2;
> > +
> > +	return r1->start < r2->start ? -1 : (r1->start > r2->start);
> > +}
> > +
> > +static void __init sort_memblock_regions(void)
> > +{
> > +	sort(kvm_nvhe_sym(hyp_memory),
> > +	     kvm_nvhe_sym(hyp_memblock_nr),
> > +	     sizeof(struct hyp_memblock_region),
> > +	     cmp_hyp_memblock,
> > +	     NULL);
> > +}
> > +
> > +void __init kvm_hyp_reserve(void)
> > +{
> > +	u64 nr_pages, prev;
> > +
> > +	if (!is_hyp_mode_available() || is_kernel_in_hyp_mode())
> > +		return;
> > +
> > +	if (kvm_get_mode() != KVM_MODE_PROTECTED)
> > +		return;
> > +
> > +	if (kvm_nvhe_sym(hyp_memblock_nr) < 0) {
> > +		kvm_err("Failed to register hyp memblocks\n");
> > +		return;
> > +	}
> > +
> > +	sort_memblock_regions();
> > +
> > +	/*
> > +	 * We don't know the number of possible CPUs yet, so allocate for the
> > +	 * worst case.
> > +	 */
> > +	hyp_mem_size += NR_CPUS << PAGE_SHIFT;
> 
> There was a recent patch bumping NR_CPUs to 512, so this would be 32MB
> with 64k pages. Is it possible to return memory to the host later on once
> we have a better handle on the number of CPUs in the system?

That's not possible yet, no :/

> 
> > +	hyp_mem_size += hyp_s1_pgtable_size();
> > +
> > +	/*
> > +	 * The hyp_vmemmap needs to be backed by pages, but these pages
> > +	 * themselves need to be present in the vmemmap, so compute the number
> > +	 * of pages needed by looking for a fixed point.
> > +	 */
> > +	nr_pages = 0;
> > +	do {
> > +		prev = nr_pages;
> > +		nr_pages = (hyp_mem_size >> PAGE_SHIFT) + prev;
> > +		nr_pages = DIV_ROUND_UP(nr_pages * sizeof(struct hyp_page), PAGE_SIZE);
> > +		nr_pages += __hyp_pgtable_max_pages(nr_pages);
> > +	} while (nr_pages != prev);
> > +	hyp_mem_size += nr_pages << PAGE_SHIFT;
> > +
> > +	hyp_mem_base = memblock_find_in_range(0, memblock_end_of_DRAM(),
> > +					      hyp_mem_size, SZ_2M);
> 
> Why SZ_2M? Guessing you might mean PMD_SIZE,

Indeed.

> although then we will probably
> want to retry with smaller alignment if the allocation fails as this can
> again be large with e.g. 64k pages.

That can't hurt I guess.

> > +	if (!hyp_mem_base) {
> > +		kvm_err("Failed to reserve hyp memory\n");
> > +		return;
> > +	}
> > +	memblock_reserve(hyp_mem_base, hyp_mem_size);
> > +
> > +	kvm_info("Reserved %lld MiB at 0x%llx\n", hyp_mem_size >> 20,
> > +		 hyp_mem_base);
> > +}
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 278e163beda4..3cf9397dabdb 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -1264,10 +1264,10 @@ static struct kvm_pgtable_mm_ops kvm_hyp_mm_ops = {
> >  	.virt_to_phys		= kvm_host_pa,
> >  };
> >  
> > +u32 hyp_va_bits;
> 
> Perhaps it would be better to pass this to __pkvm_init() instead of making
> it global?

Sure, I'll have init_hyp_mode() pass a pointer to kvm_mmu_init() to
propagate it back.

Cheers,
Quentin

Quentin Perret Feb. 4, 2021, 11:08 a.m. UTC | #36

On Wednesday 03 Feb 2021 at 15:31:39 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:15PM +0000, Quentin Perret wrote:
> > Previous commits have introduced infrastructure at EL2 to enable the Hyp
> > code to manage its own memory, and more specifically its stage 1 page
> > tables. However, this was preliminary work, and none of it is currently
> > in use.
> > 
> > Put all of this together by elevating the hyp mappings creation at EL2
> > when memory protection is enabled. In this case, the host kernel running
> > at EL1 still creates _temporary_ Hyp mappings, only used while
> > initializing the hypervisor, but frees them right after.
> > 
> > As such, all calls to create_hyp_mappings() after kvm init has finished
> > turn into hypercalls, as the host now has no 'legal' way to modify the
> > hypevisor page tables directly.
> > 
> > Signed-off-by: Quentin Perret <qperret@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_mmu.h |  1 -
> >  arch/arm64/kvm/arm.c             | 62 +++++++++++++++++++++++++++++---
> >  arch/arm64/kvm/mmu.c             | 34 ++++++++++++++++++
> >  3 files changed, 92 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index d7ebd73ec86f..6c8466a042a9 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -309,6 +309,5 @@ static __always_inline void __load_guest_stage2(struct kvm_s2_mmu *mmu)
> >  	 */
> >  	asm(ALTERNATIVE("nop", "isb", ARM64_WORKAROUND_SPECULATIVE_AT));
> >  }
> > -
> >  #endif /* __ASSEMBLY__ */
> >  #endif /* __ARM64_KVM_MMU_H__ */
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 6af9204bcd5b..e524682c2ccf 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -1421,7 +1421,7 @@ static void cpu_prepare_hyp_mode(int cpu)
> >  	kvm_flush_dcache_to_poc(params, sizeof(*params));
> >  }
> >  
> > -static void cpu_init_hyp_mode(void)
> > +static void kvm_set_hyp_vector(void)
> 
> Please do something about the naming: now we have both cpu_set_hyp_vector()
> and kvm_set_hyp_vector()!

I'll try to find something different, but no guarantees it'll be much
better :) Suggestions welcome.

> >  {
> >  	struct kvm_nvhe_init_params *params;
> >  	struct arm_smccc_res res;
> > @@ -1439,6 +1439,11 @@ static void cpu_init_hyp_mode(void)
> >  	params = this_cpu_ptr_nvhe_sym(kvm_init_params);
> >  	arm_smccc_1_1_hvc(KVM_HOST_SMCCC_FUNC(__kvm_hyp_init), virt_to_phys(params), &res);
> >  	WARN_ON(res.a0 != SMCCC_RET_SUCCESS);
> > +}
> > +
> > +static void cpu_init_hyp_mode(void)
> > +{
> > +	kvm_set_hyp_vector();
> >  
> >  	/*
> >  	 * Disabling SSBD on a non-VHE system requires us to enable SSBS
> > @@ -1481,7 +1486,10 @@ static void cpu_set_hyp_vector(void)
> >  	struct bp_hardening_data *data = this_cpu_ptr(&bp_hardening_data);
> >  	void *vector = hyp_spectre_vector_selector[data->slot];
> >  
> > -	*this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> > +	if (!is_protected_kvm_enabled())
> > +		*this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> > +	else
> > +		kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);
> 
> *Very* minor nit, but it might be cleaner to have static inline functions
> with the same prototypes as the hypercalls, just to make the code even
> easier to read. e.g
> 
> 	if (!is_protected_kvm_enabled())
> 		_cpu_set_vector(data->slot);
> 	else
> 		kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);
> 
> you could then conceivably wrap that in a macro and avoid having the
> "is_protected_kvm_enabled()" checks explicit every time.

Happy to do this here, but are you suggesting to generalize this pattern
to other places as well?

> >  }
> >  
> >  static void cpu_hyp_reinit(void)
> > @@ -1489,13 +1497,14 @@ static void cpu_hyp_reinit(void)
> >  	kvm_init_host_cpu_context(&this_cpu_ptr_hyp_sym(kvm_host_data)->host_ctxt);
> >  
> >  	cpu_hyp_reset();
> > -	cpu_set_hyp_vector();
> >  
> >  	if (is_kernel_in_hyp_mode())
> >  		kvm_timer_init_vhe();
> >  	else
> >  		cpu_init_hyp_mode();
> >  
> > +	cpu_set_hyp_vector();
> > +
> >  	kvm_arm_init_debug();
> >  
> >  	if (vgic_present)
> > @@ -1714,13 +1723,52 @@ static int copy_cpu_ftr_regs(void)
> >  	return 0;
> >  }
> >  
> > +static int kvm_hyp_enable_protection(void)
> > +{
> > +	void *per_cpu_base = kvm_ksym_ref(kvm_arm_hyp_percpu_base);
> > +	int ret, cpu;
> > +	void *addr;
> > +
> > +	if (!is_protected_kvm_enabled())
> > +		return 0;
> 
> Maybe I'm hung up on my previous suggestion, but I feel like we shouldn't
> get here if protected kvm isn't enabled.

The alternative is to move this check next to the call site, but it
won't help much IMO.

> 
> > +	if (!hyp_mem_base)
> > +		return -ENOMEM;
> > +
> > +	addr = phys_to_virt(hyp_mem_base);
> > +	ret = create_hyp_mappings(addr, addr + hyp_mem_size - 1, PAGE_HYP);
> > +	if (ret)
> > +		return ret;
> > +
> > +	preempt_disable();
> > +	kvm_set_hyp_vector();
> > +	ret = kvm_call_hyp_nvhe(__pkvm_init, hyp_mem_base, hyp_mem_size,
> > +				num_possible_cpus(), kern_hyp_va(per_cpu_base));
> 
> Would it make sense for the __pkvm_init() hypercall to set the vector as
> well, so that we wouldn't need to disable preemption over two hypercalls?

Not sure, kvm_set_hyp_vector() itself already does multiple hypercalls,
and I need it separate from __pkvm_init for secondary CPUs.

> Failing that, maybe move the whole preempt_disable/enable sequence into
> another function.

But that I can do.

> > +	preempt_enable();
> > +	if (ret)
> > +		return ret;
> > +
> > +	free_hyp_pgds();
> > +	for_each_possible_cpu(cpu)
> > +		free_page(per_cpu(kvm_arm_hyp_stack_page, cpu));
> > +
> > +	return 0;
> > +}
> > +
> >  /**
> >   * Inits Hyp-mode on all online CPUs
> >   */
> >  static int init_hyp_mode(void)
> >  {
> >  	int cpu;
> > -	int err = 0;
> > +	int err = -ENOMEM;
> > +
> > +	/*
> > +	 * The protected Hyp-mode cannot be initialized if the memory pool
> > +	 * allocation has failed.
> > +	 */
> > +	if (is_protected_kvm_enabled() && !hyp_mem_base)
> > +		return err;
> >  
> >  	/*
> >  	 * Copy the required CPU feature register in their EL2 counterpart
> > @@ -1854,6 +1902,12 @@ static int init_hyp_mode(void)
> >  	for_each_possible_cpu(cpu)
> >  		cpu_prepare_hyp_mode(cpu);
> >  
> > +	err = kvm_hyp_enable_protection();
> > +	if (err) {
> > +		kvm_err("Failed to enable hyp memory protection: %d\n", err);
> > +		goto out_err;
> > +	}
> > +
> >  	return 0;
> >  
> >  out_err:
> > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > index 3cf9397dabdb..9d4c9251208e 100644
> > --- a/arch/arm64/kvm/mmu.c
> > +++ b/arch/arm64/kvm/mmu.c
> > @@ -225,15 +225,39 @@ void free_hyp_pgds(void)
> >  	if (hyp_pgtable) {
> >  		kvm_pgtable_hyp_destroy(hyp_pgtable);
> >  		kfree(hyp_pgtable);
> > +		hyp_pgtable = NULL;
> >  	}
> >  	mutex_unlock(&kvm_hyp_pgd_mutex);
> >  }
> >  
> > +static bool kvm_host_owns_hyp_mappings(void)
> > +{
> > +	if (static_branch_likely(&kvm_protected_mode_initialized))
> > +		return false;
> > +
> > +	/*
> > +	 * This can happen at boot time when __create_hyp_mappings() is called
> > +	 * after the hyp protection has been enabled, but the static key has
> > +	 * not been flipped yet.
> > +	 */
> > +	if (!hyp_pgtable && is_protected_kvm_enabled())
> > +		return false;
> > +
> > +	BUG_ON(!hyp_pgtable);
> 
> Can we fail more gracefully, e.g. by continuing without KVM?

Got any suggestion as to how that can be done? We could also just remove
that line -- that really should not happen.

Thanks!
Quentin

Quentin Perret Feb. 4, 2021, 2:07 p.m. UTC | #37

On Wednesday 03 Feb 2021 at 15:53:54 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:19PM +0000, Quentin Perret wrote:
> > In order to re-use some of the stage 2 setup at EL2, factor parts of
> > kvm_arm_setup_stage2() out into static inline functions.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: Quentin Perret <qperret@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_mmu.h | 48 ++++++++++++++++++++++++++++++++
> >  arch/arm64/kvm/reset.c           | 42 +++-------------------------
> >  2 files changed, 52 insertions(+), 38 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> > index 662f0415344e..83b4c5cf4768 100644
> > --- a/arch/arm64/include/asm/kvm_mmu.h
> > +++ b/arch/arm64/include/asm/kvm_mmu.h
> > @@ -280,6 +280,54 @@ static inline int kvm_write_guest_lock(struct kvm *kvm, gpa_t gpa,
> >  	return ret;
> >  }
> >  
> > +static inline u64 kvm_get_parange(u64 mmfr0)
> > +{
> > +	u64 parange = cpuid_feature_extract_unsigned_field(mmfr0,
> > +				ID_AA64MMFR0_PARANGE_SHIFT);
> > +	if (parange > ID_AA64MMFR0_PARANGE_MAX)
> > +		parange = ID_AA64MMFR0_PARANGE_MAX;
> > +
> > +	return parange;
> > +}
> > +
> > +/*
> > + * The VTCR value is common across all the physical CPUs on the system.
> > + * We use system wide sanitised values to fill in different fields,
> > + * except for Hardware Management of Access Flags. HA Flag is set
> > + * unconditionally on all CPUs, as it is safe to run with or without
> > + * the feature and the bit is RES0 on CPUs that don't support it.
> > + */
> > +static inline u64 kvm_get_vtcr(u64 mmfr0, u64 mmfr1, u32 phys_shift)
> > +{
> > +	u64 vtcr = VTCR_EL2_FLAGS;
> > +	u8 lvls;
> > +
> > +	vtcr |= kvm_get_parange(mmfr0) << VTCR_EL2_PS_SHIFT;
> > +	vtcr |= VTCR_EL2_T0SZ(phys_shift);
> > +	/*
> > +	 * Use a minimum 2 level page table to prevent splitting
> > +	 * host PMD huge pages at stage2.
> > +	 */
> > +	lvls = stage2_pgtable_levels(phys_shift);
> > +	if (lvls < 2)
> > +		lvls = 2;
> > +	vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls);
> > +
> > +	/*
> > +	 * Enable the Hardware Access Flag management, unconditionally
> > +	 * on all CPUs. The features is RES0 on CPUs without the support
> > +	 * and must be ignored by the CPUs.
> > +	 */
> > +	vtcr |= VTCR_EL2_HA;
> > +
> > +	/* Set the vmid bits */
> > +	vtcr |= (get_vmid_bits(mmfr1) == 16) ?
> > +		VTCR_EL2_VS_16BIT :
> > +		VTCR_EL2_VS_8BIT;
> > +
> > +	return vtcr;
> > +}
> 
> Although I think this is functionally fine, I think it's unusual to see
> large "static inline" functions like this in shared header files. One
> alternative approach would be to follow the example of
> kernel/locking/qspinlock_paravirt.h, where the header is guarded in such a
> way that is only ever included by kernel/locking/qspinlock.c and therefore
> doesn't need the "inline" at all. That separation really helps, I think.

Alternatively, I might be able to have an mmu.c file in the hyp/ folder,
and to compile it for both the host kernel and the EL2 obj as we do for
a few things already. Or maybe I'll just stick it in pgtable.c. Either
way, it'll add a function call, but I can't really see that having any
measurable impact, so we should be fine.

Cheers,
Quentin

Quentin Perret Feb. 4, 2021, 2:18 p.m. UTC | #38

On Wednesday 03 Feb 2021 at 15:58:32 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:21PM +0000, Quentin Perret wrote:
> > Refactor __populate_fault_info() to introduce __get_fault_info() which
> > will be used once the host is wrapped in a stage 2.
> > 
> > Signed-off-by: Quentin Perret <qperret@google.com>
> > ---
> >  arch/arm64/kvm/hyp/include/hyp/switch.h | 36 +++++++++++++++----------
> >  1 file changed, 22 insertions(+), 14 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
> > index 84473574c2e7..e9005255d639 100644
> > --- a/arch/arm64/kvm/hyp/include/hyp/switch.h
> > +++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
> > @@ -157,19 +157,9 @@ static inline bool __translate_far_to_hpfar(u64 far, u64 *hpfar)
> >  	return true;
> >  }
> >  
> > -static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)
> > +static inline bool __get_fault_info(u64 esr, u64 *far, u64 *hpfar)
> 
> Could this take a pointer to a struct kvm_vcpu_fault_info instead?

The disr_el1 field will be unused in this case, but yes, that should
work.

Cheers,
Quentin

Quentin Perret Feb. 4, 2021, 2:24 p.m. UTC | #39

On Wednesday 03 Feb 2021 at 15:59:44 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:22PM +0000, Quentin Perret wrote:
> > The current stage2 page-table allocator uses a memcache to get
> > pre-allocated pages when it needs any. To allow re-using this code at
> > EL2 which uses a concept of memory pools, make the memcache argument to
> > kvm_pgtable_stage2_map() anonymous. and let the mm_ops zalloc_page()
> > callbacks use it the way they need to.
> > 
> > Signed-off-by: Quentin Perret <qperret@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_pgtable.h | 6 +++---
> >  arch/arm64/kvm/hyp/pgtable.c         | 4 ++--
> >  2 files changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > index 8e8f1d2c5e0e..d846bc3d3b77 100644
> > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > @@ -176,8 +176,8 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
> >   * @size:	Size of the mapping.
> >   * @phys:	Physical address of the memory to map.
> >   * @prot:	Permissions and attributes for the mapping.
> > - * @mc:		Cache of pre-allocated GFP_PGTABLE_USER memory from which to
> > - *		allocate page-table pages.
> > + * @mc:		Cache of pre-allocated memory from which to allocate page-table
> > + *		pages.
> 
> We should probably mention that this memory must be zeroed, since I don't
> think the page-table code takes care of that.

OK, though I think this is unrelated to this change -- this is already
true today I believe. Anyhow, I'll pile a change on top.

Cheers,
Quentin

Quentin Perret Feb. 4, 2021, 2:26 p.m. UTC | #40

On Wednesday 03 Feb 2021 at 16:11:47 (+0000), Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:24PM +0000, Quentin Perret wrote:
> > When KVM runs in protected nVHE mode, make use of a stage 2 page-table
> > to give the hypervisor some control over the host memory accesses. At
> > the moment all memory aborts from the host will be instantly idmapped
> > RWX at stage 2 in a lazy fashion. Later patches will make use of that
> > infrastructure to implement access control restrictions to e.g. protect
> > guest memory from the host.
> > 
> > Signed-off-by: Quentin Perret <qperret@google.com>
> > ---
> >  arch/arm64/include/asm/kvm_cpufeature.h       |   2 +
> >  arch/arm64/kernel/image-vars.h                |   3 +
> >  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  33 +++
> >  arch/arm64/kvm/hyp/nvhe/Makefile              |   2 +-
> >  arch/arm64/kvm/hyp/nvhe/hyp-init.S            |   1 +
> >  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |   6 +
> >  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 191 ++++++++++++++++++
> >  arch/arm64/kvm/hyp/nvhe/setup.c               |   6 +
> >  arch/arm64/kvm/hyp/nvhe/switch.c              |   7 +-
> >  arch/arm64/kvm/hyp/nvhe/tlb.c                 |   4 +-
> >  10 files changed, 248 insertions(+), 7 deletions(-)
> >  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> >  create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c
> 
> [...]
> 
> > +void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
> > +{
> > +	enum kvm_pgtable_prot prot;
> > +	u64 far, hpfar, esr, ipa;
> > +	int ret;
> > +
> > +	esr = read_sysreg_el2(SYS_ESR);
> > +	if (!__get_fault_info(esr, &far, &hpfar))
> > +		hyp_panic();
> > +
> > +	prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W | KVM_PGTABLE_PROT_X;
> > +	ipa = (hpfar & HPFAR_MASK) << 8;
> > +	ret = host_stage2_map(ipa, PAGE_SIZE, prot);
> 
> Can we try to put down a block mapping if the whole thing falls within
> memory?

Yes we can! And in fact we can do that outside of memory too. It's
queued for v3 already, so stay tuned ... :)

Thanks,
Quentin

Will Deacon Feb. 4, 2021, 2:31 p.m. UTC | #41

On Wed, Feb 03, 2021 at 06:33:30PM +0000, Quentin Perret wrote:
> On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > > + *   __find_buddy(pool, page 0, order 0) => page 1
> > > + *   __find_buddy(pool, page 0, order 1) => page 2
> > > + *   __find_buddy(pool, page 1, order 0) => page 0
> > > + *   __find_buddy(pool, page 2, order 0) => page 3
> > > + */
> > > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > > +				     unsigned int order)
> > > +{
> > > +	phys_addr_t addr = hyp_page_to_phys(p);
> > > +
> > > +	addr ^= (PAGE_SIZE << order);
> > > +	if (addr < pool->range_start || addr >= pool->range_end)
> > > +		return NULL;
> > 
> > Are these range checks only needed because the pool isn't required to be
> > an exact power-of-2 pages in size? If so, maybe it would be more
> > straightforward to limit the max order on a per-pool basis depending upon
> > its size?
> 
> More importantly, it is because pages outside of the pool are not
> guaranteed to be covered by the hyp_vmemmap, so I really need to make
> sure I don't dereference them.

Wouldn't having a per-pool max order help with that?

> > > +	return hyp_phys_to_page(addr);
> > > +}
> > > +
> > > +static void __hyp_attach_page(struct hyp_pool *pool,
> > > +			      struct hyp_page *p)
> > > +{
> > > +	unsigned int order = p->order;
> > > +	struct hyp_page *buddy;
> > > +
> > > +	p->order = HYP_NO_ORDER;
> > 
> > Why is this needed?
> 
> If p->order is say 3, I may be able to coalesce with the buddy of order
> 3 to form a higher order page of order 4. And that higher order page
> will be represented by the 'first' of the two order-3 pages (let's call
> it the head), and the other order 3 page (let's say the tail) will be
> assigned 'HYP_NO_ORDER'.
> 
> And basically at this point I don't know if 'p' is going be the head or
> the tail, so I set it to HYP_NO_ORDER a priori so I don't have to think
> about this in the loop below. Is that helping?
> 
> I suppose this could use more comments as well ...

Comments would definitely help, but perhaps even having a simple function to
do the coalescing, which you could call from the loop body and which would
deal with marking the tail pages as HYP_NO_ORDER?

> > > +	for (; order < HYP_MAX_ORDER; order++) {
> > > +		/* Nothing to do if the buddy isn't in a free-list */
> > > +		buddy = __find_buddy(pool, p, order);
> > > +		if (!buddy || list_empty(&buddy->node) || buddy->order != order)
> > 
> > Could we move the "buddy->order" check into __find_buddy()?
> 
> I think might break __hyp_extract_page() below. The way I think about
> __find_buddy() is as a low level function which gives you the buddy page
> blindly if it exists in the hyp_vmemmap, and it's up to the callers to
> decide whether the buddy is in the right state for their use or not.

Just feels a bit backwards having __find_buddy() take an order parameter,
yet then return a page of the wrong order! __hyp_extract_page() always
passes the p->order as the order, so I think it would be worth having a
separate function that just takes the pool and the page for that.

Will

Will Deacon Feb. 4, 2021, 2:36 p.m. UTC | #42

On Thu, Feb 04, 2021 at 02:24:44PM +0000, Quentin Perret wrote:
> On Wednesday 03 Feb 2021 at 15:59:44 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:22PM +0000, Quentin Perret wrote:
> > > The current stage2 page-table allocator uses a memcache to get
> > > pre-allocated pages when it needs any. To allow re-using this code at
> > > EL2 which uses a concept of memory pools, make the memcache argument to
> > > kvm_pgtable_stage2_map() anonymous. and let the mm_ops zalloc_page()
> > > callbacks use it the way they need to.
> > > 
> > > Signed-off-by: Quentin Perret <qperret@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_pgtable.h | 6 +++---
> > >  arch/arm64/kvm/hyp/pgtable.c         | 4 ++--
> > >  2 files changed, 5 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/arm64/include/asm/kvm_pgtable.h b/arch/arm64/include/asm/kvm_pgtable.h
> > > index 8e8f1d2c5e0e..d846bc3d3b77 100644
> > > --- a/arch/arm64/include/asm/kvm_pgtable.h
> > > +++ b/arch/arm64/include/asm/kvm_pgtable.h
> > > @@ -176,8 +176,8 @@ void kvm_pgtable_stage2_destroy(struct kvm_pgtable *pgt);
> > >   * @size:	Size of the mapping.
> > >   * @phys:	Physical address of the memory to map.
> > >   * @prot:	Permissions and attributes for the mapping.
> > > - * @mc:		Cache of pre-allocated GFP_PGTABLE_USER memory from which to
> > > - *		allocate page-table pages.
> > > + * @mc:		Cache of pre-allocated memory from which to allocate page-table
> > > + *		pages.
> > 
> > We should probably mention that this memory must be zeroed, since I don't
> > think the page-table code takes care of that.
> 
> OK, though I think this is unrelated to this change -- this is already
> true today I believe. Anyhow, I'll pile a change on top.

It is, but GFP_PGTABLE_USER implies __GFP_ZERO, so the existing comment
captures that.

Will

Will Deacon Feb. 4, 2021, 2:37 p.m. UTC | #43

On Thu, Feb 04, 2021 at 02:26:35PM +0000, Quentin Perret wrote:
> On Wednesday 03 Feb 2021 at 16:11:47 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:24PM +0000, Quentin Perret wrote:
> > > When KVM runs in protected nVHE mode, make use of a stage 2 page-table
> > > to give the hypervisor some control over the host memory accesses. At
> > > the moment all memory aborts from the host will be instantly idmapped
> > > RWX at stage 2 in a lazy fashion. Later patches will make use of that
> > > infrastructure to implement access control restrictions to e.g. protect
> > > guest memory from the host.
> > > 
> > > Signed-off-by: Quentin Perret <qperret@google.com>
> > > ---
> > >  arch/arm64/include/asm/kvm_cpufeature.h       |   2 +
> > >  arch/arm64/kernel/image-vars.h                |   3 +
> > >  arch/arm64/kvm/hyp/include/nvhe/mem_protect.h |  33 +++
> > >  arch/arm64/kvm/hyp/nvhe/Makefile              |   2 +-
> > >  arch/arm64/kvm/hyp/nvhe/hyp-init.S            |   1 +
> > >  arch/arm64/kvm/hyp/nvhe/hyp-main.c            |   6 +
> > >  arch/arm64/kvm/hyp/nvhe/mem_protect.c         | 191 ++++++++++++++++++
> > >  arch/arm64/kvm/hyp/nvhe/setup.c               |   6 +
> > >  arch/arm64/kvm/hyp/nvhe/switch.c              |   7 +-
> > >  arch/arm64/kvm/hyp/nvhe/tlb.c                 |   4 +-
> > >  10 files changed, 248 insertions(+), 7 deletions(-)
> > >  create mode 100644 arch/arm64/kvm/hyp/include/nvhe/mem_protect.h
> > >  create mode 100644 arch/arm64/kvm/hyp/nvhe/mem_protect.c
> > 
> > [...]
> > 
> > > +void handle_host_mem_abort(struct kvm_cpu_context *host_ctxt)
> > > +{
> > > +	enum kvm_pgtable_prot prot;
> > > +	u64 far, hpfar, esr, ipa;
> > > +	int ret;
> > > +
> > > +	esr = read_sysreg_el2(SYS_ESR);
> > > +	if (!__get_fault_info(esr, &far, &hpfar))
> > > +		hyp_panic();
> > > +
> > > +	prot = KVM_PGTABLE_PROT_R | KVM_PGTABLE_PROT_W | KVM_PGTABLE_PROT_X;
> > > +	ipa = (hpfar & HPFAR_MASK) << 8;
> > > +	ret = host_stage2_map(ipa, PAGE_SIZE, prot);
> > 
> > Can we try to put down a block mapping if the whole thing falls within
> > memory?
> 
> Yes we can! And in fact we can do that outside of memory too. It's
> queued for v3 already, so stay tuned ... :)

Awesome! The Stage-2 TLB thanks you.

Will

Quentin Perret Feb. 4, 2021, 2:52 p.m. UTC | #44

On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> On Wed, Feb 03, 2021 at 06:33:30PM +0000, Quentin Perret wrote:
> > On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> > > On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > > > + *   __find_buddy(pool, page 0, order 0) => page 1
> > > > + *   __find_buddy(pool, page 0, order 1) => page 2
> > > > + *   __find_buddy(pool, page 1, order 0) => page 0
> > > > + *   __find_buddy(pool, page 2, order 0) => page 3
> > > > + */
> > > > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > > > +				     unsigned int order)
> > > > +{
> > > > +	phys_addr_t addr = hyp_page_to_phys(p);
> > > > +
> > > > +	addr ^= (PAGE_SIZE << order);
> > > > +	if (addr < pool->range_start || addr >= pool->range_end)
> > > > +		return NULL;
> > > 
> > > Are these range checks only needed because the pool isn't required to be
> > > an exact power-of-2 pages in size? If so, maybe it would be more
> > > straightforward to limit the max order on a per-pool basis depending upon
> > > its size?
> > 
> > More importantly, it is because pages outside of the pool are not
> > guaranteed to be covered by the hyp_vmemmap, so I really need to make
> > sure I don't dereference them.
> 
> Wouldn't having a per-pool max order help with that?

The issue is, I have no alignment guarantees for the pools, so I may end
up with max_order = 0 ...

Will Deacon Feb. 4, 2021, 5:48 p.m. UTC | #45

On Thu, Feb 04, 2021 at 02:52:52PM +0000, Quentin Perret wrote:
> On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> > On Wed, Feb 03, 2021 at 06:33:30PM +0000, Quentin Perret wrote:
> > > On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> > > > On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > > > > + *   __find_buddy(pool, page 0, order 0) => page 1
> > > > > + *   __find_buddy(pool, page 0, order 1) => page 2
> > > > > + *   __find_buddy(pool, page 1, order 0) => page 0
> > > > > + *   __find_buddy(pool, page 2, order 0) => page 3
> > > > > + */
> > > > > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > > > > +				     unsigned int order)
> > > > > +{
> > > > > +	phys_addr_t addr = hyp_page_to_phys(p);
> > > > > +
> > > > > +	addr ^= (PAGE_SIZE << order);
> > > > > +	if (addr < pool->range_start || addr >= pool->range_end)
> > > > > +		return NULL;
> > > > 
> > > > Are these range checks only needed because the pool isn't required to be
> > > > an exact power-of-2 pages in size? If so, maybe it would be more
> > > > straightforward to limit the max order on a per-pool basis depending upon
> > > > its size?
> > > 
> > > More importantly, it is because pages outside of the pool are not
> > > guaranteed to be covered by the hyp_vmemmap, so I really need to make
> > > sure I don't dereference them.
> > 
> > Wouldn't having a per-pool max order help with that?
> 
> The issue is, I have no alignment guarantees for the pools, so I may end
> up with max_order = 0 ...

Yeah, so you would still need the range tracking, but it would at least help
to reduce HYP_MAX_ORDER failed searches each time. Still, we can always do
that later.

Will

Quentin Perret Feb. 4, 2021, 6:01 p.m. UTC | #46

On Thursday 04 Feb 2021 at 17:48:49 (+0000), Will Deacon wrote:
> On Thu, Feb 04, 2021 at 02:52:52PM +0000, Quentin Perret wrote:
> > On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> > > On Wed, Feb 03, 2021 at 06:33:30PM +0000, Quentin Perret wrote:
> > > > On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> > > > > On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > > > > > + *   __find_buddy(pool, page 0, order 0) => page 1
> > > > > > + *   __find_buddy(pool, page 0, order 1) => page 2
> > > > > > + *   __find_buddy(pool, page 1, order 0) => page 0
> > > > > > + *   __find_buddy(pool, page 2, order 0) => page 3
> > > > > > + */
> > > > > > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > > > > > +				     unsigned int order)
> > > > > > +{
> > > > > > +	phys_addr_t addr = hyp_page_to_phys(p);
> > > > > > +
> > > > > > +	addr ^= (PAGE_SIZE << order);
> > > > > > +	if (addr < pool->range_start || addr >= pool->range_end)
> > > > > > +		return NULL;
> > > > > 
> > > > > Are these range checks only needed because the pool isn't required to be
> > > > > an exact power-of-2 pages in size? If so, maybe it would be more
> > > > > straightforward to limit the max order on a per-pool basis depending upon
> > > > > its size?
> > > > 
> > > > More importantly, it is because pages outside of the pool are not
> > > > guaranteed to be covered by the hyp_vmemmap, so I really need to make
> > > > sure I don't dereference them.
> > > 
> > > Wouldn't having a per-pool max order help with that?
> > 
> > The issue is, I have no alignment guarantees for the pools, so I may end
> > up with max_order = 0 ...
> 
> Yeah, so you would still need the range tracking,

Hmm actually I don't think I would, but that would essentially mean the
'buddy' allocator is now turned into a free list of single pages
(because we cannot create pages of order 1).

> but it would at least help
> to reduce HYP_MAX_ORDER failed searches each time. Still, we can always do
> that later.

Sorry but I am not following. In which case do we have HYP_MAX_ORDER
failed searches?

Thanks,
Quentin

Will Deacon Feb. 4, 2021, 6:13 p.m. UTC | #47

On Thu, Feb 04, 2021 at 06:01:12PM +0000, Quentin Perret wrote:
> On Thursday 04 Feb 2021 at 17:48:49 (+0000), Will Deacon wrote:
> > On Thu, Feb 04, 2021 at 02:52:52PM +0000, Quentin Perret wrote:
> > > On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> > > > On Wed, Feb 03, 2021 at 06:33:30PM +0000, Quentin Perret wrote:
> > > > > On Tuesday 02 Feb 2021 at 18:13:08 (+0000), Will Deacon wrote:
> > > > > > On Fri, Jan 08, 2021 at 12:15:10PM +0000, Quentin Perret wrote:
> > > > > > > + *   __find_buddy(pool, page 0, order 0) => page 1
> > > > > > > + *   __find_buddy(pool, page 0, order 1) => page 2
> > > > > > > + *   __find_buddy(pool, page 1, order 0) => page 0
> > > > > > > + *   __find_buddy(pool, page 2, order 0) => page 3
> > > > > > > + */
> > > > > > > +static struct hyp_page *__find_buddy(struct hyp_pool *pool, struct hyp_page *p,
> > > > > > > +				     unsigned int order)
> > > > > > > +{
> > > > > > > +	phys_addr_t addr = hyp_page_to_phys(p);
> > > > > > > +
> > > > > > > +	addr ^= (PAGE_SIZE << order);
> > > > > > > +	if (addr < pool->range_start || addr >= pool->range_end)
> > > > > > > +		return NULL;
> > > > > > 
> > > > > > Are these range checks only needed because the pool isn't required to be
> > > > > > an exact power-of-2 pages in size? If so, maybe it would be more
> > > > > > straightforward to limit the max order on a per-pool basis depending upon
> > > > > > its size?
> > > > > 
> > > > > More importantly, it is because pages outside of the pool are not
> > > > > guaranteed to be covered by the hyp_vmemmap, so I really need to make
> > > > > sure I don't dereference them.
> > > > 
> > > > Wouldn't having a per-pool max order help with that?
> > > 
> > > The issue is, I have no alignment guarantees for the pools, so I may end
> > > up with max_order = 0 ...
> > 
> > Yeah, so you would still need the range tracking,
> 
> Hmm actually I don't think I would, but that would essentially mean the
> 'buddy' allocator is now turned into a free list of single pages
> (because we cannot create pages of order 1).

Right, I'm not suggesting we do that.

> > but it would at least help
> > to reduce HYP_MAX_ORDER failed searches each time. Still, we can always do
> > that later.
> 
> Sorry but I am not following. In which case do we have HYP_MAX_ORDER
> failed searches?

I was going from memory, but the loop in __hyp_alloc_pages() searches up to
HYP_MAX_ORDER, whereas this is _never_ going to succeed beyond some per-pool
order determined by the size of the pool. But I doubt it matters -- I
thought we did more than just check a list.

Will

Quentin Perret Feb. 4, 2021, 6:19 p.m. UTC | #48

On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> Just feels a bit backwards having __find_buddy() take an order parameter,
> yet then return a page of the wrong order! __hyp_extract_page() always
> passes the p->order as the order,

Gotcha, so maybe this is just a naming problem. __find_buddy() is simply
a helper to lookup/index the vmemmap, but it's perfectly possible that
the 'destination' page that is being indexed has already been allocated,
and split up multiple time (and so at a different order), etc ... And
that is the caller's job to decide.

How about __lookup_potential_buddy() ? Any suggestion?

Will Deacon Feb. 4, 2021, 6:24 p.m. UTC | #49

On Thu, Feb 04, 2021 at 06:19:36PM +0000, Quentin Perret wrote:
> On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> > Just feels a bit backwards having __find_buddy() take an order parameter,
> > yet then return a page of the wrong order! __hyp_extract_page() always
> > passes the p->order as the order,
> 
> Gotcha, so maybe this is just a naming problem. __find_buddy() is simply
> a helper to lookup/index the vmemmap, but it's perfectly possible that
> the 'destination' page that is being indexed has already been allocated,
> and split up multiple time (and so at a different order), etc ... And
> that is the caller's job to decide.
> 
> How about __lookup_potential_buddy() ? Any suggestion?

Hey, my job here is to waffle incoherently and hope that you find bugs in
your own code. Now you want me to _name_ something! Jeez...

Ok, how about __find_buddy() does what it does today but doesn't take an
order argument, whereas __find_buddy_of_order() takes the order argument
and checks the page order before returning?

Will

Quentin Perret Feb. 4, 2021, 6:24 p.m. UTC | #50

On Thursday 04 Feb 2021 at 18:13:18 (+0000), Will Deacon wrote:
> I was going from memory, but the loop in __hyp_alloc_pages() searches up to
> HYP_MAX_ORDER, whereas this is _never_ going to succeed beyond some per-pool
> order determined by the size of the pool. But I doubt it matters -- I
> thought we did more than just check a list.

Ah, I see -- I was looking at the __hyp_attach_page() loop.

I think it's a good point, I should be able to figure out a max order
based on the size and alignment of the pool, and cache that in struct
hyp_pool to optimize cases where this is < HYP_MAX_ORDER.
Should be easy enough, I'll see what I can do in v3.

Thanks!
Quentin

Quentin Perret Feb. 4, 2021, 6:32 p.m. UTC | #51

On Thursday 04 Feb 2021 at 18:24:05 (+0000), Will Deacon wrote:
> On Thu, Feb 04, 2021 at 06:19:36PM +0000, Quentin Perret wrote:
> > On Thursday 04 Feb 2021 at 14:31:08 (+0000), Will Deacon wrote:
> > > Just feels a bit backwards having __find_buddy() take an order parameter,
> > > yet then return a page of the wrong order! __hyp_extract_page() always
> > > passes the p->order as the order,
> > 
> > Gotcha, so maybe this is just a naming problem. __find_buddy() is simply
> > a helper to lookup/index the vmemmap, but it's perfectly possible that
> > the 'destination' page that is being indexed has already been allocated,
> > and split up multiple time (and so at a different order), etc ... And
> > that is the caller's job to decide.
> > 
> > How about __lookup_potential_buddy() ? Any suggestion?
> 
> Hey, my job here is to waffle incoherently and hope that you find bugs in
> your own code. Now you want me to _name_ something! Jeez...

Hey, that's my special -- I already got Marc to make a suggestion on v1
and it's been my favorite function name so far, so why not try again?

https://lore.kernel.org/kvmarm/d6a674a0e8e259161ab741d78924c756@kernel.org/

> Ok, how about __find_buddy() does what it does today but doesn't take an
> order argument, whereas __find_buddy_of_order() takes the order argument
> and checks the page order before returning?

Sounds like a plan!

Cheers,
Quentin

Will Deacon Feb. 5, 2021, 5:56 p.m. UTC | #52

On Thu, Feb 04, 2021 at 10:47:08AM +0000, Quentin Perret wrote:
> On Wednesday 03 Feb 2021 at 14:37:10 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:14PM +0000, Quentin Perret wrote:
> > > +static inline unsigned long __hyp_pgtable_max_pages(unsigned long nr_pages)
> > > +{
> > > +	unsigned long total = 0, i;
> > > +
> > > +	/* Provision the worst case scenario with 4 levels of page-table */
> > > +	for (i = 0; i < 4; i++) {
> > 
> > Looks like you want KVM_PGTABLE_MAX_LEVELS, so maybe move that into a
> > header?
> 
> Will do.
> 
> > 
> > > +		nr_pages = DIV_ROUND_UP(nr_pages, PTRS_PER_PTE);
> > > +		total += nr_pages;
> > > +	}
> > 
> > ... that said, I'm not sure this needs to iterate at all. What exactly are
> > you trying to compute?
> 
> I'm trying to figure out how many pages I will need to construct a
> page-table covering nr_pages contiguous pages. The first iteration tells
> me how many level 0 pages I need to cover nr_pages, the second iteration
> how many level 1 pages I need to cover the level 0 pages, and so on...

Ah, you iterate from leaves back to the root. Got it, thanks.

> I might be doing this naively though. Got a better idea?

I thought I did, but I ended up with something based on a geometric series
and it looks terrible to code-up in C without, err, iterating like you do.

So yeah, ignore me :)

> > > +
> > > +	return total;
> > > +}
> > > +
> > > +static inline unsigned long hyp_s1_pgtable_size(void)
> > > +{
> > > +	struct hyp_memblock_region *reg;
> > > +	unsigned long nr_pages, res = 0;
> > > +	int i;
> > > +
> > > +	if (kvm_nvhe_sym(hyp_memblock_nr) <= 0)
> > > +		return 0;
> > 
> > It's a bit grotty having this be signed. Why do we need to encode the error
> > case differently from the 0 case?
> 
> Here specifically we don't, but it is needed in early_init_dt_add_memory_hyp()
> to distinguish the overflow case from the first memblock being added.

Fair enough, but if you figure out a way for hyp_memblock_nr to be unsigned,
I think that would be preferable.

Will

Will Deacon Feb. 5, 2021, 6:01 p.m. UTC | #53

On Thu, Feb 04, 2021 at 11:08:33AM +0000, Quentin Perret wrote:
> On Wednesday 03 Feb 2021 at 15:31:39 (+0000), Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:15PM +0000, Quentin Perret wrote:
> > > @@ -1481,7 +1486,10 @@ static void cpu_set_hyp_vector(void)
> > >  	struct bp_hardening_data *data = this_cpu_ptr(&bp_hardening_data);
> > >  	void *vector = hyp_spectre_vector_selector[data->slot];
> > >  
> > > -	*this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> > > +	if (!is_protected_kvm_enabled())
> > > +		*this_cpu_ptr_hyp_sym(kvm_hyp_vector) = (unsigned long)vector;
> > > +	else
> > > +		kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);
> > 
> > *Very* minor nit, but it might be cleaner to have static inline functions
> > with the same prototypes as the hypercalls, just to make the code even
> > easier to read. e.g
> > 
> > 	if (!is_protected_kvm_enabled())
> > 		_cpu_set_vector(data->slot);
> > 	else
> > 		kvm_call_hyp_nvhe(__pkvm_cpu_set_vector, data->slot);
> > 
> > you could then conceivably wrap that in a macro and avoid having the
> > "is_protected_kvm_enabled()" checks explicit every time.
> 
> Happy to do this here, but are you suggesting to generalize this pattern
> to other places as well?

I think it's probably a good pattern to follow, but no need to generalise it
prematurely.

> > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> > > index 3cf9397dabdb..9d4c9251208e 100644
> > > --- a/arch/arm64/kvm/mmu.c
> > > +++ b/arch/arm64/kvm/mmu.c
> > > @@ -225,15 +225,39 @@ void free_hyp_pgds(void)
> > >  	if (hyp_pgtable) {
> > >  		kvm_pgtable_hyp_destroy(hyp_pgtable);
> > >  		kfree(hyp_pgtable);
> > > +		hyp_pgtable = NULL;
> > >  	}
> > >  	mutex_unlock(&kvm_hyp_pgd_mutex);
> > >  }
> > >  
> > > +static bool kvm_host_owns_hyp_mappings(void)
> > > +{
> > > +	if (static_branch_likely(&kvm_protected_mode_initialized))
> > > +		return false;
> > > +
> > > +	/*
> > > +	 * This can happen at boot time when __create_hyp_mappings() is called
> > > +	 * after the hyp protection has been enabled, but the static key has
> > > +	 * not been flipped yet.
> > > +	 */
> > > +	if (!hyp_pgtable && is_protected_kvm_enabled())
> > > +		return false;
> > > +
> > > +	BUG_ON(!hyp_pgtable);
> > 
> > Can we fail more gracefully, e.g. by continuing without KVM?
> 
> Got any suggestion as to how that can be done? We could also just remove
> that line -- that really should not happen.

Or downgrade to WARN_ON.

Will

Quentin Perret Feb. 9, 2021, 10 a.m. UTC | #54

On Thursday 04 Feb 2021 at 10:47:08 (+0000), Quentin Perret wrote:
> On Wednesday 03 Feb 2021 at 14:37:10 (+0000), Will Deacon wrote:
> > > +static void handle___pkvm_init(struct kvm_cpu_context *host_ctxt)
> > > +{
> > > +	DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> > > +	DECLARE_REG(unsigned long, size, host_ctxt, 2);
> > > +	DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3);
> > > +	DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4);
> > > +
> > > +	cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base);
> > 
> > __pkvm_init() doesn't return, so I think this assignment back into host_ctxt
> > is confusing.
> 
> Very good point, I'll get rid of this.

Actually not, I think I'll leave it like that. __pkvm_init can return an
error, which is why I did this in the first place And it is useful for
debugging to have it propagated back to the host.

Thanks,
Quentin

Will Deacon Feb. 9, 2021, 12:23 p.m. UTC | #55

On Tue, Feb 09, 2021 at 10:00:29AM +0000, Quentin Perret wrote:
> On Thursday 04 Feb 2021 at 10:47:08 (+0000), Quentin Perret wrote:
> > On Wednesday 03 Feb 2021 at 14:37:10 (+0000), Will Deacon wrote:
> > > > +static void handle___pkvm_init(struct kvm_cpu_context *host_ctxt)
> > > > +{
> > > > +	DECLARE_REG(phys_addr_t, phys, host_ctxt, 1);
> > > > +	DECLARE_REG(unsigned long, size, host_ctxt, 2);
> > > > +	DECLARE_REG(unsigned long, nr_cpus, host_ctxt, 3);
> > > > +	DECLARE_REG(unsigned long *, per_cpu_base, host_ctxt, 4);
> > > > +
> > > > +	cpu_reg(host_ctxt, 1) = __pkvm_init(phys, size, nr_cpus, per_cpu_base);
> > > 
> > > __pkvm_init() doesn't return, so I think this assignment back into host_ctxt
> > > is confusing.
> > 
> > Very good point, I'll get rid of this.
> 
> Actually not, I think I'll leave it like that. __pkvm_init can return an
> error, which is why I did this in the first place And it is useful for
> debugging to have it propagated back to the host.

Good point, but please add a comment!

Will

Quentin Perret Feb. 17, 2021, 5:24 p.m. UTC | #56

Hi Mate,

On Wednesday 17 Feb 2021 at 17:27:07 (+0100), Mate Toth-Pal wrote:
> We tested the pKVM changes pulled from here:
> 
> 
> >      https://android-kvm.googlesource.com/linux qperret/host-stage2-v2
> 
> 
> We were using a target with Arm architecture with FEAT_S2FWB, and found that
> there is a bug in the patch.
> 
> 
> It turned out that the Kernel checks for the extension, and sets up the
> stage 2 translation so that it forces the host memory type to write-through.
> However it seems that the code doesn't turn on the feature in the HCR_EL2
> register.
> 
> 
> We were able to fix the issue by applying the following patch:
> 
> 
> diff --git a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> index 0cd3eb178f3b..e8521a072ea6 100644
> --- a/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> +++ b/arch/arm64/kvm/hyp/nvhe/mem_protect.c
> @@ -105,6 +105,8 @@ int kvm_host_prepare_stage2(void *mem_pgt_pool, void
> *dev_pgt_pool)
>                 params->vttbr = kvm_get_vttbr(mmu);
>                 params->vtcr = host_kvm.arch.vtcr;
>                 params->hcr_el2 |= HCR_VM;
> +               if (cpus_have_const_cap(ARM64_HAS_STAGE2_FWB))
> +                       params->hcr_el2 |= HCR_FWB;
>                 __flush_dcache_area(params, sizeof(*params));
>         }

Aha, indeed, this looks right. I'll double check HCR_EL2 to see if I'm
missing any other, and I'll add this to v3.

Thanks for testing, and the for the report.
Quentin

Sean Christopherson Feb. 19, 2021, 5:54 p.m. UTC | #57

On Fri, Jan 08, 2021, Quentin Perret wrote:
> [2] https://kvmforum2020.sched.com/event/eE24/virtualization-for-the-masses-exposing-kvm-on-android-will-deacon-google

I couldn't find any slides on the official KVM forum site linked above.  I was
able to track down a mirror[1] and the recorded presentation[2].

[1] https://mirrors.edge.kernel.org/pub/linux/kernel/people/will/slides/kvmforum-2020-edited.pdf
[2] https://youtu.be/wY-u6n75iXc

Quentin Perret Feb. 19, 2021, 5:57 p.m. UTC | #58

On Friday 19 Feb 2021 at 09:54:38 (-0800), Sean Christopherson wrote:
> On Fri, Jan 08, 2021, Quentin Perret wrote:
> > [2] https://kvmforum2020.sched.com/event/eE24/virtualization-for-the-masses-exposing-kvm-on-android-will-deacon-google
> 
> I couldn't find any slides on the official KVM forum site linked above.  I was
> able to track down a mirror[1] and the recorded presentation[2].
> 
> [1] https://mirrors.edge.kernel.org/pub/linux/kernel/people/will/slides/kvmforum-2020-edited.pdf
> [2] https://youtu.be/wY-u6n75iXc

Much nicer, I'll make sure to link those in the next cover letter.

Thanks Sean!
Quentin

Sean Christopherson Feb. 19, 2021, 6:32 p.m. UTC | #59

On Wed, Feb 03, 2021, Will Deacon wrote:
> On Fri, Jan 08, 2021 at 12:15:14PM +0000, Quentin Perret wrote:

...

> > +static inline unsigned long hyp_s1_pgtable_size(void)
> > +{

...

> > +		res += nr_pages << PAGE_SHIFT;
> > +	}
> > +
> > +	/* Allow 1 GiB for private mappings */
> > +	nr_pages = (1 << 30) >> PAGE_SHIFT;
> 
> SZ_1G >> PAGE_SHIFT

Where does the 1gb magic number come from?  IIUC, this is calculating the number
of pages needed for the hypervisor's Stage-1 page tables.  The amount of memory
needed for those page tables should be easily calculated, and assuming huge
pages can be used, should be far less the 1gb.

> > +	nr_pages = __hyp_pgtable_max_pages(nr_pages);
> > +	res += nr_pages << PAGE_SHIFT;
> > +
> > +	return res;

...

> > +void __init kvm_hyp_reserve(void)
> > +{
> > +	u64 nr_pages, prev;
> > +
> > +	if (!is_hyp_mode_available() || is_kernel_in_hyp_mode())
> > +		return;
> > +
> > +	if (kvm_get_mode() != KVM_MODE_PROTECTED)
> > +		return;
> > +
> > +	if (kvm_nvhe_sym(hyp_memblock_nr) < 0) {
> > +		kvm_err("Failed to register hyp memblocks\n");
> > +		return;
> > +	}
> > +
> > +	sort_memblock_regions();
> > +
> > +	/*
> > +	 * We don't know the number of possible CPUs yet, so allocate for the
> > +	 * worst case.
> > +	 */
> > +	hyp_mem_size += NR_CPUS << PAGE_SHIFT;

Is this for per-cpu stack?

If so, what guarantees a single page is sufficient?  Mostly a curiosity question,
since it looks like this is an existing assumption by init_hyp_mode().  Shouldn't
the required stack size be defined in bytes and converted to pages, or is there a
guarantee that 64kb pages will be used?

> There was a recent patch bumping NR_CPUs to 512, so this would be 32MB
> with 64k pages. Is it possible to return memory to the host later on once
> we have a better handle on the number of CPUs in the system?

Does kvm_hyp_reserve() really need to be called during bootmem_init()?  What
prevents doing the reservation during init_hyp_mode()?  If the problem is that
pKVM needs a single contiguous chunk of memory, then it might be worth solving
_that_ problem, e.g. letting the host donate memory in N-byte chunks instead of
requiring a single huge blob of memory.

> > +	hyp_mem_size += hyp_s1_pgtable_size();

Quentin Perret Feb. 22, 2021, 11:04 a.m. UTC | #60

Hi Sean,

On Friday 19 Feb 2021 at 10:32:58 (-0800), Sean Christopherson wrote:
> On Wed, Feb 03, 2021, Will Deacon wrote:
> > On Fri, Jan 08, 2021 at 12:15:14PM +0000, Quentin Perret wrote:
> 
> ...
> 
> > > +static inline unsigned long hyp_s1_pgtable_size(void)
> > > +{
> 
> ...
> 
> > > +		res += nr_pages << PAGE_SHIFT;
> > > +	}
> > > +
> > > +	/* Allow 1 GiB for private mappings */
> > > +	nr_pages = (1 << 30) >> PAGE_SHIFT;
> > 
> > SZ_1G >> PAGE_SHIFT
> 
> Where does the 1gb magic number come from?

Admittedly it is arbitrary. It needs to be enough to cover all the
so-called 'private' mappings that EL2 needs, and which can vary a little
depending on the hardware.

> IIUC, this is calculating the number
> of pages needed for the hypervisor's Stage-1 page tables.

Correct. The thing worth noting is that the hypervisor VA space is
essentially split in half. One half is reserved to map portions of
memory with a fixed offset, and the other half is used for a whole bunch
of other things: we have a vmemmap, the 'private' mappings and the idmap
page.

> The amount of memory
> needed for those page tables should be easily calculated

As mentioned above, that is true for pretty much everything in the hyp
VA space except the private mappings as that depends on e.g. the CPU
uarch and such.

> and assuming huge pages can be used, should be far less the 1gb.

Ack, though this is no supported for the EL2 mappings yet. Historically
the amount of contiguous portions of memory mapped at EL2 has been
rather small, so there wasn't really a need, but we might want to
revisit this at some point.

> > > +	nr_pages = __hyp_pgtable_max_pages(nr_pages);
> > > +	res += nr_pages << PAGE_SHIFT;
> > > +
> > > +	return res;
> 
> ...
> 
> > > +void __init kvm_hyp_reserve(void)
> > > +{
> > > +	u64 nr_pages, prev;
> > > +
> > > +	if (!is_hyp_mode_available() || is_kernel_in_hyp_mode())
> > > +		return;
> > > +
> > > +	if (kvm_get_mode() != KVM_MODE_PROTECTED)
> > > +		return;
> > > +
> > > +	if (kvm_nvhe_sym(hyp_memblock_nr) < 0) {
> > > +		kvm_err("Failed to register hyp memblocks\n");
> > > +		return;
> > > +	}
> > > +
> > > +	sort_memblock_regions();
> > > +
> > > +	/*
> > > +	 * We don't know the number of possible CPUs yet, so allocate for the
> > > +	 * worst case.
> > > +	 */
> > > +	hyp_mem_size += NR_CPUS << PAGE_SHIFT;
> 
> Is this for per-cpu stack?

Correct.

> If so, what guarantees a single page is sufficient? Mostly a curiosity question,
> since it looks like this is an existing assumption by init_hyp_mode().  Shouldn't
> the required stack size be defined in bytes and converted to pages, or is there a
> guarantee that 64kb pages will be used?

Nope, we have no such guarantees, but 4K has been more than enough for
EL2 so far. The hyp code doesn't use recursion much (I think the only
occurence we have is Will's pgtable code, and that is architecturally
limited to 4 levels of recursion for obvious reasons) and doesn't have
use stack allocations.

It's on my todo list to remap the stack pages in the 'private' range, to
surround them with guard pages so we can at least run-time check this
assumption, so stay tuned :)

> > There was a recent patch bumping NR_CPUs to 512, so this would be 32MB
> > with 64k pages. Is it possible to return memory to the host later on once
> > we have a better handle on the number of CPUs in the system?
> 
> Does kvm_hyp_reserve() really need to be called during bootmem_init()?  What
> prevents doing the reservation during init_hyp_mode()?  If the problem is that
> pKVM needs a single contiguous chunk of memory, then it might be worth solving
> _that_ problem, e.g. letting the host donate memory in N-byte chunks instead of
> requiring a single huge blob of memory.

Right, I've been thinking about this over the weekend and that might
actually be fairly straightforward for stack pages. I'll try to move this
allocation to init_hyp_mode() where it belongs (or better, re-use the
existing one) in the nest version.

Thanks,
Quentin

[RFC,v2,00/26] KVM/arm64: A stage 2 for the host

Message

Comments