mbox series

[v6,00/18] kvm: arm64: Dynamic IPA and 52bit IPA

Message ID 20180926163258.20218-1-suzuki.poulose@arm.com
Headers show
Series kvm: arm64: Dynamic IPA and 52bit IPA | expand

Message

Suzuki K Poulose Sept. 26, 2018, 4:32 p.m. UTC
The physical address space size for a VM (IPA size) on arm/arm64 is
limited to a static limit of 40bits. This series adds support for
using an IPA size specific to a VM, allowing to use a size supported
by the host (based on the host kernel configuration and CPU support).
The default size is fixed to 40bits. On arm64, we can allow the limit
to be lowered (limiting the number of levels in stage2 to 2, to prevent
splitting the host PMD huge pages at stage2). We also add support for
handling 52bit IPA addresses (where supported) added by Arm v8.2
extensions.

We need to set the IPA limit as early as the VM creation to keep the
code simpler to avoid sprinkling checks everywhere to ensure that the
IPA is configured. We encode the IPA size in the machine_type
argument to KVM_CREATE_VM ioctl. Bits [7-0] of the type are reserved
for the IPA size. The availability of this feature is advertised by a
new cap KVM_CAP_ARM_VM_IPA_SIZE. When supported, this capability
returns the maximum IPA shift supported by the host. The supported IPA
size on a host could be different from the system's PARange indicated
by the CPUs (e.g, kernel limit on the PA size).

Supporting different IPA size requires modification to the stage2 page
table code. The arm64 page table level helpers are defined based on the
page table levels used by the host VA. So, the accessors may not work
if the guest uses more number of levels in stage2 than the stage1
of the host.  The previous versions (v1 & v2) of this series refactored
the stage1 page table accessors to reuse the low-level accessors for an
independent stage2 table. However, due to the level folding in the
generic code, the types are redefined as well. i.e, if the PUD is
folded, the pud_t could be defined as :

 typedef struct { pgd_t pgd; } pud_t;

similarly for pmd_t.  So, without stage1 independent page table entry
types for stage2, we could be dealing with a different type for level
 0-2 entries. This is practically fine on arm/arm64 as the entries
have similar format and size and we always use the appropriate
accessors to get the raw value (i.e, pud_val/pmd_val etc). But not
ideal for a solution upstream. So, this version caps the stage2 page
table levels to that of the stage1. This has the following impact on
the IPA support for various pagesize/host-va combinations :


x-----------------------------------------------------x
| host\ipa    | 40bit | 42bit | 44bit | 48bit | 52bit |
-------------------------------------------------------
| 39bit-4K    |  y    |   y   |  n    |   n   |  n/a  |
-------------------------------------------------------
| 48bit-4K    |  y    |   y   |  y    |   y   |  n/a  |
-------------------------------------------------------
| 36bit-16K   |  y    |   n   |  n    |   n   |  n/a  |
-------------------------------------------------------
| 47bit-16K   |  y    |   y   |  y    |   y   |  n/a  |
-------------------------------------------------------
| 48bit-4K    |  y    |   y   |  y    |   y   |  n/a  |
-------------------------------------------------------
| 42bit-64K   |  y    |   y   |  y    |   n   |  n    |
-------------------------------------------------------
| 48bit-64K   |  y    |   y   |  y    |   y   |  y    |
x-----------------------------------------------------x

Or the following list shows what cannot be supported :

 39bit-4K host  | [44 - 48]
 36bit-16K host | [41 - 48]
 42bit-64K host | [47 - 52]

which is not really bad. We can pursue the independent stage2
page table support and lift the restriction once we get there.
Given there is a proposal for new generic page table walker [0],
it would make sense to make our efforts in sync with it to avoid
diverting from a common API.

52bit support is added for VGIC (including ITS emulation) and handling
of PAR, HPFAR registers.

The series applies on 4.19-rc4. A tree is available here:

	 git://linux-arm.org/linux-skp.git ipa52/v6

Tested with
  - Modified kvmtool, which can only be used for (patches included in
    the series for reference / testing):
    * with virtio-pci upto 44bit PA (Due to 4K page size for virtio-pci
      legacy implemented by kvmtool)
    * Upto 48bit PA with virtio-mmio, due to 32bit PFN limitation.
  - Hacked Qemu (boot loader support for highmem, IPA size support)
    * with virtio-pci GIC-v3 ITS & MSI upto 52bit on Foundation model.
    Also see [1] for Qemu support.

[0] https://lkml.org/lkml/2018/4/24/777
[1] https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg05759.html

Change since v5:
 - Don't raise the IPA Limit to 40bits on systems with lower PA size.
   Doesn't break backward compatibility, we still allow KVM_CREATE_VM
   to succeed with "0" as the IPA size (40bits). But prevent specifying
   40bit explicitly, when the limit is lower.
 - Rename CAP, KVM_CAP_ARM_VM_PHYS_SHIFT => KVM_CAP_ARM_VM_IPA_SIZE
   and helper, KVM_VM_TYPE_ARM_VM_PHY_SHIFT => KVM_VM_TYPE_ARM_VM_IPA_SIZE
 - Update Documentation of the API
 - Update comments and commit description as reported by Eric
 - Set the missing TCR_T0SZ in patch "kvm: arm64: Configure VTCR_EL2 per VM"
 - Fix bits for CBASER_ADDRESS mask, GITS_CBASER_ADDRESS()

Changes since V4:
 - Rebased on v4.19-rc3
 - Dropped virtio patches queued already by mst.
 - Collect Acks from Christoffer
 - Restrict IPA configuration support to arm64 only
 - Use KVM_CAP_ARM_VM_PHYS_SHIFT for detecting the support for
   IPA size configuration along with the limit on the IPA for the host.
 - Update comments on __load_guest_stage2
 - Add comment about the default value for unknown PARange values.
 - Update Documentation of the API

Changes since V3:
 - Use per-VM VTCR instead per-VM private VTCR bits
 - Allow IPA less than 40bits
 - Split the patch adding support for stage2 dynamic page tables
 - Rearrange the series to keep the userspace API at the end, which
   needs further discussion.
 - Collect Reviews/Acks from Eric & Marc

Changes since V2:
 - Drop "refactoring of host page table helpers" and restrict the IPA size
   to make sure stage2 doesn't use more page table levels than that of the host.
 - Load VTCR for TLB operations on behalf of the VM (Pointed-by: James Morse)
 - Split a couple of patches to make them easier to review.
 - Fall back to normal (non-concatenated) entry level page table support if
   possible.
 - Bump the IOCTL number

Changes since V1:
 - Change the userspace API for configuring VM to encode the IPA
   size in the VM type.  (suggested by Christoffer)
 - Expose the IPA limit on the host via ioctl on /dev/kvm
 - Handle 52bit addresses in PAR & HPFAR
 - Drop patch changing the life time of stage2 PGD
 - Rename macros for 48-to-52 bit conversion for GIC ITS BASER.
   (suggested by Christoffer)
 - Split virtio PFN check patches and address comments.


Kristina Martsenko (1):
  vgic: Add support for 52bit guest physical address

Suzuki K Poulose (17):
  kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table
  kvm: arm/arm64: Remove spurious WARN_ON
  kvm: arm64: Add helper for loading the stage2 setting for a VM
  arm64: Add a helper for PARange to physical shift conversion
  kvm: arm64: Clean up VTCR_EL2 initialisation
  kvm: arm/arm64: Allow arch specific configurations for VM
  kvm: arm64: Configure VTCR_EL2 per VM
  kvm: arm/arm64: Prepare for VM specific stage2 translations
  kvm: arm64: Prepare for dynamic stage2 page table layout
  kvm: arm64: Make stage2 page table layout dynamic
  kvm: arm64: Dynamic configuration of VTTBR mask
  kvm: arm64: Configure VTCR_EL2.SL0 per VM
  kvm: arm64: Switch to per VM IPA limit
  kvm: arm64: Add 52bit support for PAR to HPFAR conversoin
  kvm: arm64: Set a limit on the IPA size
  kvm: arm64: Limit the minimum number of page table levels
  kvm: arm64: Allow tuning the physical address size for VM

 Documentation/virtual/kvm/api.txt             |  31 +++
 arch/arm/include/asm/kvm_arm.h                |   3 +-
 arch/arm/include/asm/kvm_host.h               |   7 +
 arch/arm/include/asm/kvm_mmu.h                |  15 +-
 arch/arm/include/asm/stage2_pgtable.h         |  50 ++--
 arch/arm64/include/asm/cpufeature.h           |  20 ++
 arch/arm64/include/asm/kvm_arm.h              | 157 +++++++++---
 arch/arm64/include/asm/kvm_asm.h              |   2 -
 arch/arm64/include/asm/kvm_host.h             |  16 +-
 arch/arm64/include/asm/kvm_hyp.h              |  10 +
 arch/arm64/include/asm/kvm_mmu.h              |  42 +++-
 arch/arm64/include/asm/stage2_pgtable-nopmd.h |  42 ----
 arch/arm64/include/asm/stage2_pgtable-nopud.h |  39 ---
 arch/arm64/include/asm/stage2_pgtable.h       | 236 +++++++++++++-----
 arch/arm64/kvm/hyp/Makefile                   |   1 -
 arch/arm64/kvm/hyp/s2-setup.c                 |  90 -------
 arch/arm64/kvm/hyp/switch.c                   |   4 +-
 arch/arm64/kvm/hyp/tlb.c                      |   4 +-
 arch/arm64/kvm/reset.c                        | 103 ++++++++
 include/linux/irqchip/arm-gic-v3.h            |   5 +
 include/uapi/linux/kvm.h                      |  10 +
 virt/kvm/arm/arm.c                            |   9 +-
 virt/kvm/arm/mmu.c                            | 120 ++++-----
 virt/kvm/arm/vgic/vgic-its.c                  |  36 +--
 virt/kvm/arm/vgic/vgic-kvm-device.c           |   2 +-
 virt/kvm/arm/vgic/vgic-mmio-v3.c              |   2 -
 26 files changed, 648 insertions(+), 408 deletions(-)
 delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
 delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h
 delete mode 100644 arch/arm64/kvm/hyp/s2-setup.c

kvmtool changes:

Suzuki K Poulose (4):
  kvmtool: Allow backends to run checks on the KVM device fd
  kvmtool: arm64: Add support for guest physical address size
  kvmtool: arm64: Switch memory layout
  kvmtool: arm: Add support for creating VM with PA size

 arm/aarch32/include/kvm/kvm-arch.h        |  6 ++--
 arm/aarch64/include/kvm/kvm-arch.h        | 15 ++++++++--
 arm/aarch64/include/kvm/kvm-config-arch.h |  5 +++-
 arm/include/arm-common/kvm-arch.h         | 17 ++++++++----
 arm/include/arm-common/kvm-config-arch.h  |  1 +
 arm/kvm.c                                 | 34 ++++++++++++++++++++++-
 include/kvm/kvm.h                         |  4 +++
 kvm.c                                     |  2 ++
 8 files changed, 71 insertions(+), 13 deletions(-)

Comments

Marc Zyngier Sept. 28, 2018, 5:27 p.m. UTC | #1
Hi Suzuki,

On 26/09/18 17:32, Suzuki K Poulose wrote:
> Allow the arch backends to perform VM specific initialisation.
> This will be later used to handle IPA size configuration and per-VM
> VTCR configuration on arm64.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>   arch/arm/include/asm/kvm_host.h   | 7 +++++++
>   arch/arm64/include/asm/kvm_host.h | 2 ++
>   arch/arm64/kvm/reset.c            | 7 +++++++
>   virt/kvm/arm/arm.c                | 5 +++--
>   4 files changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 3ad482d2f1eb..72d46418e1ef 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -354,4 +354,11 @@ static inline void kvm_vcpu_put_sysregs(struct kvm_vcpu *vcpu) {}
>   struct kvm *kvm_arch_alloc_vm(void);
>   void kvm_arch_free_vm(struct kvm *kvm);
>   
> +static inline int kvm_arm_config_vm(struct kvm *kvm, unsigned long type)

This is a bit of a nit, but VM is a bit of an overloaded term in this 
context. Given what we do in the following patch (moving the global 
stage-2 init to be on a per VM -- virtual machine), I'd like to rename 
this to something less ambiguous.

How about kvm_arm_config_stage2? Or something along those lines?

No need to respin the series on this account, we can address it in a 
separate patch. But I think it would help understanding what is done where.

Thanks,

	M.
Suzuki K Poulose Sept. 29, 2018, 8:30 a.m. UTC | #2
Hi Marc,

On 09/28/2018 06:27 PM, Marc Zyngier wrote:
> Hi Suzuki,
> 
> On 26/09/18 17:32, Suzuki K Poulose wrote:
>> Allow the arch backends to perform VM specific initialisation.
>> This will be later used to handle IPA size configuration and per-VM
>> VTCR configuration on arm64.
>>
>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>> Cc: Christoffer Dall <cdall@kernel.org>
>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>>   arch/arm/include/asm/kvm_host.h   | 7 +++++++
>>   arch/arm64/include/asm/kvm_host.h | 2 ++
>>   arch/arm64/kvm/reset.c            | 7 +++++++
>>   virt/kvm/arm/arm.c                | 5 +++--
>>   4 files changed, 19 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_host.h 
>> b/arch/arm/include/asm/kvm_host.h
>> index 3ad482d2f1eb..72d46418e1ef 100644
>> --- a/arch/arm/include/asm/kvm_host.h
>> +++ b/arch/arm/include/asm/kvm_host.h
>> @@ -354,4 +354,11 @@ static inline void kvm_vcpu_put_sysregs(struct 
>> kvm_vcpu *vcpu) {}
>>   struct kvm *kvm_arch_alloc_vm(void);
>>   void kvm_arch_free_vm(struct kvm *kvm);
>> +static inline int kvm_arm_config_vm(struct kvm *kvm, unsigned long type)
> 
> This is a bit of a nit, but VM is a bit of an overloaded term in this 
> context. Given what we do in the following patch (moving the global 
> stage-2 init to be on a per VM -- virtual machine), I'd like to rename 
> this to something less ambiguous.
> 
> How about kvm_arm_config_stage2? Or something along those lines?

I had something similar in the earlier versions, where we supported
the "IPA" size for both arm and arm64. But since we restricted this
feature to arm64, I changed the name to make it more generic to
let the archs parse the "vm_type" parameter which could potentially have
more bits defined, not just the IPA size. Hence the change. On arm64, we
only use it for stage2 configuration and on arm32 we make sure the type
is empty.

> 
> No need to respin the series on this account, we can address it in a 
> separate patch. But I think it would help understanding what is done where.
> 

As such I am fine with suggestion.

Cheers
Suzuki
Catalin Marinas Oct. 1, 2018, 12:05 p.m. UTC | #3
On Wed, Sep 26, 2018 at 05:32:40PM +0100, Suzuki K. Poulose wrote:
> On arm64, ID_AA64MMFR0_EL1.PARange encodes the maximum Physical
> Address range supported by the CPU. Add a helper to decode this
> to actual physical shift. If we hit an unallocated value, return
> the maximum range supported by the kernel.
> This will be used by KVM to set the VTCR_EL2.T0SZ, as it
> is about to move its place. Having this helper keeps the code
> movement cleaner.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: James Morse <james.morse@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>

Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Marc Zyngier Oct. 1, 2018, 2:13 p.m. UTC | #4
On 26/09/18 17:32, Suzuki K Poulose wrote:
> Specify the physical size for the VM encoded in the vm type.
> 
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>   arm/include/arm-common/kvm-arch.h |  6 +++++-
>   arm/kvm.c                         | 32 +++++++++++++++++++++++++++++++
>   2 files changed, 37 insertions(+), 1 deletion(-)
> 
> diff --git a/arm/include/arm-common/kvm-arch.h b/arm/include/arm-common/kvm-arch.h
> index b29b4b1..d77f3ac 100644
> --- a/arm/include/arm-common/kvm-arch.h
> +++ b/arm/include/arm-common/kvm-arch.h
> @@ -44,7 +44,11 @@
>   
>   #define KVM_IRQ_OFFSET		GIC_SPI_IRQ_BASE
>   
> -#define KVM_VM_TYPE		0
> +extern unsigned long		kvm_arm_type;
> +extern void kvm__arch_init_hyp(struct kvm *kvm);
> +
> +#define KVM_VM_TYPE		kvm_arm_type
> +#define kvm__arch_init_hyp	kvm__arch_init_hyp
>   
>   #define VIRTIO_DEFAULT_TRANS(kvm)	\
>   	((kvm)->cfg.arch.virtio_trans_pci ? VIRTIO_PCI : VIRTIO_MMIO)
> diff --git a/arm/kvm.c b/arm/kvm.c
> index eac2ad2..c8db6b3 100644
> --- a/arm/kvm.c
> +++ b/arm/kvm.c
> @@ -11,6 +11,8 @@
>   #include <linux/kvm.h>
>   #include <linux/sizes.h>
>   
> +unsigned long kvm_arm_type;
> +
>   struct kvm_ext kvm_req_ext[] = {
>   	{ DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
>   	{ DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
> @@ -18,6 +20,36 @@ struct kvm_ext kvm_req_ext[] = {
>   	{ 0, 0 },
>   };
>   
> +#ifndef KVM_CAP_ARM_VM_IPA_SIZE
> +#define KVM_CAP_ARM_VM_IPA_SIZE	159
> +#endif

Note that when merged on top of 4.19-rc5, this is already 160. I assume 
this will be even different once this lands in mainline.

Thanks,

	M.

> +
> +#ifndef KVM_VM_TYPE_ARM_IPA_SIZE
> +#define KVM_VM_TYPE_ARM_IPA_SIZE_MASK	0xffULL
> +#define KVM_VM_TYPE_ARM_IPA_SIZE(x)	\
> +	((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK)
> +#endif
> +
> +void kvm__arch_init_hyp(struct kvm *kvm)
> +{
> +	int max_ipa;
> +
> +	if (!kvm->cfg.arch.phys_shift)
> +		kvm->cfg.arch.phys_shift = 40;
> +	if (kvm->cfg.arch.phys_shift == 40)
> +		return;
> +	max_ipa = ioctl(kvm->sys_fd,
> +			KVM_CHECK_EXTENSION, KVM_CAP_ARM_VM_IPA_SIZE);
> +	if (!max_ipa)
> +		die("Kernel doesn't support IPA size configuration\n");
> +	if ((kvm->cfg.arch.phys_shift > max_ipa) ||
> +	    (kvm->cfg.arch.phys_shift < 32))
> +		die("Requested PA size (%u) is not supported by the host"
> +		    " [32 - %u]bit\n",
> +		    kvm->cfg.arch.phys_shift, max_ipa);
> +	kvm_arm_type = KVM_VM_TYPE_ARM_IPA_SIZE(kvm->cfg.arch.phys_shift);
> +}
> +
>   bool kvm__arch_cpu_supports_vm(void)
>   {
>   	/* The KVM capability check is enough. */
>
Eric Auger Oct. 2, 2018, 7:48 a.m. UTC | #5
Hi Suzuki,

On 9/26/18 6:32 PM, Suzuki K Poulose wrote:
> Add support for setting the VTCR_EL2 per VM, rather than hard
> coding a value at boot time per CPU. This would allow us to tune
> the stage2 page table parameters per VM in the later changes.
> 
> We compute the VTCR fields based on the system wide sanitised
> feature registers, except for the hardware management of Access
> Flags (VTCR_EL2.HA). It is fine to run a system with a mix of
> CPUs that may or may not update the page table Access Flags.
> Since the bit is RES0 on CPUs that don't support it, the bit
> should be ignored on them.
> 
> Suggested-by: Marc Zyngier <marc.zyngier@arm.com>
> Acked-by: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
> Changes since v5:
>  - Set the missing TCR_T0SZ initialisation (Eric Auger)
>    and limit the T0SZ to the real CPU limit or KVM_PHYS_SHIFT
>    whichever is lower.
> ---
>  arch/arm64/include/asm/kvm_arm.h  |  3 +-
>  arch/arm64/include/asm/kvm_asm.h  |  2 -
>  arch/arm64/include/asm/kvm_host.h | 12 ++++--
>  arch/arm64/include/asm/kvm_hyp.h  |  1 +
>  arch/arm64/kvm/hyp/Makefile       |  1 -
>  arch/arm64/kvm/hyp/s2-setup.c     | 72 -------------------------------
>  arch/arm64/kvm/reset.c            | 35 +++++++++++++++
>  7 files changed, 45 insertions(+), 81 deletions(-)
>  delete mode 100644 arch/arm64/kvm/hyp/s2-setup.c
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 5f807b680a5f..14317b3a1820 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -135,8 +135,7 @@
>   * 40 bits wide (T0SZ = 24).  Systems with a PARange smaller than 40 bits are
>   * not known to exist and will break with this configuration.
>   *
> - * VTCR_EL2.PS is extracted from ID_AA64MMFR0_EL1.PARange at boot time
> - * (see hyp-init.S).
> + * The VTCR_EL2 is configured per VM and is initialised in kvm_arm_config_vm().
>   *
>   * Note that when using 4K pages, we concatenate two first level page tables
>   * together. With 16K pages, we concatenate 16 first level page tables.
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 102b5a5c47b6..0b53c72e7591 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -72,8 +72,6 @@ extern void __vgic_v3_init_lrs(void);
>  
>  extern u32 __kvm_get_mdcr_el2(void);
>  
> -extern u32 __init_stage2_translation(void);
> -
>  /* Home-grown __this_cpu_{ptr,read} variants that always work at HYP */
>  #define __hyp_this_cpu_ptr(sym)						\
>  	({								\
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index b04280ae1be0..5ecd457bce7d 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -61,11 +61,13 @@ struct kvm_arch {
>  	u64    vmid_gen;
>  	u32    vmid;
>  
> -	/* 1-level 2nd stage table, protected by kvm->mmu_lock */
> +	/* stage2 entry level table */
>  	pgd_t *pgd;
>  
>  	/* VTTBR value associated with above pgd and vmid */
>  	u64    vttbr;
> +	/* VTCR_EL2 value for this VM */
> +	u64    vtcr;
>  
>  	/* The last vcpu id that ran on each physical CPU */
>  	int __percpu *last_vcpu_ran;
> @@ -442,10 +444,12 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>  
>  static inline void __cpu_init_stage2(void)
>  {
> -	u32 parange = kvm_call_hyp(__init_stage2_translation);
> +	u32 ps;
>  
> -	WARN_ONCE(parange < 40,
> -		  "PARange is %d bits, unsupported configuration!", parange);
> +	/* Sanity check for minimum IPA size support */
> +	ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
> +	WARN_ONCE(ps < 40,
> +		  "PARange is %d bits, unsupported configuration!", ps);
>  }
>  
>  /* Guest/host FPSIMD coordination helpers */
> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
> index d1bd1e0f14d7..23aca66767f9 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -161,6 +161,7 @@ void __noreturn __hyp_do_panic(unsigned long, ...);
>   */
>  static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
>  {
> +	write_sysreg(kvm->arch.vtcr, vtcr_el2);
>  	write_sysreg(kvm->arch.vttbr, vttbr_el2);
>  }
>  
> diff --git a/arch/arm64/kvm/hyp/Makefile b/arch/arm64/kvm/hyp/Makefile
> index 2fabc2dc1966..82d1904328ad 100644
> --- a/arch/arm64/kvm/hyp/Makefile
> +++ b/arch/arm64/kvm/hyp/Makefile
> @@ -19,7 +19,6 @@ obj-$(CONFIG_KVM_ARM_HOST) += switch.o
>  obj-$(CONFIG_KVM_ARM_HOST) += fpsimd.o
>  obj-$(CONFIG_KVM_ARM_HOST) += tlb.o
>  obj-$(CONFIG_KVM_ARM_HOST) += hyp-entry.o
> -obj-$(CONFIG_KVM_ARM_HOST) += s2-setup.o
>  
>  # KVM code is run at a different exception code with a different map, so
>  # compiler instrumentation that inserts callbacks or checks into the code may
> diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
> deleted file mode 100644
> index e1ca672e937a..000000000000
> --- a/arch/arm64/kvm/hyp/s2-setup.c
> +++ /dev/null
> @@ -1,72 +0,0 @@
> -/*
> - * Copyright (C) 2016 - ARM Ltd
> - * Author: Marc Zyngier <marc.zyngier@arm.com>
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> - * GNU General Public License for more details.
> - *
> - * You should have received a copy of the GNU General Public License
> - * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> - */
> -
> -#include <linux/types.h>
> -#include <asm/kvm_arm.h>
> -#include <asm/kvm_asm.h>
> -#include <asm/kvm_hyp.h>
> -#include <asm/cpufeature.h>
> -
> -u32 __hyp_text __init_stage2_translation(void)
> -{
> -	u64 val = VTCR_EL2_FLAGS;
> -	u64 parange;
> -	u32 phys_shift;
> -	u64 tmp;
> -
> -	/*
> -	 * Read the PARange bits from ID_AA64MMFR0_EL1 and set the PS
> -	 * bits in VTCR_EL2. Amusingly, the PARange is 4 bits, but the
> -	 * allocated values are limited to 3bits.
> -	 */
> -	parange = read_sysreg(id_aa64mmfr0_el1) & 7;
> -	if (parange > ID_AA64MMFR0_PARANGE_MAX)
> -		parange = ID_AA64MMFR0_PARANGE_MAX;
> -	val |= parange << VTCR_EL2_PS_SHIFT;
> -
> -	/* Compute the actual PARange... */
> -	phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
> -
> -	/*
> -	 * ... and clamp it to 40 bits, unless we have some braindead
> -	 * HW that implements less than that. In all cases, we'll
> -	 * return that value for the rest of the kernel to decide what
> -	 * to do.
> -	 */
> -	val |= VTCR_EL2_T0SZ(phys_shift > 40 ? 40 : phys_shift);
> -
> -	/*
> -	 * Check the availability of Hardware Access Flag / Dirty Bit
> -	 * Management in ID_AA64MMFR1_EL1 and enable the feature in VTCR_EL2.
> -	 */
> -	tmp = (read_sysreg(id_aa64mmfr1_el1) >> ID_AA64MMFR1_HADBS_SHIFT) & 0xf;
> -	if (tmp)
> -		val |= VTCR_EL2_HA;
> -
> -	/*
> -	 * Read the VMIDBits bits from ID_AA64MMFR1_EL1 and set the VS
> -	 * bit in VTCR_EL2.
> -	 */
> -	tmp = (read_sysreg(id_aa64mmfr1_el1) >> ID_AA64MMFR1_VMIDBITS_SHIFT) & 0xf;
> -	val |= (tmp == ID_AA64MMFR1_VMIDBITS_16) ?
> -			VTCR_EL2_VS_16BIT :
> -			VTCR_EL2_VS_8BIT;
> -
> -	write_sysreg(val, vtcr_el2);
> -
> -	return phys_shift;
> -}
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index b0c07dab5cb3..616120c4176b 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -26,6 +26,7 @@
>  
>  #include <kvm/arm_arch_timer.h>
>  
> +#include <asm/cpufeature.h>
>  #include <asm/cputype.h>
>  #include <asm/ptrace.h>
>  #include <asm/kvm_arm.h>
> @@ -134,9 +135,43 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
>  	return kvm_timer_vcpu_reset(vcpu);
>  }
>  
> +/*
> + * Configure the VTCR_EL2 for this VM. The VTCR value is common
> + * across all the physical CPUs on the system. We use system wide
> + * sanitised values to fill in different fields, except for Hardware
> + * Management of Access Flags. HA Flag is set unconditionally on
> + * all CPUs, as it is safe to run with or without the feature and
> + * the bit is RES0 on CPUs that don't support it.
> + */
>  int kvm_arm_config_vm(struct kvm *kvm, unsigned long type)
>  {
> +	u64 vtcr = VTCR_EL2_FLAGS;
> +	u32 parange, phys_shift;
> +
>  	if (type)
>  		return -EINVAL;
> +
> +	parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 7;
> +	if (parange > ID_AA64MMFR0_PARANGE_MAX)
> +		parange = ID_AA64MMFR0_PARANGE_MAX;
> +	vtcr |= parange << VTCR_EL2_PS_SHIFT;
> +
> +	phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
> +	if (phys_shift > KVM_PHYS_SHIFT)
> +		phys_shift = KVM_PHYS_SHIFT;
> +	vtcr |= VTCR_EL2_T0SZ(phys_shift);
> +
> +	/*
> +	 * Enable the Hardware Access Flag management, unconditionally
> +	 * on all CPUs. The features is RES0 on CPUs without the support
> +	 * and must be ignored by the CPUs.
> +	 */
> +	vtcr |= VTCR_EL2_HA;
> +
> +	/* Set the vmid bits */
> +	vtcr |= (kvm_get_vmid_bits() == 16) ?
> +		VTCR_EL2_VS_16BIT :
> +		VTCR_EL2_VS_8BIT;
> +	kvm->arch.vtcr = vtcr;
>  	return 0;
>  }
>
Eric Auger Oct. 2, 2018, 7:54 a.m. UTC | #6
Hi Suzuki,

On 9/26/18 6:32 PM, Suzuki K Poulose wrote:
> On arm64 VTTBR_EL2:BADDR holds the base address for the stage2
> translation table. The Arm ARM mandates that the bits BADDR[x-1:0]
> should be 0, where 'x' is defined for a given IPA Size and the
> number of levels for a translation granule size. It is defined
> using some magical constants. This patch is a reverse engineered
> implementation to calculate the 'x' at runtime for a given ipa and
> number of page table levels. See patch for more details.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
> Changes since v5:
>  - Update comment about the Magic_N for VTTBR_X calculation
>  - Remove the obsolete VTTBR_TGRAN_MAGIC value defintions
> Changes since V3:
>  - Update reference to latest ARM ARM and improve commentary
> ---
>  arch/arm64/include/asm/kvm_arm.h | 73 ++++++++++++++++++++++++++++----
>  arch/arm64/include/asm/kvm_mmu.h | 25 ++++++++++-
>  2 files changed, 88 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 14317b3a1820..b236d90ca056 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -123,7 +123,6 @@
>  #define VTCR_EL2_SL0_MASK	(3 << VTCR_EL2_SL0_SHIFT)
>  #define VTCR_EL2_SL0_LVL1	(1 << VTCR_EL2_SL0_SHIFT)
>  #define VTCR_EL2_T0SZ_MASK	0x3f
> -#define VTCR_EL2_T0SZ_40B	24
>  #define VTCR_EL2_VS_SHIFT	19
>  #define VTCR_EL2_VS_8BIT	(0 << VTCR_EL2_VS_SHIFT)
>  #define VTCR_EL2_VS_16BIT	(1 << VTCR_EL2_VS_SHIFT)
> @@ -140,11 +139,8 @@
>   * Note that when using 4K pages, we concatenate two first level page tables
>   * together. With 16K pages, we concatenate 16 first level page tables.
>   *
> - * The magic numbers used for VTTBR_X in this patch can be found in Tables
> - * D4-23 and D4-25 in ARM DDI 0487A.b.
>   */
>  
> -#define VTCR_EL2_T0SZ_IPA	VTCR_EL2_T0SZ_40B
>  #define VTCR_EL2_COMMON_BITS	(VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
>  				 VTCR_EL2_IRGN0_WBWA | VTCR_EL2_RES1)
>  
> @@ -155,7 +151,6 @@
>   * 2 level page tables (SL = 1)
>   */
>  #define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
> -#define VTTBR_X_TGRAN_MAGIC		38
>  #elif defined(CONFIG_ARM64_16K_PAGES)
>  /*
>   * Stage2 translation configuration:
> @@ -163,7 +158,6 @@
>   * 2 level page tables (SL = 1)
>   */
>  #define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
> -#define VTTBR_X_TGRAN_MAGIC		42
>  #else	/* 4K */
>  /*
>   * Stage2 translation configuration:
> @@ -171,13 +165,74 @@
>   * 3 level page tables (SL = 1)
>   */
>  #define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
> -#define VTTBR_X_TGRAN_MAGIC		37
>  #endif
>  
>  #define VTCR_EL2_FLAGS			(VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
> -#define VTTBR_X				(VTTBR_X_TGRAN_MAGIC - VTCR_EL2_T0SZ_IPA)
> +/*
> + * ARM VMSAv8-64 defines an algorithm for finding the translation table
> + * descriptors in section D4.2.8 in ARM DDI 0487C.a.
> + *
> + * The algorithm defines the expectations on the translation table
> + * addresses for each level, based on PAGE_SIZE, entry level
> + * and the translation table size (T0SZ). The variable "x" in the
> + * algorithm determines the alignment of a table base address at a given
> + * level and thus determines the alignment of VTTBR:BADDR for stage2
> + * page table entry level.
> + * Since the number of bits resolved at the entry level could vary
> + * depending on the T0SZ, the value of "x" is defined based on a
> + * Magic constant for a given PAGE_SIZE and Entry Level. The
> + * intermediate levels must be always aligned to the PAGE_SIZE (i.e,
> + * x = PAGE_SHIFT).
> + *
> + * The value of "x" for entry level is calculated as :
> + *    x = Magic_N - T0SZ
> + *
> + * where Magic_N is an integer depending on the page size and the entry
> + * level of the page table as below:
> + *
> + *	--------------------------------------------
> + *	| Entry level		|  4K    16K   64K |
> + *	--------------------------------------------
> + *	| Level: 0 (4 levels)	| 28   |  -  |  -  |
> + *	--------------------------------------------
> + *	| Level: 1 (3 levels)	| 37   | 31  | 25  |
> + *	--------------------------------------------
> + *	| Level: 2 (2 levels)	| 46   | 42  | 38  |
> + *	--------------------------------------------
> + *	| Level: 3 (1 level)	| -    | 53  | 51  |
> + *	--------------------------------------------
> + *
> + * We have a magic formula for the Magic_N below:
> + *
> + *  Magic_N(PAGE_SIZE, Level) = 64 - ((PAGE_SHIFT - 3) * Number_of_levels)
> + *
> + * where Number_of_levels = (4 - Level). We are only interested in the
> + * value for Entry_Level for the stage2 page table.
> + *
> + * So, given that T0SZ = (64 - IPA_SHIFT), we can compute 'x' as follows:
> + *
> + *	x = (64 - ((PAGE_SHIFT - 3) * Number_of_levels)) - (64 - IPA_SHIFT)
> + *	  = IPA_SHIFT - ((PAGE_SHIFT - 3) * Number of levels)
> + *
> + * Here is one way to explain the Magic Formula:
> + *
> + *  x = log2(Size_of_Entry_Level_Table)
> + *
> + * Since, we can resolve (PAGE_SHIFT - 3) bits at each level, and another
> + * PAGE_SHIFT bits in the PTE, we have :
> + *
> + *  Bits_Entry_level = IPA_SHIFT - ((PAGE_SHIFT - 3) * (n - 1) + PAGE_SHIFT)
> + *		     = IPA_SHIFT - (PAGE_SHIFT - 3) * n - 3
> + *  where n = number of levels, and since each pointer is 8bytes, we have:
> + *
> + *  x = Bits_Entry_Level + 3
> + *    = IPA_SHIFT - (PAGE_SHIFT - 3) * n
> + *
> + * The only constraint here is that, we have to find the number of page table
> + * levels for a given IPA size (which we do, see stage2_pt_levels())
> + */
> +#define ARM64_VTTBR_X(ipa, levels)	((ipa) - ((levels) * (PAGE_SHIFT - 3)))
>  
> -#define VTTBR_BADDR_MASK  (((UL(1) << (PHYS_MASK_SHIFT - VTTBR_X)) - 1) << VTTBR_X)
>  #define VTTBR_VMID_SHIFT  (UL(48))
>  #define VTTBR_VMID_MASK(size) (_AT(u64, (1 << size) - 1) << VTTBR_VMID_SHIFT)
>  
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 7342d2c51773..ac3ca9690bad 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -145,7 +145,6 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
>  #define kvm_phys_shift(kvm)		KVM_PHYS_SHIFT
>  #define kvm_phys_size(kvm)		(_AC(1, ULL) << kvm_phys_shift(kvm))
>  #define kvm_phys_mask(kvm)		(kvm_phys_size(kvm) - _AC(1, ULL))
> -#define kvm_vttbr_baddr_mask(kvm)	VTTBR_BADDR_MASK
>  
>  static inline bool kvm_page_empty(void *ptr)
>  {
> @@ -520,5 +519,29 @@ static inline int hyp_map_aux_data(void)
>  
>  #define kvm_phys_to_vttbr(addr)		phys_to_ttbr(addr)
>  
> +/*
> + * Get the magic number 'x' for VTTBR:BADDR of this KVM instance.
> + * With v8.2 LVA extensions, 'x' should be a minimum of 6 with
> + * 52bit IPS.
> + */
> +static inline int arm64_vttbr_x(u32 ipa_shift, u32 levels)
> +{
> +	int x = ARM64_VTTBR_X(ipa_shift, levels);
> +
> +	return (IS_ENABLED(CONFIG_ARM64_PA_BITS_52) && x < 6) ? 6 : x;
> +}
> +
> +static inline u64 vttbr_baddr_mask(u32 ipa_shift, u32 levels)
> +{
> +	unsigned int x = arm64_vttbr_x(ipa_shift, levels);
> +
> +	return GENMASK_ULL(PHYS_MASK_SHIFT - 1, x);
> +}
> +
> +static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
> +{
> +	return vttbr_baddr_mask(kvm_phys_shift(kvm), kvm_stage2_levels(kvm));
> +}
> +
>  #endif /* __ASSEMBLY__ */
>  #endif /* __ARM64_KVM_MMU_H__ */
>
Eric Auger Oct. 2, 2018, 7:58 a.m. UTC | #7
Hi Suzuki,
On 9/26/18 6:32 PM, Suzuki K Poulose wrote:
> Now that we can manage the stage2 page table per VM, switch the
> configuration details to per VM instance. The VTCR is updated
> with the values specific to the VM based on the configuration.
> We store the IPA size and the number of stage2 page table levels
> for the guest already in VTCR. Decode it back from the vtcr
> field wherever we need it.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
>  arch/arm64/include/asm/kvm_arm.h        | 2 ++
>  arch/arm64/include/asm/kvm_mmu.h        | 2 +-
>  arch/arm64/include/asm/stage2_pgtable.h | 2 +-
>  arch/arm64/kvm/reset.c                  | 2 +-
>  4 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index f913adb44f93..e4240568cc18 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -197,6 +197,8 @@
>  	VTCR_EL2_SL0_TO_LVLS(((vtcr) & VTCR_EL2_SL0_MASK) >> VTCR_EL2_SL0_SHIFT)
>  
>  #define VTCR_EL2_FLAGS			(VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
> +#define VTCR_EL2_IPA(vtcr)		(64 - ((vtcr) & VTCR_EL2_T0SZ_MASK))
> +
>  /*
>   * ARM VMSAv8-64 defines an algorithm for finding the translation table
>   * descriptors in section D4.2.8 in ARM DDI 0487C.a.
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index ac3ca9690bad..77b1af9e64db 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -142,7 +142,7 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
>   */
>  #define KVM_PHYS_SHIFT	(40)
>  
> -#define kvm_phys_shift(kvm)		KVM_PHYS_SHIFT
> +#define kvm_phys_shift(kvm)		VTCR_EL2_IPA(kvm->arch.vtcr)
>  #define kvm_phys_size(kvm)		(_AC(1, ULL) << kvm_phys_shift(kvm))
>  #define kvm_phys_mask(kvm)		(kvm_phys_size(kvm) - _AC(1, ULL))
>  
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index 36a0a1165003..c62fe118a898 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -43,7 +43,7 @@
>   */
>  #define stage2_pgtable_levels(ipa)	ARM64_HW_PGTABLE_LEVELS((ipa) - 4)
>  #define STAGE2_PGTABLE_LEVELS		stage2_pgtable_levels(KVM_PHYS_SHIFT)
> -#define kvm_stage2_levels(kvm)		stage2_pgtable_levels(kvm_phys_shift(kvm))
> +#define kvm_stage2_levels(kvm)		VTCR_EL2_LVLS(kvm->arch.vtcr)
>  
>  /*
>   * With all the supported VA_BITs and 40bit guest IPA, the following condition
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 1ced1e37374e..2bf41e007390 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -160,7 +160,7 @@ int kvm_arm_config_vm(struct kvm *kvm, unsigned long type)
>  	if (phys_shift > KVM_PHYS_SHIFT)
>  		phys_shift = KVM_PHYS_SHIFT;
>  	vtcr |= VTCR_EL2_T0SZ(phys_shift);
> -	vtcr |= VTCR_EL2_LVLS_TO_SL0(kvm_stage2_levels(kvm));
> +	vtcr |= VTCR_EL2_LVLS_TO_SL0(stage2_pgtable_levels(phys_shift));
>  
>  	/*
>  	 * Enable the Hardware Access Flag management, unconditionally
>
Eric Auger Oct. 2, 2018, 8:20 a.m. UTC | #8
Hi,

On 9/26/18 6:32 PM, Suzuki K Poulose wrote:
> So far we have restricted the IPA size of the VM to the default
> value (40bits). Now that we can manage the IPA size per VM and
> support dynamic stage2 page tables, we can allow VMs to have
> larger IPA. This patch introduces a the maximum IPA size
> supported on the host.
nit: a the -> computes the max IPA size that can be supported for any VM ?

Besides
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

 This is decided by the following factors :
> 
>  1) Maximum PARange supported by the CPUs - This can be inferred
>     from the system wide safe value.
>  2) Maximum PA size supported by the host kernel (48 vs 52)
>  3) Number of levels in the host page table (as we base our
>     stage2 tables on the host table helpers).
> 
> Since the stage2 page table code is dependent on the stage1
> page table, we always ensure that :
> 
>   Number of Levels at Stage1 >= Number of Levels at Stage2
> 
> So we limit the IPA to make sure that the above condition
> is satisfied. This will affect the following combinations
> of VA_BITS and IPA for different page sizes.
> 
>   Host configuration | Unsupported IPA ranges
>   39bit VA, 4K       | [44, 48]
>   36bit VA, 16K      | [41, 48]
>   42bit VA, 64K      | [47, 52]
> 
> Supporting the above combinations need independent stage2
> page table manipulation code, which would need substantial
> changes. We could purse the solution independently and
> switch the page table code once we have it ready.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since v5:
>  - Don't raise the IPA limit to 40bits
>  - Print the KVM IPA limit, WARN if the limit is less than the
>    default size. Drop the per-CPU PARange check
>  - If the limit was reduced due to kernel configuration,
>    report the limiting factor. i.e, kernel virtual vs physical
>    address limit.
> Changes since V2:
>  - Restrict the IPA size to limit the number of page table
>    levels in stage2 to that of stage1 or less.
> ---
>  arch/arm/include/asm/kvm_mmu.h    |  2 ++
>  arch/arm64/include/asm/kvm_host.h | 12 +++------
>  arch/arm64/kvm/reset.c            | 43 +++++++++++++++++++++++++++++++
>  virt/kvm/arm/arm.c                |  2 ++
>  4 files changed, 50 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 12ae5fbbcf01..5ad1a54f98dc 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -358,6 +358,8 @@ static inline int hyp_map_aux_data(void)
>  
>  #define kvm_phys_to_vttbr(addr)		(addr)
>  
> +static inline void kvm_set_ipa_limit(void) {}
> +
>  #endif	/* !__ASSEMBLY__ */
>  
>  #endif /* __ARM_KVM_MMU_H__ */
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 5ecd457bce7d..f008f8866b2a 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -442,15 +442,7 @@ int kvm_arm_vcpu_arch_get_attr(struct kvm_vcpu *vcpu,
>  int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>  			       struct kvm_device_attr *attr);
>  
> -static inline void __cpu_init_stage2(void)
> -{
> -	u32 ps;
> -
> -	/* Sanity check for minimum IPA size support */
> -	ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
> -	WARN_ONCE(ps < 40,
> -		  "PARange is %d bits, unsupported configuration!", ps);
> -}
> +static inline void __cpu_init_stage2(void) {}
>  
>  /* Guest/host FPSIMD coordination helpers */
>  int kvm_arch_vcpu_run_map_fp(struct kvm_vcpu *vcpu);
> @@ -513,6 +505,8 @@ static inline int kvm_arm_have_ssbd(void)
>  void kvm_vcpu_load_sysregs(struct kvm_vcpu *vcpu);
>  void kvm_vcpu_put_sysregs(struct kvm_vcpu *vcpu);
>  
> +void kvm_set_ipa_limit(void);
> +
>  #define __KVM_HAVE_ARCH_VM_ALLOC
>  struct kvm *kvm_arch_alloc_vm(void);
>  void kvm_arch_free_vm(struct kvm *kvm);
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 2bf41e007390..96b3f50101bc 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -34,6 +34,9 @@
>  #include <asm/kvm_coproc.h>
>  #include <asm/kvm_mmu.h>
>  
> +/* Maximum phys_shift supported for any VM on this host */
> +static u32 kvm_ipa_limit;
> +
>  /*
>   * ARMv8 Reset Values
>   */
> @@ -135,6 +138,46 @@ int kvm_reset_vcpu(struct kvm_vcpu *vcpu)
>  	return kvm_timer_vcpu_reset(vcpu);
>  }
>  
> +void kvm_set_ipa_limit(void)
> +{
> +	unsigned int ipa_max, pa_max, va_max, parange;
> +
> +	parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 0x7;
> +	pa_max = id_aa64mmfr0_parange_to_phys_shift(parange);
> +
> +	/* Clamp the IPA limit to the PA size supported by the kernel */
> +	ipa_max = (pa_max > PHYS_MASK_SHIFT) ? PHYS_MASK_SHIFT : pa_max;
> +	/*
> +	 * Since our stage2 table is dependent on the stage1 page table code,
> +	 * we must always honor the following condition:
> +	 *
> +	 *  Number of levels in Stage1 >= Number of levels in Stage2.
> +	 *
> +	 * So clamp the ipa limit further down to limit the number of levels.
> +	 * Since we can concatenate upto 16 tables at entry level, we could
> +	 * go upto 4bits above the maximum VA addressible with the current
> +	 * number of levels.
> +	 */
> +	va_max = PGDIR_SHIFT + PAGE_SHIFT - 3;
> +	va_max += 4;
> +
> +	if (va_max < ipa_max)
> +		ipa_max = va_max;
> +
> +	/*
> +	 * If the final limit is lower than the real physical address
> +	 * limit of the CPUs, report the reason.
> +	 */
> +	if (ipa_max < pa_max)
> +		pr_info("kvm: Limiting the IPA size due to kernel %s Address limit\n",
> +			(va_max < pa_max) ? "Virtual" : "Physical");
> +
> +	WARN(ipa_max < KVM_PHYS_SHIFT,
> +	     "KVM IPA limit (%d bit) is smaller than default size\n", ipa_max);
> +	kvm_ipa_limit = ipa_max;
> +	kvm_info("IPA Size Limit: %dbits\n", kvm_ipa_limit);
> +}
> +
>  /*
>   * Configure the VTCR_EL2 for this VM. The VTCR value is common
>   * across all the physical CPUs on the system. We use system wide
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 43e716bc3f08..631f9a3ad99a 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -1413,6 +1413,8 @@ static int init_common_resources(void)
>  	kvm_vmid_bits = kvm_get_vmid_bits();
>  	kvm_info("%d-bit VMID\n", kvm_vmid_bits);
>  
> +	kvm_set_ipa_limit();
> +
>  	return 0;
>  }
>  
>
Eric Auger Oct. 2, 2018, 8:22 a.m. UTC | #9
Hi,
On 9/26/18 6:32 PM, Suzuki K Poulose wrote:
> Since we are about to remove the lower limit on the IPA size,
> make sure that we do not go to 1 level page table (e.g, with
> 32bit IPA on 64K host with concatenation) to avoid splitting
> the host PMD huge pages at stage2.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
> Change since v5:
>  - Cosmetic changes to the comment
>  - Remove unnecessary new line
> ---
>  arch/arm64/include/asm/stage2_pgtable.h |  7 ++++++-
>  arch/arm64/kvm/reset.c                  | 10 +++++++++-
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index c62fe118a898..2cce769ba4c6 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -72,8 +72,13 @@
>  /*
>   * The number of PTRS across all concatenated stage2 tables given by the
>   * number of bits resolved at the initial level.
> + * If we force more levels than necessary, we may have (stage2_pgdir_shift > IPA),
> + * in which case, stage2_pgd_ptrs will have one entry.
>   */
> -#define __s2_pgd_ptrs(ipa, lvls)	(1 << ((ipa) - pt_levels_pgdir_shift((lvls))))
> +#define pgd_ptrs_shift(ipa, pgdir_shift)	\
> +	((ipa) > (pgdir_shift) ? ((ipa) - (pgdir_shift)) : 0)
> +#define __s2_pgd_ptrs(ipa, lvls)		\
> +	(1 << (pgd_ptrs_shift((ipa), pt_levels_pgdir_shift(lvls))))
>  #define __s2_pgd_size(ipa, lvls)	(__s2_pgd_ptrs((ipa), (lvls)) * sizeof(pgd_t))
>  
>  #define stage2_pgd_ptrs(kvm)		__s2_pgd_ptrs(kvm_phys_shift(kvm), kvm_stage2_levels(kvm))
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 96b3f50101bc..f156e45760bc 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -190,6 +190,7 @@ int kvm_arm_config_vm(struct kvm *kvm, unsigned long type)
>  {
>  	u64 vtcr = VTCR_EL2_FLAGS;
>  	u32 parange, phys_shift;
> +	u8 lvls;
>  
>  	if (type)
>  		return -EINVAL;
> @@ -203,7 +204,14 @@ int kvm_arm_config_vm(struct kvm *kvm, unsigned long type)
>  	if (phys_shift > KVM_PHYS_SHIFT)
>  		phys_shift = KVM_PHYS_SHIFT;
>  	vtcr |= VTCR_EL2_T0SZ(phys_shift);
> -	vtcr |= VTCR_EL2_LVLS_TO_SL0(stage2_pgtable_levels(phys_shift));
> +	/*
> +	 * Use a minimum 2 level page table to prevent splitting
> +	 * host PMD huge pages at stage2.
> +	 */
> +	lvls = stage2_pgtable_levels(phys_shift);
> +	if (lvls < 2)
> +		lvls = 2;
> +	vtcr |= VTCR_EL2_LVLS_TO_SL0(lvls);
>  
>  	/*
>  	 * Enable the Hardware Access Flag management, unconditionally
>
Eric Auger Oct. 4, 2018, 8:40 a.m. UTC | #10
Hi Suzuki,

On 9/26/18 6:32 PM, Suzuki K Poulose wrote:
> 
> The physical address space size for a VM (IPA size) on arm/arm64 is
> limited to a static limit of 40bits. This series adds support for
> using an IPA size specific to a VM, allowing to use a size supported
> by the host (based on the host kernel configuration and CPU support).
> The default size is fixed to 40bits. On arm64, we can allow the limit
> to be lowered (limiting the number of levels in stage2 to 2, to prevent
> splitting the host PMD huge pages at stage2). We also add support for
> handling 52bit IPA addresses (where supported) added by Arm v8.2
> extensions.
> 
> We need to set the IPA limit as early as the VM creation to keep the
> code simpler to avoid sprinkling checks everywhere to ensure that the
> IPA is configured. We encode the IPA size in the machine_type
> argument to KVM_CREATE_VM ioctl. Bits [7-0] of the type are reserved
> for the IPA size. The availability of this feature is advertised by a
> new cap KVM_CAP_ARM_VM_IPA_SIZE. When supported, this capability
> returns the maximum IPA shift supported by the host. The supported IPA
> size on a host could be different from the system's PARange indicated
> by the CPUs (e.g, kernel limit on the PA size).
> 
> Supporting different IPA size requires modification to the stage2 page
> table code. The arm64 page table level helpers are defined based on the
> page table levels used by the host VA. So, the accessors may not work
> if the guest uses more number of levels in stage2 than the stage1
> of the host.  The previous versions (v1 & v2) of this series refactored
> the stage1 page table accessors to reuse the low-level accessors for an
> independent stage2 table. However, due to the level folding in the
> generic code, the types are redefined as well. i.e, if the PUD is
> folded, the pud_t could be defined as :
> 
>  typedef struct { pgd_t pgd; } pud_t;
> 
> similarly for pmd_t.  So, without stage1 independent page table entry
> types for stage2, we could be dealing with a different type for level
>  0-2 entries. This is practically fine on arm/arm64 as the entries
> have similar format and size and we always use the appropriate
> accessors to get the raw value (i.e, pud_val/pmd_val etc). But not
> ideal for a solution upstream. So, this version caps the stage2 page
> table levels to that of the stage1. This has the following impact on
> the IPA support for various pagesize/host-va combinations :
> 
> 
> x-----------------------------------------------------x
> | host\ipa    | 40bit | 42bit | 44bit | 48bit | 52bit |
> -------------------------------------------------------
> | 39bit-4K    |  y    |   y   |  n    |   n   |  n/a  |
> -------------------------------------------------------
> | 48bit-4K    |  y    |   y   |  y    |   y   |  n/a  |
> -------------------------------------------------------
> | 36bit-16K   |  y    |   n   |  n    |   n   |  n/a  |
> -------------------------------------------------------
> | 47bit-16K   |  y    |   y   |  y    |   y   |  n/a  |
> -------------------------------------------------------
> | 48bit-4K    |  y    |   y   |  y    |   y   |  n/a  |
> -------------------------------------------------------
> | 42bit-64K   |  y    |   y   |  y    |   n   |  n    |
> -------------------------------------------------------
> | 48bit-64K   |  y    |   y   |  y    |   y   |  y    |
> x-----------------------------------------------------x
> 
> Or the following list shows what cannot be supported :
> 
>  39bit-4K host  | [44 - 48]
>  36bit-16K host | [41 - 48]
>  42bit-64K host | [47 - 52]
> 
> which is not really bad. We can pursue the independent stage2
> page table support and lift the restriction once we get there.
> Given there is a proposal for new generic page table walker [0],
> it would make sense to make our efforts in sync with it to avoid
> diverting from a common API.
> 
> 52bit support is added for VGIC (including ITS emulation) and handling
> of PAR, HPFAR registers.
> 
> The series applies on 4.19-rc4. A tree is available here:
> 
> 	 git://linux-arm.org/linux-skp.git ipa52/v6
> 
> Tested with
>   - Modified kvmtool, which can only be used for (patches included in
>     the series for reference / testing):
>     * with virtio-pci upto 44bit PA (Due to 4K page size for virtio-pci
>       legacy implemented by kvmtool)
>     * Upto 48bit PA with virtio-mmio, due to 32bit PFN limitation.
>   - Hacked Qemu (boot loader support for highmem, IPA size support)
>     * with virtio-pci GIC-v3 ITS & MSI upto 52bit on Foundation model.
>     Also see [1] for Qemu support.
> 
> [0] https://lkml.org/lkml/2018/4/24/777
> [1] https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg05759.html
> 
> Change since v5:
>  - Don't raise the IPA Limit to 40bits on systems with lower PA size.
>    Doesn't break backward compatibility, we still allow KVM_CREATE_VM
>    to succeed with "0" as the IPA size (40bits). But prevent specifying
>    40bit explicitly, when the limit is lower.
>  - Rename CAP, KVM_CAP_ARM_VM_PHYS_SHIFT => KVM_CAP_ARM_VM_IPA_SIZE
>    and helper, KVM_VM_TYPE_ARM_VM_PHY_SHIFT => KVM_VM_TYPE_ARM_VM_IPA_SIZE
>  - Update Documentation of the API
>  - Update comments and commit description as reported by Eric
>  - Set the missing TCR_T0SZ in patch "kvm: arm64: Configure VTCR_EL2 per VM"
>  - Fix bits for CBASER_ADDRESS mask, GITS_CBASER_ADDRESS()
> 
> Changes since V4:
>  - Rebased on v4.19-rc3
>  - Dropped virtio patches queued already by mst.
>  - Collect Acks from Christoffer
>  - Restrict IPA configuration support to arm64 only
>  - Use KVM_CAP_ARM_VM_PHYS_SHIFT for detecting the support for
>    IPA size configuration along with the limit on the IPA for the host.
>  - Update comments on __load_guest_stage2
>  - Add comment about the default value for unknown PARange values.
>  - Update Documentation of the API
> 
> Changes since V3:
>  - Use per-VM VTCR instead per-VM private VTCR bits
>  - Allow IPA less than 40bits
>  - Split the patch adding support for stage2 dynamic page tables
>  - Rearrange the series to keep the userspace API at the end, which
>    needs further discussion.
>  - Collect Reviews/Acks from Eric & Marc
> 
> Changes since V2:
>  - Drop "refactoring of host page table helpers" and restrict the IPA size
>    to make sure stage2 doesn't use more page table levels than that of the host.
>  - Load VTCR for TLB operations on behalf of the VM (Pointed-by: James Morse)
>  - Split a couple of patches to make them easier to review.
>  - Fall back to normal (non-concatenated) entry level page table support if
>    possible.
>  - Bump the IOCTL number
> 
> Changes since V1:
>  - Change the userspace API for configuring VM to encode the IPA
>    size in the VM type.  (suggested by Christoffer)
>  - Expose the IPA limit on the host via ioctl on /dev/kvm
>  - Handle 52bit addresses in PAR & HPFAR
>  - Drop patch changing the life time of stage2 PGD
>  - Rename macros for 48-to-52 bit conversion for GIC ITS BASER.
>    (suggested by Christoffer)
>  - Split virtio PFN check patches and address comments.
> 
> 
> Kristina Martsenko (1):
>   vgic: Add support for 52bit guest physical address
> 
> Suzuki K Poulose (17):
>   kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table
>   kvm: arm/arm64: Remove spurious WARN_ON
>   kvm: arm64: Add helper for loading the stage2 setting for a VM
>   arm64: Add a helper for PARange to physical shift conversion
>   kvm: arm64: Clean up VTCR_EL2 initialisation
>   kvm: arm/arm64: Allow arch specific configurations for VM
>   kvm: arm64: Configure VTCR_EL2 per VM
>   kvm: arm/arm64: Prepare for VM specific stage2 translations
>   kvm: arm64: Prepare for dynamic stage2 page table layout
>   kvm: arm64: Make stage2 page table layout dynamic
>   kvm: arm64: Dynamic configuration of VTTBR mask
>   kvm: arm64: Configure VTCR_EL2.SL0 per VM
>   kvm: arm64: Switch to per VM IPA limit
>   kvm: arm64: Add 52bit support for PAR to HPFAR conversoin
>   kvm: arm64: Set a limit on the IPA size
>   kvm: arm64: Limit the minimum number of page table levels
>   kvm: arm64: Allow tuning the physical address size for VM
> 
>  Documentation/virtual/kvm/api.txt             |  31 +++
>  arch/arm/include/asm/kvm_arm.h                |   3 +-
>  arch/arm/include/asm/kvm_host.h               |   7 +
>  arch/arm/include/asm/kvm_mmu.h                |  15 +-
>  arch/arm/include/asm/stage2_pgtable.h         |  50 ++--
>  arch/arm64/include/asm/cpufeature.h           |  20 ++
>  arch/arm64/include/asm/kvm_arm.h              | 157 +++++++++---
>  arch/arm64/include/asm/kvm_asm.h              |   2 -
>  arch/arm64/include/asm/kvm_host.h             |  16 +-
>  arch/arm64/include/asm/kvm_hyp.h              |  10 +
>  arch/arm64/include/asm/kvm_mmu.h              |  42 +++-
>  arch/arm64/include/asm/stage2_pgtable-nopmd.h |  42 ----
>  arch/arm64/include/asm/stage2_pgtable-nopud.h |  39 ---
>  arch/arm64/include/asm/stage2_pgtable.h       | 236 +++++++++++++-----
>  arch/arm64/kvm/hyp/Makefile                   |   1 -
>  arch/arm64/kvm/hyp/s2-setup.c                 |  90 -------
>  arch/arm64/kvm/hyp/switch.c                   |   4 +-
>  arch/arm64/kvm/hyp/tlb.c                      |   4 +-
>  arch/arm64/kvm/reset.c                        | 103 ++++++++
>  include/linux/irqchip/arm-gic-v3.h            |   5 +
>  include/uapi/linux/kvm.h                      |  10 +
>  virt/kvm/arm/arm.c                            |   9 +-
>  virt/kvm/arm/mmu.c                            | 120 ++++-----
>  virt/kvm/arm/vgic/vgic-its.c                  |  36 +--
>  virt/kvm/arm/vgic/vgic-kvm-device.c           |   2 +-
>  virt/kvm/arm/vgic/vgic-mmio-v3.c              |   2 -
>  26 files changed, 648 insertions(+), 408 deletions(-)
>  delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
>  delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h
>  delete mode 100644 arch/arm64/kvm/hyp/s2-setup.c
> 
> kvmtool changes:
> 
> Suzuki K Poulose (4):
>   kvmtool: Allow backends to run checks on the KVM device fd
>   kvmtool: arm64: Add support for guest physical address size
>   kvmtool: arm64: Switch memory layout
>   kvmtool: arm: Add support for creating VM with PA size
> 
>  arm/aarch32/include/kvm/kvm-arch.h        |  6 ++--
>  arm/aarch64/include/kvm/kvm-arch.h        | 15 ++++++++--
>  arm/aarch64/include/kvm/kvm-config-arch.h |  5 +++-
>  arm/include/arm-common/kvm-arch.h         | 17 ++++++++----
>  arm/include/arm-common/kvm-config-arch.h  |  1 +
>  arm/kvm.c                                 | 34 ++++++++++++++++++++++-
>  include/kvm/kvm.h                         |  4 +++
>  kvm.c                                     |  2 ++
>  8 files changed, 71 insertions(+), 13 deletions(-)
> 

Feel free to add
Tested-by: Eric Auger <eric.auger@redhat.com>

I tested this series with QEMU, using cold plugged 4GB PC-DIMM at 2TB on
a Gigabyte machine. The VM is created with 43 IPA bits. I ran memtester
on guest at 2TB using "memtester -p 20000000000 1G 1" and it succeeds.

Thanks

Eric