[v3,00/20] arm64: Dynamic & 52bit IPA support

Message ID	1530270944-11351-1-git-send-email-suzuki.poulose@arm.com
Headers	show Return-Path: <linux-arm-kernel-bounces+incoming-imx=patchwork.ozlabs.org@lists.infradead.org> From: Suzuki K Poulose <suzuki.poulose@arm.com> To: linux-arm-kernel@lists.infradead.org Subject: [PATCH v3 00/20] arm64: Dynamic & 52bit IPA support Date: Fri, 29 Jun 2018 12:15:20 +0100 Message-Id: <1530270944-11351-1-git-send-email-suzuki.poulose@arm.com> MIME-Version: 1.0 summary: Content analysis details: (-5.0 points) pts rule name description ---- ---------------------- -------------------------------------------------- -5.0 RCVD_IN_DNSWL_HI RBL: Sender listed at http://www.dnswl.org/, high trust [217.140.101.70 listed in list.dnswl.org] -0.0 SPF_PASS SPF: sender matches SPF record Precedence: list Cc: cdall@kernel.org, kvm@vger.kernel.org, Suzuki K Poulose <suzuki.poulose@arm.com>, marc.zyngier@arm.com, catalin.marinas@arm.com, punit.agrawal@arm.com, will.deacon@arm.com, linux-kernel@vger.kernel.org, qemu-devel@nongnu.org, eric.auger@redhat.com, julien.grall@arm.com, james.morse@arm.com, kvmarm@lists.cs.columbia.edu Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+incoming-imx=patchwork.ozlabs.org@lists.infradead.org
Series	arm64: Dynamic & 52bit IPA support \| expand [v3,00/20] arm64: Dynamic & 52bit IPA support [v3,15/20] kvm: arm/arm64: Allow tuning the physical address size for VM

Suzuki K Poulose June 29, 2018, 11:15 a.m. UTC

The physical address space size for a VM (IPA size) on arm/arm64 is
limited to a static limit of 40bits. This series adds support for
using an IPA size specific to a VM, allowing to use a limit supported
by the host (based on the host kernel configuration and CPU support).
The default and the minimum size is fixed to 40bits. We also add
support for handling 52bit IPA addresses added by Arm v8.2 extensions.

As mentioned above, the supported IPA size on a host could be different
from the system's PARange indicated by the CPUs (e.g, kernel limit
on the PA size). So we expose the limit via a new system ioctl request
 - KVM_ARM_GET_MAX_VM_PHYS_SHIFT - on arm/arm64. This can then be
passed on to the KVM_CREATE_VM ioctl, encoded in the "type" field.
Bits [7-0] of the type are reserved for the IPA size. This approach
allows simpler management of the stage2 page table and guest memory
slots.

The arm64 page table level helpers are defined based on the page
table levels used by the host VA. So, the accessors may not work
if the guest uses more number of levels in stage2 than the stage1
of the host.  The previous versions (v1 & v2) of this series refactored
the stage1 page table accessors to reuse the low-level accessors for an
independent stage2 table. However, due to the level folding in the
generic code, the types are redefined as well. i.e, if the PUD is
folded, the pud_t could be defined as :

 typedef struct { pgd_t pgd; } pud_t;

similarly for pmd_t.  So, without stage1 independent page table entry
types for stage2, we could be dealing with a different type for level
 0-2 entries. This is practically fine on arm/arm64 as the entries
have similar format and size and we always use the appropriate
accessors to get the raw value (i.e, pud_val/pmd_val etc). But not
ideal for a solution upstream. So, this version caps the stage2 page
table levels to that of the stage1. This has the following impact on
the IPA support for various pagesize/host-va combinations :


x-----------------------------------------------------x
| host\ipa    | 40bit | 42bit | 44bit | 48bit | 52bit |
-------------------------------------------------------
| 39bit-4K    |  y    |   y   |  n    |   n   |  n/a  |
-------------------------------------------------------
| 48bit-4K    |  y    |   y   |  y    |   y   |  n/a  |
-------------------------------------------------------
| 36bit-16K   |  y    |   n   |  n    |   n   |  n/a  |
-------------------------------------------------------
| 47bit-16K   |  y    |   y   |  y    |   y   |  n/a  |
-------------------------------------------------------
| 48bit-4K    |  y    |   y   |  y    |   y   |  n/a  |
-------------------------------------------------------
| 42bit-64K   |  y    |   y   |  y    |   n   |  n    |
-------------------------------------------------------
| 48bit-64K   |  y    |   y   |  y    |   y   |  y    |
x-----------------------------------------------------x

Or the following list shows what cannot be supported :

 39bit-4K host supporting IPA > 43bit (upto 48bit)
 36bit-16K host for IPA > 40bit (upto 48bit)
 42bit-64K host for IPA > 46bit (upto 52bit)

which is not really bad. We can pursue the independent stage2
page table support and lift the restriction once we get there.
Given there is a proposal for new generic page table walker [0],
it would make sense to make our efforts in sync with it to avoid
diverting from a common API.

52bit support is added for VGIC (including ITS emulation) and handling
of PAR, HPFAR registers.

The series applies on 4.18-rc2. A tree is available here:

	 git://linux-arm.org/linux-skp.git ipa52/v3

Tested with
  - Modified kvmtool, which can only be used for (patches included in
    the series for reference / testing):
    * with virtio-pci upto 44bit PA (Due to 4K page size for virtio-pci
      legacy implemented by kvmtool)
    * Upto 48bit PA with virtio-mmio, due to 32bit PFN limitation.
  - Hacked Qemu (boot loader support for highmem, phys-shift support)
    * with virtio-pci GIC-v3 ITS & MSI upto 52bit on Foundation model.
    Also see [1] for Qemu support.

[0] https://lkml.org/lkml/2018/4/24/777
[1] https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg05759.html

Changes since V2:
 - Drop "refactoring of host page table helpers" and restrict the IPA size
   to make sure stage2 doesn't use more page table levels than that of the host.
 - Load VTCR for TLB operations on behalf of the VM (Pointed-by: James Morse)
 - Split a couple of patches to make them easier to review.
 - Fall back to normal (non-concatenated) entry level page table support if
   possible.
 - Bump the IOCTL number

Changes since V1:
 - Change the userspace API for configuring VM to encode the IPA
   size in the VM type.  (suggested by Christoffer)
 - Expose the IPA limit on the host via ioctl on /dev/kvm
 - Handle 52bit addresses in PAR & HPFAR
 - Drop patch changing the life time of stage2 PGD
 - Rename macros for 48-to-52 bit conversion for GIC ITS BASER.
   (suggested by Christoffer)
 - Split virtio PFN check patches and address comments.

Kristina Martsenko (1):
  vgic: Add support for 52bit guest physical address

Suzuki K Poulose (19):
  virtio: mmio-v1: Validate queue PFN
  virtio: pci-legacy: Validate queue pfn
  arm64: Add a helper for PARange to physical shift conversion
  kvm: arm64: Clean up VTCR_EL2 initialisation
  kvm: arm/arm64: Fix stage2_flush_memslot for 4 level page table
  kvm: arm/arm64: Remove spurious WARN_ON
  kvm: arm/arm64: Prepare for VM specific stage2 translations
  kvm: arm/arm64: Abstract stage2 pgd table allocation
  kvm: arm64: Make stage2 page table layout dynamic
  kvm: arm64: Dynamic configuration of VTTBR mask
  kvm: arm64: Helper for computing VTCR_EL2.SL0
  kvm: arm64: Add helper for loading the stage2 setting for a VM
  kvm: arm64: Configure VTCR per VM
  kvm: arm/arm64: Expose supported physical address limit for VM
  kvm: arm/arm64: Allow tuning the physical address size for VM
  kvm: arm64: Switch to per VM IPA limit
  kvm: arm64: Add support for handling 52bit IPA
  kvm: arm64: Allow IPA size supported by the system
  kvm: arm64: Fall back to normal stage2 entry level

 Documentation/virtual/kvm/api.txt             |  15 ++
 arch/arm/include/asm/kvm_arm.h                |   3 +-
 arch/arm/include/asm/kvm_mmu.h                |  28 +++-
 arch/arm/include/asm/stage2_pgtable.h         |  42 ++---
 arch/arm64/include/asm/cpufeature.h           |  13 ++
 arch/arm64/include/asm/kvm_arm.h              | 137 ++++++++++++++---
 arch/arm64/include/asm/kvm_asm.h              |   2 +-
 arch/arm64/include/asm/kvm_host.h             |  19 ++-
 arch/arm64/include/asm/kvm_hyp.h              |  16 ++
 arch/arm64/include/asm/kvm_mmu.h              |  92 ++++++++++-
 arch/arm64/include/asm/stage2_pgtable-nopmd.h |  42 -----
 arch/arm64/include/asm/stage2_pgtable-nopud.h |  39 -----
 arch/arm64/include/asm/stage2_pgtable.h       | 213 +++++++++++++++++++-------
 arch/arm64/kvm/guest.c                        |  42 +++++
 arch/arm64/kvm/hyp/s2-setup.c                 |  37 +----
 arch/arm64/kvm/hyp/switch.c                   |   4 +-
 arch/arm64/kvm/hyp/tlb.c                      |   4 +-
 drivers/virtio/virtio_mmio.c                  |  18 ++-
 drivers/virtio/virtio_pci_legacy.c            |  12 +-
 include/linux/irqchip/arm-gic-v3.h            |   5 +
 include/uapi/linux/kvm.h                      |  16 ++
 virt/kvm/arm/arm.c                            |  32 +++-
 virt/kvm/arm/mmu.c                            | 124 ++++++++-------
 virt/kvm/arm/vgic/vgic-its.c                  |  36 ++---
 virt/kvm/arm/vgic/vgic-kvm-device.c           |   2 +-
 virt/kvm/arm/vgic/vgic-mmio-v3.c              |   2 -
 26 files changed, 663 insertions(+), 332 deletions(-)
 delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
 delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h


kvmtool patches :

Suzuki K Poulose (4):
  kvmtool: Allow backends to run checks on the KVM device fd
  kvmtool: arm64: Add support for guest physical address size
  kvmtool: arm64: Switch memory layout
  kvmtool: arm: Add support for creating VM with PA size

 arm/aarch32/include/kvm/kvm-arch.h        |  6 ++++--
 arm/aarch64/include/kvm/kvm-arch.h        | 15 ++++++++++++---
 arm/aarch64/include/kvm/kvm-config-arch.h |  5 ++++-
 arm/include/arm-common/kvm-arch.h         | 17 +++++++++++------
 arm/include/arm-common/kvm-config-arch.h  |  1 +
 arm/kvm.c                                 | 24 +++++++++++++++++++++++-
 include/kvm/kvm.h                         |  4 ++++
 kvm.c                                     |  2 ++
 8 files changed, 61 insertions(+), 13 deletions(-)

Eric Auger June 29, 2018, 2:50 p.m. UTC | #1

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> On arm64, ID_AA64MMFR0_EL1.PARange encodes the maximum Physical
> Address range supported by the CPU. Add a helper to decode this
> to actual physical shift. If we hit an unallocated value, return
> the maximum range supported by the kernel.
> This is will be used by the KVM to set the VTCR_EL2.T0SZ, as it
s/is// and s/the KVM/KVM
> is about to move its place. Having this helper keeps the code
> movement cleaner.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: James Morse <james.morse@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since V2:
>  - Split the patch
>  - Limit the physical shift only for values unrecognized.
> ---
>  arch/arm64/include/asm/cpufeature.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index 1717ba1..855cf0e 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -530,6 +530,19 @@ void arm64_set_ssbd_mitigation(bool state);
>  static inline void arm64_set_ssbd_mitigation(bool state) {}
>  #endif
>  
> +static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
> +{
> +	switch (parange) {
> +	case 0: return 32;
> +	case 1: return 36;
> +	case 2: return 40;
> +	case 3: return 42;
> +	case 4: return 44;
> +	case 5: return 48;
> +	case 6: return 52;
> +	default: return CONFIG_ARM64_PA_BITS;
> +	}
> +}
>  #endif /* __ASSEMBLY__ */
>  
>  #endif
>

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

Eric Auger June 29, 2018, 2:50 p.m. UTC | #2

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> Use the new helper for converting the parange to the physical shift.
> Also, add the missing definitions for the VTCR_EL2 register fields
> and use them instead of hard coding numbers.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since V2
>  - Part 2 of the split from original patch.
>  - Also add missing VTCR field helpers and use them.
> ---
>  arch/arm64/include/asm/kvm_arm.h |  3 +++
>  arch/arm64/kvm/hyp/s2-setup.c    | 30 ++++++------------------------
>  2 files changed, 9 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 6dd285e..3dffd38 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -106,6 +106,7 @@
>  #define VTCR_EL2_RES1		(1 << 31)
>  #define VTCR_EL2_HD		(1 << 22)
>  #define VTCR_EL2_HA		(1 << 21)
> +#define VTCR_EL2_PS_SHIFT	TCR_EL2_PS_SHIFT
>  #define VTCR_EL2_PS_MASK	TCR_EL2_PS_MASK
>  #define VTCR_EL2_TG0_MASK	TCR_TG0_MASK
>  #define VTCR_EL2_TG0_4K		TCR_TG0_4K
> @@ -126,6 +127,8 @@
>  #define VTCR_EL2_VS_8BIT	(0 << VTCR_EL2_VS_SHIFT)
>  #define VTCR_EL2_VS_16BIT	(1 << VTCR_EL2_VS_SHIFT)
>  
> +#define VTCR_EL2_T0SZ(x)	TCR_T0SZ(x)
> +
>  /*
>   * We configure the Stage-2 page tables to always restrict the IPA space to be
>   * 40 bits wide (T0SZ = 24).  Systems with a PARange smaller than 40 bits are
> diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
> index 603e1ee..81094f1 100644
> --- a/arch/arm64/kvm/hyp/s2-setup.c
> +++ b/arch/arm64/kvm/hyp/s2-setup.c
> @@ -19,11 +19,13 @@
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_asm.h>
>  #include <asm/kvm_hyp.h>
> +#include <asm/cpufeature.h>
>  
>  u32 __hyp_text __init_stage2_translation(void)
>  {
>  	u64 val = VTCR_EL2_FLAGS;
>  	u64 parange;
> +	u32 phys_shift;
>  	u64 tmp;

Not related to this patch but the comment reporting that bit 19 of
VTCR_EL2 is RES0 is not fully valid anymore as it now corresponds to
VMID size in ARM ARM >= 8.1.
>  
>  	/*
> @@ -34,30 +36,10 @@ u32 __hyp_text __init_stage2_translation(void)
>  	parange = read_sysreg(id_aa64mmfr0_el1) & 7;
>  	if (parange > ID_AA64MMFR0_PARANGE_MAX)
>  		parange = ID_AA64MMFR0_PARANGE_MAX;
> -	val |= parange << 16;
> +	val |= parange << VTCR_EL2_PS_SHIFT;
>  
>  	/* Compute the actual PARange... */
> -	switch (parange) {
> -	case 0:
> -		parange = 32;
> -		break;
> -	case 1:
> -		parange = 36;
> -		break;
> -	case 2:
> -		parange = 40;
> -		break;
> -	case 3:
> -		parange = 42;
> -		break;
> -	case 4:
> -		parange = 44;
> -		break;
> -	case 5:
> -	default:
> -		parange = 48;
> -		break;
> -	}
> +	phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
>  
>  	/*
>  	 * ... and clamp it to 40 bits, unless we have some braindead
> @@ -65,7 +47,7 @@ u32 __hyp_text __init_stage2_translation(void)
>  	 * return that value for the rest of the kernel to decide what
>  	 * to do.
>  	 */
> -	val |= 64 - (parange > 40 ? 40 : parange);
> +	val |= VTCR_EL2_T0SZ(phys_shift > 40 ? 40 : phys_shift);
>  
>  	/*
>  	 * Check the availability of Hardware Access Flag / Dirty Bit
> @@ -86,5 +68,5 @@ u32 __hyp_text __init_stage2_translation(void)
>  
>  	write_sysreg(val, vtcr_el2);
>  
> -	return parange;
> +	return phys_shift;
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

>  }
>

Eric Auger June 29, 2018, 2:50 p.m. UTC | #3

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> So far we have only supported 3 level page table with fixed IPA of 40bits.
> Fix stage2_flush_memslot() to accommodate for 4 level tables.
in 06/30 you add the justification for this change I think. worth to put
in here as well?

> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Acked-by: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  virt/kvm/arm/mmu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 1d90d79..061e6b3 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -379,7 +379,8 @@ static void stage2_flush_memslot(struct kvm *kvm,
>  	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>  	do {
>  		next = stage2_pgd_addr_end(addr, end);
> -		stage2_flush_puds(kvm, pgd, addr, next);
> +		if (!stage2_pgd_none(*pgd))
> +			stage2_flush_puds(kvm, pgd, addr, next);
>  	} while (pgd++, addr = next, addr != end);
>  }
>  
> 

Besides
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

Eric Auger June 29, 2018, 2:51 p.m. UTC | #4

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> On a 4-level page table pgd entry can be empty, unlike a 3-level
> page table. Remove the spurious WARN_ON() in stage_get_pud().
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Acked-by: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  virt/kvm/arm/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 061e6b3..308171c 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -976,7 +976,7 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>  	pud_t *pud;
>  
>  	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> -	if (WARN_ON(stage2_pgd_none(*pgd))) {
> +	if (stage2_pgd_none(*pgd)) {
>  		if (!cache)
>  			return NULL;
>  		pud = mmu_memory_cache_alloc(cache);
> 

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

Michael S. Tsirkin June 29, 2018, 5:42 p.m. UTC | #5

On Fri, Jun 29, 2018 at 12:15:21PM +0100, Suzuki K Poulose wrote:
> virtio-mmio with virtio-v1 uses a 32bit PFN for the queue.
> If the queue pfn is too large to fit in 32bits, which
> we could hit on arm64 systems with 52bit physical addresses
> (even with 64K page size), we simply miss out a proper link
> to the other side of the queue.
> 
> Add a check to validate the PFN, rather than silently breaking
> the devices.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Cc: Peter Maydel <peter.maydell@linaro.org>
> Cc: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since v2:
>  - Change errno to -E2BIG
> ---
>  drivers/virtio/virtio_mmio.c | 18 ++++++++++++++++--
>  1 file changed, 16 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> index 67763d3..82cedc8 100644
> --- a/drivers/virtio/virtio_mmio.c
> +++ b/drivers/virtio/virtio_mmio.c
> @@ -397,9 +397,21 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>  	/* Activate the queue */
>  	writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
>  	if (vm_dev->version == 1) {
> +		u64 q_pfn = virtqueue_get_desc_addr(vq) >> PAGE_SHIFT;
> +
> +		/*
> +		 * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something
> +		 * that doesn't fit in 32bit, fail the setup rather than
> +		 * pretending to be successful.
> +		 */
> +		if (q_pfn >> 32) {
> +			dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");

How about:
	"hypervisor bug: legacy virtio-mmio must not be used with more than 0x%llx Gigabytes of memory",
	0x1ULL << (32 - 30) << PAGE_SHIFT

> +			err = -E2BIG;
> +			goto error_bad_pfn;
> +		}
> +
>  		writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN);
> -		writel(virtqueue_get_desc_addr(vq) >> PAGE_SHIFT,
> -				vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
> +		writel(q_pfn, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
>  	} else {
>  		u64 addr;
>  
> @@ -430,6 +442,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>  
>  	return vq;
>  
> +error_bad_pfn:
> +	vring_del_virtqueue(vq);
>  error_new_virtqueue:
>  	if (vm_dev->version == 1) {
>  		writel(0, vm_dev->base + VIRTIO_MMIO_QUEUE_PFN);
> -- 
> 2.7.4

Michael S. Tsirkin June 29, 2018, 5:42 p.m. UTC | #6

On Fri, Jun 29, 2018 at 12:15:22PM +0100, Suzuki K Poulose wrote:
> Legacy PCI over virtio uses a 32bit PFN for the queue. If the
> queue pfn is too large to fit in 32bits, which we could hit on
> arm64 systems with 52bit physical addresses (even with 64K page
> size), we simply miss out a proper link to the other side of
> the queue.
> 
> Add a check to validate the PFN, rather than silently breaking
> the devices.
> 
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Cc: Peter Maydel <peter.maydell@linaro.org>
> Cc: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since v2:
>  - Change errno to -E2BIG
> ---
>  drivers/virtio/virtio_pci_legacy.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c
> index 2780886..c0d6987a 100644
> --- a/drivers/virtio/virtio_pci_legacy.c
> +++ b/drivers/virtio/virtio_pci_legacy.c
> @@ -122,6 +122,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>  	struct virtqueue *vq;
>  	u16 num;
>  	int err;
> +	u64 q_pfn;
>  
>  	/* Select the queue we're interested in */
>  	iowrite16(index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL);
> @@ -141,9 +142,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>  	if (!vq)
>  		return ERR_PTR(-ENOMEM);
>  
> +	q_pfn = virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT;
> +	if (q_pfn >> 32) {
> +		dev_err(&vp_dev->pci_dev->dev, "virtio-pci queue PFN too large\n");
> +		err = -E2BIG;

Same comment here. Let's make it clear it's host not guest problem.

> +		goto out_del_vq;
> +	}
> +
>  	/* activate the queue */
> -	iowrite32(virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT,
> -		  vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> +	iowrite32(q_pfn, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
>  
>  	vq->priv = (void __force *)vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY;
>  
> @@ -160,6 +167,7 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev,
>  
>  out_deactivate:
>  	iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN);
> +out_del_vq:
>  	vring_del_virtqueue(vq);
>  	return ERR_PTR(err);
>  }
> -- 
> 2.7.4

Marc Zyngier July 2, 2018, 9:59 a.m. UTC | #7

On 29/06/18 12:15, Suzuki K Poulose wrote:
> So far we have only supported 3 level page table with fixed IPA of 40bits.
> Fix stage2_flush_memslot() to accommodate for 4 level tables.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Acked-by: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  virt/kvm/arm/mmu.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 1d90d79..061e6b3 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -379,7 +379,8 @@ static void stage2_flush_memslot(struct kvm *kvm,
>  	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>  	do {
>  		next = stage2_pgd_addr_end(addr, end);
> -		stage2_flush_puds(kvm, pgd, addr, next);
> +		if (!stage2_pgd_none(*pgd))
> +			stage2_flush_puds(kvm, pgd, addr, next);
>  	} while (pgd++, addr = next, addr != end);
>  }
>  
> 

Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>

	M.

Marc Zyngier July 2, 2018, 10:01 a.m. UTC | #8

On 29/06/18 12:15, Suzuki K Poulose wrote:
> On a 4-level page table pgd entry can be empty, unlike a 3-level
> page table. Remove the spurious WARN_ON() in stage_get_pud().
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Acked-by: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  virt/kvm/arm/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 061e6b3..308171c 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -976,7 +976,7 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>  	pud_t *pud;
>  
>  	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> -	if (WARN_ON(stage2_pgd_none(*pgd))) {
> +	if (stage2_pgd_none(*pgd)) {
>  		if (!cache)
>  			return NULL;
>  		pud = mmu_memory_cache_alloc(cache);
> 

Acked-by: Marc Zyngier <marc.zyngier@arm.com>

	M.

Marc Zyngier July 2, 2018, 10:12 a.m. UTC | #9

On 29/06/18 12:15, Suzuki K Poulose wrote:
> Right now the stage2 page table for a VM is hard coded, assuming
> an IPA of 40bits. As we are about to add support for per VM IPA,
> prepare the stage2 page table helpers to accept the kvm instance
> to make the right decision for the VM. No functional changes.
> Adds stage2_pgd_size(kvm) to replace S2_PGD_SIZE. Also, moves
> some of the definitions dependent on kvm instance to asm/kvm_mmu.h
> for arm32. In that process drop the _AC() specifier constants
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since V2:
>  - Update commit description abuot the movement to asm/kvm_mmu.h
>    for arm32
>  - Drop _AC() specifiers
> ---
>  arch/arm/include/asm/kvm_arm.h                |   3 +-
>  arch/arm/include/asm/kvm_mmu.h                |  15 +++-
>  arch/arm/include/asm/stage2_pgtable.h         |  42 ++++-----
>  arch/arm64/include/asm/kvm_mmu.h              |   7 +-
>  arch/arm64/include/asm/stage2_pgtable-nopmd.h |  18 ++--
>  arch/arm64/include/asm/stage2_pgtable-nopud.h |  16 ++--
>  arch/arm64/include/asm/stage2_pgtable.h       |  49 ++++++-----
>  virt/kvm/arm/arm.c                            |   2 +-
>  virt/kvm/arm/mmu.c                            | 119 +++++++++++++-------------
>  virt/kvm/arm/vgic/vgic-kvm-device.c           |   2 +-
>  10 files changed, 148 insertions(+), 125 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
> index 3ab8b37..c3f1f9b 100644
> --- a/arch/arm/include/asm/kvm_arm.h
> +++ b/arch/arm/include/asm/kvm_arm.h
> @@ -133,8 +133,7 @@
>   * space.
>   */
>  #define KVM_PHYS_SHIFT	(40)
> -#define KVM_PHYS_SIZE	(_AC(1, ULL) << KVM_PHYS_SHIFT)
> -#define KVM_PHYS_MASK	(KVM_PHYS_SIZE - _AC(1, ULL))
> +
>  #define PTRS_PER_S2_PGD	(_AC(1, ULL) << (KVM_PHYS_SHIFT - 30))
>  
>  /* Virtualization Translation Control Register (VTCR) bits */
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 8553d68..f36eb20 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -36,15 +36,19 @@
>  	})
>  
>  /*
> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation levels.
> + * kvm_mmu_cache_min_pages() is the number of stage2 page
> + * table translation levels, excluding the top level, for
> + * the given VM. Since we have a 3 level page-table, this
> + * is fixed.

I find this comment quite confusing: number of levels, but excluding the
top one? The original one was just as bad, to be honest.

Can't we just say: "kvm_mmu_cache_min_page() is the number of pages
required to install a stage-2 translation"?

>   */
> -#define KVM_MMU_CACHE_MIN_PAGES	2
> +#define kvm_mmu_cache_min_pages(kvm)	2
>  
>  #ifndef __ASSEMBLY__
>  
>  #include <linux/highmem.h>
>  #include <asm/cacheflush.h>
>  #include <asm/cputype.h>
> +#include <asm/kvm_arm.h>
>  #include <asm/kvm_hyp.h>
>  #include <asm/pgalloc.h>
>  #include <asm/stage2_pgtable.h>
> @@ -52,6 +56,13 @@
>  /* Ensure compatibility with arm64 */
>  #define VA_BITS			32
>  
> +#define kvm_phys_shift(kvm)		KVM_PHYS_SHIFT
> +#define kvm_phys_size(kvm)		(1ULL << kvm_phys_shift(kvm))
> +#define kvm_phys_mask(kvm)		(kvm_phys_size(kvm) - 1ULL)
> +#define kvm_vttbr_baddr_mask(kvm)	VTTBR_BADDR_MASK
> +
> +#define stage2_pgd_size(kvm)		(PTRS_PER_S2_PGD * sizeof(pgd_t))
> +
>  int create_hyp_mappings(void *from, void *to, pgprot_t prot);
>  int create_hyp_io_mappings(phys_addr_t phys_addr, size_t size,
>  			   void __iomem **kaddr,
> diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h
> index 460d616..e22ae94 100644
> --- a/arch/arm/include/asm/stage2_pgtable.h
> +++ b/arch/arm/include/asm/stage2_pgtable.h
> @@ -19,43 +19,45 @@
>  #ifndef __ARM_S2_PGTABLE_H_
>  #define __ARM_S2_PGTABLE_H_
>  
> -#define stage2_pgd_none(pgd)			pgd_none(pgd)
> -#define stage2_pgd_clear(pgd)			pgd_clear(pgd)
> -#define stage2_pgd_present(pgd)			pgd_present(pgd)
> -#define stage2_pgd_populate(pgd, pud)		pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(pgd, address)		pud_offset(pgd, address)
> -#define stage2_pud_free(pud)			pud_free(NULL, pud)
> +#define stage2_pgd_none(kvm, pgd)		pgd_none(pgd)
> +#define stage2_pgd_clear(kvm, pgd)		pgd_clear(pgd)
> +#define stage2_pgd_present(kvm, pgd)		pgd_present(pgd)
> +#define stage2_pgd_populate(kvm, pgd, pud)	pgd_populate(NULL, pgd, pud)
> +#define stage2_pud_offset(kvm, pgd, address)	pud_offset(pgd, address)
> +#define stage2_pud_free(kvm, pud)		pud_free(NULL, pud)
>  
> -#define stage2_pud_none(pud)			pud_none(pud)
> -#define stage2_pud_clear(pud)			pud_clear(pud)
> -#define stage2_pud_present(pud)			pud_present(pud)
> -#define stage2_pud_populate(pud, pmd)		pud_populate(NULL, pud, pmd)
> -#define stage2_pmd_offset(pud, address)		pmd_offset(pud, address)
> -#define stage2_pmd_free(pmd)			pmd_free(NULL, pmd)
> +#define stage2_pud_none(kvm, pud)		pud_none(pud)
> +#define stage2_pud_clear(kvm, pud)		pud_clear(pud)
> +#define stage2_pud_present(kvm, pud)		pud_present(pud)
> +#define stage2_pud_populate(kvm, pud, pmd)	pud_populate(NULL, pud, pmd)
> +#define stage2_pmd_offset(kvm, pud, address)	pmd_offset(pud, address)
> +#define stage2_pmd_free(kvm, pmd)		pmd_free(NULL, pmd)
>  
> -#define stage2_pud_huge(pud)			pud_huge(pud)
> +#define stage2_pud_huge(kvm, pud)		pud_huge(pud)
>  
>  /* Open coded p*d_addr_end that can deal with 64bit addresses */
> -static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
>  	phys_addr_t boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>  
>  	return (boundary - 1 < end - 1) ? boundary : end;
>  }
>  
> -#define stage2_pud_addr_end(addr, end)		(end)
> +#define stage2_pud_addr_end(kvm, addr, end)	(end)
>  
> -static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
>  	phys_addr_t boundary = (addr + PMD_SIZE) & PMD_MASK;
>  
>  	return (boundary - 1 < end - 1) ? boundary : end;
>  }
>  
> -#define stage2_pgd_index(addr)				pgd_index(addr)
> +#define stage2_pgd_index(kvm, addr)		pgd_index(addr)
>  
> -#define stage2_pte_table_empty(ptep)			kvm_page_empty(ptep)
> -#define stage2_pmd_table_empty(pmdp)			kvm_page_empty(pmdp)
> -#define stage2_pud_table_empty(pudp)			false
> +#define stage2_pte_table_empty(kvm, ptep)	kvm_page_empty(ptep)
> +#define stage2_pmd_table_empty(kvm, pmdp)	kvm_page_empty(pmdp)
> +#define stage2_pud_table_empty(kvm, pudp)	false
>  
>  #endif	/* __ARM_S2_PGTABLE_H_ */
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index fb9a712..5da8f52 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -141,8 +141,11 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
>   * We currently only support a 40bit IPA.
>   */
>  #define KVM_PHYS_SHIFT	(40)
> -#define KVM_PHYS_SIZE	(1UL << KVM_PHYS_SHIFT)
> -#define KVM_PHYS_MASK	(KVM_PHYS_SIZE - 1UL)
> +
> +#define kvm_phys_shift(kvm)		KVM_PHYS_SHIFT
> +#define kvm_phys_size(kvm)		(_AC(1, ULL) << kvm_phys_shift(kvm))
> +#define kvm_phys_mask(kvm)		(kvm_phys_size(kvm) - _AC(1, ULL))
> +#define kvm_vttbr_baddr_mask(kvm)	VTTBR_BADDR_MASK
>  
>  #include <asm/stage2_pgtable.h>
>  
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> index 2656a0f..0280ded 100644
> --- a/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> +++ b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> @@ -26,17 +26,17 @@
>  #define S2_PMD_SIZE		(1UL << S2_PMD_SHIFT)
>  #define S2_PMD_MASK		(~(S2_PMD_SIZE-1))
>  
> -#define stage2_pud_none(pud)			(0)
> -#define stage2_pud_present(pud)			(1)
> -#define stage2_pud_clear(pud)			do { } while (0)
> -#define stage2_pud_populate(pud, pmd)		do { } while (0)
> -#define stage2_pmd_offset(pud, address)		((pmd_t *)(pud))
> +#define stage2_pud_none(kvm, pud)		(0)
> +#define stage2_pud_present(kvm, pud)		(1)
> +#define stage2_pud_clear(kvm, pud)		do { } while (0)
> +#define stage2_pud_populate(kvm, pud, pmd)	do { } while (0)
> +#define stage2_pmd_offset(kvm, pud, address)	((pmd_t *)(pud))
>  
> -#define stage2_pmd_free(pmd)			do { } while (0)
> +#define stage2_pmd_free(kvm, pmd)		do { } while (0)
>  
> -#define stage2_pmd_addr_end(addr, end)		(end)
> +#define stage2_pmd_addr_end(kvm, addr, end)	(end)
>  
> -#define stage2_pud_huge(pud)			(0)
> -#define stage2_pmd_table_empty(pmdp)		(0)
> +#define stage2_pud_huge(kvm, pud)		(0)
> +#define stage2_pmd_table_empty(kvm, pmdp)	(0)
>  
>  #endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> index 5ee87b5..cd6304e 100644
> --- a/arch/arm64/include/asm/stage2_pgtable-nopud.h
> +++ b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> @@ -24,16 +24,16 @@
>  #define S2_PUD_SIZE		(_AC(1, UL) << S2_PUD_SHIFT)
>  #define S2_PUD_MASK		(~(S2_PUD_SIZE-1))
>  
> -#define stage2_pgd_none(pgd)			(0)
> -#define stage2_pgd_present(pgd)			(1)
> -#define stage2_pgd_clear(pgd)			do { } while (0)
> -#define stage2_pgd_populate(pgd, pud)	do { } while (0)
> +#define stage2_pgd_none(kvm, pgd)		(0)
> +#define stage2_pgd_present(kvm, pgd)		(1)
> +#define stage2_pgd_clear(kvm, pgd)		do { } while (0)
> +#define stage2_pgd_populate(kvm, pgd, pud)	do { } while (0)
>  
> -#define stage2_pud_offset(pgd, address)		((pud_t *)(pgd))
> +#define stage2_pud_offset(kvm, pgd, address)	((pud_t *)(pgd))
>  
> -#define stage2_pud_free(x)			do { } while (0)
> +#define stage2_pud_free(kvm, x)			do { } while (0)
>  
> -#define stage2_pud_addr_end(addr, end)		(end)
> -#define stage2_pud_table_empty(pmdp)		(0)
> +#define stage2_pud_addr_end(kvm, addr, end)	(end)
> +#define stage2_pud_table_empty(kvm, pmdp)	(0)
>  
>  #endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index 8b68099..057a405 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -65,10 +65,10 @@
>  #define PTRS_PER_S2_PGD			(1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
>  
>  /*
> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
> + * kvm_mmmu_cache_min_pages is the number of stage2 page table translation
>   * levels in addition to the PGD.
>   */
> -#define KVM_MMU_CACHE_MIN_PAGES		(STAGE2_PGTABLE_LEVELS - 1)
> +#define kvm_mmu_cache_min_pages(kvm)	(STAGE2_PGTABLE_LEVELS - 1)

Same comment as for the 32bit case.

>  
>  
>  #if STAGE2_PGTABLE_LEVELS > 3
> @@ -77,16 +77,17 @@
>  #define S2_PUD_SIZE			(_AC(1, UL) << S2_PUD_SHIFT)
>  #define S2_PUD_MASK			(~(S2_PUD_SIZE - 1))
>  
> -#define stage2_pgd_none(pgd)				pgd_none(pgd)
> -#define stage2_pgd_clear(pgd)				pgd_clear(pgd)
> -#define stage2_pgd_present(pgd)				pgd_present(pgd)
> -#define stage2_pgd_populate(pgd, pud)			pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(pgd, address)			pud_offset(pgd, address)
> -#define stage2_pud_free(pud)				pud_free(NULL, pud)
> +#define stage2_pgd_none(kvm, pgd)		pgd_none(pgd)
> +#define stage2_pgd_clear(kvm, pgd)		pgd_clear(pgd)
> +#define stage2_pgd_present(kvm, pgd)		pgd_present(pgd)
> +#define stage2_pgd_populate(kvm, pgd, pud)	pgd_populate(NULL, pgd, pud)
> +#define stage2_pud_offset(kvm, pgd, address)	pud_offset(pgd, address)
> +#define stage2_pud_free(kvm, pud)		pud_free(NULL, pud)
>  
> -#define stage2_pud_table_empty(pudp)			kvm_page_empty(pudp)
> +#define stage2_pud_table_empty(kvm, pudp)	kvm_page_empty(pudp)
>  
> -static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pud_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
>  	phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;
>  
> @@ -102,17 +103,18 @@ static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
>  #define S2_PMD_SIZE			(_AC(1, UL) << S2_PMD_SHIFT)
>  #define S2_PMD_MASK			(~(S2_PMD_SIZE - 1))
>  
> -#define stage2_pud_none(pud)				pud_none(pud)
> -#define stage2_pud_clear(pud)				pud_clear(pud)
> -#define stage2_pud_present(pud)				pud_present(pud)
> -#define stage2_pud_populate(pud, pmd)			pud_populate(NULL, pud, pmd)
> -#define stage2_pmd_offset(pud, address)			pmd_offset(pud, address)
> -#define stage2_pmd_free(pmd)				pmd_free(NULL, pmd)
> +#define stage2_pud_none(kvm, pud)		pud_none(pud)
> +#define stage2_pud_clear(kvm, pud)		pud_clear(pud)
> +#define stage2_pud_present(kvm, pud)		pud_present(pud)
> +#define stage2_pud_populate(kvm, pud, pmd)	pud_populate(NULL, pud, pmd)
> +#define stage2_pmd_offset(kvm, pud, address)	pmd_offset(pud, address)
> +#define stage2_pmd_free(kvm, pmd)		pmd_free(NULL, pmd)
>  
> -#define stage2_pud_huge(pud)				pud_huge(pud)
> -#define stage2_pmd_table_empty(pmdp)			kvm_page_empty(pmdp)
> +#define stage2_pud_huge(kvm, pud)		pud_huge(pud)
> +#define stage2_pmd_table_empty(kvm, pmdp)	kvm_page_empty(pmdp)
>  
> -static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
>  	phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;
>  
> @@ -121,7 +123,7 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
>  
>  #endif		/* STAGE2_PGTABLE_LEVELS > 2 */
>  
> -#define stage2_pte_table_empty(ptep)			kvm_page_empty(ptep)
> +#define stage2_pte_table_empty(kvm, ptep)	kvm_page_empty(ptep)
>  
>  #if STAGE2_PGTABLE_LEVELS == 2
>  #include <asm/stage2_pgtable-nopmd.h>
> @@ -129,10 +131,13 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
>  #include <asm/stage2_pgtable-nopud.h>
>  #endif
>  
> +#define stage2_pgd_size(kvm)	(PTRS_PER_S2_PGD * sizeof(pgd_t))
>  
> -#define stage2_pgd_index(addr)				(((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
> +#define stage2_pgd_index(kvm, addr) \
> +	(((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
>  
> -static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
>  	phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
>  
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 04e554c..d2637bb 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -538,7 +538,7 @@ static void update_vttbr(struct kvm *kvm)
>  
>  	/* update vttbr to be used with the new vmid */
>  	pgd_phys = virt_to_phys(kvm->arch.pgd);
> -	BUG_ON(pgd_phys & ~VTTBR_BADDR_MASK);
> +	BUG_ON(pgd_phys & ~kvm_vttbr_baddr_mask(kvm));
>  	vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK(kvm_vmid_bits);
>  	kvm->arch.vttbr = kvm_phys_to_vttbr(pgd_phys) | vmid;
>  
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 308171c..82dd571 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -45,7 +45,6 @@ static phys_addr_t hyp_idmap_vector;
>  
>  static unsigned long io_map_base;
>  
> -#define S2_PGD_SIZE	(PTRS_PER_S2_PGD * sizeof(pgd_t))
>  #define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))
>  
>  #define KVM_S2PTE_FLAG_IS_IOMAP		(1UL << 0)
> @@ -150,20 +149,20 @@ static void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
>  
>  static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr)
>  {
> -	pud_t *pud_table __maybe_unused = stage2_pud_offset(pgd, 0UL);
> -	stage2_pgd_clear(pgd);
> +	pud_t *pud_table __maybe_unused = stage2_pud_offset(kvm, pgd, 0UL);
> +	stage2_pgd_clear(kvm, pgd);
>  	kvm_tlb_flush_vmid_ipa(kvm, addr);
> -	stage2_pud_free(pud_table);
> +	stage2_pud_free(kvm, pud_table);
>  	put_page(virt_to_page(pgd));
>  }
>  
>  static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
>  {
> -	pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);
> -	VM_BUG_ON(stage2_pud_huge(*pud));
> -	stage2_pud_clear(pud);
> +	pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(kvm, pud, 0);
> +	VM_BUG_ON(stage2_pud_huge(kvm, *pud));
> +	stage2_pud_clear(kvm, pud);
>  	kvm_tlb_flush_vmid_ipa(kvm, addr);
> -	stage2_pmd_free(pmd_table);
> +	stage2_pmd_free(kvm, pmd_table);
>  	put_page(virt_to_page(pud));
>  }
>  
> @@ -219,7 +218,7 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd,
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  
> -	if (stage2_pte_table_empty(start_pte))
> +	if (stage2_pte_table_empty(kvm, start_pte))
>  		clear_stage2_pmd_entry(kvm, pmd, start_addr);
>  }
>  
> @@ -229,9 +228,9 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
>  	phys_addr_t next, start_addr = addr;
>  	pmd_t *pmd, *start_pmd;
>  
> -	start_pmd = pmd = stage2_pmd_offset(pud, addr);
> +	start_pmd = pmd = stage2_pmd_offset(kvm, pud, addr);
>  	do {
> -		next = stage2_pmd_addr_end(addr, end);
> +		next = stage2_pmd_addr_end(kvm, addr, end);
>  		if (!pmd_none(*pmd)) {
>  			if (pmd_thp_or_huge(*pmd)) {
>  				pmd_t old_pmd = *pmd;
> @@ -248,7 +247,7 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
>  		}
>  	} while (pmd++, addr = next, addr != end);
>  
> -	if (stage2_pmd_table_empty(start_pmd))
> +	if (stage2_pmd_table_empty(kvm, start_pmd))
>  		clear_stage2_pud_entry(kvm, pud, start_addr);
>  }
>  
> @@ -258,14 +257,14 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
>  	phys_addr_t next, start_addr = addr;
>  	pud_t *pud, *start_pud;
>  
> -	start_pud = pud = stage2_pud_offset(pgd, addr);
> +	start_pud = pud = stage2_pud_offset(kvm, pgd, addr);
>  	do {
> -		next = stage2_pud_addr_end(addr, end);
> -		if (!stage2_pud_none(*pud)) {
> -			if (stage2_pud_huge(*pud)) {
> +		next = stage2_pud_addr_end(kvm, addr, end);
> +		if (!stage2_pud_none(kvm, *pud)) {
> +			if (stage2_pud_huge(kvm, *pud)) {
>  				pud_t old_pud = *pud;
>  
> -				stage2_pud_clear(pud);
> +				stage2_pud_clear(kvm, pud);
>  				kvm_tlb_flush_vmid_ipa(kvm, addr);
>  				kvm_flush_dcache_pud(old_pud);
>  				put_page(virt_to_page(pud));
> @@ -275,7 +274,7 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
>  		}
>  	} while (pud++, addr = next, addr != end);
>  
> -	if (stage2_pud_table_empty(start_pud))
> +	if (stage2_pud_table_empty(kvm, start_pud))
>  		clear_stage2_pgd_entry(kvm, pgd, start_addr);
>  }
>  
> @@ -299,7 +298,7 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>  	assert_spin_locked(&kvm->mmu_lock);
>  	WARN_ON(size & ~PAGE_MASK);
>  
> -	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> +	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
>  	do {
>  		/*
>  		 * Make sure the page table is still active, as another thread
> @@ -308,8 +307,8 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>  		 */
>  		if (!READ_ONCE(kvm->arch.pgd))
>  			break;
> -		next = stage2_pgd_addr_end(addr, end);
> -		if (!stage2_pgd_none(*pgd))
> +		next = stage2_pgd_addr_end(kvm, addr, end);
> +		if (!stage2_pgd_none(kvm, *pgd))
>  			unmap_stage2_puds(kvm, pgd, addr, next);
>  		/*
>  		 * If the range is too large, release the kvm->mmu_lock
> @@ -338,9 +337,9 @@ static void stage2_flush_pmds(struct kvm *kvm, pud_t *pud,
>  	pmd_t *pmd;
>  	phys_addr_t next;
>  
> -	pmd = stage2_pmd_offset(pud, addr);
> +	pmd = stage2_pmd_offset(kvm, pud, addr);
>  	do {
> -		next = stage2_pmd_addr_end(addr, end);
> +		next = stage2_pmd_addr_end(kvm, addr, end);
>  		if (!pmd_none(*pmd)) {
>  			if (pmd_thp_or_huge(*pmd))
>  				kvm_flush_dcache_pmd(*pmd);
> @@ -356,11 +355,11 @@ static void stage2_flush_puds(struct kvm *kvm, pgd_t *pgd,
>  	pud_t *pud;
>  	phys_addr_t next;
>  
> -	pud = stage2_pud_offset(pgd, addr);
> +	pud = stage2_pud_offset(kvm, pgd, addr);
>  	do {
> -		next = stage2_pud_addr_end(addr, end);
> -		if (!stage2_pud_none(*pud)) {
> -			if (stage2_pud_huge(*pud))
> +		next = stage2_pud_addr_end(kvm, addr, end);
> +		if (!stage2_pud_none(kvm, *pud)) {
> +			if (stage2_pud_huge(kvm, *pud))
>  				kvm_flush_dcache_pud(*pud);
>  			else
>  				stage2_flush_pmds(kvm, pud, addr, next);
> @@ -376,10 +375,10 @@ static void stage2_flush_memslot(struct kvm *kvm,
>  	phys_addr_t next;
>  	pgd_t *pgd;
>  
> -	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> +	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
>  	do {
> -		next = stage2_pgd_addr_end(addr, end);
> -		if (!stage2_pgd_none(*pgd))
> +		next = stage2_pgd_addr_end(kvm, addr, end);
> +		if (!stage2_pgd_none(kvm, *pgd))
>  			stage2_flush_puds(kvm, pgd, addr, next);
>  	} while (pgd++, addr = next, addr != end);
>  }
> @@ -869,7 +868,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
>  	}
>  
>  	/* Allocate the HW PGD, making sure that each page gets its own refcount */
> -	pgd = alloc_pages_exact(S2_PGD_SIZE, GFP_KERNEL | __GFP_ZERO);
> +	pgd = alloc_pages_exact(stage2_pgd_size(kvm), GFP_KERNEL | __GFP_ZERO);
>  	if (!pgd)
>  		return -ENOMEM;
>  
> @@ -958,7 +957,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>  
>  	spin_lock(&kvm->mmu_lock);
>  	if (kvm->arch.pgd) {
> -		unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> +		unmap_stage2_range(kvm, 0, kvm_phys_size(kvm));
>  		pgd = READ_ONCE(kvm->arch.pgd);
>  		kvm->arch.pgd = NULL;
>  	}
> @@ -966,7 +965,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>  
>  	/* Free the HW pgd, one page at a time */
>  	if (pgd)
> -		free_pages_exact(pgd, S2_PGD_SIZE);
> +		free_pages_exact(pgd, stage2_pgd_size(kvm));
>  }
>  
>  static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> @@ -975,16 +974,16 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>  	pgd_t *pgd;
>  	pud_t *pud;
>  
> -	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> -	if (stage2_pgd_none(*pgd)) {
> +	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
> +	if (stage2_pgd_none(kvm, *pgd)) {
>  		if (!cache)
>  			return NULL;
>  		pud = mmu_memory_cache_alloc(cache);
> -		stage2_pgd_populate(pgd, pud);
> +		stage2_pgd_populate(kvm, pgd, pud);
>  		get_page(virt_to_page(pgd));
>  	}
>  
> -	return stage2_pud_offset(pgd, addr);
> +	return stage2_pud_offset(kvm, pgd, addr);
>  }
>  
>  static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> @@ -997,15 +996,15 @@ static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>  	if (!pud)
>  		return NULL;
>  
> -	if (stage2_pud_none(*pud)) {
> +	if (stage2_pud_none(kvm, *pud)) {
>  		if (!cache)
>  			return NULL;
>  		pmd = mmu_memory_cache_alloc(cache);
> -		stage2_pud_populate(pud, pmd);
> +		stage2_pud_populate(kvm, pud, pmd);
>  		get_page(virt_to_page(pud));
>  	}
>  
> -	return stage2_pmd_offset(pud, addr);
> +	return stage2_pmd_offset(kvm, pud, addr);
>  }
>  
>  static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
> @@ -1159,8 +1158,9 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  		if (writable)
>  			pte = kvm_s2pte_mkwrite(pte);
>  
> -		ret = mmu_topup_memory_cache(&cache, KVM_MMU_CACHE_MIN_PAGES,
> -						KVM_NR_MEM_OBJS);
> +		ret = mmu_topup_memory_cache(&cache,
> +					     kvm_mmu_cache_min_pages(kvm),
> +					     KVM_NR_MEM_OBJS);
>  		if (ret)
>  			goto out;
>  		spin_lock(&kvm->mmu_lock);
> @@ -1248,19 +1248,21 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
>  
>  /**
>   * stage2_wp_pmds - write protect PUD range
> + * kvm:		kvm instance for the VM
>   * @pud:	pointer to pud entry
>   * @addr:	range start address
>   * @end:	range end address
>   */
> -static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
> +static void stage2_wp_pmds(struct kvm *kvm, pud_t *pud,
> +			   phys_addr_t addr, phys_addr_t end)
>  {
>  	pmd_t *pmd;
>  	phys_addr_t next;
>  
> -	pmd = stage2_pmd_offset(pud, addr);
> +	pmd = stage2_pmd_offset(kvm, pud, addr);
>  
>  	do {
> -		next = stage2_pmd_addr_end(addr, end);
> +		next = stage2_pmd_addr_end(kvm, addr, end);
>  		if (!pmd_none(*pmd)) {
>  			if (pmd_thp_or_huge(*pmd)) {
>  				if (!kvm_s2pmd_readonly(pmd))
> @@ -1280,18 +1282,19 @@ static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
>    *
>    * Process PUD entries, for a huge PUD we cause a panic.
>    */
> -static void  stage2_wp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
> +static void  stage2_wp_puds(struct kvm *kvm, pgd_t *pgd,
> +			    phys_addr_t addr, phys_addr_t end)
>  {
>  	pud_t *pud;
>  	phys_addr_t next;
>  
> -	pud = stage2_pud_offset(pgd, addr);
> +	pud = stage2_pud_offset(kvm, pgd, addr);
>  	do {
> -		next = stage2_pud_addr_end(addr, end);
> -		if (!stage2_pud_none(*pud)) {
> +		next = stage2_pud_addr_end(kvm, addr, end);
> +		if (!stage2_pud_none(kvm, *pud)) {
>  			/* TODO:PUD not supported, revisit later if supported */
> -			BUG_ON(stage2_pud_huge(*pud));
> -			stage2_wp_pmds(pud, addr, next);
> +			BUG_ON(stage2_pud_huge(kvm, *pud));
> +			stage2_wp_pmds(kvm, pud, addr, next);
>  		}
>  	} while (pud++, addr = next, addr != end);
>  }
> @@ -1307,7 +1310,7 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  	pgd_t *pgd;
>  	phys_addr_t next;
>  
> -	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> +	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
>  	do {
>  		/*
>  		 * Release kvm_mmu_lock periodically if the memory region is
> @@ -1321,9 +1324,9 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  		cond_resched_lock(&kvm->mmu_lock);
>  		if (!READ_ONCE(kvm->arch.pgd))
>  			break;
> -		next = stage2_pgd_addr_end(addr, end);
> -		if (stage2_pgd_present(*pgd))
> -			stage2_wp_puds(pgd, addr, next);
> +		next = stage2_pgd_addr_end(kvm, addr, end);
> +		if (stage2_pgd_present(kvm, *pgd))
> +			stage2_wp_puds(kvm, pgd, addr, next);
>  	} while (pgd++, addr = next, addr != end);
>  }
>  
> @@ -1472,7 +1475,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	up_read(&current->mm->mmap_sem);
>  
>  	/* We need minimum second+third level pages */
> -	ret = mmu_topup_memory_cache(memcache, KVM_MMU_CACHE_MIN_PAGES,
> +	ret = mmu_topup_memory_cache(memcache, kvm_mmu_cache_min_pages(kvm),
>  				     KVM_NR_MEM_OBJS);
>  	if (ret)
>  		return ret;
> @@ -1715,7 +1718,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  	}
>  
>  	/* Userspace should not be able to register out-of-bounds IPAs */
> -	VM_BUG_ON(fault_ipa >= KVM_PHYS_SIZE);
> +	VM_BUG_ON(fault_ipa >= kvm_phys_size(vcpu->kvm));
>  
>  	if (fault_status == FSC_ACCESS) {
>  		handle_access_fault(vcpu, fault_ipa);
> @@ -2019,7 +2022,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  	 * space addressable by the KVM guest IPA space.
>  	 */
>  	if (memslot->base_gfn + memslot->npages >=
> -	    (KVM_PHYS_SIZE >> PAGE_SHIFT))
> +	    (kvm_phys_size(kvm) >> PAGE_SHIFT))
>  		return -EFAULT;
>  
>  	down_read(&current->mm->mmap_sem);
> diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c b/virt/kvm/arm/vgic/vgic-kvm-device.c
> index 6ada243..114dce9 100644
> --- a/virt/kvm/arm/vgic/vgic-kvm-device.c
> +++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
> @@ -25,7 +25,7 @@
>  int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t *ioaddr,
>  		      phys_addr_t addr, phys_addr_t alignment)
>  {
> -	if (addr & ~KVM_PHYS_MASK)
> +	if (addr & ~kvm_phys_mask(kvm))
>  		return -E2BIG;
>  
>  	if (!IS_ALIGNED(addr, alignment))
> 

Otherwise:

Acked-by: Marc Zyngier <marc.zyngier@arm.com>

	M.

Suzuki K Poulose July 2, 2018, 10:25 a.m. UTC | #10

On 02/07/18 11:12, Marc Zyngier wrote:
> On 29/06/18 12:15, Suzuki K Poulose wrote:
>> Right now the stage2 page table for a VM is hard coded, assuming
>> an IPA of 40bits. As we are about to add support for per VM IPA,
>> prepare the stage2 page table helpers to accept the kvm instance
>> to make the right decision for the VM. No functional changes.
>> Adds stage2_pgd_size(kvm) to replace S2_PGD_SIZE. Also, moves
>> some of the definitions dependent on kvm instance to asm/kvm_mmu.h
>> for arm32. In that process drop the _AC() specifier constants
>>
>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>> Cc: Christoffer Dall <cdall@kernel.org>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>> Changes since V2:
>>   - Update commit description abuot the movement to asm/kvm_mmu.h
>>     for arm32
>>   - Drop _AC() specifiers
>> ---
>>   arch/arm/include/asm/kvm_arm.h                |   3 +-
>>   arch/arm/include/asm/kvm_mmu.h                |  15 +++-
>>   arch/arm/include/asm/stage2_pgtable.h         |  42 ++++-----
>>   arch/arm64/include/asm/kvm_mmu.h              |   7 +-
>>   arch/arm64/include/asm/stage2_pgtable-nopmd.h |  18 ++--
>>   arch/arm64/include/asm/stage2_pgtable-nopud.h |  16 ++--
>>   arch/arm64/include/asm/stage2_pgtable.h       |  49 ++++++-----
>>   virt/kvm/arm/arm.c                            |   2 +-
>>   virt/kvm/arm/mmu.c                            | 119 +++++++++++++-------------
>>   virt/kvm/arm/vgic/vgic-kvm-device.c           |   2 +-
>>   10 files changed, 148 insertions(+), 125 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
>> index 3ab8b37..c3f1f9b 100644
>> --- a/arch/arm/include/asm/kvm_arm.h
>> +++ b/arch/arm/include/asm/kvm_arm.h
>> @@ -133,8 +133,7 @@
>>    * space.
>>    */
>>   #define KVM_PHYS_SHIFT	(40)
>> -#define KVM_PHYS_SIZE	(_AC(1, ULL) << KVM_PHYS_SHIFT)
>> -#define KVM_PHYS_MASK	(KVM_PHYS_SIZE - _AC(1, ULL))
>> +
>>   #define PTRS_PER_S2_PGD	(_AC(1, ULL) << (KVM_PHYS_SHIFT - 30))
>>   
>>   /* Virtualization Translation Control Register (VTCR) bits */
>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 8553d68..f36eb20 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -36,15 +36,19 @@
>>   	})
>>   
>>   /*
>> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation levels.
>> + * kvm_mmu_cache_min_pages() is the number of stage2 page
>> + * table translation levels, excluding the top level, for
>> + * the given VM. Since we have a 3 level page-table, this
>> + * is fixed.
> 
> I find this comment quite confusing: number of levels, but excluding the
> top one? The original one was just as bad, to be honest.
> 
> Can't we just say: "kvm_mmu_cache_min_page() is the number of pages
> required to install a stage-2 translation"?

Yes, that is much better.  Will change it.

>> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
>> index 8b68099..057a405 100644
>> --- a/arch/arm64/include/asm/stage2_pgtable.h
>> +++ b/arch/arm64/include/asm/stage2_pgtable.h
>> @@ -65,10 +65,10 @@
>>   #define PTRS_PER_S2_PGD			(1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
>>   
>>   /*
>> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
>> + * kvm_mmmu_cache_min_pages is the number of stage2 page table translation
>>    * levels in addition to the PGD.
>>    */
>> -#define KVM_MMU_CACHE_MIN_PAGES		(STAGE2_PGTABLE_LEVELS - 1)
>> +#define kvm_mmu_cache_min_pages(kvm)	(STAGE2_PGTABLE_LEVELS - 1)
> 
> Same comment as for the 32bit case.
> 
>>   

>>
> 
> Otherwise:
> 
> Acked-by: Marc Zyngier <marc.zyngier@arm.com>

Thanks
Suzuki

Eric Auger July 2, 2018, 10:51 a.m. UTC | #11

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> Right now the stage2 page table for a VM is hard coded, assuming
> an IPA of 40bits. As we are about to add support for per VM IPA,
> prepare the stage2 page table helpers to accept the kvm instance
> to make the right decision for the VM. No functional changes.
> Adds stage2_pgd_size(kvm) to replace S2_PGD_SIZE. Also, moves
> some of the definitions dependent on kvm instance to asm/kvm_mmu.h
> for arm32. In that process drop the _AC() specifier constants
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since V2:
>  - Update commit description abuot the movement to asm/kvm_mmu.h
>    for arm32
>  - Drop _AC() specifiers
> ---
>  arch/arm/include/asm/kvm_arm.h                |   3 +-
>  arch/arm/include/asm/kvm_mmu.h                |  15 +++-
>  arch/arm/include/asm/stage2_pgtable.h         |  42 ++++-----
>  arch/arm64/include/asm/kvm_mmu.h              |   7 +-
>  arch/arm64/include/asm/stage2_pgtable-nopmd.h |  18 ++--
>  arch/arm64/include/asm/stage2_pgtable-nopud.h |  16 ++--
>  arch/arm64/include/asm/stage2_pgtable.h       |  49 ++++++-----
>  virt/kvm/arm/arm.c                            |   2 +-
>  virt/kvm/arm/mmu.c                            | 119 +++++++++++++-------------
>  virt/kvm/arm/vgic/vgic-kvm-device.c           |   2 +-
>  10 files changed, 148 insertions(+), 125 deletions(-)
> 
> diff --git a/arch/arm/include/asm/kvm_arm.h b/arch/arm/include/asm/kvm_arm.h
> index 3ab8b37..c3f1f9b 100644
> --- a/arch/arm/include/asm/kvm_arm.h
> +++ b/arch/arm/include/asm/kvm_arm.h
> @@ -133,8 +133,7 @@
>   * space.
>   */
>  #define KVM_PHYS_SHIFT	(40)
> -#define KVM_PHYS_SIZE	(_AC(1, ULL) << KVM_PHYS_SHIFT)
> -#define KVM_PHYS_MASK	(KVM_PHYS_SIZE - _AC(1, ULL))
> +
>  #define PTRS_PER_S2_PGD	(_AC(1, ULL) << (KVM_PHYS_SHIFT - 30))
>  
>  /* Virtualization Translation Control Register (VTCR) bits */
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index 8553d68..f36eb20 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -36,15 +36,19 @@
>  	})
>  
>  /*
> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation levels.
> + * kvm_mmu_cache_min_pages() is the number of stage2 page
> + * table translation levels, excluding the top level, for
> + * the given VM. Since we have a 3 level page-table, this
> + * is fixed.
>   */
> -#define KVM_MMU_CACHE_MIN_PAGES	2
> +#define kvm_mmu_cache_min_pages(kvm)	2
nit: In addition to Marc'c comment, I can see it defined in
stage2_pgtable.h on arm64 side. Can't we align?
>  
>  #ifndef __ASSEMBLY__
>  
>  #include <linux/highmem.h>
>  #include <asm/cacheflush.h>
>  #include <asm/cputype.h>
> +#include <asm/kvm_arm.h>
>  #include <asm/kvm_hyp.h>
>  #include <asm/pgalloc.h>
>  #include <asm/stage2_pgtable.h>
> @@ -52,6 +56,13 @@
>  /* Ensure compatibility with arm64 */
>  #define VA_BITS			32
>  
> +#define kvm_phys_shift(kvm)		KVM_PHYS_SHIFT
> +#define kvm_phys_size(kvm)		(1ULL << kvm_phys_shift(kvm))
> +#define kvm_phys_mask(kvm)		(kvm_phys_size(kvm) - 1ULL)
> +#define kvm_vttbr_baddr_mask(kvm)	VTTBR_BADDR_MASK
> +
> +#define stage2_pgd_size(kvm)		(PTRS_PER_S2_PGD * sizeof(pgd_t))
> +
>  int create_hyp_mappings(void *from, void *to, pgprot_t prot);
>  int create_hyp_io_mappings(phys_addr_t phys_addr, size_t size,
>  			   void __iomem **kaddr,
> diff --git a/arch/arm/include/asm/stage2_pgtable.h b/arch/arm/include/asm/stage2_pgtable.h
> index 460d616..e22ae94 100644
> --- a/arch/arm/include/asm/stage2_pgtable.h
> +++ b/arch/arm/include/asm/stage2_pgtable.h
> @@ -19,43 +19,45 @@
>  #ifndef __ARM_S2_PGTABLE_H_
>  #define __ARM_S2_PGTABLE_H_
>  
> -#define stage2_pgd_none(pgd)			pgd_none(pgd)
> -#define stage2_pgd_clear(pgd)			pgd_clear(pgd)
> -#define stage2_pgd_present(pgd)			pgd_present(pgd)
> -#define stage2_pgd_populate(pgd, pud)		pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(pgd, address)		pud_offset(pgd, address)
> -#define stage2_pud_free(pud)			pud_free(NULL, pud)
> +#define stage2_pgd_none(kvm, pgd)		pgd_none(pgd)
> +#define stage2_pgd_clear(kvm, pgd)		pgd_clear(pgd)
> +#define stage2_pgd_present(kvm, pgd)		pgd_present(pgd)
> +#define stage2_pgd_populate(kvm, pgd, pud)	pgd_populate(NULL, pgd, pud)
> +#define stage2_pud_offset(kvm, pgd, address)	pud_offset(pgd, address)
> +#define stage2_pud_free(kvm, pud)		pud_free(NULL, pud)
>  
> -#define stage2_pud_none(pud)			pud_none(pud)
> -#define stage2_pud_clear(pud)			pud_clear(pud)
> -#define stage2_pud_present(pud)			pud_present(pud)
> -#define stage2_pud_populate(pud, pmd)		pud_populate(NULL, pud, pmd)
> -#define stage2_pmd_offset(pud, address)		pmd_offset(pud, address)
> -#define stage2_pmd_free(pmd)			pmd_free(NULL, pmd)
> +#define stage2_pud_none(kvm, pud)		pud_none(pud)
> +#define stage2_pud_clear(kvm, pud)		pud_clear(pud)
> +#define stage2_pud_present(kvm, pud)		pud_present(pud)
> +#define stage2_pud_populate(kvm, pud, pmd)	pud_populate(NULL, pud, pmd)
> +#define stage2_pmd_offset(kvm, pud, address)	pmd_offset(pud, address)
> +#define stage2_pmd_free(kvm, pmd)		pmd_free(NULL, pmd)
>  
> -#define stage2_pud_huge(pud)			pud_huge(pud)
> +#define stage2_pud_huge(kvm, pud)		pud_huge(pud)
>  
>  /* Open coded p*d_addr_end that can deal with 64bit addresses */
> -static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
>  	phys_addr_t boundary = (addr + PGDIR_SIZE) & PGDIR_MASK;
>  
>  	return (boundary - 1 < end - 1) ? boundary : end;
>  }
>  
> -#define stage2_pud_addr_end(addr, end)		(end)
> +#define stage2_pud_addr_end(kvm, addr, end)	(end)
>  
> -static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
>  	phys_addr_t boundary = (addr + PMD_SIZE) & PMD_MASK;
>  
>  	return (boundary - 1 < end - 1) ? boundary : end;
>  }
>  
> -#define stage2_pgd_index(addr)				pgd_index(addr)
> +#define stage2_pgd_index(kvm, addr)		pgd_index(addr)
>  
> -#define stage2_pte_table_empty(ptep)			kvm_page_empty(ptep)
> -#define stage2_pmd_table_empty(pmdp)			kvm_page_empty(pmdp)
> -#define stage2_pud_table_empty(pudp)			false
> +#define stage2_pte_table_empty(kvm, ptep)	kvm_page_empty(ptep)
> +#define stage2_pmd_table_empty(kvm, pmdp)	kvm_page_empty(pmdp)
> +#define stage2_pud_table_empty(kvm, pudp)	false
>  
>  #endif	/* __ARM_S2_PGTABLE_H_ */
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index fb9a712..5da8f52 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -141,8 +141,11 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
>   * We currently only support a 40bit IPA.
>   */
>  #define KVM_PHYS_SHIFT	(40)
> -#define KVM_PHYS_SIZE	(1UL << KVM_PHYS_SHIFT)
> -#define KVM_PHYS_MASK	(KVM_PHYS_SIZE - 1UL)
> +
> +#define kvm_phys_shift(kvm)		KVM_PHYS_SHIFT
> +#define kvm_phys_size(kvm)		(_AC(1, ULL) << kvm_phys_shift(kvm))
Can't you get rid of _AC() also in arm64 case?

> +#define kvm_phys_mask(kvm)		(kvm_phys_size(kvm) - _AC(1, ULL))
> +#define kvm_vttbr_baddr_mask(kvm)	VTTBR_BADDR_MASK
>  
>  #include <asm/stage2_pgtable.h>
>  
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> index 2656a0f..0280ded 100644
> --- a/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> +++ b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> @@ -26,17 +26,17 @@
>  #define S2_PMD_SIZE		(1UL << S2_PMD_SHIFT)
>  #define S2_PMD_MASK		(~(S2_PMD_SIZE-1))
>  
> -#define stage2_pud_none(pud)			(0)
> -#define stage2_pud_present(pud)			(1)
> -#define stage2_pud_clear(pud)			do { } while (0)
> -#define stage2_pud_populate(pud, pmd)		do { } while (0)
> -#define stage2_pmd_offset(pud, address)		((pmd_t *)(pud))
> +#define stage2_pud_none(kvm, pud)		(0)
> +#define stage2_pud_present(kvm, pud)		(1)
> +#define stage2_pud_clear(kvm, pud)		do { } while (0)
> +#define stage2_pud_populate(kvm, pud, pmd)	do { } while (0)
> +#define stage2_pmd_offset(kvm, pud, address)	((pmd_t *)(pud))
>  
> -#define stage2_pmd_free(pmd)			do { } while (0)
> +#define stage2_pmd_free(kvm, pmd)		do { } while (0)
>  
> -#define stage2_pmd_addr_end(addr, end)		(end)
> +#define stage2_pmd_addr_end(kvm, addr, end)	(end)
>  
> -#define stage2_pud_huge(pud)			(0)
> -#define stage2_pmd_table_empty(pmdp)		(0)
> +#define stage2_pud_huge(kvm, pud)		(0)
> +#define stage2_pmd_table_empty(kvm, pmdp)	(0)
>  
>  #endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> index 5ee87b5..cd6304e 100644
> --- a/arch/arm64/include/asm/stage2_pgtable-nopud.h
> +++ b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> @@ -24,16 +24,16 @@
>  #define S2_PUD_SIZE		(_AC(1, UL) << S2_PUD_SHIFT)
>  #define S2_PUD_MASK		(~(S2_PUD_SIZE-1))
>  
> -#define stage2_pgd_none(pgd)			(0)
> -#define stage2_pgd_present(pgd)			(1)
> -#define stage2_pgd_clear(pgd)			do { } while (0)
> -#define stage2_pgd_populate(pgd, pud)	do { } while (0)
> +#define stage2_pgd_none(kvm, pgd)		(0)
> +#define stage2_pgd_present(kvm, pgd)		(1)
> +#define stage2_pgd_clear(kvm, pgd)		do { } while (0)
> +#define stage2_pgd_populate(kvm, pgd, pud)	do { } while (0)
>  
> -#define stage2_pud_offset(pgd, address)		((pud_t *)(pgd))
> +#define stage2_pud_offset(kvm, pgd, address)	((pud_t *)(pgd))
>  
> -#define stage2_pud_free(x)			do { } while (0)
> +#define stage2_pud_free(kvm, x)			do { } while (0)
>  
> -#define stage2_pud_addr_end(addr, end)		(end)
> -#define stage2_pud_table_empty(pmdp)		(0)
> +#define stage2_pud_addr_end(kvm, addr, end)	(end)
> +#define stage2_pud_table_empty(kvm, pmdp)	(0)
>  
>  #endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index 8b68099..057a405 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -65,10 +65,10 @@
>  #define PTRS_PER_S2_PGD			(1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
>  
>  /*
> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation
> + * kvm_mmmu_cache_min_pages is the number of stage2 page table translation
>   * levels in addition to the PGD.
>   */
> -#define KVM_MMU_CACHE_MIN_PAGES		(STAGE2_PGTABLE_LEVELS - 1)
> +#define kvm_mmu_cache_min_pages(kvm)	(STAGE2_PGTABLE_LEVELS - 1)
>  
>  
>  #if STAGE2_PGTABLE_LEVELS > 3
> @@ -77,16 +77,17 @@
>  #define S2_PUD_SIZE			(_AC(1, UL) << S2_PUD_SHIFT)
>  #define S2_PUD_MASK			(~(S2_PUD_SIZE - 1))
>  
> -#define stage2_pgd_none(pgd)				pgd_none(pgd)
> -#define stage2_pgd_clear(pgd)				pgd_clear(pgd)
> -#define stage2_pgd_present(pgd)				pgd_present(pgd)
> -#define stage2_pgd_populate(pgd, pud)			pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(pgd, address)			pud_offset(pgd, address)
> -#define stage2_pud_free(pud)				pud_free(NULL, pud)
> +#define stage2_pgd_none(kvm, pgd)		pgd_none(pgd)
> +#define stage2_pgd_clear(kvm, pgd)		pgd_clear(pgd)
> +#define stage2_pgd_present(kvm, pgd)		pgd_present(pgd)
> +#define stage2_pgd_populate(kvm, pgd, pud)	pgd_populate(NULL, pgd, pud)
> +#define stage2_pud_offset(kvm, pgd, address)	pud_offset(pgd, address)
> +#define stage2_pud_free(kvm, pud)		pud_free(NULL, pud)
>  
> -#define stage2_pud_table_empty(pudp)			kvm_page_empty(pudp)
> +#define stage2_pud_table_empty(kvm, pudp)	kvm_page_empty(pudp)
>  
> -static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pud_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
>  	phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;
>  
> @@ -102,17 +103,18 @@ static inline phys_addr_t stage2_pud_addr_end(phys_addr_t addr, phys_addr_t end)
>  #define S2_PMD_SIZE			(_AC(1, UL) << S2_PMD_SHIFT)
>  #define S2_PMD_MASK			(~(S2_PMD_SIZE - 1))
>  
> -#define stage2_pud_none(pud)				pud_none(pud)
> -#define stage2_pud_clear(pud)				pud_clear(pud)
> -#define stage2_pud_present(pud)				pud_present(pud)
> -#define stage2_pud_populate(pud, pmd)			pud_populate(NULL, pud, pmd)
> -#define stage2_pmd_offset(pud, address)			pmd_offset(pud, address)
> -#define stage2_pmd_free(pmd)				pmd_free(NULL, pmd)
> +#define stage2_pud_none(kvm, pud)		pud_none(pud)
> +#define stage2_pud_clear(kvm, pud)		pud_clear(pud)
> +#define stage2_pud_present(kvm, pud)		pud_present(pud)
> +#define stage2_pud_populate(kvm, pud, pmd)	pud_populate(NULL, pud, pmd)
> +#define stage2_pmd_offset(kvm, pud, address)	pmd_offset(pud, address)
> +#define stage2_pmd_free(kvm, pmd)		pmd_free(NULL, pmd)
>  
> -#define stage2_pud_huge(pud)				pud_huge(pud)
> -#define stage2_pmd_table_empty(pmdp)			kvm_page_empty(pmdp)
> +#define stage2_pud_huge(kvm, pud)		pud_huge(pud)
> +#define stage2_pmd_table_empty(kvm, pmdp)	kvm_page_empty(pmdp)
>  
> -static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
>  	phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;
>  
> @@ -121,7 +123,7 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
>  
>  #endif		/* STAGE2_PGTABLE_LEVELS > 2 */
>  
> -#define stage2_pte_table_empty(ptep)			kvm_page_empty(ptep)
> +#define stage2_pte_table_empty(kvm, ptep)	kvm_page_empty(ptep)
>  
>  #if STAGE2_PGTABLE_LEVELS == 2
>  #include <asm/stage2_pgtable-nopmd.h>
> @@ -129,10 +131,13 @@ static inline phys_addr_t stage2_pmd_addr_end(phys_addr_t addr, phys_addr_t end)
>  #include <asm/stage2_pgtable-nopud.h>
>  #endif
>  
> +#define stage2_pgd_size(kvm)	(PTRS_PER_S2_PGD * sizeof(pgd_t))
>  
> -#define stage2_pgd_index(addr)				(((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
> +#define stage2_pgd_index(kvm, addr) \
> +	(((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
>  
> -static inline phys_addr_t stage2_pgd_addr_end(phys_addr_t addr, phys_addr_t end)
> +static inline phys_addr_t
> +stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
>  	phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
>  
> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 04e554c..d2637bb 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -538,7 +538,7 @@ static void update_vttbr(struct kvm *kvm)
>  
>  	/* update vttbr to be used with the new vmid */
>  	pgd_phys = virt_to_phys(kvm->arch.pgd);
> -	BUG_ON(pgd_phys & ~VTTBR_BADDR_MASK);
> +	BUG_ON(pgd_phys & ~kvm_vttbr_baddr_mask(kvm));
>  	vmid = ((u64)(kvm->arch.vmid) << VTTBR_VMID_SHIFT) & VTTBR_VMID_MASK(kvm_vmid_bits);
>  	kvm->arch.vttbr = kvm_phys_to_vttbr(pgd_phys) | vmid;
>  
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 308171c..82dd571 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -45,7 +45,6 @@ static phys_addr_t hyp_idmap_vector;
>  
>  static unsigned long io_map_base;
>  
> -#define S2_PGD_SIZE	(PTRS_PER_S2_PGD * sizeof(pgd_t))
>  #define hyp_pgd_order get_order(PTRS_PER_PGD * sizeof(pgd_t))
>  
>  #define KVM_S2PTE_FLAG_IS_IOMAP		(1UL << 0)
> @@ -150,20 +149,20 @@ static void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc)
>  
>  static void clear_stage2_pgd_entry(struct kvm *kvm, pgd_t *pgd, phys_addr_t addr)
>  {
> -	pud_t *pud_table __maybe_unused = stage2_pud_offset(pgd, 0UL);
> -	stage2_pgd_clear(pgd);
> +	pud_t *pud_table __maybe_unused = stage2_pud_offset(kvm, pgd, 0UL);
> +	stage2_pgd_clear(kvm, pgd);
>  	kvm_tlb_flush_vmid_ipa(kvm, addr);
> -	stage2_pud_free(pud_table);
> +	stage2_pud_free(kvm, pud_table);
>  	put_page(virt_to_page(pgd));
>  }
>  
>  static void clear_stage2_pud_entry(struct kvm *kvm, pud_t *pud, phys_addr_t addr)
>  {
> -	pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(pud, 0);
> -	VM_BUG_ON(stage2_pud_huge(*pud));
> -	stage2_pud_clear(pud);
> +	pmd_t *pmd_table __maybe_unused = stage2_pmd_offset(kvm, pud, 0);
> +	VM_BUG_ON(stage2_pud_huge(kvm, *pud));
> +	stage2_pud_clear(kvm, pud);
>  	kvm_tlb_flush_vmid_ipa(kvm, addr);
> -	stage2_pmd_free(pmd_table);
> +	stage2_pmd_free(kvm, pmd_table);
>  	put_page(virt_to_page(pud));
>  }
>  
> @@ -219,7 +218,7 @@ static void unmap_stage2_ptes(struct kvm *kvm, pmd_t *pmd,
>  		}
>  	} while (pte++, addr += PAGE_SIZE, addr != end);
>  
> -	if (stage2_pte_table_empty(start_pte))
> +	if (stage2_pte_table_empty(kvm, start_pte))
>  		clear_stage2_pmd_entry(kvm, pmd, start_addr);
>  }
>  
> @@ -229,9 +228,9 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
>  	phys_addr_t next, start_addr = addr;
>  	pmd_t *pmd, *start_pmd;
>  
> -	start_pmd = pmd = stage2_pmd_offset(pud, addr);
> +	start_pmd = pmd = stage2_pmd_offset(kvm, pud, addr);
>  	do {
> -		next = stage2_pmd_addr_end(addr, end);
> +		next = stage2_pmd_addr_end(kvm, addr, end);
>  		if (!pmd_none(*pmd)) {
>  			if (pmd_thp_or_huge(*pmd)) {
>  				pmd_t old_pmd = *pmd;
> @@ -248,7 +247,7 @@ static void unmap_stage2_pmds(struct kvm *kvm, pud_t *pud,
>  		}
>  	} while (pmd++, addr = next, addr != end);
>  
> -	if (stage2_pmd_table_empty(start_pmd))
> +	if (stage2_pmd_table_empty(kvm, start_pmd))
>  		clear_stage2_pud_entry(kvm, pud, start_addr);
>  }
>  
> @@ -258,14 +257,14 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
>  	phys_addr_t next, start_addr = addr;
>  	pud_t *pud, *start_pud;
>  
> -	start_pud = pud = stage2_pud_offset(pgd, addr);
> +	start_pud = pud = stage2_pud_offset(kvm, pgd, addr);
>  	do {
> -		next = stage2_pud_addr_end(addr, end);
> -		if (!stage2_pud_none(*pud)) {
> -			if (stage2_pud_huge(*pud)) {
> +		next = stage2_pud_addr_end(kvm, addr, end);
> +		if (!stage2_pud_none(kvm, *pud)) {
> +			if (stage2_pud_huge(kvm, *pud)) {
>  				pud_t old_pud = *pud;
>  
> -				stage2_pud_clear(pud);
> +				stage2_pud_clear(kvm, pud);
>  				kvm_tlb_flush_vmid_ipa(kvm, addr);
>  				kvm_flush_dcache_pud(old_pud);
>  				put_page(virt_to_page(pud));
> @@ -275,7 +274,7 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
>  		}
>  	} while (pud++, addr = next, addr != end);
>  
> -	if (stage2_pud_table_empty(start_pud))
> +	if (stage2_pud_table_empty(kvm, start_pud))
>  		clear_stage2_pgd_entry(kvm, pgd, start_addr);
>  }
>  
> @@ -299,7 +298,7 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>  	assert_spin_locked(&kvm->mmu_lock);
>  	WARN_ON(size & ~PAGE_MASK);
>  
> -	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> +	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
>  	do {
>  		/*
>  		 * Make sure the page table is still active, as another thread
> @@ -308,8 +307,8 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
>  		 */
>  		if (!READ_ONCE(kvm->arch.pgd))
>  			break;
> -		next = stage2_pgd_addr_end(addr, end);
> -		if (!stage2_pgd_none(*pgd))
> +		next = stage2_pgd_addr_end(kvm, addr, end);
> +		if (!stage2_pgd_none(kvm, *pgd))
>  			unmap_stage2_puds(kvm, pgd, addr, next);
>  		/*
>  		 * If the range is too large, release the kvm->mmu_lock
> @@ -338,9 +337,9 @@ static void stage2_flush_pmds(struct kvm *kvm, pud_t *pud,
>  	pmd_t *pmd;
>  	phys_addr_t next;
>  
> -	pmd = stage2_pmd_offset(pud, addr);
> +	pmd = stage2_pmd_offset(kvm, pud, addr);
>  	do {
> -		next = stage2_pmd_addr_end(addr, end);
> +		next = stage2_pmd_addr_end(kvm, addr, end);
>  		if (!pmd_none(*pmd)) {
>  			if (pmd_thp_or_huge(*pmd))
>  				kvm_flush_dcache_pmd(*pmd);
> @@ -356,11 +355,11 @@ static void stage2_flush_puds(struct kvm *kvm, pgd_t *pgd,
>  	pud_t *pud;
>  	phys_addr_t next;
>  
> -	pud = stage2_pud_offset(pgd, addr);
> +	pud = stage2_pud_offset(kvm, pgd, addr);
>  	do {
> -		next = stage2_pud_addr_end(addr, end);
> -		if (!stage2_pud_none(*pud)) {
> -			if (stage2_pud_huge(*pud))
> +		next = stage2_pud_addr_end(kvm, addr, end);
> +		if (!stage2_pud_none(kvm, *pud)) {
> +			if (stage2_pud_huge(kvm, *pud))
>  				kvm_flush_dcache_pud(*pud);
>  			else
>  				stage2_flush_pmds(kvm, pud, addr, next);
> @@ -376,10 +375,10 @@ static void stage2_flush_memslot(struct kvm *kvm,
>  	phys_addr_t next;
>  	pgd_t *pgd;
>  
> -	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> +	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
>  	do {
> -		next = stage2_pgd_addr_end(addr, end);
> -		if (!stage2_pgd_none(*pgd))
> +		next = stage2_pgd_addr_end(kvm, addr, end);
> +		if (!stage2_pgd_none(kvm, *pgd))
>  			stage2_flush_puds(kvm, pgd, addr, next);
>  	} while (pgd++, addr = next, addr != end);
>  }
> @@ -869,7 +868,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
>  	}
>  
>  	/* Allocate the HW PGD, making sure that each page gets its own refcount */
> -	pgd = alloc_pages_exact(S2_PGD_SIZE, GFP_KERNEL | __GFP_ZERO);
> +	pgd = alloc_pages_exact(stage2_pgd_size(kvm), GFP_KERNEL | __GFP_ZERO);
>  	if (!pgd)
>  		return -ENOMEM;
>  
> @@ -958,7 +957,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>  
>  	spin_lock(&kvm->mmu_lock);
>  	if (kvm->arch.pgd) {
> -		unmap_stage2_range(kvm, 0, KVM_PHYS_SIZE);
> +		unmap_stage2_range(kvm, 0, kvm_phys_size(kvm));
>  		pgd = READ_ONCE(kvm->arch.pgd);
>  		kvm->arch.pgd = NULL;
>  	}
> @@ -966,7 +965,7 @@ void kvm_free_stage2_pgd(struct kvm *kvm)
>  
>  	/* Free the HW pgd, one page at a time */
>  	if (pgd)
> -		free_pages_exact(pgd, S2_PGD_SIZE);
> +		free_pages_exact(pgd, stage2_pgd_size(kvm));
>  }
>  
>  static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> @@ -975,16 +974,16 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>  	pgd_t *pgd;
>  	pud_t *pud;
>  
> -	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> -	if (stage2_pgd_none(*pgd)) {
> +	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
> +	if (stage2_pgd_none(kvm, *pgd)) {
>  		if (!cache)
>  			return NULL;
>  		pud = mmu_memory_cache_alloc(cache);
> -		stage2_pgd_populate(pgd, pud);
> +		stage2_pgd_populate(kvm, pgd, pud);
>  		get_page(virt_to_page(pgd));
>  	}
>  
> -	return stage2_pud_offset(pgd, addr);
> +	return stage2_pud_offset(kvm, pgd, addr);
>  }
>  
>  static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache,
> @@ -997,15 +996,15 @@ static pmd_t *stage2_get_pmd(struct kvm *kvm, struct kvm_mmu_memory_cache *cache
>  	if (!pud)
>  		return NULL;
>  
> -	if (stage2_pud_none(*pud)) {
> +	if (stage2_pud_none(kvm, *pud)) {
>  		if (!cache)
>  			return NULL;
>  		pmd = mmu_memory_cache_alloc(cache);
> -		stage2_pud_populate(pud, pmd);
> +		stage2_pud_populate(kvm, pud, pmd);
>  		get_page(virt_to_page(pud));
>  	}
>  
> -	return stage2_pmd_offset(pud, addr);
> +	return stage2_pmd_offset(kvm, pud, addr);
>  }
>  
>  static int stage2_set_pmd_huge(struct kvm *kvm, struct kvm_mmu_memory_cache
> @@ -1159,8 +1158,9 @@ int kvm_phys_addr_ioremap(struct kvm *kvm, phys_addr_t guest_ipa,
>  		if (writable)
>  			pte = kvm_s2pte_mkwrite(pte);
>  
> -		ret = mmu_topup_memory_cache(&cache, KVM_MMU_CACHE_MIN_PAGES,
> -						KVM_NR_MEM_OBJS);
> +		ret = mmu_topup_memory_cache(&cache,
> +					     kvm_mmu_cache_min_pages(kvm),
> +					     KVM_NR_MEM_OBJS);
>  		if (ret)
>  			goto out;
>  		spin_lock(&kvm->mmu_lock);
> @@ -1248,19 +1248,21 @@ static void stage2_wp_ptes(pmd_t *pmd, phys_addr_t addr, phys_addr_t end)
>  
>  /**
>   * stage2_wp_pmds - write protect PUD range
> + * kvm:		kvm instance for the VM
>   * @pud:	pointer to pud entry
>   * @addr:	range start address
>   * @end:	range end address
>   */
> -static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
> +static void stage2_wp_pmds(struct kvm *kvm, pud_t *pud,
> +			   phys_addr_t addr, phys_addr_t end)
>  {
>  	pmd_t *pmd;
>  	phys_addr_t next;
>  
> -	pmd = stage2_pmd_offset(pud, addr);
> +	pmd = stage2_pmd_offset(kvm, pud, addr);
>  
>  	do {
> -		next = stage2_pmd_addr_end(addr, end);
> +		next = stage2_pmd_addr_end(kvm, addr, end);
>  		if (!pmd_none(*pmd)) {
>  			if (pmd_thp_or_huge(*pmd)) {
>  				if (!kvm_s2pmd_readonly(pmd))
> @@ -1280,18 +1282,19 @@ static void stage2_wp_pmds(pud_t *pud, phys_addr_t addr, phys_addr_t end)
>    *
>    * Process PUD entries, for a huge PUD we cause a panic.
>    */
> -static void  stage2_wp_puds(pgd_t *pgd, phys_addr_t addr, phys_addr_t end)
> +static void  stage2_wp_puds(struct kvm *kvm, pgd_t *pgd,
> +			    phys_addr_t addr, phys_addr_t end)
>  {
>  	pud_t *pud;
>  	phys_addr_t next;
>  
> -	pud = stage2_pud_offset(pgd, addr);
> +	pud = stage2_pud_offset(kvm, pgd, addr);
>  	do {
> -		next = stage2_pud_addr_end(addr, end);
> -		if (!stage2_pud_none(*pud)) {
> +		next = stage2_pud_addr_end(kvm, addr, end);
> +		if (!stage2_pud_none(kvm, *pud)) {
>  			/* TODO:PUD not supported, revisit later if supported */
> -			BUG_ON(stage2_pud_huge(*pud));
> -			stage2_wp_pmds(pud, addr, next);
> +			BUG_ON(stage2_pud_huge(kvm, *pud));
> +			stage2_wp_pmds(kvm, pud, addr, next);
>  		}
>  	} while (pud++, addr = next, addr != end);
>  }
> @@ -1307,7 +1310,7 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  	pgd_t *pgd;
>  	phys_addr_t next;
>  
> -	pgd = kvm->arch.pgd + stage2_pgd_index(addr);
> +	pgd = kvm->arch.pgd + stage2_pgd_index(kvm, addr);
>  	do {
>  		/*
>  		 * Release kvm_mmu_lock periodically if the memory region is
> @@ -1321,9 +1324,9 @@ static void stage2_wp_range(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  		cond_resched_lock(&kvm->mmu_lock);
>  		if (!READ_ONCE(kvm->arch.pgd))
>  			break;
> -		next = stage2_pgd_addr_end(addr, end);
> -		if (stage2_pgd_present(*pgd))
> -			stage2_wp_puds(pgd, addr, next);
> +		next = stage2_pgd_addr_end(kvm, addr, end);
> +		if (stage2_pgd_present(kvm, *pgd))
> +			stage2_wp_puds(kvm, pgd, addr, next);
>  	} while (pgd++, addr = next, addr != end);
>  }
>  
> @@ -1472,7 +1475,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  	up_read(&current->mm->mmap_sem);
>  
>  	/* We need minimum second+third level pages */
> -	ret = mmu_topup_memory_cache(memcache, KVM_MMU_CACHE_MIN_PAGES,
> +	ret = mmu_topup_memory_cache(memcache, kvm_mmu_cache_min_pages(kvm),
>  				     KVM_NR_MEM_OBJS);
>  	if (ret)
>  		return ret;
> @@ -1715,7 +1718,7 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, struct kvm_run *run)
>  	}
>  
>  	/* Userspace should not be able to register out-of-bounds IPAs */
> -	VM_BUG_ON(fault_ipa >= KVM_PHYS_SIZE);
> +	VM_BUG_ON(fault_ipa >= kvm_phys_size(vcpu->kvm));
>  
>  	if (fault_status == FSC_ACCESS) {
>  		handle_access_fault(vcpu, fault_ipa);
> @@ -2019,7 +2022,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
>  	 * space addressable by the KVM guest IPA space.
>  	 */
>  	if (memslot->base_gfn + memslot->npages >=
> -	    (KVM_PHYS_SIZE >> PAGE_SHIFT))
> +	    (kvm_phys_size(kvm) >> PAGE_SHIFT))
>  		return -EFAULT;
>  
>  	down_read(&current->mm->mmap_sem);
> diff --git a/virt/kvm/arm/vgic/vgic-kvm-device.c b/virt/kvm/arm/vgic/vgic-kvm-device.c
> index 6ada243..114dce9 100644
> --- a/virt/kvm/arm/vgic/vgic-kvm-device.c
> +++ b/virt/kvm/arm/vgic/vgic-kvm-device.c
> @@ -25,7 +25,7 @@
>  int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t *ioaddr,
>  		      phys_addr_t addr, phys_addr_t alignment)
>  {
> -	if (addr & ~KVM_PHYS_MASK)
> +	if (addr & ~kvm_phys_mask(kvm))
>  		return -E2BIG;
>  
>  	if (!IS_ALIGNED(addr, alignment))
> 

Thanks

Eric

Suzuki K Poulose July 2, 2018, 10:57 a.m. UTC | #12

On 29/06/18 12:15, Suzuki K Poulose wrote:
> So far we had a static stage2 page table handling code, based on a
> fixed IPA of 40bits. As we prepare for a configurable IPA size per
> VM, make our stage2 page table code dynamic, to do the right thing
> for a given VM. We ensure the existing condition is always true even
> when we lift the limit on the IPA. i.e,
> 
> 	page table levels in stage1 >= page table levels in stage2
> 
> Support for the IPA size configuration needs other changes in the way
> we configure the EL2 registers (VTTBR and VTCR). So, the IPA is still
> fixed to 40bits. The patch also moves the kvm_page_empty() in asm/kvm_mmu.h
> to the top, before including the asm/stage2_pgtable.h to avoid a forward
> declaration.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since V2
>   - Restrict the stage2 page table to allow reusing the host page table
>     helpers for now, until we get stage1 independent page table helpers.

...

> -#define stage2_pgd_none(kvm, pgd)		pgd_none(pgd)
> -#define stage2_pgd_clear(kvm, pgd)		pgd_clear(pgd)
> -#define stage2_pgd_present(kvm, pgd)		pgd_present(pgd)
> -#define stage2_pgd_populate(kvm, pgd, pud)	pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(kvm, pgd, address)	pud_offset(pgd, address)
> -#define stage2_pud_free(kvm, pud)		pud_free(NULL, pud)
> +#define __s2_pud_index(addr) \
> +	(((addr) >> __S2_PUD_SHIFT) & (PTRS_PER_PTE - 1))
> +#define __s2_pmd_index(addr) \
> +	(((addr) >> __S2_PMD_SHIFT) & (PTRS_PER_PTE - 1))
>   
> -#define stage2_pud_table_empty(kvm, pudp)	kvm_page_empty(pudp)
> +#define __kvm_has_stage2_levels(kvm, min_levels)	\
> +  ((CONFIG_PGTABLE_LEVELS >= min_levels) && (kvm_stage2_levels(kvm) >= min_levels))

On another look, I have renamed the helpers as follows :

kvm_stage2_has_pud(kvm) => kvm_stage2_has_pmd(kvm)
kvm_stage2_has_pgd(kvm) => kvm_stage2_has_pud(kvm)

below and everywhere.

> +
> +#define kvm_stage2_has_pgd(kvm)	__kvm_has_stage2_levels(kvm, 4)
> +#define kvm_stage2_has_pud(kvm) __kvm_has_stage2_levels(kvm, 3)


Suzuki

Suzuki K Poulose July 2, 2018, 10:59 a.m. UTC | #13

Hi Eric,

On 02/07/18 11:51, Auger Eric wrote:
> Hi Suzuki,
> 
> On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
>> Right now the stage2 page table for a VM is hard coded, assuming
>> an IPA of 40bits. As we are about to add support for per VM IPA,
>> prepare the stage2 page table helpers to accept the kvm instance
>> to make the right decision for the VM. No functional changes.
>> Adds stage2_pgd_size(kvm) to replace S2_PGD_SIZE. Also, moves
>> some of the definitions dependent on kvm instance to asm/kvm_mmu.h
>> for arm32. In that process drop the _AC() specifier constants
>>
>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>> Cc: Christoffer Dall <cdall@kernel.org>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>> Changes since V2:
>>   - Update commit description abuot the movement to asm/kvm_mmu.h
>>     for arm32
>>   - Drop _AC() specifiers


>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>> index 8553d68..f36eb20 100644
>> --- a/arch/arm/include/asm/kvm_mmu.h
>> +++ b/arch/arm/include/asm/kvm_mmu.h
>> @@ -36,15 +36,19 @@
>>   	})
>>   
>>   /*
>> - * KVM_MMU_CACHE_MIN_PAGES is the number of stage2 page table translation levels.
>> + * kvm_mmu_cache_min_pages() is the number of stage2 page
>> + * table translation levels, excluding the top level, for
>> + * the given VM. Since we have a 3 level page-table, this
>> + * is fixed.
>>    */
>> -#define KVM_MMU_CACHE_MIN_PAGES	2
>> +#define kvm_mmu_cache_min_pages(kvm)	2
> nit: In addition to Marc'c comment, I can see it defined in
> stage2_pgtable.h on arm64 side. Can't we align?

Sure, will do that.

>> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
>> index fb9a712..5da8f52 100644
>> --- a/arch/arm64/include/asm/kvm_mmu.h
>> +++ b/arch/arm64/include/asm/kvm_mmu.h
>> @@ -141,8 +141,11 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
>>    * We currently only support a 40bit IPA.
>>    */
>>   #define KVM_PHYS_SHIFT	(40)
>> -#define KVM_PHYS_SIZE	(1UL << KVM_PHYS_SHIFT)
>> -#define KVM_PHYS_MASK	(KVM_PHYS_SIZE - 1UL)
>> +
>> +#define kvm_phys_shift(kvm)		KVM_PHYS_SHIFT
>> +#define kvm_phys_size(kvm)		(_AC(1, ULL) << kvm_phys_shift(kvm))
> Can't you get rid of _AC() also in arm64 case?
> 

>> +#define kvm_phys_mask(kvm)		(kvm_phys_size(kvm) - _AC(1, ULL))

Yes, that missed. I will do it. Thanks for spotting.

Cheers
Suzuki

Eric Auger July 2, 2018, 12:14 p.m. UTC | #14

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> So far we had a static stage2 page table handling code, based on a
> fixed IPA of 40bits. As we prepare for a configurable IPA size per
> VM, make our stage2 page table code dynamic, to do the right thing
> for a given VM. We ensure the existing condition is always true even
> when we lift the limit on the IPA. i.e,
> 
> 	page table levels in stage1 >= page table levels in stage2
> 
> Support for the IPA size configuration needs other changes in the way
> we configure the EL2 registers (VTTBR and VTCR). So, the IPA is still
> fixed to 40bits. The patch also moves the kvm_page_empty() in asm/kvm_mmu.h
> to the top, before including the asm/stage2_pgtable.h to avoid a forward
> declaration.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since V2
>  - Restrict the stage2 page table to allow reusing the host page table
>    helpers for now, until we get stage1 independent page table helpers.
I would move this up in the commit msg to motivate the fact we enforce
the able condition.
> ---
>  arch/arm64/include/asm/kvm_mmu.h              |  14 +-
>  arch/arm64/include/asm/stage2_pgtable-nopmd.h |  42 ------
>  arch/arm64/include/asm/stage2_pgtable-nopud.h |  39 -----
>  arch/arm64/include/asm/stage2_pgtable.h       | 207 +++++++++++++++++++-------
>  4 files changed, 159 insertions(+), 143 deletions(-)
>  delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
>  delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h

with my very limited knowledge of S2 page table walkers I fail to
understand why we now can get rid of stage2_pgtable-nopmd.h and
stage2_pgtable-nopud.h and associated FOLDED config. Please could you
explain it in the commit message?
> 
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index dbaf513..a351722 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -21,6 +21,7 @@
>  #include <asm/page.h>
>  #include <asm/memory.h>
>  #include <asm/cpufeature.h>
> +#include <asm/kvm_arm.h>
>  
>  /*
>   * As ARMv8.0 only has the TTBR0_EL2 register, we cannot express
> @@ -147,6 +148,13 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
>  #define kvm_phys_mask(kvm)		(kvm_phys_size(kvm) - _AC(1, ULL))
>  #define kvm_vttbr_baddr_mask(kvm)	VTTBR_BADDR_MASK
>  
> +static inline bool kvm_page_empty(void *ptr)
> +{
> +	struct page *ptr_page = virt_to_page(ptr);
> +
> +	return page_count(ptr_page) == 1;
> +}
> +
>  #include <asm/stage2_pgtable.h>
>  
>  int create_hyp_mappings(void *from, void *to, pgprot_t prot);
> @@ -237,12 +245,6 @@ static inline bool kvm_s2pmd_exec(pmd_t *pmdp)
>  	return !(READ_ONCE(pmd_val(*pmdp)) & PMD_S2_XN);
>  }
>  
> -static inline bool kvm_page_empty(void *ptr)
> -{
> -	struct page *ptr_page = virt_to_page(ptr);
> -	return page_count(ptr_page) == 1;
> -}
> -
>  #define hyp_pte_table_empty(ptep) kvm_page_empty(ptep)
>  
>  #ifdef __PAGETABLE_PMD_FOLDED
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopmd.h b/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> deleted file mode 100644
> index 0280ded..0000000
> --- a/arch/arm64/include/asm/stage2_pgtable-nopmd.h
> +++ /dev/null
> @@ -1,42 +0,0 @@
> -/*
> - * Copyright (C) 2016 - ARM Ltd
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> - * GNU General Public License for more details.
> - *
> - * You should have received a copy of the GNU General Public License
> - * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> - */
> -
> -#ifndef __ARM64_S2_PGTABLE_NOPMD_H_
> -#define __ARM64_S2_PGTABLE_NOPMD_H_
> -
> -#include <asm/stage2_pgtable-nopud.h>
> -
> -#define __S2_PGTABLE_PMD_FOLDED
> -
> -#define S2_PMD_SHIFT		S2_PUD_SHIFT
> -#define S2_PTRS_PER_PMD		1
> -#define S2_PMD_SIZE		(1UL << S2_PMD_SHIFT)
> -#define S2_PMD_MASK		(~(S2_PMD_SIZE-1))
> -
> -#define stage2_pud_none(kvm, pud)		(0)
> -#define stage2_pud_present(kvm, pud)		(1)
> -#define stage2_pud_clear(kvm, pud)		do { } while (0)
> -#define stage2_pud_populate(kvm, pud, pmd)	do { } while (0)
> -#define stage2_pmd_offset(kvm, pud, address)	((pmd_t *)(pud))
> -
> -#define stage2_pmd_free(kvm, pmd)		do { } while (0)
> -
> -#define stage2_pmd_addr_end(kvm, addr, end)	(end)
> -
> -#define stage2_pud_huge(kvm, pud)		(0)
> -#define stage2_pmd_table_empty(kvm, pmdp)	(0)
> -
> -#endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable-nopud.h b/arch/arm64/include/asm/stage2_pgtable-nopud.h
> deleted file mode 100644
> index cd6304e..0000000
> --- a/arch/arm64/include/asm/stage2_pgtable-nopud.h
> +++ /dev/null
> @@ -1,39 +0,0 @@
> -/*
> - * Copyright (C) 2016 - ARM Ltd
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> - * GNU General Public License for more details.
> - *
> - * You should have received a copy of the GNU General Public License
> - * along with this program.  If not, see <http://www.gnu.org/licenses/>.
> - */
> -
> -#ifndef __ARM64_S2_PGTABLE_NOPUD_H_
> -#define __ARM64_S2_PGTABLE_NOPUD_H_
> -
> -#define __S2_PGTABLE_PUD_FOLDED
> -
> -#define S2_PUD_SHIFT		S2_PGDIR_SHIFT
> -#define S2_PTRS_PER_PUD		1
> -#define S2_PUD_SIZE		(_AC(1, UL) << S2_PUD_SHIFT)
> -#define S2_PUD_MASK		(~(S2_PUD_SIZE-1))
> -
> -#define stage2_pgd_none(kvm, pgd)		(0)
> -#define stage2_pgd_present(kvm, pgd)		(1)
> -#define stage2_pgd_clear(kvm, pgd)		do { } while (0)
> -#define stage2_pgd_populate(kvm, pgd, pud)	do { } while (0)
> -
> -#define stage2_pud_offset(kvm, pgd, address)	((pud_t *)(pgd))
> -
> -#define stage2_pud_free(kvm, x)			do { } while (0)
> -
> -#define stage2_pud_addr_end(kvm, addr, end)	(end)
> -#define stage2_pud_table_empty(kvm, pmdp)	(0)
> -
> -#endif
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index 057a405..ffc37cc 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -19,8 +19,12 @@
>  #ifndef __ARM64_S2_PGTABLE_H_
>  #define __ARM64_S2_PGTABLE_H_
>  
> +#include <linux/hugetlb.h>
>  #include <asm/pgtable.h>
>  
> +/* The PGDIR shift for a given page table with "n" levels. */
> +#define pt_levels_pgdir_shift(n)	ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - (n))
> +
>  /*
>   * The hardware supports concatenation of up to 16 tables at stage2 entry level
>   * and we use the feature whenever possible.
> @@ -29,118 +33,209 @@
>   * On arm64, the smallest PAGE_SIZE supported is 4k, which means
>   *             (PAGE_SHIFT - 3) > 4 holds for all page sizes.
Trying to understand that comment. Why do we compare to 4?
>   * This implies, the total number of page table levels at stage2 expected
> - * by the hardware is actually the number of levels required for (KVM_PHYS_SHIFT - 4)
> + * by the hardware is actually the number of levels required for (IPA_SHIFT - 4)
although understandable, is IPA_SHIFT defined somewhere?
>   * in normal translations(e.g, stage1), since we cannot have another level in
> - * the range (KVM_PHYS_SHIFT, KVM_PHYS_SHIFT - 4).
> + * the range (IPA_SHIFT, IPA_SHIFT - 4).
I fail to understand the above comment. Could you give a pointer to the
spec?
>   */
> -#define STAGE2_PGTABLE_LEVELS		ARM64_HW_PGTABLE_LEVELS(KVM_PHYS_SHIFT - 4)
> +#define stage2_pt_levels(ipa_shift)	ARM64_HW_PGTABLE_LEVELS((ipa_shift) - 4)
>  
>  /*
> - * With all the supported VA_BITs and 40bit guest IPA, the following condition
> - * is always true:
> + * With all the supported VA_BITs and guest IPA, the following condition
> + * must be always true:
>   *
> - *       STAGE2_PGTABLE_LEVELS <= CONFIG_PGTABLE_LEVELS
> + *       stage2_pt_levels <= CONFIG_PGTABLE_LEVELS
>   *
>   * We base our stage-2 page table walker helpers on this assumption and
>   * fall back to using the host version of the helper wherever possible.
>   * i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back
>   * to using the host version, since it is guaranteed it is not folded at host.
>   *
> - * If the condition breaks in the future, we can rearrange the host level
> - * definitions and reuse them for stage2. Till then...
> + * If the condition breaks in the future, we need completely independent
> + * page table helpers. Till then...
>   */
> -#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS
> +
> +#if stage2_pt_levels(KVM_PHYS_SHIFT) > CONFIG_PGTABLE_LEVELS
>  #error "Unsupported combination of guest IPA and host VA_BITS."
>  #endif
>  
> -/* S2_PGDIR_SHIFT is the size mapped by top-level stage2 entry */
> -#define S2_PGDIR_SHIFT			ARM64_HW_PGTABLE_LEVEL_SHIFT(4 - STAGE2_PGTABLE_LEVELS)
> -#define S2_PGDIR_SIZE			(_AC(1, UL) << S2_PGDIR_SHIFT)
> -#define S2_PGDIR_MASK			(~(S2_PGDIR_SIZE - 1))
> -
>  /*
>   * The number of PTRS across all concatenated stage2 tables given by the
>   * number of bits resolved at the initial level.
>   */
> -#define PTRS_PER_S2_PGD			(1 << (KVM_PHYS_SHIFT - S2_PGDIR_SHIFT))
> +#define __s2_pgd_ptrs(pa, lvls)	(1 << ((pa) - pt_levels_pgdir_shift((lvls))))
> +#define __s2_pgd_size(pa, lvls)	(__s2_pgd_ptrs((pa), (lvls)) * sizeof(pgd_t))
> +
> +#define kvm_stage2_levels(kvm)		stage2_pt_levels(kvm_phys_shift(kvm))
> +#define stage2_pgdir_shift(kvm)	\
> +		pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
> +#define stage2_pgdir_size(kvm)		(_AC(1, UL) << stage2_pgdir_shift((kvm)))
> +#define stage2_pgdir_mask(kvm)		(~(stage2_pgdir_size((kvm)) - 1))
> +#define stage2_pgd_ptrs(kvm)	\
> +	__s2_pgd_ptrs(kvm_phys_shift(kvm), kvm_stage2_levels(kvm))
> +
> +#define stage2_pgd_size(kvm)	__s2_pgd_size(kvm_phys_shift(kvm), kvm_stage2_levels(kvm))
>  
>  /*
>   * kvm_mmmu_cache_min_pages is the number of stage2 page table translation
>   * levels in addition to the PGD.
>   */
> -#define kvm_mmu_cache_min_pages(kvm)	(STAGE2_PGTABLE_LEVELS - 1)
> +#define kvm_mmu_cache_min_pages(kvm)	(kvm_stage2_levels(kvm) - 1)
>  
>  
> -#if STAGE2_PGTABLE_LEVELS > 3
> +/* PUD/PMD definitions if present */
> +#define __S2_PUD_SHIFT			ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
> +#define __S2_PUD_SIZE			(_AC(1, UL) << __S2_PUD_SHIFT)
> +#define __S2_PUD_MASK			(~(__S2_PUD_SIZE - 1))
>  
> -#define S2_PUD_SHIFT			ARM64_HW_PGTABLE_LEVEL_SHIFT(1)
> -#define S2_PUD_SIZE			(_AC(1, UL) << S2_PUD_SHIFT)
> -#define S2_PUD_MASK			(~(S2_PUD_SIZE - 1))
> +#define __S2_PMD_SHIFT			ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
> +#define __S2_PMD_SIZE			(_AC(1, UL) << __S2_PMD_SHIFT)
> +#define __S2_PMD_MASK			(~(__S2_PMD_SIZE - 1))
Is this renaming mandatory?
>  
> -#define stage2_pgd_none(kvm, pgd)		pgd_none(pgd)
> -#define stage2_pgd_clear(kvm, pgd)		pgd_clear(pgd)
> -#define stage2_pgd_present(kvm, pgd)		pgd_present(pgd)
> -#define stage2_pgd_populate(kvm, pgd, pud)	pgd_populate(NULL, pgd, pud)
> -#define stage2_pud_offset(kvm, pgd, address)	pud_offset(pgd, address)
> -#define stage2_pud_free(kvm, pud)		pud_free(NULL, pud)
> +#define __s2_pud_index(addr) \
> +	(((addr) >> __S2_PUD_SHIFT) & (PTRS_PER_PTE - 1))
> +#define __s2_pmd_index(addr) \
> +	(((addr) >> __S2_PMD_SHIFT) & (PTRS_PER_PTE - 1))
>  
> -#define stage2_pud_table_empty(kvm, pudp)	kvm_page_empty(pudp)
> +#define __kvm_has_stage2_levels(kvm, min_levels)	\
> +  ((CONFIG_PGTABLE_LEVELS >= min_levels) && (kvm_stage2_levels(kvm) >= min_levels))
kvm_stage2_levels <= CONFIG_PGTABLE_LEVELS so you should just need to
check kvm_stage2_levels?
> +
> +#define kvm_stage2_has_pgd(kvm)	__kvm_has_stage2_levels(kvm, 4)
> +#define kvm_stage2_has_pud(kvm) __kvm_has_stage2_levels(kvm, 3)
> +
> +static inline int stage2_pgd_none(struct kvm *kvm, pgd_t pgd)
> +{
> +	return kvm_stage2_has_pgd(kvm) ? pgd_none(pgd) : 0;
> +}
> +
> +static inline void stage2_pgd_clear(struct kvm *kvm, pgd_t *pgdp)
> +{
> +	if (kvm_stage2_has_pgd(kvm))
> +		pgd_clear(pgdp);
> +}
> +
> +static inline int stage2_pgd_present(struct kvm *kvm, pgd_t pgd)
> +{
> +	return kvm_stage2_has_pgd(kvm) ? pgd_present(pgd) : 1;
> +}
> +
> +static inline void stage2_pgd_populate(struct kvm *kvm, pgd_t *pgdp, pud_t *pud)
> +{
> +	if (kvm_stage2_has_pgd(kvm))
> +		pgd_populate(NULL, pgdp, pud);
> +	else
> +		BUG();
> +}
> +
> +static inline pud_t *stage2_pud_offset(struct kvm *kvm,
> +				       pgd_t *pgd, unsigned long address)
> +{
> +	if (kvm_stage2_has_pgd(kvm)) {
> +		phys_addr_t pud_phys = pgd_page_paddr(*pgd);
> +
> +		pud_phys += __s2_pud_index(address) * sizeof(pud_t);
> +		return __va(pud_phys);
> +	}
> +	return (pud_t *)pgd;
> +}
> +
> +static inline void stage2_pud_free(struct kvm *kvm, pud_t *pud)
> +{
> +	if (kvm_stage2_has_pgd(kvm))
> +		pud_free(NULL, pud);
> +}
> +
> +static inline int stage2_pud_table_empty(struct kvm *kvm, pud_t *pudp)
> +{
> +	return kvm_stage2_has_pgd(kvm) && kvm_page_empty(pudp);
> +}
>  
>  static inline phys_addr_t
>  stage2_pud_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
> -	phys_addr_t boundary = (addr + S2_PUD_SIZE) & S2_PUD_MASK;
> +	if (kvm_stage2_has_pgd(kvm)) {
> +		phys_addr_t boundary = (addr + __S2_PUD_SIZE) & __S2_PUD_MASK;
>  
> -	return (boundary - 1 < end - 1) ? boundary : end;
> +		return (boundary - 1 < end - 1) ? boundary : end;
> +	}
> +	return end;
>  }
>  
> -#endif		/* STAGE2_PGTABLE_LEVELS > 3 */
> +static inline int stage2_pud_none(struct kvm *kvm, pud_t pud)
> +{
> +	return kvm_stage2_has_pud(kvm) ? pud_none(pud) : 0;
> +}
>  
> +static inline void stage2_pud_clear(struct kvm *kvm, pud_t *pudp)
> +{
> +	if (kvm_stage2_has_pud(kvm))
> +		pud_clear(pudp);
> +}
>  
> -#if STAGE2_PGTABLE_LEVELS > 2
> +static inline int stage2_pud_present(struct kvm *kvm, pud_t pud)
> +{
> +	return kvm_stage2_has_pud(kvm) ? pud_present(pud) : 1;
> +}
>  
> -#define S2_PMD_SHIFT			ARM64_HW_PGTABLE_LEVEL_SHIFT(2)
> -#define S2_PMD_SIZE			(_AC(1, UL) << S2_PMD_SHIFT)
> -#define S2_PMD_MASK			(~(S2_PMD_SIZE - 1))
> +static inline void stage2_pud_populate(struct kvm *kvm, pud_t *pudp, pmd_t *pmd)
> +{
> +	if (kvm_stage2_has_pud(kvm))
> +		pud_populate(NULL, pudp, pmd);
> +	else
> +		BUG();
> +}
>  
> -#define stage2_pud_none(kvm, pud)		pud_none(pud)
> -#define stage2_pud_clear(kvm, pud)		pud_clear(pud)
> -#define stage2_pud_present(kvm, pud)		pud_present(pud)
> -#define stage2_pud_populate(kvm, pud, pmd)	pud_populate(NULL, pud, pmd)
> -#define stage2_pmd_offset(kvm, pud, address)	pmd_offset(pud, address)
> -#define stage2_pmd_free(kvm, pmd)		pmd_free(NULL, pmd)
> +static inline pmd_t *stage2_pmd_offset(struct kvm *kvm,
> +					 pud_t *pud, unsigned long address)
> +{
> +	if (kvm_stage2_has_pud(kvm)) {
> +		phys_addr_t pmd_phys = pud_page_paddr(*pud);
>  
> -#define stage2_pud_huge(kvm, pud)		pud_huge(pud)
> -#define stage2_pmd_table_empty(kvm, pmdp)	kvm_page_empty(pmdp)
> +		pmd_phys += __s2_pmd_index(address) * sizeof(pmd_t);
> +		return __va(pmd_phys);
> +	}
> +	return (pmd_t *)pud;
> +}
> +
> +static inline void stage2_pmd_free(struct kvm *kvm, pmd_t *pmd)
> +{
> +	if (kvm_stage2_has_pud(kvm))
> +		pmd_free(NULL, pmd);
> +}
> +
> +static inline int stage2_pmd_table_empty(struct kvm *kvm, pmd_t *pmdp)
> +{
> +	return kvm_stage2_has_pud(kvm) && kvm_page_empty(pmdp);
> +}
>  
>  static inline phys_addr_t
>  stage2_pmd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
> -	phys_addr_t boundary = (addr + S2_PMD_SIZE) & S2_PMD_MASK;
> +	if (kvm_stage2_has_pud(kvm)) {
> +		phys_addr_t boundary = (addr + __S2_PMD_SIZE) & __S2_PMD_MASK;
>  
> -	return (boundary - 1 < end - 1) ? boundary : end;
> +		return (boundary - 1 < end - 1) ? boundary : end;
> +	}
> +	return end;
>  }
>  
> -#endif		/* STAGE2_PGTABLE_LEVELS > 2 */
> +static inline int stage2_pud_huge(struct kvm *kvm, pud_t pud)
> +{
> +	return kvm_stage2_has_pud(kvm) ? pud_huge(pud) : 0;
> +}
>  
>  #define stage2_pte_table_empty(kvm, ptep)	kvm_page_empty(ptep)
>  
> -#if STAGE2_PGTABLE_LEVELS == 2
> -#include <asm/stage2_pgtable-nopmd.h>
> -#elif STAGE2_PGTABLE_LEVELS == 3
> -#include <asm/stage2_pgtable-nopud.h>
> -#endif
> -
> -#define stage2_pgd_size(kvm)	(PTRS_PER_S2_PGD * sizeof(pgd_t))
> -
> -#define stage2_pgd_index(kvm, addr) \
> -	(((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
> +static inline unsigned long stage2_pgd_index(struct kvm *kvm, phys_addr_t addr)
> +{
> +	return (addr >> stage2_pgdir_shift(kvm)) & (stage2_pgd_ptrs(kvm) - 1);
> +}
>  
>  static inline phys_addr_t
>  stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>  {
> -	phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
> +	phys_addr_t boundary;
>  
> +	boundary = (addr + stage2_pgdir_size(kvm)) & stage2_pgdir_mask(kvm);
>  	return (boundary - 1 < end - 1) ? boundary : end;
>  }
>  
> 

Globally this patch is pretty hard to review. I don't know if it is
possible to split into 2. 1) Addition of some helper macros. 2) removal
of nopud and nopmd and implementation of the corresponding macros?

Thanks

Eric

Marc Zyngier July 2, 2018, 12:16 p.m. UTC | #15

On 29/06/18 12:15, Suzuki K Poulose wrote:
> We set VTCR_EL2 very early during the stage2 init and don't
> touch it ever. This is fine as we had a fixed IPA size. This
> patch changes the behavior to set the VTCR for a given VM,
> depending on its stage2 table. The common configuration for
> VTCR is still performed during the early init as we have to
> retain the hardware access flag update bits (VTCR_EL2_HA)
> per CPU (as they are only set for the CPUs which are capabile).

capable

> The bits defining the number of levels in the page table (SL0)
> and and the size of the Input address to the translation (T0SZ)
> are programmed for each VM upon entry to the guest.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Change since V2:
>  - Load VTCR for TLB operations
> ---
>  arch/arm64/include/asm/kvm_arm.h  | 19 +++++++++----------
>  arch/arm64/include/asm/kvm_asm.h  |  2 +-
>  arch/arm64/include/asm/kvm_host.h |  9 ++++++---
>  arch/arm64/include/asm/kvm_hyp.h  | 11 +++++++++++
>  arch/arm64/kvm/hyp/s2-setup.c     | 17 +----------------
>  5 files changed, 28 insertions(+), 30 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 11a7db0..b02c316 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -120,9 +120,7 @@
>  #define VTCR_EL2_IRGN0_WBWA	TCR_IRGN0_WBWA
>  #define VTCR_EL2_SL0_SHIFT	6
>  #define VTCR_EL2_SL0_MASK	(3 << VTCR_EL2_SL0_SHIFT)
> -#define VTCR_EL2_SL0_LVL1	(1 << VTCR_EL2_SL0_SHIFT)
>  #define VTCR_EL2_T0SZ_MASK	0x3f
> -#define VTCR_EL2_T0SZ_40B	24
>  #define VTCR_EL2_VS_SHIFT	19
>  #define VTCR_EL2_VS_8BIT	(0 << VTCR_EL2_VS_SHIFT)
>  #define VTCR_EL2_VS_16BIT	(1 << VTCR_EL2_VS_SHIFT)
> @@ -137,43 +135,44 @@
>   * VTCR_EL2.PS is extracted from ID_AA64MMFR0_EL1.PARange at boot time
>   * (see hyp-init.S).
>   *
> + * VTCR_EL2.SL0 and T0SZ are configured per VM at runtime before switching to
> + * the VM.
> + *
>   * Note that when using 4K pages, we concatenate two first level page tables
>   * together. With 16K pages, we concatenate 16 first level page tables.
>   *
>   */
>  
> -#define VTCR_EL2_T0SZ_IPA	VTCR_EL2_T0SZ_40B
>  #define VTCR_EL2_COMMON_BITS	(VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
>  				 VTCR_EL2_IRGN0_WBWA | VTCR_EL2_RES1)
> +#define VTCR_EL2_PRIVATE_MASK	(VTCR_EL2_SL0_MASK | VTCR_EL2_T0SZ_MASK)

What does "private" mean here? It really is the IPA configuration, so
I'd rather have a naming that reflects that.

>  #ifdef CONFIG_ARM64_64K_PAGES
>  /*
>   * Stage2 translation configuration:
>   * 64kB pages (TG0 = 1)
> - * 2 level page tables (SL = 1)
>   */
> -#define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN			VTCR_EL2_TG0_64K
>  #define VTCR_EL2_TGRAN_SL0_BASE		3UL
>  
>  #elif defined(CONFIG_ARM64_16K_PAGES)
>  /*
>   * Stage2 translation configuration:
>   * 16kB pages (TG0 = 2)
> - * 2 level page tables (SL = 1)
>   */
> -#define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN			VTCR_EL2_TG0_16K
>  #define VTCR_EL2_TGRAN_SL0_BASE		3UL
>  #else	/* 4K */
>  /*
>   * Stage2 translation configuration:
>   * 4kB pages (TG0 = 0)
> - * 3 level page tables (SL = 1)
>   */
> -#define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
> +#define VTCR_EL2_TGRAN			VTCR_EL2_TG0_4K
>  #define VTCR_EL2_TGRAN_SL0_BASE		2UL
>  #endif
>  
> -#define VTCR_EL2_FLAGS			(VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
> +#define VTCR_EL2_FLAGS		(VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
> +
>  /*
>   * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
>   * Interestingly, it depends on the page size.
> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
> index 102b5a5..91372eb 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -72,7 +72,7 @@ extern void __vgic_v3_init_lrs(void);
>  
>  extern u32 __kvm_get_mdcr_el2(void);
>  
> -extern u32 __init_stage2_translation(void);
> +extern void __init_stage2_translation(void);
>  
>  /* Home-grown __this_cpu_{ptr,read} variants that always work at HYP */
>  #define __hyp_this_cpu_ptr(sym)						\
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index fe8777b..328f472 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -442,10 +442,13 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>  
>  static inline void __cpu_init_stage2(void)
>  {
> -	u32 parange = kvm_call_hyp(__init_stage2_translation);
> +	u32 ps;
>  
> -	WARN_ONCE(parange < 40,
> -		  "PARange is %d bits, unsupported configuration!", parange);
> +	kvm_call_hyp(__init_stage2_translation);
> +	/* Sanity check for minimum IPA size support */
> +	ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
> +	WARN_ONCE(ps < 40,
> +		  "PARange is %d bits, unsupported configuration!", ps);
>  }
>  
>  /* Guest/host FPSIMD coordination helpers */
> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
> index 82f9994..3e8052d1 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -20,6 +20,7 @@
>  
>  #include <linux/compiler.h>
>  #include <linux/kvm_host.h>
> +#include <asm/kvm_mmu.h>
>  #include <asm/sysreg.h>
>  
>  #define __hyp_text __section(.hyp.text) notrace
> @@ -158,6 +159,16 @@ void __noreturn __hyp_do_panic(unsigned long, ...);
>  /* Must be called from hyp code running at EL2 */
>  static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
>  {
> +	/*
> +	 * Configure the VTCR translation control bits
> +	 * for this VM.
> +	 */
> +	u64 vtcr = read_sysreg(vtcr_el2);
> +
> +	vtcr &= ~VTCR_EL2_PRIVATE_MASK;
> +	vtcr |= VTCR_EL2_SL0(kvm_stage2_levels(kvm)) |
> +		VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
> +	write_sysreg(vtcr, vtcr_el2);

Can't we generate the whole vtcr value in one go, without reading it
back? Specially given that on patch 16, you're actually switching to a
per-VM variable, and it would make a lot of sense to start with that here.

>  	write_sysreg(kvm->arch.vttbr, vttbr_el2);
>  }
>  
> diff --git a/arch/arm64/kvm/hyp/s2-setup.c b/arch/arm64/kvm/hyp/s2-setup.c
> index 81094f1..6567315 100644
> --- a/arch/arm64/kvm/hyp/s2-setup.c
> +++ b/arch/arm64/kvm/hyp/s2-setup.c
> @@ -19,13 +19,11 @@
>  #include <asm/kvm_arm.h>
>  #include <asm/kvm_asm.h>
>  #include <asm/kvm_hyp.h>
> -#include <asm/cpufeature.h>
>  
> -u32 __hyp_text __init_stage2_translation(void)
> +void __hyp_text __init_stage2_translation(void)
>  {
>  	u64 val = VTCR_EL2_FLAGS;
>  	u64 parange;
> -	u32 phys_shift;
>  	u64 tmp;
>  
>  	/*
> @@ -38,17 +36,6 @@ u32 __hyp_text __init_stage2_translation(void)
>  		parange = ID_AA64MMFR0_PARANGE_MAX;
>  	val |= parange << VTCR_EL2_PS_SHIFT;
>  
> -	/* Compute the actual PARange... */
> -	phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange);
> -
> -	/*
> -	 * ... and clamp it to 40 bits, unless we have some braindead
> -	 * HW that implements less than that. In all cases, we'll
> -	 * return that value for the rest of the kernel to decide what
> -	 * to do.
> -	 */
> -	val |= VTCR_EL2_T0SZ(phys_shift > 40 ? 40 : phys_shift);
> -
>  	/*
>  	 * Check the availability of Hardware Access Flag / Dirty Bit
>  	 * Management in ID_AA64MMFR1_EL1 and enable the feature in VTCR_EL2.
> @@ -67,6 +54,4 @@ u32 __hyp_text __init_stage2_translation(void)
>  			VTCR_EL2_VS_8BIT;
>  
>  	write_sysreg(val, vtcr_el2);

And then most of the code here could run on a per-VM basis.

> -
> -	return phys_shift;
>  }
> 

Thanks,

	M.

Suzuki K Poulose July 2, 2018, 1:24 p.m. UTC | #16

Hi Eric,


On 02/07/18 13:14, Auger Eric wrote:
> Hi Suzuki,
> 
> On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
>> So far we had a static stage2 page table handling code, based on a
>> fixed IPA of 40bits. As we prepare for a configurable IPA size per
>> VM, make our stage2 page table code dynamic, to do the right thing
>> for a given VM. We ensure the existing condition is always true even
>> when we lift the limit on the IPA. i.e,
>>
>> 	page table levels in stage1 >= page table levels in stage2
>>
>> Support for the IPA size configuration needs other changes in the way
>> we configure the EL2 registers (VTTBR and VTCR). So, the IPA is still
>> fixed to 40bits. The patch also moves the kvm_page_empty() in asm/kvm_mmu.h
>> to the top, before including the asm/stage2_pgtable.h to avoid a forward
>> declaration.
>>
>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>> Cc: Christoffer Dall <cdall@kernel.org>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>> Changes since V2
>>   - Restrict the stage2 page table to allow reusing the host page table
>>     helpers for now, until we get stage1 independent page table helpers.
> I would move this up in the commit msg to motivate the fact we enforce
> the able condition.

This is mentioned in the commit message for the patch which lifts the limitation
on the IPA. This patch only deals with the dynamic page table level handling,
with the restriction on the levels. Nevertheless, I could add it to the
description.

>> ---
>>   arch/arm64/include/asm/kvm_mmu.h              |  14 +-
>>   arch/arm64/include/asm/stage2_pgtable-nopmd.h |  42 ------
>>   arch/arm64/include/asm/stage2_pgtable-nopud.h |  39 -----
>>   arch/arm64/include/asm/stage2_pgtable.h       | 207 +++++++++++++++++++-------
>>   4 files changed, 159 insertions(+), 143 deletions(-)
>>   delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
>>   delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h
> 
> with my very limited knowledge of S2 page table walkers I fail to
> understand why we now can get rid of stage2_pgtable-nopmd.h and
> stage2_pgtable-nopud.h and associated FOLDED config. Please could you
> explain it in the commit message?

As mentioned above, we have static page table helpers, which are decided
at compile time (just like the stage1). So these files hold the definitions
for the cases where PUD/PMD is folded and included for a given stage1 VA.
But since we are now doing this check per VM, we make the decision
by checking the kvm_stage2_levels(), instead of hard coding it.

Does that help ? A short version of that is already there. May be I could
elaborate that a bit.

>> -
>> -#define stage2_pgd_index(kvm, addr) \
>> -	(((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
>> +static inline unsigned long stage2_pgd_index(struct kvm *kvm, phys_addr_t addr)
>> +{
>> +	return (addr >> stage2_pgdir_shift(kvm)) & (stage2_pgd_ptrs(kvm) - 1);
>> +}
>>   
>>   static inline phys_addr_t
>>   stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t end)
>>   {
>> -	phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
>> +	phys_addr_t boundary;
>>   
>> +	boundary = (addr + stage2_pgdir_size(kvm)) & stage2_pgdir_mask(kvm);
>>   	return (boundary - 1 < end - 1) ? boundary : end;
>>   }
>>   
>>
> 
> Globally this patch is pretty hard to review. I don't know if it is
> possible to split into 2. 1) Addition of some helper macros. 2) removal
> of nopud and nopmd and implementation of the corresponding macros?

I acknowledge that. The patch redefines the "existing" macros to make the
decision at runtime based on the VM's setting. I will see if there is a
better way to do it.

Cheers
Suzuki

Marc Zyngier July 2, 2018, 1:32 p.m. UTC | #17

On 29/06/18 12:15, Suzuki K Poulose wrote:
> Now that we can manage the stage2 page table per VM, switch the
> configuration details to per VM instance. We keep track of the
> IPA bits, number of page table levels and the VTCR bits (which
> depends on the IPA and the number of levels). While at it, remove
> unused pgd_lock field from kvm_arch for arm64.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  arch/arm64/include/asm/kvm_host.h       | 14 ++++++++++++--
>  arch/arm64/include/asm/kvm_hyp.h        |  3 +--
>  arch/arm64/include/asm/kvm_mmu.h        | 20 ++++++++++++++++++--
>  arch/arm64/include/asm/stage2_pgtable.h |  1 -
>  virt/kvm/arm/mmu.c                      |  4 ++++
>  5 files changed, 35 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 328f472..9a15860 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -61,13 +61,23 @@ struct kvm_arch {
>  	u64    vmid_gen;
>  	u32    vmid;
>  
> -	/* 1-level 2nd stage table and lock */
> -	spinlock_t pgd_lock;
> +	/* stage-2 page table */
>  	pgd_t *pgd;
>  
>  	/* VTTBR value associated with above pgd and vmid */
>  	u64    vttbr;
>  
> +	/* Private bits of VTCR_EL2 for this VM */
> +	u64    vtcr_private;

As I said in another email, this should become a full VTCR_EL2 copy.

> +	/* Size of the PA size for this guest */
> +	u8     phys_shift;
> +	/*
> +	 * Number of levels in page table. We could always calculate
> +	 * it from phys_shift above. We cache it for faster switches
> +	 * in stage2 page table helpers.
> +	 */
> +	u8     s2_levels;

And these two fields feel like they should be derived from the VTCR
itself, instead of being there on their own. Any chance you could look
into this?

> +
>  	/* The last vcpu id that ran on each physical CPU */
>  	int __percpu *last_vcpu_ran;
>  
> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
> index 3e8052d1..699f678 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -166,8 +166,7 @@ static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
>  	u64 vtcr = read_sysreg(vtcr_el2);
>  
>  	vtcr &= ~VTCR_EL2_PRIVATE_MASK;
> -	vtcr |= VTCR_EL2_SL0(kvm_stage2_levels(kvm)) |
> -		VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
> +	vtcr |= kvm->arch.vtcr_private;
>  	write_sysreg(vtcr, vtcr_el2);
>  	write_sysreg(kvm->arch.vttbr, vttbr_el2);
>  }
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index f3fb05a3..a291cdc 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -143,9 +143,10 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
>   */
>  #define KVM_PHYS_SHIFT	(40)
>  
> -#define kvm_phys_shift(kvm)		KVM_PHYS_SHIFT
> +#define kvm_phys_shift(kvm)		(kvm->arch.phys_shift)
>  #define kvm_phys_size(kvm)		(_AC(1, ULL) << kvm_phys_shift(kvm))
>  #define kvm_phys_mask(kvm)		(kvm_phys_size(kvm) - _AC(1, ULL))
> +#define kvm_stage2_levels(kvm)		(kvm->arch.s2_levels)
>  
>  static inline bool kvm_page_empty(void *ptr)
>  {
> @@ -528,6 +529,18 @@ static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
>  
>  static inline void *stage2_alloc_pgd(struct kvm *kvm)
>  {
> +	u32 ipa, lvls;
> +
> +	/*
> +	 * Stage2 page table can support concatenation of (upto 16) tables
> +	 * at the entry level, thereby reducing the number of levels.
> +	 */
> +	ipa = kvm_phys_shift(kvm);
> +	lvls = stage2_pt_levels(ipa);
> +
> +	kvm->arch.s2_levels = lvls;
> +	kvm->arch.vtcr_private = VTCR_EL2_SL0(lvls) | TCR_T0SZ(ipa);
> +
>  	return alloc_pages_exact(stage2_pgd_size(kvm),
>  				 GFP_KERNEL | __GFP_ZERO);
>  }
> @@ -537,7 +550,10 @@ static inline u32 kvm_get_ipa_limit(void)
>  	return KVM_PHYS_SHIFT;
>  }
>  
> -static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift) {}
> +static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift)
> +{
> +	kvm->arch.phys_shift = ipa_shift;
> +}
>  
>  #endif /* __ASSEMBLY__ */
>  #endif /* __ARM64_KVM_MMU_H__ */
> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
> index ffc37cc..91d7936 100644
> --- a/arch/arm64/include/asm/stage2_pgtable.h
> +++ b/arch/arm64/include/asm/stage2_pgtable.h
> @@ -65,7 +65,6 @@
>  #define __s2_pgd_ptrs(pa, lvls)	(1 << ((pa) - pt_levels_pgdir_shift((lvls))))
>  #define __s2_pgd_size(pa, lvls)	(__s2_pgd_ptrs((pa), (lvls)) * sizeof(pgd_t))
>  
> -#define kvm_stage2_levels(kvm)		stage2_pt_levels(kvm_phys_shift(kvm))
>  #define stage2_pgdir_shift(kvm)	\
>  		pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
>  #define stage2_pgdir_size(kvm)		(_AC(1, UL) << stage2_pgdir_shift((kvm)))
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index a339e00..d7822e1 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -867,6 +867,10 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
>  		return -EINVAL;
>  	}
>  
> +	/* Make sure we have the stage2 configured for this VM */
> +	if (WARN_ON(!kvm_phys_shift(kvm)))

Can this be triggered from userspace?

> +		return -EINVAL;
> +
>  	/* Allocate the HW PGD, making sure that each page gets its own refcount */
>  	pgd = stage2_alloc_pgd(kvm);
>  	if (!pgd)
> 

Thanks,

	M.

Marc Zyngier July 2, 2018, 1:43 p.m. UTC | #18

On 29/06/18 12:15, Suzuki K Poulose wrote:
> Add support for handling the 52bit IPA. 52bit IPA
> support needs changes to the following :
> 
>  1) Page-table entries - We use kernel page table helpers for setting
>  up the stage2. Hence we don't explicit changes here
> 
>  2) VTTBR:BADDR - This is already supported with :
>    commit 529c4b05a3cb2f324aa ("arm64: handle 52-bit addresses in TTBR")
> 
>  3) VGIC support for 52bit: Supported with a patch in this series.
> 
> That leaves us with the handling for PAR and HPAR. This patch adds

HPFAR?

> support for handling the 52bit addresses in PAR and HPFAR,
> which are used while handling the permission faults in stage1.

Overall, this is a pretty confusing commit message. Can you just call it:

KVM/arm64: Add 52bit support for PAR to HPFAR conversion

and just describe that it now uses PHYS_MASK_SHIFT instead of a
hardcoded constant?

> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Kristina Martsenko <kristina.martsenko@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  arch/arm64/include/asm/kvm_arm.h | 7 +++++++
>  arch/arm64/kvm/hyp/switch.c      | 2 +-
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 2e90942..cb6a2ee 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -301,6 +301,13 @@
>  
>  /* Hyp Prefetch Fault Address Register (HPFAR/HDFAR) */
>  #define HPFAR_MASK	(~UL(0xf))
> +/*
> + * We have
> + *	PAR	[PA_Shift - 1	: 12] = PA	[PA_Shift - 1 : 12]
> + *	HPFAR	[PA_Shift - 9	: 4]  = FIPA	[PA_Shift - 1 : 12]
> + */
> +#define PAR_TO_HPFAR(par)		\
> +	(((par) & GENMASK_ULL(PHYS_MASK_SHIFT - 1, 12)) >> 8)
>  
>  #define kvm_arm_exception_type	\
>  	{0, "IRQ" }, 		\
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index 355fb25..fb66320 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -260,7 +260,7 @@ static bool __hyp_text __translate_far_to_hpfar(u64 far, u64 *hpfar)
>  		return false; /* Translation failed, back to guest */
>  
>  	/* Convert PAR to HPFAR format */
> -	*hpfar = ((tmp >> 12) & ((1UL << 36) - 1)) << 4;
> +	*hpfar = PAR_TO_HPFAR(tmp);
>  	return true;
>  }
>  
> 

Otherwise:

Acked-by: Marc Zyngier <marc.zyngier@arm.com>

	M.

Marc Zyngier July 2, 2018, 1:50 p.m. UTC | #19

On 29/06/18 12:15, Suzuki K Poulose wrote:
> So far we have restricted the IPA size of the VM to the default
> value (40bits). Now that we can manage the IPA size per VM and
> support dynamic stage2 page tables, allow VMs to have larger IPA.
> This is done by setting the IPA limit to the one supported by
> the hardware and kernel. This patch also moves the check for
> the default IPA size support to kvm_get_ipa_limit().
> 
> Since the stage2 page table code is dependent on the stage1
> page table, we always ensure that :
> 
>   Number of Levels at Stage1 >= Number of Levels at Stage2
> 
> So we limit the IPA to make sure that the above condition
> is satisfied. This will affect the following combinations
> of VA_BITS and IPA for different page sizes.
> 
>   39bit VA, 4K  - IPA > 43 (Upto 48)
>   36bit VA, 16K - IPA > 40 (Upto 48)
>   42bit VA, 64K - IPA > 46 (Upto 52)

I'm not sure I get it. Are these the IPA sizes that we forbid based on
the host VA size and page size configuration? If so, can you rewrite
this as:

   host configuration | unsupported IPA range
   39bit VA, 4k       | [44, 48]
   36bit VA, 16K      | [41, 48]
   42bit VA, 64k      | [47, 52]

and say that all the other combinations are supported?

> 
> Supporting the above combinations need independent stage2
> page table manipulation code, which would need substantial
> changes. We could purse the solution independently and
> switch the page table code once we have it ready.
> 
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since V2:
>  - Restrict the IPA size to limit the number of page table
>    levels in stage2 to that of stage1 or less.
> ---
>  arch/arm64/include/asm/kvm_host.h |  6 ------
>  arch/arm64/include/asm/kvm_mmu.h  | 37 ++++++++++++++++++++++++++++++++++++-
>  2 files changed, 36 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 9a15860..e858e49 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -452,13 +452,7 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>  
>  static inline void __cpu_init_stage2(void)
>  {
> -	u32 ps;
> -
>  	kvm_call_hyp(__init_stage2_translation);
> -	/* Sanity check for minimum IPA size support */
> -	ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
> -	WARN_ONCE(ps < 40,
> -		  "PARange is %d bits, unsupported configuration!", ps);
>  }
>  
>  /* Guest/host FPSIMD coordination helpers */
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index a291cdc..d38f395 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -547,7 +547,42 @@ static inline void *stage2_alloc_pgd(struct kvm *kvm)
>  
>  static inline u32 kvm_get_ipa_limit(void)
>  {
> -	return KVM_PHYS_SHIFT;
> +	unsigned int ipa_max, va_max, parange;
> +
> +	parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 0x7;
> +	ipa_max = id_aa64mmfr0_parange_to_phys_shift(parange);
> +
> +	/* Raise the limit to the default size for backward compatibility */
> +	if (ipa_max < KVM_PHYS_SHIFT) {
> +		WARN_ONCE(1,
> +			  "PARange is %d bits, unsupported configuration!",
> +			  ipa_max);
> +		ipa_max = KVM_PHYS_SHIFT;
> +	}
> +
> +	/* Clamp it to the PA size supported by the kernel */
> +	ipa_max = (ipa_max > PHYS_MASK_SHIFT) ? PHYS_MASK_SHIFT : ipa_max;
> +	/*
> +	 * Since our stage2 table is dependent on the stage1 page table code,
> +	 * we must always honor the following condition:
> +	 *
> +	 *  Number of levels in Stage1 >= Number of levels in Stage2.
> +	 *
> +	 * So clamp the ipa limit further down to limit the number of levels.
> +	 * Since we can concatenate upto 16 tables at entry level, we could
> +	 * go upto 4bits above the maximum VA addressible with the current
> +	 * number of levels.
> +	 */
> +	va_max = PGDIR_SHIFT + PAGE_SHIFT - 3;
> +	va_max += 4;
> +
> +	if (va_max < ipa_max) {
> +		kvm_info("Limiting IPA limit to %dbytes due to host VA bits limitation\n",
> +			 va_max);
> +		ipa_max = va_max;
> +	}
> +
> +	return ipa_max;
>  }
>  
>  static inline void kvm_config_stage2(struct kvm *kvm, u32 ipa_shift)
> 

Otherwise looks good.

	M.

Suzuki K Poulose July 2, 2018, 1:53 p.m. UTC | #20

Hi Marc,

On 02/07/18 14:32, Marc Zyngier wrote:
> On 29/06/18 12:15, Suzuki K Poulose wrote:
>> Now that we can manage the stage2 page table per VM, switch the
>> configuration details to per VM instance. We keep track of the
>> IPA bits, number of page table levels and the VTCR bits (which
>> depends on the IPA and the number of levels). While at it, remove
>> unused pgd_lock field from kvm_arch for arm64.
>>
>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>> Cc: Christoffer Dall <cdall@kernel.org>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>


>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 328f472..9a15860 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -61,13 +61,23 @@ struct kvm_arch {
>>   	u64    vmid_gen;
>>   	u32    vmid;
>>   
>> -	/* 1-level 2nd stage table and lock */
>> -	spinlock_t pgd_lock;
>> +	/* stage-2 page table */
>>   	pgd_t *pgd;
>>   
>>   	/* VTTBR value associated with above pgd and vmid */
>>   	u64    vttbr;
>>   
>> +	/* Private bits of VTCR_EL2 for this VM */
>> +	u64    vtcr_private;
> 
> As I said in another email, this should become a full VTCR_EL2 copy.
> 

OK

>> +	/* Size of the PA size for this guest */
>> +	u8     phys_shift;
>> +	/*
>> +	 * Number of levels in page table. We could always calculate
>> +	 * it from phys_shift above. We cache it for faster switches
>> +	 * in stage2 page table helpers.
>> +	 */
>> +	u8     s2_levels;
> 
> And these two fields feel like they should be derived from the VTCR
> itself, instead of being there on their own. Any chance you could look
> into this?

Yes, the VTCR is computed from the above two values and we could compute
them back from the VTCR. I will give it a try.

>> diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h
>> index ffc37cc..91d7936 100644
>> --- a/arch/arm64/include/asm/stage2_pgtable.h
>> +++ b/arch/arm64/include/asm/stage2_pgtable.h
>> @@ -65,7 +65,6 @@
>>   #define __s2_pgd_ptrs(pa, lvls)	(1 << ((pa) - pt_levels_pgdir_shift((lvls))))
>>   #define __s2_pgd_size(pa, lvls)	(__s2_pgd_ptrs((pa), (lvls)) * sizeof(pgd_t))
>>   
>> -#define kvm_stage2_levels(kvm)		stage2_pt_levels(kvm_phys_shift(kvm))
>>   #define stage2_pgdir_shift(kvm)	\
>>   		pt_levels_pgdir_shift(kvm_stage2_levels(kvm))
>>   #define stage2_pgdir_size(kvm)		(_AC(1, UL) << stage2_pgdir_shift((kvm)))
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index a339e00..d7822e1 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -867,6 +867,10 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
>>   		return -EINVAL;
>>   	}
>>   
>> +	/* Make sure we have the stage2 configured for this VM */
>> +	if (WARN_ON(!kvm_phys_shift(kvm)))
> 
> Can this be triggered from userspace?

No. As we initialise the phys shift before we get here. If type is left
blank (i.e, 0), we default to 40bits. So there should be something there.
The check is to make sure we have indeed past the configuration step.

>> +		return -EINVAL;
>> +
>>   	/* Allocate the HW PGD, making sure that each page gets its own refcount */
>>   	pgd = stage2_alloc_pgd(kvm);
>>   	if (!pgd)
>>
> 

Cheers
Suzuki

Suzuki K Poulose July 2, 2018, 1:54 p.m. UTC | #21

On 02/07/18 14:50, Marc Zyngier wrote:
> On 29/06/18 12:15, Suzuki K Poulose wrote:
>> So far we have restricted the IPA size of the VM to the default
>> value (40bits). Now that we can manage the IPA size per VM and
>> support dynamic stage2 page tables, allow VMs to have larger IPA.
>> This is done by setting the IPA limit to the one supported by
>> the hardware and kernel. This patch also moves the check for
>> the default IPA size support to kvm_get_ipa_limit().
>>
>> Since the stage2 page table code is dependent on the stage1
>> page table, we always ensure that :
>>
>>    Number of Levels at Stage1 >= Number of Levels at Stage2
>>
>> So we limit the IPA to make sure that the above condition
>> is satisfied. This will affect the following combinations
>> of VA_BITS and IPA for different page sizes.
>>
>>    39bit VA, 4K  - IPA > 43 (Upto 48)
>>    36bit VA, 16K - IPA > 40 (Upto 48)
>>    42bit VA, 64K - IPA > 46 (Upto 52)
> 
> I'm not sure I get it. Are these the IPA sizes that we forbid based on
> the host VA size and page size configuration?

Yes, thats right.

> If so, can you rewrite
> this as:
> 
>     host configuration | unsupported IPA range
>     39bit VA, 4k       | [44, 48]
>     36bit VA, 16K      | [41, 48]
>     42bit VA, 64k      | [47, 52]
> 
> and say that all the other combinations are supported?

Sure, that looks much better. Thanks

Suzuki

Eric Auger July 2, 2018, 2:41 p.m. UTC | #22

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> On arm64 VTTBR_EL2:BADDR holds the base address for the stage2
> translation table. The Arm ARM mandates that the bits BADDR[x-1:0]
> should be 0, where 'x' is defined for a given IPA Size and the
> number of levels for a translation granule size. It is defined
> using some magical constants. This patch is a reverse engineered
> implementation to calculate the 'x' at runtime for a given ipa and
> number of page table levels. See patch for more details.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since V2:
>  - Part 1 of spilt from VTCR & VTTBR dynamic configuration
> ---
>  arch/arm64/include/asm/kvm_arm.h | 60 +++++++++++++++++++++++++++++++++++++---
>  arch/arm64/include/asm/kvm_mmu.h | 25 ++++++++++++++++-
>  2 files changed, 80 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index 3dffd38..c557f45 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -140,8 +140,6 @@
>   * Note that when using 4K pages, we concatenate two first level page tables
>   * together. With 16K pages, we concatenate 16 first level page tables.
>   *
> - * The magic numbers used for VTTBR_X in this patch can be found in Tables
> - * D4-23 and D4-25 in ARM DDI 0487A.b.
Isn't it a pretty old reference? Could you refer to C.a?

>   */
>  
>  #define VTCR_EL2_T0SZ_IPA	VTCR_EL2_T0SZ_40B
> @@ -175,9 +173,63 @@
>  #endif
>  
>  #define VTCR_EL2_FLAGS			(VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
> -#define VTTBR_X				(VTTBR_X_TGRAN_MAGIC - VTCR_EL2_T0SZ_IPA)
> +/*
> + * ARM VMSAv8-64 defines an algorithm for finding the translation table
> + * descriptors in section D4.2.8 in ARM DDI 0487B.b.
another one ;-)
> + *
> + * The algorithm defines the expectations on the BaseAddress (for the page
> + * table) bits resolved at each level based on the page size, entry level
> + * and T0SZ. The variable "x" in the algorithm also affects the VTTBR:BADDR
> + * for stage2 page table.
> + *
> + * The value of "x" is calculated as :
> + *	x = Magic_N - T0SZ
> + *
> + * where Magic_N is an integer depending on the page size and the entry
> + * level of the page table as below:
> + *
> + *	--------------------------------------------
> + *	| Entry level		|  4K    16K   64K |
> + *	--------------------------------------------
> + *	| Level: 0 (4 levels)	| 28   |  -  |  -  |
> + *	--------------------------------------------
> + *	| Level: 1 (3 levels)	| 37   | 31  | 25  |
> + *	--------------------------------------------
> + *	| Level: 2 (2 levels)	| 46   | 42  | 38  |
> + *	--------------------------------------------
> + *	| Level: 3 (1 level)	| -    | 53  | 51  |
> + *	--------------------------------------------
I understand entry level = Lookup level in the table.
But you may want to compute x for BaseAddress matching lookup level 2
with number of levels = 4.
So shouldn't you s/Number of levels/4 - entry_level?
for BADDR we want the BaseAddr of the initial lookup level so
effectively the entry level we are interested in is 4 - number of levels
and we don't care or d) condition. At least this is my understanding ;-)
If correct you may slightly reword the explanation?
> + *
> + * We have a magic formula for the Magic_N below.
> + *
> + *  Magic_N(PAGE_SIZE, Entry_Level) = 64 - ((PAGE_SHIFT - 3) * Number of levels)
> + *
> + * where number of levels = (4 - Entry_Level).
> + *
> + * So, given that T0SZ = (64 - PA_SHIFT), we can compute 'x' as follows:
Isn't it IPA_SHIFT instead?
> + *
> + *	x = (64 - ((PAGE_SHIFT - 3) * Number_of_levels)) - (64 - PA_SHIFT)
> + *	  = PA_SHIFT - ((PAGE_SHIFT - 3) * Number of levels)
> + *
> + * Here is one way to explain the Magic Formula:
> + *
> + *  x = log2(Size_of_Entry_Level_Table)
> + *
> + * Since, we can resolve (PAGE_SHIFT - 3) bits at each level, and another
> + * PAGE_SHIFT bits in the PTE, we have :
> + *
> + *  Bits_Entry_level = PA_SHIFT - ((PAGE_SHIFT - 3) * (n - 1) + PAGE_SHIFT)
> + *		     = PA_SHIFT - (PAGE_SHIFT - 3) * n - 3
> + *  where n = number of levels, and since each pointer is 8bytes, we have:
> + *
> + *  x = Bits_Entry_Level + 3
> + *    = PA_SHIFT - (PAGE_SHIFT - 3) * n
> + *
> + * The only constraint here is that, we have to find the number of page table
> + * levels for a given IPA size (which we do, see stage2_pt_levels())
> + */
> +#define ARM64_VTTBR_X(ipa, levels)	((ipa) - ((levels) * (PAGE_SHIFT - 3)))
>  
> -#define VTTBR_BADDR_MASK  (((UL(1) << (PHYS_MASK_SHIFT - VTTBR_X)) - 1) << VTTBR_X)
>  #define VTTBR_VMID_SHIFT  (UL(48))
>  #define VTTBR_VMID_MASK(size) (_AT(u64, (1 << size) - 1) << VTTBR_VMID_SHIFT)
>  
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index a351722..813a72a 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -146,7 +146,6 @@ static inline unsigned long __kern_hyp_va(unsigned long v)
>  #define kvm_phys_shift(kvm)		KVM_PHYS_SHIFT
>  #define kvm_phys_size(kvm)		(_AC(1, ULL) << kvm_phys_shift(kvm))
>  #define kvm_phys_mask(kvm)		(kvm_phys_size(kvm) - _AC(1, ULL))
> -#define kvm_vttbr_baddr_mask(kvm)	VTTBR_BADDR_MASK
>  
>  static inline bool kvm_page_empty(void *ptr)
>  {
> @@ -503,6 +502,30 @@ static inline int hyp_map_aux_data(void)
>  
>  #define kvm_phys_to_vttbr(addr)		phys_to_ttbr(addr)
>  
> +/*
> + * Get the magic number 'x' for VTTBR:BADDR of this KVM instance.
> + * With v8.2 LVA extensions, 'x' should be a minimum of 6 with
> + * 52bit IPS.
Link to the spec?
> + */
> +static inline int arm64_vttbr_x(u32 ipa_shift, u32 levels)
> +{
> +	int x = ARM64_VTTBR_X(ipa_shift, levels);
> +
> +	return (IS_ENABLED(CONFIG_ARM64_PA_BITS_52) && x < 6) ? 6 : x;
> +}
> +
> +static inline u64 vttbr_baddr_mask(u32 ipa_shift, u32 levels)
> +{
> +	unsigned int x = arm64_vttbr_x(ipa_shift, levels);
> +
> +	return GENMASK_ULL(PHYS_MASK_SHIFT - 1, x);
> +}
> +
> +static inline u64 kvm_vttbr_baddr_mask(struct kvm *kvm)
> +{
> +	return vttbr_baddr_mask(kvm_phys_shift(kvm), kvm_stage2_levels(kvm));
> +}
> +
>  static inline void *stage2_alloc_pgd(struct kvm *kvm)
>  {
>  	return alloc_pages_exact(stage2_pgd_size(kvm),
> 

Thanks

Eric

Eric Auger July 2, 2018, 2:46 p.m. UTC | #23

Hi Suzuki,

On 07/02/2018 03:24 PM, Suzuki K Poulose wrote:
> Hi Eric,
> 
> 
> On 02/07/18 13:14, Auger Eric wrote:
>> Hi Suzuki,
>>
>> On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
>>> So far we had a static stage2 page table handling code, based on a
>>> fixed IPA of 40bits. As we prepare for a configurable IPA size per
>>> VM, make our stage2 page table code dynamic, to do the right thing
>>> for a given VM. We ensure the existing condition is always true even
>>> when we lift the limit on the IPA. i.e,
>>>
>>>     page table levels in stage1 >= page table levels in stage2
>>>
>>> Support for the IPA size configuration needs other changes in the way
>>> we configure the EL2 registers (VTTBR and VTCR). So, the IPA is still
>>> fixed to 40bits. The patch also moves the kvm_page_empty() in
>>> asm/kvm_mmu.h
>>> to the top, before including the asm/stage2_pgtable.h to avoid a forward
>>> declaration.
>>>
>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>>> Cc: Christoffer Dall <cdall@kernel.org>
>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>> ---
>>> Changes since V2
>>>   - Restrict the stage2 page table to allow reusing the host page table
>>>     helpers for now, until we get stage1 independent page table helpers.
>> I would move this up in the commit msg to motivate the fact we enforce
>> the able condition.
> 
> This is mentioned in the commit message for the patch which lifts the
> limitation
> on the IPA. This patch only deals with the dynamic page table level
> handling,
> with the restriction on the levels. Nevertheless, I could add it to the
> description.
> 
>>> ---
>>>   arch/arm64/include/asm/kvm_mmu.h              |  14 +-
>>>   arch/arm64/include/asm/stage2_pgtable-nopmd.h |  42 ------
>>>   arch/arm64/include/asm/stage2_pgtable-nopud.h |  39 -----
>>>   arch/arm64/include/asm/stage2_pgtable.h       | 207
>>> +++++++++++++++++++-------
>>>   4 files changed, 159 insertions(+), 143 deletions(-)
>>>   delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopmd.h
>>>   delete mode 100644 arch/arm64/include/asm/stage2_pgtable-nopud.h
>>
>> with my very limited knowledge of S2 page table walkers I fail to
>> understand why we now can get rid of stage2_pgtable-nopmd.h and
>> stage2_pgtable-nopud.h and associated FOLDED config. Please could you
>> explain it in the commit message?
> 
> As mentioned above, we have static page table helpers, which are decided
> at compile time (just like the stage1). So these files hold the definitions
> for the cases where PUD/PMD is folded and included for a given stage1 VA.
> But since we are now doing this check per VM, we make the decision
> by checking the kvm_stage2_levels(), instead of hard coding it.
> 
> Does that help ? A short version of that is already there. May be I could
> elaborate that a bit.

not totally to be honest. But that's not your fault. I need to spend
more time studying the code to get what the FOLDED case does ;-)

Thanks

Eric
> 
>>> -
>>> -#define stage2_pgd_index(kvm, addr) \
>>> -    (((addr) >> S2_PGDIR_SHIFT) & (PTRS_PER_S2_PGD - 1))
>>> +static inline unsigned long stage2_pgd_index(struct kvm *kvm,
>>> phys_addr_t addr)
>>> +{
>>> +    return (addr >> stage2_pgdir_shift(kvm)) & (stage2_pgd_ptrs(kvm)
>>> - 1);
>>> +}
>>>     static inline phys_addr_t
>>>   stage2_pgd_addr_end(struct kvm *kvm, phys_addr_t addr, phys_addr_t
>>> end)
>>>   {
>>> -    phys_addr_t boundary = (addr + S2_PGDIR_SIZE) & S2_PGDIR_MASK;
>>> +    phys_addr_t boundary;
>>>   +    boundary = (addr + stage2_pgdir_size(kvm)) &
>>> stage2_pgdir_mask(kvm);
>>>       return (boundary - 1 < end - 1) ? boundary : end;
>>>   }
>>>  
>>
>> Globally this patch is pretty hard to review. I don't know if it is
>> possible to split into 2. 1) Addition of some helper macros. 2) removal
>> of nopud and nopmd and implementation of the corresponding macros?
> 
> I acknowledge that. The patch redefines the "existing" macros to make the
> decision at runtime based on the VM's setting. I will see if there is a
> better way to do it.
> 
> Cheers
> Suzuki
>

Eric Auger July 2, 2018, 2:59 p.m. UTC | #24

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> VTCR_EL2 holds the following key stage2 translation table
> parameters:
>   SL0  - Entry level in the page table lookup.
>   T0SZ - Denotes the size of the memory addressed by the table.
> 
> We have been using fixed values for the SL0 depending on the
> page size as we have a fixed IPA size. But since we are about
> to make it dynamic, we need to calculate the SL0 at runtime
> per VM. This patch adds a helper to comput the value of SL0 for
compute
> a given IPA.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@arm.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since v2:
>  - Part 2 of split from VTCR & VTTBR dynamic configuration
> ---
>  arch/arm64/include/asm/kvm_arm.h | 35 ++++++++++++++++++++++++++++++++---
>  1 file changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
> index c557f45..11a7db0 100644
> --- a/arch/arm64/include/asm/kvm_arm.h
> +++ b/arch/arm64/include/asm/kvm_arm.h
> @@ -153,7 +153,8 @@
>   * 2 level page tables (SL = 1)
>   */
>  #define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
> -#define VTTBR_X_TGRAN_MAGIC		38
> +#define VTCR_EL2_TGRAN_SL0_BASE		3UL
> +
>  #elif defined(CONFIG_ARM64_16K_PAGES)
>  /*
>   * Stage2 translation configuration:
> @@ -161,7 +162,7 @@
>   * 2 level page tables (SL = 1)
>   */
>  #define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
> -#define VTTBR_X_TGRAN_MAGIC		42
> +#define VTCR_EL2_TGRAN_SL0_BASE		3UL
>  #else	/* 4K */
>  /*
>   * Stage2 translation configuration:
> @@ -169,11 +170,39 @@
>   * 3 level page tables (SL = 1)
>   */
>  #define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
> -#define VTTBR_X_TGRAN_MAGIC		37
> +#define VTCR_EL2_TGRAN_SL0_BASE		2UL
>  #endif
>  
>  #define VTCR_EL2_FLAGS			(VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
>  /*
> + * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
> + * Interestingly, it depends on the page size.
> + * See D.10.2.110, VTCR_EL2, in ARM DDI 0487B.b
update ref to the last one?
> + *
> + *	-----------------------------------------
> + *	| Entry level		|  4K  | 16K/64K |
> + *	------------------------------------------
> + *	| Level: 0		|  2   |   -     |
> + *	------------------------------------------
> + *	| Level: 1		|  1   |   2     |
> + *	------------------------------------------
> + *	| Level: 2		|  0   |   1     |
> + *	------------------------------------------
> + *	| Level: 3		|  -   |   0     |
> + *	------------------------------------------
> + *
> + * That table roughly translates to :
> + *
> + *	SL0(PAGE_SIZE, Entry_level) = SL0_BASE(PAGE_SIZE) - Entry_Level
> + *
> + * Where SL0_BASE(4K) = 2 and SL0_BASE(16K) = 3, SL0_BASE(64K) = 3, provided
> + * we take care of ruling out the unsupported cases and
> + * Entry_Level = 4 - Number_of_levels.
> + *
> + */
> +#define VTCR_EL2_SL0(levels) \
> +	((VTCR_EL2_TGRAN_SL0_BASE - (4 - (levels))) << VTCR_EL2_SL0_SHIFT)
> +/*
>   * ARM VMSAv8-64 defines an algorithm for finding the translation table
>   * descriptors in section D4.2.8 in ARM DDI 0487B.b.
>   *
> 
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

Eric Auger July 2, 2018, 3:01 p.m. UTC | #25

Hi Suzuki

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> Abstract the allocation of stage2 entry level tables for
> given VM, so that later we can choose to fall back to the
> normal page table levels (i.e, avoid entry level table
> concatenation) on arm64.

the justification is not crystal clear to me but it does no harm I think.
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
> Changes since V2:
>  - New patch
> ---
>  arch/arm/include/asm/kvm_mmu.h   | 6 ++++++
>  arch/arm64/include/asm/kvm_mmu.h | 6 ++++++
>  virt/kvm/arm/mmu.c               | 2 +-
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
> index f36eb20..b2da5a4 100644
> --- a/arch/arm/include/asm/kvm_mmu.h
> +++ b/arch/arm/include/asm/kvm_mmu.h
> @@ -372,6 +372,12 @@ static inline int hyp_map_aux_data(void)
>  	return 0;
>  }
>  
> +static inline void *stage2_alloc_pgd(struct kvm *kvm)
> +{
> +	return alloc_pages_exact(stage2_pgd_size(kvm),
> +				 GFP_KERNEL | __GFP_ZERO);
> +}
> +
>  #define kvm_phys_to_vttbr(addr)		(addr)
>  
>  #endif	/* !__ASSEMBLY__ */
> diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h
> index 5da8f52..dbaf513 100644
> --- a/arch/arm64/include/asm/kvm_mmu.h
> +++ b/arch/arm64/include/asm/kvm_mmu.h
> @@ -501,5 +501,11 @@ static inline int hyp_map_aux_data(void)
>  
>  #define kvm_phys_to_vttbr(addr)		phys_to_ttbr(addr)
>  
> +static inline void *stage2_alloc_pgd(struct kvm *kvm)
> +{
> +	return alloc_pages_exact(stage2_pgd_size(kvm),
> +				 GFP_KERNEL | __GFP_ZERO);
> +}
> +
>  #endif /* __ASSEMBLY__ */
>  #endif /* __ARM64_KVM_MMU_H__ */
> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
> index 82dd571..a339e00 100644
> --- a/virt/kvm/arm/mmu.c
> +++ b/virt/kvm/arm/mmu.c
> @@ -868,7 +868,7 @@ int kvm_alloc_stage2_pgd(struct kvm *kvm)
>  	}
>  
>  	/* Allocate the HW PGD, making sure that each page gets its own refcount */
> -	pgd = alloc_pages_exact(stage2_pgd_size(kvm), GFP_KERNEL | __GFP_ZERO);
> +	pgd = stage2_alloc_pgd(kvm);
>  	if (!pgd)
>  		return -ENOMEM;
>  
> 
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

Eric Auger July 2, 2018, 7:13 p.m. UTC | #26

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> We load the stage2 context of a guest for different operations,
> including running the guest and tlb maintenance on behalf of the
> guest. As of now only the vttbr is private to the guest, but this
> is about to change with IPA per VM. Add a helper to load the stage2
> configuration for a VM, which could do the right thing with the
> future changes.
> 
> Cc: Christoffer Dall <cdall@kernel.org>
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
> Changes since v2:
>  - New patch
> ---
>  arch/arm64/include/asm/kvm_hyp.h | 6 ++++++
>  arch/arm64/kvm/hyp/switch.c      | 2 +-
>  arch/arm64/kvm/hyp/tlb.c         | 4 ++--
>  3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
> index 384c343..82f9994 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -155,5 +155,11 @@ void deactivate_traps_vhe_put(void);
>  u64 __guest_enter(struct kvm_vcpu *vcpu, struct kvm_cpu_context *host_ctxt);
>  void __noreturn __hyp_do_panic(unsigned long, ...);
>  
> +/* Must be called from hyp code running at EL2 */
> +static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
> +{
> +	write_sysreg(kvm->arch.vttbr, vttbr_el2);
> +}
> +
>  #endif /* __ARM64_KVM_HYP_H__ */
>  
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index d496ef5..355fb25 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -195,7 +195,7 @@ void deactivate_traps_vhe_put(void)
>  
>  static void __hyp_text __activate_vm(struct kvm *kvm)
>  {
> -	write_sysreg(kvm->arch.vttbr, vttbr_el2);
> +	__load_guest_stage2(kvm);
>  }
>  
>  static void __hyp_text __deactivate_vm(struct kvm_vcpu *vcpu)
> diff --git a/arch/arm64/kvm/hyp/tlb.c b/arch/arm64/kvm/hyp/tlb.c
> index 131c777..4dbd9c6 100644
> --- a/arch/arm64/kvm/hyp/tlb.c
> +++ b/arch/arm64/kvm/hyp/tlb.c
> @@ -30,7 +30,7 @@ static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm *kvm)
>  	 * bits. Changing E2H is impossible (goodbye TTBR1_EL2), so
>  	 * let's flip TGE before executing the TLB operation.
>  	 */
> -	write_sysreg(kvm->arch.vttbr, vttbr_el2);
> +	__load_guest_stage2(kvm);
>  	val = read_sysreg(hcr_el2);
>  	val &= ~HCR_TGE;
>  	write_sysreg(val, hcr_el2);
> @@ -39,7 +39,7 @@ static void __hyp_text __tlb_switch_to_guest_vhe(struct kvm *kvm)
>  
>  static void __hyp_text __tlb_switch_to_guest_nvhe(struct kvm *kvm)
>  {
> -	write_sysreg(kvm->arch.vttbr, vttbr_el2);
> +	__load_guest_stage2(kvm);
>  	isb();
>  }
>  
>

Suzuki K Poulose July 3, 2018, 8:04 a.m. UTC | #27

Hi Michael,

On 06/29/2018 06:42 PM, Michael S. Tsirkin wrote:
> On Fri, Jun 29, 2018 at 12:15:21PM +0100, Suzuki K Poulose wrote:
>> virtio-mmio with virtio-v1 uses a 32bit PFN for the queue.
>> If the queue pfn is too large to fit in 32bits, which
>> we could hit on arm64 systems with 52bit physical addresses
>> (even with 64K page size), we simply miss out a proper link
>> to the other side of the queue.
>>
>> Add a check to validate the PFN, rather than silently breaking
>> the devices.
>>
>> Cc: "Michael S. Tsirkin" <mst@redhat.com>
>> Cc: Jason Wang <jasowang@redhat.com>
>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>> Cc: Christoffer Dall <cdall@kernel.org>
>> Cc: Peter Maydel <peter.maydell@linaro.org>
>> Cc: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>> Changes since v2:
>>   - Change errno to -E2BIG
>> ---
>>   drivers/virtio/virtio_mmio.c | 18 ++++++++++++++++--
>>   1 file changed, 16 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
>> index 67763d3..82cedc8 100644
>> --- a/drivers/virtio/virtio_mmio.c
>> +++ b/drivers/virtio/virtio_mmio.c
>> @@ -397,9 +397,21 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
>>   	/* Activate the queue */
>>   	writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
>>   	if (vm_dev->version == 1) {
>> +		u64 q_pfn = virtqueue_get_desc_addr(vq) >> PAGE_SHIFT;
>> +
>> +		/*
>> +		 * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something
>> +		 * that doesn't fit in 32bit, fail the setup rather than
>> +		 * pretending to be successful.
>> +		 */
>> +		if (q_pfn >> 32) {
>> +			dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");
> 
> How about:
> 	"hypervisor bug: legacy virtio-mmio must not be used with more than 0x%llx Gigabytes of memory",
> 	0x1ULL << (32 - 30) << PAGE_SHIFT

nit : Do we need change "hypervisor" => "platform" ? Virtio is used by 
other tools (e.g, emulators) and not just virtual machines.

Suzuki

Suzuki K Poulose July 3, 2018, 10:48 a.m. UTC | #28

On 02/07/18 13:16, Marc Zyngier wrote:
> On 29/06/18 12:15, Suzuki K Poulose wrote:
>> We set VTCR_EL2 very early during the stage2 init and don't
>> touch it ever. This is fine as we had a fixed IPA size. This
>> patch changes the behavior to set the VTCR for a given VM,
>> depending on its stage2 table. The common configuration for
>> VTCR is still performed during the early init as we have to
>> retain the hardware access flag update bits (VTCR_EL2_HA)
>> per CPU (as they are only set for the CPUs which are capabile).
> 
> capable
> 
>> The bits defining the number of levels in the page table (SL0)
>> and and the size of the Input address to the translation (T0SZ)
>> are programmed for each VM upon entry to the guest.
>>
>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>> Cc: Christoffer Dall <cdall@kernel.org>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>> Change since V2:
>>   - Load VTCR for TLB operations
>> ---
>>   arch/arm64/include/asm/kvm_arm.h  | 19 +++++++++----------
>>   arch/arm64/include/asm/kvm_asm.h  |  2 +-
>>   arch/arm64/include/asm/kvm_host.h |  9 ++++++---
>>   arch/arm64/include/asm/kvm_hyp.h  | 11 +++++++++++
>>   arch/arm64/kvm/hyp/s2-setup.c     | 17 +----------------
>>   5 files changed, 28 insertions(+), 30 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>> index 11a7db0..b02c316 100644
>> --- a/arch/arm64/include/asm/kvm_arm.h
>> +++ b/arch/arm64/include/asm/kvm_arm.h
>> @@ -120,9 +120,7 @@
>>   #define VTCR_EL2_IRGN0_WBWA	TCR_IRGN0_WBWA
>>   #define VTCR_EL2_SL0_SHIFT	6
>>   #define VTCR_EL2_SL0_MASK	(3 << VTCR_EL2_SL0_SHIFT)
>> -#define VTCR_EL2_SL0_LVL1	(1 << VTCR_EL2_SL0_SHIFT)
>>   #define VTCR_EL2_T0SZ_MASK	0x3f
>> -#define VTCR_EL2_T0SZ_40B	24
>>   #define VTCR_EL2_VS_SHIFT	19
>>   #define VTCR_EL2_VS_8BIT	(0 << VTCR_EL2_VS_SHIFT)
>>   #define VTCR_EL2_VS_16BIT	(1 << VTCR_EL2_VS_SHIFT)
>> @@ -137,43 +135,44 @@
>>    * VTCR_EL2.PS is extracted from ID_AA64MMFR0_EL1.PARange at boot time
>>    * (see hyp-init.S).
>>    *
>> + * VTCR_EL2.SL0 and T0SZ are configured per VM at runtime before switching to
>> + * the VM.
>> + *
>>    * Note that when using 4K pages, we concatenate two first level page tables
>>    * together. With 16K pages, we concatenate 16 first level page tables.
>>    *
>>    */
>>   
>> -#define VTCR_EL2_T0SZ_IPA	VTCR_EL2_T0SZ_40B
>>   #define VTCR_EL2_COMMON_BITS	(VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
>>   				 VTCR_EL2_IRGN0_WBWA | VTCR_EL2_RES1)
>> +#define VTCR_EL2_PRIVATE_MASK	(VTCR_EL2_SL0_MASK | VTCR_EL2_T0SZ_MASK)
> 
> What does "private" mean here? It really is the IPA configuration, so
> I'd rather have a naming that reflects that.
> 
>>   #ifdef CONFIG_ARM64_64K_PAGES
>>   /*
>>    * Stage2 translation configuration:
>>    * 64kB pages (TG0 = 1)
>> - * 2 level page tables (SL = 1)
>>    */
>> -#define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
>> +#define VTCR_EL2_TGRAN			VTCR_EL2_TG0_64K
>>   #define VTCR_EL2_TGRAN_SL0_BASE		3UL
>>   
>>   #elif defined(CONFIG_ARM64_16K_PAGES)
>>   /*
>>    * Stage2 translation configuration:
>>    * 16kB pages (TG0 = 2)
>> - * 2 level page tables (SL = 1)
>>    */
>> -#define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
>> +#define VTCR_EL2_TGRAN			VTCR_EL2_TG0_16K
>>   #define VTCR_EL2_TGRAN_SL0_BASE		3UL
>>   #else	/* 4K */
>>   /*
>>    * Stage2 translation configuration:
>>    * 4kB pages (TG0 = 0)
>> - * 3 level page tables (SL = 1)
>>    */
>> -#define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
>> +#define VTCR_EL2_TGRAN			VTCR_EL2_TG0_4K
>>   #define VTCR_EL2_TGRAN_SL0_BASE		2UL
>>   #endif
>>   
>> -#define VTCR_EL2_FLAGS			(VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
>> +#define VTCR_EL2_FLAGS		(VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
>> +
>>   /*
>>    * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
>>    * Interestingly, it depends on the page size.
>> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
>> index 102b5a5..91372eb 100644
>> --- a/arch/arm64/include/asm/kvm_asm.h
>> +++ b/arch/arm64/include/asm/kvm_asm.h
>> @@ -72,7 +72,7 @@ extern void __vgic_v3_init_lrs(void);
>>   
>>   extern u32 __kvm_get_mdcr_el2(void);
>>   
>> -extern u32 __init_stage2_translation(void);
>> +extern void __init_stage2_translation(void);
>>   
>>   /* Home-grown __this_cpu_{ptr,read} variants that always work at HYP */
>>   #define __hyp_this_cpu_ptr(sym)						\
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index fe8777b..328f472 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -442,10 +442,13 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>>   
>>   static inline void __cpu_init_stage2(void)
>>   {
>> -	u32 parange = kvm_call_hyp(__init_stage2_translation);
>> +	u32 ps;
>>   
>> -	WARN_ONCE(parange < 40,
>> -		  "PARange is %d bits, unsupported configuration!", parange);
>> +	kvm_call_hyp(__init_stage2_translation);
>> +	/* Sanity check for minimum IPA size support */
>> +	ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
>> +	WARN_ONCE(ps < 40,
>> +		  "PARange is %d bits, unsupported configuration!", ps);
>>   }
>>   
>>   /* Guest/host FPSIMD coordination helpers */
>> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
>> index 82f9994..3e8052d1 100644
>> --- a/arch/arm64/include/asm/kvm_hyp.h
>> +++ b/arch/arm64/include/asm/kvm_hyp.h
>> @@ -20,6 +20,7 @@
>>   
>>   #include <linux/compiler.h>
>>   #include <linux/kvm_host.h>
>> +#include <asm/kvm_mmu.h>
>>   #include <asm/sysreg.h>
>>   
>>   #define __hyp_text __section(.hyp.text) notrace
>> @@ -158,6 +159,16 @@ void __noreturn __hyp_do_panic(unsigned long, ...);
>>   /* Must be called from hyp code running at EL2 */

Marc,

>>   static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
>>   {
>> +	/*
>> +	 * Configure the VTCR translation control bits
>> +	 * for this VM.
>> +	 */
>> +	u64 vtcr = read_sysreg(vtcr_el2);
>> +
>> +	vtcr &= ~VTCR_EL2_PRIVATE_MASK;
>> +	vtcr |= VTCR_EL2_SL0(kvm_stage2_levels(kvm)) |
>> +		VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
>> +	write_sysreg(vtcr, vtcr_el2);
> 
> Can't we generate the whole vtcr value in one go, without reading it
> back? Specially given that on patch 16, you're actually switching to a
> per-VM variable, and it would make a lot of sense to start with that here.

...

>> -u32 __hyp_text __init_stage2_translation(void)
>> +void __hyp_text __init_stage2_translation(void)
..

> 
> And then most of the code here could run on a per-VM basis.

There is one problem with generating the entire vtcr for a VM.
On a system with mismatched CPU features, we need to have either :

  - Per CPU VTCR fixed bits
     OR
  - Track system wide safe VTCR bits. (Not ideal with dirty bit and access
  flag updates, if and when we support them ).

So far the only fields of interest are HA & HD, which may be turned on
for CPUs that can support the feature. Rest can be filled in from the
sanitised EL1 system registers and IPA limit and the others would need
to be filled as RES0. This could potentially have some issues on
newer versions of the architecture running on older kernels.

What do you think ?

Suzuki

Marc Zyngier July 3, 2018, 10:58 a.m. UTC | #29

On 03/07/18 11:48, Suzuki K Poulose wrote:
> On 02/07/18 13:16, Marc Zyngier wrote:
>> On 29/06/18 12:15, Suzuki K Poulose wrote:
>>> We set VTCR_EL2 very early during the stage2 init and don't
>>> touch it ever. This is fine as we had a fixed IPA size. This
>>> patch changes the behavior to set the VTCR for a given VM,
>>> depending on its stage2 table. The common configuration for
>>> VTCR is still performed during the early init as we have to
>>> retain the hardware access flag update bits (VTCR_EL2_HA)
>>> per CPU (as they are only set for the CPUs which are capabile).
>>
>> capable
>>
>>> The bits defining the number of levels in the page table (SL0)
>>> and and the size of the Input address to the translation (T0SZ)
>>> are programmed for each VM upon entry to the guest.
>>>
>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>>> Cc: Christoffer Dall <cdall@kernel.org>
>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>> ---
>>> Change since V2:
>>>   - Load VTCR for TLB operations
>>> ---
>>>   arch/arm64/include/asm/kvm_arm.h  | 19 +++++++++----------
>>>   arch/arm64/include/asm/kvm_asm.h  |  2 +-
>>>   arch/arm64/include/asm/kvm_host.h |  9 ++++++---
>>>   arch/arm64/include/asm/kvm_hyp.h  | 11 +++++++++++
>>>   arch/arm64/kvm/hyp/s2-setup.c     | 17 +----------------
>>>   5 files changed, 28 insertions(+), 30 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>>> index 11a7db0..b02c316 100644
>>> --- a/arch/arm64/include/asm/kvm_arm.h
>>> +++ b/arch/arm64/include/asm/kvm_arm.h
>>> @@ -120,9 +120,7 @@
>>>   #define VTCR_EL2_IRGN0_WBWA	TCR_IRGN0_WBWA
>>>   #define VTCR_EL2_SL0_SHIFT	6
>>>   #define VTCR_EL2_SL0_MASK	(3 << VTCR_EL2_SL0_SHIFT)
>>> -#define VTCR_EL2_SL0_LVL1	(1 << VTCR_EL2_SL0_SHIFT)
>>>   #define VTCR_EL2_T0SZ_MASK	0x3f
>>> -#define VTCR_EL2_T0SZ_40B	24
>>>   #define VTCR_EL2_VS_SHIFT	19
>>>   #define VTCR_EL2_VS_8BIT	(0 << VTCR_EL2_VS_SHIFT)
>>>   #define VTCR_EL2_VS_16BIT	(1 << VTCR_EL2_VS_SHIFT)
>>> @@ -137,43 +135,44 @@
>>>    * VTCR_EL2.PS is extracted from ID_AA64MMFR0_EL1.PARange at boot time
>>>    * (see hyp-init.S).
>>>    *
>>> + * VTCR_EL2.SL0 and T0SZ are configured per VM at runtime before switching to
>>> + * the VM.
>>> + *
>>>    * Note that when using 4K pages, we concatenate two first level page tables
>>>    * together. With 16K pages, we concatenate 16 first level page tables.
>>>    *
>>>    */
>>>   
>>> -#define VTCR_EL2_T0SZ_IPA	VTCR_EL2_T0SZ_40B
>>>   #define VTCR_EL2_COMMON_BITS	(VTCR_EL2_SH0_INNER | VTCR_EL2_ORGN0_WBWA | \
>>>   				 VTCR_EL2_IRGN0_WBWA | VTCR_EL2_RES1)
>>> +#define VTCR_EL2_PRIVATE_MASK	(VTCR_EL2_SL0_MASK | VTCR_EL2_T0SZ_MASK)
>>
>> What does "private" mean here? It really is the IPA configuration, so
>> I'd rather have a naming that reflects that.
>>
>>>   #ifdef CONFIG_ARM64_64K_PAGES
>>>   /*
>>>    * Stage2 translation configuration:
>>>    * 64kB pages (TG0 = 1)
>>> - * 2 level page tables (SL = 1)
>>>    */
>>> -#define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_64K | VTCR_EL2_SL0_LVL1)
>>> +#define VTCR_EL2_TGRAN			VTCR_EL2_TG0_64K
>>>   #define VTCR_EL2_TGRAN_SL0_BASE		3UL
>>>   
>>>   #elif defined(CONFIG_ARM64_16K_PAGES)
>>>   /*
>>>    * Stage2 translation configuration:
>>>    * 16kB pages (TG0 = 2)
>>> - * 2 level page tables (SL = 1)
>>>    */
>>> -#define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_16K | VTCR_EL2_SL0_LVL1)
>>> +#define VTCR_EL2_TGRAN			VTCR_EL2_TG0_16K
>>>   #define VTCR_EL2_TGRAN_SL0_BASE		3UL
>>>   #else	/* 4K */
>>>   /*
>>>    * Stage2 translation configuration:
>>>    * 4kB pages (TG0 = 0)
>>> - * 3 level page tables (SL = 1)
>>>    */
>>> -#define VTCR_EL2_TGRAN_FLAGS		(VTCR_EL2_TG0_4K | VTCR_EL2_SL0_LVL1)
>>> +#define VTCR_EL2_TGRAN			VTCR_EL2_TG0_4K
>>>   #define VTCR_EL2_TGRAN_SL0_BASE		2UL
>>>   #endif
>>>   
>>> -#define VTCR_EL2_FLAGS			(VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN_FLAGS)
>>> +#define VTCR_EL2_FLAGS		(VTCR_EL2_COMMON_BITS | VTCR_EL2_TGRAN)
>>> +
>>>   /*
>>>    * VTCR_EL2:SL0 indicates the entry level for Stage2 translation.
>>>    * Interestingly, it depends on the page size.
>>> diff --git a/arch/arm64/include/asm/kvm_asm.h b/arch/arm64/include/asm/kvm_asm.h
>>> index 102b5a5..91372eb 100644
>>> --- a/arch/arm64/include/asm/kvm_asm.h
>>> +++ b/arch/arm64/include/asm/kvm_asm.h
>>> @@ -72,7 +72,7 @@ extern void __vgic_v3_init_lrs(void);
>>>   
>>>   extern u32 __kvm_get_mdcr_el2(void);
>>>   
>>> -extern u32 __init_stage2_translation(void);
>>> +extern void __init_stage2_translation(void);
>>>   
>>>   /* Home-grown __this_cpu_{ptr,read} variants that always work at HYP */
>>>   #define __hyp_this_cpu_ptr(sym)						\
>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>> index fe8777b..328f472 100644
>>> --- a/arch/arm64/include/asm/kvm_host.h
>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>> @@ -442,10 +442,13 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>>>   
>>>   static inline void __cpu_init_stage2(void)
>>>   {
>>> -	u32 parange = kvm_call_hyp(__init_stage2_translation);
>>> +	u32 ps;
>>>   
>>> -	WARN_ONCE(parange < 40,
>>> -		  "PARange is %d bits, unsupported configuration!", parange);
>>> +	kvm_call_hyp(__init_stage2_translation);
>>> +	/* Sanity check for minimum IPA size support */
>>> +	ps = id_aa64mmfr0_parange_to_phys_shift(read_sysreg(id_aa64mmfr0_el1) & 0x7);
>>> +	WARN_ONCE(ps < 40,
>>> +		  "PARange is %d bits, unsupported configuration!", ps);
>>>   }
>>>   
>>>   /* Guest/host FPSIMD coordination helpers */
>>> diff --git a/arch/arm64/include/asm/kvm_hyp.h b/arch/arm64/include/asm/kvm_hyp.h
>>> index 82f9994..3e8052d1 100644
>>> --- a/arch/arm64/include/asm/kvm_hyp.h
>>> +++ b/arch/arm64/include/asm/kvm_hyp.h
>>> @@ -20,6 +20,7 @@
>>>   
>>>   #include <linux/compiler.h>
>>>   #include <linux/kvm_host.h>
>>> +#include <asm/kvm_mmu.h>
>>>   #include <asm/sysreg.h>
>>>   
>>>   #define __hyp_text __section(.hyp.text) notrace
>>> @@ -158,6 +159,16 @@ void __noreturn __hyp_do_panic(unsigned long, ...);
>>>   /* Must be called from hyp code running at EL2 */
> 
> Marc,
> 
>>>   static __always_inline void __hyp_text __load_guest_stage2(struct kvm *kvm)
>>>   {
>>> +	/*
>>> +	 * Configure the VTCR translation control bits
>>> +	 * for this VM.
>>> +	 */
>>> +	u64 vtcr = read_sysreg(vtcr_el2);
>>> +
>>> +	vtcr &= ~VTCR_EL2_PRIVATE_MASK;
>>> +	vtcr |= VTCR_EL2_SL0(kvm_stage2_levels(kvm)) |
>>> +		VTCR_EL2_T0SZ(kvm_phys_shift(kvm));
>>> +	write_sysreg(vtcr, vtcr_el2);
>>
>> Can't we generate the whole vtcr value in one go, without reading it
>> back? Specially given that on patch 16, you're actually switching to a
>> per-VM variable, and it would make a lot of sense to start with that here.
> 
> ...
> 
>>> -u32 __hyp_text __init_stage2_translation(void)
>>> +void __hyp_text __init_stage2_translation(void)
> ..
> 
>>
>> And then most of the code here could run on a per-VM basis.
> 
> There is one problem with generating the entire vtcr for a VM.
> On a system with mismatched CPU features, we need to have either :
> 
>   - Per CPU VTCR fixed bits
>      OR
>   - Track system wide safe VTCR bits. (Not ideal with dirty bit and access
>   flag updates, if and when we support them ).
> 
> So far the only fields of interest are HA & HD, which may be turned on
> for CPUs that can support the feature. Rest can be filled in from the
> sanitised EL1 system registers and IPA limit and the others would need
> to be filled as RES0. This could potentially have some issues on
> newer versions of the architecture running on older kernels.

For HA and HD, we can perfectly set them if if only one CPU in the
system has it. We already do this for other system registers on the
ground that if the CPU doesn't honour the RES0 behaviour, then it is
terminally broken.

Thanks,

	M.

Suzuki K Poulose July 3, 2018, 11:54 a.m. UTC | #30

Hi Eric,

On 02/07/18 15:41, Auger Eric wrote:
> Hi Suzuki,
> 
> On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
>> On arm64 VTTBR_EL2:BADDR holds the base address for the stage2
>> translation table. The Arm ARM mandates that the bits BADDR[x-1:0]
>> should be 0, where 'x' is defined for a given IPA Size and the
>> number of levels for a translation granule size. It is defined
>> using some magical constants. This patch is a reverse engineered
>> implementation to calculate the 'x' at runtime for a given ipa and
>> number of page table levels. See patch for more details.
>>
>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>> Cc: Christoffer Dall <cdall@kernel.org>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>> Changes since V2:
>>   - Part 1 of spilt from VTCR & VTTBR dynamic configuration
>> ---
>>   arch/arm64/include/asm/kvm_arm.h | 60 +++++++++++++++++++++++++++++++++++++---
>>   arch/arm64/include/asm/kvm_mmu.h | 25 ++++++++++++++++-
>>   2 files changed, 80 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h
>> index 3dffd38..c557f45 100644
>> --- a/arch/arm64/include/asm/kvm_arm.h
>> +++ b/arch/arm64/include/asm/kvm_arm.h
>> @@ -140,8 +140,6 @@
>>    * Note that when using 4K pages, we concatenate two first level page tables
>>    * together. With 16K pages, we concatenate 16 first level page tables.
>>    *
>> - * The magic numbers used for VTTBR_X in this patch can be found in Tables
>> - * D4-23 and D4-25 in ARM DDI 0487A.b.
> Isn't it a pretty old reference? Could you refer to C.a?

Sure, I will update the references everywhere.

>> + *
>> + * The algorithm defines the expectations on the BaseAddress (for the page
>> + * table) bits resolved at each level based on the page size, entry level
>> + * and T0SZ. The variable "x" in the algorithm also affects the VTTBR:BADDR
>> + * for stage2 page table.
>> + *
>> + * The value of "x" is calculated as :
>> + *	x = Magic_N - T0SZ
>> + *
>> + * where Magic_N is an integer depending on the page size and the entry
>> + * level of the page table as below:
>> + *
>> + *	--------------------------------------------
>> + *	| Entry level		|  4K    16K   64K |
>> + *	--------------------------------------------
>> + *	| Level: 0 (4 levels)	| 28   |  -  |  -  |
>> + *	--------------------------------------------
>> + *	| Level: 1 (3 levels)	| 37   | 31  | 25  |
>> + *	--------------------------------------------
>> + *	| Level: 2 (2 levels)	| 46   | 42  | 38  |
>> + *	--------------------------------------------
>> + *	| Level: 3 (1 level)	| -    | 53  | 51  |
>> + *	--------------------------------------------
> I understand entry level = Lookup level in the table.

Entry level => The level at which we start the page table walk for
a given address (This is in line with the ARM ARM). So,

Entry_level = (4 - Number_of_Page_table_levels)

> But you may want to compute x for BaseAddress matching lookup level 2
> with number of levels = 4.

No, the BaseAddress is only calcualted for the "Entry_level". So the
above case doesn't exist at all.

> So shouldn't you s/Number of levels/4 - entry_level?

Ok, I now understood what you are referring to [0]
> for BADDR we want the BaseAddr of the initial lookup level so
> effectively the entry level we are interested in is 4 - number of levels
> and we don't care or d) condition. At least this is my understanding ;-)
> If correct you may slightly reword the explanation?


>> + *
>> + * We have a magic formula for the Magic_N below.
>> + *
>> + *  Magic_N(PAGE_SIZE, Entry_Level) = 64 - ((PAGE_SHIFT - 3) * Number of levels)

[0] ^^^



>> + *
>> + * where number of levels = (4 - Entry_Level).

^^^ Doesn't this help make it clear ? Using the expansion makes it a bit more
unreadable below.

>>   
>> +/*
>> + * Get the magic number 'x' for VTTBR:BADDR of this KVM instance.
>> + * With v8.2 LVA extensions, 'x' should be a minimum of 6 with
>> + * 52bit IPS.
> Link to the spec?

Sure, will add it.

Thanks for the patience to review :-)

Cheers
Suzuki

Michael S. Tsirkin July 4, 2018, 5:37 a.m. UTC | #31

On Tue, Jul 03, 2018 at 09:04:01AM +0100, Suzuki K Poulose wrote:
> Hi Michael,
> 
> On 06/29/2018 06:42 PM, Michael S. Tsirkin wrote:
> > On Fri, Jun 29, 2018 at 12:15:21PM +0100, Suzuki K Poulose wrote:
> > > virtio-mmio with virtio-v1 uses a 32bit PFN for the queue.
> > > If the queue pfn is too large to fit in 32bits, which
> > > we could hit on arm64 systems with 52bit physical addresses
> > > (even with 64K page size), we simply miss out a proper link
> > > to the other side of the queue.
> > > 
> > > Add a check to validate the PFN, rather than silently breaking
> > > the devices.
> > > 
> > > Cc: "Michael S. Tsirkin" <mst@redhat.com>
> > > Cc: Jason Wang <jasowang@redhat.com>
> > > Cc: Marc Zyngier <marc.zyngier@arm.com>
> > > Cc: Christoffer Dall <cdall@kernel.org>
> > > Cc: Peter Maydel <peter.maydell@linaro.org>
> > > Cc: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> > > Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> > > ---
> > > Changes since v2:
> > >   - Change errno to -E2BIG
> > > ---
> > >   drivers/virtio/virtio_mmio.c | 18 ++++++++++++++++--
> > >   1 file changed, 16 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c
> > > index 67763d3..82cedc8 100644
> > > --- a/drivers/virtio/virtio_mmio.c
> > > +++ b/drivers/virtio/virtio_mmio.c
> > > @@ -397,9 +397,21 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index,
> > >   	/* Activate the queue */
> > >   	writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM);
> > >   	if (vm_dev->version == 1) {
> > > +		u64 q_pfn = virtqueue_get_desc_addr(vq) >> PAGE_SHIFT;
> > > +
> > > +		/*
> > > +		 * virtio-mmio v1 uses a 32bit QUEUE PFN. If we have something
> > > +		 * that doesn't fit in 32bit, fail the setup rather than
> > > +		 * pretending to be successful.
> > > +		 */
> > > +		if (q_pfn >> 32) {
> > > +			dev_err(&vdev->dev, "virtio-mmio: queue address too large\n");
> > 
> > How about:
> > 	"hypervisor bug: legacy virtio-mmio must not be used with more than 0x%llx Gigabytes of memory",
> > 	0x1ULL << (32 - 30) << PAGE_SHIFT
> 
> nit : Do we need change "hypervisor" => "platform" ? Virtio is used by other
> tools (e.g, emulators) and not just virtual machines.
> 
> Suzuki

OK.

Eric Auger July 4, 2018, 8:09 a.m. UTC | #32

Hi Suzuki,

On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
> From: Kristina Martsenko <kristina.martsenko@arm.com>
> 
> Add support for handling 52bit guest physical address to the
> VGIC layer. So far we have limited the guest physical address
> to 48bits, by explicitly masking the upper bits. This patch
> removes the restriction. We do not have to check if the host
> supports 52bit as the gpa is always validated during an access.
> (e.g, kvm_{read/write}_guest, kvm_is_visible_gfn()).
> Also, the ITS table save-restore is also not affected with
> the enhancement. The DTE entries already store the bits[51:8]
> of the ITT_addr (with a 256byte alignment).
> 
> Cc: Marc Zyngier <marc.zyngier@arm.com>
> Cc: Christoffer Dall <cdall@kernel.org>
> Signed-off-by: Kristina Martsenko <kristina.martsenko@arm.com>
> [ Macro clean ups, fix PROPBASER and PENDBASER accesses ]
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  include/linux/irqchip/arm-gic-v3.h |  5 +++++
>  virt/kvm/arm/vgic/vgic-its.c       | 36 ++++++++++--------------------------
>  virt/kvm/arm/vgic/vgic-mmio-v3.c   |  2 --
>  3 files changed, 15 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h
> index cbb872c..bc4b95b 100644
> --- a/include/linux/irqchip/arm-gic-v3.h
> +++ b/include/linux/irqchip/arm-gic-v3.h
> @@ -346,6 +346,8 @@
>  #define GITS_CBASER_RaWaWt	GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, RaWaWt)
>  #define GITS_CBASER_RaWaWb	GIC_BASER_CACHEABILITY(GITS_CBASER, INNER, RaWaWb)
>  
> +#define GITS_CBASER_ADDRESS(cbaser)	((cbaser) & GENMASK_ULL(52, 12))
> +
>  #define GITS_BASER_NR_REGS		8
>  
>  #define GITS_BASER_VALID			(1ULL << 63)
> @@ -377,6 +379,9 @@
>  #define GITS_BASER_ENTRY_SIZE_MASK	GENMASK_ULL(52, 48)
>  #define GITS_BASER_PHYS_52_to_48(phys)					\
>  	(((phys) & GENMASK_ULL(47, 16)) | (((phys) >> 48) & 0xf) << 12)
> +#define GITS_BASER_ADDR_48_to_52(baser)					\
> +	(((baser) & GENMASK_ULL(47, 16)) | (((baser) >> 12) & 0xf) << 48)
only works if page_size = 64kB which is the case in vITS but as it is in
irqchip header, may be worth a comment?
> +
>  #define GITS_BASER_SHAREABILITY_SHIFT	(10)
>  #define GITS_BASER_InnerShareable					\
>  	GIC_BASER_SHAREABILITY(GITS_BASER, InnerShareable)
> diff --git a/virt/kvm/arm/vgic/vgic-its.c b/virt/kvm/arm/vgic/vgic-its.c
> index 4ed79c9..c6eb390 100644
> --- a/virt/kvm/arm/vgic/vgic-its.c
> +++ b/virt/kvm/arm/vgic/vgic-its.c
> @@ -234,13 +234,6 @@ static struct its_ite *find_ite(struct vgic_its *its, u32 device_id,
>  	list_for_each_entry(dev, &(its)->device_list, dev_list) \
>  		list_for_each_entry(ite, &(dev)->itt_head, ite_list)
>  
> -/*
> - * We only implement 48 bits of PA at the moment, although the ITS
> - * supports more. Let's be restrictive here.
> - */
> -#define BASER_ADDRESS(x)	((x) & GENMASK_ULL(47, 16))
> -#define CBASER_ADDRESS(x)	((x) & GENMASK_ULL(47, 12))
> -
>  #define GIC_LPI_OFFSET 8192
>  
>  #define VITS_TYPER_IDBITS 16
> @@ -752,6 +745,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
>  {
>  	int l1_tbl_size = GITS_BASER_NR_PAGES(baser) * SZ_64K;
>  	u64 indirect_ptr, type = GITS_BASER_TYPE(baser);
> +	phys_addr_t base = GITS_BASER_ADDR_48_to_52(baser);
>  	int esz = GITS_BASER_ENTRY_SIZE(baser);
>  	int index;
>  	gfn_t gfn;
> @@ -776,7 +770,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
>  		if (id >= (l1_tbl_size / esz))
>  			return false;
>  
> -		addr = BASER_ADDRESS(baser) + id * esz;
> +		addr = base + id * esz;
>  		gfn = addr >> PAGE_SHIFT;
>  
>  		if (eaddr)
> @@ -791,7 +785,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
>  
>  	/* Each 1st level entry is represented by a 64-bit value. */
>  	if (kvm_read_guest_lock(its->dev->kvm,
> -			   BASER_ADDRESS(baser) + index * sizeof(indirect_ptr),
> +			   base + index * sizeof(indirect_ptr),
>  			   &indirect_ptr, sizeof(indirect_ptr)))
>  		return false;
>  
> @@ -801,11 +795,7 @@ static bool vgic_its_check_id(struct vgic_its *its, u64 baser, u32 id,
>  	if (!(indirect_ptr & BIT_ULL(63)))
>  		return false;
>  
> -	/*
> -	 * Mask the guest physical address and calculate the frame number.
> -	 * Any address beyond our supported 48 bits of PA will be caught
> -	 * by the actual check in the final step.
> -	 */
> +	/* Mask the guest physical address and calculate the frame number. */
>  	indirect_ptr &= GENMASK_ULL(51, 16);
>  
>  	/* Find the address of the actual entry */
> @@ -1297,9 +1287,6 @@ static u64 vgic_sanitise_its_baser(u64 reg)
>  				  GITS_BASER_OUTER_CACHEABILITY_SHIFT,
>  				  vgic_sanitise_outer_cacheability);
>  
> -	/* Bits 15:12 contain bits 51:48 of the PA, which we don't support. */
> -	reg &= ~GENMASK_ULL(15, 12);
> -
>  	/* We support only one (ITS) page size: 64K */
>  	reg = (reg & ~GITS_BASER_PAGE_SIZE_MASK) | GITS_BASER_PAGE_SIZE_64K;
>  
> @@ -1318,11 +1305,8 @@ static u64 vgic_sanitise_its_cbaser(u64 reg)
>  				  GITS_CBASER_OUTER_CACHEABILITY_SHIFT,
>  				  vgic_sanitise_outer_cacheability);
>  
> -	/*
> -	 * Sanitise the physical address to be 64k aligned.
> -	 * Also limit the physical addresses to 48 bits.
> -	 */
> -	reg &= ~(GENMASK_ULL(51, 48) | GENMASK_ULL(15, 12));
> +	/* Sanitise the physical address to be 64k aligned. */
> +	reg &= ~GENMASK_ULL(15, 12);
>  
>  	return reg;
>  }
> @@ -1368,7 +1352,7 @@ static void vgic_its_process_commands(struct kvm *kvm, struct vgic_its *its)
>  	if (!its->enabled)
>  		return;
>  
> -	cbaser = CBASER_ADDRESS(its->cbaser);
> +	cbaser = GITS_CBASER_ADDRESS(its->cbaser);
>  
>  	while (its->cwriter != its->creadr) {
>  		int ret = kvm_read_guest_lock(kvm, cbaser + its->creadr,
> @@ -2226,7 +2210,7 @@ static int vgic_its_restore_device_tables(struct vgic_its *its)
>  	if (!(baser & GITS_BASER_VALID))
>  		return 0;
>  
> -	l1_gpa = BASER_ADDRESS(baser);
> +	l1_gpa = GITS_BASER_ADDR_48_to_52(baser);
>  
>  	if (baser & GITS_BASER_INDIRECT) {
>  		l1_esz = GITS_LVL1_ENTRY_SIZE;
> @@ -2298,7 +2282,7 @@ static int vgic_its_save_collection_table(struct vgic_its *its)
>  {
>  	const struct vgic_its_abi *abi = vgic_its_get_abi(its);
>  	u64 baser = its->baser_coll_table;
> -	gpa_t gpa = BASER_ADDRESS(baser);
> +	gpa_t gpa = GITS_BASER_ADDR_48_to_52(baser);
>  	struct its_collection *collection;
>  	u64 val;
>  	size_t max_size, filled = 0;
> @@ -2347,7 +2331,7 @@ static int vgic_its_restore_collection_table(struct vgic_its *its)
>  	if (!(baser & GITS_BASER_VALID))
>  		return 0;
>  
> -	gpa = BASER_ADDRESS(baser);
> +	gpa = GITS_BASER_ADDR_48_to_52(baser);
>  
>  	max_size = GITS_BASER_NR_PAGES(baser) * SZ_64K;
>  
> diff --git a/virt/kvm/arm/vgic/vgic-mmio-v3.c b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> index 2877840..64647be 100644
> --- a/virt/kvm/arm/vgic/vgic-mmio-v3.c
> +++ b/virt/kvm/arm/vgic/vgic-mmio-v3.c
> @@ -338,7 +338,6 @@ static u64 vgic_sanitise_pendbaser(u64 reg)
>  				  vgic_sanitise_outer_cacheability);
>  
>  	reg &= ~PENDBASER_RES0_MASK;
> -	reg &= ~GENMASK_ULL(51, 48);
>  
>  	return reg;
>  }
> @@ -356,7 +355,6 @@ static u64 vgic_sanitise_propbaser(u64 reg)
>  				  vgic_sanitise_outer_cacheability);
>  
>  	reg &= ~PROPBASER_RES0_MASK;
> -	reg &= ~GENMASK_ULL(51, 48);
>  	return reg;
>  }
>  
> 
Besides it looks good to me.

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

Eric Auger July 4, 2018, 8:24 a.m. UTC | #33

Hi Suzuki,

On 07/03/2018 01:54 PM, Suzuki K Poulose wrote:
> Hi Eric,
> 
> On 02/07/18 15:41, Auger Eric wrote:
>> Hi Suzuki,
>>
>> On 06/29/2018 01:15 PM, Suzuki K Poulose wrote:
>>> On arm64 VTTBR_EL2:BADDR holds the base address for the stage2
>>> translation table. The Arm ARM mandates that the bits BADDR[x-1:0]
>>> should be 0, where 'x' is defined for a given IPA Size and the
>>> number of levels for a translation granule size. It is defined
>>> using some magical constants. This patch is a reverse engineered
>>> implementation to calculate the 'x' at runtime for a given ipa and
>>> number of page table levels. See patch for more details.
>>>
>>> Cc: Marc Zyngier <marc.zyngier@arm.com>
>>> Cc: Christoffer Dall <cdall@kernel.org>
>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>> ---
>>> Changes since V2:
>>>   - Part 1 of spilt from VTCR & VTTBR dynamic configuration
>>> ---
>>>   arch/arm64/include/asm/kvm_arm.h | 60
>>> +++++++++++++++++++++++++++++++++++++---
>>>   arch/arm64/include/asm/kvm_mmu.h | 25 ++++++++++++++++-
>>>   2 files changed, 80 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_arm.h
>>> b/arch/arm64/include/asm/kvm_arm.h
>>> index 3dffd38..c557f45 100644
>>> --- a/arch/arm64/include/asm/kvm_arm.h
>>> +++ b/arch/arm64/include/asm/kvm_arm.h
>>> @@ -140,8 +140,6 @@
>>>    * Note that when using 4K pages, we concatenate two first level
>>> page tables
>>>    * together. With 16K pages, we concatenate 16 first level page
>>> tables.
>>>    *
>>> - * The magic numbers used for VTTBR_X in this patch can be found in
>>> Tables
>>> - * D4-23 and D4-25 in ARM DDI 0487A.b.
>> Isn't it a pretty old reference? Could you refer to C.a?
> 
> Sure, I will update the references everywhere.
> 
>>> + *
>>> + * The algorithm defines the expectations on the BaseAddress (for
>>> the page
>>> + * table) bits resolved at each level based on the page size, entry
>>> level
>>> + * and T0SZ. The variable "x" in the algorithm also affects the
>>> VTTBR:BADDR
>>> + * for stage2 page table.
>>> + *
>>> + * The value of "x" is calculated as :
>>> + *    x = Magic_N - T0SZ
>>> + *
>>> + * where Magic_N is an integer depending on the page size and the entry
>>> + * level of the page table as below:
>>> + *
>>> + *    --------------------------------------------
>>> + *    | Entry level        |  4K    16K   64K |
>>> + *    --------------------------------------------
>>> + *    | Level: 0 (4 levels)    | 28   |  -  |  -  |
>>> + *    --------------------------------------------
>>> + *    | Level: 1 (3 levels)    | 37   | 31  | 25  |
>>> + *    --------------------------------------------
>>> + *    | Level: 2 (2 levels)    | 46   | 42  | 38  |
>>> + *    --------------------------------------------
>>> + *    | Level: 3 (1 level)    | -    | 53  | 51  |
>>> + *    --------------------------------------------
>> I understand entry level = Lookup level in the table.
> 
> Entry level => The level at which we start the page table walk for
> a given address (This is in line with the ARM ARM). So,
> 
> Entry_level = (4 - Number_of_Page_table_levels)
> 
>> But you may want to compute x for BaseAddress matching lookup level 2
>> with number of levels = 4.
> 
> No, the BaseAddress is only calcualted for the "Entry_level". So the
> above case doesn't exist at all.
> 
>> So shouldn't you s/Number of levels/4 - entry_level?
> 
> Ok, I now understood what you are referring to [0]
>> for BADDR we want the BaseAddr of the initial lookup level so
>> effectively the entry level we are interested in is 4 - number of levels
>> and we don't care or d) condition. At least this is my understanding ;-)
>> If correct you may slightly reword the explanation?
> 
> 
>>> + *
>>> + * We have a magic formula for the Magic_N below.
>>> + *
>>> + *  Magic_N(PAGE_SIZE, Entry_Level) = 64 - ((PAGE_SHIFT - 3) *
>>> Number of levels)
> 
> [0] ^^^
> 
> 
> 
>>> + *
>>> + * where number of levels = (4 - Entry_Level).
> 
> ^^^ Doesn't this help make it clear ? Using the expansion makes it a bit
> more
> unreadable below.

I just wanted to mention the tables you refer (D4-23 and D4-25) give
Magic_N for a larger scope as they deal with any lookup level while we
only care about the entry level for BADDR. So I was a little bit
confused when reading the explanation but that's not a big deal.

> 
>>>   +/*
>>> + * Get the magic number 'x' for VTTBR:BADDR of this KVM instance.
>>> + * With v8.2 LVA extensions, 'x' should be a minimum of 6 with
>>> + * 52bit IPS.
>> Link to the spec?
> 
> Sure, will add it.
> 
> Thanks for the patience to review :-)
you're welcome ;-)

Eric
> 
> Cheers
> Suzuki

Suzuki K Poulose July 4, 2018, 8:29 a.m. UTC | #34

On 07/04/2018 09:24 AM, Auger Eric wrote:
>>>> + *
>>>> + * We have a magic formula for the Magic_N below.
>>>> + *
>>>> + *  Magic_N(PAGE_SIZE, Entry_Level) = 64 - ((PAGE_SHIFT - 3) *
>>>> Number of levels)
>>
>> [0] ^^^
>>
>>
>>
>>>> + *
>>>> + * where number of levels = (4 - Entry_Level).
>>
>> ^^^ Doesn't this help make it clear ? Using the expansion makes it a bit
>> more
>> unreadable below.
> 
> I just wanted to mention the tables you refer (D4-23 and D4-25) give
> Magic_N for a larger scope as they deal with any lookup level while we
> only care about the entry level for BADDR. So I was a little bit
> confused when reading the explanation but that's not a big deal.

Ah, ok. I will try to clarify it.

Cheers
Suzuki

Will Deacon July 4, 2018, 2:09 p.m. UTC | #35

On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
> Add an option to specify the physical address size used by this
> VM.
> 
> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> ---
>  arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>  arm/include/arm-common/kvm-config-arch.h  | 1 +
>  2 files changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
> index 04be43d..dabd22c 100644
> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
> @@ -8,7 +8,10 @@
>  			"Create PMUv3 device"),				\
>  	OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed,			\
>  			"Specify random seed for Kernel Address Space "	\
> -			"Layout Randomization (KASLR)"),
> +			"Layout Randomization (KASLR)"),		\
> +	OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift,		\
> +			"Specify maximum physical address size (not "	\
> +			"the amount of memory)"),

Given that this is a shift value, I think the help message could be more
informative. Something like:

	"Specify maximum number of bits in a guest physical address"

I think I'd actually leave out any mention of memory, because this does
actually have an effect on the amount of addressable memory in a way that I
don't think we want to describe in half of a usage message line :)

Will

Will Deacon July 4, 2018, 2:22 p.m. UTC | #36

On Fri, Jun 29, 2018 at 12:15:44PM +0100, Suzuki K Poulose wrote:
> diff --git a/arm/kvm.c b/arm/kvm.c
> index 5701d41..b1969be 100644
> --- a/arm/kvm.c
> +++ b/arm/kvm.c
> @@ -11,6 +11,8 @@
>  #include <linux/kvm.h>
>  #include <linux/sizes.h>
>  
> +unsigned long kvm_arm_type;
> +
>  struct kvm_ext kvm_req_ext[] = {
>  	{ DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
>  	{ DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
> @@ -18,6 +20,26 @@ struct kvm_ext kvm_req_ext[] = {
>  	{ 0, 0 },
>  };
>  
> +#ifndef KVM_ARM_GET_MAX_VM_PHYS_SHIFT
> +#define KVM_ARM_GET_MAX_VM_PHYS_SHIFT		_IO(KVMIO, 0x0b)
> +#endif
> +
> +void kvm__arch_init_hyp(struct kvm *kvm)
> +{
> +	int max_ipa;
> +
> +	max_ipa = ioctl(kvm->sys_fd, KVM_ARM_GET_MAX_VM_PHYS_SHIFT);
> +	if (max_ipa < 0)
> +		max_ipa = 40;
> +	if (!kvm->cfg.arch.phys_shift)
> +		kvm->cfg.arch.phys_shift = 40;
> +	if (kvm->cfg.arch.phys_shift > max_ipa)
> +		die("Requested PA size (%u) is not supported by the host (%ubits)\n",
> +		    kvm->cfg.arch.phys_shift, max_ipa);
> +	if (kvm->cfg.arch.phys_shift != 40)
> +		kvm_arm_type = kvm->cfg.arch.phys_shift;
> +}

Seems a bit weird that the "machine type identifier" to KVM_CREATE_VM is
dedicated entirely to holding the physical address shift verbatim. Is this
really the ABI?

Also, couldn't KVM figure it out automatically if you add memslots at high
addresses, making this a niche tunable outside of testing?

Will

Marc Zyngier July 4, 2018, 2:41 p.m. UTC | #37

On Wed, 04 Jul 2018 15:22:42 +0100,
Will Deacon <will.deacon@arm.com> wrote:
> 
> On Fri, Jun 29, 2018 at 12:15:44PM +0100, Suzuki K Poulose wrote:
> > diff --git a/arm/kvm.c b/arm/kvm.c
> > index 5701d41..b1969be 100644
> > --- a/arm/kvm.c
> > +++ b/arm/kvm.c
> > @@ -11,6 +11,8 @@
> >  #include <linux/kvm.h>
> >  #include <linux/sizes.h>
> >  
> > +unsigned long kvm_arm_type;
> > +
> >  struct kvm_ext kvm_req_ext[] = {
> >  	{ DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
> >  	{ DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
> > @@ -18,6 +20,26 @@ struct kvm_ext kvm_req_ext[] = {
> >  	{ 0, 0 },
> >  };
> >  
> > +#ifndef KVM_ARM_GET_MAX_VM_PHYS_SHIFT
> > +#define KVM_ARM_GET_MAX_VM_PHYS_SHIFT		_IO(KVMIO, 0x0b)
> > +#endif
> > +
> > +void kvm__arch_init_hyp(struct kvm *kvm)
> > +{
> > +	int max_ipa;
> > +
> > +	max_ipa = ioctl(kvm->sys_fd, KVM_ARM_GET_MAX_VM_PHYS_SHIFT);
> > +	if (max_ipa < 0)
> > +		max_ipa = 40;
> > +	if (!kvm->cfg.arch.phys_shift)
> > +		kvm->cfg.arch.phys_shift = 40;
> > +	if (kvm->cfg.arch.phys_shift > max_ipa)
> > +		die("Requested PA size (%u) is not supported by the host (%ubits)\n",
> > +		    kvm->cfg.arch.phys_shift, max_ipa);
> > +	if (kvm->cfg.arch.phys_shift != 40)
> > +		kvm_arm_type = kvm->cfg.arch.phys_shift;
> > +}
> 
> Seems a bit weird that the "machine type identifier" to KVM_CREATE_VM is
> dedicated entirely to holding the physical address shift verbatim. Is this
> really the ABI?
> 
> Also, couldn't KVM figure it out automatically if you add memslots at high
> addresses, making this a niche tunable outside of testing?

Not really. Let's say I want my IPA space split in two: memory covers
the low 47 bit, and I want MMIO spanning the top 47 bit. With your
scheme, you'd end-up with a 47bit IPA space, while you really want 48
bits (MMIO space implemented by userspace isn't registered to the
kernel).

	M.

Julien Grall July 4, 2018, 3 p.m. UTC | #38

Hi,

On 04/07/18 15:09, Will Deacon wrote:
> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>> Add an option to specify the physical address size used by this
>> VM.
>>
>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>> ---
>>   arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>   arm/include/arm-common/kvm-config-arch.h  | 1 +
>>   2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>> index 04be43d..dabd22c 100644
>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>> @@ -8,7 +8,10 @@
>>   			"Create PMUv3 device"),				\
>>   	OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed,			\
>>   			"Specify random seed for Kernel Address Space "	\
>> -			"Layout Randomization (KASLR)"),
>> +			"Layout Randomization (KASLR)"),		\
>> +	OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift,		\
>> +			"Specify maximum physical address size (not "	\
>> +			"the amount of memory)"),
> 
> Given that this is a shift value, I think the help message could be more
> informative. Something like:
> 
> 	"Specify maximum number of bits in a guest physical address"
> 
> I think I'd actually leave out any mention of memory, because this does
> actually have an effect on the amount of addressable memory in a way that I
> don't think we want to describe in half of a usage message line :)
Is there any particular reasons to expose this option to the user?

I have recently sent a series to allow the user to specify the position
of the RAM [1]. With that series in mind, I think the user would not 
really need to specify the maximum physical shift. Instead we could 
automatically find it.

Cheers,

[1] 
http://archive.armlinux.org.uk/lurker/message/20180510.140428.1c295b5b.en.html

> 
> Will
>

Will Deacon July 4, 2018, 3:51 p.m. UTC | #39

On Wed, Jul 04, 2018 at 03:41:18PM +0100, Marc Zyngier wrote:
> On Wed, 04 Jul 2018 15:22:42 +0100,
> Will Deacon <will.deacon@arm.com> wrote:
> > 
> > On Fri, Jun 29, 2018 at 12:15:44PM +0100, Suzuki K Poulose wrote:
> > > diff --git a/arm/kvm.c b/arm/kvm.c
> > > index 5701d41..b1969be 100644
> > > --- a/arm/kvm.c
> > > +++ b/arm/kvm.c
> > > @@ -11,6 +11,8 @@
> > >  #include <linux/kvm.h>
> > >  #include <linux/sizes.h>
> > >  
> > > +unsigned long kvm_arm_type;
> > > +
> > >  struct kvm_ext kvm_req_ext[] = {
> > >  	{ DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
> > >  	{ DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
> > > @@ -18,6 +20,26 @@ struct kvm_ext kvm_req_ext[] = {
> > >  	{ 0, 0 },
> > >  };
> > >  
> > > +#ifndef KVM_ARM_GET_MAX_VM_PHYS_SHIFT
> > > +#define KVM_ARM_GET_MAX_VM_PHYS_SHIFT		_IO(KVMIO, 0x0b)
> > > +#endif
> > > +
> > > +void kvm__arch_init_hyp(struct kvm *kvm)
> > > +{
> > > +	int max_ipa;
> > > +
> > > +	max_ipa = ioctl(kvm->sys_fd, KVM_ARM_GET_MAX_VM_PHYS_SHIFT);
> > > +	if (max_ipa < 0)
> > > +		max_ipa = 40;
> > > +	if (!kvm->cfg.arch.phys_shift)
> > > +		kvm->cfg.arch.phys_shift = 40;
> > > +	if (kvm->cfg.arch.phys_shift > max_ipa)
> > > +		die("Requested PA size (%u) is not supported by the host (%ubits)\n",
> > > +		    kvm->cfg.arch.phys_shift, max_ipa);
> > > +	if (kvm->cfg.arch.phys_shift != 40)
> > > +		kvm_arm_type = kvm->cfg.arch.phys_shift;
> > > +}
> > 
> > Seems a bit weird that the "machine type identifier" to KVM_CREATE_VM is
> > dedicated entirely to holding the physical address shift verbatim. Is this
> > really the ABI?
> > 
> > Also, couldn't KVM figure it out automatically if you add memslots at high
> > addresses, making this a niche tunable outside of testing?
> 
> Not really. Let's say I want my IPA space split in two: memory covers
> the low 47 bit, and I want MMIO spanning the top 47 bit. With your
> scheme, you'd end-up with a 47bit IPA space, while you really want 48
> bits (MMIO space implemented by userspace isn't registered to the
> kernel).

That still sounds quite niche for a VM. Does QEMU do that? In any case,
having KVM automatically increase the IPA bits to cover the memslots it
knows about would make sense to me, and also be sufficient for kvmtool
without us having to add an extra command-line argument.

The MMIO case might be better dealt with by having a way to register MMIO
regions rather than having the PA bits exposed directly.

Will

Will Deacon July 4, 2018, 3:52 p.m. UTC | #40

On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
> On 04/07/18 15:09, Will Deacon wrote:
> >On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
> >>Add an option to specify the physical address size used by this
> >>VM.
> >>
> >>Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
> >>---
> >>  arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
> >>  arm/include/arm-common/kvm-config-arch.h  | 1 +
> >>  2 files changed, 5 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
> >>index 04be43d..dabd22c 100644
> >>--- a/arm/aarch64/include/kvm/kvm-config-arch.h
> >>+++ b/arm/aarch64/include/kvm/kvm-config-arch.h
> >>@@ -8,7 +8,10 @@
> >>  			"Create PMUv3 device"),				\
> >>  	OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed,			\
> >>  			"Specify random seed for Kernel Address Space "	\
> >>-			"Layout Randomization (KASLR)"),
> >>+			"Layout Randomization (KASLR)"),		\
> >>+	OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift,		\
> >>+			"Specify maximum physical address size (not "	\
> >>+			"the amount of memory)"),
> >
> >Given that this is a shift value, I think the help message could be more
> >informative. Something like:
> >
> >	"Specify maximum number of bits in a guest physical address"
> >
> >I think I'd actually leave out any mention of memory, because this does
> >actually have an effect on the amount of addressable memory in a way that I
> >don't think we want to describe in half of a usage message line :)
> Is there any particular reasons to expose this option to the user?
> 
> I have recently sent a series to allow the user to specify the position
> of the RAM [1]. With that series in mind, I think the user would not really
> need to specify the maximum physical shift. Instead we could automatically
> find it.

Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
to understand whether we can do something differently there and avoid
sacrificing the type parameter.

Will

Suzuki K Poulose July 4, 2018, 3:58 p.m. UTC | #41

Hi Will,

On 07/04/2018 03:22 PM, Will Deacon wrote:
> On Fri, Jun 29, 2018 at 12:15:44PM +0100, Suzuki K Poulose wrote:
>> diff --git a/arm/kvm.c b/arm/kvm.c
>> index 5701d41..b1969be 100644
>> --- a/arm/kvm.c
>> +++ b/arm/kvm.c
>> @@ -11,6 +11,8 @@
>>   #include <linux/kvm.h>
>>   #include <linux/sizes.h>
>>   
>> +unsigned long kvm_arm_type;
>> +
>>   struct kvm_ext kvm_req_ext[] = {
>>   	{ DEFINE_KVM_EXT(KVM_CAP_IRQCHIP) },
>>   	{ DEFINE_KVM_EXT(KVM_CAP_ONE_REG) },
>> @@ -18,6 +20,26 @@ struct kvm_ext kvm_req_ext[] = {
>>   	{ 0, 0 },
>>   };
>>   
>> +#ifndef KVM_ARM_GET_MAX_VM_PHYS_SHIFT
>> +#define KVM_ARM_GET_MAX_VM_PHYS_SHIFT		_IO(KVMIO, 0x0b)
>> +#endif
>> +
>> +void kvm__arch_init_hyp(struct kvm *kvm)
>> +{
>> +	int max_ipa;
>> +
>> +	max_ipa = ioctl(kvm->sys_fd, KVM_ARM_GET_MAX_VM_PHYS_SHIFT);
>> +	if (max_ipa < 0)
>> +		max_ipa = 40;
>> +	if (!kvm->cfg.arch.phys_shift)
>> +		kvm->cfg.arch.phys_shift = 40;
>> +	if (kvm->cfg.arch.phys_shift > max_ipa)
>> +		die("Requested PA size (%u) is not supported by the host (%ubits)\n",
>> +		    kvm->cfg.arch.phys_shift, max_ipa);
>> +	if (kvm->cfg.arch.phys_shift != 40)
>> +		kvm_arm_type = kvm->cfg.arch.phys_shift;
>> +}
> 
> Seems a bit weird that the "machine type identifier" to KVM_CREATE_VM is
> dedicated entirely to holding the physical address shift verbatim. Is this
> really the ABI?

The bits[7:0] of the machine type has been reserved for the IPA shift.
This version is missing the updates to the ABI documentation, I have it
for the next version.

> 
> Also, couldn't KVM figure it out automatically if you add memslots at high
> addresses, making this a niche tunable outside of testing?

The stage2 pgd size is really dependent on the max IPA. Also, unlike the 
stage1 (where the maximum size will be 1 page), the size can go upto 16
pages (and different number of levels due to concatenation), so we need
to finalize this at least before the first memory gets mapped (RAM or 
Device). That implies, we cannot wait until all the memory slots are
created.

The first version of the series added a separate ioctl for specifying
the limit, which had its own complexities. So, this ABI was suggested
to keep things simpler.

Suzuki

Peter Maydell July 5, 2018, 7:51 a.m. UTC | #42

On 4 July 2018 at 16:51, Will Deacon <will.deacon@arm.com> wrote:
> On Wed, Jul 04, 2018 at 03:41:18PM +0100, Marc Zyngier wrote:
>> Not really. Let's say I want my IPA space split in two: memory covers
>> the low 47 bit, and I want MMIO spanning the top 47 bit. With your
>> scheme, you'd end-up with a 47bit IPA space, while you really want 48
>> bits (MMIO space implemented by userspace isn't registered to the
>> kernel).
>
> That still sounds quite niche for a VM. Does QEMU do that?

Not at 47 bits, but we have RAM up to the 256GB mark, and
MMIO above that (including a large PCI window), so the general
arrangement of having the top end of the IPA space not
necessarily be things we've told the kernel about definitely
exists.

thanks
-- PMM

Eric Auger July 5, 2018, 7:58 a.m. UTC | #43

Hi,

On 07/05/2018 09:51 AM, Peter Maydell wrote:
> On 4 July 2018 at 16:51, Will Deacon <will.deacon@arm.com> wrote:
>> On Wed, Jul 04, 2018 at 03:41:18PM +0100, Marc Zyngier wrote:
>>> Not really. Let's say I want my IPA space split in two: memory covers
>>> the low 47 bit, and I want MMIO spanning the top 47 bit. With your
>>> scheme, you'd end-up with a 47bit IPA space, while you really want 48
>>> bits (MMIO space implemented by userspace isn't registered to the
>>> kernel).
>>
>> That still sounds quite niche for a VM. Does QEMU do that?
> 
> Not at 47 bits, but we have RAM up to the 256GB mark, and
> MMIO above that (including a large PCI window), so the general
> arrangement of having the top end of the IPA space not
> necessarily be things we've told the kernel about definitely
> exists.

Is this document (2012) still a reference document?
http://infocenter.arm.com/help/topic/com.arm.doc.den0001c/DEN0001C_principles_of_arm_memory_maps.pdf
(especially Fig 5?)

Peter, comments in QEMU hw/arm/virt.c suggested next RAM chunk should be
added at 2TB. This doc suggests to put it at 8TB. I understand the PA
memory map only is suggested but shouldn't we align?

Thanks

Eric


> 
> thanks
> -- PMM
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>

Julien Grall July 5, 2018, 12:47 p.m. UTC | #44

Hi Will,

On 04/07/18 16:52, Will Deacon wrote:
> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>> On 04/07/18 15:09, Will Deacon wrote:
>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>> Add an option to specify the physical address size used by this
>>>> VM.
>>>>
>>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>>> ---
>>>>   arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>>   arm/include/arm-common/kvm-config-arch.h  | 1 +
>>>>   2 files changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>> index 04be43d..dabd22c 100644
>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>> @@ -8,7 +8,10 @@
>>>>   			"Create PMUv3 device"),				\
>>>>   	OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed,			\
>>>>   			"Specify random seed for Kernel Address Space "	\
>>>> -			"Layout Randomization (KASLR)"),
>>>> +			"Layout Randomization (KASLR)"),		\
>>>> +	OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift,		\
>>>> +			"Specify maximum physical address size (not "	\
>>>> +			"the amount of memory)"),
>>>
>>> Given that this is a shift value, I think the help message could be more
>>> informative. Something like:
>>>
>>> 	"Specify maximum number of bits in a guest physical address"
>>>
>>> I think I'd actually leave out any mention of memory, because this does
>>> actually have an effect on the amount of addressable memory in a way that I
>>> don't think we want to describe in half of a usage message line :)
>> Is there any particular reasons to expose this option to the user?
>>
>> I have recently sent a series to allow the user to specify the position
>> of the RAM [1]. With that series in mind, I think the user would not really
>> need to specify the maximum physical shift. Instead we could automatically
>> find it.
> 
> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
> to understand whether we can do something differently there and avoid
> sacrificing the type parameter.

I am not sure to understand this. kvmtools knows the memory layout 
(including MMIOs) of the guest, so couldn't it guess the maximum 
physical shift for that?

Cheers,

Marc Zyngier July 5, 2018, 1:20 p.m. UTC | #45

On 05/07/18 13:47, Julien Grall wrote:
> Hi Will,
> 
> On 04/07/18 16:52, Will Deacon wrote:
>> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>>> On 04/07/18 15:09, Will Deacon wrote:
>>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>>> Add an option to specify the physical address size used by this
>>>>> VM.
>>>>>
>>>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>>>> ---
>>>>>   arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>>>   arm/include/arm-common/kvm-config-arch.h  | 1 +
>>>>>   2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>> index 04be43d..dabd22c 100644
>>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>> @@ -8,7 +8,10 @@
>>>>>   			"Create PMUv3 device"),				\
>>>>>   	OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed,			\
>>>>>   			"Specify random seed for Kernel Address Space "	\
>>>>> -			"Layout Randomization (KASLR)"),
>>>>> +			"Layout Randomization (KASLR)"),		\
>>>>> +	OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift,		\
>>>>> +			"Specify maximum physical address size (not "	\
>>>>> +			"the amount of memory)"),
>>>>
>>>> Given that this is a shift value, I think the help message could be more
>>>> informative. Something like:
>>>>
>>>> 	"Specify maximum number of bits in a guest physical address"
>>>>
>>>> I think I'd actually leave out any mention of memory, because this does
>>>> actually have an effect on the amount of addressable memory in a way that I
>>>> don't think we want to describe in half of a usage message line :)
>>> Is there any particular reasons to expose this option to the user?
>>>
>>> I have recently sent a series to allow the user to specify the position
>>> of the RAM [1]. With that series in mind, I think the user would not really
>>> need to specify the maximum physical shift. Instead we could automatically
>>> find it.
>>
>> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
>> to understand whether we can do something differently there and avoid
>> sacrificing the type parameter.
> 
> I am not sure to understand this. kvmtools knows the memory layout 
> (including MMIOs) of the guest, so couldn't it guess the maximum 
> physical shift for that?

That's exactly what Will was trying to avoid, by having KVM to compute
the size of the IPA space based on the registered memslots. We've now
established that it doesn't work, so what we need to define is:

- whether we need another ioctl(), or do we carry on piggy-backing on
the CPU type,
- assuming the latter, whether we can reduce the number of bits used in
the ioctl parameter by subtly encoding the IPA size.

Thanks,

	M.

Eric Auger July 5, 2018, 1:46 p.m. UTC | #46

Hi Marc,

On 07/05/2018 03:20 PM, Marc Zyngier wrote:
> On 05/07/18 13:47, Julien Grall wrote:
>> Hi Will,
>>
>> On 04/07/18 16:52, Will Deacon wrote:
>>> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>>>> On 04/07/18 15:09, Will Deacon wrote:
>>>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>>>> Add an option to specify the physical address size used by this
>>>>>> VM.
>>>>>>
>>>>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>>>>> ---
>>>>>>   arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>>>>   arm/include/arm-common/kvm-config-arch.h  | 1 +
>>>>>>   2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>> index 04be43d..dabd22c 100644
>>>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>> @@ -8,7 +8,10 @@
>>>>>>   			"Create PMUv3 device"),				\
>>>>>>   	OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed,			\
>>>>>>   			"Specify random seed for Kernel Address Space "	\
>>>>>> -			"Layout Randomization (KASLR)"),
>>>>>> +			"Layout Randomization (KASLR)"),		\
>>>>>> +	OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift,		\
>>>>>> +			"Specify maximum physical address size (not "	\
>>>>>> +			"the amount of memory)"),
>>>>>
>>>>> Given that this is a shift value, I think the help message could be more
>>>>> informative. Something like:
>>>>>
>>>>> 	"Specify maximum number of bits in a guest physical address"
>>>>>
>>>>> I think I'd actually leave out any mention of memory, because this does
>>>>> actually have an effect on the amount of addressable memory in a way that I
>>>>> don't think we want to describe in half of a usage message line :)
>>>> Is there any particular reasons to expose this option to the user?
>>>>
>>>> I have recently sent a series to allow the user to specify the position
>>>> of the RAM [1]. With that series in mind, I think the user would not really
>>>> need to specify the maximum physical shift. Instead we could automatically
>>>> find it.
>>>
>>> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
>>> to understand whether we can do something differently there and avoid
>>> sacrificing the type parameter.
>>
>> I am not sure to understand this. kvmtools knows the memory layout 
>> (including MMIOs) of the guest, so couldn't it guess the maximum 
>> physical shift for that?
> 
> That's exactly what Will was trying to avoid, by having KVM to compute
> the size of the IPA space based on the registered memslots. We've now
> established that it doesn't work, so what we need to define is:
> 
> - whether we need another ioctl(), or do we carry on piggy-backing on
> the CPU type,
kvm type I guess
> - assuming the latter, whether we can reduce the number of bits used in
> the ioctl parameter by subtly encoding the IPA size.
Getting benefit from your Freudian slip, how should guest CPU PARange
and maximum number of bits in a guest physical address relate?

My understanding is they are not correlated at the moment and our guest
PARange is fixed at the moment. But shouldn't they?

On Intel there is
   qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,phys-bits=36
or
   qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,host-phys-bits=true

where phys-bits, as far as I understand has a a similar semantics as the
PARange.

Thanks

Eric
> 
> Thanks,
> 
> 	M.
>

Suzuki K Poulose July 5, 2018, 2:12 p.m. UTC | #47

On 05/07/18 14:46, Auger Eric wrote:
> Hi Marc,
> 
> On 07/05/2018 03:20 PM, Marc Zyngier wrote:
>> On 05/07/18 13:47, Julien Grall wrote:
>>> Hi Will,
>>>
>>> On 04/07/18 16:52, Will Deacon wrote:
>>>> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>>>>> On 04/07/18 15:09, Will Deacon wrote:
>>>>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>>>>> Add an option to specify the physical address size used by this
>>>>>>> VM.
>>>>>>>
>>>>>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>>>>>> ---
>>>>>>>    arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>>>>>    arm/include/arm-common/kvm-config-arch.h  | 1 +
>>>>>>>    2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> index 04be43d..dabd22c 100644
>>>>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> @@ -8,7 +8,10 @@
>>>>>>>    			"Create PMUv3 device"),				\
>>>>>>>    	OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed,			\
>>>>>>>    			"Specify random seed for Kernel Address Space "	\
>>>>>>> -			"Layout Randomization (KASLR)"),
>>>>>>> +			"Layout Randomization (KASLR)"),		\
>>>>>>> +	OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift,		\
>>>>>>> +			"Specify maximum physical address size (not "	\
>>>>>>> +			"the amount of memory)"),
>>>>>>
>>>>>> Given that this is a shift value, I think the help message could be more
>>>>>> informative. Something like:
>>>>>>
>>>>>> 	"Specify maximum number of bits in a guest physical address"
>>>>>>
>>>>>> I think I'd actually leave out any mention of memory, because this does
>>>>>> actually have an effect on the amount of addressable memory in a way that I
>>>>>> don't think we want to describe in half of a usage message line :)
>>>>> Is there any particular reasons to expose this option to the user?
>>>>>
>>>>> I have recently sent a series to allow the user to specify the position
>>>>> of the RAM [1]. With that series in mind, I think the user would not really
>>>>> need to specify the maximum physical shift. Instead we could automatically
>>>>> find it.
>>>>
>>>> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
>>>> to understand whether we can do something differently there and avoid
>>>> sacrificing the type parameter.
>>>
>>> I am not sure to understand this. kvmtools knows the memory layout
>>> (including MMIOs) of the guest, so couldn't it guess the maximum
>>> physical shift for that?
>>
>> That's exactly what Will was trying to avoid, by having KVM to compute
>> the size of the IPA space based on the registered memslots. We've now
>> established that it doesn't work, so what we need to define is:
>>
>> - whether we need another ioctl(), or do we carry on piggy-backing on
>> the CPU type,
> kvm type I guess

machine type is more appropriate, going by the existing users.

>> - assuming the latter, whether we can reduce the number of bits used in
>> the ioctl parameter by subtly encoding the IPA size.
> Getting benefit from your Freudian slip, how should guest CPU PARange
> and maximum number of bits in a guest physical address relate?
> 
> My understanding is they are not correlated at the moment and our guest
> PARange is fixed at the moment. But shouldn't they?
> 
> On Intel there is
>     qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,phys-bits=36
> or
>     qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,host-phys-bits=true
> 
> where phys-bits, as far as I understand has a a similar semantics as the
> PARange.


AFAIT, PARange tells you the maximum (I)Physcial Address that can be handled
by the CPU. But your IPA limit tells you where the guest RAM is placed.
So they need not be the same. e.g, on Juno, A57's have a PARange of 42 if I am
not wrong (but definitely > 40), while A53's have it at 40 and the system RAM
is at 40bits.

So, if we were to only use the A57s on Juno, we could run a KVM instance with 42
bits IPA or anything lower. So, PARange can be inferred as the maximum limit
of the CPU's capability while the IPA is where the RAM is placed for a given
system.
One could keep them in sync for a VM by emulating, but then nobody
uses the PARange, except the KVM. The other problem with capping PARange in the VM
to IPA is restricting the IPA size of a nested VM. So, I don't think this is
really beneficial.

Cheers
Suzuki


> 
> Thanks
> 
> Eric
>>
>> Thanks,
>>
>> 	M.
>>

Marc Zyngier July 5, 2018, 2:15 p.m. UTC | #48

Hi Eric,

On 05/07/18 14:46, Auger Eric wrote:
> Hi Marc,
> 
> On 07/05/2018 03:20 PM, Marc Zyngier wrote:
>> On 05/07/18 13:47, Julien Grall wrote:
>>> Hi Will,
>>>
>>> On 04/07/18 16:52, Will Deacon wrote:
>>>> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>>>>> On 04/07/18 15:09, Will Deacon wrote:
>>>>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>>>>> Add an option to specify the physical address size used by this
>>>>>>> VM.
>>>>>>>
>>>>>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>>>>>> ---
>>>>>>>   arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>>>>>   arm/include/arm-common/kvm-config-arch.h  | 1 +
>>>>>>>   2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> index 04be43d..dabd22c 100644
>>>>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>> @@ -8,7 +8,10 @@
>>>>>>>   			"Create PMUv3 device"),				\
>>>>>>>   	OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed,			\
>>>>>>>   			"Specify random seed for Kernel Address Space "	\
>>>>>>> -			"Layout Randomization (KASLR)"),
>>>>>>> +			"Layout Randomization (KASLR)"),		\
>>>>>>> +	OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift,		\
>>>>>>> +			"Specify maximum physical address size (not "	\
>>>>>>> +			"the amount of memory)"),
>>>>>>
>>>>>> Given that this is a shift value, I think the help message could be more
>>>>>> informative. Something like:
>>>>>>
>>>>>> 	"Specify maximum number of bits in a guest physical address"
>>>>>>
>>>>>> I think I'd actually leave out any mention of memory, because this does
>>>>>> actually have an effect on the amount of addressable memory in a way that I
>>>>>> don't think we want to describe in half of a usage message line :)
>>>>> Is there any particular reasons to expose this option to the user?
>>>>>
>>>>> I have recently sent a series to allow the user to specify the position
>>>>> of the RAM [1]. With that series in mind, I think the user would not really
>>>>> need to specify the maximum physical shift. Instead we could automatically
>>>>> find it.
>>>>
>>>> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
>>>> to understand whether we can do something differently there and avoid
>>>> sacrificing the type parameter.
>>>
>>> I am not sure to understand this. kvmtools knows the memory layout 
>>> (including MMIOs) of the guest, so couldn't it guess the maximum 
>>> physical shift for that?
>>
>> That's exactly what Will was trying to avoid, by having KVM to compute
>> the size of the IPA space based on the registered memslots. We've now
>> established that it doesn't work, so what we need to define is:
>>
>> - whether we need another ioctl(), or do we carry on piggy-backing on
>> the CPU type,
> kvm type I guess

I really meant target here. Whatever you pass as a "-cpu" on your QEMU
command line.

>> - assuming the latter, whether we can reduce the number of bits used in
>> the ioctl parameter by subtly encoding the IPA size.
> Getting benefit from your Freudian slip, how should guest CPU PARange
> and maximum number of bits in a guest physical address relate?

Freudian? I'm not on the sofa yet... ;-)

> My understanding is they are not correlated at the moment and our guest
> PARange is fixed at the moment. But shouldn't they?
> 
> On Intel there is
>    qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,phys-bits=36
> or
>    qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,host-phys-bits=true
> 
> where phys-bits, as far as I understand has a a similar semantics as the
> PARange.

I think there is value in having it global, just like on x86. We don't
really support heterogeneous guests anyway.

Independently, we should also repaint/satinize PARange so that the guest
observes the same thing, no matter what CPU it runs on (an A53/A57
system could be confusing in that respect).

Thanks,

	M.

Eric Auger July 5, 2018, 2:37 p.m. UTC | #49

Hi Suzuki, Marc,

On 07/05/2018 04:15 PM, Marc Zyngier wrote:
> Hi Eric,
> 
> On 05/07/18 14:46, Auger Eric wrote:
>> Hi Marc,
>>
>> On 07/05/2018 03:20 PM, Marc Zyngier wrote:
>>> On 05/07/18 13:47, Julien Grall wrote:
>>>> Hi Will,
>>>>
>>>> On 04/07/18 16:52, Will Deacon wrote:
>>>>> On Wed, Jul 04, 2018 at 04:00:11PM +0100, Julien Grall wrote:
>>>>>> On 04/07/18 15:09, Will Deacon wrote:
>>>>>>> On Fri, Jun 29, 2018 at 12:15:42PM +0100, Suzuki K Poulose wrote:
>>>>>>>> Add an option to specify the physical address size used by this
>>>>>>>> VM.
>>>>>>>>
>>>>>>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
>>>>>>>> ---
>>>>>>>>   arm/aarch64/include/kvm/kvm-config-arch.h | 5 ++++-
>>>>>>>>   arm/include/arm-common/kvm-config-arch.h  | 1 +
>>>>>>>>   2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/arm/aarch64/include/kvm/kvm-config-arch.h b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>>> index 04be43d..dabd22c 100644
>>>>>>>> --- a/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>>> +++ b/arm/aarch64/include/kvm/kvm-config-arch.h
>>>>>>>> @@ -8,7 +8,10 @@
>>>>>>>>   			"Create PMUv3 device"),				\
>>>>>>>>   	OPT_U64('\0', "kaslr-seed", &(cfg)->kaslr_seed,			\
>>>>>>>>   			"Specify random seed for Kernel Address Space "	\
>>>>>>>> -			"Layout Randomization (KASLR)"),
>>>>>>>> +			"Layout Randomization (KASLR)"),		\
>>>>>>>> +	OPT_INTEGER('\0', "phys-shift", &(cfg)->phys_shift,		\
>>>>>>>> +			"Specify maximum physical address size (not "	\
>>>>>>>> +			"the amount of memory)"),
>>>>>>>
>>>>>>> Given that this is a shift value, I think the help message could be more
>>>>>>> informative. Something like:
>>>>>>>
>>>>>>> 	"Specify maximum number of bits in a guest physical address"
>>>>>>>
>>>>>>> I think I'd actually leave out any mention of memory, because this does
>>>>>>> actually have an effect on the amount of addressable memory in a way that I
>>>>>>> don't think we want to describe in half of a usage message line :)
>>>>>> Is there any particular reasons to expose this option to the user?
>>>>>>
>>>>>> I have recently sent a series to allow the user to specify the position
>>>>>> of the RAM [1]. With that series in mind, I think the user would not really
>>>>>> need to specify the maximum physical shift. Instead we could automatically
>>>>>> find it.
>>>>>
>>>>> Marc makes a good point that it doesn't help for MMIO regions, so I'm trying
>>>>> to understand whether we can do something differently there and avoid
>>>>> sacrificing the type parameter.
>>>>
>>>> I am not sure to understand this. kvmtools knows the memory layout 
>>>> (including MMIOs) of the guest, so couldn't it guess the maximum 
>>>> physical shift for that?
>>>
>>> That's exactly what Will was trying to avoid, by having KVM to compute
>>> the size of the IPA space based on the registered memslots. We've now
>>> established that it doesn't work, so what we need to define is:
>>>
>>> - whether we need another ioctl(), or do we carry on piggy-backing on
>>> the CPU type,
>> kvm type I guess
> 
> I really meant target here. Whatever you pass as a "-cpu" on your QEMU
> command line.
Oh OK. It was not a slip then ;-)
> 
>>> - assuming the latter, whether we can reduce the number of bits used in
>>> the ioctl parameter by subtly encoding the IPA size.
>> Getting benefit from your Freudian slip, how should guest CPU PARange
>> and maximum number of bits in a guest physical address relate?
> 
> Freudian? I'm not on the sofa yet... ;-)
> 
>> My understanding is they are not correlated at the moment and our guest
>> PARange is fixed at the moment. But shouldn't they?
>>
>> On Intel there is
>>    qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,phys-bits=36
>> or
>>    qemu-system-x86_64 -M pc,accel=kvm -cpu SandyBridge,host-phys-bits=true
>>
>> where phys-bits, as far as I understand has a a similar semantics as the
>> PARange.
> 
> I think there is value in having it global, just like on x86. We don't
> really support heterogeneous guests anyway.

Assuming we would use such a ",phys-bits=n" cpu option, is my
understanding correct that it would set both
- guest CPU PARange an
- maximum number of bits in a guest physical address
to n?

Thanks

Eric
> 
> Independently, we should also repaint/satinize PARange so that the guest
> observes the same thing, no matter what CPU it runs on (an A53/A57
> system could be confusing in that respect).

> 
> Thanks,
> 
> 	M.
>

[v3,00/20] arm64: Dynamic & 52bit IPA support

Message

Comments