Message ID | 20180926163258.20218-19-suzuki.poulose@arm.com |
---|---|
State | New |
Headers | show |
Series | kvm: arm64: Dynamic IPA and 52bit IPA | expand |
Hi, On 9/26/18 6:32 PM, Suzuki K Poulose wrote: > Allow specifying the physical address size limit for a new > VM via the kvm_type argument for the KVM_CREATE_VM ioctl. This > allows us to finalise the stage2 page table as early as possible > and hence perform the right checks on the memory slots > without complication. The size is encoded as Log2(PA_Size) in > bits[7:0] of the type field. For backward compatibility the > value 0 is reserved and implies 40bits. Also, lift the limit > of the IPA to host limit and allow lower IPA sizes (e.g, 32). > > The userspace could check the extension KVM_CAP_ARM_VM_IPA_SIZE > for the availability of this feature. The cap check returns the > maximum limit for the physical address shift supported by the host. > > Cc: Marc Zyngier <marc.zyngier@arm.com> > Cc: Christoffer Dall <cdall@kernel.org> > Cc: Peter Maydell <peter.maydell@linaro.org> > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: Radim Krčmář <rkrcmar@redhat.com> > Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> > --- > Changes since v5: > - Rename the capability to KVM_CAP_ARM_VM_IPA_SIZE > - Update Documentation of the API (Peter Maydell) > - Fix comment/commit-description as spotted by Eric > Changes since v4: > - Fold the introduction of the KVM_CAP_ARM_VM_PHYS_SHIFT to this > patch to allow detection of the availability of the feature for > userspace. > - Document the API > - Restrict the feature only to arm64. > Changes since V3: > - Switch to a CAP, that can be checkd via EXTENSIONS on KVM device > fd, rather than a dedicated ioctl. > --- > Documentation/virtual/kvm/api.txt | 31 +++++++++++++++++++++++++ > arch/arm64/include/asm/stage2_pgtable.h | 20 ---------------- > arch/arm64/kvm/reset.c | 17 ++++++++++---- > include/uapi/linux/kvm.h | 10 ++++++++ > 4 files changed, 54 insertions(+), 24 deletions(-) > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt > index c664064f76fb..54eb7c763c89 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -123,6 +123,37 @@ memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the > flag KVM_VM_MIPS_VZ. > > > +On arm64, the physical address size for a VM (IPA Size limit) is limited > +to 40bits by default. The limit can be configured if the host supports the > +extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use > +KVM_VM_TYPE_ARM_IPA_SIZE(IPA_Bits) to set the size in the machine type > +identifier, where IPA_Bits is the maximum width of any physical > +address used by the VM. The IPA_Bits is encoded in bits[7-0] of the > +machine type identifier. > + > +e.g, to configure a guest to use 48bit physical address size : > + > + vm_fd = ioctl(dev_fd, KVM_CREATE_VM, KVM_VM_TYPE_ARM_IPA_SIZE(48)); > + > +The requested size (IPA_Bits) must be : > + 0 - Implies default size, 40bits (for backward compatibility) > + > + or > + > + N - Implies N bits, where N is a positive integer such that, > + 32 <= N <= Host_IPA_Limit > + > +Host_IPA_Limit is the maximum possible value for IPA_Bits on the host and > +is dependent on the CPU capability and the kernel configuration. The limit can > +be retrieved using KVM_CAP_ARM_VM_IPA_SIZE of the KVM_CHECK_EXTENSION > +ioctl() at run-time. > + > +Please note that configuring the IPA size does not affect the capability > +exposed by the guest CPUs in ID_AA64MMFR0_EL1[PARange]. It only affects > +size of the address translated by the stage2 level (guest physical to nit: the size of the input address Besides Reviewed-by: Eric Auger <eric.auger@redhat.com> Thanks Eric > +host physical address translations). > + > + > 4.3 KVM_GET_MSR_INDEX_LIST, KVM_GET_MSR_FEATURE_INDEX_LIST > > Capability: basic, KVM_CAP_GET_MSR_FEATURES for KVM_GET_MSR_FEATURE_INDEX_LIST > diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h > index 2cce769ba4c6..d352f6df8d2c 100644 > --- a/arch/arm64/include/asm/stage2_pgtable.h > +++ b/arch/arm64/include/asm/stage2_pgtable.h > @@ -42,28 +42,8 @@ > * the range (IPA_SHIFT, IPA_SHIFT - 4). > */ > #define stage2_pgtable_levels(ipa) ARM64_HW_PGTABLE_LEVELS((ipa) - 4) > -#define STAGE2_PGTABLE_LEVELS stage2_pgtable_levels(KVM_PHYS_SHIFT) > #define kvm_stage2_levels(kvm) VTCR_EL2_LVLS(kvm->arch.vtcr) > > -/* > - * With all the supported VA_BITs and 40bit guest IPA, the following condition > - * is always true: > - * > - * STAGE2_PGTABLE_LEVELS <= CONFIG_PGTABLE_LEVELS > - * > - * We base our stage-2 page table walker helpers on this assumption and > - * fall back to using the host version of the helper wherever possible. > - * i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back > - * to using the host version, since it is guaranteed it is not folded at host. > - * > - * If the condition breaks in the future, we can rearrange the host level > - * definitions and reuse them for stage2. Till then... > - */ > -#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS > -#error "Unsupported combination of guest IPA and host VA_BITS." > -#endif > - > - > /* stage2_pgdir_shift() is the size mapped by top-level stage2 entry for the VM */ > #define stage2_pgdir_shift(kvm) pt_levels_pgdir_shift(kvm_stage2_levels(kvm)) > #define stage2_pgdir_size(kvm) (1ULL << stage2_pgdir_shift(kvm)) > diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c > index f156e45760bc..95f28d5950e0 100644 > --- a/arch/arm64/kvm/reset.c > +++ b/arch/arm64/kvm/reset.c > @@ -89,6 +89,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext) > case KVM_CAP_VCPU_EVENTS: > r = 1; > break; > + case KVM_CAP_ARM_VM_IPA_SIZE: > + r = kvm_ipa_limit; > + break; > default: > r = 0; > } > @@ -192,17 +195,23 @@ int kvm_arm_config_vm(struct kvm *kvm, unsigned long type) > u32 parange, phys_shift; > u8 lvls; > > - if (type) > + if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) > return -EINVAL; > > + phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); > + if (phys_shift) { > + if (phys_shift > kvm_ipa_limit || > + phys_shift < 32) > + return -EINVAL; > + } else { > + phys_shift = KVM_PHYS_SHIFT; > + } > + > parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 7; > if (parange > ID_AA64MMFR0_PARANGE_MAX) > parange = ID_AA64MMFR0_PARANGE_MAX; > vtcr |= parange << VTCR_EL2_PS_SHIFT; > > - phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange); > - if (phys_shift > KVM_PHYS_SHIFT) > - phys_shift = KVM_PHYS_SHIFT; > vtcr |= VTCR_EL2_T0SZ(phys_shift); > /* > * Use a minimum 2 level page table to prevent splitting > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index 07548de5c988..9b949efcfd32 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -750,6 +750,15 @@ struct kvm_ppc_resize_hpt { > > #define KVM_S390_SIE_PAGE_OFFSET 1 > > +/* > + * On arm64, machine type can be used to request the physical > + * address size for the VM. Bits[7-0] are reserved for the guest > + * PA size shift (i.e, log2(PA_Size)). For backward compatibility, > + * value 0 implies the default IPA size, 40bits. > + */ > +#define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL > +#define KVM_VM_TYPE_ARM_IPA_SIZE(x) \ > + ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK) > /* > * ioctls for /dev/kvm fds: > */ > @@ -952,6 +961,7 @@ struct kvm_ppc_resize_hpt { > #define KVM_CAP_S390_HPAGE_1M 156 > #define KVM_CAP_NESTED_STATE 157 > #define KVM_CAP_ARM_INJECT_SERROR_ESR 158 > +#define KVM_CAP_ARM_VM_IPA_SIZE 159 /* returns maximum IPA bits for a VM */ > > #ifdef KVM_CAP_IRQ_ROUTING > >
On Wed, Sep 26, 2018 at 05:32:54PM +0100, Suzuki K Poulose wrote: > Allow specifying the physical address size limit for a new > VM via the kvm_type argument for the KVM_CREATE_VM ioctl. This > allows us to finalise the stage2 page table as early as possible > and hence perform the right checks on the memory slots > without complication. The size is encoded as Log2(PA_Size) in > bits[7:0] of the type field. For backward compatibility the > value 0 is reserved and implies 40bits. Also, lift the limit > of the IPA to host limit and allow lower IPA sizes (e.g, 32). > > The userspace could check the extension KVM_CAP_ARM_VM_IPA_SIZE > for the availability of this feature. The cap check returns the > maximum limit for the physical address shift supported by the host. > > Cc: Marc Zyngier <marc.zyngier@arm.com> > Cc: Christoffer Dall <cdall@kernel.org> > Cc: Peter Maydell <peter.maydell@linaro.org> > Cc: Paolo Bonzini <pbonzini@redhat.com> > Cc: Radim Krčmář <rkrcmar@redhat.com> > Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> > --- > Changes since v5: > - Rename the capability to KVM_CAP_ARM_VM_IPA_SIZE > - Update Documentation of the API (Peter Maydell) > - Fix comment/commit-description as spotted by Eric > Changes since v4: > - Fold the introduction of the KVM_CAP_ARM_VM_PHYS_SHIFT to this > patch to allow detection of the availability of the feature for > userspace. > - Document the API > - Restrict the feature only to arm64. > Changes since V3: > - Switch to a CAP, that can be checkd via EXTENSIONS on KVM device > fd, rather than a dedicated ioctl. > --- > Documentation/virtual/kvm/api.txt | 31 +++++++++++++++++++++++++ > arch/arm64/include/asm/stage2_pgtable.h | 20 ---------------- > arch/arm64/kvm/reset.c | 17 ++++++++++---- > include/uapi/linux/kvm.h | 10 ++++++++ > 4 files changed, 54 insertions(+), 24 deletions(-) > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt > index c664064f76fb..54eb7c763c89 100644 > --- a/Documentation/virtual/kvm/api.txt > +++ b/Documentation/virtual/kvm/api.txt > @@ -123,6 +123,37 @@ memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the > flag KVM_VM_MIPS_VZ. > > > +On arm64, the physical address size for a VM (IPA Size limit) is limited > +to 40bits by default. The limit can be configured if the host supports the > +extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use > +KVM_VM_TYPE_ARM_IPA_SIZE(IPA_Bits) to set the size in the machine type > +identifier, where IPA_Bits is the maximum width of any physical > +address used by the VM. The IPA_Bits is encoded in bits[7-0] of the > +machine type identifier. > + > +e.g, to configure a guest to use 48bit physical address size : > + > + vm_fd = ioctl(dev_fd, KVM_CREATE_VM, KVM_VM_TYPE_ARM_IPA_SIZE(48)); > + > +The requested size (IPA_Bits) must be : > + 0 - Implies default size, 40bits (for backward compatibility) > + > + or > + > + N - Implies N bits, where N is a positive integer such that, > + 32 <= N <= Host_IPA_Limit > + > +Host_IPA_Limit is the maximum possible value for IPA_Bits on the host and > +is dependent on the CPU capability and the kernel configuration. The limit can > +be retrieved using KVM_CAP_ARM_VM_IPA_SIZE of the KVM_CHECK_EXTENSION > +ioctl() at run-time. > + > +Please note that configuring the IPA size does not affect the capability > +exposed by the guest CPUs in ID_AA64MMFR0_EL1[PARange]. It only affects > +size of the address translated by the stage2 level (guest physical to > +host physical address translations). > + > + > 4.3 KVM_GET_MSR_INDEX_LIST, KVM_GET_MSR_FEATURE_INDEX_LIST > > Capability: basic, KVM_CAP_GET_MSR_FEATURES for KVM_GET_MSR_FEATURE_INDEX_LIST > diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h > index 2cce769ba4c6..d352f6df8d2c 100644 > --- a/arch/arm64/include/asm/stage2_pgtable.h > +++ b/arch/arm64/include/asm/stage2_pgtable.h > @@ -42,28 +42,8 @@ > * the range (IPA_SHIFT, IPA_SHIFT - 4). > */ > #define stage2_pgtable_levels(ipa) ARM64_HW_PGTABLE_LEVELS((ipa) - 4) > -#define STAGE2_PGTABLE_LEVELS stage2_pgtable_levels(KVM_PHYS_SHIFT) > #define kvm_stage2_levels(kvm) VTCR_EL2_LVLS(kvm->arch.vtcr) > > -/* > - * With all the supported VA_BITs and 40bit guest IPA, the following condition > - * is always true: > - * > - * STAGE2_PGTABLE_LEVELS <= CONFIG_PGTABLE_LEVELS > - * > - * We base our stage-2 page table walker helpers on this assumption and > - * fall back to using the host version of the helper wherever possible. > - * i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back > - * to using the host version, since it is guaranteed it is not folded at host. > - * > - * If the condition breaks in the future, we can rearrange the host level > - * definitions and reuse them for stage2. Till then... > - */ > -#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS > -#error "Unsupported combination of guest IPA and host VA_BITS." > -#endif > - > - > /* stage2_pgdir_shift() is the size mapped by top-level stage2 entry for the VM */ > #define stage2_pgdir_shift(kvm) pt_levels_pgdir_shift(kvm_stage2_levels(kvm)) > #define stage2_pgdir_size(kvm) (1ULL << stage2_pgdir_shift(kvm)) > diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c > index f156e45760bc..95f28d5950e0 100644 > --- a/arch/arm64/kvm/reset.c > +++ b/arch/arm64/kvm/reset.c > @@ -89,6 +89,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext) > case KVM_CAP_VCPU_EVENTS: > r = 1; > break; > + case KVM_CAP_ARM_VM_IPA_SIZE: > + r = kvm_ipa_limit; > + break; > default: > r = 0; > } > @@ -192,17 +195,23 @@ int kvm_arm_config_vm(struct kvm *kvm, unsigned long type) > u32 parange, phys_shift; > u8 lvls; > > - if (type) > + if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) > return -EINVAL; > > + phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); > + if (phys_shift) { > + if (phys_shift > kvm_ipa_limit || > + phys_shift < 32) > + return -EINVAL; I am concerned here that if we allow the user to set the phys_size to 32 bits, then we end up with 2 levels of stage2 page tables, which means that the size of a stage2 pmd mapping becomes the size of a stage2 pgd mapping, yet we can still decide in user_mem_abort() that a stage2 fault is backed by PMD size mappings on the host, and attempt a huge mapping at stage2, which then becomes a PGD level block map, I think. Is this handled somehow? If so, how? I can't see user_mem_abort() being modified to explicitly handle this in your code, but perhaps the stage2_set_pmd_huge() call ends up actually mapping at the stage2 pte level, but I can't tell that it does. In any case, I think user_mem_abort() should give up on pmd/pud huge mappings if the size mapped by the stage2/stage1 pmd/pud levels don't line up. What do you think? Thanks, Christoffer > + } else { > + phys_shift = KVM_PHYS_SHIFT; > + } > + > parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 7; > if (parange > ID_AA64MMFR0_PARANGE_MAX) > parange = ID_AA64MMFR0_PARANGE_MAX; > vtcr |= parange << VTCR_EL2_PS_SHIFT; > > - phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange); > - if (phys_shift > KVM_PHYS_SHIFT) > - phys_shift = KVM_PHYS_SHIFT; > vtcr |= VTCR_EL2_T0SZ(phys_shift); > /* > * Use a minimum 2 level page table to prevent splitting > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h > index 07548de5c988..9b949efcfd32 100644 > --- a/include/uapi/linux/kvm.h > +++ b/include/uapi/linux/kvm.h > @@ -750,6 +750,15 @@ struct kvm_ppc_resize_hpt { > > #define KVM_S390_SIE_PAGE_OFFSET 1 > > +/* > + * On arm64, machine type can be used to request the physical > + * address size for the VM. Bits[7-0] are reserved for the guest > + * PA size shift (i.e, log2(PA_Size)). For backward compatibility, > + * value 0 implies the default IPA size, 40bits. > + */ > +#define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL > +#define KVM_VM_TYPE_ARM_IPA_SIZE(x) \ > + ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK) > /* > * ioctls for /dev/kvm fds: > */ > @@ -952,6 +961,7 @@ struct kvm_ppc_resize_hpt { > #define KVM_CAP_S390_HPAGE_1M 156 > #define KVM_CAP_NESTED_STATE 157 > #define KVM_CAP_ARM_INJECT_SERROR_ESR 158 > +#define KVM_CAP_ARM_VM_IPA_SIZE 159 /* returns maximum IPA bits for a VM */ > > #ifdef KVM_CAP_IRQ_ROUTING > > -- > 2.19.0 > > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
Hi Christoffer, On 31/10/18 14:22, Christoffer Dall wrote: > On Wed, Sep 26, 2018 at 05:32:54PM +0100, Suzuki K Poulose wrote: >> Allow specifying the physical address size limit for a new >> VM via the kvm_type argument for the KVM_CREATE_VM ioctl. This >> allows us to finalise the stage2 page table as early as possible >> and hence perform the right checks on the memory slots >> without complication. The size is encoded as Log2(PA_Size) in >> bits[7:0] of the type field. For backward compatibility the >> value 0 is reserved and implies 40bits. Also, lift the limit >> of the IPA to host limit and allow lower IPA sizes (e.g, 32). >> >> The userspace could check the extension KVM_CAP_ARM_VM_IPA_SIZE >> for the availability of this feature. The cap check returns the >> maximum limit for the physical address shift supported by the host. >> >> Cc: Marc Zyngier <marc.zyngier@arm.com> >> Cc: Christoffer Dall <cdall@kernel.org> >> Cc: Peter Maydell <peter.maydell@linaro.org> >> Cc: Paolo Bonzini <pbonzini@redhat.com> >> Cc: Radim Krčmář <rkrcmar@redhat.com> >> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> >> --- >> @@ -192,17 +195,23 @@ int kvm_arm_config_vm(struct kvm *kvm, unsigned long type) >> u32 parange, phys_shift; >> u8 lvls; >> >> - if (type) >> + if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) >> return -EINVAL; >> >> + phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); >> + if (phys_shift) { >> + if (phys_shift > kvm_ipa_limit || >> + phys_shift < 32) >> + return -EINVAL; > > I am concerned here that if we allow the user to set the phys_size to 32 > bits, then we end up with 2 levels of stage2 page tables, which means > that the size of a stage2 pmd mapping becomes the size of a stage2 pgd > mapping, yet we can still decide in user_mem_abort() that a stage2 fault > is backed by PMD size mappings on the host, and attempt a huge mapping > at stage2, which then becomes a PGD level block map, I think. Yes, you're right. We will have a pgd-level block map in that case. This should work transparently as PMD at stage2 is folded into PGD and we endup marking the PGD entry as huge and the stage2 accessors deal with it appropriately. This is similar to having a PMD huge page with 64K + 42bit VA (2 level page table) on stage1. > > Is this handled somehow? If so, how? We don't prevent this. We have a guaranteed minimum number of levels at 2, which implies you can map a stage1 PMD huge page at stage2. I acknowledge that the Linux naming convention does cause some confusion for a "level" at stage1 and stage2 levels. But if you think of it from the hardware level (and like the ARM ARM defines it , Level 0-3), it is much simpler. i.e, you can map a huge page at level N in stage1 into stage2 if you have that level N. It doesn't matter if stage2 has more or less number of levels than stage1, as long as stage2 table can deal with it. > > I can't see user_mem_abort() being modified to explicitly handle this > in your code, but perhaps the stage2_set_pmd_huge() call ends up > actually mapping at the stage2 pte level, but I can't tell that it does. The stage2_set_pmd_huge() installs it at the PGD (level 2, which would have been PMD if we had levels > 2) slot. pmd = stage2_get_pmd(addr) \-> pud = stage2_get_pud(addr) \-> pgd = kvm->arch.pgd + stage2_pgd_index(addr); \-> (we have stage2_pgd_none(x) = 0 and \-> stage2_pud_offset(pgd, addr) = pgd \->returns (kvm->arch.pgd + stage2_pgd_index(addr); \-> stage_pud_none(x) = 0 & stage2_pmd_offset(pud, addr) = pud \-> returns pud (kvm->arch.pgd + stage2_pgd_index(addr)) and we install the PMD huge mapping at the location. > In any case, I think user_mem_abort() should give up on pmd/pud huge > mappings if the size mapped by the stage2/stage1 pmd/pud levels don't > line up. What do you think? Does it matter ? Personally I don't think it matters much as long as we are able to map the "huge" page at stage1 in the stage2 as huge, even if the stage2 has lesser levels and manage it well. Given that PMD huge pages are quite common, it would be good to exploit it when we can. On the other hand, for stage2 PUD we are checking if the stage2 has a PUD level (kvm_has_stage2_pud()). May be we should relax it just like we do for PMD to check (kvm_stage2_levels > 2). Thanks Suzuki > > Thanks, > > Christoffer > >> + } else { >> + phys_shift = KVM_PHYS_SHIFT; >> + } >> + >> parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 7; >> if (parange > ID_AA64MMFR0_PARANGE_MAX) >> parange = ID_AA64MMFR0_PARANGE_MAX; >> vtcr |= parange << VTCR_EL2_PS_SHIFT; >> >> - phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange); >> - if (phys_shift > KVM_PHYS_SHIFT) >> - phys_shift = KVM_PHYS_SHIFT; >> vtcr |= VTCR_EL2_T0SZ(phys_shift); >> /* >> * Use a minimum 2 level page table to prevent splitting >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h >> index 07548de5c988..9b949efcfd32 100644 >> --- a/include/uapi/linux/kvm.h >> +++ b/include/uapi/linux/kvm.h >> @@ -750,6 +750,15 @@ struct kvm_ppc_resize_hpt { >> >> #define KVM_S390_SIE_PAGE_OFFSET 1 >> >> +/* >> + * On arm64, machine type can be used to request the physical >> + * address size for the VM. Bits[7-0] are reserved for the guest >> + * PA size shift (i.e, log2(PA_Size)). For backward compatibility, >> + * value 0 implies the default IPA size, 40bits. >> + */ >> +#define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL >> +#define KVM_VM_TYPE_ARM_IPA_SIZE(x) \ >> + ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK) >> /* >> * ioctls for /dev/kvm fds: >> */ >> @@ -952,6 +961,7 @@ struct kvm_ppc_resize_hpt { >> #define KVM_CAP_S390_HPAGE_1M 156 >> #define KVM_CAP_NESTED_STATE 157 >> #define KVM_CAP_ARM_INJECT_SERROR_ESR 158 >> +#define KVM_CAP_ARM_VM_IPA_SIZE 159 /* returns maximum IPA bits for a VM */ >> >> #ifdef KVM_CAP_IRQ_ROUTING >> >> -- >> 2.19.0 >> >> _______________________________________________ >> kvmarm mailing list >> kvmarm@lists.cs.columbia.edu >> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
On Wed, Oct 31, 2018 at 05:55:13PM +0000, Suzuki K Poulose wrote: > Hi Christoffer, > > On 31/10/18 14:22, Christoffer Dall wrote: > >On Wed, Sep 26, 2018 at 05:32:54PM +0100, Suzuki K Poulose wrote: > >>Allow specifying the physical address size limit for a new > >>VM via the kvm_type argument for the KVM_CREATE_VM ioctl. This > >>allows us to finalise the stage2 page table as early as possible > >>and hence perform the right checks on the memory slots > >>without complication. The size is encoded as Log2(PA_Size) in > >>bits[7:0] of the type field. For backward compatibility the > >>value 0 is reserved and implies 40bits. Also, lift the limit > >>of the IPA to host limit and allow lower IPA sizes (e.g, 32). > >> > >>The userspace could check the extension KVM_CAP_ARM_VM_IPA_SIZE > >>for the availability of this feature. The cap check returns the > >>maximum limit for the physical address shift supported by the host. > >> > >>Cc: Marc Zyngier <marc.zyngier@arm.com> > >>Cc: Christoffer Dall <cdall@kernel.org> > >>Cc: Peter Maydell <peter.maydell@linaro.org> > >>Cc: Paolo Bonzini <pbonzini@redhat.com> > >>Cc: Radim Krčmář <rkrcmar@redhat.com> > >>Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> > >>--- > > >>@@ -192,17 +195,23 @@ int kvm_arm_config_vm(struct kvm *kvm, unsigned long type) > >> u32 parange, phys_shift; > >> u8 lvls; > >>- if (type) > >>+ if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) > >> return -EINVAL; > >>+ phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); > >>+ if (phys_shift) { > >>+ if (phys_shift > kvm_ipa_limit || > >>+ phys_shift < 32) > >>+ return -EINVAL; > > > >I am concerned here that if we allow the user to set the phys_size to 32 > >bits, then we end up with 2 levels of stage2 page tables, which means > >that the size of a stage2 pmd mapping becomes the size of a stage2 pgd > >mapping, yet we can still decide in user_mem_abort() that a stage2 fault > >is backed by PMD size mappings on the host, and attempt a huge mapping > >at stage2, which then becomes a PGD level block map, I think. > > Yes, you're right. We will have a pgd-level block map in that case. > This should work transparently as PMD at stage2 is folded into PGD and > we endup marking the PGD entry as huge and the stage2 accessors deal > with it appropriately. This is similar to having a PMD huge page with > 64K + 42bit VA (2 level page table) on stage1. > > > > >Is this handled somehow? If so, how? > > We don't prevent this. We have a guaranteed minimum number of levels > at 2, which implies you can map a stage1 PMD huge page at stage2. > I acknowledge that the Linux naming convention does cause some confusion > for a "level" at stage1 and stage2 levels. But if you think of it > from the hardware level (and like the ARM ARM defines it , Level 0-3), > it is much simpler. i.e, you can map a huge page at level N in stage1 > into stage2 if you have that level N. It doesn't matter if stage2 has > more or less number of levels than stage1, as long as stage2 table can > deal with it. > That is indeed a good way to reason about it. > > > >I can't see user_mem_abort() being modified to explicitly handle this > >in your code, but perhaps the stage2_set_pmd_huge() call ends up > >actually mapping at the stage2 pte level, but I can't tell that it does. > > The stage2_set_pmd_huge() installs it at the PGD (level 2, which would > have been PMD if we had levels > 2) slot. > > pmd = stage2_get_pmd(addr) > \-> pud = stage2_get_pud(addr) > \-> pgd = kvm->arch.pgd + stage2_pgd_index(addr); > \-> (we have stage2_pgd_none(x) = 0 and > \-> stage2_pud_offset(pgd, addr) = pgd > \->returns (kvm->arch.pgd + stage2_pgd_index(addr); > \-> stage_pud_none(x) = 0 & stage2_pmd_offset(pud, addr) = pud > \-> returns pud (kvm->arch.pgd + stage2_pgd_index(addr)) > > and we install the PMD huge mapping at the location. > > >In any case, I think user_mem_abort() should give up on pmd/pud huge > >mappings if the size mapped by the stage2/stage1 pmd/pud levels don't > >line up. What do you think? > > Does it matter ? Personally I don't think it matters much as long as we > are able to map the "huge" page at stage1 in the stage2 as huge, even if > the stage2 has lesser levels and manage it well. Given that PMD huge > pages are quite common, it would be good to exploit it when we can. What I couldn't convince myself of was whether having 2 levels at stage2 implied the entry level block mapping being of the same size as the stage1 block mapping, but given your explanation above, I think that's fine. > > On the other hand, for stage2 PUD we are checking if the stage2 has a > PUD level (kvm_has_stage2_pud()). May be we should relax it just like > we do for PMD to check (kvm_stage2_levels > 2). > Depends on how the code ends up looking like I suppose, but the more symmetry we can have between the approach for host PMD and host PUD and host PTE mappings, the better. Thanks, Christoffer
On 01/11/18 08:36, Christoffer Dall wrote: > On Wed, Oct 31, 2018 at 05:55:13PM +0000, Suzuki K Poulose wrote: >> Hi Christoffer, >> >> On 31/10/18 14:22, Christoffer Dall wrote: >>> On Wed, Sep 26, 2018 at 05:32:54PM +0100, Suzuki K Poulose wrote: >>>> Allow specifying the physical address size limit for a new >>>> VM via the kvm_type argument for the KVM_CREATE_VM ioctl. This >>>> allows us to finalise the stage2 page table as early as possible >>>> and hence perform the right checks on the memory slots >>>> without complication. The size is encoded as Log2(PA_Size) in >>>> bits[7:0] of the type field. For backward compatibility the >>>> value 0 is reserved and implies 40bits. Also, lift the limit >>>> of the IPA to host limit and allow lower IPA sizes (e.g, 32). >>>> >>>> The userspace could check the extension KVM_CAP_ARM_VM_IPA_SIZE >>>> for the availability of this feature. The cap check returns the >>>> maximum limit for the physical address shift supported by the host. >>>> >>>> Cc: Marc Zyngier <marc.zyngier@arm.com> >>>> Cc: Christoffer Dall <cdall@kernel.org> >>>> Cc: Peter Maydell <peter.maydell@linaro.org> >>>> Cc: Paolo Bonzini <pbonzini@redhat.com> >>>> Cc: Radim Krčmář <rkrcmar@redhat.com> >>>> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> >>>> --- >> >>>> @@ -192,17 +195,23 @@ int kvm_arm_config_vm(struct kvm *kvm, unsigned long type) >>>> u32 parange, phys_shift; >>>> u8 lvls; >>>> - if (type) >>>> + if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) >>>> return -EINVAL; >>>> + phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); >>>> + if (phys_shift) { >>>> + if (phys_shift > kvm_ipa_limit || >>>> + phys_shift < 32) >>>> + return -EINVAL; >>> >>> I am concerned here that if we allow the user to set the phys_size to 32 >>> bits, then we end up with 2 levels of stage2 page tables, which means >>> that the size of a stage2 pmd mapping becomes the size of a stage2 pgd >>> mapping, yet we can still decide in user_mem_abort() that a stage2 fault >>> is backed by PMD size mappings on the host, and attempt a huge mapping >>> at stage2, which then becomes a PGD level block map, I think. >> >> Yes, you're right. We will have a pgd-level block map in that case. >> This should work transparently as PMD at stage2 is folded into PGD and >> we endup marking the PGD entry as huge and the stage2 accessors deal >> with it appropriately. This is similar to having a PMD huge page with >> 64K + 42bit VA (2 level page table) on stage1. >> >>> >>> Is this handled somehow? If so, how? >> >> We don't prevent this. We have a guaranteed minimum number of levels >> at 2, which implies you can map a stage1 PMD huge page at stage2. >> I acknowledge that the Linux naming convention does cause some confusion >> for a "level" at stage1 and stage2 levels. But if you think of it >> from the hardware level (and like the ARM ARM defines it , Level 0-3), >> it is much simpler. i.e, you can map a huge page at level N in stage1 >> into stage2 if you have that level N. It doesn't matter if stage2 has >> more or less number of levels than stage1, as long as stage2 table can >> deal with it. >> > > That is indeed a good way to reason about it. > >>> >>> I can't see user_mem_abort() being modified to explicitly handle this >>> in your code, but perhaps the stage2_set_pmd_huge() call ends up >>> actually mapping at the stage2 pte level, but I can't tell that it does. >> >> The stage2_set_pmd_huge() installs it at the PGD (level 2, which would >> have been PMD if we had levels > 2) slot. >> >> pmd = stage2_get_pmd(addr) >> \-> pud = stage2_get_pud(addr) >> \-> pgd = kvm->arch.pgd + stage2_pgd_index(addr); >> \-> (we have stage2_pgd_none(x) = 0 and >> \-> stage2_pud_offset(pgd, addr) = pgd >> \->returns (kvm->arch.pgd + stage2_pgd_index(addr); >> \-> stage_pud_none(x) = 0 & stage2_pmd_offset(pud, addr) = pud >> \-> returns pud (kvm->arch.pgd + stage2_pgd_index(addr)) >> >> and we install the PMD huge mapping at the location. >> >>> In any case, I think user_mem_abort() should give up on pmd/pud huge >>> mappings if the size mapped by the stage2/stage1 pmd/pud levels don't >>> line up. What do you think? >> >> Does it matter ? Personally I don't think it matters much as long as we >> are able to map the "huge" page at stage1 in the stage2 as huge, even if >> the stage2 has lesser levels and manage it well. Given that PMD huge >> pages are quite common, it would be good to exploit it when we can. > > What I couldn't convince myself of was whether having 2 levels at stage2 > implied the entry level block mapping being of the same size as the The point worth noting is, PMD (in ARM ARM terms) is always Level 2, irrespective of whether it is also PGD(the entry level). So as long as we deal with PMD or PUD (and not PGD which varies) and make sure you have sufficient number of levels we are fine. For PMD it is guaranteed on all architectures and we do that explicitly for stage2 in arm64 KVM. Similarly for PUD we need to make sure we have 3 levels to use a huge page or else fallback to splitting the pages. > stage1 block mapping, but given your explanation above, I think that's > fine. > >> >> On the other hand, for stage2 PUD we are checking if the stage2 has a >> PUD level (kvm_has_stage2_pud()). May be we should relax it just like >> we do for PMD to check (kvm_stage2_levels > 2). >> > > Depends on how the code ends up looking like I suppose, but the more > symmetry we can have between the approach for host PMD and host PUD and > host PTE mappings, the better. Sure. It is quite easy to confuse ourselves even later when we look at it. We could put in a big fat comment explaining this to avoid scratching our heads later. Cheers Suzuki
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index c664064f76fb..54eb7c763c89 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -123,6 +123,37 @@ memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the flag KVM_VM_MIPS_VZ. +On arm64, the physical address size for a VM (IPA Size limit) is limited +to 40bits by default. The limit can be configured if the host supports the +extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use +KVM_VM_TYPE_ARM_IPA_SIZE(IPA_Bits) to set the size in the machine type +identifier, where IPA_Bits is the maximum width of any physical +address used by the VM. The IPA_Bits is encoded in bits[7-0] of the +machine type identifier. + +e.g, to configure a guest to use 48bit physical address size : + + vm_fd = ioctl(dev_fd, KVM_CREATE_VM, KVM_VM_TYPE_ARM_IPA_SIZE(48)); + +The requested size (IPA_Bits) must be : + 0 - Implies default size, 40bits (for backward compatibility) + + or + + N - Implies N bits, where N is a positive integer such that, + 32 <= N <= Host_IPA_Limit + +Host_IPA_Limit is the maximum possible value for IPA_Bits on the host and +is dependent on the CPU capability and the kernel configuration. The limit can +be retrieved using KVM_CAP_ARM_VM_IPA_SIZE of the KVM_CHECK_EXTENSION +ioctl() at run-time. + +Please note that configuring the IPA size does not affect the capability +exposed by the guest CPUs in ID_AA64MMFR0_EL1[PARange]. It only affects +size of the address translated by the stage2 level (guest physical to +host physical address translations). + + 4.3 KVM_GET_MSR_INDEX_LIST, KVM_GET_MSR_FEATURE_INDEX_LIST Capability: basic, KVM_CAP_GET_MSR_FEATURES for KVM_GET_MSR_FEATURE_INDEX_LIST diff --git a/arch/arm64/include/asm/stage2_pgtable.h b/arch/arm64/include/asm/stage2_pgtable.h index 2cce769ba4c6..d352f6df8d2c 100644 --- a/arch/arm64/include/asm/stage2_pgtable.h +++ b/arch/arm64/include/asm/stage2_pgtable.h @@ -42,28 +42,8 @@ * the range (IPA_SHIFT, IPA_SHIFT - 4). */ #define stage2_pgtable_levels(ipa) ARM64_HW_PGTABLE_LEVELS((ipa) - 4) -#define STAGE2_PGTABLE_LEVELS stage2_pgtable_levels(KVM_PHYS_SHIFT) #define kvm_stage2_levels(kvm) VTCR_EL2_LVLS(kvm->arch.vtcr) -/* - * With all the supported VA_BITs and 40bit guest IPA, the following condition - * is always true: - * - * STAGE2_PGTABLE_LEVELS <= CONFIG_PGTABLE_LEVELS - * - * We base our stage-2 page table walker helpers on this assumption and - * fall back to using the host version of the helper wherever possible. - * i.e, if a particular level is not folded (e.g, PUD) at stage2, we fall back - * to using the host version, since it is guaranteed it is not folded at host. - * - * If the condition breaks in the future, we can rearrange the host level - * definitions and reuse them for stage2. Till then... - */ -#if STAGE2_PGTABLE_LEVELS > CONFIG_PGTABLE_LEVELS -#error "Unsupported combination of guest IPA and host VA_BITS." -#endif - - /* stage2_pgdir_shift() is the size mapped by top-level stage2 entry for the VM */ #define stage2_pgdir_shift(kvm) pt_levels_pgdir_shift(kvm_stage2_levels(kvm)) #define stage2_pgdir_size(kvm) (1ULL << stage2_pgdir_shift(kvm)) diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c index f156e45760bc..95f28d5950e0 100644 --- a/arch/arm64/kvm/reset.c +++ b/arch/arm64/kvm/reset.c @@ -89,6 +89,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_VCPU_EVENTS: r = 1; break; + case KVM_CAP_ARM_VM_IPA_SIZE: + r = kvm_ipa_limit; + break; default: r = 0; } @@ -192,17 +195,23 @@ int kvm_arm_config_vm(struct kvm *kvm, unsigned long type) u32 parange, phys_shift; u8 lvls; - if (type) + if (type & ~KVM_VM_TYPE_ARM_IPA_SIZE_MASK) return -EINVAL; + phys_shift = KVM_VM_TYPE_ARM_IPA_SIZE(type); + if (phys_shift) { + if (phys_shift > kvm_ipa_limit || + phys_shift < 32) + return -EINVAL; + } else { + phys_shift = KVM_PHYS_SHIFT; + } + parange = read_sanitised_ftr_reg(SYS_ID_AA64MMFR0_EL1) & 7; if (parange > ID_AA64MMFR0_PARANGE_MAX) parange = ID_AA64MMFR0_PARANGE_MAX; vtcr |= parange << VTCR_EL2_PS_SHIFT; - phys_shift = id_aa64mmfr0_parange_to_phys_shift(parange); - if (phys_shift > KVM_PHYS_SHIFT) - phys_shift = KVM_PHYS_SHIFT; vtcr |= VTCR_EL2_T0SZ(phys_shift); /* * Use a minimum 2 level page table to prevent splitting diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 07548de5c988..9b949efcfd32 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -750,6 +750,15 @@ struct kvm_ppc_resize_hpt { #define KVM_S390_SIE_PAGE_OFFSET 1 +/* + * On arm64, machine type can be used to request the physical + * address size for the VM. Bits[7-0] are reserved for the guest + * PA size shift (i.e, log2(PA_Size)). For backward compatibility, + * value 0 implies the default IPA size, 40bits. + */ +#define KVM_VM_TYPE_ARM_IPA_SIZE_MASK 0xffULL +#define KVM_VM_TYPE_ARM_IPA_SIZE(x) \ + ((x) & KVM_VM_TYPE_ARM_IPA_SIZE_MASK) /* * ioctls for /dev/kvm fds: */ @@ -952,6 +961,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_S390_HPAGE_1M 156 #define KVM_CAP_NESTED_STATE 157 #define KVM_CAP_ARM_INJECT_SERROR_ESR 158 +#define KVM_CAP_ARM_VM_IPA_SIZE 159 /* returns maximum IPA bits for a VM */ #ifdef KVM_CAP_IRQ_ROUTING
Allow specifying the physical address size limit for a new VM via the kvm_type argument for the KVM_CREATE_VM ioctl. This allows us to finalise the stage2 page table as early as possible and hence perform the right checks on the memory slots without complication. The size is encoded as Log2(PA_Size) in bits[7:0] of the type field. For backward compatibility the value 0 is reserved and implies 40bits. Also, lift the limit of the IPA to host limit and allow lower IPA sizes (e.g, 32). The userspace could check the extension KVM_CAP_ARM_VM_IPA_SIZE for the availability of this feature. The cap check returns the maximum limit for the physical address shift supported by the host. Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Christoffer Dall <cdall@kernel.org> Cc: Peter Maydell <peter.maydell@linaro.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> --- Changes since v5: - Rename the capability to KVM_CAP_ARM_VM_IPA_SIZE - Update Documentation of the API (Peter Maydell) - Fix comment/commit-description as spotted by Eric Changes since v4: - Fold the introduction of the KVM_CAP_ARM_VM_PHYS_SHIFT to this patch to allow detection of the availability of the feature for userspace. - Document the API - Restrict the feature only to arm64. Changes since V3: - Switch to a CAP, that can be checkd via EXTENSIONS on KVM device fd, rather than a dedicated ioctl. --- Documentation/virtual/kvm/api.txt | 31 +++++++++++++++++++++++++ arch/arm64/include/asm/stage2_pgtable.h | 20 ---------------- arch/arm64/kvm/reset.c | 17 ++++++++++---- include/uapi/linux/kvm.h | 10 ++++++++ 4 files changed, 54 insertions(+), 24 deletions(-)