Message ID | 20190621163422.6127-11-drjones@redhat.com |
---|---|
State | New |
Headers | show |
Series | target/arm/kvm: enable SVE in guests | expand |
On Fri, Jun 21, 2019 at 05:34:18PM +0100, Andrew Jones wrote: > These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the > swabbing is different than it is for fpsmid because the vector format > is a little-endian stream of words. Note, on big-endian hosts the FPSIMD view Vn and the SVE view Zn[127:0] of the FPSIMD/SVE common register bits has the opposite endianness for SVE_{GET,SET}_ONE_REG. This only matters if mixing the two views: just from this patch I don't know whether this is an issue for QEMU or not. The kernel and gdb were recently found to be broken in this regard for userspace [1] but the KVM interface should be unaffected. Cheers ---Dave [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kernel?id=41040cf7c5f0f26c368bc5d3016fed3a9ca6dba4
On Mon, Jun 24, 2019 at 12:05:35PM +0100, Dave Martin wrote: > On Fri, Jun 21, 2019 at 05:34:18PM +0100, Andrew Jones wrote: > > These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the > > swabbing is different than it is for fpsmid because the vector format > > is a little-endian stream of words. > > Note, on big-endian hosts the FPSIMD view Vn and the SVE view Zn[127:0] > of the FPSIMD/SVE common register bits has the opposite endianness for > SVE_{GET,SET}_ONE_REG. > > This only matters if mixing the two views: just from this patch I don't > know whether this is an issue for QEMU or not. I don't know either. My experience with the emulation side of QEMU is mostly the zcr_write tweak in this series. And, TBH, I didn't put too much thought into the endianness stuff, nor test this series with big endian. Hopefully Richard can chime in on this. Thanks, drew > > The kernel and gdb were recently found to be broken in this regard for > userspace [1] but the KVM interface should be unaffected. > > Cheers > ---Dave > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kernel?id=41040cf7c5f0f26c368bc5d3016fed3a9ca6dba4
On Mon, Jun 24, 2019 at 12:55:53PM +0100, Andrew Jones wrote: > On Mon, Jun 24, 2019 at 12:05:35PM +0100, Dave Martin wrote: > > On Fri, Jun 21, 2019 at 05:34:18PM +0100, Andrew Jones wrote: > > > These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the > > > swabbing is different than it is for fpsmid because the vector format > > > is a little-endian stream of words. > > > > Note, on big-endian hosts the FPSIMD view Vn and the SVE view Zn[127:0] > > of the FPSIMD/SVE common register bits has the opposite endianness for > > SVE_{GET,SET}_ONE_REG. > > > > This only matters if mixing the two views: just from this patch I don't > > know whether this is an issue for QEMU or not. > > I don't know either. My experience with the emulation side of QEMU is > mostly the zcr_write tweak in this series. And, TBH, I didn't put too > much thought into the endianness stuff, nor test this series with big > endian. Neither did I (at least beyond the "does it boot" level) -- hence the bug ;) And of course, few people are using big-endian, so nobody complained. Just flagging it up so it doesn't get missed! > Hopefully Richard can chime in on this. It would be interesting to know whether you do hit this issue somewhere, or whether you have a strong view about the clarification to the KVM ABI. Cheers ---Dave
On 6/21/19 6:34 PM, Andrew Jones wrote: > +/* > + * If ARM_MAX_VQ is increased to be greater than 16, then we can no > + * longer hard code slices to 1 in kvm_arch_put/get_sve(). > + */ > +QEMU_BUILD_BUG_ON(ARM_MAX_VQ > 16); This seems easy to fix, or simply drop the slices entirely for now, as otherwise they are a teeny bit confusing. It's a shame that these slices exist at all. It seems like the kernel could use the negotiated max sve size to grab the data all at once. > + for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) { > + uint64_t *q = aa64_vfp_qreg(env, n); > +#ifdef HOST_WORDS_BIGENDIAN > + uint64_t d[ARM_MAX_VQ * 2]; > + int j; > + for (j = 0; j < cpu->sve_max_vq * 2; j++) { > + d[j] = bswap64(q[j]); > + } > + reg.addr = (uintptr_t)d; > +#else > + reg.addr = (uintptr_t)q; > +#endif > + reg.id = KVM_REG_ARM64_SVE_ZREG(n, i); > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); It might be worth splitting this... > + for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) { > + uint64_t *q = &env->vfp.pregs[n].p[0]; > +#ifdef HOST_WORDS_BIGENDIAN > + uint64_t d[ARM_MAX_VQ * 2 / 8]; > + int j; > + for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) { > + d[j] = bswap64(q[j]); > + } > + reg.addr = (uintptr_t)d; > +#else > + reg.addr = (uintptr_t)q; > +#endif > + reg.id = KVM_REG_ARM64_SVE_PREG(n, i); > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); ... and this (unified w/ reg + size parameters?) to a function because ... > + reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0]; > + reg.id = KVM_REG_ARM64_SVE_FFR(i); > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); ... you forgot to apply the bswap here. Likewise for the other direction. r~ PS: It's also tempting to drop the ifdefs and, since we know the host supports sve instructions, and that the host supports sve_max_vq, do the reformatting as uint64_t scratch[ARM_MAX_VQ * 2]; asm("whilelo p0.d, xzr, %2\n\t" "ld1d z0.d, p0/z [%1]\n\t" "str z0, [%0]" : "=Q"(scratch) : "Q"(*aa64_vfp_qreg(env, n)), "r"(cpu->sve_max_vq) : "p0", "v0"); PPS: Ideally, this would be further cleaned up with acle builtins, but those are still under development for GCC.
Hi, On 6/21/19 6:34 PM, Andrew Jones wrote: > These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the > swabbing is different than it is for fpsmid because the vector format > is a little-endian stream of words. some cosmetic changes besides Richard's comments > > Signed-off-by: Andrew Jones <drjones@redhat.com> > --- > target/arm/kvm64.c | 135 +++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 131 insertions(+), 4 deletions(-) > > diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c > index a2485d447e6a..706541327491 100644 > --- a/target/arm/kvm64.c > +++ b/target/arm/kvm64.c > @@ -673,11 +673,12 @@ int kvm_arch_destroy_vcpu(CPUState *cs) > bool kvm_arm_reg_syncs_via_cpreg_list(uint64_t regidx) > { > /* Return true if the regidx is a register we should synchronize > - * via the cpreg_tuples array (ie is not a core reg we sync by > - * hand in kvm_arch_get/put_registers()) > + * via the cpreg_tuples array (ie is not a core or sve reg that > + * we sync by hand in kvm_arch_get/put_registers()) > */ > switch (regidx & KVM_REG_ARM_COPROC_MASK) { > case KVM_REG_ARM_CORE: > + case KVM_REG_ARM64_SVE: > return false; > default: > return true; > @@ -763,6 +764,70 @@ static int kvm_arch_put_fpsimd(CPUState *cs) > return 0; > } > > +/* > + * If ARM_MAX_VQ is increased to be greater than 16, then we can no > + * longer hard code slices to 1 in kvm_arch_put/get_sve(). > + */ > +QEMU_BUILD_BUG_ON(ARM_MAX_VQ > 16); if the code is ready to support slices, I guess you could have a define and compute the slice number from ARM_MAX_VQ? > + > +static int kvm_arch_put_sve(CPUState *cs) > +{ > + ARMCPU *cpu = ARM_CPU(cs); > + CPUARMState *env = &cpu->env; > + struct kvm_one_reg reg; > + int slices = 1; > + int i, n, ret; > + > + for (i = 0; i < slices; i++) { > + for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) { > + uint64_t *q = aa64_vfp_qreg(env, n); > +#ifdef HOST_WORDS_BIGENDIAN > + uint64_t d[ARM_MAX_VQ * 2]; > + int j; line to be added > + for (j = 0; j < cpu->sve_max_vq * 2; j++) { > + d[j] = bswap64(q[j]); > + } > + reg.addr = (uintptr_t)d; > +#else > + reg.addr = (uintptr_t)q; > +#endif > + reg.id = KVM_REG_ARM64_SVE_ZREG(n, i); > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); > + if (ret) { > + return ret; > + } > + } > + > + for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) { > + uint64_t *q = &env->vfp.pregs[n].p[0]; > +#ifdef HOST_WORDS_BIGENDIAN > + uint64_t d[ARM_MAX_VQ * 2 / 8]; > + int j; line > + for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) { > + d[j] = bswap64(q[j]); > + } > + reg.addr = (uintptr_t)d; > +#else > + reg.addr = (uintptr_t)q; > +#endif > + reg.id = KVM_REG_ARM64_SVE_PREG(n, i); > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); > + if (ret) { > + return ret; > + } > + } > + > + reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0]; > + reg.id = KVM_REG_ARM64_SVE_FFR(i); > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); > + if (ret) { > + return ret; > + } > + } > + > + return 0; > +} > + > int kvm_arch_put_registers(CPUState *cs, int level) > { > struct kvm_one_reg reg; > @@ -857,7 +922,11 @@ int kvm_arch_put_registers(CPUState *cs, int level) > } > } > > - ret = kvm_arch_put_fpsimd(cs); > + if (!cpu->sve_max_vq) { > + ret = kvm_arch_put_fpsimd(cs); > + } else { > + ret = kvm_arch_put_sve(cs); > + } > if (ret) { > return ret; > } > @@ -920,6 +989,60 @@ static int kvm_arch_get_fpsimd(CPUState *cs) > return 0; > } > > +static int kvm_arch_get_sve(CPUState *cs) > +{ > + ARMCPU *cpu = ARM_CPU(cs); > + CPUARMState *env = &cpu->env; > + struct kvm_one_reg reg; > + int slices = 1; > + int i, n, ret; > + > + for (i = 0; i < slices; i++) { > + for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) { > + uint64_t *q = aa64_vfp_qreg(env, n); extra line needed > + reg.id = KVM_REG_ARM64_SVE_ZREG(n, i); > + reg.addr = (uintptr_t)q; > + ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ®); > + if (ret) { > + return ret; > + } else {> +#ifdef HOST_WORDS_BIGENDIAN > + int j; line > + for (j = 0; j < cpu->sve_max_vq * 2; j++) { > + q[j] = bswap64(q[j]); > + } > +#endif > + } > + } > + > + for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) { > + uint64_t *q = &env->vfp.pregs[n].p[0]; extra line needed > + reg.id = KVM_REG_ARM64_SVE_PREG(n, i); > + reg.addr = (uintptr_t)q; > + ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ®); > + if (ret) { > + return ret; > + } else {> +#ifdef HOST_WORDS_BIGENDIAN > + int j; line > + for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) { > + q[j] = bswap64(q[j]); > + } > +#endif > + } > + } > + > + reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0]; > + reg.id = KVM_REG_ARM64_SVE_FFR(i); > + ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ®); > + if (ret) { > + return ret; > + } > + } > + > + return 0; > +} > + > int kvm_arch_get_registers(CPUState *cs) > { > struct kvm_one_reg reg; > @@ -1014,7 +1137,11 @@ int kvm_arch_get_registers(CPUState *cs) > env->spsr = env->banked_spsr[i]; > } > > - ret = kvm_arch_get_fpsimd(cs); > + if (!cpu->sve_max_vq) { > + ret = kvm_arch_get_fpsimd(cs); > + } else { > + ret = kvm_arch_get_sve(cs); > + } > if (ret) { > return ret; > } > Thanks Eric
On Wed, Jun 26, 2019 at 04:22:34PM +0100, Richard Henderson wrote: > On 6/21/19 6:34 PM, Andrew Jones wrote: > > +/* > > + * If ARM_MAX_VQ is increased to be greater than 16, then we can no > > + * longer hard code slices to 1 in kvm_arch_put/get_sve(). > > + */ > > +QEMU_BUILD_BUG_ON(ARM_MAX_VQ > 16); > > This seems easy to fix, or simply drop the slices entirely for now, as > otherwise they are a teeny bit confusing. > > It's a shame that these slices exist at all. It seems like the kernel could > use the negotiated max sve size to grab the data all at once. The aim here was to be forwards compatible while fitting within the existing ABI. The ABI doesn't allow variable-sized registers, and if the vq can someday grow above 16 then the individual registers could become pretty big. Inside the kernel, we took the view that if that ever happens, it's sufficiently far out that we can skip implementing the support today, providing that the ABI can accommodate the change. For qemu, if you don't actually care what's in the regs, you could just enumerate then using KVM_GET_REG_LIST instead of manufacturing the IDs by hand. That way, you don't have to care what slices exist. For save/restore/migrate purposes, the regs are just data, so that's probably enough. Debug, and exchanging vector registers between the guest and, say, and SMC trapped to userspace, would still need to examine specific regs. I don't know what QEMU does in this area though. > > + for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) { > > + uint64_t *q = aa64_vfp_qreg(env, n); > > +#ifdef HOST_WORDS_BIGENDIAN > > + uint64_t d[ARM_MAX_VQ * 2]; > > + int j; > > + for (j = 0; j < cpu->sve_max_vq * 2; j++) { > > + d[j] = bswap64(q[j]); > > + } > > + reg.addr = (uintptr_t)d; > > +#else > > + reg.addr = (uintptr_t)q; > > +#endif > > + reg.id = KVM_REG_ARM64_SVE_ZREG(n, i); > > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); > > It might be worth splitting this... > > > + for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) { > > + uint64_t *q = &env->vfp.pregs[n].p[0]; > > +#ifdef HOST_WORDS_BIGENDIAN > > + uint64_t d[ARM_MAX_VQ * 2 / 8]; > > + int j; > > + for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) { > > + d[j] = bswap64(q[j]); > > + } > > + reg.addr = (uintptr_t)d; > > +#else > > + reg.addr = (uintptr_t)q; > > +#endif > > + reg.id = KVM_REG_ARM64_SVE_PREG(n, i); > > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); It's for QEMU to choose, but does it actually matter what byteorder you store a Z-reg or P-reg in? Maybe the byteswap isn't really needed. I don't know how this works when migrating from a little-endian to a big-endian host or vice-versa (or if that is even supported...) [...] Cheers ---Dave
On Thu, Jun 27, 2019 at 08:56:21AM +0200, Auger Eric wrote: > Hi, > > On 6/21/19 6:34 PM, Andrew Jones wrote: > > These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the > > swabbing is different than it is for fpsmid because the vector format > > is a little-endian stream of words. > > some cosmetic changes besides Richard's comments > > > > Signed-off-by: Andrew Jones <drjones@redhat.com> > > --- > > target/arm/kvm64.c | 135 +++++++++++++++++++++++++++++++++++++++++++-- > > 1 file changed, 131 insertions(+), 4 deletions(-) > > > > diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c > > index a2485d447e6a..706541327491 100644 > > --- a/target/arm/kvm64.c > > +++ b/target/arm/kvm64.c > > @@ -673,11 +673,12 @@ int kvm_arch_destroy_vcpu(CPUState *cs) > > bool kvm_arm_reg_syncs_via_cpreg_list(uint64_t regidx) > > { > > /* Return true if the regidx is a register we should synchronize > > - * via the cpreg_tuples array (ie is not a core reg we sync by > > - * hand in kvm_arch_get/put_registers()) > > + * via the cpreg_tuples array (ie is not a core or sve reg that > > + * we sync by hand in kvm_arch_get/put_registers()) > > */ > > switch (regidx & KVM_REG_ARM_COPROC_MASK) { > > case KVM_REG_ARM_CORE: > > + case KVM_REG_ARM64_SVE: > > return false; > > default: > > return true; > > @@ -763,6 +764,70 @@ static int kvm_arch_put_fpsimd(CPUState *cs) > > return 0; > > } > > > > +/* > > + * If ARM_MAX_VQ is increased to be greater than 16, then we can no > > + * longer hard code slices to 1 in kvm_arch_put/get_sve(). > > + */ > > +QEMU_BUILD_BUG_ON(ARM_MAX_VQ > 16); > if the code is ready to support slices, I guess you could have a define > and compute the slice number from ARM_MAX_VQ? Yeah, that should be do-able. Thanks for the suggestion. > > + > > +static int kvm_arch_put_sve(CPUState *cs) > > +{ > > + ARMCPU *cpu = ARM_CPU(cs); > > + CPUARMState *env = &cpu->env; > > + struct kvm_one_reg reg; > > + int slices = 1; > > + int i, n, ret; > > + > > + for (i = 0; i < slices; i++) { > > + for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) { > > + uint64_t *q = aa64_vfp_qreg(env, n); > > +#ifdef HOST_WORDS_BIGENDIAN > > + uint64_t d[ARM_MAX_VQ * 2]; > > + int j; > line to be added > > + for (j = 0; j < cpu->sve_max_vq * 2; j++) { > > + d[j] = bswap64(q[j]); > > + } > > + reg.addr = (uintptr_t)d; > > +#else > > + reg.addr = (uintptr_t)q; > > +#endif > > + reg.id = KVM_REG_ARM64_SVE_ZREG(n, i); > > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); > > + if (ret) { > > + return ret; > > + } > > + } > > + > > + for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) { > > + uint64_t *q = &env->vfp.pregs[n].p[0]; > > +#ifdef HOST_WORDS_BIGENDIAN > > + uint64_t d[ARM_MAX_VQ * 2 / 8]; > > + int j; > line > > + for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) { > > + d[j] = bswap64(q[j]); > > + } > > + reg.addr = (uintptr_t)d; > > +#else > > + reg.addr = (uintptr_t)q; > > +#endif > > + reg.id = KVM_REG_ARM64_SVE_PREG(n, i); > > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); > > + if (ret) { > > + return ret; > > + } > > + } > > + > > + reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0]; > > + reg.id = KVM_REG_ARM64_SVE_FFR(i); > > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); > > + if (ret) { > > + return ret; > > + } > > + } > > + > > + return 0; > > +} > > + > > int kvm_arch_put_registers(CPUState *cs, int level) > > { > > struct kvm_one_reg reg; > > @@ -857,7 +922,11 @@ int kvm_arch_put_registers(CPUState *cs, int level) > > } > > } > > > > - ret = kvm_arch_put_fpsimd(cs); > > + if (!cpu->sve_max_vq) { > > + ret = kvm_arch_put_fpsimd(cs); > > + } else { > > + ret = kvm_arch_put_sve(cs); > > + } > > if (ret) { > > return ret; > > } > > @@ -920,6 +989,60 @@ static int kvm_arch_get_fpsimd(CPUState *cs) > > return 0; > > } > > > > +static int kvm_arch_get_sve(CPUState *cs) > > +{ > > + ARMCPU *cpu = ARM_CPU(cs); > > + CPUARMState *env = &cpu->env; > > + struct kvm_one_reg reg; > > + int slices = 1; > > + int i, n, ret; > > + > > + for (i = 0; i < slices; i++) { > > + for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) { > > + uint64_t *q = aa64_vfp_qreg(env, n); > extra line needed > > + reg.id = KVM_REG_ARM64_SVE_ZREG(n, i); > > + reg.addr = (uintptr_t)q; > > + ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ®); > > + if (ret) { > > + return ret; > > + } else {> +#ifdef HOST_WORDS_BIGENDIAN > > + int j; > line > > + for (j = 0; j < cpu->sve_max_vq * 2; j++) { > > + q[j] = bswap64(q[j]); > > + } > > +#endif > > + } > > + } > > + > > + for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) { > > + uint64_t *q = &env->vfp.pregs[n].p[0]; > extra line needed > > + reg.id = KVM_REG_ARM64_SVE_PREG(n, i); > > + reg.addr = (uintptr_t)q; > > + ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ®); > > + if (ret) { > > + return ret; > > + } else {> +#ifdef HOST_WORDS_BIGENDIAN > > + int j; > line > > + for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) { > > + q[j] = bswap64(q[j]); > > + } > > +#endif > > + } > > + } > > + > > + reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0]; > > + reg.id = KVM_REG_ARM64_SVE_FFR(i); > > + ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ®); > > + if (ret) { > > + return ret; > > + } > > + } > > + > > + return 0; > > +} > > + > > int kvm_arch_get_registers(CPUState *cs) > > { > > struct kvm_one_reg reg; > > @@ -1014,7 +1137,11 @@ int kvm_arch_get_registers(CPUState *cs) > > env->spsr = env->banked_spsr[i]; > > } > > > > - ret = kvm_arch_get_fpsimd(cs); > > + if (!cpu->sve_max_vq) { > > + ret = kvm_arch_get_fpsimd(cs); > > + } else { > > + ret = kvm_arch_get_sve(cs); > > + } > > if (ret) { > > return ret; > > } > > > > I'll see if I can clean/improve this patch with yours and Richard's suggestions. Thanks, drew
On 6/27/19 12:59 PM, Dave Martin wrote: >> It's a shame that these slices exist at all. It seems like the kernel could >> use the negotiated max sve size to grab the data all at once. > > The aim here was to be forwards compatible while fitting within the > existing ABI. > > The ABI doesn't allow variable-sized registers, and if the vq can > someday grow above 16 then the individual registers could become pretty > big. The ABI doesn't appear to have fixed sized data blocks. Since that's the case, it didn't seem to me that variable sized blocks was so different, given that the actual size is constrained by the hardware on which we're running. And if VQ does grow large, then do we really want oodles of syscalls in order to transfer the data for each register? With the 9 bits reserved for this field, we could require a maximum of 1024 syscalls to transfer the whole register set. > It's for QEMU to choose, but does it actually matter what byteorder you > store a Z-reg or P-reg in? Maybe the byteswap isn't really needed. I think the only sensible order for the kernel is that in which LDR/STR itself saves the data. Which is a byte stream. Within QEMU, it has so far made sense to keep the data in 64-bit hunks in host-endian order. That's how the AdvSIMD code was written originally, and it turned out to be easy enough to continue that for SVE. Which does mean that if we want to pass data to the kernel as the aforementioned byte stream that a big-endian host does need to bswap the data in 64-bit hunks. > I don't know how this works when migrating from a little-endian to a > big-endian host or vice-versa (or if that is even supported...) The data is stored canonically within the vmsave, so such migrations are supposed to work. r~
On Thu, Jun 27, 2019 at 12:26:06PM +0100, Richard Henderson wrote: > On 6/27/19 12:59 PM, Dave Martin wrote: > >> It's a shame that these slices exist at all. It seems like the kernel could > >> use the negotiated max sve size to grab the data all at once. > > > > The aim here was to be forwards compatible while fitting within the > > existing ABI. > > > > The ABI doesn't allow variable-sized registers, and if the vq can > > someday grow above 16 then the individual registers could become pretty > > big. > > The ABI doesn't appear to have fixed sized data blocks. Since that's > the case, it didn't seem to me that variable sized blocks was so > different, given that the actual size is constrained by the hardware > on which we're running. I'm not sure what you mean here. For KVM_GET_ONE_REG, the size is determined by the reg size field in the register ID, so size is deemed to be a fixed property of a particular register. Having the register IDs vary according to the vector length seemed a step too far. > And if VQ does grow large, then do we really want oodles of syscalls in order > to transfer the data for each register? With the 9 bits reserved for this > field, we could require a maximum of 1024 syscalls to transfer the whole > register set. A save/restore requires oodles of syscalls in any case, and for SVE there is a rapid dropoff of probabilities: VQ < 16 is much likelier than VQ == 32 is likelier than VQ == 64 etc. The reg access API has some shortcomings, and we might find at some point that the whole thing needs redesigning. I suppose we could have taken the view that the KVM ABI would not even try to support VQ > 16 in a forwards compatible way. In the end we decided to at least have things workable. Either way, it's entirely reasonable for userspace not to try to support additional slices for now. We'll have plenty of time to plan away across that bridge when we spot it on the horizon... > > It's for QEMU to choose, but does it actually matter what byteorder you > > store a Z-reg or P-reg in? Maybe the byteswap isn't really needed. > > I think the only sensible order for the kernel is that in which LDR/STR itself > saves the data. Which is a byte stream. We have a choice of STRs though. Anyway, yes, it is the way it is, now. > Within QEMU, it has so far made sense to keep the data in 64-bit hunks in > host-endian order. That's how the AdvSIMD code was written originally, and it > turned out to be easy enough to continue that for SVE. Fair enough. It's entirely up to QEMU to decide -- I just wanted to check that there was no misunderstanding about this issue in the ABI. > Which does mean that if we want to pass data to the kernel as the > aforementioned byte stream that a big-endian host does need to bswap the data > in 64-bit hunks. > > > I don't know how this works when migrating from a little-endian to a > > big-endian host or vice-versa (or if that is even supported...) > > The data is stored canonically within the vmsave, so such migrations are > supposed to work. Right, I was wondering about that. Could be fun to test :) Cheers ---Dave
On Thu, Jun 27, 2019 at 04:02:24PM +0100, Dave Martin wrote: > Either way, it's entirely reasonable for userspace not to try to support > additional slices for now. We'll have plenty of time to plan away > across that bridge when we spot it on the horizon... Which makes me inclined to keep the get/put register code the way it is in this patch, at least with regards to the hard coded number of slices and the build-bug. The way it's written (to me) serves to document the state of things, rather than truly implement anything, but also (to me) it's easier to understand that code than would be a couple of paragraphs of actual documentation trying to explain it. > > Within QEMU, it has so far made sense to keep the data in 64-bit hunks in > > host-endian order. That's how the AdvSIMD code was written originally, and it > > turned out to be easy enough to continue that for SVE. > > Fair enough. It's entirely up to QEMU to decide -- I just wanted to > check that there was no misunderstanding about this issue in the ABI. We do need to use/swap-to host-endian when we implement the monitor's dump-guest-memory command, at it also creates ELF notes for the general and VFP (and, coming soon, SVE) registers. Implementing those ELF notes for SVE is on my TODO, right after this series. Thanks, drew
On Wed, Jun 26, 2019 at 05:22:34PM +0200, Richard Henderson wrote: > On 6/21/19 6:34 PM, Andrew Jones wrote: > > +/* > > + * If ARM_MAX_VQ is increased to be greater than 16, then we can no > > + * longer hard code slices to 1 in kvm_arch_put/get_sve(). > > + */ > > +QEMU_BUILD_BUG_ON(ARM_MAX_VQ > 16); > > This seems easy to fix, or simply drop the slices entirely for now, as > otherwise they are a teeny bit confusing. I can do that, but as I replied down thread, I sort of like it this way for documentation purposes. Anyway, I don't have a strong opinion here, so I'm happy to make reviewers happy :-) > > It's a shame that these slices exist at all. It seems like the kernel could > use the negotiated max sve size to grab the data all at once. > > > + for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) { > > + uint64_t *q = aa64_vfp_qreg(env, n); > > +#ifdef HOST_WORDS_BIGENDIAN > > + uint64_t d[ARM_MAX_VQ * 2]; > > + int j; > > + for (j = 0; j < cpu->sve_max_vq * 2; j++) { > > + d[j] = bswap64(q[j]); > > + } > > + reg.addr = (uintptr_t)d; > > +#else > > + reg.addr = (uintptr_t)q; > > +#endif > > + reg.id = KVM_REG_ARM64_SVE_ZREG(n, i); > > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); > > It might be worth splitting this... > > > + for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) { > > + uint64_t *q = &env->vfp.pregs[n].p[0]; > > +#ifdef HOST_WORDS_BIGENDIAN > > + uint64_t d[ARM_MAX_VQ * 2 / 8]; > > + int j; > > + for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) { > > + d[j] = bswap64(q[j]); > > + } > > + reg.addr = (uintptr_t)d; > > +#else > > + reg.addr = (uintptr_t)q; > > +#endif > > + reg.id = KVM_REG_ARM64_SVE_PREG(n, i); > > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); > > ... and this (unified w/ reg + size parameters?) to a function because ... > > > + reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0]; > > + reg.id = KVM_REG_ARM64_SVE_FFR(i); > > + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); > > ... you forgot to apply the bswap here. Ah, thanks for catching this. I'll fix it for v3, possibly with the factoring, as you suggest. > > Likewise for the other direction. > > > r~ > > > PS: It's also tempting to drop the ifdefs and, since we know the host supports > sve instructions, and that the host supports sve_max_vq, do the reformatting as > > uint64_t scratch[ARM_MAX_VQ * 2]; > asm("whilelo p0.d, xzr, %2\n\t" > "ld1d z0.d, p0/z [%1]\n\t" > "str z0, [%0]" > : "=Q"(scratch) > : "Q"(*aa64_vfp_qreg(env, n)), > "r"(cpu->sve_max_vq) > : "p0", "v0"); This is nice, but as we don't have any other asm's in this file, I'm inclined to leave it with the loops/swaps until we can use a builtin, as you suggest below. > > PPS: Ideally, this would be further cleaned up with acle builtins, but those > are still under development for GCC. > Thanks, drew
diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c index a2485d447e6a..706541327491 100644 --- a/target/arm/kvm64.c +++ b/target/arm/kvm64.c @@ -673,11 +673,12 @@ int kvm_arch_destroy_vcpu(CPUState *cs) bool kvm_arm_reg_syncs_via_cpreg_list(uint64_t regidx) { /* Return true if the regidx is a register we should synchronize - * via the cpreg_tuples array (ie is not a core reg we sync by - * hand in kvm_arch_get/put_registers()) + * via the cpreg_tuples array (ie is not a core or sve reg that + * we sync by hand in kvm_arch_get/put_registers()) */ switch (regidx & KVM_REG_ARM_COPROC_MASK) { case KVM_REG_ARM_CORE: + case KVM_REG_ARM64_SVE: return false; default: return true; @@ -763,6 +764,70 @@ static int kvm_arch_put_fpsimd(CPUState *cs) return 0; } +/* + * If ARM_MAX_VQ is increased to be greater than 16, then we can no + * longer hard code slices to 1 in kvm_arch_put/get_sve(). + */ +QEMU_BUILD_BUG_ON(ARM_MAX_VQ > 16); + +static int kvm_arch_put_sve(CPUState *cs) +{ + ARMCPU *cpu = ARM_CPU(cs); + CPUARMState *env = &cpu->env; + struct kvm_one_reg reg; + int slices = 1; + int i, n, ret; + + for (i = 0; i < slices; i++) { + for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) { + uint64_t *q = aa64_vfp_qreg(env, n); +#ifdef HOST_WORDS_BIGENDIAN + uint64_t d[ARM_MAX_VQ * 2]; + int j; + for (j = 0; j < cpu->sve_max_vq * 2; j++) { + d[j] = bswap64(q[j]); + } + reg.addr = (uintptr_t)d; +#else + reg.addr = (uintptr_t)q; +#endif + reg.id = KVM_REG_ARM64_SVE_ZREG(n, i); + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); + if (ret) { + return ret; + } + } + + for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) { + uint64_t *q = &env->vfp.pregs[n].p[0]; +#ifdef HOST_WORDS_BIGENDIAN + uint64_t d[ARM_MAX_VQ * 2 / 8]; + int j; + for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) { + d[j] = bswap64(q[j]); + } + reg.addr = (uintptr_t)d; +#else + reg.addr = (uintptr_t)q; +#endif + reg.id = KVM_REG_ARM64_SVE_PREG(n, i); + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); + if (ret) { + return ret; + } + } + + reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0]; + reg.id = KVM_REG_ARM64_SVE_FFR(i); + ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, ®); + if (ret) { + return ret; + } + } + + return 0; +} + int kvm_arch_put_registers(CPUState *cs, int level) { struct kvm_one_reg reg; @@ -857,7 +922,11 @@ int kvm_arch_put_registers(CPUState *cs, int level) } } - ret = kvm_arch_put_fpsimd(cs); + if (!cpu->sve_max_vq) { + ret = kvm_arch_put_fpsimd(cs); + } else { + ret = kvm_arch_put_sve(cs); + } if (ret) { return ret; } @@ -920,6 +989,60 @@ static int kvm_arch_get_fpsimd(CPUState *cs) return 0; } +static int kvm_arch_get_sve(CPUState *cs) +{ + ARMCPU *cpu = ARM_CPU(cs); + CPUARMState *env = &cpu->env; + struct kvm_one_reg reg; + int slices = 1; + int i, n, ret; + + for (i = 0; i < slices; i++) { + for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) { + uint64_t *q = aa64_vfp_qreg(env, n); + reg.id = KVM_REG_ARM64_SVE_ZREG(n, i); + reg.addr = (uintptr_t)q; + ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ®); + if (ret) { + return ret; + } else { +#ifdef HOST_WORDS_BIGENDIAN + int j; + for (j = 0; j < cpu->sve_max_vq * 2; j++) { + q[j] = bswap64(q[j]); + } +#endif + } + } + + for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) { + uint64_t *q = &env->vfp.pregs[n].p[0]; + reg.id = KVM_REG_ARM64_SVE_PREG(n, i); + reg.addr = (uintptr_t)q; + ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ®); + if (ret) { + return ret; + } else { +#ifdef HOST_WORDS_BIGENDIAN + int j; + for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) { + q[j] = bswap64(q[j]); + } +#endif + } + } + + reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0]; + reg.id = KVM_REG_ARM64_SVE_FFR(i); + ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, ®); + if (ret) { + return ret; + } + } + + return 0; +} + int kvm_arch_get_registers(CPUState *cs) { struct kvm_one_reg reg; @@ -1014,7 +1137,11 @@ int kvm_arch_get_registers(CPUState *cs) env->spsr = env->banked_spsr[i]; } - ret = kvm_arch_get_fpsimd(cs); + if (!cpu->sve_max_vq) { + ret = kvm_arch_get_fpsimd(cs); + } else { + ret = kvm_arch_get_sve(cs); + } if (ret) { return ret; }
These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the swabbing is different than it is for fpsmid because the vector format is a little-endian stream of words. Signed-off-by: Andrew Jones <drjones@redhat.com> --- target/arm/kvm64.c | 135 +++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 131 insertions(+), 4 deletions(-)