[v2,10/14] target/arm/kvm64: Add kvm_arch_get/put_sve

Message ID	20190621163422.6127-11-drjones@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Andrew Jones <drjones@redhat.com> To: qemu-devel@nongnu.org, qemu-arm@nongnu.org Date: Fri, 21 Jun 2019 18:34:18 +0200 Message-Id: <20190621163422.6127-11-drjones@redhat.com> In-Reply-To: <20190621163422.6127-1-drjones@redhat.com> References: <20190621163422.6127-1-drjones@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Subject: [Qemu-devel] [PATCH v2 10/14] target/arm/kvm64: Add kvm_arch_get/put_sve Precedence: list Cc: peter.maydell@linaro.org, richard.henderson@linaro.org, armbru@redhat.com, eric.auger@redhat.com, imammedo@redhat.com, alex.bennee@linaro.org, Dave.Martin@arm.com Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	target/arm/kvm: enable SVE in guests \| expand [v2,00/14] target/arm/kvm: enable SVE in guests [v2,02/14] target/arm/cpu: Ensure we can use the pmu with kvm [v2,03/14] target/arm/monitor: Introduce qmp_query_cpu_model_expansion [v2,04/14] tests: arm: Introduce cpu feature tests [v2,05/14] target/arm/helper: zcr: Add build bug next to value range assumption [v2,06/14] target/arm: Allow SVE to be disabled via a CPU property [v2,07/14] target/arm/cpu64: max cpu: Introduce sve<vl-bits> properties [v2,08/14] target/arm/kvm64: Fix error returns [v2,09/14] target/arm/kvm64: Move the get/put of fpsimd registers out [v2,10/14] target/arm/kvm64: Add kvm_arch_get/put_sve [v2,11/14] target/arm/kvm64: max cpu: Enable SVE when available [v2,12/14] target/arm/kvm: scratch vcpu: Preserve input kvm_vcpu_init features [v2,13/14] target/arm/cpu64: max cpu: Support sve properties with KVM [v2,14/14] target/arm/kvm: host cpu: Add support for sve<vl-bits> properties

Andrew Jones June 21, 2019, 4:34 p.m. UTC

These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the
swabbing is different than it is for fpsmid because the vector format
is a little-endian stream of words.

Signed-off-by: Andrew Jones <drjones@redhat.com>
---
 target/arm/kvm64.c | 135 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 131 insertions(+), 4 deletions(-)

Dave Martin June 24, 2019, 11:05 a.m. UTC | #1

On Fri, Jun 21, 2019 at 05:34:18PM +0100, Andrew Jones wrote:
> These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the
> swabbing is different than it is for fpsmid because the vector format
> is a little-endian stream of words.

Note, on big-endian hosts the FPSIMD view Vn and the SVE view Zn[127:0]
of the FPSIMD/SVE common register bits has the opposite endianness for
SVE_{GET,SET}_ONE_REG.

This only matters if mixing the two views: just from this patch I don't
know whether this is an issue for QEMU or not.

The kernel and gdb were recently found to be broken in this regard for
userspace [1] but the KVM interface should be unaffected.

Cheers
---Dave

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kernel?id=41040cf7c5f0f26c368bc5d3016fed3a9ca6dba4

Andrew Jones June 24, 2019, 11:55 a.m. UTC | #2

On Mon, Jun 24, 2019 at 12:05:35PM +0100, Dave Martin wrote:
> On Fri, Jun 21, 2019 at 05:34:18PM +0100, Andrew Jones wrote:
> > These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the
> > swabbing is different than it is for fpsmid because the vector format
> > is a little-endian stream of words.
> 
> Note, on big-endian hosts the FPSIMD view Vn and the SVE view Zn[127:0]
> of the FPSIMD/SVE common register bits has the opposite endianness for
> SVE_{GET,SET}_ONE_REG.
> 
> This only matters if mixing the two views: just from this patch I don't
> know whether this is an issue for QEMU or not.

I don't know either. My experience with the emulation side of QEMU is
mostly the zcr_write tweak in this series. And, TBH, I didn't put too
much thought into the endianness stuff, nor test this series with big
endian.

Hopefully Richard can chime in on this.

Thanks,
drew

> 
> The kernel and gdb were recently found to be broken in this regard for
> userspace [1] but the KVM interface should be unaffected.
> 
> Cheers
> ---Dave
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kernel?id=41040cf7c5f0f26c368bc5d3016fed3a9ca6dba4

Dave Martin June 24, 2019, 4:09 p.m. UTC | #3

On Mon, Jun 24, 2019 at 12:55:53PM +0100, Andrew Jones wrote:
> On Mon, Jun 24, 2019 at 12:05:35PM +0100, Dave Martin wrote:
> > On Fri, Jun 21, 2019 at 05:34:18PM +0100, Andrew Jones wrote:
> > > These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the
> > > swabbing is different than it is for fpsmid because the vector format
> > > is a little-endian stream of words.
> > 
> > Note, on big-endian hosts the FPSIMD view Vn and the SVE view Zn[127:0]
> > of the FPSIMD/SVE common register bits has the opposite endianness for
> > SVE_{GET,SET}_ONE_REG.
> > 
> > This only matters if mixing the two views: just from this patch I don't
> > know whether this is an issue for QEMU or not.
> 
> I don't know either. My experience with the emulation side of QEMU is
> mostly the zcr_write tweak in this series. And, TBH, I didn't put too
> much thought into the endianness stuff, nor test this series with big
> endian.

Neither did I (at least beyond the "does it boot" level) -- hence the
bug ;)

And of course, few people are using big-endian, so nobody complained.

Just flagging it up so it doesn't get missed!

> Hopefully Richard can chime in on this.

It would be interesting to know whether you do hit this issue
somewhere, or whether you have a strong view about the clarification to
the KVM ABI.

Cheers
---Dave

Richard Henderson June 26, 2019, 3:22 p.m. UTC | #4

On 6/21/19 6:34 PM, Andrew Jones wrote:
> +/*
> + * If ARM_MAX_VQ is increased to be greater than 16, then we can no
> + * longer hard code slices to 1 in kvm_arch_put/get_sve().
> + */
> +QEMU_BUILD_BUG_ON(ARM_MAX_VQ > 16);

This seems easy to fix, or simply drop the slices entirely for now, as
otherwise they are a teeny bit confusing.

It's a shame that these slices exist at all.  It seems like the kernel could
use the negotiated max sve size to grab the data all at once.

> +        for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) {
> +            uint64_t *q = aa64_vfp_qreg(env, n);
> +#ifdef HOST_WORDS_BIGENDIAN
> +            uint64_t d[ARM_MAX_VQ * 2];
> +            int j;
> +            for (j = 0; j < cpu->sve_max_vq * 2; j++) {
> +                d[j] = bswap64(q[j]);
> +            }
> +            reg.addr = (uintptr_t)d;
> +#else
> +            reg.addr = (uintptr_t)q;
> +#endif
> +            reg.id = KVM_REG_ARM64_SVE_ZREG(n, i);
> +            ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);

It might be worth splitting this...

> +        for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) {
> +            uint64_t *q = &env->vfp.pregs[n].p[0];
> +#ifdef HOST_WORDS_BIGENDIAN
> +            uint64_t d[ARM_MAX_VQ * 2 / 8];
> +            int j;
> +            for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) {
> +                d[j] = bswap64(q[j]);
> +            }
> +            reg.addr = (uintptr_t)d;
> +#else
> +            reg.addr = (uintptr_t)q;
> +#endif
> +            reg.id = KVM_REG_ARM64_SVE_PREG(n, i);
> +            ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);

... and this (unified w/ reg + size parameters?) to a function because ...

> +        reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0];
> +        reg.id = KVM_REG_ARM64_SVE_FFR(i);
> +        ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);

... you forgot to apply the bswap here.

Likewise for the other direction.


r~


PS: It's also tempting to drop the ifdefs and, since we know the host supports
sve instructions, and that the host supports sve_max_vq, do the reformatting as

    uint64_t scratch[ARM_MAX_VQ * 2];
    asm("whilelo  p0.d, xzr, %2\n\t"
        "ld1d     z0.d, p0/z [%1]\n\t"
        "str      z0, [%0]"
        : "=Q"(scratch)
        : "Q"(*aa64_vfp_qreg(env, n)),
          "r"(cpu->sve_max_vq)
        : "p0", "v0");

PPS: Ideally, this would be further cleaned up with acle builtins, but those
are still under development for GCC.

Eric Auger June 27, 2019, 6:56 a.m. UTC | #5

Hi,

On 6/21/19 6:34 PM, Andrew Jones wrote:
> These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the
> swabbing is different than it is for fpsmid because the vector format
> is a little-endian stream of words.

some cosmetic changes besides Richard's comments
> 
> Signed-off-by: Andrew Jones <drjones@redhat.com>
> ---
>  target/arm/kvm64.c | 135 +++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 131 insertions(+), 4 deletions(-)
> 
> diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
> index a2485d447e6a..706541327491 100644
> --- a/target/arm/kvm64.c
> +++ b/target/arm/kvm64.c
> @@ -673,11 +673,12 @@ int kvm_arch_destroy_vcpu(CPUState *cs)
>  bool kvm_arm_reg_syncs_via_cpreg_list(uint64_t regidx)
>  {
>      /* Return true if the regidx is a register we should synchronize
> -     * via the cpreg_tuples array (ie is not a core reg we sync by
> -     * hand in kvm_arch_get/put_registers())
> +     * via the cpreg_tuples array (ie is not a core or sve reg that
> +     * we sync by hand in kvm_arch_get/put_registers())
>       */
>      switch (regidx & KVM_REG_ARM_COPROC_MASK) {
>      case KVM_REG_ARM_CORE:
> +    case KVM_REG_ARM64_SVE:
>          return false;
>      default:
>          return true;
> @@ -763,6 +764,70 @@ static int kvm_arch_put_fpsimd(CPUState *cs)
>      return 0;
>  }
>  
> +/*
> + * If ARM_MAX_VQ is increased to be greater than 16, then we can no
> + * longer hard code slices to 1 in kvm_arch_put/get_sve().
> + */
> +QEMU_BUILD_BUG_ON(ARM_MAX_VQ > 16);
if the code is ready to support slices, I guess you could have a define
and compute the slice number from ARM_MAX_VQ?
> +
> +static int kvm_arch_put_sve(CPUState *cs)
> +{
> +    ARMCPU *cpu = ARM_CPU(cs);
> +    CPUARMState *env = &cpu->env;
> +    struct kvm_one_reg reg;
> +    int slices = 1;
> +    int i, n, ret;
> +
> +    for (i = 0; i < slices; i++) {
> +        for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) {
> +            uint64_t *q = aa64_vfp_qreg(env, n);
> +#ifdef HOST_WORDS_BIGENDIAN
> +            uint64_t d[ARM_MAX_VQ * 2];
> +            int j;
line to be added
> +            for (j = 0; j < cpu->sve_max_vq * 2; j++) {
> +                d[j] = bswap64(q[j]);
> +            }
> +            reg.addr = (uintptr_t)d;
> +#else
> +            reg.addr = (uintptr_t)q;
> +#endif
> +            reg.id = KVM_REG_ARM64_SVE_ZREG(n, i);
> +            ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> +            if (ret) {
> +                return ret;
> +            }
> +        }
> +
> +        for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) {
> +            uint64_t *q = &env->vfp.pregs[n].p[0];
> +#ifdef HOST_WORDS_BIGENDIAN
> +            uint64_t d[ARM_MAX_VQ * 2 / 8];
> +            int j;
line
> +            for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) {
> +                d[j] = bswap64(q[j]);
> +            }
> +            reg.addr = (uintptr_t)d;
> +#else
> +            reg.addr = (uintptr_t)q;
> +#endif
> +            reg.id = KVM_REG_ARM64_SVE_PREG(n, i);
> +            ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> +            if (ret) {
> +                return ret;
> +            }
> +        }
> +
> +        reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0];
> +        reg.id = KVM_REG_ARM64_SVE_FFR(i);
> +        ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>  int kvm_arch_put_registers(CPUState *cs, int level)
>  {
>      struct kvm_one_reg reg;
> @@ -857,7 +922,11 @@ int kvm_arch_put_registers(CPUState *cs, int level)
>          }
>      }
>  
> -    ret = kvm_arch_put_fpsimd(cs);
> +    if (!cpu->sve_max_vq) {
> +        ret = kvm_arch_put_fpsimd(cs);
> +    } else {
> +        ret = kvm_arch_put_sve(cs);
> +    }
>      if (ret) {
>          return ret;
>      }
> @@ -920,6 +989,60 @@ static int kvm_arch_get_fpsimd(CPUState *cs)
>      return 0;
>  }
>  
> +static int kvm_arch_get_sve(CPUState *cs)
> +{
> +    ARMCPU *cpu = ARM_CPU(cs);
> +    CPUARMState *env = &cpu->env;
> +    struct kvm_one_reg reg;
> +    int slices = 1;
> +    int i, n, ret;
> +
> +    for (i = 0; i < slices; i++) {
> +        for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) {
> +            uint64_t *q = aa64_vfp_qreg(env, n);
extra line needed
> +            reg.id = KVM_REG_ARM64_SVE_ZREG(n, i);
> +            reg.addr = (uintptr_t)q;
> +            ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
> +            if (ret) {
> +                return ret;
> +            } else {> +#ifdef HOST_WORDS_BIGENDIAN
> +                int j;
line
> +                for (j = 0; j < cpu->sve_max_vq * 2; j++) {
> +                    q[j] = bswap64(q[j]);
> +                }
> +#endif
> +            }
> +        }
> +
> +        for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) {
> +            uint64_t *q = &env->vfp.pregs[n].p[0];
extra line needed
> +            reg.id = KVM_REG_ARM64_SVE_PREG(n, i);
> +            reg.addr = (uintptr_t)q;
> +            ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
> +            if (ret) {
> +                return ret;
> +            } else {> +#ifdef HOST_WORDS_BIGENDIAN
> +                int j;
line
> +                for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) {
> +                    q[j] = bswap64(q[j]);
> +                }
> +#endif
> +            }
> +        }
> +
> +        reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0];
> +        reg.id = KVM_REG_ARM64_SVE_FFR(i);
> +        ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
> +        if (ret) {
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
>  int kvm_arch_get_registers(CPUState *cs)
>  {
>      struct kvm_one_reg reg;
> @@ -1014,7 +1137,11 @@ int kvm_arch_get_registers(CPUState *cs)
>          env->spsr = env->banked_spsr[i];
>      }
>  
> -    ret = kvm_arch_get_fpsimd(cs);
> +    if (!cpu->sve_max_vq) {
> +        ret = kvm_arch_get_fpsimd(cs);
> +    } else {
> +        ret = kvm_arch_get_sve(cs);
> +    }
>      if (ret) {
>          return ret;
>      }
> 


Thanks

Eric

Dave Martin June 27, 2019, 10:59 a.m. UTC | #6

On Wed, Jun 26, 2019 at 04:22:34PM +0100, Richard Henderson wrote:
> On 6/21/19 6:34 PM, Andrew Jones wrote:
> > +/*
> > + * If ARM_MAX_VQ is increased to be greater than 16, then we can no
> > + * longer hard code slices to 1 in kvm_arch_put/get_sve().
> > + */
> > +QEMU_BUILD_BUG_ON(ARM_MAX_VQ > 16);
> 
> This seems easy to fix, or simply drop the slices entirely for now, as
> otherwise they are a teeny bit confusing.
> 
> It's a shame that these slices exist at all.  It seems like the kernel could
> use the negotiated max sve size to grab the data all at once.

The aim here was to be forwards compatible while fitting within the
existing ABI.

The ABI doesn't allow variable-sized registers, and if the vq can
someday grow above 16 then the individual registers could become pretty
big.

Inside the kernel, we took the view that if that ever happens, it's
sufficiently far out that we can skip implementing the support today,
providing that the ABI can accommodate the change.

For qemu, if you don't actually care what's in the regs, you could just
enumerate then using KVM_GET_REG_LIST instead of manufacturing the IDs
by hand.  That way, you don't have to care what slices exist.  For
save/restore/migrate purposes, the regs are just data, so that's
probably enough.

Debug, and exchanging vector registers between the guest and, say, and
SMC trapped to userspace, would still need to examine specific regs.
I don't know what QEMU does in this area though.

> > +        for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) {
> > +            uint64_t *q = aa64_vfp_qreg(env, n);
> > +#ifdef HOST_WORDS_BIGENDIAN
> > +            uint64_t d[ARM_MAX_VQ * 2];
> > +            int j;
> > +            for (j = 0; j < cpu->sve_max_vq * 2; j++) {
> > +                d[j] = bswap64(q[j]);
> > +            }
> > +            reg.addr = (uintptr_t)d;
> > +#else
> > +            reg.addr = (uintptr_t)q;
> > +#endif
> > +            reg.id = KVM_REG_ARM64_SVE_ZREG(n, i);
> > +            ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> 
> It might be worth splitting this...
> 
> > +        for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) {
> > +            uint64_t *q = &env->vfp.pregs[n].p[0];
> > +#ifdef HOST_WORDS_BIGENDIAN
> > +            uint64_t d[ARM_MAX_VQ * 2 / 8];
> > +            int j;
> > +            for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) {
> > +                d[j] = bswap64(q[j]);
> > +            }
> > +            reg.addr = (uintptr_t)d;
> > +#else
> > +            reg.addr = (uintptr_t)q;
> > +#endif
> > +            reg.id = KVM_REG_ARM64_SVE_PREG(n, i);
> > +            ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);

It's for QEMU to choose, but does it actually matter what byteorder you
store a Z-reg or P-reg in?  Maybe the byteswap isn't really needed.

I don't know how this works when migrating from a little-endian to a
big-endian host or vice-versa (or if that is even supported...)

[...]

Cheers
---Dave

Andrew Jones June 27, 2019, 10:59 a.m. UTC | #7

On Thu, Jun 27, 2019 at 08:56:21AM +0200, Auger Eric wrote:
> Hi,
> 
> On 6/21/19 6:34 PM, Andrew Jones wrote:
> > These are the SVE equivalents to kvm_arch_get/put_fpsimd. Note, the
> > swabbing is different than it is for fpsmid because the vector format
> > is a little-endian stream of words.
> 
> some cosmetic changes besides Richard's comments
> > 
> > Signed-off-by: Andrew Jones <drjones@redhat.com>
> > ---
> >  target/arm/kvm64.c | 135 +++++++++++++++++++++++++++++++++++++++++++--
> >  1 file changed, 131 insertions(+), 4 deletions(-)
> > 
> > diff --git a/target/arm/kvm64.c b/target/arm/kvm64.c
> > index a2485d447e6a..706541327491 100644
> > --- a/target/arm/kvm64.c
> > +++ b/target/arm/kvm64.c
> > @@ -673,11 +673,12 @@ int kvm_arch_destroy_vcpu(CPUState *cs)
> >  bool kvm_arm_reg_syncs_via_cpreg_list(uint64_t regidx)
> >  {
> >      /* Return true if the regidx is a register we should synchronize
> > -     * via the cpreg_tuples array (ie is not a core reg we sync by
> > -     * hand in kvm_arch_get/put_registers())
> > +     * via the cpreg_tuples array (ie is not a core or sve reg that
> > +     * we sync by hand in kvm_arch_get/put_registers())
> >       */
> >      switch (regidx & KVM_REG_ARM_COPROC_MASK) {
> >      case KVM_REG_ARM_CORE:
> > +    case KVM_REG_ARM64_SVE:
> >          return false;
> >      default:
> >          return true;
> > @@ -763,6 +764,70 @@ static int kvm_arch_put_fpsimd(CPUState *cs)
> >      return 0;
> >  }
> >  
> > +/*
> > + * If ARM_MAX_VQ is increased to be greater than 16, then we can no
> > + * longer hard code slices to 1 in kvm_arch_put/get_sve().
> > + */
> > +QEMU_BUILD_BUG_ON(ARM_MAX_VQ > 16);
> if the code is ready to support slices, I guess you could have a define
> and compute the slice number from ARM_MAX_VQ?

Yeah, that should be do-able. Thanks for the suggestion.

> > +
> > +static int kvm_arch_put_sve(CPUState *cs)
> > +{
> > +    ARMCPU *cpu = ARM_CPU(cs);
> > +    CPUARMState *env = &cpu->env;
> > +    struct kvm_one_reg reg;
> > +    int slices = 1;
> > +    int i, n, ret;
> > +
> > +    for (i = 0; i < slices; i++) {
> > +        for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) {
> > +            uint64_t *q = aa64_vfp_qreg(env, n);
> > +#ifdef HOST_WORDS_BIGENDIAN
> > +            uint64_t d[ARM_MAX_VQ * 2];
> > +            int j;
> line to be added
> > +            for (j = 0; j < cpu->sve_max_vq * 2; j++) {
> > +                d[j] = bswap64(q[j]);
> > +            }
> > +            reg.addr = (uintptr_t)d;
> > +#else
> > +            reg.addr = (uintptr_t)q;
> > +#endif
> > +            reg.id = KVM_REG_ARM64_SVE_ZREG(n, i);
> > +            ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> > +            if (ret) {
> > +                return ret;
> > +            }
> > +        }
> > +
> > +        for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) {
> > +            uint64_t *q = &env->vfp.pregs[n].p[0];
> > +#ifdef HOST_WORDS_BIGENDIAN
> > +            uint64_t d[ARM_MAX_VQ * 2 / 8];
> > +            int j;
> line
> > +            for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) {
> > +                d[j] = bswap64(q[j]);
> > +            }
> > +            reg.addr = (uintptr_t)d;
> > +#else
> > +            reg.addr = (uintptr_t)q;
> > +#endif
> > +            reg.id = KVM_REG_ARM64_SVE_PREG(n, i);
> > +            ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> > +            if (ret) {
> > +                return ret;
> > +            }
> > +        }
> > +
> > +        reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0];
> > +        reg.id = KVM_REG_ARM64_SVE_FFR(i);
> > +        ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> > +        if (ret) {
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  int kvm_arch_put_registers(CPUState *cs, int level)
> >  {
> >      struct kvm_one_reg reg;
> > @@ -857,7 +922,11 @@ int kvm_arch_put_registers(CPUState *cs, int level)
> >          }
> >      }
> >  
> > -    ret = kvm_arch_put_fpsimd(cs);
> > +    if (!cpu->sve_max_vq) {
> > +        ret = kvm_arch_put_fpsimd(cs);
> > +    } else {
> > +        ret = kvm_arch_put_sve(cs);
> > +    }
> >      if (ret) {
> >          return ret;
> >      }
> > @@ -920,6 +989,60 @@ static int kvm_arch_get_fpsimd(CPUState *cs)
> >      return 0;
> >  }
> >  
> > +static int kvm_arch_get_sve(CPUState *cs)
> > +{
> > +    ARMCPU *cpu = ARM_CPU(cs);
> > +    CPUARMState *env = &cpu->env;
> > +    struct kvm_one_reg reg;
> > +    int slices = 1;
> > +    int i, n, ret;
> > +
> > +    for (i = 0; i < slices; i++) {
> > +        for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) {
> > +            uint64_t *q = aa64_vfp_qreg(env, n);
> extra line needed
> > +            reg.id = KVM_REG_ARM64_SVE_ZREG(n, i);
> > +            reg.addr = (uintptr_t)q;
> > +            ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
> > +            if (ret) {
> > +                return ret;
> > +            } else {> +#ifdef HOST_WORDS_BIGENDIAN
> > +                int j;
> line
> > +                for (j = 0; j < cpu->sve_max_vq * 2; j++) {
> > +                    q[j] = bswap64(q[j]);
> > +                }
> > +#endif
> > +            }
> > +        }
> > +
> > +        for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) {
> > +            uint64_t *q = &env->vfp.pregs[n].p[0];
> extra line needed
> > +            reg.id = KVM_REG_ARM64_SVE_PREG(n, i);
> > +            reg.addr = (uintptr_t)q;
> > +            ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
> > +            if (ret) {
> > +                return ret;
> > +            } else {> +#ifdef HOST_WORDS_BIGENDIAN
> > +                int j;
> line
> > +                for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) {
> > +                    q[j] = bswap64(q[j]);
> > +                }
> > +#endif
> > +            }
> > +        }
> > +
> > +        reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0];
> > +        reg.id = KVM_REG_ARM64_SVE_FFR(i);
> > +        ret = kvm_vcpu_ioctl(cs, KVM_GET_ONE_REG, &reg);
> > +        if (ret) {
> > +            return ret;
> > +        }
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> >  int kvm_arch_get_registers(CPUState *cs)
> >  {
> >      struct kvm_one_reg reg;
> > @@ -1014,7 +1137,11 @@ int kvm_arch_get_registers(CPUState *cs)
> >          env->spsr = env->banked_spsr[i];
> >      }
> >  
> > -    ret = kvm_arch_get_fpsimd(cs);
> > +    if (!cpu->sve_max_vq) {
> > +        ret = kvm_arch_get_fpsimd(cs);
> > +    } else {
> > +        ret = kvm_arch_get_sve(cs);
> > +    }
> >      if (ret) {
> >          return ret;
> >      }
> > 
> 
>

I'll see if I can clean/improve this patch with yours and Richard's
suggestions.

Thanks,
drew

Richard Henderson June 27, 2019, 11:26 a.m. UTC | #8

On 6/27/19 12:59 PM, Dave Martin wrote:
>> It's a shame that these slices exist at all.  It seems like the kernel could
>> use the negotiated max sve size to grab the data all at once.
> 
> The aim here was to be forwards compatible while fitting within the
> existing ABI.
> 
> The ABI doesn't allow variable-sized registers, and if the vq can
> someday grow above 16 then the individual registers could become pretty
> big.

The ABI doesn't appear to have fixed sized data blocks.  Since that's the case,
it didn't seem to me that variable sized blocks was so different, given that
the actual size is constrained by the hardware on which we're running.

And if VQ does grow large, then do we really want oodles of syscalls in order
to transfer the data for each register?  With the 9 bits reserved for this
field, we could require a maximum of 1024 syscalls to transfer the whole
register set.

> It's for QEMU to choose, but does it actually matter what byteorder you
> store a Z-reg or P-reg in?  Maybe the byteswap isn't really needed.

I think the only sensible order for the kernel is that in which LDR/STR itself
saves the data.  Which is a byte stream.

Within QEMU, it has so far made sense to keep the data in 64-bit hunks in
host-endian order.  That's how the AdvSIMD code was written originally, and it
turned out to be easy enough to continue that for SVE.

Which does mean that if we want to pass data to the kernel as the
aforementioned byte stream that a big-endian host does need to bswap the data
in 64-bit hunks.

> I don't know how this works when migrating from a little-endian to a
> big-endian host or vice-versa (or if that is even supported...)

The data is stored canonically within the vmsave, so such migrations are
supposed to work.

r~

Dave Martin June 27, 2019, 3:02 p.m. UTC | #9

On Thu, Jun 27, 2019 at 12:26:06PM +0100, Richard Henderson wrote:
> On 6/27/19 12:59 PM, Dave Martin wrote:
> >> It's a shame that these slices exist at all.  It seems like the kernel could
> >> use the negotiated max sve size to grab the data all at once.
> > 
> > The aim here was to be forwards compatible while fitting within the
> > existing ABI.
> > 
> > The ABI doesn't allow variable-sized registers, and if the vq can
> > someday grow above 16 then the individual registers could become pretty
> > big.
> 
> The ABI doesn't appear to have fixed sized data blocks.  Since that's
> the case, it didn't seem to me that variable sized blocks was so
> different, given that the actual size is constrained by the hardware
> on which we're running.

I'm not sure what you mean here.

For KVM_GET_ONE_REG, the size is determined by the reg size field in
the register ID, so size is deemed to be a fixed property of a
particular register.

Having the register IDs vary according to the vector length seemed a
step too far.

> And if VQ does grow large, then do we really want oodles of syscalls in order
> to transfer the data for each register?  With the 9 bits reserved for this
> field, we could require a maximum of 1024 syscalls to transfer the whole
> register set.

A save/restore requires oodles of syscalls in any case, and for SVE
there is a rapid dropoff of probabilities: VQ < 16 is much likelier than
VQ == 32 is likelier than VQ == 64 etc.

The reg access API has some shortcomings, and we might find at some
point that the whole thing needs redesigning.

I suppose we could have taken the view that the KVM ABI would not even
try to support VQ > 16 in a forwards compatible way.  In the end we
decided to at least have things workable.

Either way, it's entirely reasonable for userspace not to try to support
additional slices for now.  We'll have plenty of time to plan away
across that bridge when we spot it on the horizon...

> > It's for QEMU to choose, but does it actually matter what byteorder you
> > store a Z-reg or P-reg in?  Maybe the byteswap isn't really needed.
> 
> I think the only sensible order for the kernel is that in which LDR/STR itself
> saves the data.  Which is a byte stream.

We have a choice of STRs though.  Anyway, yes, it is the way it is, now.

> Within QEMU, it has so far made sense to keep the data in 64-bit hunks in
> host-endian order.  That's how the AdvSIMD code was written originally, and it
> turned out to be easy enough to continue that for SVE.

Fair enough.  It's entirely up to QEMU to decide -- I just wanted to
check that there was no misunderstanding about this issue in the ABI.

> Which does mean that if we want to pass data to the kernel as the
> aforementioned byte stream that a big-endian host does need to bswap the data
> in 64-bit hunks.
> 
> > I don't know how this works when migrating from a little-endian to a
> > big-endian host or vice-versa (or if that is even supported...)
> 
> The data is stored canonically within the vmsave, so such migrations are
> supposed to work.

Right, I was wondering about that.  Could be fun to test :)

Cheers
---Dave

Andrew Jones July 17, 2019, 9:25 a.m. UTC | #10

On Thu, Jun 27, 2019 at 04:02:24PM +0100, Dave Martin wrote:
> Either way, it's entirely reasonable for userspace not to try to support
> additional slices for now.  We'll have plenty of time to plan away
> across that bridge when we spot it on the horizon...

Which makes me inclined to keep the get/put register code the way it is
in this patch, at least with regards to the hard coded number of slices
and the build-bug. The way it's written (to me) serves to document the
state of things, rather than truly implement anything, but also (to me)
it's easier to understand that code than would be a couple of paragraphs
of actual documentation trying to explain it.

> > Within QEMU, it has so far made sense to keep the data in 64-bit hunks in
> > host-endian order.  That's how the AdvSIMD code was written originally, and it
> > turned out to be easy enough to continue that for SVE.
> 
> Fair enough.  It's entirely up to QEMU to decide -- I just wanted to
> check that there was no misunderstanding about this issue in the ABI.

We do need to use/swap-to host-endian when we implement the monitor's
dump-guest-memory command, at it also creates ELF notes for the general
and VFP (and, coming soon, SVE) registers. Implementing those ELF notes
for SVE is on my TODO, right after this series.

Thanks,
drew

Andrew Jones July 17, 2019, 9:35 a.m. UTC | #11

On Wed, Jun 26, 2019 at 05:22:34PM +0200, Richard Henderson wrote:
> On 6/21/19 6:34 PM, Andrew Jones wrote:
> > +/*
> > + * If ARM_MAX_VQ is increased to be greater than 16, then we can no
> > + * longer hard code slices to 1 in kvm_arch_put/get_sve().
> > + */
> > +QEMU_BUILD_BUG_ON(ARM_MAX_VQ > 16);
> 
> This seems easy to fix, or simply drop the slices entirely for now, as
> otherwise they are a teeny bit confusing.

I can do that, but as I replied down thread, I sort of like it this way
for documentation purposes. Anyway, I don't have a strong opinion here,
so I'm happy to make reviewers happy :-)

> 
> It's a shame that these slices exist at all.  It seems like the kernel could
> use the negotiated max sve size to grab the data all at once.
> 
> > +        for (n = 0; n < KVM_ARM64_SVE_NUM_ZREGS; n++) {
> > +            uint64_t *q = aa64_vfp_qreg(env, n);
> > +#ifdef HOST_WORDS_BIGENDIAN
> > +            uint64_t d[ARM_MAX_VQ * 2];
> > +            int j;
> > +            for (j = 0; j < cpu->sve_max_vq * 2; j++) {
> > +                d[j] = bswap64(q[j]);
> > +            }
> > +            reg.addr = (uintptr_t)d;
> > +#else
> > +            reg.addr = (uintptr_t)q;
> > +#endif
> > +            reg.id = KVM_REG_ARM64_SVE_ZREG(n, i);
> > +            ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> 
> It might be worth splitting this...
> 
> > +        for (n = 0; n < KVM_ARM64_SVE_NUM_PREGS; n++) {
> > +            uint64_t *q = &env->vfp.pregs[n].p[0];
> > +#ifdef HOST_WORDS_BIGENDIAN
> > +            uint64_t d[ARM_MAX_VQ * 2 / 8];
> > +            int j;
> > +            for (j = 0; j < cpu->sve_max_vq * 2 / 8; j++) {
> > +                d[j] = bswap64(q[j]);
> > +            }
> > +            reg.addr = (uintptr_t)d;
> > +#else
> > +            reg.addr = (uintptr_t)q;
> > +#endif
> > +            reg.id = KVM_REG_ARM64_SVE_PREG(n, i);
> > +            ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> 
> ... and this (unified w/ reg + size parameters?) to a function because ...
> 
> > +        reg.addr = (uintptr_t)&env->vfp.pregs[FFR_PRED_NUM].p[0];
> > +        reg.id = KVM_REG_ARM64_SVE_FFR(i);
> > +        ret = kvm_vcpu_ioctl(cs, KVM_SET_ONE_REG, &reg);
> 
> ... you forgot to apply the bswap here.

Ah, thanks for catching this. I'll fix it for v3, possibly with the
factoring, as you suggest.

> 
> Likewise for the other direction.
> 
> 
> r~
> 
> 
> PS: It's also tempting to drop the ifdefs and, since we know the host supports
> sve instructions, and that the host supports sve_max_vq, do the reformatting as
> 
>     uint64_t scratch[ARM_MAX_VQ * 2];
>     asm("whilelo  p0.d, xzr, %2\n\t"
>         "ld1d     z0.d, p0/z [%1]\n\t"
>         "str      z0, [%0]"
>         : "=Q"(scratch)
>         : "Q"(*aa64_vfp_qreg(env, n)),
>           "r"(cpu->sve_max_vq)
>         : "p0", "v0");

This is nice, but as we don't have any other asm's in this file, I'm
inclined to leave it with the loops/swaps until we can use a builtin,
as you suggest below.

> 
> PPS: Ideally, this would be further cleaned up with acle builtins, but those
> are still under development for GCC.
> 

Thanks,
drew

[v2,10/14] target/arm/kvm64: Add kvm_arch_get/put_sve

Commit Message

Comments

Patch