diff mbox series

[3/4] AArch64: enable zero-extends using TBLs for Adv. SIMD

Message ID Zwz4zvHBkkn7iOZ9@arm.com
State New
Headers show
Series [1/4] middle-end: support multi-step zero-extends using VEC_PERM_EXPR | expand

Commit Message

Tamar Christina Oct. 14, 2024, 10:56 a.m. UTC
Hi All,

In this patch series I'm adding support for zero extending using permutes
instead of requiring multi-step decomposition.

This codegen has the benefit of needing fewer instructions and having much
higher throughput than uxtl.  We previously replaced pairs of uxtl/uxtl2s with
ZIPs to increase throughput, but latency was still an issue due to the
depencency chain created.

To fix this we can now use TBLs.  The indexes are listed output of the loop as
well and can be shared amongst zero extends of the same type.  The additional
benefit of this is that if the values are being permuted after extensions they
will be simplified and merged leading to better overall code.

e.g. an LD4 can be replaced with LDR since the permutes being performance for
the extensions can be merged with the load permutes.  The way LOAD_LANES is
currently implemented means this can't be done in GCC yet, but I'm aiming for
this in GCC 15.

I've additionally only added support for non-VLA.  The problem with VLA is that
the index registers are hard or impossible to make.

On Adv. SIMD we use -1 to indicate an out of range register so we can transform
the two regs TBL into a one reg one.  However on e.g. a byte array, on VLA 255
would be a valid entry. e.g, at VL > 2048.  Which means that's already not a
transformation we can make.

Secondly the actual mask looks something like {x,x,x,n,n+1, x,x,x, n+2, n+3}
and while I think I can represent this in vect_perm_builder, I couldn't think of
any real efficient VLA way to build such masks..  It would require a lot of
setup code.

Lastly I don't think this transformation would make much sense for SVE, as SVE
has loads and converts that can do multi-step types.  For instance the loop
below is already pretty good for SVE (though it's missed that the load can do
more than one step, presumably because a single extend is merged only in RTL).

While I tried hard, for these reasons I don't support VLA, which I hope is ok..

Concretely on AArch64 this changes:

void test4(unsigned char *x, long long *y, int n) {
    for(int i = 0; i < n; i++) {
        y[i] = x[i];
    }
}

from generating:

.L4:
        ldr     q30, [x4], 16
        add     x3, x3, 128
        zip1    v1.16b, v30.16b, v31.16b
        zip2    v30.16b, v30.16b, v31.16b
        zip1    v2.8h, v1.8h, v31.8h
        zip1    v0.8h, v30.8h, v31.8h
        zip2    v1.8h, v1.8h, v31.8h
        zip2    v30.8h, v30.8h, v31.8h
        zip1    v26.4s, v2.4s, v31.4s
        zip1    v29.4s, v0.4s, v31.4s
        zip1    v28.4s, v1.4s, v31.4s
        zip1    v27.4s, v30.4s, v31.4s
        zip2    v2.4s, v2.4s, v31.4s
        zip2    v0.4s, v0.4s, v31.4s
        zip2    v1.4s, v1.4s, v31.4s
        zip2    v30.4s, v30.4s, v31.4s
        stp     q26, q2, [x3, -128]
        stp     q28, q1, [x3, -96]
        stp     q29, q0, [x3, -64]
        stp     q27, q30, [x3, -32]
        cmp     x4, x5
        bne     .L4

and instead we get:

.L4:
        add     x3, x3, 128
        ldr     q23, [x4], 16
        tbl     v5.16b, {v23.16b}, v31.16b
        tbl     v4.16b, {v23.16b}, v30.16b
        tbl     v3.16b, {v23.16b}, v29.16b
        tbl     v2.16b, {v23.16b}, v28.16b
        tbl     v1.16b, {v23.16b}, v27.16b
        tbl     v0.16b, {v23.16b}, v26.16b
        tbl     v22.16b, {v23.16b}, v25.16b
        tbl     v23.16b, {v23.16b}, v24.16b
        stp     q5, q4, [x3, -128]
        stp     q3, q2, [x3, -96]
        stp     q1, q0, [x3, -64]
        stp     q22, q23, [x3, -32]
        cmp     x4, x5
        bne     .L4

Which results in up to 40% performance uplift on certain workloads.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (aarch64_use_permute_for_promotion): New.
	(TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION): Use it.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/vect-tbl-zero-extend_1.c: New test.

---




--

Comments

Kyrylo Tkachov Oct. 14, 2024, 11:15 a.m. UTC | #1
Hi Tamar,

> On 14 Oct 2024, at 12:56, Tamar Christina <tamar.christina@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi All,
> 
> In this patch series I'm adding support for zero extending using permutes
> instead of requiring multi-step decomposition.
> 
> This codegen has the benefit of needing fewer instructions and having much
> higher throughput than uxtl.  We previously replaced pairs of uxtl/uxtl2s with
> ZIPs to increase throughput, but latency was still an issue due to the
> depencency chain created.
> 
> To fix this we can now use TBLs.  The indexes are listed output of the loop as
> well and can be shared amongst zero extends of the same type.  The additional
> benefit of this is that if the values are being permuted after extensions they
> will be simplified and merged leading to better overall code.
> 
> e.g. an LD4 can be replaced with LDR since the permutes being performance for
> the extensions can be merged with the load permutes.  The way LOAD_LANES is
> currently implemented means this can't be done in GCC yet, but I'm aiming for
> this in GCC 15.
> 
> I've additionally only added support for non-VLA.  The problem with VLA is that
> the index registers are hard or impossible to make.
> 
> On Adv. SIMD we use -1 to indicate an out of range register so we can transform
> the two regs TBL into a one reg one.  However on e.g. a byte array, on VLA 255
> would be a valid entry. e.g, at VL > 2048.  Which means that's already not a
> transformation we can make.
> 
> Secondly the actual mask looks something like {x,x,x,n,n+1, x,x,x, n+2, n+3}
> and while I think I can represent this in vect_perm_builder, I couldn't think of
> any real efficient VLA way to build such masks..  It would require a lot of
> setup code.
> 
> Lastly I don't think this transformation would make much sense for SVE, as SVE
> has loads and converts that can do multi-step types.  For instance the loop
> below is already pretty good for SVE (though it's missed that the load can do
> more than one step, presumably because a single extend is merged only in RTL).

Thanks for working on this, it’s certainly an area that can be improved from the current state!
On the SVE front, one thing we noticed recently is that on Neoverse V2 the SVE single-register TBL instructions
have twice the throughput of the equivalent Neon ones. One has to be careful with the indexes as SVE TBL instructions
can access lanes outside the 128-bit granule (and thus are not truly VLA) but if the compiler controls the permute mask
as it does here they should be safe to emit.

I won’t block this patch/series on that point, just a consideration if you’re interested.
Thanks,
Kyrill  

> 
> While I tried hard, for these reasons I don't support VLA, which I hope is ok..
> 
> Concretely on AArch64 this changes:
> 
> void test4(unsigned char *x, long long *y, int n) {
>    for(int i = 0; i < n; i++) {
>        y[i] = x[i];
>    }
> }
> 
> from generating:
> 
> .L4:
>        ldr     q30, [x4], 16
>        add     x3, x3, 128
>        zip1    v1.16b, v30.16b, v31.16b
>        zip2    v30.16b, v30.16b, v31.16b
>        zip1    v2.8h, v1.8h, v31.8h
>        zip1    v0.8h, v30.8h, v31.8h
>        zip2    v1.8h, v1.8h, v31.8h
>        zip2    v30.8h, v30.8h, v31.8h
>        zip1    v26.4s, v2.4s, v31.4s
>        zip1    v29.4s, v0.4s, v31.4s
>        zip1    v28.4s, v1.4s, v31.4s
>        zip1    v27.4s, v30.4s, v31.4s
>        zip2    v2.4s, v2.4s, v31.4s
>        zip2    v0.4s, v0.4s, v31.4s
>        zip2    v1.4s, v1.4s, v31.4s
>        zip2    v30.4s, v30.4s, v31.4s
>        stp     q26, q2, [x3, -128]
>        stp     q28, q1, [x3, -96]
>        stp     q29, q0, [x3, -64]
>        stp     q27, q30, [x3, -32]
>        cmp     x4, x5
>        bne     .L4
> 
> and instead we get:
> 
> .L4:
>        add     x3, x3, 128
>        ldr     q23, [x4], 16
>        tbl     v5.16b, {v23.16b}, v31.16b
>        tbl     v4.16b, {v23.16b}, v30.16b
>        tbl     v3.16b, {v23.16b}, v29.16b
>        tbl     v2.16b, {v23.16b}, v28.16b
>        tbl     v1.16b, {v23.16b}, v27.16b
>        tbl     v0.16b, {v23.16b}, v26.16b
>        tbl     v22.16b, {v23.16b}, v25.16b
>        tbl     v23.16b, {v23.16b}, v24.16b
>        stp     q5, q4, [x3, -128]
>        stp     q3, q2, [x3, -96]
>        stp     q1, q0, [x3, -64]
>        stp     q22, q23, [x3, -32]
>        cmp     x4, x5
>        bne     .L4
> 
> Which results in up to 40% performance uplift on certain workloads.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
>        * config/aarch64/aarch64.cc (aarch64_use_permute_for_promotion): New.
>        (TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION): Use it.
> 
> gcc/testsuite/ChangeLog:
> 
>        * gcc.target/aarch64/vect-tbl-zero-extend_1.c: New test.
> 
> ---
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 102680a0efca1ce928e6945033c01cfb68a65152..b90577f4fc8157b3e02936256c8af8b2b7fac144 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -28404,6 +28404,29 @@ aarch64_empty_mask_is_expensive (unsigned)
>   return false;
> }
> 
> +/* Implement TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION.  Assume that
> +   predicated operations when available are beneficial when doing more than
> +   one step conversion.  */
> +
> +static bool
> +aarch64_use_permute_for_promotion (const_tree in_type, const_tree out_type)
> +{
> +  /* AArch64's vect_perm_constant doesn't currently support two 64 bit shuffles
> +     into a 128 bit vector type.  So for now reject it.  */
> +  if (maybe_ne (GET_MODE_BITSIZE (TYPE_MODE (in_type)),
> +               GET_MODE_BITSIZE (TYPE_MODE (out_type))))
> +    return false;
> +
> +  auto bitsize_in = element_precision (in_type);
> +  auto bitsize_out = element_precision (out_type);
> +
> +  /* We don't want to use the permutes for a single widening step because we're
> +     picking there between two zip and tbl sequences with the same throughput
> +     and latencies.  However the zip doesn't require a mask and uses less
> +     registers so we prefer that.  */
> +  return (bitsize_out / bitsize_in) > 2;
> +}
> +
> /* Return 1 if pseudo register should be created and used to hold
>    GOT address for PIC code.  */
> 
> @@ -31113,6 +31136,9 @@ aarch64_libgcc_floating_mode_supported_p
> #undef TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE
> #define TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE \
>   aarch64_conditional_operation_is_expensive
> +#undef TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> +#define TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION \
> +  aarch64_use_permute_for_promotion
> #undef TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> #define TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE \
>   aarch64_empty_mask_is_expensive
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..3c088ced63543c203d1cc020de5d67807b48b3fb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> @@ -0,0 +1,52 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
> +
> +void test1(unsigned short *x, double *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        unsigned short a = x[i*4+0];
> +        unsigned short b = x[i*4+1];
> +        unsigned short c = x[i*4+2];
> +        unsigned short d = x[i*4+3];
> +        y[i] = (double)a + (double)b + (double)c + (double)d;
> +    }
> +}
> +
> +void test2(unsigned char *x, double *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        unsigned short a = x[i*4+0];
> +        unsigned short b = x[i*4+1];
> +        unsigned short c = x[i*4+2];
> +        unsigned short d = x[i*4+3];
> +        y[i] = (double)a + (double)b + (double)c + (double)d;
> +    }
> +}
> +
> +void test3(unsigned short *x, double *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        unsigned int a = x[i];
> +        y[i] = (double)a;
> +    }
> +}
> +
> +void test4(unsigned short *x, long long *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] = x[i];
> +    }
> +}
> +
> +void test5(unsigned int *x, long long *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] = x[i];
> +    }
> +}
> +
> +void test6(unsigned char *x, long long *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] = x[i];
> +    }
> +}
> +
> +/* { dg-final { scan-assembler-times {\tzip1} 1 } } */
> +/* { dg-final { scan-assembler-times {\tzip2} 1 } } */
> +/* { dg-final { scan-assembler-times {\ttbl} 64 } } */
> +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */
> 
> 
> 
> 
> --
> <rb18855.patch>
Tamar Christina Oct. 14, 2024, 5:58 p.m. UTC | #2
Hi Kyrill,

> -----Original Message-----
> From: Kyrylo Tkachov <ktkachov@nvidia.com>
> Sent: Monday, October 14, 2024 12:15 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; nd <nd@arm.com>; Richard
> Earnshaw <Richard.Earnshaw@arm.com>; ktkachov@gcc.gnu.org; Richard
> Sandiford <Richard.Sandiford@arm.com>
> Subject: Re: [PATCH 3/4]AArch64: enable zero-extends using TBLs for Adv. SIMD
> 
> Hi Tamar,
> 
> > On 14 Oct 2024, at 12:56, Tamar Christina <tamar.christina@arm.com> wrote:
> >
> > External email: Use caution opening links or attachments
> >
> >
> > Hi All,
> >
> > In this patch series I'm adding support for zero extending using permutes
> > instead of requiring multi-step decomposition.
> >
> > This codegen has the benefit of needing fewer instructions and having much
> > higher throughput than uxtl.  We previously replaced pairs of uxtl/uxtl2s with
> > ZIPs to increase throughput, but latency was still an issue due to the
> > depencency chain created.
> >
> > To fix this we can now use TBLs.  The indexes are listed output of the loop as
> > well and can be shared amongst zero extends of the same type.  The additional
> > benefit of this is that if the values are being permuted after extensions they
> > will be simplified and merged leading to better overall code.
> >
> > e.g. an LD4 can be replaced with LDR since the permutes being performance for
> > the extensions can be merged with the load permutes.  The way LOAD_LANES is
> > currently implemented means this can't be done in GCC yet, but I'm aiming for
> > this in GCC 15.
> >
> > I've additionally only added support for non-VLA.  The problem with VLA is that
> > the index registers are hard or impossible to make.
> >
> > On Adv. SIMD we use -1 to indicate an out of range register so we can transform
> > the two regs TBL into a one reg one.  However on e.g. a byte array, on VLA 255
> > would be a valid entry. e.g, at VL > 2048.  Which means that's already not a
> > transformation we can make.
> >
> > Secondly the actual mask looks something like {x,x,x,n,n+1, x,x,x, n+2, n+3}
> > and while I think I can represent this in vect_perm_builder, I couldn't think of
> > any real efficient VLA way to build such masks..  It would require a lot of
> > setup code.
> >
> > Lastly I don't think this transformation would make much sense for SVE, as SVE
> > has loads and converts that can do multi-step types.  For instance the loop
> > below is already pretty good for SVE (though it's missed that the load can do
> > more than one step, presumably because a single extend is merged only in RTL).
> 
> Thanks for working on this, it’s certainly an area that can be improved from the
> current state!
> On the SVE front, one thing we noticed recently is that on Neoverse V2 the SVE
> single-register TBL instructions
> have twice the throughput of the equivalent Neon ones. One has to be careful
> with the indexes as SVE TBL instructions
> can access lanes outside the 128-bit granule (and thus are not truly VLA) but if the
> compiler controls the permute mask
> as it does here they should be safe to emit.

I've checked internally with the CPU teams and verified on simulators,  this is likely an
error in the SWoG of several cores.  These will be corrected in the next releases.

> 
> I won’t block this patch/series on that point, just a consideration if you’re
> interested.

I did think about it for a long time, but the problem is the single register version of TBL
for SVE won't work, as you mentioned and I had in the cover letter.  So you'd need the
two register version for the byte case, and that's SVE2 only.   I think short and above
would work though.

I was struggling with how to even represent the sequence in VLA.  I realized that you'd
have to get the VL with rdvl, and then use a predicated mov to fill the index register with
the indices to the zero register. 

The problem is though how to create the "selection" mask. An INDEX gets you the sequential
numbers, but you then need to space them out. e.g for byte to int you want
{x,x,x,x,x,i,x,x,x,x,x,i+1,...} which I really can't think of anyway to represent in gimple but also
can't think of a way to generate in SVE code that doesn't require a lot of code.  For instance
there's no predicated INDEX. The only thing I can think of is either INDEX + unpacks + pred mov.

Anyway, I'd be happy to consider this for VLA in general, though I do hope we can get non-VLA
in first, but I really don't know of a good sequence here.

Although back to your original point, yes we could have used SVE for the Adv. SIMD version by
creating the mask using LD1R instead of LDR. But I'm concerned since there's no predicated
version of TBL, that on longer VLs we'd do a lot of work that is just ignored.

It's a good question that gave me lots to think about though..

Thanks,
Tamar

> Thanks,
> Kyrill
> 
> >
> > While I tried hard, for these reasons I don't support VLA, which I hope is ok..
> >
> > Concretely on AArch64 this changes:
> >
> > void test4(unsigned char *x, long long *y, int n) {
> >    for(int i = 0; i < n; i++) {
> >        y[i] = x[i];
> >    }
> > }
> >
> > from generating:
> >
> > .L4:
> >        ldr     q30, [x4], 16
> >        add     x3, x3, 128
> >        zip1    v1.16b, v30.16b, v31.16b
> >        zip2    v30.16b, v30.16b, v31.16b
> >        zip1    v2.8h, v1.8h, v31.8h
> >        zip1    v0.8h, v30.8h, v31.8h
> >        zip2    v1.8h, v1.8h, v31.8h
> >        zip2    v30.8h, v30.8h, v31.8h
> >        zip1    v26.4s, v2.4s, v31.4s
> >        zip1    v29.4s, v0.4s, v31.4s
> >        zip1    v28.4s, v1.4s, v31.4s
> >        zip1    v27.4s, v30.4s, v31.4s
> >        zip2    v2.4s, v2.4s, v31.4s
> >        zip2    v0.4s, v0.4s, v31.4s
> >        zip2    v1.4s, v1.4s, v31.4s
> >        zip2    v30.4s, v30.4s, v31.4s
> >        stp     q26, q2, [x3, -128]
> >        stp     q28, q1, [x3, -96]
> >        stp     q29, q0, [x3, -64]
> >        stp     q27, q30, [x3, -32]
> >        cmp     x4, x5
> >        bne     .L4
> >
> > and instead we get:
> >
> > .L4:
> >        add     x3, x3, 128
> >        ldr     q23, [x4], 16
> >        tbl     v5.16b, {v23.16b}, v31.16b
> >        tbl     v4.16b, {v23.16b}, v30.16b
> >        tbl     v3.16b, {v23.16b}, v29.16b
> >        tbl     v2.16b, {v23.16b}, v28.16b
> >        tbl     v1.16b, {v23.16b}, v27.16b
> >        tbl     v0.16b, {v23.16b}, v26.16b
> >        tbl     v22.16b, {v23.16b}, v25.16b
> >        tbl     v23.16b, {v23.16b}, v24.16b
> >        stp     q5, q4, [x3, -128]
> >        stp     q3, q2, [x3, -96]
> >        stp     q1, q0, [x3, -64]
> >        stp     q22, q23, [x3, -32]
> >        cmp     x4, x5
> >        bne     .L4
> >
> > Which results in up to 40% performance uplift on certain workloads.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >        * config/aarch64/aarch64.cc (aarch64_use_permute_for_promotion): New.
> >        (TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION): Use it.
> >
> > gcc/testsuite/ChangeLog:
> >
> >        * gcc.target/aarch64/vect-tbl-zero-extend_1.c: New test.
> >
> > ---
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index
> 102680a0efca1ce928e6945033c01cfb68a65152..b90577f4fc8157b3e02936256
> c8af8b2b7fac144 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -28404,6 +28404,29 @@ aarch64_empty_mask_is_expensive (unsigned)
> >   return false;
> > }
> >
> > +/* Implement TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION.
> Assume that
> > +   predicated operations when available are beneficial when doing more than
> > +   one step conversion.  */
> > +
> > +static bool
> > +aarch64_use_permute_for_promotion (const_tree in_type, const_tree
> out_type)
> > +{
> > +  /* AArch64's vect_perm_constant doesn't currently support two 64 bit
> shuffles
> > +     into a 128 bit vector type.  So for now reject it.  */
> > +  if (maybe_ne (GET_MODE_BITSIZE (TYPE_MODE (in_type)),
> > +               GET_MODE_BITSIZE (TYPE_MODE (out_type))))
> > +    return false;
> > +
> > +  auto bitsize_in = element_precision (in_type);
> > +  auto bitsize_out = element_precision (out_type);
> > +
> > +  /* We don't want to use the permutes for a single widening step because we're
> > +     picking there between two zip and tbl sequences with the same throughput
> > +     and latencies.  However the zip doesn't require a mask and uses less
> > +     registers so we prefer that.  */
> > +  return (bitsize_out / bitsize_in) > 2;
> > +}
> > +
> > /* Return 1 if pseudo register should be created and used to hold
> >    GOT address for PIC code.  */
> >
> > @@ -31113,6 +31136,9 @@ aarch64_libgcc_floating_mode_supported_p
> > #undef TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE
> > #define TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE \
> >   aarch64_conditional_operation_is_expensive
> > +#undef TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > +#define TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION \
> > +  aarch64_use_permute_for_promotion
> > #undef TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> > #define TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE \
> >   aarch64_empty_mask_is_expensive
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..3c088ced63543c203d1cc0
> 20de5d67807b48b3fb
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> > @@ -0,0 +1,52 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
> > +
> > +void test1(unsigned short *x, double *y, int n) {
> > +    for(int i = 0; i < (n & -8); i++) {
> > +        unsigned short a = x[i*4+0];
> > +        unsigned short b = x[i*4+1];
> > +        unsigned short c = x[i*4+2];
> > +        unsigned short d = x[i*4+3];
> > +        y[i] = (double)a + (double)b + (double)c + (double)d;
> > +    }
> > +}
> > +
> > +void test2(unsigned char *x, double *y, int n) {
> > +    for(int i = 0; i < (n & -8); i++) {
> > +        unsigned short a = x[i*4+0];
> > +        unsigned short b = x[i*4+1];
> > +        unsigned short c = x[i*4+2];
> > +        unsigned short d = x[i*4+3];
> > +        y[i] = (double)a + (double)b + (double)c + (double)d;
> > +    }
> > +}
> > +
> > +void test3(unsigned short *x, double *y, int n) {
> > +    for(int i = 0; i < (n & -8); i++) {
> > +        unsigned int a = x[i];
> > +        y[i] = (double)a;
> > +    }
> > +}
> > +
> > +void test4(unsigned short *x, long long *y, int n) {
> > +    for(int i = 0; i < (n & -8); i++) {
> > +        y[i] = x[i];
> > +    }
> > +}
> > +
> > +void test5(unsigned int *x, long long *y, int n) {
> > +    for(int i = 0; i < (n & -8); i++) {
> > +        y[i] = x[i];
> > +    }
> > +}
> > +
> > +void test6(unsigned char *x, long long *y, int n) {
> > +    for(int i = 0; i < (n & -8); i++) {
> > +        y[i] = x[i];
> > +    }
> > +}
> > +
> > +/* { dg-final { scan-assembler-times {\tzip1} 1 } } */
> > +/* { dg-final { scan-assembler-times {\tzip2} 1 } } */
> > +/* { dg-final { scan-assembler-times {\ttbl} 64 } } */
> > +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */
> >
> >
> >
> >
> > --
> > <rb18855.patch>
Richard Sandiford Oct. 14, 2024, 6:49 p.m. UTC | #3
Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> In this patch series I'm adding support for zero extending using permutes
> instead of requiring multi-step decomposition.
>
> This codegen has the benefit of needing fewer instructions and having much
> higher throughput than uxtl.  We previously replaced pairs of uxtl/uxtl2s with
> ZIPs to increase throughput, but latency was still an issue due to the
> depencency chain created.
>
> To fix this we can now use TBLs.  The indexes are listed output of the loop as
> well and can be shared amongst zero extends of the same type.  The additional
> benefit of this is that if the values are being permuted after extensions they
> will be simplified and merged leading to better overall code.
>
> e.g. an LD4 can be replaced with LDR since the permutes being performance for
> the extensions can be merged with the load permutes.  The way LOAD_LANES is
> currently implemented means this can't be done in GCC yet, but I'm aiming for
> this in GCC 15.
>
> I've additionally only added support for non-VLA.  The problem with VLA is that
> the index registers are hard or impossible to make.
>
> On Adv. SIMD we use -1 to indicate an out of range register so we can transform
> the two regs TBL into a one reg one.  However on e.g. a byte array, on VLA 255
> would be a valid entry. e.g, at VL > 2048.  Which means that's already not a
> transformation we can make.
>
> Secondly the actual mask looks something like {x,x,x,n,n+1, x,x,x, n+2, n+3}
> and while I think I can represent this in vect_perm_builder, I couldn't think of
> any real efficient VLA way to build such masks..  It would require a lot of
> setup code.
>
> Lastly I don't think this transformation would make much sense for SVE, as SVE
> has loads and converts that can do multi-step types.  For instance the loop
> below is already pretty good for SVE (though it's missed that the load can do
> more than one step, presumably because a single extend is merged only in RTL).
>
> While I tried hard, for these reasons I don't support VLA, which I hope is ok..

Yeah, I agree we can skip VLA.  For .B->.D, we could in principle use
a TBL .B with a mask created by:

        INDEX   Zmask.D, #0, #1

and using INCD to generate subsequence vectors.
We could then mask the result with:

        AND     Zresult.D, Zresult.D, #0xff

which still saves one level of unpacks.  But like you say, it doesn't
scale to VL>2048 bits, and setting up each index, although relatively
simple individually, is going to mount up.
    
> Concretely on AArch64 this changes:
>
> void test4(unsigned char *x, long long *y, int n) {
>     for(int i = 0; i < n; i++) {
>         y[i] = x[i];
>     }
> }
>
> from generating:
>
> .L4:
>         ldr     q30, [x4], 16
>         add     x3, x3, 128
>         zip1    v1.16b, v30.16b, v31.16b
>         zip2    v30.16b, v30.16b, v31.16b
>         zip1    v2.8h, v1.8h, v31.8h
>         zip1    v0.8h, v30.8h, v31.8h
>         zip2    v1.8h, v1.8h, v31.8h
>         zip2    v30.8h, v30.8h, v31.8h
>         zip1    v26.4s, v2.4s, v31.4s
>         zip1    v29.4s, v0.4s, v31.4s
>         zip1    v28.4s, v1.4s, v31.4s
>         zip1    v27.4s, v30.4s, v31.4s
>         zip2    v2.4s, v2.4s, v31.4s
>         zip2    v0.4s, v0.4s, v31.4s
>         zip2    v1.4s, v1.4s, v31.4s
>         zip2    v30.4s, v30.4s, v31.4s
>         stp     q26, q2, [x3, -128]
>         stp     q28, q1, [x3, -96]
>         stp     q29, q0, [x3, -64]
>         stp     q27, q30, [x3, -32]
>         cmp     x4, x5
>         bne     .L4
>
> and instead we get:
>
> .L4:
>         add     x3, x3, 128
>         ldr     q23, [x4], 16
>         tbl     v5.16b, {v23.16b}, v31.16b
>         tbl     v4.16b, {v23.16b}, v30.16b
>         tbl     v3.16b, {v23.16b}, v29.16b
>         tbl     v2.16b, {v23.16b}, v28.16b
>         tbl     v1.16b, {v23.16b}, v27.16b
>         tbl     v0.16b, {v23.16b}, v26.16b
>         tbl     v22.16b, {v23.16b}, v25.16b
>         tbl     v23.16b, {v23.16b}, v24.16b
>         stp     q5, q4, [x3, -128]
>         stp     q3, q2, [x3, -96]
>         stp     q1, q0, [x3, -64]
>         stp     q22, q23, [x3, -32]
>         cmp     x4, x5
>         bne     .L4
>
> Which results in up to 40% performance uplift on certain workloads.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64.cc (aarch64_use_permute_for_promotion): New.
> 	(TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION): Use it.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/vect-tbl-zero-extend_1.c: New test.

LGTM, but...

> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 102680a0efca1ce928e6945033c01cfb68a65152..b90577f4fc8157b3e02936256c8af8b2b7fac144 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -28404,6 +28404,29 @@ aarch64_empty_mask_is_expensive (unsigned)
>    return false;
>  }
>  
> +/* Implement TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION.  Assume that
> +   predicated operations when available are beneficial when doing more than
> +   one step conversion.  */
> +
> +static bool
> +aarch64_use_permute_for_promotion (const_tree in_type, const_tree out_type)
> +{
> +  /* AArch64's vect_perm_constant doesn't currently support two 64 bit shuffles
> +     into a 128 bit vector type.  So for now reject it.  */
> +  if (maybe_ne (GET_MODE_BITSIZE (TYPE_MODE (in_type)),
> +		GET_MODE_BITSIZE (TYPE_MODE (out_type))))
> +    return false;

...I think we should explicitly reject VLA vectors here, since it isn't
a documented part of the interface that the vectors must be constant sized
(and IMO that's a good thing).

Also, it would be good to have SVE tests with -msve-vector-bits=512 or
something.

Thanks,
Richard

> +
> +  auto bitsize_in = element_precision (in_type);
> +  auto bitsize_out = element_precision (out_type);
> +
> +  /* We don't want to use the permutes for a single widening step because we're
> +     picking there between two zip and tbl sequences with the same throughput
> +     and latencies.  However the zip doesn't require a mask and uses less
> +     registers so we prefer that.  */
> +  return (bitsize_out / bitsize_in) > 2;
> +}
> +
>  /* Return 1 if pseudo register should be created and used to hold
>     GOT address for PIC code.  */
>  
> @@ -31113,6 +31136,9 @@ aarch64_libgcc_floating_mode_supported_p
>  #undef TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE
>  #define TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE \
>    aarch64_conditional_operation_is_expensive
> +#undef TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> +#define TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION \
> +  aarch64_use_permute_for_promotion
>  #undef TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
>  #define TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE \
>    aarch64_empty_mask_is_expensive
> diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> new file mode 100644
> index 0000000000000000000000000000000000000000..3c088ced63543c203d1cc020de5d67807b48b3fb
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
> @@ -0,0 +1,52 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
> +
> +void test1(unsigned short *x, double *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        unsigned short a = x[i*4+0];
> +        unsigned short b = x[i*4+1];
> +        unsigned short c = x[i*4+2];
> +        unsigned short d = x[i*4+3];
> +        y[i] = (double)a + (double)b + (double)c + (double)d;
> +    }
> +}
> +
> +void test2(unsigned char *x, double *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        unsigned short a = x[i*4+0];
> +        unsigned short b = x[i*4+1];
> +        unsigned short c = x[i*4+2];
> +        unsigned short d = x[i*4+3];
> +        y[i] = (double)a + (double)b + (double)c + (double)d;
> +    }
> +}
> +
> +void test3(unsigned short *x, double *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        unsigned int a = x[i];
> +        y[i] = (double)a;
> +    }
> +}
> +
> +void test4(unsigned short *x, long long *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] = x[i];
> +    }
> +}
> +
> +void test5(unsigned int *x, long long *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] = x[i];
> +    }
> +}
> +
> +void test6(unsigned char *x, long long *y, int n) {
> +    for(int i = 0; i < (n & -8); i++) {
> +        y[i] = x[i];
> +    }
> +}
> +
> +/* { dg-final { scan-assembler-times {\tzip1} 1 } } */
> +/* { dg-final { scan-assembler-times {\tzip2} 1 } } */
> +/* { dg-final { scan-assembler-times {\ttbl} 64 } } */
> +/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */
diff mbox series

Patch

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 102680a0efca1ce928e6945033c01cfb68a65152..b90577f4fc8157b3e02936256c8af8b2b7fac144 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -28404,6 +28404,29 @@  aarch64_empty_mask_is_expensive (unsigned)
   return false;
 }
 
+/* Implement TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION.  Assume that
+   predicated operations when available are beneficial when doing more than
+   one step conversion.  */
+
+static bool
+aarch64_use_permute_for_promotion (const_tree in_type, const_tree out_type)
+{
+  /* AArch64's vect_perm_constant doesn't currently support two 64 bit shuffles
+     into a 128 bit vector type.  So for now reject it.  */
+  if (maybe_ne (GET_MODE_BITSIZE (TYPE_MODE (in_type)),
+		GET_MODE_BITSIZE (TYPE_MODE (out_type))))
+    return false;
+
+  auto bitsize_in = element_precision (in_type);
+  auto bitsize_out = element_precision (out_type);
+
+  /* We don't want to use the permutes for a single widening step because we're
+     picking there between two zip and tbl sequences with the same throughput
+     and latencies.  However the zip doesn't require a mask and uses less
+     registers so we prefer that.  */
+  return (bitsize_out / bitsize_in) > 2;
+}
+
 /* Return 1 if pseudo register should be created and used to hold
    GOT address for PIC code.  */
 
@@ -31113,6 +31136,9 @@  aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE
 #define TARGET_VECTORIZE_CONDITIONAL_OPERATION_IS_EXPENSIVE \
   aarch64_conditional_operation_is_expensive
+#undef TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
+#define TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION \
+  aarch64_use_permute_for_promotion
 #undef TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
 #define TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE \
   aarch64_empty_mask_is_expensive
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..3c088ced63543c203d1cc020de5d67807b48b3fb
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/vect-tbl-zero-extend_1.c
@@ -0,0 +1,52 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -std=c99 -march=armv8-a" } */
+
+void test1(unsigned short *x, double *y, int n) {
+    for(int i = 0; i < (n & -8); i++) {
+        unsigned short a = x[i*4+0];
+        unsigned short b = x[i*4+1];
+        unsigned short c = x[i*4+2];
+        unsigned short d = x[i*4+3];
+        y[i] = (double)a + (double)b + (double)c + (double)d;
+    }
+}
+
+void test2(unsigned char *x, double *y, int n) {
+    for(int i = 0; i < (n & -8); i++) {
+        unsigned short a = x[i*4+0];
+        unsigned short b = x[i*4+1];
+        unsigned short c = x[i*4+2];
+        unsigned short d = x[i*4+3];
+        y[i] = (double)a + (double)b + (double)c + (double)d;
+    }
+}
+
+void test3(unsigned short *x, double *y, int n) {
+    for(int i = 0; i < (n & -8); i++) {
+        unsigned int a = x[i];
+        y[i] = (double)a;
+    }
+}
+
+void test4(unsigned short *x, long long *y, int n) {
+    for(int i = 0; i < (n & -8); i++) {
+        y[i] = x[i];
+    }
+}
+
+void test5(unsigned int *x, long long *y, int n) {
+    for(int i = 0; i < (n & -8); i++) {
+        y[i] = x[i];
+    }
+}
+
+void test6(unsigned char *x, long long *y, int n) {
+    for(int i = 0; i < (n & -8); i++) {
+        y[i] = x[i];
+    }
+}
+
+/* { dg-final { scan-assembler-times {\tzip1} 1 } } */
+/* { dg-final { scan-assembler-times {\tzip2} 1 } } */
+/* { dg-final { scan-assembler-times {\ttbl} 64 } } */
+/* { dg-final { scan-assembler-times {\.LC[0-9]+:} 12 } } */