Message ID | 5AF99160.6080802@foss.arm.com |
---|---|
State | New |
Headers | show |
Series | [AArch64] Implement usadv16qi and ssadv16qi standard names | expand |
I realised I had forgotten to copy the maintainers... https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00613.html Thanks, Kyrill On 14/05/18 14:38, Kyrill Tkachov wrote: > Hi all, > > This patch implements the usadv16qi and ssadv16qi standard names. > See the thread at on gcc@gcc.gnu.org [1] for background. > > The V16QImode variant is important to get right as it is the most commonly used pattern: > reducing vectors of bytes into an int. > The midend expects the optab to compute the absolute differences of operands 1 and 2 and > reduce them while widening along the way up to SImode. So the inputs are V16QImode and > the output is V4SImode. > > I've tried out a few different strategies for that, the one I settled with is to emit: > UABDL2 tmp.8h, op1.16b, op2.16b > UABAL tmp.8h, op1.16b, op2.16b > UADALP op3.4s, tmp.8h > > To work through the semantics let's say operands 1 and 2 are: > op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 } > op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 } > op3 { c0, c1, c2, c3 } > > The UABDL2 takes the upper V8QI elements, computes their absolute differences, widens them and stores them into the V8HImode tmp: > > tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) } > > The UABAL after that takes the lower V8QI elements, computes their absolute differences, widens them and accumulates them into the V8HImode tmp from the previous step: > > tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) } > > Finally the UADALP does a pairwise widening reduction and accumulation into the V4SImode op3: > op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) } > > (sorry for the text dump) > > Remember, according to [1] the exact reduction sequence doesn't matter (for integer arithmetic at least). > I've considered other sequences as well (thanks Wilco), for example > * UABD + UADDLP + UADALP > * UABLD2 + UABDL + UADALP + UADALP > > I ended up settling in the sequence in this patch as it's short (3 instructions) and in the future we can potentially > look to optimise multiple occurrences of these into something even faster (for example accumulating into H registers for longer > before doing a single UADALP in the end to accumulate into the final S register). > > If your microarchitecture has some some strong preferences for a particular sequence, please let me know or, even better, propose a patch > to parametrise the generation sequence by code (or the appropriate RTX cost). > > > This expansion allows the vectoriser to avoid unpacking the bytes in two steps and performing V4SI arithmetic on them. > So, for the code: > > unsigned char pix1[N], pix2[N]; > > int foo (void) > { > int i_sum = 0; > int i; > > for (i = 0; i < 16; i++) > i_sum += __builtin_abs (pix1[i] - pix2[i]); > > return i_sum; > } > > we now generate on aarch64: > foo: > adrp x1, pix1 > add x1, x1, :lo12:pix1 > movi v0.4s, 0 > adrp x0, pix2 > add x0, x0, :lo12:pix2 > ldr q2, [x1] > ldr q3, [x0] > uabdl2 v1.8h, v2.16b, v3.16b > uabal v1.8h, v2.8b, v3.8b > uadalp v0.4s, v1.8h > addv s0, v0.4s > umov w0, v0.s[0] > ret > > > instead of: > foo: > adrp x1, pix1 > adrp x0, pix2 > add x1, x1, :lo12:pix1 > add x0, x0, :lo12:pix2 > ldr q0, [x1] > ldr q4, [x0] > ushll v1.8h, v0.8b, 0 > ushll2 v0.8h, v0.16b, 0 > ushll v2.8h, v4.8b, 0 > ushll2 v4.8h, v4.16b, 0 > usubl v3.4s, v1.4h, v2.4h > usubl2 v1.4s, v1.8h, v2.8h > usubl v2.4s, v0.4h, v4.4h > usubl2 v0.4s, v0.8h, v4.8h > abs v3.4s, v3.4s > abs v1.4s, v1.4s > abs v2.4s, v2.4s > abs v0.4s, v0.4s > add v1.4s, v3.4s, v1.4s > add v1.4s, v2.4s, v1.4s > add v0.4s, v0.4s, v1.4s > addv s0, v0.4s > umov w0, v0.s[0] > ret > > So I expect this new expansion to be better than the status quo in any case. > Bootstrapped and tested on aarch64-none-linux-gnu. > This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72. > > Ok for trunk? > > Thanks, > Kyrill > > [1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html > > > 2018-05-11 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL, > UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2, > UNSPEC_UADALP values. > * config/aarch64/iterators.md (ABAL): New int iterator. > (ABDL2): Likewise. > (ADALP): Likewise. > (sur): Add mappings for the above. > * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3): > New define_insn. > (aarch64_<sur>abal<mode>_4): Likewise. > (aarch64_<sur>adalp<mode>_3): Likewise. > (<sur>sadv16qi): New define_expand. > > 2018-05-11 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * gcc.c-torture/execute/ssad-run.c: New test. > * gcc.c-torture/execute/usad-run.c: Likewise. > * gcc.target/aarch64/ssadv16qi.c: Likewise. > * gcc.target/aarch64/usadv16qi.c: Likewise.
On Mon, May 14, 2018 at 08:38:40AM -0500, Kyrill Tkachov wrote: > Hi all, > > This patch implements the usadv16qi and ssadv16qi standard names. > See the thread at on gcc@gcc.gnu.org [1] for background. > > The V16QImode variant is important to get right as it is the most commonly used pattern: > reducing vectors of bytes into an int. > The midend expects the optab to compute the absolute differences of operands 1 and 2 and > reduce them while widening along the way up to SImode. So the inputs are V16QImode and > the output is V4SImode. > > I've tried out a few different strategies for that, the one I settled with is to emit: > UABDL2 tmp.8h, op1.16b, op2.16b > UABAL tmp.8h, op1.16b, op2.16b > UADALP op3.4s, tmp.8h > > To work through the semantics let's say operands 1 and 2 are: > op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 } > op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 } > op3 { c0, c1, c2, c3 } > > The UABDL2 takes the upper V8QI elements, computes their absolute differences, widens them and stores them into the V8HImode tmp: > > tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) } > > The UABAL after that takes the lower V8QI elements, computes their absolute differences, widens them and accumulates them into the V8HImode tmp from the previous step: > > tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) } > > Finally the UADALP does a pairwise widening reduction and accumulation into the V4SImode op3: > op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) } > > (sorry for the text dump) > > Remember, according to [1] the exact reduction sequence doesn't matter (for integer arithmetic at least). > I've considered other sequences as well (thanks Wilco), for example > * UABD + UADDLP + UADALP > * UABLD2 + UABDL + UADALP + UADALP > > I ended up settling in the sequence in this patch as it's short (3 instructions) and in the future we can potentially > look to optimise multiple occurrences of these into something even faster (for example accumulating into H registers for longer > before doing a single UADALP in the end to accumulate into the final S register). > > If your microarchitecture has some some strong preferences for a particular sequence, please let me know or, even better, propose a patch > to parametrise the generation sequence by code (or the appropriate RTX cost). > > > This expansion allows the vectoriser to avoid unpacking the bytes in two steps and performing V4SI arithmetic on them. > So, for the code: > > unsigned char pix1[N], pix2[N]; > > int foo (void) > { > int i_sum = 0; > int i; > > for (i = 0; i < 16; i++) > i_sum += __builtin_abs (pix1[i] - pix2[i]); > > return i_sum; > } > > we now generate on aarch64: > foo: > adrp x1, pix1 > add x1, x1, :lo12:pix1 > movi v0.4s, 0 > adrp x0, pix2 > add x0, x0, :lo12:pix2 > ldr q2, [x1] > ldr q3, [x0] > uabdl2 v1.8h, v2.16b, v3.16b > uabal v1.8h, v2.8b, v3.8b > uadalp v0.4s, v1.8h > addv s0, v0.4s > umov w0, v0.s[0] > ret > > > instead of: > foo: > adrp x1, pix1 > adrp x0, pix2 > add x1, x1, :lo12:pix1 > add x0, x0, :lo12:pix2 > ldr q0, [x1] > ldr q4, [x0] > ushll v1.8h, v0.8b, 0 > ushll2 v0.8h, v0.16b, 0 > ushll v2.8h, v4.8b, 0 > ushll2 v4.8h, v4.16b, 0 > usubl v3.4s, v1.4h, v2.4h > usubl2 v1.4s, v1.8h, v2.8h > usubl v2.4s, v0.4h, v4.4h > usubl2 v0.4s, v0.8h, v4.8h > abs v3.4s, v3.4s > abs v1.4s, v1.4s > abs v2.4s, v2.4s > abs v0.4s, v0.4s > add v1.4s, v3.4s, v1.4s > add v1.4s, v2.4s, v1.4s > add v0.4s, v0.4s, v1.4s > addv s0, v0.4s > umov w0, v0.s[0] > ret > > So I expect this new expansion to be better than the status quo in any case. > Bootstrapped and tested on aarch64-none-linux-gnu. > This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72. > > Ok for trunk? You don't say it explicitly here, but I presume the mid-end takes care of zeroing the accumulator register before the loop (i.e. op3 in your sequence in aarch64-simd.md)? If so, looks good to me. Ok for trunk. By the way, now you have the patterns, presumably you could also wire them up in arm_neon.h Thanks for the patch! James > > Thanks, > Kyrill > > [1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html > > > 2018-05-11 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL, > UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2, > UNSPEC_UADALP values. > * config/aarch64/iterators.md (ABAL): New int iterator. > (ABDL2): Likewise. > (ADALP): Likewise. > (sur): Add mappings for the above. > * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3): > New define_insn. > (aarch64_<sur>abal<mode>_4): Likewise. > (aarch64_<sur>adalp<mode>_3): Likewise. > (<sur>sadv16qi): New define_expand. > > 2018-05-11 Kyrylo Tkachov <kyrylo.tkachov@arm.com> > > * gcc.c-torture/execute/ssad-run.c: New test. > * gcc.c-torture/execute/usad-run.c: Likewise. > * gcc.target/aarch64/ssadv16qi.c: Likewise. > * gcc.target/aarch64/usadv16qi.c: Likewise.
On 19/05/18 02:09, James Greenhalgh wrote: > On Mon, May 14, 2018 at 08:38:40AM -0500, Kyrill Tkachov wrote: >> Hi all, >> >> This patch implements the usadv16qi and ssadv16qi standard names. >> See the thread at on gcc@gcc.gnu.org [1] for background. >> >> The V16QImode variant is important to get right as it is the most commonly used pattern: >> reducing vectors of bytes into an int. >> The midend expects the optab to compute the absolute differences of operands 1 and 2 and >> reduce them while widening along the way up to SImode. So the inputs are V16QImode and >> the output is V4SImode. >> >> I've tried out a few different strategies for that, the one I settled with is to emit: >> UABDL2 tmp.8h, op1.16b, op2.16b >> UABAL tmp.8h, op1.16b, op2.16b >> UADALP op3.4s, tmp.8h >> >> To work through the semantics let's say operands 1 and 2 are: >> op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 } >> op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 } >> op3 { c0, c1, c2, c3 } >> >> The UABDL2 takes the upper V8QI elements, computes their absolute differences, widens them and stores them into the V8HImode tmp: >> >> tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) } >> >> The UABAL after that takes the lower V8QI elements, computes their absolute differences, widens them and accumulates them into the V8HImode tmp from the previous step: >> >> tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) } >> >> Finally the UADALP does a pairwise widening reduction and accumulation into the V4SImode op3: >> op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) } >> >> (sorry for the text dump) >> >> Remember, according to [1] the exact reduction sequence doesn't matter (for integer arithmetic at least). >> I've considered other sequences as well (thanks Wilco), for example >> * UABD + UADDLP + UADALP >> * UABLD2 + UABDL + UADALP + UADALP >> >> I ended up settling in the sequence in this patch as it's short (3 instructions) and in the future we can potentially >> look to optimise multiple occurrences of these into something even faster (for example accumulating into H registers for longer >> before doing a single UADALP in the end to accumulate into the final S register). >> >> If your microarchitecture has some some strong preferences for a particular sequence, please let me know or, even better, propose a patch >> to parametrise the generation sequence by code (or the appropriate RTX cost). >> >> >> This expansion allows the vectoriser to avoid unpacking the bytes in two steps and performing V4SI arithmetic on them. >> So, for the code: >> >> unsigned char pix1[N], pix2[N]; >> >> int foo (void) >> { >> int i_sum = 0; >> int i; >> >> for (i = 0; i < 16; i++) >> i_sum += __builtin_abs (pix1[i] - pix2[i]); >> >> return i_sum; >> } >> >> we now generate on aarch64: >> foo: >> adrp x1, pix1 >> add x1, x1, :lo12:pix1 >> movi v0.4s, 0 >> adrp x0, pix2 >> add x0, x0, :lo12:pix2 >> ldr q2, [x1] >> ldr q3, [x0] >> uabdl2 v1.8h, v2.16b, v3.16b >> uabal v1.8h, v2.8b, v3.8b >> uadalp v0.4s, v1.8h >> addv s0, v0.4s >> umov w0, v0.s[0] >> ret >> >> >> instead of: >> foo: >> adrp x1, pix1 >> adrp x0, pix2 >> add x1, x1, :lo12:pix1 >> add x0, x0, :lo12:pix2 >> ldr q0, [x1] >> ldr q4, [x0] >> ushll v1.8h, v0.8b, 0 >> ushll2 v0.8h, v0.16b, 0 >> ushll v2.8h, v4.8b, 0 >> ushll2 v4.8h, v4.16b, 0 >> usubl v3.4s, v1.4h, v2.4h >> usubl2 v1.4s, v1.8h, v2.8h >> usubl v2.4s, v0.4h, v4.4h >> usubl2 v0.4s, v0.8h, v4.8h >> abs v3.4s, v3.4s >> abs v1.4s, v1.4s >> abs v2.4s, v2.4s >> abs v0.4s, v0.4s >> add v1.4s, v3.4s, v1.4s >> add v1.4s, v2.4s, v1.4s >> add v0.4s, v0.4s, v1.4s >> addv s0, v0.4s >> umov w0, v0.s[0] >> ret >> >> So I expect this new expansion to be better than the status quo in any case. >> Bootstrapped and tested on aarch64-none-linux-gnu. >> This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72. >> >> Ok for trunk? > You don't say it explicitly here, but I presume the mid-end takes care of > zeroing the accumulator register before the loop (i.e. op3 in your sequence > in aarch64-simd.md)? Yes, the midend takes care of zeroing the accumulator register and doing the full reduction at the end of the loop. > If so, looks good to me. > > Ok for trunk. Thanks, committed with r260437. > > By the way, now you have the patterns, presumably you could also wire them > up in arm_neon.h Yeah, it should be simple to wire them up. Thanks, Kyrill > > Thanks for the patch! > > James > > >> Thanks, >> Kyrill >> >> [1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html >> >> >> 2018-05-11 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >> >> * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL, >> UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2, >> UNSPEC_UADALP values. >> * config/aarch64/iterators.md (ABAL): New int iterator. >> (ABDL2): Likewise. >> (ADALP): Likewise. >> (sur): Add mappings for the above. >> * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3): >> New define_insn. >> (aarch64_<sur>abal<mode>_4): Likewise. >> (aarch64_<sur>adalp<mode>_3): Likewise. >> (<sur>sadv16qi): New define_expand. >> >> 2018-05-11 Kyrylo Tkachov <kyrylo.tkachov@arm.com> >> >> * gcc.c-torture/execute/ssad-run.c: New test. >> * gcc.c-torture/execute/usad-run.c: Likewise. >> * gcc.target/aarch64/ssadv16qi.c: Likewise. >> * gcc.target/aarch64/usadv16qi.c: Likewise. >
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 9a6ed304432af0ca23ec7d3797783a3128776a6e..97f8dbf1c219e2df2653804f2c1f83c123cdf2d6 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -612,6 +612,67 @@ (define_insn "abd<mode>_3" [(set_attr "type" "neon_abd<q>")] ) +(define_insn "aarch64_<sur>abdl2<mode>_3" + [(set (match_operand:<VDBLW> 0 "register_operand" "=w") + (unspec:<VDBLW> [(match_operand:VDQV_S 1 "register_operand" "w") + (match_operand:VDQV_S 2 "register_operand" "w")] + ABDL2))] + "TARGET_SIMD" + "<sur>abdl2\t%0.<Vwtype>, %1.<Vtype>, %2.<Vtype>" + [(set_attr "type" "neon_abd<q>")] +) + +(define_insn "aarch64_<sur>abal<mode>_4" + [(set (match_operand:<VDBLW> 0 "register_operand" "=w") + (unspec:<VDBLW> [(match_operand:VDQV_S 1 "register_operand" "w") + (match_operand:VDQV_S 2 "register_operand" "w") + (match_operand:<VDBLW> 3 "register_operand" "0")] + ABAL))] + "TARGET_SIMD" + "<sur>abal\t%0.<Vwtype>, %1.<Vhalftype>, %2.<Vhalftype>" + [(set_attr "type" "neon_arith_acc<q>")] +) + +(define_insn "aarch64_<sur>adalp<mode>_3" + [(set (match_operand:<VDBLW> 0 "register_operand" "=w") + (unspec:<VDBLW> [(match_operand:VDQV_S 1 "register_operand" "w") + (match_operand:<VDBLW> 2 "register_operand" "0")] + ADALP))] + "TARGET_SIMD" + "<sur>adalp\t%0.<Vwtype>, %1.<Vtype>" + [(set_attr "type" "neon_reduc_add<q>")] +) + +;; Emit a sequence to produce a sum-of-absolute-differences of the V16QI +;; inputs in operands 1 and 2. The sequence also has to perform a widening +;; reduction of the difference into a V4SI vector and accumulate that into +;; operand 3 before copying that into the result operand 0. +;; Perform that with a sequence of: +;; UABDL2 tmp.8h, op1.16b, op2.16b +;; UABAL tmp.8h, op1.16b, op2.16b +;; UADALP op3.4s, tmp.8h +;; MOV op0, op3 // should be eliminated in later passes. +;; The signed version just uses the signed variants of the above instructions. + +(define_expand "<sur>sadv16qi" + [(use (match_operand:V4SI 0 "register_operand")) + (unspec:V16QI [(use (match_operand:V16QI 1 "register_operand")) + (use (match_operand:V16QI 2 "register_operand"))] ABAL) + (use (match_operand:V4SI 3 "register_operand"))] + "TARGET_SIMD" + { + rtx reduc = gen_reg_rtx (V8HImode); + emit_insn (gen_aarch64_<sur>abdl2v16qi_3 (reduc, operands[1], + operands[2])); + emit_insn (gen_aarch64_<sur>abalv16qi_4 (reduc, operands[1], + operands[2], reduc)); + emit_insn (gen_aarch64_<sur>adalpv8hi_3 (operands[3], reduc, + operands[3])); + emit_move_insn (operands[0], operands[3]); + DONE; + } +) + (define_insn "aba<mode>_3" [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w") (plus:VDQ_BHSI (abs:VDQ_BHSI (minus:VDQ_BHSI diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index 953edb7b943b9acb6fe65db93f67ce73e4498dcb..079385c58ea201225ecf54c752b3c9e3756eab49 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -141,6 +141,9 @@ (define_c_enum "unspec" [ UNSPEC_PRLG_STK UNSPEC_REV UNSPEC_RBIT + UNSPEC_SABAL + UNSPEC_SABDL2 + UNSPEC_SADALP UNSPEC_SCVTF UNSPEC_SISD_NEG UNSPEC_SISD_SSHL @@ -159,6 +162,9 @@ (define_c_enum "unspec" [ UNSPEC_TLSLE24 UNSPEC_TLSLE32 UNSPEC_TLSLE48 + UNSPEC_UABAL + UNSPEC_UABDL2 + UNSPEC_UADALP UNSPEC_UCVTF UNSPEC_USHL_2S UNSPEC_VSTRUCTDUMMY diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md index 2642de74bcfb0c729d8309cde14b76cf233ad7ab..e994e58ffb38cee2a00fae4216ae90e33e5563e1 100644 --- a/gcc/config/aarch64/iterators.md +++ b/gcc/config/aarch64/iterators.md @@ -1392,6 +1392,16 @@ (define_code_attr sve_imm_con [(eq "vsc") ;; ------------------------------------------------------------------- ;; Int Iterators. ;; ------------------------------------------------------------------- + +;; The unspec codes for the SABAL, UABAL AdvancedSIMD instructions. +(define_int_iterator ABAL [UNSPEC_SABAL UNSPEC_UABAL]) + +;; The unspec codes for the SABDL2, UABDL2 AdvancedSIMD instructions. +(define_int_iterator ABDL2 [UNSPEC_SABDL2 UNSPEC_UABDL2]) + +;; The unspec codes for the SADALP, UADALP AdvancedSIMD instructions. +(define_int_iterator ADALP [UNSPEC_SADALP UNSPEC_UADALP]) + (define_int_iterator MAXMINV [UNSPEC_UMAXV UNSPEC_UMINV UNSPEC_SMAXV UNSPEC_SMINV]) @@ -1599,6 +1609,9 @@ (define_int_attr sur [(UNSPEC_SHADD "s") (UNSPEC_UHADD "u") (UNSPEC_SHSUB "s") (UNSPEC_UHSUB "u") (UNSPEC_SRHSUB "sr") (UNSPEC_URHSUB "ur") (UNSPEC_ADDHN "") (UNSPEC_RADDHN "r") + (UNSPEC_SABAL "s") (UNSPEC_UABAL "u") + (UNSPEC_SABDL2 "s") (UNSPEC_UABDL2 "u") + (UNSPEC_SADALP "s") (UNSPEC_UADALP "u") (UNSPEC_SUBHN "") (UNSPEC_RSUBHN "r") (UNSPEC_ADDHN2 "") (UNSPEC_RADDHN2 "r") (UNSPEC_SUBHN2 "") (UNSPEC_RSUBHN2 "r") diff --git a/gcc/testsuite/gcc.c-torture/execute/ssad-run.c b/gcc/testsuite/gcc.c-torture/execute/ssad-run.c new file mode 100644 index 0000000000000000000000000000000000000000..f15f85f5753769a492cc066ac1ff8a82f39fcc30 --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/execute/ssad-run.c @@ -0,0 +1,49 @@ +extern void abort (); +extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__)); + +static int +foo (signed char *w, int i, signed char *x, int j) +{ + int tot = 0; + for (int a = 0; a < 16; a++) + { + for (int b = 0; b < 16; b++) + tot += abs (w[b] - x[b]); + w += i; + x += j; + } + return tot; +} + +void +bar (signed char *w, signed char *x, int i, int *result) +{ + *result = foo (w, 16, x, i); +} + +int +main (void) +{ + signed char m[256]; + signed char n[256]; + int sum, i; + + for (i = 0; i < 256; ++i) + if (i % 2 == 0) + { + m[i] = (i % 8) * 2 + 1; + n[i] = -(i % 8); + } + else + { + m[i] = -((i % 8) * 2 + 2); + n[i] = -((i % 8) >> 1); + } + + bar (m, n, 16, &sum); + + if (sum != 2368) + abort (); + + return 0; +} diff --git a/gcc/testsuite/gcc.c-torture/execute/usad-run.c b/gcc/testsuite/gcc.c-torture/execute/usad-run.c new file mode 100644 index 0000000000000000000000000000000000000000..904a634a497688eda6331845e2bf2805aa8a7991 --- /dev/null +++ b/gcc/testsuite/gcc.c-torture/execute/usad-run.c @@ -0,0 +1,49 @@ +extern void abort (); +extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__)); + +static int +foo (unsigned char *w, int i, unsigned char *x, int j) +{ + int tot = 0; + for (int a = 0; a < 16; a++) + { + for (int b = 0; b < 16; b++) + tot += abs (w[b] - x[b]); + w += i; + x += j; + } + return tot; +} + +void +bar (unsigned char *w, unsigned char *x, int i, int *result) +{ + *result = foo (w, 16, x, i); +} + +int +main (void) +{ + unsigned char m[256]; + unsigned char n[256]; + int sum, i; + + for (i = 0; i < 256; ++i) + if (i % 2 == 0) + { + m[i] = (i % 8) * 2 + 1; + n[i] = -(i % 8); + } + else + { + m[i] = -((i % 8) * 2 + 2); + n[i] = -((i % 8) >> 1); + } + + bar (m, n, 16, &sum); + + if (sum != 32384) + abort (); + + return 0; +} diff --git a/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c b/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c new file mode 100644 index 0000000000000000000000000000000000000000..bab75992986865389dff8f9ca43c58e947ef94a0 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O3" } */ + +#define N 1024 + +signed char pix1[N], pix2[N]; + +int foo (void) +{ + int i_sum = 0; + int i; + + for (i = 0; i < N; i++) + i_sum += __builtin_abs (pix1[i] - pix2[i]); + + return i_sum; +} + +/* { dg-final { scan-assembler-not {\tsshll\t} } } */ +/* { dg-final { scan-assembler-not {\tsshll2\t} } } */ +/* { dg-final { scan-assembler-not {\tssubl\t} } } */ +/* { dg-final { scan-assembler-not {\tssubl2\t} } } */ +/* { dg-final { scan-assembler-not {\tabs\t} } } */ + +/* { dg-final { scan-assembler {\tsabdl2\t} } } */ +/* { dg-final { scan-assembler {\tsabal\t} } } */ +/* { dg-final { scan-assembler {\tsadalp\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/usadv16qi.c b/gcc/testsuite/gcc.target/aarch64/usadv16qi.c new file mode 100644 index 0000000000000000000000000000000000000000..b7c08ee1e1182dadba0048bb96b006f2db61ffe0 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/usadv16qi.c @@ -0,0 +1,27 @@ +/* { dg-do compile } */ +/* { dg-options "-O3" } */ + +#define N 1024 + +unsigned char pix1[N], pix2[N]; + +int foo (void) +{ + int i_sum = 0; + int i; + + for (i = 0; i < N; i++) + i_sum += __builtin_abs (pix1[i] - pix2[i]); + + return i_sum; +} + +/* { dg-final { scan-assembler-not {\tushll\t} } } */ +/* { dg-final { scan-assembler-not {\tushll2\t} } } */ +/* { dg-final { scan-assembler-not {\tusubl\t} } } */ +/* { dg-final { scan-assembler-not {\tusubl2\t} } } */ +/* { dg-final { scan-assembler-not {\tabs\t} } } */ + +/* { dg-final { scan-assembler {\tuabdl2\t} } } */ +/* { dg-final { scan-assembler {\tuabal\t} } } */ +/* { dg-final { scan-assembler {\tuadalp\t} } } */