[AArch64] Implement usadv16qi and ssadv16qi standard names

Hi all,

This patch implements the usadv16qi and ssadv16qi standard names.
See the thread at on gcc@gcc.gnu.org [1] for background.

The V16QImode variant is important to get right as it is the most commonly used pattern:
reducing vectors of bytes into an int.
The midend expects the optab to compute the absolute differences of operands 1 and 2 and
reduce them while widening along the way up to SImode. So the inputs are V16QImode and
the output is V4SImode.

I've tried out a few different strategies for that, the one I settled with is to emit:
UABDL2    tmp.8h, op1.16b, op2.16b
UABAL    tmp.8h, op1.16b, op2.16b
UADALP    op3.4s, tmp.8h

To work through the semantics let's say operands 1 and 2 are:
op1 { a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13, a14, a15 }
op2 { b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15 }
op3 { c0, c1, c2, c3 }

The UABDL2 takes the upper V8QI elements, computes their absolute differences, widens them and stores them into the V8HImode tmp:

tmp { ABS(a[8]-b[8]), ABS(a[9]-b[9]), ABS(a[10]-b[10]), ABS(a[11]-b[11]), ABS(a[12]-b[12]), ABS(a[13]-b[13]), ABS(a[14]-b[14]), ABS(a[15]-b[15]) }

The UABAL after that takes the lower V8QI elements, computes their absolute differences, widens them and accumulates them into the V8HImode tmp from the previous step:

tmp { ABS(a[8]-b[8])+ABS (a[0]-b[0]), ABS(a[9]-b[9])+ABS(a[1]-b[1]), ABS(a[10]-b[10])+ABS(a[2]-b[2]), ABS(a[11]-b[11])+ABS(a[3]-b[3]), ABS(a[12]-b[12])+ABS(a[4]-b[4]), ABS(a[13]-b[13])+ABS(a[5]-b[5]), ABS(a[14]-b[14])+ABS(a[6]-b[6]), ABS(a[15]-b[15])+ABS(a[7]-b[7]) }

Finally the UADALP does a pairwise widening reduction and accumulation into the V4SImode op3:
op3 { c0+ABS(a[8]-b[8])+ABS(a[0]-b[0])+ABS(a[9]-b[9])+ABS(a[1]-b[1]), c1+ABS(a[10]-b[10])+ABS(a[2]-b[2])+ABS(a[11]-b[11])+ABS(a[3]-b[3]), c2+ABS(a[12]-b[12])+ABS(a[4]-b[4])+ABS(a[13]-b[13])+ABS(a[5]-b[5]), c3+ABS(a[14]-b[14])+ABS(a[6]-b[6])+ABS(a[15]-b[15])+ABS(a[7]-b[7]) }

(sorry for the text dump)

Remember, according to [1] the exact reduction sequence doesn't matter (for integer arithmetic at least).
I've considered other sequences as well (thanks Wilco), for example
* UABD + UADDLP + UADALP
* UABLD2 + UABDL + UADALP + UADALP

I ended up settling in the sequence in this patch as it's short (3 instructions) and in the future we can potentially
look to optimise multiple occurrences of these into something even faster (for example accumulating into H registers for longer
before doing a single UADALP in the end to accumulate into the final S register).

If your microarchitecture has some some strong preferences for a particular sequence, please let me know or, even better, propose a patch
to parametrise the generation sequence by code (or the appropriate RTX cost).

This expansion allows the vectoriser to avoid unpacking the bytes in two steps and performing V4SI arithmetic on them.
So, for the code:

unsigned char pix1[N], pix2[N];

int foo (void)
{
   int i_sum = 0;
   int i;

   for (i = 0; i < 16; i++)
     i_sum += __builtin_abs (pix1[i] - pix2[i]);

   return i_sum;
}

we now generate on aarch64:
foo:
         adrp    x1, pix1
         add     x1, x1, :lo12:pix1
         movi    v0.4s, 0
         adrp    x0, pix2
         add     x0, x0, :lo12:pix2
         ldr     q2, [x1]
         ldr     q3, [x0]
         uabdl2  v1.8h, v2.16b, v3.16b
         uabal   v1.8h, v2.8b, v3.8b
         uadalp  v0.4s, v1.8h
         addv    s0, v0.4s
         umov    w0, v0.s[0]
         ret

instead of:
foo:
         adrp    x1, pix1
         adrp    x0, pix2
         add     x1, x1, :lo12:pix1
         add     x0, x0, :lo12:pix2
         ldr     q0, [x1]
         ldr     q4, [x0]
         ushll   v1.8h, v0.8b, 0
         ushll2  v0.8h, v0.16b, 0
         ushll   v2.8h, v4.8b, 0
         ushll2  v4.8h, v4.16b, 0
         usubl   v3.4s, v1.4h, v2.4h
         usubl2  v1.4s, v1.8h, v2.8h
         usubl   v2.4s, v0.4h, v4.4h
         usubl2  v0.4s, v0.8h, v4.8h
         abs     v3.4s, v3.4s
         abs     v1.4s, v1.4s
         abs     v2.4s, v2.4s
         abs     v0.4s, v0.4s
         add     v1.4s, v3.4s, v1.4s
         add     v1.4s, v2.4s, v1.4s
         add     v0.4s, v0.4s, v1.4s
         addv    s0, v0.4s
         umov    w0, v0.s[0]
         ret

So I expect this new expansion to be better than the status quo in any case.
Bootstrapped and tested on aarch64-none-linux-gnu.
This gives about 8% on 525.x264_r from SPEC2017 on a Cortex-A72.

Ok for trunk?

Thanks,
Kyrill

[1] https://gcc.gnu.org/ml/gcc/2018-05/msg00070.html

2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.md ("unspec"): Define UNSPEC_SABAL,
     UNSPEC_SABDL2, UNSPEC_SADALP, UNSPEC_UABAL, UNSPEC_UABDL2,
     UNSPEC_UADALP values.
     * config/aarch64/iterators.md (ABAL): New int iterator.
     (ABDL2): Likewise.
     (ADALP): Likewise.
     (sur): Add mappings for the above.
     * config/aarch64/aarch64-simd.md (aarch64_<sur>abdl2<mode>_3):
     New define_insn.
     (aarch64_<sur>abal<mode>_4): Likewise.
     (aarch64_<sur>adalp<mode>_3): Likewise.
     (<sur>sadv16qi): New define_expand.

2018-05-11  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.c-torture/execute/ssad-run.c: New test.
     * gcc.c-torture/execute/usad-run.c: Likewise.
     * gcc.target/aarch64/ssadv16qi.c: Likewise.
     * gcc.target/aarch64/usadv16qi.c: Likewise.

Message ID	5AF99160.6080802@foss.arm.com
State	New
Headers	show Return-Path: <gcc-patches-return-477641-incoming=patchwork.ozlabs.org@gcc.gnu.org> X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-477641-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=<UNKNOWN>) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=foss.arm.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="MzTcL+M/"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40l1xs2rwlz9s08 for <incoming@patchwork.ozlabs.org>; Mon, 14 May 2018 23:38:57 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; q=dns; s=default; b=sKa6kogGX9P+Wesr0xScvkg+Okc0kRnkIX+0ySsoiQw nMxpc4c2+k6uke4eTiku7jlgWHxcKfF+fEHGDvC/C5tsCPzkSdzrtxuZmgp/aXbR TVadjE4KNz/PCMJ0ipbuj+d8b8y0j4fhi9ukx6nk0PRCE5Lmw9hF98N4Y0glKhR8 = DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; s=default; bh=BDQ6nERsTBBTgcBaflDQicsoLyI=; b=MzTcL+M/+VpsqwFzU D2fAqIgamwAIiCRrxpTUEyfXdcPYsxqxyp+ao0aWBeHsiPCE8Sm//wM9dJ+GiwGl 3tzvnC1tM61ltT6W9/l7SmqzLpeP60Siut91Hzmtce43VFWIIu7Akb6YXnn/mFqo ddcJUots5/WjfSobBqLjH/xYmo= Received: (qmail 119509 invoked by alias); 14 May 2018 13:38:49 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: <gcc-patches.gcc.gnu.org> List-Unsubscribe: <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org> List-Archive: <http://gcc.gnu.org/ml/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-help@gcc.gnu.org> Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 119359 invoked by uid 89); 14 May 2018 13:38:48 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-24.8 required=5.0 tests=BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, KAM_LAZY_DOMAIN_SECURITY, KAM_LOTSOFHASH, KAM_SHORT autolearn=ham version=3.3.2 spammy=sur, upper, kyrill, Kyrill X-HELO: foss.arm.com Received: from usa-sjc-mx-foss1.foss.arm.com (HELO foss.arm.com) (217.140.101.70) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 14 May 2018 13:38:44 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 90B2B1596; Mon, 14 May 2018 06:38:42 -0700 (PDT) Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com [10.2.207.77]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C36B83F25D; Mon, 14 May 2018 06:38:41 -0700 (PDT) Message-ID: <5AF99160.6080802@foss.arm.com> Date: Mon, 14 May 2018 14:38:40 +0100 From: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org> CC: "sellcey@cavium.com" <sellcey@cavium.com>, "siddhesh@sourceware.org >> Siddhesh Poyarekar" <siddhesh@sourceware.org> Subject: [PATCH][AArch64] Implement usadv16qi and ssadv16qi standard names Content-Type: multipart/mixed; boundary="------------070605050208020604090705"
Series	[AArch64] Implement usadv16qi and ssadv16qi standard names \| expand [AArch64] Implement usadv16qi and ssadv16qi standard names

[AArch64] Implement usadv16qi and ssadv16qi standard names

Commit Message

Comments

Patch