From patchwork Fri Dec 21 12:30:49 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
X-Patchwork-Id: 1017493
Return-Path: 
 <gcc-patches-return-492987-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-492987-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none)
	header.from=foss.arm.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="gzf9ocyb"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 43Lnzw2X15z9sDP
	for <incoming@patchwork.ozlabs.org>;
	Fri, 21 Dec 2018 23:31:22 +1100 (AEDT)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:content-type;
	q=dns; s=default; b=JDRIUriAPgleaO3HgyCp+/3buGOOU8ZD39Rsj5cgdZs
	3VItNgG1FnCLgXTBFUYZFKXwctMmKAuRYoeBVoNAXBfMz+k5ii0djB/BIMLXeHxd
	suoK5ebO0elNUeMkXFcy3aVAiekpTzbKOS++ZVcR188gjxZcH0Ip8qVzX6ZADsNE
	=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:content-type;
	s=default; bh=jsPgyFYr4bhMZ4yW/0yvdR4+6vQ=; b=gzf9ocybjWzqfHgN+
	Jyn6ELuOrzok2MwQ0GqLfdgIuuq4vVCm49EifbHGDUYAJ87/vhnODocUdFcxa/yE
	dlmrUNglI9epeAQEgZ7LlJlzRVDciCpLLSj1B9NzoaJbbMz7FniI1/G/GwFi+0+Q
	32g49hSMv4X9SFqQeHNL7/S/ZE=
Received: (qmail 5601 invoked by alias); 21 Dec 2018 12:31:16 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 5030 invoked by uid 89); 21 Dec 2018 12:30:56 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-25.9 required=5.0 tests=BAYES_00, GIT_PATCH_0,
	GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	KAM_LAZY_DOMAIN_SECURITY autolearn=ham version=3.3.2 spammy=256bit,
	256-bit, pressure, noise
X-HELO: foss.arm.com
Received: from usa-sjc-mx-foss1.foss.arm.com (HELO foss.arm.com)
	(217.140.101.70) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Fri, 21 Dec 2018 12:30:54 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by
	usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id
	1FDBDEBD; Fri, 21 Dec 2018 04:30:52 -0800 (PST)
Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com
	[10.2.207.77])	by usa-sjc-imap-foss1.foss.arm.com (Postfix)
	with ESMTPSA id 53B663F675; Fri, 21 Dec 2018 04:30:51 -0800 (PST)
Message-ID: <5C1CDCF9.7030001@foss.arm.com>
Date: Fri, 21 Dec 2018 12:30:49 +0000
From: Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
CC: "Richard Earnshaw (lists)" <richard.earnshaw@arm.com>,
	James Greenhalgh <james.greenhalgh@arm.com>,
	Marcus Shawcroft <marcus.shawcroft@arm.com>
Subject: [PATCH][AArch64] Use Q-reg loads/stores in movmem expansion

Hi all,

Our movmem expansion currently emits TImode loads and stores when copying 128-bit chunks.
This generates X-register LDP/STP sequences as these are the most preferred registers for that mode.

For the purpose of copying memory, however, we want to prefer Q-registers.
This uses one fewer register, so helping with register pressure.
It also allows merging of 256-bit and larger copies into Q-reg LDP/STP, further helping code size.

The implementation of that is easy: we just use a 128-bit vector mode (V4SImode in this patch)
rather than a TImode.

With this patch the testcase:
#define N 8
int src[N], dst[N];

void
foo (void)
{
   __builtin_memcpy (dst, src, N * sizeof (int));
}

generates:
foo:
         adrp    x1, src
         add     x1, x1, :lo12:src
         adrp    x0, dst
         add     x0, x0, :lo12:dst
         ldp     q1, q0, [x1]
         stp     q1, q0, [x0]
         ret

instead of:
foo:
         adrp    x1, src
         add     x1, x1, :lo12:src
         adrp    x0, dst
         add     x0, x0, :lo12:dst
         ldp     x2, x3, [x1]
         stp     x2, x3, [x0]
         ldp     x2, x3, [x1, 16]
         stp     x2, x3, [x0, 16]
         ret

Bootstrapped and tested on aarch64-none-linux-gnu.
I hope this is a small enough change for GCC 9.
One could argue that it is finishing up the work done this cycle to support Q-register LDP/STPs

I've seen this give about 1.8% on 541.leela_r on Cortex-A57 with other changes in SPEC2017 in the noise
but there is reduction in code size everywhere (due to more LDP/STP-Q pairs being formed)

Ok for trunk?

Thanks,
Kyrill

2018-12-21  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/aarch64/aarch64.c (aarch64_expand_movmem): Use V4SImode for
     128-bit moves.

2018-12-21  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/movmem-q-reg_1.c: New test.
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 88b14179a4cbc5357dfabe21227ff9c8a111804c..a8dcdd4c9e22a7583a197372e500c787c91fe459 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -16448,6 +16448,16 @@ aarch64_expand_movmem (rtx *operands)
 	if (GET_MODE_BITSIZE (mode_iter.require ()) <= MIN (n, copy_limit))
 	  cur_mode = mode_iter.require ();
 
+      /* If we want to use 128-bit chunks use a vector mode to prefer the use
+	 of Q registers.  This is preferable to using load/store-pairs of X
+	 registers as we need 1 Q-register vs 2 X-registers.
+	 Also, for targets that prefer it, further passes can create
+	 LDP/STP of Q-regs to further reduce the code size.  */
+      if (TARGET_SIMD
+	  && known_eq (GET_MODE_SIZE (cur_mode), GET_MODE_SIZE (TImode)))
+	cur_mode = V4SImode;
+
+
       gcc_assert (cur_mode != BLKmode);
 
       mode_bits = GET_MODE_BITSIZE (cur_mode).to_constant ();
diff --git a/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c b/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..09afad59712b939e25519f02153b5156ddacbf5a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/movmem-q-reg_1.c
@@ -0,0 +1,16 @@
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#define N 8
+int src[N], dst[N];
+
+void
+foo (void)
+{
+  __builtin_memcpy (dst, src, N * sizeof (int));
+}
+
+/* { dg-final { scan-assembler {ld[rp]\tq[0-9]*} } } */
+/* { dg-final { scan-assembler-not {ld[rp]\tx[0-9]*} } } */
+/* { dg-final { scan-assembler {st[rp]\tq[0-9]*} } } */
+/* { dg-final { scan-assembler-not {st[rp]\tx[0-9]*} } } */
\ No newline at end of file