From patchwork Mon Apr 20 15:12:13 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
X-Patchwork-Id: 462802
Return-Path: 
 <gcc-patches-return-395501-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 504A6140133
	for <incoming@patchwork.ozlabs.org>;
	Tue, 21 Apr 2015 01:12:29 +1000 (AEST)
Authentication-Results: ozlabs.org; dkim=pass
	reason="1024-bit key; unprotected key"
	header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=mozHxP2V;
	dkim-adsp=none (unprotected policy); dkim-atps=neutral
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:references
	:in-reply-to:content-type; q=dns; s=default; b=KYKXkVuTd0Zl1UiZA
	VV3Xld+PKkMpZftLMeBJLxknCSSuw7crdPwaCAeDU86LnjhCDJTzIUXnhcW2Mc4a
	CaQUf7ahJvv9VYmiqHl5nm/6B/oJVJaLTCPs7etIcIILKkxDHyJVhIvMZ8WN9tdi
	sAhIZymsYiHUfPos7tf976+fBo=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:references
	:in-reply-to:content-type; s=default; bh=M+Xd3ZcUdeyUNFUxksot4EF
	1IN8=; b=mozHxP2VtqTgyLdKhlp0aLQ583VCzCN63/NkfSz0/iP6wCFj4r6dRsR
	BuI/NnU2KSDJMcInBbVwdyd37Z0TaXFoJFSQSDI+guRk5jBcrswfwpUQdZ2+Bp2W
	8wj4xxToTH68ZfKuNNWJnw7adxWNNFIJ0PNRWfOVZMhsSbAr06rw=
Received: (qmail 95213 invoked by alias); 20 Apr 2015 15:12:21 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 95198 invoked by uid 89); 20 Apr 2015 15:12:19 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=AWL, BAYES_00,
	SPF_PASS autolearn=ham version=3.3.2
X-HELO: eu-smtp-delivery-143.mimecast.com
Received: from eu-smtp-delivery-143.mimecast.com (HELO
	eu-smtp-delivery-143.mimecast.com) (146.101.78.143) by
	sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Mon, 20 Apr 2015 15:12:17 +0000
Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com
	[217.140.96.140]) by uk-mta-29.uk.mimecast.lan;
	Mon, 20 Apr 2015 16:12:14 +0100
Received: from [10.2.207.50] ([10.1.2.79]) by cam-owa1.Emea.Arm.com with
	Microsoft SMTPSVC(6.0.3790.3959); Mon, 20 Apr 2015 16:12:14 +0100
Message-ID: <5535174D.3010103@arm.com>
Date: Mon, 20 Apr 2015 16:12:13 +0100
From: Kyrill Tkachov <kyrylo.tkachov@arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: Jeff Law <law@redhat.com>, GCC Patches <gcc-patches@gcc.gnu.org>
CC: "rth@redhat.com" <rth@redhat.com>,
	Marcus Shawcroft <marcus.shawcroft@arm.com>,
	Richard Earnshaw <Richard.Earnshaw@arm.com>,
	James Greenhalgh <james.greenhalgh@arm.com>
Subject: Re: [PATCH][expmed] Properly account for the cost and latency of
	shift+add ops when synthesizing mults
References: <552CCBCF.2060502@arm.com> <5531611F.7060505@redhat.com>
	<5534DE71.6060904@arm.com> <553515FD.4030203@redhat.com>
In-Reply-To: <553515FD.4030203@redhat.com>
X-MC-Unique: dA_uybhhTRa4xrPzNhZjKQ-1
X-IsSubscribed: yes

On 20/04/15 16:06, Jeff Law wrote:
> On 04/20/2015 05:09 AM, Kyrill Tkachov wrote:
>> Hi Jeff,
>>
>> On 17/04/15 20:38, Jeff Law wrote:
>>> On 04/14/2015 02:11 AM, Kyrill Tkachov wrote:
>>>> Of course the effect on codegen of this patch depends a lot on the rtx
>>>> costs in the backend.
>>>> On aarch64 with -mcpu=cortex-a57 tuning I see the cost limit being
>>>> exceeded in more cases and the
>>>> expansion code choosing instead to do a move-immediate and a mul
>>>> instruction.
>>>> No regressions on SPEC2006 on a Cortex-A57.
>>>>
>>>> For example, for code:
>>>> long f0 (int x, int y)
>>>> {
>>>>      return (long)x * 6L;
>>>> }
>>>>
>>>>
>>>> int f1(int x)
>>>> {
>>>>      return x * 10;
>>>> }
>>>>
>>>> int f2(int x)
>>>> {
>>>>        return x * 100;
>>>> }
>>>>
>>>> int f3(int x)
>>>> {
>>>>        return x * 20;
>>>> }
>>>>
>>>> int f4(int x)
>>>> {
>>>>        return x * 25;
>>>> }
>>>>
>>>> int f5(int x)
>>>> {
>>>>          return x * 11;
>>>> }
>>> Please turn this into a test for the testsuite.  It's fine if this the
>>> test is specific to AArch64.  You may need to break it into 6 individual
>>> tests since what you want to check for in each one may be significantly
>>> different.  For example, f0, f4 and f5 you'd probably check for the
>>> constant load & multiply instructions.  Not sure how to best test for
>>> what you want in f1-f3.
>> f1/f3 still end up synthesising the mult, but prefer a different
>> algorithm. I don't think the algorithm chosen in f1/f3 is worse or
>> better than what it was producing before, so I don't think there's
>> much point in testing for it.
> Yea, when I looked at the differences, it wasn't immediately clear if
> there was a real improvement or not.  The new sequences use the single
> register, so one might claim they're marginally better due to that.
>
>    If you think it's really better to
>> test for something, I propose testing that only two instructions are
>> generated, and neither of them are a 'mul'. I'll repost a patch with
>> my proposed testcases for f0,f2,f4,f5.
> Your call on f1/f3.  IIRC f2 didn't change at all, so I'm not sure if
> you need a test for that (perhaps you should make sure it continues to
> use a mul rather than a synthesized sequence).
>
> Pre-approved with whatever you decide on the testing side.

Thanks,
I could've sworn I had sent this version out a couple hours ago.
My mail client has been playing up.

Here it is with 6 tests. For the tests corresponding to f1/f3 in my
example above I scan that we don't use the 'w1' reg.

I'll give the AArch64 maintainers to comment on the tests for a day or two
before committing.

Thanks,
Kyrill

2015-04-20  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * expmed.c: (synth_mult): Only assume overlapping
     shift with previous steps in alg_sub_t_m2 case.

2015-04-20  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/mult-synth_1.c: New test.
     * gcc.target/aarch64/mult-synth_2.c: Likewise.
     * gcc.target/aarch64/mult-synth_3.c: Likewise.
     * gcc.target/aarch64/mult-synth_4.c: Likewise.
     * gcc.target/aarch64/mult-synth_5.c: Likewise.
     * gcc.target/aarch64/mult-synth_6.c: Likewise.
>
> jeff
>

commit 6dddd785f2839811c9f8e59e3e36f64cdb3a4ebe
Author: Kyrylo Tkachov <kyrylo.tkachov@arm.com>
Date:   Thu Mar 12 17:38:20 2015 +0000

    [expmed] Properly account for the cost and latency of shift+sub ops when synthesizing mults

diff --git a/gcc/expmed.c b/gcc/expmed.c
index f570612..e890a1e 100644
--- a/gcc/expmed.c
+++ b/gcc/expmed.c
@@ -2664,14 +2664,28 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
       m = exact_log2 (-orig_t + 1);
       if (m >= 0 && m < maxm)
 	{
-	  op_cost = shiftsub1_cost (speed, mode, m);
+	  op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
+	  /* If the target has a cheap shift-and-subtract insn use
+	     that in preference to a shift insn followed by a sub insn.
+	     Assume that the shift-and-sub is "atomic" with a latency
+	     equal to it's cost, otherwise assume that on superscalar
+	     hardware the shift may be executed concurrently with the
+	     earlier steps in the algorithm.  */
+	  if (shiftsub1_cost (speed, mode, m) <= op_cost)
+	    {
+	      op_cost = shiftsub1_cost (speed, mode, m);
+	      op_latency = op_cost;
+	    }
+	  else
+	    op_latency = add_cost (speed, mode);
+
 	  new_limit.cost = best_cost.cost - op_cost;
-	  new_limit.latency = best_cost.latency - op_cost;
+	  new_limit.latency = best_cost.latency - op_latency;
 	  synth_mult (alg_in, (unsigned HOST_WIDE_INT) (-orig_t + 1) >> m,
 		      &new_limit, mode);
 
 	  alg_in->cost.cost += op_cost;
-	  alg_in->cost.latency += op_cost;
+	  alg_in->cost.latency += op_latency;
 	  if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost))
 	    {
 	      best_cost = alg_in->cost;
@@ -2704,20 +2718,12 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
       if (t % d == 0 && t > d && m < maxm
 	  && (!cache_hit || cache_alg == alg_add_factor))
 	{
-	  /* If the target has a cheap shift-and-add instruction use
-	     that in preference to a shift insn followed by an add insn.
-	     Assume that the shift-and-add is "atomic" with a latency
-	     equal to its cost, otherwise assume that on superscalar
-	     hardware the shift may be executed concurrently with the
-	     earlier steps in the algorithm.  */
 	  op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
-	  if (shiftadd_cost (speed, mode, m) < op_cost)
-	    {
-	      op_cost = shiftadd_cost (speed, mode, m);
-	      op_latency = op_cost;
-	    }
-	  else
-	    op_latency = add_cost (speed, mode);
+	  if (shiftadd_cost (speed, mode, m) <= op_cost)
+	    op_cost = shiftadd_cost (speed, mode, m);
+
+	  op_latency = op_cost;
+
 
 	  new_limit.cost = best_cost.cost - op_cost;
 	  new_limit.latency = best_cost.latency - op_latency;
@@ -2742,20 +2748,11 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t,
       if (t % d == 0 && t > d && m < maxm
 	  && (!cache_hit || cache_alg == alg_sub_factor))
 	{
-	  /* If the target has a cheap shift-and-subtract insn use
-	     that in preference to a shift insn followed by a sub insn.
-	     Assume that the shift-and-sub is "atomic" with a latency
-	     equal to it's cost, otherwise assume that on superscalar
-	     hardware the shift may be executed concurrently with the
-	     earlier steps in the algorithm.  */
 	  op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m);
-	  if (shiftsub0_cost (speed, mode, m) < op_cost)
-	    {
-	      op_cost = shiftsub0_cost (speed, mode, m);
-	      op_latency = op_cost;
-	    }
-	  else
-	    op_latency = add_cost (speed, mode);
+	  if (shiftsub0_cost (speed, mode, m) <= op_cost)
+	    op_cost = shiftsub0_cost (speed, mode, m);
+
+	  op_latency = op_cost;
 
 	  new_limit.cost = best_cost.cost - op_cost;
 	  new_limit.latency = best_cost.latency - op_latency;
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c
new file mode 100644
index 0000000..8b6ee70
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+  return x * 100;
+}
+
+/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c
new file mode 100644
index 0000000..9278817
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+  return x * 25;
+}
+
+/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c
new file mode 100644
index 0000000..6898e78
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+  return x * 11;
+}
+
+/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c
new file mode 100644
index 0000000..1088a75
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+long
+foo (int x, int y)
+{
+   return (long)x * 6L;
+}
+
+/* { dg-final { scan-assembler "smull\tx\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c
new file mode 100644
index 0000000..8cf3314
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+  return x * 10;
+}
+
+/* { dg-final { scan-assembler-not "\tw1" } } */
+/* { dg-final { cleanup-saved-temps } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c
new file mode 100644
index 0000000..e941b72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */
+
+int
+foo (int x)
+{
+  return x * 20;
+}
+
+/* { dg-final { scan-assembler-not "\tw1" } } */
+/* { dg-final { cleanup-saved-temps } } */