From patchwork Mon Apr 20 15:12:13 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kyrylo Tkachov X-Patchwork-Id: 462802 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 504A6140133 for ; Tue, 21 Apr 2015 01:12:29 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass reason="1024-bit key; unprotected key" header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=mozHxP2V; dkim-adsp=none (unprotected policy); dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:references :in-reply-to:content-type; q=dns; s=default; b=KYKXkVuTd0Zl1UiZA VV3Xld+PKkMpZftLMeBJLxknCSSuw7crdPwaCAeDU86LnjhCDJTzIUXnhcW2Mc4a CaQUf7ahJvv9VYmiqHl5nm/6B/oJVJaLTCPs7etIcIILKkxDHyJVhIvMZ8WN9tdi sAhIZymsYiHUfPos7tf976+fBo= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:references :in-reply-to:content-type; s=default; bh=M+Xd3ZcUdeyUNFUxksot4EF 1IN8=; b=mozHxP2VtqTgyLdKhlp0aLQ583VCzCN63/NkfSz0/iP6wCFj4r6dRsR BuI/NnU2KSDJMcInBbVwdyd37Z0TaXFoJFSQSDI+guRk5jBcrswfwpUQdZ2+Bp2W 8wj4xxToTH68ZfKuNNWJnw7adxWNNFIJ0PNRWfOVZMhsSbAr06rw= Received: (qmail 95213 invoked by alias); 20 Apr 2015 15:12:21 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 95198 invoked by uid 89); 20 Apr 2015 15:12:19 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.8 required=5.0 tests=AWL, BAYES_00, SPF_PASS autolearn=ham version=3.3.2 X-HELO: eu-smtp-delivery-143.mimecast.com Received: from eu-smtp-delivery-143.mimecast.com (HELO eu-smtp-delivery-143.mimecast.com) (146.101.78.143) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 20 Apr 2015 15:12:17 +0000 Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.140]) by uk-mta-29.uk.mimecast.lan; Mon, 20 Apr 2015 16:12:14 +0100 Received: from [10.2.207.50] ([10.1.2.79]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Mon, 20 Apr 2015 16:12:14 +0100 Message-ID: <5535174D.3010103@arm.com> Date: Mon, 20 Apr 2015 16:12:13 +0100 From: Kyrill Tkachov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Jeff Law , GCC Patches CC: "rth@redhat.com" , Marcus Shawcroft , Richard Earnshaw , James Greenhalgh Subject: Re: [PATCH][expmed] Properly account for the cost and latency of shift+add ops when synthesizing mults References: <552CCBCF.2060502@arm.com> <5531611F.7060505@redhat.com> <5534DE71.6060904@arm.com> <553515FD.4030203@redhat.com> In-Reply-To: <553515FD.4030203@redhat.com> X-MC-Unique: dA_uybhhTRa4xrPzNhZjKQ-1 X-IsSubscribed: yes On 20/04/15 16:06, Jeff Law wrote: > On 04/20/2015 05:09 AM, Kyrill Tkachov wrote: >> Hi Jeff, >> >> On 17/04/15 20:38, Jeff Law wrote: >>> On 04/14/2015 02:11 AM, Kyrill Tkachov wrote: >>>> Of course the effect on codegen of this patch depends a lot on the rtx >>>> costs in the backend. >>>> On aarch64 with -mcpu=cortex-a57 tuning I see the cost limit being >>>> exceeded in more cases and the >>>> expansion code choosing instead to do a move-immediate and a mul >>>> instruction. >>>> No regressions on SPEC2006 on a Cortex-A57. >>>> >>>> For example, for code: >>>> long f0 (int x, int y) >>>> { >>>> return (long)x * 6L; >>>> } >>>> >>>> >>>> int f1(int x) >>>> { >>>> return x * 10; >>>> } >>>> >>>> int f2(int x) >>>> { >>>> return x * 100; >>>> } >>>> >>>> int f3(int x) >>>> { >>>> return x * 20; >>>> } >>>> >>>> int f4(int x) >>>> { >>>> return x * 25; >>>> } >>>> >>>> int f5(int x) >>>> { >>>> return x * 11; >>>> } >>> Please turn this into a test for the testsuite. It's fine if this the >>> test is specific to AArch64. You may need to break it into 6 individual >>> tests since what you want to check for in each one may be significantly >>> different. For example, f0, f4 and f5 you'd probably check for the >>> constant load & multiply instructions. Not sure how to best test for >>> what you want in f1-f3. >> f1/f3 still end up synthesising the mult, but prefer a different >> algorithm. I don't think the algorithm chosen in f1/f3 is worse or >> better than what it was producing before, so I don't think there's >> much point in testing for it. > Yea, when I looked at the differences, it wasn't immediately clear if > there was a real improvement or not. The new sequences use the single > register, so one might claim they're marginally better due to that. > > If you think it's really better to >> test for something, I propose testing that only two instructions are >> generated, and neither of them are a 'mul'. I'll repost a patch with >> my proposed testcases for f0,f2,f4,f5. > Your call on f1/f3. IIRC f2 didn't change at all, so I'm not sure if > you need a test for that (perhaps you should make sure it continues to > use a mul rather than a synthesized sequence). > > Pre-approved with whatever you decide on the testing side. Thanks, I could've sworn I had sent this version out a couple hours ago. My mail client has been playing up. Here it is with 6 tests. For the tests corresponding to f1/f3 in my example above I scan that we don't use the 'w1' reg. I'll give the AArch64 maintainers to comment on the tests for a day or two before committing. Thanks, Kyrill 2015-04-20 Kyrylo Tkachov * expmed.c: (synth_mult): Only assume overlapping shift with previous steps in alg_sub_t_m2 case. 2015-04-20 Kyrylo Tkachov * gcc.target/aarch64/mult-synth_1.c: New test. * gcc.target/aarch64/mult-synth_2.c: Likewise. * gcc.target/aarch64/mult-synth_3.c: Likewise. * gcc.target/aarch64/mult-synth_4.c: Likewise. * gcc.target/aarch64/mult-synth_5.c: Likewise. * gcc.target/aarch64/mult-synth_6.c: Likewise. > > jeff > commit 6dddd785f2839811c9f8e59e3e36f64cdb3a4ebe Author: Kyrylo Tkachov Date: Thu Mar 12 17:38:20 2015 +0000 [expmed] Properly account for the cost and latency of shift+sub ops when synthesizing mults diff --git a/gcc/expmed.c b/gcc/expmed.c index f570612..e890a1e 100644 --- a/gcc/expmed.c +++ b/gcc/expmed.c @@ -2664,14 +2664,28 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t, m = exact_log2 (-orig_t + 1); if (m >= 0 && m < maxm) { - op_cost = shiftsub1_cost (speed, mode, m); + op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m); + /* If the target has a cheap shift-and-subtract insn use + that in preference to a shift insn followed by a sub insn. + Assume that the shift-and-sub is "atomic" with a latency + equal to it's cost, otherwise assume that on superscalar + hardware the shift may be executed concurrently with the + earlier steps in the algorithm. */ + if (shiftsub1_cost (speed, mode, m) <= op_cost) + { + op_cost = shiftsub1_cost (speed, mode, m); + op_latency = op_cost; + } + else + op_latency = add_cost (speed, mode); + new_limit.cost = best_cost.cost - op_cost; - new_limit.latency = best_cost.latency - op_cost; + new_limit.latency = best_cost.latency - op_latency; synth_mult (alg_in, (unsigned HOST_WIDE_INT) (-orig_t + 1) >> m, &new_limit, mode); alg_in->cost.cost += op_cost; - alg_in->cost.latency += op_cost; + alg_in->cost.latency += op_latency; if (CHEAPER_MULT_COST (&alg_in->cost, &best_cost)) { best_cost = alg_in->cost; @@ -2704,20 +2718,12 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t, if (t % d == 0 && t > d && m < maxm && (!cache_hit || cache_alg == alg_add_factor)) { - /* If the target has a cheap shift-and-add instruction use - that in preference to a shift insn followed by an add insn. - Assume that the shift-and-add is "atomic" with a latency - equal to its cost, otherwise assume that on superscalar - hardware the shift may be executed concurrently with the - earlier steps in the algorithm. */ op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m); - if (shiftadd_cost (speed, mode, m) < op_cost) - { - op_cost = shiftadd_cost (speed, mode, m); - op_latency = op_cost; - } - else - op_latency = add_cost (speed, mode); + if (shiftadd_cost (speed, mode, m) <= op_cost) + op_cost = shiftadd_cost (speed, mode, m); + + op_latency = op_cost; + new_limit.cost = best_cost.cost - op_cost; new_limit.latency = best_cost.latency - op_latency; @@ -2742,20 +2748,11 @@ synth_mult (struct algorithm *alg_out, unsigned HOST_WIDE_INT t, if (t % d == 0 && t > d && m < maxm && (!cache_hit || cache_alg == alg_sub_factor)) { - /* If the target has a cheap shift-and-subtract insn use - that in preference to a shift insn followed by a sub insn. - Assume that the shift-and-sub is "atomic" with a latency - equal to it's cost, otherwise assume that on superscalar - hardware the shift may be executed concurrently with the - earlier steps in the algorithm. */ op_cost = add_cost (speed, mode) + shift_cost (speed, mode, m); - if (shiftsub0_cost (speed, mode, m) < op_cost) - { - op_cost = shiftsub0_cost (speed, mode, m); - op_latency = op_cost; - } - else - op_latency = add_cost (speed, mode); + if (shiftsub0_cost (speed, mode, m) <= op_cost) + op_cost = shiftsub0_cost (speed, mode, m); + + op_latency = op_cost; new_limit.cost = best_cost.cost - op_cost; new_limit.latency = best_cost.latency - op_latency; diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c new file mode 100644 index 0000000..8b6ee70 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_1.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */ + +int +foo (int x) +{ + return x * 100; +} + +/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */ +/* { dg-final { cleanup-saved-temps } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c new file mode 100644 index 0000000..9278817 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_2.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */ + +int +foo (int x) +{ + return x * 25; +} + +/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */ +/* { dg-final { cleanup-saved-temps } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c new file mode 100644 index 0000000..6898e78 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_3.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */ + +int +foo (int x) +{ + return x * 11; +} + +/* { dg-final { scan-assembler "mul\tw\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */ +/* { dg-final { cleanup-saved-temps } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c new file mode 100644 index 0000000..1088a75 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_4.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */ + +long +foo (int x, int y) +{ + return (long)x * 6L; +} + +/* { dg-final { scan-assembler "smull\tx\[0-9\]+, w\[0-9\]+, w\[0-9\]+" } } */ +/* { dg-final { cleanup-saved-temps } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c new file mode 100644 index 0000000..8cf3314 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_5.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */ + +int +foo (int x) +{ + return x * 10; +} + +/* { dg-final { scan-assembler-not "\tw1" } } */ +/* { dg-final { cleanup-saved-temps } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c b/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c new file mode 100644 index 0000000..e941b72 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/mult-synth_6.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mcpu=cortex-a57 -save-temps" } */ + +int +foo (int x) +{ + return x * 20; +} + +/* { dg-final { scan-assembler-not "\tw1" } } */ +/* { dg-final { cleanup-saved-temps } } */