[ARM] PR target/70473: Reduce size of Cortex-A8 automaton

Message ID	57C0166B.1090901@foss.arm.com
State	New
Headers	show Return-Path: <gcc-patches-return-434719-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:content-type; q=dns; s=default; b=Twxp6hHFL12xgE8N8wA8yWlLN5rlKRbkKPIciRtd2M3 SZ3jRBfTAGfzpTL4vxX9qQbPxtzTrdp7vpHexgFtEkqDfSTUuP/G7eL+B5Us2AMY QgeQ0WNIRH01Nyvqc1Ikb5/MerlR4259jXxJh2PfWvz9a54JKV3vPHlci+DkoZAM = Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Message-ID: <57C0166B.1090901@foss.arm.com> Date: Fri, 26 Aug 2016 11:14:03 +0100 From: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: GCC Patches <gcc-patches@gcc.gnu.org> CC: Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>, Richard Earnshaw <Richard.Earnshaw@arm.com> Subject: [PATCH][ARM] PR target/70473: Reduce size of Cortex-A8 automaton Content-Type: multipart/mixed; boundary="------------030507000105080909010403"

Message ID

57C0166B.1090901@foss.arm.com

State

New

Headers

DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:content-type;
	q=dns; s=default; b=Twxp6hHFL12xgE8N8wA8yWlLN5rlKRbkKPIciRtd2M3
	SZ3jRBfTAGfzpTL4vxX9qQbPxtzTrdp7vpHexgFtEkqDfSTUuP/G7eL+B5Us2AMY
	QgeQ0WNIRH01Nyvqc1Ikb5/MerlR4259jXxJh2PfWvz9a54JKV3vPHlci+DkoZAM
	=
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
Sender: gcc-patches-owner@gcc.gnu.org
Message-ID: <57C0166B.1090901@foss.arm.com>
Date: Fri, 26 Aug 2016 11:14:03 +0100
From: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: GCC Patches <gcc-patches@gcc.gnu.org>
CC: Ramana Radhakrishnan <ramana.radhakrishnan@arm.com>,
	Richard Earnshaw <Richard.Earnshaw@arm.com>
Subject: [PATCH][ARM] PR target/70473: Reduce size of Cortex-A8 automaton
Content-Type: multipart/mixed;
	boundary="------------030507000105080909010403"

Commit Message

Kyrill Tkachov Aug. 26, 2016, 10:14 a.m. UTC

Hi all,

The scheduling automata sizes are getting a bit out of control (as the PR complains about) and the Cortex-A8
one is one of the largest offenders. An easy, low-hanging fruit in dealing with this are some of the FP/NEON operations
that have very large reservation durations specified for them. They bloat the state space by quite a lot and it's not
likely that there is enough parallelism present in the program to fill the (for example) 64 cycles that are modelled
for the double-precision division. In the past we've dealt with this by decreasing the modelled reservation duration
to keep the state space down.

This patch does that for the cortex_a8_neon automaton and caps the reservation duration for a particular reservation
to 15 cycles. This should be plenty to demonstrate that these are high latency instructions.
With this patch the number of NDFA states is massively reduced by more than 70% (26796 -> 6020).

As I don't have access to reasonable Cortex-A8 hardware I benchmarked it on SPEC2000 on a Cortex-A15.
The idea (from Ramana) is that since Cortex-A8 tuning is the default tuning for armv7-a the patch shouldn't hurt
the more widely accessible Cortex-A15 targets. There were no regressions in performance there.

Bootstrapped and tested on arm-none-linux-gnueabihf.
Ok for trunk?

Thanks,
Kyrill

2016-08-26 Kyrylo Tkachov <kyrylo.tkachov@arm.com>

PR target/70473
* config/arm/cortex-a8-neon.md (cortex_a8_vfp_muld): Reduce
reservation duration to 15 cycles.
(cortex_a8_vfp_macs): Likewise.
(cortex_a8_vfp_macd): Likewise.
(cortex_a8_vfp_divs): Likewise.
(cortex_a8_vfp_divd): Likewise.

Comments

Richard Earnshaw (lists) Aug. 26, 2016, 10:17 a.m. UTC | #1

On 26/08/16 11:14, Kyrill Tkachov wrote:
> Hi all,
> 
> The scheduling automata sizes are getting a bit out of control (as the
> PR complains about) and the Cortex-A8
> one is one of the largest offenders. An easy, low-hanging fruit in
> dealing with this are some of the FP/NEON operations
> that have very large reservation durations specified for them. They
> bloat the state space by quite a lot and it's not
> likely that there is enough parallelism present in the program to fill
> the (for example) 64 cycles that are modelled
> for the double-precision division. In the past we've dealt with this by
> decreasing the modelled reservation duration
> to keep the state space down.
> 
> This patch does that for the cortex_a8_neon automaton and caps the
> reservation duration for a particular reservation
> to 15 cycles. This should be plenty to demonstrate that these are high
> latency instructions.
> With this patch the number of NDFA states is massively reduced by more
> than 70% (26796 -> 6020).
> 
> As I don't have access to reasonable Cortex-A8 hardware I benchmarked it
> on SPEC2000 on a Cortex-A15.
> The idea (from Ramana) is that since Cortex-A8 tuning is the default
> tuning for armv7-a the patch shouldn't hurt
> the more widely accessible Cortex-A15 targets. There were no regressions
> in performance there.
> 
> Bootstrapped and tested on arm-none-linux-gnueabihf.
> Ok for trunk?
> 
> Thanks,
> Kyrill
> 
> 2016-08-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>     PR target/70473
>     * config/arm/cortex-a8-neon.md (cortex_a8_vfp_muld): Reduce
>     reservation duration to 15 cycles.
>     (cortex_a8_vfp_macs): Likewise.
>     (cortex_a8_vfp_macd): Likewise.
>     (cortex_a8_vfp_divs): Likewise.
>     (cortex_a8_vfp_divd): Likewise.
> 

OK.

R.

> arm-a8-automaton.patch
> 
> 
> diff --git a/gcc/config/arm/cortex-a8-neon.md b/gcc/config/arm/cortex-a8-neon.md
> index 45f861f6c6f840bd113e468eeec5373e06058f6d..b16c29974a7278e70d64dc83b5b388aebb51718b 100644
> --- a/gcc/config/arm/cortex-a8-neon.md
> +++ b/gcc/config/arm/cortex-a8-neon.md
> @@ -357,30 +357,34 @@ (define_insn_reservation "cortex_a8_vfp_muls" 12
>         (eq_attr "type" "fmuls"))
>    "cortex_a8_vfp,cortex_a8_vfplite*11")
>  
> +;; Don't model a reservation for more than 15 cycles as this explodes the
> +;; state space of the automaton for little gain.  It is unlikely that the
> +;; scheduler will find enough instructions to hide the full latency of the
> +;; instructions.
>  (define_insn_reservation "cortex_a8_vfp_muld" 17
>    (and (eq_attr "tune" "cortexa8")
>         (eq_attr "type" "fmuld"))
> -  "cortex_a8_vfp,cortex_a8_vfplite*16")
> +  "cortex_a8_vfp,cortex_a8_vfplite*15")
>  
>  (define_insn_reservation "cortex_a8_vfp_macs" 21
>    (and (eq_attr "tune" "cortexa8")
>         (eq_attr "type" "fmacs,ffmas"))
> -  "cortex_a8_vfp,cortex_a8_vfplite*20")
> +  "cortex_a8_vfp,cortex_a8_vfplite*15")
>  
>  (define_insn_reservation "cortex_a8_vfp_macd" 26
>    (and (eq_attr "tune" "cortexa8")
>         (eq_attr "type" "fmacd,ffmad"))
> -  "cortex_a8_vfp,cortex_a8_vfplite*25")
> +  "cortex_a8_vfp,cortex_a8_vfplite*15")
>  
>  (define_insn_reservation "cortex_a8_vfp_divs" 37
>    (and (eq_attr "tune" "cortexa8")
>         (eq_attr "type" "fdivs, fsqrts"))
> -  "cortex_a8_vfp,cortex_a8_vfplite*36")
> +  "cortex_a8_vfp,cortex_a8_vfplite*15")
>  
>  (define_insn_reservation "cortex_a8_vfp_divd" 65
>    (and (eq_attr "tune" "cortexa8")
>         (eq_attr "type" "fdivd, fsqrtd"))
> -  "cortex_a8_vfp,cortex_a8_vfplite*64")
> +  "cortex_a8_vfp,cortex_a8_vfplite*15")
>  
>  ;; Comparisons can actually take 7 cycles sometimes instead of four,
>  ;; but given all the other instructions lumped into type=ffarith that
>

Ramana Radhakrishnan Aug. 26, 2016, 10:18 a.m. UTC | #2

On Fri, Aug 26, 2016 at 11:14 AM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi all,
>
> The scheduling automata sizes are getting a bit out of control (as the PR
> complains about) and the Cortex-A8
> one is one of the largest offenders. An easy, low-hanging fruit in dealing
> with this are some of the FP/NEON operations
> that have very large reservation durations specified for them. They bloat
> the state space by quite a lot and it's not
> likely that there is enough parallelism present in the program to fill the
> (for example) 64 cycles that are modelled
> for the double-precision division. In the past we've dealt with this by
> decreasing the modelled reservation duration
> to keep the state space down.
>
> This patch does that for the cortex_a8_neon automaton and caps the
> reservation duration for a particular reservation
> to 15 cycles. This should be plenty to demonstrate that these are high
> latency instructions.
> With this patch the number of NDFA states is massively reduced by more than
> 70% (26796 -> 6020).
>
> As I don't have access to reasonable Cortex-A8 hardware I benchmarked it on
> SPEC2000 on a Cortex-A15.
> The idea (from Ramana) is that since Cortex-A8 tuning is the default tuning
> for armv7-a the patch shouldn't hurt
> the more widely accessible Cortex-A15 targets. There were no regressions in
> performance there.
>
> Bootstrapped and tested on arm-none-linux-gnueabihf.
> Ok for trunk?
>


OK,

regards
Ramana
> Thanks,
> Kyrill
>
> 2016-08-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     PR target/70473
>     * config/arm/cortex-a8-neon.md (cortex_a8_vfp_muld): Reduce
>     reservation duration to 15 cycles.
>     (cortex_a8_vfp_macs): Likewise.
>     (cortex_a8_vfp_macd): Likewise.
>     (cortex_a8_vfp_divs): Likewise.
>     (cortex_a8_vfp_divd): Likewise.

diff --git a/gcc/config/arm/cortex-a8-neon.md b/gcc/config/arm/cortex-a8-neon.md
index 45f861f6c6f840bd113e468eeec5373e06058f6d..b16c29974a7278e70d64dc83b5b388aebb51718b 100644
--- a/gcc/config/arm/cortex-a8-neon.md
+++ b/gcc/config/arm/cortex-a8-neon.md
@@ -357,30 +357,34 @@  (define_insn_reservation "cortex_a8_vfp_muls" 12
        (eq_attr "type" "fmuls"))
   "cortex_a8_vfp,cortex_a8_vfplite*11")
 
+;; Don't model a reservation for more than 15 cycles as this explodes the
+;; state space of the automaton for little gain.  It is unlikely that the
+;; scheduler will find enough instructions to hide the full latency of the
+;; instructions.
 (define_insn_reservation "cortex_a8_vfp_muld" 17
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fmuld"))
-  "cortex_a8_vfp,cortex_a8_vfplite*16")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_macs" 21
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fmacs,ffmas"))
-  "cortex_a8_vfp,cortex_a8_vfplite*20")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_macd" 26
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fmacd,ffmad"))
-  "cortex_a8_vfp,cortex_a8_vfplite*25")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_divs" 37
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fdivs, fsqrts"))
-  "cortex_a8_vfp,cortex_a8_vfplite*36")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_divd" 65
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fdivd, fsqrtd"))
-  "cortex_a8_vfp,cortex_a8_vfplite*64")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 ;; Comparisons can actually take 7 cycles sometimes instead of four,
 ;; but given all the other instructions lumped into type=ffarith that