diff mbox

[ARM] PR target/70473: Reduce size of Cortex-A8 automaton

Message ID 57C0166B.1090901@foss.arm.com
State New
Headers show

Commit Message

Kyrill Tkachov Aug. 26, 2016, 10:14 a.m. UTC
Hi all,

The scheduling automata sizes are getting a bit out of control (as the PR complains about) and the Cortex-A8
one is one of the largest offenders. An easy, low-hanging fruit in dealing with this are some of the FP/NEON operations
that have very large reservation durations specified for them. They bloat the state space by quite a lot and it's not
likely that there is enough parallelism present in the program to fill the (for example) 64 cycles that are modelled
for the double-precision division. In the past we've dealt with this by decreasing the modelled reservation duration
to keep the state space down.

This patch does that for the cortex_a8_neon automaton and caps the reservation duration for a particular reservation
to 15 cycles. This should be plenty to demonstrate that these are high latency instructions.
With this patch the number of NDFA states is massively reduced by more than 70% (26796 -> 6020).

As I don't have access to reasonable Cortex-A8 hardware I benchmarked it on SPEC2000 on a Cortex-A15.
The idea (from Ramana) is that since Cortex-A8 tuning is the default tuning for armv7-a the patch shouldn't hurt
the more widely accessible Cortex-A15 targets. There were no regressions in performance there.

Bootstrapped and tested on arm-none-linux-gnueabihf.
Ok for trunk?

Thanks,
Kyrill

2016-08-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     PR target/70473
     * config/arm/cortex-a8-neon.md (cortex_a8_vfp_muld): Reduce
     reservation duration to 15 cycles.
     (cortex_a8_vfp_macs): Likewise.
     (cortex_a8_vfp_macd): Likewise.
     (cortex_a8_vfp_divs): Likewise.
     (cortex_a8_vfp_divd): Likewise.

Comments

Richard Earnshaw (lists) Aug. 26, 2016, 10:17 a.m. UTC | #1
On 26/08/16 11:14, Kyrill Tkachov wrote:
> Hi all,
> 
> The scheduling automata sizes are getting a bit out of control (as the
> PR complains about) and the Cortex-A8
> one is one of the largest offenders. An easy, low-hanging fruit in
> dealing with this are some of the FP/NEON operations
> that have very large reservation durations specified for them. They
> bloat the state space by quite a lot and it's not
> likely that there is enough parallelism present in the program to fill
> the (for example) 64 cycles that are modelled
> for the double-precision division. In the past we've dealt with this by
> decreasing the modelled reservation duration
> to keep the state space down.
> 
> This patch does that for the cortex_a8_neon automaton and caps the
> reservation duration for a particular reservation
> to 15 cycles. This should be plenty to demonstrate that these are high
> latency instructions.
> With this patch the number of NDFA states is massively reduced by more
> than 70% (26796 -> 6020).
> 
> As I don't have access to reasonable Cortex-A8 hardware I benchmarked it
> on SPEC2000 on a Cortex-A15.
> The idea (from Ramana) is that since Cortex-A8 tuning is the default
> tuning for armv7-a the patch shouldn't hurt
> the more widely accessible Cortex-A15 targets. There were no regressions
> in performance there.
> 
> Bootstrapped and tested on arm-none-linux-gnueabihf.
> Ok for trunk?
> 
> Thanks,
> Kyrill
> 
> 2016-08-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
> 
>     PR target/70473
>     * config/arm/cortex-a8-neon.md (cortex_a8_vfp_muld): Reduce
>     reservation duration to 15 cycles.
>     (cortex_a8_vfp_macs): Likewise.
>     (cortex_a8_vfp_macd): Likewise.
>     (cortex_a8_vfp_divs): Likewise.
>     (cortex_a8_vfp_divd): Likewise.
> 

OK.

R.

> arm-a8-automaton.patch
> 
> 
> diff --git a/gcc/config/arm/cortex-a8-neon.md b/gcc/config/arm/cortex-a8-neon.md
> index 45f861f6c6f840bd113e468eeec5373e06058f6d..b16c29974a7278e70d64dc83b5b388aebb51718b 100644
> --- a/gcc/config/arm/cortex-a8-neon.md
> +++ b/gcc/config/arm/cortex-a8-neon.md
> @@ -357,30 +357,34 @@ (define_insn_reservation "cortex_a8_vfp_muls" 12
>         (eq_attr "type" "fmuls"))
>    "cortex_a8_vfp,cortex_a8_vfplite*11")
>  
> +;; Don't model a reservation for more than 15 cycles as this explodes the
> +;; state space of the automaton for little gain.  It is unlikely that the
> +;; scheduler will find enough instructions to hide the full latency of the
> +;; instructions.
>  (define_insn_reservation "cortex_a8_vfp_muld" 17
>    (and (eq_attr "tune" "cortexa8")
>         (eq_attr "type" "fmuld"))
> -  "cortex_a8_vfp,cortex_a8_vfplite*16")
> +  "cortex_a8_vfp,cortex_a8_vfplite*15")
>  
>  (define_insn_reservation "cortex_a8_vfp_macs" 21
>    (and (eq_attr "tune" "cortexa8")
>         (eq_attr "type" "fmacs,ffmas"))
> -  "cortex_a8_vfp,cortex_a8_vfplite*20")
> +  "cortex_a8_vfp,cortex_a8_vfplite*15")
>  
>  (define_insn_reservation "cortex_a8_vfp_macd" 26
>    (and (eq_attr "tune" "cortexa8")
>         (eq_attr "type" "fmacd,ffmad"))
> -  "cortex_a8_vfp,cortex_a8_vfplite*25")
> +  "cortex_a8_vfp,cortex_a8_vfplite*15")
>  
>  (define_insn_reservation "cortex_a8_vfp_divs" 37
>    (and (eq_attr "tune" "cortexa8")
>         (eq_attr "type" "fdivs, fsqrts"))
> -  "cortex_a8_vfp,cortex_a8_vfplite*36")
> +  "cortex_a8_vfp,cortex_a8_vfplite*15")
>  
>  (define_insn_reservation "cortex_a8_vfp_divd" 65
>    (and (eq_attr "tune" "cortexa8")
>         (eq_attr "type" "fdivd, fsqrtd"))
> -  "cortex_a8_vfp,cortex_a8_vfplite*64")
> +  "cortex_a8_vfp,cortex_a8_vfplite*15")
>  
>  ;; Comparisons can actually take 7 cycles sometimes instead of four,
>  ;; but given all the other instructions lumped into type=ffarith that
>
Ramana Radhakrishnan Aug. 26, 2016, 10:18 a.m. UTC | #2
On Fri, Aug 26, 2016 at 11:14 AM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi all,
>
> The scheduling automata sizes are getting a bit out of control (as the PR
> complains about) and the Cortex-A8
> one is one of the largest offenders. An easy, low-hanging fruit in dealing
> with this are some of the FP/NEON operations
> that have very large reservation durations specified for them. They bloat
> the state space by quite a lot and it's not
> likely that there is enough parallelism present in the program to fill the
> (for example) 64 cycles that are modelled
> for the double-precision division. In the past we've dealt with this by
> decreasing the modelled reservation duration
> to keep the state space down.
>
> This patch does that for the cortex_a8_neon automaton and caps the
> reservation duration for a particular reservation
> to 15 cycles. This should be plenty to demonstrate that these are high
> latency instructions.
> With this patch the number of NDFA states is massively reduced by more than
> 70% (26796 -> 6020).
>
> As I don't have access to reasonable Cortex-A8 hardware I benchmarked it on
> SPEC2000 on a Cortex-A15.
> The idea (from Ramana) is that since Cortex-A8 tuning is the default tuning
> for armv7-a the patch shouldn't hurt
> the more widely accessible Cortex-A15 targets. There were no regressions in
> performance there.
>
> Bootstrapped and tested on arm-none-linux-gnueabihf.
> Ok for trunk?
>


OK,

regards
Ramana
> Thanks,
> Kyrill
>
> 2016-08-26  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     PR target/70473
>     * config/arm/cortex-a8-neon.md (cortex_a8_vfp_muld): Reduce
>     reservation duration to 15 cycles.
>     (cortex_a8_vfp_macs): Likewise.
>     (cortex_a8_vfp_macd): Likewise.
>     (cortex_a8_vfp_divs): Likewise.
>     (cortex_a8_vfp_divd): Likewise.
diff mbox

Patch

diff --git a/gcc/config/arm/cortex-a8-neon.md b/gcc/config/arm/cortex-a8-neon.md
index 45f861f6c6f840bd113e468eeec5373e06058f6d..b16c29974a7278e70d64dc83b5b388aebb51718b 100644
--- a/gcc/config/arm/cortex-a8-neon.md
+++ b/gcc/config/arm/cortex-a8-neon.md
@@ -357,30 +357,34 @@  (define_insn_reservation "cortex_a8_vfp_muls" 12
        (eq_attr "type" "fmuls"))
   "cortex_a8_vfp,cortex_a8_vfplite*11")
 
+;; Don't model a reservation for more than 15 cycles as this explodes the
+;; state space of the automaton for little gain.  It is unlikely that the
+;; scheduler will find enough instructions to hide the full latency of the
+;; instructions.
 (define_insn_reservation "cortex_a8_vfp_muld" 17
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fmuld"))
-  "cortex_a8_vfp,cortex_a8_vfplite*16")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_macs" 21
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fmacs,ffmas"))
-  "cortex_a8_vfp,cortex_a8_vfplite*20")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_macd" 26
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fmacd,ffmad"))
-  "cortex_a8_vfp,cortex_a8_vfplite*25")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_divs" 37
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fdivs, fsqrts"))
-  "cortex_a8_vfp,cortex_a8_vfplite*36")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 (define_insn_reservation "cortex_a8_vfp_divd" 65
   (and (eq_attr "tune" "cortexa8")
        (eq_attr "type" "fdivd, fsqrtd"))
-  "cortex_a8_vfp,cortex_a8_vfplite*64")
+  "cortex_a8_vfp,cortex_a8_vfplite*15")
 
 ;; Comparisons can actually take 7 cycles sometimes instead of four,
 ;; but given all the other instructions lumped into type=ffarith that