[AArch64] Tweak Cortex-A57 vector cost

Message ID	AM5PR0802MB2610A35A5419228F752E3B9583B80@AM5PR0802MB2610.eurprd08.prod.outlook.com
State	New
Headers	show Return-Path: <gcc-patches-return-440985-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:content-type :content-transfer-encoding:mime-version; q=dns; s=default; b=DKW 6dhJY8xLUjuqAJiOf37JMZLiKuwrI2XR5VJvvPnaonp6eXTWHlSfP7IPvY9+Ct2v EqA+cOE49dR7t1GA0hx1gHrEhQ8Ao3SaoVJu1diE4dwZcnl2Ozgu6bvVYhQfXZs+ xe44X2DzUyuT3dOMJGuWlCZG5YQ39+COoDdUe04k= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org From: Wilco Dijkstra <Wilco.Dijkstra@arm.com> To: GCC Patches <gcc-patches@gcc.gnu.org> CC: nd <nd@arm.com> Subject: [PATCH][AArch64] Tweak Cortex-A57 vector cost Date: Thu, 10 Nov 2016 17:10:00 +0000 Message-ID: <AM5PR0802MB2610A35A5419228F752E3B9583B80@AM5PR0802MB2610.eurprd08.prod.outlook.com> nodisclaimer: True received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0

Message ID

AM5PR0802MB2610A35A5419228F752E3B9583B80@AM5PR0802MB2610.eurprd08.prod.outlook.com

State

New

Headers

DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:from
	:to:cc:subject:date:message-id:content-type
	:content-transfer-encoding:mime-version; q=dns; s=default; b=DKW
	6dhJY8xLUjuqAJiOf37JMZLiKuwrI2XR5VJvvPnaonp6eXTWHlSfP7IPvY9+Ct2v
	EqA+cOE49dR7t1GA0hx1gHrEhQ8Ao3SaoVJu1diE4dwZcnl2Ozgu6bvVYhQfXZs+
	xe44X2DzUyuT3dOMJGuWlCZG5YQ39+COoDdUe04k=
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
Sender: gcc-patches-owner@gcc.gnu.org
From: Wilco Dijkstra <Wilco.Dijkstra@arm.com>
To: GCC Patches <gcc-patches@gcc.gnu.org>
CC: nd <nd@arm.com>
Subject: [PATCH][AArch64] Tweak Cortex-A57 vector cost
Date: Thu, 10 Nov 2016 17:10:00 +0000
Message-ID: <AM5PR0802MB2610A35A5419228F752E3B9583B80@AM5PR0802MB2610.eurprd08.prod.outlook.com>
nodisclaimer: True
received-spf: None (protection.outlook.com: arm.com does not designate
	permitted sender hosts)
spamdiagnosticoutput: 1:99
spamdiagnosticmetadata: NSPM
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-MS-Exchange-CrossTenant-originalarrivaltime: 10 Nov 2016 17:10:00.8625
	(UTC)
X-MS-Exchange-CrossTenant-fromentityheader: Hosted
X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d
X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM5PR0802MB2610

Commit Message

Wilco Dijkstra Nov. 10, 2016, 5:10 p.m. UTC

The existing vector costs stop some beneficial vectorization.  This is mostly due
to vector statement cost being set to 3 as well as vector loads having a higher
cost than scalar loads.  This means that even when we vectorize 4x, it is possible
that the cost of a vectorized loop is similar to the scalar version, and we fail
to vectorize.  For example for a particular loop the costs for -mcpu=generic are:

note: Cost model analysis: 
  Vector inside of loop cost: 146
  Vector prologue cost: 5
  Vector epilogue cost: 0
  Scalar iteration cost: 50
  Scalar outside cost: 0
  Vector outside cost: 5
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 1
note:   Runtime profitability threshold = 3
note:   Static estimate profitability threshold = 3
note: loop vectorized


While -mcpu=cortex-a57 reports:

note: Cost model analysis: 
  Vector inside of loop cost: 294
  Vector prologue cost: 15
  Vector epilogue cost: 0
  Scalar iteration cost: 74
  Scalar outside cost: 0
  Vector outside cost: 15
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 31
note:   Runtime profitability threshold = 30
note:   Static estimate profitability threshold = 30
note: not vectorized: vectorization not profitable.
note: not vectorized: iteration count smaller than user specified loop bound parameter or minimum profitable iterations (whichever is more conservative).


Using a cost of 3 for a vector operation suggests they are 3 times as
expensive as scalar operations.  Since most vector operations have a 
similar throughput as scalar operations, this is not correct.

Using slightly lower values for these heuristics now allows this loop
and many others to be vectorized.  On a proprietary benchmark the gain
from vectorizing this loop is around 15-30% which shows vectorizing it is
indeed beneficial.

ChangeLog:
2016-11-10  Wilco Dijkstra  <wdijkstr@arm.com>

	* config/aarch64/aarch64.c (cortexa57_vector_cost):
	Change vec_stmt_cost, vec_align_load_cost and vec_unalign_load_cost.

--

Comments

Richard Earnshaw Nov. 11, 2016, 10:16 a.m. UTC | #1

On 10/11/16 17:10, Wilco Dijkstra wrote:
> The existing vector costs stop some beneficial vectorization.  This is mostly due
> to vector statement cost being set to 3 as well as vector loads having a higher
> cost than scalar loads.  This means that even when we vectorize 4x, it is possible
> that the cost of a vectorized loop is similar to the scalar version, and we fail
> to vectorize.  For example for a particular loop the costs for -mcpu=generic are:
> 
> note: Cost model analysis: 
>   Vector inside of loop cost: 146
>   Vector prologue cost: 5
>   Vector epilogue cost: 0
>   Scalar iteration cost: 50
>   Scalar outside cost: 0
>   Vector outside cost: 5
>   prologue iterations: 0
>   epilogue iterations: 0
>   Calculated minimum iters for profitability: 1
> note:   Runtime profitability threshold = 3
> note:   Static estimate profitability threshold = 3
> note: loop vectorized
> 
> 
> While -mcpu=cortex-a57 reports:
> 
> note: Cost model analysis: 
>   Vector inside of loop cost: 294
>   Vector prologue cost: 15
>   Vector epilogue cost: 0
>   Scalar iteration cost: 74
>   Scalar outside cost: 0
>   Vector outside cost: 15
>   prologue iterations: 0
>   epilogue iterations: 0
>   Calculated minimum iters for profitability: 31
> note:   Runtime profitability threshold = 30
> note:   Static estimate profitability threshold = 30
> note: not vectorized: vectorization not profitable.
> note: not vectorized: iteration count smaller than user specified loop bound parameter or minimum profitable iterations (whichever is more conservative).
> 
> 
> Using a cost of 3 for a vector operation suggests they are 3 times as
> expensive as scalar operations.  Since most vector operations have a 
> similar throughput as scalar operations, this is not correct.
> 
> Using slightly lower values for these heuristics now allows this loop
> and many others to be vectorized.  On a proprietary benchmark the gain
> from vectorizing this loop is around 15-30% which shows vectorizing it is
> indeed beneficial.
> 
> ChangeLog:
> 2016-11-10  Wilco Dijkstra  <wdijkstr@arm.com>
> 
> 	* config/aarch64/aarch64.c (cortexa57_vector_cost):
> 	Change vec_stmt_cost, vec_align_load_cost and vec_unalign_load_cost.
> 

OK.

R.

> --
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 279a6dfaa4a9c306bc7a8dba9f4f53704f61fefe..cff2e8fc6e9309e6aa4f68a5aba3bfac3b737283 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -382,12 +382,12 @@ static const struct cpu_vector_cost cortexa57_vector_cost =
>    1, /* scalar_stmt_cost  */
>    4, /* scalar_load_cost  */
>    1, /* scalar_store_cost  */
> -  3, /* vec_stmt_cost  */
> +  2, /* vec_stmt_cost  */
>    3, /* vec_permute_cost  */
>    8, /* vec_to_scalar_cost  */
>    8, /* scalar_to_vec_cost  */
> -  5, /* vec_align_load_cost  */
> -  5, /* vec_unalign_load_cost  */
> +  4, /* vec_align_load_cost  */
> +  4, /* vec_unalign_load_cost  */
>    1, /* vec_unalign_store_cost  */
>    1, /* vec_store_cost  */
>    1, /* cond_taken_branch_cost  */
>

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 279a6dfaa4a9c306bc7a8dba9f4f53704f61fefe..cff2e8fc6e9309e6aa4f68a5aba3bfac3b737283 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -382,12 +382,12 @@  static const struct cpu_vector_cost cortexa57_vector_cost =
   1, /* scalar_stmt_cost  */
   4, /* scalar_load_cost  */
   1, /* scalar_store_cost  */
-  3, /* vec_stmt_cost  */
+  2, /* vec_stmt_cost  */
   3, /* vec_permute_cost  */
   8, /* vec_to_scalar_cost  */
   8, /* scalar_to_vec_cost  */
-  5, /* vec_align_load_cost  */
-  5, /* vec_unalign_load_cost  */
+  4, /* vec_align_load_cost  */
+  4, /* vec_unalign_load_cost  */
   1, /* vec_unalign_store_cost  */
   1, /* vec_store_cost  */
   1, /* cond_taken_branch_cost  */