From patchwork Fri Mar 26 16:15:04 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Sandiford <richard.sandiford@arm.com>
X-Patchwork-Id: 1458871
Return-Path: <gcc-patches-bounces@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=8.43.85.97; helo=sourceware.org;
 envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256
 header.s=default header.b=lbvxxwNL;
	dkim-atps=neutral
Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4F6Rrt4g5cz9sR4
	for <incoming@patchwork.ozlabs.org>; Sat, 27 Mar 2021 03:15:10 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 8CA6C3844061;
	Fri, 26 Mar 2021 16:15:08 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8CA6C3844061
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1616775308;
	bh=I/cep+eSG4EwQJzg8QpfXABZpNnql5yQvHjBoDS3bYM=;
	h=To:Subject:References:Date:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=lbvxxwNLZixJKOnHY/D07JQH1/prBy8TQlUeiOtcpHAW1om0iwGK3X8u/YLqh0yWp
	 KBpJQ/HZesng5cyokTYuyO4IlvPkBTAnQ9kmp0DgZBhutxtB41LHNRMBgcVM/bwnU0
	 WAOYgpjCbUkSLQB+TXLl6ZtxHm2WlmOA3GibaJd4=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id E8C363857C63
 for <gcc-patches@gcc.gnu.org>; Fri, 26 Mar 2021 16:15:05 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org E8C363857C63
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9499214BF
 for <gcc-patches@gcc.gnu.org>; Fri, 26 Mar 2021 09:15:05 -0700 (PDT)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.98.126])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 3ACA93F792
 for <gcc-patches@gcc.gnu.org>; Fri, 26 Mar 2021 09:15:05 -0700 (PDT)
To: gcc-patches@gcc.gnu.org
Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Subject: [PATCH 03/13] aarch64: Add costs for LD[234]/ST[234] permutes
References: <mptpmzlsxl1.fsf@arm.com>
Date: Fri, 26 Mar 2021 16:15:04 +0000
In-Reply-To: <mptpmzlsxl1.fsf@arm.com> (Richard Sandiford's message of "Fri,
 26 Mar 2021 16:12:42 +0000")
Message-ID: <mpta6qpsxh3.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
X-Spam-Status: No, score=-12.6 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: Richard Sandiford via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: Richard Sandiford <richard.sandiford@arm.com>
Reply-To: Richard Sandiford <richard.sandiford@arm.com>
Errors-To: gcc-patches-bounces@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>

At the moment, we cost LD[234] and ST[234] as N vector loads
or stores, which effectively treats the implied permute as free.
This patch adds additional costs for the permutes, which apply on
top of the costs for the loads and stores.

Like with the previous patches, this one only becomes active if
a CPU selects use_new_vector_costs.  It should therefore have
a very low impact on other CPUs.

gcc/
	* config/aarch64/aarch64-protos.h (simd_vec_cost::ld2_st2_permute_cost)
	(simd_vec_cost::ld3_st3_permute_cost): New member variables.
	(simd_vec_cost::ld4_st4_permute_cost): Likewise.
	* config/aarch64/aarch64.c (generic_advsimd_vector_cost): Update
	accordingly, using zero for the new costs.
	(generic_sve_vector_cost, a64fx_advsimd_vector_cost): Likewise.
	(a64fx_sve_vector_cost, qdf24xx_advsimd_vector_cost): Likewise.
	(thunderx_advsimd_vector_cost, tsv110_advsimd_vector_cost): Likewise.
	(cortexa57_advsimd_vector_cost, exynosm1_advsimd_vector_cost)
	(xgene1_advsimd_vector_cost, thunderx2t99_advsimd_vector_cost)
	(thunderx3t110_advsimd_vector_cost): Likewise.
	(aarch64_ld234_st234_vectors): New function.
	(aarch64_adjust_stmt_cost): Likewise.
	(aarch64_add_stmt_cost): Call aarch64_adjust_stmt_cost if using
	the new vector costs.
---
 gcc/config/aarch64/aarch64-protos.h |  7 +++
 gcc/config/aarch64/aarch64.c        | 94 +++++++++++++++++++++++++++++
 2 files changed, 101 insertions(+)
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index bfcab72b122..3d152754981 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -202,6 +202,13 @@ struct simd_vec_cost
      specially below.  */
   const int fp_stmt_cost;
 
+  /* Per-vector cost of permuting vectors after an LD2, LD3 or LD4,
+     as well as the per-vector cost of permuting vectors before
+     an ST2, ST3 or ST4.  */
+  const int ld2_st2_permute_cost;
+  const int ld3_st3_permute_cost;
+  const int ld4_st4_permute_cost;
+
   /* Cost of a permute operation.  */
   const int permute_cost;
 
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index b62169a267a..8fb723dabd2 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -590,6 +590,9 @@ static const advsimd_vec_cost generic_advsimd_vector_cost =
 {
   1, /* int_stmt_cost  */
   1, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   2, /* permute_cost  */
   2, /* reduc_i8_cost  */
   2, /* reduc_i16_cost  */
@@ -612,6 +615,9 @@ static const sve_vec_cost generic_sve_vector_cost =
   {
     1, /* int_stmt_cost  */
     1, /* fp_stmt_cost  */
+    0, /* ld2_st2_permute_cost  */
+    0, /* ld3_st3_permute_cost  */
+    0, /* ld4_st4_permute_cost  */
     2, /* permute_cost  */
     2, /* reduc_i8_cost  */
     2, /* reduc_i16_cost  */
@@ -650,6 +656,9 @@ static const advsimd_vec_cost a64fx_advsimd_vector_cost =
 {
   2, /* int_stmt_cost  */
   5, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   3, /* permute_cost  */
   13, /* reduc_i8_cost  */
   13, /* reduc_i16_cost  */
@@ -671,6 +680,9 @@ static const sve_vec_cost a64fx_sve_vector_cost =
   {
     2, /* int_stmt_cost  */
     5, /* fp_stmt_cost  */
+    0, /* ld2_st2_permute_cost  */
+    0, /* ld3_st3_permute_cost  */
+    0, /* ld4_st4_permute_cost  */
     3, /* permute_cost  */
     13, /* reduc_i8_cost  */
     13, /* reduc_i16_cost  */
@@ -708,6 +720,9 @@ static const advsimd_vec_cost qdf24xx_advsimd_vector_cost =
 {
   1, /* int_stmt_cost  */
   3, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   2, /* permute_cost  */
   1, /* reduc_i8_cost  */
   1, /* reduc_i16_cost  */
@@ -742,6 +757,9 @@ static const advsimd_vec_cost thunderx_advsimd_vector_cost =
 {
   4, /* int_stmt_cost  */
   1, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   4, /* permute_cost  */
   2, /* reduc_i8_cost  */
   2, /* reduc_i16_cost  */
@@ -775,6 +793,9 @@ static const advsimd_vec_cost tsv110_advsimd_vector_cost =
 {
   2, /* int_stmt_cost  */
   2, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   2, /* permute_cost  */
   3, /* reduc_i8_cost  */
   3, /* reduc_i16_cost  */
@@ -807,6 +828,9 @@ static const advsimd_vec_cost cortexa57_advsimd_vector_cost =
 {
   2, /* int_stmt_cost  */
   2, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   3, /* permute_cost  */
   8, /* reduc_i8_cost  */
   8, /* reduc_i16_cost  */
@@ -840,6 +864,9 @@ static const advsimd_vec_cost exynosm1_advsimd_vector_cost =
 {
   3, /* int_stmt_cost  */
   3, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   3, /* permute_cost  */
   3, /* reduc_i8_cost  */
   3, /* reduc_i16_cost  */
@@ -872,6 +899,9 @@ static const advsimd_vec_cost xgene1_advsimd_vector_cost =
 {
   2, /* int_stmt_cost  */
   2, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   2, /* permute_cost  */
   4, /* reduc_i8_cost  */
   4, /* reduc_i16_cost  */
@@ -905,6 +935,9 @@ static const advsimd_vec_cost thunderx2t99_advsimd_vector_cost =
 {
   4, /* int_stmt_cost  */
   5, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   10, /* permute_cost  */
   6, /* reduc_i8_cost  */
   6, /* reduc_i16_cost  */
@@ -938,6 +971,9 @@ static const advsimd_vec_cost thunderx3t110_advsimd_vector_cost =
 {
   5, /* int_stmt_cost  */
   5, /* fp_stmt_cost  */
+  0, /* ld2_st2_permute_cost  */
+  0, /* ld3_st3_permute_cost  */
+  0, /* ld4_st4_permute_cost  */
   10, /* permute_cost  */
   5, /* reduc_i8_cost  */
   5, /* reduc_i16_cost  */
@@ -14086,6 +14122,26 @@ aarch64_reduc_type (vec_info *vinfo, stmt_vec_info stmt_info)
   return -1;
 }
 
+/* Return true if an access of kind KIND for STMT_INFO represents one
+   vector of an LD[234] or ST[234] operation.  Return the total number of
+   vectors (2, 3 or 4) if so, otherwise return a value outside that range.  */
+static int
+aarch64_ld234_st234_vectors (vect_cost_for_stmt kind, stmt_vec_info stmt_info)
+{
+  if ((kind == vector_load
+       || kind == unaligned_load
+       || kind == vector_store
+       || kind == unaligned_store)
+      && STMT_VINFO_DATA_REF (stmt_info))
+    {
+      stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info);
+      if (stmt_info
+	  && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_LOAD_STORE_LANES)
+	return DR_GROUP_SIZE (stmt_info);
+    }
+  return 0;
+}
+
 /* Return true if creating multiple copies of STMT_INFO for Advanced SIMD
    vectors would produce a series of LDP or STP operations.  KIND is the
    kind of statement that STMT_INFO represents.  */
@@ -14320,6 +14376,38 @@ aarch64_sve_adjust_stmt_cost (class vec_info *vinfo, vect_cost_for_stmt kind,
   return stmt_cost;
 }
 
+/* STMT_COST is the cost calculated for STMT_INFO, which has cost kind KIND
+   and which when vectorized would operate on vector type VECTYPE.  Add the
+   cost of any embedded operations.  */
+static unsigned int
+aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
+			  tree vectype, unsigned int stmt_cost)
+{
+  if (vectype)
+    {
+      const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
+
+      /* Detect cases in which a vector load or store represents an
+	 LD[234] or ST[234] instruction.  */
+      switch (aarch64_ld234_st234_vectors (kind, stmt_info))
+	{
+	case 2:
+	  stmt_cost += simd_costs->ld2_st2_permute_cost;
+	  break;
+
+	case 3:
+	  stmt_cost += simd_costs->ld3_st3_permute_cost;
+	  break;
+
+	case 4:
+	  stmt_cost += simd_costs->ld4_st4_permute_cost;
+	  break;
+	}
+    }
+
+  return stmt_cost;
+}
+
 /* Implement targetm.vectorize.add_stmt_cost.  */
 static unsigned
 aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
@@ -14347,6 +14435,12 @@ aarch64_add_stmt_cost (class vec_info *vinfo, void *data, int count,
 	stmt_cost = aarch64_sve_adjust_stmt_cost (vinfo, kind, stmt_info,
 						  vectype, stmt_cost);
 
+      if (stmt_info && aarch64_use_new_vector_costs_p ())
+	/* Account for any extra "embedded" costs that apply additively
+	   to the base cost calculated above.  */
+	stmt_cost = aarch64_adjust_stmt_cost (kind, stmt_info, vectype,
+					      stmt_cost);
+
       /* Statements in an inner loop relative to the loop being
 	 vectorized are weighted more heavily.  The value here is
 	 arbitrary and could potentially be improved with analysis.  */