Adjust vectorized cost for reduction.

Message ID	20231212061208.234184-1-hongtao.liu@intel.com
State	New
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5E984385841A From: liuhongt <hongtao.liu@intel.com> To: gcc-patches@gcc.gnu.org Cc: crazylht@gmail.com, hjl.tools@gmail.com Subject: [PATCH] Adjust vectorized cost for reduction. Date: Tue, 12 Dec 2023 14:12:08 +0800 Message-Id: <20231212061208.234184-1-hongtao.liu@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Series	Adjust vectorized cost for reduction. \| expand Adjust vectorized cost for reduction.

Message ID

20231212061208.234184-1-hongtao.liu@intel.com

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 5E984385841A
From: liuhongt <hongtao.liu@intel.com>
To: gcc-patches@gcc.gnu.org
Cc: crazylht@gmail.com,
	hjl.tools@gmail.com
Subject: [PATCH] Adjust vectorized cost for reduction.
Date: Tue, 12 Dec 2023 14:12:08 +0800
Message-Id: <20231212061208.234184-1-hongtao.liu@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

Series

Adjust vectorized cost for reduction. | expand

Commit Message

liuhongt Dec. 12, 2023, 6:12 a.m. UTC

x86 doesn't support horizontal reduction instructions, reduc_op_scal_m
is emulated with vec_extract_half + op(half vector length)
Take that into account when calculating cost for vectorization.

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
No big performance impact on SPEC2017 as measured on ICX.
Ok for trunk?

gcc/ChangeLog:

	PR target/112325
	* config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
	Handle reduction vec_to_scalar.
	(ix86_vector_costs::ix86_vect_reduc_cost): New function.
---
 gcc/config/i386/i386.cc | 45 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

Comments

Richard Biener Dec. 12, 2023, noon UTC | #1

On Tue, Dec 12, 2023 at 7:12 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> x86 doesn't support horizontal reduction instructions, reduc_op_scal_m
> is emulated with vec_extract_half + op(half vector length)
> Take that into account when calculating cost for vectorization.
>
> Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> No big performance impact on SPEC2017 as measured on ICX.
> Ok for trunk?

I don't think keying on only on vec_to_scalar is good since
vect_model_reduction_cost will always use that when
extracting the scalar result element from the final vector
as well so you'll get double-counting here.

There is currently no good way of identifying the cases
the vectorizer chose reduc_*_scal, this operation
is identified as vector_stmt.

There is STMT_VINFO_REDUC_FN though, but I'm
not 100% positive the stmt_info you get passed has
this set (it's probably on the info_for_reduction node).

It should be possible to invent a new accessor like
vect_reduc_type () computing REDUC_FN though.

Richard.

> gcc/ChangeLog:
>
>         PR target/112325
>         * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost):
>         Handle reduction vec_to_scalar.
>         (ix86_vector_costs::ix86_vect_reduc_cost): New function.
> ---
>  gcc/config/i386/i386.cc | 45 +++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
>
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 4b6bad37c8f..02c9a5004a1 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -24603,6 +24603,7 @@ private:
>
>    /* Estimate register pressure of the vectorized code.  */
>    void ix86_vect_estimate_reg_pressure ();
> +  unsigned ix86_vect_reduc_cost (stmt_vec_info, tree);
>    /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
>       estimation of register pressure.
>       ??? Currently it's only used by vec_construct/scalar_to_vec
> @@ -24845,6 +24846,12 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>         if (TREE_CODE (op) == SSA_NAME)
>           TREE_VISITED (op) = 0;
>      }
> +  /* This is a reduc_*_scal_m, x86 support reduc_*_scal_m with emulation.  */
> +  else if (kind == vec_to_scalar
> +          && stmt_info
> +          && vect_is_reduction (stmt_info))
> +    stmt_cost = ix86_vect_reduc_cost (stmt_info, vectype);
> +
>    if (stmt_cost == -1)
>      stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
>
> @@ -24875,6 +24882,44 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
>    return retval;
>  }
>
> +/* x86 doesn't support horizontal reduction instructions,
> +   redc_op_scal_m is emulated with vec_extract_hi + op.  */
> +unsigned
> +ix86_vector_costs::ix86_vect_reduc_cost (stmt_vec_info stmt_info,
> +                                        tree vectype)
> +{
> +  gcc_assert (vectype);
> +  unsigned cost = 0;
> +  machine_mode mode = TYPE_MODE (vectype);
> +  unsigned len = GET_MODE_SIZE (mode);
> +
> +  /* PSADBW is used for reduc_plus_scal_{v16qi, v8qi, v4qi}.  */
> +  if (GET_MODE_INNER (mode) == E_QImode
> +      && stmt_info
> +      && stmt_info->stmt && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN
> +      && gimple_assign_rhs_code (stmt_info->stmt) == PLUS_EXPR)
> +    {
> +      cost = ix86_cost->sse_op;
> +      /* vec_extract_hi + vpaddb for 256/512-bit reduc_plus_scal_v*qi.  */
> +      if (len > 16)
> +       cost += exact_log2 (len >> 4) * ix86_cost->sse_op * 2;
> +    }
> +  else
> +    /* vec_extract_hi + op.  */
> +    cost = ix86_cost->sse_op * exact_log2 (TYPE_VECTOR_SUBPARTS (vectype)) * 2;
> +
> +  /* Cout extra uops for TARGET_*_SPLIT_REGS. NB: There's no target which
> +     supports 512-bit vector but has TARGET_AVX256/128_SPLIT_REGS.
> +     ix86_vect_cost is not used since reduction instruction sequence are
> +     consisted with mixed vector-length instructions after vec_extract_hi.  */
> +  if ((len == 64 && TARGET_AVX512_SPLIT_REGS)
> +      || (len == 32 && TARGET_AVX256_SPLIT_REGS)
> +      || (len == 16 && TARGET_AVX256_SPLIT_REGS))
> +    cost += ix86_cost->sse_op;
> +
> +  return cost;
> +}
> +
>  void
>  ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
>  {
> --
> 2.31.1
>

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 4b6bad37c8f..02c9a5004a1 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -24603,6 +24603,7 @@  private:
 
   /* Estimate register pressure of the vectorized code.  */
   void ix86_vect_estimate_reg_pressure ();
+  unsigned ix86_vect_reduc_cost (stmt_vec_info, tree);
   /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for
      estimation of register pressure.
      ??? Currently it's only used by vec_construct/scalar_to_vec
@@ -24845,6 +24846,12 @@  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 	if (TREE_CODE (op) == SSA_NAME)
 	  TREE_VISITED (op) = 0;
     }
+  /* This is a reduc_*_scal_m, x86 support reduc_*_scal_m with emulation.  */
+  else if (kind == vec_to_scalar
+	   && stmt_info
+	   && vect_is_reduction (stmt_info))
+    stmt_cost = ix86_vect_reduc_cost (stmt_info, vectype);
+
   if (stmt_cost == -1)
     stmt_cost = ix86_builtin_vectorization_cost (kind, vectype, misalign);
 
@@ -24875,6 +24882,44 @@  ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
   return retval;
 }
 
+/* x86 doesn't support horizontal reduction instructions,
+   redc_op_scal_m is emulated with vec_extract_hi + op.  */
+unsigned
+ix86_vector_costs::ix86_vect_reduc_cost (stmt_vec_info stmt_info,
+					 tree vectype)
+{
+  gcc_assert (vectype);
+  unsigned cost = 0;
+  machine_mode mode = TYPE_MODE (vectype);
+  unsigned len = GET_MODE_SIZE (mode);
+
+  /* PSADBW is used for reduc_plus_scal_{v16qi, v8qi, v4qi}.  */
+  if (GET_MODE_INNER (mode) == E_QImode
+      && stmt_info
+      && stmt_info->stmt && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN
+      && gimple_assign_rhs_code (stmt_info->stmt) == PLUS_EXPR)
+    {
+      cost = ix86_cost->sse_op;
+      /* vec_extract_hi + vpaddb for 256/512-bit reduc_plus_scal_v*qi.  */
+      if (len > 16)
+	cost += exact_log2 (len >> 4) * ix86_cost->sse_op * 2;
+    }
+  else
+    /* vec_extract_hi + op.  */
+    cost = ix86_cost->sse_op * exact_log2 (TYPE_VECTOR_SUBPARTS (vectype)) * 2;
+
+  /* Cout extra uops for TARGET_*_SPLIT_REGS. NB: There's no target which
+     supports 512-bit vector but has TARGET_AVX256/128_SPLIT_REGS.
+     ix86_vect_cost is not used since reduction instruction sequence are
+     consisted with mixed vector-length instructions after vec_extract_hi.  */
+  if ((len == 64 && TARGET_AVX512_SPLIT_REGS)
+      || (len == 32 && TARGET_AVX256_SPLIT_REGS)
+      || (len == 16 && TARGET_AVX256_SPLIT_REGS))
+    cost += ix86_cost->sse_op;
+
+  return cost;
+}
+
 void
 ix86_vector_costs::ix86_vect_estimate_reg_pressure ()
 {

Adjust vectorized cost for reduction.

Commit Message

Comments

Patch