diff mbox series

[1/4] middle-end: support multi-step zero-extends using VEC_PERM_EXPR

Message ID patch-18853-tamar@arm.com
State New
Headers show
Series [1/4] middle-end: support multi-step zero-extends using VEC_PERM_EXPR | expand

Commit Message

Tamar Christina Oct. 14, 2024, 10:55 a.m. UTC
Hi All,

This patch series adds support for a target to do a direct convertion for zero
extends using permutes.

To do this it uses a target hook use_permute_for_promotio which must be
implemented by targets.  This hook is used to indicate:

 1. can a target do this for the given modes.
 2. is it profitable for the target to do it.
 3. can the target convert between various vector modes with a VIEW_CONVERT.

Using permutations have a big benefit of multi-step zero extensions because they
both reduce the number of needed instructions, but also increase throughput as
the dependency chain is removed.

Concretely on AArch64 this changes:

void test4(unsigned char *x, long long *y, int n) {
    for(int i = 0; i < n; i++) {
        y[i] = x[i];
    }
}

from generating:

.L4:
        ldr     q30, [x4], 16
        add     x3, x3, 128
        zip1    v1.16b, v30.16b, v31.16b
        zip2    v30.16b, v30.16b, v31.16b
        zip1    v2.8h, v1.8h, v31.8h
        zip1    v0.8h, v30.8h, v31.8h
        zip2    v1.8h, v1.8h, v31.8h
        zip2    v30.8h, v30.8h, v31.8h
        zip1    v26.4s, v2.4s, v31.4s
        zip1    v29.4s, v0.4s, v31.4s
        zip1    v28.4s, v1.4s, v31.4s
        zip1    v27.4s, v30.4s, v31.4s
        zip2    v2.4s, v2.4s, v31.4s
        zip2    v0.4s, v0.4s, v31.4s
        zip2    v1.4s, v1.4s, v31.4s
        zip2    v30.4s, v30.4s, v31.4s
        stp     q26, q2, [x3, -128]
        stp     q28, q1, [x3, -96]
        stp     q29, q0, [x3, -64]
        stp     q27, q30, [x3, -32]
        cmp     x4, x5
        bne     .L4

and instead we get:

.L4:
        add     x3, x3, 128
        ldr     q23, [x4], 16
        tbl     v5.16b, {v23.16b}, v31.16b
        tbl     v4.16b, {v23.16b}, v30.16b
        tbl     v3.16b, {v23.16b}, v29.16b
        tbl     v2.16b, {v23.16b}, v28.16b
        tbl     v1.16b, {v23.16b}, v27.16b
        tbl     v0.16b, {v23.16b}, v26.16b
        tbl     v22.16b, {v23.16b}, v25.16b
        tbl     v23.16b, {v23.16b}, v24.16b
        stp     q5, q4, [x3, -128]
        stp     q3, q2, [x3, -96]
        stp     q1, q0, [x3, -64]
        stp     q22, q23, [x3, -32]
        cmp     x4, x5
        bne     .L4

Tests are added in the AArch64 patch introducing the hook.  The testsuite also
already had about 800 runtime tests that get affected by this.

Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
x86_64-pc-linux-gnu -m32, -m64 and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* target.def (use_permute_for_promotion): New.
	* doc/tm.texi.in: Document it.
	* doc/tm.texi: Regenerate.
	* targhooks.cc (default_use_permute_for_promotion): New.
	* targhooks.h (default_use_permute_for_promotion): New.
	(vectorizable_conversion): Support direct convertion with permute.
	* tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
	(supportable_widening_operation): Likewise.
	(vect_gen_perm_mask_any): Allow vector permutes where input registers
	are half the width of the result per the GCC 14 relaxation of
	VEC_PERM_EXPR.

---




--

Comments

Richard Sandiford Oct. 14, 2024, 6:33 p.m. UTC | #1
Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> This patch series adds support for a target to do a direct convertion for zero
> extends using permutes.
>
> To do this it uses a target hook use_permute_for_promotio which must be
> implemented by targets.  This hook is used to indicate:
>
>  1. can a target do this for the given modes.
>  2. is it profitable for the target to do it.
>  3. can the target convert between various vector modes with a VIEW_CONVERT.
>
> Using permutations have a big benefit of multi-step zero extensions because they
> both reduce the number of needed instructions, but also increase throughput as
> the dependency chain is removed.
>
> Concretely on AArch64 this changes:
>
> void test4(unsigned char *x, long long *y, int n) {
>     for(int i = 0; i < n; i++) {
>         y[i] = x[i];
>     }
> }
>
> from generating:
>
> .L4:
>         ldr     q30, [x4], 16
>         add     x3, x3, 128
>         zip1    v1.16b, v30.16b, v31.16b
>         zip2    v30.16b, v30.16b, v31.16b
>         zip1    v2.8h, v1.8h, v31.8h
>         zip1    v0.8h, v30.8h, v31.8h
>         zip2    v1.8h, v1.8h, v31.8h
>         zip2    v30.8h, v30.8h, v31.8h
>         zip1    v26.4s, v2.4s, v31.4s
>         zip1    v29.4s, v0.4s, v31.4s
>         zip1    v28.4s, v1.4s, v31.4s
>         zip1    v27.4s, v30.4s, v31.4s
>         zip2    v2.4s, v2.4s, v31.4s
>         zip2    v0.4s, v0.4s, v31.4s
>         zip2    v1.4s, v1.4s, v31.4s
>         zip2    v30.4s, v30.4s, v31.4s
>         stp     q26, q2, [x3, -128]
>         stp     q28, q1, [x3, -96]
>         stp     q29, q0, [x3, -64]
>         stp     q27, q30, [x3, -32]
>         cmp     x4, x5
>         bne     .L4
>
> and instead we get:
>
> .L4:
>         add     x3, x3, 128
>         ldr     q23, [x4], 16
>         tbl     v5.16b, {v23.16b}, v31.16b
>         tbl     v4.16b, {v23.16b}, v30.16b
>         tbl     v3.16b, {v23.16b}, v29.16b
>         tbl     v2.16b, {v23.16b}, v28.16b
>         tbl     v1.16b, {v23.16b}, v27.16b
>         tbl     v0.16b, {v23.16b}, v26.16b
>         tbl     v22.16b, {v23.16b}, v25.16b
>         tbl     v23.16b, {v23.16b}, v24.16b
>         stp     q5, q4, [x3, -128]
>         stp     q3, q2, [x3, -96]
>         stp     q1, q0, [x3, -64]
>         stp     q22, q23, [x3, -32]
>         cmp     x4, x5
>         bne     .L4
>
> Tests are added in the AArch64 patch introducing the hook.  The testsuite also
> already had about 800 runtime tests that get affected by this.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> x86_64-pc-linux-gnu -m32, -m64 and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* target.def (use_permute_for_promotion): New.
> 	* doc/tm.texi.in: Document it.
> 	* doc/tm.texi: Regenerate.
> 	* targhooks.cc (default_use_permute_for_promotion): New.
> 	* targhooks.h (default_use_permute_for_promotion): New.
> 	(vectorizable_conversion): Support direct convertion with permute.
> 	* tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
> 	(supportable_widening_operation): Likewise.
> 	(vect_gen_perm_mask_any): Allow vector permutes where input registers
> 	are half the width of the result per the GCC 14 relaxation of
> 	VEC_PERM_EXPR.
>
> ---
>
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f731c16ee7eacb78143 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered expensive when the mask is
>  all zeros.  GCC can then try to branch around the instruction instead.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} bool TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree @var{in_type}, const_tree @var{out_type})
> +This hook returns true if the operation promoting @var{in_type} to
> +@var{out_type} should be done as a vector permute.  If @var{out_type} is
> +a signed type the operation will be done as the related unsigned type and
> +converted to @var{out_type}.  If the target supports the needed permute,
> +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> +beneficial to the hook should return true, else false should be returned.
> +@end deftypefn

Just a review of the documentation, but: is a two-step process really
necessary for signed out_types?  I thought it could be done directly,
since it's in_type rather than out_type that determines the type of
extension.

Thanks,
Richard

> +
>  @deftypefn {Target Hook} {class vector_costs *} TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool @var{costing_for_scalar})
>  This hook should initialize target-specific data structures in preparation
>  for modeling the costs of vectorizing a loop or basic block.  The default
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52d29b76f5bc283a1 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent strategy can generate better code.
>  
>  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
>  
> +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> +
>  @hook TARGET_VECTORIZE_CREATE_COSTS
>  
>  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> diff --git a/gcc/target.def b/gcc/target.def
> index b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f4db9f2636973598 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around the instruction instead.",
>   (unsigned ifn),
>   default_empty_mask_is_expensive)
>  
> +/* Function to say whether a target supports and prefers to use permutes for
> +   zero extensions or truncates.  */
> +DEFHOOK
> +(use_permute_for_promotion,
> + "This hook returns true if the operation promoting @var{in_type} to\n\
> +@var{out_type} should be done as a vector permute.  If @var{out_type} is\n\
> +a signed type the operation will be done as the related unsigned type and\n\
> +converted to @var{out_type}.  If the target supports the needed permute,\n\
> +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is\n\
> +beneficial to the hook should return true, else false should be returned.",
> + bool,
> + (const_tree in_type, const_tree out_type),
> + default_use_permute_for_promotion)
> +
>  /* Target builtin that implements vector gather operation.  */
>  DEFHOOK
>  (builtin_gather,
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b3fafad74d3c536f 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -124,6 +124,7 @@ extern opt_machine_mode default_vectorize_related_mode (machine_mode,
>  extern opt_machine_mode default_get_mask_mode (machine_mode);
>  extern bool default_empty_mask_is_expensive (unsigned);
>  extern bool default_conditional_operation_is_expensive (unsigned);
> +extern bool default_use_permute_for_promotion (const_tree, const_tree);
>  extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
>  
>  /* OpenACC hooks.  */
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdfc881fdb19d28f3 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive (unsigned ifn)
>    return ifn == IFN_MASK_STORE;
>  }
>  
> +/* By default no targets prefer permutes over multi step extension.  */
> +
> +bool
> +default_use_permute_for_promotion (const_tree, const_tree)
> +{
> +  return false;
> +}
> +
>  /* By default consider masked stores to be expensive.  */
>  
>  bool
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894eaf769d29b1c5b82 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts (vec_info *vinfo,
>    gimple *new_stmt1, *new_stmt2;
>    vec<tree> vec_tmp = vNULL;
>  
> +  /* If we're using a VEC_PERM_EXPR then we're widening to the final type in
> +     one go.  */
> +  if (ch1 == VEC_PERM_EXPR
> +      && op_type == unary_op)
> +    {
> +      vec_tmp.create (vec_oprnds0->length () * 2);
> +      bool failed_p = false;
> +
> +      /* Extending with a vec-perm requires 2 instructions per step.  */
> +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> +	{
> +	  tree vectype_in = TREE_TYPE (vop0);
> +	  tree vectype_out = TREE_TYPE (vec_dest);
> +	  machine_mode mode_in = TYPE_MODE (vectype_in);
> +	  machine_mode mode_out = TYPE_MODE (vectype_out);
> +	  unsigned bitsize_in = element_precision (vectype_in);
> +	  unsigned tot_in, tot_out;
> +	  unsigned HOST_WIDE_INT count;
> +
> +	  /* We can't really support VLA here as the indexes depend on the VL.
> +	     VLA should really use widening instructions like widening
> +	     loads.  */
> +	  if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> +	      || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> +	      || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
> +	      || !TYPE_UNSIGNED (vectype_in)
> +	      || !targetm.vectorize.use_permute_for_promotion (vectype_in,
> +							       vectype_out))
> +	    {
> +	      failed_p = true;
> +	      break;
> +	    }
> +
> +	  unsigned steps = tot_out / bitsize_in;
> +	  tree zero = build_zero_cst (vectype_in);
> +
> +	  unsigned chunk_size
> +	    = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> +			 TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant ();
> +	  unsigned step_size = chunk_size * (tot_out / tot_in);
> +	  unsigned nunits = tot_out / bitsize_in;
> +
> +	  vec_perm_builder sel (steps, 1, 1);
> +	  sel.quick_grow (steps);
> +
> +	  /* Flood fill with the out of range value first.  */
> +	  for (unsigned long i = 0; i < steps; ++i)
> +	    sel[i] = count;
> +
> +	  tree var;
> +	  tree elem_in = TREE_TYPE (vectype_in);
> +	  machine_mode elem_mode_in = TYPE_MODE (elem_in);
> +	  unsigned long idx = 0;
> +	  tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> +							    elem_in, nunits);
> +
> +	  for (unsigned long j = 0; j < chunk_size; j++)
> +	    {
> +	      if (WORDS_BIG_ENDIAN)
> +		for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> +		  sel[i] = idx;
> +	      else
> +		for (int i = 0; i < (int)steps; i += step_size, idx++)
> +		  sel[i] = idx;
> +
> +	      vec_perm_indices indices (sel, 2, steps);
> +
> +	      tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices);
> +	      auto vec_oprnd = make_ssa_name (vc_in);
> +	      auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR,
> +						   vop0, zero, perm_mask);
> +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> +
> +	      tree intvect_out = unsigned_type_for (vectype_out);
> +	      var = make_ssa_name (intvect_out);
> +	      new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR,
> +							   intvect_out,
> +							   vec_oprnd));
> +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> +
> +	      gcc_assert (ch2.is_tree_code ());
> +
> +	      var = make_ssa_name (vectype_out);
> +	      if (ch2 == VIEW_CONVERT_EXPR)
> +		  new_stmt = gimple_build_assign (var,
> +						  build1 (VIEW_CONVERT_EXPR,
> +							  vectype_out,
> +							  vec_oprnd));
> +	      else
> +		  new_stmt = gimple_build_assign (var, (tree_code)ch2,
> +						  vec_oprnd);
> +
> +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> +	      vec_tmp.safe_push (var);
> +	    }
> +	}
> +
> +      if (!failed_p)
> +	{
> +	  vec_oprnds0->release ();
> +	  *vec_oprnds0 = vec_tmp;
> +	  return;
> +	}
> +    }
> +
>    vec_tmp.create (vec_oprnds0->length () * 2);
>    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
>      {
> @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
>  	  || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
>  	goto unsupported;
>  
> +      /* Check to see if the target can use a permute to perform the zero
> +	 extension.  */
> +      intermediate_type = unsigned_type_for (vectype_out);
> +      if (TYPE_UNSIGNED (vectype_in)
> +	  && VECTOR_TYPE_P (intermediate_type)
> +	  && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> +							  intermediate_type))
> +	{
> +	  code1 = VEC_PERM_EXPR;
> +	  code2 = FLOAT_EXPR;
> +	  break;
> +	}
> +
>        fltsz = GET_MODE_SIZE (lhs_mode);
>        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
>  	{
> @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const vec_perm_indices &sel)
>    tree mask_type;
>  
>    poly_uint64 nunits = sel.length ();
> -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> +	      || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
>  
>    mask_type = build_vector_type (ssizetype, nunits);
>    return vec_perm_indices_to_tree (mask_type, sel);
> @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info *vinfo,
>        break;
>  
>      CASE_CONVERT:
> -      c1 = VEC_UNPACK_LO_EXPR;
> -      c2 = VEC_UNPACK_HI_EXPR;
> +      {
> +	tree cvt_type = unsigned_type_for (vectype_out);
> +	if (TYPE_UNSIGNED (vectype_in)
> +	  && VECTOR_TYPE_P (cvt_type)
> +	  && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> +	  && targetm.vectorize.use_permute_for_promotion (vectype_in, cvt_type))
> +	  {
> +	    *code1 = VEC_PERM_EXPR;
> +	    *code2 = VIEW_CONVERT_EXPR;
> +	    return true;
> +	  }
> +	c1 = VEC_UNPACK_LO_EXPR;
> +	c2 = VEC_UNPACK_HI_EXPR;
> +      }
>        break;
>  
>      case FLOAT_EXPR:
Tamar Christina Oct. 14, 2024, 6:44 p.m. UTC | #2
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, October 14, 2024 7:34 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de
> Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends using
> VEC_PERM_EXPR
> 
> Tamar Christina <tamar.christina@arm.com> writes:
> > Hi All,
> >
> > This patch series adds support for a target to do a direct convertion for zero
> > extends using permutes.
> >
> > To do this it uses a target hook use_permute_for_promotio which must be
> > implemented by targets.  This hook is used to indicate:
> >
> >  1. can a target do this for the given modes.
> >  2. is it profitable for the target to do it.
> >  3. can the target convert between various vector modes with a VIEW_CONVERT.
> >
> > Using permutations have a big benefit of multi-step zero extensions because
> they
> > both reduce the number of needed instructions, but also increase throughput as
> > the dependency chain is removed.
> >
> > Concretely on AArch64 this changes:
> >
> > void test4(unsigned char *x, long long *y, int n) {
> >     for(int i = 0; i < n; i++) {
> >         y[i] = x[i];
> >     }
> > }
> >
> > from generating:
> >
> > .L4:
> >         ldr     q30, [x4], 16
> >         add     x3, x3, 128
> >         zip1    v1.16b, v30.16b, v31.16b
> >         zip2    v30.16b, v30.16b, v31.16b
> >         zip1    v2.8h, v1.8h, v31.8h
> >         zip1    v0.8h, v30.8h, v31.8h
> >         zip2    v1.8h, v1.8h, v31.8h
> >         zip2    v30.8h, v30.8h, v31.8h
> >         zip1    v26.4s, v2.4s, v31.4s
> >         zip1    v29.4s, v0.4s, v31.4s
> >         zip1    v28.4s, v1.4s, v31.4s
> >         zip1    v27.4s, v30.4s, v31.4s
> >         zip2    v2.4s, v2.4s, v31.4s
> >         zip2    v0.4s, v0.4s, v31.4s
> >         zip2    v1.4s, v1.4s, v31.4s
> >         zip2    v30.4s, v30.4s, v31.4s
> >         stp     q26, q2, [x3, -128]
> >         stp     q28, q1, [x3, -96]
> >         stp     q29, q0, [x3, -64]
> >         stp     q27, q30, [x3, -32]
> >         cmp     x4, x5
> >         bne     .L4
> >
> > and instead we get:
> >
> > .L4:
> >         add     x3, x3, 128
> >         ldr     q23, [x4], 16
> >         tbl     v5.16b, {v23.16b}, v31.16b
> >         tbl     v4.16b, {v23.16b}, v30.16b
> >         tbl     v3.16b, {v23.16b}, v29.16b
> >         tbl     v2.16b, {v23.16b}, v28.16b
> >         tbl     v1.16b, {v23.16b}, v27.16b
> >         tbl     v0.16b, {v23.16b}, v26.16b
> >         tbl     v22.16b, {v23.16b}, v25.16b
> >         tbl     v23.16b, {v23.16b}, v24.16b
> >         stp     q5, q4, [x3, -128]
> >         stp     q3, q2, [x3, -96]
> >         stp     q1, q0, [x3, -64]
> >         stp     q22, q23, [x3, -32]
> >         cmp     x4, x5
> >         bne     .L4
> >
> > Tests are added in the AArch64 patch introducing the hook.  The testsuite also
> > already had about 800 runtime tests that get affected by this.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* target.def (use_permute_for_promotion): New.
> > 	* doc/tm.texi.in: Document it.
> > 	* doc/tm.texi: Regenerate.
> > 	* targhooks.cc (default_use_permute_for_promotion): New.
> > 	* targhooks.h (default_use_permute_for_promotion): New.
> > 	(vectorizable_conversion): Support direct convertion with permute.
> > 	* tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
> > 	(supportable_widening_operation): Likewise.
> > 	(vect_gen_perm_mask_any): Allow vector permutes where input registers
> > 	are half the width of the result per the GCC 14 relaxation of
> > 	VEC_PERM_EXPR.
> >
> > ---
> >
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > index
> 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73
> 1c16ee7eacb78143 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered
> expensive when the mask is
> >  all zeros.  GCC can then try to branch around the instruction instead.
> >  @end deftypefn
> >
> > +@deftypefn {Target Hook} bool
> TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree
> @var{in_type}, const_tree @var{out_type})
> > +This hook returns true if the operation promoting @var{in_type} to
> > +@var{out_type} should be done as a vector permute.  If @var{out_type} is
> > +a signed type the operation will be done as the related unsigned type and
> > +converted to @var{out_type}.  If the target supports the needed permute,
> > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> > +beneficial to the hook should return true, else false should be returned.
> > +@end deftypefn
> 
> Just a review of the documentation, but: is a two-step process really
> necessary for signed out_types?  I thought it could be done directly,
> since it's in_type rather than out_type that determines the type of
> extension.

Thanks!,

I think this is an indication the text is ambiguous.  The intention was to say
that if out_type is signed, we still keep the type as signed, but insert an
intermediate cast to (unsigned type(out_type)).

The optimization only looks at in_type as you correctly point out.
I think you're right in that the documentation is explaining too much of
how the optimization does the transform, rather than explaining just
the transform.

Would it be clearer is I just delete the 

> If @var{out_type} is
> > +a signed type the operation will be done as the related unsigned type and
> > +converted to @var{out_type}.  

Part?

Thanks for raising this :)

Thanks,
Tamar
> 
> Thanks,
> Richard
> 
> > +
> >  @deftypefn {Target Hook} {class vector_costs *}
> TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool
> @var{costing_for_scalar})
> >  This hook should initialize target-specific data structures in preparation
> >  for modeling the costs of vectorizing a loop or basic block.  The default
> > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > index
> 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52
> d29b76f5bc283a1 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent strategy
> can generate better code.
> >
> >  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> >
> > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > +
> >  @hook TARGET_VECTORIZE_CREATE_COSTS
> >
> >  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > diff --git a/gcc/target.def b/gcc/target.def
> > index
> b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f
> 4db9f2636973598 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around the
> instruction instead.",
> >   (unsigned ifn),
> >   default_empty_mask_is_expensive)
> >
> > +/* Function to say whether a target supports and prefers to use permutes for
> > +   zero extensions or truncates.  */
> > +DEFHOOK
> > +(use_permute_for_promotion,
> > + "This hook returns true if the operation promoting @var{in_type} to\n\
> > +@var{out_type} should be done as a vector permute.  If @var{out_type} is\n\
> > +a signed type the operation will be done as the related unsigned type and\n\
> > +converted to @var{out_type}.  If the target supports the needed permute,\n\
> > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is\n\
> > +beneficial to the hook should return true, else false should be returned.",
> > + bool,
> > + (const_tree in_type, const_tree out_type),
> > + default_use_permute_for_promotion)
> > +
> >  /* Target builtin that implements vector gather operation.  */
> >  DEFHOOK
> >  (builtin_gather,
> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > index
> 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b
> 3fafad74d3c536f 100644
> > --- a/gcc/targhooks.h
> > +++ b/gcc/targhooks.h
> > @@ -124,6 +124,7 @@ extern opt_machine_mode
> default_vectorize_related_mode (machine_mode,
> >  extern opt_machine_mode default_get_mask_mode (machine_mode);
> >  extern bool default_empty_mask_is_expensive (unsigned);
> >  extern bool default_conditional_operation_is_expensive (unsigned);
> > +extern bool default_use_permute_for_promotion (const_tree, const_tree);
> >  extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
> >
> >  /* OpenACC hooks.  */
> > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > index
> dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdf
> c881fdb19d28f3 100644
> > --- a/gcc/targhooks.cc
> > +++ b/gcc/targhooks.cc
> > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive
> (unsigned ifn)
> >    return ifn == IFN_MASK_STORE;
> >  }
> >
> > +/* By default no targets prefer permutes over multi step extension.  */
> > +
> > +bool
> > +default_use_permute_for_promotion (const_tree, const_tree)
> > +{
> > +  return false;
> > +}
> > +
> >  /* By default consider masked stores to be expensive.  */
> >
> >  bool
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index
> 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894e
> af769d29b1c5b82 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts
> (vec_info *vinfo,
> >    gimple *new_stmt1, *new_stmt2;
> >    vec<tree> vec_tmp = vNULL;
> >
> > +  /* If we're using a VEC_PERM_EXPR then we're widening to the final type in
> > +     one go.  */
> > +  if (ch1 == VEC_PERM_EXPR
> > +      && op_type == unary_op)
> > +    {
> > +      vec_tmp.create (vec_oprnds0->length () * 2);
> > +      bool failed_p = false;
> > +
> > +      /* Extending with a vec-perm requires 2 instructions per step.  */
> > +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > +	{
> > +	  tree vectype_in = TREE_TYPE (vop0);
> > +	  tree vectype_out = TREE_TYPE (vec_dest);
> > +	  machine_mode mode_in = TYPE_MODE (vectype_in);
> > +	  machine_mode mode_out = TYPE_MODE (vectype_out);
> > +	  unsigned bitsize_in = element_precision (vectype_in);
> > +	  unsigned tot_in, tot_out;
> > +	  unsigned HOST_WIDE_INT count;
> > +
> > +	  /* We can't really support VLA here as the indexes depend on the VL.
> > +	     VLA should really use widening instructions like widening
> > +	     loads.  */
> > +	  if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > +	      || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > +	      || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
> > +	      || !TYPE_UNSIGNED (vectype_in)
> > +	      || !targetm.vectorize.use_permute_for_promotion (vectype_in,
> > +							       vectype_out))
> > +	    {
> > +	      failed_p = true;
> > +	      break;
> > +	    }
> > +
> > +	  unsigned steps = tot_out / bitsize_in;
> > +	  tree zero = build_zero_cst (vectype_in);
> > +
> > +	  unsigned chunk_size
> > +	    = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > +			 TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant ();
> > +	  unsigned step_size = chunk_size * (tot_out / tot_in);
> > +	  unsigned nunits = tot_out / bitsize_in;
> > +
> > +	  vec_perm_builder sel (steps, 1, 1);
> > +	  sel.quick_grow (steps);
> > +
> > +	  /* Flood fill with the out of range value first.  */
> > +	  for (unsigned long i = 0; i < steps; ++i)
> > +	    sel[i] = count;
> > +
> > +	  tree var;
> > +	  tree elem_in = TREE_TYPE (vectype_in);
> > +	  machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > +	  unsigned long idx = 0;
> > +	  tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> > +							    elem_in, nunits);
> > +
> > +	  for (unsigned long j = 0; j < chunk_size; j++)
> > +	    {
> > +	      if (WORDS_BIG_ENDIAN)
> > +		for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > +		  sel[i] = idx;
> > +	      else
> > +		for (int i = 0; i < (int)steps; i += step_size, idx++)
> > +		  sel[i] = idx;
> > +
> > +	      vec_perm_indices indices (sel, 2, steps);
> > +
> > +	      tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices);
> > +	      auto vec_oprnd = make_ssa_name (vc_in);
> > +	      auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR,
> > +						   vop0, zero, perm_mask);
> > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > +
> > +	      tree intvect_out = unsigned_type_for (vectype_out);
> > +	      var = make_ssa_name (intvect_out);
> > +	      new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR,
> > +							   intvect_out,
> > +							   vec_oprnd));
> > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > +
> > +	      gcc_assert (ch2.is_tree_code ());
> > +
> > +	      var = make_ssa_name (vectype_out);
> > +	      if (ch2 == VIEW_CONVERT_EXPR)
> > +		  new_stmt = gimple_build_assign (var,
> > +						  build1 (VIEW_CONVERT_EXPR,
> > +							  vectype_out,
> > +							  vec_oprnd));
> > +	      else
> > +		  new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > +						  vec_oprnd);
> > +
> > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > +	      vec_tmp.safe_push (var);
> > +	    }
> > +	}
> > +
> > +      if (!failed_p)
> > +	{
> > +	  vec_oprnds0->release ();
> > +	  *vec_oprnds0 = vec_tmp;
> > +	  return;
> > +	}
> > +    }
> > +
> >    vec_tmp.create (vec_oprnds0->length () * 2);
> >    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> >      {
> > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> >  	  || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> >  	goto unsupported;
> >
> > +      /* Check to see if the target can use a permute to perform the zero
> > +	 extension.  */
> > +      intermediate_type = unsigned_type_for (vectype_out);
> > +      if (TYPE_UNSIGNED (vectype_in)
> > +	  && VECTOR_TYPE_P (intermediate_type)
> > +	  && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > +							  intermediate_type))
> > +	{
> > +	  code1 = VEC_PERM_EXPR;
> > +	  code2 = FLOAT_EXPR;
> > +	  break;
> > +	}
> > +
> >        fltsz = GET_MODE_SIZE (lhs_mode);
> >        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> >  	{
> > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const
> vec_perm_indices &sel)
> >    tree mask_type;
> >
> >    poly_uint64 nunits = sel.length ();
> > -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > +	      || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
> >
> >    mask_type = build_vector_type (ssizetype, nunits);
> >    return vec_perm_indices_to_tree (mask_type, sel);
> > @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info
> *vinfo,
> >        break;
> >
> >      CASE_CONVERT:
> > -      c1 = VEC_UNPACK_LO_EXPR;
> > -      c2 = VEC_UNPACK_HI_EXPR;
> > +      {
> > +	tree cvt_type = unsigned_type_for (vectype_out);
> > +	if (TYPE_UNSIGNED (vectype_in)
> > +	  && VECTOR_TYPE_P (cvt_type)
> > +	  && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> cvt_type))
> > +	  {
> > +	    *code1 = VEC_PERM_EXPR;
> > +	    *code2 = VIEW_CONVERT_EXPR;
> > +	    return true;
> > +	  }
> > +	c1 = VEC_UNPACK_LO_EXPR;
> > +	c2 = VEC_UNPACK_HI_EXPR;
> > +      }
> >        break;
> >
> >      case FLOAT_EXPR:
Richard Biener Oct. 15, 2024, 8:54 a.m. UTC | #3
On Mon, 14 Oct 2024, Tamar Christina wrote:

> Hi All,
> 
> This patch series adds support for a target to do a direct convertion for zero
> extends using permutes.
> 
> To do this it uses a target hook use_permute_for_promotio which must be
> implemented by targets.  This hook is used to indicate:
> 
>  1. can a target do this for the given modes.

can_vec_perm_const_p?

>  2. is it profitable for the target to do it.

So you say the target can do both ways but both zip and tbl are
permute instructions so I really fail to see the point and why
the target itself doesn't choose to use tbl for unpack.

Is the intent in the end to have VEC_PERM in the IL rather than
VEC_UNPACK_* so it combines with other VEC_PERMs?

That said, I'm not against supporting VEC_PERM code gen from
unsigned promotion but I don't see why we should do this when
the target advertises VEC_UNPACK_* support or direct conversion
support?

Esp. with adding a "local" cost related hook which cannot take
into accout context.

>  3. can the target convert between various vector modes with a VIEW_CONVERT.

We have modes_tieable_p for this I think.

> Using permutations have a big benefit of multi-step zero extensions because they
> both reduce the number of needed instructions, but also increase throughput as
> the dependency chain is removed.
> 
> Concretely on AArch64 this changes:
> 
> void test4(unsigned char *x, long long *y, int n) {
>     for(int i = 0; i < n; i++) {
>         y[i] = x[i];
>     }
> }
> 
> from generating:
> 
> .L4:
>         ldr     q30, [x4], 16
>         add     x3, x3, 128
>         zip1    v1.16b, v30.16b, v31.16b
>         zip2    v30.16b, v30.16b, v31.16b
>         zip1    v2.8h, v1.8h, v31.8h
>         zip1    v0.8h, v30.8h, v31.8h
>         zip2    v1.8h, v1.8h, v31.8h
>         zip2    v30.8h, v30.8h, v31.8h
>         zip1    v26.4s, v2.4s, v31.4s
>         zip1    v29.4s, v0.4s, v31.4s
>         zip1    v28.4s, v1.4s, v31.4s
>         zip1    v27.4s, v30.4s, v31.4s
>         zip2    v2.4s, v2.4s, v31.4s
>         zip2    v0.4s, v0.4s, v31.4s
>         zip2    v1.4s, v1.4s, v31.4s
>         zip2    v30.4s, v30.4s, v31.4s
>         stp     q26, q2, [x3, -128]
>         stp     q28, q1, [x3, -96]
>         stp     q29, q0, [x3, -64]
>         stp     q27, q30, [x3, -32]
>         cmp     x4, x5
>         bne     .L4
> 
> and instead we get:
> 
> .L4:
>         add     x3, x3, 128
>         ldr     q23, [x4], 16
>         tbl     v5.16b, {v23.16b}, v31.16b
>         tbl     v4.16b, {v23.16b}, v30.16b
>         tbl     v3.16b, {v23.16b}, v29.16b
>         tbl     v2.16b, {v23.16b}, v28.16b
>         tbl     v1.16b, {v23.16b}, v27.16b
>         tbl     v0.16b, {v23.16b}, v26.16b
>         tbl     v22.16b, {v23.16b}, v25.16b
>         tbl     v23.16b, {v23.16b}, v24.16b
>         stp     q5, q4, [x3, -128]
>         stp     q3, q2, [x3, -96]
>         stp     q1, q0, [x3, -64]
>         stp     q22, q23, [x3, -32]
>         cmp     x4, x5
>         bne     .L4
> 
> Tests are added in the AArch64 patch introducing the hook.  The testsuite also
> already had about 800 runtime tests that get affected by this.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> x86_64-pc-linux-gnu -m32, -m64 and no issues.
> 
> Ok for master?
> 
> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* target.def (use_permute_for_promotion): New.
> 	* doc/tm.texi.in: Document it.
> 	* doc/tm.texi: Regenerate.
> 	* targhooks.cc (default_use_permute_for_promotion): New.
> 	* targhooks.h (default_use_permute_for_promotion): New.
> 	(vectorizable_conversion): Support direct convertion with permute.
> 	* tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
> 	(supportable_widening_operation): Likewise.
> 	(vect_gen_perm_mask_any): Allow vector permutes where input registers
> 	are half the width of the result per the GCC 14 relaxation of
> 	VEC_PERM_EXPR.
> 
> ---
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f731c16ee7eacb78143 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered expensive when the mask is
>  all zeros.  GCC can then try to branch around the instruction instead.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} bool TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree @var{in_type}, const_tree @var{out_type})
> +This hook returns true if the operation promoting @var{in_type} to
> +@var{out_type} should be done as a vector permute.  If @var{out_type} is
> +a signed type the operation will be done as the related unsigned type and
> +converted to @var{out_type}.  If the target supports the needed permute,
> +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> +beneficial to the hook should return true, else false should be returned.
> +@end deftypefn
> +
>  @deftypefn {Target Hook} {class vector_costs *} TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool @var{costing_for_scalar})
>  This hook should initialize target-specific data structures in preparation
>  for modeling the costs of vectorizing a loop or basic block.  The default
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52d29b76f5bc283a1 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent strategy can generate better code.
>  
>  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
>  
> +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> +
>  @hook TARGET_VECTORIZE_CREATE_COSTS
>  
>  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> diff --git a/gcc/target.def b/gcc/target.def
> index b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f4db9f2636973598 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around the instruction instead.",
>   (unsigned ifn),
>   default_empty_mask_is_expensive)
>  
> +/* Function to say whether a target supports and prefers to use permutes for
> +   zero extensions or truncates.  */
> +DEFHOOK
> +(use_permute_for_promotion,
> + "This hook returns true if the operation promoting @var{in_type} to\n\
> +@var{out_type} should be done as a vector permute.  If @var{out_type} is\n\
> +a signed type the operation will be done as the related unsigned type and\n\
> +converted to @var{out_type}.  If the target supports the needed permute,\n\
> +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is\n\
> +beneficial to the hook should return true, else false should be returned.",
> + bool,
> + (const_tree in_type, const_tree out_type),
> + default_use_permute_for_promotion)
> +
>  /* Target builtin that implements vector gather operation.  */
>  DEFHOOK
>  (builtin_gather,
> diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> index 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b3fafad74d3c536f 100644
> --- a/gcc/targhooks.h
> +++ b/gcc/targhooks.h
> @@ -124,6 +124,7 @@ extern opt_machine_mode default_vectorize_related_mode (machine_mode,
>  extern opt_machine_mode default_get_mask_mode (machine_mode);
>  extern bool default_empty_mask_is_expensive (unsigned);
>  extern bool default_conditional_operation_is_expensive (unsigned);
> +extern bool default_use_permute_for_promotion (const_tree, const_tree);
>  extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
>  
>  /* OpenACC hooks.  */
> diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> index dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdfc881fdb19d28f3 100644
> --- a/gcc/targhooks.cc
> +++ b/gcc/targhooks.cc
> @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive (unsigned ifn)
>    return ifn == IFN_MASK_STORE;
>  }
>  
> +/* By default no targets prefer permutes over multi step extension.  */
> +
> +bool
> +default_use_permute_for_promotion (const_tree, const_tree)
> +{
> +  return false;
> +}
> +
>  /* By default consider masked stores to be expensive.  */
>  
>  bool
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894eaf769d29b1c5b82 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts (vec_info *vinfo,
>    gimple *new_stmt1, *new_stmt2;
>    vec<tree> vec_tmp = vNULL;
>  
> +  /* If we're using a VEC_PERM_EXPR then we're widening to the final type in
> +     one go.  */
> +  if (ch1 == VEC_PERM_EXPR
> +      && op_type == unary_op)
> +    {
> +      vec_tmp.create (vec_oprnds0->length () * 2);
> +      bool failed_p = false;
> +
> +      /* Extending with a vec-perm requires 2 instructions per step.  */
> +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> +	{
> +	  tree vectype_in = TREE_TYPE (vop0);
> +	  tree vectype_out = TREE_TYPE (vec_dest);
> +	  machine_mode mode_in = TYPE_MODE (vectype_in);
> +	  machine_mode mode_out = TYPE_MODE (vectype_out);
> +	  unsigned bitsize_in = element_precision (vectype_in);
> +	  unsigned tot_in, tot_out;
> +	  unsigned HOST_WIDE_INT count;
> +
> +	  /* We can't really support VLA here as the indexes depend on the VL.
> +	     VLA should really use widening instructions like widening
> +	     loads.  */
> +	  if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> +	      || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> +	      || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
> +	      || !TYPE_UNSIGNED (vectype_in)
> +	      || !targetm.vectorize.use_permute_for_promotion (vectype_in,
> +							       vectype_out))
> +	    {
> +	      failed_p = true;
> +	      break;
> +	    }
> +
> +	  unsigned steps = tot_out / bitsize_in;
> +	  tree zero = build_zero_cst (vectype_in);
> +
> +	  unsigned chunk_size
> +	    = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> +			 TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant ();
> +	  unsigned step_size = chunk_size * (tot_out / tot_in);
> +	  unsigned nunits = tot_out / bitsize_in;
> +
> +	  vec_perm_builder sel (steps, 1, 1);
> +	  sel.quick_grow (steps);
> +
> +	  /* Flood fill with the out of range value first.  */
> +	  for (unsigned long i = 0; i < steps; ++i)
> +	    sel[i] = count;
> +
> +	  tree var;
> +	  tree elem_in = TREE_TYPE (vectype_in);
> +	  machine_mode elem_mode_in = TYPE_MODE (elem_in);
> +	  unsigned long idx = 0;
> +	  tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> +							    elem_in, nunits);
> +
> +	  for (unsigned long j = 0; j < chunk_size; j++)
> +	    {
> +	      if (WORDS_BIG_ENDIAN)
> +		for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> +		  sel[i] = idx;
> +	      else
> +		for (int i = 0; i < (int)steps; i += step_size, idx++)
> +		  sel[i] = idx;
> +
> +	      vec_perm_indices indices (sel, 2, steps);
> +
> +	      tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices);
> +	      auto vec_oprnd = make_ssa_name (vc_in);
> +	      auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR,
> +						   vop0, zero, perm_mask);
> +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> +
> +	      tree intvect_out = unsigned_type_for (vectype_out);
> +	      var = make_ssa_name (intvect_out);
> +	      new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR,
> +							   intvect_out,
> +							   vec_oprnd));
> +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> +
> +	      gcc_assert (ch2.is_tree_code ());
> +
> +	      var = make_ssa_name (vectype_out);
> +	      if (ch2 == VIEW_CONVERT_EXPR)
> +		  new_stmt = gimple_build_assign (var,
> +						  build1 (VIEW_CONVERT_EXPR,
> +							  vectype_out,
> +							  vec_oprnd));
> +	      else
> +		  new_stmt = gimple_build_assign (var, (tree_code)ch2,
> +						  vec_oprnd);
> +
> +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> +	      vec_tmp.safe_push (var);
> +	    }
> +	}
> +
> +      if (!failed_p)
> +	{
> +	  vec_oprnds0->release ();
> +	  *vec_oprnds0 = vec_tmp;
> +	  return;
> +	}
> +    }
> +
>    vec_tmp.create (vec_oprnds0->length () * 2);
>    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
>      {
> @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
>  	  || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
>  	goto unsupported;
>  
> +      /* Check to see if the target can use a permute to perform the zero
> +	 extension.  */
> +      intermediate_type = unsigned_type_for (vectype_out);
> +      if (TYPE_UNSIGNED (vectype_in)
> +	  && VECTOR_TYPE_P (intermediate_type)
> +	  && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> +							  intermediate_type))
> +	{
> +	  code1 = VEC_PERM_EXPR;
> +	  code2 = FLOAT_EXPR;
> +	  break;
> +	}
> +
>        fltsz = GET_MODE_SIZE (lhs_mode);
>        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
>  	{
> @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const vec_perm_indices &sel)
>    tree mask_type;
>  
>    poly_uint64 nunits = sel.length ();
> -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> +	      || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
>  
>    mask_type = build_vector_type (ssizetype, nunits);
>    return vec_perm_indices_to_tree (mask_type, sel);
> @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info *vinfo,
>        break;
>  
>      CASE_CONVERT:
> -      c1 = VEC_UNPACK_LO_EXPR;
> -      c2 = VEC_UNPACK_HI_EXPR;
> +      {
> +	tree cvt_type = unsigned_type_for (vectype_out);
> +	if (TYPE_UNSIGNED (vectype_in)
> +	  && VECTOR_TYPE_P (cvt_type)
> +	  && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> +	  && targetm.vectorize.use_permute_for_promotion (vectype_in, cvt_type))
> +	  {
> +	    *code1 = VEC_PERM_EXPR;
> +	    *code2 = VIEW_CONVERT_EXPR;
> +	    return true;
> +	  }
> +	c1 = VEC_UNPACK_LO_EXPR;
> +	c2 = VEC_UNPACK_HI_EXPR;
> +      }
>        break;
>  
>      case FLOAT_EXPR:
> 
> 
> 
> 
>
Richard Sandiford Oct. 15, 2024, 9:05 a.m. UTC | #4
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Monday, October 14, 2024 7:34 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; rguenther@suse.de
>> Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends using
>> VEC_PERM_EXPR
>> 
>> Tamar Christina <tamar.christina@arm.com> writes:
>> > Hi All,
>> >
>> > This patch series adds support for a target to do a direct convertion for zero
>> > extends using permutes.
>> >
>> > To do this it uses a target hook use_permute_for_promotio which must be
>> > implemented by targets.  This hook is used to indicate:
>> >
>> >  1. can a target do this for the given modes.
>> >  2. is it profitable for the target to do it.
>> >  3. can the target convert between various vector modes with a VIEW_CONVERT.
>> >
>> > Using permutations have a big benefit of multi-step zero extensions because
>> they
>> > both reduce the number of needed instructions, but also increase throughput as
>> > the dependency chain is removed.
>> >
>> > Concretely on AArch64 this changes:
>> >
>> > void test4(unsigned char *x, long long *y, int n) {
>> >     for(int i = 0; i < n; i++) {
>> >         y[i] = x[i];
>> >     }
>> > }
>> >
>> > from generating:
>> >
>> > .L4:
>> >         ldr     q30, [x4], 16
>> >         add     x3, x3, 128
>> >         zip1    v1.16b, v30.16b, v31.16b
>> >         zip2    v30.16b, v30.16b, v31.16b
>> >         zip1    v2.8h, v1.8h, v31.8h
>> >         zip1    v0.8h, v30.8h, v31.8h
>> >         zip2    v1.8h, v1.8h, v31.8h
>> >         zip2    v30.8h, v30.8h, v31.8h
>> >         zip1    v26.4s, v2.4s, v31.4s
>> >         zip1    v29.4s, v0.4s, v31.4s
>> >         zip1    v28.4s, v1.4s, v31.4s
>> >         zip1    v27.4s, v30.4s, v31.4s
>> >         zip2    v2.4s, v2.4s, v31.4s
>> >         zip2    v0.4s, v0.4s, v31.4s
>> >         zip2    v1.4s, v1.4s, v31.4s
>> >         zip2    v30.4s, v30.4s, v31.4s
>> >         stp     q26, q2, [x3, -128]
>> >         stp     q28, q1, [x3, -96]
>> >         stp     q29, q0, [x3, -64]
>> >         stp     q27, q30, [x3, -32]
>> >         cmp     x4, x5
>> >         bne     .L4
>> >
>> > and instead we get:
>> >
>> > .L4:
>> >         add     x3, x3, 128
>> >         ldr     q23, [x4], 16
>> >         tbl     v5.16b, {v23.16b}, v31.16b
>> >         tbl     v4.16b, {v23.16b}, v30.16b
>> >         tbl     v3.16b, {v23.16b}, v29.16b
>> >         tbl     v2.16b, {v23.16b}, v28.16b
>> >         tbl     v1.16b, {v23.16b}, v27.16b
>> >         tbl     v0.16b, {v23.16b}, v26.16b
>> >         tbl     v22.16b, {v23.16b}, v25.16b
>> >         tbl     v23.16b, {v23.16b}, v24.16b
>> >         stp     q5, q4, [x3, -128]
>> >         stp     q3, q2, [x3, -96]
>> >         stp     q1, q0, [x3, -64]
>> >         stp     q22, q23, [x3, -32]
>> >         cmp     x4, x5
>> >         bne     .L4
>> >
>> > Tests are added in the AArch64 patch introducing the hook.  The testsuite also
>> > already had about 800 runtime tests that get affected by this.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
>> > x86_64-pc-linux-gnu -m32, -m64 and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> > 	* target.def (use_permute_for_promotion): New.
>> > 	* doc/tm.texi.in: Document it.
>> > 	* doc/tm.texi: Regenerate.
>> > 	* targhooks.cc (default_use_permute_for_promotion): New.
>> > 	* targhooks.h (default_use_permute_for_promotion): New.
>> > 	(vectorizable_conversion): Support direct convertion with permute.
>> > 	* tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
>> > 	(supportable_widening_operation): Likewise.
>> > 	(vect_gen_perm_mask_any): Allow vector permutes where input registers
>> > 	are half the width of the result per the GCC 14 relaxation of
>> > 	VEC_PERM_EXPR.
>> >
>> > ---
>> >
>> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
>> > index
>> 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73
>> 1c16ee7eacb78143 100644
>> > --- a/gcc/doc/tm.texi
>> > +++ b/gcc/doc/tm.texi
>> > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered
>> expensive when the mask is
>> >  all zeros.  GCC can then try to branch around the instruction instead.
>> >  @end deftypefn
>> >
>> > +@deftypefn {Target Hook} bool
>> TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree
>> @var{in_type}, const_tree @var{out_type})
>> > +This hook returns true if the operation promoting @var{in_type} to
>> > +@var{out_type} should be done as a vector permute.  If @var{out_type} is
>> > +a signed type the operation will be done as the related unsigned type and
>> > +converted to @var{out_type}.  If the target supports the needed permute,
>> > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
>> > +beneficial to the hook should return true, else false should be returned.
>> > +@end deftypefn
>> 
>> Just a review of the documentation, but: is a two-step process really
>> necessary for signed out_types?  I thought it could be done directly,
>> since it's in_type rather than out_type that determines the type of
>> extension.
>
> Thanks!,
>
> I think this is an indication the text is ambiguous.  The intention was to say
> that if out_type is signed, we still keep the type as signed, but insert an
> intermediate cast to (unsigned type(out_type)).

Yeah, the documentation explained that well.

I was simply confused, sorry.  I was still thinking in terms of the
type requirements for conversions (where going directly from unsigned
to signed would be ok).  But of course, that isn't true for VEC_PERM_EXPR.

So ignore my earlier comment.

Thanks,
Richard
Tamar Christina Oct. 15, 2024, 9:48 a.m. UTC | #5
Hi,

Thanks for the look,

The 10/15/2024 09:54, Richard Biener wrote:
> On Mon, 14 Oct 2024, Tamar Christina wrote:
> 
> > Hi All,
> > 
> > This patch series adds support for a target to do a direct convertion for zero
> > extends using permutes.
> > 
> > To do this it uses a target hook use_permute_for_promotio which must be
> > implemented by targets.  This hook is used to indicate:
> > 
> >  1. can a target do this for the given modes.
> 
> can_vec_perm_const_p?
> 
> >  3. can the target convert between various vector modes with a VIEW_CONVERT.
> 
> We have modes_tieable_p for this I think.
> 

Yes, though the reason I didn't use either of them was because they are reporting
a capability of the backend.  In which case the hook, which is already backend
specific already should answer these two.

I initially had these checks there, but they didn't seem to add value, for
promotions the masks are only dependent on the input and output modes. So they really
don't change.

When you have say a loop that does lots of conversions from say char to int, it seemed
like a waste to retest the same permute constants over and over again.

I can add them back in if you prefer...

> >  2. is it profitable for the target to do it.
> 
> So you say the target can do both ways but both zip and tbl are
> permute instructions so I really fail to see the point and why
> the target itself doesn't choose to use tbl for unpack.
> 
> Is the intent in the end to have VEC_PERM in the IL rather than
> VEC_UNPACK_* so it combines with other VEC_PERMs?
> 

Yes, and this happens quite often, e.g. load permutes or lane shuffles etc.
The reason for exposing them as VEC_PERM was to trigger further optimizations.

If you remember the ticket about LOAD_LANES, with this optimization and an open
encoding of LOAD_LANES we stop using it in cases where theres a zero extend after
the LOAD_LANES, because then you're doing effectively two permutes and the LOAD_LANES
is no longer beneficial. There are other examples, load and replicate etc.

> That said, I'm not against supporting VEC_PERM code gen from
> unsigned promotion but I don't see why we should do this when
> the target advertises VEC_UNPACK_* support or direct conversion
> support?
> 
> Esp. with adding a "local" cost related hook which cannot take
> into accout context.
> 

To summarize a long story:

  yes I open encode zero extends as permutes to allow further optimizations.  One could convert
  vec_unpacks to convert optabs and use that, but that is an opague value that can't be further
  optimized.

  The hook isn't really a costing thing in the general sense. It's literally just "do you want
  permutes yes or no".  The reason it gets the modes is simply that I don't think a single level
  extend is worth it, but I can just change it to never try to do this on more than one level.

I think think there's a lot of merrit in open-encoding zero extends, but one reason this is
beneficial on AArch64 for instance is that we can consume the zero register and rewrite the
indices to a single register TBL.  Two registers TBLs are slower on some implementations.

Thanks,
Tamar

> > Using permutations have a big benefit of multi-step zero extensions because they
> > both reduce the number of needed instructions, but also increase throughput as
> > the dependency chain is removed.
> > 
> > Concretely on AArch64 this changes:
> > 
> > void test4(unsigned char *x, long long *y, int n) {
> >     for(int i = 0; i < n; i++) {
> >         y[i] = x[i];
> >     }
> > }
> > 
> > from generating:
> > 
> > .L4:
> >         ldr     q30, [x4], 16
> >         add     x3, x3, 128
> >         zip1    v1.16b, v30.16b, v31.16b
> >         zip2    v30.16b, v30.16b, v31.16b
> >         zip1    v2.8h, v1.8h, v31.8h
> >         zip1    v0.8h, v30.8h, v31.8h
> >         zip2    v1.8h, v1.8h, v31.8h
> >         zip2    v30.8h, v30.8h, v31.8h
> >         zip1    v26.4s, v2.4s, v31.4s
> >         zip1    v29.4s, v0.4s, v31.4s
> >         zip1    v28.4s, v1.4s, v31.4s
> >         zip1    v27.4s, v30.4s, v31.4s
> >         zip2    v2.4s, v2.4s, v31.4s
> >         zip2    v0.4s, v0.4s, v31.4s
> >         zip2    v1.4s, v1.4s, v31.4s
> >         zip2    v30.4s, v30.4s, v31.4s
> >         stp     q26, q2, [x3, -128]
> >         stp     q28, q1, [x3, -96]
> >         stp     q29, q0, [x3, -64]
> >         stp     q27, q30, [x3, -32]
> >         cmp     x4, x5
> >         bne     .L4
> > 
> > and instead we get:
> > 
> > .L4:
> >         add     x3, x3, 128
> >         ldr     q23, [x4], 16
> >         tbl     v5.16b, {v23.16b}, v31.16b
> >         tbl     v4.16b, {v23.16b}, v30.16b
> >         tbl     v3.16b, {v23.16b}, v29.16b
> >         tbl     v2.16b, {v23.16b}, v28.16b
> >         tbl     v1.16b, {v23.16b}, v27.16b
> >         tbl     v0.16b, {v23.16b}, v26.16b
> >         tbl     v22.16b, {v23.16b}, v25.16b
> >         tbl     v23.16b, {v23.16b}, v24.16b
> >         stp     q5, q4, [x3, -128]
> >         stp     q3, q2, [x3, -96]
> >         stp     q1, q0, [x3, -64]
> >         stp     q22, q23, [x3, -32]
> >         cmp     x4, x5
> >         bne     .L4
> > 
> > Tests are added in the AArch64 patch introducing the hook.  The testsuite also
> > already had about 800 runtime tests that get affected by this.
> > 
> > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > 
> > Ok for master?
> > 
> > Thanks,
> > Tamar
> > 
> > gcc/ChangeLog:
> > 
> > 	* target.def (use_permute_for_promotion): New.
> > 	* doc/tm.texi.in: Document it.
> > 	* doc/tm.texi: Regenerate.
> > 	* targhooks.cc (default_use_permute_for_promotion): New.
> > 	* targhooks.h (default_use_permute_for_promotion): New.
> > 	(vectorizable_conversion): Support direct convertion with permute.
> > 	* tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
> > 	(supportable_widening_operation): Likewise.
> > 	(vect_gen_perm_mask_any): Allow vector permutes where input registers
> > 	are half the width of the result per the GCC 14 relaxation of
> > 	VEC_PERM_EXPR.
> > 
> > ---
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > index 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f731c16ee7eacb78143 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered expensive when the mask is
> >  all zeros.  GCC can then try to branch around the instruction instead.
> >  @end deftypefn
> >  
> > +@deftypefn {Target Hook} bool TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree @var{in_type}, const_tree @var{out_type})
> > +This hook returns true if the operation promoting @var{in_type} to
> > +@var{out_type} should be done as a vector permute.  If @var{out_type} is
> > +a signed type the operation will be done as the related unsigned type and
> > +converted to @var{out_type}.  If the target supports the needed permute,
> > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> > +beneficial to the hook should return true, else false should be returned.
> > +@end deftypefn
> > +
> >  @deftypefn {Target Hook} {class vector_costs *} TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool @var{costing_for_scalar})
> >  This hook should initialize target-specific data structures in preparation
> >  for modeling the costs of vectorizing a loop or basic block.  The default
> > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > index 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52d29b76f5bc283a1 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent strategy can generate better code.
> >  
> >  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> >  
> > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > +
> >  @hook TARGET_VECTORIZE_CREATE_COSTS
> >  
> >  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > diff --git a/gcc/target.def b/gcc/target.def
> > index b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f4db9f2636973598 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around the instruction instead.",
> >   (unsigned ifn),
> >   default_empty_mask_is_expensive)
> >  
> > +/* Function to say whether a target supports and prefers to use permutes for
> > +   zero extensions or truncates.  */
> > +DEFHOOK
> > +(use_permute_for_promotion,
> > + "This hook returns true if the operation promoting @var{in_type} to\n\
> > +@var{out_type} should be done as a vector permute.  If @var{out_type} is\n\
> > +a signed type the operation will be done as the related unsigned type and\n\
> > +converted to @var{out_type}.  If the target supports the needed permute,\n\
> > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is\n\
> > +beneficial to the hook should return true, else false should be returned.",
> > + bool,
> > + (const_tree in_type, const_tree out_type),
> > + default_use_permute_for_promotion)
> > +
> >  /* Target builtin that implements vector gather operation.  */
> >  DEFHOOK
> >  (builtin_gather,
> > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > index 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b3fafad74d3c536f 100644
> > --- a/gcc/targhooks.h
> > +++ b/gcc/targhooks.h
> > @@ -124,6 +124,7 @@ extern opt_machine_mode default_vectorize_related_mode (machine_mode,
> >  extern opt_machine_mode default_get_mask_mode (machine_mode);
> >  extern bool default_empty_mask_is_expensive (unsigned);
> >  extern bool default_conditional_operation_is_expensive (unsigned);
> > +extern bool default_use_permute_for_promotion (const_tree, const_tree);
> >  extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
> >  
> >  /* OpenACC hooks.  */
> > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > index dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdfc881fdb19d28f3 100644
> > --- a/gcc/targhooks.cc
> > +++ b/gcc/targhooks.cc
> > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive (unsigned ifn)
> >    return ifn == IFN_MASK_STORE;
> >  }
> >  
> > +/* By default no targets prefer permutes over multi step extension.  */
> > +
> > +bool
> > +default_use_permute_for_promotion (const_tree, const_tree)
> > +{
> > +  return false;
> > +}
> > +
> >  /* By default consider masked stores to be expensive.  */
> >  
> >  bool
> > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > index 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894eaf769d29b1c5b82 100644
> > --- a/gcc/tree-vect-stmts.cc
> > +++ b/gcc/tree-vect-stmts.cc
> > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts (vec_info *vinfo,
> >    gimple *new_stmt1, *new_stmt2;
> >    vec<tree> vec_tmp = vNULL;
> >  
> > +  /* If we're using a VEC_PERM_EXPR then we're widening to the final type in
> > +     one go.  */
> > +  if (ch1 == VEC_PERM_EXPR
> > +      && op_type == unary_op)
> > +    {
> > +      vec_tmp.create (vec_oprnds0->length () * 2);
> > +      bool failed_p = false;
> > +
> > +      /* Extending with a vec-perm requires 2 instructions per step.  */
> > +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > +	{
> > +	  tree vectype_in = TREE_TYPE (vop0);
> > +	  tree vectype_out = TREE_TYPE (vec_dest);
> > +	  machine_mode mode_in = TYPE_MODE (vectype_in);
> > +	  machine_mode mode_out = TYPE_MODE (vectype_out);
> > +	  unsigned bitsize_in = element_precision (vectype_in);
> > +	  unsigned tot_in, tot_out;
> > +	  unsigned HOST_WIDE_INT count;
> > +
> > +	  /* We can't really support VLA here as the indexes depend on the VL.
> > +	     VLA should really use widening instructions like widening
> > +	     loads.  */
> > +	  if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > +	      || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > +	      || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
> > +	      || !TYPE_UNSIGNED (vectype_in)
> > +	      || !targetm.vectorize.use_permute_for_promotion (vectype_in,
> > +							       vectype_out))
> > +	    {
> > +	      failed_p = true;
> > +	      break;
> > +	    }
> > +
> > +	  unsigned steps = tot_out / bitsize_in;
> > +	  tree zero = build_zero_cst (vectype_in);
> > +
> > +	  unsigned chunk_size
> > +	    = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > +			 TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant ();
> > +	  unsigned step_size = chunk_size * (tot_out / tot_in);
> > +	  unsigned nunits = tot_out / bitsize_in;
> > +
> > +	  vec_perm_builder sel (steps, 1, 1);
> > +	  sel.quick_grow (steps);
> > +
> > +	  /* Flood fill with the out of range value first.  */
> > +	  for (unsigned long i = 0; i < steps; ++i)
> > +	    sel[i] = count;
> > +
> > +	  tree var;
> > +	  tree elem_in = TREE_TYPE (vectype_in);
> > +	  machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > +	  unsigned long idx = 0;
> > +	  tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> > +							    elem_in, nunits);
> > +
> > +	  for (unsigned long j = 0; j < chunk_size; j++)
> > +	    {
> > +	      if (WORDS_BIG_ENDIAN)
> > +		for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > +		  sel[i] = idx;
> > +	      else
> > +		for (int i = 0; i < (int)steps; i += step_size, idx++)
> > +		  sel[i] = idx;
> > +
> > +	      vec_perm_indices indices (sel, 2, steps);
> > +
> > +	      tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices);
> > +	      auto vec_oprnd = make_ssa_name (vc_in);
> > +	      auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR,
> > +						   vop0, zero, perm_mask);
> > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > +
> > +	      tree intvect_out = unsigned_type_for (vectype_out);
> > +	      var = make_ssa_name (intvect_out);
> > +	      new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR,
> > +							   intvect_out,
> > +							   vec_oprnd));
> > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > +
> > +	      gcc_assert (ch2.is_tree_code ());
> > +
> > +	      var = make_ssa_name (vectype_out);
> > +	      if (ch2 == VIEW_CONVERT_EXPR)
> > +		  new_stmt = gimple_build_assign (var,
> > +						  build1 (VIEW_CONVERT_EXPR,
> > +							  vectype_out,
> > +							  vec_oprnd));
> > +	      else
> > +		  new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > +						  vec_oprnd);
> > +
> > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > +	      vec_tmp.safe_push (var);
> > +	    }
> > +	}
> > +
> > +      if (!failed_p)
> > +	{
> > +	  vec_oprnds0->release ();
> > +	  *vec_oprnds0 = vec_tmp;
> > +	  return;
> > +	}
> > +    }
> > +
> >    vec_tmp.create (vec_oprnds0->length () * 2);
> >    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> >      {
> > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> >  	  || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> >  	goto unsupported;
> >  
> > +      /* Check to see if the target can use a permute to perform the zero
> > +	 extension.  */
> > +      intermediate_type = unsigned_type_for (vectype_out);
> > +      if (TYPE_UNSIGNED (vectype_in)
> > +	  && VECTOR_TYPE_P (intermediate_type)
> > +	  && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > +							  intermediate_type))
> > +	{
> > +	  code1 = VEC_PERM_EXPR;
> > +	  code2 = FLOAT_EXPR;
> > +	  break;
> > +	}
> > +
> >        fltsz = GET_MODE_SIZE (lhs_mode);
> >        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> >  	{
> > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const vec_perm_indices &sel)
> >    tree mask_type;
> >  
> >    poly_uint64 nunits = sel.length ();
> > -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > +	      || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
> >  
> >    mask_type = build_vector_type (ssizetype, nunits);
> >    return vec_perm_indices_to_tree (mask_type, sel);
> > @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info *vinfo,
> >        break;
> >  
> >      CASE_CONVERT:
> > -      c1 = VEC_UNPACK_LO_EXPR;
> > -      c2 = VEC_UNPACK_HI_EXPR;
> > +      {
> > +	tree cvt_type = unsigned_type_for (vectype_out);
> > +	if (TYPE_UNSIGNED (vectype_in)
> > +	  && VECTOR_TYPE_P (cvt_type)
> > +	  && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in, cvt_type))
> > +	  {
> > +	    *code1 = VEC_PERM_EXPR;
> > +	    *code2 = VIEW_CONVERT_EXPR;
> > +	    return true;
> > +	  }
> > +	c1 = VEC_UNPACK_LO_EXPR;
> > +	c2 = VEC_UNPACK_HI_EXPR;
> > +      }
> >        break;
> >  
> >      case FLOAT_EXPR:
> > 
> > 
> > 
> > 
> > 
> 
> -- 
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

--
Richard Biener Oct. 15, 2024, 11:12 a.m. UTC | #6
On Tue, 15 Oct 2024, Tamar Christina wrote:

> Hi,
> 
> Thanks for the look,
> 
> The 10/15/2024 09:54, Richard Biener wrote:
> > On Mon, 14 Oct 2024, Tamar Christina wrote:
> > 
> > > Hi All,
> > > 
> > > This patch series adds support for a target to do a direct convertion for zero
> > > extends using permutes.
> > > 
> > > To do this it uses a target hook use_permute_for_promotio which must be
> > > implemented by targets.  This hook is used to indicate:
> > > 
> > >  1. can a target do this for the given modes.
> > 
> > can_vec_perm_const_p?
> > 
> > >  3. can the target convert between various vector modes with a VIEW_CONVERT.
> > 
> > We have modes_tieable_p for this I think.
> > 
> 
> Yes, though the reason I didn't use either of them was because they are reporting
> a capability of the backend.  In which case the hook, which is already backend
> specific already should answer these two.
> 
> I initially had these checks there, but they didn't seem to add value, for
> promotions the masks are only dependent on the input and output modes. So they really
> don't change.
> 
> When you have say a loop that does lots of conversions from say char to int, it seemed
> like a waste to retest the same permute constants over and over again.
> 
> I can add them back in if you prefer...
> 
> > >  2. is it profitable for the target to do it.
> > 
> > So you say the target can do both ways but both zip and tbl are
> > permute instructions so I really fail to see the point and why
> > the target itself doesn't choose to use tbl for unpack.
> > 
> > Is the intent in the end to have VEC_PERM in the IL rather than
> > VEC_UNPACK_* so it combines with other VEC_PERMs?
> > 
> 
> Yes, and this happens quite often, e.g. load permutes or lane shuffles etc.
> The reason for exposing them as VEC_PERM was to trigger further optimizations.
> 
> If you remember the ticket about LOAD_LANES, with this optimization and an open
> encoding of LOAD_LANES we stop using it in cases where theres a zero extend after
> the LOAD_LANES, because then you're doing effectively two permutes and the LOAD_LANES
> is no longer beneficial. There are other examples, load and replicate etc.
> 
> > That said, I'm not against supporting VEC_PERM code gen from
> > unsigned promotion but I don't see why we should do this when
> > the target advertises VEC_UNPACK_* support or direct conversion
> > support?
> > 
> > Esp. with adding a "local" cost related hook which cannot take
> > into accout context.
> > 
> 
> To summarize a long story:
> 
>   yes I open encode zero extends as permutes to allow further optimizations.  One could convert
>   vec_unpacks to convert optabs and use that, but that is an opague value that can't be further
>   optimized.
> 
>   The hook isn't really a costing thing in the general sense. It's literally just "do you want
>   permutes yes or no".  The reason it gets the modes is simply that I don't think a single level
>   extend is worth it, but I can just change it to never try to do this on more than one level.

When you mention LOAD_LANES we do not expose "permutes" in them on GIMPLE
either, so why should we for VEC_UNPACK_*.

At what level are the simplifications you see happening then?

I do realize we have two ways of expressing zero-extending widenings
(also truncations btw) and that's always bad - so we could decide to
_always_ use VEC_PERMs as the canonical representation because those
combine more easily.  And either match VEC_PERMs back to vec_unpack
at RTL expansion time or require targets to expose those as constant
vec_perms as well.  There are targets like GCN where you can't do
unpacking with permutes of course, so we can't do away with them
(we could possibly force those targets to expose widening/truncation
solely with [us]ext and trunc patterns of course).

> I think think there's a lot of merrit in open-encoding zero extends, but one reason this is
> beneficial on AArch64 for instance is that we can consume the zero register and rewrite the
> indices to a single register TBL.  Two registers TBLs are slower on some implementations.

But this latter fact can be done by optimizing the RTL?

Richard.

> Thanks,
> Tamar
> 
> > > Using permutations have a big benefit of multi-step zero extensions because they
> > > both reduce the number of needed instructions, but also increase throughput as
> > > the dependency chain is removed.
> > > 
> > > Concretely on AArch64 this changes:
> > > 
> > > void test4(unsigned char *x, long long *y, int n) {
> > >     for(int i = 0; i < n; i++) {
> > >         y[i] = x[i];
> > >     }
> > > }
> > > 
> > > from generating:
> > > 
> > > .L4:
> > >         ldr     q30, [x4], 16
> > >         add     x3, x3, 128
> > >         zip1    v1.16b, v30.16b, v31.16b
> > >         zip2    v30.16b, v30.16b, v31.16b
> > >         zip1    v2.8h, v1.8h, v31.8h
> > >         zip1    v0.8h, v30.8h, v31.8h
> > >         zip2    v1.8h, v1.8h, v31.8h
> > >         zip2    v30.8h, v30.8h, v31.8h
> > >         zip1    v26.4s, v2.4s, v31.4s
> > >         zip1    v29.4s, v0.4s, v31.4s
> > >         zip1    v28.4s, v1.4s, v31.4s
> > >         zip1    v27.4s, v30.4s, v31.4s
> > >         zip2    v2.4s, v2.4s, v31.4s
> > >         zip2    v0.4s, v0.4s, v31.4s
> > >         zip2    v1.4s, v1.4s, v31.4s
> > >         zip2    v30.4s, v30.4s, v31.4s
> > >         stp     q26, q2, [x3, -128]
> > >         stp     q28, q1, [x3, -96]
> > >         stp     q29, q0, [x3, -64]
> > >         stp     q27, q30, [x3, -32]
> > >         cmp     x4, x5
> > >         bne     .L4
> > > 
> > > and instead we get:
> > > 
> > > .L4:
> > >         add     x3, x3, 128
> > >         ldr     q23, [x4], 16
> > >         tbl     v5.16b, {v23.16b}, v31.16b
> > >         tbl     v4.16b, {v23.16b}, v30.16b
> > >         tbl     v3.16b, {v23.16b}, v29.16b
> > >         tbl     v2.16b, {v23.16b}, v28.16b
> > >         tbl     v1.16b, {v23.16b}, v27.16b
> > >         tbl     v0.16b, {v23.16b}, v26.16b
> > >         tbl     v22.16b, {v23.16b}, v25.16b
> > >         tbl     v23.16b, {v23.16b}, v24.16b
> > >         stp     q5, q4, [x3, -128]
> > >         stp     q3, q2, [x3, -96]
> > >         stp     q1, q0, [x3, -64]
> > >         stp     q22, q23, [x3, -32]
> > >         cmp     x4, x5
> > >         bne     .L4
> > > 
> > > Tests are added in the AArch64 patch introducing the hook.  The testsuite also
> > > already had about 800 runtime tests that get affected by this.
> > > 
> > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf,
> > > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > > 
> > > Ok for master?
> > > 
> > > Thanks,
> > > Tamar
> > > 
> > > gcc/ChangeLog:
> > > 
> > > 	* target.def (use_permute_for_promotion): New.
> > > 	* doc/tm.texi.in: Document it.
> > > 	* doc/tm.texi: Regenerate.
> > > 	* targhooks.cc (default_use_permute_for_promotion): New.
> > > 	* targhooks.h (default_use_permute_for_promotion): New.
> > > 	(vectorizable_conversion): Support direct convertion with permute.
> > > 	* tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
> > > 	(supportable_widening_operation): Likewise.
> > > 	(vect_gen_perm_mask_any): Allow vector permutes where input registers
> > > 	are half the width of the result per the GCC 14 relaxation of
> > > 	VEC_PERM_EXPR.
> > > 
> > > ---
> > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > index 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f731c16ee7eacb78143 100644
> > > --- a/gcc/doc/tm.texi
> > > +++ b/gcc/doc/tm.texi
> > > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered expensive when the mask is
> > >  all zeros.  GCC can then try to branch around the instruction instead.
> > >  @end deftypefn
> > >  
> > > +@deftypefn {Target Hook} bool TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree @var{in_type}, const_tree @var{out_type})
> > > +This hook returns true if the operation promoting @var{in_type} to
> > > +@var{out_type} should be done as a vector permute.  If @var{out_type} is
> > > +a signed type the operation will be done as the related unsigned type and
> > > +converted to @var{out_type}.  If the target supports the needed permute,
> > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> > > +beneficial to the hook should return true, else false should be returned.
> > > +@end deftypefn
> > > +
> > >  @deftypefn {Target Hook} {class vector_costs *} TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool @var{costing_for_scalar})
> > >  This hook should initialize target-specific data structures in preparation
> > >  for modeling the costs of vectorizing a loop or basic block.  The default
> > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > index 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52d29b76f5bc283a1 100644
> > > --- a/gcc/doc/tm.texi.in
> > > +++ b/gcc/doc/tm.texi.in
> > > @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent strategy can generate better code.
> > >  
> > >  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> > >  
> > > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > > +
> > >  @hook TARGET_VECTORIZE_CREATE_COSTS
> > >  
> > >  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > > diff --git a/gcc/target.def b/gcc/target.def
> > > index b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f4db9f2636973598 100644
> > > --- a/gcc/target.def
> > > +++ b/gcc/target.def
> > > @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around the instruction instead.",
> > >   (unsigned ifn),
> > >   default_empty_mask_is_expensive)
> > >  
> > > +/* Function to say whether a target supports and prefers to use permutes for
> > > +   zero extensions or truncates.  */
> > > +DEFHOOK
> > > +(use_permute_for_promotion,
> > > + "This hook returns true if the operation promoting @var{in_type} to\n\
> > > +@var{out_type} should be done as a vector permute.  If @var{out_type} is\n\
> > > +a signed type the operation will be done as the related unsigned type and\n\
> > > +converted to @var{out_type}.  If the target supports the needed permute,\n\
> > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is\n\
> > > +beneficial to the hook should return true, else false should be returned.",
> > > + bool,
> > > + (const_tree in_type, const_tree out_type),
> > > + default_use_permute_for_promotion)
> > > +
> > >  /* Target builtin that implements vector gather operation.  */
> > >  DEFHOOK
> > >  (builtin_gather,
> > > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > > index 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b3fafad74d3c536f 100644
> > > --- a/gcc/targhooks.h
> > > +++ b/gcc/targhooks.h
> > > @@ -124,6 +124,7 @@ extern opt_machine_mode default_vectorize_related_mode (machine_mode,
> > >  extern opt_machine_mode default_get_mask_mode (machine_mode);
> > >  extern bool default_empty_mask_is_expensive (unsigned);
> > >  extern bool default_conditional_operation_is_expensive (unsigned);
> > > +extern bool default_use_permute_for_promotion (const_tree, const_tree);
> > >  extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
> > >  
> > >  /* OpenACC hooks.  */
> > > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > > index dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdfc881fdb19d28f3 100644
> > > --- a/gcc/targhooks.cc
> > > +++ b/gcc/targhooks.cc
> > > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive (unsigned ifn)
> > >    return ifn == IFN_MASK_STORE;
> > >  }
> > >  
> > > +/* By default no targets prefer permutes over multi step extension.  */
> > > +
> > > +bool
> > > +default_use_permute_for_promotion (const_tree, const_tree)
> > > +{
> > > +  return false;
> > > +}
> > > +
> > >  /* By default consider masked stores to be expensive.  */
> > >  
> > >  bool
> > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > index 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894eaf769d29b1c5b82 100644
> > > --- a/gcc/tree-vect-stmts.cc
> > > +++ b/gcc/tree-vect-stmts.cc
> > > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts (vec_info *vinfo,
> > >    gimple *new_stmt1, *new_stmt2;
> > >    vec<tree> vec_tmp = vNULL;
> > >  
> > > +  /* If we're using a VEC_PERM_EXPR then we're widening to the final type in
> > > +     one go.  */
> > > +  if (ch1 == VEC_PERM_EXPR
> > > +      && op_type == unary_op)
> > > +    {
> > > +      vec_tmp.create (vec_oprnds0->length () * 2);
> > > +      bool failed_p = false;
> > > +
> > > +      /* Extending with a vec-perm requires 2 instructions per step.  */
> > > +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > +	{
> > > +	  tree vectype_in = TREE_TYPE (vop0);
> > > +	  tree vectype_out = TREE_TYPE (vec_dest);
> > > +	  machine_mode mode_in = TYPE_MODE (vectype_in);
> > > +	  machine_mode mode_out = TYPE_MODE (vectype_out);
> > > +	  unsigned bitsize_in = element_precision (vectype_in);
> > > +	  unsigned tot_in, tot_out;
> > > +	  unsigned HOST_WIDE_INT count;
> > > +
> > > +	  /* We can't really support VLA here as the indexes depend on the VL.
> > > +	     VLA should really use widening instructions like widening
> > > +	     loads.  */
> > > +	  if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > > +	      || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > > +	      || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
> > > +	      || !TYPE_UNSIGNED (vectype_in)
> > > +	      || !targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > +							       vectype_out))
> > > +	    {
> > > +	      failed_p = true;
> > > +	      break;
> > > +	    }
> > > +
> > > +	  unsigned steps = tot_out / bitsize_in;
> > > +	  tree zero = build_zero_cst (vectype_in);
> > > +
> > > +	  unsigned chunk_size
> > > +	    = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > > +			 TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant ();
> > > +	  unsigned step_size = chunk_size * (tot_out / tot_in);
> > > +	  unsigned nunits = tot_out / bitsize_in;
> > > +
> > > +	  vec_perm_builder sel (steps, 1, 1);
> > > +	  sel.quick_grow (steps);
> > > +
> > > +	  /* Flood fill with the out of range value first.  */
> > > +	  for (unsigned long i = 0; i < steps; ++i)
> > > +	    sel[i] = count;
> > > +
> > > +	  tree var;
> > > +	  tree elem_in = TREE_TYPE (vectype_in);
> > > +	  machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > > +	  unsigned long idx = 0;
> > > +	  tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> > > +							    elem_in, nunits);
> > > +
> > > +	  for (unsigned long j = 0; j < chunk_size; j++)
> > > +	    {
> > > +	      if (WORDS_BIG_ENDIAN)
> > > +		for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > > +		  sel[i] = idx;
> > > +	      else
> > > +		for (int i = 0; i < (int)steps; i += step_size, idx++)
> > > +		  sel[i] = idx;
> > > +
> > > +	      vec_perm_indices indices (sel, 2, steps);
> > > +
> > > +	      tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices);
> > > +	      auto vec_oprnd = make_ssa_name (vc_in);
> > > +	      auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR,
> > > +						   vop0, zero, perm_mask);
> > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > +
> > > +	      tree intvect_out = unsigned_type_for (vectype_out);
> > > +	      var = make_ssa_name (intvect_out);
> > > +	      new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR,
> > > +							   intvect_out,
> > > +							   vec_oprnd));
> > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > +
> > > +	      gcc_assert (ch2.is_tree_code ());
> > > +
> > > +	      var = make_ssa_name (vectype_out);
> > > +	      if (ch2 == VIEW_CONVERT_EXPR)
> > > +		  new_stmt = gimple_build_assign (var,
> > > +						  build1 (VIEW_CONVERT_EXPR,
> > > +							  vectype_out,
> > > +							  vec_oprnd));
> > > +	      else
> > > +		  new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > > +						  vec_oprnd);
> > > +
> > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > +	      vec_tmp.safe_push (var);
> > > +	    }
> > > +	}
> > > +
> > > +      if (!failed_p)
> > > +	{
> > > +	  vec_oprnds0->release ();
> > > +	  *vec_oprnds0 = vec_tmp;
> > > +	  return;
> > > +	}
> > > +    }
> > > +
> > >    vec_tmp.create (vec_oprnds0->length () * 2);
> > >    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > >      {
> > > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> > >  	  || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> > >  	goto unsupported;
> > >  
> > > +      /* Check to see if the target can use a permute to perform the zero
> > > +	 extension.  */
> > > +      intermediate_type = unsigned_type_for (vectype_out);
> > > +      if (TYPE_UNSIGNED (vectype_in)
> > > +	  && VECTOR_TYPE_P (intermediate_type)
> > > +	  && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > +							  intermediate_type))
> > > +	{
> > > +	  code1 = VEC_PERM_EXPR;
> > > +	  code2 = FLOAT_EXPR;
> > > +	  break;
> > > +	}
> > > +
> > >        fltsz = GET_MODE_SIZE (lhs_mode);
> > >        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> > >  	{
> > > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const vec_perm_indices &sel)
> > >    tree mask_type;
> > >  
> > >    poly_uint64 nunits = sel.length ();
> > > -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > > +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > > +	      || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
> > >  
> > >    mask_type = build_vector_type (ssizetype, nunits);
> > >    return vec_perm_indices_to_tree (mask_type, sel);
> > > @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info *vinfo,
> > >        break;
> > >  
> > >      CASE_CONVERT:
> > > -      c1 = VEC_UNPACK_LO_EXPR;
> > > -      c2 = VEC_UNPACK_HI_EXPR;
> > > +      {
> > > +	tree cvt_type = unsigned_type_for (vectype_out);
> > > +	if (TYPE_UNSIGNED (vectype_in)
> > > +	  && VECTOR_TYPE_P (cvt_type)
> > > +	  && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in, cvt_type))
> > > +	  {
> > > +	    *code1 = VEC_PERM_EXPR;
> > > +	    *code2 = VIEW_CONVERT_EXPR;
> > > +	    return true;
> > > +	  }
> > > +	c1 = VEC_UNPACK_LO_EXPR;
> > > +	c2 = VEC_UNPACK_HI_EXPR;
> > > +      }
> > >        break;
> > >  
> > >      case FLOAT_EXPR:
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
> > -- 
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
> 
>
Tamar Christina Oct. 15, 2024, 12:03 p.m. UTC | #7
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Tuesday, October 15, 2024 12:13 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends using
> VEC_PERM_EXPR
> 
> On Tue, 15 Oct 2024, Tamar Christina wrote:
> 
> > Hi,
> >
> > Thanks for the look,
> >
> > The 10/15/2024 09:54, Richard Biener wrote:
> > > On Mon, 14 Oct 2024, Tamar Christina wrote:
> > >
> > > > Hi All,
> > > >
> > > > This patch series adds support for a target to do a direct convertion for zero
> > > > extends using permutes.
> > > >
> > > > To do this it uses a target hook use_permute_for_promotio which must be
> > > > implemented by targets.  This hook is used to indicate:
> > > >
> > > >  1. can a target do this for the given modes.
> > >
> > > can_vec_perm_const_p?
> > >
> > > >  3. can the target convert between various vector modes with a
> VIEW_CONVERT.
> > >
> > > We have modes_tieable_p for this I think.
> > >
> >
> > Yes, though the reason I didn't use either of them was because they are reporting
> > a capability of the backend.  In which case the hook, which is already backend
> > specific already should answer these two.
> >
> > I initially had these checks there, but they didn't seem to add value, for
> > promotions the masks are only dependent on the input and output modes. So
> they really
> > don't change.
> >
> > When you have say a loop that does lots of conversions from say char to int, it
> seemed
> > like a waste to retest the same permute constants over and over again.
> >
> > I can add them back in if you prefer...
> >
> > > >  2. is it profitable for the target to do it.
> > >
> > > So you say the target can do both ways but both zip and tbl are
> > > permute instructions so I really fail to see the point and why
> > > the target itself doesn't choose to use tbl for unpack.
> > >
> > > Is the intent in the end to have VEC_PERM in the IL rather than
> > > VEC_UNPACK_* so it combines with other VEC_PERMs?
> > >
> >
> > Yes, and this happens quite often, e.g. load permutes or lane shuffles etc.
> > The reason for exposing them as VEC_PERM was to trigger further optimizations.
> >
> > If you remember the ticket about LOAD_LANES, with this optimization and an
> open
> > encoding of LOAD_LANES we stop using it in cases where theres a zero extend
> after
> > the LOAD_LANES, because then you're doing effectively two permutes and the
> LOAD_LANES
> > is no longer beneficial. There are other examples, load and replicate etc.
> >
> > > That said, I'm not against supporting VEC_PERM code gen from
> > > unsigned promotion but I don't see why we should do this when
> > > the target advertises VEC_UNPACK_* support or direct conversion
> > > support?
> > >
> > > Esp. with adding a "local" cost related hook which cannot take
> > > into accout context.
> > >
> >
> > To summarize a long story:
> >
> >   yes I open encode zero extends as permutes to allow further optimizations.
> One could convert
> >   vec_unpacks to convert optabs and use that, but that is an opague value that
> can't be further
> >   optimized.
> >
> >   The hook isn't really a costing thing in the general sense. It's literally just "do you
> want
> >   permutes yes or no".  The reason it gets the modes is simply that I don't think a
> single level
> >   extend is worth it, but I can just change it to never try to do this on more than
> one level.
> 
> When you mention LOAD_LANES we do not expose "permutes" in them on
> GIMPLE
> either, so why should we for VEC_UNPACK_*.

I think not exposing LOAD_LANES in GIMPLE *is* an actual mistake that I hope to correct in GCC-16.
Or at least the time we pick LOAD_LANES is too early.  So I don't think pointing to this is a convincing
argument.  It's only VLA that I think needs the IL because you have to mask the group of operations and
may be hard to reconcile that later on.

> At what level are the simplifications you see happening then?

Well, they are currently happening outside of the vectorizer passes itself,
more specifically in this case because VN runs match simplifications.

If the concern is that that's late I can lift it to a pattern I suppose.
I didn't use a pattern because similar changes in this area always just happened
at codegen.

> 
> I do realize we have two ways of expressing zero-extending widenings
> (also truncations btw) and that's always bad - so we could decide to
> _always_ use VEC_PERMs as the canonical representation because those
> combine more easily.  And either match VEC_PERMs back to vec_unpack
> at RTL expansion time or require targets to expose those as constant
> vec_perms as well.  There are targets like GCN where you can't do
> unpacking with permutes of course, so we can't do away with them
> (we could possibly force those targets to expose widening/truncation
> solely with [us]ext and trunc patterns of course).

Ok, so your objection is that you don't want to have a different way of doing
a single step zero extend vs a multi-step zero extend.

At the moment my patch doesn't care, if you return an unconditional true
then for that target you get VEC_PERM or everything and the vectorizer
won't ever spit out VEC_UNPACKU.

You're arguing that this should be the default, even if the target does not
support it and then we have to somehow undo it during vec_lowering?

Otherwise if the target doesn't support the permute it'll be scalarized..

I guess sure..  But then...

> There are targets like GCN where you can't do
> unpacking with permutes of course, so we can't do away with them
> (we could possibly force those targets to expose widening/truncation
> solely with [us]ext and trunc patterns of course).

I guess if can_vec_perm_const_p fails we can undo it.. But it feels like
we lose an element of preference here.  A target *could* do the permute,
but not do it efficiently.

> 
> > I think think there's a lot of merrit in open-encoding zero extends, but one
> reason this is
> > beneficial on AArch64 for instance is that we can consume the zero register and
> rewrite the
> > indices to a single register TBL.  Two registers TBLs are slower on some
> implementations.
> 
> But this latter fact can be done by optimizing the RTL?

Sure, and we do so today.  That's why the example output in the cover letter
has only one input register.  The point of this blurb was to point out more that
the optimization being beneficial may depend on a specific uarch and as such
I believe that a certain element of target buy in is needed.

If you want me to do it unconditionally sure, I can do that...

If so can I get a review on the other patches anyway? They are independent mostly.
Only have some dependencies on the output of the tests.

Thanks,
Tamar

> 
> Richard.
> 
> > Thanks,
> > Tamar
> >
> > > > Using permutations have a big benefit of multi-step zero extensions because
> they
> > > > both reduce the number of needed instructions, but also increase throughput
> as
> > > > the dependency chain is removed.
> > > >
> > > > Concretely on AArch64 this changes:
> > > >
> > > > void test4(unsigned char *x, long long *y, int n) {
> > > >     for(int i = 0; i < n; i++) {
> > > >         y[i] = x[i];
> > > >     }
> > > > }
> > > >
> > > > from generating:
> > > >
> > > > .L4:
> > > >         ldr     q30, [x4], 16
> > > >         add     x3, x3, 128
> > > >         zip1    v1.16b, v30.16b, v31.16b
> > > >         zip2    v30.16b, v30.16b, v31.16b
> > > >         zip1    v2.8h, v1.8h, v31.8h
> > > >         zip1    v0.8h, v30.8h, v31.8h
> > > >         zip2    v1.8h, v1.8h, v31.8h
> > > >         zip2    v30.8h, v30.8h, v31.8h
> > > >         zip1    v26.4s, v2.4s, v31.4s
> > > >         zip1    v29.4s, v0.4s, v31.4s
> > > >         zip1    v28.4s, v1.4s, v31.4s
> > > >         zip1    v27.4s, v30.4s, v31.4s
> > > >         zip2    v2.4s, v2.4s, v31.4s
> > > >         zip2    v0.4s, v0.4s, v31.4s
> > > >         zip2    v1.4s, v1.4s, v31.4s
> > > >         zip2    v30.4s, v30.4s, v31.4s
> > > >         stp     q26, q2, [x3, -128]
> > > >         stp     q28, q1, [x3, -96]
> > > >         stp     q29, q0, [x3, -64]
> > > >         stp     q27, q30, [x3, -32]
> > > >         cmp     x4, x5
> > > >         bne     .L4
> > > >
> > > > and instead we get:
> > > >
> > > > .L4:
> > > >         add     x3, x3, 128
> > > >         ldr     q23, [x4], 16
> > > >         tbl     v5.16b, {v23.16b}, v31.16b
> > > >         tbl     v4.16b, {v23.16b}, v30.16b
> > > >         tbl     v3.16b, {v23.16b}, v29.16b
> > > >         tbl     v2.16b, {v23.16b}, v28.16b
> > > >         tbl     v1.16b, {v23.16b}, v27.16b
> > > >         tbl     v0.16b, {v23.16b}, v26.16b
> > > >         tbl     v22.16b, {v23.16b}, v25.16b
> > > >         tbl     v23.16b, {v23.16b}, v24.16b
> > > >         stp     q5, q4, [x3, -128]
> > > >         stp     q3, q2, [x3, -96]
> > > >         stp     q1, q0, [x3, -64]
> > > >         stp     q22, q23, [x3, -32]
> > > >         cmp     x4, x5
> > > >         bne     .L4
> > > >
> > > > Tests are added in the AArch64 patch introducing the hook.  The testsuite also
> > > > already had about 800 runtime tests that get affected by this.
> > > >
> > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-
> gnueabihf,
> > > > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > > >
> > > > Ok for master?
> > > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > 	* target.def (use_permute_for_promotion): New.
> > > > 	* doc/tm.texi.in: Document it.
> > > > 	* doc/tm.texi: Regenerate.
> > > > 	* targhooks.cc (default_use_permute_for_promotion): New.
> > > > 	* targhooks.h (default_use_permute_for_promotion): New.
> > > > 	(vectorizable_conversion): Support direct convertion with permute.
> > > > 	* tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
> > > > 	(supportable_widening_operation): Likewise.
> > > > 	(vect_gen_perm_mask_any): Allow vector permutes where input registers
> > > > 	are half the width of the result per the GCC 14 relaxation of
> > > > 	VEC_PERM_EXPR.
> > > >
> > > > ---
> > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > index
> 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73
> 1c16ee7eacb78143 100644
> > > > --- a/gcc/doc/tm.texi
> > > > +++ b/gcc/doc/tm.texi
> > > > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered
> expensive when the mask is
> > > >  all zeros.  GCC can then try to branch around the instruction instead.
> > > >  @end deftypefn
> > > >
> > > > +@deftypefn {Target Hook} bool
> TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree
> @var{in_type}, const_tree @var{out_type})
> > > > +This hook returns true if the operation promoting @var{in_type} to
> > > > +@var{out_type} should be done as a vector permute.  If @var{out_type} is
> > > > +a signed type the operation will be done as the related unsigned type and
> > > > +converted to @var{out_type}.  If the target supports the needed permute,
> > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> > > > +beneficial to the hook should return true, else false should be returned.
> > > > +@end deftypefn
> > > > +
> > > >  @deftypefn {Target Hook} {class vector_costs *}
> TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool
> @var{costing_for_scalar})
> > > >  This hook should initialize target-specific data structures in preparation
> > > >  for modeling the costs of vectorizing a loop or basic block.  The default
> > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > index
> 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52
> d29b76f5bc283a1 100644
> > > > --- a/gcc/doc/tm.texi.in
> > > > +++ b/gcc/doc/tm.texi.in
> > > > @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent strategy
> can generate better code.
> > > >
> > > >  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> > > >
> > > > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > > > +
> > > >  @hook TARGET_VECTORIZE_CREATE_COSTS
> > > >
> > > >  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > index
> b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f
> 4db9f2636973598 100644
> > > > --- a/gcc/target.def
> > > > +++ b/gcc/target.def
> > > > @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around the
> instruction instead.",
> > > >   (unsigned ifn),
> > > >   default_empty_mask_is_expensive)
> > > >
> > > > +/* Function to say whether a target supports and prefers to use permutes
> for
> > > > +   zero extensions or truncates.  */
> > > > +DEFHOOK
> > > > +(use_permute_for_promotion,
> > > > + "This hook returns true if the operation promoting @var{in_type} to\n\
> > > > +@var{out_type} should be done as a vector permute.  If @var{out_type}
> is\n\
> > > > +a signed type the operation will be done as the related unsigned type and\n\
> > > > +converted to @var{out_type}.  If the target supports the needed
> permute,\n\
> > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is\n\
> > > > +beneficial to the hook should return true, else false should be returned.",
> > > > + bool,
> > > > + (const_tree in_type, const_tree out_type),
> > > > + default_use_permute_for_promotion)
> > > > +
> > > >  /* Target builtin that implements vector gather operation.  */
> > > >  DEFHOOK
> > > >  (builtin_gather,
> > > > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > > > index
> 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b
> 3fafad74d3c536f 100644
> > > > --- a/gcc/targhooks.h
> > > > +++ b/gcc/targhooks.h
> > > > @@ -124,6 +124,7 @@ extern opt_machine_mode
> default_vectorize_related_mode (machine_mode,
> > > >  extern opt_machine_mode default_get_mask_mode (machine_mode);
> > > >  extern bool default_empty_mask_is_expensive (unsigned);
> > > >  extern bool default_conditional_operation_is_expensive (unsigned);
> > > > +extern bool default_use_permute_for_promotion (const_tree, const_tree);
> > > >  extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
> > > >
> > > >  /* OpenACC hooks.  */
> > > > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > > > index
> dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdf
> c881fdb19d28f3 100644
> > > > --- a/gcc/targhooks.cc
> > > > +++ b/gcc/targhooks.cc
> > > > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive
> (unsigned ifn)
> > > >    return ifn == IFN_MASK_STORE;
> > > >  }
> > > >
> > > > +/* By default no targets prefer permutes over multi step extension.  */
> > > > +
> > > > +bool
> > > > +default_use_permute_for_promotion (const_tree, const_tree)
> > > > +{
> > > > +  return false;
> > > > +}
> > > > +
> > > >  /* By default consider masked stores to be expensive.  */
> > > >
> > > >  bool
> > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > index
> 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894e
> af769d29b1c5b82 100644
> > > > --- a/gcc/tree-vect-stmts.cc
> > > > +++ b/gcc/tree-vect-stmts.cc
> > > > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts
> (vec_info *vinfo,
> > > >    gimple *new_stmt1, *new_stmt2;
> > > >    vec<tree> vec_tmp = vNULL;
> > > >
> > > > +  /* If we're using a VEC_PERM_EXPR then we're widening to the final type in
> > > > +     one go.  */
> > > > +  if (ch1 == VEC_PERM_EXPR
> > > > +      && op_type == unary_op)
> > > > +    {
> > > > +      vec_tmp.create (vec_oprnds0->length () * 2);
> > > > +      bool failed_p = false;
> > > > +
> > > > +      /* Extending with a vec-perm requires 2 instructions per step.  */
> > > > +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > > +	{
> > > > +	  tree vectype_in = TREE_TYPE (vop0);
> > > > +	  tree vectype_out = TREE_TYPE (vec_dest);
> > > > +	  machine_mode mode_in = TYPE_MODE (vectype_in);
> > > > +	  machine_mode mode_out = TYPE_MODE (vectype_out);
> > > > +	  unsigned bitsize_in = element_precision (vectype_in);
> > > > +	  unsigned tot_in, tot_out;
> > > > +	  unsigned HOST_WIDE_INT count;
> > > > +
> > > > +	  /* We can't really support VLA here as the indexes depend on the VL.
> > > > +	     VLA should really use widening instructions like widening
> > > > +	     loads.  */
> > > > +	  if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > > > +	      || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > > > +	      || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
> > > > +	      || !TYPE_UNSIGNED (vectype_in)
> > > > +	      || !targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > > +							       vectype_out))
> > > > +	    {
> > > > +	      failed_p = true;
> > > > +	      break;
> > > > +	    }
> > > > +
> > > > +	  unsigned steps = tot_out / bitsize_in;
> > > > +	  tree zero = build_zero_cst (vectype_in);
> > > > +
> > > > +	  unsigned chunk_size
> > > > +	    = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > > > +			 TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant ();
> > > > +	  unsigned step_size = chunk_size * (tot_out / tot_in);
> > > > +	  unsigned nunits = tot_out / bitsize_in;
> > > > +
> > > > +	  vec_perm_builder sel (steps, 1, 1);
> > > > +	  sel.quick_grow (steps);
> > > > +
> > > > +	  /* Flood fill with the out of range value first.  */
> > > > +	  for (unsigned long i = 0; i < steps; ++i)
> > > > +	    sel[i] = count;
> > > > +
> > > > +	  tree var;
> > > > +	  tree elem_in = TREE_TYPE (vectype_in);
> > > > +	  machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > > > +	  unsigned long idx = 0;
> > > > +	  tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> > > > +							    elem_in, nunits);
> > > > +
> > > > +	  for (unsigned long j = 0; j < chunk_size; j++)
> > > > +	    {
> > > > +	      if (WORDS_BIG_ENDIAN)
> > > > +		for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > > > +		  sel[i] = idx;
> > > > +	      else
> > > > +		for (int i = 0; i < (int)steps; i += step_size, idx++)
> > > > +		  sel[i] = idx;
> > > > +
> > > > +	      vec_perm_indices indices (sel, 2, steps);
> > > > +
> > > > +	      tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices);
> > > > +	      auto vec_oprnd = make_ssa_name (vc_in);
> > > > +	      auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR,
> > > > +						   vop0, zero, perm_mask);
> > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > +
> > > > +	      tree intvect_out = unsigned_type_for (vectype_out);
> > > > +	      var = make_ssa_name (intvect_out);
> > > > +	      new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR,
> > > > +							   intvect_out,
> > > > +							   vec_oprnd));
> > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > +
> > > > +	      gcc_assert (ch2.is_tree_code ());
> > > > +
> > > > +	      var = make_ssa_name (vectype_out);
> > > > +	      if (ch2 == VIEW_CONVERT_EXPR)
> > > > +		  new_stmt = gimple_build_assign (var,
> > > > +						  build1 (VIEW_CONVERT_EXPR,
> > > > +							  vectype_out,
> > > > +							  vec_oprnd));
> > > > +	      else
> > > > +		  new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > > > +						  vec_oprnd);
> > > > +
> > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > +	      vec_tmp.safe_push (var);
> > > > +	    }
> > > > +	}
> > > > +
> > > > +      if (!failed_p)
> > > > +	{
> > > > +	  vec_oprnds0->release ();
> > > > +	  *vec_oprnds0 = vec_tmp;
> > > > +	  return;
> > > > +	}
> > > > +    }
> > > > +
> > > >    vec_tmp.create (vec_oprnds0->length () * 2);
> > > >    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > >      {
> > > > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> > > >  	  || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> > > >  	goto unsupported;
> > > >
> > > > +      /* Check to see if the target can use a permute to perform the zero
> > > > +	 extension.  */
> > > > +      intermediate_type = unsigned_type_for (vectype_out);
> > > > +      if (TYPE_UNSIGNED (vectype_in)
> > > > +	  && VECTOR_TYPE_P (intermediate_type)
> > > > +	  && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > > > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > > +							  intermediate_type))
> > > > +	{
> > > > +	  code1 = VEC_PERM_EXPR;
> > > > +	  code2 = FLOAT_EXPR;
> > > > +	  break;
> > > > +	}
> > > > +
> > > >        fltsz = GET_MODE_SIZE (lhs_mode);
> > > >        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> > > >  	{
> > > > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const
> vec_perm_indices &sel)
> > > >    tree mask_type;
> > > >
> > > >    poly_uint64 nunits = sel.length ();
> > > > -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > > > +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > > > +	      || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
> > > >
> > > >    mask_type = build_vector_type (ssizetype, nunits);
> > > >    return vec_perm_indices_to_tree (mask_type, sel);
> > > > @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info
> *vinfo,
> > > >        break;
> > > >
> > > >      CASE_CONVERT:
> > > > -      c1 = VEC_UNPACK_LO_EXPR;
> > > > -      c2 = VEC_UNPACK_HI_EXPR;
> > > > +      {
> > > > +	tree cvt_type = unsigned_type_for (vectype_out);
> > > > +	if (TYPE_UNSIGNED (vectype_in)
> > > > +	  && VECTOR_TYPE_P (cvt_type)
> > > > +	  && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > > > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> cvt_type))
> > > > +	  {
> > > > +	    *code1 = VEC_PERM_EXPR;
> > > > +	    *code2 = VIEW_CONVERT_EXPR;
> > > > +	    return true;
> > > > +	  }
> > > > +	c1 = VEC_UNPACK_LO_EXPR;
> > > > +	c2 = VEC_UNPACK_HI_EXPR;
> > > > +      }
> > > >        break;
> > > >
> > > >      case FLOAT_EXPR:
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH,
> > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> Nuernberg)
> >
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
Richard Biener Oct. 15, 2024, 12:19 p.m. UTC | #8
On Tue, 15 Oct 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Tuesday, October 15, 2024 12:13 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends using
> > VEC_PERM_EXPR
> > 
> > On Tue, 15 Oct 2024, Tamar Christina wrote:
> > 
> > > Hi,
> > >
> > > Thanks for the look,
> > >
> > > The 10/15/2024 09:54, Richard Biener wrote:
> > > > On Mon, 14 Oct 2024, Tamar Christina wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > This patch series adds support for a target to do a direct convertion for zero
> > > > > extends using permutes.
> > > > >
> > > > > To do this it uses a target hook use_permute_for_promotio which must be
> > > > > implemented by targets.  This hook is used to indicate:
> > > > >
> > > > >  1. can a target do this for the given modes.
> > > >
> > > > can_vec_perm_const_p?
> > > >
> > > > >  3. can the target convert between various vector modes with a
> > VIEW_CONVERT.
> > > >
> > > > We have modes_tieable_p for this I think.
> > > >
> > >
> > > Yes, though the reason I didn't use either of them was because they are reporting
> > > a capability of the backend.  In which case the hook, which is already backend
> > > specific already should answer these two.
> > >
> > > I initially had these checks there, but they didn't seem to add value, for
> > > promotions the masks are only dependent on the input and output modes. So
> > they really
> > > don't change.
> > >
> > > When you have say a loop that does lots of conversions from say char to int, it
> > seemed
> > > like a waste to retest the same permute constants over and over again.
> > >
> > > I can add them back in if you prefer...
> > >
> > > > >  2. is it profitable for the target to do it.
> > > >
> > > > So you say the target can do both ways but both zip and tbl are
> > > > permute instructions so I really fail to see the point and why
> > > > the target itself doesn't choose to use tbl for unpack.
> > > >
> > > > Is the intent in the end to have VEC_PERM in the IL rather than
> > > > VEC_UNPACK_* so it combines with other VEC_PERMs?
> > > >
> > >
> > > Yes, and this happens quite often, e.g. load permutes or lane shuffles etc.
> > > The reason for exposing them as VEC_PERM was to trigger further optimizations.
> > >
> > > If you remember the ticket about LOAD_LANES, with this optimization and an
> > open
> > > encoding of LOAD_LANES we stop using it in cases where theres a zero extend
> > after
> > > the LOAD_LANES, because then you're doing effectively two permutes and the
> > LOAD_LANES
> > > is no longer beneficial. There are other examples, load and replicate etc.
> > >
> > > > That said, I'm not against supporting VEC_PERM code gen from
> > > > unsigned promotion but I don't see why we should do this when
> > > > the target advertises VEC_UNPACK_* support or direct conversion
> > > > support?
> > > >
> > > > Esp. with adding a "local" cost related hook which cannot take
> > > > into accout context.
> > > >
> > >
> > > To summarize a long story:
> > >
> > >   yes I open encode zero extends as permutes to allow further optimizations.
> > One could convert
> > >   vec_unpacks to convert optabs and use that, but that is an opague value that
> > can't be further
> > >   optimized.
> > >
> > >   The hook isn't really a costing thing in the general sense. It's literally just "do you
> > want
> > >   permutes yes or no".  The reason it gets the modes is simply that I don't think a
> > single level
> > >   extend is worth it, but I can just change it to never try to do this on more than
> > one level.
> > 
> > When you mention LOAD_LANES we do not expose "permutes" in them on
> > GIMPLE
> > either, so why should we for VEC_UNPACK_*.
> 
> I think not exposing LOAD_LANES in GIMPLE *is* an actual mistake that I hope to correct in GCC-16.
> Or at least the time we pick LOAD_LANES is too early.  So I don't think pointing to this is a convincing
> argument.  It's only VLA that I think needs the IL because you have to mask the group of operations and
> may be hard to reconcile that later on.
> 
> > At what level are the simplifications you see happening then?
> 
> Well, they are currently happening outside of the vectorizer passes itself,
> more specifically in this case because VN runs match simplifications.

But match doesn't simplify permutes against .LOAD_LANES?  So it's about
"other" permutes (from loads) that get simplified?

> If the concern is that that's late I can lift it to a pattern I suppose.
> I didn't use a pattern because similar changes in this area always just happened
> at codegen.

I was wondering how this plays with my idea of having us "lower"
or rather "code generate" to an intermediate SLP representation where
we split SLP groups on vector boundaries and are then free to
perform permute optimizations that need to know the vector type.

That said - match could as well combine VEC_UNPACK_* with a VEC_PERMUTE
with the catch that this duplicates patterns for the 
VEC_UNPACK_*/VEC_PERMUTE duality we have.

> > 
> > I do realize we have two ways of expressing zero-extending widenings
> > (also truncations btw) and that's always bad - so we could decide to
> > _always_ use VEC_PERMs as the canonical representation because those
> > combine more easily.  And either match VEC_PERMs back to vec_unpack
> > at RTL expansion time or require targets to expose those as constant
> > vec_perms as well.  There are targets like GCN where you can't do
> > unpacking with permutes of course, so we can't do away with them
> > (we could possibly force those targets to expose widening/truncation
> > solely with [us]ext and trunc patterns of course).
> 
> Ok, so your objection is that you don't want to have a different way of doing
> a single step zero extend vs a multi-step zero extend.

My objection is mainly that we do this based on a target decision and
without immediate effect on the vector loop and its costing - it's not
that we are then able to see we can combine the permutes with others,
say in SLP permute optimization.

> At the moment my patch doesn't care, if you return an unconditional true
> then for that target you get VEC_PERM or everything and the vectorizer
> won't ever spit out VEC_UNPACKU.
> 
> You're arguing that this should be the default, even if the target does not
> support it and then we have to somehow undo it during vec_lowering?

I argued that we possibly should do this by default and all targets
that can vec_unpack but not vec_perm_const with such a permute can
either implement the missing vec_perm_const or they are of the kind
that cannot use a permute for this (!modes_tieable_p).

> Otherwise if the target doesn't support the permute it'll be scalarized..
> 
> I guess sure..  But then...
> 
> > There are targets like GCN where you can't do
> > unpacking with permutes of course, so we can't do away with them
> > (we could possibly force those targets to expose widening/truncation
> > solely with [us]ext and trunc patterns of course).
> 
> I guess if can_vec_perm_const_p fails we can undo it.. But it feels like
> we lose an element of preference here.  A target *could* do the permute,
> but not do it efficiently.

It can do it the same way it would do the vec_unpack?  Or what am I
missing here?  Does your permute not exactly replicate vec_unpack_lo/hi?

> > 
> > > I think think there's a lot of merrit in open-encoding zero extends, but one
> > reason this is
> > > beneficial on AArch64 for instance is that we can consume the zero register and
> > rewrite the
> > > indices to a single register TBL.  Two registers TBLs are slower on some
> > implementations.
> > 
> > But this latter fact can be done by optimizing the RTL?
> 
> Sure, and we do so today.  That's why the example output in the cover letter
> has only one input register.  The point of this blurb was to point out more that
> the optimization being beneficial may depend on a specific uarch and as such
> I believe that a certain element of target buy in is needed.

If it's dependent on uarch then even more so - why not simply
expand vec_unpack as tbl then?

> If you want me to do it unconditionally sure, I can do that...
> 
> If so can I get a review on the other patches anyway? They are 
> independent mostly. Only have some dependencies on the output of the 
> tests.

Sure, I'm behind stuff - sorry.

Richard.

> Thanks,
> Tamar
> 
> > 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > > > Using permutations have a big benefit of multi-step zero extensions because
> > they
> > > > > both reduce the number of needed instructions, but also increase throughput
> > as
> > > > > the dependency chain is removed.
> > > > >
> > > > > Concretely on AArch64 this changes:
> > > > >
> > > > > void test4(unsigned char *x, long long *y, int n) {
> > > > >     for(int i = 0; i < n; i++) {
> > > > >         y[i] = x[i];
> > > > >     }
> > > > > }
> > > > >
> > > > > from generating:
> > > > >
> > > > > .L4:
> > > > >         ldr     q30, [x4], 16
> > > > >         add     x3, x3, 128
> > > > >         zip1    v1.16b, v30.16b, v31.16b
> > > > >         zip2    v30.16b, v30.16b, v31.16b
> > > > >         zip1    v2.8h, v1.8h, v31.8h
> > > > >         zip1    v0.8h, v30.8h, v31.8h
> > > > >         zip2    v1.8h, v1.8h, v31.8h
> > > > >         zip2    v30.8h, v30.8h, v31.8h
> > > > >         zip1    v26.4s, v2.4s, v31.4s
> > > > >         zip1    v29.4s, v0.4s, v31.4s
> > > > >         zip1    v28.4s, v1.4s, v31.4s
> > > > >         zip1    v27.4s, v30.4s, v31.4s
> > > > >         zip2    v2.4s, v2.4s, v31.4s
> > > > >         zip2    v0.4s, v0.4s, v31.4s
> > > > >         zip2    v1.4s, v1.4s, v31.4s
> > > > >         zip2    v30.4s, v30.4s, v31.4s
> > > > >         stp     q26, q2, [x3, -128]
> > > > >         stp     q28, q1, [x3, -96]
> > > > >         stp     q29, q0, [x3, -64]
> > > > >         stp     q27, q30, [x3, -32]
> > > > >         cmp     x4, x5
> > > > >         bne     .L4
> > > > >
> > > > > and instead we get:
> > > > >
> > > > > .L4:
> > > > >         add     x3, x3, 128
> > > > >         ldr     q23, [x4], 16
> > > > >         tbl     v5.16b, {v23.16b}, v31.16b
> > > > >         tbl     v4.16b, {v23.16b}, v30.16b
> > > > >         tbl     v3.16b, {v23.16b}, v29.16b
> > > > >         tbl     v2.16b, {v23.16b}, v28.16b
> > > > >         tbl     v1.16b, {v23.16b}, v27.16b
> > > > >         tbl     v0.16b, {v23.16b}, v26.16b
> > > > >         tbl     v22.16b, {v23.16b}, v25.16b
> > > > >         tbl     v23.16b, {v23.16b}, v24.16b
> > > > >         stp     q5, q4, [x3, -128]
> > > > >         stp     q3, q2, [x3, -96]
> > > > >         stp     q1, q0, [x3, -64]
> > > > >         stp     q22, q23, [x3, -32]
> > > > >         cmp     x4, x5
> > > > >         bne     .L4
> > > > >
> > > > > Tests are added in the AArch64 patch introducing the hook.  The testsuite also
> > > > > already had about 800 runtime tests that get affected by this.
> > > > >
> > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-
> > gnueabihf,
> > > > > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > > > >
> > > > > Ok for master?
> > > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > > 	* target.def (use_permute_for_promotion): New.
> > > > > 	* doc/tm.texi.in: Document it.
> > > > > 	* doc/tm.texi: Regenerate.
> > > > > 	* targhooks.cc (default_use_permute_for_promotion): New.
> > > > > 	* targhooks.h (default_use_permute_for_promotion): New.
> > > > > 	(vectorizable_conversion): Support direct convertion with permute.
> > > > > 	* tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts): Likewise.
> > > > > 	(supportable_widening_operation): Likewise.
> > > > > 	(vect_gen_perm_mask_any): Allow vector permutes where input registers
> > > > > 	are half the width of the result per the GCC 14 relaxation of
> > > > > 	VEC_PERM_EXPR.
> > > > >
> > > > > ---
> > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > > index
> > 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73
> > 1c16ee7eacb78143 100644
> > > > > --- a/gcc/doc/tm.texi
> > > > > +++ b/gcc/doc/tm.texi
> > > > > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be considered
> > expensive when the mask is
> > > > >  all zeros.  GCC can then try to branch around the instruction instead.
> > > > >  @end deftypefn
> > > > >
> > > > > +@deftypefn {Target Hook} bool
> > TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree
> > @var{in_type}, const_tree @var{out_type})
> > > > > +This hook returns true if the operation promoting @var{in_type} to
> > > > > +@var{out_type} should be done as a vector permute.  If @var{out_type} is
> > > > > +a signed type the operation will be done as the related unsigned type and
> > > > > +converted to @var{out_type}.  If the target supports the needed permute,
> > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> > > > > +beneficial to the hook should return true, else false should be returned.
> > > > > +@end deftypefn
> > > > > +
> > > > >  @deftypefn {Target Hook} {class vector_costs *}
> > TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool
> > @var{costing_for_scalar})
> > > > >  This hook should initialize target-specific data structures in preparation
> > > > >  for modeling the costs of vectorizing a loop or basic block.  The default
> > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > > index
> > 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52
> > d29b76f5bc283a1 100644
> > > > > --- a/gcc/doc/tm.texi.in
> > > > > +++ b/gcc/doc/tm.texi.in
> > > > > @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent strategy
> > can generate better code.
> > > > >
> > > > >  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> > > > >
> > > > > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > > > > +
> > > > >  @hook TARGET_VECTORIZE_CREATE_COSTS
> > > > >
> > > > >  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > > index
> > b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f
> > 4db9f2636973598 100644
> > > > > --- a/gcc/target.def
> > > > > +++ b/gcc/target.def
> > > > > @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around the
> > instruction instead.",
> > > > >   (unsigned ifn),
> > > > >   default_empty_mask_is_expensive)
> > > > >
> > > > > +/* Function to say whether a target supports and prefers to use permutes
> > for
> > > > > +   zero extensions or truncates.  */
> > > > > +DEFHOOK
> > > > > +(use_permute_for_promotion,
> > > > > + "This hook returns true if the operation promoting @var{in_type} to\n\
> > > > > +@var{out_type} should be done as a vector permute.  If @var{out_type}
> > is\n\
> > > > > +a signed type the operation will be done as the related unsigned type and\n\
> > > > > +converted to @var{out_type}.  If the target supports the needed
> > permute,\n\
> > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is\n\
> > > > > +beneficial to the hook should return true, else false should be returned.",
> > > > > + bool,
> > > > > + (const_tree in_type, const_tree out_type),
> > > > > + default_use_permute_for_promotion)
> > > > > +
> > > > >  /* Target builtin that implements vector gather operation.  */
> > > > >  DEFHOOK
> > > > >  (builtin_gather,
> > > > > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > > > > index
> > 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b
> > 3fafad74d3c536f 100644
> > > > > --- a/gcc/targhooks.h
> > > > > +++ b/gcc/targhooks.h
> > > > > @@ -124,6 +124,7 @@ extern opt_machine_mode
> > default_vectorize_related_mode (machine_mode,
> > > > >  extern opt_machine_mode default_get_mask_mode (machine_mode);
> > > > >  extern bool default_empty_mask_is_expensive (unsigned);
> > > > >  extern bool default_conditional_operation_is_expensive (unsigned);
> > > > > +extern bool default_use_permute_for_promotion (const_tree, const_tree);
> > > > >  extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
> > > > >
> > > > >  /* OpenACC hooks.  */
> > > > > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > > > > index
> > dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdf
> > c881fdb19d28f3 100644
> > > > > --- a/gcc/targhooks.cc
> > > > > +++ b/gcc/targhooks.cc
> > > > > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive
> > (unsigned ifn)
> > > > >    return ifn == IFN_MASK_STORE;
> > > > >  }
> > > > >
> > > > > +/* By default no targets prefer permutes over multi step extension.  */
> > > > > +
> > > > > +bool
> > > > > +default_use_permute_for_promotion (const_tree, const_tree)
> > > > > +{
> > > > > +  return false;
> > > > > +}
> > > > > +
> > > > >  /* By default consider masked stores to be expensive.  */
> > > > >
> > > > >  bool
> > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > index
> > 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894e
> > af769d29b1c5b82 100644
> > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts
> > (vec_info *vinfo,
> > > > >    gimple *new_stmt1, *new_stmt2;
> > > > >    vec<tree> vec_tmp = vNULL;
> > > > >
> > > > > +  /* If we're using a VEC_PERM_EXPR then we're widening to the final type in
> > > > > +     one go.  */
> > > > > +  if (ch1 == VEC_PERM_EXPR
> > > > > +      && op_type == unary_op)
> > > > > +    {
> > > > > +      vec_tmp.create (vec_oprnds0->length () * 2);
> > > > > +      bool failed_p = false;
> > > > > +
> > > > > +      /* Extending with a vec-perm requires 2 instructions per step.  */
> > > > > +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > > > +	{
> > > > > +	  tree vectype_in = TREE_TYPE (vop0);
> > > > > +	  tree vectype_out = TREE_TYPE (vec_dest);
> > > > > +	  machine_mode mode_in = TYPE_MODE (vectype_in);
> > > > > +	  machine_mode mode_out = TYPE_MODE (vectype_out);
> > > > > +	  unsigned bitsize_in = element_precision (vectype_in);
> > > > > +	  unsigned tot_in, tot_out;
> > > > > +	  unsigned HOST_WIDE_INT count;
> > > > > +
> > > > > +	  /* We can't really support VLA here as the indexes depend on the VL.
> > > > > +	     VLA should really use widening instructions like widening
> > > > > +	     loads.  */
> > > > > +	  if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > > > > +	      || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > > > > +	      || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
> > > > > +	      || !TYPE_UNSIGNED (vectype_in)
> > > > > +	      || !targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > > > +							       vectype_out))
> > > > > +	    {
> > > > > +	      failed_p = true;
> > > > > +	      break;
> > > > > +	    }
> > > > > +
> > > > > +	  unsigned steps = tot_out / bitsize_in;
> > > > > +	  tree zero = build_zero_cst (vectype_in);
> > > > > +
> > > > > +	  unsigned chunk_size
> > > > > +	    = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > > > > +			 TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant ();
> > > > > +	  unsigned step_size = chunk_size * (tot_out / tot_in);
> > > > > +	  unsigned nunits = tot_out / bitsize_in;
> > > > > +
> > > > > +	  vec_perm_builder sel (steps, 1, 1);
> > > > > +	  sel.quick_grow (steps);
> > > > > +
> > > > > +	  /* Flood fill with the out of range value first.  */
> > > > > +	  for (unsigned long i = 0; i < steps; ++i)
> > > > > +	    sel[i] = count;
> > > > > +
> > > > > +	  tree var;
> > > > > +	  tree elem_in = TREE_TYPE (vectype_in);
> > > > > +	  machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > > > > +	  unsigned long idx = 0;
> > > > > +	  tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> > > > > +							    elem_in, nunits);
> > > > > +
> > > > > +	  for (unsigned long j = 0; j < chunk_size; j++)
> > > > > +	    {
> > > > > +	      if (WORDS_BIG_ENDIAN)
> > > > > +		for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > > > > +		  sel[i] = idx;
> > > > > +	      else
> > > > > +		for (int i = 0; i < (int)steps; i += step_size, idx++)
> > > > > +		  sel[i] = idx;
> > > > > +
> > > > > +	      vec_perm_indices indices (sel, 2, steps);
> > > > > +
> > > > > +	      tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices);
> > > > > +	      auto vec_oprnd = make_ssa_name (vc_in);
> > > > > +	      auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR,
> > > > > +						   vop0, zero, perm_mask);
> > > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > > +
> > > > > +	      tree intvect_out = unsigned_type_for (vectype_out);
> > > > > +	      var = make_ssa_name (intvect_out);
> > > > > +	      new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR,
> > > > > +							   intvect_out,
> > > > > +							   vec_oprnd));
> > > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > > +
> > > > > +	      gcc_assert (ch2.is_tree_code ());
> > > > > +
> > > > > +	      var = make_ssa_name (vectype_out);
> > > > > +	      if (ch2 == VIEW_CONVERT_EXPR)
> > > > > +		  new_stmt = gimple_build_assign (var,
> > > > > +						  build1 (VIEW_CONVERT_EXPR,
> > > > > +							  vectype_out,
> > > > > +							  vec_oprnd));
> > > > > +	      else
> > > > > +		  new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > > > > +						  vec_oprnd);
> > > > > +
> > > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > > +	      vec_tmp.safe_push (var);
> > > > > +	    }
> > > > > +	}
> > > > > +
> > > > > +      if (!failed_p)
> > > > > +	{
> > > > > +	  vec_oprnds0->release ();
> > > > > +	  *vec_oprnds0 = vec_tmp;
> > > > > +	  return;
> > > > > +	}
> > > > > +    }
> > > > > +
> > > > >    vec_tmp.create (vec_oprnds0->length () * 2);
> > > > >    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > > >      {
> > > > > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> > > > >  	  || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> > > > >  	goto unsupported;
> > > > >
> > > > > +      /* Check to see if the target can use a permute to perform the zero
> > > > > +	 extension.  */
> > > > > +      intermediate_type = unsigned_type_for (vectype_out);
> > > > > +      if (TYPE_UNSIGNED (vectype_in)
> > > > > +	  && VECTOR_TYPE_P (intermediate_type)
> > > > > +	  && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > > > > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > > > +							  intermediate_type))
> > > > > +	{
> > > > > +	  code1 = VEC_PERM_EXPR;
> > > > > +	  code2 = FLOAT_EXPR;
> > > > > +	  break;
> > > > > +	}
> > > > > +
> > > > >        fltsz = GET_MODE_SIZE (lhs_mode);
> > > > >        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> > > > >  	{
> > > > > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype, const
> > vec_perm_indices &sel)
> > > > >    tree mask_type;
> > > > >
> > > > >    poly_uint64 nunits = sel.length ();
> > > > > -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > > > > +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > > > > +	      || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
> > > > >
> > > > >    mask_type = build_vector_type (ssizetype, nunits);
> > > > >    return vec_perm_indices_to_tree (mask_type, sel);
> > > > > @@ -14397,8 +14517,20 @@ supportable_widening_operation (vec_info
> > *vinfo,
> > > > >        break;
> > > > >
> > > > >      CASE_CONVERT:
> > > > > -      c1 = VEC_UNPACK_LO_EXPR;
> > > > > -      c2 = VEC_UNPACK_HI_EXPR;
> > > > > +      {
> > > > > +	tree cvt_type = unsigned_type_for (vectype_out);
> > > > > +	if (TYPE_UNSIGNED (vectype_in)
> > > > > +	  && VECTOR_TYPE_P (cvt_type)
> > > > > +	  && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > > > > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > cvt_type))
> > > > > +	  {
> > > > > +	    *code1 = VEC_PERM_EXPR;
> > > > > +	    *code2 = VIEW_CONVERT_EXPR;
> > > > > +	    return true;
> > > > > +	  }
> > > > > +	c1 = VEC_UNPACK_LO_EXPR;
> > > > > +	c2 = VEC_UNPACK_HI_EXPR;
> > > > > +      }
> > > > >        break;
> > > > >
> > > > >      case FLOAT_EXPR:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <rguenther@suse.de>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>
Tamar Christina Oct. 15, 2024, 12:43 p.m. UTC | #9
> -----Original Message-----
> From: Richard Biener <rguenther@suse.de>
> Sent: Tuesday, October 15, 2024 1:20 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> Subject: RE: [PATCH 1/4]middle-end: support multi-step zero-extends using
> VEC_PERM_EXPR
> 
> On Tue, 15 Oct 2024, Tamar Christina wrote:
> 
> > > -----Original Message-----
> > > From: Richard Biener <rguenther@suse.de>
> > > Sent: Tuesday, October 15, 2024 12:13 PM
> > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > > Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends using
> > > VEC_PERM_EXPR
> > >
> > > On Tue, 15 Oct 2024, Tamar Christina wrote:
> > >
> > > > Hi,
> > > >
> > > > Thanks for the look,
> > > >
> > > > The 10/15/2024 09:54, Richard Biener wrote:
> > > > > On Mon, 14 Oct 2024, Tamar Christina wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > This patch series adds support for a target to do a direct convertion for
> zero
> > > > > > extends using permutes.
> > > > > >
> > > > > > To do this it uses a target hook use_permute_for_promotio which must be
> > > > > > implemented by targets.  This hook is used to indicate:
> > > > > >
> > > > > >  1. can a target do this for the given modes.
> > > > >
> > > > > can_vec_perm_const_p?
> > > > >
> > > > > >  3. can the target convert between various vector modes with a
> > > VIEW_CONVERT.
> > > > >
> > > > > We have modes_tieable_p for this I think.
> > > > >
> > > >
> > > > Yes, though the reason I didn't use either of them was because they are
> reporting
> > > > a capability of the backend.  In which case the hook, which is already backend
> > > > specific already should answer these two.
> > > >
> > > > I initially had these checks there, but they didn't seem to add value, for
> > > > promotions the masks are only dependent on the input and output modes.
> So
> > > they really
> > > > don't change.
> > > >
> > > > When you have say a loop that does lots of conversions from say char to int,
> it
> > > seemed
> > > > like a waste to retest the same permute constants over and over again.
> > > >
> > > > I can add them back in if you prefer...
> > > >
> > > > > >  2. is it profitable for the target to do it.
> > > > >
> > > > > So you say the target can do both ways but both zip and tbl are
> > > > > permute instructions so I really fail to see the point and why
> > > > > the target itself doesn't choose to use tbl for unpack.
> > > > >
> > > > > Is the intent in the end to have VEC_PERM in the IL rather than
> > > > > VEC_UNPACK_* so it combines with other VEC_PERMs?
> > > > >
> > > >
> > > > Yes, and this happens quite often, e.g. load permutes or lane shuffles etc.
> > > > The reason for exposing them as VEC_PERM was to trigger further
> optimizations.
> > > >
> > > > If you remember the ticket about LOAD_LANES, with this optimization and an
> > > open
> > > > encoding of LOAD_LANES we stop using it in cases where theres a zero extend
> > > after
> > > > the LOAD_LANES, because then you're doing effectively two permutes and
> the
> > > LOAD_LANES
> > > > is no longer beneficial. There are other examples, load and replicate etc.
> > > >
> > > > > That said, I'm not against supporting VEC_PERM code gen from
> > > > > unsigned promotion but I don't see why we should do this when
> > > > > the target advertises VEC_UNPACK_* support or direct conversion
> > > > > support?
> > > > >
> > > > > Esp. with adding a "local" cost related hook which cannot take
> > > > > into accout context.
> > > > >
> > > >
> > > > To summarize a long story:
> > > >
> > > >   yes I open encode zero extends as permutes to allow further optimizations.
> > > One could convert
> > > >   vec_unpacks to convert optabs and use that, but that is an opague value
> that
> > > can't be further
> > > >   optimized.
> > > >
> > > >   The hook isn't really a costing thing in the general sense. It's literally just "do
> you
> > > want
> > > >   permutes yes or no".  The reason it gets the modes is simply that I don't
> think a
> > > single level
> > > >   extend is worth it, but I can just change it to never try to do this on more
> than
> > > one level.
> > >
> > > When you mention LOAD_LANES we do not expose "permutes" in them on
> > > GIMPLE
> > > either, so why should we for VEC_UNPACK_*.
> >
> > I think not exposing LOAD_LANES in GIMPLE *is* an actual mistake that I hope to
> correct in GCC-16.
> > Or at least the time we pick LOAD_LANES is too early.  So I don't think pointing to
> this is a convincing
> > argument.  It's only VLA that I think needs the IL because you have to mask the
> group of operations and
> > may be hard to reconcile that later on.
> >
> > > At what level are the simplifications you see happening then?
> >
> > Well, they are currently happening outside of the vectorizer passes itself,
> > more specifically in this case because VN runs match simplifications.
> 
> But match doesn't simplify permutes against .LOAD_LANES?  So it's about
> "other" permutes (from loads) that get simplified?
> 

Yes, or other permute after the zero extend.  I shouldn't have mentioned LOAD_LANES
I think that moved the discussion to a wrong place.

> > If the concern is that that's late I can lift it to a pattern I suppose.
> > I didn't use a pattern because similar changes in this area always just happened
> > at codegen.
> 
> I was wondering how this plays with my idea of having us "lower"
> or rather "code generate" to an intermediate SLP representation where
> we split SLP groups on vector boundaries and are then free to
> perform permute optimizations that need to know the vector type.
> 
> That said - match could as well combine VEC_UNPACK_* with a VEC_PERMUTE
> with the catch that this duplicates patterns for the
> VEC_UNPACK_*/VEC_PERMUTE duality we have.
> 
> > >
> > > I do realize we have two ways of expressing zero-extending widenings
> > > (also truncations btw) and that's always bad - so we could decide to
> > > _always_ use VEC_PERMs as the canonical representation because those
> > > combine more easily.  And either match VEC_PERMs back to vec_unpack
> > > at RTL expansion time or require targets to expose those as constant
> > > vec_perms as well.  There are targets like GCN where you can't do
> > > unpacking with permutes of course, so we can't do away with them
> > > (we could possibly force those targets to expose widening/truncation
> > > solely with [us]ext and trunc patterns of course).
> >
> > Ok, so your objection is that you don't want to have a different way of doing
> > a single step zero extend vs a multi-step zero extend.
> 
> My objection is mainly that we do this based on a target decision and
> without immediate effect on the vector loop and its costing - it's not
> that we are then able to see we can combine the permutes with others,
> say in SLP permute optimization.

I can fix that by lifting the code up as a pattern so it does affect costing
directly and also gets seen by the vectorizer's permute simplification.

I agree that that would be a better place for it.  Does that address the
issue?  Then at least the target decision directly affects vectorization
like other patterns.

> 
> > At the moment my patch doesn't care, if you return an unconditional true
> > then for that target you get VEC_PERM or everything and the vectorizer
> > won't ever spit out VEC_UNPACKU.
> >
> > You're arguing that this should be the default, even if the target does not
> > support it and then we have to somehow undo it during vec_lowering?
> 
> I argued that we possibly should do this by default and all targets
> that can vec_unpack but not vec_perm_const with such a permute can
> either implement the missing vec_perm_const or they are of the kind
> that cannot use a permute for this (!modes_tieable_p).

Ok, and I assume this would catch targets like GCN?  I don't know much about
What can be converted or not there. I'll go check their modes_tieable_p.

> > Otherwise if the target doesn't support the permute it'll be scalarized..
> >
> > I guess sure..  But then...
> >
> > > There are targets like GCN where you can't do
> > > unpacking with permutes of course, so we can't do away with them
> > > (we could possibly force those targets to expose widening/truncation
> > > solely with [us]ext and trunc patterns of course).
> >
> > I guess if can_vec_perm_const_p fails we can undo it.. But it feels like
> > we lose an element of preference here.  A target *could* do the permute,
> > but not do it efficiently.
> 
> It can do it the same way it would do the vec_unpack?  Or what am I
> missing here?  Does your permute not exactly replicate vec_unpack_lo/hi?

It replicates  series of them yeah. What I meant with the above is what should
happen for targets that haven't implemented vec_perm_const,  but I suppose
the previous paragraph addresses this.

> > >
> > > > I think think there's a lot of merrit in open-encoding zero extends, but one
> > > reason this is
> > > > beneficial on AArch64 for instance is that we can consume the zero register
> and
> > > rewrite the
> > > > indices to a single register TBL.  Two registers TBLs are slower on some
> > > implementations.
> > >
> > > But this latter fact can be done by optimizing the RTL?
> >
> > Sure, and we do so today.  That's why the example output in the cover letter
> > has only one input register.  The point of this blurb was to point out more that
> > the optimization being beneficial may depend on a specific uarch and as such
> > I believe that a certain element of target buy in is needed.
> 
> If it's dependent on uarch then even more so - why not simply
> expand vec_unpack as tbl then?

We expand them as ZIPs, because these don't require a lookup table index.
However again these are only single level unpacks.  It doesn't work for this
case of multi-level unpacks.  For something like byte -> long, or worse byte -> double
the number of instructions to match in combine would exceed it's combine limit.

Additionally they require a lot of patterns.  So simply, we cannot recombine multi-level
unpacks in RTL.

The backend however will do something sensible given a VEC_PERM_EXPR.

But I think this is just a detail we're getting into.

It sounds like you're ok with doing it unconditionally for any target that supports
the permutes, and lift it pre analysis (like in a pattern) so it's costed?

Did I understand that right?

Thanks for the discussion so far.

Tamar

> > If you want me to do it unconditionally sure, I can do that...
> >
> > If so can I get a review on the other patches anyway? They are
> > independent mostly. Only have some dependencies on the output of the
> > tests.
> 
> Sure, I'm behind stuff - sorry.
> 
> Richard.
> 
> > Thanks,
> > Tamar
> >
> > >
> > > Richard.
> > >
> > > > Thanks,
> > > > Tamar
> > > >
> > > > > > Using permutations have a big benefit of multi-step zero extensions
> because
> > > they
> > > > > > both reduce the number of needed instructions, but also increase
> throughput
> > > as
> > > > > > the dependency chain is removed.
> > > > > >
> > > > > > Concretely on AArch64 this changes:
> > > > > >
> > > > > > void test4(unsigned char *x, long long *y, int n) {
> > > > > >     for(int i = 0; i < n; i++) {
> > > > > >         y[i] = x[i];
> > > > > >     }
> > > > > > }
> > > > > >
> > > > > > from generating:
> > > > > >
> > > > > > .L4:
> > > > > >         ldr     q30, [x4], 16
> > > > > >         add     x3, x3, 128
> > > > > >         zip1    v1.16b, v30.16b, v31.16b
> > > > > >         zip2    v30.16b, v30.16b, v31.16b
> > > > > >         zip1    v2.8h, v1.8h, v31.8h
> > > > > >         zip1    v0.8h, v30.8h, v31.8h
> > > > > >         zip2    v1.8h, v1.8h, v31.8h
> > > > > >         zip2    v30.8h, v30.8h, v31.8h
> > > > > >         zip1    v26.4s, v2.4s, v31.4s
> > > > > >         zip1    v29.4s, v0.4s, v31.4s
> > > > > >         zip1    v28.4s, v1.4s, v31.4s
> > > > > >         zip1    v27.4s, v30.4s, v31.4s
> > > > > >         zip2    v2.4s, v2.4s, v31.4s
> > > > > >         zip2    v0.4s, v0.4s, v31.4s
> > > > > >         zip2    v1.4s, v1.4s, v31.4s
> > > > > >         zip2    v30.4s, v30.4s, v31.4s
> > > > > >         stp     q26, q2, [x3, -128]
> > > > > >         stp     q28, q1, [x3, -96]
> > > > > >         stp     q29, q0, [x3, -64]
> > > > > >         stp     q27, q30, [x3, -32]
> > > > > >         cmp     x4, x5
> > > > > >         bne     .L4
> > > > > >
> > > > > > and instead we get:
> > > > > >
> > > > > > .L4:
> > > > > >         add     x3, x3, 128
> > > > > >         ldr     q23, [x4], 16
> > > > > >         tbl     v5.16b, {v23.16b}, v31.16b
> > > > > >         tbl     v4.16b, {v23.16b}, v30.16b
> > > > > >         tbl     v3.16b, {v23.16b}, v29.16b
> > > > > >         tbl     v2.16b, {v23.16b}, v28.16b
> > > > > >         tbl     v1.16b, {v23.16b}, v27.16b
> > > > > >         tbl     v0.16b, {v23.16b}, v26.16b
> > > > > >         tbl     v22.16b, {v23.16b}, v25.16b
> > > > > >         tbl     v23.16b, {v23.16b}, v24.16b
> > > > > >         stp     q5, q4, [x3, -128]
> > > > > >         stp     q3, q2, [x3, -96]
> > > > > >         stp     q1, q0, [x3, -64]
> > > > > >         stp     q22, q23, [x3, -32]
> > > > > >         cmp     x4, x5
> > > > > >         bne     .L4
> > > > > >
> > > > > > Tests are added in the AArch64 patch introducing the hook.  The testsuite
> also
> > > > > > already had about 800 runtime tests that get affected by this.
> > > > > >
> > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-
> > > gnueabihf,
> > > > > > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > > > > >
> > > > > > Ok for master?
> > > > > >
> > > > > > Thanks,
> > > > > > Tamar
> > > > > >
> > > > > > gcc/ChangeLog:
> > > > > >
> > > > > > 	* target.def (use_permute_for_promotion): New.
> > > > > > 	* doc/tm.texi.in: Document it.
> > > > > > 	* doc/tm.texi: Regenerate.
> > > > > > 	* targhooks.cc (default_use_permute_for_promotion): New.
> > > > > > 	* targhooks.h (default_use_permute_for_promotion): New.
> > > > > > 	(vectorizable_conversion): Support direct convertion with
> permute.
> > > > > > 	* tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts):
> Likewise.
> > > > > > 	(supportable_widening_operation): Likewise.
> > > > > > 	(vect_gen_perm_mask_any): Allow vector permutes where input
> registers
> > > > > > 	are half the width of the result per the GCC 14 relaxation of
> > > > > > 	VEC_PERM_EXPR.
> > > > > >
> > > > > > ---
> > > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > > > index
> > >
> 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73
> > > 1c16ee7eacb78143 100644
> > > > > > --- a/gcc/doc/tm.texi
> > > > > > +++ b/gcc/doc/tm.texi
> > > > > > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be
> considered
> > > expensive when the mask is
> > > > > >  all zeros.  GCC can then try to branch around the instruction instead.
> > > > > >  @end deftypefn
> > > > > >
> > > > > > +@deftypefn {Target Hook} bool
> > > TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree
> > > @var{in_type}, const_tree @var{out_type})
> > > > > > +This hook returns true if the operation promoting @var{in_type} to
> > > > > > +@var{out_type} should be done as a vector permute.  If @var{out_type}
> is
> > > > > > +a signed type the operation will be done as the related unsigned type and
> > > > > > +converted to @var{out_type}.  If the target supports the needed
> permute,
> > > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> > > > > > +beneficial to the hook should return true, else false should be returned.
> > > > > > +@end deftypefn
> > > > > > +
> > > > > >  @deftypefn {Target Hook} {class vector_costs *}
> > > TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool
> > > @var{costing_for_scalar})
> > > > > >  This hook should initialize target-specific data structures in preparation
> > > > > >  for modeling the costs of vectorizing a loop or basic block.  The default
> > > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > > > index
> > >
> 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52
> > > d29b76f5bc283a1 100644
> > > > > > --- a/gcc/doc/tm.texi.in
> > > > > > +++ b/gcc/doc/tm.texi.in
> > > > > > @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent
> strategy
> > > can generate better code.
> > > > > >
> > > > > >  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> > > > > >
> > > > > > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > > > > > +
> > > > > >  @hook TARGET_VECTORIZE_CREATE_COSTS
> > > > > >
> > > > > >  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > > > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > > > index
> > >
> b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f
> > > 4db9f2636973598 100644
> > > > > > --- a/gcc/target.def
> > > > > > +++ b/gcc/target.def
> > > > > > @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around
> the
> > > instruction instead.",
> > > > > >   (unsigned ifn),
> > > > > >   default_empty_mask_is_expensive)
> > > > > >
> > > > > > +/* Function to say whether a target supports and prefers to use
> permutes
> > > for
> > > > > > +   zero extensions or truncates.  */
> > > > > > +DEFHOOK
> > > > > > +(use_permute_for_promotion,
> > > > > > + "This hook returns true if the operation promoting @var{in_type} to\n\
> > > > > > +@var{out_type} should be done as a vector permute.  If @var{out_type}
> > > is\n\
> > > > > > +a signed type the operation will be done as the related unsigned type
> and\n\
> > > > > > +converted to @var{out_type}.  If the target supports the needed
> > > permute,\n\
> > > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it
> is\n\
> > > > > > +beneficial to the hook should return true, else false should be returned.",
> > > > > > + bool,
> > > > > > + (const_tree in_type, const_tree out_type),
> > > > > > + default_use_permute_for_promotion)
> > > > > > +
> > > > > >  /* Target builtin that implements vector gather operation.  */
> > > > > >  DEFHOOK
> > > > > >  (builtin_gather,
> > > > > > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > > > > > index
> > >
> 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b
> > > 3fafad74d3c536f 100644
> > > > > > --- a/gcc/targhooks.h
> > > > > > +++ b/gcc/targhooks.h
> > > > > > @@ -124,6 +124,7 @@ extern opt_machine_mode
> > > default_vectorize_related_mode (machine_mode,
> > > > > >  extern opt_machine_mode default_get_mask_mode (machine_mode);
> > > > > >  extern bool default_empty_mask_is_expensive (unsigned);
> > > > > >  extern bool default_conditional_operation_is_expensive (unsigned);
> > > > > > +extern bool default_use_permute_for_promotion (const_tree,
> const_tree);
> > > > > >  extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
> > > > > >
> > > > > >  /* OpenACC hooks.  */
> > > > > > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > > > > > index
> > >
> dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdf
> > > c881fdb19d28f3 100644
> > > > > > --- a/gcc/targhooks.cc
> > > > > > +++ b/gcc/targhooks.cc
> > > > > > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive
> > > (unsigned ifn)
> > > > > >    return ifn == IFN_MASK_STORE;
> > > > > >  }
> > > > > >
> > > > > > +/* By default no targets prefer permutes over multi step extension.  */
> > > > > > +
> > > > > > +bool
> > > > > > +default_use_permute_for_promotion (const_tree, const_tree)
> > > > > > +{
> > > > > > +  return false;
> > > > > > +}
> > > > > > +
> > > > > >  /* By default consider masked stores to be expensive.  */
> > > > > >
> > > > > >  bool
> > > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > > index
> > >
> 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894e
> > > af769d29b1c5b82 100644
> > > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts
> > > (vec_info *vinfo,
> > > > > >    gimple *new_stmt1, *new_stmt2;
> > > > > >    vec<tree> vec_tmp = vNULL;
> > > > > >
> > > > > > +  /* If we're using a VEC_PERM_EXPR then we're widening to the final
> type in
> > > > > > +     one go.  */
> > > > > > +  if (ch1 == VEC_PERM_EXPR
> > > > > > +      && op_type == unary_op)
> > > > > > +    {
> > > > > > +      vec_tmp.create (vec_oprnds0->length () * 2);
> > > > > > +      bool failed_p = false;
> > > > > > +
> > > > > > +      /* Extending with a vec-perm requires 2 instructions per step.  */
> > > > > > +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > > > > +	{
> > > > > > +	  tree vectype_in = TREE_TYPE (vop0);
> > > > > > +	  tree vectype_out = TREE_TYPE (vec_dest);
> > > > > > +	  machine_mode mode_in = TYPE_MODE (vectype_in);
> > > > > > +	  machine_mode mode_out = TYPE_MODE (vectype_out);
> > > > > > +	  unsigned bitsize_in = element_precision (vectype_in);
> > > > > > +	  unsigned tot_in, tot_out;
> > > > > > +	  unsigned HOST_WIDE_INT count;
> > > > > > +
> > > > > > +	  /* We can't really support VLA here as the indexes depend on the
> VL.
> > > > > > +	     VLA should really use widening instructions like widening
> > > > > > +	     loads.  */
> > > > > > +	  if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > > > > > +	      || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > > > > > +	      || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant
> (&count)
> > > > > > +	      || !TYPE_UNSIGNED (vectype_in)
> > > > > > +	      || !targetm.vectorize.use_permute_for_promotion
> (vectype_in,
> > > > > > +							       vectype_out))
> > > > > > +	    {
> > > > > > +	      failed_p = true;
> > > > > > +	      break;
> > > > > > +	    }
> > > > > > +
> > > > > > +	  unsigned steps = tot_out / bitsize_in;
> > > > > > +	  tree zero = build_zero_cst (vectype_in);
> > > > > > +
> > > > > > +	  unsigned chunk_size
> > > > > > +	    = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > > > > > +			 TYPE_VECTOR_SUBPARTS
> (vectype_out)).to_constant ();
> > > > > > +	  unsigned step_size = chunk_size * (tot_out / tot_in);
> > > > > > +	  unsigned nunits = tot_out / bitsize_in;
> > > > > > +
> > > > > > +	  vec_perm_builder sel (steps, 1, 1);
> > > > > > +	  sel.quick_grow (steps);
> > > > > > +
> > > > > > +	  /* Flood fill with the out of range value first.  */
> > > > > > +	  for (unsigned long i = 0; i < steps; ++i)
> > > > > > +	    sel[i] = count;
> > > > > > +
> > > > > > +	  tree var;
> > > > > > +	  tree elem_in = TREE_TYPE (vectype_in);
> > > > > > +	  machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > > > > > +	  unsigned long idx = 0;
> > > > > > +	  tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> > > > > > +							    elem_in,
> nunits);
> > > > > > +
> > > > > > +	  for (unsigned long j = 0; j < chunk_size; j++)
> > > > > > +	    {
> > > > > > +	      if (WORDS_BIG_ENDIAN)
> > > > > > +		for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > > > > > +		  sel[i] = idx;
> > > > > > +	      else
> > > > > > +		for (int i = 0; i < (int)steps; i += step_size, idx++)
> > > > > > +		  sel[i] = idx;
> > > > > > +
> > > > > > +	      vec_perm_indices indices (sel, 2, steps);
> > > > > > +
> > > > > > +	      tree perm_mask = vect_gen_perm_mask_checked (vc_in,
> indices);
> > > > > > +	      auto vec_oprnd = make_ssa_name (vc_in);
> > > > > > +	      auto new_stmt = gimple_build_assign (vec_oprnd,
> VEC_PERM_EXPR,
> > > > > > +						   vop0, zero, perm_mask);
> > > > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > > > +
> > > > > > +	      tree intvect_out = unsigned_type_for (vectype_out);
> > > > > > +	      var = make_ssa_name (intvect_out);
> > > > > > +	      new_stmt = gimple_build_assign (var, build1
> (VIEW_CONVERT_EXPR,
> > > > > > +							   intvect_out,
> > > > > > +							   vec_oprnd));
> > > > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > > > +
> > > > > > +	      gcc_assert (ch2.is_tree_code ());
> > > > > > +
> > > > > > +	      var = make_ssa_name (vectype_out);
> > > > > > +	      if (ch2 == VIEW_CONVERT_EXPR)
> > > > > > +		  new_stmt = gimple_build_assign (var,
> > > > > > +						  build1
> (VIEW_CONVERT_EXPR,
> > > > > > +							  vectype_out,
> > > > > > +							  vec_oprnd));
> > > > > > +	      else
> > > > > > +		  new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > > > > > +						  vec_oprnd);
> > > > > > +
> > > > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > > > +	      vec_tmp.safe_push (var);
> > > > > > +	    }
> > > > > > +	}
> > > > > > +
> > > > > > +      if (!failed_p)
> > > > > > +	{
> > > > > > +	  vec_oprnds0->release ();
> > > > > > +	  *vec_oprnds0 = vec_tmp;
> > > > > > +	  return;
> > > > > > +	}
> > > > > > +    }
> > > > > > +
> > > > > >    vec_tmp.create (vec_oprnds0->length () * 2);
> > > > > >    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > > > >      {
> > > > > > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> > > > > >  	  || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> > > > > >  	goto unsupported;
> > > > > >
> > > > > > +      /* Check to see if the target can use a permute to perform the zero
> > > > > > +	 extension.  */
> > > > > > +      intermediate_type = unsigned_type_for (vectype_out);
> > > > > > +      if (TYPE_UNSIGNED (vectype_in)
> > > > > > +	  && VECTOR_TYPE_P (intermediate_type)
> > > > > > +	  && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > > > > > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > > > > +
> intermediate_type))
> > > > > > +	{
> > > > > > +	  code1 = VEC_PERM_EXPR;
> > > > > > +	  code2 = FLOAT_EXPR;
> > > > > > +	  break;
> > > > > > +	}
> > > > > > +
> > > > > >        fltsz = GET_MODE_SIZE (lhs_mode);
> > > > > >        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> > > > > >  	{
> > > > > > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype,
> const
> > > vec_perm_indices &sel)
> > > > > >    tree mask_type;
> > > > > >
> > > > > >    poly_uint64 nunits = sel.length ();
> > > > > > -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > > > > > +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > > > > > +	      || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) *
> 2));
> > > > > >
> > > > > >    mask_type = build_vector_type (ssizetype, nunits);
> > > > > >    return vec_perm_indices_to_tree (mask_type, sel);
> > > > > > @@ -14397,8 +14517,20 @@ supportable_widening_operation
> (vec_info
> > > *vinfo,
> > > > > >        break;
> > > > > >
> > > > > >      CASE_CONVERT:
> > > > > > -      c1 = VEC_UNPACK_LO_EXPR;
> > > > > > -      c2 = VEC_UNPACK_HI_EXPR;
> > > > > > +      {
> > > > > > +	tree cvt_type = unsigned_type_for (vectype_out);
> > > > > > +	if (TYPE_UNSIGNED (vectype_in)
> > > > > > +	  && VECTOR_TYPE_P (cvt_type)
> > > > > > +	  && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > > > > > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > cvt_type))
> > > > > > +	  {
> > > > > > +	    *code1 = VEC_PERM_EXPR;
> > > > > > +	    *code2 = VIEW_CONVERT_EXPR;
> > > > > > +	    return true;
> > > > > > +	  }
> > > > > > +	c1 = VEC_UNPACK_LO_EXPR;
> > > > > > +	c2 = VEC_UNPACK_HI_EXPR;
> > > > > > +      }
> > > > > >        break;
> > > > > >
> > > > > >      case FLOAT_EXPR:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > Richard Biener <rguenther@suse.de>
> > > > > SUSE Software Solutions Germany GmbH,
> > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > Nuernberg)
> > > >
> > > >
> > >
> > > --
> > > Richard Biener <rguenther@suse.de>
> > > SUSE Software Solutions Germany GmbH,
> > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> Nuernberg)
> >
> 
> --
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
Richard Biener Oct. 17, 2024, 12:49 p.m. UTC | #10
On Tue, 15 Oct 2024, Tamar Christina wrote:

> > -----Original Message-----
> > From: Richard Biener <rguenther@suse.de>
> > Sent: Tuesday, October 15, 2024 1:20 PM
> > To: Tamar Christina <Tamar.Christina@arm.com>
> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > Subject: RE: [PATCH 1/4]middle-end: support multi-step zero-extends using
> > VEC_PERM_EXPR
> > 
> > On Tue, 15 Oct 2024, Tamar Christina wrote:
> > 
> > > > -----Original Message-----
> > > > From: Richard Biener <rguenther@suse.de>
> > > > Sent: Tuesday, October 15, 2024 12:13 PM
> > > > To: Tamar Christina <Tamar.Christina@arm.com>
> > > > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>
> > > > Subject: Re: [PATCH 1/4]middle-end: support multi-step zero-extends using
> > > > VEC_PERM_EXPR
> > > >
> > > > On Tue, 15 Oct 2024, Tamar Christina wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Thanks for the look,
> > > > >
> > > > > The 10/15/2024 09:54, Richard Biener wrote:
> > > > > > On Mon, 14 Oct 2024, Tamar Christina wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > This patch series adds support for a target to do a direct convertion for
> > zero
> > > > > > > extends using permutes.
> > > > > > >
> > > > > > > To do this it uses a target hook use_permute_for_promotio which must be
> > > > > > > implemented by targets.  This hook is used to indicate:
> > > > > > >
> > > > > > >  1. can a target do this for the given modes.
> > > > > >
> > > > > > can_vec_perm_const_p?
> > > > > >
> > > > > > >  3. can the target convert between various vector modes with a
> > > > VIEW_CONVERT.
> > > > > >
> > > > > > We have modes_tieable_p for this I think.
> > > > > >
> > > > >
> > > > > Yes, though the reason I didn't use either of them was because they are
> > reporting
> > > > > a capability of the backend.  In which case the hook, which is already backend
> > > > > specific already should answer these two.
> > > > >
> > > > > I initially had these checks there, but they didn't seem to add value, for
> > > > > promotions the masks are only dependent on the input and output modes.
> > So
> > > > they really
> > > > > don't change.
> > > > >
> > > > > When you have say a loop that does lots of conversions from say char to int,
> > it
> > > > seemed
> > > > > like a waste to retest the same permute constants over and over again.
> > > > >
> > > > > I can add them back in if you prefer...
> > > > >
> > > > > > >  2. is it profitable for the target to do it.
> > > > > >
> > > > > > So you say the target can do both ways but both zip and tbl are
> > > > > > permute instructions so I really fail to see the point and why
> > > > > > the target itself doesn't choose to use tbl for unpack.
> > > > > >
> > > > > > Is the intent in the end to have VEC_PERM in the IL rather than
> > > > > > VEC_UNPACK_* so it combines with other VEC_PERMs?
> > > > > >
> > > > >
> > > > > Yes, and this happens quite often, e.g. load permutes or lane shuffles etc.
> > > > > The reason for exposing them as VEC_PERM was to trigger further
> > optimizations.
> > > > >
> > > > > If you remember the ticket about LOAD_LANES, with this optimization and an
> > > > open
> > > > > encoding of LOAD_LANES we stop using it in cases where theres a zero extend
> > > > after
> > > > > the LOAD_LANES, because then you're doing effectively two permutes and
> > the
> > > > LOAD_LANES
> > > > > is no longer beneficial. There are other examples, load and replicate etc.
> > > > >
> > > > > > That said, I'm not against supporting VEC_PERM code gen from
> > > > > > unsigned promotion but I don't see why we should do this when
> > > > > > the target advertises VEC_UNPACK_* support or direct conversion
> > > > > > support?
> > > > > >
> > > > > > Esp. with adding a "local" cost related hook which cannot take
> > > > > > into accout context.
> > > > > >
> > > > >
> > > > > To summarize a long story:
> > > > >
> > > > >   yes I open encode zero extends as permutes to allow further optimizations.
> > > > One could convert
> > > > >   vec_unpacks to convert optabs and use that, but that is an opague value
> > that
> > > > can't be further
> > > > >   optimized.
> > > > >
> > > > >   The hook isn't really a costing thing in the general sense. It's literally just "do
> > you
> > > > want
> > > > >   permutes yes or no".  The reason it gets the modes is simply that I don't
> > think a
> > > > single level
> > > > >   extend is worth it, but I can just change it to never try to do this on more
> > than
> > > > one level.
> > > >
> > > > When you mention LOAD_LANES we do not expose "permutes" in them on
> > > > GIMPLE
> > > > either, so why should we for VEC_UNPACK_*.
> > >
> > > I think not exposing LOAD_LANES in GIMPLE *is* an actual mistake that I hope to
> > correct in GCC-16.
> > > Or at least the time we pick LOAD_LANES is too early.  So I don't think pointing to
> > this is a convincing
> > > argument.  It's only VLA that I think needs the IL because you have to mask the
> > group of operations and
> > > may be hard to reconcile that later on.
> > >
> > > > At what level are the simplifications you see happening then?
> > >
> > > Well, they are currently happening outside of the vectorizer passes itself,
> > > more specifically in this case because VN runs match simplifications.
> > 
> > But match doesn't simplify permutes against .LOAD_LANES?  So it's about
> > "other" permutes (from loads) that get simplified?
> > 
> 
> Yes, or other permute after the zero extend.  I shouldn't have mentioned LOAD_LANES
> I think that moved the discussion to a wrong place.
> 
> > > If the concern is that that's late I can lift it to a pattern I suppose.
> > > I didn't use a pattern because similar changes in this area always just happened
> > > at codegen.
> > 
> > I was wondering how this plays with my idea of having us "lower"
> > or rather "code generate" to an intermediate SLP representation where
> > we split SLP groups on vector boundaries and are then free to
> > perform permute optimizations that need to know the vector type.
> > 
> > That said - match could as well combine VEC_UNPACK_* with a VEC_PERMUTE
> > with the catch that this duplicates patterns for the
> > VEC_UNPACK_*/VEC_PERMUTE duality we have.
> > 
> > > >
> > > > I do realize we have two ways of expressing zero-extending widenings
> > > > (also truncations btw) and that's always bad - so we could decide to
> > > > _always_ use VEC_PERMs as the canonical representation because those
> > > > combine more easily.  And either match VEC_PERMs back to vec_unpack
> > > > at RTL expansion time or require targets to expose those as constant
> > > > vec_perms as well.  There are targets like GCN where you can't do
> > > > unpacking with permutes of course, so we can't do away with them
> > > > (we could possibly force those targets to expose widening/truncation
> > > > solely with [us]ext and trunc patterns of course).
> > >
> > > Ok, so your objection is that you don't want to have a different way of doing
> > > a single step zero extend vs a multi-step zero extend.
> > 
> > My objection is mainly that we do this based on a target decision and
> > without immediate effect on the vector loop and its costing - it's not
> > that we are then able to see we can combine the permutes with others,
> > say in SLP permute optimization.
> 
> I can fix that by lifting the code up as a pattern so it does affect costing
> directly and also gets seen by the vectorizer's permute simplification.
> 
> I agree that that would be a better place for it.  Does that address the
> issue?  Then at least the target decision directly affects vectorization
> like other patterns.

So - how can you teach the SLP permute optimization to treat converts
as permutes?  I think since you can't really do this as pattern either
it doesn't fit a VEC_PERM SLP node either?  Or maybe you can have
VEC_PERM <{a}, {0}, { [0:0], [1:0] }> followed by a node with a
VIEW_CONVERT_EXPR to a wider element type?  So it might be fully
implementable in SLP permute optimization?

> > 
> > > At the moment my patch doesn't care, if you return an unconditional true
> > > then for that target you get VEC_PERM or everything and the vectorizer
> > > won't ever spit out VEC_UNPACKU.
> > >
> > > You're arguing that this should be the default, even if the target does not
> > > support it and then we have to somehow undo it during vec_lowering?
> > 
> > I argued that we possibly should do this by default and all targets
> > that can vec_unpack but not vec_perm_const with such a permute can
> > either implement the missing vec_perm_const or they are of the kind
> > that cannot use a permute for this (!modes_tieable_p).
> 
> Ok, and I assume this would catch targets like GCN?  I don't know much about
> What can be converted or not there. I'll go check their modes_tieable_p.

GCN can't pun a V8HI to a V4SI vector, yes.

> > > Otherwise if the target doesn't support the permute it'll be scalarized..
> > >
> > > I guess sure..  But then...
> > >
> > > > There are targets like GCN where you can't do
> > > > unpacking with permutes of course, so we can't do away with them
> > > > (we could possibly force those targets to expose widening/truncation
> > > > solely with [us]ext and trunc patterns of course).
> > >
> > > I guess if can_vec_perm_const_p fails we can undo it.. But it feels like
> > > we lose an element of preference here.  A target *could* do the permute,
> > > but not do it efficiently.
> > 
> > It can do it the same way it would do the vec_unpack?  Or what am I
> > missing here?  Does your permute not exactly replicate vec_unpack_lo/hi?
> 
> It replicates  series of them yeah. What I meant with the above is what should
> happen for targets that haven't implemented vec_perm_const,  but I suppose
> the previous paragraph addresses this.
> 
> > > >
> > > > > I think think there's a lot of merrit in open-encoding zero extends, but one
> > > > reason this is
> > > > > beneficial on AArch64 for instance is that we can consume the zero register
> > and
> > > > rewrite the
> > > > > indices to a single register TBL.  Two registers TBLs are slower on some
> > > > implementations.
> > > >
> > > > But this latter fact can be done by optimizing the RTL?
> > >
> > > Sure, and we do so today.  That's why the example output in the cover letter
> > > has only one input register.  The point of this blurb was to point out more that
> > > the optimization being beneficial may depend on a specific uarch and as such
> > > I believe that a certain element of target buy in is needed.
> > 
> > If it's dependent on uarch then even more so - why not simply
> > expand vec_unpack as tbl then?
> 
> We expand them as ZIPs, because these don't require a lookup table index.
> However again these are only single level unpacks.  It doesn't work for this
> case of multi-level unpacks.  For something like byte -> long, or worse byte -> double
> the number of instructions to match in combine would exceed it's combine limit.
> 
> Additionally they require a lot of patterns.  So simply, we cannot recombine multi-level
> unpacks in RTL.
> 
> The backend however will do something sensible given a VEC_PERM_EXPR.
> 
> But I think this is just a detail we're getting into.
> 
> It sounds like you're ok with doing it unconditionally for any target that supports
> the permutes, and lift it pre analysis (like in a pattern) so it's costed?

I _think_ that I'd be OK to do this as canonicalization, but as said it
requires buy-in and work in all targets.  We should be able to get
rid of VEC_UNPACK_HI/LO as tree code then, GCN doesn't (cannot)
implement any of those but uses [sz]ext/trunc exclusively IIRC.

Richard.

> Did I understand that right?
> 
> Thanks for the discussion so far.
> 
> Tamar
> 
> > > If you want me to do it unconditionally sure, I can do that...
> > >
> > > If so can I get a review on the other patches anyway? They are
> > > independent mostly. Only have some dependencies on the output of the
> > > tests.
> > 
> > Sure, I'm behind stuff - sorry.
> > 
> > Richard.
> > 
> > > Thanks,
> > > Tamar
> > >
> > > >
> > > > Richard.
> > > >
> > > > > Thanks,
> > > > > Tamar
> > > > >
> > > > > > > Using permutations have a big benefit of multi-step zero extensions
> > because
> > > > they
> > > > > > > both reduce the number of needed instructions, but also increase
> > throughput
> > > > as
> > > > > > > the dependency chain is removed.
> > > > > > >
> > > > > > > Concretely on AArch64 this changes:
> > > > > > >
> > > > > > > void test4(unsigned char *x, long long *y, int n) {
> > > > > > >     for(int i = 0; i < n; i++) {
> > > > > > >         y[i] = x[i];
> > > > > > >     }
> > > > > > > }
> > > > > > >
> > > > > > > from generating:
> > > > > > >
> > > > > > > .L4:
> > > > > > >         ldr     q30, [x4], 16
> > > > > > >         add     x3, x3, 128
> > > > > > >         zip1    v1.16b, v30.16b, v31.16b
> > > > > > >         zip2    v30.16b, v30.16b, v31.16b
> > > > > > >         zip1    v2.8h, v1.8h, v31.8h
> > > > > > >         zip1    v0.8h, v30.8h, v31.8h
> > > > > > >         zip2    v1.8h, v1.8h, v31.8h
> > > > > > >         zip2    v30.8h, v30.8h, v31.8h
> > > > > > >         zip1    v26.4s, v2.4s, v31.4s
> > > > > > >         zip1    v29.4s, v0.4s, v31.4s
> > > > > > >         zip1    v28.4s, v1.4s, v31.4s
> > > > > > >         zip1    v27.4s, v30.4s, v31.4s
> > > > > > >         zip2    v2.4s, v2.4s, v31.4s
> > > > > > >         zip2    v0.4s, v0.4s, v31.4s
> > > > > > >         zip2    v1.4s, v1.4s, v31.4s
> > > > > > >         zip2    v30.4s, v30.4s, v31.4s
> > > > > > >         stp     q26, q2, [x3, -128]
> > > > > > >         stp     q28, q1, [x3, -96]
> > > > > > >         stp     q29, q0, [x3, -64]
> > > > > > >         stp     q27, q30, [x3, -32]
> > > > > > >         cmp     x4, x5
> > > > > > >         bne     .L4
> > > > > > >
> > > > > > > and instead we get:
> > > > > > >
> > > > > > > .L4:
> > > > > > >         add     x3, x3, 128
> > > > > > >         ldr     q23, [x4], 16
> > > > > > >         tbl     v5.16b, {v23.16b}, v31.16b
> > > > > > >         tbl     v4.16b, {v23.16b}, v30.16b
> > > > > > >         tbl     v3.16b, {v23.16b}, v29.16b
> > > > > > >         tbl     v2.16b, {v23.16b}, v28.16b
> > > > > > >         tbl     v1.16b, {v23.16b}, v27.16b
> > > > > > >         tbl     v0.16b, {v23.16b}, v26.16b
> > > > > > >         tbl     v22.16b, {v23.16b}, v25.16b
> > > > > > >         tbl     v23.16b, {v23.16b}, v24.16b
> > > > > > >         stp     q5, q4, [x3, -128]
> > > > > > >         stp     q3, q2, [x3, -96]
> > > > > > >         stp     q1, q0, [x3, -64]
> > > > > > >         stp     q22, q23, [x3, -32]
> > > > > > >         cmp     x4, x5
> > > > > > >         bne     .L4
> > > > > > >
> > > > > > > Tests are added in the AArch64 patch introducing the hook.  The testsuite
> > also
> > > > > > > already had about 800 runtime tests that get affected by this.
> > > > > > >
> > > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu, arm-none-linux-
> > > > gnueabihf,
> > > > > > > x86_64-pc-linux-gnu -m32, -m64 and no issues.
> > > > > > >
> > > > > > > Ok for master?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Tamar
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > > 	* target.def (use_permute_for_promotion): New.
> > > > > > > 	* doc/tm.texi.in: Document it.
> > > > > > > 	* doc/tm.texi: Regenerate.
> > > > > > > 	* targhooks.cc (default_use_permute_for_promotion): New.
> > > > > > > 	* targhooks.h (default_use_permute_for_promotion): New.
> > > > > > > 	(vectorizable_conversion): Support direct convertion with
> > permute.
> > > > > > > 	* tree-vect-stmts.cc (vect_create_vectorized_promotion_stmts):
> > Likewise.
> > > > > > > 	(supportable_widening_operation): Likewise.
> > > > > > > 	(vect_gen_perm_mask_any): Allow vector permutes where input
> > registers
> > > > > > > 	are half the width of the result per the GCC 14 relaxation of
> > > > > > > 	VEC_PERM_EXPR.
> > > > > > >
> > > > > > > ---
> > > > > > > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > > > > > > index
> > > >
> > 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f73
> > > > 1c16ee7eacb78143 100644
> > > > > > > --- a/gcc/doc/tm.texi
> > > > > > > +++ b/gcc/doc/tm.texi
> > > > > > > @@ -6480,6 +6480,15 @@ type @code{internal_fn}) should be
> > considered
> > > > expensive when the mask is
> > > > > > >  all zeros.  GCC can then try to branch around the instruction instead.
> > > > > > >  @end deftypefn
> > > > > > >
> > > > > > > +@deftypefn {Target Hook} bool
> > > > TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree
> > > > @var{in_type}, const_tree @var{out_type})
> > > > > > > +This hook returns true if the operation promoting @var{in_type} to
> > > > > > > +@var{out_type} should be done as a vector permute.  If @var{out_type}
> > is
> > > > > > > +a signed type the operation will be done as the related unsigned type and
> > > > > > > +converted to @var{out_type}.  If the target supports the needed
> > permute,
> > > > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
> > > > > > > +beneficial to the hook should return true, else false should be returned.
> > > > > > > +@end deftypefn
> > > > > > > +
> > > > > > >  @deftypefn {Target Hook} {class vector_costs *}
> > > > TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool
> > > > @var{costing_for_scalar})
> > > > > > >  This hook should initialize target-specific data structures in preparation
> > > > > > >  for modeling the costs of vectorizing a loop or basic block.  The default
> > > > > > > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > > > > > > index
> > > >
> > 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52
> > > > d29b76f5bc283a1 100644
> > > > > > > --- a/gcc/doc/tm.texi.in
> > > > > > > +++ b/gcc/doc/tm.texi.in
> > > > > > > @@ -4303,6 +4303,8 @@ address;  but often a machine-dependent
> > strategy
> > > > can generate better code.
> > > > > > >
> > > > > > >  @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
> > > > > > >
> > > > > > > +@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
> > > > > > > +
> > > > > > >  @hook TARGET_VECTORIZE_CREATE_COSTS
> > > > > > >
> > > > > > >  @hook TARGET_VECTORIZE_BUILTIN_GATHER
> > > > > > > diff --git a/gcc/target.def b/gcc/target.def
> > > > > > > index
> > > >
> > b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f
> > > > 4db9f2636973598 100644
> > > > > > > --- a/gcc/target.def
> > > > > > > +++ b/gcc/target.def
> > > > > > > @@ -2056,6 +2056,20 @@ all zeros.  GCC can then try to branch around
> > the
> > > > instruction instead.",
> > > > > > >   (unsigned ifn),
> > > > > > >   default_empty_mask_is_expensive)
> > > > > > >
> > > > > > > +/* Function to say whether a target supports and prefers to use
> > permutes
> > > > for
> > > > > > > +   zero extensions or truncates.  */
> > > > > > > +DEFHOOK
> > > > > > > +(use_permute_for_promotion,
> > > > > > > + "This hook returns true if the operation promoting @var{in_type} to\n\
> > > > > > > +@var{out_type} should be done as a vector permute.  If @var{out_type}
> > > > is\n\
> > > > > > > +a signed type the operation will be done as the related unsigned type
> > and\n\
> > > > > > > +converted to @var{out_type}.  If the target supports the needed
> > > > permute,\n\
> > > > > > > +is able to convert unsigned(@var{out_type}) to @var{out_type} and it
> > is\n\
> > > > > > > +beneficial to the hook should return true, else false should be returned.",
> > > > > > > + bool,
> > > > > > > + (const_tree in_type, const_tree out_type),
> > > > > > > + default_use_permute_for_promotion)
> > > > > > > +
> > > > > > >  /* Target builtin that implements vector gather operation.  */
> > > > > > >  DEFHOOK
> > > > > > >  (builtin_gather,
> > > > > > > diff --git a/gcc/targhooks.h b/gcc/targhooks.h
> > > > > > > index
> > > >
> > 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b
> > > > 3fafad74d3c536f 100644
> > > > > > > --- a/gcc/targhooks.h
> > > > > > > +++ b/gcc/targhooks.h
> > > > > > > @@ -124,6 +124,7 @@ extern opt_machine_mode
> > > > default_vectorize_related_mode (machine_mode,
> > > > > > >  extern opt_machine_mode default_get_mask_mode (machine_mode);
> > > > > > >  extern bool default_empty_mask_is_expensive (unsigned);
> > > > > > >  extern bool default_conditional_operation_is_expensive (unsigned);
> > > > > > > +extern bool default_use_permute_for_promotion (const_tree,
> > const_tree);
> > > > > > >  extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
> > > > > > >
> > > > > > >  /* OpenACC hooks.  */
> > > > > > > diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
> > > > > > > index
> > > >
> > dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdf
> > > > c881fdb19d28f3 100644
> > > > > > > --- a/gcc/targhooks.cc
> > > > > > > +++ b/gcc/targhooks.cc
> > > > > > > @@ -1615,6 +1615,14 @@ default_conditional_operation_is_expensive
> > > > (unsigned ifn)
> > > > > > >    return ifn == IFN_MASK_STORE;
> > > > > > >  }
> > > > > > >
> > > > > > > +/* By default no targets prefer permutes over multi step extension.  */
> > > > > > > +
> > > > > > > +bool
> > > > > > > +default_use_permute_for_promotion (const_tree, const_tree)
> > > > > > > +{
> > > > > > > +  return false;
> > > > > > > +}
> > > > > > > +
> > > > > > >  /* By default consider masked stores to be expensive.  */
> > > > > > >
> > > > > > >  bool
> > > > > > > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> > > > > > > index
> > > >
> > 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894e
> > > > af769d29b1c5b82 100644
> > > > > > > --- a/gcc/tree-vect-stmts.cc
> > > > > > > +++ b/gcc/tree-vect-stmts.cc
> > > > > > > @@ -5129,6 +5129,111 @@ vect_create_vectorized_promotion_stmts
> > > > (vec_info *vinfo,
> > > > > > >    gimple *new_stmt1, *new_stmt2;
> > > > > > >    vec<tree> vec_tmp = vNULL;
> > > > > > >
> > > > > > > +  /* If we're using a VEC_PERM_EXPR then we're widening to the final
> > type in
> > > > > > > +     one go.  */
> > > > > > > +  if (ch1 == VEC_PERM_EXPR
> > > > > > > +      && op_type == unary_op)
> > > > > > > +    {
> > > > > > > +      vec_tmp.create (vec_oprnds0->length () * 2);
> > > > > > > +      bool failed_p = false;
> > > > > > > +
> > > > > > > +      /* Extending with a vec-perm requires 2 instructions per step.  */
> > > > > > > +      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > > > > > +	{
> > > > > > > +	  tree vectype_in = TREE_TYPE (vop0);
> > > > > > > +	  tree vectype_out = TREE_TYPE (vec_dest);
> > > > > > > +	  machine_mode mode_in = TYPE_MODE (vectype_in);
> > > > > > > +	  machine_mode mode_out = TYPE_MODE (vectype_out);
> > > > > > > +	  unsigned bitsize_in = element_precision (vectype_in);
> > > > > > > +	  unsigned tot_in, tot_out;
> > > > > > > +	  unsigned HOST_WIDE_INT count;
> > > > > > > +
> > > > > > > +	  /* We can't really support VLA here as the indexes depend on the
> > VL.
> > > > > > > +	     VLA should really use widening instructions like widening
> > > > > > > +	     loads.  */
> > > > > > > +	  if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
> > > > > > > +	      || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
> > > > > > > +	      || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant
> > (&count)
> > > > > > > +	      || !TYPE_UNSIGNED (vectype_in)
> > > > > > > +	      || !targetm.vectorize.use_permute_for_promotion
> > (vectype_in,
> > > > > > > +							       vectype_out))
> > > > > > > +	    {
> > > > > > > +	      failed_p = true;
> > > > > > > +	      break;
> > > > > > > +	    }
> > > > > > > +
> > > > > > > +	  unsigned steps = tot_out / bitsize_in;
> > > > > > > +	  tree zero = build_zero_cst (vectype_in);
> > > > > > > +
> > > > > > > +	  unsigned chunk_size
> > > > > > > +	    = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
> > > > > > > +			 TYPE_VECTOR_SUBPARTS
> > (vectype_out)).to_constant ();
> > > > > > > +	  unsigned step_size = chunk_size * (tot_out / tot_in);
> > > > > > > +	  unsigned nunits = tot_out / bitsize_in;
> > > > > > > +
> > > > > > > +	  vec_perm_builder sel (steps, 1, 1);
> > > > > > > +	  sel.quick_grow (steps);
> > > > > > > +
> > > > > > > +	  /* Flood fill with the out of range value first.  */
> > > > > > > +	  for (unsigned long i = 0; i < steps; ++i)
> > > > > > > +	    sel[i] = count;
> > > > > > > +
> > > > > > > +	  tree var;
> > > > > > > +	  tree elem_in = TREE_TYPE (vectype_in);
> > > > > > > +	  machine_mode elem_mode_in = TYPE_MODE (elem_in);
> > > > > > > +	  unsigned long idx = 0;
> > > > > > > +	  tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
> > > > > > > +							    elem_in,
> > nunits);
> > > > > > > +
> > > > > > > +	  for (unsigned long j = 0; j < chunk_size; j++)
> > > > > > > +	    {
> > > > > > > +	      if (WORDS_BIG_ENDIAN)
> > > > > > > +		for (int i = steps - 1; i >= 0; i -= step_size, idx++)
> > > > > > > +		  sel[i] = idx;
> > > > > > > +	      else
> > > > > > > +		for (int i = 0; i < (int)steps; i += step_size, idx++)
> > > > > > > +		  sel[i] = idx;
> > > > > > > +
> > > > > > > +	      vec_perm_indices indices (sel, 2, steps);
> > > > > > > +
> > > > > > > +	      tree perm_mask = vect_gen_perm_mask_checked (vc_in,
> > indices);
> > > > > > > +	      auto vec_oprnd = make_ssa_name (vc_in);
> > > > > > > +	      auto new_stmt = gimple_build_assign (vec_oprnd,
> > VEC_PERM_EXPR,
> > > > > > > +						   vop0, zero, perm_mask);
> > > > > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > > > > +
> > > > > > > +	      tree intvect_out = unsigned_type_for (vectype_out);
> > > > > > > +	      var = make_ssa_name (intvect_out);
> > > > > > > +	      new_stmt = gimple_build_assign (var, build1
> > (VIEW_CONVERT_EXPR,
> > > > > > > +							   intvect_out,
> > > > > > > +							   vec_oprnd));
> > > > > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > > > > +
> > > > > > > +	      gcc_assert (ch2.is_tree_code ());
> > > > > > > +
> > > > > > > +	      var = make_ssa_name (vectype_out);
> > > > > > > +	      if (ch2 == VIEW_CONVERT_EXPR)
> > > > > > > +		  new_stmt = gimple_build_assign (var,
> > > > > > > +						  build1
> > (VIEW_CONVERT_EXPR,
> > > > > > > +							  vectype_out,
> > > > > > > +							  vec_oprnd));
> > > > > > > +	      else
> > > > > > > +		  new_stmt = gimple_build_assign (var, (tree_code)ch2,
> > > > > > > +						  vec_oprnd);
> > > > > > > +
> > > > > > > +	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
> > > > > > > +	      vec_tmp.safe_push (var);
> > > > > > > +	    }
> > > > > > > +	}
> > > > > > > +
> > > > > > > +      if (!failed_p)
> > > > > > > +	{
> > > > > > > +	  vec_oprnds0->release ();
> > > > > > > +	  *vec_oprnds0 = vec_tmp;
> > > > > > > +	  return;
> > > > > > > +	}
> > > > > > > +    }
> > > > > > > +
> > > > > > >    vec_tmp.create (vec_oprnds0->length () * 2);
> > > > > > >    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> > > > > > >      {
> > > > > > > @@ -5495,6 +5600,20 @@ vectorizable_conversion (vec_info *vinfo,
> > > > > > >  	  || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
> > > > > > >  	goto unsupported;
> > > > > > >
> > > > > > > +      /* Check to see if the target can use a permute to perform the zero
> > > > > > > +	 extension.  */
> > > > > > > +      intermediate_type = unsigned_type_for (vectype_out);
> > > > > > > +      if (TYPE_UNSIGNED (vectype_in)
> > > > > > > +	  && VECTOR_TYPE_P (intermediate_type)
> > > > > > > +	  && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
> > > > > > > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > > > > > +
> > intermediate_type))
> > > > > > > +	{
> > > > > > > +	  code1 = VEC_PERM_EXPR;
> > > > > > > +	  code2 = FLOAT_EXPR;
> > > > > > > +	  break;
> > > > > > > +	}
> > > > > > > +
> > > > > > >        fltsz = GET_MODE_SIZE (lhs_mode);
> > > > > > >        FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
> > > > > > >  	{
> > > > > > > @@ -9804,7 +9923,8 @@ vect_gen_perm_mask_any (tree vectype,
> > const
> > > > vec_perm_indices &sel)
> > > > > > >    tree mask_type;
> > > > > > >
> > > > > > >    poly_uint64 nunits = sel.length ();
> > > > > > > -  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
> > > > > > > +  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
> > > > > > > +	      || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) *
> > 2));
> > > > > > >
> > > > > > >    mask_type = build_vector_type (ssizetype, nunits);
> > > > > > >    return vec_perm_indices_to_tree (mask_type, sel);
> > > > > > > @@ -14397,8 +14517,20 @@ supportable_widening_operation
> > (vec_info
> > > > *vinfo,
> > > > > > >        break;
> > > > > > >
> > > > > > >      CASE_CONVERT:
> > > > > > > -      c1 = VEC_UNPACK_LO_EXPR;
> > > > > > > -      c2 = VEC_UNPACK_HI_EXPR;
> > > > > > > +      {
> > > > > > > +	tree cvt_type = unsigned_type_for (vectype_out);
> > > > > > > +	if (TYPE_UNSIGNED (vectype_in)
> > > > > > > +	  && VECTOR_TYPE_P (cvt_type)
> > > > > > > +	  && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
> > > > > > > +	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
> > > > cvt_type))
> > > > > > > +	  {
> > > > > > > +	    *code1 = VEC_PERM_EXPR;
> > > > > > > +	    *code2 = VIEW_CONVERT_EXPR;
> > > > > > > +	    return true;
> > > > > > > +	  }
> > > > > > > +	c1 = VEC_UNPACK_LO_EXPR;
> > > > > > > +	c2 = VEC_UNPACK_HI_EXPR;
> > > > > > > +      }
> > > > > > >        break;
> > > > > > >
> > > > > > >      case FLOAT_EXPR:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > > Richard Biener <rguenther@suse.de>
> > > > > > SUSE Software Solutions Germany GmbH,
> > > > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > > > Nuernberg)
> > > > >
> > > > >
> > > >
> > > > --
> > > > Richard Biener <rguenther@suse.de>
> > > > SUSE Software Solutions Germany GmbH,
> > > > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG
> > Nuernberg)
> > >
> > 
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE Software Solutions Germany GmbH,
> > Frankenstrasse 146, 90461 Nuernberg, Germany;
> > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
>
diff mbox series

Patch

diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 4deb3d2c283a2964972b94f434370a6f57ea816a..e8192590ac14005bf7cb5f731c16ee7eacb78143 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -6480,6 +6480,15 @@  type @code{internal_fn}) should be considered expensive when the mask is
 all zeros.  GCC can then try to branch around the instruction instead.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION (const_tree @var{in_type}, const_tree @var{out_type})
+This hook returns true if the operation promoting @var{in_type} to
+@var{out_type} should be done as a vector permute.  If @var{out_type} is
+a signed type the operation will be done as the related unsigned type and
+converted to @var{out_type}.  If the target supports the needed permute,
+is able to convert unsigned(@var{out_type}) to @var{out_type} and it is
+beneficial to the hook should return true, else false should be returned.
+@end deftypefn
+
 @deftypefn {Target Hook} {class vector_costs *} TARGET_VECTORIZE_CREATE_COSTS (vec_info *@var{vinfo}, bool @var{costing_for_scalar})
 This hook should initialize target-specific data structures in preparation
 for modeling the costs of vectorizing a loop or basic block.  The default
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 9f147ccb95cc6d4e79cdf5b265666ad502492145..c007bc707372dd374e8effc52d29b76f5bc283a1 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4303,6 +4303,8 @@  address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_EMPTY_MASK_IS_EXPENSIVE
 
+@hook TARGET_VECTORIZE_USE_PERMUTE_FOR_PROMOTION
+
 @hook TARGET_VECTORIZE_CREATE_COSTS
 
 @hook TARGET_VECTORIZE_BUILTIN_GATHER
diff --git a/gcc/target.def b/gcc/target.def
index b31550108883c5c3f5ffc7e46a1e8a7b839ebe83..58545d5ef4248da5850edec8f4db9f2636973598 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -2056,6 +2056,20 @@  all zeros.  GCC can then try to branch around the instruction instead.",
  (unsigned ifn),
  default_empty_mask_is_expensive)
 
+/* Function to say whether a target supports and prefers to use permutes for
+   zero extensions or truncates.  */
+DEFHOOK
+(use_permute_for_promotion,
+ "This hook returns true if the operation promoting @var{in_type} to\n\
+@var{out_type} should be done as a vector permute.  If @var{out_type} is\n\
+a signed type the operation will be done as the related unsigned type and\n\
+converted to @var{out_type}.  If the target supports the needed permute,\n\
+is able to convert unsigned(@var{out_type}) to @var{out_type} and it is\n\
+beneficial to the hook should return true, else false should be returned.",
+ bool,
+ (const_tree in_type, const_tree out_type),
+ default_use_permute_for_promotion)
+
 /* Target builtin that implements vector gather operation.  */
 DEFHOOK
 (builtin_gather,
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 2704d6008f14d2aa65671f002af886d3b802effa..723f8f4fda7808b6899f10f8b3fafad74d3c536f 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -124,6 +124,7 @@  extern opt_machine_mode default_vectorize_related_mode (machine_mode,
 extern opt_machine_mode default_get_mask_mode (machine_mode);
 extern bool default_empty_mask_is_expensive (unsigned);
 extern bool default_conditional_operation_is_expensive (unsigned);
+extern bool default_use_permute_for_promotion (const_tree, const_tree);
 extern vector_costs *default_vectorize_create_costs (vec_info *, bool);
 
 /* OpenACC hooks.  */
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index dc040df9fcd1182b62d83088ee7fb3a248c99f51..a487eab794fe9f1089ecb58fdfc881fdb19d28f3 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -1615,6 +1615,14 @@  default_conditional_operation_is_expensive (unsigned ifn)
   return ifn == IFN_MASK_STORE;
 }
 
+/* By default no targets prefer permutes over multi step extension.  */
+
+bool
+default_use_permute_for_promotion (const_tree, const_tree)
+{
+  return false;
+}
+
 /* By default consider masked stores to be expensive.  */
 
 bool
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 4f6905f15417f90c6f36e1711a7a25071f0f507c..f2939655e4ec34111baa8894eaf769d29b1c5b82 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -5129,6 +5129,111 @@  vect_create_vectorized_promotion_stmts (vec_info *vinfo,
   gimple *new_stmt1, *new_stmt2;
   vec<tree> vec_tmp = vNULL;
 
+  /* If we're using a VEC_PERM_EXPR then we're widening to the final type in
+     one go.  */
+  if (ch1 == VEC_PERM_EXPR
+      && op_type == unary_op)
+    {
+      vec_tmp.create (vec_oprnds0->length () * 2);
+      bool failed_p = false;
+
+      /* Extending with a vec-perm requires 2 instructions per step.  */
+      FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
+	{
+	  tree vectype_in = TREE_TYPE (vop0);
+	  tree vectype_out = TREE_TYPE (vec_dest);
+	  machine_mode mode_in = TYPE_MODE (vectype_in);
+	  machine_mode mode_out = TYPE_MODE (vectype_out);
+	  unsigned bitsize_in = element_precision (vectype_in);
+	  unsigned tot_in, tot_out;
+	  unsigned HOST_WIDE_INT count;
+
+	  /* We can't really support VLA here as the indexes depend on the VL.
+	     VLA should really use widening instructions like widening
+	     loads.  */
+	  if (!GET_MODE_BITSIZE (mode_in).is_constant (&tot_in)
+	      || !GET_MODE_BITSIZE (mode_out).is_constant (&tot_out)
+	      || !TYPE_VECTOR_SUBPARTS (vectype_in).is_constant (&count)
+	      || !TYPE_UNSIGNED (vectype_in)
+	      || !targetm.vectorize.use_permute_for_promotion (vectype_in,
+							       vectype_out))
+	    {
+	      failed_p = true;
+	      break;
+	    }
+
+	  unsigned steps = tot_out / bitsize_in;
+	  tree zero = build_zero_cst (vectype_in);
+
+	  unsigned chunk_size
+	    = exact_div (TYPE_VECTOR_SUBPARTS (vectype_in),
+			 TYPE_VECTOR_SUBPARTS (vectype_out)).to_constant ();
+	  unsigned step_size = chunk_size * (tot_out / tot_in);
+	  unsigned nunits = tot_out / bitsize_in;
+
+	  vec_perm_builder sel (steps, 1, 1);
+	  sel.quick_grow (steps);
+
+	  /* Flood fill with the out of range value first.  */
+	  for (unsigned long i = 0; i < steps; ++i)
+	    sel[i] = count;
+
+	  tree var;
+	  tree elem_in = TREE_TYPE (vectype_in);
+	  machine_mode elem_mode_in = TYPE_MODE (elem_in);
+	  unsigned long idx = 0;
+	  tree vc_in = get_related_vectype_for_scalar_type (elem_mode_in,
+							    elem_in, nunits);
+
+	  for (unsigned long j = 0; j < chunk_size; j++)
+	    {
+	      if (WORDS_BIG_ENDIAN)
+		for (int i = steps - 1; i >= 0; i -= step_size, idx++)
+		  sel[i] = idx;
+	      else
+		for (int i = 0; i < (int)steps; i += step_size, idx++)
+		  sel[i] = idx;
+
+	      vec_perm_indices indices (sel, 2, steps);
+
+	      tree perm_mask = vect_gen_perm_mask_checked (vc_in, indices);
+	      auto vec_oprnd = make_ssa_name (vc_in);
+	      auto new_stmt = gimple_build_assign (vec_oprnd, VEC_PERM_EXPR,
+						   vop0, zero, perm_mask);
+	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
+
+	      tree intvect_out = unsigned_type_for (vectype_out);
+	      var = make_ssa_name (intvect_out);
+	      new_stmt = gimple_build_assign (var, build1 (VIEW_CONVERT_EXPR,
+							   intvect_out,
+							   vec_oprnd));
+	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
+
+	      gcc_assert (ch2.is_tree_code ());
+
+	      var = make_ssa_name (vectype_out);
+	      if (ch2 == VIEW_CONVERT_EXPR)
+		  new_stmt = gimple_build_assign (var,
+						  build1 (VIEW_CONVERT_EXPR,
+							  vectype_out,
+							  vec_oprnd));
+	      else
+		  new_stmt = gimple_build_assign (var, (tree_code)ch2,
+						  vec_oprnd);
+
+	      vect_finish_stmt_generation (vinfo, stmt_info, new_stmt, gsi);
+	      vec_tmp.safe_push (var);
+	    }
+	}
+
+      if (!failed_p)
+	{
+	  vec_oprnds0->release ();
+	  *vec_oprnds0 = vec_tmp;
+	  return;
+	}
+    }
+
   vec_tmp.create (vec_oprnds0->length () * 2);
   FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
     {
@@ -5495,6 +5600,20 @@  vectorizable_conversion (vec_info *vinfo,
 	  || GET_MODE_SIZE (lhs_mode) <= GET_MODE_SIZE (rhs_mode))
 	goto unsupported;
 
+      /* Check to see if the target can use a permute to perform the zero
+	 extension.  */
+      intermediate_type = unsigned_type_for (vectype_out);
+      if (TYPE_UNSIGNED (vectype_in)
+	  && VECTOR_TYPE_P (intermediate_type)
+	  && TYPE_VECTOR_SUBPARTS (intermediate_type).is_constant ()
+	  && targetm.vectorize.use_permute_for_promotion (vectype_in,
+							  intermediate_type))
+	{
+	  code1 = VEC_PERM_EXPR;
+	  code2 = FLOAT_EXPR;
+	  break;
+	}
+
       fltsz = GET_MODE_SIZE (lhs_mode);
       FOR_EACH_2XWIDER_MODE (rhs_mode_iter, rhs_mode)
 	{
@@ -9804,7 +9923,8 @@  vect_gen_perm_mask_any (tree vectype, const vec_perm_indices &sel)
   tree mask_type;
 
   poly_uint64 nunits = sel.length ();
-  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype)));
+  gcc_assert (known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype))
+	      || known_eq (nunits, TYPE_VECTOR_SUBPARTS (vectype) * 2));
 
   mask_type = build_vector_type (ssizetype, nunits);
   return vec_perm_indices_to_tree (mask_type, sel);
@@ -14397,8 +14517,20 @@  supportable_widening_operation (vec_info *vinfo,
       break;
 
     CASE_CONVERT:
-      c1 = VEC_UNPACK_LO_EXPR;
-      c2 = VEC_UNPACK_HI_EXPR;
+      {
+	tree cvt_type = unsigned_type_for (vectype_out);
+	if (TYPE_UNSIGNED (vectype_in)
+	  && VECTOR_TYPE_P (cvt_type)
+	  && TYPE_VECTOR_SUBPARTS (cvt_type).is_constant ()
+	  && targetm.vectorize.use_permute_for_promotion (vectype_in, cvt_type))
+	  {
+	    *code1 = VEC_PERM_EXPR;
+	    *code2 = VIEW_CONVERT_EXPR;
+	    return true;
+	  }
+	c1 = VEC_UNPACK_LO_EXPR;
+	c2 = VEC_UNPACK_HI_EXPR;
+      }
       break;
 
     case FLOAT_EXPR: