diff mbox series

[v1] Internal-fn: Add new IFN mask_len_strided_load/store

Message ID 20240528031452.2706461-1-pan2.li@intel.com
State New
Headers show
Series [v1] Internal-fn: Add new IFN mask_len_strided_load/store | expand

Commit Message

Li, Pan2 May 28, 2024, 3:14 a.m. UTC
From: Pan Li <pan2.li@intel.com>

This patch would like to add new internal fun for the below 2 IFN.
* mask_len_strided_load
* mask_len_strided_store

The GIMPLE v = MASK_LEN_STRIDED_LOAD (ptr, stride, mask, len, bias) will
be expanded into v = mask_len_strided_load (ptr, stried, mask, len, bias).

The GIMPLE MASK_LEN_STRIED_STORE (ptr, stride, v, mask, len, bias)
be expanded into mask_len_stried_store (ptr, stride, v, mask, len, bias).

The below test suites are passed for this patch:
* The x86 bootstrap test.
* The x86 fully regression test.
* The riscv fully regression test.

gcc/ChangeLog:

	* doc/md.texi: Add description for mask_len_strided_load/store.
	* internal-fn.cc (strided_load_direct): New internal_fn define
	for strided_load_direct.
	(strided_store_direct): Ditto but for store.
	(expand_strided_load_optab_fn): New expand func for
	mask_len_strided_load.
	(expand_strided_store_optab_fn): Ditto but for store.
	(direct_strided_load_optab_supported_p): New define for load
	direct optab supported.
	(direct_strided_store_optab_supported_p): Ditto but for store.
	(internal_fn_len_index): Add len index for both load and store.
	(internal_fn_mask_index): Ditto but for mask index.
	(internal_fn_stored_value_index): Add stored index.
	* internal-fn.def (MASK_LEN_STRIDED_LOAD): New direct fn define
	for strided_load.
	(MASK_LEN_STRIDED_STORE): Ditto but for stride_store.
	* optabs.def (OPTAB_D): New optab define for load and store.

Signed-off-by: Pan Li <pan2.li@intel.com>
Co-Authored-By: Juzhe-Zhong <juzhe.zhong@rivai.ai>
---
 gcc/doc/md.texi     | 27 ++++++++++++++++
 gcc/internal-fn.cc  | 75 +++++++++++++++++++++++++++++++++++++++++++++
 gcc/internal-fn.def |  6 ++++
 gcc/optabs.def      |  2 ++
 4 files changed, 110 insertions(+)

Comments

Richard Biener June 4, 2024, 1:22 p.m. UTC | #1
On Tue, May 28, 2024 at 5:15 AM <pan2.li@intel.com> wrote:
>
> From: Pan Li <pan2.li@intel.com>
>
> This patch would like to add new internal fun for the below 2 IFN.
> * mask_len_strided_load
> * mask_len_strided_store
>
> The GIMPLE v = MASK_LEN_STRIDED_LOAD (ptr, stride, mask, len, bias) will
> be expanded into v = mask_len_strided_load (ptr, stried, mask, len, bias).
>
> The GIMPLE MASK_LEN_STRIED_STORE (ptr, stride, v, mask, len, bias)
> be expanded into mask_len_stried_store (ptr, stride, v, mask, len, bias).
>
> The below test suites are passed for this patch:
> * The x86 bootstrap test.
> * The x86 fully regression test.
> * The riscv fully regression test.

Sorry if we have discussed this last year already - is there anything wrong
with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset?

Richard.

> gcc/ChangeLog:
>
>         * doc/md.texi: Add description for mask_len_strided_load/store.
>         * internal-fn.cc (strided_load_direct): New internal_fn define
>         for strided_load_direct.
>         (strided_store_direct): Ditto but for store.
>         (expand_strided_load_optab_fn): New expand func for
>         mask_len_strided_load.
>         (expand_strided_store_optab_fn): Ditto but for store.
>         (direct_strided_load_optab_supported_p): New define for load
>         direct optab supported.
>         (direct_strided_store_optab_supported_p): Ditto but for store.
>         (internal_fn_len_index): Add len index for both load and store.
>         (internal_fn_mask_index): Ditto but for mask index.
>         (internal_fn_stored_value_index): Add stored index.
>         * internal-fn.def (MASK_LEN_STRIDED_LOAD): New direct fn define
>         for strided_load.
>         (MASK_LEN_STRIDED_STORE): Ditto but for stride_store.
>         * optabs.def (OPTAB_D): New optab define for load and store.
>
> Signed-off-by: Pan Li <pan2.li@intel.com>
> Co-Authored-By: Juzhe-Zhong <juzhe.zhong@rivai.ai>
> ---
>  gcc/doc/md.texi     | 27 ++++++++++++++++
>  gcc/internal-fn.cc  | 75 +++++++++++++++++++++++++++++++++++++++++++++
>  gcc/internal-fn.def |  6 ++++
>  gcc/optabs.def      |  2 ++
>  4 files changed, 110 insertions(+)
>
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 5730bda80dc..3d242675c63 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5138,6 +5138,20 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
>  be loaded from memory and clear if element @var{i} of the result should be undefined.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>
> +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> +@item @samp{mask_len_strided_load@var{m}}
> +Load several separate memory locations into a destination vector of mode @var{m}.
> +Operand 0 is a destination vector of mode @var{m}.
> +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i load address is operand 1 + @var{i} * operand 2.
> +Similar to mask_len_load, the instruction loads at most (operand 4 + operand 5) elements from memory.
> +Element @var{i} of the mask (operand 3) is set if element @var{i} of the result should
> +be loaded from memory and clear if element @var{i} of the result should be zero.
> +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> +
>  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
>  @item @samp{scatter_store@var{m}@var{n}}
>  Store a vector of mode @var{m} into several distinct memory locations.
> @@ -5175,6 +5189,19 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
>  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>
> +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> +@item @samp{mask_len_strided_store@var{m}}
> +Store a vector of mode m into several distinct memory locations.
> +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> +For each element index i store address is operand 0 + @var{i} * operand 1.
> +Similar to mask_len_store, the instruction stores at most (operand 4 + operand 5) elements of mask (operand 3) to memory.
> +Element @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
> +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> +
>  @cindex @code{vec_set@var{m}} instruction pattern
>  @item @samp{vec_set@var{m}}
>  Set given field in the vector value.  Operand 0 is the vector to modify,
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 9c09026793f..f6e5329cd84 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -159,6 +159,7 @@ init_internal_fns ()
>  #define load_lanes_direct { -1, -1, false }
>  #define mask_load_lanes_direct { -1, -1, false }
>  #define gather_load_direct { 3, 1, false }
> +#define strided_load_direct { -1, -1, false }
>  #define len_load_direct { -1, -1, false }
>  #define mask_len_load_direct { -1, 4, false }
>  #define mask_store_direct { 3, 2, false }
> @@ -168,6 +169,7 @@ init_internal_fns ()
>  #define vec_cond_mask_len_direct { 1, 1, false }
>  #define vec_cond_direct { 2, 0, false }
>  #define scatter_store_direct { 3, 1, false }
> +#define strided_store_direct { 1, 1, false }
>  #define len_store_direct { 3, 3, false }
>  #define mask_len_store_direct { 4, 5, false }
>  #define vec_set_direct { 3, 3, false }
> @@ -3668,6 +3670,68 @@ expand_gather_load_optab_fn (internal_fn, gcall *stmt, direct_optab optab)
>      emit_move_insn (lhs_rtx, ops[0].value);
>  }
>
> +/* Expand MASK_LEN_STRIDED_LOAD call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_load_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> +                             direct_optab optab)
> +{
> +  tree lhs = gimple_call_lhs (stmt);
> +  tree base = gimple_call_arg (stmt, 0);
> +  tree stride = gimple_call_arg (stmt, 1);
> +
> +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  rtx base_rtx = expand_normal (base);
> +  rtx stride_rtx = expand_normal (stride);
> +
> +  unsigned i = 0;
> +  class expand_operand ops[6];
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
> +
> +  create_output_operand (&ops[i++], lhs_rtx, mode);
> +  create_address_operand (&ops[i++], base_rtx);
> +  create_address_operand (&ops[i++], stride_rtx);
> +
> +  insn_code icode = direct_optab_handler (optab, mode);
> +
> +  i = add_mask_and_len_args (ops, i, stmt);
> +  expand_insn (icode, i, ops);
> +
> +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> +    emit_move_insn (lhs_rtx, ops[0].value);
> +}
> +
> +/* Expand MASK_LEN_STRIDED_STORE call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_store_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> +                              direct_optab optab)
> +{
> +  internal_fn fn = gimple_call_internal_fn (stmt);
> +  int rhs_index = internal_fn_stored_value_index (fn);
> +
> +  tree base = gimple_call_arg (stmt, 0);
> +  tree stride = gimple_call_arg (stmt, 1);
> +  tree rhs = gimple_call_arg (stmt, rhs_index);
> +
> +  rtx base_rtx = expand_normal (base);
> +  rtx stride_rtx = expand_normal (stride);
> +  rtx rhs_rtx = expand_normal (rhs);
> +
> +  unsigned i = 0;
> +  class expand_operand ops[6];
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (rhs));
> +
> +  create_address_operand (&ops[i++], base_rtx);
> +  create_address_operand (&ops[i++], stride_rtx);
> +  create_input_operand (&ops[i++], rhs_rtx, mode);
> +
> +  insn_code icode = direct_optab_handler (optab, mode);
> +  i = add_mask_and_len_args (ops, i, stmt);
> +
> +  expand_insn (icode, i, ops);
> +}
> +
>  /* Helper for expand_DIVMOD.  Return true if the sequence starting with
>     INSN contains any call insns or insns with {,U}{DIV,MOD} rtxes.  */
>
> @@ -4058,6 +4122,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_gather_load_optab_supported_p convert_optab_supported_p
> +#define direct_strided_load_optab_supported_p direct_optab_supported_p
>  #define direct_len_load_optab_supported_p direct_optab_supported_p
>  #define direct_mask_len_load_optab_supported_p convert_optab_supported_p
>  #define direct_mask_store_optab_supported_p convert_optab_supported_p
> @@ -4066,6 +4131,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_vec_cond_mask_optab_supported_p convert_optab_supported_p
>  #define direct_vec_cond_optab_supported_p convert_optab_supported_p
>  #define direct_scatter_store_optab_supported_p convert_optab_supported_p
> +#define direct_strided_store_optab_supported_p direct_optab_supported_p
>  #define direct_len_store_optab_supported_p direct_optab_supported_p
>  #define direct_mask_len_store_optab_supported_p convert_optab_supported_p
>  #define direct_while_optab_supported_p convert_optab_supported_p
> @@ -4723,6 +4789,8 @@ internal_fn_len_index (internal_fn fn)
>      case IFN_COND_LEN_XOR:
>      case IFN_COND_LEN_SHL:
>      case IFN_COND_LEN_SHR:
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
>        return 4;
>
>      case IFN_COND_LEN_NEG:
> @@ -4817,6 +4885,10 @@ internal_fn_mask_index (internal_fn fn)
>      case IFN_MASK_LEN_STORE:
>        return 2;
>
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
> +      return 3;
> +
>      case IFN_MASK_GATHER_LOAD:
>      case IFN_MASK_SCATTER_STORE:
>      case IFN_MASK_LEN_GATHER_LOAD:
> @@ -4840,6 +4912,9 @@ internal_fn_stored_value_index (internal_fn fn)
>  {
>    switch (fn)
>      {
> +    case IFN_MASK_LEN_STRIDED_STORE:
> +      return 2;
> +
>      case IFN_MASK_STORE:
>      case IFN_MASK_STORE_LANES:
>      case IFN_SCATTER_STORE:
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 25badbb86e5..b30a7a5b009 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -56,6 +56,7 @@ along with GCC; see the file COPYING3.  If not see
>     - mask_load_lanes: currently just vec_mask_load_lanes
>     - mask_len_load_lanes: currently just vec_mask_len_load_lanes
>     - gather_load: used for {mask_,mask_len_,}gather_load
> +   - strided_load: currently just mask_len_strided_load
>     - len_load: currently just len_load
>     - mask_len_load: currently just mask_len_load
>
> @@ -64,6 +65,7 @@ along with GCC; see the file COPYING3.  If not see
>     - mask_store_lanes: currently just vec_mask_store_lanes
>     - mask_len_store_lanes: currently just vec_mask_len_store_lanes
>     - scatter_store: used for {mask_,mask_len_,}scatter_store
> +   - strided_store: currently just mask_len_strided_store
>     - len_store: currently just len_store
>     - mask_len_store: currently just mask_len_store
>
> @@ -212,6 +214,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
>                        mask_gather_load, gather_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
>                        mask_len_gather_load, gather_load)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> +                      mask_len_strided_load, strided_load)
>
>  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> @@ -221,6 +225,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
>                        mask_scatter_store, scatter_store)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
>                        mask_len_scatter_store, scatter_store)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> +                      mask_len_strided_store, strided_store)
>
>  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
>  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 3f2cb46aff8..630b1de8f97 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -539,4 +539,6 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
>  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
>  OPTAB_D (len_load_optab, "len_load_$a")
>  OPTAB_D (len_store_optab, "len_store_$a")
> +OPTAB_D (mask_len_strided_load_optab, "mask_len_strided_load_$a")
> +OPTAB_D (mask_len_strided_store_optab, "mask_len_strided_store_$a")
>  OPTAB_D (select_vl_optab, "select_vl$a")
> --
> 2.34.1
>
Li, Pan2 June 5, 2024, 1:18 a.m. UTC | #2
> Sorry if we have discussed this last year already - is there anything wrong
> with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset?

Thanks for comments, it is quit a while since last discussion. Let me recall a little about it and keep you posted.

Pan

-----Original Message-----
From: Richard Biener <richard.guenther@gmail.com> 
Sent: Tuesday, June 4, 2024 9:22 PM
To: Li, Pan2 <pan2.li@intel.com>; Richard Sandiford <richard.sandiford@arm.com>
Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
Subject: Re: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store

On Tue, May 28, 2024 at 5:15 AM <pan2.li@intel.com> wrote:
>
> From: Pan Li <pan2.li@intel.com>
>
> This patch would like to add new internal fun for the below 2 IFN.
> * mask_len_strided_load
> * mask_len_strided_store
>
> The GIMPLE v = MASK_LEN_STRIDED_LOAD (ptr, stride, mask, len, bias) will
> be expanded into v = mask_len_strided_load (ptr, stried, mask, len, bias).
>
> The GIMPLE MASK_LEN_STRIED_STORE (ptr, stride, v, mask, len, bias)
> be expanded into mask_len_stried_store (ptr, stride, v, mask, len, bias).
>
> The below test suites are passed for this patch:
> * The x86 bootstrap test.
> * The x86 fully regression test.
> * The riscv fully regression test.

Sorry if we have discussed this last year already - is there anything wrong
with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset?

Richard.

> gcc/ChangeLog:
>
>         * doc/md.texi: Add description for mask_len_strided_load/store.
>         * internal-fn.cc (strided_load_direct): New internal_fn define
>         for strided_load_direct.
>         (strided_store_direct): Ditto but for store.
>         (expand_strided_load_optab_fn): New expand func for
>         mask_len_strided_load.
>         (expand_strided_store_optab_fn): Ditto but for store.
>         (direct_strided_load_optab_supported_p): New define for load
>         direct optab supported.
>         (direct_strided_store_optab_supported_p): Ditto but for store.
>         (internal_fn_len_index): Add len index for both load and store.
>         (internal_fn_mask_index): Ditto but for mask index.
>         (internal_fn_stored_value_index): Add stored index.
>         * internal-fn.def (MASK_LEN_STRIDED_LOAD): New direct fn define
>         for strided_load.
>         (MASK_LEN_STRIDED_STORE): Ditto but for stride_store.
>         * optabs.def (OPTAB_D): New optab define for load and store.
>
> Signed-off-by: Pan Li <pan2.li@intel.com>
> Co-Authored-By: Juzhe-Zhong <juzhe.zhong@rivai.ai>
> ---
>  gcc/doc/md.texi     | 27 ++++++++++++++++
>  gcc/internal-fn.cc  | 75 +++++++++++++++++++++++++++++++++++++++++++++
>  gcc/internal-fn.def |  6 ++++
>  gcc/optabs.def      |  2 ++
>  4 files changed, 110 insertions(+)
>
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 5730bda80dc..3d242675c63 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5138,6 +5138,20 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
>  be loaded from memory and clear if element @var{i} of the result should be undefined.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>
> +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> +@item @samp{mask_len_strided_load@var{m}}
> +Load several separate memory locations into a destination vector of mode @var{m}.
> +Operand 0 is a destination vector of mode @var{m}.
> +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i load address is operand 1 + @var{i} * operand 2.
> +Similar to mask_len_load, the instruction loads at most (operand 4 + operand 5) elements from memory.
> +Element @var{i} of the mask (operand 3) is set if element @var{i} of the result should
> +be loaded from memory and clear if element @var{i} of the result should be zero.
> +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> +
>  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
>  @item @samp{scatter_store@var{m}@var{n}}
>  Store a vector of mode @var{m} into several distinct memory locations.
> @@ -5175,6 +5189,19 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
>  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>
> +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> +@item @samp{mask_len_strided_store@var{m}}
> +Store a vector of mode m into several distinct memory locations.
> +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> +For each element index i store address is operand 0 + @var{i} * operand 1.
> +Similar to mask_len_store, the instruction stores at most (operand 4 + operand 5) elements of mask (operand 3) to memory.
> +Element @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
> +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> +
>  @cindex @code{vec_set@var{m}} instruction pattern
>  @item @samp{vec_set@var{m}}
>  Set given field in the vector value.  Operand 0 is the vector to modify,
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 9c09026793f..f6e5329cd84 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -159,6 +159,7 @@ init_internal_fns ()
>  #define load_lanes_direct { -1, -1, false }
>  #define mask_load_lanes_direct { -1, -1, false }
>  #define gather_load_direct { 3, 1, false }
> +#define strided_load_direct { -1, -1, false }
>  #define len_load_direct { -1, -1, false }
>  #define mask_len_load_direct { -1, 4, false }
>  #define mask_store_direct { 3, 2, false }
> @@ -168,6 +169,7 @@ init_internal_fns ()
>  #define vec_cond_mask_len_direct { 1, 1, false }
>  #define vec_cond_direct { 2, 0, false }
>  #define scatter_store_direct { 3, 1, false }
> +#define strided_store_direct { 1, 1, false }
>  #define len_store_direct { 3, 3, false }
>  #define mask_len_store_direct { 4, 5, false }
>  #define vec_set_direct { 3, 3, false }
> @@ -3668,6 +3670,68 @@ expand_gather_load_optab_fn (internal_fn, gcall *stmt, direct_optab optab)
>      emit_move_insn (lhs_rtx, ops[0].value);
>  }
>
> +/* Expand MASK_LEN_STRIDED_LOAD call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_load_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> +                             direct_optab optab)
> +{
> +  tree lhs = gimple_call_lhs (stmt);
> +  tree base = gimple_call_arg (stmt, 0);
> +  tree stride = gimple_call_arg (stmt, 1);
> +
> +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  rtx base_rtx = expand_normal (base);
> +  rtx stride_rtx = expand_normal (stride);
> +
> +  unsigned i = 0;
> +  class expand_operand ops[6];
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
> +
> +  create_output_operand (&ops[i++], lhs_rtx, mode);
> +  create_address_operand (&ops[i++], base_rtx);
> +  create_address_operand (&ops[i++], stride_rtx);
> +
> +  insn_code icode = direct_optab_handler (optab, mode);
> +
> +  i = add_mask_and_len_args (ops, i, stmt);
> +  expand_insn (icode, i, ops);
> +
> +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> +    emit_move_insn (lhs_rtx, ops[0].value);
> +}
> +
> +/* Expand MASK_LEN_STRIDED_STORE call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_store_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> +                              direct_optab optab)
> +{
> +  internal_fn fn = gimple_call_internal_fn (stmt);
> +  int rhs_index = internal_fn_stored_value_index (fn);
> +
> +  tree base = gimple_call_arg (stmt, 0);
> +  tree stride = gimple_call_arg (stmt, 1);
> +  tree rhs = gimple_call_arg (stmt, rhs_index);
> +
> +  rtx base_rtx = expand_normal (base);
> +  rtx stride_rtx = expand_normal (stride);
> +  rtx rhs_rtx = expand_normal (rhs);
> +
> +  unsigned i = 0;
> +  class expand_operand ops[6];
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (rhs));
> +
> +  create_address_operand (&ops[i++], base_rtx);
> +  create_address_operand (&ops[i++], stride_rtx);
> +  create_input_operand (&ops[i++], rhs_rtx, mode);
> +
> +  insn_code icode = direct_optab_handler (optab, mode);
> +  i = add_mask_and_len_args (ops, i, stmt);
> +
> +  expand_insn (icode, i, ops);
> +}
> +
>  /* Helper for expand_DIVMOD.  Return true if the sequence starting with
>     INSN contains any call insns or insns with {,U}{DIV,MOD} rtxes.  */
>
> @@ -4058,6 +4122,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_gather_load_optab_supported_p convert_optab_supported_p
> +#define direct_strided_load_optab_supported_p direct_optab_supported_p
>  #define direct_len_load_optab_supported_p direct_optab_supported_p
>  #define direct_mask_len_load_optab_supported_p convert_optab_supported_p
>  #define direct_mask_store_optab_supported_p convert_optab_supported_p
> @@ -4066,6 +4131,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_vec_cond_mask_optab_supported_p convert_optab_supported_p
>  #define direct_vec_cond_optab_supported_p convert_optab_supported_p
>  #define direct_scatter_store_optab_supported_p convert_optab_supported_p
> +#define direct_strided_store_optab_supported_p direct_optab_supported_p
>  #define direct_len_store_optab_supported_p direct_optab_supported_p
>  #define direct_mask_len_store_optab_supported_p convert_optab_supported_p
>  #define direct_while_optab_supported_p convert_optab_supported_p
> @@ -4723,6 +4789,8 @@ internal_fn_len_index (internal_fn fn)
>      case IFN_COND_LEN_XOR:
>      case IFN_COND_LEN_SHL:
>      case IFN_COND_LEN_SHR:
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
>        return 4;
>
>      case IFN_COND_LEN_NEG:
> @@ -4817,6 +4885,10 @@ internal_fn_mask_index (internal_fn fn)
>      case IFN_MASK_LEN_STORE:
>        return 2;
>
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
> +      return 3;
> +
>      case IFN_MASK_GATHER_LOAD:
>      case IFN_MASK_SCATTER_STORE:
>      case IFN_MASK_LEN_GATHER_LOAD:
> @@ -4840,6 +4912,9 @@ internal_fn_stored_value_index (internal_fn fn)
>  {
>    switch (fn)
>      {
> +    case IFN_MASK_LEN_STRIDED_STORE:
> +      return 2;
> +
>      case IFN_MASK_STORE:
>      case IFN_MASK_STORE_LANES:
>      case IFN_SCATTER_STORE:
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 25badbb86e5..b30a7a5b009 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -56,6 +56,7 @@ along with GCC; see the file COPYING3.  If not see
>     - mask_load_lanes: currently just vec_mask_load_lanes
>     - mask_len_load_lanes: currently just vec_mask_len_load_lanes
>     - gather_load: used for {mask_,mask_len_,}gather_load
> +   - strided_load: currently just mask_len_strided_load
>     - len_load: currently just len_load
>     - mask_len_load: currently just mask_len_load
>
> @@ -64,6 +65,7 @@ along with GCC; see the file COPYING3.  If not see
>     - mask_store_lanes: currently just vec_mask_store_lanes
>     - mask_len_store_lanes: currently just vec_mask_len_store_lanes
>     - scatter_store: used for {mask_,mask_len_,}scatter_store
> +   - strided_store: currently just mask_len_strided_store
>     - len_store: currently just len_store
>     - mask_len_store: currently just mask_len_store
>
> @@ -212,6 +214,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
>                        mask_gather_load, gather_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
>                        mask_len_gather_load, gather_load)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> +                      mask_len_strided_load, strided_load)
>
>  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> @@ -221,6 +225,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
>                        mask_scatter_store, scatter_store)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
>                        mask_len_scatter_store, scatter_store)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> +                      mask_len_strided_store, strided_store)
>
>  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
>  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 3f2cb46aff8..630b1de8f97 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -539,4 +539,6 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
>  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
>  OPTAB_D (len_load_optab, "len_load_$a")
>  OPTAB_D (len_store_optab, "len_store_$a")
> +OPTAB_D (mask_len_strided_load_optab, "mask_len_strided_load_$a")
> +OPTAB_D (mask_len_strided_store_optab, "mask_len_strided_store_$a")
>  OPTAB_D (select_vl_optab, "select_vl$a")
> --
> 2.34.1
>
Li, Pan2 June 5, 2024, 7:50 a.m. UTC | #3
Looks not easy to get the original context/history, only catch some shadow from below patch but not the fully picture.

https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634683.html

It is reasonable to me that using gather/scatter with a VEC_SERICES, for example as blow, will have a try for this.

operand_0 = mask_gather_loadmn (ptr, offset, 1/0(sign/unsign), multiply, mask)
  offset = (vec_series:m base step) => base + i * step
  op_0[i] = memory[ptr + offset[i] * multiply] && mask[i]

operand_0 = mask_len_strided_load (ptr, stride, mask, len, bias).
  op_0[i] = memory[prt + stride * i] && mask[i] && i < (len + bias)

Pan

-----Original Message-----
From: Li, Pan2 
Sent: Wednesday, June 5, 2024 9:18 AM
To: Richard Biener <richard.guenther@gmail.com>; Richard Sandiford <richard.sandiford@arm.com>
Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
Subject: RE: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store

> Sorry if we have discussed this last year already - is there anything wrong
> with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset?

Thanks for comments, it is quit a while since last discussion. Let me recall a little about it and keep you posted.

Pan

-----Original Message-----
From: Richard Biener <richard.guenther@gmail.com> 
Sent: Tuesday, June 4, 2024 9:22 PM
To: Li, Pan2 <pan2.li@intel.com>; Richard Sandiford <richard.sandiford@arm.com>
Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
Subject: Re: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store

On Tue, May 28, 2024 at 5:15 AM <pan2.li@intel.com> wrote:
>
> From: Pan Li <pan2.li@intel.com>
>
> This patch would like to add new internal fun for the below 2 IFN.
> * mask_len_strided_load
> * mask_len_strided_store
>
> The GIMPLE v = MASK_LEN_STRIDED_LOAD (ptr, stride, mask, len, bias) will
> be expanded into v = mask_len_strided_load (ptr, stried, mask, len, bias).
>
> The GIMPLE MASK_LEN_STRIED_STORE (ptr, stride, v, mask, len, bias)
> be expanded into mask_len_stried_store (ptr, stride, v, mask, len, bias).
>
> The below test suites are passed for this patch:
> * The x86 bootstrap test.
> * The x86 fully regression test.
> * The riscv fully regression test.

Sorry if we have discussed this last year already - is there anything wrong
with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset?

Richard.

> gcc/ChangeLog:
>
>         * doc/md.texi: Add description for mask_len_strided_load/store.
>         * internal-fn.cc (strided_load_direct): New internal_fn define
>         for strided_load_direct.
>         (strided_store_direct): Ditto but for store.
>         (expand_strided_load_optab_fn): New expand func for
>         mask_len_strided_load.
>         (expand_strided_store_optab_fn): Ditto but for store.
>         (direct_strided_load_optab_supported_p): New define for load
>         direct optab supported.
>         (direct_strided_store_optab_supported_p): Ditto but for store.
>         (internal_fn_len_index): Add len index for both load and store.
>         (internal_fn_mask_index): Ditto but for mask index.
>         (internal_fn_stored_value_index): Add stored index.
>         * internal-fn.def (MASK_LEN_STRIDED_LOAD): New direct fn define
>         for strided_load.
>         (MASK_LEN_STRIDED_STORE): Ditto but for stride_store.
>         * optabs.def (OPTAB_D): New optab define for load and store.
>
> Signed-off-by: Pan Li <pan2.li@intel.com>
> Co-Authored-By: Juzhe-Zhong <juzhe.zhong@rivai.ai>
> ---
>  gcc/doc/md.texi     | 27 ++++++++++++++++
>  gcc/internal-fn.cc  | 75 +++++++++++++++++++++++++++++++++++++++++++++
>  gcc/internal-fn.def |  6 ++++
>  gcc/optabs.def      |  2 ++
>  4 files changed, 110 insertions(+)
>
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 5730bda80dc..3d242675c63 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5138,6 +5138,20 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
>  be loaded from memory and clear if element @var{i} of the result should be undefined.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>
> +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> +@item @samp{mask_len_strided_load@var{m}}
> +Load several separate memory locations into a destination vector of mode @var{m}.
> +Operand 0 is a destination vector of mode @var{m}.
> +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i load address is operand 1 + @var{i} * operand 2.
> +Similar to mask_len_load, the instruction loads at most (operand 4 + operand 5) elements from memory.
> +Element @var{i} of the mask (operand 3) is set if element @var{i} of the result should
> +be loaded from memory and clear if element @var{i} of the result should be zero.
> +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> +
>  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
>  @item @samp{scatter_store@var{m}@var{n}}
>  Store a vector of mode @var{m} into several distinct memory locations.
> @@ -5175,6 +5189,19 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
>  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>
> +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> +@item @samp{mask_len_strided_store@var{m}}
> +Store a vector of mode m into several distinct memory locations.
> +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> +For each element index i store address is operand 0 + @var{i} * operand 1.
> +Similar to mask_len_store, the instruction stores at most (operand 4 + operand 5) elements of mask (operand 3) to memory.
> +Element @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
> +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> +
>  @cindex @code{vec_set@var{m}} instruction pattern
>  @item @samp{vec_set@var{m}}
>  Set given field in the vector value.  Operand 0 is the vector to modify,
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 9c09026793f..f6e5329cd84 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -159,6 +159,7 @@ init_internal_fns ()
>  #define load_lanes_direct { -1, -1, false }
>  #define mask_load_lanes_direct { -1, -1, false }
>  #define gather_load_direct { 3, 1, false }
> +#define strided_load_direct { -1, -1, false }
>  #define len_load_direct { -1, -1, false }
>  #define mask_len_load_direct { -1, 4, false }
>  #define mask_store_direct { 3, 2, false }
> @@ -168,6 +169,7 @@ init_internal_fns ()
>  #define vec_cond_mask_len_direct { 1, 1, false }
>  #define vec_cond_direct { 2, 0, false }
>  #define scatter_store_direct { 3, 1, false }
> +#define strided_store_direct { 1, 1, false }
>  #define len_store_direct { 3, 3, false }
>  #define mask_len_store_direct { 4, 5, false }
>  #define vec_set_direct { 3, 3, false }
> @@ -3668,6 +3670,68 @@ expand_gather_load_optab_fn (internal_fn, gcall *stmt, direct_optab optab)
>      emit_move_insn (lhs_rtx, ops[0].value);
>  }
>
> +/* Expand MASK_LEN_STRIDED_LOAD call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_load_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> +                             direct_optab optab)
> +{
> +  tree lhs = gimple_call_lhs (stmt);
> +  tree base = gimple_call_arg (stmt, 0);
> +  tree stride = gimple_call_arg (stmt, 1);
> +
> +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  rtx base_rtx = expand_normal (base);
> +  rtx stride_rtx = expand_normal (stride);
> +
> +  unsigned i = 0;
> +  class expand_operand ops[6];
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
> +
> +  create_output_operand (&ops[i++], lhs_rtx, mode);
> +  create_address_operand (&ops[i++], base_rtx);
> +  create_address_operand (&ops[i++], stride_rtx);
> +
> +  insn_code icode = direct_optab_handler (optab, mode);
> +
> +  i = add_mask_and_len_args (ops, i, stmt);
> +  expand_insn (icode, i, ops);
> +
> +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> +    emit_move_insn (lhs_rtx, ops[0].value);
> +}
> +
> +/* Expand MASK_LEN_STRIDED_STORE call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_store_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> +                              direct_optab optab)
> +{
> +  internal_fn fn = gimple_call_internal_fn (stmt);
> +  int rhs_index = internal_fn_stored_value_index (fn);
> +
> +  tree base = gimple_call_arg (stmt, 0);
> +  tree stride = gimple_call_arg (stmt, 1);
> +  tree rhs = gimple_call_arg (stmt, rhs_index);
> +
> +  rtx base_rtx = expand_normal (base);
> +  rtx stride_rtx = expand_normal (stride);
> +  rtx rhs_rtx = expand_normal (rhs);
> +
> +  unsigned i = 0;
> +  class expand_operand ops[6];
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (rhs));
> +
> +  create_address_operand (&ops[i++], base_rtx);
> +  create_address_operand (&ops[i++], stride_rtx);
> +  create_input_operand (&ops[i++], rhs_rtx, mode);
> +
> +  insn_code icode = direct_optab_handler (optab, mode);
> +  i = add_mask_and_len_args (ops, i, stmt);
> +
> +  expand_insn (icode, i, ops);
> +}
> +
>  /* Helper for expand_DIVMOD.  Return true if the sequence starting with
>     INSN contains any call insns or insns with {,U}{DIV,MOD} rtxes.  */
>
> @@ -4058,6 +4122,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_gather_load_optab_supported_p convert_optab_supported_p
> +#define direct_strided_load_optab_supported_p direct_optab_supported_p
>  #define direct_len_load_optab_supported_p direct_optab_supported_p
>  #define direct_mask_len_load_optab_supported_p convert_optab_supported_p
>  #define direct_mask_store_optab_supported_p convert_optab_supported_p
> @@ -4066,6 +4131,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_vec_cond_mask_optab_supported_p convert_optab_supported_p
>  #define direct_vec_cond_optab_supported_p convert_optab_supported_p
>  #define direct_scatter_store_optab_supported_p convert_optab_supported_p
> +#define direct_strided_store_optab_supported_p direct_optab_supported_p
>  #define direct_len_store_optab_supported_p direct_optab_supported_p
>  #define direct_mask_len_store_optab_supported_p convert_optab_supported_p
>  #define direct_while_optab_supported_p convert_optab_supported_p
> @@ -4723,6 +4789,8 @@ internal_fn_len_index (internal_fn fn)
>      case IFN_COND_LEN_XOR:
>      case IFN_COND_LEN_SHL:
>      case IFN_COND_LEN_SHR:
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
>        return 4;
>
>      case IFN_COND_LEN_NEG:
> @@ -4817,6 +4885,10 @@ internal_fn_mask_index (internal_fn fn)
>      case IFN_MASK_LEN_STORE:
>        return 2;
>
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
> +      return 3;
> +
>      case IFN_MASK_GATHER_LOAD:
>      case IFN_MASK_SCATTER_STORE:
>      case IFN_MASK_LEN_GATHER_LOAD:
> @@ -4840,6 +4912,9 @@ internal_fn_stored_value_index (internal_fn fn)
>  {
>    switch (fn)
>      {
> +    case IFN_MASK_LEN_STRIDED_STORE:
> +      return 2;
> +
>      case IFN_MASK_STORE:
>      case IFN_MASK_STORE_LANES:
>      case IFN_SCATTER_STORE:
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 25badbb86e5..b30a7a5b009 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -56,6 +56,7 @@ along with GCC; see the file COPYING3.  If not see
>     - mask_load_lanes: currently just vec_mask_load_lanes
>     - mask_len_load_lanes: currently just vec_mask_len_load_lanes
>     - gather_load: used for {mask_,mask_len_,}gather_load
> +   - strided_load: currently just mask_len_strided_load
>     - len_load: currently just len_load
>     - mask_len_load: currently just mask_len_load
>
> @@ -64,6 +65,7 @@ along with GCC; see the file COPYING3.  If not see
>     - mask_store_lanes: currently just vec_mask_store_lanes
>     - mask_len_store_lanes: currently just vec_mask_len_store_lanes
>     - scatter_store: used for {mask_,mask_len_,}scatter_store
> +   - strided_store: currently just mask_len_strided_store
>     - len_store: currently just len_store
>     - mask_len_store: currently just mask_len_store
>
> @@ -212,6 +214,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
>                        mask_gather_load, gather_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
>                        mask_len_gather_load, gather_load)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> +                      mask_len_strided_load, strided_load)
>
>  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> @@ -221,6 +225,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
>                        mask_scatter_store, scatter_store)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
>                        mask_len_scatter_store, scatter_store)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> +                      mask_len_strided_store, strided_store)
>
>  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
>  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 3f2cb46aff8..630b1de8f97 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -539,4 +539,6 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
>  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
>  OPTAB_D (len_load_optab, "len_load_$a")
>  OPTAB_D (len_store_optab, "len_store_$a")
> +OPTAB_D (mask_len_strided_load_optab, "mask_len_strided_load_$a")
> +OPTAB_D (mask_len_strided_store_optab, "mask_len_strided_store_$a")
>  OPTAB_D (select_vl_optab, "select_vl$a")
> --
> 2.34.1
>
Li, Pan2 Oct. 17, 2024, 6:38 a.m. UTC | #4
It is quit a while since last discussion.
I recall these materials recently and have a try in the risc-v backend.

   1   │ void foo (int * __restrict a, int * __restrict b, int stride, int n)
   2   │ {
   3   │     for (int i = 0; i < n; i++)
   4   │       a[i*stride] = b[i*stride] + 100;
   5   │ }

We will have expand similar as below for VEC_SERIES_EXPR + MASK_LEN_GATHER_LOAD.
There will be 8 insns after expand which is not applicable when try_combine (at most 4 insn) if
my understand is correct.

Thus, is there any other approaches instead of adding new IFN? If we need to add new IFN, can
we leverage match.pd to try to match the MASK_LEN_GATHER_LOAD(base, VEC_SERICES_EXPR, ...)
pattern and then emit the new IFN like sat alu does.

Thanks a lot.

 316   │ ;; _58 = VEC_SERIES_EXPR <0, _57>;
 317   │
 318   │ (insn 17 16 18 (set (reg:DI 156 [ _56 ])
 319   │         (ashiftrt:DI (reg:DI 141 [ _54 ])
 320   │             (const_int 2 [0x2]))) -1
 321   │      (expr_list:REG_EQUAL (div:DI (reg:DI 141 [ _54 ])
 322   │             (const_int 4 [0x4]))
 323   │         (nil)))
 324   │
 325   │ (insn 18 17 19 (set (reg:DI 158)
 326   │         (unspec:DI [
 327   │                 (const_int 32 [0x20])
 328   │             ] UNSPEC_VLMAX)) -1
 329   │      (nil))
 330   │
 331   │ (insn 19 18 20 (set (reg:RVVM1SI 157)
 332   │         (if_then_else:RVVM1SI (unspec:RVVMF32BI [
 333   │                     (const_vector:RVVMF32BI repeat [
 334   │                             (const_int 1 [0x1])
 335   │                         ])
 336   │                     (reg:DI 158)
 337   │                     (const_int 2 [0x2]) repeated x2
 338   │                     (const_int 1 [0x1])
 339   │                     (reg:SI 66 vl)
 340   │                     (reg:SI 67 vtype)
 341   │                 ] UNSPEC_VPREDICATE)
 342   │             (vec_series:RVVM1SI (const_int 0 [0])
 343   │                 (const_int 1 [0x1]))
 344   │             (unspec:RVVM1SI [
 345   │                     (reg:DI 0 zero)
 346   │                 ] UNSPEC_VUNDEF))) -1
 347   │      (nil))
 348   │
 349   │ (insn 20 19 21 (set (reg:DI 160)
 350   │         (unspec:DI [
 351   │                 (const_int 32 [0x20])
 352   │             ] UNSPEC_VLMAX)) -1
 353   │      (nil))
 354   │
 355   │ (insn 21 20 22 (set (reg:RVVM1SI 159)
 356   │         (if_then_else:RVVM1SI (unspec:RVVMF32BI [
 357   │                     (const_vector:RVVMF32BI repeat [
 358   │                             (const_int 1 [0x1])
 359   │                         ])
 360   │                     (reg:DI 160)
 361   │                     (const_int 2 [0x2]) repeated x2
 362   │                     (const_int 1 [0x1])
 363   │                     (reg:SI 66 vl)
 364   │                     (reg:SI 67 vtype)
 365   │                 ] UNSPEC_VPREDICATE)
 366   │             (mult:RVVM1SI (vec_duplicate:RVVM1SI (subreg:SI (reg:DI 156 [ _56 ]) 0))
 367   │                 (reg:RVVM1SI 157))
 368   │             (unspec:RVVM1SI [
 369   │                     (reg:DI 0 zero)
 370   │                 ] UNSPEC_VUNDEF))) -1
 371   │      (nil))
 ...
 403   │ ;; vect__5.16_61 = .MASK_LEN_GATHER_LOAD (vectp_b.14_59, _58, 4, { 0, ... }, { -1, ... }, _73, 0);
 404   │
 405   │ (insn 27 26 28 (set (reg:RVVM2DI 161)
 406   │         (sign_extend:RVVM2DI (reg:RVVM1SI 145 [ _58 ]))) "strided_ld-st.c":4:22 -1
 407   │      (nil))
 408   │
 409   │ (insn 28 27 29 (set (reg:RVVM2DI 162)
 410   │         (ashift:RVVM2DI (reg:RVVM2DI 161)
 411   │             (const_int 2 [0x2]))) "strided_ld-st.c":4:22 -1
 412   │      (nil))
 413   │
 414   │ (insn 29 28 0 (set (reg:RVVM1SI 146 [ vect__5.16 ])
 415   │         (if_then_else:RVVM1SI (unspec:RVVMF32BI [
 416   │                     (const_vector:RVVMF32BI repeat [
 417   │                             (const_int 1 [0x1])
 418   │                         ])
 419   │                     (reg:DI 149 [ _73 ])
 420   │                     (const_int 2 [0x2]) repeated x2
 421   │                     (const_int 0 [0])
 422   │                     (reg:SI 66 vl)
 423   │                     (reg:SI 67 vtype)
 424   │                 ] UNSPEC_VPREDICATE)
 425   │             (unspec:RVVM1SI [
 426   │                     (reg/v/f:DI 151 [ b ])
 427   │                     (mem:BLK (scratch) [0  A8])
 428   │                     (reg:RVVM2DI 162)
 429   │                 ] UNSPEC_UNORDERED)
 430   │             (unspec:RVVM1SI [
 431   │                     (reg:DI 0 zero)
 432   │                 ] UNSPEC_VUNDEF))) "strided_ld-st.c":4:22 -1
 433   │      (nil))

Pan


-----Original Message-----
From: Li, Pan2 <pan2.li@intel.com> 
Sent: Wednesday, June 5, 2024 3:50 PM
To: Richard Biener <richard.guenther@gmail.com>; Richard Sandiford <richard.sandiford@arm.com>
Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
Subject: RE: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store

Looks not easy to get the original context/history, only catch some shadow from below patch but not the fully picture.

https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634683.html

It is reasonable to me that using gather/scatter with a VEC_SERICES, for example as blow, will have a try for this.

operand_0 = mask_gather_loadmn (ptr, offset, 1/0(sign/unsign), multiply, mask)
  offset = (vec_series:m base step) => base + i * step
  op_0[i] = memory[ptr + offset[i] * multiply] && mask[i]

operand_0 = mask_len_strided_load (ptr, stride, mask, len, bias).
  op_0[i] = memory[prt + stride * i] && mask[i] && i < (len + bias)

Pan

-----Original Message-----
From: Li, Pan2 
Sent: Wednesday, June 5, 2024 9:18 AM
To: Richard Biener <richard.guenther@gmail.com>; Richard Sandiford <richard.sandiford@arm.com>
Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
Subject: RE: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store

> Sorry if we have discussed this last year already - is there anything wrong
> with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset?

Thanks for comments, it is quit a while since last discussion. Let me recall a little about it and keep you posted.

Pan

-----Original Message-----
From: Richard Biener <richard.guenther@gmail.com> 
Sent: Tuesday, June 4, 2024 9:22 PM
To: Li, Pan2 <pan2.li@intel.com>; Richard Sandiford <richard.sandiford@arm.com>
Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
Subject: Re: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store

On Tue, May 28, 2024 at 5:15 AM <pan2.li@intel.com> wrote:
>
> From: Pan Li <pan2.li@intel.com>
>
> This patch would like to add new internal fun for the below 2 IFN.
> * mask_len_strided_load
> * mask_len_strided_store
>
> The GIMPLE v = MASK_LEN_STRIDED_LOAD (ptr, stride, mask, len, bias) will
> be expanded into v = mask_len_strided_load (ptr, stried, mask, len, bias).
>
> The GIMPLE MASK_LEN_STRIED_STORE (ptr, stride, v, mask, len, bias)
> be expanded into mask_len_stried_store (ptr, stride, v, mask, len, bias).
>
> The below test suites are passed for this patch:
> * The x86 bootstrap test.
> * The x86 fully regression test.
> * The riscv fully regression test.

Sorry if we have discussed this last year already - is there anything wrong
with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset?

Richard.

> gcc/ChangeLog:
>
>         * doc/md.texi: Add description for mask_len_strided_load/store.
>         * internal-fn.cc (strided_load_direct): New internal_fn define
>         for strided_load_direct.
>         (strided_store_direct): Ditto but for store.
>         (expand_strided_load_optab_fn): New expand func for
>         mask_len_strided_load.
>         (expand_strided_store_optab_fn): Ditto but for store.
>         (direct_strided_load_optab_supported_p): New define for load
>         direct optab supported.
>         (direct_strided_store_optab_supported_p): Ditto but for store.
>         (internal_fn_len_index): Add len index for both load and store.
>         (internal_fn_mask_index): Ditto but for mask index.
>         (internal_fn_stored_value_index): Add stored index.
>         * internal-fn.def (MASK_LEN_STRIDED_LOAD): New direct fn define
>         for strided_load.
>         (MASK_LEN_STRIDED_STORE): Ditto but for stride_store.
>         * optabs.def (OPTAB_D): New optab define for load and store.
>
> Signed-off-by: Pan Li <pan2.li@intel.com>
> Co-Authored-By: Juzhe-Zhong <juzhe.zhong@rivai.ai>
> ---
>  gcc/doc/md.texi     | 27 ++++++++++++++++
>  gcc/internal-fn.cc  | 75 +++++++++++++++++++++++++++++++++++++++++++++
>  gcc/internal-fn.def |  6 ++++
>  gcc/optabs.def      |  2 ++
>  4 files changed, 110 insertions(+)
>
> diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> index 5730bda80dc..3d242675c63 100644
> --- a/gcc/doc/md.texi
> +++ b/gcc/doc/md.texi
> @@ -5138,6 +5138,20 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
>  be loaded from memory and clear if element @var{i} of the result should be undefined.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>
> +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> +@item @samp{mask_len_strided_load@var{m}}
> +Load several separate memory locations into a destination vector of mode @var{m}.
> +Operand 0 is a destination vector of mode @var{m}.
> +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> +For each element index i load address is operand 1 + @var{i} * operand 2.
> +Similar to mask_len_load, the instruction loads at most (operand 4 + operand 5) elements from memory.
> +Element @var{i} of the mask (operand 3) is set if element @var{i} of the result should
> +be loaded from memory and clear if element @var{i} of the result should be zero.
> +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> +
>  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
>  @item @samp{scatter_store@var{m}@var{n}}
>  Store a vector of mode @var{m} into several distinct memory locations.
> @@ -5175,6 +5189,19 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
>  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
>  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
>
> +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> +@item @samp{mask_len_strided_store@var{m}}
> +Store a vector of mode m into several distinct memory locations.
> +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> +For each element index i store address is operand 0 + @var{i} * operand 1.
> +Similar to mask_len_store, the instruction stores at most (operand 4 + operand 5) elements of mask (operand 3) to memory.
> +Element @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
> +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> +
>  @cindex @code{vec_set@var{m}} instruction pattern
>  @item @samp{vec_set@var{m}}
>  Set given field in the vector value.  Operand 0 is the vector to modify,
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 9c09026793f..f6e5329cd84 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -159,6 +159,7 @@ init_internal_fns ()
>  #define load_lanes_direct { -1, -1, false }
>  #define mask_load_lanes_direct { -1, -1, false }
>  #define gather_load_direct { 3, 1, false }
> +#define strided_load_direct { -1, -1, false }
>  #define len_load_direct { -1, -1, false }
>  #define mask_len_load_direct { -1, 4, false }
>  #define mask_store_direct { 3, 2, false }
> @@ -168,6 +169,7 @@ init_internal_fns ()
>  #define vec_cond_mask_len_direct { 1, 1, false }
>  #define vec_cond_direct { 2, 0, false }
>  #define scatter_store_direct { 3, 1, false }
> +#define strided_store_direct { 1, 1, false }
>  #define len_store_direct { 3, 3, false }
>  #define mask_len_store_direct { 4, 5, false }
>  #define vec_set_direct { 3, 3, false }
> @@ -3668,6 +3670,68 @@ expand_gather_load_optab_fn (internal_fn, gcall *stmt, direct_optab optab)
>      emit_move_insn (lhs_rtx, ops[0].value);
>  }
>
> +/* Expand MASK_LEN_STRIDED_LOAD call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_load_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> +                             direct_optab optab)
> +{
> +  tree lhs = gimple_call_lhs (stmt);
> +  tree base = gimple_call_arg (stmt, 0);
> +  tree stride = gimple_call_arg (stmt, 1);
> +
> +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  rtx base_rtx = expand_normal (base);
> +  rtx stride_rtx = expand_normal (stride);
> +
> +  unsigned i = 0;
> +  class expand_operand ops[6];
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
> +
> +  create_output_operand (&ops[i++], lhs_rtx, mode);
> +  create_address_operand (&ops[i++], base_rtx);
> +  create_address_operand (&ops[i++], stride_rtx);
> +
> +  insn_code icode = direct_optab_handler (optab, mode);
> +
> +  i = add_mask_and_len_args (ops, i, stmt);
> +  expand_insn (icode, i, ops);
> +
> +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> +    emit_move_insn (lhs_rtx, ops[0].value);
> +}
> +
> +/* Expand MASK_LEN_STRIDED_STORE call CALL by optab OPTAB.  */
> +
> +static void
> +expand_strided_store_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> +                              direct_optab optab)
> +{
> +  internal_fn fn = gimple_call_internal_fn (stmt);
> +  int rhs_index = internal_fn_stored_value_index (fn);
> +
> +  tree base = gimple_call_arg (stmt, 0);
> +  tree stride = gimple_call_arg (stmt, 1);
> +  tree rhs = gimple_call_arg (stmt, rhs_index);
> +
> +  rtx base_rtx = expand_normal (base);
> +  rtx stride_rtx = expand_normal (stride);
> +  rtx rhs_rtx = expand_normal (rhs);
> +
> +  unsigned i = 0;
> +  class expand_operand ops[6];
> +  machine_mode mode = TYPE_MODE (TREE_TYPE (rhs));
> +
> +  create_address_operand (&ops[i++], base_rtx);
> +  create_address_operand (&ops[i++], stride_rtx);
> +  create_input_operand (&ops[i++], rhs_rtx, mode);
> +
> +  insn_code icode = direct_optab_handler (optab, mode);
> +  i = add_mask_and_len_args (ops, i, stmt);
> +
> +  expand_insn (icode, i, ops);
> +}
> +
>  /* Helper for expand_DIVMOD.  Return true if the sequence starting with
>     INSN contains any call insns or insns with {,U}{DIV,MOD} rtxes.  */
>
> @@ -4058,6 +4122,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
>  #define direct_gather_load_optab_supported_p convert_optab_supported_p
> +#define direct_strided_load_optab_supported_p direct_optab_supported_p
>  #define direct_len_load_optab_supported_p direct_optab_supported_p
>  #define direct_mask_len_load_optab_supported_p convert_optab_supported_p
>  #define direct_mask_store_optab_supported_p convert_optab_supported_p
> @@ -4066,6 +4131,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
>  #define direct_vec_cond_mask_optab_supported_p convert_optab_supported_p
>  #define direct_vec_cond_optab_supported_p convert_optab_supported_p
>  #define direct_scatter_store_optab_supported_p convert_optab_supported_p
> +#define direct_strided_store_optab_supported_p direct_optab_supported_p
>  #define direct_len_store_optab_supported_p direct_optab_supported_p
>  #define direct_mask_len_store_optab_supported_p convert_optab_supported_p
>  #define direct_while_optab_supported_p convert_optab_supported_p
> @@ -4723,6 +4789,8 @@ internal_fn_len_index (internal_fn fn)
>      case IFN_COND_LEN_XOR:
>      case IFN_COND_LEN_SHL:
>      case IFN_COND_LEN_SHR:
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
>        return 4;
>
>      case IFN_COND_LEN_NEG:
> @@ -4817,6 +4885,10 @@ internal_fn_mask_index (internal_fn fn)
>      case IFN_MASK_LEN_STORE:
>        return 2;
>
> +    case IFN_MASK_LEN_STRIDED_LOAD:
> +    case IFN_MASK_LEN_STRIDED_STORE:
> +      return 3;
> +
>      case IFN_MASK_GATHER_LOAD:
>      case IFN_MASK_SCATTER_STORE:
>      case IFN_MASK_LEN_GATHER_LOAD:
> @@ -4840,6 +4912,9 @@ internal_fn_stored_value_index (internal_fn fn)
>  {
>    switch (fn)
>      {
> +    case IFN_MASK_LEN_STRIDED_STORE:
> +      return 2;
> +
>      case IFN_MASK_STORE:
>      case IFN_MASK_STORE_LANES:
>      case IFN_SCATTER_STORE:
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 25badbb86e5..b30a7a5b009 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -56,6 +56,7 @@ along with GCC; see the file COPYING3.  If not see
>     - mask_load_lanes: currently just vec_mask_load_lanes
>     - mask_len_load_lanes: currently just vec_mask_len_load_lanes
>     - gather_load: used for {mask_,mask_len_,}gather_load
> +   - strided_load: currently just mask_len_strided_load
>     - len_load: currently just len_load
>     - mask_len_load: currently just mask_len_load
>
> @@ -64,6 +65,7 @@ along with GCC; see the file COPYING3.  If not see
>     - mask_store_lanes: currently just vec_mask_store_lanes
>     - mask_len_store_lanes: currently just vec_mask_len_store_lanes
>     - scatter_store: used for {mask_,mask_len_,}scatter_store
> +   - strided_store: currently just mask_len_strided_store
>     - len_store: currently just len_store
>     - mask_len_store: currently just mask_len_store
>
> @@ -212,6 +214,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
>                        mask_gather_load, gather_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
>                        mask_len_gather_load, gather_load)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> +                      mask_len_strided_load, strided_load)
>
>  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> @@ -221,6 +225,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
>                        mask_scatter_store, scatter_store)
>  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
>                        mask_len_scatter_store, scatter_store)
> +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> +                      mask_len_strided_store, strided_store)
>
>  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
>  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index 3f2cb46aff8..630b1de8f97 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -539,4 +539,6 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
>  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
>  OPTAB_D (len_load_optab, "len_load_$a")
>  OPTAB_D (len_store_optab, "len_store_$a")
> +OPTAB_D (mask_len_strided_load_optab, "mask_len_strided_load_$a")
> +OPTAB_D (mask_len_strided_store_optab, "mask_len_strided_store_$a")
>  OPTAB_D (select_vl_optab, "select_vl$a")
> --
> 2.34.1
>
Richard Biener Oct. 17, 2024, 7:13 a.m. UTC | #5
On Thu, Oct 17, 2024 at 8:38 AM Li, Pan2 <pan2.li@intel.com> wrote:
>
> It is quit a while since last discussion.
> I recall these materials recently and have a try in the risc-v backend.
>
>    1   │ void foo (int * __restrict a, int * __restrict b, int stride, int n)
>    2   │ {
>    3   │     for (int i = 0; i < n; i++)
>    4   │       a[i*stride] = b[i*stride] + 100;
>    5   │ }
>
> We will have expand similar as below for VEC_SERIES_EXPR + MASK_LEN_GATHER_LOAD.
> There will be 8 insns after expand which is not applicable when try_combine (at most 4 insn) if
> my understand is correct.
>
> Thus, is there any other approaches instead of adding new IFN? If we need to add new IFN, can
> we leverage match.pd to try to match the MASK_LEN_GATHER_LOAD(base, VEC_SERICES_EXPR, ...)
> pattern and then emit the new IFN like sat alu does.

Adding an optab (and direct internal fn) is fine I guess - it should
be modeled after the
gather optab specifying the vec_series is implicit with the then scalar stride.

Enabling it via match.pd looks possible but also possibly sub-optimal
for costing side on the
vectorizer - supporting it directly in the vectorizer can be done later though.

Richard.

> Thanks a lot.
>
>  316   │ ;; _58 = VEC_SERIES_EXPR <0, _57>;
>  317   │
>  318   │ (insn 17 16 18 (set (reg:DI 156 [ _56 ])
>  319   │         (ashiftrt:DI (reg:DI 141 [ _54 ])
>  320   │             (const_int 2 [0x2]))) -1
>  321   │      (expr_list:REG_EQUAL (div:DI (reg:DI 141 [ _54 ])
>  322   │             (const_int 4 [0x4]))
>  323   │         (nil)))
>  324   │
>  325   │ (insn 18 17 19 (set (reg:DI 158)
>  326   │         (unspec:DI [
>  327   │                 (const_int 32 [0x20])
>  328   │             ] UNSPEC_VLMAX)) -1
>  329   │      (nil))
>  330   │
>  331   │ (insn 19 18 20 (set (reg:RVVM1SI 157)
>  332   │         (if_then_else:RVVM1SI (unspec:RVVMF32BI [
>  333   │                     (const_vector:RVVMF32BI repeat [
>  334   │                             (const_int 1 [0x1])
>  335   │                         ])
>  336   │                     (reg:DI 158)
>  337   │                     (const_int 2 [0x2]) repeated x2
>  338   │                     (const_int 1 [0x1])
>  339   │                     (reg:SI 66 vl)
>  340   │                     (reg:SI 67 vtype)
>  341   │                 ] UNSPEC_VPREDICATE)
>  342   │             (vec_series:RVVM1SI (const_int 0 [0])
>  343   │                 (const_int 1 [0x1]))
>  344   │             (unspec:RVVM1SI [
>  345   │                     (reg:DI 0 zero)
>  346   │                 ] UNSPEC_VUNDEF))) -1
>  347   │      (nil))
>  348   │
>  349   │ (insn 20 19 21 (set (reg:DI 160)
>  350   │         (unspec:DI [
>  351   │                 (const_int 32 [0x20])
>  352   │             ] UNSPEC_VLMAX)) -1
>  353   │      (nil))
>  354   │
>  355   │ (insn 21 20 22 (set (reg:RVVM1SI 159)
>  356   │         (if_then_else:RVVM1SI (unspec:RVVMF32BI [
>  357   │                     (const_vector:RVVMF32BI repeat [
>  358   │                             (const_int 1 [0x1])
>  359   │                         ])
>  360   │                     (reg:DI 160)
>  361   │                     (const_int 2 [0x2]) repeated x2
>  362   │                     (const_int 1 [0x1])
>  363   │                     (reg:SI 66 vl)
>  364   │                     (reg:SI 67 vtype)
>  365   │                 ] UNSPEC_VPREDICATE)
>  366   │             (mult:RVVM1SI (vec_duplicate:RVVM1SI (subreg:SI (reg:DI 156 [ _56 ]) 0))
>  367   │                 (reg:RVVM1SI 157))
>  368   │             (unspec:RVVM1SI [
>  369   │                     (reg:DI 0 zero)
>  370   │                 ] UNSPEC_VUNDEF))) -1
>  371   │      (nil))
>  ...
>  403   │ ;; vect__5.16_61 = .MASK_LEN_GATHER_LOAD (vectp_b.14_59, _58, 4, { 0, ... }, { -1, ... }, _73, 0);
>  404   │
>  405   │ (insn 27 26 28 (set (reg:RVVM2DI 161)
>  406   │         (sign_extend:RVVM2DI (reg:RVVM1SI 145 [ _58 ]))) "strided_ld-st.c":4:22 -1
>  407   │      (nil))
>  408   │
>  409   │ (insn 28 27 29 (set (reg:RVVM2DI 162)
>  410   │         (ashift:RVVM2DI (reg:RVVM2DI 161)
>  411   │             (const_int 2 [0x2]))) "strided_ld-st.c":4:22 -1
>  412   │      (nil))
>  413   │
>  414   │ (insn 29 28 0 (set (reg:RVVM1SI 146 [ vect__5.16 ])
>  415   │         (if_then_else:RVVM1SI (unspec:RVVMF32BI [
>  416   │                     (const_vector:RVVMF32BI repeat [
>  417   │                             (const_int 1 [0x1])
>  418   │                         ])
>  419   │                     (reg:DI 149 [ _73 ])
>  420   │                     (const_int 2 [0x2]) repeated x2
>  421   │                     (const_int 0 [0])
>  422   │                     (reg:SI 66 vl)
>  423   │                     (reg:SI 67 vtype)
>  424   │                 ] UNSPEC_VPREDICATE)
>  425   │             (unspec:RVVM1SI [
>  426   │                     (reg/v/f:DI 151 [ b ])
>  427   │                     (mem:BLK (scratch) [0  A8])
>  428   │                     (reg:RVVM2DI 162)
>  429   │                 ] UNSPEC_UNORDERED)
>  430   │             (unspec:RVVM1SI [
>  431   │                     (reg:DI 0 zero)
>  432   │                 ] UNSPEC_VUNDEF))) "strided_ld-st.c":4:22 -1
>  433   │      (nil))
>
> Pan
>
>
> -----Original Message-----
> From: Li, Pan2 <pan2.li@intel.com>
> Sent: Wednesday, June 5, 2024 3:50 PM
> To: Richard Biener <richard.guenther@gmail.com>; Richard Sandiford <richard.sandiford@arm.com>
> Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
> Subject: RE: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store
>
> Looks not easy to get the original context/history, only catch some shadow from below patch but not the fully picture.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634683.html
>
> It is reasonable to me that using gather/scatter with a VEC_SERICES, for example as blow, will have a try for this.
>
> operand_0 = mask_gather_loadmn (ptr, offset, 1/0(sign/unsign), multiply, mask)
>   offset = (vec_series:m base step) => base + i * step
>   op_0[i] = memory[ptr + offset[i] * multiply] && mask[i]
>
> operand_0 = mask_len_strided_load (ptr, stride, mask, len, bias).
>   op_0[i] = memory[prt + stride * i] && mask[i] && i < (len + bias)
>
> Pan
>
> -----Original Message-----
> From: Li, Pan2
> Sent: Wednesday, June 5, 2024 9:18 AM
> To: Richard Biener <richard.guenther@gmail.com>; Richard Sandiford <richard.sandiford@arm.com>
> Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
> Subject: RE: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store
>
> > Sorry if we have discussed this last year already - is there anything wrong
> > with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset?
>
> Thanks for comments, it is quit a while since last discussion. Let me recall a little about it and keep you posted.
>
> Pan
>
> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Tuesday, June 4, 2024 9:22 PM
> To: Li, Pan2 <pan2.li@intel.com>; Richard Sandiford <richard.sandiford@arm.com>
> Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
> Subject: Re: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store
>
> On Tue, May 28, 2024 at 5:15 AM <pan2.li@intel.com> wrote:
> >
> > From: Pan Li <pan2.li@intel.com>
> >
> > This patch would like to add new internal fun for the below 2 IFN.
> > * mask_len_strided_load
> > * mask_len_strided_store
> >
> > The GIMPLE v = MASK_LEN_STRIDED_LOAD (ptr, stride, mask, len, bias) will
> > be expanded into v = mask_len_strided_load (ptr, stried, mask, len, bias).
> >
> > The GIMPLE MASK_LEN_STRIED_STORE (ptr, stride, v, mask, len, bias)
> > be expanded into mask_len_stried_store (ptr, stride, v, mask, len, bias).
> >
> > The below test suites are passed for this patch:
> > * The x86 bootstrap test.
> > * The x86 fully regression test.
> > * The riscv fully regression test.
>
> Sorry if we have discussed this last year already - is there anything wrong
> with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset?
>
> Richard.
>
> > gcc/ChangeLog:
> >
> >         * doc/md.texi: Add description for mask_len_strided_load/store.
> >         * internal-fn.cc (strided_load_direct): New internal_fn define
> >         for strided_load_direct.
> >         (strided_store_direct): Ditto but for store.
> >         (expand_strided_load_optab_fn): New expand func for
> >         mask_len_strided_load.
> >         (expand_strided_store_optab_fn): Ditto but for store.
> >         (direct_strided_load_optab_supported_p): New define for load
> >         direct optab supported.
> >         (direct_strided_store_optab_supported_p): Ditto but for store.
> >         (internal_fn_len_index): Add len index for both load and store.
> >         (internal_fn_mask_index): Ditto but for mask index.
> >         (internal_fn_stored_value_index): Add stored index.
> >         * internal-fn.def (MASK_LEN_STRIDED_LOAD): New direct fn define
> >         for strided_load.
> >         (MASK_LEN_STRIDED_STORE): Ditto but for stride_store.
> >         * optabs.def (OPTAB_D): New optab define for load and store.
> >
> > Signed-off-by: Pan Li <pan2.li@intel.com>
> > Co-Authored-By: Juzhe-Zhong <juzhe.zhong@rivai.ai>
> > ---
> >  gcc/doc/md.texi     | 27 ++++++++++++++++
> >  gcc/internal-fn.cc  | 75 +++++++++++++++++++++++++++++++++++++++++++++
> >  gcc/internal-fn.def |  6 ++++
> >  gcc/optabs.def      |  2 ++
> >  4 files changed, 110 insertions(+)
> >
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > index 5730bda80dc..3d242675c63 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5138,6 +5138,20 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> >
> > +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> > +@item @samp{mask_len_strided_load@var{m}}
> > +Load several separate memory locations into a destination vector of mode @var{m}.
> > +Operand 0 is a destination vector of mode @var{m}.
> > +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> > +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > +For each element index i load address is operand 1 + @var{i} * operand 2.
> > +Similar to mask_len_load, the instruction loads at most (operand 4 + operand 5) elements from memory.
> > +Element @var{i} of the mask (operand 3) is set if element @var{i} of the result should
> > +be loaded from memory and clear if element @var{i} of the result should be zero.
> > +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> > +
> >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> >  @item @samp{scatter_store@var{m}@var{n}}
> >  Store a vector of mode @var{m} into several distinct memory locations.
> > @@ -5175,6 +5189,19 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> >
> > +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> > +@item @samp{mask_len_strided_store@var{m}}
> > +Store a vector of mode m into several distinct memory locations.
> > +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> > +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> > +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> > +For each element index i store address is operand 0 + @var{i} * operand 1.
> > +Similar to mask_len_store, the instruction stores at most (operand 4 + operand 5) elements of mask (operand 3) to memory.
> > +Element @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
> > +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> > +
> >  @cindex @code{vec_set@var{m}} instruction pattern
> >  @item @samp{vec_set@var{m}}
> >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > index 9c09026793f..f6e5329cd84 100644
> > --- a/gcc/internal-fn.cc
> > +++ b/gcc/internal-fn.cc
> > @@ -159,6 +159,7 @@ init_internal_fns ()
> >  #define load_lanes_direct { -1, -1, false }
> >  #define mask_load_lanes_direct { -1, -1, false }
> >  #define gather_load_direct { 3, 1, false }
> > +#define strided_load_direct { -1, -1, false }
> >  #define len_load_direct { -1, -1, false }
> >  #define mask_len_load_direct { -1, 4, false }
> >  #define mask_store_direct { 3, 2, false }
> > @@ -168,6 +169,7 @@ init_internal_fns ()
> >  #define vec_cond_mask_len_direct { 1, 1, false }
> >  #define vec_cond_direct { 2, 0, false }
> >  #define scatter_store_direct { 3, 1, false }
> > +#define strided_store_direct { 1, 1, false }
> >  #define len_store_direct { 3, 3, false }
> >  #define mask_len_store_direct { 4, 5, false }
> >  #define vec_set_direct { 3, 3, false }
> > @@ -3668,6 +3670,68 @@ expand_gather_load_optab_fn (internal_fn, gcall *stmt, direct_optab optab)
> >      emit_move_insn (lhs_rtx, ops[0].value);
> >  }
> >
> > +/* Expand MASK_LEN_STRIDED_LOAD call CALL by optab OPTAB.  */
> > +
> > +static void
> > +expand_strided_load_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> > +                             direct_optab optab)
> > +{
> > +  tree lhs = gimple_call_lhs (stmt);
> > +  tree base = gimple_call_arg (stmt, 0);
> > +  tree stride = gimple_call_arg (stmt, 1);
> > +
> > +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> > +  rtx base_rtx = expand_normal (base);
> > +  rtx stride_rtx = expand_normal (stride);
> > +
> > +  unsigned i = 0;
> > +  class expand_operand ops[6];
> > +  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
> > +
> > +  create_output_operand (&ops[i++], lhs_rtx, mode);
> > +  create_address_operand (&ops[i++], base_rtx);
> > +  create_address_operand (&ops[i++], stride_rtx);
> > +
> > +  insn_code icode = direct_optab_handler (optab, mode);
> > +
> > +  i = add_mask_and_len_args (ops, i, stmt);
> > +  expand_insn (icode, i, ops);
> > +
> > +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> > +    emit_move_insn (lhs_rtx, ops[0].value);
> > +}
> > +
> > +/* Expand MASK_LEN_STRIDED_STORE call CALL by optab OPTAB.  */
> > +
> > +static void
> > +expand_strided_store_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> > +                              direct_optab optab)
> > +{
> > +  internal_fn fn = gimple_call_internal_fn (stmt);
> > +  int rhs_index = internal_fn_stored_value_index (fn);
> > +
> > +  tree base = gimple_call_arg (stmt, 0);
> > +  tree stride = gimple_call_arg (stmt, 1);
> > +  tree rhs = gimple_call_arg (stmt, rhs_index);
> > +
> > +  rtx base_rtx = expand_normal (base);
> > +  rtx stride_rtx = expand_normal (stride);
> > +  rtx rhs_rtx = expand_normal (rhs);
> > +
> > +  unsigned i = 0;
> > +  class expand_operand ops[6];
> > +  machine_mode mode = TYPE_MODE (TREE_TYPE (rhs));
> > +
> > +  create_address_operand (&ops[i++], base_rtx);
> > +  create_address_operand (&ops[i++], stride_rtx);
> > +  create_input_operand (&ops[i++], rhs_rtx, mode);
> > +
> > +  insn_code icode = direct_optab_handler (optab, mode);
> > +  i = add_mask_and_len_args (ops, i, stmt);
> > +
> > +  expand_insn (icode, i, ops);
> > +}
> > +
> >  /* Helper for expand_DIVMOD.  Return true if the sequence starting with
> >     INSN contains any call insns or insns with {,U}{DIV,MOD} rtxes.  */
> >
> > @@ -4058,6 +4122,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
> >  #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
> >  #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
> >  #define direct_gather_load_optab_supported_p convert_optab_supported_p
> > +#define direct_strided_load_optab_supported_p direct_optab_supported_p
> >  #define direct_len_load_optab_supported_p direct_optab_supported_p
> >  #define direct_mask_len_load_optab_supported_p convert_optab_supported_p
> >  #define direct_mask_store_optab_supported_p convert_optab_supported_p
> > @@ -4066,6 +4131,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
> >  #define direct_vec_cond_mask_optab_supported_p convert_optab_supported_p
> >  #define direct_vec_cond_optab_supported_p convert_optab_supported_p
> >  #define direct_scatter_store_optab_supported_p convert_optab_supported_p
> > +#define direct_strided_store_optab_supported_p direct_optab_supported_p
> >  #define direct_len_store_optab_supported_p direct_optab_supported_p
> >  #define direct_mask_len_store_optab_supported_p convert_optab_supported_p
> >  #define direct_while_optab_supported_p convert_optab_supported_p
> > @@ -4723,6 +4789,8 @@ internal_fn_len_index (internal_fn fn)
> >      case IFN_COND_LEN_XOR:
> >      case IFN_COND_LEN_SHL:
> >      case IFN_COND_LEN_SHR:
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> >        return 4;
> >
> >      case IFN_COND_LEN_NEG:
> > @@ -4817,6 +4885,10 @@ internal_fn_mask_index (internal_fn fn)
> >      case IFN_MASK_LEN_STORE:
> >        return 2;
> >
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> > +      return 3;
> > +
> >      case IFN_MASK_GATHER_LOAD:
> >      case IFN_MASK_SCATTER_STORE:
> >      case IFN_MASK_LEN_GATHER_LOAD:
> > @@ -4840,6 +4912,9 @@ internal_fn_stored_value_index (internal_fn fn)
> >  {
> >    switch (fn)
> >      {
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> > +      return 2;
> > +
> >      case IFN_MASK_STORE:
> >      case IFN_MASK_STORE_LANES:
> >      case IFN_SCATTER_STORE:
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > index 25badbb86e5..b30a7a5b009 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -56,6 +56,7 @@ along with GCC; see the file COPYING3.  If not see
> >     - mask_load_lanes: currently just vec_mask_load_lanes
> >     - mask_len_load_lanes: currently just vec_mask_len_load_lanes
> >     - gather_load: used for {mask_,mask_len_,}gather_load
> > +   - strided_load: currently just mask_len_strided_load
> >     - len_load: currently just len_load
> >     - mask_len_load: currently just mask_len_load
> >
> > @@ -64,6 +65,7 @@ along with GCC; see the file COPYING3.  If not see
> >     - mask_store_lanes: currently just vec_mask_store_lanes
> >     - mask_len_store_lanes: currently just vec_mask_len_store_lanes
> >     - scatter_store: used for {mask_,mask_len_,}scatter_store
> > +   - strided_store: currently just mask_len_strided_store
> >     - len_store: currently just len_store
> >     - mask_len_store: currently just mask_len_store
> >
> > @@ -212,6 +214,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> >                        mask_gather_load, gather_load)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
> >                        mask_len_gather_load, gather_load)
> > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> > +                      mask_len_strided_load, strided_load)
> >
> >  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> > @@ -221,6 +225,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> >                        mask_scatter_store, scatter_store)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
> >                        mask_len_scatter_store, scatter_store)
> > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> > +                      mask_len_strided_store, strided_store)
> >
> >  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
> >  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > index 3f2cb46aff8..630b1de8f97 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -539,4 +539,6 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
> >  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
> >  OPTAB_D (len_load_optab, "len_load_$a")
> >  OPTAB_D (len_store_optab, "len_store_$a")
> > +OPTAB_D (mask_len_strided_load_optab, "mask_len_strided_load_$a")
> > +OPTAB_D (mask_len_strided_store_optab, "mask_len_strided_store_$a")
> >  OPTAB_D (select_vl_optab, "select_vl$a")
> > --
> > 2.34.1
> >
Li, Pan2 Oct. 17, 2024, 9:46 a.m. UTC | #6
Thanks Richard for comments.

> Enabling it via match.pd looks possible but also possibly sub-optimal
> for costing side on the
> vectorizer - supporting it directly in the vectorizer can be done later though.

Sure, will have a try in v2.

Pan

-----Original Message-----
From: Richard Biener <richard.guenther@gmail.com> 
Sent: Thursday, October 17, 2024 3:13 PM
To: Li, Pan2 <pan2.li@intel.com>
Cc: Richard Sandiford <richard.sandiford@arm.com>; gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
Subject: Re: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store

On Thu, Oct 17, 2024 at 8:38 AM Li, Pan2 <pan2.li@intel.com> wrote:
>
> It is quit a while since last discussion.
> I recall these materials recently and have a try in the risc-v backend.
>
>    1   │ void foo (int * __restrict a, int * __restrict b, int stride, int n)
>    2   │ {
>    3   │     for (int i = 0; i < n; i++)
>    4   │       a[i*stride] = b[i*stride] + 100;
>    5   │ }
>
> We will have expand similar as below for VEC_SERIES_EXPR + MASK_LEN_GATHER_LOAD.
> There will be 8 insns after expand which is not applicable when try_combine (at most 4 insn) if
> my understand is correct.
>
> Thus, is there any other approaches instead of adding new IFN? If we need to add new IFN, can
> we leverage match.pd to try to match the MASK_LEN_GATHER_LOAD(base, VEC_SERICES_EXPR, ...)
> pattern and then emit the new IFN like sat alu does.

Adding an optab (and direct internal fn) is fine I guess - it should
be modeled after the
gather optab specifying the vec_series is implicit with the then scalar stride.

Enabling it via match.pd looks possible but also possibly sub-optimal
for costing side on the
vectorizer - supporting it directly in the vectorizer can be done later though.

Richard.

> Thanks a lot.
>
>  316   │ ;; _58 = VEC_SERIES_EXPR <0, _57>;
>  317   │
>  318   │ (insn 17 16 18 (set (reg:DI 156 [ _56 ])
>  319   │         (ashiftrt:DI (reg:DI 141 [ _54 ])
>  320   │             (const_int 2 [0x2]))) -1
>  321   │      (expr_list:REG_EQUAL (div:DI (reg:DI 141 [ _54 ])
>  322   │             (const_int 4 [0x4]))
>  323   │         (nil)))
>  324   │
>  325   │ (insn 18 17 19 (set (reg:DI 158)
>  326   │         (unspec:DI [
>  327   │                 (const_int 32 [0x20])
>  328   │             ] UNSPEC_VLMAX)) -1
>  329   │      (nil))
>  330   │
>  331   │ (insn 19 18 20 (set (reg:RVVM1SI 157)
>  332   │         (if_then_else:RVVM1SI (unspec:RVVMF32BI [
>  333   │                     (const_vector:RVVMF32BI repeat [
>  334   │                             (const_int 1 [0x1])
>  335   │                         ])
>  336   │                     (reg:DI 158)
>  337   │                     (const_int 2 [0x2]) repeated x2
>  338   │                     (const_int 1 [0x1])
>  339   │                     (reg:SI 66 vl)
>  340   │                     (reg:SI 67 vtype)
>  341   │                 ] UNSPEC_VPREDICATE)
>  342   │             (vec_series:RVVM1SI (const_int 0 [0])
>  343   │                 (const_int 1 [0x1]))
>  344   │             (unspec:RVVM1SI [
>  345   │                     (reg:DI 0 zero)
>  346   │                 ] UNSPEC_VUNDEF))) -1
>  347   │      (nil))
>  348   │
>  349   │ (insn 20 19 21 (set (reg:DI 160)
>  350   │         (unspec:DI [
>  351   │                 (const_int 32 [0x20])
>  352   │             ] UNSPEC_VLMAX)) -1
>  353   │      (nil))
>  354   │
>  355   │ (insn 21 20 22 (set (reg:RVVM1SI 159)
>  356   │         (if_then_else:RVVM1SI (unspec:RVVMF32BI [
>  357   │                     (const_vector:RVVMF32BI repeat [
>  358   │                             (const_int 1 [0x1])
>  359   │                         ])
>  360   │                     (reg:DI 160)
>  361   │                     (const_int 2 [0x2]) repeated x2
>  362   │                     (const_int 1 [0x1])
>  363   │                     (reg:SI 66 vl)
>  364   │                     (reg:SI 67 vtype)
>  365   │                 ] UNSPEC_VPREDICATE)
>  366   │             (mult:RVVM1SI (vec_duplicate:RVVM1SI (subreg:SI (reg:DI 156 [ _56 ]) 0))
>  367   │                 (reg:RVVM1SI 157))
>  368   │             (unspec:RVVM1SI [
>  369   │                     (reg:DI 0 zero)
>  370   │                 ] UNSPEC_VUNDEF))) -1
>  371   │      (nil))
>  ...
>  403   │ ;; vect__5.16_61 = .MASK_LEN_GATHER_LOAD (vectp_b.14_59, _58, 4, { 0, ... }, { -1, ... }, _73, 0);
>  404   │
>  405   │ (insn 27 26 28 (set (reg:RVVM2DI 161)
>  406   │         (sign_extend:RVVM2DI (reg:RVVM1SI 145 [ _58 ]))) "strided_ld-st.c":4:22 -1
>  407   │      (nil))
>  408   │
>  409   │ (insn 28 27 29 (set (reg:RVVM2DI 162)
>  410   │         (ashift:RVVM2DI (reg:RVVM2DI 161)
>  411   │             (const_int 2 [0x2]))) "strided_ld-st.c":4:22 -1
>  412   │      (nil))
>  413   │
>  414   │ (insn 29 28 0 (set (reg:RVVM1SI 146 [ vect__5.16 ])
>  415   │         (if_then_else:RVVM1SI (unspec:RVVMF32BI [
>  416   │                     (const_vector:RVVMF32BI repeat [
>  417   │                             (const_int 1 [0x1])
>  418   │                         ])
>  419   │                     (reg:DI 149 [ _73 ])
>  420   │                     (const_int 2 [0x2]) repeated x2
>  421   │                     (const_int 0 [0])
>  422   │                     (reg:SI 66 vl)
>  423   │                     (reg:SI 67 vtype)
>  424   │                 ] UNSPEC_VPREDICATE)
>  425   │             (unspec:RVVM1SI [
>  426   │                     (reg/v/f:DI 151 [ b ])
>  427   │                     (mem:BLK (scratch) [0  A8])
>  428   │                     (reg:RVVM2DI 162)
>  429   │                 ] UNSPEC_UNORDERED)
>  430   │             (unspec:RVVM1SI [
>  431   │                     (reg:DI 0 zero)
>  432   │                 ] UNSPEC_VUNDEF))) "strided_ld-st.c":4:22 -1
>  433   │      (nil))
>
> Pan
>
>
> -----Original Message-----
> From: Li, Pan2 <pan2.li@intel.com>
> Sent: Wednesday, June 5, 2024 3:50 PM
> To: Richard Biener <richard.guenther@gmail.com>; Richard Sandiford <richard.sandiford@arm.com>
> Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
> Subject: RE: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store
>
> Looks not easy to get the original context/history, only catch some shadow from below patch but not the fully picture.
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-October/634683.html
>
> It is reasonable to me that using gather/scatter with a VEC_SERICES, for example as blow, will have a try for this.
>
> operand_0 = mask_gather_loadmn (ptr, offset, 1/0(sign/unsign), multiply, mask)
>   offset = (vec_series:m base step) => base + i * step
>   op_0[i] = memory[ptr + offset[i] * multiply] && mask[i]
>
> operand_0 = mask_len_strided_load (ptr, stride, mask, len, bias).
>   op_0[i] = memory[prt + stride * i] && mask[i] && i < (len + bias)
>
> Pan
>
> -----Original Message-----
> From: Li, Pan2
> Sent: Wednesday, June 5, 2024 9:18 AM
> To: Richard Biener <richard.guenther@gmail.com>; Richard Sandiford <richard.sandiford@arm.com>
> Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
> Subject: RE: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store
>
> > Sorry if we have discussed this last year already - is there anything wrong
> > with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset?
>
> Thanks for comments, it is quit a while since last discussion. Let me recall a little about it and keep you posted.
>
> Pan
>
> -----Original Message-----
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Tuesday, June 4, 2024 9:22 PM
> To: Li, Pan2 <pan2.li@intel.com>; Richard Sandiford <richard.sandiford@arm.com>
> Cc: gcc-patches@gcc.gnu.org; juzhe.zhong@rivai.ai; kito.cheng@gmail.com; tamar.christina@arm.com
> Subject: Re: [PATCH v1] Internal-fn: Add new IFN mask_len_strided_load/store
>
> On Tue, May 28, 2024 at 5:15 AM <pan2.li@intel.com> wrote:
> >
> > From: Pan Li <pan2.li@intel.com>
> >
> > This patch would like to add new internal fun for the below 2 IFN.
> > * mask_len_strided_load
> > * mask_len_strided_store
> >
> > The GIMPLE v = MASK_LEN_STRIDED_LOAD (ptr, stride, mask, len, bias) will
> > be expanded into v = mask_len_strided_load (ptr, stried, mask, len, bias).
> >
> > The GIMPLE MASK_LEN_STRIED_STORE (ptr, stride, v, mask, len, bias)
> > be expanded into mask_len_stried_store (ptr, stride, v, mask, len, bias).
> >
> > The below test suites are passed for this patch:
> > * The x86 bootstrap test.
> > * The x86 fully regression test.
> > * The riscv fully regression test.
>
> Sorry if we have discussed this last year already - is there anything wrong
> with using a gather/scatter with a VEC_SERIES gimple/rtl def for the offset?
>
> Richard.
>
> > gcc/ChangeLog:
> >
> >         * doc/md.texi: Add description for mask_len_strided_load/store.
> >         * internal-fn.cc (strided_load_direct): New internal_fn define
> >         for strided_load_direct.
> >         (strided_store_direct): Ditto but for store.
> >         (expand_strided_load_optab_fn): New expand func for
> >         mask_len_strided_load.
> >         (expand_strided_store_optab_fn): Ditto but for store.
> >         (direct_strided_load_optab_supported_p): New define for load
> >         direct optab supported.
> >         (direct_strided_store_optab_supported_p): Ditto but for store.
> >         (internal_fn_len_index): Add len index for both load and store.
> >         (internal_fn_mask_index): Ditto but for mask index.
> >         (internal_fn_stored_value_index): Add stored index.
> >         * internal-fn.def (MASK_LEN_STRIDED_LOAD): New direct fn define
> >         for strided_load.
> >         (MASK_LEN_STRIDED_STORE): Ditto but for stride_store.
> >         * optabs.def (OPTAB_D): New optab define for load and store.
> >
> > Signed-off-by: Pan Li <pan2.li@intel.com>
> > Co-Authored-By: Juzhe-Zhong <juzhe.zhong@rivai.ai>
> > ---
> >  gcc/doc/md.texi     | 27 ++++++++++++++++
> >  gcc/internal-fn.cc  | 75 +++++++++++++++++++++++++++++++++++++++++++++
> >  gcc/internal-fn.def |  6 ++++
> >  gcc/optabs.def      |  2 ++
> >  4 files changed, 110 insertions(+)
> >
> > diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
> > index 5730bda80dc..3d242675c63 100644
> > --- a/gcc/doc/md.texi
> > +++ b/gcc/doc/md.texi
> > @@ -5138,6 +5138,20 @@ Bit @var{i} of the mask is set if element @var{i} of the result should
> >  be loaded from memory and clear if element @var{i} of the result should be undefined.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> >
> > +@cindex @code{mask_len_strided_load@var{m}} instruction pattern
> > +@item @samp{mask_len_strided_load@var{m}}
> > +Load several separate memory locations into a destination vector of mode @var{m}.
> > +Operand 0 is a destination vector of mode @var{m}.
> > +Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
> > +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> > +The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
> > +For each element index i load address is operand 1 + @var{i} * operand 2.
> > +Similar to mask_len_load, the instruction loads at most (operand 4 + operand 5) elements from memory.
> > +Element @var{i} of the mask (operand 3) is set if element @var{i} of the result should
> > +be loaded from memory and clear if element @var{i} of the result should be zero.
> > +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> > +
> >  @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
> >  @item @samp{scatter_store@var{m}@var{n}}
> >  Store a vector of mode @var{m} into several distinct memory locations.
> > @@ -5175,6 +5189,19 @@ at most (operand 6 + operand 7) elements of (operand 4) to memory.
> >  Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
> >  Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
> >
> > +@cindex @code{mask_len_strided_store@var{m}} instruction pattern
> > +@item @samp{mask_len_strided_store@var{m}}
> > +Store a vector of mode m into several distinct memory locations.
> > +Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
> > +Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
> > +operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
> > +The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
> > +with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
> > +For each element index i store address is operand 0 + @var{i} * operand 1.
> > +Similar to mask_len_store, the instruction stores at most (operand 4 + operand 5) elements of mask (operand 3) to memory.
> > +Element @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
> > +Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
> > +
> >  @cindex @code{vec_set@var{m}} instruction pattern
> >  @item @samp{vec_set@var{m}}
> >  Set given field in the vector value.  Operand 0 is the vector to modify,
> > diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> > index 9c09026793f..f6e5329cd84 100644
> > --- a/gcc/internal-fn.cc
> > +++ b/gcc/internal-fn.cc
> > @@ -159,6 +159,7 @@ init_internal_fns ()
> >  #define load_lanes_direct { -1, -1, false }
> >  #define mask_load_lanes_direct { -1, -1, false }
> >  #define gather_load_direct { 3, 1, false }
> > +#define strided_load_direct { -1, -1, false }
> >  #define len_load_direct { -1, -1, false }
> >  #define mask_len_load_direct { -1, 4, false }
> >  #define mask_store_direct { 3, 2, false }
> > @@ -168,6 +169,7 @@ init_internal_fns ()
> >  #define vec_cond_mask_len_direct { 1, 1, false }
> >  #define vec_cond_direct { 2, 0, false }
> >  #define scatter_store_direct { 3, 1, false }
> > +#define strided_store_direct { 1, 1, false }
> >  #define len_store_direct { 3, 3, false }
> >  #define mask_len_store_direct { 4, 5, false }
> >  #define vec_set_direct { 3, 3, false }
> > @@ -3668,6 +3670,68 @@ expand_gather_load_optab_fn (internal_fn, gcall *stmt, direct_optab optab)
> >      emit_move_insn (lhs_rtx, ops[0].value);
> >  }
> >
> > +/* Expand MASK_LEN_STRIDED_LOAD call CALL by optab OPTAB.  */
> > +
> > +static void
> > +expand_strided_load_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> > +                             direct_optab optab)
> > +{
> > +  tree lhs = gimple_call_lhs (stmt);
> > +  tree base = gimple_call_arg (stmt, 0);
> > +  tree stride = gimple_call_arg (stmt, 1);
> > +
> > +  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> > +  rtx base_rtx = expand_normal (base);
> > +  rtx stride_rtx = expand_normal (stride);
> > +
> > +  unsigned i = 0;
> > +  class expand_operand ops[6];
> > +  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
> > +
> > +  create_output_operand (&ops[i++], lhs_rtx, mode);
> > +  create_address_operand (&ops[i++], base_rtx);
> > +  create_address_operand (&ops[i++], stride_rtx);
> > +
> > +  insn_code icode = direct_optab_handler (optab, mode);
> > +
> > +  i = add_mask_and_len_args (ops, i, stmt);
> > +  expand_insn (icode, i, ops);
> > +
> > +  if (!rtx_equal_p (lhs_rtx, ops[0].value))
> > +    emit_move_insn (lhs_rtx, ops[0].value);
> > +}
> > +
> > +/* Expand MASK_LEN_STRIDED_STORE call CALL by optab OPTAB.  */
> > +
> > +static void
> > +expand_strided_store_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
> > +                              direct_optab optab)
> > +{
> > +  internal_fn fn = gimple_call_internal_fn (stmt);
> > +  int rhs_index = internal_fn_stored_value_index (fn);
> > +
> > +  tree base = gimple_call_arg (stmt, 0);
> > +  tree stride = gimple_call_arg (stmt, 1);
> > +  tree rhs = gimple_call_arg (stmt, rhs_index);
> > +
> > +  rtx base_rtx = expand_normal (base);
> > +  rtx stride_rtx = expand_normal (stride);
> > +  rtx rhs_rtx = expand_normal (rhs);
> > +
> > +  unsigned i = 0;
> > +  class expand_operand ops[6];
> > +  machine_mode mode = TYPE_MODE (TREE_TYPE (rhs));
> > +
> > +  create_address_operand (&ops[i++], base_rtx);
> > +  create_address_operand (&ops[i++], stride_rtx);
> > +  create_input_operand (&ops[i++], rhs_rtx, mode);
> > +
> > +  insn_code icode = direct_optab_handler (optab, mode);
> > +  i = add_mask_and_len_args (ops, i, stmt);
> > +
> > +  expand_insn (icode, i, ops);
> > +}
> > +
> >  /* Helper for expand_DIVMOD.  Return true if the sequence starting with
> >     INSN contains any call insns or insns with {,U}{DIV,MOD} rtxes.  */
> >
> > @@ -4058,6 +4122,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
> >  #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
> >  #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
> >  #define direct_gather_load_optab_supported_p convert_optab_supported_p
> > +#define direct_strided_load_optab_supported_p direct_optab_supported_p
> >  #define direct_len_load_optab_supported_p direct_optab_supported_p
> >  #define direct_mask_len_load_optab_supported_p convert_optab_supported_p
> >  #define direct_mask_store_optab_supported_p convert_optab_supported_p
> > @@ -4066,6 +4131,7 @@ multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
> >  #define direct_vec_cond_mask_optab_supported_p convert_optab_supported_p
> >  #define direct_vec_cond_optab_supported_p convert_optab_supported_p
> >  #define direct_scatter_store_optab_supported_p convert_optab_supported_p
> > +#define direct_strided_store_optab_supported_p direct_optab_supported_p
> >  #define direct_len_store_optab_supported_p direct_optab_supported_p
> >  #define direct_mask_len_store_optab_supported_p convert_optab_supported_p
> >  #define direct_while_optab_supported_p convert_optab_supported_p
> > @@ -4723,6 +4789,8 @@ internal_fn_len_index (internal_fn fn)
> >      case IFN_COND_LEN_XOR:
> >      case IFN_COND_LEN_SHL:
> >      case IFN_COND_LEN_SHR:
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> >        return 4;
> >
> >      case IFN_COND_LEN_NEG:
> > @@ -4817,6 +4885,10 @@ internal_fn_mask_index (internal_fn fn)
> >      case IFN_MASK_LEN_STORE:
> >        return 2;
> >
> > +    case IFN_MASK_LEN_STRIDED_LOAD:
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> > +      return 3;
> > +
> >      case IFN_MASK_GATHER_LOAD:
> >      case IFN_MASK_SCATTER_STORE:
> >      case IFN_MASK_LEN_GATHER_LOAD:
> > @@ -4840,6 +4912,9 @@ internal_fn_stored_value_index (internal_fn fn)
> >  {
> >    switch (fn)
> >      {
> > +    case IFN_MASK_LEN_STRIDED_STORE:
> > +      return 2;
> > +
> >      case IFN_MASK_STORE:
> >      case IFN_MASK_STORE_LANES:
> >      case IFN_SCATTER_STORE:
> > diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> > index 25badbb86e5..b30a7a5b009 100644
> > --- a/gcc/internal-fn.def
> > +++ b/gcc/internal-fn.def
> > @@ -56,6 +56,7 @@ along with GCC; see the file COPYING3.  If not see
> >     - mask_load_lanes: currently just vec_mask_load_lanes
> >     - mask_len_load_lanes: currently just vec_mask_len_load_lanes
> >     - gather_load: used for {mask_,mask_len_,}gather_load
> > +   - strided_load: currently just mask_len_strided_load
> >     - len_load: currently just len_load
> >     - mask_len_load: currently just mask_len_load
> >
> > @@ -64,6 +65,7 @@ along with GCC; see the file COPYING3.  If not see
> >     - mask_store_lanes: currently just vec_mask_store_lanes
> >     - mask_len_store_lanes: currently just vec_mask_len_store_lanes
> >     - scatter_store: used for {mask_,mask_len_,}scatter_store
> > +   - strided_store: currently just mask_len_strided_store
> >     - len_store: currently just len_store
> >     - mask_len_store: currently just mask_len_store
> >
> > @@ -212,6 +214,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
> >                        mask_gather_load, gather_load)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
> >                        mask_len_gather_load, gather_load)
> > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
> > +                      mask_len_strided_load, strided_load)
> >
> >  DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
> > @@ -221,6 +225,8 @@ DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
> >                        mask_scatter_store, scatter_store)
> >  DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
> >                        mask_len_scatter_store, scatter_store)
> > +DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
> > +                      mask_len_strided_store, strided_store)
> >
> >  DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
> >  DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
> > diff --git a/gcc/optabs.def b/gcc/optabs.def
> > index 3f2cb46aff8..630b1de8f97 100644
> > --- a/gcc/optabs.def
> > +++ b/gcc/optabs.def
> > @@ -539,4 +539,6 @@ OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
> >  OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
> >  OPTAB_D (len_load_optab, "len_load_$a")
> >  OPTAB_D (len_store_optab, "len_store_$a")
> > +OPTAB_D (mask_len_strided_load_optab, "mask_len_strided_load_$a")
> > +OPTAB_D (mask_len_strided_store_optab, "mask_len_strided_store_$a")
> >  OPTAB_D (select_vl_optab, "select_vl$a")
> > --
> > 2.34.1
> >
diff mbox series

Patch

diff --git a/gcc/doc/md.texi b/gcc/doc/md.texi
index 5730bda80dc..3d242675c63 100644
--- a/gcc/doc/md.texi
+++ b/gcc/doc/md.texi
@@ -5138,6 +5138,20 @@  Bit @var{i} of the mask is set if element @var{i} of the result should
 be loaded from memory and clear if element @var{i} of the result should be undefined.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
 
+@cindex @code{mask_len_strided_load@var{m}} instruction pattern
+@item @samp{mask_len_strided_load@var{m}}
+Load several separate memory locations into a destination vector of mode @var{m}.
+Operand 0 is a destination vector of mode @var{m}.
+Operand 1 is a scalar base address and operand 2 is a scalar stride of Pmode.
+operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
+The instruction can be seen as a special case of @code{mask_len_gather_load@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and operand 2 as step.
+For each element index i load address is operand 1 + @var{i} * operand 2.
+Similar to mask_len_load, the instruction loads at most (operand 4 + operand 5) elements from memory.
+Element @var{i} of the mask (operand 3) is set if element @var{i} of the result should
+be loaded from memory and clear if element @var{i} of the result should be zero.
+Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
+
 @cindex @code{scatter_store@var{m}@var{n}} instruction pattern
 @item @samp{scatter_store@var{m}@var{n}}
 Store a vector of mode @var{m} into several distinct memory locations.
@@ -5175,6 +5189,19 @@  at most (operand 6 + operand 7) elements of (operand 4) to memory.
 Bit @var{i} of the mask is set if element @var{i} of (operand 4) should be stored.
 Mask elements @var{i} with @var{i} > (operand 6 + operand 7) are ignored.
 
+@cindex @code{mask_len_strided_store@var{m}} instruction pattern
+@item @samp{mask_len_strided_store@var{m}}
+Store a vector of mode m into several distinct memory locations.
+Operand 0 is a scalar base address and operand 1 is scalar stride of Pmode.
+Operand 2 is the vector of values that should be stored, which is of mode @var{m}.
+operand 3 is mask operand, operand 4 is length operand and operand 5 is bias operand.
+The instruction can be seen as a special case of @code{mask_len_scatter_store@var{m}@var{n}}
+with an offset vector that is a @code{vec_series} with operand 1 as base and operand 1 as step.
+For each element index i store address is operand 0 + @var{i} * operand 1.
+Similar to mask_len_store, the instruction stores at most (operand 4 + operand 5) elements of mask (operand 3) to memory.
+Element @var{i} of the mask is set if element @var{i} of (operand 3) should be stored.
+Mask elements @var{i} with @var{i} > (operand 4 + operand 5) are ignored.
+
 @cindex @code{vec_set@var{m}} instruction pattern
 @item @samp{vec_set@var{m}}
 Set given field in the vector value.  Operand 0 is the vector to modify,
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 9c09026793f..f6e5329cd84 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -159,6 +159,7 @@  init_internal_fns ()
 #define load_lanes_direct { -1, -1, false }
 #define mask_load_lanes_direct { -1, -1, false }
 #define gather_load_direct { 3, 1, false }
+#define strided_load_direct { -1, -1, false }
 #define len_load_direct { -1, -1, false }
 #define mask_len_load_direct { -1, 4, false }
 #define mask_store_direct { 3, 2, false }
@@ -168,6 +169,7 @@  init_internal_fns ()
 #define vec_cond_mask_len_direct { 1, 1, false }
 #define vec_cond_direct { 2, 0, false }
 #define scatter_store_direct { 3, 1, false }
+#define strided_store_direct { 1, 1, false }
 #define len_store_direct { 3, 3, false }
 #define mask_len_store_direct { 4, 5, false }
 #define vec_set_direct { 3, 3, false }
@@ -3668,6 +3670,68 @@  expand_gather_load_optab_fn (internal_fn, gcall *stmt, direct_optab optab)
     emit_move_insn (lhs_rtx, ops[0].value);
 }
 
+/* Expand MASK_LEN_STRIDED_LOAD call CALL by optab OPTAB.  */
+
+static void
+expand_strided_load_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
+			      direct_optab optab)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  tree base = gimple_call_arg (stmt, 0);
+  tree stride = gimple_call_arg (stmt, 1);
+
+  rtx lhs_rtx = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  rtx base_rtx = expand_normal (base);
+  rtx stride_rtx = expand_normal (stride);
+
+  unsigned i = 0;
+  class expand_operand ops[6];
+  machine_mode mode = TYPE_MODE (TREE_TYPE (lhs));
+
+  create_output_operand (&ops[i++], lhs_rtx, mode);
+  create_address_operand (&ops[i++], base_rtx);
+  create_address_operand (&ops[i++], stride_rtx);
+
+  insn_code icode = direct_optab_handler (optab, mode);
+
+  i = add_mask_and_len_args (ops, i, stmt);
+  expand_insn (icode, i, ops);
+
+  if (!rtx_equal_p (lhs_rtx, ops[0].value))
+    emit_move_insn (lhs_rtx, ops[0].value);
+}
+
+/* Expand MASK_LEN_STRIDED_STORE call CALL by optab OPTAB.  */
+
+static void
+expand_strided_store_optab_fn (ATTRIBUTE_UNUSED internal_fn, gcall *stmt,
+			       direct_optab optab)
+{
+  internal_fn fn = gimple_call_internal_fn (stmt);
+  int rhs_index = internal_fn_stored_value_index (fn);
+
+  tree base = gimple_call_arg (stmt, 0);
+  tree stride = gimple_call_arg (stmt, 1);
+  tree rhs = gimple_call_arg (stmt, rhs_index);
+
+  rtx base_rtx = expand_normal (base);
+  rtx stride_rtx = expand_normal (stride);
+  rtx rhs_rtx = expand_normal (rhs);
+
+  unsigned i = 0;
+  class expand_operand ops[6];
+  machine_mode mode = TYPE_MODE (TREE_TYPE (rhs));
+
+  create_address_operand (&ops[i++], base_rtx);
+  create_address_operand (&ops[i++], stride_rtx);
+  create_input_operand (&ops[i++], rhs_rtx, mode);
+
+  insn_code icode = direct_optab_handler (optab, mode);
+  i = add_mask_and_len_args (ops, i, stmt);
+
+  expand_insn (icode, i, ops);
+}
+
 /* Helper for expand_DIVMOD.  Return true if the sequence starting with
    INSN contains any call insns or insns with {,U}{DIV,MOD} rtxes.  */
 
@@ -4058,6 +4122,7 @@  multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_mask_load_lanes_optab_supported_p multi_vector_optab_supported_p
 #define direct_gather_load_optab_supported_p convert_optab_supported_p
+#define direct_strided_load_optab_supported_p direct_optab_supported_p
 #define direct_len_load_optab_supported_p direct_optab_supported_p
 #define direct_mask_len_load_optab_supported_p convert_optab_supported_p
 #define direct_mask_store_optab_supported_p convert_optab_supported_p
@@ -4066,6 +4131,7 @@  multi_vector_optab_supported_p (convert_optab optab, tree_pair types,
 #define direct_vec_cond_mask_optab_supported_p convert_optab_supported_p
 #define direct_vec_cond_optab_supported_p convert_optab_supported_p
 #define direct_scatter_store_optab_supported_p convert_optab_supported_p
+#define direct_strided_store_optab_supported_p direct_optab_supported_p
 #define direct_len_store_optab_supported_p direct_optab_supported_p
 #define direct_mask_len_store_optab_supported_p convert_optab_supported_p
 #define direct_while_optab_supported_p convert_optab_supported_p
@@ -4723,6 +4789,8 @@  internal_fn_len_index (internal_fn fn)
     case IFN_COND_LEN_XOR:
     case IFN_COND_LEN_SHL:
     case IFN_COND_LEN_SHR:
+    case IFN_MASK_LEN_STRIDED_LOAD:
+    case IFN_MASK_LEN_STRIDED_STORE:
       return 4;
 
     case IFN_COND_LEN_NEG:
@@ -4817,6 +4885,10 @@  internal_fn_mask_index (internal_fn fn)
     case IFN_MASK_LEN_STORE:
       return 2;
 
+    case IFN_MASK_LEN_STRIDED_LOAD:
+    case IFN_MASK_LEN_STRIDED_STORE:
+      return 3;
+
     case IFN_MASK_GATHER_LOAD:
     case IFN_MASK_SCATTER_STORE:
     case IFN_MASK_LEN_GATHER_LOAD:
@@ -4840,6 +4912,9 @@  internal_fn_stored_value_index (internal_fn fn)
 {
   switch (fn)
     {
+    case IFN_MASK_LEN_STRIDED_STORE:
+      return 2;
+
     case IFN_MASK_STORE:
     case IFN_MASK_STORE_LANES:
     case IFN_SCATTER_STORE:
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 25badbb86e5..b30a7a5b009 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -56,6 +56,7 @@  along with GCC; see the file COPYING3.  If not see
    - mask_load_lanes: currently just vec_mask_load_lanes
    - mask_len_load_lanes: currently just vec_mask_len_load_lanes
    - gather_load: used for {mask_,mask_len_,}gather_load
+   - strided_load: currently just mask_len_strided_load
    - len_load: currently just len_load
    - mask_len_load: currently just mask_len_load
 
@@ -64,6 +65,7 @@  along with GCC; see the file COPYING3.  If not see
    - mask_store_lanes: currently just vec_mask_store_lanes
    - mask_len_store_lanes: currently just vec_mask_len_store_lanes
    - scatter_store: used for {mask_,mask_len_,}scatter_store
+   - strided_store: currently just mask_len_strided_store
    - len_store: currently just len_store
    - mask_len_store: currently just mask_len_store
 
@@ -212,6 +214,8 @@  DEF_INTERNAL_OPTAB_FN (MASK_GATHER_LOAD, ECF_PURE,
 		       mask_gather_load, gather_load)
 DEF_INTERNAL_OPTAB_FN (MASK_LEN_GATHER_LOAD, ECF_PURE,
 		       mask_len_gather_load, gather_load)
+DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_LOAD, ECF_PURE,
+		       mask_len_strided_load, strided_load)
 
 DEF_INTERNAL_OPTAB_FN (LEN_LOAD, ECF_PURE, len_load, len_load)
 DEF_INTERNAL_OPTAB_FN (MASK_LEN_LOAD, ECF_PURE, mask_len_load, mask_len_load)
@@ -221,6 +225,8 @@  DEF_INTERNAL_OPTAB_FN (MASK_SCATTER_STORE, 0,
 		       mask_scatter_store, scatter_store)
 DEF_INTERNAL_OPTAB_FN (MASK_LEN_SCATTER_STORE, 0,
 		       mask_len_scatter_store, scatter_store)
+DEF_INTERNAL_OPTAB_FN (MASK_LEN_STRIDED_STORE, 0,
+		       mask_len_strided_store, strided_store)
 
 DEF_INTERNAL_OPTAB_FN (MASK_STORE, 0, maskstore, mask_store)
 DEF_INTERNAL_OPTAB_FN (STORE_LANES, ECF_CONST, vec_store_lanes, store_lanes)
diff --git a/gcc/optabs.def b/gcc/optabs.def
index 3f2cb46aff8..630b1de8f97 100644
--- a/gcc/optabs.def
+++ b/gcc/optabs.def
@@ -539,4 +539,6 @@  OPTAB_DC (vec_series_optab, "vec_series$a", VEC_SERIES)
 OPTAB_D (vec_shl_insert_optab, "vec_shl_insert_$a")
 OPTAB_D (len_load_optab, "len_load_$a")
 OPTAB_D (len_store_optab, "len_store_$a")
+OPTAB_D (mask_len_strided_load_optab, "mask_len_strided_load_$a")
+OPTAB_D (mask_len_strided_store_optab, "mask_len_strided_store_$a")
 OPTAB_D (select_vl_optab, "select_vl$a")