diff mbox series

[PATCHv4,expand] Add const0 move checking for CLEAR_BY_PIECES optabs

Message ID ce9ad908-d2b0-4a84-aa14-6ed25e0111bb@linux.ibm.com
State New
Headers show
Series [PATCHv4,expand] Add const0 move checking for CLEAR_BY_PIECES optabs | expand

Commit Message

HAO CHEN GUI Aug. 19, 2024, 2:20 a.m. UTC
Hi,
  This patch adds const0 move checking for CLEAR_BY_PIECES. The original
vec_duplicate handles duplicates of non-constant inputs. But 0 is a
constant. So even a platform doesn't support vec_duplicate, it could
still do clear by pieces if it supports const0 move by that mode.

  Compared to the previous version, the main change is to set up a
new function to generate const0 for certain modes and use the function
as by_pieces_constfn for CLEAR_BY_PIECES.
https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html

  Bootstrapped and tested on powerpc64-linux BE and LE with no
regressions.

  On i386, it got several regressions. One issue is the predicate of
V16QI move expand doesn't include const0. Thus V16QI mode can't be used
for clear by pieces with the patch. The second issue is the const0 is
passed directly to the move expand with the patch. Originally it is
forced to a pseudo and i386 can leverage the previous data to do
optimization.

  The patch also raises several regressions on aarch64. The V2x8QImode
replaces TImode to do 16-byte clear by pieces as V2x8QImode move expand
supports const0 and vector mode is preferable. I drafted a patch to
address the issue. It will be sent for review in a separate email.
Another problem is V8QImode replaces DImode to do 8-byte clear by pieces.
It seems cause different sequences of instructions but the actually
instructions are the same.

Thanks
Gui Haochen

ChangeLog
expand: Add const0 move checking for CLEAR_BY_PIECES optabs

vec_duplicate handles duplicates of non-constant inputs.  The 0 is a
constant.  So even a platform doesn't support vec_duplicate, it could
still do clear by pieces if it supports const0 move.  This patch adds
the checking.

gcc/
	* expr.cc (by_pieces_mode_supported_p): Add const0 move checking
	for CLEAR_BY_PIECES.
	(set_zero): New.
	(clear_by_pieces): Pass set_zero as by_pieces_constfn.

patch.diff

Comments

Richard Sandiford Aug. 19, 2024, 10:32 p.m. UTC | #1
HAO CHEN GUI <guihaoc@linux.ibm.com> writes:
> Hi,
>   This patch adds const0 move checking for CLEAR_BY_PIECES. The original
> vec_duplicate handles duplicates of non-constant inputs. But 0 is a
> constant. So even a platform doesn't support vec_duplicate, it could
> still do clear by pieces if it supports const0 move by that mode.
>
>   Compared to the previous version, the main change is to set up a
> new function to generate const0 for certain modes and use the function
> as by_pieces_constfn for CLEAR_BY_PIECES.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html
>
>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> regressions.
>
>   On i386, it got several regressions. One issue is the predicate of
> V16QI move expand doesn't include const0. Thus V16QI mode can't be used
> for clear by pieces with the patch. The second issue is the const0 is
> passed directly to the move expand with the patch. Originally it is
> forced to a pseudo and i386 can leverage the previous data to do
> optimization.

The patch looks good to me, but I suppose we'll need to decide what
to do about x86.

It's not obvious to me why movv16qi requires a nonimmediate_operand
source, especially since ix86_expand_vector_mode does have code to
cope with constant operand[1]s.  emit_move_insn_1 doesn't check the
predicates anyway, so the predicate will have little effect.

A workaround would be to check legitimate_constant_p instead of the
predicate, but I'm not sure that that should be necessary.

Has this already been discussed?  If not, we should loop in the x86
maintainers (but I didn't do that here in case it would be a repeat).

As far as the second issue goes, I suppose there are at least three
ways of handling shared constants:

(1) Force the zero into a register and leave later optimisations to
    propagate the zero where profitable.

(2) Emit stores of zero and expect a later pass to share constants
    where beneficial.

(3) Generate stores of zero and leave the target expanders to force
    constants into registers on the fly if reuse seems plausibly
    beneficial.

where (3) is a middle ground between (1) and (2).

Thanks,
Richard

>   The patch also raises several regressions on aarch64. The V2x8QImode
> replaces TImode to do 16-byte clear by pieces as V2x8QImode move expand
> supports const0 and vector mode is preferable. I drafted a patch to
> address the issue. It will be sent for review in a separate email.
> Another problem is V8QImode replaces DImode to do 8-byte clear by pieces.
> It seems cause different sequences of instructions but the actually
> instructions are the same.
>
> Thanks
> Gui Haochen
>
> ChangeLog
> expand: Add const0 move checking for CLEAR_BY_PIECES optabs
>
> vec_duplicate handles duplicates of non-constant inputs.  The 0 is a
> constant.  So even a platform doesn't support vec_duplicate, it could
> still do clear by pieces if it supports const0 move.  This patch adds
> the checking.
>
> gcc/
> 	* expr.cc (by_pieces_mode_supported_p): Add const0 move checking
> 	for CLEAR_BY_PIECES.
> 	(set_zero): New.
> 	(clear_by_pieces): Pass set_zero as by_pieces_constfn.
>
> patch.diff
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index ffbac513692..7199e0956f8 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -1014,14 +1014,20 @@ can_use_qi_vectors (by_pieces_operation op)
>  static bool
>  by_pieces_mode_supported_p (fixed_size_mode mode, by_pieces_operation op)
>  {
> -  if (optab_handler (mov_optab, mode) == CODE_FOR_nothing)
> +  enum insn_code icode = optab_handler (mov_optab, mode);
> +  if (icode == CODE_FOR_nothing)
>      return false;
>
> -  if ((op == SET_BY_PIECES || op == CLEAR_BY_PIECES)
> +  if (op == SET_BY_PIECES
>        && VECTOR_MODE_P (mode)
>        && optab_handler (vec_duplicate_optab, mode) == CODE_FOR_nothing)
>      return false;
>
> +  if (op == CLEAR_BY_PIECES
> +      && VECTOR_MODE_P (mode)
> +      && !insn_operand_matches (icode, 1, CONST0_RTX (mode)))
> +   return false;
> +
>    if (op == COMPARE_BY_PIECES
>        && !can_compare_p (EQ, mode, ccp_jump))
>      return false;
> @@ -1840,16 +1846,20 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
>      return to;
>  }
>
> +static rtx
> +set_zero (void *, void *, HOST_WIDE_INT, fixed_size_mode mode)
> +{
> +  return CONST0_RTX (mode);
> +}
> +
>  void
>  clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
>  {
>    if (len == 0)
>      return;
>
> -  /* Use builtin_memset_read_str to support vector mode broadcast.  */
> -  char c = 0;
> -  store_by_pieces_d data (to, builtin_memset_read_str, &c, len, align,
> -			  CLEAR_BY_PIECES);
> +  /* Use set_zero to generate const0 of centain mode.  */
> +  store_by_pieces_d data (to, set_zero, NULL, len, align, CLEAR_BY_PIECES);
>    data.run ();
>  }
HAO CHEN GUI Aug. 20, 2024, 6:12 a.m. UTC | #2
Hi,
  Add Hongtao Liu as the patch affects x86.

在 2024/8/20 6:32, Richard Sandiford 写道:
> HAO CHEN GUI <guihaoc@linux.ibm.com> writes:
>> Hi,
>>   This patch adds const0 move checking for CLEAR_BY_PIECES. The original
>> vec_duplicate handles duplicates of non-constant inputs. But 0 is a
>> constant. So even a platform doesn't support vec_duplicate, it could
>> still do clear by pieces if it supports const0 move by that mode.
>>
>>   Compared to the previous version, the main change is to set up a
>> new function to generate const0 for certain modes and use the function
>> as by_pieces_constfn for CLEAR_BY_PIECES.
>> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html
>>
>>   Bootstrapped and tested on powerpc64-linux BE and LE with no
>> regressions.
>>
>>   On i386, it got several regressions. One issue is the predicate of
>> V16QI move expand doesn't include const0. Thus V16QI mode can't be used
>> for clear by pieces with the patch. The second issue is the const0 is
>> passed directly to the move expand with the patch. Originally it is
>> forced to a pseudo and i386 can leverage the previous data to do
>> optimization.
> 
> The patch looks good to me, but I suppose we'll need to decide what
> to do about x86.
> 
> It's not obvious to me why movv16qi requires a nonimmediate_operand
> source, especially since ix86_expand_vector_mode does have code to
> cope with constant operand[1]s.  emit_move_insn_1 doesn't check the
> predicates anyway, so the predicate will have little effect.
> 
> A workaround would be to check legitimate_constant_p instead of the
> predicate, but I'm not sure that that should be necessary.
> 
> Has this already been discussed?  If not, we should loop in the x86
> maintainers (but I didn't do that here in case it would be a repeat).

I also noticed it. Not sure why movv16qi requires a
nonimmediate_operand, while ix86_expand_vector_mode could deal with
constant op. Looking forward to Hongtao's comments.

> 
> As far as the second issue goes, I suppose there are at least three
> ways of handling shared constants:
> 
> (1) Force the zero into a register and leave later optimisations to
>     propagate the zero where profitable.
The zero can be propagated into the store, but the address adjustment
may not be combined into insn properly. For instance, if zero is
forced to a register, "movv2x8qi" insn is generated. The address
adjustment becomes a separate insn as "movv2x8qi" insn doesn't support
d-from address. When zero is propagated, it converts "movv2x8qi" to
"movti". "movti" supports d-from as well as post/inc address. Probably,
the auto_inc_dec pass combines address adjustment insn into previous
"movti" to generate a post inc "movti". The expected optimization might
be to combine address adjustment insn into second "movit" and generate a
d-form "movti". It's a regression issue I found in aarch64.

Also we checks if const0 is supported for mov optab. But finally we
force the const0 to a register and generate a store with the register.
Seems it's not reasonable.

> 
> (2) Emit stores of zero and expect a later pass to share constants
>     where beneficial.
Not sure which pass can optimize it.

> 
> (3) Generate stores of zero and leave the target expanders to force
>     constants into registers on the fly if reuse seems plausibly
>     beneficial.
> 
The constant zero with different modes are not relevant. Not sure
which pass can optimize it. The compiler should be taught that
reg 102 can be expressed as a subreg of reg 100.

(insn 6 5 7 2 (set (reg:V32QI 100)
        (const_vector:V32QI [
                (const_int 0 [0]) repeated x32
            ]))

(insn 8 7 0 2 (set (reg:V16QI 102)
        (const_vector:V16QI [
                (const_int 0 [0]) repeated x16
            ]))

I tested a case with one 32-byte and one 16-byte memory clear on x86.
These two sets can't be optimized. They can be optimized only when they
are in the same memory clear operation (for example, a 48-byte memory
clear).

Thanks
Gui Haochen

> where (3) is a middle ground between (1) and (2).
> 
> Thanks,
> Richard
> 
>>   The patch also raises several regressions on aarch64. The V2x8QImode
>> replaces TImode to do 16-byte clear by pieces as V2x8QImode move expand
>> supports const0 and vector mode is preferable. I drafted a patch to
>> address the issue. It will be sent for review in a separate email.
>> Another problem is V8QImode replaces DImode to do 8-byte clear by pieces.
>> It seems cause different sequences of instructions but the actually
>> instructions are the same.
>>
>> Thanks
>> Gui Haochen
>>
>> ChangeLog
>> expand: Add const0 move checking for CLEAR_BY_PIECES optabs
>>
>> vec_duplicate handles duplicates of non-constant inputs.  The 0 is a
>> constant.  So even a platform doesn't support vec_duplicate, it could
>> still do clear by pieces if it supports const0 move.  This patch adds
>> the checking.
>>
>> gcc/
>> 	* expr.cc (by_pieces_mode_supported_p): Add const0 move checking
>> 	for CLEAR_BY_PIECES.
>> 	(set_zero): New.
>> 	(clear_by_pieces): Pass set_zero as by_pieces_constfn.
>>
>> patch.diff
>> diff --git a/gcc/expr.cc b/gcc/expr.cc
>> index ffbac513692..7199e0956f8 100644
>> --- a/gcc/expr.cc
>> +++ b/gcc/expr.cc
>> @@ -1014,14 +1014,20 @@ can_use_qi_vectors (by_pieces_operation op)
>>  static bool
>>  by_pieces_mode_supported_p (fixed_size_mode mode, by_pieces_operation op)
>>  {
>> -  if (optab_handler (mov_optab, mode) == CODE_FOR_nothing)
>> +  enum insn_code icode = optab_handler (mov_optab, mode);
>> +  if (icode == CODE_FOR_nothing)
>>      return false;
>>
>> -  if ((op == SET_BY_PIECES || op == CLEAR_BY_PIECES)
>> +  if (op == SET_BY_PIECES
>>        && VECTOR_MODE_P (mode)
>>        && optab_handler (vec_duplicate_optab, mode) == CODE_FOR_nothing)
>>      return false;
>>
>> +  if (op == CLEAR_BY_PIECES
>> +      && VECTOR_MODE_P (mode)
>> +      && !insn_operand_matches (icode, 1, CONST0_RTX (mode)))
>> +   return false;
>> +
>>    if (op == COMPARE_BY_PIECES
>>        && !can_compare_p (EQ, mode, ccp_jump))
>>      return false;
>> @@ -1840,16 +1846,20 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
>>      return to;
>>  }
>>
>> +static rtx
>> +set_zero (void *, void *, HOST_WIDE_INT, fixed_size_mode mode)
>> +{
>> +  return CONST0_RTX (mode);
>> +}
>> +
>>  void
>>  clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
>>  {
>>    if (len == 0)
>>      return;
>>
>> -  /* Use builtin_memset_read_str to support vector mode broadcast.  */
>> -  char c = 0;
>> -  store_by_pieces_d data (to, builtin_memset_read_str, &c, len, align,
>> -			  CLEAR_BY_PIECES);
>> +  /* Use set_zero to generate const0 of centain mode.  */
>> +  store_by_pieces_d data (to, set_zero, NULL, len, align, CLEAR_BY_PIECES);
>>    data.run ();
>>  }
Hongtao Liu Aug. 20, 2024, 6:50 a.m. UTC | #3
On Tue, Aug 20, 2024 at 2:12 PM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
>
> Hi,
>   Add Hongtao Liu as the patch affects x86.
>
> 在 2024/8/20 6:32, Richard Sandiford 写道:
> > HAO CHEN GUI <guihaoc@linux.ibm.com> writes:
> >> Hi,
> >>   This patch adds const0 move checking for CLEAR_BY_PIECES. The original
> >> vec_duplicate handles duplicates of non-constant inputs. But 0 is a
> >> constant. So even a platform doesn't support vec_duplicate, it could
> >> still do clear by pieces if it supports const0 move by that mode.
> >>
> >>   Compared to the previous version, the main change is to set up a
> >> new function to generate const0 for certain modes and use the function
> >> as by_pieces_constfn for CLEAR_BY_PIECES.
> >> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html
> >>
> >>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> >> regressions.
> >>
> >>   On i386, it got several regressions. One issue is the predicate of
> >> V16QI move expand doesn't include const0. Thus V16QI mode can't be used
> >> for clear by pieces with the patch. The second issue is the const0 is
> >> passed directly to the move expand with the patch. Originally it is
> >> forced to a pseudo and i386 can leverage the previous data to do
> >> optimization.
> >
> > The patch looks good to me, but I suppose we'll need to decide what
> > to do about x86.
> >
> > It's not obvious to me why movv16qi requires a nonimmediate_operand
> > source, especially since ix86_expand_vector_mode does have code to
> > cope with constant operand[1]s.  emit_move_insn_1 doesn't check the
> > predicates anyway, so the predicate will have little effect.
> >
> > A workaround would be to check legitimate_constant_p instead of the
> > predicate, but I'm not sure that that should be necessary.
> >
> > Has this already been discussed?  If not, we should loop in the x86
> > maintainers (but I didn't do that here in case it would be a repeat).
>
> I also noticed it. Not sure why movv16qi requires a
> nonimmediate_operand, while ix86_expand_vector_mode could deal with
> constant op. Looking forward to Hongtao's comments.
The code has been there since 2005 before I'm involved.
 It looks to me at the beginning both mov<mode> and
*mov<mode>_internal only support nonimmediate_operand for the
operands[1].
And r0-75606-g5656a184e83983 adjusted the nonimmediate_operand to
nonimmediate_or_sse_const_operand for *mov<mode>_internal, but not for
mov<mode>.
I think we can align the predicate between mov<mode> and *mov<mode>_internal.
I'll do some tests and reach back to you.
>
> >
> > As far as the second issue goes, I suppose there are at least three
> > ways of handling shared constants:
> >
> > (1) Force the zero into a register and leave later optimisations to
> >     propagate the zero where profitable.
> The zero can be propagated into the store, but the address adjustment
> may not be combined into insn properly. For instance, if zero is
> forced to a register, "movv2x8qi" insn is generated. The address
> adjustment becomes a separate insn as "movv2x8qi" insn doesn't support
> d-from address. When zero is propagated, it converts "movv2x8qi" to
> "movti". "movti" supports d-from as well as post/inc address. Probably,
> the auto_inc_dec pass combines address adjustment insn into previous
> "movti" to generate a post inc "movti". The expected optimization might
> be to combine address adjustment insn into second "movit" and generate a
> d-form "movti". It's a regression issue I found in aarch64.
>
> Also we checks if const0 is supported for mov optab. But finally we
> force the const0 to a register and generate a store with the register.
> Seems it's not reasonable.
>
> >
> > (2) Emit stores of zero and expect a later pass to share constants
> >     where beneficial.
> Not sure which pass can optimize it.
>
> >
> > (3) Generate stores of zero and leave the target expanders to force
> >     constants into registers on the fly if reuse seems plausibly
> >     beneficial.
> >
> The constant zero with different modes are not relevant. Not sure
> which pass can optimize it. The compiler should be taught that
> reg 102 can be expressed as a subreg of reg 100.
>
> (insn 6 5 7 2 (set (reg:V32QI 100)
>         (const_vector:V32QI [
>                 (const_int 0 [0]) repeated x32
>             ]))
>
> (insn 8 7 0 2 (set (reg:V16QI 102)
>         (const_vector:V16QI [
>                 (const_int 0 [0]) repeated x16
>             ]))
>
It's https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080, I have some
experimental patch which tries to eliminate those redundancy in cse.

> I tested a case with one 32-byte and one 16-byte memory clear on x86.
> These two sets can't be optimized. They can be optimized only when they
> are in the same memory clear operation (for example, a 48-byte memory
> clear).
>
> Thanks
> Gui Haochen
>
> > where (3) is a middle ground between (1) and (2).
> >
> > Thanks,
> > Richard
> >
> >>   The patch also raises several regressions on aarch64. The V2x8QImode
> >> replaces TImode to do 16-byte clear by pieces as V2x8QImode move expand
> >> supports const0 and vector mode is preferable. I drafted a patch to
> >> address the issue. It will be sent for review in a separate email.
> >> Another problem is V8QImode replaces DImode to do 8-byte clear by pieces.
> >> It seems cause different sequences of instructions but the actually
> >> instructions are the same.
> >>
> >> Thanks
> >> Gui Haochen
> >>
> >> ChangeLog
> >> expand: Add const0 move checking for CLEAR_BY_PIECES optabs
> >>
> >> vec_duplicate handles duplicates of non-constant inputs.  The 0 is a
> >> constant.  So even a platform doesn't support vec_duplicate, it could
> >> still do clear by pieces if it supports const0 move.  This patch adds
> >> the checking.
> >>
> >> gcc/
> >>      * expr.cc (by_pieces_mode_supported_p): Add const0 move checking
> >>      for CLEAR_BY_PIECES.
> >>      (set_zero): New.
> >>      (clear_by_pieces): Pass set_zero as by_pieces_constfn.
> >>
> >> patch.diff
> >> diff --git a/gcc/expr.cc b/gcc/expr.cc
> >> index ffbac513692..7199e0956f8 100644
> >> --- a/gcc/expr.cc
> >> +++ b/gcc/expr.cc
> >> @@ -1014,14 +1014,20 @@ can_use_qi_vectors (by_pieces_operation op)
> >>  static bool
> >>  by_pieces_mode_supported_p (fixed_size_mode mode, by_pieces_operation op)
> >>  {
> >> -  if (optab_handler (mov_optab, mode) == CODE_FOR_nothing)
> >> +  enum insn_code icode = optab_handler (mov_optab, mode);
> >> +  if (icode == CODE_FOR_nothing)
> >>      return false;
> >>
> >> -  if ((op == SET_BY_PIECES || op == CLEAR_BY_PIECES)
> >> +  if (op == SET_BY_PIECES
> >>        && VECTOR_MODE_P (mode)
> >>        && optab_handler (vec_duplicate_optab, mode) == CODE_FOR_nothing)
> >>      return false;
> >>
> >> +  if (op == CLEAR_BY_PIECES
> >> +      && VECTOR_MODE_P (mode)
> >> +      && !insn_operand_matches (icode, 1, CONST0_RTX (mode)))
> >> +   return false;
> >> +
> >>    if (op == COMPARE_BY_PIECES
> >>        && !can_compare_p (EQ, mode, ccp_jump))
> >>      return false;
> >> @@ -1840,16 +1846,20 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
> >>      return to;
> >>  }
> >>
> >> +static rtx
> >> +set_zero (void *, void *, HOST_WIDE_INT, fixed_size_mode mode)
> >> +{
> >> +  return CONST0_RTX (mode);
> >> +}
> >> +
> >>  void
> >>  clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
> >>  {
> >>    if (len == 0)
> >>      return;
> >>
> >> -  /* Use builtin_memset_read_str to support vector mode broadcast.  */
> >> -  char c = 0;
> >> -  store_by_pieces_d data (to, builtin_memset_read_str, &c, len, align,
> >> -                      CLEAR_BY_PIECES);
> >> +  /* Use set_zero to generate const0 of centain mode.  */
> >> +  store_by_pieces_d data (to, set_zero, NULL, len, align, CLEAR_BY_PIECES);
> >>    data.run ();
> >>  }
Hongtao Liu Aug. 21, 2024, 3:21 a.m. UTC | #4
On Tue, Aug 20, 2024 at 2:50 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Tue, Aug 20, 2024 at 2:12 PM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
> >
> > Hi,
> >   Add Hongtao Liu as the patch affects x86.
> >
> > 在 2024/8/20 6:32, Richard Sandiford 写道:
> > > HAO CHEN GUI <guihaoc@linux.ibm.com> writes:
> > >> Hi,
> > >>   This patch adds const0 move checking for CLEAR_BY_PIECES. The original
> > >> vec_duplicate handles duplicates of non-constant inputs. But 0 is a
> > >> constant. So even a platform doesn't support vec_duplicate, it could
> > >> still do clear by pieces if it supports const0 move by that mode.
> > >>
> > >>   Compared to the previous version, the main change is to set up a
> > >> new function to generate const0 for certain modes and use the function
> > >> as by_pieces_constfn for CLEAR_BY_PIECES.
> > >> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html
> > >>
> > >>   Bootstrapped and tested on powerpc64-linux BE and LE with no
> > >> regressions.
> > >>
> > >>   On i386, it got several regressions. One issue is the predicate of
> > >> V16QI move expand doesn't include const0. Thus V16QI mode can't be used
> > >> for clear by pieces with the patch. The second issue is the const0 is
> > >> passed directly to the move expand with the patch. Originally it is
> > >> forced to a pseudo and i386 can leverage the previous data to do
> > >> optimization.
> > >
> > > The patch looks good to me, but I suppose we'll need to decide what
> > > to do about x86.
> > >
> > > It's not obvious to me why movv16qi requires a nonimmediate_operand
> > > source, especially since ix86_expand_vector_mode does have code to
> > > cope with constant operand[1]s.  emit_move_insn_1 doesn't check the
> > > predicates anyway, so the predicate will have little effect.
> > >
> > > A workaround would be to check legitimate_constant_p instead of the
> > > predicate, but I'm not sure that that should be necessary.
> > >
> > > Has this already been discussed?  If not, we should loop in the x86
> > > maintainers (but I didn't do that here in case it would be a repeat).
> >
> > I also noticed it. Not sure why movv16qi requires a
> > nonimmediate_operand, while ix86_expand_vector_mode could deal with
> > constant op. Looking forward to Hongtao's comments.
> The code has been there since 2005 before I'm involved.
>  It looks to me at the beginning both mov<mode> and
> *mov<mode>_internal only support nonimmediate_operand for the
> operands[1].
> And r0-75606-g5656a184e83983 adjusted the nonimmediate_operand to
> nonimmediate_or_sse_const_operand for *mov<mode>_internal, but not for
> mov<mode>.
> I think we can align the predicate between mov<mode> and *mov<mode>_internal.
> I'll do some tests and reach back to you.
r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
rebase your patch and see if there's still the regressions.
> >
> > >
> > > As far as the second issue goes, I suppose there are at least three
> > > ways of handling shared constants:
> > >
> > > (1) Force the zero into a register and leave later optimisations to
> > >     propagate the zero where profitable.
> > The zero can be propagated into the store, but the address adjustment
> > may not be combined into insn properly. For instance, if zero is
> > forced to a register, "movv2x8qi" insn is generated. The address
> > adjustment becomes a separate insn as "movv2x8qi" insn doesn't support
> > d-from address. When zero is propagated, it converts "movv2x8qi" to
> > "movti". "movti" supports d-from as well as post/inc address. Probably,
> > the auto_inc_dec pass combines address adjustment insn into previous
> > "movti" to generate a post inc "movti". The expected optimization might
> > be to combine address adjustment insn into second "movit" and generate a
> > d-form "movti". It's a regression issue I found in aarch64.
> >
> > Also we checks if const0 is supported for mov optab. But finally we
> > force the const0 to a register and generate a store with the register.
> > Seems it's not reasonable.
> >
> > >
> > > (2) Emit stores of zero and expect a later pass to share constants
> > >     where beneficial.
> > Not sure which pass can optimize it.
> >
> > >
> > > (3) Generate stores of zero and leave the target expanders to force
> > >     constants into registers on the fly if reuse seems plausibly
> > >     beneficial.
> > >
> > The constant zero with different modes are not relevant. Not sure
> > which pass can optimize it. The compiler should be taught that
> > reg 102 can be expressed as a subreg of reg 100.
> >
> > (insn 6 5 7 2 (set (reg:V32QI 100)
> >         (const_vector:V32QI [
> >                 (const_int 0 [0]) repeated x32
> >             ]))
> >
> > (insn 8 7 0 2 (set (reg:V16QI 102)
> >         (const_vector:V16QI [
> >                 (const_int 0 [0]) repeated x16
> >             ]))
> >
> It's https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080, I have some
> experimental patch which tries to eliminate those redundancy in cse.
>
> > I tested a case with one 32-byte and one 16-byte memory clear on x86.
> > These two sets can't be optimized. They can be optimized only when they
> > are in the same memory clear operation (for example, a 48-byte memory
> > clear).
> >
> > Thanks
> > Gui Haochen
> >
> > > where (3) is a middle ground between (1) and (2).
> > >
> > > Thanks,
> > > Richard
> > >
> > >>   The patch also raises several regressions on aarch64. The V2x8QImode
> > >> replaces TImode to do 16-byte clear by pieces as V2x8QImode move expand
> > >> supports const0 and vector mode is preferable. I drafted a patch to
> > >> address the issue. It will be sent for review in a separate email.
> > >> Another problem is V8QImode replaces DImode to do 8-byte clear by pieces.
> > >> It seems cause different sequences of instructions but the actually
> > >> instructions are the same.
> > >>
> > >> Thanks
> > >> Gui Haochen
> > >>
> > >> ChangeLog
> > >> expand: Add const0 move checking for CLEAR_BY_PIECES optabs
> > >>
> > >> vec_duplicate handles duplicates of non-constant inputs.  The 0 is a
> > >> constant.  So even a platform doesn't support vec_duplicate, it could
> > >> still do clear by pieces if it supports const0 move.  This patch adds
> > >> the checking.
> > >>
> > >> gcc/
> > >>      * expr.cc (by_pieces_mode_supported_p): Add const0 move checking
> > >>      for CLEAR_BY_PIECES.
> > >>      (set_zero): New.
> > >>      (clear_by_pieces): Pass set_zero as by_pieces_constfn.
> > >>
> > >> patch.diff
> > >> diff --git a/gcc/expr.cc b/gcc/expr.cc
> > >> index ffbac513692..7199e0956f8 100644
> > >> --- a/gcc/expr.cc
> > >> +++ b/gcc/expr.cc
> > >> @@ -1014,14 +1014,20 @@ can_use_qi_vectors (by_pieces_operation op)
> > >>  static bool
> > >>  by_pieces_mode_supported_p (fixed_size_mode mode, by_pieces_operation op)
> > >>  {
> > >> -  if (optab_handler (mov_optab, mode) == CODE_FOR_nothing)
> > >> +  enum insn_code icode = optab_handler (mov_optab, mode);
> > >> +  if (icode == CODE_FOR_nothing)
> > >>      return false;
> > >>
> > >> -  if ((op == SET_BY_PIECES || op == CLEAR_BY_PIECES)
> > >> +  if (op == SET_BY_PIECES
> > >>        && VECTOR_MODE_P (mode)
> > >>        && optab_handler (vec_duplicate_optab, mode) == CODE_FOR_nothing)
> > >>      return false;
> > >>
> > >> +  if (op == CLEAR_BY_PIECES
> > >> +      && VECTOR_MODE_P (mode)
> > >> +      && !insn_operand_matches (icode, 1, CONST0_RTX (mode)))
> > >> +   return false;
> > >> +
> > >>    if (op == COMPARE_BY_PIECES
> > >>        && !can_compare_p (EQ, mode, ccp_jump))
> > >>      return false;
> > >> @@ -1840,16 +1846,20 @@ store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
> > >>      return to;
> > >>  }
> > >>
> > >> +static rtx
> > >> +set_zero (void *, void *, HOST_WIDE_INT, fixed_size_mode mode)
> > >> +{
> > >> +  return CONST0_RTX (mode);
> > >> +}
> > >> +
> > >>  void
> > >>  clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
> > >>  {
> > >>    if (len == 0)
> > >>      return;
> > >>
> > >> -  /* Use builtin_memset_read_str to support vector mode broadcast.  */
> > >> -  char c = 0;
> > >> -  store_by_pieces_d data (to, builtin_memset_read_str, &c, len, align,
> > >> -                      CLEAR_BY_PIECES);
> > >> +  /* Use set_zero to generate const0 of centain mode.  */
> > >> +  store_by_pieces_d data (to, set_zero, NULL, len, align, CLEAR_BY_PIECES);
> > >>    data.run ();
> > >>  }
>
>
>
> --
> BR,
> Hongtao
HAO CHEN GUI Aug. 22, 2024, 8:06 a.m. UTC | #5
Hi Hongtao,

在 2024/8/21 11:21, Hongtao Liu 写道:
> r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
> rebase your patch and see if there's still the regressions.

There's still regressions. The patch enables V16QI const0 store, but
it also enables V8QI const0 store. The vector mode is preferable than
scalar mode so that V8QI is used for 8-byte memory clear instead of
DI. It's sub-optimal.

Another issue is it takes lots of subreg to generate an all-zero
V16QI register sometime. As PR92080 has been fixed, it can't reuse
existing all-zero V16QI register.

(insn 16 15 17 (set (reg:V4SI 118)
        (const_vector:V4SI [
                (const_int 0 [0]) repeated x4
            ])) "auto-init-7.c":25:12 -1
     (nil))

(insn 17 16 18 (set (reg:V8HI 117)
        (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1
     (nil))

(insn 18 17 19 (set (reg:V16QI 116)
        (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1
     (nil))

(insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114)
                (const_int 12 [0xc])) [0 MEM <char[1:28]> [(void *)&temp3]+12 S16 A32])
        (reg:V16QI 116)) "auto-init-7.c":25:12 -1
     (nil))

Thanks
Gui Haochen
Hongtao Liu Aug. 23, 2024, 1:47 a.m. UTC | #6
On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
>
> Hi Hongtao,
>
> 在 2024/8/21 11:21, Hongtao Liu 写道:
> > r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
> > rebase your patch and see if there's still the regressions.
>
> There's still regressions. The patch enables V16QI const0 store, but
> it also enables V8QI const0 store. The vector mode is preferable than
> scalar mode so that V8QI is used for 8-byte memory clear instead of
> DI. It's sub-optimal.
Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT?
>
> Another issue is it takes lots of subreg to generate an all-zero
> V16QI register sometime. As PR92080 has been fixed, it can't reuse
> existing all-zero V16QI register.
>
> (insn 16 15 17 (set (reg:V4SI 118)
>         (const_vector:V4SI [
>                 (const_int 0 [0]) repeated x4
>             ])) "auto-init-7.c":25:12 -1
>      (nil))
>
> (insn 17 16 18 (set (reg:V8HI 117)
>         (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1
>      (nil))
>
> (insn 18 17 19 (set (reg:V16QI 116)
>         (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1
>      (nil))
>
> (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114)
>                 (const_int 12 [0xc])) [0 MEM <char[1:28]> [(void *)&temp3]+12 S16 A32])
>         (reg:V16QI 116)) "auto-init-7.c":25:12 -1
>      (nil))
I think those subregs can be simplified by later rtl passes?
>
> Thanks
> Gui Haochen
HAO CHEN GUI Aug. 23, 2024, 3:02 a.m. UTC | #7
Hi Hongtao,

在 2024/8/23 9:47, Hongtao Liu 写道:
> On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
>>
>> Hi Hongtao,
>>
>> 在 2024/8/21 11:21, Hongtao Liu 写道:
>>> r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
>>> rebase your patch and see if there's still the regressions.
>>
>> There's still regressions. The patch enables V16QI const0 store, but
>> it also enables V8QI const0 store. The vector mode is preferable than
>> scalar mode so that V8QI is used for 8-byte memory clear instead of
>> DI. It's sub-optimal.
> Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT?
Not sure if all targets prefer it. Richard & Jeff, what's your opinion?

IMHO, could we disable it from predicate or convert it to DI mode store
if V8QI const0 store is sub-optimal on i386?


>>
>> Another issue is it takes lots of subreg to generate an all-zero
>> V16QI register sometime. As PR92080 has been fixed, it can't reuse
>> existing all-zero V16QI register.
>>
>> (insn 16 15 17 (set (reg:V4SI 118)
>>         (const_vector:V4SI [
>>                 (const_int 0 [0]) repeated x4
>>             ])) "auto-init-7.c":25:12 -1
>>      (nil))
>>
>> (insn 17 16 18 (set (reg:V8HI 117)
>>         (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1
>>      (nil))
>>
>> (insn 18 17 19 (set (reg:V16QI 116)
>>         (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1
>>      (nil))
>>
>> (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114)
>>                 (const_int 12 [0xc])) [0 MEM <char[1:28]> [(void *)&temp3]+12 S16 A32])
>>         (reg:V16QI 116)) "auto-init-7.c":25:12 -1
>>      (nil))
> I think those subregs can be simplified by later rtl passes?

Here is the final dump. There are two all-zero 16-byte vector
registers. It can't figure out V4SI could be a subreg of V16QI.

(insn 14 56 15 2 (set (reg:V16QI 20 xmm0 [115])
        (const_vector:V16QI [
                (const_int 0 [0]) repeated x16
            ])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
     (nil))
(insn 15 14 16 2 (set (mem/c:V16QI (reg:DI 0 ax [114]) [0 MEM <char[1:28]> [(void *)&temp3]+0 S16 A128])
        (reg:V16QI 20 xmm0 [115])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
     (nil))
(insn 16 15 19 2 (set (reg:V4SI 20 xmm0 [118])
        (const_vector:V4SI [
                (const_int 0 [0]) repeated x4
            ])) "auto-init-7.c":25:12 2160 {movv4si_internal}
     (nil))
(insn 19 16 57 2 (set (mem/c:V16QI (plus:DI (reg:DI 0 ax [114])
                (const_int 12 [0xc])) [0 MEM <char[1:28]> [(void *)&temp3]+12 S16 A32])
        (reg:V16QI 20 xmm0 [116])) "auto-init-7.c":25:12 2154 {movv16qi_internal}

Thanks
Gui Haochen

>>
>> Thanks
>> Gui Haochen
> 
> 
>
Hongtao Liu Aug. 23, 2024, 3:47 a.m. UTC | #8
On Fri, Aug 23, 2024 at 11:03 AM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
>
> Hi Hongtao,
>
> 在 2024/8/23 9:47, Hongtao Liu 写道:
> > On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
> >>
> >> Hi Hongtao,
> >>
> >> 在 2024/8/21 11:21, Hongtao Liu 写道:
> >>> r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
> >>> rebase your patch and see if there's still the regressions.
> >>
> >> There's still regressions. The patch enables V16QI const0 store, but
> >> it also enables V8QI const0 store. The vector mode is preferable than
> >> scalar mode so that V8QI is used for 8-byte memory clear instead of
> >> DI. It's sub-optimal.
> > Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT?
> Not sure if all targets prefer it. Richard & Jeff, what's your opinion?
>
> IMHO, could we disable it from predicate or convert it to DI mode store
> if V8QI const0 store is sub-optimal on i386?
>
>
> >>
> >> Another issue is it takes lots of subreg to generate an all-zero
> >> V16QI register sometime. As PR92080 has been fixed, it can't reuse
> >> existing all-zero V16QI register.
Backend rtx_cost needs to be adjusted to prevent const0 propagation.
The current rtx_cost for const0 for i386 is 0, which will enable
propagation of const0.

   /* If MODE2 is appropriate for an MMX register, then tie
@@ -21588,10 +21590,12 @@ ix86_rtx_costs (rtx x, machine_mode mode,
int outer_code_i, int opno,
        case 0:
          break;
        case 1:  /* 0: xor eliminates false dependency */
-         *total = 0;
+         /* Add extra cost 1 to prevent propagation of CONST_VECTOR
+            for SET, which will enable more CSE optimization.  */
+         *total = 0 + (outer_code == SET);
          return true;
        default: /* -1: cmp contains false dependency */
-         *total = 1;
+         *total = 1 + (outer_code == SET);
          return true;
        }

the upper hunk should help for that.
> >>
> >> (insn 16 15 17 (set (reg:V4SI 118)
> >>         (const_vector:V4SI [
> >>                 (const_int 0 [0]) repeated x4
> >>             ])) "auto-init-7.c":25:12 -1
> >>      (nil))
> >>
> >> (insn 17 16 18 (set (reg:V8HI 117)
> >>         (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1
> >>      (nil))
> >>
> >> (insn 18 17 19 (set (reg:V16QI 116)
> >>         (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1
> >>      (nil))
> >>
> >> (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114)
> >>                 (const_int 12 [0xc])) [0 MEM <char[1:28]> [(void *)&temp3]+12 S16 A32])
> >>         (reg:V16QI 116)) "auto-init-7.c":25:12 -1
> >>      (nil))
> > I think those subregs can be simplified by later rtl passes?
>
> Here is the final dump. There are two all-zero 16-byte vector
> registers. It can't figure out V4SI could be a subreg of V16QI.
>
> (insn 14 56 15 2 (set (reg:V16QI 20 xmm0 [115])
>         (const_vector:V16QI [
>                 (const_int 0 [0]) repeated x16
>             ])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
>      (nil))
> (insn 15 14 16 2 (set (mem/c:V16QI (reg:DI 0 ax [114]) [0 MEM <char[1:28]> [(void *)&temp3]+0 S16 A128])
>         (reg:V16QI 20 xmm0 [115])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
>      (nil))
> (insn 16 15 19 2 (set (reg:V4SI 20 xmm0 [118])
>         (const_vector:V4SI [
>                 (const_int 0 [0]) repeated x4
>             ])) "auto-init-7.c":25:12 2160 {movv4si_internal}
>      (nil))
> (insn 19 16 57 2 (set (mem/c:V16QI (plus:DI (reg:DI 0 ax [114])
>                 (const_int 12 [0xc])) [0 MEM <char[1:28]> [(void *)&temp3]+12 S16 A32])
>         (reg:V16QI 20 xmm0 [116])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
>
> Thanks
> Gui Haochen
>
> >>
> >> Thanks
> >> Gui Haochen
> >
> >
> >
HAO CHEN GUI Aug. 23, 2024, 9:46 a.m. UTC | #9
Hi Hongtao,

在 2024/8/23 11:47, Hongtao Liu 写道:
> On Fri, Aug 23, 2024 at 11:03 AM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
>>
>> Hi Hongtao,
>>
>> 在 2024/8/23 9:47, Hongtao Liu 写道:
>>> On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
>>>>
>>>> Hi Hongtao,
>>>>
>>>> 在 2024/8/21 11:21, Hongtao Liu 写道:
>>>>> r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
>>>>> rebase your patch and see if there's still the regressions.
>>>>
>>>> There's still regressions. The patch enables V16QI const0 store, but
>>>> it also enables V8QI const0 store. The vector mode is preferable than
>>>> scalar mode so that V8QI is used for 8-byte memory clear instead of
>>>> DI. It's sub-optimal.
>>> Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT?
>> Not sure if all targets prefer it. Richard & Jeff, what's your opinion?
>>
>> IMHO, could we disable it from predicate or convert it to DI mode store
>> if V8QI const0 store is sub-optimal on i386?
>>
>>
>>>>
>>>> Another issue is it takes lots of subreg to generate an all-zero
>>>> V16QI register sometime. As PR92080 has been fixed, it can't reuse
>>>> existing all-zero V16QI register.
> Backend rtx_cost needs to be adjusted to prevent const0 propagation.
> The current rtx_cost for const0 for i386 is 0, which will enable
> propagation of const0.
> 
>    /* If MODE2 is appropriate for an MMX register, then tie
> @@ -21588,10 +21590,12 @@ ix86_rtx_costs (rtx x, machine_mode mode,
> int outer_code_i, int opno,
>         case 0:
>           break;
>         case 1:  /* 0: xor eliminates false dependency */
> -         *total = 0;
> +         /* Add extra cost 1 to prevent propagation of CONST_VECTOR
> +            for SET, which will enable more CSE optimization.  */
> +         *total = 0 + (outer_code == SET);
>           return true;
>         default: /* -1: cmp contains false dependency */
> -         *total = 1;
> +         *total = 1 + (outer_code == SET);
>           return true;
>         }
> 
> the upper hunk should help for that.
Sorry, I didn't get your point. Which problem it will fix? I tested
upper code. Nothing changed. Which kind of const0 propagation you want
to prevent?

Thanks
Gui Haochen

>>>>
>>>> (insn 16 15 17 (set (reg:V4SI 118)
>>>>         (const_vector:V4SI [
>>>>                 (const_int 0 [0]) repeated x4
>>>>             ])) "auto-init-7.c":25:12 -1
>>>>      (nil))
>>>>
>>>> (insn 17 16 18 (set (reg:V8HI 117)
>>>>         (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1
>>>>      (nil))
>>>>
>>>> (insn 18 17 19 (set (reg:V16QI 116)
>>>>         (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1
>>>>      (nil))
>>>>
>>>> (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114)
>>>>                 (const_int 12 [0xc])) [0 MEM <char[1:28]> [(void *)&temp3]+12 S16 A32])
>>>>         (reg:V16QI 116)) "auto-init-7.c":25:12 -1
>>>>      (nil))
>>> I think those subregs can be simplified by later rtl passes?
>>
>> Here is the final dump. There are two all-zero 16-byte vector
>> registers. It can't figure out V4SI could be a subreg of V16QI.
>>
>> (insn 14 56 15 2 (set (reg:V16QI 20 xmm0 [115])
>>         (const_vector:V16QI [
>>                 (const_int 0 [0]) repeated x16
>>             ])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
>>      (nil))
>> (insn 15 14 16 2 (set (mem/c:V16QI (reg:DI 0 ax [114]) [0 MEM <char[1:28]> [(void *)&temp3]+0 S16 A128])
>>         (reg:V16QI 20 xmm0 [115])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
>>      (nil))
>> (insn 16 15 19 2 (set (reg:V4SI 20 xmm0 [118])
>>         (const_vector:V4SI [
>>                 (const_int 0 [0]) repeated x4
>>             ])) "auto-init-7.c":25:12 2160 {movv4si_internal}
>>      (nil))
>> (insn 19 16 57 2 (set (mem/c:V16QI (plus:DI (reg:DI 0 ax [114])
>>                 (const_int 12 [0xc])) [0 MEM <char[1:28]> [(void *)&temp3]+12 S16 A32])
>>         (reg:V16QI 20 xmm0 [116])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
>>
>> Thanks
>> Gui Haochen
>>
>>>>
>>>> Thanks
>>>> Gui Haochen
>>>
>>>
>>>
> 
> 
>
Jeff Law Aug. 23, 2024, 1:37 p.m. UTC | #10
On 8/22/24 9:02 PM, HAO CHEN GUI wrote:
> Hi Hongtao,
> 
> 在 2024/8/23 9:47, Hongtao Liu 写道:
>> On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
>>>
>>> Hi Hongtao,
>>>
>>> 在 2024/8/21 11:21, Hongtao Liu 写道:
>>>> r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
>>>> rebase your patch and see if there's still the regressions.
>>>
>>> There's still regressions. The patch enables V16QI const0 store, but
>>> it also enables V8QI const0 store. The vector mode is preferable than
>>> scalar mode so that V8QI is used for 8-byte memory clear instead of
>>> DI. It's sub-optimal.
>> Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT?
> Not sure if all targets prefer it. Richard & Jeff, what's your opinion?
Sorry, I haven't been following.  That doesn't seem like a good test at 
the surface (why would HOST_BITS_PER_WIDE_INT matter here, that's a 
property of the host, not the target).

Additionally, selection of the "optimal" mode may be impossible as 
there's just not going to be enough context.  For a given target there 
may be cases where something like V16QI is good and for the same target 
cases where doing a series of DI accesses would be better.

So we have to pick sensible modes and give the targets ways to turn the 
knobs to hopefully get better code depending on the desired behavior of 
each (sub)target.

So how's that for a non-answer?  :-)


> 
> IMHO, could we disable it from predicate or convert it to DI mode store
> if V8QI const0 store is sub-optimal on i386?
I'd look for ways to allow the x86 port to control behavior.  Presumably 
the problem is the move-by-pieces code is emitting stores directly 
rather than going through an expander?


Jeff
Hongtao Liu Aug. 26, 2024, 2:11 a.m. UTC | #11
On Fri, Aug 23, 2024 at 5:46 PM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
>
> Hi Hongtao,
>
> 在 2024/8/23 11:47, Hongtao Liu 写道:
> > On Fri, Aug 23, 2024 at 11:03 AM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
> >>
> >> Hi Hongtao,
> >>
> >> 在 2024/8/23 9:47, Hongtao Liu 写道:
> >>> On Thu, Aug 22, 2024 at 4:06 PM HAO CHEN GUI <guihaoc@linux.ibm.com> wrote:
> >>>>
> >>>> Hi Hongtao,
> >>>>
> >>>> 在 2024/8/21 11:21, Hongtao Liu 写道:
> >>>>> r15-3058-gbb42c551905024 support const0 operand for movv16qi, please
> >>>>> rebase your patch and see if there's still the regressions.
> >>>>
> >>>> There's still regressions. The patch enables V16QI const0 store, but
> >>>> it also enables V8QI const0 store. The vector mode is preferable than
> >>>> scalar mode so that V8QI is used for 8-byte memory clear instead of
> >>>> DI. It's sub-optimal.
> >>> Could we check if mode_size is greater than HOST_BITS_PER_WIDE_INT?
> >> Not sure if all targets prefer it. Richard & Jeff, what's your opinion?
> >>
> >> IMHO, could we disable it from predicate or convert it to DI mode store
> >> if V8QI const0 store is sub-optimal on i386?
> >>
> >>
> >>>>
> >>>> Another issue is it takes lots of subreg to generate an all-zero
> >>>> V16QI register sometime. As PR92080 has been fixed, it can't reuse
> >>>> existing all-zero V16QI register.
> > Backend rtx_cost needs to be adjusted to prevent const0 propagation.
> > The current rtx_cost for const0 for i386 is 0, which will enable
> > propagation of const0.
> >
> >    /* If MODE2 is appropriate for an MMX register, then tie
> > @@ -21588,10 +21590,12 @@ ix86_rtx_costs (rtx x, machine_mode mode,
> > int outer_code_i, int opno,
> >         case 0:
> >           break;
> >         case 1:  /* 0: xor eliminates false dependency */
> > -         *total = 0;
> > +         /* Add extra cost 1 to prevent propagation of CONST_VECTOR
> > +            for SET, which will enable more CSE optimization.  */
> > +         *total = 0 + (outer_code == SET);
> >           return true;
> >         default: /* -1: cmp contains false dependency */
> > -         *total = 1;
> > +         *total = 1 + (outer_code == SET);
> >           return true;
> >         }
> >
> > the upper hunk should help for that.
> Sorry, I didn't get your point. Which problem it will fix? I tested
> upper code. Nothing changed. Which kind of const0 propagation you want
> to prevent?
The patch itself doesn't enable CSE for const0_rtx, but it's needed
after cse_insn recognizes CONST0_RTX with a different mode and
replaces them with subreg.
I thought you had changed the cse_insn part.
 On the other hand, pxor is cheap, what matters more is the CSE of
broadcasting the same value to different modes. i.e.

__m512i sinkz;
__m256i sinky;
void foo(char c) {
    sinkz = _mm512_set1_epi8(c);
    sinky = _mm256_set1_epi8(c);
}

>
> Thanks
> Gui Haochen
>
> >>>>
> >>>> (insn 16 15 17 (set (reg:V4SI 118)
> >>>>         (const_vector:V4SI [
> >>>>                 (const_int 0 [0]) repeated x4
> >>>>             ])) "auto-init-7.c":25:12 -1
> >>>>      (nil))
> >>>>
> >>>> (insn 17 16 18 (set (reg:V8HI 117)
> >>>>         (subreg:V8HI (reg:V4SI 118) 0)) "auto-init-7.c":25:12 -1
> >>>>      (nil))
> >>>>
> >>>> (insn 18 17 19 (set (reg:V16QI 116)
> >>>>         (subreg:V16QI (reg:V8HI 117) 0)) "auto-init-7.c":25:12 -1
> >>>>      (nil))
> >>>>
> >>>> (insn 19 18 0 (set (mem/c:V16QI (plus:DI (reg:DI 114)
> >>>>                 (const_int 12 [0xc])) [0 MEM <char[1:28]> [(void *)&temp3]+12 S16 A32])
> >>>>         (reg:V16QI 116)) "auto-init-7.c":25:12 -1
> >>>>      (nil))
> >>> I think those subregs can be simplified by later rtl passes?
> >>
> >> Here is the final dump. There are two all-zero 16-byte vector
> >> registers. It can't figure out V4SI could be a subreg of V16QI.
> >>
> >> (insn 14 56 15 2 (set (reg:V16QI 20 xmm0 [115])
> >>         (const_vector:V16QI [
> >>                 (const_int 0 [0]) repeated x16
> >>             ])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
> >>      (nil))
> >> (insn 15 14 16 2 (set (mem/c:V16QI (reg:DI 0 ax [114]) [0 MEM <char[1:28]> [(void *)&temp3]+0 S16 A128])
> >>         (reg:V16QI 20 xmm0 [115])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
> >>      (nil))
> >> (insn 16 15 19 2 (set (reg:V4SI 20 xmm0 [118])
> >>         (const_vector:V4SI [
> >>                 (const_int 0 [0]) repeated x4
> >>             ])) "auto-init-7.c":25:12 2160 {movv4si_internal}
> >>      (nil))
> >> (insn 19 16 57 2 (set (mem/c:V16QI (plus:DI (reg:DI 0 ax [114])
> >>                 (const_int 12 [0xc])) [0 MEM <char[1:28]> [(void *)&temp3]+12 S16 A32])
> >>         (reg:V16QI 20 xmm0 [116])) "auto-init-7.c":25:12 2154 {movv16qi_internal}
> >>
> >> Thanks
> >> Gui Haochen
> >>
> >>>>
> >>>> Thanks
> >>>> Gui Haochen
> >>>
> >>>
> >>>
> >
> >
> >
diff mbox series

Patch

diff --git a/gcc/expr.cc b/gcc/expr.cc
index ffbac513692..7199e0956f8 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -1014,14 +1014,20 @@  can_use_qi_vectors (by_pieces_operation op)
 static bool
 by_pieces_mode_supported_p (fixed_size_mode mode, by_pieces_operation op)
 {
-  if (optab_handler (mov_optab, mode) == CODE_FOR_nothing)
+  enum insn_code icode = optab_handler (mov_optab, mode);
+  if (icode == CODE_FOR_nothing)
     return false;

-  if ((op == SET_BY_PIECES || op == CLEAR_BY_PIECES)
+  if (op == SET_BY_PIECES
       && VECTOR_MODE_P (mode)
       && optab_handler (vec_duplicate_optab, mode) == CODE_FOR_nothing)
     return false;

+  if (op == CLEAR_BY_PIECES
+      && VECTOR_MODE_P (mode)
+      && !insn_operand_matches (icode, 1, CONST0_RTX (mode)))
+   return false;
+
   if (op == COMPARE_BY_PIECES
       && !can_compare_p (EQ, mode, ccp_jump))
     return false;
@@ -1840,16 +1846,20 @@  store_by_pieces (rtx to, unsigned HOST_WIDE_INT len,
     return to;
 }

+static rtx
+set_zero (void *, void *, HOST_WIDE_INT, fixed_size_mode mode)
+{
+  return CONST0_RTX (mode);
+}
+
 void
 clear_by_pieces (rtx to, unsigned HOST_WIDE_INT len, unsigned int align)
 {
   if (len == 0)
     return;

-  /* Use builtin_memset_read_str to support vector mode broadcast.  */
-  char c = 0;
-  store_by_pieces_d data (to, builtin_memset_read_str, &c, len, align,
-			  CLEAR_BY_PIECES);
+  /* Use set_zero to generate const0 of centain mode.  */
+  store_by_pieces_d data (to, set_zero, NULL, len, align, CLEAR_BY_PIECES);
   data.run ();
 }