diff mbox series

[RFC,aarch64] Implement 16-byte vector mode const0 store by TImode

Message ID 93661e31-a4dd-4c18-a3ad-d2d42010e5af@linux.ibm.com
State New
Headers show
Series [RFC,aarch64] Implement 16-byte vector mode const0 store by TImode | expand

Commit Message

HAO CHEN GUI Aug. 14, 2024, 6:26 a.m. UTC
Hi,
  I submitted a patch to change the mode checking for
CLEAR_BY_PIECES.
https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html

  It causes some regressions on aarch64. With the patch,
V2x8QImode is used to do clear by pieces instead of TImode as
vector mode is preferable and V2x8QImode supports const0 store.
Thus the efficient "stp" instructions can't be generated.

  I drafted following patch to fix the problem. It can fix
regressions found in memset-corner-cases.c, memset-q-reg.c,
auto-init-padding-11.c and auto-init-padding-5.c.

  Not sure if it should be done on all 16-byte vector modes.
Also not sure if the patch is proper. So I send this RFC email.

Thanks
Gui Haochen

ChangeLog
aarch64: Implement 16-byte vector mode const0 store by TImode

gcc/
	* config/aarch64/aarch64-simd.md (mov<mode> for VSTRUCT_QD):
	Expand V2x8QImode const0 store by TImode.


patch.diff

Comments

Richard Sandiford Aug. 14, 2024, 8:20 a.m. UTC | #1
HAO CHEN GUI <guihaoc@linux.ibm.com> writes:
> Hi,
>   I submitted a patch to change the mode checking for
> CLEAR_BY_PIECES.
> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html
>
>   It causes some regressions on aarch64. With the patch,
> V2x8QImode is used to do clear by pieces instead of TImode as
> vector mode is preferable and V2x8QImode supports const0 store.
> Thus the efficient "stp" instructions can't be generated.
>
>   I drafted following patch to fix the problem. It can fix
> regressions found in memset-corner-cases.c, memset-q-reg.c,
> auto-init-padding-11.c and auto-init-padding-5.c.
>
>   Not sure if it should be done on all 16-byte vector modes.
> Also not sure if the patch is proper. So I send this RFC email.
>
> Thanks
> Gui Haochen
>
> ChangeLog
> aarch64: Implement 16-byte vector mode const0 store by TImode
>
> gcc/
> 	* config/aarch64/aarch64-simd.md (mov<mode> for VSTRUCT_QD):
> 	Expand V2x8QImode const0 store by TImode.
>
>
> patch.diff
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 01b084d8ccb..8aa72940b12 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -7766,7 +7766,14 @@ (define_expand "mov<mode>"
>  	(match_operand:VSTRUCT_QD 1 "general_operand"))]
>    "TARGET_FLOAT"
>  {
> -  if (can_create_pseudo_p ())
> +  if (<MODE>mode == V2x8QImode
> +      && operands[1] == CONST0_RTX (V2x8QImode)
> +      && MEM_P (operands[0]))
> +    {
> +      operands[0] = adjust_address (operands[0], TImode, 0);
> +      operands[1] = CONST0_RTX (TImode);
> +    }

Interesting idea.  And the current handling of zeros certainly isn't
optimised.  For:

  void f(int8x8x2_t *ptr) { *ptr = (int8x8x2_t) {}; }

the patch changes:

        adrp    x1, .LC0
        add     x1, x1, :lo12:.LC0
        ld1     {v30.8b - v31.8b}, [x1]
        st1     {v30.8b - v31.8b}, [x0]
        ret
        ...
.LC0:
        ...lots of zeros...

to:

        stp     xzr, xzr, [x0]
        ret

which is a vast improvement.  We could of course fix that in the move
patterns (and maybe we should), but the point remains that zeroing N
bytes doesn't carry any real mode information.  We should just use the
best N-byte mode.

The only difficulty I can see is that, for big-endian targets, we
allow V8xQImode addresses to be any 7-bit scaled offset:

	  if (aarch64_advsimd_partial_struct_mode_p (mode)
	      && known_eq (GET_MODE_SIZE (mode), 16))
	    return aarch64_offset_7bit_signed_scaled_p (DImode, offset);

whereas for TImode we require:

	  if (mode == TImode || mode == TFmode || mode == TDmode)
	    return (aarch64_offset_7bit_signed_scaled_p (DImode, offset)
		    && (aarch64_offset_9bit_signed_unscaled_p (mode, offset)
			|| offset_12bit_unsigned_scaled_p (mode, offset)));

So for big-endian, there are some immediate offsets that are valid
for V8xQImode but not for TImode.  This isn't a problem before register
allocation, because the adjust_address will take care of it.  But it
could lead to an ICE after register allocation.  Testing:

      && (can_create_pseudo_p ()
	  || memory_address_p (TImode, XEXP (operands[0], 0))))

would avoid that.  (TBH I'm not sure this would ever trigger, i.e.
whether the move patterns would ever be asked to store zero to
memory after register allocation, but it does test the precondition
on using adjust_address.)

Like you say, the same approach should work for all 16-byte modes.
And since it's such an improvement, I think we should use it :)

Taking all that together, could you change the condition to:

  if (known_eq (GET_MODE_SIZE (<MODE>mode), 16)
      && operands[1] == CONST0_RTX (<MODE>mode)
      && MEM_P (operands[0])
      && (can_create_pseudo_p ()
	  || memory_address_p (TImode, XEXP (operands[0], 0))))

The patch is OK from my POV with that change, independently of the
CLEAR_BY_PIECES patch, but please give others 24 hours to comment.

(Once the patch is in, I'll follow up with some tests for arm_neon.h,
to defend the improvement above.)

Thanks,
Richard

> +  else if (can_create_pseudo_p ())
>      {
>        if (GET_CODE (operands[0]) != REG)
>  	operands[1] = force_reg (<MODE>mode, operands[1]);
diff mbox series

Patch

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 01b084d8ccb..8aa72940b12 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -7766,7 +7766,14 @@  (define_expand "mov<mode>"
 	(match_operand:VSTRUCT_QD 1 "general_operand"))]
   "TARGET_FLOAT"
 {
-  if (can_create_pseudo_p ())
+  if (<MODE>mode == V2x8QImode
+      && operands[1] == CONST0_RTX (V2x8QImode)
+      && MEM_P (operands[0]))
+    {
+      operands[0] = adjust_address (operands[0], TImode, 0);
+      operands[1] = CONST0_RTX (TImode);
+    }
+  else if (can_create_pseudo_p ())
     {
       if (GET_CODE (operands[0]) != REG)
 	operands[1] = force_reg (<MODE>mode, operands[1]);