Message ID | 93661e31-a4dd-4c18-a3ad-d2d42010e5af@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [RFC,aarch64] Implement 16-byte vector mode const0 store by TImode | expand |
HAO CHEN GUI <guihaoc@linux.ibm.com> writes: > Hi, > I submitted a patch to change the mode checking for > CLEAR_BY_PIECES. > https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660344.html > > It causes some regressions on aarch64. With the patch, > V2x8QImode is used to do clear by pieces instead of TImode as > vector mode is preferable and V2x8QImode supports const0 store. > Thus the efficient "stp" instructions can't be generated. > > I drafted following patch to fix the problem. It can fix > regressions found in memset-corner-cases.c, memset-q-reg.c, > auto-init-padding-11.c and auto-init-padding-5.c. > > Not sure if it should be done on all 16-byte vector modes. > Also not sure if the patch is proper. So I send this RFC email. > > Thanks > Gui Haochen > > ChangeLog > aarch64: Implement 16-byte vector mode const0 store by TImode > > gcc/ > * config/aarch64/aarch64-simd.md (mov<mode> for VSTRUCT_QD): > Expand V2x8QImode const0 store by TImode. > > > patch.diff > diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md > index 01b084d8ccb..8aa72940b12 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -7766,7 +7766,14 @@ (define_expand "mov<mode>" > (match_operand:VSTRUCT_QD 1 "general_operand"))] > "TARGET_FLOAT" > { > - if (can_create_pseudo_p ()) > + if (<MODE>mode == V2x8QImode > + && operands[1] == CONST0_RTX (V2x8QImode) > + && MEM_P (operands[0])) > + { > + operands[0] = adjust_address (operands[0], TImode, 0); > + operands[1] = CONST0_RTX (TImode); > + } Interesting idea. And the current handling of zeros certainly isn't optimised. For: void f(int8x8x2_t *ptr) { *ptr = (int8x8x2_t) {}; } the patch changes: adrp x1, .LC0 add x1, x1, :lo12:.LC0 ld1 {v30.8b - v31.8b}, [x1] st1 {v30.8b - v31.8b}, [x0] ret ... .LC0: ...lots of zeros... to: stp xzr, xzr, [x0] ret which is a vast improvement. We could of course fix that in the move patterns (and maybe we should), but the point remains that zeroing N bytes doesn't carry any real mode information. We should just use the best N-byte mode. The only difficulty I can see is that, for big-endian targets, we allow V8xQImode addresses to be any 7-bit scaled offset: if (aarch64_advsimd_partial_struct_mode_p (mode) && known_eq (GET_MODE_SIZE (mode), 16)) return aarch64_offset_7bit_signed_scaled_p (DImode, offset); whereas for TImode we require: if (mode == TImode || mode == TFmode || mode == TDmode) return (aarch64_offset_7bit_signed_scaled_p (DImode, offset) && (aarch64_offset_9bit_signed_unscaled_p (mode, offset) || offset_12bit_unsigned_scaled_p (mode, offset))); So for big-endian, there are some immediate offsets that are valid for V8xQImode but not for TImode. This isn't a problem before register allocation, because the adjust_address will take care of it. But it could lead to an ICE after register allocation. Testing: && (can_create_pseudo_p () || memory_address_p (TImode, XEXP (operands[0], 0)))) would avoid that. (TBH I'm not sure this would ever trigger, i.e. whether the move patterns would ever be asked to store zero to memory after register allocation, but it does test the precondition on using adjust_address.) Like you say, the same approach should work for all 16-byte modes. And since it's such an improvement, I think we should use it :) Taking all that together, could you change the condition to: if (known_eq (GET_MODE_SIZE (<MODE>mode), 16) && operands[1] == CONST0_RTX (<MODE>mode) && MEM_P (operands[0]) && (can_create_pseudo_p () || memory_address_p (TImode, XEXP (operands[0], 0)))) The patch is OK from my POV with that change, independently of the CLEAR_BY_PIECES patch, but please give others 24 hours to comment. (Once the patch is in, I'll follow up with some tests for arm_neon.h, to defend the improvement above.) Thanks, Richard > + else if (can_create_pseudo_p ()) > { > if (GET_CODE (operands[0]) != REG) > operands[1] = force_reg (<MODE>mode, operands[1]);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 01b084d8ccb..8aa72940b12 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -7766,7 +7766,14 @@ (define_expand "mov<mode>" (match_operand:VSTRUCT_QD 1 "general_operand"))] "TARGET_FLOAT" { - if (can_create_pseudo_p ()) + if (<MODE>mode == V2x8QImode + && operands[1] == CONST0_RTX (V2x8QImode) + && MEM_P (operands[0])) + { + operands[0] = adjust_address (operands[0], TImode, 0); + operands[1] = CONST0_RTX (TImode); + } + else if (can_create_pseudo_p ()) { if (GET_CODE (operands[0]) != REG) operands[1] = force_reg (<MODE>mode, operands[1]);