diff mbox series

[1/2] AArch64: make aarch64_simd_vec_unpack<su>_lo_/_hi_ consistent.

Message ID patch-18605-tamar@arm.com
State New
Headers show
Series [1/2] AArch64: make aarch64_simd_vec_unpack<su>_lo_/_hi_ consistent. | expand

Commit Message

Tamar Christina July 4, 2024, 11:25 a.m. UTC
Hi All,

The fix for PR18127 reworked the uxtl to zip optimization.
In doing so it undid the changes in aarch64_simd_vec_unpack<su>_lo_ and this now
no longer matches aarch64_simd_vec_unpack<su>_hi_.  It still works because the
RTL generated by aarch64_simd_vec_unpack<su>_lo_ overlaps with the general zero
extend RTL and so because that one is listed before the lo pattern recog picks
it instead.

This just makes aarch64_simd_vec_unpack<su>_lo_ mirror
aarch64_simd_vec_unpack<su>_hi_ for consistency and so we're not relying on the
order of patterns.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md
	(aarch64_simd_vec_unpack<su>_lo_<mode>): Add same split as
	aarch64_simd_vec_unpack<su>_hi_.
	* config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
	comment.

---




--

Comments

Richard Sandiford July 4, 2024, 11:45 a.m. UTC | #1
Tamar Christina <tamar.christina@arm.com> writes:
> Hi All,
>
> The fix for PR18127 reworked the uxtl to zip optimization.
> In doing so it undid the changes in aarch64_simd_vec_unpack<su>_lo_ and this now
> no longer matches aarch64_simd_vec_unpack<su>_hi_.  It still works because the
> RTL generated by aarch64_simd_vec_unpack<su>_lo_ overlaps with the general zero
> extend RTL and so because that one is listed before the lo pattern recog picks
> it instead.
>
> This just makes aarch64_simd_vec_unpack<su>_lo_ mirror
> aarch64_simd_vec_unpack<su>_hi_ for consistency and so we're not relying on the
> order of patterns.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64-simd.md
> 	(aarch64_simd_vec_unpack<su>_lo_<mode>): Add same split as
> 	aarch64_simd_vec_unpack<su>_hi_.
> 	* config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
> 	comment.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca294faaf8d5ba847 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1904,7 +1904,7 @@ (define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_be"
>  
>  ;; Widening operations.
>  
> -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>"
> +(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>"
>    [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
>          (ANY_EXTEND:<VWIDE> (vec_select:<VHALF>
>  			       (match_operand:VQW 1 "register_operand" "w")
> @@ -1912,6 +1912,19 @@ (define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>"
>  			    )))]
>    "TARGET_SIMD"
>    "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>"
> +  "&& <CODE> == ZERO_EXTEND
> +   && aarch64_split_simd_shift_p (insn)"
> +  [(const_int 0)]
> +  {
> +    /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero,
> +       provided that the cost of the zero can be amortized over several
> +       operations.  We'll later recombine the zero and zip if there are
> +       not sufficient uses of the zero to make the split worthwhile.  */
> +    rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode, 0);
> +    rtx zero = aarch64_gen_shareable_zero (<MODE>mode);
> +    emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero));
> +    DONE;
> +  }
>    [(set_attr "type" "neon_shift_imm_long")]
>  )

I think we should just remove the pattern instead, and update users
to generate an extension of the 64-bit lowpart.  That form is now the
canonical one, so the current aarch64_simd_vec_unpack<su>_lo_<mode>
pattern hard-codes unsimplified rtl.

> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index ae7e21d90b2aeec51b7626471ccf7f036fa9b3db..6f49d1482042efabedbe723aa59ecf129b84f4ad 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -23159,7 +23159,8 @@ aarch64_gen_shareable_zero (machine_mode mode)
>     to split without that restriction and instead recombine shared zeros
>     if they turn out not to be worthwhile.  This would allow splits in
>     single-block functions and would also cope more naturally with
> -   rematerialization.  */
> +   rematerialization.  The downside of not doing this is that we lose the
> +   optimizations for vector epilogues as well.  */
>  
>  bool
>  aarch64_split_simd_shift_p (rtx_insn *insn)

This part is ok.  late-combine was supposed to help with this,
but we might need to update the propagation code so that it handle
changes in mode.

Thanks,
Richard
Tamar Christina July 4, 2024, 11:54 a.m. UTC | #2
> -----Original Message-----
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Thursday, July 4, 2024 12:46 PM
> To: Tamar Christina <Tamar.Christina@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
> Subject: Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack<su>_lo_/_hi_
> consistent.
> 
> Tamar Christina <tamar.christina@arm.com> writes:
> > Hi All,
> >
> > The fix for PR18127 reworked the uxtl to zip optimization.
> > In doing so it undid the changes in aarch64_simd_vec_unpack<su>_lo_ and this
> now
> > no longer matches aarch64_simd_vec_unpack<su>_hi_.  It still works because
> the
> > RTL generated by aarch64_simd_vec_unpack<su>_lo_ overlaps with the general
> zero
> > extend RTL and so because that one is listed before the lo pattern recog picks
> > it instead.
> >
> > This just makes aarch64_simd_vec_unpack<su>_lo_ mirror
> > aarch64_simd_vec_unpack<su>_hi_ for consistency and so we're not relying on
> the
> > order of patterns.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> > 	* config/aarch64/aarch64-simd.md
> > 	(aarch64_simd_vec_unpack<su>_lo_<mode>): Add same split as
> > 	aarch64_simd_vec_unpack<su>_hi_.
> > 	* config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
> > 	comment.
> >
> > ---
> >
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> b/gcc/config/aarch64/aarch64-simd.md
> > index
> 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca
> 294faaf8d5ba847 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -1904,7 +1904,7 @@ (define_insn
> "*aarch64_<srn_op>topbits_shuffle<mode>_be"
> >
> >  ;; Widening operations.
> >
> > -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>"
> > +(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>"
> >    [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
> >          (ANY_EXTEND:<VWIDE> (vec_select:<VHALF>
> >  			       (match_operand:VQW 1 "register_operand" "w")
> > @@ -1912,6 +1912,19 @@ (define_insn
> "aarch64_simd_vec_unpack<su>_lo_<mode>"
> >  			    )))]
> >    "TARGET_SIMD"
> >    "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>"
> > +  "&& <CODE> == ZERO_EXTEND
> > +   && aarch64_split_simd_shift_p (insn)"
> > +  [(const_int 0)]
> > +  {
> > +    /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero,
> > +       provided that the cost of the zero can be amortized over several
> > +       operations.  We'll later recombine the zero and zip if there are
> > +       not sufficient uses of the zero to make the split worthwhile.  */
> > +    rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode,
> 0);
> > +    rtx zero = aarch64_gen_shareable_zero (<MODE>mode);
> > +    emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero));
> > +    DONE;
> > +  }
> >    [(set_attr "type" "neon_shift_imm_long")]
> >  )
> 
> I think we should just remove the pattern instead, and update users
> to generate an extension of the 64-bit lowpart.  That form is now the
> canonical one, so the current aarch64_simd_vec_unpack<su>_lo_<mode>
> pattern hard-codes unsimplified rtl.
> 

Ok, but then it's very weird to have _hi still.  The reason for these indirections
seems to be to model the hi/lo split for RTL simplifications. 

The aarch64_simd_vec_unpack<su>_lo models a vec_select as well.  This RTL is
different from the one it gets rewritten to (dropping the vec_select).

So dropping it means we lose this, and I guess maybe fold-rtx does it now?
It's not immediately clear to me that dropping the additional RTL for recog
is beneficially.

Thanks,
Tamar

> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index
> ae7e21d90b2aeec51b7626471ccf7f036fa9b3db..6f49d1482042efabedbe723aa
> 59ecf129b84f4ad 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -23159,7 +23159,8 @@ aarch64_gen_shareable_zero (machine_mode
> mode)
> >     to split without that restriction and instead recombine shared zeros
> >     if they turn out not to be worthwhile.  This would allow splits in
> >     single-block functions and would also cope more naturally with
> > -   rematerialization.  */
> > +   rematerialization.  The downside of not doing this is that we lose the
> > +   optimizations for vector epilogues as well.  */
> >
> >  bool
> >  aarch64_split_simd_shift_p (rtx_insn *insn)
> 
> This part is ok.  late-combine was supposed to help with this,
> but we might need to update the propagation code so that it handle
> changes in mode.
> 
> Thanks,
> Richard
Richard Sandiford July 4, 2024, 12:13 p.m. UTC | #3
Tamar Christina <Tamar.Christina@arm.com> writes:
>> -----Original Message-----
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Thursday, July 4, 2024 12:46 PM
>> To: Tamar Christina <Tamar.Christina@arm.com>
>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
>> Subject: Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack<su>_lo_/_hi_
>> consistent.
>> 
>> Tamar Christina <tamar.christina@arm.com> writes:
>> > Hi All,
>> >
>> > The fix for PR18127 reworked the uxtl to zip optimization.
>> > In doing so it undid the changes in aarch64_simd_vec_unpack<su>_lo_ and this
>> now
>> > no longer matches aarch64_simd_vec_unpack<su>_hi_.  It still works because
>> the
>> > RTL generated by aarch64_simd_vec_unpack<su>_lo_ overlaps with the general
>> zero
>> > extend RTL and so because that one is listed before the lo pattern recog picks
>> > it instead.
>> >
>> > This just makes aarch64_simd_vec_unpack<su>_lo_ mirror
>> > aarch64_simd_vec_unpack<su>_hi_ for consistency and so we're not relying on
>> the
>> > order of patterns.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> >
>> > Ok for master?
>> >
>> > Thanks,
>> > Tamar
>> >
>> > gcc/ChangeLog:
>> >
>> > 	* config/aarch64/aarch64-simd.md
>> > 	(aarch64_simd_vec_unpack<su>_lo_<mode>): Add same split as
>> > 	aarch64_simd_vec_unpack<su>_hi_.
>> > 	* config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
>> > 	comment.
>> >
>> > ---
>> >
>> > diff --git a/gcc/config/aarch64/aarch64-simd.md
>> b/gcc/config/aarch64/aarch64-simd.md
>> > index
>> 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca
>> 294faaf8d5ba847 100644
>> > --- a/gcc/config/aarch64/aarch64-simd.md
>> > +++ b/gcc/config/aarch64/aarch64-simd.md
>> > @@ -1904,7 +1904,7 @@ (define_insn
>> "*aarch64_<srn_op>topbits_shuffle<mode>_be"
>> >
>> >  ;; Widening operations.
>> >
>> > -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>"
>> > +(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>"
>> >    [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
>> >          (ANY_EXTEND:<VWIDE> (vec_select:<VHALF>
>> >  			       (match_operand:VQW 1 "register_operand" "w")
>> > @@ -1912,6 +1912,19 @@ (define_insn
>> "aarch64_simd_vec_unpack<su>_lo_<mode>"
>> >  			    )))]
>> >    "TARGET_SIMD"
>> >    "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>"
>> > +  "&& <CODE> == ZERO_EXTEND
>> > +   && aarch64_split_simd_shift_p (insn)"
>> > +  [(const_int 0)]
>> > +  {
>> > +    /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero,
>> > +       provided that the cost of the zero can be amortized over several
>> > +       operations.  We'll later recombine the zero and zip if there are
>> > +       not sufficient uses of the zero to make the split worthwhile.  */
>> > +    rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode,
>> 0);
>> > +    rtx zero = aarch64_gen_shareable_zero (<MODE>mode);
>> > +    emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero));
>> > +    DONE;
>> > +  }
>> >    [(set_attr "type" "neon_shift_imm_long")]
>> >  )
>> 
>> I think we should just remove the pattern instead, and update users
>> to generate an extension of the 64-bit lowpart.  That form is now the
>> canonical one, so the current aarch64_simd_vec_unpack<su>_lo_<mode>
>> pattern hard-codes unsimplified rtl.
>> 
>
> Ok, but then it's very weird to have _hi still.  The reason for these indirections
> seems to be to model the hi/lo split for RTL simplifications. 
>
> The aarch64_simd_vec_unpack<su>_lo models a vec_select as well.  This RTL is
> different from the one it gets rewritten to (dropping the vec_select).
>
> So dropping it means we lose this, and I guess maybe fold-rtx does it now?
> It's not immediately clear to me that dropping the additional RTL for recog
> is beneficially.

The principle is that, say:

  (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)]))

is (for little-endian) equivalent to:

  (subreg:V2SI (reg:V2DI R) 0)

and similarly for the equivalent big-endian pair.  Simplification rules
are now supposed to ensure that only the second (simpler) form is generated
by target-independent code.  We should fix any cases where that doesn't
happen, since it would be a missed optimisation for any instructions
that take (in this case) V2SI inputs.

There's no equivalent simplification for _hi because it isn't possible
to refer directly to the upper 64 bits of a 128-bit register using subregs.

Thanks,
Richard
Richard Sandiford July 4, 2024, 12:28 p.m. UTC | #4
Richard Sandiford <richard.sandiford@arm.com> writes:
> Tamar Christina <Tamar.Christina@arm.com> writes:
>>> -----Original Message-----
>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>> Sent: Thursday, July 4, 2024 12:46 PM
>>> To: Tamar Christina <Tamar.Christina@arm.com>
>>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw
>>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft
>>> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org
>>> Subject: Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack<su>_lo_/_hi_
>>> consistent.
>>> 
>>> Tamar Christina <tamar.christina@arm.com> writes:
>>> > Hi All,
>>> >
>>> > The fix for PR18127 reworked the uxtl to zip optimization.
>>> > In doing so it undid the changes in aarch64_simd_vec_unpack<su>_lo_ and this
>>> now
>>> > no longer matches aarch64_simd_vec_unpack<su>_hi_.  It still works because
>>> the
>>> > RTL generated by aarch64_simd_vec_unpack<su>_lo_ overlaps with the general
>>> zero
>>> > extend RTL and so because that one is listed before the lo pattern recog picks
>>> > it instead.
>>> >
>>> > This just makes aarch64_simd_vec_unpack<su>_lo_ mirror
>>> > aarch64_simd_vec_unpack<su>_hi_ for consistency and so we're not relying on
>>> the
>>> > order of patterns.
>>> >
>>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>>> >
>>> > Ok for master?
>>> >
>>> > Thanks,
>>> > Tamar
>>> >
>>> > gcc/ChangeLog:
>>> >
>>> > 	* config/aarch64/aarch64-simd.md
>>> > 	(aarch64_simd_vec_unpack<su>_lo_<mode>): Add same split as
>>> > 	aarch64_simd_vec_unpack<su>_hi_.
>>> > 	* config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
>>> > 	comment.
>>> >
>>> > ---
>>> >
>>> > diff --git a/gcc/config/aarch64/aarch64-simd.md
>>> b/gcc/config/aarch64/aarch64-simd.md
>>> > index
>>> 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca
>>> 294faaf8d5ba847 100644
>>> > --- a/gcc/config/aarch64/aarch64-simd.md
>>> > +++ b/gcc/config/aarch64/aarch64-simd.md
>>> > @@ -1904,7 +1904,7 @@ (define_insn
>>> "*aarch64_<srn_op>topbits_shuffle<mode>_be"
>>> >
>>> >  ;; Widening operations.
>>> >
>>> > -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>"
>>> > +(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>"
>>> >    [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
>>> >          (ANY_EXTEND:<VWIDE> (vec_select:<VHALF>
>>> >  			       (match_operand:VQW 1 "register_operand" "w")
>>> > @@ -1912,6 +1912,19 @@ (define_insn
>>> "aarch64_simd_vec_unpack<su>_lo_<mode>"
>>> >  			    )))]
>>> >    "TARGET_SIMD"
>>> >    "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>"
>>> > +  "&& <CODE> == ZERO_EXTEND
>>> > +   && aarch64_split_simd_shift_p (insn)"
>>> > +  [(const_int 0)]
>>> > +  {
>>> > +    /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero,
>>> > +       provided that the cost of the zero can be amortized over several
>>> > +       operations.  We'll later recombine the zero and zip if there are
>>> > +       not sufficient uses of the zero to make the split worthwhile.  */
>>> > +    rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode,
>>> 0);
>>> > +    rtx zero = aarch64_gen_shareable_zero (<MODE>mode);
>>> > +    emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero));
>>> > +    DONE;
>>> > +  }
>>> >    [(set_attr "type" "neon_shift_imm_long")]
>>> >  )
>>> 
>>> I think we should just remove the pattern instead, and update users
>>> to generate an extension of the 64-bit lowpart.  That form is now the
>>> canonical one, so the current aarch64_simd_vec_unpack<su>_lo_<mode>
>>> pattern hard-codes unsimplified rtl.
>>> 
>>
>> Ok, but then it's very weird to have _hi still.  The reason for these indirections
>> seems to be to model the hi/lo split for RTL simplifications. 
>>
>> The aarch64_simd_vec_unpack<su>_lo models a vec_select as well.  This RTL is
>> different from the one it gets rewritten to (dropping the vec_select).
>>
>> So dropping it means we lose this, and I guess maybe fold-rtx does it now?
>> It's not immediately clear to me that dropping the additional RTL for recog
>> is beneficially.
>
> The principle is that, say:
>
>   (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)]))
>
> is (for little-endian) equivalent to:
>
>   (subreg:V2SI (reg:V2DI R) 0)

Sigh, of course I meant V4SI rather than V2DI in the above :)

> and similarly for the equivalent big-endian pair.  Simplification rules
> are now supposed to ensure that only the second (simpler) form is generated
> by target-independent code.  We should fix any cases where that doesn't
> happen, since it would be a missed optimisation for any instructions
> that take (in this case) V2SI inputs.
>
> There's no equivalent simplification for _hi because it isn't possible
> to refer directly to the upper 64 bits of a 128-bit register using subregs.
>
> Thanks,
> Richard
Tamar Christina July 5, 2024, 8:13 a.m. UTC | #5
> > The principle is that, say:
> >
> >   (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)]))
> >
> > is (for little-endian) equivalent to:
> >
> >   (subreg:V2SI (reg:V2DI R) 0)
> 
> Sigh, of course I meant V4SI rather than V2DI in the above :)
> 
> > and similarly for the equivalent big-endian pair.  Simplification rules
> > are now supposed to ensure that only the second (simpler) form is generated
> > by target-independent code.  We should fix any cases where that doesn't
> > happen, since it would be a missed optimisation for any instructions
> > that take (in this case) V2SI inputs.
> >
> > There's no equivalent simplification for _hi because it isn't possible
> > to refer directly to the upper 64 bits of a 128-bit register using subregs.
> >

This removes aarch64_simd_vec_unpack<su>_lo_.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* config/aarch64/aarch64-simd.md
	(aarch64_simd_vec_unpack<su>_lo_<mode>): Remove.
	(vec_unpack<su>_lo_<mode): Simplify.
	* config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
	comment.

-- inline copy of patch --

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index fd10039f9a27d0da51624d6d3a6d8b2a532f5625..bbeee221f37c4875056cdf52932a787f8ac1c2aa 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1904,17 +1904,6 @@ (define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_be"
 
 ;; Widening operations.
 
-(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>"
-  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
-        (ANY_EXTEND:<VWIDE> (vec_select:<VHALF>
-			       (match_operand:VQW 1 "register_operand" "w")
-			       (match_operand:VQW 2 "vect_par_cnst_lo_half" "")
-			    )))]
-  "TARGET_SIMD"
-  "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>"
-  [(set_attr "type" "neon_shift_imm_long")]
-)
-
 (define_insn_and_split "aarch64_simd_vec_unpack<su>_hi_<mode>"
   [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
         (ANY_EXTEND:<VWIDE> (vec_select:<VHALF>
@@ -1952,14 +1941,11 @@ (define_expand "vec_unpack<su>_hi_<mode>"
 )
 
 (define_expand "vec_unpack<su>_lo_<mode>"
-  [(match_operand:<VWIDE> 0 "register_operand")
-   (ANY_EXTEND:<VWIDE> (match_operand:VQW 1 "register_operand"))]
+  [(set (match_operand:<VWIDE> 0 "register_operand")
+	(ANY_EXTEND:<VWIDE> (match_operand:VQW 1 "register_operand")))]
   "TARGET_SIMD"
   {
-    rtx p = aarch64_simd_vect_par_cnst_half (<MODE>mode, <nunits>, false);
-    emit_insn (gen_aarch64_simd_vec_unpack<su>_lo_<mode> (operands[0],
-							  operands[1], p));
-    DONE;
+    operands[1] = lowpart_subreg (<VHALF>mode, operands[1], <MODE>mode);
   }
 )
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 6b106a72e49f11b8128485baceaaaddcbf139863..469eb938953a70bc6b0ce3d4aa16f773e40ee03e 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -23188,7 +23188,8 @@ aarch64_gen_shareable_zero (machine_mode mode)
    to split without that restriction and instead recombine shared zeros
    if they turn out not to be worthwhile.  This would allow splits in
    single-block functions and would also cope more naturally with
-   rematerialization.  */
+   rematerialization.  The downside of not doing this is that we lose the
+   optimizations for vector epilogues as well.  */
 
 bool
 aarch64_split_simd_shift_p (rtx_insn *insn)
Richard Sandiford July 5, 2024, 8:29 a.m. UTC | #6
Tamar Christina <Tamar.Christina@arm.com> writes:
>> > The principle is that, say:
>> >
>> >   (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)]))
>> >
>> > is (for little-endian) equivalent to:
>> >
>> >   (subreg:V2SI (reg:V2DI R) 0)
>> 
>> Sigh, of course I meant V4SI rather than V2DI in the above :)
>> 
>> > and similarly for the equivalent big-endian pair.  Simplification rules
>> > are now supposed to ensure that only the second (simpler) form is generated
>> > by target-independent code.  We should fix any cases where that doesn't
>> > happen, since it would be a missed optimisation for any instructions
>> > that take (in this case) V2SI inputs.
>> >
>> > There's no equivalent simplification for _hi because it isn't possible
>> > to refer directly to the upper 64 bits of a 128-bit register using subregs.
>> >
>
> This removes aarch64_simd_vec_unpack<su>_lo_.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> 	* config/aarch64/aarch64-simd.md
> 	(aarch64_simd_vec_unpack<su>_lo_<mode>): Remove.
> 	(vec_unpack<su>_lo_<mode): Simplify.
> 	* config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update
> 	comment.

OK, thanks.

Richard

> -- inline copy of patch --
>
> diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
> index fd10039f9a27d0da51624d6d3a6d8b2a532f5625..bbeee221f37c4875056cdf52932a787f8ac1c2aa 100644
> --- a/gcc/config/aarch64/aarch64-simd.md
> +++ b/gcc/config/aarch64/aarch64-simd.md
> @@ -1904,17 +1904,6 @@ (define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_be"
>  
>  ;; Widening operations.
>  
> -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>"
> -  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
> -        (ANY_EXTEND:<VWIDE> (vec_select:<VHALF>
> -			       (match_operand:VQW 1 "register_operand" "w")
> -			       (match_operand:VQW 2 "vect_par_cnst_lo_half" "")
> -			    )))]
> -  "TARGET_SIMD"
> -  "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>"
> -  [(set_attr "type" "neon_shift_imm_long")]
> -)
> -
>  (define_insn_and_split "aarch64_simd_vec_unpack<su>_hi_<mode>"
>    [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
>          (ANY_EXTEND:<VWIDE> (vec_select:<VHALF>
> @@ -1952,14 +1941,11 @@ (define_expand "vec_unpack<su>_hi_<mode>"
>  )
>  
>  (define_expand "vec_unpack<su>_lo_<mode>"
> -  [(match_operand:<VWIDE> 0 "register_operand")
> -   (ANY_EXTEND:<VWIDE> (match_operand:VQW 1 "register_operand"))]
> +  [(set (match_operand:<VWIDE> 0 "register_operand")
> +	(ANY_EXTEND:<VWIDE> (match_operand:VQW 1 "register_operand")))]
>    "TARGET_SIMD"
>    {
> -    rtx p = aarch64_simd_vect_par_cnst_half (<MODE>mode, <nunits>, false);
> -    emit_insn (gen_aarch64_simd_vec_unpack<su>_lo_<mode> (operands[0],
> -							  operands[1], p));
> -    DONE;
> +    operands[1] = lowpart_subreg (<VHALF>mode, operands[1], <MODE>mode);
>    }
>  )
>  
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 6b106a72e49f11b8128485baceaaaddcbf139863..469eb938953a70bc6b0ce3d4aa16f773e40ee03e 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -23188,7 +23188,8 @@ aarch64_gen_shareable_zero (machine_mode mode)
>     to split without that restriction and instead recombine shared zeros
>     if they turn out not to be worthwhile.  This would allow splits in
>     single-block functions and would also cope more naturally with
> -   rematerialization.  */
> +   rematerialization.  The downside of not doing this is that we lose the
> +   optimizations for vector epilogues as well.  */
>  
>  bool
>  aarch64_split_simd_shift_p (rtx_insn *insn)
diff mbox series

Patch

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca294faaf8d5ba847 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1904,7 +1904,7 @@  (define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_be"
 
 ;; Widening operations.
 
-(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>"
+(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>"
   [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
         (ANY_EXTEND:<VWIDE> (vec_select:<VHALF>
 			       (match_operand:VQW 1 "register_operand" "w")
@@ -1912,6 +1912,19 @@  (define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>"
 			    )))]
   "TARGET_SIMD"
   "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>"
+  "&& <CODE> == ZERO_EXTEND
+   && aarch64_split_simd_shift_p (insn)"
+  [(const_int 0)]
+  {
+    /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero,
+       provided that the cost of the zero can be amortized over several
+       operations.  We'll later recombine the zero and zip if there are
+       not sufficient uses of the zero to make the split worthwhile.  */
+    rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode, 0);
+    rtx zero = aarch64_gen_shareable_zero (<MODE>mode);
+    emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero));
+    DONE;
+  }
   [(set_attr "type" "neon_shift_imm_long")]
 )
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index ae7e21d90b2aeec51b7626471ccf7f036fa9b3db..6f49d1482042efabedbe723aa59ecf129b84f4ad 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -23159,7 +23159,8 @@  aarch64_gen_shareable_zero (machine_mode mode)
    to split without that restriction and instead recombine shared zeros
    if they turn out not to be worthwhile.  This would allow splits in
    single-block functions and would also cope more naturally with
-   rematerialization.  */
+   rematerialization.  The downside of not doing this is that we lose the
+   optimizations for vector epilogues as well.  */
 
 bool
 aarch64_split_simd_shift_p (rtx_insn *insn)