Message ID | patch-18605-tamar@arm.com |
---|---|
State | New |
Headers | show |
Series | [1/2] AArch64: make aarch64_simd_vec_unpack<su>_lo_/_hi_ consistent. | expand |
Tamar Christina <tamar.christina@arm.com> writes: > Hi All, > > The fix for PR18127 reworked the uxtl to zip optimization. > In doing so it undid the changes in aarch64_simd_vec_unpack<su>_lo_ and this now > no longer matches aarch64_simd_vec_unpack<su>_hi_. It still works because the > RTL generated by aarch64_simd_vec_unpack<su>_lo_ overlaps with the general zero > extend RTL and so because that one is listed before the lo pattern recog picks > it instead. > > This just makes aarch64_simd_vec_unpack<su>_lo_ mirror > aarch64_simd_vec_unpack<su>_hi_ for consistency and so we're not relying on the > order of patterns. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > * config/aarch64/aarch64-simd.md > (aarch64_simd_vec_unpack<su>_lo_<mode>): Add same split as > aarch64_simd_vec_unpack<su>_hi_. > * config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update > comment. > > --- > > diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md > index 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca294faaf8d5ba847 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -1904,7 +1904,7 @@ (define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_be" > > ;; Widening operations. > > -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>" > +(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>" > [(set (match_operand:<VWIDE> 0 "register_operand" "=w") > (ANY_EXTEND:<VWIDE> (vec_select:<VHALF> > (match_operand:VQW 1 "register_operand" "w") > @@ -1912,6 +1912,19 @@ (define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>" > )))] > "TARGET_SIMD" > "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>" > + "&& <CODE> == ZERO_EXTEND > + && aarch64_split_simd_shift_p (insn)" > + [(const_int 0)] > + { > + /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero, > + provided that the cost of the zero can be amortized over several > + operations. We'll later recombine the zero and zip if there are > + not sufficient uses of the zero to make the split worthwhile. */ > + rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode, 0); > + rtx zero = aarch64_gen_shareable_zero (<MODE>mode); > + emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero)); > + DONE; > + } > [(set_attr "type" "neon_shift_imm_long")] > ) I think we should just remove the pattern instead, and update users to generate an extension of the 64-bit lowpart. That form is now the canonical one, so the current aarch64_simd_vec_unpack<su>_lo_<mode> pattern hard-codes unsimplified rtl. > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index ae7e21d90b2aeec51b7626471ccf7f036fa9b3db..6f49d1482042efabedbe723aa59ecf129b84f4ad 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -23159,7 +23159,8 @@ aarch64_gen_shareable_zero (machine_mode mode) > to split without that restriction and instead recombine shared zeros > if they turn out not to be worthwhile. This would allow splits in > single-block functions and would also cope more naturally with > - rematerialization. */ > + rematerialization. The downside of not doing this is that we lose the > + optimizations for vector epilogues as well. */ > > bool > aarch64_split_simd_shift_p (rtx_insn *insn) This part is ok. late-combine was supposed to help with this, but we might need to update the propagation code so that it handle changes in mode. Thanks, Richard
> -----Original Message----- > From: Richard Sandiford <richard.sandiford@arm.com> > Sent: Thursday, July 4, 2024 12:46 PM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw > <Richard.Earnshaw@arm.com>; Marcus Shawcroft > <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org > Subject: Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack<su>_lo_/_hi_ > consistent. > > Tamar Christina <tamar.christina@arm.com> writes: > > Hi All, > > > > The fix for PR18127 reworked the uxtl to zip optimization. > > In doing so it undid the changes in aarch64_simd_vec_unpack<su>_lo_ and this > now > > no longer matches aarch64_simd_vec_unpack<su>_hi_. It still works because > the > > RTL generated by aarch64_simd_vec_unpack<su>_lo_ overlaps with the general > zero > > extend RTL and so because that one is listed before the lo pattern recog picks > > it instead. > > > > This just makes aarch64_simd_vec_unpack<su>_lo_ mirror > > aarch64_simd_vec_unpack<su>_hi_ for consistency and so we're not relying on > the > > order of patterns. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > * config/aarch64/aarch64-simd.md > > (aarch64_simd_vec_unpack<su>_lo_<mode>): Add same split as > > aarch64_simd_vec_unpack<su>_hi_. > > * config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update > > comment. > > > > --- > > > > diff --git a/gcc/config/aarch64/aarch64-simd.md > b/gcc/config/aarch64/aarch64-simd.md > > index > 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca > 294faaf8d5ba847 100644 > > --- a/gcc/config/aarch64/aarch64-simd.md > > +++ b/gcc/config/aarch64/aarch64-simd.md > > @@ -1904,7 +1904,7 @@ (define_insn > "*aarch64_<srn_op>topbits_shuffle<mode>_be" > > > > ;; Widening operations. > > > > -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>" > > +(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>" > > [(set (match_operand:<VWIDE> 0 "register_operand" "=w") > > (ANY_EXTEND:<VWIDE> (vec_select:<VHALF> > > (match_operand:VQW 1 "register_operand" "w") > > @@ -1912,6 +1912,19 @@ (define_insn > "aarch64_simd_vec_unpack<su>_lo_<mode>" > > )))] > > "TARGET_SIMD" > > "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>" > > + "&& <CODE> == ZERO_EXTEND > > + && aarch64_split_simd_shift_p (insn)" > > + [(const_int 0)] > > + { > > + /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero, > > + provided that the cost of the zero can be amortized over several > > + operations. We'll later recombine the zero and zip if there are > > + not sufficient uses of the zero to make the split worthwhile. */ > > + rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode, > 0); > > + rtx zero = aarch64_gen_shareable_zero (<MODE>mode); > > + emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero)); > > + DONE; > > + } > > [(set_attr "type" "neon_shift_imm_long")] > > ) > > I think we should just remove the pattern instead, and update users > to generate an extension of the 64-bit lowpart. That form is now the > canonical one, so the current aarch64_simd_vec_unpack<su>_lo_<mode> > pattern hard-codes unsimplified rtl. > Ok, but then it's very weird to have _hi still. The reason for these indirections seems to be to model the hi/lo split for RTL simplifications. The aarch64_simd_vec_unpack<su>_lo models a vec_select as well. This RTL is different from the one it gets rewritten to (dropping the vec_select). So dropping it means we lose this, and I guess maybe fold-rtx does it now? It's not immediately clear to me that dropping the additional RTL for recog is beneficially. Thanks, Tamar > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > > index > ae7e21d90b2aeec51b7626471ccf7f036fa9b3db..6f49d1482042efabedbe723aa > 59ecf129b84f4ad 100644 > > --- a/gcc/config/aarch64/aarch64.cc > > +++ b/gcc/config/aarch64/aarch64.cc > > @@ -23159,7 +23159,8 @@ aarch64_gen_shareable_zero (machine_mode > mode) > > to split without that restriction and instead recombine shared zeros > > if they turn out not to be worthwhile. This would allow splits in > > single-block functions and would also cope more naturally with > > - rematerialization. */ > > + rematerialization. The downside of not doing this is that we lose the > > + optimizations for vector epilogues as well. */ > > > > bool > > aarch64_split_simd_shift_p (rtx_insn *insn) > > This part is ok. late-combine was supposed to help with this, > but we might need to update the propagation code so that it handle > changes in mode. > > Thanks, > Richard
Tamar Christina <Tamar.Christina@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandiford@arm.com> >> Sent: Thursday, July 4, 2024 12:46 PM >> To: Tamar Christina <Tamar.Christina@arm.com> >> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw >> <Richard.Earnshaw@arm.com>; Marcus Shawcroft >> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org >> Subject: Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack<su>_lo_/_hi_ >> consistent. >> >> Tamar Christina <tamar.christina@arm.com> writes: >> > Hi All, >> > >> > The fix for PR18127 reworked the uxtl to zip optimization. >> > In doing so it undid the changes in aarch64_simd_vec_unpack<su>_lo_ and this >> now >> > no longer matches aarch64_simd_vec_unpack<su>_hi_. It still works because >> the >> > RTL generated by aarch64_simd_vec_unpack<su>_lo_ overlaps with the general >> zero >> > extend RTL and so because that one is listed before the lo pattern recog picks >> > it instead. >> > >> > This just makes aarch64_simd_vec_unpack<su>_lo_ mirror >> > aarch64_simd_vec_unpack<su>_hi_ for consistency and so we're not relying on >> the >> > order of patterns. >> > >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. >> > >> > Ok for master? >> > >> > Thanks, >> > Tamar >> > >> > gcc/ChangeLog: >> > >> > * config/aarch64/aarch64-simd.md >> > (aarch64_simd_vec_unpack<su>_lo_<mode>): Add same split as >> > aarch64_simd_vec_unpack<su>_hi_. >> > * config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update >> > comment. >> > >> > --- >> > >> > diff --git a/gcc/config/aarch64/aarch64-simd.md >> b/gcc/config/aarch64/aarch64-simd.md >> > index >> 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca >> 294faaf8d5ba847 100644 >> > --- a/gcc/config/aarch64/aarch64-simd.md >> > +++ b/gcc/config/aarch64/aarch64-simd.md >> > @@ -1904,7 +1904,7 @@ (define_insn >> "*aarch64_<srn_op>topbits_shuffle<mode>_be" >> > >> > ;; Widening operations. >> > >> > -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>" >> > +(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>" >> > [(set (match_operand:<VWIDE> 0 "register_operand" "=w") >> > (ANY_EXTEND:<VWIDE> (vec_select:<VHALF> >> > (match_operand:VQW 1 "register_operand" "w") >> > @@ -1912,6 +1912,19 @@ (define_insn >> "aarch64_simd_vec_unpack<su>_lo_<mode>" >> > )))] >> > "TARGET_SIMD" >> > "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>" >> > + "&& <CODE> == ZERO_EXTEND >> > + && aarch64_split_simd_shift_p (insn)" >> > + [(const_int 0)] >> > + { >> > + /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero, >> > + provided that the cost of the zero can be amortized over several >> > + operations. We'll later recombine the zero and zip if there are >> > + not sufficient uses of the zero to make the split worthwhile. */ >> > + rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode, >> 0); >> > + rtx zero = aarch64_gen_shareable_zero (<MODE>mode); >> > + emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero)); >> > + DONE; >> > + } >> > [(set_attr "type" "neon_shift_imm_long")] >> > ) >> >> I think we should just remove the pattern instead, and update users >> to generate an extension of the 64-bit lowpart. That form is now the >> canonical one, so the current aarch64_simd_vec_unpack<su>_lo_<mode> >> pattern hard-codes unsimplified rtl. >> > > Ok, but then it's very weird to have _hi still. The reason for these indirections > seems to be to model the hi/lo split for RTL simplifications. > > The aarch64_simd_vec_unpack<su>_lo models a vec_select as well. This RTL is > different from the one it gets rewritten to (dropping the vec_select). > > So dropping it means we lose this, and I guess maybe fold-rtx does it now? > It's not immediately clear to me that dropping the additional RTL for recog > is beneficially. The principle is that, say: (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)])) is (for little-endian) equivalent to: (subreg:V2SI (reg:V2DI R) 0) and similarly for the equivalent big-endian pair. Simplification rules are now supposed to ensure that only the second (simpler) form is generated by target-independent code. We should fix any cases where that doesn't happen, since it would be a missed optimisation for any instructions that take (in this case) V2SI inputs. There's no equivalent simplification for _hi because it isn't possible to refer directly to the upper 64 bits of a 128-bit register using subregs. Thanks, Richard
Richard Sandiford <richard.sandiford@arm.com> writes: > Tamar Christina <Tamar.Christina@arm.com> writes: >>> -----Original Message----- >>> From: Richard Sandiford <richard.sandiford@arm.com> >>> Sent: Thursday, July 4, 2024 12:46 PM >>> To: Tamar Christina <Tamar.Christina@arm.com> >>> Cc: gcc-patches@gcc.gnu.org; nd <nd@arm.com>; Richard Earnshaw >>> <Richard.Earnshaw@arm.com>; Marcus Shawcroft >>> <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org >>> Subject: Re: [PATCH 1/2]AArch64: make aarch64_simd_vec_unpack<su>_lo_/_hi_ >>> consistent. >>> >>> Tamar Christina <tamar.christina@arm.com> writes: >>> > Hi All, >>> > >>> > The fix for PR18127 reworked the uxtl to zip optimization. >>> > In doing so it undid the changes in aarch64_simd_vec_unpack<su>_lo_ and this >>> now >>> > no longer matches aarch64_simd_vec_unpack<su>_hi_. It still works because >>> the >>> > RTL generated by aarch64_simd_vec_unpack<su>_lo_ overlaps with the general >>> zero >>> > extend RTL and so because that one is listed before the lo pattern recog picks >>> > it instead. >>> > >>> > This just makes aarch64_simd_vec_unpack<su>_lo_ mirror >>> > aarch64_simd_vec_unpack<su>_hi_ for consistency and so we're not relying on >>> the >>> > order of patterns. >>> > >>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. >>> > >>> > Ok for master? >>> > >>> > Thanks, >>> > Tamar >>> > >>> > gcc/ChangeLog: >>> > >>> > * config/aarch64/aarch64-simd.md >>> > (aarch64_simd_vec_unpack<su>_lo_<mode>): Add same split as >>> > aarch64_simd_vec_unpack<su>_hi_. >>> > * config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update >>> > comment. >>> > >>> > --- >>> > >>> > diff --git a/gcc/config/aarch64/aarch64-simd.md >>> b/gcc/config/aarch64/aarch64-simd.md >>> > index >>> 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca >>> 294faaf8d5ba847 100644 >>> > --- a/gcc/config/aarch64/aarch64-simd.md >>> > +++ b/gcc/config/aarch64/aarch64-simd.md >>> > @@ -1904,7 +1904,7 @@ (define_insn >>> "*aarch64_<srn_op>topbits_shuffle<mode>_be" >>> > >>> > ;; Widening operations. >>> > >>> > -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>" >>> > +(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>" >>> > [(set (match_operand:<VWIDE> 0 "register_operand" "=w") >>> > (ANY_EXTEND:<VWIDE> (vec_select:<VHALF> >>> > (match_operand:VQW 1 "register_operand" "w") >>> > @@ -1912,6 +1912,19 @@ (define_insn >>> "aarch64_simd_vec_unpack<su>_lo_<mode>" >>> > )))] >>> > "TARGET_SIMD" >>> > "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>" >>> > + "&& <CODE> == ZERO_EXTEND >>> > + && aarch64_split_simd_shift_p (insn)" >>> > + [(const_int 0)] >>> > + { >>> > + /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero, >>> > + provided that the cost of the zero can be amortized over several >>> > + operations. We'll later recombine the zero and zip if there are >>> > + not sufficient uses of the zero to make the split worthwhile. */ >>> > + rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode, >>> 0); >>> > + rtx zero = aarch64_gen_shareable_zero (<MODE>mode); >>> > + emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero)); >>> > + DONE; >>> > + } >>> > [(set_attr "type" "neon_shift_imm_long")] >>> > ) >>> >>> I think we should just remove the pattern instead, and update users >>> to generate an extension of the 64-bit lowpart. That form is now the >>> canonical one, so the current aarch64_simd_vec_unpack<su>_lo_<mode> >>> pattern hard-codes unsimplified rtl. >>> >> >> Ok, but then it's very weird to have _hi still. The reason for these indirections >> seems to be to model the hi/lo split for RTL simplifications. >> >> The aarch64_simd_vec_unpack<su>_lo models a vec_select as well. This RTL is >> different from the one it gets rewritten to (dropping the vec_select). >> >> So dropping it means we lose this, and I guess maybe fold-rtx does it now? >> It's not immediately clear to me that dropping the additional RTL for recog >> is beneficially. > > The principle is that, say: > > (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)])) > > is (for little-endian) equivalent to: > > (subreg:V2SI (reg:V2DI R) 0) Sigh, of course I meant V4SI rather than V2DI in the above :) > and similarly for the equivalent big-endian pair. Simplification rules > are now supposed to ensure that only the second (simpler) form is generated > by target-independent code. We should fix any cases where that doesn't > happen, since it would be a missed optimisation for any instructions > that take (in this case) V2SI inputs. > > There's no equivalent simplification for _hi because it isn't possible > to refer directly to the upper 64 bits of a 128-bit register using subregs. > > Thanks, > Richard
> > The principle is that, say: > > > > (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)])) > > > > is (for little-endian) equivalent to: > > > > (subreg:V2SI (reg:V2DI R) 0) > > Sigh, of course I meant V4SI rather than V2DI in the above :) > > > and similarly for the equivalent big-endian pair. Simplification rules > > are now supposed to ensure that only the second (simpler) form is generated > > by target-independent code. We should fix any cases where that doesn't > > happen, since it would be a missed optimisation for any instructions > > that take (in this case) V2SI inputs. > > > > There's no equivalent simplification for _hi because it isn't possible > > to refer directly to the upper 64 bits of a 128-bit register using subregs. > > This removes aarch64_simd_vec_unpack<su>_lo_. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * config/aarch64/aarch64-simd.md (aarch64_simd_vec_unpack<su>_lo_<mode>): Remove. (vec_unpack<su>_lo_<mode): Simplify. * config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update comment. -- inline copy of patch -- diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index fd10039f9a27d0da51624d6d3a6d8b2a532f5625..bbeee221f37c4875056cdf52932a787f8ac1c2aa 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -1904,17 +1904,6 @@ (define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_be" ;; Widening operations. -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>" - [(set (match_operand:<VWIDE> 0 "register_operand" "=w") - (ANY_EXTEND:<VWIDE> (vec_select:<VHALF> - (match_operand:VQW 1 "register_operand" "w") - (match_operand:VQW 2 "vect_par_cnst_lo_half" "") - )))] - "TARGET_SIMD" - "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>" - [(set_attr "type" "neon_shift_imm_long")] -) - (define_insn_and_split "aarch64_simd_vec_unpack<su>_hi_<mode>" [(set (match_operand:<VWIDE> 0 "register_operand" "=w") (ANY_EXTEND:<VWIDE> (vec_select:<VHALF> @@ -1952,14 +1941,11 @@ (define_expand "vec_unpack<su>_hi_<mode>" ) (define_expand "vec_unpack<su>_lo_<mode>" - [(match_operand:<VWIDE> 0 "register_operand") - (ANY_EXTEND:<VWIDE> (match_operand:VQW 1 "register_operand"))] + [(set (match_operand:<VWIDE> 0 "register_operand") + (ANY_EXTEND:<VWIDE> (match_operand:VQW 1 "register_operand")))] "TARGET_SIMD" { - rtx p = aarch64_simd_vect_par_cnst_half (<MODE>mode, <nunits>, false); - emit_insn (gen_aarch64_simd_vec_unpack<su>_lo_<mode> (operands[0], - operands[1], p)); - DONE; + operands[1] = lowpart_subreg (<VHALF>mode, operands[1], <MODE>mode); } ) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index 6b106a72e49f11b8128485baceaaaddcbf139863..469eb938953a70bc6b0ce3d4aa16f773e40ee03e 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -23188,7 +23188,8 @@ aarch64_gen_shareable_zero (machine_mode mode) to split without that restriction and instead recombine shared zeros if they turn out not to be worthwhile. This would allow splits in single-block functions and would also cope more naturally with - rematerialization. */ + rematerialization. The downside of not doing this is that we lose the + optimizations for vector epilogues as well. */ bool aarch64_split_simd_shift_p (rtx_insn *insn)
Tamar Christina <Tamar.Christina@arm.com> writes: >> > The principle is that, say: >> > >> > (vec_select:V2SI (reg:V2DI R) (parallel [(const_int 0) (const_int 1)])) >> > >> > is (for little-endian) equivalent to: >> > >> > (subreg:V2SI (reg:V2DI R) 0) >> >> Sigh, of course I meant V4SI rather than V2DI in the above :) >> >> > and similarly for the equivalent big-endian pair. Simplification rules >> > are now supposed to ensure that only the second (simpler) form is generated >> > by target-independent code. We should fix any cases where that doesn't >> > happen, since it would be a missed optimisation for any instructions >> > that take (in this case) V2SI inputs. >> > >> > There's no equivalent simplification for _hi because it isn't possible >> > to refer directly to the upper 64 bits of a 128-bit register using subregs. >> > > > This removes aarch64_simd_vec_unpack<su>_lo_. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > * config/aarch64/aarch64-simd.md > (aarch64_simd_vec_unpack<su>_lo_<mode>): Remove. > (vec_unpack<su>_lo_<mode): Simplify. > * config/aarch64/aarch64.cc (aarch64_gen_shareable_zero): Update > comment. OK, thanks. Richard > -- inline copy of patch -- > > diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md > index fd10039f9a27d0da51624d6d3a6d8b2a532f5625..bbeee221f37c4875056cdf52932a787f8ac1c2aa 100644 > --- a/gcc/config/aarch64/aarch64-simd.md > +++ b/gcc/config/aarch64/aarch64-simd.md > @@ -1904,17 +1904,6 @@ (define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_be" > > ;; Widening operations. > > -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>" > - [(set (match_operand:<VWIDE> 0 "register_operand" "=w") > - (ANY_EXTEND:<VWIDE> (vec_select:<VHALF> > - (match_operand:VQW 1 "register_operand" "w") > - (match_operand:VQW 2 "vect_par_cnst_lo_half" "") > - )))] > - "TARGET_SIMD" > - "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>" > - [(set_attr "type" "neon_shift_imm_long")] > -) > - > (define_insn_and_split "aarch64_simd_vec_unpack<su>_hi_<mode>" > [(set (match_operand:<VWIDE> 0 "register_operand" "=w") > (ANY_EXTEND:<VWIDE> (vec_select:<VHALF> > @@ -1952,14 +1941,11 @@ (define_expand "vec_unpack<su>_hi_<mode>" > ) > > (define_expand "vec_unpack<su>_lo_<mode>" > - [(match_operand:<VWIDE> 0 "register_operand") > - (ANY_EXTEND:<VWIDE> (match_operand:VQW 1 "register_operand"))] > + [(set (match_operand:<VWIDE> 0 "register_operand") > + (ANY_EXTEND:<VWIDE> (match_operand:VQW 1 "register_operand")))] > "TARGET_SIMD" > { > - rtx p = aarch64_simd_vect_par_cnst_half (<MODE>mode, <nunits>, false); > - emit_insn (gen_aarch64_simd_vec_unpack<su>_lo_<mode> (operands[0], > - operands[1], p)); > - DONE; > + operands[1] = lowpart_subreg (<VHALF>mode, operands[1], <MODE>mode); > } > ) > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index 6b106a72e49f11b8128485baceaaaddcbf139863..469eb938953a70bc6b0ce3d4aa16f773e40ee03e 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -23188,7 +23188,8 @@ aarch64_gen_shareable_zero (machine_mode mode) > to split without that restriction and instead recombine shared zeros > if they turn out not to be worthwhile. This would allow splits in > single-block functions and would also cope more naturally with > - rematerialization. */ > + rematerialization. The downside of not doing this is that we lose the > + optimizations for vector epilogues as well. */ > > bool > aarch64_split_simd_shift_p (rtx_insn *insn)
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 01b084d8ccb53fc0f1c7b0dd8f23546c331f020a..d4026cbf0b66995104e8e40ca294faaf8d5ba847 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -1904,7 +1904,7 @@ (define_insn "*aarch64_<srn_op>topbits_shuffle<mode>_be" ;; Widening operations. -(define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>" +(define_insn_and_split "aarch64_simd_vec_unpack<su>_lo_<mode>" [(set (match_operand:<VWIDE> 0 "register_operand" "=w") (ANY_EXTEND:<VWIDE> (vec_select:<VHALF> (match_operand:VQW 1 "register_operand" "w") @@ -1912,6 +1912,19 @@ (define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>" )))] "TARGET_SIMD" "<su>xtl\t%0.<Vwtype>, %1.<Vhalftype>" + "&& <CODE> == ZERO_EXTEND + && aarch64_split_simd_shift_p (insn)" + [(const_int 0)] + { + /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero, + provided that the cost of the zero can be amortized over several + operations. We'll later recombine the zero and zip if there are + not sufficient uses of the zero to make the split worthwhile. */ + rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode, 0); + rtx zero = aarch64_gen_shareable_zero (<MODE>mode); + emit_insn (gen_aarch64_zip1<mode> (res, operands[1], zero)); + DONE; + } [(set_attr "type" "neon_shift_imm_long")] ) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index ae7e21d90b2aeec51b7626471ccf7f036fa9b3db..6f49d1482042efabedbe723aa59ecf129b84f4ad 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -23159,7 +23159,8 @@ aarch64_gen_shareable_zero (machine_mode mode) to split without that restriction and instead recombine shared zeros if they turn out not to be worthwhile. This would allow splits in single-block functions and would also cope more naturally with - rematerialization. */ + rematerialization. The downside of not doing this is that we lose the + optimizations for vector epilogues as well. */ bool aarch64_split_simd_shift_p (rtx_insn *insn)