From patchwork Fri Jan 5 16:30:48 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Richard Sandiford X-Patchwork-Id: 1883069 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4T68996X8gz1yP3 for ; Sat, 6 Jan 2024 03:31:25 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DE2BF3857B93 for ; Fri, 5 Jan 2024 16:31:23 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by sourceware.org (Postfix) with ESMTP id 048343857B93 for ; Fri, 5 Jan 2024 16:30:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 048343857B93 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 048343857B93 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=217.140.110.172 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1704472255; cv=none; b=GtnHgKq2FTfr8Ka0Jw4TbEkaDWqMMTKlPW74bLgNCGYaFs/rV39ssGD9wS8wW+/V2DaKpo4TtqfBvbvxrDNYafFm5Yt3G2DbANB0+5SKRrVnuHEpI2+G8C8JQir/oTIFbXALQmlqWeqeo4B0zn3j+Vn5AKKGAQcqnB/vBqJptQA= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1704472255; c=relaxed/simple; bh=Rl/5/a9q3EcGWc47HkyonQaRz+k6ToT80rOJmDZK2iU=; h=From:To:Subject:Date:Message-ID:MIME-Version; b=VrFOGEjEpC9obIy96Aly571LChJmm1r3OZ7Y95oi870cCKWkV6sgi4MAxGBQvf6c/4UZjKyKbcJx+ZS9DkkP0yNCexinpHgbPQCh8G1zDMUyTLYZKedOyo0JtQT9iv1rSlA0+CNFQe/JmLUOHjLH3qALLwWZDzJZuGS/101r94I= ARC-Authentication-Results: i=1; server2.sourceware.org Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 0E501C15 for ; Fri, 5 Jan 2024 08:31:37 -0800 (PST) Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 17A293F64C for ; Fri, 5 Jan 2024 08:30:49 -0800 (PST) From: Richard Sandiford To: gcc-patches@gcc.gnu.org Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Subject: [PATCH] aarch64: Rework uxtl->zip optimisation [PR113196] Date: Fri, 05 Jan 2024 16:30:48 +0000 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 X-Spam-Status: No, score=-20.7 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT, KAM_STOCKGEN, SPF_HELO_NONE, SPF_NONE, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org g:f26f92b534f9 implemented unsigned extensions using ZIPs rather than UXTL{,2}, since the former has a higher throughput than the latter on amny cores. The optimisation worked by lowering directly to ZIP during expand, so that the zero input could be hoisted and shared. However, changing to ZIP means that zero extensions no longer benefit from some existing combine patterns. The patch included new patterns for UADDW and USUBW, but the PR shows that other patterns were affected as well. This patch instead introduces the ZIPs during a pre-reload split and forcibly hoists the zero move to the outermost scope. This has the disadvantage of executing the move even for a shrink-wrapped function, which I suppose could be a problem if it causes a kernel to trap and enable Advanced SIMD unnecessarily. In other circumstances, an unused move shouldn't affect things much. Also, the RA should be able to rematerialise the move at an appropriate point if necessary, such as if there is an intervening call. uxtl-combine-13.c contains a test for this. The patch then tries to ensure that the post-RA late-combine pass can recombine zeros and ZIPs back into UXTLs if there wasn't sufficient use of the zero to make it worthwhile. The cut-off used by the patch is that 1 UXTL is better than 1 MOVI + 1 ZIP, but that 1 MOVI + 2 ZIPs are better than 2 UXTLs (assuming all instructions have equal execution frequency). Any other uses of the shared zero would count in its favour too; it's not limitedto ZIPs. In order to do that, the patch relaxes the ZIP patterns so that the inputs can have any mode. This allows the V4SI zero to be propagated into any kind of ZIP, rather than just V4SI ones. I think that's logically consistent, since it's the mode of the unspec that ultimately determines the mode of the operation. (And we don't need to be overly defensive about which modes are acceptable, since ZIPs are only generated by code that knows/ought to know what it's doing.) Also, the original optimisation contained a big-endian correction that I don't think is needed/correct. Even on big-endian targets, we want the ZIP to take the low half of an element from the input vector and the high half from the zero vector. And the patterns map directly to the underlying Advanced SIMD instructions: the use of unspecs means that there's no need to adjust for the difference between GCC and Arm lane numbering. Tested on aarch64-linux-gnu and aarch64_be-elf (fixing some execution failures for the latter). The patch depends on the late-combine pass and on the FUNCTION_BEG patch that I just posted. I'll commit once those are in, if there are no objections. Richard gcc/ PR target/113196 * config/aarch64/aarch64.h (machine_function::advsimd_zero_insn): New member variable. * config/aarch64/iterators.md (Vnarrowq2): New mode attribute. * config/aarch64/predicates.md (aarch64_any_register_operand): Accept subregs too. * config/aarch64/aarch64-simd.md (aarch64_): Change the input operand predicates to aarch64_any_register_operand. (vec_unpacku_hi_, vec_unpacks_hi_): Recombine into... (vec_unpack_hi_): ...this. Move the generation of zip2 for zero-extends to... (aarch64_simd_vec_unpack_hi_): ...a split of this instruction. Fix big-endian handling. (*aarch64_zip2_uxtl2): New pattern. (vec_unpacku_lo_, vec_unpacks_lo_): Recombine into... (vec_unpack_lo_): ...this. Move the generation of zip1 for zero-extends to... (2): ...a split of this instruction. Fix big-endian handling. (*aarch64_zip1_uxtl): New pattern. (aarch64_usubw_lo_zip, aarch64_uaddw_lo_zip): Delete (aarch64_usubw_hi_zip, aarch64_uaddw_hi_zip): Likewise. * config/aarch64/aarch64.cc (aarch64_rtx_costs): Recognize ZIP1s and ZIP2s that can be implemented using UXTL{,2}. Make them half an instruction more expensive than a normal zip. (aarch64_get_shareable_reg): New function. (aarch64_gen_shareable_zero): Use it. gcc/testsuite/ PR target/113196 * gcc.target/aarch64/pr103350-1.c: Disable split1. * gcc.target/aarch64/pr103350-2.c: Likewise. * gcc.target/aarch64/simd/vmovl_high_1.c: Remove double include. Expect uxtl2 rather than zip2. * gcc.target/aarch64/vect_mixed_sizes_8.c: Expect zip1 rather than uxtl. * gcc.target/aarch64/vect_mixed_sizes_9.c: Likewise. * gcc.target/aarch64/vect_mixed_sizes_10.c: Likewise. * gcc.target/aarch64/uxtl-combine-7.c: New test. * gcc.target/aarch64/uxtl-combine-8.c: Likewise. * gcc.target/aarch64/uxtl-combine-9.c: Likewise. * gcc.target/aarch64/uxtl-combine-10.c: Likewise. * gcc.target/aarch64/uxtl-combine-11.c: Likewise. * gcc.target/aarch64/uxtl-combine-12.c: Likewise. * gcc.target/aarch64/uxtl-combine-13.c: Likewise. --- gcc/config/aarch64/aarch64-simd.md | 157 +++++++----------- gcc/config/aarch64/aarch64.cc | 47 +++++- gcc/config/aarch64/aarch64.h | 6 + gcc/config/aarch64/iterators.md | 2 + gcc/config/aarch64/predicates.md | 4 +- gcc/testsuite/gcc.target/aarch64/pr103350-1.c | 2 +- gcc/testsuite/gcc.target/aarch64/pr103350-2.c | 2 +- .../gcc.target/aarch64/simd/vmovl_high_1.c | 8 +- .../gcc.target/aarch64/uxtl-combine-10.c | 24 +++ .../gcc.target/aarch64/uxtl-combine-11.c | 127 ++++++++++++++ .../gcc.target/aarch64/uxtl-combine-12.c | 130 +++++++++++++++ .../gcc.target/aarch64/uxtl-combine-13.c | 26 +++ .../gcc.target/aarch64/uxtl-combine-7.c | 136 +++++++++++++++ .../gcc.target/aarch64/uxtl-combine-8.c | 136 +++++++++++++++ .../gcc.target/aarch64/uxtl-combine-9.c | 32 ++++ .../gcc.target/aarch64/vect_mixed_sizes_10.c | 2 +- .../gcc.target/aarch64/vect_mixed_sizes_8.c | 2 +- .../gcc.target/aarch64/vect_mixed_sizes_9.c | 2 +- 18 files changed, 732 insertions(+), 113 deletions(-) create mode 100644 gcc/testsuite/gcc.target/aarch64/uxtl-combine-10.c create mode 100644 gcc/testsuite/gcc.target/aarch64/uxtl-combine-11.c create mode 100644 gcc/testsuite/gcc.target/aarch64/uxtl-combine-12.c create mode 100644 gcc/testsuite/gcc.target/aarch64/uxtl-combine-13.c create mode 100644 gcc/testsuite/gcc.target/aarch64/uxtl-combine-7.c create mode 100644 gcc/testsuite/gcc.target/aarch64/uxtl-combine-8.c create mode 100644 gcc/testsuite/gcc.target/aarch64/uxtl-combine-9.c diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 3cd184f46fa..66cf6a71fad 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -1958,7 +1958,7 @@ (define_insn "aarch64_simd_vec_unpack_lo_" [(set_attr "type" "neon_shift_imm_long")] ) -(define_insn "aarch64_simd_vec_unpack_hi_" +(define_insn_and_split "aarch64_simd_vec_unpack_hi_" [(set (match_operand: 0 "register_operand" "=w") (ANY_EXTEND: (vec_select: (match_operand:VQW 1 "register_operand" "w") @@ -1966,63 +1966,54 @@ (define_insn "aarch64_simd_vec_unpack_hi_" )))] "TARGET_SIMD" "xtl2\t%0., %1." - [(set_attr "type" "neon_shift_imm_long")] -) - -(define_expand "vec_unpacku_hi_" - [(match_operand: 0 "register_operand") - (match_operand:VQW 1 "register_operand")] - "TARGET_SIMD" + "&& == ZERO_EXTEND + && can_create_pseudo_p () + && optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn))" + [(const_int 0)] { - rtx res = gen_reg_rtx (mode); - rtx tmp = aarch64_gen_shareable_zero (mode); - if (BYTES_BIG_ENDIAN) - emit_insn (gen_aarch64_zip2 (res, tmp, operands[1])); - else - emit_insn (gen_aarch64_zip2 (res, operands[1], tmp)); - emit_move_insn (operands[0], - simplify_gen_subreg (mode, res, mode, 0)); + /* On many cores, it is cheaper to implement UXTL2 using a ZIP2 with zero, + provided that the cost of the zero can be amortized over several + operations. We'll later recombine the zero and zip if there are + not sufficient uses of the zero to make the split worthwhile. */ + rtx res = simplify_gen_subreg (mode, operands[0], mode, 0); + rtx zero = aarch64_gen_shareable_zero (V4SImode); + emit_insn (gen_aarch64_zip2 (res, operands[1], zero)); DONE; } + [(set_attr "type" "neon_shift_imm_long")] ) -(define_expand "vec_unpacks_hi_" - [(match_operand: 0 "register_operand") - (match_operand:VQW 1 "register_operand")] +(define_insn "*aarch64_zip2_uxtl2" + [(set (match_operand:VQW 0 "register_operand" "=w") + (unspec:VQW + [(match_operand 1 "aarch64_any_register_operand" "w") + (match_operand 2 "aarch64_simd_imm_zero")] + UNSPEC_ZIP2))] "TARGET_SIMD" - { - rtx p = aarch64_simd_vect_par_cnst_half (mode, , true); - emit_insn (gen_aarch64_simd_vec_unpacks_hi_ (operands[0], - operands[1], p)); - DONE; - } + "uxtl2\t%0., %1." + [(set_attr "type" "neon_shift_imm_long")] ) -(define_expand "vec_unpacku_lo_" +(define_expand "vec_unpack_hi_" [(match_operand: 0 "register_operand") - (match_operand:VQW 1 "register_operand")] + (ANY_EXTEND: (match_operand:VQW 1 "register_operand"))] "TARGET_SIMD" { - rtx res = gen_reg_rtx (mode); - rtx tmp = aarch64_gen_shareable_zero (mode); - if (BYTES_BIG_ENDIAN) - emit_insn (gen_aarch64_zip1 (res, tmp, operands[1])); - else - emit_insn (gen_aarch64_zip1 (res, operands[1], tmp)); - emit_move_insn (operands[0], - simplify_gen_subreg (mode, res, mode, 0)); + rtx p = aarch64_simd_vect_par_cnst_half (mode, , true); + emit_insn (gen_aarch64_simd_vec_unpack_hi_ (operands[0], + operands[1], p)); DONE; } ) -(define_expand "vec_unpacks_lo_" +(define_expand "vec_unpack_lo_" [(match_operand: 0 "register_operand") - (match_operand:VQW 1 "register_operand")] + (ANY_EXTEND: (match_operand:VQW 1 "register_operand"))] "TARGET_SIMD" { rtx p = aarch64_simd_vect_par_cnst_half (mode, , false); - emit_insn (gen_aarch64_simd_vec_unpacks_lo_ (operands[0], - operands[1], p)); + emit_insn (gen_aarch64_simd_vec_unpack_lo_ (operands[0], + operands[1], p)); DONE; } ) @@ -4792,62 +4783,6 @@ (define_insn "aarch64_subw2_internal" [(set_attr "type" "neon_sub_widen")] ) -(define_insn "aarch64_usubw_lo_zip" - [(set (match_operand: 0 "register_operand" "=w") - (minus: - (match_operand: 1 "register_operand" "w") - (subreg: - (unspec: [ - (match_operand:VQW 2 "register_operand" "w") - (match_operand:VQW 3 "aarch64_simd_imm_zero") - ] UNSPEC_ZIP1) 0)))] - "TARGET_SIMD" - "usubw\\t%0., %1., %2." - [(set_attr "type" "neon_sub_widen")] -) - -(define_insn "aarch64_uaddw_lo_zip" - [(set (match_operand: 0 "register_operand" "=w") - (plus: - (subreg: - (unspec: [ - (match_operand:VQW 2 "register_operand" "w") - (match_operand:VQW 3 "aarch64_simd_imm_zero") - ] UNSPEC_ZIP1) 0) - (match_operand: 1 "register_operand" "w")))] - "TARGET_SIMD" - "uaddw\\t%0., %1., %2." - [(set_attr "type" "neon_add_widen")] -) - -(define_insn "aarch64_usubw_hi_zip" - [(set (match_operand: 0 "register_operand" "=w") - (minus: - (match_operand: 1 "register_operand" "w") - (subreg: - (unspec: [ - (match_operand:VQW 2 "register_operand" "w") - (match_operand:VQW 3 "aarch64_simd_imm_zero") - ] UNSPEC_ZIP2) 0)))] - "TARGET_SIMD" - "usubw2\\t%0., %1., %2." - [(set_attr "type" "neon_sub_widen")] -) - -(define_insn "aarch64_uaddw_hi_zip" - [(set (match_operand: 0 "register_operand" "=w") - (plus: - (subreg: - (unspec: [ - (match_operand:VQW 2 "register_operand" "w") - (match_operand:VQW 3 "aarch64_simd_imm_zero") - ] UNSPEC_ZIP2) 0) - (match_operand: 1 "register_operand" "w")))] - "TARGET_SIMD" - "uaddw2\\t%0., %1., %2." - [(set_attr "type" "neon_add_widen")] -) - (define_insn "aarch64_addw" [(set (match_operand: 0 "register_operand" "=w") (plus: @@ -8615,8 +8550,8 @@ (define_insn_and_split "aarch64_combinev16qi" ;; need corresponding changes there. (define_insn "aarch64_" [(set (match_operand:VALL_F16 0 "register_operand" "=w") - (unspec:VALL_F16 [(match_operand:VALL_F16 1 "register_operand" "w") - (match_operand:VALL_F16 2 "register_operand" "w")] + (unspec:VALL_F16 [(match_operand:VALL_F16 1 "aarch64_any_register_operand" "w") + (match_operand:VALL_F16 2 "aarch64_any_register_operand" "w")] PERMUTE))] "TARGET_SIMD" "\\t%0., %1., %2." @@ -9788,11 +9723,37 @@ (define_insn "aarch64_crypto_pmullv2di" ) ;; Sign- or zero-extend a 64-bit integer vector to a 128-bit vector. -(define_insn "2" +(define_insn_and_split "2" [(set (match_operand:VQN 0 "register_operand" "=w") (ANY_EXTEND:VQN (match_operand: 1 "register_operand" "w")))] "TARGET_SIMD" "xtl\t%0., %1." + "&& == ZERO_EXTEND + && can_create_pseudo_p () + && optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn))" + [(const_int 0)] + { + /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero, + provided that the cost of the zero can be amortized over several + operations. We'll later recombine the zero and zip if there are + not sufficient uses of the zero to make the split worthwhile. */ + rtx res = simplify_gen_subreg (mode, operands[0], + mode, 0); + rtx zero = aarch64_gen_shareable_zero (V4SImode); + emit_insn (gen_aarch64_zip1 (res, operands[1], zero)); + DONE; + } + [(set_attr "type" "neon_shift_imm_long")] +) + +(define_insn "*aarch64_zip1_uxtl" + [(set (match_operand:VQW 0 "register_operand" "=w") + (unspec:VQW + [(match_operand 1 "aarch64_any_register_operand" "w") + (match_operand 2 "aarch64_simd_imm_zero")] + UNSPEC_ZIP1))] + "TARGET_SIMD" + "uxtl\t%0., %1." [(set_attr "type" "neon_shift_imm_long")] ) diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index a5a6b52730d..a3a1a0a7466 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -15202,6 +15202,20 @@ cost_plus: return false; } + /* Recognize ZIPs of zero that can be implemented using UXTL{,2}. + On many cores, ZIPs have a higher throughput than UXTL, + and the zero feeding the ZIPs can be eliminated during rename. + We therefore prefer 1 MOVI + 2 ZIPs over 2 UXTLs, assuming all + five instructions have equal execution frequency. + + This could be put behind a tuning property if other cores prefer + a different approach. */ + if (speed + && (XINT (x, 1) == UNSPEC_ZIP1 || XINT (x, 1) == UNSPEC_ZIP2) + && (mode == V16QImode || mode == V8HImode || mode == V4SImode) + && aarch64_const_zero_rtx_p (XVECEXP (x, 0, 1))) + *cost += COSTS_N_INSNS (1); + if (XINT (x, 1) == UNSPEC_RBIT) { if (speed) @@ -22873,16 +22887,41 @@ aarch64_mov_operand_p (rtx x, machine_mode mode) == SYMBOL_TINY_ABSOLUTE; } +/* Return a function-invariant register that contains VALUE. *CACHED_INSN + caches instructions that set up such registers, so that they can be + reused by future calls. */ + +static rtx +aarch64_get_shareable_reg (rtx_insn **cached_insn, rtx value) +{ + rtx_insn *insn = *cached_insn; + if (insn && INSN_P (insn) && !insn->deleted ()) + { + rtx pat = PATTERN (insn); + if (GET_CODE (pat) == SET) + { + rtx dest = SET_DEST (pat); + if (REG_P (dest) + && !HARD_REGISTER_P (dest) + && rtx_equal_p (SET_SRC (pat), value)) + return dest; + } + } + rtx reg = gen_reg_rtx (GET_MODE (value)); + *cached_insn = emit_insn_before (gen_rtx_SET (reg, value), + function_beg_insn); + return reg; +} + /* Create a 0 constant that is based on V4SI to allow CSE to optimally share the constant creation. */ rtx aarch64_gen_shareable_zero (machine_mode mode) { - machine_mode zmode = V4SImode; - rtx tmp = gen_reg_rtx (zmode); - emit_move_insn (tmp, CONST0_RTX (zmode)); - return lowpart_subreg (mode, tmp, zmode); + rtx reg = aarch64_get_shareable_reg (&cfun->machine->advsimd_zero_insn, + CONST0_RTX (V4SImode)); + return lowpart_subreg (mode, reg, GET_MODE (reg)); } /* Return a const_int vector of VAL. */ diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h index 0a4e152c9bd..157a0b9dfa5 100644 --- a/gcc/config/aarch64/aarch64.h +++ b/gcc/config/aarch64/aarch64.h @@ -1056,6 +1056,12 @@ typedef struct GTY (()) machine_function /* A set of all decls that have been passed to a vld1 intrinsic in the current function. This is used to help guide the vector cost model. */ hash_set *vector_load_decls; + + /* An instruction that was emitted at the start of the function to + set an Advanced SIMD pseudo register to zero. If the instruction + still exists and still fulfils its original purpose. the same register + can be reused by other code. */ + rtx_insn *advsimd_zero_insn; } machine_function; #endif #endif diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md index 89767eecdf8..942270e99d6 100644 --- a/gcc/config/aarch64/iterators.md +++ b/gcc/config/aarch64/iterators.md @@ -1656,6 +1656,8 @@ (define_mode_attr Vnarrowq [(V8HI "v8qi") (V4SI "v4hi") ;; Narrowed quad-modes for VQN (Used for XTN2). (define_mode_attr VNARROWQ2 [(V8HI "V16QI") (V4SI "V8HI") (V2DI "V4SI")]) +(define_mode_attr Vnarrowq2 [(V8HI "v16qi") (V4SI "v8hi") + (V2DI "v4si")]) ;; Narrowed modes of vector modes. (define_mode_attr VNARROW [(VNx8HI "VNx16QI") diff --git a/gcc/config/aarch64/predicates.md b/gcc/config/aarch64/predicates.md index 8a204e48bb5..71faa8624a5 100644 --- a/gcc/config/aarch64/predicates.md +++ b/gcc/config/aarch64/predicates.md @@ -1042,7 +1042,9 @@ (define_predicate "aarch64_gather_scale_operand_d" ;; A special predicate that doesn't match a particular mode. (define_special_predicate "aarch64_any_register_operand" - (match_code "reg")) + (ior (match_code "reg") + (and (match_code "subreg") + (match_code "reg" "0")))) (define_predicate "aarch64_sve_any_binary_operator" (match_code "plus,minus,mult,div,udiv,smax,umax,smin,umin,and,ior,xor")) diff --git a/gcc/testsuite/gcc.target/aarch64/pr103350-1.c b/gcc/testsuite/gcc.target/aarch64/pr103350-1.c index a0e764e8653..151d27d6c62 100644 --- a/gcc/testsuite/gcc.target/aarch64/pr103350-1.c +++ b/gcc/testsuite/gcc.target/aarch64/pr103350-1.c @@ -1,5 +1,5 @@ /* { dg-do run { target le } } */ -/* { dg-additional-options "-Os -fno-tree-ter -save-temps -fdump-rtl-ree-all -free -std=c99 -w" } */ +/* { dg-additional-options "-Os -fno-tree-ter -save-temps -fdump-rtl-ree-all -free -std=c99 -w -fdisable-rtl-split1" } */ typedef unsigned char u8; typedef unsigned char __attribute__((__vector_size__ (8))) v64u8; diff --git a/gcc/testsuite/gcc.target/aarch64/pr103350-2.c b/gcc/testsuite/gcc.target/aarch64/pr103350-2.c index f799dfc77ce..79c807cadc0 100644 --- a/gcc/testsuite/gcc.target/aarch64/pr103350-2.c +++ b/gcc/testsuite/gcc.target/aarch64/pr103350-2.c @@ -1,5 +1,5 @@ /* { dg-do run { target le } } */ -/* { dg-additional-options "-O2 -save-temps -fdump-rtl-ree-all -free -std=c99 -w" } */ +/* { dg-additional-options "-O2 -save-temps -fdump-rtl-ree-all -free -std=c99 -w -fdisable-rtl-split1" } */ typedef unsigned char __attribute__((__vector_size__ (8))) v64u8; typedef unsigned char __attribute__((__vector_size__ (16))) v128u8; diff --git a/gcc/testsuite/gcc.target/aarch64/simd/vmovl_high_1.c b/gcc/testsuite/gcc.target/aarch64/simd/vmovl_high_1.c index a2d09eaee0d..9519062e6d7 100644 --- a/gcc/testsuite/gcc.target/aarch64/simd/vmovl_high_1.c +++ b/gcc/testsuite/gcc.target/aarch64/simd/vmovl_high_1.c @@ -3,8 +3,6 @@ #include -#include - #define FUNC(IT, OT, S) \ OT \ foo_##S (IT a) \ @@ -22,11 +20,11 @@ FUNC (int32x4_t, int64x2_t, s32) /* { dg-final { scan-assembler-times {sxtl2\tv0\.2d, v0\.4s} 1} } */ FUNC (uint8x16_t, uint16x8_t, u8) -/* { dg-final { scan-assembler-times {zip2\tv0\.16b, v0\.16b} 1} } */ +/* { dg-final { scan-assembler-times {uxtl2\tv0\.8h, v0\.16b} 1} } */ FUNC (uint16x8_t, uint32x4_t, u16) -/* { dg-final { scan-assembler-times {zip2\tv0\.8h, v0\.8h} 1} } */ +/* { dg-final { scan-assembler-times {uxtl2\tv0\.4s, v0\.8h} 1} } */ FUNC (uint32x4_t, uint64x2_t, u32) -/* { dg-final { scan-assembler-times {zip2\tv0\.4s, v0\.4s} 1} } */ +/* { dg-final { scan-assembler-times {uxtl2\tv0\.2d, v0\.4s} 1} } */ diff --git a/gcc/testsuite/gcc.target/aarch64/uxtl-combine-10.c b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-10.c new file mode 100644 index 00000000000..283257135ef --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-10.c @@ -0,0 +1,24 @@ +/* { dg-options "-O2 -ftree-vectorize --param aarch64-vect-compare-costs=0" } */ +/* { dg-do run } */ + +#pragma GCC target "+nosve" + +void __attribute__((noipa)) +f (unsigned int *__restrict x, unsigned short *__restrict y, int n) +{ + for (int i = 0; i < n; ++i) + x[i] = y[i]; +} + +unsigned short y[] = { 1, 2, 3, 4, 5, 6, 7, 8, -1, -2, -3, -4, -5, -6, -7, -8 }; +volatile unsigned int x[16]; + +int +main (void) +{ + f ((unsigned int *) x, y, 16); + for (int i = 0; i < 8; ++i) + if (x[i] != i + 1 || x[i + 8] != 0xffff - i) + __builtin_abort (); + return 0; +} diff --git a/gcc/testsuite/gcc.target/aarch64/uxtl-combine-11.c b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-11.c new file mode 100644 index 00000000000..bb209d2d63d --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-11.c @@ -0,0 +1,127 @@ +/* { dg-options "-Os -fno-schedule-insns -fno-schedule-insns2" } */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */ + +typedef __UINT8_TYPE__ v8qi __attribute__((vector_size(8))); +typedef __UINT16_TYPE__ v4hi __attribute__((vector_size(8))); +typedef __UINT32_TYPE__ v2si __attribute__((vector_size(8))); + +typedef __UINT16_TYPE__ v8hi __attribute__((vector_size(16))); +typedef __UINT32_TYPE__ v4si __attribute__((vector_size(16))); +typedef __UINT64_TYPE__ v2di __attribute__((vector_size(16))); + +/* +** f1: +** uxtl v0\.2d, v0\.2s +** ret +*/ +v2di f1 (v2si x) { return __builtin_convertvector (x, v2di); } + +/* +** f2: +** uxtl v0\.4s, v0\.4h +** ret +*/ +v4si f2 (v4hi x) { return __builtin_convertvector (x, v4si); } + +/* +** f3: +** uxtl v0\.8h, v0\.8b +** ret +*/ +v8hi f3 (v8qi x) { return __builtin_convertvector (x, v8hi); } + +/* +** g1: +** uxtl v[0-9]+\.2d, v[0-9]+\.2s +** uxtl v[0-9]+\.2d, v[0-9]+\.2s +** stp [^\n]+ +** ret +*/ +void +g1 (v2di *__restrict a, v2si b, v2si c) +{ + a[0] = __builtin_convertvector (b, v2di); + a[1] = __builtin_convertvector (c, v2di); +} + +/* +** g2: +** uxtl v[0-9]+\.4s, v[0-9]+\.4h +** uxtl v[0-9]+\.4s, v[0-9]+\.4h +** stp [^\n]+ +** ret +*/ +void +g2 (v4si *__restrict a, v4hi b, v4hi c) +{ + a[0] = __builtin_convertvector (b, v4si); + a[1] = __builtin_convertvector (c, v4si); +} + +/* +** g3: +** uxtl v[0-9]+\.8h, v[0-9]+\.8b +** uxtl v[0-9]+\.8h, v[0-9]+\.8b +** stp [^\n]+ +** ret +*/ +void +g3 (v8hi *__restrict a, v8qi b, v8qi c) +{ + a[0] = __builtin_convertvector (b, v8hi); + a[1] = __builtin_convertvector (c, v8hi); +} + +/* +** h1: +** uxtl v[0-9]+\.2d, v[0-9]+\.2s +** ... +** uxtl v[0-9]+\.2d, v[0-9]+\.2s +** ... +** uxtl v[0-9]+\.2d, v[0-9]+\.2s +** ... +** ret +*/ +void +h1 (v2di *__restrict a, v2si b, v2si c, v2si d) +{ + a[0] = __builtin_convertvector (b, v2di); + a[1] = __builtin_convertvector (c, v2di); + a[2] = __builtin_convertvector (d, v2di); +} + +/* +** h2: +** uxtl v[0-9]+\.4s, v[0-9]+\.4h +** ... +** uxtl v[0-9]+\.4s, v[0-9]+\.4h +** ... +** uxtl v[0-9]+\.4s, v[0-9]+\.4h +** ... +** ret +*/ +void +h2 (v4si *__restrict a, v4hi b, v4hi c, v4hi d) +{ + a[0] = __builtin_convertvector (b, v4si); + a[1] = __builtin_convertvector (c, v4si); + a[2] = __builtin_convertvector (d, v4si); +} + +/* +** h3: +** uxtl v[0-9]+\.8h, v[0-9]+\.8b +** ... +** uxtl v[0-9]+\.8h, v[0-9]+\.8b +** ... +** uxtl v[0-9]+\.8h, v[0-9]+\.8b +** ... +** ret +*/ +void +h3 (v8hi *__restrict a, v8qi b, v8qi c, v8qi d) +{ + a[0] = __builtin_convertvector (b, v8hi); + a[1] = __builtin_convertvector (c, v8hi); + a[2] = __builtin_convertvector (d, v8hi); +} diff --git a/gcc/testsuite/gcc.target/aarch64/uxtl-combine-12.c b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-12.c new file mode 100644 index 00000000000..4de8200a8c9 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-12.c @@ -0,0 +1,130 @@ +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2" } */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */ + +#include + +/* +** f1: +** uxtl2 v0\.2d, v0\.4s +** ret +*/ +uint64x2_t f1 (uint32x4_t x) { return vshll_high_n_u32 (x, 0); } + +/* +** f2: +** uxtl2 v0\.4s, v0\.8h +** ret +*/ +uint32x4_t f2 (uint16x8_t x) { return vshll_high_n_u16 (x, 0); } + +/* +** f3: +** uxtl2 v0\.8h, v0\.16b +** ret +*/ +uint16x8_t f3 (uint8x16_t x) { return vshll_high_n_u8 (x, 0); } + +/* +** g1: +** movi (v[0-9]+)\.4s, #?0 +** zip2 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** zip2 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** stp [^\n]+ +** ret +*/ +void +g1 (uint64x2_t *__restrict a, uint32x4_t b, uint32x4_t c) +{ + a[0] = vshll_high_n_u32 (b, 0); + a[1] = vshll_high_n_u32 (c, 0); +} + +/* +** g2: +** movi (v[0-9]+)\.4s, #?0 +** zip2 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** zip2 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** stp [^\n]+ +** ret +*/ +void +g2 (uint32x4_t *__restrict a, uint16x8_t b, uint16x8_t c) +{ + a[0] = vshll_high_n_u16 (b, 0); + a[1] = vshll_high_n_u16 (c, 0); +} + +/* +** g3: +** movi (v[0-9]+)\.4s, #?0 +** zip2 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** zip2 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** stp [^\n]+ +** ret +*/ +void +g3 (uint16x8_t *__restrict a, uint8x16_t b, uint8x16_t c) +{ + a[0] = vshll_high_n_u8 (b, 0); + a[1] = vshll_high_n_u8 (c, 0); +} + +/* +** h1: +** movi (v[0-9]+)\.4s, #?0 +** ... +** zip2 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** ... +** zip2 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** ... +** zip2 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** ... +** ret +*/ +void +h1 (uint64x2_t *__restrict a, uint32x4_t b, uint32x4_t c, uint32x4_t d) +{ + a[0] = vshll_high_n_u32 (b, 0); + a[1] = vshll_high_n_u32 (c, 0); + a[2] = vshll_high_n_u32 (d, 0); +} + +/* +** h2: +** movi (v[0-9]+)\.4s, #?0 +** ... +** zip2 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** ... +** zip2 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** ... +** zip2 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** ... +** ret +*/ +void +h2 (uint32x4_t *__restrict a, uint16x8_t b, uint16x8_t c, uint16x8_t d) +{ + a[0] = vshll_high_n_u16 (b, 0); + a[1] = vshll_high_n_u16 (c, 0); + a[2] = vshll_high_n_u16 (d, 0); +} + +/* +** h3: +** movi (v[0-9]+)\.4s, #?0 +** ... +** zip2 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** ... +** zip2 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** ... +** zip2 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** ... +** ret +*/ +void +h3 (uint16x8_t *__restrict a, uint8x16_t b, uint8x16_t c, uint8x16_t d) +{ + a[0] = vshll_high_n_u8 (b, 0); + a[1] = vshll_high_n_u8 (c, 0); + a[2] = vshll_high_n_u8 (d, 0); +} diff --git a/gcc/testsuite/gcc.target/aarch64/uxtl-combine-13.c b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-13.c new file mode 100644 index 00000000000..0de589cb5c9 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-13.c @@ -0,0 +1,26 @@ +/* { dg-options "-O2" } */ + +#include + +void foo (); + +void +f (uint16x8_t *__restrict a, uint8x16_t *__restrict b) +{ + a[0] = vshll_high_n_u8 (b[0], 0); + a[1] = vshll_high_n_u8 (b[1], 0); + a[2] = vshll_high_n_u8 (b[2], 0); + a[3] = vshll_high_n_u8 (b[3], 0); + foo (); + a[4] = vshll_high_n_u8 (b[4], 0); + a[5] = vshll_high_n_u8 (b[5], 0); + a[6] = vshll_high_n_u8 (b[6], 0); + a[7] = vshll_high_n_u8 (b[7], 0); +} + +/* The zero should be rematerialized after the call to foo. */ +/* { dg-final { scan-assembler-times {\tmovi\tv[0-9]+\.4s, #?0\n} 2 } } */ +/* { dg-final { scan-assembler-times {\tldp\tq} 4 } } */ +/* { dg-final { scan-assembler-times {\tzip2\t} 8 } } */ +/* { dg-final { scan-assembler-times {\tstp\tq} 4 } } */ +/* { dg-final { scan-assembler-not {\t[bhsdqv](?:[89]|1[0-5])} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/uxtl-combine-7.c b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-7.c new file mode 100644 index 00000000000..278804685b0 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-7.c @@ -0,0 +1,136 @@ +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2 -mlittle-endian" } */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */ + +typedef __UINT8_TYPE__ v8qi __attribute__((vector_size(8))); +typedef __UINT16_TYPE__ v4hi __attribute__((vector_size(8))); +typedef __UINT32_TYPE__ v2si __attribute__((vector_size(8))); + +typedef __UINT16_TYPE__ v8hi __attribute__((vector_size(16))); +typedef __UINT32_TYPE__ v4si __attribute__((vector_size(16))); +typedef __UINT64_TYPE__ v2di __attribute__((vector_size(16))); + +/* +** f1: +** uxtl v0\.2d, v0\.2s +** ret +*/ +v2di f1 (v2si x) { return __builtin_convertvector (x, v2di); } + +/* +** f2: +** uxtl v0\.4s, v0\.4h +** ret +*/ +v4si f2 (v4hi x) { return __builtin_convertvector (x, v4si); } + +/* +** f3: +** uxtl v0\.8h, v0\.8b +** ret +*/ +v8hi f3 (v8qi x) { return __builtin_convertvector (x, v8hi); } + +/* +** g1: +** movi (v[0-9]+)\.4s, #?0 +** zip1 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** zip1 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** stp [^\n]+ +** ret +*/ +void +g1 (v2di *__restrict a, v2si b, v2si c) +{ + a[0] = __builtin_convertvector (b, v2di); + a[1] = __builtin_convertvector (c, v2di); +} + +/* +** g2: +** movi (v[0-9]+)\.4s, #?0 +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** stp [^\n]+ +** ret +*/ +void +g2 (v4si *__restrict a, v4hi b, v4hi c) +{ + a[0] = __builtin_convertvector (b, v4si); + a[1] = __builtin_convertvector (c, v4si); +} + +/* +** g3: +** movi (v[0-9]+)\.4s, #?0 +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** stp [^\n]+ +** ret +*/ +void +g3 (v8hi *__restrict a, v8qi b, v8qi c) +{ + a[0] = __builtin_convertvector (b, v8hi); + a[1] = __builtin_convertvector (c, v8hi); +} + +/* +** h1: +** movi (v[0-9]+)\.4s, #?0 +** ... +** zip1 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** ... +** zip1 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** ... +** zip1 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** ... +** ret +*/ +void +h1 (v2di *__restrict a, v2si b, v2si c, v2si d) +{ + a[0] = __builtin_convertvector (b, v2di); + a[1] = __builtin_convertvector (c, v2di); + a[2] = __builtin_convertvector (d, v2di); +} + +/* +** h2: +** movi (v[0-9]+)\.4s, #?0 +** ... +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** ... +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** ... +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** ... +** ret +*/ +void +h2 (v4si *__restrict a, v4hi b, v4hi c, v4hi d) +{ + a[0] = __builtin_convertvector (b, v4si); + a[1] = __builtin_convertvector (c, v4si); + a[2] = __builtin_convertvector (d, v4si); +} + +/* +** h3: +** movi (v[0-9]+)\.4s, #?0 +** ... +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** ... +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** ... +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** ... +** ret +*/ +void +h3 (v8hi *__restrict a, v8qi b, v8qi c, v8qi d) +{ + a[0] = __builtin_convertvector (b, v8hi); + a[1] = __builtin_convertvector (c, v8hi); + a[2] = __builtin_convertvector (d, v8hi); +} diff --git a/gcc/testsuite/gcc.target/aarch64/uxtl-combine-8.c b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-8.c new file mode 100644 index 00000000000..dc68477738b --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-8.c @@ -0,0 +1,136 @@ +/* { dg-options "-O2 -fno-schedule-insns -fno-schedule-insns2 -mbig-endian" } */ +/* { dg-final { check-function-bodies "**" "" "" { target lp64 } } } */ + +typedef __UINT8_TYPE__ v8qi __attribute__((vector_size(8))); +typedef __UINT16_TYPE__ v4hi __attribute__((vector_size(8))); +typedef __UINT32_TYPE__ v2si __attribute__((vector_size(8))); + +typedef __UINT16_TYPE__ v8hi __attribute__((vector_size(16))); +typedef __UINT32_TYPE__ v4si __attribute__((vector_size(16))); +typedef __UINT64_TYPE__ v2di __attribute__((vector_size(16))); + +/* +** f1: +** uxtl v0\.2d, v0\.2s +** ret +*/ +v2di f1 (v2si x) { return __builtin_convertvector (x, v2di); } + +/* +** f2: +** uxtl v0\.4s, v0\.4h +** ret +*/ +v4si f2 (v4hi x) { return __builtin_convertvector (x, v4si); } + +/* +** f3: +** uxtl v0\.8h, v0\.8b +** ret +*/ +v8hi f3 (v8qi x) { return __builtin_convertvector (x, v8hi); } + +/* +** g1: +** movi (v[0-9]+)\.4s, #?0 +** zip1 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** zip1 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** stp [^\n]+ +** ret +*/ +void +g1 (v2di *__restrict a, v2si b, v2si c) +{ + a[0] = __builtin_convertvector (b, v2di); + a[1] = __builtin_convertvector (c, v2di); +} + +/* +** g2: +** movi (v[0-9]+)\.4s, #?0 +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** stp [^\n]+ +** ret +*/ +void +g2 (v4si *__restrict a, v4hi b, v4hi c) +{ + a[0] = __builtin_convertvector (b, v4si); + a[1] = __builtin_convertvector (c, v4si); +} + +/* +** g3: +** movi (v[0-9]+)\.4s, #?0 +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** stp [^\n]+ +** ret +*/ +void +g3 (v8hi *__restrict a, v8qi b, v8qi c) +{ + a[0] = __builtin_convertvector (b, v8hi); + a[1] = __builtin_convertvector (c, v8hi); +} + +/* +** h1: +** movi (v[0-9]+)\.4s, #?0 +** ... +** zip1 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** ... +** zip1 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** ... +** zip1 v[0-9]+\.4s, v[0-9]+\.4s, \1\.4s +** ... +** ret +*/ +void +h1 (v2di *__restrict a, v2si b, v2si c, v2si d) +{ + a[0] = __builtin_convertvector (b, v2di); + a[1] = __builtin_convertvector (c, v2di); + a[2] = __builtin_convertvector (d, v2di); +} + +/* +** h2: +** movi (v[0-9]+)\.4s, #?0 +** ... +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** ... +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** ... +** zip1 v[0-9]+\.8h, v[0-9]+\.8h, \1\.8h +** ... +** ret +*/ +void +h2 (v4si *__restrict a, v4hi b, v4hi c, v4hi d) +{ + a[0] = __builtin_convertvector (b, v4si); + a[1] = __builtin_convertvector (c, v4si); + a[2] = __builtin_convertvector (d, v4si); +} + +/* +** h3: +** movi (v[0-9]+)\.4s, #?0 +** ... +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** ... +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** ... +** zip1 v[0-9]+\.16b, v[0-9]+\.16b, \1\.16b +** ... +** ret +*/ +void +h3 (v8hi *__restrict a, v8qi b, v8qi c, v8qi d) +{ + a[0] = __builtin_convertvector (b, v8hi); + a[1] = __builtin_convertvector (c, v8hi); + a[2] = __builtin_convertvector (d, v8hi); +} diff --git a/gcc/testsuite/gcc.target/aarch64/uxtl-combine-9.c b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-9.c new file mode 100644 index 00000000000..34fb6239c23 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/uxtl-combine-9.c @@ -0,0 +1,32 @@ +/* { dg-options "-O2" } */ +/* { dg-do run } */ + +#pragma GCC target "+nosve" + +typedef __UINT8_TYPE__ v8qi __attribute__((vector_size(8))); +typedef __UINT16_TYPE__ v8hi __attribute__((vector_size(16))); + +void __attribute__((noipa)) +f (v8hi *__restrict a, v8qi b, v8qi c, v8qi d) +{ + a[0] = __builtin_convertvector (b, v8hi); + a[1] = __builtin_convertvector (c, v8hi); + a[2] = __builtin_convertvector (d, v8hi); +} + +v8hi a[3]; +v8qi b = { 1, 2, 3, 4, 5, 6, 7, 8 }; +v8qi c = { -1, -2, -3, -4, -5, -6, -7, -8 }; + +v8hi bconv = { 1, 2, 3, 4, 5, 6, 7, 8 }; +v8hi cconv = { 0xff, 0xfe, 0xfd, 0xfc, 0xfb, 0xfa, 0xf9, 0xf8 }; + +int +main (void) +{ + f (a, b, c, b); + if (__builtin_memcmp (&a[0], &bconv, sizeof (bconv)) != 0 + || __builtin_memcmp (&a[1], &cconv, sizeof (cconv)) != 0) + __builtin_abort (); + return 0; +} diff --git a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_10.c b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_10.c index 81e77a8bb04..a741919b924 100644 --- a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_10.c +++ b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_10.c @@ -14,5 +14,5 @@ f (int16_t *x, int16_t *y, uint8_t *z, int n) } } -/* { dg-final { scan-assembler-times {\tuxtl\tv[0-9]+\.8h, v[0-9]+\.8b\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tzip1\tv[0-9]+\.16b, v[0-9]+\.16b, v[0-9]+\.16b\n} 1 } } */ /* { dg-final { scan-assembler-times {\tadd\tv[0-9]+\.8h,} 1 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_8.c b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_8.c index 9531966c294..835eef32f50 100644 --- a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_8.c +++ b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_8.c @@ -14,5 +14,5 @@ f (int64_t *x, int64_t *y, uint32_t *z, int n) } } -/* { dg-final { scan-assembler-times {\tuxtl\tv[0-9]+\.2d, v[0-9]+\.2s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tzip1\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.4s\n} 1 } } */ /* { dg-final { scan-assembler-times {\tadd\tv[0-9]+\.2d,} 1 } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_9.c b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_9.c index de8f6988685..77ff691da1c 100644 --- a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_9.c +++ b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_9.c @@ -14,5 +14,5 @@ f (int32_t *x, int32_t *y, uint16_t *z, int n) } } -/* { dg-final { scan-assembler-times {\tuxtl\tv[0-9]+\.4s, v[0-9]+\.4h\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tzip1\tv[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h\n} 1 } } */ /* { dg-final { scan-assembler-times {\tadd\tv[0-9]+\.4s,} 1 } } */