From patchwork Fri Jan 12 12:39:34 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Sandiford <richard.sandiford@arm.com>
X-Patchwork-Id: 1886007
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4TBLj05t1Cz1yP3
	for <incoming@patchwork.ozlabs.org>; Fri, 12 Jan 2024 23:40:04 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id BA9563858281
	for <incoming@patchwork.ozlabs.org>; Fri, 12 Jan 2024 12:40:02 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by sourceware.org (Postfix) with ESMTP id 270393858C60
 for <gcc-patches@gcc.gnu.org>; Fri, 12 Jan 2024 12:39:36 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 270393858C60
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=arm.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=arm.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 270393858C60
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=217.140.110.172
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1705063179; cv=none;
 b=ZVfrPOuSHHZHHQXVXVODccHXXXKIliMZyVdWNqfoWIuRL0HBVKANxbyt76KwLfcY9Dv7v7zB1I0wONrOctjA0zhW/GWo5se0TQ38z7NORqDhb4sx/vHdtRpe/mwiADmOxT3RSG5jgob0KvChq5iYjBGfQl/hrHRpolviLwt0nnI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1705063179; c=relaxed/simple;
 bh=LJwVRd6iRLb9rChQxWuUhhOBJczl/gEhALfh/Y3u3Vw=;
 h=From:To:Subject:Date:Message-ID:MIME-Version;
 b=cxB5p2c62U1Y4NuXA4nTiFPFvurwPuPYsv2iV+IHk//OBshNNXH3uOUP/kI6QhUqQlPDrDfrF2q5IIap/H0UB37TbkWoxJuZH/8danjXZKkG/D1KQ2cdmvfa9t5quw1H17KsBB227fsc4WoFgMczIMFyM4Ssbi2NSNkTy4WrNYY=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 26FC11FB
 for <gcc-patches@gcc.gnu.org>; Fri, 12 Jan 2024 04:40:22 -0800 (PST)
Received: from localhost (e121540-lin.manchester.arm.com [10.32.110.72])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 66FC03F73F
 for <gcc-patches@gcc.gnu.org>; Fri, 12 Jan 2024 04:39:35 -0800 (PST)
From: Richard Sandiford <richard.sandiford@arm.com>
To: gcc-patches@gcc.gnu.org
Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Subject: [pushed] aarch64: Rework uxtl->zip optimisation [PR113196]
Date: Fri, 12 Jan 2024 12:39:34 +0000
Message-ID: <mptwmseraih.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
X-Spam-Status: No, score=-20.6 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT,
 KAM_STOCKGEN, SPF_HELO_NONE, SPF_NONE, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

g:f26f92b534f9 implemented unsigned extensions using ZIPs rather than
UXTL{,2}, since the former has a higher throughput than the latter on
amny cores.  The optimisation worked by lowering directly to ZIP during
expand, so that the zero input could be hoisted and shared.

However, changing to ZIP means that zero extensions no longer benefit
from some existing combine patterns.  The patch included new patterns
for UADDW and USUBW, but the PR shows that other patterns were affected
as well.

This patch instead introduces the ZIPs during a pre-reload split
and forcibly hoists the zero move to the outermost scope.  This has
the disadvantage of executing the move even for a shrink-wrapped
function, which I suppose could be a problem if it causes a kernel
to trap and enable Advanced SIMD unnecessarily.  In other circumstances,
an unused move shouldn't affect things much.

Also, the RA should be able to rematerialise the move at an
appropriate point if necessary, such as if there is an intervening
call.

In https://gcc.gnu.org/pipermail/gcc-patches/2024-January/641948.html
I'd then tried to allow a zero to be recombined back into a solitary
ZIP.  However, that relied on late-combine, which didn't make it into
GCC 14.  This version instead restricts the split to cases where the
UXTL executes more frequently as the entry block (which is where we
plan to put the zero).

Also, the original optimisation contained a big-endian correction
that I don't think is needed/correct.  Even on big-endian targets,
we want the ZIP to take the low half of an element from the input
vector and the high half from the zero vector.  And the patterns
map directly to the underlying Advanced SIMD instructions: the use
of unspecs means that there's no need to adjust for the difference
between GCC and Arm lane numbering.

Tested on aarch64-linux-gnu & pushed (after checking with Tamar
off-list).

Richard


gcc/
	PR target/113196
	* config/aarch64/aarch64.h (machine_function::advsimd_zero_insn):
	New member variable.
	* config/aarch64/aarch64-protos.h (aarch64_split_simd_shift_p):
	Declare.
	* config/aarch64/iterators.md (Vnarrowq2): New mode attribute.
	* config/aarch64/aarch64-simd.md
	(vec_unpacku_hi_<mode>, vec_unpacks_hi_<mode>): Recombine into...
	(vec_unpack<su>_hi_<mode>): ...this.  Move the generation of
	zip2 for zero-extends to...
	(aarch64_simd_vec_unpack<su>_hi_<mode>): ...a split of this
	instruction.  Fix big-endian handling.
	(vec_unpacku_lo_<mode>, vec_unpacks_lo_<mode>): Recombine into...
	(vec_unpack<su>_lo_<mode>): ...this.  Move the generation of
	zip1 for zero-extends to...
	(<optab><Vnarrowq><mode>2): ...a split of this instruction.
	Fix big-endian handling.
	(*aarch64_zip1_uxtl): New pattern.
	(aarch64_usubw<mode>_lo_zip, aarch64_uaddw<mode>_lo_zip): Delete
	(aarch64_usubw<mode>_hi_zip, aarch64_uaddw<mode>_hi_zip): Likewise.
	* config/aarch64/aarch64.cc (aarch64_get_shareable_reg): New function.
	(aarch64_gen_shareable_zero): Use it.
	(aarch64_split_simd_shift_p): New function.

gcc/testsuite/
	PR target/113196
	* gcc.target/aarch64/pr113196.c: New test.
	* gcc.target/aarch64/simd/vmovl_high_1.c: Remove double include.
	Expect uxtl2 rather than zip2.
	* gcc.target/aarch64/vect_mixed_sizes_8.c: Expect zip1 rather
	than uxtl.
	* gcc.target/aarch64/vect_mixed_sizes_9.c: Likewise.
	* gcc.target/aarch64/vect_mixed_sizes_10.c: Likewise.
---
 gcc/config/aarch64/aarch64-protos.h           |   1 +
 gcc/config/aarch64/aarch64-simd.md            | 134 +++++-------------
 gcc/config/aarch64/aarch64.cc                 |  53 ++++++-
 gcc/config/aarch64/aarch64.h                  |   6 +
 gcc/config/aarch64/iterators.md               |   2 +
 gcc/testsuite/gcc.target/aarch64/pr113196.c   |  23 +++
 .../gcc.target/aarch64/simd/vmovl_high_1.c    |   8 +-
 .../gcc.target/aarch64/vect_mixed_sizes_10.c  |   2 +-
 .../gcc.target/aarch64/vect_mixed_sizes_8.c   |   2 +-
 .../gcc.target/aarch64/vect_mixed_sizes_9.c   |   2 +-
 10 files changed, 123 insertions(+), 110 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr113196.c

diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h
index ce9bec79cec..4c70e8a4963 100644
--- a/gcc/config/aarch64/aarch64-protos.h
+++ b/gcc/config/aarch64/aarch64-protos.h
@@ -880,6 +880,7 @@ rtx aarch64_return_addr_rtx (void);
 rtx aarch64_return_addr (int, rtx);
 rtx aarch64_simd_gen_const_vector_dup (machine_mode, HOST_WIDE_INT);
 rtx aarch64_gen_shareable_zero (machine_mode);
+bool aarch64_split_simd_shift_p (rtx_insn *);
 bool aarch64_simd_mem_operand_p (rtx);
 bool aarch64_sve_ld1r_operand_p (rtx);
 bool aarch64_sve_ld1rq_operand_p (rtx);
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 3cd184f46fa..6f48b4d5f21 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -1958,7 +1958,7 @@ (define_insn "aarch64_simd_vec_unpack<su>_lo_<mode>"
   [(set_attr "type" "neon_shift_imm_long")]
 )
 
-(define_insn "aarch64_simd_vec_unpack<su>_hi_<mode>"
+(define_insn_and_split "aarch64_simd_vec_unpack<su>_hi_<mode>"
   [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
         (ANY_EXTEND:<VWIDE> (vec_select:<VHALF>
 			       (match_operand:VQW 1 "register_operand" "w")
@@ -1966,63 +1966,42 @@ (define_insn "aarch64_simd_vec_unpack<su>_hi_<mode>"
 			    )))]
   "TARGET_SIMD"
   "<su>xtl2\t%0.<Vwtype>, %1.<Vtype>"
-  [(set_attr "type" "neon_shift_imm_long")]
-)
-
-(define_expand "vec_unpacku_hi_<mode>"
-  [(match_operand:<VWIDE> 0 "register_operand")
-   (match_operand:VQW 1 "register_operand")]
-  "TARGET_SIMD"
+  "&& <CODE> == ZERO_EXTEND
+   && aarch64_split_simd_shift_p (insn)"
+  [(const_int 0)]
   {
-    rtx res = gen_reg_rtx (<MODE>mode);
-    rtx tmp = aarch64_gen_shareable_zero (<MODE>mode);
-    if (BYTES_BIG_ENDIAN)
-      emit_insn (gen_aarch64_zip2<mode> (res, tmp, operands[1]));
-    else
-     emit_insn (gen_aarch64_zip2<mode> (res, operands[1], tmp));
-    emit_move_insn (operands[0],
-		   simplify_gen_subreg (<VWIDE>mode, res, <MODE>mode, 0));
+    /* On many cores, it is cheaper to implement UXTL2 using a ZIP2 with zero,
+       provided that the cost of the zero can be amortized over several
+       operations.  We'll later recombine the zero and zip if there are
+       not sufficient uses of the zero to make the split worthwhile.  */
+    rtx res = simplify_gen_subreg (<MODE>mode, operands[0], <VWIDE>mode, 0);
+    rtx zero = aarch64_gen_shareable_zero (<MODE>mode);
+    emit_insn (gen_aarch64_zip2<mode> (res, operands[1], zero));
     DONE;
   }
+  [(set_attr "type" "neon_shift_imm_long")]
 )
 
-(define_expand "vec_unpacks_hi_<mode>"
+(define_expand "vec_unpack<su>_hi_<mode>"
   [(match_operand:<VWIDE> 0 "register_operand")
-   (match_operand:VQW 1 "register_operand")]
+   (ANY_EXTEND:<VWIDE> (match_operand:VQW 1 "register_operand"))]
   "TARGET_SIMD"
   {
     rtx p = aarch64_simd_vect_par_cnst_half (<MODE>mode, <nunits>, true);
-    emit_insn (gen_aarch64_simd_vec_unpacks_hi_<mode> (operands[0],
-						       operands[1], p));
+    emit_insn (gen_aarch64_simd_vec_unpack<su>_hi_<mode> (operands[0],
+							  operands[1], p));
     DONE;
   }
 )
 
-(define_expand "vec_unpacku_lo_<mode>"
+(define_expand "vec_unpack<su>_lo_<mode>"
   [(match_operand:<VWIDE> 0 "register_operand")
-   (match_operand:VQW 1 "register_operand")]
-  "TARGET_SIMD"
-  {
-    rtx res = gen_reg_rtx (<MODE>mode);
-    rtx tmp = aarch64_gen_shareable_zero (<MODE>mode);
-    if (BYTES_BIG_ENDIAN)
-	emit_insn (gen_aarch64_zip1<mode> (res, tmp, operands[1]));
-    else
-	emit_insn (gen_aarch64_zip1<mode> (res, operands[1], tmp));
-    emit_move_insn (operands[0],
-		   simplify_gen_subreg (<VWIDE>mode, res, <MODE>mode, 0));
-    DONE;
-  }
-)
-
-(define_expand "vec_unpacks_lo_<mode>"
-  [(match_operand:<VWIDE> 0 "register_operand")
-   (match_operand:VQW 1 "register_operand")]
+   (ANY_EXTEND:<VWIDE> (match_operand:VQW 1 "register_operand"))]
   "TARGET_SIMD"
   {
     rtx p = aarch64_simd_vect_par_cnst_half (<MODE>mode, <nunits>, false);
-    emit_insn (gen_aarch64_simd_vec_unpacks_lo_<mode> (operands[0],
-						       operands[1], p));
+    emit_insn (gen_aarch64_simd_vec_unpack<su>_lo_<mode> (operands[0],
+							  operands[1], p));
     DONE;
   }
 )
@@ -4792,62 +4771,6 @@ (define_insn "aarch64_<ANY_EXTEND:su>subw2<mode>_internal"
   [(set_attr "type" "neon_sub_widen")]
 )
 
-(define_insn "aarch64_usubw<mode>_lo_zip"
-  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
-	(minus:<VWIDE>
-	  (match_operand:<VWIDE> 1 "register_operand" "w")
-	  (subreg:<VWIDE>
-	    (unspec:<MODE> [
-		(match_operand:VQW 2 "register_operand" "w")
-		(match_operand:VQW 3 "aarch64_simd_imm_zero")
-	       ] UNSPEC_ZIP1) 0)))]
-  "TARGET_SIMD"
-  "usubw\\t%0.<Vwtype>, %1.<Vwtype>, %2.<Vhalftype>"
-  [(set_attr "type" "neon_sub_widen")]
-)
-
-(define_insn "aarch64_uaddw<mode>_lo_zip"
-  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
-	(plus:<VWIDE>
-	  (subreg:<VWIDE>
-	    (unspec:<MODE> [
-		(match_operand:VQW 2 "register_operand" "w")
-		(match_operand:VQW 3 "aarch64_simd_imm_zero")
-	       ] UNSPEC_ZIP1) 0)
-	  (match_operand:<VWIDE> 1 "register_operand" "w")))]
-  "TARGET_SIMD"
-  "uaddw\\t%0.<Vwtype>, %1.<Vwtype>, %2.<Vhalftype>"
-  [(set_attr "type" "neon_add_widen")]
-)
-
-(define_insn "aarch64_usubw<mode>_hi_zip"
-  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
-	(minus:<VWIDE>
-	  (match_operand:<VWIDE> 1 "register_operand" "w")
-	  (subreg:<VWIDE>
-	    (unspec:<MODE> [
-		(match_operand:VQW 2 "register_operand" "w")
-		(match_operand:VQW 3 "aarch64_simd_imm_zero")
-	       ] UNSPEC_ZIP2) 0)))]
-  "TARGET_SIMD"
-  "usubw2\\t%0.<Vwtype>, %1.<Vwtype>, %2.<Vtype>"
-  [(set_attr "type" "neon_sub_widen")]
-)
-
-(define_insn "aarch64_uaddw<mode>_hi_zip"
-  [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
-	(plus:<VWIDE>
-	  (subreg:<VWIDE>
-	    (unspec:<MODE> [
-		(match_operand:VQW 2 "register_operand" "w")
-		(match_operand:VQW 3 "aarch64_simd_imm_zero")
-	       ] UNSPEC_ZIP2) 0)
-	  (match_operand:<VWIDE> 1 "register_operand" "w")))]
-  "TARGET_SIMD"
-  "uaddw2\\t%0.<Vwtype>, %1.<Vwtype>, %2.<Vtype>"
-  [(set_attr "type" "neon_add_widen")]
-)
-
 (define_insn "aarch64_<ANY_EXTEND:su>addw<mode>"
   [(set (match_operand:<VWIDE> 0 "register_operand" "=w")
 	(plus:<VWIDE>
@@ -9788,11 +9711,26 @@ (define_insn "aarch64_crypto_pmullv2di"
 )
 
 ;; Sign- or zero-extend a 64-bit integer vector to a 128-bit vector.
-(define_insn "<optab><Vnarrowq><mode>2"
+(define_insn_and_split "<optab><Vnarrowq><mode>2"
   [(set (match_operand:VQN 0 "register_operand" "=w")
 	(ANY_EXTEND:VQN (match_operand:<VNARROWQ> 1 "register_operand" "w")))]
   "TARGET_SIMD"
   "<su>xtl\t%0.<Vtype>, %1.<Vntype>"
+  "&& <CODE> == ZERO_EXTEND
+   && aarch64_split_simd_shift_p (insn)"
+  [(const_int 0)]
+  {
+    /* On many cores, it is cheaper to implement UXTL using a ZIP1 with zero,
+       provided that the cost of the zero can be amortized over several
+       operations.  We'll later recombine the zero and zip if there are
+       not sufficient uses of the zero to make the split worthwhile.  */
+    rtx res = simplify_gen_subreg (<VNARROWQ2>mode, operands[0],
+				   <MODE>mode, 0);
+    rtx zero = aarch64_gen_shareable_zero (<VNARROWQ2>mode);
+    rtx op = lowpart_subreg (<VNARROWQ2>mode, operands[1], <VNARROWQ>mode);
+    emit_insn (gen_aarch64_zip1<Vnarrowq2> (res, op, zero));
+    DONE;
+  }
   [(set_attr "type" "neon_shift_imm_long")]
 )
 
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 32c7317f360..7d1f8c65ce4 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -22882,16 +22882,61 @@ aarch64_mov_operand_p (rtx x, machine_mode mode)
     == SYMBOL_TINY_ABSOLUTE;
 }
 
+/* Return a function-invariant register that contains VALUE.  *CACHED_INSN
+   caches instructions that set up such registers, so that they can be
+   reused by future calls.  */
+
+static rtx
+aarch64_get_shareable_reg (rtx_insn **cached_insn, rtx value)
+{
+  rtx_insn *insn = *cached_insn;
+  if (insn && INSN_P (insn) && !insn->deleted ())
+    {
+      rtx pat = PATTERN (insn);
+      if (GET_CODE (pat) == SET)
+	{
+	  rtx dest = SET_DEST (pat);
+	  if (REG_P (dest)
+	      && !HARD_REGISTER_P (dest)
+	      && rtx_equal_p (SET_SRC (pat), value))
+	    return dest;
+	}
+    }
+  rtx reg = gen_reg_rtx (GET_MODE (value));
+  *cached_insn = emit_insn_before (gen_rtx_SET (reg, value),
+				   function_beg_insn);
+  return reg;
+}
+
 /* Create a 0 constant that is based on V4SI to allow CSE to optimally share
    the constant creation.  */
 
 rtx
 aarch64_gen_shareable_zero (machine_mode mode)
 {
-  machine_mode zmode = V4SImode;
-  rtx tmp = gen_reg_rtx (zmode);
-  emit_move_insn (tmp, CONST0_RTX (zmode));
-  return lowpart_subreg (mode, tmp, zmode);
+  rtx reg = aarch64_get_shareable_reg (&cfun->machine->advsimd_zero_insn,
+				       CONST0_RTX (V4SImode));
+  return lowpart_subreg (mode, reg, GET_MODE (reg));
+}
+
+/* INSN is some form of extension or shift that can be split into a
+   permutation involving a shared zero.  Return true if we should
+   perform such a split.
+
+   ??? For now, make sure that the split instruction executes more
+   frequently than the zero that feeds it.  In future it would be good
+   to split without that restriction and instead recombine shared zeros
+   if they turn out not to be worthwhile.  This would allow splits in
+   single-block functions and would also cope more naturally with
+   rematerialization.  */
+
+bool
+aarch64_split_simd_shift_p (rtx_insn *insn)
+{
+  return (can_create_pseudo_p ()
+	  && optimize_bb_for_speed_p (BLOCK_FOR_INSN (insn))
+	  && (ENTRY_BLOCK_PTR_FOR_FN (cfun)->count
+	      < BLOCK_FOR_INSN (insn)->count));
 }
 
 /* Return a const_int vector of VAL.  */
diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 0a4e152c9bd..157a0b9dfa5 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -1056,6 +1056,12 @@ typedef struct GTY (()) machine_function
   /* A set of all decls that have been passed to a vld1 intrinsic in the
      current function.  This is used to help guide the vector cost model.  */
   hash_set<tree> *vector_load_decls;
+
+  /* An instruction that was emitted at the start of the function to
+     set an Advanced SIMD pseudo register to zero.  If the instruction
+     still exists and still fulfils its original purpose. the same register
+     can be reused by other code.  */
+  rtx_insn *advsimd_zero_insn;
 } machine_function;
 #endif
 #endif
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 89767eecdf8..942270e99d6 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -1656,6 +1656,8 @@ (define_mode_attr Vnarrowq [(V8HI "v8qi") (V4SI "v4hi")
 ;; Narrowed quad-modes for VQN (Used for XTN2).
 (define_mode_attr VNARROWQ2 [(V8HI "V16QI") (V4SI "V8HI")
 			     (V2DI "V4SI")])
+(define_mode_attr Vnarrowq2 [(V8HI "v16qi") (V4SI "v8hi")
+			     (V2DI "v4si")])
 
 ;; Narrowed modes of vector modes.
 (define_mode_attr VNARROW [(VNx8HI "VNx16QI")
diff --git a/gcc/testsuite/gcc.target/aarch64/pr113196.c b/gcc/testsuite/gcc.target/aarch64/pr113196.c
new file mode 100644
index 00000000000..8982cc50282
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr113196.c
@@ -0,0 +1,23 @@
+/* { dg-options "-O3" } */
+
+#pragma GCC target "+nosve"
+
+int test(unsigned array[4][4]);
+
+int foo(unsigned short *a, unsigned long n)
+{
+  unsigned array[4][4];
+
+  for (unsigned i = 0; i < 4; i++, a += 4)
+    {
+      array[i][0] = a[0] << 6;
+      array[i][1] = a[1] << 6;
+      array[i][2] = a[2] << 6;
+      array[i][3] = a[3] << 6;
+    }
+
+  return test(array);
+}
+
+/* { dg-final { scan-assembler-times {\tushll\t} 2 } } */
+/* { dg-final { scan-assembler-times {\tushll2\t} 2 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/simd/vmovl_high_1.c b/gcc/testsuite/gcc.target/aarch64/simd/vmovl_high_1.c
index a2d09eaee0d..9519062e6d7 100644
--- a/gcc/testsuite/gcc.target/aarch64/simd/vmovl_high_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/simd/vmovl_high_1.c
@@ -3,8 +3,6 @@
 
 #include <arm_neon.h>
 
-#include <arm_neon.h>
-
 #define FUNC(IT, OT, S)         \
 OT                              \
 foo_##S (IT a)                  \
@@ -22,11 +20,11 @@ FUNC (int32x4_t, int64x2_t, s32)
 /* { dg-final { scan-assembler-times {sxtl2\tv0\.2d, v0\.4s} 1} }  */
 
 FUNC (uint8x16_t, uint16x8_t, u8)
-/* { dg-final { scan-assembler-times {zip2\tv0\.16b, v0\.16b} 1} }  */
+/* { dg-final { scan-assembler-times {uxtl2\tv0\.8h, v0\.16b} 1} }  */
 
 FUNC (uint16x8_t, uint32x4_t, u16)
-/* { dg-final { scan-assembler-times {zip2\tv0\.8h, v0\.8h} 1} }  */
+/* { dg-final { scan-assembler-times {uxtl2\tv0\.4s, v0\.8h} 1} }  */
 
 FUNC (uint32x4_t, uint64x2_t, u32)
-/* { dg-final { scan-assembler-times {zip2\tv0\.4s, v0\.4s} 1} }  */
+/* { dg-final { scan-assembler-times {uxtl2\tv0\.2d, v0\.4s} 1} }  */
 
diff --git a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_10.c b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_10.c
index 81e77a8bb04..a741919b924 100644
--- a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_10.c
+++ b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_10.c
@@ -14,5 +14,5 @@ f (int16_t *x, int16_t *y, uint8_t *z, int n)
     }
 }
 
-/* { dg-final { scan-assembler-times {\tuxtl\tv[0-9]+\.8h, v[0-9]+\.8b\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tv[0-9]+\.16b, v[0-9]+\.16b, v[0-9]+\.16b\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tadd\tv[0-9]+\.8h,} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_8.c b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_8.c
index 9531966c294..835eef32f50 100644
--- a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_8.c
+++ b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_8.c
@@ -14,5 +14,5 @@ f (int64_t *x, int64_t *y, uint32_t *z, int n)
     }
 }
 
-/* { dg-final { scan-assembler-times {\tuxtl\tv[0-9]+\.2d, v[0-9]+\.2s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tv[0-9]+\.4s, v[0-9]+\.4s, v[0-9]+\.4s\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tadd\tv[0-9]+\.2d,} 1 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_9.c b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_9.c
index de8f6988685..77ff691da1c 100644
--- a/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_9.c
+++ b/gcc/testsuite/gcc.target/aarch64/vect_mixed_sizes_9.c
@@ -14,5 +14,5 @@ f (int32_t *x, int32_t *y, uint16_t *z, int n)
     }
 }
 
-/* { dg-final { scan-assembler-times {\tuxtl\tv[0-9]+\.4s, v[0-9]+\.4h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tzip1\tv[0-9]+\.8h, v[0-9]+\.8h, v[0-9]+\.8h\n} 1 } } */
 /* { dg-final { scan-assembler-times {\tadd\tv[0-9]+\.4s,} 1 } } */