From patchwork Mon Feb 27 12:34:20 2023
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Tamar Christina <Tamar.Christina@arm.com>
X-Patchwork-Id: 1748686
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=8.43.85.97; helo=sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=<UNKNOWN>)
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256
 header.s=default header.b=MED4DIy9;
	dkim-atps=neutral
Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4PQKhW0QkXz1yX0
	for <incoming@patchwork.ozlabs.org>; Mon, 27 Feb 2023 23:35:07 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 0B33F385C301
	for <incoming@patchwork.ozlabs.org>; Mon, 27 Feb 2023 12:35:05 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0B33F385C301
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org;
	s=default; t=1677501305;
	bh=beMEe6W4hHi/TM+8J88TP+FUa/eJioJM8cj2OGVVG/g=;
	h=Date:To:Cc:Subject:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=MED4DIy9oEX00Dof150dmqH68KiZkBets8EkYy97n9+/7Wf+q/czuLEwZefdv2zGY
	 fOij3vqjlm3PmfGezPtiLV9uTFkqxCZGeGwe3tyEix+5EWbAi5WNcSI0LAXM21Mhje
	 4B0ZBo73vGSEvJ4ectySdqllgN4rZZW5GogQfO4A=
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from EUR04-DB3-obe.outbound.protection.outlook.com
 (mail-db3eur04on2059.outbound.protection.outlook.com [40.107.6.59])
 by sourceware.org (Postfix) with ESMTPS id A53D0385B53E
 for <gcc-patches@gcc.gnu.org>; Mon, 27 Feb 2023 12:34:41 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org A53D0385B53E
Received: from AM0PR02CA0016.eurprd02.prod.outlook.com (2603:10a6:208:3e::29)
 by DU0PR08MB9751.eurprd08.prod.outlook.com (2603:10a6:10:445::17)
 with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6134.27; Mon, 27 Feb
 2023 12:34:38 +0000
Received: from AM7EUR03FT029.eop-EUR03.prod.protection.outlook.com
 (2603:10a6:208:3e:cafe::1f) by AM0PR02CA0016.outlook.office365.com
 (2603:10a6:208:3e::29) with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6134.29 via Frontend
 Transport; Mon, 27 Feb 2023 12:34:38 +0000
X-MS-Exchange-Authentication-Results: spf=pass (sender IP is 63.35.35.123)
 smtp.mailfrom=arm.com; dkim=pass (signature was verified)
 header.d=armh.onmicrosoft.com;dmarc=pass action=none header.from=arm.com;
Received-SPF: Pass (protection.outlook.com: domain of arm.com designates
 63.35.35.123 as permitted sender) receiver=protection.outlook.com;
 client-ip=63.35.35.123; helo=64aa7808-outbound-1.mta.getcheckrecipient.com;
 pr=C
Received: from 64aa7808-outbound-1.mta.getcheckrecipient.com (63.35.35.123) by
 AM7EUR03FT029.mail.protection.outlook.com (100.127.140.143) with
 Microsoft
 SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.6156.12 via Frontend Transport; Mon, 27 Feb 2023 12:34:38 +0000
Received: ("Tessian outbound baf1b7a96f25:v132");
 Mon, 27 Feb 2023 12:34:37 +0000
X-CheckRecipientChecked: true
X-CR-MTA-CID: 6dc4d26afe4b18c2
X-CR-MTA-TID: 64aa7808
Received: from 87cf6227fd12.1
 by 64aa7808-outbound-1.mta.getcheckrecipient.com id
 368CB167-D43C-4E11-9DFF-88B9EF756485.1;
 Mon, 27 Feb 2023 12:34:27 +0000
Received: from EUR01-VE1-obe.outbound.protection.outlook.com
 by 64aa7808-outbound-1.mta.getcheckrecipient.com with ESMTPS id
 87cf6227fd12.1
 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384);
 Mon, 27 Feb 2023 12:34:27 +0000
ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;
 b=MJKdtgwSig3Wn1TOINXN8aGPeqqw0AYTaNhSTtjwhsDO4bbOi0LJGjBRgjgbhTzb4qoUH2hcGLQkgtA5FnoTSZJy3CvcvmvEpJq0JofgqUMGSRU7+w/PfDboLYCGqD/5rcuJXC2AcaKUBEC2qLGFFknKnVI1c2XtPEzKGCO9j77TSdjRrTRRSByODrVo2HUuPNVmyuqLw7R6X4lpRt2I1r/D0gI+M1nljAf+LzCyGv5/FckCA2P1Rczde8/32sIS4FOGshGShzeBTMi9swSm69MSOaRUUM+FoTwvEFZxrZAnljphpRUzkyuYn04LR9RxBj4GKql22CxQPmqOUNxjag==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com;
 s=arcselector9901;
 h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;
 bh=beMEe6W4hHi/TM+8J88TP+FUa/eJioJM8cj2OGVVG/g=;
 b=EDty85Pcb0II9Xg5xaSI3Tcm827Zq63RilyI5E/0ZqHeA12yOQtKQqDB7pUCXI3bNWZ/EyHuExeiTTVfcvzZmeSXnzTmwIPOr9rBch7aB8Yd0OMhkBzmJETWgwi8LcKgjaHsjcywEGF1Q9KyH+Dgh6t4X064imlZZOkal/mZyHp41FWTCSM9rr374cxHikqdNT91T7DsZAaxwqmPUGAyF/iziessQwPXKlgReioh2s2120e9WxMM7hLOmGd2efYyf4hkususBU0aF2ck9f/c50pl5QjqeL4nuC2EgGq4+KQ8pPqtJUqvZ69oEA//j7TDJjnZ511ROnWdUaXMl+B2xw==
ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass
 smtp.mailfrom=arm.com; dmarc=pass action=none header.from=arm.com; dkim=pass
 header.d=arm.com; arc=none
Authentication-Results-Original: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=arm.com;
Received: from VI1PR08MB5325.eurprd08.prod.outlook.com (2603:10a6:803:13e::17)
 by GV1PR08MB7900.eurprd08.prod.outlook.com (2603:10a6:150:5f::22)
 with Microsoft SMTP Server (version=TLS1_2,
 cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.6134.29; Mon, 27 Feb
 2023 12:34:22 +0000
Received: from VI1PR08MB5325.eurprd08.prod.outlook.com
 ([fe80::210c:d369:23f7:84fe]) by VI1PR08MB5325.eurprd08.prod.outlook.com
 ([fe80::210c:d369:23f7:84fe%6]) with mapi id 15.20.6134.027; Mon, 27 Feb 2023
 12:34:22 +0000
Date: Mon, 27 Feb 2023 12:34:20 +0000
To: gcc-patches@gcc.gnu.org
Cc: nd@arm.com, Richard.Earnshaw@arm.com, Marcus.Shawcroft@arm.com,
 Kyrylo.Tkachov@arm.com, richard.sandiford@arm.com
Subject: [PATCH 4/4]AArch64 Update div-bitmask to implement new optab instead
 of target hook [PR108583]
Message-ID: <Y/yjTE9I4WrC4tEg@arm.com>
Content-Disposition: inline
In-Reply-To: <patch-16928-tamar@arm.com>
X-ClientProxiedBy: LO4P123CA0402.GBRP123.PROD.OUTLOOK.COM
 (2603:10a6:600:189::11) To VI1PR08MB5325.eurprd08.prod.outlook.com
 (2603:10a6:803:13e::17)
MIME-Version: 1.0
X-MS-TrafficTypeDiagnostic: 
 VI1PR08MB5325:EE_|GV1PR08MB7900:EE_|AM7EUR03FT029:EE_|DU0PR08MB9751:EE_
X-MS-Office365-Filtering-Correlation-Id: 0cb5ccc1-e1a3-489a-26e5-08db18befdac
x-checkrecipientrouted: true
NoDisclaimer: true
X-MS-Exchange-SenderADCheck: 1
X-MS-Exchange-AntiSpam-Relay: 0
X-Microsoft-Antispam-Untrusted: BCL:0;
X-Microsoft-Antispam-Message-Info-Original: 
 fTZrvgbmpFCnfiay9PBS02MthQ4W2YddzejQYGUAM6Kaer9HMUKyST5H8AldHCWvqnXnZwXlK/bqLkF2HBXFyvi/dwFa8YJps67je87Ka7F0f8zMok5XluZD4ri1AW7DUSlL4kGZpkBLtOoWwN2aOJTUDUzTMbQc650DaxTnA0ZbEU/w39+fejkBfrZOkdZeP8uLl8UKHoYnZDwxAQ/y1JTptKehyD+1NQDpj3DLMKFfWbvWWB1QNLrL94dJM9bmo7yRbLMJNdOOsPbRnG6OqMlzKkFy4MEa5jniyin7ZzJBnIQbjuLJsouMllu5dh9oD6dPzlYl1l+s77tA333lLqpt123QrDbPbS1E0V9dz5J3B0X2u/J3Qhq7ZjBx45sObDEgyFvruZ/5uzAIpH5azPntZtJqfc1soBk10GTykdiz4CxYYI1YWiLZIBP6fxEeNYixqqJW1JzD5mzYqa3e4d32RGFpGEHDAQoUHM6rwG0B4+kH7Yco4wcyzn2D87tT75m6xzHRAMpFtj9TIEweyAxvuZCWJ9suBdMzvLJTRXtAFtb8lPqeGSGXQH/31KREiR82z6bGpAANwA6HLAA+DXzfQw7Ng13RAKSUsG0UqUZTtcqmd3alDwijQxN74/Vo4Fa60Z6AaqZLET97h2mbQ3AUGw8b01xOGp/tKH8n7Q9yorMCZvml0QUrLW8W51O/kF+7DxIGpyEc6GvdtM5QJHXoPSUBVWfCNMNXzlEOLFM=
X-Forefront-Antispam-Report-Untrusted: CIP:255.255.255.255; CTRY:; LANG:en;
 SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:VI1PR08MB5325.eurprd08.prod.outlook.com;
 PTR:; CAT:NONE;
 SFS:(13230025)(4636009)(39860400002)(376002)(366004)(396003)(136003)(346002)(451199018)(316002)(2906002)(44832011)(36756003)(5660300002)(235185007)(30864003)(8936002)(86362001)(6512007)(33964004)(6506007)(44144004)(66946007)(186003)(26005)(4743002)(8676002)(6916009)(66476007)(4326008)(66556008)(38100700002)(41300700001)(2616005)(83380400001)(478600001)(6486002)(2700100001)(357404004);
 DIR:OUT; SFP:1101;
X-MS-Exchange-Transport-CrossTenantHeadersStamped: GV1PR08MB7900
Original-Authentication-Results: dkim=none (message not signed)
 header.d=none;dmarc=none action=none header.from=arm.com;
X-EOPAttributedMessage: 0
X-MS-Exchange-Transport-CrossTenantHeadersStripped: 
 AM7EUR03FT029.eop-EUR03.prod.protection.outlook.com
X-MS-PublicTrafficType: Email
X-MS-Office365-Filtering-Correlation-Id-Prvs: 
 c851d2d4-5768-4567-21d5-08db18bef42e
X-Microsoft-Antispam: BCL:0;
X-Microsoft-Antispam-Message-Info: 
 9IM5V8GjpivOCsadJaG2WQAIzN2ZXaG+k/GyeB8XklE3sIk2pHBNDe3GjGqexmTHGpFs9eseU9zcOuJI7NcZUBKCY02T+Om86zmQCzIAuYUSqo20GgYFP0fz79uc+pyC9FSZanp7tB596XrTriMZbtLzv0FiqnMqBxtJbjR4u5PggMLRG/3nPwQY8Srq80aY+ioXAvaZYBS8SqTaVBLH09l/9AS402n2e+j6AXYh/YZzAGca2Qc0pHAf95HhyPeBpyWmgIO05gdU6CxBYbBUDTIXqeD7dAHrslBuTwmsEzgN0M73Vb7dSsJ/fz/WgkvUmskn4hC2y/6N+YBPkwN0Ug1PECV9Hlz/xoo4pFHYYI71TJu87tsnSNluSi7ChhoKYZ82c6Wcp0AeqneHCzQHM29M5Wb9CAQrEldAMzWUWEq1kQzZhxRiqQ7APMZN8ZtAFbcsFMWYli8UFpew7ZJM2gBJGwxHFyDUgibfFZN65yvMW1FXf209qukylyzU7LNtyop5LnwHGpyExmqNFEdV0JJpeAE+Srwe9Eg/+wrHYhoEkI0Fj5EPnZsaFGCj7pJ1oIa+TWFXKhHTOYBtzJgQnsiY5gRFuP64U6omujEbypX6CZ8zHowMkNHhBekw4AgKdhwZH2zKiy7v7tfMBl8EW9d1MKKNaU80UO+13WqY+d6DvIgPAKg2f1qjMVhAag1JPWvUoG8INqTmabO4QDgm8zVQ/Lyp74ssMhuI29sJnw0OVD9SffIvH+od6MLsLby2fTOYcwDD1Nb7Hm/o0yHKsw==
X-Forefront-Antispam-Report: CIP:63.35.35.123; CTRY:IE; LANG:en; SCL:1; SRV:;
 IPV:CAL; SFV:NSPM; H:64aa7808-outbound-1.mta.getcheckrecipient.com;
 PTR:ec2-63-35-35-123.eu-west-1.compute.amazonaws.com; CAT:NONE;
 SFS:(13230025)(4636009)(346002)(39850400004)(376002)(396003)(136003)(451199018)(40470700004)(46966006)(36840700001)(6486002)(2616005)(82310400005)(36756003)(86362001)(36860700001)(40460700003)(186003)(4743002)(2906002)(70586007)(30864003)(5660300002)(235185007)(6916009)(4326008)(8676002)(8936002)(70206006)(41300700001)(6512007)(47076005)(478600001)(44144004)(26005)(33964004)(44832011)(336012)(40480700001)(6506007)(82740400003)(316002)(356005)(83380400001)(81166007)(2700100001)(357404004);
 DIR:OUT; SFP:1101;
X-OriginatorOrg: arm.com
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 27 Feb 2023 12:34:38.1046 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 0cb5ccc1-e1a3-489a-26e5-08db18befdac
X-MS-Exchange-CrossTenant-Id: f34e5979-57d9-4aaa-ad4d-b122a662184d
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=f34e5979-57d9-4aaa-ad4d-b122a662184d; Ip=[63.35.35.123];
 Helo=[64aa7808-outbound-1.mta.getcheckrecipient.com]
X-MS-Exchange-CrossTenant-AuthSource: 
 AM7EUR03FT029.eop-EUR03.prod.protection.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: DU0PR08MB9751
X-Spam-Status: No, score=-9.7 required=5.0 tests=BAYES_00, BODY_8BITS,
 DKIM_SIGNED, DKIM_VALID, FORGED_SPF_HELO, GIT_PATCH_0, KAM_ASCII_DIVIDERS,
 KAM_DMARC_NONE, KAM_LOTSOFHASH, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,
 RCVD_IN_VALIDITY_RPBL, SPF_HELO_PASS, SPF_NONE, TXREP,
 UNPARSEABLE_RELAY autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
X-Patchwork-Original-From: Tamar Christina via Gcc-patches
 <gcc-patches@gcc.gnu.org>
From: Tamar Christina <Tamar.Christina@arm.com>
Reply-To: Tamar Christina <tamar.christina@arm.com>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>

Hi All,

This replaces the custom division hook with just an implementation through
add_highpart.  For NEON we implement the add highpart (Addition + extraction of
the upper highpart of the register in the same precision) as ADD + LSR.

This representation allows us to easily optimize the sequence using existing
sequences. This gets us a pretty decent sequence using SRA:

        umull   v1.8h, v0.8b, v3.8b
        umull2  v0.8h, v0.16b, v3.16b
        add     v5.8h, v1.8h, v2.8h
        add     v4.8h, v0.8h, v2.8h
        usra    v1.8h, v5.8h, 8
        usra    v0.8h, v4.8h, 8
        uzp2    v1.16b, v1.16b, v0.16b

To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
is half the precision of the mode of the operation into addhn + uaddw which is
a general good optimization on its own and gets us back to:

.L4:
        ldr     q0, [x3]
        umull   v1.8h, v0.8b, v5.8b
        umull2  v0.8h, v0.16b, v5.16b
        addhn   v3.8b, v1.8h, v4.8h
        addhn   v2.8b, v0.8h, v4.8h
        uaddw   v1.8h, v1.8h, v3.8b
        uaddw   v0.8h, v0.8h, v2.8b
        uzp2    v1.16b, v1.16b, v0.16b
        str     q1, [x3], 16
        cmp     x3, x4
        bne     .L4

For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        add     z1.h, z0.h, z3.h
        usra    z0.h, z1.h, #8
        lsr     z0.h, z0.h, #8
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3
.L1:
        ret

and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
to addhnb which gets us to:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        addhnb  z1.b, z0.h, z3.h
        addhnb  z0.b, z0.h, z1.h
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3

There are multiple RTL representations possible for these optimizations, I did
not represent them using a zero_extend because we seem very inconsistent in this
in the backend.  Since they are unspecs we won't match them from vector ops
anyway. I figured maintainers would prefer this, but my maintainer ouija board
is still out for repairs :)

There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	PR target/108583
	* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
	(*bitmask_shift_plus<mode>): New.
	* config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New.
	(@aarch64_bitmask_udiv<mode>3): Remove.
	* config/aarch64/aarch64.cc
	(aarch64_vectorize_can_special_div_by_constant,
	TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
	(TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
	aarch64_vectorize_preferred_div_as_shifts_over_mult): New.

--- inline copy of patch --
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 7f212bf37cd2c120dceb7efa733c9fa76226f029..e1ecb88634f93d380ef534093ea6599dc7278108 100644
---
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 7f212bf37cd2c120dceb7efa733c9fa76226f029..e1ecb88634f93d380ef534093ea6599dc7278108 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,60 +4867,27 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
   }
 )
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;;      short value of 16-bits
-;; ┌──────────────┬────────────────┐
-;; │              │                │
-;; └──────────────┴────────────────┘
-;;   8-bit part1 ▲  8-bit part2   ▲
-;;               │                │
-;;               │                │
-;;              +1               +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte.  Remember that the addition must be done in
-;; double the precision of the input.  Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte.  The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:VQN 0 "register_operand")
-   (match_operand:VQN 1 "register_operand")
-   (match_operand:VQN 2 "immediate_operand")]
+;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+  [(set (match_operand:VQN 0 "register_operand" "=&w")
+	(plus:VQN
+	  (lshiftrt:VQN
+	    (plus:VQN (match_operand:VQN 1 "register_operand" "w")
+		      (match_operand:VQN 2 "register_operand" "w"))
+	    (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+	  (match_operand:VQN 4 "register_operand" "w")))]
   "TARGET_SIMD"
+  "#"
+  "&& true"
+  [(const_int 0)]
 {
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
-  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
-  rtx tmp2 = gen_reg_rtx (<MODE>mode);
-  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
-  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
-  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
-  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
-  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
+  rtx tmp;
+  if (can_create_pseudo_p ())
+    tmp = gen_reg_rtx (<VNARROWQ>mode);
+  else
+    tmp = gen_rtx_REG (<VNARROWQ>mode, REGNO (operands[0]));
+  emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
+  emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
   DONE;
 })
 
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 40c0728a7e6f00c395c360ce7625bc2e4a018809..bed44d7d6873877386222d56144cc115e3953a61 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -2317,41 +2317,24 @@ (define_insn "@aarch64_sve_<optab><mode>"
 ;; ---- [INT] Misc optab implementations
 ;; -------------------------------------------------------------------------
 ;; Includes:
-;; - aarch64_bitmask_udiv
+;; - bitmask_shift_plus
 ;; -------------------------------------------------------------------------
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; See aarch64-simd.md for bigger explanation.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
-   (match_operand:SVE_FULL_HSDI 1 "register_operand")
-   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
+;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
+(define_insn "*bitmask_shift_plus<mode>"
+  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
+	(unspec:SVE_FULL_HSDI
+	   [(match_operand:<VPRED> 1)
+	    (lshiftrt:SVE_FULL_HSDI
+	      (plus:SVE_FULL_HSDI
+		(match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
+		(match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
+	      (match_operand:SVE_FULL_HSDI 4
+		 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))]
+          UNSPEC_PRED_X))]
   "TARGET_SVE2"
-{
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
-  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
-			      addend));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
-			      lowpart_subreg (<MODE>mode, tmp1,
-					      <VNARROW>mode)));
-  emit_move_insn (operands[0],
-		  lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
-  DONE;
-})
+  "addhnb\t%0.<Ventype>, %2.<Vetype>, %3.<Vetype>"
+)
 
 ;; =========================================================================
 ;; == Permutation
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..2728fb347c0df1756b237f4d6268908eef6bdd2a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -3849,6 +3849,13 @@ aarch64_vectorize_related_mode (machine_mode vector_mode,
   return default_vectorize_related_mode (vector_mode, element_mode, nunits);
 }
 
+/* Implement TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
+
+static bool aarch64_vectorize_preferred_div_as_shifts_over_mult (void)
+{
+  return true;
+}
+
 /* Implement TARGET_PREFERRED_ELSE_VALUE.  For binary operations,
    prefer to use the first arithmetic operand as the else value if
    the else value doesn't matter, since that exactly matches the SVE
@@ -24363,46 +24370,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
 
   return ret;
 }
-
-/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
-
-bool
-aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
-					       tree vectype, wide_int cst,
-					       rtx *output, rtx in0, rtx in1)
-{
-  if (code != TRUNC_DIV_EXPR
-      || !TYPE_UNSIGNED (vectype))
-    return false;
-
-  machine_mode mode = TYPE_MODE (vectype);
-  unsigned int flags = aarch64_classify_vector_mode (mode);
-  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
-    return false;
-
-  int pow = wi::exact_log2 (cst + 1);
-  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
-  /* SVE actually has a div operator, we may have gotten here through
-     that route.  */
-  if (pow != (int) (element_precision (vectype) / 2)
-      || insn_code == CODE_FOR_nothing)
-    return false;
-
-  /* We can use the optimized pattern.  */
-  if (in0 == NULL_RTX && in1 == NULL_RTX)
-    return true;
-
-  gcc_assert (output);
-
-  expand_operand ops[3];
-  create_output_operand (&ops[0], *output, mode);
-  create_input_operand (&ops[1], in0, mode);
-  create_fixed_operand (&ops[2], in1);
-  expand_insn (insn_code, 3, ops);
-  *output = ops[0].value;
-  return true;
-}
-
 /* Generate a byte permute mask for a register of mode MODE,
    which has NUNITS units.  */
 
@@ -27904,13 +27871,13 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_MAX_ANCHOR_OFFSET
 #define TARGET_MAX_ANCHOR_OFFSET 4095
 
+#undef TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
+#define TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT \
+  aarch64_vectorize_preferred_div_as_shifts_over_mult
+
 #undef TARGET_VECTOR_ALIGNMENT
 #define TARGET_VECTOR_ALIGNMENT aarch64_simd_vector_alignment
 
-#undef TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
-#define TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST \
-  aarch64_vectorize_can_special_div_by_constant
-
 #undef TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT
 #define TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT \
   aarch64_vectorize_preferred_vector_alignment

--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -4867,60 +4867,27 @@ (define_expand "aarch64_<sur><addsub>hn2<mode>"
   }
 )
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; If we imagine a short as being composed of two blocks of bytes then
-;; adding 257 or 0b0000_0001_0000_0001 to the number is equivalent to
-;; adding 1 to each sub component:
-;;
-;;      short value of 16-bits
-;; ┌──────────────┬────────────────┐
-;; │              │                │
-;; └──────────────┴────────────────┘
-;;   8-bit part1 ▲  8-bit part2   ▲
-;;               │                │
-;;               │                │
-;;              +1               +1
-;;
-;; after the first addition, we have to shift right by 8, and narrow the
-;; results back to a byte.  Remember that the addition must be done in
-;; double the precision of the input.  Since 8 is half the size of a short
-;; we can use a narrowing halfing instruction in AArch64, addhn which also
-;; does the addition in a wider precision and narrows back to a byte.  The
-;; shift itself is implicit in the operation as it writes back only the top
-;; half of the result. i.e. bits 2*esize-1:esize.
-;;
-;; Since we have narrowed the result of the first part back to a byte, for
-;; the second addition we can use a widening addition, uaddw.
-;;
-;; For the final shift, since it's unsigned arithmetic we emit an ushr by 8.
-;;
-;; The shift is later optimized by combine to a uzp2 with movi #0.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:VQN 0 "register_operand")
-   (match_operand:VQN 1 "register_operand")
-   (match_operand:VQN 2 "immediate_operand")]
+;; Optimize ((a + b) >> n) + c where n is half the bitsize of the vector
+(define_insn_and_split "*bitmask_shift_plus<mode>"
+  [(set (match_operand:VQN 0 "register_operand" "=&w")
+	(plus:VQN
+	  (lshiftrt:VQN
+	    (plus:VQN (match_operand:VQN 1 "register_operand" "w")
+		      (match_operand:VQN 2 "register_operand" "w"))
+	    (match_operand:VQN 3 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))
+	  (match_operand:VQN 4 "register_operand" "w")))]
   "TARGET_SIMD"
+  "#"
+  "&& true"
+  [(const_int 0)]
 {
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROWQ2>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROWQ2>mode));
-  rtx tmp1 = gen_reg_rtx (<VNARROWQ>mode);
-  rtx tmp2 = gen_reg_rtx (<MODE>mode);
-  emit_insn (gen_aarch64_addhn<mode> (tmp1, operands[1], addend));
-  unsigned bitsize = GET_MODE_UNIT_BITSIZE (<VNARROWQ>mode);
-  rtx shift_vector = aarch64_simd_gen_const_vector_dup (<MODE>mode, bitsize);
-  emit_insn (gen_aarch64_uaddw<Vnarrowq> (tmp2, operands[1], tmp1));
-  emit_insn (gen_aarch64_simd_lshr<mode> (operands[0], tmp2, shift_vector));
+  rtx tmp;
+  if (can_create_pseudo_p ())
+    tmp = gen_reg_rtx (<VNARROWQ>mode);
+  else
+    tmp = gen_rtx_REG (<VNARROWQ>mode, REGNO (operands[0]));
+  emit_insn (gen_aarch64_addhn<mode> (tmp, operands[1], operands[2]));
+  emit_insn (gen_aarch64_uaddw<Vnarrowq> (operands[0], operands[4], tmp));
   DONE;
 })
 
diff --git a/gcc/config/aarch64/aarch64-sve2.md b/gcc/config/aarch64/aarch64-sve2.md
index 40c0728a7e6f00c395c360ce7625bc2e4a018809..bed44d7d6873877386222d56144cc115e3953a61 100644
--- a/gcc/config/aarch64/aarch64-sve2.md
+++ b/gcc/config/aarch64/aarch64-sve2.md
@@ -2317,41 +2317,24 @@ (define_insn "@aarch64_sve_<optab><mode>"
 ;; ---- [INT] Misc optab implementations
 ;; -------------------------------------------------------------------------
 ;; Includes:
-;; - aarch64_bitmask_udiv
+;; - bitmask_shift_plus
 ;; -------------------------------------------------------------------------
 
-;; div optimizations using narrowings
-;; we can do the division e.g. shorts by 255 faster by calculating it as
-;; (x + ((x + 257) >> 8)) >> 8 assuming the operation is done in
-;; double the precision of x.
-;;
-;; See aarch64-simd.md for bigger explanation.
-(define_expand "@aarch64_bitmask_udiv<mode>3"
-  [(match_operand:SVE_FULL_HSDI 0 "register_operand")
-   (match_operand:SVE_FULL_HSDI 1 "register_operand")
-   (match_operand:SVE_FULL_HSDI 2 "immediate_operand")]
+;; Optimize ((a + b) >> n) where n is half the bitsize of the vector
+(define_insn "*bitmask_shift_plus<mode>"
+  [(set (match_operand:SVE_FULL_HSDI 0 "register_operand" "=w")
+	(unspec:SVE_FULL_HSDI
+	   [(match_operand:<VPRED> 1)
+	    (lshiftrt:SVE_FULL_HSDI
+	      (plus:SVE_FULL_HSDI
+		(match_operand:SVE_FULL_HSDI 2 "register_operand" "w")
+		(match_operand:SVE_FULL_HSDI 3 "register_operand" "w"))
+	      (match_operand:SVE_FULL_HSDI 4
+		 "aarch64_simd_shift_imm_vec_exact_top" "Dr"))]
+          UNSPEC_PRED_X))]
   "TARGET_SVE2"
-{
-  unsigned HOST_WIDE_INT size
-    = (1ULL << GET_MODE_UNIT_BITSIZE (<VNARROW>mode)) - 1;
-  rtx elt = unwrap_const_vec_duplicate (operands[2]);
-  if (!CONST_INT_P (elt) || UINTVAL (elt) != size)
-    FAIL;
-
-  rtx addend = gen_reg_rtx (<MODE>mode);
-  rtx tmp1 = gen_reg_rtx (<VNARROW>mode);
-  rtx tmp2 = gen_reg_rtx (<VNARROW>mode);
-  rtx val = aarch64_simd_gen_const_vector_dup (<VNARROW>mode, 1);
-  emit_move_insn (addend, lowpart_subreg (<MODE>mode, val, <VNARROW>mode));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp1, operands[1],
-			      addend));
-  emit_insn (gen_aarch64_sve (UNSPEC_ADDHNB, <MODE>mode, tmp2, operands[1],
-			      lowpart_subreg (<MODE>mode, tmp1,
-					      <VNARROW>mode)));
-  emit_move_insn (operands[0],
-		  lowpart_subreg (<MODE>mode, tmp2, <VNARROW>mode));
-  DONE;
-})
+  "addhnb\t%0.<Ventype>, %2.<Vetype>, %3.<Vetype>"
+)
 
 ;; =========================================================================
 ;; == Permutation
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index e6f47cbbb0d04a6f33b9a741ebb614cabd0204b9..2728fb347c0df1756b237f4d6268908eef6bdd2a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -3849,6 +3849,13 @@ aarch64_vectorize_related_mode (machine_mode vector_mode,
   return default_vectorize_related_mode (vector_mode, element_mode, nunits);
 }
 
+/* Implement TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT.  */
+
+static bool aarch64_vectorize_preferred_div_as_shifts_over_mult (void)
+{
+  return true;
+}
+
 /* Implement TARGET_PREFERRED_ELSE_VALUE.  For binary operations,
    prefer to use the first arithmetic operand as the else value if
    the else value doesn't matter, since that exactly matches the SVE
@@ -24363,46 +24370,6 @@ aarch64_vectorize_vec_perm_const (machine_mode vmode, machine_mode op_mode,
 
   return ret;
 }
-
-/* Implement TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST.  */
-
-bool
-aarch64_vectorize_can_special_div_by_constant (enum tree_code code,
-					       tree vectype, wide_int cst,
-					       rtx *output, rtx in0, rtx in1)
-{
-  if (code != TRUNC_DIV_EXPR
-      || !TYPE_UNSIGNED (vectype))
-    return false;
-
-  machine_mode mode = TYPE_MODE (vectype);
-  unsigned int flags = aarch64_classify_vector_mode (mode);
-  if ((flags & VEC_ANY_SVE) && !TARGET_SVE2)
-    return false;
-
-  int pow = wi::exact_log2 (cst + 1);
-  auto insn_code = maybe_code_for_aarch64_bitmask_udiv3 (TYPE_MODE (vectype));
-  /* SVE actually has a div operator, we may have gotten here through
-     that route.  */
-  if (pow != (int) (element_precision (vectype) / 2)
-      || insn_code == CODE_FOR_nothing)
-    return false;
-
-  /* We can use the optimized pattern.  */
-  if (in0 == NULL_RTX && in1 == NULL_RTX)
-    return true;
-
-  gcc_assert (output);
-
-  expand_operand ops[3];
-  create_output_operand (&ops[0], *output, mode);
-  create_input_operand (&ops[1], in0, mode);
-  create_fixed_operand (&ops[2], in1);
-  expand_insn (insn_code, 3, ops);
-  *output = ops[0].value;
-  return true;
-}
-
 /* Generate a byte permute mask for a register of mode MODE,
    which has NUNITS units.  */
 
@@ -27904,13 +27871,13 @@ aarch64_libgcc_floating_mode_supported_p
 #undef TARGET_MAX_ANCHOR_OFFSET
 #define TARGET_MAX_ANCHOR_OFFSET 4095
 
+#undef TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT
+#define TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT \
+  aarch64_vectorize_preferred_div_as_shifts_over_mult
+
 #undef TARGET_VECTOR_ALIGNMENT
 #define TARGET_VECTOR_ALIGNMENT aarch64_simd_vector_alignment
 
-#undef TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST
-#define TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST \
-  aarch64_vectorize_can_special_div_by_constant
-
 #undef TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT
 #define TARGET_VECTORIZE_PREFERRED_VECTOR_ALIGNMENT \
   aarch64_vectorize_preferred_vector_alignment