From patchwork Mon Jan  8 17:55:26 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
X-Patchwork-Id: 856995
Return-Path: 
 <gcc-patches-return-470450-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-470450-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="nJAN1Vzq"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 3zFjcC5cKjz9s7v
	for <incoming@patchwork.ozlabs.org>;
	Tue,  9 Jan 2018 04:55:39 +1100 (AEDT)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:content-type;
	q=dns; s=default; b=ByKAvQaypJJIIX2tcZtflUieZ5hG3v3iitkR9hBp7n8
	0EYsakhk0q7JDMQKE6pjMdjycORWpLucMDY6IQ1sZhvfZblmvd0k/rKsRwJnQXP/
	Tim4TBgNwnrGI8Gflm7o517wiVsr/z5EQtO03AuiCLcYfmm8jz93NhxvCCUvHWjY
	=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:content-type;
	s=default; bh=/O6xvRUuV/YJGHB3acf9tug+E5c=; b=nJAN1VzqCut5lzxY0
	oAtKyiH18ojj877PEl2GNyejKRYph5N3v+XI8L7gg4yObdmXLs4haP32BxIjuF6L
	dAgWWm+CUDGlUbUp355kaeyRjCMTFelDnpmkk1mGPymuoONzTpnfK/r8+niTHOoT
	CTrbMWFofb9xfKGXSFoqjVkVt8=
Received: (qmail 117461 invoked by alias); 8 Jan 2018 17:55:32 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 117448 invoked by uid 89); 8 Jan 2018 17:55:31 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-25.7 required=5.0 tests=BAYES_00, GIT_PATCH_0,
	GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	KAM_LAZY_DOMAIN_SECURITY, KAM_LOTSOFHASH,
	T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=
X-HELO: foss.arm.com
Received: from usa-sjc-mx-foss1.foss.arm.com (HELO foss.arm.com)
	(217.140.101.70) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Mon, 08 Jan 2018 17:55:29 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by
	usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id
	7E67A80D; Mon,  8 Jan 2018 09:55:28 -0800 (PST)
Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com
	[10.2.207.77])	by usa-sjc-imap-foss1.foss.arm.com (Postfix)
	with ESMTPSA id 09BC43F487; Mon,  8 Jan 2018 09:55:27 -0800 (PST)
Message-ID: <5A53B08E.6030702@foss.arm.com>
Date: Mon, 08 Jan 2018 17:55:26 +0000
From: Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
CC: Christophe Lyon <christophe.lyon@linaro.org>
Subject: [PATCH][arm][1/3] Add -march=armv8.4-a option

[resending due to mailer problems...]

Hi all,

This patch adds support for the Armv8.4-A architecture [1]
in the arm backend. This is done through the new
-march=armv8.4-a option.

With this patch armv8.4-a is recognised as an argument
and supports the extensions: simd, fp16, crypto, nocrypto,
nofp with the familiar meaning of these options.
Worth noting that there is no dotprod option like in
armv8.2-a and armv8.3-a because Dot Product support is
mandatory in Armv8.4-A when simd is available, so when using
+simd (of fp16 which enables +simd), the +dotprod is implied.

The various multilib selection makefile fragments are updated
too and the mutlilib.exp test gets a few armv8.4-a combination
tests.

Bootstrapped and tested on arm-none-linux-gnueabihf.

Christophe: Can I ask you for a huge favour to give these 3
patches a run through your testing infrastructure if you get
the chance?
The changes should be fairly self-contained
(i.e. touching only -march=armv8.4-a support) but I've gotten
various edge cases with testsuite setup wrong in the past...

Thanks,
Kyrill

[1] 
https://community.arm.com/processors/b/blog/posts/introducing-2017s-extensions-to-the-arm-architecture

2017-01-08  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/arm/arm-cpus.in (armv8_4): New feature.
     (ARMv8_4a): New fgroup.
     (armv8.4-a): New arch.
     * config/arm/arm-tables.opt: Regenerate.
     * config/arm/t-aprofile: Add matching rules for -march=armv8.4-a.
     * config/arm/t-arm-elf (all_v8_archs): Add armv8.4-a.
     * config/arm/t-multilib (v8_4_a_simd_variants): New variable.
     Add matching rules for -march=armv8.4-a and extensions.
     * doc/invoke.texi (ARM Options): Document -march=armv8.4-a.

2017-01-08  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/arm/multilib.exp: Add some -march=armv8.4-a
     combination tests.
diff --git a/gcc/config/arm/arm-cpus.in b/gcc/config/arm/arm-cpus.in
index 281ec162db8c982128462d8efac2be1d21959cf7..0967b9d2277a0d211452b7cd4d579db1774f29b3 100644
--- a/gcc/config/arm/arm-cpus.in
+++ b/gcc/config/arm/arm-cpus.in
@@ -120,6 +120,9 @@ define feature armv8_2
 # Architecture rel 8.3.
 define feature armv8_3
 
+# Architecture rel 8.4.
+define feature armv8_4
+
 # M-Profile security extensions.
 define feature cmse
 
@@ -242,6 +245,7 @@ define fgroup ARMv8a      ARMv7ve armv8
 define fgroup ARMv8_1a    ARMv8a crc32 armv8_1
 define fgroup ARMv8_2a    ARMv8_1a armv8_2
 define fgroup ARMv8_3a    ARMv8_2a armv8_3
+define fgroup ARMv8_4a    ARMv8_3a armv8_4
 define fgroup ARMv8m_base ARMv6m armv8 cmse tdiv
 define fgroup ARMv8m_main ARMv7m armv8 cmse
 define fgroup ARMv8r      ARMv8a
@@ -597,6 +601,19 @@ begin arch armv8.3-a
  option dotprod add FP_ARMv8 DOTPROD
 end arch armv8.3-a
 
+begin arch armv8.4-a
+ tune for cortex-a53
+ tune flags CO_PROC
+ base 8A
+ profile A
+ isa ARMv8_4a
+ option simd add FP_ARMv8 DOTPROD
+ option fp16 add fp16 FP_ARMv8 DOTPROD
+ option crypto add FP_ARMv8 CRYPTO DOTPROD
+ option nocrypto remove ALL_CRYPTO
+ option nofp remove ALL_FP
+end arch armv8.4-a
+
 begin arch armv8-m.base
  tune for cortex-m23
  base 8M_BASE
diff --git a/gcc/config/arm/arm-tables.opt b/gcc/config/arm/arm-tables.opt
index f7937256cd79296ba33d109232bcf0d6f7b03917..b8ebec668b1404fd3f9a71dd1f0d48d1261bcf53 100644
--- a/gcc/config/arm/arm-tables.opt
+++ b/gcc/config/arm/arm-tables.opt
@@ -455,19 +455,22 @@ EnumValue
 Enum(arm_arch) String(armv8.3-a) Value(29)
 
 EnumValue
-Enum(arm_arch) String(armv8-m.base) Value(30)
+Enum(arm_arch) String(armv8.4-a) Value(30)
 
 EnumValue
-Enum(arm_arch) String(armv8-m.main) Value(31)
+Enum(arm_arch) String(armv8-m.base) Value(31)
 
 EnumValue
-Enum(arm_arch) String(armv8-r) Value(32)
+Enum(arm_arch) String(armv8-m.main) Value(32)
 
 EnumValue
-Enum(arm_arch) String(iwmmxt) Value(33)
+Enum(arm_arch) String(armv8-r) Value(33)
 
 EnumValue
-Enum(arm_arch) String(iwmmxt2) Value(34)
+Enum(arm_arch) String(iwmmxt) Value(34)
+
+EnumValue
+Enum(arm_arch) String(iwmmxt2) Value(35)
 
 Enum
 Name(arm_fpu) Type(enum fpu_type)
diff --git a/gcc/config/arm/t-aprofile b/gcc/config/arm/t-aprofile
index a4bf04794e71381256e1489cdad71e966306477f..167a49d16e468be3c222a50abec57b6a68bc561e 100644
--- a/gcc/config/arm/t-aprofile
+++ b/gcc/config/arm/t-aprofile
@@ -96,6 +96,13 @@ MULTILIB_MATCHES	+= $(foreach ARCH, $(v8_2_a_simd_variants), \
 			     march?armv8-a+simd=march?armv8.2-a$(ARCH) \
 			     march?armv8-a+simd=march?armv8.3-a$(ARCH))
 
+# Baseline v8.4-a: map down to baseline v8-a
+MULTILIB_MATCHES	+= march?armv8-a=march?armv8.4-a
+
+# Map all v8.4-a SIMD variants to v8-a+simd
+MULTILIB_MATCHES	+= $(foreach ARCH, $(v8_4_a_simd_variants), \
+			     march?armv8-a+simd=march?armv8.4-a$(ARCH))
+
 # Use Thumb libraries for everything.
 
 MULTILIB_REUSE		+= mthumb/march.armv7-a/mfloat-abi.soft=marm/march.armv7-a/mfloat-abi.soft
diff --git a/gcc/config/arm/t-arm-elf b/gcc/config/arm/t-arm-elf
index a15fb2df12f7b0d637976f3912432740ecd104bd..3e721ec789806335c6097d4088642150abf1003a 100644
--- a/gcc/config/arm/t-arm-elf
+++ b/gcc/config/arm/t-arm-elf
@@ -46,7 +46,7 @@ all_early_arch	:= armv5e armv5tej armv6 armv6j armv6k armv6z armv6kz \
 
 all_v7_a_r	:= armv7-a armv7ve armv7-r
 
-all_v8_archs	:= armv8-a armv8-a+crc armv8.1-a armv8.2-a armv8.3-a
+all_v8_archs	:= armv8-a armv8-a+crc armv8.1-a armv8.2-a armv8.3-a armv8.4-a
 
 # No floating point variants, require thumb1 softfp
 all_nofp_t	:= armv6-m armv6s-m armv8-m.base
diff --git a/gcc/config/arm/t-multilib b/gcc/config/arm/t-multilib
index cc8caa45e118890c5dbe4adbd1a83b8c856ab22e..26b8ae15da74b275e5617b3054572d2a7e8cfe49 100644
--- a/gcc/config/arm/t-multilib
+++ b/gcc/config/arm/t-multilib
@@ -69,7 +69,7 @@ v8_a_nosimd_variants	:= +crc
 v8_a_simd_variants	:= $(call all_feat_combs, simd crypto)
 v8_1_a_simd_variants	:= $(call all_feat_combs, simd crypto)
 v8_2_a_simd_variants	:= $(call all_feat_combs, simd fp16 crypto dotprod)
-
+v8_4_a_simd_variants	:= $(call all_feat_combs, simd fp16 crypto)
 
 ifneq (,$(HAS_APROFILE))
 include $(srcdir)/config/arm/t-aprofile
@@ -147,6 +147,13 @@ MULTILIB_MATCHES	+= $(foreach ARCH, $(v8_2_a_simd_variants), \
 			     march?armv7+fp=march?armv8.2-a$(ARCH) \
 			     march?armv7+fp=march?armv8.3-a$(ARCH))
 
+# Baseline v8.4-a: map down to baseline v8-a
+MULTILIB_MATCHES	+= march?armv7=march?armv8.4-a
+
+# Map all v8.4-a SIMD variants
+MULTILIB_MATCHES	+= $(foreach ARCH, $(v8_4_a_simd_variants), \
+			     march?armv7+fp=march?armv8.4-a$(ARCH))
+
 # Use Thumb libraries for everything.
 
 MULTILIB_REUSE		+= mthumb/march.armv7/mfloat-abi.soft=marm/march.armv7/mfloat-abi.soft
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 283eab82d0533f427bb1821d1e29341f367ae690..9c2388aae2b813c675bf4b697cfd80e79cbfdb78 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -15655,6 +15655,7 @@ Permissible names are:
 @samp{armv6z}, @samp{armv6zk},
 @samp{armv7}, @samp{armv7-a}, @samp{armv7ve}, 
 @samp{armv8-a}, @samp{armv8.1-a}, @samp{armv8.2-a}, @samp{armv8.3-a},
+@samp{armv8.4-a},
 @samp{armv7-r},
 @samp{armv8-r},
 @samp{armv6-m}, @samp{armv6s-m},
@@ -15876,6 +15877,28 @@ Disable the cryptographic extension.
 Disable the floating-point, Advanced SIMD and cryptographic instructions.
 @end table
 
+@item armv8.4-a
+@table @samp
+@item +fp16
+The half-precision floating-point data processing instructions.
+This also enables the Advanced SIMD and floating-point instructions as well
+as the Dot Product extension.
+
+@item +simd
+The ARMv8.3-A Advanced SIMD and floating-point instructions as well as the
+Dot Product extension.
+
+@item +crypto
+The cryptographic instructions.  This also enables the Advanced SIMD and
+floating-point instructions as well as the Dot Product extension.
+
+@item +nocrypto
+Disable the cryptographic extension.
+
+@item +nofp
+Disable the floating-point, Advanced SIMD and cryptographic instructions.
+@end table
+
 @item armv7-r
 @table @samp
 @item +fp.sp
diff --git a/gcc/testsuite/gcc.target/arm/multilib.exp b/gcc/testsuite/gcc.target/arm/multilib.exp
index 8ab7ca8853c1228c1cdfe0d80930165b7e56350b..b210f32f680a673bedd3dc16ae74fefe70a403e4 100644
--- a/gcc/testsuite/gcc.target/arm/multilib.exp
+++ b/gcc/testsuite/gcc.target/arm/multilib.exp
@@ -92,6 +92,14 @@ if {[multilib_config "aprofile"] } {
 	{-march=armv8.3-a+simd+dotprod -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
 	{-march=armv8.3-a+simd+dotprod+nofp -mfloat-abi=softfp} "thumb/v8-a/nofp"
 	{-march=armv8.3-a+simd+nofp+dotprod -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
+	{-march=armv8.4-a+crypto -mfloat-abi=soft} "thumb/v8-a/nofp"
+	{-march=armv8.4-a+simd+crypto -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
+	{-march=armv8.4-a+simd+crypto+nofp -mfloat-abi=softfp} "thumb/v8-a/nofp"
+	{-march=armv8.4-a+simd+nofp+crypto -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
+	{-march=armv8.4-a+fp16 -mfloat-abi=soft} "thumb/v8-a/nofp"
+	{-march=armv8.4-a+simd+fp16 -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
+	{-march=armv8.4-a+simd+fp16+nofp -mfloat-abi=softfp} "thumb/v8-a/nofp"
+	{-march=armv8.4-a+simd+nofp+fp16 -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
 	{-mcpu=cortex-a53+crypto -mfloat-abi=hard} "thumb/v8-a+simd/hard"
 	{-mcpu=cortex-a53+nofp -mfloat-abi=softfp} "thumb/v8-a/nofp"
 	{-march=armv8-a+crc -mfloat-abi=hard -mfpu=vfp} "thumb/v8-a+simd/hard"

From patchwork Mon Jan  8 17:52:26 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
X-Patchwork-Id: 856987
Return-Path: 
 <gcc-patches-return-470448-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-470448-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="VqcwOoFu"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 3zFjXr2584z9s72
	for <incoming@patchwork.ozlabs.org>;
	Tue,  9 Jan 2018 04:52:44 +1100 (AEDT)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:content-type;
	q=dns; s=default; b=YB8fNIw1lXT9uHMHEbsLS/31u9zjhNFrhCpO2VUkOL6
	T/SyqU9QEEqGFJaDRnD5VKBDR0HVuXHRw9rSKwY4mssUm4BVXf9p/yFrJEklFgVq
	e7eBs/4/WQTtjvh7wX469Lt3QlKUgiR6bpx6PSwCb4mNzHw49NLPTRm9cxxXQVnY
	=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:content-type;
	s=default; bh=X/4xymsGuMSD/sdRhGMoU15b/ls=; b=VqcwOoFuyidEo7e3L
	5f2vieu7TO93CyewyCUTNJpDWg11nu1XoC7ut5PqwdcCBtP4UefS3+srIy6B8zAj
	SjuTK3JwC4+yLlkDuMB33NmHp3NcTMty1LodlD2OImiidcJYN2yri/Cep/SbLh3r
	KefLnQFOurvSqPlgoXWg5tm6ck=
Received: (qmail 109868 invoked by alias); 8 Jan 2018 17:52:36 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 109848 invoked by uid 89); 8 Jan 2018 17:52:33 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-24.9 required=5.0 tests=BAYES_00, GIT_PATCH_0,
	GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS,
	KAM_LAZY_DOMAIN_SECURITY, KAM_LOTSOFHASH,
	T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=Operations,
	sup, ldm, AS
X-HELO: foss.arm.com
Received: from usa-sjc-mx-foss1.foss.arm.com (HELO foss.arm.com)
	(217.140.101.70) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Mon, 08 Jan 2018 17:52:29 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by
	usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id
	899BD15A2; Mon,  8 Jan 2018 09:52:28 -0800 (PST)
Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com
	[10.2.207.77])	by usa-sjc-imap-foss1.foss.arm.com (Postfix)
	with ESMTPSA id BE7C63F487; Mon,  8 Jan 2018 09:52:27 -0800 (PST)
Message-ID: <5A53AFDA.3020507@foss.arm.com>
Date: Mon, 08 Jan 2018 17:52:26 +0000
From: Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
CC: Christophe Lyon <christophe.lyon@linaro.org>
Subject: [PATCH][arm][2/3] Implement fp16fml extension for ARMv8.4-A

Hi all,

This patch adds the +fp16fml extension that enables some
half-precision floating-point Advanced SIMD instructions,
available through arm_neon.h intrinsics.

This extension is on by default for armv8.4-a
if fp16 is available, so it can be enabled by -march=armv8.4-a+fp16.

fp16fml is also available for armv8.2-a and armv8.3-a through the
+fp16fml option that is added for these architectures.

The new instructions that this patch adds support for are:
vfmal.f16 Dr, Sm, Sn
vfmal.f16 Qr, Dm, Dn
vfmsl.f16 Dr, Sm, Sn
vfmsl.f16 Qr, Dm, Dn

They interpret their input registers as a vector of half-precision
floating-point values, extend them to single-precision vectors
and perform a fused multiply-add or subtract of them with the
destination vector.

This patch exposes these instructions through arm_neon.h intrinsics.
The set of intrinsics allows us to do stuff such as perform
the multiply-add/subtract operation on the low or top half of
float16x4_t and float16x8_t values.  This maps naturally in aarch64
to the FMLAL and FMLAL2 instructions but on arm we have to use the
fact that consecutive NEON registers overlap the wider register
(i.e. d0 is s0 plus s1, q0 is d0 plus d1 etc). This just means
we have to be careful to use the right subreg operand print code.

New arm-specific builtins are defined to expand to the new patterns.
I've managed to compress the define_expands using code, mode and int
iterators but the define_insns don't compress very well without two-tiered
iterators (iterator attributes expanding to iterators) which we
don't support.

Bootstrapped and tested on arm-none-linux-gnueabihf and also on
armeb-none-eabi.

Thanks,
Kyrill

2018-01-08  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/arm/arm-cpus.in (fp16fml): New feature.
     (ALL_SIMD): Add fp16fml.
     (armv8.2-a): Add fp16fml as an option.
     (armv8.3-a): Likewise.
     (armv8.4-a): Add fp16fml as part of fp16.
     * config/arm/arm.h (TARGET_FP16FML): Define.
     * config/arm/arm-c.c (arm_cpu_builtins): Define __ARM_FEATURE_FP16_FML
     when appropriate.
     * config/arm/arm-modes.def (V2HF): Define.
     * config/arm/arm_neon.h (vfmlal_low_u32, vfmlsl_low_u32,
     vfmlal_high_u32, vfmlsl_high_u32, vfmlalq_low_u32,
     vfmlslq_low_u32, vfmlalq_high_u32, vfmlslq_high_u32): Define.
     * config/arm/arm_neon_builtins.def (vfmal_low, vfmal_high,
     vfmsl_low, vfmsl_high): New set of builtins.
     * config/arm/iterators.md (PLUSMINUS): New code iterator.
     (vfml_op): New code attribute.
     (VFMLHALVES): New int iterator.
     (VFML, VFMLSEL): New mode attributes.
     (V_reg): Define mapping for V2HF.
     (V_hi, V_lo): New mode attributes.
     (VF_constraint): Likewise.
     (vfml_half, vfml_half_selector): New int attributes.
     * config/arm/neon.md (neon_vfm<vfml_op>l_<vfml_half><mode>): New
     define_expand.
     (vfmal_low<mode>_intrinsic, vfmsl_high<mode>_intrinsic,
     vfmal_high<mode>_intrinsic, vfmsl_low<mode>_intrinsic):
     New define_insn.
     * config/arm/t-arm-elf (v8_fps): Add fp16fml.
     * config/arm/t-multilib (v8_2_a_simd_variants): Add fp16fml.
     * config/arm/unspecs.md (UNSPEC_VFML_LO, UNSPEC_VFML_HI): New unspecs.
     * doc/invoke.texi (ARM Options): Document fp16fml.  Update armv8.4-a
     documentation.
     * doc/sourcebuild.texi (arm_fp16fml_neon_ok, arm_fp16fml_neon):
     Document new effective target and option set.

2018-01-08  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/arm/multilib.exp: Add combination tests for fp16fml.
     * gcc.target/arm/simd/fp16fml_high.c: New test.
     * gcc.target/arm/simd/fp16fml_low.c: Likewise.
     * lib/target-supports.exp
     (check_effective_target_arm_fp16fml_neon_ok_nocache,
     check_effective_target_arm_fp16fml_neon_ok,
     add_options_for_arm_fp16fml_neon): New procedures.

diff --git a/gcc/config/arm/arm-c.c b/gcc/config/arm/arm-c.c
index 635bc3c1c38de79802041fc50229b90defd2e467..46dc8d51ffcd80983a70f1bd283caa3688648c9b 100644
--- a/gcc/config/arm/arm-c.c
+++ b/gcc/config/arm/arm-c.c
@@ -160,6 +160,7 @@ arm_cpu_builtins (struct cpp_reader* pfile)
 		      TARGET_VFP_FP16INST);
   def_or_undef_macro (pfile, "__ARM_FEATURE_FP16_VECTOR_ARITHMETIC",
 		      TARGET_NEON_FP16INST);
+  def_or_undef_macro (pfile, "__ARM_FEATURE_FP16_FML", TARGET_FP16FML);
 
   def_or_undef_macro (pfile, "__ARM_FEATURE_FMA", TARGET_FMA);
   def_or_undef_macro (pfile, "__ARM_NEON__", TARGET_NEON);
diff --git a/gcc/config/arm/arm-cpus.in b/gcc/config/arm/arm-cpus.in
index 0967b9d2277a0d211452b7cd4d579db1774f29b3..7b9224b6b0791a9a7a315e1807b439604a3c0929 100644
--- a/gcc/config/arm/arm-cpus.in
+++ b/gcc/config/arm/arm-cpus.in
@@ -165,6 +165,9 @@ define feature fp16
 # Dot Product instructions extension to ARMv8.2-a.
 define feature dotprod
 
+# Half-precision floating-point instructions in ARMv8.4-A.
+define feature fp16fml
+
 # ISA Quirks (errata?).  Don't forget to add this to the fgroup
 # ALL_QUIRKS below.
 
@@ -202,7 +205,7 @@ define fgroup ALL_CRYPTO	crypto
 # strip off 32 D-registers, but does not remove support for
 # double-precision FP.
 define fgroup ALL_SIMD_INTERNAL	fp_d32 neon ALL_CRYPTO
-define fgroup ALL_SIMD	ALL_SIMD_INTERNAL dotprod
+define fgroup ALL_SIMD	ALL_SIMD_INTERNAL dotprod fp16fml
 
 # List of all FPU bits to strip out if -mfpu is used to override the
 # default.  fp16 is deliberately missing from this list.
@@ -581,6 +584,7 @@ begin arch armv8.2-a
  isa ARMv8_2a
  option simd add FP_ARMv8 NEON
  option fp16 add fp16 FP_ARMv8 NEON
+ option fp16fml add fp16fml fp16 FP_ARMv8 NEON
  option crypto add FP_ARMv8 CRYPTO
  option nocrypto remove ALL_CRYPTO
  option nofp remove ALL_FP
@@ -595,6 +599,7 @@ begin arch armv8.3-a
  isa ARMv8_3a
  option simd add FP_ARMv8 NEON
  option fp16 add fp16 FP_ARMv8 NEON
+ option fp16fml add fp16fml fp16 FP_ARMv8 NEON
  option crypto add FP_ARMv8 CRYPTO
  option nocrypto remove ALL_CRYPTO
  option nofp remove ALL_FP
@@ -608,7 +613,7 @@ begin arch armv8.4-a
  profile A
  isa ARMv8_4a
  option simd add FP_ARMv8 DOTPROD
- option fp16 add fp16 FP_ARMv8 DOTPROD
+ option fp16 add fp16 fp16fml FP_ARMv8 DOTPROD
  option crypto add FP_ARMv8 CRYPTO DOTPROD
  option nocrypto remove ALL_CRYPTO
  option nofp remove ALL_FP
diff --git a/gcc/config/arm/arm-modes.def b/gcc/config/arm/arm-modes.def
index f58a159a89852a82587ad974387468dab6c9be80..0fb22111cbb35fadd6642a5042779b7586870ceb 100644
--- a/gcc/config/arm/arm-modes.def
+++ b/gcc/config/arm/arm-modes.def
@@ -67,6 +67,7 @@ VECTOR_MODES (INT, 8);        /*       V8QI V4HI V2SI */
 VECTOR_MODES (INT, 16);       /* V16QI V8HI V4SI V2DI */
 VECTOR_MODES (FLOAT, 8);      /*            V4HF V2SF */
 VECTOR_MODES (FLOAT, 16);     /*       V8HF V4SF V2DF */
+VECTOR_MODE (FLOAT, HF, 2);   /*                 V2HF */
 
 /* Fraction and accumulator vector modes.  */
 VECTOR_MODES (FRACT, 4);      /* V4QQ  V2HQ */
diff --git a/gcc/config/arm/arm.h b/gcc/config/arm/arm.h
index 410bfb998419dd3b10d47cc143def5cfdc1b02a0..fbb5e9f38af50dbefb465b6ed370c050ae2f6274 100644
--- a/gcc/config/arm/arm.h
+++ b/gcc/config/arm/arm.h
@@ -216,10 +216,18 @@ extern tree arm_fp16_type_node;
 					isa_bit_dotprod)		\
 			&& arm_arch8_2)
 
-/* FPU supports the floating point FP16 instructions for ARMv8.2 and later.  */
+/* FPU supports the floating point FP16 instructions for ARMv8.2-A
+   and later.  */
 #define TARGET_VFP_FP16INST \
   (TARGET_32BIT && TARGET_HARD_FLOAT && TARGET_VFP5 && arm_fp16_inst)
 
+/* Target supports the floating point FP16 instructions from ARMv8.2-A
+   and later.  */
+#define TARGET_FP16FML (TARGET_NEON					\
+			&& bitmap_bit_p (arm_active_target.isa,	\
+					isa_bit_fp16fml)		\
+			&& arm_arch8_2)
+
 /* FPU supports the AdvSIMD FP16 instructions for ARMv8.2 and later.  */
 #define TARGET_NEON_FP16INST (TARGET_VFP_FP16INST && TARGET_NEON_RDMA)
 
diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
index 649182b6a776220cf3aba1b5b1b023e4ccf7857f..01324096d673187b504b89a6d68785275b445b1b 100644
--- a/gcc/config/arm/arm_neon.h
+++ b/gcc/config/arm/arm_neon.h
@@ -18104,6 +18104,69 @@ vdotq_lane_s32 (int32x4_t __r, int8x16_t __a, int8x8_t __b, const int __index)
 #pragma GCC pop_options
 #endif
 
+#if defined (__ARM_FP16_FORMAT_IEEE) || defined (__ARM_FP16_FORMAT_ALTERNATIVE)
+#pragma GCC push_options
+#pragma GCC target ("arch=armv8.2-a+fp16fml")
+
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlal_low_u32 (float32x2_t __r, float16x4_t __a, float16x4_t __b)
+{
+  return __builtin_neon_vfmal_lowv2sf (__r, __a, __b);
+}
+
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlsl_low_u32 (float32x2_t __r, float16x4_t __a, float16x4_t __b)
+{
+  return __builtin_neon_vfmsl_lowv2sf (__r, __a, __b);
+}
+
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlal_high_u32 (float32x2_t __r, float16x4_t __a, float16x4_t __b)
+{
+  return __builtin_neon_vfmal_highv2sf (__r, __a, __b);
+}
+
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlsl_high_u32 (float32x2_t __r, float16x4_t __a, float16x4_t __b)
+{
+  return __builtin_neon_vfmsl_highv2sf (__r, __a, __b);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlalq_low_u32 (float32x4_t __r, float16x8_t __a, float16x8_t __b)
+{
+  return __builtin_neon_vfmal_lowv4sf (__r, __a, __b);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlslq_low_u32 (float32x4_t __r, float16x8_t __a, float16x8_t __b)
+{
+  return __builtin_neon_vfmsl_lowv4sf (__r, __a, __b);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlalq_high_u32 (float32x4_t __r, float16x8_t __a, float16x8_t __b)
+{
+  return __builtin_neon_vfmal_highv4sf (__r, __a, __b);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlslq_high_u32 (float32x4_t __r, float16x8_t __a, float16x8_t __b)
+{
+  return __builtin_neon_vfmsl_highv4sf (__r, __a, __b);
+}
+
+#pragma GCC pop_options
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/gcc/config/arm/arm_neon_builtins.def b/gcc/config/arm/arm_neon_builtins.def
index 982eec810dafb5ec955273099853f8842020d104..d4fe33b0502f46b9d6303a08003ede2c69574e29 100644
--- a/gcc/config/arm/arm_neon_builtins.def
+++ b/gcc/config/arm/arm_neon_builtins.def
@@ -51,6 +51,10 @@ VAR2 (TERNOP, vqdmlal, v4hi, v2si)
 VAR2 (TERNOP, vqdmlsl, v4hi, v2si)
 VAR4 (TERNOP, vqrdmlah, v4hi, v2si, v8hi, v4si)
 VAR4 (TERNOP, vqrdmlsh, v4hi, v2si, v8hi, v4si)
+VAR2 (TERNOP, vfmal_low, v2sf, v4sf)
+VAR2 (TERNOP, vfmal_high, v2sf, v4sf)
+VAR2 (TERNOP, vfmsl_low, v2sf, v4sf)
+VAR2 (TERNOP, vfmsl_high, v2sf, v4sf)
 VAR3 (BINOP, vmullp, v8qi, v4hi, v2si)
 VAR3 (BINOP, vmulls, v8qi, v4hi, v2si)
 VAR3 (BINOP, vmullu, v8qi, v4hi, v2si)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index a4fb234a846795e1c0dd5bf7de76ff7da487be23..efa410e4fbd301e0e43d5364bb3bd59e676962bf 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -247,6 +247,9 @@ (define_code_iterator SHIFTABLE_OPS [plus minus ior xor and])
 ;; Operations on the sign of a number.
 (define_code_iterator ABSNEG [abs neg])
 
+;; The PLUS and MINUS operators.
+(define_code_iterator PLUSMINUS [plus minus])
+
 ;; Conversions.
 (define_code_iterator FCVT [unsigned_float float])
 
@@ -266,6 +269,8 @@ (define_code_attr cmp_op [(eq "eq") (gt "gt") (ge "ge") (lt "lt") (le "le")
 
 (define_code_attr cmp_type [(eq "i") (gt "s") (ge "s") (lt "s") (le "s")])
 
+(define_code_attr vfml_op [(plus "a") (minus "s")])
+
 ;;----------------------------------------------------------------------------
 ;; Int iterators
 ;;----------------------------------------------------------------------------
@@ -412,6 +417,8 @@ (define_int_iterator VFM_LANE_AS [UNSPEC_VFMA_LANE UNSPEC_VFMS_LANE])
 
 (define_int_iterator DOTPROD [UNSPEC_DOT_S UNSPEC_DOT_U])
 
+(define_int_iterator VFMLHALVES [UNSPEC_VFML_LO UNSPEC_VFML_HI])
+
 ;;----------------------------------------------------------------------------
 ;; Mode attributes
 ;;----------------------------------------------------------------------------
@@ -471,6 +478,12 @@ (define_mode_attr V_two_elem [(V8QI "HI")   (V16QI "HI")
                               (V2SF "V2SF") (V4SF "V2SF")
                               (DI "V2DI")   (V2DI "V2DI")])
 
+;; Mode mapping for VFM[A,S]L instructions.
+(define_mode_attr VFML [(V2SF "V4HF") (V4SF "V8HF")])
+
+;; Mode mapping for VFM[A,S]L instructions for the vec_select result.
+(define_mode_attr VFMLSEL [(V2SF "V2HF") (V4SF "V4HF")])
+
 ;; Similar, for three elements.
 (define_mode_attr V_three_elem [(V8QI "BLK") (V16QI "BLK")
                                 (V4HI "BLK") (V8HI "BLK")
@@ -494,8 +507,14 @@ (define_mode_attr V_reg [(V8QI "P") (V16QI "q")
 			 (V2SI "P") (V4SI  "q")
 			 (V2SF "P") (V4SF  "q")
 			 (DI   "P") (V2DI  "q")
-			 (SF   "")  (DF    "P")
-			 (HF   "")])
+			 (V2HF "") (SF   "")
+			 (DF    "P") (HF   "")])
+
+;; Output template to select the high VFP register of a mult-register value.
+(define_mode_attr V_hi [(V2SF "p") (V4SF  "f")])
+
+;; Output template to select the low VFP register of a mult-register value.
+(define_mode_attr V_lo [(V2SF "") (V4SF  "e")])
 
 ;; Wider modes with the same number of elements.
 (define_mode_attr V_widen [(V8QI "V8HI") (V4HI "V4SI") (V2SI "V2DI")])
@@ -708,6 +727,7 @@ (define_mode_attr V_innermode [(V8QI "QI") (V4HI "HI") (V2SI "SI")])
 (define_mode_attr F_constraint [(SF "t") (DF "w")])
 (define_mode_attr vfp_type [(SF "s") (DF "d")])
 (define_mode_attr vfp_double_cond [(SF "") (DF "&& TARGET_VFP_DOUBLE")])
+(define_mode_attr VF_constraint [(V2SF "t") (V4SF "w")])
 
 ;; Mode attribute used to build the "type" attribute.
 (define_mode_attr q [(V8QI "") (V16QI "_q")
@@ -824,6 +844,12 @@ (define_int_attr sup [
   (UNSPEC_DOT_S "s") (UNSPEC_DOT_U "u")
 ])
 
+(define_int_attr vfml_half
+ [(UNSPEC_VFML_HI "high") (UNSPEC_VFML_LO "low")])
+
+(define_int_attr vfml_half_selector
+ [(UNSPEC_VFML_HI "true") (UNSPEC_VFML_LO "false")])
+
 (define_int_attr vcvth_op
  [(UNSPEC_VCVTA_S "a") (UNSPEC_VCVTA_U "a")
   (UNSPEC_VCVTM_S "m") (UNSPEC_VCVTM_U "m")
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 073c26580dd317a01cd0e275965fee2ef83ae3f9..75be5aca8b57ad744d70253989073756bf4ca1fe 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -2290,6 +2290,98 @@ (define_expand "neon_vfms<VH:mode>"
   DONE;
 })
 
+;; The expand RTL structure here is not important.
+;; We use the gen_* functions anyway.
+;; We just need something to wrap the iterators around.
+
+(define_expand "neon_vfm<vfml_op>l_<vfml_half><mode>"
+  [(set (match_operand:VCVTF 0 "s_register_operand")
+     (unspec:VCVTF
+	[(match_operand:VCVTF 1 "s_register_operand")
+	   (PLUSMINUS:<VFML>
+	     (match_operand:<VFML> 2 "s_register_operand")
+	     (match_operand:<VFML> 3 "s_register_operand"))] VFMLHALVES))]
+  "TARGET_FP16FML"
+{
+  rtx half = arm_simd_vect_par_cnst_half (<VFML>mode, <vfml_half_selector>);
+  emit_insn (gen_vfm<vfml_op>l_<vfml_half><mode>_intrinsic (operands[0],
+							     operands[1],
+							     operands[2],
+							     operands[3],
+							     half, half));
+  DONE;
+})
+
+(define_insn "vfmal_low<mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (vec_select:<VFMLSEL>
+	   (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	   (match_operand:<VFML> 4 "vect_par_constant_low" "")))
+	 (float_extend:VCVTF
+	  (vec_select:<VFMLSEL>
+	   (match_operand:<VFML> 3 "s_register_operand" "<VF_constraint>")
+	   (match_operand:<VFML> 5 "vect_par_constant_low" "")))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+ "vfmal.f16\\t%<V_reg>0, %<V_lo>2, %<V_lo>3"
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+(define_insn "vfmsl_high<mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (neg:<VFMLSEL>
+	    (vec_select:<VFMLSEL>
+	      (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	      (match_operand:<VFML> 4 "vect_par_constant_high" ""))))
+	 (float_extend:VCVTF
+	  (vec_select:<VFMLSEL>
+	   (match_operand:<VFML> 3 "s_register_operand" "<VF_constraint>")
+	   (match_operand:<VFML> 5 "vect_par_constant_high" "")))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+ "vfmsl.f16\\t%<V_reg>0, %<V_hi>2, %<V_hi>3"
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+(define_insn "vfmal_high<mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (vec_select:<VFMLSEL>
+	   (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	   (match_operand:<VFML> 4 "vect_par_constant_high" "")))
+	 (float_extend:VCVTF
+	  (vec_select:<VFMLSEL>
+	   (match_operand:<VFML> 3 "s_register_operand" "<VF_constraint>")
+	   (match_operand:<VFML> 5 "vect_par_constant_high" "")))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+ "vfmal.f16\\t%<V_reg>0, %<V_hi>2, %<V_hi>3"
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+(define_insn "vfmsl_low<mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (neg:<VFMLSEL>
+	    (vec_select:<VFMLSEL>
+	      (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	      (match_operand:<VFML> 4 "vect_par_constant_low" ""))))
+	 (float_extend:VCVTF
+	  (vec_select:<VFMLSEL>
+	   (match_operand:<VFML> 3 "s_register_operand" "<VF_constraint>")
+	   (match_operand:<VFML> 5 "vect_par_constant_low" "")))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+ "vfmsl.f16\\t%<V_reg>0, %<V_lo>2, %<V_lo>3"
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
 ; Used for intrinsics when flag_unsafe_math_optimizations is false.
 
 (define_insn "neon_vmla<mode>_unspec"
diff --git a/gcc/config/arm/t-arm-elf b/gcc/config/arm/t-arm-elf
index 3e721ec789806335c6097d4088642150abf1003a..d9fc9f1a5615ffb036dde18cd8e34c22b5c874cd 100644
--- a/gcc/config/arm/t-arm-elf
+++ b/gcc/config/arm/t-arm-elf
@@ -36,7 +36,7 @@ v7ve_fps	:= vfpv3-d16 vfpv3 vfpv3-d16-fp16 vfpv3-fp16 vfpv4 neon \
 
 # Not all these permutations exist for all architecture variants, but
 # it seems to work ok.
-v8_fps		:= simd fp16 crypto fp16+crypto dotprod
+v8_fps		:= simd fp16 crypto fp16+crypto dotprod fp16fml
 
 # We don't do anything special with these.  Pre-v4t probably doesn't work.
 all_early_nofp	:= armv2 armv2a armv3 armv3m armv4 armv4t armv5 armv5t
diff --git a/gcc/config/arm/t-multilib b/gcc/config/arm/t-multilib
index 26b8ae15da74b275e5617b3054572d2a7e8cfe49..09c451e8ca6ee128989a331ef5870031c56b3e3b 100644
--- a/gcc/config/arm/t-multilib
+++ b/gcc/config/arm/t-multilib
@@ -68,7 +68,7 @@ v7ve_vfpv4_simd_variants := +simd
 v8_a_nosimd_variants	:= +crc
 v8_a_simd_variants	:= $(call all_feat_combs, simd crypto)
 v8_1_a_simd_variants	:= $(call all_feat_combs, simd crypto)
-v8_2_a_simd_variants	:= $(call all_feat_combs, simd fp16 crypto dotprod)
+v8_2_a_simd_variants	:= $(call all_feat_combs, simd fp16 fp16fml crypto dotprod)
 v8_4_a_simd_variants	:= $(call all_feat_combs, simd fp16 crypto)
 
 ifneq (,$(HAS_APROFILE))
diff --git a/gcc/config/arm/unspecs.md b/gcc/config/arm/unspecs.md
index c474f4bb5db995b60f464f098e478f0398ce15f9..cc2309e751676ea7a15e6c4709e9bc55a32b51b3 100644
--- a/gcc/config/arm/unspecs.md
+++ b/gcc/config/arm/unspecs.md
@@ -412,4 +412,6 @@ (define_c_enum "unspec" [
   UNSPEC_VRNDX
   UNSPEC_DOT_S
   UNSPEC_DOT_U
+  UNSPEC_VFML_LO
+  UNSPEC_VFML_HI
 ])
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 9c2388aae2b813c675bf4b697cfd80e79cbfdb78..0de15e70d86b2a9752911d1e3c2b62aea414d7ad 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -15860,6 +15860,11 @@ Disable the floating-point, Advanced SIMD and cryptographic instructions.
 The half-precision floating-point data processing instructions.
 This also enables the Advanced SIMD and floating-point instructions.
 
+@item +fp16fml
+The half-precision floating-point fmla extension.  This also enables
+the half-precision floating-point extension and Advanced SIMD and
+floating-point instructions.
+
 @item +simd
 The ARMv8.1-A Advanced SIMD and floating-point instructions.
 
@@ -15882,7 +15887,8 @@ Disable the floating-point, Advanced SIMD and cryptographic instructions.
 @item +fp16
 The half-precision floating-point data processing instructions.
 This also enables the Advanced SIMD and floating-point instructions as well
-as the Dot Product extension.
+as the Dot Product extension and the half-precision floating-point fmla
+extension.
 
 @item +simd
 The ARMv8.3-A Advanced SIMD and floating-point instructions as well as the
diff --git a/gcc/doc/sourcebuild.texi b/gcc/doc/sourcebuild.texi
index d5da39334e4a4d1aae9a6222883a413d1a315020..2c9438c3491f71abed5deb71f70ba5f6aae918ac 100644
--- a/gcc/doc/sourcebuild.texi
+++ b/gcc/doc/sourcebuild.texi
@@ -1769,6 +1769,12 @@ ARM target supports executing instructions from ARMv8.2-A with the Dot
 Product extension. Some multilibs may be incompatible with these options.
 Implies arm_v8_2a_dotprod_neon_ok.
 
+@item arm_fp16fml_neon_ok
+@anchor{arm_fp16fml_neon_ok}
+ARM target supports extensions to generate the @code{VFMAL} and @code{VFMLS}
+half-precision floating-point instructions available from ARMv8.2-A and
+onwards.  Some multilibs may be incompatible with these options.
+
 @item arm_prefer_ldrd_strd
 ARM target prefers @code{LDRD} and @code{STRD} instructions over
 @code{LDM} and @code{STM} instructions.
@@ -2384,6 +2390,11 @@ Add options for ARMv8.2-A with Adv.SIMD Dot Product support, if this is
 supported by the target; see the
 @ref{arm_v8_2a_dotprod_neon_ok} effective target keyword.
 
+@item arm_fp16fml_neon
+Add options to enable generation of the @code{VFMAL} and @code{VFMSL}
+instructions, if this is supported by the target; see the
+@ref{arm_fp16fml_neon_ok} effective target keyword.
+
 @item bind_pic_locally
 Add the target-specific flags needed to enable functions to bind
 locally when using pic/PIC passes in the testsuite.
diff --git a/gcc/testsuite/gcc.target/arm/multilib.exp b/gcc/testsuite/gcc.target/arm/multilib.exp
index b210f32f680a673bedd3dc16ae74fefe70a403e4..4e3324f23114c751b99d41d85dd14e0cb5d79145 100644
--- a/gcc/testsuite/gcc.target/arm/multilib.exp
+++ b/gcc/testsuite/gcc.target/arm/multilib.exp
@@ -92,6 +92,14 @@ if {[multilib_config "aprofile"] } {
 	{-march=armv8.3-a+simd+dotprod -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
 	{-march=armv8.3-a+simd+dotprod+nofp -mfloat-abi=softfp} "thumb/v8-a/nofp"
 	{-march=armv8.3-a+simd+nofp+dotprod -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
+	{-march=armv8.2-a+fp16fml -mfloat-abi=soft} "thumb/v8-a/nofp"
+	{-march=armv8.2-a+simd+fp16fml -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
+	{-march=armv8.2-a+simd+fp16fml+nofp -mfloat-abi=softfp} "thumb/v8-a/nofp"
+	{-march=armv8.2-a+simd+nofp+fp16fml -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
+	{-march=armv8.3-a+fp16fml -mfloat-abi=soft} "thumb/v8-a/nofp"
+	{-march=armv8.3-a+simd+fp16fml -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
+	{-march=armv8.3-a+simd+fp16fml+nofp -mfloat-abi=softfp} "thumb/v8-a/nofp"
+	{-march=armv8.3-a+simd+nofp+fp16fml -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
 	{-march=armv8.4-a+crypto -mfloat-abi=soft} "thumb/v8-a/nofp"
 	{-march=armv8.4-a+simd+crypto -mfloat-abi=softfp} "thumb/v8-a+simd/softfp"
 	{-march=armv8.4-a+simd+crypto+nofp -mfloat-abi=softfp} "thumb/v8-a/nofp"
diff --git a/gcc/testsuite/gcc.target/arm/simd/fp16fml_high.c b/gcc/testsuite/gcc.target/arm/simd/fp16fml_high.c
new file mode 100644
index 0000000000000000000000000000000000000000..0f50a57f42836dfd93d9dd2b52001fc6d6356744
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/fp16fml_high.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_fp16fml_neon_ok } */
+/* { dg-add-options arm_fp16fml_neon }  */
+
+#include "arm_neon.h"
+
+float32x2_t
+test_vfmlal_high_u32 (float32x2_t r, float16x4_t a, float16x4_t b)
+{
+  return vfmlal_high_u32 (r, a, b);
+}
+
+float32x4_t
+test_vfmlalq_high_u32 (float32x4_t r, float16x8_t a, float16x8_t b)
+{
+  return vfmlalq_high_u32 (r, a, b);
+}
+
+float32x2_t
+test_vfmlsl_high_u32 (float32x2_t r, float16x4_t a, float16x4_t b)
+{
+  return vfmlsl_high_u32 (r, a, b);
+}
+
+float32x4_t
+test_vfmlslq_high_u32 (float32x4_t r, float16x8_t a, float16x8_t b)
+{
+  return vfmlslq_high_u32 (r, a, b);
+}
+
+/* { dg-final { scan-assembler-times {vfmal.f16\td[0-9]+, s[123]?[13579], s[123]?[13579]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmal.f16\tq[0-9]+, d[123]?[13579], d[123]?[13579]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmsl.f16\td[0-9]+, s[123]?[13579], s[123]?[13579]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmsl.f16\tq[0-9]+, d[123]?[13579], d[123]?[13579]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/fp16fml_low.c b/gcc/testsuite/gcc.target/arm/simd/fp16fml_low.c
new file mode 100644
index 0000000000000000000000000000000000000000..427331c8684ca5f0cc47272e4c30e23908995f33
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/fp16fml_low.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_fp16fml_neon_ok } */
+/* { dg-add-options arm_fp16fml_neon }  */
+
+#include "arm_neon.h"
+
+float32x2_t
+test_vfmlal_low_u32 (float32x2_t r, float16x4_t a, float16x4_t b)
+{
+  return vfmlal_low_u32 (r, a, b);
+}
+
+float32x4_t
+test_vfmlalq_low_u32 (float32x4_t r, float16x8_t a, float16x8_t b)
+{
+  return vfmlalq_low_u32 (r, a, b);
+}
+
+float32x2_t
+test_vfmlsl_low_u32 (float32x2_t r, float16x4_t a, float16x4_t b)
+{
+  return vfmlsl_low_u32 (r, a, b);
+}
+
+float32x4_t
+test_vfmlslq_low_u32 (float32x4_t r, float16x8_t a, float16x8_t b)
+{
+  return vfmlslq_low_u32 (r, a, b);
+}
+
+/* { dg-final { scan-assembler-times {vfmal.f16\td[0-9]+, s[123]?[02468], s[123]?[02468]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmal.f16\tq[0-9]+, d[123]?[02468], d[123]?[02468]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmsl.f16\td[0-9]+, s[123]?[02468], s[123]?[02468]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmsl.f16\tq[0-9]+, d[123]?[02468], d[123]?[02468]} 1 } } */
diff --git a/gcc/testsuite/lib/target-supports.exp b/gcc/testsuite/lib/target-supports.exp
index 114c1f11ccc00fb50a595e613122429e09c5925f..14517a840ea8870c8fab496637fa03c9c01634b9 100644
--- a/gcc/testsuite/lib/target-supports.exp
+++ b/gcc/testsuite/lib/target-supports.exp
@@ -4442,6 +4442,48 @@ proc add_options_for_arm_v8_2a_dotprod_neon { flags } {
     return "$flags $et_arm_v8_2a_dotprod_neon_flags"
 }
 
+# Return 1 if the target supports FP16 VFMAL and VFMSL
+# instructions, 0 otherwise.
+# Record the command line options needed.
+
+proc check_effective_target_arm_fp16fml_neon_ok_nocache { } {
+    global et_arm_fp16fml_neon_flags
+    set et_arm_fp16fml_neon_flags ""
+
+    if { ![istarget arm*-*-*] } {
+        return 0;
+    }
+
+    # Iterate through sets of options to find the compiler flags that
+    # need to be added to the -march option.
+    foreach flags {"" "-mfloat-abi=softfp -mfpu=neon-fp-armv8" "-mfloat-abi=hard -mfpu=neon-fp-armv8"} {
+        if { [check_no_compiler_messages_nocache \
+                  arm_fp16fml_neon_ok object {
+            #if !defined (__ARM_FEATURE_FP16_FML)
+            #error "__ARM_FEATURE_FP16_FML not defined"
+            #endif
+        } "$flags -march=armv8.2-a+fp16fml"] } {
+            set et_arm_fp16fml_neon_flags "$flags -march=armv8.2-a+fp16fml"
+            return 1
+        }
+    }
+
+    return 0;
+}
+
+proc check_effective_target_arm_fp16fml_neon_ok { } {
+    return [check_cached_effective_target arm_fp16fml_neon_ok \
+                check_effective_target_arm_fp16fml_neon_ok_nocache]
+}
+
+proc add_options_for_arm_fp16fml_neon { flags } {
+    if { ! [check_effective_target_arm_fp16fml_neon_ok] } {
+        return "$flags"
+    }
+    global et_arm_fp16fml_neon_flags
+    return "$flags $et_arm_fp16fml_neon_flags"
+}
+
 # Return 1 if the target supports executing ARMv8 NEON instructions, 0
 # otherwise.
 

From patchwork Mon Jan  8 17:52:30 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
X-Patchwork-Id: 856989
Return-Path: 
 <gcc-patches-return-470449-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-470449-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="Pv53Omj2"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 3zFjY50Tyxz9s72
	for <incoming@patchwork.ozlabs.org>;
	Tue,  9 Jan 2018 04:52:56 +1100 (AEDT)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:content-type;
	q=dns; s=default; b=lpWsJp/SKnSSkrSVoamWJuLFYmXGG/GbZNjm5KNHNTk
	oAm8bSFAMytQg2F4zguXCpEC28lze1JWQAKIIxyCNGTlnUwUX7sy+ebked9TxW/A
	A3r6CTXmXT+1QrmjgcesNQEer3WW0Ss1kkot9ncuaeCmcutmW9Vi5Kkg2x9ggwaI
	=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:message-id:date:from:mime-version:to:cc:subject:content-type;
	s=default; bh=Lu1qP48Snax8akh4iO+xqtKyCLY=; b=Pv53Omj2XKap6uAGi
	q4uZyYP5SEspNhjUM5378vQbCfT7gvNEDkC4vppsF0k2CQ25uhQ7vknzWDYdLBez
	urDLBU3XrQ22rD36J7jm0D+AELCUWOE5Yrwsqy+18njfVPI1OI4dc7Kn65LYAxGO
	2nT74WRSzrpYnExFGIEc/YjcXs=
Received: (qmail 110361 invoked by alias); 8 Jan 2018 17:52:40 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 110211 invoked by uid 89); 8 Jan 2018 17:52:39 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-25.7 required=5.0 tests=BAYES_00, GIT_PATCH_0,
	GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	KAM_LAZY_DOMAIN_SECURITY, KAM_LOTSOFHASH,
	T_RP_MATCHES_RCVD autolearn=ham version=3.3.2 spammy=
X-HELO: foss.arm.com
Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by
	sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Mon, 08 Jan 2018 17:52:34 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by
	usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id
	D3DFA15AD; Mon,  8 Jan 2018 09:52:32 -0800 (PST)
Received: from [10.2.207.77] (e100706-lin.cambridge.arm.com
	[10.2.207.77])	by usa-sjc-imap-foss1.foss.arm.com (Postfix)
	with ESMTPSA id 144653F487; Mon,  8 Jan 2018 09:52:31 -0800 (PST)
Message-ID: <5A53AFDE.8080207@foss.arm.com>
Date: Mon, 08 Jan 2018 17:52:30 +0000
From: Kyrill  Tkachov <kyrylo.tkachov@foss.arm.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:31.0) Gecko/20100101 Thunderbird/31.2.0
MIME-Version: 1.0
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
CC: Christophe Lyon <christophe.lyon@linaro.org>
Subject: [PATCH][arm][3/3] Implement fp16fml lane intrinsics

Hi all,

This patch implements the lane-wise fp16fml intrinsics.
There's quite a few of them so I've split them up from
the other simpler fp16fml intrinsics.

These ones expose instructions such as

vfmal.f16 Dd, Sn, Sm[<index>]  0 <= index <= 1
vfmal.f16 Qd, Dn, Dm[<index>]  0 <= index <= 3
vfmsl.f16 Dd, Sn, Sm[<index>]  0 <= index <= 1
vfmsl.f16 Qd, Dn, Dm[<index>]  0 <= index <= 3

These instructions extract a single half-precision
floating-point value from one of the source regs
and perform a vfmal/vfmsl operation as per the
normal variant with that value.

The nuance here is that some of the intrinsics want
to do things like:

float32x2_t vfmlal_laneq_low_u32 (float32x2_t __r, float16x4_t __a, 
float16x8_t __b, const int __index)


where the float16x8_t value of '__b' is held in a Q
register, so we need to be a bit smart about finding
the right D or S sub-register and translating the
lane number to a lane in that sub-register, instead
of just passing the language-level const-int down to
the assembly instruction.

That's where most of the complexity of this patch comes from
but hopefully it's orthogonal enough to make sense.

Bootstrapped and tested on arm-none-linux-gnueabihf as well as
armeb-none-eabi.

Thanks,
Kyrill

2018-01-08  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * config/arm/arm_neon.h (vfmlal_lane_low_u32, vfmlal_lane_high_u32,
     vfmlalq_laneq_low_u32, vfmlalq_lane_low_u32, vfmlal_laneq_low_u32,
     vfmlalq_laneq_high_u32, vfmlalq_lane_high_u32, vfmlal_laneq_high_u32,
     vfmlsl_lane_low_u32, vfmlsl_lane_high_u32, vfmlslq_laneq_low_u32,
     vfmlslq_lane_low_u32, vfmlsl_laneq_low_u32, vfmlslq_laneq_high_u32,
     vfmlslq_lane_high_u32, vfmlsl_laneq_high_u32): Define.
     * config/arm/arm_neon_builtins.def (vfmal_lane_low,
     vfmal_lane_lowv4hf, vfmal_lane_lowv8hf, vfmal_lane_high,
     vfmal_lane_highv4hf, vfmal_lane_highv8hf, vfmsl_lane_low,
     vfmsl_lane_lowv4hf, vfmsl_lane_lowv8hf, vfmsl_lane_high,
     vfmsl_lane_highv4hf, vfmsl_lane_highv8hf): New sets of builtins.
     * config/arm/iterators.md (VFMLSEL2, vfmlsel2): New mode attributes.
     (V_lane_reg): Likewise.
     * config/arm/neon.md (neon_vfm<vfml_op>l_lane_<vfml_half><VCVTF:mode>):
     New define_expand.
  (neon_vfm<vfml_op>l_lane_<vfml_half><vfmlsel2><mode>): Likewise.
     (vfmal_lane_low<mode>_intrinsic,
     vfmal_lane_low<vfmlsel2><mode>_intrinsic,
     vfmal_lane_high<vfmlsel2><mode>_intrinsic,
     vfmal_lane_high<mode>_intrinsic, vfmsl_lane_low<mode>_intrinsic,
     vfmsl_lane_low<vfmlsel2><mode>_intrinsic,
     vfmsl_lane_high<vfmlsel2><mode>_intrinsic,
     vfmsl_lane_high<mode>_intrinsic): New define_insns.

2018-01-08  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     * gcc.target/arm/simd/fp16fml_lane_high.c: New test.
     * gcc.target/arm/simd/fp16fml_lane_low.c: New test.

diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
index 01324096d673187b504b89a6d68785275b445b1b..a8aae4464aa02b4286751c116fae493517056e99 100644
--- a/gcc/config/arm/arm_neon.h
+++ b/gcc/config/arm/arm_neon.h
@@ -18164,6 +18164,150 @@ vfmlslq_high_u32 (float32x4_t __r, float16x8_t __a, float16x8_t __b)
   return __builtin_neon_vfmsl_highv4sf (__r, __a, __b);
 }
 
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlal_lane_low_u32 (float32x2_t __r, float16x4_t __a, float16x4_t __b,
+		     const int __index)
+{
+  __builtin_arm_lane_check (4, __index);
+  return __builtin_neon_vfmal_lane_lowv2sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlal_lane_high_u32 (float32x2_t __r, float16x4_t __a, float16x4_t __b,
+		      const int __index)
+{
+  __builtin_arm_lane_check (4, __index);
+  return __builtin_neon_vfmal_lane_highv2sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlalq_laneq_low_u32 (float32x4_t __r, float16x8_t __a, float16x8_t __b,
+		       const int __index)
+{
+  __builtin_arm_lane_check (8, __index);
+  return __builtin_neon_vfmal_lane_lowv4sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlalq_lane_low_u32 (float32x4_t __r, float16x8_t __a, float16x4_t __b,
+		       const int __index)
+{
+  __builtin_arm_lane_check (4, __index);
+  return __builtin_neon_vfmal_lane_lowv4hfv4sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlal_laneq_low_u32 (float32x2_t __r, float16x4_t __a, float16x8_t __b,
+		       const int __index)
+{
+  __builtin_arm_lane_check (8, __index);
+  return __builtin_neon_vfmal_lane_lowv8hfv2sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlalq_laneq_high_u32 (float32x4_t __r, float16x8_t __a, float16x8_t __b,
+			const int __index)
+{
+  __builtin_arm_lane_check (8, __index);
+  return __builtin_neon_vfmal_lane_highv4sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlalq_lane_high_u32 (float32x4_t __r, float16x8_t __a, float16x4_t __b,
+		       const int __index)
+{
+  __builtin_arm_lane_check (4, __index);
+  return __builtin_neon_vfmal_lane_highv4hfv4sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlal_laneq_high_u32 (float32x2_t __r, float16x4_t __a, float16x8_t __b,
+		       const int __index)
+{
+  __builtin_arm_lane_check (8, __index);
+  return __builtin_neon_vfmal_lane_highv8hfv2sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlsl_lane_low_u32 (float32x2_t __r, float16x4_t __a, float16x4_t __b,
+		     const int __index)
+{
+  __builtin_arm_lane_check (4, __index);
+  return __builtin_neon_vfmsl_lane_lowv2sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlsl_lane_high_u32 (float32x2_t __r, float16x4_t __a, float16x4_t __b,
+		      const int __index)
+{
+  __builtin_arm_lane_check (4, __index);
+  return __builtin_neon_vfmsl_lane_highv2sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlslq_laneq_low_u32 (float32x4_t __r, float16x8_t __a, float16x8_t __b,
+		       const int __index)
+{
+  __builtin_arm_lane_check (8, __index);
+  return __builtin_neon_vfmsl_lane_lowv4sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlslq_lane_low_u32 (float32x4_t __r, float16x8_t __a, float16x4_t __b,
+		       const int __index)
+{
+  __builtin_arm_lane_check (4, __index);
+  return __builtin_neon_vfmsl_lane_lowv4hfv4sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlsl_laneq_low_u32 (float32x2_t __r, float16x4_t __a, float16x8_t __b,
+		       const int __index)
+{
+  __builtin_arm_lane_check (8, __index);
+  return __builtin_neon_vfmsl_lane_lowv8hfv2sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlslq_laneq_high_u32 (float32x4_t __r, float16x8_t __a, float16x8_t __b,
+			const int __index)
+{
+  __builtin_arm_lane_check (8, __index);
+  return __builtin_neon_vfmsl_lane_highv4sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlslq_lane_high_u32 (float32x4_t __r, float16x8_t __a, float16x4_t __b,
+		       const int __index)
+{
+  __builtin_arm_lane_check (4, __index);
+  return __builtin_neon_vfmsl_lane_highv4hfv4sf (__r, __a, __b, __index);
+}
+
+__extension__ extern __inline float32x2_t
+__attribute__ ((__always_inline__, __gnu_inline__, __artificial__))
+vfmlsl_laneq_high_u32 (float32x2_t __r, float16x4_t __a, float16x8_t __b,
+		       const int __index)
+{
+  __builtin_arm_lane_check (8, __index);
+  return __builtin_neon_vfmsl_lane_highv8hfv2sf (__r, __a, __b, __index);
+}
+
 #pragma GCC pop_options
 #endif
 
diff --git a/gcc/config/arm/arm_neon_builtins.def b/gcc/config/arm/arm_neon_builtins.def
index d4fe33b0502f46b9d6303a08003ede2c69574e29..d134ffd38b33062d32aaceafa169012cda7524d8 100644
--- a/gcc/config/arm/arm_neon_builtins.def
+++ b/gcc/config/arm/arm_neon_builtins.def
@@ -55,6 +55,18 @@ VAR2 (TERNOP, vfmal_low, v2sf, v4sf)
 VAR2 (TERNOP, vfmal_high, v2sf, v4sf)
 VAR2 (TERNOP, vfmsl_low, v2sf, v4sf)
 VAR2 (TERNOP, vfmsl_high, v2sf, v4sf)
+VAR2 (MAC_LANE, vfmal_lane_low, v2sf, v4sf)
+VAR1 (MAC_LANE, vfmal_lane_lowv4hf, v4sf)
+VAR1 (MAC_LANE, vfmal_lane_lowv8hf, v2sf)
+VAR2 (MAC_LANE, vfmal_lane_high, v2sf, v4sf)
+VAR1 (MAC_LANE, vfmal_lane_highv4hf, v4sf)
+VAR1 (MAC_LANE, vfmal_lane_highv8hf, v2sf)
+VAR2 (MAC_LANE, vfmsl_lane_low, v2sf, v4sf)
+VAR1 (MAC_LANE, vfmsl_lane_lowv4hf, v4sf)
+VAR1 (MAC_LANE, vfmsl_lane_lowv8hf, v2sf)
+VAR2 (MAC_LANE, vfmsl_lane_high, v2sf, v4sf)
+VAR1 (MAC_LANE, vfmsl_lane_highv4hf, v4sf)
+VAR1 (MAC_LANE, vfmsl_lane_highv8hf, v2sf)
 VAR3 (BINOP, vmullp, v8qi, v4hi, v2si)
 VAR3 (BINOP, vmulls, v8qi, v4hi, v2si)
 VAR3 (BINOP, vmullu, v8qi, v4hi, v2si)
diff --git a/gcc/config/arm/iterators.md b/gcc/config/arm/iterators.md
index efa410e4fbd301e0e43d5364bb3bd59e676962bf..f4bc83aa08cd8db8264c01c74d7c86c3adf98773 100644
--- a/gcc/config/arm/iterators.md
+++ b/gcc/config/arm/iterators.md
@@ -484,6 +484,12 @@ (define_mode_attr VFML [(V2SF "V4HF") (V4SF "V8HF")])
 ;; Mode mapping for VFM[A,S]L instructions for the vec_select result.
 (define_mode_attr VFMLSEL [(V2SF "V2HF") (V4SF "V4HF")])
 
+;; Mode mapping for VFM[A,S]L instructions for some awkward lane-wise forms.
+(define_mode_attr VFMLSEL2 [(V2SF "V8HF") (V4SF "V4HF")])
+
+;; Same as the above, but lowercase.
+(define_mode_attr vfmlsel2 [(V2SF "v8hf") (V4SF "v4hf")])
+
 ;; Similar, for three elements.
 (define_mode_attr V_three_elem [(V8QI "BLK") (V16QI "BLK")
                                 (V4HI "BLK") (V8HI "BLK")
@@ -516,6 +522,10 @@ (define_mode_attr V_hi [(V2SF "p") (V4SF  "f")])
 ;; Output template to select the low VFP register of a mult-register value.
 (define_mode_attr V_lo [(V2SF "") (V4SF  "e")])
 
+;; Helper attribute for printing output templates for awkward forms of
+;; vfmlal/vfmlsl intrinsics.
+(define_mode_attr V_lane_reg [(V2SF "") (V4SF  "P")])
+
 ;; Wider modes with the same number of elements.
 (define_mode_attr V_widen [(V8QI "V8HI") (V4HI "V4SI") (V2SI "V2DI")])
 
diff --git a/gcc/config/arm/neon.md b/gcc/config/arm/neon.md
index 75be5aca8b57ad744d70253989073756bf4ca1fe..24e5fe7f7d2d53744cd7b28943541e2aefcd59c8 100644
--- a/gcc/config/arm/neon.md
+++ b/gcc/config/arm/neon.md
@@ -2382,6 +2382,314 @@ (define_insn "vfmsl_low<mode>_intrinsic"
  [(set_attr "type" "neon_fp_mla_s<q>")]
 )
 
+(define_expand "neon_vfm<vfml_op>l_lane_<vfml_half><VCVTF:mode>"
+  [(set:VCVTF (match_operand:VCVTF 0 "s_register_operand")
+     (unspec:VCVTF
+	[(match_operand:VCVTF 1 "s_register_operand")
+	 (PLUSMINUS:<VFML>
+	   (match_operand:<VFML> 2 "s_register_operand")
+	   (match_operand:<VFML> 3 "s_register_operand"))
+	 (match_operand:SI 4 "const_int_operand")] VFMLHALVES))]
+  "TARGET_FP16FML"
+{
+  rtx lane = GEN_INT (NEON_ENDIAN_LANE_N (<VFML>mode, INTVAL (operands[4])));
+  rtx half = arm_simd_vect_par_cnst_half (<VFML>mode, <vfml_half_selector>);
+  emit_insn (gen_vfm<vfml_op>l_lane_<vfml_half><mode>_intrinsic
+					       (operands[0], operands[1],
+						operands[2], operands[3],
+						half, lane));
+  DONE;
+})
+
+(define_insn "vfmal_lane_low<mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (vec_select:<VFMLSEL>
+	   (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	   (match_operand:<VFML> 4 "vect_par_constant_low" "")))
+	 (float_extend:VCVTF
+	   (vec_duplicate:<VFMLSEL>
+	     (vec_select:HF
+	       (match_operand:<VFML> 3 "s_register_operand" "x")
+	       (parallel [(match_operand:SI 5 "const_int_operand" "n")]))))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+ {
+    int lane = NEON_ENDIAN_LANE_N (<VFML>mode, INTVAL (operands[5]));
+    if (lane > GET_MODE_NUNITS (<VFMLSEL>mode) - 1)
+      {
+	operands[5] = GEN_INT (lane - GET_MODE_NUNITS (<VFMLSEL>mode));
+	return "vfmal.f16\\t%<V_reg>0, %<V_lo>2, %<V_hi>3[%c5]";
+      }
+    else
+      {
+	operands[5] = GEN_INT (lane);
+	return "vfmal.f16\\t%<V_reg>0, %<V_lo>2, %<V_lo>3[%c5]";
+      }
+  }
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+(define_expand "neon_vfm<vfml_op>l_lane_<vfml_half><vfmlsel2><mode>"
+  [(set:VCVTF (match_operand:VCVTF 0 "s_register_operand")
+     (unspec:VCVTF
+	[(match_operand:VCVTF 1 "s_register_operand")
+	 (PLUSMINUS:<VFML>
+	   (match_operand:<VFML> 2 "s_register_operand")
+	   (match_operand:<VFMLSEL2> 3 "s_register_operand"))
+	 (match_operand:SI 4 "const_int_operand")] VFMLHALVES))]
+  "TARGET_FP16FML"
+{
+  rtx lane
+    = GEN_INT (NEON_ENDIAN_LANE_N (<VFMLSEL2>mode, INTVAL (operands[4])));
+  rtx half = arm_simd_vect_par_cnst_half (<VFML>mode, <vfml_half_selector>);
+  emit_insn (gen_vfm<vfml_op>l_lane_<vfml_half><vfmlsel2><mode>_intrinsic
+		(operands[0], operands[1], operands[2], operands[3],
+		 half, lane));
+  DONE;
+})
+
+;; Used to implement the intrinsics:
+;; float32x4_t vfmlalq_lane_low_u32 (float32x4_t r, float16x8_t a, float16x4_t b, const int lane)
+;; float32x2_t vfmlal_laneq_low_u32 (float32x2_t r, float16x4_t a, float16x8_t b, const int lane)
+;; Needs a bit of care to get the modes of the different sub-expressions right
+;; due to 'a' and 'b' having different sizes and make sure we use the right
+;; S or D subregister to select the appropriate lane from.
+
+(define_insn "vfmal_lane_low<vfmlsel2><mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (vec_select:<VFMLSEL>
+	   (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	   (match_operand:<VFML> 4 "vect_par_constant_low" "")))
+	 (float_extend:VCVTF
+	   (vec_duplicate:<VFMLSEL>
+	     (vec_select:HF
+	       (match_operand:<VFMLSEL2> 3 "s_register_operand" "x")
+	       (parallel [(match_operand:SI 5 "const_int_operand" "n")]))))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+ {
+   int lane = NEON_ENDIAN_LANE_N (<VFMLSEL2>mode, INTVAL (operands[5]));
+   int elts_per_reg = GET_MODE_NUNITS (<VFMLSEL>mode);
+   int new_lane = lane % elts_per_reg;
+   int regdiff = lane / elts_per_reg;
+   operands[5] = GEN_INT (new_lane);
+   /* We re-create operands[2] and operands[3] in the halved VFMLSEL modes
+      because we want the print_operand code to print the appropriate
+      S or D register prefix.  */
+   operands[3] = gen_rtx_REG (<VFMLSEL>mode, REGNO (operands[3]) + regdiff);
+   operands[2] = gen_rtx_REG (<VFMLSEL>mode, REGNO (operands[2]));
+   return "vfmal.f16\\t%<V_reg>0, %<V_lane_reg>2, %<V_lane_reg>3[%c5]";
+ }
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+;; Used to implement the intrinsics:
+;; float32x4_t vfmlalq_lane_high_u32 (float32x4_t r, float16x8_t a, float16x4_t b, const int lane)
+;; float32x2_t vfmlal_laneq_high_u32 (float32x2_t r, float16x4_t a, float16x8_t b, const int lane)
+;; Needs a bit of care to get the modes of the different sub-expressions right
+;; due to 'a' and 'b' having different sizes and make sure we use the right
+;; S or D subregister to select the appropriate lane from.
+
+(define_insn "vfmal_lane_high<vfmlsel2><mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (vec_select:<VFMLSEL>
+	   (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	   (match_operand:<VFML> 4 "vect_par_constant_high" "")))
+	 (float_extend:VCVTF
+	   (vec_duplicate:<VFMLSEL>
+	     (vec_select:HF
+	       (match_operand:<VFMLSEL2> 3 "s_register_operand" "x")
+	       (parallel [(match_operand:SI 5 "const_int_operand" "n")]))))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+ {
+   int lane = NEON_ENDIAN_LANE_N (<VFMLSEL2>mode, INTVAL (operands[5]));
+   int elts_per_reg = GET_MODE_NUNITS (<VFMLSEL>mode);
+   int new_lane = lane % elts_per_reg;
+   int regdiff = lane / elts_per_reg;
+   operands[5] = GEN_INT (new_lane);
+   /* We re-create operands[3] in the halved VFMLSEL mode
+      because we've calculated the correct half-width subreg to extract
+      the lane from and we want to print *that* subreg instead.  */
+   operands[3] = gen_rtx_REG (<VFMLSEL>mode, REGNO (operands[3]) + regdiff);
+   return "vfmal.f16\\t%<V_reg>0, %<V_hi>2, %<V_lane_reg>3[%c5]";
+ }
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+(define_insn "vfmal_lane_high<mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (vec_select:<VFMLSEL>
+	   (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	   (match_operand:<VFML> 4 "vect_par_constant_high" "")))
+	 (float_extend:VCVTF
+	   (vec_duplicate:<VFMLSEL>
+	     (vec_select:HF
+	       (match_operand:<VFML> 3 "s_register_operand" "x")
+	       (parallel [(match_operand:SI 5 "const_int_operand" "n")]))))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+  {
+    int lane = NEON_ENDIAN_LANE_N (<VFML>mode, INTVAL (operands[5]));
+    if (lane > GET_MODE_NUNITS (<VFMLSEL>mode) - 1)
+      {
+	operands[5] = GEN_INT (lane - GET_MODE_NUNITS (<VFMLSEL>mode));
+	return "vfmal.f16\\t%<V_reg>0, %<V_hi>2, %<V_hi>3[%c5]";
+      }
+    else
+      {
+	operands[5] = GEN_INT (lane);
+	return "vfmal.f16\\t%<V_reg>0, %<V_hi>2, %<V_lo>3[%c5]";
+      }
+  }
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+(define_insn "vfmsl_lane_low<mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (neg:<VFMLSEL>
+	    (vec_select:<VFMLSEL>
+	      (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	      (match_operand:<VFML> 4 "vect_par_constant_low" ""))))
+	 (float_extend:VCVTF
+	   (vec_duplicate:<VFMLSEL>
+	     (vec_select:HF
+	       (match_operand:<VFML> 3 "s_register_operand" "x")
+	       (parallel [(match_operand:SI 5 "const_int_operand" "n")]))))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+ {
+    int lane = NEON_ENDIAN_LANE_N (<VFML>mode, INTVAL (operands[5]));
+    if (lane > GET_MODE_NUNITS (<VFMLSEL>mode) - 1)
+      {
+	operands[5] = GEN_INT (lane - GET_MODE_NUNITS (<VFMLSEL>mode));
+	return "vfmsl.f16\\t%<V_reg>0, %<V_lo>2, %<V_hi>3[%c5]";
+      }
+    else
+      {
+	operands[5] = GEN_INT (lane);
+	return "vfmsl.f16\\t%<V_reg>0, %<V_lo>2, %<V_lo>3[%c5]";
+      }
+  }
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+;; Used to implement the intrinsics:
+;; float32x4_t vfmlslq_lane_low_u32 (float32x4_t r, float16x8_t a, float16x4_t b, const int lane)
+;; float32x2_t vfmlsl_laneq_low_u32 (float32x2_t r, float16x4_t a, float16x8_t b, const int lane)
+;; Needs a bit of care to get the modes of the different sub-expressions right
+;; due to 'a' and 'b' having different sizes and make sure we use the right
+;; S or D subregister to select the appropriate lane from.
+
+(define_insn "vfmsl_lane_low<vfmlsel2><mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (neg:<VFMLSEL>
+	    (vec_select:<VFMLSEL>
+	      (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	      (match_operand:<VFML> 4 "vect_par_constant_low" ""))))
+	 (float_extend:VCVTF
+	   (vec_duplicate:<VFMLSEL>
+	     (vec_select:HF
+	       (match_operand:<VFMLSEL2> 3 "s_register_operand" "x")
+	       (parallel [(match_operand:SI 5 "const_int_operand" "n")]))))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+ {
+   int lane = NEON_ENDIAN_LANE_N (<VFMLSEL2>mode, INTVAL (operands[5]));
+   int elts_per_reg = GET_MODE_NUNITS (<VFMLSEL>mode);
+   int new_lane = lane % elts_per_reg;
+   int regdiff = lane / elts_per_reg;
+   operands[5] = GEN_INT (new_lane);
+   /* We re-create operands[2] and operands[3] in the halved VFMLSEL modes
+      because we want the print_operand code to print the appropriate
+      S or D register prefix.  */
+   operands[3] = gen_rtx_REG (<VFMLSEL>mode, REGNO (operands[3]) + regdiff);
+   operands[2] = gen_rtx_REG (<VFMLSEL>mode, REGNO (operands[2]));
+   return "vfmsl.f16\\t%<V_reg>0, %<V_lane_reg>2, %<V_lane_reg>3[%c5]";
+ }
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+;; Used to implement the intrinsics:
+;; float32x4_t vfmlslq_lane_high_u32 (float32x4_t r, float16x8_t a, float16x4_t b, const int lane)
+;; float32x2_t vfmlsl_laneq_high_u32 (float32x2_t r, float16x4_t a, float16x8_t b, const int lane)
+;; Needs a bit of care to get the modes of the different sub-expressions right
+;; due to 'a' and 'b' having different sizes and make sure we use the right
+;; S or D subregister to select the appropriate lane from.
+
+(define_insn "vfmsl_lane_high<vfmlsel2><mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (neg:<VFMLSEL>
+	    (vec_select:<VFMLSEL>
+	     (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	     (match_operand:<VFML> 4 "vect_par_constant_high" ""))))
+	 (float_extend:VCVTF
+	   (vec_duplicate:<VFMLSEL>
+	     (vec_select:HF
+	       (match_operand:<VFMLSEL2> 3 "s_register_operand" "x")
+	       (parallel [(match_operand:SI 5 "const_int_operand" "n")]))))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+ {
+   int lane = NEON_ENDIAN_LANE_N (<VFMLSEL2>mode, INTVAL (operands[5]));
+   int elts_per_reg = GET_MODE_NUNITS (<VFMLSEL>mode);
+   int new_lane = lane % elts_per_reg;
+   int regdiff = lane / elts_per_reg;
+   operands[5] = GEN_INT (new_lane);
+   /* We re-create operands[3] in the halved VFMLSEL mode
+      because we've calculated the correct half-width subreg to extract
+      the lane from and we want to print *that* subreg instead.  */
+   operands[3] = gen_rtx_REG (<VFMLSEL>mode, REGNO (operands[3]) + regdiff);
+   return "vfmsl.f16\\t%<V_reg>0, %<V_hi>2, %<V_lane_reg>3[%c5]";
+ }
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
+(define_insn "vfmsl_lane_high<mode>_intrinsic"
+ [(set (match_operand:VCVTF 0 "s_register_operand" "=w")
+	(fma:VCVTF
+	 (float_extend:VCVTF
+	  (neg:<VFMLSEL>
+	    (vec_select:<VFMLSEL>
+	     (match_operand:<VFML> 2 "s_register_operand" "<VF_constraint>")
+	     (match_operand:<VFML> 4 "vect_par_constant_high" ""))))
+	 (float_extend:VCVTF
+	   (vec_duplicate:<VFMLSEL>
+	     (vec_select:HF
+	       (match_operand:<VFML> 3 "s_register_operand" "x")
+	       (parallel [(match_operand:SI 5 "const_int_operand" "n")]))))
+	 (match_operand:VCVTF 1 "s_register_operand" "0")))]
+ "TARGET_FP16FML"
+  {
+    int lane = NEON_ENDIAN_LANE_N (<VFML>mode, INTVAL (operands[5]));
+    if (lane > GET_MODE_NUNITS (<VFMLSEL>mode) - 1)
+      {
+	operands[5] = GEN_INT (lane - GET_MODE_NUNITS (<VFMLSEL>mode));
+	return "vfmsl.f16\\t%<V_reg>0, %<V_hi>2, %<V_hi>3[%c5]";
+      }
+    else
+      {
+	operands[5] = GEN_INT (lane);
+	return "vfmsl.f16\\t%<V_reg>0, %<V_hi>2, %<V_lo>3[%c5]";
+      }
+  }
+ [(set_attr "type" "neon_fp_mla_s<q>")]
+)
+
 ; Used for intrinsics when flag_unsafe_math_optimizations is false.
 
 (define_insn "neon_vmla<mode>_unspec"
diff --git a/gcc/testsuite/gcc.target/arm/simd/fp16fml_lane_high.c b/gcc/testsuite/gcc.target/arm/simd/fp16fml_lane_high.c
new file mode 100644
index 0000000000000000000000000000000000000000..67f5fa5f04f3458704d2d539d41aa029036fc680
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/fp16fml_lane_high.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_fp16fml_neon_ok } */
+/* { dg-add-options arm_fp16fml_neon }  */
+
+#include "arm_neon.h"
+
+float32x2_t
+test_vfmlal_lane_high_u32 (float32x2_t r, float16x4_t a, float16x4_t b)
+{
+  return vfmlal_lane_high_u32 (r, a, b, 0);
+}
+
+float32x2_t
+tets_vfmlsl_lane_high_u32  (float32x2_t r, float16x4_t a, float16x4_t b)
+{
+  return vfmlsl_lane_high_u32 (r, a, b, 0);
+}
+
+float32x2_t
+test_vfmlal_laneq_high_u32 (float32x2_t r, float16x4_t a, float16x8_t b)
+{
+  return vfmlal_laneq_high_u32 (r, a, b, 6);
+}
+
+float32x2_t
+test_vfmlsl_laneq_high_u32 (float32x2_t r, float16x4_t a, float16x8_t b)
+{
+  return vfmlsl_laneq_high_u32 (r, a, b, 6);
+}
+
+float32x4_t
+test_vfmlalq_lane_high_u32 (float32x4_t r, float16x8_t a, float16x4_t b)
+{
+  return vfmlalq_lane_high_u32 (r, a, b, 1);
+}
+
+float32x4_t
+test_vfmlslq_lane_high_u32 (float32x4_t r, float16x8_t a, float16x4_t b)
+{
+  return vfmlslq_lane_high_u32 (r, a, b, 1);
+}
+
+float32x4_t
+test_vfmlalq_laneq_high_u32  (float32x4_t r, float16x8_t a, float16x8_t b)
+{
+  return vfmlalq_laneq_high_u32 (r, a, b, 7);
+}
+
+float32x4_t
+test_vfmlslq_laneq_high_u32 (float32x4_t r, float16x8_t a, float16x8_t b)
+{
+  return vfmlslq_laneq_high_u32 (r, a, b, 7);
+}
+
+/* { dg-final { scan-assembler-times {vfmal.f16\td[0-9]+, s[123]?[13579], s[123]?[02468]\[0\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmal.f16\td[0-9]+, s[123]?[13579], s[123]?[13579]\[0\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmal.f16\tq[0-9]+, d[123]?[13579], d[0-9]+\[1\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmal.f16\tq[0-9]+, d[123]?[13579], d[123]?[13579]\[3\]} 1 } } */
+
+/* { dg-final { scan-assembler-times {vfmsl.f16\td[0-9]+, s[123]?[13579], s[123]?[02468]\[0\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmsl.f16\td[0-9]+, s[123]?[13579], s[123]?[13579]\[0\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmsl.f16\tq[0-9]+, d[123]?[13579], d[0-9]+\[1\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmsl.f16\tq[0-9]+, d[123]?[13579], d[123]?[13579]\[3\]} 1 } } */
diff --git a/gcc/testsuite/gcc.target/arm/simd/fp16fml_lane_low.c b/gcc/testsuite/gcc.target/arm/simd/fp16fml_lane_low.c
new file mode 100644
index 0000000000000000000000000000000000000000..585f775fb57a2f1d479eb66b2728f279cf3e4faf
--- /dev/null
+++ b/gcc/testsuite/gcc.target/arm/simd/fp16fml_lane_low.c
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_fp16fml_neon_ok } */
+/* { dg-add-options arm_fp16fml_neon }  */
+
+#include "arm_neon.h"
+
+float32x2_t
+test_vfmlal_lane_low_u32 (float32x2_t r, float16x4_t a, float16x4_t b)
+{
+  return vfmlal_lane_low_u32 (r, a, b, 0);
+}
+
+float32x2_t
+test_vfmlsl_lane_low_u32 (float32x2_t r, float16x4_t a, float16x4_t b)
+{
+  return vfmlsl_lane_low_u32 (r, a, b, 0);
+}
+
+float32x2_t
+test_vfmlal_laneq_low_u32 (float32x2_t r, float16x4_t a, float16x8_t b)
+{
+  return vfmlal_laneq_low_u32 (r, a, b, 6);
+}
+
+float32x2_t
+test_vfmlsl_laneq_low_u32 (float32x2_t r, float16x4_t a, float16x8_t b)
+{
+  return vfmlsl_laneq_low_u32 (r, a, b, 6);
+}
+
+float32x4_t
+test_vfmlalq_lane_low_u32 (float32x4_t r, float16x8_t a, float16x4_t b)
+{
+  return vfmlalq_lane_low_u32 (r, a, b, 1);
+}
+
+float32x4_t
+test_vfmlslq_lane_low_u32 (float32x4_t r, float16x8_t a, float16x4_t b)
+{
+  return vfmlslq_lane_low_u32 (r, a, b, 1);
+}
+
+float32x4_t
+test_vfmlalq_laneq_low_u32 (float32x4_t r, float16x8_t a, float16x8_t b)
+{
+  return vfmlalq_laneq_low_u32 (r, a, b, 7);
+}
+
+float32x4_t
+test_vfmlslq_laneq_low_u32 (float32x4_t r, float16x8_t a, float16x8_t b)
+{
+  return vfmlslq_laneq_low_u32 (r, a, b, 7);
+}
+
+/* { dg-final { scan-assembler-times {vfmal.f16\td[0-9]+, s[123]?[02468], s[123]?[02468]\[0\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmal.f16\td[0-9]+, s[123]?[02468], s[123]?[13579]\[0\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmal.f16\tq[0-9]+, d[123]?[02468], d[0-9]+\[1\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmal.f16\tq[0-9]+, d[123]?[02468], d[123]?[13579]\[3\]} 1 } } */
+
+/* { dg-final { scan-assembler-times {vfmsl.f16\td[0-9]+, s[123]?[02468], s[123]?[02468]\[0\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmsl.f16\td[0-9]+, s[123]?[02468], s[123]?[13579]\[0\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmsl.f16\tq[0-9]+, d[123]?[02468], d[0-9]+\[1\]} 1 } } */
+/* { dg-final { scan-assembler-times {vfmsl.f16\tq[0-9]+, d[123]?[02468], d[123]?[13579]\[3\]} 1 } } */