From patchwork Mon Sep 8 16:39:28 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alan Lawrence X-Patchwork-Id: 386985 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 79EC514012C for ; Tue, 9 Sep 2014 02:39:44 +1000 (EST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:subject:references :in-reply-to:content-type; q=dns; s=default; b=h0tOGBLr4bMP+QeeJ C4piVxGLXbOKcCAp/T/mz91kiOqsIVkx5MFy4uJdR85lNHPi4sZe5BdX4HAk3+LG kKA3GVZMxgVcTJu9KlFpb4i+bEexQHz/r5DHSBGPyjdJDlWL2mLyYQTvNvjbcYYr KpPvUC8dyaaoLuJIM/lFLZF3iM= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:subject:references :in-reply-to:content-type; s=default; bh=9dC0uii8JXvIbfkm1wyicxH 29WE=; b=yrap3cikFlHe6cXN6gMaCY6T4eFswrN6iFDVFqcxDlkdwrfjyIKvyWB 0TpkVYqmriPf7qk8F4ha9D5thSnqmfAoHWqMtMuyqH9U3dRnYzL1NeqY/dkb4egs KPE8asPJkiAvhmya1ifsmsXmo5d6W4H7Q91hMli4n5qHgfO1dL9U= Received: (qmail 13628 invoked by alias); 8 Sep 2014 16:39:36 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 13613 invoked by uid 89); 8 Sep 2014 16:39:34 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.9 required=5.0 tests=AWL, BAYES_00, SPF_PASS autolearn=ham version=3.3.2 X-HELO: service87.mimecast.com Received: from service87.mimecast.com (HELO service87.mimecast.com) (91.220.42.44) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 08 Sep 2014 16:39:31 +0000 Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com [217.140.96.21]) by service87.mimecast.com; Mon, 08 Sep 2014 17:39:28 +0100 Received: from [10.1.209.51] ([10.1.255.212]) by cam-owa1.Emea.Arm.com with Microsoft SMTPSVC(6.0.3790.3959); Mon, 8 Sep 2014 17:39:28 +0100 Message-ID: <540DDBC0.9040204@arm.com> Date: Mon, 08 Sep 2014 17:39:28 +0100 From: Alan Lawrence User-Agent: Thunderbird 2.0.0.24 (X11/20101213) MIME-Version: 1.0 To: "gcc-patches@gcc.gnu.org" Subject: [PATCH 2/2][AArch64] Replace temporary inline assembler for vset_lane References: <540DD9DA.5000304@arm.com> In-Reply-To: <540DD9DA.5000304@arm.com> X-MC-Unique: 114090817392806701 X-IsSubscribed: yes The vset(q?)_lane_XXX intrinsics are presently implemented using inline asm blocks containing "ins" instructions - which are opaque to the mid-end. This patch replaces them with simple writes using gcc vector extension operations, with a lane-flip on bigendian (where ARM intrinsic lanes are indexed the other way around to gcc vector extensions). This should enable more optimization by being transparent to the mid-end. No significant changes to assembly output for vset_lane_1.c test from previous patch. Tested with check-gcc on aarch64-none-elf and aarch64_be-none-elf, including vset_lane_1.c test from previous patch. gcc/ChangeLog: * config/aarch64/arm_neon.h (aarch64_vset_lane_any): New (*2). (vset_lane_f32, vset_lane_f64, vset_lane_p8, vset_lane_p16, vset_lane_s8, vset_lane_s16, vset_lane_s32, vset_lane_s64, vset_lane_u8, vset_lane_u16, vset_lane_u32, vset_lane_u64, vsetq_lane_f32, vsetq_lane_f64, vsetq_lane_p8, vsetq_lane_p16, vsetq_lane_s8, vsetq_lane_s16, vsetq_lane_s32, vsetq_lane_s64, vsetq_lane_u8, vsetq_lane_u16, vsetq_lane_u32, vsetq_lane_u64): Replace inline assembler with __aarch64_vset_lane_any. OK for trunk? Alan Lawrence wrote: > This adds a test thath checks the result of a vset_lane intrinsic is identical > to the input apart from one value being changed. > > Test checks only one index per vset_lane_xxx in a somewhat adhoc fashion as the > index has to be a compile-time immediate and I felt that doing a loop using > macros did not add enough to justify the complexity. > > Passing on aarch64-none-elf and aarch64_be-none-elf (cross-tested). > > gcc/testsuite/ChangeLog: > > gcc.target/aarch64/vset_lane_1.c: New test. > diff --git a/gcc/config/aarch64/arm_neon.h b/gcc/config/aarch64/arm_neon.h index 0e087a86b3307e36fb2854a2c1d878c12aadff74..a30556d04ff30d6061249037ad016858af182286 100644 --- a/gcc/config/aarch64/arm_neon.h +++ b/gcc/config/aarch64/arm_neon.h @@ -673,6 +673,174 @@ typedef struct poly16x8x4_t #define __aarch64_vdupq_laneq_u64(__a, __b) \ __aarch64_vdup_lane_any (u64, q, q, __a, __b) +/* vset_lane internal macro. */ + +#ifdef __AARCH64EB__ +/* For big-endian, GCC's vector indices are the opposite way around + to the architectural lane indices used by Neon intrinsics. */ +#define __aarch64_vset_lane_any(__vec, __index, __val, __lanes) \ + __extension__ \ + ({ \ + __builtin_aarch64_im_lane_boundsi (__index, __lanes); \ + __vec[__lanes - 1 - __index] = __val; \ + __vec; \ + }) +#else +#define __aarch64_vset_lane_any(__vec, __index, __val, __lanes) \ + __extension__ \ + ({ \ + __builtin_aarch64_im_lane_boundsi (__index, __lanes); \ + __vec[__index] = __val; \ + __vec; \ + }) +#endif + +/* vset_lane */ + +__extension__ static __inline float32x2_t __attribute__ ((__always_inline__)) +vset_lane_f32 (float32_t __elem, float32x2_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 2); +} + +__extension__ static __inline float64x1_t __attribute__ ((__always_inline__)) +vset_lane_f64 (float64_t __elem, float64x1_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 1); +} + +__extension__ static __inline poly8x8_t __attribute__ ((__always_inline__)) +vset_lane_p8 (poly8_t __elem, poly8x8_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 8); +} + +__extension__ static __inline poly16x4_t __attribute__ ((__always_inline__)) +vset_lane_p16 (poly16_t __elem, poly16x4_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 4); +} + +__extension__ static __inline int8x8_t __attribute__ ((__always_inline__)) +vset_lane_s8 (int8_t __elem, int8x8_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 8); +} + +__extension__ static __inline int16x4_t __attribute__ ((__always_inline__)) +vset_lane_s16 (int16_t __elem, int16x4_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 4); +} + +__extension__ static __inline int32x2_t __attribute__ ((__always_inline__)) +vset_lane_s32 (int32_t __elem, int32x2_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 2); +} + +__extension__ static __inline int64x1_t __attribute__ ((__always_inline__)) +vset_lane_s64 (int64_t __elem, int64x1_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 1); +} + +__extension__ static __inline uint8x8_t __attribute__ ((__always_inline__)) +vset_lane_u8 (uint8_t __elem, uint8x8_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 8); +} + +__extension__ static __inline uint16x4_t __attribute__ ((__always_inline__)) +vset_lane_u16 (uint16_t __elem, uint16x4_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 4); +} + +__extension__ static __inline uint32x2_t __attribute__ ((__always_inline__)) +vset_lane_u32 (uint32_t __elem, uint32x2_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 2); +} + +__extension__ static __inline uint64x1_t __attribute__ ((__always_inline__)) +vset_lane_u64 (uint64_t __elem, uint64x1_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 1); +} + +__extension__ static __inline float32x4_t __attribute__ ((__always_inline__)) +vsetq_lane_f32 (float32_t __elem, float32x4_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 4); +} + +__extension__ static __inline float64x2_t __attribute__ ((__always_inline__)) +vsetq_lane_f64 (float64_t __elem, float64x2_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 2); +} + +__extension__ static __inline poly8x16_t __attribute__ ((__always_inline__)) +vsetq_lane_p8 (poly8_t __elem, poly8x16_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 16); +} + +__extension__ static __inline poly16x8_t __attribute__ ((__always_inline__)) +vsetq_lane_p16 (poly16_t __elem, poly16x8_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 8); +} + +__extension__ static __inline int8x16_t __attribute__ ((__always_inline__)) +vsetq_lane_s8 (int8_t __elem, int8x16_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 16); +} + +__extension__ static __inline int16x8_t __attribute__ ((__always_inline__)) +vsetq_lane_s16 (int16_t __elem, int16x8_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 8); +} + +__extension__ static __inline int32x4_t __attribute__ ((__always_inline__)) +vsetq_lane_s32 (int32_t __elem, int32x4_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 4); +} + +__extension__ static __inline int64x2_t __attribute__ ((__always_inline__)) +vsetq_lane_s64 (int64_t __elem, int64x2_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 2); +} + +__extension__ static __inline uint8x16_t __attribute__ ((__always_inline__)) +vsetq_lane_u8 (uint8_t __elem, uint8x16_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 16); +} + +__extension__ static __inline uint16x8_t __attribute__ ((__always_inline__)) +vsetq_lane_u16 (uint16_t __elem, uint16x8_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 8); +} + +__extension__ static __inline uint32x4_t __attribute__ ((__always_inline__)) +vsetq_lane_u32 (uint32_t __elem, uint32x4_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 4); +} + +__extension__ static __inline uint64x2_t __attribute__ ((__always_inline__)) +vsetq_lane_u64 (uint64_t __elem, uint64x2_t __vec, const int __index) +{ + return __aarch64_vset_lane_any (__vec, __index, __elem, 2); +} + /* vadd */ __extension__ static __inline int8x8_t __attribute__ ((__always_inline__)) vadd_s8 (int8x8_t __a, int8x8_t __b) @@ -11156,318 +11324,6 @@ vrsubhn_u64 (uint64x2_t a, uint64x2_t b) return result; } -#define vset_lane_f32(a, b, c) \ - __extension__ \ - ({ \ - float32x2_t b_ = (b); \ - float32_t a_ = (a); \ - float32x2_t result; \ - __asm__ ("ins %0.s[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vset_lane_f64(a, b, c) \ - __extension__ \ - ({ \ - float64x1_t b_ = (b); \ - float64_t a_ = (a); \ - float64x1_t result; \ - __asm__ ("ins %0.d[%3], %x1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vset_lane_p8(a, b, c) \ - __extension__ \ - ({ \ - poly8x8_t b_ = (b); \ - poly8_t a_ = (a); \ - poly8x8_t result; \ - __asm__ ("ins %0.b[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vset_lane_p16(a, b, c) \ - __extension__ \ - ({ \ - poly16x4_t b_ = (b); \ - poly16_t a_ = (a); \ - poly16x4_t result; \ - __asm__ ("ins %0.h[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vset_lane_s8(a, b, c) \ - __extension__ \ - ({ \ - int8x8_t b_ = (b); \ - int8_t a_ = (a); \ - int8x8_t result; \ - __asm__ ("ins %0.b[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vset_lane_s16(a, b, c) \ - __extension__ \ - ({ \ - int16x4_t b_ = (b); \ - int16_t a_ = (a); \ - int16x4_t result; \ - __asm__ ("ins %0.h[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vset_lane_s32(a, b, c) \ - __extension__ \ - ({ \ - int32x2_t b_ = (b); \ - int32_t a_ = (a); \ - int32x2_t result; \ - __asm__ ("ins %0.s[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vset_lane_s64(a, b, c) \ - __extension__ \ - ({ \ - int64x1_t b_ = (b); \ - int64_t a_ = (a); \ - int64x1_t result; \ - __asm__ ("ins %0.d[%3], %x1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vset_lane_u8(a, b, c) \ - __extension__ \ - ({ \ - uint8x8_t b_ = (b); \ - uint8_t a_ = (a); \ - uint8x8_t result; \ - __asm__ ("ins %0.b[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vset_lane_u16(a, b, c) \ - __extension__ \ - ({ \ - uint16x4_t b_ = (b); \ - uint16_t a_ = (a); \ - uint16x4_t result; \ - __asm__ ("ins %0.h[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vset_lane_u32(a, b, c) \ - __extension__ \ - ({ \ - uint32x2_t b_ = (b); \ - uint32_t a_ = (a); \ - uint32x2_t result; \ - __asm__ ("ins %0.s[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vset_lane_u64(a, b, c) \ - __extension__ \ - ({ \ - uint64x1_t b_ = (b); \ - uint64_t a_ = (a); \ - uint64x1_t result; \ - __asm__ ("ins %0.d[%3], %x1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_f32(a, b, c) \ - __extension__ \ - ({ \ - float32x4_t b_ = (b); \ - float32_t a_ = (a); \ - float32x4_t result; \ - __asm__ ("ins %0.s[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_f64(a, b, c) \ - __extension__ \ - ({ \ - float64x2_t b_ = (b); \ - float64_t a_ = (a); \ - float64x2_t result; \ - __asm__ ("ins %0.d[%3], %x1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_p8(a, b, c) \ - __extension__ \ - ({ \ - poly8x16_t b_ = (b); \ - poly8_t a_ = (a); \ - poly8x16_t result; \ - __asm__ ("ins %0.b[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_p16(a, b, c) \ - __extension__ \ - ({ \ - poly16x8_t b_ = (b); \ - poly16_t a_ = (a); \ - poly16x8_t result; \ - __asm__ ("ins %0.h[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_s8(a, b, c) \ - __extension__ \ - ({ \ - int8x16_t b_ = (b); \ - int8_t a_ = (a); \ - int8x16_t result; \ - __asm__ ("ins %0.b[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_s16(a, b, c) \ - __extension__ \ - ({ \ - int16x8_t b_ = (b); \ - int16_t a_ = (a); \ - int16x8_t result; \ - __asm__ ("ins %0.h[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_s32(a, b, c) \ - __extension__ \ - ({ \ - int32x4_t b_ = (b); \ - int32_t a_ = (a); \ - int32x4_t result; \ - __asm__ ("ins %0.s[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_s64(a, b, c) \ - __extension__ \ - ({ \ - int64x2_t b_ = (b); \ - int64_t a_ = (a); \ - int64x2_t result; \ - __asm__ ("ins %0.d[%3], %x1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_u8(a, b, c) \ - __extension__ \ - ({ \ - uint8x16_t b_ = (b); \ - uint8_t a_ = (a); \ - uint8x16_t result; \ - __asm__ ("ins %0.b[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_u16(a, b, c) \ - __extension__ \ - ({ \ - uint16x8_t b_ = (b); \ - uint16_t a_ = (a); \ - uint16x8_t result; \ - __asm__ ("ins %0.h[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_u32(a, b, c) \ - __extension__ \ - ({ \ - uint32x4_t b_ = (b); \ - uint32_t a_ = (a); \ - uint32x4_t result; \ - __asm__ ("ins %0.s[%3], %w1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - -#define vsetq_lane_u64(a, b, c) \ - __extension__ \ - ({ \ - uint64x2_t b_ = (b); \ - uint64_t a_ = (a); \ - uint64x2_t result; \ - __asm__ ("ins %0.d[%3], %x1" \ - : "=w"(result) \ - : "r"(a_), "0"(b_), "i"(c) \ - : /* No clobbers */); \ - result; \ - }) - #define vshrn_high_n_s16(a, b, c) \ __extension__ \ ({ \