diff mbox series

[ARM] PR66791: Gate comparison in vca intrinsics on __FAST_MATH__

Message ID CAAgBjMm6nhfME6xVBDZTkHz3ObLvFD5-_mAnWftuM5jixbrYkg@mail.gmail.com
State New
Headers show
Series [ARM] PR66791: Gate comparison in vca intrinsics on __FAST_MATH__ | expand

Commit Message

Prathamesh Kulkarni June 22, 2021, 9:34 a.m. UTC
Hi,
The attached patch gates abs(__a) cmp abs(__b) for vca intrinsics on
__FAST_MATH__. I moved vabs intrinsics before vcage_f32 since vca
intrinsics use those.
Bootstrapped+tested on arm-linux-gnueabihf.
OK to commit ?

Thanks,
Prathamesh
2021-06-22  Prathamesh Kulkarni  <prathamesh.kulkarni@linaro.org>

	PR target/66791
	* gcc/config/arm_neon.h: Move vabs intrinsics before vcage_f32.
	(vcage_f32): Gate comparison conditionally on __FAST_MATH__.
	(vcageq_f32): Likewise.
	(vcale_f32): Likewise.
	(vcaleq_f32): Likewise.
	(vcagt_f32): Likewise.
	(vcagtq_f32): Likewise.
	(vcalt_f32): Likewise.
	(vcaltq_f32): Likewise.
	(vcage_f16): Likewise.
	(vcageq_f16): Likewise.
	(vcale_f16): Likewise.
	(vcaleq_f16): Likewise.
	(vcagt_f16): Likewise.
	(vcagtq_f16): Likewise.
	(vcalt_f16): Likewise.
	(vcaltq_f16): Likewise.

Comments

Prathamesh Kulkarni June 29, 2021, 7:20 a.m. UTC | #1
On Tue, 22 Jun 2021 at 15:04, Prathamesh Kulkarni
<prathamesh.kulkarni@linaro.org> wrote:
>
> Hi,
> The attached patch gates abs(__a) cmp abs(__b) for vca intrinsics on
> __FAST_MATH__. I moved vabs intrinsics before vcage_f32 since vca
> intrinsics use those.
> Bootstrapped+tested on arm-linux-gnueabihf.
> OK to commit ?
ping https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573384.html

Thanks,
Prathamesh
>
> Thanks,
> Prathamesh
Kyrylo Tkachov June 30, 2021, 8:30 a.m. UTC | #2
> -----Original Message-----
> From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
> Sent: 29 June 2021 08:21
> To: gcc Patches <gcc-patches@gcc.gnu.org>; Kyrylo Tkachov
> <Kyrylo.Tkachov@arm.com>
> Subject: Re: [ARM] PR66791: Gate comparison in vca intrinsics on
> __FAST_MATH__
> 
> On Tue, 22 Jun 2021 at 15:04, Prathamesh Kulkarni
> <prathamesh.kulkarni@linaro.org> wrote:
> >
> > Hi,
> > The attached patch gates abs(__a) cmp abs(__b) for vca intrinsics on
> > __FAST_MATH__. I moved vabs intrinsics before vcage_f32 since vca
> > intrinsics use those.
> > Bootstrapped+tested on arm-linux-gnueabihf.
> > OK to commit ?
> ping https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573384.html

Hmm, does this result in better optimisation? I guess it's expressing the operation at a higher level, but there's now conceptually three operations (2xvabs + 1 comparison) that would need to be folded away by the optimisers...

Thanks,
Kyrill
> 
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Prathamesh
Prathamesh Kulkarni June 30, 2021, 9:05 a.m. UTC | #3
On Wed, 30 Jun 2021 at 14:00, Kyrylo Tkachov <Kyrylo.Tkachov@arm.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
> > Sent: 29 June 2021 08:21
> > To: gcc Patches <gcc-patches@gcc.gnu.org>; Kyrylo Tkachov
> > <Kyrylo.Tkachov@arm.com>
> > Subject: Re: [ARM] PR66791: Gate comparison in vca intrinsics on
> > __FAST_MATH__
> >
> > On Tue, 22 Jun 2021 at 15:04, Prathamesh Kulkarni
> > <prathamesh.kulkarni@linaro.org> wrote:
> > >
> > > Hi,
> > > The attached patch gates abs(__a) cmp abs(__b) for vca intrinsics on
> > > __FAST_MATH__. I moved vabs intrinsics before vcage_f32 since vca
> > > intrinsics use those.
> > > Bootstrapped+tested on arm-linux-gnueabihf.
> > > OK to commit ?
> > ping https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573384.html
>
> Hmm, does this result in better optimisation? I guess it's expressing the operation at a higher level, but there's now conceptually three operations (2xvabs + 1 comparison) that would need to be folded away by the optimisers...
Hi Kyrill,
That was my motivation for PR97906 ;-)
With that fix, it now folds c = vabs(a) >= vabs(b) to vacle z, b, a
with __FAST_MATH__ defined.

Thanks,
Prathamesh
>
> Thanks,
> Kyrill
> >
> > Thanks,
> > Prathamesh
> > >
> > > Thanks,
> > > Prathamesh
Kyrylo Tkachov June 30, 2021, 9:08 a.m. UTC | #4
> -----Original Message-----
> From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
> Sent: 30 June 2021 10:05
> To: Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> Cc: gcc Patches <gcc-patches@gcc.gnu.org>
> Subject: Re: [ARM] PR66791: Gate comparison in vca intrinsics on
> __FAST_MATH__
> 
> On Wed, 30 Jun 2021 at 14:00, Kyrylo Tkachov <Kyrylo.Tkachov@arm.com>
> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Prathamesh Kulkarni <prathamesh.kulkarni@linaro.org>
> > > Sent: 29 June 2021 08:21
> > > To: gcc Patches <gcc-patches@gcc.gnu.org>; Kyrylo Tkachov
> > > <Kyrylo.Tkachov@arm.com>
> > > Subject: Re: [ARM] PR66791: Gate comparison in vca intrinsics on
> > > __FAST_MATH__
> > >
> > > On Tue, 22 Jun 2021 at 15:04, Prathamesh Kulkarni
> > > <prathamesh.kulkarni@linaro.org> wrote:
> > > >
> > > > Hi,
> > > > The attached patch gates abs(__a) cmp abs(__b) for vca intrinsics on
> > > > __FAST_MATH__. I moved vabs intrinsics before vcage_f32 since vca
> > > > intrinsics use those.
> > > > Bootstrapped+tested on arm-linux-gnueabihf.
> > > > OK to commit ?
> > > ping https://gcc.gnu.org/pipermail/gcc-patches/2021-June/573384.html
> >
> > Hmm, does this result in better optimisation? I guess it's expressing the
> operation at a higher level, but there's now conceptually three operations
> (2xvabs + 1 comparison) that would need to be folded away by the
> optimisers...
> Hi Kyrill,
> That was my motivation for PR97906 ;-)
> With that fix, it now folds c = vabs(a) >= vabs(b) to vacle z, b, a
> with __FAST_MATH__ defined.

Ah right. Ok for trunk then.
Kyrill

> 
> Thanks,
> Prathamesh
> >
> > Thanks,
> > Kyrill
> > >
> > > Thanks,
> > > Prathamesh
> > > >
> > > > Thanks,
> > > > Prathamesh
diff mbox series

Patch

diff --git a/gcc/config/arm/arm_neon.h b/gcc/config/arm/arm_neon.h
index dcd533fd003..4f81a55234c 100644
--- a/gcc/config/arm/arm_neon.h
+++ b/gcc/config/arm/arm_neon.h
@@ -2859,60 +2859,189 @@  vcltq_u32 (uint32x4_t __a, uint32x4_t __b)
   return (__a < __b);
 }
 
+__extension__ extern __inline int8x8_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vabs_s8 (int8x8_t __a)
+{
+  return (int8x8_t)__builtin_neon_vabsv8qi (__a);
+}
+
+__extension__ extern __inline int16x4_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vabs_s16 (int16x4_t __a)
+{
+  return (int16x4_t)__builtin_neon_vabsv4hi (__a);
+}
+
+__extension__ extern __inline int32x2_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vabs_s32 (int32x2_t __a)
+{
+  return (int32x2_t)__builtin_neon_vabsv2si (__a);
+}
+
+__extension__ extern __inline float32x2_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vabs_f32 (float32x2_t __a)
+{
+  return (float32x2_t)__builtin_neon_vabsv2sf (__a);
+}
+
+__extension__ extern __inline int8x16_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vabsq_s8 (int8x16_t __a)
+{
+  return (int8x16_t)__builtin_neon_vabsv16qi (__a);
+}
+
+__extension__ extern __inline int16x8_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vabsq_s16 (int16x8_t __a)
+{
+  return (int16x8_t)__builtin_neon_vabsv8hi (__a);
+}
+
+__extension__ extern __inline int32x4_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vabsq_s32 (int32x4_t __a)
+{
+  return (int32x4_t)__builtin_neon_vabsv4si (__a);
+}
+
+__extension__ extern __inline float32x4_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vabsq_f32 (float32x4_t __a)
+{
+  return (float32x4_t)__builtin_neon_vabsv4sf (__a);
+}
+
+__extension__ extern __inline int8x8_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vqabs_s8 (int8x8_t __a)
+{
+  return (int8x8_t)__builtin_neon_vqabsv8qi (__a);
+}
+
+__extension__ extern __inline int16x4_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vqabs_s16 (int16x4_t __a)
+{
+  return (int16x4_t)__builtin_neon_vqabsv4hi (__a);
+}
+
+__extension__ extern __inline int32x2_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vqabs_s32 (int32x2_t __a)
+{
+  return (int32x2_t)__builtin_neon_vqabsv2si (__a);
+}
+
+__extension__ extern __inline int8x16_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vqabsq_s8 (int8x16_t __a)
+{
+  return (int8x16_t)__builtin_neon_vqabsv16qi (__a);
+}
+
+__extension__ extern __inline int16x8_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vqabsq_s16 (int16x8_t __a)
+{
+  return (int16x8_t)__builtin_neon_vqabsv8hi (__a);
+}
+
+__extension__ extern __inline int32x4_t
+__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
+vqabsq_s32 (int32x4_t __a)
+{
+  return (int32x4_t)__builtin_neon_vqabsv4si (__a);
+}
 __extension__ extern __inline uint32x2_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcage_f32 (float32x2_t __a, float32x2_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint32x2_t) (vabs_f32 (__a) >= vabs_f32 (__b));
+#else
   return (uint32x2_t)__builtin_neon_vcagev2sf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint32x4_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcageq_f32 (float32x4_t __a, float32x4_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint32x4_t) (vabsq_f32 (__a) >= vabsq_f32 (__b));
+#else
   return (uint32x4_t)__builtin_neon_vcagev4sf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint32x2_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcale_f32 (float32x2_t __a, float32x2_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint32x2_t) (vabs_f32 (__a) <= vabs_f32 (__b));
+#else
   return (uint32x2_t)__builtin_neon_vcagev2sf (__b, __a);
+#endif
 }
 
 __extension__ extern __inline uint32x4_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcaleq_f32 (float32x4_t __a, float32x4_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint32x4_t) (vabsq_f32 (__a) <= vabsq_f32 (__b));
+#else
   return (uint32x4_t)__builtin_neon_vcagev4sf (__b, __a);
+#endif
 }
 
 __extension__ extern __inline uint32x2_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcagt_f32 (float32x2_t __a, float32x2_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint32x2_t) (vabs_f32 (__a) > vabs_f32 (__b));
+#else
   return (uint32x2_t)__builtin_neon_vcagtv2sf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint32x4_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcagtq_f32 (float32x4_t __a, float32x4_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint32x4_t) (vabsq_f32 (__a) > vabsq_f32 (__b));
+#else
   return (uint32x4_t)__builtin_neon_vcagtv4sf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint32x2_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcalt_f32 (float32x2_t __a, float32x2_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint32x2_t) (vabs_f32 (__a) < vabs_f32 (__b));
+#else
   return (uint32x2_t)__builtin_neon_vcagtv2sf (__b, __a);
+#endif
 }
 
 __extension__ extern __inline uint32x4_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcaltq_f32 (float32x4_t __a, float32x4_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint32x4_t) (vabsq_f32 (__a) < vabsq_f32 (__b));
+#else
   return (uint32x4_t)__builtin_neon_vcagtv4sf (__b, __a);
+#endif
 }
 
 __extension__ extern __inline uint8x8_t
@@ -5612,104 +5741,6 @@  vsliq_n_p16 (poly16x8_t __a, poly16x8_t __b, const int __c)
   return (poly16x8_t)__builtin_neon_vsli_nv8hi ((int16x8_t) __a, (int16x8_t) __b, __c);
 }
 
-__extension__ extern __inline int8x8_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vabs_s8 (int8x8_t __a)
-{
-  return (int8x8_t)__builtin_neon_vabsv8qi (__a);
-}
-
-__extension__ extern __inline int16x4_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vabs_s16 (int16x4_t __a)
-{
-  return (int16x4_t)__builtin_neon_vabsv4hi (__a);
-}
-
-__extension__ extern __inline int32x2_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vabs_s32 (int32x2_t __a)
-{
-  return (int32x2_t)__builtin_neon_vabsv2si (__a);
-}
-
-__extension__ extern __inline float32x2_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vabs_f32 (float32x2_t __a)
-{
-  return (float32x2_t)__builtin_neon_vabsv2sf (__a);
-}
-
-__extension__ extern __inline int8x16_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vabsq_s8 (int8x16_t __a)
-{
-  return (int8x16_t)__builtin_neon_vabsv16qi (__a);
-}
-
-__extension__ extern __inline int16x8_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vabsq_s16 (int16x8_t __a)
-{
-  return (int16x8_t)__builtin_neon_vabsv8hi (__a);
-}
-
-__extension__ extern __inline int32x4_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vabsq_s32 (int32x4_t __a)
-{
-  return (int32x4_t)__builtin_neon_vabsv4si (__a);
-}
-
-__extension__ extern __inline float32x4_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vabsq_f32 (float32x4_t __a)
-{
-  return (float32x4_t)__builtin_neon_vabsv4sf (__a);
-}
-
-__extension__ extern __inline int8x8_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vqabs_s8 (int8x8_t __a)
-{
-  return (int8x8_t)__builtin_neon_vqabsv8qi (__a);
-}
-
-__extension__ extern __inline int16x4_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vqabs_s16 (int16x4_t __a)
-{
-  return (int16x4_t)__builtin_neon_vqabsv4hi (__a);
-}
-
-__extension__ extern __inline int32x2_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vqabs_s32 (int32x2_t __a)
-{
-  return (int32x2_t)__builtin_neon_vqabsv2si (__a);
-}
-
-__extension__ extern __inline int8x16_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vqabsq_s8 (int8x16_t __a)
-{
-  return (int8x16_t)__builtin_neon_vqabsv16qi (__a);
-}
-
-__extension__ extern __inline int16x8_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vqabsq_s16 (int16x8_t __a)
-{
-  return (int16x8_t)__builtin_neon_vqabsv8hi (__a);
-}
-
-__extension__ extern __inline int32x4_t
-__attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
-vqabsq_s32 (int32x4_t __a)
-{
-  return (int32x4_t)__builtin_neon_vqabsv4si (__a);
-}
-
 __extension__ extern __inline int8x8_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vneg_s8 (int8x8_t __a)
@@ -17139,56 +17170,88 @@  __extension__ extern __inline uint16x4_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcage_f16 (float16x4_t __a, float16x4_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint16x4_t) (vabs_f16 (__a) >= vabs_f16 (__b));
+#else
   return (uint16x4_t)__builtin_neon_vcagev4hf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint16x8_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcageq_f16 (float16x8_t __a, float16x8_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint16x8_t) (vabsq_f16 (__a) >= vabsq_f16 (__b));
+#else
   return (uint16x8_t)__builtin_neon_vcagev8hf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint16x4_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcagt_f16 (float16x4_t __a, float16x4_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint16x4_t) (vabs_f16 (__a) > vabs_f16 (__b));
+#else
   return (uint16x4_t)__builtin_neon_vcagtv4hf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint16x8_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcagtq_f16 (float16x8_t __a, float16x8_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint16x8_t) (vabsq_f16 (__a) > vabsq_f16 (__b));
+#else
   return (uint16x8_t)__builtin_neon_vcagtv8hf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint16x4_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcale_f16 (float16x4_t __a, float16x4_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint16x4_t) (vabs_f16 (__a) <= vabs_f16 (__b));
+#else
   return (uint16x4_t)__builtin_neon_vcalev4hf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint16x8_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcaleq_f16 (float16x8_t __a, float16x8_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint16x8_t) (vabsq_f16 (__a) <= vabsq_f16 (__b));
+#else
   return (uint16x8_t)__builtin_neon_vcalev8hf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint16x4_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcalt_f16 (float16x4_t __a, float16x4_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint16x4_t) (vabs_f16 (__a) < vabs_f16 (__b));
+#else
   return (uint16x4_t)__builtin_neon_vcaltv4hf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint16x8_t
 __attribute__  ((__always_inline__, __gnu_inline__, __artificial__))
 vcaltq_f16 (float16x8_t __a, float16x8_t __b)
 {
+#ifdef __FAST_MATH__
+  return (uint16x8_t) (vabsq_f16 (__a) < vabsq_f16 (__b));
+#else
   return (uint16x8_t)__builtin_neon_vcaltv8hf (__a, __b);
+#endif
 }
 
 __extension__ extern __inline uint16x4_t