Message ID | alpine.DEB.2.02.1404112137530.19663@stedding.saclay.inria.fr |
---|---|
State | New |
Headers | show |
Ping http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html (note that ARM seems to be doing the same thing for their neon intrinsics, see Ramana's patch series posted today) On Fri, 11 Apr 2014, Marc Glisse wrote: > Hello, > > the previous discussion on the topic was before we added all those #pragma > target in *mmintrin.h: > > http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html > > I believe that removes a large part of the arguments against it. Note that I > only did a few of the more obvious intrinsics, I am waiting to see if this > patch is accepted before doing more. > > Bootstrap+testsuite on x86_64-linux-gnu. > > 2014-04-11 Marc Glisse <marc.glisse@inria.fr> > > * config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps, > _mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions > instead of builtins. > * config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, > _mm_storeh_pd, > _mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd, > _mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64, > _mm_loadh_pd, _mm_loadl_pd): Likewise. > (_mm_sqrt_sd): Fix comment.
Ping On Mon, 28 Apr 2014, Marc Glisse wrote: > Ping > http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html > > (note that ARM seems to be doing the same thing for their neon intrinsics, > see Ramana's patch series posted today) > > On Fri, 11 Apr 2014, Marc Glisse wrote: > >> Hello, >> >> the previous discussion on the topic was before we added all those #pragma >> target in *mmintrin.h: >> >> http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html >> >> I believe that removes a large part of the arguments against it. Note that >> I only did a few of the more obvious intrinsics, I am waiting to see if >> this patch is accepted before doing more. >> >> Bootstrap+testsuite on x86_64-linux-gnu. >> >> 2014-04-11 Marc Glisse <marc.glisse@inria.fr> >> >> * config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps, >> _mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions >> instead of builtins. >> * config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, >> _mm_storeh_pd, >> _mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd, >> _mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64, >> _mm_loadh_pd, _mm_loadl_pd): Likewise. >> (_mm_sqrt_sd): Fix comment.
Ping, nobody has an opinion on this? Or some explanation why I am mistaken to believe that #pragma target makes it safer now? It would enable a number of optimizations, like constant propagation, FMA contraction, etc. It would also allow us to remove several builtins. On Sat, 17 May 2014, Marc Glisse wrote: > Ping > > On Mon, 28 Apr 2014, Marc Glisse wrote: > >> Ping >> http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html >> >> (note that ARM seems to be doing the same thing for their neon intrinsics, >> see Ramana's patch series posted today) >> >> On Fri, 11 Apr 2014, Marc Glisse wrote: >> >>> Hello, >>> >>> the previous discussion on the topic was before we added all those #pragma >>> target in *mmintrin.h: >>> >>> http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html >>> >>> I believe that removes a large part of the arguments against it. Note that >>> I only did a few of the more obvious intrinsics, I am waiting to see if >>> this patch is accepted before doing more. >>> >>> Bootstrap+testsuite on x86_64-linux-gnu. >>> >>> 2014-04-11 Marc Glisse <marc.glisse@inria.fr> >>> >>> * config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps, >>> _mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions >>> instead of builtins. >>> * config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, _mm_storeh_pd, >>> _mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd, >>> _mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64, >>> _mm_loadh_pd, _mm_loadl_pd): Likewise. >>> (_mm_sqrt_sd): Fix comment.
On Sat, Jun 28, 2014 at 6:42 AM, Marc Glisse <marc.glisse@inria.fr> wrote: > Ping, > > nobody has an opinion on this? Or some explanation why I am mistaken to > believe that #pragma target makes it safer now? > > It would enable a number of optimizations, like constant propagation, FMA > contraction, etc. It would also allow us to remove several builtins. I see no problem with using the array-type access to the registers. As for replacing the builtins with arithmetic operators: I appreciate the possibility for optimization. But is there any chance the calls could not end up being implemented with a vector instruction? I think that would be bad. The intrinsics should be a way to guarantee that the programmer can create vector instructions. Otherwise we might just not support them.
On Sat, 28 Jun 2014, Ulrich Drepper wrote: > On Sat, Jun 28, 2014 at 6:42 AM, Marc Glisse <marc.glisse@inria.fr> wrote: >> Ping, >> >> nobody has an opinion on this? Or some explanation why I am mistaken to >> believe that #pragma target makes it safer now? >> >> It would enable a number of optimizations, like constant propagation, FMA >> contraction, etc. It would also allow us to remove several builtins. > > I see no problem with using the array-type access to the registers. > > As for replacing the builtins with arithmetic operators: I appreciate > the possibility for optimization. But is there any chance the calls > could not end up being implemented with a vector instruction? I think > that would be bad. The intrinsics should be a way to guarantee that > the programmer can create vector instructions. Otherwise we might > just not support them. There is always a risk, but then even with builtins I think there was a small risk that an RTL optimization would mess things up. It is indeed higher if we expose the operation to the optimizers earlier, but it would be a bug if an "optimization" replaced a vector operation by something worse. Also, I am only proposing to handle the most trivial operations this way, not more complicated ones (like v[0]+=s) where we would be likely to fail generating the right instruction. And the pragma should ensure that the function will always be compiled in a mode where the vector instruction is available. ARM did the same and I don't think I have seen a bug reporting a regression about it (I haven't really looked though). Thanks,
On Sat, Jun 28, 2014 at 6:53 PM, Marc Glisse <marc.glisse@inria.fr> wrote: > There is always a risk, but then even with builtins I think there was a > small risk that an RTL optimization would mess things up. It is indeed > higher if we expose the operation to the optimizers earlier, but it would be > a bug if an "optimization" replaced a vector operation by something worse. > Also, I am only proposing to handle the most trivial operations this way, > not more complicated ones (like v[0]+=s) where we would be likely to fail > generating the right instruction. And the pragma should ensure that the > function will always be compiled in a mode where the vector instruction is > available. > > ARM did the same and I don't think I have seen a bug reporting a regression > about it (I haven't really looked though). I think the Arm definitions come from a different angle. It's new, there is no assumed semantics. For the x86 intrinsics Intel defines that _mm_xxx() generates one of a given opcodes if there is a match. If I want to generate a specific code sequence I use the intrinsics. Otherwise I could already today use the vector type semantics myself. Don't get me wrong, I like the idea to have the optimization of the intrinsics happening. But perhaps not unconditionally or at least not without preventing them. I know this will look ugly, but how about a macro __GCC_X86_HONOR_INTRINSICS to enable the current code and have by default your proposed use of the vector arithmetic in place? This wouldn't allow removing support for the built-ins but it would also open the door to some more risky optimizations to be enabled by default.
On Sun, 29 Jun 2014, Ulrich Drepper wrote: > I think the Arm definitions come from a different angle. It's new, > there is no assumed semantics. Is it that new? I thought it was implemented based on a rather precise specification by ARM. Again, I don't really know arm. > For the x86 intrinsics Intel defines > that _mm_xxx() generates one of a given opcodes if there is a match. > If I want to generate a specific code sequence I use the intrinsics. We already sometimes generate a different instruction than the name of the instrinsic suggests, or combine consecutive intrinsics into something else. I use inline asm when I want a specific code sequence. > Otherwise I could already today use the vector type semantics myself. Well, the main reasons I use the intrinsics are: 1) the code compiles with visual studio 2) use the esoteric instructions (anything without a trivial mapping in C) > Don't get me wrong, I like the idea to have the optimization of the > intrinsics happening. But perhaps not unconditionally or at least not > without preventing them. > > I know this will look ugly, but how about a macro > __GCC_X86_HONOR_INTRINSICS to enable the current code and have by > default your proposed use of the vector arithmetic in place? This > wouldn't allow removing support for the built-ins but it would also > open the door to some more risky optimizations to be enabled by > default. That's a pretty big drawback. Instead of simplifying the implementation, it makes it more complicated. We also have to document the macro, update the testsuite so it tests the intrinsics in both modes, etc. I understand the concern, and I would probably implement __GCC_X86_HONOR_INTRINSICS (though the testsuite part scares me as I have so little understanding of how it works, so I may need help), but I'd like to make sure first that the simpler approach is not acceptable, possibly with strong constraints on which operations are ok (_mm_load[hl]_pd could be removed from the patch for instance). As another comparison, clang's version of *intrin.h uses the vector extensions much more than I am proposing.
Hello Marc, On 28 Jun 12:42, Marc Glisse wrote: > It would enable a number of optimizations, like constant > propagation, FMA contraction, etc. It would also allow us to remove > several builtins. This should be main motivation for replacing built-ins. But this approach IMHO should only be used for `obvious' cases only. I mean: + - / * and friends. Think that this shouldn't apply for shuffles, broadcasts. But we have to define border between `obvious' and rest intrinsics. On the over hand, updated in such a way intrinsic may actually generate different instruction then intended (e.g. FMA case). For ICC this is generally OK to generate different instructions, only semantics should be obeyed. -- Thanks, K
On Thu, 3 Jul 2014, Kirill Yukhin wrote: > Hello Marc, > On 28 Jun 12:42, Marc Glisse wrote: >> It would enable a number of optimizations, like constant >> propagation, FMA contraction, etc. It would also allow us to remove >> several builtins. > This should be main motivation for replacing built-ins. > But this approach IMHO should only be used for `obvious' cases only. > I mean: + - / * and friends. > Think that this shouldn't apply for shuffles, broadcasts. > But we have to define border between `obvious' and rest intrinsics. We don't have a syntax in the front-end for broadcasts anyway, but are you sure about shuffles? __builtin_shuffle directly translates to VEC_PERM_EXPR, on which we are careful to avoid optimizations like combining 2 shuffles unless the result is the identity. And expanding shuffles that can be done in a single instruction works well. But I am happy not doing them yet. To be very specific, could you list which intrinsics you would like to remove from the posted patch? > On the over hand, updated in such a way intrinsic > may actually generate different instruction then intended (e.g. FMA case). It is the same with scalars, we have -ffp-contract for that. > For ICC this is generally OK to generate different instructions, only > semantics should be obeyed.
Hello Marc. On 04 Jul 21:11, Marc Glisse wrote: > On Thu, 3 Jul 2014, Kirill Yukhin wrote: > like combining 2 shuffles unless the result is the identity. And > expanding shuffles that can be done in a single instruction works > well. > > But I am happy not doing them yet. To be very specific, could you > list which intrinsics you would like to remove from the posted > patch? I am not a x86 maintainer, however while such a replacements produce correct semantics and probably enable optimizations, I support your patch. Probably you could try such your approach on AVX2, AVX-512 whose intrinsics are well covered by tests? > >On the over hand, updated in such a way intrinsic > >may actually generate different instruction then intended (e.g. FMA case). > > It is the same with scalars, we have -ffp-contract for that. Agreed. -- Thanks, K
On Tue, Jul 08, 2014 at 03:14:04PM +0400, Kirill Yukhin wrote: > > >On the over hand, updated in such a way intrinsic > > >may actually generate different instruction then intended (e.g. FMA case). > > > > It is the same with scalars, we have -ffp-contract for that. > Agreed. I don't think we actually always guarantee using the particular instructions for the intrinsics even when they are implemented using builtins, at least if they don't use UNSPECs, e.g. if combiner or peephole2 manage to combine something into some other insn, we'll happily do that. Jakub
On Jul 8, 2014, at 4:17 AM, Jakub Jelinek <jakub@redhat.com> wrote: > On Tue, Jul 08, 2014 at 03:14:04PM +0400, Kirill Yukhin wrote: >>>> On the over hand, updated in such a way intrinsic >>>> may actually generate different instruction then intended (e.g. FMA case). >>> >>> It is the same with scalars, we have -ffp-contract for that. >> Agreed. > > I don't think we actually always guarantee using the particular instructions > for the intrinsics even when they are implemented using builtins, at least > if they don't use UNSPECs, e.g. if combiner or peephole2 manage to combine > something into some other insn, we'll happily do that. In a testcase, one is free to hide the inputs and the output from the optimizer using standard tricks and take one step closer to having a 1-1 mapping. Of course, wether or not the port even offers a 1-1 mapping for any particular builtin is completely dependent upon the port.
Index: gcc/config/i386/emmintrin.h =================================================================== --- gcc/config/i386/emmintrin.h (revision 209323) +++ gcc/config/i386/emmintrin.h (working copy) @@ -161,40 +161,40 @@ _mm_store_pd (double *__P, __m128d __A) extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_storeu_pd (double *__P, __m128d __A) { __builtin_ia32_storeupd (__P, __A); } /* Stores the lower DPFP value. */ extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_store_sd (double *__P, __m128d __A) { - *__P = __builtin_ia32_vec_ext_v2df (__A, 0); + *__P = __A[0]; } extern __inline double __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_cvtsd_f64 (__m128d __A) { - return __builtin_ia32_vec_ext_v2df (__A, 0); + return __A[0]; } extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_storel_pd (double *__P, __m128d __A) { _mm_store_sd (__P, __A); } /* Stores the upper DPFP value. */ extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_storeh_pd (double *__P, __m128d __A) { - *__P = __builtin_ia32_vec_ext_v2df (__A, 1); + *__P = __A[1]; } /* Store the lower DPFP value across two words. The address must be 16-byte aligned. */ extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_store1_pd (double *__P, __m128d __A) { _mm_store_pd (__P, __builtin_ia32_shufpd (__A, __A, _MM_SHUFFLE2 (0,0))); } @@ -215,86 +215,86 @@ extern __inline int __attribute__((__gnu _mm_cvtsi128_si32 (__m128i __A) { return __builtin_ia32_vec_ext_v4si ((__v4si)__A, 0); } #ifdef __x86_64__ /* Intel intrinsic. */ extern __inline long long __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_cvtsi128_si64 (__m128i __A) { - return __builtin_ia32_vec_ext_v2di ((__v2di)__A, 0); + return __A[0]; } /* Microsoft intrinsic. */ extern __inline long long __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_cvtsi128_si64x (__m128i __A) { - return __builtin_ia32_vec_ext_v2di ((__v2di)__A, 0); + return __A[0]; } #endif extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_add_pd (__m128d __A, __m128d __B) { - return (__m128d)__builtin_ia32_addpd ((__v2df)__A, (__v2df)__B); + return __A + __B; } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_add_sd (__m128d __A, __m128d __B) { return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_sub_pd (__m128d __A, __m128d __B) { - return (__m128d)__builtin_ia32_subpd ((__v2df)__A, (__v2df)__B); + return __A - __B; } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_sub_sd (__m128d __A, __m128d __B) { return (__m128d)__builtin_ia32_subsd ((__v2df)__A, (__v2df)__B); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_mul_pd (__m128d __A, __m128d __B) { - return (__m128d)__builtin_ia32_mulpd ((__v2df)__A, (__v2df)__B); + return __A * __B; } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_mul_sd (__m128d __A, __m128d __B) { return (__m128d)__builtin_ia32_mulsd ((__v2df)__A, (__v2df)__B); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_div_pd (__m128d __A, __m128d __B) { - return (__m128d)__builtin_ia32_divpd ((__v2df)__A, (__v2df)__B); + return __A / __B; } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_div_sd (__m128d __A, __m128d __B) { return (__m128d)__builtin_ia32_divsd ((__v2df)__A, (__v2df)__B); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_sqrt_pd (__m128d __A) { return (__m128d)__builtin_ia32_sqrtpd ((__v2df)__A); } -/* Return pair {sqrt (A[0), B[1]}. */ +/* Return pair {sqrt (B[0]), A[1]}. */ extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_sqrt_sd (__m128d __A, __m128d __B) { __v2df __tmp = __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B); return (__m128d)__builtin_ia32_sqrtsd ((__v2df)__tmp); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_min_pd (__m128d __A, __m128d __B) { @@ -708,27 +708,27 @@ _mm_store_si128 (__m128i *__P, __m128i _ extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_storeu_si128 (__m128i *__P, __m128i __B) { __builtin_ia32_storedqu ((char *)__P, (__v16qi)__B); } extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_storel_epi64 (__m128i *__P, __m128i __B) { - *(long long *)__P = __builtin_ia32_vec_ext_v2di ((__v2di)__B, 0); + *(long long *)__P = __B[0]; } extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_movepi64_pi64 (__m128i __B) { - return (__m64) __builtin_ia32_vec_ext_v2di ((__v2di)__B, 0); + return (__m64) __B[0]; } extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_movpi64_epi64 (__m64 __A) { return _mm_set_epi64 ((__m64)0LL, __A); } extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_move_epi64 (__m128i __A) @@ -915,27 +915,27 @@ _mm_unpackhi_pd (__m128d __A, __m128d __ extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_unpacklo_pd (__m128d __A, __m128d __B) { return (__m128d)__builtin_ia32_unpcklpd ((__v2df)__A, (__v2df)__B); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_loadh_pd (__m128d __A, double const *__B) { - return (__m128d)__builtin_ia32_loadhpd ((__v2df)__A, __B); + return __extension__ (__m128d){ __A[0], __B[0] }; } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_loadl_pd (__m128d __A, double const *__B) { - return (__m128d)__builtin_ia32_loadlpd ((__v2df)__A, __B); + return __extension__ (__m128d){ __B[0], __A[1] }; } extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_movemask_pd (__m128d __A) { return __builtin_ia32_movmskpd ((__v2df)__A); } extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_packs_epi16 (__m128i __A, __m128i __B) Index: gcc/config/i386/xmmintrin.h =================================================================== --- gcc/config/i386/xmmintrin.h (revision 209323) +++ gcc/config/i386/xmmintrin.h (working copy) @@ -173,39 +173,39 @@ extern __inline __m128 __attribute__((__ _mm_max_ss (__m128 __A, __m128 __B) { return (__m128) __builtin_ia32_maxss ((__v4sf)__A, (__v4sf)__B); } /* Perform the respective operation on the four SPFP values in A and B. */ extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_add_ps (__m128 __A, __m128 __B) { - return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B); + return __A + __B; } extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_sub_ps (__m128 __A, __m128 __B) { - return (__m128) __builtin_ia32_subps ((__v4sf)__A, (__v4sf)__B); + return __A - __B; } extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_mul_ps (__m128 __A, __m128 __B) { - return (__m128) __builtin_ia32_mulps ((__v4sf)__A, (__v4sf)__B); + return __A * __B; } extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_div_ps (__m128 __A, __m128 __B) { - return (__m128) __builtin_ia32_divps ((__v4sf)__A, (__v4sf)__B); + return __A / __B; } extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_sqrt_ps (__m128 __A) { return (__m128) __builtin_ia32_sqrtps ((__v4sf)__A); } extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_rcp_ps (__m128 __A) @@ -950,27 +950,27 @@ _mm_set_ps (const float __Z, const float extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_setr_ps (float __Z, float __Y, float __X, float __W) { return __extension__ (__m128)(__v4sf){ __Z, __Y, __X, __W }; } /* Stores the lower SPFP value. */ extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_store_ss (float *__P, __m128 __A) { - *__P = __builtin_ia32_vec_ext_v4sf ((__v4sf)__A, 0); + *__P = __A[0]; } extern __inline float __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_cvtss_f32 (__m128 __A) { - return __builtin_ia32_vec_ext_v4sf ((__v4sf)__A, 0); + return __A[0]; } /* Store four SPFP values. The address must be 16-byte aligned. */ extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_store_ps (float *__P, __m128 __A) { *(__v4sf *)__P = (__v4sf)__A; } /* Store four SPFP values. The address need not be 16-byte aligned. */