Message ID | alpine.DEB.2.02.1212041715370.22183@stedding.saclay.inria.fr |
---|---|
State | New |
Headers | show |
On Tue, Dec 4, 2012 at 5:28 PM, Marc Glisse <marc.glisse@inria.fr> wrote: > On Tue, 4 Dec 2012, Marc Glisse wrote: > >> Do you agree? > > > Like this ? (only tested on the new testcases, and then I'd need to ask Eric > his opinion) > > 2012-12-04 Marc Glisse <marc.glisse@inria.fr> > > PR target/54855 > gcc/ > * simplify-rtx.c (simplify_binary_operation_1) <VEC_CONCAT>: Replace > with VEC_MERGE. > > * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite > pattern. > * config/i386/i386-builtin-types.def: New function types. > * config/i386/i386.c (ix86_expand_args_builtin): Likewise. > (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss, > __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype. > * config/i386/xmmintrin.h: Adapt to new builtin prototype. > * config/i386/emmintrin.h: Likewise. > * doc/extend.texi (X86 Built-in Functions): Document changed > prototype. > > > testsuite/ > * gcc.target/i386/pr54855-1.c: New testcase. > * gcc.target/i386/pr54855-2.c: New testcase. Yes, the approach taken in this patch looks really good to me. There should be no code differences with your patch, but let's ask HJ for his opinion on intrinsics header changes. A little nit below: > + rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0); > + rtx newop1 = XEXP (trueop1, 0); > + return simplify_gen_ternary (VEC_MERGE, mode, GET_MODE (newop0), > + newop0, newop1, GEN_INT (1)); You can use const1_rtx here. Thanks, Uros.
On Tue, Dec 4, 2012 at 10:06 AM, Uros Bizjak <ubizjak@gmail.com> wrote: > On Tue, Dec 4, 2012 at 5:28 PM, Marc Glisse <marc.glisse@inria.fr> wrote: >> On Tue, 4 Dec 2012, Marc Glisse wrote: >> >>> Do you agree? >> >> >> Like this ? (only tested on the new testcases, and then I'd need to ask Eric >> his opinion) >> >> 2012-12-04 Marc Glisse <marc.glisse@inria.fr> >> >> PR target/54855 >> gcc/ >> * simplify-rtx.c (simplify_binary_operation_1) <VEC_CONCAT>: Replace >> with VEC_MERGE. >> >> * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite >> pattern. >> * config/i386/i386-builtin-types.def: New function types. >> * config/i386/i386.c (ix86_expand_args_builtin): Likewise. >> (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss, >> __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype. >> * config/i386/xmmintrin.h: Adapt to new builtin prototype. >> * config/i386/emmintrin.h: Likewise. >> * doc/extend.texi (X86 Built-in Functions): Document changed >> prototype. >> >> >> testsuite/ >> * gcc.target/i386/pr54855-1.c: New testcase. >> * gcc.target/i386/pr54855-2.c: New testcase. > > Yes, the approach taken in this patch looks really good to me. There > should be no code differences with your patch, but let's ask HJ for > his opinion on intrinsics header changes. Hi Kirill, Can you take a look? Thanks. > A little nit below: > >> + rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0); >> + rtx newop1 = XEXP (trueop1, 0); >> + return simplify_gen_ternary (VEC_MERGE, mode, GET_MODE (newop0), >> + newop0, newop1, GEN_INT (1)); > > You can use const1_rtx here. > > Thanks, > Uros.
>> Yes, the approach taken in this patch looks really good to me. There >> should be no code differences with your patch, but let's ask HJ for >> his opinion on intrinsics header changes. > > Hi Kirill, > > Can you take a look? Thanks. Hi guys, I like changes in intrinsics header. Thanks, K
Hi guys, Could I ask several questions just to clarify the things up? 1) Does the root problem lay in the fact that even for scalar additions we perform the addition on the whole vector and only then drop the higher parts of the vector? I.e. to fix the test from the PR we need to replace plus on vector mode with plus on scalar mode? 2) Is one of the main requirements having the same pattern for V4SF and V2DF version? 3) I don't see vec_concat in patterns from your patches, is it explicitly generated by some x86-expander? Anyway, I really like the idea of having some uniformity in describing patterns for scalar instructions, so thank you for the work! On 6 December 2012 17:42, Kirill Yukhin <kirill.yukhin@gmail.com> wrote: >>> Yes, the approach taken in this patch looks really good to me. There >>> should be no code differences with your patch, but let's ask HJ for >>> his opinion on intrinsics header changes. >> >> Hi Kirill, >> >> Can you take a look? Thanks. > > Hi guys, I like changes in intrinsics header. > > Thanks, K
On Fri, Dec 7, 2012 at 7:49 AM, Michael Zolotukhin <michael.v.zolotukhin@gmail.com> wrote: > Hi guys, > Could I ask several questions just to clarify the things up? > > 1) Does the root problem lay in the fact that even for scalar > additions we perform the addition on the whole vector and only then > drop the higher parts of the vector? I.e. to fix the test from the PR > we need to replace plus on vector mode with plus on scalar mode? Yes, existing pattern is used to implement intrinsics, and it is modelled with vector operand 2. But we in fact emit scalar operation, so we would like to model the pattern with a scalar operand 2. This way, the same pattern can be used to emit intrinsics _and_ can be used to optimize the code from the testcase at the same time. Also, please note that alignment requirements for vector operand and scalar operands are different. > 2) Is one of the main requirements having the same pattern for V4SF > and V2DF version? It is not required, but having macroized pattern avoids pattern explosion, eases maintenance (it is easier to understand similar functionality if it is described in some uniform way), and in some cases, macroization opportunities force author to rethink the RTL description, making the patterns more "universal". Uros.
On Fri, 7 Dec 2012, Michael Zolotukhin wrote: > 1) Does the root problem lay in the fact that even for scalar > additions we perform the addition on the whole vector and only then > drop the higher parts of the vector? I.e. to fix the test from the PR > we need to replace plus on vector mode with plus on scalar mode? The root problem is that we model the subs[sd] instructions as taking a 128-bit second operand, when Intel's documentation says they take a 32/64-bit operand, which is an important difference for memory operands (and constants). Writing a pattern that reconstructs the result from a scalar operation also seems more natural than pretending we are doing a parallel operation and dropping most of it (easier for recog and friends). (note: I think the insn was written to support the intrinsic, which does take a 128-bit argument, so it did a good job for that) > 2) Is one of the main requirements having the same pattern for V4SF > and V2DF version? Uros seems to think that would be best. > 3) I don't see vec_concat in patterns from your patches, is it > explicitly generated by some x86-expander? It is generated by ix86_expand_vector_set. > Anyway, I really like the idea of having some uniformity in describing > patterns for scalar instructions, so thank you for the work! For 2-element vectors, vec_concat does seem more natural than vec_merge. If we chose vec_merge as the canonical representation, we should chose it for setting an element in a vector (ix86_expand_vector_set) everywhere, not just those scalarish operations. So it would be good to have rth's opinion on this (svn blame seems to indicate he is the one who chose to use vec_concat specifically for V2DF instead of vec_merge).
Thanks for the explanation! By the way, if we decide to have one pattern for V4SF instructions and another for V2DF, we could try to use recently introduced define_subst here. It won't reduce number of actual patterns (I mean number of patterns after iterators and subst expanding), but it could help to make sse.md more compact. On 7 December 2012 12:49, Marc Glisse <marc.glisse@inria.fr> wrote: > On Fri, 7 Dec 2012, Michael Zolotukhin wrote: > >> 1) Does the root problem lay in the fact that even for scalar >> additions we perform the addition on the whole vector and only then >> drop the higher parts of the vector? I.e. to fix the test from the PR >> we need to replace plus on vector mode with plus on scalar mode? > > > The root problem is that we model the subs[sd] instructions as taking a > 128-bit second operand, when Intel's documentation says they take a > 32/64-bit operand, which is an important difference for memory operands (and > constants). Writing a pattern that reconstructs the result from a scalar > operation also seems more natural than pretending we are doing a parallel > operation and dropping most of it (easier for recog and friends). > > (note: I think the insn was written to support the intrinsic, which does > take a 128-bit argument, so it did a good job for that) > > >> 2) Is one of the main requirements having the same pattern for V4SF >> and V2DF version? > > > Uros seems to think that would be best. > > >> 3) I don't see vec_concat in patterns from your patches, is it >> explicitly generated by some x86-expander? > > > It is generated by ix86_expand_vector_set. > > >> Anyway, I really like the idea of having some uniformity in describing >> patterns for scalar instructions, so thank you for the work! > > > For 2-element vectors, vec_concat does seem more natural than vec_merge. If > we chose vec_merge as the canonical representation, we should chose it for > setting an element in a vector (ix86_expand_vector_set) everywhere, not just > those scalarish operations. > > So it would be good to have rth's opinion on this (svn blame seems to > indicate he is the one who chose to use vec_concat specifically for V2DF > instead of vec_merge). > > -- > Marc Glisse
On 2012-12-07 02:49, Marc Glisse wrote: > The root problem is that we model the subs[sd] instructions as taking > a 128-bit second operand, when Intel's documentation says they take a > 32/64-bit operand, which is an important difference for memory > operands (and constants). Writing a pattern that reconstructs the > result from a scalar operation also seems more natural than > pretending we are doing a parallel operation and dropping most of it > (easier for recog and friends). I agree this is a problem with the current representation. > For 2-element vectors, vec_concat does seem more natural than > vec_merge. If we chose vec_merge as the canonical representation, we > should chose it for setting an element in a vector > (ix86_expand_vector_set) everywhere, not just those scalarish > operations. I'd hate to enshrine vec_merge over vec_concat for the benefit of x86, and to the detriment of e.g. mips. There are plenty of embedded simd implementations that are V2xx only. If we simply pull the various x86 patterns into one common form, set and extract included, does that buy us most of what we'd get for playing games in combine? As for your xmmintrin.h changes, I'd like to see a test case that verifies that _mm_add_ss(a, b) does not add extra insns to extract __B[0]. > +(define_insn "<sse>_vm<plusminus_insn><mode>3<vec_merge_or_concat>" > [(set (match_operand:VF_128 0 "register_operand" "=x,x") > (vec_merge:VF_128 > - (plusminus:VF_128 > - (match_operand:VF_128 1 "register_operand" "0,x") > - (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm")) > + (vec_duplicate:VF_128 > + (plusminus:<ssescalarmode> > + (vec_select:<ssescalarmode> > + (match_operand:VF_128 1 "register_operand" "0,x") > + (parallel [(const_int 0)])) > + (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm"))) > (match_dup 1) > (const_int 1)))] > "TARGET_SSE" > "@ > <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2} > v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}" > [(set_attr "isa" "noavx,avx") > (set_attr "type" "sseadd") > (set_attr "prefix" "orig,vex") > (set_attr "mode" "<ssescalarmode>")]) Did this really trigger as a substitution? It's not supposed to have, since you didn't add (set_attr "replace_vec_merge_with_vec_concat" "yes")... r~
On Fri, Dec 07, 2012 at 08:43:05AM -0600, Richard Henderson wrote: > > +(define_insn "<sse>_vm<plusminus_insn><mode>3<vec_merge_or_concat>" > > [(set (match_operand:VF_128 0 "register_operand" "=x,x") > > (vec_merge:VF_128 > > - (plusminus:VF_128 > > - (match_operand:VF_128 1 "register_operand" "0,x") > > - (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm")) > > + (vec_duplicate:VF_128 > > + (plusminus:<ssescalarmode> > > + (vec_select:<ssescalarmode> > > + (match_operand:VF_128 1 "register_operand" "0,x") > > + (parallel [(const_int 0)])) > > + (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm"))) > > (match_dup 1) > > (const_int 1)))] > > "TARGET_SSE" > > "@ > > <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2} > > v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}" > > [(set_attr "isa" "noavx,avx") > > (set_attr "type" "sseadd") > > (set_attr "prefix" "orig,vex") > > (set_attr "mode" "<ssescalarmode>")]) > > Did this really trigger as a substitution? It's not supposed to have, since > you didn't add (set_attr "replace_vec_merge_with_vec_concat" "yes")... That was the older proposal, the current way to trigger it is using the substitution attr somewhere, typically in pattern name - <vec_merge_or_concat> in the above case. Jakub
On 2012-12-07 08:47, Jakub Jelinek wrote: > That was the older proposal, the current way to trigger it is using > the substitution attr somewhere, typically in pattern name > - <vec_merge_or_concat> in the above case. Ah right. ("My mind is going, Dave. I can feel it.") r~
On Fri, 7 Dec 2012, Richard Henderson wrote: > On 2012-12-07 02:49, Marc Glisse wrote: >> For 2-element vectors, vec_concat does seem more natural than >> vec_merge. If we chose vec_merge as the canonical representation, we >> should chose it for setting an element in a vector >> (ix86_expand_vector_set) everywhere, not just those scalarish >> operations. > > I'd hate to enshrine vec_merge over vec_concat for the benefit of x86, > and to the detriment of e.g. mips. There are plenty of embedded simd > implementations that are V2xx only. > > If we simply pull the various x86 patterns into one common form, set > and extract included, does that buy us most of what we'd get for > playing games in combine? I'm sorry, could you be more precise? I don't see clearly what you are suggesting. > As for your xmmintrin.h changes, I'd like to see a test case that verifies > that _mm_add_ss(a, b) does not add extra insns to extract __B[0]. Yes, good idea, thanks.
On 2012-12-07 09:00, Marc Glisse wrote: > On Fri, 7 Dec 2012, Richard Henderson wrote: > >> On 2012-12-07 02:49, Marc Glisse wrote: >>> For 2-element vectors, vec_concat does seem more natural than >>> vec_merge. If we chose vec_merge as the canonical representation, we >>> should chose it for setting an element in a vector >>> (ix86_expand_vector_set) everywhere, not just those scalarish >>> operations. >> >> I'd hate to enshrine vec_merge over vec_concat for the benefit of x86, >> and to the detriment of e.g. mips. There are plenty of embedded simd >> implementations that are V2xx only. >> >> If we simply pull the various x86 patterns into one common form, set >> and extract included, does that buy us most of what we'd get for >> playing games in combine? > > I'm sorry, could you be more precise? I don't see clearly what you are suggesting. Don't change combine? Have I lost the plot somewhere here? r~
On Fri, 7 Dec 2012, Richard Henderson wrote: > On 2012-12-07 09:00, Marc Glisse wrote: >> On Fri, 7 Dec 2012, Richard Henderson wrote: >> >>> On 2012-12-07 02:49, Marc Glisse wrote: >>>> For 2-element vectors, vec_concat does seem more natural than >>>> vec_merge. If we chose vec_merge as the canonical representation, we >>>> should chose it for setting an element in a vector >>>> (ix86_expand_vector_set) everywhere, not just those scalarish >>>> operations. >>> >>> I'd hate to enshrine vec_merge over vec_concat for the benefit of x86, >>> and to the detriment of e.g. mips. There are plenty of embedded simd >>> implementations that are V2xx only. >>> >>> If we simply pull the various x86 patterns into one common form, set >>> and extract included, does that buy us most of what we'd get for >>> playing games in combine? >> >> I'm sorry, could you be more precise? I don't see clearly what you are suggesting. > > Don't change combine? but change ix86_expand_vector_set and others to generate vec_merge and have only the vec_merge define_insn in sse.md? I guess it would buy a large part of it. That's a pretty invasive change, I'll have to try...
On 2012-12-07 09:12, Marc Glisse wrote: > but change ix86_expand_vector_set and others to generate vec_merge > and have only the vec_merge define_insn in sse.md? I guess it would > buy a large part of it. That's a pretty invasive change, I'll have to > try... Is it really that invasive? Anyway, it's something worth trying for 4.9... r~
On Fri, 7 Dec 2012, Richard Henderson wrote: > On 2012-12-07 09:12, Marc Glisse wrote: >> but change ix86_expand_vector_set and others to generate vec_merge >> and have only the vec_merge define_insn in sse.md? I guess it would >> buy a large part of it. That's a pretty invasive change, I'll have to >> try... > > Is it really that invasive? No, changing only V2DF, I seem to have the basic pieces in place changing just 6 patterns in sse.md and a couple functions in i386.c. Now I need to test it and see how much it affects the generated code... > Anyway, it's something worth trying for 4.9... Should I take it that if it ends up looking manageable, you prefer changing all the V2DF operations to vec_merge, rather than just adapting the addsd pattern? Won't it complicate things for generic optimizations if different platforms have different canonical patterns for the same operation on V2DF? It seems to me that it would be good if we agreed on a pattern common to all platforms so we could canonicalize as in the last part of: http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00243.html (or possibly the reverse transformation depending on the pattern we settle on). (those 2 patches didn't touch the generic simplify-rtx code either, if I remember just the "don't touch combine" part of your suggestion ;-) http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00028.html http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00492.html )
Index: doc/extend.texi =================================================================== --- doc/extend.texi (revision 194150) +++ doc/extend.texi (working copy) @@ -9843,22 +9843,22 @@ int __builtin_ia32_comige (v4sf, v4sf) int __builtin_ia32_ucomieq (v4sf, v4sf) int __builtin_ia32_ucomineq (v4sf, v4sf) int __builtin_ia32_ucomilt (v4sf, v4sf) int __builtin_ia32_ucomile (v4sf, v4sf) int __builtin_ia32_ucomigt (v4sf, v4sf) int __builtin_ia32_ucomige (v4sf, v4sf) v4sf __builtin_ia32_addps (v4sf, v4sf) v4sf __builtin_ia32_subps (v4sf, v4sf) v4sf __builtin_ia32_mulps (v4sf, v4sf) v4sf __builtin_ia32_divps (v4sf, v4sf) -v4sf __builtin_ia32_addss (v4sf, v4sf) -v4sf __builtin_ia32_subss (v4sf, v4sf) +v4sf __builtin_ia32_addss (v4sf, float) +v4sf __builtin_ia32_subss (v4sf, float) v4sf __builtin_ia32_mulss (v4sf, v4sf) v4sf __builtin_ia32_divss (v4sf, v4sf) v4si __builtin_ia32_cmpeqps (v4sf, v4sf) v4si __builtin_ia32_cmpltps (v4sf, v4sf) v4si __builtin_ia32_cmpleps (v4sf, v4sf) v4si __builtin_ia32_cmpgtps (v4sf, v4sf) v4si __builtin_ia32_cmpgeps (v4sf, v4sf) v4si __builtin_ia32_cmpunordps (v4sf, v4sf) v4si __builtin_ia32_cmpneqps (v4sf, v4sf) v4si __builtin_ia32_cmpnltps (v4sf, v4sf) @@ -9964,22 +9964,22 @@ v2df __builtin_ia32_cmpunordsd (v2df, v2 v2df __builtin_ia32_cmpneqsd (v2df, v2df) v2df __builtin_ia32_cmpnltsd (v2df, v2df) v2df __builtin_ia32_cmpnlesd (v2df, v2df) v2df __builtin_ia32_cmpordsd (v2df, v2df) v2di __builtin_ia32_paddq (v2di, v2di) v2di __builtin_ia32_psubq (v2di, v2di) v2df __builtin_ia32_addpd (v2df, v2df) v2df __builtin_ia32_subpd (v2df, v2df) v2df __builtin_ia32_mulpd (v2df, v2df) v2df __builtin_ia32_divpd (v2df, v2df) -v2df __builtin_ia32_addsd (v2df, v2df) -v2df __builtin_ia32_subsd (v2df, v2df) +v2df __builtin_ia32_addsd (v2df, double) +v2df __builtin_ia32_subsd (v2df, double) v2df __builtin_ia32_mulsd (v2df, v2df) v2df __builtin_ia32_divsd (v2df, v2df) v2df __builtin_ia32_minpd (v2df, v2df) v2df __builtin_ia32_maxpd (v2df, v2df) v2df __builtin_ia32_minsd (v2df, v2df) v2df __builtin_ia32_maxsd (v2df, v2df) v2df __builtin_ia32_andpd (v2df, v2df) v2df __builtin_ia32_andnpd (v2df, v2df) v2df __builtin_ia32_orpd (v2df, v2df) v2df __builtin_ia32_xorpd (v2df, v2df) Index: testsuite/gcc.target/i386/pr54855-2.c =================================================================== --- testsuite/gcc.target/i386/pr54855-2.c (revision 0) +++ testsuite/gcc.target/i386/pr54855-2.c (revision 0) @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O -msse" } */ + +typedef float vec __attribute__((vector_size(16))); + +vec f (vec x) +{ + x[0] += 2; + return x; +} + +vec g (vec x) +{ + x[0] -= 1; + return x; +} + +/* { dg-final { scan-assembler-not "mov" } } */ Property changes on: testsuite/gcc.target/i386/pr54855-2.c ___________________________________________________________________ Added: svn:keywords + Author Date Id Revision URL Added: svn:eol-style + native Index: testsuite/gcc.target/i386/pr54855-1.c =================================================================== --- testsuite/gcc.target/i386/pr54855-1.c (revision 0) +++ testsuite/gcc.target/i386/pr54855-1.c (revision 0) @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O -msse2" } */ + +typedef double vec __attribute__((vector_size(16))); + +vec f (vec x) +{ + x[0] += 2; + return x; +} + +vec g (vec x) +{ + x[0] -= 1; + return x; +} + +/* { dg-final { scan-assembler-not "mov" } } */ Property changes on: testsuite/gcc.target/i386/pr54855-1.c ___________________________________________________________________ Added: svn:eol-style + native Added: svn:keywords + Author Date Id Revision URL Index: config/i386/xmmintrin.h =================================================================== --- config/i386/xmmintrin.h (revision 194150) +++ config/i386/xmmintrin.h (working copy) @@ -92,27 +92,27 @@ _mm_setzero_ps (void) return __extension__ (__m128){ 0.0f, 0.0f, 0.0f, 0.0f }; } /* Perform the respective operation on the lower SPFP (single-precision floating-point) values of A and B; the upper three SPFP values are passed through from A. */ extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_add_ss (__m128 __A, __m128 __B) { - return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B); + return (__m128) __builtin_ia32_addss ((__v4sf)__A, __B[0]); } extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_sub_ss (__m128 __A, __m128 __B) { - return (__m128) __builtin_ia32_subss ((__v4sf)__A, (__v4sf)__B); + return (__m128) __builtin_ia32_subss ((__v4sf)__A, __B[0]); } extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_mul_ss (__m128 __A, __m128 __B) { return (__m128) __builtin_ia32_mulss ((__v4sf)__A, (__v4sf)__B); } extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_div_ss (__m128 __A, __m128 __B) Index: config/i386/emmintrin.h =================================================================== --- config/i386/emmintrin.h (revision 194150) +++ config/i386/emmintrin.h (working copy) @@ -226,33 +226,33 @@ _mm_cvtsi128_si64x (__m128i __A) extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_add_pd (__m128d __A, __m128d __B) { return (__m128d)__builtin_ia32_addpd ((__v2df)__A, (__v2df)__B); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_add_sd (__m128d __A, __m128d __B) { - return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B); + return (__m128d)__builtin_ia32_addsd ((__v2df)__A, __B[0]); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_sub_pd (__m128d __A, __m128d __B) { return (__m128d)__builtin_ia32_subpd ((__v2df)__A, (__v2df)__B); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_sub_sd (__m128d __A, __m128d __B) { - return (__m128d)__builtin_ia32_subsd ((__v2df)__A, (__v2df)__B); + return (__m128d)__builtin_ia32_subsd ((__v2df)__A, __B[0]); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_mul_pd (__m128d __A, __m128d __B) { return (__m128d)__builtin_ia32_mulpd ((__v2df)__A, (__v2df)__B); } extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__)) _mm_mul_sd (__m128d __A, __m128d __B) Index: config/i386/sse.md =================================================================== --- config/i386/sse.md (revision 194150) +++ config/i386/sse.md (working copy) @@ -858,23 +858,26 @@ <plusminus_mnemonic><ssemodesuffix>\t{%2, %0|%0, %2} v<plusminus_mnemonic><ssemodesuffix>\t{%2, %1, %0|%0, %1, %2}" [(set_attr "isa" "noavx,avx") (set_attr "type" "sseadd") (set_attr "prefix" "orig,vex") (set_attr "mode" "<MODE>")]) (define_insn "<sse>_vm<plusminus_insn><mode>3" [(set (match_operand:VF_128 0 "register_operand" "=x,x") (vec_merge:VF_128 - (plusminus:VF_128 - (match_operand:VF_128 1 "register_operand" "0,x") - (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm")) + (vec_duplicate:VF_128 + (plusminus:<ssescalarmode> + (vec_select:<ssescalarmode> + (match_operand:VF_128 1 "register_operand" "0,x") + (parallel [(const_int 0)])) + (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm"))) (match_dup 1) (const_int 1)))] "TARGET_SSE" "@ <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2} v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}" [(set_attr "isa" "noavx,avx") (set_attr "type" "sseadd") (set_attr "prefix" "orig,vex") (set_attr "mode" "<ssescalarmode>")]) Index: config/i386/i386-builtin-types.def =================================================================== --- config/i386/i386-builtin-types.def (revision 194150) +++ config/i386/i386-builtin-types.def (working copy) @@ -263,20 +263,21 @@ DEF_FUNCTION_TYPE (UINT64, UINT64, UINT6 DEF_FUNCTION_TYPE (UINT8, UINT8, INT) DEF_FUNCTION_TYPE (V16QI, V16QI, SI) DEF_FUNCTION_TYPE (V16QI, V16QI, V16QI) DEF_FUNCTION_TYPE (V16QI, V8HI, V8HI) DEF_FUNCTION_TYPE (V1DI, V1DI, SI) DEF_FUNCTION_TYPE (V1DI, V1DI, V1DI) DEF_FUNCTION_TYPE (V1DI, V2SI, V2SI) DEF_FUNCTION_TYPE (V1DI, V8QI, V8QI) DEF_FUNCTION_TYPE (V2DF, PCV2DF, V2DI) DEF_FUNCTION_TYPE (V2DF, V2DF, DI) +DEF_FUNCTION_TYPE (V2DF, V2DF, DOUBLE) DEF_FUNCTION_TYPE (V2DF, V2DF, INT) DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE) DEF_FUNCTION_TYPE (V2DF, V2DF, SI) DEF_FUNCTION_TYPE (V2DF, V2DF, V2DF) DEF_FUNCTION_TYPE (V2DF, V2DF, V2DI) DEF_FUNCTION_TYPE (V2DF, V2DF, V4SF) DEF_FUNCTION_TYPE (V2DF, V4DF, INT) DEF_FUNCTION_TYPE (V2DI, V16QI, V16QI) DEF_FUNCTION_TYPE (V2DI, V2DF, V2DF) DEF_FUNCTION_TYPE (V2DI, V2DI, INT) @@ -296,20 +297,21 @@ DEF_FUNCTION_TYPE (V4DF, PCV4DF, V4DI) DEF_FUNCTION_TYPE (V4DF, V4DF, INT) DEF_FUNCTION_TYPE (V4DF, V4DF, V4DF) DEF_FUNCTION_TYPE (V4DF, V4DF, V4DI) DEF_FUNCTION_TYPE (V4HI, V2SI, V2SI) DEF_FUNCTION_TYPE (V4HI, V4HI, INT) DEF_FUNCTION_TYPE (V4HI, V4HI, SI) DEF_FUNCTION_TYPE (V4HI, V4HI, V4HI) DEF_FUNCTION_TYPE (V4HI, V8QI, V8QI) DEF_FUNCTION_TYPE (V4SF, PCV4SF, V4SI) DEF_FUNCTION_TYPE (V4SF, V4SF, DI) +DEF_FUNCTION_TYPE (V4SF, V4SF, FLOAT) DEF_FUNCTION_TYPE (V4SF, V4SF, INT) DEF_FUNCTION_TYPE (V4SF, V4SF, PCV2SF) DEF_FUNCTION_TYPE (V4SF, V4SF, SI) DEF_FUNCTION_TYPE (V4SF, V4SF, V2DF) DEF_FUNCTION_TYPE (V4SF, V4SF, V2SI) DEF_FUNCTION_TYPE (V4SF, V4SF, V4SF) DEF_FUNCTION_TYPE (V4SF, V4SF, V4SI) DEF_FUNCTION_TYPE (V4SF, V8SF, INT) DEF_FUNCTION_TYPE (V4SI, V2DF, V2DF) DEF_FUNCTION_TYPE (V4SI, V4SF, V4SF) Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 194150) +++ config/i386/i386.c (working copy) @@ -27059,22 +27059,22 @@ static const struct builtin_description { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttps2pi, "__builtin_ia32_cvttps2pi", IX86_BUILTIN_CVTTPS2PI, UNKNOWN, (int) V2SI_FTYPE_V4SF }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttss2si, "__builtin_ia32_cvttss2si", IX86_BUILTIN_CVTTSS2SI, UNKNOWN, (int) INT_FTYPE_V4SF }, { OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_64BIT, CODE_FOR_sse_cvttss2siq, "__builtin_ia32_cvttss2si64", IX86_BUILTIN_CVTTSS2SI64, UNKNOWN, (int) INT64_FTYPE_V4SF }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_shufps, "__builtin_ia32_shufps", IX86_BUILTIN_SHUFPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF_INT }, { OPTION_MASK_ISA_SSE, CODE_FOR_addv4sf3, "__builtin_ia32_addps", IX86_BUILTIN_ADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF }, { OPTION_MASK_ISA_SSE, CODE_FOR_subv4sf3, "__builtin_ia32_subps", IX86_BUILTIN_SUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF }, { OPTION_MASK_ISA_SSE, CODE_FOR_mulv4sf3, "__builtin_ia32_mulps", IX86_BUILTIN_MULPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_divv4sf3, "__builtin_ia32_divps", IX86_BUILTIN_DIVPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF }, - { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3, "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF }, - { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3, "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF }, + { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3, "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT }, + { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3, "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmmulv4sf3, "__builtin_ia32_mulss", IX86_BUILTIN_MULSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmdivv4sf3, "__builtin_ia32_divss", IX86_BUILTIN_DIVSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpeqps", IX86_BUILTIN_CMPEQPS, EQ, (int) V4SF_FTYPE_V4SF_V4SF }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpltps", IX86_BUILTIN_CMPLTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpleps", IX86_BUILTIN_CMPLEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgtps", IX86_BUILTIN_CMPGTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF_SWAP }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgeps", IX86_BUILTIN_CMPGEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF_SWAP }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpunordps", IX86_BUILTIN_CMPUNORDPS, UNORDERED, (int) V4SF_FTYPE_V4SF_V4SF }, { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpneqps", IX86_BUILTIN_CMPNEQPS, NE, (int) V4SF_FTYPE_V4SF_V4SF }, @@ -27163,22 +27163,22 @@ static const struct builtin_description { OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_64BIT, CODE_FOR_sse2_cvttsd2siq, "__builtin_ia32_cvttsd2si64", IX86_BUILTIN_CVTTSD2SI64, UNKNOWN, (int) INT64_FTYPE_V2DF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2dq, "__builtin_ia32_cvtps2dq", IX86_BUILTIN_CVTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2pd, "__builtin_ia32_cvtps2pd", IX86_BUILTIN_CVTPS2PD, UNKNOWN, (int) V2DF_FTYPE_V4SF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_fix_truncv4sfv4si2, "__builtin_ia32_cvttps2dq", IX86_BUILTIN_CVTTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_addv2df3, "__builtin_ia32_addpd", IX86_BUILTIN_ADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_subv2df3, "__builtin_ia32_subpd", IX86_BUILTIN_SUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_mulv2df3, "__builtin_ia32_mulpd", IX86_BUILTIN_MULPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_divv2df3, "__builtin_ia32_divpd", IX86_BUILTIN_DIVPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF }, - { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3, "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF }, - { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3, "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF }, + { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3, "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE }, + { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3, "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE }, { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmmulv2df3, "__builtin_ia32_mulsd", IX86_BUILTIN_MULSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmdivv2df3, "__builtin_ia32_divsd", IX86_BUILTIN_DIVSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpeqpd", IX86_BUILTIN_CMPEQPD, EQ, (int) V2DF_FTYPE_V2DF_V2DF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpltpd", IX86_BUILTIN_CMPLTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmplepd", IX86_BUILTIN_CMPLEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgtpd", IX86_BUILTIN_CMPGTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF_SWAP }, { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgepd", IX86_BUILTIN_CMPGEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF_SWAP}, { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpunordpd", IX86_BUILTIN_CMPUNORDPD, UNORDERED, (int) V2DF_FTYPE_V2DF_V2DF }, { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpneqpd", IX86_BUILTIN_CMPNEQPD, NE, (int) V2DF_FTYPE_V2DF_V2DF }, @@ -30790,34 +30790,36 @@ ix86_expand_args_builtin (const struct b case V4HI_FTYPE_V8QI_V8QI: case V4HI_FTYPE_V2SI_V2SI: case V4DF_FTYPE_V4DF_V4DF: case V4DF_FTYPE_V4DF_V4DI: case V4SF_FTYPE_V4SF_V4SF: case V4SF_FTYPE_V4SF_V4SI: case V4SF_FTYPE_V4SF_V2SI: case V4SF_FTYPE_V4SF_V2DF: case V4SF_FTYPE_V4SF_DI: case V4SF_FTYPE_V4SF_SI: + case V4SF_FTYPE_V4SF_FLOAT: case V2DI_FTYPE_V2DI_V2DI: case V2DI_FTYPE_V16QI_V16QI: case V2DI_FTYPE_V4SI_V4SI: case V2UDI_FTYPE_V4USI_V4USI: case V2DI_FTYPE_V2DI_V16QI: case V2DI_FTYPE_V2DF_V2DF: case V2SI_FTYPE_V2SI_V2SI: case V2SI_FTYPE_V4HI_V4HI: case V2SI_FTYPE_V2SF_V2SF: case V2DF_FTYPE_V2DF_V2DF: case V2DF_FTYPE_V2DF_V4SF: case V2DF_FTYPE_V2DF_V2DI: case V2DF_FTYPE_V2DF_DI: case V2DF_FTYPE_V2DF_SI: + case V2DF_FTYPE_V2DF_DOUBLE: case V2SF_FTYPE_V2SF_V2SF: case V1DI_FTYPE_V1DI_V1DI: case V1DI_FTYPE_V8QI_V8QI: case V1DI_FTYPE_V2SI_V2SI: case V32QI_FTYPE_V16HI_V16HI: case V16HI_FTYPE_V8SI_V8SI: case V32QI_FTYPE_V32QI_V32QI: case V16HI_FTYPE_V32QI_V32QI: case V16HI_FTYPE_V16HI_V16HI: case V8SI_FTYPE_V4DF_V4DF: Index: simplify-rtx.c =================================================================== --- simplify-rtx.c (revision 194150) +++ simplify-rtx.c (working copy) @@ -3588,20 +3588,32 @@ simplify_binary_operation_1 (enum rtx_co int len0 = XVECLEN (par0, 0); int len1 = XVECLEN (par1, 0); rtvec vec = rtvec_alloc (len0 + len1); for (int i = 0; i < len0; i++) RTVEC_ELT (vec, i) = XVECEXP (par0, 0, i); for (int i = 0; i < len1; i++) RTVEC_ELT (vec, len0 + i) = XVECEXP (par1, 0, i); return simplify_gen_binary (VEC_SELECT, mode, XEXP (trueop0, 0), gen_rtx_PARALLEL (VOIDmode, vec)); } + + /* Recognize a simple form of VEC_MERGE. */ + if (GET_CODE (trueop1) == VEC_SELECT + && GET_MODE (XEXP (trueop1, 0)) == mode + && XVECLEN (XEXP (trueop1, 1), 0) == 1 + && INTVAL (XVECEXP (XEXP (trueop1, 1), 0, 0)) == 1) + { + rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0); + rtx newop1 = XEXP (trueop1, 0); + return simplify_gen_ternary (VEC_MERGE, mode, GET_MODE (newop0), + newop0, newop1, GEN_INT (1)); + } } return 0; default: gcc_unreachable (); } return 0; }
On Tue, 4 Dec 2012, Marc Glisse wrote: > Do you agree? Like thisĀ ? (only tested on the new testcases, and then I'd need to ask Eric his opinion) 2012-12-04 Marc Glisse <marc.glisse@inria.fr> PR target/54855 gcc/ * simplify-rtx.c (simplify_binary_operation_1) <VEC_CONCAT>: Replace with VEC_MERGE. * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite pattern. * config/i386/i386-builtin-types.def: New function types. * config/i386/i386.c (ix86_expand_args_builtin): Likewise. (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss, __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype. * config/i386/xmmintrin.h: Adapt to new builtin prototype. * config/i386/emmintrin.h: Likewise. * doc/extend.texi (X86 Built-in Functions): Document changed prototype. testsuite/ * gcc.target/i386/pr54855-1.c: New testcase. * gcc.target/i386/pr54855-2.c: New testcase.