Message ID | CAFULd4abm7fZrKOYWMibFDM=uBk1TET0vSn7=5=-tYhcVrRdUA@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | [RFC] i386: Do not sanitize upper part of V2SFmode reg with -fno-trapping-math [PR110832] | expand |
On Sun, 30 Jul 2023, Uros Bizjak wrote: > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF > named patterns in order to avoid generation of partial vector V4SFmode > trapping instructions. > > The new option is enabled by default, because even with sanitization, > a small but consistent speed up of 2 to 3% with Polyhedron capacita > benchmark can be achieved vs. scalar code. > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9% > vs. scalar code. This is what clang does by default, as it defaults > to -fno-trapping-math. I like the new option, note you lack invoke.texi documentation where I'd also elaborate a bit on the interaction with -fno-trapping-math and the possible performance impact then NaNs or denormals leak into the upper halves and cross-reference -mdaz-ftz. Thanks, Richard. > PR target/110832 > > gcc/ChangeLog: > > * config/i386/i386.h (TARGET_MMXFP_WITH_SSE): New macro. > * config/i386/i386/opt (mmmxfp-with-sse): New option. > * config/i386/mmx.md (movq_<mode>_to_sse): Do not sanitize > upper part of V2SFmode register with -fno-trapping-math. > (<plusminusmult:insn>v2sf3): Enable for TARGET_MMXFP_WITH_SSE. > (divv2sf3): Ditto. > (<smaxmin:code>v2sf3): Ditto. > (sqrtv2sf2): Ditto. > (*mmx_haddv2sf3_low): Ditto. > (*mmx_hsubv2sf3_low): Ditto. > (vec_addsubv2sf3): Ditto. > (vec_cmpv2sfv2si): Ditto. > (vcond<V2FI:mode>v2sf): Ditto. > (fmav2sf4): Ditto. > (fmsv2sf4): Ditto. > (fnmav2sf4): Ditto. > (fnmsv2sf4): Ditto. > (fix_truncv2sfv2si2): Ditto. > (fixuns_truncv2sfv2si2): Ditto. > (floatv2siv2sf2): Ditto. > (floatunsv2siv2sf2): Ditto. > (nearbyintv2sf2): Ditto. > (rintv2sf2): Ditto. > (lrintv2sfv2si2): Ditto. > (ceilv2sf2): Ditto. > (lceilv2sfv2si2): Ditto. > (floorv2sf2): Ditto. > (lfloorv2sfv2si2): Ditto. > (btruncv2sf2): Ditto. > (roundv2sf2): Ditto. > (lroundv2sfv2si2): Ditto. > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. > > Uros. >
On Mon, Jul 31, 2023 at 11:40 AM Richard Biener <rguenther@suse.de> wrote: > > On Sun, 30 Jul 2023, Uros Bizjak wrote: > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF > > named patterns in order to avoid generation of partial vector V4SFmode > > trapping instructions. > > > > The new option is enabled by default, because even with sanitization, > > a small but consistent speed up of 2 to 3% with Polyhedron capacita > > benchmark can be achieved vs. scalar code. > > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9% > > vs. scalar code. This is what clang does by default, as it defaults > > to -fno-trapping-math. > > I like the new option, note you lack invoke.texi documentation where > I'd also elaborate a bit on the interaction with -fno-trapping-math > and the possible performance impact then NaNs or denormals leak > into the upper halves and cross-reference -mdaz-ftz. Yes, this is my plan (lack of documentation is due to RFC status of the patch). OTOH, Hongtao has some other ideas in the PR, so I'll wait with the patch a bit. Thanks, Uros. > Thanks, > Richard. > > > PR target/110832 > > > > gcc/ChangeLog: > > > > * config/i386/i386.h (TARGET_MMXFP_WITH_SSE): New macro. > > * config/i386/i386/opt (mmmxfp-with-sse): New option. > > * config/i386/mmx.md (movq_<mode>_to_sse): Do not sanitize > > upper part of V2SFmode register with -fno-trapping-math. > > (<plusminusmult:insn>v2sf3): Enable for TARGET_MMXFP_WITH_SSE. > > (divv2sf3): Ditto. > > (<smaxmin:code>v2sf3): Ditto. > > (sqrtv2sf2): Ditto. > > (*mmx_haddv2sf3_low): Ditto. > > (*mmx_hsubv2sf3_low): Ditto. > > (vec_addsubv2sf3): Ditto. > > (vec_cmpv2sfv2si): Ditto. > > (vcond<V2FI:mode>v2sf): Ditto. > > (fmav2sf4): Ditto. > > (fmsv2sf4): Ditto. > > (fnmav2sf4): Ditto. > > (fnmsv2sf4): Ditto. > > (fix_truncv2sfv2si2): Ditto. > > (fixuns_truncv2sfv2si2): Ditto. > > (floatv2siv2sf2): Ditto. > > (floatunsv2siv2sf2): Ditto. > > (nearbyintv2sf2): Ditto. > > (rintv2sf2): Ditto. > > (lrintv2sfv2si2): Ditto. > > (ceilv2sf2): Ditto. > > (lceilv2sfv2si2): Ditto. > > (floorv2sf2): Ditto. > > (lfloorv2sfv2si2): Ditto. > > (btruncv2sf2): Ditto. > > (roundv2sf2): Ditto. > > (lroundv2sfv2si2): Ditto. > > > > Bootstrapped and regression tested on x86_64-linux-gnu {,-m32}. > > > > Uros. > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
On Mon, Jul 31, 2023 at 11:40 AM Richard Biener <rguenther@suse.de> wrote: > > On Sun, 30 Jul 2023, Uros Bizjak wrote: > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF > > named patterns in order to avoid generation of partial vector V4SFmode > > trapping instructions. > > > > The new option is enabled by default, because even with sanitization, > > a small but consistent speed up of 2 to 3% with Polyhedron capacita > > benchmark can be achieved vs. scalar code. > > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9% > > vs. scalar code. This is what clang does by default, as it defaults > > to -fno-trapping-math. > > I like the new option, note you lack invoke.texi documentation where > I'd also elaborate a bit on the interaction with -fno-trapping-math > and the possible performance impact then NaNs or denormals leak > into the upper halves and cross-reference -mdaz-ftz. The attached doc patch is invoke.texi entry for -mmmxfp-with-sse option. It is written in a way to also cover half-float vectors. WDYT? Uros. diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index fa765d5a0dd..99093172abe 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -1417,6 +1417,7 @@ See RS/6000 and PowerPC Options. -mcld -mcx16 -msahf -mmovbe -mcrc32 -mmwait -mrecip -mrecip=@var{opt} -mvzeroupper -mprefer-avx128 -mprefer-vector-width=@var{opt} +-mmmxfp-with-sse -mmove-max=@var{bits} -mstore-max=@var{bits} -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4 -mavx -mavx2 -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl @@ -33708,6 +33709,22 @@ This option instructs GCC to use 128-bit AVX instructions instead of This option instructs GCC to use @var{opt}-bit vector width in instructions instead of default on the selected platform. +@opindex -mmmxfp-with-sse +@item -mmmxfp-with-sse +This option enables GCC to generate trapping floating-point operations on +partial vectors, where vector elements reside in the low part of the 128-bit +SSE register. Unless @option{-fno-trapping-math} is specified, the compiler +guarantees correct trapping behavior by sanitizing all input operands to +have zeroes in the upper part of the vector register. Note that by using +built-in functions or inline assembly with partial vector arguments, NaNs, +denormal or invalid values can leak into the upper part of the vector, +causing possible performance issues when @option{-fno-trapping-math} is in +effect. These issues can be mitigated by manually sanitizing the upper part +of the partial vector argument register or by using @option{-mdaz-ftz} to set +denormals-are-zero (DAZ) flag in the MXCSR register. + +This option is enabled by default. + @opindex mmove-max @item -mmove-max=@var{bits} This option instructs GCC to set the maximum number of bits can be
On Mon, 7 Aug 2023, Uros Bizjak wrote: > On Mon, Jul 31, 2023 at 11:40?AM Richard Biener <rguenther@suse.de> wrote: > > > > On Sun, 30 Jul 2023, Uros Bizjak wrote: > > > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF > > > named patterns in order to avoid generation of partial vector V4SFmode > > > trapping instructions. > > > > > > The new option is enabled by default, because even with sanitization, > > > a small but consistent speed up of 2 to 3% with Polyhedron capacita > > > benchmark can be achieved vs. scalar code. > > > > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9% > > > vs. scalar code. This is what clang does by default, as it defaults > > > to -fno-trapping-math. > > > > I like the new option, note you lack invoke.texi documentation where > > I'd also elaborate a bit on the interaction with -fno-trapping-math > > and the possible performance impact then NaNs or denormals leak > > into the upper halves and cross-reference -mdaz-ftz. > > The attached doc patch is invoke.texi entry for -mmmxfp-with-sse > option. It is written in a way to also cover half-float vectors. WDYT? "generate trapping floating-point operations" I'd say "generate floating-point operations that might affect the set of floating point status flags", the word "trapping" is IMHO misleading. Not sure if "set of floating point status flags" is the correct term, but it's what the C standard seems to refer to when talking about things you get with fegetexceptflag. feraieexcept refers to "floating-point exceptions". Unfortunately the -fno-trapping-math documentation is similarly confusing (and maybe even wrong, I read it to conform to 'non-stop' IEEE arithmetic). I'd maybe give an example of a FP operation that's _not_ affected by the flag (copysign?). Otherwise it looks OK to me. Thanks, Richard.
On Tue, Aug 8, 2023 at 10:07 AM Richard Biener <rguenther@suse.de> wrote: > > On Mon, 7 Aug 2023, Uros Bizjak wrote: > > > On Mon, Jul 31, 2023 at 11:40?AM Richard Biener <rguenther@suse.de> wrote: > > > > > > On Sun, 30 Jul 2023, Uros Bizjak wrote: > > > > > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF > > > > named patterns in order to avoid generation of partial vector V4SFmode > > > > trapping instructions. > > > > > > > > The new option is enabled by default, because even with sanitization, > > > > a small but consistent speed up of 2 to 3% with Polyhedron capacita > > > > benchmark can be achieved vs. scalar code. > > > > > > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9% > > > > vs. scalar code. This is what clang does by default, as it defaults > > > > to -fno-trapping-math. > > > > > > I like the new option, note you lack invoke.texi documentation where > > > I'd also elaborate a bit on the interaction with -fno-trapping-math > > > and the possible performance impact then NaNs or denormals leak > > > into the upper halves and cross-reference -mdaz-ftz. > > > > The attached doc patch is invoke.texi entry for -mmmxfp-with-sse > > option. It is written in a way to also cover half-float vectors. WDYT? > > "generate trapping floating-point operations" > > I'd say "generate floating-point operations that might affect the > set of floating point status flags", the word "trapping" is IMHO > misleading. > Not sure if "set of floating point status flags" is the correct term, > but it's what the C standard seems to refer to when talking about > things you get with fegetexceptflag. feraieexcept refers to > "floating-point exceptions". Unfortunately the -fno-trapping-math > documentation is similarly confusing (and maybe even wrong, I read > it to conform to 'non-stop' IEEE arithmetic). Thanks for suggesting the right terminology. I think that: +@opindex mpartial-vector-math +@item -mpartial-vector-math +This option enables GCC to generate floating-point operations that might +affect the set of floating point status flags on partial vectors, where +vector elements reside in the low part of the 128-bit SSE register. Unless +@option{-fno-trapping-math} is specified, the compiler guarantees correct +behavior by sanitizing all input operands to have zeroes in the unused +upper part of the vector register. Note that by using built-in functions +or inline assembly with partial vector arguments, NaNs, denormal or invalid +values can leak into the upper part of the vector, causing possible +performance issues when @option{-fno-trapping-math} is in effect. These +issues can be mitigated by manually sanitizing the upper part of the partial +vector argument register or by using @option{-mdaz-ftz} to set +denormals-are-zero (DAZ) flag in the MXCSR register. Now explain in adequate detail what the option does. IMO, the "floating-point operations that might affect the set of floating point status flags" correctly identifies affected operations, so an example, as suggested below, is not necessary. > I'd maybe give an example of a FP operation that's _not_ affected > by the flag (copysign?). Please note that I have renamed the option to "-mpartial-vector-math" with a short target-specific description: +partial-vector-math +Target Var(ix86_partial_vec_math) Init(1) +Enable floating-point status flags setting SSE vector operations on partial vectors which I think summarises the option (without the word "trapping"). The same approach will be taken for Float16 operations, so the approach is not specific to MMX vectors. > Otherwise it looks OK to me. Thanks, I have attached the RFC V2 patch; I plan to submit a formal patch later today. Uros. diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index 1cc8563477a..8d9a1ae93f3 100644 --- a/gcc/config/i386/i386.opt +++ b/gcc/config/i386/i386.opt @@ -632,6 +632,10 @@ Enum(prefer_vector_width) String(256) Value(PVW_AVX256) EnumValue Enum(prefer_vector_width) String(512) Value(PVW_AVX512) +partial-vector-math +Target Var(ix86_partial_vec_math) Init(1) +Enable floating-point status flags setting SSE vector operations on partial vectors + mmove-max= Target RejectNegative Joined Var(ix86_move_max) Enum(prefer_vector_width) Init(PVW_NONE) Save Maximum number of bits that can be moved from memory to memory efficiently. diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md index b49554e9b8f..95f7a0113e7 100644 --- a/gcc/config/i386/mmx.md +++ b/gcc/config/i386/mmx.md @@ -595,7 +595,18 @@ (define_expand "movq_<mode>_to_sse" (match_operand:V2FI_V4HF 1 "nonimmediate_operand") (match_dup 2)))] "TARGET_SSE2" - "operands[2] = CONST0_RTX (<MODE>mode);") +{ + if (<MODE>mode == V2SFmode + && !flag_trapping_math) + { + rtx op1 = force_reg (<MODE>mode, operands[1]); + emit_move_insn (operands[0], lowpart_subreg (<mmxdoublevecmode>mode, + op1, <MODE>mode)); + DONE; + } + + operands[2] = CONST0_RTX (<MODE>mode); +}) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; @@ -648,7 +659,7 @@ (define_expand "<insn>v2sf3" (plusminusmult:V2SF (match_operand:V2SF 1 "nonimmediate_operand") (match_operand:V2SF 2 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -726,7 +737,7 @@ (define_expand "divv2sf3" [(set (match_operand:V2SF 0 "register_operand") (div:V2SF (match_operand:V2SF 1 "register_operand") (match_operand:V2SF 2 "register_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -748,7 +759,7 @@ (define_expand "<code>v2sf3" (smaxmin:V2SF (match_operand:V2SF 1 "register_operand") (match_operand:V2SF 2 "register_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -850,7 +861,7 @@ (define_insn "mmx_rcpit2v2sf3" (define_expand "sqrtv2sf2" [(set (match_operand:V2SF 0 "register_operand") (sqrt:V2SF (match_operand:V2SF 1 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -931,7 +942,7 @@ (define_insn_and_split "*mmx_haddv2sf3_low" (vec_select:SF (match_dup 1) (parallel [(match_operand:SI 3 "const_0_to_1_operand")]))))] - "TARGET_SSE3 && TARGET_MMX_WITH_SSE + "TARGET_SSE3 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math && INTVAL (operands[2]) != INTVAL (operands[3]) && ix86_pre_reload_split ()" "#" @@ -977,7 +988,7 @@ (define_insn_and_split "*mmx_hsubv2sf3_low" (vec_select:SF (match_dup 1) (parallel [(const_int 1)]))))] - "TARGET_SSE3 && TARGET_MMX_WITH_SSE + "TARGET_SSE3 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math && ix86_pre_reload_split ()" "#" "&& 1" @@ -1039,7 +1050,7 @@ (define_expand "vec_addsubv2sf3" (match_operand:V2SF 2 "nonimmediate_operand")) (plus:V2SF (match_dup 1) (match_dup 2)) (const_int 1)))] - "TARGET_SSE3 && TARGET_MMX_WITH_SSE" + "TARGET_SSE3 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -1102,7 +1113,7 @@ (define_expand "vec_cmpv2sfv2si" (match_operator:V2SI 1 "" [(match_operand:V2SF 2 "nonimmediate_operand") (match_operand:V2SF 3 "nonimmediate_operand")]))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx ops[4]; ops[3] = gen_reg_rtx (V4SFmode); @@ -1128,7 +1139,7 @@ (define_expand "vcond<mode>v2sf" (match_operand:V2SF 5 "nonimmediate_operand")]) (match_operand:V2FI 1 "general_operand") (match_operand:V2FI 2 "general_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx ops[6]; ops[5] = gen_reg_rtx (V4SFmode); @@ -1318,7 +1329,7 @@ (define_expand "fmav2sf4" (match_operand:V2SF 2 "nonimmediate_operand") (match_operand:V2SF 3 "nonimmediate_operand")))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1343,7 +1354,7 @@ (define_expand "fmsv2sf4" (neg:V2SF (match_operand:V2SF 3 "nonimmediate_operand"))))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1368,7 +1379,7 @@ (define_expand "fnmav2sf4" (match_operand:V2SF 2 "nonimmediate_operand") (match_operand:V2SF 3 "nonimmediate_operand")))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1394,7 +1405,7 @@ (define_expand "fnmsv2sf4" (neg:V2SF (match_operand:V2SF 3 "nonimmediate_operand"))))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1420,7 +1431,7 @@ (define_expand "fnmsv2sf4" (define_expand "fix_truncv2sfv2si2" [(set (match_operand:V2SI 0 "register_operand") (fix:V2SI (match_operand:V2SF 1 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1436,7 +1447,7 @@ (define_expand "fix_truncv2sfv2si2" (define_expand "fixuns_truncv2sfv2si2" [(set (match_operand:V2SI 0 "register_operand") (unsigned_fix:V2SI (match_operand:V2SF 1 "nonimmediate_operand")))] - "TARGET_AVX512VL && TARGET_MMX_WITH_SSE" + "TARGET_AVX512VL && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1461,7 +1472,7 @@ (define_insn "mmx_fix_truncv2sfv2si2" (define_expand "floatv2siv2sf2" [(set (match_operand:V2SF 0 "register_operand") (float:V2SF (match_operand:V2SI 1 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SImode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1477,7 +1488,7 @@ (define_expand "floatv2siv2sf2" (define_expand "floatunsv2siv2sf2" [(set (match_operand:V2SF 0 "register_operand") (unsigned_float:V2SF (match_operand:V2SI 1 "nonimmediate_operand")))] - "TARGET_AVX512VL && TARGET_MMX_WITH_SSE" + "TARGET_AVX512VL && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SImode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1754,7 +1765,7 @@ (define_expand "vec_initv2sfsf" (define_expand "nearbyintv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1770,7 +1781,7 @@ (define_expand "nearbyintv2sf2" (define_expand "rintv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1786,8 +1797,8 @@ (define_expand "rintv2sf2" (define_expand "lrintv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1804,7 +1815,7 @@ (define_expand "ceilv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1820,8 +1831,8 @@ (define_expand "ceilv2sf2" (define_expand "lceilv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1838,7 +1849,7 @@ (define_expand "floorv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1854,8 +1865,8 @@ (define_expand "floorv2sf2" (define_expand "lfloorv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1872,7 +1883,7 @@ (define_expand "btruncv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1889,7 +1900,7 @@ (define_expand "roundv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1905,8 +1916,8 @@ (define_expand "roundv2sf2" (define_expand "lroundv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMX_WITH_SSE && ix86_partial_vec_math" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 674f956f4b8..f5081c0cfb9 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -1419,6 +1419,7 @@ See RS/6000 and PowerPC Options. -mcld -mcx16 -msahf -mmovbe -mcrc32 -mmwait -mrecip -mrecip=@var{opt} -mvzeroupper -mprefer-avx128 -mprefer-vector-width=@var{opt} +-mpartial-vector-math -mmove-max=@var{bits} -mstore-max=@var{bits} -mmmx -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -msse4 -mavx -mavx2 -mavx512f -mavx512pf -mavx512er -mavx512cd -mavx512vl @@ -33754,6 +33755,23 @@ This option instructs GCC to use 128-bit AVX instructions instead of This option instructs GCC to use @var{opt}-bit vector width in instructions instead of default on the selected platform. +@opindex mpartial-vector-math +@item -mpartial-vector-math +This option enables GCC to generate floating-point operations that might +affect the set of floating point status flags on partial vectors, where +vector elements reside in the low part of the 128-bit SSE register. Unless +@option{-fno-trapping-math} is specified, the compiler guarantees correct +behavior by sanitizing all input operands to have zeroes in the unused +upper part of the vector register. Note that by using built-in functions +or inline assembly with partial vector arguments, NaNs, denormal or invalid +values can leak into the upper part of the vector, causing possible +performance issues when @option{-fno-trapping-math} is in effect. These +issues can be mitigated by manually sanitizing the upper part of the partial +vector argument register or by using @option{-mdaz-ftz} to set +denormals-are-zero (DAZ) flag in the MXCSR register. + +This option is enabled by default. + @opindex mmove-max @item -mmove-max=@var{bits} This option instructs GCC to set the maximum number of bits can be diff --git a/gcc/testsuite/gcc.target/i386/pr110832-1.c b/gcc/testsuite/gcc.target/i386/pr110832-1.c new file mode 100644 index 00000000000..3df22e3b5a7 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr110832-1.c @@ -0,0 +1,12 @@ +/* PR target/110832 */ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -msse2 -mno-partial-vector-math" } */ + +typedef float __attribute__((vector_size(8))) v2sf; + +v2sf test (v2sf a, v2sf b) +{ + return a + b; +} + +/* { dg-final { scan-assembler-not "addps" } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr110832-2.c b/gcc/testsuite/gcc.target/i386/pr110832-2.c new file mode 100644 index 00000000000..4d16488b4fb --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr110832-2.c @@ -0,0 +1,13 @@ +/* PR target/110832 */ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -ftrapping-math -msse2 -mpartial-vector-math -dp" } */ + +typedef float __attribute__((vector_size(8))) v2sf; + +v2sf test (v2sf a, v2sf b) +{ + return a + b; +} + +/* { dg-final { scan-assembler "addps" } } */ +/* { dg-final { scan-assembler-times "\\*vec_concatv4sf_0" 2 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr110832-3.c b/gcc/testsuite/gcc.target/i386/pr110832-3.c new file mode 100644 index 00000000000..02cb4fc8100 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr110832-3.c @@ -0,0 +1,13 @@ +/* PR target/110832 */ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -fno-trapping-math -msse2 -mpartial-vector-math -dp" } */ + +typedef float __attribute__((vector_size(8))) v2sf; + +v2sf test (v2sf a, v2sf b) +{ + return a + b; +} + +/* { dg-final { scan-assembler "addps" } } */ +/* { dg-final { scan-assembler-not "\\*vec_concatv4sf_0" } } */
On Tue, 8 Aug 2023, Uros Bizjak wrote: > On Tue, Aug 8, 2023 at 10:07?AM Richard Biener <rguenther@suse.de> wrote: > > > > On Mon, 7 Aug 2023, Uros Bizjak wrote: > > > > > On Mon, Jul 31, 2023 at 11:40?AM Richard Biener <rguenther@suse.de> wrote: > > > > > > > > On Sun, 30 Jul 2023, Uros Bizjak wrote: > > > > > > > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF > > > > > named patterns in order to avoid generation of partial vector V4SFmode > > > > > trapping instructions. > > > > > > > > > > The new option is enabled by default, because even with sanitization, > > > > > a small but consistent speed up of 2 to 3% with Polyhedron capacita > > > > > benchmark can be achieved vs. scalar code. > > > > > > > > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9% > > > > > vs. scalar code. This is what clang does by default, as it defaults > > > > > to -fno-trapping-math. > > > > > > > > I like the new option, note you lack invoke.texi documentation where > > > > I'd also elaborate a bit on the interaction with -fno-trapping-math > > > > and the possible performance impact then NaNs or denormals leak > > > > into the upper halves and cross-reference -mdaz-ftz. > > > > > > The attached doc patch is invoke.texi entry for -mmmxfp-with-sse > > > option. It is written in a way to also cover half-float vectors. WDYT? > > > > "generate trapping floating-point operations" > > > > I'd say "generate floating-point operations that might affect the > > set of floating point status flags", the word "trapping" is IMHO > > misleading. > > Not sure if "set of floating point status flags" is the correct term, > > but it's what the C standard seems to refer to when talking about > > things you get with fegetexceptflag. feraieexcept refers to > > "floating-point exceptions". Unfortunately the -fno-trapping-math > > documentation is similarly confusing (and maybe even wrong, I read > > it to conform to 'non-stop' IEEE arithmetic). > > Thanks for suggesting the right terminology. I think that: > > +@opindex mpartial-vector-math > +@item -mpartial-vector-math > +This option enables GCC to generate floating-point operations that might > +affect the set of floating point status flags on partial vectors, where > +vector elements reside in the low part of the 128-bit SSE register. Unless > +@option{-fno-trapping-math} is specified, the compiler guarantees correct > +behavior by sanitizing all input operands to have zeroes in the unused > +upper part of the vector register. Note that by using built-in functions > +or inline assembly with partial vector arguments, NaNs, denormal or invalid > +values can leak into the upper part of the vector, causing possible > +performance issues when @option{-fno-trapping-math} is in effect. These > +issues can be mitigated by manually sanitizing the upper part of the partial > +vector argument register or by using @option{-mdaz-ftz} to set > +denormals-are-zero (DAZ) flag in the MXCSR register. > > Now explain in adequate detail what the option does. IMO, the > "floating-point operations that might affect the set of floating point > status flags" correctly identifies affected operations, so an example, > as suggested below, is not necessary. > > > I'd maybe give an example of a FP operation that's _not_ affected > > by the flag (copysign?). > > Please note that I have renamed the option to "-mpartial-vector-math" > with a short target-specific description: Ah yes, that's a less confusing name but then it might suggest that -mno-partial-vector-math would disable all of that, including integer ops, not only the patterns possibly affecting the exception flags? Note I don't have a better suggestion and this is clearly better than the one mentioning mmx. > +partial-vector-math > +Target Var(ix86_partial_vec_math) Init(1) > +Enable floating-point status flags setting SSE vector operations on > partial vectors > > which I think summarises the option (without the word "trapping"). The > same approach will be taken for Float16 operations, so the approach is > not specific to MMX vectors. > > > Otherwise it looks OK to me. > > Thanks, I have attached the RFC V2 patch; I plan to submit a formal > patch later today. Thanks. With AVX512VL there might also be the option to use a mask (with the penalty of a very much larger instruction encoding). Richard.
On Tue, Aug 8, 2023 at 12:08 PM Richard Biener <rguenther@suse.de> wrote: > > > > > > Also introduce -m[no-]mmxfp-with-sse option to disable trapping V2SF > > > > > > named patterns in order to avoid generation of partial vector V4SFmode > > > > > > trapping instructions. > > > > > > > > > > > > The new option is enabled by default, because even with sanitization, > > > > > > a small but consistent speed up of 2 to 3% with Polyhedron capacita > > > > > > benchmark can be achieved vs. scalar code. > > > > > > > > > > > > Using -fno-trapping-math improves Polyhedron capacita runtime 8 to 9% > > > > > > vs. scalar code. This is what clang does by default, as it defaults > > > > > > to -fno-trapping-math. > > > > > > > > > > I like the new option, note you lack invoke.texi documentation where > > > > > I'd also elaborate a bit on the interaction with -fno-trapping-math > > > > > and the possible performance impact then NaNs or denormals leak > > > > > into the upper halves and cross-reference -mdaz-ftz. > > > > > > > > The attached doc patch is invoke.texi entry for -mmmxfp-with-sse > > > > option. It is written in a way to also cover half-float vectors. WDYT? > > > > > > "generate trapping floating-point operations" > > > > > > I'd say "generate floating-point operations that might affect the > > > set of floating point status flags", the word "trapping" is IMHO > > > misleading. > > > Not sure if "set of floating point status flags" is the correct term, > > > but it's what the C standard seems to refer to when talking about > > > things you get with fegetexceptflag. feraieexcept refers to > > > "floating-point exceptions". Unfortunately the -fno-trapping-math > > > documentation is similarly confusing (and maybe even wrong, I read > > > it to conform to 'non-stop' IEEE arithmetic). > > > > Thanks for suggesting the right terminology. I think that: > > > > +@opindex mpartial-vector-math > > +@item -mpartial-vector-math > > +This option enables GCC to generate floating-point operations that might > > +affect the set of floating point status flags on partial vectors, where > > +vector elements reside in the low part of the 128-bit SSE register. Unless > > +@option{-fno-trapping-math} is specified, the compiler guarantees correct > > +behavior by sanitizing all input operands to have zeroes in the unused > > +upper part of the vector register. Note that by using built-in functions > > +or inline assembly with partial vector arguments, NaNs, denormal or invalid > > +values can leak into the upper part of the vector, causing possible > > +performance issues when @option{-fno-trapping-math} is in effect. These > > +issues can be mitigated by manually sanitizing the upper part of the partial > > +vector argument register or by using @option{-mdaz-ftz} to set > > +denormals-are-zero (DAZ) flag in the MXCSR register. > > > > Now explain in adequate detail what the option does. IMO, the > > "floating-point operations that might affect the set of floating point > > status flags" correctly identifies affected operations, so an example, > > as suggested below, is not necessary. > > > > > I'd maybe give an example of a FP operation that's _not_ affected > > > by the flag (copysign?). > > > > Please note that I have renamed the option to "-mpartial-vector-math" > > with a short target-specific description: > > Ah yes, that's a less confusing name but then it might suggest > that -mno-partial-vector-math would disable all of that, including > integer ops, not only the patterns possibly affecting the exception > flags? Note I don't have a better suggestion and this is clearly > better than the one mentioning mmx. You are right, I think I'll rename the option to -mpartial-vector-fp-math. Thanks, Uros.
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index ef342fcee9b..af72b6c48a9 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -50,6 +50,7 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see #define TARGET_16BIT_P(x) TARGET_CODE16_P(x) #define TARGET_MMX_WITH_SSE (TARGET_64BIT && TARGET_SSE2) +#define TARGET_MMXFP_WITH_SSE (TARGET_MMX_WITH_SSE && ix86_mmxfp_with_sse) #include "config/vxworks-dummy.h" diff --git a/gcc/config/i386/i386.opt b/gcc/config/i386/i386.opt index 1cc8563477a..1b65fed5daf 100644 --- a/gcc/config/i386/i386.opt +++ b/gcc/config/i386/i386.opt @@ -670,6 +670,10 @@ m3dnowa Target Mask(ISA_3DNOW_A) Var(ix86_isa_flags) Save Support Athlon 3Dnow! built-in functions. +mmmxfp-with-sse +Target Var(ix86_mmxfp_with_sse) Init(1) +Enable MMX floating point vectors in SSE registers + msse Target Mask(ISA_SSE) Var(ix86_isa_flags) Save Support MMX and SSE built-in functions and code generation. diff --git a/gcc/config/i386/mmx.md b/gcc/config/i386/mmx.md index 896af76a33f..0555da9022b 100644 --- a/gcc/config/i386/mmx.md +++ b/gcc/config/i386/mmx.md @@ -597,7 +597,18 @@ (define_expand "movq_<mode>_to_sse" (match_operand:V2FI 1 "nonimmediate_operand") (match_dup 2)))] "TARGET_SSE2" - "operands[2] = CONST0_RTX (<MODE>mode);") +{ + if (<MODE>mode == V2SFmode + && !flag_trapping_math) + { + rtx op1 = force_reg (<MODE>mode, operands[1]); + emit_move_insn (operands[0], lowpart_subreg (<mmxdoublevecmode>mode, + op1, <MODE>mode)); + DONE; + } + + operands[2] = CONST0_RTX (<MODE>mode); +}) ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; ;; @@ -650,7 +661,7 @@ (define_expand "<insn>v2sf3" (plusminusmult:V2SF (match_operand:V2SF 1 "nonimmediate_operand") (match_operand:V2SF 2 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMXFP_WITH_SSE" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -728,7 +739,7 @@ (define_expand "divv2sf3" [(set (match_operand:V2SF 0 "register_operand") (div:V2SF (match_operand:V2SF 1 "register_operand") (match_operand:V2SF 2 "register_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMXFP_WITH_SSE" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -750,7 +761,7 @@ (define_expand "<code>v2sf3" (smaxmin:V2SF (match_operand:V2SF 1 "register_operand") (match_operand:V2SF 2 "register_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMXFP_WITH_SSE" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -852,7 +863,7 @@ (define_insn "mmx_rcpit2v2sf3" (define_expand "sqrtv2sf2" [(set (match_operand:V2SF 0 "register_operand") (sqrt:V2SF (match_operand:V2SF 1 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -933,7 +944,7 @@ (define_insn_and_split "*mmx_haddv2sf3_low" (vec_select:SF (match_dup 1) (parallel [(match_operand:SI 3 "const_0_to_1_operand")]))))] - "TARGET_SSE3 && TARGET_MMX_WITH_SSE + "TARGET_SSE3 && TARGET_MMXFP_WITH_SSE && INTVAL (operands[2]) != INTVAL (operands[3]) && ix86_pre_reload_split ()" "#" @@ -979,7 +990,7 @@ (define_insn_and_split "*mmx_hsubv2sf3_low" (vec_select:SF (match_dup 1) (parallel [(const_int 1)]))))] - "TARGET_SSE3 && TARGET_MMX_WITH_SSE + "TARGET_SSE3 && TARGET_MMXFP_WITH_SSE && ix86_pre_reload_split ()" "#" "&& 1" @@ -1041,7 +1052,7 @@ (define_expand "vec_addsubv2sf3" (match_operand:V2SF 2 "nonimmediate_operand")) (plus:V2SF (match_dup 1) (match_dup 2)) (const_int 1)))] - "TARGET_SSE3 && TARGET_MMX_WITH_SSE" + "TARGET_SSE3 && TARGET_MMXFP_WITH_SSE" { rtx op2 = gen_reg_rtx (V4SFmode); rtx op1 = gen_reg_rtx (V4SFmode); @@ -1104,7 +1115,7 @@ (define_expand "vec_cmpv2sfv2si" (match_operator:V2SI 1 "" [(match_operand:V2SF 2 "nonimmediate_operand") (match_operand:V2SF 3 "nonimmediate_operand")]))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMXFP_WITH_SSE" { rtx ops[4]; ops[3] = gen_reg_rtx (V4SFmode); @@ -1130,7 +1141,7 @@ (define_expand "vcond<mode>v2sf" (match_operand:V2SF 5 "nonimmediate_operand")]) (match_operand:V2FI 1 "general_operand") (match_operand:V2FI 2 "general_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMXFP_WITH_SSE" { rtx ops[6]; ops[5] = gen_reg_rtx (V4SFmode); @@ -1320,7 +1331,7 @@ (define_expand "fmav2sf4" (match_operand:V2SF 2 "nonimmediate_operand") (match_operand:V2SF 3 "nonimmediate_operand")))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMXFP_WITH_SSE" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1345,7 +1356,7 @@ (define_expand "fmsv2sf4" (neg:V2SF (match_operand:V2SF 3 "nonimmediate_operand"))))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMXFP_WITH_SSE" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1370,7 +1381,7 @@ (define_expand "fnmav2sf4" (match_operand:V2SF 2 "nonimmediate_operand") (match_operand:V2SF 3 "nonimmediate_operand")))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMXFP_WITH_SSE" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1396,7 +1407,7 @@ (define_expand "fnmsv2sf4" (neg:V2SF (match_operand:V2SF 3 "nonimmediate_operand"))))] "(TARGET_FMA || TARGET_FMA4 || TARGET_AVX512VL) - && TARGET_MMX_WITH_SSE" + && TARGET_MMXFP_WITH_SSE" { rtx op3 = gen_reg_rtx (V4SFmode); rtx op2 = gen_reg_rtx (V4SFmode); @@ -1422,7 +1433,7 @@ (define_expand "fnmsv2sf4" (define_expand "fix_truncv2sfv2si2" [(set (match_operand:V2SI 0 "register_operand") (fix:V2SI (match_operand:V2SF 1 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1438,7 +1449,7 @@ (define_expand "fix_truncv2sfv2si2" (define_expand "fixuns_truncv2sfv2si2" [(set (match_operand:V2SI 0 "register_operand") (unsigned_fix:V2SI (match_operand:V2SF 1 "nonimmediate_operand")))] - "TARGET_AVX512VL && TARGET_MMX_WITH_SSE" + "TARGET_AVX512VL && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1463,7 +1474,7 @@ (define_insn "mmx_fix_truncv2sfv2si2" (define_expand "floatv2siv2sf2" [(set (match_operand:V2SF 0 "register_operand") (float:V2SF (match_operand:V2SI 1 "nonimmediate_operand")))] - "TARGET_MMX_WITH_SSE" + "TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SImode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1479,7 +1490,7 @@ (define_expand "floatv2siv2sf2" (define_expand "floatunsv2siv2sf2" [(set (match_operand:V2SF 0 "register_operand") (unsigned_float:V2SF (match_operand:V2SI 1 "nonimmediate_operand")))] - "TARGET_AVX512VL && TARGET_MMX_WITH_SSE" + "TARGET_AVX512VL && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SImode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1756,7 +1767,7 @@ (define_expand "vec_initv2sfsf" (define_expand "nearbyintv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1772,7 +1783,7 @@ (define_expand "nearbyintv2sf2" (define_expand "rintv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1788,8 +1799,8 @@ (define_expand "rintv2sf2" (define_expand "lrintv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1806,7 +1817,7 @@ (define_expand "ceilv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1822,8 +1833,8 @@ (define_expand "ceilv2sf2" (define_expand "lceilv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1840,7 +1851,7 @@ (define_expand "floorv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1856,8 +1867,8 @@ (define_expand "floorv2sf2" (define_expand "lfloorv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode); @@ -1874,7 +1885,7 @@ (define_expand "btruncv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1891,7 +1902,7 @@ (define_expand "roundv2sf2" [(match_operand:V2SF 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SFmode); @@ -1907,8 +1918,8 @@ (define_expand "roundv2sf2" (define_expand "lroundv2sfv2si2" [(match_operand:V2SI 0 "register_operand") (match_operand:V2SF 1 "nonimmediate_operand")] - "TARGET_SSE4_1 && !flag_trapping_math - && TARGET_MMX_WITH_SSE" + "TARGET_SSE4_1 && !flag_trapping_math + && TARGET_MMXFP_WITH_SSE" { rtx op1 = gen_reg_rtx (V4SFmode); rtx op0 = gen_reg_rtx (V4SImode);