Message ID | CAFULd4ZN1jGnbRdjhKTjNVDqEAMYdU3nEKxuAfB=AhyWRuPkOg@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | [i386] Fix PR 88998, bad codegen with mmx instructions | expand |
On Wed, Jan 23, 2019 at 11:22 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > Attached patch adds SSE alternatives to sse2_cvtpi2pd, sse2_cvtpd2pi > and sse2_cvttpd2pi to avoid MMX registers when e.g. _mm_cvtepi32_pd > intrinsics is used. Without the patch, the testcase compiles to (-O2 > -mavx): > > _Z7prepareii: > vmovd %edi, %xmm1 > vpinsrd $1, %esi, %xmm1, %xmm0 > movdq2q %xmm0, %mm0 > cvtpi2pd %mm0, %xmm0 > vhaddpd %xmm0, %xmm0, %xmm0 > ret > > while patched gcc generates: > > vmovd %edi, %xmm1 > vpinsrd $1, %esi, %xmm1, %xmm0 > vcvtdq2pd %xmm0, %xmm0 > vhaddpd %xmm0, %xmm0, %xmm0 > ret > > The later avoids transition of FPU to MMX mode. > Is that possible to support 64-bit vectors, like V2SI, with SSE instead of MMX for x86-64 under a command-line switch?
On Wed, Jan 23, 2019 at 8:52 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Wed, Jan 23, 2019 at 11:22 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > Attached patch adds SSE alternatives to sse2_cvtpi2pd, sse2_cvtpd2pi > > and sse2_cvttpd2pi to avoid MMX registers when e.g. _mm_cvtepi32_pd > > intrinsics is used. Without the patch, the testcase compiles to (-O2 > > -mavx): > > > > _Z7prepareii: > > vmovd %edi, %xmm1 > > vpinsrd $1, %esi, %xmm1, %xmm0 > > movdq2q %xmm0, %mm0 > > cvtpi2pd %mm0, %xmm0 > > vhaddpd %xmm0, %xmm0, %xmm0 > > ret > > > > while patched gcc generates: > > > > vmovd %edi, %xmm1 > > vpinsrd $1, %esi, %xmm1, %xmm0 > > vcvtdq2pd %xmm0, %xmm0 > > vhaddpd %xmm0, %xmm0, %xmm0 > > ret > > > > The later avoids transition of FPU to MMX mode. > > > > Is that possible to support 64-bit vectors, like V2SI, with SSE > instead of MMX for x86-64 under a command-line switch? SSE registers are preferred for 64bit vectors (see number of exclamation marks in *mov<mode>_internal in mmx.md), so the value will be passed in SSE regs unless there is pure MMX instruction, where due to missing SSE alternatives, RA will need to allocate MMX register. Uros.
On Wed, Jan 23, 2019 at 12:27 PM Uros Bizjak <ubizjak@gmail.com> wrote: > > On Wed, Jan 23, 2019 at 8:52 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Wed, Jan 23, 2019 at 11:22 AM Uros Bizjak <ubizjak@gmail.com> wrote: > > > > > > Attached patch adds SSE alternatives to sse2_cvtpi2pd, sse2_cvtpd2pi > > > and sse2_cvttpd2pi to avoid MMX registers when e.g. _mm_cvtepi32_pd > > > intrinsics is used. Without the patch, the testcase compiles to (-O2 > > > -mavx): > > > > > > _Z7prepareii: > > > vmovd %edi, %xmm1 > > > vpinsrd $1, %esi, %xmm1, %xmm0 > > > movdq2q %xmm0, %mm0 > > > cvtpi2pd %mm0, %xmm0 > > > vhaddpd %xmm0, %xmm0, %xmm0 > > > ret > > > > > > while patched gcc generates: > > > > > > vmovd %edi, %xmm1 > > > vpinsrd $1, %esi, %xmm1, %xmm0 > > > vcvtdq2pd %xmm0, %xmm0 > > > vhaddpd %xmm0, %xmm0, %xmm0 > > > ret > > > > > > The later avoids transition of FPU to MMX mode. > > > > > > > Is that possible to support 64-bit vectors, like V2SI, with SSE > > instead of MMX for x86-64 under a command-line switch? > > SSE registers are preferred for 64bit vectors (see number of > exclamation marks in *mov<mode>_internal in mmx.md), so the value will > be passed in SSE regs unless there is pure MMX instruction, where due > to missing SSE alternatives, RA will need to allocate MMX register. > [hjl@gnu-cfl-1 v64-1]$ cat x.i typedef float __v2sf __attribute__ ((__vector_size__ (8))); extern __v2sf z; void add (__v2sf x, __v2sf y) { z = x + y; } [hjl@gnu-cfl-1 v64-1]$ make x.s /export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/tools-build/gcc-debug/build-x86_64-linux/gcc/ -O2 -S x.i [hjl@gnu-cfl-1 v64-1]$ cat x.s .file "x.i" .text .p2align 4 .globl add .type add, @function add: .LFB0: .cfi_startproc movlps %xmm0, -32(%rsp) movss -32(%rsp), %xmm0 movlps %xmm1, -40(%rsp) addss -40(%rsp), %xmm0 movss %xmm0, -56(%rsp) movss -28(%rsp), %xmm0 addss -36(%rsp), %xmm0 movss %xmm0, -52(%rsp) movq -56(%rsp), %rax movq %rax, z(%rip) ret .cfi_endproc .LFE0: .size add, .-add .ident "GCC: (GNU) 9.0.0 20190122 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-cfl-1 v64-1]$ I am expecting: addps %xmm1, %xmm0 movlps %xmm0, z(%rip) retq
Index: config/i386/sse.md =================================================================== --- config/i386/sse.md (revision 268188) +++ config/i386/sse.md (working copy) @@ -4997,37 +4997,49 @@ ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; (define_insn "sse2_cvtpi2pd" - [(set (match_operand:V2DF 0 "register_operand" "=x,x") - (float:V2DF (match_operand:V2SI 1 "nonimmediate_operand" "y,m")))] + [(set (match_operand:V2DF 0 "register_operand" "=v,x") + (float:V2DF (match_operand:V2SI 1 "nonimmediate_operand" "vBm,?!y")))] "TARGET_SSE2" - "cvtpi2pd\t{%1, %0|%0, %1}" + "@ + %vcvtdq2pd\t{%1, %0|%0, %1} + cvtpi2pd\t{%1, %0|%0, %1}" [(set_attr "type" "ssecvt") - (set_attr "unit" "mmx,*") - (set_attr "prefix_data16" "1,*") + (set_attr "unit" "*,mmx") + (set_attr "prefix_data16" "*,1") + (set_attr "prefix" "maybe_vex,*") (set_attr "mode" "V2DF")]) (define_insn "sse2_cvtpd2pi" - [(set (match_operand:V2SI 0 "register_operand" "=y") - (unspec:V2SI [(match_operand:V2DF 1 "nonimmediate_operand" "xm")] + [(set (match_operand:V2SI 0 "register_operand" "=v,?!y") + (unspec:V2SI [(match_operand:V2DF 1 "nonimmediate_operand" "vBm,xm")] UNSPEC_FIX_NOTRUNC))] "TARGET_SSE2" - "cvtpd2pi\t{%1, %0|%0, %1}" + "@ + * return TARGET_AVX ? \"vcvtpd2dq{x}\t{%1, %0|%0, %1}\" : \"cvtpd2dq\t{%1, %0|%0, %1}\"; + cvtpd2pi\t{%1, %0|%0, %1}" [(set_attr "type" "ssecvt") - (set_attr "unit" "mmx") + (set_attr "unit" "*,mmx") + (set_attr "amdfam10_decode" "double") + (set_attr "athlon_decode" "vector") (set_attr "bdver1_decode" "double") - (set_attr "btver2_decode" "direct") - (set_attr "prefix_data16" "1") - (set_attr "mode" "DI")]) + (set_attr "prefix_data16" "*,1") + (set_attr "prefix" "maybe_vex,*") + (set_attr "mode" "TI")]) (define_insn "sse2_cvttpd2pi" - [(set (match_operand:V2SI 0 "register_operand" "=y") - (fix:V2SI (match_operand:V2DF 1 "nonimmediate_operand" "xm")))] + [(set (match_operand:V2SI 0 "register_operand" "=v,?!y") + (fix:V2SI (match_operand:V2DF 1 "nonimmediate_operand" "vBm,xm")))] "TARGET_SSE2" - "cvttpd2pi\t{%1, %0|%0, %1}" + "@ + * return TARGET_AVX ? \"vcvttpd2dq{x}\t{%1, %0|%0, %1}\" : \"cvttpd2dq\t{%1, %0|%0, %1}\"; + cvttpd2pi\t{%1, %0|%0, %1}" [(set_attr "type" "ssecvt") - (set_attr "unit" "mmx") + (set_attr "unit" "*,mmx") + (set_attr "amdfam10_decode" "double") + (set_attr "athlon_decode" "vector") (set_attr "bdver1_decode" "double") - (set_attr "prefix_data16" "1") + (set_attr "prefix_data16" "*,1") + (set_attr "prefix" "maybe_vex,*") (set_attr "mode" "TI")]) (define_insn "sse2_cvtsi2sd" Index: testsuite/g++.target/i386/pr88998.C =================================================================== --- testsuite/g++.target/i386/pr88998.C (nonexistent) +++ testsuite/g++.target/i386/pr88998.C (working copy) @@ -0,0 +1,31 @@ +// PR target/88998 +// { dg-do run { target sse2_runtime } } +// { dg-options "-O2 -msse2 -mfpmath=387" } +// { dg-require-effective-target c++11 } + +#include <cassert> +#include <unordered_map> +#include <x86intrin.h> + +double +__attribute__((noinline)) +prepare (int a, int b) +{ + __m128i is = _mm_setr_epi32 (a, b, 0, 0); + __m128d ds = _mm_cvtepi32_pd (is); + return ds[0] + ds[1]; +} + +int +main (int, char **) +{ + double d = prepare (1, 2); + + std::unordered_map < int, int >m; + m.insert ({0, 0}); + m.insert ({1, 1}); + assert (m.load_factor () <= m.max_load_factor ()); + + assert (d == 3); + return 0; +}