Message ID | 20220718023104.48190-1-hongtao.liu@intel.com |
---|---|
State | New |
Headers | show |
Series | [V2,RFC] Support vectorization for Complex type. | expand |
On Mon, Jul 18, 2022 at 4:31 AM liuhongt <hongtao.liu@intel.com> wrote: > > V2 update: > Handle VMAT_ELEMENTWISE, VMAT_CONTIGUOUS_PERMUTE, VMAT_STRIDED_SLP, > VMAT_CONTIGUOUS_REVERSE, VMAT_CONTIGUOUS_DOWN for complex type. > > I've run SPECspeed@2017 627.cam4_s, there's some vectorization cases, > but no big performance impact(since this patch only handle load/store). > > Any comments? My original comments still stand (it feels like this should be more generic). Can we go the way lowering complex loads/stores first? A large part of the testcases added by the patch should pass after that. Thanks, Richard. > gcc/ChangeLog: > > PR tree-optimization/106010 > * tree-vect-data-refs.cc (vect_get_data_access_cost): > Pass complex_p to vect_get_num_copies to avoid ICE. > (vect_analyze_data_refs): Support vectorization for Complex > type with vector scalar types. > (vect_permute_load_chain): Handle Complex type. > * tree-vect-loop.cc (vect_determine_vf_for_stmt_1): VF should > be half of TYPE_VECTOR_SUBPARTS when complex_p. > * tree-vect-slp.cc (vect_record_max_nunits): nunits should be > half of TYPE_VECTOR_SUBPARTS when complex_p. > (vect_optimize_slp): Support permutation for complex type. > (vect_slp_analyze_node_operations_1): Double nunits in > vect_get_num_vectors to get right SLP_TREE_NUMBER_OF_VEC_STMTS > when complex_p. > (vect_slp_analyze_node_operations): Ditto. > (vect_create_constant_vectors): Support CTOR for complex type. > (vect_transform_slp_perm_load): Support permutation for > complex type. > * tree-vect-stmts.cc (vect_init_vector): Support complex type. > (vect_get_vec_defs_for_operand): Get vector type for > complex type. > (vectorizable_store): Get right ncopies/nunits and > elem_type for complex type vector, also return false when > complex_p and !TYPE_VECTOR_SUBPARTS.is_constant (). > (vect_truncate_gather_scatter_offset): Return false for > complex type. > (vectorizable_load): Ditto. > (vect_get_vector_types_for_stmt): Get vector type for > complex type. > (get_group_load_store_type): Hanlde complex type for > nunits. > (perm_mask_for_reverse): New overload. > (get_negative_load_store_type): Handle complex type, > p_offset should be N - 2 beofre addres of DR. > (vect_check_scalar_mask): Return false for complex type. > * tree-vectorizer.h (STMT_VINFO_COMPLEX_P): New macro. > (vect_get_num_copies): New overload. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/pr106010-1a.c: New test. > * gcc.target/i386/pr106010-1b.c: New test. > * gcc.target/i386/pr106010-1c.c: New test. > * gcc.target/i386/pr106010-2a.c: New test. > * gcc.target/i386/pr106010-2b.c: New test. > * gcc.target/i386/pr106010-2c.c: New test. > * gcc.target/i386/pr106010-3a.c: New test. > * gcc.target/i386/pr106010-3b.c: New test. > * gcc.target/i386/pr106010-3c.c: New test. > * gcc.target/i386/pr106010-4a.c: New test. > * gcc.target/i386/pr106010-4b.c: New test. > * gcc.target/i386/pr106010-4c.c: New test. > * gcc.target/i386/pr106010-5a.c: New test. > * gcc.target/i386/pr106010-5b.c: New test. > * gcc.target/i386/pr106010-5c.c: New test. > * gcc.target/i386/pr106010-6a.c: New test. > * gcc.target/i386/pr106010-6b.c: New test. > * gcc.target/i386/pr106010-6c.c: New test. > * gcc.target/i386/pr106010-7a.c: New test. > * gcc.target/i386/pr106010-7b.c: New test. > * gcc.target/i386/pr106010-7c.c: New test. > * gcc.target/i386/pr106010-8a.c: New test. > * gcc.target/i386/pr106010-8b.c: New test. > * gcc.target/i386/pr106010-8c.c: New test. > * gcc.target/i386/pr106010-9a.c: New test. > * gcc.target/i386/pr106010-9b.c: New test. > * gcc.target/i386/pr106010-9c.c: New test. > * gcc.target/i386/pr106010-9d.c: New test. > --- > gcc/testsuite/gcc.target/i386/pr106010-1a.c | 58 +++++ > gcc/testsuite/gcc.target/i386/pr106010-1b.c | 63 ++++++ > gcc/testsuite/gcc.target/i386/pr106010-1c.c | 41 ++++ > gcc/testsuite/gcc.target/i386/pr106010-2a.c | 82 +++++++ > gcc/testsuite/gcc.target/i386/pr106010-2b.c | 62 ++++++ > gcc/testsuite/gcc.target/i386/pr106010-2c.c | 47 ++++ > gcc/testsuite/gcc.target/i386/pr106010-3a.c | 80 +++++++ > gcc/testsuite/gcc.target/i386/pr106010-3b.c | 126 +++++++++++ > gcc/testsuite/gcc.target/i386/pr106010-3c.c | 69 ++++++ > gcc/testsuite/gcc.target/i386/pr106010-4a.c | 101 +++++++++ > gcc/testsuite/gcc.target/i386/pr106010-4b.c | 67 ++++++ > gcc/testsuite/gcc.target/i386/pr106010-4c.c | 54 +++++ > gcc/testsuite/gcc.target/i386/pr106010-5a.c | 117 ++++++++++ > gcc/testsuite/gcc.target/i386/pr106010-5b.c | 80 +++++++ > gcc/testsuite/gcc.target/i386/pr106010-5c.c | 62 ++++++ > gcc/testsuite/gcc.target/i386/pr106010-6a.c | 115 ++++++++++ > gcc/testsuite/gcc.target/i386/pr106010-6b.c | 157 +++++++++++++ > gcc/testsuite/gcc.target/i386/pr106010-6c.c | 80 +++++++ > gcc/testsuite/gcc.target/i386/pr106010-7a.c | 58 +++++ > gcc/testsuite/gcc.target/i386/pr106010-7b.c | 63 ++++++ > gcc/testsuite/gcc.target/i386/pr106010-7c.c | 41 ++++ > gcc/testsuite/gcc.target/i386/pr106010-8a.c | 58 +++++ > gcc/testsuite/gcc.target/i386/pr106010-8b.c | 53 +++++ > gcc/testsuite/gcc.target/i386/pr106010-8c.c | 38 ++++ > gcc/testsuite/gcc.target/i386/pr106010-9a.c | 89 ++++++++ > gcc/testsuite/gcc.target/i386/pr106010-9b.c | 90 ++++++++ > gcc/testsuite/gcc.target/i386/pr106010-9c.c | 90 ++++++++ > gcc/testsuite/gcc.target/i386/pr106010-9d.c | 92 ++++++++ > gcc/tree-vect-data-refs.cc | 134 +++++++++--- > gcc/tree-vect-loop.cc | 7 +- > gcc/tree-vect-slp.cc | 174 +++++++++++---- > gcc/tree-vect-stmts.cc | 231 +++++++++++++++++--- > gcc/tree-vectorizer.h | 13 ++ > 33 files changed, 2594 insertions(+), 98 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-1a.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-1b.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-1c.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-2a.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-2b.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-2c.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-3a.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-3b.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-3c.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-4a.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-4b.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-4c.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-5a.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-5b.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-5c.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-6a.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-6b.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-6c.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-7a.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-7b.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-7c.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-8a.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-8b.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-8c.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-9a.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-9b.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-9c.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr106010-9d.c > > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-1a.c b/gcc/testsuite/gcc.target/i386/pr106010-1a.c > new file mode 100644 > index 00000000000..b608f484934 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-1a.c > @@ -0,0 +1,58 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-vect-details -mprefer-vector-width=256" } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 6 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 2 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 2 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 2 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 2 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 2 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 2 "vect" } } */ > + > +#define N 10000 > +void > +__attribute__((noipa)) > +foo_pd (_Complex double* a, _Complex double* b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b[i]; > +} > + > +void > +__attribute__((noipa)) > +foo_ps (_Complex float* a, _Complex float* b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b[i]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi64 (_Complex long long* a, _Complex long long* b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b[i]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi32 (_Complex int* a, _Complex int* b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b[i]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi16 (_Complex short* a, _Complex short* b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b[i]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi8 (_Complex char* a, _Complex char* b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b[i]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-1b.c b/gcc/testsuite/gcc.target/i386/pr106010-1b.c > new file mode 100644 > index 00000000000..0f377c3a548 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-1b.c > @@ -0,0 +1,63 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ > +/* { dg-require-effective-target avx } */ > + > +#include "avx-check.h" > +#include <string.h> > +#include "pr106010-1a.c" > + > +void > +avx_test (void) > +{ > + _Complex double* pd_src = (_Complex double*) malloc (2 * N * sizeof (double)); > + _Complex double* pd_dst = (_Complex double*) malloc (2 * N * sizeof (double)); > + _Complex float* ps_src = (_Complex float*) malloc (2 * N * sizeof (float)); > + _Complex float* ps_dst = (_Complex float*) malloc (2 * N * sizeof (float)); > + _Complex long long* epi64_src = (_Complex long long*) malloc (2 * N * sizeof (long long)); > + _Complex long long* epi64_dst = (_Complex long long*) malloc (2 * N * sizeof (long long)); > + _Complex int* epi32_src = (_Complex int*) malloc (2 * N * sizeof (int)); > + _Complex int* epi32_dst = (_Complex int*) malloc (2 * N * sizeof (int)); > + _Complex short* epi16_src = (_Complex short*) malloc (2 * N * sizeof (short)); > + _Complex short* epi16_dst = (_Complex short*) malloc (2 * N * sizeof (short)); > + _Complex char* epi8_src = (_Complex char*) malloc (2 * N * sizeof (char)); > + _Complex char* epi8_dst = (_Complex char*) malloc (2 * N * sizeof (char)); > + char* p_init = (char*) malloc (2 * N * sizeof (double)); > + > + __builtin_memset (pd_dst, 0, 2 * N * sizeof (double)); > + __builtin_memset (ps_dst, 0, 2 * N * sizeof (float)); > + __builtin_memset (epi64_dst, 0, 2 * N * sizeof (long long)); > + __builtin_memset (epi32_dst, 0, 2 * N * sizeof (int)); > + __builtin_memset (epi16_dst, 0, 2 * N * sizeof (short)); > + __builtin_memset (epi8_dst, 0, 2 * N * sizeof (char)); > + > + for (int i = 0; i != 2 * N * sizeof (double); i++) > + p_init[i] = i; > + > + memcpy (pd_src, p_init, 2 * N * sizeof (double)); > + memcpy (ps_src, p_init, 2 * N * sizeof (float)); > + memcpy (epi64_src, p_init, 2 * N * sizeof (long long)); > + memcpy (epi32_src, p_init, 2 * N * sizeof (int)); > + memcpy (epi16_src, p_init, 2 * N * sizeof (short)); > + memcpy (epi8_src, p_init, 2 * N * sizeof (char)); > + > + foo_pd (pd_dst, pd_src); > + foo_ps (ps_dst, ps_src); > + foo_epi64 (epi64_dst, epi64_src); > + foo_epi32 (epi32_dst, epi32_src); > + foo_epi16 (epi16_dst, epi16_src); > + foo_epi8 (epi8_dst, epi8_src); > + if (__builtin_memcmp (pd_dst, pd_src, N * 2 * sizeof (double)) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (ps_dst, ps_src, N * 2 * sizeof (float)) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi64_dst, epi64_src, N * 2 * sizeof (long long)) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi32_dst, epi32_src, N * 2 * sizeof (int)) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi16_dst, epi16_src, N * 2 * sizeof (short)) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi8_dst, epi8_src, N * 2 * sizeof (char)) != 0) > + __builtin_abort (); > + > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-1c.c b/gcc/testsuite/gcc.target/i386/pr106010-1c.c > new file mode 100644 > index 00000000000..f07e9fb2d3d > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-1c.c > @@ -0,0 +1,41 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-vect-details" } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 2 "vect" } } */ > +/* { dg-require-effective-target avx512fp16 } */ > + > +#include <string.h> > + > +static void do_test (void); > + > +#define DO_TEST do_test > +#define AVX512FP16 > +#include "avx512-check.h" > + > +#define N 10000 > + > +void > +__attribute__((noipa)) > +foo_ph (_Complex _Float16* a, _Complex _Float16* b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b[i]; > +} > + > +static void > +do_test (void) > +{ > + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (2 * N * sizeof (_Float16)); > + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (2 * N * sizeof (_Float16)); > + char* p_init = (char*) malloc (2 * N * sizeof (_Float16)); > + > + __builtin_memset (ph_dst, 0, 2 * N * sizeof (_Float16)); > + > + for (int i = 0; i != 2 * N * sizeof (_Float16); i++) > + p_init[i] = i; > + > + memcpy (ph_src, p_init, 2 * N * sizeof (_Float16)); > + > + foo_ph (ph_dst, ph_src); > + if (__builtin_memcmp (ph_dst, ph_src, N * 2 * sizeof (_Float16)) != 0) > + __builtin_abort (); > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-2a.c b/gcc/testsuite/gcc.target/i386/pr106010-2a.c > new file mode 100644 > index 00000000000..d2e2f8d4f43 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-2a.c > @@ -0,0 +1,82 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details -mprefer-vector-width=256" } */ > +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 6 "slp2" } }*/ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 2 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 2 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 2 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 2 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 2 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 2 "slp2" } } */ > + > +void > +__attribute__((noipa)) > +foo_pd (_Complex double* a, _Complex double* __restrict b) > +{ > + a[0] = b[0]; > + a[1] = b[1]; > +} > + > +void > +__attribute__((noipa)) > +foo_ps (_Complex float* a, _Complex float* __restrict b) > +{ > + a[0] = b[0]; > + a[1] = b[1]; > + a[2] = b[2]; > + a[3] = b[3]; > + > +} > + > +void > +__attribute__((noipa)) > +foo_epi64 (_Complex long long* a, _Complex long long* __restrict b) > +{ > + a[0] = b[0]; > + a[1] = b[1]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi32 (_Complex int* a, _Complex int* __restrict b) > +{ > + a[0] = b[0]; > + a[1] = b[1]; > + a[2] = b[2]; > + a[3] = b[3]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi16 (_Complex short* a, _Complex short* __restrict b) > +{ > + a[0] = b[0]; > + a[1] = b[1]; > + a[2] = b[2]; > + a[3] = b[3]; > + a[4] = b[4]; > + a[5] = b[5]; > + a[6] = b[6]; > + a[7] = b[7]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi8 (_Complex char* a, _Complex char* __restrict b) > +{ > + a[0] = b[0]; > + a[1] = b[1]; > + a[2] = b[2]; > + a[3] = b[3]; > + a[4] = b[4]; > + a[5] = b[5]; > + a[6] = b[6]; > + a[7] = b[7]; > + a[8] = b[8]; > + a[9] = b[9]; > + a[10] = b[10]; > + a[11] = b[11]; > + a[12] = b[12]; > + a[13] = b[13]; > + a[14] = b[14]; > + a[15] = b[15]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-2b.c b/gcc/testsuite/gcc.target/i386/pr106010-2b.c > new file mode 100644 > index 00000000000..ac360752693 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-2b.c > @@ -0,0 +1,62 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ > +/* { dg-require-effective-target avx } */ > + > +#include "avx-check.h" > +#include <string.h> > +#include "pr106010-2a.c" > + > +void > +avx_test (void) > +{ > + _Complex double* pd_src = (_Complex double*) malloc (32); > + _Complex double* pd_dst = (_Complex double*) malloc (32); > + _Complex float* ps_src = (_Complex float*) malloc (32); > + _Complex float* ps_dst = (_Complex float*) malloc (32); > + _Complex long long* epi64_src = (_Complex long long*) malloc (32); > + _Complex long long* epi64_dst = (_Complex long long*) malloc (32); > + _Complex int* epi32_src = (_Complex int*) malloc (32); > + _Complex int* epi32_dst = (_Complex int*) malloc (32); > + _Complex short* epi16_src = (_Complex short*) malloc (32); > + _Complex short* epi16_dst = (_Complex short*) malloc (32); > + _Complex char* epi8_src = (_Complex char*) malloc (32); > + _Complex char* epi8_dst = (_Complex char*) malloc (32); > + char* p = (char* ) malloc (32); > + > + __builtin_memset (pd_dst, 0, 32); > + __builtin_memset (ps_dst, 0, 32); > + __builtin_memset (epi64_dst, 0, 32); > + __builtin_memset (epi32_dst, 0, 32); > + __builtin_memset (epi16_dst, 0, 32); > + __builtin_memset (epi8_dst, 0, 32); > + > + for (int i = 0; i != 32; i++) > + p[i] = i; > + __builtin_memcpy (pd_src, p, 32); > + __builtin_memcpy (ps_src, p, 32); > + __builtin_memcpy (epi64_src, p, 32); > + __builtin_memcpy (epi32_src, p, 32); > + __builtin_memcpy (epi16_src, p, 32); > + __builtin_memcpy (epi8_src, p, 32); > + > + foo_pd (pd_dst, pd_src); > + foo_ps (ps_dst, ps_src); > + foo_epi64 (epi64_dst, epi64_src); > + foo_epi32 (epi32_dst, epi32_src); > + foo_epi16 (epi16_dst, epi16_src); > + foo_epi8 (epi8_dst, epi8_src); > + if (__builtin_memcmp (pd_dst, pd_src, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (ps_dst, ps_src, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi64_dst, epi64_src, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi32_dst, epi32_src, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi16_dst, epi16_src, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi16_dst, epi16_src, 32) != 0) > + __builtin_abort (); > + > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-2c.c b/gcc/testsuite/gcc.target/i386/pr106010-2c.c > new file mode 100644 > index 00000000000..a002f209ec9 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-2c.c > @@ -0,0 +1,47 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-slp-details" } */ > +/* { dg-require-effective-target avx512fp16 } */ > + > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 2 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 1 "slp2" } }*/ > + > +#include <string.h> > + > +static void do_test (void); > +#define DO_TEST do_test > +#define AVX512FP16 > +#include "avx512-check.h" > + > +void > +__attribute__((noipa)) > +foo_ph (_Complex _Float16* a, _Complex _Float16* __restrict b) > +{ > + a[0] = b[0]; > + a[1] = b[1]; > + a[2] = b[2]; > + a[3] = b[3]; > + a[4] = b[4]; > + a[5] = b[5]; > + a[6] = b[6]; > + a[7] = b[7]; > +} > + > +void > +do_test (void) > +{ > + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (32); > + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (32); > + char* p = (char* ) malloc (32); > + > + __builtin_memset (ph_dst, 0, 32); > + > + for (int i = 0; i != 32; i++) > + p[i] = i; > + __builtin_memcpy (ph_src, p, 32); > + > + foo_ph (ph_dst, ph_src); > + if (__builtin_memcmp (ph_dst, ph_src, 32) != 0) > + __builtin_abort (); > + > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-3a.c b/gcc/testsuite/gcc.target/i386/pr106010-3a.c > new file mode 100644 > index 00000000000..c1b64b56b1c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-3a.c > @@ -0,0 +1,80 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 6 "slp2" } }*/ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 2, 3, 0, 1 \}} 2 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 6, 7, 4, 5, 2, 3, 0, 1 \}} 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 2, 3, 0, 1, 6, 7, 4, 5 \}} 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 \}} 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1, 30, 31, 28, 29, 26, 27, 24, 25, 22, 23, 20, 21, 18, 19, 16, 17 \}} 1 "slp2" } } */ > + > +void > +__attribute__((noipa)) > +foo_pd (_Complex double* a, _Complex double* __restrict b) > +{ > + a[0] = b[1]; > + a[1] = b[0]; > +} > + > +void > +__attribute__((noipa)) > +foo_ps (_Complex float* a, _Complex float* __restrict b) > +{ > + a[0] = b[1]; > + a[1] = b[0]; > + a[2] = b[3]; > + a[3] = b[2]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi64 (_Complex long long* a, _Complex long long* __restrict b) > +{ > + a[0] = b[1]; > + a[1] = b[0]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi32 (_Complex int* a, _Complex int* __restrict b) > +{ > + a[0] = b[3]; > + a[1] = b[2]; > + a[2] = b[1]; > + a[3] = b[0]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi16 (_Complex short* a, _Complex short* __restrict b) > +{ > + a[0] = b[7]; > + a[1] = b[6]; > + a[2] = b[5]; > + a[3] = b[4]; > + a[4] = b[3]; > + a[5] = b[2]; > + a[6] = b[1]; > + a[7] = b[0]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi8 (_Complex char* a, _Complex char* __restrict b) > +{ > + a[0] = b[7]; > + a[1] = b[6]; > + a[2] = b[5]; > + a[3] = b[4]; > + a[4] = b[3]; > + a[5] = b[2]; > + a[6] = b[1]; > + a[7] = b[0]; > + a[8] = b[15]; > + a[9] = b[14]; > + a[10] = b[13]; > + a[11] = b[12]; > + a[12] = b[11]; > + a[13] = b[10]; > + a[14] = b[9]; > + a[15] = b[8]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-3b.c b/gcc/testsuite/gcc.target/i386/pr106010-3b.c > new file mode 100644 > index 00000000000..e4fa3f3a541 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-3b.c > @@ -0,0 +1,126 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ > +/* { dg-require-effective-target avx2 } */ > + > +#include "avx2-check.h" > +#include <string.h> > +#include "pr106010-3a.c" > + > +void > +avx2_test (void) > +{ > + _Complex double* pd_src = (_Complex double*) malloc (32); > + _Complex double* pd_dst = (_Complex double*) malloc (32); > + _Complex double* pd_exp = (_Complex double*) malloc (32); > + _Complex float* ps_src = (_Complex float*) malloc (32); > + _Complex float* ps_dst = (_Complex float*) malloc (32); > + _Complex float* ps_exp = (_Complex float*) malloc (32); > + _Complex long long* epi64_src = (_Complex long long*) malloc (32); > + _Complex long long* epi64_dst = (_Complex long long*) malloc (32); > + _Complex long long* epi64_exp = (_Complex long long*) malloc (32); > + _Complex int* epi32_src = (_Complex int*) malloc (32); > + _Complex int* epi32_dst = (_Complex int*) malloc (32); > + _Complex int* epi32_exp = (_Complex int*) malloc (32); > + _Complex short* epi16_src = (_Complex short*) malloc (32); > + _Complex short* epi16_dst = (_Complex short*) malloc (32); > + _Complex short* epi16_exp = (_Complex short*) malloc (32); > + _Complex char* epi8_src = (_Complex char*) malloc (32); > + _Complex char* epi8_dst = (_Complex char*) malloc (32); > + _Complex char* epi8_exp = (_Complex char*) malloc (32); > + char* p = (char* ) malloc (32); > + char* q = (char* ) malloc (32); > + > + __builtin_memset (pd_dst, 0, 32); > + __builtin_memset (ps_dst, 0, 32); > + __builtin_memset (epi64_dst, 0, 32); > + __builtin_memset (epi32_dst, 0, 32); > + __builtin_memset (epi16_dst, 0, 32); > + __builtin_memset (epi8_dst, 0, 32); > + > + for (int i = 0; i != 32; i++) > + p[i] = i; > + __builtin_memcpy (pd_src, p, 32); > + __builtin_memcpy (ps_src, p, 32); > + __builtin_memcpy (epi64_src, p, 32); > + __builtin_memcpy (epi32_src, p, 32); > + __builtin_memcpy (epi16_src, p, 32); > + __builtin_memcpy (epi8_src, p, 32); > + > + for (int i = 0; i != 16; i++) > + { > + p[i] = i + 16; > + p[i + 16] = i; > + } > + __builtin_memcpy (pd_exp, p, 32); > + __builtin_memcpy (epi64_exp, p, 32); > + > + for (int i = 0; i != 8; i++) > + { > + p[i] = i + 8; > + p[i + 8] = i; > + p[i + 16] = i + 24; > + p[i + 24] = i + 16; > + q[i] = i + 24; > + q[i + 8] = i + 16; > + q[i + 16] = i + 8; > + q[i + 24] = i; > + } > + __builtin_memcpy (ps_exp, p, 32); > + __builtin_memcpy (epi32_exp, q, 32); > + > + > + for (int i = 0; i != 4; i++) > + { > + q[i] = i + 28; > + q[i + 4] = i + 24; > + q[i + 8] = i + 20; > + q[i + 12] = i + 16; > + q[i + 16] = i + 12; > + q[i + 20] = i + 8; > + q[i + 24] = i + 4; > + q[i + 28] = i; > + } > + __builtin_memcpy (epi16_exp, q, 32); > + > + for (int i = 0; i != 2; i++) > + { > + q[i] = i + 14; > + q[i + 2] = i + 12; > + q[i + 4] = i + 10; > + q[i + 6] = i + 8; > + q[i + 8] = i + 6; > + q[i + 10] = i + 4; > + q[i + 12] = i + 2; > + q[i + 14] = i; > + q[i + 16] = i + 30; > + q[i + 18] = i + 28; > + q[i + 20] = i + 26; > + q[i + 22] = i + 24; > + q[i + 24] = i + 22; > + q[i + 26] = i + 20; > + q[i + 28] = i + 18; > + q[i + 30] = i + 16; > + } > + __builtin_memcpy (epi8_exp, q, 32); > + > + foo_pd (pd_dst, pd_src); > + foo_ps (ps_dst, ps_src); > + foo_epi64 (epi64_dst, epi64_src); > + foo_epi32 (epi32_dst, epi32_src); > + foo_epi16 (epi16_dst, epi16_src); > + foo_epi8 (epi8_dst, epi8_src); > + if (__builtin_memcmp (pd_dst, pd_exp, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (ps_dst, ps_exp, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi64_dst, epi64_exp, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi32_dst, epi32_exp, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi16_dst, epi16_exp, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi8_dst, epi8_exp, 32) != 0) > + __builtin_abort (); > + > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-3c.c b/gcc/testsuite/gcc.target/i386/pr106010-3c.c > new file mode 100644 > index 00000000000..5a5a3d4b992 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-3c.c > @@ -0,0 +1,69 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-slp-details" } */ > +/* { dg-require-effective-target avx512fp16 } */ > +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 1 "slp2" } }*/ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 2, 3, 0, 1, 8, 9, 6, 7, 14, 15, 12, 13, 4, 5, 10, 11 \}} 1 "slp2" } } */ > + > +#include <string.h> > + > +static void do_test (void); > +#define DO_TEST do_test > +#define AVX512FP16 > +#include "avx512-check.h" > + > +void > +__attribute__((noipa)) > +foo_ph (_Complex _Float16* a, _Complex _Float16* __restrict b) > +{ > + a[0] = b[1]; > + a[1] = b[0]; > + a[2] = b[4]; > + a[3] = b[3]; > + a[4] = b[7]; > + a[5] = b[6]; > + a[6] = b[2]; > + a[7] = b[5]; > +} > + > +void > +do_test (void) > +{ > + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (32); > + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (32); > + _Complex _Float16* ph_exp = (_Complex _Float16*) malloc (32); > + char* p = (char* ) malloc (32); > + char* q = (char* ) malloc (32); > + > + __builtin_memset (ph_dst, 0, 32); > + > + for (int i = 0; i != 32; i++) > + p[i] = i; > + __builtin_memcpy (ph_src, p, 32); > + > + for (int i = 0; i != 4; i++) > + { > + p[i] = i + 4; > + p[i + 4] = i; > + p[i + 8] = i + 16; > + p[i + 12] = i + 12; > + p[i + 16] = i + 28; > + p[i + 20] = i + 24; > + p[i + 24] = i + 8; > + p[i + 28] = i + 20; > + q[i] = i + 28; > + q[i + 4] = i + 24; > + q[i + 8] = i + 20; > + q[i + 12] = i + 16; > + q[i + 16] = i + 12; > + q[i + 20] = i + 8; > + q[i + 24] = i + 4; > + q[i + 28] = i; > + } > + __builtin_memcpy (ph_exp, p, 32); > + > + foo_ph (ph_dst, ph_src); > + if (__builtin_memcmp (ph_dst, ph_exp, 32) != 0) > + __builtin_abort (); > + > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-4a.c b/gcc/testsuite/gcc.target/i386/pr106010-4a.c > new file mode 100644 > index 00000000000..b7b0b532bb1 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-4a.c > @@ -0,0 +1,101 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details" } */ > +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 6 "slp2" } }*/ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 1 "slp2" } } */ > + > +void > +__attribute__((noipa)) > +foo_pd (_Complex double* a, > + _Complex double b1, > + _Complex double b2) > +{ > + a[0] = b1; > + a[1] = b2; > +} > + > +void > +__attribute__((noipa)) > +foo_ps (_Complex float* a, > + _Complex float b1, _Complex float b2, > + _Complex float b3, _Complex float b4) > +{ > + a[0] = b1; > + a[1] = b2; > + a[2] = b3; > + a[3] = b4; > +} > + > +void > +__attribute__((noipa)) > +foo_epi64 (_Complex long long* a, > + _Complex long long b1, > + _Complex long long b2) > +{ > + a[0] = b1; > + a[1] = b2; > +} > + > +void > +__attribute__((noipa)) > +foo_epi32 (_Complex int* a, > + _Complex int b1, _Complex int b2, > + _Complex int b3, _Complex int b4) > +{ > + a[0] = b1; > + a[1] = b2; > + a[2] = b3; > + a[3] = b4; > +} > + > +void > +__attribute__((noipa)) > +foo_epi16 (_Complex short* a, > + _Complex short b1, _Complex short b2, > + _Complex short b3, _Complex short b4, > + _Complex short b5, _Complex short b6, > + _Complex short b7,_Complex short b8) > +{ > + a[0] = b1; > + a[1] = b2; > + a[2] = b3; > + a[3] = b4; > + a[4] = b5; > + a[5] = b6; > + a[6] = b7; > + a[7] = b8; > +} > + > +void > +__attribute__((noipa)) > +foo_epi8 (_Complex char* a, > + _Complex char b1, _Complex char b2, > + _Complex char b3, _Complex char b4, > + _Complex char b5, _Complex char b6, > + _Complex char b7,_Complex char b8, > + _Complex char b9, _Complex char b10, > + _Complex char b11, _Complex char b12, > + _Complex char b13, _Complex char b14, > + _Complex char b15,_Complex char b16) > +{ > + a[0] = b1; > + a[1] = b2; > + a[2] = b3; > + a[3] = b4; > + a[4] = b5; > + a[5] = b6; > + a[6] = b7; > + a[7] = b8; > + a[8] = b9; > + a[9] = b10; > + a[10] = b11; > + a[11] = b12; > + a[12] = b13; > + a[13] = b14; > + a[14] = b15; > + a[15] = b16; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-4b.c b/gcc/testsuite/gcc.target/i386/pr106010-4b.c > new file mode 100644 > index 00000000000..e2e79508c4b > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-4b.c > @@ -0,0 +1,67 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ > +/* { dg-require-effective-target avx } */ > + > +#include "avx-check.h" > +#include <string.h> > +#include "pr106010-4a.c" > + > +void > +avx_test (void) > +{ > + _Complex double* pd_src = (_Complex double*) malloc (32); > + _Complex double* pd_dst = (_Complex double*) malloc (32); > + _Complex float* ps_src = (_Complex float*) malloc (32); > + _Complex float* ps_dst = (_Complex float*) malloc (32); > + _Complex long long* epi64_src = (_Complex long long*) malloc (32); > + _Complex long long* epi64_dst = (_Complex long long*) malloc (32); > + _Complex int* epi32_src = (_Complex int*) malloc (32); > + _Complex int* epi32_dst = (_Complex int*) malloc (32); > + _Complex short* epi16_src = (_Complex short*) malloc (32); > + _Complex short* epi16_dst = (_Complex short*) malloc (32); > + _Complex char* epi8_src = (_Complex char*) malloc (32); > + _Complex char* epi8_dst = (_Complex char*) malloc (32); > + char* p = (char* ) malloc (32); > + > + __builtin_memset (pd_dst, 0, 32); > + __builtin_memset (ps_dst, 0, 32); > + __builtin_memset (epi64_dst, 0, 32); > + __builtin_memset (epi32_dst, 0, 32); > + __builtin_memset (epi16_dst, 0, 32); > + __builtin_memset (epi8_dst, 0, 32); > + > + for (int i = 0; i != 32; i++) > + p[i] = i; > + __builtin_memcpy (pd_src, p, 32); > + __builtin_memcpy (ps_src, p, 32); > + __builtin_memcpy (epi64_src, p, 32); > + __builtin_memcpy (epi32_src, p, 32); > + __builtin_memcpy (epi16_src, p, 32); > + __builtin_memcpy (epi8_src, p, 32); > + > + foo_pd (pd_dst, pd_src[0], pd_src[1]); > + foo_ps (ps_dst, ps_src[0], ps_src[1], ps_src[2], ps_src[3]); > + foo_epi64 (epi64_dst, epi64_src[0], epi64_src[1]); > + foo_epi32 (epi32_dst, epi32_src[0], epi32_src[1], epi32_src[2], epi32_src[3]); > + foo_epi16 (epi16_dst, epi16_src[0], epi16_src[1], epi16_src[2], epi16_src[3], > + epi16_src[4], epi16_src[5], epi16_src[6], epi16_src[7]); > + foo_epi8 (epi8_dst, epi8_src[0], epi8_src[1], epi8_src[2], epi8_src[3], > + epi8_src[4], epi8_src[5], epi8_src[6], epi8_src[7], > + epi8_src[8], epi8_src[9], epi8_src[10], epi8_src[11], > + epi8_src[12], epi8_src[13], epi8_src[14], epi8_src[15]); > + > + if (__builtin_memcmp (pd_dst, pd_src, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (ps_dst, ps_src, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi64_dst, epi64_src, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi32_dst, epi32_src, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi16_dst, epi16_src, 32) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi8_dst, epi8_src, 32) != 0) > + __builtin_abort (); > + > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-4c.c b/gcc/testsuite/gcc.target/i386/pr106010-4c.c > new file mode 100644 > index 00000000000..8e02aefe3b5 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-4c.c > @@ -0,0 +1,54 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -fdump-tree-slp-details -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ > +/* { dg-require-effective-target avx512fp16 } */ > +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 1 "slp2" } }*/ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 1 "slp2" } } */ > + > +#include <string.h> > + > +static void do_test (void); > +#define DO_TEST do_test > +#define AVX512FP16 > +#include "avx512-check.h" > + > +void > +__attribute__((noipa)) > +foo_ph (_Complex _Float16* a, > + _Complex _Float16 b1, _Complex _Float16 b2, > + _Complex _Float16 b3, _Complex _Float16 b4, > + _Complex _Float16 b5, _Complex _Float16 b6, > + _Complex _Float16 b7,_Complex _Float16 b8) > +{ > + a[0] = b1; > + a[1] = b2; > + a[2] = b3; > + a[3] = b4; > + a[4] = b5; > + a[5] = b6; > + a[6] = b7; > + a[7] = b8; > +} > + > +void > +do_test (void) > +{ > + > + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (32); > + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (32); > + > + char* p = (char* ) malloc (32); > + > + __builtin_memset (ph_dst, 0, 32); > + > + for (int i = 0; i != 32; i++) > + p[i] = i; > + > + __builtin_memcpy (ph_src, p, 32); > + > + foo_ph (ph_dst, ph_src[0], ph_src[1], ph_src[2], ph_src[3], > + ph_src[4], ph_src[5], ph_src[6], ph_src[7]); > + > + if (__builtin_memcmp (ph_dst, ph_src, 32) != 0) > + __builtin_abort (); > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-5a.c b/gcc/testsuite/gcc.target/i386/pr106010-5a.c > new file mode 100644 > index 00000000000..9d4a6f9846b > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-5a.c > @@ -0,0 +1,117 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details -mprefer-vector-width=256" } */ > +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 6 "slp2" } }*/ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 4 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 4 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 4 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 4 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 4 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 4 "slp2" } } */ > + > +void > +__attribute__((noipa)) > +foo_pd (_Complex double* a, _Complex double* __restrict b) > +{ > + a[0] = b[2]; > + a[1] = b[3]; > + a[2] = b[0]; > + a[3] = b[1]; > +} > + > +void > +__attribute__((noipa)) > +foo_ps (_Complex float* a, _Complex float* __restrict b) > +{ > + a[0] = b[4]; > + a[1] = b[5]; > + a[2] = b[6]; > + a[3] = b[7]; > + a[4] = b[0]; > + a[5] = b[1]; > + a[6] = b[2]; > + a[7] = b[3]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi64 (_Complex long long* a, _Complex long long* __restrict b) > +{ > + a[0] = b[2]; > + a[1] = b[3]; > + a[2] = b[0]; > + a[3] = b[1]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi32 (_Complex int* a, _Complex int* __restrict b) > +{ > + a[0] = b[4]; > + a[1] = b[5]; > + a[2] = b[6]; > + a[3] = b[7]; > + a[4] = b[0]; > + a[5] = b[1]; > + a[6] = b[2]; > + a[7] = b[3]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi16 (_Complex short* a, _Complex short* __restrict b) > +{ > + a[0] = b[8]; > + a[1] = b[9]; > + a[2] = b[10]; > + a[3] = b[11]; > + a[4] = b[12]; > + a[5] = b[13]; > + a[6] = b[14]; > + a[7] = b[15]; > + a[8] = b[0]; > + a[9] = b[1]; > + a[10] = b[2]; > + a[11] = b[3]; > + a[12] = b[4]; > + a[13] = b[5]; > + a[14] = b[6]; > + a[15] = b[7]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi8 (_Complex char* a, _Complex char* __restrict b) > +{ > + a[0] = b[16]; > + a[1] = b[17]; > + a[2] = b[18]; > + a[3] = b[19]; > + a[4] = b[20]; > + a[5] = b[21]; > + a[6] = b[22]; > + a[7] = b[23]; > + a[8] = b[24]; > + a[9] = b[25]; > + a[10] = b[26]; > + a[11] = b[27]; > + a[12] = b[28]; > + a[13] = b[29]; > + a[14] = b[30]; > + a[15] = b[31]; > + a[16] = b[0]; > + a[17] = b[1]; > + a[18] = b[2]; > + a[19] = b[3]; > + a[20] = b[4]; > + a[21] = b[5]; > + a[22] = b[6]; > + a[23] = b[7]; > + a[24] = b[8]; > + a[25] = b[9]; > + a[26] = b[10]; > + a[27] = b[11]; > + a[28] = b[12]; > + a[29] = b[13]; > + a[30] = b[14]; > + a[31] = b[15]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-5b.c b/gcc/testsuite/gcc.target/i386/pr106010-5b.c > new file mode 100644 > index 00000000000..d5c6ebeb5cf > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-5b.c > @@ -0,0 +1,80 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ > +/* { dg-require-effective-target avx } */ > + > +#include "avx-check.h" > +#include <string.h> > +#include "pr106010-5a.c" > + > +void > +avx_test (void) > +{ > + _Complex double* pd_src = (_Complex double*) malloc (64); > + _Complex double* pd_dst = (_Complex double*) malloc (64); > + _Complex double* pd_exp = (_Complex double*) malloc (64); > + _Complex float* ps_src = (_Complex float*) malloc (64); > + _Complex float* ps_dst = (_Complex float*) malloc (64); > + _Complex float* ps_exp = (_Complex float*) malloc (64); > + _Complex long long* epi64_src = (_Complex long long*) malloc (64); > + _Complex long long* epi64_dst = (_Complex long long*) malloc (64); > + _Complex long long* epi64_exp = (_Complex long long*) malloc (64); > + _Complex int* epi32_src = (_Complex int*) malloc (64); > + _Complex int* epi32_dst = (_Complex int*) malloc (64); > + _Complex int* epi32_exp = (_Complex int*) malloc (64); > + _Complex short* epi16_src = (_Complex short*) malloc (64); > + _Complex short* epi16_dst = (_Complex short*) malloc (64); > + _Complex short* epi16_exp = (_Complex short*) malloc (64); > + _Complex char* epi8_src = (_Complex char*) malloc (64); > + _Complex char* epi8_dst = (_Complex char*) malloc (64); > + _Complex char* epi8_exp = (_Complex char*) malloc (64); > + char* p = (char* ) malloc (64); > + char* q = (char* ) malloc (64); > + > + __builtin_memset (pd_dst, 0, 64); > + __builtin_memset (ps_dst, 0, 64); > + __builtin_memset (epi64_dst, 0, 64); > + __builtin_memset (epi32_dst, 0, 64); > + __builtin_memset (epi16_dst, 0, 64); > + __builtin_memset (epi8_dst, 0, 64); > + > + for (int i = 0; i != 64; i++) > + { > + p[i] = i; > + q[i] = (i + 32) % 64; > + } > + __builtin_memcpy (pd_src, p, 64); > + __builtin_memcpy (ps_src, p, 64); > + __builtin_memcpy (epi64_src, p, 64); > + __builtin_memcpy (epi32_src, p, 64); > + __builtin_memcpy (epi16_src, p, 64); > + __builtin_memcpy (epi8_src, p, 64); > + > + __builtin_memcpy (pd_exp, q, 64); > + __builtin_memcpy (ps_exp, q, 64); > + __builtin_memcpy (epi64_exp, q, 64); > + __builtin_memcpy (epi32_exp, q, 64); > + __builtin_memcpy (epi16_exp, q, 64); > + __builtin_memcpy (epi8_exp, q, 64); > + > + foo_pd (pd_dst, pd_src); > + foo_ps (ps_dst, ps_src); > + foo_epi64 (epi64_dst, epi64_src); > + foo_epi32 (epi32_dst, epi32_src); > + foo_epi16 (epi16_dst, epi16_src); > + foo_epi8 (epi8_dst, epi8_src); > + > + if (__builtin_memcmp (pd_dst, pd_exp, 64) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (ps_dst, ps_exp, 64) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi64_dst, epi64_exp, 64) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi32_dst, epi32_exp, 64) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi16_dst, epi16_exp, 64) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi8_dst, epi8_exp, 64) != 0) > + __builtin_abort (); > + > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-5c.c b/gcc/testsuite/gcc.target/i386/pr106010-5c.c > new file mode 100644 > index 00000000000..9ce4e6dd5c0 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-5c.c > @@ -0,0 +1,62 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details -mprefer-vector-width=256" } */ > +/* { dg-require-effective-target avx512fp16 } */ > +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 1 "slp2" } }*/ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 4 "slp2" } } */ > + > +#include <string.h> > + > +static void do_test (void); > +#define DO_TEST do_test > +#define AVX512FP16 > +#include "avx512-check.h" > + > +void > +__attribute__((noipa)) > +foo_ph (_Complex _Float16* a, _Complex _Float16* __restrict b) > +{ > + a[0] = b[8]; > + a[1] = b[9]; > + a[2] = b[10]; > + a[3] = b[11]; > + a[4] = b[12]; > + a[5] = b[13]; > + a[6] = b[14]; > + a[7] = b[15]; > + a[8] = b[0]; > + a[9] = b[1]; > + a[10] = b[2]; > + a[11] = b[3]; > + a[12] = b[4]; > + a[13] = b[5]; > + a[14] = b[6]; > + a[15] = b[7]; > +} > + > +void > +do_test (void) > +{ > + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (64); > + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (64); > + _Complex _Float16* ph_exp = (_Complex _Float16*) malloc (64); > + char* p = (char* ) malloc (64); > + char* q = (char* ) malloc (64); > + > + __builtin_memset (ph_dst, 0, 64); > + > + for (int i = 0; i != 64; i++) > + { > + p[i] = i; > + q[i] = (i + 32) % 64; > + } > + __builtin_memcpy (ph_src, p, 64); > + > + __builtin_memcpy (ph_exp, q, 64); > + > + foo_ph (ph_dst, ph_src); > + > + if (__builtin_memcmp (ph_dst, ph_exp, 64) != 0) > + __builtin_abort (); > + > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-6a.c b/gcc/testsuite/gcc.target/i386/pr106010-6a.c > new file mode 100644 > index 00000000000..65a90d03684 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-6a.c > @@ -0,0 +1,115 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details -mprefer-vector-width=256" } */ > +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 6 "slp2" } }*/ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 2, 3, 0, 1 \}} 4 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 6, 7, 4, 5, 2, 3, 0, 1 \}} 4 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 \}} 2 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 30, 31, 28, 29, 26, 27, 24, 25, 22, 23, 20, 21, 18, 19, 16, 17, 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 \}} 2 "slp2" } } */ > + > +void > +__attribute__((noipa)) > +foo_pd (_Complex double* a, _Complex double* __restrict b) > +{ > + a[0] = b[3]; > + a[1] = b[2]; > + a[2] = b[1]; > + a[3] = b[0]; > +} > + > +void > +__attribute__((noipa)) > +foo_ps (_Complex float* a, _Complex float* __restrict b) > +{ > + a[0] = b[7]; > + a[1] = b[6]; > + a[2] = b[5]; > + a[3] = b[4]; > + a[4] = b[3]; > + a[5] = b[2]; > + a[6] = b[1]; > + a[7] = b[0]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi64 (_Complex long long* a, _Complex long long* __restrict b) > +{ > + a[0] = b[3]; > + a[1] = b[2]; > + a[2] = b[1]; > + a[3] = b[0]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi32 (_Complex int* a, _Complex int* __restrict b) > +{ > + a[0] = b[7]; > + a[1] = b[6]; > + a[2] = b[5]; > + a[3] = b[4]; > + a[4] = b[3]; > + a[5] = b[2]; > + a[6] = b[1]; > + a[7] = b[0]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi16 (_Complex short* a, _Complex short* __restrict b) > +{ > + a[0] = b[15]; > + a[1] = b[14]; > + a[2] = b[13]; > + a[3] = b[12]; > + a[4] = b[11]; > + a[5] = b[10]; > + a[6] = b[9]; > + a[7] = b[8]; > + a[8] = b[7]; > + a[9] = b[6]; > + a[10] = b[5]; > + a[11] = b[4]; > + a[12] = b[3]; > + a[13] = b[2]; > + a[14] = b[1]; > + a[15] = b[0]; > +} > + > +void > +__attribute__((noipa)) > +foo_epi8 (_Complex char* a, _Complex char* __restrict b) > +{ > + a[0] = b[31]; > + a[1] = b[30]; > + a[2] = b[29]; > + a[3] = b[28]; > + a[4] = b[27]; > + a[5] = b[26]; > + a[6] = b[25]; > + a[7] = b[24]; > + a[8] = b[23]; > + a[9] = b[22]; > + a[10] = b[21]; > + a[11] = b[20]; > + a[12] = b[19]; > + a[13] = b[18]; > + a[14] = b[17]; > + a[15] = b[16]; > + a[16] = b[15]; > + a[17] = b[14]; > + a[18] = b[13]; > + a[19] = b[12]; > + a[20] = b[11]; > + a[21] = b[10]; > + a[22] = b[9]; > + a[23] = b[8]; > + a[24] = b[7]; > + a[25] = b[6]; > + a[26] = b[5]; > + a[27] = b[4]; > + a[28] = b[3]; > + a[29] = b[2]; > + a[30] = b[1]; > + a[31] = b[0]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-6b.c b/gcc/testsuite/gcc.target/i386/pr106010-6b.c > new file mode 100644 > index 00000000000..1c5bb020939 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-6b.c > @@ -0,0 +1,157 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ > +/* { dg-require-effective-target avx2 } */ > + > +#include "avx2-check.h" > +#include <string.h> > +#include "pr106010-6a.c" > + > +void > +avx2_test (void) > +{ > + _Complex double* pd_src = (_Complex double*) malloc (64); > + _Complex double* pd_dst = (_Complex double*) malloc (64); > + _Complex double* pd_exp = (_Complex double*) malloc (64); > + _Complex float* ps_src = (_Complex float*) malloc (64); > + _Complex float* ps_dst = (_Complex float*) malloc (64); > + _Complex float* ps_exp = (_Complex float*) malloc (64); > + _Complex long long* epi64_src = (_Complex long long*) malloc (64); > + _Complex long long* epi64_dst = (_Complex long long*) malloc (64); > + _Complex long long* epi64_exp = (_Complex long long*) malloc (64); > + _Complex int* epi32_src = (_Complex int*) malloc (64); > + _Complex int* epi32_dst = (_Complex int*) malloc (64); > + _Complex int* epi32_exp = (_Complex int*) malloc (64); > + _Complex short* epi16_src = (_Complex short*) malloc (64); > + _Complex short* epi16_dst = (_Complex short*) malloc (64); > + _Complex short* epi16_exp = (_Complex short*) malloc (64); > + _Complex char* epi8_src = (_Complex char*) malloc (64); > + _Complex char* epi8_dst = (_Complex char*) malloc (64); > + _Complex char* epi8_exp = (_Complex char*) malloc (64); > + char* p = (char* ) malloc (64); > + char* q = (char* ) malloc (64); > + > + __builtin_memset (pd_dst, 0, 64); > + __builtin_memset (ps_dst, 0, 64); > + __builtin_memset (epi64_dst, 0, 64); > + __builtin_memset (epi32_dst, 0, 64); > + __builtin_memset (epi16_dst, 0, 64); > + __builtin_memset (epi8_dst, 0, 64); > + > + for (int i = 0; i != 64; i++) > + p[i] = i; > + > + __builtin_memcpy (pd_src, p, 64); > + __builtin_memcpy (ps_src, p, 64); > + __builtin_memcpy (epi64_src, p, 64); > + __builtin_memcpy (epi32_src, p, 64); > + __builtin_memcpy (epi16_src, p, 64); > + __builtin_memcpy (epi8_src, p, 64); > + > + > + for (int i = 0; i != 16; i++) > + { > + q[i] = i + 48; > + q[i + 16] = i + 32; > + q[i + 32] = i + 16; > + q[i + 48] = i; > + } > + > + __builtin_memcpy (pd_exp, q, 64); > + __builtin_memcpy (epi64_exp, q, 64); > + > + for (int i = 0; i != 8; i++) > + { > + q[i] = i + 56; > + q[i + 8] = i + 48; > + q[i + 16] = i + 40; > + q[i + 24] = i + 32; > + q[i + 32] = i + 24; > + q[i + 40] = i + 16; > + q[i + 48] = i + 8; > + q[i + 56] = i; > + } > + > + __builtin_memcpy (ps_exp, q, 64); > + __builtin_memcpy (epi32_exp, q, 64); > + > + for (int i = 0; i != 4; i++) > + { > + q[i] = i + 60; > + q[i + 4] = i + 56; > + q[i + 8] = i + 52; > + q[i + 12] = i + 48; > + q[i + 16] = i + 44; > + q[i + 20] = i + 40; > + q[i + 24] = i + 36; > + q[i + 28] = i + 32; > + q[i + 32] = i + 28; > + q[i + 36] = i + 24; > + q[i + 40] = i + 20; > + q[i + 44] = i + 16; > + q[i + 48] = i + 12; > + q[i + 52] = i + 8; > + q[i + 56] = i + 4; > + q[i + 60] = i; > + } > + > + __builtin_memcpy (epi16_exp, q, 64); > + > + for (int i = 0; i != 2; i++) > + { > + q[i] = i + 62; > + q[i + 2] = i + 60; > + q[i + 4] = i + 58; > + q[i + 6] = i + 56; > + q[i + 8] = i + 54; > + q[i + 10] = i + 52; > + q[i + 12] = i + 50; > + q[i + 14] = i + 48; > + q[i + 16] = i + 46; > + q[i + 18] = i + 44; > + q[i + 20] = i + 42; > + q[i + 22] = i + 40; > + q[i + 24] = i + 38; > + q[i + 26] = i + 36; > + q[i + 28] = i + 34; > + q[i + 30] = i + 32; > + q[i + 32] = i + 30; > + q[i + 34] = i + 28; > + q[i + 36] = i + 26; > + q[i + 38] = i + 24; > + q[i + 40] = i + 22; > + q[i + 42] = i + 20; > + q[i + 44] = i + 18; > + q[i + 46] = i + 16; > + q[i + 48] = i + 14; > + q[i + 50] = i + 12; > + q[i + 52] = i + 10; > + q[i + 54] = i + 8; > + q[i + 56] = i + 6; > + q[i + 58] = i + 4; > + q[i + 60] = i + 2; > + q[i + 62] = i; > + } > + __builtin_memcpy (epi8_exp, q, 64); > + > + foo_pd (pd_dst, pd_src); > + foo_ps (ps_dst, ps_src); > + foo_epi64 (epi64_dst, epi64_src); > + foo_epi32 (epi32_dst, epi32_src); > + foo_epi16 (epi16_dst, epi16_src); > + foo_epi8 (epi8_dst, epi8_src); > + > + if (__builtin_memcmp (pd_dst, pd_exp, 64) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (ps_dst, ps_exp, 64) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi64_dst, epi64_exp, 64) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi32_dst, epi32_exp, 64) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi16_dst, epi16_exp, 64) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi8_dst, epi8_exp, 64) != 0) > + __builtin_abort (); > + > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-6c.c b/gcc/testsuite/gcc.target/i386/pr106010-6c.c > new file mode 100644 > index 00000000000..b859d884a7f > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-6c.c > @@ -0,0 +1,80 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-slp-details" } */ > +/* { dg-require-effective-target avx512fp16 } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 \}} 2 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 1 "slp2" } } */ > + > +#include <string.h> > + > +static void do_test (void); > +#define DO_TEST do_test > +#define AVX512FP16 > +#include "avx512-check.h" > + > +void > +__attribute__((noipa)) > +foo_ph (_Complex _Float16* a, _Complex _Float16* __restrict b) > +{ > + a[0] = b[15]; > + a[1] = b[14]; > + a[2] = b[13]; > + a[3] = b[12]; > + a[4] = b[11]; > + a[5] = b[10]; > + a[6] = b[9]; > + a[7] = b[8]; > + a[8] = b[7]; > + a[9] = b[6]; > + a[10] = b[5]; > + a[11] = b[4]; > + a[12] = b[3]; > + a[13] = b[2]; > + a[14] = b[1]; > + a[15] = b[0]; > +} > + > +void > +do_test (void) > +{ > + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (64); > + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (64); > + _Complex _Float16* ph_exp = (_Complex _Float16*) malloc (64); > + char* p = (char* ) malloc (64); > + char* q = (char* ) malloc (64); > + > + __builtin_memset (ph_dst, 0, 64); > + > + for (int i = 0; i != 64; i++) > + p[i] = i; > + > + __builtin_memcpy (ph_src, p, 64); > + > + for (int i = 0; i != 4; i++) > + { > + q[i] = i + 60; > + q[i + 4] = i + 56; > + q[i + 8] = i + 52; > + q[i + 12] = i + 48; > + q[i + 16] = i + 44; > + q[i + 20] = i + 40; > + q[i + 24] = i + 36; > + q[i + 28] = i + 32; > + q[i + 32] = i + 28; > + q[i + 36] = i + 24; > + q[i + 40] = i + 20; > + q[i + 44] = i + 16; > + q[i + 48] = i + 12; > + q[i + 52] = i + 8; > + q[i + 56] = i + 4; > + q[i + 60] = i; > + } > + > + __builtin_memcpy (ph_exp, q, 64); > + > + foo_ph (ph_dst, ph_src); > + > + if (__builtin_memcmp (ph_dst, ph_exp, 64) != 0) > + __builtin_abort (); > + > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-7a.c b/gcc/testsuite/gcc.target/i386/pr106010-7a.c > new file mode 100644 > index 00000000000..2ea01fac927 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-7a.c > @@ -0,0 +1,58 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-vect-details -mprefer-vector-width=256" } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 6 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 1 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 1 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 1 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 1 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 1 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 1 "vect" } } */ > + > +#define N 10000 > +void > +__attribute__((noipa)) > +foo_pd (_Complex double* a, _Complex double b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b; > +} > + > +void > +__attribute__((noipa)) > +foo_ps (_Complex float* a, _Complex float b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b; > +} > + > +void > +__attribute__((noipa)) > +foo_epi64 (_Complex long long* a, _Complex long long b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b; > +} > + > +void > +__attribute__((noipa)) > +foo_epi32 (_Complex int* a, _Complex int b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b; > +} > + > +void > +__attribute__((noipa)) > +foo_epi16 (_Complex short* a, _Complex short b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b; > +} > + > +void > +__attribute__((noipa)) > +foo_epi8 (_Complex char* a, _Complex char b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-7b.c b/gcc/testsuite/gcc.target/i386/pr106010-7b.c > new file mode 100644 > index 00000000000..26482cc10f5 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-7b.c > @@ -0,0 +1,63 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ > +/* { dg-require-effective-target avx } */ > + > +#include "avx-check.h" > +#include <string.h> > +#include "pr106010-7a.c" > + > +void > +avx_test (void) > +{ > + _Complex double* pd_src = (_Complex double*) malloc (2 * N * sizeof (double)); > + _Complex double* pd_dst = (_Complex double*) malloc (2 * N * sizeof (double)); > + _Complex float* ps_src = (_Complex float*) malloc (2 * N * sizeof (float)); > + _Complex float* ps_dst = (_Complex float*) malloc (2 * N * sizeof (float)); > + _Complex long long* epi64_src = (_Complex long long*) malloc (2 * N * sizeof (long long)); > + _Complex long long* epi64_dst = (_Complex long long*) malloc (2 * N * sizeof (long long)); > + _Complex int* epi32_src = (_Complex int*) malloc (2 * N * sizeof (int)); > + _Complex int* epi32_dst = (_Complex int*) malloc (2 * N * sizeof (int)); > + _Complex short* epi16_src = (_Complex short*) malloc (2 * N * sizeof (short)); > + _Complex short* epi16_dst = (_Complex short*) malloc (2 * N * sizeof (short)); > + _Complex char* epi8_src = (_Complex char*) malloc (2 * N * sizeof (char)); > + _Complex char* epi8_dst = (_Complex char*) malloc (2 * N * sizeof (char)); > + char* p_init = (char*) malloc (2 * N * sizeof (double)); > + > + __builtin_memset (pd_dst, 0, 2 * N * sizeof (double)); > + __builtin_memset (ps_dst, 0, 2 * N * sizeof (float)); > + __builtin_memset (epi64_dst, 0, 2 * N * sizeof (long long)); > + __builtin_memset (epi32_dst, 0, 2 * N * sizeof (int)); > + __builtin_memset (epi16_dst, 0, 2 * N * sizeof (short)); > + __builtin_memset (epi8_dst, 0, 2 * N * sizeof (char)); > + > + for (int i = 0; i != 2 * N * sizeof (double); i++) > + p_init[i] = i % 2 + 3; > + > + memcpy (pd_src, p_init, 2 * N * sizeof (double)); > + memcpy (ps_dst, p_init, 2 * N * sizeof (float)); > + memcpy (epi64_dst, p_init, 2 * N * sizeof (long long)); > + memcpy (epi32_dst, p_init, 2 * N * sizeof (int)); > + memcpy (epi16_dst, p_init, 2 * N * sizeof (short)); > + memcpy (epi8_dst, p_init, 2 * N * sizeof (char)); > + > + foo_pd (pd_dst, pd_src[0]); > + foo_ps (ps_dst, ps_src[0]); > + foo_epi64 (epi64_dst, epi64_src[0]); > + foo_epi32 (epi32_dst, epi32_src[0]); > + foo_epi16 (epi16_dst, epi16_src[0]); > + foo_epi8 (epi8_dst, epi8_src[0]); > + if (__builtin_memcmp (pd_dst, pd_src, N * 2 * sizeof (double)) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (ps_dst, ps_src, N * 2 * sizeof (float)) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi64_dst, epi64_src, N * 2 * sizeof (long long)) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi32_dst, epi32_src, N * 2 * sizeof (int)) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi16_dst, epi16_src, N * 2 * sizeof (short)) != 0) > + __builtin_abort (); > + if (__builtin_memcmp (epi8_dst, epi8_src, N * 2 * sizeof (char)) != 0) > + __builtin_abort (); > + > + return; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-7c.c b/gcc/testsuite/gcc.target/i386/pr106010-7c.c > new file mode 100644 > index 00000000000..7f4056a5ecc > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-7c.c > @@ -0,0 +1,41 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-vect-details" } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 1 "vect" } } */ > +/* { dg-require-effective-target avx512fp16 } */ > + > +#include <string.h> > + > +static void do_test (void); > + > +#define DO_TEST do_test > +#define AVX512FP16 > +#include "avx512-check.h" > + > +#define N 10000 > + > +void > +__attribute__((noipa)) > +foo_ph (_Complex _Float16* a, _Complex _Float16 b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b; > +} > + > +static void > +do_test (void) > +{ > + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (2 * N * sizeof (_Float16)); > + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (2 * N * sizeof (_Float16)); > + char* p_init = (char*) malloc (2 * N * sizeof (_Float16)); > + > + __builtin_memset (ph_dst, 0, 2 * N * sizeof (_Float16)); > + > + for (int i = 0; i != 2 * N * sizeof (_Float16); i++) > + p_init[i] = i % 2 + 3; > + > + memcpy (ph_src, p_init, 2 * N * sizeof (_Float16)); > + > + foo_ph (ph_dst, ph_src[0]); > + if (__builtin_memcmp (ph_dst, ph_src, N * 2 * sizeof (_Float16)) != 0) > + __builtin_abort (); > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-8a.c b/gcc/testsuite/gcc.target/i386/pr106010-8a.c > new file mode 100644 > index 00000000000..11054b60d30 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-8a.c > @@ -0,0 +1,58 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-vect-details -mprefer-vector-width=256" } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 6 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 1 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 1 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 1 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 1 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 1 "vect" } } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 1 "vect" } } */ > + > +#define N 10000 > +void > +__attribute__((noipa)) > +foo_pd (_Complex double* a) > +{ > + for (int i = 0; i != N; i++) > + a[i] = 1.0 + 2.0i; > +} > + > +void > +__attribute__((noipa)) > +foo_ps (_Complex float* a) > +{ > + for (int i = 0; i != N; i++) > + a[i] = 1.0f + 2.0fi; > +} > + > +void > +__attribute__((noipa)) > +foo_epi64 (_Complex long long* a) > +{ > + for (int i = 0; i != N; i++) > + a[i] = 1 + 2i; > +} > + > +void > +__attribute__((noipa)) > +foo_epi32 (_Complex int* a) > +{ > + for (int i = 0; i != N; i++) > + a[i] = 1 + 2i; > +} > + > +void > +__attribute__((noipa)) > +foo_epi16 (_Complex short* a) > +{ > + for (int i = 0; i != N; i++) > + a[i] = 1 + 2i; > +} > + > +void > +__attribute__((noipa)) > +foo_epi8 (_Complex char* a) > +{ > + for (int i = 0; i != N; i++) > + a[i] = 1 + 2i; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-8b.c b/gcc/testsuite/gcc.target/i386/pr106010-8b.c > new file mode 100644 > index 00000000000..6bb0073b691 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-8b.c > @@ -0,0 +1,53 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ > +/* { dg-require-effective-target avx } */ > + > +#include "avx-check.h" > +#include <string.h> > +#include "pr106010-8a.c" > + > +void > +avx_test (void) > +{ > + _Complex double pd_src = 1.0 + 2.0i; > + _Complex double* pd_dst = (_Complex double*) malloc (2 * N * sizeof (double)); > + _Complex float ps_src = 1.0 + 2.0i; > + _Complex float* ps_dst = (_Complex float*) malloc (2 * N * sizeof (float)); > + _Complex long long epi64_src = 1 + 2i;; > + _Complex long long* epi64_dst = (_Complex long long*) malloc (2 * N * sizeof (long long)); > + _Complex int epi32_src = 1 + 2i; > + _Complex int* epi32_dst = (_Complex int*) malloc (2 * N * sizeof (int)); > + _Complex short epi16_src = 1 + 2i; > + _Complex short* epi16_dst = (_Complex short*) malloc (2 * N * sizeof (short)); > + _Complex char epi8_src = 1 + 2i; > + _Complex char* epi8_dst = (_Complex char*) malloc (2 * N * sizeof (char)); > + > + __builtin_memset (pd_dst, 0, 2 * N * sizeof (double)); > + __builtin_memset (ps_dst, 0, 2 * N * sizeof (float)); > + __builtin_memset (epi64_dst, 0, 2 * N * sizeof (long long)); > + __builtin_memset (epi32_dst, 0, 2 * N * sizeof (int)); > + __builtin_memset (epi16_dst, 0, 2 * N * sizeof (short)); > + __builtin_memset (epi8_dst, 0, 2 * N * sizeof (char)); > + > + foo_pd (pd_dst); > + foo_ps (ps_dst); > + foo_epi64 (epi64_dst); > + foo_epi32 (epi32_dst); > + foo_epi16 (epi16_dst); > + foo_epi8 (epi8_dst); > + for (int i = 0 ; i != N; i++) > + { > + if (pd_dst[i] != pd_src) > + __builtin_abort (); > + if (ps_dst[i] != ps_src) > + __builtin_abort (); > + if (epi64_dst[i] != epi64_src) > + __builtin_abort (); > + if (epi32_dst[i] != epi32_src) > + __builtin_abort (); > + if (epi16_dst[i] != epi16_src) > + __builtin_abort (); > + if (epi8_dst[i] != epi8_src) > + __builtin_abort (); > + } > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-8c.c b/gcc/testsuite/gcc.target/i386/pr106010-8c.c > new file mode 100644 > index 00000000000..61ae131829d > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-8c.c > @@ -0,0 +1,38 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-vect-details" } */ > +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 1 "vect" } } */ > +/* { dg-require-effective-target avx512fp16 } */ > + > +#include <string.h> > + > +static void do_test (void); > + > +#define DO_TEST do_test > +#define AVX512FP16 > +#include "avx512-check.h" > + > +#define N 10000 > + > +void > +__attribute__((noipa)) > +foo_ph (_Complex _Float16* a) > +{ > + for (int i = 0; i != N; i++) > + a[i] = 1.0f16 + 2.0f16i; > +} > + > +static void > +do_test (void) > +{ > + _Complex _Float16 ph_src = 1.0f16 + 2.0f16i; > + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (2 * N * sizeof (_Float16)); > + > + __builtin_memset (ph_dst, 0, 2 * N * sizeof (_Float16)); > + > + foo_ph (ph_dst); > + for (int i = 0; i != N; i++) > + { > + if (ph_dst[i] != ph_src) > + __builtin_abort (); > + } > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-9a.c b/gcc/testsuite/gcc.target/i386/pr106010-9a.c > new file mode 100644 > index 00000000000..e922f7b5400 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-9a.c > @@ -0,0 +1,89 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -mavx2 -fvect-cost-model=unlimited -fdump-tree-vect-details" } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 6 "vect" } } */ > + > +typedef struct { _Complex double c; double a1; double a2;} > + cdf; > +typedef struct { _Complex double c; double a1; double a2; double a3; double a4;} > + cdf2; > +typedef struct { _Complex double c1; _Complex double c2; double a1; double a2; double a3; double a4;} > + cdf3; > +typedef struct { _Complex double c1; _Complex double c2; double a1; double a2;} > + cdf4; > + > +#define N 100 > +/* VMAT_ELEMENTWISE. */ > +void > +__attribute__((noipa)) > +foo (cdf* a, cdf* __restrict b) > +{ > + for (int i = 0; i < N; ++i) > + { > + a[i].c = b[i].c; > + a[i].a1 = b[i].a1; > + a[i].a2 = b[i].a2; > + } > +} > + > +/* VMAT_CONTIGUOUS_PERMUTE. */ > +void > +__attribute__((noipa)) > +foo1 (cdf2* a, cdf2* __restrict b) > +{ > + for (int i = 0; i < N; ++i) > + { > + a[i].c = b[i].c; > + a[i].a1 = b[i].a1; > + a[i].a2 = b[i].a2; > + a[i].a3 = b[i].a3; > + a[i].a4 = b[i].a4; > + } > +} > + > +/* VMAT_CONTIGUOUS. */ > +void > +__attribute__((noipa)) > +foo2 (cdf3* a, cdf3* __restrict b) > +{ > + for (int i = 0; i < N; ++i) > + { > + a[i].c1 = b[i].c1; > + a[i].c2 = b[i].c2; > + a[i].a1 = b[i].a1; > + a[i].a2 = b[i].a2; > + a[i].a3 = b[i].a3; > + a[i].a4 = b[i].a4; > + } > +} > + > +/* VMAT_STRIDED_SLP. */ > +void > +__attribute__((noipa)) > +foo3 (cdf4* a, cdf4* __restrict b) > +{ > + for (int i = 0; i < N; ++i) > + { > + a[i].c1 = b[i].c1; > + a[i].c2 = b[i].c2; > + a[i].a1 = b[i].a1; > + a[i].a2 = b[i].a2; > + } > +} > + > +/* VMAT_CONTIGUOUS_REVERSE. */ > +void > +__attribute__((noipa)) > +foo4 (_Complex double* a, _Complex double* __restrict b) > +{ > + for (int i = 0; i != N; i++) > + a[i] = b[N-i-1]; > +} > + > +/* VMAT_CONTIGUOUS_DOWN. */ > +void > +__attribute__((noipa)) > +foo5 (_Complex double* a, _Complex double* __restrict b) > +{ > + for (int i = 0; i != N; i++) > + a[N-i-1] = b[0]; > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-9b.c b/gcc/testsuite/gcc.target/i386/pr106010-9b.c > new file mode 100644 > index 00000000000..e220445e6e3 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-9b.c > @@ -0,0 +1,90 @@ > +/* { dg-do run } */ > +/* { dg-options "-O3 -msse2 -fvect-cost-model=unlimited" } */ > +/* { dg-require-effective-target sse2 } */ > + > +#include <string.h> > +#include "sse2-check.h" > +#include "pr106010-9a.c" > + > +static void > +sse2_test (void) > +{ > + _Complex double* pd_src = (_Complex double*) malloc (N * sizeof (_Complex double)); > + _Complex double* pd_dst = (_Complex double*) malloc (N * sizeof (_Complex double)); > + _Complex double* pd_src2 = (_Complex double*) malloc (N * sizeof (_Complex double)); > + _Complex double* pd_dst2 = (_Complex double*) malloc (N * sizeof (_Complex double)); > + cdf* cdf_src = (cdf*) malloc (N * sizeof (cdf)); > + cdf* cdf_dst = (cdf*) malloc (N * sizeof (cdf)); > + cdf2* cdf2_src = (cdf2*) malloc (N * sizeof (cdf2)); > + cdf2* cdf2_dst = (cdf2*) malloc (N * sizeof (cdf2)); > + cdf3* cdf3_src = (cdf3*) malloc (N * sizeof (cdf3)); > + cdf3* cdf3_dst = (cdf3*) malloc (N * sizeof (cdf3)); > + cdf4* cdf4_src = (cdf4*) malloc (N * sizeof (cdf4)); > + cdf4* cdf4_dst = (cdf4*) malloc (N * sizeof (cdf4)); > + > + char* p_init = (char*) malloc (N * sizeof (cdf3)); > + > + __builtin_memset (cdf_dst, 0, N * sizeof (cdf)); > + __builtin_memset (cdf2_dst, 0, N * sizeof (cdf2)); > + __builtin_memset (cdf3_dst, 0, N * sizeof (cdf3)); > + __builtin_memset (cdf4_dst, 0, N * sizeof (cdf4)); > + __builtin_memset (pd_dst, 0, N * sizeof (_Complex double)); > + __builtin_memset (pd_dst2, 0, N * sizeof (_Complex double)); > + > + for (int i = 0; i != N * sizeof (cdf3); i++) > + p_init[i] = i; > + > + memcpy (cdf_src, p_init, N * sizeof (cdf)); > + memcpy (cdf2_src, p_init, N * sizeof (cdf2)); > + memcpy (cdf3_src, p_init, N * sizeof (cdf3)); > + memcpy (cdf4_src, p_init, N * sizeof (cdf4)); > + memcpy (pd_src, p_init, N * sizeof (_Complex double)); > + for (int i = 0; i != 2 * N * sizeof (double); i++) > + p_init[i] = i % 16; > + memcpy (pd_src2, p_init, N * sizeof (_Complex double)); > + > + foo (cdf_dst, cdf_src); > + foo1 (cdf2_dst, cdf2_src); > + foo2 (cdf3_dst, cdf3_src); > + foo3 (cdf4_dst, cdf4_src); > + foo4 (pd_dst, pd_src); > + foo5 (pd_dst2, pd_src2); > + for (int i = 0; i != N; i++) > + { > + p_init[(N - i - 1) * 16] = i * 16; > + p_init[(N - i - 1) * 16 + 1] = i * 16 + 1; > + p_init[(N - i - 1) * 16 + 2] = i * 16 + 2; > + p_init[(N - i - 1) * 16 + 3] = i * 16 + 3; > + p_init[(N - i - 1) * 16 + 4] = i * 16 + 4; > + p_init[(N - i - 1) * 16 + 5] = i * 16 + 5; > + p_init[(N - i - 1) * 16 + 6] = i * 16 + 6; > + p_init[(N - i - 1) * 16 + 7] = i * 16 + 7; > + p_init[(N - i - 1) * 16 + 8] = i * 16 + 8; > + p_init[(N - i - 1) * 16 + 9] = i * 16 + 9; > + p_init[(N - i - 1) * 16 + 10] = i * 16 + 10; > + p_init[(N - i - 1) * 16 + 11] = i * 16 + 11; > + p_init[(N - i - 1) * 16 + 12] = i * 16 + 12; > + p_init[(N - i - 1) * 16 + 13] = i * 16 + 13; > + p_init[(N - i - 1) * 16 + 14] = i * 16 + 14; > + p_init[(N - i - 1) * 16 + 15] = i * 16 + 15; > + } > + memcpy (pd_src, p_init, N * 16); > + > + if (__builtin_memcmp (pd_dst, pd_src, N * 2 * sizeof (double)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (pd_dst2, pd_src2, N * 2 * sizeof (double)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf_dst, cdf_src, N * sizeof (cdf)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf2_dst, cdf2_src, N * sizeof (cdf2)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf3_dst, cdf3_src, N * sizeof (cdf3)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf4_dst, cdf4_src, N * sizeof (cdf4)) != 0) > + __builtin_abort (); > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-9c.c b/gcc/testsuite/gcc.target/i386/pr106010-9c.c > new file mode 100644 > index 00000000000..ff51f6195b7 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-9c.c > @@ -0,0 +1,90 @@ > +/* { dg-do run } */ > +/* { dg-options "-O3 -mavx2 -fvect-cost-model=unlimited" } */ > +/* { dg-require-effective-target avx2 } */ > + > +#include <string.h> > +#include "avx2-check.h" > +#include "pr106010-9a.c" > + > +static void > +avx2_test (void) > +{ > + _Complex double* pd_src = (_Complex double*) malloc (N * sizeof (_Complex double)); > + _Complex double* pd_dst = (_Complex double*) malloc (N * sizeof (_Complex double)); > + _Complex double* pd_src2 = (_Complex double*) malloc (N * sizeof (_Complex double)); > + _Complex double* pd_dst2 = (_Complex double*) malloc (N * sizeof (_Complex double)); > + cdf* cdf_src = (cdf*) malloc (N * sizeof (cdf)); > + cdf* cdf_dst = (cdf*) malloc (N * sizeof (cdf)); > + cdf2* cdf2_src = (cdf2*) malloc (N * sizeof (cdf2)); > + cdf2* cdf2_dst = (cdf2*) malloc (N * sizeof (cdf2)); > + cdf3* cdf3_src = (cdf3*) malloc (N * sizeof (cdf3)); > + cdf3* cdf3_dst = (cdf3*) malloc (N * sizeof (cdf3)); > + cdf4* cdf4_src = (cdf4*) malloc (N * sizeof (cdf4)); > + cdf4* cdf4_dst = (cdf4*) malloc (N * sizeof (cdf4)); > + > + char* p_init = (char*) malloc (N * sizeof (cdf3)); > + > + __builtin_memset (cdf_dst, 0, N * sizeof (cdf)); > + __builtin_memset (cdf2_dst, 0, N * sizeof (cdf2)); > + __builtin_memset (cdf3_dst, 0, N * sizeof (cdf3)); > + __builtin_memset (cdf4_dst, 0, N * sizeof (cdf4)); > + __builtin_memset (pd_dst, 0, N * sizeof (_Complex double)); > + __builtin_memset (pd_dst2, 0, N * sizeof (_Complex double)); > + > + for (int i = 0; i != N * sizeof (cdf3); i++) > + p_init[i] = i; > + > + memcpy (cdf_src, p_init, N * sizeof (cdf)); > + memcpy (cdf2_src, p_init, N * sizeof (cdf2)); > + memcpy (cdf3_src, p_init, N * sizeof (cdf3)); > + memcpy (cdf4_src, p_init, N * sizeof (cdf4)); > + memcpy (pd_src, p_init, N * sizeof (_Complex double)); > + for (int i = 0; i != 2 * N * sizeof (double); i++) > + p_init[i] = i % 16; > + memcpy (pd_src2, p_init, N * sizeof (_Complex double)); > + > + foo (cdf_dst, cdf_src); > + foo1 (cdf2_dst, cdf2_src); > + foo2 (cdf3_dst, cdf3_src); > + foo3 (cdf4_dst, cdf4_src); > + foo4 (pd_dst, pd_src); > + foo5 (pd_dst2, pd_src2); > + for (int i = 0; i != N; i++) > + { > + p_init[(N - i - 1) * 16] = i * 16; > + p_init[(N - i - 1) * 16 + 1] = i * 16 + 1; > + p_init[(N - i - 1) * 16 + 2] = i * 16 + 2; > + p_init[(N - i - 1) * 16 + 3] = i * 16 + 3; > + p_init[(N - i - 1) * 16 + 4] = i * 16 + 4; > + p_init[(N - i - 1) * 16 + 5] = i * 16 + 5; > + p_init[(N - i - 1) * 16 + 6] = i * 16 + 6; > + p_init[(N - i - 1) * 16 + 7] = i * 16 + 7; > + p_init[(N - i - 1) * 16 + 8] = i * 16 + 8; > + p_init[(N - i - 1) * 16 + 9] = i * 16 + 9; > + p_init[(N - i - 1) * 16 + 10] = i * 16 + 10; > + p_init[(N - i - 1) * 16 + 11] = i * 16 + 11; > + p_init[(N - i - 1) * 16 + 12] = i * 16 + 12; > + p_init[(N - i - 1) * 16 + 13] = i * 16 + 13; > + p_init[(N - i - 1) * 16 + 14] = i * 16 + 14; > + p_init[(N - i - 1) * 16 + 15] = i * 16 + 15; > + } > + memcpy (pd_src, p_init, N * 16); > + > + if (__builtin_memcmp (pd_dst, pd_src, N * 2 * sizeof (double)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (pd_dst2, pd_src2, N * 2 * sizeof (double)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf_dst, cdf_src, N * sizeof (cdf)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf2_dst, cdf2_src, N * sizeof (cdf2)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf3_dst, cdf3_src, N * sizeof (cdf3)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf4_dst, cdf4_src, N * sizeof (cdf4)) != 0) > + __builtin_abort (); > +} > diff --git a/gcc/testsuite/gcc.target/i386/pr106010-9d.c b/gcc/testsuite/gcc.target/i386/pr106010-9d.c > new file mode 100644 > index 00000000000..d4d8f1dd722 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/pr106010-9d.c > @@ -0,0 +1,92 @@ > +/* { dg-do run } */ > +/* { dg-options "-O3 -mavx512f -mavx512vl -fvect-cost-model=unlimited -mprefer-vector-width=512" } */ > +/* { dg-require-effective-target avx512f } */ > + > +#include <string.h> > +#include <stdlib.h> > +#define AVX512F > +#include "avx512-check.h" > +#include "pr106010-9a.c" > + > +static void > +test_512 (void) > +{ > + _Complex double* pd_src = (_Complex double*) malloc (N * sizeof (_Complex double)); > + _Complex double* pd_dst = (_Complex double*) malloc (N * sizeof (_Complex double)); > + _Complex double* pd_src2 = (_Complex double*) malloc (N * sizeof (_Complex double)); > + _Complex double* pd_dst2 = (_Complex double*) malloc (N * sizeof (_Complex double)); > + cdf* cdf_src = (cdf*) malloc (N * sizeof (cdf)); > + cdf* cdf_dst = (cdf*) malloc (N * sizeof (cdf)); > + cdf2* cdf2_src = (cdf2*) malloc (N * sizeof (cdf2)); > + cdf2* cdf2_dst = (cdf2*) malloc (N * sizeof (cdf2)); > + cdf3* cdf3_src = (cdf3*) malloc (N * sizeof (cdf3)); > + cdf3* cdf3_dst = (cdf3*) malloc (N * sizeof (cdf3)); > + cdf4* cdf4_src = (cdf4*) malloc (N * sizeof (cdf4)); > + cdf4* cdf4_dst = (cdf4*) malloc (N * sizeof (cdf4)); > + > + char* p_init = (char*) malloc (N * sizeof (cdf3)); > + > + __builtin_memset (cdf_dst, 0, N * sizeof (cdf)); > + __builtin_memset (cdf2_dst, 0, N * sizeof (cdf2)); > + __builtin_memset (cdf3_dst, 0, N * sizeof (cdf3)); > + __builtin_memset (cdf4_dst, 0, N * sizeof (cdf4)); > + __builtin_memset (pd_dst, 0, N * sizeof (_Complex double)); > + __builtin_memset (pd_dst2, 0, N * sizeof (_Complex double)); > + > + for (int i = 0; i != N * sizeof (cdf3); i++) > + p_init[i] = i; > + > + memcpy (cdf_src, p_init, N * sizeof (cdf)); > + memcpy (cdf2_src, p_init, N * sizeof (cdf2)); > + memcpy (cdf3_src, p_init, N * sizeof (cdf3)); > + memcpy (cdf4_src, p_init, N * sizeof (cdf4)); > + memcpy (pd_src, p_init, N * sizeof (_Complex double)); > + for (int i = 0; i != 2 * N * sizeof (double); i++) > + p_init[i] = i % 16; > + memcpy (pd_src2, p_init, N * sizeof (_Complex double)); > + > + foo (cdf_dst, cdf_src); > + foo1 (cdf2_dst, cdf2_src); > + foo2 (cdf3_dst, cdf3_src); > + foo3 (cdf4_dst, cdf4_src); > + foo4 (pd_dst, pd_src); > + foo5 (pd_dst2, pd_src2); > + for (int i = 0; i != N; i++) > + { > + p_init[(N - i - 1) * 16] = i * 16; > + p_init[(N - i - 1) * 16 + 1] = i * 16 + 1; > + p_init[(N - i - 1) * 16 + 2] = i * 16 + 2; > + p_init[(N - i - 1) * 16 + 3] = i * 16 + 3; > + p_init[(N - i - 1) * 16 + 4] = i * 16 + 4; > + p_init[(N - i - 1) * 16 + 5] = i * 16 + 5; > + p_init[(N - i - 1) * 16 + 6] = i * 16 + 6; > + p_init[(N - i - 1) * 16 + 7] = i * 16 + 7; > + p_init[(N - i - 1) * 16 + 8] = i * 16 + 8; > + p_init[(N - i - 1) * 16 + 9] = i * 16 + 9; > + p_init[(N - i - 1) * 16 + 10] = i * 16 + 10; > + p_init[(N - i - 1) * 16 + 11] = i * 16 + 11; > + p_init[(N - i - 1) * 16 + 12] = i * 16 + 12; > + p_init[(N - i - 1) * 16 + 13] = i * 16 + 13; > + p_init[(N - i - 1) * 16 + 14] = i * 16 + 14; > + p_init[(N - i - 1) * 16 + 15] = i * 16 + 15; > + } > + memcpy (pd_src, p_init, N * 16); > + > + if (__builtin_memcmp (pd_dst, pd_src, N * 2 * sizeof (double)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (pd_dst2, pd_src2, N * 2 * sizeof (double)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf_dst, cdf_src, N * sizeof (cdf)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf2_dst, cdf2_src, N * sizeof (cdf2)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf3_dst, cdf3_src, N * sizeof (cdf3)) != 0) > + __builtin_abort (); > + > + if (__builtin_memcmp (cdf4_dst, cdf4_src, N * sizeof (cdf4)) != 0) > + __builtin_abort (); > +} > diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc > index d20a10a1524..19567bb338a 100644 > --- a/gcc/tree-vect-data-refs.cc > +++ b/gcc/tree-vect-data-refs.cc > @@ -1403,7 +1403,8 @@ vect_get_data_access_cost (vec_info *vinfo, dr_vec_info *dr_info, > if (PURE_SLP_STMT (stmt_info)) > ncopies = 1; > else > - ncopies = vect_get_num_copies (loop_vinfo, STMT_VINFO_VECTYPE (stmt_info)); > + ncopies = vect_get_num_copies (loop_vinfo, STMT_VINFO_VECTYPE (stmt_info), > + STMT_VINFO_COMPLEX_P (stmt_info)); > > if (DR_IS_READ (dr_info->dr)) > vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme, > @@ -4597,8 +4598,22 @@ vect_analyze_data_refs (vec_info *vinfo, poly_uint64 *min_vf, bool *fatal) > > /* Set vectype for STMT. */ > scalar_type = TREE_TYPE (DR_REF (dr)); > - tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type); > - if (!vectype) > + tree adjust_scalar_type = scalar_type; > + /* Support Complex type access. Note that the complex type of load/store > + does not support gather/scatter. */ > + if (TREE_CODE (scalar_type) == COMPLEX_TYPE > + && gatherscatter == SG_NONE) > + { > + adjust_scalar_type = TREE_TYPE (scalar_type); > + STMT_VINFO_COMPLEX_P (stmt_info) = true; > + } > + tree vectype = get_vectype_for_scalar_type (vinfo, adjust_scalar_type); > + unsigned HOST_WIDE_INT constant_nunits; > + if (!vectype > + /* For complex type, V1DI doesn't make sense. */ > + || (STMT_VINFO_COMPLEX_P (stmt_info) > + && (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&constant_nunits) > + || constant_nunits == 1))) > { > if (dump_enabled_p ()) > { > @@ -4635,8 +4650,11 @@ vect_analyze_data_refs (vec_info *vinfo, poly_uint64 *min_vf, bool *fatal) > } > > /* Adjust the minimal vectorization factor according to the > - vector type. */ > + vector type. Note for complex type, VF is half of > + TYPE_VECTOR_SUBPARTS. */ > vf = TYPE_VECTOR_SUBPARTS (vectype); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + vf = exact_div (vf, 2); > *min_vf = upper_bound (*min_vf, vf); > > /* Leave the BB vectorizer to pick the vector type later, based on > @@ -6140,21 +6158,55 @@ vect_permute_load_chain (vec_info *vinfo, vec<tree> dr_chain, > vec_perm_indices indices; > for (k = 0; k < 3; k++) > { > - for (i = 0; i < nelt; i++) > - if (3 * i + k < 2 * nelt) > - sel[i] = 3 * i + k; > - else > - sel[i] = 0; > - indices.new_vector (sel, 2, nelt); > - perm3_mask_low = vect_gen_perm_mask_checked (vectype, indices); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + { > + for (i = 0; i < nelt / 2; i++) > + if (6 * i + 2 * k + 1 < 2 * nelt) > + { > + sel[2 * i] = 6 * i + 2 * k; > + sel[2 * i + 1] = 6 * i + 2 * k + 1; > + } > + else > + { > + sel[2 * i] = 0; > + sel[2 * i + 1] = 0; > + } > > - for (i = 0, j = 0; i < nelt; i++) > - if (3 * i + k < 2 * nelt) > - sel[i] = i; > - else > - sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++); > - indices.new_vector (sel, 2, nelt); > - perm3_mask_high = vect_gen_perm_mask_checked (vectype, indices); > + indices.new_vector (sel, 2, nelt); > + perm3_mask_low = vect_gen_perm_mask_checked (vectype, indices); > + > + for (i = 0, j = 0; i < nelt / 2; i++) > + if (6 * i + 2 * k + 1 < 2 * nelt) > + { > + sel[2 * i] = 2 * i; > + sel[2 * i + 1] = 2 * i + 1; > + } > + else > + { > + sel[2 * i] = nelt + ((nelt + 2 * k) % 6) + 6 * j; > + sel[2 * i + 1] = nelt + ((nelt + 2 * k) % 6) + 6 * (j++) + 1; > + } > + indices.new_vector (sel, 2, nelt); > + perm3_mask_high = vect_gen_perm_mask_checked (vectype, indices); > + } > + else > + { > + for (i = 0; i < nelt; i++) > + if (3 * i + k < 2 * nelt) > + sel[i] = 3 * i + k; > + else > + sel[i] = 0; > + indices.new_vector (sel, 2, nelt); > + perm3_mask_low = vect_gen_perm_mask_checked (vectype, indices); > + > + for (i = 0, j = 0; i < nelt; i++) > + if (3 * i + k < 2 * nelt) > + sel[i] = i; > + else > + sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++); > + indices.new_vector (sel, 2, nelt); > + perm3_mask_high = vect_gen_perm_mask_checked (vectype, indices); > + } > > first_vect = dr_chain[0]; > second_vect = dr_chain[1]; > @@ -6186,17 +6238,43 @@ vect_permute_load_chain (vec_info *vinfo, vec<tree> dr_chain, > > /* The encoding has a single stepped pattern. */ > poly_uint64 nelt = TYPE_VECTOR_SUBPARTS (vectype); > - vec_perm_builder sel (nelt, 1, 3); > - sel.quick_grow (3); > - for (i = 0; i < 3; ++i) > - sel[i] = i * 2; > - vec_perm_indices indices (sel, 2, nelt); > - perm_mask_even = vect_gen_perm_mask_checked (vectype, indices); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + { > + vec_perm_builder sel; > + unsigned neltc = nelt.to_constant (); > + sel.new_vector (neltc, neltc, 1); > + sel.quick_grow (neltc); > + for (unsigned i = 0; i != neltc / 2; i++) > + { > + sel[2 * i] = i * 4; > + sel[2 * i + 1] = i * 4 + 1; > + } > + vec_perm_indices indices (sel, 2, nelt); > + perm_mask_even = vect_gen_perm_mask_checked (vectype, indices); > > - for (i = 0; i < 3; ++i) > - sel[i] = i * 2 + 1; > - indices.new_vector (sel, 2, nelt); > - perm_mask_odd = vect_gen_perm_mask_checked (vectype, indices); > + for (unsigned i = 0; i != nelt.to_constant() / 2; i++) > + { > + sel[2 * i] = i * 4 + 2; > + sel[2 * i + 1] = i * 4 + 3; > + } > + indices.new_vector (sel, 2, nelt); > + perm_mask_odd = vect_gen_perm_mask_checked (vectype, indices); > + } > + else > + { > + vec_perm_builder sel (nelt, 1, 3); > + sel.quick_grow (3); > + for (i = 0; i < 3; ++i) > + sel[i] = i * 2; > + > + vec_perm_indices indices (sel, 2, nelt); > + perm_mask_even = vect_gen_perm_mask_checked (vectype, indices); > + > + for (i = 0; i < 3; ++i) > + sel[i] = i * 2 + 1; > + indices.new_vector (sel, 2, nelt); > + perm_mask_odd = vect_gen_perm_mask_checked (vectype, indices); > + } > > for (i = 0; i < log_length; i++) > { > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 3a70c15b593..365fa738022 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -200,7 +200,12 @@ vect_determine_vf_for_stmt_1 (vec_info *vinfo, stmt_vec_info stmt_info, > } > > if (nunits_vectype) > - vect_update_max_nunits (vf, nunits_vectype); > + { > + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (nunits_vectype); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + nunits = exact_div (nunits, 2); > + vect_update_max_nunits (vf, nunits); > + } > > return opt_result::success (); > } > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index dab5daddcc5..5d66ea2f286 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -877,10 +877,14 @@ vect_record_max_nunits (vec_info *vinfo, stmt_vec_info stmt_info, > return false; > } > > + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + nunits = exact_div (nunits, 2); > + > /* If populating the vector type requires unrolling then fail > before adjusting *max_nunits for basic-block vectorization. */ > if (is_a <bb_vec_info> (vinfo) > - && !multiple_p (group_size, TYPE_VECTOR_SUBPARTS (vectype))) > + && !multiple_p (group_size , nunits)) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -891,7 +895,7 @@ vect_record_max_nunits (vec_info *vinfo, stmt_vec_info stmt_info, > } > > /* In case of multiple types we need to detect the smallest type. */ > - vect_update_max_nunits (max_nunits, vectype); > + vect_update_max_nunits (max_nunits, nunits); > return true; > } > > @@ -3720,22 +3724,54 @@ vect_optimize_slp (vec_info *vinfo) > vect_attempt_slp_rearrange_stmts did. This allows us to be lazy > when permuting constants and invariants keeping the permute > bijective. */ > - auto_sbitmap load_index (SLP_TREE_LANES (node)); > - bitmap_clear (load_index); > - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > - bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin); > - unsigned j; > - for (j = 0; j < SLP_TREE_LANES (node); ++j) > - if (!bitmap_bit_p (load_index, j)) > - break; > - if (j != SLP_TREE_LANES (node)) > - continue; > + /* Permutation of Complex type. */ > + if (STMT_VINFO_COMPLEX_P (dr_stmt)) > + { > + auto_sbitmap load_index (SLP_TREE_LANES (node) * 2); > + bitmap_clear (load_index); > + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > + { > + unsigned bit = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; > + bitmap_set_bit (load_index, 2 * bit); > + bitmap_set_bit (load_index, 2 * bit + 1); > + } > + unsigned j; > + for (j = 0; j < SLP_TREE_LANES (node) * 2; ++j) > + if (!bitmap_bit_p (load_index, j)) > + break; > + if (j != SLP_TREE_LANES (node) * 2) > + continue; > > - vec<unsigned> perm = vNULL; > - perm.safe_grow (SLP_TREE_LANES (node), true); > - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > - perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; > - perms.safe_push (perm); > + vec<unsigned> perm = vNULL; > + perm.safe_grow (SLP_TREE_LANES (node) * 2, true); > + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > + { > + unsigned cidx = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; > + perm[2 * j] = 2 * cidx; > + perm[2 * j + 1] = 2 * cidx + 1; > + } > + perms.safe_push (perm); > + } > + else > + { > + auto_sbitmap load_index (SLP_TREE_LANES (node)); > + bitmap_clear (load_index); > + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > + bitmap_set_bit (load_index, > + SLP_TREE_LOAD_PERMUTATION (node)[j] - imin); > + unsigned j; > + for (j = 0; j < SLP_TREE_LANES (node); ++j) > + if (!bitmap_bit_p (load_index, j)) > + break; > + if (j != SLP_TREE_LANES (node)) > + continue; > + > + vec<unsigned> perm = vNULL; > + perm.safe_grow (SLP_TREE_LANES (node), true); > + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) > + perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; > + perms.safe_push (perm); > + } > vertices[idx].perm_in = perms.length () - 1; > vertices[idx].perm_out = perms.length () - 1; > } > @@ -4518,6 +4554,12 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node, > vf = loop_vinfo->vectorization_factor; > else > vf = 1; > + /* For complex type and SLP, double vf to get right vectype. > + .i.e vector(4) double for complex double, group size is 2, double vf > + to map vf * group_size to TYPE_VECTOR_SUBPARTS. */ > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + vf *= 2; > + > unsigned int group_size = SLP_TREE_LANES (node); > tree vectype = SLP_TREE_VECTYPE (node); > SLP_TREE_NUMBER_OF_VEC_STMTS (node) > @@ -4763,10 +4805,17 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node, > } > unsigned group_size = SLP_TREE_LANES (child); > poly_uint64 vf = 1; > + > if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo)) > vf = loop_vinfo->vectorization_factor; > + > + /* V2SF is just 1 complex type, so mutiply by 2 > + to get release vector numbers. */ > + unsigned cp > + = STMT_VINFO_COMPLEX_P (SLP_TREE_REPRESENTATIVE (node)) ? 2 : 1; > + > SLP_TREE_NUMBER_OF_VEC_STMTS (child) > - = vect_get_num_vectors (vf * group_size, vector_type); > + = vect_get_num_vectors (vf * group_size * cp, vector_type); > /* And cost them. */ > vect_prologue_cost_for_slp (child, cost_vec); > } > @@ -6402,6 +6451,11 @@ vect_create_constant_vectors (vec_info *vinfo, slp_tree op_node) > > /* We always want SLP_TREE_VECTYPE (op_node) here correctly set. */ > vector_type = SLP_TREE_VECTYPE (op_node); > + unsigned int cp = 1; > + /* Handle Complex type vector init. > + SLP_TREE_REPRESENTATIVE (op_node) could be NULL. */ > + if (TREE_CODE (TREE_TYPE (op_node->ops[0])) == COMPLEX_TYPE) > + cp = 2; > > unsigned int number_of_vectors = SLP_TREE_NUMBER_OF_VEC_STMTS (op_node); > SLP_TREE_VEC_DEFS (op_node).create (number_of_vectors); > @@ -6426,9 +6480,9 @@ vect_create_constant_vectors (vec_info *vinfo, slp_tree op_node) > /* When using duplicate_and_interleave, we just need one element for > each scalar statement. */ > if (!TYPE_VECTOR_SUBPARTS (vector_type).is_constant (&nunits)) > - nunits = group_size; > + nunits = group_size * cp; > > - number_of_copies = nunits * number_of_vectors / group_size; > + number_of_copies = nunits * number_of_vectors / (group_size * cp); > > number_of_places_left_in_vector = nunits; > constant_p = true; > @@ -6460,8 +6514,23 @@ vect_create_constant_vectors (vec_info *vinfo, slp_tree op_node) > gcc_unreachable (); > } > else > - op = fold_unary (VIEW_CONVERT_EXPR, > - TREE_TYPE (vector_type), op); > + { > + tree scalar_type = TREE_TYPE (vector_type); > + /* For complex type, insert real and imag part > + separately. */ > + if (cp == 2) > + { > + gcc_assert ((TREE_CODE (TREE_TYPE (op)) > + == COMPLEX_TYPE) > + && (scalar_type > + == TREE_TYPE (TREE_TYPE (op)))); > + elts[number_of_places_left_in_vector--] > + = fold_unary (IMAGPART_EXPR, scalar_type, op); > + op = fold_unary (REALPART_EXPR, scalar_type, op); > + } > + else > + op = fold_unary (VIEW_CONVERT_EXPR, scalar_type, op); > + } > gcc_assert (op && CONSTANT_CLASS_P (op)); > } > else > @@ -6481,11 +6550,28 @@ vect_create_constant_vectors (vec_info *vinfo, slp_tree op_node) > } > else > { > - op = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (vector_type), > - op); > - init_stmt > - = gimple_build_assign (new_temp, VIEW_CONVERT_EXPR, > - op); > + tree scalar_type = TREE_TYPE (vector_type); > + if (cp == 2) > + { > + gcc_assert ((TREE_CODE (TREE_TYPE (op)) > + == COMPLEX_TYPE) > + && (scalar_type > + == TREE_TYPE (TREE_TYPE (op)))); > + tree imag = build1 (IMAGPART_EXPR, scalar_type, op); > + op = build1 (REALPART_EXPR, scalar_type, op); > + tree imag_temp = make_ssa_name (scalar_type); > + elts[number_of_places_left_in_vector--] = imag_temp; > + init_stmt = gimple_build_assign (imag_temp, imag); > + gimple_seq_add_stmt (&ctor_seq, init_stmt); > + init_stmt = gimple_build_assign (new_temp, op); > + } > + else > + { > + op = build1 (VIEW_CONVERT_EXPR, scalar_type, op); > + init_stmt > + = gimple_build_assign (new_temp, VIEW_CONVERT_EXPR, > + op); > + } > } > gimple_seq_add_stmt (&ctor_seq, init_stmt); > op = new_temp; > @@ -6696,15 +6782,17 @@ vect_transform_slp_perm_load (vec_info *vinfo, > unsigned int nelts_to_build; > unsigned int nvectors_per_build; > unsigned int in_nlanes; > + unsigned int cp = STMT_VINFO_COMPLEX_P (stmt_info) ? 2 : 1; > bool repeating_p = (group_size == DR_GROUP_SIZE (stmt_info) > - && multiple_p (nunits, group_size)); > + && multiple_p (nunits, group_size * cp)); > if (repeating_p) > { > /* A single vector contains a whole number of copies of the node, so: > (a) all permutes can use the same mask; and > (b) the permutes only need a single vector input. */ > - mask.new_vector (nunits, group_size, 3); > - nelts_to_build = mask.encoded_nelts (); > + /* For complex type, mask size should be double of nelts_to_build. */ > + mask.new_vector (nunits, group_size * cp, 3); > + nelts_to_build = mask.encoded_nelts () / cp; > nvectors_per_build = SLP_TREE_VEC_STMTS (node).length (); > in_nlanes = DR_GROUP_SIZE (stmt_info) * 3; > } > @@ -6744,8 +6832,8 @@ vect_transform_slp_perm_load (vec_info *vinfo, > { > /* Enforced before the loop when !repeating_p. */ > unsigned int const_nunits = nunits.to_constant (); > - vec_index = i / const_nunits; > - mask_element = i % const_nunits; > + vec_index = i / (const_nunits / cp); > + mask_element = i % (const_nunits / cp); > if (vec_index == first_vec_index > || first_vec_index == -1) > { > @@ -6755,7 +6843,7 @@ vect_transform_slp_perm_load (vec_info *vinfo, > || second_vec_index == -1) > { > second_vec_index = vec_index; > - mask_element += const_nunits; > + mask_element += (const_nunits / cp); > } > else > { > @@ -6768,14 +6856,24 @@ vect_transform_slp_perm_load (vec_info *vinfo, > return false; > } > > - gcc_assert (mask_element < 2 * const_nunits); > + gcc_assert (mask_element < 2 * const_nunits / cp); > } > > if (mask_element != index) > noop_p = false; > - mask[index++] = mask_element; > + /* Set index for Complex _type. > + i.e. mask like [1,0] is actually [2, 3, 0, 1] > + for vector scalar type. */ > + if (cp == 2) > + { > + mask[2 * index] = 2 * mask_element; > + mask[2 * index + 1] = 2 * mask_element + 1; > + } > + else > + mask[index] = mask_element; > + index++; > > - if (index == count && !noop_p) > + if (index * cp == count && !noop_p) > { > indices.new_vector (mask, second_vec_index == -1 ? 1 : 2, nunits); > if (!can_vec_perm_const_p (mode, mode, indices)) > @@ -6799,7 +6897,7 @@ vect_transform_slp_perm_load (vec_info *vinfo, > ++*n_perms; > } > > - if (index == count) > + if (index * cp == count) > { > if (!analyze_only) > { > @@ -6869,7 +6967,7 @@ vect_transform_slp_perm_load (vec_info *vinfo, > bool load_seen = false; > for (unsigned i = 0; i < in_nlanes; ++i) > { > - if (i % const_nunits == 0) > + if (i % (const_nunits * cp) == 0) > { > if (load_seen) > *n_loads += 1; > diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc > index 72107afc883..d6223c28f1c 100644 > --- a/gcc/tree-vect-stmts.cc > +++ b/gcc/tree-vect-stmts.cc > @@ -1397,25 +1397,70 @@ vect_init_vector (vec_info *vinfo, stmt_vec_info stmt_info, tree val, tree type, > { > gimple *init_stmt; > tree new_temp; > + tree scalar_type = TREE_TYPE (type); > + gimple_seq stmts = NULL; > + > + if (TREE_CODE (TREE_TYPE (val)) == COMPLEX_TYPE) > + { > + unsigned HOST_WIDE_INT nunits; > + gcc_assert (TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits)); > + > + tree_vector_builder elts (type, nunits, 1); > + tree imag, real; > + if (TREE_CODE (val) == COMPLEX_CST) > + { > + real = fold_unary (REALPART_EXPR, scalar_type, val); > + imag = fold_unary (IMAGPART_EXPR, scalar_type, val); > + } > + else > + { > + real = make_ssa_name (scalar_type); > + imag = make_ssa_name (scalar_type); > + init_stmt > + = gimple_build_assign (real, > + build1 (REALPART_EXPR, scalar_type, val)); > + gimple_seq_add_stmt (&stmts, init_stmt); > + init_stmt > + = gimple_build_assign (imag, > + build1 (IMAGPART_EXPR, scalar_type, val)); > + gimple_seq_add_stmt (&stmts, init_stmt); > + } > > + /* Build vector as [real,imag,real,imag,...]. */ > + for (unsigned i = 0; i != nunits; i++) > + { > + if (i % 2) > + elts.quick_push (imag); > + else > + elts.quick_push (real); > + } > + val = gimple_build_vector (&stmts, &elts); > + if (!gimple_seq_empty_p (stmts)) > + { > + if (gsi) > + gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT); > + else > + vinfo->insert_seq_on_entry (stmt_info, stmts); > + } > + } > /* We abuse this function to push sth to a SSA name with initial 'val'. */ > - if (! useless_type_conversion_p (type, TREE_TYPE (val))) > + else if (! useless_type_conversion_p (type, TREE_TYPE (val))) > { > gcc_assert (TREE_CODE (type) == VECTOR_TYPE); > - if (! types_compatible_p (TREE_TYPE (type), TREE_TYPE (val))) > + if (! types_compatible_p (scalar_type, TREE_TYPE (val))) > { > /* Scalar boolean value should be transformed into > all zeros or all ones value before building a vector. */ > if (VECTOR_BOOLEAN_TYPE_P (type)) > { > - tree true_val = build_all_ones_cst (TREE_TYPE (type)); > - tree false_val = build_zero_cst (TREE_TYPE (type)); > + tree true_val = build_all_ones_cst (scalar_type); > + tree false_val = build_zero_cst (scalar_type); > > if (CONSTANT_CLASS_P (val)) > val = integer_zerop (val) ? false_val : true_val; > else > { > - new_temp = make_ssa_name (TREE_TYPE (type)); > + new_temp = make_ssa_name (scalar_type); > init_stmt = gimple_build_assign (new_temp, COND_EXPR, > val, true_val, false_val); > vect_init_vector_1 (vinfo, stmt_info, init_stmt, gsi); > @@ -1424,14 +1469,13 @@ vect_init_vector (vec_info *vinfo, stmt_vec_info stmt_info, tree val, tree type, > } > else > { > - gimple_seq stmts = NULL; > if (! INTEGRAL_TYPE_P (TREE_TYPE (val))) > val = gimple_build (&stmts, VIEW_CONVERT_EXPR, > - TREE_TYPE (type), val); > + scalar_type, val); > else > /* ??? Condition vectorization expects us to do > promotion of invariant/external defs. */ > - val = gimple_convert (&stmts, TREE_TYPE (type), val); > + val = gimple_convert (&stmts, scalar_type, val); > for (gimple_stmt_iterator gsi2 = gsi_start (stmts); > !gsi_end_p (gsi2); ) > { > @@ -1496,7 +1540,12 @@ vect_get_vec_defs_for_operand (vec_info *vinfo, stmt_vec_info stmt_vinfo, > && VECTOR_BOOLEAN_TYPE_P (stmt_vectype)) > vector_type = truth_type_for (stmt_vectype); > else > - vector_type = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op)); > + { > + tree scalar_type = TREE_TYPE (op); > + if (STMT_VINFO_COMPLEX_P (stmt_vinfo)) > + scalar_type = TREE_TYPE (scalar_type); > + vector_type = get_vectype_for_scalar_type (loop_vinfo, scalar_type); > + } > > gcc_assert (vector_type); > tree vop = vect_init_vector (vinfo, stmt_vinfo, op, vector_type, NULL); > @@ -1892,6 +1941,13 @@ vect_truncate_gather_scatter_offset (stmt_vec_info stmt_info, > return false; > } > > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "Complex type doens't support gather_scatter.\n"); > + return false; > + } > /* Get the number of bits in an element. */ > tree vectype = STMT_VINFO_VECTYPE (stmt_info); > scalar_mode element_mode = SCALAR_TYPE_MODE (TREE_TYPE (vectype)); > @@ -2022,6 +2078,30 @@ perm_mask_for_reverse (tree vectype) > return vect_gen_perm_mask_checked (vectype, indices); > } > > +static tree > +perm_mask_for_reverse (tree vectype, bool complex_p) > +{ > + if (!complex_p) > + return perm_mask_for_reverse (vectype); > + > + unsigned HOST_WIDE_INT nunits; > + gcc_assert (TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits)); > + > + /* The encoding has a single stepped pattern. */ > + vec_perm_builder sel (nunits, nunits, 1); > + for (unsigned i = 0; i < nunits; i+=2) > + { > + sel.quick_push (nunits - 2 - i); > + sel.quick_push (nunits - 1 - i); > + } > + > + vec_perm_indices indices (sel, 1, nunits); > + if (!can_vec_perm_const_p (TYPE_MODE (vectype), TYPE_MODE (vectype), > + indices)) > + return NULL_TREE; > + return vect_gen_perm_mask_checked (vectype, indices); > +} > + > /* A subroutine of get_load_store_type, with a subset of the same > arguments. Handle the case where STMT_INFO is a load or store that > accesses consecutive elements with a negative step. Sets *POFFSET > @@ -2045,8 +2125,12 @@ get_negative_load_store_type (vec_info *vinfo, > } > > /* For backward running DRs the first access in vectype actually is > - N-1 elements before the address of the DR. */ > - *poffset = ((-TYPE_VECTOR_SUBPARTS (vectype) + 1) > + N-1 elements before the address of the DR. > + for Complex type, it's N - 2. */ > + unsigned cp = 1; > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + cp = 2; > + *poffset = ((-TYPE_VECTOR_SUBPARTS (vectype) + cp) > * TREE_INT_CST_LOW (TYPE_SIZE_UNIT (TREE_TYPE (vectype)))); > > int misalignment = dr_misalignment (dr_info, vectype, *poffset); > @@ -2071,7 +2155,7 @@ get_negative_load_store_type (vec_info *vinfo, > return VMAT_CONTIGUOUS_DOWN; > } > > - if (!perm_mask_for_reverse (vectype)) > + if (!perm_mask_for_reverse (vectype, STMT_VINFO_COMPLEX_P (stmt_info))) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -2188,6 +2272,8 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, > && !DR_GROUP_NEXT_ELEMENT (stmt_info)); > unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info); > poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + nunits = exact_div (nunits, 2); > > /* True if the vectorized statements would access beyond the last > statement in the group. */ > @@ -2352,7 +2438,11 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, > { > /* First cope with the degenerate case of a single-element > vector. */ > - if (known_eq (TYPE_VECTOR_SUBPARTS (vectype), 1U)) > + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + nunits = exact_div (nunits, 2); > + > + if (known_eq (nunits, 1U)) > ; > > /* Otherwise try using LOAD/STORE_LANES. */ > @@ -2361,6 +2451,8 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, > : vect_store_lanes_supported (vectype, group_size, > masked_p)) > { > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + return false; > *memory_access_type = VMAT_LOAD_STORE_LANES; > overrun_p = would_overrun_p; > } > @@ -2620,6 +2712,14 @@ vect_check_scalar_mask (vec_info *vinfo, stmt_vec_info stmt_info, > return false; > } > > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > + "Complex type doesn't support mask argument.\n"); > + return false; > + } > + > if (!VECT_SCALAR_BOOLEAN_TYPE_P (TREE_TYPE (*mask))) > { > if (dump_enabled_p ()) > @@ -7509,8 +7609,17 @@ vectorizable_store (vec_info *vinfo, > same location twice. */ > gcc_assert (slp == PURE_SLP_STMT (stmt_info)); > > + if (!STMT_VINFO_DATA_REF (stmt_info)) > + return false; > + > tree vectype = STMT_VINFO_VECTYPE (stmt_info), rhs_vectype = NULL_TREE; > poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + { > + if (!nunits.is_constant ()) > + return false; > + nunits = exact_div (nunits, 2); > + } > > if (loop_vinfo) > { > @@ -7526,7 +7635,8 @@ vectorizable_store (vec_info *vinfo, > if (slp) > ncopies = 1; > else > - ncopies = vect_get_num_copies (loop_vinfo, vectype); > + ncopies = vect_get_num_copies (loop_vinfo, vectype, > + STMT_VINFO_COMPLEX_P (stmt_info)); > > gcc_assert (ncopies >= 1); > > @@ -7544,11 +7654,10 @@ vectorizable_store (vec_info *vinfo, > return false; > > elem_type = TREE_TYPE (vectype); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + elem_type = build_complex_type (elem_type); > vec_mode = TYPE_MODE (vectype); > > - if (!STMT_VINFO_DATA_REF (stmt_info)) > - return false; > - > vect_memory_access_type memory_access_type; > enum dr_alignment_support alignment_support_scheme; > int misalignment; > @@ -7951,21 +8060,31 @@ vectorizable_store (vec_info *vinfo, > tree lvectype = vectype; > if (slp) > { > + scalar_mode elmode; > if (group_size < const_nunits > && const_nunits % group_size == 0) > { > nstores = const_nunits / group_size; > - lnel = group_size; > - ltype = build_vector_type (elem_type, group_size); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + { > + lnel = group_size * 2; > + ltype = build_vector_type (TREE_TYPE (elem_type), group_size * 2); > + elmode = SCALAR_TYPE_MODE (TREE_TYPE (elem_type)); > + } > + else > + { > + ltype = build_vector_type (elem_type, group_size); > + lnel = group_size; > + elmode = SCALAR_TYPE_MODE (elem_type); > + } > lvectype = vectype; > > /* First check if vec_extract optab doesn't support extraction > of vector elts directly. */ > - scalar_mode elmode = SCALAR_TYPE_MODE (elem_type); > machine_mode vmode; > if (!VECTOR_MODE_P (TYPE_MODE (vectype)) > || !related_vector_mode (TYPE_MODE (vectype), elmode, > - group_size).exists (&vmode) > + lnel).exists (&vmode) > || (convert_optab_handler (vec_extract_optab, > TYPE_MODE (vectype), vmode) > == CODE_FOR_nothing)) > @@ -8051,6 +8170,8 @@ vectorizable_store (vec_info *vinfo, > unsigned int group_el = 0; > unsigned HOST_WIDE_INT > elsz = tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype))); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + elsz *= 2; > for (j = 0; j < ncopies; j++) > { > vec_oprnd = vec_oprnds[j]; > @@ -8448,7 +8569,9 @@ vectorizable_store (vec_info *vinfo, > > if (memory_access_type == VMAT_CONTIGUOUS_REVERSE) > { > - tree perm_mask = perm_mask_for_reverse (vectype); > + tree perm_mask > + = perm_mask_for_reverse (vectype, > + STMT_VINFO_COMPLEX_P (stmt_info)); > tree perm_dest = vect_create_destination_var > (vect_get_store_rhs (stmt_info), vectype); > tree new_temp = make_ssa_name (perm_dest); > @@ -8778,6 +8901,12 @@ vectorizable_load (vec_info *vinfo, > > tree vectype = STMT_VINFO_VECTYPE (stmt_info); > poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + { > + if (!nunits.is_constant ()) > + return false; > + nunits = exact_div (nunits, 2); > + } > > if (loop_vinfo) > { > @@ -8794,7 +8923,8 @@ vectorizable_load (vec_info *vinfo, > if (slp) > ncopies = 1; > else > - ncopies = vect_get_num_copies (loop_vinfo, vectype); > + ncopies = vect_get_num_copies (loop_vinfo, vectype, > + STMT_VINFO_COMPLEX_P (stmt_info)); > > gcc_assert (ncopies >= 1); > > @@ -8822,6 +8952,8 @@ vectorizable_load (vec_info *vinfo, > } > > elem_type = TREE_TYPE (vectype); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + elem_type = build_complex_type (elem_type); > mode = TYPE_MODE (vectype); > > /* FORNOW. In some cases can vectorize even if data-type not supported > @@ -8870,8 +9002,11 @@ vectorizable_load (vec_info *vinfo, > if (k > maxk) > maxk = k; > tree vectype = SLP_TREE_VECTYPE (slp_node); > + /* For complex type, half the nunits. */ > if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits) > - || maxk >= (DR_GROUP_SIZE (group_info) & ~(nunits - 1))) > + || maxk >= (DR_GROUP_SIZE (group_info) > + & ~((STMT_VINFO_COMPLEX_P (group_info) > + ? nunits >> 1 : nunits) - 1))) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -9098,9 +9233,10 @@ vectorizable_load (vec_info *vinfo, > } > else > { > + unsigned cp = STMT_VINFO_COMPLEX_P (stmt_info) ? 2 : 1; > if (grouped_load) > cst_offset > - = (tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype))) > + = (tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype))) * cp > * vect_get_place_in_interleaving_chain (stmt_info, > first_stmt_info)); > group_size = 1; > @@ -9150,6 +9286,8 @@ vectorizable_load (vec_info *vinfo, > int nloads = const_nunits; > int lnel = 1; > tree ltype = TREE_TYPE (vectype); > + if (STMT_VINFO_COMPLEX_P (stmt_info)) > + ltype = build_complex_type (ltype); > tree lvectype = vectype; > auto_vec<tree> dr_chain; > if (memory_access_type == VMAT_STRIDED_SLP) > @@ -10080,7 +10218,9 @@ vectorizable_load (vec_info *vinfo, > > if (memory_access_type == VMAT_CONTIGUOUS_REVERSE) > { > - tree perm_mask = perm_mask_for_reverse (vectype); > + tree perm_mask > + = perm_mask_for_reverse (vectype, > + STMT_VINFO_COMPLEX_P (stmt_info)); > new_temp = permute_vec_elements (vinfo, new_temp, new_temp, > perm_mask, stmt_info, gsi); > new_stmt = SSA_NAME_DEF_STMT (new_temp); > @@ -12499,12 +12639,27 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info, > dump_printf_loc (MSG_NOTE, vect_location, > "get vectype for scalar type: %T\n", scalar_type); > } > + > + tree orig_scalar_type = scalar_type; > + if (TREE_CODE (scalar_type) == COMPLEX_TYPE) > + { > + /* Set complex_p for BB vectorizer. */ > + STMT_VINFO_COMPLEX_P (stmt_info) = true; > + scalar_type = TREE_TYPE (scalar_type); > + /* Double group_size for BB vectorizer to make > + following 2 get_vectype_for_scalar_type return wanted vectype. > + Real group size is not changed, just make the "faked" input > + group_size. */ > + group_size *= 2; > + } > vectype = get_vectype_for_scalar_type (vinfo, scalar_type, group_size); > - if (!vectype) > + if (!vectype > + || (STMT_VINFO_COMPLEX_P (stmt_info) > + && !TYPE_VECTOR_SUBPARTS (vectype).is_constant ())) > return opt_result::failure_at (stmt, > "not vectorized:" > " unsupported data-type %T\n", > - scalar_type); > + orig_scalar_type); > > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, "vectype: %T\n", vectype); > @@ -12529,16 +12684,30 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info, > TREE_TYPE (vectype)); > if (scalar_type != TREE_TYPE (vectype)) > { > - if (dump_enabled_p ()) > + tree orig_scalar_type = scalar_type; > + if (TREE_CODE (scalar_type) == COMPLEX_TYPE) > + { > + /* Set complex_p for Loop vectorizer. */ > + STMT_VINFO_COMPLEX_P (stmt_info) = true; > + scalar_type = TREE_TYPE (scalar_type); > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "get complex for smallest scalar type: %T\n", > + scalar_type); > + > + } > + else if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > "get vectype for smallest scalar type: %T\n", > scalar_type); > nunits_vectype = get_vectype_for_scalar_type (vinfo, scalar_type, > group_size); > - if (!nunits_vectype) > + if (!nunits_vectype > + || (STMT_VINFO_COMPLEX_P (stmt_info) > + && !TYPE_VECTOR_SUBPARTS (nunits_vectype).is_constant ())) > return opt_result::failure_at > (stmt, "not vectorized: unsupported data-type %T\n", > - scalar_type); > + orig_scalar_type); > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, "nunits vectype: %T\n", > nunits_vectype); > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h > index e5fdc9e0a14..4a809e492c4 100644 > --- a/gcc/tree-vectorizer.h > +++ b/gcc/tree-vectorizer.h > @@ -1161,6 +1161,9 @@ public: > vectorization. */ > bool vectorizable; > > + /* The scalar type of the LHS of this statement is complex type. */ > + bool complex_p; > + > /* The stmt to which this info struct refers to. */ > gimple *stmt; > > @@ -1395,6 +1398,7 @@ struct gather_scatter_info { > #define STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT(S) (S)->reduc_epilogue_adjustment > #define STMT_VINFO_REDUC_IDX(S) (S)->reduc_idx > #define STMT_VINFO_FORCE_SINGLE_CYCLE(S) (S)->force_single_cycle > +#define STMT_VINFO_COMPLEX_P(S) (S)->complex_p > > #define STMT_VINFO_DR_WRT_VEC_LOOP(S) (S)->dr_wrt_vec_loop > #define STMT_VINFO_DR_BASE_ADDRESS(S) (S)->dr_wrt_vec_loop.base_address > @@ -1970,6 +1974,15 @@ vect_get_num_copies (loop_vec_info loop_vinfo, tree vectype) > return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo), vectype); > } > > +static inline unsigned int > +vect_get_num_copies (loop_vec_info loop_vinfo, tree vectype, bool complex_p) > +{ > + poly_uint64 nunits = LOOP_VINFO_VECT_FACTOR (loop_vinfo); > + if (complex_p) > + nunits *= 2; > + return vect_get_num_vectors (nunits, vectype); > +} > + > /* Update maximum unit count *MAX_NUNITS so that it accounts for > NUNITS. *MAX_NUNITS can be 1 if we haven't yet recorded anything. */ > > -- > 2.18.1 >
diff --git a/gcc/testsuite/gcc.target/i386/pr106010-1a.c b/gcc/testsuite/gcc.target/i386/pr106010-1a.c new file mode 100644 index 00000000000..b608f484934 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-1a.c @@ -0,0 +1,58 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-vect-details -mprefer-vector-width=256" } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 6 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 2 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 2 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 2 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 2 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 2 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 2 "vect" } } */ + +#define N 10000 +void +__attribute__((noipa)) +foo_pd (_Complex double* a, _Complex double* b) +{ + for (int i = 0; i != N; i++) + a[i] = b[i]; +} + +void +__attribute__((noipa)) +foo_ps (_Complex float* a, _Complex float* b) +{ + for (int i = 0; i != N; i++) + a[i] = b[i]; +} + +void +__attribute__((noipa)) +foo_epi64 (_Complex long long* a, _Complex long long* b) +{ + for (int i = 0; i != N; i++) + a[i] = b[i]; +} + +void +__attribute__((noipa)) +foo_epi32 (_Complex int* a, _Complex int* b) +{ + for (int i = 0; i != N; i++) + a[i] = b[i]; +} + +void +__attribute__((noipa)) +foo_epi16 (_Complex short* a, _Complex short* b) +{ + for (int i = 0; i != N; i++) + a[i] = b[i]; +} + +void +__attribute__((noipa)) +foo_epi8 (_Complex char* a, _Complex char* b) +{ + for (int i = 0; i != N; i++) + a[i] = b[i]; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-1b.c b/gcc/testsuite/gcc.target/i386/pr106010-1b.c new file mode 100644 index 00000000000..0f377c3a548 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-1b.c @@ -0,0 +1,63 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ +/* { dg-require-effective-target avx } */ + +#include "avx-check.h" +#include <string.h> +#include "pr106010-1a.c" + +void +avx_test (void) +{ + _Complex double* pd_src = (_Complex double*) malloc (2 * N * sizeof (double)); + _Complex double* pd_dst = (_Complex double*) malloc (2 * N * sizeof (double)); + _Complex float* ps_src = (_Complex float*) malloc (2 * N * sizeof (float)); + _Complex float* ps_dst = (_Complex float*) malloc (2 * N * sizeof (float)); + _Complex long long* epi64_src = (_Complex long long*) malloc (2 * N * sizeof (long long)); + _Complex long long* epi64_dst = (_Complex long long*) malloc (2 * N * sizeof (long long)); + _Complex int* epi32_src = (_Complex int*) malloc (2 * N * sizeof (int)); + _Complex int* epi32_dst = (_Complex int*) malloc (2 * N * sizeof (int)); + _Complex short* epi16_src = (_Complex short*) malloc (2 * N * sizeof (short)); + _Complex short* epi16_dst = (_Complex short*) malloc (2 * N * sizeof (short)); + _Complex char* epi8_src = (_Complex char*) malloc (2 * N * sizeof (char)); + _Complex char* epi8_dst = (_Complex char*) malloc (2 * N * sizeof (char)); + char* p_init = (char*) malloc (2 * N * sizeof (double)); + + __builtin_memset (pd_dst, 0, 2 * N * sizeof (double)); + __builtin_memset (ps_dst, 0, 2 * N * sizeof (float)); + __builtin_memset (epi64_dst, 0, 2 * N * sizeof (long long)); + __builtin_memset (epi32_dst, 0, 2 * N * sizeof (int)); + __builtin_memset (epi16_dst, 0, 2 * N * sizeof (short)); + __builtin_memset (epi8_dst, 0, 2 * N * sizeof (char)); + + for (int i = 0; i != 2 * N * sizeof (double); i++) + p_init[i] = i; + + memcpy (pd_src, p_init, 2 * N * sizeof (double)); + memcpy (ps_src, p_init, 2 * N * sizeof (float)); + memcpy (epi64_src, p_init, 2 * N * sizeof (long long)); + memcpy (epi32_src, p_init, 2 * N * sizeof (int)); + memcpy (epi16_src, p_init, 2 * N * sizeof (short)); + memcpy (epi8_src, p_init, 2 * N * sizeof (char)); + + foo_pd (pd_dst, pd_src); + foo_ps (ps_dst, ps_src); + foo_epi64 (epi64_dst, epi64_src); + foo_epi32 (epi32_dst, epi32_src); + foo_epi16 (epi16_dst, epi16_src); + foo_epi8 (epi8_dst, epi8_src); + if (__builtin_memcmp (pd_dst, pd_src, N * 2 * sizeof (double)) != 0) + __builtin_abort (); + if (__builtin_memcmp (ps_dst, ps_src, N * 2 * sizeof (float)) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi64_dst, epi64_src, N * 2 * sizeof (long long)) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi32_dst, epi32_src, N * 2 * sizeof (int)) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi16_dst, epi16_src, N * 2 * sizeof (short)) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi8_dst, epi8_src, N * 2 * sizeof (char)) != 0) + __builtin_abort (); + + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-1c.c b/gcc/testsuite/gcc.target/i386/pr106010-1c.c new file mode 100644 index 00000000000..f07e9fb2d3d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-1c.c @@ -0,0 +1,41 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-vect-details" } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 2 "vect" } } */ +/* { dg-require-effective-target avx512fp16 } */ + +#include <string.h> + +static void do_test (void); + +#define DO_TEST do_test +#define AVX512FP16 +#include "avx512-check.h" + +#define N 10000 + +void +__attribute__((noipa)) +foo_ph (_Complex _Float16* a, _Complex _Float16* b) +{ + for (int i = 0; i != N; i++) + a[i] = b[i]; +} + +static void +do_test (void) +{ + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (2 * N * sizeof (_Float16)); + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (2 * N * sizeof (_Float16)); + char* p_init = (char*) malloc (2 * N * sizeof (_Float16)); + + __builtin_memset (ph_dst, 0, 2 * N * sizeof (_Float16)); + + for (int i = 0; i != 2 * N * sizeof (_Float16); i++) + p_init[i] = i; + + memcpy (ph_src, p_init, 2 * N * sizeof (_Float16)); + + foo_ph (ph_dst, ph_src); + if (__builtin_memcmp (ph_dst, ph_src, N * 2 * sizeof (_Float16)) != 0) + __builtin_abort (); +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-2a.c b/gcc/testsuite/gcc.target/i386/pr106010-2a.c new file mode 100644 index 00000000000..d2e2f8d4f43 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-2a.c @@ -0,0 +1,82 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details -mprefer-vector-width=256" } */ +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 6 "slp2" } }*/ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 2 "slp2" } } */ + +void +__attribute__((noipa)) +foo_pd (_Complex double* a, _Complex double* __restrict b) +{ + a[0] = b[0]; + a[1] = b[1]; +} + +void +__attribute__((noipa)) +foo_ps (_Complex float* a, _Complex float* __restrict b) +{ + a[0] = b[0]; + a[1] = b[1]; + a[2] = b[2]; + a[3] = b[3]; + +} + +void +__attribute__((noipa)) +foo_epi64 (_Complex long long* a, _Complex long long* __restrict b) +{ + a[0] = b[0]; + a[1] = b[1]; +} + +void +__attribute__((noipa)) +foo_epi32 (_Complex int* a, _Complex int* __restrict b) +{ + a[0] = b[0]; + a[1] = b[1]; + a[2] = b[2]; + a[3] = b[3]; +} + +void +__attribute__((noipa)) +foo_epi16 (_Complex short* a, _Complex short* __restrict b) +{ + a[0] = b[0]; + a[1] = b[1]; + a[2] = b[2]; + a[3] = b[3]; + a[4] = b[4]; + a[5] = b[5]; + a[6] = b[6]; + a[7] = b[7]; +} + +void +__attribute__((noipa)) +foo_epi8 (_Complex char* a, _Complex char* __restrict b) +{ + a[0] = b[0]; + a[1] = b[1]; + a[2] = b[2]; + a[3] = b[3]; + a[4] = b[4]; + a[5] = b[5]; + a[6] = b[6]; + a[7] = b[7]; + a[8] = b[8]; + a[9] = b[9]; + a[10] = b[10]; + a[11] = b[11]; + a[12] = b[12]; + a[13] = b[13]; + a[14] = b[14]; + a[15] = b[15]; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-2b.c b/gcc/testsuite/gcc.target/i386/pr106010-2b.c new file mode 100644 index 00000000000..ac360752693 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-2b.c @@ -0,0 +1,62 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ +/* { dg-require-effective-target avx } */ + +#include "avx-check.h" +#include <string.h> +#include "pr106010-2a.c" + +void +avx_test (void) +{ + _Complex double* pd_src = (_Complex double*) malloc (32); + _Complex double* pd_dst = (_Complex double*) malloc (32); + _Complex float* ps_src = (_Complex float*) malloc (32); + _Complex float* ps_dst = (_Complex float*) malloc (32); + _Complex long long* epi64_src = (_Complex long long*) malloc (32); + _Complex long long* epi64_dst = (_Complex long long*) malloc (32); + _Complex int* epi32_src = (_Complex int*) malloc (32); + _Complex int* epi32_dst = (_Complex int*) malloc (32); + _Complex short* epi16_src = (_Complex short*) malloc (32); + _Complex short* epi16_dst = (_Complex short*) malloc (32); + _Complex char* epi8_src = (_Complex char*) malloc (32); + _Complex char* epi8_dst = (_Complex char*) malloc (32); + char* p = (char* ) malloc (32); + + __builtin_memset (pd_dst, 0, 32); + __builtin_memset (ps_dst, 0, 32); + __builtin_memset (epi64_dst, 0, 32); + __builtin_memset (epi32_dst, 0, 32); + __builtin_memset (epi16_dst, 0, 32); + __builtin_memset (epi8_dst, 0, 32); + + for (int i = 0; i != 32; i++) + p[i] = i; + __builtin_memcpy (pd_src, p, 32); + __builtin_memcpy (ps_src, p, 32); + __builtin_memcpy (epi64_src, p, 32); + __builtin_memcpy (epi32_src, p, 32); + __builtin_memcpy (epi16_src, p, 32); + __builtin_memcpy (epi8_src, p, 32); + + foo_pd (pd_dst, pd_src); + foo_ps (ps_dst, ps_src); + foo_epi64 (epi64_dst, epi64_src); + foo_epi32 (epi32_dst, epi32_src); + foo_epi16 (epi16_dst, epi16_src); + foo_epi8 (epi8_dst, epi8_src); + if (__builtin_memcmp (pd_dst, pd_src, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (ps_dst, ps_src, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi64_dst, epi64_src, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi32_dst, epi32_src, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi16_dst, epi16_src, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi16_dst, epi16_src, 32) != 0) + __builtin_abort (); + + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-2c.c b/gcc/testsuite/gcc.target/i386/pr106010-2c.c new file mode 100644 index 00000000000..a002f209ec9 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-2c.c @@ -0,0 +1,47 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-slp-details" } */ +/* { dg-require-effective-target avx512fp16 } */ + +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 1 "slp2" } }*/ + +#include <string.h> + +static void do_test (void); +#define DO_TEST do_test +#define AVX512FP16 +#include "avx512-check.h" + +void +__attribute__((noipa)) +foo_ph (_Complex _Float16* a, _Complex _Float16* __restrict b) +{ + a[0] = b[0]; + a[1] = b[1]; + a[2] = b[2]; + a[3] = b[3]; + a[4] = b[4]; + a[5] = b[5]; + a[6] = b[6]; + a[7] = b[7]; +} + +void +do_test (void) +{ + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (32); + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (32); + char* p = (char* ) malloc (32); + + __builtin_memset (ph_dst, 0, 32); + + for (int i = 0; i != 32; i++) + p[i] = i; + __builtin_memcpy (ph_src, p, 32); + + foo_ph (ph_dst, ph_src); + if (__builtin_memcmp (ph_dst, ph_src, 32) != 0) + __builtin_abort (); + + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-3a.c b/gcc/testsuite/gcc.target/i386/pr106010-3a.c new file mode 100644 index 00000000000..c1b64b56b1c --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-3a.c @@ -0,0 +1,80 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details" } */ +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 6 "slp2" } }*/ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 2, 3, 0, 1 \}} 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 6, 7, 4, 5, 2, 3, 0, 1 \}} 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 2, 3, 0, 1, 6, 7, 4, 5 \}} 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 \}} 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1, 30, 31, 28, 29, 26, 27, 24, 25, 22, 23, 20, 21, 18, 19, 16, 17 \}} 1 "slp2" } } */ + +void +__attribute__((noipa)) +foo_pd (_Complex double* a, _Complex double* __restrict b) +{ + a[0] = b[1]; + a[1] = b[0]; +} + +void +__attribute__((noipa)) +foo_ps (_Complex float* a, _Complex float* __restrict b) +{ + a[0] = b[1]; + a[1] = b[0]; + a[2] = b[3]; + a[3] = b[2]; +} + +void +__attribute__((noipa)) +foo_epi64 (_Complex long long* a, _Complex long long* __restrict b) +{ + a[0] = b[1]; + a[1] = b[0]; +} + +void +__attribute__((noipa)) +foo_epi32 (_Complex int* a, _Complex int* __restrict b) +{ + a[0] = b[3]; + a[1] = b[2]; + a[2] = b[1]; + a[3] = b[0]; +} + +void +__attribute__((noipa)) +foo_epi16 (_Complex short* a, _Complex short* __restrict b) +{ + a[0] = b[7]; + a[1] = b[6]; + a[2] = b[5]; + a[3] = b[4]; + a[4] = b[3]; + a[5] = b[2]; + a[6] = b[1]; + a[7] = b[0]; +} + +void +__attribute__((noipa)) +foo_epi8 (_Complex char* a, _Complex char* __restrict b) +{ + a[0] = b[7]; + a[1] = b[6]; + a[2] = b[5]; + a[3] = b[4]; + a[4] = b[3]; + a[5] = b[2]; + a[6] = b[1]; + a[7] = b[0]; + a[8] = b[15]; + a[9] = b[14]; + a[10] = b[13]; + a[11] = b[12]; + a[12] = b[11]; + a[13] = b[10]; + a[14] = b[9]; + a[15] = b[8]; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-3b.c b/gcc/testsuite/gcc.target/i386/pr106010-3b.c new file mode 100644 index 00000000000..e4fa3f3a541 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-3b.c @@ -0,0 +1,126 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ +/* { dg-require-effective-target avx2 } */ + +#include "avx2-check.h" +#include <string.h> +#include "pr106010-3a.c" + +void +avx2_test (void) +{ + _Complex double* pd_src = (_Complex double*) malloc (32); + _Complex double* pd_dst = (_Complex double*) malloc (32); + _Complex double* pd_exp = (_Complex double*) malloc (32); + _Complex float* ps_src = (_Complex float*) malloc (32); + _Complex float* ps_dst = (_Complex float*) malloc (32); + _Complex float* ps_exp = (_Complex float*) malloc (32); + _Complex long long* epi64_src = (_Complex long long*) malloc (32); + _Complex long long* epi64_dst = (_Complex long long*) malloc (32); + _Complex long long* epi64_exp = (_Complex long long*) malloc (32); + _Complex int* epi32_src = (_Complex int*) malloc (32); + _Complex int* epi32_dst = (_Complex int*) malloc (32); + _Complex int* epi32_exp = (_Complex int*) malloc (32); + _Complex short* epi16_src = (_Complex short*) malloc (32); + _Complex short* epi16_dst = (_Complex short*) malloc (32); + _Complex short* epi16_exp = (_Complex short*) malloc (32); + _Complex char* epi8_src = (_Complex char*) malloc (32); + _Complex char* epi8_dst = (_Complex char*) malloc (32); + _Complex char* epi8_exp = (_Complex char*) malloc (32); + char* p = (char* ) malloc (32); + char* q = (char* ) malloc (32); + + __builtin_memset (pd_dst, 0, 32); + __builtin_memset (ps_dst, 0, 32); + __builtin_memset (epi64_dst, 0, 32); + __builtin_memset (epi32_dst, 0, 32); + __builtin_memset (epi16_dst, 0, 32); + __builtin_memset (epi8_dst, 0, 32); + + for (int i = 0; i != 32; i++) + p[i] = i; + __builtin_memcpy (pd_src, p, 32); + __builtin_memcpy (ps_src, p, 32); + __builtin_memcpy (epi64_src, p, 32); + __builtin_memcpy (epi32_src, p, 32); + __builtin_memcpy (epi16_src, p, 32); + __builtin_memcpy (epi8_src, p, 32); + + for (int i = 0; i != 16; i++) + { + p[i] = i + 16; + p[i + 16] = i; + } + __builtin_memcpy (pd_exp, p, 32); + __builtin_memcpy (epi64_exp, p, 32); + + for (int i = 0; i != 8; i++) + { + p[i] = i + 8; + p[i + 8] = i; + p[i + 16] = i + 24; + p[i + 24] = i + 16; + q[i] = i + 24; + q[i + 8] = i + 16; + q[i + 16] = i + 8; + q[i + 24] = i; + } + __builtin_memcpy (ps_exp, p, 32); + __builtin_memcpy (epi32_exp, q, 32); + + + for (int i = 0; i != 4; i++) + { + q[i] = i + 28; + q[i + 4] = i + 24; + q[i + 8] = i + 20; + q[i + 12] = i + 16; + q[i + 16] = i + 12; + q[i + 20] = i + 8; + q[i + 24] = i + 4; + q[i + 28] = i; + } + __builtin_memcpy (epi16_exp, q, 32); + + for (int i = 0; i != 2; i++) + { + q[i] = i + 14; + q[i + 2] = i + 12; + q[i + 4] = i + 10; + q[i + 6] = i + 8; + q[i + 8] = i + 6; + q[i + 10] = i + 4; + q[i + 12] = i + 2; + q[i + 14] = i; + q[i + 16] = i + 30; + q[i + 18] = i + 28; + q[i + 20] = i + 26; + q[i + 22] = i + 24; + q[i + 24] = i + 22; + q[i + 26] = i + 20; + q[i + 28] = i + 18; + q[i + 30] = i + 16; + } + __builtin_memcpy (epi8_exp, q, 32); + + foo_pd (pd_dst, pd_src); + foo_ps (ps_dst, ps_src); + foo_epi64 (epi64_dst, epi64_src); + foo_epi32 (epi32_dst, epi32_src); + foo_epi16 (epi16_dst, epi16_src); + foo_epi8 (epi8_dst, epi8_src); + if (__builtin_memcmp (pd_dst, pd_exp, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (ps_dst, ps_exp, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi64_dst, epi64_exp, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi32_dst, epi32_exp, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi16_dst, epi16_exp, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi8_dst, epi8_exp, 32) != 0) + __builtin_abort (); + + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-3c.c b/gcc/testsuite/gcc.target/i386/pr106010-3c.c new file mode 100644 index 00000000000..5a5a3d4b992 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-3c.c @@ -0,0 +1,69 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-slp-details" } */ +/* { dg-require-effective-target avx512fp16 } */ +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 1 "slp2" } }*/ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 2, 3, 0, 1, 8, 9, 6, 7, 14, 15, 12, 13, 4, 5, 10, 11 \}} 1 "slp2" } } */ + +#include <string.h> + +static void do_test (void); +#define DO_TEST do_test +#define AVX512FP16 +#include "avx512-check.h" + +void +__attribute__((noipa)) +foo_ph (_Complex _Float16* a, _Complex _Float16* __restrict b) +{ + a[0] = b[1]; + a[1] = b[0]; + a[2] = b[4]; + a[3] = b[3]; + a[4] = b[7]; + a[5] = b[6]; + a[6] = b[2]; + a[7] = b[5]; +} + +void +do_test (void) +{ + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (32); + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (32); + _Complex _Float16* ph_exp = (_Complex _Float16*) malloc (32); + char* p = (char* ) malloc (32); + char* q = (char* ) malloc (32); + + __builtin_memset (ph_dst, 0, 32); + + for (int i = 0; i != 32; i++) + p[i] = i; + __builtin_memcpy (ph_src, p, 32); + + for (int i = 0; i != 4; i++) + { + p[i] = i + 4; + p[i + 4] = i; + p[i + 8] = i + 16; + p[i + 12] = i + 12; + p[i + 16] = i + 28; + p[i + 20] = i + 24; + p[i + 24] = i + 8; + p[i + 28] = i + 20; + q[i] = i + 28; + q[i + 4] = i + 24; + q[i + 8] = i + 20; + q[i + 12] = i + 16; + q[i + 16] = i + 12; + q[i + 20] = i + 8; + q[i + 24] = i + 4; + q[i + 28] = i; + } + __builtin_memcpy (ph_exp, p, 32); + + foo_ph (ph_dst, ph_src); + if (__builtin_memcmp (ph_dst, ph_exp, 32) != 0) + __builtin_abort (); + + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-4a.c b/gcc/testsuite/gcc.target/i386/pr106010-4a.c new file mode 100644 index 00000000000..b7b0b532bb1 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-4a.c @@ -0,0 +1,101 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details" } */ +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 6 "slp2" } }*/ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 1 "slp2" } } */ + +void +__attribute__((noipa)) +foo_pd (_Complex double* a, + _Complex double b1, + _Complex double b2) +{ + a[0] = b1; + a[1] = b2; +} + +void +__attribute__((noipa)) +foo_ps (_Complex float* a, + _Complex float b1, _Complex float b2, + _Complex float b3, _Complex float b4) +{ + a[0] = b1; + a[1] = b2; + a[2] = b3; + a[3] = b4; +} + +void +__attribute__((noipa)) +foo_epi64 (_Complex long long* a, + _Complex long long b1, + _Complex long long b2) +{ + a[0] = b1; + a[1] = b2; +} + +void +__attribute__((noipa)) +foo_epi32 (_Complex int* a, + _Complex int b1, _Complex int b2, + _Complex int b3, _Complex int b4) +{ + a[0] = b1; + a[1] = b2; + a[2] = b3; + a[3] = b4; +} + +void +__attribute__((noipa)) +foo_epi16 (_Complex short* a, + _Complex short b1, _Complex short b2, + _Complex short b3, _Complex short b4, + _Complex short b5, _Complex short b6, + _Complex short b7,_Complex short b8) +{ + a[0] = b1; + a[1] = b2; + a[2] = b3; + a[3] = b4; + a[4] = b5; + a[5] = b6; + a[6] = b7; + a[7] = b8; +} + +void +__attribute__((noipa)) +foo_epi8 (_Complex char* a, + _Complex char b1, _Complex char b2, + _Complex char b3, _Complex char b4, + _Complex char b5, _Complex char b6, + _Complex char b7,_Complex char b8, + _Complex char b9, _Complex char b10, + _Complex char b11, _Complex char b12, + _Complex char b13, _Complex char b14, + _Complex char b15,_Complex char b16) +{ + a[0] = b1; + a[1] = b2; + a[2] = b3; + a[3] = b4; + a[4] = b5; + a[5] = b6; + a[6] = b7; + a[7] = b8; + a[8] = b9; + a[9] = b10; + a[10] = b11; + a[11] = b12; + a[12] = b13; + a[13] = b14; + a[14] = b15; + a[15] = b16; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-4b.c b/gcc/testsuite/gcc.target/i386/pr106010-4b.c new file mode 100644 index 00000000000..e2e79508c4b --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-4b.c @@ -0,0 +1,67 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ +/* { dg-require-effective-target avx } */ + +#include "avx-check.h" +#include <string.h> +#include "pr106010-4a.c" + +void +avx_test (void) +{ + _Complex double* pd_src = (_Complex double*) malloc (32); + _Complex double* pd_dst = (_Complex double*) malloc (32); + _Complex float* ps_src = (_Complex float*) malloc (32); + _Complex float* ps_dst = (_Complex float*) malloc (32); + _Complex long long* epi64_src = (_Complex long long*) malloc (32); + _Complex long long* epi64_dst = (_Complex long long*) malloc (32); + _Complex int* epi32_src = (_Complex int*) malloc (32); + _Complex int* epi32_dst = (_Complex int*) malloc (32); + _Complex short* epi16_src = (_Complex short*) malloc (32); + _Complex short* epi16_dst = (_Complex short*) malloc (32); + _Complex char* epi8_src = (_Complex char*) malloc (32); + _Complex char* epi8_dst = (_Complex char*) malloc (32); + char* p = (char* ) malloc (32); + + __builtin_memset (pd_dst, 0, 32); + __builtin_memset (ps_dst, 0, 32); + __builtin_memset (epi64_dst, 0, 32); + __builtin_memset (epi32_dst, 0, 32); + __builtin_memset (epi16_dst, 0, 32); + __builtin_memset (epi8_dst, 0, 32); + + for (int i = 0; i != 32; i++) + p[i] = i; + __builtin_memcpy (pd_src, p, 32); + __builtin_memcpy (ps_src, p, 32); + __builtin_memcpy (epi64_src, p, 32); + __builtin_memcpy (epi32_src, p, 32); + __builtin_memcpy (epi16_src, p, 32); + __builtin_memcpy (epi8_src, p, 32); + + foo_pd (pd_dst, pd_src[0], pd_src[1]); + foo_ps (ps_dst, ps_src[0], ps_src[1], ps_src[2], ps_src[3]); + foo_epi64 (epi64_dst, epi64_src[0], epi64_src[1]); + foo_epi32 (epi32_dst, epi32_src[0], epi32_src[1], epi32_src[2], epi32_src[3]); + foo_epi16 (epi16_dst, epi16_src[0], epi16_src[1], epi16_src[2], epi16_src[3], + epi16_src[4], epi16_src[5], epi16_src[6], epi16_src[7]); + foo_epi8 (epi8_dst, epi8_src[0], epi8_src[1], epi8_src[2], epi8_src[3], + epi8_src[4], epi8_src[5], epi8_src[6], epi8_src[7], + epi8_src[8], epi8_src[9], epi8_src[10], epi8_src[11], + epi8_src[12], epi8_src[13], epi8_src[14], epi8_src[15]); + + if (__builtin_memcmp (pd_dst, pd_src, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (ps_dst, ps_src, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi64_dst, epi64_src, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi32_dst, epi32_src, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi16_dst, epi16_src, 32) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi8_dst, epi8_src, 32) != 0) + __builtin_abort (); + + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-4c.c b/gcc/testsuite/gcc.target/i386/pr106010-4c.c new file mode 100644 index 00000000000..8e02aefe3b5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-4c.c @@ -0,0 +1,54 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -fdump-tree-slp-details -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ +/* { dg-require-effective-target avx512fp16 } */ +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 1 "slp2" } }*/ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 1 "slp2" } } */ + +#include <string.h> + +static void do_test (void); +#define DO_TEST do_test +#define AVX512FP16 +#include "avx512-check.h" + +void +__attribute__((noipa)) +foo_ph (_Complex _Float16* a, + _Complex _Float16 b1, _Complex _Float16 b2, + _Complex _Float16 b3, _Complex _Float16 b4, + _Complex _Float16 b5, _Complex _Float16 b6, + _Complex _Float16 b7,_Complex _Float16 b8) +{ + a[0] = b1; + a[1] = b2; + a[2] = b3; + a[3] = b4; + a[4] = b5; + a[5] = b6; + a[6] = b7; + a[7] = b8; +} + +void +do_test (void) +{ + + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (32); + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (32); + + char* p = (char* ) malloc (32); + + __builtin_memset (ph_dst, 0, 32); + + for (int i = 0; i != 32; i++) + p[i] = i; + + __builtin_memcpy (ph_src, p, 32); + + foo_ph (ph_dst, ph_src[0], ph_src[1], ph_src[2], ph_src[3], + ph_src[4], ph_src[5], ph_src[6], ph_src[7]); + + if (__builtin_memcmp (ph_dst, ph_src, 32) != 0) + __builtin_abort (); + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-5a.c b/gcc/testsuite/gcc.target/i386/pr106010-5a.c new file mode 100644 index 00000000000..9d4a6f9846b --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-5a.c @@ -0,0 +1,117 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details -mprefer-vector-width=256" } */ +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 6 "slp2" } }*/ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 4 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 4 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 4 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 4 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 4 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 4 "slp2" } } */ + +void +__attribute__((noipa)) +foo_pd (_Complex double* a, _Complex double* __restrict b) +{ + a[0] = b[2]; + a[1] = b[3]; + a[2] = b[0]; + a[3] = b[1]; +} + +void +__attribute__((noipa)) +foo_ps (_Complex float* a, _Complex float* __restrict b) +{ + a[0] = b[4]; + a[1] = b[5]; + a[2] = b[6]; + a[3] = b[7]; + a[4] = b[0]; + a[5] = b[1]; + a[6] = b[2]; + a[7] = b[3]; +} + +void +__attribute__((noipa)) +foo_epi64 (_Complex long long* a, _Complex long long* __restrict b) +{ + a[0] = b[2]; + a[1] = b[3]; + a[2] = b[0]; + a[3] = b[1]; +} + +void +__attribute__((noipa)) +foo_epi32 (_Complex int* a, _Complex int* __restrict b) +{ + a[0] = b[4]; + a[1] = b[5]; + a[2] = b[6]; + a[3] = b[7]; + a[4] = b[0]; + a[5] = b[1]; + a[6] = b[2]; + a[7] = b[3]; +} + +void +__attribute__((noipa)) +foo_epi16 (_Complex short* a, _Complex short* __restrict b) +{ + a[0] = b[8]; + a[1] = b[9]; + a[2] = b[10]; + a[3] = b[11]; + a[4] = b[12]; + a[5] = b[13]; + a[6] = b[14]; + a[7] = b[15]; + a[8] = b[0]; + a[9] = b[1]; + a[10] = b[2]; + a[11] = b[3]; + a[12] = b[4]; + a[13] = b[5]; + a[14] = b[6]; + a[15] = b[7]; +} + +void +__attribute__((noipa)) +foo_epi8 (_Complex char* a, _Complex char* __restrict b) +{ + a[0] = b[16]; + a[1] = b[17]; + a[2] = b[18]; + a[3] = b[19]; + a[4] = b[20]; + a[5] = b[21]; + a[6] = b[22]; + a[7] = b[23]; + a[8] = b[24]; + a[9] = b[25]; + a[10] = b[26]; + a[11] = b[27]; + a[12] = b[28]; + a[13] = b[29]; + a[14] = b[30]; + a[15] = b[31]; + a[16] = b[0]; + a[17] = b[1]; + a[18] = b[2]; + a[19] = b[3]; + a[20] = b[4]; + a[21] = b[5]; + a[22] = b[6]; + a[23] = b[7]; + a[24] = b[8]; + a[25] = b[9]; + a[26] = b[10]; + a[27] = b[11]; + a[28] = b[12]; + a[29] = b[13]; + a[30] = b[14]; + a[31] = b[15]; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-5b.c b/gcc/testsuite/gcc.target/i386/pr106010-5b.c new file mode 100644 index 00000000000..d5c6ebeb5cf --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-5b.c @@ -0,0 +1,80 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ +/* { dg-require-effective-target avx } */ + +#include "avx-check.h" +#include <string.h> +#include "pr106010-5a.c" + +void +avx_test (void) +{ + _Complex double* pd_src = (_Complex double*) malloc (64); + _Complex double* pd_dst = (_Complex double*) malloc (64); + _Complex double* pd_exp = (_Complex double*) malloc (64); + _Complex float* ps_src = (_Complex float*) malloc (64); + _Complex float* ps_dst = (_Complex float*) malloc (64); + _Complex float* ps_exp = (_Complex float*) malloc (64); + _Complex long long* epi64_src = (_Complex long long*) malloc (64); + _Complex long long* epi64_dst = (_Complex long long*) malloc (64); + _Complex long long* epi64_exp = (_Complex long long*) malloc (64); + _Complex int* epi32_src = (_Complex int*) malloc (64); + _Complex int* epi32_dst = (_Complex int*) malloc (64); + _Complex int* epi32_exp = (_Complex int*) malloc (64); + _Complex short* epi16_src = (_Complex short*) malloc (64); + _Complex short* epi16_dst = (_Complex short*) malloc (64); + _Complex short* epi16_exp = (_Complex short*) malloc (64); + _Complex char* epi8_src = (_Complex char*) malloc (64); + _Complex char* epi8_dst = (_Complex char*) malloc (64); + _Complex char* epi8_exp = (_Complex char*) malloc (64); + char* p = (char* ) malloc (64); + char* q = (char* ) malloc (64); + + __builtin_memset (pd_dst, 0, 64); + __builtin_memset (ps_dst, 0, 64); + __builtin_memset (epi64_dst, 0, 64); + __builtin_memset (epi32_dst, 0, 64); + __builtin_memset (epi16_dst, 0, 64); + __builtin_memset (epi8_dst, 0, 64); + + for (int i = 0; i != 64; i++) + { + p[i] = i; + q[i] = (i + 32) % 64; + } + __builtin_memcpy (pd_src, p, 64); + __builtin_memcpy (ps_src, p, 64); + __builtin_memcpy (epi64_src, p, 64); + __builtin_memcpy (epi32_src, p, 64); + __builtin_memcpy (epi16_src, p, 64); + __builtin_memcpy (epi8_src, p, 64); + + __builtin_memcpy (pd_exp, q, 64); + __builtin_memcpy (ps_exp, q, 64); + __builtin_memcpy (epi64_exp, q, 64); + __builtin_memcpy (epi32_exp, q, 64); + __builtin_memcpy (epi16_exp, q, 64); + __builtin_memcpy (epi8_exp, q, 64); + + foo_pd (pd_dst, pd_src); + foo_ps (ps_dst, ps_src); + foo_epi64 (epi64_dst, epi64_src); + foo_epi32 (epi32_dst, epi32_src); + foo_epi16 (epi16_dst, epi16_src); + foo_epi8 (epi8_dst, epi8_src); + + if (__builtin_memcmp (pd_dst, pd_exp, 64) != 0) + __builtin_abort (); + if (__builtin_memcmp (ps_dst, ps_exp, 64) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi64_dst, epi64_exp, 64) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi32_dst, epi32_exp, 64) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi16_dst, epi16_exp, 64) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi8_dst, epi8_exp, 64) != 0) + __builtin_abort (); + + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-5c.c b/gcc/testsuite/gcc.target/i386/pr106010-5c.c new file mode 100644 index 00000000000..9ce4e6dd5c0 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-5c.c @@ -0,0 +1,62 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details -mprefer-vector-width=256" } */ +/* { dg-require-effective-target avx512fp16 } */ +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 1 "slp2" } }*/ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 4 "slp2" } } */ + +#include <string.h> + +static void do_test (void); +#define DO_TEST do_test +#define AVX512FP16 +#include "avx512-check.h" + +void +__attribute__((noipa)) +foo_ph (_Complex _Float16* a, _Complex _Float16* __restrict b) +{ + a[0] = b[8]; + a[1] = b[9]; + a[2] = b[10]; + a[3] = b[11]; + a[4] = b[12]; + a[5] = b[13]; + a[6] = b[14]; + a[7] = b[15]; + a[8] = b[0]; + a[9] = b[1]; + a[10] = b[2]; + a[11] = b[3]; + a[12] = b[4]; + a[13] = b[5]; + a[14] = b[6]; + a[15] = b[7]; +} + +void +do_test (void) +{ + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (64); + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (64); + _Complex _Float16* ph_exp = (_Complex _Float16*) malloc (64); + char* p = (char* ) malloc (64); + char* q = (char* ) malloc (64); + + __builtin_memset (ph_dst, 0, 64); + + for (int i = 0; i != 64; i++) + { + p[i] = i; + q[i] = (i + 32) % 64; + } + __builtin_memcpy (ph_src, p, 64); + + __builtin_memcpy (ph_exp, q, 64); + + foo_ph (ph_dst, ph_src); + + if (__builtin_memcmp (ph_dst, ph_exp, 64) != 0) + __builtin_abort (); + + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-6a.c b/gcc/testsuite/gcc.target/i386/pr106010-6a.c new file mode 100644 index 00000000000..65a90d03684 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-6a.c @@ -0,0 +1,115 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-slp-details -mprefer-vector-width=256" } */ +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 6 "slp2" } }*/ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 2, 3, 0, 1 \}} 4 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 6, 7, 4, 5, 2, 3, 0, 1 \}} 4 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 \}} 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 30, 31, 28, 29, 26, 27, 24, 25, 22, 23, 20, 21, 18, 19, 16, 17, 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 \}} 2 "slp2" } } */ + +void +__attribute__((noipa)) +foo_pd (_Complex double* a, _Complex double* __restrict b) +{ + a[0] = b[3]; + a[1] = b[2]; + a[2] = b[1]; + a[3] = b[0]; +} + +void +__attribute__((noipa)) +foo_ps (_Complex float* a, _Complex float* __restrict b) +{ + a[0] = b[7]; + a[1] = b[6]; + a[2] = b[5]; + a[3] = b[4]; + a[4] = b[3]; + a[5] = b[2]; + a[6] = b[1]; + a[7] = b[0]; +} + +void +__attribute__((noipa)) +foo_epi64 (_Complex long long* a, _Complex long long* __restrict b) +{ + a[0] = b[3]; + a[1] = b[2]; + a[2] = b[1]; + a[3] = b[0]; +} + +void +__attribute__((noipa)) +foo_epi32 (_Complex int* a, _Complex int* __restrict b) +{ + a[0] = b[7]; + a[1] = b[6]; + a[2] = b[5]; + a[3] = b[4]; + a[4] = b[3]; + a[5] = b[2]; + a[6] = b[1]; + a[7] = b[0]; +} + +void +__attribute__((noipa)) +foo_epi16 (_Complex short* a, _Complex short* __restrict b) +{ + a[0] = b[15]; + a[1] = b[14]; + a[2] = b[13]; + a[3] = b[12]; + a[4] = b[11]; + a[5] = b[10]; + a[6] = b[9]; + a[7] = b[8]; + a[8] = b[7]; + a[9] = b[6]; + a[10] = b[5]; + a[11] = b[4]; + a[12] = b[3]; + a[13] = b[2]; + a[14] = b[1]; + a[15] = b[0]; +} + +void +__attribute__((noipa)) +foo_epi8 (_Complex char* a, _Complex char* __restrict b) +{ + a[0] = b[31]; + a[1] = b[30]; + a[2] = b[29]; + a[3] = b[28]; + a[4] = b[27]; + a[5] = b[26]; + a[6] = b[25]; + a[7] = b[24]; + a[8] = b[23]; + a[9] = b[22]; + a[10] = b[21]; + a[11] = b[20]; + a[12] = b[19]; + a[13] = b[18]; + a[14] = b[17]; + a[15] = b[16]; + a[16] = b[15]; + a[17] = b[14]; + a[18] = b[13]; + a[19] = b[12]; + a[20] = b[11]; + a[21] = b[10]; + a[22] = b[9]; + a[23] = b[8]; + a[24] = b[7]; + a[25] = b[6]; + a[26] = b[5]; + a[27] = b[4]; + a[28] = b[3]; + a[29] = b[2]; + a[30] = b[1]; + a[31] = b[0]; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-6b.c b/gcc/testsuite/gcc.target/i386/pr106010-6b.c new file mode 100644 index 00000000000..1c5bb020939 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-6b.c @@ -0,0 +1,157 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx2 -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ +/* { dg-require-effective-target avx2 } */ + +#include "avx2-check.h" +#include <string.h> +#include "pr106010-6a.c" + +void +avx2_test (void) +{ + _Complex double* pd_src = (_Complex double*) malloc (64); + _Complex double* pd_dst = (_Complex double*) malloc (64); + _Complex double* pd_exp = (_Complex double*) malloc (64); + _Complex float* ps_src = (_Complex float*) malloc (64); + _Complex float* ps_dst = (_Complex float*) malloc (64); + _Complex float* ps_exp = (_Complex float*) malloc (64); + _Complex long long* epi64_src = (_Complex long long*) malloc (64); + _Complex long long* epi64_dst = (_Complex long long*) malloc (64); + _Complex long long* epi64_exp = (_Complex long long*) malloc (64); + _Complex int* epi32_src = (_Complex int*) malloc (64); + _Complex int* epi32_dst = (_Complex int*) malloc (64); + _Complex int* epi32_exp = (_Complex int*) malloc (64); + _Complex short* epi16_src = (_Complex short*) malloc (64); + _Complex short* epi16_dst = (_Complex short*) malloc (64); + _Complex short* epi16_exp = (_Complex short*) malloc (64); + _Complex char* epi8_src = (_Complex char*) malloc (64); + _Complex char* epi8_dst = (_Complex char*) malloc (64); + _Complex char* epi8_exp = (_Complex char*) malloc (64); + char* p = (char* ) malloc (64); + char* q = (char* ) malloc (64); + + __builtin_memset (pd_dst, 0, 64); + __builtin_memset (ps_dst, 0, 64); + __builtin_memset (epi64_dst, 0, 64); + __builtin_memset (epi32_dst, 0, 64); + __builtin_memset (epi16_dst, 0, 64); + __builtin_memset (epi8_dst, 0, 64); + + for (int i = 0; i != 64; i++) + p[i] = i; + + __builtin_memcpy (pd_src, p, 64); + __builtin_memcpy (ps_src, p, 64); + __builtin_memcpy (epi64_src, p, 64); + __builtin_memcpy (epi32_src, p, 64); + __builtin_memcpy (epi16_src, p, 64); + __builtin_memcpy (epi8_src, p, 64); + + + for (int i = 0; i != 16; i++) + { + q[i] = i + 48; + q[i + 16] = i + 32; + q[i + 32] = i + 16; + q[i + 48] = i; + } + + __builtin_memcpy (pd_exp, q, 64); + __builtin_memcpy (epi64_exp, q, 64); + + for (int i = 0; i != 8; i++) + { + q[i] = i + 56; + q[i + 8] = i + 48; + q[i + 16] = i + 40; + q[i + 24] = i + 32; + q[i + 32] = i + 24; + q[i + 40] = i + 16; + q[i + 48] = i + 8; + q[i + 56] = i; + } + + __builtin_memcpy (ps_exp, q, 64); + __builtin_memcpy (epi32_exp, q, 64); + + for (int i = 0; i != 4; i++) + { + q[i] = i + 60; + q[i + 4] = i + 56; + q[i + 8] = i + 52; + q[i + 12] = i + 48; + q[i + 16] = i + 44; + q[i + 20] = i + 40; + q[i + 24] = i + 36; + q[i + 28] = i + 32; + q[i + 32] = i + 28; + q[i + 36] = i + 24; + q[i + 40] = i + 20; + q[i + 44] = i + 16; + q[i + 48] = i + 12; + q[i + 52] = i + 8; + q[i + 56] = i + 4; + q[i + 60] = i; + } + + __builtin_memcpy (epi16_exp, q, 64); + + for (int i = 0; i != 2; i++) + { + q[i] = i + 62; + q[i + 2] = i + 60; + q[i + 4] = i + 58; + q[i + 6] = i + 56; + q[i + 8] = i + 54; + q[i + 10] = i + 52; + q[i + 12] = i + 50; + q[i + 14] = i + 48; + q[i + 16] = i + 46; + q[i + 18] = i + 44; + q[i + 20] = i + 42; + q[i + 22] = i + 40; + q[i + 24] = i + 38; + q[i + 26] = i + 36; + q[i + 28] = i + 34; + q[i + 30] = i + 32; + q[i + 32] = i + 30; + q[i + 34] = i + 28; + q[i + 36] = i + 26; + q[i + 38] = i + 24; + q[i + 40] = i + 22; + q[i + 42] = i + 20; + q[i + 44] = i + 18; + q[i + 46] = i + 16; + q[i + 48] = i + 14; + q[i + 50] = i + 12; + q[i + 52] = i + 10; + q[i + 54] = i + 8; + q[i + 56] = i + 6; + q[i + 58] = i + 4; + q[i + 60] = i + 2; + q[i + 62] = i; + } + __builtin_memcpy (epi8_exp, q, 64); + + foo_pd (pd_dst, pd_src); + foo_ps (ps_dst, ps_src); + foo_epi64 (epi64_dst, epi64_src); + foo_epi32 (epi32_dst, epi32_src); + foo_epi16 (epi16_dst, epi16_src); + foo_epi8 (epi8_dst, epi8_src); + + if (__builtin_memcmp (pd_dst, pd_exp, 64) != 0) + __builtin_abort (); + if (__builtin_memcmp (ps_dst, ps_exp, 64) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi64_dst, epi64_exp, 64) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi32_dst, epi32_exp, 64) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi16_dst, epi16_exp, 64) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi8_dst, epi8_exp, 64) != 0) + __builtin_abort (); + + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-6c.c b/gcc/testsuite/gcc.target/i386/pr106010-6c.c new file mode 100644 index 00000000000..b859d884a7f --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-6c.c @@ -0,0 +1,80 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-slp-details" } */ +/* { dg-require-effective-target avx512fp16 } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*VEC_PERM_EXPR.*\{ 14, 15, 12, 13, 10, 11, 8, 9, 6, 7, 4, 5, 2, 3, 0, 1 \}} 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "basic block part vectorized using (?:32|64) byte vectors" 1 "slp2" } } */ + +#include <string.h> + +static void do_test (void); +#define DO_TEST do_test +#define AVX512FP16 +#include "avx512-check.h" + +void +__attribute__((noipa)) +foo_ph (_Complex _Float16* a, _Complex _Float16* __restrict b) +{ + a[0] = b[15]; + a[1] = b[14]; + a[2] = b[13]; + a[3] = b[12]; + a[4] = b[11]; + a[5] = b[10]; + a[6] = b[9]; + a[7] = b[8]; + a[8] = b[7]; + a[9] = b[6]; + a[10] = b[5]; + a[11] = b[4]; + a[12] = b[3]; + a[13] = b[2]; + a[14] = b[1]; + a[15] = b[0]; +} + +void +do_test (void) +{ + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (64); + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (64); + _Complex _Float16* ph_exp = (_Complex _Float16*) malloc (64); + char* p = (char* ) malloc (64); + char* q = (char* ) malloc (64); + + __builtin_memset (ph_dst, 0, 64); + + for (int i = 0; i != 64; i++) + p[i] = i; + + __builtin_memcpy (ph_src, p, 64); + + for (int i = 0; i != 4; i++) + { + q[i] = i + 60; + q[i + 4] = i + 56; + q[i + 8] = i + 52; + q[i + 12] = i + 48; + q[i + 16] = i + 44; + q[i + 20] = i + 40; + q[i + 24] = i + 36; + q[i + 28] = i + 32; + q[i + 32] = i + 28; + q[i + 36] = i + 24; + q[i + 40] = i + 20; + q[i + 44] = i + 16; + q[i + 48] = i + 12; + q[i + 52] = i + 8; + q[i + 56] = i + 4; + q[i + 60] = i; + } + + __builtin_memcpy (ph_exp, q, 64); + + foo_ph (ph_dst, ph_src); + + if (__builtin_memcmp (ph_dst, ph_exp, 64) != 0) + __builtin_abort (); + + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-7a.c b/gcc/testsuite/gcc.target/i386/pr106010-7a.c new file mode 100644 index 00000000000..2ea01fac927 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-7a.c @@ -0,0 +1,58 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-vect-details -mprefer-vector-width=256" } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 6 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 1 "vect" } } */ + +#define N 10000 +void +__attribute__((noipa)) +foo_pd (_Complex double* a, _Complex double b) +{ + for (int i = 0; i != N; i++) + a[i] = b; +} + +void +__attribute__((noipa)) +foo_ps (_Complex float* a, _Complex float b) +{ + for (int i = 0; i != N; i++) + a[i] = b; +} + +void +__attribute__((noipa)) +foo_epi64 (_Complex long long* a, _Complex long long b) +{ + for (int i = 0; i != N; i++) + a[i] = b; +} + +void +__attribute__((noipa)) +foo_epi32 (_Complex int* a, _Complex int b) +{ + for (int i = 0; i != N; i++) + a[i] = b; +} + +void +__attribute__((noipa)) +foo_epi16 (_Complex short* a, _Complex short b) +{ + for (int i = 0; i != N; i++) + a[i] = b; +} + +void +__attribute__((noipa)) +foo_epi8 (_Complex char* a, _Complex char b) +{ + for (int i = 0; i != N; i++) + a[i] = b; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-7b.c b/gcc/testsuite/gcc.target/i386/pr106010-7b.c new file mode 100644 index 00000000000..26482cc10f5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-7b.c @@ -0,0 +1,63 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ +/* { dg-require-effective-target avx } */ + +#include "avx-check.h" +#include <string.h> +#include "pr106010-7a.c" + +void +avx_test (void) +{ + _Complex double* pd_src = (_Complex double*) malloc (2 * N * sizeof (double)); + _Complex double* pd_dst = (_Complex double*) malloc (2 * N * sizeof (double)); + _Complex float* ps_src = (_Complex float*) malloc (2 * N * sizeof (float)); + _Complex float* ps_dst = (_Complex float*) malloc (2 * N * sizeof (float)); + _Complex long long* epi64_src = (_Complex long long*) malloc (2 * N * sizeof (long long)); + _Complex long long* epi64_dst = (_Complex long long*) malloc (2 * N * sizeof (long long)); + _Complex int* epi32_src = (_Complex int*) malloc (2 * N * sizeof (int)); + _Complex int* epi32_dst = (_Complex int*) malloc (2 * N * sizeof (int)); + _Complex short* epi16_src = (_Complex short*) malloc (2 * N * sizeof (short)); + _Complex short* epi16_dst = (_Complex short*) malloc (2 * N * sizeof (short)); + _Complex char* epi8_src = (_Complex char*) malloc (2 * N * sizeof (char)); + _Complex char* epi8_dst = (_Complex char*) malloc (2 * N * sizeof (char)); + char* p_init = (char*) malloc (2 * N * sizeof (double)); + + __builtin_memset (pd_dst, 0, 2 * N * sizeof (double)); + __builtin_memset (ps_dst, 0, 2 * N * sizeof (float)); + __builtin_memset (epi64_dst, 0, 2 * N * sizeof (long long)); + __builtin_memset (epi32_dst, 0, 2 * N * sizeof (int)); + __builtin_memset (epi16_dst, 0, 2 * N * sizeof (short)); + __builtin_memset (epi8_dst, 0, 2 * N * sizeof (char)); + + for (int i = 0; i != 2 * N * sizeof (double); i++) + p_init[i] = i % 2 + 3; + + memcpy (pd_src, p_init, 2 * N * sizeof (double)); + memcpy (ps_dst, p_init, 2 * N * sizeof (float)); + memcpy (epi64_dst, p_init, 2 * N * sizeof (long long)); + memcpy (epi32_dst, p_init, 2 * N * sizeof (int)); + memcpy (epi16_dst, p_init, 2 * N * sizeof (short)); + memcpy (epi8_dst, p_init, 2 * N * sizeof (char)); + + foo_pd (pd_dst, pd_src[0]); + foo_ps (ps_dst, ps_src[0]); + foo_epi64 (epi64_dst, epi64_src[0]); + foo_epi32 (epi32_dst, epi32_src[0]); + foo_epi16 (epi16_dst, epi16_src[0]); + foo_epi8 (epi8_dst, epi8_src[0]); + if (__builtin_memcmp (pd_dst, pd_src, N * 2 * sizeof (double)) != 0) + __builtin_abort (); + if (__builtin_memcmp (ps_dst, ps_src, N * 2 * sizeof (float)) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi64_dst, epi64_src, N * 2 * sizeof (long long)) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi32_dst, epi32_src, N * 2 * sizeof (int)) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi16_dst, epi16_src, N * 2 * sizeof (short)) != 0) + __builtin_abort (); + if (__builtin_memcmp (epi8_dst, epi8_src, N * 2 * sizeof (char)) != 0) + __builtin_abort (); + + return; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-7c.c b/gcc/testsuite/gcc.target/i386/pr106010-7c.c new file mode 100644 index 00000000000..7f4056a5ecc --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-7c.c @@ -0,0 +1,41 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-vect-details" } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 1 "vect" } } */ +/* { dg-require-effective-target avx512fp16 } */ + +#include <string.h> + +static void do_test (void); + +#define DO_TEST do_test +#define AVX512FP16 +#include "avx512-check.h" + +#define N 10000 + +void +__attribute__((noipa)) +foo_ph (_Complex _Float16* a, _Complex _Float16 b) +{ + for (int i = 0; i != N; i++) + a[i] = b; +} + +static void +do_test (void) +{ + _Complex _Float16* ph_src = (_Complex _Float16*) malloc (2 * N * sizeof (_Float16)); + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (2 * N * sizeof (_Float16)); + char* p_init = (char*) malloc (2 * N * sizeof (_Float16)); + + __builtin_memset (ph_dst, 0, 2 * N * sizeof (_Float16)); + + for (int i = 0; i != 2 * N * sizeof (_Float16); i++) + p_init[i] = i % 2 + 3; + + memcpy (ph_src, p_init, 2 * N * sizeof (_Float16)); + + foo_ph (ph_dst, ph_src[0]); + if (__builtin_memcmp (ph_dst, ph_src, N * 2 * sizeof (_Float16)) != 0) + __builtin_abort (); +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-8a.c b/gcc/testsuite/gcc.target/i386/pr106010-8a.c new file mode 100644 index 00000000000..11054b60d30 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-8a.c @@ -0,0 +1,58 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -fdump-tree-vect-details -mprefer-vector-width=256" } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 6 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) double>} 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) float>} 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(4\) long long int>} 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(8\) int>} 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) short int>} 1 "vect" } } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(32\) char>} 1 "vect" } } */ + +#define N 10000 +void +__attribute__((noipa)) +foo_pd (_Complex double* a) +{ + for (int i = 0; i != N; i++) + a[i] = 1.0 + 2.0i; +} + +void +__attribute__((noipa)) +foo_ps (_Complex float* a) +{ + for (int i = 0; i != N; i++) + a[i] = 1.0f + 2.0fi; +} + +void +__attribute__((noipa)) +foo_epi64 (_Complex long long* a) +{ + for (int i = 0; i != N; i++) + a[i] = 1 + 2i; +} + +void +__attribute__((noipa)) +foo_epi32 (_Complex int* a) +{ + for (int i = 0; i != N; i++) + a[i] = 1 + 2i; +} + +void +__attribute__((noipa)) +foo_epi16 (_Complex short* a) +{ + for (int i = 0; i != N; i++) + a[i] = 1 + 2i; +} + +void +__attribute__((noipa)) +foo_epi8 (_Complex char* a) +{ + for (int i = 0; i != N; i++) + a[i] = 1 + 2i; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-8b.c b/gcc/testsuite/gcc.target/i386/pr106010-8b.c new file mode 100644 index 00000000000..6bb0073b691 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-8b.c @@ -0,0 +1,53 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256" } */ +/* { dg-require-effective-target avx } */ + +#include "avx-check.h" +#include <string.h> +#include "pr106010-8a.c" + +void +avx_test (void) +{ + _Complex double pd_src = 1.0 + 2.0i; + _Complex double* pd_dst = (_Complex double*) malloc (2 * N * sizeof (double)); + _Complex float ps_src = 1.0 + 2.0i; + _Complex float* ps_dst = (_Complex float*) malloc (2 * N * sizeof (float)); + _Complex long long epi64_src = 1 + 2i;; + _Complex long long* epi64_dst = (_Complex long long*) malloc (2 * N * sizeof (long long)); + _Complex int epi32_src = 1 + 2i; + _Complex int* epi32_dst = (_Complex int*) malloc (2 * N * sizeof (int)); + _Complex short epi16_src = 1 + 2i; + _Complex short* epi16_dst = (_Complex short*) malloc (2 * N * sizeof (short)); + _Complex char epi8_src = 1 + 2i; + _Complex char* epi8_dst = (_Complex char*) malloc (2 * N * sizeof (char)); + + __builtin_memset (pd_dst, 0, 2 * N * sizeof (double)); + __builtin_memset (ps_dst, 0, 2 * N * sizeof (float)); + __builtin_memset (epi64_dst, 0, 2 * N * sizeof (long long)); + __builtin_memset (epi32_dst, 0, 2 * N * sizeof (int)); + __builtin_memset (epi16_dst, 0, 2 * N * sizeof (short)); + __builtin_memset (epi8_dst, 0, 2 * N * sizeof (char)); + + foo_pd (pd_dst); + foo_ps (ps_dst); + foo_epi64 (epi64_dst); + foo_epi32 (epi32_dst); + foo_epi16 (epi16_dst); + foo_epi8 (epi8_dst); + for (int i = 0 ; i != N; i++) + { + if (pd_dst[i] != pd_src) + __builtin_abort (); + if (ps_dst[i] != ps_src) + __builtin_abort (); + if (epi64_dst[i] != epi64_src) + __builtin_abort (); + if (epi32_dst[i] != epi32_src) + __builtin_abort (); + if (epi16_dst[i] != epi16_src) + __builtin_abort (); + if (epi8_dst[i] != epi8_src) + __builtin_abort (); + } +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-8c.c b/gcc/testsuite/gcc.target/i386/pr106010-8c.c new file mode 100644 index 00000000000..61ae131829d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-8c.c @@ -0,0 +1,38 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -mavx512fp16 -mavx512vl -ftree-vectorize -fvect-cost-model=unlimited -mprefer-vector-width=256 -fdump-tree-vect-details" } */ +/* { dg-final { scan-tree-dump-times {(?n)add new stmt:.*MEM <vector\(16\) _Float16>} 1 "vect" } } */ +/* { dg-require-effective-target avx512fp16 } */ + +#include <string.h> + +static void do_test (void); + +#define DO_TEST do_test +#define AVX512FP16 +#include "avx512-check.h" + +#define N 10000 + +void +__attribute__((noipa)) +foo_ph (_Complex _Float16* a) +{ + for (int i = 0; i != N; i++) + a[i] = 1.0f16 + 2.0f16i; +} + +static void +do_test (void) +{ + _Complex _Float16 ph_src = 1.0f16 + 2.0f16i; + _Complex _Float16* ph_dst = (_Complex _Float16*) malloc (2 * N * sizeof (_Float16)); + + __builtin_memset (ph_dst, 0, 2 * N * sizeof (_Float16)); + + foo_ph (ph_dst); + for (int i = 0; i != N; i++) + { + if (ph_dst[i] != ph_src) + __builtin_abort (); + } +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-9a.c b/gcc/testsuite/gcc.target/i386/pr106010-9a.c new file mode 100644 index 00000000000..e922f7b5400 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-9a.c @@ -0,0 +1,89 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 -mavx2 -fvect-cost-model=unlimited -fdump-tree-vect-details" } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 6 "vect" } } */ + +typedef struct { _Complex double c; double a1; double a2;} + cdf; +typedef struct { _Complex double c; double a1; double a2; double a3; double a4;} + cdf2; +typedef struct { _Complex double c1; _Complex double c2; double a1; double a2; double a3; double a4;} + cdf3; +typedef struct { _Complex double c1; _Complex double c2; double a1; double a2;} + cdf4; + +#define N 100 +/* VMAT_ELEMENTWISE. */ +void +__attribute__((noipa)) +foo (cdf* a, cdf* __restrict b) +{ + for (int i = 0; i < N; ++i) + { + a[i].c = b[i].c; + a[i].a1 = b[i].a1; + a[i].a2 = b[i].a2; + } +} + +/* VMAT_CONTIGUOUS_PERMUTE. */ +void +__attribute__((noipa)) +foo1 (cdf2* a, cdf2* __restrict b) +{ + for (int i = 0; i < N; ++i) + { + a[i].c = b[i].c; + a[i].a1 = b[i].a1; + a[i].a2 = b[i].a2; + a[i].a3 = b[i].a3; + a[i].a4 = b[i].a4; + } +} + +/* VMAT_CONTIGUOUS. */ +void +__attribute__((noipa)) +foo2 (cdf3* a, cdf3* __restrict b) +{ + for (int i = 0; i < N; ++i) + { + a[i].c1 = b[i].c1; + a[i].c2 = b[i].c2; + a[i].a1 = b[i].a1; + a[i].a2 = b[i].a2; + a[i].a3 = b[i].a3; + a[i].a4 = b[i].a4; + } +} + +/* VMAT_STRIDED_SLP. */ +void +__attribute__((noipa)) +foo3 (cdf4* a, cdf4* __restrict b) +{ + for (int i = 0; i < N; ++i) + { + a[i].c1 = b[i].c1; + a[i].c2 = b[i].c2; + a[i].a1 = b[i].a1; + a[i].a2 = b[i].a2; + } +} + +/* VMAT_CONTIGUOUS_REVERSE. */ +void +__attribute__((noipa)) +foo4 (_Complex double* a, _Complex double* __restrict b) +{ + for (int i = 0; i != N; i++) + a[i] = b[N-i-1]; +} + +/* VMAT_CONTIGUOUS_DOWN. */ +void +__attribute__((noipa)) +foo5 (_Complex double* a, _Complex double* __restrict b) +{ + for (int i = 0; i != N; i++) + a[N-i-1] = b[0]; +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-9b.c b/gcc/testsuite/gcc.target/i386/pr106010-9b.c new file mode 100644 index 00000000000..e220445e6e3 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-9b.c @@ -0,0 +1,90 @@ +/* { dg-do run } */ +/* { dg-options "-O3 -msse2 -fvect-cost-model=unlimited" } */ +/* { dg-require-effective-target sse2 } */ + +#include <string.h> +#include "sse2-check.h" +#include "pr106010-9a.c" + +static void +sse2_test (void) +{ + _Complex double* pd_src = (_Complex double*) malloc (N * sizeof (_Complex double)); + _Complex double* pd_dst = (_Complex double*) malloc (N * sizeof (_Complex double)); + _Complex double* pd_src2 = (_Complex double*) malloc (N * sizeof (_Complex double)); + _Complex double* pd_dst2 = (_Complex double*) malloc (N * sizeof (_Complex double)); + cdf* cdf_src = (cdf*) malloc (N * sizeof (cdf)); + cdf* cdf_dst = (cdf*) malloc (N * sizeof (cdf)); + cdf2* cdf2_src = (cdf2*) malloc (N * sizeof (cdf2)); + cdf2* cdf2_dst = (cdf2*) malloc (N * sizeof (cdf2)); + cdf3* cdf3_src = (cdf3*) malloc (N * sizeof (cdf3)); + cdf3* cdf3_dst = (cdf3*) malloc (N * sizeof (cdf3)); + cdf4* cdf4_src = (cdf4*) malloc (N * sizeof (cdf4)); + cdf4* cdf4_dst = (cdf4*) malloc (N * sizeof (cdf4)); + + char* p_init = (char*) malloc (N * sizeof (cdf3)); + + __builtin_memset (cdf_dst, 0, N * sizeof (cdf)); + __builtin_memset (cdf2_dst, 0, N * sizeof (cdf2)); + __builtin_memset (cdf3_dst, 0, N * sizeof (cdf3)); + __builtin_memset (cdf4_dst, 0, N * sizeof (cdf4)); + __builtin_memset (pd_dst, 0, N * sizeof (_Complex double)); + __builtin_memset (pd_dst2, 0, N * sizeof (_Complex double)); + + for (int i = 0; i != N * sizeof (cdf3); i++) + p_init[i] = i; + + memcpy (cdf_src, p_init, N * sizeof (cdf)); + memcpy (cdf2_src, p_init, N * sizeof (cdf2)); + memcpy (cdf3_src, p_init, N * sizeof (cdf3)); + memcpy (cdf4_src, p_init, N * sizeof (cdf4)); + memcpy (pd_src, p_init, N * sizeof (_Complex double)); + for (int i = 0; i != 2 * N * sizeof (double); i++) + p_init[i] = i % 16; + memcpy (pd_src2, p_init, N * sizeof (_Complex double)); + + foo (cdf_dst, cdf_src); + foo1 (cdf2_dst, cdf2_src); + foo2 (cdf3_dst, cdf3_src); + foo3 (cdf4_dst, cdf4_src); + foo4 (pd_dst, pd_src); + foo5 (pd_dst2, pd_src2); + for (int i = 0; i != N; i++) + { + p_init[(N - i - 1) * 16] = i * 16; + p_init[(N - i - 1) * 16 + 1] = i * 16 + 1; + p_init[(N - i - 1) * 16 + 2] = i * 16 + 2; + p_init[(N - i - 1) * 16 + 3] = i * 16 + 3; + p_init[(N - i - 1) * 16 + 4] = i * 16 + 4; + p_init[(N - i - 1) * 16 + 5] = i * 16 + 5; + p_init[(N - i - 1) * 16 + 6] = i * 16 + 6; + p_init[(N - i - 1) * 16 + 7] = i * 16 + 7; + p_init[(N - i - 1) * 16 + 8] = i * 16 + 8; + p_init[(N - i - 1) * 16 + 9] = i * 16 + 9; + p_init[(N - i - 1) * 16 + 10] = i * 16 + 10; + p_init[(N - i - 1) * 16 + 11] = i * 16 + 11; + p_init[(N - i - 1) * 16 + 12] = i * 16 + 12; + p_init[(N - i - 1) * 16 + 13] = i * 16 + 13; + p_init[(N - i - 1) * 16 + 14] = i * 16 + 14; + p_init[(N - i - 1) * 16 + 15] = i * 16 + 15; + } + memcpy (pd_src, p_init, N * 16); + + if (__builtin_memcmp (pd_dst, pd_src, N * 2 * sizeof (double)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (pd_dst2, pd_src2, N * 2 * sizeof (double)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf_dst, cdf_src, N * sizeof (cdf)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf2_dst, cdf2_src, N * sizeof (cdf2)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf3_dst, cdf3_src, N * sizeof (cdf3)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf4_dst, cdf4_src, N * sizeof (cdf4)) != 0) + __builtin_abort (); +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-9c.c b/gcc/testsuite/gcc.target/i386/pr106010-9c.c new file mode 100644 index 00000000000..ff51f6195b7 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-9c.c @@ -0,0 +1,90 @@ +/* { dg-do run } */ +/* { dg-options "-O3 -mavx2 -fvect-cost-model=unlimited" } */ +/* { dg-require-effective-target avx2 } */ + +#include <string.h> +#include "avx2-check.h" +#include "pr106010-9a.c" + +static void +avx2_test (void) +{ + _Complex double* pd_src = (_Complex double*) malloc (N * sizeof (_Complex double)); + _Complex double* pd_dst = (_Complex double*) malloc (N * sizeof (_Complex double)); + _Complex double* pd_src2 = (_Complex double*) malloc (N * sizeof (_Complex double)); + _Complex double* pd_dst2 = (_Complex double*) malloc (N * sizeof (_Complex double)); + cdf* cdf_src = (cdf*) malloc (N * sizeof (cdf)); + cdf* cdf_dst = (cdf*) malloc (N * sizeof (cdf)); + cdf2* cdf2_src = (cdf2*) malloc (N * sizeof (cdf2)); + cdf2* cdf2_dst = (cdf2*) malloc (N * sizeof (cdf2)); + cdf3* cdf3_src = (cdf3*) malloc (N * sizeof (cdf3)); + cdf3* cdf3_dst = (cdf3*) malloc (N * sizeof (cdf3)); + cdf4* cdf4_src = (cdf4*) malloc (N * sizeof (cdf4)); + cdf4* cdf4_dst = (cdf4*) malloc (N * sizeof (cdf4)); + + char* p_init = (char*) malloc (N * sizeof (cdf3)); + + __builtin_memset (cdf_dst, 0, N * sizeof (cdf)); + __builtin_memset (cdf2_dst, 0, N * sizeof (cdf2)); + __builtin_memset (cdf3_dst, 0, N * sizeof (cdf3)); + __builtin_memset (cdf4_dst, 0, N * sizeof (cdf4)); + __builtin_memset (pd_dst, 0, N * sizeof (_Complex double)); + __builtin_memset (pd_dst2, 0, N * sizeof (_Complex double)); + + for (int i = 0; i != N * sizeof (cdf3); i++) + p_init[i] = i; + + memcpy (cdf_src, p_init, N * sizeof (cdf)); + memcpy (cdf2_src, p_init, N * sizeof (cdf2)); + memcpy (cdf3_src, p_init, N * sizeof (cdf3)); + memcpy (cdf4_src, p_init, N * sizeof (cdf4)); + memcpy (pd_src, p_init, N * sizeof (_Complex double)); + for (int i = 0; i != 2 * N * sizeof (double); i++) + p_init[i] = i % 16; + memcpy (pd_src2, p_init, N * sizeof (_Complex double)); + + foo (cdf_dst, cdf_src); + foo1 (cdf2_dst, cdf2_src); + foo2 (cdf3_dst, cdf3_src); + foo3 (cdf4_dst, cdf4_src); + foo4 (pd_dst, pd_src); + foo5 (pd_dst2, pd_src2); + for (int i = 0; i != N; i++) + { + p_init[(N - i - 1) * 16] = i * 16; + p_init[(N - i - 1) * 16 + 1] = i * 16 + 1; + p_init[(N - i - 1) * 16 + 2] = i * 16 + 2; + p_init[(N - i - 1) * 16 + 3] = i * 16 + 3; + p_init[(N - i - 1) * 16 + 4] = i * 16 + 4; + p_init[(N - i - 1) * 16 + 5] = i * 16 + 5; + p_init[(N - i - 1) * 16 + 6] = i * 16 + 6; + p_init[(N - i - 1) * 16 + 7] = i * 16 + 7; + p_init[(N - i - 1) * 16 + 8] = i * 16 + 8; + p_init[(N - i - 1) * 16 + 9] = i * 16 + 9; + p_init[(N - i - 1) * 16 + 10] = i * 16 + 10; + p_init[(N - i - 1) * 16 + 11] = i * 16 + 11; + p_init[(N - i - 1) * 16 + 12] = i * 16 + 12; + p_init[(N - i - 1) * 16 + 13] = i * 16 + 13; + p_init[(N - i - 1) * 16 + 14] = i * 16 + 14; + p_init[(N - i - 1) * 16 + 15] = i * 16 + 15; + } + memcpy (pd_src, p_init, N * 16); + + if (__builtin_memcmp (pd_dst, pd_src, N * 2 * sizeof (double)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (pd_dst2, pd_src2, N * 2 * sizeof (double)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf_dst, cdf_src, N * sizeof (cdf)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf2_dst, cdf2_src, N * sizeof (cdf2)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf3_dst, cdf3_src, N * sizeof (cdf3)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf4_dst, cdf4_src, N * sizeof (cdf4)) != 0) + __builtin_abort (); +} diff --git a/gcc/testsuite/gcc.target/i386/pr106010-9d.c b/gcc/testsuite/gcc.target/i386/pr106010-9d.c new file mode 100644 index 00000000000..d4d8f1dd722 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr106010-9d.c @@ -0,0 +1,92 @@ +/* { dg-do run } */ +/* { dg-options "-O3 -mavx512f -mavx512vl -fvect-cost-model=unlimited -mprefer-vector-width=512" } */ +/* { dg-require-effective-target avx512f } */ + +#include <string.h> +#include <stdlib.h> +#define AVX512F +#include "avx512-check.h" +#include "pr106010-9a.c" + +static void +test_512 (void) +{ + _Complex double* pd_src = (_Complex double*) malloc (N * sizeof (_Complex double)); + _Complex double* pd_dst = (_Complex double*) malloc (N * sizeof (_Complex double)); + _Complex double* pd_src2 = (_Complex double*) malloc (N * sizeof (_Complex double)); + _Complex double* pd_dst2 = (_Complex double*) malloc (N * sizeof (_Complex double)); + cdf* cdf_src = (cdf*) malloc (N * sizeof (cdf)); + cdf* cdf_dst = (cdf*) malloc (N * sizeof (cdf)); + cdf2* cdf2_src = (cdf2*) malloc (N * sizeof (cdf2)); + cdf2* cdf2_dst = (cdf2*) malloc (N * sizeof (cdf2)); + cdf3* cdf3_src = (cdf3*) malloc (N * sizeof (cdf3)); + cdf3* cdf3_dst = (cdf3*) malloc (N * sizeof (cdf3)); + cdf4* cdf4_src = (cdf4*) malloc (N * sizeof (cdf4)); + cdf4* cdf4_dst = (cdf4*) malloc (N * sizeof (cdf4)); + + char* p_init = (char*) malloc (N * sizeof (cdf3)); + + __builtin_memset (cdf_dst, 0, N * sizeof (cdf)); + __builtin_memset (cdf2_dst, 0, N * sizeof (cdf2)); + __builtin_memset (cdf3_dst, 0, N * sizeof (cdf3)); + __builtin_memset (cdf4_dst, 0, N * sizeof (cdf4)); + __builtin_memset (pd_dst, 0, N * sizeof (_Complex double)); + __builtin_memset (pd_dst2, 0, N * sizeof (_Complex double)); + + for (int i = 0; i != N * sizeof (cdf3); i++) + p_init[i] = i; + + memcpy (cdf_src, p_init, N * sizeof (cdf)); + memcpy (cdf2_src, p_init, N * sizeof (cdf2)); + memcpy (cdf3_src, p_init, N * sizeof (cdf3)); + memcpy (cdf4_src, p_init, N * sizeof (cdf4)); + memcpy (pd_src, p_init, N * sizeof (_Complex double)); + for (int i = 0; i != 2 * N * sizeof (double); i++) + p_init[i] = i % 16; + memcpy (pd_src2, p_init, N * sizeof (_Complex double)); + + foo (cdf_dst, cdf_src); + foo1 (cdf2_dst, cdf2_src); + foo2 (cdf3_dst, cdf3_src); + foo3 (cdf4_dst, cdf4_src); + foo4 (pd_dst, pd_src); + foo5 (pd_dst2, pd_src2); + for (int i = 0; i != N; i++) + { + p_init[(N - i - 1) * 16] = i * 16; + p_init[(N - i - 1) * 16 + 1] = i * 16 + 1; + p_init[(N - i - 1) * 16 + 2] = i * 16 + 2; + p_init[(N - i - 1) * 16 + 3] = i * 16 + 3; + p_init[(N - i - 1) * 16 + 4] = i * 16 + 4; + p_init[(N - i - 1) * 16 + 5] = i * 16 + 5; + p_init[(N - i - 1) * 16 + 6] = i * 16 + 6; + p_init[(N - i - 1) * 16 + 7] = i * 16 + 7; + p_init[(N - i - 1) * 16 + 8] = i * 16 + 8; + p_init[(N - i - 1) * 16 + 9] = i * 16 + 9; + p_init[(N - i - 1) * 16 + 10] = i * 16 + 10; + p_init[(N - i - 1) * 16 + 11] = i * 16 + 11; + p_init[(N - i - 1) * 16 + 12] = i * 16 + 12; + p_init[(N - i - 1) * 16 + 13] = i * 16 + 13; + p_init[(N - i - 1) * 16 + 14] = i * 16 + 14; + p_init[(N - i - 1) * 16 + 15] = i * 16 + 15; + } + memcpy (pd_src, p_init, N * 16); + + if (__builtin_memcmp (pd_dst, pd_src, N * 2 * sizeof (double)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (pd_dst2, pd_src2, N * 2 * sizeof (double)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf_dst, cdf_src, N * sizeof (cdf)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf2_dst, cdf2_src, N * sizeof (cdf2)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf3_dst, cdf3_src, N * sizeof (cdf3)) != 0) + __builtin_abort (); + + if (__builtin_memcmp (cdf4_dst, cdf4_src, N * sizeof (cdf4)) != 0) + __builtin_abort (); +} diff --git a/gcc/tree-vect-data-refs.cc b/gcc/tree-vect-data-refs.cc index d20a10a1524..19567bb338a 100644 --- a/gcc/tree-vect-data-refs.cc +++ b/gcc/tree-vect-data-refs.cc @@ -1403,7 +1403,8 @@ vect_get_data_access_cost (vec_info *vinfo, dr_vec_info *dr_info, if (PURE_SLP_STMT (stmt_info)) ncopies = 1; else - ncopies = vect_get_num_copies (loop_vinfo, STMT_VINFO_VECTYPE (stmt_info)); + ncopies = vect_get_num_copies (loop_vinfo, STMT_VINFO_VECTYPE (stmt_info), + STMT_VINFO_COMPLEX_P (stmt_info)); if (DR_IS_READ (dr_info->dr)) vect_get_load_cost (vinfo, stmt_info, ncopies, alignment_support_scheme, @@ -4597,8 +4598,22 @@ vect_analyze_data_refs (vec_info *vinfo, poly_uint64 *min_vf, bool *fatal) /* Set vectype for STMT. */ scalar_type = TREE_TYPE (DR_REF (dr)); - tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type); - if (!vectype) + tree adjust_scalar_type = scalar_type; + /* Support Complex type access. Note that the complex type of load/store + does not support gather/scatter. */ + if (TREE_CODE (scalar_type) == COMPLEX_TYPE + && gatherscatter == SG_NONE) + { + adjust_scalar_type = TREE_TYPE (scalar_type); + STMT_VINFO_COMPLEX_P (stmt_info) = true; + } + tree vectype = get_vectype_for_scalar_type (vinfo, adjust_scalar_type); + unsigned HOST_WIDE_INT constant_nunits; + if (!vectype + /* For complex type, V1DI doesn't make sense. */ + || (STMT_VINFO_COMPLEX_P (stmt_info) + && (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&constant_nunits) + || constant_nunits == 1))) { if (dump_enabled_p ()) { @@ -4635,8 +4650,11 @@ vect_analyze_data_refs (vec_info *vinfo, poly_uint64 *min_vf, bool *fatal) } /* Adjust the minimal vectorization factor according to the - vector type. */ + vector type. Note for complex type, VF is half of + TYPE_VECTOR_SUBPARTS. */ vf = TYPE_VECTOR_SUBPARTS (vectype); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + vf = exact_div (vf, 2); *min_vf = upper_bound (*min_vf, vf); /* Leave the BB vectorizer to pick the vector type later, based on @@ -6140,21 +6158,55 @@ vect_permute_load_chain (vec_info *vinfo, vec<tree> dr_chain, vec_perm_indices indices; for (k = 0; k < 3; k++) { - for (i = 0; i < nelt; i++) - if (3 * i + k < 2 * nelt) - sel[i] = 3 * i + k; - else - sel[i] = 0; - indices.new_vector (sel, 2, nelt); - perm3_mask_low = vect_gen_perm_mask_checked (vectype, indices); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + { + for (i = 0; i < nelt / 2; i++) + if (6 * i + 2 * k + 1 < 2 * nelt) + { + sel[2 * i] = 6 * i + 2 * k; + sel[2 * i + 1] = 6 * i + 2 * k + 1; + } + else + { + sel[2 * i] = 0; + sel[2 * i + 1] = 0; + } - for (i = 0, j = 0; i < nelt; i++) - if (3 * i + k < 2 * nelt) - sel[i] = i; - else - sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++); - indices.new_vector (sel, 2, nelt); - perm3_mask_high = vect_gen_perm_mask_checked (vectype, indices); + indices.new_vector (sel, 2, nelt); + perm3_mask_low = vect_gen_perm_mask_checked (vectype, indices); + + for (i = 0, j = 0; i < nelt / 2; i++) + if (6 * i + 2 * k + 1 < 2 * nelt) + { + sel[2 * i] = 2 * i; + sel[2 * i + 1] = 2 * i + 1; + } + else + { + sel[2 * i] = nelt + ((nelt + 2 * k) % 6) + 6 * j; + sel[2 * i + 1] = nelt + ((nelt + 2 * k) % 6) + 6 * (j++) + 1; + } + indices.new_vector (sel, 2, nelt); + perm3_mask_high = vect_gen_perm_mask_checked (vectype, indices); + } + else + { + for (i = 0; i < nelt; i++) + if (3 * i + k < 2 * nelt) + sel[i] = 3 * i + k; + else + sel[i] = 0; + indices.new_vector (sel, 2, nelt); + perm3_mask_low = vect_gen_perm_mask_checked (vectype, indices); + + for (i = 0, j = 0; i < nelt; i++) + if (3 * i + k < 2 * nelt) + sel[i] = i; + else + sel[i] = nelt + ((nelt + k) % 3) + 3 * (j++); + indices.new_vector (sel, 2, nelt); + perm3_mask_high = vect_gen_perm_mask_checked (vectype, indices); + } first_vect = dr_chain[0]; second_vect = dr_chain[1]; @@ -6186,17 +6238,43 @@ vect_permute_load_chain (vec_info *vinfo, vec<tree> dr_chain, /* The encoding has a single stepped pattern. */ poly_uint64 nelt = TYPE_VECTOR_SUBPARTS (vectype); - vec_perm_builder sel (nelt, 1, 3); - sel.quick_grow (3); - for (i = 0; i < 3; ++i) - sel[i] = i * 2; - vec_perm_indices indices (sel, 2, nelt); - perm_mask_even = vect_gen_perm_mask_checked (vectype, indices); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + { + vec_perm_builder sel; + unsigned neltc = nelt.to_constant (); + sel.new_vector (neltc, neltc, 1); + sel.quick_grow (neltc); + for (unsigned i = 0; i != neltc / 2; i++) + { + sel[2 * i] = i * 4; + sel[2 * i + 1] = i * 4 + 1; + } + vec_perm_indices indices (sel, 2, nelt); + perm_mask_even = vect_gen_perm_mask_checked (vectype, indices); - for (i = 0; i < 3; ++i) - sel[i] = i * 2 + 1; - indices.new_vector (sel, 2, nelt); - perm_mask_odd = vect_gen_perm_mask_checked (vectype, indices); + for (unsigned i = 0; i != nelt.to_constant() / 2; i++) + { + sel[2 * i] = i * 4 + 2; + sel[2 * i + 1] = i * 4 + 3; + } + indices.new_vector (sel, 2, nelt); + perm_mask_odd = vect_gen_perm_mask_checked (vectype, indices); + } + else + { + vec_perm_builder sel (nelt, 1, 3); + sel.quick_grow (3); + for (i = 0; i < 3; ++i) + sel[i] = i * 2; + + vec_perm_indices indices (sel, 2, nelt); + perm_mask_even = vect_gen_perm_mask_checked (vectype, indices); + + for (i = 0; i < 3; ++i) + sel[i] = i * 2 + 1; + indices.new_vector (sel, 2, nelt); + perm_mask_odd = vect_gen_perm_mask_checked (vectype, indices); + } for (i = 0; i < log_length; i++) { diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 3a70c15b593..365fa738022 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -200,7 +200,12 @@ vect_determine_vf_for_stmt_1 (vec_info *vinfo, stmt_vec_info stmt_info, } if (nunits_vectype) - vect_update_max_nunits (vf, nunits_vectype); + { + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (nunits_vectype); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + nunits = exact_div (nunits, 2); + vect_update_max_nunits (vf, nunits); + } return opt_result::success (); } diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index dab5daddcc5..5d66ea2f286 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -877,10 +877,14 @@ vect_record_max_nunits (vec_info *vinfo, stmt_vec_info stmt_info, return false; } + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + nunits = exact_div (nunits, 2); + /* If populating the vector type requires unrolling then fail before adjusting *max_nunits for basic-block vectorization. */ if (is_a <bb_vec_info> (vinfo) - && !multiple_p (group_size, TYPE_VECTOR_SUBPARTS (vectype))) + && !multiple_p (group_size , nunits)) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -891,7 +895,7 @@ vect_record_max_nunits (vec_info *vinfo, stmt_vec_info stmt_info, } /* In case of multiple types we need to detect the smallest type. */ - vect_update_max_nunits (max_nunits, vectype); + vect_update_max_nunits (max_nunits, nunits); return true; } @@ -3720,22 +3724,54 @@ vect_optimize_slp (vec_info *vinfo) vect_attempt_slp_rearrange_stmts did. This allows us to be lazy when permuting constants and invariants keeping the permute bijective. */ - auto_sbitmap load_index (SLP_TREE_LANES (node)); - bitmap_clear (load_index); - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) - bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin); - unsigned j; - for (j = 0; j < SLP_TREE_LANES (node); ++j) - if (!bitmap_bit_p (load_index, j)) - break; - if (j != SLP_TREE_LANES (node)) - continue; + /* Permutation of Complex type. */ + if (STMT_VINFO_COMPLEX_P (dr_stmt)) + { + auto_sbitmap load_index (SLP_TREE_LANES (node) * 2); + bitmap_clear (load_index); + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) + { + unsigned bit = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; + bitmap_set_bit (load_index, 2 * bit); + bitmap_set_bit (load_index, 2 * bit + 1); + } + unsigned j; + for (j = 0; j < SLP_TREE_LANES (node) * 2; ++j) + if (!bitmap_bit_p (load_index, j)) + break; + if (j != SLP_TREE_LANES (node) * 2) + continue; - vec<unsigned> perm = vNULL; - perm.safe_grow (SLP_TREE_LANES (node), true); - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) - perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; - perms.safe_push (perm); + vec<unsigned> perm = vNULL; + perm.safe_grow (SLP_TREE_LANES (node) * 2, true); + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) + { + unsigned cidx = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; + perm[2 * j] = 2 * cidx; + perm[2 * j + 1] = 2 * cidx + 1; + } + perms.safe_push (perm); + } + else + { + auto_sbitmap load_index (SLP_TREE_LANES (node)); + bitmap_clear (load_index); + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) + bitmap_set_bit (load_index, + SLP_TREE_LOAD_PERMUTATION (node)[j] - imin); + unsigned j; + for (j = 0; j < SLP_TREE_LANES (node); ++j) + if (!bitmap_bit_p (load_index, j)) + break; + if (j != SLP_TREE_LANES (node)) + continue; + + vec<unsigned> perm = vNULL; + perm.safe_grow (SLP_TREE_LANES (node), true); + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) + perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; + perms.safe_push (perm); + } vertices[idx].perm_in = perms.length () - 1; vertices[idx].perm_out = perms.length () - 1; } @@ -4518,6 +4554,12 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node, vf = loop_vinfo->vectorization_factor; else vf = 1; + /* For complex type and SLP, double vf to get right vectype. + .i.e vector(4) double for complex double, group size is 2, double vf + to map vf * group_size to TYPE_VECTOR_SUBPARTS. */ + if (STMT_VINFO_COMPLEX_P (stmt_info)) + vf *= 2; + unsigned int group_size = SLP_TREE_LANES (node); tree vectype = SLP_TREE_VECTYPE (node); SLP_TREE_NUMBER_OF_VEC_STMTS (node) @@ -4763,10 +4805,17 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node, } unsigned group_size = SLP_TREE_LANES (child); poly_uint64 vf = 1; + if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo)) vf = loop_vinfo->vectorization_factor; + + /* V2SF is just 1 complex type, so mutiply by 2 + to get release vector numbers. */ + unsigned cp + = STMT_VINFO_COMPLEX_P (SLP_TREE_REPRESENTATIVE (node)) ? 2 : 1; + SLP_TREE_NUMBER_OF_VEC_STMTS (child) - = vect_get_num_vectors (vf * group_size, vector_type); + = vect_get_num_vectors (vf * group_size * cp, vector_type); /* And cost them. */ vect_prologue_cost_for_slp (child, cost_vec); } @@ -6402,6 +6451,11 @@ vect_create_constant_vectors (vec_info *vinfo, slp_tree op_node) /* We always want SLP_TREE_VECTYPE (op_node) here correctly set. */ vector_type = SLP_TREE_VECTYPE (op_node); + unsigned int cp = 1; + /* Handle Complex type vector init. + SLP_TREE_REPRESENTATIVE (op_node) could be NULL. */ + if (TREE_CODE (TREE_TYPE (op_node->ops[0])) == COMPLEX_TYPE) + cp = 2; unsigned int number_of_vectors = SLP_TREE_NUMBER_OF_VEC_STMTS (op_node); SLP_TREE_VEC_DEFS (op_node).create (number_of_vectors); @@ -6426,9 +6480,9 @@ vect_create_constant_vectors (vec_info *vinfo, slp_tree op_node) /* When using duplicate_and_interleave, we just need one element for each scalar statement. */ if (!TYPE_VECTOR_SUBPARTS (vector_type).is_constant (&nunits)) - nunits = group_size; + nunits = group_size * cp; - number_of_copies = nunits * number_of_vectors / group_size; + number_of_copies = nunits * number_of_vectors / (group_size * cp); number_of_places_left_in_vector = nunits; constant_p = true; @@ -6460,8 +6514,23 @@ vect_create_constant_vectors (vec_info *vinfo, slp_tree op_node) gcc_unreachable (); } else - op = fold_unary (VIEW_CONVERT_EXPR, - TREE_TYPE (vector_type), op); + { + tree scalar_type = TREE_TYPE (vector_type); + /* For complex type, insert real and imag part + separately. */ + if (cp == 2) + { + gcc_assert ((TREE_CODE (TREE_TYPE (op)) + == COMPLEX_TYPE) + && (scalar_type + == TREE_TYPE (TREE_TYPE (op)))); + elts[number_of_places_left_in_vector--] + = fold_unary (IMAGPART_EXPR, scalar_type, op); + op = fold_unary (REALPART_EXPR, scalar_type, op); + } + else + op = fold_unary (VIEW_CONVERT_EXPR, scalar_type, op); + } gcc_assert (op && CONSTANT_CLASS_P (op)); } else @@ -6481,11 +6550,28 @@ vect_create_constant_vectors (vec_info *vinfo, slp_tree op_node) } else { - op = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (vector_type), - op); - init_stmt - = gimple_build_assign (new_temp, VIEW_CONVERT_EXPR, - op); + tree scalar_type = TREE_TYPE (vector_type); + if (cp == 2) + { + gcc_assert ((TREE_CODE (TREE_TYPE (op)) + == COMPLEX_TYPE) + && (scalar_type + == TREE_TYPE (TREE_TYPE (op)))); + tree imag = build1 (IMAGPART_EXPR, scalar_type, op); + op = build1 (REALPART_EXPR, scalar_type, op); + tree imag_temp = make_ssa_name (scalar_type); + elts[number_of_places_left_in_vector--] = imag_temp; + init_stmt = gimple_build_assign (imag_temp, imag); + gimple_seq_add_stmt (&ctor_seq, init_stmt); + init_stmt = gimple_build_assign (new_temp, op); + } + else + { + op = build1 (VIEW_CONVERT_EXPR, scalar_type, op); + init_stmt + = gimple_build_assign (new_temp, VIEW_CONVERT_EXPR, + op); + } } gimple_seq_add_stmt (&ctor_seq, init_stmt); op = new_temp; @@ -6696,15 +6782,17 @@ vect_transform_slp_perm_load (vec_info *vinfo, unsigned int nelts_to_build; unsigned int nvectors_per_build; unsigned int in_nlanes; + unsigned int cp = STMT_VINFO_COMPLEX_P (stmt_info) ? 2 : 1; bool repeating_p = (group_size == DR_GROUP_SIZE (stmt_info) - && multiple_p (nunits, group_size)); + && multiple_p (nunits, group_size * cp)); if (repeating_p) { /* A single vector contains a whole number of copies of the node, so: (a) all permutes can use the same mask; and (b) the permutes only need a single vector input. */ - mask.new_vector (nunits, group_size, 3); - nelts_to_build = mask.encoded_nelts (); + /* For complex type, mask size should be double of nelts_to_build. */ + mask.new_vector (nunits, group_size * cp, 3); + nelts_to_build = mask.encoded_nelts () / cp; nvectors_per_build = SLP_TREE_VEC_STMTS (node).length (); in_nlanes = DR_GROUP_SIZE (stmt_info) * 3; } @@ -6744,8 +6832,8 @@ vect_transform_slp_perm_load (vec_info *vinfo, { /* Enforced before the loop when !repeating_p. */ unsigned int const_nunits = nunits.to_constant (); - vec_index = i / const_nunits; - mask_element = i % const_nunits; + vec_index = i / (const_nunits / cp); + mask_element = i % (const_nunits / cp); if (vec_index == first_vec_index || first_vec_index == -1) { @@ -6755,7 +6843,7 @@ vect_transform_slp_perm_load (vec_info *vinfo, || second_vec_index == -1) { second_vec_index = vec_index; - mask_element += const_nunits; + mask_element += (const_nunits / cp); } else { @@ -6768,14 +6856,24 @@ vect_transform_slp_perm_load (vec_info *vinfo, return false; } - gcc_assert (mask_element < 2 * const_nunits); + gcc_assert (mask_element < 2 * const_nunits / cp); } if (mask_element != index) noop_p = false; - mask[index++] = mask_element; + /* Set index for Complex _type. + i.e. mask like [1,0] is actually [2, 3, 0, 1] + for vector scalar type. */ + if (cp == 2) + { + mask[2 * index] = 2 * mask_element; + mask[2 * index + 1] = 2 * mask_element + 1; + } + else + mask[index] = mask_element; + index++; - if (index == count && !noop_p) + if (index * cp == count && !noop_p) { indices.new_vector (mask, second_vec_index == -1 ? 1 : 2, nunits); if (!can_vec_perm_const_p (mode, mode, indices)) @@ -6799,7 +6897,7 @@ vect_transform_slp_perm_load (vec_info *vinfo, ++*n_perms; } - if (index == count) + if (index * cp == count) { if (!analyze_only) { @@ -6869,7 +6967,7 @@ vect_transform_slp_perm_load (vec_info *vinfo, bool load_seen = false; for (unsigned i = 0; i < in_nlanes; ++i) { - if (i % const_nunits == 0) + if (i % (const_nunits * cp) == 0) { if (load_seen) *n_loads += 1; diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc index 72107afc883..d6223c28f1c 100644 --- a/gcc/tree-vect-stmts.cc +++ b/gcc/tree-vect-stmts.cc @@ -1397,25 +1397,70 @@ vect_init_vector (vec_info *vinfo, stmt_vec_info stmt_info, tree val, tree type, { gimple *init_stmt; tree new_temp; + tree scalar_type = TREE_TYPE (type); + gimple_seq stmts = NULL; + + if (TREE_CODE (TREE_TYPE (val)) == COMPLEX_TYPE) + { + unsigned HOST_WIDE_INT nunits; + gcc_assert (TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits)); + + tree_vector_builder elts (type, nunits, 1); + tree imag, real; + if (TREE_CODE (val) == COMPLEX_CST) + { + real = fold_unary (REALPART_EXPR, scalar_type, val); + imag = fold_unary (IMAGPART_EXPR, scalar_type, val); + } + else + { + real = make_ssa_name (scalar_type); + imag = make_ssa_name (scalar_type); + init_stmt + = gimple_build_assign (real, + build1 (REALPART_EXPR, scalar_type, val)); + gimple_seq_add_stmt (&stmts, init_stmt); + init_stmt + = gimple_build_assign (imag, + build1 (IMAGPART_EXPR, scalar_type, val)); + gimple_seq_add_stmt (&stmts, init_stmt); + } + /* Build vector as [real,imag,real,imag,...]. */ + for (unsigned i = 0; i != nunits; i++) + { + if (i % 2) + elts.quick_push (imag); + else + elts.quick_push (real); + } + val = gimple_build_vector (&stmts, &elts); + if (!gimple_seq_empty_p (stmts)) + { + if (gsi) + gsi_insert_seq_before (gsi, stmts, GSI_SAME_STMT); + else + vinfo->insert_seq_on_entry (stmt_info, stmts); + } + } /* We abuse this function to push sth to a SSA name with initial 'val'. */ - if (! useless_type_conversion_p (type, TREE_TYPE (val))) + else if (! useless_type_conversion_p (type, TREE_TYPE (val))) { gcc_assert (TREE_CODE (type) == VECTOR_TYPE); - if (! types_compatible_p (TREE_TYPE (type), TREE_TYPE (val))) + if (! types_compatible_p (scalar_type, TREE_TYPE (val))) { /* Scalar boolean value should be transformed into all zeros or all ones value before building a vector. */ if (VECTOR_BOOLEAN_TYPE_P (type)) { - tree true_val = build_all_ones_cst (TREE_TYPE (type)); - tree false_val = build_zero_cst (TREE_TYPE (type)); + tree true_val = build_all_ones_cst (scalar_type); + tree false_val = build_zero_cst (scalar_type); if (CONSTANT_CLASS_P (val)) val = integer_zerop (val) ? false_val : true_val; else { - new_temp = make_ssa_name (TREE_TYPE (type)); + new_temp = make_ssa_name (scalar_type); init_stmt = gimple_build_assign (new_temp, COND_EXPR, val, true_val, false_val); vect_init_vector_1 (vinfo, stmt_info, init_stmt, gsi); @@ -1424,14 +1469,13 @@ vect_init_vector (vec_info *vinfo, stmt_vec_info stmt_info, tree val, tree type, } else { - gimple_seq stmts = NULL; if (! INTEGRAL_TYPE_P (TREE_TYPE (val))) val = gimple_build (&stmts, VIEW_CONVERT_EXPR, - TREE_TYPE (type), val); + scalar_type, val); else /* ??? Condition vectorization expects us to do promotion of invariant/external defs. */ - val = gimple_convert (&stmts, TREE_TYPE (type), val); + val = gimple_convert (&stmts, scalar_type, val); for (gimple_stmt_iterator gsi2 = gsi_start (stmts); !gsi_end_p (gsi2); ) { @@ -1496,7 +1540,12 @@ vect_get_vec_defs_for_operand (vec_info *vinfo, stmt_vec_info stmt_vinfo, && VECTOR_BOOLEAN_TYPE_P (stmt_vectype)) vector_type = truth_type_for (stmt_vectype); else - vector_type = get_vectype_for_scalar_type (loop_vinfo, TREE_TYPE (op)); + { + tree scalar_type = TREE_TYPE (op); + if (STMT_VINFO_COMPLEX_P (stmt_vinfo)) + scalar_type = TREE_TYPE (scalar_type); + vector_type = get_vectype_for_scalar_type (loop_vinfo, scalar_type); + } gcc_assert (vector_type); tree vop = vect_init_vector (vinfo, stmt_vinfo, op, vector_type, NULL); @@ -1892,6 +1941,13 @@ vect_truncate_gather_scatter_offset (stmt_vec_info stmt_info, return false; } + if (STMT_VINFO_COMPLEX_P (stmt_info)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Complex type doens't support gather_scatter.\n"); + return false; + } /* Get the number of bits in an element. */ tree vectype = STMT_VINFO_VECTYPE (stmt_info); scalar_mode element_mode = SCALAR_TYPE_MODE (TREE_TYPE (vectype)); @@ -2022,6 +2078,30 @@ perm_mask_for_reverse (tree vectype) return vect_gen_perm_mask_checked (vectype, indices); } +static tree +perm_mask_for_reverse (tree vectype, bool complex_p) +{ + if (!complex_p) + return perm_mask_for_reverse (vectype); + + unsigned HOST_WIDE_INT nunits; + gcc_assert (TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits)); + + /* The encoding has a single stepped pattern. */ + vec_perm_builder sel (nunits, nunits, 1); + for (unsigned i = 0; i < nunits; i+=2) + { + sel.quick_push (nunits - 2 - i); + sel.quick_push (nunits - 1 - i); + } + + vec_perm_indices indices (sel, 1, nunits); + if (!can_vec_perm_const_p (TYPE_MODE (vectype), TYPE_MODE (vectype), + indices)) + return NULL_TREE; + return vect_gen_perm_mask_checked (vectype, indices); +} + /* A subroutine of get_load_store_type, with a subset of the same arguments. Handle the case where STMT_INFO is a load or store that accesses consecutive elements with a negative step. Sets *POFFSET @@ -2045,8 +2125,12 @@ get_negative_load_store_type (vec_info *vinfo, } /* For backward running DRs the first access in vectype actually is - N-1 elements before the address of the DR. */ - *poffset = ((-TYPE_VECTOR_SUBPARTS (vectype) + 1) + N-1 elements before the address of the DR. + for Complex type, it's N - 2. */ + unsigned cp = 1; + if (STMT_VINFO_COMPLEX_P (stmt_info)) + cp = 2; + *poffset = ((-TYPE_VECTOR_SUBPARTS (vectype) + cp) * TREE_INT_CST_LOW (TYPE_SIZE_UNIT (TREE_TYPE (vectype)))); int misalignment = dr_misalignment (dr_info, vectype, *poffset); @@ -2071,7 +2155,7 @@ get_negative_load_store_type (vec_info *vinfo, return VMAT_CONTIGUOUS_DOWN; } - if (!perm_mask_for_reverse (vectype)) + if (!perm_mask_for_reverse (vectype, STMT_VINFO_COMPLEX_P (stmt_info))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -2188,6 +2272,8 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, && !DR_GROUP_NEXT_ELEMENT (stmt_info)); unsigned HOST_WIDE_INT gap = DR_GROUP_GAP (first_stmt_info); poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + nunits = exact_div (nunits, 2); /* True if the vectorized statements would access beyond the last statement in the group. */ @@ -2352,7 +2438,11 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, { /* First cope with the degenerate case of a single-element vector. */ - if (known_eq (TYPE_VECTOR_SUBPARTS (vectype), 1U)) + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + nunits = exact_div (nunits, 2); + + if (known_eq (nunits, 1U)) ; /* Otherwise try using LOAD/STORE_LANES. */ @@ -2361,6 +2451,8 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info, : vect_store_lanes_supported (vectype, group_size, masked_p)) { + if (STMT_VINFO_COMPLEX_P (stmt_info)) + return false; *memory_access_type = VMAT_LOAD_STORE_LANES; overrun_p = would_overrun_p; } @@ -2620,6 +2712,14 @@ vect_check_scalar_mask (vec_info *vinfo, stmt_vec_info stmt_info, return false; } + if (STMT_VINFO_COMPLEX_P (stmt_info)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "Complex type doesn't support mask argument.\n"); + return false; + } + if (!VECT_SCALAR_BOOLEAN_TYPE_P (TREE_TYPE (*mask))) { if (dump_enabled_p ()) @@ -7509,8 +7609,17 @@ vectorizable_store (vec_info *vinfo, same location twice. */ gcc_assert (slp == PURE_SLP_STMT (stmt_info)); + if (!STMT_VINFO_DATA_REF (stmt_info)) + return false; + tree vectype = STMT_VINFO_VECTYPE (stmt_info), rhs_vectype = NULL_TREE; poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + { + if (!nunits.is_constant ()) + return false; + nunits = exact_div (nunits, 2); + } if (loop_vinfo) { @@ -7526,7 +7635,8 @@ vectorizable_store (vec_info *vinfo, if (slp) ncopies = 1; else - ncopies = vect_get_num_copies (loop_vinfo, vectype); + ncopies = vect_get_num_copies (loop_vinfo, vectype, + STMT_VINFO_COMPLEX_P (stmt_info)); gcc_assert (ncopies >= 1); @@ -7544,11 +7654,10 @@ vectorizable_store (vec_info *vinfo, return false; elem_type = TREE_TYPE (vectype); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + elem_type = build_complex_type (elem_type); vec_mode = TYPE_MODE (vectype); - if (!STMT_VINFO_DATA_REF (stmt_info)) - return false; - vect_memory_access_type memory_access_type; enum dr_alignment_support alignment_support_scheme; int misalignment; @@ -7951,21 +8060,31 @@ vectorizable_store (vec_info *vinfo, tree lvectype = vectype; if (slp) { + scalar_mode elmode; if (group_size < const_nunits && const_nunits % group_size == 0) { nstores = const_nunits / group_size; - lnel = group_size; - ltype = build_vector_type (elem_type, group_size); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + { + lnel = group_size * 2; + ltype = build_vector_type (TREE_TYPE (elem_type), group_size * 2); + elmode = SCALAR_TYPE_MODE (TREE_TYPE (elem_type)); + } + else + { + ltype = build_vector_type (elem_type, group_size); + lnel = group_size; + elmode = SCALAR_TYPE_MODE (elem_type); + } lvectype = vectype; /* First check if vec_extract optab doesn't support extraction of vector elts directly. */ - scalar_mode elmode = SCALAR_TYPE_MODE (elem_type); machine_mode vmode; if (!VECTOR_MODE_P (TYPE_MODE (vectype)) || !related_vector_mode (TYPE_MODE (vectype), elmode, - group_size).exists (&vmode) + lnel).exists (&vmode) || (convert_optab_handler (vec_extract_optab, TYPE_MODE (vectype), vmode) == CODE_FOR_nothing)) @@ -8051,6 +8170,8 @@ vectorizable_store (vec_info *vinfo, unsigned int group_el = 0; unsigned HOST_WIDE_INT elsz = tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype))); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + elsz *= 2; for (j = 0; j < ncopies; j++) { vec_oprnd = vec_oprnds[j]; @@ -8448,7 +8569,9 @@ vectorizable_store (vec_info *vinfo, if (memory_access_type == VMAT_CONTIGUOUS_REVERSE) { - tree perm_mask = perm_mask_for_reverse (vectype); + tree perm_mask + = perm_mask_for_reverse (vectype, + STMT_VINFO_COMPLEX_P (stmt_info)); tree perm_dest = vect_create_destination_var (vect_get_store_rhs (stmt_info), vectype); tree new_temp = make_ssa_name (perm_dest); @@ -8778,6 +8901,12 @@ vectorizable_load (vec_info *vinfo, tree vectype = STMT_VINFO_VECTYPE (stmt_info); poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + { + if (!nunits.is_constant ()) + return false; + nunits = exact_div (nunits, 2); + } if (loop_vinfo) { @@ -8794,7 +8923,8 @@ vectorizable_load (vec_info *vinfo, if (slp) ncopies = 1; else - ncopies = vect_get_num_copies (loop_vinfo, vectype); + ncopies = vect_get_num_copies (loop_vinfo, vectype, + STMT_VINFO_COMPLEX_P (stmt_info)); gcc_assert (ncopies >= 1); @@ -8822,6 +8952,8 @@ vectorizable_load (vec_info *vinfo, } elem_type = TREE_TYPE (vectype); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + elem_type = build_complex_type (elem_type); mode = TYPE_MODE (vectype); /* FORNOW. In some cases can vectorize even if data-type not supported @@ -8870,8 +9002,11 @@ vectorizable_load (vec_info *vinfo, if (k > maxk) maxk = k; tree vectype = SLP_TREE_VECTYPE (slp_node); + /* For complex type, half the nunits. */ if (!TYPE_VECTOR_SUBPARTS (vectype).is_constant (&nunits) - || maxk >= (DR_GROUP_SIZE (group_info) & ~(nunits - 1))) + || maxk >= (DR_GROUP_SIZE (group_info) + & ~((STMT_VINFO_COMPLEX_P (group_info) + ? nunits >> 1 : nunits) - 1))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -9098,9 +9233,10 @@ vectorizable_load (vec_info *vinfo, } else { + unsigned cp = STMT_VINFO_COMPLEX_P (stmt_info) ? 2 : 1; if (grouped_load) cst_offset - = (tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype))) + = (tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype))) * cp * vect_get_place_in_interleaving_chain (stmt_info, first_stmt_info)); group_size = 1; @@ -9150,6 +9286,8 @@ vectorizable_load (vec_info *vinfo, int nloads = const_nunits; int lnel = 1; tree ltype = TREE_TYPE (vectype); + if (STMT_VINFO_COMPLEX_P (stmt_info)) + ltype = build_complex_type (ltype); tree lvectype = vectype; auto_vec<tree> dr_chain; if (memory_access_type == VMAT_STRIDED_SLP) @@ -10080,7 +10218,9 @@ vectorizable_load (vec_info *vinfo, if (memory_access_type == VMAT_CONTIGUOUS_REVERSE) { - tree perm_mask = perm_mask_for_reverse (vectype); + tree perm_mask + = perm_mask_for_reverse (vectype, + STMT_VINFO_COMPLEX_P (stmt_info)); new_temp = permute_vec_elements (vinfo, new_temp, new_temp, perm_mask, stmt_info, gsi); new_stmt = SSA_NAME_DEF_STMT (new_temp); @@ -12499,12 +12639,27 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info, dump_printf_loc (MSG_NOTE, vect_location, "get vectype for scalar type: %T\n", scalar_type); } + + tree orig_scalar_type = scalar_type; + if (TREE_CODE (scalar_type) == COMPLEX_TYPE) + { + /* Set complex_p for BB vectorizer. */ + STMT_VINFO_COMPLEX_P (stmt_info) = true; + scalar_type = TREE_TYPE (scalar_type); + /* Double group_size for BB vectorizer to make + following 2 get_vectype_for_scalar_type return wanted vectype. + Real group size is not changed, just make the "faked" input + group_size. */ + group_size *= 2; + } vectype = get_vectype_for_scalar_type (vinfo, scalar_type, group_size); - if (!vectype) + if (!vectype + || (STMT_VINFO_COMPLEX_P (stmt_info) + && !TYPE_VECTOR_SUBPARTS (vectype).is_constant ())) return opt_result::failure_at (stmt, "not vectorized:" " unsupported data-type %T\n", - scalar_type); + orig_scalar_type); if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "vectype: %T\n", vectype); @@ -12529,16 +12684,30 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info, TREE_TYPE (vectype)); if (scalar_type != TREE_TYPE (vectype)) { - if (dump_enabled_p ()) + tree orig_scalar_type = scalar_type; + if (TREE_CODE (scalar_type) == COMPLEX_TYPE) + { + /* Set complex_p for Loop vectorizer. */ + STMT_VINFO_COMPLEX_P (stmt_info) = true; + scalar_type = TREE_TYPE (scalar_type); + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "get complex for smallest scalar type: %T\n", + scalar_type); + + } + else if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "get vectype for smallest scalar type: %T\n", scalar_type); nunits_vectype = get_vectype_for_scalar_type (vinfo, scalar_type, group_size); - if (!nunits_vectype) + if (!nunits_vectype + || (STMT_VINFO_COMPLEX_P (stmt_info) + && !TYPE_VECTOR_SUBPARTS (nunits_vectype).is_constant ())) return opt_result::failure_at (stmt, "not vectorized: unsupported data-type %T\n", - scalar_type); + orig_scalar_type); if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "nunits vectype: %T\n", nunits_vectype); diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index e5fdc9e0a14..4a809e492c4 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1161,6 +1161,9 @@ public: vectorization. */ bool vectorizable; + /* The scalar type of the LHS of this statement is complex type. */ + bool complex_p; + /* The stmt to which this info struct refers to. */ gimple *stmt; @@ -1395,6 +1398,7 @@ struct gather_scatter_info { #define STMT_VINFO_REDUC_EPILOGUE_ADJUSTMENT(S) (S)->reduc_epilogue_adjustment #define STMT_VINFO_REDUC_IDX(S) (S)->reduc_idx #define STMT_VINFO_FORCE_SINGLE_CYCLE(S) (S)->force_single_cycle +#define STMT_VINFO_COMPLEX_P(S) (S)->complex_p #define STMT_VINFO_DR_WRT_VEC_LOOP(S) (S)->dr_wrt_vec_loop #define STMT_VINFO_DR_BASE_ADDRESS(S) (S)->dr_wrt_vec_loop.base_address @@ -1970,6 +1974,15 @@ vect_get_num_copies (loop_vec_info loop_vinfo, tree vectype) return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo), vectype); } +static inline unsigned int +vect_get_num_copies (loop_vec_info loop_vinfo, tree vectype, bool complex_p) +{ + poly_uint64 nunits = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + if (complex_p) + nunits *= 2; + return vect_get_num_vectors (nunits, vectype); +} + /* Update maximum unit count *MAX_NUNITS so that it accounts for NUNITS. *MAX_NUNITS can be 1 if we haven't yet recorded anything. */