Message ID | 20210812054323.897480-1-hongtao.liu@intel.com |
---|---|
State | New |
Headers | show |
Series | [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}. | expand |
On Thu, Aug 12, 2021 at 01:43:23PM +0800, liuhongt wrote: > Hi: > This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb} > under AVX512. > For scenarios(like pr101846-2.c) where the upper half is not used, this patch > generates better code with only one vpmov{wb,dw,qd} instruction. For > scenarios(like pr101846-3.c) where the upper half is actually used, if the src > vector length is 256/512bits, the patch can still generate better code, but for > 128bits, the code generation is worse. > > 128 bits upper half not used. > > - vpshufb .LC2(%rip), %xmm0, %xmm0 > + vpmovdw %xmm0, %xmm0 > > 128 bits upper half used. > - vpshufb .LC2(%rip), %xmm0, %xmm0 > + vpmovdw %xmm0, %xmm1 > + vmovq %xmm1, %rax > + vpinsrq $0, %rax, %xmm0, %xmm0 > > Maybe expand_vec_perm_trunc_vinsert should only deal with 256/512bits of > vectors, but considering the real use of scenarios like pr101846-3.c > foo_*_128 possibility is relatively low, I still keep this part of the code. I actually am not sure if even foo_dw_512: .LFB0: .cfi_startproc - vmovdqa64 %zmm0, %zmm1 - vmovdqa64 .LC0(%rip), %zmm0 - vpermi2w %zmm1, %zmm1, %zmm0 + vpmovdw %zmm0, %ymm1 + vinserti64x4 $0x0, %ymm1, %zmm0, %zmm0 ret is always a win, the permutations we should care most about are in loops and the constant load as well as the first move in that case likely go away and it is one permutation insn vs. two. Different case is e.g. - vmovdqa64 .LC5(%rip), %zmm2 - vmovdqa64 %zmm0, %zmm1 - vmovdqa64 .LC0(%rip), %zmm0 - vpermi2w %zmm1, %zmm1, %zmm2 - vpermi2w %zmm1, %zmm1, %zmm0 - vpshufb .LC6(%rip), %zmm0, %zmm0 - vpshufb .LC7(%rip), %zmm2, %zmm1 - vporq %zmm1, %zmm0, %zmm0 + vpmovwb %zmm0, %ymm1 + vinserti64x4 $0x0, %ymm1, %zmm0, %zmm0 So, I wonder if your new routine shouldn't be instead done after in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases and handle the other vpmovdw etc. cases in combine splitters (see that we only use low half or quarter of the result and transform whatever permutation we've used into what we want). And perhaps make the routine eventually more general, don't handle just identity permutation in the upper half, but allow there other permutations too (ones where that half can be represented by a single insn permutation). > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. > Ok for trunk? > > gcc/ChangeLog: > > PR target/101846 > * config/i386/i386-expand.c (expand_vec_perm_trunc_vinsert): > New function. > (ix86_vectorize_vec_perm_const): Call > expand_vec_perm_trunc_vinsert. > * config/i386/sse.md (vec_set_lo_v32hi): New define_insn. > (vec_set_lo_v64qi): Ditto. > (vec_set_lo_<mode><mask_name>): Extend to no-avx512dq. > > gcc/testsuite/ChangeLog: > > PR target/101846 > * gcc.target/i386/pr101846-2.c: New test. > * gcc.target/i386/pr101846-3.c: New test. > --- > gcc/config/i386/i386-expand.c | 125 +++++++++++++++++++++ > gcc/config/i386/sse.md | 60 +++++++++- > gcc/testsuite/gcc.target/i386/pr101846-2.c | 81 +++++++++++++ > gcc/testsuite/gcc.target/i386/pr101846-3.c | 95 ++++++++++++++++ > 4 files changed, 359 insertions(+), 2 deletions(-) > create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-2.c > create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-3.c > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c > index bd21efa9530..519caac2e15 100644 > --- a/gcc/config/i386/i386-expand.c > +++ b/gcc/config/i386/i386-expand.c > @@ -18317,6 +18317,126 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) > return false; > } > > +/* A subroutine of ix86_expand_vec_perm_const_1. Try to implement D > + in terms of a pair of vpmovdw + vinserti128 instructions. */ > +static bool > +expand_vec_perm_trunc_vinsert (struct expand_vec_perm_d *d) > +{ > + unsigned i, nelt = d->nelt, mask = d->nelt - 1; > + unsigned half = nelt / 2; > + machine_mode half_mode, trunc_mode; > + > + /* vpmov{wb,dw,qd} only available under AVX512. */ > + if (!d->one_operand_p || !TARGET_AVX512F > + || (!TARGET_AVX512VL && GET_MODE_SIZE (d->vmode) < 64) Too many spaces. > + || GET_MODE_SIZE (GET_MODE_INNER (d->vmode)) > 4) > + return false; > + > + /* TARGET_AVX512BW is needed for vpmovwb. */ > + if (GET_MODE_INNER (d->vmode) == E_QImode && !TARGET_AVX512BW) > + return false; > + > + for (i = 0; i < nelt; i++) > + { > + unsigned idx = d->perm[i] & mask; > + if (idx != i * 2 && i < half) > + return false; > + if (idx != i && i >= half) > + return false; > + } > + > + rtx (*gen_trunc) (rtx, rtx) = NULL; > + rtx (*gen_vec_set_lo) (rtx, rtx, rtx) = NULL; > + switch (d->vmode) > + { > + case E_V16QImode: > + gen_trunc = gen_truncv8hiv8qi2; > + gen_vec_set_lo = gen_vec_setv2di; > + half_mode = V8QImode; > + trunc_mode = V8HImode; > + break; > + case E_V32QImode: > + gen_trunc = gen_truncv16hiv16qi2; > + gen_vec_set_lo = gen_vec_set_lo_v32qi; > + half_mode = V16QImode; > + trunc_mode = V16HImode; > + break; > + case E_V64QImode: > + gen_trunc = gen_truncv32hiv32qi2; > + gen_vec_set_lo = gen_vec_set_lo_v64qi; > + half_mode = V32QImode; > + trunc_mode = V32HImode; > + break; > + case E_V8HImode: > + gen_trunc = gen_truncv4siv4hi2; > + gen_vec_set_lo = gen_vec_setv2di; > + half_mode = V4HImode; > + trunc_mode = V4SImode; > + break; > + case E_V16HImode: > + gen_trunc = gen_truncv8siv8hi2; > + gen_vec_set_lo = gen_vec_set_lo_v16hi; > + half_mode = V8HImode; > + trunc_mode = V8SImode; > + break; > + case E_V32HImode: > + gen_trunc = gen_truncv16siv16hi2; > + gen_vec_set_lo = gen_vec_set_lo_v32hi; > + half_mode = V16HImode; > + trunc_mode = V16SImode; > + break; > + case E_V4SImode: > + gen_trunc = gen_truncv2div2si2; > + gen_vec_set_lo = gen_vec_setv2di; > + half_mode = V2SImode; > + trunc_mode = V2DImode; > + break; > + case E_V8SImode: > + gen_trunc = gen_truncv4div4si2; > + gen_vec_set_lo = gen_vec_set_lo_v8si; > + half_mode = V4SImode; > + trunc_mode = V4DImode; > + break; > + case E_V16SImode: > + gen_trunc = gen_truncv8div8si2; > + gen_vec_set_lo = gen_vec_set_lo_v16si; > + half_mode = V8SImode; > + trunc_mode = V8DImode; > + break; > + > + default: > + break; > + } > + > + if (gen_trunc == NULL) > + return false; Isn't it simpler to return false; in the default: case? > @@ -21028,6 +21148,11 @@ ix86_vectorize_vec_perm_const (machine_mode vmode, rtx target, rtx op0, > d.op0 = nop0; > d.op1 = force_reg (vmode, d.op1); > > + /* Try to match vpmov{wb,dw,qd}, although vinserti128 will be generated, > + it's very likely to be optimized off. So let's put the function here. */ Two spaces after full stop. Jakub
On Thu, Aug 12, 2021 at 11:22:48AM +0200, Jakub Jelinek via Gcc-patches wrote: > So, I wonder if your new routine shouldn't be instead done after > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases > and handle the other vpmovdw etc. cases in combine splitters (see that we > only use low half or quarter of the result and transform whatever > permutation we've used into what we want). E.g. in the first function, combine tries: (set (reg:V16HI 85) (vec_select:V16HI (unspec:V32HI [ (mem/u/c:V32HI (symbol_ref/u:DI ("*.LC0") [flags 0x2]) [0 S64 A512]) (reg:V32HI 88) repeated x2 ] UNSPEC_VPERMT2) (parallel [ (const_int 0 [0]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) (const_int 8 [0x8]) (const_int 9 [0x9]) (const_int 10 [0xa]) (const_int 11 [0xb]) (const_int 12 [0xc]) (const_int 13 [0xd]) (const_int 14 [0xe]) (const_int 15 [0xf]) ]))) A combine splitter could run avoid_constant_pool_reference on the first UNSPEC_VPERMT2 argument and check the permutation if it can be optimized, ideally using some function call so that we wouldn't need too many splitters. Jakub
On Thu, Aug 12, 2021 at 5:23 PM Jakub Jelinek <jakub@redhat.com> wrote: > > On Thu, Aug 12, 2021 at 01:43:23PM +0800, liuhongt wrote: > > Hi: > > This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb} > > under AVX512. > > For scenarios(like pr101846-2.c) where the upper half is not used, this patch > > generates better code with only one vpmov{wb,dw,qd} instruction. For > > scenarios(like pr101846-3.c) where the upper half is actually used, if the src > > vector length is 256/512bits, the patch can still generate better code, but for > > 128bits, the code generation is worse. > > > > 128 bits upper half not used. > > > > - vpshufb .LC2(%rip), %xmm0, %xmm0 > > + vpmovdw %xmm0, %xmm0 > > > > 128 bits upper half used. > > - vpshufb .LC2(%rip), %xmm0, %xmm0 > > + vpmovdw %xmm0, %xmm1 > > + vmovq %xmm1, %rax > > + vpinsrq $0, %rax, %xmm0, %xmm0 > > > > Maybe expand_vec_perm_trunc_vinsert should only deal with 256/512bits of > > vectors, but considering the real use of scenarios like pr101846-3.c > > foo_*_128 possibility is relatively low, I still keep this part of the code. > > I actually am not sure if even > foo_dw_512: > .LFB0: > .cfi_startproc > - vmovdqa64 %zmm0, %zmm1 > - vmovdqa64 .LC0(%rip), %zmm0 > - vpermi2w %zmm1, %zmm1, %zmm0 > + vpmovdw %zmm0, %ymm1 > + vinserti64x4 $0x0, %ymm1, %zmm0, %zmm0 > ret > is always a win, the permutations we should care most about are in loops Yes, and vpmov{qd,dw} and vpermi{w,d} are both available under avx512f, so expand_vec_perm_trunc_vinsert will never be matched when it's placed in other 2 insn cases. The only part we need to handle is vpmovwb which is under avx512bw but vpermib require avx512vbmi. > and the constant load as well as the first move in that case likely go > away and it is one permutation insn vs. two. > Different case is e.g. > - vmovdqa64 .LC5(%rip), %zmm2 > - vmovdqa64 %zmm0, %zmm1 > - vmovdqa64 .LC0(%rip), %zmm0 > - vpermi2w %zmm1, %zmm1, %zmm2 > - vpermi2w %zmm1, %zmm1, %zmm0 > - vpshufb .LC6(%rip), %zmm0, %zmm0 > - vpshufb .LC7(%rip), %zmm2, %zmm1 > - vporq %zmm1, %zmm0, %zmm0 > + vpmovwb %zmm0, %ymm1 > + vinserti64x4 $0x0, %ymm1, %zmm0, %zmm0 > So, I wonder if your new routine shouldn't be instead done after > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases > and handle the other vpmovdw etc. cases in combine splitters (see that we > only use low half or quarter of the result and transform whatever > permutation we've used into what we want). > Got it, i'll try that way. > And perhaps make the routine eventually more general, don't handle > just identity permutation in the upper half, but allow there other > permutations too (ones where that half can be represented by a single insn > permutation). > > > > Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. > > Ok for trunk? > > > > gcc/ChangeLog: > > > > PR target/101846 > > * config/i386/i386-expand.c (expand_vec_perm_trunc_vinsert): > > New function. > > (ix86_vectorize_vec_perm_const): Call > > expand_vec_perm_trunc_vinsert. > > * config/i386/sse.md (vec_set_lo_v32hi): New define_insn. > > (vec_set_lo_v64qi): Ditto. > > (vec_set_lo_<mode><mask_name>): Extend to no-avx512dq. > > > > gcc/testsuite/ChangeLog: > > > > PR target/101846 > > * gcc.target/i386/pr101846-2.c: New test. > > * gcc.target/i386/pr101846-3.c: New test. > > --- > > gcc/config/i386/i386-expand.c | 125 +++++++++++++++++++++ > > gcc/config/i386/sse.md | 60 +++++++++- > > gcc/testsuite/gcc.target/i386/pr101846-2.c | 81 +++++++++++++ > > gcc/testsuite/gcc.target/i386/pr101846-3.c | 95 ++++++++++++++++ > > 4 files changed, 359 insertions(+), 2 deletions(-) > > create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-2.c > > create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-3.c > > > > diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c > > index bd21efa9530..519caac2e15 100644 > > --- a/gcc/config/i386/i386-expand.c > > +++ b/gcc/config/i386/i386-expand.c > > @@ -18317,6 +18317,126 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) > > return false; > > } > > > > +/* A subroutine of ix86_expand_vec_perm_const_1. Try to implement D > > + in terms of a pair of vpmovdw + vinserti128 instructions. */ > > +static bool > > +expand_vec_perm_trunc_vinsert (struct expand_vec_perm_d *d) > > +{ > > + unsigned i, nelt = d->nelt, mask = d->nelt - 1; > > + unsigned half = nelt / 2; > > + machine_mode half_mode, trunc_mode; > > + > > + /* vpmov{wb,dw,qd} only available under AVX512. */ > > + if (!d->one_operand_p || !TARGET_AVX512F > > + || (!TARGET_AVX512VL && GET_MODE_SIZE (d->vmode) < 64) > > Too many spaces. > > + || GET_MODE_SIZE (GET_MODE_INNER (d->vmode)) > 4) > > + return false; > > + > > + /* TARGET_AVX512BW is needed for vpmovwb. */ > > + if (GET_MODE_INNER (d->vmode) == E_QImode && !TARGET_AVX512BW) > > + return false; > > + > > + for (i = 0; i < nelt; i++) > > + { > > + unsigned idx = d->perm[i] & mask; > > + if (idx != i * 2 && i < half) > > + return false; > > + if (idx != i && i >= half) > > + return false; > > + } > > + > > + rtx (*gen_trunc) (rtx, rtx) = NULL; > > + rtx (*gen_vec_set_lo) (rtx, rtx, rtx) = NULL; > > + switch (d->vmode) > > + { > > + case E_V16QImode: > > + gen_trunc = gen_truncv8hiv8qi2; > > + gen_vec_set_lo = gen_vec_setv2di; > > + half_mode = V8QImode; > > + trunc_mode = V8HImode; > > + break; > > + case E_V32QImode: > > + gen_trunc = gen_truncv16hiv16qi2; > > + gen_vec_set_lo = gen_vec_set_lo_v32qi; > > + half_mode = V16QImode; > > + trunc_mode = V16HImode; > > + break; > > + case E_V64QImode: > > + gen_trunc = gen_truncv32hiv32qi2; > > + gen_vec_set_lo = gen_vec_set_lo_v64qi; > > + half_mode = V32QImode; > > + trunc_mode = V32HImode; > > + break; > > + case E_V8HImode: > > + gen_trunc = gen_truncv4siv4hi2; > > + gen_vec_set_lo = gen_vec_setv2di; > > + half_mode = V4HImode; > > + trunc_mode = V4SImode; > > + break; > > + case E_V16HImode: > > + gen_trunc = gen_truncv8siv8hi2; > > + gen_vec_set_lo = gen_vec_set_lo_v16hi; > > + half_mode = V8HImode; > > + trunc_mode = V8SImode; > > + break; > > + case E_V32HImode: > > + gen_trunc = gen_truncv16siv16hi2; > > + gen_vec_set_lo = gen_vec_set_lo_v32hi; > > + half_mode = V16HImode; > > + trunc_mode = V16SImode; > > + break; > > + case E_V4SImode: > > + gen_trunc = gen_truncv2div2si2; > > + gen_vec_set_lo = gen_vec_setv2di; > > + half_mode = V2SImode; > > + trunc_mode = V2DImode; > > + break; > > + case E_V8SImode: > > + gen_trunc = gen_truncv4div4si2; > > + gen_vec_set_lo = gen_vec_set_lo_v8si; > > + half_mode = V4SImode; > > + trunc_mode = V4DImode; > > + break; > > + case E_V16SImode: > > + gen_trunc = gen_truncv8div8si2; > > + gen_vec_set_lo = gen_vec_set_lo_v16si; > > + half_mode = V8SImode; > > + trunc_mode = V8DImode; > > + break; > > + > > + default: > > + break; > > + } > > + > > + if (gen_trunc == NULL) > > + return false; > > Isn't it simpler to return false; in the default: case? > > @@ -21028,6 +21148,11 @@ ix86_vectorize_vec_perm_const (machine_mode vmode, rtx target, rtx op0, > > d.op0 = nop0; > > d.op1 = force_reg (vmode, d.op1); > > > > + /* Try to match vpmov{wb,dw,qd}, although vinserti128 will be generated, > > + it's very likely to be optimized off. So let's put the function here. */ > > Two spaces after full stop. > > Jakub >
On Fri, Aug 13, 2021 at 09:42:00AM +0800, Hongtao Liu wrote: > > So, I wonder if your new routine shouldn't be instead done after > > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases > > and handle the other vpmovdw etc. cases in combine splitters (see that we > > only use low half or quarter of the result and transform whatever > > permutation we've used into what we want). > > > Got it, i'll try that way. Note, IMHO the ultimate fix would be to add real support for the __builtin_shufflevector -1 indices (meaning I don't care what will be in that element, perhaps narrowed down to an implementation choice of any element of the input vector(s) or 0). As VEC_PERM_EXPR is currently used for both perms by variable permutation vector and constant, I think we'd need to introduce VEC_PERM_CONST_EXPR, which would be exactly like VEC_PERM_EXPR, except that the last operand would be required to be a VECTOR_CST and that all ones element would mean something different, the I don't care behavior. The GIMPLE side would be fairly easy, except that there should be some optimizations eventually, like when only certain subset of elements of a vector are used later, we can mark the other elements as don't care. The hard part would be backend expansion, especially x86. I guess we could easily canonicalize VEC_PERM_EXPR with constant permutations into VEC_PERM_CONST_EXPR by replacing all ones elements with elements modulo the number of elements (or twice that for 2 operand perms), but then in all the routines that recognize something we'd need to special case the unknown elements to match anything during testing and for expansion replace it by something that would match. That is again a lot of work, but not extremely hard. The hardest would be to deal with the expand_vec_perm_1 handling many cases by trying to recog an instruction. Either we'd need to represent the unknown case by a magic CONST_INT_WILDCARD or CONST_INT_RANGE that recog with the help of the patterns would replace by some CONST_INT that matches it, but note we have all those const_0_to_N_operand and in conditions INTVAL (operands[3]) & 1 == 0 and INTVAL (operands[3]) + 1 == INTVAL (operands[4]) etc., or we'd need to either manually or semi-automatically build some code that would try to guess right values for unknown before trying to recog it. Jakub
Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > On Fri, Aug 13, 2021 at 09:42:00AM +0800, Hongtao Liu wrote: >> > So, I wonder if your new routine shouldn't be instead done after >> > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases >> > and handle the other vpmovdw etc. cases in combine splitters (see that we >> > only use low half or quarter of the result and transform whatever >> > permutation we've used into what we want). >> > >> Got it, i'll try that way. > > Note, IMHO the ultimate fix would be to add real support for the > __builtin_shufflevector -1 indices (meaning I don't care what will be in > that element, perhaps narrowed down to an implementation choice of > any element of the input vector(s) or 0). > As VEC_PERM_EXPR is currently used for both perms by variable permutation > vector and constant, I think we'd need to introduce VEC_PERM_CONST_EXPR, > which would be exactly like VEC_PERM_EXPR, except that the last operand > would be required to be a VECTOR_CST and that all ones element would mean > something different, the I don't care behavior. > The GIMPLE side would be fairly easy, except that there should be some > optimizations eventually, like when only certain subset of elements of > a vector are used later, we can mark the other elements as don't care. Another alternative I'd wondered about was keeping a single tree code, but adding an extra operand with a “care/don't care” mask. I think that would fit with variable-length vectors better. Thanks, Richard
On Fri, Aug 13, 2021 at 11:04 AM Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > Jakub Jelinek via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > On Fri, Aug 13, 2021 at 09:42:00AM +0800, Hongtao Liu wrote: > >> > So, I wonder if your new routine shouldn't be instead done after > >> > in ix86_expand_vec_perm_const_1 after vec_perm_1 among other 2 insn cases > >> > and handle the other vpmovdw etc. cases in combine splitters (see that we > >> > only use low half or quarter of the result and transform whatever > >> > permutation we've used into what we want). > >> > > >> Got it, i'll try that way. > > > > Note, IMHO the ultimate fix would be to add real support for the > > __builtin_shufflevector -1 indices (meaning I don't care what will be in > > that element, perhaps narrowed down to an implementation choice of > > any element of the input vector(s) or 0). > > As VEC_PERM_EXPR is currently used for both perms by variable permutation > > vector and constant, I think we'd need to introduce VEC_PERM_CONST_EXPR, > > which would be exactly like VEC_PERM_EXPR, except that the last operand > > would be required to be a VECTOR_CST and that all ones element would mean > > something different, the I don't care behavior. > > The GIMPLE side would be fairly easy, except that there should be some > > optimizations eventually, like when only certain subset of elements of > > a vector are used later, we can mark the other elements as don't care. > > Another alternative I'd wondered about was keeping a single tree code, > but adding an extra operand with a “care/don't care” mask. I think > that would fit with variable-length vectors better. That sounds reasonable. Note I avoided "quaternary" ops for BIT_INSERT_EXPR but I don't see a good way to avoid that for such VEC_PERM_EXPR extension. Richard. > Thanks, > Richard
diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c index bd21efa9530..519caac2e15 100644 --- a/gcc/config/i386/i386-expand.c +++ b/gcc/config/i386/i386-expand.c @@ -18317,6 +18317,126 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) return false; } +/* A subroutine of ix86_expand_vec_perm_const_1. Try to implement D + in terms of a pair of vpmovdw + vinserti128 instructions. */ +static bool +expand_vec_perm_trunc_vinsert (struct expand_vec_perm_d *d) +{ + unsigned i, nelt = d->nelt, mask = d->nelt - 1; + unsigned half = nelt / 2; + machine_mode half_mode, trunc_mode; + + /* vpmov{wb,dw,qd} only available under AVX512. */ + if (!d->one_operand_p || !TARGET_AVX512F + || (!TARGET_AVX512VL && GET_MODE_SIZE (d->vmode) < 64) + || GET_MODE_SIZE (GET_MODE_INNER (d->vmode)) > 4) + return false; + + /* TARGET_AVX512BW is needed for vpmovwb. */ + if (GET_MODE_INNER (d->vmode) == E_QImode && !TARGET_AVX512BW) + return false; + + for (i = 0; i < nelt; i++) + { + unsigned idx = d->perm[i] & mask; + if (idx != i * 2 && i < half) + return false; + if (idx != i && i >= half) + return false; + } + + rtx (*gen_trunc) (rtx, rtx) = NULL; + rtx (*gen_vec_set_lo) (rtx, rtx, rtx) = NULL; + switch (d->vmode) + { + case E_V16QImode: + gen_trunc = gen_truncv8hiv8qi2; + gen_vec_set_lo = gen_vec_setv2di; + half_mode = V8QImode; + trunc_mode = V8HImode; + break; + case E_V32QImode: + gen_trunc = gen_truncv16hiv16qi2; + gen_vec_set_lo = gen_vec_set_lo_v32qi; + half_mode = V16QImode; + trunc_mode = V16HImode; + break; + case E_V64QImode: + gen_trunc = gen_truncv32hiv32qi2; + gen_vec_set_lo = gen_vec_set_lo_v64qi; + half_mode = V32QImode; + trunc_mode = V32HImode; + break; + case E_V8HImode: + gen_trunc = gen_truncv4siv4hi2; + gen_vec_set_lo = gen_vec_setv2di; + half_mode = V4HImode; + trunc_mode = V4SImode; + break; + case E_V16HImode: + gen_trunc = gen_truncv8siv8hi2; + gen_vec_set_lo = gen_vec_set_lo_v16hi; + half_mode = V8HImode; + trunc_mode = V8SImode; + break; + case E_V32HImode: + gen_trunc = gen_truncv16siv16hi2; + gen_vec_set_lo = gen_vec_set_lo_v32hi; + half_mode = V16HImode; + trunc_mode = V16SImode; + break; + case E_V4SImode: + gen_trunc = gen_truncv2div2si2; + gen_vec_set_lo = gen_vec_setv2di; + half_mode = V2SImode; + trunc_mode = V2DImode; + break; + case E_V8SImode: + gen_trunc = gen_truncv4div4si2; + gen_vec_set_lo = gen_vec_set_lo_v8si; + half_mode = V4SImode; + trunc_mode = V4DImode; + break; + case E_V16SImode: + gen_trunc = gen_truncv8div8si2; + gen_vec_set_lo = gen_vec_set_lo_v16si; + half_mode = V8SImode; + trunc_mode = V8DImode; + break; + + default: + break; + } + + if (gen_trunc == NULL) + return false; + + rtx op_half = gen_reg_rtx (half_mode); + rtx op_trunc = d->op0; + if (d->vmode != trunc_mode) + op_trunc = lowpart_subreg (trunc_mode, op_trunc, d->vmode); + emit_insn (gen_trunc (op_half, op_trunc)); + + if (gen_vec_set_lo == gen_vec_setv2di) + { + op_half = lowpart_subreg (DImode, op_half, half_mode); + rtx op_dest = lowpart_subreg (V2DImode, d->op0, d->vmode); + + /* vec_set<mode> require register_operand. */ + if (MEM_P (op_dest)) + op_dest = force_reg (V2DImode, op_dest); + if (MEM_P (op_half)) + op_half = force_reg (DImode, op_half); + + emit_insn (gen_vec_set_lo (op_dest, op_half, GEN_INT(0))); + op_dest = lowpart_subreg (d->vmode, op_dest, V2DImode); + emit_move_insn (d->target, op_dest); + } + else + emit_insn (gen_vec_set_lo (d->target, d->op0, op_half)); + return true; +} + /* A subroutine of ix86_expand_vec_perm_const_1. Try to implement D in terms of a pair of pshuflw + pshufhw instructions. */ @@ -21028,6 +21148,11 @@ ix86_vectorize_vec_perm_const (machine_mode vmode, rtx target, rtx op0, d.op0 = nop0; d.op1 = force_reg (vmode, d.op1); + /* Try to match vpmov{wb,dw,qd}, although vinserti128 will be generated, + it's very likely to be optimized off. So let's put the function here. */ + if (expand_vec_perm_trunc_vinsert (&d)) + return true; + if (ix86_expand_vec_perm_const_1 (&d)) return true; diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index f631756c829..87e22332c83 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -15162,8 +15162,12 @@ (define_insn "vec_set_lo_<mode><mask_name>" (const_int 10) (const_int 11) (const_int 12) (const_int 13) (const_int 14) (const_int 15)]))))] - "TARGET_AVX512DQ" - "vinsert<shuffletype>32x8\t{$0x0, %2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2, 0x0}" + "TARGET_AVX512F && <mask_avx512dq_condition>" +{ + if (TARGET_AVX512DQ) + return "vinsert<shuffletype>32x8\t{$0x0, %2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2, 0x0}"; + return "vinsert<shuffletype>64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}"; +} [(set_attr "type" "sselog") (set_attr "length_immediate" "1") (set_attr "prefix" "evex") @@ -22806,6 +22810,28 @@ (define_insn "vec_set_hi_v16hi" (set_attr "prefix" "vex,evex") (set_attr "mode" "OI")]) +(define_insn "vec_set_lo_v32hi" + [(set (match_operand:V32HI 0 "register_operand" "=v") + (vec_concat:V32HI + (match_operand:V16HI 2 "nonimmediate_operand" "vm") + (vec_select:V16HI + (match_operand:V32HI 1 "register_operand" "v") + (parallel [(const_int 16) (const_int 17) + (const_int 18) (const_int 19) + (const_int 20) (const_int 21) + (const_int 22) (const_int 23) + (const_int 24) (const_int 25) + (const_int 26) (const_int 27) + (const_int 28) (const_int 29) + (const_int 30) (const_int 31)]))))] + "TARGET_AVX512F" + "vinserti64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}" + [(set_attr "type" "sselog") + (set_attr "prefix_extra" "1") + (set_attr "length_immediate" "1") + (set_attr "prefix" "evex") + (set_attr "mode" "XI")]) + (define_insn "vec_set_lo_v32qi" [(set (match_operand:V32QI 0 "register_operand" "=x,v") (vec_concat:V32QI @@ -22854,6 +22880,36 @@ (define_insn "vec_set_hi_v32qi" (set_attr "prefix" "vex,evex") (set_attr "mode" "OI")]) +(define_insn "vec_set_lo_v64qi" + [(set (match_operand:V64QI 0 "register_operand" "=v") + (vec_concat:V64QI + (match_operand:V32QI 2 "nonimmediate_operand" "vm") + (vec_select:V32QI + (match_operand:V64QI 1 "register_operand" "v") + (parallel [(const_int 32) (const_int 33) + (const_int 34) (const_int 35) + (const_int 36) (const_int 37) + (const_int 38) (const_int 39) + (const_int 40) (const_int 41) + (const_int 42) (const_int 43) + (const_int 44) (const_int 45) + (const_int 46) (const_int 47) + (const_int 48) (const_int 49) + (const_int 50) (const_int 51) + (const_int 52) (const_int 53) + (const_int 54) (const_int 55) + (const_int 56) (const_int 57) + (const_int 58) (const_int 59) + (const_int 60) (const_int 61) + (const_int 62) (const_int 63)]))))] + "TARGET_AVX512F" + "vinserti64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}" + [(set_attr "type" "sselog") + (set_attr "prefix_extra" "1") + (set_attr "length_immediate" "1") + (set_attr "prefix" "evex") + (set_attr "mode" "XI")]) + (define_insn "<avx_avx2>_maskload<ssemodesuffix><avxsizesuffix>" [(set (match_operand:V48_AVX2 0 "register_operand" "=x") (unspec:V48_AVX2 diff --git a/gcc/testsuite/gcc.target/i386/pr101846-2.c b/gcc/testsuite/gcc.target/i386/pr101846-2.c new file mode 100644 index 00000000000..af4ae8ccdd6 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr101846-2.c @@ -0,0 +1,81 @@ +/* { dg-do compile } */ +/* { dg-options "-mavx512bw -mavx512vl -mavx512dq -O2" } */ +/* { dg-final { scan-assembler-times "vpmovwb" "3" } } */ +/* { dg-final { scan-assembler-times "vpmovdw" "3" } } */ +/* { dg-final { scan-assembler-times "vpmovqd" "3" } } */ + +typedef short v4hi __attribute__((vector_size (8))); +typedef short v8hi __attribute__((vector_size (16))); +typedef short v16hi __attribute__((vector_size (32))); +typedef short v32hi __attribute__((vector_size (64))); +typedef char v8qi __attribute__((vector_size (8))); +typedef char v16qi __attribute__((vector_size (16))); +typedef char v32qi __attribute__((vector_size (32))); +typedef char v64qi __attribute__((vector_size (64))); +typedef int v2si __attribute__((vector_size (8))); +typedef int v4si __attribute__((vector_size (16))); +typedef int v8si __attribute__((vector_size (32))); +typedef int v16si __attribute__((vector_size (64))); + +v16hi +foo_dw_512 (v32hi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30); +} + +v8hi +foo_dw_256 (v16hi x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14); +} + +v4hi +foo_dw_128 (v8hi x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6); +} + +v8si +foo_qd_512 (v16si x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14); +} + +v4si +foo_qd_256 (v8si x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6); +} + +v2si +foo_qd_128 (v4si x) +{ + return __builtin_shufflevector (x, x, 0, 2); +} + +v32qi +foo_wb_512 (v64qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30, + 32, 34, 36, 38, 40, 42, 44, 46, + 48, 50, 52, 54, 56, 58, 60, 62); +} + +v16qi +foo_wb_256 (v32qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30); +} + +v8qi +foo_wb_128 (v16qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14); +} diff --git a/gcc/testsuite/gcc.target/i386/pr101846-3.c b/gcc/testsuite/gcc.target/i386/pr101846-3.c new file mode 100644 index 00000000000..380b1220327 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr101846-3.c @@ -0,0 +1,95 @@ +/* { dg-do compile } */ +/* { dg-options "-mavx512bw -mavx512vl -mavx512dq -O2" } */ +/* { dg-final { scan-assembler-times "vpmovwb" "3" } } */ +/* { dg-final { scan-assembler-times "vpmovdw" "3" } } */ +/* { dg-final { scan-assembler-times "vpmovqd" "3" } } */ + +typedef short v4hi __attribute__((vector_size (8))); +typedef short v8hi __attribute__((vector_size (16))); +typedef short v16hi __attribute__((vector_size (32))); +typedef short v32hi __attribute__((vector_size (64))); +typedef char v8qi __attribute__((vector_size (8))); +typedef char v16qi __attribute__((vector_size (16))); +typedef char v32qi __attribute__((vector_size (32))); +typedef char v64qi __attribute__((vector_size (64))); +typedef int v2si __attribute__((vector_size (8))); +typedef int v4si __attribute__((vector_size (16))); +typedef int v8si __attribute__((vector_size (32))); +typedef int v16si __attribute__((vector_size (64))); + +v32hi +foo_dw_512 (v32hi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30, + 16, 17, 18, 19, 20, 21, 22, 23, + 24, 25, 26, 27, 28, 29, 30, 31); +} + +v16hi +foo_dw_256 (v16hi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 8, 9, 10, 11, 12, 13, 14, 15); +} + +v8hi +foo_dw_128 (v8hi x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6, 4, 5, 6, 7); +} + +v16si +foo_qd_512 (v16si x) +{ + return __builtin_shufflevector (x, x, 0, + 2, 4, 6, 8, 10, 12, 14, + 8, 9, 10, 11, 12, 13, 14, 15); +} + +v8si +foo_qd_256 (v8si x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6, 4, 5, 6, 7); +} + +v4si +foo_qd_128 (v4si x) +{ + return __builtin_shufflevector (x, x, 0, 2, 2, 3); +} + +v64qi +foo_wb_512 (v64qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30, + 32, 34, 36, 38, 40, 42, 44, 46, + 48, 50, 52, 54, 56, 58, 60, 62, + 32, 33, 34, 35, 36, 37, 38, 39, + 40, 41, 42, 43, 44, 45, 46, 47, + 48, 49, 50, 51, 52, 53, 54, 55, + 56, 57, 58, 59, 60, 61, 62, 63); +} + +v32qi +foo_wb_256 (v32qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30, + 16, 17, 18, 19, 20, 21, 22, 23, + 24, 25, 26, 27, 28, 29, 30, 31); +} + +v16qi +foo_wb_128 (v16qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 8, 9, 10, 11, 12, 13, 14, 15); +} +