From patchwork Thu Aug 12 05:43:23 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: liuhongt X-Patchwork-Id: 1516106 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.a=rsa-sha256 header.s=default header.b=jf+FewEV; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4GlbGW6TjXz9sWl for ; Thu, 12 Aug 2021 15:44:02 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 683923839C52 for ; Thu, 12 Aug 2021 05:43:59 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 683923839C52 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gcc.gnu.org; s=default; t=1628747039; bh=f0r+mcrMFi4MSphPmQaBqKGght/PbEWr6aq3mMZboSM=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc: From; b=jf+FewEVAfw1B2cyAtjAIzaEmw/XdrCbj1PdsqoWaIVI+ud9Geg57dEh1F3XzFz4S 9YVNoOGqDFBOFaWiw61mhLxc7PhKlh13hURjas2FGvC/nHcslgpzsE1ztl+71rVBji du9gH+QMBJ3HDck3xZOdp9CtiWs77ctIgEqq3WJs= X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by sourceware.org (Postfix) with ESMTPS id D9EC5385742A for ; Thu, 12 Aug 2021 05:43:28 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D9EC5385742A X-IronPort-AV: E=McAfee;i="6200,9189,10073"; a="195550446" X-IronPort-AV: E=Sophos;i="5.84,315,1620716400"; d="scan'208";a="195550446" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Aug 2021 22:43:27 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.84,315,1620716400"; d="scan'208";a="503750436" Received: from scymds01.sc.intel.com ([10.148.94.138]) by orsmga001.jf.intel.com with ESMTP; 11 Aug 2021 22:43:26 -0700 Received: from shliclel219.sh.intel.com (shliclel219.sh.intel.com [10.239.236.219]) by scymds01.sc.intel.com with ESMTP id 17C5hOO7018813; Wed, 11 Aug 2021 22:43:24 -0700 To: gcc-patches@gcc.gnu.org Subject: [PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}. Date: Thu, 12 Aug 2021 13:43:23 +0800 Message-Id: <20210812054323.897480-1-hongtao.liu@intel.com> X-Mailer: git-send-email 2.27.0 In-Reply-To: References: MIME-Version: 1.0 X-Spam-Status: No, score=-10.7 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_NONE, KAM_DMARC_STATUS, KAM_LAZY_DOMAIN_SECURITY, KAM_SHORT, SCC_10_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_NONE, TXREP autolearn=ham autolearn_force=no version=3.4.4 X-Spam-Checker-Version: SpamAssassin 3.4.4 (2020-01-24) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: liuhongt via Gcc-patches From: liuhongt Reply-To: liuhongt Cc: jakub@redhat.com Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" Hi: This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb} under AVX512. For scenarios(like pr101846-2.c) where the upper half is not used, this patch generates better code with only one vpmov{wb,dw,qd} instruction. For scenarios(like pr101846-3.c) where the upper half is actually used, if the src vector length is 256/512bits, the patch can still generate better code, but for 128bits, the code generation is worse. 128 bits upper half not used. - vpshufb .LC2(%rip), %xmm0, %xmm0 + vpmovdw %xmm0, %xmm0 128 bits upper half used. - vpshufb .LC2(%rip), %xmm0, %xmm0 + vpmovdw %xmm0, %xmm1 + vmovq %xmm1, %rax + vpinsrq $0, %rax, %xmm0, %xmm0 Maybe expand_vec_perm_trunc_vinsert should only deal with 256/512bits of vectors, but considering the real use of scenarios like pr101846-3.c foo_*_128 possibility is relatively low, I still keep this part of the code. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/101846 * config/i386/i386-expand.c (expand_vec_perm_trunc_vinsert): New function. (ix86_vectorize_vec_perm_const): Call expand_vec_perm_trunc_vinsert. * config/i386/sse.md (vec_set_lo_v32hi): New define_insn. (vec_set_lo_v64qi): Ditto. (vec_set_lo_): Extend to no-avx512dq. gcc/testsuite/ChangeLog: PR target/101846 * gcc.target/i386/pr101846-2.c: New test. * gcc.target/i386/pr101846-3.c: New test. --- gcc/config/i386/i386-expand.c | 125 +++++++++++++++++++++ gcc/config/i386/sse.md | 60 +++++++++- gcc/testsuite/gcc.target/i386/pr101846-2.c | 81 +++++++++++++ gcc/testsuite/gcc.target/i386/pr101846-3.c | 95 ++++++++++++++++ 4 files changed, 359 insertions(+), 2 deletions(-) create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-2.c create mode 100644 gcc/testsuite/gcc.target/i386/pr101846-3.c diff --git a/gcc/config/i386/i386-expand.c b/gcc/config/i386/i386-expand.c index bd21efa9530..519caac2e15 100644 --- a/gcc/config/i386/i386-expand.c +++ b/gcc/config/i386/i386-expand.c @@ -18317,6 +18317,126 @@ expand_vec_perm_1 (struct expand_vec_perm_d *d) return false; } +/* A subroutine of ix86_expand_vec_perm_const_1. Try to implement D + in terms of a pair of vpmovdw + vinserti128 instructions. */ +static bool +expand_vec_perm_trunc_vinsert (struct expand_vec_perm_d *d) +{ + unsigned i, nelt = d->nelt, mask = d->nelt - 1; + unsigned half = nelt / 2; + machine_mode half_mode, trunc_mode; + + /* vpmov{wb,dw,qd} only available under AVX512. */ + if (!d->one_operand_p || !TARGET_AVX512F + || (!TARGET_AVX512VL && GET_MODE_SIZE (d->vmode) < 64) + || GET_MODE_SIZE (GET_MODE_INNER (d->vmode)) > 4) + return false; + + /* TARGET_AVX512BW is needed for vpmovwb. */ + if (GET_MODE_INNER (d->vmode) == E_QImode && !TARGET_AVX512BW) + return false; + + for (i = 0; i < nelt; i++) + { + unsigned idx = d->perm[i] & mask; + if (idx != i * 2 && i < half) + return false; + if (idx != i && i >= half) + return false; + } + + rtx (*gen_trunc) (rtx, rtx) = NULL; + rtx (*gen_vec_set_lo) (rtx, rtx, rtx) = NULL; + switch (d->vmode) + { + case E_V16QImode: + gen_trunc = gen_truncv8hiv8qi2; + gen_vec_set_lo = gen_vec_setv2di; + half_mode = V8QImode; + trunc_mode = V8HImode; + break; + case E_V32QImode: + gen_trunc = gen_truncv16hiv16qi2; + gen_vec_set_lo = gen_vec_set_lo_v32qi; + half_mode = V16QImode; + trunc_mode = V16HImode; + break; + case E_V64QImode: + gen_trunc = gen_truncv32hiv32qi2; + gen_vec_set_lo = gen_vec_set_lo_v64qi; + half_mode = V32QImode; + trunc_mode = V32HImode; + break; + case E_V8HImode: + gen_trunc = gen_truncv4siv4hi2; + gen_vec_set_lo = gen_vec_setv2di; + half_mode = V4HImode; + trunc_mode = V4SImode; + break; + case E_V16HImode: + gen_trunc = gen_truncv8siv8hi2; + gen_vec_set_lo = gen_vec_set_lo_v16hi; + half_mode = V8HImode; + trunc_mode = V8SImode; + break; + case E_V32HImode: + gen_trunc = gen_truncv16siv16hi2; + gen_vec_set_lo = gen_vec_set_lo_v32hi; + half_mode = V16HImode; + trunc_mode = V16SImode; + break; + case E_V4SImode: + gen_trunc = gen_truncv2div2si2; + gen_vec_set_lo = gen_vec_setv2di; + half_mode = V2SImode; + trunc_mode = V2DImode; + break; + case E_V8SImode: + gen_trunc = gen_truncv4div4si2; + gen_vec_set_lo = gen_vec_set_lo_v8si; + half_mode = V4SImode; + trunc_mode = V4DImode; + break; + case E_V16SImode: + gen_trunc = gen_truncv8div8si2; + gen_vec_set_lo = gen_vec_set_lo_v16si; + half_mode = V8SImode; + trunc_mode = V8DImode; + break; + + default: + break; + } + + if (gen_trunc == NULL) + return false; + + rtx op_half = gen_reg_rtx (half_mode); + rtx op_trunc = d->op0; + if (d->vmode != trunc_mode) + op_trunc = lowpart_subreg (trunc_mode, op_trunc, d->vmode); + emit_insn (gen_trunc (op_half, op_trunc)); + + if (gen_vec_set_lo == gen_vec_setv2di) + { + op_half = lowpart_subreg (DImode, op_half, half_mode); + rtx op_dest = lowpart_subreg (V2DImode, d->op0, d->vmode); + + /* vec_set require register_operand. */ + if (MEM_P (op_dest)) + op_dest = force_reg (V2DImode, op_dest); + if (MEM_P (op_half)) + op_half = force_reg (DImode, op_half); + + emit_insn (gen_vec_set_lo (op_dest, op_half, GEN_INT(0))); + op_dest = lowpart_subreg (d->vmode, op_dest, V2DImode); + emit_move_insn (d->target, op_dest); + } + else + emit_insn (gen_vec_set_lo (d->target, d->op0, op_half)); + return true; +} + /* A subroutine of ix86_expand_vec_perm_const_1. Try to implement D in terms of a pair of pshuflw + pshufhw instructions. */ @@ -21028,6 +21148,11 @@ ix86_vectorize_vec_perm_const (machine_mode vmode, rtx target, rtx op0, d.op0 = nop0; d.op1 = force_reg (vmode, d.op1); + /* Try to match vpmov{wb,dw,qd}, although vinserti128 will be generated, + it's very likely to be optimized off. So let's put the function here. */ + if (expand_vec_perm_trunc_vinsert (&d)) + return true; + if (ix86_expand_vec_perm_const_1 (&d)) return true; diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index f631756c829..87e22332c83 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -15162,8 +15162,12 @@ (define_insn "vec_set_lo_" (const_int 10) (const_int 11) (const_int 12) (const_int 13) (const_int 14) (const_int 15)]))))] - "TARGET_AVX512DQ" - "vinsert32x8\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}" + "TARGET_AVX512F && " +{ + if (TARGET_AVX512DQ) + return "vinsert32x8\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}"; + return "vinsert64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}"; +} [(set_attr "type" "sselog") (set_attr "length_immediate" "1") (set_attr "prefix" "evex") @@ -22806,6 +22810,28 @@ (define_insn "vec_set_hi_v16hi" (set_attr "prefix" "vex,evex") (set_attr "mode" "OI")]) +(define_insn "vec_set_lo_v32hi" + [(set (match_operand:V32HI 0 "register_operand" "=v") + (vec_concat:V32HI + (match_operand:V16HI 2 "nonimmediate_operand" "vm") + (vec_select:V16HI + (match_operand:V32HI 1 "register_operand" "v") + (parallel [(const_int 16) (const_int 17) + (const_int 18) (const_int 19) + (const_int 20) (const_int 21) + (const_int 22) (const_int 23) + (const_int 24) (const_int 25) + (const_int 26) (const_int 27) + (const_int 28) (const_int 29) + (const_int 30) (const_int 31)]))))] + "TARGET_AVX512F" + "vinserti64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}" + [(set_attr "type" "sselog") + (set_attr "prefix_extra" "1") + (set_attr "length_immediate" "1") + (set_attr "prefix" "evex") + (set_attr "mode" "XI")]) + (define_insn "vec_set_lo_v32qi" [(set (match_operand:V32QI 0 "register_operand" "=x,v") (vec_concat:V32QI @@ -22854,6 +22880,36 @@ (define_insn "vec_set_hi_v32qi" (set_attr "prefix" "vex,evex") (set_attr "mode" "OI")]) +(define_insn "vec_set_lo_v64qi" + [(set (match_operand:V64QI 0 "register_operand" "=v") + (vec_concat:V64QI + (match_operand:V32QI 2 "nonimmediate_operand" "vm") + (vec_select:V32QI + (match_operand:V64QI 1 "register_operand" "v") + (parallel [(const_int 32) (const_int 33) + (const_int 34) (const_int 35) + (const_int 36) (const_int 37) + (const_int 38) (const_int 39) + (const_int 40) (const_int 41) + (const_int 42) (const_int 43) + (const_int 44) (const_int 45) + (const_int 46) (const_int 47) + (const_int 48) (const_int 49) + (const_int 50) (const_int 51) + (const_int 52) (const_int 53) + (const_int 54) (const_int 55) + (const_int 56) (const_int 57) + (const_int 58) (const_int 59) + (const_int 60) (const_int 61) + (const_int 62) (const_int 63)]))))] + "TARGET_AVX512F" + "vinserti64x4\t{$0x0, %2, %1, %0|%0, %1, %2, 0x0}" + [(set_attr "type" "sselog") + (set_attr "prefix_extra" "1") + (set_attr "length_immediate" "1") + (set_attr "prefix" "evex") + (set_attr "mode" "XI")]) + (define_insn "_maskload" [(set (match_operand:V48_AVX2 0 "register_operand" "=x") (unspec:V48_AVX2 diff --git a/gcc/testsuite/gcc.target/i386/pr101846-2.c b/gcc/testsuite/gcc.target/i386/pr101846-2.c new file mode 100644 index 00000000000..af4ae8ccdd6 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr101846-2.c @@ -0,0 +1,81 @@ +/* { dg-do compile } */ +/* { dg-options "-mavx512bw -mavx512vl -mavx512dq -O2" } */ +/* { dg-final { scan-assembler-times "vpmovwb" "3" } } */ +/* { dg-final { scan-assembler-times "vpmovdw" "3" } } */ +/* { dg-final { scan-assembler-times "vpmovqd" "3" } } */ + +typedef short v4hi __attribute__((vector_size (8))); +typedef short v8hi __attribute__((vector_size (16))); +typedef short v16hi __attribute__((vector_size (32))); +typedef short v32hi __attribute__((vector_size (64))); +typedef char v8qi __attribute__((vector_size (8))); +typedef char v16qi __attribute__((vector_size (16))); +typedef char v32qi __attribute__((vector_size (32))); +typedef char v64qi __attribute__((vector_size (64))); +typedef int v2si __attribute__((vector_size (8))); +typedef int v4si __attribute__((vector_size (16))); +typedef int v8si __attribute__((vector_size (32))); +typedef int v16si __attribute__((vector_size (64))); + +v16hi +foo_dw_512 (v32hi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30); +} + +v8hi +foo_dw_256 (v16hi x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14); +} + +v4hi +foo_dw_128 (v8hi x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6); +} + +v8si +foo_qd_512 (v16si x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6, 8, 10, 12, 14); +} + +v4si +foo_qd_256 (v8si x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6); +} + +v2si +foo_qd_128 (v4si x) +{ + return __builtin_shufflevector (x, x, 0, 2); +} + +v32qi +foo_wb_512 (v64qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30, + 32, 34, 36, 38, 40, 42, 44, 46, + 48, 50, 52, 54, 56, 58, 60, 62); +} + +v16qi +foo_wb_256 (v32qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30); +} + +v8qi +foo_wb_128 (v16qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14); +} diff --git a/gcc/testsuite/gcc.target/i386/pr101846-3.c b/gcc/testsuite/gcc.target/i386/pr101846-3.c new file mode 100644 index 00000000000..380b1220327 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr101846-3.c @@ -0,0 +1,95 @@ +/* { dg-do compile } */ +/* { dg-options "-mavx512bw -mavx512vl -mavx512dq -O2" } */ +/* { dg-final { scan-assembler-times "vpmovwb" "3" } } */ +/* { dg-final { scan-assembler-times "vpmovdw" "3" } } */ +/* { dg-final { scan-assembler-times "vpmovqd" "3" } } */ + +typedef short v4hi __attribute__((vector_size (8))); +typedef short v8hi __attribute__((vector_size (16))); +typedef short v16hi __attribute__((vector_size (32))); +typedef short v32hi __attribute__((vector_size (64))); +typedef char v8qi __attribute__((vector_size (8))); +typedef char v16qi __attribute__((vector_size (16))); +typedef char v32qi __attribute__((vector_size (32))); +typedef char v64qi __attribute__((vector_size (64))); +typedef int v2si __attribute__((vector_size (8))); +typedef int v4si __attribute__((vector_size (16))); +typedef int v8si __attribute__((vector_size (32))); +typedef int v16si __attribute__((vector_size (64))); + +v32hi +foo_dw_512 (v32hi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30, + 16, 17, 18, 19, 20, 21, 22, 23, + 24, 25, 26, 27, 28, 29, 30, 31); +} + +v16hi +foo_dw_256 (v16hi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 8, 9, 10, 11, 12, 13, 14, 15); +} + +v8hi +foo_dw_128 (v8hi x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6, 4, 5, 6, 7); +} + +v16si +foo_qd_512 (v16si x) +{ + return __builtin_shufflevector (x, x, 0, + 2, 4, 6, 8, 10, 12, 14, + 8, 9, 10, 11, 12, 13, 14, 15); +} + +v8si +foo_qd_256 (v8si x) +{ + return __builtin_shufflevector (x, x, 0, 2, 4, 6, 4, 5, 6, 7); +} + +v4si +foo_qd_128 (v4si x) +{ + return __builtin_shufflevector (x, x, 0, 2, 2, 3); +} + +v64qi +foo_wb_512 (v64qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30, + 32, 34, 36, 38, 40, 42, 44, 46, + 48, 50, 52, 54, 56, 58, 60, 62, + 32, 33, 34, 35, 36, 37, 38, 39, + 40, 41, 42, 43, 44, 45, 46, 47, + 48, 49, 50, 51, 52, 53, 54, 55, + 56, 57, 58, 59, 60, 61, 62, 63); +} + +v32qi +foo_wb_256 (v32qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 16, 18, 20, 22, 24, 26, 28, 30, + 16, 17, 18, 19, 20, 21, 22, 23, + 24, 25, 26, 27, 28, 29, 30, 31); +} + +v16qi +foo_wb_128 (v16qi x) +{ + return __builtin_shufflevector (x, x, + 0, 2, 4, 6, 8, 10, 12, 14, + 8, 9, 10, 11, 12, 13, 14, 15); +} +