From patchwork Thu Nov 20 13:38:32 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Evgeny Stupachenko X-Patchwork-Id: 412714 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 9B5951400B7 for ; Fri, 21 Nov 2014 00:38:43 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; q=dns; s=default; b=ZdkIhlX9ie0mWIeC1m wBa3Nc6P6PgKxVDiEsDvilt6JBPvMYtUE80BkJX7v4EZ0qtHBVfZgd5GEOEKiSee vn+Adz8+J2KUb2tJMiNT4+WWabu/tfAeXqeU8G9dcjy0dPI6KTqlKd4b4/BzZv3m y34bX9ZmOUGAbq6t5XLp/2bWg= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; s=default; bh=GWEnqBNF2ja7OGjQlMneLvZ9 WC0=; b=U7SRuotkYZcEloXLrcV2Pooj/ovPF1PsmXfJB+cPh+I2HxqHjumsZuut EdduRMe2Ogj81Rsr3pvK3JmeualhljPsnqLEauR7J2urkbQiqmnT1SyRZqxkeNaR /n6wu+xj7g6QlmEVlNOibK40Xxq4s0G7LBnHqzHeLYxnFoApLDg= Received: (qmail 28239 invoked by alias); 20 Nov 2014 13:38:36 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 28217 invoked by uid 89); 20 Nov 2014 13:38:35 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.1 required=5.0 tests=AWL, BAYES_50, FREEMAIL_FROM, LIKELY_SPAM_BODY, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=no version=3.3.2 X-HELO: mail-ig0-f177.google.com Received: from mail-ig0-f177.google.com (HELO mail-ig0-f177.google.com) (209.85.213.177) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted) ESMTPS; Thu, 20 Nov 2014 13:38:34 +0000 Received: by mail-ig0-f177.google.com with SMTP id uq10so2801849igb.4 for ; Thu, 20 Nov 2014 05:38:32 -0800 (PST) MIME-Version: 1.0 X-Received: by 10.107.160.146 with SMTP id j140mr51431032ioe.6.1416490712053; Thu, 20 Nov 2014 05:38:32 -0800 (PST) Received: by 10.107.135.82 with HTTP; Thu, 20 Nov 2014 05:38:32 -0800 (PST) In-Reply-To: References: Date: Thu, 20 Nov 2014 16:38:32 +0300 Message-ID: Subject: Re: [PATCH x86, PR60451] Expand even/odd permutation using pack insn. From: Evgeny Stupachenko To: Uros Bizjak Cc: GCC Patches , Richard Henderson X-IsSubscribed: yes Thank you. Patch with proposed fixes: On Thu, Nov 20, 2014 at 3:26 PM, Uros Bizjak wrote: > On Thu, Nov 20, 2014 at 12:36 PM, Evgeny Stupachenko wrote: >> Hi, >> >> The patch expand even/odd permutation using: >> "and, and, pack" in odd case >> "shift, shift, pack" in even case >> >> instead of current "pshufb, pshufb, or" or big set of unpack insns. >> >> AVX2/CORE bootstrap and make check passed. >> expensive tests are in progress >> >> Is it ok for trunk? >> >> Evgeny >> >> 2014-11-20 Evgeny Stupachenko >> >> gcc/testsuite >> PR target/60451 >> * gcc.target/i386/pr60451.c: New. >> >> gcc/ >> PR target/60451 >> * config/i386/i386.c (expand_vec_perm_even_odd_pack): New. >> (expand_vec_perm_even_odd_1): Add new expand for SSE cases, >> replace with for AVX2 cases. >> (ix86_expand_vec_perm_const_1): Add new expand. > > OK with a couple of small adjustments below. > > Thanks, > Uros. > >> +/* A subroutine of expand_vec_perm_even_odd_1. Implement extract-even >> + and extract-odd permutations of two V16QI, V8HI, V16HI or V32QI operands >> + with two "and" and "pack" or two "shift" and "pack" insns. We should >> + have already failed all two instruction sequences. */ >> + >> +static bool >> +expand_vec_perm_even_odd_pack (struct expand_vec_perm_d *d) >> +{ >> + rtx op, dop0, dop1, t, rperm[16]; >> + unsigned i, odd, c, s, nelt = d->nelt; >> + bool end_perm = false; >> + machine_mode half_mode; >> + rtx (*gen_and) (rtx, rtx, rtx); >> + rtx (*gen_pack) (rtx, rtx, rtx); >> + rtx (*gen_shift) (rtx, rtx, rtx); >> + >> + /* Required for "pack". */ >> + if (!TARGET_SSE4_2 || d->one_operand_p) >> + return false; >> + >> + /* Only V8HI, V16QI, V16HI and V32QI modes are more profitable than general >> + shuffles. */ >> + if (d->vmode == V8HImode) > > Use switch, as proposed by Jakub. > >> + { >> + c = 0xffff; >> + s = 16; >> + half_mode = V4SImode; >> + gen_and = gen_andv4si3; >> + gen_pack = gen_sse4_1_packusdw; >> + gen_shift = gen_lshrv4si3; >> + } >> + else if (d->vmode == V16QImode) >> + { >> + c = 0xff; >> + s = 8; >> + half_mode = V8HImode; >> + gen_and = gen_andv8hi3; >> + gen_pack = gen_sse2_packuswb; >> + gen_shift = gen_lshrv8hi3; >> + } >> + else if (d->vmode == V16HImode) >> + { >> + c = 0xffff; >> + s = 16; >> + half_mode = V8SImode; >> + gen_and = gen_andv8si3; >> + gen_pack = gen_avx2_packusdw; >> + gen_shift = gen_lshrv8si3; >> + end_perm = true; >> + } >> + else if (d->vmode == V32QImode) >> + { >> + c = 0xff; >> + s = 8; >> + half_mode = V16HImode; >> + gen_and = gen_andv16hi3; >> + gen_pack = gen_avx2_packuswb; >> + gen_shift = gen_lshrv16hi3; >> + end_perm = true; >> + } >> + else >> + return false; >> + >> + /* Check that permutation is even or odd. */ >> + odd = d->perm[0]; >> + if (odd != 0 && odd != 1) > > if (odd > 1) > >> + return false; >> + >> + for (i = 1; i < nelt; ++i) >> + if (d->perm[i] != 2 * i + odd) >> + return false; >> + >> + if (d->testing_p) >> + return true; >> + >> + dop0 = gen_reg_rtx (half_mode); >> + dop1 = gen_reg_rtx (half_mode); >> + if (odd == 0) >> + { >> + for (i = 0; i < nelt / 2; rperm[i++] = GEN_INT (c)); > > Please write above as: > > for (i = 0; i < nelt / 2; i++) > rperm[i] = GEN_INT (c)); > >> + t = gen_rtx_CONST_VECTOR (half_mode, gen_rtvec_v (nelt / 2, rperm)); >> + t = force_reg (half_mode, t); >> + emit_insn (gen_and (dop0, t, gen_lowpart (half_mode, d->op0))); >> + emit_insn (gen_and (dop1, t, gen_lowpart (half_mode, d->op1))); >> + } >> + else >> + { >> + emit_insn (gen_shift (dop0, >> + gen_lowpart (half_mode, d->op0), >> + GEN_INT (s))); >> + emit_insn (gen_shift (dop1, >> + gen_lowpart (half_mode, d->op1), >> + GEN_INT (s))); >> + } >> + /* In AVX2 for 256 bit case we need to permute pack result. */ >> + if (TARGET_AVX2 && end_perm) >> + { >> + op = gen_reg_rtx (d->vmode); >> + t = gen_reg_rtx (V4DImode); >> + emit_insn (gen_pack (op, dop0, dop1)); >> + emit_insn (gen_avx2_permv4di_1 (t, gen_lowpart (V4DImode, op), >> const0_rtx, >> + const2_rtx, const1_rtx, GEN_INT (3))); >> + emit_move_insn (d->target, gen_lowpart (d->vmode, t)); >> + } >> + else >> + emit_insn (gen_pack (d->target, dop0, dop1)); >> + >> + return true; >> +} >> + >> /* A subroutine of ix86_expand_vec_perm_builtin_1. Implement extract-even >> and extract-odd permutations. */ >> >> @@ -48393,6 +48503,8 @@ expand_vec_perm_even_odd_1 (struct >> expand_vec_perm_d *d, unsigned odd) >> gcc_unreachable (); >> >> case V8HImode: >> + if (TARGET_SSE4_2) >> + return expand_vec_perm_even_odd_pack (d); >> if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB) > > "else if" in the above line, to be consistent with else below. > >> return expand_vec_perm_pshufb2 (d); >> else >> @@ -48416,6 +48528,8 @@ expand_vec_perm_even_odd_1 (struct >> expand_vec_perm_d *d, unsigned odd) >> break; >> >> case V16QImode: >> + if (TARGET_SSE4_2) >> + return expand_vec_perm_even_odd_pack (d); >> if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB) > > "else if" in the above line. > >> return expand_vec_perm_pshufb2 (d); >> else >> @@ -48441,7 +48555,7 @@ expand_vec_perm_even_odd_1 (struct >> expand_vec_perm_d *d, unsigned odd) >> >> case V16HImode: >> case V32QImode: >> - return expand_vec_perm_vpshufb2_vpermq_even_odd (d); >> + return expand_vec_perm_even_odd_pack (d); >> >> case V4DImode: >> if (!TARGET_AVX2) >> @@ -48814,6 +48928,9 @@ ix86_expand_vec_perm_const_1 (struct >> expand_vec_perm_d *d) >> >> /* Try sequences of three instructions. */ >> >> + if (expand_vec_perm_even_odd_pack (d)) >> + return true; >> + >> if (expand_vec_perm_2vperm2f128_vshuf (d)) >> return true; >> >> diff --git a/gcc/testsuite/gcc.target/i386/pr60451.c >> b/gcc/testsuite/gcc.target/i386/pr60451.c >> new file mode 100644 >> index 0000000..29f019d >> --- /dev/null >> +++ b/gcc/testsuite/gcc.target/i386/pr60451.c >> @@ -0,0 +1,14 @@ >> +/* { dg-do compile } */ >> +/* { dg-require-effective-target sse4 } */ >> +/* { dg-options "-O2 -ftree-vectorize -msse4.2" } */ >> + >> +void >> +foo (unsigned char *a, unsigned char *b, unsigned char *c, int size) >> +{ >> + int i; >> + >> + for (i = 0; i < size; i++) >> + a[i] = (unsigned char) ((unsigned int)1 + b[i] * c[i] * 117); >> +} >> + >> +/* { dg-final { scan-assembler "packuswb|vpunpck" } } */ diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 085eb54..09c0057 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -48322,6 +48322,120 @@ expand_vec_perm_vpshufb2_vpermq_even_odd (struct expand_vec_perm_d *d) return true; } +/* A subroutine of expand_vec_perm_even_odd_1. Implement extract-even + and extract-odd permutations of two V16QI, V8HI, V16HI or V32QI operands + with two "and" and "pack" or two "shift" and "pack" insns. We should + have already failed all two instruction sequences. */ + +static bool +expand_vec_perm_even_odd_pack (struct expand_vec_perm_d *d) +{ + rtx op, dop0, dop1, t, rperm[16]; + unsigned i, odd, c, s, nelt = d->nelt; + bool end_perm = false; + machine_mode half_mode; + rtx (*gen_and) (rtx, rtx, rtx); + rtx (*gen_pack) (rtx, rtx, rtx); + rtx (*gen_shift) (rtx, rtx, rtx); + + /* Required for "pack". */ + if (!TARGET_SSE4_2 || d->one_operand_p) + return false; + + switch (d->vmode) + { + case V8HImode: + c = 0xffff; + s = 16; + half_mode = V4SImode; + gen_and = gen_andv4si3; + gen_pack = gen_sse4_1_packusdw; + gen_shift = gen_lshrv4si3; + break; + case V16QImode: + c = 0xff; + s = 8; + half_mode = V8HImode; + gen_and = gen_andv8hi3; + gen_pack = gen_sse2_packuswb; + gen_shift = gen_lshrv8hi3; + break; + case V16HImode: + c = 0xffff; + s = 16; + half_mode = V8SImode; + gen_and = gen_andv8si3; + gen_pack = gen_avx2_packusdw; + gen_shift = gen_lshrv8si3; + end_perm = true; + break; + case V32QImode: + c = 0xff; + s = 8; + half_mode = V16HImode; + gen_and = gen_andv16hi3; + gen_pack = gen_avx2_packuswb; + gen_shift = gen_lshrv16hi3; + end_perm = true; + break; + default: + /* Only V8HI, V16QI, V16HI and V32QI modes are more profitable than + general shuffles. */ + return false; + } + + /* Check that permutation is even or odd. */ + odd = d->perm[0]; + if (odd > 1) + return false; + + for (i = 1; i < nelt; ++i) + if (d->perm[i] != 2 * i + odd) + return false; + + if (d->testing_p) + return true; + + dop0 = gen_reg_rtx (half_mode); + dop1 = gen_reg_rtx (half_mode); + if (odd == 0) + { + for (i = 0; i < nelt / 2; i++) + rperm[i] = GEN_INT (c); + t = gen_rtx_CONST_VECTOR (half_mode, gen_rtvec_v (nelt / 2, rperm)); + t = force_reg (half_mode, t); + emit_insn (gen_and (dop0, t, gen_lowpart (half_mode, d->op0))); + emit_insn (gen_and (dop1, t, gen_lowpart (half_mode, d->op1))); + } + else + { + emit_insn (gen_shift (dop0, + gen_lowpart (half_mode, d->op0), + GEN_INT (s))); + emit_insn (gen_shift (dop1, + gen_lowpart (half_mode, d->op1), + GEN_INT (s))); + } + /* In AVX2 for 256 bit case we need to permute pack result. */ + if (TARGET_AVX2 && end_perm) + { + op = gen_reg_rtx (d->vmode); + t = gen_reg_rtx (V4DImode); + emit_insn (gen_pack (op, dop0, dop1)); + emit_insn (gen_avx2_permv4di_1 (t, + gen_lowpart (V4DImode, op), + const0_rtx, + const2_rtx, + const1_rtx, + GEN_INT (3))); + emit_move_insn (d->target, gen_lowpart (d->vmode, t)); + } + else + emit_insn (gen_pack (d->target, dop0, dop1)); + + return true; +} + /* A subroutine of ix86_expand_vec_perm_builtin_1. Implement extract-even and extract-odd permutations. */ @@ -48393,7 +48507,9 @@ expand_vec_perm_even_odd_1 (struct expand_vec_perm_d *d, unsigned odd) gcc_unreachable (); case V8HImode: - if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB) + if (TARGET_SSE4_2) + return expand_vec_perm_even_odd_pack (d); + else if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB) return expand_vec_perm_pshufb2 (d); else { @@ -48416,7 +48532,9 @@ expand_vec_perm_even_odd_1 (struct expand_vec_perm_d *d, unsigned odd) break; case V16QImode: - if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB) + if (TARGET_SSE4_2) + return expand_vec_perm_even_odd_pack (d); + else if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB) return expand_vec_perm_pshufb2 (d); else { @@ -48441,7 +48559,7 @@ expand_vec_perm_even_odd_1 (struct expand_vec_perm_d *d, unsigned odd) case V16HImode: case V32QImode: - return expand_vec_perm_vpshufb2_vpermq_even_odd (d); + return expand_vec_perm_even_odd_pack (d); case V4DImode: if (!TARGET_AVX2) @@ -48814,6 +48932,9 @@ ix86_expand_vec_perm_const_1 (struct expand_vec_perm_d *d) /* Try sequences of three instructions. */ + if (expand_vec_perm_even_odd_pack (d)) + return true; + if (expand_vec_perm_2vperm2f128_vshuf (d)) return true; diff --git a/gcc/testsuite/gcc.target/i386/pr60451.c b/gcc/testsuite/gcc.target/i386/pr60451.c new file mode 100644 index 0000000..29f019d --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/pr60451.c @@ -0,0 +1,14 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target sse4 } */ +/* { dg-options "-O2 -ftree-vectorize -msse4.2" } */ + +void +foo (unsigned char *a, unsigned char *b, unsigned char *c, int size) +{ + int i; + + for (i = 0; i < size; i++) + a[i] = (unsigned char) ((unsigned int)1 + b[i] * c[i] * 117); +} + +/* { dg-final { scan-assembler "packuswb|vpunpck" } } */