From patchwork Thu Nov 20 13:38:32 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Evgeny Stupachenko <evstupac@gmail.com>
X-Patchwork-Id: 412714
Return-Path: 
 <gcc-patches-return-385288-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 9B5951400B7
	for <incoming@patchwork.ozlabs.org>;
	Fri, 21 Nov 2014 00:38:43 +1100 (AEDT)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:mime-version:in-reply-to:references:date:message-id:subject
	:from:to:cc:content-type; q=dns; s=default; b=ZdkIhlX9ie0mWIeC1m
	wBa3Nc6P6PgKxVDiEsDvilt6JBPvMYtUE80BkJX7v4EZ0qtHBVfZgd5GEOEKiSee
	vn+Adz8+J2KUb2tJMiNT4+WWabu/tfAeXqeU8G9dcjy0dPI6KTqlKd4b4/BzZv3m
	y34bX9ZmOUGAbq6t5XLp/2bWg=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:mime-version:in-reply-to:references:date:message-id:subject
	:from:to:cc:content-type; s=default; bh=GWEnqBNF2ja7OGjQlMneLvZ9
	WC0=; b=U7SRuotkYZcEloXLrcV2Pooj/ovPF1PsmXfJB+cPh+I2HxqHjumsZuut
	EdduRMe2Ogj81Rsr3pvK3JmeualhljPsnqLEauR7J2urkbQiqmnT1SyRZqxkeNaR
	/n6wu+xj7g6QlmEVlNOibK40Xxq4s0G7LBnHqzHeLYxnFoApLDg=
Received: (qmail 28239 invoked by alias); 20 Nov 2014 13:38:36 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 28217 invoked by uid 89); 20 Nov 2014 13:38:35 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-0.1 required=5.0 tests=AWL, BAYES_50,
	FREEMAIL_FROM, LIKELY_SPAM_BODY, RCVD_IN_DNSWL_LOW,
	SPF_PASS autolearn=no version=3.3.2
X-HELO: mail-ig0-f177.google.com
Received: from mail-ig0-f177.google.com (HELO mail-ig0-f177.google.com)
	(209.85.213.177) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-SHA encrypted)
	ESMTPS; Thu, 20 Nov 2014 13:38:34 +0000
Received: by mail-ig0-f177.google.com with SMTP id uq10so2801849igb.4 for
	<gcc-patches@gcc.gnu.org>; Thu, 20 Nov 2014 05:38:32 -0800 (PST)
MIME-Version: 1.0
X-Received: by 10.107.160.146 with SMTP id j140mr51431032ioe.6.1416490712053;
	Thu, 20 Nov 2014 05:38:32 -0800 (PST)
Received: by 10.107.135.82 with HTTP; Thu, 20 Nov 2014 05:38:32 -0800 (PST)
In-Reply-To: 
 <CAFULd4ZNLWSOWEKyU2Prb3WX=XdL6LhhcJMvZmj2mMEWWvcPVQ@mail.gmail.com>
References: 
 <CAOvf_xzkGVO6t_TWkLA=bqrf4JGpYW+0aNgGUm6Ababa5NjFig@mail.gmail.com>
	<CAFULd4ZNLWSOWEKyU2Prb3WX=XdL6LhhcJMvZmj2mMEWWvcPVQ@mail.gmail.com>
Date: Thu, 20 Nov 2014 16:38:32 +0300
Message-ID: 
 <CAOvf_xxahs4BVxRBGrdC-+hetSG1EyYiPHKb4gnMb2f2nEehFQ@mail.gmail.com>
Subject: Re: [PATCH x86,
	PR60451] Expand even/odd permutation using pack insn.
From: Evgeny Stupachenko <evstupac@gmail.com>
To: Uros Bizjak <ubizjak@gmail.com>
Cc: GCC Patches <gcc-patches@gcc.gnu.org>, Richard Henderson <rth@redhat.com>
X-IsSubscribed: yes

Thank you.
Patch with proposed fixes:


On Thu, Nov 20, 2014 at 3:26 PM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Thu, Nov 20, 2014 at 12:36 PM, Evgeny Stupachenko <evstupac@gmail.com> wrote:
>> Hi,
>>
>> The patch expand even/odd permutation using:
>> "and, and, pack" in odd case
>> "shift, shift, pack" in even case
>>
>> instead of current "pshufb, pshufb, or" or big set of unpack insns.
>>
>> AVX2/CORE bootstrap and make check passed.
>> expensive tests are in progress
>>
>> Is it ok for trunk?
>>
>> Evgeny
>>
>> 2014-11-20  Evgeny Stupachenko  <evstupac@gmail.com>
>>
>> gcc/testsuite
>>         PR target/60451
>>         * gcc.target/i386/pr60451.c: New.
>>
>> gcc/
>>         PR target/60451
>>         * config/i386/i386.c (expand_vec_perm_even_odd_pack): New.
>>         (expand_vec_perm_even_odd_1): Add new expand for SSE cases,
>>         replace with for AVX2 cases.
>>         (ix86_expand_vec_perm_const_1): Add new expand.
>
> OK with a couple of small adjustments below.
>
> Thanks,
> Uros.
>
>> +/* A subroutine of expand_vec_perm_even_odd_1.  Implement extract-even
>> +   and extract-odd permutations of two V16QI, V8HI, V16HI or V32QI operands
>> +   with two "and" and "pack" or two "shift" and "pack" insns.  We should
>> +   have already failed all two instruction sequences.  */
>> +
>> +static bool
>> +expand_vec_perm_even_odd_pack (struct expand_vec_perm_d *d)
>> +{
>> +  rtx op, dop0, dop1, t, rperm[16];
>> +  unsigned i, odd, c, s, nelt = d->nelt;
>> +  bool end_perm = false;
>> +  machine_mode half_mode;
>> +  rtx (*gen_and) (rtx, rtx, rtx);
>> +  rtx (*gen_pack) (rtx, rtx, rtx);
>> +  rtx (*gen_shift) (rtx, rtx, rtx);
>> +
>> +  /* Required for "pack".  */
>> +  if (!TARGET_SSE4_2 || d->one_operand_p)
>> +    return false;
>> +
>> +  /* Only V8HI, V16QI, V16HI and V32QI modes are more profitable than general
>> +     shuffles.  */
>> +  if (d->vmode == V8HImode)
>
> Use switch, as proposed by Jakub.
>
>> +    {
>> +      c = 0xffff;
>> +      s = 16;
>> +      half_mode = V4SImode;
>> +      gen_and = gen_andv4si3;
>> +      gen_pack = gen_sse4_1_packusdw;
>> +      gen_shift = gen_lshrv4si3;
>> +    }
>> +  else if (d->vmode == V16QImode)
>> +    {
>> +      c = 0xff;
>> +      s = 8;
>> +      half_mode = V8HImode;
>> +      gen_and = gen_andv8hi3;
>> +      gen_pack = gen_sse2_packuswb;
>> +      gen_shift = gen_lshrv8hi3;
>> +    }
>> +  else if (d->vmode == V16HImode)
>> +    {
>> +      c = 0xffff;
>> +      s = 16;
>> +      half_mode = V8SImode;
>> +      gen_and = gen_andv8si3;
>> +      gen_pack = gen_avx2_packusdw;
>> +      gen_shift = gen_lshrv8si3;
>> +      end_perm = true;
>> +    }
>> +  else if (d->vmode == V32QImode)
>> +    {
>> +      c = 0xff;
>> +      s = 8;
>> +      half_mode = V16HImode;
>> +      gen_and = gen_andv16hi3;
>> +      gen_pack = gen_avx2_packuswb;
>> +      gen_shift = gen_lshrv16hi3;
>> +      end_perm = true;
>> +    }
>> +  else
>> +    return false;
>> +
>> +  /* Check that permutation is even or odd.  */
>> +  odd = d->perm[0];
>> +  if (odd != 0 && odd != 1)
>
> if (odd > 1)
>
>> +    return false;
>> +
>> +  for (i = 1; i < nelt; ++i)
>> +    if (d->perm[i] != 2 * i + odd)
>> +      return false;
>> +
>> +  if (d->testing_p)
>> +    return true;
>> +
>> +  dop0 = gen_reg_rtx (half_mode);
>> +  dop1 = gen_reg_rtx (half_mode);
>> +  if (odd == 0)
>> +    {
>> +      for (i = 0; i < nelt / 2; rperm[i++] = GEN_INT (c));
>
> Please write above as:
>
>  for (i = 0; i < nelt / 2; i++)
>      rperm[i] = GEN_INT (c));
>
>> +      t = gen_rtx_CONST_VECTOR (half_mode, gen_rtvec_v (nelt / 2, rperm));
>> +      t = force_reg (half_mode, t);
>> +      emit_insn (gen_and (dop0, t, gen_lowpart (half_mode, d->op0)));
>> +      emit_insn (gen_and (dop1, t, gen_lowpart (half_mode, d->op1)));
>> +    }
>> +  else
>> +    {
>> +      emit_insn (gen_shift (dop0,
>> +                           gen_lowpart (half_mode, d->op0),
>> +                           GEN_INT (s)));
>> +      emit_insn (gen_shift (dop1,
>> +                           gen_lowpart (half_mode, d->op1),
>> +                           GEN_INT (s)));
>> +    }
>> +  /* In AVX2 for 256 bit case we need to permute pack result.  */
>> +  if (TARGET_AVX2 && end_perm)
>> +    {
>> +      op = gen_reg_rtx (d->vmode);
>> +      t = gen_reg_rtx (V4DImode);
>> +      emit_insn (gen_pack (op, dop0, dop1));
>> +      emit_insn (gen_avx2_permv4di_1 (t, gen_lowpart (V4DImode, op),
>> const0_rtx,
>> +                                     const2_rtx, const1_rtx, GEN_INT (3)));
>> +      emit_move_insn (d->target, gen_lowpart (d->vmode, t));
>> +    }
>> +  else
>> +    emit_insn (gen_pack (d->target, dop0, dop1));
>> +
>> +  return true;
>> +}
>> +
>>  /* A subroutine of ix86_expand_vec_perm_builtin_1.  Implement extract-even
>>     and extract-odd permutations.  */
>>
>> @@ -48393,6 +48503,8 @@ expand_vec_perm_even_odd_1 (struct
>> expand_vec_perm_d *d, unsigned odd)
>>        gcc_unreachable ();
>>
>>      case V8HImode:
>> +      if (TARGET_SSE4_2)
>> +       return expand_vec_perm_even_odd_pack (d);
>>        if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB)
>
> "else if" in the above line, to be consistent with else below.
>
>>         return expand_vec_perm_pshufb2 (d);
>>        else
>> @@ -48416,6 +48528,8 @@ expand_vec_perm_even_odd_1 (struct
>> expand_vec_perm_d *d, unsigned odd)
>>        break;
>>
>>      case V16QImode:
>> +      if (TARGET_SSE4_2)
>> +       return expand_vec_perm_even_odd_pack (d);
>>        if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB)
>
> "else if" in the above line.
>
>>         return expand_vec_perm_pshufb2 (d);
>>        else
>> @@ -48441,7 +48555,7 @@ expand_vec_perm_even_odd_1 (struct
>> expand_vec_perm_d *d, unsigned odd)
>>
>>      case V16HImode:
>>      case V32QImode:
>> -      return expand_vec_perm_vpshufb2_vpermq_even_odd (d);
>> +      return expand_vec_perm_even_odd_pack (d);
>>
>>      case V4DImode:
>>        if (!TARGET_AVX2)
>> @@ -48814,6 +48928,9 @@ ix86_expand_vec_perm_const_1 (struct
>> expand_vec_perm_d *d)
>>
>>    /* Try sequences of three instructions.  */
>>
>> +  if (expand_vec_perm_even_odd_pack (d))
>> +    return true;
>> +
>>    if (expand_vec_perm_2vperm2f128_vshuf (d))
>>      return true;
>>
>> diff --git a/gcc/testsuite/gcc.target/i386/pr60451.c
>> b/gcc/testsuite/gcc.target/i386/pr60451.c
>> new file mode 100644
>> index 0000000..29f019d
>> --- /dev/null
>> +++ b/gcc/testsuite/gcc.target/i386/pr60451.c
>> @@ -0,0 +1,14 @@
>> +/* { dg-do compile } */
>> +/* { dg-require-effective-target sse4 } */
>> +/* { dg-options "-O2 -ftree-vectorize -msse4.2" } */
>> +
>> +void
>> +foo (unsigned char *a, unsigned char *b, unsigned char *c, int size)
>> +{
>> +  int i;
>> +
>> +  for (i = 0; i < size; i++)
>> +    a[i] = (unsigned char) ((unsigned int)1 + b[i] * c[i] * 117);
>> +}
>> +
>> +/* { dg-final { scan-assembler "packuswb|vpunpck" } } */

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 085eb54..09c0057 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -48322,6 +48322,120 @@ expand_vec_perm_vpshufb2_vpermq_even_odd
(struct expand_vec_perm_d *d)
   return true;
 }

+/* A subroutine of expand_vec_perm_even_odd_1.  Implement extract-even
+   and extract-odd permutations of two V16QI, V8HI, V16HI or V32QI operands
+   with two "and" and "pack" or two "shift" and "pack" insns.  We should
+   have already failed all two instruction sequences.  */
+
+static bool
+expand_vec_perm_even_odd_pack (struct expand_vec_perm_d *d)
+{
+  rtx op, dop0, dop1, t, rperm[16];
+  unsigned i, odd, c, s, nelt = d->nelt;
+  bool end_perm = false;
+  machine_mode half_mode;
+  rtx (*gen_and) (rtx, rtx, rtx);
+  rtx (*gen_pack) (rtx, rtx, rtx);
+  rtx (*gen_shift) (rtx, rtx, rtx);
+
+  /* Required for "pack".  */
+  if (!TARGET_SSE4_2 || d->one_operand_p)
+    return false;
+
+  switch (d->vmode)
+    {
+    case V8HImode:
+      c = 0xffff;
+      s = 16;
+      half_mode = V4SImode;
+      gen_and = gen_andv4si3;
+      gen_pack = gen_sse4_1_packusdw;
+      gen_shift = gen_lshrv4si3;
+      break;
+    case V16QImode:
+      c = 0xff;
+      s = 8;
+      half_mode = V8HImode;
+      gen_and = gen_andv8hi3;
+      gen_pack = gen_sse2_packuswb;
+      gen_shift = gen_lshrv8hi3;
+      break;
+    case V16HImode:
+      c = 0xffff;
+      s = 16;
+      half_mode = V8SImode;
+      gen_and = gen_andv8si3;
+      gen_pack = gen_avx2_packusdw;
+      gen_shift = gen_lshrv8si3;
+      end_perm = true;
+      break;
+    case V32QImode:
+      c = 0xff;
+      s = 8;
+      half_mode = V16HImode;
+      gen_and = gen_andv16hi3;
+      gen_pack = gen_avx2_packuswb;
+      gen_shift = gen_lshrv16hi3;
+      end_perm = true;
+      break;
+    default:
+      /* Only V8HI, V16QI, V16HI and V32QI modes are more profitable than
+        general shuffles.  */
+      return false;
+    }
+
+  /* Check that permutation is even or odd.  */
+  odd = d->perm[0];
+  if (odd > 1)
+    return false;
+
+  for (i = 1; i < nelt; ++i)
+    if (d->perm[i] != 2 * i + odd)
+      return false;
+
+  if (d->testing_p)
+    return true;
+
+  dop0 = gen_reg_rtx (half_mode);
+  dop1 = gen_reg_rtx (half_mode);
+  if (odd == 0)
+    {
+      for (i = 0; i < nelt / 2; i++)
+       rperm[i] = GEN_INT (c);
+      t = gen_rtx_CONST_VECTOR (half_mode, gen_rtvec_v (nelt / 2, rperm));
+      t = force_reg (half_mode, t);
+      emit_insn (gen_and (dop0, t, gen_lowpart (half_mode, d->op0)));
+      emit_insn (gen_and (dop1, t, gen_lowpart (half_mode, d->op1)));
+    }
+  else
+    {
+      emit_insn (gen_shift (dop0,
+                           gen_lowpart (half_mode, d->op0),
+                           GEN_INT (s)));
+      emit_insn (gen_shift (dop1,
+                           gen_lowpart (half_mode, d->op1),
+                           GEN_INT (s)));
+    }
+  /* In AVX2 for 256 bit case we need to permute pack result.  */
+  if (TARGET_AVX2 && end_perm)
+    {
+      op = gen_reg_rtx (d->vmode);
+      t = gen_reg_rtx (V4DImode);
+      emit_insn (gen_pack (op, dop0, dop1));
+      emit_insn (gen_avx2_permv4di_1 (t,
+                                     gen_lowpart (V4DImode, op),
+                                     const0_rtx,
+                                     const2_rtx,
+                                     const1_rtx,
+                                     GEN_INT (3)));
+      emit_move_insn (d->target, gen_lowpart (d->vmode, t));
+    }
+  else
+    emit_insn (gen_pack (d->target, dop0, dop1));
+
+  return true;
+}
+
 /* A subroutine of ix86_expand_vec_perm_builtin_1.  Implement extract-even
    and extract-odd permutations.  */

@@ -48393,7 +48507,9 @@ expand_vec_perm_even_odd_1 (struct
expand_vec_perm_d *d, unsigned odd)
       gcc_unreachable ();

     case V8HImode:
-      if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB)
+      if (TARGET_SSE4_2)
+       return expand_vec_perm_even_odd_pack (d);
+      else if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB)
        return expand_vec_perm_pshufb2 (d);
       else
        {
@@ -48416,7 +48532,9 @@ expand_vec_perm_even_odd_1 (struct
expand_vec_perm_d *d, unsigned odd)
       break;

     case V16QImode:
-      if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB)
+      if (TARGET_SSE4_2)
+       return expand_vec_perm_even_odd_pack (d);
+      else if (TARGET_SSSE3 && !TARGET_SLOW_PSHUFB)
        return expand_vec_perm_pshufb2 (d);
       else
        {
@@ -48441,7 +48559,7 @@ expand_vec_perm_even_odd_1 (struct
expand_vec_perm_d *d, unsigned odd)

     case V16HImode:
     case V32QImode:
-      return expand_vec_perm_vpshufb2_vpermq_even_odd (d);
+      return expand_vec_perm_even_odd_pack (d);

     case V4DImode:
       if (!TARGET_AVX2)
@@ -48814,6 +48932,9 @@ ix86_expand_vec_perm_const_1 (struct
expand_vec_perm_d *d)

   /* Try sequences of three instructions.  */

+  if (expand_vec_perm_even_odd_pack (d))
+    return true;
+
   if (expand_vec_perm_2vperm2f128_vshuf (d))
     return true;

diff --git a/gcc/testsuite/gcc.target/i386/pr60451.c
b/gcc/testsuite/gcc.target/i386/pr60451.c
new file mode 100644
index 0000000..29f019d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr60451.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target sse4 } */
+/* { dg-options "-O2 -ftree-vectorize -msse4.2" } */
+
+void
+foo (unsigned char *a, unsigned char *b, unsigned char *c, int size)
+{
+  int i;
+
+  for (i = 0; i < size; i++)
+    a[i] = (unsigned char) ((unsigned int)1 + b[i] * c[i] * 117);
+}
+
+/* { dg-final { scan-assembler "packuswb|vpunpck" } } */