[RFC,i386] : Autovectorize 8-byte vectors

Message ID	CAFULd4ZqqnAiyXC75NmCo_YbA64PF60PZb3qWJDe+=xu0Oudjg@mail.gmail.com
State	New
Headers	show Return-Path: <gcc-patches-return-503762-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:from:date:message-id:subject:to:cc:content-type; q=dns; s=default; b=X4PqfFGHVMnStOn0grtqAfkqK+gy2uGFnQtkz8akyE/ QpzcBkD6cSQQcd+9Q2HMXykLQcfa3cFACLeDpLC9YlskwlXJy6X+tTkO1auNtz5h htNOn/iyWx8NID0dbX0OInfGF/kShmi1rPXq8HO41g0UWkhrm3V9lDbAny6xswYs = Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org MIME-Version: 1.0 From: Uros Bizjak <ubizjak@gmail.com> Date: Wed, 26 Jun 2019 10:17:26 +0200 Message-ID: <CAFULd4ZqqnAiyXC75NmCo_YbA64PF60PZb3qWJDe+=xu0Oudjg@mail.gmail.com> Subject: [RFC PATCH, i386]: Autovectorize 8-byte vectors To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org> Cc: Jakub Jelinek <jakub@redhat.com>, Richard Biener <rguenther@suse.de> Content-Type: multipart/mixed; boundary="000000000000a2fe8b058c35adb3"
Series	[RFC,i386] : Autovectorize 8-byte vectors \| expand [RFC,i386] : Autovectorize 8-byte vectors

Uros Bizjak June 26, 2019, 8:17 a.m. UTC

Now that TARGET_MMX_WITH_SSE is implemented, the compiler should be
able to auto-vectorize:

void
foo (char *restrict r, char *restrict a)
{
  for (int i = 0; i < 8; i++)
    r[i] += a[i];
}

Attached patch enables the conversion and produces:

foo:
        movq    (%rdi), %xmm1
        movq    (%rsi), %xmm0
        paddb   %xmm1, %xmm0
        movq    %xmm0, (%rdi)
        ret

Please note that the patch regresses

FAIL: gcc.target/i386/sse2-vect-simd-11.c scan-tree-dump-times vect
"vectorized [1-3] loops" 2
FAIL: gcc.target/i386/sse2-vect-simd-15.c scan-tree-dump-times vect
"vectorized [1-3] loops" 2

For some reason, the compiler decides to vectorize with 8-byte
vectors, resulting in:

missed:   not vectorized: relevant stmt not supported: _8 = (short
unsigned int) _4;
missed:  bad operation or unsupported loop bound.
missed: couldn't vectorize loop

However, the unpatched compiler is able to vectorize loop using
16-byte vectors. It looks that the compiler should re-run
vectorization with wider vectors, if vectorization with narrower
vectors fails. Jakub, Richard, do you have any insight in this issue?

2019-06-26  Uroš Bizjak  <ubizjak@gmail.com>

        * config/i386/i386.c (ix86_autovectorize_vector_sizes):
        Autovectorize 8-byte vectors for TARGET_MMX_WITH_SSE.

testsuite/ChangeLog:

2019-06-26  Uroš Bizjak  <ubizjak@gmail.com>

        * lib/target-supports.exp (available_vector_sizes)
        <[istarget i?86-*-*] || [istarget x86_64-*-*]>: Add
        64-bit vectors for !ia32.

The patch was bootstrapped and regression tested on x86_64-linux-gnu {,-m32}.

Uros.

Uros Bizjak June 26, 2019, 8:25 a.m. UTC | #1

On Wed, Jun 26, 2019 at 10:17 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> Now that TARGET_MMX_WITH_SSE is implemented, the compiler should be
> able to auto-vectorize:

On a related note, following slightly changed testcase:

void
foo (char *restrict r, char *restrict a)
{
  for (int i = 0; i < 24; i++)
    r[i] += a[i];
}

compiles to:

foo:
        vmovdqu (%rdi), %xmm1
        vpaddb  (%rsi), %xmm1, %xmm0
        movzbl  16(%rsi), %eax
        addb    %al, 16(%rdi)
        vmovups %xmm0, (%rdi)
        movzbl  17(%rsi), %eax
        addb    %al, 17(%rdi)
        movzbl  18(%rsi), %eax
        addb    %al, 18(%rdi)
        movzbl  19(%rsi), %eax
        addb    %al, 19(%rdi)
        movzbl  20(%rsi), %eax
        addb    %al, 20(%rdi)
        movzbl  21(%rsi), %eax
        addb    %al, 21(%rdi)
        movzbl  22(%rsi), %eax
        addb    %al, 22(%rdi)
        movzbl  23(%rsi), %eax
        addb    %al, 23(%rdi)
        ret

One would expect that the remaining 8-byte array would also get
vectorized, resulting in one 16-byte operation and one 8-byte
operation.

Uros.

Richard Biener June 26, 2019, 8:33 a.m. UTC | #2

On June 26, 2019 10:17:26 AM GMT+02:00, Uros Bizjak <ubizjak@gmail.com> wrote:
>Now that TARGET_MMX_WITH_SSE is implemented, the compiler should be
>able to auto-vectorize:
>
>void
>foo (char *restrict r, char *restrict a)
>{
>  for (int i = 0; i < 8; i++)
>    r[i] += a[i];
>}
>
>Attached patch enables the conversion and produces:
>
>foo:
>        movq    (%rdi), %xmm1
>        movq    (%rsi), %xmm0
>        paddb   %xmm1, %xmm0
>        movq    %xmm0, (%rdi)
>        ret
>
>Please note that the patch regresses
>
>FAIL: gcc.target/i386/sse2-vect-simd-11.c scan-tree-dump-times vect
>"vectorized [1-3] loops" 2
>FAIL: gcc.target/i386/sse2-vect-simd-15.c scan-tree-dump-times vect
>"vectorized [1-3] loops" 2
>
>For some reason, the compiler decides to vectorize with 8-byte
>vectors, resulting in:
>
>missed:   not vectorized: relevant stmt not supported: _8 = (short
>unsigned int) _4;
>missed:  bad operation or unsupported loop bound.
>missed: couldn't vectorize loop
>
>However, the unpatched compiler is able to vectorize loop using
>16-byte vectors. It looks that the compiler should re-run
>vectorization with wider vectors, if vectorization with narrower
>vectors fails. Jakub, Richard, do you have any insight in this issue?

Double check the ordering of the vector size pushes - it should already iterate but first successful wins. 

Richard. 

>2019-06-26  Uroš Bizjak  <ubizjak@gmail.com>
>
>        * config/i386/i386.c (ix86_autovectorize_vector_sizes):
>        Autovectorize 8-byte vectors for TARGET_MMX_WITH_SSE.
>
>testsuite/ChangeLog:
>
>2019-06-26  Uroš Bizjak  <ubizjak@gmail.com>
>
>        * lib/target-supports.exp (available_vector_sizes)
>        <[istarget i?86-*-*] || [istarget x86_64-*-*]>: Add
>        64-bit vectors for !ia32.
>
>The patch was bootstrapped and regression tested on x86_64-linux-gnu
>{,-m32}.
>
>Uros.

Richard Biener June 26, 2019, 8:36 a.m. UTC | #3

On June 26, 2019 10:25:44 AM GMT+02:00, Uros Bizjak <ubizjak@gmail.com> wrote:
>On Wed, Jun 26, 2019 at 10:17 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>>
>> Now that TARGET_MMX_WITH_SSE is implemented, the compiler should be
>> able to auto-vectorize:
>
>On a related note, following slightly changed testcase:
>
>void
>foo (char *restrict r, char *restrict a)
>{
>  for (int i = 0; i < 24; i++)
>    r[i] += a[i];
>}
>
>compiles to:
>
>foo:
>        vmovdqu (%rdi), %xmm1
>        vpaddb  (%rsi), %xmm1, %xmm0
>        movzbl  16(%rsi), %eax
>        addb    %al, 16(%rdi)
>        vmovups %xmm0, (%rdi)
>        movzbl  17(%rsi), %eax
>        addb    %al, 17(%rdi)
>        movzbl  18(%rsi), %eax
>        addb    %al, 18(%rdi)
>        movzbl  19(%rsi), %eax
>        addb    %al, 19(%rdi)
>        movzbl  20(%rsi), %eax
>        addb    %al, 20(%rdi)
>        movzbl  21(%rsi), %eax
>        addb    %al, 21(%rdi)
>        movzbl  22(%rsi), %eax
>        addb    %al, 22(%rdi)
>        movzbl  23(%rsi), %eax
>        addb    %al, 23(%rdi)
>        ret
>
>One would expect that the remaining 8-byte array would also get
>vectorized, resulting in one 16-byte operation and one 8-byte
>operation.

Try - - param vect-epilogue-nomask=1 (or so). 

Richard. 

>Uros.

Uros Bizjak June 26, 2019, 8:41 a.m. UTC | #4

On Wed, Jun 26, 2019 at 10:36 AM Richard Biener <rguenther@suse.de> wrote:
>
> On June 26, 2019 10:25:44 AM GMT+02:00, Uros Bizjak <ubizjak@gmail.com> wrote:
> >On Wed, Jun 26, 2019 at 10:17 AM Uros Bizjak <ubizjak@gmail.com> wrote:
> >>
> >> Now that TARGET_MMX_WITH_SSE is implemented, the compiler should be
> >> able to auto-vectorize:
> >
> >On a related note, following slightly changed testcase:
> >
> >void
> >foo (char *restrict r, char *restrict a)
> >{
> >  for (int i = 0; i < 24; i++)
> >    r[i] += a[i];
> >}
> >
> >compiles to:
> >
> >foo:
> >        vmovdqu (%rdi), %xmm1
> >        vpaddb  (%rsi), %xmm1, %xmm0
> >        movzbl  16(%rsi), %eax
> >        addb    %al, 16(%rdi)
> >        vmovups %xmm0, (%rdi)
> >        movzbl  17(%rsi), %eax
> >        addb    %al, 17(%rdi)
> >        movzbl  18(%rsi), %eax
> >        addb    %al, 18(%rdi)
> >        movzbl  19(%rsi), %eax
> >        addb    %al, 19(%rdi)
> >        movzbl  20(%rsi), %eax
> >        addb    %al, 20(%rdi)
> >        movzbl  21(%rsi), %eax
> >        addb    %al, 21(%rdi)
> >        movzbl  22(%rsi), %eax
> >        addb    %al, 22(%rdi)
> >        movzbl  23(%rsi), %eax
> >        addb    %al, 23(%rdi)
> >        ret
> >
> >One would expect that the remaining 8-byte array would also get
> >vectorized, resulting in one 16-byte operation and one 8-byte
> >operation.
>
> Try - - param vect-epilogue-nomask=1 (or so).

Yes, this (--param vect-epilogues-nomask=1) works!

foo:
        movdqu  (%rdi), %xmm0
        movdqu  (%rsi), %xmm2
        movq    16(%rsi), %xmm1
        paddb   %xmm2, %xmm0
        movups  %xmm0, (%rdi)
        movq    16(%rdi), %xmm0
        paddb   %xmm1, %xmm0
        movq    %xmm0, 16(%rdi)
        ret

Thanks,
Uros.

Jakub Jelinek June 26, 2019, 8:47 a.m. UTC | #5

On Wed, Jun 26, 2019 at 10:17:26AM +0200, Uros Bizjak wrote:
> Please note that the patch regresses
> 
> FAIL: gcc.target/i386/sse2-vect-simd-11.c scan-tree-dump-times vect
> "vectorized [1-3] loops" 2
> FAIL: gcc.target/i386/sse2-vect-simd-15.c scan-tree-dump-times vect
> "vectorized [1-3] loops" 2
> 
> For some reason, the compiler decides to vectorize with 8-byte
> vectors, resulting in:
> 
> missed:   not vectorized: relevant stmt not supported: _8 = (short
> unsigned int) _4;
> missed:  bad operation or unsupported loop bound.
> missed: couldn't vectorize loop
> 
> However, the unpatched compiler is able to vectorize loop using
> 16-byte vectors. It looks that the compiler should re-run
> vectorization with wider vectors, if vectorization with narrower
> vectors fails. Jakub, Richard, do you have any insight in this issue?
> 
> 2019-06-26  Uroš Bizjak  <ubizjak@gmail.com>
> 
>         * config/i386/i386.c (ix86_autovectorize_vector_sizes):
>         Autovectorize 8-byte vectors for TARGET_MMX_WITH_SSE.

The patch isn't correct if TARGET_MMX_WITH_SSE, but not TARGET_AVX, because
in that case it will push only that 8 and nothing else, while you really
want to have 16 and 8 in that order, so that it tries to vectorize first
with 16-byte vectors and fall back to 8-byte.  The hook is supposed to
either push nothing at all, then only one vector size is tried,
one derived from preferred_simd_mode, or push all possible vectorization
sizes to be tried.

The following patch fixes the failures:

--- gcc/config/i386/i386.c.jj	2019-06-26 09:15:53.474869259 +0200
+++ gcc/config/i386/i386.c	2019-06-26 10:42:01.354106012 +0200
@@ -21401,6 +21401,11 @@ ix86_autovectorize_vector_sizes (vector_
       sizes->safe_push (16);
       sizes->safe_push (32);
     }
+  else if (TARGET_MMX_WITH_SSE)
+    sizes->safe_push (16);
+
+  if (TARGET_MMX_WITH_SSE)
+    sizes->safe_push (8);
 }
 
 /* Implemenation of targetm.vectorize.get_mask_mode.  */


	Jakub

Uros Bizjak June 26, 2019, 10:19 a.m. UTC | #6

On Wed, Jun 26, 2019 at 10:47 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Wed, Jun 26, 2019 at 10:17:26AM +0200, Uros Bizjak wrote:
> > Please note that the patch regresses
> >
> > FAIL: gcc.target/i386/sse2-vect-simd-11.c scan-tree-dump-times vect
> > "vectorized [1-3] loops" 2
> > FAIL: gcc.target/i386/sse2-vect-simd-15.c scan-tree-dump-times vect
> > "vectorized [1-3] loops" 2
> >
> > For some reason, the compiler decides to vectorize with 8-byte
> > vectors, resulting in:
> >
> > missed:   not vectorized: relevant stmt not supported: _8 = (short
> > unsigned int) _4;
> > missed:  bad operation or unsupported loop bound.
> > missed: couldn't vectorize loop
> >
> > However, the unpatched compiler is able to vectorize loop using
> > 16-byte vectors. It looks that the compiler should re-run
> > vectorization with wider vectors, if vectorization with narrower
> > vectors fails. Jakub, Richard, do you have any insight in this issue?
> >
> > 2019-06-26  Uroš Bizjak  <ubizjak@gmail.com>
> >
> >         * config/i386/i386.c (ix86_autovectorize_vector_sizes):
> >         Autovectorize 8-byte vectors for TARGET_MMX_WITH_SSE.
>
> The patch isn't correct if TARGET_MMX_WITH_SSE, but not TARGET_AVX, because
> in that case it will push only that 8 and nothing else, while you really
> want to have 16 and 8 in that order, so that it tries to vectorize first
> with 16-byte vectors and fall back to 8-byte.  The hook is supposed to
> either push nothing at all, then only one vector size is tried,
> one derived from preferred_simd_mode, or push all possible vectorization
> sizes to be tried.

Thanks for the explanation and the patch!

Yes, the patch works OK. I'll regression test it and push it later today.

Thanks,
Uros.

> The following patch fixes the failures:
>
> --- gcc/config/i386/i386.c.jj   2019-06-26 09:15:53.474869259 +0200
> +++ gcc/config/i386/i386.c      2019-06-26 10:42:01.354106012 +0200
> @@ -21401,6 +21401,11 @@ ix86_autovectorize_vector_sizes (vector_
>        sizes->safe_push (16);
>        sizes->safe_push (32);
>      }
> +  else if (TARGET_MMX_WITH_SSE)
> +    sizes->safe_push (16);
> +
> +  if (TARGET_MMX_WITH_SSE)
> +    sizes->safe_push (8);
>  }
>
>  /* Implemenation of targetm.vectorize.get_mask_mode.  */
>
>
>         Jakub

Jakub Jelinek June 26, 2019, 10:22 a.m. UTC | #7

On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote:
> > The patch isn't correct if TARGET_MMX_WITH_SSE, but not TARGET_AVX, because
> > in that case it will push only that 8 and nothing else, while you really
> > want to have 16 and 8 in that order, so that it tries to vectorize first
> > with 16-byte vectors and fall back to 8-byte.  The hook is supposed to
> > either push nothing at all, then only one vector size is tried,
> > one derived from preferred_simd_mode, or push all possible vectorization
> > sizes to be tried.
> 
> Thanks for the explanation and the patch!

It is even documented that way:
 "If the mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE} is not\n\
the only one that is worth considering, this hook should add all suitable\n\
vector sizes to @var{sizes}, in order of decreasing preference.  The first\n\
one should be the size of @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.\n\
If @var{all} is true, add suitable vector sizes even when they are generally\n\
not expected to be worthwhile.\n\

	Jakub

Jakub Jelinek June 27, 2019, 6:05 a.m. UTC | #8

On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote:
> Yes, the patch works OK. I'll regression test it and push it later today.

I think it caused
+FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"
which admittedly already is xfailed on various targets.
We now newly vectorize those loops and there is no FRE or similar pass
after vectorization to clean it up, in particular optimize the
a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B]
store:
  MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 };
  _13 = a[8];
  res_6 = _13 + 140;
  _18 = a[9];
  res_15 = res_6 + _18;
  a ={v} {CLOBBER};
  return res_15;

Shall we xfail it, or is there a plan to enable FRE after vectorization,
or similar pass that would be able to do similar memory optimizations?
Note, the RTL passes are able to optimize it in the end in this testcase.

	Jakub

Uros Bizjak June 27, 2019, 6:08 a.m. UTC | #9

On Thu, Jun 27, 2019 at 8:05 AM Jakub Jelinek <jakub@redhat.com> wrote:
>
> On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote:
> > Yes, the patch works OK. I'll regression test it and push it later today.
>
> I think it caused
> +FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"
> which admittedly already is xfailed on various targets.
> We now newly vectorize those loops and there is no FRE or similar pass
> after vectorization to clean it up, in particular optimize the
> a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B]
> store:
>   MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 };
>   _13 = a[8];
>   res_6 = _13 + 140;
>   _18 = a[9];
>   res_15 = res_6 + _18;
>   a ={v} {CLOBBER};
>   return res_15;

Yes, I have seen pr84512.c, but the failure is benign. It is caused by
the fact that we now vectorize the loops of the test.

> Shall we xfail it, or is there a plan to enable FRE after vectorization,
> or similar pass that would be able to do similar memory optimizations?
> Note, the RTL passes are able to optimize it in the end in this testcase.

The testcase failure could be solved by -fno-tree-vectorize, but I
think that the value should be propagated through vectors, and tree
optimizers should optimize the vectorized function in the same way as
scalar function.

Uros.

Richard Biener June 27, 2019, 7:41 a.m. UTC | #10

On Thu, 27 Jun 2019, Uros Bizjak wrote:

> On Thu, Jun 27, 2019 at 8:05 AM Jakub Jelinek <jakub@redhat.com> wrote:
> >
> > On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote:
> > > Yes, the patch works OK. I'll regression test it and push it later today.
> >
> > I think it caused
> > +FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"
> > which admittedly already is xfailed on various targets.
> > We now newly vectorize those loops and there is no FRE or similar pass
> > after vectorization to clean it up, in particular optimize the
> > a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B]
> > store:
> >   MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 };
> >   _13 = a[8];
> >   res_6 = _13 + 140;
> >   _18 = a[9];
> >   res_15 = res_6 + _18;
> >   a ={v} {CLOBBER};
> >   return res_15;
> 
> Yes, I have seen pr84512.c, but the failure is benign. It is caused by
> the fact that we now vectorize the loops of the test.
> 
> > Shall we xfail it, or is there a plan to enable FRE after vectorization,
> > or similar pass that would be able to do similar memory optimizations?
> > Note, the RTL passes are able to optimize it in the end in this testcase.
> 
> The testcase failure could be solved by -fno-tree-vectorize, but I
> think that the value should be propagated through vectors, and tree
> optimizers should optimize the vectorized function in the same way as
> scalar function.

FRE needs a simple fix (oops) to handle this case though.

Bootstrap / regtest running on x86_64-unknown-linux-gnu.

And yes, I think ultimatively we want a late FRE...

Richard.

2019-06-27  Richard Biener  <rguenther@suse.de>

	* tree-ssa-sccvn.c (vn_reference_lookup_3): Encode valueized RHS.

	* gcc.dg/tree-ssa/ssa-fre-67.c: New testcase.

Index: gcc/tree-ssa-sccvn.c
===================================================================
--- gcc/tree-ssa-sccvn.c	(revision 272732)
+++ gcc/tree-ssa-sccvn.c	(working copy)
@@ -2242,7 +2242,7 @@ vn_reference_lookup_3 (ao_ref *ref, tree
 	  tree rhs = gimple_assign_rhs1 (def_stmt);
 	  if (TREE_CODE (rhs) == SSA_NAME)
 	    rhs = SSA_VAL (rhs);
-	  len = native_encode_expr (gimple_assign_rhs1 (def_stmt),
+	  len = native_encode_expr (rhs,
 				    buffer, sizeof (buffer),
 				    (offseti - offset2) / BITS_PER_UNIT);
 	  if (len > 0 && len * BITS_PER_UNIT >= maxsizei)
Index: gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-67.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-67.c	(revision 272732)
+++ gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-67.c	(working copy)
@@ -1,16 +1,32 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fno-tree-ccp -fdump-tree-fre1-stats" } */
+/* { dg-options "-fgimple -O1 -fdump-tree-fre1" } */
 
-int foo()
+int a[10];
+typedef int v2si __attribute__((vector_size(__SIZEOF_INT__*2)));
+int __GIMPLE (ssa,guessed_local(97603132),startwith("fre1"))
+     foo ()
 {
-  int i = 0;
-  do
-    {
-      i++;
-    }
-  while (i != 1);
-  return i;
+  int i;
+  int _59;
+  int _44;
+  int _13;
+  int _18;
+  v2si _80;
+  v2si _81;
+  int res;
+
+  __BB(2,guessed_local(97603132)):
+  _59 = 64;
+  i_61 = 9;
+  _44 = i_61 * i_61;
+  _80 = _Literal (v2si) {_59, _44};
+  _81 = _80;
+  __MEM <v2si> ((int *)&a + _Literal (int *) 32) = _81;
+  i_48 = 9;
+  _13 = a[8];
+  _18 = a[i_48];
+  res_15 = _13 + _18;
+  return res_15;
 }
 
-/* { dg-final { scan-tree-dump "RPO iteration over 3 blocks visited 3 blocks" "fre1" } } */
-/* { dg-final { scan-tree-dump "return 1;" "fre1" } } */
+/* { dg-final { scan-tree-dump "return 145;" "fre1" } } */

Jeff Law June 27, 2019, 3:24 p.m. UTC | #11

On 6/27/19 12:05 AM, Jakub Jelinek wrote:
> On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote:
>> Yes, the patch works OK. I'll regression test it and push it later today.
> 
> I think it caused
> +FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"
> which admittedly already is xfailed on various targets.
> We now newly vectorize those loops and there is no FRE or similar pass
> after vectorization to clean it up, in particular optimize the
> a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B]
> store:
>   MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 };
>   _13 = a[8];
>   res_6 = _13 + 140;
>   _18 = a[9];
>   res_15 = res_6 + _18;
>   a ={v} {CLOBBER};
>   return res_15;
> 
> Shall we xfail it, or is there a plan to enable FRE after vectorization,
> or similar pass that would be able to do similar memory optimizations?
> Note, the RTL passes are able to optimize it in the end in this testcase.
I wonder if we could logically break up the vector store within DOM.  If
we did that we'd end up with a[8] and a[9] in DOM's expression hash
table.  That would allow us to replace the loads into _13 and _18 with
constants and the rest should just fall out.

Care to open a BZ?  If so, go ahead and assign it to me.

jeff

Jakub Jelinek June 27, 2019, 3:34 p.m. UTC | #12

On Thu, Jun 27, 2019 at 09:24:58AM -0600, Jeff Law wrote:
> On 6/27/19 12:05 AM, Jakub Jelinek wrote:
> > On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote:
> >> Yes, the patch works OK. I'll regression test it and push it later today.
> > 
> > I think it caused
> > +FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"
> > which admittedly already is xfailed on various targets.
> > We now newly vectorize those loops and there is no FRE or similar pass
> > after vectorization to clean it up, in particular optimize the
> > a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B]
> > store:
> >   MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 };
> >   _13 = a[8];
> >   res_6 = _13 + 140;
> >   _18 = a[9];
> >   res_15 = res_6 + _18;
> >   a ={v} {CLOBBER};
> >   return res_15;
> > 
> > Shall we xfail it, or is there a plan to enable FRE after vectorization,
> > or similar pass that would be able to do similar memory optimizations?
> > Note, the RTL passes are able to optimize it in the end in this testcase.
> I wonder if we could logically break up the vector store within DOM.  If
> we did that we'd end up with a[8] and a[9] in DOM's expression hash
> table.  That would allow us to replace the loads into _13 and _18 with
> constants and the rest should just fall out.
> 
> Care to open a BZ?  If so, go ahead and assign it to me.

I think Richi is on working on adding fre3 now.

	Jakub

Jeff Law June 27, 2019, 3:36 p.m. UTC | #13

On 6/27/19 9:34 AM, Jakub Jelinek wrote:
> On Thu, Jun 27, 2019 at 09:24:58AM -0600, Jeff Law wrote:
>> On 6/27/19 12:05 AM, Jakub Jelinek wrote:
>>> On Wed, Jun 26, 2019 at 12:19:28PM +0200, Uros Bizjak wrote:
>>>> Yes, the patch works OK. I'll regression test it and push it later today.
>>>
>>> I think it caused
>>> +FAIL: gcc.dg/tree-ssa/pr84512.c scan-tree-dump optimized "return 285;"
>>> which admittedly already is xfailed on various targets.
>>> We now newly vectorize those loops and there is no FRE or similar pass
>>> after vectorization to clean it up, in particular optimize the
>>> a[8] and a[9] loads given the MEM <vector(2) int> [(int *)&a + 32B]
>>> store:
>>>   MEM <vector(2) int> [(int *)&a + 32B] = { 64, 81 };
>>>   _13 = a[8];
>>>   res_6 = _13 + 140;
>>>   _18 = a[9];
>>>   res_15 = res_6 + _18;
>>>   a ={v} {CLOBBER};
>>>   return res_15;
>>>
>>> Shall we xfail it, or is there a plan to enable FRE after vectorization,
>>> or similar pass that would be able to do similar memory optimizations?
>>> Note, the RTL passes are able to optimize it in the end in this testcase.
>> I wonder if we could logically break up the vector store within DOM.  If
>> we did that we'd end up with a[8] and a[9] in DOM's expression hash
>> table.  That would allow us to replace the loads into _13 and _18 with
>> constants and the rest should just fall out.
>>
>> Care to open a BZ?  If so, go ahead and assign it to me.
> 
> I think Richi is on working on adding fre3 now.
Yea, I saw that later.  I think Richi's message indicated he wanted a
late fre pass, so even if DOM was to capture this, it may not eliminate
the desire for a late fre pass.

jeff

[RFC,i386] : Autovectorize 8-byte vectors

Commit Message

Comments

Patch