vect: Fix VLA SLP invariant optimisation [PR98535]

Message ID	mpta6t326y8.fsf@arm.com
State	New
Headers	show Return-Path: <gcc-patches-bounces@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org AD0FB3858026 To: gcc-patches@gcc.gnu.org Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com Subject: [PATCH] vect: Fix VLA SLP invariant optimisation [PR98535] Date: Wed, 20 Jan 2021 11:17:35 +0000 Message-ID: <mpta6t326y8.fsf@arm.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Precedence: list From: Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Richard Sandiford <richard.sandiford@arm.com> Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>
Series	vect: Fix VLA SLP invariant optimisation [PR98535] \| expand vect: Fix VLA SLP invariant optimisation [PR98535]

Message ID

mpta6t326y8.fsf@arm.com

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org AD0FB3858026
To: gcc-patches@gcc.gnu.org
Mail-Followup-To: gcc-patches@gcc.gnu.org, richard.sandiford@arm.com
Subject: [PATCH] vect: Fix VLA SLP invariant optimisation [PR98535]
Date: Wed, 20 Jan 2021 11:17:35 +0000
Message-ID: <mpta6t326y8.fsf@arm.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
From: Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org>
Reply-To: Richard Sandiford <richard.sandiford@arm.com>
Errors-To: gcc-patches-bounces@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>

Series

vect: Fix VLA SLP invariant optimisation [PR98535] | expand

Commit Message

Richard Sandiford Jan. 20, 2021, 11:17 a.m. UTC

duplicate_and_interleave is the main fallback way of loading
a repeating sequence of elements into variable-length vectors.
The code handles cases in which the number of elements in the
sequence is potentially several times greater than the number
of elements in a vector.

Let:

- NE be the (compile-time) number of elements in the sequence
- NR be the (compile-time) number of vector results and
- VE be the (run-time) number of elements in each vector

The basic approach is to duplicate each element into a
separate vector, giving NE vectors in total, then use
log2(NE) rows of NE permutes to generate NE results.

In the worst case — when VE has no known compile-time factor
and NR >= NE — all of these permutes are necessary.  However,
if VE is known to be a multiple of 2**F, then each of the
first F permute rows produce duplicate results; specifically,
the high permute for a given pair is the same as the low permute.
The code dealt with this by reusing the low result for the
high result.  This part was OK.

However, having duplicate results from one row meant that the
next row did duplicate work.  The redundancies would be optimised
away by later passes, but the code tried to avoid generating them
in the first place.  This is the part that went wrong.

Specifically, NR is typically less than NE when some permutes are
redundant, so the code tried to use NR to reduce the amount of work
performed.  The problem was that, although it correctly calculated
a conservative bound on how many results were needed in each row,
it chose the wrong results for anything other than the final row.

This doesn't usually matter for fully-packed SVE vectors.  We first
try to coalesce smaller elements into larger ones, so normally
VE ends up being 2**VQ (where VQ is the number of 128-bit blocks
in an SVE vector).  In that situation we'd only apply the faulty
optimisation to the final row, i.e. the case it handled correctly.
E.g. for things like:

  void
  f (long *x)
  {
    for (int i = 0; i < 100; i += 8)
      {
        x[i] += 1;
        x[i + 1] += 2;
        x[i + 2] += 3;
        x[i + 3] += 4;
        x[i + 4] += 5;
        x[i + 5] += 6;
        x[i + 6] += 7;
        x[i + 7] += 8;
      }
  }

(already tested by the testsuite), we'd have 3 rows of permutes
producing 4 vector results.  The schemne produced:

1st row: 8 results from 4 permutes, highs duplicates of lows
2nd row: 8 results from 8 permutes (half of which are actually redundant)
3rd row: 4 results from 4 permutes

However, coalescing elements is trickier for unpacked vectors,
and at the moment we don't try to do it (see the GET_MODE_SIZE
check in can_duplicate_and_interleave_p).  Unpacked vectors
therefore stress the code in ways that packed vectors didn't.

The patch fixes this by removing the redundancies from each row,
rather than trying to work around them later.  This also removes
the redundant work in the second row of the example above.

Tested on aarch64-linux-gnu, aarch64_be-elf and (redundantly)
x86_64-linux-gnu.  OK for trunk and branches?

Thanks,
Richard


gcc/
	PR tree-optimization/98535
	* tree-vect-slp.c (duplicate_and_interleave): Use quick_grow_cleared.
	If the high and low permutes are the same, remove the high permutes
	from the working set and only continue with the low ones.
---
 .../gcc.target/aarch64/sve/pr98535.c          | 18 +++++++
 gcc/tree-vect-slp.c                           | 49 +++++++++++--------
 2 files changed, 46 insertions(+), 21 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/pr98535.c

Comments

Richard Biener Jan. 20, 2021, 11:44 a.m. UTC | #1

On January 20, 2021 12:17:35 PM GMT+01:00, Richard Sandiford via Gcc-patches <gcc-patches@gcc.gnu.org> wrote:
>duplicate_and_interleave is the main fallback way of loading
>a repeating sequence of elements into variable-length vectors.
>The code handles cases in which the number of elements in the
>sequence is potentially several times greater than the number
>of elements in a vector.
>
>Let:
>
>- NE be the (compile-time) number of elements in the sequence
>- NR be the (compile-time) number of vector results and
>- VE be the (run-time) number of elements in each vector
>
>The basic approach is to duplicate each element into a
>separate vector, giving NE vectors in total, then use
>log2(NE) rows of NE permutes to generate NE results.
>
>In the worst case — when VE has no known compile-time factor
>and NR >= NE — all of these permutes are necessary.  However,
>if VE is known to be a multiple of 2**F, then each of the
>first F permute rows produce duplicate results; specifically,
>the high permute for a given pair is the same as the low permute.
>The code dealt with this by reusing the low result for the
>high result.  This part was OK.
>
>However, having duplicate results from one row meant that the
>next row did duplicate work.  The redundancies would be optimised
>away by later passes, but the code tried to avoid generating them
>in the first place.  This is the part that went wrong.
>
>Specifically, NR is typically less than NE when some permutes are
>redundant, so the code tried to use NR to reduce the amount of work
>performed.  The problem was that, although it correctly calculated
>a conservative bound on how many results were needed in each row,
>it chose the wrong results for anything other than the final row.
>
>This doesn't usually matter for fully-packed SVE vectors.  We first
>try to coalesce smaller elements into larger ones, so normally
>VE ends up being 2**VQ (where VQ is the number of 128-bit blocks
>in an SVE vector).  In that situation we'd only apply the faulty
>optimisation to the final row, i.e. the case it handled correctly.
>E.g. for things like:
>
>  void
>  f (long *x)
>  {
>    for (int i = 0; i < 100; i += 8)
>      {
>        x[i] += 1;
>        x[i + 1] += 2;
>        x[i + 2] += 3;
>        x[i + 3] += 4;
>        x[i + 4] += 5;
>        x[i + 5] += 6;
>        x[i + 6] += 7;
>        x[i + 7] += 8;
>      }
>  }
>
>(already tested by the testsuite), we'd have 3 rows of permutes
>producing 4 vector results.  The schemne produced:
>
>1st row: 8 results from 4 permutes, highs duplicates of lows
>2nd row: 8 results from 8 permutes (half of which are actually
>redundant)
>3rd row: 4 results from 4 permutes
>
>However, coalescing elements is trickier for unpacked vectors,
>and at the moment we don't try to do it (see the GET_MODE_SIZE
>check in can_duplicate_and_interleave_p).  Unpacked vectors
>therefore stress the code in ways that packed vectors didn't.
>
>The patch fixes this by removing the redundancies from each row,
>rather than trying to work around them later.  This also removes
>the redundant work in the second row of the example above.
>
>Tested on aarch64-linux-gnu, aarch64_be-elf and (redundantly)
>x86_64-linux-gnu.  OK for trunk and branches?

Ok. 

Thanks, 
Richard. 

>Thanks,
>Richard
>
>
>gcc/
>	PR tree-optimization/98535
>	* tree-vect-slp.c (duplicate_and_interleave): Use quick_grow_cleared.
>	If the high and low permutes are the same, remove the high permutes
>	from the working set and only continue with the low ones.
>---
> .../gcc.target/aarch64/sve/pr98535.c          | 18 +++++++
> gcc/tree-vect-slp.c                           | 49 +++++++++++--------
> 2 files changed, 46 insertions(+), 21 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/pr98535.c
>
>diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
>index 1787ad74268..4465cf7494e 100644
>--- a/gcc/tree-vect-slp.c
>+++ b/gcc/tree-vect-slp.c
>@@ -5063,7 +5063,7 @@ duplicate_and_interleave (vec_info *vinfo,
>gimple_seq *seq, tree vector_type,
> 
>   tree_vector_builder partial_elts;
>   auto_vec<tree, 32> pieces (nvectors * 2);
>-  pieces.quick_grow (nvectors * 2);
>+  pieces.quick_grow_cleared (nvectors * 2);
>   for (unsigned int i = 0; i < nvectors; ++i)
>     {
>/* (2) Replace ELTS[0:NELTS] with ELTS'[0:NELTS'], where each element
>of
>@@ -5082,53 +5082,60 @@ duplicate_and_interleave (vec_info *vinfo,
>gimple_seq *seq, tree vector_type,
>   /* (4) Use a tree of VEC_PERM_EXPRs to create a single VM with the
> 	 correct byte contents.
> 
>-     We need to repeat the following operation log2(nvectors) times:
>+     Conceptually, we need to repeat the following operation
>log2(nvectors)
>+     times, where hi_start = nvectors / 2:
> 
> 	out[i * 2] = VEC_PERM_EXPR (in[i], in[i + hi_start], lo_permute);
> 	out[i * 2 + 1] = VEC_PERM_EXPR (in[i], in[i + hi_start], hi_permute);
> 
>      However, if each input repeats every N elements and the VF is
>-     a multiple of N * 2, the HI result is the same as the LO.  */
>+     a multiple of N * 2, the HI result is the same as the LO result.
>+     This will be true for the first N1 iterations of the outer loop,
>+     followed by N2 iterations for which both the LO and HI results
>+     are needed.  I.e.:
>+
>+	N1 + N2 = log2(nvectors)
>+
>+     Each "N1 iteration" doubles the number of redundant vectors and
>the
>+     effect of the process as a whole is to have a sequence of
>nvectors/2**N1
>+     vectors that repeats 2**N1 times.  Rather than generate these
>redundant
>+     vectors, we halve the number of vectors for each N1 iteration. 
>*/
>   unsigned int in_start = 0;
>   unsigned int out_start = nvectors;
>-  unsigned int hi_start = nvectors / 2;
>-  /* A bound on the number of outputs needed to produce NRESULTS
>results
>-     in the final iteration.  */
>-  unsigned int noutputs_bound = nvectors * nresults;
>+  unsigned int new_nvectors = nvectors;
> for (unsigned int in_repeat = 1; in_repeat < nvectors; in_repeat *= 2)
>     {
>-      noutputs_bound /= 2;
>-      unsigned int limit = MIN (noutputs_bound, nvectors);
>-      for (unsigned int i = 0; i < limit; ++i)
>+      unsigned int hi_start = new_nvectors / 2;
>+      unsigned int out_i = 0;
>+      for (unsigned int in_i = 0; in_i < new_nvectors; ++in_i)
> 	{
>-	  if ((i & 1) != 0
>+	  if ((in_i & 1) != 0
> 	      && multiple_p (TYPE_VECTOR_SUBPARTS (new_vector_type),
> 			     2 * in_repeat))
>-	    {
>-	      pieces[out_start + i] = pieces[out_start + i - 1];
>-	      continue;
>-	    }
>+	    continue;
> 
> 	  tree output = make_ssa_name (new_vector_type);
>-	  tree input1 = pieces[in_start + (i / 2)];
>-	  tree input2 = pieces[in_start + (i / 2) + hi_start];
>+	  tree input1 = pieces[in_start + (in_i / 2)];
>+	  tree input2 = pieces[in_start + (in_i / 2) + hi_start];
> 	  gassign *stmt = gimple_build_assign (output, VEC_PERM_EXPR,
> 					       input1, input2,
>-					       permutes[i & 1]);
>+					       permutes[in_i & 1]);
> 	  gimple_seq_add_stmt (seq, stmt);
>-	  pieces[out_start + i] = output;
>+	  pieces[out_start + out_i] = output;
>+	  out_i += 1;
> 	}
>       std::swap (in_start, out_start);
>+      new_nvectors = out_i;
>     }
> 
>/* (5) Use VIEW_CONVERT_EXPR to cast the final VM to the required type.
> */
>   results.reserve (nresults);
>   for (unsigned int i = 0; i < nresults; ++i)
>-    if (i < nvectors)
>+    if (i < new_nvectors)
> results.quick_push (gimple_build (seq, VIEW_CONVERT_EXPR, vector_type,
> 					pieces[in_start + i]));
>     else
>-      results.quick_push (results[i - nvectors]);
>+      results.quick_push (results[i - new_nvectors]);
> }
> 
> 
>diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c
>b/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c
>new file mode 100644
>index 00000000000..6873a38734d
>--- /dev/null
>+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c
>@@ -0,0 +1,18 @@
>+/* { dg-options "-O3 -mtune=neoverse-v1" } */
>+
>+typedef short a;
>+
>+typedef struct {
>+  a b, c, d, e;
>+} f;
>+
>+f *g;
>+
>+long h;
>+
>+void
>+i() {
>+  f j;
>+  for (; h; h++)
>+    *g++ = j;
>+}

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 1787ad74268..4465cf7494e 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -5063,7 +5063,7 @@  duplicate_and_interleave (vec_info *vinfo, gimple_seq *seq, tree vector_type,
 
   tree_vector_builder partial_elts;
   auto_vec<tree, 32> pieces (nvectors * 2);
-  pieces.quick_grow (nvectors * 2);
+  pieces.quick_grow_cleared (nvectors * 2);
   for (unsigned int i = 0; i < nvectors; ++i)
     {
       /* (2) Replace ELTS[0:NELTS] with ELTS'[0:NELTS'], where each element of
@@ -5082,53 +5082,60 @@  duplicate_and_interleave (vec_info *vinfo, gimple_seq *seq, tree vector_type,
   /* (4) Use a tree of VEC_PERM_EXPRs to create a single VM with the
 	 correct byte contents.
 
-     We need to repeat the following operation log2(nvectors) times:
+     Conceptually, we need to repeat the following operation log2(nvectors)
+     times, where hi_start = nvectors / 2:
 
 	out[i * 2] = VEC_PERM_EXPR (in[i], in[i + hi_start], lo_permute);
 	out[i * 2 + 1] = VEC_PERM_EXPR (in[i], in[i + hi_start], hi_permute);
 
      However, if each input repeats every N elements and the VF is
-     a multiple of N * 2, the HI result is the same as the LO.  */
+     a multiple of N * 2, the HI result is the same as the LO result.
+     This will be true for the first N1 iterations of the outer loop,
+     followed by N2 iterations for which both the LO and HI results
+     are needed.  I.e.:
+
+	N1 + N2 = log2(nvectors)
+
+     Each "N1 iteration" doubles the number of redundant vectors and the
+     effect of the process as a whole is to have a sequence of nvectors/2**N1
+     vectors that repeats 2**N1 times.  Rather than generate these redundant
+     vectors, we halve the number of vectors for each N1 iteration.  */
   unsigned int in_start = 0;
   unsigned int out_start = nvectors;
-  unsigned int hi_start = nvectors / 2;
-  /* A bound on the number of outputs needed to produce NRESULTS results
-     in the final iteration.  */
-  unsigned int noutputs_bound = nvectors * nresults;
+  unsigned int new_nvectors = nvectors;
   for (unsigned int in_repeat = 1; in_repeat < nvectors; in_repeat *= 2)
     {
-      noutputs_bound /= 2;
-      unsigned int limit = MIN (noutputs_bound, nvectors);
-      for (unsigned int i = 0; i < limit; ++i)
+      unsigned int hi_start = new_nvectors / 2;
+      unsigned int out_i = 0;
+      for (unsigned int in_i = 0; in_i < new_nvectors; ++in_i)
 	{
-	  if ((i & 1) != 0
+	  if ((in_i & 1) != 0
 	      && multiple_p (TYPE_VECTOR_SUBPARTS (new_vector_type),
 			     2 * in_repeat))
-	    {
-	      pieces[out_start + i] = pieces[out_start + i - 1];
-	      continue;
-	    }
+	    continue;
 
 	  tree output = make_ssa_name (new_vector_type);
-	  tree input1 = pieces[in_start + (i / 2)];
-	  tree input2 = pieces[in_start + (i / 2) + hi_start];
+	  tree input1 = pieces[in_start + (in_i / 2)];
+	  tree input2 = pieces[in_start + (in_i / 2) + hi_start];
 	  gassign *stmt = gimple_build_assign (output, VEC_PERM_EXPR,
 					       input1, input2,
-					       permutes[i & 1]);
+					       permutes[in_i & 1]);
 	  gimple_seq_add_stmt (seq, stmt);
-	  pieces[out_start + i] = output;
+	  pieces[out_start + out_i] = output;
+	  out_i += 1;
 	}
       std::swap (in_start, out_start);
+      new_nvectors = out_i;
     }
 
   /* (5) Use VIEW_CONVERT_EXPR to cast the final VM to the required type.  */
   results.reserve (nresults);
   for (unsigned int i = 0; i < nresults; ++i)
-    if (i < nvectors)
+    if (i < new_nvectors)
       results.quick_push (gimple_build (seq, VIEW_CONVERT_EXPR, vector_type,
 					pieces[in_start + i]));
     else
-      results.quick_push (results[i - nvectors]);
+      results.quick_push (results[i - new_nvectors]);
 }
 
 
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c b/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c
new file mode 100644
index 00000000000..6873a38734d
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr98535.c
@@ -0,0 +1,18 @@ 
+/* { dg-options "-O3 -mtune=neoverse-v1" } */
+
+typedef short a;
+
+typedef struct {
+  a b, c, d, e;
+} f;
+
+f *g;
+
+long h;
+
+void
+i() {
+  f j;
+  for (; h; h++)
+    *g++ = j;
+}

vect: Fix VLA SLP invariant optimisation [PR98535]

Commit Message

Comments

Patch