diff mbox series

[1/2] Reduce lane utilization in VEC_PERM_EXPRs for two_operator nodes

Message ID 20241015145650.298139-1-christoph.muellner@vrull.eu
State New
Headers show
Series [1/2] Reduce lane utilization in VEC_PERM_EXPRs for two_operator nodes | expand

Commit Message

Christoph Müllner Oct. 15, 2024, 2:56 p.m. UTC
When two_operator SLP nodes are built, the VEC_PERM_EXPR that merges the result
selects a lane only based on the operator found. If the input nodes have
duplicate elements, there may be more than one way to choose. This commit
changes the policy to reuse an existing lane with the result that we can
possibly free up lanes entirely.

For example, given two vectors with duplicates:
  A = {a1, a1, a2, a2}
  B = {b1, b1, b3, b2}

A two_operator node with operators +, -, +, - is currently built as:
  RES = VEC_PERM_EXPR<A, B>(0, 5, 2, 7)
With this patch, the permutation becomes:
  RES = VEC_PERM_EXPR<A, B>(0, 4, 2, 6)
Lanes 0 and 2 are reused and lanes 1 and 3 are not utilized anymore.

The direct effect of this change can be seen in the AArch64 test case,
where the simpler permutation allows to lower to a TRN1 instead of an
expensive TBL.

Bootstrapped and reg-tested on AArch64 (C, C++, Fortran).

Manolis Tsamis was the patch's initial author before I took it over.

gcc/ChangeLog:

	* tree-vect-slp.cc: Reduce lane utilization in VEC_PERM_EXPRs for two_operators

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/slp-perm-13.c: New test.
	* gcc.target/aarch64/sve/slp-perm-13.c: New test.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
 gcc/testsuite/gcc.dg/vect/slp-perm-13.c       | 29 +++++++++++++++++++
 .../gcc.target/aarch64/sve/slp-perm-13.c      |  4 +++
 gcc/tree-vect-slp.cc                          | 21 +++++++++++++-
 3 files changed, 53 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-perm-13.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c

Comments

Tamar Christina Oct. 17, 2024, 2:49 p.m. UTC | #1
Hi Christoph,

> -----Original Message-----
> From: Christoph Müllner <christoph.muellner@vrull.eu>
> Sent: Tuesday, October 15, 2024 3:57 PM
> To: gcc-patches@gcc.gnu.org; Philipp Tomsich <philipp.tomsich@vrull.eu>; Tamar
> Christina <Tamar.Christina@arm.com>; Richard Biener <rguenther@suse.de>
> Cc: Jeff Law <jeffreyalaw@gmail.com>; Robin Dapp <rdapp@ventanamicro.com>;
> Christoph Müllner <christoph.muellner@vrull.eu>
> Subject: [PATCH 1/2] Reduce lane utilization in VEC_PERM_EXPRs for
> two_operator nodes
> 
> When two_operator SLP nodes are built, the VEC_PERM_EXPR that merges the
> result
> selects a lane only based on the operator found. If the input nodes have
> duplicate elements, there may be more than one way to choose. This commit
> changes the policy to reuse an existing lane with the result that we can
> possibly free up lanes entirely.
> 
> For example, given two vectors with duplicates:
>   A = {a1, a1, a2, a2}
>   B = {b1, b1, b3, b2}
> 
> A two_operator node with operators +, -, +, - is currently built as:
>   RES = VEC_PERM_EXPR<A, B>(0, 5, 2, 7)
> With this patch, the permutation becomes:
>   RES = VEC_PERM_EXPR<A, B>(0, 4, 2, 6)
> Lanes 0 and 2 are reused and lanes 1 and 3 are not utilized anymore.
> 
> The direct effect of this change can be seen in the AArch64 test case,
> where the simpler permutation allows to lower to a TRN1 instead of an
> expensive TBL.

That makes sense,  So this is trying to make the permutes EVEN/EVEN or
ODD/ODD if possible.  Just wondering if....

> 
> Bootstrapped and reg-tested on AArch64 (C, C++, Fortran).
> 
> Manolis Tsamis was the patch's initial author before I took it over.
> 
> gcc/ChangeLog:
> 
> 	* tree-vect-slp.cc: Reduce lane utilization in VEC_PERM_EXPRs for
> two_operators
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.dg/vect/slp-perm-13.c: New test.
> 	* gcc.target/aarch64/sve/slp-perm-13.c: New test.
> 
> Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
> ---
>  gcc/testsuite/gcc.dg/vect/slp-perm-13.c       | 29 +++++++++++++++++++
>  .../gcc.target/aarch64/sve/slp-perm-13.c      |  4 +++
>  gcc/tree-vect-slp.cc                          | 21 +++++++++++++-
>  3 files changed, 53 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-perm-13.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c
> 
> diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-13.c
> b/gcc/testsuite/gcc.dg/vect/slp-perm-13.c
> new file mode 100644
> index 00000000000..08639e72fb0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-13.c
> @@ -0,0 +1,29 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -fdump-tree-slp2-details" } */
> +
> +#define LOAD_VEC(e0, e1, e2, e3, p) \
> +    int e0 = p[0]; \
> +    int e1 = p[1]; \
> +    int e2 = p[2]; \
> +    int e3 = p[3];
> +
> +#define STORE_VEC(p, e0, e1, e2, e3) \
> +    p[0] = e0; \
> +    p[1] = e1; \
> +    p[2] = e2; \
> +    p[3] = e3;
> +
> +void
> +foo (int *p)
> +{
> +  LOAD_VEC(s0, s1, s2, s3, p);
> +
> +  int t0 = s0 + s1;
> +  int t1 = s0 - s1;
> +  int t2 = s2 + s3;
> +  int t3 = s2 - s3;
> +
> +  STORE_VEC(p, t0, t1, t2, t3);
> +}
> +
> +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 0, 4, 2, 6 }" "slp2" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c
> b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c
> new file mode 100644
> index 00000000000..f5839f273e5
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c
> @@ -0,0 +1,4 @@
> +#include "../../../gcc.dg/vect/slp-perm-13.c"
> +
> +/* { dg-final { scan-assembler-not {\ttbl\t} } } */
> +
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 16332e0b6d7..8794c94ef90 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -2921,7 +2921,26 @@ fail:
>  	  gassign *ostmt = as_a <gassign *> (ostmt_info->stmt);
>  	  if (gimple_assign_rhs_code (ostmt) != code0)
>  	    {
> -	      SLP_TREE_LANE_PERMUTATION (node).safe_push (std::make_pair (1,
> i));
> +	      /* If the current element can be found in another lane that has
> +		 been used previously then use that one instead.  This can
> +		 happen when the ONE and TWO contain duplicate elements and
> +		 reduces the number of 'active' lanes.  */
> +	      int idx = i;
> +	      for (int alt_idx = (int) i - 1; alt_idx >= 0; alt_idx--)
> +		{
> +		  gassign *alt_stmt = as_a <gassign *> (stmts[alt_idx]->stmt);
> +		  if (gimple_assign_rhs_code (alt_stmt) == code0
> +		      && gimple_assign_rhs1 (ostmt)
> +			== gimple_assign_rhs1 (alt_stmt)
> +		      && gimple_assign_rhs2 (ostmt)
> +			== gimple_assign_rhs2 (alt_stmt))
> +		    {

.. we may want operand_equal_p (ostmt, alt_stmt, 0) etc here.

This looks good to me though but can't approve. 

Lets see what Richi thinks.

Cheers,
Tamar

> +		      idx = alt_idx;
> +		      break;
> +		    }
> +		}
> +	      SLP_TREE_LANE_PERMUTATION (node)
> +		.safe_push (std::make_pair (1, idx));
>  	      ocode = gimple_assign_rhs_code (ostmt);
>  	      j = i;
>  	    }
> --
> 2.46.0
Richard Biener Oct. 18, 2024, 9:47 a.m. UTC | #2
On Thu, 17 Oct 2024, Tamar Christina wrote:

> Hi Christoph,
> 
> > -----Original Message-----
> > From: Christoph Müllner <christoph.muellner@vrull.eu>
> > Sent: Tuesday, October 15, 2024 3:57 PM
> > To: gcc-patches@gcc.gnu.org; Philipp Tomsich <philipp.tomsich@vrull.eu>; Tamar
> > Christina <Tamar.Christina@arm.com>; Richard Biener <rguenther@suse.de>
> > Cc: Jeff Law <jeffreyalaw@gmail.com>; Robin Dapp <rdapp@ventanamicro.com>;
> > Christoph Müllner <christoph.muellner@vrull.eu>
> > Subject: [PATCH 1/2] Reduce lane utilization in VEC_PERM_EXPRs for
> > two_operator nodes
> > 
> > When two_operator SLP nodes are built, the VEC_PERM_EXPR that merges the
> > result
> > selects a lane only based on the operator found. If the input nodes have
> > duplicate elements, there may be more than one way to choose. This commit
> > changes the policy to reuse an existing lane with the result that we can
> > possibly free up lanes entirely.
> > 
> > For example, given two vectors with duplicates:
> >   A = {a1, a1, a2, a2}
> >   B = {b1, b1, b3, b2}
> > 
> > A two_operator node with operators +, -, +, - is currently built as:
> >   RES = VEC_PERM_EXPR<A, B>(0, 5, 2, 7)
> > With this patch, the permutation becomes:
> >   RES = VEC_PERM_EXPR<A, B>(0, 4, 2, 6)
> > Lanes 0 and 2 are reused and lanes 1 and 3 are not utilized anymore.
> > 
> > The direct effect of this change can be seen in the AArch64 test case,
> > where the simpler permutation allows to lower to a TRN1 instead of an
> > expensive TBL.
> 
> That makes sense,  So this is trying to make the permutes EVEN/EVEN or
> ODD/ODD if possible.  Just wondering if....

The original intent was to allow a blend (aka vec_merge in RTL).  For
RISC-V for example there's only single input general permutes but
two input blend.

Likewise this may end up with a permute not supported (on x86
early SSE2 ISAs lack quite some permutes but always can do blends
via bitwise operation).

I think the information that the lanes are duplicate is there with
and without this alternate canonicalization so I think it should
be part of whoever would prefer from more "unused lanes".  The
possibly pessimizing change here looks premature IMO and should
be defered to code generation where we can at least check whether
the resulting permute is supported (as said, the "blend" can be
emulated with AND[N] and OR).

Thanks,
Richard.

> > 
> > Bootstrapped and reg-tested on AArch64 (C, C++, Fortran).
> > 
> > Manolis Tsamis was the patch's initial author before I took it over.
> > 
> > gcc/ChangeLog:
> > 
> > 	* tree-vect-slp.cc: Reduce lane utilization in VEC_PERM_EXPRs for
> > two_operators
> > 
> > gcc/testsuite/ChangeLog:
> > 
> > 	* gcc.dg/vect/slp-perm-13.c: New test.
> > 	* gcc.target/aarch64/sve/slp-perm-13.c: New test.
> > 
> > Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
> > ---
> >  gcc/testsuite/gcc.dg/vect/slp-perm-13.c       | 29 +++++++++++++++++++
> >  .../gcc.target/aarch64/sve/slp-perm-13.c      |  4 +++
> >  gcc/tree-vect-slp.cc                          | 21 +++++++++++++-
> >  3 files changed, 53 insertions(+), 1 deletion(-)
> >  create mode 100644 gcc/testsuite/gcc.dg/vect/slp-perm-13.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c
> > 
> > diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-13.c
> > b/gcc/testsuite/gcc.dg/vect/slp-perm-13.c
> > new file mode 100644
> > index 00000000000..08639e72fb0
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.dg/vect/slp-perm-13.c
> > @@ -0,0 +1,29 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -fdump-tree-slp2-details" } */
> > +
> > +#define LOAD_VEC(e0, e1, e2, e3, p) \
> > +    int e0 = p[0]; \
> > +    int e1 = p[1]; \
> > +    int e2 = p[2]; \
> > +    int e3 = p[3];
> > +
> > +#define STORE_VEC(p, e0, e1, e2, e3) \
> > +    p[0] = e0; \
> > +    p[1] = e1; \
> > +    p[2] = e2; \
> > +    p[3] = e3;
> > +
> > +void
> > +foo (int *p)
> > +{
> > +  LOAD_VEC(s0, s1, s2, s3, p);
> > +
> > +  int t0 = s0 + s1;
> > +  int t1 = s0 - s1;
> > +  int t2 = s2 + s3;
> > +  int t3 = s2 - s3;
> > +
> > +  STORE_VEC(p, t0, t1, t2, t3);
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 0, 4, 2, 6 }" "slp2" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c
> > b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c
> > new file mode 100644
> > index 00000000000..f5839f273e5
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c
> > @@ -0,0 +1,4 @@
> > +#include "../../../gcc.dg/vect/slp-perm-13.c"
> > +
> > +/* { dg-final { scan-assembler-not {\ttbl\t} } } */
> > +
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index 16332e0b6d7..8794c94ef90 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -2921,7 +2921,26 @@ fail:
> >  	  gassign *ostmt = as_a <gassign *> (ostmt_info->stmt);
> >  	  if (gimple_assign_rhs_code (ostmt) != code0)
> >  	    {
> > -	      SLP_TREE_LANE_PERMUTATION (node).safe_push (std::make_pair (1,
> > i));
> > +	      /* If the current element can be found in another lane that has
> > +		 been used previously then use that one instead.  This can
> > +		 happen when the ONE and TWO contain duplicate elements and
> > +		 reduces the number of 'active' lanes.  */
> > +	      int idx = i;
> > +	      for (int alt_idx = (int) i - 1; alt_idx >= 0; alt_idx--)
> > +		{
> > +		  gassign *alt_stmt = as_a <gassign *> (stmts[alt_idx]->stmt);
> > +		  if (gimple_assign_rhs_code (alt_stmt) == code0
> > +		      && gimple_assign_rhs1 (ostmt)
> > +			== gimple_assign_rhs1 (alt_stmt)
> > +		      && gimple_assign_rhs2 (ostmt)
> > +			== gimple_assign_rhs2 (alt_stmt))
> > +		    {
> 
> .. we may want operand_equal_p (ostmt, alt_stmt, 0) etc here.
> 
> This looks good to me though but can't approve. 
> 
> Lets see what Richi thinks.
> 
> Cheers,
> Tamar
> 
> > +		      idx = alt_idx;
> > +		      break;
> > +		    }
> > +		}
> > +	      SLP_TREE_LANE_PERMUTATION (node)
> > +		.safe_push (std::make_pair (1, idx));
> >  	      ocode = gimple_assign_rhs_code (ostmt);
> >  	      j = i;
> >  	    }
> > --
> > 2.46.0
> 
>
diff mbox series

Patch

diff --git a/gcc/testsuite/gcc.dg/vect/slp-perm-13.c b/gcc/testsuite/gcc.dg/vect/slp-perm-13.c
new file mode 100644
index 00000000000..08639e72fb0
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-perm-13.c
@@ -0,0 +1,29 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-slp2-details" } */
+
+#define LOAD_VEC(e0, e1, e2, e3, p) \
+    int e0 = p[0]; \
+    int e1 = p[1]; \
+    int e2 = p[2]; \
+    int e3 = p[3];
+
+#define STORE_VEC(p, e0, e1, e2, e3) \
+    p[0] = e0; \
+    p[1] = e1; \
+    p[2] = e2; \
+    p[3] = e3;
+
+void
+foo (int *p)
+{
+  LOAD_VEC(s0, s1, s2, s3, p);
+
+  int t0 = s0 + s1;
+  int t1 = s0 - s1;
+  int t2 = s2 + s3;
+  int t3 = s2 - s3;
+
+  STORE_VEC(p, t0, t1, t2, t3);
+}
+
+/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 0, 4, 2, 6 }" "slp2" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c
new file mode 100644
index 00000000000..f5839f273e5
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/slp-perm-13.c
@@ -0,0 +1,4 @@ 
+#include "../../../gcc.dg/vect/slp-perm-13.c"
+
+/* { dg-final { scan-assembler-not {\ttbl\t} } } */
+
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 16332e0b6d7..8794c94ef90 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -2921,7 +2921,26 @@  fail:
 	  gassign *ostmt = as_a <gassign *> (ostmt_info->stmt);
 	  if (gimple_assign_rhs_code (ostmt) != code0)
 	    {
-	      SLP_TREE_LANE_PERMUTATION (node).safe_push (std::make_pair (1, i));
+	      /* If the current element can be found in another lane that has
+		 been used previously then use that one instead.  This can
+		 happen when the ONE and TWO contain duplicate elements and
+		 reduces the number of 'active' lanes.  */
+	      int idx = i;
+	      for (int alt_idx = (int) i - 1; alt_idx >= 0; alt_idx--)
+		{
+		  gassign *alt_stmt = as_a <gassign *> (stmts[alt_idx]->stmt);
+		  if (gimple_assign_rhs_code (alt_stmt) == code0
+		      && gimple_assign_rhs1 (ostmt)
+			== gimple_assign_rhs1 (alt_stmt)
+		      && gimple_assign_rhs2 (ostmt)
+			== gimple_assign_rhs2 (alt_stmt))
+		    {
+		      idx = alt_idx;
+		      break;
+		    }
+		}
+	      SLP_TREE_LANE_PERMUTATION (node)
+		.safe_push (std::make_pair (1, idx));
 	      ocode = gimple_assign_rhs_code (ostmt);
 	      j = i;
 	    }