Message ID | 20240604135317.3536415-1-manolis.tsamis@vrull.eu |
---|---|
State | New |
Headers | show |
Series | Rearrange SLP nodes with duplicate statements. [PR98138] | expand |
On Tue, 4 Jun 2024, Manolis Tsamis wrote: > This change adds a function that checks for SLP nodes with multiple occurrences > of the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the node > so that there are no duplicates. A vec_perm is then introduced to recreate the > original ordering. These duplicates can appear due to how two_operators nodes > are handled, and they prevent vectorization in some cases. So the trick is that when we have two operands we elide duplicate lanes so we can do discovery for a single combined operand instead which we then decompose into the required two again. That's a nice one. But as implemented this will fail SLP discovery if the combined operand fails discovery possibly because of divergence in downstream defs. That is, it doesn't fall back to separate discovery. I suspect the situation of duplicate lanes isn't common but then I would also suspect that divergence _is_ common. The discovery code is already quite complex with the way it possibly swaps operands of lanes, fitting in this as another variant to try (first) is likely going to be a bit awkward. A way out might be to split the function or to make the re-try in the caller which could indicate whether to apply this pattern trick or not. That said - can you try to get data on how often the trick applies and discovery succeeds and how often discovery fails but discovery would suceed without applying the pattern (say, on SPEC)? I also suppose instead of hardcoding three patterns for a fixed size it should be possible to see there's only (at most) half unique lanes in both operands (and one less in one operand if the number of lanes is odd) and compute the un-swizzling lane permutes during this discovery, removing the need of the explicit enum and open-coding each case? Another general note is that trying (and then undo on fail) such ticks eats at the discovery limit we have in place to avoid exponential run-off in exactly this degenerate cases. Thanks, Richard. > This targets the vectorization of the SPEC2017 x264 pixel_satd functions. > In some processors a larger than 10% improvement on x264 has been observed. > > See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138 > > gcc/ChangeLog: > > * tree-vect-slp.cc (enum slp_oprnd_pattern): new enum for rearrangement > patterns. > (try_rearrange_oprnd_info): Detect if a node corresponds to one of the > patterns. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/vect-slp-two-operator.c: New test. > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > --- > > .../aarch64/vect-slp-two-operator.c | 42 ++++ > gcc/tree-vect-slp.cc | 234 ++++++++++++++++++ > 2 files changed, 276 insertions(+) > create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > new file mode 100644 > index 00000000000..2db066a0b6e > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > @@ -0,0 +1,42 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect -fdump-tree-vect-details" } */ > + > +typedef unsigned char uint8_t; > +typedef unsigned int uint32_t; > + > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ > + int t0 = s0 + s1;\ > + int t1 = s0 - s1;\ > + int t2 = s2 + s3;\ > + int t3 = s2 - s3;\ > + d0 = t0 + t2;\ > + d1 = t1 + t3;\ > + d2 = t0 - t2;\ > + d3 = t1 - t3;\ > +} > + > +static uint32_t abs2( uint32_t a ) > +{ > + uint32_t s = ((a>>15)&0x10001)*0xffff; > + return (a+s)^s; > +} > + > +void sink(uint32_t tmp[4][4]); > + > +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) > +{ > + uint32_t tmp[4][4]; > + int sum = 0; > + for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) > + { > + uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); > + uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); > + uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); > + uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); > + HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 ); > + } > + sink(tmp); > +} > + > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index bf1f467f53f..e395db0e185 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3. If not see > #include "tree-vectorizer.h" > #include "langhooks.h" > #include "gimple-walk.h" > +#include "gimple-pretty-print.h" > #include "dbgcnt.h" > #include "tree-vector-builder.h" > #include "vec-perm-indices.h" > @@ -1829,6 +1830,141 @@ vect_slp_build_two_operator_nodes (slp_tree perm, tree vectype, > SLP_TREE_CHILDREN (perm).quick_push (child2); > } > > +enum slp_oprnd_pattern > +{ > + SLP_OPRND_PATTERN_NONE, > + SLP_OPRND_PATTERN_ABAB, > + SLP_OPRND_PATTERN_AABB, > + SLP_OPRND_PATTERN_ABBA > +}; > + > +/* Check if OPRNDS_INFO has duplicated nodes that correspond to a predefined > + pattern described by SLP_OPRND_PATTERN and return it. */ > + > +static int > +try_rearrange_oprnd_info (vec<slp_oprnd_info> &oprnds_info, unsigned group_size) > +{ > + unsigned i; > + slp_oprnd_info info; > + > + if (oprnds_info.length () != 2 || group_size % 4 != 0) > + return SLP_OPRND_PATTERN_NONE; > + > + if (!oprnds_info[0]->def_stmts[0] > + || !is_a<gassign *> (oprnds_info[0]->def_stmts[0]->stmt)) > + return SLP_OPRND_PATTERN_NONE; > + > + enum tree_code code > + = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt); > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > + for (unsigned int j = 0; j < group_size; j += 1) > + { > + if (!info->def_stmts[j] > + || !is_a<gassign *> (info->def_stmts[j]->stmt) > + || STMT_VINFO_DATA_REF (info->def_stmts[j])) > + return SLP_OPRND_PATTERN_NONE; > + /* Don't mix different operations. */ > + if (gimple_assign_rhs_code (info->def_stmts[j]->stmt) != code) > + return SLP_OPRND_PATTERN_NONE; > + } > + > + if (gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt) > + != gimple_assign_rhs_code (oprnds_info[1]->def_stmts[0]->stmt)) > + return SLP_OPRND_PATTERN_NONE; > + > + int pattern = SLP_OPRND_PATTERN_NONE; > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > + for (unsigned int j = 0; j < group_size; j += 4) > + { > + int cur_pattern = SLP_OPRND_PATTERN_NONE; > + /* Check for an ABAB... pattern. */ > + if ((info->def_stmts[j] == info->def_stmts[j + 2]) > + && (info->def_stmts[j + 1] == info->def_stmts[j + 3]) > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > + cur_pattern = SLP_OPRND_PATTERN_ABAB; > + /* Check for an AABB... pattern. */ > + else if ((info->def_stmts[j] == info->def_stmts[j + 1]) > + && (info->def_stmts[j + 2] == info->def_stmts[j + 3]) > + && (info->def_stmts[j] != info->def_stmts[j + 2])) > + cur_pattern = SLP_OPRND_PATTERN_AABB; > + /* Check for an ABBA... pattern. */ > + else if ((info->def_stmts[j] == info->def_stmts[j + 3]) > + && (info->def_stmts[j + 1] == info->def_stmts[j + 2]) > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > + cur_pattern = SLP_OPRND_PATTERN_ABBA; > + /* Unrecognised pattern. */ > + else > + return SLP_OPRND_PATTERN_NONE; > + > + if (pattern == SLP_OPRND_PATTERN_NONE) > + pattern = cur_pattern; > + /* Multiple patterns detected. */ > + else if (cur_pattern != pattern) > + return SLP_OPRND_PATTERN_NONE; > + } > + > + gcc_checking_assert (pattern != SLP_OPRND_PATTERN_NONE); > + > + if (dump_enabled_p ()) > + { > + if (pattern == SLP_OPRND_PATTERN_ABAB) > + dump_printf (MSG_NOTE, "ABAB"); > + else if (pattern == SLP_OPRND_PATTERN_AABB) > + dump_printf (MSG_NOTE, "AABB"); > + else if (pattern == SLP_OPRND_PATTERN_ABBA) > + dump_printf (MSG_NOTE, "ABBA"); > + dump_printf (MSG_NOTE, " pattern detected.\n"); > + } > + > + if (pattern == SLP_OPRND_PATTERN_ABAB || pattern == SLP_OPRND_PATTERN_ABBA) > + for (unsigned int j = 0; j < group_size; j += 4) > + { > + /* Given oprnd[0] -> A1, B1, A1, B1, A2, B2, A2, B2, ... > + Given oprnd[1] -> C1, D1, C1, D1, C2, D2, C2, D2, ... > + Create a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... */ > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[1]->def_stmts[j]; > + oprnds_info[0]->ops[j+2] = oprnds_info[1]->ops[j]; > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+1]; > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+1]; > + } > + else if (pattern == SLP_OPRND_PATTERN_AABB) > + for (unsigned int j = 0; j < group_size; j += 4) > + { > + /* Given oprnd[0] -> A1, A1, B1, B1, A2, A2, B2, B2, ... > + Given oprnd[1] -> C1, C1, D1, D1, C2, C2, D2, D2, ... > + Create a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... */ > + > + /* The ordering here is at least to some extent arbitrary. > + A generilized version needs to use some explicit ordering. */ > + oprnds_info[0]->def_stmts[j+1] = oprnds_info[1]->def_stmts[j]; > + oprnds_info[0]->ops[j+1] = oprnds_info[1]->ops[j]; > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[0]->def_stmts[j+2]; > + oprnds_info[0]->ops[j+2] = oprnds_info[0]->ops[j+2]; > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+2]; > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+2]; > + } > + > + if (dump_enabled_p ()) > + { > + dump_printf (MSG_NOTE, "Recurse with:\n"); > + for (unsigned int j = 0; j < group_size; j++) > + { > + dump_printf (MSG_NOTE, " "); > + print_gimple_stmt (dump_file, oprnds_info[0]->def_stmts[j]->stmt, 0); > + } > + } > + > + /* Since we've merged the two nodes in one, make the second one a copy of > + the first. */ > + for (unsigned int j = 0; j < group_size; j++) > + { > + oprnds_info[1]->def_stmts[j] = oprnds_info[0]->def_stmts[j]; > + oprnds_info[1]->ops[j] = oprnds_info[0]->ops[j]; > + } > + > + return pattern; > +} > + > /* Recursively build an SLP tree starting from NODE. > Fail (and return a value not equal to zero) if def-stmts are not > isomorphic, require data permutation or are of unsupported types of > @@ -2409,6 +2545,10 @@ out: > > stmt_info = stmts[0]; > > + int rearrange_pattern = SLP_OPRND_PATTERN_NONE; > + if (is_a<gassign *> (stmt_info->stmt)) > + rearrange_pattern = try_rearrange_oprnd_info (oprnds_info, group_size); > + > /* Create SLP_TREE nodes for the definition node/s. */ > FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info) > { > @@ -2669,6 +2809,100 @@ fail: > *tree_size += this_tree_size + 1; > *max_nunits = this_max_nunits; > > + /* If we applied any rearrangmenets then we need to reconstruct the original > + elements with an additional permutation layer. */ > + if (rearrange_pattern != SLP_OPRND_PATTERN_NONE) > + { > + slp_tree one = new _slp_tree; > + slp_tree two = new _slp_tree; > + SLP_TREE_DEF_TYPE (one) = vect_internal_def; > + SLP_TREE_DEF_TYPE (two) = vect_internal_def; > + SLP_TREE_VECTYPE (one) = vectype; > + SLP_TREE_VECTYPE (two) = vectype; > + SLP_TREE_CHILDREN (one).safe_splice (children); > + SLP_TREE_CHILDREN (two).safe_splice (children); > + > + SLP_TREE_CODE (one) = VEC_PERM_EXPR; > + SLP_TREE_CODE (two) = VEC_PERM_EXPR; > + SLP_TREE_REPRESENTATIVE (one) = stmts[0]; > + SLP_TREE_REPRESENTATIVE (two) = stmts[2]; > + lane_permutation_t &perm_one = SLP_TREE_LANE_PERMUTATION (one); > + lane_permutation_t &perm_two = SLP_TREE_LANE_PERMUTATION (two); > + > + if (rearrange_pattern == SLP_OPRND_PATTERN_ABAB) > + { > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > + Create node "one" -> A1, B1, A1, B1, A2, B2, A2, B2, ... > + Create node "two" -> C1, D1, C1, D1, C2, D2, C2, D2, ... */ > + > + for (unsigned int j = 0; j < group_size; j += 4) > + { > + perm_one.safe_push (std::make_pair (0, j + 0)); > + perm_one.safe_push (std::make_pair (0, j + 1)); > + perm_one.safe_push (std::make_pair (0, j + 0)); > + perm_one.safe_push (std::make_pair (0, j + 1)); > + > + perm_two.safe_push (std::make_pair (0, j + 2)); > + perm_two.safe_push (std::make_pair (0, j + 3)); > + perm_two.safe_push (std::make_pair (0, j + 2)); > + perm_two.safe_push (std::make_pair (0, j + 3)); > + } > + } > + else if (rearrange_pattern == SLP_OPRND_PATTERN_AABB) > + { > + /* Given a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... > + Create node "one" -> A1, A1, B1, B1, A2, A2, B2, B2, ... > + Create node "two" -> C1, C1, D1, D1, C2, C2, D2, D2, ... */ > + > + for (unsigned int j = 0; j < group_size; j += 4) > + { > + perm_one.safe_push (std::make_pair (0, j + 0)); > + perm_one.safe_push (std::make_pair (0, j + 0)); > + perm_one.safe_push (std::make_pair (0, j + 2)); > + perm_one.safe_push (std::make_pair (0, j + 2)); > + > + perm_two.safe_push (std::make_pair (0, j + 1)); > + perm_two.safe_push (std::make_pair (0, j + 1)); > + perm_two.safe_push (std::make_pair (0, j + 3)); > + perm_two.safe_push (std::make_pair (0, j + 3)); > + } > + } > + else if (rearrange_pattern == SLP_OPRND_PATTERN_ABBA) > + { > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > + Create node "one" -> A1, B1, B1, A1, A2, B2, B2, A2, ... > + Create node "two" -> C1, D1, D1, C1, C2, D2, D2, C2, ... */ > + > + for (unsigned int j = 0; j < group_size; j += 4) > + { > + perm_one.safe_push (std::make_pair (0, j + 0)); > + perm_one.safe_push (std::make_pair (0, j + 1)); > + perm_one.safe_push (std::make_pair (0, j + 1)); > + perm_one.safe_push (std::make_pair (0, j + 0)); > + > + perm_two.safe_push (std::make_pair (0, j + 2)); > + perm_two.safe_push (std::make_pair (0, j + 3)); > + perm_two.safe_push (std::make_pair (0, j + 3)); > + perm_two.safe_push (std::make_pair (0, j + 2)); > + } > + } > + > + slp_tree child; > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (two), i, child) > + SLP_TREE_REF_COUNT (child)++; > + > + node = vect_create_new_slp_node (node, stmts, 2); > + SLP_TREE_VECTYPE (node) = vectype; > + SLP_TREE_CHILDREN (node).quick_push (one); > + SLP_TREE_CHILDREN (node).quick_push (two); > + > + SLP_TREE_LANES (one) = stmts.length (); > + SLP_TREE_LANES (two) = stmts.length (); > + > + children.truncate (0); > + children.safe_splice (SLP_TREE_CHILDREN (node)); > + } > + > if (two_operators) > { > /* ??? We'd likely want to either cache in bst_map sth like >
> -----Original Message----- > From: Richard Biener <rguenther@suse.de> > Sent: Wednesday, June 5, 2024 9:07 AM > To: Manolis Tsamis <manolis.tsamis@vrull.eu> > Cc: gcc-patches@gcc.gnu.org; Christoph Müllner <christoph.muellner@vrull.eu>; > Kewen . Lin <linkw@linux.ibm.com>; Philipp Tomsich <philipp.tomsich@vrull.eu>; > Tamar Christina <Tamar.Christina@arm.com>; Jiangning Liu > <jiangning.liu@amperecomputing.com> > Subject: Re: [PATCH] Rearrange SLP nodes with duplicate statements. [PR98138] > > On Tue, 4 Jun 2024, Manolis Tsamis wrote: > > > This change adds a function that checks for SLP nodes with multiple occurrences > > of the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the node > > so that there are no duplicates. A vec_perm is then introduced to recreate the > > original ordering. These duplicates can appear due to how two_operators nodes > > are handled, and they prevent vectorization in some cases. > > So the trick is that when we have two operands we elide duplicate lanes > so we can do discovery for a single combined operand instead which we > then decompose into the required two again. That's a nice one. > > But as implemented this will fail SLP discovery if the combined operand > fails discovery possibly because of divergence in downstream defs. That > is, it doesn't fall back to separate discovery. I suspect the situation > of duplicate lanes isn't common but then I would also suspect that > divergence _is_ common. I think we should also look at the cases where vectorization itself also failed because the generated tree ends up with an unsupported load. i.e. in this particular case we would have failed SLP at a later step. > > The discovery code is already quite complex with the way it possibly > swaps operands of lanes, fitting in this as another variant to try (first) > is likely going to be a bit awkward. A way out might be to split the > function or to make the re-try in the caller which could indicate whether > to apply this pattern trick or not. That said - can you try to get > data on how often the trick applies and discovery succeeds and how > often discovery fails but discovery would suceed without applying the > pattern (say, on SPEC)? > > I also suppose instead of hardcoding three patterns for a fixed > size it should be possible to see there's > only (at most) half unique lanes in both operands (and one less in one > operand if the number of lanes is odd) and compute the un-swizzling lane > permutes during this discovery, removing the need of the explicit enum > and open-coding each case? > > Another general note is that trying (and then undo on fail) such ticks > eats at the discovery limit we have in place to avoid exponential run-off > in exactly this degenerate cases. I suppose this is typically a case where changing to merging multiple single lane SLPs instead of creating the multiline graph in one go would make things easier? Isn't SLP discovery computationally expensive since it has to create the full graph in one go, whereas with merging you just rotate some subgraphs or eventually just keep the single lane separate? Cheers, Tamar > > Thanks, > Richard. > > > This targets the vectorization of the SPEC2017 x264 pixel_satd functions. > > In some processors a larger than 10% improvement on x264 has been observed. > > > > See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138 > > > > gcc/ChangeLog: > > > > * tree-vect-slp.cc (enum slp_oprnd_pattern): new enum for > rearrangement > > patterns. > > (try_rearrange_oprnd_info): Detect if a node corresponds to one of the > > patterns. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/vect-slp-two-operator.c: New test. > > > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > > --- > > > > .../aarch64/vect-slp-two-operator.c | 42 ++++ > > gcc/tree-vect-slp.cc | 234 ++++++++++++++++++ > > 2 files changed, 276 insertions(+) > > create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > new file mode 100644 > > index 00000000000..2db066a0b6e > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > @@ -0,0 +1,42 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect -fdump-tree-vect- > details" } */ > > + > > +typedef unsigned char uint8_t; > > +typedef unsigned int uint32_t; > > + > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ > > + int t0 = s0 + s1;\ > > + int t1 = s0 - s1;\ > > + int t2 = s2 + s3;\ > > + int t3 = s2 - s3;\ > > + d0 = t0 + t2;\ > > + d1 = t1 + t3;\ > > + d2 = t0 - t2;\ > > + d3 = t1 - t3;\ > > +} > > + > > +static uint32_t abs2( uint32_t a ) > > +{ > > + uint32_t s = ((a>>15)&0x10001)*0xffff; > > + return (a+s)^s; > > +} > > + > > +void sink(uint32_t tmp[4][4]); > > + > > +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) > > +{ > > + uint32_t tmp[4][4]; > > + int sum = 0; > > + for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) > > + { > > + uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); > > + uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); > > + uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); > > + uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); > > + HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 ); > > + } > > + sink(tmp); > > +} > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > index bf1f467f53f..e395db0e185 100644 > > --- a/gcc/tree-vect-slp.cc > > +++ b/gcc/tree-vect-slp.cc > > @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3. If not see > > #include "tree-vectorizer.h" > > #include "langhooks.h" > > #include "gimple-walk.h" > > +#include "gimple-pretty-print.h" > > #include "dbgcnt.h" > > #include "tree-vector-builder.h" > > #include "vec-perm-indices.h" > > @@ -1829,6 +1830,141 @@ vect_slp_build_two_operator_nodes (slp_tree > perm, tree vectype, > > SLP_TREE_CHILDREN (perm).quick_push (child2); > > } > > > > +enum slp_oprnd_pattern > > +{ > > + SLP_OPRND_PATTERN_NONE, > > + SLP_OPRND_PATTERN_ABAB, > > + SLP_OPRND_PATTERN_AABB, > > + SLP_OPRND_PATTERN_ABBA > > +}; > > + > > +/* Check if OPRNDS_INFO has duplicated nodes that correspond to a > predefined > > + pattern described by SLP_OPRND_PATTERN and return it. */ > > + > > +static int > > +try_rearrange_oprnd_info (vec<slp_oprnd_info> &oprnds_info, unsigned > group_size) > > +{ > > + unsigned i; > > + slp_oprnd_info info; > > + > > + if (oprnds_info.length () != 2 || group_size % 4 != 0) > > + return SLP_OPRND_PATTERN_NONE; > > + > > + if (!oprnds_info[0]->def_stmts[0] > > + || !is_a<gassign *> (oprnds_info[0]->def_stmts[0]->stmt)) > > + return SLP_OPRND_PATTERN_NONE; > > + > > + enum tree_code code > > + = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt); > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > + for (unsigned int j = 0; j < group_size; j += 1) > > + { > > + if (!info->def_stmts[j] > > + || !is_a<gassign *> (info->def_stmts[j]->stmt) > > + || STMT_VINFO_DATA_REF (info->def_stmts[j])) > > + return SLP_OPRND_PATTERN_NONE; > > + /* Don't mix different operations. */ > > + if (gimple_assign_rhs_code (info->def_stmts[j]->stmt) != code) > > + return SLP_OPRND_PATTERN_NONE; > > + } > > + > > + if (gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt) > > + != gimple_assign_rhs_code (oprnds_info[1]->def_stmts[0]->stmt)) > > + return SLP_OPRND_PATTERN_NONE; > > + > > + int pattern = SLP_OPRND_PATTERN_NONE; > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + int cur_pattern = SLP_OPRND_PATTERN_NONE; > > + /* Check for an ABAB... pattern. */ > > + if ((info->def_stmts[j] == info->def_stmts[j + 2]) > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 3]) > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > + cur_pattern = SLP_OPRND_PATTERN_ABAB; > > + /* Check for an AABB... pattern. */ > > + else if ((info->def_stmts[j] == info->def_stmts[j + 1]) > > + && (info->def_stmts[j + 2] == info->def_stmts[j + 3]) > > + && (info->def_stmts[j] != info->def_stmts[j + 2])) > > + cur_pattern = SLP_OPRND_PATTERN_AABB; > > + /* Check for an ABBA... pattern. */ > > + else if ((info->def_stmts[j] == info->def_stmts[j + 3]) > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 2]) > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > + cur_pattern = SLP_OPRND_PATTERN_ABBA; > > + /* Unrecognised pattern. */ > > + else > > + return SLP_OPRND_PATTERN_NONE; > > + > > + if (pattern == SLP_OPRND_PATTERN_NONE) > > + pattern = cur_pattern; > > + /* Multiple patterns detected. */ > > + else if (cur_pattern != pattern) > > + return SLP_OPRND_PATTERN_NONE; > > + } > > + > > + gcc_checking_assert (pattern != SLP_OPRND_PATTERN_NONE); > > + > > + if (dump_enabled_p ()) > > + { > > + if (pattern == SLP_OPRND_PATTERN_ABAB) > > + dump_printf (MSG_NOTE, "ABAB"); > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > + dump_printf (MSG_NOTE, "AABB"); > > + else if (pattern == SLP_OPRND_PATTERN_ABBA) > > + dump_printf (MSG_NOTE, "ABBA"); > > + dump_printf (MSG_NOTE, " pattern detected.\n"); > > + } > > + > > + if (pattern == SLP_OPRND_PATTERN_ABAB || pattern == > SLP_OPRND_PATTERN_ABBA) > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + /* Given oprnd[0] -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > + Given oprnd[1] -> C1, D1, C1, D1, C2, D2, C2, D2, ... > > + Create a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... */ > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[1]->def_stmts[j]; > > + oprnds_info[0]->ops[j+2] = oprnds_info[1]->ops[j]; > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+1]; > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+1]; > > + } > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + /* Given oprnd[0] -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > + Given oprnd[1] -> C1, C1, D1, D1, C2, C2, D2, D2, ... > > + Create a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... */ > > + > > + /* The ordering here is at least to some extent arbitrary. > > + A generilized version needs to use some explicit ordering. */ > > + oprnds_info[0]->def_stmts[j+1] = oprnds_info[1]->def_stmts[j]; > > + oprnds_info[0]->ops[j+1] = oprnds_info[1]->ops[j]; > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[0]->def_stmts[j+2]; > > + oprnds_info[0]->ops[j+2] = oprnds_info[0]->ops[j+2]; > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+2]; > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+2]; > > + } > > + > > + if (dump_enabled_p ()) > > + { > > + dump_printf (MSG_NOTE, "Recurse with:\n"); > > + for (unsigned int j = 0; j < group_size; j++) > > + { > > + dump_printf (MSG_NOTE, " "); > > + print_gimple_stmt (dump_file, oprnds_info[0]->def_stmts[j]->stmt, 0); > > + } > > + } > > + > > + /* Since we've merged the two nodes in one, make the second one a copy of > > + the first. */ > > + for (unsigned int j = 0; j < group_size; j++) > > + { > > + oprnds_info[1]->def_stmts[j] = oprnds_info[0]->def_stmts[j]; > > + oprnds_info[1]->ops[j] = oprnds_info[0]->ops[j]; > > + } > > + > > + return pattern; > > +} > > + > > /* Recursively build an SLP tree starting from NODE. > > Fail (and return a value not equal to zero) if def-stmts are not > > isomorphic, require data permutation or are of unsupported types of > > @@ -2409,6 +2545,10 @@ out: > > > > stmt_info = stmts[0]; > > > > + int rearrange_pattern = SLP_OPRND_PATTERN_NONE; > > + if (is_a<gassign *> (stmt_info->stmt)) > > + rearrange_pattern = try_rearrange_oprnd_info (oprnds_info, group_size); > > + > > /* Create SLP_TREE nodes for the definition node/s. */ > > FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info) > > { > > @@ -2669,6 +2809,100 @@ fail: > > *tree_size += this_tree_size + 1; > > *max_nunits = this_max_nunits; > > > > + /* If we applied any rearrangmenets then we need to reconstruct the original > > + elements with an additional permutation layer. */ > > + if (rearrange_pattern != SLP_OPRND_PATTERN_NONE) > > + { > > + slp_tree one = new _slp_tree; > > + slp_tree two = new _slp_tree; > > + SLP_TREE_DEF_TYPE (one) = vect_internal_def; > > + SLP_TREE_DEF_TYPE (two) = vect_internal_def; > > + SLP_TREE_VECTYPE (one) = vectype; > > + SLP_TREE_VECTYPE (two) = vectype; > > + SLP_TREE_CHILDREN (one).safe_splice (children); > > + SLP_TREE_CHILDREN (two).safe_splice (children); > > + > > + SLP_TREE_CODE (one) = VEC_PERM_EXPR; > > + SLP_TREE_CODE (two) = VEC_PERM_EXPR; > > + SLP_TREE_REPRESENTATIVE (one) = stmts[0]; > > + SLP_TREE_REPRESENTATIVE (two) = stmts[2]; > > + lane_permutation_t &perm_one = SLP_TREE_LANE_PERMUTATION (one); > > + lane_permutation_t &perm_two = SLP_TREE_LANE_PERMUTATION (two); > > + > > + if (rearrange_pattern == SLP_OPRND_PATTERN_ABAB) > > + { > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > + Create node "one" -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > + Create node "two" -> C1, D1, C1, D1, C2, D2, C2, D2, ... */ > > + > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + } > > + } > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_AABB) > > + { > > + /* Given a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... > > + Create node "one" -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > + Create node "two" -> C1, C1, D1, D1, C2, C2, D2, D2, ... */ > > + > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > + > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + } > > + } > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_ABBA) > > + { > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > + Create node "one" -> A1, B1, B1, A1, A2, B2, B2, A2, ... > > + Create node "two" -> C1, D1, D1, C1, C2, D2, D2, C2, ... */ > > + > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + } > > + } > > + > > + slp_tree child; > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (two), i, child) > > + SLP_TREE_REF_COUNT (child)++; > > + > > + node = vect_create_new_slp_node (node, stmts, 2); > > + SLP_TREE_VECTYPE (node) = vectype; > > + SLP_TREE_CHILDREN (node).quick_push (one); > > + SLP_TREE_CHILDREN (node).quick_push (two); > > + > > + SLP_TREE_LANES (one) = stmts.length (); > > + SLP_TREE_LANES (two) = stmts.length (); > > + > > + children.truncate (0); > > + children.safe_splice (SLP_TREE_CHILDREN (node)); > > + } > > + > > if (two_operators) > > { > > /* ??? We'd likely want to either cache in bst_map sth like > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
On Wed, 5 Jun 2024, Tamar Christina wrote: > > -----Original Message----- > > From: Richard Biener <rguenther@suse.de> > > Sent: Wednesday, June 5, 2024 9:07 AM > > To: Manolis Tsamis <manolis.tsamis@vrull.eu> > > Cc: gcc-patches@gcc.gnu.org; Christoph Müllner <christoph.muellner@vrull.eu>; > > Kewen . Lin <linkw@linux.ibm.com>; Philipp Tomsich <philipp.tomsich@vrull.eu>; > > Tamar Christina <Tamar.Christina@arm.com>; Jiangning Liu > > <jiangning.liu@amperecomputing.com> > > Subject: Re: [PATCH] Rearrange SLP nodes with duplicate statements. [PR98138] > > > > On Tue, 4 Jun 2024, Manolis Tsamis wrote: > > > > > This change adds a function that checks for SLP nodes with multiple occurrences > > > of the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the node > > > so that there are no duplicates. A vec_perm is then introduced to recreate the > > > original ordering. These duplicates can appear due to how two_operators nodes > > > are handled, and they prevent vectorization in some cases. > > > > So the trick is that when we have two operands we elide duplicate lanes > > so we can do discovery for a single combined operand instead which we > > then decompose into the required two again. That's a nice one. > > > > But as implemented this will fail SLP discovery if the combined operand > > fails discovery possibly because of divergence in downstream defs. That > > is, it doesn't fall back to separate discovery. I suspect the situation > > of duplicate lanes isn't common but then I would also suspect that > > divergence _is_ common. > > I think we should also look at the cases where vectorization itself also failed > because the generated tree ends up with an unsupported load. > > i.e. in this particular case we would have failed SLP at a later step. > > > > > The discovery code is already quite complex with the way it possibly > > swaps operands of lanes, fitting in this as another variant to try (first) > > is likely going to be a bit awkward. A way out might be to split the > > function or to make the re-try in the caller which could indicate whether > > to apply this pattern trick or not. That said - can you try to get > > data on how often the trick applies and discovery succeeds and how > > often discovery fails but discovery would suceed without applying the > > pattern (say, on SPEC)? > > > > I also suppose instead of hardcoding three patterns for a fixed > > size it should be possible to see there's > > only (at most) half unique lanes in both operands (and one less in one > > operand if the number of lanes is odd) and compute the un-swizzling lane > > permutes during this discovery, removing the need of the explicit enum > > and open-coding each case? > > > > Another general note is that trying (and then undo on fail) such ticks > > eats at the discovery limit we have in place to avoid exponential run-off > > in exactly this degenerate cases. > > I suppose this is typically a case where changing to merging multiple single > lane SLPs instead of creating the multiline graph in one go would make things > easier? I think that's the way forward at some point, not sure if it's really helping this particular case though ... > Isn't SLP discovery computationally expensive since it has to create the full > graph in one go, whereas with merging you just rotate some subgraphs or > eventually just keep the single lane separate? Yes, the current SLP discovery has this issue. But of course all of the merging / rotating is still going to be greedy and prone to get stuck in local optima where backtracking to find the global optimal SLP is necessary. At least merging the grouped store and load nodes first is what would give you the most immediate advantage over the current scheme. In the mean time a trick like the one proposed can help if it's not too ugly to interwind into the current search. Richard. > Cheers, > Tamar > > > > > Thanks, > > Richard. > > > > > This targets the vectorization of the SPEC2017 x264 pixel_satd functions. > > > In some processors a larger than 10% improvement on x264 has been observed. > > > > > > See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138 > > > > > > gcc/ChangeLog: > > > > > > * tree-vect-slp.cc (enum slp_oprnd_pattern): new enum for > > rearrangement > > > patterns. > > > (try_rearrange_oprnd_info): Detect if a node corresponds to one of the > > > patterns. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/aarch64/vect-slp-two-operator.c: New test. > > > > > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > > > --- > > > > > > .../aarch64/vect-slp-two-operator.c | 42 ++++ > > > gcc/tree-vect-slp.cc | 234 ++++++++++++++++++ > > > 2 files changed, 276 insertions(+) > > > create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > new file mode 100644 > > > index 00000000000..2db066a0b6e > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > @@ -0,0 +1,42 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect -fdump-tree-vect- > > details" } */ > > > + > > > +typedef unsigned char uint8_t; > > > +typedef unsigned int uint32_t; > > > + > > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ > > > + int t0 = s0 + s1;\ > > > + int t1 = s0 - s1;\ > > > + int t2 = s2 + s3;\ > > > + int t3 = s2 - s3;\ > > > + d0 = t0 + t2;\ > > > + d1 = t1 + t3;\ > > > + d2 = t0 - t2;\ > > > + d3 = t1 - t3;\ > > > +} > > > + > > > +static uint32_t abs2( uint32_t a ) > > > +{ > > > + uint32_t s = ((a>>15)&0x10001)*0xffff; > > > + return (a+s)^s; > > > +} > > > + > > > +void sink(uint32_t tmp[4][4]); > > > + > > > +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) > > > +{ > > > + uint32_t tmp[4][4]; > > > + int sum = 0; > > > + for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) > > > + { > > > + uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); > > > + uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); > > > + uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); > > > + uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); > > > + HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 ); > > > + } > > > + sink(tmp); > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ > > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > > index bf1f467f53f..e395db0e185 100644 > > > --- a/gcc/tree-vect-slp.cc > > > +++ b/gcc/tree-vect-slp.cc > > > @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3. If not see > > > #include "tree-vectorizer.h" > > > #include "langhooks.h" > > > #include "gimple-walk.h" > > > +#include "gimple-pretty-print.h" > > > #include "dbgcnt.h" > > > #include "tree-vector-builder.h" > > > #include "vec-perm-indices.h" > > > @@ -1829,6 +1830,141 @@ vect_slp_build_two_operator_nodes (slp_tree > > perm, tree vectype, > > > SLP_TREE_CHILDREN (perm).quick_push (child2); > > > } > > > > > > +enum slp_oprnd_pattern > > > +{ > > > + SLP_OPRND_PATTERN_NONE, > > > + SLP_OPRND_PATTERN_ABAB, > > > + SLP_OPRND_PATTERN_AABB, > > > + SLP_OPRND_PATTERN_ABBA > > > +}; > > > + > > > +/* Check if OPRNDS_INFO has duplicated nodes that correspond to a > > predefined > > > + pattern described by SLP_OPRND_PATTERN and return it. */ > > > + > > > +static int > > > +try_rearrange_oprnd_info (vec<slp_oprnd_info> &oprnds_info, unsigned > > group_size) > > > +{ > > > + unsigned i; > > > + slp_oprnd_info info; > > > + > > > + if (oprnds_info.length () != 2 || group_size % 4 != 0) > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + if (!oprnds_info[0]->def_stmts[0] > > > + || !is_a<gassign *> (oprnds_info[0]->def_stmts[0]->stmt)) > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + enum tree_code code > > > + = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt); > > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > > + for (unsigned int j = 0; j < group_size; j += 1) > > > + { > > > + if (!info->def_stmts[j] > > > + || !is_a<gassign *> (info->def_stmts[j]->stmt) > > > + || STMT_VINFO_DATA_REF (info->def_stmts[j])) > > > + return SLP_OPRND_PATTERN_NONE; > > > + /* Don't mix different operations. */ > > > + if (gimple_assign_rhs_code (info->def_stmts[j]->stmt) != code) > > > + return SLP_OPRND_PATTERN_NONE; > > > + } > > > + > > > + if (gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt) > > > + != gimple_assign_rhs_code (oprnds_info[1]->def_stmts[0]->stmt)) > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + int pattern = SLP_OPRND_PATTERN_NONE; > > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + int cur_pattern = SLP_OPRND_PATTERN_NONE; > > > + /* Check for an ABAB... pattern. */ > > > + if ((info->def_stmts[j] == info->def_stmts[j + 2]) > > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 3]) > > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > > + cur_pattern = SLP_OPRND_PATTERN_ABAB; > > > + /* Check for an AABB... pattern. */ > > > + else if ((info->def_stmts[j] == info->def_stmts[j + 1]) > > > + && (info->def_stmts[j + 2] == info->def_stmts[j + 3]) > > > + && (info->def_stmts[j] != info->def_stmts[j + 2])) > > > + cur_pattern = SLP_OPRND_PATTERN_AABB; > > > + /* Check for an ABBA... pattern. */ > > > + else if ((info->def_stmts[j] == info->def_stmts[j + 3]) > > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 2]) > > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > > + cur_pattern = SLP_OPRND_PATTERN_ABBA; > > > + /* Unrecognised pattern. */ > > > + else > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + if (pattern == SLP_OPRND_PATTERN_NONE) > > > + pattern = cur_pattern; > > > + /* Multiple patterns detected. */ > > > + else if (cur_pattern != pattern) > > > + return SLP_OPRND_PATTERN_NONE; > > > + } > > > + > > > + gcc_checking_assert (pattern != SLP_OPRND_PATTERN_NONE); > > > + > > > + if (dump_enabled_p ()) > > > + { > > > + if (pattern == SLP_OPRND_PATTERN_ABAB) > > > + dump_printf (MSG_NOTE, "ABAB"); > > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > > + dump_printf (MSG_NOTE, "AABB"); > > > + else if (pattern == SLP_OPRND_PATTERN_ABBA) > > > + dump_printf (MSG_NOTE, "ABBA"); > > > + dump_printf (MSG_NOTE, " pattern detected.\n"); > > > + } > > > + > > > + if (pattern == SLP_OPRND_PATTERN_ABAB || pattern == > > SLP_OPRND_PATTERN_ABBA) > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + /* Given oprnd[0] -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > > + Given oprnd[1] -> C1, D1, C1, D1, C2, D2, C2, D2, ... > > > + Create a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... */ > > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[1]->def_stmts[j]; > > > + oprnds_info[0]->ops[j+2] = oprnds_info[1]->ops[j]; > > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+1]; > > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+1]; > > > + } > > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + /* Given oprnd[0] -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > > + Given oprnd[1] -> C1, C1, D1, D1, C2, C2, D2, D2, ... > > > + Create a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... */ > > > + > > > + /* The ordering here is at least to some extent arbitrary. > > > + A generilized version needs to use some explicit ordering. */ > > > + oprnds_info[0]->def_stmts[j+1] = oprnds_info[1]->def_stmts[j]; > > > + oprnds_info[0]->ops[j+1] = oprnds_info[1]->ops[j]; > > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[0]->def_stmts[j+2]; > > > + oprnds_info[0]->ops[j+2] = oprnds_info[0]->ops[j+2]; > > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+2]; > > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+2]; > > > + } > > > + > > > + if (dump_enabled_p ()) > > > + { > > > + dump_printf (MSG_NOTE, "Recurse with:\n"); > > > + for (unsigned int j = 0; j < group_size; j++) > > > + { > > > + dump_printf (MSG_NOTE, " "); > > > + print_gimple_stmt (dump_file, oprnds_info[0]->def_stmts[j]->stmt, 0); > > > + } > > > + } > > > + > > > + /* Since we've merged the two nodes in one, make the second one a copy of > > > + the first. */ > > > + for (unsigned int j = 0; j < group_size; j++) > > > + { > > > + oprnds_info[1]->def_stmts[j] = oprnds_info[0]->def_stmts[j]; > > > + oprnds_info[1]->ops[j] = oprnds_info[0]->ops[j]; > > > + } > > > + > > > + return pattern; > > > +} > > > + > > > /* Recursively build an SLP tree starting from NODE. > > > Fail (and return a value not equal to zero) if def-stmts are not > > > isomorphic, require data permutation or are of unsupported types of > > > @@ -2409,6 +2545,10 @@ out: > > > > > > stmt_info = stmts[0]; > > > > > > + int rearrange_pattern = SLP_OPRND_PATTERN_NONE; > > > + if (is_a<gassign *> (stmt_info->stmt)) > > > + rearrange_pattern = try_rearrange_oprnd_info (oprnds_info, group_size); > > > + > > > /* Create SLP_TREE nodes for the definition node/s. */ > > > FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info) > > > { > > > @@ -2669,6 +2809,100 @@ fail: > > > *tree_size += this_tree_size + 1; > > > *max_nunits = this_max_nunits; > > > > > > + /* If we applied any rearrangmenets then we need to reconstruct the original > > > + elements with an additional permutation layer. */ > > > + if (rearrange_pattern != SLP_OPRND_PATTERN_NONE) > > > + { > > > + slp_tree one = new _slp_tree; > > > + slp_tree two = new _slp_tree; > > > + SLP_TREE_DEF_TYPE (one) = vect_internal_def; > > > + SLP_TREE_DEF_TYPE (two) = vect_internal_def; > > > + SLP_TREE_VECTYPE (one) = vectype; > > > + SLP_TREE_VECTYPE (two) = vectype; > > > + SLP_TREE_CHILDREN (one).safe_splice (children); > > > + SLP_TREE_CHILDREN (two).safe_splice (children); > > > + > > > + SLP_TREE_CODE (one) = VEC_PERM_EXPR; > > > + SLP_TREE_CODE (two) = VEC_PERM_EXPR; > > > + SLP_TREE_REPRESENTATIVE (one) = stmts[0]; > > > + SLP_TREE_REPRESENTATIVE (two) = stmts[2]; > > > + lane_permutation_t &perm_one = SLP_TREE_LANE_PERMUTATION (one); > > > + lane_permutation_t &perm_two = SLP_TREE_LANE_PERMUTATION (two); > > > + > > > + if (rearrange_pattern == SLP_OPRND_PATTERN_ABAB) > > > + { > > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > > + Create node "one" -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > > + Create node "two" -> C1, D1, C1, D1, C2, D2, C2, D2, ... */ > > > + > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + } > > > + } > > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_AABB) > > > + { > > > + /* Given a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... > > > + Create node "one" -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > > + Create node "two" -> C1, C1, D1, D1, C2, C2, D2, D2, ... */ > > > + > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > > + > > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + } > > > + } > > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_ABBA) > > > + { > > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > > + Create node "one" -> A1, B1, B1, A1, A2, B2, B2, A2, ... > > > + Create node "two" -> C1, D1, D1, C1, C2, D2, D2, C2, ... */ > > > + > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + } > > > + } > > > + > > > + slp_tree child; > > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (two), i, child) > > > + SLP_TREE_REF_COUNT (child)++; > > > + > > > + node = vect_create_new_slp_node (node, stmts, 2); > > > + SLP_TREE_VECTYPE (node) = vectype; > > > + SLP_TREE_CHILDREN (node).quick_push (one); > > > + SLP_TREE_CHILDREN (node).quick_push (two); > > > + > > > + SLP_TREE_LANES (one) = stmts.length (); > > > + SLP_TREE_LANES (two) = stmts.length (); > > > + > > > + children.truncate (0); > > > + children.safe_splice (SLP_TREE_CHILDREN (node)); > > > + } > > > + > > > if (two_operators) > > > { > > > /* ??? We'd likely want to either cache in bst_map sth like > > > > > > > -- > > Richard Biener <rguenther@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) >
On Wed, Jun 5, 2024 at 11:07 AM Richard Biener <rguenther@suse.de> wrote: > > On Tue, 4 Jun 2024, Manolis Tsamis wrote: > > > This change adds a function that checks for SLP nodes with multiple occurrences > > of the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the node > > so that there are no duplicates. A vec_perm is then introduced to recreate the > > original ordering. These duplicates can appear due to how two_operators nodes > > are handled, and they prevent vectorization in some cases. > > So the trick is that when we have two operands we elide duplicate lanes > so we can do discovery for a single combined operand instead which we > then decompose into the required two again. That's a nice one. > > But as implemented this will fail SLP discovery if the combined operand > fails discovery possibly because of divergence in downstream defs. That > is, it doesn't fall back to separate discovery. I suspect the situation > of duplicate lanes isn't common but then I would also suspect that > divergence _is_ common. > That's a good point; I checked out and at least for the x264 testcase provided SLP discovery succeeds in both cases but in one case vectorization fails later on due to the unsupported load permutations among others. I think that's what Tamar also mentioned and it makes it hard to decide whether to apply the pattern based on if discovery fails. > The discovery code is already quite complex with the way it possibly > swaps operands of lanes, fitting in this as another variant to try (first) > is likely going to be a bit awkward. A way out might be to split the > function or to make the re-try in the caller which could indicate whether > to apply this pattern trick or not. That said - can you try to get > data on how often the trick applies and discovery succeeds and how > often discovery fails but discovery would suceed without applying the > pattern (say, on SPEC)? > I checked out SPEC and this pattern only triggers on x264 and in that case discovery succeeds. So we don't have any data on the pattern applying but discovery failing. > I also suppose instead of hardcoding three patterns for a fixed > size it should be possible to see there's > only (at most) half unique lanes in both operands (and one less in one > operand if the number of lanes is odd) and compute the un-swizzling lane > permutes during this discovery, removing the need of the explicit enum > and open-coding each case? > Yes, that's a fair point. I will change that in the next iteration. > Another general note is that trying (and then undo on fail) such ticks > eats at the discovery limit we have in place to avoid exponential run-off > in exactly this degenerate cases. > So, most importantly, the points you and Tamar mentioned got me thinking about the transformation again, why it is useful and when it applies. In this initial implementation I tried to make this independant from the two_operators logic and apply it when possible, which brings up all these issues about discovery and usefulness of the pattern in general. E.g. If we had just [a, b, a, b] + [c, d, c, d] without two_operators I sort of doubt it would be worth it to apply the transformation in most cases (except of course if it enables vectorization, but as I understand it it is hard to tell when that happens). On the other hand, if we know that we're dealing with two_operators nodes then the argument changes, as we know that we'll duplicate these nodes. In turn, it may be best to try to see this as a 'two_operators lowering strategy' improvement instead of a generic rearrangement pattern. Specifically for x264, we're given code like int t0 = s0 + s1; int t1 = s0 - s1; int t2 = s2 + s3; int t3 = s2 - s3; and currently we lower that to VEC_PERM<(A + B), (A - B)>(...) with A = [s0, s0, s2, s2], B = [s1, s1, s3, s3] which doesn't work very well (due to element duplication). With this patch we do VEC_PERM<(A + B), (A - B)>(...) with A = VEC_PERM<C, C>(...), B = VEC_PERM<C, C>(...), C = [s0, s1, s2, s3] instead which works good. But it is obvious that there are other strategies to lower this too and they may be even better (by taking advantage of the fact that we know we're dealing with a two_operators node *and* have duplicate elements). For example doing VEC_PERM<(A + B), (A - B)>(...) with A = [s0, s1, s2, s3] and B = VEC_PERM<A, A>(1, 0, 3, 2) looks interesting too and is only possible because we combine two_operators and rearrangement. Do you believe that narrowing this to a "two_operators lowering improvement" makes more sense and addresses at least some of the issues mentioned? I'm currently testing to see the code that we generate with other strategies and will reach out once I have new results. Thanks, Manolis > Thanks, > Richard. > > > This targets the vectorization of the SPEC2017 x264 pixel_satd functions. > > In some processors a larger than 10% improvement on x264 has been observed. > > > > See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138 > > > > gcc/ChangeLog: > > > > * tree-vect-slp.cc (enum slp_oprnd_pattern): new enum for rearrangement > > patterns. > > (try_rearrange_oprnd_info): Detect if a node corresponds to one of the > > patterns. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/vect-slp-two-operator.c: New test. > > > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > > --- > > > > .../aarch64/vect-slp-two-operator.c | 42 ++++ > > gcc/tree-vect-slp.cc | 234 ++++++++++++++++++ > > 2 files changed, 276 insertions(+) > > create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > new file mode 100644 > > index 00000000000..2db066a0b6e > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > @@ -0,0 +1,42 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect -fdump-tree-vect-details" } */ > > + > > +typedef unsigned char uint8_t; > > +typedef unsigned int uint32_t; > > + > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ > > + int t0 = s0 + s1;\ > > + int t1 = s0 - s1;\ > > + int t2 = s2 + s3;\ > > + int t3 = s2 - s3;\ > > + d0 = t0 + t2;\ > > + d1 = t1 + t3;\ > > + d2 = t0 - t2;\ > > + d3 = t1 - t3;\ > > +} > > + > > +static uint32_t abs2( uint32_t a ) > > +{ > > + uint32_t s = ((a>>15)&0x10001)*0xffff; > > + return (a+s)^s; > > +} > > + > > +void sink(uint32_t tmp[4][4]); > > + > > +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) > > +{ > > + uint32_t tmp[4][4]; > > + int sum = 0; > > + for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) > > + { > > + uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); > > + uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); > > + uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); > > + uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); > > + HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 ); > > + } > > + sink(tmp); > > +} > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > index bf1f467f53f..e395db0e185 100644 > > --- a/gcc/tree-vect-slp.cc > > +++ b/gcc/tree-vect-slp.cc > > @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3. If not see > > #include "tree-vectorizer.h" > > #include "langhooks.h" > > #include "gimple-walk.h" > > +#include "gimple-pretty-print.h" > > #include "dbgcnt.h" > > #include "tree-vector-builder.h" > > #include "vec-perm-indices.h" > > @@ -1829,6 +1830,141 @@ vect_slp_build_two_operator_nodes (slp_tree perm, tree vectype, > > SLP_TREE_CHILDREN (perm).quick_push (child2); > > } > > > > +enum slp_oprnd_pattern > > +{ > > + SLP_OPRND_PATTERN_NONE, > > + SLP_OPRND_PATTERN_ABAB, > > + SLP_OPRND_PATTERN_AABB, > > + SLP_OPRND_PATTERN_ABBA > > +}; > > + > > +/* Check if OPRNDS_INFO has duplicated nodes that correspond to a predefined > > + pattern described by SLP_OPRND_PATTERN and return it. */ > > + > > +static int > > +try_rearrange_oprnd_info (vec<slp_oprnd_info> &oprnds_info, unsigned group_size) > > +{ > > + unsigned i; > > + slp_oprnd_info info; > > + > > + if (oprnds_info.length () != 2 || group_size % 4 != 0) > > + return SLP_OPRND_PATTERN_NONE; > > + > > + if (!oprnds_info[0]->def_stmts[0] > > + || !is_a<gassign *> (oprnds_info[0]->def_stmts[0]->stmt)) > > + return SLP_OPRND_PATTERN_NONE; > > + > > + enum tree_code code > > + = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt); > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > + for (unsigned int j = 0; j < group_size; j += 1) > > + { > > + if (!info->def_stmts[j] > > + || !is_a<gassign *> (info->def_stmts[j]->stmt) > > + || STMT_VINFO_DATA_REF (info->def_stmts[j])) > > + return SLP_OPRND_PATTERN_NONE; > > + /* Don't mix different operations. */ > > + if (gimple_assign_rhs_code (info->def_stmts[j]->stmt) != code) > > + return SLP_OPRND_PATTERN_NONE; > > + } > > + > > + if (gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt) > > + != gimple_assign_rhs_code (oprnds_info[1]->def_stmts[0]->stmt)) > > + return SLP_OPRND_PATTERN_NONE; > > + > > + int pattern = SLP_OPRND_PATTERN_NONE; > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + int cur_pattern = SLP_OPRND_PATTERN_NONE; > > + /* Check for an ABAB... pattern. */ > > + if ((info->def_stmts[j] == info->def_stmts[j + 2]) > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 3]) > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > + cur_pattern = SLP_OPRND_PATTERN_ABAB; > > + /* Check for an AABB... pattern. */ > > + else if ((info->def_stmts[j] == info->def_stmts[j + 1]) > > + && (info->def_stmts[j + 2] == info->def_stmts[j + 3]) > > + && (info->def_stmts[j] != info->def_stmts[j + 2])) > > + cur_pattern = SLP_OPRND_PATTERN_AABB; > > + /* Check for an ABBA... pattern. */ > > + else if ((info->def_stmts[j] == info->def_stmts[j + 3]) > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 2]) > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > + cur_pattern = SLP_OPRND_PATTERN_ABBA; > > + /* Unrecognised pattern. */ > > + else > > + return SLP_OPRND_PATTERN_NONE; > > + > > + if (pattern == SLP_OPRND_PATTERN_NONE) > > + pattern = cur_pattern; > > + /* Multiple patterns detected. */ > > + else if (cur_pattern != pattern) > > + return SLP_OPRND_PATTERN_NONE; > > + } > > + > > + gcc_checking_assert (pattern != SLP_OPRND_PATTERN_NONE); > > + > > + if (dump_enabled_p ()) > > + { > > + if (pattern == SLP_OPRND_PATTERN_ABAB) > > + dump_printf (MSG_NOTE, "ABAB"); > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > + dump_printf (MSG_NOTE, "AABB"); > > + else if (pattern == SLP_OPRND_PATTERN_ABBA) > > + dump_printf (MSG_NOTE, "ABBA"); > > + dump_printf (MSG_NOTE, " pattern detected.\n"); > > + } > > + > > + if (pattern == SLP_OPRND_PATTERN_ABAB || pattern == SLP_OPRND_PATTERN_ABBA) > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + /* Given oprnd[0] -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > + Given oprnd[1] -> C1, D1, C1, D1, C2, D2, C2, D2, ... > > + Create a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... */ > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[1]->def_stmts[j]; > > + oprnds_info[0]->ops[j+2] = oprnds_info[1]->ops[j]; > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+1]; > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+1]; > > + } > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + /* Given oprnd[0] -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > + Given oprnd[1] -> C1, C1, D1, D1, C2, C2, D2, D2, ... > > + Create a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... */ > > + > > + /* The ordering here is at least to some extent arbitrary. > > + A generilized version needs to use some explicit ordering. */ > > + oprnds_info[0]->def_stmts[j+1] = oprnds_info[1]->def_stmts[j]; > > + oprnds_info[0]->ops[j+1] = oprnds_info[1]->ops[j]; > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[0]->def_stmts[j+2]; > > + oprnds_info[0]->ops[j+2] = oprnds_info[0]->ops[j+2]; > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+2]; > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+2]; > > + } > > + > > + if (dump_enabled_p ()) > > + { > > + dump_printf (MSG_NOTE, "Recurse with:\n"); > > + for (unsigned int j = 0; j < group_size; j++) > > + { > > + dump_printf (MSG_NOTE, " "); > > + print_gimple_stmt (dump_file, oprnds_info[0]->def_stmts[j]->stmt, 0); > > + } > > + } > > + > > + /* Since we've merged the two nodes in one, make the second one a copy of > > + the first. */ > > + for (unsigned int j = 0; j < group_size; j++) > > + { > > + oprnds_info[1]->def_stmts[j] = oprnds_info[0]->def_stmts[j]; > > + oprnds_info[1]->ops[j] = oprnds_info[0]->ops[j]; > > + } > > + > > + return pattern; > > +} > > + > > /* Recursively build an SLP tree starting from NODE. > > Fail (and return a value not equal to zero) if def-stmts are not > > isomorphic, require data permutation or are of unsupported types of > > @@ -2409,6 +2545,10 @@ out: > > > > stmt_info = stmts[0]; > > > > + int rearrange_pattern = SLP_OPRND_PATTERN_NONE; > > + if (is_a<gassign *> (stmt_info->stmt)) > > + rearrange_pattern = try_rearrange_oprnd_info (oprnds_info, group_size); > > + > > /* Create SLP_TREE nodes for the definition node/s. */ > > FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info) > > { > > @@ -2669,6 +2809,100 @@ fail: > > *tree_size += this_tree_size + 1; > > *max_nunits = this_max_nunits; > > > > + /* If we applied any rearrangmenets then we need to reconstruct the original > > + elements with an additional permutation layer. */ > > + if (rearrange_pattern != SLP_OPRND_PATTERN_NONE) > > + { > > + slp_tree one = new _slp_tree; > > + slp_tree two = new _slp_tree; > > + SLP_TREE_DEF_TYPE (one) = vect_internal_def; > > + SLP_TREE_DEF_TYPE (two) = vect_internal_def; > > + SLP_TREE_VECTYPE (one) = vectype; > > + SLP_TREE_VECTYPE (two) = vectype; > > + SLP_TREE_CHILDREN (one).safe_splice (children); > > + SLP_TREE_CHILDREN (two).safe_splice (children); > > + > > + SLP_TREE_CODE (one) = VEC_PERM_EXPR; > > + SLP_TREE_CODE (two) = VEC_PERM_EXPR; > > + SLP_TREE_REPRESENTATIVE (one) = stmts[0]; > > + SLP_TREE_REPRESENTATIVE (two) = stmts[2]; > > + lane_permutation_t &perm_one = SLP_TREE_LANE_PERMUTATION (one); > > + lane_permutation_t &perm_two = SLP_TREE_LANE_PERMUTATION (two); > > + > > + if (rearrange_pattern == SLP_OPRND_PATTERN_ABAB) > > + { > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > + Create node "one" -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > + Create node "two" -> C1, D1, C1, D1, C2, D2, C2, D2, ... */ > > + > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + } > > + } > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_AABB) > > + { > > + /* Given a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... > > + Create node "one" -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > + Create node "two" -> C1, C1, D1, D1, C2, C2, D2, D2, ... */ > > + > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > + > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + } > > + } > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_ABBA) > > + { > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > + Create node "one" -> A1, B1, B1, A1, A2, B2, B2, A2, ... > > + Create node "two" -> C1, D1, D1, C1, C2, D2, D2, C2, ... */ > > + > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + } > > + } > > + > > + slp_tree child; > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (two), i, child) > > + SLP_TREE_REF_COUNT (child)++; > > + > > + node = vect_create_new_slp_node (node, stmts, 2); > > + SLP_TREE_VECTYPE (node) = vectype; > > + SLP_TREE_CHILDREN (node).quick_push (one); > > + SLP_TREE_CHILDREN (node).quick_push (two); > > + > > + SLP_TREE_LANES (one) = stmts.length (); > > + SLP_TREE_LANES (two) = stmts.length (); > > + > > + children.truncate (0); > > + children.safe_splice (SLP_TREE_CHILDREN (node)); > > + } > > + > > if (two_operators) > > { > > /* ??? We'd likely want to either cache in bst_map sth like > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
On Mon, 10 Jun 2024, Manolis Tsamis wrote: > On Wed, Jun 5, 2024 at 11:07 AM Richard Biener <rguenther@suse.de> wrote: > > > > On Tue, 4 Jun 2024, Manolis Tsamis wrote: > > > > > This change adds a function that checks for SLP nodes with multiple occurrences > > > of the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the node > > > so that there are no duplicates. A vec_perm is then introduced to recreate the > > > original ordering. These duplicates can appear due to how two_operators nodes > > > are handled, and they prevent vectorization in some cases. > > > > So the trick is that when we have two operands we elide duplicate lanes > > so we can do discovery for a single combined operand instead which we > > then decompose into the required two again. That's a nice one. > > > > But as implemented this will fail SLP discovery if the combined operand > > fails discovery possibly because of divergence in downstream defs. That > > is, it doesn't fall back to separate discovery. I suspect the situation > > of duplicate lanes isn't common but then I would also suspect that > > divergence _is_ common. > > > That's a good point; I checked out and at least for the x264 testcase > provided SLP discovery succeeds in both cases but in one case > vectorization fails later on due to the unsupported load permutations > among others. > I think that's what Tamar also mentioned and it makes it hard to > decide whether to apply the pattern based on if discovery fails. > > > The discovery code is already quite complex with the way it possibly > > swaps operands of lanes, fitting in this as another variant to try (first) > > is likely going to be a bit awkward. A way out might be to split the > > function or to make the re-try in the caller which could indicate whether > > to apply this pattern trick or not. That said - can you try to get > > data on how often the trick applies and discovery succeeds and how > > often discovery fails but discovery would suceed without applying the > > pattern (say, on SPEC)? > > > I checked out SPEC and this pattern only triggers on x264 and in that > case discovery succeeds. So we don't have any data on the pattern > applying but discovery failing. > > > I also suppose instead of hardcoding three patterns for a fixed > > size it should be possible to see there's > > only (at most) half unique lanes in both operands (and one less in one > > operand if the number of lanes is odd) and compute the un-swizzling lane > > permutes during this discovery, removing the need of the explicit enum > > and open-coding each case? > > > Yes, that's a fair point. I will change that in the next iteration. > > > Another general note is that trying (and then undo on fail) such ticks > > eats at the discovery limit we have in place to avoid exponential run-off > > in exactly this degenerate cases. > > > > So, most importantly, the points you and Tamar mentioned got me > thinking about the transformation again, why it is useful and when it > applies. > In this initial implementation I tried to make this independant from > the two_operators logic and apply it when possible, which brings up > all these issues about discovery and usefulness of the pattern in > general. > E.g. If we had just [a, b, a, b] + [c, d, c, d] without two_operators > I sort of doubt it would be worth it to apply the transformation in > most cases (except of course if it enables vectorization, but as I > understand it it is hard to tell when that happens). > On the other hand, if we know that we're dealing with two_operators > nodes then the argument changes, as we know that we'll duplicate these > nodes. > > In turn, it may be best to try to see this as a 'two_operators > lowering strategy' improvement instead of a generic rearrangement > pattern. > Specifically for x264, we're given code like > > int t0 = s0 + s1; > int t1 = s0 - s1; > int t2 = s2 + s3; > int t3 = s2 - s3; > > and currently we lower that to VEC_PERM<(A + B), (A - B)>(...) with A > = [s0, s0, s2, s2], B = [s1, s1, s3, s3] which doesn't work very well > (due to element duplication). > With this patch we do VEC_PERM<(A + B), (A - B)>(...) with A = > VEC_PERM<C, C>(...), B = VEC_PERM<C, C>(...), C = [s0, s1, s2, s3] > instead which works good. So I'm not sure I buy the argument that [a, b, a, b] + [c, d, c, d] is much different from the two-operators version. [a, b, a, b] + [c, d, c, d] is one of the variants (the plus) we discover. Isn't this really about us stupidly trying to force the use of a larger vector rather than doing the two-operator handling by doing VEC_PERM <{s0, s2} + {s1, s3}, {s0, s2} - {s1, s3}, { 0, 2, 1, 3 }> and thus recursing with smaller group size and then "interleaving" the result? Which might only work well in exactly the case where we have the same number of + and -. Of course in both forms 'A' and 'B' (or the smaller vectors) should be SLP nodes re-used. > But it is obvious that there are other strategies to lower this too > and they may be even better (by taking advantage of the fact that we > know we're dealing with a two_operators node *and* have duplicate > elements). > For example doing VEC_PERM<(A + B), (A - B)>(...) with A = [s0, s1, > s2, s3] and B = VEC_PERM<A, A>(1, 0, 3, 2) looks interesting too and > is only possible because we combine two_operators and rearrangement. > > Do you believe that narrowing this to a "two_operators lowering > improvement" makes more sense and addresses at least some of the > issues mentioned? > I'm currently testing to see the code that we generate with other > strategies and will reach out once I have new results. I think doing this in the if (two_operators) code in vect_build_slp_tree_2 only where we can take advantage of knowing that SLP build for the original A and B children succeeded would be good. It should side-step the discovery cost issue. If you do your inter-mixing scheme you'll still need to re-discover but FAILing should be possible easily by falling back to the already built children nodes? Richard. > Thanks, > Manolis > > > Thanks, > > Richard. > > > > > This targets the vectorization of the SPEC2017 x264 pixel_satd functions. > > > In some processors a larger than 10% improvement on x264 has been observed. > > > > > > See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138 > > > > > > gcc/ChangeLog: > > > > > > * tree-vect-slp.cc (enum slp_oprnd_pattern): new enum for rearrangement > > > patterns. > > > (try_rearrange_oprnd_info): Detect if a node corresponds to one of the > > > patterns. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/aarch64/vect-slp-two-operator.c: New test. > > > > > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > > > --- > > > > > > .../aarch64/vect-slp-two-operator.c | 42 ++++ > > > gcc/tree-vect-slp.cc | 234 ++++++++++++++++++ > > > 2 files changed, 276 insertions(+) > > > create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > new file mode 100644 > > > index 00000000000..2db066a0b6e > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > @@ -0,0 +1,42 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect -fdump-tree-vect-details" } */ > > > + > > > +typedef unsigned char uint8_t; > > > +typedef unsigned int uint32_t; > > > + > > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ > > > + int t0 = s0 + s1;\ > > > + int t1 = s0 - s1;\ > > > + int t2 = s2 + s3;\ > > > + int t3 = s2 - s3;\ > > > + d0 = t0 + t2;\ > > > + d1 = t1 + t3;\ > > > + d2 = t0 - t2;\ > > > + d3 = t1 - t3;\ > > > +} > > > + > > > +static uint32_t abs2( uint32_t a ) > > > +{ > > > + uint32_t s = ((a>>15)&0x10001)*0xffff; > > > + return (a+s)^s; > > > +} > > > + > > > +void sink(uint32_t tmp[4][4]); > > > + > > > +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) > > > +{ > > > + uint32_t tmp[4][4]; > > > + int sum = 0; > > > + for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) > > > + { > > > + uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); > > > + uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); > > > + uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); > > > + uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); > > > + HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 ); > > > + } > > > + sink(tmp); > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ > > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > > index bf1f467f53f..e395db0e185 100644 > > > --- a/gcc/tree-vect-slp.cc > > > +++ b/gcc/tree-vect-slp.cc > > > @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3. If not see > > > #include "tree-vectorizer.h" > > > #include "langhooks.h" > > > #include "gimple-walk.h" > > > +#include "gimple-pretty-print.h" > > > #include "dbgcnt.h" > > > #include "tree-vector-builder.h" > > > #include "vec-perm-indices.h" > > > @@ -1829,6 +1830,141 @@ vect_slp_build_two_operator_nodes (slp_tree perm, tree vectype, > > > SLP_TREE_CHILDREN (perm).quick_push (child2); > > > } > > > > > > +enum slp_oprnd_pattern > > > +{ > > > + SLP_OPRND_PATTERN_NONE, > > > + SLP_OPRND_PATTERN_ABAB, > > > + SLP_OPRND_PATTERN_AABB, > > > + SLP_OPRND_PATTERN_ABBA > > > +}; > > > + > > > +/* Check if OPRNDS_INFO has duplicated nodes that correspond to a predefined > > > + pattern described by SLP_OPRND_PATTERN and return it. */ > > > + > > > +static int > > > +try_rearrange_oprnd_info (vec<slp_oprnd_info> &oprnds_info, unsigned group_size) > > > +{ > > > + unsigned i; > > > + slp_oprnd_info info; > > > + > > > + if (oprnds_info.length () != 2 || group_size % 4 != 0) > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + if (!oprnds_info[0]->def_stmts[0] > > > + || !is_a<gassign *> (oprnds_info[0]->def_stmts[0]->stmt)) > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + enum tree_code code > > > + = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt); > > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > > + for (unsigned int j = 0; j < group_size; j += 1) > > > + { > > > + if (!info->def_stmts[j] > > > + || !is_a<gassign *> (info->def_stmts[j]->stmt) > > > + || STMT_VINFO_DATA_REF (info->def_stmts[j])) > > > + return SLP_OPRND_PATTERN_NONE; > > > + /* Don't mix different operations. */ > > > + if (gimple_assign_rhs_code (info->def_stmts[j]->stmt) != code) > > > + return SLP_OPRND_PATTERN_NONE; > > > + } > > > + > > > + if (gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt) > > > + != gimple_assign_rhs_code (oprnds_info[1]->def_stmts[0]->stmt)) > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + int pattern = SLP_OPRND_PATTERN_NONE; > > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + int cur_pattern = SLP_OPRND_PATTERN_NONE; > > > + /* Check for an ABAB... pattern. */ > > > + if ((info->def_stmts[j] == info->def_stmts[j + 2]) > > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 3]) > > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > > + cur_pattern = SLP_OPRND_PATTERN_ABAB; > > > + /* Check for an AABB... pattern. */ > > > + else if ((info->def_stmts[j] == info->def_stmts[j + 1]) > > > + && (info->def_stmts[j + 2] == info->def_stmts[j + 3]) > > > + && (info->def_stmts[j] != info->def_stmts[j + 2])) > > > + cur_pattern = SLP_OPRND_PATTERN_AABB; > > > + /* Check for an ABBA... pattern. */ > > > + else if ((info->def_stmts[j] == info->def_stmts[j + 3]) > > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 2]) > > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > > + cur_pattern = SLP_OPRND_PATTERN_ABBA; > > > + /* Unrecognised pattern. */ > > > + else > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + if (pattern == SLP_OPRND_PATTERN_NONE) > > > + pattern = cur_pattern; > > > + /* Multiple patterns detected. */ > > > + else if (cur_pattern != pattern) > > > + return SLP_OPRND_PATTERN_NONE; > > > + } > > > + > > > + gcc_checking_assert (pattern != SLP_OPRND_PATTERN_NONE); > > > + > > > + if (dump_enabled_p ()) > > > + { > > > + if (pattern == SLP_OPRND_PATTERN_ABAB) > > > + dump_printf (MSG_NOTE, "ABAB"); > > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > > + dump_printf (MSG_NOTE, "AABB"); > > > + else if (pattern == SLP_OPRND_PATTERN_ABBA) > > > + dump_printf (MSG_NOTE, "ABBA"); > > > + dump_printf (MSG_NOTE, " pattern detected.\n"); > > > + } > > > + > > > + if (pattern == SLP_OPRND_PATTERN_ABAB || pattern == SLP_OPRND_PATTERN_ABBA) > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + /* Given oprnd[0] -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > > + Given oprnd[1] -> C1, D1, C1, D1, C2, D2, C2, D2, ... > > > + Create a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... */ > > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[1]->def_stmts[j]; > > > + oprnds_info[0]->ops[j+2] = oprnds_info[1]->ops[j]; > > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+1]; > > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+1]; > > > + } > > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + /* Given oprnd[0] -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > > + Given oprnd[1] -> C1, C1, D1, D1, C2, C2, D2, D2, ... > > > + Create a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... */ > > > + > > > + /* The ordering here is at least to some extent arbitrary. > > > + A generilized version needs to use some explicit ordering. */ > > > + oprnds_info[0]->def_stmts[j+1] = oprnds_info[1]->def_stmts[j]; > > > + oprnds_info[0]->ops[j+1] = oprnds_info[1]->ops[j]; > > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[0]->def_stmts[j+2]; > > > + oprnds_info[0]->ops[j+2] = oprnds_info[0]->ops[j+2]; > > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+2]; > > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+2]; > > > + } > > > + > > > + if (dump_enabled_p ()) > > > + { > > > + dump_printf (MSG_NOTE, "Recurse with:\n"); > > > + for (unsigned int j = 0; j < group_size; j++) > > > + { > > > + dump_printf (MSG_NOTE, " "); > > > + print_gimple_stmt (dump_file, oprnds_info[0]->def_stmts[j]->stmt, 0); > > > + } > > > + } > > > + > > > + /* Since we've merged the two nodes in one, make the second one a copy of > > > + the first. */ > > > + for (unsigned int j = 0; j < group_size; j++) > > > + { > > > + oprnds_info[1]->def_stmts[j] = oprnds_info[0]->def_stmts[j]; > > > + oprnds_info[1]->ops[j] = oprnds_info[0]->ops[j]; > > > + } > > > + > > > + return pattern; > > > +} > > > + > > > /* Recursively build an SLP tree starting from NODE. > > > Fail (and return a value not equal to zero) if def-stmts are not > > > isomorphic, require data permutation or are of unsupported types of > > > @@ -2409,6 +2545,10 @@ out: > > > > > > stmt_info = stmts[0]; > > > > > > + int rearrange_pattern = SLP_OPRND_PATTERN_NONE; > > > + if (is_a<gassign *> (stmt_info->stmt)) > > > + rearrange_pattern = try_rearrange_oprnd_info (oprnds_info, group_size); > > > + > > > /* Create SLP_TREE nodes for the definition node/s. */ > > > FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info) > > > { > > > @@ -2669,6 +2809,100 @@ fail: > > > *tree_size += this_tree_size + 1; > > > *max_nunits = this_max_nunits; > > > > > > + /* If we applied any rearrangmenets then we need to reconstruct the original > > > + elements with an additional permutation layer. */ > > > + if (rearrange_pattern != SLP_OPRND_PATTERN_NONE) > > > + { > > > + slp_tree one = new _slp_tree; > > > + slp_tree two = new _slp_tree; > > > + SLP_TREE_DEF_TYPE (one) = vect_internal_def; > > > + SLP_TREE_DEF_TYPE (two) = vect_internal_def; > > > + SLP_TREE_VECTYPE (one) = vectype; > > > + SLP_TREE_VECTYPE (two) = vectype; > > > + SLP_TREE_CHILDREN (one).safe_splice (children); > > > + SLP_TREE_CHILDREN (two).safe_splice (children); > > > + > > > + SLP_TREE_CODE (one) = VEC_PERM_EXPR; > > > + SLP_TREE_CODE (two) = VEC_PERM_EXPR; > > > + SLP_TREE_REPRESENTATIVE (one) = stmts[0]; > > > + SLP_TREE_REPRESENTATIVE (two) = stmts[2]; > > > + lane_permutation_t &perm_one = SLP_TREE_LANE_PERMUTATION (one); > > > + lane_permutation_t &perm_two = SLP_TREE_LANE_PERMUTATION (two); > > > + > > > + if (rearrange_pattern == SLP_OPRND_PATTERN_ABAB) > > > + { > > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > > + Create node "one" -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > > + Create node "two" -> C1, D1, C1, D1, C2, D2, C2, D2, ... */ > > > + > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + } > > > + } > > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_AABB) > > > + { > > > + /* Given a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... > > > + Create node "one" -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > > + Create node "two" -> C1, C1, D1, D1, C2, C2, D2, D2, ... */ > > > + > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > > + > > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + } > > > + } > > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_ABBA) > > > + { > > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > > + Create node "one" -> A1, B1, B1, A1, A2, B2, B2, A2, ... > > > + Create node "two" -> C1, D1, D1, C1, C2, D2, D2, C2, ... */ > > > + > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + } > > > + } > > > + > > > + slp_tree child; > > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (two), i, child) > > > + SLP_TREE_REF_COUNT (child)++; > > > + > > > + node = vect_create_new_slp_node (node, stmts, 2); > > > + SLP_TREE_VECTYPE (node) = vectype; > > > + SLP_TREE_CHILDREN (node).quick_push (one); > > > + SLP_TREE_CHILDREN (node).quick_push (two); > > > + > > > + SLP_TREE_LANES (one) = stmts.length (); > > > + SLP_TREE_LANES (two) = stmts.length (); > > > + > > > + children.truncate (0); > > > + children.safe_splice (SLP_TREE_CHILDREN (node)); > > > + } > > > + > > > if (two_operators) > > > { > > > /* ??? We'd likely want to either cache in bst_map sth like > > > > > > > -- > > Richard Biener <rguenther@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) >
On Wed, Jun 5, 2024 at 11:07 AM Richard Biener <rguenther@suse.de> wrote: > > On Tue, 4 Jun 2024, Manolis Tsamis wrote: > > > This change adds a function that checks for SLP nodes with multiple occurrences > > of the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the node > > so that there are no duplicates. A vec_perm is then introduced to recreate the > > original ordering. These duplicates can appear due to how two_operators nodes > > are handled, and they prevent vectorization in some cases. > > So the trick is that when we have two operands we elide duplicate lanes > so we can do discovery for a single combined operand instead which we > then decompose into the required two again. That's a nice one. > > But as implemented this will fail SLP discovery if the combined operand > fails discovery possibly because of divergence in downstream defs. That > is, it doesn't fall back to separate discovery. I suspect the situation > of duplicate lanes isn't common but then I would also suspect that > divergence _is_ common. > > The discovery code is already quite complex with the way it possibly > swaps operands of lanes, fitting in this as another variant to try (first) > is likely going to be a bit awkward. A way out might be to split the > function or to make the re-try in the caller which could indicate whether > to apply this pattern trick or not. That said - can you try to get > data on how often the trick applies and discovery succeeds and how > often discovery fails but discovery would suceed without applying the > pattern (say, on SPEC)? Hi Richard, I have found two other SPEC benchmarks in which the new version of this optimization applies. In these cases discovery "fails" anyway when not doing deduplication. It's not an immediate failure though but rather not producing good SLP trees and then aborting due to cost of other checks (similar to x264). > > I also suppose instead of hardcoding three patterns for a fixed > size it should be possible to see there's > only (at most) half unique lanes in both operands (and one less in one > operand if the number of lanes is odd) and compute the un-swizzling lane > permutes during this discovery, removing the need of the explicit enum > and open-coding each case? > > Another general note is that trying (and then undo on fail) such ticks > eats at the discovery limit we have in place to avoid exponential run-off > in exactly this degenerate cases. > I have sent a new version that doesn't have hardcoded patterns and only works with two_operators nodes among others. Please note that I still haven't addressed all your other feedback as I'm still iterating the implementation. Thanks, Manolis > Thanks, > Richard. > > > This targets the vectorization of the SPEC2017 x264 pixel_satd functions. > > In some processors a larger than 10% improvement on x264 has been observed. > > > > See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138 > > > > gcc/ChangeLog: > > > > * tree-vect-slp.cc (enum slp_oprnd_pattern): new enum for rearrangement > > patterns. > > (try_rearrange_oprnd_info): Detect if a node corresponds to one of the > > patterns. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/aarch64/vect-slp-two-operator.c: New test. > > > > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> > > --- > > > > .../aarch64/vect-slp-two-operator.c | 42 ++++ > > gcc/tree-vect-slp.cc | 234 ++++++++++++++++++ > > 2 files changed, 276 insertions(+) > > create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > new file mode 100644 > > index 00000000000..2db066a0b6e > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > @@ -0,0 +1,42 @@ > > +/* { dg-do compile } */ > > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect -fdump-tree-vect-details" } */ > > + > > +typedef unsigned char uint8_t; > > +typedef unsigned int uint32_t; > > + > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ > > + int t0 = s0 + s1;\ > > + int t1 = s0 - s1;\ > > + int t2 = s2 + s3;\ > > + int t3 = s2 - s3;\ > > + d0 = t0 + t2;\ > > + d1 = t1 + t3;\ > > + d2 = t0 - t2;\ > > + d3 = t1 - t3;\ > > +} > > + > > +static uint32_t abs2( uint32_t a ) > > +{ > > + uint32_t s = ((a>>15)&0x10001)*0xffff; > > + return (a+s)^s; > > +} > > + > > +void sink(uint32_t tmp[4][4]); > > + > > +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) > > +{ > > + uint32_t tmp[4][4]; > > + int sum = 0; > > + for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) > > + { > > + uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); > > + uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); > > + uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); > > + uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); > > + HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 ); > > + } > > + sink(tmp); > > +} > > + > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ > > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > index bf1f467f53f..e395db0e185 100644 > > --- a/gcc/tree-vect-slp.cc > > +++ b/gcc/tree-vect-slp.cc > > @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3. If not see > > #include "tree-vectorizer.h" > > #include "langhooks.h" > > #include "gimple-walk.h" > > +#include "gimple-pretty-print.h" > > #include "dbgcnt.h" > > #include "tree-vector-builder.h" > > #include "vec-perm-indices.h" > > @@ -1829,6 +1830,141 @@ vect_slp_build_two_operator_nodes (slp_tree perm, tree vectype, > > SLP_TREE_CHILDREN (perm).quick_push (child2); > > } > > > > +enum slp_oprnd_pattern > > +{ > > + SLP_OPRND_PATTERN_NONE, > > + SLP_OPRND_PATTERN_ABAB, > > + SLP_OPRND_PATTERN_AABB, > > + SLP_OPRND_PATTERN_ABBA > > +}; > > + > > +/* Check if OPRNDS_INFO has duplicated nodes that correspond to a predefined > > + pattern described by SLP_OPRND_PATTERN and return it. */ > > + > > +static int > > +try_rearrange_oprnd_info (vec<slp_oprnd_info> &oprnds_info, unsigned group_size) > > +{ > > + unsigned i; > > + slp_oprnd_info info; > > + > > + if (oprnds_info.length () != 2 || group_size % 4 != 0) > > + return SLP_OPRND_PATTERN_NONE; > > + > > + if (!oprnds_info[0]->def_stmts[0] > > + || !is_a<gassign *> (oprnds_info[0]->def_stmts[0]->stmt)) > > + return SLP_OPRND_PATTERN_NONE; > > + > > + enum tree_code code > > + = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt); > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > + for (unsigned int j = 0; j < group_size; j += 1) > > + { > > + if (!info->def_stmts[j] > > + || !is_a<gassign *> (info->def_stmts[j]->stmt) > > + || STMT_VINFO_DATA_REF (info->def_stmts[j])) > > + return SLP_OPRND_PATTERN_NONE; > > + /* Don't mix different operations. */ > > + if (gimple_assign_rhs_code (info->def_stmts[j]->stmt) != code) > > + return SLP_OPRND_PATTERN_NONE; > > + } > > + > > + if (gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt) > > + != gimple_assign_rhs_code (oprnds_info[1]->def_stmts[0]->stmt)) > > + return SLP_OPRND_PATTERN_NONE; > > + > > + int pattern = SLP_OPRND_PATTERN_NONE; > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + int cur_pattern = SLP_OPRND_PATTERN_NONE; > > + /* Check for an ABAB... pattern. */ > > + if ((info->def_stmts[j] == info->def_stmts[j + 2]) > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 3]) > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > + cur_pattern = SLP_OPRND_PATTERN_ABAB; > > + /* Check for an AABB... pattern. */ > > + else if ((info->def_stmts[j] == info->def_stmts[j + 1]) > > + && (info->def_stmts[j + 2] == info->def_stmts[j + 3]) > > + && (info->def_stmts[j] != info->def_stmts[j + 2])) > > + cur_pattern = SLP_OPRND_PATTERN_AABB; > > + /* Check for an ABBA... pattern. */ > > + else if ((info->def_stmts[j] == info->def_stmts[j + 3]) > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 2]) > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > + cur_pattern = SLP_OPRND_PATTERN_ABBA; > > + /* Unrecognised pattern. */ > > + else > > + return SLP_OPRND_PATTERN_NONE; > > + > > + if (pattern == SLP_OPRND_PATTERN_NONE) > > + pattern = cur_pattern; > > + /* Multiple patterns detected. */ > > + else if (cur_pattern != pattern) > > + return SLP_OPRND_PATTERN_NONE; > > + } > > + > > + gcc_checking_assert (pattern != SLP_OPRND_PATTERN_NONE); > > + > > + if (dump_enabled_p ()) > > + { > > + if (pattern == SLP_OPRND_PATTERN_ABAB) > > + dump_printf (MSG_NOTE, "ABAB"); > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > + dump_printf (MSG_NOTE, "AABB"); > > + else if (pattern == SLP_OPRND_PATTERN_ABBA) > > + dump_printf (MSG_NOTE, "ABBA"); > > + dump_printf (MSG_NOTE, " pattern detected.\n"); > > + } > > + > > + if (pattern == SLP_OPRND_PATTERN_ABAB || pattern == SLP_OPRND_PATTERN_ABBA) > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + /* Given oprnd[0] -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > + Given oprnd[1] -> C1, D1, C1, D1, C2, D2, C2, D2, ... > > + Create a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... */ > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[1]->def_stmts[j]; > > + oprnds_info[0]->ops[j+2] = oprnds_info[1]->ops[j]; > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+1]; > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+1]; > > + } > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + /* Given oprnd[0] -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > + Given oprnd[1] -> C1, C1, D1, D1, C2, C2, D2, D2, ... > > + Create a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... */ > > + > > + /* The ordering here is at least to some extent arbitrary. > > + A generilized version needs to use some explicit ordering. */ > > + oprnds_info[0]->def_stmts[j+1] = oprnds_info[1]->def_stmts[j]; > > + oprnds_info[0]->ops[j+1] = oprnds_info[1]->ops[j]; > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[0]->def_stmts[j+2]; > > + oprnds_info[0]->ops[j+2] = oprnds_info[0]->ops[j+2]; > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+2]; > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+2]; > > + } > > + > > + if (dump_enabled_p ()) > > + { > > + dump_printf (MSG_NOTE, "Recurse with:\n"); > > + for (unsigned int j = 0; j < group_size; j++) > > + { > > + dump_printf (MSG_NOTE, " "); > > + print_gimple_stmt (dump_file, oprnds_info[0]->def_stmts[j]->stmt, 0); > > + } > > + } > > + > > + /* Since we've merged the two nodes in one, make the second one a copy of > > + the first. */ > > + for (unsigned int j = 0; j < group_size; j++) > > + { > > + oprnds_info[1]->def_stmts[j] = oprnds_info[0]->def_stmts[j]; > > + oprnds_info[1]->ops[j] = oprnds_info[0]->ops[j]; > > + } > > + > > + return pattern; > > +} > > + > > /* Recursively build an SLP tree starting from NODE. > > Fail (and return a value not equal to zero) if def-stmts are not > > isomorphic, require data permutation or are of unsupported types of > > @@ -2409,6 +2545,10 @@ out: > > > > stmt_info = stmts[0]; > > > > + int rearrange_pattern = SLP_OPRND_PATTERN_NONE; > > + if (is_a<gassign *> (stmt_info->stmt)) > > + rearrange_pattern = try_rearrange_oprnd_info (oprnds_info, group_size); > > + > > /* Create SLP_TREE nodes for the definition node/s. */ > > FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info) > > { > > @@ -2669,6 +2809,100 @@ fail: > > *tree_size += this_tree_size + 1; > > *max_nunits = this_max_nunits; > > > > + /* If we applied any rearrangmenets then we need to reconstruct the original > > + elements with an additional permutation layer. */ > > + if (rearrange_pattern != SLP_OPRND_PATTERN_NONE) > > + { > > + slp_tree one = new _slp_tree; > > + slp_tree two = new _slp_tree; > > + SLP_TREE_DEF_TYPE (one) = vect_internal_def; > > + SLP_TREE_DEF_TYPE (two) = vect_internal_def; > > + SLP_TREE_VECTYPE (one) = vectype; > > + SLP_TREE_VECTYPE (two) = vectype; > > + SLP_TREE_CHILDREN (one).safe_splice (children); > > + SLP_TREE_CHILDREN (two).safe_splice (children); > > + > > + SLP_TREE_CODE (one) = VEC_PERM_EXPR; > > + SLP_TREE_CODE (two) = VEC_PERM_EXPR; > > + SLP_TREE_REPRESENTATIVE (one) = stmts[0]; > > + SLP_TREE_REPRESENTATIVE (two) = stmts[2]; > > + lane_permutation_t &perm_one = SLP_TREE_LANE_PERMUTATION (one); > > + lane_permutation_t &perm_two = SLP_TREE_LANE_PERMUTATION (two); > > + > > + if (rearrange_pattern == SLP_OPRND_PATTERN_ABAB) > > + { > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > + Create node "one" -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > + Create node "two" -> C1, D1, C1, D1, C2, D2, C2, D2, ... */ > > + > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + } > > + } > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_AABB) > > + { > > + /* Given a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... > > + Create node "one" -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > + Create node "two" -> C1, C1, D1, D1, C2, C2, D2, D2, ... */ > > + > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > + > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + } > > + } > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_ABBA) > > + { > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > + Create node "one" -> A1, B1, B1, A1, A2, B2, B2, A2, ... > > + Create node "two" -> C1, D1, D1, C1, C2, D2, D2, C2, ... */ > > + > > + for (unsigned int j = 0; j < group_size; j += 4) > > + { > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > + > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > + } > > + } > > + > > + slp_tree child; > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (two), i, child) > > + SLP_TREE_REF_COUNT (child)++; > > + > > + node = vect_create_new_slp_node (node, stmts, 2); > > + SLP_TREE_VECTYPE (node) = vectype; > > + SLP_TREE_CHILDREN (node).quick_push (one); > > + SLP_TREE_CHILDREN (node).quick_push (two); > > + > > + SLP_TREE_LANES (one) = stmts.length (); > > + SLP_TREE_LANES (two) = stmts.length (); > > + > > + children.truncate (0); > > + children.safe_splice (SLP_TREE_CHILDREN (node)); > > + } > > + > > if (two_operators) > > { > > /* ??? We'd likely want to either cache in bst_map sth like > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, > Frankenstrasse 146, 90461 Nuernberg, Germany; > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)
diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c new file mode 100644 index 00000000000..2db066a0b6e --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c @@ -0,0 +1,42 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect -fdump-tree-vect-details" } */ + +typedef unsigned char uint8_t; +typedef unsigned int uint32_t; + +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ + int t0 = s0 + s1;\ + int t1 = s0 - s1;\ + int t2 = s2 + s3;\ + int t3 = s2 - s3;\ + d0 = t0 + t2;\ + d1 = t1 + t3;\ + d2 = t0 - t2;\ + d3 = t1 - t3;\ +} + +static uint32_t abs2( uint32_t a ) +{ + uint32_t s = ((a>>15)&0x10001)*0xffff; + return (a+s)^s; +} + +void sink(uint32_t tmp[4][4]); + +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2 ) +{ + uint32_t tmp[4][4]; + int sum = 0; + for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) + { + uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); + uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); + uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); + uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); + HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0,a1,a2,a3 ); + } + sink(tmp); +} + +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } */ +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index bf1f467f53f..e395db0e185 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3. If not see #include "tree-vectorizer.h" #include "langhooks.h" #include "gimple-walk.h" +#include "gimple-pretty-print.h" #include "dbgcnt.h" #include "tree-vector-builder.h" #include "vec-perm-indices.h" @@ -1829,6 +1830,141 @@ vect_slp_build_two_operator_nodes (slp_tree perm, tree vectype, SLP_TREE_CHILDREN (perm).quick_push (child2); } +enum slp_oprnd_pattern +{ + SLP_OPRND_PATTERN_NONE, + SLP_OPRND_PATTERN_ABAB, + SLP_OPRND_PATTERN_AABB, + SLP_OPRND_PATTERN_ABBA +}; + +/* Check if OPRNDS_INFO has duplicated nodes that correspond to a predefined + pattern described by SLP_OPRND_PATTERN and return it. */ + +static int +try_rearrange_oprnd_info (vec<slp_oprnd_info> &oprnds_info, unsigned group_size) +{ + unsigned i; + slp_oprnd_info info; + + if (oprnds_info.length () != 2 || group_size % 4 != 0) + return SLP_OPRND_PATTERN_NONE; + + if (!oprnds_info[0]->def_stmts[0] + || !is_a<gassign *> (oprnds_info[0]->def_stmts[0]->stmt)) + return SLP_OPRND_PATTERN_NONE; + + enum tree_code code + = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt); + FOR_EACH_VEC_ELT (oprnds_info, i, info) + for (unsigned int j = 0; j < group_size; j += 1) + { + if (!info->def_stmts[j] + || !is_a<gassign *> (info->def_stmts[j]->stmt) + || STMT_VINFO_DATA_REF (info->def_stmts[j])) + return SLP_OPRND_PATTERN_NONE; + /* Don't mix different operations. */ + if (gimple_assign_rhs_code (info->def_stmts[j]->stmt) != code) + return SLP_OPRND_PATTERN_NONE; + } + + if (gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt) + != gimple_assign_rhs_code (oprnds_info[1]->def_stmts[0]->stmt)) + return SLP_OPRND_PATTERN_NONE; + + int pattern = SLP_OPRND_PATTERN_NONE; + FOR_EACH_VEC_ELT (oprnds_info, i, info) + for (unsigned int j = 0; j < group_size; j += 4) + { + int cur_pattern = SLP_OPRND_PATTERN_NONE; + /* Check for an ABAB... pattern. */ + if ((info->def_stmts[j] == info->def_stmts[j + 2]) + && (info->def_stmts[j + 1] == info->def_stmts[j + 3]) + && (info->def_stmts[j] != info->def_stmts[j + 1])) + cur_pattern = SLP_OPRND_PATTERN_ABAB; + /* Check for an AABB... pattern. */ + else if ((info->def_stmts[j] == info->def_stmts[j + 1]) + && (info->def_stmts[j + 2] == info->def_stmts[j + 3]) + && (info->def_stmts[j] != info->def_stmts[j + 2])) + cur_pattern = SLP_OPRND_PATTERN_AABB; + /* Check for an ABBA... pattern. */ + else if ((info->def_stmts[j] == info->def_stmts[j + 3]) + && (info->def_stmts[j + 1] == info->def_stmts[j + 2]) + && (info->def_stmts[j] != info->def_stmts[j + 1])) + cur_pattern = SLP_OPRND_PATTERN_ABBA; + /* Unrecognised pattern. */ + else + return SLP_OPRND_PATTERN_NONE; + + if (pattern == SLP_OPRND_PATTERN_NONE) + pattern = cur_pattern; + /* Multiple patterns detected. */ + else if (cur_pattern != pattern) + return SLP_OPRND_PATTERN_NONE; + } + + gcc_checking_assert (pattern != SLP_OPRND_PATTERN_NONE); + + if (dump_enabled_p ()) + { + if (pattern == SLP_OPRND_PATTERN_ABAB) + dump_printf (MSG_NOTE, "ABAB"); + else if (pattern == SLP_OPRND_PATTERN_AABB) + dump_printf (MSG_NOTE, "AABB"); + else if (pattern == SLP_OPRND_PATTERN_ABBA) + dump_printf (MSG_NOTE, "ABBA"); + dump_printf (MSG_NOTE, " pattern detected.\n"); + } + + if (pattern == SLP_OPRND_PATTERN_ABAB || pattern == SLP_OPRND_PATTERN_ABBA) + for (unsigned int j = 0; j < group_size; j += 4) + { + /* Given oprnd[0] -> A1, B1, A1, B1, A2, B2, A2, B2, ... + Given oprnd[1] -> C1, D1, C1, D1, C2, D2, C2, D2, ... + Create a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... */ + oprnds_info[0]->def_stmts[j+2] = oprnds_info[1]->def_stmts[j]; + oprnds_info[0]->ops[j+2] = oprnds_info[1]->ops[j]; + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+1]; + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+1]; + } + else if (pattern == SLP_OPRND_PATTERN_AABB) + for (unsigned int j = 0; j < group_size; j += 4) + { + /* Given oprnd[0] -> A1, A1, B1, B1, A2, A2, B2, B2, ... + Given oprnd[1] -> C1, C1, D1, D1, C2, C2, D2, D2, ... + Create a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... */ + + /* The ordering here is at least to some extent arbitrary. + A generilized version needs to use some explicit ordering. */ + oprnds_info[0]->def_stmts[j+1] = oprnds_info[1]->def_stmts[j]; + oprnds_info[0]->ops[j+1] = oprnds_info[1]->ops[j]; + oprnds_info[0]->def_stmts[j+2] = oprnds_info[0]->def_stmts[j+2]; + oprnds_info[0]->ops[j+2] = oprnds_info[0]->ops[j+2]; + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+2]; + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+2]; + } + + if (dump_enabled_p ()) + { + dump_printf (MSG_NOTE, "Recurse with:\n"); + for (unsigned int j = 0; j < group_size; j++) + { + dump_printf (MSG_NOTE, " "); + print_gimple_stmt (dump_file, oprnds_info[0]->def_stmts[j]->stmt, 0); + } + } + + /* Since we've merged the two nodes in one, make the second one a copy of + the first. */ + for (unsigned int j = 0; j < group_size; j++) + { + oprnds_info[1]->def_stmts[j] = oprnds_info[0]->def_stmts[j]; + oprnds_info[1]->ops[j] = oprnds_info[0]->ops[j]; + } + + return pattern; +} + /* Recursively build an SLP tree starting from NODE. Fail (and return a value not equal to zero) if def-stmts are not isomorphic, require data permutation or are of unsupported types of @@ -2409,6 +2545,10 @@ out: stmt_info = stmts[0]; + int rearrange_pattern = SLP_OPRND_PATTERN_NONE; + if (is_a<gassign *> (stmt_info->stmt)) + rearrange_pattern = try_rearrange_oprnd_info (oprnds_info, group_size); + /* Create SLP_TREE nodes for the definition node/s. */ FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info) { @@ -2669,6 +2809,100 @@ fail: *tree_size += this_tree_size + 1; *max_nunits = this_max_nunits; + /* If we applied any rearrangmenets then we need to reconstruct the original + elements with an additional permutation layer. */ + if (rearrange_pattern != SLP_OPRND_PATTERN_NONE) + { + slp_tree one = new _slp_tree; + slp_tree two = new _slp_tree; + SLP_TREE_DEF_TYPE (one) = vect_internal_def; + SLP_TREE_DEF_TYPE (two) = vect_internal_def; + SLP_TREE_VECTYPE (one) = vectype; + SLP_TREE_VECTYPE (two) = vectype; + SLP_TREE_CHILDREN (one).safe_splice (children); + SLP_TREE_CHILDREN (two).safe_splice (children); + + SLP_TREE_CODE (one) = VEC_PERM_EXPR; + SLP_TREE_CODE (two) = VEC_PERM_EXPR; + SLP_TREE_REPRESENTATIVE (one) = stmts[0]; + SLP_TREE_REPRESENTATIVE (two) = stmts[2]; + lane_permutation_t &perm_one = SLP_TREE_LANE_PERMUTATION (one); + lane_permutation_t &perm_two = SLP_TREE_LANE_PERMUTATION (two); + + if (rearrange_pattern == SLP_OPRND_PATTERN_ABAB) + { + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... + Create node "one" -> A1, B1, A1, B1, A2, B2, A2, B2, ... + Create node "two" -> C1, D1, C1, D1, C2, D2, C2, D2, ... */ + + for (unsigned int j = 0; j < group_size; j += 4) + { + perm_one.safe_push (std::make_pair (0, j + 0)); + perm_one.safe_push (std::make_pair (0, j + 1)); + perm_one.safe_push (std::make_pair (0, j + 0)); + perm_one.safe_push (std::make_pair (0, j + 1)); + + perm_two.safe_push (std::make_pair (0, j + 2)); + perm_two.safe_push (std::make_pair (0, j + 3)); + perm_two.safe_push (std::make_pair (0, j + 2)); + perm_two.safe_push (std::make_pair (0, j + 3)); + } + } + else if (rearrange_pattern == SLP_OPRND_PATTERN_AABB) + { + /* Given a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... + Create node "one" -> A1, A1, B1, B1, A2, A2, B2, B2, ... + Create node "two" -> C1, C1, D1, D1, C2, C2, D2, D2, ... */ + + for (unsigned int j = 0; j < group_size; j += 4) + { + perm_one.safe_push (std::make_pair (0, j + 0)); + perm_one.safe_push (std::make_pair (0, j + 0)); + perm_one.safe_push (std::make_pair (0, j + 2)); + perm_one.safe_push (std::make_pair (0, j + 2)); + + perm_two.safe_push (std::make_pair (0, j + 1)); + perm_two.safe_push (std::make_pair (0, j + 1)); + perm_two.safe_push (std::make_pair (0, j + 3)); + perm_two.safe_push (std::make_pair (0, j + 3)); + } + } + else if (rearrange_pattern == SLP_OPRND_PATTERN_ABBA) + { + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... + Create node "one" -> A1, B1, B1, A1, A2, B2, B2, A2, ... + Create node "two" -> C1, D1, D1, C1, C2, D2, D2, C2, ... */ + + for (unsigned int j = 0; j < group_size; j += 4) + { + perm_one.safe_push (std::make_pair (0, j + 0)); + perm_one.safe_push (std::make_pair (0, j + 1)); + perm_one.safe_push (std::make_pair (0, j + 1)); + perm_one.safe_push (std::make_pair (0, j + 0)); + + perm_two.safe_push (std::make_pair (0, j + 2)); + perm_two.safe_push (std::make_pair (0, j + 3)); + perm_two.safe_push (std::make_pair (0, j + 3)); + perm_two.safe_push (std::make_pair (0, j + 2)); + } + } + + slp_tree child; + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (two), i, child) + SLP_TREE_REF_COUNT (child)++; + + node = vect_create_new_slp_node (node, stmts, 2); + SLP_TREE_VECTYPE (node) = vectype; + SLP_TREE_CHILDREN (node).quick_push (one); + SLP_TREE_CHILDREN (node).quick_push (two); + + SLP_TREE_LANES (one) = stmts.length (); + SLP_TREE_LANES (two) = stmts.length (); + + children.truncate (0); + children.safe_splice (SLP_TREE_CHILDREN (node)); + } + if (two_operators) { /* ??? We'd likely want to either cache in bst_map sth like
This change adds a function that checks for SLP nodes with multiple occurrences of the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the node so that there are no duplicates. A vec_perm is then introduced to recreate the original ordering. These duplicates can appear due to how two_operators nodes are handled, and they prevent vectorization in some cases. This targets the vectorization of the SPEC2017 x264 pixel_satd functions. In some processors a larger than 10% improvement on x264 has been observed. See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138 gcc/ChangeLog: * tree-vect-slp.cc (enum slp_oprnd_pattern): new enum for rearrangement patterns. (try_rearrange_oprnd_info): Detect if a node corresponds to one of the patterns. gcc/testsuite/ChangeLog: * gcc.target/aarch64/vect-slp-two-operator.c: New test. Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu> --- .../aarch64/vect-slp-two-operator.c | 42 ++++ gcc/tree-vect-slp.cc | 234 ++++++++++++++++++ 2 files changed, 276 insertions(+) create mode 100644 gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c