diff mbox series

[v2] Adding new pass slp_perm_simplify

Message ID 20241104082501.835594-1-christoph.muellner@vrull.eu
State New
Headers show
Series [v2] Adding new pass slp_perm_simplify | expand

Commit Message

Christoph Müllner Nov. 4, 2024, 8:25 a.m. UTC
This patch merges two isomorphic vector sequences by using the
redundancy in the lane utilization in these sequences.
This redundancy in lane utilization comes from the way how specific
scalar statements end up vectorized: two VEC_PERMs on top, binary operations
on both of them, and a final VEC_PERM to create the result.
Here is an example of this sequence:

[...] // v_in = {e0, e1, e2, e3}
v_1 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}> // v_1 = {e1, e3, e1, e3}
v_2 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}> // v_2 = {e0, e2, e0, e2}
v_y = v_2 - v_1                           // v_y = {e0-e1, e2-e3, e0-e1, e2-e3}
v_x = v_2 + v_1                           // v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}> // v_out = {e0+e1, e2+e3,
e0-e1, e2-e3}

To remove the redundancy, lanes 2 and 3 can be freed, which allows to
change the last statement into:
v_out = VEC_PERM <v_x, v_y, {0, 1, 4, 5}> // _v5 = {e0+e1, e2+e3, e0-e1, e2-e3}

The cost of eliminating the redundancy in the lane utilization is that
lowering the VEC PERM expression could get more expensive because of
tighter packing of the lanes.  Therefore this optimization is not done
alone, but in only in case we identify two such sequences that can be
merged.

Once all candidate sequences have been identified, we try to pair them
up, so that we can use the freed lane for the second sequence.
On success, we replace 2x (2x VEC_PERM + 2x BINOP + 1x VEC_PERM)
instructions by 2x VEC_PERM + 2x BINOP + 2x VEC_PERM.
2x VEC_PERM and 2x BINOP get eliminated.

This targets x264_pixel_satd_8x4, which calculates the sum of absolute
transformed differences (SATD) using Hadamard transformation.
We have seen 8% speedup on SPEC's x264 on a 5950X (x86-64) and 7%
speedup on an AArch64 machine.

Bootstrapped and reg-tested on x86-64 and AArch64 (all languages).

gcc/ChangeLog:

	* Makefile.in: Adding tree-vect-perm-simplify.
	* passes.def: Adding pass_slp_perm_simplify.
	* tree-pass.h (make_pass_slp_perm_simplify): Adding prototype
	for make_pass_slp_perm_simplify ().
	* tree-vectorizer.cc (class pass_slp_perm_simplify): New pass
	class.
	(pass_slp_perm_simplify::execute): New entry function for
	make_pass_slp_perm_simplify.
	(make_pass_slp_perm_simplify): New helper to instantiate
	make_pass_slp_perm_simplify.
	* tree-vectorizer.h (vect_slp_permute_simplify): Adding
	prototype for vect_slp_permute_simplify ().
	* tree-vect-perm-simplify.cc: New file.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/satd-hadamard.c: New test.
	* gcc.target/aarch64/sve/satd-hadamard.c: New test.

Signed-off-by: Christoph Müllner <christoph.muellner@vrull.eu>
---
Manolis Tsamis was the patch's initial author before I took it over.

Changes from v1:
* Converted SLP code into tree code (lane redundancy detection and
  elimination).
* Moved code from tree-vect-slp.cc into a new pass (from where it could
  be moved elsewhere).
* Only deduplicate lanes if sequences will be merged later on.
* Split functionality stricter into analysis and transformation parts.
* Cleanups, improved dump messages.

 gcc/Makefile.in                               |   1 +
 gcc/passes.def                                |   1 +
 gcc/testsuite/gcc.dg/vect/satd-hadamard.c     |  42 ++
 .../gcc.target/aarch64/sve/satd-hadamard.c    |   3 +
 gcc/tree-pass.h                               |   1 +
 gcc/tree-vect-perm-simplify.cc                | 536 ++++++++++++++++++
 gcc/tree-vectorizer.cc                        |  50 ++
 gcc/tree-vectorizer.h                         |   3 +
 8 files changed, 637 insertions(+)
 create mode 100644 gcc/testsuite/gcc.dg/vect/satd-hadamard.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
 create mode 100644 gcc/tree-vect-perm-simplify.cc
diff mbox series

Patch

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index eb578210411..62e37dc7690 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1789,6 +1789,7 @@  OBJS = \
 	tree-vect-stmts.o \
 	tree-vect-loop.o \
 	tree-vect-loop-manip.o \
+	tree-vect-perm-simplify.o \
 	tree-vect-slp.o \
 	tree-vect-slp-patterns.o \
 	tree-vectorizer.o \
diff --git a/gcc/passes.def b/gcc/passes.def
index 7d01227eed1..a022b3fac57 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -318,6 +318,7 @@  along with GCC; see the file COPYING3.  If not see
 	      NEXT_PASS (pass_dse);
 	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_slp_vectorize);
+	  NEXT_PASS (pass_slp_perm_simplify);
 	  NEXT_PASS (pass_loop_prefetch);
 	  /* Run IVOPTs after the last pass that uses data-reference analysis
 	     as that doesn't handle TARGET_MEM_REFs.  */
diff --git a/gcc/testsuite/gcc.dg/vect/satd-hadamard.c b/gcc/testsuite/gcc.dg/vect/satd-hadamard.c
new file mode 100644
index 00000000000..759498ddb73
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/satd-hadamard.c
@@ -0,0 +1,42 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -fdump-tree-slp_perm_simplify-details" } */
+
+#include <stdint.h>
+
+#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\
+    int t0 = s0 + s1;\
+    int t1 = s0 - s1;\
+    int t2 = s2 + s3;\
+    int t3 = s2 - s3;\
+    d0 = t0 + t2;\
+    d1 = t1 + t3;\
+    d2 = t0 - t2;\
+    d3 = t1 - t3;\
+}
+
+int
+x264_pixel_satd_8x4_simplified (uint8_t *pix1, int i_pix1, uint8_t *pix2, int i_pix2)
+{
+  uint32_t tmp[4][4];
+  uint32_t a0, a1, a2, a3;
+  int sum = 0;
+
+  for (int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2)
+    {
+      a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16);
+      a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16);
+      a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16);
+      a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16);
+      HADAMARD4(tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], a0, a1, a2, a3);
+    }
+
+  for (int i = 0; i < 4; i++)
+    {
+      HADAMARD4(a0, a1, a2, a3, tmp[0][i], tmp[1][i], tmp[2][i], tmp[3][i]);
+      sum += a0 + a1 + a2 + a3;
+    }
+
+  return (((uint16_t)sum) + ((uint32_t)sum>>16)) >> 1;
+}
+
+/* { dg-final { scan-tree-dump "VEC_PERM_EXPR.*{ 2, 3, 6, 7 }" "slp_perm_simplify" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
new file mode 100644
index 00000000000..e0ef093afa6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
@@ -0,0 +1,3 @@ 
+#include "../../../gcc.dg/vect/satd-hadamard.c"
+
+/* { dg-final { scan-assembler-not {\ttbl\t} } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index a928cbe4557..555e41a91ce 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -390,6 +390,7 @@  extern gimple_opt_pass *make_pass_if_to_switch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_distribution (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_simduid_cleanup (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_slp_perm_simplify (gcc::context *ctx);
 extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
diff --git a/gcc/tree-vect-perm-simplify.cc b/gcc/tree-vect-perm-simplify.cc
new file mode 100644
index 00000000000..cbfd4050e23
--- /dev/null
+++ b/gcc/tree-vect-perm-simplify.cc
@@ -0,0 +1,536 @@ 
+/* Vector Permutation Simplification
+   Copyright (C) 2024 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#define INCLUDE_ALGORITHM
+#define INCLUDE_MEMORY
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "rtl.h"
+#include "tree.h"
+#include "gimple.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "optabs-tree.h"
+#include "insn-config.h"
+#include "fold-const.h"
+#include "function.h"
+#include "gimple-iterator.h"
+#include "cfgloop.h"
+#include "tree-vectorizer.h"
+#include "langhooks.h"
+#include "gimple-walk.h"
+#include "tree-vector-builder.h"
+#include "vec-perm-indices.h"
+#include "gimple-fold.h"
+#include "internal-fn.h"
+#include "cfganal.h"
+#include "tree-cfg.h"
+#include "gimple-pretty-print.h"
+
+typedef hash_map<int_hash <unsigned HOST_WIDE_INT, -1U>,
+			   unsigned HOST_WIDE_INT> lane_map_t;
+
+/* Data structure that contains simplifiable vectorized permute sequences.
+ *   v_in = {e0, e1, e2, e3}
+ *   v_1 = VEC_PERM <v_in, v_in, v_1_sel>
+ *   v_2 = VEC_PERM <v_in, v_in, v_2_sel>
+ *   v_y = v_2 - v_1
+ *   v_x = v_2 + v_1
+ *   v_out = VEC_PERM <v_x, v_y, v_out_sel>  */
+
+struct _vec_perm_simplify_seq
+{
+  gassign *v_1_stmt;
+  gassign *v_2_stmt;
+  gassign *v_x_stmt;
+  gassign *v_y_stmt;
+  /* Final permute statment.  */
+  gassign *stmt;
+  /* Elements of each vector and selector.  */
+  unsigned HOST_WIDE_INT nelts;
+  /* New selector indices for stmt.  */
+  vec<unsigned HOST_WIDE_INT> new_sel;
+};
+typedef struct _vec_perm_simplify_seq *vec_perm_simplify_seq;
+
+/* If NAME is an SSA_NAME defined by an assignment, return that assignment.
+   If SINGLE_USE_ONLY is true and NAME has multiple uses, return NULL.  */
+
+static gassign *
+get_tree_def (tree name, bool single_use_only)
+{
+  if (TREE_CODE (name) != SSA_NAME)
+    return NULL;
+
+  gimple *def_stmt = SSA_NAME_DEF_STMT (name);
+  gassign *assign = dyn_cast <gassign *> (def_stmt);
+
+  if (!assign)
+    return NULL;
+
+  if (single_use_only && !has_single_use (name))
+    return NULL;
+
+  return assign;
+}
+
+/* Get an index map from the provided vector permute selector
+ * and return the number of unique indices.
+ * E.g.: { 1, 3, 1, 3 } -> <0, 1, 0, 1>, 2
+ *       { 0, 2, 0, 2 } -> <0, 1, 0, 1>, 2
+ *       { 3, 2, 1, 0 } -> <0, 1, 2, 3>, 4  */
+
+static unsigned HOST_WIDE_INT
+get_vect_selector_index_map (tree sel, lane_map_t *index_map)
+{
+  gcc_assert (VECTOR_CST_NELTS (sel).is_constant ());
+  unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant ();
+  unsigned HOST_WIDE_INT n = 0;
+
+  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
+    {
+      /* Extract the i-th value from the selector.  */
+      tree sel_cst_tree = VECTOR_CST_ELT (sel, i);
+      unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree);
+
+      unsigned HOST_WIDE_INT j = 0;
+      for (; j <= i; j++)
+	{
+	  tree prev_sel_cst_tree = VECTOR_CST_ELT (sel, j);
+	  unsigned HOST_WIDE_INT prev_sel_cst
+	    = TREE_INT_CST_LOW (prev_sel_cst_tree);
+	  if (prev_sel_cst == sel_cst)
+	    break;
+	}
+      index_map->put (i, j);
+      n += (i == j) ? 1 : 0;
+    }
+
+  return n;
+}
+
+/* Search for opportunities to free half of the lanes in the following pattern:
+ *
+ *   v_in = {e0, e1, e2, e3}
+ *   v_1 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}>
+ *   // v_1 = {e1, e3, e1, e3}
+ *   v_2 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}>
+ *   // v_2 = {e0, e2, e0, e2}
+ *
+ *   v_x = v_2 + v_1
+ *   // v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
+ *   v_y = v_2 - v_1
+ *   // v_y = {e0-e1, e2-e3, e0-e1, e2-e3}
+ *
+ *   v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}>
+ *   // v_out = {e0+e1, e2+e3, e0-e1, e2-e3}
+ *
+ * Characteristic properties:
+ * - v_1 and v_2 are created from the same input vector v_in and introduce the
+ *   lane duplication (in the selection operand) that we can eliminate.
+ * - v_x and v_y are results from lane-preserving operations that use v_1 and
+ *   v_2 as inputs.
+ * - v_out is created by selecting from duplicated lanes.
+ */
+
+static bool
+recognise_vec_perm_simplify_pattern (gassign *stmt, vec_perm_simplify_seq *seq)
+{
+  if (gimple_assign_rhs_code (stmt) != VEC_PERM_EXPR)
+    return false;
+
+  /* Decompose the final vec permute statement.  */
+  tree v_x = gimple_assign_rhs1 (stmt);
+  tree v_y = gimple_assign_rhs2 (stmt);
+  tree sel = gimple_assign_rhs3 (stmt);
+
+  if (!VECTOR_CST_NELTS (sel).is_constant ())
+    return false;
+
+  unsigned HOST_WIDE_INT nelts = VECTOR_CST_NELTS (sel).to_constant ();
+
+  /* Lookup the definition of v_x and v_y.  */
+  gassign *v_x_stmt = get_tree_def (v_x, true);
+  gassign *v_y_stmt = get_tree_def (v_y, true);
+  if (!v_x_stmt || !v_y_stmt)
+    return false;
+
+  /* Check the operations that define v_x and v_y.  */
+  if (TREE_CODE_CLASS (gimple_assign_rhs_code (v_x_stmt)) != tcc_binary
+      || TREE_CODE_CLASS (gimple_assign_rhs_code (v_y_stmt)) != tcc_binary)
+    return false;
+
+  tree v_1 = gimple_assign_rhs1 (v_x_stmt);
+  tree v_2 = gimple_assign_rhs2 (v_x_stmt);
+
+  if (v_x_stmt == v_y_stmt
+      || TREE_CODE (v_1) != SSA_NAME
+      || TREE_CODE (v_2) != SSA_NAME
+      || v_1 != gimple_assign_rhs1 (v_y_stmt)
+      || v_2 != gimple_assign_rhs2 (v_y_stmt)
+      || num_imm_uses (v_1) != 2
+      || num_imm_uses (v_2) != 2)
+    return false;
+
+  gassign *v_1_stmt = get_tree_def (v_1, false);
+  gassign *v_2_stmt = get_tree_def (v_2, false);
+
+  if (gimple_assign_rhs_code (v_1_stmt) != VEC_PERM_EXPR
+      || gimple_assign_rhs_code (v_2_stmt) != VEC_PERM_EXPR)
+    return false;
+
+  /* Decompose initial VEC_PERM_EXPRs.  */
+  tree v_in = gimple_assign_rhs1 (v_1_stmt);
+  tree v_1_sel = gimple_assign_rhs3 (v_1_stmt);
+  tree v_2_sel = gimple_assign_rhs3 (v_2_stmt);
+  if (v_in != gimple_assign_rhs2 (v_1_stmt)
+      || v_in != gimple_assign_rhs1 (v_2_stmt)
+      || v_in != gimple_assign_rhs2 (v_2_stmt))
+    return false;
+
+  if (!VECTOR_CST_NELTS (v_1_sel).is_constant ()
+      || !VECTOR_CST_NELTS (v_2_sel).is_constant ())
+    return false;
+
+  if (nelts != VECTOR_CST_NELTS (v_1_sel).to_constant ()
+      || nelts != VECTOR_CST_NELTS (v_2_sel).to_constant ())
+    return false;
+
+  /* Now check permutation selection operands.  */
+  lane_map_t v_1_lane_map, v_2_lane_map;
+  unsigned HOST_WIDE_INT v_1_lanes, v_2_lanes;
+  v_1_lanes = get_vect_selector_index_map (v_1_sel, &v_1_lane_map);
+  v_2_lanes = get_vect_selector_index_map (v_2_sel, &v_2_lane_map);
+
+  /* Check if we could free up half of the lanes.  */
+  if (v_1_lanes != v_2_lanes || v_1_lanes > (nelts / 2))
+    return false;
+
+  if (dump_enabled_p ())
+    {
+      fprintf (dump_file, "Found matching sequence ending with INSN: ");
+      print_gimple_stmt (dump_file, stmt, 0);
+
+      if (dump_flags & TDF_DETAILS)
+	fprintf (dump_file, "\tPossible new selector: { ");
+     }
+
+  *seq = XNEW (struct _vec_perm_simplify_seq);
+  (*seq)->stmt = stmt;
+  (*seq)->v_1_stmt = v_1_stmt;
+  (*seq)->v_2_stmt = v_2_stmt;
+  (*seq)->v_x_stmt = v_x_stmt;
+  (*seq)->v_y_stmt = v_y_stmt;
+  (*seq)->nelts = nelts;
+  (*seq)->new_sel = vNULL;
+
+  for (unsigned HOST_WIDE_INT i = 0; i < nelts; i++)
+    {
+      /* Extract the i-th value from the selector.  */
+      tree sel_cst_tree = VECTOR_CST_ELT (sel, i);
+      unsigned HOST_WIDE_INT sel_cst = TREE_INT_CST_LOW (sel_cst_tree);
+
+      unsigned HOST_WIDE_INT j;
+      if (sel_cst < nelts)
+	j = *v_1_lane_map.get (sel_cst);
+      else
+	j = *v_2_lane_map.get (sel_cst - nelts) + nelts;
+
+      (*seq)->new_sel.safe_push (j);
+
+      if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "%lu", j);
+	  if (i != (nelts -1))
+	    fprintf (dump_file, ", ");
+	}
+    }
+
+  if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, " }\n");
+
+  return true;
+}
+
+/* Test if we can merge two simplifyable vec permute sequences.  */
+
+static bool
+vect_perm_seq_can_merge_p (vec_perm_simplify_seq seq1,
+			   vec_perm_simplify_seq seq2)
+{
+  if (seq1->nelts != seq2->nelts)
+    return false;
+
+  if (gimple_assign_rhs_code (seq1->v_x_stmt)
+      != gimple_assign_rhs_code (seq2->v_x_stmt))
+    return false;
+
+  if (gimple_assign_rhs_code (seq1->v_y_stmt)
+      != gimple_assign_rhs_code (seq2->v_y_stmt))
+    return false;
+
+  if (dump_enabled_p ())
+    {
+      fprintf (dump_file, "Found mergable sequence pair.\n");
+    }
+
+  return true;
+}
+
+/* Merge the two given simplifyable vec permute sequences.  */
+
+static void
+vect_perm_seq_merge (vec_perm_simplify_seq seq1, vec_perm_simplify_seq seq2)
+{
+  unsigned HOST_WIDE_INT nelts = seq1->nelts;
+  tree mask1 = gimple_assign_rhs3 (seq1->stmt);
+  tree mask2 = gimple_assign_rhs3 (seq2->stmt);
+
+  /* Calculate the lanes that the first permute needs.  */
+  lane_map_t mask1_lanes;
+  unsigned HOST_WIDE_INT i;
+  for (i = 0; i < nelts; i++)
+    {
+      tree val = VECTOR_CST_ELT (mask1, i);
+      gcc_assert (TREE_CODE (val) == INTEGER_CST);
+      unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) & (nelts - 1);
+      mask1_lanes.put (j, j);
+    }
+
+  /* Calculate the lanes that the second permute needs and rewrite
+     them to use the lanes that are unused from the first permute.  */
+  lane_map_t mask2_lanes;
+  unsigned HOST_WIDE_INT lane_gen = 0;
+  for (i = 0; i < nelts; i++)
+    {
+      tree val = VECTOR_CST_ELT (mask2, i);
+      gcc_assert (TREE_CODE (val) == INTEGER_CST);
+      unsigned HOST_WIDE_INT j = TREE_INT_CST_LOW (val) & (nelts - 1);
+      bool existed;
+      unsigned HOST_WIDE_INT &rewrite_lane
+	= mask2_lanes.get_or_insert (j, &existed);
+      if (!existed)
+	{
+	  while (mask1_lanes.get (lane_gen))
+	    lane_gen++;
+	  if (lane_gen >= nelts)
+	    break;
+	  rewrite_lane = lane_gen;
+	  lane_gen++;
+	}
+    }
+
+  /* Ensure we don't need more than one permute.  */
+  gcc_assert (i >= nelts);
+
+  vec_perm_builder sel (nelts, nelts, 1);
+  vec_perm_builder sel_1 (nelts, nelts, 1);
+  vec_perm_builder sel_2 (nelts, nelts, 1);
+
+  /* Rewrite the permutations based on MASK1_LANES/MASK2_LANES.  */
+  for (i = 0; i < nelts; i++)
+    {
+      /* Calculate new mask for STMT2.  */
+      tree val = VECTOR_CST_ELT (mask2, i);
+      unsigned HOST_WIDE_INT lane
+	= TREE_INT_CST_LOW (val) & (2 * nelts - 1);
+      unsigned off = (lane < nelts) ? 0 : nelts;
+      unsigned HOST_WIDE_INT new_lane
+	= *mask2_lanes.get (lane - off) + off;
+      sel.quick_push (new_lane);
+
+      /* Calculate new masks for SRC1_1 and SRC1_2.  */
+      bool use_src1 = mask1_lanes.get (i);
+      tree mask1 = gimple_assign_rhs3 (use_src1 ? seq1->v_1_stmt
+						: seq2->v_1_stmt);
+      tree mask2 = gimple_assign_rhs3 (use_src1 ? seq1->v_2_stmt
+						: seq2->v_2_stmt);
+      unsigned HOST_WIDE_INT lane1
+	= TREE_INT_CST_LOW (VECTOR_CST_ELT (mask1, i)) & (nelts - 1);
+      unsigned HOST_WIDE_INT lane2
+	= TREE_INT_CST_LOW (VECTOR_CST_ELT (mask2, i)) & (nelts - 1);
+      sel_1.quick_push (lane1 + (use_src1 ? 0 : nelts));
+      sel_2.quick_push (lane2 + (use_src1 ? 0 : nelts));
+    }
+
+  vec_perm_indices indices (sel, 2, nelts);
+  vec_perm_indices indices1_1 (sel_1, 2, nelts);
+  vec_perm_indices indices1_2 (sel_2, 2, nelts);
+
+  tree vectype = TREE_TYPE (gimple_assign_lhs (seq2->stmt));
+  gcc_assert (can_vec_perm_const_p (TYPE_MODE (vectype), TYPE_MODE (vectype),
+				    indices));
+  gcc_assert (can_vec_perm_const_p (TYPE_MODE (vectype), TYPE_MODE (vectype),
+				    indices1_1));
+  gcc_assert (can_vec_perm_const_p (TYPE_MODE (vectype), TYPE_MODE (vectype),
+				    indices1_2));
+
+  gimple_assign_set_rhs1 (seq2->stmt, gimple_assign_rhs1 (seq1->stmt));
+  gimple_assign_set_rhs2 (seq2->stmt, gimple_assign_rhs2 (seq1->stmt));
+  tree m1 = vect_gen_perm_mask_checked (vectype, indices);
+  gimple_assign_set_rhs3 (seq2->stmt, m1);
+
+  gimple_assign_set_rhs2 (seq1->v_1_stmt, gimple_assign_rhs1 (seq2->v_1_stmt));
+  gimple_assign_set_rhs2 (seq1->v_2_stmt, gimple_assign_rhs1 (seq2->v_2_stmt));
+
+  tree m2 = vect_gen_perm_mask_checked (vectype, indices1_1);
+  gimple_assign_set_rhs3 (seq1->v_1_stmt, m2);
+  tree m3 = vect_gen_perm_mask_checked (vectype, indices1_2);
+  gimple_assign_set_rhs3 (seq1->v_2_stmt, m3);
+
+  update_stmt (seq2->stmt);
+  update_stmt (seq1->v_1_stmt);
+  update_stmt (seq1->v_2_stmt);
+
+  /* We need to move the updated statements because otherwise they may
+     come before some variable that they depend on.  Since we know that
+     they don't have uses outside the pattern, we can remove them and
+     add them back in order.  */
+  gimple_stmt_iterator gsi_rm = gsi_for_stmt (seq1->v_x_stmt);
+  gsi_remove (&gsi_rm, false);
+  gsi_rm = gsi_for_stmt (seq1->v_y_stmt);
+  gsi_remove (&gsi_rm, false);
+  gsi_rm = gsi_for_stmt (seq1->v_1_stmt);
+  gsi_remove (&gsi_rm, false);
+  gsi_rm = gsi_for_stmt (seq1->v_2_stmt);
+  gsi_remove (&gsi_rm, false);
+
+  gimple_stmt_iterator gsi_stmt1 = gsi_for_stmt (seq1->stmt);
+  gsi_insert_before (&gsi_stmt1, seq1->v_1_stmt, GSI_SAME_STMT);
+  gsi_insert_before (&gsi_stmt1, seq1->v_2_stmt, GSI_SAME_STMT);
+  gsi_insert_before (&gsi_stmt1, seq1->v_x_stmt, GSI_SAME_STMT);
+  gsi_insert_before (&gsi_stmt1, seq1->v_y_stmt, GSI_SAME_STMT);
+
+  if (dump_enabled_p ())
+    fprintf (dump_file, "Sequences have been merged.\n\n");
+}
+
+/* Reduce the lane consumption of a simplifyable vec perm statment.  */
+
+static void
+compress_vec_perm_stmt (vec_perm_simplify_seq seq)
+{
+  /* Update the VEC_PERM statement.  */
+  gassign *stmt = seq->stmt;
+  unsigned HOST_WIDE_INT nelts = seq->nelts;
+
+  vec_perm_builder new_sel (nelts, nelts, 1);
+
+  int i;
+  unsigned HOST_WIDE_INT idx;
+  FOR_EACH_VEC_ELT (seq->new_sel, i, idx)
+    {
+      new_sel.quick_push (idx);
+    }
+  vec_perm_indices new_indices (new_sel, 2, nelts);
+
+  tree vectype = TREE_TYPE (gimple_assign_lhs (stmt));
+  gcc_assert (can_vec_perm_const_p (TYPE_MODE (vectype),
+				    TYPE_MODE (vectype), new_indices));
+
+  if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Updating VEC_PERM statment:\n");
+      fprintf (dump_file, "Old stmt: ");
+      print_gimple_stmt (dump_file, stmt, 0);
+    }
+
+  tree m = vect_gen_perm_mask_checked (vectype, new_indices);
+  gimple_assign_set_rhs3 (stmt, m);
+  update_stmt (stmt);
+
+  if (dump_enabled_p () && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "New stmt: ");
+      print_gimple_stmt (dump_file, stmt, 0);
+    }
+}
+
+/* Find vectorized sequences, where we can reduce the lane utilization.
+ * Pair up such sequences and merge them into a single one.  */
+
+static void
+vect_slp_permute_simplify_fun (function *fun)
+{
+  vec<vec_perm_simplify_seq> candidate_list = vNULL;
+  basic_block bb;
+
+  /* Analysis.  */
+
+  FOR_EACH_BB_FN (bb, fun)
+    for (gimple_stmt_iterator gsi = gsi_start_bb (bb); !gsi_end_p (gsi);
+	 gsi_next_nondebug (&gsi))
+      {
+	gassign *stmt = dyn_cast <gassign *> (gsi_stmt (gsi));
+	if (!stmt)
+	  continue;
+
+	vec_perm_simplify_seq seq;
+	if (recognise_vec_perm_simplify_pattern (stmt, &seq))
+	  candidate_list.safe_push (seq);
+      }
+
+  /* Transformation.  */
+
+  if (dump_enabled_p () && !candidate_list.is_empty ())
+    {
+      fprintf (dump_file, "\nFound %u candidates.  Trying to merge them.\n\n",
+	       candidate_list.length ());
+    }
+
+  while (!candidate_list.is_empty ())
+    {
+      vec_perm_simplify_seq seq1 = candidate_list[0];
+      candidate_list.ordered_remove (0);
+
+      int i;
+      vec_perm_simplify_seq seq2;
+      FOR_EACH_VEC_ELT (candidate_list, i, seq2)
+	{
+	  if (vect_perm_seq_can_merge_p (seq1, seq2))
+	    {
+	      candidate_list.ordered_remove (i);
+	      compress_vec_perm_stmt (seq1);
+	      compress_vec_perm_stmt (seq2);
+
+	      vect_perm_seq_merge (seq1, seq2);
+
+	      seq2->new_sel.release ();
+	      XDELETE (seq2);
+	      break;
+	    }
+	}
+
+      seq1->new_sel.release ();
+      XDELETE (seq1);
+    }
+
+  candidate_list.release ();
+}
+
+/* Main entry point.  */
+
+void
+vect_slp_permute_simplify (function *fun)
+{
+  vect_slp_permute_simplify_fun (fun);
+}
diff --git a/gcc/tree-vectorizer.cc b/gcc/tree-vectorizer.cc
index 16fa0ec1bb7..6bdbde21d70 100644
--- a/gcc/tree-vectorizer.cc
+++ b/gcc/tree-vectorizer.cc
@@ -1473,6 +1473,56 @@  make_pass_simduid_cleanup (gcc::context *ctxt)
   return new pass_simduid_cleanup (ctxt);
 }
 
+/*  Entry point to SLP permutation simplifier.  */
+
+namespace {
+
+const pass_data pass_data_slp_perm_simplify =
+{
+  GIMPLE_PASS, /* type */
+  "slp_perm_simplify", /* name */
+  OPTGROUP_LOOP | OPTGROUP_VEC, /* optinfo_flags */
+  TV_TREE_SLP_VECTORIZATION, /* tv_id */
+  ( PROP_ssa | PROP_cfg ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_update_ssa, /* todo_flags_finish */
+};
+
+class pass_slp_perm_simplify : public gimple_opt_pass
+{
+public:
+  pass_slp_perm_simplify (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_slp_perm_simplify, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  opt_pass * clone () final override
+  {
+    return new pass_slp_perm_simplify (m_ctxt);
+  }
+  bool gate (function *) final override { return flag_tree_slp_vectorize != 0; }
+  unsigned int execute (function *) final override;
+
+}; // class pass_slp_perm_simplify
+
+unsigned int
+pass_slp_perm_simplify::execute (function *fun)
+{
+  vect_slp_permute_simplify (fun);
+
+  return 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_slp_perm_simplify (gcc::context *ctxt)
+{
+  return new pass_slp_perm_simplify (ctxt);
+}
+
 
 /*  Entry point to basic block SLP phase.  */
 
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 24227a69d4a..2500ebaae6e 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2548,6 +2548,9 @@  extern tree cse_and_gimplify_to_preheader (loop_vec_info, tree);
 extern tree vect_peel_nonlinear_iv_init (gimple_seq*, tree, tree,
 					 tree, enum vect_induction_op_type);
 
+/* In tree-vect-perm-simplify.cc.  */
+extern void vect_slp_permute_simplify (function *);
+
 /* In tree-vect-slp.cc.  */
 extern void vect_slp_init (void);
 extern void vect_slp_fini (void);