diff mbox series

[v2] gimple: Add limit after which slower switchlower algs are used [PR117091] [PR117352]

Message ID Z1cJ1UrMTB5til3s@localhost.localdomain
State New
Headers show
Series [v2] gimple: Add limit after which slower switchlower algs are used [PR117091] [PR117352] | expand

Commit Message

Filip Kastl Dec. 9, 2024, 3:16 p.m. UTC
Hi Richi,

This is the second version of the patch.  I've lowered the default value of the
switch-lower-slow-alg-max-cases from 10000 to 1000.  I've also noticed that I
forgot to add an entry into the param section of doc/invoke.texi so I fixed
that.

I'm bootstraping and regtesting this again just to be extra sure.  Ok to push
if bootstrap/regtest succeeds?

Cheers,
Filip Kastl


-- 8< --


This patch adds a limit on the number of cases of a switch.  When this
limit is exceeded, switch lowering decides to use faster but less
powerful algorithms.

In particular this means that for finding bit tests switch lowering
decides between the old dynamic programming O(n^2) algorithm and the
new greedy algorithm that Andi Kleen recently added but then reverted
due to PR117352.  It also means that switch lowering may bail out on
finding jump tables if the switch is too large  (Btw it also may not
bail!  It can happen that the greedy algorithms finds some bit tests
which then basically split the switch into multiple smaller switches and
those may be small enough to fit under the limit.)

The limit is implemented as --param switch-lower-slow-alg-max-cases.
Exceeding the limit is reported through -Wdisabled-optimization.

This patch fixes the issue with the greedy algorithm described in
PR117352.  The problem was incorrect usage of the is_beneficial()
heuristic.

gcc/ChangeLog:

	PR middle-end/117091
	PR middle-end/117352
	* doc/invoke.texi: Add switch-lower-slow-alg-max-cases.
	* params.opt: Add switch-lower-slow-alg-max-cases.
	* tree-switch-conversion.cc (jump_table_cluster::find_jump_tables):
	Note in a comment that we are looking for jump tables in
	case sequences delimited by the already found bit tests.
	(bit_test_cluster::find_bit_tests): Decide between
	find_bit_tests_fast() and find_bit_tests_slow().
	(bit_test_cluster::find_bit_tests_fast): New function.
	(bit_test_cluster::find_bit_tests_slow): New function.
	(switch_decision_tree::analyze_switch_statement): Report
	exceeding the limit.
	* tree-switch-conversion.h: Add find_bit_tests_fast() and
	find_bit_tests_slow().

Co-Authored-By: Andi Kleen <ak@gcc.gnu.org>
Signed-off-by: Filip Kastl <fkastl@suse.cz>
---
 gcc/doc/invoke.texi           |   3 +
 gcc/params.opt                |   4 ++
 gcc/tree-switch-conversion.cc | 112 +++++++++++++++++++++++++++++++---
 gcc/tree-switch-conversion.h  |  18 ++++++
 4 files changed, 130 insertions(+), 7 deletions(-)

Comments

Richard Biener Dec. 9, 2024, 3:55 p.m. UTC | #1
> Am 09.12.2024 um 16:19 schrieb Filip Kastl <fkastl@suse.cz>:
> 
> Hi Richi,
> 
> This is the second version of the patch.  I've lowered the default value of the
> switch-lower-slow-alg-max-cases from 10000 to 1000.  I've also noticed that I
> forgot to add an entry into the param section of doc/invoke.texi so I fixed
> that.
> 
> I'm bootstraping and regtesting this again just to be extra sure.  Ok to push
> if bootstrap/regtest succeeds?

Ok

Thanks,
Richard 

> Cheers,
> Filip Kastl
> 
> 
> -- 8< --
> 
> 
> This patch adds a limit on the number of cases of a switch.  When this
> limit is exceeded, switch lowering decides to use faster but less
> powerful algorithms.
> 
> In particular this means that for finding bit tests switch lowering
> decides between the old dynamic programming O(n^2) algorithm and the
> new greedy algorithm that Andi Kleen recently added but then reverted
> due to PR117352.  It also means that switch lowering may bail out on
> finding jump tables if the switch is too large  (Btw it also may not
> bail!  It can happen that the greedy algorithms finds some bit tests
> which then basically split the switch into multiple smaller switches and
> those may be small enough to fit under the limit.)
> 
> The limit is implemented as --param switch-lower-slow-alg-max-cases.
> Exceeding the limit is reported through -Wdisabled-optimization.
> 
> This patch fixes the issue with the greedy algorithm described in
> PR117352.  The problem was incorrect usage of the is_beneficial()
> heuristic.
> 
> gcc/ChangeLog:
> 
>    PR middle-end/117091
>    PR middle-end/117352
>    * doc/invoke.texi: Add switch-lower-slow-alg-max-cases.
>    * params.opt: Add switch-lower-slow-alg-max-cases.
>    * tree-switch-conversion.cc (jump_table_cluster::find_jump_tables):
>    Note in a comment that we are looking for jump tables in
>    case sequences delimited by the already found bit tests.
>    (bit_test_cluster::find_bit_tests): Decide between
>    find_bit_tests_fast() and find_bit_tests_slow().
>    (bit_test_cluster::find_bit_tests_fast): New function.
>    (bit_test_cluster::find_bit_tests_slow): New function.
>    (switch_decision_tree::analyze_switch_statement): Report
>    exceeding the limit.
>    * tree-switch-conversion.h: Add find_bit_tests_fast() and
>    find_bit_tests_slow().
> 
> Co-Authored-By: Andi Kleen <ak@gcc.gnu.org>
> Signed-off-by: Filip Kastl <fkastl@suse.cz>
> ---
> gcc/doc/invoke.texi           |   3 +
> gcc/params.opt                |   4 ++
> gcc/tree-switch-conversion.cc | 112 +++++++++++++++++++++++++++++++---
> gcc/tree-switch-conversion.h  |  18 ++++++
> 4 files changed, 130 insertions(+), 7 deletions(-)
> 
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 14afd1934bd..3cb9a50b690 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -16500,6 +16500,9 @@ Switch initialization conversion refuses to create arrays that are
> bigger than @option{switch-conversion-max-branch-ratio} times the number of
> branches in the switch.
> 
> +@item switch-lower-slow-alg-max-cases
> +Maximum number of cases for slow switch lowering algorithms to be used.
> +
> @item max-partial-antic-length
> Maximum length of the partial antic set computed during the tree
> partial redundancy elimination optimization (@option{-ftree-pre}) when
> diff --git a/gcc/params.opt b/gcc/params.opt
> index 5853bf02f9e..1c88d5212c4 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -1052,6 +1052,10 @@ Maximum number of instruction distance that a small store forwarded to a larger
> Common Joined UInteger Var(param_switch_conversion_branch_ratio) Init(8) IntegerRange(1, 65536) Param Optimization
> The maximum ratio between array size and switch branches for a switch conversion to take place.
> 
> +-param=switch-lower-slow-alg-max-cases=
> +Common Joined UInteger Var(param_switch_lower_slow_alg_max_cases) Init(1000) IntegerRange(1, 1000000000) Param Optimization
> +Maximum number of cases for slow switch lowering algorithms to be used.
> +
> -param=modref-max-bases=
> Common Joined UInteger Var(param_modref_max_bases) Init(32) Param Optimization
> Maximum number of bases stored in each modref tree.
> diff --git a/gcc/tree-switch-conversion.cc b/gcc/tree-switch-conversion.cc
> index 3436c2a8b98..b98e70cf7d1 100644
> --- a/gcc/tree-switch-conversion.cc
> +++ b/gcc/tree-switch-conversion.cc
> @@ -54,6 +54,7 @@ Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA
> #include "tree-cfgcleanup.h"
> #include "hwint.h"
> #include "internal-fn.h"
> +#include "diagnostic-core.h"
> 
> /* ??? For lang_hooks.types.type_for_mode, but is there a word_mode
>    type in the GIMPLE type system that is language-independent?  */
> @@ -1641,6 +1642,11 @@ jump_table_cluster::find_jump_tables (vec<cluster *> &clusters)
>     return clusters.copy ();
> 
>   unsigned l = clusters.length ();
> +
> +  /* Note: l + 1 is the number of cases of the switch.  */
> +  if (l + 1 > (unsigned) param_switch_lower_slow_alg_max_cases)
> +    return clusters.copy ();
> +
>   auto_vec<min_cluster_item> min;
>   min.reserve (l + 1);
> 
> @@ -1771,16 +1777,80 @@ jump_table_cluster::is_beneficial (const vec<cluster *> &,
>   return end - start + 1 >= case_values_threshold ();
> }
> 
> -/* Find bit tests of given CLUSTERS, where all members of the vector
> -   are of type simple_cluster.   MAX_C is the approx max number of cases per
> -   label.  New clusters are returned.  */
> +/* Find bit tests of given CLUSTERS, where all members of the vector are of
> +   type simple_cluster.  Use a fast algorithm that might not find the optimal
> +   solution (minimal number of clusters on the output).  New clusters are
> +   returned.
> +
> +   You should call find_bit_tests () instead of calling this function
> +   directly.  */
> 
> vec<cluster *>
> -bit_test_cluster::find_bit_tests (vec<cluster *> &clusters, int max_c)
> +bit_test_cluster::find_bit_tests_fast (vec<cluster *> &clusters)
> {
> -  if (!is_enabled () || max_c == 1)
> -    return clusters.copy ();
> +  unsigned l = clusters.length ();
> +  vec<cluster *> output;
> +
> +  output.create (l);
> +
> +  /* Look at sliding BITS_PER_WORD sized windows in the switch value space
> +     and determine if they are suitable for a bit test cluster.  Worst case
> +     this can examine every value BITS_PER_WORD-1 times.  */
> +  unsigned k;
> +  for (unsigned i = 0; i < l; i += k)
> +    {
> +      hash_set<basic_block> targets;
> +      cluster *start_cluster = clusters[i];
> +
> +      /* Find the biggest k such that clusters i to i+k-1 can be turned into a
> +     one big bit test cluster.  */
> +      k = 0;
> +      while (i + k < l)
> +    {
> +      cluster *end_cluster = clusters[i + k];
> +
> +      /* Does value range fit into the BITS_PER_WORD window?  */
> +      HOST_WIDE_INT w = cluster::get_range (start_cluster->get_low (),
> +                        end_cluster->get_high ());
> +      if (w == 0 || w > BITS_PER_WORD)
> +        break;
> +
> +      /* Check for max # of targets.  */
> +      if (targets.elements () == m_max_case_bit_tests
> +          && !targets.contains (end_cluster->m_case_bb))
> +        break;
> +
> +      targets.add (end_cluster->m_case_bb);
> +      k++;
> +    }
> 
> +      if (is_beneficial (k, targets.elements ()))
> +    {
> +      output.safe_push (new bit_test_cluster (clusters, i, i + k - 1,
> +                          i == 0 && k == l));
> +    }
> +      else
> +    {
> +      output.safe_push (clusters[i]);
> +      /* ??? Might be able to skip more.  */
> +      k = 1;
> +    }
> +    }
> +
> +  return output;
> +}
> +
> +/* Find bit tests of given CLUSTERS, where all members of the vector
> +   are of type simple_cluster.  Use a slow (quadratic) algorithm that always
> +   finds the optimal solution (minimal number of clusters on the output).  New
> +   clusters are returned.
> +
> +   You should call find_bit_tests () instead of calling this function
> +   directly.  */
> +
> +vec<cluster *>
> +bit_test_cluster::find_bit_tests_slow (vec<cluster *> &clusters)
> +{
>   unsigned l = clusters.length ();
>   auto_vec<min_cluster_item> min;
>   min.reserve (l + 1);
> @@ -1834,6 +1904,25 @@ bit_test_cluster::find_bit_tests (vec<cluster *> &clusters, int max_c)
>   return output;
> }
> 
> +/* Find bit tests of given CLUSTERS, where all members of the vector
> +   are of type simple_cluster.  MAX_C is the approx max number of cases per
> +   label.  New clusters are returned.  */
> +
> +vec<cluster *>
> +bit_test_cluster::find_bit_tests (vec<cluster *> &clusters, int max_c)
> +{
> +  if (!is_enabled () || max_c == 1)
> +    return clusters.copy ();
> +
> +  unsigned l = clusters.length ();
> +
> +  /* Note: l + 1 is the number of cases of the switch.  */
> +  if (l + 1 > (unsigned) param_switch_lower_slow_alg_max_cases)
> +    return find_bit_tests_fast (clusters);
> +  else
> +    return find_bit_tests_slow (clusters);
> +}
> +
> /* Return true when RANGE of case values with UNIQ labels
>    can build a bit test.  */
> 
> @@ -2264,10 +2353,19 @@ switch_decision_tree::analyze_switch_statement ()
> 
>   reset_out_edges_aux (m_switch);
> 
> +  if (l > (unsigned) param_switch_lower_slow_alg_max_cases)
> +    warning_at (gimple_location (m_switch), OPT_Wdisabled_optimization,
> +           "Using faster switch lowering algorithms. "
> +           "Number of switch cases (%d) exceeds "
> +           "%<--param=switch-lower-slow-alg-max-cases=%d%> limit.",
> +           l, param_switch_lower_slow_alg_max_cases);
> +
>   /* Find bit-test clusters.  */
>   vec<cluster *> output = bit_test_cluster::find_bit_tests (clusters, max_c);
> 
> -  /* Find jump table clusters.  */
> +  /* Find jump table clusters.  We are looking for these in the sequences of
> +     simple clusters which we didn't manage to convert into bit-test
> +     clusters.  */
>   vec<cluster *> output2;
>   auto_vec<cluster *> tmp;
>   output2.create (1);
> diff --git a/gcc/tree-switch-conversion.h b/gcc/tree-switch-conversion.h
> index e6a85fa6025..560620903a8 100644
> --- a/gcc/tree-switch-conversion.h
> +++ b/gcc/tree-switch-conversion.h
> @@ -397,6 +397,24 @@ public:
>         tree default_label_expr, basic_block default_bb, location_t loc)
>      final override;
> 
> +  /* Find bit tests of given CLUSTERS, where all members of the vector are of
> +     type simple_cluster.  Use a fast algorithm that might not find the optimal
> +     solution (minimal number of clusters on the output).  New clusters are
> +     returned.
> +
> +     You should call find_bit_tests () instead of calling this function
> +     directly.  */
> +  static vec<cluster *> find_bit_tests_fast (vec<cluster *> &clusters);
> +
> +  /* Find bit tests of given CLUSTERS, where all members of the vector
> +     are of type simple_cluster.  Use a slow (quadratic) algorithm that always
> +     finds the optimal solution (minimal number of clusters on the output).  New
> +     clusters are returned.
> +
> +     You should call find_bit_tests () instead of calling this function
> +     directly.  */
> +  static vec<cluster *> find_bit_tests_slow (vec<cluster *> &clusters);
> +
>   /* Find bit tests of given CLUSTERS, where all members of the vector
>      are of type simple_cluster.  New clusters are returned.  */
>   static vec<cluster *> find_bit_tests (vec<cluster *> &clusters, int max_c);
> --
> 2.46.0
>
diff mbox series

Patch

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 14afd1934bd..3cb9a50b690 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16500,6 +16500,9 @@  Switch initialization conversion refuses to create arrays that are
 bigger than @option{switch-conversion-max-branch-ratio} times the number of
 branches in the switch.
 
+@item switch-lower-slow-alg-max-cases
+Maximum number of cases for slow switch lowering algorithms to be used.
+
 @item max-partial-antic-length
 Maximum length of the partial antic set computed during the tree
 partial redundancy elimination optimization (@option{-ftree-pre}) when
diff --git a/gcc/params.opt b/gcc/params.opt
index 5853bf02f9e..1c88d5212c4 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1052,6 +1052,10 @@  Maximum number of instruction distance that a small store forwarded to a larger
 Common Joined UInteger Var(param_switch_conversion_branch_ratio) Init(8) IntegerRange(1, 65536) Param Optimization
 The maximum ratio between array size and switch branches for a switch conversion to take place.
 
+-param=switch-lower-slow-alg-max-cases=
+Common Joined UInteger Var(param_switch_lower_slow_alg_max_cases) Init(1000) IntegerRange(1, 1000000000) Param Optimization
+Maximum number of cases for slow switch lowering algorithms to be used.
+
 -param=modref-max-bases=
 Common Joined UInteger Var(param_modref_max_bases) Init(32) Param Optimization
 Maximum number of bases stored in each modref tree.
diff --git a/gcc/tree-switch-conversion.cc b/gcc/tree-switch-conversion.cc
index 3436c2a8b98..b98e70cf7d1 100644
--- a/gcc/tree-switch-conversion.cc
+++ b/gcc/tree-switch-conversion.cc
@@ -54,6 +54,7 @@  Software Foundation, 51 Franklin Street, Fifth Floor, Boston, MA
 #include "tree-cfgcleanup.h"
 #include "hwint.h"
 #include "internal-fn.h"
+#include "diagnostic-core.h"
 
 /* ??? For lang_hooks.types.type_for_mode, but is there a word_mode
    type in the GIMPLE type system that is language-independent?  */
@@ -1641,6 +1642,11 @@  jump_table_cluster::find_jump_tables (vec<cluster *> &clusters)
     return clusters.copy ();
 
   unsigned l = clusters.length ();
+
+  /* Note: l + 1 is the number of cases of the switch.  */
+  if (l + 1 > (unsigned) param_switch_lower_slow_alg_max_cases)
+    return clusters.copy ();
+
   auto_vec<min_cluster_item> min;
   min.reserve (l + 1);
 
@@ -1771,16 +1777,80 @@  jump_table_cluster::is_beneficial (const vec<cluster *> &,
   return end - start + 1 >= case_values_threshold ();
 }
 
-/* Find bit tests of given CLUSTERS, where all members of the vector
-   are of type simple_cluster.   MAX_C is the approx max number of cases per
-   label.  New clusters are returned.  */
+/* Find bit tests of given CLUSTERS, where all members of the vector are of
+   type simple_cluster.  Use a fast algorithm that might not find the optimal
+   solution (minimal number of clusters on the output).  New clusters are
+   returned.
+
+   You should call find_bit_tests () instead of calling this function
+   directly.  */
 
 vec<cluster *>
-bit_test_cluster::find_bit_tests (vec<cluster *> &clusters, int max_c)
+bit_test_cluster::find_bit_tests_fast (vec<cluster *> &clusters)
 {
-  if (!is_enabled () || max_c == 1)
-    return clusters.copy ();
+  unsigned l = clusters.length ();
+  vec<cluster *> output;
+
+  output.create (l);
+
+  /* Look at sliding BITS_PER_WORD sized windows in the switch value space
+     and determine if they are suitable for a bit test cluster.  Worst case
+     this can examine every value BITS_PER_WORD-1 times.  */
+  unsigned k;
+  for (unsigned i = 0; i < l; i += k)
+    {
+      hash_set<basic_block> targets;
+      cluster *start_cluster = clusters[i];
+
+      /* Find the biggest k such that clusters i to i+k-1 can be turned into a
+	 one big bit test cluster.  */
+      k = 0;
+      while (i + k < l)
+	{
+	  cluster *end_cluster = clusters[i + k];
+
+	  /* Does value range fit into the BITS_PER_WORD window?  */
+	  HOST_WIDE_INT w = cluster::get_range (start_cluster->get_low (),
+						end_cluster->get_high ());
+	  if (w == 0 || w > BITS_PER_WORD)
+	    break;
+
+	  /* Check for max # of targets.  */
+	  if (targets.elements () == m_max_case_bit_tests
+	      && !targets.contains (end_cluster->m_case_bb))
+	    break;
+
+	  targets.add (end_cluster->m_case_bb);
+	  k++;
+	}
 
+      if (is_beneficial (k, targets.elements ()))
+	{
+	  output.safe_push (new bit_test_cluster (clusters, i, i + k - 1,
+						  i == 0 && k == l));
+	}
+      else
+	{
+	  output.safe_push (clusters[i]);
+	  /* ??? Might be able to skip more.  */
+	  k = 1;
+	}
+    }
+
+  return output;
+}
+
+/* Find bit tests of given CLUSTERS, where all members of the vector
+   are of type simple_cluster.  Use a slow (quadratic) algorithm that always
+   finds the optimal solution (minimal number of clusters on the output).  New
+   clusters are returned.
+
+   You should call find_bit_tests () instead of calling this function
+   directly.  */
+
+vec<cluster *>
+bit_test_cluster::find_bit_tests_slow (vec<cluster *> &clusters)
+{
   unsigned l = clusters.length ();
   auto_vec<min_cluster_item> min;
   min.reserve (l + 1);
@@ -1834,6 +1904,25 @@  bit_test_cluster::find_bit_tests (vec<cluster *> &clusters, int max_c)
   return output;
 }
 
+/* Find bit tests of given CLUSTERS, where all members of the vector
+   are of type simple_cluster.  MAX_C is the approx max number of cases per
+   label.  New clusters are returned.  */
+
+vec<cluster *>
+bit_test_cluster::find_bit_tests (vec<cluster *> &clusters, int max_c)
+{
+  if (!is_enabled () || max_c == 1)
+    return clusters.copy ();
+
+  unsigned l = clusters.length ();
+
+  /* Note: l + 1 is the number of cases of the switch.  */
+  if (l + 1 > (unsigned) param_switch_lower_slow_alg_max_cases)
+    return find_bit_tests_fast (clusters);
+  else
+    return find_bit_tests_slow (clusters);
+}
+
 /* Return true when RANGE of case values with UNIQ labels
    can build a bit test.  */
 
@@ -2264,10 +2353,19 @@  switch_decision_tree::analyze_switch_statement ()
 
   reset_out_edges_aux (m_switch);
 
+  if (l > (unsigned) param_switch_lower_slow_alg_max_cases)
+    warning_at (gimple_location (m_switch), OPT_Wdisabled_optimization,
+	       "Using faster switch lowering algorithms. "
+	       "Number of switch cases (%d) exceeds "
+	       "%<--param=switch-lower-slow-alg-max-cases=%d%> limit.",
+	       l, param_switch_lower_slow_alg_max_cases);
+
   /* Find bit-test clusters.  */
   vec<cluster *> output = bit_test_cluster::find_bit_tests (clusters, max_c);
 
-  /* Find jump table clusters.  */
+  /* Find jump table clusters.  We are looking for these in the sequences of
+     simple clusters which we didn't manage to convert into bit-test
+     clusters.  */
   vec<cluster *> output2;
   auto_vec<cluster *> tmp;
   output2.create (1);
diff --git a/gcc/tree-switch-conversion.h b/gcc/tree-switch-conversion.h
index e6a85fa6025..560620903a8 100644
--- a/gcc/tree-switch-conversion.h
+++ b/gcc/tree-switch-conversion.h
@@ -397,6 +397,24 @@  public:
 	     tree default_label_expr, basic_block default_bb, location_t loc)
      final override;
 
+  /* Find bit tests of given CLUSTERS, where all members of the vector are of
+     type simple_cluster.  Use a fast algorithm that might not find the optimal
+     solution (minimal number of clusters on the output).  New clusters are
+     returned.
+
+     You should call find_bit_tests () instead of calling this function
+     directly.  */
+  static vec<cluster *> find_bit_tests_fast (vec<cluster *> &clusters);
+
+  /* Find bit tests of given CLUSTERS, where all members of the vector
+     are of type simple_cluster.  Use a slow (quadratic) algorithm that always
+     finds the optimal solution (minimal number of clusters on the output).  New
+     clusters are returned.
+
+     You should call find_bit_tests () instead of calling this function
+     directly.  */
+  static vec<cluster *> find_bit_tests_slow (vec<cluster *> &clusters);
+
   /* Find bit tests of given CLUSTERS, where all members of the vector
      are of type simple_cluster.  New clusters are returned.  */
   static vec<cluster *> find_bit_tests (vec<cluster *> &clusters, int max_c);