diff mbox series

[5/5] Allow multiple vectorized epilogs via --param vect-epilogues-nomask=N

Message ID 20241106143242.8B4363858C53@sourceware.org
State New
Headers show
Series [1/5] Check LOOP_VINFO_PEELING_FOR_GAPS on epilog is supported | expand

Commit Message

Richard Biener Nov. 6, 2024, 2:32 p.m. UTC
The following is a prototype allowing N possible vector epilogues.
In the end I'd like the target to tell us a set of (or no) vector modes
to consider for the epilogue of the main or the current epilog analyzed loop
in a way similar as to how we communicate back suggested_unroll_factor.

The main motivation is SPEC CPU 2017 525.x264_r which when doing
AVX512 vectorization ends up with using the scalar epilogue in
a hot function because the AVX2 epilogue has a too high VF.  Using
two vector epilogues mitigates this and also avoids regressing in
527.cam4_r which has a loop iteration count exactly matching the
AVX2 epilogue (one of the original ideas was to always use a SSE2
vector epilogue, even with a AVX512 main loop).

It turns out that two vector epilogues even create smaller code
in some cases since we tend to fully unroll epilogues with less
than 16 iterations.  So a simple (int x[])

  for (int i = 0; i < n; ++i)
    x[i] *= 3;

has a -O3 -march=znver4 code size

N vector epilogues   size
0                    615
1                    429
2                    388
3                    392

I'm unsure how important/effective multiple vector epilogues are
for non-x86 ISAs who all seem to have only a single vector size
or VLA vectors.  For better target control on x86 I'd like to
tell the vectorizer the array of modes to consider for the
epilogue of the current loop plus a flag whether to consider
using partial vectors (x86 does not have that encoded into the mode).
So I'd add m_epilog_vec_modes[] and m_epilog_vec_mode_partial,
since currently x86 doesn't do cost compares the latter can be a
flag and we'd try that first when set, together with (only?) the
first mode?  Alternatively only hint a single mode, but this won't
ever scale to cost compare targets?

So using --param vect-epilogues-nomask=N is mainly for this RFC,
not sure if it has to prevail.

Note I didn't manage to get aarch64 to use more than one epilogue,
not even with -msve-vector-bits=512.

Bootstrapped and tested on x86_64-unknown-linux-gnu, I've also
built SPEC CPU 2017 with --param vect-epilogues-nomask=2 - as
said, I want the target to have more control, even on x86 we
probably only want two epilogues when doing 512bit vectorization
for the main loop and possibly depend on its VF.

Any comments sofar?

Thanks,
Richard.

	* doc/invoke.texi (vect-epilogues-nomask): Adjust.
	* params.opt (vect-epilogues-nomask): Adjust max value and
	documentation.
	* tree-vect-loop.cc (vect_analyze_loop): Hack in multiple
	vectorized epilogs.
---
 gcc/doc/invoke.texi   |  3 ++-
 gcc/params.opt        |  2 +-
 gcc/tree-vect-loop.cc | 23 +++++++++++++++++------
 3 files changed, 20 insertions(+), 8 deletions(-)
diff mbox series

Patch

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index f2555ec83a1..73e54a47381 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -16870,7 +16870,8 @@  The maximum number of insns in loop header duplicated
 by the copy loop headers pass.
 
 @item vect-epilogues-nomask
-Enable loop epilogue vectorization using smaller vector size.
+Enable loop epilogue vectorization using smaller vector size with up to N
+vector epilogue loops.
 
 @item vect-partial-vector-usage
 Controls when the loop vectorizer considers using partial vector loads
diff --git a/gcc/params.opt b/gcc/params.opt
index 4dab7a26f9b..c77472e7ad3 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1175,7 +1175,7 @@  Common Joined UInteger Var(param_use_canonical_types) Init(1) IntegerRange(0, 1)
 Whether to use canonical types.
 
 -param=vect-epilogues-nomask=
-Common Joined UInteger Var(param_vect_epilogues_nomask) Init(1) IntegerRange(0, 1) Param Optimization
+Common Joined UInteger Var(param_vect_epilogues_nomask) Init(1) IntegerRange(0, 8) Param Optimization
 Enable loop epilogue vectorization using smaller vector size.
 
 -param=vect-max-layout-candidates=
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 41875683595..90802675a84 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -3721,6 +3721,10 @@  vect_analyze_loop (class loop *loop, gimple *loop_vectorized_call,
     partial_vectors_supported_p () && param_vect_partial_vector_usage != 0;
   poly_uint64 first_vinfo_vf = LOOP_VINFO_VECT_FACTOR (first_loop_vinfo);
 
+  loop_vec_info orig_loop_vinfo = first_loop_vinfo;
+  unsigned n = param_vect_epilogues_nomask;
+  do
+    {
   while (1)
     {
       /* If the target does not support partial vectors we can shorten the
@@ -3744,7 +3748,7 @@  vect_analyze_loop (class loop *loop, gimple *loop_vectorized_call,
       bool fatal;
       opt_loop_vec_info loop_vinfo
 	= vect_analyze_loop_1 (loop, shared, &loop_form_info,
-			       first_loop_vinfo,
+			       orig_loop_vinfo,
 			       vector_modes, mode_i,
 			       autodetected_vector_mode, fatal);
       if (fatal)
@@ -3769,17 +3773,24 @@  vect_analyze_loop (class loop *loop, gimple *loop_vectorized_call,
 	      loop_vinfo = opt_loop_vec_info::success (NULL);
 	    }
 
-	  /* For now only allow one epilogue loop, but allow
-	     pick_lowest_cost_p to replace it, so commit to the
-	     first epilogue if we have no reason to try alternatives.  */
+	  /* If we do not pick an alternative based on cost we're done.  */
 	  if (!pick_lowest_cost_p)
 	    break;
 	}
 
       if (mode_i == vector_modes.length ())
-	break;
-
+	{
+	  mode_i = 0;
+	  break;
+	}
+    }
+  if (mode_i == vector_modes.length ())
+    break;
+  orig_loop_vinfo = orig_loop_vinfo->epilogue_vinfo;
     }
+  while (orig_loop_vinfo
+	 && !LOOP_VINFO_USING_PARTIAL_VECTORS_P (orig_loop_vinfo)
+	 && --n != 0);
 
   if (first_loop_vinfo->epilogue_vinfo)
     {