diff mbox series

[4/19] middle-end: Fix scale_loop_frequencies segfault on multiple-exits

Message ID ZJw5BcegL6zob2rC@arm.com
State New
Headers show
Series Support early break/return auto-vectorization | expand

Commit Message

Tamar Christina June 28, 2023, 1:43 p.m. UTC
Hi All,

There's an existing bug in loop frequency scaling where the if statement checks
to see if there's a single exit, and records an dump file note but then
continues.

It then tries to access the null pointer, which of course fails.

For multiple loop exists it's not really clear how to scale the exit
probablities as it's really unknown which exit is most probably.

For that reason I ignore the exit edges during scaling but still adjust the
loop body.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

	* cfgloopmanip.cc (scale_loop_frequencies): Fix typo.
	(scale_loop_profile): Don't access null pointer.

--- inline copy of patch -- 
diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
index 6e09dcbb0b1864bc64ffd570a4b923f50c3819b5..b10ef3d2be82902ccd74e52a4318217b2db13bcb 100644




--
diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
index 6e09dcbb0b1864bc64ffd570a4b923f50c3819b5..b10ef3d2be82902ccd74e52a4318217b2db13bcb 100644
--- a/gcc/cfgloopmanip.cc
+++ b/gcc/cfgloopmanip.cc
@@ -501,7 +501,7 @@ scale_loop_frequencies (class loop *loop, profile_probability p)
 /* Scale profile in LOOP by P.
    If ITERATION_BOUND is non-zero, scale even further if loop is predicted
    to iterate too many times.
-   Before caling this function, preheader block profile should be already
+   Before calling this function, preheader block profile should be already
    scaled to final count.  This is necessary because loop iterations are
    determined by comparing header edge count to latch ege count and thus
    they need to be scaled synchronously.  */
@@ -597,14 +597,14 @@ scale_loop_profile (class loop *loop, profile_probability p,
       /* If latch exists, change its count, since we changed
 	 probability of exit.  Theoretically we should update everything from
 	 source of exit edge to latch, but for vectorizer this is enough.  */
-      if (loop->latch && loop->latch != e->src)
+      if (e && loop->latch && loop->latch != e->src)
 	loop->latch->count += count_delta;
 
       /* Scale the probabilities.  */
       scale_loop_frequencies (loop, p);
 
       /* Change latch's count back.  */
-      if (loop->latch && loop->latch != e->src)
+      if (e && loop->latch && loop->latch != e->src)
 	loop->latch->count -= count_delta;
 
       if (dump_file && (dump_flags & TDF_DETAILS))

Comments

Richard Biener July 4, 2023, 11:52 a.m. UTC | #1
On Wed, 28 Jun 2023, Tamar Christina wrote:

> Hi All,
> 
> There's an existing bug in loop frequency scaling where the if statement checks
> to see if there's a single exit, and records an dump file note but then
> continues.
> 
> It then tries to access the null pointer, which of course fails.
> 
> For multiple loop exists it's not really clear how to scale the exit
> probablities as it's really unknown which exit is most probably.
> 
> For that reason I ignore the exit edges during scaling but still adjust the
> loop body.
> 
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> Ok for master?

I can't really make sense of

      /* If latch exists, change its count, since we changed
         probability of exit.  Theoretically we should update everything 
from
         source of exit edge to latch, but for vectorizer this is enough.  
*/
      if (loop->latch && loop->latch != e->src)
        loop->latch->count += count_delta;

since with simple latches the latch itself is an empty forwarder and
e->src is the block with the conditional eventually exiting the block.
That means this condition is always true.

So I think for exits the idea is to "remove" them by redirecting
them "count-wise" back into the loop.  So turn

   if (cond) --(exit-count)-- exit
     |
     | in-loop-count
     |
   latch

into

   [cond-blk-count]
   if (cond) -- (zero count) -- exit
     |
     | in-loop-cound + exit-count (== cond-blk-count)
     |
   latch (now with cond-blk-count)

and the comment correctly suggests all blocks following from here
would need similar adjustment (and on in-loop branches the delta would be
distributed according to probabilities).

Given the code is quite imperfect I would suggest to change the
updating of the latch block count to read

  profile_count count_delta = profile_count::zero ();
  if (loop->latch
      && single_pred_p (loop->latch)
      && loop_exits_from_bb_p (single_pred (loop->latch)))
    {
      count_delta = single_pred (loop->latch)->count - loop->latch->count;
      loop->latch->count = single_pred (loop->latch)->count;
    }

   scale_loop_frequencies (loop, p);

  if (count_delta != 0)
    loop->latch->count -= count_delta;

which should exactly preserve the exit-before-latch behavior independent
on the number of exits of the loop.

Please leave Honza a chance to comment here.

Thanks,
Richard.


> Thanks,
> Tamar
> 
> gcc/ChangeLog:
> 
> 	* cfgloopmanip.cc (scale_loop_frequencies): Fix typo.
> 	(scale_loop_profile): Don't access null pointer.
> 
> --- inline copy of patch -- 
> diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
> index 6e09dcbb0b1864bc64ffd570a4b923f50c3819b5..b10ef3d2be82902ccd74e52a4318217b2db13bcb 100644
> --- a/gcc/cfgloopmanip.cc
> +++ b/gcc/cfgloopmanip.cc
> @@ -501,7 +501,7 @@ scale_loop_frequencies (class loop *loop, profile_probability p)
>  /* Scale profile in LOOP by P.
>     If ITERATION_BOUND is non-zero, scale even further if loop is predicted
>     to iterate too many times.
> -   Before caling this function, preheader block profile should be already
> +   Before calling this function, preheader block profile should be already
>     scaled to final count.  This is necessary because loop iterations are
>     determined by comparing header edge count to latch ege count and thus
>     they need to be scaled synchronously.  */
> @@ -597,14 +597,14 @@ scale_loop_profile (class loop *loop, profile_probability p,
>        /* If latch exists, change its count, since we changed
>  	 probability of exit.  Theoretically we should update everything from
>  	 source of exit edge to latch, but for vectorizer this is enough.  */
> -      if (loop->latch && loop->latch != e->src)
> +      if (e && loop->latch && loop->latch != e->src)
>  	loop->latch->count += count_delta;
>  
>        /* Scale the probabilities.  */
>        scale_loop_frequencies (loop, p);
>  
>        /* Change latch's count back.  */
> -      if (loop->latch && loop->latch != e->src)
> +      if (e && loop->latch && loop->latch != e->src)
>  	loop->latch->count -= count_delta;
>  
>        if (dump_file && (dump_flags & TDF_DETAILS))
> 
> 
> 
> 
>
Jan Hubicka July 4, 2023, 2:57 p.m. UTC | #2
> On Wed, 28 Jun 2023, Tamar Christina wrote:
> 
> > Hi All,
> > 
> > There's an existing bug in loop frequency scaling where the if statement checks
> > to see if there's a single exit, and records an dump file note but then
> > continues.
> > 
> > It then tries to access the null pointer, which of course fails.
> > 
> > For multiple loop exists it's not really clear how to scale the exit
> > probablities as it's really unknown which exit is most probably.
> > 
> > For that reason I ignore the exit edges during scaling but still adjust the
> > loop body.
> > 
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> > 
> > Ok for master?
> 
> I can't really make sense of
> 
>       /* If latch exists, change its count, since we changed
>          probability of exit.  Theoretically we should update everything 
> from
>          source of exit edge to latch, but for vectorizer this is enough.  
> */
>       if (loop->latch && loop->latch != e->src)
>         loop->latch->count += count_delta;
> 
> since with simple latches the latch itself is an empty forwarder and
> e->src is the block with the conditional eventually exiting the block.
> That means this condition is always true.
> 
> So I think for exits the idea is to "remove" them by redirecting
> them "count-wise" back into the loop.  So turn
> 
>    if (cond) --(exit-count)-- exit
>      |
>      | in-loop-count
>      |
>    latch
> 
> into
> 
>    [cond-blk-count]
>    if (cond) -- (zero count) -- exit
>      |
>      | in-loop-cound + exit-count (== cond-blk-count)
>      |
>    latch (now with cond-blk-count)

This is oposite situation.  You have loop predicted to iterate 10 times,
but you found it actually iterates at most twice.  So you want to
 1) scale down profile of every BB in the loop
    so header is 2*sum_of_counts_to_from_entry_edges
    instead of 10*
 2) reduce probability of loopback and instead increase probability of
    exit.

The code attemts to get right only case where loop has one exit 
and instead of
  if (cond) -- (original-wrong-exit-probability) -- exit
it does
  if (cond) -- (exit-probability=1/#iterations) -- exit
Now it should adjust in-loop-count for every path from source of exit to
latch edge.  It just assumes that there is one basic block that is latch
and does it there.

I was just looking into using this for profile update when loop-ch or
complete unrolling proves that loop is iterating fewer times then
profile.  I can cleanup the funtion - it was originall written for the
old reperesentation of probabilities and cound and I did not do very
good job on updating it to new code.

Honza
> 
> and the comment correctly suggests all blocks following from here
> would need similar adjustment (and on in-loop branches the delta would be
> distributed according to probabilities).
> 
> Given the code is quite imperfect I would suggest to change the
> updating of the latch block count to read
> 
>   profile_count count_delta = profile_count::zero ();
>   if (loop->latch
>       && single_pred_p (loop->latch)
>       && loop_exits_from_bb_p (single_pred (loop->latch)))
>     {
>       count_delta = single_pred (loop->latch)->count - loop->latch->count;
>       loop->latch->count = single_pred (loop->latch)->count;
>     }
> 
>    scale_loop_frequencies (loop, p);
> 
>   if (count_delta != 0)
>     loop->latch->count -= count_delta;
> 
> which should exactly preserve the exit-before-latch behavior independent
> on the number of exits of the loop.
> 
> Please leave Honza a chance to comment here.
> 
> Thanks,
> Richard.
> 
> 
> > Thanks,
> > Tamar
> > 
> > gcc/ChangeLog:
> > 
> > 	* cfgloopmanip.cc (scale_loop_frequencies): Fix typo.
> > 	(scale_loop_profile): Don't access null pointer.
> > 
> > --- inline copy of patch -- 
> > diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
> > index 6e09dcbb0b1864bc64ffd570a4b923f50c3819b5..b10ef3d2be82902ccd74e52a4318217b2db13bcb 100644
> > --- a/gcc/cfgloopmanip.cc
> > +++ b/gcc/cfgloopmanip.cc
> > @@ -501,7 +501,7 @@ scale_loop_frequencies (class loop *loop, profile_probability p)
> >  /* Scale profile in LOOP by P.
> >     If ITERATION_BOUND is non-zero, scale even further if loop is predicted
> >     to iterate too many times.
> > -   Before caling this function, preheader block profile should be already
> > +   Before calling this function, preheader block profile should be already
> >     scaled to final count.  This is necessary because loop iterations are
> >     determined by comparing header edge count to latch ege count and thus
> >     they need to be scaled synchronously.  */
> > @@ -597,14 +597,14 @@ scale_loop_profile (class loop *loop, profile_probability p,
> >        /* If latch exists, change its count, since we changed
> >  	 probability of exit.  Theoretically we should update everything from
> >  	 source of exit edge to latch, but for vectorizer this is enough.  */
> > -      if (loop->latch && loop->latch != e->src)
> > +      if (e && loop->latch && loop->latch != e->src)
> >  	loop->latch->count += count_delta;
> >  
> >        /* Scale the probabilities.  */
> >        scale_loop_frequencies (loop, p);
> >  
> >        /* Change latch's count back.  */
> > -      if (loop->latch && loop->latch != e->src)
> > +      if (e && loop->latch && loop->latch != e->src)
> >  	loop->latch->count -= count_delta;
> >  
> >        if (dump_file && (dump_flags & TDF_DETAILS))
> > 
> > 
> > 
> > 
> > 
> 
> -- 
> Richard Biener <rguenther@suse.de>
> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg,
> Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman;
> HRB 36809 (AG Nuernberg)
Jan Hubicka July 6, 2023, 2:34 p.m. UTC | #3
Hi,
original scale_loop_profile was implemented to only handle very simple loops
produced by vectorizer at that time (basically loops with only one exit and no
subloops). It also has not been updated to new profile-count API very carefully.
Since I want to use it from loop peeling and unlooping, I need the
function to at least not get profile worse on general loops.

The function does two thigs
 1) scales down the loop profile by a given probability.
    This is useful, for example, to scale down profile after peeling when loop
    body is executed less often than before
 2) after scaling is done and if profile indicates too large iteration
    count update profile to cap iteration count by ITERATION_BOUND parameter.

Step 1 is easy and unchanged.

I changed ITERATION_BOUND to be actual bound on number of iterations as
used elsewhere (i.e. number of executions of latch edge) rather then
number of iterations + 1 as it was before.

To do 2) one needs to do the following
  a) scale own loop profile so frquency o header is at most
     the sum of in-edge counts * (iteration_bound + 1)
  b) update loop exit probabilities so their count is the same
     as before scaling.
  c) reduce frequencies of basic blocks after loop exit

old code did b) by setting probability to 1 / iteration_bound which is
correctly only of the basic block containing exit executes precisely one per
iteration (it is not insie other conditional or inner loop).  This is fixed
now by using set_edge_probability_and_rescale_others

aldo c) was implemented only for special case when the exit was just before
latch bacis block.  I now use dominance info to get right some of addional
case.

I still did not try to do anything for multiple exit loops, though the
implementatoin could be generalized.

Bootstrapped/regtested x86_64-linux.  Plan to cmmit it tonight if there
are no complains.

gcc/ChangeLog:

	* cfgloopmanip.cc (scale_loop_profile): Rewrite exit edge
	probability update to be safe on loops with subloops.
	Make bound parameter to be iteration bound.
	* tree-ssa-loop-ivcanon.cc (try_peel_loop): Update call
	of scale_loop_profile.
	* tree-vect-loop-manip.cc (vect_do_peeling): Likewise.

diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
index 6e09dcbb0b1..524b979a546 100644
--- a/gcc/cfgloopmanip.cc
+++ b/gcc/cfgloopmanip.cc
@@ -499,7 +499,7 @@ scale_loop_frequencies (class loop *loop, profile_probability p)
 }
 
 /* Scale profile in LOOP by P.
-   If ITERATION_BOUND is non-zero, scale even further if loop is predicted
+   If ITERATION_BOUND is not -1, scale even further if loop is predicted
    to iterate too many times.
    Before caling this function, preheader block profile should be already
    scaled to final count.  This is necessary because loop iterations are
@@ -510,106 +510,123 @@ void
 scale_loop_profile (class loop *loop, profile_probability p,
 		    gcov_type iteration_bound)
 {
-  edge e, preheader_e;
-  edge_iterator ei;
-
-  if (dump_file && (dump_flags & TDF_DETAILS))
+  if (!(p == profile_probability::always ()))
     {
-      fprintf (dump_file, ";; Scaling loop %i with scale ",
-	       loop->num);
-      p.dump (dump_file);
-      fprintf (dump_file, " bounding iterations to %i\n",
-	       (int)iteration_bound);
-    }
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, ";; Scaling loop %i with scale ",
+		   loop->num);
+	  p.dump (dump_file);
+	  fprintf (dump_file, "\n");
+	}
 
-  /* Scale the probabilities.  */
-  scale_loop_frequencies (loop, p);
+      /* Scale the probabilities.  */
+      scale_loop_frequencies (loop, p);
+    }
 
-  if (iteration_bound == 0)
+  if (iteration_bound == -1)
     return;
 
   gcov_type iterations = expected_loop_iterations_unbounded (loop, NULL, true);
+  if (iterations == -1)
+    return;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
-      fprintf (dump_file, ";; guessed iterations after scaling %i\n",
-	       (int)iterations);
+      fprintf (dump_file,
+	       ";; guessed iterations of loop %i:%i new upper bound %i:\n",
+	       loop->num,
+	       (int)iterations,
+	       (int)iteration_bound);
     }
 
   /* See if loop is predicted to iterate too many times.  */
   if (iterations <= iteration_bound)
     return;
 
-  preheader_e = loop_preheader_edge (loop);
-
-  /* We could handle also loops without preheaders, but bounding is
-     currently used only by optimizers that have preheaders constructed.  */
-  gcc_checking_assert (preheader_e);
-  profile_count count_in = preheader_e->count ();
+  /* Compute number of invocations of the loop.  */
+  profile_count count_in = profile_count::zero ();
+  edge e;
+  edge_iterator ei;
+  FOR_EACH_EDGE (e, ei, loop->header->preds)
+    count_in += e->count ();
 
-  if (count_in > profile_count::zero ()
-      && loop->header->count.initialized_p ())
+  /* Now scale the loop body so header count is
+     count_in * (iteration_bound + 1)  */
+  profile_probability scale_prob
+    = (count_in *= iteration_bound).probability_in (loop->header->count);
+  if (dump_file && (dump_flags & TDF_DETAILS))
     {
-      profile_count count_delta = profile_count::zero ();
-
-      e = single_exit (loop);
-      if (e)
-	{
-	  edge other_e;
-	  FOR_EACH_EDGE (other_e, ei, e->src->succs)
-	    if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE))
-		&& e != other_e)
-	      break;
-
-	  /* Probability of exit must be 1/iterations.  */
-	  count_delta = e->count ();
-	  e->probability = profile_probability::always () / iteration_bound;
-	  other_e->probability = e->probability.invert ();
-
-	  /* In code below we only handle the following two updates.  */
-	  if (other_e->dest != loop->header
-	      && other_e->dest != loop->latch
-	      && (dump_file && (dump_flags & TDF_DETAILS)))
-	    {
-	      fprintf (dump_file, ";; giving up on update of paths from "
-		       "exit condition to latch\n");
-	    }
-	}
+      fprintf (dump_file, ";; Scaling loop %i with scale ",
+	       loop->num);
+      p.dump (dump_file);
+      fprintf (dump_file, " to reach upper bound %i\n",
+	       (int)iteration_bound);
+    }
+  /* Finally attempt to fix exit edge probability.  */
+  auto_vec<edge> exits = get_loop_exit_edges  (loop);
+  edge exit_edge = single_likely_exit (loop, exits);
+
+  /* In a consistent profile unadjusted_exit_count should be same as count_in,
+     however to preserve as much of the original info, avoid recomputing
+     it.  */
+  profile_count unadjusted_exit_count;
+  if (exit_edge)
+    unadjusted_exit_count = exit_edge->count ();
+  scale_loop_frequencies (loop, scale_prob);
+
+  if (exit_edge)
+    {
+      profile_count old_exit_count = exit_edge->count ();
+      profile_probability new_probability;
+      if (iteration_bound > 0)
+	new_probability
+	  = unadjusted_exit_count.probability_in (exit_edge->src->count);
       else
-        if (dump_file && (dump_flags & TDF_DETAILS))
-	  fprintf (dump_file, ";; Loop has multiple exit edges; "
-	      		      "giving up on exit condition update\n");
-
-      /* Roughly speaking we want to reduce the loop body profile by the
-	 difference of loop iterations.  We however can do better if
-	 we look at the actual profile, if it is available.  */
-      p = profile_probability::always ();
-
-      count_in *= iteration_bound;
-      p = count_in.probability_in (loop->header->count);
-      if (!(p > profile_probability::never ()))
-	p = profile_probability::very_unlikely ();
-
-      if (p == profile_probability::always ()
-	  || !p.initialized_p ())
-	return;
-
-      /* If latch exists, change its count, since we changed
-	 probability of exit.  Theoretically we should update everything from
-	 source of exit edge to latch, but for vectorizer this is enough.  */
-      if (loop->latch && loop->latch != e->src)
-	loop->latch->count += count_delta;
-
-      /* Scale the probabilities.  */
-      scale_loop_frequencies (loop, p);
+	new_probability = profile_probability::always ();
+      set_edge_probability_and_rescale_others (exit_edge, new_probability);
+      profile_count new_exit_count = exit_edge->count ();
+
+      /* Rescale the remaining edge probabilities and see if there is only
+	 one.  */
+      edge other_edge = NULL;
+      bool found = false;
+      FOR_EACH_EDGE (e, ei, exit_edge->src->succs)
+	if (!(e->flags & EDGE_FAKE)
+	    && !(e->probability == profile_probability::never ())
+	    && !loop_exit_edge_p (loop, e))
+	  {
+	    if (found)
+	      {
+		other_edge = NULL;
+		break;
+	      }
+	    other_edge = e;
+	    found = true;
+	  }
+      /* If there is only loop latch after other edge,
+	 update its profile.  */
+      if (other_edge && other_edge->dest == loop->latch)
+	loop->latch->count -= new_exit_count - old_exit_count;
+      else
+	{
+	  basic_block *body = get_loop_body (loop);
+	  profile_count new_count = exit_edge->src->count - new_exit_count;
+	  profile_count old_count = exit_edge->src->count - old_exit_count;
 
-      /* Change latch's count back.  */
-      if (loop->latch && loop->latch != e->src)
-	loop->latch->count -= count_delta;
+	  for (unsigned int i = 0; i < loop->num_nodes; i++)
+	    if (body[i] != exit_edge->src
+		&& dominated_by_p (CDI_DOMINATORS, body[i], exit_edge->src))
+	      body[i]->count.apply_scale (new_count, old_count);
 
-      if (dump_file && (dump_flags & TDF_DETAILS))
-	fprintf (dump_file, ";; guessed iterations are now %i\n",
-		 (int)expected_loop_iterations_unbounded (loop, NULL, true));
+	  free (body);
+	}
+    }
+  else if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file,
+	       ";; Loop has mulitple exits;"
+	       " will leave exit probabilities inconsistent\n");
     }
 }
 
diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
index 491b57ec0f1..184c08eec75 100644
--- a/gcc/tree-ssa-loop-ivcanon.cc
+++ b/gcc/tree-ssa-loop-ivcanon.cc
@@ -1173,7 +1179,7 @@ try_peel_loop (class loop *loop,
       }
   profile_probability p;
   p = entry_count.probability_in (loop->header->count);
-  scale_loop_profile (loop, p, 0);
+  scale_loop_profile (loop, p, -1);
   bitmap_set_bit (peeled_loops, loop->num);
   return true;
 }
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index d66d4a6de69..2361cb328ab 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -3191,7 +3191,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
       if (prob_vector.initialized_p ())
 	{
 	  scale_bbs_frequencies (&bb_before_loop, 1, prob_vector);
-	  scale_loop_profile (loop, prob_vector, 0);
+	  scale_loop_profile (loop, prob_vector, -1);
 	}
     }
 
@@ -3236,7 +3236,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 	  slpeel_update_phi_nodes_for_guard1 (prolog, loop, guard_e, e);
 
 	  scale_bbs_frequencies (&bb_after_prolog, 1, prob_prolog);
-	  scale_loop_profile (prolog, prob_prolog, bound_prolog);
+	  scale_loop_profile (prolog, prob_prolog, bound_prolog - 1);
 	}
 
       /* Update init address of DRs.  */
@@ -3378,7 +3378,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
 
 	      scale_bbs_frequencies (&bb_before_epilog, 1, prob_epilog);
 	    }
-	  scale_loop_profile (epilog, prob_epilog, 0);
+	  scale_loop_profile (epilog, prob_epilog, -1);
 	}
       else
 	slpeel_update_phi_nodes_for_lcssa (epilog);
Richard Biener July 7, 2023, 5:59 a.m. UTC | #4
On Thu, 6 Jul 2023, Jan Hubicka wrote:

> Hi,
> original scale_loop_profile was implemented to only handle very simple loops
> produced by vectorizer at that time (basically loops with only one exit and no
> subloops). It also has not been updated to new profile-count API very carefully.
> Since I want to use it from loop peeling and unlooping, I need the
> function to at least not get profile worse on general loops.
> 
> The function does two thigs
>  1) scales down the loop profile by a given probability.
>     This is useful, for example, to scale down profile after peeling when loop
>     body is executed less often than before
>  2) after scaling is done and if profile indicates too large iteration
>     count update profile to cap iteration count by ITERATION_BOUND parameter.
> 
> Step 1 is easy and unchanged.
> 
> I changed ITERATION_BOUND to be actual bound on number of iterations as
> used elsewhere (i.e. number of executions of latch edge) rather then
> number of iterations + 1 as it was before.
> 
> To do 2) one needs to do the following
>   a) scale own loop profile so frquency o header is at most
>      the sum of in-edge counts * (iteration_bound + 1)
>   b) update loop exit probabilities so their count is the same
>      as before scaling.
>   c) reduce frequencies of basic blocks after loop exit
> 
> old code did b) by setting probability to 1 / iteration_bound which is
> correctly only of the basic block containing exit executes precisely one per
> iteration (it is not insie other conditional or inner loop).  This is fixed
> now by using set_edge_probability_and_rescale_others
> 
> aldo c) was implemented only for special case when the exit was just before
> latch bacis block.  I now use dominance info to get right some of addional
> case.
> 
> I still did not try to do anything for multiple exit loops, though the
> implementatoin could be generalized.
> 
> Bootstrapped/regtested x86_64-linux.  Plan to cmmit it tonight if there
> are no complains.

Looks good, but I wonder what we can do to at least make the
multiple exit case behave reasonably?  The vectorizer keeps track
of a "canonical" exit, would it be possible to pass in the main
exit edge and use that instead of single_exit (), would other
exits then behave somewhat reasonable or would we totally screw
things up here?  That is, the "canonical" exit would be the
counting exit while the other exits are on data driven conditions
and thus wouldn't change probability when we reduce the number
of iterations(?)

Richard.

> gcc/ChangeLog:
> 
> 	* cfgloopmanip.cc (scale_loop_profile): Rewrite exit edge
> 	probability update to be safe on loops with subloops.
> 	Make bound parameter to be iteration bound.
> 	* tree-ssa-loop-ivcanon.cc (try_peel_loop): Update call
> 	of scale_loop_profile.
> 	* tree-vect-loop-manip.cc (vect_do_peeling): Likewise.
> 
> diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc
> index 6e09dcbb0b1..524b979a546 100644
> --- a/gcc/cfgloopmanip.cc
> +++ b/gcc/cfgloopmanip.cc
> @@ -499,7 +499,7 @@ scale_loop_frequencies (class loop *loop, profile_probability p)
>  }
>  
>  /* Scale profile in LOOP by P.
> -   If ITERATION_BOUND is non-zero, scale even further if loop is predicted
> +   If ITERATION_BOUND is not -1, scale even further if loop is predicted
>     to iterate too many times.
>     Before caling this function, preheader block profile should be already
>     scaled to final count.  This is necessary because loop iterations are
> @@ -510,106 +510,123 @@ void
>  scale_loop_profile (class loop *loop, profile_probability p,
>  		    gcov_type iteration_bound)
>  {
> -  edge e, preheader_e;
> -  edge_iterator ei;
> -
> -  if (dump_file && (dump_flags & TDF_DETAILS))
> +  if (!(p == profile_probability::always ()))
>      {
> -      fprintf (dump_file, ";; Scaling loop %i with scale ",
> -	       loop->num);
> -      p.dump (dump_file);
> -      fprintf (dump_file, " bounding iterations to %i\n",
> -	       (int)iteration_bound);
> -    }
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	{
> +	  fprintf (dump_file, ";; Scaling loop %i with scale ",
> +		   loop->num);
> +	  p.dump (dump_file);
> +	  fprintf (dump_file, "\n");
> +	}
>  
> -  /* Scale the probabilities.  */
> -  scale_loop_frequencies (loop, p);
> +      /* Scale the probabilities.  */
> +      scale_loop_frequencies (loop, p);
> +    }
>  
> -  if (iteration_bound == 0)
> +  if (iteration_bound == -1)
>      return;
>  
>    gcov_type iterations = expected_loop_iterations_unbounded (loop, NULL, true);
> +  if (iterations == -1)
> +    return;
>  
>    if (dump_file && (dump_flags & TDF_DETAILS))
>      {
> -      fprintf (dump_file, ";; guessed iterations after scaling %i\n",
> -	       (int)iterations);
> +      fprintf (dump_file,
> +	       ";; guessed iterations of loop %i:%i new upper bound %i:\n",
> +	       loop->num,
> +	       (int)iterations,
> +	       (int)iteration_bound);
>      }
>  
>    /* See if loop is predicted to iterate too many times.  */
>    if (iterations <= iteration_bound)
>      return;
>  
> -  preheader_e = loop_preheader_edge (loop);
> -
> -  /* We could handle also loops without preheaders, but bounding is
> -     currently used only by optimizers that have preheaders constructed.  */
> -  gcc_checking_assert (preheader_e);
> -  profile_count count_in = preheader_e->count ();
> +  /* Compute number of invocations of the loop.  */
> +  profile_count count_in = profile_count::zero ();
> +  edge e;
> +  edge_iterator ei;
> +  FOR_EACH_EDGE (e, ei, loop->header->preds)
> +    count_in += e->count ();
>  
> -  if (count_in > profile_count::zero ()
> -      && loop->header->count.initialized_p ())
> +  /* Now scale the loop body so header count is
> +     count_in * (iteration_bound + 1)  */
> +  profile_probability scale_prob
> +    = (count_in *= iteration_bound).probability_in (loop->header->count);
> +  if (dump_file && (dump_flags & TDF_DETAILS))
>      {
> -      profile_count count_delta = profile_count::zero ();
> -
> -      e = single_exit (loop);
> -      if (e)
> -	{
> -	  edge other_e;
> -	  FOR_EACH_EDGE (other_e, ei, e->src->succs)
> -	    if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE))
> -		&& e != other_e)
> -	      break;
> -
> -	  /* Probability of exit must be 1/iterations.  */
> -	  count_delta = e->count ();
> -	  e->probability = profile_probability::always () / iteration_bound;
> -	  other_e->probability = e->probability.invert ();
> -
> -	  /* In code below we only handle the following two updates.  */
> -	  if (other_e->dest != loop->header
> -	      && other_e->dest != loop->latch
> -	      && (dump_file && (dump_flags & TDF_DETAILS)))
> -	    {
> -	      fprintf (dump_file, ";; giving up on update of paths from "
> -		       "exit condition to latch\n");
> -	    }
> -	}
> +      fprintf (dump_file, ";; Scaling loop %i with scale ",
> +	       loop->num);
> +      p.dump (dump_file);
> +      fprintf (dump_file, " to reach upper bound %i\n",
> +	       (int)iteration_bound);
> +    }
> +  /* Finally attempt to fix exit edge probability.  */
> +  auto_vec<edge> exits = get_loop_exit_edges  (loop);
> +  edge exit_edge = single_likely_exit (loop, exits);
> +
> +  /* In a consistent profile unadjusted_exit_count should be same as count_in,
> +     however to preserve as much of the original info, avoid recomputing
> +     it.  */
> +  profile_count unadjusted_exit_count;
> +  if (exit_edge)
> +    unadjusted_exit_count = exit_edge->count ();
> +  scale_loop_frequencies (loop, scale_prob);
> +
> +  if (exit_edge)
> +    {
> +      profile_count old_exit_count = exit_edge->count ();
> +      profile_probability new_probability;
> +      if (iteration_bound > 0)
> +	new_probability
> +	  = unadjusted_exit_count.probability_in (exit_edge->src->count);
>        else
> -        if (dump_file && (dump_flags & TDF_DETAILS))
> -	  fprintf (dump_file, ";; Loop has multiple exit edges; "
> -	      		      "giving up on exit condition update\n");
> -
> -      /* Roughly speaking we want to reduce the loop body profile by the
> -	 difference of loop iterations.  We however can do better if
> -	 we look at the actual profile, if it is available.  */
> -      p = profile_probability::always ();
> -
> -      count_in *= iteration_bound;
> -      p = count_in.probability_in (loop->header->count);
> -      if (!(p > profile_probability::never ()))
> -	p = profile_probability::very_unlikely ();
> -
> -      if (p == profile_probability::always ()
> -	  || !p.initialized_p ())
> -	return;
> -
> -      /* If latch exists, change its count, since we changed
> -	 probability of exit.  Theoretically we should update everything from
> -	 source of exit edge to latch, but for vectorizer this is enough.  */
> -      if (loop->latch && loop->latch != e->src)
> -	loop->latch->count += count_delta;
> -
> -      /* Scale the probabilities.  */
> -      scale_loop_frequencies (loop, p);
> +	new_probability = profile_probability::always ();
> +      set_edge_probability_and_rescale_others (exit_edge, new_probability);
> +      profile_count new_exit_count = exit_edge->count ();
> +
> +      /* Rescale the remaining edge probabilities and see if there is only
> +	 one.  */
> +      edge other_edge = NULL;
> +      bool found = false;
> +      FOR_EACH_EDGE (e, ei, exit_edge->src->succs)
> +	if (!(e->flags & EDGE_FAKE)
> +	    && !(e->probability == profile_probability::never ())
> +	    && !loop_exit_edge_p (loop, e))
> +	  {
> +	    if (found)
> +	      {
> +		other_edge = NULL;
> +		break;
> +	      }
> +	    other_edge = e;
> +	    found = true;
> +	  }
> +      /* If there is only loop latch after other edge,
> +	 update its profile.  */
> +      if (other_edge && other_edge->dest == loop->latch)
> +	loop->latch->count -= new_exit_count - old_exit_count;
> +      else
> +	{
> +	  basic_block *body = get_loop_body (loop);
> +	  profile_count new_count = exit_edge->src->count - new_exit_count;
> +	  profile_count old_count = exit_edge->src->count - old_exit_count;
>  
> -      /* Change latch's count back.  */
> -      if (loop->latch && loop->latch != e->src)
> -	loop->latch->count -= count_delta;
> +	  for (unsigned int i = 0; i < loop->num_nodes; i++)
> +	    if (body[i] != exit_edge->src
> +		&& dominated_by_p (CDI_DOMINATORS, body[i], exit_edge->src))
> +	      body[i]->count.apply_scale (new_count, old_count);
>  
> -      if (dump_file && (dump_flags & TDF_DETAILS))
> -	fprintf (dump_file, ";; guessed iterations are now %i\n",
> -		 (int)expected_loop_iterations_unbounded (loop, NULL, true));
> +	  free (body);
> +	}
> +    }
> +  else if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file,
> +	       ";; Loop has mulitple exits;"
> +	       " will leave exit probabilities inconsistent\n");
>      }
>  }
>  
> diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc
> index 491b57ec0f1..184c08eec75 100644
> --- a/gcc/tree-ssa-loop-ivcanon.cc
> +++ b/gcc/tree-ssa-loop-ivcanon.cc
> @@ -1173,7 +1179,7 @@ try_peel_loop (class loop *loop,
>        }
>    profile_probability p;
>    p = entry_count.probability_in (loop->header->count);
> -  scale_loop_profile (loop, p, 0);
> +  scale_loop_profile (loop, p, -1);
>    bitmap_set_bit (peeled_loops, loop->num);
>    return true;
>  }
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index d66d4a6de69..2361cb328ab 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -3191,7 +3191,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>        if (prob_vector.initialized_p ())
>  	{
>  	  scale_bbs_frequencies (&bb_before_loop, 1, prob_vector);
> -	  scale_loop_profile (loop, prob_vector, 0);
> +	  scale_loop_profile (loop, prob_vector, -1);
>  	}
>      }
>  
> @@ -3236,7 +3236,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>  	  slpeel_update_phi_nodes_for_guard1 (prolog, loop, guard_e, e);
>  
>  	  scale_bbs_frequencies (&bb_after_prolog, 1, prob_prolog);
> -	  scale_loop_profile (prolog, prob_prolog, bound_prolog);
> +	  scale_loop_profile (prolog, prob_prolog, bound_prolog - 1);
>  	}
>  
>        /* Update init address of DRs.  */
> @@ -3378,7 +3378,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1,
>  
>  	      scale_bbs_frequencies (&bb_before_epilog, 1, prob_epilog);
>  	    }
> -	  scale_loop_profile (epilog, prob_epilog, 0);
> +	  scale_loop_profile (epilog, prob_epilog, -1);
>  	}
>        else
>  	slpeel_update_phi_nodes_for_lcssa (epilog);
>
Jan Hubicka July 7, 2023, 12:20 p.m. UTC | #5
> 
> Looks good, but I wonder what we can do to at least make the
> multiple exit case behave reasonably?  The vectorizer keeps track

> of a "canonical" exit, would it be possible to pass in the main
> exit edge and use that instead of single_exit (), would other
> exits then behave somewhat reasonable or would we totally screw
> things up here?  That is, the "canonical" exit would be the
> counting exit while the other exits are on data driven conditions
> and thus wouldn't change probability when we reduce the number
> of iterations(?)

I can add canonical_exit parameter and make the function to direct flow
to it if possible.  However overall I think fixup depends on what
transformation led to the change.

Assuming that vectorizer did no prologues and apilogues and we
vectorized with factor N, then I think the update could be done more
specifically as follows.

We know that header block count dropped by 4. So we can start from that
and each time we reach basic block with exit edge, we know the original
count of the edge.  This count is unchanged, so one can rescale
probabilities out of that BB accordingly.  If loop has no inner loops,
we can just walk the body in RPO and propagate scales downwards and we
sould arrive to right result

I originally added the bound parameter to handle prologues/epilogues
which gets new artificial bound.  In prologue I think you are right that
the flow will be probably directed to the conditional counting
iterations.

In epilogue we add no artificial iteration cap, so maybe it is more
realistic to simply scale up probability of all exits?

To see what is going on I tried following testcase:

int a[99];
test()
{
  for (int i = 0; i < 99; i++)
      a[i]++;
}

What surprises me is that vectorizer at -O2 does nothing and we end up
unrolling the loop:

L2:
        addl    $1, (%rax)
        addl    $1, 4(%rax)
        addl    $1, 8(%rax)
        addq    $12, %rax
        cmpq    $a+396, %rax

Which seems sily thing to do. Vectorized loop with epilogue doing 2 and
1 addition would be better.

With -O3 we vectorize it:


.L2:
        movdqa  (%rax), %xmm0
        addq    $16, %rax
        paddd   %xmm1, %xmm0
        movaps  %xmm0, -16(%rax)
        cmpq    %rax, %rdx
        jne     .L2
        movq    a+384(%rip), %xmm0
        addl    $1, a+392(%rip)
        movq    .LC1(%rip), %xmm1
        paddd   %xmm1, %xmm0
        movq    %xmm0, a+384(%rip)


and correctly drop vectorized loop body to 24 iterations. However the
epilogue has loop for vector size 2 predicted to iterate once (it won't)

;;   basic block 7, loop depth 0, count 10737416 (estimated locally), maybe hot 
;;    prev block 5, next block 8, flags: (NEW, VISITED)                         
;;    pred:       3 [4.0% (adjusted)]  count:10737416 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;    succ:       8 [always]  count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE)
                                                                                
;;   basic block 8, loop depth 1, count 21474835 (estimated locally), maybe hot 
;;    prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED)              
;;    pred:       9 [always]  count:10737417 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE)
;;                7 [always]  count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE)
  # i_9 = PHI <i_17(9), 96(7)>                                                  
  # ivtmp_13 = PHI <ivtmp_18(9), 3(7)>                                          
  # vectp_a.14_40 = PHI <vectp_a.14_41(9), &MEM <int[99]> [(void *)&a + 384B](7)>
  # vectp_a.18_46 = PHI <vectp_a.18_47(9), &MEM <int[99]> [(void *)&a + 384B](7)>
  # ivtmp_49 = PHI <ivtmp_50(9), 0(7)>                                          
  vect__14.16_42 = MEM <vector(2) int> [(int *)vectp_a.14_40];                  
  _14 = a[i_9];                                                                 
  vect__15.17_44 = vect__14.16_42 + { 1, 1 };                                   
  _15 = _14 + 1;                                                                
  MEM <vector(2) int> [(int *)vectp_a.18_46] = vect__15.17_44;                  
  i_17 = i_9 + 1;                                                               
  ivtmp_18 = ivtmp_13 - 1;                                                      
  vectp_a.14_41 = vectp_a.14_40 + 8;                                            
  vectp_a.18_47 = vectp_a.18_46 + 8;                                            
  ivtmp_50 = ivtmp_49 + 1;                                                      
  if (ivtmp_50 < 1)                                                             
    goto <bb 9>; [50.00%]                                                       
  else                                                                          
    goto <bb 12>; [50.00%]                                                      

and finally the scalar copy

;;   basic block 12, loop depth 0, count 10737416 (estimated locally), maybe hot
;;    prev block 9, next block 13, flags: (NEW, VISITED)                        
;;    pred:       8 [50.0% (adjusted)]  count:10737418 (estimated locally) (FALSE_VALUE,EXECUTABLE)
;;    succ:       13 [always]  count:10737416 (estimated locally) (FALLTHRU)    
                                                                                
;;   basic block 13, loop depth 1, count 1063004409 (estimated locally), maybe hot
;;    prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED)            
;;    pred:       14 [always]  count:1052266996 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE)
;;                12 [always]  count:10737416 (estimated locally) (FALLTHRU)    
  # i_30 = PHI <i_36(14), 98(12)>                                               
  # ivtmp_32 = PHI <ivtmp_37(14), 1(12)>                                        
  _33 = a[i_30];                                                                
  _34 = _33 + 1;                                                                
  a[i_30] = _34;                                                                
  i_36 = i_30 + 1;                                                              
  ivtmp_37 = ivtmp_32 - 1;                                                      
  if (ivtmp_37 != 0)                                                            
    goto <bb 14>; [98.99%]                                                      
  else                                                                          
    goto <bb 4>; [1.01%]                                                        

With also small but non-zero iteration probability.   This is papered
over by my yesterday patch. But it seems to me that it would be a lot
better if vectorizer understood that the epilogue will be loopless and
accounted it to the cost model that would probably make it easy to
enable it at cheap costs too.

Clang 16 at -O2 is much more aggressive by both vectorizing and unroling:

test:                                   # @test
        .cfi_startproc
# %bb.0:
        movdqa  a(%rip), %xmm1
        pcmpeqd %xmm0, %xmm0
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a(%rip)
        movdqa  a+16(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+16(%rip)
        movdqa  a+32(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+32(%rip)
        movdqa  a+48(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+48(%rip)
        movdqa  a+64(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+64(%rip)
        movdqa  a+80(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+80(%rip)
        movdqa  a+96(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+96(%rip)
        movdqa  a+112(%rip), %xmm1
        psubd   %xmm0, %xmm1
....
        movdqa  %xmm1, a+240(%rip)
        movdqa  a+256(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+256(%rip)
        movdqa  a+272(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+272(%rip)
        movdqa  a+288(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+288(%rip)
        movdqa  a+304(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+304(%rip)
        movdqa  a+320(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+320(%rip)
        movdqa  a+336(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+336(%rip)
        movdqa  a+352(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+352(%rip)
        movdqa  a+368(%rip), %xmm1
        psubd   %xmm0, %xmm1
        movdqa  %xmm1, a+368(%rip)
        addl    $1, a+384(%rip)
        addl    $1, a+388(%rip)
        addl    $1, a+392(%rip)
        retq

Honza
Tamar Christina July 7, 2023, 12:27 p.m. UTC | #6
Hi Both,

Thanks for all the reviews/patches so far 😊

> >
> > Looks good, but I wonder what we can do to at least make the multiple
> > exit case behave reasonably?  The vectorizer keeps track
> 
> > of a "canonical" exit, would it be possible to pass in the main exit
> > edge and use that instead of single_exit (), would other exits then
> > behave somewhat reasonable or would we totally screw things up here?
> > That is, the "canonical" exit would be the counting exit while the
> > other exits are on data driven conditions and thus wouldn't change
> > probability when we reduce the number of iterations(?)
> 
> I can add canonical_exit parameter and make the function to direct flow to it if
> possible.  However overall I think fixup depends on what transformation led to
> the change.
> 
> Assuming that vectorizer did no prologues and apilogues and we vectorized
> with factor N, then I think the update could be done more specifically as
> follows.
> 

If it helps, how this patch series addresses multiple exits by forcing a scalar
epilogue, all non canonical_exits would have been redirected to this scalar
epilogue, so the remaining scalar iteration count will be at most VF.

Regards,
Tamar

> We know that header block count dropped by 4. So we can start from that
> and each time we reach basic block with exit edge, we know the original count
> of the edge.  This count is unchanged, so one can rescale probabilities out of
> that BB accordingly.  If loop has no inner loops, we can just walk the body in
> RPO and propagate scales downwards and we sould arrive to right result
> 
> I originally added the bound parameter to handle prologues/epilogues which
> gets new artificial bound.  In prologue I think you are right that the flow will be
> probably directed to the conditional counting iterations.
> 
> In epilogue we add no artificial iteration cap, so maybe it is more realistic to
> simply scale up probability of all exits?
> 
> To see what is going on I tried following testcase:
> 
> int a[99];
> test()
> {
>   for (int i = 0; i < 99; i++)
>       a[i]++;
> }
> 
> What surprises me is that vectorizer at -O2 does nothing and we end up
> unrolling the loop:
> 
> L2:
>         addl    $1, (%rax)
>         addl    $1, 4(%rax)
>         addl    $1, 8(%rax)
>         addq    $12, %rax
>         cmpq    $a+396, %rax
> 
> Which seems sily thing to do. Vectorized loop with epilogue doing 2 and
> 1 addition would be better.
> 
> With -O3 we vectorize it:
> 
> 
> .L2:
>         movdqa  (%rax), %xmm0
>         addq    $16, %rax
>         paddd   %xmm1, %xmm0
>         movaps  %xmm0, -16(%rax)
>         cmpq    %rax, %rdx
>         jne     .L2
>         movq    a+384(%rip), %xmm0
>         addl    $1, a+392(%rip)
>         movq    .LC1(%rip), %xmm1
>         paddd   %xmm1, %xmm0
>         movq    %xmm0, a+384(%rip)
> 
> 
> and correctly drop vectorized loop body to 24 iterations. However the
> epilogue has loop for vector size 2 predicted to iterate once (it won't)
> 
> ;;   basic block 7, loop depth 0, count 10737416 (estimated locally), maybe
> hot
> ;;    prev block 5, next block 8, flags: (NEW, VISITED)
> ;;    pred:       3 [4.0% (adjusted)]  count:10737416 (estimated locally)
> (FALSE_VALUE,EXECUTABLE)
> ;;    succ:       8 [always]  count:10737416 (estimated locally)
> (FALLTHRU,EXECUTABLE)
> 
> ;;   basic block 8, loop depth 1, count 21474835 (estimated locally), maybe
> hot
> ;;    prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       9 [always]  count:10737417 (estimated locally)
> (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;                7 [always]  count:10737416 (estimated locally)
> (FALLTHRU,EXECUTABLE)
>   # i_9 = PHI <i_17(9), 96(7)>
>   # ivtmp_13 = PHI <ivtmp_18(9), 3(7)>
>   # vectp_a.14_40 = PHI <vectp_a.14_41(9), &MEM <int[99]> [(void *)&a +
> 384B](7)>
>   # vectp_a.18_46 = PHI <vectp_a.18_47(9), &MEM <int[99]> [(void *)&a +
> 384B](7)>
>   # ivtmp_49 = PHI <ivtmp_50(9), 0(7)>
>   vect__14.16_42 = MEM <vector(2) int> [(int *)vectp_a.14_40];
>   _14 = a[i_9];
>   vect__15.17_44 = vect__14.16_42 + { 1, 1 };
>   _15 = _14 + 1;
>   MEM <vector(2) int> [(int *)vectp_a.18_46] = vect__15.17_44;
>   i_17 = i_9 + 1;
>   ivtmp_18 = ivtmp_13 - 1;
>   vectp_a.14_41 = vectp_a.14_40 + 8;
>   vectp_a.18_47 = vectp_a.18_46 + 8;
>   ivtmp_50 = ivtmp_49 + 1;
>   if (ivtmp_50 < 1)
>     goto <bb 9>; [50.00%]
>   else
>     goto <bb 12>; [50.00%]
> 
> and finally the scalar copy
> 
> ;;   basic block 12, loop depth 0, count 10737416 (estimated locally), maybe
> hot
> ;;    prev block 9, next block 13, flags: (NEW, VISITED)
> ;;    pred:       8 [50.0% (adjusted)]  count:10737418 (estimated locally)
> (FALSE_VALUE,EXECUTABLE)
> ;;    succ:       13 [always]  count:10737416 (estimated locally) (FALLTHRU)
> 
> ;;   basic block 13, loop depth 1, count 1063004409 (estimated locally),
> maybe hot
> ;;    prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED)
> ;;    pred:       14 [always]  count:1052266996 (estimated locally)
> (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;                12 [always]  count:10737416 (estimated locally) (FALLTHRU)
>   # i_30 = PHI <i_36(14), 98(12)>
>   # ivtmp_32 = PHI <ivtmp_37(14), 1(12)>
>   _33 = a[i_30];
>   _34 = _33 + 1;
>   a[i_30] = _34;
>   i_36 = i_30 + 1;
>   ivtmp_37 = ivtmp_32 - 1;
>   if (ivtmp_37 != 0)
>     goto <bb 14>; [98.99%]
>   else
>     goto <bb 4>; [1.01%]
> 
> With also small but non-zero iteration probability.   This is papered
> over by my yesterday patch. But it seems to me that it would be a lot better if
> vectorizer understood that the epilogue will be loopless and accounted it to
> the cost model that would probably make it easy to enable it at cheap costs
> too.
> 
> Clang 16 at -O2 is much more aggressive by both vectorizing and unroling:
> 
> test:                                   # @test
>         .cfi_startproc
> # %bb.0:
>         movdqa  a(%rip), %xmm1
>         pcmpeqd %xmm0, %xmm0
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a(%rip)
>         movdqa  a+16(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+16(%rip)
>         movdqa  a+32(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+32(%rip)
>         movdqa  a+48(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+48(%rip)
>         movdqa  a+64(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+64(%rip)
>         movdqa  a+80(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+80(%rip)
>         movdqa  a+96(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+96(%rip)
>         movdqa  a+112(%rip), %xmm1
>         psubd   %xmm0, %xmm1
> ....
>         movdqa  %xmm1, a+240(%rip)
>         movdqa  a+256(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+256(%rip)
>         movdqa  a+272(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+272(%rip)
>         movdqa  a+288(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+288(%rip)
>         movdqa  a+304(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+304(%rip)
>         movdqa  a+320(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+320(%rip)
>         movdqa  a+336(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+336(%rip)
>         movdqa  a+352(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+352(%rip)
>         movdqa  a+368(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+368(%rip)
>         addl    $1, a+384(%rip)
>         addl    $1, a+388(%rip)
>         addl    $1, a+392(%rip)
>         retq
> 
> Honza
Jan Hubicka July 7, 2023, 2:10 p.m. UTC | #7
> Hi Both,
> 
> Thanks for all the reviews/patches so far 😊
> 
> > >
> > > Looks good, but I wonder what we can do to at least make the multiple
> > > exit case behave reasonably?  The vectorizer keeps track
> > 
> > > of a "canonical" exit, would it be possible to pass in the main exit
> > > edge and use that instead of single_exit (), would other exits then
> > > behave somewhat reasonable or would we totally screw things up here?
> > > That is, the "canonical" exit would be the counting exit while the
> > > other exits are on data driven conditions and thus wouldn't change
> > > probability when we reduce the number of iterations(?)
> > 
> > I can add canonical_exit parameter and make the function to direct flow to it if
> > possible.  However overall I think fixup depends on what transformation led to
> > the change.
> > 
> > Assuming that vectorizer did no prologues and apilogues and we vectorized
> > with factor N, then I think the update could be done more specifically as
> > follows.
> > 
> 
> If it helps, how this patch series addresses multiple exits by forcing a scalar
> epilogue, all non canonical_exits would have been redirected to this scalar
> epilogue, so the remaining scalar iteration count will be at most VF.

It looks like profile update after vectorization needs quite some TLC.
My student Ondrej Kubanek also implemented loop histogram profiling
which gives better idea on how commonly prologues/epilogues are needed
and it would be also nice to handle it.
> > ;;   basic block 12, loop depth 0, count 10737416 (estimated locally), maybe
> > hot
> > ;;    prev block 9, next block 13, flags: (NEW, VISITED)
> > ;;    pred:       8 [50.0% (adjusted)]  count:10737418 (estimated locally)
> > (FALSE_VALUE,EXECUTABLE)
> > ;;    succ:       13 [always]  count:10737416 (estimated locally) (FALLTHRU)
> > 
> > ;;   basic block 13, loop depth 1, count 1063004409 (estimated locally),
> > maybe hot
> > ;;    prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED)
> > ;;    pred:       14 [always]  count:1052266996 (estimated locally)
> > (FALLTHRU,DFS_BACK,EXECUTABLE)
> > ;;                12 [always]  count:10737416 (estimated locally) (FALLTHRU)
> >   # i_30 = PHI <i_36(14), 98(12)>
> >   # ivtmp_32 = PHI <ivtmp_37(14), 1(12)>
> >   _33 = a[i_30];
> >   _34 = _33 + 1;
> >   a[i_30] = _34;
> >   i_36 = i_30 + 1;
> >   ivtmp_37 = ivtmp_32 - 1;
> >   if (ivtmp_37 != 0)
> >     goto <bb 14>; [98.99%]
> >   else
> >     goto <bb 4>; [1.01%]

Actually it seems that the scalar epilogue loop is with oriignal profile
(predicted to iterate 99 times) which is quite wrong.
Looking at the statistics for yesterday patch, on tramp3d we got 86%
reduction in cummulative profile mismatches after whole optimization
pipeline.  More interestingly however the overall time esimtate
dropped by 18%, so it seems that the profile adjustment done by cunroll
are afecting the profile a lot.

I think the fact that iteration counts of epilogues is not capped is one
of main problems.

We seem to call scale_loop_profile 3 times:

       scale_loop_profile (loop, prob_vector, -1);

This seems to account for the probability that control flow is
redirected to prolog/epilog later.  So it only scales down the profile
but is not responsible 

       scale_loop_profile (prolog, prob_prolog, bound_prolog - 1);

This is does prolog and sets bound.

       scale_loop_profile (epilog, prob_epilog, -1);

This scales epilog but does not set bound at all. 
I think the information is availale since we update the loop_info
datastructures.

Honza
Richard Biener July 10, 2023, 7:07 a.m. UTC | #8
On Fri, 7 Jul 2023, Jan Hubicka wrote:

> > 
> > Looks good, but I wonder what we can do to at least make the
> > multiple exit case behave reasonably?  The vectorizer keeps track
> 
> > of a "canonical" exit, would it be possible to pass in the main
> > exit edge and use that instead of single_exit (), would other
> > exits then behave somewhat reasonable or would we totally screw
> > things up here?  That is, the "canonical" exit would be the
> > counting exit while the other exits are on data driven conditions
> > and thus wouldn't change probability when we reduce the number
> > of iterations(?)
> 
> I can add canonical_exit parameter and make the function to direct flow
> to it if possible.  However overall I think fixup depends on what
> transformation led to the change.

I think the vectorizer knows there's a single counting IV and all
other exits are dependent on data processed, so the scaling the
vectorizer just changes the counting IV.  So I think it makes
sense to pass that exit to the function in all cases.

> Assuming that vectorizer did no prologues and apilogues and we
> vectorized with factor N, then I think the update could be done more
> specifically as follows.
> 
> We know that header block count dropped by 4. So we can start from that
> and each time we reach basic block with exit edge, we know the original
> count of the edge.  This count is unchanged, so one can rescale
> probabilities out of that BB accordingly.  If loop has no inner loops,
> we can just walk the body in RPO and propagate scales downwards and we
> sould arrive to right result

That should work for alternate exits as well, no?

> I originally added the bound parameter to handle prologues/epilogues
> which gets new artificial bound.  In prologue I think you are right that
> the flow will be probably directed to the conditional counting
> iterations.

I suppose we'd need to scale both main and epilogue together since
the epilogue "steals" from the main loop counts.  Likewise if there's
a skip edge around the vector loop.  I think currently we simply
set the edge probability of those skip conds rather than basing
this off the niter values they work on.  Aka if (niter < VF) goto
epilogue; do {} while (niter / VF); epilogue: do {} while (niter);

There's also the cost model which might require niter > VF to enter
the main loop body.

> In epilogue we add no artificial iteration cap, so maybe it is more
> realistic to simply scale up probability of all exits?

Probably.

> To see what is going on I tried following testcase:
> 
> int a[99];
> test()
> {
>   for (int i = 0; i < 99; i++)
>       a[i]++;
> }
> 
> What surprises me is that vectorizer at -O2 does nothing and we end up
> unrolling the loop:
> 
> L2:
>         addl    $1, (%rax)
>         addl    $1, 4(%rax)
>         addl    $1, 8(%rax)
>         addq    $12, %rax
>         cmpq    $a+396, %rax
> 
> Which seems sily thing to do. Vectorized loop with epilogue doing 2 and
> 1 addition would be better.
> 
> With -O3 we vectorize it:
> 
> 
> .L2:
>         movdqa  (%rax), %xmm0
>         addq    $16, %rax
>         paddd   %xmm1, %xmm0
>         movaps  %xmm0, -16(%rax)
>         cmpq    %rax, %rdx
>         jne     .L2
>         movq    a+384(%rip), %xmm0
>         addl    $1, a+392(%rip)
>         movq    .LC1(%rip), %xmm1
>         paddd   %xmm1, %xmm0
>         movq    %xmm0, a+384(%rip)

The -O2 cost model doesn't want to do epilogues:

  /* If using the "very cheap" model. reject cases in which we'd keep
     a copy of the scalar code (even if we might be able to vectorize it).  
*/
  if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
      && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
          || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
          || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                         "some scalar iterations would need to be 
peeled\n");
      return 0;
    }

it's because of the code size increase.

> and correctly drop vectorized loop body to 24 iterations. However the
> epilogue has loop for vector size 2 predicted to iterate once (it won't)
> 
> ;;   basic block 7, loop depth 0, count 10737416 (estimated locally), maybe hot 
> ;;    prev block 5, next block 8, flags: (NEW, VISITED)                         
> ;;    pred:       3 [4.0% (adjusted)]  count:10737416 (estimated locally) (FALSE_VALUE,EXECUTABLE)
> ;;    succ:       8 [always]  count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE)
>                                                                                 
> ;;   basic block 8, loop depth 1, count 21474835 (estimated locally), maybe hot 
> ;;    prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED)              
> ;;    pred:       9 [always]  count:10737417 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;                7 [always]  count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE)
>   # i_9 = PHI <i_17(9), 96(7)>                                                  
>   # ivtmp_13 = PHI <ivtmp_18(9), 3(7)>                                          
>   # vectp_a.14_40 = PHI <vectp_a.14_41(9), &MEM <int[99]> [(void *)&a + 384B](7)>
>   # vectp_a.18_46 = PHI <vectp_a.18_47(9), &MEM <int[99]> [(void *)&a + 384B](7)>
>   # ivtmp_49 = PHI <ivtmp_50(9), 0(7)>                                          
>   vect__14.16_42 = MEM <vector(2) int> [(int *)vectp_a.14_40];                  
>   _14 = a[i_9];                                                                 
>   vect__15.17_44 = vect__14.16_42 + { 1, 1 };                                   
>   _15 = _14 + 1;                                                                
>   MEM <vector(2) int> [(int *)vectp_a.18_46] = vect__15.17_44;                  
>   i_17 = i_9 + 1;                                                               
>   ivtmp_18 = ivtmp_13 - 1;                                                      
>   vectp_a.14_41 = vectp_a.14_40 + 8;                                            
>   vectp_a.18_47 = vectp_a.18_46 + 8;                                            
>   ivtmp_50 = ivtmp_49 + 1;                                                      
>   if (ivtmp_50 < 1)                                                             
>     goto <bb 9>; [50.00%]                                                       
>   else                                                                          
>     goto <bb 12>; [50.00%]                                                      
> 
> and finally the scalar copy
> 
> ;;   basic block 12, loop depth 0, count 10737416 (estimated locally), maybe hot
> ;;    prev block 9, next block 13, flags: (NEW, VISITED)                        
> ;;    pred:       8 [50.0% (adjusted)]  count:10737418 (estimated locally) (FALSE_VALUE,EXECUTABLE)
> ;;    succ:       13 [always]  count:10737416 (estimated locally) (FALLTHRU)    
>                                                                                 
> ;;   basic block 13, loop depth 1, count 1063004409 (estimated locally), maybe hot
> ;;    prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED)            
> ;;    pred:       14 [always]  count:1052266996 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE)
> ;;                12 [always]  count:10737416 (estimated locally) (FALLTHRU)    
>   # i_30 = PHI <i_36(14), 98(12)>                                               
>   # ivtmp_32 = PHI <ivtmp_37(14), 1(12)>                                        
>   _33 = a[i_30];                                                                
>   _34 = _33 + 1;                                                                
>   a[i_30] = _34;                                                                
>   i_36 = i_30 + 1;                                                              
>   ivtmp_37 = ivtmp_32 - 1;                                                      
>   if (ivtmp_37 != 0)                                                            
>     goto <bb 14>; [98.99%]                                                      
>   else                                                                          
>     goto <bb 4>; [1.01%]                                                        
> 
> With also small but non-zero iteration probability.   This is papered
> over by my yesterday patch. But it seems to me that it would be a lot
> better if vectorizer understood that the epilogue will be loopless and
> accounted it to the cost model that would probably make it easy to
> enable it at cheap costs too.

The epilogue will be "unrolled" later I think because we can correctly
compute it won't iterate.

> Clang 16 at -O2 is much more aggressive by both vectorizing and unroling:
> 
> test:                                   # @test
>         .cfi_startproc
> # %bb.0:
>         movdqa  a(%rip), %xmm1
>         pcmpeqd %xmm0, %xmm0
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a(%rip)
>         movdqa  a+16(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+16(%rip)
>         movdqa  a+32(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+32(%rip)
>         movdqa  a+48(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+48(%rip)
>         movdqa  a+64(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+64(%rip)
>         movdqa  a+80(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+80(%rip)
>         movdqa  a+96(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+96(%rip)
>         movdqa  a+112(%rip), %xmm1
>         psubd   %xmm0, %xmm1
> ....
>         movdqa  %xmm1, a+240(%rip)
>         movdqa  a+256(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+256(%rip)
>         movdqa  a+272(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+272(%rip)
>         movdqa  a+288(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+288(%rip)
>         movdqa  a+304(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+304(%rip)
>         movdqa  a+320(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+320(%rip)
>         movdqa  a+336(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+336(%rip)
>         movdqa  a+352(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+352(%rip)
>         movdqa  a+368(%rip), %xmm1
>         psubd   %xmm0, %xmm1
>         movdqa  %xmm1, a+368(%rip)
>         addl    $1, a+384(%rip)
>         addl    $1, a+388(%rip)
>         addl    $1, a+392(%rip)
>         retq

That's clearly much larger code.  On x86 we're also fighting with
large instruction encodings here, in particular EVEX for AVX512 is
"bad" here.  We hardly get more than two instructions decoded per
cycle due to their size.

Richard.
Jan Hubicka July 10, 2023, 8:33 a.m. UTC | #9
Hi,
over weekend I found that vectorizer is missing scale_loop_profile for
epilogues.  It already adjusts loop_info to set max iteraitons, so
adding it was easy. However now predicts the first loop to iterate at
most once (which is too much, I suppose it forgets to divide by epilogue
unrolling factor) and second never.
> 
> The -O2 cost model doesn't want to do epilogues:
> 
>   /* If using the "very cheap" model. reject cases in which we'd keep
>      a copy of the scalar code (even if we might be able to vectorize it).  
> */
>   if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
>       && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
>           || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
>           || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
>     {
>       if (dump_enabled_p ())
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>                          "some scalar iterations would need to be 
> peeled\n");
>       return 0;
>     }
> 
> it's because of the code size increase.

I know, however -O2 is not -Os and here the tradeoffs of
performance/code size seems a lot better than other code expanding
things we do at -O2 (such as the unrolling 3 times).
I think we set the very cheap cost model very conservatively in order to
get -ftree-vectorize enabled with -O2 and there is some room for finding
right balance.

I get:

jan@localhost:~> cat t.c
int a[99];
__attribute((noipa, weak))
void
test()
{
        for (int i = 0 ; i < 99; i++)
                a[i]++;
}
void
main()
{
        for (int j = 0; j < 10000000; j++)
                test();
}
jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out

real    0m0.529s
user    0m0.528s
sys     0m0.000s

jan@localhost:~> gcc -O2 t.c ; time ./a.out

real    0m0.427s
user    0m0.426s
sys     0m0.000s
jan@localhost:~> gcc -O3 t.c ; time ./a.out

real    0m0.136s
user    0m0.135s
sys     0m0.000s
jan@localhost:~> clang -O2 t.c ; time ./a.out
<warnings>

real    0m0.116s
user    0m0.116s
sys     0m0.000s

Code size (of function test):
 gcc -O2 -fno-unroll-loops 17  bytes
 gcc -O2                   29  bytes
 gcc -O3                   50  bytes
 clang -O2                 510 bytes

So unroling 70% code size growth for 23% speedup.
Vectorizing is 294% code size growth for 388% speedup
Clang does 3000% codde size growth for 456% speedup
> 
> That's clearly much larger code.  On x86 we're also fighting with
> large instruction encodings here, in particular EVEX for AVX512 is
> "bad" here.  We hardly get more than two instructions decoded per
> cycle due to their size.

Agreed, I found it surprising clang does that much of complette unrolling
at -O2. However vectorizing and not unrolling here seems like it may be
a better default for -O2 than what we do currently...

Honza
> 
> Richard.
Jan Hubicka July 10, 2023, 9:23 a.m. UTC | #10
> On Fri, 7 Jul 2023, Jan Hubicka wrote:
> 
> > > 
> > > Looks good, but I wonder what we can do to at least make the
> > > multiple exit case behave reasonably?  The vectorizer keeps track
> > 
> > > of a "canonical" exit, would it be possible to pass in the main
> > > exit edge and use that instead of single_exit (), would other
> > > exits then behave somewhat reasonable or would we totally screw
> > > things up here?  That is, the "canonical" exit would be the
> > > counting exit while the other exits are on data driven conditions
> > > and thus wouldn't change probability when we reduce the number
> > > of iterations(?)
> > 
> > I can add canonical_exit parameter and make the function to direct flow
> > to it if possible.  However overall I think fixup depends on what
> > transformation led to the change.
> 
> I think the vectorizer knows there's a single counting IV and all
> other exits are dependent on data processed, so the scaling the
> vectorizer just changes the counting IV.  So I think it makes
> sense to pass that exit to the function in all cases.

It really seems to me that vectorized loop is like N loops happening 
in parallel, so the probabilities of alternative exits grows as well.
But canonical exit is right thing to do for prologues - here we really
add extra conditions to the iteration counting exit.
> 
> > Assuming that vectorizer did no prologues and apilogues and we
> > vectorized with factor N, then I think the update could be done more
> > specifically as follows.
> > 
> > We know that header block count dropped by 4. So we can start from that
> > and each time we reach basic block with exit edge, we know the original
> > count of the edge.  This count is unchanged, so one can rescale
> > probabilities out of that BB accordingly.  If loop has no inner loops,
> > we can just walk the body in RPO and propagate scales downwards and we
> > sould arrive to right result
> 
> That should work for alternate exits as well, no?
Yes, i think it could omstly work for acyclic bodies. I ended up
implementing a special case of this for loop-ch in order to handle
corectly loop invariant conditionals.  Will send patch after some
cleanups. (There seems to be more loop invariant conditionals in real
code than I would tought)

Tampering only with loop exit probabilities is not always enought.
If you have:
  while (1)
    if (test1)
      {
        if (test2)
	  break;
      }
increasing count of exit may require increasing probablity of the outer
conditional.   Do we support this in vectorization at all and if so, do
we know something here?
For example if the test1 is triggered if test1 is true in one of
iterations packed togehter, its probability also increases by
vectorization factor.  

We run into this in peeling i.e. when we prove that test1 will trigger
undefined behaviour after one or two iterations but the orignal
esimtated profile believes in higher iteration count.  I added special
case for this yesterday to avoid turning if (test2) to 100% in this case
as that triggers strange codegen in some of fortran testcases.

We also can have
  while (1)
    while (test1)
      {
        if (test2)
	  break;
      }
Which is harder because changing probability of test2 affects number
of iteraitons of the inner loop.  So I am giving up on this.
I think currently it happens mostly with unlooping.
> 
> > I originally added the bound parameter to handle prologues/epilogues
> > which gets new artificial bound.  In prologue I think you are right that
> > the flow will be probably directed to the conditional counting
> > iterations.
> 
> I suppose we'd need to scale both main and epilogue together since
> the epilogue "steals" from the main loop counts.  Likewise if there's
> a skip edge around the vector loop.  I think currently we simply
> set the edge probability of those skip conds rather than basing
> this off the niter values they work on.  Aka if (niter < VF) goto
> epilogue; do {} while (niter / VF); epilogue: do {} while (niter);
> 
> There's also the cost model which might require niter > VF to enter
> the main loop body.

I think I mostly understand this since we was playing with it with Ondra's
histograms (that can be used to get some of the unknowns in the
transformation right). The unknowns (how many times we end up jumpig to
epilogue, for instance, probably can't be reasonably well guessed if we
do not know the loop histogram which currently we know only if we prove
that loop has constant number of iterations.  So I am trying to get
right at least this case first.

Theoretically correct approach would be to first determine entry counts
of prologue and epilogue, then produce what we believe to be correct
profile of those and subtract it from the main loop profile updating
also probabilities in basic blocks where we did nontrivial changes while
updating prologs/epilogs. Finally scale down the main loop profile and
increase exit probabilities.

Honza
Richard Biener July 10, 2023, 9:24 a.m. UTC | #11
On Mon, 10 Jul 2023, Jan Hubicka wrote:

> Hi,
> over weekend I found that vectorizer is missing scale_loop_profile for
> epilogues.  It already adjusts loop_info to set max iteraitons, so
> adding it was easy. However now predicts the first loop to iterate at
> most once (which is too much, I suppose it forgets to divide by epilogue
> unrolling factor) and second never.
> > 
> > The -O2 cost model doesn't want to do epilogues:
> > 
> >   /* If using the "very cheap" model. reject cases in which we'd keep
> >      a copy of the scalar code (even if we might be able to vectorize it).  
> > */
> >   if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
> >       && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> >           || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> >           || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> >     {
> >       if (dump_enabled_p ())
> >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> >                          "some scalar iterations would need to be 
> > peeled\n");
> >       return 0;
> >     }
> > 
> > it's because of the code size increase.
> 
> I know, however -O2 is not -Os and here the tradeoffs of
> performance/code size seems a lot better than other code expanding
> things we do at -O2 (such as the unrolling 3 times).
> I think we set the very cheap cost model very conservatively in order to
> get -ftree-vectorize enabled with -O2 and there is some room for finding
> right balance.
> 
> I get:
> 
> jan@localhost:~> cat t.c
> int a[99];
> __attribute((noipa, weak))
> void
> test()
> {
>         for (int i = 0 ; i < 99; i++)
>                 a[i]++;
> }
> void
> main()
> {
>         for (int j = 0; j < 10000000; j++)
>                 test();
> }
> jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out
> 
> real    0m0.529s
> user    0m0.528s
> sys     0m0.000s
> 
> jan@localhost:~> gcc -O2 t.c ; time ./a.out
> 
> real    0m0.427s
> user    0m0.426s
> sys     0m0.000s
> jan@localhost:~> gcc -O3 t.c ; time ./a.out
> 
> real    0m0.136s
> user    0m0.135s
> sys     0m0.000s
> jan@localhost:~> clang -O2 t.c ; time ./a.out
> <warnings>
> 
> real    0m0.116s
> user    0m0.116s
> sys     0m0.000s
> 
> Code size (of function test):
>  gcc -O2 -fno-unroll-loops 17  bytes
>  gcc -O2                   29  bytes
>  gcc -O3                   50  bytes
>  clang -O2                 510 bytes
> 
> So unroling 70% code size growth for 23% speedup.
> Vectorizing is 294% code size growth for 388% speedup
> Clang does 3000% codde size growth for 456% speedup
> > 
> > That's clearly much larger code.  On x86 we're also fighting with
> > large instruction encodings here, in particular EVEX for AVX512 is
> > "bad" here.  We hardly get more than two instructions decoded per
> > cycle due to their size.
> 
> Agreed, I found it surprising clang does that much of complette unrolling
> at -O2. However vectorizing and not unrolling here seems like it may be
> a better default for -O2 than what we do currently...

I was also playing with AVX512 fully masked loops here which avoids
the epilogue but due to the instruction encoding size that doesn't
usually win.  I agree that size isn't everything at least for -O2.

Richard.
Richard Biener July 10, 2023, 9:29 a.m. UTC | #12
On Mon, 10 Jul 2023, Jan Hubicka wrote:

> > On Fri, 7 Jul 2023, Jan Hubicka wrote:
> > 
> > > > 
> > > > Looks good, but I wonder what we can do to at least make the
> > > > multiple exit case behave reasonably?  The vectorizer keeps track
> > > 
> > > > of a "canonical" exit, would it be possible to pass in the main
> > > > exit edge and use that instead of single_exit (), would other
> > > > exits then behave somewhat reasonable or would we totally screw
> > > > things up here?  That is, the "canonical" exit would be the
> > > > counting exit while the other exits are on data driven conditions
> > > > and thus wouldn't change probability when we reduce the number
> > > > of iterations(?)
> > > 
> > > I can add canonical_exit parameter and make the function to direct flow
> > > to it if possible.  However overall I think fixup depends on what
> > > transformation led to the change.
> > 
> > I think the vectorizer knows there's a single counting IV and all
> > other exits are dependent on data processed, so the scaling the
> > vectorizer just changes the counting IV.  So I think it makes
> > sense to pass that exit to the function in all cases.
> 
> It really seems to me that vectorized loop is like N loops happening 
> in parallel, so the probabilities of alternative exits grows as well.
> But canonical exit is right thing to do for prologues - here we really
> add extra conditions to the iteration counting exit.
> > 
> > > Assuming that vectorizer did no prologues and apilogues and we
> > > vectorized with factor N, then I think the update could be done more
> > > specifically as follows.
> > > 
> > > We know that header block count dropped by 4. So we can start from that
> > > and each time we reach basic block with exit edge, we know the original
> > > count of the edge.  This count is unchanged, so one can rescale
> > > probabilities out of that BB accordingly.  If loop has no inner loops,
> > > we can just walk the body in RPO and propagate scales downwards and we
> > > sould arrive to right result
> > 
> > That should work for alternate exits as well, no?
> Yes, i think it could omstly work for acyclic bodies. I ended up
> implementing a special case of this for loop-ch in order to handle
> corectly loop invariant conditionals.  Will send patch after some
> cleanups. (There seems to be more loop invariant conditionals in real
> code than I would tought)
> 
> Tampering only with loop exit probabilities is not always enought.
> If you have:
>   while (1)
>     if (test1)
>       {
>         if (test2)
> 	  break;
>       }
> increasing count of exit may require increasing probablity of the outer
> conditional.   Do we support this in vectorization at all and if so, do
> we know something here?

Tamar would need to answer this but without early break vectorization
the if-conversion pass will flatten everything and I think even early
breaks will be in the end a non-nested sequence of BBs with
exit conds at the end (or a loopback branch).

Note the (scalar) epilogue is copied from the original scalar loop
body so it doesn't see any if-conversion.

> For example if the test1 is triggered if test1 is true in one of
> iterations packed togehter, its probability also increases by
> vectorization factor.  
> 
> We run into this in peeling i.e. when we prove that test1 will trigger
> undefined behaviour after one or two iterations but the orignal
> esimtated profile believes in higher iteration count.  I added special
> case for this yesterday to avoid turning if (test2) to 100% in this case
> as that triggers strange codegen in some of fortran testcases.
> 
> We also can have
>   while (1)
>     while (test1)
>       {
>         if (test2)
> 	  break;
>       }
> Which is harder because changing probability of test2 affects number
> of iteraitons of the inner loop.  So I am giving up on this.
> I think currently it happens mostly with unlooping.

What I saw most wrecking the profile is when passes turn
if (cond) into if (0/1) leaving the CFG adjustment to CFG cleanup
which then simply deletes one of the outgoing edges without doing
anything to the (guessed) profile.

> > > I originally added the bound parameter to handle prologues/epilogues
> > > which gets new artificial bound.  In prologue I think you are right that
> > > the flow will be probably directed to the conditional counting
> > > iterations.
> > 
> > I suppose we'd need to scale both main and epilogue together since
> > the epilogue "steals" from the main loop counts.  Likewise if there's
> > a skip edge around the vector loop.  I think currently we simply
> > set the edge probability of those skip conds rather than basing
> > this off the niter values they work on.  Aka if (niter < VF) goto
> > epilogue; do {} while (niter / VF); epilogue: do {} while (niter);
> > 
> > There's also the cost model which might require niter > VF to enter
> > the main loop body.
> 
> I think I mostly understand this since we was playing with it with Ondra's
> histograms (that can be used to get some of the unknowns in the
> transformation right). The unknowns (how many times we end up jumpig to
> epilogue, for instance, probably can't be reasonably well guessed if we
> do not know the loop histogram which currently we know only if we prove
> that loop has constant number of iterations.  So I am trying to get
> right at least this case first.
> 
> Theoretically correct approach would be to first determine entry counts
> of prologue and epilogue, then produce what we believe to be correct
> profile of those and subtract it from the main loop profile updating
> also probabilities in basic blocks where we did nontrivial changes while
> updating prologs/epilogs. Finally scale down the main loop profile and
> increase exit probabilities.
Jan Hubicka July 11, 2023, 9:28 a.m. UTC | #13
> 
> What I saw most wrecking the profile is when passes turn
> if (cond) into if (0/1) leaving the CFG adjustment to CFG cleanup
> which then simply deletes one of the outgoing edges without doing
> anything to the (guessed) profile.

Yep, I agree that this is disturbing.  At the cfg cleanup time one can
hardly do anything useful, since the knowledge of transform that caused
profile inconsistency is forgotten.  however I think it is not a complete
disaster.

With profile feedback the most common case of this happening is a
situation where we duplicated code (by inlining, unrolling etc.) into a
context where it behaves differently then the typical behaviour
represented by the profile.

So if one ends up zapping edge with large probability, one also knows
that the code being optimized does not exhibit typical behaviour from
the train run and thus is not very hot.  So profile inconsistency should
not affect performance that much.

So doing nothing is IMO may end up being safter than trying to get the
in/out counts right without really known what is going on.

This is mostly about the scenario "constant propagated this conditional
and profile disagrees with me".  There are other cases where update is
IMO important.  i.e. vectorizer forgetting to cap #of iterations of
epilogue may cause issue since the epilogue loop looks more frequent
than the main vectorized loop and it may cause IRA to insert spilling
into it or so.

When we duplicate we have chance to figure out profile updates.
Also we may try to get as much as possible done early.
I think we should again do loop header copying that does not expand code
at early opts again.  I have some more plans on cleaning up loop-ch and
then we can give it a try.

With guessed profile we always have option to re-do the propagation.
There is TODO_rebuild_frequencies for that which we do after inlining.
This is mostly to handle possible overflows on large loops nests
constructed by inliner.  

We can re-propagate once again after late cleanup passes. Looking at the
queue, we have:

      NEXT_PASS (pass_remove_cgraph_callee_edges);
      /* Initial scalar cleanups before alias computation.
         They ensure memory accesses are not indirect wherever possible.  */
      NEXT_PASS (pass_strip_predict_hints, false /* early_p */);
      NEXT_PASS (pass_ccp, true /* nonzero_p */);
      /* After CCP we rewrite no longer addressed locals into SSA
         form if possible.  */
      NEXT_PASS (pass_object_sizes);
      NEXT_PASS (pass_post_ipa_warn);
      /* Must run before loop unrolling.  */
      NEXT_PASS (pass_warn_access, /*early=*/true);
      NEXT_PASS (pass_complete_unrolli);
^^^^ here we care about profile
      NEXT_PASS (pass_backprop);
      NEXT_PASS (pass_phiprop);
      NEXT_PASS (pass_forwprop);
      /* pass_build_alias is a dummy pass that ensures that we
         execute TODO_rebuild_alias at this point.  */
      NEXT_PASS (pass_build_alias);
      NEXT_PASS (pass_return_slot);
      NEXT_PASS (pass_fre, true /* may_iterate */);
      NEXT_PASS (pass_merge_phi);
      NEXT_PASS (pass_thread_jumps_full, /*first=*/true);
^^^^ here

By now we did CCP and FRE so we likely optimized out most of constant
conditionals exposed by inline.
Honza
Richard Biener July 11, 2023, 10:31 a.m. UTC | #14
On Tue, 11 Jul 2023, Jan Hubicka wrote:

> > 
> > What I saw most wrecking the profile is when passes turn
> > if (cond) into if (0/1) leaving the CFG adjustment to CFG cleanup
> > which then simply deletes one of the outgoing edges without doing
> > anything to the (guessed) profile.
> 
> Yep, I agree that this is disturbing.  At the cfg cleanup time one can
> hardly do anything useful, since the knowledge of transform that caused
> profile inconsistency is forgotten.  however I think it is not a complete
> disaster.
> 
> With profile feedback the most common case of this happening is a
> situation where we duplicated code (by inlining, unrolling etc.) into a
> context where it behaves differently then the typical behaviour
> represented by the profile.
> 
> So if one ends up zapping edge with large probability, one also knows
> that the code being optimized does not exhibit typical behaviour from
> the train run and thus is not very hot.  So profile inconsistency should
> not affect performance that much.
> 
> So doing nothing is IMO may end up being safter than trying to get the
> in/out counts right without really known what is going on.
> 
> This is mostly about the scenario "constant propagated this conditional
> and profile disagrees with me".  There are other cases where update is
> IMO important.  i.e. vectorizer forgetting to cap #of iterations of
> epilogue may cause issue since the epilogue loop looks more frequent
> than the main vectorized loop and it may cause IRA to insert spilling
> into it or so.
> 
> When we duplicate we have chance to figure out profile updates.
> Also we may try to get as much as possible done early.
> I think we should again do loop header copying that does not expand code
> at early opts again.  I have some more plans on cleaning up loop-ch and
> then we can give it a try.
> 
> With guessed profile we always have option to re-do the propagation.
> There is TODO_rebuild_frequencies for that which we do after inlining.
> This is mostly to handle possible overflows on large loops nests
> constructed by inliner.  
> 
> We can re-propagate once again after late cleanup passes. Looking at the
> queue, we have:
> 
>       NEXT_PASS (pass_remove_cgraph_callee_edges);
>       /* Initial scalar cleanups before alias computation.
>          They ensure memory accesses are not indirect wherever possible.  */
>       NEXT_PASS (pass_strip_predict_hints, false /* early_p */);
>       NEXT_PASS (pass_ccp, true /* nonzero_p */);
>       /* After CCP we rewrite no longer addressed locals into SSA
>          form if possible.  */
>       NEXT_PASS (pass_object_sizes);
>       NEXT_PASS (pass_post_ipa_warn);
>       /* Must run before loop unrolling.  */
>       NEXT_PASS (pass_warn_access, /*early=*/true);
>       NEXT_PASS (pass_complete_unrolli);
> ^^^^ here we care about profile
>       NEXT_PASS (pass_backprop);
>       NEXT_PASS (pass_phiprop);
>       NEXT_PASS (pass_forwprop);
>       /* pass_build_alias is a dummy pass that ensures that we
>          execute TODO_rebuild_alias at this point.  */
>       NEXT_PASS (pass_build_alias);
>       NEXT_PASS (pass_return_slot);
>       NEXT_PASS (pass_fre, true /* may_iterate */);
>       NEXT_PASS (pass_merge_phi);
>       NEXT_PASS (pass_thread_jumps_full, /*first=*/true);
> ^^^^ here
> 
> By now we did CCP and FRE so we likely optimized out most of constant
> conditionals exposed by inline.

So maybe we should simply delay re-propagation of the profile?  I
think cunrolli doesn't so much care about the profile - cunrolli
is (was) about abstraction removal.  Jump threading should be
the first pass to care.

Richard.
Jan Hubicka July 11, 2023, 12:40 p.m. UTC | #15
> > By now we did CCP and FRE so we likely optimized out most of constant
> > conditionals exposed by inline.
> 
> So maybe we should simply delay re-propagation of the profile?  I
> think cunrolli doesn't so much care about the profile - cunrolli
> is (was) about abstraction removal.  Jump threading should be
> the first pass to care.

That is what I was thinking too.  After inlining the profile counts may
be in quite bad shape. If you inline together loop like in exchange that
has large loop nest, we will definitely end up capping counts to avoid
overflow.

cunrolli does:

 ret = tree_unroll_loops_completely (optimize >= 3, false);

which sets may_increase_size to true for -O3 and then

 may_increase_size && optimize_loop_nest_for_speed_p (loop)

which seems reasonable guard and it may get random answers on capped
profile.  It is not big deal to try propagating before cunrolli and then
again before threading and see how much potential this idea has.
I guess I should also double check that the other passes are indeed
safe, but I think it is quite obvoius they should be.

Honza
> 
> Richard.
Richard Biener July 11, 2023, 1:04 p.m. UTC | #16
On Tue, 11 Jul 2023, Jan Hubicka wrote:

> > > By now we did CCP and FRE so we likely optimized out most of constant
> > > conditionals exposed by inline.
> > 
> > So maybe we should simply delay re-propagation of the profile?  I
> > think cunrolli doesn't so much care about the profile - cunrolli
> > is (was) about abstraction removal.  Jump threading should be
> > the first pass to care.
> 
> That is what I was thinking too.  After inlining the profile counts may
> be in quite bad shape. If you inline together loop like in exchange that
> has large loop nest, we will definitely end up capping counts to avoid
> overflow.
> 
> cunrolli does:
> 
>  ret = tree_unroll_loops_completely (optimize >= 3, false);

Ah, yeah - that used to be false, false ...

> which sets may_increase_size to true for -O3 and then
> 
>  may_increase_size && optimize_loop_nest_for_speed_p (loop)
> 
> which seems reasonable guard and it may get random answers on capped
> profile.  It is not big deal to try propagating before cunrolli and then
> again before threading and see how much potential this idea has.
> I guess I should also double check that the other passes are indeed
> safe, but I think it is quite obvoius they should be.

Yeah.

Richard.
diff mbox series

Patch

--- a/gcc/cfgloopmanip.cc
+++ b/gcc/cfgloopmanip.cc
@@ -501,7 +501,7 @@  scale_loop_frequencies (class loop *loop, profile_probability p)
 /* Scale profile in LOOP by P.
    If ITERATION_BOUND is non-zero, scale even further if loop is predicted
    to iterate too many times.
-   Before caling this function, preheader block profile should be already
+   Before calling this function, preheader block profile should be already
    scaled to final count.  This is necessary because loop iterations are
    determined by comparing header edge count to latch ege count and thus
    they need to be scaled synchronously.  */
@@ -597,14 +597,14 @@  scale_loop_profile (class loop *loop, profile_probability p,
       /* If latch exists, change its count, since we changed
 	 probability of exit.  Theoretically we should update everything from
 	 source of exit edge to latch, but for vectorizer this is enough.  */
-      if (loop->latch && loop->latch != e->src)
+      if (e && loop->latch && loop->latch != e->src)
 	loop->latch->count += count_delta;
 
       /* Scale the probabilities.  */
       scale_loop_frequencies (loop, p);
 
       /* Change latch's count back.  */
-      if (loop->latch && loop->latch != e->src)
+      if (e && loop->latch && loop->latch != e->src)
 	loop->latch->count -= count_delta;
 
       if (dump_file && (dump_flags & TDF_DETAILS))