Message ID | ZJw5BcegL6zob2rC@arm.com |
---|---|
State | New |
Headers | show |
Series | Support early break/return auto-vectorization | expand |
On Wed, 28 Jun 2023, Tamar Christina wrote: > Hi All, > > There's an existing bug in loop frequency scaling where the if statement checks > to see if there's a single exit, and records an dump file note but then > continues. > > It then tries to access the null pointer, which of course fails. > > For multiple loop exists it's not really clear how to scale the exit > probablities as it's really unknown which exit is most probably. > > For that reason I ignore the exit edges during scaling but still adjust the > loop body. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? I can't really make sense of /* If latch exists, change its count, since we changed probability of exit. Theoretically we should update everything from source of exit edge to latch, but for vectorizer this is enough. */ if (loop->latch && loop->latch != e->src) loop->latch->count += count_delta; since with simple latches the latch itself is an empty forwarder and e->src is the block with the conditional eventually exiting the block. That means this condition is always true. So I think for exits the idea is to "remove" them by redirecting them "count-wise" back into the loop. So turn if (cond) --(exit-count)-- exit | | in-loop-count | latch into [cond-blk-count] if (cond) -- (zero count) -- exit | | in-loop-cound + exit-count (== cond-blk-count) | latch (now with cond-blk-count) and the comment correctly suggests all blocks following from here would need similar adjustment (and on in-loop branches the delta would be distributed according to probabilities). Given the code is quite imperfect I would suggest to change the updating of the latch block count to read profile_count count_delta = profile_count::zero (); if (loop->latch && single_pred_p (loop->latch) && loop_exits_from_bb_p (single_pred (loop->latch))) { count_delta = single_pred (loop->latch)->count - loop->latch->count; loop->latch->count = single_pred (loop->latch)->count; } scale_loop_frequencies (loop, p); if (count_delta != 0) loop->latch->count -= count_delta; which should exactly preserve the exit-before-latch behavior independent on the number of exits of the loop. Please leave Honza a chance to comment here. Thanks, Richard. > Thanks, > Tamar > > gcc/ChangeLog: > > * cfgloopmanip.cc (scale_loop_frequencies): Fix typo. > (scale_loop_profile): Don't access null pointer. > > --- inline copy of patch -- > diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc > index 6e09dcbb0b1864bc64ffd570a4b923f50c3819b5..b10ef3d2be82902ccd74e52a4318217b2db13bcb 100644 > --- a/gcc/cfgloopmanip.cc > +++ b/gcc/cfgloopmanip.cc > @@ -501,7 +501,7 @@ scale_loop_frequencies (class loop *loop, profile_probability p) > /* Scale profile in LOOP by P. > If ITERATION_BOUND is non-zero, scale even further if loop is predicted > to iterate too many times. > - Before caling this function, preheader block profile should be already > + Before calling this function, preheader block profile should be already > scaled to final count. This is necessary because loop iterations are > determined by comparing header edge count to latch ege count and thus > they need to be scaled synchronously. */ > @@ -597,14 +597,14 @@ scale_loop_profile (class loop *loop, profile_probability p, > /* If latch exists, change its count, since we changed > probability of exit. Theoretically we should update everything from > source of exit edge to latch, but for vectorizer this is enough. */ > - if (loop->latch && loop->latch != e->src) > + if (e && loop->latch && loop->latch != e->src) > loop->latch->count += count_delta; > > /* Scale the probabilities. */ > scale_loop_frequencies (loop, p); > > /* Change latch's count back. */ > - if (loop->latch && loop->latch != e->src) > + if (e && loop->latch && loop->latch != e->src) > loop->latch->count -= count_delta; > > if (dump_file && (dump_flags & TDF_DETAILS)) > > > > >
> On Wed, 28 Jun 2023, Tamar Christina wrote: > > > Hi All, > > > > There's an existing bug in loop frequency scaling where the if statement checks > > to see if there's a single exit, and records an dump file note but then > > continues. > > > > It then tries to access the null pointer, which of course fails. > > > > For multiple loop exists it's not really clear how to scale the exit > > probablities as it's really unknown which exit is most probably. > > > > For that reason I ignore the exit edges during scaling but still adjust the > > loop body. > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > Ok for master? > > I can't really make sense of > > /* If latch exists, change its count, since we changed > probability of exit. Theoretically we should update everything > from > source of exit edge to latch, but for vectorizer this is enough. > */ > if (loop->latch && loop->latch != e->src) > loop->latch->count += count_delta; > > since with simple latches the latch itself is an empty forwarder and > e->src is the block with the conditional eventually exiting the block. > That means this condition is always true. > > So I think for exits the idea is to "remove" them by redirecting > them "count-wise" back into the loop. So turn > > if (cond) --(exit-count)-- exit > | > | in-loop-count > | > latch > > into > > [cond-blk-count] > if (cond) -- (zero count) -- exit > | > | in-loop-cound + exit-count (== cond-blk-count) > | > latch (now with cond-blk-count) This is oposite situation. You have loop predicted to iterate 10 times, but you found it actually iterates at most twice. So you want to 1) scale down profile of every BB in the loop so header is 2*sum_of_counts_to_from_entry_edges instead of 10* 2) reduce probability of loopback and instead increase probability of exit. The code attemts to get right only case where loop has one exit and instead of if (cond) -- (original-wrong-exit-probability) -- exit it does if (cond) -- (exit-probability=1/#iterations) -- exit Now it should adjust in-loop-count for every path from source of exit to latch edge. It just assumes that there is one basic block that is latch and does it there. I was just looking into using this for profile update when loop-ch or complete unrolling proves that loop is iterating fewer times then profile. I can cleanup the funtion - it was originall written for the old reperesentation of probabilities and cound and I did not do very good job on updating it to new code. Honza > > and the comment correctly suggests all blocks following from here > would need similar adjustment (and on in-loop branches the delta would be > distributed according to probabilities). > > Given the code is quite imperfect I would suggest to change the > updating of the latch block count to read > > profile_count count_delta = profile_count::zero (); > if (loop->latch > && single_pred_p (loop->latch) > && loop_exits_from_bb_p (single_pred (loop->latch))) > { > count_delta = single_pred (loop->latch)->count - loop->latch->count; > loop->latch->count = single_pred (loop->latch)->count; > } > > scale_loop_frequencies (loop, p); > > if (count_delta != 0) > loop->latch->count -= count_delta; > > which should exactly preserve the exit-before-latch behavior independent > on the number of exits of the loop. > > Please leave Honza a chance to comment here. > > Thanks, > Richard. > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > * cfgloopmanip.cc (scale_loop_frequencies): Fix typo. > > (scale_loop_profile): Don't access null pointer. > > > > --- inline copy of patch -- > > diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc > > index 6e09dcbb0b1864bc64ffd570a4b923f50c3819b5..b10ef3d2be82902ccd74e52a4318217b2db13bcb 100644 > > --- a/gcc/cfgloopmanip.cc > > +++ b/gcc/cfgloopmanip.cc > > @@ -501,7 +501,7 @@ scale_loop_frequencies (class loop *loop, profile_probability p) > > /* Scale profile in LOOP by P. > > If ITERATION_BOUND is non-zero, scale even further if loop is predicted > > to iterate too many times. > > - Before caling this function, preheader block profile should be already > > + Before calling this function, preheader block profile should be already > > scaled to final count. This is necessary because loop iterations are > > determined by comparing header edge count to latch ege count and thus > > they need to be scaled synchronously. */ > > @@ -597,14 +597,14 @@ scale_loop_profile (class loop *loop, profile_probability p, > > /* If latch exists, change its count, since we changed > > probability of exit. Theoretically we should update everything from > > source of exit edge to latch, but for vectorizer this is enough. */ > > - if (loop->latch && loop->latch != e->src) > > + if (e && loop->latch && loop->latch != e->src) > > loop->latch->count += count_delta; > > > > /* Scale the probabilities. */ > > scale_loop_frequencies (loop, p); > > > > /* Change latch's count back. */ > > - if (loop->latch && loop->latch != e->src) > > + if (e && loop->latch && loop->latch != e->src) > > loop->latch->count -= count_delta; > > > > if (dump_file && (dump_flags & TDF_DETAILS)) > > > > > > > > > > > > -- > Richard Biener <rguenther@suse.de> > SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, > Germany; GF: Ivo Totev, Andrew Myers, Andrew McDonald, Boudien Moerman; > HRB 36809 (AG Nuernberg)
Hi, original scale_loop_profile was implemented to only handle very simple loops produced by vectorizer at that time (basically loops with only one exit and no subloops). It also has not been updated to new profile-count API very carefully. Since I want to use it from loop peeling and unlooping, I need the function to at least not get profile worse on general loops. The function does two thigs 1) scales down the loop profile by a given probability. This is useful, for example, to scale down profile after peeling when loop body is executed less often than before 2) after scaling is done and if profile indicates too large iteration count update profile to cap iteration count by ITERATION_BOUND parameter. Step 1 is easy and unchanged. I changed ITERATION_BOUND to be actual bound on number of iterations as used elsewhere (i.e. number of executions of latch edge) rather then number of iterations + 1 as it was before. To do 2) one needs to do the following a) scale own loop profile so frquency o header is at most the sum of in-edge counts * (iteration_bound + 1) b) update loop exit probabilities so their count is the same as before scaling. c) reduce frequencies of basic blocks after loop exit old code did b) by setting probability to 1 / iteration_bound which is correctly only of the basic block containing exit executes precisely one per iteration (it is not insie other conditional or inner loop). This is fixed now by using set_edge_probability_and_rescale_others aldo c) was implemented only for special case when the exit was just before latch bacis block. I now use dominance info to get right some of addional case. I still did not try to do anything for multiple exit loops, though the implementatoin could be generalized. Bootstrapped/regtested x86_64-linux. Plan to cmmit it tonight if there are no complains. gcc/ChangeLog: * cfgloopmanip.cc (scale_loop_profile): Rewrite exit edge probability update to be safe on loops with subloops. Make bound parameter to be iteration bound. * tree-ssa-loop-ivcanon.cc (try_peel_loop): Update call of scale_loop_profile. * tree-vect-loop-manip.cc (vect_do_peeling): Likewise. diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc index 6e09dcbb0b1..524b979a546 100644 --- a/gcc/cfgloopmanip.cc +++ b/gcc/cfgloopmanip.cc @@ -499,7 +499,7 @@ scale_loop_frequencies (class loop *loop, profile_probability p) } /* Scale profile in LOOP by P. - If ITERATION_BOUND is non-zero, scale even further if loop is predicted + If ITERATION_BOUND is not -1, scale even further if loop is predicted to iterate too many times. Before caling this function, preheader block profile should be already scaled to final count. This is necessary because loop iterations are @@ -510,106 +510,123 @@ void scale_loop_profile (class loop *loop, profile_probability p, gcov_type iteration_bound) { - edge e, preheader_e; - edge_iterator ei; - - if (dump_file && (dump_flags & TDF_DETAILS)) + if (!(p == profile_probability::always ())) { - fprintf (dump_file, ";; Scaling loop %i with scale ", - loop->num); - p.dump (dump_file); - fprintf (dump_file, " bounding iterations to %i\n", - (int)iteration_bound); - } + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, ";; Scaling loop %i with scale ", + loop->num); + p.dump (dump_file); + fprintf (dump_file, "\n"); + } - /* Scale the probabilities. */ - scale_loop_frequencies (loop, p); + /* Scale the probabilities. */ + scale_loop_frequencies (loop, p); + } - if (iteration_bound == 0) + if (iteration_bound == -1) return; gcov_type iterations = expected_loop_iterations_unbounded (loop, NULL, true); + if (iterations == -1) + return; if (dump_file && (dump_flags & TDF_DETAILS)) { - fprintf (dump_file, ";; guessed iterations after scaling %i\n", - (int)iterations); + fprintf (dump_file, + ";; guessed iterations of loop %i:%i new upper bound %i:\n", + loop->num, + (int)iterations, + (int)iteration_bound); } /* See if loop is predicted to iterate too many times. */ if (iterations <= iteration_bound) return; - preheader_e = loop_preheader_edge (loop); - - /* We could handle also loops without preheaders, but bounding is - currently used only by optimizers that have preheaders constructed. */ - gcc_checking_assert (preheader_e); - profile_count count_in = preheader_e->count (); + /* Compute number of invocations of the loop. */ + profile_count count_in = profile_count::zero (); + edge e; + edge_iterator ei; + FOR_EACH_EDGE (e, ei, loop->header->preds) + count_in += e->count (); - if (count_in > profile_count::zero () - && loop->header->count.initialized_p ()) + /* Now scale the loop body so header count is + count_in * (iteration_bound + 1) */ + profile_probability scale_prob + = (count_in *= iteration_bound).probability_in (loop->header->count); + if (dump_file && (dump_flags & TDF_DETAILS)) { - profile_count count_delta = profile_count::zero (); - - e = single_exit (loop); - if (e) - { - edge other_e; - FOR_EACH_EDGE (other_e, ei, e->src->succs) - if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE)) - && e != other_e) - break; - - /* Probability of exit must be 1/iterations. */ - count_delta = e->count (); - e->probability = profile_probability::always () / iteration_bound; - other_e->probability = e->probability.invert (); - - /* In code below we only handle the following two updates. */ - if (other_e->dest != loop->header - && other_e->dest != loop->latch - && (dump_file && (dump_flags & TDF_DETAILS))) - { - fprintf (dump_file, ";; giving up on update of paths from " - "exit condition to latch\n"); - } - } + fprintf (dump_file, ";; Scaling loop %i with scale ", + loop->num); + p.dump (dump_file); + fprintf (dump_file, " to reach upper bound %i\n", + (int)iteration_bound); + } + /* Finally attempt to fix exit edge probability. */ + auto_vec<edge> exits = get_loop_exit_edges (loop); + edge exit_edge = single_likely_exit (loop, exits); + + /* In a consistent profile unadjusted_exit_count should be same as count_in, + however to preserve as much of the original info, avoid recomputing + it. */ + profile_count unadjusted_exit_count; + if (exit_edge) + unadjusted_exit_count = exit_edge->count (); + scale_loop_frequencies (loop, scale_prob); + + if (exit_edge) + { + profile_count old_exit_count = exit_edge->count (); + profile_probability new_probability; + if (iteration_bound > 0) + new_probability + = unadjusted_exit_count.probability_in (exit_edge->src->count); else - if (dump_file && (dump_flags & TDF_DETAILS)) - fprintf (dump_file, ";; Loop has multiple exit edges; " - "giving up on exit condition update\n"); - - /* Roughly speaking we want to reduce the loop body profile by the - difference of loop iterations. We however can do better if - we look at the actual profile, if it is available. */ - p = profile_probability::always (); - - count_in *= iteration_bound; - p = count_in.probability_in (loop->header->count); - if (!(p > profile_probability::never ())) - p = profile_probability::very_unlikely (); - - if (p == profile_probability::always () - || !p.initialized_p ()) - return; - - /* If latch exists, change its count, since we changed - probability of exit. Theoretically we should update everything from - source of exit edge to latch, but for vectorizer this is enough. */ - if (loop->latch && loop->latch != e->src) - loop->latch->count += count_delta; - - /* Scale the probabilities. */ - scale_loop_frequencies (loop, p); + new_probability = profile_probability::always (); + set_edge_probability_and_rescale_others (exit_edge, new_probability); + profile_count new_exit_count = exit_edge->count (); + + /* Rescale the remaining edge probabilities and see if there is only + one. */ + edge other_edge = NULL; + bool found = false; + FOR_EACH_EDGE (e, ei, exit_edge->src->succs) + if (!(e->flags & EDGE_FAKE) + && !(e->probability == profile_probability::never ()) + && !loop_exit_edge_p (loop, e)) + { + if (found) + { + other_edge = NULL; + break; + } + other_edge = e; + found = true; + } + /* If there is only loop latch after other edge, + update its profile. */ + if (other_edge && other_edge->dest == loop->latch) + loop->latch->count -= new_exit_count - old_exit_count; + else + { + basic_block *body = get_loop_body (loop); + profile_count new_count = exit_edge->src->count - new_exit_count; + profile_count old_count = exit_edge->src->count - old_exit_count; - /* Change latch's count back. */ - if (loop->latch && loop->latch != e->src) - loop->latch->count -= count_delta; + for (unsigned int i = 0; i < loop->num_nodes; i++) + if (body[i] != exit_edge->src + && dominated_by_p (CDI_DOMINATORS, body[i], exit_edge->src)) + body[i]->count.apply_scale (new_count, old_count); - if (dump_file && (dump_flags & TDF_DETAILS)) - fprintf (dump_file, ";; guessed iterations are now %i\n", - (int)expected_loop_iterations_unbounded (loop, NULL, true)); + free (body); + } + } + else if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, + ";; Loop has mulitple exits;" + " will leave exit probabilities inconsistent\n"); } } diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc index 491b57ec0f1..184c08eec75 100644 --- a/gcc/tree-ssa-loop-ivcanon.cc +++ b/gcc/tree-ssa-loop-ivcanon.cc @@ -1173,7 +1179,7 @@ try_peel_loop (class loop *loop, } profile_probability p; p = entry_count.probability_in (loop->header->count); - scale_loop_profile (loop, p, 0); + scale_loop_profile (loop, p, -1); bitmap_set_bit (peeled_loops, loop->num); return true; } diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc index d66d4a6de69..2361cb328ab 100644 --- a/gcc/tree-vect-loop-manip.cc +++ b/gcc/tree-vect-loop-manip.cc @@ -3191,7 +3191,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, if (prob_vector.initialized_p ()) { scale_bbs_frequencies (&bb_before_loop, 1, prob_vector); - scale_loop_profile (loop, prob_vector, 0); + scale_loop_profile (loop, prob_vector, -1); } } @@ -3236,7 +3236,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, slpeel_update_phi_nodes_for_guard1 (prolog, loop, guard_e, e); scale_bbs_frequencies (&bb_after_prolog, 1, prob_prolog); - scale_loop_profile (prolog, prob_prolog, bound_prolog); + scale_loop_profile (prolog, prob_prolog, bound_prolog - 1); } /* Update init address of DRs. */ @@ -3378,7 +3378,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, scale_bbs_frequencies (&bb_before_epilog, 1, prob_epilog); } - scale_loop_profile (epilog, prob_epilog, 0); + scale_loop_profile (epilog, prob_epilog, -1); } else slpeel_update_phi_nodes_for_lcssa (epilog);
On Thu, 6 Jul 2023, Jan Hubicka wrote: > Hi, > original scale_loop_profile was implemented to only handle very simple loops > produced by vectorizer at that time (basically loops with only one exit and no > subloops). It also has not been updated to new profile-count API very carefully. > Since I want to use it from loop peeling and unlooping, I need the > function to at least not get profile worse on general loops. > > The function does two thigs > 1) scales down the loop profile by a given probability. > This is useful, for example, to scale down profile after peeling when loop > body is executed less often than before > 2) after scaling is done and if profile indicates too large iteration > count update profile to cap iteration count by ITERATION_BOUND parameter. > > Step 1 is easy and unchanged. > > I changed ITERATION_BOUND to be actual bound on number of iterations as > used elsewhere (i.e. number of executions of latch edge) rather then > number of iterations + 1 as it was before. > > To do 2) one needs to do the following > a) scale own loop profile so frquency o header is at most > the sum of in-edge counts * (iteration_bound + 1) > b) update loop exit probabilities so their count is the same > as before scaling. > c) reduce frequencies of basic blocks after loop exit > > old code did b) by setting probability to 1 / iteration_bound which is > correctly only of the basic block containing exit executes precisely one per > iteration (it is not insie other conditional or inner loop). This is fixed > now by using set_edge_probability_and_rescale_others > > aldo c) was implemented only for special case when the exit was just before > latch bacis block. I now use dominance info to get right some of addional > case. > > I still did not try to do anything for multiple exit loops, though the > implementatoin could be generalized. > > Bootstrapped/regtested x86_64-linux. Plan to cmmit it tonight if there > are no complains. Looks good, but I wonder what we can do to at least make the multiple exit case behave reasonably? The vectorizer keeps track of a "canonical" exit, would it be possible to pass in the main exit edge and use that instead of single_exit (), would other exits then behave somewhat reasonable or would we totally screw things up here? That is, the "canonical" exit would be the counting exit while the other exits are on data driven conditions and thus wouldn't change probability when we reduce the number of iterations(?) Richard. > gcc/ChangeLog: > > * cfgloopmanip.cc (scale_loop_profile): Rewrite exit edge > probability update to be safe on loops with subloops. > Make bound parameter to be iteration bound. > * tree-ssa-loop-ivcanon.cc (try_peel_loop): Update call > of scale_loop_profile. > * tree-vect-loop-manip.cc (vect_do_peeling): Likewise. > > diff --git a/gcc/cfgloopmanip.cc b/gcc/cfgloopmanip.cc > index 6e09dcbb0b1..524b979a546 100644 > --- a/gcc/cfgloopmanip.cc > +++ b/gcc/cfgloopmanip.cc > @@ -499,7 +499,7 @@ scale_loop_frequencies (class loop *loop, profile_probability p) > } > > /* Scale profile in LOOP by P. > - If ITERATION_BOUND is non-zero, scale even further if loop is predicted > + If ITERATION_BOUND is not -1, scale even further if loop is predicted > to iterate too many times. > Before caling this function, preheader block profile should be already > scaled to final count. This is necessary because loop iterations are > @@ -510,106 +510,123 @@ void > scale_loop_profile (class loop *loop, profile_probability p, > gcov_type iteration_bound) > { > - edge e, preheader_e; > - edge_iterator ei; > - > - if (dump_file && (dump_flags & TDF_DETAILS)) > + if (!(p == profile_probability::always ())) > { > - fprintf (dump_file, ";; Scaling loop %i with scale ", > - loop->num); > - p.dump (dump_file); > - fprintf (dump_file, " bounding iterations to %i\n", > - (int)iteration_bound); > - } > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, ";; Scaling loop %i with scale ", > + loop->num); > + p.dump (dump_file); > + fprintf (dump_file, "\n"); > + } > > - /* Scale the probabilities. */ > - scale_loop_frequencies (loop, p); > + /* Scale the probabilities. */ > + scale_loop_frequencies (loop, p); > + } > > - if (iteration_bound == 0) > + if (iteration_bound == -1) > return; > > gcov_type iterations = expected_loop_iterations_unbounded (loop, NULL, true); > + if (iterations == -1) > + return; > > if (dump_file && (dump_flags & TDF_DETAILS)) > { > - fprintf (dump_file, ";; guessed iterations after scaling %i\n", > - (int)iterations); > + fprintf (dump_file, > + ";; guessed iterations of loop %i:%i new upper bound %i:\n", > + loop->num, > + (int)iterations, > + (int)iteration_bound); > } > > /* See if loop is predicted to iterate too many times. */ > if (iterations <= iteration_bound) > return; > > - preheader_e = loop_preheader_edge (loop); > - > - /* We could handle also loops without preheaders, but bounding is > - currently used only by optimizers that have preheaders constructed. */ > - gcc_checking_assert (preheader_e); > - profile_count count_in = preheader_e->count (); > + /* Compute number of invocations of the loop. */ > + profile_count count_in = profile_count::zero (); > + edge e; > + edge_iterator ei; > + FOR_EACH_EDGE (e, ei, loop->header->preds) > + count_in += e->count (); > > - if (count_in > profile_count::zero () > - && loop->header->count.initialized_p ()) > + /* Now scale the loop body so header count is > + count_in * (iteration_bound + 1) */ > + profile_probability scale_prob > + = (count_in *= iteration_bound).probability_in (loop->header->count); > + if (dump_file && (dump_flags & TDF_DETAILS)) > { > - profile_count count_delta = profile_count::zero (); > - > - e = single_exit (loop); > - if (e) > - { > - edge other_e; > - FOR_EACH_EDGE (other_e, ei, e->src->succs) > - if (!(other_e->flags & (EDGE_ABNORMAL | EDGE_FAKE)) > - && e != other_e) > - break; > - > - /* Probability of exit must be 1/iterations. */ > - count_delta = e->count (); > - e->probability = profile_probability::always () / iteration_bound; > - other_e->probability = e->probability.invert (); > - > - /* In code below we only handle the following two updates. */ > - if (other_e->dest != loop->header > - && other_e->dest != loop->latch > - && (dump_file && (dump_flags & TDF_DETAILS))) > - { > - fprintf (dump_file, ";; giving up on update of paths from " > - "exit condition to latch\n"); > - } > - } > + fprintf (dump_file, ";; Scaling loop %i with scale ", > + loop->num); > + p.dump (dump_file); > + fprintf (dump_file, " to reach upper bound %i\n", > + (int)iteration_bound); > + } > + /* Finally attempt to fix exit edge probability. */ > + auto_vec<edge> exits = get_loop_exit_edges (loop); > + edge exit_edge = single_likely_exit (loop, exits); > + > + /* In a consistent profile unadjusted_exit_count should be same as count_in, > + however to preserve as much of the original info, avoid recomputing > + it. */ > + profile_count unadjusted_exit_count; > + if (exit_edge) > + unadjusted_exit_count = exit_edge->count (); > + scale_loop_frequencies (loop, scale_prob); > + > + if (exit_edge) > + { > + profile_count old_exit_count = exit_edge->count (); > + profile_probability new_probability; > + if (iteration_bound > 0) > + new_probability > + = unadjusted_exit_count.probability_in (exit_edge->src->count); > else > - if (dump_file && (dump_flags & TDF_DETAILS)) > - fprintf (dump_file, ";; Loop has multiple exit edges; " > - "giving up on exit condition update\n"); > - > - /* Roughly speaking we want to reduce the loop body profile by the > - difference of loop iterations. We however can do better if > - we look at the actual profile, if it is available. */ > - p = profile_probability::always (); > - > - count_in *= iteration_bound; > - p = count_in.probability_in (loop->header->count); > - if (!(p > profile_probability::never ())) > - p = profile_probability::very_unlikely (); > - > - if (p == profile_probability::always () > - || !p.initialized_p ()) > - return; > - > - /* If latch exists, change its count, since we changed > - probability of exit. Theoretically we should update everything from > - source of exit edge to latch, but for vectorizer this is enough. */ > - if (loop->latch && loop->latch != e->src) > - loop->latch->count += count_delta; > - > - /* Scale the probabilities. */ > - scale_loop_frequencies (loop, p); > + new_probability = profile_probability::always (); > + set_edge_probability_and_rescale_others (exit_edge, new_probability); > + profile_count new_exit_count = exit_edge->count (); > + > + /* Rescale the remaining edge probabilities and see if there is only > + one. */ > + edge other_edge = NULL; > + bool found = false; > + FOR_EACH_EDGE (e, ei, exit_edge->src->succs) > + if (!(e->flags & EDGE_FAKE) > + && !(e->probability == profile_probability::never ()) > + && !loop_exit_edge_p (loop, e)) > + { > + if (found) > + { > + other_edge = NULL; > + break; > + } > + other_edge = e; > + found = true; > + } > + /* If there is only loop latch after other edge, > + update its profile. */ > + if (other_edge && other_edge->dest == loop->latch) > + loop->latch->count -= new_exit_count - old_exit_count; > + else > + { > + basic_block *body = get_loop_body (loop); > + profile_count new_count = exit_edge->src->count - new_exit_count; > + profile_count old_count = exit_edge->src->count - old_exit_count; > > - /* Change latch's count back. */ > - if (loop->latch && loop->latch != e->src) > - loop->latch->count -= count_delta; > + for (unsigned int i = 0; i < loop->num_nodes; i++) > + if (body[i] != exit_edge->src > + && dominated_by_p (CDI_DOMINATORS, body[i], exit_edge->src)) > + body[i]->count.apply_scale (new_count, old_count); > > - if (dump_file && (dump_flags & TDF_DETAILS)) > - fprintf (dump_file, ";; guessed iterations are now %i\n", > - (int)expected_loop_iterations_unbounded (loop, NULL, true)); > + free (body); > + } > + } > + else if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, > + ";; Loop has mulitple exits;" > + " will leave exit probabilities inconsistent\n"); > } > } > > diff --git a/gcc/tree-ssa-loop-ivcanon.cc b/gcc/tree-ssa-loop-ivcanon.cc > index 491b57ec0f1..184c08eec75 100644 > --- a/gcc/tree-ssa-loop-ivcanon.cc > +++ b/gcc/tree-ssa-loop-ivcanon.cc > @@ -1173,7 +1179,7 @@ try_peel_loop (class loop *loop, > } > profile_probability p; > p = entry_count.probability_in (loop->header->count); > - scale_loop_profile (loop, p, 0); > + scale_loop_profile (loop, p, -1); > bitmap_set_bit (peeled_loops, loop->num); > return true; > } > diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc > index d66d4a6de69..2361cb328ab 100644 > --- a/gcc/tree-vect-loop-manip.cc > +++ b/gcc/tree-vect-loop-manip.cc > @@ -3191,7 +3191,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, > if (prob_vector.initialized_p ()) > { > scale_bbs_frequencies (&bb_before_loop, 1, prob_vector); > - scale_loop_profile (loop, prob_vector, 0); > + scale_loop_profile (loop, prob_vector, -1); > } > } > > @@ -3236,7 +3236,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, > slpeel_update_phi_nodes_for_guard1 (prolog, loop, guard_e, e); > > scale_bbs_frequencies (&bb_after_prolog, 1, prob_prolog); > - scale_loop_profile (prolog, prob_prolog, bound_prolog); > + scale_loop_profile (prolog, prob_prolog, bound_prolog - 1); > } > > /* Update init address of DRs. */ > @@ -3378,7 +3378,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, > > scale_bbs_frequencies (&bb_before_epilog, 1, prob_epilog); > } > - scale_loop_profile (epilog, prob_epilog, 0); > + scale_loop_profile (epilog, prob_epilog, -1); > } > else > slpeel_update_phi_nodes_for_lcssa (epilog); >
> > Looks good, but I wonder what we can do to at least make the > multiple exit case behave reasonably? The vectorizer keeps track > of a "canonical" exit, would it be possible to pass in the main > exit edge and use that instead of single_exit (), would other > exits then behave somewhat reasonable or would we totally screw > things up here? That is, the "canonical" exit would be the > counting exit while the other exits are on data driven conditions > and thus wouldn't change probability when we reduce the number > of iterations(?) I can add canonical_exit parameter and make the function to direct flow to it if possible. However overall I think fixup depends on what transformation led to the change. Assuming that vectorizer did no prologues and apilogues and we vectorized with factor N, then I think the update could be done more specifically as follows. We know that header block count dropped by 4. So we can start from that and each time we reach basic block with exit edge, we know the original count of the edge. This count is unchanged, so one can rescale probabilities out of that BB accordingly. If loop has no inner loops, we can just walk the body in RPO and propagate scales downwards and we sould arrive to right result I originally added the bound parameter to handle prologues/epilogues which gets new artificial bound. In prologue I think you are right that the flow will be probably directed to the conditional counting iterations. In epilogue we add no artificial iteration cap, so maybe it is more realistic to simply scale up probability of all exits? To see what is going on I tried following testcase: int a[99]; test() { for (int i = 0; i < 99; i++) a[i]++; } What surprises me is that vectorizer at -O2 does nothing and we end up unrolling the loop: L2: addl $1, (%rax) addl $1, 4(%rax) addl $1, 8(%rax) addq $12, %rax cmpq $a+396, %rax Which seems sily thing to do. Vectorized loop with epilogue doing 2 and 1 addition would be better. With -O3 we vectorize it: .L2: movdqa (%rax), %xmm0 addq $16, %rax paddd %xmm1, %xmm0 movaps %xmm0, -16(%rax) cmpq %rax, %rdx jne .L2 movq a+384(%rip), %xmm0 addl $1, a+392(%rip) movq .LC1(%rip), %xmm1 paddd %xmm1, %xmm0 movq %xmm0, a+384(%rip) and correctly drop vectorized loop body to 24 iterations. However the epilogue has loop for vector size 2 predicted to iterate once (it won't) ;; basic block 7, loop depth 0, count 10737416 (estimated locally), maybe hot ;; prev block 5, next block 8, flags: (NEW, VISITED) ;; pred: 3 [4.0% (adjusted)] count:10737416 (estimated locally) (FALSE_VALUE,EXECUTABLE) ;; succ: 8 [always] count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE) ;; basic block 8, loop depth 1, count 21474835 (estimated locally), maybe hot ;; prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED) ;; pred: 9 [always] count:10737417 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; 7 [always] count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE) # i_9 = PHI <i_17(9), 96(7)> # ivtmp_13 = PHI <ivtmp_18(9), 3(7)> # vectp_a.14_40 = PHI <vectp_a.14_41(9), &MEM <int[99]> [(void *)&a + 384B](7)> # vectp_a.18_46 = PHI <vectp_a.18_47(9), &MEM <int[99]> [(void *)&a + 384B](7)> # ivtmp_49 = PHI <ivtmp_50(9), 0(7)> vect__14.16_42 = MEM <vector(2) int> [(int *)vectp_a.14_40]; _14 = a[i_9]; vect__15.17_44 = vect__14.16_42 + { 1, 1 }; _15 = _14 + 1; MEM <vector(2) int> [(int *)vectp_a.18_46] = vect__15.17_44; i_17 = i_9 + 1; ivtmp_18 = ivtmp_13 - 1; vectp_a.14_41 = vectp_a.14_40 + 8; vectp_a.18_47 = vectp_a.18_46 + 8; ivtmp_50 = ivtmp_49 + 1; if (ivtmp_50 < 1) goto <bb 9>; [50.00%] else goto <bb 12>; [50.00%] and finally the scalar copy ;; basic block 12, loop depth 0, count 10737416 (estimated locally), maybe hot ;; prev block 9, next block 13, flags: (NEW, VISITED) ;; pred: 8 [50.0% (adjusted)] count:10737418 (estimated locally) (FALSE_VALUE,EXECUTABLE) ;; succ: 13 [always] count:10737416 (estimated locally) (FALLTHRU) ;; basic block 13, loop depth 1, count 1063004409 (estimated locally), maybe hot ;; prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED) ;; pred: 14 [always] count:1052266996 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE) ;; 12 [always] count:10737416 (estimated locally) (FALLTHRU) # i_30 = PHI <i_36(14), 98(12)> # ivtmp_32 = PHI <ivtmp_37(14), 1(12)> _33 = a[i_30]; _34 = _33 + 1; a[i_30] = _34; i_36 = i_30 + 1; ivtmp_37 = ivtmp_32 - 1; if (ivtmp_37 != 0) goto <bb 14>; [98.99%] else goto <bb 4>; [1.01%] With also small but non-zero iteration probability. This is papered over by my yesterday patch. But it seems to me that it would be a lot better if vectorizer understood that the epilogue will be loopless and accounted it to the cost model that would probably make it easy to enable it at cheap costs too. Clang 16 at -O2 is much more aggressive by both vectorizing and unroling: test: # @test .cfi_startproc # %bb.0: movdqa a(%rip), %xmm1 pcmpeqd %xmm0, %xmm0 psubd %xmm0, %xmm1 movdqa %xmm1, a(%rip) movdqa a+16(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+16(%rip) movdqa a+32(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+32(%rip) movdqa a+48(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+48(%rip) movdqa a+64(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+64(%rip) movdqa a+80(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+80(%rip) movdqa a+96(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+96(%rip) movdqa a+112(%rip), %xmm1 psubd %xmm0, %xmm1 .... movdqa %xmm1, a+240(%rip) movdqa a+256(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+256(%rip) movdqa a+272(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+272(%rip) movdqa a+288(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+288(%rip) movdqa a+304(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+304(%rip) movdqa a+320(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+320(%rip) movdqa a+336(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+336(%rip) movdqa a+352(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+352(%rip) movdqa a+368(%rip), %xmm1 psubd %xmm0, %xmm1 movdqa %xmm1, a+368(%rip) addl $1, a+384(%rip) addl $1, a+388(%rip) addl $1, a+392(%rip) retq Honza
Hi Both, Thanks for all the reviews/patches so far 😊 > > > > Looks good, but I wonder what we can do to at least make the multiple > > exit case behave reasonably? The vectorizer keeps track > > > of a "canonical" exit, would it be possible to pass in the main exit > > edge and use that instead of single_exit (), would other exits then > > behave somewhat reasonable or would we totally screw things up here? > > That is, the "canonical" exit would be the counting exit while the > > other exits are on data driven conditions and thus wouldn't change > > probability when we reduce the number of iterations(?) > > I can add canonical_exit parameter and make the function to direct flow to it if > possible. However overall I think fixup depends on what transformation led to > the change. > > Assuming that vectorizer did no prologues and apilogues and we vectorized > with factor N, then I think the update could be done more specifically as > follows. > If it helps, how this patch series addresses multiple exits by forcing a scalar epilogue, all non canonical_exits would have been redirected to this scalar epilogue, so the remaining scalar iteration count will be at most VF. Regards, Tamar > We know that header block count dropped by 4. So we can start from that > and each time we reach basic block with exit edge, we know the original count > of the edge. This count is unchanged, so one can rescale probabilities out of > that BB accordingly. If loop has no inner loops, we can just walk the body in > RPO and propagate scales downwards and we sould arrive to right result > > I originally added the bound parameter to handle prologues/epilogues which > gets new artificial bound. In prologue I think you are right that the flow will be > probably directed to the conditional counting iterations. > > In epilogue we add no artificial iteration cap, so maybe it is more realistic to > simply scale up probability of all exits? > > To see what is going on I tried following testcase: > > int a[99]; > test() > { > for (int i = 0; i < 99; i++) > a[i]++; > } > > What surprises me is that vectorizer at -O2 does nothing and we end up > unrolling the loop: > > L2: > addl $1, (%rax) > addl $1, 4(%rax) > addl $1, 8(%rax) > addq $12, %rax > cmpq $a+396, %rax > > Which seems sily thing to do. Vectorized loop with epilogue doing 2 and > 1 addition would be better. > > With -O3 we vectorize it: > > > .L2: > movdqa (%rax), %xmm0 > addq $16, %rax > paddd %xmm1, %xmm0 > movaps %xmm0, -16(%rax) > cmpq %rax, %rdx > jne .L2 > movq a+384(%rip), %xmm0 > addl $1, a+392(%rip) > movq .LC1(%rip), %xmm1 > paddd %xmm1, %xmm0 > movq %xmm0, a+384(%rip) > > > and correctly drop vectorized loop body to 24 iterations. However the > epilogue has loop for vector size 2 predicted to iterate once (it won't) > > ;; basic block 7, loop depth 0, count 10737416 (estimated locally), maybe > hot > ;; prev block 5, next block 8, flags: (NEW, VISITED) > ;; pred: 3 [4.0% (adjusted)] count:10737416 (estimated locally) > (FALSE_VALUE,EXECUTABLE) > ;; succ: 8 [always] count:10737416 (estimated locally) > (FALLTHRU,EXECUTABLE) > > ;; basic block 8, loop depth 1, count 21474835 (estimated locally), maybe > hot > ;; prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED) > ;; pred: 9 [always] count:10737417 (estimated locally) > (FALLTHRU,DFS_BACK,EXECUTABLE) > ;; 7 [always] count:10737416 (estimated locally) > (FALLTHRU,EXECUTABLE) > # i_9 = PHI <i_17(9), 96(7)> > # ivtmp_13 = PHI <ivtmp_18(9), 3(7)> > # vectp_a.14_40 = PHI <vectp_a.14_41(9), &MEM <int[99]> [(void *)&a + > 384B](7)> > # vectp_a.18_46 = PHI <vectp_a.18_47(9), &MEM <int[99]> [(void *)&a + > 384B](7)> > # ivtmp_49 = PHI <ivtmp_50(9), 0(7)> > vect__14.16_42 = MEM <vector(2) int> [(int *)vectp_a.14_40]; > _14 = a[i_9]; > vect__15.17_44 = vect__14.16_42 + { 1, 1 }; > _15 = _14 + 1; > MEM <vector(2) int> [(int *)vectp_a.18_46] = vect__15.17_44; > i_17 = i_9 + 1; > ivtmp_18 = ivtmp_13 - 1; > vectp_a.14_41 = vectp_a.14_40 + 8; > vectp_a.18_47 = vectp_a.18_46 + 8; > ivtmp_50 = ivtmp_49 + 1; > if (ivtmp_50 < 1) > goto <bb 9>; [50.00%] > else > goto <bb 12>; [50.00%] > > and finally the scalar copy > > ;; basic block 12, loop depth 0, count 10737416 (estimated locally), maybe > hot > ;; prev block 9, next block 13, flags: (NEW, VISITED) > ;; pred: 8 [50.0% (adjusted)] count:10737418 (estimated locally) > (FALSE_VALUE,EXECUTABLE) > ;; succ: 13 [always] count:10737416 (estimated locally) (FALLTHRU) > > ;; basic block 13, loop depth 1, count 1063004409 (estimated locally), > maybe hot > ;; prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED) > ;; pred: 14 [always] count:1052266996 (estimated locally) > (FALLTHRU,DFS_BACK,EXECUTABLE) > ;; 12 [always] count:10737416 (estimated locally) (FALLTHRU) > # i_30 = PHI <i_36(14), 98(12)> > # ivtmp_32 = PHI <ivtmp_37(14), 1(12)> > _33 = a[i_30]; > _34 = _33 + 1; > a[i_30] = _34; > i_36 = i_30 + 1; > ivtmp_37 = ivtmp_32 - 1; > if (ivtmp_37 != 0) > goto <bb 14>; [98.99%] > else > goto <bb 4>; [1.01%] > > With also small but non-zero iteration probability. This is papered > over by my yesterday patch. But it seems to me that it would be a lot better if > vectorizer understood that the epilogue will be loopless and accounted it to > the cost model that would probably make it easy to enable it at cheap costs > too. > > Clang 16 at -O2 is much more aggressive by both vectorizing and unroling: > > test: # @test > .cfi_startproc > # %bb.0: > movdqa a(%rip), %xmm1 > pcmpeqd %xmm0, %xmm0 > psubd %xmm0, %xmm1 > movdqa %xmm1, a(%rip) > movdqa a+16(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+16(%rip) > movdqa a+32(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+32(%rip) > movdqa a+48(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+48(%rip) > movdqa a+64(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+64(%rip) > movdqa a+80(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+80(%rip) > movdqa a+96(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+96(%rip) > movdqa a+112(%rip), %xmm1 > psubd %xmm0, %xmm1 > .... > movdqa %xmm1, a+240(%rip) > movdqa a+256(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+256(%rip) > movdqa a+272(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+272(%rip) > movdqa a+288(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+288(%rip) > movdqa a+304(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+304(%rip) > movdqa a+320(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+320(%rip) > movdqa a+336(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+336(%rip) > movdqa a+352(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+352(%rip) > movdqa a+368(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+368(%rip) > addl $1, a+384(%rip) > addl $1, a+388(%rip) > addl $1, a+392(%rip) > retq > > Honza
> Hi Both, > > Thanks for all the reviews/patches so far 😊 > > > > > > > Looks good, but I wonder what we can do to at least make the multiple > > > exit case behave reasonably? The vectorizer keeps track > > > > > of a "canonical" exit, would it be possible to pass in the main exit > > > edge and use that instead of single_exit (), would other exits then > > > behave somewhat reasonable or would we totally screw things up here? > > > That is, the "canonical" exit would be the counting exit while the > > > other exits are on data driven conditions and thus wouldn't change > > > probability when we reduce the number of iterations(?) > > > > I can add canonical_exit parameter and make the function to direct flow to it if > > possible. However overall I think fixup depends on what transformation led to > > the change. > > > > Assuming that vectorizer did no prologues and apilogues and we vectorized > > with factor N, then I think the update could be done more specifically as > > follows. > > > > If it helps, how this patch series addresses multiple exits by forcing a scalar > epilogue, all non canonical_exits would have been redirected to this scalar > epilogue, so the remaining scalar iteration count will be at most VF. It looks like profile update after vectorization needs quite some TLC. My student Ondrej Kubanek also implemented loop histogram profiling which gives better idea on how commonly prologues/epilogues are needed and it would be also nice to handle it. > > ;; basic block 12, loop depth 0, count 10737416 (estimated locally), maybe > > hot > > ;; prev block 9, next block 13, flags: (NEW, VISITED) > > ;; pred: 8 [50.0% (adjusted)] count:10737418 (estimated locally) > > (FALSE_VALUE,EXECUTABLE) > > ;; succ: 13 [always] count:10737416 (estimated locally) (FALLTHRU) > > > > ;; basic block 13, loop depth 1, count 1063004409 (estimated locally), > > maybe hot > > ;; prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED) > > ;; pred: 14 [always] count:1052266996 (estimated locally) > > (FALLTHRU,DFS_BACK,EXECUTABLE) > > ;; 12 [always] count:10737416 (estimated locally) (FALLTHRU) > > # i_30 = PHI <i_36(14), 98(12)> > > # ivtmp_32 = PHI <ivtmp_37(14), 1(12)> > > _33 = a[i_30]; > > _34 = _33 + 1; > > a[i_30] = _34; > > i_36 = i_30 + 1; > > ivtmp_37 = ivtmp_32 - 1; > > if (ivtmp_37 != 0) > > goto <bb 14>; [98.99%] > > else > > goto <bb 4>; [1.01%] Actually it seems that the scalar epilogue loop is with oriignal profile (predicted to iterate 99 times) which is quite wrong. Looking at the statistics for yesterday patch, on tramp3d we got 86% reduction in cummulative profile mismatches after whole optimization pipeline. More interestingly however the overall time esimtate dropped by 18%, so it seems that the profile adjustment done by cunroll are afecting the profile a lot. I think the fact that iteration counts of epilogues is not capped is one of main problems. We seem to call scale_loop_profile 3 times: scale_loop_profile (loop, prob_vector, -1); This seems to account for the probability that control flow is redirected to prolog/epilog later. So it only scales down the profile but is not responsible scale_loop_profile (prolog, prob_prolog, bound_prolog - 1); This is does prolog and sets bound. scale_loop_profile (epilog, prob_epilog, -1); This scales epilog but does not set bound at all. I think the information is availale since we update the loop_info datastructures. Honza
On Fri, 7 Jul 2023, Jan Hubicka wrote: > > > > Looks good, but I wonder what we can do to at least make the > > multiple exit case behave reasonably? The vectorizer keeps track > > > of a "canonical" exit, would it be possible to pass in the main > > exit edge and use that instead of single_exit (), would other > > exits then behave somewhat reasonable or would we totally screw > > things up here? That is, the "canonical" exit would be the > > counting exit while the other exits are on data driven conditions > > and thus wouldn't change probability when we reduce the number > > of iterations(?) > > I can add canonical_exit parameter and make the function to direct flow > to it if possible. However overall I think fixup depends on what > transformation led to the change. I think the vectorizer knows there's a single counting IV and all other exits are dependent on data processed, so the scaling the vectorizer just changes the counting IV. So I think it makes sense to pass that exit to the function in all cases. > Assuming that vectorizer did no prologues and apilogues and we > vectorized with factor N, then I think the update could be done more > specifically as follows. > > We know that header block count dropped by 4. So we can start from that > and each time we reach basic block with exit edge, we know the original > count of the edge. This count is unchanged, so one can rescale > probabilities out of that BB accordingly. If loop has no inner loops, > we can just walk the body in RPO and propagate scales downwards and we > sould arrive to right result That should work for alternate exits as well, no? > I originally added the bound parameter to handle prologues/epilogues > which gets new artificial bound. In prologue I think you are right that > the flow will be probably directed to the conditional counting > iterations. I suppose we'd need to scale both main and epilogue together since the epilogue "steals" from the main loop counts. Likewise if there's a skip edge around the vector loop. I think currently we simply set the edge probability of those skip conds rather than basing this off the niter values they work on. Aka if (niter < VF) goto epilogue; do {} while (niter / VF); epilogue: do {} while (niter); There's also the cost model which might require niter > VF to enter the main loop body. > In epilogue we add no artificial iteration cap, so maybe it is more > realistic to simply scale up probability of all exits? Probably. > To see what is going on I tried following testcase: > > int a[99]; > test() > { > for (int i = 0; i < 99; i++) > a[i]++; > } > > What surprises me is that vectorizer at -O2 does nothing and we end up > unrolling the loop: > > L2: > addl $1, (%rax) > addl $1, 4(%rax) > addl $1, 8(%rax) > addq $12, %rax > cmpq $a+396, %rax > > Which seems sily thing to do. Vectorized loop with epilogue doing 2 and > 1 addition would be better. > > With -O3 we vectorize it: > > > .L2: > movdqa (%rax), %xmm0 > addq $16, %rax > paddd %xmm1, %xmm0 > movaps %xmm0, -16(%rax) > cmpq %rax, %rdx > jne .L2 > movq a+384(%rip), %xmm0 > addl $1, a+392(%rip) > movq .LC1(%rip), %xmm1 > paddd %xmm1, %xmm0 > movq %xmm0, a+384(%rip) The -O2 cost model doesn't want to do epilogues: /* If using the "very cheap" model. reject cases in which we'd keep a copy of the scalar code (even if we might be able to vectorize it). */ if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "some scalar iterations would need to be peeled\n"); return 0; } it's because of the code size increase. > and correctly drop vectorized loop body to 24 iterations. However the > epilogue has loop for vector size 2 predicted to iterate once (it won't) > > ;; basic block 7, loop depth 0, count 10737416 (estimated locally), maybe hot > ;; prev block 5, next block 8, flags: (NEW, VISITED) > ;; pred: 3 [4.0% (adjusted)] count:10737416 (estimated locally) (FALSE_VALUE,EXECUTABLE) > ;; succ: 8 [always] count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE) > > ;; basic block 8, loop depth 1, count 21474835 (estimated locally), maybe hot > ;; prev block 7, next block 9, flags: (NEW, REACHABLE, VISITED) > ;; pred: 9 [always] count:10737417 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE) > ;; 7 [always] count:10737416 (estimated locally) (FALLTHRU,EXECUTABLE) > # i_9 = PHI <i_17(9), 96(7)> > # ivtmp_13 = PHI <ivtmp_18(9), 3(7)> > # vectp_a.14_40 = PHI <vectp_a.14_41(9), &MEM <int[99]> [(void *)&a + 384B](7)> > # vectp_a.18_46 = PHI <vectp_a.18_47(9), &MEM <int[99]> [(void *)&a + 384B](7)> > # ivtmp_49 = PHI <ivtmp_50(9), 0(7)> > vect__14.16_42 = MEM <vector(2) int> [(int *)vectp_a.14_40]; > _14 = a[i_9]; > vect__15.17_44 = vect__14.16_42 + { 1, 1 }; > _15 = _14 + 1; > MEM <vector(2) int> [(int *)vectp_a.18_46] = vect__15.17_44; > i_17 = i_9 + 1; > ivtmp_18 = ivtmp_13 - 1; > vectp_a.14_41 = vectp_a.14_40 + 8; > vectp_a.18_47 = vectp_a.18_46 + 8; > ivtmp_50 = ivtmp_49 + 1; > if (ivtmp_50 < 1) > goto <bb 9>; [50.00%] > else > goto <bb 12>; [50.00%] > > and finally the scalar copy > > ;; basic block 12, loop depth 0, count 10737416 (estimated locally), maybe hot > ;; prev block 9, next block 13, flags: (NEW, VISITED) > ;; pred: 8 [50.0% (adjusted)] count:10737418 (estimated locally) (FALSE_VALUE,EXECUTABLE) > ;; succ: 13 [always] count:10737416 (estimated locally) (FALLTHRU) > > ;; basic block 13, loop depth 1, count 1063004409 (estimated locally), maybe hot > ;; prev block 12, next block 14, flags: (NEW, REACHABLE, VISITED) > ;; pred: 14 [always] count:1052266996 (estimated locally) (FALLTHRU,DFS_BACK,EXECUTABLE) > ;; 12 [always] count:10737416 (estimated locally) (FALLTHRU) > # i_30 = PHI <i_36(14), 98(12)> > # ivtmp_32 = PHI <ivtmp_37(14), 1(12)> > _33 = a[i_30]; > _34 = _33 + 1; > a[i_30] = _34; > i_36 = i_30 + 1; > ivtmp_37 = ivtmp_32 - 1; > if (ivtmp_37 != 0) > goto <bb 14>; [98.99%] > else > goto <bb 4>; [1.01%] > > With also small but non-zero iteration probability. This is papered > over by my yesterday patch. But it seems to me that it would be a lot > better if vectorizer understood that the epilogue will be loopless and > accounted it to the cost model that would probably make it easy to > enable it at cheap costs too. The epilogue will be "unrolled" later I think because we can correctly compute it won't iterate. > Clang 16 at -O2 is much more aggressive by both vectorizing and unroling: > > test: # @test > .cfi_startproc > # %bb.0: > movdqa a(%rip), %xmm1 > pcmpeqd %xmm0, %xmm0 > psubd %xmm0, %xmm1 > movdqa %xmm1, a(%rip) > movdqa a+16(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+16(%rip) > movdqa a+32(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+32(%rip) > movdqa a+48(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+48(%rip) > movdqa a+64(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+64(%rip) > movdqa a+80(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+80(%rip) > movdqa a+96(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+96(%rip) > movdqa a+112(%rip), %xmm1 > psubd %xmm0, %xmm1 > .... > movdqa %xmm1, a+240(%rip) > movdqa a+256(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+256(%rip) > movdqa a+272(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+272(%rip) > movdqa a+288(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+288(%rip) > movdqa a+304(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+304(%rip) > movdqa a+320(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+320(%rip) > movdqa a+336(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+336(%rip) > movdqa a+352(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+352(%rip) > movdqa a+368(%rip), %xmm1 > psubd %xmm0, %xmm1 > movdqa %xmm1, a+368(%rip) > addl $1, a+384(%rip) > addl $1, a+388(%rip) > addl $1, a+392(%rip) > retq That's clearly much larger code. On x86 we're also fighting with large instruction encodings here, in particular EVEX for AVX512 is "bad" here. We hardly get more than two instructions decoded per cycle due to their size. Richard.
Hi, over weekend I found that vectorizer is missing scale_loop_profile for epilogues. It already adjusts loop_info to set max iteraitons, so adding it was easy. However now predicts the first loop to iterate at most once (which is too much, I suppose it forgets to divide by epilogue unrolling factor) and second never. > > The -O2 cost model doesn't want to do epilogues: > > /* If using the "very cheap" model. reject cases in which we'd keep > a copy of the scalar code (even if we might be able to vectorize it). > */ > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "some scalar iterations would need to be > peeled\n"); > return 0; > } > > it's because of the code size increase. I know, however -O2 is not -Os and here the tradeoffs of performance/code size seems a lot better than other code expanding things we do at -O2 (such as the unrolling 3 times). I think we set the very cheap cost model very conservatively in order to get -ftree-vectorize enabled with -O2 and there is some room for finding right balance. I get: jan@localhost:~> cat t.c int a[99]; __attribute((noipa, weak)) void test() { for (int i = 0 ; i < 99; i++) a[i]++; } void main() { for (int j = 0; j < 10000000; j++) test(); } jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out real 0m0.529s user 0m0.528s sys 0m0.000s jan@localhost:~> gcc -O2 t.c ; time ./a.out real 0m0.427s user 0m0.426s sys 0m0.000s jan@localhost:~> gcc -O3 t.c ; time ./a.out real 0m0.136s user 0m0.135s sys 0m0.000s jan@localhost:~> clang -O2 t.c ; time ./a.out <warnings> real 0m0.116s user 0m0.116s sys 0m0.000s Code size (of function test): gcc -O2 -fno-unroll-loops 17 bytes gcc -O2 29 bytes gcc -O3 50 bytes clang -O2 510 bytes So unroling 70% code size growth for 23% speedup. Vectorizing is 294% code size growth for 388% speedup Clang does 3000% codde size growth for 456% speedup > > That's clearly much larger code. On x86 we're also fighting with > large instruction encodings here, in particular EVEX for AVX512 is > "bad" here. We hardly get more than two instructions decoded per > cycle due to their size. Agreed, I found it surprising clang does that much of complette unrolling at -O2. However vectorizing and not unrolling here seems like it may be a better default for -O2 than what we do currently... Honza > > Richard.
> On Fri, 7 Jul 2023, Jan Hubicka wrote: > > > > > > > Looks good, but I wonder what we can do to at least make the > > > multiple exit case behave reasonably? The vectorizer keeps track > > > > > of a "canonical" exit, would it be possible to pass in the main > > > exit edge and use that instead of single_exit (), would other > > > exits then behave somewhat reasonable or would we totally screw > > > things up here? That is, the "canonical" exit would be the > > > counting exit while the other exits are on data driven conditions > > > and thus wouldn't change probability when we reduce the number > > > of iterations(?) > > > > I can add canonical_exit parameter and make the function to direct flow > > to it if possible. However overall I think fixup depends on what > > transformation led to the change. > > I think the vectorizer knows there's a single counting IV and all > other exits are dependent on data processed, so the scaling the > vectorizer just changes the counting IV. So I think it makes > sense to pass that exit to the function in all cases. It really seems to me that vectorized loop is like N loops happening in parallel, so the probabilities of alternative exits grows as well. But canonical exit is right thing to do for prologues - here we really add extra conditions to the iteration counting exit. > > > Assuming that vectorizer did no prologues and apilogues and we > > vectorized with factor N, then I think the update could be done more > > specifically as follows. > > > > We know that header block count dropped by 4. So we can start from that > > and each time we reach basic block with exit edge, we know the original > > count of the edge. This count is unchanged, so one can rescale > > probabilities out of that BB accordingly. If loop has no inner loops, > > we can just walk the body in RPO and propagate scales downwards and we > > sould arrive to right result > > That should work for alternate exits as well, no? Yes, i think it could omstly work for acyclic bodies. I ended up implementing a special case of this for loop-ch in order to handle corectly loop invariant conditionals. Will send patch after some cleanups. (There seems to be more loop invariant conditionals in real code than I would tought) Tampering only with loop exit probabilities is not always enought. If you have: while (1) if (test1) { if (test2) break; } increasing count of exit may require increasing probablity of the outer conditional. Do we support this in vectorization at all and if so, do we know something here? For example if the test1 is triggered if test1 is true in one of iterations packed togehter, its probability also increases by vectorization factor. We run into this in peeling i.e. when we prove that test1 will trigger undefined behaviour after one or two iterations but the orignal esimtated profile believes in higher iteration count. I added special case for this yesterday to avoid turning if (test2) to 100% in this case as that triggers strange codegen in some of fortran testcases. We also can have while (1) while (test1) { if (test2) break; } Which is harder because changing probability of test2 affects number of iteraitons of the inner loop. So I am giving up on this. I think currently it happens mostly with unlooping. > > > I originally added the bound parameter to handle prologues/epilogues > > which gets new artificial bound. In prologue I think you are right that > > the flow will be probably directed to the conditional counting > > iterations. > > I suppose we'd need to scale both main and epilogue together since > the epilogue "steals" from the main loop counts. Likewise if there's > a skip edge around the vector loop. I think currently we simply > set the edge probability of those skip conds rather than basing > this off the niter values they work on. Aka if (niter < VF) goto > epilogue; do {} while (niter / VF); epilogue: do {} while (niter); > > There's also the cost model which might require niter > VF to enter > the main loop body. I think I mostly understand this since we was playing with it with Ondra's histograms (that can be used to get some of the unknowns in the transformation right). The unknowns (how many times we end up jumpig to epilogue, for instance, probably can't be reasonably well guessed if we do not know the loop histogram which currently we know only if we prove that loop has constant number of iterations. So I am trying to get right at least this case first. Theoretically correct approach would be to first determine entry counts of prologue and epilogue, then produce what we believe to be correct profile of those and subtract it from the main loop profile updating also probabilities in basic blocks where we did nontrivial changes while updating prologs/epilogs. Finally scale down the main loop profile and increase exit probabilities. Honza
On Mon, 10 Jul 2023, Jan Hubicka wrote: > Hi, > over weekend I found that vectorizer is missing scale_loop_profile for > epilogues. It already adjusts loop_info to set max iteraitons, so > adding it was easy. However now predicts the first loop to iterate at > most once (which is too much, I suppose it forgets to divide by epilogue > unrolling factor) and second never. > > > > The -O2 cost model doesn't want to do epilogues: > > > > /* If using the "very cheap" model. reject cases in which we'd keep > > a copy of the scalar code (even if we might be able to vectorize it). > > */ > > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP > > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > > || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > > || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > "some scalar iterations would need to be > > peeled\n"); > > return 0; > > } > > > > it's because of the code size increase. > > I know, however -O2 is not -Os and here the tradeoffs of > performance/code size seems a lot better than other code expanding > things we do at -O2 (such as the unrolling 3 times). > I think we set the very cheap cost model very conservatively in order to > get -ftree-vectorize enabled with -O2 and there is some room for finding > right balance. > > I get: > > jan@localhost:~> cat t.c > int a[99]; > __attribute((noipa, weak)) > void > test() > { > for (int i = 0 ; i < 99; i++) > a[i]++; > } > void > main() > { > for (int j = 0; j < 10000000; j++) > test(); > } > jan@localhost:~> gcc -O2 t.c -fno-unroll-loops ; time ./a.out > > real 0m0.529s > user 0m0.528s > sys 0m0.000s > > jan@localhost:~> gcc -O2 t.c ; time ./a.out > > real 0m0.427s > user 0m0.426s > sys 0m0.000s > jan@localhost:~> gcc -O3 t.c ; time ./a.out > > real 0m0.136s > user 0m0.135s > sys 0m0.000s > jan@localhost:~> clang -O2 t.c ; time ./a.out > <warnings> > > real 0m0.116s > user 0m0.116s > sys 0m0.000s > > Code size (of function test): > gcc -O2 -fno-unroll-loops 17 bytes > gcc -O2 29 bytes > gcc -O3 50 bytes > clang -O2 510 bytes > > So unroling 70% code size growth for 23% speedup. > Vectorizing is 294% code size growth for 388% speedup > Clang does 3000% codde size growth for 456% speedup > > > > That's clearly much larger code. On x86 we're also fighting with > > large instruction encodings here, in particular EVEX for AVX512 is > > "bad" here. We hardly get more than two instructions decoded per > > cycle due to their size. > > Agreed, I found it surprising clang does that much of complette unrolling > at -O2. However vectorizing and not unrolling here seems like it may be > a better default for -O2 than what we do currently... I was also playing with AVX512 fully masked loops here which avoids the epilogue but due to the instruction encoding size that doesn't usually win. I agree that size isn't everything at least for -O2. Richard.
On Mon, 10 Jul 2023, Jan Hubicka wrote: > > On Fri, 7 Jul 2023, Jan Hubicka wrote: > > > > > > > > > > Looks good, but I wonder what we can do to at least make the > > > > multiple exit case behave reasonably? The vectorizer keeps track > > > > > > > of a "canonical" exit, would it be possible to pass in the main > > > > exit edge and use that instead of single_exit (), would other > > > > exits then behave somewhat reasonable or would we totally screw > > > > things up here? That is, the "canonical" exit would be the > > > > counting exit while the other exits are on data driven conditions > > > > and thus wouldn't change probability when we reduce the number > > > > of iterations(?) > > > > > > I can add canonical_exit parameter and make the function to direct flow > > > to it if possible. However overall I think fixup depends on what > > > transformation led to the change. > > > > I think the vectorizer knows there's a single counting IV and all > > other exits are dependent on data processed, so the scaling the > > vectorizer just changes the counting IV. So I think it makes > > sense to pass that exit to the function in all cases. > > It really seems to me that vectorized loop is like N loops happening > in parallel, so the probabilities of alternative exits grows as well. > But canonical exit is right thing to do for prologues - here we really > add extra conditions to the iteration counting exit. > > > > > Assuming that vectorizer did no prologues and apilogues and we > > > vectorized with factor N, then I think the update could be done more > > > specifically as follows. > > > > > > We know that header block count dropped by 4. So we can start from that > > > and each time we reach basic block with exit edge, we know the original > > > count of the edge. This count is unchanged, so one can rescale > > > probabilities out of that BB accordingly. If loop has no inner loops, > > > we can just walk the body in RPO and propagate scales downwards and we > > > sould arrive to right result > > > > That should work for alternate exits as well, no? > Yes, i think it could omstly work for acyclic bodies. I ended up > implementing a special case of this for loop-ch in order to handle > corectly loop invariant conditionals. Will send patch after some > cleanups. (There seems to be more loop invariant conditionals in real > code than I would tought) > > Tampering only with loop exit probabilities is not always enought. > If you have: > while (1) > if (test1) > { > if (test2) > break; > } > increasing count of exit may require increasing probablity of the outer > conditional. Do we support this in vectorization at all and if so, do > we know something here? Tamar would need to answer this but without early break vectorization the if-conversion pass will flatten everything and I think even early breaks will be in the end a non-nested sequence of BBs with exit conds at the end (or a loopback branch). Note the (scalar) epilogue is copied from the original scalar loop body so it doesn't see any if-conversion. > For example if the test1 is triggered if test1 is true in one of > iterations packed togehter, its probability also increases by > vectorization factor. > > We run into this in peeling i.e. when we prove that test1 will trigger > undefined behaviour after one or two iterations but the orignal > esimtated profile believes in higher iteration count. I added special > case for this yesterday to avoid turning if (test2) to 100% in this case > as that triggers strange codegen in some of fortran testcases. > > We also can have > while (1) > while (test1) > { > if (test2) > break; > } > Which is harder because changing probability of test2 affects number > of iteraitons of the inner loop. So I am giving up on this. > I think currently it happens mostly with unlooping. What I saw most wrecking the profile is when passes turn if (cond) into if (0/1) leaving the CFG adjustment to CFG cleanup which then simply deletes one of the outgoing edges without doing anything to the (guessed) profile. > > > I originally added the bound parameter to handle prologues/epilogues > > > which gets new artificial bound. In prologue I think you are right that > > > the flow will be probably directed to the conditional counting > > > iterations. > > > > I suppose we'd need to scale both main and epilogue together since > > the epilogue "steals" from the main loop counts. Likewise if there's > > a skip edge around the vector loop. I think currently we simply > > set the edge probability of those skip conds rather than basing > > this off the niter values they work on. Aka if (niter < VF) goto > > epilogue; do {} while (niter / VF); epilogue: do {} while (niter); > > > > There's also the cost model which might require niter > VF to enter > > the main loop body. > > I think I mostly understand this since we was playing with it with Ondra's > histograms (that can be used to get some of the unknowns in the > transformation right). The unknowns (how many times we end up jumpig to > epilogue, for instance, probably can't be reasonably well guessed if we > do not know the loop histogram which currently we know only if we prove > that loop has constant number of iterations. So I am trying to get > right at least this case first. > > Theoretically correct approach would be to first determine entry counts > of prologue and epilogue, then produce what we believe to be correct > profile of those and subtract it from the main loop profile updating > also probabilities in basic blocks where we did nontrivial changes while > updating prologs/epilogs. Finally scale down the main loop profile and > increase exit probabilities.
> > What I saw most wrecking the profile is when passes turn > if (cond) into if (0/1) leaving the CFG adjustment to CFG cleanup > which then simply deletes one of the outgoing edges without doing > anything to the (guessed) profile. Yep, I agree that this is disturbing. At the cfg cleanup time one can hardly do anything useful, since the knowledge of transform that caused profile inconsistency is forgotten. however I think it is not a complete disaster. With profile feedback the most common case of this happening is a situation where we duplicated code (by inlining, unrolling etc.) into a context where it behaves differently then the typical behaviour represented by the profile. So if one ends up zapping edge with large probability, one also knows that the code being optimized does not exhibit typical behaviour from the train run and thus is not very hot. So profile inconsistency should not affect performance that much. So doing nothing is IMO may end up being safter than trying to get the in/out counts right without really known what is going on. This is mostly about the scenario "constant propagated this conditional and profile disagrees with me". There are other cases where update is IMO important. i.e. vectorizer forgetting to cap #of iterations of epilogue may cause issue since the epilogue loop looks more frequent than the main vectorized loop and it may cause IRA to insert spilling into it or so. When we duplicate we have chance to figure out profile updates. Also we may try to get as much as possible done early. I think we should again do loop header copying that does not expand code at early opts again. I have some more plans on cleaning up loop-ch and then we can give it a try. With guessed profile we always have option to re-do the propagation. There is TODO_rebuild_frequencies for that which we do after inlining. This is mostly to handle possible overflows on large loops nests constructed by inliner. We can re-propagate once again after late cleanup passes. Looking at the queue, we have: NEXT_PASS (pass_remove_cgraph_callee_edges); /* Initial scalar cleanups before alias computation. They ensure memory accesses are not indirect wherever possible. */ NEXT_PASS (pass_strip_predict_hints, false /* early_p */); NEXT_PASS (pass_ccp, true /* nonzero_p */); /* After CCP we rewrite no longer addressed locals into SSA form if possible. */ NEXT_PASS (pass_object_sizes); NEXT_PASS (pass_post_ipa_warn); /* Must run before loop unrolling. */ NEXT_PASS (pass_warn_access, /*early=*/true); NEXT_PASS (pass_complete_unrolli); ^^^^ here we care about profile NEXT_PASS (pass_backprop); NEXT_PASS (pass_phiprop); NEXT_PASS (pass_forwprop); /* pass_build_alias is a dummy pass that ensures that we execute TODO_rebuild_alias at this point. */ NEXT_PASS (pass_build_alias); NEXT_PASS (pass_return_slot); NEXT_PASS (pass_fre, true /* may_iterate */); NEXT_PASS (pass_merge_phi); NEXT_PASS (pass_thread_jumps_full, /*first=*/true); ^^^^ here By now we did CCP and FRE so we likely optimized out most of constant conditionals exposed by inline. Honza
On Tue, 11 Jul 2023, Jan Hubicka wrote: > > > > What I saw most wrecking the profile is when passes turn > > if (cond) into if (0/1) leaving the CFG adjustment to CFG cleanup > > which then simply deletes one of the outgoing edges without doing > > anything to the (guessed) profile. > > Yep, I agree that this is disturbing. At the cfg cleanup time one can > hardly do anything useful, since the knowledge of transform that caused > profile inconsistency is forgotten. however I think it is not a complete > disaster. > > With profile feedback the most common case of this happening is a > situation where we duplicated code (by inlining, unrolling etc.) into a > context where it behaves differently then the typical behaviour > represented by the profile. > > So if one ends up zapping edge with large probability, one also knows > that the code being optimized does not exhibit typical behaviour from > the train run and thus is not very hot. So profile inconsistency should > not affect performance that much. > > So doing nothing is IMO may end up being safter than trying to get the > in/out counts right without really known what is going on. > > This is mostly about the scenario "constant propagated this conditional > and profile disagrees with me". There are other cases where update is > IMO important. i.e. vectorizer forgetting to cap #of iterations of > epilogue may cause issue since the epilogue loop looks more frequent > than the main vectorized loop and it may cause IRA to insert spilling > into it or so. > > When we duplicate we have chance to figure out profile updates. > Also we may try to get as much as possible done early. > I think we should again do loop header copying that does not expand code > at early opts again. I have some more plans on cleaning up loop-ch and > then we can give it a try. > > With guessed profile we always have option to re-do the propagation. > There is TODO_rebuild_frequencies for that which we do after inlining. > This is mostly to handle possible overflows on large loops nests > constructed by inliner. > > We can re-propagate once again after late cleanup passes. Looking at the > queue, we have: > > NEXT_PASS (pass_remove_cgraph_callee_edges); > /* Initial scalar cleanups before alias computation. > They ensure memory accesses are not indirect wherever possible. */ > NEXT_PASS (pass_strip_predict_hints, false /* early_p */); > NEXT_PASS (pass_ccp, true /* nonzero_p */); > /* After CCP we rewrite no longer addressed locals into SSA > form if possible. */ > NEXT_PASS (pass_object_sizes); > NEXT_PASS (pass_post_ipa_warn); > /* Must run before loop unrolling. */ > NEXT_PASS (pass_warn_access, /*early=*/true); > NEXT_PASS (pass_complete_unrolli); > ^^^^ here we care about profile > NEXT_PASS (pass_backprop); > NEXT_PASS (pass_phiprop); > NEXT_PASS (pass_forwprop); > /* pass_build_alias is a dummy pass that ensures that we > execute TODO_rebuild_alias at this point. */ > NEXT_PASS (pass_build_alias); > NEXT_PASS (pass_return_slot); > NEXT_PASS (pass_fre, true /* may_iterate */); > NEXT_PASS (pass_merge_phi); > NEXT_PASS (pass_thread_jumps_full, /*first=*/true); > ^^^^ here > > By now we did CCP and FRE so we likely optimized out most of constant > conditionals exposed by inline. So maybe we should simply delay re-propagation of the profile? I think cunrolli doesn't so much care about the profile - cunrolli is (was) about abstraction removal. Jump threading should be the first pass to care. Richard.
> > By now we did CCP and FRE so we likely optimized out most of constant > > conditionals exposed by inline. > > So maybe we should simply delay re-propagation of the profile? I > think cunrolli doesn't so much care about the profile - cunrolli > is (was) about abstraction removal. Jump threading should be > the first pass to care. That is what I was thinking too. After inlining the profile counts may be in quite bad shape. If you inline together loop like in exchange that has large loop nest, we will definitely end up capping counts to avoid overflow. cunrolli does: ret = tree_unroll_loops_completely (optimize >= 3, false); which sets may_increase_size to true for -O3 and then may_increase_size && optimize_loop_nest_for_speed_p (loop) which seems reasonable guard and it may get random answers on capped profile. It is not big deal to try propagating before cunrolli and then again before threading and see how much potential this idea has. I guess I should also double check that the other passes are indeed safe, but I think it is quite obvoius they should be. Honza > > Richard.
On Tue, 11 Jul 2023, Jan Hubicka wrote: > > > By now we did CCP and FRE so we likely optimized out most of constant > > > conditionals exposed by inline. > > > > So maybe we should simply delay re-propagation of the profile? I > > think cunrolli doesn't so much care about the profile - cunrolli > > is (was) about abstraction removal. Jump threading should be > > the first pass to care. > > That is what I was thinking too. After inlining the profile counts may > be in quite bad shape. If you inline together loop like in exchange that > has large loop nest, we will definitely end up capping counts to avoid > overflow. > > cunrolli does: > > ret = tree_unroll_loops_completely (optimize >= 3, false); Ah, yeah - that used to be false, false ... > which sets may_increase_size to true for -O3 and then > > may_increase_size && optimize_loop_nest_for_speed_p (loop) > > which seems reasonable guard and it may get random answers on capped > profile. It is not big deal to try propagating before cunrolli and then > again before threading and see how much potential this idea has. > I guess I should also double check that the other passes are indeed > safe, but I think it is quite obvoius they should be. Yeah. Richard.
--- a/gcc/cfgloopmanip.cc +++ b/gcc/cfgloopmanip.cc @@ -501,7 +501,7 @@ scale_loop_frequencies (class loop *loop, profile_probability p) /* Scale profile in LOOP by P. If ITERATION_BOUND is non-zero, scale even further if loop is predicted to iterate too many times. - Before caling this function, preheader block profile should be already + Before calling this function, preheader block profile should be already scaled to final count. This is necessary because loop iterations are determined by comparing header edge count to latch ege count and thus they need to be scaled synchronously. */ @@ -597,14 +597,14 @@ scale_loop_profile (class loop *loop, profile_probability p, /* If latch exists, change its count, since we changed probability of exit. Theoretically we should update everything from source of exit edge to latch, but for vectorizer this is enough. */ - if (loop->latch && loop->latch != e->src) + if (e && loop->latch && loop->latch != e->src) loop->latch->count += count_delta; /* Scale the probabilities. */ scale_loop_frequencies (loop, p); /* Change latch's count back. */ - if (loop->latch && loop->latch != e->src) + if (e && loop->latch && loop->latch != e->src) loop->latch->count -= count_delta; if (dump_file && (dump_flags & TDF_DETAILS))