Message ID | 055af750e7a6bd722e55c7046b8b9a38eefa4986.1629805719.git.mjambor@suse.cz |
---|---|
State | New |
Headers | show |
Series | IPA-CP profile feedback handling fixes | expand |
> For non-local nodes which can have unknown callers, the algorithm just > takes half of the counts - we may decide that taking just a third or > some other portion is more reasonable, but I do not think we can > attempt anything more clever. Can't you just sum the calling edges and subtract it from callee's count? > 2021-08-23 Martin Jambor <mjambor@suse.cz> > > * ipa-cp.c (struct caller_statistics): New fields rec_count_sum, > n_nonrec_calls and itself, document all fields. > (init_caller_stats): Initialize the above new fields. > (gather_caller_stats): Gather self-recursive counts and calls number. > (get_info_about_necessary_edges): Gather counts of self-recursive and > other edges bringing in the requested value separately. > (dump_profile_updates): Rework to dump info about a single node only. > (lenient_count_portion_handling): New function. > (struct gather_other_count_struct): New type. > (gather_count_of_non_rec_edges): New function. > (struct desc_incoming_count_struct): New type. > (analyze_clone_icoming_counts): New function. > (adjust_clone_incoming_counts): Likewise. > (update_counts_for_self_gen_clones): Likewise. > (update_profiling_info): Rewritten. > (update_specialized_profile): Adjust call to dump_profile_updates. > (create_specialized_node): Do not update profiling info. > (decide_about_value): New parameter self_gen_clones, either push new > clones into it or updat their profile counts. For self-recursively > generated values, use a portion of the node count instead of count > from self-recursive edges to estimate goodness. > (decide_whether_version_node): Gather clones for self-generated values > in a new vector, update their profiles at once at the end. > --- > gcc/ipa-cp.c | 543 +++++++++++++++++++++++++++++++++++++++++++-------- > 1 file changed, 457 insertions(+), 86 deletions(-) > > diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c > index b987d975793..53cca7aa804 100644 > --- a/gcc/ipa-cp.c > +++ b/gcc/ipa-cp.c > @@ -701,20 +701,36 @@ ipcp_versionable_function_p (struct cgraph_node *node) > > struct caller_statistics > { > + /* If requested (see below), self-recursive call counts are summed into this > + field. */ > + profile_count rec_count_sum; > + /* The sum of all ipa counts of all the other (non-recursive) calls. */ > profile_count count_sum; > + /* Sum of all frequencies for all calls. */ > sreal freq_sum; > + /* Number of calls and hot calls respectively. */ > int n_calls, n_hot_calls; > + /* If itself is set up, also count the number of non-self-recursive > + calls. */ > + int n_nonrec_calls; > + /* If non-NULL, this is the node itself and calls from it should have their > + counts included in rec_count_sum and not count_sum. */ > + cgraph_node *itself; > }; > > +/* With partial train run we do not want to assume that original's count is > + zero whenever we redurect all executed edges to clone. Simply drop profile > + to local one in this case. In eany case, return the new value. ORIG_NODE > + is the original node and its count has not been updaed yet. */ > + > +profile_count > +lenient_count_portion_handling (profile_count remainder, cgraph_node *orig_node) > +{ > + if (remainder.ipa_p () && !remainder.ipa ().nonzero_p () > + && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p () > + && opt_for_fn (orig_node->decl, flag_profile_partial_training)) > + remainder = remainder.guessed_local (); I do not think you need partial training flag here. You should see IPA profile is mising by simply testing ipa_p predicate on relevant counts. > + > +/* If caller edge counts of a clone created for a self-recursive arithmetic jump > + function must be adjusted, do so. NODE is the node or its thunk. */ I would add comment on why it needs to be adjusted and how. > + > +static void > +adjust_clone_incoming_counts (cgraph_node *node, > + desc_incoming_count_struct *desc) > +{ > + for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller) > + if (cs->caller->thunk) > + { > + adjust_clone_incoming_counts (cs->caller, desc); > + profile_count sum = profile_count::zero (); > + for (cgraph_edge *e = cs->caller->callers; e; e = e->next_caller) > + if (e->count.initialized_p ()) > + sum += e->count.ipa (); > + cs->count = cs->count.combine_with_ipa_count (sum); > + } > + else if (!desc->processed_edges->contains (cs) > + && cs->caller->clone_of == desc->orig) > + { > + cs->count += desc->count; > + if (dump_file) > + { > + fprintf (dump_file, " Adjusted count of an incoming edge of " > + "a clone %s -> %s to ", cs->caller->dump_name (), > + cs->callee->dump_name ()); > + cs->count.dump (dump_file); > + fprintf (dump_file, "\n"); > + } > + } > +} Otherwise the patch looks OK. Honza
Hi, On Fri, Oct 08 2021, Jan Hubicka wrote: >> For non-local nodes which can have unknown callers, the algorithm just >> takes half of the counts - we may decide that taking just a third or >> some other portion is more reasonable, but I do not think we can >> attempt anything more clever. > > Can't you just sum the calling edges and subtract it from callee's > count? I guess I can, the patch below changes handling of this scenario, it now behaves as if there was another hidden caller with the count of calls that are calculated as you suggested. [...] >> +/* With partial train run we do not want to assume that original's count is >> + zero whenever we redurect all executed edges to clone. Simply drop profile >> + to local one in this case. In eany case, return the new value. ORIG_NODE >> + is the original node and its count has not been updaed yet. */ >> + >> +profile_count >> +lenient_count_portion_handling (profile_count remainder, cgraph_node *orig_node) >> +{ >> + if (remainder.ipa_p () && !remainder.ipa ().nonzero_p () >> + && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p () >> + && opt_for_fn (orig_node->decl, flag_profile_partial_training)) >> + remainder = remainder.guessed_local (); > > I do not think you need partial training flag here. You should see IPA > profile is mising by simply testing ipa_p predicate on relevant counts. I can take that test out, that is not a problem, but I have not done that yet because I am still wondering whether it is the right thing to do. The code attempts to do the same thing which is currently in update_profiling_info and which you added and which does test the flag. If I understand that snippet I'm replacing correctly, the code is supposed to make sure that, when a clone "steals" all counts from the original node, this original node is not left with "adjusted" zero count but with a "locally guessed" zero count when the partial training flag is provided, which should not make it cold cold but rather switch back to reasoning about it as if we did not have profile at all. That is why I kept the flag check there, but if you really want me to remove it, I'll do that. >> + >> +/* If caller edge counts of a clone created for a self-recursive arithmetic jump >> + function must be adjusted, do so. NODE is the node or its thunk. */ > > I would add comment on why it needs to be adjusted and how. Done. The adjusted patch which I am testing (but which has already passed LTO profiledbootatrap and testing) is below. Let me know what you think. Thanks, Martin IPA-CP does not do a reasonable job when it is updating profile counts after it has created clones of recursive functions. This patch addresses that by: 1. Only updating counts for special-context clones. When a clone is created for all contexts, the original is going to be dead and the cgraph machinery has copied counts to the new node which is the right thing to do. Therefore updating counts has been moved from create_specialized_node to decide_about_value and decide_whether_version_node. 2. The current profile updating code artificially increased the assumed old count when the sum of counts of incoming edges to both the original and new node were bigger than the count of the original node. This always happened when self-recursive edge from the clone was also redirected to the clone because both the original edge and its clone had original high counts. This clutch was removed and replaced by the next point. 3. When cloning also redirects a self-recursive clone to the clone itself, new logic has been added to divide the counts brought by such recursive edges between the original node and the clone. This is impossible to do well without special knowledge about the function and which non-recursive entry calls are responsible for what portion of recursion depth, so the approach taken is rather crude. For local nodes, we detect the case when the original node is never called (in the training run at least) with another value and if so, steal all its counts like if it was dead. If that is not the case, we try to divide the count brought by recursive edges (or rather not brought by direct edges) proportionally to the counts brought by non-recursive edges - but with artificial limits in place so that we do not take too many or too few, because that was happening with detrimental effect in mcf_r. 4. When cloning creates extra clones for values brought by a formerly self-recursive edge with an arithmetic pass-through jump function on it, such as it does in exchange2_r, all such clones are processed at once rather than one after another. The counts of all such nodes are distributed evenly (modulo even-formerly-non-recursive-edges) and the whole situation is then fixed up so that the edge counts fit. This is what new function update_counts_for_self_gen_clones does. 5. When values brought by a formerly self-recursive edge with an arithmetic pass-through jump function on it are evaluated by heuristics which assumes vast majority of node counts are result of recursive calls and so we simply divide those with the number of clones there would be if we created another one. 6. The mechanisms in init_caller_stats and gather_caller_stats and get_info_about_necessary_edges was enhanced to gather data required for the above and a missing check not to count dead incoming edges was also added. gcc/ChangeLog: 2021-10-15 Martin Jambor <mjambor@suse.cz> * ipa-cp.c (struct caller_statistics): New fields rec_count_sum, n_nonrec_calls and itself, document all fields. (init_caller_stats): Initialize the above new fields. (gather_caller_stats): Gather self-recursive counts and calls number. (get_info_about_necessary_edges): Gather counts of self-recursive and other edges bringing in the requested value separately. (dump_profile_updates): Rework to dump info about a single node only. (lenient_count_portion_handling): New function. (struct gather_other_count_struct): New type. (gather_count_of_non_rec_edges): New function. (struct desc_incoming_count_struct): New type. (analyze_clone_icoming_counts): New function. (adjust_clone_incoming_counts): Likewise. (update_counts_for_self_gen_clones): Likewise. (update_profiling_info): Rewritten. (update_specialized_profile): Adjust call to dump_profile_updates. (create_specialized_node): Do not update profiling info. (decide_about_value): New parameter self_gen_clones, either push new clones into it or updat their profile counts. For self-recursively generated values, use a portion of the node count instead of count from self-recursive edges to estimate goodness. (decide_whether_version_node): Gather clones for self-generated values in a new vector, update their profiles at once at the end. --- gcc/ipa-cp.c | 534 ++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 448 insertions(+), 86 deletions(-) diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c index b987d975793..b254b9b9de6 100644 --- a/gcc/ipa-cp.c +++ b/gcc/ipa-cp.c @@ -701,20 +701,36 @@ ipcp_versionable_function_p (struct cgraph_node *node) struct caller_statistics { + /* If requested (see below), self-recursive call counts are summed into this + field. */ + profile_count rec_count_sum; + /* The sum of all ipa counts of all the other (non-recursive) calls. */ profile_count count_sum; + /* Sum of all frequencies for all calls. */ sreal freq_sum; + /* Number of calls and hot calls respectively. */ int n_calls, n_hot_calls; + /* If itself is set up, also count the number of non-self-recursive + calls. */ + int n_nonrec_calls; + /* If non-NULL, this is the node itself and calls from it should have their + counts included in rec_count_sum and not count_sum. */ + cgraph_node *itself; }; -/* Initialize fields of STAT to zeroes. */ +/* Initialize fields of STAT to zeroes and optionally set it up so that edges + from IGNORED_CALLER are not counted. */ static inline void -init_caller_stats (struct caller_statistics *stats) +init_caller_stats (caller_statistics *stats, cgraph_node *itself = NULL) { + stats->rec_count_sum = profile_count::zero (); stats->count_sum = profile_count::zero (); stats->n_calls = 0; stats->n_hot_calls = 0; + stats->n_nonrec_calls = 0; stats->freq_sum = 0; + stats->itself = itself; } /* Worker callback of cgraph_for_node_and_aliases accumulating statistics of @@ -729,10 +745,22 @@ gather_caller_stats (struct cgraph_node *node, void *data) for (cs = node->callers; cs; cs = cs->next_caller) if (!cs->caller->thunk) { - if (cs->count.ipa ().initialized_p ()) - stats->count_sum += cs->count.ipa (); + ipa_node_params *info = ipa_node_params_sum->get (cs->caller); + if (info && info->node_dead) + continue; + + if (cs->count.ipa ().initialized_p ()) + { + if (stats->itself && stats->itself == cs->caller) + stats->rec_count_sum += cs->count.ipa (); + else + stats->count_sum += cs->count.ipa (); + } stats->freq_sum += cs->sreal_frequency (); stats->n_calls++; + if (stats->itself && stats->itself != cs->caller) + stats->n_nonrec_calls++; + if (cs->maybe_hot_p ()) stats->n_hot_calls ++; } @@ -4202,19 +4230,22 @@ get_next_cgraph_edge_clone (struct cgraph_edge *cs) /* Given VAL that is intended for DEST, iterate over all its sources and if any of them is viable and hot, return true. In that case, for those that still - hold, add their edge frequency and their number into *FREQUENCY and - *CALLER_COUNT respectively. */ + hold, add their edge frequency and their number and cumulative profile + counts of self-ecursive and other edges into *FREQUENCY, *CALLER_COUNT, + REC_COUNT_SUM and NONREC_COUNT_SUM respectively. */ template <typename valtype> static bool get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest, - sreal *freq_sum, profile_count *count_sum, - int *caller_count) + sreal *freq_sum, int *caller_count, + profile_count *rec_count_sum, + profile_count *nonrec_count_sum) { ipcp_value_source<valtype> *src; sreal freq = 0; int count = 0; - profile_count cnt = profile_count::zero (); + profile_count rec_cnt = profile_count::zero (); + profile_count nonrec_cnt = profile_count::zero (); bool hot = false; bool non_self_recursive = false; @@ -4227,11 +4258,15 @@ get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest, { count++; freq += cs->sreal_frequency (); - if (cs->count.ipa ().initialized_p ()) - cnt += cs->count.ipa (); hot |= cs->maybe_hot_p (); if (cs->caller != dest) - non_self_recursive = true; + { + non_self_recursive = true; + if (cs->count.ipa ().initialized_p ()) + rec_cnt += cs->count.ipa (); + } + else if (cs->count.ipa ().initialized_p ()) + nonrec_cnt += cs->count.ipa (); } cs = get_next_cgraph_edge_clone (cs); } @@ -4243,8 +4278,9 @@ get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest, return false; *freq_sum = freq; - *count_sum = cnt; *caller_count = count; + *rec_count_sum = rec_cnt; + *nonrec_count_sum = nonrec_cnt; if (!hot && ipa_node_params_sum->get (dest)->node_within_scc) { @@ -4349,112 +4385,399 @@ get_replacement_map (class ipa_node_params *info, tree value, int parm_num, return replace_map; } -/* Dump new profiling counts */ +/* Dump new profiling counts of NODE. SPEC is true when NODE is a specialzied + one, otherwise it will be referred to as the original node. */ static void -dump_profile_updates (struct cgraph_node *orig_node, - struct cgraph_node *new_node) +dump_profile_updates (cgraph_node *node, bool spec) { - struct cgraph_edge *cs; + if (spec) + fprintf (dump_file, " setting count of the specialized node %s to ", + node->dump_name ()); + else + fprintf (dump_file, " setting count of the original node %s to ", + node->dump_name ()); - fprintf (dump_file, " setting count of the specialized node to "); - new_node->count.dump (dump_file); + node->count.dump (dump_file); fprintf (dump_file, "\n"); - for (cs = new_node->callees; cs; cs = cs->next_callee) + for (cgraph_edge *cs = node->callees; cs; cs = cs->next_callee) { - fprintf (dump_file, " edge to %s has count ", - cs->callee->dump_name ()); - cs->count.dump (dump_file); - fprintf (dump_file, "\n"); - } - - fprintf (dump_file, " setting count of the original node to "); - orig_node->count.dump (dump_file); - fprintf (dump_file, "\n"); - for (cs = orig_node->callees; cs; cs = cs->next_callee) - { - fprintf (dump_file, " edge to %s is left with ", + fprintf (dump_file, " edge to %s has count ", cs->callee->dump_name ()); cs->count.dump (dump_file); fprintf (dump_file, "\n"); } } +/* With partial train run we do not want to assume that original's count is + zero whenever we redurect all executed edges to clone. Simply drop profile + to local one in this case. In eany case, return the new value. ORIG_NODE + is the original node and its count has not been updaed yet. */ + +profile_count +lenient_count_portion_handling (profile_count remainder, cgraph_node *orig_node) +{ + if (remainder.ipa_p () && !remainder.ipa ().nonzero_p () + && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p () + && opt_for_fn (orig_node->decl, flag_profile_partial_training)) + remainder = remainder.guessed_local (); + + return remainder; +} + +/* Structure to sum counts coming from nodes other than the original node and + its clones. */ + +struct gather_other_count_struct +{ + cgraph_node *orig; + profile_count other_count; +}; + +/* Worker callback of call_for_symbol_thunks_and_aliases summing the number of + counts that come from non-self-recursive calls.. */ + +static bool +gather_count_of_non_rec_edges (cgraph_node *node, void *data) +{ + gather_other_count_struct *desc = (gather_other_count_struct *) data; + for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller) + if (cs->caller != desc->orig && cs->caller->clone_of != desc->orig) + desc->other_count += cs->count.ipa (); + return false; +} + +/* Structure to help analyze if we need to boost counts of some clones of some + non-recursive edges to match the new callee count. */ + +struct desc_incoming_count_struct +{ + cgraph_node *orig; + hash_set <cgraph_edge *> *processed_edges; + profile_count count; + unsigned unproc_orig_rec_edges; +}; + +/* Go over edges calling NODE and its thunks and gather information about + incoming counts so that we know if we need to make any adjustments. */ + +static void +analyze_clone_icoming_counts (cgraph_node *node, + desc_incoming_count_struct *desc) +{ + for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller) + if (cs->caller->thunk) + { + analyze_clone_icoming_counts (cs->caller, desc); + continue; + } + else + { + if (cs->count.initialized_p ()) + desc->count += cs->count.ipa (); + if (!desc->processed_edges->contains (cs) + && cs->caller->clone_of == desc->orig) + desc->unproc_orig_rec_edges++; + } +} + +/* If caller edge counts of a clone created for a self-recursive arithmetic + jump function must be adjusted because it is coming from a the "seed" clone + for the first value and so has been excessively scaled back as if it was not + a recursive call, adjust it so that the incoming counts of NODE match its + count. NODE is the node or its thunk. */ + +static void +adjust_clone_incoming_counts (cgraph_node *node, + desc_incoming_count_struct *desc) +{ + for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller) + if (cs->caller->thunk) + { + adjust_clone_incoming_counts (cs->caller, desc); + profile_count sum = profile_count::zero (); + for (cgraph_edge *e = cs->caller->callers; e; e = e->next_caller) + if (e->count.initialized_p ()) + sum += e->count.ipa (); + cs->count = cs->count.combine_with_ipa_count (sum); + } + else if (!desc->processed_edges->contains (cs) + && cs->caller->clone_of == desc->orig) + { + cs->count += desc->count; + if (dump_file) + { + fprintf (dump_file, " Adjusted count of an incoming edge of " + "a clone %s -> %s to ", cs->caller->dump_name (), + cs->callee->dump_name ()); + cs->count.dump (dump_file); + fprintf (dump_file, "\n"); + } + } +} + +/* When ORIG_NODE has been cloned for values which have been generated fora + self-recursive call as a result of an arithmetic pass-through + jump-functions, adjust its count together with counts of all such clones in + SELF_GEN_CLONES which also at this point contains ORIG_NODE itself. + + The function sums the counts of the original node and all its clones that + cannot be attributed to a specific clone because it comes from a + non-recursive edge. This sum is then evenly divided between the clones and + on top of that each one gets all the counts which can be attributed directly + to it. */ + +static void +update_counts_for_self_gen_clones (cgraph_node *orig_node, + const vec<cgraph_node *> &self_gen_clones) +{ + profile_count redist_sum = orig_node->count.ipa (); + if (!(redist_sum > profile_count::zero ())) + return; + + if (dump_file) + fprintf (dump_file, " Updating profile of self recursive clone " + "series\n"); + + gather_other_count_struct gocs; + gocs.orig = orig_node; + gocs.other_count = profile_count::zero (); + + auto_vec <profile_count, 8> other_edges_count; + for (cgraph_node *n : self_gen_clones) + { + gocs.other_count = profile_count::zero (); + n->call_for_symbol_thunks_and_aliases (gather_count_of_non_rec_edges, + &gocs, false); + other_edges_count.safe_push (gocs.other_count); + redist_sum -= gocs.other_count; + } + + hash_set<cgraph_edge *> processed_edges; + unsigned i = 0; + for (cgraph_node *n : self_gen_clones) + { + profile_count orig_count = n->count; + profile_count new_count + = (redist_sum.apply_scale (1, self_gen_clones.length ()) + + other_edges_count[i]); + new_count = lenient_count_portion_handling (new_count, orig_node); + n->count = new_count; + profile_count::adjust_for_ipa_scaling (&new_count, &orig_count); + for (cgraph_edge *cs = n->callees; cs; cs = cs->next_callee) + { + cs->count = cs->count.apply_scale (new_count, orig_count); + processed_edges.add (cs); + } + for (cgraph_edge *cs = n->indirect_calls; cs; cs = cs->next_callee) + cs->count = cs->count.apply_scale (new_count, orig_count); + + i++; + } + + /* There are still going to be edges to ORIG_NODE that have one or more + clones coming from another node clone in SELF_GEN_CLONES and which we + scaled by the same amount, which means that the total incoming sum of + counts to ORIG_NODE will be too high, scale such edges back. */ + for (cgraph_edge *cs = orig_node->callees; cs; cs = cs->next_callee) + { + if (cs->callee->ultimate_alias_target () == orig_node) + { + unsigned den = 0; + for (cgraph_edge *e = cs; e; e = get_next_cgraph_edge_clone (e)) + if (e->callee->ultimate_alias_target () == orig_node + && processed_edges.contains (e)) + den++; + if (den > 0) + for (cgraph_edge *e = cs; e; e = get_next_cgraph_edge_clone (e)) + if (e->callee->ultimate_alias_target () == orig_node + && processed_edges.contains (e)) + e->count = e->count.apply_scale (1, den); + } + } + + /* Edges from the seeds of the valus generated for arithmetic jump-functions + along self-recursive edges are likely to have fairly low count and so + edges from them to nodes in the self_gen_clones do not correspond to the + artificially distributed count of the nodes, the total sum of incoming + edges to some clones might be too low. Detect this situation and correct + it. */ + for (cgraph_node *n : self_gen_clones) + { + if (!(n->count.ipa () > profile_count::zero ())) + continue; + + desc_incoming_count_struct desc; + desc.orig = orig_node; + desc.processed_edges = &processed_edges; + desc.count = profile_count::zero (); + desc.unproc_orig_rec_edges = 0; + analyze_clone_icoming_counts (n, &desc); + + if (n->count.differs_from_p (desc.count)) + { + if (n->count > desc.count + && desc.unproc_orig_rec_edges > 0) + { + desc.count = n->count - desc.count; + desc.count + = desc.count.apply_scale (1, desc.unproc_orig_rec_edges); + adjust_clone_incoming_counts (n, &desc); + } + else if (dump_file) + fprintf (dump_file, + " Unable to fix up incoming counts for %s.\n", + n->dump_name ()); + } + } + + if (dump_file) + for (cgraph_node *n : self_gen_clones) + dump_profile_updates (n, n != orig_node); + return; +} + /* After a specialized NEW_NODE version of ORIG_NODE has been created, update - their profile information to reflect this. */ + their profile information to reflect this. This function should not be used + for clones generated for arithmetic pass-through jump functions on a + self-recursive call graph edge, that situation is handled by + update_counts_for_self_gen_clones. */ static void update_profiling_info (struct cgraph_node *orig_node, struct cgraph_node *new_node) { - struct cgraph_edge *cs; struct caller_statistics stats; - profile_count new_sum, orig_sum; - profile_count remainder, orig_node_count = orig_node->count; - profile_count orig_new_node_count = new_node->count; + profile_count new_sum; + profile_count remainder, orig_node_count = orig_node->count.ipa (); - if (!(orig_node_count.ipa () > profile_count::zero ())) + if (!(orig_node_count > profile_count::zero ())) return; - init_caller_stats (&stats); - orig_node->call_for_symbol_thunks_and_aliases (gather_caller_stats, &stats, - false); - orig_sum = stats.count_sum; - init_caller_stats (&stats); + if (dump_file) + { + fprintf (dump_file, " Updating profile from original count: "); + orig_node_count.dump (dump_file); + fprintf (dump_file, "\n"); + } + + init_caller_stats (&stats, new_node); new_node->call_for_symbol_thunks_and_aliases (gather_caller_stats, &stats, false); new_sum = stats.count_sum; - if (orig_node_count < orig_sum + new_sum) + if (new_sum > orig_node_count) { - if (dump_file) - { - fprintf (dump_file, " Problem: node %s has too low count ", - orig_node->dump_name ()); - orig_node_count.dump (dump_file); - fprintf (dump_file, "while the sum of incoming count is "); - (orig_sum + new_sum).dump (dump_file); - fprintf (dump_file, "\n"); - } - - orig_node_count = (orig_sum + new_sum).apply_scale (12, 10); - if (dump_file) - { - fprintf (dump_file, " proceeding by pretending it was "); - orig_node_count.dump (dump_file); - fprintf (dump_file, "\n"); - } + /* TODO: Perhaps this should be gcc_unreachable ()? */ + remainder = profile_count::zero ().guessed_local (); } + else if (stats.rec_count_sum.nonzero_p ()) + { + int new_nonrec_calls = stats.n_nonrec_calls; + /* There are self-recursive edges which are likely to bring in the + majority of calls but which we must divide in between the original and + new node. */ + init_caller_stats (&stats, orig_node); + orig_node->call_for_symbol_thunks_and_aliases (gather_caller_stats, + &stats, false); + int orig_nonrec_calls = stats.n_nonrec_calls; + profile_count orig_nonrec_call_count = stats.count_sum; - remainder = orig_node_count.combine_with_ipa_count (orig_node_count.ipa () - - new_sum.ipa ()); + if (orig_node->local) + { + if (!orig_nonrec_call_count.nonzero_p ()) + { + if (dump_file) + fprintf (dump_file, " The original is local and the only " + "incoming edges from non-dead callers with nonzero " + "counts are self-recursive, assuming it is cold.\n"); + /* The NEW_NODE count and counts of all its outgoing edges + are still unmodified copies of ORIG_NODE's. Just clear + the latter and bail out. */ + profile_count zero; + if (opt_for_fn (orig_node->decl, flag_profile_partial_training)) + zero = profile_count::zero ().guessed_local (); + else + zero = profile_count::adjusted_zero (); + orig_node->count = zero; + for (cgraph_edge *cs = orig_node->callees; + cs; + cs = cs->next_callee) + cs->count = zero; + for (cgraph_edge *cs = orig_node->indirect_calls; + cs; + cs = cs->next_callee) + cs->count = zero; + return; + } + } + else + { + /* Let's behave as if there was another caller that accounts for all + the calls that were either indirect or from other compilation + units. */ + orig_nonrec_calls++; + profile_count pretend_caller_count + = (orig_node_count - new_sum - orig_nonrec_call_count + - stats.rec_count_sum); + orig_nonrec_call_count += pretend_caller_count; + } - /* With partial train run we do not want to assume that original's - count is zero whenever we redurect all executed edges to clone. - Simply drop profile to local one in this case. */ - if (remainder.ipa_p () && !remainder.ipa ().nonzero_p () - && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p () - && flag_profile_partial_training) - remainder = remainder.guessed_local (); + /* Divide all "unexplained" counts roughly proportionally to sums of + counts of non-recursive calls. + + We put rather arbitrary limits on how many counts we claim because the + number of non-self-recursive incoming count is only a rough guideline + and there are cases (such as mcf) where using it blindly just takes + too many. And if lattices are considered in the opposite order we + could also take too few. */ + profile_count unexp = orig_node_count - new_sum - orig_nonrec_call_count; + + int limit_den = 2 * (orig_nonrec_calls + new_nonrec_calls); + profile_count new_part + = MAX(MIN (unexp.apply_scale (new_sum, + new_sum + orig_nonrec_call_count), + unexp.apply_scale (limit_den - 1, limit_den)), + unexp.apply_scale (new_nonrec_calls, limit_den)); + if (dump_file) + { + fprintf (dump_file, " Claiming "); + new_part.dump (dump_file); + fprintf (dump_file, " of unexplained "); + unexp.dump (dump_file); + fprintf (dump_file, " counts because of self-recursive " + "calls\n"); + } + new_sum += new_part; + remainder = lenient_count_portion_handling (orig_node_count - new_sum, + orig_node); + } + else + remainder = lenient_count_portion_handling (orig_node_count - new_sum, + orig_node); new_sum = orig_node_count.combine_with_ipa_count (new_sum); new_node->count = new_sum; orig_node->count = remainder; + profile_count orig_new_node_count = orig_node_count; profile_count::adjust_for_ipa_scaling (&new_sum, &orig_new_node_count); - for (cs = new_node->callees; cs; cs = cs->next_callee) + for (cgraph_edge *cs = new_node->callees; cs; cs = cs->next_callee) cs->count = cs->count.apply_scale (new_sum, orig_new_node_count); - for (cs = new_node->indirect_calls; cs; cs = cs->next_callee) + for (cgraph_edge *cs = new_node->indirect_calls; cs; cs = cs->next_callee) cs->count = cs->count.apply_scale (new_sum, orig_new_node_count); profile_count::adjust_for_ipa_scaling (&remainder, &orig_node_count); - for (cs = orig_node->callees; cs; cs = cs->next_callee) + for (cgraph_edge *cs = orig_node->callees; cs; cs = cs->next_callee) cs->count = cs->count.apply_scale (remainder, orig_node_count); - for (cs = orig_node->indirect_calls; cs; cs = cs->next_callee) + for (cgraph_edge *cs = orig_node->indirect_calls; cs; cs = cs->next_callee) cs->count = cs->count.apply_scale (remainder, orig_node_count); if (dump_file) - dump_profile_updates (orig_node, new_node); + { + dump_profile_updates (new_node, true); + dump_profile_updates (orig_node, false); + } } /* Update the respective profile of specialized NEW_NODE and the original @@ -4495,7 +4818,10 @@ update_specialized_profile (struct cgraph_node *new_node, } if (dump_file) - dump_profile_updates (orig_node, new_node); + { + dump_profile_updates (new_node, true); + dump_profile_updates (orig_node, false); + } } static void adjust_references_in_caller (cgraph_edge *cs, @@ -4795,8 +5121,7 @@ create_specialized_node (struct cgraph_node *node, if (aggvals) ipa_dump_agg_replacement_values (dump_file, aggvals); } - ipa_check_create_node_params (); - update_profiling_info (node, new_node); + new_info = ipa_node_params_sum->get (new_node); new_info->ipcp_orig_node = node; new_node->ipcp_clone = true; @@ -5621,17 +5946,20 @@ ipcp_val_agg_replacement_ok_p (ipa_agg_replacement_value *, /* Decide whether to create a special version of NODE for value VAL of parameter at the given INDEX. If OFFSET is -1, the value is for the parameter itself, otherwise it is stored at the given OFFSET of the - parameter. AVALS describes the other already known values. */ + parameter. AVALS describes the other already known values. SELF_GEN_CLONES + is a vector which contains clones created for self-recursive calls with an + arithmetic pass-through jump function. */ template <typename valtype> static bool decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset, - ipcp_value<valtype> *val, ipa_auto_call_arg_values *avals) + ipcp_value<valtype> *val, ipa_auto_call_arg_values *avals, + vec<cgraph_node *> *self_gen_clones) { struct ipa_agg_replacement_value *aggvals; int caller_count; sreal freq_sum; - profile_count count_sum; + profile_count count_sum, rec_count_sum; vec<cgraph_edge *> callers; if (val->spec_node) @@ -5647,13 +5975,31 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset, val->local_size_cost + overall_size); return false; } - else if (!get_info_about_necessary_edges (val, node, &freq_sum, &count_sum, - &caller_count)) + else if (!get_info_about_necessary_edges (val, node, &freq_sum, &caller_count, + &rec_count_sum, &count_sum)) return false; if (!dbg_cnt (ipa_cp_values)) return false; + if (val->self_recursion_generated_p ()) + { + /* The edge counts in this case might not have been adjusted yet. + Nevertleless, even if they were it would be only a guesswork which we + can do now. The recursive part of the counts can be derived from the + count of the original node anyway. */ + if (node->count.ipa ().nonzero_p ()) + { + unsigned dem = self_gen_clones->length () + 1; + rec_count_sum = node->count.ipa ().apply_scale (1, dem); + } + else + rec_count_sum = profile_count::zero (); + } + + /* get_info_about_necessary_edges only sums up ipa counts. */ + count_sum += rec_count_sum; + if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, " - considering value "); @@ -5694,6 +6040,12 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset, offset, val->value)); val->spec_node = create_specialized_node (node, known_csts, known_contexts, aggvals, callers); + + if (val->self_recursion_generated_p ()) + self_gen_clones->safe_push (val->spec_node); + else + update_profiling_info (node, val->spec_node); + callers.release (); overall_size += val->local_size_cost; if (dump_file && (dump_flags & TDF_DETAILS)) @@ -5722,6 +6074,7 @@ decide_whether_version_node (struct cgraph_node *node) fprintf (dump_file, "\nEvaluating opportunities for %s.\n", node->dump_name ()); + auto_vec <cgraph_node *, 9> self_gen_clones; ipa_auto_call_arg_values avals; gather_context_independent_values (info, &avals, false, NULL); @@ -5736,7 +6089,8 @@ decide_whether_version_node (struct cgraph_node *node) { ipcp_value<tree> *val; for (val = lat->values; val; val = val->next) - ret |= decide_about_value (node, i, -1, val, &avals); + ret |= decide_about_value (node, i, -1, val, &avals, + &self_gen_clones); } if (!plats->aggs_bottom) @@ -5750,7 +6104,8 @@ decide_whether_version_node (struct cgraph_node *node) && (plats->aggs_contain_variable || !aglat->is_single_const ())) for (val = aglat->values; val; val = val->next) - ret |= decide_about_value (node, i, aglat->offset, val, &avals); + ret |= decide_about_value (node, i, aglat->offset, val, &avals, + &self_gen_clones); } if (!ctxlat->bottom @@ -5758,10 +6113,17 @@ decide_whether_version_node (struct cgraph_node *node) { ipcp_value<ipa_polymorphic_call_context> *val; for (val = ctxlat->values; val; val = val->next) - ret |= decide_about_value (node, i, -1, val, &avals); + ret |= decide_about_value (node, i, -1, val, &avals, + &self_gen_clones); } } + if (!self_gen_clones.is_empty ()) + { + self_gen_clones.safe_push (node); + update_counts_for_self_gen_clones (node, self_gen_clones); + } + if (info->do_clone_for_all_contexts) { if (!dbg_cnt (ipa_cp_values))
On Mon, Oct 18 2021, Martin Jambor wrote: > [...] > > IPA-CP does not do a reasonable job when it is updating profile counts > after it has created clones of recursive functions. This patch > addresses that by: > > 1. Only updating counts for special-context clones. When a clone is > created for all contexts, the original is going to be dead and the > cgraph machinery has copied counts to the new node which is the right > thing to do. Therefore updating counts has been moved from > create_specialized_node to decide_about_value and > decide_whether_version_node. > > 2. The current profile updating code artificially increased the assumed > old count when the sum of counts of incoming edges to both the > original and new node were bigger than the count of the original > node. This always happened when self-recursive edge from the clone > was also redirected to the clone because both the original edge and > its clone had original high counts. This clutch was removed and > replaced by the next point. > > 3. When cloning also redirects a self-recursive clone to the clone > itself, new logic has been added to divide the counts brought by such > recursive edges between the original node and the clone. This is > impossible to do well without special knowledge about the function and > which non-recursive entry calls are responsible for what portion of > recursion depth, so the approach taken is rather crude. > > For local nodes, we detect the case when the original node is never > called (in the training run at least) with another value and if so, > steal all its counts like if it was dead. If that is not the case, we > try to divide the count brought by recursive edges (or rather not > brought by direct edges) proportionally to the counts brought by > non-recursive edges - but with artificial limits in place so that we > do not take too many or too few, because that was happening with > detrimental effect in mcf_r. > > 4. When cloning creates extra clones for values brought by a formerly > self-recursive edge with an arithmetic pass-through jump function on > it, such as it does in exchange2_r, all such clones are processed at > once rather than one after another. The counts of all such nodes are > distributed evenly (modulo even-formerly-non-recursive-edges) and the > whole situation is then fixed up so that the edge counts fit. This is > what new function update_counts_for_self_gen_clones does. > > 5. When values brought by a formerly self-recursive edge with an > arithmetic pass-through jump function on it are evaluated by > heuristics which assumes vast majority of node counts are result of > recursive calls and so we simply divide those with the number of > clones there would be if we created another one. > > 6. The mechanisms in init_caller_stats and gather_caller_stats and > get_info_about_necessary_edges was enhanced to gather data required > for the above and a missing check not to count dead incoming edges was > also added. > > gcc/ChangeLog: > > 2021-10-15 Martin Jambor <mjambor@suse.cz> > > * ipa-cp.c (struct caller_statistics): New fields rec_count_sum, > n_nonrec_calls and itself, document all fields. > (init_caller_stats): Initialize the above new fields. > (gather_caller_stats): Gather self-recursive counts and calls number. > (get_info_about_necessary_edges): Gather counts of self-recursive and > other edges bringing in the requested value separately. > (dump_profile_updates): Rework to dump info about a single node only. > (lenient_count_portion_handling): New function. > (struct gather_other_count_struct): New type. > (gather_count_of_non_rec_edges): New function. > (struct desc_incoming_count_struct): New type. > (analyze_clone_icoming_counts): New function. > (adjust_clone_incoming_counts): Likewise. > (update_counts_for_self_gen_clones): Likewise. > (update_profiling_info): Rewritten. > (update_specialized_profile): Adjust call to dump_profile_updates. > (create_specialized_node): Do not update profiling info. > (decide_about_value): New parameter self_gen_clones, either push new > clones into it or updat their profile counts. For self-recursively > generated values, use a portion of the node count instead of count > from self-recursive edges to estimate goodness. > (decide_whether_version_node): Gather clones for self-generated values > in a new vector, update their profiles at once at the end. Honza approved the patch in a private conversation and I have pushed it to master as commit d1e2e4f9ce4df50564f1244dcea9befc3066faa8. Thanks, Martin
diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c index b987d975793..53cca7aa804 100644 --- a/gcc/ipa-cp.c +++ b/gcc/ipa-cp.c @@ -701,20 +701,36 @@ ipcp_versionable_function_p (struct cgraph_node *node) struct caller_statistics { + /* If requested (see below), self-recursive call counts are summed into this + field. */ + profile_count rec_count_sum; + /* The sum of all ipa counts of all the other (non-recursive) calls. */ profile_count count_sum; + /* Sum of all frequencies for all calls. */ sreal freq_sum; + /* Number of calls and hot calls respectively. */ int n_calls, n_hot_calls; + /* If itself is set up, also count the number of non-self-recursive + calls. */ + int n_nonrec_calls; + /* If non-NULL, this is the node itself and calls from it should have their + counts included in rec_count_sum and not count_sum. */ + cgraph_node *itself; }; -/* Initialize fields of STAT to zeroes. */ +/* Initialize fields of STAT to zeroes and optionally set it up so that edges + from IGNORED_CALLER are not counted. */ static inline void -init_caller_stats (struct caller_statistics *stats) +init_caller_stats (caller_statistics *stats, cgraph_node *itself = NULL) { + stats->rec_count_sum = profile_count::zero (); stats->count_sum = profile_count::zero (); stats->n_calls = 0; stats->n_hot_calls = 0; + stats->n_nonrec_calls = 0; stats->freq_sum = 0; + stats->itself = itself; } /* Worker callback of cgraph_for_node_and_aliases accumulating statistics of @@ -729,10 +745,22 @@ gather_caller_stats (struct cgraph_node *node, void *data) for (cs = node->callers; cs; cs = cs->next_caller) if (!cs->caller->thunk) { - if (cs->count.ipa ().initialized_p ()) - stats->count_sum += cs->count.ipa (); + ipa_node_params *info = ipa_node_params_sum->get (cs->caller); + if (info && info->node_dead) + continue; + + if (cs->count.ipa ().initialized_p ()) + { + if (stats->itself && stats->itself == cs->caller) + stats->rec_count_sum += cs->count.ipa (); + else + stats->count_sum += cs->count.ipa (); + } stats->freq_sum += cs->sreal_frequency (); stats->n_calls++; + if (stats->itself && stats->itself != cs->caller) + stats->n_nonrec_calls++; + if (cs->maybe_hot_p ()) stats->n_hot_calls ++; } @@ -4202,19 +4230,22 @@ get_next_cgraph_edge_clone (struct cgraph_edge *cs) /* Given VAL that is intended for DEST, iterate over all its sources and if any of them is viable and hot, return true. In that case, for those that still - hold, add their edge frequency and their number into *FREQUENCY and - *CALLER_COUNT respectively. */ + hold, add their edge frequency and their number and cumulative profile + counts of self-ecursive and other edges into *FREQUENCY, *CALLER_COUNT, + REC_COUNT_SUM and NONREC_COUNT_SUM respectively. */ template <typename valtype> static bool get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest, - sreal *freq_sum, profile_count *count_sum, - int *caller_count) + sreal *freq_sum, int *caller_count, + profile_count *rec_count_sum, + profile_count *nonrec_count_sum) { ipcp_value_source<valtype> *src; sreal freq = 0; int count = 0; - profile_count cnt = profile_count::zero (); + profile_count rec_cnt = profile_count::zero (); + profile_count nonrec_cnt = profile_count::zero (); bool hot = false; bool non_self_recursive = false; @@ -4227,11 +4258,15 @@ get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest, { count++; freq += cs->sreal_frequency (); - if (cs->count.ipa ().initialized_p ()) - cnt += cs->count.ipa (); hot |= cs->maybe_hot_p (); if (cs->caller != dest) - non_self_recursive = true; + { + non_self_recursive = true; + if (cs->count.ipa ().initialized_p ()) + rec_cnt += cs->count.ipa (); + } + else if (cs->count.ipa ().initialized_p ()) + nonrec_cnt += cs->count.ipa (); } cs = get_next_cgraph_edge_clone (cs); } @@ -4243,8 +4278,9 @@ get_info_about_necessary_edges (ipcp_value<valtype> *val, cgraph_node *dest, return false; *freq_sum = freq; - *count_sum = cnt; *caller_count = count; + *rec_count_sum = rec_cnt; + *nonrec_count_sum = nonrec_cnt; if (!hot && ipa_node_params_sum->get (dest)->node_within_scc) { @@ -4349,112 +4385,408 @@ get_replacement_map (class ipa_node_params *info, tree value, int parm_num, return replace_map; } -/* Dump new profiling counts */ +/* Dump new profiling counts of NODE. SPEC is true when NODE is a specialzied + one, otherwise it will be referred to as the original node. */ static void -dump_profile_updates (struct cgraph_node *orig_node, - struct cgraph_node *new_node) +dump_profile_updates (cgraph_node *node, bool spec) { - struct cgraph_edge *cs; + if (spec) + fprintf (dump_file, " setting count of the specialized node %s to ", + node->dump_name ()); + else + fprintf (dump_file, " setting count of the original node %s to ", + node->dump_name ()); - fprintf (dump_file, " setting count of the specialized node to "); - new_node->count.dump (dump_file); + node->count.dump (dump_file); fprintf (dump_file, "\n"); - for (cs = new_node->callees; cs; cs = cs->next_callee) + for (cgraph_edge *cs = node->callees; cs; cs = cs->next_callee) { - fprintf (dump_file, " edge to %s has count ", - cs->callee->dump_name ()); - cs->count.dump (dump_file); - fprintf (dump_file, "\n"); - } - - fprintf (dump_file, " setting count of the original node to "); - orig_node->count.dump (dump_file); - fprintf (dump_file, "\n"); - for (cs = orig_node->callees; cs; cs = cs->next_callee) - { - fprintf (dump_file, " edge to %s is left with ", + fprintf (dump_file, " edge to %s has count ", cs->callee->dump_name ()); cs->count.dump (dump_file); fprintf (dump_file, "\n"); } } +/* With partial train run we do not want to assume that original's count is + zero whenever we redurect all executed edges to clone. Simply drop profile + to local one in this case. In eany case, return the new value. ORIG_NODE + is the original node and its count has not been updaed yet. */ + +profile_count +lenient_count_portion_handling (profile_count remainder, cgraph_node *orig_node) +{ + if (remainder.ipa_p () && !remainder.ipa ().nonzero_p () + && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p () + && opt_for_fn (orig_node->decl, flag_profile_partial_training)) + remainder = remainder.guessed_local (); + + return remainder; +} + +/* Structure to sum counts coming from nodes other than the original node and + its clones. */ + +struct gather_other_count_struct +{ + cgraph_node *orig; + profile_count other_count; +}; + +/* Worker callback of call_for_symbol_thunks_and_aliases summing the number of + counts that come from non-self-recursive calls.. */ + +static bool +gather_count_of_non_rec_edges (cgraph_node *node, void *data) +{ + gather_other_count_struct *desc = (gather_other_count_struct *) data; + for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller) + if (cs->caller != desc->orig && cs->caller->clone_of != desc->orig) + desc->other_count += cs->count.ipa (); + return false; +} + +/* Structure to help analyze if we need to boost counts of some clones of some + non-recursive edges to match the new callee count. */ + +struct desc_incoming_count_struct +{ + cgraph_node *orig; + hash_set <cgraph_edge *> *processed_edges; + profile_count count; + unsigned unproc_orig_rec_edges; +}; + +/* Go over edges calling NODE and its thunks and gather information about + incoming counts so that we know if we need to make any adjustments. */ + +static void +analyze_clone_icoming_counts (cgraph_node *node, + desc_incoming_count_struct *desc) +{ + for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller) + if (cs->caller->thunk) + { + analyze_clone_icoming_counts (cs->caller, desc); + continue; + } + else + { + if (cs->count.initialized_p ()) + desc->count += cs->count.ipa (); + if (!desc->processed_edges->contains (cs) + && cs->caller->clone_of == desc->orig) + desc->unproc_orig_rec_edges++; + } +} + +/* If caller edge counts of a clone created for a self-recursive arithmetic jump + function must be adjusted, do so. NODE is the node or its thunk. */ + +static void +adjust_clone_incoming_counts (cgraph_node *node, + desc_incoming_count_struct *desc) +{ + for (cgraph_edge *cs = node->callers; cs; cs = cs->next_caller) + if (cs->caller->thunk) + { + adjust_clone_incoming_counts (cs->caller, desc); + profile_count sum = profile_count::zero (); + for (cgraph_edge *e = cs->caller->callers; e; e = e->next_caller) + if (e->count.initialized_p ()) + sum += e->count.ipa (); + cs->count = cs->count.combine_with_ipa_count (sum); + } + else if (!desc->processed_edges->contains (cs) + && cs->caller->clone_of == desc->orig) + { + cs->count += desc->count; + if (dump_file) + { + fprintf (dump_file, " Adjusted count of an incoming edge of " + "a clone %s -> %s to ", cs->caller->dump_name (), + cs->callee->dump_name ()); + cs->count.dump (dump_file); + fprintf (dump_file, "\n"); + } + } +} + +/* When ORIG_NODE has been cloned for values which have been generated fora + self-recursive call as a result of an arithmetic pass-through + jump-functions, adjust its count together with counts of all such clones in + SELF_GEN_CLONES which also at this point contains ORIG_NODE itself. + + The function sums the counts of the original node and all its clones that + cannot be attributed to a specific clone because it comes from a + non-recursive edge. This sum is then evenly divided between the clones and + on top of that each one gets all the counts which can be attributed directly + to it. */ + +static void +update_counts_for_self_gen_clones (cgraph_node *orig_node, + const vec<cgraph_node *> &self_gen_clones) +{ + profile_count redist_sum = orig_node->count.ipa (); + if (!(redist_sum > profile_count::zero ())) + return; + + if (dump_file) + fprintf (dump_file, " Updating profile of self recursive clone " + "series\n"); + + gather_other_count_struct gocs; + gocs.orig = orig_node; + gocs.other_count = profile_count::zero (); + + auto_vec <profile_count, 8> other_edges_count; + for (cgraph_node *n : self_gen_clones) + { + gocs.other_count = profile_count::zero (); + n->call_for_symbol_thunks_and_aliases (gather_count_of_non_rec_edges, + &gocs, false); + other_edges_count.safe_push (gocs.other_count); + redist_sum -= gocs.other_count; + } + + hash_set<cgraph_edge *> processed_edges; + unsigned i = 0; + for (cgraph_node *n : self_gen_clones) + { + profile_count orig_count = n->count; + profile_count new_count + = (redist_sum.apply_scale (1, self_gen_clones.length ()) + + other_edges_count[i]); + new_count = lenient_count_portion_handling (new_count, orig_node); + n->count = new_count; + profile_count::adjust_for_ipa_scaling (&new_count, &orig_count); + for (cgraph_edge *cs = n->callees; cs; cs = cs->next_callee) + { + cs->count = cs->count.apply_scale (new_count, orig_count); + processed_edges.add (cs); + } + for (cgraph_edge *cs = n->indirect_calls; cs; cs = cs->next_callee) + cs->count = cs->count.apply_scale (new_count, orig_count); + + i++; + } + + /* There are still going to be edges to ORIG_NODE that have one or more + clones from another clone in SELF_GEN_CLONES to ORIG_NODE and which we + scaled by the same amount, which means that the total incoming sum of + counts to ORIG_NODE will be too high, scale such edges back. */ + for (cgraph_edge *cs = orig_node->callees; cs; cs = cs->next_callee) + { + if (cs->callee->ultimate_alias_target () == orig_node) + { + unsigned den = 0; + for (cgraph_edge *e = cs; e; e = get_next_cgraph_edge_clone (e)) + if (e->callee->ultimate_alias_target () == orig_node + && processed_edges.contains (e)) + den++; + if (den > 0) + for (cgraph_edge *e = cs; e; e = get_next_cgraph_edge_clone (e)) + if (e->callee->ultimate_alias_target () == orig_node + && processed_edges.contains (e)) + e->count = e->count.apply_scale (1, den); + } + } + + /* Edges from the seeds of the valus generated for arithmetic jump-functions + along self-recursive edges are likely to have fairly low count and so + edges from them to nodes in the self_gen_clones do not correspond to the + artificially distributed count of the nodes, the total sum of incoming + edges to some clones might be too low. Detect this situation and correct + it. */ + for (cgraph_node *n : self_gen_clones) + { + if (!(n->count.ipa () > profile_count::zero ())) + continue; + + desc_incoming_count_struct desc; + desc.orig = orig_node; + desc.processed_edges = &processed_edges; + desc.count = profile_count::zero (); + desc.unproc_orig_rec_edges = 0; + analyze_clone_icoming_counts (n, &desc); + + if (n->count.differs_from_p (desc.count)) + { + if (n->count > desc.count + && desc.unproc_orig_rec_edges > 0) + { + desc.count = n->count - desc.count; + desc.count + = desc.count.apply_scale (1, desc.unproc_orig_rec_edges); + adjust_clone_incoming_counts (n, &desc); + } + else if (dump_file) + fprintf (dump_file, + " Unable to fix up incoming counts for %s.\n", + n->dump_name ()); + } + } + + if (dump_file) + for (cgraph_node *n : self_gen_clones) + dump_profile_updates (n, n != orig_node); + return; +} + /* After a specialized NEW_NODE version of ORIG_NODE has been created, update - their profile information to reflect this. */ + their profile information to reflect this. This function should not be used + for clones generated for arithmetic pass-through jump functions on a + self-recursive call graph edge, that situation is handled by + update_counts_for_self_gen_clones. */ static void update_profiling_info (struct cgraph_node *orig_node, struct cgraph_node *new_node) { - struct cgraph_edge *cs; struct caller_statistics stats; - profile_count new_sum, orig_sum; - profile_count remainder, orig_node_count = orig_node->count; - profile_count orig_new_node_count = new_node->count; + profile_count new_sum; + profile_count remainder, orig_node_count = orig_node->count.ipa (); - if (!(orig_node_count.ipa () > profile_count::zero ())) + if (!(orig_node_count > profile_count::zero ())) return; - init_caller_stats (&stats); - orig_node->call_for_symbol_thunks_and_aliases (gather_caller_stats, &stats, - false); - orig_sum = stats.count_sum; - init_caller_stats (&stats); + if (dump_file) + { + fprintf (dump_file, " Updating profile from original count: "); + orig_node_count.dump (dump_file); + fprintf (dump_file, "\n"); + } + + init_caller_stats (&stats, new_node); new_node->call_for_symbol_thunks_and_aliases (gather_caller_stats, &stats, false); new_sum = stats.count_sum; - if (orig_node_count < orig_sum + new_sum) - { - if (dump_file) - { - fprintf (dump_file, " Problem: node %s has too low count ", - orig_node->dump_name ()); - orig_node_count.dump (dump_file); - fprintf (dump_file, "while the sum of incoming count is "); - (orig_sum + new_sum).dump (dump_file); - fprintf (dump_file, "\n"); - } - orig_node_count = (orig_sum + new_sum).apply_scale (12, 10); - if (dump_file) + if (new_sum > orig_node_count) + { + /* TODO: Perhaps this should be gcc_unreachable ()? */ + remainder = profile_count::zero ().guessed_local (); + } + else if (stats.rec_count_sum.nonzero_p ()) + { + int new_nonrec_calls = stats.n_nonrec_calls; + /* There are self-recursive edges which are likely to bring in the + majority of calls but which we must divide in between the original and + new node. */ + init_caller_stats (&stats, orig_node); + orig_node->call_for_symbol_thunks_and_aliases (gather_caller_stats, + &stats, false); + if (orig_node->local) { - fprintf (dump_file, " proceeding by pretending it was "); - orig_node_count.dump (dump_file); - fprintf (dump_file, "\n"); + /* If ORIG_NODE is local, divide all "unexplained" counts roughly + proportionally to sums of counts of non-recursive calls, unless + the orig_node does not seem to have any counts from elsewhere, + then assume it is dead. */ + if (!stats.count_sum.nonzero_p ()) + { + if (dump_file) + fprintf (dump_file, " The original is local and only " + "has self-recursive edges with counts from non-dead " + "callers, assuming it is dead too.\n"); + new_sum = orig_node_count; + if (opt_for_fn (orig_node->decl, flag_profile_partial_training)) + remainder = profile_count::zero ().guessed_local (); + else + { + /* The NEW_NODE count and counts of all its outgoing edges + are still unmodified copies of ORIG_NODE's. Just clear + the latter and bail out. */ + orig_node->count = profile_count::zero ().guessed_local (); + for (cgraph_edge *cs = orig_node->callees; + cs; + cs = cs->next_callee) + cs->count = profile_count::zero ().guessed_local (); + for (cgraph_edge *cs = orig_node->indirect_calls; + cs; + cs = cs->next_callee) + cs->count = profile_count::zero ().guessed_local (); + return; + } + } + else + { + profile_count unexp = orig_node_count - stats.count_sum; + /* We put rather arbitrary limits on how many counts we claim + because the number of non-self-recursive incoming count is + only a rough guideline and there are cases (such as mcf) where + using it blindly just takes too many. And if clattices are + considered in the opposite order we could also take too + few. */ + int limit_den = 2 * (stats.n_nonrec_calls + new_nonrec_calls); + profile_count new_part + = MAX(MIN (unexp.apply_scale (new_sum, + new_sum + stats.count_sum), + unexp.apply_scale (limit_den - new_nonrec_calls, + limit_den)), + unexp.apply_scale (new_nonrec_calls, limit_den)); + + if (dump_file) + { + fprintf (dump_file, " Claiming "); + new_part.dump (dump_file); + fprintf (dump_file, " of unexplained "); + unexp.dump (dump_file); + fprintf (dump_file, " counts because of self-recursive " + "calls\n"); + } + new_sum += new_part; + unexp -= new_part; + remainder = lenient_count_portion_handling (stats.count_sum + + unexp, + orig_node); + } + } + else + { + /* If ORIG_NODE is not local, take a wild guess and simply claim a + half of all unexplained counts. */ + profile_count half + = (orig_node_count - stats.count_sum).apply_scale (1, 2); + if (dump_file) + { + fprintf (dump_file, " Claiming half of unexplained " + "counts ("); + half.dump (dump_file); + fprintf (dump_file, ") because of relf-recursive calls\n"); + } + new_sum += half; + remainder = lenient_count_portion_handling (stats.count_sum + half, + orig_node); } } - - remainder = orig_node_count.combine_with_ipa_count (orig_node_count.ipa () - - new_sum.ipa ()); - - /* With partial train run we do not want to assume that original's - count is zero whenever we redurect all executed edges to clone. - Simply drop profile to local one in this case. */ - if (remainder.ipa_p () && !remainder.ipa ().nonzero_p () - && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p () - && flag_profile_partial_training) - remainder = remainder.guessed_local (); + else + remainder = lenient_count_portion_handling (orig_node_count - new_sum, + orig_node); new_sum = orig_node_count.combine_with_ipa_count (new_sum); new_node->count = new_sum; orig_node->count = remainder; + profile_count orig_new_node_count = orig_node_count; profile_count::adjust_for_ipa_scaling (&new_sum, &orig_new_node_count); - for (cs = new_node->callees; cs; cs = cs->next_callee) + for (cgraph_edge *cs = new_node->callees; cs; cs = cs->next_callee) cs->count = cs->count.apply_scale (new_sum, orig_new_node_count); - for (cs = new_node->indirect_calls; cs; cs = cs->next_callee) + for (cgraph_edge *cs = new_node->indirect_calls; cs; cs = cs->next_callee) cs->count = cs->count.apply_scale (new_sum, orig_new_node_count); profile_count::adjust_for_ipa_scaling (&remainder, &orig_node_count); - for (cs = orig_node->callees; cs; cs = cs->next_callee) + for (cgraph_edge *cs = orig_node->callees; cs; cs = cs->next_callee) cs->count = cs->count.apply_scale (remainder, orig_node_count); - for (cs = orig_node->indirect_calls; cs; cs = cs->next_callee) + for (cgraph_edge *cs = orig_node->indirect_calls; cs; cs = cs->next_callee) cs->count = cs->count.apply_scale (remainder, orig_node_count); if (dump_file) - dump_profile_updates (orig_node, new_node); + { + dump_profile_updates (new_node, true); + dump_profile_updates (orig_node, false); + } } /* Update the respective profile of specialized NEW_NODE and the original @@ -4495,7 +4827,10 @@ update_specialized_profile (struct cgraph_node *new_node, } if (dump_file) - dump_profile_updates (orig_node, new_node); + { + dump_profile_updates (new_node, true); + dump_profile_updates (orig_node, false); + } } static void adjust_references_in_caller (cgraph_edge *cs, @@ -4795,8 +5130,7 @@ create_specialized_node (struct cgraph_node *node, if (aggvals) ipa_dump_agg_replacement_values (dump_file, aggvals); } - ipa_check_create_node_params (); - update_profiling_info (node, new_node); + new_info = ipa_node_params_sum->get (new_node); new_info->ipcp_orig_node = node; new_node->ipcp_clone = true; @@ -5621,17 +5955,20 @@ ipcp_val_agg_replacement_ok_p (ipa_agg_replacement_value *, /* Decide whether to create a special version of NODE for value VAL of parameter at the given INDEX. If OFFSET is -1, the value is for the parameter itself, otherwise it is stored at the given OFFSET of the - parameter. AVALS describes the other already known values. */ + parameter. AVALS describes the other already known values. SELF_GEN_CLONES + is a vector which contains clones created for self-recursive calls with an + arithmetic pass-through jump function. */ template <typename valtype> static bool decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset, - ipcp_value<valtype> *val, ipa_auto_call_arg_values *avals) + ipcp_value<valtype> *val, ipa_auto_call_arg_values *avals, + vec<cgraph_node *> *self_gen_clones) { struct ipa_agg_replacement_value *aggvals; int caller_count; sreal freq_sum; - profile_count count_sum; + profile_count count_sum, rec_count_sum; vec<cgraph_edge *> callers; if (val->spec_node) @@ -5647,13 +5984,31 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset, val->local_size_cost + overall_size); return false; } - else if (!get_info_about_necessary_edges (val, node, &freq_sum, &count_sum, - &caller_count)) + else if (!get_info_about_necessary_edges (val, node, &freq_sum, &caller_count, + &rec_count_sum, &count_sum)) return false; if (!dbg_cnt (ipa_cp_values)) return false; + if (val->self_recursion_generated_p ()) + { + /* The edge counts in this case might not have been adjusted yet. + Nevertleless, even if they were it would be only a guesswork which we + can do now. The recursive part of the counts can be derived from the + count of the original node anyway. */ + if (node->count.ipa ().nonzero_p ()) + { + unsigned dem = self_gen_clones->length () + 1; + rec_count_sum = node->count.ipa ().apply_scale (1, dem); + } + else + rec_count_sum = profile_count::zero (); + } + + /* get_info_about_necessary_edges only sums up ipa counts. */ + count_sum += rec_count_sum; + if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, " - considering value "); @@ -5694,6 +6049,12 @@ decide_about_value (struct cgraph_node *node, int index, HOST_WIDE_INT offset, offset, val->value)); val->spec_node = create_specialized_node (node, known_csts, known_contexts, aggvals, callers); + + if (val->self_recursion_generated_p ()) + self_gen_clones->safe_push (val->spec_node); + else + update_profiling_info (node, val->spec_node); + callers.release (); overall_size += val->local_size_cost; if (dump_file && (dump_flags & TDF_DETAILS)) @@ -5722,6 +6083,7 @@ decide_whether_version_node (struct cgraph_node *node) fprintf (dump_file, "\nEvaluating opportunities for %s.\n", node->dump_name ()); + auto_vec <cgraph_node *, 9> self_gen_clones; ipa_auto_call_arg_values avals; gather_context_independent_values (info, &avals, false, NULL); @@ -5736,7 +6098,8 @@ decide_whether_version_node (struct cgraph_node *node) { ipcp_value<tree> *val; for (val = lat->values; val; val = val->next) - ret |= decide_about_value (node, i, -1, val, &avals); + ret |= decide_about_value (node, i, -1, val, &avals, + &self_gen_clones); } if (!plats->aggs_bottom) @@ -5750,7 +6113,8 @@ decide_whether_version_node (struct cgraph_node *node) && (plats->aggs_contain_variable || !aglat->is_single_const ())) for (val = aglat->values; val; val = val->next) - ret |= decide_about_value (node, i, aglat->offset, val, &avals); + ret |= decide_about_value (node, i, aglat->offset, val, &avals, + &self_gen_clones); } if (!ctxlat->bottom @@ -5758,10 +6122,17 @@ decide_whether_version_node (struct cgraph_node *node) { ipcp_value<ipa_polymorphic_call_context> *val; for (val = ctxlat->values; val; val = val->next) - ret |= decide_about_value (node, i, -1, val, &avals); + ret |= decide_about_value (node, i, -1, val, &avals, + &self_gen_clones); } } + if (!self_gen_clones.is_empty ()) + { + self_gen_clones.safe_push (node); + update_counts_for_self_gen_clones (node, self_gen_clones); + } + if (info->do_clone_for_all_contexts) { if (!dbg_cnt (ipa_cp_values))