Message ID | 20191204215228.fuuywt3ef3uiqswh@kam.mff.cuni.cz |
---|---|
State | New |
Headers | show |
Series | Add -fpartial-profile-training | expand |
On Wed, Dec 4, 2019 at 10:52 PM Jan Hubicka <hubicka@ucw.cz> wrote: > > Hi, > with recent fixes to proile updating I noticed that we get more regressions > compared to gcc 9 at Firefox testing. This is because Firefox train run is > not covering all the benchmarks and gcc 9, thanks to updating bugs sometimes > optimize code for speed even if it was not trained. > > While in general one should have reasonable train run, in some cases it is > not practical to do so. For example, skia library has optimized vector code > for different ISAs and thus firefox renders quickly only if it is trained > on same CPU as run. > > This patch adds flag -fprofile-partial-training which makes GCC to optimize > untrained functions as w/o -fprofile-use. This nullifies code size improvements > of FDO but can be used in cases where full training is not quite possible > (and one can use it only on portions of programs). > > Previously only good answer was to disable profiling for a given function, but > that needs to be done quite precisely and in general is hard to arrange. > > Patch works by > 1) not setting PROFILE_READ for functions with entry count 0 > 2) make inliner and ipa-cp to drop profile to local one when all > trained executions are redirected to clones > 3) reduce quality of branch probabilities of branches leading to never > executed regions to GUESSED. This is necessary to prevent gcc from > proagating thins back. > > Bootstrapped/regtested x86_64-linux. I plan to commit it tomorrow if there > are no complains. Feedback is welcome! I wonder if the behavior shouldn't be the default? The only thing we lose is failing to notice really cold calls (error paths) in programs? Richard. > > Honza > > * cgraphclones.c (localize_profile): New function. > (cgraph_node::create_clone): Use it for partial profiles. > * common.opt (fprofile-partial-training): New flag. > * doc/invoke.texi (-fprofile-partial-training): Document. > * ipa-cp.c (update_profiling_info): For partial profiles do not > set function profile to zero. > * profile.c (compute_branch_probabilities): With partial profile > watch if edge count is zero and turn all probabilities to guessed. > (compute_branch_probabilities): For partial profiles do not apply > profile when entry count is zero. > * tree-profile.c (tree_profiling): Only do value_profile_transformations > when profile is read. > Index: cgraphclones.c > =================================================================== > --- cgraphclones.c (revision 278944) > +++ cgraphclones.c (working copy) > @@ -307,6 +307,22 @@ dump_callgraph_transformation (const cgr > } > } > > +/* Turn profile of N to local profile. */ > + > +static void > +localize_profile (cgraph_node *n) > +{ > + n->count = n->count.guessed_local (); > + for (cgraph_edge *e = n->callees; e; e=e->next_callee) > + { > + e->count = e->count.guessed_local (); > + if (!e->inline_failed) > + localize_profile (e->callee); > + } > + for (cgraph_edge *e = n->indirect_calls; e; e=e->next_callee) > + e->count = e->count.guessed_local (); > +} > + > /* Create node representing clone of N executed COUNT times. Decrease > the execution counts from original node too. > The new clone will have decl set to DECL that may or may not be the same > @@ -340,6 +356,7 @@ cgraph_node::create_clone (tree new_decl > cgraph_edge *e; > unsigned i; > profile_count old_count = count; > + bool nonzero = count.ipa ().nonzero_p (); > > if (new_inlined_to) > dump_callgraph_transformation (this, new_inlined_to, "inlining to"); > @@ -426,6 +446,15 @@ cgraph_node::create_clone (tree new_decl > > if (call_duplication_hook) > symtab->call_cgraph_duplication_hooks (this, new_node); > + /* With partial train run we do not want to assume that original's > + count is zero whenever we redurect all executed edges to clone. > + Simply drop profile to local one in this case. */ > + if (update_original > + && opt_for_fn (decl, flag_partial_profile_training) > + && nonzero > + && count.ipa_p () > + && !count.ipa ().nonzero_p ()) > + localize_profile (this); > > if (!new_inlined_to) > dump_callgraph_transformation (this, new_node, suffix); > Index: common.opt > =================================================================== > --- common.opt (revision 278944) > +++ common.opt (working copy) > @@ -2160,6 +2160,10 @@ fprofile-generate= > Common Joined RejectNegative > Enable common options for generating profile info for profile feedback directed optimizations, and set -fprofile-dir=. > > +fprofile-partial-training > +Common Report Var(flag_partial_profile_training) Optimization > +Do not assume that functions never executed during the train run are cold > + > fprofile-use > Common Var(flag_profile_use) > Enable common options for performing profile feedback directed optimizations. > Index: doc/invoke.texi > =================================================================== > --- doc/invoke.texi (revision 278944) > +++ doc/invoke.texi (working copy) > @@ -453,8 +453,8 @@ Objective-C and Objective-C++ Dialects}. > -fpartial-inlining -fpeel-loops -fpredictive-commoning @gol > -fprefetch-loop-arrays @gol > -fprofile-correction @gol > --fprofile-use -fprofile-use=@var{path} -fprofile-values @gol > --fprofile-reorder-functions @gol > +-fprofile-use -fprofile-use=@var{path} -fprofile-partial-training @gol > +-fprofile-values -fprofile-reorder-functions @gol > -freciprocal-math -free -frename-registers -freorder-blocks @gol > -freorder-blocks-algorithm=@var{algorithm} @gol > -freorder-blocks-and-partition -freorder-functions @gol > @@ -10634,6 +10634,17 @@ default, GCC emits an error message when > > This option is enabled by @option{-fauto-profile}. > > +@item -fprofile-partial-training > +@opindex fprofile-use > +With @code{-fprofile-use} all portions of programs not executed during train > +run are optimized agressively for size rather than speed. In some cases it is not > +practical to train all possible paths hot paths in the program. (For example > +program may contain functions specific for a given hardware and trianing may > +not cover all hardware configurations program is run on.) With > +@code{-fprofile-partial-training} profile feedback will be ignored for all > +functions not executed during the train run leading them to be optimized as > +if they were compiled without profile feedback. > + > @item -fprofile-use > @itemx -fprofile-use=@var{path} > @opindex fprofile-use > Index: ipa-cp.c > =================================================================== > --- ipa-cp.c (revision 278944) > +++ ipa-cp.c (working copy) > @@ -4295,6 +4295,15 @@ update_profiling_info (struct cgraph_nod > > remainder = orig_node_count.combine_with_ipa_count (orig_node_count.ipa () > - new_sum.ipa ()); > + > + /* With partial train run we do not want to assume that original's > + count is zero whenever we redurect all executed edges to clone. > + Simply drop profile to local one in this case. */ > + if (remainder.ipa_p () && !remainder.ipa ().nonzero_p () > + && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p () > + && flag_partial_profile_training) > + remainder = remainder.guessed_local (); > + > new_sum = orig_node_count.combine_with_ipa_count (new_sum); > new_node->count = new_sum; > orig_node->count = remainder; > Index: profile.c > =================================================================== > --- profile.c (revision 278944) > +++ profile.c (working copy) > @@ -635,9 +635,20 @@ compute_branch_probabilities (unsigned c > } > if (bb_gcov_count (bb)) > { > + bool set_to_guessed = false; > FOR_EACH_EDGE (e, ei, bb->succs) > - e->probability = profile_probability::probability_in_gcov_type > - (edge_gcov_count (e), bb_gcov_count (bb)); > + { > + bool prev_never = e->probability == profile_probability::never (); > + e->probability = profile_probability::probability_in_gcov_type > + (edge_gcov_count (e), bb_gcov_count (bb)); > + if (e->probability == profile_probability::never () > + && !prev_never > + && flag_partial_profile_training) > + set_to_guessed = true; > + } > + if (set_to_guessed) > + FOR_EACH_EDGE (e, ei, bb->succs) > + e->probability = e->probability.guessed (); > if (bb->index >= NUM_FIXED_BLOCKS > && block_ends_with_condjump_p (bb) > && EDGE_COUNT (bb->succs) >= 2) > @@ -697,17 +708,23 @@ compute_branch_probabilities (unsigned c > } > } > > - if (exec_counts) > + if (exec_counts > + && (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun)) > + || !flag_partial_profile_training)) > profile_status_for_fn (cfun) = PROFILE_READ; > > /* If we have real data, use them! */ > if (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun)) > || !flag_guess_branch_prob) > FOR_ALL_BB_FN (bb, cfun) > - bb->count = profile_count::from_gcov_type (bb_gcov_count (bb)); > + if (bb_gcov_count (bb) || !flag_partial_profile_training) > + bb->count = profile_count::from_gcov_type (bb_gcov_count (bb)); > + else > + bb->count = profile_count::guessed_zero (); > /* If function was not trained, preserve local estimates including statically > determined zero counts. */ > - else if (profile_status_for_fn (cfun) == PROFILE_READ) > + else if (profile_status_for_fn (cfun) == PROFILE_READ > + && !flag_partial_profile_training) > FOR_ALL_BB_FN (bb, cfun) > if (!(bb->count == profile_count::zero ())) > bb->count = bb->count.global0 (); > @@ -1417,7 +1434,7 @@ branch_prob (bool thunk) > /* At this moment we have precise loop iteration count estimates. > Record them to loop structure before the profile gets out of date. */ > FOR_EACH_LOOP (loop, 0) > - if (loop->header->count > 0) > + if (loop->header->count > 0 && loop->header->count.reliable_p ()) > { > gcov_type nit = expected_loop_iterations_unbounded (loop); > widest_int bound = gcov_type_to_wide_int (nit); > Index: tree-profile.c > =================================================================== > --- tree-profile.c (revision 278944) > +++ tree-profile.c (working copy) > @@ -785,7 +785,8 @@ tree_profiling (void) > if (flag_branch_probabilities > && !thunk > && flag_profile_values > - && flag_value_profile_transformations) > + && flag_value_profile_transformations > + && profile_status_for_fn (cfun) == PROFILE_READ) > gimple_value_profile_transformations (); > > /* The above could hose dominator info. Currently there is
On 12/5/19 1:30 PM, Richard Biener wrote: > I wonder if the behavior shouldn't be the default? The only thing we lose > is failing to notice really cold calls (error paths) in programs? I would also consider enabling that by default. I'm sending a language correction for the option documentation: diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 26a444ac7b2..130529dece1 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -10637,10 +10637,10 @@ This option is enabled by @option{-fauto-profile}. @item -fprofile-partial-training @opindex fprofile-use With @code{-fprofile-use} all portions of programs not executed during train -run are optimized agressively for size rather than speed. In some cases it is not +run are optimized aggressively for size rather than speed. In some cases it is not practical to train all possible paths hot paths in the program. (For example -program may contain functions specific for a given hardware and trianing may -not cover all hardware configurations program is run on.) With +a program may contain functions specific for a given hardware and training may +not cover all hardware configurations program can run on). With @code{-fprofile-partial-training} profile feedback will be ignored for all functions not executed during the train run leading them to be optimized as if they were compiled without profile feedback. Martin
On Thu, Dec 5, 2019 at 1:41 PM Martin Liška <mliska@suse.cz> wrote: > > On 12/5/19 1:30 PM, Richard Biener wrote: > > I wonder if the behavior shouldn't be the default? The only thing we lose > > is failing to notice really cold calls (error paths) in programs? > > I would also consider enabling that by default. So I'd add the "reverse" option -fconsider-unprofiled-functions-cold or so. Your proposed change makes functions not executed during profiling behave as if the function were built without -fprofile-generate for training but with -fprofile-use later? Documentation should somehow relate behavior to that. Richard. > I'm sending a language correction for the option documentation: > > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index 26a444ac7b2..130529dece1 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -10637,10 +10637,10 @@ This option is enabled by @option{-fauto-profile}. > @item -fprofile-partial-training > @opindex fprofile-use > With @code{-fprofile-use} all portions of programs not executed during train > -run are optimized agressively for size rather than speed. In some cases it is not > +run are optimized aggressively for size rather than speed. In some cases it is not > practical to train all possible paths hot paths in the program. (For example > -program may contain functions specific for a given hardware and trianing may > -not cover all hardware configurations program is run on.) With > +a program may contain functions specific for a given hardware and training may > +not cover all hardware configurations program can run on). With > @code{-fprofile-partial-training} profile feedback will be ignored for all > functions not executed during the train run leading them to be optimized as > if they were compiled without profile feedback. > > Martin
On Wed, Dec 04, 2019 at 10:52:28PM +0100, Jan Hubicka wrote: > * cgraphclones.c (localize_profile): New function. > (cgraph_node::create_clone): Use it for partial profiles. > * common.opt (fprofile-partial-training): New flag. This FAILs everywhere, with: Running /usr/src/gcc/gcc/testsuite/gcc.misc-tests/help.exp ... FAIL: compiler driver --help=common option(s): "^ +-.*[^:.]$" absent from output: " -fprofile-partial-training Do not assume that functions never executed during the train run are cold" FAIL: compiler driver --help=optimizers option(s): "^ +-.*[^:.]$" absent from output: " -fprofile-partial-training Do not assume that functions never executed during the train run are cold" Fixed thusly, tested on x86_64-linux, committed to trunk as obvious: 2019-12-06 Jakub Jelinek <jakub@redhat.com> * common.opt (fprofile-partial-training): Terminate description with full stop. --- gcc/common.opt.jj 2019-12-06 00:40:46.096605346 +0100 +++ gcc/common.opt 2019-12-06 01:24:22.825265282 +0100 @@ -2162,7 +2162,7 @@ Enable common options for generating pro fprofile-partial-training Common Report Var(flag_profile_partial_training) Optimization -Do not assume that functions never executed during the train run are cold +Do not assume that functions never executed during the train run are cold. fprofile-use Common Var(flag_profile_use) Jakub
Index: cgraphclones.c =================================================================== --- cgraphclones.c (revision 278944) +++ cgraphclones.c (working copy) @@ -307,6 +307,22 @@ dump_callgraph_transformation (const cgr } } +/* Turn profile of N to local profile. */ + +static void +localize_profile (cgraph_node *n) +{ + n->count = n->count.guessed_local (); + for (cgraph_edge *e = n->callees; e; e=e->next_callee) + { + e->count = e->count.guessed_local (); + if (!e->inline_failed) + localize_profile (e->callee); + } + for (cgraph_edge *e = n->indirect_calls; e; e=e->next_callee) + e->count = e->count.guessed_local (); +} + /* Create node representing clone of N executed COUNT times. Decrease the execution counts from original node too. The new clone will have decl set to DECL that may or may not be the same @@ -340,6 +356,7 @@ cgraph_node::create_clone (tree new_decl cgraph_edge *e; unsigned i; profile_count old_count = count; + bool nonzero = count.ipa ().nonzero_p (); if (new_inlined_to) dump_callgraph_transformation (this, new_inlined_to, "inlining to"); @@ -426,6 +446,15 @@ cgraph_node::create_clone (tree new_decl if (call_duplication_hook) symtab->call_cgraph_duplication_hooks (this, new_node); + /* With partial train run we do not want to assume that original's + count is zero whenever we redurect all executed edges to clone. + Simply drop profile to local one in this case. */ + if (update_original + && opt_for_fn (decl, flag_partial_profile_training) + && nonzero + && count.ipa_p () + && !count.ipa ().nonzero_p ()) + localize_profile (this); if (!new_inlined_to) dump_callgraph_transformation (this, new_node, suffix); Index: common.opt =================================================================== --- common.opt (revision 278944) +++ common.opt (working copy) @@ -2160,6 +2160,10 @@ fprofile-generate= Common Joined RejectNegative Enable common options for generating profile info for profile feedback directed optimizations, and set -fprofile-dir=. +fprofile-partial-training +Common Report Var(flag_partial_profile_training) Optimization +Do not assume that functions never executed during the train run are cold + fprofile-use Common Var(flag_profile_use) Enable common options for performing profile feedback directed optimizations. Index: doc/invoke.texi =================================================================== --- doc/invoke.texi (revision 278944) +++ doc/invoke.texi (working copy) @@ -453,8 +453,8 @@ Objective-C and Objective-C++ Dialects}. -fpartial-inlining -fpeel-loops -fpredictive-commoning @gol -fprefetch-loop-arrays @gol -fprofile-correction @gol --fprofile-use -fprofile-use=@var{path} -fprofile-values @gol --fprofile-reorder-functions @gol +-fprofile-use -fprofile-use=@var{path} -fprofile-partial-training @gol +-fprofile-values -fprofile-reorder-functions @gol -freciprocal-math -free -frename-registers -freorder-blocks @gol -freorder-blocks-algorithm=@var{algorithm} @gol -freorder-blocks-and-partition -freorder-functions @gol @@ -10634,6 +10634,17 @@ default, GCC emits an error message when This option is enabled by @option{-fauto-profile}. +@item -fprofile-partial-training +@opindex fprofile-use +With @code{-fprofile-use} all portions of programs not executed during train +run are optimized agressively for size rather than speed. In some cases it is not +practical to train all possible paths hot paths in the program. (For example +program may contain functions specific for a given hardware and trianing may +not cover all hardware configurations program is run on.) With +@code{-fprofile-partial-training} profile feedback will be ignored for all +functions not executed during the train run leading them to be optimized as +if they were compiled without profile feedback. + @item -fprofile-use @itemx -fprofile-use=@var{path} @opindex fprofile-use Index: ipa-cp.c =================================================================== --- ipa-cp.c (revision 278944) +++ ipa-cp.c (working copy) @@ -4295,6 +4295,15 @@ update_profiling_info (struct cgraph_nod remainder = orig_node_count.combine_with_ipa_count (orig_node_count.ipa () - new_sum.ipa ()); + + /* With partial train run we do not want to assume that original's + count is zero whenever we redurect all executed edges to clone. + Simply drop profile to local one in this case. */ + if (remainder.ipa_p () && !remainder.ipa ().nonzero_p () + && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p () + && flag_partial_profile_training) + remainder = remainder.guessed_local (); + new_sum = orig_node_count.combine_with_ipa_count (new_sum); new_node->count = new_sum; orig_node->count = remainder; Index: profile.c =================================================================== --- profile.c (revision 278944) +++ profile.c (working copy) @@ -635,9 +635,20 @@ compute_branch_probabilities (unsigned c } if (bb_gcov_count (bb)) { + bool set_to_guessed = false; FOR_EACH_EDGE (e, ei, bb->succs) - e->probability = profile_probability::probability_in_gcov_type - (edge_gcov_count (e), bb_gcov_count (bb)); + { + bool prev_never = e->probability == profile_probability::never (); + e->probability = profile_probability::probability_in_gcov_type + (edge_gcov_count (e), bb_gcov_count (bb)); + if (e->probability == profile_probability::never () + && !prev_never + && flag_partial_profile_training) + set_to_guessed = true; + } + if (set_to_guessed) + FOR_EACH_EDGE (e, ei, bb->succs) + e->probability = e->probability.guessed (); if (bb->index >= NUM_FIXED_BLOCKS && block_ends_with_condjump_p (bb) && EDGE_COUNT (bb->succs) >= 2) @@ -697,17 +708,23 @@ compute_branch_probabilities (unsigned c } } - if (exec_counts) + if (exec_counts + && (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun)) + || !flag_partial_profile_training)) profile_status_for_fn (cfun) = PROFILE_READ; /* If we have real data, use them! */ if (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun)) || !flag_guess_branch_prob) FOR_ALL_BB_FN (bb, cfun) - bb->count = profile_count::from_gcov_type (bb_gcov_count (bb)); + if (bb_gcov_count (bb) || !flag_partial_profile_training) + bb->count = profile_count::from_gcov_type (bb_gcov_count (bb)); + else + bb->count = profile_count::guessed_zero (); /* If function was not trained, preserve local estimates including statically determined zero counts. */ - else if (profile_status_for_fn (cfun) == PROFILE_READ) + else if (profile_status_for_fn (cfun) == PROFILE_READ + && !flag_partial_profile_training) FOR_ALL_BB_FN (bb, cfun) if (!(bb->count == profile_count::zero ())) bb->count = bb->count.global0 (); @@ -1417,7 +1434,7 @@ branch_prob (bool thunk) /* At this moment we have precise loop iteration count estimates. Record them to loop structure before the profile gets out of date. */ FOR_EACH_LOOP (loop, 0) - if (loop->header->count > 0) + if (loop->header->count > 0 && loop->header->count.reliable_p ()) { gcov_type nit = expected_loop_iterations_unbounded (loop); widest_int bound = gcov_type_to_wide_int (nit); Index: tree-profile.c =================================================================== --- tree-profile.c (revision 278944) +++ tree-profile.c (working copy) @@ -785,7 +785,8 @@ tree_profiling (void) if (flag_branch_probabilities && !thunk && flag_profile_values - && flag_value_profile_transformations) + && flag_value_profile_transformations + && profile_status_for_fn (cfun) == PROFILE_READ) gimple_value_profile_transformations (); /* The above could hose dominator info. Currently there is