diff mbox

Sanitize block partitioning under -freorder-blocks-and-partition

Message ID CAAe5K+W+VEERrsaCCjCD=n40_MO2EPashDp5qnKoS8SLaSjBjQ@mail.gmail.com
State New
Headers show

Commit Message

Teresa Johnson Aug. 17, 2013, 6:31 a.m. UTC
On Fri, Aug 9, 2013 at 2:02 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Fri, Aug 9, 2013 at 8:28 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> > Do we sanity check that the cold partition does not contain any blocks of
>>> > count 0?  It may be that the profile is broken enough to make partitioning
>>> > not work.
>>>
>>> Do you mean sanity check that the cold partition does not contain any
>>> blocks of count > 0? (they should all be zero) I don't think that
>>> sanity check is there, but I can try adding that.
>>
>> Thanks, lets start with this - I suppose we need to figure out if
>>  1) the reachable blocks goes to cold section because partitioning decides
>>     so even if they have non-0 count.
>
> Right, this should be easy enough to check and should hopefully never happen.
>
>>  2) the reachable blocks goes to cold section because they have incorrectly
>>     updated count to 0 by someone
>
> A sanity check should find this too. But it can happen now for various
> reasons like the comdat issue I described below. But it will be good
> to flag these and fix them.
>
>>  3) profiling gets some blocks wrong.
>
> This is the one that will be tough to fix, if the training run isn't
> representative.
>
>>>
>>> The issue with such a sanity check may be due to the later fixup I
>>> have in this patch (fixup_partitions). It is invoked after certain
>>> optimizations on the cfg that may make hot blocks previously reached
>>> by both hot and cold edges only reachable by cold blocks. These blocks
>>> are remarked cold. If the profile data hasn't been updated correctly
>>> it is possible that they would still have a non-0 count, although they
>>> are essentially cold after the cfg transformation.
>>
>> Well, or the other posibility is that the edges was updated wrong
>> and the blocks are really cold.  We need to figure out if that happens
>> commonly enough.
>>
>> I will try to think of some artificial testcases.
>>
>>>
>>> But certainly such a sanity check should always succeed after the
>>> original partitioning.
>>>
>>> > I can think of inlining where the count gets scaled all way down to 0.  Perhaps
>>> > count scaling code can be modified to never round towards 0 for block executing
>>> > non-0 times...
>>>
>>> This reminds me of why this situation could happen. When I have been
>>> testing this on the google branch I found situations where COMDAT
>>> routines have 0 profile counts (if I remember correctly, this happens
>>> when profile-gen binary has call to out-of-line copy of COMDAT in
>>> module A, linker chooses the out-of-line copy from module B, therefore
>>> the profile data for COMDAT in module A is 0). When the COMDAT gets
>>> inlined, the 0 counts on its bbs are scaled to 0, even though the
>>> callsite is non-zero. I have a patch that I was planning to send as a
>>> follow-up that handles this case by propagating the callsite bb's
>>> count to the inlined code when it has 0 counts, scaling by the edge
>>> frequencies. I can either include that patch in this one, or send it
>>> for review separately right now. Do you want to give it a try with
>>> this one to see if it addresses the issue?
>>
>> This scenario should not happen with LTO setup: the LTO symbol tables contains
>> code before early optimization and should be identical with profiling or
>> without (modulo the new references and call from profile code).
>>
>> But this patch seems useful as a backup solution for non-LTO, so yes, please
>> send it separately and I can try to double check that it really do not happen
>> with LTO.
>> (acutally LTO symtab may just chose COMDAT from module that has counts with it.
>> It has all the info for it.  I was thinkin about it few weeks back.  It is
>> bit hard to do - you need to verify that all references from the function are
>> the same or linking might fail if you overwrite linker's decisiosns).
>
> I see, yes LTO can deal with this better since it has global
> information. In non-LTO mode (including LIPO) we have the issue.
>
> I take it gimp is built with LTO and therefore shouldn't be hitting
> this comdat issue?
>
> Let me do a couple things:

Here is some status:

> - port over my comdat inlining fix from the google branch to trunk and
> send it for review. If you or Martin could try it to see if it helps
> with function splitting to avoid the hits from the cold code that
> would be great

I have included the cleaned up patch below. I will send it to trunk
for review, but included it below if you want to try it out before I
send it pending a little more extensive testing I want to do after the
cleanup (it passes bootstrap and some small test cases I checked
manually).

> - I'll add some new sanity checking to try to detect non-zero blocks
> in the cold section, or 0 blocks reached by non-zero edges and see if
> I can flush out any problems with my tests or a profiledbootstrap or
> gimp.

I added both of these and ran into issues due to profile maintenance.
For example, there were non-zero blocks in the cold section because
pro_and_epilogue split a simple return block that was previously reach
by both hot and cold paths. The new return block that was then only
reached via the cold path did not have its count properly updated to
reflect this, and since with this patch, blocks dominated by cold
blocks are remarked cold, we ended up with a non-zero count block in
the cold section. And there were 0 count blocks reached by non-zero
edges because copyprop did not clean up edge weights after removing
some branches and blocks, leading to non-zero edge weights that had
previously targeted a branch that was removed, now targeting a 0 count
block that the removed branch always branched around.

In any case, the good news is in that the cases I looked at, the
splitting code is doing the right thing and these blocks that were
marked cold really were cold. It would be great to fix the profile
maintenance issues, but that in the meantime the above sanity checks
are too aggressive.

I think it makes sense to commit the current patch if possible, as it
is making the splitting more sane.

> - I'll try building and profiling gimp myself to see if I can
> reproduce the issue with code executing out of the cold section.

I have spent some time this week trying to get the latest gimp Martin
pointed me to configured and built, but it took awhile to track down
and configure/build all of the required versions of dependent
packages. I'm still hitting some issues trying to get it compiled, so
it may not yet be configured properly. I'll take a look again early
next week.

Teresa

patch for updating counts based on estimated frequencies to address
inlined comdats with 0 profile counts:

013-08-16  Teresa Johnson  <tejohnson@google.com>

        * tree-inline.c (copy_bb): Compute count based on frequency.
        (copy_edges_for_bb): Ditto.
        (copy_cfg_body): Ditto.
        (copy_body): Pass down frequency.
        (expand_call_inline): Ditto.
        (tree_function_versioning): Ditto.
        * predict.c (init_and_estimate_bb_frequencies): New function.
        (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies.
        * predict.h (init_and_estimate_bb_frequencies): Declare.
        * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies.
        * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge
        counts.
        (clone_inlined_nodes): Compute edge count scale if needed.



>
> Thanks,
> Teresa
>
>>>
>>> Also, can you send me reproduction instructions for gimp? I don't
>>> think I need Martin's patch, but which version of gimp and what is the
>>> equivalent way for me to train it? I have some scripts to generate a
>>> similar type of instruction heat map graph that I have been using to
>>> tune partitioning and function reordering. Essentially it uses linux
>>> perf to sample on instructions_retired and then munge the data in
>>> several ways to produce various stats and graphs. One thing that has
>>> been useful has been to combine the perf data with nm output to
>>> determine which cold functions are being executed at runtime.
>>
>> Martin?
>>
>>>
>>> However, for this to tell me which split cold bbs are being executed I
>>> need to use a patch that Sri sent for review several months back that
>>> gives the split cold section its own name:
>>>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>>> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>>>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>>> (cc'ing Sri as we should probably revive this patch soon to address
>>> gdb and other issues with detecting split functions properly)
>>
>> Intresting, I used linker script for this purposes, but that his GNU ld only...
>>
>> Honza
>>>
>>> Thanks!
>>> Teresa
>>>
>>> >
>>> > Honza
>>> >>
>>> >> Thanks,
>>> >> Teresa
>>> >>
>>> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>>> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>>> >> >
>>> >> > Honza
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Comments

Jan Hubicka Aug. 17, 2013, 8:44 p.m. UTC | #1
> 
> I added both of these and ran into issues due to profile maintenance.
> For example, there were non-zero blocks in the cold section because
> pro_and_epilogue split a simple return block that was previously reach
> by both hot and cold paths. The new return block that was then only
> reached via the cold path did not have its count properly updated to
> reflect this, and since with this patch, blocks dominated by cold
> blocks are remarked cold, we ended up with a non-zero count block in
> the cold section. And there were 0 count blocks reached by non-zero
> edges because copyprop did not clean up edge weights after removing
> some branches and blocks, leading to non-zero edge weights that had
> previously targeted a branch that was removed, now targeting a 0 count
> block that the removed branch always branched around.

I see, can you please send fixes for the problems you identified?
Thanks for owrking on this!
> 
> In any case, the good news is in that the cases I looked at, the
> splitting code is doing the right thing and these blocks that were
> marked cold really were cold. It would be great to fix the profile
> maintenance issues, but that in the meantime the above sanity checks
> are too aggressive.

We can keep them and output info into dump file - it is what most of
the profile sanity checking does anyway.

Did you try to use Martin's linker script to turn text.unlikely
section unexecutable?  I think that way we will easily find what really
causes us to use it during startup of trained application (just like
Martin does for gimp).
> 
> I think it makes sense to commit the current patch if possible, as it
> is making the splitting more sane.

My only concern about the patch is that I am not convinced the dominator
based code has chance to work reliably enough so we won't see too many
accesses into the cold section.
We can commit it and work on better solution incrementally but it will
probably mean replacing it later.  If you think it makes things easier
to work on it incrementally, I think the patch is OK.
> 
> > - I'll try building and profiling gimp myself to see if I can
> > reproduce the issue with code executing out of the cold section.
> 
> I have spent some time this week trying to get the latest gimp Martin
> pointed me to configured and built, but it took awhile to track down
> and configure/build all of the required versions of dependent
> packages. I'm still hitting some issues trying to get it compiled, so
> it may not yet be configured properly. I'll take a look again early
> next week.

I do not think there is anything special about gimp.  You can probably
take any other bigger app, like GCC itself. With profiledbootstrap
and linker script to lock unlikely section you should get ICEs where
we jump into cold secton and should not.
> 
> Teresa
> 
> patch for updating counts based on estimated frequencies to address
> inlined comdats with 0 profile counts:
> 
> 013-08-16  Teresa Johnson  <tejohnson@google.com>
> 
>         * tree-inline.c (copy_bb): Compute count based on frequency.
>         (copy_edges_for_bb): Ditto.
>         (copy_cfg_body): Ditto.
>         (copy_body): Pass down frequency.
>         (expand_call_inline): Ditto.
>         (tree_function_versioning): Ditto.
>         * predict.c (init_and_estimate_bb_frequencies): New function.
>         (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies.
>         * predict.h (init_and_estimate_bb_frequencies): Declare.
>         * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies.
>         * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge
>         counts.
>         (clone_inlined_nodes): Compute edge count scale if needed.

I do not see why inliner needs to care about scaling more than it does right
now.  So you have init_and_estimate_bb_frequencies that force profile guessing
on a given function body. In addition to that I thing you need something like
freqs_to_counts that will compute counts based on freqs with given scale
(actually you can do that as part of propagation before frequencies are scalled
to the usual 0...FREQ_MAX scale and precision is lost).

Because offline COMDAT functoin will be porduced for every COMDAT used, I think
it is bad to porduce any COMDAT (or any reachable function via calls with non-0
count) that has empty profile (either because it got lost by COMDAT merging
or because of reading mismatch). 

So I guess you can just check functions with 0 count and non-0 count callers
and initialize their guessed profile.
Some capping will probably be needed to not propagate insanely large numbers..

Since new direct calls can be discovered later, inline may want to do that
again each time it inlines non-0 count call of COMDAT with 0 count...

Honza
> 
> Index: tree-inline.c
> ===================================================================
> --- tree-inline.c       (revision 201644)
> +++ tree-inline.c       (working copy)
> @@ -1502,7 +1502,7 @@ remap_gimple_stmt (gimple stmt, copy_body_data *id
> 
>  static basic_block
>  copy_bb (copy_body_data *id, basic_block bb, int frequency_scale,
> -         gcov_type count_scale)
> +         gcov_type count_scale, gcov_type freq_to_count_scale)
>  {
>    gimple_stmt_iterator gsi, copy_gsi, seq_gsi;
>    basic_block copy_basic_block;
> @@ -1519,7 +1519,13 @@ copy_bb (copy_body_data *id, basic_block bb, int f
>       basic_block_info automatically.  */
>    copy_basic_block = create_basic_block (NULL, (void *) 0,
>                                           (basic_block) prev->aux);
> -  copy_basic_block->count = apply_scale (bb->count, count_scale);
> +  copy_basic_block->count
> +      = (count_scale
> +         ? apply_scale (bb->count, count_scale)
> +         /* When the callee bb counts were all zero (e.g. this was a COMDAT
> +            that didn't get profile counts) then we compute the new bb counts
> +            via the statically-estimated frequencies.  */
> +         : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, BB_FREQ_MAX));
> 
>    /* We are going to rebuild frequencies from scratch.  These values
>       have just small importance to drive canonicalize_loop_headers.  */
> @@ -1888,7 +1894,8 @@ update_ssa_across_abnormal_edges (basic_block bb,
>     debug stmts are left after a statement that must end the basic block.  */
> 
>  static bool
> -copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block ret_bb,
> +copy_edges_for_bb (basic_block bb, gcov_type count_scale,
> +                   basic_block ret_bb,
>                    bool can_make_abnormal_goto)
>  {
>    basic_block new_bb = (basic_block) bb->aux;
> @@ -1912,7 +1919,14 @@ static bool
>             && old_edge->dest->aux != EXIT_BLOCK_PTR)
>           flags |= EDGE_FALLTHRU;
>         new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, flags);
> -       new_edge->count = apply_scale (old_edge->count, count_scale);
> +        basic_block new_src_bb = (basic_block) old_edge->src->aux;
> +       new_edge->count
> +            = (count_scale
> +               ? apply_scale (old_edge->count, count_scale)
> +               // The bb counts have already been scaled with
> freq_to_count_scale
> +               // when that is non-zero, so just scale that new bb count by
> +               // the edge probability.
> +               : apply_probability (new_src_bb->count, old_edge->probability));
>         new_edge->probability = old_edge->probability;
>        }
> 
> @@ -2282,7 +2296,8 @@ redirect_all_calls (copy_body_data * id, basic_blo
>     another function.  Walks FN via CFG, returns new fndecl.  */
> 
>  static tree
> -copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale,
> +copy_cfg_body (copy_body_data * id, gcov_type count,
> +              int frequency, int frequency_scale,
>                basic_block entry_block_map, basic_block exit_block_map,
>                bitmap blocks_to_copy, basic_block new_entry)
>  {
> @@ -2293,15 +2308,20 @@ static tree
>    basic_block bb;
>    tree new_fndecl = NULL;
>    bool need_debug_cleanup = false;
> -  gcov_type count_scale;
> +  gcov_type count_scale = 0;
> +  gcov_type freq_to_count_scale = 0;
>    int last;
>    int incoming_frequency = 0;
>    gcov_type incoming_count = 0;
> 
> -  if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count)
> -    count_scale
> -        = GCOV_COMPUTE_SCALE (count,
> -                              ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count);
> +  basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun);
> +  if (entry_bb->count)
> +    count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count);
> +  /* When the callee bb counts were all zero (e.g. this was a COMDAT
> +     that didn't get profile counts) then we compute the new bb counts
> +     via the statically-estimated frequencies.  */
> +  else if (entry_bb->frequency)
> +    freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency);
>    else
>      count_scale = REG_BR_PROB_BASE;
> 
> @@ -2323,7 +2343,13 @@ static tree
>             incoming_frequency += EDGE_FREQUENCY (e);
>             incoming_count += e->count;
>           }
> -      incoming_count = apply_scale (incoming_count, count_scale);
> +      incoming_count
> +          = (count_scale
> +             ? apply_scale (incoming_count, count_scale)
> +             /* When the callee bb counts were all zero (e.g. this was a COMDAT
> +                that didn't get profile counts) then we compute the
> new bb counts
> +                via the statically-estimated frequencies.  */
> +             : RDIV (incoming_frequency * freq_to_count_scale, BB_FREQ_MAX));
>        incoming_frequency
>         = apply_scale ((gcov_type)incoming_frequency, frequency_scale);
>        ENTRY_BLOCK_PTR->count = incoming_count;
> @@ -2350,7 +2376,8 @@ static tree
>    FOR_EACH_BB_FN (bb, cfun_to_copy)
>      if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index))
>        {
> -       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale);
> +       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale,
> +                                     freq_to_count_scale);
>         bb->aux = new_bb;
>         new_bb->aux = bb;
>         new_bb->loop_father = entry_block_map->loop_father;
> @@ -2364,7 +2391,8 @@ static tree
>    FOR_ALL_BB_FN (bb, cfun_to_copy)
>      if (!blocks_to_copy
>          || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index)))
> -      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, exit_block_map,
> +      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale,
> +                                              exit_block_map,
>                                                can_make_abormal_goto);
> 
>    if (new_entry)
> @@ -2562,7 +2590,8 @@ copy_tree_body (copy_body_data *id)
>     another function.  */
> 
>  static tree
> -copy_body (copy_body_data *id, gcov_type count, int frequency_scale,
> +copy_body (copy_body_data *id, gcov_type count, int frequency,
> +          int frequency_scale,
>            basic_block entry_block_map, basic_block exit_block_map,
>            bitmap blocks_to_copy, basic_block new_entry)
>  {
> @@ -2571,7 +2600,8 @@ static tree
> 
>    /* If this body has a CFG, walk CFG and copy.  */
>    gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl)));
> -  body = copy_cfg_body (id, count, frequency_scale, entry_block_map,
> exit_block_map,
> +  body = copy_cfg_body (id, count, frequency, frequency_scale,
> +                       entry_block_map, exit_block_map,
>                         blocks_to_copy, new_entry);
>    copy_debug_stmts (id);
> 
> @@ -4172,7 +4202,7 @@ expand_call_inline (basic_block bb, gimple stmt, c
>       function in any way before this point, as this CALL_EXPR may be
>       a self-referential call; if we're calling ourselves, we need to
>       duplicate our body before altering anything.  */
> -  copy_body (id, bb->count,
> +  copy_body (id, bb->count, bb->frequency,
>              GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE),
>              bb, return_block, NULL, NULL);
> 
> @@ -5299,8 +5329,9 @@ tree_function_versioning (tree old_decl, tree new_
>      }
> 
>    /* Copy the Function's body.  */
> -  copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE,
> -            ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry);
> +  copy_body (&id, old_entry_block->count, old_entry_block->frequency,
> +            REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR,
> +            blocks_to_copy, new_entry);
> 
>    /* Renumber the lexical scoping (non-code) blocks consecutively.  */
>    number_blocks (new_decl);
> Index: predict.c
> ===================================================================
> --- predict.c   (revision 201644)
> +++ predict.c   (working copy)
> @@ -2976,6 +2976,24 @@ make_pass_strip_predict_hints (gcc::context *ctxt)
>    return new pass_strip_predict_hints (ctxt);
>  }
> 
> +/* Initialize loop edges and compute estimated bb frequencies when there
> +   is no profile data available.  */
> +
> +void
> +init_and_estimate_bb_frequencies (void)
> +{
> +  if (profile_status == PROFILE_READ && counts_to_freqs ())
> +    return;
> +
> +  loop_optimizer_init (0);
> +  add_noreturn_fake_exit_edges ();
> +  mark_irreducible_loops ();
> +  connect_infinite_loops_to_exit ();
> +  estimate_bb_frequencies ();
> +  remove_fake_exit_edges ();
> +  loop_optimizer_finalize ();
> +}
> +
>  /* Rebuild function frequencies.  Passes are in general expected to
>     maintain profile by hand, however in some cases this is not possible:
>     for example when inlining several functions with loops freuqencies might run
> @@ -2986,15 +3004,7 @@ rebuild_frequencies (void)
>  {
>    timevar_push (TV_REBUILD_FREQUENCIES);
>    if (profile_status == PROFILE_GUESSED)
> -    {
> -      loop_optimizer_init (0);
> -      add_noreturn_fake_exit_edges ();
> -      mark_irreducible_loops ();
> -      connect_infinite_loops_to_exit ();
> -      estimate_bb_frequencies ();
> -      remove_fake_exit_edges ();
> -      loop_optimizer_finalize ();
> -    }
> +    init_and_estimate_bb_frequencies ();
>    else if (profile_status == PROFILE_READ)
>      counts_to_freqs ();
>    else
> Index: predict.h
> ===================================================================
> --- predict.h   (revision 201644)
> +++ predict.h   (working copy)
> @@ -38,6 +38,7 @@ enum prediction
>  extern void predict_insn_def (rtx, enum br_predictor, enum prediction);
>  extern int counts_to_freqs (void);
>  extern void estimate_bb_frequencies (void);
> +extern void init_and_estimate_bb_frequencies (void);
>  extern const char *predictor_name (enum br_predictor);
>  extern tree build_predict_expr (enum br_predictor, enum prediction);
>  extern void tree_estimate_probability (void);
> Index: profile.c
> ===================================================================
> --- profile.c   (revision 201644)
> +++ profile.c   (working copy)
> @@ -1305,6 +1305,12 @@ branch_prob (void)
> 
>    values.release ();
>    free_edge_list (el);
> +
> +  /* Call after setting profile_status to PROFILE_READ, will then
> +     invoke counts_to_freqs and if the sum of the counts is zero, will
> +     estimate the frequencies.  */
> +  init_and_estimate_bb_frequencies ();
> +
>    coverage_end_function (lineno_checksum, cfg_checksum);
>  }
>  ^L
> Index: ipa-inline-transform.c
> ===================================================================
> --- ipa-inline-transform.c      (revision 201644)
> +++ ipa-inline-transform.c      (working copy)
> @@ -51,7 +51,7 @@ int nfunctions_inlined;
> 
>  static void
>  update_noncloned_frequencies (struct cgraph_node *node,
> -                             int freq_scale)
> +                             gcov_type count_scale, int freq_scale)
>  {
>    struct cgraph_edge *e;
> 
> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
>      freq_scale = 1;
>    for (e = node->callees; e; e = e->next_callee)
>      {
> +      e->count = apply_scale (e->count, count_scale);
>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>        if (e->frequency > CGRAPH_FREQ_MAX)
>          e->frequency = CGRAPH_FREQ_MAX;
>        if (!e->inline_failed)
> -        update_noncloned_frequencies (e->callee, freq_scale);
> +        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
>      }
>    for (e = node->indirect_calls; e; e = e->next_callee)
>      {
> +      e->count = apply_scale (e->count, count_scale);
>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>        if (e->frequency > CGRAPH_FREQ_MAX)
>          e->frequency = CGRAPH_FREQ_MAX;
> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
>             }
>           duplicate = false;
>           e->callee->symbol.externally_visible = false;
> -          update_noncloned_frequencies (e->callee, e->frequency);
> +          // In the case of a COMDAT, the callee's count may be from other
> +          // modules, and we need to scale it for the current module's calls
> +          // (e.g. e->count may be 0 despite e->callee->count > 0).
> +          gcov_type count_scale = REG_BR_PROB_BASE;
> +          if (e->callee->count > e->count)
> +            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
> +          update_noncloned_frequencies (e->callee, count_scale, e->frequency);
>         }
>        else
>         {
Teresa Johnson Aug. 19, 2013, 1:34 p.m. UTC | #2
On Sat, Aug 17, 2013 at 1:44 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> I added both of these and ran into issues due to profile maintenance.
>> For example, there were non-zero blocks in the cold section because
>> pro_and_epilogue split a simple return block that was previously reach
>> by both hot and cold paths. The new return block that was then only
>> reached via the cold path did not have its count properly updated to
>> reflect this, and since with this patch, blocks dominated by cold
>> blocks are remarked cold, we ended up with a non-zero count block in
>> the cold section. And there were 0 count blocks reached by non-zero
>> edges because copyprop did not clean up edge weights after removing
>> some branches and blocks, leading to non-zero edge weights that had
>> previously targeted a branch that was removed, now targeting a 0 count
>> block that the removed branch always branched around.
>
> I see, can you please send fixes for the problems you identified?
> Thanks for owrking on this!

I don't have fixes at this point - I just identified the phase and
transformation from looking at the dump. But I'll try to fix them soon
while I'm working on performance tuning for splitting. I have a
feeling there are probably a bunch of places where the profile isn't
getting updated properly, unfortunately.

>>
>> In any case, the good news is in that the cases I looked at, the
>> splitting code is doing the right thing and these blocks that were
>> marked cold really were cold. It would be great to fix the profile
>> maintenance issues, but that in the meantime the above sanity checks
>> are too aggressive.
>
> We can keep them and output info into dump file - it is what most of
> the profile sanity checking does anyway.

Ok, I will add that.

>
> Did you try to use Martin's linker script to turn text.unlikely
> section unexecutable?  I think that way we will easily find what really
> causes us to use it during startup of trained application (just like
> Martin does for gimp).

I haven't - where can I get that script?

>>
>> I think it makes sense to commit the current patch if possible, as it
>> is making the splitting more sane.
>
> My only concern about the patch is that I am not convinced the dominator
> based code has chance to work reliably enough so we won't see too many
> accesses into the cold section.

Remember it isn't using dominance anymore. The latest patch was
instead ensuring the most frequent path between hot blocks and the
entry/exit are marked hot. That should be better than the dominance
approach used in the earlier version.

> We can commit it and work on better solution incrementally but it will
> probably mean replacing it later.  If you think it makes things easier
> to work on it incrementally, I think the patch is OK.

Yes, I think this is a big step forward from what is there now for
splitting, which does the splitting purely based on bb count in
isolation. I don't have a much better solution in mind yet.

>>
>> > - I'll try building and profiling gimp myself to see if I can
>> > reproduce the issue with code executing out of the cold section.
>>
>> I have spent some time this week trying to get the latest gimp Martin
>> pointed me to configured and built, but it took awhile to track down
>> and configure/build all of the required versions of dependent
>> packages. I'm still hitting some issues trying to get it compiled, so
>> it may not yet be configured properly. I'll take a look again early
>> next week.
>
> I do not think there is anything special about gimp.  You can probably
> take any other bigger app, like GCC itself. With profiledbootstrap
> and linker script to lock unlikely section you should get ICEs where
> we jump into cold secton and should not.

Ok, please point me to the linker script and I will try gcc
profiledbootstrap as well. I wanted to try gimp if possible as I
haven't seen this much jumping to the cold section in some of the
internal apps I tried.

Thanks,
Teresa

>>
>> Teresa
>>
>> patch for updating counts based on estimated frequencies to address
>> inlined comdats with 0 profile counts:
>>
>> 013-08-16  Teresa Johnson  <tejohnson@google.com>
>>
>>         * tree-inline.c (copy_bb): Compute count based on frequency.
>>         (copy_edges_for_bb): Ditto.
>>         (copy_cfg_body): Ditto.
>>         (copy_body): Pass down frequency.
>>         (expand_call_inline): Ditto.
>>         (tree_function_versioning): Ditto.
>>         * predict.c (init_and_estimate_bb_frequencies): New function.
>>         (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies.
>>         * predict.h (init_and_estimate_bb_frequencies): Declare.
>>         * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies.
>>         * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge
>>         counts.
>>         (clone_inlined_nodes): Compute edge count scale if needed.
>
> I do not see why inliner needs to care about scaling more than it does right
> now.  So you have init_and_estimate_bb_frequencies that force profile guessing
> on a given function body. In addition to that I thing you need something like
> freqs_to_counts that will compute counts based on freqs with given scale
> (actually you can do that as part of propagation before frequencies are scalled
> to the usual 0...FREQ_MAX scale and precision is lost).
>
> Because offline COMDAT functoin will be porduced for every COMDAT used, I think
> it is bad to porduce any COMDAT (or any reachable function via calls with non-0
> count) that has empty profile (either because it got lost by COMDAT merging
> or because of reading mismatch).
>
> So I guess you can just check functions with 0 count and non-0 count callers
> and initialize their guessed profile.
> Some capping will probably be needed to not propagate insanely large numbers..
>
> Since new direct calls can be discovered later, inline may want to do that
> again each time it inlines non-0 count call of COMDAT with 0 count...
>
> Honza
>>
>> Index: tree-inline.c
>> ===================================================================
>> --- tree-inline.c       (revision 201644)
>> +++ tree-inline.c       (working copy)
>> @@ -1502,7 +1502,7 @@ remap_gimple_stmt (gimple stmt, copy_body_data *id
>>
>>  static basic_block
>>  copy_bb (copy_body_data *id, basic_block bb, int frequency_scale,
>> -         gcov_type count_scale)
>> +         gcov_type count_scale, gcov_type freq_to_count_scale)
>>  {
>>    gimple_stmt_iterator gsi, copy_gsi, seq_gsi;
>>    basic_block copy_basic_block;
>> @@ -1519,7 +1519,13 @@ copy_bb (copy_body_data *id, basic_block bb, int f
>>       basic_block_info automatically.  */
>>    copy_basic_block = create_basic_block (NULL, (void *) 0,
>>                                           (basic_block) prev->aux);
>> -  copy_basic_block->count = apply_scale (bb->count, count_scale);
>> +  copy_basic_block->count
>> +      = (count_scale
>> +         ? apply_scale (bb->count, count_scale)
>> +         /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +            that didn't get profile counts) then we compute the new bb counts
>> +            via the statically-estimated frequencies.  */
>> +         : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, BB_FREQ_MAX));
>>
>>    /* We are going to rebuild frequencies from scratch.  These values
>>       have just small importance to drive canonicalize_loop_headers.  */
>> @@ -1888,7 +1894,8 @@ update_ssa_across_abnormal_edges (basic_block bb,
>>     debug stmts are left after a statement that must end the basic block.  */
>>
>>  static bool
>> -copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block ret_bb,
>> +copy_edges_for_bb (basic_block bb, gcov_type count_scale,
>> +                   basic_block ret_bb,
>>                    bool can_make_abnormal_goto)
>>  {
>>    basic_block new_bb = (basic_block) bb->aux;
>> @@ -1912,7 +1919,14 @@ static bool
>>             && old_edge->dest->aux != EXIT_BLOCK_PTR)
>>           flags |= EDGE_FALLTHRU;
>>         new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, flags);
>> -       new_edge->count = apply_scale (old_edge->count, count_scale);
>> +        basic_block new_src_bb = (basic_block) old_edge->src->aux;
>> +       new_edge->count
>> +            = (count_scale
>> +               ? apply_scale (old_edge->count, count_scale)
>> +               // The bb counts have already been scaled with
>> freq_to_count_scale
>> +               // when that is non-zero, so just scale that new bb count by
>> +               // the edge probability.
>> +               : apply_probability (new_src_bb->count, old_edge->probability));
>>         new_edge->probability = old_edge->probability;
>>        }
>>
>> @@ -2282,7 +2296,8 @@ redirect_all_calls (copy_body_data * id, basic_blo
>>     another function.  Walks FN via CFG, returns new fndecl.  */
>>
>>  static tree
>> -copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale,
>> +copy_cfg_body (copy_body_data * id, gcov_type count,
>> +              int frequency, int frequency_scale,
>>                basic_block entry_block_map, basic_block exit_block_map,
>>                bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2293,15 +2308,20 @@ static tree
>>    basic_block bb;
>>    tree new_fndecl = NULL;
>>    bool need_debug_cleanup = false;
>> -  gcov_type count_scale;
>> +  gcov_type count_scale = 0;
>> +  gcov_type freq_to_count_scale = 0;
>>    int last;
>>    int incoming_frequency = 0;
>>    gcov_type incoming_count = 0;
>>
>> -  if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count)
>> -    count_scale
>> -        = GCOV_COMPUTE_SCALE (count,
>> -                              ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count);
>> +  basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun);
>> +  if (entry_bb->count)
>> +    count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count);
>> +  /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +     that didn't get profile counts) then we compute the new bb counts
>> +     via the statically-estimated frequencies.  */
>> +  else if (entry_bb->frequency)
>> +    freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency);
>>    else
>>      count_scale = REG_BR_PROB_BASE;
>>
>> @@ -2323,7 +2343,13 @@ static tree
>>             incoming_frequency += EDGE_FREQUENCY (e);
>>             incoming_count += e->count;
>>           }
>> -      incoming_count = apply_scale (incoming_count, count_scale);
>> +      incoming_count
>> +          = (count_scale
>> +             ? apply_scale (incoming_count, count_scale)
>> +             /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +                that didn't get profile counts) then we compute the
>> new bb counts
>> +                via the statically-estimated frequencies.  */
>> +             : RDIV (incoming_frequency * freq_to_count_scale, BB_FREQ_MAX));
>>        incoming_frequency
>>         = apply_scale ((gcov_type)incoming_frequency, frequency_scale);
>>        ENTRY_BLOCK_PTR->count = incoming_count;
>> @@ -2350,7 +2376,8 @@ static tree
>>    FOR_EACH_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index))
>>        {
>> -       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale);
>> +       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale,
>> +                                     freq_to_count_scale);
>>         bb->aux = new_bb;
>>         new_bb->aux = bb;
>>         new_bb->loop_father = entry_block_map->loop_father;
>> @@ -2364,7 +2391,8 @@ static tree
>>    FOR_ALL_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy
>>          || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index)))
>> -      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, exit_block_map,
>> +      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale,
>> +                                              exit_block_map,
>>                                                can_make_abormal_goto);
>>
>>    if (new_entry)
>> @@ -2562,7 +2590,8 @@ copy_tree_body (copy_body_data *id)
>>     another function.  */
>>
>>  static tree
>> -copy_body (copy_body_data *id, gcov_type count, int frequency_scale,
>> +copy_body (copy_body_data *id, gcov_type count, int frequency,
>> +          int frequency_scale,
>>            basic_block entry_block_map, basic_block exit_block_map,
>>            bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2571,7 +2600,8 @@ static tree
>>
>>    /* If this body has a CFG, walk CFG and copy.  */
>>    gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl)));
>> -  body = copy_cfg_body (id, count, frequency_scale, entry_block_map,
>> exit_block_map,
>> +  body = copy_cfg_body (id, count, frequency, frequency_scale,
>> +                       entry_block_map, exit_block_map,
>>                         blocks_to_copy, new_entry);
>>    copy_debug_stmts (id);
>>
>> @@ -4172,7 +4202,7 @@ expand_call_inline (basic_block bb, gimple stmt, c
>>       function in any way before this point, as this CALL_EXPR may be
>>       a self-referential call; if we're calling ourselves, we need to
>>       duplicate our body before altering anything.  */
>> -  copy_body (id, bb->count,
>> +  copy_body (id, bb->count, bb->frequency,
>>              GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE),
>>              bb, return_block, NULL, NULL);
>>
>> @@ -5299,8 +5329,9 @@ tree_function_versioning (tree old_decl, tree new_
>>      }
>>
>>    /* Copy the Function's body.  */
>> -  copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE,
>> -            ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry);
>> +  copy_body (&id, old_entry_block->count, old_entry_block->frequency,
>> +            REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR,
>> +            blocks_to_copy, new_entry);
>>
>>    /* Renumber the lexical scoping (non-code) blocks consecutively.  */
>>    number_blocks (new_decl);
>> Index: predict.c
>> ===================================================================
>> --- predict.c   (revision 201644)
>> +++ predict.c   (working copy)
>> @@ -2976,6 +2976,24 @@ make_pass_strip_predict_hints (gcc::context *ctxt)
>>    return new pass_strip_predict_hints (ctxt);
>>  }
>>
>> +/* Initialize loop edges and compute estimated bb frequencies when there
>> +   is no profile data available.  */
>> +
>> +void
>> +init_and_estimate_bb_frequencies (void)
>> +{
>> +  if (profile_status == PROFILE_READ && counts_to_freqs ())
>> +    return;
>> +
>> +  loop_optimizer_init (0);
>> +  add_noreturn_fake_exit_edges ();
>> +  mark_irreducible_loops ();
>> +  connect_infinite_loops_to_exit ();
>> +  estimate_bb_frequencies ();
>> +  remove_fake_exit_edges ();
>> +  loop_optimizer_finalize ();
>> +}
>> +
>>  /* Rebuild function frequencies.  Passes are in general expected to
>>     maintain profile by hand, however in some cases this is not possible:
>>     for example when inlining several functions with loops freuqencies might run
>> @@ -2986,15 +3004,7 @@ rebuild_frequencies (void)
>>  {
>>    timevar_push (TV_REBUILD_FREQUENCIES);
>>    if (profile_status == PROFILE_GUESSED)
>> -    {
>> -      loop_optimizer_init (0);
>> -      add_noreturn_fake_exit_edges ();
>> -      mark_irreducible_loops ();
>> -      connect_infinite_loops_to_exit ();
>> -      estimate_bb_frequencies ();
>> -      remove_fake_exit_edges ();
>> -      loop_optimizer_finalize ();
>> -    }
>> +    init_and_estimate_bb_frequencies ();
>>    else if (profile_status == PROFILE_READ)
>>      counts_to_freqs ();
>>    else
>> Index: predict.h
>> ===================================================================
>> --- predict.h   (revision 201644)
>> +++ predict.h   (working copy)
>> @@ -38,6 +38,7 @@ enum prediction
>>  extern void predict_insn_def (rtx, enum br_predictor, enum prediction);
>>  extern int counts_to_freqs (void);
>>  extern void estimate_bb_frequencies (void);
>> +extern void init_and_estimate_bb_frequencies (void);
>>  extern const char *predictor_name (enum br_predictor);
>>  extern tree build_predict_expr (enum br_predictor, enum prediction);
>>  extern void tree_estimate_probability (void);
>> Index: profile.c
>> ===================================================================
>> --- profile.c   (revision 201644)
>> +++ profile.c   (working copy)
>> @@ -1305,6 +1305,12 @@ branch_prob (void)
>>
>>    values.release ();
>>    free_edge_list (el);
>> +
>> +  /* Call after setting profile_status to PROFILE_READ, will then
>> +     invoke counts_to_freqs and if the sum of the counts is zero, will
>> +     estimate the frequencies.  */
>> +  init_and_estimate_bb_frequencies ();
>> +
>>    coverage_end_function (lineno_checksum, cfg_checksum);
>>  }
>>  ^L
>> Index: ipa-inline-transform.c
>> ===================================================================
>> --- ipa-inline-transform.c      (revision 201644)
>> +++ ipa-inline-transform.c      (working copy)
>> @@ -51,7 +51,7 @@ int nfunctions_inlined;
>>
>>  static void
>>  update_noncloned_frequencies (struct cgraph_node *node,
>> -                             int freq_scale)
>> +                             gcov_type count_scale, int freq_scale)
>>  {
>>    struct cgraph_edge *e;
>>
>> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
>>      freq_scale = 1;
>>    for (e = node->callees; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>>        if (!e->inline_failed)
>> -        update_noncloned_frequencies (e->callee, freq_scale);
>> +        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
>>      }
>>    for (e = node->indirect_calls; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
>>             }
>>           duplicate = false;
>>           e->callee->symbol.externally_visible = false;
>> -          update_noncloned_frequencies (e->callee, e->frequency);
>> +          // In the case of a COMDAT, the callee's count may be from other
>> +          // modules, and we need to scale it for the current module's calls
>> +          // (e.g. e->count may be 0 despite e->callee->count > 0).
>> +          gcov_type count_scale = REG_BR_PROB_BASE;
>> +          if (e->callee->count > e->count)
>> +            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
>> +          update_noncloned_frequencies (e->callee, count_scale, e->frequency);
>>         }
>>        else
>>         {
Jan Hubicka Aug. 19, 2013, 3:09 p.m. UTC | #3
> Remember it isn't using dominance anymore. The latest patch was
> instead ensuring the most frequent path between hot blocks and the
> entry/exit are marked hot. That should be better than the dominance
> approach used in the earlier version.

Indeed, that looks more resonable approach.
Can you point me to the last version of patch? Last one I remember still
walked dominators...
> 
> > We can commit it and work on better solution incrementally but it will
> > probably mean replacing it later.  If you think it makes things easier
> > to work on it incrementally, I think the patch is OK.
> 
> Yes, I think this is a big step forward from what is there now for
> splitting, which does the splitting purely based on bb count in
> isolation. I don't have a much better solution in mind yet.
> 
> >>
> >> > - I'll try building and profiling gimp myself to see if I can
> >> > reproduce the issue with code executing out of the cold section.
> >>
> >> I have spent some time this week trying to get the latest gimp Martin
> >> pointed me to configured and built, but it took awhile to track down
> >> and configure/build all of the required versions of dependent
> >> packages. I'm still hitting some issues trying to get it compiled, so
> >> it may not yet be configured properly. I'll take a look again early
> >> next week.
> >
> > I do not think there is anything special about gimp.  You can probably
> > take any other bigger app, like GCC itself. With profiledbootstrap
> > and linker script to lock unlikely section you should get ICEs where
> > we jump into cold secton and should not.
> 
> Ok, please point me to the linker script and I will try gcc
> profiledbootstrap as well. I wanted to try gimp if possible as I
> haven't seen this much jumping to the cold section in some of the
> internal apps I tried.

You can also discuss with Martin the systemtap script to plot disk accesses
during the startup.  It is very handy for analyzing the code layout issues

It may be interesting to get similar script taking traces from valgrind
and ploting the most frequent calls in the final layout ;)

Honza
Teresa Johnson Aug. 19, 2013, 3:30 p.m. UTC | #4
On Sat, Aug 17, 2013 at 1:44 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> patch for updating counts based on estimated frequencies to address
>> inlined comdats with 0 profile counts:
>>
>> 013-08-16  Teresa Johnson  <tejohnson@google.com>
>>
>>         * tree-inline.c (copy_bb): Compute count based on frequency.
>>         (copy_edges_for_bb): Ditto.
>>         (copy_cfg_body): Ditto.
>>         (copy_body): Pass down frequency.
>>         (expand_call_inline): Ditto.
>>         (tree_function_versioning): Ditto.
>>         * predict.c (init_and_estimate_bb_frequencies): New function.
>>         (rebuild_frequencies): Invoke init_and_estimate_bb_frequencies.
>>         * predict.h (init_and_estimate_bb_frequencies): Declare.
>>         * profile.c (branch_prob): Invoke init_and_estimate_bb_frequencies.
>>         * ipa-inline-transform.c (update_noncloned_frequencies): Scale edge
>>         counts.
>>         (clone_inlined_nodes): Compute edge count scale if needed.
>
> I do not see why inliner needs to care about scaling more than it does right
> now.  So you have init_and_estimate_bb_frequencies that force profile guessing
> on a given function body. In addition to that I thing you need something like
> freqs_to_counts that will compute counts based on freqs with given scale
> (actually you can do that as part of propagation before frequencies are scalled
> to the usual 0...FREQ_MAX scale and precision is lost).
>
> Because offline COMDAT functoin will be porduced for every COMDAT used, I think
> it is bad to porduce any COMDAT (or any reachable function via calls with non-0
> count) that has empty profile (either because it got lost by COMDAT merging
> or because of reading mismatch).

The approach this patch takes is to simply treat those functions the
same as we would if we didn't feed back profile data in the first
place, by using the frequencies. This is sufficient except when one is
inlined, which is why I have the special handling in the inliner
itself.

>
> So I guess you can just check functions with 0 count and non-0 count callers
> and initialize their guessed profile.
> Some capping will probably be needed to not propagate insanely large numbers..

But at profile read time we don't have access to the inter-module
calls. Presumably having guessed profiles for these routines should
help the O2 profile-use case as well (i.e. better optimized
out-of-line copy), so I wouldn't want to limit it to IPO or LIPO
compiles where we can identify inter-module call counts at some point
in the compilation.

>
> Since new direct calls can be discovered later, inline may want to do that
> again each time it inlines non-0 count call of COMDAT with 0 count...

How about an approach like this:
- Invoke init_and_estimate_bb_frequencies as I am doing to guess the
profiles at profile read time for functions with 0 counts.
- At inline time, invoke some kind of freqs_to_counts routine for any
0-count routine that is reached by non-zero call edges. It would take
the sum of all incoming call edge counts and synthesize counts for the
bbs using the guessed profile frequencies applied earlier by
init_and_estimate_bb_frequencies. Then the inliner can do its normal
bb count scaling.

Does that seem like a reasonable approach?

There is one other fix in this patch:
- The clone_inlined_nodes/update_noncloned_frequencies changes below
are handling a different case: 0-count call edge in this module, with
non-zero callee node count due to calls from other modules. It will
allow update_noncloned_frequencies to scale down the edge counts in
callee's cloned call tree. This was a fix I made for the
callgraph-based linker plugin function reordering, and not splitting
(since it is using both the node and edge weights to make ordering
decisions). Here's a description of the issue when I was debugging it:

----
In this case, because the callee we are inlining does not have any
other callers and is not external, we call
update_noncloned_frequencies from clone_inlined_nodes instead of
creating a clone. This routine does not attempt to scale the outgoing
edge weight counters on the callee, since the assumption must be that
there are no other callers so all the weight is attributed to the
current edge that we are inlining.

In this case this is clearly not correct, because the caller's count
is 0. I'm assuming that this is happening because the callee we are
inlining is a comdat, so its non-zero weights must have come from a
different module. It seems like update_noncloned_frequencies should go
ahead and scale the counts too.
----

For the above case, I think the right place to fix this is probably
during clone_inlined_nodes/update_noncloned_frequencies, as scaling is
handled by cgraph_clone_node in the case where we need cloning (also
called from clone_inlined_nodes).

Teresa

>
> Honza
>>
>> Index: tree-inline.c
>> ===================================================================
>> --- tree-inline.c       (revision 201644)
>> +++ tree-inline.c       (working copy)
>> @@ -1502,7 +1502,7 @@ remap_gimple_stmt (gimple stmt, copy_body_data *id
>>
>>  static basic_block
>>  copy_bb (copy_body_data *id, basic_block bb, int frequency_scale,
>> -         gcov_type count_scale)
>> +         gcov_type count_scale, gcov_type freq_to_count_scale)
>>  {
>>    gimple_stmt_iterator gsi, copy_gsi, seq_gsi;
>>    basic_block copy_basic_block;
>> @@ -1519,7 +1519,13 @@ copy_bb (copy_body_data *id, basic_block bb, int f
>>       basic_block_info automatically.  */
>>    copy_basic_block = create_basic_block (NULL, (void *) 0,
>>                                           (basic_block) prev->aux);
>> -  copy_basic_block->count = apply_scale (bb->count, count_scale);
>> +  copy_basic_block->count
>> +      = (count_scale
>> +         ? apply_scale (bb->count, count_scale)
>> +         /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +            that didn't get profile counts) then we compute the new bb counts
>> +            via the statically-estimated frequencies.  */
>> +         : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, BB_FREQ_MAX));
>>
>>    /* We are going to rebuild frequencies from scratch.  These values
>>       have just small importance to drive canonicalize_loop_headers.  */
>> @@ -1888,7 +1894,8 @@ update_ssa_across_abnormal_edges (basic_block bb,
>>     debug stmts are left after a statement that must end the basic block.  */
>>
>>  static bool
>> -copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block ret_bb,
>> +copy_edges_for_bb (basic_block bb, gcov_type count_scale,
>> +                   basic_block ret_bb,
>>                    bool can_make_abnormal_goto)
>>  {
>>    basic_block new_bb = (basic_block) bb->aux;
>> @@ -1912,7 +1919,14 @@ static bool
>>             && old_edge->dest->aux != EXIT_BLOCK_PTR)
>>           flags |= EDGE_FALLTHRU;
>>         new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, flags);
>> -       new_edge->count = apply_scale (old_edge->count, count_scale);
>> +        basic_block new_src_bb = (basic_block) old_edge->src->aux;
>> +       new_edge->count
>> +            = (count_scale
>> +               ? apply_scale (old_edge->count, count_scale)
>> +               // The bb counts have already been scaled with
>> freq_to_count_scale
>> +               // when that is non-zero, so just scale that new bb count by
>> +               // the edge probability.
>> +               : apply_probability (new_src_bb->count, old_edge->probability));
>>         new_edge->probability = old_edge->probability;
>>        }
>>
>> @@ -2282,7 +2296,8 @@ redirect_all_calls (copy_body_data * id, basic_blo
>>     another function.  Walks FN via CFG, returns new fndecl.  */
>>
>>  static tree
>> -copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale,
>> +copy_cfg_body (copy_body_data * id, gcov_type count,
>> +              int frequency, int frequency_scale,
>>                basic_block entry_block_map, basic_block exit_block_map,
>>                bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2293,15 +2308,20 @@ static tree
>>    basic_block bb;
>>    tree new_fndecl = NULL;
>>    bool need_debug_cleanup = false;
>> -  gcov_type count_scale;
>> +  gcov_type count_scale = 0;
>> +  gcov_type freq_to_count_scale = 0;
>>    int last;
>>    int incoming_frequency = 0;
>>    gcov_type incoming_count = 0;
>>
>> -  if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count)
>> -    count_scale
>> -        = GCOV_COMPUTE_SCALE (count,
>> -                              ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count);
>> +  basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun);
>> +  if (entry_bb->count)
>> +    count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count);
>> +  /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +     that didn't get profile counts) then we compute the new bb counts
>> +     via the statically-estimated frequencies.  */
>> +  else if (entry_bb->frequency)
>> +    freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency);
>>    else
>>      count_scale = REG_BR_PROB_BASE;
>>
>> @@ -2323,7 +2343,13 @@ static tree
>>             incoming_frequency += EDGE_FREQUENCY (e);
>>             incoming_count += e->count;
>>           }
>> -      incoming_count = apply_scale (incoming_count, count_scale);
>> +      incoming_count
>> +          = (count_scale
>> +             ? apply_scale (incoming_count, count_scale)
>> +             /* When the callee bb counts were all zero (e.g. this was a COMDAT
>> +                that didn't get profile counts) then we compute the
>> new bb counts
>> +                via the statically-estimated frequencies.  */
>> +             : RDIV (incoming_frequency * freq_to_count_scale, BB_FREQ_MAX));
>>        incoming_frequency
>>         = apply_scale ((gcov_type)incoming_frequency, frequency_scale);
>>        ENTRY_BLOCK_PTR->count = incoming_count;
>> @@ -2350,7 +2376,8 @@ static tree
>>    FOR_EACH_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index))
>>        {
>> -       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale);
>> +       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale,
>> +                                     freq_to_count_scale);
>>         bb->aux = new_bb;
>>         new_bb->aux = bb;
>>         new_bb->loop_father = entry_block_map->loop_father;
>> @@ -2364,7 +2391,8 @@ static tree
>>    FOR_ALL_BB_FN (bb, cfun_to_copy)
>>      if (!blocks_to_copy
>>          || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index)))
>> -      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, exit_block_map,
>> +      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale,
>> +                                              exit_block_map,
>>                                                can_make_abormal_goto);
>>
>>    if (new_entry)
>> @@ -2562,7 +2590,8 @@ copy_tree_body (copy_body_data *id)
>>     another function.  */
>>
>>  static tree
>> -copy_body (copy_body_data *id, gcov_type count, int frequency_scale,
>> +copy_body (copy_body_data *id, gcov_type count, int frequency,
>> +          int frequency_scale,
>>            basic_block entry_block_map, basic_block exit_block_map,
>>            bitmap blocks_to_copy, basic_block new_entry)
>>  {
>> @@ -2571,7 +2600,8 @@ static tree
>>
>>    /* If this body has a CFG, walk CFG and copy.  */
>>    gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl)));
>> -  body = copy_cfg_body (id, count, frequency_scale, entry_block_map,
>> exit_block_map,
>> +  body = copy_cfg_body (id, count, frequency, frequency_scale,
>> +                       entry_block_map, exit_block_map,
>>                         blocks_to_copy, new_entry);
>>    copy_debug_stmts (id);
>>
>> @@ -4172,7 +4202,7 @@ expand_call_inline (basic_block bb, gimple stmt, c
>>       function in any way before this point, as this CALL_EXPR may be
>>       a self-referential call; if we're calling ourselves, we need to
>>       duplicate our body before altering anything.  */
>> -  copy_body (id, bb->count,
>> +  copy_body (id, bb->count, bb->frequency,
>>              GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE),
>>              bb, return_block, NULL, NULL);
>>
>> @@ -5299,8 +5329,9 @@ tree_function_versioning (tree old_decl, tree new_
>>      }
>>
>>    /* Copy the Function's body.  */
>> -  copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE,
>> -            ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry);
>> +  copy_body (&id, old_entry_block->count, old_entry_block->frequency,
>> +            REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR,
>> +            blocks_to_copy, new_entry);
>>
>>    /* Renumber the lexical scoping (non-code) blocks consecutively.  */
>>    number_blocks (new_decl);
>> Index: predict.c
>> ===================================================================
>> --- predict.c   (revision 201644)
>> +++ predict.c   (working copy)
>> @@ -2976,6 +2976,24 @@ make_pass_strip_predict_hints (gcc::context *ctxt)
>>    return new pass_strip_predict_hints (ctxt);
>>  }
>>
>> +/* Initialize loop edges and compute estimated bb frequencies when there
>> +   is no profile data available.  */
>> +
>> +void
>> +init_and_estimate_bb_frequencies (void)
>> +{
>> +  if (profile_status == PROFILE_READ && counts_to_freqs ())
>> +    return;
>> +
>> +  loop_optimizer_init (0);
>> +  add_noreturn_fake_exit_edges ();
>> +  mark_irreducible_loops ();
>> +  connect_infinite_loops_to_exit ();
>> +  estimate_bb_frequencies ();
>> +  remove_fake_exit_edges ();
>> +  loop_optimizer_finalize ();
>> +}
>> +
>>  /* Rebuild function frequencies.  Passes are in general expected to
>>     maintain profile by hand, however in some cases this is not possible:
>>     for example when inlining several functions with loops freuqencies might run
>> @@ -2986,15 +3004,7 @@ rebuild_frequencies (void)
>>  {
>>    timevar_push (TV_REBUILD_FREQUENCIES);
>>    if (profile_status == PROFILE_GUESSED)
>> -    {
>> -      loop_optimizer_init (0);
>> -      add_noreturn_fake_exit_edges ();
>> -      mark_irreducible_loops ();
>> -      connect_infinite_loops_to_exit ();
>> -      estimate_bb_frequencies ();
>> -      remove_fake_exit_edges ();
>> -      loop_optimizer_finalize ();
>> -    }
>> +    init_and_estimate_bb_frequencies ();
>>    else if (profile_status == PROFILE_READ)
>>      counts_to_freqs ();
>>    else
>> Index: predict.h
>> ===================================================================
>> --- predict.h   (revision 201644)
>> +++ predict.h   (working copy)
>> @@ -38,6 +38,7 @@ enum prediction
>>  extern void predict_insn_def (rtx, enum br_predictor, enum prediction);
>>  extern int counts_to_freqs (void);
>>  extern void estimate_bb_frequencies (void);
>> +extern void init_and_estimate_bb_frequencies (void);
>>  extern const char *predictor_name (enum br_predictor);
>>  extern tree build_predict_expr (enum br_predictor, enum prediction);
>>  extern void tree_estimate_probability (void);
>> Index: profile.c
>> ===================================================================
>> --- profile.c   (revision 201644)
>> +++ profile.c   (working copy)
>> @@ -1305,6 +1305,12 @@ branch_prob (void)
>>
>>    values.release ();
>>    free_edge_list (el);
>> +
>> +  /* Call after setting profile_status to PROFILE_READ, will then
>> +     invoke counts_to_freqs and if the sum of the counts is zero, will
>> +     estimate the frequencies.  */
>> +  init_and_estimate_bb_frequencies ();
>> +
>>    coverage_end_function (lineno_checksum, cfg_checksum);
>>  }
>>  ^L
>> Index: ipa-inline-transform.c
>> ===================================================================
>> --- ipa-inline-transform.c      (revision 201644)
>> +++ ipa-inline-transform.c      (working copy)
>> @@ -51,7 +51,7 @@ int nfunctions_inlined;
>>
>>  static void
>>  update_noncloned_frequencies (struct cgraph_node *node,
>> -                             int freq_scale)
>> +                             gcov_type count_scale, int freq_scale)
>>  {
>>    struct cgraph_edge *e;
>>
>> @@ -60,14 +60,16 @@ update_noncloned_frequencies (struct cgraph_node *
>>      freq_scale = 1;
>>    for (e = node->callees; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>>        if (!e->inline_failed)
>> -        update_noncloned_frequencies (e->callee, freq_scale);
>> +        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
>>      }
>>    for (e = node->indirect_calls; e; e = e->next_callee)
>>      {
>> +      e->count = apply_scale (e->count, count_scale);
>>        e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
>>        if (e->frequency > CGRAPH_FREQ_MAX)
>>          e->frequency = CGRAPH_FREQ_MAX;
>> @@ -169,7 +171,13 @@ clone_inlined_nodes (struct cgraph_edge *e, bool d
>>             }
>>           duplicate = false;
>>           e->callee->symbol.externally_visible = false;
>> -          update_noncloned_frequencies (e->callee, e->frequency);
>> +          // In the case of a COMDAT, the callee's count may be from other
>> +          // modules, and we need to scale it for the current module's calls
>> +          // (e.g. e->count may be 0 despite e->callee->count > 0).
>> +          gcov_type count_scale = REG_BR_PROB_BASE;
>> +          if (e->callee->count > e->count)
>> +            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
>> +          update_noncloned_frequencies (e->callee, count_scale, e->frequency);
>>         }
>>        else
>>         {
diff mbox

Patch

Index: tree-inline.c
===================================================================
--- tree-inline.c       (revision 201644)
+++ tree-inline.c       (working copy)
@@ -1502,7 +1502,7 @@  remap_gimple_stmt (gimple stmt, copy_body_data *id

 static basic_block
 copy_bb (copy_body_data *id, basic_block bb, int frequency_scale,
-         gcov_type count_scale)
+         gcov_type count_scale, gcov_type freq_to_count_scale)
 {
   gimple_stmt_iterator gsi, copy_gsi, seq_gsi;
   basic_block copy_basic_block;
@@ -1519,7 +1519,13 @@  copy_bb (copy_body_data *id, basic_block bb, int f
      basic_block_info automatically.  */
   copy_basic_block = create_basic_block (NULL, (void *) 0,
                                          (basic_block) prev->aux);
-  copy_basic_block->count = apply_scale (bb->count, count_scale);
+  copy_basic_block->count
+      = (count_scale
+         ? apply_scale (bb->count, count_scale)
+         /* When the callee bb counts were all zero (e.g. this was a COMDAT
+            that didn't get profile counts) then we compute the new bb counts
+            via the statically-estimated frequencies.  */
+         : RDIV ((gcov_type)bb->frequency * freq_to_count_scale, BB_FREQ_MAX));

   /* We are going to rebuild frequencies from scratch.  These values
      have just small importance to drive canonicalize_loop_headers.  */
@@ -1888,7 +1894,8 @@  update_ssa_across_abnormal_edges (basic_block bb,
    debug stmts are left after a statement that must end the basic block.  */

 static bool
-copy_edges_for_bb (basic_block bb, gcov_type count_scale, basic_block ret_bb,
+copy_edges_for_bb (basic_block bb, gcov_type count_scale,
+                   basic_block ret_bb,
                   bool can_make_abnormal_goto)
 {
   basic_block new_bb = (basic_block) bb->aux;
@@ -1912,7 +1919,14 @@  static bool
            && old_edge->dest->aux != EXIT_BLOCK_PTR)
          flags |= EDGE_FALLTHRU;
        new_edge = make_edge (new_bb, (basic_block) old_edge->dest->aux, flags);
-       new_edge->count = apply_scale (old_edge->count, count_scale);
+        basic_block new_src_bb = (basic_block) old_edge->src->aux;
+       new_edge->count
+            = (count_scale
+               ? apply_scale (old_edge->count, count_scale)
+               // The bb counts have already been scaled with
freq_to_count_scale
+               // when that is non-zero, so just scale that new bb count by
+               // the edge probability.
+               : apply_probability (new_src_bb->count, old_edge->probability));
        new_edge->probability = old_edge->probability;
       }

@@ -2282,7 +2296,8 @@  redirect_all_calls (copy_body_data * id, basic_blo
    another function.  Walks FN via CFG, returns new fndecl.  */

 static tree
-copy_cfg_body (copy_body_data * id, gcov_type count, int frequency_scale,
+copy_cfg_body (copy_body_data * id, gcov_type count,
+              int frequency, int frequency_scale,
               basic_block entry_block_map, basic_block exit_block_map,
               bitmap blocks_to_copy, basic_block new_entry)
 {
@@ -2293,15 +2308,20 @@  static tree
   basic_block bb;
   tree new_fndecl = NULL;
   bool need_debug_cleanup = false;
-  gcov_type count_scale;
+  gcov_type count_scale = 0;
+  gcov_type freq_to_count_scale = 0;
   int last;
   int incoming_frequency = 0;
   gcov_type incoming_count = 0;

-  if (ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count)
-    count_scale
-        = GCOV_COMPUTE_SCALE (count,
-                              ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun)->count);
+  basic_block entry_bb = ENTRY_BLOCK_PTR_FOR_FUNCTION (src_cfun);
+  if (entry_bb->count)
+    count_scale = GCOV_COMPUTE_SCALE (count, entry_bb->count);
+  /* When the callee bb counts were all zero (e.g. this was a COMDAT
+     that didn't get profile counts) then we compute the new bb counts
+     via the statically-estimated frequencies.  */
+  else if (entry_bb->frequency)
+    freq_to_count_scale = RDIV (count * frequency, entry_bb->frequency);
   else
     count_scale = REG_BR_PROB_BASE;

@@ -2323,7 +2343,13 @@  static tree
            incoming_frequency += EDGE_FREQUENCY (e);
            incoming_count += e->count;
          }
-      incoming_count = apply_scale (incoming_count, count_scale);
+      incoming_count
+          = (count_scale
+             ? apply_scale (incoming_count, count_scale)
+             /* When the callee bb counts were all zero (e.g. this was a COMDAT
+                that didn't get profile counts) then we compute the
new bb counts
+                via the statically-estimated frequencies.  */
+             : RDIV (incoming_frequency * freq_to_count_scale, BB_FREQ_MAX));
       incoming_frequency
        = apply_scale ((gcov_type)incoming_frequency, frequency_scale);
       ENTRY_BLOCK_PTR->count = incoming_count;
@@ -2350,7 +2376,8 @@  static tree
   FOR_EACH_BB_FN (bb, cfun_to_copy)
     if (!blocks_to_copy || bitmap_bit_p (blocks_to_copy, bb->index))
       {
-       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale);
+       basic_block new_bb = copy_bb (id, bb, frequency_scale, count_scale,
+                                     freq_to_count_scale);
        bb->aux = new_bb;
        new_bb->aux = bb;
        new_bb->loop_father = entry_block_map->loop_father;
@@ -2364,7 +2391,8 @@  static tree
   FOR_ALL_BB_FN (bb, cfun_to_copy)
     if (!blocks_to_copy
         || (bb->index > 0 && bitmap_bit_p (blocks_to_copy, bb->index)))
-      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale, exit_block_map,
+      need_debug_cleanup |= copy_edges_for_bb (bb, count_scale,
+                                              exit_block_map,
                                               can_make_abormal_goto);

   if (new_entry)
@@ -2562,7 +2590,8 @@  copy_tree_body (copy_body_data *id)
    another function.  */

 static tree
-copy_body (copy_body_data *id, gcov_type count, int frequency_scale,
+copy_body (copy_body_data *id, gcov_type count, int frequency,
+          int frequency_scale,
           basic_block entry_block_map, basic_block exit_block_map,
           bitmap blocks_to_copy, basic_block new_entry)
 {
@@ -2571,7 +2600,8 @@  static tree

   /* If this body has a CFG, walk CFG and copy.  */
   gcc_assert (ENTRY_BLOCK_PTR_FOR_FUNCTION (DECL_STRUCT_FUNCTION (fndecl)));
-  body = copy_cfg_body (id, count, frequency_scale, entry_block_map,
exit_block_map,
+  body = copy_cfg_body (id, count, frequency, frequency_scale,
+                       entry_block_map, exit_block_map,
                        blocks_to_copy, new_entry);
   copy_debug_stmts (id);

@@ -4172,7 +4202,7 @@  expand_call_inline (basic_block bb, gimple stmt, c
      function in any way before this point, as this CALL_EXPR may be
      a self-referential call; if we're calling ourselves, we need to
      duplicate our body before altering anything.  */
-  copy_body (id, bb->count,
+  copy_body (id, bb->count, bb->frequency,
             GCOV_COMPUTE_SCALE (cg_edge->frequency, CGRAPH_FREQ_BASE),
             bb, return_block, NULL, NULL);

@@ -5299,8 +5329,9 @@  tree_function_versioning (tree old_decl, tree new_
     }

   /* Copy the Function's body.  */
-  copy_body (&id, old_entry_block->count, REG_BR_PROB_BASE,
-            ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR, blocks_to_copy, new_entry);
+  copy_body (&id, old_entry_block->count, old_entry_block->frequency,
+            REG_BR_PROB_BASE, ENTRY_BLOCK_PTR, EXIT_BLOCK_PTR,
+            blocks_to_copy, new_entry);

   /* Renumber the lexical scoping (non-code) blocks consecutively.  */
   number_blocks (new_decl);
Index: predict.c
===================================================================
--- predict.c   (revision 201644)
+++ predict.c   (working copy)
@@ -2976,6 +2976,24 @@  make_pass_strip_predict_hints (gcc::context *ctxt)
   return new pass_strip_predict_hints (ctxt);
 }

+/* Initialize loop edges and compute estimated bb frequencies when there
+   is no profile data available.  */
+
+void
+init_and_estimate_bb_frequencies (void)
+{
+  if (profile_status == PROFILE_READ && counts_to_freqs ())
+    return;
+
+  loop_optimizer_init (0);
+  add_noreturn_fake_exit_edges ();
+  mark_irreducible_loops ();
+  connect_infinite_loops_to_exit ();
+  estimate_bb_frequencies ();
+  remove_fake_exit_edges ();
+  loop_optimizer_finalize ();
+}
+
 /* Rebuild function frequencies.  Passes are in general expected to
    maintain profile by hand, however in some cases this is not possible:
    for example when inlining several functions with loops freuqencies might run
@@ -2986,15 +3004,7 @@  rebuild_frequencies (void)
 {
   timevar_push (TV_REBUILD_FREQUENCIES);
   if (profile_status == PROFILE_GUESSED)
-    {
-      loop_optimizer_init (0);
-      add_noreturn_fake_exit_edges ();
-      mark_irreducible_loops ();
-      connect_infinite_loops_to_exit ();
-      estimate_bb_frequencies ();
-      remove_fake_exit_edges ();
-      loop_optimizer_finalize ();
-    }
+    init_and_estimate_bb_frequencies ();
   else if (profile_status == PROFILE_READ)
     counts_to_freqs ();
   else
Index: predict.h
===================================================================
--- predict.h   (revision 201644)
+++ predict.h   (working copy)
@@ -38,6 +38,7 @@  enum prediction
 extern void predict_insn_def (rtx, enum br_predictor, enum prediction);
 extern int counts_to_freqs (void);
 extern void estimate_bb_frequencies (void);
+extern void init_and_estimate_bb_frequencies (void);
 extern const char *predictor_name (enum br_predictor);
 extern tree build_predict_expr (enum br_predictor, enum prediction);
 extern void tree_estimate_probability (void);
Index: profile.c
===================================================================
--- profile.c   (revision 201644)
+++ profile.c   (working copy)
@@ -1305,6 +1305,12 @@  branch_prob (void)

   values.release ();
   free_edge_list (el);
+
+  /* Call after setting profile_status to PROFILE_READ, will then
+     invoke counts_to_freqs and if the sum of the counts is zero, will
+     estimate the frequencies.  */
+  init_and_estimate_bb_frequencies ();
+
   coverage_end_function (lineno_checksum, cfg_checksum);
 }
 ^L
Index: ipa-inline-transform.c
===================================================================
--- ipa-inline-transform.c      (revision 201644)
+++ ipa-inline-transform.c      (working copy)
@@ -51,7 +51,7 @@  int nfunctions_inlined;

 static void
 update_noncloned_frequencies (struct cgraph_node *node,
-                             int freq_scale)
+                             gcov_type count_scale, int freq_scale)
 {
   struct cgraph_edge *e;

@@ -60,14 +60,16 @@  update_noncloned_frequencies (struct cgraph_node *
     freq_scale = 1;
   for (e = node->callees; e; e = e->next_callee)
     {
+      e->count = apply_scale (e->count, count_scale);
       e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
       if (e->frequency > CGRAPH_FREQ_MAX)
         e->frequency = CGRAPH_FREQ_MAX;
       if (!e->inline_failed)
-        update_noncloned_frequencies (e->callee, freq_scale);
+        update_noncloned_frequencies (e->callee, count_scale, freq_scale);
     }
   for (e = node->indirect_calls; e; e = e->next_callee)
     {
+      e->count = apply_scale (e->count, count_scale);
       e->frequency = e->frequency * (gcov_type) freq_scale / CGRAPH_FREQ_BASE;
       if (e->frequency > CGRAPH_FREQ_MAX)
         e->frequency = CGRAPH_FREQ_MAX;
@@ -169,7 +171,13 @@  clone_inlined_nodes (struct cgraph_edge *e, bool d
            }
          duplicate = false;
          e->callee->symbol.externally_visible = false;
-          update_noncloned_frequencies (e->callee, e->frequency);
+          // In the case of a COMDAT, the callee's count may be from other
+          // modules, and we need to scale it for the current module's calls
+          // (e.g. e->count may be 0 despite e->callee->count > 0).
+          gcov_type count_scale = REG_BR_PROB_BASE;
+          if (e->callee->count > e->count)
+            count_scale = GCOV_COMPUTE_SCALE (e->count, e->callee->count);
+          update_noncloned_frequencies (e->callee, count_scale, e->frequency);
        }
       else
        {