diff mbox

Sanitize block partitioning under -freorder-blocks-and-partition

Message ID CAAe5K+WToUznwFFfm5beapXAOOrOgxHR8LXmYBTL70C4VVsT+w@mail.gmail.com
State New
Headers show

Commit Message

Teresa Johnson Aug. 5, 2013, 1:35 p.m. UTC
On Fri, Aug 2, 2013 at 9:48 PM, Teresa Johnson <tejohnson@google.com> wrote:
> On Fri, Aug 2, 2013 at 8:05 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>
>>> 2013-08-01  Teresa Johnson  <tejohnson@google.com>
>>>             Steven Bosscher  <steven@gcc.gnu.org>
>>>
>>>         * cfgrtl.c (fixup_bb_partition): New routine.
>>>         (commit_edge_insertions): Invoke fixup_partitions.
>>>         (find_partition_fixes): New routine.
>>>         (fixup_partitions): Ditto.
>>>         (verify_hot_cold_block_grouping): Update comments.
>>>         (rtl_verify_edges): Invoke find_partition_fixes.
>>>         (rtl_verify_bb_pointers): Update comments.
>>>         (rtl_verify_bb_layout): Ditto.
>>>         * basic-block.h (fixup_partitions): Declare.
>>>         * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
>>>         * bb-reorder.c (sanitize_dominator_hotness): New function.
>>>         (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
>>>         sanitize_dominator_hotness.
>>>
>>> Index: cfgrtl.c
>>> ===================================================================
>>> --- cfgrtl.c    (revision 201281)
>>> +++ cfgrtl.c    (working copy)
>>> @@ -1341,6 +1341,34 @@ fixup_partition_crossing (edge e)
>>>      }
>>>  }
>>>
>>> +/* Called when block BB has been reassigned to a different partition,
>>> +   to ensure that the region crossing attributes are updated.  */
>>> +
>>> +static void
>>> +fixup_bb_partition (basic_block bb)
>>> +{
>>> +  edge e;
>>> +  edge_iterator ei;
>>> +
>>> +  /* Now need to make bb's pred edges non-region crossing.  */
>>> +  FOR_EACH_EDGE (e, ei, bb->preds)
>>> +    {
>>> +      fixup_partition_crossing (e);
>>> +    }
>>> +
>>> +  /* Possibly need to make bb's successor edges region crossing,
>>> +     or remove stale region crossing.  */
>>> +  FOR_EACH_EDGE (e, ei, bb->succs)
>>> +    {
>>> +      if ((e->flags & EDGE_FALLTHRU)
>>> +          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
>>> +          && e->dest != EXIT_BLOCK_PTR)
>>> +        force_nonfallthru (e);
>>> +      else
>>> +        fixup_partition_crossing (e);
>>> +    }
>>> +}
>>
>> Is there particular reason why preds can not be fallhtrus and why
>> force_nonfallthru edge does not need partition crossing fixup?
>> (if so, perhpas it could be mentioned in the description, if not,
>> I think force_nonfallthru path has to check if new BB was introduced
>> and do the right thing on the edge.
>
> I need to clarify the comments in this routine, because without the
> context of how this is called it isn't clear. This routine is only
> called when we detect a hot bb that is now dominated by a cold bb and
> needs to become cold. Therefore, its preds will no longer be region
> crossing (any non-dominating blocks that were previously hot would
> have been marked cold in the caller for the same reason, so we will
> not end up adjusting the region crossing-ness or fallthrough-ness of
> those pred edges). Any that were region crossing before but aren't any
> longer could not have been fall through (as Steven noted, you can't
> have a fall through across a partition boundary). I will add some
> better comments here.
>
> Regarding the call to force_nonfallthru, that routine calls
> fixup_partition_crossing as needed, and I will update the comment to
> reflect that too.

Patch with updated comments below. Ok for trunk?

Thanks,
Teresa

2013-08-05  Teresa Johnson  <tejohnson@google.com>
            Steven Bosscher  <steven@gcc.gnu.org>

        * cfgrtl.c (fixup_bb_partition): New routine.
        (commit_edge_insertions): Invoke fixup_partitions.
        (find_partition_fixes): New routine.
        (fixup_partitions): Ditto.
        (verify_hot_cold_block_grouping): Update comments.
        (rtl_verify_edges): Invoke find_partition_fixes.
        (rtl_verify_bb_pointers): Update comments.
        (rtl_verify_bb_layout): Ditto.
        * basic-block.h (fixup_partitions): Declare.
        * cfgcleanup.c (try_optimize_cfg): Invoke fixup_partitions.
        * bb-reorder.c (sanitize_dominator_hotness): New function.
        (find_rarely_executed_basic_blocks_and_crossing_edges): Invoke
        sanitize_dominator_hotness.

   /* The format of .gcc_except_table does not allow landing pads to
      be in a different partition as the throw.  Fix this by either
      moving or duplicating the landing pads.  */

>
>>
>>> +/* Sanity check partition hotness to ensure that basic blocks in
>>> +   the cold partition don't dominate basic blocks in the hot partition.
>>> +   If FLAG_ONLY is true, report violations as errors. Otherwise
>>> +   re-mark the dominated blocks as cold, since this is run after
>>> +   cfg optimizations that may make hot blocks previously reached
>>> +   by both hot and cold blocks now only reachable along cold paths.  */
>>
>> With profile, I suppose we can have cold blocks dominating hot blocks when the
>> hot blocks is in loop whose trip count is high enough.  Indeed for partitioning
>> reasons it does not make sense to push those into different section.
>>
>> I also wonder, if we finally get the pass stable, can we enable it by default
>> and offline probably cold blocks w/o profile? Primarily blocks reachable only
>> by EH + blocks leading to a crash or throw().  For C++ those should be common
>> enough to make a difference...
>
> Yep, as soon as PR57451 is fixed, which I hope to get to next week,
> then I am going to send a patch to turn this on by default, at least
> with profile feedback, which is where I've been doing performance
> tuning. But you are right that there are some cases where it should be
> beneficial without profile data as well.
>
> Thanks,
> Teresa
>
>>
>> Honza
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413

Comments

Jan Hubicka Aug. 5, 2013, 2:11 p.m. UTC | #1
The patch looks OK to me in general (I can not approve it).
Still have one question...
> +
> +/* Ensure that no cold bbs dominate hot bbs along the dominance or
> +   post-dominance DIR, for example as a result of edge weight insanities.
> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
> +   to BBS_IN_HOT_PARTITION.  */
> +
> +static unsigned int
> +sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
> +                            vec<basic_block> *bbs_in_hot_partition)
> +{
> +  /* Callers check this.  */
> +  gcc_checking_assert (cold_bb_count);
> +
> +  bool dom_calculated_here = !dom_info_available_p (dir);
> +
> +  if (dom_calculated_here)
> +    calculate_dominance_info (dir);
> +
> +  /* Keep examining hot bbs while we still have some left to check
> +     and there are remaining cold bbs.  */
> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
> +  while (! hot_bbs_to_check.is_empty ()
> +         && cold_bb_count)
> +    {
> +      basic_block bb = hot_bbs_to_check.pop ();
> +      basic_block dom_bb = get_immediate_dominator (dir, bb);
> +
> +      /* If bb's immediate dominator is also hot (or unpartitioned,
> +         e.g. the entry block) then it is ok. If it is cold, it
> +         needs to be adjusted.  */
> +      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)
> +        continue;

What will hapepn on

if (t)
  something
else
  something else
for (i=0;i<1000000;i++)
  something else2

I would expect if/something and something else to be cold by profile feedback.
Your dominator code will bring the if into hot partition but both paths
of conditional will be cold, so the number of crossings will actually grow.

If we want to have at least some path to hot blocks in the hot region, I suspect
we could walk back from hot regions to entry and keep those in hot regions rather
than relying on the dominator tree...
But I am sure such things can be dealt with incrementally.

Honza
Teresa Johnson Aug. 5, 2013, 2:57 p.m. UTC | #2
On Mon, Aug 5, 2013 at 7:11 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> The patch looks OK to me in general (I can not approve it).
> Still have one question...
>> +
>> +/* Ensure that no cold bbs dominate hot bbs along the dominance or
>> +   post-dominance DIR, for example as a result of edge weight insanities.
>> +   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
>> +   to BBS_IN_HOT_PARTITION.  */
>> +
>> +static unsigned int
>> +sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
>> +                            vec<basic_block> *bbs_in_hot_partition)
>> +{
>> +  /* Callers check this.  */
>> +  gcc_checking_assert (cold_bb_count);
>> +
>> +  bool dom_calculated_here = !dom_info_available_p (dir);
>> +
>> +  if (dom_calculated_here)
>> +    calculate_dominance_info (dir);
>> +
>> +  /* Keep examining hot bbs while we still have some left to check
>> +     and there are remaining cold bbs.  */
>> +  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
>> +  while (! hot_bbs_to_check.is_empty ()
>> +         && cold_bb_count)
>> +    {
>> +      basic_block bb = hot_bbs_to_check.pop ();
>> +      basic_block dom_bb = get_immediate_dominator (dir, bb);
>> +
>> +      /* If bb's immediate dominator is also hot (or unpartitioned,
>> +         e.g. the entry block) then it is ok. If it is cold, it
>> +         needs to be adjusted.  */
>> +      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)
>> +        continue;
>
> What will hapepn on
>
> if (t)
>   something
> else
>   something else
> for (i=0;i<1000000;i++)
>   something else2
>
> I would expect if/something and something else to be cold by profile feedback.
> Your dominator code will bring the if into hot partition but both paths
> of conditional will be cold, so the number of crossings will actually grow.

You are right, this case will not be handled well.

>
> If we want to have at least some path to hot blocks in the hot region, I suspect
> we could walk back from hot regions to entry and keep those in hot regions rather
> than relying on the dominator tree...
> But I am sure such things can be dealt with incrementally.

I am going to fix this and will resend the patch. Rather than look at
the immediate dominator of each hot block, we need to ensure that at
least one pred bb is hot. In your example, if that was a 50-50 branch,
then IMO both preds should be marked hot.

Thanks,
Teresa

>
> Honza
Teresa Johnson Aug. 8, 2013, 11:04 p.m. UTC | #3
On Thu, Aug 8, 2013 at 3:23 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> Martin Liska was kind enough to generate disk seeking graph of gimp statrup with his function reordering.
> His code simply measures time of firest execution of a function and orders functions in the given order.
> The functions stay in the subsections (unlikely/startup/exit/hot/normal) that are then glued together
> in this order.
>
> I am attaching disk seeking with and without -freorder-blocks-and-partition (with your patch).
>
> In 2.pdf you can see two increasing sequences in the text segment.  If I am not mistaken the bottom
> one comes for hot and the top one for normal section.  The big unused part on bottom is unlikely
> section since most of gimp is not trained.

2.pdf is reordered with Martin's technique?

>
> Now 1.pdf is with -freorder-blocks-and-partition and your patch.  You can see there is third sequence
> near bottom of the text seciton. that is beggining of unlikely section, so it tracks jumps where we
> fall into cold section of function.

1.pdf is generated using the usual FDO +
-freorder-blocks-and-partition (i.e. not using Martin's technique)?

>
> It still seems rather bad (i.e. good part of unlikely section is actually used).  I think the dominator
> based approach is not going to work too reliably (I can "fix" my testcase to contain multiple nested
> conditionals and then the heuristic about predecestors won't help).

Yes, this doesn't look good. Did you use the latest version of my
patch that doesn't walk the dominators?

Do you know how many training runs are done for this benchmark? I
think a lot of the issues that you pointed out with the hot loop
preceded by non-looping conditional code as in your earlier example,
or multiple nested conditionals, comes from the fact that the cold
cutoff is not 0, but some number less than the number of training
runs. Perhaps the cutoff for splitting should be 0. Then the main
issue that needs to be corrected is profile insanities, not code that
is executed once (since that would not be marked cold).

The only other issue that I can think of here is that the training
data was not representative and didn't execute these blocks.

>
> What about simply walking the CFG from entry through all edges with non-0 counts and making all reachable
> blocks hot + forcingly make any hot blocks not reachable this way reachable?

Is this different than what I currently have + changing the cold
cutoff to 0? In that case any blocks reachable through non-0 edges
should be non-0 and marked hot, and the current patch forces the most
frequent paths to all hot blocks to be hot.

Thanks,
Teresa

> I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>
> Honza
Jan Hubicka Aug. 9, 2013, 9:58 a.m. UTC | #4
> On Thu, Aug 8, 2013 at 3:23 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> > Hi,
> > Martin Liska was kind enough to generate disk seeking graph of gimp statrup with his function reordering.
> > His code simply measures time of firest execution of a function and orders functions in the given order.
> > The functions stay in the subsections (unlikely/startup/exit/hot/normal) that are then glued together
> > in this order.
> >
> > I am attaching disk seeking with and without -freorder-blocks-and-partition (with your patch).
> >
> > In 2.pdf you can see two increasing sequences in the text segment.  If I am not mistaken the bottom
> > one comes for hot and the top one for normal section.  The big unused part on bottom is unlikely
> > section since most of gimp is not trained.
> 
> 2.pdf is reordered with Martin's technique?
> 
> >
> > Now 1.pdf is with -freorder-blocks-and-partition and your patch.  You can see there is third sequence
> > near bottom of the text seciton. that is beggining of unlikely section, so it tracks jumps where we
> > fall into cold section of function.
> 
> 1.pdf is generated using the usual FDO +
> -freorder-blocks-and-partition (i.e. not using Martin's technique)?

2.pdf is Martin's reordering (that works ortoghonally to what we already have -
it just orders the functions inside idividual subsections. This make the
subsections more visible than without his patch).
1.pdf is Marting's rerodering + yours patch (I asked him to double check it is
the latest verision) + -freorder-blocks-and-partition.

He simply trains and measures the gimp startup, nothing else, so there should not
be problem with representativity of the data.
> 
> >
> > It still seems rather bad (i.e. good part of unlikely section is actually used).  I think the dominator
> > based approach is not going to work too reliably (I can "fix" my testcase to contain multiple nested
> > conditionals and then the heuristic about predecestors won't help).
> 
> Yes, this doesn't look good. Did you use the latest version of my
> patch that doesn't walk the dominators?
> 
> Do you know how many training runs are done for this benchmark? I
> think a lot of the issues that you pointed out with the hot loop
> preceded by non-looping conditional code as in your earlier example,
> or multiple nested conditionals, comes from the fact that the cold
> cutoff is not 0, but some number less than the number of training
> runs. Perhaps the cutoff for splitting should be 0. Then the main
> issue that needs to be corrected is profile insanities, not code that
> is executed once (since that would not be marked cold).

Hmm, compute_function_frequency uses probably_never_executed_bb_p that requires
the count of basic block to be less than number of train runs.  In Martin's
setup that will be 0.

This is the same as what -frerorder-block-of-partition does?
> 
> The only other issue that I can think of here is that the training
> data was not representative and didn't execute these blocks.
> 
> >
> > What about simply walking the CFG from entry through all edges with non-0 counts and making all reachable
> > blocks hot + forcingly make any hot blocks not reachable this way reachable?
> 
> Is this different than what I currently have + changing the cold
> cutoff to 0? In that case any blocks reachable through non-0 edges
> should be non-0 and marked hot, and the current patch forces the most
> frequent paths to all hot blocks to be hot.

Do we sanity check that the cold partition does not contain any blocks of
count 0?  It may be that the profile is broken enough to make partitioning
not work.
I can think of inlining where the count gets scaled all way down to 0.  Perhaps
count scaling code can be modified to never round towards 0 for block executing
non-0 times...

Honza
> 
> Thanks,
> Teresa
> 
> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >
> > Honza
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Teresa Johnson Aug. 9, 2013, 2:38 p.m. UTC | #5
On Fri, Aug 9, 2013 at 2:58 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> On Thu, Aug 8, 2013 at 3:23 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Hi,
>> > Martin Liska was kind enough to generate disk seeking graph of gimp statrup with his function reordering.
>> > His code simply measures time of firest execution of a function and orders functions in the given order.
>> > The functions stay in the subsections (unlikely/startup/exit/hot/normal) that are then glued together
>> > in this order.
>> >
>> > I am attaching disk seeking with and without -freorder-blocks-and-partition (with your patch).
>> >
>> > In 2.pdf you can see two increasing sequences in the text segment.  If I am not mistaken the bottom
>> > one comes for hot and the top one for normal section.  The big unused part on bottom is unlikely
>> > section since most of gimp is not trained.
>>
>> 2.pdf is reordered with Martin's technique?
>>
>> >
>> > Now 1.pdf is with -freorder-blocks-and-partition and your patch.  You can see there is third sequence
>> > near bottom of the text seciton. that is beggining of unlikely section, so it tracks jumps where we
>> > fall into cold section of function.
>>
>> 1.pdf is generated using the usual FDO +
>> -freorder-blocks-and-partition (i.e. not using Martin's technique)?
>
> 2.pdf is Martin's reordering (that works ortoghonally to what we already have -
> it just orders the functions inside idividual subsections. This make the
> subsections more visible than without his patch).
> 1.pdf is Marting's rerodering + yours patch (I asked him to double check it is
> the latest verision) + -freorder-blocks-and-partition.
>
> He simply trains and measures the gimp startup, nothing else, so there should not
> be problem with representativity of the data.

Ok, so a single training run, and it is essentially the same as what
is being used to create the graph after optimization.

>>
>> >
>> > It still seems rather bad (i.e. good part of unlikely section is actually used).  I think the dominator
>> > based approach is not going to work too reliably (I can "fix" my testcase to contain multiple nested
>> > conditionals and then the heuristic about predecestors won't help).
>>
>> Yes, this doesn't look good. Did you use the latest version of my
>> patch that doesn't walk the dominators?
>>
>> Do you know how many training runs are done for this benchmark? I
>> think a lot of the issues that you pointed out with the hot loop
>> preceded by non-looping conditional code as in your earlier example,
>> or multiple nested conditionals, comes from the fact that the cold
>> cutoff is not 0, but some number less than the number of training
>> runs. Perhaps the cutoff for splitting should be 0. Then the main
>> issue that needs to be corrected is profile insanities, not code that
>> is executed once (since that would not be marked cold).
>
> Hmm, compute_function_frequency uses probably_never_executed_bb_p that requires
> the count of basic block to be less than number of train runs.  In Martin's
> setup that will be 0.
>
> This is the same as what -frerorder-block-of-partition does?

Right, it simply puts blocks that are probably_never_executed_bb_p
into the cold section. But it sounds like that is not an issue here
since Martin is doing a single training run so the cutoff is
essentially 0.

>>
>> The only other issue that I can think of here is that the training
>> data was not representative and didn't execute these blocks.
>>
>> >
>> > What about simply walking the CFG from entry through all edges with non-0 counts and making all reachable
>> > blocks hot + forcingly make any hot blocks not reachable this way reachable?
>>
>> Is this different than what I currently have + changing the cold
>> cutoff to 0? In that case any blocks reachable through non-0 edges
>> should be non-0 and marked hot, and the current patch forces the most
>> frequent paths to all hot blocks to be hot.
>
> Do we sanity check that the cold partition does not contain any blocks of
> count 0?  It may be that the profile is broken enough to make partitioning
> not work.

Do you mean sanity check that the cold partition does not contain any
blocks of count > 0? (they should all be zero) I don't think that
sanity check is there, but I can try adding that.

The issue with such a sanity check may be due to the later fixup I
have in this patch (fixup_partitions). It is invoked after certain
optimizations on the cfg that may make hot blocks previously reached
by both hot and cold edges only reachable by cold blocks. These blocks
are remarked cold. If the profile data hasn't been updated correctly
it is possible that they would still have a non-0 count, although they
are essentially cold after the cfg transformation.

But certainly such a sanity check should always succeed after the
original partitioning.

> I can think of inlining where the count gets scaled all way down to 0.  Perhaps
> count scaling code can be modified to never round towards 0 for block executing
> non-0 times...

This reminds me of why this situation could happen. When I have been
testing this on the google branch I found situations where COMDAT
routines have 0 profile counts (if I remember correctly, this happens
when profile-gen binary has call to out-of-line copy of COMDAT in
module A, linker chooses the out-of-line copy from module B, therefore
the profile data for COMDAT in module A is 0). When the COMDAT gets
inlined, the 0 counts on its bbs are scaled to 0, even though the
callsite is non-zero. I have a patch that I was planning to send as a
follow-up that handles this case by propagating the callsite bb's
count to the inlined code when it has 0 counts, scaling by the edge
frequencies. I can either include that patch in this one, or send it
for review separately right now. Do you want to give it a try with
this one to see if it addresses the issue?

Also, can you send me reproduction instructions for gimp? I don't
think I need Martin's patch, but which version of gimp and what is the
equivalent way for me to train it? I have some scripts to generate a
similar type of instruction heat map graph that I have been using to
tune partitioning and function reordering. Essentially it uses linux
perf to sample on instructions_retired and then munge the data in
several ways to produce various stats and graphs. One thing that has
been useful has been to combine the perf data with nm output to
determine which cold functions are being executed at runtime.

However, for this to tell me which split cold bbs are being executed I
need to use a patch that Sri sent for review several months back that
gives the split cold section its own name:
  http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
Steven had some follow up comments that Sri hasn't had a chance to address yet:
  http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
(cc'ing Sri as we should probably revive this patch soon to address
gdb and other issues with detecting split functions properly)

Thanks!
Teresa

>
> Honza
>>
>> Thanks,
>> Teresa
>>
>> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>> >
>> > Honza
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka Aug. 9, 2013, 3:28 p.m. UTC | #6
> > Do we sanity check that the cold partition does not contain any blocks of
> > count 0?  It may be that the profile is broken enough to make partitioning
> > not work.
> 
> Do you mean sanity check that the cold partition does not contain any
> blocks of count > 0? (they should all be zero) I don't think that
> sanity check is there, but I can try adding that.

Thanks, lets start with this - I suppose we need to figure out if
 1) the reachable blocks goes to cold section because partitioning decides
    so even if they have non-0 count.
 2) the reachable blocks goes to cold section because they have incorrectly
    updated count to 0 by someone
 3) profiling gets some blocks wrong.
> 
> The issue with such a sanity check may be due to the later fixup I
> have in this patch (fixup_partitions). It is invoked after certain
> optimizations on the cfg that may make hot blocks previously reached
> by both hot and cold edges only reachable by cold blocks. These blocks
> are remarked cold. If the profile data hasn't been updated correctly
> it is possible that they would still have a non-0 count, although they
> are essentially cold after the cfg transformation.

Well, or the other posibility is that the edges was updated wrong
and the blocks are really cold.  We need to figure out if that happens
commonly enough.

I will try to think of some artificial testcases.

> 
> But certainly such a sanity check should always succeed after the
> original partitioning.
> 
> > I can think of inlining where the count gets scaled all way down to 0.  Perhaps
> > count scaling code can be modified to never round towards 0 for block executing
> > non-0 times...
> 
> This reminds me of why this situation could happen. When I have been
> testing this on the google branch I found situations where COMDAT
> routines have 0 profile counts (if I remember correctly, this happens
> when profile-gen binary has call to out-of-line copy of COMDAT in
> module A, linker chooses the out-of-line copy from module B, therefore
> the profile data for COMDAT in module A is 0). When the COMDAT gets
> inlined, the 0 counts on its bbs are scaled to 0, even though the
> callsite is non-zero. I have a patch that I was planning to send as a
> follow-up that handles this case by propagating the callsite bb's
> count to the inlined code when it has 0 counts, scaling by the edge
> frequencies. I can either include that patch in this one, or send it
> for review separately right now. Do you want to give it a try with
> this one to see if it addresses the issue?

This scenario should not happen with LTO setup: the LTO symbol tables contains
code before early optimization and should be identical with profiling or
without (modulo the new references and call from profile code).

But this patch seems useful as a backup solution for non-LTO, so yes, please
send it separately and I can try to double check that it really do not happen
with LTO.
(acutally LTO symtab may just chose COMDAT from module that has counts with it.
It has all the info for it.  I was thinkin about it few weeks back.  It is
bit hard to do - you need to verify that all references from the function are
the same or linking might fail if you overwrite linker's decisiosns).
> 
> Also, can you send me reproduction instructions for gimp? I don't
> think I need Martin's patch, but which version of gimp and what is the
> equivalent way for me to train it? I have some scripts to generate a
> similar type of instruction heat map graph that I have been using to
> tune partitioning and function reordering. Essentially it uses linux
> perf to sample on instructions_retired and then munge the data in
> several ways to produce various stats and graphs. One thing that has
> been useful has been to combine the perf data with nm output to
> determine which cold functions are being executed at runtime.

Martin?

> 
> However, for this to tell me which split cold bbs are being executed I
> need to use a patch that Sri sent for review several months back that
> gives the split cold section its own name:
>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
> (cc'ing Sri as we should probably revive this patch soon to address
> gdb and other issues with detecting split functions properly)

Intresting, I used linker script for this purposes, but that his GNU ld only...

Honza
> 
> Thanks!
> Teresa
> 
> >
> > Honza
> >>
> >> Thanks,
> >> Teresa
> >>
> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >> >
> >> > Honza
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Martin Liška Aug. 9, 2013, 3:54 p.m. UTC | #7
Hi

On 9 August 2013 17:28, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Do we sanity check that the cold partition does not contain any blocks of
>> > count 0?  It may be that the profile is broken enough to make partitioning
>> > not work.
>>
>> Do you mean sanity check that the cold partition does not contain any
>> blocks of count > 0? (they should all be zero) I don't think that
>> sanity check is there, but I can try adding that.
>
> Thanks, lets start with this - I suppose we need to figure out if
>  1) the reachable blocks goes to cold section because partitioning decides
>     so even if they have non-0 count.
>  2) the reachable blocks goes to cold section because they have incorrectly
>     updated count to 0 by someone
>  3) profiling gets some blocks wrong.
>>
>> The issue with such a sanity check may be due to the later fixup I
>> have in this patch (fixup_partitions). It is invoked after certain
>> optimizations on the cfg that may make hot blocks previously reached
>> by both hot and cold edges only reachable by cold blocks. These blocks
>> are remarked cold. If the profile data hasn't been updated correctly
>> it is possible that they would still have a non-0 count, although they
>> are essentially cold after the cfg transformation.
>
> Well, or the other posibility is that the edges was updated wrong
> and the blocks are really cold.  We need to figure out if that happens
> commonly enough.
>
> I will try to think of some artificial testcases.
>
>>
>> But certainly such a sanity check should always succeed after the
>> original partitioning.
>>
>> > I can think of inlining where the count gets scaled all way down to 0.  Perhaps
>> > count scaling code can be modified to never round towards 0 for block executing
>> > non-0 times...
>>
>> This reminds me of why this situation could happen. When I have been
>> testing this on the google branch I found situations where COMDAT
>> routines have 0 profile counts (if I remember correctly, this happens
>> when profile-gen binary has call to out-of-line copy of COMDAT in
>> module A, linker chooses the out-of-line copy from module B, therefore
>> the profile data for COMDAT in module A is 0). When the COMDAT gets
>> inlined, the 0 counts on its bbs are scaled to 0, even though the
>> callsite is non-zero. I have a patch that I was planning to send as a
>> follow-up that handles this case by propagating the callsite bb's
>> count to the inlined code when it has 0 counts, scaling by the edge
>> frequencies. I can either include that patch in this one, or send it
>> for review separately right now. Do you want to give it a try with
>> this one to see if it addresses the issue?
>
> This scenario should not happen with LTO setup: the LTO symbol tables contains
> code before early optimization and should be identical with profiling or
> without (modulo the new references and call from profile code).
>
> But this patch seems useful as a backup solution for non-LTO, so yes, please
> send it separately and I can try to double check that it really do not happen
> with LTO.
> (acutally LTO symtab may just chose COMDAT from module that has counts with it.
> It has all the info for it.  I was thinkin about it few weeks back.  It is
> bit hard to do - you need to verify that all references from the function are
> the same or linking might fail if you overwrite linker's decisiosns).
>>
>> Also, can you send me reproduction instructions for gimp? I don't
>> think I need Martin's patch, but which version of gimp and what is the
>> equivalent way for me to train it? I have some scripts to generate a
>> similar type of instruction heat map graph that I have been using to
>> tune partitioning and function reordering. Essentially it uses linux
>> perf to sample on instructions_retired and then munge the data in
>> several ways to produce various stats and graphs. One thing that has
>> been useful has been to combine the perf data with nm output to
>> determine which cold functions are being executed at runtime.
>
> Martin?

I use gimp from git repository, commit:
88ecd59c3436d302b644a5d25c1938c0e7b60ae0 (from Fet 5 2013)
Link: http://www.gimp.org/source/#gimp_from_git

Martin

>>
>> However, for this to tell me which split cold bbs are being executed I
>> need to use a patch that Sri sent for review several months back that
>> gives the split cold section its own name:
>>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>> (cc'ing Sri as we should probably revive this patch soon to address
>> gdb and other issues with detecting split functions properly)
>
> Intresting, I used linker script for this purposes, but that his GNU ld only...
>
> Honza
>>
>> Thanks!
>> Teresa
>>
>> >
>> > Honza
>> >>
>> >> Thanks,
>> >> Teresa
>> >>
>> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>> >> >
>> >> > Honza
>> >>
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Teresa Johnson Aug. 9, 2013, 9:02 p.m. UTC | #8
On Fri, Aug 9, 2013 at 8:28 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> > Do we sanity check that the cold partition does not contain any blocks of
>> > count 0?  It may be that the profile is broken enough to make partitioning
>> > not work.
>>
>> Do you mean sanity check that the cold partition does not contain any
>> blocks of count > 0? (they should all be zero) I don't think that
>> sanity check is there, but I can try adding that.
>
> Thanks, lets start with this - I suppose we need to figure out if
>  1) the reachable blocks goes to cold section because partitioning decides
>     so even if they have non-0 count.

Right, this should be easy enough to check and should hopefully never happen.

>  2) the reachable blocks goes to cold section because they have incorrectly
>     updated count to 0 by someone

A sanity check should find this too. But it can happen now for various
reasons like the comdat issue I described below. But it will be good
to flag these and fix them.

>  3) profiling gets some blocks wrong.

This is the one that will be tough to fix, if the training run isn't
representative.

>>
>> The issue with such a sanity check may be due to the later fixup I
>> have in this patch (fixup_partitions). It is invoked after certain
>> optimizations on the cfg that may make hot blocks previously reached
>> by both hot and cold edges only reachable by cold blocks. These blocks
>> are remarked cold. If the profile data hasn't been updated correctly
>> it is possible that they would still have a non-0 count, although they
>> are essentially cold after the cfg transformation.
>
> Well, or the other posibility is that the edges was updated wrong
> and the blocks are really cold.  We need to figure out if that happens
> commonly enough.
>
> I will try to think of some artificial testcases.
>
>>
>> But certainly such a sanity check should always succeed after the
>> original partitioning.
>>
>> > I can think of inlining where the count gets scaled all way down to 0.  Perhaps
>> > count scaling code can be modified to never round towards 0 for block executing
>> > non-0 times...
>>
>> This reminds me of why this situation could happen. When I have been
>> testing this on the google branch I found situations where COMDAT
>> routines have 0 profile counts (if I remember correctly, this happens
>> when profile-gen binary has call to out-of-line copy of COMDAT in
>> module A, linker chooses the out-of-line copy from module B, therefore
>> the profile data for COMDAT in module A is 0). When the COMDAT gets
>> inlined, the 0 counts on its bbs are scaled to 0, even though the
>> callsite is non-zero. I have a patch that I was planning to send as a
>> follow-up that handles this case by propagating the callsite bb's
>> count to the inlined code when it has 0 counts, scaling by the edge
>> frequencies. I can either include that patch in this one, or send it
>> for review separately right now. Do you want to give it a try with
>> this one to see if it addresses the issue?
>
> This scenario should not happen with LTO setup: the LTO symbol tables contains
> code before early optimization and should be identical with profiling or
> without (modulo the new references and call from profile code).
>
> But this patch seems useful as a backup solution for non-LTO, so yes, please
> send it separately and I can try to double check that it really do not happen
> with LTO.
> (acutally LTO symtab may just chose COMDAT from module that has counts with it.
> It has all the info for it.  I was thinkin about it few weeks back.  It is
> bit hard to do - you need to verify that all references from the function are
> the same or linking might fail if you overwrite linker's decisiosns).

I see, yes LTO can deal with this better since it has global
information. In non-LTO mode (including LIPO) we have the issue.

I take it gimp is built with LTO and therefore shouldn't be hitting
this comdat issue?

Let me do a couple things:
- port over my comdat inlining fix from the google branch to trunk and
send it for review. If you or Martin could try it to see if it helps
with function splitting to avoid the hits from the cold code that
would be great
- I'll add some new sanity checking to try to detect non-zero blocks
in the cold section, or 0 blocks reached by non-zero edges and see if
I can flush out any problems with my tests or a profiledbootstrap or
gimp.
- I'll try building and profiling gimp myself to see if I can
reproduce the issue with code executing out of the cold section.

Thanks,
Teresa

>>
>> Also, can you send me reproduction instructions for gimp? I don't
>> think I need Martin's patch, but which version of gimp and what is the
>> equivalent way for me to train it? I have some scripts to generate a
>> similar type of instruction heat map graph that I have been using to
>> tune partitioning and function reordering. Essentially it uses linux
>> perf to sample on instructions_retired and then munge the data in
>> several ways to produce various stats and graphs. One thing that has
>> been useful has been to combine the perf data with nm output to
>> determine which cold functions are being executed at runtime.
>
> Martin?
>
>>
>> However, for this to tell me which split cold bbs are being executed I
>> need to use a patch that Sri sent for review several months back that
>> gives the split cold section its own name:
>>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>> (cc'ing Sri as we should probably revive this patch soon to address
>> gdb and other issues with detecting split functions properly)
>
> Intresting, I used linker script for this purposes, but that his GNU ld only...
>
> Honza
>>
>> Thanks!
>> Teresa
>>
>> >
>> > Honza
>> >>
>> >> Thanks,
>> >> Teresa
>> >>
>> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>> >> >
>> >> > Honza
>> >>
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Teresa Johnson Aug. 9, 2013, 9:03 p.m. UTC | #9
On Fri, Aug 9, 2013 at 8:54 AM, Martin Liška <marxin.liska@gmail.com> wrote:
> Hi
>
> On 9 August 2013 17:28, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> > Do we sanity check that the cold partition does not contain any blocks of
>>> > count 0?  It may be that the profile is broken enough to make partitioning
>>> > not work.
>>>
>>> Do you mean sanity check that the cold partition does not contain any
>>> blocks of count > 0? (they should all be zero) I don't think that
>>> sanity check is there, but I can try adding that.
>>
>> Thanks, lets start with this - I suppose we need to figure out if
>>  1) the reachable blocks goes to cold section because partitioning decides
>>     so even if they have non-0 count.
>>  2) the reachable blocks goes to cold section because they have incorrectly
>>     updated count to 0 by someone
>>  3) profiling gets some blocks wrong.
>>>
>>> The issue with such a sanity check may be due to the later fixup I
>>> have in this patch (fixup_partitions). It is invoked after certain
>>> optimizations on the cfg that may make hot blocks previously reached
>>> by both hot and cold edges only reachable by cold blocks. These blocks
>>> are remarked cold. If the profile data hasn't been updated correctly
>>> it is possible that they would still have a non-0 count, although they
>>> are essentially cold after the cfg transformation.
>>
>> Well, or the other posibility is that the edges was updated wrong
>> and the blocks are really cold.  We need to figure out if that happens
>> commonly enough.
>>
>> I will try to think of some artificial testcases.
>>
>>>
>>> But certainly such a sanity check should always succeed after the
>>> original partitioning.
>>>
>>> > I can think of inlining where the count gets scaled all way down to 0.  Perhaps
>>> > count scaling code can be modified to never round towards 0 for block executing
>>> > non-0 times...
>>>
>>> This reminds me of why this situation could happen. When I have been
>>> testing this on the google branch I found situations where COMDAT
>>> routines have 0 profile counts (if I remember correctly, this happens
>>> when profile-gen binary has call to out-of-line copy of COMDAT in
>>> module A, linker chooses the out-of-line copy from module B, therefore
>>> the profile data for COMDAT in module A is 0). When the COMDAT gets
>>> inlined, the 0 counts on its bbs are scaled to 0, even though the
>>> callsite is non-zero. I have a patch that I was planning to send as a
>>> follow-up that handles this case by propagating the callsite bb's
>>> count to the inlined code when it has 0 counts, scaling by the edge
>>> frequencies. I can either include that patch in this one, or send it
>>> for review separately right now. Do you want to give it a try with
>>> this one to see if it addresses the issue?
>>
>> This scenario should not happen with LTO setup: the LTO symbol tables contains
>> code before early optimization and should be identical with profiling or
>> without (modulo the new references and call from profile code).
>>
>> But this patch seems useful as a backup solution for non-LTO, so yes, please
>> send it separately and I can try to double check that it really do not happen
>> with LTO.
>> (acutally LTO symtab may just chose COMDAT from module that has counts with it.
>> It has all the info for it.  I was thinkin about it few weeks back.  It is
>> bit hard to do - you need to verify that all references from the function are
>> the same or linking might fail if you overwrite linker's decisiosns).
>>>
>>> Also, can you send me reproduction instructions for gimp? I don't
>>> think I need Martin's patch, but which version of gimp and what is the
>>> equivalent way for me to train it? I have some scripts to generate a
>>> similar type of instruction heat map graph that I have been using to
>>> tune partitioning and function reordering. Essentially it uses linux
>>> perf to sample on instructions_retired and then munge the data in
>>> several ways to produce various stats and graphs. One thing that has
>>> been useful has been to combine the perf data with nm output to
>>> determine which cold functions are being executed at runtime.
>>
>> Martin?
>
> I use gimp from git repository, commit:
> 88ecd59c3436d302b644a5d25c1938c0e7b60ae0 (from Fet 5 2013)
> Link: http://www.gimp.org/source/#gimp_from_git

Thanks. Were you building with LTO? And just -O2, or any other options
I should use?

Teresa

>
> Martin
>
>>>
>>> However, for this to tell me which split cold bbs are being executed I
>>> need to use a patch that Sri sent for review several months back that
>>> gives the split cold section its own name:
>>>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>>> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>>>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>>> (cc'ing Sri as we should probably revive this patch soon to address
>>> gdb and other issues with detecting split functions properly)
>>
>> Intresting, I used linker script for this purposes, but that his GNU ld only...
>>
>> Honza
>>>
>>> Thanks!
>>> Teresa
>>>
>>> >
>>> > Honza
>>> >>
>>> >> Thanks,
>>> >> Teresa
>>> >>
>>> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>>> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>>> >> >
>>> >> > Honza
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka Aug. 9, 2013, 10:43 p.m. UTC | #10
> 
> I see, yes LTO can deal with this better since it has global
> information. In non-LTO mode (including LIPO) we have the issue.

Thinking about it, there is still one problem left: I usually suggest
users to train with -fno-lto to avoid excessive linking time with
instrumentation.  This actually will bring differences in between
compile and link decisions of linker.  
We can't even replace one body by another, since inlining done
at training stage may distribute the profile across multiple comdat
copies.  I suppose only way is to extend lto-symtab to actually
merge profiles.  To merge CFG profiles it will need to read in the body.

Martin is working on identical function merging, I suppose once we
implement CFG profile merging for that purpose, we can use it here.
I will try to look into this.
> 
> I take it gimp is built with LTO and therefore shouldn't be hitting
> this comdat issue?

I think gimp is still mostly C program. (did not double check my last
contribution to it is from 90s :)
> 
> Let me do a couple things:
> - port over my comdat inlining fix from the google branch to trunk and
> send it for review. If you or Martin could try it to see if it helps
> with function splitting to avoid the hits from the cold code that
> would be great
> - I'll add some new sanity checking to try to detect non-zero blocks
> in the cold section, or 0 blocks reached by non-zero edges and see if
> I can flush out any problems with my tests or a profiledbootstrap or
> gimp.
> - I'll try building and profiling gimp myself to see if I can
> reproduce the issue with code executing out of the cold section.
> 
> Thanks,
> Teresa
> 
> >>
> >> Also, can you send me reproduction instructions for gimp? I don't
> >> think I need Martin's patch, but which version of gimp and what is the
> >> equivalent way for me to train it? I have some scripts to generate a
> >> similar type of instruction heat map graph that I have been using to
> >> tune partitioning and function reordering. Essentially it uses linux
> >> perf to sample on instructions_retired and then munge the data in
> >> several ways to produce various stats and graphs. One thing that has
> >> been useful has been to combine the perf data with nm output to
> >> determine which cold functions are being executed at runtime.
> >
> > Martin?
> >
> >>
> >> However, for this to tell me which split cold bbs are being executed I
> >> need to use a patch that Sri sent for review several months back that
> >> gives the split cold section its own name:
> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
> >> (cc'ing Sri as we should probably revive this patch soon to address
> >> gdb and other issues with detecting split functions properly)
> >
> > Intresting, I used linker script for this purposes, but that his GNU ld only...
> >
> > Honza
> >>
> >> Thanks!
> >> Teresa
> >>
> >> >
> >> > Honza
> >> >>
> >> >> Thanks,
> >> >> Teresa
> >> >>
> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >> >> >
> >> >> > Honza
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka Aug. 11, 2013, 12:21 p.m. UTC | #11
> 
> I see, yes LTO can deal with this better since it has global
> information. In non-LTO mode (including LIPO) we have the issue.

Either Martin or me will implement merging of the multiple copies at
LTO link time.  This is needed for Martin's code unification patch anyway.

Theoretically gcov runtime can also have symbol names and cfg checksums of
comdats in the static data and at exit produce buckets based on matching
names+checksums+counter counts, merge all data into in each bucket to one
representative by the existing merging routines and then memcpy them to
all the oriignal copiles.  This way all compilation units will receive same
results.

I am not very keen about making gcov runtime bigger and more complex than it
needs to be, but having sane profile for comdats seems quite important.
Perhaps, in GNU toolchain, ordered subsections can be used to make linker to
produce ordered list of comdats, so the runtime won't need to do hashing +
lookups.

Honza
> 
> I take it gimp is built with LTO and therefore shouldn't be hitting
> this comdat issue?
> 
> Let me do a couple things:
> - port over my comdat inlining fix from the google branch to trunk and
> send it for review. If you or Martin could try it to see if it helps
> with function splitting to avoid the hits from the cold code that
> would be great
> - I'll add some new sanity checking to try to detect non-zero blocks
> in the cold section, or 0 blocks reached by non-zero edges and see if
> I can flush out any problems with my tests or a profiledbootstrap or
> gimp.
> - I'll try building and profiling gimp myself to see if I can
> reproduce the issue with code executing out of the cold section.
> 
> Thanks,
> Teresa
> 
> >>
> >> Also, can you send me reproduction instructions for gimp? I don't
> >> think I need Martin's patch, but which version of gimp and what is the
> >> equivalent way for me to train it? I have some scripts to generate a
> >> similar type of instruction heat map graph that I have been using to
> >> tune partitioning and function reordering. Essentially it uses linux
> >> perf to sample on instructions_retired and then munge the data in
> >> several ways to produce various stats and graphs. One thing that has
> >> been useful has been to combine the perf data with nm output to
> >> determine which cold functions are being executed at runtime.
> >
> > Martin?
> >
> >>
> >> However, for this to tell me which split cold bbs are being executed I
> >> need to use a patch that Sri sent for review several months back that
> >> gives the split cold section its own name:
> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
> >> (cc'ing Sri as we should probably revive this patch soon to address
> >> gdb and other issues with detecting split functions properly)
> >
> > Intresting, I used linker script for this purposes, but that his GNU ld only...
> >
> > Honza
> >>
> >> Thanks!
> >> Teresa
> >>
> >> >
> >> > Honza
> >> >>
> >> >> Thanks,
> >> >> Teresa
> >> >>
> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >> >> >
> >> >> > Honza
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Teresa Johnson Aug. 11, 2013, 1:25 p.m. UTC | #12
Cc'ing Rong since he is also working on trying to address the comdat
profile issue. Rong, you may need to see an earlier message for more
context:
http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00558.html

Teresa

On Sun, Aug 11, 2013 at 5:21 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>
>> I see, yes LTO can deal with this better since it has global
>> information. In non-LTO mode (including LIPO) we have the issue.
>
> Either Martin or me will implement merging of the multiple copies at
> LTO link time.  This is needed for Martin's code unification patch anyway.
>
> Theoretically gcov runtime can also have symbol names and cfg checksums of
> comdats in the static data and at exit produce buckets based on matching
> names+checksums+counter counts, merge all data into in each bucket to one
> representative by the existing merging routines and then memcpy them to
> all the oriignal copiles.  This way all compilation units will receive same
> results.
>
> I am not very keen about making gcov runtime bigger and more complex than it
> needs to be, but having sane profile for comdats seems quite important.
> Perhaps, in GNU toolchain, ordered subsections can be used to make linker to
> produce ordered list of comdats, so the runtime won't need to do hashing +
> lookups.
>
> Honza
>>
>> I take it gimp is built with LTO and therefore shouldn't be hitting
>> this comdat issue?
>>
>> Let me do a couple things:
>> - port over my comdat inlining fix from the google branch to trunk and
>> send it for review. If you or Martin could try it to see if it helps
>> with function splitting to avoid the hits from the cold code that
>> would be great
>> - I'll add some new sanity checking to try to detect non-zero blocks
>> in the cold section, or 0 blocks reached by non-zero edges and see if
>> I can flush out any problems with my tests or a profiledbootstrap or
>> gimp.
>> - I'll try building and profiling gimp myself to see if I can
>> reproduce the issue with code executing out of the cold section.
>>
>> Thanks,
>> Teresa
>>
>> >>
>> >> Also, can you send me reproduction instructions for gimp? I don't
>> >> think I need Martin's patch, but which version of gimp and what is the
>> >> equivalent way for me to train it? I have some scripts to generate a
>> >> similar type of instruction heat map graph that I have been using to
>> >> tune partitioning and function reordering. Essentially it uses linux
>> >> perf to sample on instructions_retired and then munge the data in
>> >> several ways to produce various stats and graphs. One thing that has
>> >> been useful has been to combine the perf data with nm output to
>> >> determine which cold functions are being executed at runtime.
>> >
>> > Martin?
>> >
>> >>
>> >> However, for this to tell me which split cold bbs are being executed I
>> >> need to use a patch that Sri sent for review several months back that
>> >> gives the split cold section its own name:
>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>> >> (cc'ing Sri as we should probably revive this patch soon to address
>> >> gdb and other issues with detecting split functions properly)
>> >
>> > Intresting, I used linker script for this purposes, but that his GNU ld only...
>> >
>> > Honza
>> >>
>> >> Thanks!
>> >> Teresa
>> >>
>> >> >
>> >> > Honza
>> >> >>
>> >> >> Thanks,
>> >> >> Teresa
>> >> >>
>> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>> >> >> >
>> >> >> > Honza
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>> >>
>> >>
>> >>
>> >> --
>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Martin Liška Aug. 11, 2013, 3:54 p.m. UTC | #13
Hello,
   I did a collection of systemtap graphs for GIMP.

All these graphs were created with enabled LTO, profiling and -O2.

1) gimp-reordered.pdf - function are reordered according to my newly
created profile that utilizes LTO infrastructure
2) gimp-no-top-level-reorder.pdf - (GCC rev. 201648) -fno-top-level-reorder
3) gimp-top-level-reorder.pdf - (GCC rev. 201648) -ftop-level-reorder

Honza has an idea how to minimize hot text section and I will send new
graphs for the proposed patch.
Moreover, I will send graphs for Inkscape which is written in C++.

Have a nice day,
Martin

On 11 August 2013 15:25, Teresa Johnson <tejohnson@google.com> wrote:
> Cc'ing Rong since he is also working on trying to address the comdat
> profile issue. Rong, you may need to see an earlier message for more
> context:
> http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00558.html
>
> Teresa
>
> On Sun, Aug 11, 2013 at 5:21 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>>
>>> I see, yes LTO can deal with this better since it has global
>>> information. In non-LTO mode (including LIPO) we have the issue.
>>
>> Either Martin or me will implement merging of the multiple copies at
>> LTO link time.  This is needed for Martin's code unification patch anyway.
>>
>> Theoretically gcov runtime can also have symbol names and cfg checksums of
>> comdats in the static data and at exit produce buckets based on matching
>> names+checksums+counter counts, merge all data into in each bucket to one
>> representative by the existing merging routines and then memcpy them to
>> all the oriignal copiles.  This way all compilation units will receive same
>> results.
>>
>> I am not very keen about making gcov runtime bigger and more complex than it
>> needs to be, but having sane profile for comdats seems quite important.
>> Perhaps, in GNU toolchain, ordered subsections can be used to make linker to
>> produce ordered list of comdats, so the runtime won't need to do hashing +
>> lookups.
>>
>> Honza
>>>
>>> I take it gimp is built with LTO and therefore shouldn't be hitting
>>> this comdat issue?
>>>
>>> Let me do a couple things:
>>> - port over my comdat inlining fix from the google branch to trunk and
>>> send it for review. If you or Martin could try it to see if it helps
>>> with function splitting to avoid the hits from the cold code that
>>> would be great
>>> - I'll add some new sanity checking to try to detect non-zero blocks
>>> in the cold section, or 0 blocks reached by non-zero edges and see if
>>> I can flush out any problems with my tests or a profiledbootstrap or
>>> gimp.
>>> - I'll try building and profiling gimp myself to see if I can
>>> reproduce the issue with code executing out of the cold section.
>>>
>>> Thanks,
>>> Teresa
>>>
>>> >>
>>> >> Also, can you send me reproduction instructions for gimp? I don't
>>> >> think I need Martin's patch, but which version of gimp and what is the
>>> >> equivalent way for me to train it? I have some scripts to generate a
>>> >> similar type of instruction heat map graph that I have been using to
>>> >> tune partitioning and function reordering. Essentially it uses linux
>>> >> perf to sample on instructions_retired and then munge the data in
>>> >> several ways to produce various stats and graphs. One thing that has
>>> >> been useful has been to combine the perf data with nm output to
>>> >> determine which cold functions are being executed at runtime.
>>> >
>>> > Martin?
>>> >
>>> >>
>>> >> However, for this to tell me which split cold bbs are being executed I
>>> >> need to use a patch that Sri sent for review several months back that
>>> >> gives the split cold section its own name:
>>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
>>> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
>>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
>>> >> (cc'ing Sri as we should probably revive this patch soon to address
>>> >> gdb and other issues with detecting split functions properly)
>>> >
>>> > Intresting, I used linker script for this purposes, but that his GNU ld only...
>>> >
>>> > Honza
>>> >>
>>> >> Thanks!
>>> >> Teresa
>>> >>
>>> >> >
>>> >> > Honza
>>> >> >>
>>> >> >> Thanks,
>>> >> >> Teresa
>>> >> >>
>>> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
>>> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
>>> >> >> >
>>> >> >> > Honza
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> --
>>> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka Aug. 11, 2013, 5:55 p.m. UTC | #14
> Hello,
>    I did a collection of systemtap graphs for GIMP.
> 
> All these graphs were created with enabled LTO, profiling and -O2.
> 
> 1) gimp-reordered.pdf - function are reordered according to my newly
> created profile that utilizes LTO infrastructure
> 2) gimp-no-top-level-reorder.pdf - (GCC rev. 201648) -fno-top-level-reorder
> 3) gimp-top-level-reorder.pdf - (GCC rev. 201648) -ftop-level-reorder

Thanks for the graphs! 
gimp-top-level-reorder seems to be bogus (it shows accesses into dynstr only).

To catch the -fno-reorder-blocks-partition problem, perhaps you can modify
the Martin's linker script to make .text.unlikely section non-executable.
This way it will crash application every time we jump into it.

Honza
> 
> Honza has an idea how to minimize hot text section and I will send new
> graphs for the proposed patch.
> Moreover, I will send graphs for Inkscape which is written in C++.
> 
> Have a nice day,
> Martin
> 
> On 11 August 2013 15:25, Teresa Johnson <tejohnson@google.com> wrote:
> > Cc'ing Rong since he is also working on trying to address the comdat
> > profile issue. Rong, you may need to see an earlier message for more
> > context:
> > http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00558.html
> >
> > Teresa
> >
> > On Sun, Aug 11, 2013 at 5:21 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >>>
> >>> I see, yes LTO can deal with this better since it has global
> >>> information. In non-LTO mode (including LIPO) we have the issue.
> >>
> >> Either Martin or me will implement merging of the multiple copies at
> >> LTO link time.  This is needed for Martin's code unification patch anyway.
> >>
> >> Theoretically gcov runtime can also have symbol names and cfg checksums of
> >> comdats in the static data and at exit produce buckets based on matching
> >> names+checksums+counter counts, merge all data into in each bucket to one
> >> representative by the existing merging routines and then memcpy them to
> >> all the oriignal copiles.  This way all compilation units will receive same
> >> results.
> >>
> >> I am not very keen about making gcov runtime bigger and more complex than it
> >> needs to be, but having sane profile for comdats seems quite important.
> >> Perhaps, in GNU toolchain, ordered subsections can be used to make linker to
> >> produce ordered list of comdats, so the runtime won't need to do hashing +
> >> lookups.
> >>
> >> Honza
> >>>
> >>> I take it gimp is built with LTO and therefore shouldn't be hitting
> >>> this comdat issue?
> >>>
> >>> Let me do a couple things:
> >>> - port over my comdat inlining fix from the google branch to trunk and
> >>> send it for review. If you or Martin could try it to see if it helps
> >>> with function splitting to avoid the hits from the cold code that
> >>> would be great
> >>> - I'll add some new sanity checking to try to detect non-zero blocks
> >>> in the cold section, or 0 blocks reached by non-zero edges and see if
> >>> I can flush out any problems with my tests or a profiledbootstrap or
> >>> gimp.
> >>> - I'll try building and profiling gimp myself to see if I can
> >>> reproduce the issue with code executing out of the cold section.
> >>>
> >>> Thanks,
> >>> Teresa
> >>>
> >>> >>
> >>> >> Also, can you send me reproduction instructions for gimp? I don't
> >>> >> think I need Martin's patch, but which version of gimp and what is the
> >>> >> equivalent way for me to train it? I have some scripts to generate a
> >>> >> similar type of instruction heat map graph that I have been using to
> >>> >> tune partitioning and function reordering. Essentially it uses linux
> >>> >> perf to sample on instructions_retired and then munge the data in
> >>> >> several ways to produce various stats and graphs. One thing that has
> >>> >> been useful has been to combine the perf data with nm output to
> >>> >> determine which cold functions are being executed at runtime.
> >>> >
> >>> > Martin?
> >>> >
> >>> >>
> >>> >> However, for this to tell me which split cold bbs are being executed I
> >>> >> need to use a patch that Sri sent for review several months back that
> >>> >> gives the split cold section its own name:
> >>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
> >>> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
> >>> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
> >>> >> (cc'ing Sri as we should probably revive this patch soon to address
> >>> >> gdb and other issues with detecting split functions properly)
> >>> >
> >>> > Intresting, I used linker script for this purposes, but that his GNU ld only...
> >>> >
> >>> > Honza
> >>> >>
> >>> >> Thanks!
> >>> >> Teresa
> >>> >>
> >>> >> >
> >>> >> > Honza
> >>> >> >>
> >>> >> >> Thanks,
> >>> >> >> Teresa
> >>> >> >>
> >>> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> >>> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >>> >> >> >
> >>> >> >> > Honza
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> --
> >>> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >>>
> >>>
> >>>
> >>> --
> >>> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >
> >
> >
> > --
> > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Jan Hubicka Aug. 11, 2013, 9:05 p.m. UTC | #15
Hi,
thinking about it a bit more, I suppose easiest way is to
1) make separate sets of counters for each comdat and place them
   into comdat section named as DECL_COMDAT_GROUP (node) + cfg_checksum + individual_counter_counts.
   This will make linker to unify the sections for us.
2) extend API of libgcov initialization so multiple counters can be recorded per file.
3) at merging time, gcov needs to merge all comdat section counters into temporary memory, so multiple
   merging won't produce bad results
4) counter streaming will need to be updated to deal with separate comdat sections...
5) probably we will want to update histogram production to avoid counting same comdat many times
   (this can be done by adding an "processed" flag into the per-function sections

I don't see any obvious problems with this plan, just that it is quite some
work.  If you had chance to implement something along these lines, I think it
would help ;))

Honza
> Cc'ing Rong since he is also working on trying to address the comdat
> profile issue. Rong, you may need to see an earlier message for more
> context:
> http://gcc.gnu.org/ml/gcc-patches/2013-08/msg00558.html
> 
> Teresa
> 
> On Sun, Aug 11, 2013 at 5:21 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >>
> >> I see, yes LTO can deal with this better since it has global
> >> information. In non-LTO mode (including LIPO) we have the issue.
> >
> > Either Martin or me will implement merging of the multiple copies at
> > LTO link time.  This is needed for Martin's code unification patch anyway.
> >
> > Theoretically gcov runtime can also have symbol names and cfg checksums of
> > comdats in the static data and at exit produce buckets based on matching
> > names+checksums+counter counts, merge all data into in each bucket to one
> > representative by the existing merging routines and then memcpy them to
> > all the oriignal copiles.  This way all compilation units will receive same
> > results.
> >
> > I am not very keen about making gcov runtime bigger and more complex than it
> > needs to be, but having sane profile for comdats seems quite important.
> > Perhaps, in GNU toolchain, ordered subsections can be used to make linker to
> > produce ordered list of comdats, so the runtime won't need to do hashing +
> > lookups.
> >
> > Honza
> >>
> >> I take it gimp is built with LTO and therefore shouldn't be hitting
> >> this comdat issue?
> >>
> >> Let me do a couple things:
> >> - port over my comdat inlining fix from the google branch to trunk and
> >> send it for review. If you or Martin could try it to see if it helps
> >> with function splitting to avoid the hits from the cold code that
> >> would be great
> >> - I'll add some new sanity checking to try to detect non-zero blocks
> >> in the cold section, or 0 blocks reached by non-zero edges and see if
> >> I can flush out any problems with my tests or a profiledbootstrap or
> >> gimp.
> >> - I'll try building and profiling gimp myself to see if I can
> >> reproduce the issue with code executing out of the cold section.
> >>
> >> Thanks,
> >> Teresa
> >>
> >> >>
> >> >> Also, can you send me reproduction instructions for gimp? I don't
> >> >> think I need Martin's patch, but which version of gimp and what is the
> >> >> equivalent way for me to train it? I have some scripts to generate a
> >> >> similar type of instruction heat map graph that I have been using to
> >> >> tune partitioning and function reordering. Essentially it uses linux
> >> >> perf to sample on instructions_retired and then munge the data in
> >> >> several ways to produce various stats and graphs. One thing that has
> >> >> been useful has been to combine the perf data with nm output to
> >> >> determine which cold functions are being executed at runtime.
> >> >
> >> > Martin?
> >> >
> >> >>
> >> >> However, for this to tell me which split cold bbs are being executed I
> >> >> need to use a patch that Sri sent for review several months back that
> >> >> gives the split cold section its own name:
> >> >>   http://gcc.gnu.org/ml/gcc-patches/2013-04/msg01571.html
> >> >> Steven had some follow up comments that Sri hasn't had a chance to address yet:
> >> >>   http://gcc.gnu.org/ml/gcc-patches/2013-05/msg00798.html
> >> >> (cc'ing Sri as we should probably revive this patch soon to address
> >> >> gdb and other issues with detecting split functions properly)
> >> >
> >> > Intresting, I used linker script for this purposes, but that his GNU ld only...
> >> >
> >> > Honza
> >> >>
> >> >> Thanks!
> >> >> Teresa
> >> >>
> >> >> >
> >> >> > Honza
> >> >> >>
> >> >> >> Thanks,
> >> >> >> Teresa
> >> >> >>
> >> >> >> > I think we are really looking primarily for dead parts of the functions (sanity checks/error handling)
> >> >> >> > that should not be visited by train run.  We can then see how to make the heuristic more aggressive?
> >> >> >> >
> >> >> >> > Honza
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> >>
> >>
> >>
> >> --
> >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
> 
> 
> 
> -- 
> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
diff mbox

Patch

Index: cfgrtl.c
===================================================================
--- cfgrtl.c    (revision 201461)
+++ cfgrtl.c    (working copy)
@@ -1341,6 +1341,43 @@  fixup_partition_crossing (edge e)
     }
 }

+/* Called when block BB has been reassigned to the cold partition,
+   because it is now dominated by another cold block,
+   to ensure that the region crossing attributes are updated.  */
+
+static void
+fixup_new_cold_bb (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+
+  /* This is called when a hot bb is found to now be dominated
+     by a cold bb and therefore needs to become cold. Therefore,
+     its preds will no longer be region crossing. Any non-dominating
+     preds that were previously hot would also have become cold
+     in the caller for the same region. Any preds that were previously
+     region-crossing will be adjusted in fixup_partition_crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      fixup_partition_crossing (e);
+    }
+
+  /* Possibly need to make bb's successor edges region crossing,
+     or remove stale region crossing.  */
+  FOR_EACH_EDGE (e, ei, bb->succs)
+    {
+      /* We can't have fall-through edges across partition boundaries.
+         Note that force_nonfallthru will do any necessary partition
+         boundary fixup by calling fixup_partition_crossing itself.  */
+      if ((e->flags & EDGE_FALLTHRU)
+          && BB_PARTITION (bb) != BB_PARTITION (e->dest)
+          && e->dest != EXIT_BLOCK_PTR)
+        force_nonfallthru (e);
+      else
+        fixup_partition_crossing (e);
+    }
+}
+
 /* Attempt to change code to redirect edge E to TARGET.  Don't do that on
    expense of adding new instructions or reordering basic blocks.

@@ -1979,6 +2016,14 @@  commit_edge_insertions (void)
 {
   basic_block bb;

+  /* Optimization passes that invoke this routine can cause hot blocks
+     previously reached by both hot and cold blocks to become dominated only
+     by cold blocks. This will cause the verification below to fail,
+     and lead to now cold code in the hot section. In some cases this
+     may only be visible after newly unreachable blocks are deleted,
+     which will be done by fixup_partitions.  */
+  fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
   verify_flow_info ();
 #endif
@@ -2173,6 +2218,101 @@  get_last_bb_insn (basic_block bb)
   return end;
 }

+/* Sanity check partition hotness to ensure that basic blocks in
+   the cold partition don't dominate basic blocks in the hot partition.
+   If FLAG_ONLY is true, report violations as errors. Otherwise
+   re-mark the dominated blocks as cold, since this is run after
+   cfg optimizations that may make hot blocks previously reached
+   by both hot and cold blocks now only reachable along cold paths.  */
+
+static vec<basic_block>
+find_partition_fixes (bool flag_only)
+{
+  basic_block bb;
+  vec<basic_block> bbs_in_cold_partition = vNULL;
+  vec<basic_block> bbs_to_fix = vNULL;
+
+  /* Callers check this.  */
+  gcc_checking_assert (crtl->has_bb_partition);
+
+  FOR_EACH_BB (bb)
+    if ((BB_PARTITION (bb) == BB_COLD_PARTITION))
+      bbs_in_cold_partition.safe_push (bb);
+
+  if (bbs_in_cold_partition.is_empty ())
+    return vNULL;
+
+  bool dom_calculated_here = !dom_info_available_p (CDI_DOMINATORS);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (CDI_DOMINATORS);
+
+  while (! bbs_in_cold_partition.is_empty  ())
+    {
+      bb = bbs_in_cold_partition.pop ();
+      /* Any blocks dominated by a block in the cold section
+         must also be cold.  */
+      basic_block son;
+      for (son = first_dom_son (CDI_DOMINATORS, bb);
+           son;
+           son = next_dom_son (CDI_DOMINATORS, son))
+        {
+          /* If son is not yet cold, then mark it cold here and
+             enqueue it for further processing.  */
+          if ((BB_PARTITION (son) != BB_COLD_PARTITION))
+            {
+              if (flag_only)
+                error ("non-cold basic block %d dominated "
+                       "by a block in the cold partition", son->index);
+              else
+                BB_SET_PARTITION (son, BB_COLD_PARTITION);
+              bbs_to_fix.safe_push (son);
+              bbs_in_cold_partition.safe_push (son);
+            }
+        }
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (CDI_DOMINATORS);
+
+  return bbs_to_fix;
+}
+
+/* Perform cleanup on the hot/cold bb partitioning after optimization
+   passes that modify the cfg.  */
+
+void
+fixup_partitions (void)
+{
+  basic_block bb;
+
+  if (!crtl->has_bb_partition)
+    return;
+
+  /* Delete any blocks that became unreachable and weren't
+     already cleaned up, for example during edge forwarding
+     and convert_jumps_to_returns. This will expose more
+     opportunities for fixing the partition boundaries here.
+     Also, the calculation of the dominance graph during verification
+     will assert if there are unreachable nodes.  */
+  delete_unreachable_blocks ();
+
+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.
+     Fixup any that now violate this requirement, as a result of edge
+     forwarding and unreachable block deletion.  */
+  vec<basic_block> bbs_to_fix = find_partition_fixes (false);
+
+  /* Do the partition fixup after all necessary blocks have been converted to
+     cold, so that we only update the region crossings the minimum number of
+     places, which can require forcing edges to be non fallthru.  */
+  while (! bbs_to_fix.is_empty ())
+    {
+      bb = bbs_to_fix.pop ();
+      fixup_new_cold_bb (bb);
+    }
+}
+
 /* Verify, in the basic block chain, that there is at most one switch
    between hot/cold partitions. This condition will not be true until
    after reorder_basic_blocks is called.  */
@@ -2219,7 +2359,8 @@  verify_hot_cold_block_grouping (void)
 /* Perform several checks on the edges out of each block, such as
    the consistency of the branch probabilities, the correctness
    of hot/cold partition crossing edges, and the number of expected
-   successor edges.  */
+   successor edges.  Also verify that the dominance relationship
+   between hot/cold blocks is sane.  */

 static int
 rtl_verify_edges (void)
@@ -2382,6 +2523,14 @@  rtl_verify_edges (void)
        }
     }

+  /* If there are partitions, do a sanity check on them: A basic block in
+     a cold partition cannot dominate a basic block in a hot partition.  */
+  if (crtl->has_bb_partition && !err)
+    {
+      vec<basic_block> bbs_to_fix = find_partition_fixes (true);
+      err = !bbs_to_fix.is_empty ();
+    }
+
   /* Clean up.  */
   return err;
 }
@@ -2515,7 +2664,7 @@  rtl_verify_bb_pointers (void)
      and NOTE_INSN_BASIC_BLOCK
    - verify that no fall_thru edge crosses hot/cold partition boundaries
    - verify that there are no pending RTL branch predictions
-   - verify that there is a single hot/cold partition boundary after bbro
+   - verify that hot blocks are not dominated by cold blocks

    In future it can be extended check a lot of other stuff as well
    (reachability of basic blocks, life information, etc. etc.).  */
@@ -2761,7 +2910,8 @@  rtl_verify_bb_layout (void)
    - check that all insns are in the basic blocks
      (except the switch handling code, barriers and notes)
    - check that all returns are followed by barriers
-   - check that all fallthru edge points to the adjacent blocks.  */
+   - check that all fallthru edge points to the adjacent blocks
+   - verify that there is a single hot/cold partition boundary after bbro  */

 static int
 rtl_verify_flow_info (void)
Index: basic-block.h
===================================================================
--- basic-block.h       (revision 201461)
+++ basic-block.h       (working copy)
@@ -797,6 +797,7 @@  extern bool contains_no_active_insn_p (const_basic
 extern bool forwarder_block_p (const_basic_block);
 extern bool can_fallthru (basic_block, basic_block);
 extern void emit_barrier_after_bb (basic_block bb);
+extern void fixup_partitions (void);

 /* In cfgbuild.c.  */
 extern void find_many_sub_basic_blocks (sbitmap);
Index: cfgcleanup.c
===================================================================
--- cfgcleanup.c        (revision 201461)
+++ cfgcleanup.c        (working copy)
@@ -2807,10 +2807,21 @@  try_optimize_cfg (int mode)
              df_analyze ();
            }

+         if (changed)
+            {
+              /* Edge forwarding in particular can cause hot blocks previously
+                 reached by both hot and cold blocks to become dominated only
+                 by cold blocks. This will cause the verification
below to fail,
+                 and lead to now cold code in the hot section. This is not easy
+                 to detect and fix during edge forwarding, and in some cases
+                 is only visible after newly unreachable blocks are deleted,
+                 which will be done in fixup_partitions.  */
+              fixup_partitions ();
+
 #ifdef ENABLE_CHECKING
-         if (changed)
-           verify_flow_info ();
+              verify_flow_info ();
 #endif
+            }

          changed_overall |= changed;
          first_pass = false;
Index: bb-reorder.c
===================================================================
--- bb-reorder.c        (revision 201461)
+++ bb-reorder.c        (working copy)
@@ -1444,6 +1444,57 @@  fix_up_crossing_landing_pad (eh_landing_pad old_lp
       ei_next (&ei);
 }

+
+/* Ensure that no cold bbs dominate hot bbs along the dominance or
+   post-dominance DIR, for example as a result of edge weight insanities.
+   Returns the updated value of COLD_BB_COUNT and adds newly-hot bbs
+   to BBS_IN_HOT_PARTITION.  */
+
+static unsigned int
+sanitize_dominator_hotness (enum cdi_direction dir, unsigned int cold_bb_count,
+                            vec<basic_block> *bbs_in_hot_partition)
+{
+  /* Callers check this.  */
+  gcc_checking_assert (cold_bb_count);
+
+  bool dom_calculated_here = !dom_info_available_p (dir);
+
+  if (dom_calculated_here)
+    calculate_dominance_info (dir);
+
+  /* Keep examining hot bbs while we still have some left to check
+     and there are remaining cold bbs.  */
+  vec<basic_block> hot_bbs_to_check = bbs_in_hot_partition->copy ();
+  while (! hot_bbs_to_check.is_empty ()
+         && cold_bb_count)
+    {
+      basic_block bb = hot_bbs_to_check.pop ();
+      basic_block dom_bb = get_immediate_dominator (dir, bb);
+
+      /* If bb's immediate dominator is also hot (or unpartitioned,
+         e.g. the entry block) then it is ok. If it is cold, it
+         needs to be adjusted.  */
+      if (BB_PARTITION (dom_bb) != BB_COLD_PARTITION)
+        continue;
+
+      /* We have a hot bb with an immediate dominator that is cold.
+         The dominator needs to be re-marked hot.  */
+      BB_SET_PARTITION (dom_bb, BB_HOT_PARTITION);
+      cold_bb_count--;
+
+      /* Now we need to examine newly-hot dom_bb to see if it is also
+         dominated by a cold bb.  */
+      bbs_in_hot_partition->safe_push (dom_bb);
+      hot_bbs_to_check.safe_push (dom_bb);
+    }
+
+  if (dom_calculated_here)
+    free_dominance_info (dir);
+
+  return cold_bb_count;
+}
+
+
 /* Find the basic blocks that are rarely executed and need to be moved to
    a separate section of the .o file (to cut down on paging and improve
    cache locality).  Return a vector of all edges that cross.  */
@@ -1455,16 +1506,42 @@  find_rarely_executed_basic_blocks_and_crossing_edg
   basic_block bb;
   edge e;
   edge_iterator ei;
+  unsigned int cold_bb_count = 0;
+  vec<basic_block> bbs_in_hot_partition = vNULL;

   /* Mark which partition (hot/cold) each basic block belongs in.  */
   FOR_EACH_BB (bb)
     {
       if (probably_never_executed_bb_p (cfun, bb))
-       BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_COLD_PARTITION);
+          cold_bb_count++;
+        }
       else
-       BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+        {
+          BB_SET_PARTITION (bb, BB_HOT_PARTITION);
+          bbs_in_hot_partition.safe_push (bb);
+        }
     }

+  /* Ensure that no cold bbs dominate hot bbs. This could happen as a result of
+     several different possibilities. One is that there are edge
weight insanities
+     due to optimization phases that do not properly update basic block profile
+     counts. The second is that the entry of the function may not be
hot, because
+     it is entered fewer times than the number of profile training
runs, but there
+     is a loop inside the function that causes blocks within the function to be
+     above the threshold for hotness. Then do the same along the post-dominator
+     tree (which could have additional changes required after fixing up
+     dominators).  */
+  if (cold_bb_count)
+    cold_bb_count = sanitize_dominator_hotness (CDI_DOMINATORS,
+                                                cold_bb_count,
+                                                &bbs_in_hot_partition);
+  if (cold_bb_count)
+    sanitize_dominator_hotness (CDI_POST_DOMINATORS,
+                                cold_bb_count,
+                                &bbs_in_hot_partition);
+