Add -fpartial-profile-training

Message ID	20191204215228.fuuywt3ef3uiqswh@kam.mff.cuni.cz
State	New
Headers	show Return-Path: <gcc-patches-return-515168-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:subject:message-id:mime-version:content-type; q=dns; s= default; b=SuXu0C8dc8cxyq9wlbAY/lQ+WPLqNCNHVP2IegoWgCT2auFvC/zBK FTc6EnacZUr93Q1WDUW1/QPGSF6a9esG9Q+4nvr1Pc6zyPOP8u54Z6h+CqH01tV+ A7OxL6BmzEwos17Tf7VrrEMNe5vy4nOpFhkpXu18ICMW+Nt1YV9AFg= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Date: Wed, 4 Dec 2019 22:52:28 +0100 From: Jan Hubicka <hubicka@ucw.cz> To: gcc-patches@gcc.gnu.org Subject: Add -fpartial-profile-training Message-ID: <20191204215228.fuuywt3ef3uiqswh@kam.mff.cuni.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: NeoMutt/20170113 (1.7.2)
Series	Add -fpartial-profile-training \| expand Add -fpartial-profile-training

Message ID

20191204215228.fuuywt3ef3uiqswh@kam.mff.cuni.cz

State

New

Headers

DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:date
	:from:to:subject:message-id:mime-version:content-type; q=dns; s=
	default; b=SuXu0C8dc8cxyq9wlbAY/lQ+WPLqNCNHVP2IegoWgCT2auFvC/zBK
	FTc6EnacZUr93Q1WDUW1/QPGSF6a9esG9Q+4nvr1Pc6zyPOP8u54Z6h+CqH01tV+
	A7OxL6BmzEwos17Tf7VrrEMNe5vy4nOpFhkpXu18ICMW+Nt1YV9AFg=
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
Sender: gcc-patches-owner@gcc.gnu.org
Date: Wed, 4 Dec 2019 22:52:28 +0100
From: Jan Hubicka <hubicka@ucw.cz>
To: gcc-patches@gcc.gnu.org
Subject: Add -fpartial-profile-training
Message-ID: <20191204215228.fuuywt3ef3uiqswh@kam.mff.cuni.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: NeoMutt/20170113 (1.7.2)

Series

Add -fpartial-profile-training | expand

Commit Message

Jan Hubicka Dec. 4, 2019, 9:52 p.m. UTC

Hi,
with recent fixes to proile updating I noticed that we get more regressions
compared to gcc 9 at Firefox testing. This is because Firefox train run is
not covering all the benchmarks and gcc 9, thanks to updating bugs sometimes
optimize code for speed even if it was not trained.

While in general one should have reasonable train run, in some cases it is
not practical to do so. For example, skia library has optimized vector code
for different ISAs and thus firefox renders quickly only if it is trained
on same CPU as run.

This patch adds flag -fprofile-partial-training which makes GCC to optimize
untrained functions as w/o -fprofile-use.  This nullifies code size improvements
of FDO but can be used in cases where full training is not quite possible
(and one can use it only on portions of programs).

Previously only good answer was to disable profiling for a given function, but
that needs to be done quite precisely and in general is hard to arrange.

Patch works by
 1) not setting PROFILE_READ for functions with entry count 0
 2) make inliner and ipa-cp to drop profile to local one when all
    trained executions are redirected to clones
 3) reduce quality of branch probabilities of branches leading to never
    executed regions to GUESSED.  This is necessary to prevent gcc from
    proagating thins back.

Bootstrapped/regtested x86_64-linux. I plan to commit it tomorrow if there
are no complains. Feedback is welcome!

Honza

	* cgraphclones.c (localize_profile): New function.
	(cgraph_node::create_clone): Use it for partial profiles.
	* common.opt (fprofile-partial-training): New flag.
	* doc/invoke.texi (-fprofile-partial-training): Document.
	* ipa-cp.c (update_profiling_info): For partial profiles do not
	set function profile to zero.
	* profile.c (compute_branch_probabilities): With partial profile
	watch if edge count is zero and turn all probabilities to guessed.
	(compute_branch_probabilities): For partial profiles do not apply
	profile when entry count is zero.
	* tree-profile.c (tree_profiling): Only do value_profile_transformations
	when profile is read.

Comments

Richard Biener Dec. 5, 2019, 12:30 p.m. UTC | #1

On Wed, Dec 4, 2019 at 10:52 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hi,
> with recent fixes to proile updating I noticed that we get more regressions
> compared to gcc 9 at Firefox testing. This is because Firefox train run is
> not covering all the benchmarks and gcc 9, thanks to updating bugs sometimes
> optimize code for speed even if it was not trained.
>
> While in general one should have reasonable train run, in some cases it is
> not practical to do so. For example, skia library has optimized vector code
> for different ISAs and thus firefox renders quickly only if it is trained
> on same CPU as run.
>
> This patch adds flag -fprofile-partial-training which makes GCC to optimize
> untrained functions as w/o -fprofile-use.  This nullifies code size improvements
> of FDO but can be used in cases where full training is not quite possible
> (and one can use it only on portions of programs).
>
> Previously only good answer was to disable profiling for a given function, but
> that needs to be done quite precisely and in general is hard to arrange.
>
> Patch works by
>  1) not setting PROFILE_READ for functions with entry count 0
>  2) make inliner and ipa-cp to drop profile to local one when all
>     trained executions are redirected to clones
>  3) reduce quality of branch probabilities of branches leading to never
>     executed regions to GUESSED.  This is necessary to prevent gcc from
>     proagating thins back.
>
> Bootstrapped/regtested x86_64-linux. I plan to commit it tomorrow if there
> are no complains. Feedback is welcome!

I wonder if the behavior shouldn't be the default?  The only thing we lose
is failing to notice really cold calls (error paths) in programs?

Richard.

>
> Honza
>
>         * cgraphclones.c (localize_profile): New function.
>         (cgraph_node::create_clone): Use it for partial profiles.
>         * common.opt (fprofile-partial-training): New flag.
>         * doc/invoke.texi (-fprofile-partial-training): Document.
>         * ipa-cp.c (update_profiling_info): For partial profiles do not
>         set function profile to zero.
>         * profile.c (compute_branch_probabilities): With partial profile
>         watch if edge count is zero and turn all probabilities to guessed.
>         (compute_branch_probabilities): For partial profiles do not apply
>         profile when entry count is zero.
>         * tree-profile.c (tree_profiling): Only do value_profile_transformations
>         when profile is read.
> Index: cgraphclones.c
> ===================================================================
> --- cgraphclones.c      (revision 278944)
> +++ cgraphclones.c      (working copy)
> @@ -307,6 +307,22 @@ dump_callgraph_transformation (const cgr
>      }
>  }
>
> +/* Turn profile of N to local profile.   */
> +
> +static void
> +localize_profile (cgraph_node *n)
> +{
> +  n->count = n->count.guessed_local ();
> +  for (cgraph_edge *e = n->callees; e; e=e->next_callee)
> +    {
> +      e->count = e->count.guessed_local ();
> +      if (!e->inline_failed)
> +       localize_profile (e->callee);
> +    }
> +  for (cgraph_edge *e = n->indirect_calls; e; e=e->next_callee)
> +    e->count = e->count.guessed_local ();
> +}
> +
>  /* Create node representing clone of N executed COUNT times.  Decrease
>     the execution counts from original node too.
>     The new clone will have decl set to DECL that may or may not be the same
> @@ -340,6 +356,7 @@ cgraph_node::create_clone (tree new_decl
>    cgraph_edge *e;
>    unsigned i;
>    profile_count old_count = count;
> +  bool nonzero = count.ipa ().nonzero_p ();
>
>    if (new_inlined_to)
>      dump_callgraph_transformation (this, new_inlined_to, "inlining to");
> @@ -426,6 +446,15 @@ cgraph_node::create_clone (tree new_decl
>
>    if (call_duplication_hook)
>      symtab->call_cgraph_duplication_hooks (this, new_node);
> +  /* With partial train run we do not want to assume that original's
> +     count is zero whenever we redurect all executed edges to clone.
> +     Simply drop profile to local one in this case.  */
> +  if (update_original
> +      && opt_for_fn (decl, flag_partial_profile_training)
> +      && nonzero
> +      && count.ipa_p ()
> +      && !count.ipa ().nonzero_p ())
> +    localize_profile (this);
>
>    if (!new_inlined_to)
>      dump_callgraph_transformation (this, new_node, suffix);
> Index: common.opt
> ===================================================================
> --- common.opt  (revision 278944)
> +++ common.opt  (working copy)
> @@ -2160,6 +2160,10 @@ fprofile-generate=
>  Common Joined RejectNegative
>  Enable common options for generating profile info for profile feedback directed optimizations, and set -fprofile-dir=.
>
> +fprofile-partial-training
> +Common Report Var(flag_partial_profile_training) Optimization
> +Do not assume that functions never executed during the train run are cold
> +
>  fprofile-use
>  Common Var(flag_profile_use)
>  Enable common options for performing profile feedback directed optimizations.
> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi     (revision 278944)
> +++ doc/invoke.texi     (working copy)
> @@ -453,8 +453,8 @@ Objective-C and Objective-C++ Dialects}.
>  -fpartial-inlining  -fpeel-loops  -fpredictive-commoning @gol
>  -fprefetch-loop-arrays @gol
>  -fprofile-correction @gol
> --fprofile-use  -fprofile-use=@var{path}  -fprofile-values @gol
> --fprofile-reorder-functions @gol
> +-fprofile-use  -fprofile-use=@var{path} -fprofile-partial-training @gol
> +-fprofile-values -fprofile-reorder-functions @gol
>  -freciprocal-math  -free  -frename-registers  -freorder-blocks @gol
>  -freorder-blocks-algorithm=@var{algorithm} @gol
>  -freorder-blocks-and-partition  -freorder-functions @gol
> @@ -10634,6 +10634,17 @@ default, GCC emits an error message when
>
>  This option is enabled by @option{-fauto-profile}.
>
> +@item -fprofile-partial-training
> +@opindex fprofile-use
> +With @code{-fprofile-use} all portions of programs not executed during train
> +run are optimized agressively for size rather than speed.  In some cases it is not
> +practical to train all possible paths hot paths in the program. (For example
> +program may contain functions specific for a given hardware and trianing may
> +not cover all hardware configurations program is run on.)  With
> +@code{-fprofile-partial-training} profile feedback will be ignored for all
> +functions not executed during the train run leading them to be optimized as
> +if they were compiled without profile feedback.
> +
>  @item -fprofile-use
>  @itemx -fprofile-use=@var{path}
>  @opindex fprofile-use
> Index: ipa-cp.c
> ===================================================================
> --- ipa-cp.c    (revision 278944)
> +++ ipa-cp.c    (working copy)
> @@ -4295,6 +4295,15 @@ update_profiling_info (struct cgraph_nod
>
>    remainder = orig_node_count.combine_with_ipa_count (orig_node_count.ipa ()
>                                                       - new_sum.ipa ());
> +
> +  /* With partial train run we do not want to assume that original's
> +     count is zero whenever we redurect all executed edges to clone.
> +     Simply drop profile to local one in this case.  */
> +  if (remainder.ipa_p () && !remainder.ipa ().nonzero_p ()
> +      && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p ()
> +      && flag_partial_profile_training)
> +    remainder = remainder.guessed_local ();
> +
>    new_sum = orig_node_count.combine_with_ipa_count (new_sum);
>    new_node->count = new_sum;
>    orig_node->count = remainder;
> Index: profile.c
> ===================================================================
> --- profile.c   (revision 278944)
> +++ profile.c   (working copy)
> @@ -635,9 +635,20 @@ compute_branch_probabilities (unsigned c
>         }
>        if (bb_gcov_count (bb))
>         {
> +         bool set_to_guessed = false;
>           FOR_EACH_EDGE (e, ei, bb->succs)
> -           e->probability = profile_probability::probability_in_gcov_type
> -               (edge_gcov_count (e), bb_gcov_count (bb));
> +           {
> +             bool prev_never = e->probability == profile_probability::never ();
> +             e->probability = profile_probability::probability_in_gcov_type
> +                 (edge_gcov_count (e), bb_gcov_count (bb));
> +             if (e->probability == profile_probability::never ()
> +                 && !prev_never
> +                 && flag_partial_profile_training)
> +               set_to_guessed = true;
> +           }
> +         if (set_to_guessed)
> +           FOR_EACH_EDGE (e, ei, bb->succs)
> +             e->probability = e->probability.guessed ();
>           if (bb->index >= NUM_FIXED_BLOCKS
>               && block_ends_with_condjump_p (bb)
>               && EDGE_COUNT (bb->succs) >= 2)
> @@ -697,17 +708,23 @@ compute_branch_probabilities (unsigned c
>         }
>      }
>
> -  if (exec_counts)
> +  if (exec_counts
> +      && (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun))
> +         || !flag_partial_profile_training))
>      profile_status_for_fn (cfun) = PROFILE_READ;
>
>    /* If we have real data, use them!  */
>    if (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun))
>        || !flag_guess_branch_prob)
>      FOR_ALL_BB_FN (bb, cfun)
> -      bb->count = profile_count::from_gcov_type (bb_gcov_count (bb));
> +      if (bb_gcov_count (bb) || !flag_partial_profile_training)
> +        bb->count = profile_count::from_gcov_type (bb_gcov_count (bb));
> +      else
> +       bb->count = profile_count::guessed_zero ();
>    /* If function was not trained, preserve local estimates including statically
>       determined zero counts.  */
> -  else if (profile_status_for_fn (cfun) == PROFILE_READ)
> +  else if (profile_status_for_fn (cfun) == PROFILE_READ
> +          && !flag_partial_profile_training)
>      FOR_ALL_BB_FN (bb, cfun)
>        if (!(bb->count == profile_count::zero ()))
>          bb->count = bb->count.global0 ();
> @@ -1417,7 +1434,7 @@ branch_prob (bool thunk)
>        /* At this moment we have precise loop iteration count estimates.
>          Record them to loop structure before the profile gets out of date. */
>        FOR_EACH_LOOP (loop, 0)
> -       if (loop->header->count > 0)
> +       if (loop->header->count > 0 && loop->header->count.reliable_p ())
>           {
>             gcov_type nit = expected_loop_iterations_unbounded (loop);
>             widest_int bound = gcov_type_to_wide_int (nit);
> Index: tree-profile.c
> ===================================================================
> --- tree-profile.c      (revision 278944)
> +++ tree-profile.c      (working copy)
> @@ -785,7 +785,8 @@ tree_profiling (void)
>        if (flag_branch_probabilities
>           && !thunk
>           && flag_profile_values
> -         && flag_value_profile_transformations)
> +         && flag_value_profile_transformations
> +         && profile_status_for_fn (cfun) == PROFILE_READ)
>         gimple_value_profile_transformations ();
>
>        /* The above could hose dominator info.  Currently there is

Martin Liška Dec. 5, 2019, 12:41 p.m. UTC | #2

On 12/5/19 1:30 PM, Richard Biener wrote:
> I wonder if the behavior shouldn't be the default?  The only thing we lose
> is failing to notice really cold calls (error paths) in programs?

I would also consider enabling that by default.

I'm sending a language correction for the option documentation:

diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 26a444ac7b2..130529dece1 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -10637,10 +10637,10 @@ This option is enabled by @option{-fauto-profile}.
  @item -fprofile-partial-training
  @opindex fprofile-use
  With @code{-fprofile-use} all portions of programs not executed during train
-run are optimized agressively for size rather than speed.  In some cases it is not
+run are optimized aggressively for size rather than speed.  In some cases it is not
  practical to train all possible paths hot paths in the program. (For example
-program may contain functions specific for a given hardware and trianing may
-not cover all hardware configurations program is run on.)  With
+a program may contain functions specific for a given hardware and training may
+not cover all hardware configurations program can run on).  With
  @code{-fprofile-partial-training} profile feedback will be ignored for all
  functions not executed during the train run leading them to be optimized as
  if they were compiled without profile feedback.

Martin

Richard Biener Dec. 5, 2019, 12:45 p.m. UTC | #3

On Thu, Dec 5, 2019 at 1:41 PM Martin Liška <mliska@suse.cz> wrote:
>
> On 12/5/19 1:30 PM, Richard Biener wrote:
> > I wonder if the behavior shouldn't be the default?  The only thing we lose
> > is failing to notice really cold calls (error paths) in programs?
>
> I would also consider enabling that by default.

So I'd add the "reverse" option -fconsider-unprofiled-functions-cold or
so.  Your proposed change makes functions not executed during profiling
behave as if the function were built without -fprofile-generate for training
but with -fprofile-use later?  Documentation should somehow relate
behavior to that.

Richard.

> I'm sending a language correction for the option documentation:
>
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 26a444ac7b2..130529dece1 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -10637,10 +10637,10 @@ This option is enabled by @option{-fauto-profile}.
>   @item -fprofile-partial-training
>   @opindex fprofile-use
>   With @code{-fprofile-use} all portions of programs not executed during train
> -run are optimized agressively for size rather than speed.  In some cases it is not
> +run are optimized aggressively for size rather than speed.  In some cases it is not
>   practical to train all possible paths hot paths in the program. (For example
> -program may contain functions specific for a given hardware and trianing may
> -not cover all hardware configurations program is run on.)  With
> +a program may contain functions specific for a given hardware and training may
> +not cover all hardware configurations program can run on).  With
>   @code{-fprofile-partial-training} profile feedback will be ignored for all
>   functions not executed during the train run leading them to be optimized as
>   if they were compiled without profile feedback.
>
> Martin

Jakub Jelinek Dec. 6, 2019, 12:28 a.m. UTC | #4

On Wed, Dec 04, 2019 at 10:52:28PM +0100, Jan Hubicka wrote:
> 	* cgraphclones.c (localize_profile): New function.
> 	(cgraph_node::create_clone): Use it for partial profiles.
> 	* common.opt (fprofile-partial-training): New flag.

This FAILs everywhere, with:
Running /usr/src/gcc/gcc/testsuite/gcc.misc-tests/help.exp ...
FAIL: compiler driver --help=common option(s): "^ +-.*[^:.]$" absent from output: "  -fprofile-partial-training  Do not assume that functions never executed during the train run are cold"
FAIL: compiler driver --help=optimizers option(s): "^ +-.*[^:.]$" absent from output: "  -fprofile-partial-training  Do not assume that functions never executed during the train run are cold"

Fixed thusly, tested on x86_64-linux, committed to trunk as obvious:

2019-12-06  Jakub Jelinek  <jakub@redhat.com>

	* common.opt (fprofile-partial-training): Terminate description with
	full stop.

--- gcc/common.opt.jj	2019-12-06 00:40:46.096605346 +0100
+++ gcc/common.opt	2019-12-06 01:24:22.825265282 +0100
@@ -2162,7 +2162,7 @@ Enable common options for generating pro

 fprofile-partial-training
 Common Report Var(flag_profile_partial_training) Optimization
-Do not assume that functions never executed during the train run are cold
+Do not assume that functions never executed during the train run are cold.

 fprofile-use
 Common Var(flag_profile_use)

	Jakub

Index: cgraphclones.c
===================================================================
--- cgraphclones.c	(revision 278944)
+++ cgraphclones.c	(working copy)
@@ -307,6 +307,22 @@  dump_callgraph_transformation (const cgr
     }
 }
 
+/* Turn profile of N to local profile.   */
+
+static void
+localize_profile (cgraph_node *n)
+{
+  n->count = n->count.guessed_local ();
+  for (cgraph_edge *e = n->callees; e; e=e->next_callee)
+    {
+      e->count = e->count.guessed_local ();
+      if (!e->inline_failed)
+	localize_profile (e->callee);
+    }
+  for (cgraph_edge *e = n->indirect_calls; e; e=e->next_callee)
+    e->count = e->count.guessed_local ();
+}
+
 /* Create node representing clone of N executed COUNT times.  Decrease
    the execution counts from original node too.
    The new clone will have decl set to DECL that may or may not be the same
@@ -340,6 +356,7 @@  cgraph_node::create_clone (tree new_decl
   cgraph_edge *e;
   unsigned i;
   profile_count old_count = count;
+  bool nonzero = count.ipa ().nonzero_p ();
 
   if (new_inlined_to)
     dump_callgraph_transformation (this, new_inlined_to, "inlining to");
@@ -426,6 +446,15 @@  cgraph_node::create_clone (tree new_decl
 
   if (call_duplication_hook)
     symtab->call_cgraph_duplication_hooks (this, new_node);
+  /* With partial train run we do not want to assume that original's
+     count is zero whenever we redurect all executed edges to clone.
+     Simply drop profile to local one in this case.  */
+  if (update_original
+      && opt_for_fn (decl, flag_partial_profile_training)
+      && nonzero
+      && count.ipa_p ()
+      && !count.ipa ().nonzero_p ())
+    localize_profile (this);
 
   if (!new_inlined_to)
     dump_callgraph_transformation (this, new_node, suffix);
Index: common.opt
===================================================================
--- common.opt	(revision 278944)
+++ common.opt	(working copy)
@@ -2160,6 +2160,10 @@  fprofile-generate=
 Common Joined RejectNegative
 Enable common options for generating profile info for profile feedback directed optimizations, and set -fprofile-dir=.
 
+fprofile-partial-training
+Common Report Var(flag_partial_profile_training) Optimization
+Do not assume that functions never executed during the train run are cold
+
 fprofile-use
 Common Var(flag_profile_use)
 Enable common options for performing profile feedback directed optimizations.
Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 278944)
+++ doc/invoke.texi	(working copy)
@@ -453,8 +453,8 @@  Objective-C and Objective-C++ Dialects}.
 -fpartial-inlining  -fpeel-loops  -fpredictive-commoning @gol
 -fprefetch-loop-arrays @gol
 -fprofile-correction @gol
--fprofile-use  -fprofile-use=@var{path}  -fprofile-values @gol
--fprofile-reorder-functions @gol
+-fprofile-use  -fprofile-use=@var{path} -fprofile-partial-training @gol
+-fprofile-values -fprofile-reorder-functions @gol
 -freciprocal-math  -free  -frename-registers  -freorder-blocks @gol
 -freorder-blocks-algorithm=@var{algorithm} @gol
 -freorder-blocks-and-partition  -freorder-functions @gol
@@ -10634,6 +10634,17 @@  default, GCC emits an error message when
 
 This option is enabled by @option{-fauto-profile}.
 
+@item -fprofile-partial-training
+@opindex fprofile-use
+With @code{-fprofile-use} all portions of programs not executed during train
+run are optimized agressively for size rather than speed.  In some cases it is not
+practical to train all possible paths hot paths in the program. (For example
+program may contain functions specific for a given hardware and trianing may
+not cover all hardware configurations program is run on.)  With
+@code{-fprofile-partial-training} profile feedback will be ignored for all
+functions not executed during the train run leading them to be optimized as
+if they were compiled without profile feedback.
+
 @item -fprofile-use
 @itemx -fprofile-use=@var{path}
 @opindex fprofile-use
Index: ipa-cp.c
===================================================================
--- ipa-cp.c	(revision 278944)
+++ ipa-cp.c	(working copy)
@@ -4295,6 +4295,15 @@  update_profiling_info (struct cgraph_nod
 
   remainder = orig_node_count.combine_with_ipa_count (orig_node_count.ipa ()
 						      - new_sum.ipa ());
+
+  /* With partial train run we do not want to assume that original's
+     count is zero whenever we redurect all executed edges to clone.
+     Simply drop profile to local one in this case.  */
+  if (remainder.ipa_p () && !remainder.ipa ().nonzero_p ()
+      && orig_node->count.ipa_p () && orig_node->count.ipa ().nonzero_p ()
+      && flag_partial_profile_training)
+    remainder = remainder.guessed_local ();
+
   new_sum = orig_node_count.combine_with_ipa_count (new_sum);
   new_node->count = new_sum;
   orig_node->count = remainder;
Index: profile.c
===================================================================
--- profile.c	(revision 278944)
+++ profile.c	(working copy)
@@ -635,9 +635,20 @@  compute_branch_probabilities (unsigned c
 	}
       if (bb_gcov_count (bb))
 	{
+	  bool set_to_guessed = false;
 	  FOR_EACH_EDGE (e, ei, bb->succs)
-	    e->probability = profile_probability::probability_in_gcov_type
-		(edge_gcov_count (e), bb_gcov_count (bb));
+	    {
+	      bool prev_never = e->probability == profile_probability::never ();
+	      e->probability = profile_probability::probability_in_gcov_type
+		  (edge_gcov_count (e), bb_gcov_count (bb));
+	      if (e->probability == profile_probability::never ()
+		  && !prev_never
+		  && flag_partial_profile_training)
+		set_to_guessed = true;
+	    }
+	  if (set_to_guessed)
+	    FOR_EACH_EDGE (e, ei, bb->succs)
+	      e->probability = e->probability.guessed ();
 	  if (bb->index >= NUM_FIXED_BLOCKS
 	      && block_ends_with_condjump_p (bb)
 	      && EDGE_COUNT (bb->succs) >= 2)
@@ -697,17 +708,23 @@  compute_branch_probabilities (unsigned c
 	}
     }
 
-  if (exec_counts)
+  if (exec_counts
+      && (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun))
+	  || !flag_partial_profile_training))
     profile_status_for_fn (cfun) = PROFILE_READ;
 
   /* If we have real data, use them!  */
   if (bb_gcov_count (ENTRY_BLOCK_PTR_FOR_FN (cfun))
       || !flag_guess_branch_prob)
     FOR_ALL_BB_FN (bb, cfun)
-      bb->count = profile_count::from_gcov_type (bb_gcov_count (bb));
+      if (bb_gcov_count (bb) || !flag_partial_profile_training)
+        bb->count = profile_count::from_gcov_type (bb_gcov_count (bb));
+      else
+	bb->count = profile_count::guessed_zero ();
   /* If function was not trained, preserve local estimates including statically
      determined zero counts.  */
-  else if (profile_status_for_fn (cfun) == PROFILE_READ)
+  else if (profile_status_for_fn (cfun) == PROFILE_READ
+	   && !flag_partial_profile_training)
     FOR_ALL_BB_FN (bb, cfun)
       if (!(bb->count == profile_count::zero ()))
         bb->count = bb->count.global0 ();
@@ -1417,7 +1434,7 @@  branch_prob (bool thunk)
       /* At this moment we have precise loop iteration count estimates.
 	 Record them to loop structure before the profile gets out of date. */
       FOR_EACH_LOOP (loop, 0)
-	if (loop->header->count > 0)
+	if (loop->header->count > 0 && loop->header->count.reliable_p ())
 	  {
 	    gcov_type nit = expected_loop_iterations_unbounded (loop);
 	    widest_int bound = gcov_type_to_wide_int (nit);
Index: tree-profile.c
===================================================================
--- tree-profile.c	(revision 278944)
+++ tree-profile.c	(working copy)
@@ -785,7 +785,8 @@  tree_profiling (void)
       if (flag_branch_probabilities
 	  && !thunk
 	  && flag_profile_values
-	  && flag_value_profile_transformations)
+	  && flag_value_profile_transformations
+	  && profile_status_for_fn (cfun) == PROFILE_READ)
 	gimple_value_profile_transformations ();
 
       /* The above could hose dominator info.  Currently there is