Message ID | CAAe5K+XoZMW09BCuC31k1D=xXJj9KMnzThgxcFSEzmvMNg9hnw@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Thu, Dec 19, 2013 at 10:19 PM, Teresa Johnson <tejohnson@google.com> wrote: > On Thu, Dec 12, 2013 at 5:13 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >>> On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote: >>> > Hello, >>> > I prepared a collection of systemtap graphs for GIMP. >>> > >>> > 1) just my profile-based function reordering: 550 pages >>> > 2) just -freorder-blocks-and-partitions: 646 pages >>> > 3) just -fno-reorder-blocks-and-partitions: 638 pages >>> > >>> > Please see attached data. >>> >>> Thanks for the data. A few observations/questions: >>> >>> With both 1) (your (time-based?) reordering) and 2) >>> (-freorder-blocks-and-partitions) there are a fair amount of accesses >>> out of the cold section. I'm not seeing so many accesses out of the >>> cold section in the apps I am looking at with splitting enabled. In >> >> I see you already comitted the patch, so perhaps Martin's measurement assume >> the pass is off by default? >> >> I rebuilded GCC with profiledboostrap and with the linkerscript unmapping >> text.unlikely. I get ICE in: >> (gdb) bt >> #0 diagnostic_set_caret_max_width(diagnostic_context*, int) () at ../../gcc/diagnostic.c:108 >> #1 0x0000000000f68457 in diagnostic_initialize (context=0x18ae000 <global_diagnostic_context>, n_opts=n_opts@entry=1290) at ../../gcc/diagnostic.c:135 >> #2 0x000000000100050e in general_init (argv0=<optimized out>) at ../../gcc/toplev.c:1110 >> #3 toplev_main(int, char**) () at ../../gcc/toplev.c:1922 >> #4 0x00007ffff774cbe5 in __libc_start_main () from /lib64/libc.so.6 >> #5 0x0000000000f7898d in _start () at ../sysdeps/x86_64/start.S:122 >> >> That is relatively early in startup process. The function seems inlined and >> it fails only on second invocation, did not have time to investigate further, >> yet while without -fprofile-use it starts... > > I'll see if I can reproduce this and investigate, although at this > point that might have to wait until after my holiday vacation. I tried the linkerscript with cpu2006 and got quite a lot of failures (using the ref inputs to train, so the behavior should be the same in both profile-gen and profile-use). I investigated the one in bzip2 and found an issue that may not be easy to fix and is perhaps something it is not worth fixing. The case was essentially the following: Function foo was called by callsites A and B, with call counts 148122615 and 18, respectively. Within function foo, there was a basic block that had a very low count (compared to the entry bb count of 148122633), and therefore a 0 frequency: ;; basic block 6, loop depth 0, count 18, freq 0 The ipa inliner decided to inline into callsite A but not B. Because the vast majority of the call count was from callsite A, when we performed execute_fixup_cfg after doing the inline transformation, the count_scale is 0 and the out-of-line copy of foo's blocks all got counts 0. However, most of the bbs still had non-zero frequencies. But bb 6 ended up with a count and frequency of 0, leading us to split it out. It turns out that at least one of the 18 counts for this block were from callsite B, and we ended up trying to execute the split bb in the out-of-line copy from that callsite. I can think of a couple of ways to prevent this to happen (e.g. have execute_fixup_cfg give the block a count or frequency of 1 instead of 0, or mark the bb somehow as not eligible for splitting due to a low confidence in the 0 count/frequency), but they seem a little hacky. I am thinking that the splitting here is something we shouldn't worry about - it is so close to 0 count that the occasional jump to the split section caused by the lack of precision in the frequency is not a big deal. Unfortunately, without fixing this I can't use the linker script without disabling inlining to avoid this problem. I reran cpu2006 with the linker script but with -fno-inline and got 6 more benchmarks to pass. So there are other issues out there. I will take a look at another one and see if it is a similar scaling/precision issue. I'm thinking that I may just use my heatmap scripts (which leverages perf-events profiles) to see if there is any significant execution in the split cold sections, since it seems we can't realistically prevent any and all execution of the cold split sections, and that is more meaningful anyway. Teresa > >> >> On our periodic testers I see off-noise improvement in crafty 2200->2300 >> and regression on Vortex, 2900->2800, plus code size increase. > > I had only run cpu2006, but not cpu2000. I'll see if I can reproduce > this as well. > > I have been investigating a few places where I saw accesses in the > cold split regions in internal benchmarks. Here are a couple, and how > I have addressed them so far: > > 1) loop unswitching > > In this case, loop unswitching hoisted a branch from within the loop > to outside the loop, and in doing so it was effectively speculated > above several other branches. In it's original location it always went > to only one of the successors (biased 0/100%). But when it was hoisted > it sometimes took the previously 0% path. This led to executing out of > the cold region, since we didn't update the branch probability when > hoisting. I worked around this by assigning a small non-zero > probability after hoisting with the following change: > > Index: tree-ssa-loop-unswitch.c > =================================================================== > --- tree-ssa-loop-unswitch.c (revision 205590) > +++ tree-ssa-loop-unswitch.c (working copy) > @@ -384,6 +384,8 @@ tree_unswitch_loop (struct loop *loop, > > extract_true_false_edges_from_block (unswitch_on, &edge_true, &edge_false); > prob_true = edge_true->probability; > + if (!prob_true) > + prob_true = REG_BR_PROB_BASE/10; > return loop_version (loop, unshare_expr (cond), > NULL, prob_true, prob_true, > REG_BR_PROB_BASE - prob_true, false); > > This should probably be refined (if prob_true is 100% we want to > assign a small non-zero probability to the false path), and 10% may be > too high (maybe give it 1%?). > > 2) More COMDAT issues > > My earlier patch handled the case where the comdat had 0 counts since > the linker kept the copy in a different module. In that case we > prevent the guessed frequencies from being dropped by counts_to_freq, > and then later mark any reached via non-zero callgraph edges as > guessed. Finally, when one such 0-count comdat is inlined the call > count is propagated to the callee blocks using the guessed > probabilities. > > However, in this case, there was a comdat that had a very small > non-zero count, that was being inlined to a much hotter callsite. I > believe this could happen when there was a copy that was ipa-inlined > in the profile gen compile, so the copy in that module gets some > non-zero counts from the ipa inlined instance, but when the out of > line copy was eliminated by the linker (selected from a different > module). In this case the inliner was scaling the bb counts up quite a > lot when inlining. The problem is that you most likely can't trust > that the 0 count bbs in such a case are really not executed by the > callsite it is being into, since the counts are very small and > correspond to a different callsite. > > The problem is how to address this. We can't simply suppress > counts_to_freq from overwriting the guessed frequencies in this case, > since the profile counts are non-zero and would not match the guessed > probabilities. But we can't figure out which are called by much hotter > callsites (compared to their entry count) until later when the > callgraph is built, which is when we would know that we want to ignore > the profile counts and use the guessed probabilities instead. The > solution I came up with is to allow the profile counts to overwrite > the guessed probabilites in counts_to_freq. But then when we inline we > re-estimate the probabilities in the callee when the callsite count is > much hotter than the entry count, and then follow the same procedure > we were doing in the 0-count case (propagate the call count into the > callee bb counts via the guessed probabilities). Is there a better > solution? > > Thanks, > Teresa > > >> >> Honza > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
On Tue, Feb 11, 2014 at 2:21 PM, Teresa Johnson <tejohnson@google.com> wrote: > On Thu, Dec 19, 2013 at 10:19 PM, Teresa Johnson <tejohnson@google.com> wrote: >> On Thu, Dec 12, 2013 at 5:13 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >>>> On Wed, Dec 11, 2013 at 1:21 AM, Martin Liška <marxin.liska@gmail.com> wrote: >>>> > Hello, >>>> > I prepared a collection of systemtap graphs for GIMP. >>>> > >>>> > 1) just my profile-based function reordering: 550 pages >>>> > 2) just -freorder-blocks-and-partitions: 646 pages >>>> > 3) just -fno-reorder-blocks-and-partitions: 638 pages >>>> > >>>> > Please see attached data. >>>> >>>> Thanks for the data. A few observations/questions: >>>> >>>> With both 1) (your (time-based?) reordering) and 2) >>>> (-freorder-blocks-and-partitions) there are a fair amount of accesses >>>> out of the cold section. I'm not seeing so many accesses out of the >>>> cold section in the apps I am looking at with splitting enabled. In >>> >>> I see you already comitted the patch, so perhaps Martin's measurement assume >>> the pass is off by default? >>> >>> I rebuilded GCC with profiledboostrap and with the linkerscript unmapping >>> text.unlikely. I get ICE in: >>> (gdb) bt >>> #0 diagnostic_set_caret_max_width(diagnostic_context*, int) () at ../../gcc/diagnostic.c:108 >>> #1 0x0000000000f68457 in diagnostic_initialize (context=0x18ae000 <global_diagnostic_context>, n_opts=n_opts@entry=1290) at ../../gcc/diagnostic.c:135 >>> #2 0x000000000100050e in general_init (argv0=<optimized out>) at ../../gcc/toplev.c:1110 >>> #3 toplev_main(int, char**) () at ../../gcc/toplev.c:1922 >>> #4 0x00007ffff774cbe5 in __libc_start_main () from /lib64/libc.so.6 >>> #5 0x0000000000f7898d in _start () at ../sysdeps/x86_64/start.S:122 >>> >>> That is relatively early in startup process. The function seems inlined and >>> it fails only on second invocation, did not have time to investigate further, >>> yet while without -fprofile-use it starts... >> >> I'll see if I can reproduce this and investigate, although at this >> point that might have to wait until after my holiday vacation. > > I tried the linkerscript with cpu2006 and got quite a lot of failures > (using the ref inputs to train, so the behavior should be the same in > both profile-gen and profile-use). I investigated the one in bzip2 and > found an issue that may not be easy to fix and is perhaps something it > is not worth fixing. The case was essentially the following: Function > foo was called by callsites A and B, with call counts 148122615 and > 18, respectively. > > Within function foo, there was a basic block that had a very low count > (compared to the entry bb count of 148122633), and therefore a 0 > frequency: > > ;; basic block 6, loop depth 0, count 18, freq 0 > > The ipa inliner decided to inline into callsite A but not B. Because > the vast majority of the call count was from callsite A, when we > performed execute_fixup_cfg after doing the inline transformation, the > count_scale is 0 and the out-of-line copy of foo's blocks all got > counts 0. However, most of the bbs still had non-zero frequencies. But > bb 6 ended up with a count and frequency of 0, leading us to split it > out. It turns out that at least one of the 18 counts for this block > were from callsite B, and we ended up trying to execute the split bb > in the out-of-line copy from that callsite. > > I can think of a couple of ways to prevent this to happen (e.g. have > execute_fixup_cfg give the block a count or frequency of 1 instead of > 0, or mark the bb somehow as not eligible for splitting due to a low > confidence in the 0 count/frequency), but they seem a little hacky. I > am thinking that the splitting here is something we shouldn't worry > about - it is so close to 0 count that the occasional jump to the > split section caused by the lack of precision in the frequency is not > a big deal. Unfortunately, without fixing this I can't use the linker > script without disabling inlining to avoid this problem. > > I reran cpu2006 with the linker script but with -fno-inline and got 6 > more benchmarks to pass. So there are other issues out there. I will > take a look at another one and see if it is a similar > scaling/precision issue. I'm thinking that I may just use my heatmap > scripts (which leverages perf-events profiles) to see if there is any > significant execution in the split cold sections, since it seems we > can't realistically prevent any and all execution of the cold split > sections, and that is more meaningful anyway. I collected perf cycle profiles for all of the split cpu2006 binaries (with the patch I sent separately to do more COMDAT profile fixes), and processed the results to see how many samples were in cold functions. There were 13 benchmarks that had non-zero samples in split cold sections, although they were very small as a total percentage of sampled cycles in the benchmark. Here are the results, sorted in reverse order of the percentage of total samples in the cold section: Benchmark Cold samples Total samples Percent cold samples 471.omnetpp 471 380596 .12390660966787241039 400.perlbench 313 348717 .08983823377458352946 403.gcc 245 369530 .06634442232963700123 453.povray 105 210146 .04999024000076175603 454.calculix 326 814160 .04005730898438747951 450.soplex 104 277853 .03744387918588365754 445.gobmk 81 484055 .01673643625484013604 458.sjeng 89 537853 .01655001078540028711 465.tonto 8 554507 .00144274381017819689 437.leslie3d 4 351889 .00113673501285931483 473.astar 4 436143 .00091713880207915366 464.h264ref 3 649895 .00046161516067285025 434.zeusmp 1 563594 .00017743300573286041 I looked at omnetpp and perlbench. Omnetpp's cold samples are exclusively from a single routine, _ZN9TOmnetApp15checkTimeLimitsEv (TOmnetApp::checkTimeLimits). I tracked this down to the RTL expansion for the floating point comparison "opt_simtimelimit!=0" at libs/envir/omnetapp.cc:528. do_compare_rtx_and_jump in dojump.c calls split_comparison and produces two jumps, and we assign the full probability to one path through these two branches. In this case the first split jump actually goes the other way (so we evaluate the second split jump), into the cold section. This should be easy to fix along the same lines as the way I fixed the TRUTH_ANDIF_EXPR/TRUTH_ORIF_EXPR expansions in r203126. I'll make this fix and send a patch soon. Most of perlbench's cold samples are from Perl_pp_helem. I tracked these down to some big profile insanities introduced by jump threading in vrp1, which in turn caused tracer to end up creating a hot path that flowed into a 0-weight path, which was then split and executed. If I disable -ftree-vrp, tracer no longer causes this issue. However, with vrp off I found similar profile insanities created by jump threading called from dom1, which also resulted in hot blocks flowing into a 0-weight block due to a profile insanity. I already fixed some profile insanities created by jump threading back in r203041, but apparently there are more (or new ones). I will take a look. Teresa > > Teresa > >> >>> >>> On our periodic testers I see off-noise improvement in crafty 2200->2300 >>> and regression on Vortex, 2900->2800, plus code size increase. >> >> I had only run cpu2006, but not cpu2000. I'll see if I can reproduce >> this as well. >> >> I have been investigating a few places where I saw accesses in the >> cold split regions in internal benchmarks. Here are a couple, and how >> I have addressed them so far: >> >> 1) loop unswitching >> >> In this case, loop unswitching hoisted a branch from within the loop >> to outside the loop, and in doing so it was effectively speculated >> above several other branches. In it's original location it always went >> to only one of the successors (biased 0/100%). But when it was hoisted >> it sometimes took the previously 0% path. This led to executing out of >> the cold region, since we didn't update the branch probability when >> hoisting. I worked around this by assigning a small non-zero >> probability after hoisting with the following change: >> >> Index: tree-ssa-loop-unswitch.c >> =================================================================== >> --- tree-ssa-loop-unswitch.c (revision 205590) >> +++ tree-ssa-loop-unswitch.c (working copy) >> @@ -384,6 +384,8 @@ tree_unswitch_loop (struct loop *loop, >> >> extract_true_false_edges_from_block (unswitch_on, &edge_true, &edge_false); >> prob_true = edge_true->probability; >> + if (!prob_true) >> + prob_true = REG_BR_PROB_BASE/10; >> return loop_version (loop, unshare_expr (cond), >> NULL, prob_true, prob_true, >> REG_BR_PROB_BASE - prob_true, false); >> >> This should probably be refined (if prob_true is 100% we want to >> assign a small non-zero probability to the false path), and 10% may be >> too high (maybe give it 1%?). >> >> 2) More COMDAT issues >> >> My earlier patch handled the case where the comdat had 0 counts since >> the linker kept the copy in a different module. In that case we >> prevent the guessed frequencies from being dropped by counts_to_freq, >> and then later mark any reached via non-zero callgraph edges as >> guessed. Finally, when one such 0-count comdat is inlined the call >> count is propagated to the callee blocks using the guessed >> probabilities. >> >> However, in this case, there was a comdat that had a very small >> non-zero count, that was being inlined to a much hotter callsite. I >> believe this could happen when there was a copy that was ipa-inlined >> in the profile gen compile, so the copy in that module gets some >> non-zero counts from the ipa inlined instance, but when the out of >> line copy was eliminated by the linker (selected from a different >> module). In this case the inliner was scaling the bb counts up quite a >> lot when inlining. The problem is that you most likely can't trust >> that the 0 count bbs in such a case are really not executed by the >> callsite it is being into, since the counts are very small and >> correspond to a different callsite. >> >> The problem is how to address this. We can't simply suppress >> counts_to_freq from overwriting the guessed frequencies in this case, >> since the profile counts are non-zero and would not match the guessed >> probabilities. But we can't figure out which are called by much hotter >> callsites (compared to their entry count) until later when the >> callgraph is built, which is when we would know that we want to ignore >> the profile counts and use the guessed probabilities instead. The >> solution I came up with is to allow the profile counts to overwrite >> the guessed probabilites in counts_to_freq. But then when we inline we >> re-estimate the probabilities in the callee when the callsite count is >> much hotter than the entry count, and then follow the same procedure >> we were doing in the 0-count case (propagate the call count into the >> callee bb counts via the guessed probabilities). Is there a better >> solution? >> >> Thanks, >> Teresa >> >> >>> >>> Honza >> >> >> >> -- >> Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413 > > > > -- > Teresa Johnson | Software Engineer | tejohnson@google.com | 408-460-2413
Index: tree-ssa-loop-unswitch.c =================================================================== --- tree-ssa-loop-unswitch.c (revision 205590) +++ tree-ssa-loop-unswitch.c (working copy) @@ -384,6 +384,8 @@ tree_unswitch_loop (struct loop *loop, extract_true_false_edges_from_block (unswitch_on, &edge_true, &edge_false); prob_true = edge_true->probability; + if (!prob_true) + prob_true = REG_BR_PROB_BASE/10; return loop_version (loop, unshare_expr (cond), NULL, prob_true, prob_true, REG_BR_PROB_BASE - prob_true, false);