Message ID | CAO2gOZXexPV1rb9ag_Xcv2sfcgOB9Uuyah1uremcMdMu1LG2Lw@mail.gmail.com |
---|---|
State | New |
Headers | show |
Hi, Richard, Could you take a second look at this patch? Thanks, Dehao On Mon, Jun 24, 2013 at 3:45 PM, Dehao Chen <dehao@google.com> wrote: > The original patch has some flaw. The new patch is attached. > Bootstrapped and passed regression tests. > > Thanks, > Dehao > > On Mon, Jun 24, 2013 at 9:44 AM, Dehao Chen <dehao@google.com> wrote: >> Hi, Richard, >> >> I've updated the patch (as attached) to use sreal to compute badness. >> >> Thanks, >> Dehao >> >> On Mon, Jun 24, 2013 at 5:41 AM, Richard Biener >> <richard.guenther@gmail.com> wrote: >>> On Sat, Jun 22, 2013 at 2:59 AM, Dehao Chen <dehao@google.com> wrote: >>>> This patch prevents integer-underflow of badness computation in ipa-inline. >>>> >>>> Bootstrapped and passed regression tests. >>>> >>>> OK for trunk? >>>> >>>> Thanks, >>>> Dehao >>>> >>>> gcc/ChangeLog: >>>> 2013-06-21 Dehao Chen <dehao@google.com> >>>> >>>> * ipa-inline.c (edge_badness): Fix integer underflow. >>>> >>>> Index: gcc/ipa-inline.c >>>> =================================================================== >>>> --- gcc/ipa-inline.c (revision 200326) >>>> +++ gcc/ipa-inline.c (working copy) >>>> @@ -888,11 +888,9 @@ edge_badness (struct cgraph_edge *edge, bool dump) >>>> else if (max_count) >>>> { >>>> int relbenefit = relative_time_benefit (callee_info, edge, edge_time); >>>> - badness = >>>> - ((int) >>>> - ((double) edge->count * INT_MIN / 2 / max_count / >>>> RELATIVE_TIME_BENEFIT_RANGE) * >>>> - relbenefit) / growth; >>>> - >>>> + badness = ((int)((double) edge->count / max_count >>>> + * relbenefit / RELATIVE_TIME_BENEFIT_RANGE * INT_MIN / 2)) / growth; >>>> + >>> >>> FP operations on the host are frowned upon if code generation depends >>> on their outcome. They all should use sreal_* as predict already does. >>> Other than that I wonder why the final division is int (so we get truncation >>> and not rounding)? That increases the effect of doing the multiply by >>> relbenefit in double. >>> >>> Richard. >>> >>>> /* Be sure that insanity of the profile won't lead to increasing counts >>>> in the scalling and thus to overflow in the computation above. */ >>>> gcc_assert (max_count >= edge->count);
On Tue, Jun 25, 2013 at 12:45 AM, Dehao Chen <dehao@google.com> wrote: > The original patch has some flaw. The new patch is attached. > Bootstrapped and passed regression tests. Ok. Thanks, Richard. > Thanks, > Dehao > > On Mon, Jun 24, 2013 at 9:44 AM, Dehao Chen <dehao@google.com> wrote: >> Hi, Richard, >> >> I've updated the patch (as attached) to use sreal to compute badness. >> >> Thanks, >> Dehao >> >> On Mon, Jun 24, 2013 at 5:41 AM, Richard Biener >> <richard.guenther@gmail.com> wrote: >>> On Sat, Jun 22, 2013 at 2:59 AM, Dehao Chen <dehao@google.com> wrote: >>>> This patch prevents integer-underflow of badness computation in ipa-inline. >>>> >>>> Bootstrapped and passed regression tests. >>>> >>>> OK for trunk? >>>> >>>> Thanks, >>>> Dehao >>>> >>>> gcc/ChangeLog: >>>> 2013-06-21 Dehao Chen <dehao@google.com> >>>> >>>> * ipa-inline.c (edge_badness): Fix integer underflow. >>>> >>>> Index: gcc/ipa-inline.c >>>> =================================================================== >>>> --- gcc/ipa-inline.c (revision 200326) >>>> +++ gcc/ipa-inline.c (working copy) >>>> @@ -888,11 +888,9 @@ edge_badness (struct cgraph_edge *edge, bool dump) >>>> else if (max_count) >>>> { >>>> int relbenefit = relative_time_benefit (callee_info, edge, edge_time); >>>> - badness = >>>> - ((int) >>>> - ((double) edge->count * INT_MIN / 2 / max_count / >>>> RELATIVE_TIME_BENEFIT_RANGE) * >>>> - relbenefit) / growth; >>>> - >>>> + badness = ((int)((double) edge->count / max_count >>>> + * relbenefit / RELATIVE_TIME_BENEFIT_RANGE * INT_MIN / 2)) / growth; >>>> + >>> >>> FP operations on the host are frowned upon if code generation depends >>> on their outcome. They all should use sreal_* as predict already does. >>> Other than that I wonder why the final division is int (so we get truncation >>> and not rounding)? That increases the effect of doing the multiply by >>> relbenefit in double. >>> >>> Richard. >>> >>>> /* Be sure that insanity of the profile won't lead to increasing counts >>>> in the scalling and thus to overflow in the computation above. */ >>>> gcc_assert (max_count >= edge->count);
> >> > >> I've updated the patch (as attached) to use sreal to compute badness. > >>>> + badness = ((int)((double) edge->count / max_count > >>>> + * relbenefit / RELATIVE_TIME_BENEFIT_RANGE * INT_MIN / 2)) / growth; > >>>> + > >>> > >>> FP operations on the host are frowned upon if code generation depends > >>> on their outcome. They all should use sreal_* as predict already does. > >>> Other than that I wonder why the final division is int (so we get truncation > >>> and not rounding)? That increases the effect of doing the multiply by > >>> relbenefit in double. Sorry, i missed this patch - I need to update my scripts marking files interesting to me. I originally duplicated the trick from global.c that used to do this for ages. The rationale is that if you do only one FP operation, then you will not get difference in between hosts doing 80bit fp (x86) and 64bit fp (others) and you do the operation a lot faster than sreal does. While this is on slipperly ground, controlled use of double should not imply host dependency. Badness computation is on hot the path of inliner that is the slowest WPA optimization we have (well becuase it only does something realy useful). I suppose I can try to get Martin to benchmark the patch on firefox on the afternoon. Honza > >>> > >>> Richard. > >>> > >>>> /* Be sure that insanity of the profile won't lead to increasing counts > >>>> in the scalling and thus to overflow in the computation above. */ > >>>> gcc_assert (max_count >= edge->count);
On Tue, Aug 27, 2013 at 2:50 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> >> >> I've updated the patch (as attached) to use sreal to compute badness. >> >>>> + badness = ((int)((double) edge->count / max_count >> >>>> + * relbenefit / RELATIVE_TIME_BENEFIT_RANGE * INT_MIN / 2)) / growth; >> >>>> + >> >>> >> >>> FP operations on the host are frowned upon if code generation depends >> >>> on their outcome. They all should use sreal_* as predict already does. >> >>> Other than that I wonder why the final division is int (so we get truncation >> >>> and not rounding)? That increases the effect of doing the multiply by >> >>> relbenefit in double. > Sorry, i missed this patch - I need to update my scripts marking files interesting > to me. > > I originally duplicated the trick from global.c that used to do this for ages. > The rationale is that if you do only one FP operation, then you will not get > difference in between hosts doing 80bit fp (x86) and 64bit fp (others) and you > do the operation a lot faster than sreal does. While this is on slipperly ground, > controlled use of double should not imply host dependency. Well .. the above is clearly not "one" FP operation. Also I don't follow. If we can assume all hosts implement proper IEEE double math then of course the issue is moot. But fact is we don't (not sure if we can declare such hosts broken and unsupported as host). Of couse on a perfect IEEE compliant host sreal.[ch] could be simply mapped to double. Oh, and yes, 80bit x87 unit isn't holding up to that premises. > Badness computation is on hot the path of inliner that is the slowest WPA > optimization we have (well becuase it only does something realy useful). I > suppose I can try to get Martin to benchmark the patch on firefox on the > afternoon. > > Honza >> >>> >> >>> Richard. >> >>> >> >>>> /* Be sure that insanity of the profile won't lead to increasing counts >> >>>> in the scalling and thus to overflow in the computation above. */ >> >>>> gcc_assert (max_count >= edge->count);
> On Tue, Aug 27, 2013 at 2:50 PM, Jan Hubicka <hubicka@ucw.cz> wrote: > >> >> > >> >> I've updated the patch (as attached) to use sreal to compute badness. > >> >>>> + badness = ((int)((double) edge->count / max_count > >> >>>> + * relbenefit / RELATIVE_TIME_BENEFIT_RANGE * INT_MIN / 2)) / growth; > >> >>>> + > >> >>> > >> >>> FP operations on the host are frowned upon if code generation depends > >> >>> on their outcome. They all should use sreal_* as predict already does. > >> >>> Other than that I wonder why the final division is int (so we get truncation > >> >>> and not rounding)? That increases the effect of doing the multiply by > >> >>> relbenefit in double. > > Sorry, i missed this patch - I need to update my scripts marking files interesting > > to me. > > > > I originally duplicated the trick from global.c that used to do this for ages. > > The rationale is that if you do only one FP operation, then you will not get > > difference in between hosts doing 80bit fp (x86) and 64bit fp (others) and you > > do the operation a lot faster than sreal does. While this is on slipperly ground, > > controlled use of double should not imply host dependency. > > Well .. the above is clearly not "one" FP operation. Also I don't follow. > If we can assume all hosts implement proper IEEE double math then > of course the issue is moot. But fact is we don't (not sure if we can We are not really consistent across all hosts either, because of HOST_WIDE_INT differences. What we really cared about is to have compiler that does not give different result based on build flags. > declare such hosts broken and unsupported as host). Of couse on > a perfect IEEE compliant host sreal.[ch] could be simply mapped to > double. Oh, and yes, 80bit x87 unit isn't holding up to that premises. sreal is not IEEE compliant. We first time hit the problem of doing non-trivial chains of fp opreations in predict.c during my school project and Josef implemented it as a dirty but resonably quick FP emulation module. Hopefully it will work well enough in badness computationm, too. I am giving the patch brief benchmarking on profiledbootstrap and it it won't cause major regression, I think we should go ahead with the patch. I was never really happy about the double use there and in fact the whole fixed point arithmetic in badness compuation is a mess. If we had template based fibonaci heap and sreal fast enough, turing it all to reals would save quite some maintenance burden. Honza
On Wed, Aug 28, 2013 at 2:13 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> On Tue, Aug 27, 2013 at 2:50 PM, Jan Hubicka <hubicka@ucw.cz> wrote: >> >> >> >> >> >> I've updated the patch (as attached) to use sreal to compute badness. >> >> >>>> + badness = ((int)((double) edge->count / max_count >> >> >>>> + * relbenefit / RELATIVE_TIME_BENEFIT_RANGE * INT_MIN / 2)) / growth; >> >> >>>> + >> >> >>> >> >> >>> FP operations on the host are frowned upon if code generation depends >> >> >>> on their outcome. They all should use sreal_* as predict already does. >> >> >>> Other than that I wonder why the final division is int (so we get truncation >> >> >>> and not rounding)? That increases the effect of doing the multiply by >> >> >>> relbenefit in double. >> > Sorry, i missed this patch - I need to update my scripts marking files interesting >> > to me. >> > >> > I originally duplicated the trick from global.c that used to do this for ages. >> > The rationale is that if you do only one FP operation, then you will not get >> > difference in between hosts doing 80bit fp (x86) and 64bit fp (others) and you >> > do the operation a lot faster than sreal does. While this is on slipperly ground, >> > controlled use of double should not imply host dependency. >> >> Well .. the above is clearly not "one" FP operation. Also I don't follow. >> If we can assume all hosts implement proper IEEE double math then >> of course the issue is moot. But fact is we don't (not sure if we can > > We are not really consistent across all hosts either, because of HOST_WIDE_INT > differences. What we really cared about is to have compiler that does not give > different result based on build flags. Well, we care that a cross from X to T produces the same code as a cross from Y to T or a native compiler on T. Otherwise it's hard to even think of using icecream like we do internally. >> declare such hosts broken and unsupported as host). Of couse on >> a perfect IEEE compliant host sreal.[ch] could be simply mapped to >> double. Oh, and yes, 80bit x87 unit isn't holding up to that premises. > > sreal is not IEEE compliant. We first time hit the problem of doing > non-trivial chains of fp opreations in predict.c during my school project and > Josef implemented it as a dirty but resonably quick FP emulation module. > Hopefully it will work well enough in badness computationm, too. > > I am giving the patch brief benchmarking on profiledbootstrap and it it won't > cause major regression, I think we should go ahead with the patch. > > I was never really happy about the double use there and in fact the whole fixed > point arithmetic in badness compuation is a mess. If we had template based > fibonaci heap and sreal fast enough, turing it all to reals would save quite > some maintenance burden. Yeah, well. Richard. > Honza
> > I am giving the patch brief benchmarking on profiledbootstrap and it it won't > > cause major regression, I think we should go ahead with the patch. Uhm, I profiledbootstrapped and we bit too fast to get resonable oprofile. What I get is: 7443 9.4372 lto1 lto1 lto_end_uncompression(lto_compression_stream*) 4438 5.6271 lto1 lto1 _ZL14DFS_write_treeP12output_blockP4sccsP9tree_nodebb.lto_priv.4993 2351 2.9809 lto1 lto1 lto_output_tree(output_block*, tree_node*, bool, bool) 2179 2.7628 lto1 lto1 _ZL30linemap_macro_loc_to_exp_pointP9line_mapsjPPK8line_map.lto_priv.7860 1910 2.4217 lto1 lto1 _ZL19unpack_value_fieldsP7data_inP9bitpack_dP9tree_node.lto_priv.7292 1855 2.3520 libc-2.11.1.so libc-2.11.1.so msort_with_tmp 1531 1.9412 lto1 lto1 streamer_string_index(output_block*, char const*, unsigned int, bool) 1530 1.9399 libc-2.11.1.so libc-2.11.1.so _int_malloc 1471 1.8651 lto1 lto1 do_estimate_growth(cgraph_node*) 1306 1.6559 lto1 lto1 pointer_map_insert(pointer_map_t*, void const*) 1238 1.5697 lto1 lto1 _Z28streamer_pack_tree_bitfieldsP12output_blockP9bitpack_dP9tree_node.constprop.1086 1138 1.4429 lto1 lto1 compare_tree_sccs_1(tree_node*, tree_node*, tree_node***) 1082 1.3719 lto1 lto1 streamer_write_tree_body(output_block*, tree_node*, bool) 1044 1.3237 lto1 lto1 _ZL28estimate_calls_size_and_timeP11cgraph_nodePiS1_S1_j3vecIP9tree_node7va_heap6vl_ptrES7_S2_IP21ipa_agg_jump_function We take 12 seconds of WPA on GCC (with my fork patch) Execution times (seconds) phase setup : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 1412 kB ( 0%) ggc phase opt and generate : 4.48 (37%) usr 0.05 ( 6%) sys 4.57 (34%) wall 42983 kB ( 7%) ggc phase stream in : 7.21 (60%) usr 0.26 (32%) sys 7.47 (56%) wall 565102 kB (93%) ggc phase stream out : 0.38 ( 3%) usr 0.50 (62%) sys 1.37 (10%) wall 623 kB ( 0%) ggc callgraph optimization : 0.04 ( 0%) usr 0.00 ( 0%) sys 0.03 ( 0%) wall 6 kB ( 0%) ggc ipa dead code removal : 0.46 ( 4%) usr 0.00 ( 0%) sys 0.46 ( 3%) wall 0 kB ( 0%) ggc ipa cp : 0.36 ( 3%) usr 0.01 ( 1%) sys 0.41 ( 3%) wall 38261 kB ( 6%) ggc ipa inlining heuristics : 2.84 (24%) usr 0.05 ( 6%) sys 2.87 (21%) wall 60263 kB (10%) ggc ipa lto gimple in : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple out : 0.04 ( 0%) usr 0.02 ( 2%) sys 0.06 ( 0%) wall 0 kB ( 0%) ggc ipa lto decl in : 6.23 (52%) usr 0.18 (22%) sys 6.40 (48%) wall 425731 kB (70%) ggc ipa lto decl out : 0.09 ( 1%) usr 0.01 ( 1%) sys 0.10 ( 1%) wall 0 kB ( 0%) ggc ipa lto cgraph I/O : 0.22 ( 2%) usr 0.02 ( 2%) sys 0.25 ( 2%) wall 60840 kB (10%) ggc ipa lto decl merge : 0.20 ( 2%) usr 0.00 ( 0%) sys 0.20 ( 1%) wall 1051 kB ( 0%) ggc ipa lto cgraph merge : 0.22 ( 2%) usr 0.01 ( 1%) sys 0.25 ( 2%) wall 17676 kB ( 3%) ggc whopr wpa : 0.38 ( 3%) usr 0.00 ( 0%) sys 0.35 ( 3%) wall 626 kB ( 0%) ggc whopr wpa I/O : 0.01 ( 0%) usr 0.47 (58%) sys 0.98 ( 7%) wall 0 kB ( 0%) ggc whopr partitioning : 0.18 ( 1%) usr 0.00 ( 0%) sys 0.19 ( 1%) wall 0 kB ( 0%) ggc ipa reference : 0.31 ( 3%) usr 0.01 ( 1%) sys 0.33 ( 2%) wall 0 kB ( 0%) ggc ipa profile : 0.09 ( 1%) usr 0.01 ( 1%) sys 0.10 ( 1%) wall 150 kB ( 0%) ggc ipa pure const : 0.29 ( 2%) usr 0.00 ( 0%) sys 0.30 ( 2%) wall 0 kB ( 0%) ggc tree SSA incremental : 0.02 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 203 kB ( 0%) ggc tree operand scan : 0.00 ( 0%) usr 0.01 ( 1%) sys 0.00 ( 0%) wall 3512 kB ( 1%) ggc dominance computation : 0.01 ( 0%) usr 0.00 ( 0%) sys 0.02 ( 0%) wall 0 kB ( 0%) ggc varconst : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc unaccounted todo : 0.06 ( 0%) usr 0.01 ( 1%) sys 0.09 ( 1%) wall 0 kB ( 0%) ggc TOTAL : 12.08 0.81 13.43 610123 kB Inliing heuristics was also around 25% w/o your change. Timming maches my experience with firefox - growth estimation tends to be the hot functions, with caching, badness is off the radar. As such I think the patch is safe to go. Thank you! > > > > I was never really happy about the double use there and in fact the whole fixed > > point arithmetic in badness compuation is a mess. If we had template based > > fibonaci heap and sreal fast enough, turing it all to reals would save quite > > some maintenance burden. > > Yeah, well. > > Richard. > > > Honza
Index: gcc/ChangeLog =================================================================== --- gcc/ChangeLog (revision 200375) +++ gcc/ChangeLog (working copy) @@ -1,3 +1,7 @@ +2013-06-21 Dehao Chen <dehao@google.com> + + * ipa-inline.c (edge_badness): Fix integer underflow. + 2013-06-24 Martin Jambor <mjambor@suse.cz> PR tree-optimization/57358 Index: gcc/Makefile.in =================================================================== --- gcc/Makefile.in (revision 200375) +++ gcc/Makefile.in (working copy) @@ -2947,7 +2947,7 @@ ipa-inline.o : ipa-inline.c $(CONFIG_H) $(SYSTEM_H $(DIAGNOSTIC_H) $(FIBHEAP_H) $(PARAMS_H) $(TREE_PASS_H) \ $(COVERAGE_H) $(GGC_H) $(TREE_FLOW_H) $(RTL_H) $(IPA_PROP_H) \ $(EXCEPT_H) $(GIMPLE_PRETTY_PRINT_H) $(IPA_INLINE_H) $(TARGET_H) \ - $(IPA_UTILS_H) + $(IPA_UTILS_H) sreal.h ipa-inline-analysis.o : ipa-inline-analysis.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \ $(TREE_H) langhooks.h $(TREE_INLINE_H) $(FLAGS_H) $(CGRAPH_H) intl.h \ $(DIAGNOSTIC_H) $(PARAMS_H) $(TREE_PASS_H) $(CFGLOOP_H) \ Index: gcc/ipa-inline.c =================================================================== --- gcc/ipa-inline.c (revision 200375) +++ gcc/ipa-inline.c (working copy) @@ -113,10 +113,12 @@ along with GCC; see the file COPYING3. If not see #include "target.h" #include "ipa-inline.h" #include "ipa-utils.h" +#include "sreal.h" /* Statistics we collect about inlining algorithm. */ static int overall_size; static gcov_type max_count; +static sreal max_count_real, max_relbenefit_real, half_int_min_real; /* Return false when inlining edge E would lead to violating limits on function unit growth or stack usage growth. @@ -887,12 +889,26 @@ edge_badness (struct cgraph_edge *edge, bool dump) else if (max_count) { + sreal tmp, relbenefit_real, growth_real; int relbenefit = relative_time_benefit (callee_info, edge, edge_time); - badness = - ((int) - ((double) edge->count * INT_MIN / 2 / max_count / RELATIVE_TIME_BENEFIT_RANGE) * - relbenefit) / growth; - + + sreal_init(&relbenefit_real, relbenefit, 0); + sreal_init(&growth_real, growth, 0); + + /* relative_edge_count. */ + sreal_init (&tmp, edge->count, 0); + sreal_div (&tmp, &tmp, &max_count_real); + + /* relative_time_benefit. */ + sreal_mul (&tmp, &tmp, &relbenefit_real); + sreal_div (&tmp, &tmp, &max_relbenefit_real); + + /* growth_f_caller. */ + sreal_mul (&tmp, &tmp, &half_int_min_real); + sreal_div (&tmp, &tmp, &growth_real); + + badness = -1 * sreal_to_int (&tmp); + /* Be sure that insanity of the profile won't lead to increasing counts in the scalling and thus to overflow in the computation above. */ gcc_assert (max_count >= edge->count); @@ -1448,6 +1464,9 @@ inline_small_functions (void) if (max_count < edge->count) max_count = edge->count; } + sreal_init (&max_count_real, max_count, 0); + sreal_init (&max_relbenefit_real, RELATIVE_TIME_BENEFIT_RANGE, 0); + sreal_init (&half_int_min_real, INT_MAX / 2, 0); ipa_free_postorder_info (); initialize_growth_caches ();