Message ID | CAAs8HmzTLwD5DVR5HRXE+waH8mUCNpj_PKMtDA8aAKO1yJBHcA@mail.gmail.com |
---|---|
State | New |
Headers | show |
Would it be sufficient to 1) get rid of the 'may_increase_size' parameter' in all the unroll interfaces (basically make it true for O2); and 2) set MAX_COMPLETELY_PEELED_INSNS parameter to be a smaller value for O2? -- this makes O2 and O3's complete unroll behave in the same way but with different parameter. Note that doing so is very similar to loop vectorization at O2 -- O2 requires a cheap cost model which lowers the value for related parameter such as # of alias checks. See how this is done in opts.c David On Wed, Nov 20, 2013 at 7:41 PM, Sriraman Tallam <tmsriram@google.com> wrote: > Hi, > > Currently, tree unrolling pass(cunroll) does not allow any code > size growth in O2 mode. Code size growth is permitted only if O3 or > funroll-loops/fpeel-loops is used. I have created a patch to allow > partial code size increase in O2 mode. With funroll-loops the maximum > allowed code growth is 100 unrolled insns. For partial growth, I > experimented with various values of code growth and I have attached > SPEC 2006 performance numbers for code growth from 20 to 100 insns in > steps of 20. > > For this patch, I have set the partial code growth in O2 mode to be > 40 insns (tunable via param) where we get performance improvements > with minimal code size growth. Perf. data shows good improvements in > a few benchmarks. h264, sjeng and bzip2 get >2% improvement. > calculix shows a big regression(4.5% on westmere) which I am > investigating along with the povray regression. > > I also ran experiments with -ftree-vectorize turned on with -O2 > both in baseline and with the partial unroll to study the effect of > unrolling on vectorization. Loop unrolling seems to benefit more > benchmarks when vectorization is turned on. > > I have attached the patch and pdfs of the perf. data. and code size growth. > > How to read the attached perf data: > > There are two data files. > > * spec_perf_O2_unroll.txt contains perf data using unrolling with > various code size growth on O2. > * spec_perf_O2_vectorize_ unroll.txt contains perf data using > unrolling with various code size growth on O2 + ftree-vectorize. > > Each file contains perf. improvements and code size growth data. > Experiments were done on Ibis-sandybridge and Ikaria-westmere. > > Here is a sample from the file (All perf. numbers are in %): > > Unroll insns code growth 20 40 60 80 100 > _____________________________________________________ > spec/2006/fp/C++/444.namd -3.2 -0.13 -0.4 -0.57 -0.31 > > This data shows that namd regressed by 3.2% over baseline when code > size growth was set to 20 insns and regressed by 0.57% over baseline > when growth was 80 insns. > > Please let me know what you think. > > Thanks > Sri
On Thu, Nov 21, 2013 at 7:05 AM, Xinliang David Li <davidxl@google.com> wrote: > Would it be sufficient to > > 1) get rid of the 'may_increase_size' parameter' in all the unroll > interfaces (basically make it true for O2); and > 2) set MAX_COMPLETELY_PEELED_INSNS parameter to be a smaller value for > O2? -- this makes O2 and O3's complete unroll behave in the same way > but with different parameter. Note that doing so is very similar to > loop vectorization at O2 -- O2 requires a cheap cost model which > lowers the value for related parameter such as # of alias checks. See > how this is done in opts.c I agree that yet another param is bad. > David > > On Wed, Nov 20, 2013 at 7:41 PM, Sriraman Tallam <tmsriram@google.com> wrote: >> Hi, >> >> Currently, tree unrolling pass(cunroll) does not allow any code >> size growth in O2 mode. Code size growth is permitted only if O3 or >> funroll-loops/fpeel-loops is used. I have created a patch to allow >> partial code size increase in O2 mode. With funroll-loops the maximum >> allowed code growth is 100 unrolled insns. For partial growth, I >> experimented with various values of code growth and I have attached >> SPEC 2006 performance numbers for code growth from 20 to 100 insns in >> steps of 20. >> >> For this patch, I have set the partial code growth in O2 mode to be >> 40 insns (tunable via param) where we get performance improvements >> with minimal code size growth. Perf. data shows good improvements in >> a few benchmarks. h264, sjeng and bzip2 get >2% improvement. >> calculix shows a big regression(4.5% on westmere) which I am >> investigating along with the povray regression. Did you look at compile-time effects? Note that you should avoid complete peeling here (unrolling based on max_iter) as well I think. 40 instructions is a lot to allow given the optimistic unrolling. See PRs we have where even with the current code we unroll way too much for -O2. Richard. >> I also ran experiments with -ftree-vectorize turned on with -O2 >> both in baseline and with the partial unroll to study the effect of >> unrolling on vectorization. Loop unrolling seems to benefit more >> benchmarks when vectorization is turned on. >> >> I have attached the patch and pdfs of the perf. data. and code size growth. >> >> How to read the attached perf data: >> >> There are two data files. >> >> * spec_perf_O2_unroll.txt contains perf data using unrolling with >> various code size growth on O2. >> * spec_perf_O2_vectorize_ unroll.txt contains perf data using >> unrolling with various code size growth on O2 + ftree-vectorize. >> >> Each file contains perf. improvements and code size growth data. >> Experiments were done on Ibis-sandybridge and Ikaria-westmere. >> >> Here is a sample from the file (All perf. numbers are in %): >> >> Unroll insns code growth 20 40 60 80 100 >> _____________________________________________________ >> spec/2006/fp/C++/444.namd -3.2 -0.13 -0.4 -0.57 -0.31 >> >> This data shows that namd regressed by 3.2% over baseline when code >> size growth was set to 20 insns and regressed by 0.57% over baseline >> when growth was 80 insns. >> >> Please let me know what you think. >> >> Thanks >> Sri
Index: params.def =================================================================== --- params.def (revision 205058) +++ params.def (working copy) @@ -304,6 +304,11 @@ DEFPARAM(PARAM_MAX_COMPLETELY_PEELED_INSNS, "max-completely-peeled-insns", "The maximum number of insns of a completely peeled loop", 100, 0, 0) +/* The maximum number of insns in a peeled loop for default unrolling. */ +DEFPARAM(PARAM_MAX_DEFAULT_UNROLL_INSNS, + "max-default-unroll-insns", + "The maximum number of insns for the default tree unrolling", + 40, 0, 0) /* The maximum number of peelings of a single loop that is peeled completely. */ DEFPARAM(PARAM_MAX_COMPLETELY_PEEL_TIMES, "max-completely-peel-times", Index: tree-ssa-loop-ivcanon.c =================================================================== --- tree-ssa-loop-ivcanon.c (revision 205058) +++ tree-ssa-loop-ivcanon.c (working copy) @@ -71,9 +71,18 @@ enum unroll_level iteration. */ UL_NO_GROWTH, /* Only loops whose unrolling will not cause increase of code size. */ + UL_PARTIAL, /* All suitable loops whose unrolling will not + increase code size by more than 50% of UL_ALL. */ UL_ALL /* All suitable loops. */ }; +typedef enum _increase_code_size +{ + UNROLL_NO_INCREASE = 0, + UNROLL_PARTIAL_INCREASE = 1, + UNROLL_FULL_INCREASE = 2 +} increase_code_size; + /* Adds a canonical induction variable to LOOP iterating NITER times. EXIT is the exit edge whose condition is replaced. */ @@ -651,6 +660,7 @@ try_unroll_loop_completely (struct loop *loop, location_t locus) { unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns; + unsigned HOST_WIDE_INT max_unroll_insns; gimple cond; struct loop_size size; bool n_unroll_found = false; @@ -696,6 +706,10 @@ try_unroll_loop_completely (struct loop *loop, return false; max_unroll = PARAM_VALUE (PARAM_MAX_COMPLETELY_PEEL_TIMES); + max_unroll_insns = (ul != UL_PARTIAL) ? + PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS) : + PARAM_VALUE (PARAM_MAX_DEFAULT_UNROLL_INSNS); + if (n_unroll > max_unroll) return false; @@ -805,8 +819,7 @@ try_unroll_loop_completely (struct loop *loop, loop->num); return false; } - else if (unr_insns - > (unsigned) PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS)) + else if (unr_insns > max_unroll_insns) { if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "Not unrolling loop %d: " @@ -1100,7 +1113,8 @@ propagate_constants_for_unrolling (basic_block bb) loop we unrolled. */ static bool -tree_unroll_loops_completely_1 (bool may_increase_size, bool unroll_outer, +tree_unroll_loops_completely_1 (increase_code_size may_increase_size, + bool unroll_outer, vec<loop_p, va_heap>& father_stack, struct loop *loop) { @@ -1135,7 +1149,7 @@ static bool /* Unroll outermost loops only if asked to do so or they do not cause code growth. */ && (unroll_outer || loop_outer (loop_father))) - ul = UL_ALL; + ul = (may_increase_size == UNROLL_PARTIAL_INCREASE) ? UL_PARTIAL : UL_ALL; else ul = UL_NO_GROWTH; @@ -1163,7 +1177,8 @@ static bool size of the code does not increase. */ unsigned int -tree_unroll_loops_completely (bool may_increase_size, bool unroll_outer) +tree_unroll_loops_completely (increase_code_size may_increase_size, + bool unroll_outer) { stack_vec<loop_p, 16> father_stack; bool changed; @@ -1308,12 +1323,19 @@ make_pass_iv_canon (gcc::context *ctxt) static unsigned int tree_complete_unroll (void) { + increase_code_size code_size; + if (number_of_loops (cfun) <= 1) return 0; - return tree_unroll_loops_completely (flag_unroll_loops - || flag_peel_loops - || optimize >= 3, true); + if (flag_unroll_loops || flag_peel_loops || (optimize >= 3)) + code_size = UNROLL_FULL_INCREASE; + else if (optimize == 2) + code_size = UNROLL_PARTIAL_INCREASE; + else + code_size = UNROLL_NO_INCREASE; + + return tree_unroll_loops_completely (code_size, true); } static bool @@ -1366,13 +1388,20 @@ static unsigned int tree_complete_unroll_inner (void) { unsigned ret = 0; + increase_code_size code_size; loop_optimizer_init (LOOPS_NORMAL | LOOPS_HAVE_RECORDED_EXITS); if (number_of_loops (cfun) > 1) { scev_initialize (); - ret = tree_unroll_loops_completely (optimize >= 3, false); + + if (optimize >= 3) + code_size = UNROLL_FULL_INCREASE; + else + code_size = UNROLL_NO_INCREASE; + + ret = tree_unroll_loops_completely (code_size, false); free_numbers_of_iterations_estimates (); scev_finalize (); }