Message ID | 5279103F.20906@redhat.com |
---|---|
State | New |
Headers | show |
On Tue, Nov 5, 2013 at 4:35 PM, Vladimir Makarov <vmakarov@redhat.com> wrote: > I'd like to add a new experimental optimization to the trunk. This > optimization was discussed on RA BOF of this summer GNU Cauldron. > > It is a register pressure relief through live-range shrinkage. It > is implemented on the scheduler base and uses register-pressure insn > scheduling infrastructure. By rearranging insns we shorten pseudo > live-ranges and increase a chance to them be assigned to a hard > register. > > The code looks pretty simple but there are a lot of works behind > this patch. I've tried about ten different versions of this code > (different heuristics for two currently existing register-pressure > algorithms). > > I think it is *upto target maintainers* to decide to use or not to > use this optimization for their targets. I'd recommend to use this at > least for x86/x86-64. I think any OOO processor with small or > moderate register file which does not use the 1st insn scheduling > might benefit from this too. > > On SPEC2000 for x86/x86-64 (I use Haswell processor, -O3 with > general tuning), the optimization usage results in smaller code size > in average (for floating point and integer benchmarks in 32- and > 64-bit mode). The improvement better visible for SPECFP2000 (although > I have the same improvement on x86-64 SPECInt2000 but it might be > attributed mostly mcf benchmark unstability). It is about 0.5% for > 32-bit and 64-bit mode. It is understandable, as the optimization has > more opportunities to improve the code on longer BBs. Different from > other heuristic optimizations, I don't see any significant worse > performance. It gives practically the same or better performance (a > few benchmarks imporoved by 1% or more upto 3%). > > The single but significant drawback is additional compilation time > (4%-6%) as the 1st insn scheduling pass is quite expensive. So I'd > recommend target maintainers to switch it on only for -Ofast. Generally I'd not recomment viewing -Ofast as -O4 but as -O3 plus generally "unsafe" optimizations. So I'd not enable it for -Ofast but for -O3 - possibly also with -Os if indeed the main motivation is also code-size improvements (-Os is a similar beast as -O3, spend as much time as you can on optimizing size). Btw, thanks for working on this. How does it relate to -fsched-pressure? Does it treat all register classes the same? On x86 mostly the few fixed registers for some of the integer pipeline instructions hurt, x86_64 has enough general and FP registers? Richard. > If > somebody finds that the optimization works on processors which uses > 1st insn scheduling by default (in which I slightly doubt), we could > improve the compilation time by reusing data for this optimization and > the 1st insn scheduling. > > Any comments, questions, thoughts are appreciated. > > 2013-11-05 Vladimir Makarov <vmakarov@redhat.com> > > * tree-pass.h (make_pass_live_range_shrinkage): New external. > * timevar.def (TV_LIVE_RANGE_SHRINKAGE): New. > * sched-rgn.c (gate_handle_live_range_shrinkage): New. > (rest_of_handle_live_range_shrinkage): Ditto > (class pass_live_range_shrinkage): Ditto. > (pass_data_live_range_shrinkage): Ditto. > (make_pass_live_range_shrinkage): Ditto. > * sched-int.h (sched_relief_p): New external. > * sched-deps.c (create_insn_reg_set): Make void return value. > * passes.def: Add pass_live_range_shrinkage. > * ira.c (update_equiv_regs): Don't move if > flag_live_range_shrinkage. > * haifa-sched.c (sched_relief_p): New. > (rank_for_schedule): Add code for pressure relief through live > range shrinkage. > (schedule_insn): Print more debug info. > (sched_init): Setup SCHED_PRESSURE_WEIGHTED for pressure relief > through live range shrinkage. > * doc/invoke.texi (-flive-range-shrinkage): New. > * common.opt (flive-range-shrinkage): New. >
On 11/6/2013, 4:17 AM, Richard Biener wrote: > On Tue, Nov 5, 2013 at 4:35 PM, Vladimir Makarov <vmakarov@redhat.com> wrote: >> I'd like to add a new experimental optimization to the trunk. This >> optimization was discussed on RA BOF of this summer GNU Cauldron. >> >> It is a register pressure relief through live-range shrinkage. It >> is implemented on the scheduler base and uses register-pressure insn >> scheduling infrastructure. By rearranging insns we shorten pseudo >> live-ranges and increase a chance to them be assigned to a hard >> register. >> >> The code looks pretty simple but there are a lot of works behind >> this patch. I've tried about ten different versions of this code >> (different heuristics for two currently existing register-pressure >> algorithms). >> >> I think it is *upto target maintainers* to decide to use or not to >> use this optimization for their targets. I'd recommend to use this at >> least for x86/x86-64. I think any OOO processor with small or >> moderate register file which does not use the 1st insn scheduling >> might benefit from this too. >> >> On SPEC2000 for x86/x86-64 (I use Haswell processor, -O3 with >> general tuning), the optimization usage results in smaller code size >> in average (for floating point and integer benchmarks in 32- and >> 64-bit mode). The improvement better visible for SPECFP2000 (although >> I have the same improvement on x86-64 SPECInt2000 but it might be >> attributed mostly mcf benchmark unstability). It is about 0.5% for >> 32-bit and 64-bit mode. It is understandable, as the optimization has >> more opportunities to improve the code on longer BBs. Different from >> other heuristic optimizations, I don't see any significant worse >> performance. It gives practically the same or better performance (a >> few benchmarks imporoved by 1% or more upto 3%). >> >> The single but significant drawback is additional compilation time >> (4%-6%) as the 1st insn scheduling pass is quite expensive. So I'd >> recommend target maintainers to switch it on only for -Ofast. > Generally I'd not recomment viewing -Ofast as -O4 but as -O3 > plus generally "unsafe" optimizations. So I'd not enable it for -Ofast > but for -O3 - possibly also with -Os if indeed the main motivation is > also code-size improvements (-Os is a similar beast as -O3, spend > as much time as you can on optimizing size). Ok. Probably my recommendation is wrong. It is actually upto target maintainers to decide when to use the optimization and or use it at all for default (may be they just decide to use it only for SPEC reporting). I guess that in some time we will need to use something like -O4 for greedy algorithms (there are a lot of researches in this area, e.g. I am reading an article about optimal register-pressure sensitive insn scheduling but the optimization can be constrained for time, for example 1ms for each insn, and still to produce better results than the current heuristics). I am sure such algorithms will be coming. > Btw, thanks for working on this. How does it relate to > -fsched-pressure? It is based on -fsched-pressure infrastructure but has different heuristics and goals. GCC with 1st insn scheduling even with -fsched-pressure still produces worse results on mainstream x86/x86-64 processors that GCC without it. I've also tried -flive-range-shrinkage -fschedule-insns -fsched-pressure, but just -flive-range-shrinkage is better for x86/x86-64. By the way, LLVM uses insn-scheduling for x86/x86-64 before RA, but it goal is only register-pressure decrease (for x86, for x86-64 it is a bit more complicated). So with this optimization we are just catching up with LLVM (which is unusual for us in optimization area). > Does it treat all register classes the same? > On x86 mostly the few fixed registers for some of the integer pipeline > instructions hurt, x86_64 has enough general and FP registers? It treats them the same (although it is different for different classes as they have different number of available regs). It is always some kind of approximation as we use register pressure classes here not the classes which will be actually used for RA. It is even more complicated as IRA actually uses dynamic classes (only sets of regs which are profitable, e.g. it can be different from classes defined in the target file as reg in classes are caller-saved or some specific hard regs are used for arg passing). It makes graph coloring better for irregular register file architectures. In whole as I remember, dynamic classes gave about 1% improvement even for ppc. I should say that presence of hard regs in RTL (e.g. for parameter passing) is still a challenge for live-range shrinkage and register-pressure scheduling. It should be addressed somehow.
Index: common.opt =================================================================== --- common.opt (revision 204380) +++ common.opt (working copy) @@ -1738,6 +1738,10 @@ fregmove Common Ignore Does nothing. Preserved for backward compatibility. +flive-range-shrinkage +Common Report Var(flag_live_range_shrinkage) Init(0) Optimization +Relief of register pressure through live range shrinkage + frename-registers Common Report Var(flag_rename_registers) Init(2) Optimization Perform a register renaming optimization pass Index: doc/invoke.texi =================================================================== --- doc/invoke.texi (revision 204216) +++ doc/invoke.texi (working copy) @@ -378,7 +378,7 @@ Objective-C and Objective-C++ Dialects}. -fira-region=@var{region} -fira-hoist-pressure @gol -fira-loop-pressure -fno-ira-share-save-slots @gol -fno-ira-share-spill-slots -fira-verbose=@var{n} @gol --fivopts -fkeep-inline-functions -fkeep-static-consts @gol +-fivopts -fkeep-inline-functions -fkeep-static-consts -flive-range-shrinkage @gol -floop-block -floop-interchange -floop-strip-mine -floop-nest-optimize @gol -floop-parallelize-all -flto -flto-compression-level @gol -flto-partition=@var{alg} -flto-report -flto-report-wpa -fmerge-all-constants @gol @@ -7257,6 +7257,12 @@ registers after writing to their lower 3 Enabled for x86 at levels @option{-O2}, @option{-O3}. +@item -flive-range-shrinkage +@opindex flive-range-shrinkage +Attempt to decrease register pressure through register live range +shrinkage. This is helpful for fast processors with small or moderate +size register sets. + @item -fira-algorithm=@var{algorithm} Use the specified coloring algorithm for the integrated register allocator. The @var{algorithm} argument can be @samp{priority}, which Index: haifa-sched.c =================================================================== --- haifa-sched.c (revision 204380) +++ haifa-sched.c (working copy) @@ -150,6 +150,9 @@ along with GCC; see the file COPYING3. #ifdef INSN_SCHEDULING +/* True if we do pressure relief pass. */ +bool sched_relief_p; + /* issue_rate is the number of insns that can be scheduled in the same machine cycle. It can be defined in the config/mach/mach.h file, otherwise we set it to 1. */ @@ -2519,7 +2522,7 @@ rank_for_schedule (const void *x, const rtx tmp = *(const rtx *) y; rtx tmp2 = *(const rtx *) x; int tmp_class, tmp2_class; - int val, priority_val, info_val; + int val, priority_val, info_val, diff; if (MAY_HAVE_DEBUG_INSNS) { @@ -2532,6 +2535,20 @@ rank_for_schedule (const void *x, const return INSN_LUID (tmp) - INSN_LUID (tmp2); } + if (sched_relief_p) + { + gcc_assert (sched_pressure == SCHED_PRESSURE_WEIGHTED); + if ((INSN_REG_PRESSURE_EXCESS_COST_CHANGE (tmp) < 0 + || INSN_REG_PRESSURE_EXCESS_COST_CHANGE (tmp2) < 0) + && (diff = (INSN_REG_PRESSURE_EXCESS_COST_CHANGE (tmp) + - INSN_REG_PRESSURE_EXCESS_COST_CHANGE (tmp2))) != 0) + return diff; + /* Sort by INSN_LUID (original insn order), so that we make the + sort stable. This minimizes instruction movement, thus + minimizing sched's effect on debugging and cross-jumping. */ + return INSN_LUID (tmp) - INSN_LUID (tmp2); + } + /* The insn in a schedule group should be issued the first. */ if (flag_sched_group_heuristic && SCHED_GROUP_P (tmp) != SCHED_GROUP_P (tmp2)) @@ -2542,8 +2559,6 @@ rank_for_schedule (const void *x, const if (sched_pressure != SCHED_PRESSURE_NONE) { - int diff; - /* Prefer insn whose scheduling results in the smallest register pressure excess. */ if ((diff = (INSN_REG_PRESSURE_EXCESS_COST_CHANGE (tmp) @@ -3731,7 +3746,10 @@ schedule_insn (rtx insn) { fputc (':', sched_dump); for (i = 0; i < ira_pressure_classes_num; i++) - fprintf (sched_dump, "%s%+d(%d)", + fprintf (sched_dump, "%s%s%+d(%d)", + scheduled_insns.length () > 1 + && INSN_LUID (insn) + < INSN_LUID (scheduled_insns[scheduled_insns.length () - 2]) ? "@" : "", reg_class_names[ira_pressure_classes[i]], pressure_info[i].set_increase, pressure_info[i].change); } @@ -6578,9 +6596,11 @@ sched_init (void) if (targetm.sched.dispatch (NULL_RTX, IS_DISPATCH_ON)) targetm.sched.dispatch_do (NULL_RTX, DISPATCH_INIT); - if (flag_sched_pressure - && !reload_completed - && common_sched_info->sched_pass_id == SCHED_RGN_PASS) + if (sched_relief_p) + sched_pressure = SCHED_PRESSURE_WEIGHTED; + else if (flag_sched_pressure + && !reload_completed + && common_sched_info->sched_pass_id == SCHED_RGN_PASS) sched_pressure = ((enum sched_pressure_algorithm) PARAM_VALUE (PARAM_SCHED_PRESSURE_ALGORITHM)); else Index: ira.c =================================================================== --- ira.c (revision 204380) +++ ira.c (working copy) @@ -3794,11 +3794,12 @@ update_equiv_regs (void) if (! reg_equiv[regno].replace || reg_equiv[regno].loop_depth < loop_depth - /* There is no sense to move insns if we did - register pressure-sensitive scheduling was - done because it will not improve allocation - but worsen insn schedule with a big - probability. */ + /* There is no sense to move insns if live range + shrinkage or register pressure-sensitive + scheduling were done because it will not + improve allocation but worsen insn schedule + with a big probability. */ + || flag_live_range_shrinkage || (flag_sched_pressure && flag_schedule_insns)) continue; Index: passes.def =================================================================== --- passes.def (revision 204380) +++ passes.def (working copy) @@ -358,6 +358,7 @@ along with GCC; see the file COPYING3. NEXT_PASS (pass_mode_switching); NEXT_PASS (pass_match_asm_constraints); NEXT_PASS (pass_sms); + NEXT_PASS (pass_live_range_shrinkage); NEXT_PASS (pass_sched); NEXT_PASS (pass_ira); NEXT_PASS (pass_reload); Index: sched-deps.c =================================================================== --- sched-deps.c (revision 204380) +++ sched-deps.c (working copy) @@ -1938,8 +1938,8 @@ create_insn_reg_use (int regno, rtx insn return use; } -/* Allocate and return reg_set_data structure for REGNO and INSN. */ -static struct reg_set_data * +/* Allocate reg_set_data structure for REGNO and INSN. */ +static void create_insn_reg_set (int regno, rtx insn) { struct reg_set_data *set; @@ -1949,7 +1949,6 @@ create_insn_reg_set (int regno, rtx insn set->insn = insn; set->next_insn_set = INSN_REG_SET_LIST (insn); INSN_REG_SET_LIST (insn) = set; - return set; } /* Set up insn register uses for INSN and dependency context DEPS. */ Index: sched-int.h =================================================================== --- sched-int.h (revision 204380) +++ sched-int.h (working copy) @@ -28,6 +28,9 @@ along with GCC; see the file COPYING3. #include "df.h" #include "basic-block.h" +/* True if we do pressure relief pass. */ +extern bool sched_relief_p; + /* Identificator of a scheduler pass. */ enum sched_pass_id_t { SCHED_PASS_UNKNOWN, SCHED_RGN_PASS, SCHED_EBB_PASS, SCHED_SMS_PASS, SCHED_SEL_PASS }; Index: sched-rgn.c =================================================================== --- sched-rgn.c (revision 204380) +++ sched-rgn.c (working copy) @@ -3565,6 +3565,33 @@ advance_target_bb (basic_block bb, rtx i #endif static bool +gate_handle_live_range_shrinkage (void) +{ +#ifdef INSN_SCHEDULING + return flag_live_range_shrinkage; +#else + return 0; +#endif +} + +/* Run instruction scheduler. */ +static unsigned int +rest_of_handle_live_range_shrinkage (void) +{ +#ifdef INSN_SCHEDULING + int saved; + + sched_relief_p = true; + saved = flag_schedule_interblock; + flag_schedule_interblock = false; + schedule_insns (); + flag_schedule_interblock = saved; + sched_relief_p = false; +#endif + return 0; +} + +static bool gate_handle_sched (void) { #ifdef INSN_SCHEDULING @@ -3621,6 +3648,45 @@ rest_of_handle_sched2 (void) } namespace { + +const pass_data pass_data_live_range_shrinkage = +{ + RTL_PASS, /* type */ + "lr_shrinkage", /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + true, /* has_gate */ + true, /* has_execute */ + TV_LIVE_RANGE_SHRINKAGE, /* tv_id */ + 0, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + ( TODO_df_finish | TODO_verify_rtl_sharing + | TODO_verify_flow ), /* todo_flags_finish */ +}; + +class pass_live_range_shrinkage : public rtl_opt_pass +{ +public: + pass_live_range_shrinkage(gcc::context *ctxt) + : rtl_opt_pass(pass_data_live_range_shrinkage, ctxt) + {} + + /* opt_pass methods: */ + bool gate () { return gate_handle_live_range_shrinkage (); } + unsigned int execute () { return rest_of_handle_live_range_shrinkage (); } + +}; // class pass_live_range_shrinkage + +} // anon namespace + +rtl_opt_pass * +make_pass_live_range_shrinkage (gcc::context *ctxt) +{ + return new pass_live_range_shrinkage (ctxt); +} + +namespace { const pass_data pass_data_sched = { Index: timevar.def =================================================================== --- timevar.def (revision 204380) +++ timevar.def (working copy) @@ -223,6 +223,7 @@ DEFTIMEVAR (TV_COMBINE , " DEFTIMEVAR (TV_IFCVT , "if-conversion") DEFTIMEVAR (TV_MODE_SWITCH , "mode switching") DEFTIMEVAR (TV_SMS , "sms modulo scheduling") +DEFTIMEVAR (TV_LIVE_RANGE_SHRINKAGE , "live range shrinkage") DEFTIMEVAR (TV_SCHED , "scheduling") DEFTIMEVAR (TV_IRA , "integrated RA") DEFTIMEVAR (TV_LRA , "LRA non-specific") Index: tree-pass.h =================================================================== --- tree-pass.h (revision 204380) +++ tree-pass.h (working copy) @@ -530,6 +530,7 @@ extern rtl_opt_pass *make_pass_lower_sub extern rtl_opt_pass *make_pass_mode_switching (gcc::context *ctxt); extern rtl_opt_pass *make_pass_sms (gcc::context *ctxt); extern rtl_opt_pass *make_pass_sched (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_live_range_shrinkage (gcc::context *ctxt); extern rtl_opt_pass *make_pass_ira (gcc::context *ctxt); extern rtl_opt_pass *make_pass_reload (gcc::context *ctxt); extern rtl_opt_pass *make_pass_clean_state (gcc::context *ctxt);