Message ID | 506DAD2E.1010507@redhat.com |
---|---|
State | New |
Headers | show |
On Thu, Oct 4, 2012 at 5:37 PM, Vladimir Makarov <vmakarov@redhat.com> wrote: > The only issue now is PR54146 compilation time for IRA+LRA although it > was improved significantly. I will continue work on PR54146. But now I > am going to focus on proposals from reviews. Right, there still are opportunities to improve things. (The real solution may be to stop SRA from creating so many simultaneously live pseudos in the first place...) > + lra_simple_p > + = (ira_use_lra_p && max_reg_num () >= (1 << 26) / last_basic_block); I think you should use n_basic_blocks here instead of last_basic_block, in case this runs without compacting the cfg first (n_basic_blocks is the real number of basic blocks in the cfg, last_basic_block is the highest index, so last_basic_block >= n_basic_blocks). Thanks for working on this! Ciao! Steven
On Thu, Oct 4, 2012 at 7:07 PM, Steven Bosscher <stevenb.gcc@gmail.com> wrote: > On Thu, Oct 4, 2012 at 5:37 PM, Vladimir Makarov <vmakarov@redhat.com> wrote: >> The only issue now is PR54146 compilation time for IRA+LRA although it >> was improved significantly. I will continue work on PR54146. But now I >> am going to focus on proposals from reviews. > > Right, there still are opportunities to improve things. > > (The real solution may be to stop SRA from creating so many > simultaneously live pseudos in the first place...) > >> + lra_simple_p >> + = (ira_use_lra_p && max_reg_num () >= (1 << 26) / last_basic_block); > > I think you should use n_basic_blocks here instead of > last_basic_block, in case this runs without compacting the cfg first > (n_basic_blocks is the real number of basic blocks in the cfg, > last_basic_block is the highest index, so last_basic_block >= > n_basic_blocks). I also noticed that switching to IRA_REGION_ONE improves things when we have a large number of loops (profile points to some loop code in IRA). Note that the magic number above should be a new --param, and once we have a diagnostic flag that shows whenever we back off like this it should notify the user of that fact (and the params we have overflown) - this just reminded me of that idea from somebody else ;) > Thanks for working on this! Indeed ;) It, btw, also applies to IRA + reload ... Richard. > Ciao! > Steven
On Thu, Oct 4, 2012 at 5:37 PM, Vladimir Makarov wrote: > The following patch solves most of LRA scalability problems. > > It switches on simpler algorithms in LRA. The first it switches off > trying to reassign hard registers to spilled pseudos (they usually for such > huge functions have long live ranges -- so the possibility to assign > them something very small but trying to reassign them a hard registers > is to expensive), inheritance, live range splitting, and memory > coalescing optimizations. It seems that rematerialization is too > important for performance -- so I don't switch it off. As splitting is > also necessary for generation of caller saves code, I switch off > caller-saves in IRA and force IRA to do non-regional RA. Hi Vlad, I've revisited this patch now that parts of the scalability issues have been resolved. Something funny happened for our soon-to-be-legendary PR54146 test case... lra-branch yesterday (i.e. without the elimination and constraints speedup patches): integrated RA : 145.26 (18%) LRA non-specific : 46.94 ( 6%) LRA virtuals elimination: 51.56 ( 6%) LRA reload inheritance : 0.03 ( 0%) LRA create live ranges : 46.67 ( 6%) LRA hard reg assignment : 0.55 ( 0%) lra-branch today + ira-speedup-1.diff: integrated RA : 111.19 (15%) usr LRA non-specific : 21.16 ( 3%) usr LRA virtuals elimination: 0.65 ( 0%) usr LRA reload inheritance : 0.01 ( 0%) usr LRA create live ranges : 56.33 ( 8%) usr LRA hard reg assignment : 0.58 ( 0%) usr lra-branch today + ira-speedup-1.diff + rm-lra_simple_p.diff: integrated RA : 89.43 (11%) usr LRA non-specific : 21.43 ( 3%) usr LRA virtuals elimination: 0.61 ( 0%) usr LRA reload inheritance : 6.10 ( 1%) usr LRA create live ranges : 88.64 (11%) usr LRA hard reg assignment : 45.17 ( 6%) usr LRA coalesce pseudo regs: 2.24 ( 0%) usr Note how IRA is *faster* without the lra_simple_p patch. The cost comes back in "LRA hard reg assignment" and "LRA create live ranges" where I assume the latter is a consequence of running lra_create_live_ranges a few more times to work for the hard-reg assignment phase. Do you have an idea why IRA might be faster without the lra_simple_p thing? Maybe there's a way to get the best of both... Ciao! Steven
On 12-10-10 10:53 AM, Steven Bosscher wrote: > On Thu, Oct 4, 2012 at 5:37 PM, Vladimir Makarov wrote: >> The following patch solves most of LRA scalability problems. >> >> It switches on simpler algorithms in LRA. The first it switches off >> trying to reassign hard registers to spilled pseudos (they usually for such >> huge functions have long live ranges -- so the possibility to assign >> them something very small but trying to reassign them a hard registers >> is to expensive), inheritance, live range splitting, and memory >> coalescing optimizations. It seems that rematerialization is too >> important for performance -- so I don't switch it off. As splitting is >> also necessary for generation of caller saves code, I switch off >> caller-saves in IRA and force IRA to do non-regional RA. > Hi Vlad, > > I've revisited this patch now that parts of the scalability issues > have been resolved. Something funny happened for our > soon-to-be-legendary PR54146 test case... > > lra-branch yesterday (i.e. without the elimination and constraints > speedup patches): > integrated RA : 145.26 (18%) > LRA non-specific : 46.94 ( 6%) > LRA virtuals elimination: 51.56 ( 6%) > LRA reload inheritance : 0.03 ( 0%) > LRA create live ranges : 46.67 ( 6%) > LRA hard reg assignment : 0.55 ( 0%) > > lra-branch today + ira-speedup-1.diff: > integrated RA : 111.19 (15%) usr > LRA non-specific : 21.16 ( 3%) usr > LRA virtuals elimination: 0.65 ( 0%) usr > LRA reload inheritance : 0.01 ( 0%) usr > LRA create live ranges : 56.33 ( 8%) usr > LRA hard reg assignment : 0.58 ( 0%) usr > > lra-branch today + ira-speedup-1.diff + rm-lra_simple_p.diff: > integrated RA : 89.43 (11%) usr > LRA non-specific : 21.43 ( 3%) usr > LRA virtuals elimination: 0.61 ( 0%) usr > LRA reload inheritance : 6.10 ( 1%) usr > LRA create live ranges : 88.64 (11%) usr > LRA hard reg assignment : 45.17 ( 6%) usr > LRA coalesce pseudo regs: 2.24 ( 0%) usr > > Note how IRA is *faster* without the lra_simple_p patch. The cost > comes back in "LRA hard reg assignment" and "LRA create live ranges" > where I assume the latter is a consequence of running > lra_create_live_ranges a few more times to work for the hard-reg > assignment phase. > > Do you have an idea why IRA might be faster without the lra_simple_p > thing? Maybe there's a way to get the best of both... > I have no idea. I can not confirm it on an Intel Corei7 machine. Here is my timing. Removing lra_simple_p makes the worst compilation time, but the best code size. It is also interesting that your IRA range patch results in different code generation (i can not explain it too now). I saw the same on a small test (black jack playing and betting strategy). Another interesting thing is that IRA times are the same (with and without simplified allocation for LRA). --- branch this morning integrated RA : 48.41 (13%) usr 0.25 ( 3%) sys 48.72 (13%) wall 223608 kB (19%) ggc LRA non-specific : 14.47 ( 4%) usr 0.15 ( 2%) sys 14.57 ( 4%) wall 41443 kB ( 4%) ggc LRA virtuals elimination: 0.40 ( 0%) usr 0.00 ( 0%) sys 0.41 ( 0%) wall 36037 kB ( 3%) ggc LRA reload inheritance : 0.14 ( 0%) usr 0.00 ( 0%) sys 0.15 ( 0%) wall 1209 kB ( 0%) ggc LRA create live ranges : 17.37 ( 5%) usr 0.21 ( 3%) sys 17.56 ( 5%) wall 5182 kB ( 0%) ggc LRA hard reg assignment : 1.77 ( 0%) usr 0.02 ( 0%) sys 1.76 ( 0%) wall 0 kB ( 0%) ggc LRA coalesce pseudo regs: 0.01 ( 0%) usr 0.00 ( 0%) sys 0.00 ( 0%) wall 0 kB ( 0%) ggc real=377.25 user=367.58 system=8.36 share=99%% maxrss=33540720 ins=280 outs=92544 mfaults=4448012 waits=17 text data bss dec hex filename 6395340 16 607 6395963 61983b s.o --- branch this morning + ira range patch integrated RA : 36.03 (10%) usr 0.03 ( 0%) sys 36.20 (10%) wall 223608 kB (19%) ggc LRA non-specific : 14.57 ( 4%) usr 0.14 ( 2%) sys 14.89 ( 4%) wall 41453 kB ( 4%) ggc LRA virtuals elimination: 0.36 ( 0%) usr 0.01 ( 0%) sys 0.41 ( 0%) wall 36040 kB ( 3%) ggc LRA reload inheritance : 0.11 ( 0%) usr 0.00 ( 0%) sys 0.15 ( 0%) wall 1210 kB ( 0%) ggc LRA create live ranges : 17.36 ( 5%) usr 0.21 ( 3%) sys 17.53 ( 5%) wall 5184 kB ( 0%) ggc LRA hard reg assignment : 1.78 ( 1%) usr 0.02 ( 0%) sys 1.79 ( 0%) wall 0 kB ( 0%) ggc LRA coalesce pseudo regs: 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 351.82 7.50 360.52 1149460 kB real=362.68 user=353.65 system=7.84 share=99%% maxrss=33540432 ins=224 outs=92544 mfaults=4073281 waits=17 text data bss dec hex filename 6395424 16 607 6396047 61988f s.o --- branch this morning + ira range patch + removing lra_simple_p integrated RA : 37.87 ( 9%) usr 0.14 ( 2%) sys 38.30 ( 9%) wall 744114 kB (45%) ggc LRA non-specific : 13.52 ( 3%) usr 0.05 ( 1%) sys 13.60 ( 3%) wall 39171 kB ( 2%) ggc LRA virtuals elimination: 0.38 ( 0%) usr 0.01 ( 0%) sys 0.40 ( 0%) wall 33096 kB ( 2%) ggc LRA reload inheritance : 3.31 ( 1%) usr 0.00 ( 0%) sys 3.36 ( 1%) wall 5217 kB ( 0%) ggc LRA create live ranges : 39.75 (10%) usr 0.42 ( 5%) sys 40.53 (10%) wall 5694 kB ( 0%) ggc LRA hard reg assignment : 31.87 ( 8%) usr 0.03 ( 0%) sys 31.94 ( 8%) wall 0 kB ( 0%) ggc LRA coalesce pseudo regs: 1.14 ( 0%) usr 0.00 ( 0%) sys 1.15 ( 0%) wall 0 kB ( 0%) ggc real=424.69 user=414.47 system=8.06 share=99%% maxrss=36546048 ins=34992 outs=91528 mfaults=4253004 waits=175 text data bss dec hex filename 6278007 16 607 6278630 5fcde6 s.o
On Wed, Oct 10, 2012 at 10:14 PM, Vladimir Makarov <vmakarov@redhat.com> wrote: > It is also interesting that your IRA range patch results in > different code generation (i can not explain it too now). I saw the same > on a small test (black jack playing and betting strategy). I haven't looked into this, but I'm guessing this is because functions whose semantics depend on the number of program points, like build_conflict_bit_table, will behave differently. For example, with my patch there are significantly fewer program points and as a result OBJECT_MIN/OBJECT_MAX take on smaller values. For build_conflict_bit_table this leads to a different cost metric to decide whether the conflict table is small enough to build. Probably other functions respond similarly to my patch, but I don't know IRA well enough to say :-) In any case, bootstrap&test passed (with and without checking enabled) on x86_64-unknown-linux-gnu and powerpc64-unknown-linux-gnu so I intend to commit the patch (with a few comment typos fixed) today or tomorrow. Ciao! Steven
Index: ira.c =================================================================== --- ira.c (revision 192048) +++ ira.c (working copy) @@ -4327,8 +4327,26 @@ ira (FILE *f) bool loops_p; int max_regno_before_ira, ira_max_point_before_emit; int rebuild_p; + bool saved_flag_caller_saves = flag_caller_saves; + enum ira_region saved_flag_ira_region = flag_ira_region; + + ira_conflicts_p = optimize > 0; ira_use_lra_p = targetm.lra_p (); + /* If there are too many pseudos and/or basic blocks (e.g. 10K + pseudos and 10K blocks or 100K pseudos and 1K blocks), we will + use simplified and faster algorithms in LRA. */ + lra_simple_p + = (ira_use_lra_p && max_reg_num () >= (1 << 26) / last_basic_block); + if (lra_simple_p) + { + /* It permits to skip live range splitting in LRA. */ + flag_caller_saves = false; + /* There is no sense to do regional allocation when we use + simplified LRA. */ + flag_ira_region = IRA_REGION_ONE; + ira_conflicts_p = false; + } #ifndef IRA_NO_OBSTACK gcc_obstack_init (&ira_obstack); @@ -4349,7 +4367,6 @@ ira (FILE *f) ira_dump_file = stderr; } - ira_conflicts_p = optimize > 0; setup_prohibited_mode_move_regs (); df_note_add_problem (); @@ -4530,6 +4547,13 @@ ira (FILE *f) /* See comment for find_moveable_pseudos call. */ if (ira_conflicts_p) move_unallocated_pseudos (); + + /* Restore original values. */ + if (lra_simple_p) + { + flag_caller_saves = saved_flag_caller_saves; + flag_ira_region = saved_flag_ira_region; + } } static void Index: lra-assigns.c =================================================================== --- lra-assigns.c (revision 192050) +++ lra-assigns.c (working copy) @@ -1186,46 +1186,50 @@ assign_by_spills (void) improve_inheritance (&changed_pseudo_bitmap); bitmap_clear (&non_reload_pseudos); bitmap_clear (&changed_insns); - /* We should not assign to original pseudos of inheritance pseudos - or split pseudos if any its inheritance pseudo did not get hard - register or any its split pseudo was not split because undo - inheritance/split pass will extend live range of such inheritance - or split pseudos. */ - bitmap_initialize (&do_not_assign_nonreload_pseudos, ®_obstack); - EXECUTE_IF_SET_IN_BITMAP (&lra_inheritance_pseudos, 0, u, bi) - if ((restore_regno = lra_reg_info[u].restore_regno) >= 0 - && reg_renumber[u] < 0 && bitmap_bit_p (&lra_inheritance_pseudos, u)) - bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno); - EXECUTE_IF_SET_IN_BITMAP (&lra_split_pseudos, 0, u, bi) - if ((restore_regno = lra_reg_info[u].restore_regno) >= 0 - && reg_renumber[u] >= 0 && bitmap_bit_p (&lra_split_pseudos, u)) - bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno); - for (n = 0, i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++) - if (((i < lra_constraint_new_regno_start - && ! bitmap_bit_p (&do_not_assign_nonreload_pseudos, i)) - || (bitmap_bit_p (&lra_inheritance_pseudos, i) - && lra_reg_info[i].restore_regno >= 0) - || (bitmap_bit_p (&lra_split_pseudos, i) - && lra_reg_info[i].restore_regno >= 0) - || bitmap_bit_p (&lra_optional_reload_pseudos, i)) - && reg_renumber[i] < 0 && lra_reg_info[i].nrefs != 0 - && regno_allocno_class_array[i] != NO_REGS) - sorted_pseudos[n++] = i; - bitmap_clear (&do_not_assign_nonreload_pseudos); - if (n != 0 && lra_dump_file != NULL) - fprintf (lra_dump_file, " Reassing non-reload pseudos\n"); - qsort (sorted_pseudos, n, sizeof (int), pseudo_compare_func); - for (i = 0; i < n; i++) + if (! lra_simple_p) { - regno = sorted_pseudos[i]; - hard_regno = find_hard_regno_for (regno, &cost, -1); - if (hard_regno >= 0) + /* We should not assign to original pseudos of inheritance + pseudos or split pseudos if any its inheritance pseudo did + not get hard register or any its split pseudo was not split + because undo inheritance/split pass will extend live range of + such inheritance or split pseudos. */ + bitmap_initialize (&do_not_assign_nonreload_pseudos, ®_obstack); + EXECUTE_IF_SET_IN_BITMAP (&lra_inheritance_pseudos, 0, u, bi) + if ((restore_regno = lra_reg_info[u].restore_regno) >= 0 + && reg_renumber[u] < 0 + && bitmap_bit_p (&lra_inheritance_pseudos, u)) + bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno); + EXECUTE_IF_SET_IN_BITMAP (&lra_split_pseudos, 0, u, bi) + if ((restore_regno = lra_reg_info[u].restore_regno) >= 0 + && reg_renumber[u] >= 0 && bitmap_bit_p (&lra_split_pseudos, u)) + bitmap_set_bit (&do_not_assign_nonreload_pseudos, restore_regno); + for (n = 0, i = FIRST_PSEUDO_REGISTER; i < max_reg_num (); i++) + if (((i < lra_constraint_new_regno_start + && ! bitmap_bit_p (&do_not_assign_nonreload_pseudos, i)) + || (bitmap_bit_p (&lra_inheritance_pseudos, i) + && lra_reg_info[i].restore_regno >= 0) + || (bitmap_bit_p (&lra_split_pseudos, i) + && lra_reg_info[i].restore_regno >= 0) + || bitmap_bit_p (&lra_optional_reload_pseudos, i)) + && reg_renumber[i] < 0 && lra_reg_info[i].nrefs != 0 + && regno_allocno_class_array[i] != NO_REGS) + sorted_pseudos[n++] = i; + bitmap_clear (&do_not_assign_nonreload_pseudos); + if (n != 0 && lra_dump_file != NULL) + fprintf (lra_dump_file, " Reassing non-reload pseudos\n"); + qsort (sorted_pseudos, n, sizeof (int), pseudo_compare_func); + for (i = 0; i < n; i++) { - assign_hard_regno (hard_regno, regno); + regno = sorted_pseudos[i]; + hard_regno = find_hard_regno_for (regno, &cost, -1); + if (hard_regno >= 0) + { + assign_hard_regno (hard_regno, regno); /* We change allocation for non-reload pseudo on this iteration -- mark the pseudo for invalidation of used alternatives of insns containing the pseudo. */ - bitmap_set_bit (&changed_pseudo_bitmap, regno); + bitmap_set_bit (&changed_pseudo_bitmap, regno); + } } } free (update_hard_regno_preference_check); Index: lra.c =================================================================== --- lra.c (revision 192050) +++ lra.c (working copy) @@ -2178,8 +2178,14 @@ setup_reg_spill_flag (void) lra_reg_spill_p = false; } +/* True if the current function is too big to use regular algorithms + in LRA. In other words, we should use simpler and faster algorithms + in LRA. It also means we should not worry about generation code + for caller saves. The value is set up in IRA. */ +bool lra_simple_p; + /* Major LRA entry function. F is a file should be used to dump LRA - debug info. */ + debug info. */ void lra (FILE *f) { @@ -2266,7 +2272,9 @@ lra (FILE *f) RS6000_PIC_OFFSET_TABLE_REGNUM uneliminable if we started to use a constant pool. */ lra_eliminate (false); - lra_inheritance (); + /* Do inheritance only for regular algorithms. */ + if (! lra_simple_p) + lra_inheritance (); /* We need live ranges for lra_assign -- so build them. */ lra_create_live_ranges (true); live_p = true; @@ -2275,10 +2283,16 @@ lra (FILE *f) If inheritance pseudos were spilled, the memory-memory moves involving them will be removed by pass undoing inheritance. */ - if (! lra_assign () && lra_coalesce ()) - live_p = false; - if (lra_undo_inheritance ()) - live_p = false; + if (lra_simple_p) + lra_assign (); + else + { + /* Do coalescing only for regular algorithms. */ + if (! lra_assign () && lra_coalesce ()) + live_p = false; + if (lra_undo_inheritance ()) + live_p = false; + } } bitmap_clear (&lra_inheritance_pseudos); bitmap_clear (&lra_split_pseudos); Index: lra.h =================================================================== --- lra.h (revision 192048) +++ lra.h (working copy) @@ -20,6 +20,8 @@ You should have received a copy of the G along with GCC; see the file COPYING3. If not see <http://www.gnu.org/licenses/>. */ +extern bool lra_simple_p; + /* Return the allocno reg class of REGNO. If it is a reload pseudo, the pseudo should finally get hard register of the allocno class. */