Message ID | ddd8c186-fc88-96df-b1c0-f99edec654f2@linux.ibm.com |
---|---|
Headers | show |
Series | IVOPTs consider step cost for different forms when unrolling | expand |
Hi! On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote: > As we discussed in the thread > https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html > Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html, > I'm working to teach IVOPTs to consider D-form group access during unrolling. > The difference on D-form and other forms during unrolling is we can put the > stride into displacement field to avoid additional step increment. eg: <snip> > Imagining that if the loop get unrolled by 8 times, then 3 step updates with > D-form vs. 8 step updates with X-form. Here we only need to check stride > meet D-form field requirement, since if OFF doesn't meet, we can construct > baseA' with baseA + OFF. So why doesn't the existing code do this already? Why does it make all the extra induction variables? Is the existing cost model bad, are our target costs bad, or something like that? Segher
Hi Segher, on 2020/1/20 下午8:33, Segher Boessenkool wrote: > Hi! > > On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote: >> As we discussed in the thread >> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html >> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html, >> I'm working to teach IVOPTs to consider D-form group access during unrolling. >> The difference on D-form and other forms during unrolling is we can put the >> stride into displacement field to avoid additional step increment. eg: > > <snip> > >> Imagining that if the loop get unrolled by 8 times, then 3 step updates with >> D-form vs. 8 step updates with X-form. Here we only need to check stride >> meet D-form field requirement, since if OFF doesn't meet, we can construct >> baseA' with baseA + OFF. > > So why doesn't the existing code do this already? Why does it make all > the extra induction variables? Is the existing cost model bad, are our > target costs bad, or something like that? > I think the main cause is IVOPTs runs before RTL unroll, when it's determining the IV sets, it can only take the normal step cost into account, since its input isn't unrolled yet. After unrolling, the x-form indexing register has to play with more UF-1 times update, but we can probably hide them in d-form displacement field. The way I proposed here is to adjust IV cost with additional cost_step according to estimated unroll. It doesn't introduce new IV cand but can affect the final optimal set. BR, Kewen
Hi! On Mon, Feb 10, 2020 at 02:17:04PM +0800, Kewen.Lin wrote: > on 2020/1/20 下午8:33, Segher Boessenkool wrote: > > On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote: > >> As we discussed in the thread > >> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html > >> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html, > >> I'm working to teach IVOPTs to consider D-form group access during unrolling. > >> The difference on D-form and other forms during unrolling is we can put the > >> stride into displacement field to avoid additional step increment. eg: > > > > <snip> > > > >> Imagining that if the loop get unrolled by 8 times, then 3 step updates with > >> D-form vs. 8 step updates with X-form. Here we only need to check stride > >> meet D-form field requirement, since if OFF doesn't meet, we can construct > >> baseA' with baseA + OFF. > > > > So why doesn't the existing code do this already? Why does it make all > > the extra induction variables? Is the existing cost model bad, are our > > target costs bad, or something like that? > > > > I think the main cause is IVOPTs runs before RTL unroll, when it's determining > the IV sets, it can only take the normal step cost into account, since its > input isn't unrolled yet. After unrolling, the x-form indexing register has to > play with more UF-1 times update, but we can probably hide them in d-form > displacement field. The way I proposed here is to adjust IV cost with > additional cost_step according to estimated unroll. It doesn't introduce new > IV cand but can affect the final optimal set. Yes, we should decide how often we want to unroll things somewhere before ivopts already, and just use that info here. Or are there advantage to doing it *in* ivopts? It sounds like doing it there is probably expensive, but maybe not, and we need to do similar analysis there anyway. Segher
on 2020/2/11 上午5:29, Segher Boessenkool wrote: > Hi! > > On Mon, Feb 10, 2020 at 02:17:04PM +0800, Kewen.Lin wrote: >> on 2020/1/20 下午8:33, Segher Boessenkool wrote: >>> On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote: >>>> As we discussed in the thread >>>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html >>>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html, >>>> I'm working to teach IVOPTs to consider D-form group access during unrolling. >>>> The difference on D-form and other forms during unrolling is we can put the >>>> stride into displacement field to avoid additional step increment. eg: >>> >>> <snip> >>> >>>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with >>>> D-form vs. 8 step updates with X-form. Here we only need to check stride >>>> meet D-form field requirement, since if OFF doesn't meet, we can construct >>>> baseA' with baseA + OFF. >>> >>> So why doesn't the existing code do this already? Why does it make all >>> the extra induction variables? Is the existing cost model bad, are our >>> target costs bad, or something like that? >>> >> >> I think the main cause is IVOPTs runs before RTL unroll, when it's determining >> the IV sets, it can only take the normal step cost into account, since its >> input isn't unrolled yet. After unrolling, the x-form indexing register has to >> play with more UF-1 times update, but we can probably hide them in d-form >> displacement field. The way I proposed here is to adjust IV cost with >> additional cost_step according to estimated unroll. It doesn't introduce new >> IV cand but can affect the final optimal set. > > Yes, we should decide how often we want to unroll things somewhere before > ivopts already, and just use that info here. Agreed! If some passes are interested on this unroll factor estimation, we can move backward there if it's before IVOPTs. As patch 1/4, once it's set, the later pass can just reuse that info. As Richard B. suggested, we can even skip the later RTL unroll factor determination. > > Or are there advantage to doing it *in* ivopts? It sounds like doing > it there is probably expensive, but maybe not, and we need to do similar > analysis there anyway. > Good question. I didn't consider that, the reason putting here is we need this information in IVOPTs for some cases. :) BR, Kewen
On Mon, 10 Feb 2020, Segher Boessenkool wrote: > Hi! > > On Mon, Feb 10, 2020 at 02:17:04PM +0800, Kewen.Lin wrote: > > on 2020/1/20 下午8:33, Segher Boessenkool wrote: > > > On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote: > > >> As we discussed in the thread > > >> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html > > >> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html, > > >> I'm working to teach IVOPTs to consider D-form group access during unrolling. > > >> The difference on D-form and other forms during unrolling is we can put the > > >> stride into displacement field to avoid additional step increment. eg: > > > > > > <snip> > > > > > >> Imagining that if the loop get unrolled by 8 times, then 3 step updates with > > >> D-form vs. 8 step updates with X-form. Here we only need to check stride > > >> meet D-form field requirement, since if OFF doesn't meet, we can construct > > >> baseA' with baseA + OFF. > > > > > > So why doesn't the existing code do this already? Why does it make all > > > the extra induction variables? Is the existing cost model bad, are our > > > target costs bad, or something like that? > > > > > > > I think the main cause is IVOPTs runs before RTL unroll, when it's determining > > the IV sets, it can only take the normal step cost into account, since its > > input isn't unrolled yet. After unrolling, the x-form indexing register has to > > play with more UF-1 times update, but we can probably hide them in d-form > > displacement field. The way I proposed here is to adjust IV cost with > > additional cost_step according to estimated unroll. It doesn't introduce new > > IV cand but can affect the final optimal set. > > Yes, we should decide how often we want to unroll things somewhere before > ivopts already, and just use that info here. > > Or are there advantage to doing it *in* ivopts? It sounds like doing > it there is probably expensive, but maybe not, and we need to do similar > analysis there anyway. Well, if the only benefit of doing the unrolling is that IVs get cheaper then yes, IVOPTs should drive it. But usually unrolling exposes redundancies (catched by predictive commoning which drives some unrolling) or it enables better use of CPU resources via scheduling (only catched later in RTL). For scheduling we have the additional complication that the RTL side doesn't have as much of a fancy data dependence analysis framework as on the GIMPLE side. So I'd put my bet on trying to move something like SMS to GIMPLE and combine it with unrolling (IIRC SMS at most interleaves 1 1/2 loop iterations). Richard.
On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote: > On Mon, 10 Feb 2020, Segher Boessenkool wrote: > > Yes, we should decide how often we want to unroll things somewhere before > > ivopts already, and just use that info here. > > > > Or are there advantage to doing it *in* ivopts? It sounds like doing > > it there is probably expensive, but maybe not, and we need to do similar > > analysis there anyway. > > Well, if the only benefit of doing the unrolling is that IVs get > cheaper then yes, IVOPTs should drive it. We need to know much earlier in the pass pipeline how often a loop will be unrolled. We don't have to *do* it early. If we want to know it before ivopts, then obviously it has to be done earlier. Otherwise, maybe it is a good idea to do it in ivopts itself. Or maybe not. It's just an idea :-) We know we do not want it *later*, ivopts needs to know this to make good decisions of its own. > But usually unrolling exposes redundancies (catched by predictive > commoning which drives some unrolling) or it enables better use > of CPU resources via scheduling (only catched later in RTL). > For scheduling we have the additional complication that the RTL > side doesn't have as much of a fancy data dependence analysis > framework as on the GIMPLE side. So I'd put my bet on trying to > move something like SMS to GIMPLE and combine it with unrolling > (IIRC SMS at most interleaves 1 1/2 loop iterations). SMS on RTL always was quite disappointing... Do you expect it will be more useful on Gimple? Moving it there is a good idea in any case ;-) I don't quite see the synergy between SMS and loop unrolling, but maybe I need to look harder. Segher
On Tue, 11 Feb 2020, Segher Boessenkool wrote: > On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote: > > On Mon, 10 Feb 2020, Segher Boessenkool wrote: > > > Yes, we should decide how often we want to unroll things somewhere before > > > ivopts already, and just use that info here. > > > > > > Or are there advantage to doing it *in* ivopts? It sounds like doing > > > it there is probably expensive, but maybe not, and we need to do similar > > > analysis there anyway. > > > > Well, if the only benefit of doing the unrolling is that IVs get > > cheaper then yes, IVOPTs should drive it. > > We need to know much earlier in the pass pipeline how often a loop will > be unrolled. We don't have to *do* it early. > > If we want to know it before ivopts, then obviously it has to be done > earlier. Otherwise, maybe it is a good idea to do it in ivopts itself. > Or maybe not. It's just an idea :-) > > We know we do not want it *later*, ivopts needs to know this to make > good decisions of its own. > > > But usually unrolling exposes redundancies (catched by predictive > > commoning which drives some unrolling) or it enables better use > > of CPU resources via scheduling (only catched later in RTL). > > For scheduling we have the additional complication that the RTL > > side doesn't have as much of a fancy data dependence analysis > > framework as on the GIMPLE side. So I'd put my bet on trying to > > move something like SMS to GIMPLE and combine it with unrolling > > (IIRC SMS at most interleaves 1 1/2 loop iterations). > > SMS on RTL always was quite disappointing... It originally came with "data dependence export from GIMPLE to RTL" that never materialized so I'm not surprised ;) It also relies on doloop detection. > Do you expect it will be more useful on Gimple? Moving it there is a > good idea in any case ;-) > > I don't quite see the synergy between SMS and loop unrolling, but maybe > I need to look harder. As said elsewhere I don't believe in actual unrolling doing much good but in removing data dependences in the CPU pipeline. SMS rotates the loop, peeling N iterations (and somehow I think for N > 1 that should better mean unrolling the loop body). Of course doing "scheduling" on GIMPLE is "interesting" in its own but OTOH our pipeline DFAs are imprecise enough that one could even devise some basic GIMPLE <-> "RTL" mapping to make use of it. But then scheduling without IVs or register pressure in mind is somewhat pointless as well. That said - if I had enough time I'd still thing that investigating "scheduling on GIMPLE" as replacement for sched1 is an interesting thing to do. Richard.
11.02.2020 11:01, Richard Biener wrote: > On Tue, 11 Feb 2020, Segher Boessenkool wrote: > >> On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote: >>> On Mon, 10 Feb 2020, Segher Boessenkool wrote: >>>> Yes, we should decide how often we want to unroll things somewhere before >>>> ivopts already, and just use that info here. >>>> >>>> Or are there advantage to doing it *in* ivopts? It sounds like doing >>>> it there is probably expensive, but maybe not, and we need to do similar >>>> analysis there anyway. >>> Well, if the only benefit of doing the unrolling is that IVs get >>> cheaper then yes, IVOPTs should drive it. >> We need to know much earlier in the pass pipeline how often a loop will >> be unrolled. We don't have to *do* it early. >> >> If we want to know it before ivopts, then obviously it has to be done >> earlier. Otherwise, maybe it is a good idea to do it in ivopts itself. >> Or maybe not. It's just an idea :-) >> >> We know we do not want it *later*, ivopts needs to know this to make >> good decisions of its own. >> >>> But usually unrolling exposes redundancies (catched by predictive >>> commoning which drives some unrolling) or it enables better use >>> of CPU resources via scheduling (only catched later in RTL). >>> For scheduling we have the additional complication that the RTL >>> side doesn't have as much of a fancy data dependence analysis >>> framework as on the GIMPLE side. So I'd put my bet on trying to >>> move something like SMS to GIMPLE and combine it with unrolling >>> (IIRC SMS at most interleaves 1 1/2 loop iterations). To clarify, without specifying -fmodulo-sched-allow-regmoves it only interleaves 2 iterations. With register moves enabled more iterations can be considered. > SMS on RTL always was quite disappointing... Hmm, even when trying to move it just few passes earlier many years ago, got another opinion: https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html Although without such a move we still have annoying issues which RTL folks can't solve, see e.q. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2 > It originally came with "data dependence export from GIMPLE to RTL" > that never materialized so I'm not surprised ;) It also relies > on doloop detection. My current attempt to drop doloop dependency is still WIP, hopefully I'll create branch in refs/users/ in a month or so. But older (gcc-7 and earlier) versions are available, see https://gcc.gnu.org/ml/gcc-patches/2017-02/msg01647.html Doloops are still supported for some kind of backward compatibility, but much more loops (which loop-iv can analyze) are considered in new SMS. >> Do you expect it will be more useful on Gimple? Moving it there is a >> good idea in any case ;-) >> >> I don't quite see the synergy between SMS and loop unrolling, but maybe >> I need to look harder. > As said elsewhere I don't believe in actual unrolling doing much good > but in removing data dependences in the CPU pipeline. SMS rotates > the loop, peeling N iterations (and somehow I think for N > 1 that > should better mean unrolling the loop body). Yes, this is what theory tells us. > Of course doing "scheduling" on GIMPLE is "interesting" in its own > but OTOH our pipeline DFAs are imprecise enough that one could even > devise some basic GIMPLE <-> "RTL" mapping to make use of it. But > then scheduling without IVs or register pressure in mind is somewhat > pointless as well. Unfortunately, even with -fmodulo-sched-allow-regmoves it doesn't interact much with register pressure. > That said - if I had enough time I'd still thing that investigating > "scheduling on GIMPLE" as replacement for sched1 is an interesting > thing to do. Sound good, but IMHO modulo scheduler is not the best choice to be the first step implementing such a concept. Roman
On Tue, 11 Feb 2020, Roman Zhuykov wrote: > 11.02.2020 11:01, Richard Biener wrote: > > On Tue, 11 Feb 2020, Segher Boessenkool wrote: > > > >> On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote: > >>> On Mon, 10 Feb 2020, Segher Boessenkool wrote: > >>>> Yes, we should decide how often we want to unroll things somewhere before > >>>> ivopts already, and just use that info here. > >>>> > >>>> Or are there advantage to doing it *in* ivopts? It sounds like doing > >>>> it there is probably expensive, but maybe not, and we need to do similar > >>>> analysis there anyway. > >>> Well, if the only benefit of doing the unrolling is that IVs get > >>> cheaper then yes, IVOPTs should drive it. > >> We need to know much earlier in the pass pipeline how often a loop will > >> be unrolled. We don't have to *do* it early. > >> > >> If we want to know it before ivopts, then obviously it has to be done > >> earlier. Otherwise, maybe it is a good idea to do it in ivopts itself. > >> Or maybe not. It's just an idea :-) > >> > >> We know we do not want it *later*, ivopts needs to know this to make > >> good decisions of its own. > >> > >>> But usually unrolling exposes redundancies (catched by predictive > >>> commoning which drives some unrolling) or it enables better use > >>> of CPU resources via scheduling (only catched later in RTL). > >>> For scheduling we have the additional complication that the RTL > >>> side doesn't have as much of a fancy data dependence analysis > >>> framework as on the GIMPLE side. So I'd put my bet on trying to > >>> move something like SMS to GIMPLE and combine it with unrolling > >>> (IIRC SMS at most interleaves 1 1/2 loop iterations). > To clarify, without specifying -fmodulo-sched-allow-regmoves it only > interleaves 2 iterations. With register moves enabled more iterations > can be considered. > > SMS on RTL always was quite disappointing... > Hmm, even when trying to move it just few passes earlier many years ago, > got another opinion: > https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html > Although without such a move we still have annoying issues which RTL > folks can't solve, see e.q. > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2 > > It originally came with "data dependence export from GIMPLE to RTL" > > that never materialized so I'm not surprised ;) It also relies > > on doloop detection. > My current attempt to drop doloop dependency is still WIP, hopefully > I'll create branch in refs/users/ in a month or so. But older (gcc-7 > and earlier) versions are available, see > https://gcc.gnu.org/ml/gcc-patches/2017-02/msg01647.html > Doloops are still supported for some kind of backward compatibility, but > much more loops (which loop-iv can analyze) are considered in new SMS. > >> Do you expect it will be more useful on Gimple? Moving it there is a > >> good idea in any case ;-) > >> > >> I don't quite see the synergy between SMS and loop unrolling, but maybe > >> I need to look harder. > > As said elsewhere I don't believe in actual unrolling doing much good > > but in removing data dependences in the CPU pipeline. SMS rotates > > the loop, peeling N iterations (and somehow I think for N > 1 that > > should better mean unrolling the loop body). > Yes, this is what theory tells us. > > Of course doing "scheduling" on GIMPLE is "interesting" in its own > > but OTOH our pipeline DFAs are imprecise enough that one could even > > devise some basic GIMPLE <-> "RTL" mapping to make use of it. But > > then scheduling without IVs or register pressure in mind is somewhat > > pointless as well. > Unfortunately, even with -fmodulo-sched-allow-regmoves it doesn't > interact much with register pressure. > > That said - if I had enough time I'd still thing that investigating > > "scheduling on GIMPLE" as replacement for sched1 is an interesting > > thing to do. > Sound good, but IMHO modulo scheduler is not the best choice to be the > first step implementing such a concept. True ;) But since the context of this thread is unrolling ... Not sure how you'd figure the unroll factor to apply if you want to do unrolling within a classical scheduling framework? Maybe unroll as much as you can fill slots until the last instruction of the first iteration retires? Richard.
On Tue, Feb 11, 2020 at 02:58:47PM +0100, Richard Biener wrote: > On Tue, 11 Feb 2020, Roman Zhuykov wrote: > > 11.02.2020 11:01, Richard Biener wrote: > > Sound good, but IMHO modulo scheduler is not the best choice to be the > > first step implementing such a concept. > > True ;) But since the context of this thread is unrolling ... > Not sure how you'd figure the unroll factor to apply if you want > to do unrolling within a classical scheduling framework? Maybe > unroll as much as you can fill slots until the last instruction > of the first iteration retires? That will be terrible on register-rich architectures: it *already* is problematic how often some things are unrolled, blindly unrolling more would make things worse. We need to unroll more where it helps, but less where it does not. For that we need a good cost/benefit estimate. Segher
Hi! On Tue, Feb 11, 2020 at 03:46:05PM +0300, Roman Zhuykov wrote: > Hmm, even when trying to move it just few passes earlier many years ago, > got another opinion: > https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html > Although without such a move we still have annoying issues which RTL > folks can't solve, see e.q. > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2 Basic block partitioning has wildly disproportionate fallout in all later passes, both in terms of what those *do* (or don't, if partitioning is enabled), and of impact on the code (not to mention developer time). Maybe the implementation can be improved, but probably we should do this in a different way altogether. The current situation is not good. Segher
On Tue, 11 Feb 2020, Segher Boessenkool wrote: > On Tue, Feb 11, 2020 at 02:58:47PM +0100, Richard Biener wrote: > > On Tue, 11 Feb 2020, Roman Zhuykov wrote: > > > 11.02.2020 11:01, Richard Biener wrote: > > > Sound good, but IMHO modulo scheduler is not the best choice to be the > > > first step implementing such a concept. > > > > True ;) But since the context of this thread is unrolling ... > > Not sure how you'd figure the unroll factor to apply if you want > > to do unrolling within a classical scheduling framework? Maybe > > unroll as much as you can fill slots until the last instruction > > of the first iteration retires? > > That will be terrible on register-rich architectures: it *already* is > problematic how often some things are unrolled, blindly unrolling more > would make things worse. We need to unroll more where it helps, but > less where it does not. For that we need a good cost/benefit estimate. True. For x86 we tried but did not come up with a sensible estimate (probably the x86 uarchs are way too complicated to understand). Richard.
On Tue, 11 Feb 2020, Segher Boessenkool wrote: > Hi! > > On Tue, Feb 11, 2020 at 03:46:05PM +0300, Roman Zhuykov wrote: > > Hmm, even when trying to move it just few passes earlier many years ago, > > got another opinion: > > https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html > > Although without such a move we still have annoying issues which RTL > > folks can't solve, see e.q. > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2 > > Basic block partitioning has wildly disproportionate fallout in all > later passes, both in terms of what those *do* (or don't, if partitioning > is enabled), and of impact on the code (not to mention developer time). > > Maybe the implementation can be improved, but probably we should do this > in a different way altogether. The current situation is not good. I think the expectation that you can go back to CFG layout mode and then work with CFG layout tools after we've lowered to CFG RTL is simply bogus. Yeah, you can probably do analysis things but I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle can wreck things. Undoubtedly doing CFG manipulations is not going to work since CFG layout does not respect CFG RTL restrictions. Partitioning simply uncovered latent bugs, there's nothing wrong with it IMHO. Richard.
On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote: > On Tue, 11 Feb 2020, Segher Boessenkool wrote: > > Basic block partitioning has wildly disproportionate fallout in all > > later passes, both in terms of what those *do* (or don't, if partitioning > > is enabled), and of impact on the code (not to mention developer time). > > > > Maybe the implementation can be improved, but probably we should do this > > in a different way altogether. The current situation is not good. > > I think the expectation that you can go back to CFG layout mode > and then work with CFG layout tools after we've lowered to CFG RTL > is simply bogus. Partitioning is also quite problematic if you do not use cfglayout mode. For example, in shrink-wrapping. It prevents a lot there. > Yeah, you can probably do analysis things but > I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle > can wreck things. Undoubtedly doing CFG manipulations is not going > to work since CFG layout does not respect CFG RTL restrictions. Doing CFG manipulations on CFG RTL mode directly is almost impossible to do correctly. For example, bb-reorder. Which is a much more important optimisation than partitioning, btw. > Partitioning simply uncovered latent bugs, there's nothing wrong > with it IMHO. I don't agree. The whole way EDGE_CROSSING works hinders all other optimisations a lot. Segher
On Wed, 12 Feb 2020, Segher Boessenkool wrote: > On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote: > > On Tue, 11 Feb 2020, Segher Boessenkool wrote: > > > Basic block partitioning has wildly disproportionate fallout in all > > > later passes, both in terms of what those *do* (or don't, if partitioning > > > is enabled), and of impact on the code (not to mention developer time). > > > > > > Maybe the implementation can be improved, but probably we should do this > > > in a different way altogether. The current situation is not good. > > > > I think the expectation that you can go back to CFG layout mode > > and then work with CFG layout tools after we've lowered to CFG RTL > > is simply bogus. > > Partitioning is also quite problematic if you do not use cfglayout > mode. For example, in shrink-wrapping. It prevents a lot there. > > > Yeah, you can probably do analysis things but > > I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle > > can wreck things. Undoubtedly doing CFG manipulations is not going > > to work since CFG layout does not respect CFG RTL restrictions. > > Doing CFG manipulations on CFG RTL mode directly is almost impossible > to do correctly. > > For example, bb-reorder. Which is a much more important optimisation > than partitioning, btw. BB reorder switches back and forth as well ... :/ I think both are closely enough related that we probably should do partitioning from within the same framework? OTOH BB reorder happens _much_ later. > > Partitioning simply uncovered latent bugs, there's nothing wrong > > with it IMHO. > > I don't agree. The whole way EDGE_CROSSING works hinders all other > optimisations a lot. I'm not sure if it's there for correctness reasons or just to help checking that nothing "undoes" the partitioning decision. Richard. > > Segher
On Wed, Feb 12, 2020 at 09:07:27AM +0100, Richard Biener wrote: > On Tue, 11 Feb 2020, Segher Boessenkool wrote: > > > On Tue, Feb 11, 2020 at 02:58:47PM +0100, Richard Biener wrote: > > > On Tue, 11 Feb 2020, Roman Zhuykov wrote: > > > > 11.02.2020 11:01, Richard Biener wrote: > > > > Sound good, but IMHO modulo scheduler is not the best choice to be the > > > > first step implementing such a concept. > > > > > > True ;) But since the context of this thread is unrolling ... > > > Not sure how you'd figure the unroll factor to apply if you want > > > to do unrolling within a classical scheduling framework? Maybe > > > unroll as much as you can fill slots until the last instruction > > > of the first iteration retires? > > > > That will be terrible on register-rich architectures: it *already* is > > problematic how often some things are unrolled, blindly unrolling more > > would make things worse. We need to unroll more where it helps, but > > less where it does not. For that we need a good cost/benefit estimate. > > True. For x86 we tried but did not come up with a sensible estimate > (probably the x86 uarchs are way too complicated to understand). There are three main factors at work: 1) The cost for iterating the loop. This is related to what ivopts does, and also on most uarchs every loop iteration has some minimum cost (on many uarchs it needs a fetch redirect, or you can only do one branch per cycle, etc.; this is mainly important for "small" loops, where what is "small" differs per uarch). Unrolling brings this cost down. 2) Cost reduction by better scheduling. 3) The cost for using more registers, when unrolling. This is worse if you also use SMS or variable expansion, or simply because of the better scheduling. This can increase the cost a *lot*. 1) isn't all that hard to estimate. 2) and esp. 3) are though :-/ Segher
On Wed, Feb 12, 2020 at 11:53:22AM +0100, Richard Biener wrote: > On Wed, 12 Feb 2020, Segher Boessenkool wrote: > > On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote: > > > On Tue, 11 Feb 2020, Segher Boessenkool wrote: > > > > Basic block partitioning has wildly disproportionate fallout in all > > > > later passes, both in terms of what those *do* (or don't, if partitioning > > > > is enabled), and of impact on the code (not to mention developer time). > > > > > > > > Maybe the implementation can be improved, but probably we should do this > > > > in a different way altogether. The current situation is not good. > > > > > > I think the expectation that you can go back to CFG layout mode > > > and then work with CFG layout tools after we've lowered to CFG RTL > > > is simply bogus. > > > > Partitioning is also quite problematic if you do not use cfglayout > > mode. For example, in shrink-wrapping. It prevents a lot there. > > > > > Yeah, you can probably do analysis things but > > > I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle > > > can wreck things. Undoubtedly doing CFG manipulations is not going > > > to work since CFG layout does not respect CFG RTL restrictions. > > > > Doing CFG manipulations on CFG RTL mode directly is almost impossible > > to do correctly. > > > > For example, bb-reorder. Which is a much more important optimisation > > than partitioning, btw. > > BB reorder switches back and forth as well ... :/ Yes. It is extremely hard to change any jumps in cfgrtl mode. The goal is to use cfglayout mode more, ideally *always*, not to use it less! > I think both are closely enough related that we probably should do > partitioning from within the same framework? OTOH BB reorder > happens _much_ later. This may be doable. What we need to do first though is to find a better thing to use than EDGE_CROSSING. Maybe we can determine which blocks should be hot and cold early, but actually make that happen only very late (maybe adjusting the decision if we have to to make things work)? That way, intervening passes do not have to care (much). > > > Partitioning simply uncovered latent bugs, there's nothing wrong > > > with it IMHO. > > > > I don't agree. The whole way EDGE_CROSSING works hinders all other > > optimisations a lot. > > I'm not sure if it's there for correctness reasons or just to > help checking that nothing "undoes" the partitioning decision. The latter. (It may also make it easier for the former, of course, but that can be solved anyway). And that makes live very hard for all later passes, while it is doubtful that it even give the best decisions: for example it prevents a lot of shrink-wrapping, which you dearly *want* to do on cold code! Segher
On Wed, 12 Feb 2020, Segher Boessenkool wrote: > On Wed, Feb 12, 2020 at 11:53:22AM +0100, Richard Biener wrote: > > On Wed, 12 Feb 2020, Segher Boessenkool wrote: > > > On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote: > > > > On Tue, 11 Feb 2020, Segher Boessenkool wrote: > > > > > Basic block partitioning has wildly disproportionate fallout in all > > > > > later passes, both in terms of what those *do* (or don't, if partitioning > > > > > is enabled), and of impact on the code (not to mention developer time). > > > > > > > > > > Maybe the implementation can be improved, but probably we should do this > > > > > in a different way altogether. The current situation is not good. > > > > > > > > I think the expectation that you can go back to CFG layout mode > > > > and then work with CFG layout tools after we've lowered to CFG RTL > > > > is simply bogus. > > > > > > Partitioning is also quite problematic if you do not use cfglayout > > > mode. For example, in shrink-wrapping. It prevents a lot there. > > > > > > > Yeah, you can probably do analysis things but > > > > I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle > > > > can wreck things. Undoubtedly doing CFG manipulations is not going > > > > to work since CFG layout does not respect CFG RTL restrictions. > > > > > > Doing CFG manipulations on CFG RTL mode directly is almost impossible > > > to do correctly. > > > > > > For example, bb-reorder. Which is a much more important optimisation > > > than partitioning, btw. > > > > BB reorder switches back and forth as well ... :/ > > Yes. It is extremely hard to change any jumps in cfgrtl mode. > > The goal is to use cfglayout mode more, ideally *always*, not to use > it less! Sure! It must be that split requires CFG RTL (it doesn't say so, but we don't have a PROP_rtllayout or a "cannot work with" set of properties). Otherwise I'm not sure why we go out of CFG layout mode so early. Passes not messing with the CFG at all should be ignorant of what mode we are in. > > I think both are closely enough related that we probably should do > > partitioning from within the same framework? OTOH BB reorder > > happens _much_ later. > > This may be doable. What we need to do first though is to find a better > thing to use than EDGE_CROSSING. > > Maybe we can determine which blocks should be hot and cold early, but > actually make that happen only very late (maybe adjusting the decision > if we have to to make things work)? That way, intervening passes do not > have to care (much). The question is whether we can even preserve things like BB counts and edge frequencies once we've gone out of CFG layout mode for once. > > > > Partitioning simply uncovered latent bugs, there's nothing wrong > > > > with it IMHO. > > > > > > I don't agree. The whole way EDGE_CROSSING works hinders all other > > > optimisations a lot. > > > > I'm not sure if it's there for correctness reasons or just to > > help checking that nothing "undoes" the partitioning decision. > > The latter. (It may also make it easier for the former, of course, but > that can be solved anyway). And that makes live very hard for all later > passes, while it is doubtful that it even give the best decisions: for > example it prevents a lot of shrink-wrapping, which you dearly *want* to > do on cold code! Sure. So do that earlier ;) Richard.
On Thu, Feb 13, 2020 at 08:48:33AM +0100, Richard Biener wrote: > On Wed, 12 Feb 2020, Segher Boessenkool wrote: > > On Wed, Feb 12, 2020 at 11:53:22AM +0100, Richard Biener wrote: > > > BB reorder switches back and forth as well ... :/ > > > > Yes. It is extremely hard to change any jumps in cfgrtl mode. > > > > The goal is to use cfglayout mode more, ideally *always*, not to use > > it less! > > Sure! It must be that split requires CFG RTL (it doesn't say so, but > we don't have a PROP_rtllayout or a "cannot work with" set of > properties). Otherwise I'm not sure why we go out of CFG layout > mode so early. Passes not messing with the CFG at all should be > ignorant of what mode we are in. Well, split can split jump insns as well, so it currently cannot use cfglayout mode. We could probably make split not split jumps, only do that in split2 and later, which is after reload. Getting IRA and LRA to work with cfglayout mode will be fun. Right after split2 is pro_and_epilogue. Everything after that doesn't want cfglayout mode I'd say. So we don't even have to move out_of_cfglayout all that far, but there are some big nasties it has to move over. > > > I think both are closely enough related that we probably should do > > > partitioning from within the same framework? OTOH BB reorder > > > happens _much_ later. > > > > This may be doable. What we need to do first though is to find a better > > thing to use than EDGE_CROSSING. > > > > Maybe we can determine which blocks should be hot and cold early, but > > actually make that happen only very late (maybe adjusting the decision > > if we have to to make things work)? That way, intervening passes do not > > have to care (much). > > The question is whether we can even preserve things like BB counts > and edge frequencies once we've gone out of CFG layout mode for once. Why not? cfglayout mode simply does not include simple branches and jump instructions, and fallthrough is not necessarily to the next bb in the insn stream, but that is all (basically). The cfg is just the same in both rtl modes. What destroys the bb and edge frequencies? > > > > I don't agree. The whole way EDGE_CROSSING works hinders all other > > > > optimisations a lot. > > > > > > I'm not sure if it's there for correctness reasons or just to > > > help checking that nothing "undoes" the partitioning decision. > > > > The latter. (It may also make it easier for the former, of course, but > > that can be solved anyway). And that makes live very hard for all later > > passes, while it is doubtful that it even give the best decisions: for > > example it prevents a lot of shrink-wrapping, which you dearly *want* to > > do on cold code! > > Sure. So do that earlier ;) You cannot. It has to stay after RA, obviously. And you cannot move RA earlier, to before combine. We do shrink-wrapping (and all other *logue work) pretty much as early as possible, already. Segher