Message ID | 67c3fd27-e70b-ff39-0fe2-0d419bd74938@redhat.com |
---|---|
State | New |
Headers | show |
On Mon, Nov 28, 2016 at 10:23 PM, Jeff Law <law@redhat.com> wrote: > > > I was digging into issues around the patches for 78120 when I stumbled upon > undesirable bb copying in bb-reorder.c on the m68k. > > The core issue is that the m68k does not define a length attribute and > therefore generic code assumes that the length of all insns is 0 bytes. What other targets behave like this? > That in turn makes bb-reorder think it is infinitely cheap to copy basic > blocks. In the two codebases I looked at (GCC's runtime libraries and > newlib) this leads to a 10% and 15% undesirable increase in code size. > > I've taken a slight variant of this patch and bootstrapped/regression tested > it on x86_64-linux-gnu to verify sanity as well as built the m68k target > libraries noted above. > > OK for the trunk? I wonder if it isn't better to default to a length of 1 instead of zero when there is no length attribute. There are more users of the length attribute in bb-reorder.c (and elsewhere as well I suppose). from get_attr_length_1 it looks like a "cheap" target dependent way would be to define ADJUST_INSN_LENGTH ... Richard. > Jeff > > * bb-reorder.c (copy_bb_p): Sanely handle case where the target > has not defined length attributes for its insns. > > diff --git a/gcc/bb-reorder.c b/gcc/bb-reorder.c > index 6873b4f..0b8d1d9 100644 > --- a/gcc/bb-reorder.c > +++ b/gcc/bb-reorder.c > @@ -115,6 +115,7 @@ > #include "bb-reorder.h" > #include "except.h" > #include "fibonacci_heap.h" > +#include "insn-attr.h" > > /* The number of rounds. In most cases there will only be 4 rounds, but > when partitioning hot and cold basic blocks into separate sections of > @@ -1355,6 +1356,9 @@ copy_bb_p (const_basic_block bb, int code_may_grow) > int max_size = uncond_jump_length; > rtx_insn *insn; > > + if (!HAVE_ATTR_length) > + return false; > + > if (!bb->frequency) > return false; > if (EDGE_COUNT (bb->preds) < 2) >
On 11/29/2016 03:23 AM, Richard Biener wrote: > On Mon, Nov 28, 2016 at 10:23 PM, Jeff Law <law@redhat.com> wrote: >> >> >> I was digging into issues around the patches for 78120 when I stumbled upon >> undesirable bb copying in bb-reorder.c on the m68k. >> >> The core issue is that the m68k does not define a length attribute and >> therefore generic code assumes that the length of all insns is 0 bytes. > > What other targets behave like this? ft32, nvptx, mmix, mn10300, m68k, c6x, rl78, vax, ia64, m32c cris has a hack to define a length, even though no attempt is made to make it accurate. The hack specifically calls out that it's to make bb-reorder happy. > >> That in turn makes bb-reorder think it is infinitely cheap to copy basic >> blocks. In the two codebases I looked at (GCC's runtime libraries and >> newlib) this leads to a 10% and 15% undesirable increase in code size. >> >> I've taken a slight variant of this patch and bootstrapped/regression tested >> it on x86_64-linux-gnu to verify sanity as well as built the m68k target >> libraries noted above. >> >> OK for the trunk? > > I wonder if it isn't better to default to a length of 1 instead of zero when > there is no length attribute. There are more users of the length attribute > in bb-reorder.c (and elsewhere as well I suppose). I pondered that as well, but felt it was riskier given we've had a default length of 0 for ports that don't define lengths since the early 90s. It's certainly easy enough to change that default if you'd prefer. I don't have a strong preference either way. Jeff
On Tue, Nov 29, 2016 at 5:07 PM, Jeff Law <law@redhat.com> wrote: > On 11/29/2016 03:23 AM, Richard Biener wrote: >> >> On Mon, Nov 28, 2016 at 10:23 PM, Jeff Law <law@redhat.com> wrote: >>> >>> >>> >>> I was digging into issues around the patches for 78120 when I stumbled >>> upon >>> undesirable bb copying in bb-reorder.c on the m68k. >>> >>> The core issue is that the m68k does not define a length attribute and >>> therefore generic code assumes that the length of all insns is 0 bytes. >> >> >> What other targets behave like this? > > ft32, nvptx, mmix, mn10300, m68k, c6x, rl78, vax, ia64, m32c Ok. > cris has a hack to define a length, even though no attempt is made to make > it accurate. The hack specifically calls out that it's to make bb-reorder > happy. > >> >>> That in turn makes bb-reorder think it is infinitely cheap to copy basic >>> blocks. In the two codebases I looked at (GCC's runtime libraries and >>> newlib) this leads to a 10% and 15% undesirable increase in code size. >>> >>> I've taken a slight variant of this patch and bootstrapped/regression >>> tested >>> it on x86_64-linux-gnu to verify sanity as well as built the m68k target >>> libraries noted above. >>> >>> OK for the trunk? >> >> >> I wonder if it isn't better to default to a length of 1 instead of zero >> when >> there is no length attribute. There are more users of the length >> attribute >> in bb-reorder.c (and elsewhere as well I suppose). > > I pondered that as well, but felt it was riskier given we've had a default > length of 0 for ports that don't define lengths since the early 90s. It's > certainly easy enough to change that default if you'd prefer. I don't have > a strong preference either way. Thinking about this again maybe targets w/o insn-length should simply always use the 'simple' algorithm instead of the STV one? At least that might be what your change effectively does in some way? Richard. > Jeff
On 11/30/2016 01:38 AM, Richard Biener wrote: > On Tue, Nov 29, 2016 at 5:07 PM, Jeff Law <law@redhat.com> wrote: >> On 11/29/2016 03:23 AM, Richard Biener wrote: >>> >>> On Mon, Nov 28, 2016 at 10:23 PM, Jeff Law <law@redhat.com> wrote: >>>> >>>> >>>> >>>> I was digging into issues around the patches for 78120 when I stumbled >>>> upon >>>> undesirable bb copying in bb-reorder.c on the m68k. >>>> >>>> The core issue is that the m68k does not define a length attribute and >>>> therefore generic code assumes that the length of all insns is 0 bytes. >>> >>> >>> What other targets behave like this? >> >> ft32, nvptx, mmix, mn10300, m68k, c6x, rl78, vax, ia64, m32c > > Ok. > >> cris has a hack to define a length, even though no attempt is made to make >> it accurate. The hack specifically calls out that it's to make bb-reorder >> happy. >> >>> >>>> That in turn makes bb-reorder think it is infinitely cheap to copy basic >>>> blocks. In the two codebases I looked at (GCC's runtime libraries and >>>> newlib) this leads to a 10% and 15% undesirable increase in code size. >>>> >>>> I've taken a slight variant of this patch and bootstrapped/regression >>>> tested >>>> it on x86_64-linux-gnu to verify sanity as well as built the m68k target >>>> libraries noted above. >>>> >>>> OK for the trunk? >>> >>> >>> I wonder if it isn't better to default to a length of 1 instead of zero >>> when >>> there is no length attribute. There are more users of the length >>> attribute >>> in bb-reorder.c (and elsewhere as well I suppose). >> >> I pondered that as well, but felt it was riskier given we've had a default >> length of 0 for ports that don't define lengths since the early 90s. It's >> certainly easy enough to change that default if you'd prefer. I don't have >> a strong preference either way. > > Thinking about this again maybe targets w/o insn-length should simply > always use the 'simple' algorithm instead of the STV one? At least that > might be what your change effectively does in some way? From reading the comments I don't think STC will collapse down into the simple algorithm if block copying is disabled. But Segher would know for sure. WRT the choice of simple vs STC, I doubt it matters much for the processors in question. JEff
On Wed, Nov 30, 2016 at 6:59 PM, Jeff Law <law@redhat.com> wrote: > On 11/30/2016 01:38 AM, Richard Biener wrote: >> >> On Tue, Nov 29, 2016 at 5:07 PM, Jeff Law <law@redhat.com> wrote: >>> >>> On 11/29/2016 03:23 AM, Richard Biener wrote: >>>> >>>> >>>> On Mon, Nov 28, 2016 at 10:23 PM, Jeff Law <law@redhat.com> wrote: >>>>> >>>>> >>>>> >>>>> >>>>> I was digging into issues around the patches for 78120 when I stumbled >>>>> upon >>>>> undesirable bb copying in bb-reorder.c on the m68k. >>>>> >>>>> The core issue is that the m68k does not define a length attribute and >>>>> therefore generic code assumes that the length of all insns is 0 bytes. >>>> >>>> >>>> >>>> What other targets behave like this? >>> >>> >>> ft32, nvptx, mmix, mn10300, m68k, c6x, rl78, vax, ia64, m32c >> >> >> Ok. >> >>> cris has a hack to define a length, even though no attempt is made to >>> make >>> it accurate. The hack specifically calls out that it's to make >>> bb-reorder >>> happy. >>> >>>> >>>>> That in turn makes bb-reorder think it is infinitely cheap to copy >>>>> basic >>>>> blocks. In the two codebases I looked at (GCC's runtime libraries and >>>>> newlib) this leads to a 10% and 15% undesirable increase in code size. >>>>> >>>>> I've taken a slight variant of this patch and bootstrapped/regression >>>>> tested >>>>> it on x86_64-linux-gnu to verify sanity as well as built the m68k >>>>> target >>>>> libraries noted above. >>>>> >>>>> OK for the trunk? >>>> >>>> >>>> >>>> I wonder if it isn't better to default to a length of 1 instead of zero >>>> when >>>> there is no length attribute. There are more users of the length >>>> attribute >>>> in bb-reorder.c (and elsewhere as well I suppose). >>> >>> >>> I pondered that as well, but felt it was riskier given we've had a >>> default >>> length of 0 for ports that don't define lengths since the early 90s. >>> It's >>> certainly easy enough to change that default if you'd prefer. I don't >>> have >>> a strong preference either way. >> >> >> Thinking about this again maybe targets w/o insn-length should simply >> always use the 'simple' algorithm instead of the STV one? At least that >> might be what your change effectively does in some way? > > From reading the comments I don't think STC will collapse down into the > simple algorithm if block copying is disabled. But Segher would know for > sure. > > WRT the choice of simple vs STC, I doubt it matters much for the processors > in question. I guess STC doesn't make much sense if we can't say anything about BB sizes. Richard. > > JEff
On Thu, Dec 01, 2016 at 10:19:42AM +0100, Richard Biener wrote: > >> Thinking about this again maybe targets w/o insn-length should simply > >> always use the 'simple' algorithm instead of the STV one? At least that > >> might be what your change effectively does in some way? > > > > From reading the comments I don't think STC will collapse down into the > > simple algorithm if block copying is disabled. But Segher would know for > > sure. > > > > WRT the choice of simple vs STC, I doubt it matters much for the processors > > in question. > > I guess STC doesn't make much sense if we can't say anything about BB sizes. STC tries to make as large as possible consecutive "traces", mainly to help with instruction cache utilization and hit rate etc. It cannot do a very good job if it isn't allowed to copy blocks. "simple" tries to (dynamically) have as many fall-throughs as possible, i.e. as few jumps as possible. It never copies code; if that means it has to jump every second insn, so be it. It provably is within a factor three of optimal (optimal is NP-hard), under a really weak assumption within a factor two, and it does better than that in practice. STC without block copying makes longer traces which is not a good idea for most architectures, only for those that have a short jump that is much shorter than longer jumps (and thus, cannot cover many jump targets). I do not know how STC behaves when it does not know the insn lengths. Segher
On 12/01/2016 05:04 AM, Segher Boessenkool wrote: > On Thu, Dec 01, 2016 at 10:19:42AM +0100, Richard Biener wrote: >>>> Thinking about this again maybe targets w/o insn-length should simply >>>> always use the 'simple' algorithm instead of the STV one? At least that >>>> might be what your change effectively does in some way? >>> >>> From reading the comments I don't think STC will collapse down into the >>> simple algorithm if block copying is disabled. But Segher would know for >>> sure. >>> >>> WRT the choice of simple vs STC, I doubt it matters much for the processors >>> in question. >> >> I guess STC doesn't make much sense if we can't say anything about BB sizes. > > STC tries to make as large as possible consecutive "traces", mainly to > help with instruction cache utilization and hit rate etc. It cannot do > a very good job if it isn't allowed to copy blocks. > > "simple" tries to (dynamically) have as many fall-throughs as possible, > i.e. as few jumps as possible. It never copies code; if that means it > has to jump every second insn, so be it. It provably is within a factor > three of optimal (optimal is NP-hard), under a really weak assumption > within a factor two, and it does better than that in practice. > > STC without block copying makes longer traces which is not a good idea > for most architectures, only for those that have a short jump that is > much shorter than longer jumps (and thus, cannot cover many jump > targets). > > I do not know how STC behaves when it does not know the insn lengths. mn103 & m68k are definitely sensitive to jump distances, the former more so than the latter. Some of the others probably are as well. I think we've probably discussed this more than is really necessary. We just need to pick an alternative and go with it, I think either alternative is reasonable (avoid copying when STC has no length information or fall back to simple when there is no length information). jeff
On Thu, Dec 1, 2016 at 6:28 PM, Jeff Law <law@redhat.com> wrote: > On 12/01/2016 05:04 AM, Segher Boessenkool wrote: >> >> On Thu, Dec 01, 2016 at 10:19:42AM +0100, Richard Biener wrote: >>>>> >>>>> Thinking about this again maybe targets w/o insn-length should simply >>>>> always use the 'simple' algorithm instead of the STV one? At least >>>>> that >>>>> might be what your change effectively does in some way? >>>> >>>> >>>> From reading the comments I don't think STC will collapse down into the >>>> simple algorithm if block copying is disabled. But Segher would know >>>> for >>>> sure. >>>> >>>> WRT the choice of simple vs STC, I doubt it matters much for the >>>> processors >>>> in question. >>> >>> >>> I guess STC doesn't make much sense if we can't say anything about BB >>> sizes. >> >> >> STC tries to make as large as possible consecutive "traces", mainly to >> help with instruction cache utilization and hit rate etc. It cannot do >> a very good job if it isn't allowed to copy blocks. >> >> "simple" tries to (dynamically) have as many fall-throughs as possible, >> i.e. as few jumps as possible. I hope it special-cases backedges, that is, not make those fallthru. >> It never copies code; if that means it >> has to jump every second insn, so be it. It provably is within a factor >> three of optimal (optimal is NP-hard), under a really weak assumption >> within a factor two, and it does better than that in practice. >> >> STC without block copying makes longer traces which is not a good idea >> for most architectures, only for those that have a short jump that is >> much shorter than longer jumps (and thus, cannot cover many jump >> targets). >> >> I do not know how STC behaves when it does not know the insn lengths. > > mn103 & m68k are definitely sensitive to jump distances, the former more so > than the latter. Some of the others probably are as well. > > I think we've probably discussed this more than is really necessary. We > just need to pick an alternative and go with it, I think either alternative > is reasonable (avoid copying when STC has no length information or fall back > to simple when there is no length information). The description of both algorithms doesn't make it an obvious pick. So lets leave the decision of the algorithm to the target and make STC beave sane with no length information (whether that means disallowing any copying or substituting a "default" length is another question -- but obviously it's the targets fault to not provide lenght information). Richard. > > > jeff
On Fri, Dec 02, 2016 at 09:47:00AM +0100, Richard Biener wrote: > >> STC tries to make as large as possible consecutive "traces", mainly to > >> help with instruction cache utilization and hit rate etc. It cannot do > >> a very good job if it isn't allowed to copy blocks. > >> > >> "simple" tries to (dynamically) have as many fall-throughs as possible, > >> i.e. as few jumps as possible. > > I hope it special-cases backedges, that is, not make those fallthru. It doesn't, and that is a good thing. Firstly, what is classified as backedge doesn't make too much sense in some cases; quite often the cfg isn't reducible. Secondly, for simple loops it does not matter: the backedge has lower frequency than the forward edges, so everything works out as you want. Thirdly, consider this loop: | S<---- | | A | / \ | B C | \ / | D | | | E----- | F where the backedge now has higher frequency than A->B and A->C. With simple this ends up as: S: ...; goto A D: ... E: ...; if (not again) goto F S: ... A: ...; if () goto C B: ...; goto D C: ...; goto D and this is cheaper than the alternative: S: ... A: ...; if () goto C B: ... D: ... E: ...; if (again) goto A F: C: ...; goto D If a fraction f of the loop iterations does C (so 1-f does B), the alternative has 1+2f jumps per iteration; the code simple makes has 1+f. > > I think we've probably discussed this more than is really necessary. We > > just need to pick an alternative and go with it, I think either alternative > > is reasonable (avoid copying when STC has no length information or fall back > > to simple when there is no length information). > > The description of both algorithms doesn't make it an obvious pick. So lets > leave the decision of the algorithm to the target and make STC beave sane > with no length information (whether that means disallowing any copying or > substituting a "default" length is another question -- but obviously it's the > targets fault to not provide lenght information). I agree. Segher
On 12/02/2016 03:22 PM, Segher Boessenkool wrote: > On Fri, Dec 02, 2016 at 09:47:00AM +0100, Richard Biener wrote: >>>> STC tries to make as large as possible consecutive "traces", mainly to >>>> help with instruction cache utilization and hit rate etc. It cannot do >>>> a very good job if it isn't allowed to copy blocks. >>>> >>>> "simple" tries to (dynamically) have as many fall-throughs as possible, >>>> i.e. as few jumps as possible. >> >> I hope it special-cases backedges, that is, not make those fallthru. > > It doesn't, and that is a good thing. > > Firstly, what is classified as backedge doesn't make too much sense in > some cases; quite often the cfg isn't reducible. > > Secondly, for simple loops it does not matter: the backedge has lower > frequency than the forward edges, so everything works out as you want. > > Thirdly, consider this loop: > > | > S<---- > | | > A | > / \ | > B C | > \ / | > D | > | | > E----- > | > F > > where the backedge now has higher frequency than A->B and A->C. > With simple this ends up as: > > S: ...; goto A > > D: ... > E: ...; if (not again) goto F > S: ... > A: ...; if () goto C > B: ...; goto D > > C: ...; goto D > > and this is cheaper than the alternative: > > S: ... > A: ...; if () goto C > B: ... > D: ... > E: ...; if (again) goto A > F: > > C: ...; goto D > > If a fraction f of the loop iterations does C (so 1-f does B), the > alternative has 1+2f jumps per iteration; the code simple makes has 1+f. In fact, the latter can be particularly bad with small icaches. The first sequence is far better. IIRC there were spec92 cases there getting this kind of thing right made a measurable difference on embedded targets. Jeff
diff --git a/gcc/bb-reorder.c b/gcc/bb-reorder.c index 6873b4f..0b8d1d9 100644 --- a/gcc/bb-reorder.c +++ b/gcc/bb-reorder.c @@ -115,6 +115,7 @@ #include "bb-reorder.h" #include "except.h" #include "fibonacci_heap.h" +#include "insn-attr.h" /* The number of rounds. In most cases there will only be 4 rounds, but when partitioning hot and cold basic blocks into separate sections of @@ -1355,6 +1356,9 @@ copy_bb_p (const_basic_block bb, int code_may_grow) int max_size = uncond_jump_length; rtx_insn *insn; + if (!HAVE_ATTR_length) + return false; + if (!bb->frequency) return false; if (EDGE_COUNT (bb->preds) < 2)