diff mbox

[RFA] Handle target with no length attributes sanely in bb-reorder.c

Message ID 67c3fd27-e70b-ff39-0fe2-0d419bd74938@redhat.com
State New
Headers show

Commit Message

Jeff Law Nov. 28, 2016, 9:23 p.m. UTC
I was digging into  issues around the patches for 78120 when I stumbled 
upon undesirable bb copying in bb-reorder.c on the m68k.

The core issue is that the m68k does not define a length attribute and 
therefore generic code assumes that the length of all insns is 0 bytes.

That in turn makes bb-reorder think it is infinitely cheap to copy basic 
blocks.  In the two codebases I looked at (GCC's runtime libraries and 
newlib) this leads to a 10% and 15% undesirable increase in code size.

I've taken a slight variant of this patch and bootstrapped/regression 
tested it on x86_64-linux-gnu to verify sanity as well as built the m68k 
target libraries noted above.

OK for the trunk?

Jeff
* bb-reorder.c (copy_bb_p): Sanely handle case where the target
	has not defined length attributes for its insns.

Comments

Richard Biener Nov. 29, 2016, 10:23 a.m. UTC | #1
On Mon, Nov 28, 2016 at 10:23 PM, Jeff Law <law@redhat.com> wrote:
>
>
> I was digging into  issues around the patches for 78120 when I stumbled upon
> undesirable bb copying in bb-reorder.c on the m68k.
>
> The core issue is that the m68k does not define a length attribute and
> therefore generic code assumes that the length of all insns is 0 bytes.

What other targets behave like this?

> That in turn makes bb-reorder think it is infinitely cheap to copy basic
> blocks.  In the two codebases I looked at (GCC's runtime libraries and
> newlib) this leads to a 10% and 15% undesirable increase in code size.
>
> I've taken a slight variant of this patch and bootstrapped/regression tested
> it on x86_64-linux-gnu to verify sanity as well as built the m68k target
> libraries noted above.
>
> OK for the trunk?

I wonder if it isn't better to default to a length of 1 instead of zero when
there is no length attribute.  There are more users of the length attribute
in bb-reorder.c (and elsewhere as well I suppose).

from get_attr_length_1 it looks like a "cheap" target dependent way
would be to define ADJUST_INSN_LENGTH ...

Richard.

> Jeff
>
>         * bb-reorder.c (copy_bb_p): Sanely handle case where the target
>         has not defined length attributes for its insns.
>
> diff --git a/gcc/bb-reorder.c b/gcc/bb-reorder.c
> index 6873b4f..0b8d1d9 100644
> --- a/gcc/bb-reorder.c
> +++ b/gcc/bb-reorder.c
> @@ -115,6 +115,7 @@
>  #include "bb-reorder.h"
>  #include "except.h"
>  #include "fibonacci_heap.h"
> +#include "insn-attr.h"
>
>  /* The number of rounds.  In most cases there will only be 4 rounds, but
>     when partitioning hot and cold basic blocks into separate sections of
> @@ -1355,6 +1356,9 @@ copy_bb_p (const_basic_block bb, int code_may_grow)
>    int max_size = uncond_jump_length;
>    rtx_insn *insn;
>
> +  if (!HAVE_ATTR_length)
> +    return false;
> +
>    if (!bb->frequency)
>      return false;
>    if (EDGE_COUNT (bb->preds) < 2)
>
Jeff Law Nov. 29, 2016, 4:07 p.m. UTC | #2
On 11/29/2016 03:23 AM, Richard Biener wrote:
> On Mon, Nov 28, 2016 at 10:23 PM, Jeff Law <law@redhat.com> wrote:
>>
>>
>> I was digging into  issues around the patches for 78120 when I stumbled upon
>> undesirable bb copying in bb-reorder.c on the m68k.
>>
>> The core issue is that the m68k does not define a length attribute and
>> therefore generic code assumes that the length of all insns is 0 bytes.
>
> What other targets behave like this?
ft32, nvptx, mmix, mn10300, m68k, c6x, rl78, vax, ia64, m32c

cris has a hack to define a length, even though no attempt is made to 
make it accurate.  The hack specifically calls out that it's to make 
bb-reorder happy.

>
>> That in turn makes bb-reorder think it is infinitely cheap to copy basic
>> blocks.  In the two codebases I looked at (GCC's runtime libraries and
>> newlib) this leads to a 10% and 15% undesirable increase in code size.
>>
>> I've taken a slight variant of this patch and bootstrapped/regression tested
>> it on x86_64-linux-gnu to verify sanity as well as built the m68k target
>> libraries noted above.
>>
>> OK for the trunk?
>
> I wonder if it isn't better to default to a length of 1 instead of zero when
> there is no length attribute.  There are more users of the length attribute
> in bb-reorder.c (and elsewhere as well I suppose).
I pondered that as well, but felt it was riskier given we've had a 
default length of 0 for ports that don't define lengths since the early 
90s.  It's certainly easy enough to change that default if you'd prefer. 
  I don't have a strong preference either way.

Jeff
Richard Biener Nov. 30, 2016, 8:38 a.m. UTC | #3
On Tue, Nov 29, 2016 at 5:07 PM, Jeff Law <law@redhat.com> wrote:
> On 11/29/2016 03:23 AM, Richard Biener wrote:
>>
>> On Mon, Nov 28, 2016 at 10:23 PM, Jeff Law <law@redhat.com> wrote:
>>>
>>>
>>>
>>> I was digging into  issues around the patches for 78120 when I stumbled
>>> upon
>>> undesirable bb copying in bb-reorder.c on the m68k.
>>>
>>> The core issue is that the m68k does not define a length attribute and
>>> therefore generic code assumes that the length of all insns is 0 bytes.
>>
>>
>> What other targets behave like this?
>
> ft32, nvptx, mmix, mn10300, m68k, c6x, rl78, vax, ia64, m32c

Ok.

> cris has a hack to define a length, even though no attempt is made to make
> it accurate.  The hack specifically calls out that it's to make bb-reorder
> happy.
>
>>
>>> That in turn makes bb-reorder think it is infinitely cheap to copy basic
>>> blocks.  In the two codebases I looked at (GCC's runtime libraries and
>>> newlib) this leads to a 10% and 15% undesirable increase in code size.
>>>
>>> I've taken a slight variant of this patch and bootstrapped/regression
>>> tested
>>> it on x86_64-linux-gnu to verify sanity as well as built the m68k target
>>> libraries noted above.
>>>
>>> OK for the trunk?
>>
>>
>> I wonder if it isn't better to default to a length of 1 instead of zero
>> when
>> there is no length attribute.  There are more users of the length
>> attribute
>> in bb-reorder.c (and elsewhere as well I suppose).
>
> I pondered that as well, but felt it was riskier given we've had a default
> length of 0 for ports that don't define lengths since the early 90s.  It's
> certainly easy enough to change that default if you'd prefer.  I don't have
> a strong preference either way.

Thinking about this again maybe targets w/o insn-length should simply
always use the 'simple' algorithm instead of the STV one?  At least that
might be what your change effectively does in some way?

Richard.

> Jeff
Jeff Law Nov. 30, 2016, 5:59 p.m. UTC | #4
On 11/30/2016 01:38 AM, Richard Biener wrote:
> On Tue, Nov 29, 2016 at 5:07 PM, Jeff Law <law@redhat.com> wrote:
>> On 11/29/2016 03:23 AM, Richard Biener wrote:
>>>
>>> On Mon, Nov 28, 2016 at 10:23 PM, Jeff Law <law@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> I was digging into  issues around the patches for 78120 when I stumbled
>>>> upon
>>>> undesirable bb copying in bb-reorder.c on the m68k.
>>>>
>>>> The core issue is that the m68k does not define a length attribute and
>>>> therefore generic code assumes that the length of all insns is 0 bytes.
>>>
>>>
>>> What other targets behave like this?
>>
>> ft32, nvptx, mmix, mn10300, m68k, c6x, rl78, vax, ia64, m32c
>
> Ok.
>
>> cris has a hack to define a length, even though no attempt is made to make
>> it accurate.  The hack specifically calls out that it's to make bb-reorder
>> happy.
>>
>>>
>>>> That in turn makes bb-reorder think it is infinitely cheap to copy basic
>>>> blocks.  In the two codebases I looked at (GCC's runtime libraries and
>>>> newlib) this leads to a 10% and 15% undesirable increase in code size.
>>>>
>>>> I've taken a slight variant of this patch and bootstrapped/regression
>>>> tested
>>>> it on x86_64-linux-gnu to verify sanity as well as built the m68k target
>>>> libraries noted above.
>>>>
>>>> OK for the trunk?
>>>
>>>
>>> I wonder if it isn't better to default to a length of 1 instead of zero
>>> when
>>> there is no length attribute.  There are more users of the length
>>> attribute
>>> in bb-reorder.c (and elsewhere as well I suppose).
>>
>> I pondered that as well, but felt it was riskier given we've had a default
>> length of 0 for ports that don't define lengths since the early 90s.  It's
>> certainly easy enough to change that default if you'd prefer.  I don't have
>> a strong preference either way.
>
> Thinking about this again maybe targets w/o insn-length should simply
> always use the 'simple' algorithm instead of the STV one?  At least that
> might be what your change effectively does in some way?
 From reading the comments I don't think STC will collapse down into the 
simple algorithm if block copying is disabled.  But Segher would know 
for sure.

WRT the choice of simple vs STC, I doubt it matters much for the 
processors in question.

JEff
Richard Biener Dec. 1, 2016, 9:19 a.m. UTC | #5
On Wed, Nov 30, 2016 at 6:59 PM, Jeff Law <law@redhat.com> wrote:
> On 11/30/2016 01:38 AM, Richard Biener wrote:
>>
>> On Tue, Nov 29, 2016 at 5:07 PM, Jeff Law <law@redhat.com> wrote:
>>>
>>> On 11/29/2016 03:23 AM, Richard Biener wrote:
>>>>
>>>>
>>>> On Mon, Nov 28, 2016 at 10:23 PM, Jeff Law <law@redhat.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> I was digging into  issues around the patches for 78120 when I stumbled
>>>>> upon
>>>>> undesirable bb copying in bb-reorder.c on the m68k.
>>>>>
>>>>> The core issue is that the m68k does not define a length attribute and
>>>>> therefore generic code assumes that the length of all insns is 0 bytes.
>>>>
>>>>
>>>>
>>>> What other targets behave like this?
>>>
>>>
>>> ft32, nvptx, mmix, mn10300, m68k, c6x, rl78, vax, ia64, m32c
>>
>>
>> Ok.
>>
>>> cris has a hack to define a length, even though no attempt is made to
>>> make
>>> it accurate.  The hack specifically calls out that it's to make
>>> bb-reorder
>>> happy.
>>>
>>>>
>>>>> That in turn makes bb-reorder think it is infinitely cheap to copy
>>>>> basic
>>>>> blocks.  In the two codebases I looked at (GCC's runtime libraries and
>>>>> newlib) this leads to a 10% and 15% undesirable increase in code size.
>>>>>
>>>>> I've taken a slight variant of this patch and bootstrapped/regression
>>>>> tested
>>>>> it on x86_64-linux-gnu to verify sanity as well as built the m68k
>>>>> target
>>>>> libraries noted above.
>>>>>
>>>>> OK for the trunk?
>>>>
>>>>
>>>>
>>>> I wonder if it isn't better to default to a length of 1 instead of zero
>>>> when
>>>> there is no length attribute.  There are more users of the length
>>>> attribute
>>>> in bb-reorder.c (and elsewhere as well I suppose).
>>>
>>>
>>> I pondered that as well, but felt it was riskier given we've had a
>>> default
>>> length of 0 for ports that don't define lengths since the early 90s.
>>> It's
>>> certainly easy enough to change that default if you'd prefer.  I don't
>>> have
>>> a strong preference either way.
>>
>>
>> Thinking about this again maybe targets w/o insn-length should simply
>> always use the 'simple' algorithm instead of the STV one?  At least that
>> might be what your change effectively does in some way?
>
> From reading the comments I don't think STC will collapse down into the
> simple algorithm if block copying is disabled.  But Segher would know for
> sure.
>
> WRT the choice of simple vs STC, I doubt it matters much for the processors
> in question.

I guess STC doesn't make much sense if we can't say anything about BB sizes.

Richard.

>
> JEff
Segher Boessenkool Dec. 1, 2016, 12:04 p.m. UTC | #6
On Thu, Dec 01, 2016 at 10:19:42AM +0100, Richard Biener wrote:
> >> Thinking about this again maybe targets w/o insn-length should simply
> >> always use the 'simple' algorithm instead of the STV one?  At least that
> >> might be what your change effectively does in some way?
> >
> > From reading the comments I don't think STC will collapse down into the
> > simple algorithm if block copying is disabled.  But Segher would know for
> > sure.
> >
> > WRT the choice of simple vs STC, I doubt it matters much for the processors
> > in question.
> 
> I guess STC doesn't make much sense if we can't say anything about BB sizes.

STC tries to make as large as possible consecutive "traces", mainly to
help with instruction cache utilization and hit rate etc.  It cannot do
a very good job if it isn't allowed to copy blocks.

"simple" tries to (dynamically) have as many fall-throughs as possible,
i.e. as few jumps as possible.  It never copies code; if that means it
has to jump every second insn, so be it.  It provably is within a factor
three of optimal (optimal is NP-hard), under a really weak assumption
within a factor two, and it does better than that in practice.

STC without block copying makes longer traces which is not a good idea
for most architectures, only for those that have a short jump that is
much shorter than longer jumps (and thus, cannot cover many jump
targets).

I do not know how STC behaves when it does not know the insn lengths.


Segher
Jeff Law Dec. 1, 2016, 5:28 p.m. UTC | #7
On 12/01/2016 05:04 AM, Segher Boessenkool wrote:
> On Thu, Dec 01, 2016 at 10:19:42AM +0100, Richard Biener wrote:
>>>> Thinking about this again maybe targets w/o insn-length should simply
>>>> always use the 'simple' algorithm instead of the STV one?  At least that
>>>> might be what your change effectively does in some way?
>>>
>>> From reading the comments I don't think STC will collapse down into the
>>> simple algorithm if block copying is disabled.  But Segher would know for
>>> sure.
>>>
>>> WRT the choice of simple vs STC, I doubt it matters much for the processors
>>> in question.
>>
>> I guess STC doesn't make much sense if we can't say anything about BB sizes.
>
> STC tries to make as large as possible consecutive "traces", mainly to
> help with instruction cache utilization and hit rate etc.  It cannot do
> a very good job if it isn't allowed to copy blocks.
>
> "simple" tries to (dynamically) have as many fall-throughs as possible,
> i.e. as few jumps as possible.  It never copies code; if that means it
> has to jump every second insn, so be it.  It provably is within a factor
> three of optimal (optimal is NP-hard), under a really weak assumption
> within a factor two, and it does better than that in practice.
>
> STC without block copying makes longer traces which is not a good idea
> for most architectures, only for those that have a short jump that is
> much shorter than longer jumps (and thus, cannot cover many jump
> targets).
>
> I do not know how STC behaves when it does not know the insn lengths.
mn103 &  m68k are definitely sensitive to jump distances, the former 
more so than the latter.  Some of the others probably are as well.

I think we've probably discussed this more than is really necessary.  We 
just need to pick an alternative and go with it, I think either 
alternative is reasonable (avoid copying when STC has no length 
information or fall back to simple when there is no length information).



jeff
Richard Biener Dec. 2, 2016, 8:47 a.m. UTC | #8
On Thu, Dec 1, 2016 at 6:28 PM, Jeff Law <law@redhat.com> wrote:
> On 12/01/2016 05:04 AM, Segher Boessenkool wrote:
>>
>> On Thu, Dec 01, 2016 at 10:19:42AM +0100, Richard Biener wrote:
>>>>>
>>>>> Thinking about this again maybe targets w/o insn-length should simply
>>>>> always use the 'simple' algorithm instead of the STV one?  At least
>>>>> that
>>>>> might be what your change effectively does in some way?
>>>>
>>>>
>>>> From reading the comments I don't think STC will collapse down into the
>>>> simple algorithm if block copying is disabled.  But Segher would know
>>>> for
>>>> sure.
>>>>
>>>> WRT the choice of simple vs STC, I doubt it matters much for the
>>>> processors
>>>> in question.
>>>
>>>
>>> I guess STC doesn't make much sense if we can't say anything about BB
>>> sizes.
>>
>>
>> STC tries to make as large as possible consecutive "traces", mainly to
>> help with instruction cache utilization and hit rate etc.  It cannot do
>> a very good job if it isn't allowed to copy blocks.
>>
>> "simple" tries to (dynamically) have as many fall-throughs as possible,
>> i.e. as few jumps as possible.

I hope it special-cases backedges, that is, not make those fallthru.

>>  It never copies code; if that means it
>> has to jump every second insn, so be it.  It provably is within a factor
>> three of optimal (optimal is NP-hard), under a really weak assumption
>> within a factor two, and it does better than that in practice.
>>
>> STC without block copying makes longer traces which is not a good idea
>> for most architectures, only for those that have a short jump that is
>> much shorter than longer jumps (and thus, cannot cover many jump
>> targets).
>>
>> I do not know how STC behaves when it does not know the insn lengths.
>
> mn103 &  m68k are definitely sensitive to jump distances, the former more so
> than the latter.  Some of the others probably are as well.
>
> I think we've probably discussed this more than is really necessary.  We
> just need to pick an alternative and go with it, I think either alternative
> is reasonable (avoid copying when STC has no length information or fall back
> to simple when there is no length information).

The description of both algorithms doesn't make it an obvious pick.  So lets
leave the decision of the algorithm to the target and make STC beave sane
with no length information (whether that means disallowing any copying or
substituting a "default" length is another question -- but obviously it's the
targets fault to not provide lenght information).

Richard.

>
>
> jeff
Segher Boessenkool Dec. 2, 2016, 10:22 p.m. UTC | #9
On Fri, Dec 02, 2016 at 09:47:00AM +0100, Richard Biener wrote:
> >> STC tries to make as large as possible consecutive "traces", mainly to
> >> help with instruction cache utilization and hit rate etc.  It cannot do
> >> a very good job if it isn't allowed to copy blocks.
> >>
> >> "simple" tries to (dynamically) have as many fall-throughs as possible,
> >> i.e. as few jumps as possible.
> 
> I hope it special-cases backedges, that is, not make those fallthru.

It doesn't, and that is a good thing.

Firstly, what is classified as backedge doesn't make too much sense in
some cases; quite often the cfg isn't reducible.

Secondly, for simple loops it does not matter: the backedge has lower
frequency than the forward edges, so everything works out as you want.

Thirdly, consider this loop:

    |
    S<----
    |     |
    A     |
   / \    |
  B   C   |
   \ /    |
    D     |
    |     |
    E-----
    |
    F

where the backedge now has higher frequency than A->B and A->C.
With simple this ends up as:

S: ...; goto A

D: ...
E: ...; if (not again) goto F
S: ...
A: ...; if () goto C
B: ...; goto D

C: ...; goto D

and this is cheaper than the alternative:

S: ...
A: ...; if () goto C
B: ...
D: ...
E: ...; if (again) goto A
F:

C: ...; goto D

If a fraction f of the loop iterations does C (so 1-f does B), the
alternative has 1+2f jumps per iteration; the code simple makes has 1+f.

> > I think we've probably discussed this more than is really necessary.  We
> > just need to pick an alternative and go with it, I think either alternative
> > is reasonable (avoid copying when STC has no length information or fall back
> > to simple when there is no length information).
> 
> The description of both algorithms doesn't make it an obvious pick.  So lets
> leave the decision of the algorithm to the target and make STC beave sane
> with no length information (whether that means disallowing any copying or
> substituting a "default" length is another question -- but obviously it's the
> targets fault to not provide lenght information).

I agree.


Segher
Jeff Law Dec. 3, 2016, 2:20 a.m. UTC | #10
On 12/02/2016 03:22 PM, Segher Boessenkool wrote:
> On Fri, Dec 02, 2016 at 09:47:00AM +0100, Richard Biener wrote:
>>>> STC tries to make as large as possible consecutive "traces", mainly to
>>>> help with instruction cache utilization and hit rate etc.  It cannot do
>>>> a very good job if it isn't allowed to copy blocks.
>>>>
>>>> "simple" tries to (dynamically) have as many fall-throughs as possible,
>>>> i.e. as few jumps as possible.
>>
>> I hope it special-cases backedges, that is, not make those fallthru.
>
> It doesn't, and that is a good thing.
>
> Firstly, what is classified as backedge doesn't make too much sense in
> some cases; quite often the cfg isn't reducible.
>
> Secondly, for simple loops it does not matter: the backedge has lower
> frequency than the forward edges, so everything works out as you want.
>
> Thirdly, consider this loop:
>
>     |
>     S<----
>     |     |
>     A     |
>    / \    |
>   B   C   |
>    \ /    |
>     D     |
>     |     |
>     E-----
>     |
>     F
>
> where the backedge now has higher frequency than A->B and A->C.
> With simple this ends up as:
>
> S: ...; goto A
>
> D: ...
> E: ...; if (not again) goto F
> S: ...
> A: ...; if () goto C
> B: ...; goto D
>
> C: ...; goto D
>
> and this is cheaper than the alternative:
>
> S: ...
> A: ...; if () goto C
> B: ...
> D: ...
> E: ...; if (again) goto A
> F:
>
> C: ...; goto D
>
> If a fraction f of the loop iterations does C (so 1-f does B), the
> alternative has 1+2f jumps per iteration; the code simple makes has 1+f.
In fact, the latter can be particularly bad with small icaches.  The 
first sequence is far better.  IIRC there were spec92 cases there 
getting this kind of thing right made a measurable difference on 
embedded targets.

Jeff
diff mbox

Patch

diff --git a/gcc/bb-reorder.c b/gcc/bb-reorder.c
index 6873b4f..0b8d1d9 100644
--- a/gcc/bb-reorder.c
+++ b/gcc/bb-reorder.c
@@ -115,6 +115,7 @@ 
 #include "bb-reorder.h"
 #include "except.h"
 #include "fibonacci_heap.h"
+#include "insn-attr.h"
 
 /* The number of rounds.  In most cases there will only be 4 rounds, but
    when partitioning hot and cold basic blocks into separate sections of
@@ -1355,6 +1356,9 @@  copy_bb_p (const_basic_block bb, int code_may_grow)
   int max_size = uncond_jump_length;
   rtx_insn *insn;
 
+  if (!HAVE_ATTR_length)
+    return false;
+
   if (!bb->frequency)
     return false;
   if (EDGE_COUNT (bb->preds) < 2)