diff mbox

[0/n] Merge from match-and-simplify

Message ID 20141016203852.GB29134@f1.c.bardezibar.internal
State New
Headers show

Commit Message

Sebastian Pop Oct. 16, 2014, 8:38 p.m. UTC
Richard Biener wrote:
> 
> I have posted 5 patches as part of a larger series to merge
> (parts) from the match-and-simplify branch.  While I think
> there was overall consensus that the idea behind the project
> is sound there are technical questions left for how the
> thing should look in the end.  I've raised them in 3/n
> which is the only patch of the series that contains any
> patterns sofar.
> 
> To re-iterate here (as I expect most people will only look
> at [0/n] patches ;)), the question is whether we are fine
> with making fold-const (thus fold_{unary,binary,ternary})
> not handle some cases it handles currently.

I have tested on aarch64 all the code in the match-and-simplify against trunk as
of the last merge at r216315:

2014-10-16  Richard Biener  <rguenther@suse.de>

        Merge from trunk r216235 through r216315.

Overall, I see a lot of perf regressions (about 2/3 of the tests) than
improvements (1/3 of the tests).  I will try to reduce tests.

For instance, saxpy regresses at -O3 on aarch64:

void saxpy(double* x, double* y, double* z) {
    int i=0;
    for (i = 0 ; i < ARRAY_SIZE; i++) {
        z[i] = x[i] + scalar*y[i];
    }
}

$ diff -u base.s mas.s



Thanks,
Sebastian

Comments

Andrew Pinski Oct. 16, 2014, 8:43 p.m. UTC | #1
On Thu, Oct 16, 2014 at 1:38 PM, Sebastian Pop <sebpop@gmail.com> wrote:
> Richard Biener wrote:
>>
>> I have posted 5 patches as part of a larger series to merge
>> (parts) from the match-and-simplify branch.  While I think
>> there was overall consensus that the idea behind the project
>> is sound there are technical questions left for how the
>> thing should look in the end.  I've raised them in 3/n
>> which is the only patch of the series that contains any
>> patterns sofar.
>>
>> To re-iterate here (as I expect most people will only look
>> at [0/n] patches ;)), the question is whether we are fine
>> with making fold-const (thus fold_{unary,binary,ternary})
>> not handle some cases it handles currently.
>
> I have tested on aarch64 all the code in the match-and-simplify against trunk as
> of the last merge at r216315:
>
> 2014-10-16  Richard Biener  <rguenther@suse.de>
>
>         Merge from trunk r216235 through r216315.
>
> Overall, I see a lot of perf regressions (about 2/3 of the tests) than
> improvements (1/3 of the tests).  I will try to reduce tests.
>
> For instance, saxpy regresses at -O3 on aarch64:
>
> void saxpy(double* x, double* y, double* z) {
>     int i=0;
>     for (i = 0 ; i < ARRAY_SIZE; i++) {
>         z[i] = x[i] + scalar*y[i];
>     }
> }

This looks like a scheduling issue rather than anything else.  The
scheduler for a57 is not complete and does not model some things like
the fusion of the compares and branch which is most likely what you
are seeing.

Thanks,
Andrew Pinski

>
> $ diff -u base.s mas.s
> --- base.s      2014-10-16 15:30:15.351430000 -0500
> +++ mas.s       2014-10-16 15:30:16.183035000 -0500
> @@ -2,12 +2,14 @@
>         add     x1, x2, 800
>         ldr     q0, [x0, x2]
>         add     x3, x2, 1600
> +       cmp     x0, 784
>         ldr     q1, [x0, x1]
> +       add     x1, x0, 16
>         fmla    v0.2d, v1.2d, v2.2d
>         str     q0, [x0, x3]
> -       add     x0, x0, 16
> -       cmp     x0, 800
> +       mov     x0, x1
>         bne     .L140
>  .LBE179:
> -       subs    w4, w4, #1
> +       cmp     w4, 1
> +       sub     w4, w4, #1
>         bne     .L139
>
>
>
> Thanks,
> Sebastian
Ramana Radhakrishnan Oct. 17, 2014, 7:28 a.m. UTC | #2
On 16/10/14 21:43, Andrew Pinski wrote:
> On Thu, Oct 16, 2014 at 1:38 PM, Sebastian Pop <sebpop@gmail.com> wrote:
>> Richard Biener wrote:
>>>
>>> I have posted 5 patches as part of a larger series to merge
>>> (parts) from the match-and-simplify branch.  While I think
>>> there was overall consensus that the idea behind the project
>>> is sound there are technical questions left for how the
>>> thing should look in the end.  I've raised them in 3/n
>>> which is the only patch of the series that contains any
>>> patterns sofar.
>>>
>>> To re-iterate here (as I expect most people will only look
>>> at [0/n] patches ;)), the question is whether we are fine
>>> with making fold-const (thus fold_{unary,binary,ternary})
>>> not handle some cases it handles currently.


>>
>> I have tested on aarch64 all the code in the match-and-simplify against trunk as
>> of the last merge at r216315:
>>
>> 2014-10-16  Richard Biener  <rguenther@suse.de>
>>
>>          Merge from trunk r216235 through r216315.
>>
>> Overall, I see a lot of perf regressions (about 2/3 of the tests) than
>> improvements (1/3 of the tests).  I will try to reduce tests.

>>
>> For instance, saxpy regresses at -O3 on aarch64:
>>
>> void saxpy(double* x, double* y, double* z) {
>>      int i=0;
>>      for (i = 0 ; i < ARRAY_SIZE; i++) {
>>          z[i] = x[i] + scalar*y[i];
>>      }
>> }
>
> This looks like a scheduling issue rather than anything else.  The
> scheduler for a57 is not complete and does not model some things like
> the fusion of the compares and branch which is most likely what you
> are seeing.


Huh !! how is that related to the code generation shown by Seb ?

See the replacement of subs by cmp and sub. Folding cmp into other flag 
setting instructions is a very useful optimization on ARM and AArch64 
and that's what appears missing in fold-const. That maybe what's causing 
the slowdown. I've never known that to be caused by any scheduler 
vagaries !

regards
Ramana




>
> Thanks,
> Andrew Pinski
>
>>
>> $ diff -u base.s mas.s
>> --- base.s      2014-10-16 15:30:15.351430000 -0500
>> +++ mas.s       2014-10-16 15:30:16.183035000 -0500
>> @@ -2,12 +2,14 @@
>>          add     x1, x2, 800
>>          ldr     q0, [x0, x2]
>>          add     x3, x2, 1600
>> +       cmp     x0, 784
>>          ldr     q1, [x0, x1]
>> +       add     x1, x0, 16
>>          fmla    v0.2d, v1.2d, v2.2d
>>          str     q0, [x0, x3]
>> -       add     x0, x0, 16
>> -       cmp     x0, 800
>> +       mov     x0, x1
>>          bne     .L140
>>   .LBE179:
>> -       subs    w4, w4, #1
>> +       cmp     w4, 1
>> +       sub     w4, w4, #1
>>          bne     .L139
>>
>>
>>
>> Thanks,
>> Sebastian
>
Richard Biener Oct. 17, 2014, 7:55 a.m. UTC | #3
On Thu, 16 Oct 2014, Sebastian Pop wrote:

> Richard Biener wrote:
> > 
> > I have posted 5 patches as part of a larger series to merge
> > (parts) from the match-and-simplify branch.  While I think
> > there was overall consensus that the idea behind the project
> > is sound there are technical questions left for how the
> > thing should look in the end.  I've raised them in 3/n
> > which is the only patch of the series that contains any
> > patterns sofar.
> > 
> > To re-iterate here (as I expect most people will only look
> > at [0/n] patches ;)), the question is whether we are fine
> > with making fold-const (thus fold_{unary,binary,ternary})
> > not handle some cases it handles currently.
> 
> I have tested on aarch64 all the code in the match-and-simplify against trunk as
> of the last merge at r216315:
> 
> 2014-10-16  Richard Biener  <rguenther@suse.de>
> 
>         Merge from trunk r216235 through r216315.
> 
> Overall, I see a lot of perf regressions (about 2/3 of the tests) than
> improvements (1/3 of the tests).  I will try to reduce tests.

Note that the branch goes much further in exercising the machinery
than I want to merge at this point (that applies mostly to all
passes using the SSA propagator such as CCP and VRP and passes
exercising value-numbering - FRE and PRE).

It may also simply show the effect of now folding all statements
from tree-ssa-forwprop.c.  I have yet to investigate the testsuite
fallout of [1/n] to [5/n] - testresults have been very noisy lately
due to the C11 change and now ICF.

> For instance, saxpy regresses at -O3 on aarch64:
> 
> void saxpy(double* x, double* y, double* z) {
>     int i=0;
>     for (i = 0 ; i < ARRAY_SIZE; i++) {
>         z[i] = x[i] + scalar*y[i];
>     }
> }
> 
> $ diff -u base.s mas.s
> --- base.s      2014-10-16 15:30:15.351430000 -0500
> +++ mas.s       2014-10-16 15:30:16.183035000 -0500
> @@ -2,12 +2,14 @@
>         add     x1, x2, 800
>         ldr     q0, [x0, x2]
>         add     x3, x2, 1600
> +       cmp     x0, 784
>         ldr     q1, [x0, x1]
> +       add     x1, x0, 16
>         fmla    v0.2d, v1.2d, v2.2d
>         str     q0, [x0, x3]
> -       add     x0, x0, 16
> -       cmp     x0, 800
> +       mov     x0, x1
>         bne     .L140
>  .LBE179:
> -       subs    w4, w4, #1
> +       cmp     w4, 1
> +       sub     w4, w4, #1
>         bne     .L139

I don't understand AARCH64 assembly very well but the above looks like
RTL issues and/or IVOPTs issues?

Thanks for doing performance measurements.

Richard.
Sebastian Pop Oct. 17, 2014, 4:35 p.m. UTC | #4
Richard Biener wrote:
> On Thu, 16 Oct 2014, Sebastian Pop wrote:
> 
> > Richard Biener wrote:
> > > 
> > > I have posted 5 patches as part of a larger series to merge
> > > (parts) from the match-and-simplify branch.  While I think
> > > there was overall consensus that the idea behind the project
> > > is sound there are technical questions left for how the
> > > thing should look in the end.  I've raised them in 3/n
> > > which is the only patch of the series that contains any
> > > patterns sofar.
> > > 
> > > To re-iterate here (as I expect most people will only look
> > > at [0/n] patches ;)), the question is whether we are fine
> > > with making fold-const (thus fold_{unary,binary,ternary})
> > > not handle some cases it handles currently.
> > 
> > I have tested on aarch64 all the code in the match-and-simplify against trunk as
> > of the last merge at r216315:
> > 
> > 2014-10-16  Richard Biener  <rguenther@suse.de>
> > 
> >         Merge from trunk r216235 through r216315.
> > 
> > Overall, I see a lot of perf regressions (about 2/3 of the tests) than
> > improvements (1/3 of the tests).  I will try to reduce tests.
> 
> Note that the branch goes much further in exercising the machinery
> than I want to merge at this point (that applies mostly to all
> passes using the SSA propagator such as CCP and VRP and passes
> exercising value-numbering - FRE and PRE).

I see.  Should I run benchmarks only with the patches that you submitted for
trunk?

> I don't understand AARCH64 assembly very well but the above looks like
> RTL issues and/or IVOPTs issues?

I should have posted the first diff between the compilers with -fdump-tree-all:
that would expose the problem at its root.

I have seen that there is a way to dump the folded expressions from the new
functionality, is there a flag to print the folded expressions in current trunk?
It would be interesting to have the same kind of output, such that we could run
a diff between.

Thanks,
Sebastian
Richard Biener Oct. 17, 2014, 5:34 p.m. UTC | #5
On October 17, 2014 6:35:58 PM CEST, Sebastian Pop <sebpop@gmail.com> wrote:
>Richard Biener wrote:
>> On Thu, 16 Oct 2014, Sebastian Pop wrote:
>> 
>> > Richard Biener wrote:
>> > > 
>> > > I have posted 5 patches as part of a larger series to merge
>> > > (parts) from the match-and-simplify branch.  While I think
>> > > there was overall consensus that the idea behind the project
>> > > is sound there are technical questions left for how the
>> > > thing should look in the end.  I've raised them in 3/n
>> > > which is the only patch of the series that contains any
>> > > patterns sofar.
>> > > 
>> > > To re-iterate here (as I expect most people will only look
>> > > at [0/n] patches ;)), the question is whether we are fine
>> > > with making fold-const (thus fold_{unary,binary,ternary})
>> > > not handle some cases it handles currently.
>> > 
>> > I have tested on aarch64 all the code in the match-and-simplify
>against trunk as
>> > of the last merge at r216315:
>> > 
>> > 2014-10-16  Richard Biener  <rguenther@suse.de>
>> > 
>> >         Merge from trunk r216235 through r216315.
>> > 
>> > Overall, I see a lot of perf regressions (about 2/3 of the tests)
>than
>> > improvements (1/3 of the tests).  I will try to reduce tests.
>> 
>> Note that the branch goes much further in exercising the machinery
>> than I want to merge at this point (that applies mostly to all
>> passes using the SSA propagator such as CCP and VRP and passes
>> exercising value-numbering - FRE and PRE).
>
>I see.  Should I run benchmarks only with the patches that you
>submitted for
>trunk?

Yes. What I posted sofar should be a no-op performance wise.

Benchmarks on the branch are still useful though as eventually all changes should get merged to trunk if enough patterns are implemented.

Richard.

>> I don't understand AARCH64 assembly very well but the above looks
>like
>> RTL issues and/or IVOPTs issues?
>
>I should have posted the first diff between the compilers with
>-fdump-tree-all:
>that would expose the problem at its root.
>
>I have seen that there is a way to dump the folded expressions from the
>new
>functionality, is there a flag to print the folded expressions in
>current trunk?
>It would be interesting to have the same kind of output, such that we
>could run
>a diff between.
>
>Thanks,
>Sebastian
diff mbox

Patch

--- base.s      2014-10-16 15:30:15.351430000 -0500
+++ mas.s       2014-10-16 15:30:16.183035000 -0500
@@ -2,12 +2,14 @@ 
        add     x1, x2, 800
        ldr     q0, [x0, x2]
        add     x3, x2, 1600
+       cmp     x0, 784
        ldr     q1, [x0, x1]
+       add     x1, x0, 16
        fmla    v0.2d, v1.2d, v2.2d
        str     q0, [x0, x3]
-       add     x0, x0, 16
-       cmp     x0, 800
+       mov     x0, x1
        bne     .L140
 .LBE179:
-       subs    w4, w4, #1
+       cmp     w4, 1
+       sub     w4, w4, #1
        bne     .L139