[0/4,GCC11] IVOPTs consider step cost for different forms when unrolling

Message ID	ddd8c186-fc88-96df-b1c0-f99edec654f2@linux.ibm.com
Headers	show Return-Path: <gcc-patches-return-517496-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to:cc :from:subject:date:mime-version:content-type :content-transfer-encoding:message-id; q=dns; s=default; b=H5kHL aa+gHnv907n12q8jVLONT0iLQuGMUdjvqLCKzOIddLdaHmHf4Tu5+QVzwRGOmudc eRjD8ObZyITTLTZl1aUmrexuOSFWHemgN1+vmnuDjPRSb/P81Vd7QfTFg6R2+7ob 2ZevRmZ8V4mKZHx93ugpaFs4BrhELNpbWQZjr8= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Gateway: Authorized Use Only! Violators will be prosecuted for <gcc-patches@gcc.gnu.org> from <linkw@linux.ibm.com>; Thu, 16 Jan 2020 09:36:59 -0000 Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Thu, 16 Jan 2020 09:36:56 -0000 To: GCC Patches <gcc-patches@gcc.gnu.org> Cc: Segher Boessenkool <segher@kernel.crashing.org>, Bill Schmidt <wschmidt@linux.ibm.com>, "bin.cheng" <bin.cheng@linux.alibaba.com>, Richard Guenther <rguenther@suse.de> From: "Kewen.Lin" <linkw@linux.ibm.com> Subject: [PATCH 0/4 GCC11] IVOPTs consider step cost for different forms when unrolling Date: Thu, 16 Jan 2020 17:36:52 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 Content-Type: text/plain; charset=gbk Content-Transfer-Encoding: 7bit Message-Id: <ddd8c186-fc88-96df-b1c0-f99edec654f2@linux.ibm.com>
Series	IVOPTs consider step cost for different forms when unrolling \| expand [0/4,GCC11] IVOPTs consider step cost for different forms when unrolling [1/4,GCC11] Add middle-end unroll factor estimation [2/4,GCC11] Add target hook stride_dform_valid_p [3/4,GCC11] IVOPTs Consider cost_step on different forms during unrolling [4/4,GCC11] rs6000: P9 D-form test cases

Kewen.Lin Jan. 16, 2020, 9:36 a.m. UTC

Hi,

As we discussed in the thread
https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
I'm working to teach IVOPTs to consider D-form group access during unrolling.
The difference on D-form and other forms during unrolling is we can put the
stride into displacement field to avoid additional step increment. eg:

With X-form (uf step increment):
  ...
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  LD A = baseA, X
  LD B = baseB, X
  ST C = baseC, X
  X = X + stride
  ...

With D-form (one step increment for each base):
  ...
  LD A = baseA, OFF
  LD B = baseB, OFF
  ST C = baseC, OFF
  LD A = baseA, OFF+stride
  LD B = baseB, OFF+stride
  ST C = baseC, OFF+stride
  LD A = baseA, OFF+2*stride
  LD B = baseB, OFF+2*stride
  ST C = baseC, OFF+2*stride
  ...
  baseA += stride * uf
  baseB += stride * uf
  baseC += stride * uf

Imagining that if the loop get unrolled by 8 times, then 3 step updates with
D-form vs. 8 step updates with X-form. Here we only need to check stride
meet D-form field requirement, since if OFF doesn't meet, we can construct
baseA' with baseA + OFF.

This patch set consists four parts:
     
  [PATCH 1/4 GCC11] Add middle-end unroll factor estimation

     Add unroll factor estimation in middle-end. It mainly refers to current
     RTL unroll factor determination in function decide_unrolling and its
     sub calls.  As Richard B. suggested, we probably can force unroll factor
     with this and avoid duplicate unroll factor calculation, but I think it
     need more benchmarking work and should be handled separately.

  [PATCH 2/4 GCC11] Add target hook stride_dform_valid_p 

     Add one target hook to determine whether the current memory access with
     the given mode, stride and other flags have available D-form supports.
     
  [PATCH 3/4 GCC11] IVOPTs Consider cost_step on different forms during unrolling

     Teach IVOPTs to identify address type iv group with D-form preferred,
     and flag dform_p of their derived iv cands.  Considering unroll factor,
     increase iv cost with (uf - 1) * cost_step if it's not a dform iv cand. 
     
  [PATCH 4/4 GCC11] rs6000: P9 D-form test cases

     Add some test cases, mainly copied from Kelvin's patch.

Bootstrapped and regress tested on powerpc64le-linux-gnu.
I'll take two weeks leave soon, please expect late responses.
Thanks a lot in advance!

BR,
Kewen

------------

 gcc/cfgloop.h                                       |   3 +
 gcc/config/rs6000/rs6000.c                          |  56 ++++++++++++++++-
 gcc/doc/tm.texi                                     |  14 +++++
 gcc/doc/tm.texi.in                                  |   4 ++
 gcc/target.def                                      |  21 ++++++-
 gcc/testsuite/gcc.target/powerpc/p9-dform-0.c       |  43 +++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-1.c       |  55 +++++++++++++++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-2.c       |  12 ++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-3.c       |  15 +++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-4.c       |  12 ++++
 gcc/testsuite/gcc.target/powerpc/p9-dform-generic.h |  34 +++++++++++
 gcc/tree-ssa-loop-ivopts.c                          |  84 +++++++++++++++++++++++++-
 gcc/tree-ssa-loop-manip.c                           | 254 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 gcc/tree-ssa-loop-manip.h                           |   3 +-
 gcc/tree-ssa-loop.c                                 |  33 ++++++++++
 gcc/tree-ssa-loop.h                                 |   2 +
 16 files changed, 640 insertions(+), 5 deletions(-)

Segher Boessenkool Jan. 20, 2020, 12:33 p.m. UTC | #1

Hi!

On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
> As we discussed in the thread
> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> The difference on D-form and other forms during unrolling is we can put the
> stride into displacement field to avoid additional step increment. eg:

<snip>

> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> D-form vs. 8 step updates with X-form. Here we only need to check stride
> meet D-form field requirement, since if OFF doesn't meet, we can construct
> baseA' with baseA + OFF.

So why doesn't the existing code do this already?  Why does it make all
the extra induction variables?  Is the existing cost model bad, are our
target costs bad, or something like that?


Segher

Kewen.Lin Feb. 10, 2020, 6:17 a.m. UTC | #2

Hi Segher,

on 2020/1/20 下午8:33, Segher Boessenkool wrote:
> Hi!
> 
> On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
>> As we discussed in the thread
>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
>> I'm working to teach IVOPTs to consider D-form group access during unrolling.
>> The difference on D-form and other forms during unrolling is we can put the
>> stride into displacement field to avoid additional step increment. eg:
> 
> <snip>
> 
>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
>> D-form vs. 8 step updates with X-form. Here we only need to check stride
>> meet D-form field requirement, since if OFF doesn't meet, we can construct
>> baseA' with baseA + OFF.
> 
> So why doesn't the existing code do this already?  Why does it make all
> the extra induction variables?  Is the existing cost model bad, are our
> target costs bad, or something like that?
> 

I think the main cause is IVOPTs runs before RTL unroll, when it's determining
the IV sets, it can only take the normal step cost into account, since its
input isn't unrolled yet.  After unrolling, the x-form indexing register has to
play with more UF-1 times update, but we can probably hide them in d-form
displacement field.  The way I proposed here is to adjust IV cost with
additional cost_step according to estimated unroll.  It doesn't introduce new
IV cand but can affect the final optimal set.

BR,
Kewen

Segher Boessenkool Feb. 10, 2020, 9:29 p.m. UTC | #3

Hi!

On Mon, Feb 10, 2020 at 02:17:04PM +0800, Kewen.Lin wrote:
> on 2020/1/20 下午8:33, Segher Boessenkool wrote:
> > On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
> >> As we discussed in the thread
> >> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> >> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> >> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> >> The difference on D-form and other forms during unrolling is we can put the
> >> stride into displacement field to avoid additional step increment. eg:
> > 
> > <snip>
> > 
> >> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> >> D-form vs. 8 step updates with X-form. Here we only need to check stride
> >> meet D-form field requirement, since if OFF doesn't meet, we can construct
> >> baseA' with baseA + OFF.
> > 
> > So why doesn't the existing code do this already?  Why does it make all
> > the extra induction variables?  Is the existing cost model bad, are our
> > target costs bad, or something like that?
> > 
> 
> I think the main cause is IVOPTs runs before RTL unroll, when it's determining
> the IV sets, it can only take the normal step cost into account, since its
> input isn't unrolled yet.  After unrolling, the x-form indexing register has to
> play with more UF-1 times update, but we can probably hide them in d-form
> displacement field.  The way I proposed here is to adjust IV cost with
> additional cost_step according to estimated unroll.  It doesn't introduce new
> IV cand but can affect the final optimal set.

Yes, we should decide how often we want to unroll things somewhere before
ivopts already, and just use that info here.

Or are there advantage to doing it *in* ivopts?  It sounds like doing
it there is probably expensive, but maybe not, and we need to do similar
analysis there anyway.


Segher

Kewen.Lin Feb. 11, 2020, 2:56 a.m. UTC | #4

on 2020/2/11 上午5:29, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Feb 10, 2020 at 02:17:04PM +0800, Kewen.Lin wrote:
>> on 2020/1/20 下午8:33, Segher Boessenkool wrote:
>>> On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
>>>> As we discussed in the thread
>>>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
>>>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
>>>> I'm working to teach IVOPTs to consider D-form group access during unrolling.
>>>> The difference on D-form and other forms during unrolling is we can put the
>>>> stride into displacement field to avoid additional step increment. eg:
>>>
>>> <snip>
>>>
>>>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
>>>> D-form vs. 8 step updates with X-form. Here we only need to check stride
>>>> meet D-form field requirement, since if OFF doesn't meet, we can construct
>>>> baseA' with baseA + OFF.
>>>
>>> So why doesn't the existing code do this already?  Why does it make all
>>> the extra induction variables?  Is the existing cost model bad, are our
>>> target costs bad, or something like that?
>>>
>>
>> I think the main cause is IVOPTs runs before RTL unroll, when it's determining
>> the IV sets, it can only take the normal step cost into account, since its
>> input isn't unrolled yet.  After unrolling, the x-form indexing register has to
>> play with more UF-1 times update, but we can probably hide them in d-form
>> displacement field.  The way I proposed here is to adjust IV cost with
>> additional cost_step according to estimated unroll.  It doesn't introduce new
>> IV cand but can affect the final optimal set.
> 
> Yes, we should decide how often we want to unroll things somewhere before
> ivopts already, and just use that info here.

Agreed! If some passes are interested on this unroll factor estimation, we
can move backward there if it's before IVOPTs.  As patch 1/4, once it's set,
the later pass can just reuse that info.  As Richard B. suggested, we can
even skip the later RTL unroll factor determination.

> 
> Or are there advantage to doing it *in* ivopts?  It sounds like doing
> it there is probably expensive, but maybe not, and we need to do similar
> analysis there anyway.
> 

Good question.  I didn't consider that, the reason putting here is we need
this information in IVOPTs for some cases.  :)

BR,
Kewen

Richard Biener Feb. 11, 2020, 7:34 a.m. UTC | #5

On Mon, 10 Feb 2020, Segher Boessenkool wrote:

> Hi!
> 
> On Mon, Feb 10, 2020 at 02:17:04PM +0800, Kewen.Lin wrote:
> > on 2020/1/20 下午8:33, Segher Boessenkool wrote:
> > > On Thu, Jan 16, 2020 at 05:36:52PM +0800, Kewen.Lin wrote:
> > >> As we discussed in the thread
> > >> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html
> > >> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html,
> > >> I'm working to teach IVOPTs to consider D-form group access during unrolling.
> > >> The difference on D-form and other forms during unrolling is we can put the
> > >> stride into displacement field to avoid additional step increment. eg:
> > > 
> > > <snip>
> > > 
> > >> Imagining that if the loop get unrolled by 8 times, then 3 step updates with
> > >> D-form vs. 8 step updates with X-form. Here we only need to check stride
> > >> meet D-form field requirement, since if OFF doesn't meet, we can construct
> > >> baseA' with baseA + OFF.
> > > 
> > > So why doesn't the existing code do this already?  Why does it make all
> > > the extra induction variables?  Is the existing cost model bad, are our
> > > target costs bad, or something like that?
> > > 
> > 
> > I think the main cause is IVOPTs runs before RTL unroll, when it's determining
> > the IV sets, it can only take the normal step cost into account, since its
> > input isn't unrolled yet.  After unrolling, the x-form indexing register has to
> > play with more UF-1 times update, but we can probably hide them in d-form
> > displacement field.  The way I proposed here is to adjust IV cost with
> > additional cost_step according to estimated unroll.  It doesn't introduce new
> > IV cand but can affect the final optimal set.
> 
> Yes, we should decide how often we want to unroll things somewhere before
> ivopts already, and just use that info here.
> 
> Or are there advantage to doing it *in* ivopts?  It sounds like doing
> it there is probably expensive, but maybe not, and we need to do similar
> analysis there anyway.

Well, if the only benefit of doing the unrolling is that IVs get
cheaper then yes, IVOPTs should drive it.

But usually unrolling exposes redundancies (catched by predictive
commoning which drives some unrolling) or it enables better use
of CPU resources via scheduling (only catched later in RTL).
For scheduling we have the additional complication that the RTL
side doesn't have as much of a fancy data dependence analysis
framework as on the GIMPLE side.  So I'd put my bet on trying to
move something like SMS to GIMPLE and combine it with unrolling
(IIRC SMS at most interleaves 1 1/2 loop iterations).

Richard.

Segher Boessenkool Feb. 11, 2020, 7:48 a.m. UTC | #6

On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote:
> On Mon, 10 Feb 2020, Segher Boessenkool wrote:
> > Yes, we should decide how often we want to unroll things somewhere before
> > ivopts already, and just use that info here.
> > 
> > Or are there advantage to doing it *in* ivopts?  It sounds like doing
> > it there is probably expensive, but maybe not, and we need to do similar
> > analysis there anyway.
> 
> Well, if the only benefit of doing the unrolling is that IVs get
> cheaper then yes, IVOPTs should drive it.

We need to know much earlier in the pass pipeline how often a loop will
be unrolled.  We don't have to *do* it early.

If we want to know it before ivopts, then obviously it has to be done
earlier.  Otherwise, maybe it is a good idea to do it in ivopts itself.
Or maybe not.  It's just an idea :-)

We know we do not want it *later*, ivopts needs to know this to make
good decisions of its own.

> But usually unrolling exposes redundancies (catched by predictive
> commoning which drives some unrolling) or it enables better use
> of CPU resources via scheduling (only catched later in RTL).
> For scheduling we have the additional complication that the RTL
> side doesn't have as much of a fancy data dependence analysis
> framework as on the GIMPLE side.  So I'd put my bet on trying to
> move something like SMS to GIMPLE and combine it with unrolling
> (IIRC SMS at most interleaves 1 1/2 loop iterations).

SMS on RTL always was quite disappointing...  Do you expect it will be
more useful on Gimple?  Moving it there is a good idea in any case ;-)

I don't quite see the synergy between SMS and loop unrolling, but maybe
I need to look harder.

Segher

Richard Biener Feb. 11, 2020, 8:01 a.m. UTC | #7

On Tue, 11 Feb 2020, Segher Boessenkool wrote:

> On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote:
> > On Mon, 10 Feb 2020, Segher Boessenkool wrote:
> > > Yes, we should decide how often we want to unroll things somewhere before
> > > ivopts already, and just use that info here.
> > > 
> > > Or are there advantage to doing it *in* ivopts?  It sounds like doing
> > > it there is probably expensive, but maybe not, and we need to do similar
> > > analysis there anyway.
> > 
> > Well, if the only benefit of doing the unrolling is that IVs get
> > cheaper then yes, IVOPTs should drive it.
> 
> We need to know much earlier in the pass pipeline how often a loop will
> be unrolled.  We don't have to *do* it early.
> 
> If we want to know it before ivopts, then obviously it has to be done
> earlier.  Otherwise, maybe it is a good idea to do it in ivopts itself.
> Or maybe not.  It's just an idea :-)
> 
> We know we do not want it *later*, ivopts needs to know this to make
> good decisions of its own.
> 
> > But usually unrolling exposes redundancies (catched by predictive
> > commoning which drives some unrolling) or it enables better use
> > of CPU resources via scheduling (only catched later in RTL).
> > For scheduling we have the additional complication that the RTL
> > side doesn't have as much of a fancy data dependence analysis
> > framework as on the GIMPLE side.  So I'd put my bet on trying to
> > move something like SMS to GIMPLE and combine it with unrolling
> > (IIRC SMS at most interleaves 1 1/2 loop iterations).
> 
> SMS on RTL always was quite disappointing...

It originally came with "data dependence export from GIMPLE to RTL"
that never materialized so I'm not surprised ;)  It also relies
on doloop detection.

> Do you expect it will be more useful on Gimple?  Moving it there is a 
> good idea in any case ;-)
>
> I don't quite see the synergy between SMS and loop unrolling, but maybe
> I need to look harder.

As said elsewhere I don't believe in actual unrolling doing much good
but in removing data dependences in the CPU pipeline.  SMS rotates
the loop, peeling N iterations (and somehow I think for N > 1 that
should better mean unrolling the loop body).

Of course doing "scheduling" on GIMPLE is "interesting" in its own
but OTOH our pipeline DFAs are imprecise enough that one could even
devise some basic GIMPLE <-> "RTL" mapping to make use of it.  But
then scheduling without IVs or register pressure in mind is somewhat
pointless as well.

That said - if I had enough time I'd still thing that investigating
"scheduling on GIMPLE" as replacement for sched1 is an interesting
thing to do.

Richard.

Roman Zhuykov Feb. 11, 2020, 12:46 p.m. UTC | #8

11.02.2020 11:01, Richard Biener wrote:
> On Tue, 11 Feb 2020, Segher Boessenkool wrote:
>
>> On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote:
>>> On Mon, 10 Feb 2020, Segher Boessenkool wrote:
>>>> Yes, we should decide how often we want to unroll things somewhere before
>>>> ivopts already, and just use that info here.
>>>>
>>>> Or are there advantage to doing it *in* ivopts?  It sounds like doing
>>>> it there is probably expensive, but maybe not, and we need to do similar
>>>> analysis there anyway.
>>> Well, if the only benefit of doing the unrolling is that IVs get
>>> cheaper then yes, IVOPTs should drive it.
>> We need to know much earlier in the pass pipeline how often a loop will
>> be unrolled.  We don't have to *do* it early.
>>
>> If we want to know it before ivopts, then obviously it has to be done
>> earlier.  Otherwise, maybe it is a good idea to do it in ivopts itself.
>> Or maybe not.  It's just an idea :-)
>>
>> We know we do not want it *later*, ivopts needs to know this to make
>> good decisions of its own.
>>
>>> But usually unrolling exposes redundancies (catched by predictive
>>> commoning which drives some unrolling) or it enables better use
>>> of CPU resources via scheduling (only catched later in RTL).
>>> For scheduling we have the additional complication that the RTL
>>> side doesn't have as much of a fancy data dependence analysis
>>> framework as on the GIMPLE side.  So I'd put my bet on trying to
>>> move something like SMS to GIMPLE and combine it with unrolling
>>> (IIRC SMS at most interleaves 1 1/2 loop iterations).
To clarify, without specifying -fmodulo-sched-allow-regmoves it only
interleaves 2 iterations.  With register moves enabled more iterations
can be considered.
> SMS on RTL always was quite disappointing...
Hmm, even when trying to move it just few passes earlier many years ago,
got another opinion:
https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html
Although without such a move we still have annoying issues which RTL
folks can't solve, see e.q.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2
> It originally came with "data dependence export from GIMPLE to RTL"
> that never materialized so I'm not surprised ;)  It also relies
> on doloop detection.
My current attempt to drop doloop dependency is still WIP, hopefully
I'll create branch in refs/users/ in a month or so.  But older (gcc-7
and earlier) versions are available, see
https://gcc.gnu.org/ml/gcc-patches/2017-02/msg01647.html
Doloops are still supported for some kind of backward compatibility, but
much more loops (which loop-iv can analyze) are considered in new SMS.
>> Do you expect it will be more useful on Gimple?  Moving it there is a 
>> good idea in any case ;-)
>>
>> I don't quite see the synergy between SMS and loop unrolling, but maybe
>> I need to look harder.
> As said elsewhere I don't believe in actual unrolling doing much good
> but in removing data dependences in the CPU pipeline.  SMS rotates
> the loop, peeling N iterations (and somehow I think for N > 1 that
> should better mean unrolling the loop body).
Yes, this is what theory tells us.
> Of course doing "scheduling" on GIMPLE is "interesting" in its own
> but OTOH our pipeline DFAs are imprecise enough that one could even
> devise some basic GIMPLE <-> "RTL" mapping to make use of it.  But
> then scheduling without IVs or register pressure in mind is somewhat
> pointless as well.
Unfortunately, even with -fmodulo-sched-allow-regmoves it doesn't
interact much with register pressure.
> That said - if I had enough time I'd still thing that investigating
> "scheduling on GIMPLE" as replacement for sched1 is an interesting
> thing to do.
Sound good, but IMHO modulo scheduler is not the best choice to be the
first step implementing such a concept.

Roman

Richard Biener Feb. 11, 2020, 1:58 p.m. UTC | #9

On Tue, 11 Feb 2020, Roman Zhuykov wrote:

> 11.02.2020 11:01, Richard Biener wrote:
> > On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> >
> >> On Tue, Feb 11, 2020 at 08:34:15AM +0100, Richard Biener wrote:
> >>> On Mon, 10 Feb 2020, Segher Boessenkool wrote:
> >>>> Yes, we should decide how often we want to unroll things somewhere before
> >>>> ivopts already, and just use that info here.
> >>>>
> >>>> Or are there advantage to doing it *in* ivopts?  It sounds like doing
> >>>> it there is probably expensive, but maybe not, and we need to do similar
> >>>> analysis there anyway.
> >>> Well, if the only benefit of doing the unrolling is that IVs get
> >>> cheaper then yes, IVOPTs should drive it.
> >> We need to know much earlier in the pass pipeline how often a loop will
> >> be unrolled.  We don't have to *do* it early.
> >>
> >> If we want to know it before ivopts, then obviously it has to be done
> >> earlier.  Otherwise, maybe it is a good idea to do it in ivopts itself.
> >> Or maybe not.  It's just an idea :-)
> >>
> >> We know we do not want it *later*, ivopts needs to know this to make
> >> good decisions of its own.
> >>
> >>> But usually unrolling exposes redundancies (catched by predictive
> >>> commoning which drives some unrolling) or it enables better use
> >>> of CPU resources via scheduling (only catched later in RTL).
> >>> For scheduling we have the additional complication that the RTL
> >>> side doesn't have as much of a fancy data dependence analysis
> >>> framework as on the GIMPLE side.  So I'd put my bet on trying to
> >>> move something like SMS to GIMPLE and combine it with unrolling
> >>> (IIRC SMS at most interleaves 1 1/2 loop iterations).
> To clarify, without specifying -fmodulo-sched-allow-regmoves it only
> interleaves 2 iterations.  With register moves enabled more iterations
> can be considered.
> > SMS on RTL always was quite disappointing...
> Hmm, even when trying to move it just few passes earlier many years ago,
> got another opinion:
> https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html
> Although without such a move we still have annoying issues which RTL
> folks can't solve, see e.q.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2
> > It originally came with "data dependence export from GIMPLE to RTL"
> > that never materialized so I'm not surprised ;)  It also relies
> > on doloop detection.
> My current attempt to drop doloop dependency is still WIP, hopefully
> I'll create branch in refs/users/ in a month or so.  But older (gcc-7
> and earlier) versions are available, see
> https://gcc.gnu.org/ml/gcc-patches/2017-02/msg01647.html
> Doloops are still supported for some kind of backward compatibility, but
> much more loops (which loop-iv can analyze) are considered in new SMS.
> >> Do you expect it will be more useful on Gimple?  Moving it there is a 
> >> good idea in any case ;-)
> >>
> >> I don't quite see the synergy between SMS and loop unrolling, but maybe
> >> I need to look harder.
> > As said elsewhere I don't believe in actual unrolling doing much good
> > but in removing data dependences in the CPU pipeline.  SMS rotates
> > the loop, peeling N iterations (and somehow I think for N > 1 that
> > should better mean unrolling the loop body).
> Yes, this is what theory tells us.
> > Of course doing "scheduling" on GIMPLE is "interesting" in its own
> > but OTOH our pipeline DFAs are imprecise enough that one could even
> > devise some basic GIMPLE <-> "RTL" mapping to make use of it.  But
> > then scheduling without IVs or register pressure in mind is somewhat
> > pointless as well.
> Unfortunately, even with -fmodulo-sched-allow-regmoves it doesn't
> interact much with register pressure.
> > That said - if I had enough time I'd still thing that investigating
> > "scheduling on GIMPLE" as replacement for sched1 is an interesting
> > thing to do.
> Sound good, but IMHO modulo scheduler is not the best choice to be the
> first step implementing such a concept.

True ;)   But since the context of this thread is unrolling ...
Not sure how you'd figure the unroll factor to apply if you want
to do unrolling within a classical scheduling framework?  Maybe
unroll as much as you can fill slots until the last instruction
of the first iteration retires?

Richard.

Segher Boessenkool Feb. 11, 2020, 6 p.m. UTC | #10

On Tue, Feb 11, 2020 at 02:58:47PM +0100, Richard Biener wrote:
> On Tue, 11 Feb 2020, Roman Zhuykov wrote:
> > 11.02.2020 11:01, Richard Biener wrote:
> > Sound good, but IMHO modulo scheduler is not the best choice to be the
> > first step implementing such a concept.
> 
> True ;)   But since the context of this thread is unrolling ...
> Not sure how you'd figure the unroll factor to apply if you want
> to do unrolling within a classical scheduling framework?  Maybe
> unroll as much as you can fill slots until the last instruction
> of the first iteration retires?

That will be terrible on register-rich architectures: it *already* is
problematic how often some things are unrolled, blindly unrolling more
would make things worse.  We need to unroll more where it helps, but
less where it does not.  For that we need a good cost/benefit estimate.

Segher

Segher Boessenkool Feb. 11, 2020, 6:12 p.m. UTC | #11

Hi!

On Tue, Feb 11, 2020 at 03:46:05PM +0300, Roman Zhuykov wrote:
> Hmm, even when trying to move it just few passes earlier many years ago,
> got another opinion:
> https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html
> Although without such a move we still have annoying issues which RTL
> folks can't solve, see e.q.
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2

Basic block partitioning has wildly disproportionate fallout in all
later passes, both in terms of what those *do* (or don't, if partitioning
is enabled), and of impact on the code (not to mention developer time).

Maybe the implementation can be improved, but probably we should do this
in a different way altogether.  The current situation is not good.

Segher

Richard Biener Feb. 12, 2020, 8:07 a.m. UTC | #12

On Tue, 11 Feb 2020, Segher Boessenkool wrote:

> On Tue, Feb 11, 2020 at 02:58:47PM +0100, Richard Biener wrote:
> > On Tue, 11 Feb 2020, Roman Zhuykov wrote:
> > > 11.02.2020 11:01, Richard Biener wrote:
> > > Sound good, but IMHO modulo scheduler is not the best choice to be the
> > > first step implementing such a concept.
> > 
> > True ;)   But since the context of this thread is unrolling ...
> > Not sure how you'd figure the unroll factor to apply if you want
> > to do unrolling within a classical scheduling framework?  Maybe
> > unroll as much as you can fill slots until the last instruction
> > of the first iteration retires?
> 
> That will be terrible on register-rich architectures: it *already* is
> problematic how often some things are unrolled, blindly unrolling more
> would make things worse.  We need to unroll more where it helps, but
> less where it does not.  For that we need a good cost/benefit estimate.

True.  For x86 we tried but did not come up with a sensible estimate
(probably the x86 uarchs are way too complicated to understand).

Richard.

Richard Biener Feb. 12, 2020, 8:12 a.m. UTC | #13

On Tue, 11 Feb 2020, Segher Boessenkool wrote:

> Hi!
> 
> On Tue, Feb 11, 2020 at 03:46:05PM +0300, Roman Zhuykov wrote:
> > Hmm, even when trying to move it just few passes earlier many years ago,
> > got another opinion:
> > https://gcc.gnu.org/ml/gcc-patches/2011-10/msg01526.html
> > Although without such a move we still have annoying issues which RTL
> > folks can't solve, see e.q.
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93264#c2
> 
> Basic block partitioning has wildly disproportionate fallout in all
> later passes, both in terms of what those *do* (or don't, if partitioning
> is enabled), and of impact on the code (not to mention developer time).
> 
> Maybe the implementation can be improved, but probably we should do this
> in a different way altogether.  The current situation is not good.

I think the expectation that you can go back to CFG layout mode
and then work with CFG layout tools after we've lowered to CFG RTL
is simply bogus.  Yeah, you can probably do analysis things but
I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle 
can wreck things.  Undoubtedly doing CFG manipulations is not going
to work since CFG layout does not respect CFG RTL restrictions.

Partitioning simply uncovered latent bugs, there's nothing wrong
with it IMHO.

Richard.

Segher Boessenkool Feb. 12, 2020, 10:01 a.m. UTC | #14

On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote:
> On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> > Basic block partitioning has wildly disproportionate fallout in all
> > later passes, both in terms of what those *do* (or don't, if partitioning
> > is enabled), and of impact on the code (not to mention developer time).
> > 
> > Maybe the implementation can be improved, but probably we should do this
> > in a different way altogether.  The current situation is not good.
> 
> I think the expectation that you can go back to CFG layout mode
> and then work with CFG layout tools after we've lowered to CFG RTL
> is simply bogus.

Partitioning is also quite problematic if you do not use cfglayout
mode.  For example, in shrink-wrapping.  It prevents a lot there.

> Yeah, you can probably do analysis things but
> I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle 
> can wreck things.  Undoubtedly doing CFG manipulations is not going
> to work since CFG layout does not respect CFG RTL restrictions.

Doing CFG manipulations on CFG RTL mode directly is almost impossible
to do correctly.

For example, bb-reorder.  Which is a much more important optimisation
than partitioning, btw.

> Partitioning simply uncovered latent bugs, there's nothing wrong
> with it IMHO.

I don't agree.  The whole way EDGE_CROSSING works hinders all other
optimisations a lot.


Segher

Richard Biener Feb. 12, 2020, 10:53 a.m. UTC | #15

On Wed, 12 Feb 2020, Segher Boessenkool wrote:

> On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote:
> > On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> > > Basic block partitioning has wildly disproportionate fallout in all
> > > later passes, both in terms of what those *do* (or don't, if partitioning
> > > is enabled), and of impact on the code (not to mention developer time).
> > > 
> > > Maybe the implementation can be improved, but probably we should do this
> > > in a different way altogether.  The current situation is not good.
> > 
> > I think the expectation that you can go back to CFG layout mode
> > and then work with CFG layout tools after we've lowered to CFG RTL
> > is simply bogus.
> 
> Partitioning is also quite problematic if you do not use cfglayout
> mode.  For example, in shrink-wrapping.  It prevents a lot there.
> 
> > Yeah, you can probably do analysis things but
> > I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle 
> > can wreck things.  Undoubtedly doing CFG manipulations is not going
> > to work since CFG layout does not respect CFG RTL restrictions.
> 
> Doing CFG manipulations on CFG RTL mode directly is almost impossible
> to do correctly.
> 
> For example, bb-reorder.  Which is a much more important optimisation
> than partitioning, btw.

BB reorder switches back and forth as well ... :/

I think both are closely enough related that we probably should do
partitioning from within the same framework?  OTOH BB reorder
happens _much_ later.

> > Partitioning simply uncovered latent bugs, there's nothing wrong
> > with it IMHO.
> 
> I don't agree.  The whole way EDGE_CROSSING works hinders all other
> optimisations a lot.

I'm not sure if it's there for correctness reasons or just to
help checking that nothing "undoes" the partitioning decision.

Richard.

> 
> Segher

Segher Boessenkool Feb. 12, 2020, 9:52 p.m. UTC | #16

On Wed, Feb 12, 2020 at 09:07:27AM +0100, Richard Biener wrote:
> On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> 
> > On Tue, Feb 11, 2020 at 02:58:47PM +0100, Richard Biener wrote:
> > > On Tue, 11 Feb 2020, Roman Zhuykov wrote:
> > > > 11.02.2020 11:01, Richard Biener wrote:
> > > > Sound good, but IMHO modulo scheduler is not the best choice to be the
> > > > first step implementing such a concept.
> > > 
> > > True ;)   But since the context of this thread is unrolling ...
> > > Not sure how you'd figure the unroll factor to apply if you want
> > > to do unrolling within a classical scheduling framework?  Maybe
> > > unroll as much as you can fill slots until the last instruction
> > > of the first iteration retires?
> > 
> > That will be terrible on register-rich architectures: it *already* is
> > problematic how often some things are unrolled, blindly unrolling more
> > would make things worse.  We need to unroll more where it helps, but
> > less where it does not.  For that we need a good cost/benefit estimate.
> 
> True.  For x86 we tried but did not come up with a sensible estimate
> (probably the x86 uarchs are way too complicated to understand).

There are three main factors at work:

1) The cost for iterating the loop.  This is related to what ivopts
does, and also on most uarchs every loop iteration has some minimum cost
(on many uarchs it needs a fetch redirect, or you can only do one branch
per cycle, etc.; this is mainly important for "small" loops, where what
is "small" differs per uarch).  Unrolling brings this cost down.
2) Cost reduction by better scheduling.
3) The cost for using more registers, when unrolling.  This is worse if
you also use SMS or variable expansion, or simply because of the better
scheduling.  This can increase the cost a *lot*.

1) isn't all that hard to estimate.  2) and esp. 3) are though :-/

Segher

Segher Boessenkool Feb. 12, 2020, 10:05 p.m. UTC | #17

On Wed, Feb 12, 2020 at 11:53:22AM +0100, Richard Biener wrote:
> On Wed, 12 Feb 2020, Segher Boessenkool wrote:
> > On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote:
> > > On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> > > > Basic block partitioning has wildly disproportionate fallout in all
> > > > later passes, both in terms of what those *do* (or don't, if partitioning
> > > > is enabled), and of impact on the code (not to mention developer time).
> > > > 
> > > > Maybe the implementation can be improved, but probably we should do this
> > > > in a different way altogether.  The current situation is not good.
> > > 
> > > I think the expectation that you can go back to CFG layout mode
> > > and then work with CFG layout tools after we've lowered to CFG RTL
> > > is simply bogus.
> > 
> > Partitioning is also quite problematic if you do not use cfglayout
> > mode.  For example, in shrink-wrapping.  It prevents a lot there.
> > 
> > > Yeah, you can probably do analysis things but
> > > I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle 
> > > can wreck things.  Undoubtedly doing CFG manipulations is not going
> > > to work since CFG layout does not respect CFG RTL restrictions.
> > 
> > Doing CFG manipulations on CFG RTL mode directly is almost impossible
> > to do correctly.
> > 
> > For example, bb-reorder.  Which is a much more important optimisation
> > than partitioning, btw.
> 
> BB reorder switches back and forth as well ... :/

Yes.  It is extremely hard to change any jumps in cfgrtl mode.

The goal is to use cfglayout mode more, ideally *always*, not to use
it less!

> I think both are closely enough related that we probably should do
> partitioning from within the same framework?  OTOH BB reorder
> happens _much_ later.

This may be doable.  What we need to do first though is to find a better
thing to use than EDGE_CROSSING.

Maybe we can determine which blocks should be hot and cold early, but
actually make that happen only very late (maybe adjusting the decision
if we have to to make things work)?  That way, intervening passes do not
have to care (much).

> > > Partitioning simply uncovered latent bugs, there's nothing wrong
> > > with it IMHO.
> > 
> > I don't agree.  The whole way EDGE_CROSSING works hinders all other
> > optimisations a lot.
> 
> I'm not sure if it's there for correctness reasons or just to
> help checking that nothing "undoes" the partitioning decision.

The latter.  (It may also make it easier for the former, of course, but
that can be solved anyway).  And that makes live very hard for all later
passes, while it is doubtful that it even give the best decisions: for
example it prevents a lot of shrink-wrapping, which you dearly *want* to
do on cold code!


Segher

Richard Biener Feb. 13, 2020, 7:48 a.m. UTC | #18

On Wed, 12 Feb 2020, Segher Boessenkool wrote:

> On Wed, Feb 12, 2020 at 11:53:22AM +0100, Richard Biener wrote:
> > On Wed, 12 Feb 2020, Segher Boessenkool wrote:
> > > On Wed, Feb 12, 2020 at 09:12:58AM +0100, Richard Biener wrote:
> > > > On Tue, 11 Feb 2020, Segher Boessenkool wrote:
> > > > > Basic block partitioning has wildly disproportionate fallout in all
> > > > > later passes, both in terms of what those *do* (or don't, if partitioning
> > > > > is enabled), and of impact on the code (not to mention developer time).
> > > > > 
> > > > > Maybe the implementation can be improved, but probably we should do this
> > > > > in a different way altogether.  The current situation is not good.
> > > > 
> > > > I think the expectation that you can go back to CFG layout mode
> > > > and then work with CFG layout tools after we've lowered to CFG RTL
> > > > is simply bogus.
> > > 
> > > Partitioning is also quite problematic if you do not use cfglayout
> > > mode.  For example, in shrink-wrapping.  It prevents a lot there.
> > > 
> > > > Yeah, you can probably do analysis things but
> > > > I wouldn't be surprised if a CFG RTL -> CFG layout -> CFG RTL cycle 
> > > > can wreck things.  Undoubtedly doing CFG manipulations is not going
> > > > to work since CFG layout does not respect CFG RTL restrictions.
> > > 
> > > Doing CFG manipulations on CFG RTL mode directly is almost impossible
> > > to do correctly.
> > > 
> > > For example, bb-reorder.  Which is a much more important optimisation
> > > than partitioning, btw.
> > 
> > BB reorder switches back and forth as well ... :/
> 
> Yes.  It is extremely hard to change any jumps in cfgrtl mode.
> 
> The goal is to use cfglayout mode more, ideally *always*, not to use
> it less!

Sure!  It must be that split requires CFG RTL (it doesn't say so, but
we don't have a PROP_rtllayout or a "cannot work with" set of
properties).  Otherwise I'm not sure why we go out of CFG layout
mode so early.  Passes not messing with the CFG at all should be
ignorant of what mode we are in.

> > I think both are closely enough related that we probably should do
> > partitioning from within the same framework?  OTOH BB reorder
> > happens _much_ later.
> 
> This may be doable.  What we need to do first though is to find a better
> thing to use than EDGE_CROSSING.
> 
> Maybe we can determine which blocks should be hot and cold early, but
> actually make that happen only very late (maybe adjusting the decision
> if we have to to make things work)?  That way, intervening passes do not
> have to care (much).

The question is whether we can even preserve things like BB counts
and edge frequencies once we've gone out of CFG layout mode for once.

> > > > Partitioning simply uncovered latent bugs, there's nothing wrong
> > > > with it IMHO.
> > > 
> > > I don't agree.  The whole way EDGE_CROSSING works hinders all other
> > > optimisations a lot.
> > 
> > I'm not sure if it's there for correctness reasons or just to
> > help checking that nothing "undoes" the partitioning decision.
> 
> The latter.  (It may also make it easier for the former, of course, but
> that can be solved anyway).  And that makes live very hard for all later
> passes, while it is doubtful that it even give the best decisions: for
> example it prevents a lot of shrink-wrapping, which you dearly *want* to
> do on cold code!

Sure.  So do that earlier ;)

Richard.

Segher Boessenkool Feb. 13, 2020, 9:01 a.m. UTC | #19

On Thu, Feb 13, 2020 at 08:48:33AM +0100, Richard Biener wrote:
> On Wed, 12 Feb 2020, Segher Boessenkool wrote:
> > On Wed, Feb 12, 2020 at 11:53:22AM +0100, Richard Biener wrote:
> > > BB reorder switches back and forth as well ... :/
> > 
> > Yes.  It is extremely hard to change any jumps in cfgrtl mode.
> > 
> > The goal is to use cfglayout mode more, ideally *always*, not to use
> > it less!
> 
> Sure!  It must be that split requires CFG RTL (it doesn't say so, but
> we don't have a PROP_rtllayout or a "cannot work with" set of
> properties).  Otherwise I'm not sure why we go out of CFG layout
> mode so early.  Passes not messing with the CFG at all should be
> ignorant of what mode we are in.

Well, split can split jump insns as well, so it currently cannot use
cfglayout mode.

We could probably make split not split jumps, only do that in split2
and later, which is after reload.  Getting IRA and LRA to work with
cfglayout mode will be fun.

Right after split2 is pro_and_epilogue.  Everything after that doesn't
want cfglayout mode I'd say.  So we don't even have to move
out_of_cfglayout all that far, but there are some big nasties it has
to move over.

> > > I think both are closely enough related that we probably should do
> > > partitioning from within the same framework?  OTOH BB reorder
> > > happens _much_ later.
> > 
> > This may be doable.  What we need to do first though is to find a better
> > thing to use than EDGE_CROSSING.
> > 
> > Maybe we can determine which blocks should be hot and cold early, but
> > actually make that happen only very late (maybe adjusting the decision
> > if we have to to make things work)?  That way, intervening passes do not
> > have to care (much).
> 
> The question is whether we can even preserve things like BB counts
> and edge frequencies once we've gone out of CFG layout mode for once.

Why not?  cfglayout mode simply does not include simple branches and
jump instructions, and fallthrough is not necessarily to the next bb in
the insn stream, but that is all (basically).  The cfg is just the same
in both rtl modes.  What destroys the bb and edge frequencies?

> > > > I don't agree.  The whole way EDGE_CROSSING works hinders all other
> > > > optimisations a lot.
> > > 
> > > I'm not sure if it's there for correctness reasons or just to
> > > help checking that nothing "undoes" the partitioning decision.
> > 
> > The latter.  (It may also make it easier for the former, of course, but
> > that can be solved anyway).  And that makes live very hard for all later
> > passes, while it is doubtful that it even give the best decisions: for
> > example it prevents a lot of shrink-wrapping, which you dearly *want* to
> > do on cold code!
> 
> Sure.  So do that earlier ;)

You cannot.  It has to stay after RA, obviously.  And you cannot move RA
earlier, to before combine.

We do shrink-wrapping (and all other *logue work) pretty much as early
as possible, already.

Segher

[0/4,GCC11] IVOPTs consider step cost for different forms when unrolling

Message

Comments