[OpenACC,1/11] UNIQUE internal function

Message ID	5627E0DF.9050507@acm.org
State	New
Headers	show Return-Path: <gcc-patches-return-410917-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :subject:to:references:cc:from:message-id:date:mime-version :in-reply-to:content-type; q=dns; s=default; b=m8owhA+1vJusv73jl jv/TApGaXYoT68Eaawa9FEu6coimbMAR2s7JHZjCS42aOPVCGflZbUvlzdhn5Ioj Q5zXBDIyTx6J0tBbSpSNF60mEhFtyK2NT+DscyRdKGa8y9qFNu7YlS1yIbB6eMIC xqo4zX6zZjyaL1m8SNcuRmHLV8= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Subject: Re: [OpenACC 1/11] UNIQUE internal function To: GCC Patches <gcc-patches@gcc.gnu.org> References: <5627DD78.9040302@acm.org> Cc: Jakub Jelinek <jakub@redhat.com>, Bernd Schmidt <bschmidt@redhat.com>, Jason Merrill <jason@redhat.com>, "Joseph S. Myers" <joseph@codesourcery.com> From: Nathan Sidwell <nathan@acm.org> Message-ID: <5627E0DF.9050507@acm.org> Date: Wed, 21 Oct 2015 15:00:47 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <5627DD78.9040302@acm.org> Content-Type: multipart/mixed; boundary="------------060905010707070004040206"

Nathan Sidwell Oct. 21, 2015, 7 p.m. UTC

This patch implements a new internal function that has a 'uniqueness' property. 
   Jump-threading cannot clone it and tail-merging cannot combine multiple 
instances.

The uniqueness is implemented by a new gimple fn, gimple_call_internal_unique_p. 
  Routines that check for identical or cloneable calls are augmented to check 
this property.  These are:

* tree-ssa-threadedge, which is figuring out if jump threading is a win.  Jump 
threading is inhibited.

* gimple_call_same_target_p, used for tail merging and similar transforms.  Two 
calls of IFN_UNIQUE will never be  the same target.

* tracer.c, which is determining whether to clone a region.

Interestingly jump threading avoids cloning volatile asms (which it admits is 
conservatively safe), but the tracer does not. I wonder if there's a latent 
problem in tracer?

The reason I needed a function with this property is to  preserve the looping 
structure of a function's CFG.  As mentioned in the intro, we mark up loops 
(using this builtin), so the example I gave has the following inserts:

#pragma acc parallel ...
{
  // single mode here
#pragma acc loop ...
IFN_UNIQUE (FORKING  ...)
for (i = 0; i < N; i++) // loop 1
   ... // partitioned mode here
IFN_UNIQUE (JOINING ...)

if (expr) // single mode here
#pragma acc loop ...
   IFN_UNIQUE (FORKING ...)
   for (i = 0; i < N; i++) // loop 2
     ... // partitioned mode here
   IFN_UNIQUE (JOINING ...)
}

The properly nested loop property of the CFG is preserved through the 
compilation.  This is important as (a) it allows later passes to reconstruct 
this looping structure and (b) hardware constraints require a partioned region 
end for all partitioned threads at a single instruction.

Until I added this unique property, original bring-up  of partitioned execution 
would hit cases of split loops ending in multiple cloned JOINING markers and 
similar cases.

To distinguish different uses of the UNIQUE function, I use the first argument, 
which is expected to be an INTEGER_CST.  I figured this better than using 
multiple new internal fns, all with the unique property, as the latter would 
need (at least) a range check in gimple_call_internal_unique_p rather than a 
simple equality.

Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal fns. 
  This replaces that scheme.

ok?

nathan

Richard Biener Oct. 22, 2015, 7:48 a.m. UTC | #1

On Wed, Oct 21, 2015 at 9:00 PM, Nathan Sidwell <nathan@acm.org> wrote:
> This patch implements a new internal function that has a 'uniqueness'
> property.   Jump-threading cannot clone it and tail-merging cannot combine
> multiple instances.
>
> The uniqueness is implemented by a new gimple fn,
> gimple_call_internal_unique_p.  Routines that check for identical or
> cloneable calls are augmented to check this property.  These are:
>
> * tree-ssa-threadedge, which is figuring out if jump threading is a win.
> Jump threading is inhibited.
>
> * gimple_call_same_target_p, used for tail merging and similar transforms.
> Two calls of IFN_UNIQUE will never be  the same target.
>
> * tracer.c, which is determining whether to clone a region.
>
> Interestingly jump threading avoids cloning volatile asms (which it admits
> is conservatively safe), but the tracer does not. I wonder if there's a
> latent problem in tracer?
>
> The reason I needed a function with this property is to  preserve the
> looping structure of a function's CFG.  As mentioned in the intro, we mark
> up loops (using this builtin), so the example I gave has the following
> inserts:
>
> #pragma acc parallel ...
> {
>  // single mode here
> #pragma acc loop ...
> IFN_UNIQUE (FORKING  ...)
> for (i = 0; i < N; i++) // loop 1
>   ... // partitioned mode here
> IFN_UNIQUE (JOINING ...)
>
> if (expr) // single mode here
> #pragma acc loop ...
>   IFN_UNIQUE (FORKING ...)
>   for (i = 0; i < N; i++) // loop 2
>     ... // partitioned mode here
>   IFN_UNIQUE (JOINING ...)
> }
>
> The properly nested loop property of the CFG is preserved through the
> compilation.  This is important as (a) it allows later passes to reconstruct
> this looping structure and (b) hardware constraints require a partioned
> region end for all partitioned threads at a single instruction.
>
> Until I added this unique property, original bring-up  of partitioned
> execution would hit cases of split loops ending in multiple cloned JOINING
> markers and similar cases.
>
> To distinguish different uses of the UNIQUE function, I use the first
> argument, which is expected to be an INTEGER_CST.  I figured this better
> than using multiple new internal fns, all with the unique property, as the
> latter would need (at least) a range check in gimple_call_internal_unique_p
> rather than a simple equality.
>
> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
> fns.  This replaces that scheme.
>
> ok?

Hmm, I'd just have used gimple_has_volatile_ops on the call?  That
should have the
desired effects.

Richard.

> nathan

Richard Biener Oct. 22, 2015, 7:49 a.m. UTC | #2

On Thu, Oct 22, 2015 at 9:48 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Wed, Oct 21, 2015 at 9:00 PM, Nathan Sidwell <nathan@acm.org> wrote:
>> This patch implements a new internal function that has a 'uniqueness'
>> property.   Jump-threading cannot clone it and tail-merging cannot combine
>> multiple instances.
>>
>> The uniqueness is implemented by a new gimple fn,
>> gimple_call_internal_unique_p.  Routines that check for identical or
>> cloneable calls are augmented to check this property.  These are:
>>
>> * tree-ssa-threadedge, which is figuring out if jump threading is a win.
>> Jump threading is inhibited.
>>
>> * gimple_call_same_target_p, used for tail merging and similar transforms.
>> Two calls of IFN_UNIQUE will never be  the same target.
>>
>> * tracer.c, which is determining whether to clone a region.
>>
>> Interestingly jump threading avoids cloning volatile asms (which it admits
>> is conservatively safe), but the tracer does not. I wonder if there's a
>> latent problem in tracer?
>>
>> The reason I needed a function with this property is to  preserve the
>> looping structure of a function's CFG.  As mentioned in the intro, we mark
>> up loops (using this builtin), so the example I gave has the following
>> inserts:
>>
>> #pragma acc parallel ...
>> {
>>  // single mode here
>> #pragma acc loop ...
>> IFN_UNIQUE (FORKING  ...)
>> for (i = 0; i < N; i++) // loop 1
>>   ... // partitioned mode here
>> IFN_UNIQUE (JOINING ...)
>>
>> if (expr) // single mode here
>> #pragma acc loop ...
>>   IFN_UNIQUE (FORKING ...)
>>   for (i = 0; i < N; i++) // loop 2
>>     ... // partitioned mode here
>>   IFN_UNIQUE (JOINING ...)
>> }
>>
>> The properly nested loop property of the CFG is preserved through the
>> compilation.  This is important as (a) it allows later passes to reconstruct
>> this looping structure and (b) hardware constraints require a partioned
>> region end for all partitioned threads at a single instruction.
>>
>> Until I added this unique property, original bring-up  of partitioned
>> execution would hit cases of split loops ending in multiple cloned JOINING
>> markers and similar cases.
>>
>> To distinguish different uses of the UNIQUE function, I use the first
>> argument, which is expected to be an INTEGER_CST.  I figured this better
>> than using multiple new internal fns, all with the unique property, as the
>> latter would need (at least) a range check in gimple_call_internal_unique_p
>> rather than a simple equality.
>>
>> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
>> fns.  This replaces that scheme.
>>
>> ok?
>
> Hmm, I'd just have used gimple_has_volatile_ops on the call?  That
> should have the
> desired effects.

That is, whatever new IFNs you need are ok, but special-casing them is not
necessary if you properly mark the calls as volatile.

Richard.

> Richard.
>
>> nathan

Jakub Jelinek Oct. 22, 2015, 7:59 a.m. UTC | #3

On Thu, Oct 22, 2015 at 09:49:29AM +0200, Richard Biener wrote:
> >> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
> >> fns.  This replaces that scheme.
> >>
> >> ok?
> >
> > Hmm, I'd just have used gimple_has_volatile_ops on the call?  That
> > should have the
> > desired effects.
> 
> That is, whatever new IFNs you need are ok, but special-casing them is not
> necessary if you properly mark the calls as volatile.

I don't see gimple_has_volatile_ops used in tracer.c or
tree-ssa-threadedge.c.  Setting gimple_has_volatile_ops on those IFNs is
fine, but I think they are even stronger than that.

	Jakub

Jakub Jelinek Oct. 22, 2015, 8:04 a.m. UTC | #4

On Wed, Oct 21, 2015 at 03:00:47PM -0400, Nathan Sidwell wrote:
> To distinguish different uses of the UNIQUE function, I use the first
> argument, which is expected to be an INTEGER_CST.  I figured this better
> than using multiple new internal fns, all with the unique property, as the
> latter would need (at least) a range check in gimple_call_internal_unique_p
> rather than a simple equality.
> 
> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
> fns.  This replaces that scheme.
> 
> ok?
> 
> nathan

> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
> 	    Cesar Philippidis  <cesar@codesourcery.com>
> 	
> 	* internal-fn.c (expand_UNIQUE): New.
> 	* internal-fn.def (IFN_UNIQUE): New.
> 	(IFN_UNIQUE_UNSPEC): Define.
> 	* gimple.h (gimple_call_internal_unique_p): New.
> 	* gimple.c (gimple_call_same_target_p): Check internal fn
> 	uniqueness.
> 	* tracer.c (ignore_bb_p): Check for IFN_UNIQUE call.
> 	* tree-ssa-threadedge.c
> 	(record_temporary_equivalences_from_stmts): Likewise.

This is generally fine with me, but please work with Richi to find
something acceptable to him too.

> +DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW | ECF_LEAF, NULL)

Are you sure about the ECF_LEAF?  I mean, while the function can't
call back to your code, I'd expect you want it as kind of strong
optimization barrier too.

> +#define IFN_UNIQUE_UNSPEC 0  /* Undifferentiated UNIQUE.  */
> Index: tracer.c
> ===================================================================
> --- tracer.c	(revision 229096)
> +++ tracer.c	(working copy)
> @@ -93,6 +93,7 @@ bb_seen_p (basic_block bb)
>  static bool
>  ignore_bb_p (const_basic_block bb)
>  {
> +  gimple_stmt_iterator gsi;
>    gimple *g;
>  
>    if (bb->index < NUM_FIXED_BLOCKS)
> @@ -106,6 +107,17 @@ ignore_bb_p (const_basic_block bb)
>    if (g && gimple_code (g) == GIMPLE_TRANSACTION)
>      return true;
>  
> +  /* Ignore blocks containing non-clonable function calls.  */
> +  for (gsi = gsi_start_bb (CONST_CAST_BB (bb));
> +       !gsi_end_p (gsi); gsi_next (&gsi))
> +    {
> +      g = gsi_stmt (gsi);
> +
> +      if (is_gimple_call (g) && gimple_call_internal_p (g)
> +	  && gimple_call_internal_unique_p (as_a <gcall *> (g)))
> +	return true;
> +    }

Do you have to scan the whole bb?  E.g. don't or should not those
unique IFNs force end of bb?

	Jakub

Richard Biener Oct. 22, 2015, 8:05 a.m. UTC | #5

On Thu, Oct 22, 2015 at 9:59 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Thu, Oct 22, 2015 at 09:49:29AM +0200, Richard Biener wrote:
>> >> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
>> >> fns.  This replaces that scheme.
>> >>
>> >> ok?
>> >
>> > Hmm, I'd just have used gimple_has_volatile_ops on the call?  That
>> > should have the
>> > desired effects.
>>
>> That is, whatever new IFNs you need are ok, but special-casing them is not
>> necessary if you properly mark the calls as volatile.
>
> I don't see gimple_has_volatile_ops used in tracer.c or
> tree-ssa-threadedge.c.  Setting gimple_has_volatile_ops on those IFNs is
> fine, but I think they are even stronger than that.

Hmm, indeed.  Now I fail to see how the implemented property "preserves
the CFG looping structure".  And I would have expected can_copy_bbs_p
to be adjusted instead (catching more cases and the threading and tracer
case as well).

As far as I can see nothing would prevent dissolving the loop by completely
unolling it for example.  Or deleting it because it has no side-effects.

So you'd need to be more precise as to what properties you are trying to
preserve by placing a single stmt somewhere.

Richard.

>         Jakub

Richard Biener Oct. 22, 2015, 8:07 a.m. UTC | #6

On Thu, Oct 22, 2015 at 10:04 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Oct 21, 2015 at 03:00:47PM -0400, Nathan Sidwell wrote:
>> To distinguish different uses of the UNIQUE function, I use the first
>> argument, which is expected to be an INTEGER_CST.  I figured this better
>> than using multiple new internal fns, all with the unique property, as the
>> latter would need (at least) a range check in gimple_call_internal_unique_p
>> rather than a simple equality.
>>
>> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such distinct internal
>> fns.  This replaces that scheme.
>>
>> ok?
>>
>> nathan
>
>> 2015-10-20  Nathan Sidwell  <nathan@codesourcery.com>
>>           Cesar Philippidis  <cesar@codesourcery.com>
>>
>>       * internal-fn.c (expand_UNIQUE): New.
>>       * internal-fn.def (IFN_UNIQUE): New.
>>       (IFN_UNIQUE_UNSPEC): Define.
>>       * gimple.h (gimple_call_internal_unique_p): New.
>>       * gimple.c (gimple_call_same_target_p): Check internal fn
>>       uniqueness.
>>       * tracer.c (ignore_bb_p): Check for IFN_UNIQUE call.
>>       * tree-ssa-threadedge.c
>>       (record_temporary_equivalences_from_stmts): Likewise.
>
> This is generally fine with me, but please work with Richi to find
> something acceptable to him too.
>
>> +DEF_INTERNAL_FN (UNIQUE, ECF_NOTHROW | ECF_LEAF, NULL)
>
> Are you sure about the ECF_LEAF?  I mean, while the function can't
> call back to your code, I'd expect you want it as kind of strong
> optimization barrier too.
>
>> +#define IFN_UNIQUE_UNSPEC 0  /* Undifferentiated UNIQUE.  */
>> Index: tracer.c
>> ===================================================================
>> --- tracer.c  (revision 229096)
>> +++ tracer.c  (working copy)
>> @@ -93,6 +93,7 @@ bb_seen_p (basic_block bb)
>>  static bool
>>  ignore_bb_p (const_basic_block bb)
>>  {
>> +  gimple_stmt_iterator gsi;
>>    gimple *g;
>>
>>    if (bb->index < NUM_FIXED_BLOCKS)
>> @@ -106,6 +107,17 @@ ignore_bb_p (const_basic_block bb)
>>    if (g && gimple_code (g) == GIMPLE_TRANSACTION)
>>      return true;
>>
>> +  /* Ignore blocks containing non-clonable function calls.  */
>> +  for (gsi = gsi_start_bb (CONST_CAST_BB (bb));
>> +       !gsi_end_p (gsi); gsi_next (&gsi))
>> +    {
>> +      g = gsi_stmt (gsi);
>> +
>> +      if (is_gimple_call (g) && gimple_call_internal_p (g)
>> +       && gimple_call_internal_unique_p (as_a <gcall *> (g)))
>> +     return true;
>> +    }
>
> Do you have to scan the whole bb?  E.g. don't or should not those
> unique IFNs force end of bb?

Yeah, please make them either end or start a BB so we have to check
at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
it also makes it a code motion barrier.

Richard.

>         Jakub

Julian Brown Oct. 22, 2015, 11:10 a.m. UTC | #7

On Thu, 22 Oct 2015 10:05:30 +0200
Richard Biener <richard.guenther@gmail.com> wrote:

> On Thu, Oct 22, 2015 at 9:59 AM, Jakub Jelinek <jakub@redhat.com>
> wrote:
> > On Thu, Oct 22, 2015 at 09:49:29AM +0200, Richard Biener wrote:  
> >> >> Jakub, IYR I originally had IFN_FORK and IFN_JOIN as such
> >> >> distinct internal fns.  This replaces that scheme.
> >> >>
> >> >> ok?  
> >> >
> >> > Hmm, I'd just have used gimple_has_volatile_ops on the call?
> >> > That should have the
> >> > desired effects.  
> >>
> >> That is, whatever new IFNs you need are ok, but special-casing
> >> them is not necessary if you properly mark the calls as volatile.  
> >
> > I don't see gimple_has_volatile_ops used in tracer.c or
> > tree-ssa-threadedge.c.  Setting gimple_has_volatile_ops on those
> > IFNs is fine, but I think they are even stronger than that.  
> 
> Hmm, indeed.  Now I fail to see how the implemented property
> "preserves the CFG looping structure".  And I would have expected
> can_copy_bbs_p to be adjusted instead (catching more cases and the
> threading and tracer case as well).
> 
> As far as I can see nothing would prevent dissolving the loop by
> completely unolling it for example.  Or deleting it because it has no
> side-effects.
> 
> So you'd need to be more precise as to what properties you are trying
> to preserve by placing a single stmt somewhere.

FWIW an earlier, abandoned attempt at solving the same problem was
discussed in the following thread, continuing through June:

  https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02647.html

Though the details of lowering of OpenACC constructs have changed with
Nathan's current patches, the underlying problem remains the same. PTX
requires certain operations (bar.sync) to be executed uniformly by all
threads in a CTA. IIUC this affects "JOIN" points across all
workers/vectors in a gang, in particular (though this is generic code,
other -- particularly GPU -- targets may have similar restrictions).

HTH,

Julian

Nathan Sidwell Oct. 22, 2015, 1:02 p.m. UTC | #8

On 10/22/15 04:07, Richard Biener wrote:
> On Thu, Oct 22, 2015 at 10:04 AM, Jakub Jelinek <jakub@redhat.com> wrote:

>> Do you have to scan the whole bb?  E.g. don't or should not those
>> unique IFNs force end of bb?
>
> Yeah, please make them either end or start a BB so we have to check
> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
> it also makes it a code motion barrier.

Thanks, I'd not thought of doing it like that.  Will try.

nathan

Nathan Sidwell Oct. 22, 2015, 1:08 p.m. UTC | #9

On 10/22/15 07:10, Julian Brown wrote:
> On Thu, 22 Oct 2015 10:05:30 +0200
> Richard Biener <richard.guenther@gmail.com> wrote:

>> So you'd need to be more precise as to what properties you are trying
>> to preserve by placing a single stmt somewhere.
>
> FWIW an earlier, abandoned attempt at solving the same problem was
> discussed in the following thread, continuing through June:
>
>    https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02647.html
>
> Though the details of lowering of OpenACC constructs have changed with
> Nathan's current patches, the underlying problem remains the same. PTX
> requires certain operations (bar.sync) to be executed uniformly by all
> threads in a CTA. IIUC this affects "JOIN" points across all
> workers/vectors in a gang, in particular (though this is generic code,
> other -- particularly GPU -- targets may have similar restrictions).


Richard, does  this answer your question?

nathan

Jakub Jelinek Oct. 22, 2015, 1:17 p.m. UTC | #10

On Thu, Oct 22, 2015 at 09:08:30AM -0400, Nathan Sidwell wrote:
> On 10/22/15 07:10, Julian Brown wrote:
> >On Thu, 22 Oct 2015 10:05:30 +0200
> >Richard Biener <richard.guenther@gmail.com> wrote:
> 
> >>So you'd need to be more precise as to what properties you are trying
> >>to preserve by placing a single stmt somewhere.
> >
> >FWIW an earlier, abandoned attempt at solving the same problem was
> >discussed in the following thread, continuing through June:
> >
> >   https://gcc.gnu.org/ml/gcc-patches/2015-05/msg02647.html
> >
> >Though the details of lowering of OpenACC constructs have changed with
> >Nathan's current patches, the underlying problem remains the same. PTX
> >requires certain operations (bar.sync) to be executed uniformly by all
> >threads in a CTA. IIUC this affects "JOIN" points across all
> >workers/vectors in a gang, in particular (though this is generic code,
> >other -- particularly GPU -- targets may have similar restrictions).
> 
> 
> Richard, does  this answer your question?

I agree with Richard that it would be better to write more about what kind
of IL changes are acceptable with IFN_UNIQUE in the IL and what are not.
E.g. is inlining ok (I'd hope yes)?  Is function splitting ok (bet as long
as all IFN_UNIQUE calls stay in one or the other part, but not both)?
Various loop optimization, ...

	Jakub

Nathan Sidwell Oct. 22, 2015, 1:24 p.m. UTC | #11

On 10/22/15 09:17, Jakub Jelinek wrote:
> On Thu, Oct 22, 2015 at 09:08:30AM -0400, Nathan Sidwell wrote:

> I agree with Richard that it would be better to write more about what kind
> of IL changes are acceptable with IFN_UNIQUE in the IL and what are not.
> E.g. is inlining ok (I'd hope yes)?  Is function splitting ok (bet as long
> as all IFN_UNIQUE calls stay in one or the other part, but not both)?

Essentially, yes.  a set of IFN_UNIQUE form a group  which must not be separated 
  from each other.  The set is discovered implicitly by following the CFG 
(though I suppose we could add an identifying INT_CST operand or something 
equivalent).

nathan

Nathan Sidwell Oct. 22, 2015, 2:01 p.m. UTC | #12

On 10/22/15 04:07, Richard Biener wrote:

> Yeah, please make them either end or start a BB so we have to check
> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
> it also makes it a code motion barrier.

Just so I'm clear, you're not saying that RETURNS_TWICE will stop the call being 
duplicated though?

thinking a little further, a code motion barrier is stronger than I need (but 
conservatively safe).  For instance:

UNIQUE (HEAD)
for (...)
{
   a = <loop_invariant_expr>
}
UNIQUE (TAIL)

It would be safe and desirable to move that loop invariant to before the UNIQUE. 
  Perhaps it won't matter in practice -- after all having N physical threads 
calculate it in parallel (just after the HEAD marker, but before the loop) will 
probably take no longer than a single thread doing it while the others wait.[*]

nathan

[*] ut it will take more power.

Richard Biener Oct. 22, 2015, 2:26 p.m. UTC | #13

On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
> On 10/22/15 04:07, Richard Biener wrote:
>
>> Yeah, please make them either end or start a BB so we have to check
>> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
>> it also makes it a code motion barrier.
>
>
> Just so I'm clear, you're not saying that RETURNS_TWICE will stop the call
> being duplicated though?

It will in practice.  RETURNS_TWICE will get you an abnormal edge from
entry (I think)

> thinking a little further, a code motion barrier is stronger than I need
> (but conservatively safe).  For instance:
>
> UNIQUE (HEAD)
> for (...)
> {
>   a = <loop_invariant_expr>
> }
> UNIQUE (TAIL)
>
> It would be safe and desirable to move that loop invariant to before the
> UNIQUE.  Perhaps it won't matter in practice -- after all having N physical
> threads calculate it in parallel (just after the HEAD marker, but before the
> loop) will probably take no longer than a single thread doing it while the
> others wait.[*]

RETURNS_TWICE will make the invariant motion stop at UNIQUE (HEAD),
but it would have done that anyway.  It will also be a CSE barrier, thus

tem = global;
UNIQUE(HEAD)
tem2 = global;

will not CSE tem2 to tem.

Richard.

> nathan
>
> [*] ut it will take more power.

Richard Biener Oct. 22, 2015, 2:30 p.m. UTC | #14

On Thu, Oct 22, 2015 at 3:24 PM, Nathan Sidwell <nathan@acm.org> wrote:
> On 10/22/15 09:17, Jakub Jelinek wrote:
>>
>> On Thu, Oct 22, 2015 at 09:08:30AM -0400, Nathan Sidwell wrote:
>
>
>> I agree with Richard that it would be better to write more about what kind
>> of IL changes are acceptable with IFN_UNIQUE in the IL and what are not.
>> E.g. is inlining ok (I'd hope yes)?  Is function splitting ok (bet as long
>> as all IFN_UNIQUE calls stay in one or the other part, but not both)?
>
>
> Essentially, yes.  a set of IFN_UNIQUE form a group  which must not be
> separated  from each other.  The set is discovered implicitly by following
> the CFG (though I suppose we could add an identifying INT_CST operand or
> something equivalent).

I don't see how this is achieved though.  To achieve this you'd need data
dependences between them, sth like

token_1 = IFN_UNIQUE (HEAD);
...
token_2 = IFN_UNIQUE (TAIL, token_1);

not sure if that is enough (what is "separate from each other"?), for example
partial inlining might simply pass token_1 to the split part where only
IFN_UNIQUE (TAIL, token_1) would be in.  At least the above provides
ordering between the two IFN calls (which you achieve by having VDEFs
I guess, but then they are also barriers for memory optimizations).

Richard.

> nathan

Nathan Sidwell Oct. 22, 2015, 2:30 p.m. UTC | #15

On 10/22/15 10:26, Richard Biener wrote:
> On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:

> RETURNS_TWICE will make the invariant motion stop at UNIQUE (HEAD),
> but it would have done that anyway.  It will also be a CSE barrier, thus
>
> tem = global;
> UNIQUE(HEAD)
> tem2 = global;
>
> will not CSE tem2 to tem.

Yes, I can see it would behave like that for something globally visible.  What 
about state that isn't so visible?  (perhaps I'm worrying about something that 
doesn't matter, but I'd like to understand)

nathan

Nathan Sidwell Oct. 22, 2015, 2:42 p.m. UTC | #16

On 10/22/15 10:30, Richard Biener wrote:
> On Thu, Oct 22, 2015 at 3:24 PM, Nathan Sidwell <nathan@acm.org> wrote:
>>
>> Essentially, yes.  a set of IFN_UNIQUE form a group  which must not be
>> separated  from each other.  The set is discovered implicitly by following
>> the CFG (though I suppose we could add an identifying INT_CST operand or
>> something equivalent).
>
> I don't see how this is achieved though.

Well, in practice it does.

>  To achieve this you'd need data
> dependences between them, sth like
>
> token_1 = IFN_UNIQUE (HEAD);
> ...
> token_2 = IFN_UNIQUE (TAIL, token_1);
>
> not sure if that is enough (what is "separate from each other"?), for example
> partial inlining might simply pass token_1 to the split part where only
> IFN_UNIQUE (TAIL, token_1) would be in.

Yeah, such partial inlining will break.  Not encountered it happening though.

> At least the above provides
> ordering between the two IFN calls (which you achieve by having VDEFs
> I guess, but then they are also barriers for memory optimizations).

Right, I think I'm relying on the compiler's lack of knowledge about what global 
state might be affected by the two calls to prevent it reordering them WRT 
eachother.  Is that what you meant?

(I did wonder about the need to add the kind of data dependency you describe, 
but found it unnecessary.)

nathan

Nathan Sidwell Oct. 22, 2015, 5:37 p.m. UTC | #17

On 10/22/15 04:07, Richard Biener wrote:

> Yeah, please make them either end or start a BB so we have to check
> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
> it also makes it a code motion barrier.

I'm having a hard time making  UNIQUE the end of a BB.

I'm emitting code to a gimple sequence, which later gets processed by the OMP 
machinery.  Just doing that doesn't cause the block to be split after the 
ECF_RETURNS_TWICE function.

In all the below, the label is generated by:

   tree label = create_artificial_label (loc);
   FORCED_LABEL (label) = 1;

I tried doing
   UNIQUE (...)
   goto label
   label:

but that is apparently optimized away, leaving UNIQUE in the middle of a bb. 
Next I tried:

   UNIQUE (..., &label)
   goto label
label:

but the goto is elided and label migrates to the start of the bb, again leaving 
UNIQUE in the middle.

Any suggestions?

nathan

Nathan Sidwell Oct. 22, 2015, 6:06 p.m. UTC | #18

On 10/22/15 10:26, Richard Biener wrote:
> On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
>> On 10/22/15 04:07, Richard Biener wrote:
>>
>>> Yeah, please make them either end or start a BB so we have to check
>>> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
>>> it also makes it a code motion barrier.
>>
>>
>> Just so I'm clear, you're not saying that RETURNS_TWICE will stop the call
>> being duplicated though?
>
> It will in practice.  RETURNS_TWICE will get you an abnormal edge from
> entry (I think)

Won't that interfere with the OMP  machinery, which expects correctly nested 
loops?  (no in-to or out-of loop jumps)

nathan

Nathan Sidwell Oct. 22, 2015, 8:17 p.m. UTC | #19

On 10/22/15 04:04, Jakub Jelinek wrote:

>> +  /* Ignore blocks containing non-clonable function calls.  */
>> +  for (gsi = gsi_start_bb (CONST_CAST_BB (bb));
>> +       !gsi_end_p (gsi); gsi_next (&gsi))
>> +    {
>> +      g = gsi_stmt (gsi);
>> +
>> +      if (is_gimple_call (g) && gimple_call_internal_p (g)
>> +	  && gimple_call_internal_unique_p (as_a <gcall *> (g)))
>> +	return true;
>> +    }
>
> Do you have to scan the whole bb?  E.g. don't or should not those
> unique IFNs force end of bb?

What about adding a flag to struct function?

   /* Nonzero if this function contains IFN_UNIQUE markers.  */
   unsigned int has_unique_calls : 1;

Then the tracer could either skip it, or do the search?

(I notice there are cilk flags already in struct function, instead of the above, 
we could add an openacc-specific one with  a similar behaviour?)

nathan

Jakub Jelinek Oct. 23, 2015, 8 a.m. UTC | #20

On Thu, Oct 22, 2015 at 04:17:32PM -0400, Nathan Sidwell wrote:
> On 10/22/15 04:04, Jakub Jelinek wrote:
> 
> >>+  /* Ignore blocks containing non-clonable function calls.  */
> >>+  for (gsi = gsi_start_bb (CONST_CAST_BB (bb));
> >>+       !gsi_end_p (gsi); gsi_next (&gsi))
> >>+    {
> >>+      g = gsi_stmt (gsi);
> >>+
> >>+      if (is_gimple_call (g) && gimple_call_internal_p (g)
> >>+	  && gimple_call_internal_unique_p (as_a <gcall *> (g)))
> >>+	return true;
> >>+    }
> >
> >Do you have to scan the whole bb?  E.g. don't or should not those
> >unique IFNs force end of bb?
> 
> What about adding a flag to struct function?
> 
>   /* Nonzero if this function contains IFN_UNIQUE markers.  */
>   unsigned int has_unique_calls : 1;
> 
> Then the tracer could either skip it, or do the search?
> 
> (I notice there are cilk flags already in struct function, instead of the
> above, we could add an openacc-specific one with  a similar behaviour?)

If you want to force end of a BB after the IFN_UNIQUE call, then you can just
gimple_call_set_ctrl_altering (gcall, true);
on it, and probably tweak gimple_call_initialize_ctrl_altering
so that it does that by default.  Plus of course split the blocks after it
when you emit it.

	Jakub

Jakub Jelinek Oct. 23, 2015, 8:25 a.m. UTC | #21

On Thu, Oct 22, 2015 at 02:06:54PM -0400, Nathan Sidwell wrote:
> On 10/22/15 10:26, Richard Biener wrote:
> >On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
> >>On 10/22/15 04:07, Richard Biener wrote:
> >>
> >>>Yeah, please make them either end or start a BB so we have to check
> >>>at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
> >>>it also makes it a code motion barrier.
> >>
> >>
> >>Just so I'm clear, you're not saying that RETURNS_TWICE will stop the call
> >>being duplicated though?
> >
> >It will in practice.  RETURNS_TWICE will get you an abnormal edge from
> >entry (I think)
> 
> Won't that interfere with the OMP  machinery, which expects correctly nested
> loops?  (no in-to or out-of loop jumps)

I bet it will, the region with the abnormal edges is no longer SESE.

	Jakub

Richard Biener Oct. 23, 2015, 9:29 a.m. UTC | #22

On Thu, Oct 22, 2015 at 8:06 PM, Nathan Sidwell <nathan@acm.org> wrote:
> On 10/22/15 10:26, Richard Biener wrote:
>>
>> On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
>>>
>>> On 10/22/15 04:07, Richard Biener wrote:
>>>
>>>> Yeah, please make them either end or start a BB so we have to check
>>>> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
>>>> it also makes it a code motion barrier.
>>>
>>>
>>>
>>> Just so I'm clear, you're not saying that RETURNS_TWICE will stop the
>>> call
>>> being duplicated though?
>>
>>
>> It will in practice.  RETURNS_TWICE will get you an abnormal edge from
>> entry (I think)
>
>
> Won't that interfere with the OMP  machinery, which expects correctly nested
> loops?  (no in-to or out-of loop jumps)

Probably yes.

> nathan

Nathan Sidwell Oct. 23, 2015, 12:57 p.m. UTC | #23

On 10/23/15 04:25, Jakub Jelinek wrote:
> On Thu, Oct 22, 2015 at 02:06:54PM -0400, Nathan Sidwell wrote:
>> On 10/22/15 10:26, Richard Biener wrote:
>>> On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
>>>> On 10/22/15 04:07, Richard Biener wrote:
>>>>
>>>>> Yeah, please make them either end or start a BB so we have to check
>>>>> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
>>>>> it also makes it a code motion barrier.
>>>>
>>>>
>>>> Just so I'm clear, you're not saying that RETURNS_TWICE will stop the call
>>>> being duplicated though?
>>>
>>> It will in practice.  RETURNS_TWICE will get you an abnormal edge from
>>> entry (I think)
>>
>> Won't that interfere with the OMP  machinery, which expects correctly nested
>> loops?  (no in-to or out-of loop jumps)
>
> I bet it will, the region with the abnormal edges is no longer SESE.

Hm, it seems like a bad plan to try RETURNS_TWICE then.


> If you want to force end of a BB after the IFN_UNIQUE call, then you can just
> gimple_call_set_ctrl_altering (gcall, true);
> on it, and probably tweak gimple_call_initialize_ctrl_altering
> so that it does that by default.  Plus of course split the blocks after it
> when you emit it.

IIUC this won't require RETURNS_TWICE, correct?  We're generate these seqs to a 
gimple sequence that eventually gets attached to the graph at the end of lower 
omp_for with:

   gimple_bind_set_body (new_stmt, body);
   gimple_omp_set_body (stmt, NULL);
   gimple_omp_for_set_pre_body (stmt, NULL);

Presumably that sequence will have to be split in the manner you describe 
somewhere else.  Not sure where that might be?

Any thoughts on the approach of adding a flag to struct function, and having 
tracer to skip such functions?

nathan

Richard Biener Oct. 23, 2015, 1:03 p.m. UTC | #24

On Fri, Oct 23, 2015 at 2:57 PM, Nathan Sidwell <nathan@acm.org> wrote:
> On 10/23/15 04:25, Jakub Jelinek wrote:
>>
>> On Thu, Oct 22, 2015 at 02:06:54PM -0400, Nathan Sidwell wrote:
>>>
>>> On 10/22/15 10:26, Richard Biener wrote:
>>>>
>>>> On Thu, Oct 22, 2015 at 4:01 PM, Nathan Sidwell <nathan@acm.org> wrote:
>>>>>
>>>>> On 10/22/15 04:07, Richard Biener wrote:
>>>>>
>>>>>> Yeah, please make them either end or start a BB so we have to check
>>>>>> at most a single stmt.  ECF_RETURNS_TWICE should achieve that,
>>>>>> it also makes it a code motion barrier.
>>>>>
>>>>>
>>>>>
>>>>> Just so I'm clear, you're not saying that RETURNS_TWICE will stop the
>>>>> call
>>>>> being duplicated though?
>>>>
>>>>
>>>> It will in practice.  RETURNS_TWICE will get you an abnormal edge from
>>>> entry (I think)
>>>
>>>
>>> Won't that interfere with the OMP  machinery, which expects correctly
>>> nested
>>> loops?  (no in-to or out-of loop jumps)
>>
>>
>> I bet it will, the region with the abnormal edges is no longer SESE.
>
>
> Hm, it seems like a bad plan to try RETURNS_TWICE then.
>
>
>> If you want to force end of a BB after the IFN_UNIQUE call, then you can
>> just
>> gimple_call_set_ctrl_altering (gcall, true);
>> on it, and probably tweak gimple_call_initialize_ctrl_altering
>> so that it does that by default.  Plus of course split the blocks after it
>> when you emit it.
>
>
> IIUC this won't require RETURNS_TWICE, correct?  We're generate these seqs
> to a gimple sequence that eventually gets attached to the graph at the end
> of lower omp_for with:
>
>   gimple_bind_set_body (new_stmt, body);
>   gimple_omp_set_body (stmt, NULL);
>   gimple_omp_for_set_pre_body (stmt, NULL);
>
> Presumably that sequence will have to be split in the manner you describe
> somewhere else.  Not sure where that might be?
>
> Any thoughts on the approach of adding a flag to struct function, and having
> tracer to skip such functions?

It's a hack.  I don't like hacks.  I think the requirement "don't duplicate me"
but inlining is ok is somewhat broken.  The requirement seems to be
sth like the "important" paris of such functions need to dominate/post-dominate
each other (technically not even in the same function)?

Richard.

> nathan

Jakub Jelinek Oct. 23, 2015, 1:03 p.m. UTC | #25

On Fri, Oct 23, 2015 at 08:57:17AM -0400, Nathan Sidwell wrote:
> >If you want to force end of a BB after the IFN_UNIQUE call, then you can just
> >gimple_call_set_ctrl_altering (gcall, true);
> >on it, and probably tweak gimple_call_initialize_ctrl_altering
> >so that it does that by default.  Plus of course split the blocks after it
> >when you emit it.
> 
> IIUC this won't require RETURNS_TWICE, correct?  We're generate these seqs

It doesn't require that, sure.

> to a gimple sequence that eventually gets attached to the graph at the end
> of lower omp_for with:
> 
>   gimple_bind_set_body (new_stmt, body);
>   gimple_omp_set_body (stmt, NULL);
>   gimple_omp_for_set_pre_body (stmt, NULL);
> 
> Presumably that sequence will have to be split in the manner you describe
> somewhere else.  Not sure where that might be?

If this is during the omplower pass, then it is before cfg pass and
therefore all you need is tweak the gimple_call_initialize_ctrl_altering
function and the cfg pass will DTRT.

> Any thoughts on the approach of adding a flag to struct function, and having
> tracer to skip such functions?

It could still be expensive if functions with that flag set contain very
large basic blocks.

	Jakub

Nathan Sidwell Oct. 23, 2015, 1:13 p.m. UTC | #26

On 10/23/15 09:03, Richard Biener wrote:

> It's a hack.  I don't like hacks.

One person's hack can be another person's pragmatism :)

>  I think the requirement "don't duplicate me"
> but inlining is ok is somewhat broken.

The requirement is that the SESE region formed by the markers remains as an SESE 
region with those markers as the entry & exit paths. We don't have a way of 
expressing exactly that in the compiler.  What we do have is the ability to say 
'don't duplicate this insn'.

> The requirement seems to be
> sth like the "important" paris of such functions need to dominate/post-dominate
> each other (technically not even in the same function)?

You're correct that the SESE region could be split across a function boundary in 
the manner you describe, but the  complexity of dealing with that in the 
backend's partitioning code would be high.  Let's not try and enable that from 
the get-go.

nathan

Jakub Jelinek Oct. 23, 2015, 1:16 p.m. UTC | #27

On Fri, Oct 23, 2015 at 09:13:43AM -0400, Nathan Sidwell wrote:
> You're correct that the SESE region could be split across a function
> boundary in the manner you describe, but the  complexity of dealing with
> that in the backend's partitioning code would be high.  Let's not try and
> enable that from the get-go.

Sure, but then you probably need to tweak the fnsplit pass to guarantee
that.

	Jakub

Nathan Sidwell Oct. 23, 2015, 1:16 p.m. UTC | #28

On 10/23/15 09:03, Jakub Jelinek wrote:
> On Fri, Oct 23, 2015 at 08:57:17AM -0400, Nathan Sidwell wrote:

> If this is during the omplower pass, then it is before cfg pass and
> therefore all you need is tweak the gimple_call_initialize_ctrl_altering
> function and the cfg pass will DTRT.

ok, thanks

nathan

Nathan Sidwell Oct. 23, 2015, 2:40 p.m. UTC | #29

On 10/23/15 09:16, Jakub Jelinek wrote:
> On Fri, Oct 23, 2015 at 09:13:43AM -0400, Nathan Sidwell wrote:
>> You're correct that the SESE region could be split across a function
>> boundary in the manner you describe, but the  complexity of dealing with
>> that in the backend's partitioning code would be high.  Let's not try and
>> enable that from the get-go.
>
> Sure, but then you probably need to tweak the fnsplit pass to guarantee
> that.

Ok, I'll take a look at that too.

The gimple_call_set_ctrl_altering approach is looking good for the moment.

Richard, if that works out, so we only have to check unique_p on the last insn 
of a bb, does that satisfy your concerns?  (Of course I'll repost patch 1 for 
review).

WRT the other patches I think the status is:

01-trunk-unique.patch
   Internal function with a 'uniqueness' property
   * reworking as described.
02-trunk-nvptx-partition.patch
   NVPTX backend patch set for partitioned execution
   * approved with minor edits
03-trunk-hook.patch
   OpenACC hook
   * approved with minor edit
04-trunk-c.patch
   C FE changes
   * Being addressed by Cesar
05-trunk-cxx.patch
   C++ FE changes
   * Being addressed by Cesar
06-trunk-red-init.patch
   Placeholder to keep reductions functioning
   * Approved
07-trunk-loop-mark.patch
   Annotate OpenACC loops in device-agnostic manner
   * Addressing minor comments
08-trunk-dev-lower.patch
   Device-specific lowering of loop markers
   * Question asked & answered about non-ptx behaviour
09-trunk-lower-gate.patch
   Run oacc_device_lower pass regardless of errors
   * Approved
10-trunk-libgomp.patch
   Libgomp change (remove dimension check)
   * Approved
11-trunk-tests.patch
   Initial set of execution tests
   * Approved, but C& C++ error tests needed

I'll repost:
01-trunk-unique.patch
   Internal function with a 'uniqueness' property

That has some obvious knock on changes to 02, 07 and 08, do you want those 
reposted for review?

Cesar will repost:
04-trunk-c.patch
   C FE changes
05-trunk-cxx.patch
   C++ FE changes

The remaining patch:
08-trunk-dev-lower.patch
   Device-specific lowering of loop markers

seems to be waiting on Jakub?

nathan

[OpenACC,1/11] UNIQUE internal function

Commit Message

Comments

Patch