diff mbox

Handle BUILT_IN_GOACC_PARALLEL in ipa-pta

Message ID 565DADE6.8020908@mentor.com
State New
Headers show

Commit Message

Tom de Vries Dec. 1, 2015, 2:25 p.m. UTC
[ was: Re: [PATCH, PR46032] Handle BUILT_IN_GOMP_PARALLEL in ipa-pta ]

On 30/11/15 17:36, Tom de Vries wrote:
> On 30/11/15 14:24, Richard Biener wrote:
>> On Mon, 30 Nov 2015, Tom de Vries wrote:
>>
>>> On 30/11/15 10:16, Richard Biener wrote:
>>>> On Mon, 30 Nov 2015, Tom de Vries wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> this patch fixes PR46032.
>>>>>
>>>>> It handles a call:
>>>>> ...
>>>>>     __builtin_GOMP_parallel (fn, data, num_threads, flags)
>>>>> ...
>>>>> as:
>>>>> ...
>>>>>     fn (data)
>>>>> ...
>>>>> in ipa-pta.
>>>>>
>>>>> This improves ipa-pta alias analysis in the parallelized function
>>>>> fn,

This follow-up patch does the same for BUILT_IN_GOACC_PARALLEL.

Bootstrapped and reg-tested on x86_64.

OK for stage3 trunk?

Thanks,
- Tom

Comments

Richard Biener Dec. 1, 2015, 2:38 p.m. UTC | #1
On Tue, 1 Dec 2015, Tom de Vries wrote:

> [ was: Re: [PATCH, PR46032] Handle BUILT_IN_GOMP_PARALLEL in ipa-pta ]
> 
> On 30/11/15 17:36, Tom de Vries wrote:
> > On 30/11/15 14:24, Richard Biener wrote:
> > > On Mon, 30 Nov 2015, Tom de Vries wrote:
> > > 
> > > > On 30/11/15 10:16, Richard Biener wrote:
> > > > > On Mon, 30 Nov 2015, Tom de Vries wrote:
> > > > > 
> > > > > > Hi,
> > > > > > 
> > > > > > this patch fixes PR46032.
> > > > > > 
> > > > > > It handles a call:
> > > > > > ...
> > > > > >     __builtin_GOMP_parallel (fn, data, num_threads, flags)
> > > > > > ...
> > > > > > as:
> > > > > > ...
> > > > > >     fn (data)
> > > > > > ...
> > > > > > in ipa-pta.
> > > > > > 
> > > > > > This improves ipa-pta alias analysis in the parallelized function
> > > > > > fn,
> 
> This follow-up patch does the same for BUILT_IN_GOACC_PARALLEL.
> 
> Bootstrapped and reg-tested on x86_64.
> 
> OK for stage3 trunk?

Ok.

Richard.
Jakub Jelinek Dec. 1, 2015, 2:44 p.m. UTC | #2
On Tue, Dec 01, 2015 at 03:25:42PM +0100, Tom de Vries wrote:
> Handle BUILT_IN_GOACC_PARALLEL in ipa-pta
> 
> 2015-12-01  Tom de Vries  <tom@codesourcery.com>
> 
> 	* tree-ssa-structalias.c (find_func_aliases_for_builtin_call)
> 	(find_func_clobbers, ipa_pta_execute): Handle BUILT_IN_GOACC_PARALLEL.

Isn't this cheating though?  The kernel will be called with those addresses
only if doing host fallback (and for GOMP_target_ext even not for that
always - firstprivate vars will have the addresses replaced by addresses of
alloca-ed copies of those objects).
I haven't studied in detail what exactly IPA-PTA does, so maybe it is good
enough to pretend that.

	Jakub
Tom de Vries Dec. 1, 2015, 11:46 p.m. UTC | #3
On 01/12/15 15:44, Jakub Jelinek wrote:
> On Tue, Dec 01, 2015 at 03:25:42PM +0100, Tom de Vries wrote:
>> Handle BUILT_IN_GOACC_PARALLEL in ipa-pta
>>
>> 2015-12-01  Tom de Vries  <tom@codesourcery.com>
>>
>> 	* tree-ssa-structalias.c (find_func_aliases_for_builtin_call)
>> 	(find_func_clobbers, ipa_pta_execute): Handle BUILT_IN_GOACC_PARALLEL.
>
> Isn't this cheating though?  The kernel will be called with those addresses
> only if doing host fallback

Let's take a look at goacc/kernels-alias-ipa-pta.c:
...
unsigned int a[N];
unsigned int b[N];
unsigned int c[N];

#pragma acc kernels pcopyout (a, b, c)
{
   a[0] = 0;
   b[0] = 1;
   c[0] = a[0];
}
...

If we execute on the host, the a, b and c used in the kernels region 
will be the a, b and c declared outside the region.

If we execute on a non-shared mem accelerator, the a, b and c used in 
the kernels region will be copies of a, b and c in the accelerator 
memory: a.1, b.1 and c.1.

This patch tells ipa-pta (which has no notion of a.1, b.1 and c.1) that 
we're using declared a, b, and c, in the kernels region, while on the 
accelerator we're really using a.1, b.1 and c.1, so in that sense it's 
cheating.

However, given that declared a, b and c are disjunct, we know that their 
copies will be disjunct, so by pretending that declared a, b and c are 
used in the kernels region, we get conclusions which are also valid when 
we use a.1, b.1 and c.1 instead in the kernels region.

So, for this patch to be incorrect we have to find an example where 
ipa-pta finds that two memory references are not aliasing, while on the 
accelerator those memory references are really aliasing. AFAICT there 
are no such examples.

> (and for GOMP_target_ext even not for that
> always - firstprivate vars will have the addresses replaced by addresses of
> alloca-ed copies of those objects).

I don't think firstprivate vars is a problem, I think the opposite would 
a problem: merging vars on the accelerator which are disjunct on the host.

> I haven't studied in detail what exactly IPA-PTA does, so maybe it is good
> enough to pretend that.

AFAIU, it's good enough, because the points-to information is only used 
to prove non-aliases.

Does this explanation address your concern?

Thanks,
- Tom
Jakub Jelinek Dec. 2, 2015, 9:31 a.m. UTC | #4
On Wed, Dec 02, 2015 at 12:46:47AM +0100, Tom de Vries wrote:
> Does this explanation address your concern?

Yeah, for now it is fine I hope.

	Jakub
Thomas Schwinge Dec. 2, 2015, 5:58 p.m. UTC | #5
Hi!

On Tue, 1 Dec 2015 15:25:42 +0100, Tom de Vries <Tom_deVries@mentor.com> wrote:
> Handle BUILT_IN_GOACC_PARALLEL in ipa-pta

> 	* c-c++-common/goacc/kernels-alias-ipa-pta-2.c: New test.
> 	* c-c++-common/goacc/kernels-alias-ipa-pta-3.c: New test.
> 	* c-c++-common/goacc/kernels-alias-ipa-pta.c: New test.

I see:

    PASS: c-c++-common/goacc/kernels-alias-ipa-pta-2.c (test for excess errors)
    FAIL: c-c++-common/goacc/kernels-alias-ipa-pta-2.c scan-tree-dump-times optimized "(?n)= 0;$" 2
    PASS: c-c++-common/goacc/kernels-alias-ipa-pta-2.c scan-tree-dump-times optimized "(?n)= 1;$" 1
    FAIL: c-c++-common/goacc/kernels-alias-ipa-pta-2.c scan-tree-dump-times optimized "(?n)= \\*a" 0
    PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c (test for excess errors)
    PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c scan-tree-dump-times optimized "(?n)= 0;$" 1
    PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c scan-tree-dump-times optimized "(?n)= 1;$" 1
    PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c scan-tree-dump-times optimized "(?n)= \\*a" 1
    PASS: c-c++-common/goacc/kernels-alias-ipa-pta.c (test for excess errors)
    FAIL: c-c++-common/goacc/kernels-alias-ipa-pta.c scan-tree-dump-times optimized "(?n)= 0;$" 2
    PASS: c-c++-common/goacc/kernels-alias-ipa-pta.c scan-tree-dump-times optimized "(?n)= 1;$" 1
    FAIL: c-c++-common/goacc/kernels-alias-ipa-pta.c scan-tree-dump-times optimized "(?n)= \\*_[0-9]\\[0\\];$" 0

..., and similar for C++.  Looking at
c-c++-common/goacc/kernels-alias-ipa-pta.c:

> --- /dev/null
> +++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-ipa-pta.c
> @@ -0,0 +1,23 @@
> +/* { dg-additional-options "-O2" } */
> +/* { dg-additional-options "-fipa-pta -fdump-tree-optimized" } */
> +
> +#define N 2
> +
> +void
> +foo (void)
> +{
> +  unsigned int a[N];
> +  unsigned int b[N];
> +  unsigned int c[N];
> +
> +#pragma acc kernels pcopyout (a, b, c)
> +  {
> +    a[0] = 0;
> +    b[0] = 1;
> +    c[0] = a[0];
> +  }
> +}
> +
> +/* { dg-final { scan-tree-dump-times "(?n)= 0;$" 2 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "(?n)= 1;$" 1 "optimized" } } */
> +/* { dg-final { scan-tree-dump-times "(?n)= \\*_\[0-9\]\\\[0\\\];$" 0 "optimized" } } */

..., manually running that one for C, I get:

    ;; Function foo._omp_fn.0 (foo._omp_fn.0, funcdef_no=1, decl_uid=1874, cgraph_uid=1, symbol_order=1)
    
    __attribute__((oacc function (1, 1, 1), omp target entrypoint))
    foo._omp_fn.0 (const struct .omp_data_t.0 & restrict .omp_data_i)
    {
      unsigned int[2] * _3;
      unsigned int[2] * _5;
      unsigned int _7;
      unsigned int[2] * _8;
    
      <bb 2>:
      _3 = *.omp_data_i_2(D).a;
      *_3[0] = 0;
      _5 = *.omp_data_i_2(D).b;
      *_5[0] = 1;
      _7 = *_3[0];
      _8 = *.omp_data_i_2(D).c;
      *_8[0] = _7;
      return;
    
    }
    
    
    
    ;; Function foo (foo, funcdef_no=0, decl_uid=1866, cgraph_uid=0, symbol_order=0)
    
    foo ()
    {
      unsigned int c[2];
      unsigned int b[2];
      unsigned int a[2];
      struct .omp_data_t.0 .omp_data_arr.1;
      static long unsigned int .omp_data_sizes.2[3] = {8, 8, 8};
      static short unsigned int .omp_data_kinds.3[3] = {514, 514, 514};
    
      <bb 2>:
      .omp_data_arr.1.c = &c;
      .omp_data_arr.1.b = &b;
      .omp_data_arr.1.a = &a;
      GOACC_parallel_keyed (-1, foo._omp_fn.0, 3, &.omp_data_arr.1, &.omp_data_sizes.2, &.omp_data_kinds.3, 0);
      .omp_data_arr.1 ={v} {CLOBBER};
      a ={v} {CLOBBER};
      b ={v} {CLOBBER};
      c ={v} {CLOBBER};
      return;
    
    }


Grüße
 Thomas
Tom de Vries Dec. 2, 2015, 11:31 p.m. UTC | #6
On 02/12/15 18:58, Thomas Schwinge wrote:
> Hi!
>
> On Tue, 1 Dec 2015 15:25:42 +0100, Tom de Vries<Tom_deVries@mentor.com>  wrote:
>> >Handle BUILT_IN_GOACC_PARALLEL in ipa-pta
>> >	* c-c++-common/goacc/kernels-alias-ipa-pta-2.c: New test.
>> >	* c-c++-common/goacc/kernels-alias-ipa-pta-3.c: New test.
>> >	* c-c++-common/goacc/kernels-alias-ipa-pta.c: New test.
> I see:
>
>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-2.c (test for excess errors)
>      FAIL: c-c++-common/goacc/kernels-alias-ipa-pta-2.c scan-tree-dump-times optimized "(?n)= 0;$" 2
>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-2.c scan-tree-dump-times optimized "(?n)= 1;$" 1
>      FAIL: c-c++-common/goacc/kernels-alias-ipa-pta-2.c scan-tree-dump-times optimized "(?n)= \\*a" 0
>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c (test for excess errors)
>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c scan-tree-dump-times optimized "(?n)= 0;$" 1
>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c scan-tree-dump-times optimized "(?n)= 1;$" 1
>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c scan-tree-dump-times optimized "(?n)= \\*a" 1
>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta.c (test for excess errors)
>      FAIL: c-c++-common/goacc/kernels-alias-ipa-pta.c scan-tree-dump-times optimized "(?n)= 0;$" 2
>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta.c scan-tree-dump-times optimized "(?n)= 1;$" 1
>      FAIL: c-c++-common/goacc/kernels-alias-ipa-pta.c scan-tree-dump-times optimized "(?n)= \\*_[0-9]\\[0\\];$" 0
>
> ..., and similar for C++.

Curious, I get all passes for both C and C++ (at r231192).

>  Looking at
> c-c++-common/goacc/kernels-alias-ipa-pta.c:
>
>> >--- /dev/null
>> >+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-ipa-pta.c
>> >@@ -0,0 +1,23 @@
>> >+/* { dg-additional-options "-O2" } */
>> >+/* { dg-additional-options "-fipa-pta -fdump-tree-optimized" } */
>> >+
>> >+#define N 2
>> >+
>> >+void
>> >+foo (void)
>> >+{
>> >+  unsigned int a[N];
>> >+  unsigned int b[N];
>> >+  unsigned int c[N];
>> >+
>> >+#pragma acc kernels pcopyout (a, b, c)
>> >+  {
>> >+    a[0] = 0;
>> >+    b[0] = 1;
>> >+    c[0] = a[0];
>> >+  }
>> >+}
>> >+
>> >+/* { dg-final { scan-tree-dump-times "(?n)= 0;$" 2 "optimized" } } */
>> >+/* { dg-final { scan-tree-dump-times "(?n)= 1;$" 1 "optimized" } } */
>> >+/* { dg-final { scan-tree-dump-times "(?n)= \\*_\[0-9\]\\\[0\\\];$" 0 "optimized" } } */
> ..., manually running that one for C, I get:
>
>      ;; Function foo._omp_fn.0 (foo._omp_fn.0, funcdef_no=1, decl_uid=1874, cgraph_uid=1, symbol_order=1)
>
>      __attribute__((oacc function (1, 1, 1), omp target entrypoint))
>      foo._omp_fn.0 (const struct .omp_data_t.0 & restrict .omp_data_i)
>      {
>        unsigned int[2] * _3;
>        unsigned int[2] * _5;
>        unsigned int _7;
>        unsigned int[2] * _8;
>
>        <bb 2>:
>        _3 = *.omp_data_i_2(D).a;
>        *_3[0] = 0;
>        _5 = *.omp_data_i_2(D).b;
>        *_5[0] = 1;
>        _7 = *_3[0];
>        _8 = *.omp_data_i_2(D).c;
>        *_8[0] = _7;
>        return;
>
>      }

Indeed, the optimization hasn't taken place here so it's correct that 
the scan fails.

I've attached the kernels-alias-ipa-pta.c.201t.optimized which I get, 
and it's clear the optimization is happening there (which explains why I 
get all PASSES).

The question is, why is the optimization not happening for you.

Thanks,
- Tom
;; Function foo._omp_fn.0 (foo._omp_fn.0, funcdef_no=1, decl_uid=1874, cgraph_uid=1, symbol_order=1)

__attribute__((oacc function (1, 1, 1), omp target entrypoint))
foo._omp_fn.0 (const struct .omp_data_t.0 & restrict .omp_data_i)
{
  unsigned int[2] * _3;
  unsigned int[2] * _5;
  unsigned int[2] * _8;

  <bb 2>:
  _3 = *.omp_data_i_2(D).a;
  *_3[0] = 0;
  _5 = *.omp_data_i_2(D).b;
  *_5[0] = 1;
  _8 = *.omp_data_i_2(D).c;
  *_8[0] = 0;
  return;

}



;; Function foo (foo, funcdef_no=0, decl_uid=1866, cgraph_uid=0, symbol_order=0)

foo ()
{
  unsigned int c[2];
  unsigned int b[2];
  unsigned int a[2];
  struct .omp_data_t.0 .omp_data_arr.1;
  static long unsigned int .omp_data_sizes.2[3] = {8, 8, 8};
  static short unsigned int .omp_data_kinds.3[3] = {514, 514, 514};

  <bb 2>:
  .omp_data_arr.1.c = &c;
  .omp_data_arr.1.b = &b;
  .omp_data_arr.1.a = &a;
  __builtin_GOACC_parallel_keyed (-1, foo._omp_fn.0, 3, &.omp_data_arr.1, &.omp_data_sizes.2, &.omp_data_kinds.3, 0);
  .omp_data_arr.1 ={v} {CLOBBER};
  a ={v} {CLOBBER};
  b ={v} {CLOBBER};
  c ={v} {CLOBBER};
  return;

}
Tom de Vries Dec. 3, 2015, 12:10 a.m. UTC | #7
On 03/12/15 00:31, Tom de Vries wrote:
> On 02/12/15 18:58, Thomas Schwinge wrote:
>> Hi!
>>
>> On Tue, 1 Dec 2015 15:25:42 +0100, Tom de
>> Vries<Tom_deVries@mentor.com>  wrote:
>>> >Handle BUILT_IN_GOACC_PARALLEL in ipa-pta
>>> >    * c-c++-common/goacc/kernels-alias-ipa-pta-2.c: New test.
>>> >    * c-c++-common/goacc/kernels-alias-ipa-pta-3.c: New test.
>>> >    * c-c++-common/goacc/kernels-alias-ipa-pta.c: New test.
>> I see:
>>
>>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-2.c (test for
>> excess errors)
>>      FAIL: c-c++-common/goacc/kernels-alias-ipa-pta-2.c
>> scan-tree-dump-times optimized "(?n)= 0;$" 2
>>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-2.c
>> scan-tree-dump-times optimized "(?n)= 1;$" 1
>>      FAIL: c-c++-common/goacc/kernels-alias-ipa-pta-2.c
>> scan-tree-dump-times optimized "(?n)= \\*a" 0
>>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c (test for
>> excess errors)
>>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c
>> scan-tree-dump-times optimized "(?n)= 0;$" 1
>>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c
>> scan-tree-dump-times optimized "(?n)= 1;$" 1
>>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta-3.c
>> scan-tree-dump-times optimized "(?n)= \\*a" 1
>>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta.c (test for excess
>> errors)
>>      FAIL: c-c++-common/goacc/kernels-alias-ipa-pta.c
>> scan-tree-dump-times optimized "(?n)= 0;$" 2
>>      PASS: c-c++-common/goacc/kernels-alias-ipa-pta.c
>> scan-tree-dump-times optimized "(?n)= 1;$" 1
>>      FAIL: c-c++-common/goacc/kernels-alias-ipa-pta.c
>> scan-tree-dump-times optimized "(?n)= \\*_[0-9]\\[0\\];$" 0
>>
>> ..., and similar for C++.
>
> Curious, I get all passes for both C and C++ (at r231192).
>

I've managed to reproduce it. The difference between pass and fail is 
whether the compiler is configured with or without accelerator.

I'll look into it.

Thanks,
- Tom
Tom de Vries Dec. 3, 2015, 12:27 a.m. UTC | #8
On 03/12/15 01:10, Tom de Vries wrote:
>
> I've managed to reproduce it. The difference between pass and fail is
> whether the compiler is configured with or without accelerator.
>
> I'll look into it.

In the configuration with accelerator, the flag node->force_output is on 
for foo._omp.fn.

This causes nonlocal_p to be true in ipa_pta_execute, which causes the 
optimization to fail.

The flag is decribed as:
...
   /* The symbol will be assumed to be used in an invisible way (like
      by an toplevel asm statement).  */
  ...

Looks like I have to ignore the force_output flag as well in 
ipa_pta_execute for this sort of node.

Thanks,
- Tom
Richard Biener Dec. 3, 2015, 8:59 a.m. UTC | #9
On Thu, 3 Dec 2015, Tom de Vries wrote:

> On 03/12/15 01:10, Tom de Vries wrote:
> > 
> > I've managed to reproduce it. The difference between pass and fail is
> > whether the compiler is configured with or without accelerator.
> > 
> > I'll look into it.
> 
> In the configuration with accelerator, the flag node->force_output is on for
> foo._omp.fn.
> 
> This causes nonlocal_p to be true in ipa_pta_execute, which causes the
> optimization to fail.
> 
> The flag is decribed as:
> ...
>   /* The symbol will be assumed to be used in an invisible way (like
>      by an toplevel asm statement).  */
>  ...
> 
> Looks like I have to ignore the force_output flag as well in ipa_pta_execute
> for this sort of node.

It rather looks like the flag shouldn't be set.  The fn after all has
its address taken!(?)

Richard.
Tom de Vries Dec. 3, 2015, 11:09 a.m. UTC | #10
On 03/12/15 09:59, Richard Biener wrote:
> On Thu, 3 Dec 2015, Tom de Vries wrote:
>
>> On 03/12/15 01:10, Tom de Vries wrote:
>>>
>>> I've managed to reproduce it. The difference between pass and fail is
>>> whether the compiler is configured with or without accelerator.
>>>
>>> I'll look into it.
>>
>> In the configuration with accelerator, the flag node->force_output is on for
>> foo._omp.fn.
>>
>> This causes nonlocal_p to be true in ipa_pta_execute, which causes the
>> optimization to fail.
>>
>> The flag is decribed as:
>> ...
>>    /* The symbol will be assumed to be used in an invisible way (like
>>       by an toplevel asm statement).  */
>>   ...
>>
>> Looks like I have to ignore the force_output flag as well in ipa_pta_execute
>> for this sort of node.
>
> It rather looks like the flag shouldn't be set.  The fn after all has
> its address taken!(?)
>

The flag is set here in expand_omp_target:
...
12682         /* Prevent IPA from removing child_fn as unreachable,
                  since there are no
12683            refs from the parent function to child_fn in offload
                  LTO mode.  */
12684         if (ENABLE_OFFLOADING)
12685           cgraph_node::get (child_fn)->mark_force_output ();
...

I guess setting forced_by_abi instead would also mean child_fn is not 
removed as unreachable, while still allowing optimizations:
...
   /* Like FORCE_OUTPUT, but in the case it is ABI requiring the symbol
      to be exported.  Unlike FORCE_OUTPUT this flag gets cleared to
      symbols promoted to static and it does not inhibit
      optimization.  */
   unsigned forced_by_abi : 1;
...

But I suspect that other optimizations (than ipa-pta) might break things.

Essentially we have two situations:
- in the host compiler, there is no need for the forced_output flag,
   and it inhibits optimization
- in the accelerator compiler, it (or some equivalent) is needed

I wonder if setting the force_output flag only when streaming the 
bytecode for offloading would work. That way, it wouldn't be set in the 
host compiler, while being set in the accelerator compiler.

Thanks,
- Tom
Richard Biener Dec. 3, 2015, 11:12 a.m. UTC | #11
On Thu, 3 Dec 2015, Tom de Vries wrote:

> On 03/12/15 09:59, Richard Biener wrote:
> > On Thu, 3 Dec 2015, Tom de Vries wrote:
> > 
> > > On 03/12/15 01:10, Tom de Vries wrote:
> > > > 
> > > > I've managed to reproduce it. The difference between pass and fail is
> > > > whether the compiler is configured with or without accelerator.
> > > > 
> > > > I'll look into it.
> > > 
> > > In the configuration with accelerator, the flag node->force_output is on
> > > for
> > > foo._omp.fn.
> > > 
> > > This causes nonlocal_p to be true in ipa_pta_execute, which causes the
> > > optimization to fail.
> > > 
> > > The flag is decribed as:
> > > ...
> > >    /* The symbol will be assumed to be used in an invisible way (like
> > >       by an toplevel asm statement).  */
> > >   ...
> > > 
> > > Looks like I have to ignore the force_output flag as well in
> > > ipa_pta_execute
> > > for this sort of node.
> > 
> > It rather looks like the flag shouldn't be set.  The fn after all has
> > its address taken!(?)
> > 
> 
> The flag is set here in expand_omp_target:
> ...
> 12682         /* Prevent IPA from removing child_fn as unreachable,
>                  since there are no
> 12683            refs from the parent function to child_fn in offload
>                  LTO mode.  */
> 12684         if (ENABLE_OFFLOADING)
> 12685           cgraph_node::get (child_fn)->mark_force_output ();
> ...
> 

How are there no refs from the "parent"?  Are there not refs from
some kind of descriptor that maps fallback CPU and offloaded variants?

I think the above needs sorting out in somw way, making the refs
explicit rather than implicit via force_output.

> I guess setting forced_by_abi instead would also mean child_fn is not removed
> as unreachable, while still allowing optimizations:
> ...
>   /* Like FORCE_OUTPUT, but in the case it is ABI requiring the symbol
>      to be exported.  Unlike FORCE_OUTPUT this flag gets cleared to
>      symbols promoted to static and it does not inhibit
>      optimization.  */
>   unsigned forced_by_abi : 1;
> ...
> 
> But I suspect that other optimizations (than ipa-pta) might break things.

How so?

> Essentially we have two situations:
> - in the host compiler, there is no need for the forced_output flag,
>   and it inhibits optimization
> - in the accelerator compiler, it (or some equivalent) is needed
> 
> I wonder if setting the force_output flag only when streaming the bytecode for
> offloading would work. That way, it wouldn't be set in the host compiler,
> while being set in the accelerator compiler.

Yeah, that was my original thinking btw.

Richard.
Jakub Jelinek Dec. 3, 2015, 11:13 a.m. UTC | #12
On Thu, Dec 03, 2015 at 12:09:04PM +0100, Tom de Vries wrote:
> The flag is set here in expand_omp_target:
> ...
> 12682         /* Prevent IPA from removing child_fn as unreachable,
>                  since there are no
> 12683            refs from the parent function to child_fn in offload
>                  LTO mode.  */
> 12684         if (ENABLE_OFFLOADING)
> 12685           cgraph_node::get (child_fn)->mark_force_output ();
> ...
> 
> I guess setting forced_by_abi instead would also mean child_fn is not
> removed as unreachable, while still allowing optimizations:
> ...
>   /* Like FORCE_OUTPUT, but in the case it is ABI requiring the symbol
>      to be exported.  Unlike FORCE_OUTPUT this flag gets cleared to
>      symbols promoted to static and it does not inhibit
>      optimization.  */
>   unsigned forced_by_abi : 1;
> ...
> 
> But I suspect that other optimizations (than ipa-pta) might break things.
> 
> Essentially we have two situations:
> - in the host compiler, there is no need for the forced_output flag,
>   and it inhibits optimization
> - in the accelerator compiler, it (or some equivalent) is needed
> 
> I wonder if setting the force_output flag only when streaming the bytecode
> for offloading would work. That way, it wouldn't be set in the host
> compiler, while being set in the accelerator compiler.

I believe that the host and offload func (and var) tables need to be in
sync, so there needs to be something both in the host and accel compilers
that prevents the functions and variables that have their accel or host
counterpart in the tables from being optimized away, or say replaced by
a clone with different arguments etc.

	Jakub
diff mbox

Patch

Handle BUILT_IN_GOACC_PARALLEL in ipa-pta

2015-12-01  Tom de Vries  <tom@codesourcery.com>

	* tree-ssa-structalias.c (find_func_aliases_for_builtin_call)
	(find_func_clobbers, ipa_pta_execute): Handle BUILT_IN_GOACC_PARALLEL.

	* c-c++-common/goacc/kernels-alias-ipa-pta-2.c: New test.
	* c-c++-common/goacc/kernels-alias-ipa-pta-3.c: New test.
	* c-c++-common/goacc/kernels-alias-ipa-pta.c: New test.

---
 .../c-c++-common/goacc/kernels-alias-ipa-pta-2.c   | 37 ++++++++++++++++++++++
 .../c-c++-common/goacc/kernels-alias-ipa-pta-3.c   | 36 +++++++++++++++++++++
 .../c-c++-common/goacc/kernels-alias-ipa-pta.c     | 23 ++++++++++++++
 gcc/tree-ssa-structalias.c                         | 28 +++++++++++++---
 .../kernels-alias-ipa-pta-2.c                      | 27 ++++++++++++++++
 .../kernels-alias-ipa-pta-3.c                      | 26 +++++++++++++++
 .../kernels-alias-ipa-pta.c                        | 26 +++++++++++++++
 7 files changed, 199 insertions(+), 4 deletions(-)

diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-ipa-pta-2.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-ipa-pta-2.c
new file mode 100644
index 0000000..f16d698
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-ipa-pta-2.c
@@ -0,0 +1,37 @@ 
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fipa-pta -fdump-tree-optimized" } */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+typedef __SIZE_TYPE__ size_t;
+void *malloc (size_t);
+void free (void *);
+#ifdef __cplusplus
+}
+#endif
+
+#define N 2
+
+void
+foo (void)
+{
+  unsigned int *a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  unsigned int *b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
+  {
+    a[0] = 0;
+    b[0] = 1;
+    c[0] = a[0];
+  }
+
+  free (a);
+  free (b);
+  free (c);
+}
+
+/* { dg-final { scan-tree-dump-times "(?n)= 0;$" 2 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n)= 1;$" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n)= \\*a" 0 "optimized" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-ipa-pta-3.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-ipa-pta-3.c
new file mode 100644
index 0000000..1eb56eb
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-ipa-pta-3.c
@@ -0,0 +1,36 @@ 
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fipa-pta -fdump-tree-optimized" } */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+typedef __SIZE_TYPE__ size_t;
+void *malloc (size_t);
+void free (void *);
+#ifdef __cplusplus
+}
+#endif
+
+#define N 2
+
+void
+foo (void)
+{
+  unsigned int *a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  unsigned int *b = a;
+  unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
+  {
+    a[0] = 0;
+    b[0] = 1;
+    c[0] = a[0];
+  }
+
+  free (a);
+  free (c);
+}
+
+/* { dg-final { scan-tree-dump-times "(?n)= 0;$" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n)= 1;$" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n)= \\*a" 1 "optimized" } } */
diff --git a/gcc/testsuite/c-c++-common/goacc/kernels-alias-ipa-pta.c b/gcc/testsuite/c-c++-common/goacc/kernels-alias-ipa-pta.c
new file mode 100644
index 0000000..969b466
--- /dev/null
+++ b/gcc/testsuite/c-c++-common/goacc/kernels-alias-ipa-pta.c
@@ -0,0 +1,23 @@ 
+/* { dg-additional-options "-O2" } */
+/* { dg-additional-options "-fipa-pta -fdump-tree-optimized" } */
+
+#define N 2
+
+void
+foo (void)
+{
+  unsigned int a[N];
+  unsigned int b[N];
+  unsigned int c[N];
+
+#pragma acc kernels pcopyout (a, b, c)
+  {
+    a[0] = 0;
+    b[0] = 1;
+    c[0] = a[0];
+  }
+}
+
+/* { dg-final { scan-tree-dump-times "(?n)= 0;$" 2 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n)= 1;$" 1 "optimized" } } */
+/* { dg-final { scan-tree-dump-times "(?n)= \\*_\[0-9\]\\\[0\\\];$" 0 "optimized" } } */
diff --git a/gcc/tree-ssa-structalias.c b/gcc/tree-ssa-structalias.c
index 7f4a8ad..060ff3e 100644
--- a/gcc/tree-ssa-structalias.c
+++ b/gcc/tree-ssa-structalias.c
@@ -4507,15 +4507,32 @@  find_func_aliases_for_builtin_call (struct function *fn, gcall *t)
 	  return true;
 	}
       case BUILT_IN_GOMP_PARALLEL:
+      case BUILT_IN_GOACC_PARALLEL:
 	{
-	  /* Handle __builtin_GOMP_parallel (fn, data, num_threads, flags) as
-	     fn (data).  */
 	  if (in_ipa_mode)
 	    {
-	      tree fnarg = gimple_call_arg (t, 0);
+	      unsigned int fnpos, argpos;
+	      switch (DECL_FUNCTION_CODE (fndecl))
+		{
+		case BUILT_IN_GOMP_PARALLEL:
+		  /* __builtin_GOMP_parallel (fn, data, num_threads, flags).  */
+		  fnpos = 0;
+		  argpos = 1;
+		  break;
+		case BUILT_IN_GOACC_PARALLEL:
+		  /* __builtin_GOACC_parallel (device, fn, mapnum, hostaddrs,
+					       sizes, kinds, ...).  */
+		  fnpos = 1;
+		  argpos = 3;
+		  break;
+		default:
+		  gcc_unreachable ();
+		}
+
+	      tree fnarg = gimple_call_arg (t, fnpos);
 	      gcc_assert (TREE_CODE (fnarg) == ADDR_EXPR);
 	      tree fndecl = TREE_OPERAND (fnarg, 0);
-	      tree arg = gimple_call_arg (t, 1);
+	      tree arg = gimple_call_arg (t, argpos);
 	      gcc_assert (TREE_CODE (arg) == ADDR_EXPR);
 
 	      varinfo_t fi = get_vi_for_tree (fndecl);
@@ -5064,6 +5081,7 @@  find_func_clobbers (struct function *fn, gimple *origt)
 	  case BUILT_IN_VA_END:
 	    return;
 	  case BUILT_IN_GOMP_PARALLEL:
+	  case BUILT_IN_GOACC_PARALLEL:
 	    return;
 	  /* printf-style functions may have hooks to set pointers to
 	     point to somewhere into the generated string.  Leave them
@@ -7547,6 +7565,8 @@  ipa_pta_execute (void)
 	      /* Handle direct calls to functions with body.  */
 	      if (gimple_call_builtin_p (stmt, BUILT_IN_GOMP_PARALLEL))
 		decl = TREE_OPERAND (gimple_call_arg (stmt, 0), 0);
+	      else if (gimple_call_builtin_p (stmt, BUILT_IN_GOACC_PARALLEL))
+		decl = TREE_OPERAND (gimple_call_arg (stmt, 1), 0);
 	      else
 		decl = gimple_call_fndecl (stmt);
 
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
new file mode 100644
index 0000000..0f323c8
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-2.c
@@ -0,0 +1,27 @@ 
+/* { dg-additional-options "-O2 -fipa-pta" } */
+
+#include <stdlib.h>
+
+#define N 2
+
+int
+main (void)
+{
+  unsigned int *a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  unsigned int *b = (unsigned int *)malloc (N * sizeof (unsigned int));
+  unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
+  {
+    a[0] = 0;
+    b[0] = 1;
+    c[0] = a[0];
+  }
+
+  if (a[0] != 0 || b[0] != 1 || c[0] != 0)
+    abort ();
+
+  free (a);
+  free (b);
+  free (c);
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
new file mode 100644
index 0000000..654e750
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta-3.c
@@ -0,0 +1,26 @@ 
+/* { dg-additional-options "-O2 -fipa-pta" } */
+
+#include <stdlib.h>
+
+#define N 2
+
+int
+main (void)
+{
+  unsigned int *a = (unsigned int *)malloc (N * sizeof (unsigned int));
+  unsigned int *b = a;
+  unsigned int *c = (unsigned int *)malloc (N * sizeof (unsigned int));
+
+#pragma acc kernels pcopyout (a[0:N], b[0:N], c[0:N])
+  {
+    a[0] = 0;
+    b[0] = 1;
+    c[0] = a[0];
+  }
+
+  if (a[0] != 1 || b[0] != 1 || c[0] != 1)
+    abort ();
+
+  free (a);
+  free (c);
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
new file mode 100644
index 0000000..44d4fd2
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/kernels-alias-ipa-pta.c
@@ -0,0 +1,26 @@ 
+/* { dg-additional-options "-O2 -fipa-pta" } */
+
+#include <stdlib.h>
+
+#define N 2
+
+int
+main (void)
+{
+  unsigned int a[N];
+  unsigned int b[N];
+  unsigned int c[N];
+
+#pragma acc kernels pcopyout (a, b, c)
+  {
+    a[0] = 0;
+    b[0] = 1;
+    c[0] = a[0];
+  }
+
+  if (a[0] != 0 || b[0] != 1 || c[0] != 0)
+    abort ();
+
+  return 0;
+}
+