mbox series

[0/7,OpenACC,libgomp,v5,stage1] Async re-work

Message ID e1cccd27-4f07-6123-e9ee-82be74002c08@mentor.com
Headers show
Series Async re-work | expand

Message

Chung-Lin Tang Jan. 22, 2019, 2:52 p.m. UTC
Hi, this is a rebase to current trunk and re-submission of the OpenACC Async
re-organization work, aiming to commit when stage1 re-opens. This is technically
the 2nd time I'm sending this whole patch series, but because I've named
partial revisions up to v4 by now, for clarity I will just call this entire set "v5".

Thomas, I hope I resolved all discussed issues in this current patch set. Please
kindly remind if I missed anything, as there were so many emails to re-check :)

The more detailed descriptions are in the individual patch submissions.

Thanks,
Chung-Lin

Comments

Thomas Schwinge Feb. 12, 2019, 3:07 p.m. UTC | #1
Hi Chung-Lin!

Happy New Year now to you, too!  :-)


On Tue, 22 Jan 2019 22:52:09 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> Hi, this is a rebase to current trunk and re-submission of the OpenACC Async
> re-organization work, aiming to commit when stage1 re-opens.

Thanks!

> This is technically
> the 2nd time I'm sending this whole patch series, but because I've named
> partial revisions up to v4 by now, for clarity I will just call this entire set "v5".

As far as I'm concerned, these patches should all (with a few exceptions
to be split out, see below) be merged into one patch, because they
logically all belong together, as one piece: "async re-work".


> Thomas, I hope I resolved all discussed issues in this current patch set. Please
> kindly remind if I missed anything, as there were so many emails to re-check :)

I'm still waiting for you to commit the PR87924 "OpenACC wait clauses
without async-arguments" changes, as a prerequisite to this re-work,
<https://gcc.gnu.org/ml/gcc-patches/2019-01/msg01282.html>.


If we agree that we actually need such a thing (I'll have to re-read
Jakub's comments), please submit the 'GOMP_PLUGIN_IF_VERSION' changes
separately, with 'GOMP_PLUGIN_IF_VERSION' equal to 'GOMP_VERSION'
(initially).  As this then is only a kind of documentation update, this
might then go into trunk right now -- and even if not right now, should
still be done separately as a prerequisite patch to this re-work, which
will then just increment 'GOMP_PLUGIN_IF_VERSION'.

Maybe rename 'GOMP_PLUGIN_IF_VERSION' to 'GOMP_PLUGIN_VERSION', for
similarity with 'GOMP_VERSION'?

And, it's then a bit confusing that 'GOMP_PLUGIN_VERSION' is returned
from 'GOMP_OFFLOAD_version' functions (plus 'host_version'); we there got
"plugin" vs. "offload".  But I suppose we'll just live with that?

The 'GOMP_OFFLOAD_version' functions should then also get their source
code comments updated: "libgomp [plugin] version"?


Now, back to the actual async re-work.

I see you've incorporated some of the incremental patches I provided
(thanks!), but not all of them.  I don't know if you just missed (some
of) these, or actually object?


I had requested that the OpenACC 2.5 'default_async' changes be discussed
separately, after this re-work has gone in, so please remove these
changes from this patch series.  I've again attached "into async re-work:
revert default_async changes".


I had provided changes, "into async re-work: don't create an asyncqueue
just to then test/synchronize with it", again attached.  I had asked that
you 'Please especially review the "libgomp/oacc-parallel.c:goacc_wait"
change, and confirm no corresponding "libgomp/oacc-parallel.c:GOACC_wait"
change to be done, because that code is structured differently'.


I had requested that we maintain the current behavior, that
"acc_async_noval" stays in its own, separate asyncqueue, instead of
aliasing it to 'async(0)'.  I had proposed "into async re-work:
libgomp/oacc-async.c:async2id", again attached.

You said you don't like the 'async2id' function I'm adding there (I still
don't understand why), so I assume you'd then implement this
async-argument to queue ID translation in 'lookup_goacc_asyncqueue'
proper?


I had provided "[WIP] into async re-work: documentation", again attached,
as 'A little bit of documentation starter update for you to include.
Please make sure that all relevant functions have such comments addded'.


I'm again attaching my changes 'into async re-work: replicate
"[PR88407] [OpenACC] Correctly handle unseen async-arguments"', which --
I suppose -- are necessary to maintain the current GCC trunk behavior
(that is, avoid testsuite regressions).


I'm again attaching my changes 'into async re-work: replicate "[PR88370]
acc_get_cuda_stream/acc_set_cuda_stream: acc_async_sync,
acc_async_noval"', which -- I suppose -- are necessary to maintain the
current GCC trunk behavior (that is, avoid testsuite regressions).


I'm again attaching my changes 'into async re-work: adjust for test case
added in "[PR88484] OpenACC wait directive without wait argument but with
async clause"', which -- I suppose -- are necessary to maintain the
current GCC trunk behavior (that is, avoid testsuite regressions).

You suggested that "Instead of fixing it here, will it make more sense to
have the serialize_func hook to accommodate the NULL asyncqueue?", to
which I said "Sure, that may make sense, yes.  Right: if there's no
asyncqueue to serialize with, then serialize/synchronize with the local
(host) thread", but this has not yet been implemented, as far as I can
tell.


I'm again attaching my changes 'into async re-work: don't synchronize
with the local thread unless actually necessary', which is the behavior
that makes most sense to me, and I had asked 'Would you please review the
"TODO" comments, and again also especially review the
"libgomp/oacc-parallel.c:goacc_wait" change, and confirm no corresponding
"libgomp/oacc-parallel.c:GOACC_wait" change to be done, because that code
is structured differently'.


By means of a "TODO" comment that I added, I had asked you to verify in
your 'libgomp/oacc-parallel.c:GOACC_enter_exit_data' translation from
'async_set_async_func' function call to 'async' formal parameter,
whether/why one case deliberately has not been converted; again attaching
this remaining piece of 'into async re-work: more async function usage'.


> The more detailed descriptions are in the individual patch submissions.

I'll respond to a few items individually, later on.


Grüße
 Thomas
From da84271ae48b0d6882be3304b8b97028e08158a1 Mon Sep 17 00:00:00 2001
From: Thomas Schwinge <thomas@codesourcery.com>
Date: Thu, 6 Dec 2018 15:57:46 +0100
Subject: [PATCH 1/9] into async re-work: revert default_async changes

---
 include/gomp-constants.h                      |   1 -
 libgomp/libgomp.map                           |   4 -
 libgomp/oacc-async.c                          |  19 +-
 libgomp/oacc-init.c                           |   2 -
 libgomp/oacc-int.h                            |   3 -
 libgomp/openacc.f90                           |  22 +-
 libgomp/openacc.h                             |   3 -
 libgomp/openacc_lib.h                         |  13 -
 .../libgomp.oacc-c-c++-common/asyncwait-2.c   | 904 ------------------
 9 files changed, 2 insertions(+), 969 deletions(-)
 delete mode 100644 libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c

diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index e37f1f9e9e4..f1e2ca3c75c 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -160,7 +160,6 @@ enum gomp_map_kind
 /* Asynchronous behavior.  Keep in sync with
    libgomp/{openacc.h,openacc.f90,openacc_lib.h}:acc_async_t.  */
 
-#define GOMP_ASYNC_DEFAULT		0
 #define GOMP_ASYNC_NOVAL		-1
 #define GOMP_ASYNC_SYNC			-2
 
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index 8feec91f5a3..d8e2fd1818b 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -464,12 +464,8 @@ OACC_2.5 {
 	acc_delete_finalize_async_32_h_;
 	acc_delete_finalize_async_64_h_;
 	acc_delete_finalize_async_array_h_;
-	acc_get_default_async;
-	acc_get_default_async_h_;
 	acc_memcpy_from_device_async;
 	acc_memcpy_to_device_async;
-	acc_set_default_async;
-	acc_set_default_async_h_;
 	acc_update_device_async;
 	acc_update_device_async_32_h_;
 	acc_update_device_async_64_h_;
diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index 00bed7452af..ea5ae542ac3 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -60,7 +60,7 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
   /* The special value acc_async_noval (-1) maps to the thread-specific
      default async stream.  */
   if (async == acc_async_noval)
-    async = thr->default_async;
+    async = 0; //TODO thr->default_async;
 
   if (async == acc_async_sync)
     return NULL;
@@ -246,23 +246,6 @@ acc_wait_all_async (int async)
     gomp_fatal ("wait all async(%d) failed", async);
 }
 
-int
-acc_get_default_async (void)
-{
-  struct goacc_thread *thr = get_goacc_thread ();
-  return thr->default_async;
-}
-
-void
-acc_set_default_async (int async)
-{
-  if (async < acc_async_sync)
-    gomp_fatal ("invalid async argument: %d", async);
-
-  struct goacc_thread *thr = get_goacc_thread ();
-  thr->default_async = async;
-}
-
 attribute_hidden void
 goacc_async_free (struct gomp_device_descr *devicep,
 		  struct goacc_asyncqueue *aq, void *ptr)
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index a561792b243..28471e40ba0 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -426,8 +426,6 @@ goacc_attach_host_thread_to_device (int ord)
   
   thr->target_tls
     = acc_dev->openacc.create_thread_data_func (ord);
-
-  thr->default_async = acc_async_default;
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
diff --git a/libgomp/oacc-int.h b/libgomp/oacc-int.h
index b343a06472e..e4b6ea6b7db 100644
--- a/libgomp/oacc-int.h
+++ b/libgomp/oacc-int.h
@@ -73,9 +73,6 @@ struct goacc_thread
 
   /* Target-specific data (used by plugin).  */
   void *target_tls;
-
-  /* Default OpenACC async queue for current thread, exported to plugin.  */
-  int default_async;
 };
 
 #if defined HAVE_TLS || defined USE_EMUTLS
diff --git a/libgomp/openacc.f90 b/libgomp/openacc.f90
index 971c16f6239..bc205453f82 100644
--- a/libgomp/openacc.f90
+++ b/libgomp/openacc.f90
@@ -51,10 +51,9 @@ module openacc_kinds
 
   integer, parameter :: acc_handle_kind = int32
 
-  public :: acc_async_default, acc_async_noval, acc_async_sync
+  public :: acc_async_noval, acc_async_sync
 
   ! Keep in sync with include/gomp-constants.h.
-  integer (acc_handle_kind), parameter :: acc_async_default = 0
   integer (acc_handle_kind), parameter :: acc_async_noval = -1
   integer (acc_handle_kind), parameter :: acc_async_sync = -2
 
@@ -93,16 +92,6 @@ module openacc_internal
       integer (acc_device_kind) d
     end function
 
-    subroutine acc_set_default_async_h (a)
-      import
-      integer a
-    end subroutine
-
-    function acc_get_default_async_h ()
-      import
-      integer acc_get_default_async_h
-    end function
-
     function acc_async_test_h (a)
       logical acc_async_test_h
       integer a
@@ -731,7 +720,6 @@ module openacc
 
   public :: acc_get_num_devices, acc_set_device_type, acc_get_device_type
   public :: acc_set_device_num, acc_get_device_num, acc_async_test
-  public :: acc_set_default_async, acc_get_default_async
   public :: acc_async_test_all
   public :: acc_wait, acc_async_wait, acc_wait_async
   public :: acc_wait_all, acc_async_wait_all, acc_wait_all_async
@@ -764,14 +752,6 @@ module openacc
     procedure :: acc_get_device_num_h
   end interface
 
-  interface acc_set_default_async
-    procedure :: acc_set_default_async_h
-  end interface
-
-  interface acc_get_default_async
-    procedure :: acc_get_default_async_h
-  end interface
-
   interface acc_async_test
     procedure :: acc_async_test_h
   end interface
diff --git a/libgomp/openacc.h b/libgomp/openacc.h
index 381f74f39d0..1bbe6c90e7f 100644
--- a/libgomp/openacc.h
+++ b/libgomp/openacc.h
@@ -63,7 +63,6 @@ typedef enum acc_device_t {
 
 typedef enum acc_async_t {
   /* Keep in sync with include/gomp-constants.h.  */
-  acc_async_default = 0,
   acc_async_noval = -1,
   acc_async_sync  = -2
 } acc_async_t;
@@ -73,8 +72,6 @@ void acc_set_device_type (acc_device_t) __GOACC_NOTHROW;
 acc_device_t acc_get_device_type (void) __GOACC_NOTHROW;
 void acc_set_device_num (int, acc_device_t) __GOACC_NOTHROW;
 int acc_get_device_num (acc_device_t) __GOACC_NOTHROW;
-void acc_set_default_async (int) __GOACC_NOTHROW;
-int acc_get_default_async (void) __GOACC_NOTHROW;
 int acc_async_test (int) __GOACC_NOTHROW;
 int acc_async_test_all (void) __GOACC_NOTHROW;
 void acc_wait (int) __GOACC_NOTHROW;
diff --git a/libgomp/openacc_lib.h b/libgomp/openacc_lib.h
index 9fe47bbc48d..fbd8f5e3625 100644
--- a/libgomp/openacc_lib.h
+++ b/libgomp/openacc_lib.h
@@ -46,7 +46,6 @@
       integer, parameter :: acc_handle_kind = 4
 
 !     Keep in sync with include/gomp-constants.h.
-      integer (acc_handle_kind), parameter :: acc_async_default = 0
       integer (acc_handle_kind), parameter :: acc_async_noval = -1
       integer (acc_handle_kind), parameter :: acc_async_sync = -2
 
@@ -90,18 +89,6 @@
         end function
       end interface
 
-      interface acc_set_default_async
-        subroutine acc_set_default_async_h (a)
-          integer a
-        end subroutine
-      end interface
-
-      interface acc_get_default_async
-        function acc_get_default_async_h ()
-          integer acc_get_default_async_h
-        end function
-      end interface
-
       interface acc_async_test
         function acc_async_test_h (a)
           logical acc_async_test_h
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c
deleted file mode 100644
index 94205407d41..00000000000
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c
+++ /dev/null
@@ -1,904 +0,0 @@
-/* { dg-do run { target openacc_nvidia_accel_selected } } */
-/* { dg-additional-options "-lcuda" } */
-
-#include <openacc.h>
-#include <stdlib.h>
-#include <cuda.h>
-
-#include <stdio.h>
-#include <time.h>
-#include <sys/time.h>
-
-int
-main (int argc, char **argv)
-{
-    CUresult r;
-    CUstream stream1;
-    int N = 128; //1024 * 1024;
-    float *a, *b, *c, *d, *e;
-    int i;
-    int nbytes;
-
-    srand (time (NULL));
-    int s = rand () % 100;
-
-    acc_init (acc_device_nvidia);
-
-    nbytes = N * sizeof (float);
-
-    a = (float *) malloc (nbytes);
-    b = (float *) malloc (nbytes);
-    c = (float *) malloc (nbytes);
-    d = (float *) malloc (nbytes);
-    e = (float *) malloc (nbytes);
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-    }
-
-    acc_set_default_async (s);
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 3.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 2.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc parallel wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 4.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 11.0)
-            abort ();
-    }
-
-
-    r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING);
-    if (r != CUDA_SUCCESS)
-    {
-        fprintf (stderr, "cuStreamCreate failed: %d\n", r);
-        abort ();
-    }
-
-    acc_set_cuda_stream (1, stream1);
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 5.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 7.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 7.0)
-            abort ();
-
-        if (b[i] != 49.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc parallel wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 17.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 4.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 4.0)
-            abort ();
-
-        if (b[i] != 16.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc parallel async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) async
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 25.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 3.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 2.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 2.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc kernels wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 2.0)
-            abort ();
-
-        if (b[i] != 4.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 11.0)
-            abort ();
-    }
-
-
-    r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING);
-    if (r != CUDA_SUCCESS)
-    {
-        fprintf (stderr, "cuStreamCreate failed: %d\n", r);
-        abort ();
-    }
-
-    acc_set_cuda_stream (1, stream1);
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 5.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 7.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 7.0)
-            abort ();
-
-        if (b[i] != 49.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 3.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
-    }
-
-#pragma acc kernels wait (s) async (s)
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
-    }
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 3.0)
-            abort ();
-
-        if (b[i] != 9.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-
-        if (d[i] != 1.0)
-            abort ();
-
-        if (e[i] != 17.0)
-            abort ();
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 4.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 4.0)
-            abort ();
-
-        if (b[i] != 16.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-
-    for (i = 0; i < N; i++)
-    {
-        a[i] = 5.0;
-        b[i] = 0.0;
-        c[i] = 0.0;
-        d[i] = 0.0;
-        e[i] = 0.0;
-    }
-
-#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
-    {
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
-    }
-
-#pragma acc kernels async
-    {
-        int ii;
-
-        for (ii = 0; ii < N; ii++)
-            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
-    }
-
-#pragma acc update host (a[0:N], b[0:N], c[0:N]) async
-
-#pragma acc wait (s)
-
-    }
-
-    for (i = 0; i < N; i++)
-    {
-        if (a[i] != 5.0)
-            abort ();
-
-        if (b[i] != 25.0)
-            abort ();
-
-        if (c[i] != 4.0)
-            abort ();
-    }
-
-    acc_shutdown (acc_device_nvidia);
-
-    return 0;
-}