[5/6,OpenACC,libgomp] Async re-work, C/C++ testsuite changes

Message ID	8086c63b-f729-891b-3d21-76871d360734@mentor.com
State	New
Headers	show Return-Path: <gcc-patches-return-486327-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; q=dns; s=default; b=mOlYcOZKh/d6WtQzTMdI7lIA85eKB f//zAkySUb6Zq2OUnDWb3+Ul1KDNylL1IuT1HG/pJrGWeC7bZMlzumQHAYMkadeO emp+a2nu2qxKgpncabI5d0n6MZjBDBSD0k0RQRn96myF6yuUYYsaUybGM+gt1hDW LxAugw7JjeSxXg= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Reply-To: <cltang@codesourcery.com> From: Chung-Lin Tang <chunglin_tang@mentor.com> Subject: [PATCH 5/6, OpenACC, libgomp] Async re-work, C/C++ testsuite changes To: <gcc-patches@gcc.gnu.org>, Thomas Schwinge <Thomas_Schwinge@mentor.com> Message-ID: <8086c63b-f729-891b-3d21-76871d360734@mentor.com> Date: Tue, 25 Sep 2018 21:11:42 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------A99011B69AF57E082A6BA82E"
Series	Async re-work \| expand [0/6,OpenACC,libgomp] Async re-work [1/6,OpenACC,libgomp] Async re-work, interfaces [2/6,OpenACC,libgomp] Async re-work, oacc-* parts [3/6,OpenACC,libgomp] acc_get/set_default_async API, Fortran specific parts [4/6,OpenACC,libgomp] Async re-work, libgomp/target.c changes [5/6,OpenACC,libgomp] Async re-work, C/C++ testsuite changes [6/6,OpenACC,libgomp] Async re-work, nvptx changes

Hi Chung-Lin! On Tue, 11 Dec 2018 21:30:31 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote: > On 2018/12/7 11:56 PM, Thomas Schwinge wrote: > >> --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c > >> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c > >> @@ -114,6 +114,7 @@ main (int argc, char **argv) > >> > >> for (i = 0; i < N; i++) > >> { > >> + stream = (CUstream) acc_get_cuda_stream (i & 1); > >> r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0); > > What's the motivation for this change? > > To place work on both streams 0 and 1. That's describing what it doesn, not the motivation behind it. ;-) > > ..., and this change are needed because we're now more strictly > > synchronizing with the local (host) thread. > > > > Regarding the case of "libgomp.oacc-c-c++-common/lib-81.c", as currently > > present: > > > > [...] > > for (i = 0; i < N; i++) > > { > > r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, streams[i], kargs, 0); > > if (r != CUDA_SUCCESS) > > { > > fprintf (stderr, "cuLaunchKernel failed: %d\n", r); > > abort (); > > } > > } > > > > This launches N kernels on N separate async queues/CUDA streams, [0..N). > > > > acc_wait_all_async (N); > > > > Then, the "acc_wait_all_async (N)" -- in my understanding! -- should > > *not* synchronize with the local (host) thread, but instead just set up > > the additional async queue/CUDA stream N to "depend" on [0..N). > > > > for (i = 0; i <= N; i++) > > { > > if (acc_async_test (i) != 0) > > abort (); > > } > > > > Thus, all [0..N) should then still be "acc_async_test (i) != 0" (still > > running). > > > > acc_wait (N); > > > > Here, the "acc_wait (N)" would synchronize the local (host) thread with > > async queue/CUDA stream N and thus recursively with [0..N). > > > > for (i = 0; i <= N; i++) > > { > > if (acc_async_test (i) != 1) > > abort (); > > } > > [...] > > > > So, then all these async queues/CUDA streams here indeed are > > "acc_async_test (i) != 1", thas is, idle. > > > > > > Now, the more strict synchronization with the local (host) thread is not > > wrong in term of correctness, but I suppose it will impact performance of > > otherwise asynchronous operations, which now get synchronized too much? > > > > Or, of course, I'm misunderstanding something... > > IIRC, we encountered many issues where people misunderstood the meaning of "wait+async", > using it as if the local host sync happened, where in our original implementation it does not. ..., and that's the right thing, in my opinion. (Do you disagree?) > Also some areas of the OpenACC spec were vague on whether the local host synchronization should > or should not happen; basically, the wording treated as if it was only an implementation detail > and didn't matter, and didn't acknowledge that this would be something visible to the user. I suppose in correct code that correctly uses a different mechanism for inter-thread synchronization, it shouldn't be visible? (Well, with the additional synchronization, it would be visible in terms of performance degradation.) For example, OpenACC 2.6, 3.2.11. "acc_wait" explicitly states that "If two or more threads share the same accelerator, the 'acc_wait' routine will return only if all matching asynchronous operations initiated by this thread have completed; there is no guarantee that all matching asynchronous operations initiated by other threads have completed". I agree that this could be made more explicit throught the specification, and also the reading of OpenACC 2.6, 2.16.1. "async clause" is a bit confusing regarding multiple host threads, but as I understand, the idea still is that such wait operations do not synchronize at the host thread level. (Let's please assume that, and then work with the OpenACC technical committee to get that clarified in the documentation.) > At the end, IIRC, I decided that adding a local host synchronization is easier for all of us, Well... > and took the opportunity of the re-org to make this change. Well... Again, a re-org/re-work should not make such functional changes... > That said, I didn't notice those tests you listed above were meant to test such delicate behavior. > > > (For avoidance of doubt, I would accept the "async re-work" as is, but we > > should eventually clarify this, and restore the behavior we -- apparently > > -- had before, where we didn't synchronize so much? (So, technically, > > the "async re-work" would constitute a regression for this kind of > > usage?) > > It's not hard to restore the old behavior, just a few lines to delete. Although as described > above, this change was deliberate. > > This might be another issue to raise with the committee. I think I tried on this exact issue > a long time ago, but never got answers. OK, I'll try to find that, or send me a pointer to it, if you still got. I propose you include the following. Would you please review the "TODO" comments, and again also especially review the "libgomp/oacc-parallel.c:goacc_wait" change, and confirm no corresponding "libgomp/oacc-parallel.c:GOACC_wait" change to be done, because that code is structured differently. commit e44cc6dc8f76e50c6f905cd408475589dee7b3b1 Author: Thomas Schwinge <thomas@codesourcery.com> Date: Thu Dec 13 17:54:35 2018 +0100 into async re-work: don't synchronize with the local thread unless actually necessary --- libgomp/oacc-async.c | 8 ++++++-- libgomp/oacc-parallel.c | 1 - 2 files changed, 6 insertions(+), 3 deletions(-) diff --git libgomp/oacc-async.c libgomp/oacc-async.c index a38e42781aa0..ec5cbc408d4e 100644 --- libgomp/oacc-async.c +++ libgomp/oacc-async.c @@ -195,9 +195,11 @@ acc_wait_async (int async1, int async2) if (aq1 == aq2) gomp_fatal ("identical parameters"); - thr->dev->openacc.async.synchronize_func (aq1); if (aq2) thr->dev->openacc.async.serialize_func (aq1, aq2); + else + //TODO Local thread synchronization. Necessary for the "async2 == acc_async_sync" case, or can just skip? + thr->dev->openacc.async.synchronize_func (aq1); } void @@ -232,9 +234,11 @@ acc_wait_all_async (int async) gomp_mutex_lock (&thr->dev->openacc.async.lock); for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next) { - thr->dev->openacc.async.synchronize_func (l->aq); if (waiting_queue) thr->dev->openacc.async.serialize_func (l->aq, waiting_queue); + else + //TODO Local thread synchronization. Necessary for the "async == acc_async_sync" case, or can just skip? + thr->dev->openacc.async.synchronize_func (l->aq); } gomp_mutex_unlock (&thr->dev->openacc.async.lock); } diff --git libgomp/oacc-parallel.c libgomp/oacc-parallel.c index 9519abeccc2c..5a441c9efe38 100644 --- libgomp/oacc-parallel.c +++ libgomp/oacc-parallel.c @@ -508,7 +508,6 @@ goacc_wait (int async, int num_waits, va_list *ap) else { goacc_aq aq2 = get_goacc_asyncqueue (async); - acc_dev->openacc.async.synchronize_func (aq); acc_dev->openacc.async.serialize_func (aq, aq2); } } Grüße Thomas

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c new file mode 100644 index 0000000..9420540 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c @@ -0,0 +1,904 @@ +/* { dg-do run { target openacc_nvidia_accel_selected } } */ +/* { dg-additional-options "-lcuda" } */ + +#include <openacc.h> +#include <stdlib.h> +#include <cuda.h> + +#include <stdio.h> +#include <time.h> +#include <sys/time.h> + +int +main (int argc, char **argv) +{ + CUresult r; + CUstream stream1; + int N = 128; //1024 * 1024; + float *a, *b, *c, *d, *e; + int i; + int nbytes; + + srand (time (NULL)); + int s = rand () % 100; + + acc_init (acc_device_nvidia); + + nbytes = N * sizeof (float); + + a = (float *) malloc (nbytes); + b = (float *) malloc (nbytes); + c = (float *) malloc (nbytes); + d = (float *) malloc (nbytes); + e = (float *) malloc (nbytes); + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + } + + acc_set_default_async (s); + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 3.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 2.0; + b[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 2.0) + abort (); + + if (b[i] != 2.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 9.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 2.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc parallel wait (s) async (s) + { + int ii; + + for (ii = 0; ii < N; ii++) + e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 2.0) + abort (); + + if (b[i] != 4.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + + if (e[i] != 11.0) + abort (); + } + + + r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING); + if (r != CUDA_SUCCESS) + { + fprintf (stderr, "cuStreamCreate failed: %d\n", r); + abort (); + } + + acc_set_cuda_stream (1, stream1); + + for (i = 0; i < N; i++) + { + a[i] = 5.0; + b[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 5.0) + abort (); + + if (b[i] != 5.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 7.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 7.0) + abort (); + + if (b[i] != 49.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc parallel wait (s) async (s) + { + int ii; + + for (ii = 0; ii < N; ii++) + e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 9.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + + if (e[i] != 17.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 4.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 4.0) + abort (); + + if (b[i] != 16.0) + abort (); + + if (c[i] != 4.0) + abort (); + } + + + for (i = 0; i < N; i++) + { + a[i] = 5.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc update host (a[0:N], b[0:N], c[0:N]) async + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 5.0) + abort (); + + if (b[i] != 25.0) + abort (); + + if (c[i] != 4.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 3.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 2.0; + b[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 2.0) + abort (); + + if (b[i] != 2.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 9.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 2.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc kernels wait (s) async (s) + { + int ii; + + for (ii = 0; ii < N; ii++) + e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 2.0) + abort (); + + if (b[i] != 4.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + + if (e[i] != 11.0) + abort (); + } + + + r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING); + if (r != CUDA_SUCCESS) + { + fprintf (stderr, "cuStreamCreate failed: %d\n", r); + abort (); + } + + acc_set_cuda_stream (1, stream1); + + for (i = 0; i < N; i++) + { + a[i] = 5.0; + b[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 5.0) + abort (); + + if (b[i] != 5.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 7.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 7.0) + abort (); + + if (b[i] != 49.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc kernels wait (s) async (s) + { + int ii; + + for (ii = 0; ii < N; ii++) + e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 9.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + + if (e[i] != 17.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 4.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 4.0) + abort (); + + if (b[i] != 16.0) + abort (); + + if (c[i] != 4.0) + abort (); + } + + + for (i = 0; i < N; i++) + { + a[i] = 5.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc update host (a[0:N], b[0:N], c[0:N]) async + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 5.0) + abort (); + + if (b[i] != 25.0) + abort (); + + if (c[i] != 4.0) + abort (); + } + + acc_shutdown (acc_device_nvidia); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c index 2ddfa7d..f553d3d 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c @@ -153,7 +153,7 @@ main (int argc, char **argv) d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N], N) \ - async (4) + wait (1, 2, 3) async (4) for (int ii = 0; ii < N; ii++) e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c index 0c6abe6..81d623a 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c @@ -162,7 +162,7 @@ main (int argc, char **argv) d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \ - wait (1) async (4) + wait (1, 2, 3) async (4) for (int ii = 0; ii < N; ii++) e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c index 0bf706a..5ec50b8 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c @@ -138,7 +138,7 @@ main (int argc, char **argv) d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \ - wait (1,5) async (4) + wait (1, 2, 3, 5) async (4) for (int ii = 0; ii < N; ii++) e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c index c85e824..6afe2a0 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c @@ -92,16 +92,22 @@ main (int argc, char **argv) abort (); } - fprintf (stderr, "CheCKpOInT\n"); - if (acc_async_test (1) != 0) + if (acc_async_test (0) != 0) { fprintf (stderr, "asynchronous operation not running\n"); abort (); } + /* Test unseen async number. */ + if (acc_async_test (1) != 1) + { + fprintf (stderr, "acc_async_test failed on unseen number\n"); + abort (); + } + sleep ((int) (dtime / 1000.0f) + 1); - if (acc_async_test (1) != 1) + if (acc_async_test (0) != 1) { fprintf (stderr, "found asynchronous operation still running\n"); abort (); @@ -116,7 +122,3 @@ main (int argc, char **argv) return 0; } - -/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */ -/* { dg-output "unknown async \[0-9\]+" } */ -/* { dg-shouldfail "" } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c index f4f196d..2821f88 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c @@ -111,7 +111,7 @@ main (int argc, char **argv) start_timer (0); - acc_wait (1); + acc_wait (0); atime = stop_timer (0); @@ -132,7 +132,3 @@ main (int argc, char **argv) return 0; } - -/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */ -/* { dg-output "unknown async \[0-9\]+" } */ -/* { dg-shouldfail "" } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c index ef3df13..b22af26 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c @@ -114,6 +114,7 @@ main (int argc, char **argv) for (i = 0; i < N; i++) { + stream = (CUstream) acc_get_cuda_stream (i & 1); r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0); if (r != CUDA_SUCCESS) { @@ -122,11 +123,11 @@ main (int argc, char **argv) } } - acc_wait_async (0, 1); - if (acc_async_test (0) != 0) abort (); + acc_wait_async (0, 1); + if (acc_async_test (1) != 0) abort (); diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c index d5f18f0..30a4b57 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c @@ -133,7 +133,7 @@ main (int argc, char **argv) for (i = 0; i <= N; i++) { - if (acc_async_test (i) != 0) + if (acc_async_test (i) == 0) abort (); }

[5/6,OpenACC,libgomp] Async re-work, C/C++ testsuite changes

Commit Message

Comments

Patch