Message ID | 432c2e58-7bf6-1f7e-457f-32813207b282@mentor.com |
---|---|
Headers | show |
Series | Async re-work | expand |
Hi Chung-Lin! On Tue, 25 Sep 2018 21:09:49 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote: > This patch is a re-organization of OpenACC asynchronous queues. Thanks! > The previous style of implementation > was essentially re-defining the entire async API inside the plugin-interface, and relaying all such > API calls to the target plugin, which is awkward in design; it requires (each) target plugin to > essentially re-implement large portions of the async functionality to support OpenACC, and the > way it uses a state-setting style to "select/de-select" asynchronous queues for operations litters > a lot of code paths. > > The new design proposed here in this patch declares a "struct goacc_asyncqueue*" opaque type in libgomp.h, > and re-defines the plugin interface to a few operations (e.g. construct/destruct/test/synchronize/etc.) > on this async-queue type, all details are target-dependent inside the specific plugin/plugin-<target>.c file. Conceptually, ACK. > Also included in this patch is the code for the acc_get/set_default_async API functions in OpenACC 2.5. > It's a minor part of this patch, but since some code was merge together, I'm submitting it together here. As I requested, I'm reviewing those changes separately, and have backed out those changes in my working copy. > Testing has been done with offloading enabled. The results are mostly okay, but with a few issues > with either yet incomplete submission of our testsuite adjustment patches, or other independent problems. We'll need to understand these. > Seeking permission to commit this to trunk first. A few things will need to be clarified. For example, for the simple program: int main(void) { #pragma acc parallel async(1) ; #pragma acc wait return 0; } ..., I'm seeing memory corruption, which (oaccasionally...) shows up as an abort due to "free" complaining, but also reproduces more reliably with "valgrind". It also reproduces on openacc-gcc-8-branch: $ valgrind ./a.out [...] ==26392== Invalid read of size 8 ==26392== at 0x4E653B0: goacc_async_unmap_tgt (oacc-async.c:368) ==26392== by 0x5C90901: cuda_callback_wrapper (plugin-nvptx.c:1648) ==26392== by 0x6066B8D: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.390.77) ==26392== by 0x607A10F: ??? (in /usr/lib/x86_64-linux-gnu/libcuda.so.390.77) ==26392== by 0x50816DA: start_thread (pthread_create.c:463) ==26392== by 0x53BA88E: clone (clone.S:95) ==26392== Address 0x8d19f50 is 0 bytes inside a block of size 64 free'd ==26392== at 0x4C30D3B: free (vg_replace_malloc.c:530) ==26392== by 0x4E65BEE: goacc_async_copyout_unmap_vars (oacc-async.c:383) ==26392== by 0x4E607C9: GOACC_parallel_keyed_internal (oacc-parallel.c:403) ==26392== by 0x4E60EAA: GOACC_parallel_keyed_v2 (oacc-parallel.c:439) ==26392== by 0x40094F: ??? (in [...]/a.out) ==26392== by 0x52BAB96: (below main) (libc-start.c:310) ==26392== Block was alloc'd at ==26392== at 0x4C2FB0F: malloc (vg_replace_malloc.c:299) ==26392== by 0x4E47538: gomp_malloc (alloc.c:37) ==26392== by 0x4E5AEEB: gomp_map_vars_async (target.c:731) ==26392== by 0x4E60C2B: GOACC_parallel_keyed_internal (oacc-parallel.c:345) ==26392== by 0x4E60EAA: GOACC_parallel_keyed_v2 (oacc-parallel.c:439) ==26392== by 0x40094F: ??? (in [...]/a.out) ==26392== by 0x52BAB96: (below main) (libc-start.c:310) [...] Per my understanding, the problem is that, called from libgomp/oacc-async.c:goacc_async_copyout_unmap_vars, libgomp/target.c:gomp_unmap_vars_async runs into: if (tgt->list_count == 0) { free (tgt); return; } ..., and then goacc_async_copyout_unmap_vars does: devicep->openacc.async.queue_callback_func (aq, goacc_async_unmap_tgt, (void *) tgt); ..., which will then call libgomp/oacc-async.c:goacc_async_unmap_tgt: static void goacc_async_unmap_tgt (void *ptr) { struct target_mem_desc *tgt = (struct target_mem_desc *) ptr; if (tgt->refcount > 1) tgt->refcount--; else gomp_unmap_tgt (tgt); } ..., where the "Invalid read of size 8" happens, and which eventually would try to "free (tgt)" again, via libgomp/target.c:gomp_unmap_tgt: attribute_hidden void gomp_unmap_tgt (struct target_mem_desc *tgt) { /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end region. */ if (tgt->tgt_end) gomp_free_device_memory (tgt->device_descr, tgt->to_free); free (tgt->array); free (tgt); } Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong, or something else? Grüße Thomas
On Thu, 6 Dec 2018 21:42:14 +0100 Thomas Schwinge <thomas@codesourcery.com> wrote: > [...] > ..., where the "Invalid read of size 8" happens, and which eventually > would try to "free (tgt)" again, via libgomp/target.c:gomp_unmap_tgt: > > attribute_hidden void > gomp_unmap_tgt (struct target_mem_desc *tgt) > { > /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end > region. */ if (tgt->tgt_end) > gomp_free_device_memory (tgt->device_descr, tgt->to_free); > > free (tgt->array); > free (tgt); > } > > Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong, > or something else? It might be worth trying this with the refcounting changes in the attach/detach patch. Julian
On Thu, 6 Dec 2018 22:22:46 +0000 Julian Brown <julian@codesourcery.com> wrote: > On Thu, 6 Dec 2018 21:42:14 +0100 > Thomas Schwinge <thomas@codesourcery.com> wrote: > > > [...] > > ..., where the "Invalid read of size 8" happens, and which > > eventually would try to "free (tgt)" again, via > > libgomp/target.c:gomp_unmap_tgt: > > > > attribute_hidden void > > gomp_unmap_tgt (struct target_mem_desc *tgt) > > { > > /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end > > region. */ if (tgt->tgt_end) > > gomp_free_device_memory (tgt->device_descr, tgt->to_free); > > > > free (tgt->array); > > free (tgt); > > } > > > > Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong, > > or something else? > > It might be worth trying this with the refcounting changes in the > attach/detach patch. ...oh, also make sure you have this patch in the series you're testing with: https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01973.html else your "wait" will be ignored, IIUC. Julian
On 2018/12/7 6:26 AM, Julian Brown wrote: > On Thu, 6 Dec 2018 22:22:46 +0000 > Julian Brown <julian@codesourcery.com> wrote: > >> On Thu, 6 Dec 2018 21:42:14 +0100 >> Thomas Schwinge <thomas@codesourcery.com> wrote: >> >>> [...] >>> ..., where the "Invalid read of size 8" happens, and which >>> eventually would try to "free (tgt)" again, via >>> libgomp/target.c:gomp_unmap_tgt: >>> >>> attribute_hidden void >>> gomp_unmap_tgt (struct target_mem_desc *tgt) >>> { >>> /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end >>> region. */ if (tgt->tgt_end) >>> gomp_free_device_memory (tgt->device_descr, tgt->to_free); >>> >>> free (tgt->array); >>> free (tgt); >>> } >>> >>> Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong, >>> or something else? >> >> It might be worth trying this with the refcounting changes in the >> attach/detach patch. > > ...oh, also make sure you have this patch in the series you're testing > with: > > https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01973.html > > else your "wait" will be ignored, IIUC. > > Julian Hi Thomas, just first asking if you tried Julian's patch during this time, and if so did it do anything different? (and apologies for missing responding this part for so long :P ) Chung-Lin
Hi! On Thu, 13 Dec 2018 23:28:49 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote: > On 2018/12/7 6:26 AM, Julian Brown wrote: > > On Thu, 6 Dec 2018 22:22:46 +0000 > > Julian Brown <julian@codesourcery.com> wrote: > > > >> On Thu, 6 Dec 2018 21:42:14 +0100 > >> Thomas Schwinge <thomas@codesourcery.com> wrote: > >> > >>> [...] > >>> ..., where the "Invalid read of size 8" happens, and which > >>> eventually would try to "free (tgt)" again, via > >>> libgomp/target.c:gomp_unmap_tgt: > >>> > >>> attribute_hidden void > >>> gomp_unmap_tgt (struct target_mem_desc *tgt) > >>> { > >>> /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end > >>> region. */ if (tgt->tgt_end) > >>> gomp_free_device_memory (tgt->device_descr, tgt->to_free); > >>> > >>> free (tgt->array); > >>> free (tgt); > >>> } > >>> > >>> Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong, > >>> or something else? > >> > >> It might be worth trying this with the refcounting changes in the > >> attach/detach patch. Well, which exactly? > > ...oh, also make sure you have this patch in the series you're testing > > with: > > > > https://gcc.gnu.org/ml/gcc-patches/2018-08/msg01973.html > > > > else your "wait" will be ignored, IIUC. Thanks, and right, and yes, I got that one included. > just first asking if you tried Julian's patch during this time, and if so did it do anything different? I did not test with all the attach/detach patches on top of this one here. That's too many changes at once. Grüße Thomas
Hi Chung-Lin! On Thu, 06 Dec 2018 21:42:14 +0100, I wrote: > On Tue, 25 Sep 2018 21:09:49 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote: > > Also included in this patch is the code for the acc_get/set_default_async API functions in OpenACC 2.5. > > It's a minor part of this patch, but since some code was merge together, I'm submitting it together here. > > As I requested, I'm reviewing those changes separately, and have backed > out those changes in my working copy. ... as follows: commit 79b89a5214dc2624a52f0593bbfad5cefed0c025 Author: Thomas Schwinge <thomas@codesourcery.com> Date: Thu Dec 6 15:57:46 2018 +0100 into async re-work: revert default_async changes --- include/gomp-constants.h | 1 - libgomp/libgomp.map | 4 - libgomp/oacc-async.c | 19 +- libgomp/oacc-init.c | 2 - libgomp/oacc-int.h | 3 - libgomp/openacc.f90 | 22 +- libgomp/openacc.h | 3 - libgomp/openacc_lib.h | 13 - .../libgomp.oacc-c-c++-common/asyncwait-2.c | 904 --------------------- 9 files changed, 2 insertions(+), 969 deletions(-) diff --git include/gomp-constants.h include/gomp-constants.h index acd25851bcc7..1021306ed661 100644 --- include/gomp-constants.h +++ include/gomp-constants.h @@ -160,7 +160,6 @@ enum gomp_map_kind /* Asynchronous behavior. Keep in sync with libgomp/{openacc.h,openacc.f90,openacc_lib.h}:acc_async_t. */ -#define GOMP_ASYNC_DEFAULT 0 #define GOMP_ASYNC_NOVAL -1 #define GOMP_ASYNC_SYNC -2 diff --git libgomp/libgomp.map libgomp/libgomp.map index c5e1b876fccd..d2381da3bf07 100644 --- libgomp/libgomp.map +++ libgomp/libgomp.map @@ -464,12 +464,8 @@ OACC_2.5 { acc_delete_finalize_async_32_h_; acc_delete_finalize_async_64_h_; acc_delete_finalize_async_array_h_; - acc_get_default_async; - acc_get_default_async_h_; acc_memcpy_from_device_async; acc_memcpy_to_device_async; - acc_set_default_async; - acc_set_default_async_h_; acc_update_device_async; acc_update_device_async_32_h_; acc_update_device_async_64_h_; diff --git libgomp/oacc-async.c libgomp/oacc-async.c index 68aaf199a27e..553082fe3d4a 100644 --- libgomp/oacc-async.c +++ libgomp/oacc-async.c @@ -60,7 +60,7 @@ lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async) /* The special value acc_async_noval (-1) maps to the thread-specific default async stream. */ if (async == acc_async_noval) - async = thr->default_async; + async = 0; //TODO thr->default_async; if (async == acc_async_sync) return NULL; @@ -221,23 +221,6 @@ acc_wait_all_async (int async) gomp_mutex_unlock (&thr->dev->openacc.async.lock); } -int -acc_get_default_async (void) -{ - struct goacc_thread *thr = get_goacc_thread (); - return thr->default_async; -} - -void -acc_set_default_async (int async) -{ - if (async < acc_async_sync) - gomp_fatal ("invalid async argument: %d", async); - - struct goacc_thread *thr = get_goacc_thread (); - thr->default_async = async; -} - static void goacc_async_unmap_tgt (void *ptr) { diff --git libgomp/oacc-init.c libgomp/oacc-init.c index 2c2f91ce3c2c..c40f48829078 100644 --- libgomp/oacc-init.c +++ libgomp/oacc-init.c @@ -426,8 +426,6 @@ goacc_attach_host_thread_to_device (int ord) thr->target_tls = acc_dev->openacc.create_thread_data_func (ord); - - thr->default_async = acc_async_default; } /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of diff --git libgomp/oacc-int.h libgomp/oacc-int.h index 3354eb654ce9..97f3fc8a61ed 100644 --- libgomp/oacc-int.h +++ libgomp/oacc-int.h @@ -73,9 +73,6 @@ struct goacc_thread /* Target-specific data (used by plugin). */ void *target_tls; - - /* Default OpenACC async queue for current thread, exported to plugin. */ - int default_async; }; #if defined HAVE_TLS || defined USE_EMUTLS diff --git libgomp/openacc.f90 libgomp/openacc.f90 index 7d31ee689479..7c809fe00738 100644 --- libgomp/openacc.f90 +++ libgomp/openacc.f90 @@ -51,10 +51,9 @@ module openacc_kinds integer, parameter :: acc_handle_kind = int32 - public :: acc_async_default, acc_async_noval, acc_async_sync + public :: acc_async_noval, acc_async_sync ! Keep in sync with include/gomp-constants.h. - integer (acc_handle_kind), parameter :: acc_async_default = 0 integer (acc_handle_kind), parameter :: acc_async_noval = -1 integer (acc_handle_kind), parameter :: acc_async_sync = -2 @@ -93,16 +92,6 @@ module openacc_internal integer (acc_device_kind) d end function - subroutine acc_set_default_async_h (a) - import - integer a - end subroutine - - function acc_get_default_async_h () - import - integer acc_get_default_async_h - end function - function acc_async_test_h (a) logical acc_async_test_h integer a @@ -731,7 +720,6 @@ module openacc public :: acc_get_num_devices, acc_set_device_type, acc_get_device_type public :: acc_set_device_num, acc_get_device_num, acc_async_test - public :: acc_set_default_async, acc_get_default_async public :: acc_async_test_all public :: acc_wait, acc_async_wait, acc_wait_async public :: acc_wait_all, acc_async_wait_all, acc_wait_all_async @@ -764,14 +752,6 @@ module openacc procedure :: acc_get_device_num_h end interface - interface acc_set_default_async - procedure :: acc_set_default_async_h - end interface - - interface acc_get_default_async - procedure :: acc_get_default_async_h - end interface - interface acc_async_test procedure :: acc_async_test_h end interface diff --git libgomp/openacc.h libgomp/openacc.h index ede59d76c862..f61bb77f9f3e 100644 --- libgomp/openacc.h +++ libgomp/openacc.h @@ -63,7 +63,6 @@ typedef enum acc_device_t { typedef enum acc_async_t { /* Keep in sync with include/gomp-constants.h. */ - acc_async_default = 0, acc_async_noval = -1, acc_async_sync = -2 } acc_async_t; @@ -73,8 +72,6 @@ void acc_set_device_type (acc_device_t) __GOACC_NOTHROW; acc_device_t acc_get_device_type (void) __GOACC_NOTHROW; void acc_set_device_num (int, acc_device_t) __GOACC_NOTHROW; int acc_get_device_num (acc_device_t) __GOACC_NOTHROW; -void acc_set_default_async (int) __GOACC_NOTHROW; -int acc_get_default_async (void) __GOACC_NOTHROW; int acc_async_test (int) __GOACC_NOTHROW; int acc_async_test_all (void) __GOACC_NOTHROW; void acc_wait (int) __GOACC_NOTHROW; diff --git libgomp/openacc_lib.h libgomp/openacc_lib.h index 75a693937967..820d987d72e2 100644 --- libgomp/openacc_lib.h +++ libgomp/openacc_lib.h @@ -46,7 +46,6 @@ integer, parameter :: acc_handle_kind = 4 ! Keep in sync with include/gomp-constants.h. - integer (acc_handle_kind), parameter :: acc_async_default = 0 integer (acc_handle_kind), parameter :: acc_async_noval = -1 integer (acc_handle_kind), parameter :: acc_async_sync = -2 @@ -90,18 +89,6 @@ end function end interface - interface acc_set_default_async - subroutine acc_set_default_async_h (a) - integer a - end subroutine - end interface - - interface acc_get_default_async - function acc_get_default_async_h () - integer acc_get_default_async_h - end function - end interface - interface acc_async_test function acc_async_test_h (a) logical acc_async_test_h diff --git libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c deleted file mode 100644 index 94205407d41d..000000000000 --- libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c +++ /dev/null @@ -1,904 +0,0 @@ -/* { dg-do run { target openacc_nvidia_accel_selected } } */ -/* { dg-additional-options "-lcuda" } */ - -#include <openacc.h> -#include <stdlib.h> -#include <cuda.h> - -#include <stdio.h> -#include <time.h> -#include <sys/time.h> - -int -main (int argc, char **argv) -{ - CUresult r; - CUstream stream1; - int N = 128; //1024 * 1024; - float *a, *b, *c, *d, *e; - int i; - int nbytes; - - srand (time (NULL)); - int s = rand () % 100; - - acc_init (acc_device_nvidia); - - nbytes = N * sizeof (float); - - a = (float *) malloc (nbytes); - b = (float *) malloc (nbytes); - c = (float *) malloc (nbytes); - d = (float *) malloc (nbytes); - e = (float *) malloc (nbytes); - - for (i = 0; i < N; i++) - { - a[i] = 3.0; - b[i] = 0.0; - } - - acc_set_default_async (s); - -#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) - { - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = a[ii]; - } - -#pragma acc wait - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 3.0) - abort (); - - if (b[i] != 3.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 2.0; - b[i] = 0.0; - } - -#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) - { - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = a[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 2.0) - abort (); - - if (b[i] != 2.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 3.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - } - -#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) - { - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 3.0) - abort (); - - if (b[i] != 9.0) - abort (); - - if (c[i] != 4.0) - abort (); - - if (d[i] != 1.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 2.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - e[i] = 0.0; - } - -#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) - { - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; - } - -#pragma acc parallel wait (s) async (s) - { - int ii; - - for (ii = 0; ii < N; ii++) - e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 2.0) - abort (); - - if (b[i] != 4.0) - abort (); - - if (c[i] != 4.0) - abort (); - - if (d[i] != 1.0) - abort (); - - if (e[i] != 11.0) - abort (); - } - - - r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING); - if (r != CUDA_SUCCESS) - { - fprintf (stderr, "cuStreamCreate failed: %d\n", r); - abort (); - } - - acc_set_cuda_stream (1, stream1); - - for (i = 0; i < N; i++) - { - a[i] = 5.0; - b[i] = 0.0; - } - -#pragma acc data copy (a[0:N], b[0:N]) copyin (N) - { - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = a[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 5.0) - abort (); - - if (b[i] != 5.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 7.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - } - -#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) - { - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 7.0) - abort (); - - if (b[i] != 49.0) - abort (); - - if (c[i] != 4.0) - abort (); - - if (d[i] != 1.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 3.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - e[i] = 0.0; - } - -#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) - { - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; - } - -#pragma acc parallel wait (s) async (s) - { - int ii; - - for (ii = 0; ii < N; ii++) - e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 3.0) - abort (); - - if (b[i] != 9.0) - abort (); - - if (c[i] != 4.0) - abort (); - - if (d[i] != 1.0) - abort (); - - if (e[i] != 17.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 4.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - e[i] = 0.0; - } - -#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) - { - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - -#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 4.0) - abort (); - - if (b[i] != 16.0) - abort (); - - if (c[i] != 4.0) - abort (); - } - - - for (i = 0; i < N; i++) - { - a[i] = 5.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - e[i] = 0.0; - } - -#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) - { - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc parallel async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - -#pragma acc update host (a[0:N], b[0:N], c[0:N]) async - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 5.0) - abort (); - - if (b[i] != 25.0) - abort (); - - if (c[i] != 4.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 3.0; - b[i] = 0.0; - } - -#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) - { - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = a[ii]; - } - -#pragma acc wait - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 3.0) - abort (); - - if (b[i] != 3.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 2.0; - b[i] = 0.0; - } - -#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) - { - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = a[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 2.0) - abort (); - - if (b[i] != 2.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 3.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - } - -#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) - { - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 3.0) - abort (); - - if (b[i] != 9.0) - abort (); - - if (c[i] != 4.0) - abort (); - - if (d[i] != 1.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 2.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - e[i] = 0.0; - } - -#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) - { - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; - } - -#pragma acc kernels wait (s) async (s) - { - int ii; - - for (ii = 0; ii < N; ii++) - e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 2.0) - abort (); - - if (b[i] != 4.0) - abort (); - - if (c[i] != 4.0) - abort (); - - if (d[i] != 1.0) - abort (); - - if (e[i] != 11.0) - abort (); - } - - - r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING); - if (r != CUDA_SUCCESS) - { - fprintf (stderr, "cuStreamCreate failed: %d\n", r); - abort (); - } - - acc_set_cuda_stream (1, stream1); - - for (i = 0; i < N; i++) - { - a[i] = 5.0; - b[i] = 0.0; - } - -#pragma acc data copy (a[0:N], b[0:N]) copyin (N) - { - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = a[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 5.0) - abort (); - - if (b[i] != 5.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 7.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - } - -#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) - { - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 7.0) - abort (); - - if (b[i] != 49.0) - abort (); - - if (c[i] != 4.0) - abort (); - - if (d[i] != 1.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 3.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - e[i] = 0.0; - } - -#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) - { - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; - } - -#pragma acc kernels wait (s) async (s) - { - int ii; - - for (ii = 0; ii < N; ii++) - e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; - } - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 3.0) - abort (); - - if (b[i] != 9.0) - abort (); - - if (c[i] != 4.0) - abort (); - - if (d[i] != 1.0) - abort (); - - if (e[i] != 17.0) - abort (); - } - - for (i = 0; i < N; i++) - { - a[i] = 4.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - e[i] = 0.0; - } - -#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) - { - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - -#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 4.0) - abort (); - - if (b[i] != 16.0) - abort (); - - if (c[i] != 4.0) - abort (); - } - - - for (i = 0; i < N; i++) - { - a[i] = 5.0; - b[i] = 0.0; - c[i] = 0.0; - d[i] = 0.0; - e[i] = 0.0; - } - -#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) - { - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; - } - -#pragma acc kernels async - { - int ii; - - for (ii = 0; ii < N; ii++) - c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; - } - -#pragma acc update host (a[0:N], b[0:N], c[0:N]) async - -#pragma acc wait (s) - - } - - for (i = 0; i < N; i++) - { - if (a[i] != 5.0) - abort (); - - if (b[i] != 25.0) - abort (); - - if (c[i] != 4.0) - abort (); - } - - acc_shutdown (acc_device_nvidia); - - return 0; -} Grüße Thomas
Hi Chung-Lin! A little bit of documentation starter update for you to include. Please make sure that all relevant functions have such comments addded. commit 7e0896281d155e1544751f43c1eaace8e005e019 Author: Thomas Schwinge <thomas@codesourcery.com> Date: Thu Dec 13 17:59:46 2018 +0100 [WIP] into async re-work: documentation --- libgomp/libgomp.h | 3 +++ libgomp/oacc-async.c | 7 +++++++ libgomp/plugin/plugin-nvptx.c | 4 ++-- libgomp/target.c | 3 +++ 4 files changed, 15 insertions(+), 2 deletions(-) diff --git libgomp/libgomp.h libgomp/libgomp.h index 8b74d6368389..574fcd1ee4ad 100644 --- libgomp/libgomp.h +++ libgomp/libgomp.h @@ -949,6 +949,9 @@ typedef struct acc_dispatch_t __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func; struct { + /* Once created and put into the "active" list, asyncqueues are then never + destructed and removed from the "active" list, other than if the TODO + device is shut down. */ gomp_mutex_t lock; int nasyncqueue; struct goacc_asyncqueue **asyncqueue; diff --git libgomp/oacc-async.c libgomp/oacc-async.c index b091ba2460ac..0f5f74bdf836 100644 --- libgomp/oacc-async.c +++ libgomp/oacc-async.c @@ -280,6 +280,10 @@ goacc_async_free (struct gomp_device_descr *devicep, devicep->openacc.async.queue_callback_func (aq, free, ptr); } +/* This function initializes the asyncqueues for the device specified by + DEVICEP. TODO DEVICEP must be locked on entry, and remains locked on + return. */ + attribute_hidden void goacc_init_asyncqueues (struct gomp_device_descr *devicep) { @@ -289,6 +293,9 @@ goacc_init_asyncqueues (struct gomp_device_descr *devicep) devicep->openacc.async.active = NULL; } +/* This function finalizes the asyncqueues for the device specified by DEVICEP. + TODO DEVICEP must be locked on entry, and remains locked on return. */ + attribute_hidden bool goacc_fini_asyncqueues (struct gomp_device_descr *devicep) { diff --git libgomp/plugin/plugin-nvptx.c libgomp/plugin/plugin-nvptx.c index 7b658264b8e7..577ed39ef3f6 100644 --- libgomp/plugin/plugin-nvptx.c +++ libgomp/plugin/plugin-nvptx.c @@ -1340,14 +1340,14 @@ GOMP_OFFLOAD_openacc_cuda_get_current_context (void) return nvptx_get_current_cuda_context (); } -/* NOTE: This returns a CUstream, not a ptx_stream pointer. */ +/* This returns a CUstream. */ void * GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *aq) { return (void *) aq->cuda_stream; } -/* NOTE: This takes a CUstream, not a ptx_stream pointer. */ +/* This takes a CUstream. */ int GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream) { diff --git libgomp/target.c libgomp/target.c index e67d9248ae0b..96df1890a729 100644 --- libgomp/target.c +++ libgomp/target.c @@ -1506,6 +1506,9 @@ gomp_init_device (struct gomp_device_descr *devicep) devicep->state = GOMP_DEVICE_INITIALIZED; } +/* This function finalizes the target device, specified by DEVICEP. DEVICEP + must be locked on entry, and remains locked on return. */ + attribute_hidden bool gomp_fini_device (struct gomp_device_descr *devicep) { Grüße Thomas
On 2018/12/13 11:51 PM, Thomas Schwinge wrote: > On Thu, 13 Dec 2018 23:28:49 +0800, Chung-Lin Tang<chunglin_tang@mentor.com> wrote: >> On 2018/12/7 6:26 AM, Julian Brown wrote: >>> On Thu, 6 Dec 2018 22:22:46 +0000 >>> Julian Brown<julian@codesourcery.com> wrote: >>> >>>> On Thu, 6 Dec 2018 21:42:14 +0100 >>>> Thomas Schwinge<thomas@codesourcery.com> wrote: >>>> >>>>> [...] >>>>> ..., where the "Invalid read of size 8" happens, and which >>>>> eventually would try to "free (tgt)" again, via >>>>> libgomp/target.c:gomp_unmap_tgt: >>>>> >>>>> attribute_hidden void >>>>> gomp_unmap_tgt (struct target_mem_desc *tgt) >>>>> { >>>>> /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end >>>>> region. */ if (tgt->tgt_end) >>>>> gomp_free_device_memory (tgt->device_descr, tgt->to_free); >>>>> >>>>> free (tgt->array); >>>>> free (tgt); >>>>> } >>>>> >>>>> Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong, >>>>> or something else? I think I understand the problem now. In gomp_unmap_vars_async(), in the case of tgt->list_count == 0 (i.e. no map arguments at all) the code should simply free the tgt and return, while the code in goacc_async_copyout_unmap_vars() didn't handle this case and always scheduled an asynchronous free of the tgt later, causing that valgrind error you see. I am still testing the attached patch, but I think it is the right fix: I reviewed what I wrote and it seemed the way I organized things into a goacc_async_copyout_unmap_vars() routine, including the hackish refcount++, etc. is simply unneeded. I have deleted those stuff and consolidated things back into gomp_unmap_vars_async(). I'll update the whole patches later after complete testing, the attached is the patch atop of the prior async patches. (the small program you gave above does pass valgrind now) Julian, I didn't try the OG8 refcount changes, it's just too large a set of changes to reason about in so short time, maybe later when we are prepared to fix things completely as you noted what those patches were capable of. Chung-Lin diff -ru trunk-orig/libgomp/oacc-async.c trunk-work/libgomp/oacc-async.c --- trunk-orig/libgomp/oacc-async.c 2018-12-14 21:06:06.649794724 +0800 +++ trunk-work/libgomp/oacc-async.c 2018-12-14 22:11:29.252251925 +0800 @@ -238,31 +238,6 @@ thr->default_async = async; } -static void -goacc_async_unmap_tgt (void *ptr) -{ - struct target_mem_desc *tgt = (struct target_mem_desc *) ptr; - - if (tgt->refcount > 1) - tgt->refcount--; - else - gomp_unmap_tgt (tgt); -} - -attribute_hidden void -goacc_async_copyout_unmap_vars (struct target_mem_desc *tgt, - struct goacc_asyncqueue *aq) -{ - struct gomp_device_descr *devicep = tgt->device_descr; - - /* Increment reference to delay freeing of device memory until callback - has triggered. */ - tgt->refcount++; - gomp_unmap_vars_async (tgt, true, aq); - devicep->openacc.async.queue_callback_func (aq, goacc_async_unmap_tgt, - (void *) tgt); -} - attribute_hidden void goacc_async_free (struct gomp_device_descr *devicep, struct goacc_asyncqueue *aq, void *ptr) diff -ru trunk-orig/libgomp/oacc-int.h trunk-work/libgomp/oacc-int.h --- trunk-orig/libgomp/oacc-int.h 2018-12-14 21:06:06.649794724 +0800 +++ trunk-work/libgomp/oacc-int.h 2018-12-14 22:11:43.379947915 +0800 @@ -104,8 +104,6 @@ void goacc_init_asyncqueues (struct gomp_device_descr *); bool goacc_fini_asyncqueues (struct gomp_device_descr *); -void goacc_async_copyout_unmap_vars (struct target_mem_desc *, - struct goacc_asyncqueue *); void goacc_async_free (struct gomp_device_descr *, struct goacc_asyncqueue *, void *); struct goacc_asyncqueue *get_goacc_asyncqueue (int); diff -ru trunk-orig/libgomp/oacc-mem.c trunk-work/libgomp/oacc-mem.c --- trunk-orig/libgomp/oacc-mem.c 2018-12-14 21:06:06.649794724 +0800 +++ trunk-work/libgomp/oacc-mem.c 2018-12-14 22:10:08.325998369 +0800 @@ -911,7 +911,7 @@ else { goacc_aq aq = get_goacc_asyncqueue (async); - goacc_async_copyout_unmap_vars (t, aq); + gomp_unmap_vars_async (t, true, aq); } } diff -ru trunk-orig/libgomp/oacc-parallel.c trunk-work/libgomp/oacc-parallel.c --- trunk-orig/libgomp/oacc-parallel.c 2018-12-14 21:06:06.649794724 +0800 +++ trunk-work/libgomp/oacc-parallel.c 2018-12-14 22:09:51.918353575 +0800 @@ -245,7 +245,7 @@ { acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, dims, tgt, aq); - goacc_async_copyout_unmap_vars (tgt, aq); + gomp_unmap_vars_async (tgt, true, aq); } } diff -ru trunk-orig/libgomp/target.c trunk-work/libgomp/target.c --- trunk-orig/libgomp/target.c 2018-12-14 21:06:06.653794622 +0800 +++ trunk-work/libgomp/target.c 2018-12-14 20:42:03.629154346 +0800 @@ -1072,6 +1072,17 @@ return is_tgt_unmapped; } +static void +gomp_unref_tgt (void *ptr) +{ + struct target_mem_desc *tgt = (struct target_mem_desc *) ptr; + + if (tgt->refcount > 1) + tgt->refcount--; + else + gomp_unmap_tgt (tgt); +} + /* Unmap variables described by TGT. If DO_COPYFROM is true, copy relevant variables back from device to host: if it is false, it is assumed that this has been done already. */ @@ -1130,10 +1141,11 @@ gomp_remove_var (devicep, k); } - if (tgt->refcount > 1) - tgt->refcount--; + if (aq) + devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt, + (void *) tgt); else - gomp_unmap_tgt (tgt); + gomp_unref_tgt ((void *) tgt); gomp_mutex_unlock (&devicep->lock); }
Hi Chung-Lin!
On Tue, 25 Sep 2018 21:09:49 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote:
> This patch is a re-organization of OpenACC asynchronous queues.
Again, many thanks for that!
In addition to the review emails I just posted, I've also put all that
stuff into a GitHub branch:
<https://github.com/tschwinge/gcc/tree/wip-async_re-work>.
This also includes some more "into async re-work: replicate [...]"
commits to adjust your work for preparational things that I plan to
commit before. I split these out intentionally, so that you can easily
see/review these changes.
Grüße
Thomas
Hi Chung-Lin! On Fri, 14 Dec 2018 22:28:58 +0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote: > On 2018/12/13 11:51 PM, Thomas Schwinge wrote: > > On Thu, 13 Dec 2018 23:28:49 +0800, Chung-Lin Tang<chunglin_tang@mentor.com> wrote: > >> On 2018/12/7 6:26 AM, Julian Brown wrote: > >>> On Thu, 6 Dec 2018 22:22:46 +0000 > >>> Julian Brown<julian@codesourcery.com> wrote: > >>> > >>>> On Thu, 6 Dec 2018 21:42:14 +0100 > >>>> Thomas Schwinge<thomas@codesourcery.com> wrote: > >>>> > >>>>> [...] > >>>>> ..., where the "Invalid read of size 8" happens, and which > >>>>> eventually would try to "free (tgt)" again, via > >>>>> libgomp/target.c:gomp_unmap_tgt: > >>>>> > >>>>> attribute_hidden void > >>>>> gomp_unmap_tgt (struct target_mem_desc *tgt) > >>>>> { > >>>>> /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end > >>>>> region. */ if (tgt->tgt_end) > >>>>> gomp_free_device_memory (tgt->device_descr, tgt->to_free); > >>>>> > >>>>> free (tgt->array); > >>>>> free (tgt); > >>>>> } > >>>>> > >>>>> Is the "free (tgt)" in libgomp/target.c:gomp_unmap_vars_async wrong, > >>>>> or something else? > > I think I understand the problem now. In gomp_unmap_vars_async(), in the case of > tgt->list_count == 0 (i.e. no map arguments at all) the code should simply free the tgt > and return, while the code in goacc_async_copyout_unmap_vars() didn't handle this case > and always scheduled an asynchronous free of the tgt later, causing that valgrind error > you see. > > I am still testing the attached patch, but I think it is the right fix: I reviewed what I > wrote and it seemed the way I organized things into a goacc_async_copyout_unmap_vars() routine, > including the hackish refcount++, etc. is simply unneeded. I have deleted those stuff > and consolidated things back into gomp_unmap_vars_async(). > > I'll update the whole patches later after complete testing, the attached is the patch atop > of the prior async patches. (the small program you gave above does pass valgrind now) Thanks, confirmed. Grüße Thomas > diff -ru trunk-orig/libgomp/oacc-async.c trunk-work/libgomp/oacc-async.c > --- trunk-orig/libgomp/oacc-async.c 2018-12-14 21:06:06.649794724 +0800 > +++ trunk-work/libgomp/oacc-async.c 2018-12-14 22:11:29.252251925 +0800 > @@ -238,31 +238,6 @@ > thr->default_async = async; > } > > -static void > -goacc_async_unmap_tgt (void *ptr) > -{ > - struct target_mem_desc *tgt = (struct target_mem_desc *) ptr; > - > - if (tgt->refcount > 1) > - tgt->refcount--; > - else > - gomp_unmap_tgt (tgt); > -} > - > -attribute_hidden void > -goacc_async_copyout_unmap_vars (struct target_mem_desc *tgt, > - struct goacc_asyncqueue *aq) > -{ > - struct gomp_device_descr *devicep = tgt->device_descr; > - > - /* Increment reference to delay freeing of device memory until callback > - has triggered. */ > - tgt->refcount++; > - gomp_unmap_vars_async (tgt, true, aq); > - devicep->openacc.async.queue_callback_func (aq, goacc_async_unmap_tgt, > - (void *) tgt); > -} > - > attribute_hidden void > goacc_async_free (struct gomp_device_descr *devicep, > struct goacc_asyncqueue *aq, void *ptr) > diff -ru trunk-orig/libgomp/oacc-int.h trunk-work/libgomp/oacc-int.h > --- trunk-orig/libgomp/oacc-int.h 2018-12-14 21:06:06.649794724 +0800 > +++ trunk-work/libgomp/oacc-int.h 2018-12-14 22:11:43.379947915 +0800 > @@ -104,8 +104,6 @@ > > void goacc_init_asyncqueues (struct gomp_device_descr *); > bool goacc_fini_asyncqueues (struct gomp_device_descr *); > -void goacc_async_copyout_unmap_vars (struct target_mem_desc *, > - struct goacc_asyncqueue *); > void goacc_async_free (struct gomp_device_descr *, struct goacc_asyncqueue *, > void *); > struct goacc_asyncqueue *get_goacc_asyncqueue (int); > diff -ru trunk-orig/libgomp/oacc-mem.c trunk-work/libgomp/oacc-mem.c > --- trunk-orig/libgomp/oacc-mem.c 2018-12-14 21:06:06.649794724 +0800 > +++ trunk-work/libgomp/oacc-mem.c 2018-12-14 22:10:08.325998369 +0800 > @@ -911,7 +911,7 @@ > else > { > goacc_aq aq = get_goacc_asyncqueue (async); > - goacc_async_copyout_unmap_vars (t, aq); > + gomp_unmap_vars_async (t, true, aq); > } > } > > diff -ru trunk-orig/libgomp/oacc-parallel.c trunk-work/libgomp/oacc-parallel.c > --- trunk-orig/libgomp/oacc-parallel.c 2018-12-14 21:06:06.649794724 +0800 > +++ trunk-work/libgomp/oacc-parallel.c 2018-12-14 22:09:51.918353575 +0800 > @@ -245,7 +245,7 @@ > { > acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, > dims, tgt, aq); > - goacc_async_copyout_unmap_vars (tgt, aq); > + gomp_unmap_vars_async (tgt, true, aq); > } > } > > diff -ru trunk-orig/libgomp/target.c trunk-work/libgomp/target.c > --- trunk-orig/libgomp/target.c 2018-12-14 21:06:06.653794622 +0800 > +++ trunk-work/libgomp/target.c 2018-12-14 20:42:03.629154346 +0800 > @@ -1072,6 +1072,17 @@ > return is_tgt_unmapped; > } > > +static void > +gomp_unref_tgt (void *ptr) > +{ > + struct target_mem_desc *tgt = (struct target_mem_desc *) ptr; > + > + if (tgt->refcount > 1) > + tgt->refcount--; > + else > + gomp_unmap_tgt (tgt); > +} > + > /* Unmap variables described by TGT. If DO_COPYFROM is true, copy relevant > variables back from device to host: if it is false, it is assumed that this > has been done already. */ > @@ -1130,10 +1141,11 @@ > gomp_remove_var (devicep, k); > } > > - if (tgt->refcount > 1) > - tgt->refcount--; > + if (aq) > + devicep->openacc.async.queue_callback_func (aq, gomp_unref_tgt, > + (void *) tgt); > else > - gomp_unmap_tgt (tgt); > + gomp_unref_tgt ((void *) tgt); > > gomp_mutex_unlock (&devicep->lock); > }