From patchwork Tue Sep 25 13:10:21 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chung-Lin Tang <chunglin_tang@mentor.com>
X-Patchwork-Id: 974403
Return-Path: 
 <gcc-patches-return-486323-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-486323-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=none (p=none dis=none) header.from=mentor.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="LqMGlryO"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 42KLzP5tclz9s2P
	for <incoming@patchwork.ozlabs.org>;
	Tue, 25 Sep 2018 23:10:40 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; q=dns; s=default; b=OcHw1Dll5/P2l88XdJh3Vrrx+4hrU
	Wya8pByf4ZiVmIMmCMhCkSzcDmF7eIgD61U0cj7+1CR3sdL0IPpOZYq3EOoxvsGk
	5/nafUENPGiNBojvWFhAJfQvj7ifAQ1VplFgv2R64WU2GrYQCm30P2g7gVjvk2yh
	xjkQ2V+uJBYfYw=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; s=default; bh=G1a4za6bPTRRgOKKlZHz4EhQcKQ=; b=LqM
	GlryOF5Ka3/Bwkl7nmPllInJh1WpkbxhuamQIWPL5ioI9E6aWNXWy84LWHSpjJHc
	3T5Ml7hXg06B6Z2Jzc3V/gT8/91znJ+PQucDKWhEmWY5Nr7WxCtc+rfHjNjriXGu
	9qldoeOiqfuMYimnwF/6Wvgtzd1J2kCl51OMIdPs=
Received: (qmail 116939 invoked by alias); 25 Sep 2018 13:10:33 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 116924 invoked by uid 89); 25 Sep 2018 13:10:32 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-24.7 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	RCVD_IN_DNSWL_NONE,
	SPF_PASS autolearn=ham version=3.3.2 spammy=1997, 199, 7
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131)
	by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with
	ESMTP; Tue, 25 Sep 2018 13:10:30 +0000
Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202])	by
	relay1.mentorg.com with esmtps
	(TLSv1.2:ECDHE-RSA-AES256-SHA384:256)	id 1g4n6i-0003tn-SE
	from ChungLin_Tang@mentor.com ; Tue, 25 Sep 2018 06:10:28 -0700
Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com
	(147.34.90.202) with Microsoft SMTP Server (TLS) id
	15.0.1320.4; Tue, 25 Sep 2018 06:10:25 -0700
Reply-To: <cltang@codesourcery.com>
From: Chung-Lin Tang <chunglin_tang@mentor.com>
Subject: [PATCH 1/6, OpenACC, libgomp] Async re-work, interfaces
To: <gcc-patches@gcc.gnu.org>, Jakub Jelinek <jakub@redhat.com>,
	Thomas Schwinge <Thomas_Schwinge@mentor.com>
Message-ID: <e7a4b16c-995a-8f8f-33f7-1310552b7e26@mentor.com>
Date: Tue, 25 Sep 2018 21:10:21 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13;
	rv:52.0) Gecko/20100101 Thunderbird/52.9.1
MIME-Version: 1.0

This patch separates out the header interface changes. GOMP_VERSION has been bumped,
and various changes to the plugin interface, and a few libgomp internal functions
declared. The libgomp linkmap updated as well.

Thanks,
Chung-Lin

	include/
	* gomp-constants.h (GOMP_ASYNC_DEFAULT): Define.
	(GOMP_VERSION): Increment for new plugin interface changes.

	libgomp/
	* libgomp-plugin.h (struct goacc_asyncqueue): Declare.
	(struct goacc_asyncqueue_list): Likewise.
	(goacc_aq): Likewise.
	(goacc_aq_list): Likewise.
	(GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
	(GOMP_OFFLOAD_openacc_async_test): Remove.
	(GOMP_OFFLOAD_openacc_async_test_all): Remove.
	(GOMP_OFFLOAD_openacc_async_wait): Remove.
	(GOMP_OFFLOAD_openacc_async_wait_async): Remove.
	(GOMP_OFFLOAD_openacc_async_wait_all): Remove.
	(GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
	(GOMP_OFFLOAD_openacc_async_set_async): Remove.
	(GOMP_OFFLOAD_openacc_exec): Adjust declaration.
	(GOMP_OFFLOAD_openacc_cuda_get_stream): Likewise.
	(GOMP_OFFLOAD_openacc_cuda_set_stream): Likewise.

	(GOMP_OFFLOAD_openacc_async_exec): Declare.
	(GOMP_OFFLOAD_openacc_async_construct): Declare.
	(GOMP_OFFLOAD_openacc_async_destruct): Declare.
	(GOMP_OFFLOAD_openacc_async_test): Declare.
	(GOMP_OFFLOAD_openacc_async_synchronize): Declare.
	(GOMP_OFFLOAD_openacc_async_serialize): Declare.
	(GOMP_OFFLOAD_openacc_async_queue_callback): Declare.
	(GOMP_OFFLOAD_openacc_async_host2dev): Declare.
	(GOMP_OFFLOAD_openacc_async_dev2host): Declare.

	* libgomp.h (struct acc_dispatch_t): Define 'async' sub-struct.
	(gomp_acc_insert_pointer): Adjust declaration.
	(gomp_copy_host2dev): New declaration.
	(gomp_copy_dev2host): Likewise.
	(gomp_map_vars_async): Likewise.
	(gomp_unmap_tgt): Likewise.
	(gomp_unmap_vars_async): Likewise.
	(gomp_fini_device): Likewise.

	* libgomp.map (OACC_2.5): Add acc_get_default_async,
	acc_get_default_async_h_, acc_set_default_async, and
	acc_set_default_async_h_.
	(GOMP_PLUGIN_1.0): Remove GOMP_PLUGIN_async_unmap_vars.
diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index f1c53c5..697080c 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -160,6 +160,7 @@ enum gomp_map_kind
 /* Asynchronous behavior.  Keep in sync with
    libgomp/{openacc.h,openacc.f90,openacc_lib.h}:acc_async_t.  */
 
+#define GOMP_ASYNC_DEFAULT		0
 #define GOMP_ASYNC_NOVAL		-1
 #define GOMP_ASYNC_SYNC			-2
 
@@ -199,7 +200,7 @@ enum gomp_map_kind
 /* Versions of libgomp and device-specific plugins.  GOMP_VERSION
    should be incremented whenever an ABI-incompatible change is introduced
    to the plugin interface defined in libgomp/libgomp.h.  */
-#define GOMP_VERSION	1
+#define GOMP_VERSION	2
 #define GOMP_VERSION_NVIDIA_PTX 1
 #define GOMP_VERSION_INTEL_MIC 0
 #define GOMP_VERSION_HSA 0
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index 2fc35d56..667ba19 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -53,6 +53,20 @@ enum offload_target_type
   OFFLOAD_TARGET_TYPE_HSA = 7
 };
 
+/* Opaque type to represent plugin-dependent implementation of an
+   OpenACC asynchronous queue.  */
+struct goacc_asyncqueue;
+
+/* Used to keep a list of active asynchronous queues.  */
+struct goacc_asyncqueue_list
+{
+  struct goacc_asyncqueue *aq;
+  struct goacc_asyncqueue_list *next;
+};
+
+typedef struct goacc_asyncqueue *goacc_aq;
+typedef struct goacc_asyncqueue_list *goacc_aq_list;
+
 /* Auxiliary struct, used for transferring pairs of addresses from plugin
    to libgomp.  */
 struct addr_pair
@@ -93,22 +107,31 @@ extern bool GOMP_OFFLOAD_dev2dev (int, void *, const void *, size_t);
 extern bool GOMP_OFFLOAD_can_run (void *);
 extern void GOMP_OFFLOAD_run (int, void *, void *, void **);
 extern void GOMP_OFFLOAD_async_run (int, void *, void *, void **, void *);
+
 extern void GOMP_OFFLOAD_openacc_exec (void (*) (void *), size_t, void **,
-				       void **, int, unsigned *, void *);
-extern void GOMP_OFFLOAD_openacc_register_async_cleanup (void *, int);
-extern int GOMP_OFFLOAD_openacc_async_test (int);
-extern int GOMP_OFFLOAD_openacc_async_test_all (void);
-extern void GOMP_OFFLOAD_openacc_async_wait (int);
-extern void GOMP_OFFLOAD_openacc_async_wait_async (int, int);
-extern void GOMP_OFFLOAD_openacc_async_wait_all (void);
-extern void GOMP_OFFLOAD_openacc_async_wait_all_async (int);
-extern void GOMP_OFFLOAD_openacc_async_set_async (int);
+				       void **, unsigned *, void *);
+extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void **,
+					     void **, unsigned *, void *,
+					     struct goacc_asyncqueue *);
+extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (void);
+extern bool GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *);
+extern int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *,
+						  struct goacc_asyncqueue *);
+extern void GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *,
+						       void (*)(void *), void *);
+extern bool GOMP_OFFLOAD_openacc_async_host2dev (int, void *, const void *, size_t,
+						 struct goacc_asyncqueue *);
+extern bool GOMP_OFFLOAD_openacc_async_dev2host (int, void *, const void *, size_t,
+						 struct goacc_asyncqueue *);
 extern void *GOMP_OFFLOAD_openacc_create_thread_data (int);
 extern void GOMP_OFFLOAD_openacc_destroy_thread_data (void *);
 extern void *GOMP_OFFLOAD_openacc_cuda_get_current_device (void);
 extern void *GOMP_OFFLOAD_openacc_cuda_get_current_context (void);
-extern void *GOMP_OFFLOAD_openacc_cuda_get_stream (int);
-extern int GOMP_OFFLOAD_openacc_cuda_set_stream (int, void *);
+extern void *GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *);
+extern int GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *,
+						 void *);
 
 #ifdef __cplusplus
 }
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index 3a8cc2b..a69faa7 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -888,19 +888,23 @@ typedef struct acc_dispatch_t
   /* Execute.  */
   __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func;
 
-  /* Async cleanup callback registration.  */
-  __typeof (GOMP_OFFLOAD_openacc_register_async_cleanup)
-    *register_async_cleanup_func;
-
-  /* Asynchronous routines.  */
-  __typeof (GOMP_OFFLOAD_openacc_async_test) *async_test_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_test_all) *async_test_all_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_wait) *async_wait_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_wait_async) *async_wait_async_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_wait_all) *async_wait_all_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_wait_all_async)
-    *async_wait_all_async_func;
-  __typeof (GOMP_OFFLOAD_openacc_async_set_async) *async_set_async_func;
+  struct {
+    gomp_mutex_t lock;
+    int nasyncqueue;
+    struct goacc_asyncqueue **asyncqueue;
+    struct goacc_asyncqueue_list *active;
+
+    __typeof (GOMP_OFFLOAD_openacc_async_construct) *construct_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_destruct) *destruct_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_test) *test_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_synchronize) *synchronize_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_serialize) *serialize_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_queue_callback) *queue_callback_func;
+
+    __typeof (GOMP_OFFLOAD_openacc_async_exec) *exec_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_host2dev) *host2dev_func;
+    __typeof (GOMP_OFFLOAD_openacc_async_dev2host) *dev2host_func;
+  } async;
 
   /* Create/destroy TLS data.  */
   __typeof (GOMP_OFFLOAD_openacc_create_thread_data) *create_thread_data_func;
@@ -992,17 +996,33 @@ enum gomp_map_vars_kind
   GOMP_MAP_VARS_ENTER_DATA
 };
 
-extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *);
+extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *, int);
 extern void gomp_acc_remove_pointer (void *, size_t, bool, int, int, int);
 extern void gomp_acc_declare_allocate (bool, size_t, void **, size_t *,
 				       unsigned short *);
+struct gomp_coalesce_buf;
+extern void gomp_copy_host2dev (struct gomp_device_descr *,
+				struct goacc_asyncqueue *, void *, const void *,
+				size_t, struct gomp_coalesce_buf *);
+extern void gomp_copy_dev2host (struct gomp_device_descr *,
+				struct goacc_asyncqueue *, void *, const void *,
+				size_t);
 
 extern struct target_mem_desc *gomp_map_vars (struct gomp_device_descr *,
 					      size_t, void **, void **,
 					      size_t *, void *, bool,
 					      enum gomp_map_vars_kind);
+extern struct target_mem_desc *gomp_map_vars_async (struct gomp_device_descr *,
+						    struct goacc_asyncqueue *,
+						    size_t, void **, void **,
+						    size_t *, void *, bool,
+						    enum gomp_map_vars_kind);
+extern void gomp_unmap_tgt (struct target_mem_desc *);
 extern void gomp_unmap_vars (struct target_mem_desc *, bool);
+extern void gomp_unmap_vars_async (struct target_mem_desc *, bool,
+				   struct goacc_asyncqueue *);
 extern void gomp_init_device (struct gomp_device_descr *);
+extern bool gomp_fini_device (struct gomp_device_descr *);
 extern void gomp_free_memmap (struct splay_tree_s *);
 extern void gomp_unload_device (struct gomp_device_descr *);
 extern bool gomp_remove_var (struct gomp_device_descr *, splay_tree_key);
diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map
index e3f0c64..dd97728 100644
--- a/libgomp/libgomp.map
+++ b/libgomp/libgomp.map
@@ -420,8 +420,12 @@ OACC_2.5 {
 	acc_delete_finalize_async_32_h_;
 	acc_delete_finalize_async_64_h_;
 	acc_delete_finalize_async_array_h_;
+	acc_get_default_async;
+	acc_get_default_async_h_;
 	acc_memcpy_from_device_async;
 	acc_memcpy_to_device_async;
+	acc_set_default_async;
+	acc_set_default_async_h_;
 	acc_update_device_async;
 	acc_update_device_async_32_h_;
 	acc_update_device_async_64_h_;
@@ -458,7 +462,6 @@ GOMP_PLUGIN_1.0 {
 	GOMP_PLUGIN_debug;
 	GOMP_PLUGIN_error;
 	GOMP_PLUGIN_fatal;
-	GOMP_PLUGIN_async_unmap_vars;
 	GOMP_PLUGIN_acc_thread;
 };
 

From patchwork Tue Sep 25 13:10:47 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chung-Lin Tang <chunglin_tang@mentor.com>
X-Patchwork-Id: 974404
Return-Path: 
 <gcc-patches-return-486324-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-486324-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=none (p=none dis=none) header.from=mentor.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="kFQ2jb/q"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 42KLzw19W5z9s2P
	for <incoming@patchwork.ozlabs.org>;
	Tue, 25 Sep 2018 23:11:07 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; q=dns; s=default; b=LJQXcSp945RKu1UCM70EdmlDO72S1
	pm5y7dD3vJYd39FQnCkOBbr81o5MCAzDMxIOk6K7FOi0WotDzbEkWP3M507F4Wfr
	aKPoq+cTIdh7fD0hcqbCw8KfQQKctuwRJ6h8RXiorUfpu4ToWvfI7/WOQ8D4iHoY
	adhZNX4q7Uy3JE=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; s=default; bh=zsq19aQL3r4rQCOaZeYKngFAW1w=; b=kFQ
	2jb/q3D8tXZZuSOWGlv4MflkXKqlm1eZW4sitn/WEnJVObhvzgR/8fS7apeMjOw3
	d2IBQ8QhGcrY/AcKHhDAcPOYaplCBzBZ7AUgapMHNQ7RIc3FPeirdTxarL0zhlOf
	PkeFicVVZReeXF4zbwKl5bAotz76xg6pOGDIEZ6M=
Received: (qmail 118542 invoked by alias); 25 Sep 2018 13:11:00 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 118516 invoked by uid 89); 25 Sep 2018 13:10:59 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-24.8 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_SHORT,
	RCVD_IN_DNSWL_NONE,
	SPF_PASS autolearn=ham version=3.3.2 spammy=UD:lock, 3017, qid,
	ord
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131)
	by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with
	ESMTP; Tue, 25 Sep 2018 13:10:56 +0000
Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202])	by
	relay1.mentorg.com with esmtps
	(TLSv1.2:ECDHE-RSA-AES256-SHA384:256)	id 1g4n78-000444-I3
	from ChungLin_Tang@mentor.com for gcc-patches@gcc.gnu.org;
	Tue, 25 Sep 2018 06:10:54 -0700
Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com
	(147.34.90.202) with Microsoft SMTP Server (TLS) id
	15.0.1320.4; Tue, 25 Sep 2018 06:10:51 -0700
Reply-To: <cltang@codesourcery.com>
From: Chung-Lin Tang <chunglin_tang@mentor.com>
Subject: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts
To: <gcc-patches@gcc.gnu.org>, Thomas Schwinge <Thomas_Schwinge@mentor.com>
Message-ID: <12319572-dd02-c946-f2b9-9d047be9c707@mentor.com>
Date: Tue, 25 Sep 2018 21:10:47 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13;
	rv:52.0) Gecko/20100101 Thunderbird/52.9.1
MIME-Version: 1.0

Hi Thomas,
These are the OpenACC specific changes, mostly the re-implementation of async-related acc_* runtime
library API functions to use the new backend plugin interfaces, in a non-target specific way.

Thanks,
Chung-Lin


	* oacc-async.c (get_goacc_thread): New function.
	(get_goacc_thread_device): New function.
	(lookup_goacc_asyncqueue): New function.
	(get_goacc_asyncqueue): New function.
	(acc_async_test): Adjust code to use new async design.
	(acc_async_test_all): Likewise.
	(acc_wait): Likewise.
	(acc_wait_async): Likewise.
	(acc_wait_all): Likewise.
	(acc_wait_all_async): Likewise.
	(acc_get_default_async): New API function.
	(acc_set_default_async): Likewise.
	(goacc_async_unmap_tgt): New function.
	(goacc_async_copyout_unmap_vars): Likewise.
	(goacc_async_free): Likewise.
	(goacc_init_asyncqueues): Likewise.
	(goacc_fini_asyncqueues): Likewise.
	* oacc-cuda.c (acc_get_cuda_stream): Adjust code to use new async
	design.
	(acc_set_cuda_stream): Likewise.
	* oacc-host.c (host_openacc_exec): Adjust parameters, remove 'async'.
	(host_openacc_register_async_cleanup): Remove.
	(host_openacc_async_exec): New function.
	(host_openacc_async_test): Adjust parameters.
	(host_openacc_async_test_all): Remove.
	(host_openacc_async_wait): Remove.
	(host_openacc_async_wait_async): Remove.
	(host_openacc_async_wait_all): Remove.
	(host_openacc_async_wait_all_async): Remove.
	(host_openacc_async_set_async): Remove.
	(host_openacc_async_synchronize): New function.
	(host_openacc_async_serialize): New function.
	(host_openacc_async_host2dev): New function.
	(host_openacc_async_dev2host): New function.
	(host_openacc_async_queue_callback): New function.
	(host_openacc_async_construct): New function.
	(host_openacc_async_destruct): New function.
	(struct gomp_device_descr host_dispatch): Remove initialization of old
	interface, add intialization of new async sub-struct.
	* oacc-init.c (acc_shutdown_1): Adjust to use gomp_fini_device.
	(goacc_attach_host_thread_to_device): Remove old async code usage, add
	initialization of per-thread default_async.
	* oacc-int.h (struct goacc_thread): Add default_async field.
	(goacc_init_asyncqueues): New declaration.
	(goacc_fini_asyncqueues): Likewise.
	(goacc_async_copyout_unmap_vars): Likewise.
	(goacc_async_free): Likewise.
	(get_goacc_asyncqueue): Likewise.
	(lookup_goacc_asyncqueue): Likewise.

	* oacc-mem.c (memcpy_tofrom_device): Adjust code to use new async
	design.
	(present_create_copy): Likewise.
	(delete_copyout): Likewise.
	(update_dev_host): Likewise.
	(gomp_acc_insert_pointer): Add async parameter, adjust code to use new
	async design.
	(gomp_acc_remove_pointer): Adjust code to use new async design.
	* oacc-parallel.c (GOACC_parallel_keyed): Likewise.
	(GOACC_enter_exit_data): Likewise.
	(goacc_wait): Likewise.
	(GOACC_update): Likewise.
	* oacc-plugin.c (GOMP_PLUGIN_async_unmap_vars): Remove.

diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c
index a4e1863..68aaf19 100644
--- a/libgomp/oacc-async.c
+++ b/libgomp/oacc-async.c
@@ -27,10 +27,87 @@
    <http://www.gnu.org/licenses/>.  */
 
 #include <assert.h>
+#include <string.h>
 #include "openacc.h"
 #include "libgomp.h"
 #include "oacc-int.h"
 
+static struct goacc_thread *
+get_goacc_thread (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr;
+}
+
+static struct gomp_device_descr *
+get_goacc_thread_device (void)
+{
+  struct goacc_thread *thr = goacc_thread ();
+
+  if (!thr || !thr->dev)
+    gomp_fatal ("no device active");
+
+  return thr->dev;
+}
+
+attribute_hidden struct goacc_asyncqueue *
+lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async)
+{
+  /* The special value acc_async_noval (-1) maps to the thread-specific
+     default async stream.  */
+  if (async == acc_async_noval)
+    async = thr->default_async;
+
+  if (async == acc_async_sync)
+    return NULL;
+
+  if (async < 0)
+    gomp_fatal ("bad async %d", async);
+
+  struct gomp_device_descr *dev = thr->dev;
+
+  if (!create
+      && (async >= dev->openacc.async.nasyncqueue
+	  || !dev->openacc.async.asyncqueue[async]))
+    return NULL;
+
+  gomp_mutex_lock (&dev->openacc.async.lock);
+  if (async >= dev->openacc.async.nasyncqueue)
+    {
+      int diff = async + 1 - dev->openacc.async.nasyncqueue;
+      dev->openacc.async.asyncqueue
+	= gomp_realloc (dev->openacc.async.asyncqueue,
+			sizeof (goacc_aq) * (async + 1));
+      memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue,
+	      0, sizeof (goacc_aq) * diff);
+      dev->openacc.async.nasyncqueue = async + 1;
+    }
+
+  if (!dev->openacc.async.asyncqueue[async])
+    {
+      dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func ();
+
+      /* Link new async queue into active list.  */
+      goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list));
+      n->aq = dev->openacc.async.asyncqueue[async];
+      n->next = dev->openacc.async.active;
+      dev->openacc.async.active = n;
+    }
+  gomp_mutex_unlock (&dev->openacc.async.lock);
+  return dev->openacc.async.asyncqueue[async];
+}
+
+attribute_hidden struct goacc_asyncqueue *
+get_goacc_asyncqueue (int async)
+{
+  struct goacc_thread *thr = get_goacc_thread ();
+  return lookup_goacc_asyncqueue (thr, true, async);
+}
+
 int
 acc_async_test (int async)
 {
@@ -42,18 +119,25 @@ acc_async_test (int async)
   if (!thr || !thr->dev)
     gomp_fatal ("no device active");
 
-  return thr->dev->openacc.async_test_func (async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  return thr->dev->openacc.async.test_func (aq);
 }
 
 int
 acc_async_test_all (void)
 {
-  struct goacc_thread *thr = goacc_thread ();
-
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  return thr->dev->openacc.async_test_all_func ();
+  int ret = 1;
+  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+    if (!thr->dev->openacc.async.test_func (l->aq))
+      {
+	ret = 0;
+	break;
+      }
+  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+  return ret;
 }
 
 void
@@ -62,12 +146,10 @@ acc_wait (int async)
   if (!async_valid_p (async))
     gomp_fatal ("invalid async argument: %d", async);
 
-  struct goacc_thread *thr = goacc_thread ();
-
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  thr->dev->openacc.async_wait_func (async);
+  goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async);
+  thr->dev->openacc.async.synchronize_func (aq);
 }
 
 /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait.  */
@@ -84,23 +166,28 @@ acc_async_wait (int async)
 void
 acc_wait_async (int async1, int async2)
 {
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2);
+  goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1);
+  if (!aq1)
+    gomp_fatal ("invalid async 1");
+  if (aq1 == aq2)
+    gomp_fatal ("identical parameters");
 
-  thr->dev->openacc.async_wait_async_func (async1, async2);
+  thr->dev->openacc.async.synchronize_func (aq1);
+  thr->dev->openacc.async.serialize_func (aq1, aq2);
 }
 
 void
 acc_wait_all (void)
 {
-  struct goacc_thread *thr = goacc_thread ();
-
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  struct gomp_device_descr *dev = get_goacc_thread_device ();
 
-  thr->dev->openacc.async_wait_all_func ();
+  gomp_mutex_lock (&dev->openacc.async.lock);
+  for (goacc_aq_list l = dev->openacc.async.active; l; l = l->next)
+    dev->openacc.async.synchronize_func (l->aq);
+  gomp_mutex_unlock (&dev->openacc.async.lock);
 }
 
 /* acc_async_wait_all is an OpenACC 1.0 compatibility name for acc_wait_all.  */
@@ -120,10 +207,99 @@ acc_wait_all_async (int async)
   if (!async_valid_p (async))
     gomp_fatal ("invalid async argument: %d", async);
 
-  struct goacc_thread *thr = goacc_thread ();
+  struct goacc_thread *thr = get_goacc_thread ();
 
-  if (!thr || !thr->dev)
-    gomp_fatal ("no device active");
+  goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async);
 
-  thr->dev->openacc.async_wait_all_async_func (async);
+  gomp_mutex_lock (&thr->dev->openacc.async.lock);
+  for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next)
+    {
+      thr->dev->openacc.async.synchronize_func (l->aq);
+      if (waiting_queue)
+	thr->dev->openacc.async.serialize_func (l->aq, waiting_queue);
+    }
+  gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+}
+
+int
+acc_get_default_async (void)
+{
+  struct goacc_thread *thr = get_goacc_thread ();
+  return thr->default_async;
+}
+
+void
+acc_set_default_async (int async)
+{
+  if (async < acc_async_sync)
+    gomp_fatal ("invalid async argument: %d", async);
+
+  struct goacc_thread *thr = get_goacc_thread ();
+  thr->default_async = async;
+}
+
+static void
+goacc_async_unmap_tgt (void *ptr)
+{
+  struct target_mem_desc *tgt = (struct target_mem_desc *) ptr;
+
+  if (tgt->refcount > 1)
+    tgt->refcount--;
+  else
+    gomp_unmap_tgt (tgt);
+}
+
+attribute_hidden void
+goacc_async_copyout_unmap_vars (struct target_mem_desc *tgt,
+				struct goacc_asyncqueue *aq)
+{
+  struct gomp_device_descr *devicep = tgt->device_descr;
+
+  /* Increment reference to delay freeing of device memory until callback
+     has triggered.  */
+  tgt->refcount++;
+  gomp_unmap_vars_async (tgt, true, aq);
+  devicep->openacc.async.queue_callback_func (aq, goacc_async_unmap_tgt,
+					      (void *) tgt);
+}
+
+attribute_hidden void
+goacc_async_free (struct gomp_device_descr *devicep,
+		  struct goacc_asyncqueue *aq, void *ptr)
+{
+  if (!aq)
+    free (ptr);
+  else
+    devicep->openacc.async.queue_callback_func (aq, free, ptr);
+}
+
+attribute_hidden void
+goacc_init_asyncqueues (struct gomp_device_descr *devicep)
+{
+  gomp_mutex_init (&devicep->openacc.async.lock);
+  devicep->openacc.async.nasyncqueue = 0;
+  devicep->openacc.async.asyncqueue = NULL;
+  devicep->openacc.async.active = NULL;
+}
+
+attribute_hidden bool
+goacc_fini_asyncqueues (struct gomp_device_descr *devicep)
+{
+  bool ret = true;
+  if (devicep->openacc.async.nasyncqueue > 0)
+    {
+      goacc_aq_list next;
+      for (goacc_aq_list l = devicep->openacc.async.active; l; l = next)
+	{
+	  ret &= devicep->openacc.async.destruct_func (l->aq);
+	  next = l->next;
+	  free (l);
+	}
+      free (devicep->openacc.async.asyncqueue);
+      devicep->openacc.async.nasyncqueue = 0;
+      devicep->openacc.async.asyncqueue = NULL;
+      devicep->openacc.async.active = NULL;
+    }
+  gomp_mutex_destroy (&devicep->openacc.async.lock);
+  return ret;
 }
diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c
index 20774c1..0a842ea 100644
--- a/libgomp/oacc-cuda.c
+++ b/libgomp/oacc-cuda.c
@@ -62,7 +62,11 @@ acc_get_cuda_stream (int async)
     return NULL;
 
   if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func)
-    return thr->dev->openacc.cuda.get_stream_func (async);
+    {
+      goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async);
+      if (aq)
+	return thr->dev->openacc.cuda.get_stream_func (aq);
+    }
  
   return NULL;
 }
@@ -79,8 +83,14 @@ acc_set_cuda_stream (int async, void *stream)
 
   thr = goacc_thread ();
 
+  int ret = -1;
   if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func)
-    return thr->dev->openacc.cuda.set_stream_func (async, stream);
-
-  return -1;
+    {
+      goacc_aq aq = get_goacc_asyncqueue (async);
+      gomp_mutex_lock (&thr->dev->openacc.async.lock);
+      ret = thr->dev->openacc.cuda.set_stream_func (aq, stream);
+      gomp_mutex_unlock (&thr->dev->openacc.async.lock);
+    }
+
+  return ret;
 }
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 2de3c37..53658c8 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -140,55 +140,86 @@ host_openacc_exec (void (*fn) (void *),
 		   size_t mapnum __attribute__ ((unused)),
 		   void **hostaddrs,
 		   void **devaddrs __attribute__ ((unused)),
-		   int async __attribute__ ((unused)),
-		   unsigned *dims __attribute ((unused)),
+		   unsigned *dims __attribute__ ((unused)),
 		   void *targ_mem_desc __attribute__ ((unused)))
 {
   fn (hostaddrs);
 }
 
 static void
-host_openacc_register_async_cleanup (void *targ_mem_desc __attribute__ ((unused)),
-				     int async __attribute__ ((unused)))
+host_openacc_async_exec (void (*fn) (void *),
+			 size_t mapnum __attribute__ ((unused)),
+			 void **hostaddrs,
+			 void **devaddrs __attribute__ ((unused)),
+			 unsigned *dims __attribute__ ((unused)),
+			 void *targ_mem_desc __attribute__ ((unused)),
+			 struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
+  fn (hostaddrs);
 }
 
 static int
-host_openacc_async_test (int async __attribute__ ((unused)))
+host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused)))
 {
   return 1;
 }
 
-static int
-host_openacc_async_test_all (void)
+static void
+host_openacc_async_synchronize (struct goacc_asyncqueue *aq
+				__attribute__ ((unused)))
 {
-  return 1;
 }
 
 static void
-host_openacc_async_wait (int async __attribute__ ((unused)))
+host_openacc_async_serialize (struct goacc_asyncqueue *aq1
+			      __attribute__ ((unused)),
+			      struct goacc_asyncqueue *aq2
+			      __attribute__ ((unused)))
 {
 }
 
-static void
-host_openacc_async_wait_async (int async1 __attribute__ ((unused)),
-			       int async2 __attribute__ ((unused)))
+static bool
+host_openacc_async_host2dev (int ord __attribute__ ((unused)),
+			     void *dst __attribute__ ((unused)),
+			     const void *src __attribute__ ((unused)),
+			     size_t n __attribute__ ((unused)),
+			     struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
 {
+  return true;
 }
 
-static void
-host_openacc_async_wait_all (void)
+static bool
+host_openacc_async_dev2host (int ord __attribute__ ((unused)),
+			     void *dst __attribute__ ((unused)),
+			     const void *src __attribute__ ((unused)),
+			     size_t n __attribute__ ((unused)),
+			     struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
 {
+  return true;
 }
 
 static void
-host_openacc_async_wait_all_async (int async __attribute__ ((unused)))
+host_openacc_async_queue_callback (struct goacc_asyncqueue *aq
+				   __attribute__ ((unused)),
+				   void (*callback_fn)(void *)
+				   __attribute__ ((unused)),
+				   void *userptr __attribute__ ((unused)))
 {
 }
 
-static void
-host_openacc_async_set_async (int async __attribute__ ((unused)))
+static struct goacc_asyncqueue *
+host_openacc_async_construct (void)
 {
+  return NULL;
+}
+
+static bool
+host_openacc_async_destruct (struct goacc_asyncqueue *aq
+			     __attribute__ ((unused)))
+{
+  return true;
 }
 
 static void *
@@ -235,15 +266,17 @@ static struct gomp_device_descr host_dispatch =
 
       .exec_func = host_openacc_exec,
 
-      .register_async_cleanup_func = host_openacc_register_async_cleanup,
-
-      .async_test_func = host_openacc_async_test,
-      .async_test_all_func = host_openacc_async_test_all,
-      .async_wait_func = host_openacc_async_wait,
-      .async_wait_async_func = host_openacc_async_wait_async,
-      .async_wait_all_func = host_openacc_async_wait_all,
-      .async_wait_all_async_func = host_openacc_async_wait_all_async,
-      .async_set_async_func = host_openacc_async_set_async,
+      .async = {
+	.construct_func = host_openacc_async_construct,
+	.destruct_func = host_openacc_async_destruct,
+	.test_func = host_openacc_async_test,
+	.synchronize_func = host_openacc_async_synchronize,
+	.serialize_func = host_openacc_async_serialize,
+	.queue_callback_func = host_openacc_async_queue_callback,
+	.exec_func = host_openacc_async_exec,
+	.dev2host_func = host_openacc_async_dev2host,
+	.host2dev_func = host_openacc_async_host2dev,
+      },
 
       .create_thread_data_func = host_openacc_create_thread_data,
       .destroy_thread_data_func = host_openacc_destroy_thread_data,
diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c
index 8db24b1..2c2f91c 100644
--- a/libgomp/oacc-init.c
+++ b/libgomp/oacc-init.c
@@ -309,7 +309,7 @@ acc_shutdown_1 (acc_device_t d)
       if (acc_dev->state == GOMP_DEVICE_INITIALIZED)
         {
 	  devices_active = true;
-	  ret &= acc_dev->fini_device_func (acc_dev->target_id);
+	  ret &= gomp_fini_device (acc_dev);
 	  acc_dev->state = GOMP_DEVICE_UNINITIALIZED;
 	}
       gomp_mutex_unlock (&acc_dev->lock);
@@ -426,8 +426,8 @@ goacc_attach_host_thread_to_device (int ord)
   
   thr->target_tls
     = acc_dev->openacc.create_thread_data_func (ord);
-  
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
+
+  thr->default_async = acc_async_default;
 }
 
 /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of
diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c
index 72414b7..07a2524 100644
--- a/libgomp/oacc-mem.c
+++ b/libgomp/oacc-mem.c
@@ -172,18 +172,11 @@ memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async,
       return;
     }
 
-  if (async > acc_async_sync)
-    thr->dev->openacc.async_set_async_func (async);
-
-  bool ret = (from
-	      ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s)
-	      : thr->dev->host2dev_func (thr->dev->target_id, d, h, s));
-
-  if (async > acc_async_sync)
-    thr->dev->openacc.async_set_async_func (acc_async_sync);
-
-  if (!ret)
-    gomp_fatal ("error in %s", libfnname);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+  if (from)
+    gomp_copy_dev2host (thr->dev, aq, h, d, s);
+  else
+    gomp_copy_host2dev (thr->dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
 }
 
 void
@@ -509,17 +502,13 @@ present_create_copy (unsigned f, void *h, size_t s, int async)
 
       gomp_mutex_unlock (&acc_dev->lock);
 
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (async);
+      goacc_aq aq = get_goacc_asyncqueue (async);
 
-      tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true,
-			   GOMP_MAP_VARS_OPENACC);
+      tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s,
+				 &kinds, true, GOMP_MAP_VARS_OPENACC);
       /* Initialize dynamic refcount.  */
       tgt->list[0].key->dynamic_refcount = 1;
 
-      if (async > acc_async_sync)
-	acc_dev->openacc.async_set_async_func (acc_async_sync);
-
       gomp_mutex_lock (&acc_dev->lock);
 
       d = tgt->to_free;
@@ -673,13 +662,9 @@ delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname)
 
       if (f & FLAG_COPYOUT)
 	{
-	  if (async > acc_async_sync)
-	    acc_dev->openacc.async_set_async_func (async);
-	  acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
-	  if (async > acc_async_sync)
-	    acc_dev->openacc.async_set_async_func (acc_async_sync);
+	  goacc_aq aq = get_goacc_asyncqueue (async);
+	  gomp_copy_dev2host (acc_dev, aq, h, d, s);
 	}
-
       gomp_remove_var (acc_dev, n);
     }
 
@@ -762,16 +747,12 @@ update_dev_host (int is_dev, void *h, size_t s, int async)
   d = (void *) (n->tgt->tgt_start + n->tgt_offset
 		+ (uintptr_t) h - n->host_start);
 
-  if (async > acc_async_sync)
-    acc_dev->openacc.async_set_async_func (async);
+  goacc_aq aq = get_goacc_asyncqueue (async);
 
   if (is_dev)
-    acc_dev->host2dev_func (acc_dev->target_id, d, h, s);
+    gomp_copy_host2dev (acc_dev, aq, d, h, s, /* TODO: cbuf? */ NULL);
   else
-    acc_dev->dev2host_func (acc_dev->target_id, h, d, s);
-
-  if (async > acc_async_sync)
-    acc_dev->openacc.async_set_async_func (acc_async_sync);
+    gomp_copy_dev2host (acc_dev, aq, h, d, s);
 
   gomp_mutex_unlock (&acc_dev->lock);
 }
@@ -802,7 +783,7 @@ acc_update_self_async (void *h, size_t s, int async)
 
 void
 gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
-			 void *kinds)
+			 void *kinds, int async)
 {
   struct target_mem_desc *tgt;
   struct goacc_thread *thr = goacc_thread ();
@@ -832,8 +813,9 @@ gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes,
     }
 
   gomp_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
-  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs,
-		       NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs,
+			     NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC);
   gomp_debug (0, "  %s: mappings prepared\n", __FUNCTION__);
 
   /* Initialize dynamic refcount.  */
@@ -927,7 +909,10 @@ gomp_acc_remove_pointer (void *h, size_t s, bool force_copyfrom, int async,
       if (async < acc_async_noval)
 	gomp_unmap_vars (t, true);
       else
-	t->device_descr->openacc.register_async_cleanup_func (t, async);
+	{
+	  goacc_aq aq = get_goacc_asyncqueue (async);
+	  goacc_async_copyout_unmap_vars (t, aq);
+	}
     }
 
   gomp_mutex_unlock (&acc_dev->lock);
diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c
index bfe8876..07d0338 100644
--- a/libgomp/oacc-parallel.c
+++ b/libgomp/oacc-parallel.c
@@ -212,8 +212,6 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
     }
   va_end (ap);
   
-  acc_dev->openacc.async_set_async_func (async);
-
   if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC))
     {
       k.host_start = (uintptr_t) fn;
@@ -230,43 +228,28 @@ GOACC_parallel_keyed (int device, void (*fn) (void *),
   else
     tgt_fn = (void (*)) fn;
 
-  tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true,
-		       GOMP_MAP_VARS_OPENACC);
+  goacc_aq aq = get_goacc_asyncqueue (async);
+
+  tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds,
+			     true, GOMP_MAP_VARS_OPENACC);
 
   devaddrs = gomp_alloca (sizeof (void *) * mapnum);
   for (i = 0; i < mapnum; i++)
     devaddrs[i] = (void *) (tgt->list[i].key->tgt->tgt_start
 			    + tgt->list[i].key->tgt_offset);
-
-  acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
-			      async, dims, tgt);
-
-  /* If running synchronously, unmap immediately.  */
-  bool copyfrom = true;
-  if (async_synchronous_p (async))
-    gomp_unmap_vars (tgt, true);
+  if (aq == NULL)
+    {
+      acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+				  dims, tgt);
+      /* If running synchronously, unmap immediately.  */
+      gomp_unmap_vars (tgt, true);
+    }
   else
     {
-      bool async_unmap = false;
-      for (size_t i = 0; i < tgt->list_count; i++)
-	{
-	  splay_tree_key k = tgt->list[i].key;
-	  if (k && k->refcount == 1)
-	    {
-	      async_unmap = true;
-	      break;
-	    }
-	}
-      if (async_unmap)
-	tgt->device_descr->openacc.register_async_cleanup_func (tgt, async);
-      else
-	{
-	  copyfrom = false;
-	  gomp_unmap_vars (tgt, copyfrom);
-	}
+      acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs,
+					dims, tgt, aq);
+      goacc_async_copyout_unmap_vars (tgt, aq);
     }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 /* Legacy entry point, only provide host execution.  */
@@ -377,8 +360,6 @@ GOACC_enter_exit_data (int device, size_t mapnum,
 	finalize = true;
     }
 
-  acc_dev->openacc.async_set_async_func (async);
-
   /* Determine if this is an "acc enter data".  */
   for (i = 0; i < mapnum; ++i)
     {
@@ -450,7 +431,7 @@ GOACC_enter_exit_data (int device, size_t mapnum,
 	  else
 	    {
 	      gomp_acc_insert_pointer (pointer, &hostaddrs[i],
-				       &sizes[i], &kinds[i]);
+				       &sizes[i], &kinds[i], async);
 	      /* Increment 'i' by two because OpenACC requires fortran
 		 arrays to be contiguous, so each PSET is associated with
 		 one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and
@@ -475,17 +456,17 @@ GOACC_enter_exit_data (int device, size_t mapnum,
 		if (acc_is_present (hostaddrs[i], sizes[i]))
 		  {
 		    if (finalize)
-		      acc_delete_finalize (hostaddrs[i], sizes[i]);
+		      acc_delete_finalize_async (hostaddrs[i], sizes[i], async);
 		    else
-		      acc_delete (hostaddrs[i], sizes[i]);
+		      acc_delete_async (hostaddrs[i], sizes[i], async);
 		  }
 		break;
 	      case GOMP_MAP_FROM:
 	      case GOMP_MAP_FORCE_FROM:
 		if (finalize)
-		  acc_copyout_finalize (hostaddrs[i], sizes[i]);
+		  acc_copyout_finalize_async (hostaddrs[i], sizes[i], async);
 		else
-		  acc_copyout (hostaddrs[i], sizes[i]);
+		  acc_copyout_async (hostaddrs[i], sizes[i], async);
 		break;
 	      default:
 		gomp_fatal (">>>> GOACC_enter_exit_data UNHANDLED kind 0x%.2x",
@@ -503,8 +484,6 @@ GOACC_enter_exit_data (int device, size_t mapnum,
 	    i += pointer - 1;
 	  }
       }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 static void
@@ -517,17 +496,22 @@ goacc_wait (int async, int num_waits, va_list *ap)
     {
       int qid = va_arg (*ap, int);
       
-      if (acc_async_test (qid))
+      goacc_aq aq = get_goacc_asyncqueue (qid);
+      if (acc_dev->openacc.async.test_func (aq))
 	continue;
-
       if (async == acc_async_sync)
-	acc_wait (qid);
+	acc_dev->openacc.async.synchronize_func (aq);
       else if (qid == async)
-	;/* If we're waiting on the same asynchronous queue as we're
-	    launching on, the queue itself will order work as
-	    required, so there's no need to wait explicitly.  */
+	/* If we're waiting on the same asynchronous queue as we're
+	   launching on, the queue itself will order work as
+	   required, so there's no need to wait explicitly.  */
+	;
       else
-	acc_dev->openacc.async_wait_async_func (qid, async);
+	{
+	  goacc_aq aq2 = get_goacc_asyncqueue (async);
+	  acc_dev->openacc.async.synchronize_func (aq);
+	  acc_dev->openacc.async.serialize_func (aq, aq2);
+	}
     }
 }
 
@@ -559,8 +543,6 @@ GOACC_update (int device, size_t mapnum,
   else if (num_waits == acc_async_noval)
     acc_wait_all_async (async);
 
-  acc_dev->openacc.async_set_async_func (async);
-
   bool update_device = false;
   for (i = 0; i < mapnum; ++i)
     {
@@ -600,7 +582,7 @@ GOACC_update (int device, size_t mapnum,
 	  /* Fallthru  */
 	case GOMP_MAP_FORCE_TO:
 	  update_device = true;
-	  acc_update_device (hostaddrs[i], sizes[i]);
+	  acc_update_device_async (hostaddrs[i], sizes[i], async);
 	  break;
 
 	case GOMP_MAP_FROM:
@@ -612,7 +594,7 @@ GOACC_update (int device, size_t mapnum,
 	  /* Fallthru  */
 	case GOMP_MAP_FORCE_FROM:
 	  update_device = false;
-	  acc_update_self (hostaddrs[i], sizes[i]);
+	  acc_update_self_async (hostaddrs[i], sizes[i], async);
 	  break;
 
 	default:
@@ -620,8 +602,6 @@ GOACC_update (int device, size_t mapnum,
 	  break;
 	}
     }
-
-  acc_dev->openacc.async_set_async_func (acc_async_sync);
 }
 
 void
@@ -638,7 +618,7 @@ GOACC_wait (int async, int num_waits, ...)
   else if (async == acc_async_sync)
     acc_wait_all ();
   else if (async == acc_async_noval)
-    goacc_thread ()->dev->openacc.async_wait_all_async_func (acc_async_noval);
+    acc_wait_all_async (async);
 }
 
 int
diff --git a/libgomp/oacc-plugin.c b/libgomp/oacc-plugin.c
index c04db90..a114cc7 100644
--- a/libgomp/oacc-plugin.c
+++ b/libgomp/oacc-plugin.c
@@ -30,17 +30,6 @@
 #include "oacc-plugin.h"
 #include "oacc-int.h"
 
-void
-GOMP_PLUGIN_async_unmap_vars (void *ptr, int async)
-{
-  struct target_mem_desc *tgt = ptr;
-  struct gomp_device_descr *devicep = tgt->device_descr;
-
-  devicep->openacc.async_set_async_func (async);
-  gomp_unmap_vars (tgt, true);
-  devicep->openacc.async_set_async_func (acc_async_sync);
-}
-
 /* Return the target-specific part of the TLS data for the current thread.  */
 
 void *
diff --git a/libgomp/openacc.h b/libgomp/openacc.h
index f61bb77..ede59d7 100644
--- a/libgomp/openacc.h
+++ b/libgomp/openacc.h
@@ -63,6 +63,7 @@ typedef enum acc_device_t {
 
 typedef enum acc_async_t {
   /* Keep in sync with include/gomp-constants.h.  */
+  acc_async_default = 0,
   acc_async_noval = -1,
   acc_async_sync  = -2
 } acc_async_t;
@@ -72,6 +73,8 @@ void acc_set_device_type (acc_device_t) __GOACC_NOTHROW;
 acc_device_t acc_get_device_type (void) __GOACC_NOTHROW;
 void acc_set_device_num (int, acc_device_t) __GOACC_NOTHROW;
 int acc_get_device_num (acc_device_t) __GOACC_NOTHROW;
+void acc_set_default_async (int) __GOACC_NOTHROW;
+int acc_get_default_async (void) __GOACC_NOTHROW;
 int acc_async_test (int) __GOACC_NOTHROW;
 int acc_async_test_all (void) __GOACC_NOTHROW;
 void acc_wait (int) __GOACC_NOTHROW;

From patchwork Tue Sep 25 13:11:06 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chung-Lin Tang <chunglin_tang@mentor.com>
X-Patchwork-Id: 974405
Return-Path: 
 <gcc-patches-return-486325-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-486325-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=none (p=none dis=none) header.from=mentor.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="ONFAwKg7"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 42KM0K4Qygz9s3C
	for <incoming@patchwork.ozlabs.org>;
	Tue, 25 Sep 2018 23:11:29 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; q=dns; s=default; b=vQ2UnxaaCGe0+EAJkR9g/McGmu32f
	u6IrprossRnLOaCpLmif4rvDxuOg+3mO+sfoiVBaFzW5NCsR7TfUkWSAZxx9MZjr
	yEi93XiLIlrC6e362Qq5Tn70HQvr4xoHFBxYT7yqlP+rSHIp3Ft/8EjPtdyo0FAO
	B6aPaCzOlmyXIk=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; s=default; bh=LWMv32GMYYkWWHvEbyQ+OliyHEE=; b=ONF
	AwKg7JTqtNnwei54dUKNksxeStbBrHFuEHxkvYqU9a6t46zlZFDwAu2r+6uunU/5
	fJ0lNAenPkhQKnAi9tiUQ6m6/TWad7tIzqXAc/C+sfhJ860uL5UO9lHJmt6bAXJt
	HAVolBIRqSW+r6uQjGT7GhCZXixtpI06uTSfSNTg=
Received: (qmail 120293 invoked by alias); 25 Sep 2018 13:11:17 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 120239 invoked by uid 89); 25 Sep 2018 13:11:16 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-24.8 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	RCVD_IN_DNSWL_NONE,
	SPF_PASS autolearn=ham version=3.3.2 spammy=H*r:0700,
	H*UA:Macintosh, H*u:Macintosh
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131)
	by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with
	ESMTP; Tue, 25 Sep 2018 13:11:14 +0000
Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202])	by
	relay1.mentorg.com with esmtps
	(TLSv1.2:ECDHE-RSA-AES256-SHA384:256)	id 1g4n7R-0004D9-3A
	from ChungLin_Tang@mentor.com ; Tue, 25 Sep 2018 06:11:13 -0700
Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com
	(147.34.90.202) with Microsoft SMTP Server (TLS) id
	15.0.1320.4; Tue, 25 Sep 2018 06:11:10 -0700
Reply-To: <cltang@codesourcery.com>
From: Chung-Lin Tang <chunglin_tang@mentor.com>
Subject: [PATCH 3/6, OpenACC, libgomp] acc_get/set_default_async API,
	Fortran specific parts
To: <gcc-patches@gcc.gnu.org>, Thomas Schwinge <Thomas_Schwinge@mentor.com>,
	<fortran@gcc.gnu.org>
Message-ID: <f5f5450c-eed2-80e5-9130-7f5655bf7d27@mentor.com>
Date: Tue, 25 Sep 2018 21:11:06 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13;
	rv:52.0) Gecko/20100101 Thunderbird/52.9.1
MIME-Version: 1.0

This patch adds the OpenACC 2.5 new APIs of acc_get_default_async/acc_set_default_async,
contained are the modifications for Fortran in libgomp/openacc.f90, libgomp/openacc_lib.h,
and a small testsuite adjustment.

Thanks,
Chung-Lin

	* openacc.f90 (acc_async_default): Declare.
	(acc_set_default_async): Likewise.
	(acc_get_default_async): Likewise.
  	* openacc_lib.h (acc_async_default): Declare.
  	(acc_set_default_async): Likewise.
	(acc_get_default_async): Likewise.

	* testsuite/libgomp.oacc-fortran/lib-12.f90: Likewise.

diff --git a/libgomp/openacc.f90 b/libgomp/openacc.f90
index 7c809fe..7d31ee6 100644
--- a/libgomp/openacc.f90
+++ b/libgomp/openacc.f90
@@ -51,9 +51,10 @@ module openacc_kinds
 
   integer, parameter :: acc_handle_kind = int32
 
-  public :: acc_async_noval, acc_async_sync
+  public :: acc_async_default, acc_async_noval, acc_async_sync
 
   ! Keep in sync with include/gomp-constants.h.
+  integer (acc_handle_kind), parameter :: acc_async_default = 0
   integer (acc_handle_kind), parameter :: acc_async_noval = -1
   integer (acc_handle_kind), parameter :: acc_async_sync = -2
 
@@ -92,6 +93,16 @@ module openacc_internal
       integer (acc_device_kind) d
     end function
 
+    subroutine acc_set_default_async_h (a)
+      import
+      integer a
+    end subroutine
+
+    function acc_get_default_async_h ()
+      import
+      integer acc_get_default_async_h
+    end function
+
     function acc_async_test_h (a)
       logical acc_async_test_h
       integer a
@@ -720,6 +731,7 @@ module openacc
 
   public :: acc_get_num_devices, acc_set_device_type, acc_get_device_type
   public :: acc_set_device_num, acc_get_device_num, acc_async_test
+  public :: acc_set_default_async, acc_get_default_async
   public :: acc_async_test_all
   public :: acc_wait, acc_async_wait, acc_wait_async
   public :: acc_wait_all, acc_async_wait_all, acc_wait_all_async
@@ -752,6 +764,14 @@ module openacc
     procedure :: acc_get_device_num_h
   end interface
 
+  interface acc_set_default_async
+    procedure :: acc_set_default_async_h
+  end interface
+
+  interface acc_get_default_async
+    procedure :: acc_get_default_async_h
+  end interface
+
   interface acc_async_test
     procedure :: acc_async_test_h
   end interface
diff --git a/libgomp/openacc_lib.h b/libgomp/openacc_lib.h
index 820d987..75a6939 100644
--- a/libgomp/openacc_lib.h
+++ b/libgomp/openacc_lib.h
@@ -46,6 +46,7 @@
       integer, parameter :: acc_handle_kind = 4
 
 !     Keep in sync with include/gomp-constants.h.
+      integer (acc_handle_kind), parameter :: acc_async_default = 0
       integer (acc_handle_kind), parameter :: acc_async_noval = -1
       integer (acc_handle_kind), parameter :: acc_async_sync = -2
 
@@ -89,6 +90,18 @@
         end function
       end interface
 
+      interface acc_set_default_async
+        subroutine acc_set_default_async_h (a)
+          integer a
+        end subroutine
+      end interface
+
+      interface acc_get_default_async
+        function acc_get_default_async_h ()
+          integer acc_get_default_async_h
+        end function
+      end interface
+
       interface acc_async_test
         function acc_async_test_h (a)
           logical acc_async_test_h
diff --git a/libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90 b/libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90
index 6912f67..ffbbf33 100644
--- a/libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90
+++ b/libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90
@@ -1,4 +1,5 @@
 ! { dg-do run }
+! { dg-xfail-run-if "n/a" { openacc_host_selected } }
 
 program main
   use openacc

From patchwork Tue Sep 25 13:11:24 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chung-Lin Tang <chunglin_tang@mentor.com>
X-Patchwork-Id: 974406
Return-Path: 
 <gcc-patches-return-486326-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-486326-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=none (p=none dis=none) header.from=mentor.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="DYX9zffg"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 42KM0h1FCyz9s3C
	for <incoming@patchwork.ozlabs.org>;
	Tue, 25 Sep 2018 23:11:47 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; q=dns; s=default; b=fGbYcnYQMOLdTIuyXX32tX3Y+fxhx
	0D0M9aRNj+3pnzemdeBy86/ewXnb1yXOqQM3t7cc2/3uV4ncrdGef2VaVBxgaKTc
	OldqLtRqjoGIbT0u2/CY9xSIZ77xH2w43LOKjzOPoKBLSBaAkZDepwBruNU/m3ba
	3Xa2XWifISCA4s=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; s=default; bh=ptiOiKDRhXbR16Sqcm+8fy3fZ9c=; b=DYX
	9zffgnoulGePgTfO1C2pzZnC/maVA4up/oc8Soju8hn/GHW9ly8iunmnzxAQ6Rmx
	UwjadxoxxlCM4kUrmqiqrmfn/QolU4YPwTKT0ruxViuYUrofGM+OsfwOTghCHDLy
	d9si5Ff62q4jX35hC/nv3/Pa8WRfD1WqcwrixEFY=
Received: (qmail 122554 invoked by alias); 25 Sep 2018 13:11:40 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 122533 invoked by uid 89); 25 Sep 2018 13:11:38 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-24.9 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	RCVD_IN_DNSWL_NONE,
	SPF_PASS autolearn=ham version=3.3.2 spammy=10017, 4826, 2799,
	bias
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131)
	by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with
	ESMTP; Tue, 25 Sep 2018 13:11:32 +0000
Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202])	by
	relay1.mentorg.com with esmtps
	(TLSv1.2:ECDHE-RSA-AES256-SHA384:256)	id 1g4n7i-0004JY-Bx
	from ChungLin_Tang@mentor.com ; Tue, 25 Sep 2018 06:11:30 -0700
Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com
	(147.34.90.202) with Microsoft SMTP Server (TLS) id
	15.0.1320.4; Tue, 25 Sep 2018 06:11:27 -0700
Reply-To: <cltang@codesourcery.com>
From: Chung-Lin Tang <chunglin_tang@mentor.com>
Subject: [PATCH 4/6, OpenACC, libgomp] Async re-work,
	libgomp/target.c changes
To: <gcc-patches@gcc.gnu.org>, Jakub Jelinek <jakub@redhat.com>
Message-ID: <f4d87634-f528-b2ed-104c-0acddf208890@mentor.com>
Date: Tue, 25 Sep 2018 21:11:24 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13;
	rv:52.0) Gecko/20100101 Thunderbird/52.9.1
MIME-Version: 1.0

Hi Jakub,
This part has changes to 'struct goacc_asyncqueue*' arguments to various
memory copying/mapping functions. To lessen the amount of code changes new 'gomp_map/unmap_vars_async'
functions names are used (with the non-async original names defined with the asyncqueue==NULL).

Inside gomp_target_fini, a 'gomp_fini_device' named function has been (re?)introduced,
since there's also asyncqueue destructing now need before the ->fini_device_func() call.

Thanks,
Chung-Lin


	* target.c (goacc_device_copy_async): New function.
	(gomp_copy_host2dev): Remove 'static', add goacc_asyncqueue parameter,
	add goacc_device_copy_async case.
	(gomp_copy_dev2host): Likewise.
	(gomp_map_vars_existing): Add goacc_asyncqueue parameter, adjust code.
	(gomp_map_pointer): Likewise.
	(gomp_map_fields_existing): Likewise.
	(gomp_map_vars): Add function for compatiblity.
	(gomp_map_vars_async): Adapt from gomp_map_vars, add goacc_asyncqueue
	parameter.
	(gomp_unmap_tgt): Remove statis, add attribute_hidden.
	(gomp_unmap_vars): Add function for compatiblity.
	(gomp_unmap_vars_async): Adapt from gomp_unmap_vars, add
	goacc_asyncqueue parameter.
	(gomp_fini_device): New function.
	(gomp_exit_data): Adjust gomp_copy_dev2host call.
	(gomp_load_plugin_for_device): Remove old interface, adjust to load
	new async interface.
	(gomp_target_fini): Adjust code to call gomp_fini_device.

diff --git a/libgomp/target.c b/libgomp/target.c
index dda041c..ff5b114 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -177,6 +177,22 @@ gomp_device_copy (struct gomp_device_descr *devicep,
     }
 }
 
+static inline void
+goacc_device_copy_async (struct gomp_device_descr *devicep,
+			 bool (*copy_func) (int, void *, const void *, size_t,
+					    struct goacc_asyncqueue *),
+			 const char *dst, void *dstaddr,
+			 const char *src, const void *srcaddr,
+			 size_t size, struct goacc_asyncqueue *aq)
+{
+  if (!copy_func (devicep->target_id, dstaddr, srcaddr, size, aq))
+    {
+      gomp_mutex_unlock (&devicep->lock);
+      gomp_fatal ("Copying of %s object [%p..%p) to %s object [%p..%p) failed",
+		  src, srcaddr, srcaddr + size, dst, dstaddr, dstaddr + size);
+    }
+}
+
 /* Infrastructure for coalescing adjacent or nearly adjacent (in device addresses)
    host to device memory transfers.  */
 
@@ -263,8 +279,9 @@ gomp_to_device_kind_p (int kind)
     }
 }
 
-static void
+attribute_hidden void
 gomp_copy_host2dev (struct gomp_device_descr *devicep,
+		    struct goacc_asyncqueue *aq,
 		    void *d, const void *h, size_t sz,
 		    struct gomp_coalesce_buf *cbuf)
 {
@@ -293,14 +310,23 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep,
 	    }
 	}
     }
-  gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
+  if (aq)
+    goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func,
+			     "dev", d, "host", h, sz, aq);
+  else
+    gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz);
 }
 
-static void
+attribute_hidden void
 gomp_copy_dev2host (struct gomp_device_descr *devicep,
+		    struct goacc_asyncqueue *aq,
 		    void *h, const void *d, size_t sz)
 {
-  gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz);
+  if (aq)
+    goacc_device_copy_async (devicep, devicep->openacc.async.dev2host_func,
+			     "host", h, "dev", d, sz, aq);
+  else
+    gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz);
 }
 
 static void
@@ -318,7 +344,8 @@ gomp_free_device_memory (struct gomp_device_descr *devicep, void *devptr)
    Helper function of gomp_map_vars.  */
 
 static inline void
-gomp_map_vars_existing (struct gomp_device_descr *devicep, splay_tree_key oldn,
+gomp_map_vars_existing (struct gomp_device_descr *devicep,
+			struct goacc_asyncqueue *aq, splay_tree_key oldn,
 			splay_tree_key newn, struct target_var_desc *tgt_var,
 			unsigned char kind, struct gomp_coalesce_buf *cbuf)
 {
@@ -340,7 +367,7 @@ gomp_map_vars_existing (struct gomp_device_descr *devicep, splay_tree_key oldn,
     }
 
   if (GOMP_MAP_ALWAYS_TO_P (kind))
-    gomp_copy_host2dev (devicep,
+    gomp_copy_host2dev (devicep, aq,
 			(void *) (oldn->tgt->tgt_start + oldn->tgt_offset
 				  + newn->host_start - oldn->host_start),
 			(void *) newn->host_start,
@@ -358,8 +385,8 @@ get_kind (bool short_mapkind, void *kinds, int idx)
 }
 
 static void
-gomp_map_pointer (struct target_mem_desc *tgt, uintptr_t host_ptr,
-		  uintptr_t target_offset, uintptr_t bias,
+gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq,
+		  uintptr_t host_ptr, uintptr_t target_offset, uintptr_t bias,
 		  struct gomp_coalesce_buf *cbuf)
 {
   struct gomp_device_descr *devicep = tgt->device_descr;
@@ -370,7 +397,7 @@ gomp_map_pointer (struct target_mem_desc *tgt, uintptr_t host_ptr,
   if (cur_node.host_start == (uintptr_t) NULL)
     {
       cur_node.tgt_offset = (uintptr_t) NULL;
-      gomp_copy_host2dev (devicep,
+      gomp_copy_host2dev (devicep, aq,
 			  (void *) (tgt->tgt_start + target_offset),
 			  (void *) &cur_node.tgt_offset,
 			  sizeof (void *), cbuf);
@@ -392,12 +419,13 @@ gomp_map_pointer (struct target_mem_desc *tgt, uintptr_t host_ptr,
      array section.  Now subtract bias to get what we want
      to initialize the pointer with.  */
   cur_node.tgt_offset -= bias;
-  gomp_copy_host2dev (devicep, (void *) (tgt->tgt_start + target_offset),
+  gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + target_offset),
 		      (void *) &cur_node.tgt_offset, sizeof (void *), cbuf);
 }
 
 static void
-gomp_map_fields_existing (struct target_mem_desc *tgt, splay_tree_key n,
+gomp_map_fields_existing (struct target_mem_desc *tgt,
+			  struct goacc_asyncqueue *aq, splay_tree_key n,
 			  size_t first, size_t i, void **hostaddrs,
 			  size_t *sizes, void *kinds,
 			  struct gomp_coalesce_buf *cbuf)
@@ -417,7 +445,7 @@ gomp_map_fields_existing (struct target_mem_desc *tgt, splay_tree_key n,
       && n2->tgt == n->tgt
       && n2->host_start - n->host_start == n2->tgt_offset - n->tgt_offset)
     {
-      gomp_map_vars_existing (devicep, n2, &cur_node,
+      gomp_map_vars_existing (devicep, aq, n2, &cur_node,
 			      &tgt->list[i], kind & typemask, cbuf);
       return;
     }
@@ -433,8 +461,8 @@ gomp_map_fields_existing (struct target_mem_desc *tgt, splay_tree_key n,
 	      && n2->host_start - n->host_start
 		 == n2->tgt_offset - n->tgt_offset)
 	    {
-	      gomp_map_vars_existing (devicep, n2, &cur_node, &tgt->list[i],
-				      kind & typemask, cbuf);
+	      gomp_map_vars_existing (devicep, aq, n2, &cur_node,
+				      &tgt->list[i], kind & typemask, cbuf);
 	      return;
 	    }
 	}
@@ -445,7 +473,7 @@ gomp_map_fields_existing (struct target_mem_desc *tgt, splay_tree_key n,
 	  && n2->tgt == n->tgt
 	  && n2->host_start - n->host_start == n2->tgt_offset - n->tgt_offset)
 	{
-	  gomp_map_vars_existing (devicep, n2, &cur_node, &tgt->list[i],
+	  gomp_map_vars_existing (devicep, aq, n2, &cur_node, &tgt->list[i],
 				  kind & typemask, cbuf);
 	  return;
 	}
@@ -482,6 +510,18 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	       void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds,
 	       bool short_mapkind, enum gomp_map_vars_kind pragma_kind)
 {
+  struct target_mem_desc *tgt;
+  tgt = gomp_map_vars_async (devicep, NULL, mapnum, hostaddrs, devaddrs,
+			     sizes, kinds, short_mapkind, pragma_kind);
+  return tgt;
+}
+
+attribute_hidden struct target_mem_desc *
+gomp_map_vars_async (struct gomp_device_descr *devicep,
+		     struct goacc_asyncqueue *aq, size_t mapnum,
+		     void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds,
+		     bool short_mapkind, enum gomp_map_vars_kind pragma_kind)
+{
   size_t i, tgt_align, tgt_size, not_found_cnt = 0;
   bool has_firstprivate = false;
   const int rshift = short_mapkind ? 8 : 3;
@@ -594,7 +634,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	      continue;
 	    }
 	  for (i = first; i <= last; i++)
-	    gomp_map_fields_existing (tgt, n, first, i, hostaddrs,
+	    gomp_map_fields_existing (tgt, aq, n, first, i, hostaddrs,
 				      sizes, kinds, NULL);
 	  i--;
 	  continue;
@@ -639,7 +679,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
       else
 	n = splay_tree_lookup (mem_map, &cur_node);
       if (n && n->refcount != REFCOUNT_LINK)
-	gomp_map_vars_existing (devicep, n, &cur_node, &tgt->list[i],
+	gomp_map_vars_existing (devicep, aq, n, &cur_node, &tgt->list[i],
 				kind & typemask, NULL);
       else
 	{
@@ -750,7 +790,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		tgt_size = (tgt_size + align - 1) & ~(align - 1);
 		tgt->list[i].offset = tgt_size;
 		len = sizes[i];
-		gomp_copy_host2dev (devicep,
+		gomp_copy_host2dev (devicep, aq,
 				    (void *) (tgt->tgt_start + tgt_size),
 				    (void *) hostaddrs[i], len, cbufp);
 		tgt_size += len;
@@ -784,7 +824,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		    continue;
 		  }
 		for (i = first; i <= last; i++)
-		  gomp_map_fields_existing (tgt, n, first, i, hostaddrs,
+		  gomp_map_fields_existing (tgt, aq, n, first, i, hostaddrs,
 					    sizes, kinds, cbufp);
 		i--;
 		continue;
@@ -804,7 +844,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		  cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i - 1);
 		if (cur_node.tgt_offset)
 		  cur_node.tgt_offset -= sizes[i];
-		gomp_copy_host2dev (devicep,
+		gomp_copy_host2dev (devicep, aq,
 				    (void *) (n->tgt->tgt_start
 					      + n->tgt_offset
 					      + cur_node.host_start
@@ -825,7 +865,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 	      k->host_end = k->host_start + sizeof (void *);
 	    splay_tree_key n = splay_tree_lookup (mem_map, k);
 	    if (n && n->refcount != REFCOUNT_LINK)
-	      gomp_map_vars_existing (devicep, n, k, &tgt->list[i],
+	      gomp_map_vars_existing (devicep, aq, n, k, &tgt->list[i],
 				      kind & typemask, cbufp);
 	    else
 	      {
@@ -878,18 +918,19 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		  case GOMP_MAP_FORCE_TOFROM:
 		  case GOMP_MAP_ALWAYS_TO:
 		  case GOMP_MAP_ALWAYS_TOFROM:
-		    gomp_copy_host2dev (devicep,
+		    gomp_copy_host2dev (devicep, aq,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
 					k->host_end - k->host_start, cbufp);
 		    break;
 		  case GOMP_MAP_POINTER:
-		    gomp_map_pointer (tgt, (uintptr_t) *(void **) k->host_start,
+		    gomp_map_pointer (tgt, aq,
+				      (uintptr_t) *(void **) k->host_start,
 				      k->tgt_offset, sizes[i], cbufp);
 		    break;
 		  case GOMP_MAP_TO_PSET:
-		    gomp_copy_host2dev (devicep,
+		    gomp_copy_host2dev (devicep, aq,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
@@ -911,7 +952,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 			  tgt->list[j].always_copy_from = false;
 			  if (k->refcount != REFCOUNT_INFINITY)
 			    k->refcount++;
-			  gomp_map_pointer (tgt,
+			  gomp_map_pointer (tgt, aq,
 					    (uintptr_t) *(void **) hostaddrs[j],
 					    k->tgt_offset
 					    + ((uintptr_t) hostaddrs[j]
@@ -940,7 +981,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		    break;
 		  case GOMP_MAP_FORCE_DEVICEPTR:
 		    assert (k->host_end - k->host_start == sizeof (void *));
-		    gomp_copy_host2dev (devicep,
+		    gomp_copy_host2dev (devicep, aq,
 					(void *) (tgt->tgt_start
 						  + k->tgt_offset),
 					(void *) k->host_start,
@@ -957,9 +998,8 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
 		    /* Set link pointer on target to the device address of the
 		       mapped object.  */
 		    void *tgt_addr = (void *) (tgt->tgt_start + k->tgt_offset);
-		    devicep->host2dev_func (devicep->target_id,
-					    (void *) n->tgt_offset,
-					    &tgt_addr, sizeof (void *));
+		    gomp_copy_host2dev (devicep, aq, (void *) n->tgt_offset,
+					&tgt_addr, sizeof (void *), cbufp);
 		  }
 		array++;
 	      }
@@ -971,7 +1011,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
       for (i = 0; i < mapnum; i++)
 	{
 	  cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i);
-	  gomp_copy_host2dev (devicep,
+	  gomp_copy_host2dev (devicep, aq,
 			      (void *) (tgt->tgt_start + i * sizeof (void *)),
 			      (void *) &cur_node.tgt_offset, sizeof (void *),
 			      cbufp);
@@ -982,7 +1022,8 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
     {
       long c = 0;
       for (c = 0; c < cbuf.chunk_cnt; ++c)
-	gomp_copy_host2dev (devicep, (void *) (tgt->tgt_start + cbuf.chunks[2 * c]),
+	gomp_copy_host2dev (devicep, aq,
+			    (void *) (tgt->tgt_start + cbuf.chunks[2 * c]),
 			    (char *) cbuf.buf + (cbuf.chunks[2 * c] - cbuf.chunks[0]),
 			    cbuf.chunks[2 * c + 1] - cbuf.chunks[2 * c], NULL);
       free (cbuf.buf);
@@ -1001,7 +1042,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum,
   return tgt;
 }
 
-static void
+attribute_hidden void
 gomp_unmap_tgt (struct target_mem_desc *tgt)
 {
   /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end region.  */
@@ -1036,6 +1077,13 @@ gomp_remove_var (struct gomp_device_descr *devicep, splay_tree_key k)
 attribute_hidden void
 gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 {
+  gomp_unmap_vars_async (tgt, do_copyfrom, NULL);
+}
+
+attribute_hidden void
+gomp_unmap_vars_async (struct target_mem_desc *tgt, bool do_copyfrom,
+		       struct goacc_asyncqueue *aq)
+{
   struct gomp_device_descr *devicep = tgt->device_descr;
 
   if (tgt->list_count == 0)
@@ -1071,7 +1119,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom)
 
       if ((do_unmap && do_copyfrom && tgt->list[i].copy_from)
 	  || tgt->list[i].always_copy_from)
-	gomp_copy_dev2host (devicep,
+	gomp_copy_dev2host (devicep, aq,
 			    (void *) (k->host_start + tgt->list[i].offset),
 			    (void *) (k->tgt->tgt_start + k->tgt_offset
 				      + tgt->list[i].offset),
@@ -1137,9 +1185,10 @@ gomp_update (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs,
 	    size_t size = cur_node.host_end - cur_node.host_start;
 
 	    if (GOMP_MAP_COPY_TO_P (kind & typemask))
-	      gomp_copy_host2dev (devicep, devaddr, hostaddr, size, NULL);
+	      gomp_copy_host2dev (devicep, NULL, devaddr, hostaddr, size,
+				  NULL);
 	    if (GOMP_MAP_COPY_FROM_P (kind & typemask))
-	      gomp_copy_dev2host (devicep, hostaddr, devaddr, size);
+	      gomp_copy_dev2host (devicep, NULL, hostaddr, devaddr, size);
 	  }
       }
   gomp_mutex_unlock (&devicep->lock);
@@ -1432,9 +1481,21 @@ gomp_init_device (struct gomp_device_descr *devicep)
 				   false);
     }
 
+  /* Initialize OpenACC asynchronous queues.  */
+  goacc_init_asyncqueues (devicep);
+
   devicep->state = GOMP_DEVICE_INITIALIZED;
 }
 
+attribute_hidden bool
+gomp_fini_device (struct gomp_device_descr *devicep)
+{
+  bool ret = goacc_fini_asyncqueues (devicep);
+  ret &= devicep->fini_device_func (devicep->target_id);
+  devicep->state = GOMP_DEVICE_FINALIZED;
+  return ret;
+}
+
 attribute_hidden void
 gomp_unload_device (struct gomp_device_descr *devicep)
 {
@@ -1924,7 +1985,7 @@ gomp_exit_data (struct gomp_device_descr *devicep, size_t mapnum,
 
 	  if ((kind == GOMP_MAP_FROM && k->refcount == 0)
 	      || kind == GOMP_MAP_ALWAYS_FROM)
-	    gomp_copy_dev2host (devicep, (void *) cur_node.host_start,
+	    gomp_copy_dev2host (devicep, NULL, (void *) cur_node.host_start,
 				(void *) (k->tgt->tgt_start + k->tgt_offset
 					  + cur_node.host_start
 					  - k->host_start),
@@ -2557,20 +2618,20 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device,
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
     {
       if (!DLSYM_OPT (openacc.exec, openacc_exec)
-	  || !DLSYM_OPT (openacc.register_async_cleanup,
-			 openacc_register_async_cleanup)
-	  || !DLSYM_OPT (openacc.async_test, openacc_async_test)
-	  || !DLSYM_OPT (openacc.async_test_all, openacc_async_test_all)
-	  || !DLSYM_OPT (openacc.async_wait, openacc_async_wait)
-	  || !DLSYM_OPT (openacc.async_wait_async, openacc_async_wait_async)
-	  || !DLSYM_OPT (openacc.async_wait_all, openacc_async_wait_all)
-	  || !DLSYM_OPT (openacc.async_wait_all_async,
-			 openacc_async_wait_all_async)
-	  || !DLSYM_OPT (openacc.async_set_async, openacc_async_set_async)
 	  || !DLSYM_OPT (openacc.create_thread_data,
 			 openacc_create_thread_data)
 	  || !DLSYM_OPT (openacc.destroy_thread_data,
-			 openacc_destroy_thread_data))
+			 openacc_destroy_thread_data)
+	  || !DLSYM_OPT (openacc.async.construct, openacc_async_construct)
+	  || !DLSYM_OPT (openacc.async.destruct, openacc_async_destruct)
+	  || !DLSYM_OPT (openacc.async.test, openacc_async_test)
+	  || !DLSYM_OPT (openacc.async.synchronize, openacc_async_synchronize)
+	  || !DLSYM_OPT (openacc.async.serialize, openacc_async_serialize)
+	  || !DLSYM_OPT (openacc.async.queue_callback,
+			 openacc_async_queue_callback)
+	  || !DLSYM_OPT (openacc.async.exec, openacc_async_exec)
+	  || !DLSYM_OPT (openacc.async.dev2host, openacc_async_dev2host)
+	  || !DLSYM_OPT (openacc.async.host2dev, openacc_async_host2dev))
 	{
 	  /* Require all the OpenACC handlers if we have
 	     GOMP_OFFLOAD_CAP_OPENACC_200.  */
@@ -2621,10 +2682,7 @@ gomp_target_fini (void)
       struct gomp_device_descr *devicep = &devices[i];
       gomp_mutex_lock (&devicep->lock);
       if (devicep->state == GOMP_DEVICE_INITIALIZED)
-	{
-	  ret = devicep->fini_device_func (devicep->target_id);
-	  devicep->state = GOMP_DEVICE_FINALIZED;
-	}
+	ret = gomp_fini_device (devicep);
       gomp_mutex_unlock (&devicep->lock);
       if (!ret)
 	gomp_fatal ("device finalization failed");

From patchwork Tue Sep 25 13:11:42 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chung-Lin Tang <chunglin_tang@mentor.com>
X-Patchwork-Id: 974407
Return-Path: 
 <gcc-patches-return-486327-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-486327-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=none (p=none dis=none) header.from=mentor.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="ECJqRXoU"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 42KM0z2Dfmz9s3C
	for <incoming@patchwork.ozlabs.org>;
	Tue, 25 Sep 2018 23:12:03 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; q=dns; s=default; b=mOlYcOZKh/d6WtQzTMdI7lIA85eKB
	f//zAkySUb6Zq2OUnDWb3+Ul1KDNylL1IuT1HG/pJrGWeC7bZMlzumQHAYMkadeO
	emp+a2nu2qxKgpncabI5d0n6MZjBDBSD0k0RQRn96myF6yuUYYsaUybGM+gt1hDW
	LxAugw7JjeSxXg=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; s=default; bh=SXQGVf504W5Cc2N8j4GpK0bA2LA=; b=ECJ
	qRXoUA6tATCnsg57LImBCarB8MjBpXMpWW9m05WG4fb88ddNdrIgfVBfhEGk4Ed4
	xs1wZdP3ektb3DmTh1/es6Qk+IVjQMQ/95ybHCoXeP8xLNLjqegixQANFoCGS4IP
	SW715YJ71ZT4y6g4alyYvxe7VAaWU/05QLnF5erY=
Received: (qmail 124256 invoked by alias); 25 Sep 2018 13:11:55 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 124186 invoked by uid 89); 25 Sep 2018 13:11:54 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-24.9 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	RCVD_IN_DNSWL_NONE,
	SPF_PASS autolearn=ham version=3.3.2 spammy=2.0
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131)
	by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with
	ESMTP; Tue, 25 Sep 2018 13:11:50 +0000
Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202])	by
	relay1.mentorg.com with esmtps
	(TLSv1.2:ECDHE-RSA-AES256-SHA384:256)	id 1g4n80-0004LV-IT
	from ChungLin_Tang@mentor.com for gcc-patches@gcc.gnu.org;
	Tue, 25 Sep 2018 06:11:48 -0700
Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com
	(147.34.90.202) with Microsoft SMTP Server (TLS) id
	15.0.1320.4; Tue, 25 Sep 2018 06:11:45 -0700
Reply-To: <cltang@codesourcery.com>
From: Chung-Lin Tang <chunglin_tang@mentor.com>
Subject: [PATCH 5/6, OpenACC, libgomp] Async re-work, C/C++ testsuite changes
To: <gcc-patches@gcc.gnu.org>, Thomas Schwinge <Thomas_Schwinge@mentor.com>
Message-ID: <8086c63b-f729-891b-3d21-76871d360734@mentor.com>
Date: Tue, 25 Sep 2018 21:11:42 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13;
	rv:52.0) Gecko/20100101 Thunderbird/52.9.1
MIME-Version: 1.0

These are the testsuite/libgomp.oacc-c-c++-common/* changes.

Thanks,
Chung-Lin

	* testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c: New testcase.
	* testsuite/libgomp.oacc-c-c++-common/data-2-lib.c: Adjust testcase.
	* testsuite/libgomp.oacc-c-c++-common/data-2.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/data-3.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/lib-71.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/lib-77.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/lib-79.c: Likewise.
	* testsuite/libgomp.oacc-c-c++-common/lib-81.c: Likewise.

diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c
new file mode 100644
index 0000000..9420540
--- /dev/null
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c
@@ -0,0 +1,904 @@
+/* { dg-do run { target openacc_nvidia_accel_selected } } */
+/* { dg-additional-options "-lcuda" } */
+
+#include <openacc.h>
+#include <stdlib.h>
+#include <cuda.h>
+
+#include <stdio.h>
+#include <time.h>
+#include <sys/time.h>
+
+int
+main (int argc, char **argv)
+{
+    CUresult r;
+    CUstream stream1;
+    int N = 128; //1024 * 1024;
+    float *a, *b, *c, *d, *e;
+    int i;
+    int nbytes;
+
+    srand (time (NULL));
+    int s = rand () % 100;
+
+    acc_init (acc_device_nvidia);
+
+    nbytes = N * sizeof (float);
+
+    a = (float *) malloc (nbytes);
+    b = (float *) malloc (nbytes);
+    c = (float *) malloc (nbytes);
+    d = (float *) malloc (nbytes);
+    e = (float *) malloc (nbytes);
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 3.0;
+        b[i] = 0.0;
+    }
+
+    acc_set_default_async (s);
+
+#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
+    {
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = a[ii];
+    }
+
+#pragma acc wait
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 3.0)
+            abort ();
+
+        if (b[i] != 3.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 2.0;
+        b[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
+    {
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = a[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 2.0)
+            abort ();
+
+        if (b[i] != 2.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 3.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
+    {
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 3.0)
+            abort ();
+
+        if (b[i] != 9.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+
+        if (d[i] != 1.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 2.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+        e[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
+    {
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
+    }
+
+#pragma acc parallel wait (s) async (s)
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 2.0)
+            abort ();
+
+        if (b[i] != 4.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+
+        if (d[i] != 1.0)
+            abort ();
+
+        if (e[i] != 11.0)
+            abort ();
+    }
+
+
+    r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING);
+    if (r != CUDA_SUCCESS)
+    {
+        fprintf (stderr, "cuStreamCreate failed: %d\n", r);
+        abort ();
+    }
+
+    acc_set_cuda_stream (1, stream1);
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 5.0;
+        b[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N], b[0:N]) copyin (N)
+    {
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = a[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 5.0)
+            abort ();
+
+        if (b[i] != 5.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 7.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
+    {
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 7.0)
+            abort ();
+
+        if (b[i] != 49.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+
+        if (d[i] != 1.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 3.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+        e[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
+    {
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
+    }
+
+#pragma acc parallel wait (s) async (s)
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 3.0)
+            abort ();
+
+        if (b[i] != 9.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+
+        if (d[i] != 1.0)
+            abort ();
+
+        if (e[i] != 17.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 4.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+        e[i] = 0.0;
+    }
+
+#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
+    {
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 4.0)
+            abort ();
+
+        if (b[i] != 16.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+    }
+
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 5.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+        e[i] = 0.0;
+    }
+
+#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
+    {
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc parallel async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+#pragma acc update host (a[0:N], b[0:N], c[0:N]) async
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 5.0)
+            abort ();
+
+        if (b[i] != 25.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 3.0;
+        b[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
+    {
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = a[ii];
+    }
+
+#pragma acc wait
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 3.0)
+            abort ();
+
+        if (b[i] != 3.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 2.0;
+        b[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N)
+    {
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = a[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 2.0)
+            abort ();
+
+        if (b[i] != 2.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 3.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
+    {
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 3.0)
+            abort ();
+
+        if (b[i] != 9.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+
+        if (d[i] != 1.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 2.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+        e[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
+    {
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
+    }
+
+#pragma acc kernels wait (s) async (s)
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 2.0)
+            abort ();
+
+        if (b[i] != 4.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+
+        if (d[i] != 1.0)
+            abort ();
+
+        if (e[i] != 11.0)
+            abort ();
+    }
+
+
+    r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING);
+    if (r != CUDA_SUCCESS)
+    {
+        fprintf (stderr, "cuStreamCreate failed: %d\n", r);
+        abort ();
+    }
+
+    acc_set_cuda_stream (1, stream1);
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 5.0;
+        b[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N], b[0:N]) copyin (N)
+    {
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = a[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 5.0)
+            abort ();
+
+        if (b[i] != 5.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 7.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N)
+    {
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 7.0)
+            abort ();
+
+        if (b[i] != 49.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+
+        if (d[i] != 1.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 3.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+        e[i] = 0.0;
+    }
+
+#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N)
+    {
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
+    }
+
+#pragma acc kernels wait (s) async (s)
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
+    }
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 3.0)
+            abort ();
+
+        if (b[i] != 9.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+
+        if (d[i] != 1.0)
+            abort ();
+
+        if (e[i] != 17.0)
+            abort ();
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 4.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+        e[i] = 0.0;
+    }
+
+#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
+    {
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 4.0)
+            abort ();
+
+        if (b[i] != 16.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+    }
+
+
+    for (i = 0; i < N; i++)
+    {
+        a[i] = 5.0;
+        b[i] = 0.0;
+        c[i] = 0.0;
+        d[i] = 0.0;
+        e[i] = 0.0;
+    }
+
+#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N)
+    {
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii];
+    }
+
+#pragma acc kernels async
+    {
+        int ii;
+
+        for (ii = 0; ii < N; ii++)
+            c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii];
+    }
+
+#pragma acc update host (a[0:N], b[0:N], c[0:N]) async
+
+#pragma acc wait (s)
+
+    }
+
+    for (i = 0; i < N; i++)
+    {
+        if (a[i] != 5.0)
+            abort ();
+
+        if (b[i] != 25.0)
+            abort ();
+
+        if (c[i] != 4.0)
+            abort ();
+    }
+
+    acc_shutdown (acc_device_nvidia);
+
+    return 0;
+}
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c
index 2ddfa7d..f553d3d 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c
@@ -153,7 +153,7 @@ main (int argc, char **argv)
     d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
 
 #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N], N) \
-  async (4)
+  wait (1, 2, 3) async (4)
   for (int ii = 0; ii < N; ii++)
     e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
 
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c
index 0c6abe6..81d623a 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c
@@ -162,7 +162,7 @@ main (int argc, char **argv)
     d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
 
 #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \
-  wait (1) async (4)
+  wait (1, 2, 3) async (4)
   for (int ii = 0; ii < N; ii++)
     e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
 
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c
index 0bf706a..5ec50b8 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c
@@ -138,7 +138,7 @@ main (int argc, char **argv)
     d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii];
 
 #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \
-  wait (1,5) async (4)
+  wait (1, 2, 3, 5) async (4)
   for (int ii = 0; ii < N; ii++)
     e[ii] = a[ii] + b[ii] + c[ii] + d[ii];
 
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c
index c85e824..6afe2a0 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c
@@ -92,16 +92,22 @@ main (int argc, char **argv)
       abort ();
     }
 
-  fprintf (stderr, "CheCKpOInT\n");
-  if (acc_async_test (1) != 0)
+  if (acc_async_test (0) != 0)
     {
       fprintf (stderr, "asynchronous operation not running\n");
       abort ();
     }
 
+  /* Test unseen async number.  */
+  if (acc_async_test (1) != 1)
+    {
+      fprintf (stderr, "acc_async_test failed on unseen number\n");
+      abort ();
+    }
+
   sleep ((int) (dtime / 1000.0f) + 1);
 
-  if (acc_async_test (1) != 1)
+  if (acc_async_test (0) != 1)
     {
       fprintf (stderr, "found asynchronous operation still running\n");
       abort ();
@@ -116,7 +122,3 @@ main (int argc, char **argv)
 
   return 0;
 }
-
-/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */
-/* { dg-output "unknown async \[0-9\]+" } */
-/* { dg-shouldfail "" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c
index f4f196d..2821f88 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c
@@ -111,7 +111,7 @@ main (int argc, char **argv)
 
   start_timer (0);
 
-  acc_wait (1);
+  acc_wait (0);
 
   atime = stop_timer (0);
 
@@ -132,7 +132,3 @@ main (int argc, char **argv)
 
   return 0;
 }
-
-/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */
-/* { dg-output "unknown async \[0-9\]+" } */
-/* { dg-shouldfail "" } */
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
index ef3df13..b22af26 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c
@@ -114,6 +114,7 @@ main (int argc, char **argv)
 
   for (i = 0; i < N; i++)
     {
+      stream = (CUstream) acc_get_cuda_stream (i & 1);
       r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0);
       if (r != CUDA_SUCCESS)
 	{
@@ -122,11 +123,11 @@ main (int argc, char **argv)
 	}
     }
 
-  acc_wait_async (0, 1);
-
   if (acc_async_test (0) != 0)
     abort ();
 
+  acc_wait_async (0, 1);
+
   if (acc_async_test (1) != 0)
     abort ();
 
diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
index d5f18f0..30a4b57 100644
--- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
+++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c
@@ -133,7 +133,7 @@ main (int argc, char **argv)
 
   for (i = 0; i <= N; i++)
     {
-      if (acc_async_test (i) != 0)
+      if (acc_async_test (i) == 0)
 	abort ();
     }
 

From patchwork Tue Sep 25 13:11:58 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Chung-Lin Tang <chunglin_tang@mentor.com>
X-Patchwork-Id: 974408
Return-Path: 
 <gcc-patches-return-486328-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-486328-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=none (p=none dis=none) header.from=mentor.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="o4eQ8AZw"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 42KM1L2ggwz9s47
	for <incoming@patchwork.ozlabs.org>;
	Tue, 25 Sep 2018 23:12:22 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; q=dns; s=default; b=onrTeyfbY/uR9O6H39tXRuQnQs9DM
	oQcmx+IO2a+ip6VOIfw1NqVvX47uckgisRByjnvWOKKPWSVWHwLssLjb/KJUcK0L
	01uYPdvYsaiWhKye7M4nTaG+7WJeG7jxM8icCpP+GCX0WlReKeqMmuyY3PkahmJR
	ou0TrF6yiETcJU=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:reply-to:from:subject:to:message-id:date:mime-version
	:content-type; s=default; bh=BYBGo9Q5JcQNs+oP860zOtBiNBc=; b=o4e
	Q8AZwPEhVJYOfnJpts2zK3fekhXSpzV8YkN0ytjT4HAXtlQ040AiCP1pOs4qPO3g
	wsXak/C088X9LgNpkCQgdEW5hZ1aKWyKJaMr94G5o886t0Ph7IuKMcfXjhHcyMhK
	8yPO1mJpGXKebkNBMTPDyxw6U+/YOYfZUTDGHwrw=
Received: (qmail 125804 invoked by alias); 25 Sep 2018 13:12:13 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 125784 invoked by uid 89); 25 Sep 2018 13:12:12 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-24.9 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	RCVD_IN_DNSWL_NONE,
	SPF_PASS autolearn=ham version=3.3.2 spammy=claimed, initiated,
	**h, teams
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131)
	by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with
	ESMTP; Tue, 25 Sep 2018 13:12:07 +0000
Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202])	by
	relay1.mentorg.com with esmtps
	(TLSv1.2:ECDHE-RSA-AES256-SHA384:256)	id 1g4n8H-0004Nj-2e
	from ChungLin_Tang@mentor.com ; Tue, 25 Sep 2018 06:12:05 -0700
Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com
	(147.34.90.202) with Microsoft SMTP Server (TLS) id
	15.0.1320.4; Tue, 25 Sep 2018 06:12:01 -0700
Reply-To: <cltang@codesourcery.com>
From: Chung-Lin Tang <chunglin_tang@mentor.com>
Subject: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes
To: <gcc-patches@gcc.gnu.org>, Tom de Vries <tdevries@suse.de>
Message-ID: <9523b49a-0454-e0a9-826d-5eeec2a8c973@mentor.com>
Date: Tue, 25 Sep 2018 21:11:58 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13;
	rv:52.0) Gecko/20100101 Thunderbird/52.9.1
MIME-Version: 1.0

Hi Tom,
this patch removes large portions of plugin/plugin-nvptx.c, since a lot of it is
now in oacc-async.c now. The new code is essentially a NVPTX/CUDA-specific implementation
of the new-style goacc_asyncqueues.

Also, some needed functions in cuda-lib.def are added. The cuda.h function has also
been updated to build independently without a CUDA installation.

Thanks,
Chung-Lin

	* plugin/plugin-nvptx.c (struct cuda_map): Remove.
	(struct ptx_stream): Remove.
	(struct nvptx_thread): Remove current_stream field.
	(cuda_map_create): Remove.
	(cuda_map_destroy): Remove.
	(map_init): Remove.
	(map_fini): Remove.
	(map_pop): Remove.
	(map_push): Remove.
	(struct goacc_asyncqueue): Define.
	(struct nvptx_callback): Define.
	(struct ptx_free_block): Define.
	(struct ptx_device): Remove null_stream, active_streams, async_streams,
	stream_lock, and next fields.
	(enum ptx_event_type): Remove.
	(struct ptx_event): Remove.
	(ptx_event_lock): Remove.
	(ptx_events): Remove.
	(init_streams_for_device): Remove.
	(fini_streams_for_device): Remove.
	(select_stream_for_async): Remove.
	(nvptx_init): Remove ptx_events and ptx_event_lock references.
	(nvptx_attach_host_thread_to_device): Remove CUDA_ERROR_NOT_PERMITTED
	case.
	(nvptx_open_device): Add free_blocks initialization, remove
	init_streams_for_device call.
	(nvptx_close_device): Remove fini_streams_for_device call, add
	free_blocks destruct code.
	(event_gc): Remove.
	(event_add): Remove.
	(nvptx_exec): Adjust parameters and code.
	(nvptx_free): Likewise.
	(nvptx_host2dev): Remove.
	(nvptx_dev2host): Remove.
	(nvptx_set_async): Remove.
	(nvptx_async_test): Remove.
	(nvptx_async_test_all): Remove.
	(nvptx_wait): Remove.
	(nvptx_wait_async): Remove.
	(nvptx_wait_all): Remove.
	(nvptx_wait_all_async): Remove.
	(nvptx_get_cuda_stream): Remove.
	(nvptx_set_cuda_stream): Remove.
	(GOMP_OFFLOAD_alloc): Adjust code.
	(GOMP_OFFLOAD_free): Likewise.
	(GOMP_OFFLOAD_openacc_register_async_cleanup): Remove.
	(GOMP_OFFLOAD_openacc_exec): Adjust parameters and code.
	(GOMP_OFFLOAD_openacc_async_test_all): Remove.
	(GOMP_OFFLOAD_openacc_async_wait): Remove.
	(GOMP_OFFLOAD_openacc_async_wait_async): Remove.
	(GOMP_OFFLOAD_openacc_async_wait_all): Remove.
	(GOMP_OFFLOAD_openacc_async_wait_all_async): Remove.
	(GOMP_OFFLOAD_openacc_async_set_async): Remove.
	(cuda_free_argmem): New function.
	(GOMP_OFFLOAD_openacc_async_exec): New plugin hook function.
	(GOMP_OFFLOAD_openacc_create_thread_data): Adjust code.
	(GOMP_OFFLOAD_openacc_cuda_get_stream): Adjust code.
	(GOMP_OFFLOAD_openacc_cuda_set_stream): Adjust code.
	(GOMP_OFFLOAD_openacc_async_construct): New plugin hook function.
	(GOMP_OFFLOAD_openacc_async_destruct): New plugin hook function.
	(GOMP_OFFLOAD_openacc_async_test): Remove and re-implement.
	(GOMP_OFFLOAD_openacc_async_synchronize): New plugin hook function.
	(GOMP_OFFLOAD_openacc_async_serialize): New plugin hook function.
	(GOMP_OFFLOAD_openacc_async_queue_callback): New plugin hook function.
	(cuda_callback_wrapper): New function.
	(cuda_memcpy_sanity_check): New function.
	(GOMP_OFFLOAD_host2dev): Remove and re-implement.
	(GOMP_OFFLOAD_dev2host): Remove and re-implement.
	(GOMP_OFFLOAD_openacc_async_host2dev): New plugin hook function.
	(GOMP_OFFLOAD_openacc_async_dev2host): New plugin hook function.

diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def
index b2a4c21..a16badc 100644
--- a/libgomp/plugin/cuda-lib.def
+++ b/libgomp/plugin/cuda-lib.def
@@ -42,6 +42,7 @@ CUDA_ONE_CALL (cuModuleLoad)
 CUDA_ONE_CALL (cuModuleLoadData)
 CUDA_ONE_CALL (cuModuleUnload)
 CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize)
+CUDA_ONE_CALL (cuStreamAddCallback)
 CUDA_ONE_CALL (cuStreamCreate)
 CUDA_ONE_CALL (cuStreamDestroy)
 CUDA_ONE_CALL (cuStreamQuery)
diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h
index b4c1b29..326db54 100644
--- a/libgomp/plugin/cuda/cuda.h
+++ b/libgomp/plugin/cuda/cuda.h
@@ -54,7 +54,11 @@ typedef enum {
   CUDA_ERROR_INVALID_CONTEXT = 201,
   CUDA_ERROR_NOT_FOUND = 500,
   CUDA_ERROR_NOT_READY = 600,
-  CUDA_ERROR_LAUNCH_FAILED = 719
+  CUDA_ERROR_LAUNCH_FAILED = 719,
+  CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE = 720,
+  CUDA_ERROR_NOT_PERMITTED = 800,
+  CUDA_ERROR_NOT_SUPPORTED = 801,
+  CUDA_ERROR_UNKNOWN = 999
 } CUresult;
 
 typedef enum {
@@ -173,6 +177,8 @@ CUresult cuModuleLoadData (CUmodule *, const void *);
 CUresult cuModuleUnload (CUmodule);
 CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction,
 					  CUoccupancyB2DSize, size_t, int);
+typedef void (*CUstreamCallback)(CUstream, CUresult, void *);
+CUresult cuStreamAddCallback(CUstream, CUstreamCallback, void *, unsigned int);
 CUresult cuStreamCreate (CUstream *, unsigned);
 #define cuStreamDestroy cuStreamDestroy_v2
 CUresult cuStreamDestroy (CUstream);
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index bae1b05..2959df2 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -192,128 +192,30 @@ cuda_error (CUresult r)
 static unsigned int instantiated_devices = 0;
 static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER;
 
-struct ptx_stream
+/* NVPTX/CUDA specific definition of asynchronous queues.  */
+struct goacc_asyncqueue
 {
-  CUstream stream;
-  pthread_t host_thread;
-  bool multithreaded;
+  CUstream cuda_stream;
+};
 
-  CUdeviceptr d;
-  void *h;
-  void *h_begin;
-  void *h_end;
-  void *h_next;
-  void *h_prev;
-  void *h_tail;
-
-  struct ptx_stream *next;
+struct nvptx_callback
+{
+  void (*fn) (void *);
+  void *ptr;
+  struct goacc_asyncqueue *aq;
+  struct nvptx_callback *next;
 };
 
 /* Thread-specific data for PTX.  */
 
 struct nvptx_thread
 {
-  struct ptx_stream *current_stream;
+  /* We currently have this embedded inside the plugin because libgomp manages
+     devices through integer target_ids.  This might be better if using an
+     opaque target-specific pointer directly from gomp_device_descr.  */
   struct ptx_device *ptx_dev;
 };
 
-static bool
-map_init (struct ptx_stream *s)
-{
-  int size = getpagesize ();
-
-  assert (s);
-  assert (!s->d);
-  assert (!s->h);
-
-  CUDA_CALL (cuMemAllocHost, &s->h, size);
-  CUDA_CALL (cuMemHostGetDevicePointer, &s->d, s->h, 0);
-
-  assert (s->h);
-
-  s->h_begin = s->h;
-  s->h_end = s->h_begin + size;
-  s->h_next = s->h_prev = s->h_tail = s->h_begin;
-
-  assert (s->h_next);
-  assert (s->h_end);
-  return true;
-}
-
-static bool
-map_fini (struct ptx_stream *s)
-{
-  CUDA_CALL (cuMemFreeHost, s->h);
-  return true;
-}
-
-static void
-map_pop (struct ptx_stream *s)
-{
-  assert (s != NULL);
-  assert (s->h_next);
-  assert (s->h_prev);
-  assert (s->h_tail);
-
-  s->h_tail = s->h_next;
-
-  if (s->h_tail >= s->h_end)
-    s->h_tail = s->h_begin + (int) (s->h_tail - s->h_end);
-
-  if (s->h_next == s->h_tail)
-    s->h_prev = s->h_next;
-
-  assert (s->h_next >= s->h_begin);
-  assert (s->h_tail >= s->h_begin);
-  assert (s->h_prev >= s->h_begin);
-
-  assert (s->h_next <= s->h_end);
-  assert (s->h_tail <= s->h_end);
-  assert (s->h_prev <= s->h_end);
-}
-
-static void
-map_push (struct ptx_stream *s, size_t size, void **h, void **d)
-{
-  int left;
-  int offset;
-
-  assert (s != NULL);
-
-  left = s->h_end - s->h_next;
-
-  assert (s->h_prev);
-  assert (s->h_next);
-
-  if (size >= left)
-    {
-      assert (s->h_next == s->h_prev);
-      s->h_next = s->h_prev = s->h_tail = s->h_begin;
-    }
-
-  assert (s->h_next);
-
-  offset = s->h_next - s->h;
-
-  *d = (void *)(s->d + offset);
-  *h = (void *)(s->h + offset);
-
-  s->h_prev = s->h_next;
-  s->h_next += size;
-
-  assert (s->h_prev);
-  assert (s->h_next);
-
-  assert (s->h_next >= s->h_begin);
-  assert (s->h_tail >= s->h_begin);
-  assert (s->h_prev >= s->h_begin);
-  assert (s->h_next <= s->h_end);
-  assert (s->h_tail <= s->h_end);
-  assert (s->h_prev <= s->h_end);
-
-  return;
-}
-
 /* Target data function launch information.  */
 
 struct targ_fn_launch
@@ -365,22 +267,18 @@ struct ptx_image_data
   struct ptx_image_data *next;
 };
 
+struct ptx_free_block
+{
+  void *ptr;
+  struct ptx_free_block *next;
+};
+
 struct ptx_device
 {
   CUcontext ctx;
   bool ctx_shared;
   CUdevice dev;
-  struct ptx_stream *null_stream;
-  /* All non-null streams associated with this device (actually context),
-     either created implicitly or passed in from the user (via
-     acc_set_cuda_stream).  */
-  struct ptx_stream *active_streams;
-  struct {
-    struct ptx_stream **arr;
-    int size;
-  } async_streams;
-  /* A lock for use when manipulating the above stream list and array.  */
-  pthread_mutex_t stream_lock;
+
   int ord;
   bool overlap;
   bool map;
@@ -398,32 +296,13 @@ struct ptx_device
 
   struct ptx_image_data *images;  /* Images loaded on device.  */
   pthread_mutex_t image_lock;     /* Lock for above list.  */
-  
-  struct ptx_device *next;
-};
-
-enum ptx_event_type
-{
-  PTX_EVT_MEM,
-  PTX_EVT_KNL,
-  PTX_EVT_SYNC,
-  PTX_EVT_ASYNC_CLEANUP
-};
 
-struct ptx_event
-{
-  CUevent *evt;
-  int type;
-  void *addr;
-  int ord;
-  int val;
+  struct ptx_free_block *free_blocks;
+  pthread_mutex_t free_blocks_lock;
 
-  struct ptx_event *next;
+  struct ptx_device *next;
 };
 
-static pthread_mutex_t ptx_event_lock;
-static struct ptx_event *ptx_events;
-
 static struct ptx_device **ptx_devices;
 
 static inline struct nvptx_thread *
@@ -432,197 +311,6 @@ nvptx_thread (void)
   return (struct nvptx_thread *) GOMP_PLUGIN_acc_thread ();
 }
 
-static bool
-init_streams_for_device (struct ptx_device *ptx_dev, int concurrency)
-{
-  int i;
-  struct ptx_stream *null_stream
-    = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream));
-
-  null_stream->stream = NULL;
-  null_stream->host_thread = pthread_self ();
-  null_stream->multithreaded = true;
-  null_stream->d = (CUdeviceptr) NULL;
-  null_stream->h = NULL;
-  if (!map_init (null_stream))
-    return false;
-
-  ptx_dev->null_stream = null_stream;
-  ptx_dev->active_streams = NULL;
-  pthread_mutex_init (&ptx_dev->stream_lock, NULL);
-
-  if (concurrency < 1)
-    concurrency = 1;
-
-  /* This is just a guess -- make space for as many async streams as the
-     current device is capable of concurrently executing.  This can grow
-     later as necessary.  No streams are created yet.  */
-  ptx_dev->async_streams.arr
-    = GOMP_PLUGIN_malloc (concurrency * sizeof (struct ptx_stream *));
-  ptx_dev->async_streams.size = concurrency;
-
-  for (i = 0; i < concurrency; i++)
-    ptx_dev->async_streams.arr[i] = NULL;
-
-  return true;
-}
-
-static bool
-fini_streams_for_device (struct ptx_device *ptx_dev)
-{
-  free (ptx_dev->async_streams.arr);
-
-  bool ret = true;
-  while (ptx_dev->active_streams != NULL)
-    {
-      struct ptx_stream *s = ptx_dev->active_streams;
-      ptx_dev->active_streams = ptx_dev->active_streams->next;
-
-      ret &= map_fini (s);
-
-      CUresult r = CUDA_CALL_NOCHECK (cuStreamDestroy, s->stream);
-      if (r != CUDA_SUCCESS)
-	{
-	  GOMP_PLUGIN_error ("cuStreamDestroy error: %s", cuda_error (r));
-	  ret = false;
-	}
-      free (s);
-    }
-
-  ret &= map_fini (ptx_dev->null_stream);
-  free (ptx_dev->null_stream);
-  return ret;
-}
-
-/* Select a stream for (OpenACC-semantics) ASYNC argument for the current
-   thread THREAD (and also current device/context).  If CREATE is true, create
-   the stream if it does not exist (or use EXISTING if it is non-NULL), and
-   associate the stream with the same thread argument.  Returns stream to use
-   as result.  */
-
-static struct ptx_stream *
-select_stream_for_async (int async, pthread_t thread, bool create,
-			 CUstream existing)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-  /* Local copy of TLS variable.  */
-  struct ptx_device *ptx_dev = nvthd->ptx_dev;
-  struct ptx_stream *stream = NULL;
-  int orig_async = async;
-
-  /* The special value acc_async_noval (-1) maps (for now) to an
-     implicitly-created stream, which is then handled the same as any other
-     numbered async stream.  Other options are available, e.g. using the null
-     stream for anonymous async operations, or choosing an idle stream from an
-     active set.  But, stick with this for now.  */
-  if (async > acc_async_sync)
-    async++;
-
-  if (create)
-    pthread_mutex_lock (&ptx_dev->stream_lock);
-
-  /* NOTE: AFAICT there's no particular need for acc_async_sync to map to the
-     null stream, and in fact better performance may be obtainable if it doesn't
-     (because the null stream enforces overly-strict synchronisation with
-     respect to other streams for legacy reasons, and that's probably not
-     needed with OpenACC).  Maybe investigate later.  */
-  if (async == acc_async_sync)
-    stream = ptx_dev->null_stream;
-  else if (async >= 0 && async < ptx_dev->async_streams.size
-	   && ptx_dev->async_streams.arr[async] && !(create && existing))
-    stream = ptx_dev->async_streams.arr[async];
-  else if (async >= 0 && create)
-    {
-      if (async >= ptx_dev->async_streams.size)
-	{
-	  int i, newsize = ptx_dev->async_streams.size * 2;
-
-	  if (async >= newsize)
-	    newsize = async + 1;
-
-	  ptx_dev->async_streams.arr
-	    = GOMP_PLUGIN_realloc (ptx_dev->async_streams.arr,
-				   newsize * sizeof (struct ptx_stream *));
-
-	  for (i = ptx_dev->async_streams.size; i < newsize; i++)
-	    ptx_dev->async_streams.arr[i] = NULL;
-
-	  ptx_dev->async_streams.size = newsize;
-	}
-
-      /* Create a new stream on-demand if there isn't one already, or if we're
-	 setting a particular async value to an existing (externally-provided)
-	 stream.  */
-      if (!ptx_dev->async_streams.arr[async] || existing)
-        {
-	  CUresult r;
-	  struct ptx_stream *s
-	    = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream));
-
-	  if (existing)
-	    s->stream = existing;
-	  else
-	    {
-	      r = CUDA_CALL_NOCHECK (cuStreamCreate, &s->stream,
-				     CU_STREAM_DEFAULT);
-	      if (r != CUDA_SUCCESS)
-		{
-		  pthread_mutex_unlock (&ptx_dev->stream_lock);
-		  GOMP_PLUGIN_fatal ("cuStreamCreate error: %s",
-				     cuda_error (r));
-		}
-	    }
-
-	  /* If CREATE is true, we're going to be queueing some work on this
-	     stream.  Associate it with the current host thread.  */
-	  s->host_thread = thread;
-	  s->multithreaded = false;
-
-	  s->d = (CUdeviceptr) NULL;
-	  s->h = NULL;
-	  if (!map_init (s))
-	    {
-	      pthread_mutex_unlock (&ptx_dev->stream_lock);
-	      GOMP_PLUGIN_fatal ("map_init fail");
-	    }
-
-	  s->next = ptx_dev->active_streams;
-	  ptx_dev->active_streams = s;
-	  ptx_dev->async_streams.arr[async] = s;
-	}
-
-      stream = ptx_dev->async_streams.arr[async];
-    }
-  else if (async < 0)
-    {
-      if (create)
-	pthread_mutex_unlock (&ptx_dev->stream_lock);
-      GOMP_PLUGIN_fatal ("bad async %d", async);
-    }
-
-  if (create)
-    {
-      assert (stream != NULL);
-
-      /* If we're trying to use the same stream from different threads
-	 simultaneously, set stream->multithreaded to true.  This affects the
-	 behaviour of acc_async_test_all and acc_wait_all, which are supposed to
-	 only wait for asynchronous launches from the same host thread they are
-	 invoked on.  If multiple threads use the same async value, we make note
-	 of that here and fall back to testing/waiting for all threads in those
-	 functions.  */
-      if (thread != stream->host_thread)
-        stream->multithreaded = true;
-
-      pthread_mutex_unlock (&ptx_dev->stream_lock);
-    }
-  else if (stream && !stream->multithreaded
-	   && !pthread_equal (stream->host_thread, thread))
-    GOMP_PLUGIN_fatal ("async %d used on wrong thread", orig_async);
-
-  return stream;
-}
-
 /* Initialize the device.  Return TRUE on success, else FALSE.  PTX_DEV_LOCK
    should be locked on entry and remains locked on exit.  */
 
@@ -634,9 +322,6 @@ nvptx_init (void)
   if (instantiated_devices != 0)
     return true;
 
-  ptx_events = NULL;
-  pthread_mutex_init (&ptx_event_lock, NULL);
-
   if (!init_cuda_lib ())
     return false;
 
@@ -660,6 +345,11 @@ nvptx_attach_host_thread_to_device (int n)
   CUcontext thd_ctx;
 
   r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev);
+  if (r == CUDA_ERROR_NOT_PERMITTED)
+    {
+      /* Assume we're in a CUDA callback, just return true.  */
+      return true;
+    }
   if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT)
     {
       GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r));
@@ -804,8 +494,8 @@ nvptx_open_device (int n)
   ptx_dev->images = NULL;
   pthread_mutex_init (&ptx_dev->image_lock, NULL);
 
-  if (!init_streams_for_device (ptx_dev, async_engines))
-    return NULL;
+  ptx_dev->free_blocks = NULL;
+  pthread_mutex_init (&ptx_dev->free_blocks_lock, NULL);
 
   return ptx_dev;
 }
@@ -816,9 +506,15 @@ nvptx_close_device (struct ptx_device *ptx_dev)
   if (!ptx_dev)
     return true;
 
-  if (!fini_streams_for_device (ptx_dev))
-    return false;
-  
+  for (struct ptx_free_block *b = ptx_dev->free_blocks; b;)
+    {
+      struct ptx_free_block *b_next = b->next;
+      CUDA_CALL (cuMemFree, (CUdeviceptr) b->ptr);
+      free (b);
+      b = b_next;
+    }
+
+  pthread_mutex_destroy (&ptx_dev->free_blocks_lock);
   pthread_mutex_destroy (&ptx_dev->image_lock);
 
   if (!ptx_dev->ctx_shared)
@@ -998,138 +694,19 @@ link_ptx (CUmodule *module, const struct targ_ptx_obj *ptx_objs,
 }
 
 static void
-event_gc (bool memmap_lockable)
-{
-  struct ptx_event *ptx_event = ptx_events;
-  struct ptx_event *async_cleanups = NULL;
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  pthread_mutex_lock (&ptx_event_lock);
-
-  while (ptx_event != NULL)
-    {
-      CUresult r;
-      struct ptx_event *e = ptx_event;
-
-      ptx_event = ptx_event->next;
-
-      if (e->ord != nvthd->ptx_dev->ord)
-	continue;
-
-      r = CUDA_CALL_NOCHECK (cuEventQuery, *e->evt);
-      if (r == CUDA_SUCCESS)
-	{
-	  bool append_async = false;
-	  CUevent *te;
-
-	  te = e->evt;
-
-	  switch (e->type)
-	    {
-	    case PTX_EVT_MEM:
-	    case PTX_EVT_SYNC:
-	      break;
-
-	    case PTX_EVT_KNL:
-	      map_pop (e->addr);
-	      break;
-
-	    case PTX_EVT_ASYNC_CLEANUP:
-	      {
-		/* The function gomp_plugin_async_unmap_vars needs to claim the
-		   memory-map splay tree lock for the current device, so we
-		   can't call it when one of our callers has already claimed
-		   the lock.  In that case, just delay the GC for this event
-		   until later.  */
-		if (!memmap_lockable)
-		  continue;
-
-		append_async = true;
-	      }
-	      break;
-	    }
-
-	  CUDA_CALL_NOCHECK (cuEventDestroy, *te);
-	  free ((void *)te);
-
-	  /* Unlink 'e' from ptx_events list.  */
-	  if (ptx_events == e)
-	    ptx_events = ptx_events->next;
-	  else
-	    {
-	      struct ptx_event *e_ = ptx_events;
-	      while (e_->next != e)
-		e_ = e_->next;
-	      e_->next = e_->next->next;
-	    }
-
-	  if (append_async)
-	    {
-	      e->next = async_cleanups;
-	      async_cleanups = e;
-	    }
-	  else
-	    free (e);
-	}
-    }
-
-  pthread_mutex_unlock (&ptx_event_lock);
-
-  /* We have to do these here, after ptx_event_lock is released.  */
-  while (async_cleanups)
-    {
-      struct ptx_event *e = async_cleanups;
-      async_cleanups = async_cleanups->next;
-
-      GOMP_PLUGIN_async_unmap_vars (e->addr, e->val);
-      free (e);
-    }
-}
-
-static void
-event_add (enum ptx_event_type type, CUevent *e, void *h, int val)
-{
-  struct ptx_event *ptx_event;
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  assert (type == PTX_EVT_MEM || type == PTX_EVT_KNL || type == PTX_EVT_SYNC
-	  || type == PTX_EVT_ASYNC_CLEANUP);
-
-  ptx_event = GOMP_PLUGIN_malloc (sizeof (struct ptx_event));
-  ptx_event->type = type;
-  ptx_event->evt = e;
-  ptx_event->addr = h;
-  ptx_event->ord = nvthd->ptx_dev->ord;
-  ptx_event->val = val;
-
-  pthread_mutex_lock (&ptx_event_lock);
-
-  ptx_event->next = ptx_events;
-  ptx_events = ptx_event;
-
-  pthread_mutex_unlock (&ptx_event_lock);
-}
-
-static void
 nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
-	    int async, unsigned *dims, void *targ_mem_desc)
+	    unsigned *dims, void *targ_mem_desc,
+	    CUdeviceptr dp, CUstream stream)
 {
   struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn;
   CUfunction function;
-  CUresult r;
   int i;
-  struct ptx_stream *dev_str;
   void *kargs[1];
-  void *hp, *dp;
   struct nvptx_thread *nvthd = nvptx_thread ();
   int warp_size = nvthd->ptx_dev->warp_size;
-  const char *maybe_abort_msg = "(perhaps abort was called)";
 
   function = targ_fn->fn;
 
-  dev_str = select_stream_for_async (async, pthread_self (), false, NULL);
-  assert (dev_str == nvthd->current_stream);
-
   /* Initialize the launch dimensions.  Typically this is constant,
      provided by the device compiler, but we must permit runtime
      values.  */
@@ -1292,21 +869,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
 			 suggest_workers, suggest_workers);
     }
 
-  /* This reserves a chunk of a pre-allocated page of memory mapped on both
-     the host and the device. HP is a host pointer to the new chunk, and DP is
-     the corresponding device pointer.  */
-  map_push (dev_str, mapnum * sizeof (void *), &hp, &dp);
-
-  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
-
-  /* Copy the array of arguments to the mapped page.  */
-  for (i = 0; i < mapnum; i++)
-    ((void **) hp)[i] = devaddrs[i];
-
-  /* Copy the (device) pointers to arguments to the device (dp and hp might in
-     fact have the same value on a unified-memory system).  */
-  CUDA_CALL_ASSERT (cuMemcpy, (CUdeviceptr) dp, (CUdeviceptr) hp,
-		    mapnum * sizeof (void *));
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: launch"
 		     " gangs=%u, workers=%u, vectors=%u\n",
 		     __FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG],
@@ -1317,58 +879,14 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs,
   // num_gangs		nctaid.x
   // num_workers	ntid.y
   // vector length	ntid.x
-
   kargs[0] = &dp;
   CUDA_CALL_ASSERT (cuLaunchKernel, function,
 		    dims[GOMP_DIM_GANG], 1, 1,
 		    dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1,
-		    0, dev_str->stream, kargs, 0);
-
-#ifndef DISABLE_ASYNC
-  if (async < acc_async_noval)
-    {
-      r = CUDA_CALL_NOCHECK (cuStreamSynchronize, dev_str->stream);
-      if (r == CUDA_ERROR_LAUNCH_FAILED)
-	GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
-			   maybe_abort_msg);
-      else if (r != CUDA_SUCCESS)
-        GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
-    }
-  else
-    {
-      CUevent *e;
-
-      e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent));
-
-      r = CUDA_CALL_NOCHECK (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-      if (r == CUDA_ERROR_LAUNCH_FAILED)
-	GOMP_PLUGIN_fatal ("cuEventCreate error: %s %s\n", cuda_error (r),
-			   maybe_abort_msg);
-      else if (r != CUDA_SUCCESS)
-        GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r));
-
-      event_gc (true);
-
-      CUDA_CALL_ASSERT (cuEventRecord, *e, dev_str->stream);
-
-      event_add (PTX_EVT_KNL, e, (void *)dev_str, 0);
-    }
-#else
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
-  if (r == CUDA_ERROR_LAUNCH_FAILED)
-    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
-		       maybe_abort_msg);
-  else if (r != CUDA_SUCCESS)
-    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
-#endif
+		    0, stream, kargs, 0);
 
   GOMP_PLUGIN_debug (0, "  %s: kernel %s: finished\n", __FUNCTION__,
 		     targ_fn->launch->fn);
-
-#ifndef DISABLE_ASYNC
-  if (async < acc_async_noval)
-#endif
-    map_pop (dev_str);
 }
 
 void * openacc_get_current_cuda_context (void);
@@ -1383,8 +901,21 @@ nvptx_alloc (size_t s)
 }
 
 static bool
-nvptx_free (void *p)
+nvptx_free (void *p, struct ptx_device *ptx_dev)
 {
+  /* Assume callback context if this is null.  */
+  if (GOMP_PLUGIN_acc_thread () == NULL)
+    {
+      struct ptx_free_block *n
+	= GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block));
+      n->ptr = p;
+      pthread_mutex_lock (&ptx_dev->free_blocks_lock);
+      n->next = ptx_dev->free_blocks;
+      ptx_dev->free_blocks = n;
+      pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
+      return true;
+    }
+
   CUdeviceptr pb;
   size_t ps;
 
@@ -1399,389 +930,27 @@ nvptx_free (void *p)
   return true;
 }
 
-
-static bool
-nvptx_host2dev (void *d, const void *h, size_t s)
+static void *
+nvptx_get_current_cuda_device (void)
 {
-  CUdeviceptr pb;
-  size_t ps;
   struct nvptx_thread *nvthd = nvptx_thread ();
 
-  if (!s)
-    return true;
-  if (!d)
-    {
-      GOMP_PLUGIN_error ("invalid device address");
-      return false;
-    }
-
-  CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
-
-  if (!pb)
-    {
-      GOMP_PLUGIN_error ("invalid device address");
-      return false;
-    }
-  if (!h)
-    {
-      GOMP_PLUGIN_error ("invalid host address");
-      return false;
-    }
-  if (d == h)
-    {
-      GOMP_PLUGIN_error ("invalid host or device address");
-      return false;
-    }
-  if ((void *)(d + s) > (void *)(pb + ps))
-    {
-      GOMP_PLUGIN_error ("invalid size");
-      return false;
-    }
-
-#ifndef DISABLE_ASYNC
-  if (nvthd && nvthd->current_stream != nvthd->ptx_dev->null_stream)
-    {
-      CUevent *e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent));
-      CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-      event_gc (false);
-      CUDA_CALL (cuMemcpyHtoDAsync,
-		 (CUdeviceptr) d, h, s, nvthd->current_stream->stream);
-      CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream);
-      event_add (PTX_EVT_MEM, e, (void *)h, 0);
-    }
-  else
-#endif
-    CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) d, h, s);
+  if (!nvthd || !nvthd->ptx_dev)
+    return NULL;
 
-  return true;
+  return &nvthd->ptx_dev->dev;
 }
 
-static bool
-nvptx_dev2host (void *h, const void *d, size_t s)
+static void *
+nvptx_get_current_cuda_context (void)
 {
-  CUdeviceptr pb;
-  size_t ps;
   struct nvptx_thread *nvthd = nvptx_thread ();
 
-  if (!s)
-    return true;
-  if (!d)
-    {
-      GOMP_PLUGIN_error ("invalid device address");
-      return false;
-    }
+  if (!nvthd || !nvthd->ptx_dev)
+    return NULL;
 
-  CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
-
-  if (!pb)
-    {
-      GOMP_PLUGIN_error ("invalid device address");
-      return false;
-    }
-  if (!h)
-    {
-      GOMP_PLUGIN_error ("invalid host address");
-      return false;
-    }
-  if (d == h)
-    {
-      GOMP_PLUGIN_error ("invalid host or device address");
-      return false;
-    }
-  if ((void *)(d + s) > (void *)(pb + ps))
-    {
-      GOMP_PLUGIN_error ("invalid size");
-      return false;
-    }
-
-#ifndef DISABLE_ASYNC
-  if (nvthd && nvthd->current_stream != nvthd->ptx_dev->null_stream)
-    {
-      CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
-      CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-      event_gc (false);
-      CUDA_CALL (cuMemcpyDtoHAsync,
-		 h, (CUdeviceptr) d, s, nvthd->current_stream->stream);
-      CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream);
-      event_add (PTX_EVT_MEM, e, (void *)h, 0);
-    }
-  else
-#endif
-    CUDA_CALL (cuMemcpyDtoH, h, (CUdeviceptr) d, s);
-
-  return true;
-}
-
-static void
-nvptx_set_async (int async)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-  nvthd->current_stream
-    = select_stream_for_async (async, pthread_self (), true, NULL);
-}
-
-static int
-nvptx_async_test (int async)
-{
-  CUresult r;
-  struct ptx_stream *s;
-
-  s = select_stream_for_async (async, pthread_self (), false, NULL);
-
-  if (!s)
-    GOMP_PLUGIN_fatal ("unknown async %d", async);
-
-  r = CUDA_CALL_NOCHECK (cuStreamQuery, s->stream);
-  if (r == CUDA_SUCCESS)
-    {
-      /* The oacc-parallel.c:goacc_wait function calls this hook to determine
-	 whether all work has completed on this stream, and if so omits the call
-	 to the wait hook.  If that happens, event_gc might not get called
-	 (which prevents variables from getting unmapped and their associated
-	 device storage freed), so call it here.  */
-      event_gc (true);
-      return 1;
-    }
-  else if (r == CUDA_ERROR_NOT_READY)
-    return 0;
-
-  GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
-
-  return 0;
-}
-
-static int
-nvptx_async_test_all (void)
-{
-  struct ptx_stream *s;
-  pthread_t self = pthread_self ();
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
-  for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next)
-    {
-      if ((s->multithreaded || pthread_equal (s->host_thread, self))
-	  && CUDA_CALL_NOCHECK (cuStreamQuery,
-				s->stream) == CUDA_ERROR_NOT_READY)
-	{
-	  pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-	  return 0;
-	}
-    }
-
-  pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
-  event_gc (true);
-
-  return 1;
-}
-
-static void
-nvptx_wait (int async)
-{
-  struct ptx_stream *s;
-
-  s = select_stream_for_async (async, pthread_self (), false, NULL);
-  if (!s)
-    GOMP_PLUGIN_fatal ("unknown async %d", async);
-
-  CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream);
-
-  event_gc (true);
-}
-
-static void
-nvptx_wait_async (int async1, int async2)
-{
-  CUevent *e;
-  struct ptx_stream *s1, *s2;
-  pthread_t self = pthread_self ();
-
-  /* The stream that is waiting (rather than being waited for) doesn't
-     necessarily have to exist already.  */
-  s2 = select_stream_for_async (async2, self, true, NULL);
-
-  s1 = select_stream_for_async (async1, self, false, NULL);
-  if (!s1)
-    GOMP_PLUGIN_fatal ("invalid async 1\n");
-
-  if (s1 == s2)
-    GOMP_PLUGIN_fatal ("identical parameters");
-
-  e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
-
-  CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-
-  event_gc (true);
-
-  CUDA_CALL_ASSERT (cuEventRecord, *e, s1->stream);
-
-  event_add (PTX_EVT_SYNC, e, NULL, 0);
-
-  CUDA_CALL_ASSERT (cuStreamWaitEvent, s2->stream, *e, 0);
-}
-
-static void
-nvptx_wait_all (void)
-{
-  CUresult r;
-  struct ptx_stream *s;
-  pthread_t self = pthread_self ();
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
-  /* Wait for active streams initiated by this thread (or by multiple threads)
-     to complete.  */
-  for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next)
-    {
-      if (s->multithreaded || pthread_equal (s->host_thread, self))
-	{
-	  r = CUDA_CALL_NOCHECK (cuStreamQuery, s->stream);
-	  if (r == CUDA_SUCCESS)
-	    continue;
-	  else if (r != CUDA_ERROR_NOT_READY)
-	    GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r));
-
-	  CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream);
-	}
-    }
-
-  pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
-  event_gc (true);
-}
-
-static void
-nvptx_wait_all_async (int async)
-{
-  struct ptx_stream *waiting_stream, *other_stream;
-  CUevent *e;
-  struct nvptx_thread *nvthd = nvptx_thread ();
-  pthread_t self = pthread_self ();
-
-  /* The stream doing the waiting.  This could be the first mention of the
-     stream, so create it if necessary.  */
-  waiting_stream
-    = select_stream_for_async (async, pthread_self (), true, NULL);
-
-  /* Launches on the null stream already block on other streams in the
-     context.  */
-  if (!waiting_stream || waiting_stream == nvthd->ptx_dev->null_stream)
-    return;
-
-  event_gc (true);
-
-  pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
-  for (other_stream = nvthd->ptx_dev->active_streams;
-       other_stream != NULL;
-       other_stream = other_stream->next)
-    {
-      if (!other_stream->multithreaded
-	  && !pthread_equal (other_stream->host_thread, self))
-	continue;
-
-      e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
-
-      CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-
-      /* Record an event on the waited-for stream.  */
-      CUDA_CALL_ASSERT (cuEventRecord, *e, other_stream->stream);
-
-      event_add (PTX_EVT_SYNC, e, NULL, 0);
-
-      CUDA_CALL_ASSERT (cuStreamWaitEvent, waiting_stream->stream, *e, 0);
-   }
-
-  pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-}
-
-static void *
-nvptx_get_current_cuda_device (void)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  if (!nvthd || !nvthd->ptx_dev)
-    return NULL;
-
-  return &nvthd->ptx_dev->dev;
-}
-
-static void *
-nvptx_get_current_cuda_context (void)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  if (!nvthd || !nvthd->ptx_dev)
-    return NULL;
-
-  return nvthd->ptx_dev->ctx;
-}
-
-static void *
-nvptx_get_cuda_stream (int async)
-{
-  struct ptx_stream *s;
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  if (!nvthd || !nvthd->ptx_dev)
-    return NULL;
-
-  s = select_stream_for_async (async, pthread_self (), false, NULL);
-
-  return s ? s->stream : NULL;
-}
-
-static int
-nvptx_set_cuda_stream (int async, void *stream)
-{
-  struct ptx_stream *oldstream;
-  pthread_t self = pthread_self ();
-  struct nvptx_thread *nvthd = nvptx_thread ();
-
-  if (async < 0)
-    GOMP_PLUGIN_fatal ("bad async %d", async);
-
-  pthread_mutex_lock (&nvthd->ptx_dev->stream_lock);
-
-  /* We have a list of active streams and an array mapping async values to
-     entries of that list.  We need to take "ownership" of the passed-in stream,
-     and add it to our list, removing the previous entry also (if there was one)
-     in order to prevent resource leaks.  Note the potential for surprise
-     here: maybe we should keep track of passed-in streams and leave it up to
-     the user to tidy those up, but that doesn't work for stream handles
-     returned from acc_get_cuda_stream above...  */
-
-  oldstream = select_stream_for_async (async, self, false, NULL);
-
-  if (oldstream)
-    {
-      if (nvthd->ptx_dev->active_streams == oldstream)
-	nvthd->ptx_dev->active_streams = nvthd->ptx_dev->active_streams->next;
-      else
-	{
-	  struct ptx_stream *s = nvthd->ptx_dev->active_streams;
-	  while (s->next != oldstream)
-	    s = s->next;
-	  s->next = s->next->next;
-	}
-
-      CUDA_CALL_ASSERT (cuStreamDestroy, oldstream->stream);
-
-      if (!map_fini (oldstream))
-	GOMP_PLUGIN_fatal ("error when freeing host memory");
-
-      free (oldstream);
-    }
-
-  pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock);
-
-  (void) select_stream_for_async (async, self, true, (CUstream) stream);
-
-  return 1;
-}
+  return nvthd->ptx_dev->ctx;
+}
 
 /* Plugin entry points.  */
 
@@ -2016,100 +1185,116 @@ GOMP_OFFLOAD_alloc (int ord, size_t size)
 {
   if (!nvptx_attach_host_thread_to_device (ord))
     return NULL;
-  return nvptx_alloc (size);
-}
 
-bool
-GOMP_OFFLOAD_free (int ord, void *ptr)
-{
-  return (nvptx_attach_host_thread_to_device (ord)
-	  && nvptx_free (ptr));
-}
+  struct ptx_device *ptx_dev = ptx_devices[ord];
+  struct ptx_free_block *blocks, *tmp;
 
-bool
-GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
-{
-  return (nvptx_attach_host_thread_to_device (ord)
-	  && nvptx_dev2host (dst, src, n));
-}
+  pthread_mutex_lock (&ptx_dev->free_blocks_lock);
+  blocks = ptx_dev->free_blocks;
+  ptx_dev->free_blocks = NULL;
+  pthread_mutex_unlock (&ptx_dev->free_blocks_lock);
 
-bool
-GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
-{
-  return (nvptx_attach_host_thread_to_device (ord)
-	  && nvptx_host2dev (dst, src, n));
+  while (blocks)
+    {
+      tmp = blocks->next;
+      nvptx_free (blocks->ptr, ptx_dev);
+      free (blocks);
+      blocks = tmp;
+    }
+
+  return nvptx_alloc (size);
 }
 
 bool
-GOMP_OFFLOAD_dev2dev (int ord, void *dst, const void *src, size_t n)
+GOMP_OFFLOAD_free (int ord, void *ptr)
 {
-  struct ptx_device *ptx_dev = ptx_devices[ord];
-  CUDA_CALL (cuMemcpyDtoDAsync, (CUdeviceptr) dst, (CUdeviceptr) src, n,
-				ptx_dev->null_stream->stream);
-  return true;
+  return (nvptx_attach_host_thread_to_device (ord)
+	  && nvptx_free (ptr, ptx_devices[ord]));
 }
 
-void (*device_run) (int n, void *fn_ptr, void *vars) = NULL;
-
 void
 GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum,
 			   void **hostaddrs, void **devaddrs,
-			   int async, unsigned *dims, void *targ_mem_desc)
+			   unsigned *dims, void *targ_mem_desc)
 {
-  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, async, dims, targ_mem_desc);
-}
+  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
-void
-GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc, int async)
-{
-  struct nvptx_thread *nvthd = nvptx_thread ();
-  CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent));
+  void **hp = NULL;
+  CUdeviceptr dp = 0;
 
-  CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING);
-  CUDA_CALL_ASSERT (cuEventRecord, *e, nvthd->current_stream->stream);
-  event_add (PTX_EVT_ASYNC_CLEANUP, e, targ_mem_desc, async);
-}
+  if (mapnum > 0)
+    {
+      hp = alloca (mapnum * sizeof (void *));
+      for (int i = 0; i < mapnum; i++)
+	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+      CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+    }
 
-int
-GOMP_OFFLOAD_openacc_async_test (int async)
-{
-  return nvptx_async_test (async);
-}
+  /* Copy the (device) pointers to arguments to the device (dp and hp might in
+     fact have the same value on a unified-memory system).  */
+  if (mapnum > 0)
+    CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp,
+		      mapnum * sizeof (void *));
 
-int
-GOMP_OFFLOAD_openacc_async_test_all (void)
-{
-  return nvptx_async_test_all ();
-}
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+	      dp, NULL);
 
-void
-GOMP_OFFLOAD_openacc_async_wait (int async)
-{
-  nvptx_wait (async);
+  CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL);
+  const char *maybe_abort_msg = "(perhaps abort was called)";
+  if (r == CUDA_ERROR_LAUNCH_FAILED)
+    GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r),
+		       maybe_abort_msg);
+  else if (r != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r));
+  CUDA_CALL_ASSERT (cuMemFree, dp);
 }
 
-void
-GOMP_OFFLOAD_openacc_async_wait_async (int async1, int async2)
+static void
+cuda_free_argmem (void *ptr)
 {
-  nvptx_wait_async (async1, async2);
+  void **block = (void **) ptr;
+  nvptx_free (block[0], (struct ptx_device *) block[1]);
+  free (block);
 }
 
 void
-GOMP_OFFLOAD_openacc_async_wait_all (void)
+GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum,
+				 void **hostaddrs, void **devaddrs,
+				 unsigned *dims, void *targ_mem_desc,
+				 struct goacc_asyncqueue *aq)
 {
-  nvptx_wait_all ();
-}
+  GOMP_PLUGIN_debug (0, "  %s: prepare mappings\n", __FUNCTION__);
 
-void
-GOMP_OFFLOAD_openacc_async_wait_all_async (int async)
-{
-  nvptx_wait_all_async (async);
-}
+  void **hp = NULL;
+  CUdeviceptr dp = 0;
+  void **block = NULL;
 
-void
-GOMP_OFFLOAD_openacc_async_set_async (int async)
-{
-  nvptx_set_async (async);
+  if (mapnum > 0)
+    {
+      block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *));
+      hp = block + 2;
+      for (int i = 0; i < mapnum; i++)
+	hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]);
+      CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *));
+    }
+
+  /* Copy the (device) pointers to arguments to the device (dp and hp might in
+     fact have the same value on a unified-memory system).  */
+  if (mapnum > 0)
+    {
+      CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp,
+			mapnum * sizeof (void *), aq->cuda_stream);
+      block[0] = (void *) dp;
+
+      struct nvptx_thread *nvthd =
+	(struct nvptx_thread *) GOMP_PLUGIN_acc_thread ();
+      block[1] = (void *) nvthd->ptx_dev;
+    }
+  nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc,
+	      dp, aq->cuda_stream);
+
+  if (mapnum > 0)
+    GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block);
 }
 
 void *
@@ -2131,7 +1316,6 @@ GOMP_OFFLOAD_openacc_create_thread_data (int ord)
   if (!thd_ctx)
     CUDA_CALL_ASSERT (cuCtxPushCurrent, ptx_dev->ctx);
 
-  nvthd->current_stream = ptx_dev->null_stream;
   nvthd->ptx_dev = ptx_dev;
 
   return (void *) nvthd;
@@ -2156,19 +1340,185 @@ GOMP_OFFLOAD_openacc_cuda_get_current_context (void)
 }
 
 /* NOTE: This returns a CUstream, not a ptx_stream pointer.  */
-
 void *
-GOMP_OFFLOAD_openacc_cuda_get_stream (int async)
+GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *aq)
 {
-  return nvptx_get_cuda_stream (async);
+  return (void *) aq->cuda_stream;
 }
 
 /* NOTE: This takes a CUstream, not a ptx_stream pointer.  */
+int
+GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream)
+{
+  if (aq->cuda_stream)
+    {
+      CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
+      CUDA_CALL_ASSERT (cuStreamDestroy, aq->cuda_stream);
+    }
+
+  aq->cuda_stream = (CUstream) stream;
+  return 1;
+}
+
+struct goacc_asyncqueue *
+GOMP_OFFLOAD_openacc_async_construct (void)
+{
+  struct goacc_asyncqueue *aq
+    = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue));
+  aq->cuda_stream = NULL;
+  CUDA_CALL_ASSERT (cuStreamCreate, &aq->cuda_stream, CU_STREAM_DEFAULT);
+  if (aq->cuda_stream == NULL)
+    GOMP_PLUGIN_fatal ("CUDA stream create NULL\n");
+
+  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
+
+
+  return aq;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq)
+{
+  CUDA_CALL_ERET (false, cuStreamDestroy, aq->cuda_stream);
+  free (aq);
+  return true;
+}
 
 int
-GOMP_OFFLOAD_openacc_cuda_set_stream (int async, void *stream)
+GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq)
 {
-  return nvptx_set_cuda_stream (async, stream);
+  CUresult r = CUDA_CALL_NOCHECK (cuStreamQuery, aq->cuda_stream);
+  if (r == CUDA_SUCCESS)
+    return 1;
+  if (r == CUDA_ERROR_NOT_READY)
+    return 0;
+
+  GOMP_PLUGIN_error ("cuStreamQuery error: %s", cuda_error (r));
+  return -1;
+}
+
+void
+GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq)
+{
+  CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1,
+				      struct goacc_asyncqueue *aq2)
+{
+  CUevent e;
+  CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING);
+  CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream);
+  CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0);
+}
+
+static void
+cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr)
+{
+  if (res != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
+  struct nvptx_callback *cb = (struct nvptx_callback *) ptr;
+  cb->fn (cb->ptr);
+  free (ptr);
+}
+
+void
+GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *aq,
+					   void (*callback_fn)(void *),
+					   void *userptr)
+{
+  struct nvptx_callback *b = GOMP_PLUGIN_malloc (sizeof (*b));
+  b->fn = callback_fn;
+  b->ptr = userptr;
+  b->aq = aq;
+  CUDA_CALL_ASSERT (cuStreamAddCallback, aq->cuda_stream,
+		    cuda_callback_wrapper, (void *) b, 0);
+}
+
+static bool
+cuda_memcpy_sanity_check (const void *h, const void *d, size_t s)
+{
+  CUdeviceptr pb;
+  size_t ps;
+  if (!s)
+    return true;
+  if (!d)
+    {
+      GOMP_PLUGIN_error ("invalid device address");
+      return false;
+    }
+  CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d);
+  if (!pb)
+    {
+      GOMP_PLUGIN_error ("invalid device address");
+      return false;
+    }
+  if (!h)
+    {
+      GOMP_PLUGIN_error ("invalid host address");
+      return false;
+    }
+  if (d == h)
+    {
+      GOMP_PLUGIN_error ("invalid host or device address");
+      return false;
+    }
+  if ((void *)(d + s) > (void *)(pb + ps))
+    {
+      GOMP_PLUGIN_error ("invalid size");
+      return false;
+    }
+  return true;
+}
+
+bool
+GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n)
+{
+  if (!nvptx_attach_host_thread_to_device (ord)
+      || !cuda_memcpy_sanity_check (src, dst, n))
+    return false;
+  CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) dst, src, n);
+  return true;
+}
+
+bool
+GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n)
+{
+  if (!nvptx_attach_host_thread_to_device (ord)
+      || !cuda_memcpy_sanity_check (dst, src, n))
+    return false;
+  CUDA_CALL (cuMemcpyDtoH, dst, (CUdeviceptr) src, n);
+  return true;
+}
+
+bool
+GOMP_OFFLOAD_dev2dev (int ord, void *dst, const void *src, size_t n)
+{
+  CUDA_CALL (cuMemcpyDtoDAsync, (CUdeviceptr) dst, (CUdeviceptr) src, n, NULL);
+  return true;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_host2dev (int ord, void *dst, const void *src,
+				     size_t n, struct goacc_asyncqueue *aq)
+{
+  if (!nvptx_attach_host_thread_to_device (ord)
+      || !cuda_memcpy_sanity_check (src, dst, n))
+    return false;
+  CUDA_CALL (cuMemcpyHtoDAsync, (CUdeviceptr) dst, src, n, aq->cuda_stream);
+  return true;
+}
+
+bool
+GOMP_OFFLOAD_openacc_async_dev2host (int ord, void *dst, const void *src,
+				     size_t n, struct goacc_asyncqueue *aq)
+{
+  if (!nvptx_attach_host_thread_to_device (ord)
+      || !cuda_memcpy_sanity_check (dst, src, n))
+    return false;
+  CUDA_CALL (cuMemcpyDtoHAsync, dst, (CUdeviceptr) src, n, aq->cuda_stream);
+  return true;
 }
 
 /* Adjust launch dimensions: pick good values for number of blocks and warps
@@ -2269,8 +1619,7 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
     CU_LAUNCH_PARAM_END
   };
   r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1,
-			 32, threads, 1, 0, ptx_dev->null_stream->stream,
-			 NULL, config);
+			 32, threads, 1, 0, NULL, NULL, config);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));