From patchwork Tue Sep 25 13:10:21 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chung-Lin Tang X-Patchwork-Id: 974403 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-486323-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=mentor.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="LqMGlryO"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 42KLzP5tclz9s2P for ; Tue, 25 Sep 2018 23:10:40 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; q=dns; s=default; b=OcHw1Dll5/P2l88XdJh3Vrrx+4hrU Wya8pByf4ZiVmIMmCMhCkSzcDmF7eIgD61U0cj7+1CR3sdL0IPpOZYq3EOoxvsGk 5/nafUENPGiNBojvWFhAJfQvj7ifAQ1VplFgv2R64WU2GrYQCm30P2g7gVjvk2yh xjkQ2V+uJBYfYw= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; s=default; bh=G1a4za6bPTRRgOKKlZHz4EhQcKQ=; b=LqM GlryOF5Ka3/Bwkl7nmPllInJh1WpkbxhuamQIWPL5ioI9E6aWNXWy84LWHSpjJHc 3T5Ml7hXg06B6Z2Jzc3V/gT8/91znJ+PQucDKWhEmWY5Nr7WxCtc+rfHjNjriXGu 9qldoeOiqfuMYimnwF/6Wvgtzd1J2kCl51OMIdPs= Received: (qmail 116939 invoked by alias); 25 Sep 2018 13:10:33 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 116924 invoked by uid 89); 25 Sep 2018 13:10:32 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-24.7 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=1997, 199, 7 X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 25 Sep 2018 13:10:30 +0000 Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202]) by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256) id 1g4n6i-0003tn-SE from ChungLin_Tang@mentor.com ; Tue, 25 Sep 2018 06:10:28 -0700 Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com (147.34.90.202) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Tue, 25 Sep 2018 06:10:25 -0700 Reply-To: From: Chung-Lin Tang Subject: [PATCH 1/6, OpenACC, libgomp] Async re-work, interfaces To: , Jakub Jelinek , Thomas Schwinge Message-ID: Date: Tue, 25 Sep 2018 21:10:21 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 This patch separates out the header interface changes. GOMP_VERSION has been bumped, and various changes to the plugin interface, and a few libgomp internal functions declared. The libgomp linkmap updated as well. Thanks, Chung-Lin include/ * gomp-constants.h (GOMP_ASYNC_DEFAULT): Define. (GOMP_VERSION): Increment for new plugin interface changes. libgomp/ * libgomp-plugin.h (struct goacc_asyncqueue): Declare. (struct goacc_asyncqueue_list): Likewise. (goacc_aq): Likewise. (goacc_aq_list): Likewise. (GOMP_OFFLOAD_openacc_register_async_cleanup): Remove. (GOMP_OFFLOAD_openacc_async_test): Remove. (GOMP_OFFLOAD_openacc_async_test_all): Remove. (GOMP_OFFLOAD_openacc_async_wait): Remove. (GOMP_OFFLOAD_openacc_async_wait_async): Remove. (GOMP_OFFLOAD_openacc_async_wait_all): Remove. (GOMP_OFFLOAD_openacc_async_wait_all_async): Remove. (GOMP_OFFLOAD_openacc_async_set_async): Remove. (GOMP_OFFLOAD_openacc_exec): Adjust declaration. (GOMP_OFFLOAD_openacc_cuda_get_stream): Likewise. (GOMP_OFFLOAD_openacc_cuda_set_stream): Likewise. (GOMP_OFFLOAD_openacc_async_exec): Declare. (GOMP_OFFLOAD_openacc_async_construct): Declare. (GOMP_OFFLOAD_openacc_async_destruct): Declare. (GOMP_OFFLOAD_openacc_async_test): Declare. (GOMP_OFFLOAD_openacc_async_synchronize): Declare. (GOMP_OFFLOAD_openacc_async_serialize): Declare. (GOMP_OFFLOAD_openacc_async_queue_callback): Declare. (GOMP_OFFLOAD_openacc_async_host2dev): Declare. (GOMP_OFFLOAD_openacc_async_dev2host): Declare. * libgomp.h (struct acc_dispatch_t): Define 'async' sub-struct. (gomp_acc_insert_pointer): Adjust declaration. (gomp_copy_host2dev): New declaration. (gomp_copy_dev2host): Likewise. (gomp_map_vars_async): Likewise. (gomp_unmap_tgt): Likewise. (gomp_unmap_vars_async): Likewise. (gomp_fini_device): Likewise. * libgomp.map (OACC_2.5): Add acc_get_default_async, acc_get_default_async_h_, acc_set_default_async, and acc_set_default_async_h_. (GOMP_PLUGIN_1.0): Remove GOMP_PLUGIN_async_unmap_vars. diff --git a/include/gomp-constants.h b/include/gomp-constants.h index f1c53c5..697080c 100644 --- a/include/gomp-constants.h +++ b/include/gomp-constants.h @@ -160,6 +160,7 @@ enum gomp_map_kind /* Asynchronous behavior. Keep in sync with libgomp/{openacc.h,openacc.f90,openacc_lib.h}:acc_async_t. */ +#define GOMP_ASYNC_DEFAULT 0 #define GOMP_ASYNC_NOVAL -1 #define GOMP_ASYNC_SYNC -2 @@ -199,7 +200,7 @@ enum gomp_map_kind /* Versions of libgomp and device-specific plugins. GOMP_VERSION should be incremented whenever an ABI-incompatible change is introduced to the plugin interface defined in libgomp/libgomp.h. */ -#define GOMP_VERSION 1 +#define GOMP_VERSION 2 #define GOMP_VERSION_NVIDIA_PTX 1 #define GOMP_VERSION_INTEL_MIC 0 #define GOMP_VERSION_HSA 0 diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h index 2fc35d56..667ba19 100644 --- a/libgomp/libgomp-plugin.h +++ b/libgomp/libgomp-plugin.h @@ -53,6 +53,20 @@ enum offload_target_type OFFLOAD_TARGET_TYPE_HSA = 7 }; +/* Opaque type to represent plugin-dependent implementation of an + OpenACC asynchronous queue. */ +struct goacc_asyncqueue; + +/* Used to keep a list of active asynchronous queues. */ +struct goacc_asyncqueue_list +{ + struct goacc_asyncqueue *aq; + struct goacc_asyncqueue_list *next; +}; + +typedef struct goacc_asyncqueue *goacc_aq; +typedef struct goacc_asyncqueue_list *goacc_aq_list; + /* Auxiliary struct, used for transferring pairs of addresses from plugin to libgomp. */ struct addr_pair @@ -93,22 +107,31 @@ extern bool GOMP_OFFLOAD_dev2dev (int, void *, const void *, size_t); extern bool GOMP_OFFLOAD_can_run (void *); extern void GOMP_OFFLOAD_run (int, void *, void *, void **); extern void GOMP_OFFLOAD_async_run (int, void *, void *, void **, void *); + extern void GOMP_OFFLOAD_openacc_exec (void (*) (void *), size_t, void **, - void **, int, unsigned *, void *); -extern void GOMP_OFFLOAD_openacc_register_async_cleanup (void *, int); -extern int GOMP_OFFLOAD_openacc_async_test (int); -extern int GOMP_OFFLOAD_openacc_async_test_all (void); -extern void GOMP_OFFLOAD_openacc_async_wait (int); -extern void GOMP_OFFLOAD_openacc_async_wait_async (int, int); -extern void GOMP_OFFLOAD_openacc_async_wait_all (void); -extern void GOMP_OFFLOAD_openacc_async_wait_all_async (int); -extern void GOMP_OFFLOAD_openacc_async_set_async (int); + void **, unsigned *, void *); +extern void GOMP_OFFLOAD_openacc_async_exec (void (*) (void *), size_t, void **, + void **, unsigned *, void *, + struct goacc_asyncqueue *); +extern struct goacc_asyncqueue *GOMP_OFFLOAD_openacc_async_construct (void); +extern bool GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *); +extern int GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *); +extern void GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *); +extern void GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *, + struct goacc_asyncqueue *); +extern void GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *, + void (*)(void *), void *); +extern bool GOMP_OFFLOAD_openacc_async_host2dev (int, void *, const void *, size_t, + struct goacc_asyncqueue *); +extern bool GOMP_OFFLOAD_openacc_async_dev2host (int, void *, const void *, size_t, + struct goacc_asyncqueue *); extern void *GOMP_OFFLOAD_openacc_create_thread_data (int); extern void GOMP_OFFLOAD_openacc_destroy_thread_data (void *); extern void *GOMP_OFFLOAD_openacc_cuda_get_current_device (void); extern void *GOMP_OFFLOAD_openacc_cuda_get_current_context (void); -extern void *GOMP_OFFLOAD_openacc_cuda_get_stream (int); -extern int GOMP_OFFLOAD_openacc_cuda_set_stream (int, void *); +extern void *GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *); +extern int GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *, + void *); #ifdef __cplusplus } diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h index 3a8cc2b..a69faa7 100644 --- a/libgomp/libgomp.h +++ b/libgomp/libgomp.h @@ -888,19 +888,23 @@ typedef struct acc_dispatch_t /* Execute. */ __typeof (GOMP_OFFLOAD_openacc_exec) *exec_func; - /* Async cleanup callback registration. */ - __typeof (GOMP_OFFLOAD_openacc_register_async_cleanup) - *register_async_cleanup_func; - - /* Asynchronous routines. */ - __typeof (GOMP_OFFLOAD_openacc_async_test) *async_test_func; - __typeof (GOMP_OFFLOAD_openacc_async_test_all) *async_test_all_func; - __typeof (GOMP_OFFLOAD_openacc_async_wait) *async_wait_func; - __typeof (GOMP_OFFLOAD_openacc_async_wait_async) *async_wait_async_func; - __typeof (GOMP_OFFLOAD_openacc_async_wait_all) *async_wait_all_func; - __typeof (GOMP_OFFLOAD_openacc_async_wait_all_async) - *async_wait_all_async_func; - __typeof (GOMP_OFFLOAD_openacc_async_set_async) *async_set_async_func; + struct { + gomp_mutex_t lock; + int nasyncqueue; + struct goacc_asyncqueue **asyncqueue; + struct goacc_asyncqueue_list *active; + + __typeof (GOMP_OFFLOAD_openacc_async_construct) *construct_func; + __typeof (GOMP_OFFLOAD_openacc_async_destruct) *destruct_func; + __typeof (GOMP_OFFLOAD_openacc_async_test) *test_func; + __typeof (GOMP_OFFLOAD_openacc_async_synchronize) *synchronize_func; + __typeof (GOMP_OFFLOAD_openacc_async_serialize) *serialize_func; + __typeof (GOMP_OFFLOAD_openacc_async_queue_callback) *queue_callback_func; + + __typeof (GOMP_OFFLOAD_openacc_async_exec) *exec_func; + __typeof (GOMP_OFFLOAD_openacc_async_host2dev) *host2dev_func; + __typeof (GOMP_OFFLOAD_openacc_async_dev2host) *dev2host_func; + } async; /* Create/destroy TLS data. */ __typeof (GOMP_OFFLOAD_openacc_create_thread_data) *create_thread_data_func; @@ -992,17 +996,33 @@ enum gomp_map_vars_kind GOMP_MAP_VARS_ENTER_DATA }; -extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *); +extern void gomp_acc_insert_pointer (size_t, void **, size_t *, void *, int); extern void gomp_acc_remove_pointer (void *, size_t, bool, int, int, int); extern void gomp_acc_declare_allocate (bool, size_t, void **, size_t *, unsigned short *); +struct gomp_coalesce_buf; +extern void gomp_copy_host2dev (struct gomp_device_descr *, + struct goacc_asyncqueue *, void *, const void *, + size_t, struct gomp_coalesce_buf *); +extern void gomp_copy_dev2host (struct gomp_device_descr *, + struct goacc_asyncqueue *, void *, const void *, + size_t); extern struct target_mem_desc *gomp_map_vars (struct gomp_device_descr *, size_t, void **, void **, size_t *, void *, bool, enum gomp_map_vars_kind); +extern struct target_mem_desc *gomp_map_vars_async (struct gomp_device_descr *, + struct goacc_asyncqueue *, + size_t, void **, void **, + size_t *, void *, bool, + enum gomp_map_vars_kind); +extern void gomp_unmap_tgt (struct target_mem_desc *); extern void gomp_unmap_vars (struct target_mem_desc *, bool); +extern void gomp_unmap_vars_async (struct target_mem_desc *, bool, + struct goacc_asyncqueue *); extern void gomp_init_device (struct gomp_device_descr *); +extern bool gomp_fini_device (struct gomp_device_descr *); extern void gomp_free_memmap (struct splay_tree_s *); extern void gomp_unload_device (struct gomp_device_descr *); extern bool gomp_remove_var (struct gomp_device_descr *, splay_tree_key); diff --git a/libgomp/libgomp.map b/libgomp/libgomp.map index e3f0c64..dd97728 100644 --- a/libgomp/libgomp.map +++ b/libgomp/libgomp.map @@ -420,8 +420,12 @@ OACC_2.5 { acc_delete_finalize_async_32_h_; acc_delete_finalize_async_64_h_; acc_delete_finalize_async_array_h_; + acc_get_default_async; + acc_get_default_async_h_; acc_memcpy_from_device_async; acc_memcpy_to_device_async; + acc_set_default_async; + acc_set_default_async_h_; acc_update_device_async; acc_update_device_async_32_h_; acc_update_device_async_64_h_; @@ -458,7 +462,6 @@ GOMP_PLUGIN_1.0 { GOMP_PLUGIN_debug; GOMP_PLUGIN_error; GOMP_PLUGIN_fatal; - GOMP_PLUGIN_async_unmap_vars; GOMP_PLUGIN_acc_thread; }; From patchwork Tue Sep 25 13:10:47 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chung-Lin Tang X-Patchwork-Id: 974404 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-486324-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=mentor.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="kFQ2jb/q"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 42KLzw19W5z9s2P for ; Tue, 25 Sep 2018 23:11:07 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; q=dns; s=default; b=LJQXcSp945RKu1UCM70EdmlDO72S1 pm5y7dD3vJYd39FQnCkOBbr81o5MCAzDMxIOk6K7FOi0WotDzbEkWP3M507F4Wfr aKPoq+cTIdh7fD0hcqbCw8KfQQKctuwRJ6h8RXiorUfpu4ToWvfI7/WOQ8D4iHoY adhZNX4q7Uy3JE= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; s=default; bh=zsq19aQL3r4rQCOaZeYKngFAW1w=; b=kFQ 2jb/q3D8tXZZuSOWGlv4MflkXKqlm1eZW4sitn/WEnJVObhvzgR/8fS7apeMjOw3 d2IBQ8QhGcrY/AcKHhDAcPOYaplCBzBZ7AUgapMHNQ7RIc3FPeirdTxarL0zhlOf PkeFicVVZReeXF4zbwKl5bAotz76xg6pOGDIEZ6M= Received: (qmail 118542 invoked by alias); 25 Sep 2018 13:11:00 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 118516 invoked by uid 89); 25 Sep 2018 13:10:59 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-24.8 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=UD:lock, 3017, qid, ord X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 25 Sep 2018 13:10:56 +0000 Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202]) by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256) id 1g4n78-000444-I3 from ChungLin_Tang@mentor.com for gcc-patches@gcc.gnu.org; Tue, 25 Sep 2018 06:10:54 -0700 Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com (147.34.90.202) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Tue, 25 Sep 2018 06:10:51 -0700 Reply-To: From: Chung-Lin Tang Subject: [PATCH 2/6, OpenACC, libgomp] Async re-work, oacc-* parts To: , Thomas Schwinge Message-ID: <12319572-dd02-c946-f2b9-9d047be9c707@mentor.com> Date: Tue, 25 Sep 2018 21:10:47 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 Hi Thomas, These are the OpenACC specific changes, mostly the re-implementation of async-related acc_* runtime library API functions to use the new backend plugin interfaces, in a non-target specific way. Thanks, Chung-Lin * oacc-async.c (get_goacc_thread): New function. (get_goacc_thread_device): New function. (lookup_goacc_asyncqueue): New function. (get_goacc_asyncqueue): New function. (acc_async_test): Adjust code to use new async design. (acc_async_test_all): Likewise. (acc_wait): Likewise. (acc_wait_async): Likewise. (acc_wait_all): Likewise. (acc_wait_all_async): Likewise. (acc_get_default_async): New API function. (acc_set_default_async): Likewise. (goacc_async_unmap_tgt): New function. (goacc_async_copyout_unmap_vars): Likewise. (goacc_async_free): Likewise. (goacc_init_asyncqueues): Likewise. (goacc_fini_asyncqueues): Likewise. * oacc-cuda.c (acc_get_cuda_stream): Adjust code to use new async design. (acc_set_cuda_stream): Likewise. * oacc-host.c (host_openacc_exec): Adjust parameters, remove 'async'. (host_openacc_register_async_cleanup): Remove. (host_openacc_async_exec): New function. (host_openacc_async_test): Adjust parameters. (host_openacc_async_test_all): Remove. (host_openacc_async_wait): Remove. (host_openacc_async_wait_async): Remove. (host_openacc_async_wait_all): Remove. (host_openacc_async_wait_all_async): Remove. (host_openacc_async_set_async): Remove. (host_openacc_async_synchronize): New function. (host_openacc_async_serialize): New function. (host_openacc_async_host2dev): New function. (host_openacc_async_dev2host): New function. (host_openacc_async_queue_callback): New function. (host_openacc_async_construct): New function. (host_openacc_async_destruct): New function. (struct gomp_device_descr host_dispatch): Remove initialization of old interface, add intialization of new async sub-struct. * oacc-init.c (acc_shutdown_1): Adjust to use gomp_fini_device. (goacc_attach_host_thread_to_device): Remove old async code usage, add initialization of per-thread default_async. * oacc-int.h (struct goacc_thread): Add default_async field. (goacc_init_asyncqueues): New declaration. (goacc_fini_asyncqueues): Likewise. (goacc_async_copyout_unmap_vars): Likewise. (goacc_async_free): Likewise. (get_goacc_asyncqueue): Likewise. (lookup_goacc_asyncqueue): Likewise. * oacc-mem.c (memcpy_tofrom_device): Adjust code to use new async design. (present_create_copy): Likewise. (delete_copyout): Likewise. (update_dev_host): Likewise. (gomp_acc_insert_pointer): Add async parameter, adjust code to use new async design. (gomp_acc_remove_pointer): Adjust code to use new async design. * oacc-parallel.c (GOACC_parallel_keyed): Likewise. (GOACC_enter_exit_data): Likewise. (goacc_wait): Likewise. (GOACC_update): Likewise. * oacc-plugin.c (GOMP_PLUGIN_async_unmap_vars): Remove. diff --git a/libgomp/oacc-async.c b/libgomp/oacc-async.c index a4e1863..68aaf19 100644 --- a/libgomp/oacc-async.c +++ b/libgomp/oacc-async.c @@ -27,10 +27,87 @@ . */ #include +#include #include "openacc.h" #include "libgomp.h" #include "oacc-int.h" +static struct goacc_thread * +get_goacc_thread (void) +{ + struct goacc_thread *thr = goacc_thread (); + + if (!thr || !thr->dev) + gomp_fatal ("no device active"); + + return thr; +} + +static struct gomp_device_descr * +get_goacc_thread_device (void) +{ + struct goacc_thread *thr = goacc_thread (); + + if (!thr || !thr->dev) + gomp_fatal ("no device active"); + + return thr->dev; +} + +attribute_hidden struct goacc_asyncqueue * +lookup_goacc_asyncqueue (struct goacc_thread *thr, bool create, int async) +{ + /* The special value acc_async_noval (-1) maps to the thread-specific + default async stream. */ + if (async == acc_async_noval) + async = thr->default_async; + + if (async == acc_async_sync) + return NULL; + + if (async < 0) + gomp_fatal ("bad async %d", async); + + struct gomp_device_descr *dev = thr->dev; + + if (!create + && (async >= dev->openacc.async.nasyncqueue + || !dev->openacc.async.asyncqueue[async])) + return NULL; + + gomp_mutex_lock (&dev->openacc.async.lock); + if (async >= dev->openacc.async.nasyncqueue) + { + int diff = async + 1 - dev->openacc.async.nasyncqueue; + dev->openacc.async.asyncqueue + = gomp_realloc (dev->openacc.async.asyncqueue, + sizeof (goacc_aq) * (async + 1)); + memset (dev->openacc.async.asyncqueue + dev->openacc.async.nasyncqueue, + 0, sizeof (goacc_aq) * diff); + dev->openacc.async.nasyncqueue = async + 1; + } + + if (!dev->openacc.async.asyncqueue[async]) + { + dev->openacc.async.asyncqueue[async] = dev->openacc.async.construct_func (); + + /* Link new async queue into active list. */ + goacc_aq_list n = gomp_malloc (sizeof (struct goacc_asyncqueue_list)); + n->aq = dev->openacc.async.asyncqueue[async]; + n->next = dev->openacc.async.active; + dev->openacc.async.active = n; + } + gomp_mutex_unlock (&dev->openacc.async.lock); + return dev->openacc.async.asyncqueue[async]; +} + +attribute_hidden struct goacc_asyncqueue * +get_goacc_asyncqueue (int async) +{ + struct goacc_thread *thr = get_goacc_thread (); + return lookup_goacc_asyncqueue (thr, true, async); +} + int acc_async_test (int async) { @@ -42,18 +119,25 @@ acc_async_test (int async) if (!thr || !thr->dev) gomp_fatal ("no device active"); - return thr->dev->openacc.async_test_func (async); + goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async); + return thr->dev->openacc.async.test_func (aq); } int acc_async_test_all (void) { - struct goacc_thread *thr = goacc_thread (); - - if (!thr || !thr->dev) - gomp_fatal ("no device active"); + struct goacc_thread *thr = get_goacc_thread (); - return thr->dev->openacc.async_test_all_func (); + int ret = 1; + gomp_mutex_lock (&thr->dev->openacc.async.lock); + for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next) + if (!thr->dev->openacc.async.test_func (l->aq)) + { + ret = 0; + break; + } + gomp_mutex_unlock (&thr->dev->openacc.async.lock); + return ret; } void @@ -62,12 +146,10 @@ acc_wait (int async) if (!async_valid_p (async)) gomp_fatal ("invalid async argument: %d", async); - struct goacc_thread *thr = goacc_thread (); - - if (!thr || !thr->dev) - gomp_fatal ("no device active"); + struct goacc_thread *thr = get_goacc_thread (); - thr->dev->openacc.async_wait_func (async); + goacc_aq aq = lookup_goacc_asyncqueue (thr, true, async); + thr->dev->openacc.async.synchronize_func (aq); } /* acc_async_wait is an OpenACC 1.0 compatibility name for acc_wait. */ @@ -84,23 +166,28 @@ acc_async_wait (int async) void acc_wait_async (int async1, int async2) { - struct goacc_thread *thr = goacc_thread (); + struct goacc_thread *thr = get_goacc_thread (); - if (!thr || !thr->dev) - gomp_fatal ("no device active"); + goacc_aq aq2 = lookup_goacc_asyncqueue (thr, true, async2); + goacc_aq aq1 = lookup_goacc_asyncqueue (thr, false, async1); + if (!aq1) + gomp_fatal ("invalid async 1"); + if (aq1 == aq2) + gomp_fatal ("identical parameters"); - thr->dev->openacc.async_wait_async_func (async1, async2); + thr->dev->openacc.async.synchronize_func (aq1); + thr->dev->openacc.async.serialize_func (aq1, aq2); } void acc_wait_all (void) { - struct goacc_thread *thr = goacc_thread (); - - if (!thr || !thr->dev) - gomp_fatal ("no device active"); + struct gomp_device_descr *dev = get_goacc_thread_device (); - thr->dev->openacc.async_wait_all_func (); + gomp_mutex_lock (&dev->openacc.async.lock); + for (goacc_aq_list l = dev->openacc.async.active; l; l = l->next) + dev->openacc.async.synchronize_func (l->aq); + gomp_mutex_unlock (&dev->openacc.async.lock); } /* acc_async_wait_all is an OpenACC 1.0 compatibility name for acc_wait_all. */ @@ -120,10 +207,99 @@ acc_wait_all_async (int async) if (!async_valid_p (async)) gomp_fatal ("invalid async argument: %d", async); - struct goacc_thread *thr = goacc_thread (); + struct goacc_thread *thr = get_goacc_thread (); - if (!thr || !thr->dev) - gomp_fatal ("no device active"); + goacc_aq waiting_queue = lookup_goacc_asyncqueue (thr, true, async); - thr->dev->openacc.async_wait_all_async_func (async); + gomp_mutex_lock (&thr->dev->openacc.async.lock); + for (goacc_aq_list l = thr->dev->openacc.async.active; l; l = l->next) + { + thr->dev->openacc.async.synchronize_func (l->aq); + if (waiting_queue) + thr->dev->openacc.async.serialize_func (l->aq, waiting_queue); + } + gomp_mutex_unlock (&thr->dev->openacc.async.lock); +} + +int +acc_get_default_async (void) +{ + struct goacc_thread *thr = get_goacc_thread (); + return thr->default_async; +} + +void +acc_set_default_async (int async) +{ + if (async < acc_async_sync) + gomp_fatal ("invalid async argument: %d", async); + + struct goacc_thread *thr = get_goacc_thread (); + thr->default_async = async; +} + +static void +goacc_async_unmap_tgt (void *ptr) +{ + struct target_mem_desc *tgt = (struct target_mem_desc *) ptr; + + if (tgt->refcount > 1) + tgt->refcount--; + else + gomp_unmap_tgt (tgt); +} + +attribute_hidden void +goacc_async_copyout_unmap_vars (struct target_mem_desc *tgt, + struct goacc_asyncqueue *aq) +{ + struct gomp_device_descr *devicep = tgt->device_descr; + + /* Increment reference to delay freeing of device memory until callback + has triggered. */ + tgt->refcount++; + gomp_unmap_vars_async (tgt, true, aq); + devicep->openacc.async.queue_callback_func (aq, goacc_async_unmap_tgt, + (void *) tgt); +} + +attribute_hidden void +goacc_async_free (struct gomp_device_descr *devicep, + struct goacc_asyncqueue *aq, void *ptr) +{ + if (!aq) + free (ptr); + else + devicep->openacc.async.queue_callback_func (aq, free, ptr); +} + +attribute_hidden void +goacc_init_asyncqueues (struct gomp_device_descr *devicep) +{ + gomp_mutex_init (&devicep->openacc.async.lock); + devicep->openacc.async.nasyncqueue = 0; + devicep->openacc.async.asyncqueue = NULL; + devicep->openacc.async.active = NULL; +} + +attribute_hidden bool +goacc_fini_asyncqueues (struct gomp_device_descr *devicep) +{ + bool ret = true; + if (devicep->openacc.async.nasyncqueue > 0) + { + goacc_aq_list next; + for (goacc_aq_list l = devicep->openacc.async.active; l; l = next) + { + ret &= devicep->openacc.async.destruct_func (l->aq); + next = l->next; + free (l); + } + free (devicep->openacc.async.asyncqueue); + devicep->openacc.async.nasyncqueue = 0; + devicep->openacc.async.asyncqueue = NULL; + devicep->openacc.async.active = NULL; + } + gomp_mutex_destroy (&devicep->openacc.async.lock); + return ret; } diff --git a/libgomp/oacc-cuda.c b/libgomp/oacc-cuda.c index 20774c1..0a842ea 100644 --- a/libgomp/oacc-cuda.c +++ b/libgomp/oacc-cuda.c @@ -62,7 +62,11 @@ acc_get_cuda_stream (int async) return NULL; if (thr && thr->dev && thr->dev->openacc.cuda.get_stream_func) - return thr->dev->openacc.cuda.get_stream_func (async); + { + goacc_aq aq = lookup_goacc_asyncqueue (thr, false, async); + if (aq) + return thr->dev->openacc.cuda.get_stream_func (aq); + } return NULL; } @@ -79,8 +83,14 @@ acc_set_cuda_stream (int async, void *stream) thr = goacc_thread (); + int ret = -1; if (thr && thr->dev && thr->dev->openacc.cuda.set_stream_func) - return thr->dev->openacc.cuda.set_stream_func (async, stream); - - return -1; + { + goacc_aq aq = get_goacc_asyncqueue (async); + gomp_mutex_lock (&thr->dev->openacc.async.lock); + ret = thr->dev->openacc.cuda.set_stream_func (aq, stream); + gomp_mutex_unlock (&thr->dev->openacc.async.lock); + } + + return ret; } diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c index 2de3c37..53658c8 100644 --- a/libgomp/oacc-host.c +++ b/libgomp/oacc-host.c @@ -140,55 +140,86 @@ host_openacc_exec (void (*fn) (void *), size_t mapnum __attribute__ ((unused)), void **hostaddrs, void **devaddrs __attribute__ ((unused)), - int async __attribute__ ((unused)), - unsigned *dims __attribute ((unused)), + unsigned *dims __attribute__ ((unused)), void *targ_mem_desc __attribute__ ((unused))) { fn (hostaddrs); } static void -host_openacc_register_async_cleanup (void *targ_mem_desc __attribute__ ((unused)), - int async __attribute__ ((unused))) +host_openacc_async_exec (void (*fn) (void *), + size_t mapnum __attribute__ ((unused)), + void **hostaddrs, + void **devaddrs __attribute__ ((unused)), + unsigned *dims __attribute__ ((unused)), + void *targ_mem_desc __attribute__ ((unused)), + struct goacc_asyncqueue *aq __attribute__ ((unused))) { + fn (hostaddrs); } static int -host_openacc_async_test (int async __attribute__ ((unused))) +host_openacc_async_test (struct goacc_asyncqueue *aq __attribute__ ((unused))) { return 1; } -static int -host_openacc_async_test_all (void) +static void +host_openacc_async_synchronize (struct goacc_asyncqueue *aq + __attribute__ ((unused))) { - return 1; } static void -host_openacc_async_wait (int async __attribute__ ((unused))) +host_openacc_async_serialize (struct goacc_asyncqueue *aq1 + __attribute__ ((unused)), + struct goacc_asyncqueue *aq2 + __attribute__ ((unused))) { } -static void -host_openacc_async_wait_async (int async1 __attribute__ ((unused)), - int async2 __attribute__ ((unused))) +static bool +host_openacc_async_host2dev (int ord __attribute__ ((unused)), + void *dst __attribute__ ((unused)), + const void *src __attribute__ ((unused)), + size_t n __attribute__ ((unused)), + struct goacc_asyncqueue *aq + __attribute__ ((unused))) { + return true; } -static void -host_openacc_async_wait_all (void) +static bool +host_openacc_async_dev2host (int ord __attribute__ ((unused)), + void *dst __attribute__ ((unused)), + const void *src __attribute__ ((unused)), + size_t n __attribute__ ((unused)), + struct goacc_asyncqueue *aq + __attribute__ ((unused))) { + return true; } static void -host_openacc_async_wait_all_async (int async __attribute__ ((unused))) +host_openacc_async_queue_callback (struct goacc_asyncqueue *aq + __attribute__ ((unused)), + void (*callback_fn)(void *) + __attribute__ ((unused)), + void *userptr __attribute__ ((unused))) { } -static void -host_openacc_async_set_async (int async __attribute__ ((unused))) +static struct goacc_asyncqueue * +host_openacc_async_construct (void) { + return NULL; +} + +static bool +host_openacc_async_destruct (struct goacc_asyncqueue *aq + __attribute__ ((unused))) +{ + return true; } static void * @@ -235,15 +266,17 @@ static struct gomp_device_descr host_dispatch = .exec_func = host_openacc_exec, - .register_async_cleanup_func = host_openacc_register_async_cleanup, - - .async_test_func = host_openacc_async_test, - .async_test_all_func = host_openacc_async_test_all, - .async_wait_func = host_openacc_async_wait, - .async_wait_async_func = host_openacc_async_wait_async, - .async_wait_all_func = host_openacc_async_wait_all, - .async_wait_all_async_func = host_openacc_async_wait_all_async, - .async_set_async_func = host_openacc_async_set_async, + .async = { + .construct_func = host_openacc_async_construct, + .destruct_func = host_openacc_async_destruct, + .test_func = host_openacc_async_test, + .synchronize_func = host_openacc_async_synchronize, + .serialize_func = host_openacc_async_serialize, + .queue_callback_func = host_openacc_async_queue_callback, + .exec_func = host_openacc_async_exec, + .dev2host_func = host_openacc_async_dev2host, + .host2dev_func = host_openacc_async_host2dev, + }, .create_thread_data_func = host_openacc_create_thread_data, .destroy_thread_data_func = host_openacc_destroy_thread_data, diff --git a/libgomp/oacc-init.c b/libgomp/oacc-init.c index 8db24b1..2c2f91c 100644 --- a/libgomp/oacc-init.c +++ b/libgomp/oacc-init.c @@ -309,7 +309,7 @@ acc_shutdown_1 (acc_device_t d) if (acc_dev->state == GOMP_DEVICE_INITIALIZED) { devices_active = true; - ret &= acc_dev->fini_device_func (acc_dev->target_id); + ret &= gomp_fini_device (acc_dev); acc_dev->state = GOMP_DEVICE_UNINITIALIZED; } gomp_mutex_unlock (&acc_dev->lock); @@ -426,8 +426,8 @@ goacc_attach_host_thread_to_device (int ord) thr->target_tls = acc_dev->openacc.create_thread_data_func (ord); - - acc_dev->openacc.async_set_async_func (acc_async_sync); + + thr->default_async = acc_async_default; } /* OpenACC 2.0a (3.2.12, 3.2.13) doesn't specify whether the serialization of diff --git a/libgomp/oacc-mem.c b/libgomp/oacc-mem.c index 72414b7..07a2524 100644 --- a/libgomp/oacc-mem.c +++ b/libgomp/oacc-mem.c @@ -172,18 +172,11 @@ memcpy_tofrom_device (bool from, void *d, void *h, size_t s, int async, return; } - if (async > acc_async_sync) - thr->dev->openacc.async_set_async_func (async); - - bool ret = (from - ? thr->dev->dev2host_func (thr->dev->target_id, h, d, s) - : thr->dev->host2dev_func (thr->dev->target_id, d, h, s)); - - if (async > acc_async_sync) - thr->dev->openacc.async_set_async_func (acc_async_sync); - - if (!ret) - gomp_fatal ("error in %s", libfnname); + goacc_aq aq = get_goacc_asyncqueue (async); + if (from) + gomp_copy_dev2host (thr->dev, aq, h, d, s); + else + gomp_copy_host2dev (thr->dev, aq, d, h, s, /* TODO: cbuf? */ NULL); } void @@ -509,17 +502,13 @@ present_create_copy (unsigned f, void *h, size_t s, int async) gomp_mutex_unlock (&acc_dev->lock); - if (async > acc_async_sync) - acc_dev->openacc.async_set_async_func (async); + goacc_aq aq = get_goacc_asyncqueue (async); - tgt = gomp_map_vars (acc_dev, mapnum, &hostaddrs, NULL, &s, &kinds, true, - GOMP_MAP_VARS_OPENACC); + tgt = gomp_map_vars_async (acc_dev, aq, mapnum, &hostaddrs, NULL, &s, + &kinds, true, GOMP_MAP_VARS_OPENACC); /* Initialize dynamic refcount. */ tgt->list[0].key->dynamic_refcount = 1; - if (async > acc_async_sync) - acc_dev->openacc.async_set_async_func (acc_async_sync); - gomp_mutex_lock (&acc_dev->lock); d = tgt->to_free; @@ -673,13 +662,9 @@ delete_copyout (unsigned f, void *h, size_t s, int async, const char *libfnname) if (f & FLAG_COPYOUT) { - if (async > acc_async_sync) - acc_dev->openacc.async_set_async_func (async); - acc_dev->dev2host_func (acc_dev->target_id, h, d, s); - if (async > acc_async_sync) - acc_dev->openacc.async_set_async_func (acc_async_sync); + goacc_aq aq = get_goacc_asyncqueue (async); + gomp_copy_dev2host (acc_dev, aq, h, d, s); } - gomp_remove_var (acc_dev, n); } @@ -762,16 +747,12 @@ update_dev_host (int is_dev, void *h, size_t s, int async) d = (void *) (n->tgt->tgt_start + n->tgt_offset + (uintptr_t) h - n->host_start); - if (async > acc_async_sync) - acc_dev->openacc.async_set_async_func (async); + goacc_aq aq = get_goacc_asyncqueue (async); if (is_dev) - acc_dev->host2dev_func (acc_dev->target_id, d, h, s); + gomp_copy_host2dev (acc_dev, aq, d, h, s, /* TODO: cbuf? */ NULL); else - acc_dev->dev2host_func (acc_dev->target_id, h, d, s); - - if (async > acc_async_sync) - acc_dev->openacc.async_set_async_func (acc_async_sync); + gomp_copy_dev2host (acc_dev, aq, h, d, s); gomp_mutex_unlock (&acc_dev->lock); } @@ -802,7 +783,7 @@ acc_update_self_async (void *h, size_t s, int async) void gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes, - void *kinds) + void *kinds, int async) { struct target_mem_desc *tgt; struct goacc_thread *thr = goacc_thread (); @@ -832,8 +813,9 @@ gomp_acc_insert_pointer (size_t mapnum, void **hostaddrs, size_t *sizes, } gomp_debug (0, " %s: prepare mappings\n", __FUNCTION__); - tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, - NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC); + goacc_aq aq = get_goacc_asyncqueue (async); + tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, + NULL, sizes, kinds, true, GOMP_MAP_VARS_OPENACC); gomp_debug (0, " %s: mappings prepared\n", __FUNCTION__); /* Initialize dynamic refcount. */ @@ -927,7 +909,10 @@ gomp_acc_remove_pointer (void *h, size_t s, bool force_copyfrom, int async, if (async < acc_async_noval) gomp_unmap_vars (t, true); else - t->device_descr->openacc.register_async_cleanup_func (t, async); + { + goacc_aq aq = get_goacc_asyncqueue (async); + goacc_async_copyout_unmap_vars (t, aq); + } } gomp_mutex_unlock (&acc_dev->lock); diff --git a/libgomp/oacc-parallel.c b/libgomp/oacc-parallel.c index bfe8876..07d0338 100644 --- a/libgomp/oacc-parallel.c +++ b/libgomp/oacc-parallel.c @@ -212,8 +212,6 @@ GOACC_parallel_keyed (int device, void (*fn) (void *), } va_end (ap); - acc_dev->openacc.async_set_async_func (async); - if (!(acc_dev->capabilities & GOMP_OFFLOAD_CAP_NATIVE_EXEC)) { k.host_start = (uintptr_t) fn; @@ -230,43 +228,28 @@ GOACC_parallel_keyed (int device, void (*fn) (void *), else tgt_fn = (void (*)) fn; - tgt = gomp_map_vars (acc_dev, mapnum, hostaddrs, NULL, sizes, kinds, true, - GOMP_MAP_VARS_OPENACC); + goacc_aq aq = get_goacc_asyncqueue (async); + + tgt = gomp_map_vars_async (acc_dev, aq, mapnum, hostaddrs, NULL, sizes, kinds, + true, GOMP_MAP_VARS_OPENACC); devaddrs = gomp_alloca (sizeof (void *) * mapnum); for (i = 0; i < mapnum; i++) devaddrs[i] = (void *) (tgt->list[i].key->tgt->tgt_start + tgt->list[i].key->tgt_offset); - - acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, - async, dims, tgt); - - /* If running synchronously, unmap immediately. */ - bool copyfrom = true; - if (async_synchronous_p (async)) - gomp_unmap_vars (tgt, true); + if (aq == NULL) + { + acc_dev->openacc.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, + dims, tgt); + /* If running synchronously, unmap immediately. */ + gomp_unmap_vars (tgt, true); + } else { - bool async_unmap = false; - for (size_t i = 0; i < tgt->list_count; i++) - { - splay_tree_key k = tgt->list[i].key; - if (k && k->refcount == 1) - { - async_unmap = true; - break; - } - } - if (async_unmap) - tgt->device_descr->openacc.register_async_cleanup_func (tgt, async); - else - { - copyfrom = false; - gomp_unmap_vars (tgt, copyfrom); - } + acc_dev->openacc.async.exec_func (tgt_fn, mapnum, hostaddrs, devaddrs, + dims, tgt, aq); + goacc_async_copyout_unmap_vars (tgt, aq); } - - acc_dev->openacc.async_set_async_func (acc_async_sync); } /* Legacy entry point, only provide host execution. */ @@ -377,8 +360,6 @@ GOACC_enter_exit_data (int device, size_t mapnum, finalize = true; } - acc_dev->openacc.async_set_async_func (async); - /* Determine if this is an "acc enter data". */ for (i = 0; i < mapnum; ++i) { @@ -450,7 +431,7 @@ GOACC_enter_exit_data (int device, size_t mapnum, else { gomp_acc_insert_pointer (pointer, &hostaddrs[i], - &sizes[i], &kinds[i]); + &sizes[i], &kinds[i], async); /* Increment 'i' by two because OpenACC requires fortran arrays to be contiguous, so each PSET is associated with one of MAP_FORCE_ALLOC/MAP_FORCE_PRESET/MAP_FORCE_TO, and @@ -475,17 +456,17 @@ GOACC_enter_exit_data (int device, size_t mapnum, if (acc_is_present (hostaddrs[i], sizes[i])) { if (finalize) - acc_delete_finalize (hostaddrs[i], sizes[i]); + acc_delete_finalize_async (hostaddrs[i], sizes[i], async); else - acc_delete (hostaddrs[i], sizes[i]); + acc_delete_async (hostaddrs[i], sizes[i], async); } break; case GOMP_MAP_FROM: case GOMP_MAP_FORCE_FROM: if (finalize) - acc_copyout_finalize (hostaddrs[i], sizes[i]); + acc_copyout_finalize_async (hostaddrs[i], sizes[i], async); else - acc_copyout (hostaddrs[i], sizes[i]); + acc_copyout_async (hostaddrs[i], sizes[i], async); break; default: gomp_fatal (">>>> GOACC_enter_exit_data UNHANDLED kind 0x%.2x", @@ -503,8 +484,6 @@ GOACC_enter_exit_data (int device, size_t mapnum, i += pointer - 1; } } - - acc_dev->openacc.async_set_async_func (acc_async_sync); } static void @@ -517,17 +496,22 @@ goacc_wait (int async, int num_waits, va_list *ap) { int qid = va_arg (*ap, int); - if (acc_async_test (qid)) + goacc_aq aq = get_goacc_asyncqueue (qid); + if (acc_dev->openacc.async.test_func (aq)) continue; - if (async == acc_async_sync) - acc_wait (qid); + acc_dev->openacc.async.synchronize_func (aq); else if (qid == async) - ;/* If we're waiting on the same asynchronous queue as we're - launching on, the queue itself will order work as - required, so there's no need to wait explicitly. */ + /* If we're waiting on the same asynchronous queue as we're + launching on, the queue itself will order work as + required, so there's no need to wait explicitly. */ + ; else - acc_dev->openacc.async_wait_async_func (qid, async); + { + goacc_aq aq2 = get_goacc_asyncqueue (async); + acc_dev->openacc.async.synchronize_func (aq); + acc_dev->openacc.async.serialize_func (aq, aq2); + } } } @@ -559,8 +543,6 @@ GOACC_update (int device, size_t mapnum, else if (num_waits == acc_async_noval) acc_wait_all_async (async); - acc_dev->openacc.async_set_async_func (async); - bool update_device = false; for (i = 0; i < mapnum; ++i) { @@ -600,7 +582,7 @@ GOACC_update (int device, size_t mapnum, /* Fallthru */ case GOMP_MAP_FORCE_TO: update_device = true; - acc_update_device (hostaddrs[i], sizes[i]); + acc_update_device_async (hostaddrs[i], sizes[i], async); break; case GOMP_MAP_FROM: @@ -612,7 +594,7 @@ GOACC_update (int device, size_t mapnum, /* Fallthru */ case GOMP_MAP_FORCE_FROM: update_device = false; - acc_update_self (hostaddrs[i], sizes[i]); + acc_update_self_async (hostaddrs[i], sizes[i], async); break; default: @@ -620,8 +602,6 @@ GOACC_update (int device, size_t mapnum, break; } } - - acc_dev->openacc.async_set_async_func (acc_async_sync); } void @@ -638,7 +618,7 @@ GOACC_wait (int async, int num_waits, ...) else if (async == acc_async_sync) acc_wait_all (); else if (async == acc_async_noval) - goacc_thread ()->dev->openacc.async_wait_all_async_func (acc_async_noval); + acc_wait_all_async (async); } int diff --git a/libgomp/oacc-plugin.c b/libgomp/oacc-plugin.c index c04db90..a114cc7 100644 --- a/libgomp/oacc-plugin.c +++ b/libgomp/oacc-plugin.c @@ -30,17 +30,6 @@ #include "oacc-plugin.h" #include "oacc-int.h" -void -GOMP_PLUGIN_async_unmap_vars (void *ptr, int async) -{ - struct target_mem_desc *tgt = ptr; - struct gomp_device_descr *devicep = tgt->device_descr; - - devicep->openacc.async_set_async_func (async); - gomp_unmap_vars (tgt, true); - devicep->openacc.async_set_async_func (acc_async_sync); -} - /* Return the target-specific part of the TLS data for the current thread. */ void * diff --git a/libgomp/openacc.h b/libgomp/openacc.h index f61bb77..ede59d7 100644 --- a/libgomp/openacc.h +++ b/libgomp/openacc.h @@ -63,6 +63,7 @@ typedef enum acc_device_t { typedef enum acc_async_t { /* Keep in sync with include/gomp-constants.h. */ + acc_async_default = 0, acc_async_noval = -1, acc_async_sync = -2 } acc_async_t; @@ -72,6 +73,8 @@ void acc_set_device_type (acc_device_t) __GOACC_NOTHROW; acc_device_t acc_get_device_type (void) __GOACC_NOTHROW; void acc_set_device_num (int, acc_device_t) __GOACC_NOTHROW; int acc_get_device_num (acc_device_t) __GOACC_NOTHROW; +void acc_set_default_async (int) __GOACC_NOTHROW; +int acc_get_default_async (void) __GOACC_NOTHROW; int acc_async_test (int) __GOACC_NOTHROW; int acc_async_test_all (void) __GOACC_NOTHROW; void acc_wait (int) __GOACC_NOTHROW; From patchwork Tue Sep 25 13:11:06 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chung-Lin Tang X-Patchwork-Id: 974405 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-486325-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=mentor.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="ONFAwKg7"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 42KM0K4Qygz9s3C for ; Tue, 25 Sep 2018 23:11:29 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; q=dns; s=default; b=vQ2UnxaaCGe0+EAJkR9g/McGmu32f u6IrprossRnLOaCpLmif4rvDxuOg+3mO+sfoiVBaFzW5NCsR7TfUkWSAZxx9MZjr yEi93XiLIlrC6e362Qq5Tn70HQvr4xoHFBxYT7yqlP+rSHIp3Ft/8EjPtdyo0FAO B6aPaCzOlmyXIk= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; s=default; bh=LWMv32GMYYkWWHvEbyQ+OliyHEE=; b=ONF AwKg7JTqtNnwei54dUKNksxeStbBrHFuEHxkvYqU9a6t46zlZFDwAu2r+6uunU/5 fJ0lNAenPkhQKnAi9tiUQ6m6/TWad7tIzqXAc/C+sfhJ860uL5UO9lHJmt6bAXJt HAVolBIRqSW+r6uQjGT7GhCZXixtpI06uTSfSNTg= Received: (qmail 120293 invoked by alias); 25 Sep 2018 13:11:17 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 120239 invoked by uid 89); 25 Sep 2018 13:11:16 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-24.8 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=H*r:0700, H*UA:Macintosh, H*u:Macintosh X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 25 Sep 2018 13:11:14 +0000 Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202]) by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256) id 1g4n7R-0004D9-3A from ChungLin_Tang@mentor.com ; Tue, 25 Sep 2018 06:11:13 -0700 Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com (147.34.90.202) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Tue, 25 Sep 2018 06:11:10 -0700 Reply-To: From: Chung-Lin Tang Subject: [PATCH 3/6, OpenACC, libgomp] acc_get/set_default_async API, Fortran specific parts To: , Thomas Schwinge , Message-ID: Date: Tue, 25 Sep 2018 21:11:06 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 This patch adds the OpenACC 2.5 new APIs of acc_get_default_async/acc_set_default_async, contained are the modifications for Fortran in libgomp/openacc.f90, libgomp/openacc_lib.h, and a small testsuite adjustment. Thanks, Chung-Lin * openacc.f90 (acc_async_default): Declare. (acc_set_default_async): Likewise. (acc_get_default_async): Likewise. * openacc_lib.h (acc_async_default): Declare. (acc_set_default_async): Likewise. (acc_get_default_async): Likewise. * testsuite/libgomp.oacc-fortran/lib-12.f90: Likewise. diff --git a/libgomp/openacc.f90 b/libgomp/openacc.f90 index 7c809fe..7d31ee6 100644 --- a/libgomp/openacc.f90 +++ b/libgomp/openacc.f90 @@ -51,9 +51,10 @@ module openacc_kinds integer, parameter :: acc_handle_kind = int32 - public :: acc_async_noval, acc_async_sync + public :: acc_async_default, acc_async_noval, acc_async_sync ! Keep in sync with include/gomp-constants.h. + integer (acc_handle_kind), parameter :: acc_async_default = 0 integer (acc_handle_kind), parameter :: acc_async_noval = -1 integer (acc_handle_kind), parameter :: acc_async_sync = -2 @@ -92,6 +93,16 @@ module openacc_internal integer (acc_device_kind) d end function + subroutine acc_set_default_async_h (a) + import + integer a + end subroutine + + function acc_get_default_async_h () + import + integer acc_get_default_async_h + end function + function acc_async_test_h (a) logical acc_async_test_h integer a @@ -720,6 +731,7 @@ module openacc public :: acc_get_num_devices, acc_set_device_type, acc_get_device_type public :: acc_set_device_num, acc_get_device_num, acc_async_test + public :: acc_set_default_async, acc_get_default_async public :: acc_async_test_all public :: acc_wait, acc_async_wait, acc_wait_async public :: acc_wait_all, acc_async_wait_all, acc_wait_all_async @@ -752,6 +764,14 @@ module openacc procedure :: acc_get_device_num_h end interface + interface acc_set_default_async + procedure :: acc_set_default_async_h + end interface + + interface acc_get_default_async + procedure :: acc_get_default_async_h + end interface + interface acc_async_test procedure :: acc_async_test_h end interface diff --git a/libgomp/openacc_lib.h b/libgomp/openacc_lib.h index 820d987..75a6939 100644 --- a/libgomp/openacc_lib.h +++ b/libgomp/openacc_lib.h @@ -46,6 +46,7 @@ integer, parameter :: acc_handle_kind = 4 ! Keep in sync with include/gomp-constants.h. + integer (acc_handle_kind), parameter :: acc_async_default = 0 integer (acc_handle_kind), parameter :: acc_async_noval = -1 integer (acc_handle_kind), parameter :: acc_async_sync = -2 @@ -89,6 +90,18 @@ end function end interface + interface acc_set_default_async + subroutine acc_set_default_async_h (a) + integer a + end subroutine + end interface + + interface acc_get_default_async + function acc_get_default_async_h () + integer acc_get_default_async_h + end function + end interface + interface acc_async_test function acc_async_test_h (a) logical acc_async_test_h diff --git a/libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90 b/libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90 index 6912f67..ffbbf33 100644 --- a/libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90 +++ b/libgomp/testsuite/libgomp.oacc-fortran/lib-12.f90 @@ -1,4 +1,5 @@ ! { dg-do run } +! { dg-xfail-run-if "n/a" { openacc_host_selected } } program main use openacc From patchwork Tue Sep 25 13:11:24 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chung-Lin Tang X-Patchwork-Id: 974406 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-486326-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=mentor.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="DYX9zffg"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 42KM0h1FCyz9s3C for ; Tue, 25 Sep 2018 23:11:47 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; q=dns; s=default; b=fGbYcnYQMOLdTIuyXX32tX3Y+fxhx 0D0M9aRNj+3pnzemdeBy86/ewXnb1yXOqQM3t7cc2/3uV4ncrdGef2VaVBxgaKTc OldqLtRqjoGIbT0u2/CY9xSIZ77xH2w43LOKjzOPoKBLSBaAkZDepwBruNU/m3ba 3Xa2XWifISCA4s= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; s=default; bh=ptiOiKDRhXbR16Sqcm+8fy3fZ9c=; b=DYX 9zffgnoulGePgTfO1C2pzZnC/maVA4up/oc8Soju8hn/GHW9ly8iunmnzxAQ6Rmx UwjadxoxxlCM4kUrmqiqrmfn/QolU4YPwTKT0ruxViuYUrofGM+OsfwOTghCHDLy d9si5Ff62q4jX35hC/nv3/Pa8WRfD1WqcwrixEFY= Received: (qmail 122554 invoked by alias); 25 Sep 2018 13:11:40 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 122533 invoked by uid 89); 25 Sep 2018 13:11:38 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-24.9 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=10017, 4826, 2799, bias X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 25 Sep 2018 13:11:32 +0000 Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202]) by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256) id 1g4n7i-0004JY-Bx from ChungLin_Tang@mentor.com ; Tue, 25 Sep 2018 06:11:30 -0700 Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com (147.34.90.202) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Tue, 25 Sep 2018 06:11:27 -0700 Reply-To: From: Chung-Lin Tang Subject: [PATCH 4/6, OpenACC, libgomp] Async re-work, libgomp/target.c changes To: , Jakub Jelinek Message-ID: Date: Tue, 25 Sep 2018 21:11:24 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 Hi Jakub, This part has changes to 'struct goacc_asyncqueue*' arguments to various memory copying/mapping functions. To lessen the amount of code changes new 'gomp_map/unmap_vars_async' functions names are used (with the non-async original names defined with the asyncqueue==NULL). Inside gomp_target_fini, a 'gomp_fini_device' named function has been (re?)introduced, since there's also asyncqueue destructing now need before the ->fini_device_func() call. Thanks, Chung-Lin * target.c (goacc_device_copy_async): New function. (gomp_copy_host2dev): Remove 'static', add goacc_asyncqueue parameter, add goacc_device_copy_async case. (gomp_copy_dev2host): Likewise. (gomp_map_vars_existing): Add goacc_asyncqueue parameter, adjust code. (gomp_map_pointer): Likewise. (gomp_map_fields_existing): Likewise. (gomp_map_vars): Add function for compatiblity. (gomp_map_vars_async): Adapt from gomp_map_vars, add goacc_asyncqueue parameter. (gomp_unmap_tgt): Remove statis, add attribute_hidden. (gomp_unmap_vars): Add function for compatiblity. (gomp_unmap_vars_async): Adapt from gomp_unmap_vars, add goacc_asyncqueue parameter. (gomp_fini_device): New function. (gomp_exit_data): Adjust gomp_copy_dev2host call. (gomp_load_plugin_for_device): Remove old interface, adjust to load new async interface. (gomp_target_fini): Adjust code to call gomp_fini_device. diff --git a/libgomp/target.c b/libgomp/target.c index dda041c..ff5b114 100644 --- a/libgomp/target.c +++ b/libgomp/target.c @@ -177,6 +177,22 @@ gomp_device_copy (struct gomp_device_descr *devicep, } } +static inline void +goacc_device_copy_async (struct gomp_device_descr *devicep, + bool (*copy_func) (int, void *, const void *, size_t, + struct goacc_asyncqueue *), + const char *dst, void *dstaddr, + const char *src, const void *srcaddr, + size_t size, struct goacc_asyncqueue *aq) +{ + if (!copy_func (devicep->target_id, dstaddr, srcaddr, size, aq)) + { + gomp_mutex_unlock (&devicep->lock); + gomp_fatal ("Copying of %s object [%p..%p) to %s object [%p..%p) failed", + src, srcaddr, srcaddr + size, dst, dstaddr, dstaddr + size); + } +} + /* Infrastructure for coalescing adjacent or nearly adjacent (in device addresses) host to device memory transfers. */ @@ -263,8 +279,9 @@ gomp_to_device_kind_p (int kind) } } -static void +attribute_hidden void gomp_copy_host2dev (struct gomp_device_descr *devicep, + struct goacc_asyncqueue *aq, void *d, const void *h, size_t sz, struct gomp_coalesce_buf *cbuf) { @@ -293,14 +310,23 @@ gomp_copy_host2dev (struct gomp_device_descr *devicep, } } } - gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz); + if (aq) + goacc_device_copy_async (devicep, devicep->openacc.async.host2dev_func, + "dev", d, "host", h, sz, aq); + else + gomp_device_copy (devicep, devicep->host2dev_func, "dev", d, "host", h, sz); } -static void +attribute_hidden void gomp_copy_dev2host (struct gomp_device_descr *devicep, + struct goacc_asyncqueue *aq, void *h, const void *d, size_t sz) { - gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz); + if (aq) + goacc_device_copy_async (devicep, devicep->openacc.async.dev2host_func, + "host", h, "dev", d, sz, aq); + else + gomp_device_copy (devicep, devicep->dev2host_func, "host", h, "dev", d, sz); } static void @@ -318,7 +344,8 @@ gomp_free_device_memory (struct gomp_device_descr *devicep, void *devptr) Helper function of gomp_map_vars. */ static inline void -gomp_map_vars_existing (struct gomp_device_descr *devicep, splay_tree_key oldn, +gomp_map_vars_existing (struct gomp_device_descr *devicep, + struct goacc_asyncqueue *aq, splay_tree_key oldn, splay_tree_key newn, struct target_var_desc *tgt_var, unsigned char kind, struct gomp_coalesce_buf *cbuf) { @@ -340,7 +367,7 @@ gomp_map_vars_existing (struct gomp_device_descr *devicep, splay_tree_key oldn, } if (GOMP_MAP_ALWAYS_TO_P (kind)) - gomp_copy_host2dev (devicep, + gomp_copy_host2dev (devicep, aq, (void *) (oldn->tgt->tgt_start + oldn->tgt_offset + newn->host_start - oldn->host_start), (void *) newn->host_start, @@ -358,8 +385,8 @@ get_kind (bool short_mapkind, void *kinds, int idx) } static void -gomp_map_pointer (struct target_mem_desc *tgt, uintptr_t host_ptr, - uintptr_t target_offset, uintptr_t bias, +gomp_map_pointer (struct target_mem_desc *tgt, struct goacc_asyncqueue *aq, + uintptr_t host_ptr, uintptr_t target_offset, uintptr_t bias, struct gomp_coalesce_buf *cbuf) { struct gomp_device_descr *devicep = tgt->device_descr; @@ -370,7 +397,7 @@ gomp_map_pointer (struct target_mem_desc *tgt, uintptr_t host_ptr, if (cur_node.host_start == (uintptr_t) NULL) { cur_node.tgt_offset = (uintptr_t) NULL; - gomp_copy_host2dev (devicep, + gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + target_offset), (void *) &cur_node.tgt_offset, sizeof (void *), cbuf); @@ -392,12 +419,13 @@ gomp_map_pointer (struct target_mem_desc *tgt, uintptr_t host_ptr, array section. Now subtract bias to get what we want to initialize the pointer with. */ cur_node.tgt_offset -= bias; - gomp_copy_host2dev (devicep, (void *) (tgt->tgt_start + target_offset), + gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + target_offset), (void *) &cur_node.tgt_offset, sizeof (void *), cbuf); } static void -gomp_map_fields_existing (struct target_mem_desc *tgt, splay_tree_key n, +gomp_map_fields_existing (struct target_mem_desc *tgt, + struct goacc_asyncqueue *aq, splay_tree_key n, size_t first, size_t i, void **hostaddrs, size_t *sizes, void *kinds, struct gomp_coalesce_buf *cbuf) @@ -417,7 +445,7 @@ gomp_map_fields_existing (struct target_mem_desc *tgt, splay_tree_key n, && n2->tgt == n->tgt && n2->host_start - n->host_start == n2->tgt_offset - n->tgt_offset) { - gomp_map_vars_existing (devicep, n2, &cur_node, + gomp_map_vars_existing (devicep, aq, n2, &cur_node, &tgt->list[i], kind & typemask, cbuf); return; } @@ -433,8 +461,8 @@ gomp_map_fields_existing (struct target_mem_desc *tgt, splay_tree_key n, && n2->host_start - n->host_start == n2->tgt_offset - n->tgt_offset) { - gomp_map_vars_existing (devicep, n2, &cur_node, &tgt->list[i], - kind & typemask, cbuf); + gomp_map_vars_existing (devicep, aq, n2, &cur_node, + &tgt->list[i], kind & typemask, cbuf); return; } } @@ -445,7 +473,7 @@ gomp_map_fields_existing (struct target_mem_desc *tgt, splay_tree_key n, && n2->tgt == n->tgt && n2->host_start - n->host_start == n2->tgt_offset - n->tgt_offset) { - gomp_map_vars_existing (devicep, n2, &cur_node, &tgt->list[i], + gomp_map_vars_existing (devicep, aq, n2, &cur_node, &tgt->list[i], kind & typemask, cbuf); return; } @@ -482,6 +510,18 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds, bool short_mapkind, enum gomp_map_vars_kind pragma_kind) { + struct target_mem_desc *tgt; + tgt = gomp_map_vars_async (devicep, NULL, mapnum, hostaddrs, devaddrs, + sizes, kinds, short_mapkind, pragma_kind); + return tgt; +} + +attribute_hidden struct target_mem_desc * +gomp_map_vars_async (struct gomp_device_descr *devicep, + struct goacc_asyncqueue *aq, size_t mapnum, + void **hostaddrs, void **devaddrs, size_t *sizes, void *kinds, + bool short_mapkind, enum gomp_map_vars_kind pragma_kind) +{ size_t i, tgt_align, tgt_size, not_found_cnt = 0; bool has_firstprivate = false; const int rshift = short_mapkind ? 8 : 3; @@ -594,7 +634,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, continue; } for (i = first; i <= last; i++) - gomp_map_fields_existing (tgt, n, first, i, hostaddrs, + gomp_map_fields_existing (tgt, aq, n, first, i, hostaddrs, sizes, kinds, NULL); i--; continue; @@ -639,7 +679,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, else n = splay_tree_lookup (mem_map, &cur_node); if (n && n->refcount != REFCOUNT_LINK) - gomp_map_vars_existing (devicep, n, &cur_node, &tgt->list[i], + gomp_map_vars_existing (devicep, aq, n, &cur_node, &tgt->list[i], kind & typemask, NULL); else { @@ -750,7 +790,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, tgt_size = (tgt_size + align - 1) & ~(align - 1); tgt->list[i].offset = tgt_size; len = sizes[i]; - gomp_copy_host2dev (devicep, + gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + tgt_size), (void *) hostaddrs[i], len, cbufp); tgt_size += len; @@ -784,7 +824,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, continue; } for (i = first; i <= last; i++) - gomp_map_fields_existing (tgt, n, first, i, hostaddrs, + gomp_map_fields_existing (tgt, aq, n, first, i, hostaddrs, sizes, kinds, cbufp); i--; continue; @@ -804,7 +844,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i - 1); if (cur_node.tgt_offset) cur_node.tgt_offset -= sizes[i]; - gomp_copy_host2dev (devicep, + gomp_copy_host2dev (devicep, aq, (void *) (n->tgt->tgt_start + n->tgt_offset + cur_node.host_start @@ -825,7 +865,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, k->host_end = k->host_start + sizeof (void *); splay_tree_key n = splay_tree_lookup (mem_map, k); if (n && n->refcount != REFCOUNT_LINK) - gomp_map_vars_existing (devicep, n, k, &tgt->list[i], + gomp_map_vars_existing (devicep, aq, n, k, &tgt->list[i], kind & typemask, cbufp); else { @@ -878,18 +918,19 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, case GOMP_MAP_FORCE_TOFROM: case GOMP_MAP_ALWAYS_TO: case GOMP_MAP_ALWAYS_TOFROM: - gomp_copy_host2dev (devicep, + gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + k->tgt_offset), (void *) k->host_start, k->host_end - k->host_start, cbufp); break; case GOMP_MAP_POINTER: - gomp_map_pointer (tgt, (uintptr_t) *(void **) k->host_start, + gomp_map_pointer (tgt, aq, + (uintptr_t) *(void **) k->host_start, k->tgt_offset, sizes[i], cbufp); break; case GOMP_MAP_TO_PSET: - gomp_copy_host2dev (devicep, + gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + k->tgt_offset), (void *) k->host_start, @@ -911,7 +952,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, tgt->list[j].always_copy_from = false; if (k->refcount != REFCOUNT_INFINITY) k->refcount++; - gomp_map_pointer (tgt, + gomp_map_pointer (tgt, aq, (uintptr_t) *(void **) hostaddrs[j], k->tgt_offset + ((uintptr_t) hostaddrs[j] @@ -940,7 +981,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, break; case GOMP_MAP_FORCE_DEVICEPTR: assert (k->host_end - k->host_start == sizeof (void *)); - gomp_copy_host2dev (devicep, + gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + k->tgt_offset), (void *) k->host_start, @@ -957,9 +998,8 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, /* Set link pointer on target to the device address of the mapped object. */ void *tgt_addr = (void *) (tgt->tgt_start + k->tgt_offset); - devicep->host2dev_func (devicep->target_id, - (void *) n->tgt_offset, - &tgt_addr, sizeof (void *)); + gomp_copy_host2dev (devicep, aq, (void *) n->tgt_offset, + &tgt_addr, sizeof (void *), cbufp); } array++; } @@ -971,7 +1011,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, for (i = 0; i < mapnum; i++) { cur_node.tgt_offset = gomp_map_val (tgt, hostaddrs, i); - gomp_copy_host2dev (devicep, + gomp_copy_host2dev (devicep, aq, (void *) (tgt->tgt_start + i * sizeof (void *)), (void *) &cur_node.tgt_offset, sizeof (void *), cbufp); @@ -982,7 +1022,8 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, { long c = 0; for (c = 0; c < cbuf.chunk_cnt; ++c) - gomp_copy_host2dev (devicep, (void *) (tgt->tgt_start + cbuf.chunks[2 * c]), + gomp_copy_host2dev (devicep, aq, + (void *) (tgt->tgt_start + cbuf.chunks[2 * c]), (char *) cbuf.buf + (cbuf.chunks[2 * c] - cbuf.chunks[0]), cbuf.chunks[2 * c + 1] - cbuf.chunks[2 * c], NULL); free (cbuf.buf); @@ -1001,7 +1042,7 @@ gomp_map_vars (struct gomp_device_descr *devicep, size_t mapnum, return tgt; } -static void +attribute_hidden void gomp_unmap_tgt (struct target_mem_desc *tgt) { /* Deallocate on target the tgt->tgt_start .. tgt->tgt_end region. */ @@ -1036,6 +1077,13 @@ gomp_remove_var (struct gomp_device_descr *devicep, splay_tree_key k) attribute_hidden void gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom) { + gomp_unmap_vars_async (tgt, do_copyfrom, NULL); +} + +attribute_hidden void +gomp_unmap_vars_async (struct target_mem_desc *tgt, bool do_copyfrom, + struct goacc_asyncqueue *aq) +{ struct gomp_device_descr *devicep = tgt->device_descr; if (tgt->list_count == 0) @@ -1071,7 +1119,7 @@ gomp_unmap_vars (struct target_mem_desc *tgt, bool do_copyfrom) if ((do_unmap && do_copyfrom && tgt->list[i].copy_from) || tgt->list[i].always_copy_from) - gomp_copy_dev2host (devicep, + gomp_copy_dev2host (devicep, aq, (void *) (k->host_start + tgt->list[i].offset), (void *) (k->tgt->tgt_start + k->tgt_offset + tgt->list[i].offset), @@ -1137,9 +1185,10 @@ gomp_update (struct gomp_device_descr *devicep, size_t mapnum, void **hostaddrs, size_t size = cur_node.host_end - cur_node.host_start; if (GOMP_MAP_COPY_TO_P (kind & typemask)) - gomp_copy_host2dev (devicep, devaddr, hostaddr, size, NULL); + gomp_copy_host2dev (devicep, NULL, devaddr, hostaddr, size, + NULL); if (GOMP_MAP_COPY_FROM_P (kind & typemask)) - gomp_copy_dev2host (devicep, hostaddr, devaddr, size); + gomp_copy_dev2host (devicep, NULL, hostaddr, devaddr, size); } } gomp_mutex_unlock (&devicep->lock); @@ -1432,9 +1481,21 @@ gomp_init_device (struct gomp_device_descr *devicep) false); } + /* Initialize OpenACC asynchronous queues. */ + goacc_init_asyncqueues (devicep); + devicep->state = GOMP_DEVICE_INITIALIZED; } +attribute_hidden bool +gomp_fini_device (struct gomp_device_descr *devicep) +{ + bool ret = goacc_fini_asyncqueues (devicep); + ret &= devicep->fini_device_func (devicep->target_id); + devicep->state = GOMP_DEVICE_FINALIZED; + return ret; +} + attribute_hidden void gomp_unload_device (struct gomp_device_descr *devicep) { @@ -1924,7 +1985,7 @@ gomp_exit_data (struct gomp_device_descr *devicep, size_t mapnum, if ((kind == GOMP_MAP_FROM && k->refcount == 0) || kind == GOMP_MAP_ALWAYS_FROM) - gomp_copy_dev2host (devicep, (void *) cur_node.host_start, + gomp_copy_dev2host (devicep, NULL, (void *) cur_node.host_start, (void *) (k->tgt->tgt_start + k->tgt_offset + cur_node.host_start - k->host_start), @@ -2557,20 +2618,20 @@ gomp_load_plugin_for_device (struct gomp_device_descr *device, if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200) { if (!DLSYM_OPT (openacc.exec, openacc_exec) - || !DLSYM_OPT (openacc.register_async_cleanup, - openacc_register_async_cleanup) - || !DLSYM_OPT (openacc.async_test, openacc_async_test) - || !DLSYM_OPT (openacc.async_test_all, openacc_async_test_all) - || !DLSYM_OPT (openacc.async_wait, openacc_async_wait) - || !DLSYM_OPT (openacc.async_wait_async, openacc_async_wait_async) - || !DLSYM_OPT (openacc.async_wait_all, openacc_async_wait_all) - || !DLSYM_OPT (openacc.async_wait_all_async, - openacc_async_wait_all_async) - || !DLSYM_OPT (openacc.async_set_async, openacc_async_set_async) || !DLSYM_OPT (openacc.create_thread_data, openacc_create_thread_data) || !DLSYM_OPT (openacc.destroy_thread_data, - openacc_destroy_thread_data)) + openacc_destroy_thread_data) + || !DLSYM_OPT (openacc.async.construct, openacc_async_construct) + || !DLSYM_OPT (openacc.async.destruct, openacc_async_destruct) + || !DLSYM_OPT (openacc.async.test, openacc_async_test) + || !DLSYM_OPT (openacc.async.synchronize, openacc_async_synchronize) + || !DLSYM_OPT (openacc.async.serialize, openacc_async_serialize) + || !DLSYM_OPT (openacc.async.queue_callback, + openacc_async_queue_callback) + || !DLSYM_OPT (openacc.async.exec, openacc_async_exec) + || !DLSYM_OPT (openacc.async.dev2host, openacc_async_dev2host) + || !DLSYM_OPT (openacc.async.host2dev, openacc_async_host2dev)) { /* Require all the OpenACC handlers if we have GOMP_OFFLOAD_CAP_OPENACC_200. */ @@ -2621,10 +2682,7 @@ gomp_target_fini (void) struct gomp_device_descr *devicep = &devices[i]; gomp_mutex_lock (&devicep->lock); if (devicep->state == GOMP_DEVICE_INITIALIZED) - { - ret = devicep->fini_device_func (devicep->target_id); - devicep->state = GOMP_DEVICE_FINALIZED; - } + ret = gomp_fini_device (devicep); gomp_mutex_unlock (&devicep->lock); if (!ret) gomp_fatal ("device finalization failed"); From patchwork Tue Sep 25 13:11:42 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chung-Lin Tang X-Patchwork-Id: 974407 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-486327-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=mentor.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="ECJqRXoU"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 42KM0z2Dfmz9s3C for ; Tue, 25 Sep 2018 23:12:03 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; q=dns; s=default; b=mOlYcOZKh/d6WtQzTMdI7lIA85eKB f//zAkySUb6Zq2OUnDWb3+Ul1KDNylL1IuT1HG/pJrGWeC7bZMlzumQHAYMkadeO emp+a2nu2qxKgpncabI5d0n6MZjBDBSD0k0RQRn96myF6yuUYYsaUybGM+gt1hDW LxAugw7JjeSxXg= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; s=default; bh=SXQGVf504W5Cc2N8j4GpK0bA2LA=; b=ECJ qRXoUA6tATCnsg57LImBCarB8MjBpXMpWW9m05WG4fb88ddNdrIgfVBfhEGk4Ed4 xs1wZdP3ektb3DmTh1/es6Qk+IVjQMQ/95ybHCoXeP8xLNLjqegixQANFoCGS4IP SW715YJ71ZT4y6g4alyYvxe7VAaWU/05QLnF5erY= Received: (qmail 124256 invoked by alias); 25 Sep 2018 13:11:55 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 124186 invoked by uid 89); 25 Sep 2018 13:11:54 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-24.9 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=2.0 X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 25 Sep 2018 13:11:50 +0000 Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202]) by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256) id 1g4n80-0004LV-IT from ChungLin_Tang@mentor.com for gcc-patches@gcc.gnu.org; Tue, 25 Sep 2018 06:11:48 -0700 Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com (147.34.90.202) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Tue, 25 Sep 2018 06:11:45 -0700 Reply-To: From: Chung-Lin Tang Subject: [PATCH 5/6, OpenACC, libgomp] Async re-work, C/C++ testsuite changes To: , Thomas Schwinge Message-ID: <8086c63b-f729-891b-3d21-76871d360734@mentor.com> Date: Tue, 25 Sep 2018 21:11:42 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 These are the testsuite/libgomp.oacc-c-c++-common/* changes. Thanks, Chung-Lin * testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c: New testcase. * testsuite/libgomp.oacc-c-c++-common/data-2-lib.c: Adjust testcase. * testsuite/libgomp.oacc-c-c++-common/data-2.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/data-3.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/lib-71.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/lib-77.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/lib-79.c: Likewise. * testsuite/libgomp.oacc-c-c++-common/lib-81.c: Likewise. diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c new file mode 100644 index 0000000..9420540 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/asyncwait-2.c @@ -0,0 +1,904 @@ +/* { dg-do run { target openacc_nvidia_accel_selected } } */ +/* { dg-additional-options "-lcuda" } */ + +#include +#include +#include + +#include +#include +#include + +int +main (int argc, char **argv) +{ + CUresult r; + CUstream stream1; + int N = 128; //1024 * 1024; + float *a, *b, *c, *d, *e; + int i; + int nbytes; + + srand (time (NULL)); + int s = rand () % 100; + + acc_init (acc_device_nvidia); + + nbytes = N * sizeof (float); + + a = (float *) malloc (nbytes); + b = (float *) malloc (nbytes); + c = (float *) malloc (nbytes); + d = (float *) malloc (nbytes); + e = (float *) malloc (nbytes); + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + } + + acc_set_default_async (s); + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 3.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 2.0; + b[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 2.0) + abort (); + + if (b[i] != 2.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 9.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 2.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc parallel wait (s) async (s) + { + int ii; + + for (ii = 0; ii < N; ii++) + e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 2.0) + abort (); + + if (b[i] != 4.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + + if (e[i] != 11.0) + abort (); + } + + + r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING); + if (r != CUDA_SUCCESS) + { + fprintf (stderr, "cuStreamCreate failed: %d\n", r); + abort (); + } + + acc_set_cuda_stream (1, stream1); + + for (i = 0; i < N; i++) + { + a[i] = 5.0; + b[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 5.0) + abort (); + + if (b[i] != 5.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 7.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 7.0) + abort (); + + if (b[i] != 49.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc parallel wait (s) async (s) + { + int ii; + + for (ii = 0; ii < N; ii++) + e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 9.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + + if (e[i] != 17.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 4.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 4.0) + abort (); + + if (b[i] != 16.0) + abort (); + + if (c[i] != 4.0) + abort (); + } + + + for (i = 0; i < N; i++) + { + a[i] = 5.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) + { + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc parallel async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc update host (a[0:N], b[0:N], c[0:N]) async + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 5.0) + abort (); + + if (b[i] != 25.0) + abort (); + + if (c[i] != 4.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 3.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 2.0; + b[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 2.0) + abort (); + + if (b[i] != 2.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 9.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 2.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc kernels wait (s) async (s) + { + int ii; + + for (ii = 0; ii < N; ii++) + e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 2.0) + abort (); + + if (b[i] != 4.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + + if (e[i] != 11.0) + abort (); + } + + + r = cuStreamCreate (&stream1, CU_STREAM_NON_BLOCKING); + if (r != CUDA_SUCCESS) + { + fprintf (stderr, "cuStreamCreate failed: %d\n", r); + abort (); + } + + acc_set_cuda_stream (1, stream1); + + for (i = 0; i < N; i++) + { + a[i] = 5.0; + b[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 5.0) + abort (); + + if (b[i] != 5.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 7.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + } + +#pragma acc data copy (a[0:N]) copy (b[0:N]) copy (c[0:N]) copy (d[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 7.0) + abort (); + + if (b[i] != 49.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 3.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copy (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; + } + +#pragma acc kernels wait (s) async (s) + { + int ii; + + for (ii = 0; ii < N; ii++) + e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; + } + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 3.0) + abort (); + + if (b[i] != 9.0) + abort (); + + if (c[i] != 4.0) + abort (); + + if (d[i] != 1.0) + abort (); + + if (e[i] != 17.0) + abort (); + } + + for (i = 0; i < N; i++) + { + a[i] = 4.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc update host (a[0:N], b[0:N], c[0:N]) wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 4.0) + abort (); + + if (b[i] != 16.0) + abort (); + + if (c[i] != 4.0) + abort (); + } + + + for (i = 0; i < N; i++) + { + a[i] = 5.0; + b[i] = 0.0; + c[i] = 0.0; + d[i] = 0.0; + e[i] = 0.0; + } + +#pragma acc data copyin (a[0:N], b[0:N], c[0:N]) copyin (N) + { + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + b[ii] = (a[ii] * a[ii] * a[ii]) / a[ii]; + } + +#pragma acc kernels async + { + int ii; + + for (ii = 0; ii < N; ii++) + c[ii] = (a[ii] + a[ii] + a[ii] + a[ii]) / a[ii]; + } + +#pragma acc update host (a[0:N], b[0:N], c[0:N]) async + +#pragma acc wait (s) + + } + + for (i = 0; i < N; i++) + { + if (a[i] != 5.0) + abort (); + + if (b[i] != 25.0) + abort (); + + if (c[i] != 4.0) + abort (); + } + + acc_shutdown (acc_device_nvidia); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c index 2ddfa7d..f553d3d 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2-lib.c @@ -153,7 +153,7 @@ main (int argc, char **argv) d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N], N) \ - async (4) + wait (1, 2, 3) async (4) for (int ii = 0; ii < N; ii++) e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c index 0c6abe6..81d623a 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-2.c @@ -162,7 +162,7 @@ main (int argc, char **argv) d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \ - wait (1) async (4) + wait (1, 2, 3) async (4) for (int ii = 0; ii < N; ii++) e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c index 0bf706a..5ec50b8 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/data-3.c @@ -138,7 +138,7 @@ main (int argc, char **argv) d[ii] = ((a[ii] * a[ii] + a[ii]) / a[ii]) - a[ii]; #pragma acc parallel present (a[0:N], b[0:N], c[0:N], d[0:N], e[0:N]) \ - wait (1,5) async (4) + wait (1, 2, 3, 5) async (4) for (int ii = 0; ii < N; ii++) e[ii] = a[ii] + b[ii] + c[ii] + d[ii]; diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c index c85e824..6afe2a0 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-71.c @@ -92,16 +92,22 @@ main (int argc, char **argv) abort (); } - fprintf (stderr, "CheCKpOInT\n"); - if (acc_async_test (1) != 0) + if (acc_async_test (0) != 0) { fprintf (stderr, "asynchronous operation not running\n"); abort (); } + /* Test unseen async number. */ + if (acc_async_test (1) != 1) + { + fprintf (stderr, "acc_async_test failed on unseen number\n"); + abort (); + } + sleep ((int) (dtime / 1000.0f) + 1); - if (acc_async_test (1) != 1) + if (acc_async_test (0) != 1) { fprintf (stderr, "found asynchronous operation still running\n"); abort (); @@ -116,7 +122,3 @@ main (int argc, char **argv) return 0; } - -/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */ -/* { dg-output "unknown async \[0-9\]+" } */ -/* { dg-shouldfail "" } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c index f4f196d..2821f88 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-77.c @@ -111,7 +111,7 @@ main (int argc, char **argv) start_timer (0); - acc_wait (1); + acc_wait (0); atime = stop_timer (0); @@ -132,7 +132,3 @@ main (int argc, char **argv) return 0; } - -/* { dg-output "CheCKpOInT(\n|\r\n|\r).*" } */ -/* { dg-output "unknown async \[0-9\]+" } */ -/* { dg-shouldfail "" } */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c index ef3df13..b22af26 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-79.c @@ -114,6 +114,7 @@ main (int argc, char **argv) for (i = 0; i < N; i++) { + stream = (CUstream) acc_get_cuda_stream (i & 1); r = cuLaunchKernel (delay, 1, 1, 1, 1, 1, 1, 0, stream, kargs, 0); if (r != CUDA_SUCCESS) { @@ -122,11 +123,11 @@ main (int argc, char **argv) } } - acc_wait_async (0, 1); - if (acc_async_test (0) != 0) abort (); + acc_wait_async (0, 1); + if (acc_async_test (1) != 0) abort (); diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c index d5f18f0..30a4b57 100644 --- a/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/lib-81.c @@ -133,7 +133,7 @@ main (int argc, char **argv) for (i = 0; i <= N; i++) { - if (acc_async_test (i) != 0) + if (acc_async_test (i) == 0) abort (); } From patchwork Tue Sep 25 13:11:58 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chung-Lin Tang X-Patchwork-Id: 974408 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-486328-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=mentor.com Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="o4eQ8AZw"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 42KM1L2ggwz9s47 for ; Tue, 25 Sep 2018 23:12:22 +1000 (AEST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; q=dns; s=default; b=onrTeyfbY/uR9O6H39tXRuQnQs9DM oQcmx+IO2a+ip6VOIfw1NqVvX47uckgisRByjnvWOKKPWSVWHwLssLjb/KJUcK0L 01uYPdvYsaiWhKye7M4nTaG+7WJeG7jxM8icCpP+GCX0WlReKeqMmuyY3PkahmJR ou0TrF6yiETcJU= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :reply-to:from:subject:to:message-id:date:mime-version :content-type; s=default; bh=BYBGo9Q5JcQNs+oP860zOtBiNBc=; b=o4e Q8AZwPEhVJYOfnJpts2zK3fekhXSpzV8YkN0ytjT4HAXtlQ040AiCP1pOs4qPO3g wsXak/C088X9LgNpkCQgdEW5hZ1aKWyKJaMr94G5o886t0Ph7IuKMcfXjhHcyMhK 8yPO1mJpGXKebkNBMTPDyxw6U+/YOYfZUTDGHwrw= Received: (qmail 125804 invoked by alias); 25 Sep 2018 13:12:13 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 125784 invoked by uid 89); 25 Sep 2018 13:12:12 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-24.9 required=5.0 tests=AWL, BAYES_00, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=claimed, initiated, **h, teams X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 25 Sep 2018 13:12:07 +0000 Received: from svr-orw-mbx-02.mgc.mentorg.com ([147.34.90.202]) by relay1.mentorg.com with esmtps (TLSv1.2:ECDHE-RSA-AES256-SHA384:256) id 1g4n8H-0004Nj-2e from ChungLin_Tang@mentor.com ; Tue, 25 Sep 2018 06:12:05 -0700 Received: from [0.0.0.0] (147.34.91.1) by svr-orw-mbx-02.mgc.mentorg.com (147.34.90.202) with Microsoft SMTP Server (TLS) id 15.0.1320.4; Tue, 25 Sep 2018 06:12:01 -0700 Reply-To: From: Chung-Lin Tang Subject: [PATCH 6/6, OpenACC, libgomp] Async re-work, nvptx changes To: , Tom de Vries Message-ID: <9523b49a-0454-e0a9-826d-5eeec2a8c973@mentor.com> Date: Tue, 25 Sep 2018 21:11:58 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 Hi Tom, this patch removes large portions of plugin/plugin-nvptx.c, since a lot of it is now in oacc-async.c now. The new code is essentially a NVPTX/CUDA-specific implementation of the new-style goacc_asyncqueues. Also, some needed functions in cuda-lib.def are added. The cuda.h function has also been updated to build independently without a CUDA installation. Thanks, Chung-Lin * plugin/plugin-nvptx.c (struct cuda_map): Remove. (struct ptx_stream): Remove. (struct nvptx_thread): Remove current_stream field. (cuda_map_create): Remove. (cuda_map_destroy): Remove. (map_init): Remove. (map_fini): Remove. (map_pop): Remove. (map_push): Remove. (struct goacc_asyncqueue): Define. (struct nvptx_callback): Define. (struct ptx_free_block): Define. (struct ptx_device): Remove null_stream, active_streams, async_streams, stream_lock, and next fields. (enum ptx_event_type): Remove. (struct ptx_event): Remove. (ptx_event_lock): Remove. (ptx_events): Remove. (init_streams_for_device): Remove. (fini_streams_for_device): Remove. (select_stream_for_async): Remove. (nvptx_init): Remove ptx_events and ptx_event_lock references. (nvptx_attach_host_thread_to_device): Remove CUDA_ERROR_NOT_PERMITTED case. (nvptx_open_device): Add free_blocks initialization, remove init_streams_for_device call. (nvptx_close_device): Remove fini_streams_for_device call, add free_blocks destruct code. (event_gc): Remove. (event_add): Remove. (nvptx_exec): Adjust parameters and code. (nvptx_free): Likewise. (nvptx_host2dev): Remove. (nvptx_dev2host): Remove. (nvptx_set_async): Remove. (nvptx_async_test): Remove. (nvptx_async_test_all): Remove. (nvptx_wait): Remove. (nvptx_wait_async): Remove. (nvptx_wait_all): Remove. (nvptx_wait_all_async): Remove. (nvptx_get_cuda_stream): Remove. (nvptx_set_cuda_stream): Remove. (GOMP_OFFLOAD_alloc): Adjust code. (GOMP_OFFLOAD_free): Likewise. (GOMP_OFFLOAD_openacc_register_async_cleanup): Remove. (GOMP_OFFLOAD_openacc_exec): Adjust parameters and code. (GOMP_OFFLOAD_openacc_async_test_all): Remove. (GOMP_OFFLOAD_openacc_async_wait): Remove. (GOMP_OFFLOAD_openacc_async_wait_async): Remove. (GOMP_OFFLOAD_openacc_async_wait_all): Remove. (GOMP_OFFLOAD_openacc_async_wait_all_async): Remove. (GOMP_OFFLOAD_openacc_async_set_async): Remove. (cuda_free_argmem): New function. (GOMP_OFFLOAD_openacc_async_exec): New plugin hook function. (GOMP_OFFLOAD_openacc_create_thread_data): Adjust code. (GOMP_OFFLOAD_openacc_cuda_get_stream): Adjust code. (GOMP_OFFLOAD_openacc_cuda_set_stream): Adjust code. (GOMP_OFFLOAD_openacc_async_construct): New plugin hook function. (GOMP_OFFLOAD_openacc_async_destruct): New plugin hook function. (GOMP_OFFLOAD_openacc_async_test): Remove and re-implement. (GOMP_OFFLOAD_openacc_async_synchronize): New plugin hook function. (GOMP_OFFLOAD_openacc_async_serialize): New plugin hook function. (GOMP_OFFLOAD_openacc_async_queue_callback): New plugin hook function. (cuda_callback_wrapper): New function. (cuda_memcpy_sanity_check): New function. (GOMP_OFFLOAD_host2dev): Remove and re-implement. (GOMP_OFFLOAD_dev2host): Remove and re-implement. (GOMP_OFFLOAD_openacc_async_host2dev): New plugin hook function. (GOMP_OFFLOAD_openacc_async_dev2host): New plugin hook function. diff --git a/libgomp/plugin/cuda-lib.def b/libgomp/plugin/cuda-lib.def index b2a4c21..a16badc 100644 --- a/libgomp/plugin/cuda-lib.def +++ b/libgomp/plugin/cuda-lib.def @@ -42,6 +42,7 @@ CUDA_ONE_CALL (cuModuleLoad) CUDA_ONE_CALL (cuModuleLoadData) CUDA_ONE_CALL (cuModuleUnload) CUDA_ONE_CALL_MAYBE_NULL (cuOccupancyMaxPotentialBlockSize) +CUDA_ONE_CALL (cuStreamAddCallback) CUDA_ONE_CALL (cuStreamCreate) CUDA_ONE_CALL (cuStreamDestroy) CUDA_ONE_CALL (cuStreamQuery) diff --git a/libgomp/plugin/cuda/cuda.h b/libgomp/plugin/cuda/cuda.h index b4c1b29..326db54 100644 --- a/libgomp/plugin/cuda/cuda.h +++ b/libgomp/plugin/cuda/cuda.h @@ -54,7 +54,11 @@ typedef enum { CUDA_ERROR_INVALID_CONTEXT = 201, CUDA_ERROR_NOT_FOUND = 500, CUDA_ERROR_NOT_READY = 600, - CUDA_ERROR_LAUNCH_FAILED = 719 + CUDA_ERROR_LAUNCH_FAILED = 719, + CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE = 720, + CUDA_ERROR_NOT_PERMITTED = 800, + CUDA_ERROR_NOT_SUPPORTED = 801, + CUDA_ERROR_UNKNOWN = 999 } CUresult; typedef enum { @@ -173,6 +177,8 @@ CUresult cuModuleLoadData (CUmodule *, const void *); CUresult cuModuleUnload (CUmodule); CUresult cuOccupancyMaxPotentialBlockSize(int *, int *, CUfunction, CUoccupancyB2DSize, size_t, int); +typedef void (*CUstreamCallback)(CUstream, CUresult, void *); +CUresult cuStreamAddCallback(CUstream, CUstreamCallback, void *, unsigned int); CUresult cuStreamCreate (CUstream *, unsigned); #define cuStreamDestroy cuStreamDestroy_v2 CUresult cuStreamDestroy (CUstream); diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c index bae1b05..2959df2 100644 --- a/libgomp/plugin/plugin-nvptx.c +++ b/libgomp/plugin/plugin-nvptx.c @@ -192,128 +192,30 @@ cuda_error (CUresult r) static unsigned int instantiated_devices = 0; static pthread_mutex_t ptx_dev_lock = PTHREAD_MUTEX_INITIALIZER; -struct ptx_stream +/* NVPTX/CUDA specific definition of asynchronous queues. */ +struct goacc_asyncqueue { - CUstream stream; - pthread_t host_thread; - bool multithreaded; + CUstream cuda_stream; +}; - CUdeviceptr d; - void *h; - void *h_begin; - void *h_end; - void *h_next; - void *h_prev; - void *h_tail; - - struct ptx_stream *next; +struct nvptx_callback +{ + void (*fn) (void *); + void *ptr; + struct goacc_asyncqueue *aq; + struct nvptx_callback *next; }; /* Thread-specific data for PTX. */ struct nvptx_thread { - struct ptx_stream *current_stream; + /* We currently have this embedded inside the plugin because libgomp manages + devices through integer target_ids. This might be better if using an + opaque target-specific pointer directly from gomp_device_descr. */ struct ptx_device *ptx_dev; }; -static bool -map_init (struct ptx_stream *s) -{ - int size = getpagesize (); - - assert (s); - assert (!s->d); - assert (!s->h); - - CUDA_CALL (cuMemAllocHost, &s->h, size); - CUDA_CALL (cuMemHostGetDevicePointer, &s->d, s->h, 0); - - assert (s->h); - - s->h_begin = s->h; - s->h_end = s->h_begin + size; - s->h_next = s->h_prev = s->h_tail = s->h_begin; - - assert (s->h_next); - assert (s->h_end); - return true; -} - -static bool -map_fini (struct ptx_stream *s) -{ - CUDA_CALL (cuMemFreeHost, s->h); - return true; -} - -static void -map_pop (struct ptx_stream *s) -{ - assert (s != NULL); - assert (s->h_next); - assert (s->h_prev); - assert (s->h_tail); - - s->h_tail = s->h_next; - - if (s->h_tail >= s->h_end) - s->h_tail = s->h_begin + (int) (s->h_tail - s->h_end); - - if (s->h_next == s->h_tail) - s->h_prev = s->h_next; - - assert (s->h_next >= s->h_begin); - assert (s->h_tail >= s->h_begin); - assert (s->h_prev >= s->h_begin); - - assert (s->h_next <= s->h_end); - assert (s->h_tail <= s->h_end); - assert (s->h_prev <= s->h_end); -} - -static void -map_push (struct ptx_stream *s, size_t size, void **h, void **d) -{ - int left; - int offset; - - assert (s != NULL); - - left = s->h_end - s->h_next; - - assert (s->h_prev); - assert (s->h_next); - - if (size >= left) - { - assert (s->h_next == s->h_prev); - s->h_next = s->h_prev = s->h_tail = s->h_begin; - } - - assert (s->h_next); - - offset = s->h_next - s->h; - - *d = (void *)(s->d + offset); - *h = (void *)(s->h + offset); - - s->h_prev = s->h_next; - s->h_next += size; - - assert (s->h_prev); - assert (s->h_next); - - assert (s->h_next >= s->h_begin); - assert (s->h_tail >= s->h_begin); - assert (s->h_prev >= s->h_begin); - assert (s->h_next <= s->h_end); - assert (s->h_tail <= s->h_end); - assert (s->h_prev <= s->h_end); - - return; -} - /* Target data function launch information. */ struct targ_fn_launch @@ -365,22 +267,18 @@ struct ptx_image_data struct ptx_image_data *next; }; +struct ptx_free_block +{ + void *ptr; + struct ptx_free_block *next; +}; + struct ptx_device { CUcontext ctx; bool ctx_shared; CUdevice dev; - struct ptx_stream *null_stream; - /* All non-null streams associated with this device (actually context), - either created implicitly or passed in from the user (via - acc_set_cuda_stream). */ - struct ptx_stream *active_streams; - struct { - struct ptx_stream **arr; - int size; - } async_streams; - /* A lock for use when manipulating the above stream list and array. */ - pthread_mutex_t stream_lock; + int ord; bool overlap; bool map; @@ -398,32 +296,13 @@ struct ptx_device struct ptx_image_data *images; /* Images loaded on device. */ pthread_mutex_t image_lock; /* Lock for above list. */ - - struct ptx_device *next; -}; - -enum ptx_event_type -{ - PTX_EVT_MEM, - PTX_EVT_KNL, - PTX_EVT_SYNC, - PTX_EVT_ASYNC_CLEANUP -}; -struct ptx_event -{ - CUevent *evt; - int type; - void *addr; - int ord; - int val; + struct ptx_free_block *free_blocks; + pthread_mutex_t free_blocks_lock; - struct ptx_event *next; + struct ptx_device *next; }; -static pthread_mutex_t ptx_event_lock; -static struct ptx_event *ptx_events; - static struct ptx_device **ptx_devices; static inline struct nvptx_thread * @@ -432,197 +311,6 @@ nvptx_thread (void) return (struct nvptx_thread *) GOMP_PLUGIN_acc_thread (); } -static bool -init_streams_for_device (struct ptx_device *ptx_dev, int concurrency) -{ - int i; - struct ptx_stream *null_stream - = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream)); - - null_stream->stream = NULL; - null_stream->host_thread = pthread_self (); - null_stream->multithreaded = true; - null_stream->d = (CUdeviceptr) NULL; - null_stream->h = NULL; - if (!map_init (null_stream)) - return false; - - ptx_dev->null_stream = null_stream; - ptx_dev->active_streams = NULL; - pthread_mutex_init (&ptx_dev->stream_lock, NULL); - - if (concurrency < 1) - concurrency = 1; - - /* This is just a guess -- make space for as many async streams as the - current device is capable of concurrently executing. This can grow - later as necessary. No streams are created yet. */ - ptx_dev->async_streams.arr - = GOMP_PLUGIN_malloc (concurrency * sizeof (struct ptx_stream *)); - ptx_dev->async_streams.size = concurrency; - - for (i = 0; i < concurrency; i++) - ptx_dev->async_streams.arr[i] = NULL; - - return true; -} - -static bool -fini_streams_for_device (struct ptx_device *ptx_dev) -{ - free (ptx_dev->async_streams.arr); - - bool ret = true; - while (ptx_dev->active_streams != NULL) - { - struct ptx_stream *s = ptx_dev->active_streams; - ptx_dev->active_streams = ptx_dev->active_streams->next; - - ret &= map_fini (s); - - CUresult r = CUDA_CALL_NOCHECK (cuStreamDestroy, s->stream); - if (r != CUDA_SUCCESS) - { - GOMP_PLUGIN_error ("cuStreamDestroy error: %s", cuda_error (r)); - ret = false; - } - free (s); - } - - ret &= map_fini (ptx_dev->null_stream); - free (ptx_dev->null_stream); - return ret; -} - -/* Select a stream for (OpenACC-semantics) ASYNC argument for the current - thread THREAD (and also current device/context). If CREATE is true, create - the stream if it does not exist (or use EXISTING if it is non-NULL), and - associate the stream with the same thread argument. Returns stream to use - as result. */ - -static struct ptx_stream * -select_stream_for_async (int async, pthread_t thread, bool create, - CUstream existing) -{ - struct nvptx_thread *nvthd = nvptx_thread (); - /* Local copy of TLS variable. */ - struct ptx_device *ptx_dev = nvthd->ptx_dev; - struct ptx_stream *stream = NULL; - int orig_async = async; - - /* The special value acc_async_noval (-1) maps (for now) to an - implicitly-created stream, which is then handled the same as any other - numbered async stream. Other options are available, e.g. using the null - stream for anonymous async operations, or choosing an idle stream from an - active set. But, stick with this for now. */ - if (async > acc_async_sync) - async++; - - if (create) - pthread_mutex_lock (&ptx_dev->stream_lock); - - /* NOTE: AFAICT there's no particular need for acc_async_sync to map to the - null stream, and in fact better performance may be obtainable if it doesn't - (because the null stream enforces overly-strict synchronisation with - respect to other streams for legacy reasons, and that's probably not - needed with OpenACC). Maybe investigate later. */ - if (async == acc_async_sync) - stream = ptx_dev->null_stream; - else if (async >= 0 && async < ptx_dev->async_streams.size - && ptx_dev->async_streams.arr[async] && !(create && existing)) - stream = ptx_dev->async_streams.arr[async]; - else if (async >= 0 && create) - { - if (async >= ptx_dev->async_streams.size) - { - int i, newsize = ptx_dev->async_streams.size * 2; - - if (async >= newsize) - newsize = async + 1; - - ptx_dev->async_streams.arr - = GOMP_PLUGIN_realloc (ptx_dev->async_streams.arr, - newsize * sizeof (struct ptx_stream *)); - - for (i = ptx_dev->async_streams.size; i < newsize; i++) - ptx_dev->async_streams.arr[i] = NULL; - - ptx_dev->async_streams.size = newsize; - } - - /* Create a new stream on-demand if there isn't one already, or if we're - setting a particular async value to an existing (externally-provided) - stream. */ - if (!ptx_dev->async_streams.arr[async] || existing) - { - CUresult r; - struct ptx_stream *s - = GOMP_PLUGIN_malloc (sizeof (struct ptx_stream)); - - if (existing) - s->stream = existing; - else - { - r = CUDA_CALL_NOCHECK (cuStreamCreate, &s->stream, - CU_STREAM_DEFAULT); - if (r != CUDA_SUCCESS) - { - pthread_mutex_unlock (&ptx_dev->stream_lock); - GOMP_PLUGIN_fatal ("cuStreamCreate error: %s", - cuda_error (r)); - } - } - - /* If CREATE is true, we're going to be queueing some work on this - stream. Associate it with the current host thread. */ - s->host_thread = thread; - s->multithreaded = false; - - s->d = (CUdeviceptr) NULL; - s->h = NULL; - if (!map_init (s)) - { - pthread_mutex_unlock (&ptx_dev->stream_lock); - GOMP_PLUGIN_fatal ("map_init fail"); - } - - s->next = ptx_dev->active_streams; - ptx_dev->active_streams = s; - ptx_dev->async_streams.arr[async] = s; - } - - stream = ptx_dev->async_streams.arr[async]; - } - else if (async < 0) - { - if (create) - pthread_mutex_unlock (&ptx_dev->stream_lock); - GOMP_PLUGIN_fatal ("bad async %d", async); - } - - if (create) - { - assert (stream != NULL); - - /* If we're trying to use the same stream from different threads - simultaneously, set stream->multithreaded to true. This affects the - behaviour of acc_async_test_all and acc_wait_all, which are supposed to - only wait for asynchronous launches from the same host thread they are - invoked on. If multiple threads use the same async value, we make note - of that here and fall back to testing/waiting for all threads in those - functions. */ - if (thread != stream->host_thread) - stream->multithreaded = true; - - pthread_mutex_unlock (&ptx_dev->stream_lock); - } - else if (stream && !stream->multithreaded - && !pthread_equal (stream->host_thread, thread)) - GOMP_PLUGIN_fatal ("async %d used on wrong thread", orig_async); - - return stream; -} - /* Initialize the device. Return TRUE on success, else FALSE. PTX_DEV_LOCK should be locked on entry and remains locked on exit. */ @@ -634,9 +322,6 @@ nvptx_init (void) if (instantiated_devices != 0) return true; - ptx_events = NULL; - pthread_mutex_init (&ptx_event_lock, NULL); - if (!init_cuda_lib ()) return false; @@ -660,6 +345,11 @@ nvptx_attach_host_thread_to_device (int n) CUcontext thd_ctx; r = CUDA_CALL_NOCHECK (cuCtxGetDevice, &dev); + if (r == CUDA_ERROR_NOT_PERMITTED) + { + /* Assume we're in a CUDA callback, just return true. */ + return true; + } if (r != CUDA_SUCCESS && r != CUDA_ERROR_INVALID_CONTEXT) { GOMP_PLUGIN_error ("cuCtxGetDevice error: %s", cuda_error (r)); @@ -804,8 +494,8 @@ nvptx_open_device (int n) ptx_dev->images = NULL; pthread_mutex_init (&ptx_dev->image_lock, NULL); - if (!init_streams_for_device (ptx_dev, async_engines)) - return NULL; + ptx_dev->free_blocks = NULL; + pthread_mutex_init (&ptx_dev->free_blocks_lock, NULL); return ptx_dev; } @@ -816,9 +506,15 @@ nvptx_close_device (struct ptx_device *ptx_dev) if (!ptx_dev) return true; - if (!fini_streams_for_device (ptx_dev)) - return false; - + for (struct ptx_free_block *b = ptx_dev->free_blocks; b;) + { + struct ptx_free_block *b_next = b->next; + CUDA_CALL (cuMemFree, (CUdeviceptr) b->ptr); + free (b); + b = b_next; + } + + pthread_mutex_destroy (&ptx_dev->free_blocks_lock); pthread_mutex_destroy (&ptx_dev->image_lock); if (!ptx_dev->ctx_shared) @@ -998,138 +694,19 @@ link_ptx (CUmodule *module, const struct targ_ptx_obj *ptx_objs, } static void -event_gc (bool memmap_lockable) -{ - struct ptx_event *ptx_event = ptx_events; - struct ptx_event *async_cleanups = NULL; - struct nvptx_thread *nvthd = nvptx_thread (); - - pthread_mutex_lock (&ptx_event_lock); - - while (ptx_event != NULL) - { - CUresult r; - struct ptx_event *e = ptx_event; - - ptx_event = ptx_event->next; - - if (e->ord != nvthd->ptx_dev->ord) - continue; - - r = CUDA_CALL_NOCHECK (cuEventQuery, *e->evt); - if (r == CUDA_SUCCESS) - { - bool append_async = false; - CUevent *te; - - te = e->evt; - - switch (e->type) - { - case PTX_EVT_MEM: - case PTX_EVT_SYNC: - break; - - case PTX_EVT_KNL: - map_pop (e->addr); - break; - - case PTX_EVT_ASYNC_CLEANUP: - { - /* The function gomp_plugin_async_unmap_vars needs to claim the - memory-map splay tree lock for the current device, so we - can't call it when one of our callers has already claimed - the lock. In that case, just delay the GC for this event - until later. */ - if (!memmap_lockable) - continue; - - append_async = true; - } - break; - } - - CUDA_CALL_NOCHECK (cuEventDestroy, *te); - free ((void *)te); - - /* Unlink 'e' from ptx_events list. */ - if (ptx_events == e) - ptx_events = ptx_events->next; - else - { - struct ptx_event *e_ = ptx_events; - while (e_->next != e) - e_ = e_->next; - e_->next = e_->next->next; - } - - if (append_async) - { - e->next = async_cleanups; - async_cleanups = e; - } - else - free (e); - } - } - - pthread_mutex_unlock (&ptx_event_lock); - - /* We have to do these here, after ptx_event_lock is released. */ - while (async_cleanups) - { - struct ptx_event *e = async_cleanups; - async_cleanups = async_cleanups->next; - - GOMP_PLUGIN_async_unmap_vars (e->addr, e->val); - free (e); - } -} - -static void -event_add (enum ptx_event_type type, CUevent *e, void *h, int val) -{ - struct ptx_event *ptx_event; - struct nvptx_thread *nvthd = nvptx_thread (); - - assert (type == PTX_EVT_MEM || type == PTX_EVT_KNL || type == PTX_EVT_SYNC - || type == PTX_EVT_ASYNC_CLEANUP); - - ptx_event = GOMP_PLUGIN_malloc (sizeof (struct ptx_event)); - ptx_event->type = type; - ptx_event->evt = e; - ptx_event->addr = h; - ptx_event->ord = nvthd->ptx_dev->ord; - ptx_event->val = val; - - pthread_mutex_lock (&ptx_event_lock); - - ptx_event->next = ptx_events; - ptx_events = ptx_event; - - pthread_mutex_unlock (&ptx_event_lock); -} - -static void nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, - int async, unsigned *dims, void *targ_mem_desc) + unsigned *dims, void *targ_mem_desc, + CUdeviceptr dp, CUstream stream) { struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn; CUfunction function; - CUresult r; int i; - struct ptx_stream *dev_str; void *kargs[1]; - void *hp, *dp; struct nvptx_thread *nvthd = nvptx_thread (); int warp_size = nvthd->ptx_dev->warp_size; - const char *maybe_abort_msg = "(perhaps abort was called)"; function = targ_fn->fn; - dev_str = select_stream_for_async (async, pthread_self (), false, NULL); - assert (dev_str == nvthd->current_stream); - /* Initialize the launch dimensions. Typically this is constant, provided by the device compiler, but we must permit runtime values. */ @@ -1292,21 +869,6 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, suggest_workers, suggest_workers); } - /* This reserves a chunk of a pre-allocated page of memory mapped on both - the host and the device. HP is a host pointer to the new chunk, and DP is - the corresponding device pointer. */ - map_push (dev_str, mapnum * sizeof (void *), &hp, &dp); - - GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); - - /* Copy the array of arguments to the mapped page. */ - for (i = 0; i < mapnum; i++) - ((void **) hp)[i] = devaddrs[i]; - - /* Copy the (device) pointers to arguments to the device (dp and hp might in - fact have the same value on a unified-memory system). */ - CUDA_CALL_ASSERT (cuMemcpy, (CUdeviceptr) dp, (CUdeviceptr) hp, - mapnum * sizeof (void *)); GOMP_PLUGIN_debug (0, " %s: kernel %s: launch" " gangs=%u, workers=%u, vectors=%u\n", __FUNCTION__, targ_fn->launch->fn, dims[GOMP_DIM_GANG], @@ -1317,58 +879,14 @@ nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, // num_gangs nctaid.x // num_workers ntid.y // vector length ntid.x - kargs[0] = &dp; CUDA_CALL_ASSERT (cuLaunchKernel, function, dims[GOMP_DIM_GANG], 1, 1, dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1, - 0, dev_str->stream, kargs, 0); - -#ifndef DISABLE_ASYNC - if (async < acc_async_noval) - { - r = CUDA_CALL_NOCHECK (cuStreamSynchronize, dev_str->stream); - if (r == CUDA_ERROR_LAUNCH_FAILED) - GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r), - maybe_abort_msg); - else if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); - } - else - { - CUevent *e; - - e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); - - r = CUDA_CALL_NOCHECK (cuEventCreate, e, CU_EVENT_DISABLE_TIMING); - if (r == CUDA_ERROR_LAUNCH_FAILED) - GOMP_PLUGIN_fatal ("cuEventCreate error: %s %s\n", cuda_error (r), - maybe_abort_msg); - else if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuEventCreate error: %s", cuda_error (r)); - - event_gc (true); - - CUDA_CALL_ASSERT (cuEventRecord, *e, dev_str->stream); - - event_add (PTX_EVT_KNL, e, (void *)dev_str, 0); - } -#else - r = CUDA_CALL_NOCHECK (cuCtxSynchronize, ); - if (r == CUDA_ERROR_LAUNCH_FAILED) - GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r), - maybe_abort_msg); - else if (r != CUDA_SUCCESS) - GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r)); -#endif + 0, stream, kargs, 0); GOMP_PLUGIN_debug (0, " %s: kernel %s: finished\n", __FUNCTION__, targ_fn->launch->fn); - -#ifndef DISABLE_ASYNC - if (async < acc_async_noval) -#endif - map_pop (dev_str); } void * openacc_get_current_cuda_context (void); @@ -1383,8 +901,21 @@ nvptx_alloc (size_t s) } static bool -nvptx_free (void *p) +nvptx_free (void *p, struct ptx_device *ptx_dev) { + /* Assume callback context if this is null. */ + if (GOMP_PLUGIN_acc_thread () == NULL) + { + struct ptx_free_block *n + = GOMP_PLUGIN_malloc (sizeof (struct ptx_free_block)); + n->ptr = p; + pthread_mutex_lock (&ptx_dev->free_blocks_lock); + n->next = ptx_dev->free_blocks; + ptx_dev->free_blocks = n; + pthread_mutex_unlock (&ptx_dev->free_blocks_lock); + return true; + } + CUdeviceptr pb; size_t ps; @@ -1399,389 +930,27 @@ nvptx_free (void *p) return true; } - -static bool -nvptx_host2dev (void *d, const void *h, size_t s) +static void * +nvptx_get_current_cuda_device (void) { - CUdeviceptr pb; - size_t ps; struct nvptx_thread *nvthd = nvptx_thread (); - if (!s) - return true; - if (!d) - { - GOMP_PLUGIN_error ("invalid device address"); - return false; - } - - CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d); - - if (!pb) - { - GOMP_PLUGIN_error ("invalid device address"); - return false; - } - if (!h) - { - GOMP_PLUGIN_error ("invalid host address"); - return false; - } - if (d == h) - { - GOMP_PLUGIN_error ("invalid host or device address"); - return false; - } - if ((void *)(d + s) > (void *)(pb + ps)) - { - GOMP_PLUGIN_error ("invalid size"); - return false; - } - -#ifndef DISABLE_ASYNC - if (nvthd && nvthd->current_stream != nvthd->ptx_dev->null_stream) - { - CUevent *e = (CUevent *)GOMP_PLUGIN_malloc (sizeof (CUevent)); - CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING); - event_gc (false); - CUDA_CALL (cuMemcpyHtoDAsync, - (CUdeviceptr) d, h, s, nvthd->current_stream->stream); - CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream); - event_add (PTX_EVT_MEM, e, (void *)h, 0); - } - else -#endif - CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) d, h, s); + if (!nvthd || !nvthd->ptx_dev) + return NULL; - return true; + return &nvthd->ptx_dev->dev; } -static bool -nvptx_dev2host (void *h, const void *d, size_t s) +static void * +nvptx_get_current_cuda_context (void) { - CUdeviceptr pb; - size_t ps; struct nvptx_thread *nvthd = nvptx_thread (); - if (!s) - return true; - if (!d) - { - GOMP_PLUGIN_error ("invalid device address"); - return false; - } + if (!nvthd || !nvthd->ptx_dev) + return NULL; - CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d); - - if (!pb) - { - GOMP_PLUGIN_error ("invalid device address"); - return false; - } - if (!h) - { - GOMP_PLUGIN_error ("invalid host address"); - return false; - } - if (d == h) - { - GOMP_PLUGIN_error ("invalid host or device address"); - return false; - } - if ((void *)(d + s) > (void *)(pb + ps)) - { - GOMP_PLUGIN_error ("invalid size"); - return false; - } - -#ifndef DISABLE_ASYNC - if (nvthd && nvthd->current_stream != nvthd->ptx_dev->null_stream) - { - CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); - CUDA_CALL (cuEventCreate, e, CU_EVENT_DISABLE_TIMING); - event_gc (false); - CUDA_CALL (cuMemcpyDtoHAsync, - h, (CUdeviceptr) d, s, nvthd->current_stream->stream); - CUDA_CALL (cuEventRecord, *e, nvthd->current_stream->stream); - event_add (PTX_EVT_MEM, e, (void *)h, 0); - } - else -#endif - CUDA_CALL (cuMemcpyDtoH, h, (CUdeviceptr) d, s); - - return true; -} - -static void -nvptx_set_async (int async) -{ - struct nvptx_thread *nvthd = nvptx_thread (); - nvthd->current_stream - = select_stream_for_async (async, pthread_self (), true, NULL); -} - -static int -nvptx_async_test (int async) -{ - CUresult r; - struct ptx_stream *s; - - s = select_stream_for_async (async, pthread_self (), false, NULL); - - if (!s) - GOMP_PLUGIN_fatal ("unknown async %d", async); - - r = CUDA_CALL_NOCHECK (cuStreamQuery, s->stream); - if (r == CUDA_SUCCESS) - { - /* The oacc-parallel.c:goacc_wait function calls this hook to determine - whether all work has completed on this stream, and if so omits the call - to the wait hook. If that happens, event_gc might not get called - (which prevents variables from getting unmapped and their associated - device storage freed), so call it here. */ - event_gc (true); - return 1; - } - else if (r == CUDA_ERROR_NOT_READY) - return 0; - - GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r)); - - return 0; -} - -static int -nvptx_async_test_all (void) -{ - struct ptx_stream *s; - pthread_t self = pthread_self (); - struct nvptx_thread *nvthd = nvptx_thread (); - - pthread_mutex_lock (&nvthd->ptx_dev->stream_lock); - - for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next) - { - if ((s->multithreaded || pthread_equal (s->host_thread, self)) - && CUDA_CALL_NOCHECK (cuStreamQuery, - s->stream) == CUDA_ERROR_NOT_READY) - { - pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock); - return 0; - } - } - - pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock); - - event_gc (true); - - return 1; -} - -static void -nvptx_wait (int async) -{ - struct ptx_stream *s; - - s = select_stream_for_async (async, pthread_self (), false, NULL); - if (!s) - GOMP_PLUGIN_fatal ("unknown async %d", async); - - CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream); - - event_gc (true); -} - -static void -nvptx_wait_async (int async1, int async2) -{ - CUevent *e; - struct ptx_stream *s1, *s2; - pthread_t self = pthread_self (); - - /* The stream that is waiting (rather than being waited for) doesn't - necessarily have to exist already. */ - s2 = select_stream_for_async (async2, self, true, NULL); - - s1 = select_stream_for_async (async1, self, false, NULL); - if (!s1) - GOMP_PLUGIN_fatal ("invalid async 1\n"); - - if (s1 == s2) - GOMP_PLUGIN_fatal ("identical parameters"); - - e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); - - CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING); - - event_gc (true); - - CUDA_CALL_ASSERT (cuEventRecord, *e, s1->stream); - - event_add (PTX_EVT_SYNC, e, NULL, 0); - - CUDA_CALL_ASSERT (cuStreamWaitEvent, s2->stream, *e, 0); -} - -static void -nvptx_wait_all (void) -{ - CUresult r; - struct ptx_stream *s; - pthread_t self = pthread_self (); - struct nvptx_thread *nvthd = nvptx_thread (); - - pthread_mutex_lock (&nvthd->ptx_dev->stream_lock); - - /* Wait for active streams initiated by this thread (or by multiple threads) - to complete. */ - for (s = nvthd->ptx_dev->active_streams; s != NULL; s = s->next) - { - if (s->multithreaded || pthread_equal (s->host_thread, self)) - { - r = CUDA_CALL_NOCHECK (cuStreamQuery, s->stream); - if (r == CUDA_SUCCESS) - continue; - else if (r != CUDA_ERROR_NOT_READY) - GOMP_PLUGIN_fatal ("cuStreamQuery error: %s", cuda_error (r)); - - CUDA_CALL_ASSERT (cuStreamSynchronize, s->stream); - } - } - - pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock); - - event_gc (true); -} - -static void -nvptx_wait_all_async (int async) -{ - struct ptx_stream *waiting_stream, *other_stream; - CUevent *e; - struct nvptx_thread *nvthd = nvptx_thread (); - pthread_t self = pthread_self (); - - /* The stream doing the waiting. This could be the first mention of the - stream, so create it if necessary. */ - waiting_stream - = select_stream_for_async (async, pthread_self (), true, NULL); - - /* Launches on the null stream already block on other streams in the - context. */ - if (!waiting_stream || waiting_stream == nvthd->ptx_dev->null_stream) - return; - - event_gc (true); - - pthread_mutex_lock (&nvthd->ptx_dev->stream_lock); - - for (other_stream = nvthd->ptx_dev->active_streams; - other_stream != NULL; - other_stream = other_stream->next) - { - if (!other_stream->multithreaded - && !pthread_equal (other_stream->host_thread, self)) - continue; - - e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); - - CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING); - - /* Record an event on the waited-for stream. */ - CUDA_CALL_ASSERT (cuEventRecord, *e, other_stream->stream); - - event_add (PTX_EVT_SYNC, e, NULL, 0); - - CUDA_CALL_ASSERT (cuStreamWaitEvent, waiting_stream->stream, *e, 0); - } - - pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock); -} - -static void * -nvptx_get_current_cuda_device (void) -{ - struct nvptx_thread *nvthd = nvptx_thread (); - - if (!nvthd || !nvthd->ptx_dev) - return NULL; - - return &nvthd->ptx_dev->dev; -} - -static void * -nvptx_get_current_cuda_context (void) -{ - struct nvptx_thread *nvthd = nvptx_thread (); - - if (!nvthd || !nvthd->ptx_dev) - return NULL; - - return nvthd->ptx_dev->ctx; -} - -static void * -nvptx_get_cuda_stream (int async) -{ - struct ptx_stream *s; - struct nvptx_thread *nvthd = nvptx_thread (); - - if (!nvthd || !nvthd->ptx_dev) - return NULL; - - s = select_stream_for_async (async, pthread_self (), false, NULL); - - return s ? s->stream : NULL; -} - -static int -nvptx_set_cuda_stream (int async, void *stream) -{ - struct ptx_stream *oldstream; - pthread_t self = pthread_self (); - struct nvptx_thread *nvthd = nvptx_thread (); - - if (async < 0) - GOMP_PLUGIN_fatal ("bad async %d", async); - - pthread_mutex_lock (&nvthd->ptx_dev->stream_lock); - - /* We have a list of active streams and an array mapping async values to - entries of that list. We need to take "ownership" of the passed-in stream, - and add it to our list, removing the previous entry also (if there was one) - in order to prevent resource leaks. Note the potential for surprise - here: maybe we should keep track of passed-in streams and leave it up to - the user to tidy those up, but that doesn't work for stream handles - returned from acc_get_cuda_stream above... */ - - oldstream = select_stream_for_async (async, self, false, NULL); - - if (oldstream) - { - if (nvthd->ptx_dev->active_streams == oldstream) - nvthd->ptx_dev->active_streams = nvthd->ptx_dev->active_streams->next; - else - { - struct ptx_stream *s = nvthd->ptx_dev->active_streams; - while (s->next != oldstream) - s = s->next; - s->next = s->next->next; - } - - CUDA_CALL_ASSERT (cuStreamDestroy, oldstream->stream); - - if (!map_fini (oldstream)) - GOMP_PLUGIN_fatal ("error when freeing host memory"); - - free (oldstream); - } - - pthread_mutex_unlock (&nvthd->ptx_dev->stream_lock); - - (void) select_stream_for_async (async, self, true, (CUstream) stream); - - return 1; -} + return nvthd->ptx_dev->ctx; +} /* Plugin entry points. */ @@ -2016,100 +1185,116 @@ GOMP_OFFLOAD_alloc (int ord, size_t size) { if (!nvptx_attach_host_thread_to_device (ord)) return NULL; - return nvptx_alloc (size); -} -bool -GOMP_OFFLOAD_free (int ord, void *ptr) -{ - return (nvptx_attach_host_thread_to_device (ord) - && nvptx_free (ptr)); -} + struct ptx_device *ptx_dev = ptx_devices[ord]; + struct ptx_free_block *blocks, *tmp; -bool -GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n) -{ - return (nvptx_attach_host_thread_to_device (ord) - && nvptx_dev2host (dst, src, n)); -} + pthread_mutex_lock (&ptx_dev->free_blocks_lock); + blocks = ptx_dev->free_blocks; + ptx_dev->free_blocks = NULL; + pthread_mutex_unlock (&ptx_dev->free_blocks_lock); -bool -GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n) -{ - return (nvptx_attach_host_thread_to_device (ord) - && nvptx_host2dev (dst, src, n)); + while (blocks) + { + tmp = blocks->next; + nvptx_free (blocks->ptr, ptx_dev); + free (blocks); + blocks = tmp; + } + + return nvptx_alloc (size); } bool -GOMP_OFFLOAD_dev2dev (int ord, void *dst, const void *src, size_t n) +GOMP_OFFLOAD_free (int ord, void *ptr) { - struct ptx_device *ptx_dev = ptx_devices[ord]; - CUDA_CALL (cuMemcpyDtoDAsync, (CUdeviceptr) dst, (CUdeviceptr) src, n, - ptx_dev->null_stream->stream); - return true; + return (nvptx_attach_host_thread_to_device (ord) + && nvptx_free (ptr, ptx_devices[ord])); } -void (*device_run) (int n, void *fn_ptr, void *vars) = NULL; - void GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum, void **hostaddrs, void **devaddrs, - int async, unsigned *dims, void *targ_mem_desc) + unsigned *dims, void *targ_mem_desc) { - nvptx_exec (fn, mapnum, hostaddrs, devaddrs, async, dims, targ_mem_desc); -} + GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); -void -GOMP_OFFLOAD_openacc_register_async_cleanup (void *targ_mem_desc, int async) -{ - struct nvptx_thread *nvthd = nvptx_thread (); - CUevent *e = (CUevent *) GOMP_PLUGIN_malloc (sizeof (CUevent)); + void **hp = NULL; + CUdeviceptr dp = 0; - CUDA_CALL_ASSERT (cuEventCreate, e, CU_EVENT_DISABLE_TIMING); - CUDA_CALL_ASSERT (cuEventRecord, *e, nvthd->current_stream->stream); - event_add (PTX_EVT_ASYNC_CLEANUP, e, targ_mem_desc, async); -} + if (mapnum > 0) + { + hp = alloca (mapnum * sizeof (void *)); + for (int i = 0; i < mapnum; i++) + hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); + CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *)); + } -int -GOMP_OFFLOAD_openacc_async_test (int async) -{ - return nvptx_async_test (async); -} + /* Copy the (device) pointers to arguments to the device (dp and hp might in + fact have the same value on a unified-memory system). */ + if (mapnum > 0) + CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp, + mapnum * sizeof (void *)); -int -GOMP_OFFLOAD_openacc_async_test_all (void) -{ - return nvptx_async_test_all (); -} + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, + dp, NULL); -void -GOMP_OFFLOAD_openacc_async_wait (int async) -{ - nvptx_wait (async); + CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL); + const char *maybe_abort_msg = "(perhaps abort was called)"; + if (r == CUDA_ERROR_LAUNCH_FAILED) + GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r), + maybe_abort_msg); + else if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); + CUDA_CALL_ASSERT (cuMemFree, dp); } -void -GOMP_OFFLOAD_openacc_async_wait_async (int async1, int async2) +static void +cuda_free_argmem (void *ptr) { - nvptx_wait_async (async1, async2); + void **block = (void **) ptr; + nvptx_free (block[0], (struct ptx_device *) block[1]); + free (block); } void -GOMP_OFFLOAD_openacc_async_wait_all (void) +GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, + unsigned *dims, void *targ_mem_desc, + struct goacc_asyncqueue *aq) { - nvptx_wait_all (); -} + GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); -void -GOMP_OFFLOAD_openacc_async_wait_all_async (int async) -{ - nvptx_wait_all_async (async); -} + void **hp = NULL; + CUdeviceptr dp = 0; + void **block = NULL; -void -GOMP_OFFLOAD_openacc_async_set_async (int async) -{ - nvptx_set_async (async); + if (mapnum > 0) + { + block = (void **) GOMP_PLUGIN_malloc ((mapnum + 2) * sizeof (void *)); + hp = block + 2; + for (int i = 0; i < mapnum; i++) + hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); + CUDA_CALL_ASSERT (cuMemAlloc, &dp, mapnum * sizeof (void *)); + } + + /* Copy the (device) pointers to arguments to the device (dp and hp might in + fact have the same value on a unified-memory system). */ + if (mapnum > 0) + { + CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp, + mapnum * sizeof (void *), aq->cuda_stream); + block[0] = (void *) dp; + + struct nvptx_thread *nvthd = + (struct nvptx_thread *) GOMP_PLUGIN_acc_thread (); + block[1] = (void *) nvthd->ptx_dev; + } + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, + dp, aq->cuda_stream); + + if (mapnum > 0) + GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block); } void * @@ -2131,7 +1316,6 @@ GOMP_OFFLOAD_openacc_create_thread_data (int ord) if (!thd_ctx) CUDA_CALL_ASSERT (cuCtxPushCurrent, ptx_dev->ctx); - nvthd->current_stream = ptx_dev->null_stream; nvthd->ptx_dev = ptx_dev; return (void *) nvthd; @@ -2156,19 +1340,185 @@ GOMP_OFFLOAD_openacc_cuda_get_current_context (void) } /* NOTE: This returns a CUstream, not a ptx_stream pointer. */ - void * -GOMP_OFFLOAD_openacc_cuda_get_stream (int async) +GOMP_OFFLOAD_openacc_cuda_get_stream (struct goacc_asyncqueue *aq) { - return nvptx_get_cuda_stream (async); + return (void *) aq->cuda_stream; } /* NOTE: This takes a CUstream, not a ptx_stream pointer. */ +int +GOMP_OFFLOAD_openacc_cuda_set_stream (struct goacc_asyncqueue *aq, void *stream) +{ + if (aq->cuda_stream) + { + CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream); + CUDA_CALL_ASSERT (cuStreamDestroy, aq->cuda_stream); + } + + aq->cuda_stream = (CUstream) stream; + return 1; +} + +struct goacc_asyncqueue * +GOMP_OFFLOAD_openacc_async_construct (void) +{ + struct goacc_asyncqueue *aq + = GOMP_PLUGIN_malloc (sizeof (struct goacc_asyncqueue)); + aq->cuda_stream = NULL; + CUDA_CALL_ASSERT (cuStreamCreate, &aq->cuda_stream, CU_STREAM_DEFAULT); + if (aq->cuda_stream == NULL) + GOMP_PLUGIN_fatal ("CUDA stream create NULL\n"); + + CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream); + + + return aq; +} + +bool +GOMP_OFFLOAD_openacc_async_destruct (struct goacc_asyncqueue *aq) +{ + CUDA_CALL_ERET (false, cuStreamDestroy, aq->cuda_stream); + free (aq); + return true; +} int -GOMP_OFFLOAD_openacc_cuda_set_stream (int async, void *stream) +GOMP_OFFLOAD_openacc_async_test (struct goacc_asyncqueue *aq) { - return nvptx_set_cuda_stream (async, stream); + CUresult r = CUDA_CALL_NOCHECK (cuStreamQuery, aq->cuda_stream); + if (r == CUDA_SUCCESS) + return 1; + if (r == CUDA_ERROR_NOT_READY) + return 0; + + GOMP_PLUGIN_error ("cuStreamQuery error: %s", cuda_error (r)); + return -1; +} + +void +GOMP_OFFLOAD_openacc_async_synchronize (struct goacc_asyncqueue *aq) +{ + CUDA_CALL_ASSERT (cuStreamSynchronize, aq->cuda_stream); +} + +void +GOMP_OFFLOAD_openacc_async_serialize (struct goacc_asyncqueue *aq1, + struct goacc_asyncqueue *aq2) +{ + CUevent e; + CUDA_CALL_ASSERT (cuEventCreate, &e, CU_EVENT_DISABLE_TIMING); + CUDA_CALL_ASSERT (cuEventRecord, e, aq1->cuda_stream); + CUDA_CALL_ASSERT (cuStreamWaitEvent, aq2->cuda_stream, e, 0); +} + +static void +cuda_callback_wrapper (CUstream stream, CUresult res, void *ptr) +{ + if (res != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res)); + struct nvptx_callback *cb = (struct nvptx_callback *) ptr; + cb->fn (cb->ptr); + free (ptr); +} + +void +GOMP_OFFLOAD_openacc_async_queue_callback (struct goacc_asyncqueue *aq, + void (*callback_fn)(void *), + void *userptr) +{ + struct nvptx_callback *b = GOMP_PLUGIN_malloc (sizeof (*b)); + b->fn = callback_fn; + b->ptr = userptr; + b->aq = aq; + CUDA_CALL_ASSERT (cuStreamAddCallback, aq->cuda_stream, + cuda_callback_wrapper, (void *) b, 0); +} + +static bool +cuda_memcpy_sanity_check (const void *h, const void *d, size_t s) +{ + CUdeviceptr pb; + size_t ps; + if (!s) + return true; + if (!d) + { + GOMP_PLUGIN_error ("invalid device address"); + return false; + } + CUDA_CALL (cuMemGetAddressRange, &pb, &ps, (CUdeviceptr) d); + if (!pb) + { + GOMP_PLUGIN_error ("invalid device address"); + return false; + } + if (!h) + { + GOMP_PLUGIN_error ("invalid host address"); + return false; + } + if (d == h) + { + GOMP_PLUGIN_error ("invalid host or device address"); + return false; + } + if ((void *)(d + s) > (void *)(pb + ps)) + { + GOMP_PLUGIN_error ("invalid size"); + return false; + } + return true; +} + +bool +GOMP_OFFLOAD_host2dev (int ord, void *dst, const void *src, size_t n) +{ + if (!nvptx_attach_host_thread_to_device (ord) + || !cuda_memcpy_sanity_check (src, dst, n)) + return false; + CUDA_CALL (cuMemcpyHtoD, (CUdeviceptr) dst, src, n); + return true; +} + +bool +GOMP_OFFLOAD_dev2host (int ord, void *dst, const void *src, size_t n) +{ + if (!nvptx_attach_host_thread_to_device (ord) + || !cuda_memcpy_sanity_check (dst, src, n)) + return false; + CUDA_CALL (cuMemcpyDtoH, dst, (CUdeviceptr) src, n); + return true; +} + +bool +GOMP_OFFLOAD_dev2dev (int ord, void *dst, const void *src, size_t n) +{ + CUDA_CALL (cuMemcpyDtoDAsync, (CUdeviceptr) dst, (CUdeviceptr) src, n, NULL); + return true; +} + +bool +GOMP_OFFLOAD_openacc_async_host2dev (int ord, void *dst, const void *src, + size_t n, struct goacc_asyncqueue *aq) +{ + if (!nvptx_attach_host_thread_to_device (ord) + || !cuda_memcpy_sanity_check (src, dst, n)) + return false; + CUDA_CALL (cuMemcpyHtoDAsync, (CUdeviceptr) dst, src, n, aq->cuda_stream); + return true; +} + +bool +GOMP_OFFLOAD_openacc_async_dev2host (int ord, void *dst, const void *src, + size_t n, struct goacc_asyncqueue *aq) +{ + if (!nvptx_attach_host_thread_to_device (ord) + || !cuda_memcpy_sanity_check (dst, src, n)) + return false; + CUDA_CALL (cuMemcpyDtoHAsync, dst, (CUdeviceptr) src, n, aq->cuda_stream); + return true; } /* Adjust launch dimensions: pick good values for number of blocks and warps @@ -2269,8 +1619,7 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args) CU_LAUNCH_PARAM_END }; r = CUDA_CALL_NOCHECK (cuLaunchKernel, function, teams, 1, 1, - 32, threads, 1, 0, ptx_dev->null_stream->stream, - NULL, config); + 32, threads, 1, 0, NULL, NULL, config); if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));