From patchwork Mon Oct 26 14:14:48 2020
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Julian Brown <julian@codesourcery.com>
X-Patchwork-Id: 1387775
Return-Path: <gcc-patches-bounces@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=8.43.85.97; helo=sourceware.org;
 envelope-from=gcc-patches-bounces@gcc.gnu.org; receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
 dmarc=none (p=none dis=none) header.from=codesourcery.com
Received: from sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4CKcLC2h9Yz9sPB
	for <incoming@patchwork.ozlabs.org>; Tue, 27 Oct 2020 01:15:12 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id CF291389703E;
	Mon, 26 Oct 2020 14:15:09 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from esa2.mentor.iphmx.com (esa2.mentor.iphmx.com [68.232.141.98])
 by sourceware.org (Postfix) with ESMTPS id A6C463894C2E
 for <gcc-patches@gcc.gnu.org>; Mon, 26 Oct 2020 14:15:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org A6C463894C2E
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=codesourcery.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=Julian_Brown@mentor.com
IronPort-SDR: 
 d8g/SM6v1DeEgSiyQGEqnmPbN32Mvyx8mkBjtb/gvdCPJFPVVvzv/QLeHLdZPpddPcsutpFKcG
 xYH1vgMXlMOZz8Jqe6PKdsr32bXJBbLMVAcJINcQk/RUi6909Hso9HoPoLvZHO0o1MmaaBFhIM
 f4qfa6IfLojTbTwEPxKi2J1ejXNjJH5Gct6Fg0X4p87ioT1OyK0EW/tX7e4OaTsM6N6CXkN47g
 1JMp3kHWPwwfXZbRfDs4VR11ZJ6zNE2yHr80GU+WNx04Yt8bLwxEL3tRQklwhyGTia1/Cv0s8L
 6FY=
X-IronPort-AV: E=Sophos;i="5.77,419,1596528000"; d="scan'208";a="54398707"
Received: from orw-gwy-01-in.mentorg.com ([192.94.38.165])
 by esa2.mentor.iphmx.com with ESMTP; 26 Oct 2020 06:15:05 -0800
IronPort-SDR: 
 YIXI3hh/2jNGnVHdP1r6D/rI0yG2AL3uFaKM0pnOvP2r4oTbSH2wuxYO2amdftup9SfZkThF8L
 Mtw51WZ7UQfVLrzmmium19Gi//krcMX2Istkc3FH8ts9QevnZz1M4AEqa6thEIDOjE2EwN8ZE5
 6/ycLVjnPE7crmvHBfU+vpWx5x1HHd/7fP4cCPwBItpsFYVsp3QHeh5bQVH73iwhWmt6aL0f5H
 343VkNGbKI9zdRHyg8mApf1X8Z8ljdL8Z6Q6lRjmVDD6gRAYSfkCNG7if5Eb6qf+mrZnWUYFRO
 ozo=
From: Julian Brown <julian@codesourcery.com>
To: <gcc-patches@gcc.gnu.org>
Subject: [PATCH] nvptx: Cache stacks block for OpenMP kernel launch
Date: Mon, 26 Oct 2020 07:14:48 -0700
Message-ID: <20201026141448.109041-1-julian@codesourcery.com>
X-Mailer: git-send-email 2.28.0
MIME-Version: 1.0
X-Originating-IP: [137.202.0.90]
X-ClientProxiedBy: SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4) To
 SVR-IES-MBX-04.mgc.mentorg.com (139.181.222.4)
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, SPF_HELO_PASS, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Cc: Jakub Jelinek <jakub@redhat.com>,
 Thomas Schwinge <thomas@codesourcery.com>
Errors-To: gcc-patches-bounces@gcc.gnu.org
Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>

Hi,

This patch adds caching for the stack block allocated for offloaded
OpenMP kernel launches on NVPTX. This is a performance optimisation --
we observed an average 11% or so performance improvement with this patch
across a set of accelerated GPU benchmarks on one machine (results vary
according to individual benchmark and with hardware used).

A given kernel launch will reuse the stack block from the previous launch
if it is large enough, else it is freed and reallocated. A slight caveat
is that memory will not be freed until the device is closed, so e.g. if
code is using highly variable launch geometries and large amounts of
GPU RAM, you might run out of resources slightly quicker with this patch.

Another way this patch gains performance is by omitting the
synchronisation at the end of an OpenMP offload kernel launch -- it's
safe for the GPU and CPU to continue executing in parallel at that point,
because e.g. copies-back from the device will be synchronised properly
with kernel completion anyway.

In turn, the last part necessitates a change to the way "(perhaps abort
was called)" errors are detected and reported.

Tested with offloading to NVPTX. OK for mainline?

Thanks,

Julian

2020-10-26  Julian Brown  <julian@codesourcery.com>

libgomp/
	* plugin/plugin-nvptx.c (maybe_abort_message): Add function.
	(CUDA_CALL_ERET, CUDA_CALL_ASSERT): Use above function.
	(struct ptx_device): Add omp_stacks struct.
	(nvptx_open_device): Initialise cached-stacks housekeeping info.
	(nvptx_close_device): Free cached stacks block and mutex.
	(nvptx_stacks_alloc): Rename to...
	(nvptx_stacks_acquire): This.  Cache stacks block between runs if same
	size or smaller is required.
	(nvptx_stacks_free): Rename to...
	(nvptx_stacks_release): This.  Do not free stacks block, but release
	mutex.
	(GOMP_OFFLOAD_run): Adjust for changes to above functions, and remove
	special-case "abort" error handling and synchronisation after kernel
	launch.
---
 libgomp/plugin/plugin-nvptx.c | 91 ++++++++++++++++++++++++++---------
 1 file changed, 68 insertions(+), 23 deletions(-)
diff --git a/libgomp/plugin/plugin-nvptx.c b/libgomp/plugin/plugin-nvptx.c
index 11d4ceeae62e..e7ff5d5213e0 100644
--- a/libgomp/plugin/plugin-nvptx.c
+++ b/libgomp/plugin/plugin-nvptx.c
@@ -137,6 +137,15 @@ init_cuda_lib (void)
 #define MIN(X,Y) ((X) < (Y) ? (X) : (Y))
 #define MAX(X,Y) ((X) > (Y) ? (X) : (Y))
 
+static const char *
+maybe_abort_message (unsigned errmsg)
+{
+  if (errmsg == CUDA_ERROR_LAUNCH_FAILED)
+    return " (perhaps abort was called)";
+  else
+    return "";
+}
+
 /* Convenience macros for the frequently used CUDA library call and
    error handling sequence as well as CUDA library calls that
    do the error checking themselves or don't do it at all.  */
@@ -147,8 +156,9 @@ init_cuda_lib (void)
       = CUDA_CALL_PREFIX FN (__VA_ARGS__);	\
     if (__r != CUDA_SUCCESS)			\
       {						\
-	GOMP_PLUGIN_error (#FN " error: %s",	\
-			   cuda_error (__r));	\
+	GOMP_PLUGIN_error (#FN " error: %s%s",	\
+			   cuda_error (__r),	\
+			   maybe_abort_message (__r));	\
 	return ERET;				\
       }						\
   } while (0)
@@ -162,8 +172,9 @@ init_cuda_lib (void)
       = CUDA_CALL_PREFIX FN (__VA_ARGS__);	\
     if (__r != CUDA_SUCCESS)			\
       {						\
-	GOMP_PLUGIN_fatal (#FN " error: %s",	\
-			   cuda_error (__r));	\
+	GOMP_PLUGIN_fatal (#FN " error: %s%s",	\
+			   cuda_error (__r),	\
+			   maybe_abort_message (__r));	\
       }						\
   } while (0)
 
@@ -307,6 +318,14 @@ struct ptx_device
   struct ptx_free_block *free_blocks;
   pthread_mutex_t free_blocks_lock;
 
+  /* OpenMP stacks, cached between kernel invocations.  */
+  struct
+    {
+      CUdeviceptr ptr;
+      size_t size;
+      pthread_mutex_t lock;
+    } omp_stacks;
+
   struct ptx_device *next;
 };
 
@@ -514,6 +533,10 @@ nvptx_open_device (int n)
   ptx_dev->free_blocks = NULL;
   pthread_mutex_init (&ptx_dev->free_blocks_lock, NULL);
 
+  ptx_dev->omp_stacks.ptr = 0;
+  ptx_dev->omp_stacks.size = 0;
+  pthread_mutex_init (&ptx_dev->omp_stacks.lock, NULL);
+
   return ptx_dev;
 }
 
@@ -534,6 +557,11 @@ nvptx_close_device (struct ptx_device *ptx_dev)
   pthread_mutex_destroy (&ptx_dev->free_blocks_lock);
   pthread_mutex_destroy (&ptx_dev->image_lock);
 
+  pthread_mutex_destroy (&ptx_dev->omp_stacks.lock);
+
+  if (ptx_dev->omp_stacks.ptr)
+    CUDA_CALL (cuMemFree, ptx_dev->omp_stacks.ptr);
+
   if (!ptx_dev->ctx_shared)
     CUDA_CALL (cuCtxDestroy, ptx_dev->ctx);
 
@@ -1866,26 +1894,49 @@ nvptx_stacks_size ()
   return 128 * 1024;
 }
 
-/* Return contiguous storage for NUM stacks, each SIZE bytes.  */
+/* Return contiguous storage for NUM stacks, each SIZE bytes, and obtain the
+   lock for that storage.  */
 
 static void *
-nvptx_stacks_alloc (size_t size, int num)
+nvptx_stacks_acquire (struct ptx_device *ptx_dev, size_t size, int num)
 {
-  CUdeviceptr stacks;
-  CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &stacks, size * num);
+  pthread_mutex_lock (&ptx_dev->omp_stacks.lock);
+
+  if (ptx_dev->omp_stacks.ptr && ptx_dev->omp_stacks.size >= size * num)
+    return (void *) ptx_dev->omp_stacks.ptr;
+
+  /* Free the old, too-small stacks.  */
+  if (ptx_dev->omp_stacks.ptr)
+    {
+      CUresult r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s\n", cuda_error (r));
+      r = CUDA_CALL_NOCHECK (cuMemFree, ptx_dev->omp_stacks.ptr);
+      if (r != CUDA_SUCCESS)
+	GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r));
+    }
+
+  /* Make new and bigger stacks, and remember where we put them and how big
+     they are.  */
+  CUresult r = CUDA_CALL_NOCHECK (cuMemAlloc, &ptx_dev->omp_stacks.ptr,
+				  size * num);
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuMemAlloc error: %s", cuda_error (r));
-  return (void *) stacks;
+
+  ptx_dev->omp_stacks.size = size * num;
+
+  return (void *) ptx_dev->omp_stacks.ptr;
 }
 
-/* Release storage previously allocated by nvptx_stacks_alloc.  */
+/* Release the lock associated with a ptx_device's OpenMP stacks block.  */
 
 static void
-nvptx_stacks_free (void *p, int num)
+nvptx_stacks_release (CUstream stream, CUresult res, void *ptr)
 {
-  CUresult r = CUDA_CALL_NOCHECK (cuMemFree, (CUdeviceptr) p);
-  if (r != CUDA_SUCCESS)
-    GOMP_PLUGIN_fatal ("cuMemFree error: %s", cuda_error (r));
+  if (res != CUDA_SUCCESS)
+    GOMP_PLUGIN_fatal ("%s error: %s", __FUNCTION__, cuda_error (res));
+  struct ptx_device *ptx_dev = (struct ptx_device *) ptr;
+  pthread_mutex_unlock (&ptx_dev->omp_stacks.lock);
 }
 
 void
@@ -1898,7 +1949,6 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   const char *fn_name = launch->fn;
   CUresult r;
   struct ptx_device *ptx_dev = ptx_devices[ord];
-  const char *maybe_abort_msg = "(perhaps abort was called)";
   int teams = 0, threads = 0;
 
   if (!args)
@@ -1922,7 +1972,7 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   nvptx_adjust_launch_bounds (tgt_fn, ptx_dev, &teams, &threads);
 
   size_t stack_size = nvptx_stacks_size ();
-  void *stacks = nvptx_stacks_alloc (stack_size, teams * threads);
+  void *stacks = nvptx_stacks_acquire (ptx_dev, stack_size, teams * threads);
   void *fn_args[] = {tgt_vars, stacks, (void *) stack_size};
   size_t fn_args_size = sizeof fn_args;
   void *config[] = {
@@ -1938,13 +1988,8 @@ GOMP_OFFLOAD_run (int ord, void *tgt_fn, void *tgt_vars, void **args)
   if (r != CUDA_SUCCESS)
     GOMP_PLUGIN_fatal ("cuLaunchKernel error: %s", cuda_error (r));
 
-  r = CUDA_CALL_NOCHECK (cuCtxSynchronize, );
-  if (r == CUDA_ERROR_LAUNCH_FAILED)
-    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s %s\n", cuda_error (r),
-		       maybe_abort_msg);
-  else if (r != CUDA_SUCCESS)
-    GOMP_PLUGIN_fatal ("cuCtxSynchronize error: %s", cuda_error (r));
-  nvptx_stacks_free (stacks, teams * threads);
+  CUDA_CALL_ASSERT (cuStreamAddCallback, NULL, nvptx_stacks_release,
+		    (void *) ptx_dev, 0);
 }
 
 /* TODO: Implement GOMP_OFFLOAD_async_run. */