diff mbox

[hsa,2/10] Modifications to libgomp proper

Message ID 20151207111957.GC24234@virgil.suse.cz
State New
Headers show

Commit Message

Martin Jambor Dec. 7, 2015, 11:19 a.m. UTC
Hi,

The patch below contains all changes to libgomp files except for the
hsa plugin (which is in the following patch).

The changes can roughly divided into three categories.  First, it
contains changes I that are necessary to support shared-memory
devices.  In majority of cases this means treating them like the host
fallback because there is no need to copy, host malloc can be used for
allocating etc.  It also means that GOMP_target_ext and
gomp_target_task_fn should not be remapping arguments but should pass
to the plugin the same thing host fallback function would receive.

Second, because GCC HSA backend often does not emit HSAIL for function
it knows it cannot handle, these two functions need to gracefully
handle the case when there is no device implementation of a particular
function available by doing host fallback too.

Third, the patch implements libgomp-part of the device-specific
arguments passed to GOMP_target as requested Jakub (well, some are
actually for all devices but that is what we call them).  Because of
nowait target constructs, the arguments have proliferated into tasking
too, as did firstprivate copies.

Any feedback will be greatly appreciated,

Martin


2015-12-04  Martin Jambor  <mjambor@suse.cz>
	    Martin Liska  <mliska@suse.cz>

include/
	* gomp-constants.h (GOMP_DEVICE_HSA): New macro.
	(GOMP_VERSION_HSA): Likewise.
	(GOMP_TARGET_ARG_DEVICE_MASK): Likewise.
	(GOMP_TARGET_ARG_DEVICE_ALL): Likewise.
	(GOMP_TARGET_ARG_SUBSEQUENT_PARAM): Likewise.
	(GOMP_TARGET_ARG_ID_MASK): Likewise.
	(GOMP_TARGET_ARG_NUM_TEAMS): Likewise.
	(GOMP_TARGET_ARG_THREAD_LIMIT): Likewise.
	(GOMP_TARGET_ARG_VALUE_SHIFT): Likewise.
	(GOMP_TARGET_ARG_HSA_KERNEL_ATTRIBUTES): Likewise.
	(GOMP_kernel_launch_attributes): New type.
	(GOMP_hsa_kernel_dispatch): New type.

libgomp/
	* libgomp-plugin.h (offload_target_type): New element
	OFFLOAD_TARGET_TYPE_HSA.
	* libgomp.h (gomp_target_task): New field args.
	(bool gomp_create_target_task): Updated.
	(gomp_device_descr): Extra parameter of run_func and async_run_func,
	new field can_run_func.
	* libgomp_g.h (GOMP_target_ext): Change prototype.
	* oacc-host.c (host_run): Added a new parameter args.
	* target.c (gomp_target_fallback_firstprivate): New function.
	(gomp_target_fallback_firstprivate): Use
	gomp_target_fallback_firstprivate.
	(gomp_get_target_fn_addr): Allow returning NULL for shared memory
	devices.
	(GOMP_target): Do host fallback for all shared memory devices.  Do not
	pass any args to plugins.
	(GOMP_target_ext): Add new parameter args.  Allow host fallback if
	device shares memory.  Do not remap data if device has shared memory.
	(gomp_target_task_fn): Likewise.  Also Treat shared memory devices
	like host fallback for mappings.
	(GOMP_target_data): Treat shared memory devices like host fallback.
	(GOMP_target_data_ext): Likewise.
	(GOMP_target_update): Likewise.
	(GOMP_target_update_ext): Likewise.  Also pass NULL as args to
	gomp_create_target_task.
	(GOMP_target_enter_exit_data): Likewise.
	(omp_target_alloc): Treat shared memory devices like host fallback.
	(omp_target_free): Likewise.
	(omp_target_is_present): Likewise.
	(omp_target_memcpy): Likewise.
	(omp_target_memcpy_rect): Likewise.
	(omp_target_associate_ptr): Likewise.
	(gomp_load_plugin_for_device): Also load can_run.
	* task.c (GOMP_PLUGIN_target_task_completion): Free
	firstprivate_copies.
	(gomp_create_target_task): Accept new argument args and store it to
	ttask.

liboffloadmic/plugin
	* libgomp-plugin-intelmic.cpp (GOMP_OFFLOAD_async_run): New unused
	parameter.
	(GOMP_OFFLOAD_run): Likewise.

Comments

Jakub Jelinek Dec. 9, 2015, 11:35 a.m. UTC | #1
On Mon, Dec 07, 2015 at 12:19:57PM +0100, Martin Jambor wrote:
> +/* Flag set when the subsequent element in the device-specific argument
> +   values.  */
> +#define GOMP_TARGET_ARG_SUBSEQUENT_PARAM	(1 << 7)
> +
> +/* Bitmask to apply to a target argument to find out the value identifier.  */
> +#define GOMP_TARGET_ARG_ID_MASK			(((1 << 8) - 1) << 8)
> +/* Target argument index of NUM_TEAMS.  */
> +#define GOMP_TARGET_ARG_NUM_TEAMS		(1 << 8)
> +/* Target argument index of THREAD_LIMIT.  */
> +#define GOMP_TARGET_ARG_THREAD_LIMIT		(2 << 8)

I meant that these two would be just special, passed as the first two
pointers in the array, without the markup.  Because, otherwise you either
need to use GOMP_TARGET_ARG_SUBSEQUENT_PARAM for these always, or for 32-bit
arches and for 64-bit ones shift often at runtime.  Having the markup even
for them is perhaps cleaner, but less efficient, so if you really want to go
that way, please make sure you handle it properly for 32-bit pointers
architectures though.  num_teams or thread_limit could be > 32767 or >
65535.

> -static void
> -gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
> -				   void **hostaddrs, size_t *sizes,
> -				   unsigned short *kinds)
> +static void *
> +gomp_target_unshare_firstprivate (size_t mapnum, void **hostaddrs,
> +				  size_t *sizes, unsigned short *kinds)
>  {
>    size_t i, tgt_align = 0, tgt_size = 0;
>    char *tgt = NULL;
> @@ -1281,7 +1282,7 @@ gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
>        }
>    if (tgt_align)
>      {
> -      tgt = gomp_alloca (tgt_size + tgt_align - 1);
> +      tgt = gomp_malloc (tgt_size + tgt_align - 1);

I don't like using gomp_malloc here, either copy/paste the function, or
create separate inline functions for the two loops, one for the first loop
which returns you tgt_align and tgt_size, and another for the stuff after
the allocation.  Then you can use those two inline functions to implement
both gomp_target_fallback_firstprivate which will use alloca, and
gomp_target_unshare_firstprivate which will use gomp_malloc instead.

> @@ -1356,6 +1377,11 @@ GOMP_target (int device, void (*fn) (void *), const void *unused,
>     and several arguments have been added:
>     FLAGS is a bitmask, see GOMP_TARGET_FLAG_* in gomp-constants.h.
>     DEPEND is array of dependencies, see GOMP_task for details.
> +   ARGS is a pointer to an array consisting of NUM_TEAMS, THREAD_LIMIT and a
> +   variable number of device-specific arguments, which always take two elements
> +   where the first specifies the type and the second the actual value.  The
> +   last element of the array is a single NULL.

Note, here you document NUM_TEAMS and THREAD_LIMIT as special values, not
encoded.

> @@ -1473,6 +1508,7 @@ GOMP_target_data (int device, const void *unused, size_t mapnum,
>    struct gomp_device_descr *devicep = resolve_device (device);
>  
>    if (devicep == NULL
> +      || (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
>        || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))

Would be nice to have some consistency in the order of capabilities checks.
Usually you check SHARED_MEM after OPENMP_400, so perhaps do it this way
here too.

> @@ -1741,23 +1784,38 @@ gomp_target_task_fn (void *data)
>  
>        if (ttask->state == GOMP_TARGET_TASK_FINISHED)
>  	{
> -	  gomp_unmap_vars (ttask->tgt, true);
> +	  if (ttask->tgt)
> +	    gomp_unmap_vars (ttask->tgt, true);
>  	  return false;
>  	}

This doesn't make sense.  For the GOMP_OFFLOAD_CAP_SHARED_MEM case, unless
you want to run the free (ttask->firstprivate_copies); as a task, you
shouldn't be queing the target task again for further execution, instead
it should just be handled like a finished task at that point.  The reason
why for XeonPhi or PTX gomp_target_task_fn is run with
GOMP_TARGET_TASK_FINISHED state is that gomp_unmap_vars can perform IO
actions and doing it with the tasking lock held is highly undesirable.
>  
> -      void *fn_addr = gomp_get_target_fn_addr (devicep, ttask->fn);
> -      ttask->tgt
> -	= gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs, NULL,
> -			 ttask->sizes, ttask->kinds, true,
> -			 GOMP_MAP_VARS_TARGET);
> +      bool shared_mem;
> +      if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
> +	{
> +	  shared_mem = true;
> +	  ttask->tgt = NULL;
> +	  ttask->firstprivate_copies
> +	    = gomp_target_unshare_firstprivate (ttask->mapnum, ttask->hostaddrs,
> +						ttask->sizes, ttask->kinds);
> +	}
> +      else
> +	{
> +	  shared_mem = false;
> +	  ttask->tgt = gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs,
> +				      NULL, ttask->sizes, ttask->kinds, true,
> +				      GOMP_MAP_VARS_TARGET);
> +	}
>        ttask->state = GOMP_TARGET_TASK_READY_TO_RUN;
>  
>        devicep->async_run_func (devicep->target_id, fn_addr,
> -			       (void *) ttask->tgt->tgt_start, (void *) ttask);
> +			       shared_mem ? ttask->hostaddrs
> +			       : (void *) ttask->tgt->tgt_start,
> +			       ttask->args, (void *) ttask);

Replace the shared_mem bool variable with a void *arg or some other name and just
set it to ttask->hostaddrs in the then case and ttask->tgt->tgt_start
otherwise?

> --- a/libgomp/task.c
> +++ b/libgomp/task.c
> @@ -581,6 +581,7 @@ GOMP_PLUGIN_target_task_completion (void *data)
>        gomp_mutex_unlock (&team->task_lock);
>      }
>    ttask->state = GOMP_TARGET_TASK_FINISHED;
> +  free (ttask->firstprivate_copies);
>    gomp_target_task_completion (team, task);
>    gomp_mutex_unlock (&team->task_lock);
>  }

So, this function should have a special case for the SHARED_MEM case, handle
it closely to say how GOMP_taskgroup_end handles the finish_cancelled:
case.  Just note that the target task is missing from certain queues at that
point.

	Jakub
Alexander Monakov Jan. 12, 2016, 1 p.m. UTC | #2
Hello, Martin, Jakub, community,

This part of the patch:

On Mon, 7 Dec 2015, Martin Jambor wrote:
> include/
> 	* gomp-constants.h (GOMP_DEVICE_HSA): New macro.
[snip]
> 	(GOMP_kernel_launch_attributes): New type.
> 	(GOMP_hsa_kernel_dispatch): New type.

is going to break build of NVPTX cross-compiler, because it uses uint32_t,
uint64_t types like below, but those types will not be available when building
nvptx libgcc.  gomp-constants.h is #include'd in libgcc via tm.h and
offload.h.

Note how other files in include/ need to do a special dance with #ifdef
HAVE_STDINT_H to include <stdint.h> and obtain uint64_t.

Shall I move the problematic structs into a separate file, gomp-types.h?

Thanks.
Alexander

> diff --git a/include/gomp-constants.h b/include/gomp-constants.h
> index dffd631..1dae474 100644
> --- a/include/gomp-constants.h
> +++ b/include/gomp-constants.h
[snip]
> +/* Structure describing the run-time and grid properties of an HSA kernel
> +   lauch.  */
> +
> +struct GOMP_kernel_launch_attributes
> +{
> +  /* Number of dimensions the workload has.  Maximum number is 3.  */
> +  uint32_t ndim;
> +  /* Size of the grid in the three respective dimensions.  */
> +  uint32_t gdims[3];
> +  /* Size of work-groups in the respective dimensions.  */
> +  uint32_t wdims[3];
> +};
Jakub Jelinek Jan. 12, 2016, 1:10 p.m. UTC | #3
On Tue, Jan 12, 2016 at 04:00:11PM +0300, Alexander Monakov wrote:
> Hello, Martin, Jakub, community,
> 
> This part of the patch:
> 
> On Mon, 7 Dec 2015, Martin Jambor wrote:
> > include/
> > 	* gomp-constants.h (GOMP_DEVICE_HSA): New macro.
> [snip]
> > 	(GOMP_kernel_launch_attributes): New type.
> > 	(GOMP_hsa_kernel_dispatch): New type.
> 
> is going to break build of NVPTX cross-compiler, because it uses uint32_t,
> uint64_t types like below, but those types will not be available when building
> nvptx libgcc.  gomp-constants.h is #include'd in libgcc via tm.h and
> offload.h.
> 
> Note how other files in include/ need to do a special dance with #ifdef
> HAVE_STDINT_H to include <stdint.h> and obtain uint64_t.
> 
> Shall I move the problematic structs into a separate file, gomp-types.h?

Or just move those into libgomp-plugin.h, those type definitions don't have
to be shared between the compiler and libgomp, the compiler has to duplicate
those definitions anyway, as it needs to create the IL of those types and
can't use the host structure type for that purpose.

> > diff --git a/include/gomp-constants.h b/include/gomp-constants.h
> > index dffd631..1dae474 100644
> > --- a/include/gomp-constants.h
> > +++ b/include/gomp-constants.h
> [snip]
> > +/* Structure describing the run-time and grid properties of an HSA kernel
> > +   lauch.  */
> > +
> > +struct GOMP_kernel_launch_attributes
> > +{
> > +  /* Number of dimensions the workload has.  Maximum number is 3.  */
> > +  uint32_t ndim;
> > +  /* Size of the grid in the three respective dimensions.  */
> > +  uint32_t gdims[3];
> > +  /* Size of work-groups in the respective dimensions.  */
> > +  uint32_t wdims[3];
> > +};

	Jakub
Jakub Jelinek Jan. 12, 2016, 1:38 p.m. UTC | #4
On Tue, Jan 12, 2016 at 02:29:06PM +0100, Martin Jambor wrote:
> GOMP_kernel_launch_attributes should not be there (it is a
> reminiscence from before the device-specific target arguments) and
> should be moved just to the HSA plugin.  I'll prepare a patch today.
> 
> While we do not have to share GOMP_hsa_kernel_dispatch, we actually do
> use them in both the plugin and the compiler, where we only use it in
> an offsetof, so that we only have the structure defined once.

But, even using it in offsetof might be wrong, the compiler could be a
cross-compiler, and you'd use offsetof on the host, while you want it for
the target, and that would be different.
So, IMHO you need (unless you already have) built the structure as a tree
type, lay it out, and then you can use at TYPE_SIZE_UNIT or
DECL_FIELD_OFFSET and the like.

	Jakub
diff mbox

Patch

diff --git a/include/gomp-constants.h b/include/gomp-constants.h
index dffd631..1dae474 100644
--- a/include/gomp-constants.h
+++ b/include/gomp-constants.h
@@ -176,6 +176,7 @@  enum gomp_map_kind
 #define GOMP_DEVICE_NOT_HOST		4
 #define GOMP_DEVICE_NVIDIA_PTX		5
 #define GOMP_DEVICE_INTEL_MIC		6
+#define GOMP_DEVICE_HSA			7
 
 #define GOMP_DEVICE_ICV			-1
 #define GOMP_DEVICE_HOST_FALLBACK	-2
@@ -201,6 +202,7 @@  enum gomp_map_kind
 #define GOMP_VERSION	0
 #define GOMP_VERSION_NVIDIA_PTX 1
 #define GOMP_VERSION_INTEL_MIC 0
+#define GOMP_VERSION_HSA 0
 
 #define GOMP_VERSION_PACK(LIB, DEV) (((LIB) << 16) | (DEV))
 #define GOMP_VERSION_LIB(PACK) (((PACK) >> 16) & 0xffff)
@@ -228,4 +230,74 @@  enum gomp_map_kind
 #define GOMP_LAUNCH_OP(X) (((X) >> GOMP_LAUNCH_OP_SHIFT) & 0xffff)
 #define GOMP_LAUNCH_OP_MAX 0xffff
 
+/* Bitmask to apply in order to find out the intended device of a target
+   argument.  */
+#define GOMP_TARGET_ARG_DEVICE_MASK		((1 << 7) - 1)
+/* The target argument is significant for all devices.  */
+#define GOMP_TARGET_ARG_DEVICE_ALL		0
+
+/* Flag set when the subsequent element in the device-specific argument
+   values.  */
+#define GOMP_TARGET_ARG_SUBSEQUENT_PARAM	(1 << 7)
+
+/* Bitmask to apply to a target argument to find out the value identifier.  */
+#define GOMP_TARGET_ARG_ID_MASK			(((1 << 8) - 1) << 8)
+/* Target argument index of NUM_TEAMS.  */
+#define GOMP_TARGET_ARG_NUM_TEAMS		(1 << 8)
+/* Target argument index of THREAD_LIMIT.  */
+#define GOMP_TARGET_ARG_THREAD_LIMIT		(2 << 8)
+
+/* If the value is directly embeded in target argument, it should be a 16-bit
+   at most and shifted by this many bits.  */
+#define GOMP_TARGET_ARG_VALUE_SHIFT		16
+
+/* HSA specific data structures.  */
+
+/* Identifiers of device-specific target arguments.  */
+#define GOMP_TARGET_ARG_HSA_KERNEL_ATTRIBUTES	(1 << 8)
+
+/* Structure describing the run-time and grid properties of an HSA kernel
+   lauch.  */
+
+struct GOMP_kernel_launch_attributes
+{
+  /* Number of dimensions the workload has.  Maximum number is 3.  */
+  uint32_t ndim;
+  /* Size of the grid in the three respective dimensions.  */
+  uint32_t gdims[3];
+  /* Size of work-groups in the respective dimensions.  */
+  uint32_t wdims[3];
+};
+
+/* Collection of information needed for a dispatch of a kernel from a
+   kernel.  */
+
+struct GOMP_hsa_kernel_dispatch
+{
+  /* Pointer to a command queue associated with a kernel dispatch agent.  */
+  void *queue;
+  /* Pointer to reserved memory for OMP data struct copying.  */
+  void *omp_data_memory;
+  /* Pointer to a memory space used for kernel arguments passing.  */
+  void *kernarg_address;
+  /* Kernel object.  */
+  uint64_t object;
+  /* Synchronization signal used for dispatch synchronization.  */
+  uint64_t signal;
+  /* Private segment size.  */
+  uint32_t private_segment_size;
+  /* Group segment size.  */
+  uint32_t group_segment_size;
+  /* Number of children kernel dispatches.  */
+  uint64_t kernel_dispatch_count;
+  /* Number of threads.  */
+  uint32_t omp_num_threads;
+  /* Debug purpose argument.  */
+  uint64_t debug;
+  /* Levels-var ICV.  */
+  uint64_t omp_level;
+  /* Kernel dispatch structures created for children kernel dispatches.  */
+  struct GOMP_hsa_kernel_dispatch **children_dispatches;
+};
+
 #endif
diff --git a/libgomp/libgomp-plugin.h b/libgomp/libgomp-plugin.h
index ab22e85..0204491 100644
--- a/libgomp/libgomp-plugin.h
+++ b/libgomp/libgomp-plugin.h
@@ -48,7 +48,8 @@  enum offload_target_type
   OFFLOAD_TARGET_TYPE_HOST = 2,
   /* OFFLOAD_TARGET_TYPE_HOST_NONSHM = 3 removed.  */
   OFFLOAD_TARGET_TYPE_NVIDIA_PTX = 5,
-  OFFLOAD_TARGET_TYPE_INTEL_MIC = 6
+  OFFLOAD_TARGET_TYPE_INTEL_MIC = 6,
+  OFFLOAD_TARGET_TYPE_HSA = 7
 };
 
 /* Auxiliary struct, used for transferring pairs of addresses from plugin
diff --git a/libgomp/libgomp.h b/libgomp/libgomp.h
index c467f97..16c591f 100644
--- a/libgomp/libgomp.h
+++ b/libgomp/libgomp.h
@@ -496,6 +496,10 @@  struct gomp_target_task
   struct target_mem_desc *tgt;
   struct gomp_task *task;
   struct gomp_team *team;
+  /* Copies of firstprivate mapped data for shared memory accelerators.  */
+  void *firstprivate_copies;
+  /* Device-specific target arguments.  */
+  void **args;
   void *hostaddrs[];
 };
 
@@ -750,7 +754,8 @@  extern void gomp_task_maybe_wait_for_dependencies (void **);
 extern bool gomp_create_target_task (struct gomp_device_descr *,
 				     void (*) (void *), size_t, void **,
 				     size_t *, unsigned short *, unsigned int,
-				     void **, enum gomp_target_task_state);
+				     void **, void **,
+				     enum gomp_target_task_state);
 
 static void inline
 gomp_finish_task (struct gomp_task *task)
@@ -924,8 +929,9 @@  struct gomp_device_descr
   void *(*dev2host_func) (int, void *, const void *, size_t);
   void *(*host2dev_func) (int, void *, const void *, size_t);
   void *(*dev2dev_func) (int, void *, const void *, size_t);
-  void (*run_func) (int, void *, void *);
-  void (*async_run_func) (int, void *, void *, void *);
+  bool (*can_run_func) (void *);
+  void (*run_func) (int, void *, void *, void **);
+  void (*async_run_func) (int, void *, void *, void **, void *);
 
   /* Splay tree containing information about mapped memory regions.  */
   struct splay_tree_s mem_map;
diff --git a/libgomp/libgomp_g.h b/libgomp/libgomp_g.h
index c238e6a..9c90d59 100644
--- a/libgomp/libgomp_g.h
+++ b/libgomp/libgomp_g.h
@@ -278,8 +278,7 @@  extern void GOMP_single_copy_end (void *);
 extern void GOMP_target (int, void (*) (void *), const void *,
 			 size_t, void **, size_t *, unsigned char *);
 extern void GOMP_target_ext (int, void (*) (void *), size_t, void **, size_t *,
-			     unsigned short *, unsigned int, void **,
-			     int, int);
+			     unsigned short *, unsigned int, void **, void **);
 extern void GOMP_target_data (int, const void *,
 			      size_t, void **, size_t *, unsigned char *);
 extern void GOMP_target_data_ext (int, size_t, void **, size_t *,
diff --git a/libgomp/oacc-host.c b/libgomp/oacc-host.c
index 9874804..a769211 100644
--- a/libgomp/oacc-host.c
+++ b/libgomp/oacc-host.c
@@ -123,7 +123,8 @@  host_host2dev (int n __attribute__ ((unused)),
 }
 
 static void
-host_run (int n __attribute__ ((unused)), void *fn_ptr, void *vars)
+host_run (int n __attribute__ ((unused)), void *fn_ptr, void *vars,
+	  void **args __attribute__((unused)))
 {
   void (*fn)(void *) = (void (*)(void *)) fn_ptr;
 
diff --git a/libgomp/target.c b/libgomp/target.c
index cf9d0e6..b453c0c 100644
--- a/libgomp/target.c
+++ b/libgomp/target.c
@@ -1261,12 +1261,13 @@  gomp_target_fallback (void (*fn) (void *), void **hostaddrs)
   *thr = old_thr;
 }
 
-/* Host fallback with firstprivate map-type handling.  */
+/* Handle firstprivate map-type for shared memory devices and the host
+   fallback.  Return the pointer of firstprivate copies which has to be freed
+   after use.  */
 
-static void
-gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
-				   void **hostaddrs, size_t *sizes,
-				   unsigned short *kinds)
+static void *
+gomp_target_unshare_firstprivate (size_t mapnum, void **hostaddrs,
+				  size_t *sizes, unsigned short *kinds)
 {
   size_t i, tgt_align = 0, tgt_size = 0;
   char *tgt = NULL;
@@ -1281,7 +1282,7 @@  gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
       }
   if (tgt_align)
     {
-      tgt = gomp_alloca (tgt_size + tgt_align - 1);
+      tgt = gomp_malloc (tgt_size + tgt_align - 1);
       uintptr_t al = (uintptr_t) tgt & (tgt_align - 1);
       if (al)
 	tgt += tgt_align - al;
@@ -1296,7 +1297,19 @@  gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
 	    tgt_size = tgt_size + sizes[i];
 	  }
     }
+  return tgt;
+}
+
+/* Host fallback with firstprivate map-type handling.  */
+
+static void
+gomp_target_fallback_firstprivate (void (*fn) (void *), size_t mapnum,
+				   void **hostaddrs, size_t *sizes,
+				   unsigned short *kinds)
+{
+  void *fpc = gomp_target_unshare_firstprivate (mapnum, hostaddrs, sizes, kinds);
   gomp_target_fallback (fn, hostaddrs);
+  free (fpc);
 }
 
 /* Helper function of GOMP_target{,_ext} routines.  */
@@ -1316,7 +1329,12 @@  gomp_get_target_fn_addr (struct gomp_device_descr *devicep,
       splay_tree_key tgt_fn = splay_tree_lookup (&devicep->mem_map, &k);
       gomp_mutex_unlock (&devicep->lock);
       if (tgt_fn == NULL)
-	gomp_fatal ("Target function wasn't mapped");
+	{
+	  if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+	    return NULL;
+	  else
+	    gomp_fatal ("Target function wasn't mapped");
+	}
 
       return (void *) tgt_fn->tgt_offset;
     }
@@ -1340,7 +1358,9 @@  GOMP_target (int device, void (*fn) (void *), const void *unused,
   struct gomp_device_descr *devicep = resolve_device (device);
 
   if (devicep == NULL
-      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      /* All shared memory devices should use the GOMP_target_ext function.  */
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return gomp_target_fallback (fn, hostaddrs);
 
   void *fn_addr = gomp_get_target_fn_addr (devicep, fn);
@@ -1348,7 +1368,8 @@  GOMP_target (int device, void (*fn) (void *), const void *unused,
   struct target_mem_desc *tgt_vars
     = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, false,
 		     GOMP_MAP_VARS_TARGET);
-  devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start);
+  devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start,
+		     NULL);
   gomp_unmap_vars (tgt_vars, true);
 }
 
@@ -1356,6 +1377,11 @@  GOMP_target (int device, void (*fn) (void *), const void *unused,
    and several arguments have been added:
    FLAGS is a bitmask, see GOMP_TARGET_FLAG_* in gomp-constants.h.
    DEPEND is array of dependencies, see GOMP_task for details.
+   ARGS is a pointer to an array consisting of NUM_TEAMS, THREAD_LIMIT and a
+   variable number of device-specific arguments, which always take two elements
+   where the first specifies the type and the second the actual value.  The
+   last element of the array is a single NULL.
+
    NUM_TEAMS is positive if GOMP_teams will be called in the body with
    that value, or 1 if teams construct is not present, or 0, if
    teams construct does not have num_teams clause and so the choice is
@@ -1369,14 +1395,10 @@  GOMP_target (int device, void (*fn) (void *), const void *unused,
 void
 GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 		 void **hostaddrs, size_t *sizes, unsigned short *kinds,
-		 unsigned int flags, void **depend, int num_teams,
-		 int thread_limit)
+		 unsigned int flags, void **depend, void **args)
 {
   struct gomp_device_descr *devicep = resolve_device (device);
 
-  (void) num_teams;
-  (void) thread_limit;
-
   if (flags & GOMP_TARGET_FLAG_NOWAIT)
     {
       struct gomp_thread *thr = gomp_thread ();
@@ -1413,7 +1435,7 @@  GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 	  && !thr->task->final_task)
 	{
 	  gomp_create_target_task (devicep, fn, mapnum, hostaddrs,
-				   sizes, kinds, flags, depend,
+				   sizes, kinds, flags, depend, args,
 				   GOMP_TARGET_TASK_BEFORE_MAP);
 	  return;
 	}
@@ -1430,20 +1452,33 @@  GOMP_target_ext (int device, void (*fn) (void *), size_t mapnum,
 	gomp_task_maybe_wait_for_dependencies (depend);
     }
 
+  void *fn_addr;
   if (devicep == NULL
-      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || !(fn_addr = gomp_get_target_fn_addr (devicep, fn))
+      || (devicep->can_run_func && !devicep->can_run_func (fn_addr)))
     {
       gomp_target_fallback_firstprivate (fn, mapnum, hostaddrs, sizes, kinds);
       return;
     }
 
-  void *fn_addr = gomp_get_target_fn_addr (devicep, fn);
-
-  struct target_mem_desc *tgt_vars
-    = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds, true,
-		     GOMP_MAP_VARS_TARGET);
-  devicep->run_func (devicep->target_id, fn_addr, (void *) tgt_vars->tgt_start);
-  gomp_unmap_vars (tgt_vars, true);
+  struct target_mem_desc *tgt_vars;
+  void *fpc = NULL;
+  if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+    {
+      fpc = gomp_target_unshare_firstprivate (mapnum, hostaddrs, sizes, kinds);
+      tgt_vars = NULL;
+    }
+  else
+    tgt_vars = gomp_map_vars (devicep, mapnum, hostaddrs, NULL, sizes, kinds,
+			      true, GOMP_MAP_VARS_TARGET);
+  devicep->run_func (devicep->target_id, fn_addr,
+		     tgt_vars ? (void *) tgt_vars->tgt_start : hostaddrs,
+		     args);
+  if (tgt_vars)
+    gomp_unmap_vars (tgt_vars, true);
+  else
+    free (fpc);
 }
 
 /* Host fallback for GOMP_target_data{,_ext} routines.  */
@@ -1473,6 +1508,7 @@  GOMP_target_data (int device, const void *unused, size_t mapnum,
   struct gomp_device_descr *devicep = resolve_device (device);
 
   if (devicep == NULL
+      || (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
       || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
     return gomp_target_data_fallback ();
 
@@ -1491,7 +1527,8 @@  GOMP_target_data_ext (int device, size_t mapnum, void **hostaddrs,
   struct gomp_device_descr *devicep = resolve_device (device);
 
   if (devicep == NULL
-      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return gomp_target_data_fallback ();
 
   struct target_mem_desc *tgt
@@ -1521,7 +1558,8 @@  GOMP_target_update (int device, const void *unused, size_t mapnum,
   struct gomp_device_descr *devicep = resolve_device (device);
 
   if (devicep == NULL
-      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return;
 
   gomp_update (devicep, mapnum, hostaddrs, sizes, kinds, false);
@@ -1552,7 +1590,7 @@  GOMP_target_update_ext (int device, size_t mapnum, void **hostaddrs,
 	      if (gomp_create_target_task (devicep, (void (*) (void *)) NULL,
 					   mapnum, hostaddrs, sizes, kinds,
 					   flags | GOMP_TARGET_FLAG_UPDATE,
-					   depend, GOMP_TARGET_TASK_DATA))
+					   depend, NULL, GOMP_TARGET_TASK_DATA))
 		return;
 	    }
 	  else
@@ -1572,7 +1610,8 @@  GOMP_target_update_ext (int device, size_t mapnum, void **hostaddrs,
     }
 
   if (devicep == NULL
-      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return;
 
   struct gomp_thread *thr = gomp_thread ();
@@ -1673,7 +1712,7 @@  GOMP_target_enter_exit_data (int device, size_t mapnum, void **hostaddrs,
 	    {
 	      if (gomp_create_target_task (devicep, (void (*) (void *)) NULL,
 					   mapnum, hostaddrs, sizes, kinds,
-					   flags, depend,
+					   flags, depend, NULL,
 					   GOMP_TARGET_TASK_DATA))
 		return;
 	    }
@@ -1694,7 +1733,8 @@  GOMP_target_enter_exit_data (int device, size_t mapnum, void **hostaddrs,
     }
 
   if (devicep == NULL
-      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return;
 
   struct gomp_thread *thr = gomp_thread ();
@@ -1729,8 +1769,11 @@  gomp_target_task_fn (void *data)
 
   if (ttask->fn != NULL)
     {
+      void *fn_addr;
       if (devicep == NULL
-	  || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+	  || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+	  || !(fn_addr = gomp_get_target_fn_addr (devicep, ttask->fn))
+	  || (devicep->can_run_func && !devicep->can_run_func (fn_addr)))
 	{
 	  ttask->state = GOMP_TARGET_TASK_FALLBACK;
 	  gomp_target_fallback_firstprivate (ttask->fn, ttask->mapnum,
@@ -1741,23 +1784,38 @@  gomp_target_task_fn (void *data)
 
       if (ttask->state == GOMP_TARGET_TASK_FINISHED)
 	{
-	  gomp_unmap_vars (ttask->tgt, true);
+	  if (ttask->tgt)
+	    gomp_unmap_vars (ttask->tgt, true);
 	  return false;
 	}
 
-      void *fn_addr = gomp_get_target_fn_addr (devicep, ttask->fn);
-      ttask->tgt
-	= gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs, NULL,
-			 ttask->sizes, ttask->kinds, true,
-			 GOMP_MAP_VARS_TARGET);
+      bool shared_mem;
+      if (devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
+	{
+	  shared_mem = true;
+	  ttask->tgt = NULL;
+	  ttask->firstprivate_copies
+	    = gomp_target_unshare_firstprivate (ttask->mapnum, ttask->hostaddrs,
+						ttask->sizes, ttask->kinds);
+	}
+      else
+	{
+	  shared_mem = false;
+	  ttask->tgt = gomp_map_vars (devicep, ttask->mapnum, ttask->hostaddrs,
+				      NULL, ttask->sizes, ttask->kinds, true,
+				      GOMP_MAP_VARS_TARGET);
+	}
       ttask->state = GOMP_TARGET_TASK_READY_TO_RUN;
 
       devicep->async_run_func (devicep->target_id, fn_addr,
-			       (void *) ttask->tgt->tgt_start, (void *) ttask);
+			       shared_mem ? ttask->hostaddrs
+			       : (void *) ttask->tgt->tgt_start,
+			       ttask->args, (void *) ttask);
       return true;
     }
   else if (devicep == NULL
-	   || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+	   || !(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+	   || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return false;
 
   size_t i;
@@ -1807,7 +1865,8 @@  omp_target_alloc (size_t size, int device_num)
   if (devicep == NULL)
     return NULL;
 
-  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return malloc (size);
 
   gomp_mutex_lock (&devicep->lock);
@@ -1835,7 +1894,8 @@  omp_target_free (void *device_ptr, int device_num)
   if (devicep == NULL)
     return;
 
-  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     {
       free (device_ptr);
       return;
@@ -1862,7 +1922,8 @@  omp_target_is_present (void *ptr, int device_num)
   if (devicep == NULL)
     return 0;
 
-  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return 1;
 
   gomp_mutex_lock (&devicep->lock);
@@ -1892,7 +1953,8 @@  omp_target_memcpy (void *dst, void *src, size_t length, size_t dst_offset,
       if (dst_devicep == NULL)
 	return EINVAL;
 
-      if (!(dst_devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      if (!(dst_devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+	  || dst_devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
 	dst_devicep = NULL;
     }
   if (src_device_num != GOMP_DEVICE_HOST_FALLBACK)
@@ -1904,7 +1966,8 @@  omp_target_memcpy (void *dst, void *src, size_t length, size_t dst_offset,
       if (src_devicep == NULL)
 	return EINVAL;
 
-      if (!(src_devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      if (!(src_devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+	  || src_devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
 	src_devicep = NULL;
     }
   if (src_devicep == NULL && dst_devicep == NULL)
@@ -2034,7 +2097,8 @@  omp_target_memcpy_rect (void *dst, void *src, size_t element_size,
       if (dst_devicep == NULL)
 	return EINVAL;
 
-      if (!(dst_devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      if (!(dst_devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+	  || dst_devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
 	dst_devicep = NULL;
     }
   if (src_device_num != GOMP_DEVICE_HOST_FALLBACK)
@@ -2046,7 +2110,8 @@  omp_target_memcpy_rect (void *dst, void *src, size_t element_size,
       if (src_devicep == NULL)
 	return EINVAL;
 
-      if (!(src_devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+      if (!(src_devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+	  || src_devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
 	src_devicep = NULL;
     }
 
@@ -2082,7 +2147,8 @@  omp_target_associate_ptr (void *host_ptr, void *device_ptr, size_t size,
   if (devicep == NULL)
     return EINVAL;
 
-  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400))
+  if (!(devicep->capabilities & GOMP_OFFLOAD_CAP_OPENMP_400)
+      || devicep->capabilities & GOMP_OFFLOAD_CAP_SHARED_MEM)
     return EINVAL;
 
   gomp_mutex_lock (&devicep->lock);
@@ -2225,6 +2291,7 @@  gomp_load_plugin_for_device (struct gomp_device_descr *device,
     {
       DLSYM (run);
       DLSYM (async_run);
+      DLSYM_OPT (can_run, can_run);
       DLSYM (dev2dev);
     }
   if (device->capabilities & GOMP_OFFLOAD_CAP_OPENACC_200)
diff --git a/libgomp/task.c b/libgomp/task.c
index 620facd..aa6bd67 100644
--- a/libgomp/task.c
+++ b/libgomp/task.c
@@ -581,6 +581,7 @@  GOMP_PLUGIN_target_task_completion (void *data)
       gomp_mutex_unlock (&team->task_lock);
     }
   ttask->state = GOMP_TARGET_TASK_FINISHED;
+  free (ttask->firstprivate_copies);
   gomp_target_task_completion (team, task);
   gomp_mutex_unlock (&team->task_lock);
 }
@@ -593,7 +594,7 @@  bool
 gomp_create_target_task (struct gomp_device_descr *devicep,
 			 void (*fn) (void *), size_t mapnum, void **hostaddrs,
 			 size_t *sizes, unsigned short *kinds,
-			 unsigned int flags, void **depend,
+			 unsigned int flags, void **depend, void **args,
 			 enum gomp_target_task_state state)
 {
   struct gomp_thread *thr = gomp_thread ();
@@ -653,6 +654,7 @@  gomp_create_target_task (struct gomp_device_descr *devicep,
   ttask->devicep = devicep;
   ttask->fn = fn;
   ttask->mapnum = mapnum;
+  ttask->args = args;
   memcpy (ttask->hostaddrs, hostaddrs, mapnum * sizeof (void *));
   ttask->sizes = (size_t *) &ttask->hostaddrs[mapnum];
   memcpy (ttask->sizes, sizes, mapnum * sizeof (size_t));
diff --git a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
index f8c1725..48599dd 100644
--- a/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
+++ b/liboffloadmic/plugin/libgomp-plugin-intelmic.cpp
@@ -539,7 +539,7 @@  GOMP_OFFLOAD_dev2dev (int device, void *dst_ptr, const void *src_ptr,
 
 extern "C" void
 GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars,
-			void *async_data)
+			void **, void *async_data)
 {
   TRACE ("(device = %d, tgt_fn = %p, tgt_vars = %p, async_data = %p)", device,
 	 tgt_fn, tgt_vars, async_data);
@@ -555,7 +555,7 @@  GOMP_OFFLOAD_async_run (int device, void *tgt_fn, void *tgt_vars,
 }
 
 extern "C" void
-GOMP_OFFLOAD_run (int device, void *tgt_fn, void *tgt_vars)
+GOMP_OFFLOAD_run (int device, void *tgt_fn, void *tgt_vars, void **)
 {
   TRACE ("(device = %d, tgt_fn = %p, tgt_vars = %p)", device, tgt_fn, tgt_vars);