Message ID | 4d7dbb50-e8db-209b-63e2-0efaa18eeec1@mentor.com |
---|---|
State | New |
Headers | show |
Series | [nvptx] Expand OpenACC child function arguments to use CUDA params space | expand |
Hi Chung-Lin! On 2019-09-10T19:41:59+0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote: > this is a completely new implementation of an earlier optimization > that Cesar submitted: > https://gcc.gnu.org/ml/gcc-patches/2017-12/msg01202.html Thanks for your re-work! > The objective is to transform the original single-record-pointer argument > form (OpenMP/pthreads originated) to multiple scalar parameters, that > the CUDA runtime will place directly in the .params space for GPU kernels: > > #pragma acc parallel copy(a, b) copyin(c) > { > a += b; > b -= c; > } > > compiles to GIMPLE as: > > __attribute__((oacc function (1, 1, 32), omp target entrypoint)) > main._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) > { > ... > _3 = .omp_data_i_2(D)->a; > _4 = *_3; > _5 = .omp_data_i_2(D)->b; > _6 = *_5; > ... > > this patch adds pass to transform into: > > __attribute__((oacc function (1, 1, 32), omp target entrypoint)) > main._omp_fn.0 (int * c, int * b, int * a) > { > ... > _3 = a; > _4 = *_3; > _5 = b; > _6 = *_5; > ... ACK. > Cesar's original implementation tried to do this in the middle-end, > which required lots of changes throughout the compiler, libgomp interface, > etc. and required a dependency on libffi for the CPU-host fallback child > function (since there is no longer a known, fixed single-pointer argument > interface to all child functions) Specifically, the major problem -- per my understanding -- is that Cesar's implementation does this in the early stages of the middle end ('pass_lower_omp'), before the target vs. offload target code paths get separated, and so the transformation was done for target ("host fallback") as well as all offload targets, without each of them having the option to opt in/out. As can be seen from the new highly localized code changes (nvptx code only), your re-work clearly fixes that aspect! :-) > This new implementation works by modifying the GIMPLE for child functions > directly at the very start (before, actually) of RTL expansion That's now near the other end of the pipeline. ;-) What's the motivation for putting it there, instead of early in the nvptx offloading compilation (around 'pass_oacc_device_lower' etc. time, where I would've assumed this transformation to be done)? Not asking you to change that now, but curious for the reason. > and thus > is placed in TARGET_EXPAND_TO_RTL_HOOK, as the core issue is we inherently > need something different generated between the host-fallback vs for the GPU. (Likewise, different per each offload target.) > The new nvptx_expand_to_rtl_hook modifies the function decl type and > arguments, and scans the gimple body to remove occurrences of .omp_data_i.* > Detection of OpenACC child functions is done through "omp target entrypoint" > and "oacc function" attributes. Because OpenMP target child functions > have a more elaborate wrapper generated for them, this pass only supports > OpenACC right now. At the Cauldron, the question indeed has been raised (Jakub, Tom) why not enabled for OpenMP, too. My answer was that this surely can be done, but the change as presented here already is an improvement over the current status ("stands on its own", as Jeff Law would call it), so I'm fine with you handling OpenACC first, and then OpenMP can follow later (at some as of yet indeterminite point in time, even). > libgomp has tested with this patch x86_64-linux (nvptx-none accelerator) > without regressions Can you present performance numbers, too? > (I'm currently undergoing more gcc tests as well). As these changes, being confined to nvptx code only, can't possibly have any effect on other target testing, I assume that's nvptx target testing you're talking about? (..., where also I'm not expecting any disturbance.) > Is this okay for trunk? I'm not the one to approve these code changes, but I do have a few comments/questions: > --- gcc/config/nvptx/nvptx.c (revision 275493) > +++ gcc/config/nvptx/nvptx.c (working copy) > +static void > +nvptx_expand_to_rtl_hook (void) > +{ > + /* For utilizing CUDA .param kernel arguments, we detect and modify > + the gimple of offloaded child functions, here before RTL expansion, > + starting with standard OMP form: > + foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... } > + > + and transform it into a style where the OMP data record fields are > + "exploded" into individual scalar arguments: > + foo._omp_fn.0 (int * a, int * b, int * c) { ... } > + > + Note that there are implicit assumptions of how OMP lowering (and/or other > + intervening passes) behaves contained in this transformation code; > + if those passes change in their output, this code may possibly need > + updating. */ > + > + if (lookup_attribute ("omp target entrypoint", > + DECL_ATTRIBUTES (current_function_decl)) > + /* The rather indirect manner in which OpenMP target functions are > + launched makes this transformation only valid for OpenACC currently. > + TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc. > + needs changes for this to work with OpenMP. */ > + && lookup_attribute ("oacc function", > + DECL_ATTRIBUTES (current_function_decl)) > + && VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl)))) Why the 'void' return conditional? (Or, should that rather be an 'gcc_checking_assert' at the top of the following block?) > + { > + tree omp_data_arg = DECL_ARGUMENTS (current_function_decl); > + tree argtype = TREE_TYPE (omp_data_arg); > + > + /* Ensure this function is of the form of a single reference argument > + to the OMP data record, or a single void* argument (when no values > + passed) */ > + if (! (DECL_CHAIN (omp_data_arg) == NULL_TREE > + && ((TREE_CODE (argtype) == REFERENCE_TYPE > + && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE) > + || (TREE_CODE (argtype) == POINTER_TYPE > + && TREE_TYPE (argtype) == void_type_node)))) > + return; Again, is that something we should 'gcc_checking_assert', so that we'll notice when something changes/breaks? Given your note above, "there are implicit assumptions [on] OMP lowering", I'd assume that this code here does quite some 'gcc_checking_assert'ions to make sure that we're within the expected bounds. > + /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by > + scanning and skipping those entries, creating a new local_decls list. > + We assume a very specific MEM_REF tree expression shape. */ > + tree decl; > + unsigned int i; > + vec<tree, va_gc> *new_local_decls = NULL; > + FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl) > + { > + if (DECL_HAS_VALUE_EXPR_P (decl)) > + { > + tree t = DECL_VALUE_EXPR (decl); > + if (TREE_CODE (t) == MEM_REF > + && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF > + && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == MEM_REF > + && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0) > + == omp_data_arg)) > + continue; > + } > + vec_safe_push (new_local_decls, decl); > + } > + vec_free (cfun->local_decls); > + cfun->local_decls = new_local_decls; Is it worth doing that manually, or can/should some dead code elimination pass deal with that? > + /* Scan function body for assignments from .omp_data_i->FIELD, and using > + the above created fld_to_args hash map, convert them to reads of > + function arguments. */ > + else if (TREE_CODE (val) == MEM_REF > + && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME > + && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg) > + { > + /* This case may happen in the final tree level optimization > + output, due to SLP: > + vect.XX = MEM <vector(1) unsigned long> [(void *).omp_data_i_5(D) + 8B] > + > + Therefore here we need a more elaborate search of the field > + list to reverse map to which field the offset is referring > + to. */ Would this be simpler if the conversion would be done earlier? (And I mentioned above.) > + /* If we found the corresponding OMP data record field, replace the > + RHS with the new created PARM_DECL. */ > + if (new_val != NULL_TREE) > + { > + if (dump_file) > + { > + fprintf (dump_file, "For gimple stmt: "); > + print_gimple_stmt (dump_file, stmt, 0); > + fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n", > + print_generic_expr_to_str (val), > + print_generic_expr_to_str (new_val)); > + } > + /* Write in looked up ARG as new RHS value. */ > + *val_ptr = new_val; > + } If 'new_val == NULL_TREE' that simply means that we've been looking at something that doesn't need to be handled here, right? > + /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE. */ > + tree name; > + FOR_EACH_SSA_NAME (i, name, cfun) > + if (SSA_NAME_VAR (name) == omp_data_arg) > + (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE; Again, manual cleanup vs. automated? > --- libgomp/plugin/plugin-nvptx.c (revision 275493) > +++ libgomp/plugin/plugin-nvptx.c (working copy) > @@ -1438,78 +1374,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void > unsigned *dims, void *targ_mem_desc, > struct goacc_asyncqueue *aq) > { > [...] > - if (mapnum > 0) > - GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block); > + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream); > } Wasn't that the only user of 'GOMP_OFFLOAD_openacc_async_queue_callback'? Grüße Thomas
Hi Thomas, thanks for the review. On 2019/9/20 12:28 AM, Thomas Schwinge wrote: >> This new implementation works by modifying the GIMPLE for child functions >> directly at the very start (before, actually) of RTL expansion > That's now near the other end of the pipeline.;-) What's the motivation > for putting it there, instead of early in the nvptx offloading > compilation (around 'pass_oacc_device_lower' etc. time, where I would've > assumed this transformation to be done)? Not asking you to change that > now, but curious for the reason. I am not sure we have a natural boundary that defines/marks the start of the offload compiler stages. Maybe if we had an explicit "start_of_offload" pass, we can embed this processing there, and enable it with a bool-valued target hook by the accelerator backend. (possibly only when ACCEL_COMPILER is defined) In short of that, I think placing it here before RTL expansion seems the most well defined, even if we have to handle some optimized obscurity. >> and thus >> is placed in TARGET_EXPAND_TO_RTL_HOOK, as the core issue is we inherently >> need something different generated between the host-fallback vs for the GPU. > (Likewise, different per each offload target.) > >> The new nvptx_expand_to_rtl_hook modifies the function decl type and >> arguments, and scans the gimple body to remove occurrences of .omp_data_i.* >> Detection of OpenACC child functions is done through "omp target entrypoint" >> and "oacc function" attributes. Because OpenMP target child functions >> have a more elaborate wrapper generated for them, this pass only supports >> OpenACC right now. > At the Cauldron, the question indeed has been raised (Jakub, Tom) why not > enabled for OpenMP, too. My answer was that this surely can be done, but > the change as presented here already is an improvement over the current > status ("stands on its own", as Jeff Law would call it), so I'm fine with > you handling OpenACC first, and then OpenMP can follow later (at some as > of yet indeterminite point in time, even). The OpenMP way of wrapping the user defined GPU kernel with lots of initialization code does make this much more tedious I think. The question should actually be, can OpenMP simply do this kind of initialization by the host libgomp runtime like OpenACC does, and make the nvptx kernel proper more similar between the two? >> libgomp has tested with this patch x86_64-linux (nvptx-none accelerator) >> without regressions > Can you present performance numbers, too? Haven't got to that yet. >> (I'm currently undergoing more gcc tests as well). > As these changes, being confined to nvptx code only, can't possibly have > any effect on other target testing, I assume that's nvptx target testing > you're talking about? (..., where also I'm not expecting any > disturbance.) Yeah, I was talking about nvptx-none compiler testing. Haven't found any changes. > --- gcc/config/nvptx/nvptx.c (revision 275493) >> +++ gcc/config/nvptx/nvptx.c (working copy) >> +static void >> +nvptx_expand_to_rtl_hook (void) >> +{ >> + /* For utilizing CUDA .param kernel arguments, we detect and modify >> + the gimple of offloaded child functions, here before RTL expansion, >> + starting with standard OMP form: >> + foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... } >> + >> + and transform it into a style where the OMP data record fields are >> + "exploded" into individual scalar arguments: >> + foo._omp_fn.0 (int * a, int * b, int * c) { ... } >> + >> + Note that there are implicit assumptions of how OMP lowering (and/or other >> + intervening passes) behaves contained in this transformation code; >> + if those passes change in their output, this code may possibly need >> + updating. */ >> + >> + if (lookup_attribute ("omp target entrypoint", >> + DECL_ATTRIBUTES (current_function_decl)) >> + /* The rather indirect manner in which OpenMP target functions are >> + launched makes this transformation only valid for OpenACC currently. >> + TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc. >> + needs changes for this to work with OpenMP. */ >> + && lookup_attribute ("oacc function", >> + DECL_ATTRIBUTES (current_function_decl)) >> + && VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl)))) > Why the 'void' return conditional? (Or, should that rather be an > 'gcc_checking_assert' at the top of the following block?) That the shape of child functions omp-low generates. Maybe that should be an assertion, though here I'm just doing sanity checking and ignoring otherwise. Come to think of it, maybe I should try using the assertion to check if I'm unintentionally ignoring transforming some cases... >> + { >> + tree omp_data_arg = DECL_ARGUMENTS (current_function_decl); >> + tree argtype = TREE_TYPE (omp_data_arg); >> + >> + /* Ensure this function is of the form of a single reference argument >> + to the OMP data record, or a single void* argument (when no values >> + passed) */ >> + if (! (DECL_CHAIN (omp_data_arg) == NULL_TREE >> + && ((TREE_CODE (argtype) == REFERENCE_TYPE >> + && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE) >> + || (TREE_CODE (argtype) == POINTER_TYPE >> + && TREE_TYPE (argtype) == void_type_node)))) >> + return; > Again, is that something we should 'gcc_checking_assert', so that we'll > notice when something changes/breaks? As above. > Given your note above, "there are implicit assumptions [on] OMP > lowering", I'd assume that this code here does quite some > 'gcc_checking_assert'ions to make sure that we're within the expected > bounds. > >> + /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by >> + scanning and skipping those entries, creating a new local_decls list. >> + We assume a very specific MEM_REF tree expression shape. */ >> + tree decl; >> + unsigned int i; >> + vec<tree, va_gc> *new_local_decls = NULL; >> + FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl) >> + { >> + if (DECL_HAS_VALUE_EXPR_P (decl)) >> + { >> + tree t = DECL_VALUE_EXPR (decl); >> + if (TREE_CODE (t) == MEM_REF >> + && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF >> + && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == MEM_REF >> + && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0) >> + == omp_data_arg)) >> + continue; >> + } >> + vec_safe_push (new_local_decls, decl); >> + } >> + vec_free (cfun->local_decls); >> + cfun->local_decls = new_local_decls; > Is it worth doing that manually, or can/should some dead code elimination > pass deal with that? I think I ICE'd somewhere before adding this removal. >> + /* Scan function body for assignments from .omp_data_i->FIELD, and using >> + the above created fld_to_args hash map, convert them to reads of >> + function arguments. */ >> + else if (TREE_CODE (val) == MEM_REF >> + && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME >> + && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg) >> + { >> + /* This case may happen in the final tree level optimization >> + output, due to SLP: >> + vect.XX = MEM <vector(1) unsigned long> [(void *).omp_data_i_5(D) + 8B] >> + >> + Therefore here we need a more elaborate search of the field >> + list to reverse map to which field the offset is referring >> + to. */ > Would this be simpler if the conversion would be done earlier? (And I > mentioned above.) Yes, it would be much less laborious :P >> + /* If we found the corresponding OMP data record field, replace the >> + RHS with the new created PARM_DECL. */ >> + if (new_val != NULL_TREE) >> + { >> + if (dump_file) >> + { >> + fprintf (dump_file, "For gimple stmt: "); >> + print_gimple_stmt (dump_file, stmt, 0); >> + fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n", >> + print_generic_expr_to_str (val), >> + print_generic_expr_to_str (new_val)); >> + } >> + /* Write in looked up ARG as new RHS value. */ >> + *val_ptr = new_val; >> + } > If 'new_val == NULL_TREE' that simply means that we've been looking at > something that doesn't need to be handled here, right? Technically, it only means we haven't found something to replace the .omp_data_i->FIELD ref. >> + /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE. */ >> + tree name; >> + FOR_EACH_SSA_NAME (i, name, cfun) >> + if (SSA_NAME_VAR (name) == omp_data_arg) >> + (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE; > Again, manual cleanup vs. automated? The "automated" one that marked it unused didn't really remove it for some reason, I forgot why, probably because it was still considered "used" in some way. Only this manual manipulation worked. >> --- libgomp/plugin/plugin-nvptx.c (revision 275493) >> +++ libgomp/plugin/plugin-nvptx.c (working copy) >> @@ -1438,78 +1374,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void >> unsigned *dims, void *targ_mem_desc, >> struct goacc_asyncqueue *aq) >> { >> [...] >> - if (mapnum > 0) >> - GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block); >> + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream); >> } > Wasn't that the only user of 'GOMP_OFFLOAD_openacc_async_queue_callback'? No, gomp_map_vars uses it as a plugin hook from libgomp proper, though this was the only instance from inside the nvptx plugin. (another use from oacc-async.c:goacc_async_free() appears to be orphaned now, though I think we should keep that routine for a while, as it appears to possibly be of use) I'll try changing some of those cases we identified to be 'assertifiable' and see what happens. Thanks, Chung-Lin
On 2019/9/24 6:43 PM, Chung-Lin Tang wrote: > >> --- gcc/config/nvptx/nvptx.c (revision 275493) >>> +++ gcc/config/nvptx/nvptx.c (working copy) >>> +static void >>> +nvptx_expand_to_rtl_hook (void) >>> +{ >>> + /* For utilizing CUDA .param kernel arguments, we detect and modify >>> + the gimple of offloaded child functions, here before RTL expansion, >>> + starting with standard OMP form: >>> + foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... } >>> + >>> + and transform it into a style where the OMP data record fields are >>> + "exploded" into individual scalar arguments: >>> + foo._omp_fn.0 (int * a, int * b, int * c) { ... } >>> + >>> + Note that there are implicit assumptions of how OMP lowering (and/or other >>> + intervening passes) behaves contained in this transformation code; >>> + if those passes change in their output, this code may possibly need >>> + updating. */ >>> + >>> + if (lookup_attribute ("omp target entrypoint", >>> + DECL_ATTRIBUTES (current_function_decl)) >>> + /* The rather indirect manner in which OpenMP target functions are >>> + launched makes this transformation only valid for OpenACC currently. >>> + TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc. >>> + needs changes for this to work with OpenMP. */ >>> + && lookup_attribute ("oacc function", >>> + DECL_ATTRIBUTES (current_function_decl)) >>> + && VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl)))) >> Why the 'void' return conditional? (Or, should that rather be an >> 'gcc_checking_assert' at the top of the following block?) > > That the shape of child functions omp-low generates. Maybe that should be an > assertion, though here I'm just doing sanity checking and ignoring otherwise. > > Come to think of it, maybe I should try using the assertion to check if > I'm unintentionally ignoring transforming some cases... I've updated the patch to use an assertion for those convention checks. I think it's better leave a level of checking in place, so gcc_assert() instead of gcc_checking_assert(). Also tested no regressions. Thanks, Chung-Lin Index: gcc/config/nvptx/nvptx.c =================================================================== --- gcc/config/nvptx/nvptx.c (revision 276406) +++ gcc/config/nvptx/nvptx.c (working copy) @@ -68,6 +68,10 @@ #include "attribs.h" #include "tree-vrp.h" #include "tree-ssa-operands.h" +#include "tree-pretty-print.h" +#include "gimple-pretty-print.h" +#include "tree-cfg.h" +#include "gimple-ssa.h" #include "tree-ssanames.h" #include "gimplify.h" #include "tree-phinodes.h" @@ -6437,6 +6441,226 @@ nvptx_set_current_function (tree fndecl) oacc_bcast_partition = 0; } +static void +nvptx_expand_to_rtl_hook (void) +{ + /* For utilizing CUDA .param kernel arguments, we detect and modify + the gimple of offloaded child functions, here before RTL expansion, + starting with standard OMP form: + foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... } + + and transform it into a style where the OMP data record fields are + "exploded" into individual scalar arguments: + foo._omp_fn.0 (int * a, int * b, int * c) { ... } + + Note that there are implicit assumptions of how OMP lowering (and/or other + intervening passes) behaves contained in this transformation code; + if those passes change in their output, this code may possibly need + updating. */ + + if (lookup_attribute ("omp target entrypoint", + DECL_ATTRIBUTES (current_function_decl)) + /* The rather indirect manner in which OpenMP target functions are + launched makes this transformation only valid for OpenACC currently. + TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc. + needs changes for this to work with OpenMP. */ + && lookup_attribute ("oacc function", + DECL_ATTRIBUTES (current_function_decl))) + { + tree omp_data_arg = DECL_ARGUMENTS (current_function_decl); + tree argtype = TREE_TYPE (omp_data_arg); + + /* Ensure this function is of the form of a single reference argument + to the OMP data record, or a single void* argument (when no values + passed) */ + gcc_assert (VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl))) + && (DECL_CHAIN (omp_data_arg) == NULL_TREE + && ((TREE_CODE (argtype) == REFERENCE_TYPE + && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE) + || (TREE_CODE (argtype) == POINTER_TYPE + && TREE_TYPE (argtype) == void_type_node)))); + if (dump_file) + { + fprintf (dump_file, "Detected offloaded child function %s, " + "starting parameter conversion\n", + print_generic_expr_to_str (current_function_decl)); + fprintf (dump_file, "OMP data record argument: %s (tree type: %s)\n", + print_generic_expr_to_str (omp_data_arg), + print_generic_expr_to_str (argtype)); + fprintf (dump_file, "Data record fields:\n"); + } + + hash_map<tree,tree> fld_to_args; + tree fld, rectype = TREE_TYPE (argtype); + tree arglist = NULL_TREE, argtypelist = NULL_TREE; + + if (TREE_CODE (rectype) == RECORD_TYPE) + { + /* For each field in the OMP data record type, create a corresponding + PARM_DECL, and map field -> parm using the fld_to_args hash_map. + Also create the tree chains for creating function type and + DECL_ARGUMENTS below. */ + for (fld = TYPE_FIELDS (rectype); fld; fld = DECL_CHAIN (fld)) + { + tree narg = build_decl (DECL_SOURCE_LOCATION (fld), PARM_DECL, + DECL_NAME (fld), TREE_TYPE (fld)); + DECL_ARTIFICIAL (narg) = 1; + DECL_ARG_TYPE (narg) = TREE_TYPE (fld); + DECL_CONTEXT (narg) = current_function_decl; + TREE_USED (narg) = 1; + TREE_READONLY (narg) = 1; + + if (dump_file) + fprintf (dump_file, "\t%s, type: %s, offset: %s bytes + %s bits\n", + print_generic_expr_to_str (fld), + print_generic_expr_to_str (TREE_TYPE (fld)), + print_generic_expr_to_str (DECL_FIELD_OFFSET (fld)), + print_generic_expr_to_str (DECL_FIELD_BIT_OFFSET (fld))); + fld_to_args.put (fld, narg); + + TREE_CHAIN (narg) = arglist; + arglist = narg; + argtypelist = tree_cons (NULL_TREE, TREE_TYPE (narg), + argtypelist); + } + arglist = nreverse (arglist); + argtypelist = nreverse (argtypelist); + } + /* This is needed to not be mistaken for a stdarg function. */ + argtypelist = chainon (argtypelist, void_list_node); + + if (dump_file) + { + fprintf (dump_file, "Function before OMP data arg replaced:\n"); + dump_function_to_file (current_function_decl, dump_file, dump_flags); + } + + /* Actually modify the tree type and DECL_ARGUMENTS here. */ + TREE_TYPE (current_function_decl) = build_function_type (void_type_node, + argtypelist); + DECL_ARGUMENTS (current_function_decl) = arglist; + + /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by + scanning and skipping those entries, creating a new local_decls list. + We assume a very specific MEM_REF tree expression shape. */ + tree decl; + unsigned int i; + vec<tree, va_gc> *new_local_decls = NULL; + FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl) + { + if (DECL_HAS_VALUE_EXPR_P (decl)) + { + tree t = DECL_VALUE_EXPR (decl); + if (TREE_CODE (t) == MEM_REF + && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF + && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == MEM_REF + && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0) + == omp_data_arg)) + continue; + } + vec_safe_push (new_local_decls, decl); + } + vec_free (cfun->local_decls); + cfun->local_decls = new_local_decls; + + /* Scan function body for assignments from .omp_data_i->FIELD, and using + the above created fld_to_args hash map, convert them to reads of + function arguments. */ + basic_block bb; + gimple_stmt_iterator gsi; + FOR_EACH_BB_FN (bb, cfun) + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) + { + tree val, *val_ptr = NULL; + gimple *stmt = gsi_stmt (gsi); + if (is_gimple_assign (stmt) + && gimple_assign_rhs_class (stmt) == GIMPLE_SINGLE_RHS) + val_ptr = gimple_assign_rhs1_ptr (stmt); + else if (is_gimple_debug (stmt) && gimple_debug_bind_p (stmt)) + val_ptr = gimple_debug_bind_get_value_ptr (stmt); + + if (val_ptr == NULL || (val = *val_ptr) == NULL_TREE) + continue; + + tree new_val = NULL_TREE, fld = NULL_TREE; + + if (TREE_CODE (val) == COMPONENT_REF + && TREE_CODE (TREE_OPERAND (val, 0)) == MEM_REF + && (TREE_CODE (TREE_OPERAND (TREE_OPERAND (val, 0), 0)) + == SSA_NAME) + && (SSA_NAME_VAR (TREE_OPERAND (TREE_OPERAND (val, 0), 0)) + == omp_data_arg)) + { + /* .omp_data->FIELD case. */ + fld = TREE_OPERAND (val, 1); + new_val = *fld_to_args.get (fld); + } + else if (TREE_CODE (val) == MEM_REF + && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME + && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg) + { + /* This case may happen in the final tree level optimization + output, due to SLP: + vect.XX = MEM <vector(1) unsigned long> [(void *).omp_data_i_5(D) + 8B] + + Therefore here we need a more elaborate search of the field + list to reverse map to which field the offset is referring + to. */ + unsigned HOST_WIDE_INT offset + = tree_to_uhwi (TREE_OPERAND (val, 1)); + + for (hash_map<tree, tree>::iterator i = fld_to_args.begin (); + i != fld_to_args.end (); ++i) + { + tree cur_fld = (*i).first; + tree cur_arg = (*i).second; + gcc_assert (TREE_CODE (cur_arg) == PARM_DECL); + + unsigned HOST_WIDE_INT cur_offset = + (tree_to_uhwi (DECL_FIELD_OFFSET (cur_fld)) + + (tree_to_uhwi (DECL_FIELD_BIT_OFFSET (cur_fld)) + / BITS_PER_UNIT)); + + if (offset == cur_offset) + { + new_val = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (val), + cur_arg); + break; + } + } + } + + /* If we found the corresponding OMP data record field, replace the + RHS with the new created PARM_DECL. */ + if (new_val != NULL_TREE) + { + if (dump_file) + { + fprintf (dump_file, "For gimple stmt: "); + print_gimple_stmt (dump_file, stmt, 0); + fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n", + print_generic_expr_to_str (val), + print_generic_expr_to_str (new_val)); + } + /* Write in looked up ARG as new RHS value. */ + *val_ptr = new_val; + } + } + + /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE. */ + tree name; + FOR_EACH_SSA_NAME (i, name, cfun) + if (SSA_NAME_VAR (name) == omp_data_arg) + (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE; + + if (dump_file) + { + fprintf (dump_file, "Function after OMP data arg replaced: "); + dump_function_to_file (current_function_decl, dump_file, dump_flags); + } + } +} + #undef TARGET_OPTION_OVERRIDE #define TARGET_OPTION_OVERRIDE nvptx_option_override @@ -6576,6 +6800,9 @@ nvptx_set_current_function (tree fndecl) #undef TARGET_SET_CURRENT_FUNCTION #define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function +#undef TARGET_EXPAND_TO_RTL_HOOK +#define TARGET_EXPAND_TO_RTL_HOOK nvptx_expand_to_rtl_hook + struct gcc_target targetm = TARGET_INITIALIZER; #include "gt-nvptx.h" Index: libgomp/plugin/plugin-nvptx.c =================================================================== --- libgomp/plugin/plugin-nvptx.c (revision 276406) +++ libgomp/plugin/plugin-nvptx.c (working copy) @@ -695,16 +695,24 @@ link_ptx (CUmodule *module, const struct targ_ptx_ static void nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, - unsigned *dims, void *targ_mem_desc, - CUdeviceptr dp, CUstream stream) + unsigned *dims, CUstream stream) { struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn; CUfunction function; int i; - void *kargs[1]; struct nvptx_thread *nvthd = nvptx_thread (); int warp_size = nvthd->ptx_dev->warp_size; + void **kernel_args = NULL; + GOMP_PLUGIN_debug (0, "prepare mappings (mapnum: %u)\n", (unsigned) mapnum); + + if (mapnum > 0) + { + kernel_args = alloca (mapnum * sizeof (void *)); + for (int i = 0; i < mapnum; i++) + kernel_args[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]); + } + function = targ_fn->fn; /* Initialize the launch dimensions. Typically this is constant, @@ -936,11 +944,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **host api_info); } - kargs[0] = &dp; CUDA_CALL_ASSERT (cuLaunchKernel, function, dims[GOMP_DIM_GANG], 1, 1, dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1, - 0, stream, kargs, 0); + 0, stream, kernel_args, 0); if (profiling_p) { @@ -1349,67 +1356,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si void **hostaddrs, void **devaddrs, unsigned *dims, void *targ_mem_desc) { - GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, NULL); - struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread (); - acc_prof_info *prof_info = thr->prof_info; - acc_event_info data_event_info; - acc_api_info *api_info = thr->api_info; - bool profiling_p = __builtin_expect (prof_info != NULL, false); - - void **hp = NULL; - CUdeviceptr dp = 0; - - if (mapnum > 0) - { - size_t s = mapnum * sizeof (void *); - hp = alloca (s); - for (int i = 0; i < mapnum; i++) - hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); - CUDA_CALL_ASSERT (cuMemAlloc, &dp, s); - if (profiling_p) - goacc_profiling_acc_ev_alloc (thr, (void *) dp, s); - } - - /* Copy the (device) pointers to arguments to the device (dp and hp might in - fact have the same value on a unified-memory system). */ - if (mapnum > 0) - { - if (profiling_p) - { - prof_info->event_type = acc_ev_enqueue_upload_start; - - data_event_info.data_event.event_type = prof_info->event_type; - data_event_info.data_event.valid_bytes - = _ACC_DATA_EVENT_INFO_VALID_BYTES; - data_event_info.data_event.parent_construct - = acc_construct_parallel; - data_event_info.data_event.implicit = 1; /* Always implicit. */ - data_event_info.data_event.tool_info = NULL; - data_event_info.data_event.var_name = NULL; - data_event_info.data_event.bytes = mapnum * sizeof (void *); - data_event_info.data_event.host_ptr = hp; - data_event_info.data_event.device_ptr = (const void *) dp; - - api_info->device_api = acc_device_api_cuda; - - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, - api_info); - } - CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp, - mapnum * sizeof (void *)); - if (profiling_p) - { - prof_info->event_type = acc_ev_enqueue_upload_end; - data_event_info.data_event.event_type = prof_info->event_type; - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, - api_info); - } - } - - nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, - dp, NULL); - CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL); const char *maybe_abort_msg = "(perhaps abort was called)"; if (r == CUDA_ERROR_LAUNCH_FAILED) @@ -1417,20 +1365,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si maybe_abort_msg); else if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); - - CUDA_CALL_ASSERT (cuMemFree, dp); - if (profiling_p) - goacc_profiling_acc_ev_free (thr, (void *) dp); } -static void -cuda_free_argmem (void *ptr) -{ - void **block = (void **) ptr; - nvptx_free (block[0], (struct ptx_device *) block[1]); - free (block); -} - void GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, void **hostaddrs, void **devaddrs, @@ -1437,78 +1373,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void unsigned *dims, void *targ_mem_desc, struct goacc_asyncqueue *aq) { - GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); - - struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread (); - acc_prof_info *prof_info = thr->prof_info; - acc_event_info data_event_info; - acc_api_info *api_info = thr->api_info; - bool profiling_p = __builtin_expect (prof_info != NULL, false); - - void **hp = NULL; - CUdeviceptr dp = 0; - void **block = NULL; - - if (mapnum > 0) - { - size_t s = mapnum * sizeof (void *); - block = (void **) GOMP_PLUGIN_malloc (2 * sizeof (void *) + s); - hp = block + 2; - for (int i = 0; i < mapnum; i++) - hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); - CUDA_CALL_ASSERT (cuMemAlloc, &dp, s); - if (profiling_p) - goacc_profiling_acc_ev_alloc (thr, (void *) dp, s); - } - - /* Copy the (device) pointers to arguments to the device (dp and hp might in - fact have the same value on a unified-memory system). */ - if (mapnum > 0) - { - if (profiling_p) - { - prof_info->event_type = acc_ev_enqueue_upload_start; - - data_event_info.data_event.event_type = prof_info->event_type; - data_event_info.data_event.valid_bytes - = _ACC_DATA_EVENT_INFO_VALID_BYTES; - data_event_info.data_event.parent_construct - = acc_construct_parallel; - data_event_info.data_event.implicit = 1; /* Always implicit. */ - data_event_info.data_event.tool_info = NULL; - data_event_info.data_event.var_name = NULL; - data_event_info.data_event.bytes = mapnum * sizeof (void *); - data_event_info.data_event.host_ptr = hp; - data_event_info.data_event.device_ptr = (const void *) dp; - - api_info->device_api = acc_device_api_cuda; - - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, - api_info); - } - - CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp, - mapnum * sizeof (void *), aq->cuda_stream); - block[0] = (void *) dp; - - struct nvptx_thread *nvthd = - (struct nvptx_thread *) GOMP_PLUGIN_acc_thread (); - block[1] = (void *) nvthd->ptx_dev; - - if (profiling_p) - { - prof_info->event_type = acc_ev_enqueue_upload_end; - data_event_info.data_event.event_type = prof_info->event_type; - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, - api_info); - } - } - - nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, - dp, aq->cuda_stream); - - if (mapnum > 0) - GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block); + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream); } void *
Hi Chung-Lin! While we're all waiting for Tom to comment on this ;-) -- here's another item I realized: On 2019-09-10T19:41:59+0800, Chung-Lin Tang <chunglin_tang@mentor.com> wrote: > The libgomp nvptx plugin changes are also quite contained, with lots of > now unneeded [...] code deleted (since we no longer first cuAlloc a > buffer for the argument record before cuLaunchKernel) It would be nice ;-) -- but unless I'm confused, it's not that simple: we either have to reject (force host-fallback execution) or keep supporting "old-style" nvptx offloading code: new-libgomp has to continue to work with nvptx offloading code once generated by old-GCC. Possibly even a mixture of old and new nvptx offloading code, if libraries are involved, huh! I have not completely thought that through, but I suppose this could be addressed by adding a flag to the 'struct nvptx_fn' (or similar) that's synthesized by nvptx 'mkoffload'? Maybe if fact the 'enum id_map_flag' machinery that I once added for 'Un-parallelized OpenACC kernels constructs with nvptx offloading: "avoid offloading"'? (That's part of og8 commit 2d42fbf7e989e4bb76727b32ef11deb5845d5ab1 -- not present on og9, huh?!) The 'enum id_map_flag' machinery serves the purpose of transporting information from the offload compiler to libgomp, which seems what's needed here? (But please verify.) For reference, your proposed changes: > --- libgomp/plugin/plugin-nvptx.c (revision 275493) > +++ libgomp/plugin/plugin-nvptx.c (working copy) > @@ -696,16 +696,24 @@ link_ptx (CUmodule *module, const struct targ_ptx_ > > static void > nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, > - unsigned *dims, void *targ_mem_desc, > - CUdeviceptr dp, CUstream stream) > + unsigned *dims, CUstream stream) > { > struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn; > CUfunction function; > int i; > - void *kargs[1]; > struct nvptx_thread *nvthd = nvptx_thread (); > int warp_size = nvthd->ptx_dev->warp_size; > + void **kernel_args = NULL; > > + GOMP_PLUGIN_debug (0, "prepare mappings (mapnum: %u)\n", (unsigned) mapnum); > + > + if (mapnum > 0) > + { > + kernel_args = alloca (mapnum * sizeof (void *)); > + for (int i = 0; i < mapnum; i++) > + kernel_args[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]); > + } > + > function = targ_fn->fn; > > /* Initialize the launch dimensions. Typically this is constant, > @@ -937,11 +945,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **host > api_info); > } > > - kargs[0] = &dp; > CUDA_CALL_ASSERT (cuLaunchKernel, function, > dims[GOMP_DIM_GANG], 1, 1, > dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1, > - 0, stream, kargs, 0); > + 0, stream, kernel_args, 0); > > if (profiling_p) > { > @@ -1350,67 +1357,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si > void **hostaddrs, void **devaddrs, > unsigned *dims, void *targ_mem_desc) > { > - GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); > + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, NULL); > > - struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread (); > - acc_prof_info *prof_info = thr->prof_info; > - acc_event_info data_event_info; > - acc_api_info *api_info = thr->api_info; > - bool profiling_p = __builtin_expect (prof_info != NULL, false); > - > - void **hp = NULL; > - CUdeviceptr dp = 0; > - > - if (mapnum > 0) > - { > - size_t s = mapnum * sizeof (void *); > - hp = alloca (s); > - for (int i = 0; i < mapnum; i++) > - hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); > - CUDA_CALL_ASSERT (cuMemAlloc, &dp, s); > - if (profiling_p) > - goacc_profiling_acc_ev_alloc (thr, (void *) dp, s); > - } > - > - /* Copy the (device) pointers to arguments to the device (dp and hp might in > - fact have the same value on a unified-memory system). */ > - if (mapnum > 0) > - { > - if (profiling_p) > - { > - prof_info->event_type = acc_ev_enqueue_upload_start; > - > - data_event_info.data_event.event_type = prof_info->event_type; > - data_event_info.data_event.valid_bytes > - = _ACC_DATA_EVENT_INFO_VALID_BYTES; > - data_event_info.data_event.parent_construct > - = acc_construct_parallel; > - data_event_info.data_event.implicit = 1; /* Always implicit. */ > - data_event_info.data_event.tool_info = NULL; > - data_event_info.data_event.var_name = NULL; > - data_event_info.data_event.bytes = mapnum * sizeof (void *); > - data_event_info.data_event.host_ptr = hp; > - data_event_info.data_event.device_ptr = (const void *) dp; > - > - api_info->device_api = acc_device_api_cuda; > - > - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, > - api_info); > - } > - CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp, > - mapnum * sizeof (void *)); > - if (profiling_p) > - { > - prof_info->event_type = acc_ev_enqueue_upload_end; > - data_event_info.data_event.event_type = prof_info->event_type; > - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, > - api_info); > - } > - } > - > - nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, > - dp, NULL); > - > CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL); > const char *maybe_abort_msg = "(perhaps abort was called)"; > if (r == CUDA_ERROR_LAUNCH_FAILED) > @@ -1418,20 +1366,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si > maybe_abort_msg); > else if (r != CUDA_SUCCESS) > GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); > - > - CUDA_CALL_ASSERT (cuMemFree, dp); > - if (profiling_p) > - goacc_profiling_acc_ev_free (thr, (void *) dp); > } > > -static void > -cuda_free_argmem (void *ptr) > -{ > - void **block = (void **) ptr; > - nvptx_free (block[0], (struct ptx_device *) block[1]); > - free (block); > -} > - > void > GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, > void **hostaddrs, void **devaddrs, > @@ -1438,78 +1374,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void > unsigned *dims, void *targ_mem_desc, > struct goacc_asyncqueue *aq) > { > - GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); > - > - struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread (); > - acc_prof_info *prof_info = thr->prof_info; > - acc_event_info data_event_info; > - acc_api_info *api_info = thr->api_info; > - bool profiling_p = __builtin_expect (prof_info != NULL, false); > - > - void **hp = NULL; > - CUdeviceptr dp = 0; > - void **block = NULL; > - > - if (mapnum > 0) > - { > - size_t s = mapnum * sizeof (void *); > - block = (void **) GOMP_PLUGIN_malloc (2 * sizeof (void *) + s); > - hp = block + 2; > - for (int i = 0; i < mapnum; i++) > - hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); > - CUDA_CALL_ASSERT (cuMemAlloc, &dp, s); > - if (profiling_p) > - goacc_profiling_acc_ev_alloc (thr, (void *) dp, s); > - } > - > - /* Copy the (device) pointers to arguments to the device (dp and hp might in > - fact have the same value on a unified-memory system). */ > - if (mapnum > 0) > - { > - if (profiling_p) > - { > - prof_info->event_type = acc_ev_enqueue_upload_start; > - > - data_event_info.data_event.event_type = prof_info->event_type; > - data_event_info.data_event.valid_bytes > - = _ACC_DATA_EVENT_INFO_VALID_BYTES; > - data_event_info.data_event.parent_construct > - = acc_construct_parallel; > - data_event_info.data_event.implicit = 1; /* Always implicit. */ > - data_event_info.data_event.tool_info = NULL; > - data_event_info.data_event.var_name = NULL; > - data_event_info.data_event.bytes = mapnum * sizeof (void *); > - data_event_info.data_event.host_ptr = hp; > - data_event_info.data_event.device_ptr = (const void *) dp; > - > - api_info->device_api = acc_device_api_cuda; > - > - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, > - api_info); > - } > - > - CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp, > - mapnum * sizeof (void *), aq->cuda_stream); > - block[0] = (void *) dp; > - > - struct nvptx_thread *nvthd = > - (struct nvptx_thread *) GOMP_PLUGIN_acc_thread (); > - block[1] = (void *) nvthd->ptx_dev; > - > - if (profiling_p) > - { > - prof_info->event_type = acc_ev_enqueue_upload_end; > - data_event_info.data_event.event_type = prof_info->event_type; > - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, > - api_info); > - } > - } > - > - nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, > - dp, aq->cuda_stream); > - > - if (mapnum > 0) > - GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block); > + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream); > } > > void * Grüße Thomas
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 08-10-2019 16:05, Thomas Schwinge wrote: > Hi Chung-Lin! > > While we're all waiting for Tom to comment on this ;-) Ack, thanks for the ping ... > -- here's another item I realized: > > On 2019-09-10T19:41:59+0800, Chung-Lin Tang > <chunglin_tang@mentor.com> wrote: >> The libgomp nvptx plugin changes are also quite contained, with >> lots of now unneeded [...] code deleted (since we no longer first >> cuAlloc a buffer for the argument record before cuLaunchKernel) > > It would be nice ;-) -- but unless I'm confused, it's not that > simple: we either have to reject (force host-fallback execution) or > keep supporting "old-style" nvptx offloading code: new-libgomp has > to continue to work with nvptx offloading code once generated by > old-GCC. Possibly even a mixture of old and new nvptx offloading > code, if libraries are involved, huh! > > I have not completely thought that through, but I suppose this > could be addressed by adding a flag to the 'struct nvptx_fn' (or > similar) that's synthesized by nvptx 'mkoffload'? > > Maybe if fact the 'enum id_map_flag' machinery that I once added > for 'Un-parallelized OpenACC kernels constructs with nvptx > offloading: "avoid offloading"'? (That's part of og8 commit > 2d42fbf7e989e4bb76727b32ef11deb5845d5ab1 -- not present on og9, > huh?!) The 'enum id_map_flag' machinery serves the purpose of > transporting information from the offload compiler to libgomp, > which seems what's needed here? (But please verify.) > ... and for raising this issue. I think this needs to be addressed. It would be great if we can avoid it, but ... AFAIU, this means bumping GOMP_VERSION_NVIDIA_PTX (1 -> 2). Using a new a.out (registers with GOMP_VERSION_NVIDIA_PTX == 2) with an old libgomp (supports GOMP_VERSION_NVIDIA_PTX <= 1) will give us an "Offload data incompatible with PTX plugin" error. Using an old a.out (registers with GOMP_VERSION_NVIDIA_PTX == 1) with a new libgomp (supports GOMP_VERSION_NVIDIA_PTX <= 2) will have to be supported in the way that things are currently handled. Using a new a.out (registers with GOMP_VERSION_NVIDIA_PTX == 2) with a new libgomp (supports GOMP_VERSION_NVIDIA_PTX <= 2) will have to be supported in the way that the patch implements things. The current approach is that all offload-functions are assumed to be transformed by the optimization, which implies that failure to transform should be a compilation error (is that indeed ensured by the patch?). Which is a bit funny for an 'optimization'. We might wanna decide to do switch this on/off at offload-function level. That ties in with the fact that if we're going to keep the path alive for backward compatibility, it would be nice if we can actually test this in the trunk version by disabling the optimization. Which is also nice to have if we run into issues with the optimization. And once we allow this to be disabled at user level, we're going to have to track this at offload-function level. So I'd say for GOMP_VERSION_NVIDIA_PTX == 2 we extend target_data with a flag such that we can query things on a per offload-function level, while taking care to represent the common case where the flag is the same for all offload-functions in an economical way. That leaves the question of how to get that information to mkoffload, perhaps the patch Thomas mentioned can be of help there. Thanks, - - Tom -----BEGIN PGP SIGNATURE----- iQEzBAEBCAAdFiEErJ0nuYSmyzCtZhpo7oVdq2ziRKAFAl2d4aEACgkQ7oVdq2zi RKDhwQf/efEZRCR+HJ+M50FGKh5a1lrVm8QE5ue7SoY2rzjdKf2JT6tIUysJSYyP JQYENHAz9Q/1uxYa3VYoFc1c8cVPyhutzezIWPXDVoNBoj/NEwFvQyZl4fqGfkFb mRgEAHtfE1HZwfXp86UlJbgDV5wF1XGWQQad3P6F38NtXVTORoce79OViITnFq8I YvfvZWx1EdomacW8oThzo9VY/CM4JeuY4r0dEv8REtk3Py5Cpw4E3xk195BgUAAS OJj3g8Etg/wTBsgvrO6qqP8ie91Ys/9IRXjf238hay40i44Y7APGuRHgffFE6AE6 RPn24JUY0mdDj9WzlergTjsjWtfppQ== =EdLk -----END PGP SIGNATURE-----
On 01-10-2019 14:45, Chung-Lin Tang wrote: > Index: gcc/config/nvptx/nvptx.c > =================================================================== > --- gcc/config/nvptx/nvptx.c (revision 276406) > +++ gcc/config/nvptx/nvptx.c (working copy) > @@ -68,6 +68,10 @@ > #include "attribs.h" > #include "tree-vrp.h" > #include "tree-ssa-operands.h" > +#include "tree-pretty-print.h" > +#include "gimple-pretty-print.h" > +#include "tree-cfg.h" > +#include "gimple-ssa.h" > #include "tree-ssanames.h" > #include "gimplify.h" > #include "tree-phinodes.h" > @@ -6437,6 +6441,226 @@ nvptx_set_current_function (tree fndecl) > oacc_bcast_partition = 0; > } > > +static void > +nvptx_expand_to_rtl_hook (void) > +{ > + /* For utilizing CUDA .param kernel arguments, we detect and modify > + the gimple of offloaded child functions, here before RTL expansion, > + starting with standard OMP form: > + foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... } > + > + and transform it into a style where the OMP data record fields are > + "exploded" into individual scalar arguments: > + foo._omp_fn.0 (int * a, int * b, int * c) { ... } > + > + Note that there are implicit assumptions of how OMP lowering (and/or other > + intervening passes) behaves contained in this transformation code; > + if those passes change in their output, this code may possibly need > + updating. */ > + > + if (lookup_attribute ("omp target entrypoint", > + DECL_ATTRIBUTES (current_function_decl)) > + /* The rather indirect manner in which OpenMP target functions are > + launched makes this transformation only valid for OpenACC currently. > + TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc. > + needs changes for this to work with OpenMP. */ > + && lookup_attribute ("oacc function", > + DECL_ATTRIBUTES (current_function_decl))) > + { Please do an early-return here. Otherwise, no comments to the code as such. Thanks, - Tom
On 2019/10/8 10:05 PM, Thomas Schwinge wrote: > Hi Chung-Lin! > > While we're all waiting for Tom to comment on this;-) -- here's another > item I realized: > > On 2019-09-10T19:41:59+0800, Chung-Lin Tang<chunglin_tang@mentor.com> wrote: >> The libgomp nvptx plugin changes are also quite contained, with lots of >> now unneeded [...] code deleted (since we no longer first cuAlloc a >> buffer for the argument record before cuLaunchKernel) > It would be nice;-) -- but unless I'm confused, it's not that simple: we > either have to reject (force host-fallback execution) or keep supporting > "old-style" nvptx offloading code: new-libgomp has to continue to work > with nvptx offloading code once generated by old-GCC. Possibly even a > mixture of old and new nvptx offloading code, if libraries are involved, > huh! > > I have not completely thought that through, but I suppose this could be > addressed by adding a flag to the 'struct nvptx_fn' (or similar) that's > synthesized by nvptx 'mkoffload'? Hi Thomas, Tom, I've looked at the problem, it is unfortunate that we overlooked the need for versioning of NVPTX images, and did not reserve something in 'struct nvptx_tdata' for something like this. But how about something like: typedef struct nvptx_tdata { const struct targ_ptx_obj *ptx_objs; unsigned ptx_num; unsigned ptx_version; /* <==== Add version field here. */ const char *const *var_names; unsigned var_num; const struct targ_fn_launch *fn_descs; unsigned fn_num; } nvptx_tdata_t; We currently only support x86_64 and powerpc64le hosts, which are both LP64 targets. Assuming that, the position above where I put the new 'ptx_version' field is already a 32-bit sized alignment hole, doesn't change the layout of other fields, and in the static 'target_data' variable generated by mkoffload should be zeroed in current circulating binaries (unless binutils is not doing the intuitive thing...) If these assumptions are safe, then we can treat as if ptx_version == 0 right now, and from now on bump it to 1 for these new nvptx convention changes. (We can do a similar thing in 'struct targ_fn_launch' if we want to differentiate at a per-function level.) Any considerations? Thanks, Chung-Lin
On 2019/11/8 8:55 PM, Chung-Lin Tang wrote: > On 2019/10/8 10:05 PM, Thomas Schwinge wrote: >> Hi Chung-Lin! >> >> While we're all waiting for Tom to comment on this;-) -- here's another >> item I realized: >> >> On 2019-09-10T19:41:59+0800, Chung-Lin Tang<chunglin_tang@mentor.com> wrote: >>> The libgomp nvptx plugin changes are also quite contained, with lots of >>> now unneeded [...] code deleted (since we no longer first cuAlloc a >>> buffer for the argument record before cuLaunchKernel) >> It would be nice;-) -- but unless I'm confused, it's not that simple: we >> either have to reject (force host-fallback execution) or keep supporting >> "old-style" nvptx offloading code: new-libgomp has to continue to work >> with nvptx offloading code once generated by old-GCC. Possibly even a >> mixture of old and new nvptx offloading code, if libraries are involved, >> huh! >> >> I have not completely thought that through, but I suppose this could be >> addressed by adding a flag to the 'struct nvptx_fn' (or similar) that's >> synthesized by nvptx 'mkoffload'? > > Hi Thomas, Tom, > I've looked at the problem, it is unfortunate that we overlooked the > need for versioning of NVPTX images, and did not reserve something in > 'struct nvptx_tdata' for something like this. > > But how about something like: > > typedef struct nvptx_tdata > { > const struct targ_ptx_obj *ptx_objs; > unsigned ptx_num; > > unsigned ptx_version; /* <==== Add version field here. */ > > const char *const *var_names; > unsigned var_num; > > const struct targ_fn_launch *fn_descs; > unsigned fn_num; > } nvptx_tdata_t; > > We currently only support x86_64 and powerpc64le hosts, which are both LP64 targets. > > Assuming that, the position above where I put the new 'ptx_version' field is already > a 32-bit sized alignment hole, doesn't change the layout of other fields, and in the > static 'target_data' variable generated by mkoffload should be zeroed in current > circulating binaries (unless binutils is not doing the intuitive thing...) > > If these assumptions are safe, then we can treat as if ptx_version == 0 right now, > and from now on bump it to 1 for these new nvptx convention changes. > > (We can do a similar thing in 'struct targ_fn_launch' if we want to differentiate > at a per-function level.) > > Any considerations? Hi Tom, Thomas, as a concept, here is a version of what I mentioned above. The _exec,_async_exec plugin hooks now switch between versions of code based on image version. Thanks, Chung-Lin gcc/ * config/nvptx/mkoffload.c (process): Add 'ptx_version' field to generated struct nvptx_tdata, and initialized to '1'. * config/nvptx/nvptx.c (nvptx_expand_to_rtl_hook): New function implementing CUDA .params space transformation. (TARGET_EXPAND_TO_RTL_HOOK): implement hook with nvptx_expand_to_rtl_hook. libgomp/ * plugin/plugin-nvptx.c (struct nvptx_tdata): Add 'ptx_version' field. (struct targ_fn_descriptor): Add 'image' field. (struct ptx_image_data): Adjust 'target_data' to be proper pointer type of 'const nvptx_tdata_t *'. (nvptx_exec): Adjust arguments, add kernel argument setup code, adjust cuLaunchKernel calling code. (GOMP_OFFLOAD_load_image): Remove now unneeded pointer cast for target_data, initialize 'image' link for each function descriptor, move adding of new_image to dev->images later after everythin is set up. (openacc_exec_v0): Rename from old GOMP_OFFLOAD_openacc_exec. (openacc_async_exec_v0): Rename from old GOMP_OFFLOAD_openacc_async_exec. (GOMP_OFFLOAD_openacc_exec): Switch between v0/v1 versions of code. (GOMP_OFFLOAD_openacc_async_exec): Likewise. (openacc_exec_v1): New function. (openacc_async_exec_v1): Likewise. Index: gcc/config/nvptx/mkoffload.c =================================================================== --- gcc/config/nvptx/mkoffload.c (revision 278656) +++ gcc/config/nvptx/mkoffload.c (working copy) @@ -310,12 +310,13 @@ process (FILE *in, FILE *out) "static const struct nvptx_tdata {\n" " const struct ptx_obj *ptx_objs;\n" " unsigned ptx_num;\n" + " unsigned char ptx_version;\n" " const char *const *var_names;\n" " unsigned var_num;\n" " const struct nvptx_fn *fn_names;\n" " unsigned fn_num;\n" "} target_data = {\n" - " ptx_objs, sizeof (ptx_objs) / sizeof (ptx_objs[0]),\n" + " ptx_objs, sizeof (ptx_objs) / sizeof (ptx_objs[0]), 1,\n" " var_mappings," " sizeof (var_mappings) / sizeof (var_mappings[0]),\n" " func_mappings," Index: gcc/config/nvptx/nvptx.c =================================================================== --- gcc/config/nvptx/nvptx.c (revision 278656) +++ gcc/config/nvptx/nvptx.c (working copy) @@ -68,6 +68,10 @@ #include "attribs.h" #include "tree-vrp.h" #include "tree-ssa-operands.h" +#include "tree-pretty-print.h" +#include "gimple-pretty-print.h" +#include "tree-cfg.h" +#include "gimple-ssa.h" #include "tree-ssanames.h" #include "gimplify.h" #include "tree-phinodes.h" @@ -6463,6 +6467,226 @@ nvptx_set_current_function (tree fndecl) oacc_bcast_partition = 0; } +static void +nvptx_expand_to_rtl_hook (void) +{ + /* For utilizing CUDA .param kernel arguments, we detect and modify + the gimple of offloaded child functions, here before RTL expansion, + starting with standard OMP form: + foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... } + + and transform it into a style where the OMP data record fields are + "exploded" into individual scalar arguments: + foo._omp_fn.0 (int * a, int * b, int * c) { ... } + + Note that there are implicit assumptions of how OMP lowering (and/or other + intervening passes) behaves contained in this transformation code; + if those passes change in their output, this code may possibly need + updating. */ + + if (lookup_attribute ("omp target entrypoint", + DECL_ATTRIBUTES (current_function_decl)) + /* The rather indirect manner in which OpenMP target functions are + launched makes this transformation only valid for OpenACC currently. + TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc. + needs changes for this to work with OpenMP. */ + && lookup_attribute ("oacc function", + DECL_ATTRIBUTES (current_function_decl))) + { + tree omp_data_arg = DECL_ARGUMENTS (current_function_decl); + tree argtype = TREE_TYPE (omp_data_arg); + + /* Ensure this function is of the form of a single reference argument + to the OMP data record, or a single void* argument (when no values + passed) */ + gcc_assert (VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl))) + && (DECL_CHAIN (omp_data_arg) == NULL_TREE + && ((TREE_CODE (argtype) == REFERENCE_TYPE + && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE) + || (TREE_CODE (argtype) == POINTER_TYPE + && TREE_TYPE (argtype) == void_type_node)))); + if (dump_file) + { + fprintf (dump_file, "Detected offloaded child function %s, " + "starting parameter conversion\n", + print_generic_expr_to_str (current_function_decl)); + fprintf (dump_file, "OMP data record argument: %s (tree type: %s)\n", + print_generic_expr_to_str (omp_data_arg), + print_generic_expr_to_str (argtype)); + fprintf (dump_file, "Data record fields:\n"); + } + + hash_map<tree,tree> fld_to_args; + tree fld, rectype = TREE_TYPE (argtype); + tree arglist = NULL_TREE, argtypelist = NULL_TREE; + + if (TREE_CODE (rectype) == RECORD_TYPE) + { + /* For each field in the OMP data record type, create a corresponding + PARM_DECL, and map field -> parm using the fld_to_args hash_map. + Also create the tree chains for creating function type and + DECL_ARGUMENTS below. */ + for (fld = TYPE_FIELDS (rectype); fld; fld = DECL_CHAIN (fld)) + { + tree narg = build_decl (DECL_SOURCE_LOCATION (fld), PARM_DECL, + DECL_NAME (fld), TREE_TYPE (fld)); + DECL_ARTIFICIAL (narg) = 1; + DECL_ARG_TYPE (narg) = TREE_TYPE (fld); + DECL_CONTEXT (narg) = current_function_decl; + TREE_USED (narg) = 1; + TREE_READONLY (narg) = 1; + + if (dump_file) + fprintf (dump_file, "\t%s, type: %s, offset: %s bytes + %s bits\n", + print_generic_expr_to_str (fld), + print_generic_expr_to_str (TREE_TYPE (fld)), + print_generic_expr_to_str (DECL_FIELD_OFFSET (fld)), + print_generic_expr_to_str (DECL_FIELD_BIT_OFFSET (fld))); + fld_to_args.put (fld, narg); + + TREE_CHAIN (narg) = arglist; + arglist = narg; + argtypelist = tree_cons (NULL_TREE, TREE_TYPE (narg), + argtypelist); + } + arglist = nreverse (arglist); + argtypelist = nreverse (argtypelist); + } + /* This is needed to not be mistaken for a stdarg function. */ + argtypelist = chainon (argtypelist, void_list_node); + + if (dump_file) + { + fprintf (dump_file, "Function before OMP data arg replaced:\n"); + dump_function_to_file (current_function_decl, dump_file, dump_flags); + } + + /* Actually modify the tree type and DECL_ARGUMENTS here. */ + TREE_TYPE (current_function_decl) = build_function_type (void_type_node, + argtypelist); + DECL_ARGUMENTS (current_function_decl) = arglist; + + /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by + scanning and skipping those entries, creating a new local_decls list. + We assume a very specific MEM_REF tree expression shape. */ + tree decl; + unsigned int i; + vec<tree, va_gc> *new_local_decls = NULL; + FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl) + { + if (DECL_HAS_VALUE_EXPR_P (decl)) + { + tree t = DECL_VALUE_EXPR (decl); + if (TREE_CODE (t) == MEM_REF + && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF + && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == MEM_REF + && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0) + == omp_data_arg)) + continue; + } + vec_safe_push (new_local_decls, decl); + } + vec_free (cfun->local_decls); + cfun->local_decls = new_local_decls; + + /* Scan function body for assignments from .omp_data_i->FIELD, and using + the above created fld_to_args hash map, convert them to reads of + function arguments. */ + basic_block bb; + gimple_stmt_iterator gsi; + FOR_EACH_BB_FN (bb, cfun) + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) + { + tree val, *val_ptr = NULL; + gimple *stmt = gsi_stmt (gsi); + if (is_gimple_assign (stmt) + && gimple_assign_rhs_class (stmt) == GIMPLE_SINGLE_RHS) + val_ptr = gimple_assign_rhs1_ptr (stmt); + else if (is_gimple_debug (stmt) && gimple_debug_bind_p (stmt)) + val_ptr = gimple_debug_bind_get_value_ptr (stmt); + + if (val_ptr == NULL || (val = *val_ptr) == NULL_TREE) + continue; + + tree new_val = NULL_TREE, fld = NULL_TREE; + + if (TREE_CODE (val) == COMPONENT_REF + && TREE_CODE (TREE_OPERAND (val, 0)) == MEM_REF + && (TREE_CODE (TREE_OPERAND (TREE_OPERAND (val, 0), 0)) + == SSA_NAME) + && (SSA_NAME_VAR (TREE_OPERAND (TREE_OPERAND (val, 0), 0)) + == omp_data_arg)) + { + /* .omp_data->FIELD case. */ + fld = TREE_OPERAND (val, 1); + new_val = *fld_to_args.get (fld); + } + else if (TREE_CODE (val) == MEM_REF + && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME + && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg) + { + /* This case may happen in the final tree level optimization + output, due to SLP: + vect.XX = MEM <vector(1) unsigned long> [(void *).omp_data_i_5(D) + 8B] + + Therefore here we need a more elaborate search of the field + list to reverse map to which field the offset is referring + to. */ + unsigned HOST_WIDE_INT offset + = tree_to_uhwi (TREE_OPERAND (val, 1)); + + for (hash_map<tree, tree>::iterator i = fld_to_args.begin (); + i != fld_to_args.end (); ++i) + { + tree cur_fld = (*i).first; + tree cur_arg = (*i).second; + gcc_assert (TREE_CODE (cur_arg) == PARM_DECL); + + unsigned HOST_WIDE_INT cur_offset = + (tree_to_uhwi (DECL_FIELD_OFFSET (cur_fld)) + + (tree_to_uhwi (DECL_FIELD_BIT_OFFSET (cur_fld)) + / BITS_PER_UNIT)); + + if (offset == cur_offset) + { + new_val = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (val), + cur_arg); + break; + } + } + } + + /* If we found the corresponding OMP data record field, replace the + RHS with the new created PARM_DECL. */ + if (new_val != NULL_TREE) + { + if (dump_file) + { + fprintf (dump_file, "For gimple stmt: "); + print_gimple_stmt (dump_file, stmt, 0); + fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n", + print_generic_expr_to_str (val), + print_generic_expr_to_str (new_val)); + } + /* Write in looked up ARG as new RHS value. */ + *val_ptr = new_val; + } + } + + /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE. */ + tree name; + FOR_EACH_SSA_NAME (i, name, cfun) + if (SSA_NAME_VAR (name) == omp_data_arg) + (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE; + + if (dump_file) + { + fprintf (dump_file, "Function after OMP data arg replaced: "); + dump_function_to_file (current_function_decl, dump_file, dump_flags); + } + } +} + #undef TARGET_OPTION_OVERRIDE #define TARGET_OPTION_OVERRIDE nvptx_option_override @@ -6605,6 +6829,9 @@ nvptx_set_current_function (tree fndecl) #undef TARGET_SET_CURRENT_FUNCTION #define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function +#undef TARGET_EXPAND_TO_RTL_HOOK +#define TARGET_EXPAND_TO_RTL_HOOK nvptx_expand_to_rtl_hook + struct gcc_target targetm = TARGET_INITIALIZER; #include "gt-nvptx.h" Index: libgomp/plugin/plugin-nvptx.c =================================================================== --- libgomp/plugin/plugin-nvptx.c (revision 278656) +++ libgomp/plugin/plugin-nvptx.c (working copy) @@ -239,6 +239,8 @@ typedef struct nvptx_tdata const struct targ_ptx_obj *ptx_objs; unsigned ptx_num; + unsigned char ptx_version; + const char *const *var_names; unsigned var_num; @@ -254,12 +256,13 @@ struct targ_fn_descriptor const struct targ_fn_launch *launch; int regs_per_thread; int max_threads_per_block; + struct ptx_image_data *image; }; /* A loaded PTX image. */ struct ptx_image_data { - const void *target_data; + const nvptx_tdata_t *target_data; CUmodule module; struct targ_fn_descriptor *fns; /* Array of functions. */ @@ -695,16 +698,30 @@ link_ptx (CUmodule *module, const struct targ_ptx_ static void nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, - unsigned *dims, void *targ_mem_desc, - CUdeviceptr dp, CUstream stream) + unsigned *dims, CUdeviceptr dp, CUstream stream) { struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn; CUfunction function; int i; - void *kargs[1]; + void *kargs[1] = { &dp }; + void **kernel_args; struct nvptx_thread *nvthd = nvptx_thread (); int warp_size = nvthd->ptx_dev->warp_size; + if (__builtin_expect (dp == 0, true)) + { + /* This is the newer "exploded" CUDA parameter case. */ + GOMP_PLUGIN_debug (0, "prepare mappings (mapnum: %u)\n", (unsigned) mapnum); + if (mapnum > 0) + { + kernel_args = alloca (mapnum * sizeof (void *)); + for (int i = 0; i < mapnum; i++) + kernel_args[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]); + } + } + else + kernel_args = kargs; + function = targ_fn->fn; /* Initialize the launch dimensions. Typically this is constant, @@ -936,11 +953,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **host api_info); } - kargs[0] = &dp; CUDA_CALL_ASSERT (cuLaunchKernel, function, dims[GOMP_DIM_GANG], 1, 1, dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1, - 0, stream, kargs, 0); + 0, stream, kernel_args, 0); if (profiling_p) { @@ -1232,15 +1248,10 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version *target_table = targ_tbl; new_image = GOMP_PLUGIN_malloc (sizeof (struct ptx_image_data)); - new_image->target_data = target_data; + new_image->target_data = (const nvptx_tdata_t *) target_data; new_image->module = module; new_image->fns = targ_fns; - pthread_mutex_lock (&dev->image_lock); - new_image->next = dev->images; - dev->images = new_image; - pthread_mutex_unlock (&dev->image_lock); - for (i = 0; i < fn_entries; i++, targ_fns++, targ_tbl++) { CUfunction function; @@ -1257,11 +1268,17 @@ GOMP_OFFLOAD_load_image (int ord, unsigned version targ_fns->launch = &fn_descs[i]; targ_fns->regs_per_thread = nregs; targ_fns->max_threads_per_block = mthrs; + targ_fns->image = new_image; targ_tbl->start = (uintptr_t) targ_fns; targ_tbl->end = targ_tbl->start + 1; } + pthread_mutex_lock (&dev->image_lock); + new_image->next = dev->images; + dev->images = new_image; + pthread_mutex_unlock (&dev->image_lock); + for (j = 0; j < var_entries; j++, targ_tbl++) { CUdeviceptr var; @@ -1344,10 +1361,9 @@ GOMP_OFFLOAD_free (int ord, void *ptr) && nvptx_free (ptr, ptx_devices[ord])); } -void -GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum, - void **hostaddrs, void **devaddrs, - unsigned *dims, void *targ_mem_desc) +static void +openacc_exec_v0 (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, unsigned *dims) { GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); @@ -1407,8 +1423,7 @@ GOMP_OFFLOAD_free (int ord, void *ptr) } } - nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, - dp, NULL); + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, dp, NULL); CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL); const char *maybe_abort_msg = "(perhaps abort was called)"; @@ -1424,6 +1439,43 @@ GOMP_OFFLOAD_free (int ord, void *ptr) } static void +openacc_exec_v1 (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, unsigned *dims) +{ + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, 0, NULL); + + CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL); + const char *maybe_abort_msg = "(perhaps abort was called)"; + if (r == CUDA_ERROR_LAUNCH_FAILED) + GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s %s\n", cuda_error (r), + maybe_abort_msg); + else if (r != CUDA_SUCCESS) + GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); +} + +void +GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, + unsigned *dims, void *targ_mem_desc) +{ + struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn; + unsigned ptx_version = targ_fn->image->target_data->ptx_version; + + if (__builtin_expect (ptx_version == 1, true)) + openacc_exec_v1 (fn, mapnum, hostaddrs, devaddrs, dims); + else + switch (ptx_version) + { + case 0: + openacc_exec_v0 (fn, mapnum, hostaddrs, devaddrs, dims); + break; + default: + GOMP_PLUGIN_fatal ("Unsupported PTX image code version '%u'\n", + ptx_version); + }; +} + +static void cuda_free_argmem (void *ptr) { void **block = (void **) ptr; @@ -1431,11 +1483,10 @@ cuda_free_argmem (void *ptr) free (block); } -void -GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, - void **hostaddrs, void **devaddrs, - unsigned *dims, void *targ_mem_desc, - struct goacc_asyncqueue *aq) +static void +openacc_async_exec_v0 (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, + unsigned *dims, struct goacc_asyncqueue *aq) { GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); @@ -1504,13 +1555,43 @@ cuda_free_argmem (void *ptr) } } - nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, - dp, aq->cuda_stream); + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, dp, aq->cuda_stream); if (mapnum > 0) GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block); } +static void +openacc_async_exec_v1 (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, + unsigned *dims, struct goacc_asyncqueue *aq) +{ + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, 0, aq->cuda_stream); +} + +void +GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, + void **hostaddrs, void **devaddrs, + unsigned *dims, void *targ_mem_desc, + struct goacc_asyncqueue *aq) +{ + struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn; + unsigned ptx_version = targ_fn->image->target_data->ptx_version; + + if (__builtin_expect (ptx_version == 1, true)) + openacc_async_exec_v1 (fn, mapnum, hostaddrs, devaddrs, dims, aq); + else + switch (ptx_version) + { + case 0: + openacc_async_exec_v0 (fn, mapnum, hostaddrs, devaddrs, dims, aq); + break; + default: + GOMP_PLUGIN_fatal ("Unsupported PTX image code version '%u'\n", + ptx_version); + }; +} + void * GOMP_OFFLOAD_openacc_create_thread_data (int ord) {
Index: gcc/config/nvptx/nvptx.c =================================================================== --- gcc/config/nvptx/nvptx.c (revision 275493) +++ gcc/config/nvptx/nvptx.c (working copy) @@ -68,6 +68,10 @@ #include "attribs.h" #include "tree-vrp.h" #include "tree-ssa-operands.h" +#include "tree-pretty-print.h" +#include "gimple-pretty-print.h" +#include "tree-cfg.h" +#include "gimple-ssa.h" #include "tree-ssanames.h" #include "gimplify.h" #include "tree-phinodes.h" @@ -6437,6 +6441,228 @@ nvptx_set_current_function (tree fndecl) oacc_bcast_partition = 0; } +static void +nvptx_expand_to_rtl_hook (void) +{ + /* For utilizing CUDA .param kernel arguments, we detect and modify + the gimple of offloaded child functions, here before RTL expansion, + starting with standard OMP form: + foo._omp_fn.0 (const struct .omp_data_t.8 & restrict .omp_data_i) { ... } + + and transform it into a style where the OMP data record fields are + "exploded" into individual scalar arguments: + foo._omp_fn.0 (int * a, int * b, int * c) { ... } + + Note that there are implicit assumptions of how OMP lowering (and/or other + intervening passes) behaves contained in this transformation code; + if those passes change in their output, this code may possibly need + updating. */ + + if (lookup_attribute ("omp target entrypoint", + DECL_ATTRIBUTES (current_function_decl)) + /* The rather indirect manner in which OpenMP target functions are + launched makes this transformation only valid for OpenACC currently. + TODO: e.g. write_omp_entry(), nvptx_declare_function_name(), etc. + needs changes for this to work with OpenMP. */ + && lookup_attribute ("oacc function", + DECL_ATTRIBUTES (current_function_decl)) + && VOID_TYPE_P (TREE_TYPE (DECL_RESULT (current_function_decl)))) + { + tree omp_data_arg = DECL_ARGUMENTS (current_function_decl); + tree argtype = TREE_TYPE (omp_data_arg); + + /* Ensure this function is of the form of a single reference argument + to the OMP data record, or a single void* argument (when no values + passed) */ + if (! (DECL_CHAIN (omp_data_arg) == NULL_TREE + && ((TREE_CODE (argtype) == REFERENCE_TYPE + && TREE_CODE (TREE_TYPE (argtype)) == RECORD_TYPE) + || (TREE_CODE (argtype) == POINTER_TYPE + && TREE_TYPE (argtype) == void_type_node)))) + return; + + if (dump_file) + { + fprintf (dump_file, "Detected offloaded child function %s, " + "starting parameter conversion\n", + print_generic_expr_to_str (current_function_decl)); + fprintf (dump_file, "OMP data record argument: %s (tree type: %s)\n", + print_generic_expr_to_str (omp_data_arg), + print_generic_expr_to_str (argtype)); + fprintf (dump_file, "Data record fields:\n"); + } + + hash_map<tree,tree> fld_to_args; + tree fld, rectype = TREE_TYPE (argtype); + tree arglist = NULL_TREE, argtypelist = NULL_TREE; + + if (TREE_CODE (rectype) == RECORD_TYPE) + { + /* For each field in the OMP data record type, create a corresponding + PARM_DECL, and map field -> parm using the fld_to_args hash_map. + Also create the tree chains for creating function type and + DECL_ARGUMENTS below. */ + for (fld = TYPE_FIELDS (rectype); fld; fld = DECL_CHAIN (fld)) + { + tree narg = build_decl (DECL_SOURCE_LOCATION (fld), PARM_DECL, + DECL_NAME (fld), TREE_TYPE (fld)); + DECL_ARTIFICIAL (narg) = 1; + DECL_ARG_TYPE (narg) = TREE_TYPE (fld); + DECL_CONTEXT (narg) = current_function_decl; + TREE_USED (narg) = 1; + TREE_READONLY (narg) = 1; + + if (dump_file) + fprintf (dump_file, "\t%s, type: %s, offset: %s bytes + %s bits\n", + print_generic_expr_to_str (fld), + print_generic_expr_to_str (TREE_TYPE (fld)), + print_generic_expr_to_str (DECL_FIELD_OFFSET (fld)), + print_generic_expr_to_str (DECL_FIELD_BIT_OFFSET (fld))); + fld_to_args.put (fld, narg); + + TREE_CHAIN (narg) = arglist; + arglist = narg; + argtypelist = tree_cons (NULL_TREE, TREE_TYPE (narg), + argtypelist); + } + arglist = nreverse (arglist); + argtypelist = nreverse (argtypelist); + } + /* This is needed to not be mistaken for a stdarg function. */ + argtypelist = chainon (argtypelist, void_list_node); + + if (dump_file) + { + fprintf (dump_file, "Function before OMP data arg replaced:\n"); + dump_function_to_file (current_function_decl, dump_file, dump_flags); + } + + /* Actually modify the tree type and DECL_ARGUMENTS here. */ + TREE_TYPE (current_function_decl) = build_function_type (void_type_node, + argtypelist); + DECL_ARGUMENTS (current_function_decl) = arglist; + + /* Remove local decls which correspond to *.omp_data_i->FIELD entries, by + scanning and skipping those entries, creating a new local_decls list. + We assume a very specific MEM_REF tree expression shape. */ + tree decl; + unsigned int i; + vec<tree, va_gc> *new_local_decls = NULL; + FOR_EACH_VEC_SAFE_ELT (cfun->local_decls, i, decl) + { + if (DECL_HAS_VALUE_EXPR_P (decl)) + { + tree t = DECL_VALUE_EXPR (decl); + if (TREE_CODE (t) == MEM_REF + && TREE_CODE (TREE_OPERAND (t, 0)) == COMPONENT_REF + && TREE_CODE (TREE_OPERAND (TREE_OPERAND (t, 0), 0)) == MEM_REF + && (TREE_OPERAND (TREE_OPERAND (TREE_OPERAND (t, 0), 0), 0) + == omp_data_arg)) + continue; + } + vec_safe_push (new_local_decls, decl); + } + vec_free (cfun->local_decls); + cfun->local_decls = new_local_decls; + + /* Scan function body for assignments from .omp_data_i->FIELD, and using + the above created fld_to_args hash map, convert them to reads of + function arguments. */ + basic_block bb; + gimple_stmt_iterator gsi; + FOR_EACH_BB_FN (bb, cfun) + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi)) + { + tree val, *val_ptr = NULL; + gimple *stmt = gsi_stmt (gsi); + if (is_gimple_assign (stmt) + && gimple_assign_rhs_class (stmt) == GIMPLE_SINGLE_RHS) + val_ptr = gimple_assign_rhs1_ptr (stmt); + else if (is_gimple_debug (stmt) && gimple_debug_bind_p (stmt)) + val_ptr = gimple_debug_bind_get_value_ptr (stmt); + + if (val_ptr == NULL || (val = *val_ptr) == NULL_TREE) + continue; + + tree new_val = NULL_TREE, fld = NULL_TREE; + + if (TREE_CODE (val) == COMPONENT_REF + && TREE_CODE (TREE_OPERAND (val, 0)) == MEM_REF + && (TREE_CODE (TREE_OPERAND (TREE_OPERAND (val, 0), 0)) + == SSA_NAME) + && (SSA_NAME_VAR (TREE_OPERAND (TREE_OPERAND (val, 0), 0)) + == omp_data_arg)) + { + /* .omp_data->FIELD case. */ + fld = TREE_OPERAND (val, 1); + new_val = *fld_to_args.get (fld); + } + else if (TREE_CODE (val) == MEM_REF + && TREE_CODE (TREE_OPERAND (val, 0)) == SSA_NAME + && SSA_NAME_VAR (TREE_OPERAND (val, 0)) == omp_data_arg) + { + /* This case may happen in the final tree level optimization + output, due to SLP: + vect.XX = MEM <vector(1) unsigned long> [(void *).omp_data_i_5(D) + 8B] + + Therefore here we need a more elaborate search of the field + list to reverse map to which field the offset is referring + to. */ + unsigned HOST_WIDE_INT offset + = tree_to_uhwi (TREE_OPERAND (val, 1)); + + for (hash_map<tree, tree>::iterator i = fld_to_args.begin (); + i != fld_to_args.end (); ++i) + { + tree cur_fld = (*i).first; + tree cur_arg = (*i).second; + gcc_assert (TREE_CODE (cur_arg) == PARM_DECL); + + unsigned HOST_WIDE_INT cur_offset = + (tree_to_uhwi (DECL_FIELD_OFFSET (cur_fld)) + + (tree_to_uhwi (DECL_FIELD_BIT_OFFSET (cur_fld)) + / BITS_PER_UNIT)); + + if (offset == cur_offset) + { + new_val = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (val), + cur_arg); + break; + } + } + } + + /* If we found the corresponding OMP data record field, replace the + RHS with the new created PARM_DECL. */ + if (new_val != NULL_TREE) + { + if (dump_file) + { + fprintf (dump_file, "For gimple stmt: "); + print_gimple_stmt (dump_file, stmt, 0); + fprintf (dump_file, "\tReplacing OMP recv ref %s with %s\n", + print_generic_expr_to_str (val), + print_generic_expr_to_str (new_val)); + } + /* Write in looked up ARG as new RHS value. */ + *val_ptr = new_val; + } + } + + /* Delete SSA_NAMEs of .omp_data_i by setting them to NULL_TREE. */ + tree name; + FOR_EACH_SSA_NAME (i, name, cfun) + if (SSA_NAME_VAR (name) == omp_data_arg) + (*SSANAMES (cfun))[SSA_NAME_VERSION (name)] = NULL_TREE; + + if (dump_file) + { + fprintf (dump_file, "Function after OMP data arg replaced: "); + dump_function_to_file (current_function_decl, dump_file, dump_flags); + } + } +} + #undef TARGET_OPTION_OVERRIDE #define TARGET_OPTION_OVERRIDE nvptx_option_override @@ -6576,6 +6802,9 @@ nvptx_set_current_function (tree fndecl) #undef TARGET_SET_CURRENT_FUNCTION #define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function +#undef TARGET_EXPAND_TO_RTL_HOOK +#define TARGET_EXPAND_TO_RTL_HOOK nvptx_expand_to_rtl_hook + struct gcc_target targetm = TARGET_INITIALIZER; #include "gt-nvptx.h" Index: libgomp/plugin/plugin-nvptx.c =================================================================== --- libgomp/plugin/plugin-nvptx.c (revision 275493) +++ libgomp/plugin/plugin-nvptx.c (working copy) @@ -696,16 +696,24 @@ link_ptx (CUmodule *module, const struct targ_ptx_ static void nvptx_exec (void (*fn), size_t mapnum, void **hostaddrs, void **devaddrs, - unsigned *dims, void *targ_mem_desc, - CUdeviceptr dp, CUstream stream) + unsigned *dims, CUstream stream) { struct targ_fn_descriptor *targ_fn = (struct targ_fn_descriptor *) fn; CUfunction function; int i; - void *kargs[1]; struct nvptx_thread *nvthd = nvptx_thread (); int warp_size = nvthd->ptx_dev->warp_size; + void **kernel_args = NULL; + GOMP_PLUGIN_debug (0, "prepare mappings (mapnum: %u)\n", (unsigned) mapnum); + + if (mapnum > 0) + { + kernel_args = alloca (mapnum * sizeof (void *)); + for (int i = 0; i < mapnum; i++) + kernel_args[i] = (devaddrs[i] ? &devaddrs[i] : &hostaddrs[i]); + } + function = targ_fn->fn; /* Initialize the launch dimensions. Typically this is constant, @@ -937,11 +945,10 @@ nvptx_exec (void (*fn), size_t mapnum, void **host api_info); } - kargs[0] = &dp; CUDA_CALL_ASSERT (cuLaunchKernel, function, dims[GOMP_DIM_GANG], 1, 1, dims[GOMP_DIM_VECTOR], dims[GOMP_DIM_WORKER], 1, - 0, stream, kargs, 0); + 0, stream, kernel_args, 0); if (profiling_p) { @@ -1350,67 +1357,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si void **hostaddrs, void **devaddrs, unsigned *dims, void *targ_mem_desc) { - GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, NULL); - struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread (); - acc_prof_info *prof_info = thr->prof_info; - acc_event_info data_event_info; - acc_api_info *api_info = thr->api_info; - bool profiling_p = __builtin_expect (prof_info != NULL, false); - - void **hp = NULL; - CUdeviceptr dp = 0; - - if (mapnum > 0) - { - size_t s = mapnum * sizeof (void *); - hp = alloca (s); - for (int i = 0; i < mapnum; i++) - hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); - CUDA_CALL_ASSERT (cuMemAlloc, &dp, s); - if (profiling_p) - goacc_profiling_acc_ev_alloc (thr, (void *) dp, s); - } - - /* Copy the (device) pointers to arguments to the device (dp and hp might in - fact have the same value on a unified-memory system). */ - if (mapnum > 0) - { - if (profiling_p) - { - prof_info->event_type = acc_ev_enqueue_upload_start; - - data_event_info.data_event.event_type = prof_info->event_type; - data_event_info.data_event.valid_bytes - = _ACC_DATA_EVENT_INFO_VALID_BYTES; - data_event_info.data_event.parent_construct - = acc_construct_parallel; - data_event_info.data_event.implicit = 1; /* Always implicit. */ - data_event_info.data_event.tool_info = NULL; - data_event_info.data_event.var_name = NULL; - data_event_info.data_event.bytes = mapnum * sizeof (void *); - data_event_info.data_event.host_ptr = hp; - data_event_info.data_event.device_ptr = (const void *) dp; - - api_info->device_api = acc_device_api_cuda; - - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, - api_info); - } - CUDA_CALL_ASSERT (cuMemcpyHtoD, dp, (void *) hp, - mapnum * sizeof (void *)); - if (profiling_p) - { - prof_info->event_type = acc_ev_enqueue_upload_end; - data_event_info.data_event.event_type = prof_info->event_type; - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, - api_info); - } - } - - nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, - dp, NULL); - CUresult r = CUDA_CALL_NOCHECK (cuStreamSynchronize, NULL); const char *maybe_abort_msg = "(perhaps abort was called)"; if (r == CUDA_ERROR_LAUNCH_FAILED) @@ -1418,20 +1366,8 @@ GOMP_OFFLOAD_openacc_exec (void (*fn) (void *), si maybe_abort_msg); else if (r != CUDA_SUCCESS) GOMP_PLUGIN_fatal ("cuStreamSynchronize error: %s", cuda_error (r)); - - CUDA_CALL_ASSERT (cuMemFree, dp); - if (profiling_p) - goacc_profiling_acc_ev_free (thr, (void *) dp); } -static void -cuda_free_argmem (void *ptr) -{ - void **block = (void **) ptr; - nvptx_free (block[0], (struct ptx_device *) block[1]); - free (block); -} - void GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void *), size_t mapnum, void **hostaddrs, void **devaddrs, @@ -1438,78 +1374,7 @@ GOMP_OFFLOAD_openacc_async_exec (void (*fn) (void unsigned *dims, void *targ_mem_desc, struct goacc_asyncqueue *aq) { - GOMP_PLUGIN_debug (0, " %s: prepare mappings\n", __FUNCTION__); - - struct goacc_thread *thr = GOMP_PLUGIN_goacc_thread (); - acc_prof_info *prof_info = thr->prof_info; - acc_event_info data_event_info; - acc_api_info *api_info = thr->api_info; - bool profiling_p = __builtin_expect (prof_info != NULL, false); - - void **hp = NULL; - CUdeviceptr dp = 0; - void **block = NULL; - - if (mapnum > 0) - { - size_t s = mapnum * sizeof (void *); - block = (void **) GOMP_PLUGIN_malloc (2 * sizeof (void *) + s); - hp = block + 2; - for (int i = 0; i < mapnum; i++) - hp[i] = (devaddrs[i] ? devaddrs[i] : hostaddrs[i]); - CUDA_CALL_ASSERT (cuMemAlloc, &dp, s); - if (profiling_p) - goacc_profiling_acc_ev_alloc (thr, (void *) dp, s); - } - - /* Copy the (device) pointers to arguments to the device (dp and hp might in - fact have the same value on a unified-memory system). */ - if (mapnum > 0) - { - if (profiling_p) - { - prof_info->event_type = acc_ev_enqueue_upload_start; - - data_event_info.data_event.event_type = prof_info->event_type; - data_event_info.data_event.valid_bytes - = _ACC_DATA_EVENT_INFO_VALID_BYTES; - data_event_info.data_event.parent_construct - = acc_construct_parallel; - data_event_info.data_event.implicit = 1; /* Always implicit. */ - data_event_info.data_event.tool_info = NULL; - data_event_info.data_event.var_name = NULL; - data_event_info.data_event.bytes = mapnum * sizeof (void *); - data_event_info.data_event.host_ptr = hp; - data_event_info.data_event.device_ptr = (const void *) dp; - - api_info->device_api = acc_device_api_cuda; - - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, - api_info); - } - - CUDA_CALL_ASSERT (cuMemcpyHtoDAsync, dp, (void *) hp, - mapnum * sizeof (void *), aq->cuda_stream); - block[0] = (void *) dp; - - struct nvptx_thread *nvthd = - (struct nvptx_thread *) GOMP_PLUGIN_acc_thread (); - block[1] = (void *) nvthd->ptx_dev; - - if (profiling_p) - { - prof_info->event_type = acc_ev_enqueue_upload_end; - data_event_info.data_event.event_type = prof_info->event_type; - GOMP_PLUGIN_goacc_profiling_dispatch (prof_info, &data_event_info, - api_info); - } - } - - nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, targ_mem_desc, - dp, aq->cuda_stream); - - if (mapnum > 0) - GOMP_OFFLOAD_openacc_async_queue_callback (aq, cuda_free_argmem, block); + nvptx_exec (fn, mapnum, hostaddrs, devaddrs, dims, aq->cuda_stream); } void *