Message ID | aaf895fb1e3a009afc146d08d0cc267fa81971b3.1614342218.git.julian@codesourcery.com |
---|---|
State | New |
Headers | show |
Series | openacc: Gang-private variables in shared memory | expand |
Hi! On 2021-02-26T04:34:50-0800, Julian Brown <julian@codesourcery.com> wrote: > This patch Thanks, Julian, for your continued improving of these changes! This has iterated through several conceptually different designs and implementations, by several people, over the past several years. It's now been made my task to finish it up -- but I'll very much appreciate your input (Julian's, primarily) on the following remarks, which are basically my open work items. > implements a method to track the "private-ness" of > OpenACC variables declared in offload regions in gang-partitioned, > worker-partitioned or vector-partitioned modes. Variables declared > implicitly in scoped blocks and those declared "private" on enclosing > directives (e.g. "acc parallel") are both handled. Variables that are > e.g. gang-private can then be adjusted so they reside in GPU shared > memory. > > The reason for doing this is twofold: correct implementation of OpenACC > semantics ACK, and as mentioned before, this very much relates to <https://gcc.gnu.org/PR90115> "OpenACC: predetermined private levels for variables declared in blocks" (plus the corresponding use of 'private' clauses, implicit/explicit, including 'firstprivate') and <https://gcc.gnu.org/PR90114> "Predetermined private levels for variables declared in OpenACC accelerator routines", which we thus should refer in testcases/ChangeLog/commit log, as appropriate. I do understand we're not yet addressing all of that (and that's fine!), but we should capture remaining work items of the PRs and Cesar's list in <http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>), as appropriate. I was surprised that we didn't really have to fix up any existing libgomp testcases, because there seem to be quite some that contain a pattern (exemplified by the 'tmp' variable) as follows: int main() { #define N 123 int data[N]; int tmp; #pragma acc parallel // implicit 'firstprivate(tmp)' { // 'tmp' now conceptually made gang-private here. #pragma acc loop gang for (int i = 0; i < 123; ++i) { tmp = i + 234; data[i] = tmp; } } for (int i = 0; i < 123; ++i) if (data[i] != i + 234) __builtin_abort (); return 0; } With the code changes as posted, this actually now does *not* use gang-private memory for 'tmp', but instead continues to use "thread-private registers", as before. Same for: --- s3.c 2021-04-13 17:26:49.628739379 +0200 +++ s3_2.c 2021-04-13 17:29:43.484579664 +0200 @@ -4,6 +4,6 @@ int data[N]; - int tmp; -#pragma acc parallel // implicit 'firstprivate(tmp)' +#pragma acc parallel { + int tmp; // 'tmp' now conceptually made gang-private here. #pragma acc loop gang I suppose that's due to conditionalizing this transformation on 'TREE_ADDRESSABLE' (as you're doing), so we should be mostly "safe" regarding such existing testcases (but I haven't verified that yet in detail). That needs to be documented in testcases, with some kind of dump scanning (host compilation-side even; see below). A note for later: if this weren't just a 'gang' loop, but 'gang' plus 'worker' and/or 'vector', we'd actually be fixing up user code with undefined behavior into "correct" code (by *not* making 'tmp' gang-private, but thread-private), right? As that may not be obvious to the reader, I'd like to have the 'TREE_ADDRESSABLE' conditionalization be documented in the code. You had explained that in <http://mid.mail-archive.com/20190612204216.0ec83e4e@squid.athome>: "a non-addressable variable [...]". > and optimisation, since shared memory might be faster than > the main memory on a GPU. Do we potentially have a problem that making more use of (scarce) gang-private memory may negatively affect peformance, because potentially fewer OpenACC gangs may then be launched to the GPU hardware in parallel? (Of course, OpenACC semantics conformance firstly is more important than performance, but there may be ways to be conformant and performant; "quality of implementation".) Have you run any such performance testing with the benchmarking codes that we've got set up? (As I'm more familiar with that, I'm using nvptx offloading examples in the following, whilst assuming that similar discussion may apply for GCN offloading, which uses similar hardware concepts, as far as I remember.) Looking at the existing 'libgomp.oacc-c-c++-common/private-variables.c' (random example), for nvptx offloading, '-O0', we see the following PTX JIT compilation changes (word-'diff' of 'GOMP_DEBUG=1' at run-time): info : Function properties for 'local_g_1$_omp_fn$0': info : used 27 registers, 32 stack, [-176-]{+256+} bytes smem, 328 bytes cmem[0], 0 bytes lmem info : Function properties for 'local_w_1$_omp_fn$0': info : used 40 registers, 48 stack, [-176-]{+256+} bytes smem, 328 bytes cmem[0], 0 bytes lmem info : Function properties for 'local_w_2$_omp_fn$0': [...] info : Function properties for 'parallel_g_1$_omp_fn$0': info : used 27 registers, 32 stack, [-176-]{+256+} bytes smem, 328 bytes cmem[0], 0 bytes lmem info : Function properties for 'parallel_g_2$_omp_fn$0': info : used 32 registers, 160 stack, [-176-]{+256+} bytes smem, 328 bytes cmem[0], 0 bytes lmem ... that is, PTX '.shared' usage increases from 176 to 256 bytes for *all* functions, even though only 'loop_g_4$_omp_fn$0' and 'loop_g_5$_omp_fn$0' are actually using gang-private memory. Execution testing works before (original code, not using gang-private memory) as well as after (code changes as posted, using gang-private memory), so use on gang-private memory doesn't seem necessary here for "correct execution" -- or at least: "expected execution result". ;-) I haven't looked yet whether there's a potentional issue in the testcases here. The additional '256 - 176 = 80' bytes of PTX '.shared' memory requested are due to GCC nvptx back end implementation's use of a global "Shared memory block for gang-private variables": // BEGIN VAR DEF: __oacc_bcast .shared .align 8 .u8 __oacc_bcast[176]; +// BEGIN VAR DEF: __gangprivate_shared +.shared .align 32 .u8 __gangprivate_shared[64]; ..., plus (I suppose) an additional '80 - 64 = 16' padding/unused bytes to establish '.align 32' after '.align 8' for '__oacc_bcast'. Per <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities>, "Table 15. Technical Specifications per Compute Capability", "Compute Capability": "3.5", we have a "Maximum amount of shared memory per SM": "48 KB", so with '176 bytes smem', that permits '48 * 1024 / 176 = 279' thread blocks ('num_gangs') resident at one point in time, whereas with '256 bytes smem', it's just '48 * 1024 / 256 = 192' thread blocks resident at one point in time. (Not sure that I got all the details right, but you get the idea/concern?) Anyway, that shall be OK for now, but we shall later look into optimizing that; can't we have '.shared' local to the relevant PTX functions instead of global? Interestingly, compiling with '-O2', we see: // BEGIN VAR DEF: __oacc_bcast .shared .align 8 .u8 __oacc_bcast[144]; {+// BEGIN VAR DEF: __gangprivate_shared+} {+.shared .align 128 .u8 __gangprivate_shared[32];+} With '-O2', only 'loop_g_5$_omp_fn$0' is using gang-private memory, and apparently the PTX JIT is able to figure that out from the PTX code that GCC generates, and is then able to localize '.shared' memory usage to just 'loop_g_5$_omp_fn$0': [...] info : Function properties for 'loop_g_4$_omp_fn$0': info : used 12 registers, 0 stack, 144 bytes smem, 328 bytes cmem[0], 0 bytes lmem info : Function properties for 'loop_g_5$_omp_fn$0': info : used [-30-]{+32+} registers, 32 stack, [-144-]{+288+} bytes smem, 328 bytes cmem[0], 0 bytes lmem info : Function properties for 'loop_g_6$_omp_fn$0': info : used 13 registers, 0 stack, 144 bytes smem, 328 bytes cmem[0], 0 bytes lmem [...] This strongly suggests to me that indeed there must exist a programmatic way to get rid of the global "Shared memory block for gang-private variables". The additional '288 - 144 = 144' bytes of PTX '.shared' memory requested are 32 bytes for 'int x[8]' ('#pragma acc loop gang private(x)') plus '288 - 32 - 144 = 112' padding/unused bytes to establish '.align 128' (!) after '.align 8' for '__oacc_bcast'. That's clearly not ideal: 112 bytes wasted in contrast to just '144 + 32 = 176' bytes actually used. (I have not yet looked why/whether this really needs '.align 128'.) I have not yet looked whether similar concerns exist for the GCC GCN back end implementation. (That one also does set 'TREE_STATIC' for gang-private memory, so it's a global allocation?) > Handling of private variables is intimately > tied to the execution model for gangs/workers/vectors implemented by > a particular target: for current targets, we use (or on mainline, will > soon use) a broadcasting/neutering scheme. > > That is sufficient for code that e.g. sets a variable in worker-single > mode and expects to use the value in worker-partitioned mode. The > difficulty (semantics-wise) comes when the user wants to do something like > an atomic operation in worker-partitioned mode and expects a worker-single > (gang private) variable to be shared across each partitioned worker. > Forcing use of shared memory for such variables makes that work properly. Are we reliably making sure that gang-private variables (and other levels, in general) are not subject to the usual broadcasting scheme (nvptx, at least), or does that currently work "by accident"? (I haven't looked into that, yet.) > In terms of implementation, the parallelism level of a given loop is > not fixed until the oaccdevlow pass in the offload compiler, so the > patch delays fixing the parallelism level of variables declared on or > within such loops until the same point. This is done by adding a new > internal UNIQUE function (OACC_PRIVATE) that lists (the address of) each > private variable as an argument, and other arguments set so as to be able > to determine the correct parallelism level to use for the listed > variables. This new internal function fits into the existing scheme for > demarcating OpenACC loops, as described in comments in the patch. Yes, thanks, that's conceptually now much better than the earlier variants that we had. :-) (Hooray, again, for Nathan's OpenACC execution model design!) What we should add, though, is a bunch of testcases to verify that the expected processing does/doesn't happen for relevant source code constructs. I'm thinking that when the transformation is/isn't done, that gets logged, and we can then scan the dumps accordingly. Some of that is implemented already; we should be able to do such scanning generally for host compilation, too, not just offloading compilation. Generally, we also have to make sure that the expected privatizations (plural) happen if there are multiple levels of parallelism involved: (deep) loops nests with 'gang', 'worker', 'vector', 'seq' as well as combinations of 'gang', 'worker', 'vector' on one level. #pragma acc parallel { int x; // What's 'x' at this level? #pragma acc loop seq private(x) [for] { // What's 'x' at this level? #pragma acc loop private(x) [for] { // What's 'x' at this level? #pragma acc loop worker vector private(x) [for...] { // What's 'x' at this level? Etc. > Two new target hooks are introduced: TARGET_GOACC_ADJUST_PRIVATE_DECL and > TARGET_GOACC_EXPAND_VAR_DECL. The first can tweak a variable declaration > at oaccdevlow time, and the second at expand time. The first or both > of these target hooks can be used by a given offload target, depending > on its strategy for implementing private variables. ACK. So, currently we're only looking at making the gang-private level work. Regarding that, we have two configurations: (1) for GCN offloading, 'targetm.goacc.adjust_private_decl' does the work (in particular, change 'TREE_TYPE' etc.) and there is no 'targetm.goacc.expand_var_decl', and (2) for nvptx offloading, 'targetm.goacc.adjust_private_decl' only sets a marker ('oacc gangprivate' attribute) and then 'targetm.goacc.expand_var_decl' does the work. Therefore I suggest we clarify the (currently) expected handling similar to: --- gcc/omp-offload.c +++ gcc/omp-offload.c @@ -1854,6 +1854,19 @@ oacc_rewrite_var_decl (tree *tp, int *walk_subtrees, void *data) return NULL_TREE; } +static tree +oacc_rewrite_var_decl_ (tree *tp, int *walk_subtrees, void *data) +{ + tree t = oacc_rewrite_var_decl (tp, walk_subtrees, data); + if (targetm.goacc.expand_var_decl) + { + walk_stmt_info *wi = (walk_stmt_info *) data; + var_decl_rewrite_info *info = (var_decl_rewrite_info *) wi->info; + gcc_assert (!info->modified); + } + return t; +} + /* Return TRUE if CALL is a call to a builtin atomic/sync operation. */ static bool @@ -2195,6 +2208,9 @@ execute_oacc_device_lower () COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also rewritten to use the new decl, adjusting types of appropriate tree nodes as necessary. */ + if (targetm.goacc.expand_var_decl) + gcc_assert (adjusted_vars.is_empty ()); + if (targetm.goacc.adjust_private_decl) { FOR_ALL_BB_FN (bb, cfun) @@ -2217,7 +2233,7 @@ execute_oacc_device_lower () memset (&wi, 0, sizeof (wi)); wi.info = &info; - walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi); + walk_gimple_op (stmt, oacc_rewrite_var_decl_, &wi); if (info.modified) update_stmt (stmt); Or, in fact, 'if (targetm.goacc.expand_var_decl)', skip the 'adjusted_vars' handling completely? I do understand that eventually (in particular, for worker-private level?), both 'targetm.goacc.adjust_private_decl' and 'targetm.goacc.expand_var_decl' may need to do things, but that's currently not meant to be addressed, and thus not fully worked out and implemented, and thus untested. Hence, 'assert' what currently is implemented/tested, only. (Given that eventual goal, that's probably sufficient motivation to indeed add the 'adjusted_vars' handling in generic 'gcc/omp-offload.c' instead of moving it into the GCN back end?) For 'libgomp.oacc-c-c++-common/static-variable-1.c' that I've recently added, the code changes here cause execution test FAILs for nvptx offloading (because of making 'static' variables gang-private), and trigger an ICE with GCN offloading compilation. It isn't clear to me what the desired semantics are for (user-specified) 'static' variables -- see <https://github.com/OpenACC/openacc-spec/issues/372> "C/C++ 'static' variables" (only visible to members of the GitHub OpenACC organization) -- but an ICE clearly isn't the right answer. ;-) As for certain transformation/optimizations, 'static' variables may be synthesized in the GCC middle end, I suppose we should preserve the status quo (as documented via 'libgomp.oacc-c-c++-common/static-variable-1.c') until #372 gets resolved in OpenACC? (I suppose, skip the transformation if 'TREE_STATIC' is set, or similar.) A few individual comments (search for '[TS]'), for easy reference embedded in full-quote of the generic code changes. GCN and nvptx back end code changes to be found in <http://mid.mail-archive.com/d6ae43626eed9fd968250ee10109433e810d1048.1614342218.git.julian@codesourcery.com>, <http://mid.mail-archive.com/aab0a87b99797e1fcc73e7f3e76152405289805a.1614342218.git.julian@codesourcery.com>. > --- a/gcc/target.def > +++ b/gcc/target.def > @@ -1712,6 +1712,36 @@ for allocating any storage for reductions when necessary.", > void, (gcall *call), > default_goacc_reduction) > > +DEFHOOK > +(expand_var_decl, > +"This hook, if defined, is used by accelerator target back-ends to expand\n\ > +specially handled kinds of @code{VAR_DECL} expressions. A particular use is\n\ > +to place variables with specific attributes inside special accelarator\n\ > +memories. A return value of @code{NULL} indicates that the target does not\n\ > +handle this @code{VAR_DECL}, and normal RTL expanding is resumed.\n\ > +\n\ > +Only define this hook if your accelerator target needs to expand certain\n\ > +@code{VAR_DECL} nodes in a way that differs from the default. You can also adjust\n\ > +private variables at OpenACC device-lowering time using the\n\ > +@code{TARGET_GOACC_ADJUST_PRIVATE_DECL} target hook.", > +rtx, (tree var), > +NULL) > + > +DEFHOOK > +(adjust_private_decl, > +"This hook, if defined, is used by accelerator target back-ends to adjust\n\ > +OpenACC variable declarations that should be made private to the given\n\ > +parallelism level (i.e. @code{GOMP_DIM_GANG}, @code{GOMP_DIM_WORKER} or\n\ > +@code{GOMP_DIM_VECTOR}). A typical use for this hook is to force variable\n\ > +declarations at the @code{gang} level to reside in GPU shared memory, by\n\ > +setting the address space of the decl and making it static.\n\ > +\n\ > +You may also use the @code{TARGET_GOACC_EXPAND_VAR_DECL} hook if the\n\ > +adjusted variable declaration needs to be expanded to RTL in a non-standard\n\ > +way.", > +tree, (tree var, int level), > +NULL) > + > HOOK_VECTOR_END (goacc) > > /* Functions relating to vectorization. */ > --- a/gcc/doc/tm.texi > +++ b/gcc/doc/tm.texi > @@ -6227,6 +6227,32 @@ like @code{cond_add@var{m}}. The default implementation returns a zero > constant of type @var{type}. > @end deftypefn > > +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_VAR_DECL (tree @var{var}) > +This hook, if defined, is used by accelerator target back-ends to expand > +specially handled kinds of @code{VAR_DECL} expressions. A particular use is > +to place variables with specific attributes inside special accelarator > +memories. A return value of @code{NULL} indicates that the target does not > +handle this @code{VAR_DECL}, and normal RTL expanding is resumed. > + > +Only define this hook if your accelerator target needs to expand certain > +@code{VAR_DECL} nodes in a way that differs from the default. You can also adjust > +private variables at OpenACC device-lowering time using the > +@code{TARGET_GOACC_ADJUST_PRIVATE_DECL} target hook. > +@end deftypefn > + > +@deftypefn {Target Hook} tree TARGET_GOACC_ADJUST_PRIVATE_DECL (tree @var{var}, int @var{level}) > +This hook, if defined, is used by accelerator target back-ends to adjust > +OpenACC variable declarations that should be made private to the given > +parallelism level (i.e. @code{GOMP_DIM_GANG}, @code{GOMP_DIM_WORKER} or > +@code{GOMP_DIM_VECTOR}). A typical use for this hook is to force variable > +declarations at the @code{gang} level to reside in GPU shared memory, by > +setting the address space of the decl and making it static. > + > +You may also use the @code{TARGET_GOACC_EXPAND_VAR_DECL} hook if the > +adjusted variable declaration needs to be expanded to RTL in a non-standard > +way. > +@end deftypefn > + > @node Anchored Addresses > @section Anchored Addresses > @cindex anchored addresses > --- a/gcc/doc/tm.texi.in > +++ b/gcc/doc/tm.texi.in > @@ -4219,6 +4219,10 @@ address; but often a machine-dependent strategy can generate better code. > > @hook TARGET_PREFERRED_ELSE_VALUE > > +@hook TARGET_GOACC_EXPAND_VAR_DECL > + > +@hook TARGET_GOACC_ADJUST_PRIVATE_DECL > + > @node Anchored Addresses > @section Anchored Addresses > @cindex anchored addresses > --- a/gcc/expr.c > +++ b/gcc/expr.c > @@ -10224,8 +10224,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode, > exp = SSA_NAME_VAR (ssa_name); > goto expand_decl_rtl; > > - case PARM_DECL: > case VAR_DECL: > + /* Allow accel compiler to handle variables that require special > + treatment, e.g. if they have been modified in some way earlier in > + compilation by the adjust_private_decl OpenACC hook. */ > + if (flag_openacc && targetm.goacc.expand_var_decl) > + { > + temp = targetm.goacc.expand_var_decl (exp); > + if (temp) > + return temp; > + } > + /* ... fall through ... */ > + > + case PARM_DECL: [TS] Are we sure that we don't need the same handling for a 'PARM_DECL', too? (If yes, to document and verify that, should we thus again unify the two 'case's, and in 'targetm.goacc.expand_var_decl' add a 'gcc_checking_assert (TREE_CODE (var) == VAR_DECL')'?) Also, are we sure that all the following existing processing is not relevant to do before the 'return temp' (see above)? That's not a concern for GCN (which doesn't use 'targetm.goacc.expand_var_decl', and thus does execute all this following existing processing), but it is for nvptx (which does use 'targetm.goacc.expand_var_decl', and thus doesn't execute all this following existing processing if that returned something). Or, is 'targetm.goacc.expand_var_decl' conceptually and practically meant to implement all of the following processing, or is this for other reasons not relevant in the 'targetm.goacc.expand_var_decl' case: > /* If a static var's type was incomplete when the decl was written, > but the type is complete now, lay out the decl now. */ > if (DECL_SIZE (exp) == 0 | && COMPLETE_OR_UNBOUND_ARRAY_TYPE_P (TREE_TYPE (exp)) | && (TREE_STATIC (exp) || DECL_EXTERNAL (exp))) | layout_decl (exp, 0); | | /* fall through */ | | case FUNCTION_DECL: | case RESULT_DECL: | decl_rtl = DECL_RTL (exp); | expand_decl_rtl: | gcc_assert (decl_rtl); | | /* DECL_MODE might change when TYPE_MODE depends on attribute target | settings for VECTOR_TYPE_P that might switch for the function. */ | if (currently_expanding_to_rtl | && code == VAR_DECL && MEM_P (decl_rtl) | && VECTOR_TYPE_P (type) && exp && DECL_MODE (exp) != mode) | decl_rtl = change_address (decl_rtl, TYPE_MODE (type), 0); | else | decl_rtl = copy_rtx (decl_rtl); | | /* Record writes to register variables. */ | if (modifier == EXPAND_WRITE | && REG_P (decl_rtl) | && HARD_REGISTER_P (decl_rtl)) | add_to_hard_reg_set (&crtl->asm_clobbers, | GET_MODE (decl_rtl), REGNO (decl_rtl)); | | /* Ensure variable marked as used even if it doesn't go through | a parser. If it hasn't be used yet, write out an external | definition. */ | if (exp) | TREE_USED (exp) = 1; | | /* Show we haven't gotten RTL for this yet. */ | temp = 0; | | /* Variables inherited from containing functions should have | been lowered by this point. */ | if (exp) | context = decl_function_context (exp); | gcc_assert (!exp | || SCOPE_FILE_SCOPE_P (context) | || context == current_function_decl | || TREE_STATIC (exp) | || DECL_EXTERNAL (exp) | /* ??? C++ creates functions that are not TREE_STATIC. */ | || TREE_CODE (exp) == FUNCTION_DECL); | | /* This is the case of an array whose size is to be determined | from its initializer, while the initializer is still being parsed. | ??? We aren't parsing while expanding anymore. */ | | if (MEM_P (decl_rtl) && REG_P (XEXP (decl_rtl, 0))) | temp = validize_mem (decl_rtl); | | /* If DECL_RTL is memory, we are in the normal case and the | address is not valid, get the address into a register. */ | | else if (MEM_P (decl_rtl) && modifier != EXPAND_INITIALIZER) | { | if (alt_rtl) | *alt_rtl = decl_rtl; | decl_rtl = use_anchored_address (decl_rtl); | if (modifier != EXPAND_CONST_ADDRESS | && modifier != EXPAND_SUM | && !memory_address_addr_space_p (exp ? DECL_MODE (exp) | : GET_MODE (decl_rtl), | XEXP (decl_rtl, 0), | MEM_ADDR_SPACE (decl_rtl))) | temp = replace_equiv_address (decl_rtl, | copy_rtx (XEXP (decl_rtl, 0))); | } | | /* If we got something, return it. But first, set the alignment | if the address is a register. */ | if (temp != 0) | { | if (exp && MEM_P (temp) && REG_P (XEXP (temp, 0))) | mark_reg_pointer (XEXP (temp, 0), DECL_ALIGN (exp)); | } | else if (MEM_P (decl_rtl)) | temp = decl_rtl; | | if (temp != 0) | { | if (MEM_P (temp) | && modifier != EXPAND_WRITE | && modifier != EXPAND_MEMORY | && modifier != EXPAND_INITIALIZER | && modifier != EXPAND_CONST_ADDRESS | && modifier != EXPAND_SUM | && !inner_reference_p | && mode != BLKmode | && MEM_ALIGN (temp) < GET_MODE_ALIGNMENT (mode)) | temp = expand_misaligned_mem_ref (temp, mode, unsignedp, | MEM_ALIGN (temp), NULL_RTX, NULL); | | return temp; | } | [...] [TS] I don't understand that yet. :-| Instead of the current "early-return" handling: temp = targetm.goacc.expand_var_decl (exp); if (temp) return temp; ... should we maybe just set: DECL_RTL (exp) = targetm.goacc.expand_var_decl (exp) ... (or similar), and then let the usual processing continue? > --- a/gcc/internal-fn.c > +++ b/gcc/internal-fn.c > @@ -2957,6 +2957,8 @@ expand_UNIQUE (internal_fn, gcall *stmt) > else > gcc_unreachable (); > break; > + case IFN_UNIQUE_OACC_PRIVATE: > + break; > } > > if (pattern) > --- a/gcc/internal-fn.h > +++ b/gcc/internal-fn.h > @@ -36,7 +36,8 @@ along with GCC; see the file COPYING3. If not see > #define IFN_UNIQUE_CODES \ > DEF(UNSPEC), \ > DEF(OACC_FORK), DEF(OACC_JOIN), \ > - DEF(OACC_HEAD_MARK), DEF(OACC_TAIL_MARK) > + DEF(OACC_HEAD_MARK), DEF(OACC_TAIL_MARK), \ > + DEF(OACC_PRIVATE) > > enum ifn_unique_kind { > #define DEF(X) IFN_UNIQUE_##X > --- a/gcc/omp-low.c > +++ b/gcc/omp-low.c > @@ -171,6 +171,9 @@ struct omp_context > > /* True if there is bind clause on the construct (i.e. a loop construct). */ > bool loop_p; > + > + /* Addressable variable decls in this context. */ > + vec<tree> oacc_addressable_var_decls; > }; > > static splay_tree all_contexts; > @@ -7048,8 +7051,9 @@ lower_lastprivate_clauses (tree clauses, tree predicate, gimple_seq *body_p, > > static void > lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner, > - gcall *fork, gcall *join, gimple_seq *fork_seq, > - gimple_seq *join_seq, omp_context *ctx) > + gcall *fork, gcall *private_marker, gcall *join, > + gimple_seq *fork_seq, gimple_seq *join_seq, > + omp_context *ctx) > { > gimple_seq before_fork = NULL; > gimple_seq after_fork = NULL; > @@ -7253,6 +7257,8 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner, > > /* Now stitch things together. */ > gimple_seq_add_seq (fork_seq, before_fork); > + if (private_marker) > + gimple_seq_add_stmt (fork_seq, private_marker); > if (fork) > gimple_seq_add_stmt (fork_seq, fork); > gimple_seq_add_seq (fork_seq, after_fork); > @@ -7989,7 +7995,7 @@ lower_oacc_loop_marker (location_t loc, tree ddvar, bool head, > HEAD and TAIL. */ > > static void > -lower_oacc_head_tail (location_t loc, tree clauses, > +lower_oacc_head_tail (location_t loc, tree clauses, gcall *private_marker, > gimple_seq *head, gimple_seq *tail, omp_context *ctx) > { > bool inner = false; > @@ -7997,6 +8003,14 @@ lower_oacc_head_tail (location_t loc, tree clauses, > gimple_seq_add_stmt (head, gimple_build_assign (ddvar, integer_zero_node)); > > unsigned count = lower_oacc_head_mark (loc, ddvar, clauses, head, ctx); > + > + if (private_marker) > + { > + gimple_set_location (private_marker, loc); > + gimple_call_set_lhs (private_marker, ddvar); > + gimple_call_set_arg (private_marker, 1, ddvar); > + } > + > tree fork_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_FORK); > tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN); > > @@ -8027,7 +8041,8 @@ lower_oacc_head_tail (location_t loc, tree clauses, > &join_seq); > > lower_oacc_reductions (loc, clauses, place, inner, > - fork, join, &fork_seq, &join_seq, ctx); > + fork, (count == 1) ? private_marker : NULL, > + join, &fork_seq, &join_seq, ctx); > > /* Append this level to head. */ > gimple_seq_add_seq (head, fork_seq); [TS] That looks good in principle. Via the testing mentioned above, I just want to make sure that this does all the expected things regarding differently nested loops and privatization levels. > @@ -9992,6 +10007,32 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p, > } > } > > +/* Record vars listed in private clauses in CLAUSES in CTX. This information > + is used to mark up variables that should be made private per-gang. */ > + > +static void > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) > +{ > + for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) > + if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE) > + { > + tree decl = OMP_CLAUSE_DECL (c); > + if (VAR_P (decl) && TREE_ADDRESSABLE (decl)) > + ctx->oacc_addressable_var_decls.safe_push (decl); > + } > +} > + > +/* Record addressable vars declared in BINDVARS in CTX. This information is > + used to mark up variables that should be made private per-gang. */ > + > +static void > +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars) > +{ > + for (tree v = bindvars; v; v = DECL_CHAIN (v)) > + if (VAR_P (v) && TREE_ADDRESSABLE (v)) > + ctx->oacc_addressable_var_decls.safe_push (v); > +} > + [TS] For these two, we'd add the 'TREE_ADDRESSABLE' rationale mentioned above. > /* Callback for walk_gimple_seq. Find #pragma omp scan statement. */ > > static tree > @@ -10821,6 +10862,57 @@ lower_omp_for_scan (gimple_seq *body_p, gimple_seq *dlist, gomp_for *stmt, > *dlist = new_dlist; > } > > +/* Build an internal UNIQUE function with type IFN_UNIQUE_OACC_PRIVATE listing > + the addresses of variables that should be made private at the surrounding > + parallelism level. Such functions appear in the gimple code stream in two > + forms, e.g. for a partitioned loop: > + > + .data_dep.6 = .UNIQUE (OACC_HEAD_MARK, .data_dep.6, 1, 68); > + .data_dep.6 = .UNIQUE (OACC_PRIVATE, .data_dep.6, -1, &w); > + .data_dep.6 = .UNIQUE (OACC_FORK, .data_dep.6, -1); > + .data_dep.6 = .UNIQUE (OACC_HEAD_MARK, .data_dep.6); > + > + or alternatively, OACC_PRIVATE can appear at the top level of a parallel, > + not as part of a HEAD_MARK sequence: > + > + .UNIQUE (OACC_PRIVATE, 0, 0, &w); > + > + For such stand-alone appearances, the 3rd argument is always 0, denoting > + gang partitioning. */ > + > +static gcall * > +make_oacc_private_marker (omp_context *ctx) > +{ > + int i; > + tree decl; > + > + if (ctx->oacc_addressable_var_decls.length () == 0) > + return NULL; > + > + auto_vec<tree, 5> args; > + > + args.quick_push (build_int_cst (integer_type_node, IFN_UNIQUE_OACC_PRIVATE)); > + args.quick_push (integer_zero_node); > + args.quick_push (integer_minus_one_node); > + > + FOR_EACH_VEC_ELT (ctx->oacc_addressable_var_decls, i, decl) > + { > + for (omp_context *thisctx = ctx; thisctx; thisctx = thisctx->outer) > + { > + tree inner_decl = maybe_lookup_decl (decl, thisctx); > + if (inner_decl) > + { > + decl = inner_decl; > + break; > + } > + } > + tree addr = build_fold_addr_expr (decl); > + args.safe_push (addr); > + } > + > + return gimple_build_call_internal_vec (IFN_UNIQUE, args); > +} > + > /* Lower code for an OMP loop directive. */ > > static void > @@ -10837,6 +10929,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) > > push_gimplify_context (); > > + oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt)); > + > lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx); > > block = make_node (BLOCK); > @@ -10855,6 +10949,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) > gbind *inner_bind > = as_a <gbind *> (gimple_seq_first_stmt (omp_for_body)); > tree vars = gimple_bind_vars (inner_bind); > + if (is_gimple_omp_oacc (ctx->stmt)) > + oacc_record_vars_in_bind (ctx, vars); > gimple_bind_append_vars (new_stmt, vars); > /* bind_vars/BLOCK_VARS are being moved to new_stmt/block, don't > keep them on the inner_bind and it's block. */ > @@ -10968,6 +11064,11 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) > > lower_omp (gimple_omp_body_ptr (stmt), ctx); > > + gcall *private_marker = NULL; > + if (is_gimple_omp_oacc (ctx->stmt) > + && !gimple_seq_empty_p (omp_for_body)) > + private_marker = make_oacc_private_marker (ctx); > + > /* Lower the header expressions. At this point, we can assume that > the header is of the form: > > @@ -11022,7 +11123,7 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) > if (is_gimple_omp_oacc (ctx->stmt) > && !ctx_in_oacc_kernels_region (ctx)) > lower_oacc_head_tail (gimple_location (stmt), > - gimple_omp_for_clauses (stmt), > + gimple_omp_for_clauses (stmt), private_marker, > &oacc_head, &oacc_tail, ctx); > > /* Add OpenACC partitioning and reduction markers just before the loop. */ > @@ -13019,8 +13120,14 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) > them as a dummy GANG loop. */ > tree level = build_int_cst (integer_type_node, GOMP_DIM_GANG); > > + gcall *private_marker = make_oacc_private_marker (ctx); > + > + if (private_marker) > + gimple_call_set_arg (private_marker, 2, level); > + > lower_oacc_reductions (gimple_location (ctx->stmt), clauses, level, > - false, NULL, NULL, &fork_seq, &join_seq, ctx); > + false, NULL, private_marker, NULL, &fork_seq, > + &join_seq, ctx); > } > > gimple_seq_add_seq (&new_body, fork_seq); > @@ -13262,6 +13369,9 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx) > ctx); > break; > case GIMPLE_BIND: > + if (ctx && is_gimple_omp_oacc (ctx->stmt)) > + oacc_record_vars_in_bind (ctx, > + gimple_bind_vars (as_a <gbind *> (stmt))); > lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx); > maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt)); > break; [TS] I have not yet verified whether these lowering case are sufficient to also handle the <https://gcc.gnu.org/PR90114> "Predetermined private levels for variables declared in OpenACC accelerator routines" case. (If yes, then that needs testcases, too, if not, then need to add a TODO note, for later.) > --- a/gcc/omp-offload.c > +++ b/gcc/omp-offload.c > @@ -53,6 +53,7 @@ along with GCC; see the file COPYING3. If not see > #include "attribs.h" > #include "cfgloop.h" > #include "context.h" > +#include "convert.h" > > /* Describe the OpenACC looping structure of a function. The entire > function is held in a 'NULL' loop. */ > @@ -1356,7 +1357,9 @@ oacc_loop_xform_head_tail (gcall *from, int level) > = ((enum ifn_unique_kind) > TREE_INT_CST_LOW (gimple_call_arg (stmt, 0))); > > - if (k == IFN_UNIQUE_OACC_FORK || k == IFN_UNIQUE_OACC_JOIN) > + if (k == IFN_UNIQUE_OACC_FORK > + || k == IFN_UNIQUE_OACC_JOIN > + || k == IFN_UNIQUE_OACC_PRIVATE) > *gimple_call_arg_ptr (stmt, 2) = replacement; > else if (k == kind && stmt != from) > break; > @@ -1773,6 +1776,136 @@ default_goacc_reduction (gcall *call) > gsi_replace_with_seq (&gsi, seq, true); > } > > +struct var_decl_rewrite_info > +{ > + gimple *stmt; > + hash_map<tree, tree> *adjusted_vars; > + bool avoid_pointer_conversion; > + bool modified; > +}; > + > +/* Helper function for execute_oacc_device_lower. Rewrite VAR_DECLs (by > + themselves or wrapped in various other nodes) according to ADJUSTED_VARS in > + the var_decl_rewrite_info pointed to via DATA. Used as part of coercing > + gang-private variables in OpenACC offload regions to reside in GPU shared > + memory. */ > + > +static tree > +oacc_rewrite_var_decl (tree *tp, int *walk_subtrees, void *data) > +{ > + walk_stmt_info *wi = (walk_stmt_info *) data; > + var_decl_rewrite_info *info = (var_decl_rewrite_info *) wi->info; > + > + if (TREE_CODE (*tp) == ADDR_EXPR) > + { > + tree arg = TREE_OPERAND (*tp, 0); > + tree *new_arg = info->adjusted_vars->get (arg); > + > + if (new_arg) > + { > + if (info->avoid_pointer_conversion) > + { > + *tp = build_fold_addr_expr (*new_arg); > + info->modified = true; > + *walk_subtrees = 0; > + } > + else > + { > + gimple_stmt_iterator gsi = gsi_for_stmt (info->stmt); > + tree repl = build_fold_addr_expr (*new_arg); > + gimple *stmt1 > + = gimple_build_assign (make_ssa_name (TREE_TYPE (repl)), repl); > + tree conv = convert_to_pointer (TREE_TYPE (*tp), > + gimple_assign_lhs (stmt1)); > + gimple *stmt2 > + = gimple_build_assign (make_ssa_name (TREE_TYPE (*tp)), conv); > + gsi_insert_before (&gsi, stmt1, GSI_SAME_STMT); > + gsi_insert_before (&gsi, stmt2, GSI_SAME_STMT); > + *tp = gimple_assign_lhs (stmt2); > + info->modified = true; > + *walk_subtrees = 0; > + } > + } > + } > + else if (TREE_CODE (*tp) == COMPONENT_REF || TREE_CODE (*tp) == ARRAY_REF) > + { > + tree *base = &TREE_OPERAND (*tp, 0); > + > + while (TREE_CODE (*base) == COMPONENT_REF > + || TREE_CODE (*base) == ARRAY_REF) > + base = &TREE_OPERAND (*base, 0); > + > + if (TREE_CODE (*base) != VAR_DECL) > + return NULL; > + > + tree *new_decl = info->adjusted_vars->get (*base); > + if (!new_decl) > + return NULL; > + > + int base_quals = TYPE_QUALS (TREE_TYPE (*new_decl)); > + tree field = TREE_OPERAND (*tp, 1); > + > + /* Adjust the type of the field. */ > + int field_quals = TYPE_QUALS (TREE_TYPE (field)); > + if (TREE_CODE (field) == FIELD_DECL && field_quals != base_quals) > + { > + tree *field_type = &TREE_TYPE (field); > + while (TREE_CODE (*field_type) == ARRAY_TYPE) > + field_type = &TREE_TYPE (*field_type); > + field_quals |= base_quals; > + *field_type = build_qualified_type (*field_type, field_quals); > + } > + > + /* Adjust the type of the component ref itself. */ > + tree comp_type = TREE_TYPE (*tp); > + int comp_quals = TYPE_QUALS (comp_type); > + if (TREE_CODE (*tp) == COMPONENT_REF && comp_quals != base_quals) > + { > + comp_quals |= base_quals; > + TREE_TYPE (*tp) > + = build_qualified_type (comp_type, comp_quals); > + } > + > + *base = *new_decl; > + info->modified = true; > + } > + else if (TREE_CODE (*tp) == VAR_DECL) > + { > + tree *new_decl = info->adjusted_vars->get (*tp); > + if (new_decl) > + { > + *tp = *new_decl; > + info->modified = true; > + } > + } > + > + return NULL_TREE; > +} > + > +/* Return TRUE if CALL is a call to a builtin atomic/sync operation. */ > + > +static bool > +is_sync_builtin_call (gcall *call) > +{ > + tree callee = gimple_call_fndecl (call); > + > + if (callee != NULL_TREE > + && gimple_call_builtin_p (call, BUILT_IN_NORMAL)) > + switch (DECL_FUNCTION_CODE (callee)) > + { > +#undef DEF_SYNC_BUILTIN > +#define DEF_SYNC_BUILTIN(ENUM, NAME, TYPE, ATTRS) case ENUM: > +#include "sync-builtins.def" > +#undef DEF_SYNC_BUILTIN > + return true; > + > + default: > + ; > + } > + > + return false; > +} > + > /* Main entry point for oacc transformations which run on the device > compiler after LTO, so we know what the target device is at this > point (including the host fallback). */ > @@ -1922,6 +2055,8 @@ execute_oacc_device_lower () > dominance information to update SSA. */ > calculate_dominance_info (CDI_DOMINATORS); > > + hash_map<tree, tree> adjusted_vars; > + > /* Now lower internal loop functions to target-specific code > sequences. */ > basic_block bb; > @@ -1998,6 +2133,45 @@ execute_oacc_device_lower () > case IFN_UNIQUE_OACC_TAIL_MARK: > remove = true; > break; > + > + case IFN_UNIQUE_OACC_PRIVATE: > + { > + HOST_WIDE_INT level > + = TREE_INT_CST_LOW (gimple_call_arg (call, 2)); > + if (level == -1) > + break; > + for (unsigned i = 3; > + i < gimple_call_num_args (call); > + i++) > + { > + tree arg = gimple_call_arg (call, i); > + gcc_assert (TREE_CODE (arg) == ADDR_EXPR); > + tree decl = TREE_OPERAND (arg, 0); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + static char const *const axes[] = > + /* Must be kept in sync with GOMP_DIM > + enumeration. */ > + { "gang", "worker", "vector" }; > + fprintf (dump_file, "Decl UID %u has %s " > + "partitioning:", DECL_UID (decl), > + axes[level]); > + print_generic_decl (dump_file, decl, TDF_SLIM); > + fputc ('\n', dump_file); > + } > + if (targetm.goacc.adjust_private_decl) > + { > + tree oldtype = TREE_TYPE (decl); > + tree newdecl > + = targetm.goacc.adjust_private_decl (decl, level); > + if (TREE_TYPE (newdecl) != oldtype > + || newdecl != decl) > + adjusted_vars.put (decl, newdecl); > + } > + } > + remove = true; > + } > + break; > } > break; > } > @@ -2029,6 +2203,55 @@ execute_oacc_device_lower () > gsi_next (&gsi); > } > > + /* Make adjustments to gang-private local variables if required by the > + target, e.g. forcing them into a particular address space. Afterwards, > + ADDR_EXPR nodes which have adjusted variables as their argument need to > + be modified in one of two ways: > + > + 1. They can be recreated, making a pointer to the variable in the new > + address space, or > + > + 2. The address of the variable in the new address space can be taken, > + converted to the default (original) address space, and the result of > + that conversion subsituted in place of the original ADDR_EXPR node. > + > + Which of these is done depends on the gimple statement being processed. > + At present atomic operations and inline asms use (1), and everything else > + uses (2). At least on AMD GCN, there are atomic operations that work > + directly in the LDS address space. > + > + COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also rewritten to use > + the new decl, adjusting types of appropriate tree nodes as necessary. */ [TS] As I understand, this is only relevant for GCN offloading, but not nvptx, and I'll trust that these two variants make sense from a GCN point of view (which I cannot verify easily). > + > + if (targetm.goacc.adjust_private_decl) > + { > + FOR_ALL_BB_FN (bb, cfun) > + for (gimple_stmt_iterator gsi = gsi_start_bb (bb); > + !gsi_end_p (gsi); > + gsi_next (&gsi)) > + { > + gimple *stmt = gsi_stmt (gsi); > + walk_stmt_info wi; > + var_decl_rewrite_info info; > + > + info.avoid_pointer_conversion > + = (is_gimple_call (stmt) > + && is_sync_builtin_call (as_a <gcall *> (stmt))) > + || gimple_code (stmt) == GIMPLE_ASM; > + info.stmt = stmt; > + info.modified = false; > + info.adjusted_vars = &adjusted_vars; > + > + memset (&wi, 0, sizeof (wi)); > + wi.info = &info; > + > + walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi); > + > + if (info.modified) > + update_stmt (stmt); > + } > + } > + > free_oacc_loop (loops); > > return 0; [TS] As disucssed above, maybe can completely skip the 'adjusted_vars' rewriting for nvptx offloading? > --- /dev/null > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c [TS] Without any code changes, this one FAILs (as expected) with nvptx offloading, but with GCN offloading, it already PASSes. > @@ -0,0 +1,38 @@ > +#include <assert.h> > + > +int main (void) > +{ > + int ret; > + > + #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret) > + { > + int w = 0; > + > + #pragma acc loop worker > + for (int i = 0; i < 32; i++) > + { > + #pragma acc atomic update > + w++; > + } > + > + ret = (w == 32); > + } > + assert (ret); > + > + #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret) > + { > + int v = 0; > + > + #pragma acc loop vector > + for (int i = 0; i < 32; i++) > + { > + #pragma acc atomic update > + v++; > + } > + > + ret = (v == 32); > + } > + assert (ret); > + > + return 0; > +} > --- /dev/null > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c [TS] Both with nvptx and GCN offloading, that one already PASSes without any code changes. > @@ -0,0 +1,95 @@ > +#include <stdio.h> > +#include <openacc.h> > +#include <alloca.h> > +#include <string.h> > +#include <gomp-constants.h> > +#include <stdlib.h> > + > +#if 0 > +#define DEBUG(DIM, IDX, VAL) \ > + fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL)) > +#else > +#define DEBUG(DIM, IDX, VAL) > +#endif > + > +#define N (32*32*32) > + > +int > +check (const char *dim, int *dist, int dimsize) > +{ > + int ix; > + int exit = 0; > + > + for (ix = 0; ix < dimsize; ix++) > + { > + DEBUG(dim, ix, dist[ix]); > + if (dist[ix] < (N) / (dimsize + 0.5) > + || dist[ix] > (N) / (dimsize - 0.5)) > + { > + fprintf (stderr, "did not distribute to %ss (%d not between %d " > + "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)), > + (int) ((N) / (dimsize - 0.5))); > + exit |= 1; > + } > + } > + > + return exit; > +} > + > +int main () > +{ > + int ary[N]; > + int ix; > + int exit = 0; > + int gangsize = 0, workersize = 0, vectorsize = 0; > + int *gangdist, *workerdist, *vectordist; > + > + for (ix = 0; ix < N;ix++) > + ary[ix] = -1; > + > +#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \ > + copy(ary) copyout(gangsize, workersize, vectorsize) > + { > +#pragma acc loop gang worker vector > + for (unsigned ix = 0; ix < N; ix++) > + { > + int g, w, v; > + > + g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG); > + w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER); > + v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR); > + > + ary[ix] = (g << 16) | (w << 8) | v; > + } > + > + gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG); > + workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER); > + vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR); > + } > + > + gangdist = (int *) alloca (gangsize * sizeof (int)); > + workerdist = (int *) alloca (workersize * sizeof (int)); > + vectordist = (int *) alloca (vectorsize * sizeof (int)); > + memset (gangdist, 0, gangsize * sizeof (int)); > + memset (workerdist, 0, workersize * sizeof (int)); > + memset (vectordist, 0, vectorsize * sizeof (int)); > + > + /* Test that work is shared approximately equally amongst each active > + gang/worker/vector. */ > + for (ix = 0; ix < N; ix++) > + { > + int g = (ary[ix] >> 16) & 255; > + int w = (ary[ix] >> 8) & 255; > + int v = ary[ix] & 255; > + > + gangdist[g]++; > + workerdist[w]++; > + vectordist[v]++; > + } > + > + exit = check ("gang", gangdist, gangsize); > + exit |= check ("worker", workerdist, workersize); > + exit |= check ("vector", vectordist, vectorsize); > + > + return exit; > +} > --- /dev/null > +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 [TS] This one does show the expected behavior: FAILs without code changes, PASSes with code changes as posted. > @@ -0,0 +1,25 @@ > +! Test for "oacc gangprivate" attribute on gang-private variables > + > +! { dg-do run } > +! { dg-additional-options "-fdump-tree-oaccdevlow-details -w" } > + > +program main > + integer :: w, arr(0:31) > + > + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) > + !$acc loop gang private(w) > +! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has gang partitioning: integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */ > + do j = 0, 31 > + w = 0 > + !$acc loop seq > + do i = 0, 31 > + !$acc atomic update > + w = w + 1 > + !$acc end atomic > + end do > + arr(j) = w > + end do > + !$acc end parallel > + > + if (any (arr .ne. 32)) stop 1 > +end program main > --- /dev/null > +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90 [TS] With code changes as posted, this one FAILs for nvptx offloading execution. (... for all but the Nvidia Titan V GPU in my set of testing configurations, huh?) > @@ -0,0 +1,25 @@ > +! Test for worker-private variables > + > +! { dg-do run } > +! { dg-additional-options "-fdump-tree-oaccdevlow-details" } > + > +program main > + integer :: w, arr(0:31) > + > + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) > + !$acc loop gang worker private(w) > +! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has worker partitioning: integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */ > + do j = 0, 31 > + w = 0 > + !$acc loop seq > + do i = 0, 31 > + !$acc atomic update > + w = w + 1 > + !$acc end atomic > + end do > + arr(j) = w > + end do > + !$acc end parallel > + > + if (any (arr .ne. 32)) stop 1 > +end program main [TS] So we'll have to verify whether these are sufficiently testing what they're meant to be testing, and fix up as necessary. Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
On 15/04/2021 18:26, Thomas Schwinge wrote: >> and optimisation, since shared memory might be faster than >> the main memory on a GPU. > > Do we potentially have a problem that making more use of (scarce) > gang-private memory may negatively affect peformance, because potentially > fewer OpenACC gangs may then be launched to the GPU hardware in parallel? > (Of course, OpenACC semantics conformance firstly is more important than > performance, but there may be ways to be conformant and performant; > "quality of implementation".) Have you run any such performance testing > with the benchmarking codes that we've got set up? > > (As I'm more familiar with that, I'm using nvptx offloading examples in > the following, whilst assuming that similar discussion may apply for GCN > offloading, which uses similar hardware concepts, as far as I remember.) Yes, that could happen. However, there's space for quite a lot of scalars before performance is affected: 64KB of LDS memory shared by a hardware-defined maximum of 40 threads gives about 1.5KB of space for worker-reduction variables and gang-private variables. We might have a problem if there are large private arrays. I believe we have a "good enough" solution for the usual case, and a v2.0 full solution is going to be big and hairy enough for a whole patch of it's own (requiring per-gang dynamic allocation, a different memory address space and possibly different instruction selection too). Andrew
Hi! On 2021-04-16T17:05:24+0100, Andrew Stubbs <ams@codesourcery.com> wrote: > On 15/04/2021 18:26, Thomas Schwinge wrote: >>> and optimisation, since shared memory might be faster than >>> the main memory on a GPU. >> >> Do we potentially have a problem that making more use of (scarce) >> gang-private memory may negatively affect peformance, because potentially >> fewer OpenACC gangs may then be launched to the GPU hardware in parallel? >> (Of course, OpenACC semantics conformance firstly is more important than >> performance, but there may be ways to be conformant and performant; >> "quality of implementation".) Have you run any such performance testing >> with the benchmarking codes that we've got set up? >> >> (As I'm more familiar with that, I'm using nvptx offloading examples in >> the following, whilst assuming that similar discussion may apply for GCN >> offloading, which uses similar hardware concepts, as far as I remember.) > > Yes, that could happen. Thanks for sharing the GCN perspective. > However, there's space for quite a lot of > scalars before performance is affected: 64KB of LDS memory shared by a > hardware-defined maximum of 40 threads (Instead of threads, something like thread blocks, I suppose?) > gives about 1.5KB of space for > worker-reduction variables and gang-private variables. PTX, as I understand this, may generally have a lot of Thread Blocks in flight: all for the same GPU kernel as well as any GPU kernels running asynchronously/generally concurrently (system-wide), and libgomp does try launching a high number of Thread Blocks ('num_gangs') (for purposes of hiding memory access latency?). Random example: nvptx_exec: kernel t0_r$_omp_fn$0: launch gangs=1920, workers=32, vectors=32 With that, PTX's 48 KiB of '.shared' memory per SM (processor) are then not so much anymore: just '48 * 1024 / 1920 = 25' bytes of gang-private memory available for each of the 1920 gangs: 'double x, y, z'? (... for the simple case where just one GPU kernel is executing.) (I suppose that calculation is valid for a GPU hardware variant where there is just one SM. If there are several (typically in the order of a few dozens?), I suppose the Thread Blocks launched will be distributed over all these, thus improving the situation correspondingly.) (And of course, there are certainly other factors that also limit the number of Thread Blocks that are actually executing in parallel.) > We might have a > problem if there are large private arrays. Yes, that's understood. Also, directly related, the problem that comes with supporting worker-private memory, which basically calculates to the amount necessary for gang-private memory multiplied by the number of workers? (Out of scope at present.) > I believe we have a "good enough" solution for the usual case So you believe that. ;-) It's certainly what I'd hope, too! But we don't know yet whether there's any noticeable performance impact if we run with (potentially) lesser parallelism, hence my question whether this patch has been run through performance testing. > and a > v2.0 full solution is going to be big and hairy enough for a whole patch > of it's own (requiring per-gang dynamic allocation, a different memory > address space and possibly different instruction selection too). Agree that a fully dynamic allocation scheme likely is going to be ugly, so I'd certainly like to avoid that. Before attempting that, we'd first try to optimize gang-private memory allocation: so that it's function-local (and thus GPU kernel-local) instead of device-global (assuming that's indeed possible), and try not using gang-private memory in cases where it's not actually necessary (semantically not observable, and not necessary for performance reasons). Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
On 16/04/2021 18:30, Thomas Schwinge wrote: > Hi! > > On 2021-04-16T17:05:24+0100, Andrew Stubbs <ams@codesourcery.com> wrote: >> On 15/04/2021 18:26, Thomas Schwinge wrote: >>>> and optimisation, since shared memory might be faster than >>>> the main memory on a GPU. >>> >>> Do we potentially have a problem that making more use of (scarce) >>> gang-private memory may negatively affect peformance, because potentially >>> fewer OpenACC gangs may then be launched to the GPU hardware in parallel? >>> (Of course, OpenACC semantics conformance firstly is more important than >>> performance, but there may be ways to be conformant and performant; >>> "quality of implementation".) Have you run any such performance testing >>> with the benchmarking codes that we've got set up? >>> >>> (As I'm more familiar with that, I'm using nvptx offloading examples in >>> the following, whilst assuming that similar discussion may apply for GCN >>> offloading, which uses similar hardware concepts, as far as I remember.) >> >> Yes, that could happen. > > Thanks for sharing the GCN perspective. > >> However, there's space for quite a lot of >> scalars before performance is affected: 64KB of LDS memory shared by a >> hardware-defined maximum of 40 threads > > (Instead of threads, something like thread blocks, I suppose?) Workers. Wavefronts. The terminology is so confusing for these cases! They look like CPU threads running SIMD instructions, at least on GCN. OpenMP calls them threads. Each GCN compute unit can run up to 40 of them. A gang can have up to 16 workers (in AMD terminology, a work group can have up 16 wavefronts), so each compute unit will usually have at least two gangs, meaning each gang would get 32KB local memory. If there are no worker loops then you get 40 gangs (of one worker each) per compute unit, hence the minimum of 1.5KB per gang. The local memory is specific to the compute unit and gangs launched there will stay there until they're done, so the 40 gangs really is the limit for memory division. If you launch more gangs than there are resources then they get queued, so the memory doesn't get divided any more. >> gives about 1.5KB of space for >> worker-reduction variables and gang-private variables. > > PTX, as I understand this, may generally have a lot of Thread Blocks in > flight: all for the same GPU kernel as well as any GPU kernels running > asynchronously/generally concurrently (system-wide), and libgomp does try > launching a high number of Thread Blocks ('num_gangs') (for purposes of > hiding memory access latency?). Random example: > > nvptx_exec: kernel t0_r$_omp_fn$0: launch gangs=1920, workers=32, vectors=32 > > With that, PTX's 48 KiB of '.shared' memory per SM (processor) are then > not so much anymore: just '48 * 1024 / 1920 = 25' bytes of gang-private > memory available for each of the 1920 gangs: 'double x, y, z'? (... for > the simple case where just one GPU kernel is executing.) Your maths feels way off to me. That's not enough memory for any use, and it's not the only resource that will be stretched thin: how many GPU registers does an SM have? (I doubt that register contents are getting paged in and out.) For comparison, with the maximum num_workers(16) GCN can run only 2 gangs on each compute unit. Each compute unit can run 40 gangs simultaneously with num_workers(1), but that is the limit. If you launch more gangs than that then they are queued; even if you launch 100,000 single-worker gangs, each one will still get 1/40th of the resources. I doubt that NVPTX is magically running 1920 gangs of 32 workers on one SM without any queueing and with the gang resources split 1920 ways (and the worker resources split 61440 ways). > (I suppose that calculation is valid for a GPU hardware variant where > there is just one SM. If there are several (typically in the order of a > few dozens?), I suppose the Thread Blocks launched will be distributed > over all these, thus improving the situation correspondingly.) > > (And of course, there are certainly other factors that also limit the > number of Thread Blocks that are actually executing in parallel.) > >> We might have a >> problem if there are large private arrays. > > Yes, that's understood. > > Also, directly related, the problem that comes with supporting > worker-private memory, which basically calculates to the amount necessary > for gang-private memory multiplied by the number of workers? (Out of > scope at present.) GCN just uses the stack space for that, which lives in main memory. That's limited resource, of course, but it's not architectural. I don't know what NVPTX does here. >> I believe we have a "good enough" solution for the usual case > > So you believe that. ;-) > > It's certainly what I'd hope, too! But we don't know yet whether there's > any noticeable performance impact if we run with (potentially) lesser > parallelism, hence my question whether this patch has been run through > performance testing. Well, indeed I don't know the comparative situation with benchmark results because the benchmarks couldn't run at full occupancy, on GCN, without it. The purpose of this patch was precisely to allow us to reduce the local memory allocation enough to increase occupancy for benchmarks that don't use worker loops. >> and a >> v2.0 full solution is going to be big and hairy enough for a whole patch >> of it's own (requiring per-gang dynamic allocation, a different memory >> address space and possibly different instruction selection too). > > Agree that a fully dynamic allocation scheme likely is going to be ugly, > so I'd certainly like to avoid that. > > Before attempting that, we'd first try to optimize gang-private memory > allocation: so that it's function-local (and thus GPU kernel-local) > instead of device-global (assuming that's indeed possible), and try not > using gang-private memory in cases where it's not actually necessary > (semantically not observable, and not necessary for performance reasons). Global layout isn't ideal, but I don't know how we know how much to reserve otherwise? I suppose one would set the shared gang memory up as a stack, complete with a stack pointer in the ABI, which would allow recursion etc., but that would have other issues. Andrew
Hi! On 2021-04-18T23:53:01+0100, Andrew Stubbs <ams@codesourcery.com> wrote: > On 16/04/2021 18:30, Thomas Schwinge wrote: >> On 2021-04-16T17:05:24+0100, Andrew Stubbs <ams@codesourcery.com> wrote: >>> On 15/04/2021 18:26, Thomas Schwinge wrote: >>>>> and optimisation, since shared memory might be faster than >>>>> the main memory on a GPU. >>>> >>>> Do we potentially have a problem that making more use of (scarce) >>>> gang-private memory may negatively affect peformance, because potentially >>>> fewer OpenACC gangs may then be launched to the GPU hardware in parallel? >>>> (Of course, OpenACC semantics conformance firstly is more important than >>>> performance, but there may be ways to be conformant and performant; >>>> "quality of implementation".) Have you run any such performance testing >>>> with the benchmarking codes that we've got set up? >>>> >>>> (As I'm more familiar with that, I'm using nvptx offloading examples in >>>> the following, whilst assuming that similar discussion may apply for GCN >>>> offloading, which uses similar hardware concepts, as far as I remember.) >>> >>> Yes, that could happen. >> >> Thanks for sharing the GCN perspective. >> >>> However, there's space for quite a lot of >>> scalars before performance is affected: 64KB of LDS memory shared by a >>> hardware-defined maximum of 40 threads >> >> (Instead of threads, something like thread blocks, I suppose?) > > Workers. Wavefronts. (ACK.) > The terminology is so confusing for these cases! Absolutely! Everyone has their own, and slightly redefines meaning of certain words -- and then again uses different words for the same things/concepts... > They look like CPU threads running SIMD instructions, at least on GCN. > OpenMP calls them threads. Alright -- and in OpenACC (which is the context here), "a thread is any one vector lane of one worker of one gang" (that is, any element of a GCN SIMD instruction). > Each GCN compute unit can run up to 40 of them. A gang can have up to 16 > workers (in AMD terminology, a work group can have up 16 wavefronts), so > each compute unit will usually have at least two gangs, meaning each > gang would get 32KB local memory. If there are no worker loops then you > get 40 gangs (of one worker each) per compute unit, hence the minimum of > 1.5KB per gang. > > The local memory is specific to the compute unit and gangs launched > there will stay there until they're done, so the 40 gangs really is the > limit for memory division. If you launch more gangs than there are > resources then they get queued, so the memory doesn't get divided any more. > >>> gives about 1.5KB of space for >>> worker-reduction variables and gang-private variables. >> >> PTX, as I understand this, may generally have a lot of Thread Blocks in >> flight: all for the same GPU kernel as well as any GPU kernels running >> asynchronously/generally concurrently (system-wide), and libgomp does try >> launching a high number of Thread Blocks ('num_gangs') (for purposes of >> hiding memory access latency?). Random example: >> >> nvptx_exec: kernel t0_r$_omp_fn$0: launch gangs=1920, workers=32, vectors=32 >> >> With that, PTX's 48 KiB of '.shared' memory per SM (processor) are then >> not so much anymore: just '48 * 1024 / 1920 = 25' bytes of gang-private >> memory available for each of the 1920 gangs: 'double x, y, z'? (... for >> the simple case where just one GPU kernel is executing.) > > Your maths feels way off to me. That's not enough memory for any use, > and it's not the only resource that will be stretched thin: Might be way off, yes. I did mention "other [limiting] factors" later on, and: According to the documentation that I'd pointed to, CC 3.5 may have "Maximum number of resident blocks per SM": "16". (Aha, and if, for example, we assume there are 80 SMs, then libgomp launching 1920 gangs means '1920 / 80 = 24' Thread Blocks per SM -- which seems reasonable.) What I don't know is whether "resident" means scheduled/executing and the same applies to the '.shared' memory allocation -- or whether the two parts are separate (thus you can occupy '.shared' memory without having it used via execution). If we assume that allocation and execution are done in one, and there is no pre-emption once launched, that indeed simplifies the considerations quite some. We'd then have a decent '48 * 1024 / 16 = 3072' bytes of gang-private memory available for each of the 16 "resident" gangs (per SM). > how many GPU > registers does an SM have? "Number of 32-bit registers per SM": "64 K", and with "Maximum number of resident threads per SM": "2048", that means '64 K / 2048 = 32' registers in this configuration vs. "Maximum number of 32-bit registers per thread": "255" with correspondingly reduced occupancy. > (I doubt that register contents are getting > paged in and out.) (Again, I have not looked up to which extent Nvidia GPUs/Driver are doing any such things.) > For comparison, with the maximum num_workers(16) GCN can run only 2 > gangs on each compute unit. Each compute unit can run 40 gangs > simultaneously with num_workers(1), but that is the limit. If you launch > more gangs than that then they are queued; even if you launch 100,000 > single-worker gangs, each one will still get 1/40th of the resources. > > I doubt that NVPTX is magically running 1920 gangs of 32 workers on one > SM without any queueing and with the gang resources split 1920 ways (and > the worker resources split 61440 ways). No, indeed. As I'd said: >> (I suppose that calculation is valid for a GPU hardware variant where >> there is just one SM. If there are several (typically in the order of a >> few dozens?), I suppose the Thread Blocks launched will be distributed >> over all these, thus improving the situation correspondingly.) >> >> (And of course, there are certainly other factors that also limit the >> number of Thread Blocks that are actually executing in parallel.) >>> We might have a >>> problem if there are large private arrays. >> >> Yes, that's understood. >> >> Also, directly related, the problem that comes with supporting >> worker-private memory, which basically calculates to the amount necessary >> for gang-private memory multiplied by the number of workers? (Out of >> scope at present.) > > GCN just uses the stack space for that, which lives in main memory. > That's limited resource, of course, but it's not architectural. I don't > know what NVPTX does here. Per my understanding, neither GCN nor nvptx are supporting OpenACC worker-private memory yet. >>> I believe we have a "good enough" solution for the usual case >> >> So you believe that. ;-) >> >> It's certainly what I'd hope, too! But we don't know yet whether there's >> any noticeable performance impact if we run with (potentially) lesser >> parallelism, hence my question whether this patch has been run through >> performance testing. > > Well, indeed I don't know the comparative situation with benchmark > results because the benchmarks couldn't run at full occupancy, on GCN, > without it. The purpose of this patch was precisely to allow us to > reduce the local memory allocation enough to increase occupancy for > benchmarks that don't use worker loops. ACK, that's the GCN perspective. But for nvptx, we ought be careful to not regress existing functionality/performance. Plus, we all agree, the proposed code changes do improve certain aspects of OpenACC specification conformance: the concept of gang-private memory. >>> and a >>> v2.0 full solution is going to be big and hairy enough for a whole patch >>> of it's own (requiring per-gang dynamic allocation, a different memory >>> address space and possibly different instruction selection too). >> >> Agree that a fully dynamic allocation scheme likely is going to be ugly, >> so I'd certainly like to avoid that. >> >> Before attempting that, we'd first try to optimize gang-private memory >> allocation: so that it's function-local (and thus GPU kernel-local) >> instead of device-global (assuming that's indeed possible), and try not >> using gang-private memory in cases where it's not actually necessary >> (semantically not observable, and not necessary for performance reasons). > > Global layout isn't ideal, but I don't know how we know how much to > reserve otherwise? I suppose one would set the shared gang memory up as > a stack, complete with a stack pointer in the ABI, which would allow > recursion etc., but that would have other issues. Due to lack of in-depth knowledge, I haven't made an attempt to reason about how to implement that on GCN, but for nvptx there certainly is evidence of '.shared' memory allocation per function, building a complete call graph from the GPU kernel entry point onwards, and thus '.shared' memory allocation per each individual GPU kernel launch. (Yet, again, I'm totally fine to defer all these things for later -- unless the nvptx performance testing numbers mandate otherwise.) Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
Hi, (Chung-Lin, question for you buried below.) On Thu, 15 Apr 2021 19:26:54 +0200 Thomas Schwinge <thomas@codesourcery.com> wrote: > Hi! > > On 2021-02-26T04:34:50-0800, Julian Brown <julian@codesourcery.com> > wrote: > > This patch > > Thanks, Julian, for your continued improving of these changes! You're welcome! > This has iterated through several conceptually different designs and > implementations, by several people, over the past several years. I hope this wasn't a hint that I'd failed to attribute the authorship of the patch properly? Many apologies if so, that certainly wasn't my intention! > > implements a method to track the "private-ness" of > > OpenACC variables declared in offload regions in gang-partitioned, > > worker-partitioned or vector-partitioned modes. Variables declared > > implicitly in scoped blocks and those declared "private" on > > enclosing directives (e.g. "acc parallel") are both handled. > > Variables that are e.g. gang-private can then be adjusted so they > > reside in GPU shared memory. > > > > The reason for doing this is twofold: correct implementation of > > OpenACC semantics > > ACK, and as mentioned before, this very much relates to > <https://gcc.gnu.org/PR90115> "OpenACC: predetermined private levels > for variables declared in blocks" (plus the corresponding use of > 'private' clauses, implicit/explicit, including 'firstprivate') and > <https://gcc.gnu.org/PR90114> "Predetermined private levels for > variables declared in OpenACC accelerator routines", which we thus > should refer in testcases/ChangeLog/commit log, as appropriate. I do > understand we're not yet addressing all of that (and that's fine!), > but we should capture remaining work items of the PRs and Cesar's > list in > <http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>), > as appropriate. From that list: > * Currently variables in private clauses inside acc loops will not > utilize shared memory. The patch should handle this properly now. > * OpenACC routines don't use shared memory, except for reductions and > worker state propagation. Routines weren't a focus of this patch (at the point I inherited it), and I did not attempt to extend it to cover routines either. TBH the state there is a bit of an unknown (but the patch won't make the situation any worse). > * Variables local to worker loops don't use shared memory. That's still true, and IIUC for that to work we'd need to expand scalars into indexed array references, (i.e. "var" -> "var_arr[vector_lane]" or similar). It's not clear if/when/why we'd want to do that. As an aside, if we want to avoid shared memory for some reason but want to maintain OpenACC semantics, we'd also have to do a similar transformation for gang-private variables ("var" -> "var[gang_number]", where the array is on the stack or in global memory, or similar). Then for worker-private variables we need to do "var" -> "var[gang_number * num_workers + worker_number]". We've avoided needing to do that so far, but for some cases -- maybe large local private arrays? -- it might be necessary, at some point. > * Variables local to automatically partitioned gang and worker loops > don't use shared memory. Local variables in automatically-partitioned gang loops should work fine now. > * Shared memory is allocated globally, not locally on a per-function > basis. We're not sure if that matters though. Arguably, that's down to the target, not this middle-end patch -- this patch itself might not *help* do per-function allocation, but it doesn't set a policy that allocation must be global either. > I was surprised that we didn't really have to fix up any existing > libgomp testcases, because there seem to be quite some that contain a > pattern (exemplified by the 'tmp' variable) as follows: > > int main() > { > #define N 123 > int data[N]; > int tmp; > > #pragma acc parallel // implicit 'firstprivate(tmp)' > { > // 'tmp' now conceptually made gang-private here. > #pragma acc loop gang > for (int i = 0; i < 123; ++i) > { > tmp = i + 234; > data[i] = tmp; > } > } > > for (int i = 0; i < 123; ++i) > if (data[i] != i + 234) > __builtin_abort (); > > return 0; > } > > With the code changes as posted, this actually now does *not* use > gang-private memory for 'tmp', but instead continues to use > "thread-private registers", as before. When "tmp" is a local, non-address-taken scalar like that, it'll probably end up in a register in offloaded code (or of course be compiled out completely), both before and after this patch. So I wouldn't expect this to not work in the pre-patch state. > Same for: > > --- s3.c 2021-04-13 17:26:49.628739379 +0200 > +++ s3_2.c 2021-04-13 17:29:43.484579664 +0200 > @@ -4,6 +4,6 @@ > int data[N]; > - int tmp; > > -#pragma acc parallel // implicit 'firstprivate(tmp)' > +#pragma acc parallel > { > + int tmp; > // 'tmp' now conceptually made gang-private here. > #pragma acc loop gang > > I suppose that's due to conditionalizing this transformation on > 'TREE_ADDRESSABLE' (as you're doing), so we should be mostly "safe" > regarding such existing testcases (but I haven't verified that yet in > detail). Right. > That needs to be documented in testcases, with some kind of dump > scanning (host compilation-side even; see below). > > A note for later: if this weren't just a 'gang' loop, but 'gang' plus > 'worker' and/or 'vector', we'd actually be fixing up user code with > undefined behavior into "correct" code (by *not* making 'tmp' > gang-private, but thread-private), right? Possibly -- coming up with a case like that might need a little "ingenuity"... > As that may not be obvious to the reader, I'd like to have the > 'TREE_ADDRESSABLE' conditionalization be documented in the code. You > had explained that in > <http://mid.mail-archive.com/20190612204216.0ec83e4e@squid.athome>: "a > non-addressable variable [...]". Yeah that probably makes sense. > > and optimisation, since shared memory might be faster than > > the main memory on a GPU. > > Do we potentially have a problem that making more use of (scarce) > gang-private memory may negatively affect peformance, because > potentially fewer OpenACC gangs may then be launched to the GPU > hardware in parallel? (Of course, OpenACC semantics conformance > firstly is more important than performance, but there may be ways to > be conformant and performant; "quality of implementation".) Have you > run any such performance testing with the benchmarking codes that > we've got set up? I don't have any numbers for this patch, no. As for the question as to whether there are constructs that are currently compiled in a semantically-correct way but that this patch pessimises -- I'm not aware of anything like that, but there might be. > (As I'm more familiar with that, I'm using nvptx offloading examples > in the following, whilst assuming that similar discussion may apply > for GCN offloading, which uses similar hardware concepts, as far as I > remember.) > > Looking at the existing > 'libgomp.oacc-c-c++-common/private-variables.c' (random example), for > nvptx offloading, '-O0', we see the following PTX JIT compilation > changes (word-'diff' of 'GOMP_DEBUG=1' at run-time): > > info : Function properties for 'local_g_1$_omp_fn$0': > info : used 27 registers, 32 stack, [-176-]{+256+} bytes smem, > 328 bytes cmem[0], 0 bytes lmem info : Function properties for > 'local_w_1$_omp_fn$0': info : used 40 registers, 48 stack, > [-176-]{+256+} bytes smem, 328 bytes cmem[0], 0 bytes lmem info : > Function properties for 'local_w_2$_omp_fn$0': [...] > info : Function properties for 'parallel_g_1$_omp_fn$0': > info : used 27 registers, 32 stack, [-176-]{+256+} bytes smem, > 328 bytes cmem[0], 0 bytes lmem info : Function properties for > 'parallel_g_2$_omp_fn$0': info : used 32 registers, 160 stack, > [-176-]{+256+} bytes smem, 328 bytes cmem[0], 0 bytes lmem > > ... that is, PTX '.shared' usage increases from 176 to 256 bytes for > *all* functions, even though only 'loop_g_4$_omp_fn$0' and > 'loop_g_5$_omp_fn$0' are actually using gang-private memory. > > Execution testing works before (original code, not using gang-private > memory) as well as after (code changes as posted, using gang-private > memory), so use on gang-private memory doesn't seem necessary here for > "correct execution" -- or at least: "expected execution result". ;-) > I haven't looked yet whether there's a potentional issue in the > testcases here. > > The additional '256 - 176 = 80' bytes of PTX '.shared' memory > requested are due to GCC nvptx back end implementation's use of a > global "Shared memory block for gang-private variables": > > // BEGIN VAR DEF: __oacc_bcast > .shared .align 8 .u8 __oacc_bcast[176]; > +// BEGIN VAR DEF: __gangprivate_shared > +.shared .align 32 .u8 __gangprivate_shared[64]; > > ..., plus (I suppose) an additional '80 - 64 = 16' padding/unused > bytes to establish '.align 32' after '.align 8' for '__oacc_bcast'. > > Per > <https://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities>, > "Table 15. Technical Specifications per Compute Capability", "Compute > Capability": "3.5", we have a "Maximum amount of shared memory per > SM": "48 KB", so with '176 bytes smem', that permits '48 * 1024 / 176 > = 279' thread blocks ('num_gangs') resident at one point in time, > whereas with '256 bytes smem', it's just '48 * 1024 / 256 = 192' > thread blocks resident at one point in time. (Not sure that I got > all the details right, but you get the idea/concern?) > > Anyway, that shall be OK for now, but we shall later look into > optimizing that; can't we have '.shared' local to the relevant PTX > functions instead of global? As mentioned in a previous posting (probably some time ago!) the NVPTX backend parts were a bit of the patch I inherited from the earliest versions of the patch, and didn't alter much. The possibility for function-local allocation has been raised before (for NVPTX), but I haven't investigated if it's possible or beneficial. > Interestingly, compiling with '-O2', we see: > > // BEGIN VAR DEF: __oacc_bcast > .shared .align 8 .u8 __oacc_bcast[144]; > {+// BEGIN VAR DEF: __gangprivate_shared+} > {+.shared .align 128 .u8 __gangprivate_shared[32];+} > > With '-O2', only 'loop_g_5$_omp_fn$0' is using gang-private memory, > and apparently the PTX JIT is able to figure that out from the PTX > code that GCC generates, and is then able to localize '.shared' > memory usage to just 'loop_g_5$_omp_fn$0': > > [...] > info : Function properties for 'loop_g_4$_omp_fn$0': > info : used 12 registers, 0 stack, 144 bytes smem, 328 bytes > cmem[0], 0 bytes lmem info : Function properties for > 'loop_g_5$_omp_fn$0': info : used [-30-]{+32+} registers, 32 > stack, [-144-]{+288+} bytes smem, 328 bytes cmem[0], 0 bytes lmem > info : Function properties for 'loop_g_6$_omp_fn$0': info : > used 13 registers, 0 stack, 144 bytes smem, 328 bytes cmem[0], 0 > bytes lmem [...] > > This strongly suggests to me that indeed there must exist a > programmatic way to get rid of the global "Shared memory block for > gang-private variables". > > The additional '288 - 144 = 144' bytes of PTX '.shared' memory > requested are 32 bytes for 'int x[8]' ('#pragma acc loop gang > private(x)') plus '288 - 32 - 144 = 112' padding/unused bytes to > establish '.align 128' (!) after '.align 8' for '__oacc_bcast'. > That's clearly not ideal: 112 bytes wasted in contrast to just '144 + > 32 = 176' bytes actually used. (I have not yet looked why/whether > this really needs '.align 128'.) I'm sure improvements are possible there (maybe later?). > I have not yet looked whether similar concerns exist for the GCC GCN > back end implementation. (That one also does set 'TREE_STATIC' for > gang-private memory, so it's a global allocation?) Yes, or rather per-CU allocation. > > Handling of private variables is intimately > > tied to the execution model for gangs/workers/vectors implemented by > > a particular target: for current targets, we use (or on mainline, > > will soon use) a broadcasting/neutering scheme. > > > > That is sufficient for code that e.g. sets a variable in > > worker-single mode and expects to use the value in > > worker-partitioned mode. The difficulty (semantics-wise) comes when > > the user wants to do something like an atomic operation in > > worker-partitioned mode and expects a worker-single (gang private) > > variable to be shared across each partitioned worker. Forcing use > > of shared memory for such variables makes that work properly. > > Are we reliably making sure that gang-private variables (and other > levels, in general) are not subject to the usual broadcasting scheme > (nvptx, at least), or does that currently work "by accident"? (I > haven't looked into that, yet.) Yes, that case is explicitly handled by the broadcasting/neutering patch recently posted. (One of the reasons that patch depends on this one.) > > In terms of implementation, the parallelism level of a given loop is > > not fixed until the oaccdevlow pass in the offload compiler, so the > > patch delays fixing the parallelism level of variables declared on > > or within such loops until the same point. This is done by adding a > > new internal UNIQUE function (OACC_PRIVATE) that lists (the address > > of) each private variable as an argument, and other arguments set > > so as to be able to determine the correct parallelism level to use > > for the listed variables. This new internal function fits into the > > existing scheme for demarcating OpenACC loops, as described in > > comments in the patch. > > Yes, thanks, that's conceptually now much better than the earlier > variants that we had. :-) (Hooray, again, for Nathan's OpenACC > execution model design!) > > What we should add, though, is a bunch of testcases to verify that the > expected processing does/doesn't happen for relevant source code > constructs. I'm thinking that when the transformation is/isn't done, > that gets logged, and we can then scan the dumps accordingly. Some of > that is implemented already; we should be able to do such scanning > generally for host compilation, too, not just offloading compilation. More test coverage is always welcome, of course. > > Two new target hooks are introduced: > > TARGET_GOACC_ADJUST_PRIVATE_DECL and TARGET_GOACC_EXPAND_VAR_DECL. > > The first can tweak a variable declaration at oaccdevlow time, and > > the second at expand time. The first or both of these target hooks > > can be used by a given offload target, depending on its strategy > > for implementing private variables. > > ACK. > > So, currently we're only looking at making the gang-private level > work. Regarding that, we have two configurations: (1) for GCN > offloading, 'targetm.goacc.adjust_private_decl' does the work (in > particular, change 'TREE_TYPE' etc.) and there is no > 'targetm.goacc.expand_var_decl', and (2) for nvptx offloading, > 'targetm.goacc.adjust_private_decl' only sets a marker ('oacc > gangprivate' attribute) and then 'targetm.goacc.expand_var_decl' does > the work. > > Therefore I suggest we clarify the (currently) expected handling > similar to: > > --- gcc/omp-offload.c > +++ gcc/omp-offload.c > @@ -1854,6 +1854,19 @@ oacc_rewrite_var_decl (tree *tp, int > *walk_subtrees, void *data) return NULL_TREE; > } > > +static tree > +oacc_rewrite_var_decl_ (tree *tp, int *walk_subtrees, void *data) > +{ > + tree t = oacc_rewrite_var_decl (tp, walk_subtrees, data); > + if (targetm.goacc.expand_var_decl) > + { > + walk_stmt_info *wi = (walk_stmt_info *) data; > + var_decl_rewrite_info *info = (var_decl_rewrite_info *) > wi->info; > + gcc_assert (!info->modified); > + } > + return t; > +} Why the ugly _ tail on the function name!? I don't think that's a typical GNU coding standards thing, is it? > + > /* Return TRUE if CALL is a call to a builtin atomic/sync > operation. */ > static bool > @@ -2195,6 +2208,9 @@ execute_oacc_device_lower () > COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also > rewritten to use the new decl, adjusting types of appropriate tree > nodes as necessary. */ > + if (targetm.goacc.expand_var_decl) > + gcc_assert (adjusted_vars.is_empty ()); If you like -- or do something like > if (targetm.goacc.adjust_private_decl) && !adjusted_vars.is_empty ()) perhaps. > { > FOR_ALL_BB_FN (bb, cfun) > @@ -2217,7 +2233,7 @@ execute_oacc_device_lower () > memset (&wi, 0, sizeof (wi)); > wi.info = &info; > > - walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi); > + walk_gimple_op (stmt, oacc_rewrite_var_decl_, &wi); > > if (info.modified) > update_stmt (stmt); > > Or, in fact, 'if (targetm.goacc.expand_var_decl)', skip the > 'adjusted_vars' handling completely? For the current pair of implementations, sure. I don't think it's necessary to set that as a constraint for future targets though? I guess it doesn't matter much until such a target exists. > I do understand that eventually (in particular, for worker-private > level?), both 'targetm.goacc.adjust_private_decl' and > 'targetm.goacc.expand_var_decl' may need to do things, but that's > currently not meant to be addressed, and thus not fully worked out and > implemented, and thus untested. Hence, 'assert' what currently is > implemented/tested, only. If you like, no strong feelings from me on that. > (Given that eventual goal, that's probably sufficient motivation to > indeed add the 'adjusted_vars' handling in generic 'gcc/omp-offload.c' > instead of moving it into the GCN back end?) I'm not sure what moving it to the GCN back end would look like. I guess it's a question of keeping the right abstractions in the right place. > For 'libgomp.oacc-c-c++-common/static-variable-1.c' that I've recently > added, the code changes here cause execution test FAILs for nvptx > offloading (because of making 'static' variables gang-private), and > trigger an ICE with GCN offloading compilation. It isn't clear to me > what the desired semantics are for (user-specified) 'static' > variables -- see <https://github.com/OpenACC/openacc-spec/issues/372> > "C/C++ 'static' variables" (only visible to members of the GitHub > OpenACC organization) -- but an ICE clearly isn't the right answer. > ;-) > > As for certain transformation/optimizations, 'static' variables may be > synthesized in the GCC middle end, I suppose we should preserve the > status quo (as documented via > 'libgomp.oacc-c-c++-common/static-variable-1.c') until #372 gets > resolved in OpenACC? (I suppose, skip the transformation if > 'TREE_STATIC' is set, or similar.) ICEs are bad -- but a user expecting static variables to do something meaningful in offloaded code is being somewhat optimistic, I think! > > --- a/gcc/expr.c > > +++ b/gcc/expr.c > > @@ -10224,8 +10224,19 @@ expand_expr_real_1 (tree exp, rtx target, > > machine_mode tmode, exp = SSA_NAME_VAR (ssa_name); > > goto expand_decl_rtl; > > > > - case PARM_DECL: > > case VAR_DECL: > > + /* Allow accel compiler to handle variables that require > > special > > + treatment, e.g. if they have been modified in some way > > earlier in > > + compilation by the adjust_private_decl OpenACC hook. */ > > + if (flag_openacc && targetm.goacc.expand_var_decl) > > + { > > + temp = targetm.goacc.expand_var_decl (exp); > > + if (temp) > > + return temp; > > + } > > + /* ... fall through ... */ > > + > > + case PARM_DECL: > > [TS] Are we sure that we don't need the same handling for a > 'PARM_DECL', too? (If yes, to document and verify that, should we > thus again unify the two 'case's, and in > 'targetm.goacc.expand_var_decl' add a 'gcc_checking_assert (TREE_CODE > (var) == VAR_DECL')'?) Maybe for routines? Those bits date from the earliest version of the patch and (same excuse again) I didn't have call to revisit those decisions. > Also, are we sure that all the following existing processing is not > relevant to do before the 'return temp' (see above)? That's not a > concern for GCN (which doesn't use 'targetm.goacc.expand_var_decl', > and thus does execute all this following existing processing), but it > is for nvptx (which does use 'targetm.goacc.expand_var_decl', and > thus doesn't execute all this following existing processing if that > returned something). Or, is 'targetm.goacc.expand_var_decl' > conceptually and practically meant to implement all of the following > processing, or is this for other reasons not relevant in the > 'targetm.goacc.expand_var_decl' case: > > > /* If a static var's type was incomplete when the decl was > > written, but the type is complete now, lay out the decl now. */ > > if (DECL_SIZE (exp) == 0 > | && COMPLETE_OR_UNBOUND_ARRAY_TYPE_P (TREE_TYPE (exp)) > | && (TREE_STATIC (exp) || DECL_EXTERNAL (exp))) > | layout_decl (exp, 0); > | > | /* fall through */ > | > | case FUNCTION_DECL: > | case RESULT_DECL: > | decl_rtl = DECL_RTL (exp); > | expand_decl_rtl: > | gcc_assert (decl_rtl); > | > | /* DECL_MODE might change when TYPE_MODE depends on > attribute target | settings for VECTOR_TYPE_P that might > switch for the function. */ | if (currently_expanding_to_rtl > | && code == VAR_DECL && MEM_P (decl_rtl) > | && VECTOR_TYPE_P (type) && exp && DECL_MODE (exp) != > mode) | decl_rtl = change_address (decl_rtl, TYPE_MODE > (type), 0); | else > | decl_rtl = copy_rtx (decl_rtl); > | > | /* Record writes to register variables. */ > | if (modifier == EXPAND_WRITE > | && REG_P (decl_rtl) > | && HARD_REGISTER_P (decl_rtl)) > | add_to_hard_reg_set (&crtl->asm_clobbers, > | GET_MODE (decl_rtl), REGNO > (decl_rtl)); | > | /* Ensure variable marked as used even if it doesn't go > through | a parser. If it hasn't be used yet, write out an > external | definition. */ > | if (exp) > | TREE_USED (exp) = 1; > | > | /* Show we haven't gotten RTL for this yet. */ > | temp = 0; > | > | /* Variables inherited from containing functions should have > | been lowered by this point. */ > | if (exp) > | context = decl_function_context (exp); > | gcc_assert (!exp > | || SCOPE_FILE_SCOPE_P (context) > | || context == current_function_decl > | || TREE_STATIC (exp) > | || DECL_EXTERNAL (exp) > | /* ??? C++ creates functions that are not > TREE_STATIC. */ | || TREE_CODE (exp) == > FUNCTION_DECL); | > | /* This is the case of an array whose size is to be > determined | from its initializer, while the initializer is > still being parsed. | ??? We aren't parsing while expanding > anymore. */ | > | if (MEM_P (decl_rtl) && REG_P (XEXP (decl_rtl, 0))) > | temp = validize_mem (decl_rtl); > | > | /* If DECL_RTL is memory, we are in the normal case and the > | address is not valid, get the address into a register. */ > | > | else if (MEM_P (decl_rtl) && modifier != EXPAND_INITIALIZER) > | { > | if (alt_rtl) > | *alt_rtl = decl_rtl; > | decl_rtl = use_anchored_address (decl_rtl); > | if (modifier != EXPAND_CONST_ADDRESS > | && modifier != EXPAND_SUM > | && !memory_address_addr_space_p (exp ? DECL_MODE > (exp) | : GET_MODE > (decl_rtl), | XEXP > (decl_rtl, 0), | > MEM_ADDR_SPACE (decl_rtl))) | temp = > replace_equiv_address (decl_rtl, | > copy_rtx (XEXP (decl_rtl, 0))); | } > | > | /* If we got something, return it. But first, set the > alignment | if the address is a register. */ > | if (temp != 0) > | { > | if (exp && MEM_P (temp) && REG_P (XEXP (temp, 0))) > | mark_reg_pointer (XEXP (temp, 0), DECL_ALIGN (exp)); > | } > | else if (MEM_P (decl_rtl)) > | temp = decl_rtl; > | > | if (temp != 0) > | { > | if (MEM_P (temp) > | && modifier != EXPAND_WRITE > | && modifier != EXPAND_MEMORY > | && modifier != EXPAND_INITIALIZER > | && modifier != EXPAND_CONST_ADDRESS > | && modifier != EXPAND_SUM > | && !inner_reference_p > | && mode != BLKmode > | && MEM_ALIGN (temp) < GET_MODE_ALIGNMENT (mode)) > | temp = expand_misaligned_mem_ref (temp, mode, > unsignedp, | MEM_ALIGN > (temp), NULL_RTX, NULL); | > | return temp; > | } > | [...] > > [TS] I don't understand that yet. :-| > > Instead of the current "early-return" handling: > > temp = targetm.goacc.expand_var_decl (exp); > if (temp) > return temp; > > ... should we maybe just set: > > DECL_RTL (exp) = targetm.goacc.expand_var_decl (exp) > > ... (or similar), and then let the usual processing continue? Hum, not sure about that. See above excuse... maybe Chung-Lin remembers? My guess is the extra processing doesn't matter in practice for the limited kinds of variables that are handled by that hook, at least for NVPTX (which skips register allocation, etc. anyway). > > [snip] > > tree fork_kind = build_int_cst (unsigned_type_node, > > IFN_UNIQUE_OACC_FORK); tree join_kind = build_int_cst > > (unsigned_type_node, IFN_UNIQUE_OACC_JOIN); > > @@ -8027,7 +8041,8 @@ lower_oacc_head_tail (location_t loc, tree > > clauses, &join_seq); > > > > lower_oacc_reductions (loc, clauses, place, inner, > > - fork, join, &fork_seq, &join_seq, > > ctx); > > + fork, (count == 1) ? private_marker : > > NULL, > > + join, &fork_seq, &join_seq, ctx); > > > > /* Append this level to head. */ > > gimple_seq_add_seq (head, fork_seq); > > [TS] That looks good in principle. Via the testing mentioned above, I > just want to make sure that this does all the expected things > regarding differently nested loops and privatization levels. Feel free to extend test coverage as you see fit... > > gimple_seq_add_seq (&new_body, fork_seq); > > @@ -13262,6 +13369,9 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, > > omp_context *ctx) ctx); > > break; > > case GIMPLE_BIND: > > + if (ctx && is_gimple_omp_oacc (ctx->stmt)) > > + oacc_record_vars_in_bind (ctx, > > + gimple_bind_vars (as_a <gbind *> > > (stmt))); lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), > > ctx); maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> > > (stmt)); break; > > [TS] I have not yet verified whether these lowering case are > sufficient to also handle the <https://gcc.gnu.org/PR90114> > "Predetermined private levels for variables declared in OpenACC > accelerator routines" case. (If yes, then that needs testcases, too, > if not, then need to add a TODO note, for later.) I believe that's a TODO. > > + 1. They can be recreated, making a pointer to the variable > > in the new > > + address space, or > > + > > + 2. The address of the variable in the new address space can > > be taken, > > + converted to the default (original) address space, and > > the result of > > + that conversion subsituted in place of the original > > ADDR_EXPR node. + > > + Which of these is done depends on the gimple statement being > > processed. > > + At present atomic operations and inline asms use (1), and > > everything else > > + uses (2). At least on AMD GCN, there are atomic operations > > that work > > + directly in the LDS address space. > > + > > + COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also > > rewritten to use > > + the new decl, adjusting types of appropriate tree nodes as > > necessary. */ > > [TS] As I understand, this is only relevant for GCN offloading, but > not nvptx, and I'll trust that these two variants make sense from a > GCN point of view (which I cannot verify easily). The idea (hope) is that that's what's necessary "generically", though the only target using that support is GCN at present. I.e. it's not supposed to be GCN-specific, necessarily. Of course though, who knows what some other exotic target will need? (We don't want to be in the state where each target has to start completely from scratch for this sort of thing, if we can help it.) > > + if (targetm.goacc.adjust_private_decl) > > + { > > + FOR_ALL_BB_FN (bb, cfun) > > + for (gimple_stmt_iterator gsi = gsi_start_bb (bb); > > + !gsi_end_p (gsi); > > + gsi_next (&gsi)) > > + { > > + gimple *stmt = gsi_stmt (gsi); > > + walk_stmt_info wi; > > + var_decl_rewrite_info info; > > + > > + info.avoid_pointer_conversion > > + = (is_gimple_call (stmt) > > + && is_sync_builtin_call (as_a <gcall *> (stmt))) > > + || gimple_code (stmt) == GIMPLE_ASM; > > + info.stmt = stmt; > > + info.modified = false; > > + info.adjusted_vars = &adjusted_vars; > > + > > + memset (&wi, 0, sizeof (wi)); > > + wi.info = &info; > > + > > + walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi); > > + > > + if (info.modified) > > + update_stmt (stmt); > > + } > > + } > > + > > free_oacc_loop (loops); > > > > return 0; > > [TS] As disucssed above, maybe can completely skip the 'adjusted_vars' > rewriting for nvptx offloading? Yeah sure, if you like. > > --- /dev/null > > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c > > [TS] Without any code changes, this one FAILs (as expected) with nvptx > offloading, but with GCN offloading, it already PASSes. Not sure about that, of course one gets lucky sometimes. > > --- /dev/null > > +++ > > b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90 > > [TS] With code changes as posted, this one FAILs for nvptx offloading > execution. (... for all but the Nvidia Titan V GPU in my set of > testing configurations, huh?) > > > @@ -0,0 +1,25 @@ > > +! Test for worker-private variables > > + > > +! { dg-do run } > > +! { dg-additional-options "-fdump-tree-oaccdevlow-details" } > > + > > +program main > > + integer :: w, arr(0:31) > > + > > + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) > > + !$acc loop gang worker private(w) > > +! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has worker > > partitioning: integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */ > > + do j = 0, 31 > > + w = 0 > > + !$acc loop seq > > + do i = 0, 31 > > + !$acc atomic update > > + w = w + 1 > > + !$acc end atomic > > + end do > > + arr(j) = w > > + end do > > + !$acc end parallel > > + > > + if (any (arr .ne. 32)) stop 1 > > +end program main Boo. I don't think I saw such a failure on the systems I tested on. That needs investigation (though it might be something CUDA-version or GPU specific, hence not directly a GCC problem? Not sure.) Thanks for review, and please ask if there's anything I can help further with. Julian
Hi! On 2021-04-19T12:23:56+0100, Julian Brown <julian@codesourcery.com> wrote: > On Thu, 15 Apr 2021 19:26:54 +0200 > Thomas Schwinge <thomas@codesourcery.com> wrote: >> This has iterated through several conceptually different designs and >> implementations, by several people, over the past several years. > > I hope this wasn't a hint that I'd failed to attribute the authorship of > the patch properly? Many apologies if so, that certainly wasn't my > intention! No, not at all -- this was just to highlight the several iterations this work has gone though. With a first set of my modification merged in, I've now pushed "openacc: Add support for gang local storage allocation in shared memory [PR90115]" to master branch in commit 29a2f51806c5b30e17a8d0e9ba7915a3c53c34ff, see attached. I shall now follow up with a number of further changes, and more to come later (once developed). >> On 2021-02-26T04:34:50-0800, Julian Brown <julian@codesourcery.com> >> wrote: >> > This patch implements a method to track the "private-ness" of >> > OpenACC variables declared in offload regions in gang-partitioned, >> > worker-partitioned or vector-partitioned modes. Variables declared >> > implicitly in scoped blocks and those declared "private" on >> > enclosing directives (e.g. "acc parallel") are both handled. >> > Variables that are e.g. gang-private can then be adjusted so they >> > reside in GPU shared memory. >> > >> > The reason for doing this is twofold: correct implementation of >> > OpenACC semantics >> >> ACK, and as mentioned before, this very much relates to >> <https://gcc.gnu.org/PR90115> "OpenACC: predetermined private levels >> for variables declared in blocks" (plus the corresponding use of >> 'private' clauses, implicit/explicit, including 'firstprivate') and >> <https://gcc.gnu.org/PR90114> "Predetermined private levels for >> variables declared in OpenACC accelerator routines", which we thus >> should refer in testcases/ChangeLog/commit log, as appropriate. I do >> understand we're not yet addressing all of that (and that's fine!), >> but we should capture remaining work items of the PRs and Cesar's >> list in >> <http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>), >> as appropriate. > > From that list: [...] Thanks, that'll be useful for later. >> > Handling of private variables is intimately >> > tied to the execution model for gangs/workers/vectors implemented by >> > a particular target: for current targets, we use (or on mainline, >> > will soon use) a broadcasting/neutering scheme. >> > >> > That is sufficient for code that e.g. sets a variable in >> > worker-single mode and expects to use the value in >> > worker-partitioned mode. The difficulty (semantics-wise) comes when >> > the user wants to do something like an atomic operation in >> > worker-partitioned mode and expects a worker-single (gang private) >> > variable to be shared across each partitioned worker. Forcing use >> > of shared memory for such variables makes that work properly. >> >> Are we reliably making sure that gang-private variables (and other >> levels, in general) are not subject to the usual broadcasting scheme >> (nvptx, at least), or does that currently work "by accident"? (I >> haven't looked into that, yet.) > > Yes, that case is explicitly handled by the broadcasting/neutering patch > recently posted. (One of the reasons that patch depends on this one.) OK, I shall look into these GCN patches soon -- and I still haven't looked into the nvptx aspect. >> > --- a/gcc/expr.c >> > +++ b/gcc/expr.c >> > @@ -10224,8 +10224,19 @@ expand_expr_real_1 (tree exp, rtx target, >> > machine_mode tmode, exp = SSA_NAME_VAR (ssa_name); >> > goto expand_decl_rtl; >> > >> > - case PARM_DECL: >> > case VAR_DECL: >> > + /* Allow accel compiler to handle variables that require special >> > + treatment, e.g. if they have been modified in some way earlier in >> > + compilation by the adjust_private_decl OpenACC hook. */ >> > + if (flag_openacc && targetm.goacc.expand_var_decl) >> > + { >> > + temp = targetm.goacc.expand_var_decl (exp); >> > + if (temp) >> > + return temp; >> > + } >> > + /* ... fall through ... */ >> > + >> > + case PARM_DECL: >> >> [TS] Are we sure that we don't need the same handling for a >> 'PARM_DECL', too? (If yes, to document and verify that, should we >> thus again unify the two 'case's, and in >> 'targetm.goacc.expand_var_decl' add a 'gcc_checking_assert (TREE_CODE >> (var) == VAR_DECL')'?) > > Maybe for routines? Those bits date from the earliest version of the > patch and (same excuse again) I didn't have call to revisit those > decisions. Indeed we're currently not handling 'p' here: int f(int p) { int l; #pragma acc parallel { #pragma acc loop gang private(l, p) // 'l' is, but 'p' is *not* made gang-private here. for ([...]) ... to be fixed at some later point. >> Also, are we sure that all the following existing processing is not >> relevant to do before the 'return temp' (see above)? That's not a >> concern for GCN (which doesn't use 'targetm.goacc.expand_var_decl', >> and thus does execute all this following existing processing), but it >> is for nvptx (which does use 'targetm.goacc.expand_var_decl', and >> thus doesn't execute all this following existing processing if that >> returned something). Or, is 'targetm.goacc.expand_var_decl' >> conceptually and practically meant to implement all of the following >> processing, or is this for other reasons not relevant in the >> 'targetm.goacc.expand_var_decl' case: >> >> > /* If a static var's type was incomplete when the decl was >> > written, but the type is complete now, lay out the decl now. */ >> > if (DECL_SIZE (exp) == 0 >> | && COMPLETE_OR_UNBOUND_ARRAY_TYPE_P (TREE_TYPE (exp)) >> | && (TREE_STATIC (exp) || DECL_EXTERNAL (exp))) >> | layout_decl (exp, 0); >> | >> | /* fall through */ >> | >> | case FUNCTION_DECL: >> | case RESULT_DECL: >> | decl_rtl = DECL_RTL (exp); >> | expand_decl_rtl: >> | gcc_assert (decl_rtl); >> | >> | /* DECL_MODE might change when TYPE_MODE depends on attribute target >> | settings for VECTOR_TYPE_P that might switch for the >> | function. */ >> | if (currently_expanding_to_rtl >> | && code == VAR_DECL && MEM_P (decl_rtl) >> | && VECTOR_TYPE_P (type) && exp && DECL_MODE (exp) != mode) >> | decl_rtl = change_address (decl_rtl, TYPE_MODE (type), 0); >> | else >> | decl_rtl = copy_rtx (decl_rtl); >> | >> | /* Record writes to register variables. */ >> | if (modifier == EXPAND_WRITE >> | && REG_P (decl_rtl) >> | && HARD_REGISTER_P (decl_rtl)) >> | add_to_hard_reg_set (&crtl->asm_clobbers, >> | GET_MODE (decl_rtl), REGNO (decl_rtl)); >> | >> | /* Ensure variable marked as used even if it doesn't go through >> | a parser. If it hasn't be used yet, write out an external >> | definition. */ >> | if (exp) >> | TREE_USED (exp) = 1; >> | >> | /* Show we haven't gotten RTL for this yet. */ >> | temp = 0; >> | >> | /* Variables inherited from containing functions should have >> | been lowered by this point. */ >> | if (exp) >> | context = decl_function_context (exp); >> | gcc_assert (!exp >> | || SCOPE_FILE_SCOPE_P (context) >> | || context == current_function_decl >> | || TREE_STATIC (exp) >> | || DECL_EXTERNAL (exp) >> | /* ??? C++ creates functions that are not TREE_STATIC. */ >> | || TREE_CODE (exp) == FUNCTION_DECL); >> | >> | /* This is the case of an array whose size is to be determined >> | from its initializer, while the initializer is still being parsed. >> | ??? We aren't parsing while expanding anymore. */ >> | >> | if (MEM_P (decl_rtl) && REG_P (XEXP (decl_rtl, 0))) >> | temp = validize_mem (decl_rtl); >> | >> | /* If DECL_RTL is memory, we are in the normal case and the >> | address is not valid, get the address into a register. */ >> | >> | else if (MEM_P (decl_rtl) && modifier != EXPAND_INITIALIZER) >> | { >> | if (alt_rtl) >> | *alt_rtl = decl_rtl; >> | decl_rtl = use_anchored_address (decl_rtl); >> | if (modifier != EXPAND_CONST_ADDRESS >> | && modifier != EXPAND_SUM >> | && !memory_address_addr_space_p (exp ? DECL_MODE (exp) >> | : GET_MODE (decl_rtl), >> | XEXP (decl_rtl, 0), >> | MEM_ADDR_SPACE (decl_rtl))) >> | temp = replace_equiv_address (decl_rtl, >> | copy_rtx (XEXP (decl_rtl, 0))); >> | } >> | >> | /* If we got something, return it. But first, set the alignment >> | if the address is a register. */ >> | if (temp != 0) >> | { >> | if (exp && MEM_P (temp) && REG_P (XEXP (temp, 0))) >> | mark_reg_pointer (XEXP (temp, 0), DECL_ALIGN (exp)); >> | } >> | else if (MEM_P (decl_rtl)) >> | temp = decl_rtl; >> | >> | if (temp != 0) >> | { >> | if (MEM_P (temp) >> | && modifier != EXPAND_WRITE >> | && modifier != EXPAND_MEMORY >> | && modifier != EXPAND_INITIALIZER >> | && modifier != EXPAND_CONST_ADDRESS >> | && modifier != EXPAND_SUM >> | && !inner_reference_p >> | && mode != BLKmode >> | && MEM_ALIGN (temp) < GET_MODE_ALIGNMENT (mode)) >> | temp = expand_misaligned_mem_ref (temp, mode, unsignedp, >> | MEM_ALIGN (temp), NULL_RTX, NULL); >> | >> | return temp; >> | } >> | [...] >> >> [TS] I don't understand that yet. :-| >> >> Instead of the current "early-return" handling: >> >> temp = targetm.goacc.expand_var_decl (exp); >> if (temp) >> return temp; >> >> ... should we maybe just set: >> >> DECL_RTL (exp) = targetm.goacc.expand_var_decl (exp) >> >> ... (or similar), and then let the usual processing continue? > > Hum, not sure about that. See above excuse... maybe Chung-Lin > remembers? My guess is the extra processing doesn't matter in practice > for the limited kinds of variables that are handled by that hook, at > least for NVPTX (which skips register allocation, etc. anyway). I haven't yet looked into that further. Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
Hi! On 2021-02-26T04:34:50-0800, Julian Brown <julian@codesourcery.com> wrote: > --- a/gcc/internal-fn.c > +++ b/gcc/internal-fn.c > @@ -2957,6 +2957,8 @@ expand_UNIQUE (internal_fn, gcall *stmt) > else > gcc_unreachable (); > break; > + case IFN_UNIQUE_OACC_PRIVATE: > + break; > } > > if (pattern) That's unexpected. Meaning: better if this doesn't happen. > --- a/gcc/omp-offload.c > +++ b/gcc/omp-offload.c > @@ -1998,6 +2133,45 @@ execute_oacc_device_lower () > case IFN_UNIQUE_OACC_TAIL_MARK: > remove = true; > break; > + > + case IFN_UNIQUE_OACC_PRIVATE: > + { > + HOST_WIDE_INT level > + = TREE_INT_CST_LOW (gimple_call_arg (call, 2)); > + if (level == -1) > + break; They should be all "handled" here (meaning: also for 'level == -1', do 'remove = true' after the real handling): > + for (unsigned i = 3; > + i < gimple_call_num_args (call); > + i++) > + { > + [...] > + } > + remove = true; > + } > + break; > } > break; > } Why we at all can have 'level == -1' cases is a separate bug to be fixed. I've pushed "[OpenACC privatization] Don't let unhandled 'IFN_UNIQUE_OACC_PRIVATE' linger [PR90115]" to master branch in commit ff451ea723deb3fe8471eb96ac9381c063ec6533, see attached. Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
Hi! On 2021-04-19T12:23:56+0100, Julian Brown <julian@codesourcery.com> wrote: > On Thu, 15 Apr 2021 19:26:54 +0200 > Thomas Schwinge <thomas@codesourcery.com> wrote: >> On 2021-02-26T04:34:50-0800, Julian Brown <julian@codesourcery.com> >> wrote: >> > Two new target hooks are introduced: >> > TARGET_GOACC_ADJUST_PRIVATE_DECL and TARGET_GOACC_EXPAND_VAR_DECL. >> > The first can tweak a variable declaration at oaccdevlow time, and >> > the second at expand time. The first or both of these target hooks >> > can be used by a given offload target, depending on its strategy >> > for implementing private variables. >> >> ACK. >> >> So, currently we're only looking at making the gang-private level >> work. Regarding that, we have two configurations: (1) for GCN >> offloading, 'targetm.goacc.adjust_private_decl' does the work (in >> particular, change 'TREE_TYPE' etc.) and there is no >> 'targetm.goacc.expand_var_decl', and (2) for nvptx offloading, >> 'targetm.goacc.adjust_private_decl' only sets a marker ('oacc >> gangprivate' attribute) and then 'targetm.goacc.expand_var_decl' does >> the work. >> >> Therefore I suggest we clarify the (currently) expected handling >> similar to: >> >> --- gcc/omp-offload.c >> +++ gcc/omp-offload.c >> @@ -1854,6 +1854,19 @@ oacc_rewrite_var_decl (tree *tp, int *walk_subtrees, void *data) return NULL_TREE; >> } >> >> +static tree >> +oacc_rewrite_var_decl_ (tree *tp, int *walk_subtrees, void *data) >> +{ >> + tree t = oacc_rewrite_var_decl (tp, walk_subtrees, data); >> + if (targetm.goacc.expand_var_decl) >> + { >> + walk_stmt_info *wi = (walk_stmt_info *) data; >> + var_decl_rewrite_info *info = (var_decl_rewrite_info *) wi->info; >> + gcc_assert (!info->modified); >> + } >> + return t; >> +} > > Why the ugly _ tail on the function name!? I don't think that's a > typical GNU coding standards thing, is it? Heh, that was just to make the WIP prototype changes diff as small as possible. ;-) >> + >> /* Return TRUE if CALL is a call to a builtin atomic/sync operation. */ >> static bool >> @@ -2195,6 +2208,9 @@ execute_oacc_device_lower () >> COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also >> rewritten to use the new decl, adjusting types of appropriate tree >> nodes as necessary. */ >> + if (targetm.goacc.expand_var_decl) >> + gcc_assert (adjusted_vars.is_empty ()); > > If you like I've pushed "[OpenACC privatization] Explain two different configurations [PR90115]" to master branch in commit 21803fcaebeab36de0d7b6b8cf6abb9389f5e51f, see attached. > -- or do something like > >> if (targetm.goacc.adjust_private_decl) > && !adjusted_vars.is_empty ()) > > perhaps. That, too, additionally: I've pushed "[OpenACC privatization] Skip processing if no work to be done [PR90115]" to master branch in commit ad4612cb048b261f6834e9155e41e40e9252c80b, see attached. >> { >> FOR_ALL_BB_FN (bb, cfun) >> @@ -2217,7 +2233,7 @@ execute_oacc_device_lower () >> memset (&wi, 0, sizeof (wi)); >> wi.info = &info; >> >> - walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi); >> + walk_gimple_op (stmt, oacc_rewrite_var_decl_, &wi); >> >> if (info.modified) >> update_stmt (stmt); >> >> Or, in fact, 'if (targetm.goacc.expand_var_decl)', skip the >> 'adjusted_vars' handling completely? > > For the current pair of implementations, sure. I don't think it's > necessary to set that as a constraint for future targets though? I > guess it doesn't matter much until such a target exists. > >> I do understand that eventually (in particular, for worker-private >> level?), both 'targetm.goacc.adjust_private_decl' and >> 'targetm.goacc.expand_var_decl' may need to do things, but that's >> currently not meant to be addressed, and thus not fully worked out and >> implemented, and thus untested. Hence, 'assert' what currently is >> implemented/tested, only. > > If you like, no strong feelings from me on that. > >> (Given that eventual goal, that's probably sufficient motivation to >> indeed add the 'adjusted_vars' handling in generic 'gcc/omp-offload.c' >> instead of moving it into the GCN back end?) > > I'm not sure what moving it to the GCN back end would look like. I > guess it's a question of keeping the right abstractions in the right > place. Right. I guess we'll figure that out once we have more than one back end using the 'adjusted_vars' machinery. >> > + 1. They can be recreated, making a pointer to the variable in the new >> > + address space, or >> > + >> > + 2. The address of the variable in the new address space can be taken, >> > + converted to the default (original) address space, and the result of >> > + that conversion subsituted in place of the original ADDR_EXPR node. >> > + >> > + Which of these is done depends on the gimple statement being processed. >> > + At present atomic operations and inline asms use (1), and everything else >> > + uses (2). At least on AMD GCN, there are atomic operations that work >> > + directly in the LDS address space. >> > + >> > + COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also rewritten to use >> > + the new decl, adjusting types of appropriate tree nodes as necessary. */ >> >> [TS] As I understand, this is only relevant for GCN offloading, but >> not nvptx, and I'll trust that these two variants make sense from a >> GCN point of view (which I cannot verify easily). > > The idea (hope) is that that's what's necessary "generically", though > the only target using that support is GCN at present. I.e. it's not > supposed to be GCN-specific, necessarily. Of course though, who knows > what some other exotic target will need? (We don't want to be in the > state where each target has to start completely from scratch for this > sort of thing, if we can help it.) > >> > + if (targetm.goacc.adjust_private_decl) >> > + { >> > + FOR_ALL_BB_FN (bb, cfun) >> > + for (gimple_stmt_iterator gsi = gsi_start_bb (bb); >> > + !gsi_end_p (gsi); >> > + gsi_next (&gsi)) >> > + { >> > + gimple *stmt = gsi_stmt (gsi); >> > + walk_stmt_info wi; >> > + var_decl_rewrite_info info; >> > + >> > + info.avoid_pointer_conversion >> > + = (is_gimple_call (stmt) >> > + && is_sync_builtin_call (as_a <gcall *> (stmt))) >> > + || gimple_code (stmt) == GIMPLE_ASM; >> > + info.stmt = stmt; >> > + info.modified = false; >> > + info.adjusted_vars = &adjusted_vars; >> > + >> > + memset (&wi, 0, sizeof (wi)); >> > + wi.info = &info; >> > + >> > + walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi); >> > + >> > + if (info.modified) >> > + update_stmt (stmt); >> > + } >> > + } >> > + >> > free_oacc_loop (loops); >> > >> > return 0; >> >> [TS] As disucssed above, maybe can completely skip the 'adjusted_vars' >> rewriting for nvptx offloading? > > Yeah sure, if you like. Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
Hi! On 2021-04-19T12:23:56+0100, Julian Brown <julian@codesourcery.com> wrote: > On Thu, 15 Apr 2021 19:26:54 +0200 > Thomas Schwinge <thomas@codesourcery.com> wrote: >> As that may not be obvious to the reader, I'd like to have the >> 'TREE_ADDRESSABLE' conditionalization be documented in the code. You >> had explained that in >> <http://mid.mail-archive.com/20190612204216.0ec83e4e@squid.athome>: "a >> non-addressable variable [...]". > > Yeah that probably makes sense. I've pushed "[OpenACC privatization] Explain OpenACC privatization candidate selection [PR90115]" to master branch in commit 5a0fe1f6c4ad0e50bf4684e723ae2ba17d94c9e4, see attached. Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
Hi! On 2021-04-19T12:23:56+0100, Julian Brown <julian@codesourcery.com> wrote: > On Thu, 15 Apr 2021 19:26:54 +0200 > Thomas Schwinge <thomas@codesourcery.com> wrote: >> On 2021-02-26T04:34:50-0800, Julian Brown <julian@codesourcery.com> >> wrote: >> I was surprised that we didn't really have to fix up any existing >> libgomp testcases, because there seem to be quite some that contain a >> pattern (exemplified by the 'tmp' variable) as follows: >> >> int main() >> { >> #define N 123 >> int data[N]; >> int tmp; >> >> #pragma acc parallel // implicit 'firstprivate(tmp)' >> { >> // 'tmp' now conceptually made gang-private here. >> #pragma acc loop gang >> for (int i = 0; i < 123; ++i) >> { >> tmp = i + 234; >> data[i] = tmp; >> } >> } >> >> for (int i = 0; i < 123; ++i) >> if (data[i] != i + 234) >> __builtin_abort (); >> >> return 0; >> } >> >> With the code changes as posted, this actually now does *not* use >> gang-private memory for 'tmp', but instead continues to use >> "thread-private registers", as before. > > When "tmp" is a local, non-address-taken scalar like that, it'll > probably end up in a register in offloaded code (or of course be > compiled out completely), both before and after this patch. So I > wouldn't expect this to not work in the pre-patch state. Of course, in the example as posted, there's no need to make 'tmp' gang-private. However, even if the 'i' loop did something more spectacular (that makes 'tmp' addressable/potentially shared), at present we still wouldn't handle that case: (a) we're not processing clauses on OpenACC compute constructs (only 'loop' construct), and (b) we're not processing 'firstprivate' clauses (only 'private'). That's now all easy to fix (and reflect in the testsuite), but needs proper time allocated. Relatedly, may also think about using that new privatization functionality for the 'private' aspect that comes with 'reduction' clauses? >> Same for: >> >> --- s3.c 2021-04-13 17:26:49.628739379 +0200 >> +++ s3_2.c 2021-04-13 17:29:43.484579664 +0200 >> @@ -4,6 +4,6 @@ >> int data[N]; >> - int tmp; >> >> -#pragma acc parallel // implicit 'firstprivate(tmp)' >> +#pragma acc parallel >> { >> + int tmp; >> // 'tmp' now conceptually made gang-private here. >> #pragma acc loop gang >> >> I suppose that's due to conditionalizing this transformation on >> 'TREE_ADDRESSABLE' (as you're doing), so we should be mostly "safe" >> regarding such existing testcases (but I haven't verified that yet in >> detail). > > Right. > >> That needs to be documented in testcases, with some kind of dump >> scanning (host compilation-side even; see below). Done. >> A note for later: if this weren't just a 'gang' loop, but 'gang' plus >> 'worker' and/or 'vector', we'd actually be fixing up user code with >> undefined behavior into "correct" code (by *not* making 'tmp' >> gang-private, but thread-private), right? > > Possibly -- coming up with a case like that might need a little > "ingenuity"... Still to be done. >> > In terms of implementation, the parallelism level of a given loop is >> > not fixed until the oaccdevlow pass in the offload compiler, so the >> > patch delays fixing the parallelism level of variables declared on >> > or within such loops until the same point. This is done by adding a >> > new internal UNIQUE function (OACC_PRIVATE) that lists (the address >> > of) each private variable as an argument, and other arguments set >> > so as to be able to determine the correct parallelism level to use >> > for the listed variables. This new internal function fits into the >> > existing scheme for demarcating OpenACC loops, as described in >> > comments in the patch. >> >> Yes, thanks, that's conceptually now much better than the earlier >> variants that we had. :-) (Hooray, again, for Nathan's OpenACC >> execution model design!) >> >> What we should add, though, is a bunch of testcases to verify that the >> expected processing does/doesn't happen for relevant source code >> constructs. I'm thinking that when the transformation is/isn't done, >> that gets logged, and we can then scan the dumps accordingly. Some of >> that is implemented already; we should be able to do such scanning >> generally for host compilation, too, not just offloading compilation. > > More test coverage is always welcome, of course. ;-) I couldn't resist -- and along the way found/fixed several issues in the code. >> > [snip] >> > tree fork_kind = build_int_cst (unsigned_type_node, >> > IFN_UNIQUE_OACC_FORK); tree join_kind = build_int_cst >> > (unsigned_type_node, IFN_UNIQUE_OACC_JOIN); >> > @@ -8027,7 +8041,8 @@ lower_oacc_head_tail (location_t loc, tree >> > clauses, &join_seq); >> > >> > lower_oacc_reductions (loc, clauses, place, inner, >> > - fork, join, &fork_seq, &join_seq, ctx); >> > + fork, (count == 1) ? private_marker : NULL, >> > + join, &fork_seq, &join_seq, ctx); >> > >> > /* Append this level to head. */ >> > gimple_seq_add_seq (head, fork_seq); >> >> [TS] That looks good in principle. Via the testing mentioned above, I >> just want to make sure that this does all the expected things >> regarding differently nested loops and privatization levels. > > Feel free to extend test coverage as you see fit... A little bit added, but more still to be done. I've pushed "[OpenACC privatization] Largely extend diagnostics and corresponding testsuite coverage [PR90115]" to master branch in commit 11b8286a83289f5b54e813f14ff56d730c3f3185, see attached. I had, of course, developed that in several iterations, intertwined with implementation changes, but didn't now feel like disentangling all that, sorry. >> > --- /dev/null >> > +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90 >> >> [TS] With code changes as posted, this one FAILs for nvptx offloading >> execution. (... for all but the Nvidia Titan V GPU in my set of >> testing configurations, huh?) >> >> > @@ -0,0 +1,25 @@ >> > +! Test for worker-private variables >> > + >> > +! { dg-do run } >> > +! { dg-additional-options "-fdump-tree-oaccdevlow-details" } >> > + >> > +program main >> > + integer :: w, arr(0:31) >> > + >> > + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) >> > + !$acc loop gang worker private(w) >> > +! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has worker partitioning: integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */ >> > + do j = 0, 31 >> > + w = 0 >> > + !$acc loop seq >> > + do i = 0, 31 >> > + !$acc atomic update >> > + w = w + 1 >> > + !$acc end atomic >> > + end do >> > + arr(j) = w >> > + end do >> > + !$acc end parallel >> > + >> > + if (any (arr .ne. 32)) stop 1 >> > +end program main > > Boo. I don't think I saw such a failure on the systems I tested on. > That needs investigation (though it might be something CUDA-version or > GPU specific, hence not directly a GCC problem? Not sure.) That's <https://gcc.gnu.org/PR100678> "[OpenACC/nvptx] 'libgomp.oacc-c-c++-common/private-atomic-1.c' FAILs (differently) in certain configurations", now XFAILed. Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
Hi! First: many thanks for running this automated regression testing machinery! On 2021-05-21T18:40:55-0700, "sunil.k.pandey via Gcc-patches" <gcc-patches@gcc.gnu.org> wrote: > On Linux/x86_64, > > 325aa13996bafce0c4927876c315d1fa706d9881 is the first bad commit > commit 325aa13996bafce0c4927876c315d1fa706d9881 > Author: Thomas Schwinge <thomas@codesourcery.com> > Date: Fri May 21 08:51:47 2021 +0200 > > [OpenACC privatization] Reject 'static', 'external' in blocks [PR90115] Actually not that one, but instead one commit before is the culprit: commit 11b8286a83289f5b54e813f14ff56d730c3f3185 Author: Thomas Schwinge <thomas@codesourcery.com> Date: Thu May 20 16:11:37 2021 +0200 [OpenACC privatization] Largely extend diagnostics and corresponding testsuite coverage [PR90115] (Probably your testing aggregates commits that appear in some period of time? Maybe reflect that in the reporting emails?) > caused > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -O0 (test for warnings, line 134) > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -O0 (test for warnings, line 98) > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -O1 (test for warnings, line 134) > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -O1 (test for warnings, line 98) > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -O2 (test for warnings, line 134) > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -O2 (test for warnings, line 98) > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions (test for warnings, line 134) > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions (test for warnings, line 98) > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -O3 -g (test for warnings, line 134) > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -O3 -g (test for warnings, line 98) > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -Os (test for warnings, line 134) > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -Os (test for warnings, line 98) Sorry, and ACK, and I'm confused why I didn't see that in my own testing. I've now pushed "[OpenACC privatization] Prune uninteresting/varying diagnostics in 'libgomp.oacc-fortran/privatized-ref-2.f90'" to master branch in commit 3050a1a18276d7cdd8946e34cc1344e30efb7030, see attached. Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
Hi Thomas, I reproduced this issue manually and it turns out this is a special case. Script takes input from https://gcc.gnu.org/pipermail/gcc-regression/ and it matches the exact error message in the triaging process. This failure reported on gcc regression https://gcc.gnu.org/pipermail/gcc-regression/2021-May/074806.html Reason it triaged 325aa13996bafce0c4927876c315d1fa706d9881 and not 11b8286a83289f5b54e813f14ff56d730c3f3185 because, Commit 325aa13996bafce0c4927876c315d1fa706d9881 is the first commit which matches the failure message reported on gcc-regression. See the difference in line number. Error message produced from commit 325aa13996bafce0c4927876c315d1fa706d9881: FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -Os (test for warnings, line 98) FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -Os (test for warnings, line 134) vs. Error message produced from commit 11b8286a83289f5b54e813f14ff56d730c3f3185: FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -Os (test for warnings, line 100) FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -foffload=disable -Os (test for warnings, line 136) Thank you so much, Sunil Pandey On Sat, May 22, 2021 at 1:41 AM Thomas Schwinge <thomas@codesourcery.com> wrote: > Hi! > > First: many thanks for running this automated regression testing > machinery! > > On 2021-05-21T18:40:55-0700, "sunil.k.pandey via Gcc-patches" < > gcc-patches@gcc.gnu.org> wrote: > > On Linux/x86_64, > > > > 325aa13996bafce0c4927876c315d1fa706d9881 is the first bad commit > > commit 325aa13996bafce0c4927876c315d1fa706d9881 > > Author: Thomas Schwinge <thomas@codesourcery.com> > > Date: Fri May 21 08:51:47 2021 +0200 > > > > [OpenACC privatization] Reject 'static', 'external' in blocks > [PR90115] > > Actually not that one, but instead one commit before is the culprit: > > commit 11b8286a83289f5b54e813f14ff56d730c3f3185 > Author: Thomas Schwinge <thomas@codesourcery.com> > Date: Thu May 20 16:11:37 2021 +0200 > > [OpenACC privatization] Largely extend diagnostics and > corresponding testsuite coverage [PR90115] > > (Probably your testing aggregates commits that appear in some period of > time? Maybe reflect that in the reporting emails?) > > > caused > > > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -O0 (test for warnings, line 134) > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -O0 (test for warnings, line 98) > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -O1 (test for warnings, line 134) > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -O1 (test for warnings, line 98) > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -O2 (test for warnings, line 134) > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -O2 (test for warnings, line 98) > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -O3 -fomit-frame-pointer > -funroll-loops -fpeel-loops -ftracer -finline-functions (test for > warnings, line 134) > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -O3 -fomit-frame-pointer > -funroll-loops -fpeel-loops -ftracer -finline-functions (test for > warnings, line 98) > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -O3 -g (test for warnings, line 134) > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -O3 -g (test for warnings, line 98) > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -Os (test for warnings, line 134) > > FAIL: libgomp.oacc-fortran/privatized-ref-2.f90 -DACC_DEVICE_TYPE_host=1 > -DACC_MEM_SHARED=1 -foffload=disable -Os (test for warnings, line 98) > > Sorry, and ACK, and I'm confused why I didn't see that in my own testing. > I've now pushed "[OpenACC privatization] Prune uninteresting/varying > diagnostics in 'libgomp.oacc-fortran/privatized-ref-2.f90'" to master > branch in commit 3050a1a18276d7cdd8946e34cc1344e30efb7030, see attached. > > > Grüße > Thomas > > > ----------------- > Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München > Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank > Thürauf >
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index 062785af1e2..94927ea7b2b 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -6227,6 +6227,32 @@ like @code{cond_add@var{m}}. The default implementation returns a zero constant of type @var{type}. @end deftypefn +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_VAR_DECL (tree @var{var}) +This hook, if defined, is used by accelerator target back-ends to expand +specially handled kinds of @code{VAR_DECL} expressions. A particular use is +to place variables with specific attributes inside special accelarator +memories. A return value of @code{NULL} indicates that the target does not +handle this @code{VAR_DECL}, and normal RTL expanding is resumed. + +Only define this hook if your accelerator target needs to expand certain +@code{VAR_DECL} nodes in a way that differs from the default. You can also adjust +private variables at OpenACC device-lowering time using the +@code{TARGET_GOACC_ADJUST_PRIVATE_DECL} target hook. +@end deftypefn + +@deftypefn {Target Hook} tree TARGET_GOACC_ADJUST_PRIVATE_DECL (tree @var{var}, int @var{level}) +This hook, if defined, is used by accelerator target back-ends to adjust +OpenACC variable declarations that should be made private to the given +parallelism level (i.e. @code{GOMP_DIM_GANG}, @code{GOMP_DIM_WORKER} or +@code{GOMP_DIM_VECTOR}). A typical use for this hook is to force variable +declarations at the @code{gang} level to reside in GPU shared memory, by +setting the address space of the decl and making it static. + +You may also use the @code{TARGET_GOACC_EXPAND_VAR_DECL} hook if the +adjusted variable declaration needs to be expanded to RTL in a non-standard +way. +@end deftypefn + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 3b19e6f4281..b8c23cf6db5 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -4219,6 +4219,10 @@ address; but often a machine-dependent strategy can generate better code. @hook TARGET_PREFERRED_ELSE_VALUE +@hook TARGET_GOACC_EXPAND_VAR_DECL + +@hook TARGET_GOACC_ADJUST_PRIVATE_DECL + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/expr.c b/gcc/expr.c index 86dc1b6c973..349825cf286 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -10224,8 +10224,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode, exp = SSA_NAME_VAR (ssa_name); goto expand_decl_rtl; - case PARM_DECL: case VAR_DECL: + /* Allow accel compiler to handle variables that require special + treatment, e.g. if they have been modified in some way earlier in + compilation by the adjust_private_decl OpenACC hook. */ + if (flag_openacc && targetm.goacc.expand_var_decl) + { + temp = targetm.goacc.expand_var_decl (exp); + if (temp) + return temp; + } + /* ... fall through ... */ + + case PARM_DECL: /* If a static var's type was incomplete when the decl was written, but the type is complete now, lay out the decl now. */ if (DECL_SIZE (exp) == 0 diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c index dd7173126fb..e6611e8572f 100644 --- a/gcc/internal-fn.c +++ b/gcc/internal-fn.c @@ -2957,6 +2957,8 @@ expand_UNIQUE (internal_fn, gcall *stmt) else gcc_unreachable (); break; + case IFN_UNIQUE_OACC_PRIVATE: + break; } if (pattern) diff --git a/gcc/internal-fn.h b/gcc/internal-fn.h index c6599ce4894..9004840e0f5 100644 --- a/gcc/internal-fn.h +++ b/gcc/internal-fn.h @@ -36,7 +36,8 @@ along with GCC; see the file COPYING3. If not see #define IFN_UNIQUE_CODES \ DEF(UNSPEC), \ DEF(OACC_FORK), DEF(OACC_JOIN), \ - DEF(OACC_HEAD_MARK), DEF(OACC_TAIL_MARK) + DEF(OACC_HEAD_MARK), DEF(OACC_TAIL_MARK), \ + DEF(OACC_PRIVATE) enum ifn_unique_kind { #define DEF(X) IFN_UNIQUE_##X diff --git a/gcc/omp-low.c b/gcc/omp-low.c index df5b6cec586..fd8025e0e3f 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -171,6 +171,9 @@ struct omp_context /* True if there is bind clause on the construct (i.e. a loop construct). */ bool loop_p; + + /* Addressable variable decls in this context. */ + vec<tree> oacc_addressable_var_decls; }; static splay_tree all_contexts; @@ -7048,8 +7051,9 @@ lower_lastprivate_clauses (tree clauses, tree predicate, gimple_seq *body_p, static void lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner, - gcall *fork, gcall *join, gimple_seq *fork_seq, - gimple_seq *join_seq, omp_context *ctx) + gcall *fork, gcall *private_marker, gcall *join, + gimple_seq *fork_seq, gimple_seq *join_seq, + omp_context *ctx) { gimple_seq before_fork = NULL; gimple_seq after_fork = NULL; @@ -7253,6 +7257,8 @@ lower_oacc_reductions (location_t loc, tree clauses, tree level, bool inner, /* Now stitch things together. */ gimple_seq_add_seq (fork_seq, before_fork); + if (private_marker) + gimple_seq_add_stmt (fork_seq, private_marker); if (fork) gimple_seq_add_stmt (fork_seq, fork); gimple_seq_add_seq (fork_seq, after_fork); @@ -7989,7 +7995,7 @@ lower_oacc_loop_marker (location_t loc, tree ddvar, bool head, HEAD and TAIL. */ static void -lower_oacc_head_tail (location_t loc, tree clauses, +lower_oacc_head_tail (location_t loc, tree clauses, gcall *private_marker, gimple_seq *head, gimple_seq *tail, omp_context *ctx) { bool inner = false; @@ -7997,6 +8003,14 @@ lower_oacc_head_tail (location_t loc, tree clauses, gimple_seq_add_stmt (head, gimple_build_assign (ddvar, integer_zero_node)); unsigned count = lower_oacc_head_mark (loc, ddvar, clauses, head, ctx); + + if (private_marker) + { + gimple_set_location (private_marker, loc); + gimple_call_set_lhs (private_marker, ddvar); + gimple_call_set_arg (private_marker, 1, ddvar); + } + tree fork_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_FORK); tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN); @@ -8027,7 +8041,8 @@ lower_oacc_head_tail (location_t loc, tree clauses, &join_seq); lower_oacc_reductions (loc, clauses, place, inner, - fork, join, &fork_seq, &join_seq, ctx); + fork, (count == 1) ? private_marker : NULL, + join, &fork_seq, &join_seq, ctx); /* Append this level to head. */ gimple_seq_add_seq (head, fork_seq); @@ -9992,6 +10007,32 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p, } } +/* Record vars listed in private clauses in CLAUSES in CTX. This information + is used to mark up variables that should be made private per-gang. */ + +static void +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) +{ + for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) + if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE) + { + tree decl = OMP_CLAUSE_DECL (c); + if (VAR_P (decl) && TREE_ADDRESSABLE (decl)) + ctx->oacc_addressable_var_decls.safe_push (decl); + } +} + +/* Record addressable vars declared in BINDVARS in CTX. This information is + used to mark up variables that should be made private per-gang. */ + +static void +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars) +{ + for (tree v = bindvars; v; v = DECL_CHAIN (v)) + if (VAR_P (v) && TREE_ADDRESSABLE (v)) + ctx->oacc_addressable_var_decls.safe_push (v); +} + /* Callback for walk_gimple_seq. Find #pragma omp scan statement. */ static tree @@ -10821,6 +10862,57 @@ lower_omp_for_scan (gimple_seq *body_p, gimple_seq *dlist, gomp_for *stmt, *dlist = new_dlist; } +/* Build an internal UNIQUE function with type IFN_UNIQUE_OACC_PRIVATE listing + the addresses of variables that should be made private at the surrounding + parallelism level. Such functions appear in the gimple code stream in two + forms, e.g. for a partitioned loop: + + .data_dep.6 = .UNIQUE (OACC_HEAD_MARK, .data_dep.6, 1, 68); + .data_dep.6 = .UNIQUE (OACC_PRIVATE, .data_dep.6, -1, &w); + .data_dep.6 = .UNIQUE (OACC_FORK, .data_dep.6, -1); + .data_dep.6 = .UNIQUE (OACC_HEAD_MARK, .data_dep.6); + + or alternatively, OACC_PRIVATE can appear at the top level of a parallel, + not as part of a HEAD_MARK sequence: + + .UNIQUE (OACC_PRIVATE, 0, 0, &w); + + For such stand-alone appearances, the 3rd argument is always 0, denoting + gang partitioning. */ + +static gcall * +make_oacc_private_marker (omp_context *ctx) +{ + int i; + tree decl; + + if (ctx->oacc_addressable_var_decls.length () == 0) + return NULL; + + auto_vec<tree, 5> args; + + args.quick_push (build_int_cst (integer_type_node, IFN_UNIQUE_OACC_PRIVATE)); + args.quick_push (integer_zero_node); + args.quick_push (integer_minus_one_node); + + FOR_EACH_VEC_ELT (ctx->oacc_addressable_var_decls, i, decl) + { + for (omp_context *thisctx = ctx; thisctx; thisctx = thisctx->outer) + { + tree inner_decl = maybe_lookup_decl (decl, thisctx); + if (inner_decl) + { + decl = inner_decl; + break; + } + } + tree addr = build_fold_addr_expr (decl); + args.safe_push (addr); + } + + return gimple_build_call_internal_vec (IFN_UNIQUE, args); +} + /* Lower code for an OMP loop directive. */ static void @@ -10837,6 +10929,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) push_gimplify_context (); + oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt)); + lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx); block = make_node (BLOCK); @@ -10855,6 +10949,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) gbind *inner_bind = as_a <gbind *> (gimple_seq_first_stmt (omp_for_body)); tree vars = gimple_bind_vars (inner_bind); + if (is_gimple_omp_oacc (ctx->stmt)) + oacc_record_vars_in_bind (ctx, vars); gimple_bind_append_vars (new_stmt, vars); /* bind_vars/BLOCK_VARS are being moved to new_stmt/block, don't keep them on the inner_bind and it's block. */ @@ -10968,6 +11064,11 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) lower_omp (gimple_omp_body_ptr (stmt), ctx); + gcall *private_marker = NULL; + if (is_gimple_omp_oacc (ctx->stmt) + && !gimple_seq_empty_p (omp_for_body)) + private_marker = make_oacc_private_marker (ctx); + /* Lower the header expressions. At this point, we can assume that the header is of the form: @@ -11022,7 +11123,7 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) if (is_gimple_omp_oacc (ctx->stmt) && !ctx_in_oacc_kernels_region (ctx)) lower_oacc_head_tail (gimple_location (stmt), - gimple_omp_for_clauses (stmt), + gimple_omp_for_clauses (stmt), private_marker, &oacc_head, &oacc_tail, ctx); /* Add OpenACC partitioning and reduction markers just before the loop. */ @@ -13019,8 +13120,14 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) them as a dummy GANG loop. */ tree level = build_int_cst (integer_type_node, GOMP_DIM_GANG); + gcall *private_marker = make_oacc_private_marker (ctx); + + if (private_marker) + gimple_call_set_arg (private_marker, 2, level); + lower_oacc_reductions (gimple_location (ctx->stmt), clauses, level, - false, NULL, NULL, &fork_seq, &join_seq, ctx); + false, NULL, private_marker, NULL, &fork_seq, + &join_seq, ctx); } gimple_seq_add_seq (&new_body, fork_seq); @@ -13262,6 +13369,9 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx) ctx); break; case GIMPLE_BIND: + if (ctx && is_gimple_omp_oacc (ctx->stmt)) + oacc_record_vars_in_bind (ctx, + gimple_bind_vars (as_a <gbind *> (stmt))); lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx); maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt)); break; diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c index 57be342da97..b3f543b597a 100644 --- a/gcc/omp-offload.c +++ b/gcc/omp-offload.c @@ -53,6 +53,7 @@ along with GCC; see the file COPYING3. If not see #include "attribs.h" #include "cfgloop.h" #include "context.h" +#include "convert.h" /* Describe the OpenACC looping structure of a function. The entire function is held in a 'NULL' loop. */ @@ -1356,7 +1357,9 @@ oacc_loop_xform_head_tail (gcall *from, int level) = ((enum ifn_unique_kind) TREE_INT_CST_LOW (gimple_call_arg (stmt, 0))); - if (k == IFN_UNIQUE_OACC_FORK || k == IFN_UNIQUE_OACC_JOIN) + if (k == IFN_UNIQUE_OACC_FORK + || k == IFN_UNIQUE_OACC_JOIN + || k == IFN_UNIQUE_OACC_PRIVATE) *gimple_call_arg_ptr (stmt, 2) = replacement; else if (k == kind && stmt != from) break; @@ -1773,6 +1776,136 @@ default_goacc_reduction (gcall *call) gsi_replace_with_seq (&gsi, seq, true); } +struct var_decl_rewrite_info +{ + gimple *stmt; + hash_map<tree, tree> *adjusted_vars; + bool avoid_pointer_conversion; + bool modified; +}; + +/* Helper function for execute_oacc_device_lower. Rewrite VAR_DECLs (by + themselves or wrapped in various other nodes) according to ADJUSTED_VARS in + the var_decl_rewrite_info pointed to via DATA. Used as part of coercing + gang-private variables in OpenACC offload regions to reside in GPU shared + memory. */ + +static tree +oacc_rewrite_var_decl (tree *tp, int *walk_subtrees, void *data) +{ + walk_stmt_info *wi = (walk_stmt_info *) data; + var_decl_rewrite_info *info = (var_decl_rewrite_info *) wi->info; + + if (TREE_CODE (*tp) == ADDR_EXPR) + { + tree arg = TREE_OPERAND (*tp, 0); + tree *new_arg = info->adjusted_vars->get (arg); + + if (new_arg) + { + if (info->avoid_pointer_conversion) + { + *tp = build_fold_addr_expr (*new_arg); + info->modified = true; + *walk_subtrees = 0; + } + else + { + gimple_stmt_iterator gsi = gsi_for_stmt (info->stmt); + tree repl = build_fold_addr_expr (*new_arg); + gimple *stmt1 + = gimple_build_assign (make_ssa_name (TREE_TYPE (repl)), repl); + tree conv = convert_to_pointer (TREE_TYPE (*tp), + gimple_assign_lhs (stmt1)); + gimple *stmt2 + = gimple_build_assign (make_ssa_name (TREE_TYPE (*tp)), conv); + gsi_insert_before (&gsi, stmt1, GSI_SAME_STMT); + gsi_insert_before (&gsi, stmt2, GSI_SAME_STMT); + *tp = gimple_assign_lhs (stmt2); + info->modified = true; + *walk_subtrees = 0; + } + } + } + else if (TREE_CODE (*tp) == COMPONENT_REF || TREE_CODE (*tp) == ARRAY_REF) + { + tree *base = &TREE_OPERAND (*tp, 0); + + while (TREE_CODE (*base) == COMPONENT_REF + || TREE_CODE (*base) == ARRAY_REF) + base = &TREE_OPERAND (*base, 0); + + if (TREE_CODE (*base) != VAR_DECL) + return NULL; + + tree *new_decl = info->adjusted_vars->get (*base); + if (!new_decl) + return NULL; + + int base_quals = TYPE_QUALS (TREE_TYPE (*new_decl)); + tree field = TREE_OPERAND (*tp, 1); + + /* Adjust the type of the field. */ + int field_quals = TYPE_QUALS (TREE_TYPE (field)); + if (TREE_CODE (field) == FIELD_DECL && field_quals != base_quals) + { + tree *field_type = &TREE_TYPE (field); + while (TREE_CODE (*field_type) == ARRAY_TYPE) + field_type = &TREE_TYPE (*field_type); + field_quals |= base_quals; + *field_type = build_qualified_type (*field_type, field_quals); + } + + /* Adjust the type of the component ref itself. */ + tree comp_type = TREE_TYPE (*tp); + int comp_quals = TYPE_QUALS (comp_type); + if (TREE_CODE (*tp) == COMPONENT_REF && comp_quals != base_quals) + { + comp_quals |= base_quals; + TREE_TYPE (*tp) + = build_qualified_type (comp_type, comp_quals); + } + + *base = *new_decl; + info->modified = true; + } + else if (TREE_CODE (*tp) == VAR_DECL) + { + tree *new_decl = info->adjusted_vars->get (*tp); + if (new_decl) + { + *tp = *new_decl; + info->modified = true; + } + } + + return NULL_TREE; +} + +/* Return TRUE if CALL is a call to a builtin atomic/sync operation. */ + +static bool +is_sync_builtin_call (gcall *call) +{ + tree callee = gimple_call_fndecl (call); + + if (callee != NULL_TREE + && gimple_call_builtin_p (call, BUILT_IN_NORMAL)) + switch (DECL_FUNCTION_CODE (callee)) + { +#undef DEF_SYNC_BUILTIN +#define DEF_SYNC_BUILTIN(ENUM, NAME, TYPE, ATTRS) case ENUM: +#include "sync-builtins.def" +#undef DEF_SYNC_BUILTIN + return true; + + default: + ; + } + + return false; +} + /* Main entry point for oacc transformations which run on the device compiler after LTO, so we know what the target device is at this point (including the host fallback). */ @@ -1922,6 +2055,8 @@ execute_oacc_device_lower () dominance information to update SSA. */ calculate_dominance_info (CDI_DOMINATORS); + hash_map<tree, tree> adjusted_vars; + /* Now lower internal loop functions to target-specific code sequences. */ basic_block bb; @@ -1998,6 +2133,45 @@ execute_oacc_device_lower () case IFN_UNIQUE_OACC_TAIL_MARK: remove = true; break; + + case IFN_UNIQUE_OACC_PRIVATE: + { + HOST_WIDE_INT level + = TREE_INT_CST_LOW (gimple_call_arg (call, 2)); + if (level == -1) + break; + for (unsigned i = 3; + i < gimple_call_num_args (call); + i++) + { + tree arg = gimple_call_arg (call, i); + gcc_assert (TREE_CODE (arg) == ADDR_EXPR); + tree decl = TREE_OPERAND (arg, 0); + if (dump_file && (dump_flags & TDF_DETAILS)) + { + static char const *const axes[] = + /* Must be kept in sync with GOMP_DIM + enumeration. */ + { "gang", "worker", "vector" }; + fprintf (dump_file, "Decl UID %u has %s " + "partitioning:", DECL_UID (decl), + axes[level]); + print_generic_decl (dump_file, decl, TDF_SLIM); + fputc ('\n', dump_file); + } + if (targetm.goacc.adjust_private_decl) + { + tree oldtype = TREE_TYPE (decl); + tree newdecl + = targetm.goacc.adjust_private_decl (decl, level); + if (TREE_TYPE (newdecl) != oldtype + || newdecl != decl) + adjusted_vars.put (decl, newdecl); + } + } + remove = true; + } + break; } break; } @@ -2029,6 +2203,55 @@ execute_oacc_device_lower () gsi_next (&gsi); } + /* Make adjustments to gang-private local variables if required by the + target, e.g. forcing them into a particular address space. Afterwards, + ADDR_EXPR nodes which have adjusted variables as their argument need to + be modified in one of two ways: + + 1. They can be recreated, making a pointer to the variable in the new + address space, or + + 2. The address of the variable in the new address space can be taken, + converted to the default (original) address space, and the result of + that conversion subsituted in place of the original ADDR_EXPR node. + + Which of these is done depends on the gimple statement being processed. + At present atomic operations and inline asms use (1), and everything else + uses (2). At least on AMD GCN, there are atomic operations that work + directly in the LDS address space. + + COMPONENT_REFS, ARRAY_REFS and plain VAR_DECLs are also rewritten to use + the new decl, adjusting types of appropriate tree nodes as necessary. */ + + if (targetm.goacc.adjust_private_decl) + { + FOR_ALL_BB_FN (bb, cfun) + for (gimple_stmt_iterator gsi = gsi_start_bb (bb); + !gsi_end_p (gsi); + gsi_next (&gsi)) + { + gimple *stmt = gsi_stmt (gsi); + walk_stmt_info wi; + var_decl_rewrite_info info; + + info.avoid_pointer_conversion + = (is_gimple_call (stmt) + && is_sync_builtin_call (as_a <gcall *> (stmt))) + || gimple_code (stmt) == GIMPLE_ASM; + info.stmt = stmt; + info.modified = false; + info.adjusted_vars = &adjusted_vars; + + memset (&wi, 0, sizeof (wi)); + wi.info = &info; + + walk_gimple_op (stmt, oacc_rewrite_var_decl, &wi); + + if (info.modified) + update_stmt (stmt); + } + } + free_oacc_loop (loops); return 0; diff --git a/gcc/target.def b/gcc/target.def index be7fcde961a..00b6f8f1bc9 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -1712,6 +1712,36 @@ for allocating any storage for reductions when necessary.", void, (gcall *call), default_goacc_reduction) +DEFHOOK +(expand_var_decl, +"This hook, if defined, is used by accelerator target back-ends to expand\n\ +specially handled kinds of @code{VAR_DECL} expressions. A particular use is\n\ +to place variables with specific attributes inside special accelarator\n\ +memories. A return value of @code{NULL} indicates that the target does not\n\ +handle this @code{VAR_DECL}, and normal RTL expanding is resumed.\n\ +\n\ +Only define this hook if your accelerator target needs to expand certain\n\ +@code{VAR_DECL} nodes in a way that differs from the default. You can also adjust\n\ +private variables at OpenACC device-lowering time using the\n\ +@code{TARGET_GOACC_ADJUST_PRIVATE_DECL} target hook.", +rtx, (tree var), +NULL) + +DEFHOOK +(adjust_private_decl, +"This hook, if defined, is used by accelerator target back-ends to adjust\n\ +OpenACC variable declarations that should be made private to the given\n\ +parallelism level (i.e. @code{GOMP_DIM_GANG}, @code{GOMP_DIM_WORKER} or\n\ +@code{GOMP_DIM_VECTOR}). A typical use for this hook is to force variable\n\ +declarations at the @code{gang} level to reside in GPU shared memory, by\n\ +setting the address space of the decl and making it static.\n\ +\n\ +You may also use the @code{TARGET_GOACC_EXPAND_VAR_DECL} hook if the\n\ +adjusted variable declaration needs to be expanded to RTL in a non-standard\n\ +way.", +tree, (tree var, int level), +NULL) + HOOK_VECTOR_END (goacc) /* Functions relating to vectorization. */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c new file mode 100644 index 00000000000..28222c25da3 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c @@ -0,0 +1,38 @@ +#include <assert.h> + +int main (void) +{ + int ret; + + #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret) + { + int w = 0; + + #pragma acc loop worker + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + w++; + } + + ret = (w == 32); + } + assert (ret); + + #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret) + { + int v = 0; + + #pragma acc loop vector + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + v++; + } + + ret = (v == 32); + } + assert (ret); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c new file mode 100644 index 00000000000..a4f81a39e24 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c @@ -0,0 +1,95 @@ +#include <stdio.h> +#include <openacc.h> +#include <alloca.h> +#include <string.h> +#include <gomp-constants.h> +#include <stdlib.h> + +#if 0 +#define DEBUG(DIM, IDX, VAL) \ + fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL)) +#else +#define DEBUG(DIM, IDX, VAL) +#endif + +#define N (32*32*32) + +int +check (const char *dim, int *dist, int dimsize) +{ + int ix; + int exit = 0; + + for (ix = 0; ix < dimsize; ix++) + { + DEBUG(dim, ix, dist[ix]); + if (dist[ix] < (N) / (dimsize + 0.5) + || dist[ix] > (N) / (dimsize - 0.5)) + { + fprintf (stderr, "did not distribute to %ss (%d not between %d " + "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)), + (int) ((N) / (dimsize - 0.5))); + exit |= 1; + } + } + + return exit; +} + +int main () +{ + int ary[N]; + int ix; + int exit = 0; + int gangsize = 0, workersize = 0, vectorsize = 0; + int *gangdist, *workerdist, *vectordist; + + for (ix = 0; ix < N;ix++) + ary[ix] = -1; + +#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \ + copy(ary) copyout(gangsize, workersize, vectorsize) + { +#pragma acc loop gang worker vector + for (unsigned ix = 0; ix < N; ix++) + { + int g, w, v; + + g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG); + w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER); + v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR); + + ary[ix] = (g << 16) | (w << 8) | v; + } + + gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG); + workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER); + vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR); + } + + gangdist = (int *) alloca (gangsize * sizeof (int)); + workerdist = (int *) alloca (workersize * sizeof (int)); + vectordist = (int *) alloca (vectorsize * sizeof (int)); + memset (gangdist, 0, gangsize * sizeof (int)); + memset (workerdist, 0, workersize * sizeof (int)); + memset (vectordist, 0, vectorsize * sizeof (int)); + + /* Test that work is shared approximately equally amongst each active + gang/worker/vector. */ + for (ix = 0; ix < N; ix++) + { + int g = (ary[ix] >> 16) & 255; + int w = (ary[ix] >> 8) & 255; + int v = ary[ix] & 255; + + gangdist[g]++; + workerdist[w]++; + vectordist[v]++; + } + + exit = check ("gang", gangdist, gangsize); + exit |= check ("worker", workerdist, workersize); + exit |= check ("vector", vectordist, vectorsize); + + return exit; +} diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 new file mode 100644 index 00000000000..f330f7de1be --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 @@ -0,0 +1,25 @@ +! Test for "oacc gangprivate" attribute on gang-private variables + +! { dg-do run } +! { dg-additional-options "-fdump-tree-oaccdevlow-details -w" } + +program main + integer :: w, arr(0:31) + + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) + !$acc loop gang private(w) +! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has gang partitioning: integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */ + do j = 0, 31 + w = 0 + !$acc loop seq + do i = 0, 31 + !$acc atomic update + w = w + 1 + !$acc end atomic + end do + arr(j) = w + end do + !$acc end parallel + + if (any (arr .ne. 32)) stop 1 +end program main diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90 new file mode 100644 index 00000000000..f4e67b0c708 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90 @@ -0,0 +1,25 @@ +! Test for worker-private variables + +! { dg-do run } +! { dg-additional-options "-fdump-tree-oaccdevlow-details" } + +program main + integer :: w, arr(0:31) + + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) + !$acc loop gang worker private(w) +! { dg-final { scan-tree-dump-times "Decl UID \[0-9\]+ has worker partitioning: integer\\(kind=4\\) w;" 1 "oaccdevlow" } } */ + do j = 0, 31 + w = 0 + !$acc loop seq + do i = 0, 31 + !$acc atomic update + w = w + 1 + !$acc end atomic + end do + arr(j) = w + end do + !$acc end parallel + + if (any (arr .ne. 32)) stop 1 +end program main