Message ID | 20180813172151.6bfcece3@squid.athome |
---|---|
State | New |
Headers | show |
Series | [OpenACC] Add support for gang local storage allocation in shared memory | expand |
On 08/13/2018 09:21 AM, Julian Brown wrote: > diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c > new file mode 100644 > index 0000000..2fa708a > --- /dev/null > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c > @@ -0,0 +1,106 @@ > +/* { dg-xfail-run-if "gangprivate failure" { openacc_nvidia_accel_selected } { "-O0" } { "" } } */ As a quick comment, I like the approach that you've taken with this patch, but the og8 patch only applies the gangprivate attribute in the c/c++ FE. I'd have to review the notes, but I seem to recall that excluding that clause in fortran was deliberate. Chung-Lin, do you recall the rationale behind that? With that aside, is the above xfail still necessary? It seems to xpass for me on nvptx. However, I see this regression on the host: FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/loop-gwv-2.c -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -O2 execution test There could be other regressions, but I only tested the new tests introduced by the patch so far. Cesar
On 08/13/2018 11:42 AM, Cesar Philippidis wrote: > On 08/13/2018 09:21 AM, Julian Brown wrote: > >> diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c >> new file mode 100644 >> index 0000000..2fa708a >> --- /dev/null >> +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c >> @@ -0,0 +1,106 @@ >> +/* { dg-xfail-run-if "gangprivate failure" { openacc_nvidia_accel_selected } { "-O0" } { "" } } */ > > As a quick comment, I like the approach that you've taken with this > patch, but the og8 patch only applies the gangprivate attribute in the > c/c++ FE. I'd have to review the notes, but I seem to recall that > excluding that clause in fortran was deliberate. Chung-Lin, do you > recall the rationale behind that? I found this in an old email: The older version of fortran that OpenACC supports doesn't have a concept of lexically scoped blocks like c/c++, so this isn't relevant except for explicit gang private variables. So in other words, this is safe for fortran. It probably could use a fortran test, because that functionality wasn't explicitly exercised in og7/og8. Cesar
On Mon, 13 Aug 2018 11:42:26 -0700 Cesar Philippidis <cesar@codesourcery.com> wrote: > On 08/13/2018 09:21 AM, Julian Brown wrote: > > > diff --git > > a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c > > b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c new file > > mode 100644 index 0000000..2fa708a --- /dev/null > > +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c > > @@ -0,0 +1,106 @@ > > +/* { dg-xfail-run-if "gangprivate > > failure" { openacc_nvidia_accel_selected } { "-O0" } { "" } } */ > > As a quick comment, I like the approach that you've taken with this > patch, but the og8 patch only applies the gangprivate attribute in the > c/c++ FE. I'd have to review the notes, but I seem to recall that > excluding that clause in fortran was deliberate. Chung-Lin, do you > recall the rationale behind that? > > With that aside, is the above xfail still necessary? It seems to xpass > for me on nvptx. However, I see this regression on the host: > > FAIL: libgomp.oacc-c/../libgomp.oacc-c-c++-common/loop-gwv-2.c > -DACC_DEVICE_TYPE_host=1 -DACC_MEM_SHARED=1 -O2 execution test > > There could be other regressions, but I only tested the new tests > introduced by the patch so far. Oops, this was the version of the patch I meant to post (and the one I tested). The XFAIL on loop-gwv-2.c isn't necessary, plus that test needed some other fixes to make it pass for NVPTX (it was written for GCN to start with). Everything else is the same. I'll see what I can come up with for a Fortran test. Thanks, Julian commit 7834b2f0dffec3e56e510c04e1663424b778fdfb Author: Julian Brown <julian@codesourcery.com> Date: Thu Aug 9 20:27:04 2018 -0700 [OpenACC] Add support for gang local storage allocation in shared memory 2018-08-10 Julian Brown <julian@codesourcery.com> Chung-Lin Tang <cltang@codesourcery.com> gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" atttribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls fields. (new_omp_context): Initialize oacc_decls in new omp_context. (delete_omp_context): Delete oacc_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index c0b0a2e..14eb842 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -73,6 +73,7 @@ #include "cfgloop.h" #include "fold-const.h" #include "intl.h" +#include "tree-hash-traits.h" /* This file should be included last. */ #include "target-def.h" @@ -137,6 +138,12 @@ static unsigned worker_red_size; static unsigned worker_red_align; static GTY(()) rtx worker_red_sym; +/* Shared memory block for gang-private variables. */ +static unsigned gangprivate_shared_size; +static unsigned gangprivate_shared_align; +static GTY(()) rtx gangprivate_shared_sym; +static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap; + /* Global lock variable, needed for 128bit worker & gang reductions. */ static GTY(()) tree global_lock_var; @@ -210,6 +217,10 @@ nvptx_option_override (void) SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED); worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared"); + SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED); + gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + diagnose_openacc_conflict (TARGET_GOMP, "-mgomp"); diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack"); diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt"); @@ -4968,6 +4979,10 @@ nvptx_file_end (void) write_worker_buffer (asm_out_file, worker_red_sym, worker_red_align, worker_red_size); + if (gangprivate_shared_size) + write_worker_buffer (asm_out_file, gangprivate_shared_sym, + gangprivate_shared_align, gangprivate_shared_size); + if (need_softstack_decl) { write_var_marker (asm_out_file, false, true, "__nvptx_stacks"); @@ -5915,6 +5930,47 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t) return false; } +static rtx +nvptx_goacc_expand_accel_var (tree var) +{ + if (TREE_CODE (var) == VAR_DECL + && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var))) + { + unsigned int offset, *poffset; + poffset = gangprivate_shared_hmap.get (var); + if (poffset) + offset = *poffset; + else + { + unsigned HOST_WIDE_INT align = DECL_ALIGN (var); + gangprivate_shared_size = + (gangprivate_shared_size + align - 1) & ~(align - 1); + if (gangprivate_shared_align < align) + gangprivate_shared_align = align; + + offset = gangprivate_shared_size; + bool existed = gangprivate_shared_hmap.put (var, offset); + gcc_assert (!existed); + gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var)); + } + rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset); + return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr); + } + return NULL_RTX; +} + +static GTY(()) tree nvptx_previous_fndecl; + +static void +nvptx_set_current_function (tree fndecl) +{ + if (!fndecl || fndecl == nvptx_previous_fndecl) + return; + + gangprivate_shared_hmap.empty (); + nvptx_previous_fndecl = fndecl; +} + #undef TARGET_OPTION_OVERRIDE #define TARGET_OPTION_OVERRIDE nvptx_option_override @@ -6051,6 +6107,12 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t) #undef TARGET_HAVE_SPECULATION_SAFE_VALUE #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed +#undef TARGET_GOACC_EXPAND_ACCEL_VAR +#define TARGET_GOACC_EXPAND_ACCEL_VAR nvptx_goacc_expand_accel_var + +#undef TARGET_SET_CURRENT_FUNCTION +#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function + struct gcc_target targetm = TARGET_INITIALIZER; #include "gt-nvptx.h" diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index a40f45a..fb87f67 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -6064,6 +6064,14 @@ like @code{cond_add@var{m}}. The default implementation returns a zero constant of type @var{type}. @end deftypefn +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var}) +This hook, if defined, is used by accelerator target back-ends to expand +specially handled kinds of VAR_DECL expressions. A particular use is to +place variables with specific attributes inside special accelarator +memories. A return value of NULL indicates that the target does not +handle this VAR_DECL, and normal RTL expanding is resumed. +@end deftypefn + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 39a214e..beace61 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -4151,6 +4151,8 @@ address; but often a machine-dependent strategy can generate better code. @hook TARGET_PREFERRED_ELSE_VALUE +@hook TARGET_GOACC_EXPAND_ACCEL_VAR + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/expr.c b/gcc/expr.c index de6709d..2c62bf9 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -9854,8 +9854,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode, exp = SSA_NAME_VAR (ssa_name); goto expand_decl_rtl; - case PARM_DECL: case VAR_DECL: + /* Allow accel compiler to handle specific cases of variables, + specifically those tagged with the "oacc gangprivate" attribute, + which may intended to be placed in special memory in GPUs. */ + if (flag_openacc && targetm.goacc.expand_accel_var) + { + temp = targetm.goacc.expand_accel_var (exp); + if (temp) + return temp; + } + /* ... fall through ... */ + + case PARM_DECL: /* If a static var's type was incomplete when the decl was written, but the type is complete now, lay out the decl now. */ if (DECL_SIZE (exp) == 0 diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 843c66f..354e182 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -124,6 +124,12 @@ struct omp_context /* True if this construct can be cancelled. */ bool cancellable; + + /* The number of levels of OpenACC partitioning invoked in this context. */ + int oacc_partitioning_levels; + + /* Decls in this context. */ + vec<tree> *oacc_decls; }; static splay_tree all_contexts; @@ -850,6 +856,7 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx) } ctx->cb.decl_map = new hash_map<tree, tree>; + ctx->oacc_decls = new vec<tree> (); return ctx; } @@ -925,6 +932,8 @@ delete_omp_context (splay_tree_value value) if (is_task_ctx (ctx)) finalize_task_copyfn (as_a <gomp_task *> (ctx->stmt)); + delete ctx->oacc_decls; + XDELETE (ctx); } @@ -5716,6 +5725,9 @@ lower_oacc_head_tail (location_t loc, tree clauses, tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN); gcc_assert (count); + + ctx->oacc_partitioning_levels = count; + for (unsigned done = 1; count; count--, done++) { gimple_seq fork_seq = NULL; @@ -6732,6 +6744,66 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p, } } +/* Record vars listed in private clauses in CLAUSES in CTX. This information + is used to mark up variables that should be made private per-gang. */ + +static void +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) +{ + tree c; + + if (!ctx) + return; + + for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) + switch (OMP_CLAUSE_CODE (c)) + { + case OMP_CLAUSE_PRIVATE: + { + tree decl = OMP_CLAUSE_DECL (c); + ctx->oacc_decls->safe_push (decl); + } + break; + + default: + /* Empty. */; + } +} + +/* Record vars declared in BINDVARS in CTX. This information is used to mark + up variables that should be made private per-gang. */ + +static void +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars) +{ + if (!ctx) + return; + + for (tree v = bindvars; v; v = DECL_CHAIN (v)) + ctx->oacc_decls->safe_push (v); +} + +/* Mark variables which are declared implicitly or explicitly as gang private + with a special attribute. These may need to have their declarations altered + later on in compilation (e.g. in execute_oacc_device_lower or the backend, + depending on how the OpenACC execution model is implemented on a given + target) to ensure that sharing semantics are correct. + Only variables which have their address taken need to be considered. */ + +static void +mark_oacc_gangprivate (vec<tree> *decls) +{ + int i; + tree decl; + + FOR_EACH_VEC_ELT (*decls, i, decl) + { + if (TREE_CODE (decl) == VAR_DECL && TREE_ADDRESSABLE (decl)) + DECL_ATTRIBUTES (decl) + = tree_cons (get_identifier ("oacc gangprivate"), + NULL, DECL_ATTRIBUTES (decl)); + } +} /* Lower code for an OMP loop directive. */ @@ -6748,6 +6820,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) push_gimplify_context (); + oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt)); + lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx); block = make_node (BLOCK); @@ -6878,7 +6952,20 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) /* Add OpenACC partitioning and reduction markers just before the loop. */ if (oacc_head) - gimple_seq_add_seq (&body, oacc_head); + { + gimple_seq_add_seq (&body, oacc_head); + + int level_total = 0; + omp_context *thisctx; + + for (thisctx = ctx; thisctx; thisctx = thisctx->outer) + level_total += thisctx->oacc_partitioning_levels; + + /* If the current context and parent contexts are distributed over a + total of one parallelism level, we have gang partitioning. */ + if (level_total == 1) + mark_oacc_gangprivate (ctx->oacc_decls); + } lower_omp_for_lastprivate (&fd, &body, &dlist, ctx); @@ -7511,6 +7598,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) clauses = gimple_omp_target_clauses (stmt); + oacc_record_private_var_clauses (ctx, clauses); + gimple_seq dep_ilist = NULL; gimple_seq dep_olist = NULL; if (omp_find_clause (clauses, OMP_CLAUSE_DEPEND)) @@ -7761,6 +7850,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) if (offloaded) { + mark_oacc_gangprivate (ctx->oacc_decls); + /* Declare all the variables created by mapping and the variables declared in the scope of the target body. */ record_vars_into (ctx->block_vars, child_fn); @@ -8755,6 +8846,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx) ctx); break; case GIMPLE_BIND: + oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt))); lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx); maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt)); break; diff --git a/gcc/target.def b/gcc/target.def index c570f38..b3b24b8 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -1701,6 +1701,16 @@ for allocating any storage for reductions when necessary.", void, (gcall *call), default_goacc_reduction) +DEFHOOK +(expand_accel_var, +"This hook, if defined, is used by accelerator target back-ends to expand\n\ +specially handled kinds of VAR_DECL expressions. A particular use is to\n\ +place variables with specific attributes inside special accelarator\n\ +memories. A return value of NULL indicates that the target does not\n\ +handle this VAR_DECL, and normal RTL expanding is resumed.", +rtx, (tree var), +NULL) + HOOK_VECTOR_END (goacc) /* Functions relating to vectorization. */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c new file mode 100644 index 0000000..f378346 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c @@ -0,0 +1,38 @@ +#include <assert.h> + +int main (void) +{ + int ret; + + #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret) + { + int w = 0; + + #pragma acc loop worker + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + w++; + } + + ret = (w == 32); + } + assert (ret); + + #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret) + { + int v = 0; + + #pragma acc loop vector + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + v++; + } + + ret = (v == 32); + } + assert (ret); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c new file mode 100644 index 0000000..a4f81a3 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c @@ -0,0 +1,95 @@ +#include <stdio.h> +#include <openacc.h> +#include <alloca.h> +#include <string.h> +#include <gomp-constants.h> +#include <stdlib.h> + +#if 0 +#define DEBUG(DIM, IDX, VAL) \ + fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL)) +#else +#define DEBUG(DIM, IDX, VAL) +#endif + +#define N (32*32*32) + +int +check (const char *dim, int *dist, int dimsize) +{ + int ix; + int exit = 0; + + for (ix = 0; ix < dimsize; ix++) + { + DEBUG(dim, ix, dist[ix]); + if (dist[ix] < (N) / (dimsize + 0.5) + || dist[ix] > (N) / (dimsize - 0.5)) + { + fprintf (stderr, "did not distribute to %ss (%d not between %d " + "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)), + (int) ((N) / (dimsize - 0.5))); + exit |= 1; + } + } + + return exit; +} + +int main () +{ + int ary[N]; + int ix; + int exit = 0; + int gangsize = 0, workersize = 0, vectorsize = 0; + int *gangdist, *workerdist, *vectordist; + + for (ix = 0; ix < N;ix++) + ary[ix] = -1; + +#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \ + copy(ary) copyout(gangsize, workersize, vectorsize) + { +#pragma acc loop gang worker vector + for (unsigned ix = 0; ix < N; ix++) + { + int g, w, v; + + g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG); + w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER); + v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR); + + ary[ix] = (g << 16) | (w << 8) | v; + } + + gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG); + workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER); + vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR); + } + + gangdist = (int *) alloca (gangsize * sizeof (int)); + workerdist = (int *) alloca (workersize * sizeof (int)); + vectordist = (int *) alloca (vectorsize * sizeof (int)); + memset (gangdist, 0, gangsize * sizeof (int)); + memset (workerdist, 0, workersize * sizeof (int)); + memset (vectordist, 0, vectorsize * sizeof (int)); + + /* Test that work is shared approximately equally amongst each active + gang/worker/vector. */ + for (ix = 0; ix < N; ix++) + { + int g = (ary[ix] >> 16) & 255; + int w = (ary[ix] >> 8) & 255; + int v = ary[ix] & 255; + + gangdist[g]++; + workerdist[w]++; + vectordist[v]++; + } + + exit = check ("gang", gangdist, gangsize); + exit |= check ("worker", workerdist, workersize); + exit |= check ("vector", vectordist, vectorsize); + + return exit; +} diff --git a/libgomp/testsuite/libgomp.oacc-c/pr85465.c b/libgomp/testsuite/libgomp.oacc-c/pr85465.c new file mode 100644 index 0000000..329e8a0 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-w" } */ + +int +main (void) +{ +#pragma acc parallel + foo (); + + return 0; +}
On Mon, 13 Aug 2018 12:06:21 -0700 Cesar Philippidis <cesar@codesourcery.com> wrote: > So in other words, this is safe for fortran. It probably could use a > fortran test, because that functionality wasn't explicitly exercised > in og7/og8. Here's a new version of the patch with a Fortran test case. It's not too easy to write a test that depends on whether gang-local variables actually end up in the right kind of memory, so I wrote one that scans the omplower dump instead. Many other (including execution) tests will already trigger the new behaviour. Tested with offloading to NVPTX. OK? Thanks, Julian 2018-08-10 Julian Brown <julian@codesourcery.com> Chung-Lin Tang <cltang@codesourcery.com> gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" atttribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls fields. (new_omp_context): Initialize oacc_decls in new omp_context. (delete_omp_context): Delete oacc_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test. commit b73428237720be8d5b6e793f8615204356336d30 Author: Julian Brown <julian@codesourcery.com> Date: Thu Aug 9 20:27:04 2018 -0700 [OpenACC] Add support for gang local storage allocation in shared memory 2018-08-10 Julian Brown <julian@codesourcery.com> Chung-Lin Tang <cltang@codesourcery.com> gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" atttribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls fields. (new_omp_context): Initialize oacc_decls in new omp_context. (delete_omp_context): Delete oacc_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test. diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index c0b0a2e..14eb842 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -73,6 +73,7 @@ #include "cfgloop.h" #include "fold-const.h" #include "intl.h" +#include "tree-hash-traits.h" /* This file should be included last. */ #include "target-def.h" @@ -137,6 +138,12 @@ static unsigned worker_red_size; static unsigned worker_red_align; static GTY(()) rtx worker_red_sym; +/* Shared memory block for gang-private variables. */ +static unsigned gangprivate_shared_size; +static unsigned gangprivate_shared_align; +static GTY(()) rtx gangprivate_shared_sym; +static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap; + /* Global lock variable, needed for 128bit worker & gang reductions. */ static GTY(()) tree global_lock_var; @@ -210,6 +217,10 @@ nvptx_option_override (void) SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED); worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared"); + SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED); + gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + diagnose_openacc_conflict (TARGET_GOMP, "-mgomp"); diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack"); diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt"); @@ -4968,6 +4979,10 @@ nvptx_file_end (void) write_worker_buffer (asm_out_file, worker_red_sym, worker_red_align, worker_red_size); + if (gangprivate_shared_size) + write_worker_buffer (asm_out_file, gangprivate_shared_sym, + gangprivate_shared_align, gangprivate_shared_size); + if (need_softstack_decl) { write_var_marker (asm_out_file, false, true, "__nvptx_stacks"); @@ -5915,6 +5930,47 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t) return false; } +static rtx +nvptx_goacc_expand_accel_var (tree var) +{ + if (TREE_CODE (var) == VAR_DECL + && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var))) + { + unsigned int offset, *poffset; + poffset = gangprivate_shared_hmap.get (var); + if (poffset) + offset = *poffset; + else + { + unsigned HOST_WIDE_INT align = DECL_ALIGN (var); + gangprivate_shared_size = + (gangprivate_shared_size + align - 1) & ~(align - 1); + if (gangprivate_shared_align < align) + gangprivate_shared_align = align; + + offset = gangprivate_shared_size; + bool existed = gangprivate_shared_hmap.put (var, offset); + gcc_assert (!existed); + gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var)); + } + rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset); + return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr); + } + return NULL_RTX; +} + +static GTY(()) tree nvptx_previous_fndecl; + +static void +nvptx_set_current_function (tree fndecl) +{ + if (!fndecl || fndecl == nvptx_previous_fndecl) + return; + + gangprivate_shared_hmap.empty (); + nvptx_previous_fndecl = fndecl; +} + #undef TARGET_OPTION_OVERRIDE #define TARGET_OPTION_OVERRIDE nvptx_option_override @@ -6051,6 +6107,12 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t) #undef TARGET_HAVE_SPECULATION_SAFE_VALUE #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed +#undef TARGET_GOACC_EXPAND_ACCEL_VAR +#define TARGET_GOACC_EXPAND_ACCEL_VAR nvptx_goacc_expand_accel_var + +#undef TARGET_SET_CURRENT_FUNCTION +#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function + struct gcc_target targetm = TARGET_INITIALIZER; #include "gt-nvptx.h" diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index a40f45a..fb87f67 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -6064,6 +6064,14 @@ like @code{cond_add@var{m}}. The default implementation returns a zero constant of type @var{type}. @end deftypefn +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var}) +This hook, if defined, is used by accelerator target back-ends to expand +specially handled kinds of VAR_DECL expressions. A particular use is to +place variables with specific attributes inside special accelarator +memories. A return value of NULL indicates that the target does not +handle this VAR_DECL, and normal RTL expanding is resumed. +@end deftypefn + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 39a214e..beace61 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -4151,6 +4151,8 @@ address; but often a machine-dependent strategy can generate better code. @hook TARGET_PREFERRED_ELSE_VALUE +@hook TARGET_GOACC_EXPAND_ACCEL_VAR + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/expr.c b/gcc/expr.c index de6709d..2c62bf9 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -9854,8 +9854,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode, exp = SSA_NAME_VAR (ssa_name); goto expand_decl_rtl; - case PARM_DECL: case VAR_DECL: + /* Allow accel compiler to handle specific cases of variables, + specifically those tagged with the "oacc gangprivate" attribute, + which may intended to be placed in special memory in GPUs. */ + if (flag_openacc && targetm.goacc.expand_accel_var) + { + temp = targetm.goacc.expand_accel_var (exp); + if (temp) + return temp; + } + /* ... fall through ... */ + + case PARM_DECL: /* If a static var's type was incomplete when the decl was written, but the type is complete now, lay out the decl now. */ if (DECL_SIZE (exp) == 0 diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 843c66f..b0e173d 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -124,6 +124,12 @@ struct omp_context /* True if this construct can be cancelled. */ bool cancellable; + + /* The number of levels of OpenACC partitioning invoked in this context. */ + int oacc_partitioning_levels; + + /* Decls in this context. */ + vec<tree> *oacc_decls; }; static splay_tree all_contexts; @@ -850,6 +856,7 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx) } ctx->cb.decl_map = new hash_map<tree, tree>; + ctx->oacc_decls = new vec<tree> (); return ctx; } @@ -925,6 +932,8 @@ delete_omp_context (splay_tree_value value) if (is_task_ctx (ctx)) finalize_task_copyfn (as_a <gomp_task *> (ctx->stmt)); + delete ctx->oacc_decls; + XDELETE (ctx); } @@ -5716,6 +5725,9 @@ lower_oacc_head_tail (location_t loc, tree clauses, tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN); gcc_assert (count); + + ctx->oacc_partitioning_levels = count; + for (unsigned done = 1; count; count--, done++) { gimple_seq fork_seq = NULL; @@ -6732,6 +6744,77 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p, } } +/* Record vars listed in private clauses in CLAUSES in CTX. This information + is used to mark up variables that should be made private per-gang. */ + +static void +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) +{ + tree c; + + if (!ctx) + return; + + for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) + switch (OMP_CLAUSE_CODE (c)) + { + case OMP_CLAUSE_PRIVATE: + { + tree decl = OMP_CLAUSE_DECL (c); + ctx->oacc_decls->safe_push (decl); + } + break; + + default: + /* Empty. */; + } +} + +/* Record vars declared in BINDVARS in CTX. This information is used to mark + up variables that should be made private per-gang. */ + +static void +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars) +{ + if (!ctx) + return; + + for (tree v = bindvars; v; v = DECL_CHAIN (v)) + ctx->oacc_decls->safe_push (v); +} + +/* Mark variables which are declared implicitly or explicitly as gang private + with a special attribute. These may need to have their declarations altered + later on in compilation (e.g. in execute_oacc_device_lower or the backend, + depending on how the OpenACC execution model is implemented on a given + target) to ensure that sharing semantics are correct. + Only variables which have their address taken need to be considered. */ + +static void +mark_oacc_gangprivate (vec<tree> *decls) +{ + int i; + tree decl; + + FOR_EACH_VEC_ELT (*decls, i, decl) + { + if (TREE_CODE (decl) == VAR_DECL + && TREE_ADDRESSABLE (decl) + && !lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl))) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, + "Setting 'oacc gangprivate' attribute for decl:"); + print_generic_decl (dump_file, decl, TDF_SLIM); + fputc ('\n', dump_file); + } + DECL_ATTRIBUTES (decl) + = tree_cons (get_identifier ("oacc gangprivate"), + NULL, DECL_ATTRIBUTES (decl)); + } + } +} /* Lower code for an OMP loop directive. */ @@ -6748,6 +6831,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) push_gimplify_context (); + oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt)); + lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx); block = make_node (BLOCK); @@ -6878,7 +6963,20 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) /* Add OpenACC partitioning and reduction markers just before the loop. */ if (oacc_head) - gimple_seq_add_seq (&body, oacc_head); + { + gimple_seq_add_seq (&body, oacc_head); + + int level_total = 0; + omp_context *thisctx; + + for (thisctx = ctx; thisctx; thisctx = thisctx->outer) + level_total += thisctx->oacc_partitioning_levels; + + /* If the current context and parent contexts are distributed over a + total of one parallelism level, we have gang partitioning. */ + if (level_total == 1) + mark_oacc_gangprivate (ctx->oacc_decls); + } lower_omp_for_lastprivate (&fd, &body, &dlist, ctx); @@ -7511,6 +7609,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) clauses = gimple_omp_target_clauses (stmt); + oacc_record_private_var_clauses (ctx, clauses); + gimple_seq dep_ilist = NULL; gimple_seq dep_olist = NULL; if (omp_find_clause (clauses, OMP_CLAUSE_DEPEND)) @@ -7761,6 +7861,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) if (offloaded) { + mark_oacc_gangprivate (ctx->oacc_decls); + /* Declare all the variables created by mapping and the variables declared in the scope of the target body. */ record_vars_into (ctx->block_vars, child_fn); @@ -8755,6 +8857,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx) ctx); break; case GIMPLE_BIND: + oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt))); lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx); maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt)); break; diff --git a/gcc/target.def b/gcc/target.def index c570f38..b3b24b8 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -1701,6 +1701,16 @@ for allocating any storage for reductions when necessary.", void, (gcall *call), default_goacc_reduction) +DEFHOOK +(expand_accel_var, +"This hook, if defined, is used by accelerator target back-ends to expand\n\ +specially handled kinds of VAR_DECL expressions. A particular use is to\n\ +place variables with specific attributes inside special accelarator\n\ +memories. A return value of NULL indicates that the target does not\n\ +handle this VAR_DECL, and normal RTL expanding is resumed.", +rtx, (tree var), +NULL) + HOOK_VECTOR_END (goacc) /* Functions relating to vectorization. */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c new file mode 100644 index 0000000..f378346 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c @@ -0,0 +1,38 @@ +#include <assert.h> + +int main (void) +{ + int ret; + + #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret) + { + int w = 0; + + #pragma acc loop worker + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + w++; + } + + ret = (w == 32); + } + assert (ret); + + #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret) + { + int v = 0; + + #pragma acc loop vector + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + v++; + } + + ret = (v == 32); + } + assert (ret); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c new file mode 100644 index 0000000..a4f81a3 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c @@ -0,0 +1,95 @@ +#include <stdio.h> +#include <openacc.h> +#include <alloca.h> +#include <string.h> +#include <gomp-constants.h> +#include <stdlib.h> + +#if 0 +#define DEBUG(DIM, IDX, VAL) \ + fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL)) +#else +#define DEBUG(DIM, IDX, VAL) +#endif + +#define N (32*32*32) + +int +check (const char *dim, int *dist, int dimsize) +{ + int ix; + int exit = 0; + + for (ix = 0; ix < dimsize; ix++) + { + DEBUG(dim, ix, dist[ix]); + if (dist[ix] < (N) / (dimsize + 0.5) + || dist[ix] > (N) / (dimsize - 0.5)) + { + fprintf (stderr, "did not distribute to %ss (%d not between %d " + "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)), + (int) ((N) / (dimsize - 0.5))); + exit |= 1; + } + } + + return exit; +} + +int main () +{ + int ary[N]; + int ix; + int exit = 0; + int gangsize = 0, workersize = 0, vectorsize = 0; + int *gangdist, *workerdist, *vectordist; + + for (ix = 0; ix < N;ix++) + ary[ix] = -1; + +#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \ + copy(ary) copyout(gangsize, workersize, vectorsize) + { +#pragma acc loop gang worker vector + for (unsigned ix = 0; ix < N; ix++) + { + int g, w, v; + + g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG); + w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER); + v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR); + + ary[ix] = (g << 16) | (w << 8) | v; + } + + gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG); + workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER); + vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR); + } + + gangdist = (int *) alloca (gangsize * sizeof (int)); + workerdist = (int *) alloca (workersize * sizeof (int)); + vectordist = (int *) alloca (vectorsize * sizeof (int)); + memset (gangdist, 0, gangsize * sizeof (int)); + memset (workerdist, 0, workersize * sizeof (int)); + memset (vectordist, 0, vectorsize * sizeof (int)); + + /* Test that work is shared approximately equally amongst each active + gang/worker/vector. */ + for (ix = 0; ix < N; ix++) + { + int g = (ary[ix] >> 16) & 255; + int w = (ary[ix] >> 8) & 255; + int v = ary[ix] & 255; + + gangdist[g]++; + workerdist[w]++; + vectordist[v]++; + } + + exit = check ("gang", gangdist, gangsize); + exit |= check ("worker", workerdist, workersize); + exit |= check ("vector", vectordist, vectorsize); + + return exit; +} diff --git a/libgomp/testsuite/libgomp.oacc-c/pr85465.c b/libgomp/testsuite/libgomp.oacc-c/pr85465.c new file mode 100644 index 0000000..329e8a0 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-w" } */ + +int +main (void) +{ +#pragma acc parallel + foo (); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 new file mode 100644 index 0000000..5f8a5e6 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 @@ -0,0 +1,25 @@ +! Test for "oacc gangprivate" attribute on gang-private variables + +! { dg-do run } +! { dg-additional-options "-fdump-tree-omplower-details" } +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' attribute for decl: integer\\(kind=4\\) w;" 1 "omplower" } } */ + +program main + integer :: w, arr(0:31) + + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) + !$acc loop gang private(w) + do j = 0, 31 + w = 0 + !$acc loop seq + do i = 0, 31 + !$acc atomic update + w = w + 1 + !$acc end atomic + end do + arr(j) = w + end do + !$acc end parallel + + if (any (arr .ne. 32)) stop 1 +end program main
On 15 August 2018 18:46:37 CEST, Julian Brown <julian@codesourcery.com> wrote: >On Mon, 13 Aug 2018 12:06:21 -0700 >Cesar Philippidis <cesar@codesourcery.com> wrote: atttribute has more t than strictly necessary. Don't like signed integer levels where they should be some unsigned. Also don't like single switch cases instead of if. And omitting function comments even if the hook way above is documented may be ok ish but is a bit lazy ;) thanks, > >> So in other words, this is safe for fortran. It probably could use a >> fortran test, because that functionality wasn't explicitly exercised >> in og7/og8. > >Here's a new version of the patch with a Fortran test case. It's not >too easy to write a test that depends on whether gang-local variables >actually end up in the right kind of memory, so I wrote one that scans >the omplower dump instead. Many other (including execution) tests will >already trigger the new behaviour. > >Tested with offloading to NVPTX. > >OK? > >Thanks, > >Julian > >2018-08-10 Julian Brown <julian@codesourcery.com> > Chung-Lin Tang <cltang@codesourcery.com> > > gcc/ > * config/nvptx/nvptx.c (tree-hash-traits.h): Include. > (gangprivate_shared_size): New global variable. > (gangprivate_shared_align): Likewise. > (gangprivate_shared_sym): Likewise. > (gangprivate_shared_hmap): Likewise. > (nvptx_option_override): Initialize gangprivate_shared_sym, > gangprivate_shared_align. > (nvptx_file_end): Output gangprivate_shared_sym. > (nvptx_goacc_expand_accel_var): New function. > (nvptx_set_current_function): New function. > (TARGET_SET_CURRENT_FUNCTION): Define hook. > (TARGET_GOACC_EXPAND_ACCEL): Likewise. > * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. > * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. > * expr.c (expand_expr_real_1): Remap decls marked with the > "oacc gangprivate" atttribute. > * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls > fields. > (new_omp_context): Initialize oacc_decls in new omp_context. > (delete_omp_context): Delete oacc_decls in old omp_context. >(lower_oacc_head_tail): Record partitioning-level count in omp context. > (oacc_record_private_var_clauses, oacc_record_vars_in_bind) > (mark_oacc_gangprivate): New functions. > (lower_omp_for): Call oacc_record_private_var_clauses with "for" > clauses. Call mark_oacc_gangprivate for gang-partitioned loops. > (lower_omp_target): Call oacc_record_private_var_clauses with "target" > clauses. > Call mark_oacc_gangprivate for offloaded target regions. > (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. > * target.def (expand_accel_var): New hook. > > libgomp/ > * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. > * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. > * testsuite/libgomp.oacc-c/pr85465.c: New test. > * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.
On Wed, 15 Aug 2018 21:56:54 +0200 Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote: > On 15 August 2018 18:46:37 CEST, Julian Brown > <julian@codesourcery.com> wrote: > >On Mon, 13 Aug 2018 12:06:21 -0700 > >Cesar Philippidis <cesar@codesourcery.com> wrote: > > atttribute has more t than strictly necessary. > Don't like signed integer levels where they should be some unsigned. > Also don't like single switch cases instead of if. > And omitting function comments even if the hook way above is > documented may be ok ish but is a bit lazy ;) Here's a new version with those comments addressed. I also changed the logic around a little to avoid adding decls to the vec in omp_context which would never be given the gang-private attribute. Re-tested with offloading to NVPTX. OK? Julian 2018-08-10 Julian Brown <julian@codesourcery.com> Chung-Lin Tang <cltang@codesourcery.com> gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" attribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_addressable_var_decls fields. (new_omp_context): Initialize oacc_addressable_var_decls in new omp_context. (delete_omp_context): Delete oacc_addressable_var_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test. commit e276442550a85b62866ba13890eacf4e946d1079 Author: Julian Brown <julian@codesourcery.com> Date: Thu Aug 9 20:27:04 2018 -0700 [OpenACC] Add support for gang local storage allocation in shared memory 2018-08-10 Julian Brown <julian@codesourcery.com> Chung-Lin Tang <cltang@codesourcery.com> gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" attribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_addressable_var_decls fields. (new_omp_context): Initialize oacc_addressable_var_decls in new omp_context. (delete_omp_context): Delete oacc_addressable_var_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test. diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index c0b0a2e..7aeefdb 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -73,6 +73,7 @@ #include "cfgloop.h" #include "fold-const.h" #include "intl.h" +#include "tree-hash-traits.h" /* This file should be included last. */ #include "target-def.h" @@ -137,6 +138,12 @@ static unsigned worker_red_size; static unsigned worker_red_align; static GTY(()) rtx worker_red_sym; +/* Shared memory block for gang-private variables. */ +static unsigned gangprivate_shared_size; +static unsigned gangprivate_shared_align; +static GTY(()) rtx gangprivate_shared_sym; +static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap; + /* Global lock variable, needed for 128bit worker & gang reductions. */ static GTY(()) tree global_lock_var; @@ -210,6 +217,10 @@ nvptx_option_override (void) SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED); worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared"); + SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED); + gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + diagnose_openacc_conflict (TARGET_GOMP, "-mgomp"); diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack"); diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt"); @@ -4968,6 +4979,10 @@ nvptx_file_end (void) write_worker_buffer (asm_out_file, worker_red_sym, worker_red_align, worker_red_size); + if (gangprivate_shared_size) + write_worker_buffer (asm_out_file, gangprivate_shared_sym, + gangprivate_shared_align, gangprivate_shared_size); + if (need_softstack_decl) { write_var_marker (asm_out_file, false, true, "__nvptx_stacks"); @@ -5915,6 +5930,52 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t) return false; } +/* Implement TARGET_GOACC_EXPAND_ACCEL_VAR. Place "oacc gangprivate" + variables in shared memory. */ + +static rtx +nvptx_goacc_expand_accel_var (tree var) +{ + if (TREE_CODE (var) == VAR_DECL + && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var))) + { + unsigned int offset, *poffset; + poffset = gangprivate_shared_hmap.get (var); + if (poffset) + offset = *poffset; + else + { + unsigned HOST_WIDE_INT align = DECL_ALIGN (var); + gangprivate_shared_size = + (gangprivate_shared_size + align - 1) & ~(align - 1); + if (gangprivate_shared_align < align) + gangprivate_shared_align = align; + + offset = gangprivate_shared_size; + bool existed = gangprivate_shared_hmap.put (var, offset); + gcc_assert (!existed); + gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var)); + } + rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset); + return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr); + } + return NULL_RTX; +} + +static GTY(()) tree nvptx_previous_fndecl; + +/* Implement TARGET_SET_CURRENT_FUNCTION. Reset per-function context. */ + +static void +nvptx_set_current_function (tree fndecl) +{ + if (!fndecl || fndecl == nvptx_previous_fndecl) + return; + + gangprivate_shared_hmap.empty (); + nvptx_previous_fndecl = fndecl; +} + #undef TARGET_OPTION_OVERRIDE #define TARGET_OPTION_OVERRIDE nvptx_option_override @@ -6051,6 +6112,12 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t) #undef TARGET_HAVE_SPECULATION_SAFE_VALUE #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed +#undef TARGET_GOACC_EXPAND_ACCEL_VAR +#define TARGET_GOACC_EXPAND_ACCEL_VAR nvptx_goacc_expand_accel_var + +#undef TARGET_SET_CURRENT_FUNCTION +#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function + struct gcc_target targetm = TARGET_INITIALIZER; #include "gt-nvptx.h" diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index a40f45a..fb87f67 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -6064,6 +6064,14 @@ like @code{cond_add@var{m}}. The default implementation returns a zero constant of type @var{type}. @end deftypefn +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var}) +This hook, if defined, is used by accelerator target back-ends to expand +specially handled kinds of VAR_DECL expressions. A particular use is to +place variables with specific attributes inside special accelarator +memories. A return value of NULL indicates that the target does not +handle this VAR_DECL, and normal RTL expanding is resumed. +@end deftypefn + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 39a214e..beace61 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -4151,6 +4151,8 @@ address; but often a machine-dependent strategy can generate better code. @hook TARGET_PREFERRED_ELSE_VALUE +@hook TARGET_GOACC_EXPAND_ACCEL_VAR + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/expr.c b/gcc/expr.c index de6709d..f186a41 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -9854,8 +9854,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode, exp = SSA_NAME_VAR (ssa_name); goto expand_decl_rtl; - case PARM_DECL: case VAR_DECL: + /* Allow accel compiler to handle specific cases of variables, + specifically those tagged with the "oacc gangprivate" attribute, + which may be intended to be placed in special memory in GPUs. */ + if (flag_openacc && targetm.goacc.expand_accel_var) + { + temp = targetm.goacc.expand_accel_var (exp); + if (temp) + return temp; + } + /* ... fall through ... */ + + case PARM_DECL: /* If a static var's type was incomplete when the decl was written, but the type is complete now, lay out the decl now. */ if (DECL_SIZE (exp) == 0 diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 843c66f..a649d2e 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -124,6 +124,12 @@ struct omp_context /* True if this construct can be cancelled. */ bool cancellable; + + /* The number of levels of OpenACC partitioning invoked in this context. */ + unsigned oacc_partitioning_levels; + + /* Addressable variable decls in this context. */ + vec<tree> *oacc_addressable_var_decls; }; static splay_tree all_contexts; @@ -850,6 +856,7 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx) } ctx->cb.decl_map = new hash_map<tree, tree>; + ctx->oacc_addressable_var_decls = new vec<tree> (); return ctx; } @@ -925,6 +932,8 @@ delete_omp_context (splay_tree_value value) if (is_task_ctx (ctx)) finalize_task_copyfn (as_a <gomp_task *> (ctx->stmt)); + delete ctx->oacc_addressable_var_decls; + XDELETE (ctx); } @@ -5716,6 +5725,9 @@ lower_oacc_head_tail (location_t loc, tree clauses, tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN); gcc_assert (count); + + ctx->oacc_partitioning_levels = count; + for (unsigned done = 1; count; count--, done++) { gimple_seq fork_seq = NULL; @@ -6732,6 +6744,68 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p, } } +/* Record vars listed in private clauses in CLAUSES in CTX. This information + is used to mark up variables that should be made private per-gang. */ + +static void +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) +{ + tree c; + + if (!ctx) + return; + + for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) + if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE) + { + tree decl = OMP_CLAUSE_DECL (c); + if (TREE_CODE (decl) == VAR_DECL && TREE_ADDRESSABLE (decl)) + ctx->oacc_addressable_var_decls->safe_push (decl); + } +} + +/* Record addressable vars declared in BINDVARS in CTX. This information is + used to mark up variables that should be made private per-gang. */ + +static void +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars) +{ + if (!ctx) + return; + + for (tree v = bindvars; v; v = DECL_CHAIN (v)) + if (TREE_CODE (v) == VAR_DECL && TREE_ADDRESSABLE (v)) + ctx->oacc_addressable_var_decls->safe_push (v); +} + +/* Mark addressable variables which are declared implicitly or explicitly as + gang private with a special attribute. These may need to have their + declarations altered later on in compilation (e.g. in + execute_oacc_device_lower or the backend, depending on how the OpenACC + execution model is implemented on a given target) to ensure that sharing + semantics are correct. */ + +static void +mark_oacc_gangprivate (vec<tree> *decls) +{ + int i; + tree decl; + + FOR_EACH_VEC_ELT (*decls, i, decl) + if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl))) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, + "Setting 'oacc gangprivate' attribute for decl:"); + print_generic_decl (dump_file, decl, TDF_SLIM); + fputc ('\n', dump_file); + } + DECL_ATTRIBUTES (decl) + = tree_cons (get_identifier ("oacc gangprivate"), + NULL, DECL_ATTRIBUTES (decl)); + } +} /* Lower code for an OMP loop directive. */ @@ -6748,6 +6822,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) push_gimplify_context (); + oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt)); + lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx); block = make_node (BLOCK); @@ -6878,7 +6954,20 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) /* Add OpenACC partitioning and reduction markers just before the loop. */ if (oacc_head) - gimple_seq_add_seq (&body, oacc_head); + { + gimple_seq_add_seq (&body, oacc_head); + + unsigned level_total = 0; + omp_context *thisctx; + + for (thisctx = ctx; thisctx; thisctx = thisctx->outer) + level_total += thisctx->oacc_partitioning_levels; + + /* If the current context and parent contexts are distributed over a + total of one parallelism level, we have gang partitioning. */ + if (level_total == 1) + mark_oacc_gangprivate (ctx->oacc_addressable_var_decls); + } lower_omp_for_lastprivate (&fd, &body, &dlist, ctx); @@ -7511,6 +7600,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) clauses = gimple_omp_target_clauses (stmt); + oacc_record_private_var_clauses (ctx, clauses); + gimple_seq dep_ilist = NULL; gimple_seq dep_olist = NULL; if (omp_find_clause (clauses, OMP_CLAUSE_DEPEND)) @@ -7761,6 +7852,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) if (offloaded) { + mark_oacc_gangprivate (ctx->oacc_addressable_var_decls); + /* Declare all the variables created by mapping and the variables declared in the scope of the target body. */ record_vars_into (ctx->block_vars, child_fn); @@ -8755,6 +8848,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx) ctx); break; case GIMPLE_BIND: + oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt))); lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx); maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt)); break; diff --git a/gcc/target.def b/gcc/target.def index c570f38..b3b24b8 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -1701,6 +1701,16 @@ for allocating any storage for reductions when necessary.", void, (gcall *call), default_goacc_reduction) +DEFHOOK +(expand_accel_var, +"This hook, if defined, is used by accelerator target back-ends to expand\n\ +specially handled kinds of VAR_DECL expressions. A particular use is to\n\ +place variables with specific attributes inside special accelarator\n\ +memories. A return value of NULL indicates that the target does not\n\ +handle this VAR_DECL, and normal RTL expanding is resumed.", +rtx, (tree var), +NULL) + HOOK_VECTOR_END (goacc) /* Functions relating to vectorization. */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c new file mode 100644 index 0000000..f378346 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c @@ -0,0 +1,38 @@ +#include <assert.h> + +int main (void) +{ + int ret; + + #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret) + { + int w = 0; + + #pragma acc loop worker + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + w++; + } + + ret = (w == 32); + } + assert (ret); + + #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret) + { + int v = 0; + + #pragma acc loop vector + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + v++; + } + + ret = (v == 32); + } + assert (ret); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c new file mode 100644 index 0000000..a4f81a3 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c @@ -0,0 +1,95 @@ +#include <stdio.h> +#include <openacc.h> +#include <alloca.h> +#include <string.h> +#include <gomp-constants.h> +#include <stdlib.h> + +#if 0 +#define DEBUG(DIM, IDX, VAL) \ + fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL)) +#else +#define DEBUG(DIM, IDX, VAL) +#endif + +#define N (32*32*32) + +int +check (const char *dim, int *dist, int dimsize) +{ + int ix; + int exit = 0; + + for (ix = 0; ix < dimsize; ix++) + { + DEBUG(dim, ix, dist[ix]); + if (dist[ix] < (N) / (dimsize + 0.5) + || dist[ix] > (N) / (dimsize - 0.5)) + { + fprintf (stderr, "did not distribute to %ss (%d not between %d " + "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)), + (int) ((N) / (dimsize - 0.5))); + exit |= 1; + } + } + + return exit; +} + +int main () +{ + int ary[N]; + int ix; + int exit = 0; + int gangsize = 0, workersize = 0, vectorsize = 0; + int *gangdist, *workerdist, *vectordist; + + for (ix = 0; ix < N;ix++) + ary[ix] = -1; + +#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \ + copy(ary) copyout(gangsize, workersize, vectorsize) + { +#pragma acc loop gang worker vector + for (unsigned ix = 0; ix < N; ix++) + { + int g, w, v; + + g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG); + w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER); + v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR); + + ary[ix] = (g << 16) | (w << 8) | v; + } + + gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG); + workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER); + vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR); + } + + gangdist = (int *) alloca (gangsize * sizeof (int)); + workerdist = (int *) alloca (workersize * sizeof (int)); + vectordist = (int *) alloca (vectorsize * sizeof (int)); + memset (gangdist, 0, gangsize * sizeof (int)); + memset (workerdist, 0, workersize * sizeof (int)); + memset (vectordist, 0, vectorsize * sizeof (int)); + + /* Test that work is shared approximately equally amongst each active + gang/worker/vector. */ + for (ix = 0; ix < N; ix++) + { + int g = (ary[ix] >> 16) & 255; + int w = (ary[ix] >> 8) & 255; + int v = ary[ix] & 255; + + gangdist[g]++; + workerdist[w]++; + vectordist[v]++; + } + + exit = check ("gang", gangdist, gangsize); + exit |= check ("worker", workerdist, workersize); + exit |= check ("vector", vectordist, vectorsize); + + return exit; +} diff --git a/libgomp/testsuite/libgomp.oacc-c/pr85465.c b/libgomp/testsuite/libgomp.oacc-c/pr85465.c new file mode 100644 index 0000000..329e8a0 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-w" } */ + +int +main (void) +{ +#pragma acc parallel + foo (); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 new file mode 100644 index 0000000..5f8a5e6 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 @@ -0,0 +1,25 @@ +! Test for "oacc gangprivate" attribute on gang-private variables + +! { dg-do run } +! { dg-additional-options "-fdump-tree-omplower-details" } +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' attribute for decl: integer\\(kind=4\\) w;" 1 "omplower" } } */ + +program main + integer :: w, arr(0:31) + + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) + !$acc loop gang private(w) + do j = 0, 31 + w = 0 + !$acc loop seq + do i = 0, 31 + !$acc atomic update + w = w + 1 + !$acc end atomic + end do + arr(j) = w + end do + !$acc end parallel + + if (any (arr .ne. 32)) stop 1 +end program main
On 16 August 2018 17:46:43 CEST, Julian Brown <julian@codesourcery.com> wrote: >On Wed, 15 Aug 2018 21:56:54 +0200 >Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote: > >> On 15 August 2018 18:46:37 CEST, Julian Brown >> <julian@codesourcery.com> wrote: >> >On Mon, 13 Aug 2018 12:06:21 -0700 >> >Cesar Philippidis <cesar@codesourcery.com> wrote: >> >> atttribute has more t than strictly necessary. >> Don't like signed integer levels where they should be some unsigned. >> Also don't like single switch cases instead of if. >> And omitting function comments even if the hook way above is >> documented may be ok ish but is a bit lazy ;) > >Here's a new version with those comments addressed. I also changed the >logic around a little to avoid adding decls to the vec in omp_context >which would never be given the gang-private attribute. > >Re-tested with offloading to NVPTX. > >OK? (TREE_CODE (var) == VAR_DECL Is nowadays known as VAR_P (decl), FWIW. ISTM that global variables are not JIT-friendly. No further comments from me. Thanks,
On 8/16/18 5:46 PM, Julian Brown wrote: > On Wed, 15 Aug 2018 21:56:54 +0200 > Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote: > >> On 15 August 2018 18:46:37 CEST, Julian Brown >> <julian@codesourcery.com> wrote: >>> On Mon, 13 Aug 2018 12:06:21 -0700 >>> Cesar Philippidis <cesar@codesourcery.com> wrote: >> >> atttribute has more t than strictly necessary. >> Don't like signed integer levels where they should be some unsigned. >> Also don't like single switch cases instead of if. >> And omitting function comments even if the hook way above is >> documented may be ok ish but is a bit lazy ;) > > Here's a new version with those comments addressed. I also changed the > logic around a little to avoid adding decls to the vec in omp_context > which would never be given the gang-private attribute. > > Re-tested with offloading to NVPTX. > > OK? As far as the nvptx part is concerned, I see: ... === ERROR type #4: trailing operator (1 error(s)) === gcc/config/nvptx/nvptx.c:5946:27: gangprivate_shared_size = ... Otherwise, the nvptx part is OK. Thanks, - Tom > > Julian > > 2018-08-10 Julian Brown <julian@codesourcery.com> > Chung-Lin Tang <cltang@codesourcery.com> > > gcc/ > * config/nvptx/nvptx.c (tree-hash-traits.h): Include. > (gangprivate_shared_size): New global variable. > (gangprivate_shared_align): Likewise. > (gangprivate_shared_sym): Likewise. > (gangprivate_shared_hmap): Likewise. > (nvptx_option_override): Initialize gangprivate_shared_sym, > gangprivate_shared_align. > (nvptx_file_end): Output gangprivate_shared_sym. > (nvptx_goacc_expand_accel_var): New function. > (nvptx_set_current_function): New function. > (TARGET_SET_CURRENT_FUNCTION): Define hook. > (TARGET_GOACC_EXPAND_ACCEL): Likewise. > * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. > * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. > * expr.c (expand_expr_real_1): Remap decls marked with the > "oacc gangprivate" attribute. > * omp-low.c (omp_context): Add oacc_partitioning_level and > oacc_addressable_var_decls fields. > (new_omp_context): Initialize oacc_addressable_var_decls in new > omp_context. > (delete_omp_context): Delete oacc_addressable_var_decls in old > omp_context. > (lower_oacc_head_tail): Record partitioning-level count in omp context. > (oacc_record_private_var_clauses, oacc_record_vars_in_bind) > (mark_oacc_gangprivate): New functions. > (lower_omp_for): Call oacc_record_private_var_clauses with "for" > clauses. Call mark_oacc_gangprivate for gang-partitioned loops. > (lower_omp_target): Call oacc_record_private_var_clauses with "target" > clauses. > Call mark_oacc_gangprivate for offloaded target regions. > (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. > * target.def (expand_accel_var): New hook. > > libgomp/ > * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. > * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. > * testsuite/libgomp.oacc-c/pr85465.c: New test. > * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test. >
On Fri, 17 Aug 2018 18:39:00 +0200 Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote: > On 16 August 2018 17:46:43 CEST, Julian Brown > <julian@codesourcery.com> wrote: > >On Wed, 15 Aug 2018 21:56:54 +0200 > >Bernhard Reutner-Fischer <rep.dot.nop@gmail.com> wrote: > > > >> On 15 August 2018 18:46:37 CEST, Julian Brown > >> <julian@codesourcery.com> wrote: > >> >On Mon, 13 Aug 2018 12:06:21 -0700 > >> >Cesar Philippidis <cesar@codesourcery.com> wrote: > >> > >> atttribute has more t than strictly necessary. > >> Don't like signed integer levels where they should be some > >> unsigned. Also don't like single switch cases instead of if. > >> And omitting function comments even if the hook way above is > >> documented may be ok ish but is a bit lazy ;) > > > >Here's a new version with those comments addressed. I also changed > >the logic around a little to avoid adding decls to the vec in > >omp_context which would never be given the gang-private attribute. > > > >Re-tested with offloading to NVPTX. > > > >OK? > > (TREE_CODE (var) == VAR_DECL > Is nowadays known as VAR_P (decl), FWIW. Fixed. (And also Tom's formatting nit mentioned in another email.) > ISTM that global variables are not JIT-friendly. > No further comments from me. Probably true, but AFAIK nobody's trying to use the (GCC) JIT with the PTX backend, and the backend already uses global variables for several other purposes. Of course PTX code is JIT'ted itself by the NVidia runtime, but I guess that's not what you were referring to! Is this version OK? Re-tested with offloading to NVPTX. Thanks, Julian commit 3335ddfa72944be5359280116e8eb4febd4ed3c7 Author: Julian Brown <julian@codesourcery.com> Date: Thu Aug 9 20:27:04 2018 -0700 [OpenACC] Add support for gang local storage allocation in shared memory 2018-08-10 Julian Brown <julian@codesourcery.com> Chung-Lin Tang <cltang@codesourcery.com> gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" attribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_addressable_var_decls fields. (new_omp_context): Initialize oacc_addressable_var_decls in new omp_context. (delete_omp_context): Delete oacc_addressable_var_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test. diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index 9903a27..02c2847 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -73,6 +73,7 @@ #include "cfgloop.h" #include "fold-const.h" #include "intl.h" +#include "tree-hash-traits.h" /* This file should be included last. */ #include "target-def.h" @@ -137,6 +138,12 @@ static unsigned worker_red_size; static unsigned worker_red_align; static GTY(()) rtx worker_red_sym; +/* Shared memory block for gang-private variables. */ +static unsigned gangprivate_shared_size; +static unsigned gangprivate_shared_align; +static GTY(()) rtx gangprivate_shared_sym; +static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap; + /* Global lock variable, needed for 128bit worker & gang reductions. */ static GTY(()) tree global_lock_var; @@ -210,6 +217,10 @@ nvptx_option_override (void) SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED); worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared"); + SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED); + gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + diagnose_openacc_conflict (TARGET_GOMP, "-mgomp"); diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack"); diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt"); @@ -4971,6 +4982,10 @@ nvptx_file_end (void) write_worker_buffer (asm_out_file, worker_red_sym, worker_red_align, worker_red_size); + if (gangprivate_shared_size) + write_worker_buffer (asm_out_file, gangprivate_shared_sym, + gangprivate_shared_align, gangprivate_shared_size); + if (need_softstack_decl) { write_var_marker (asm_out_file, false, true, "__nvptx_stacks"); @@ -5918,6 +5933,52 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t) return false; } +/* Implement TARGET_GOACC_EXPAND_ACCEL_VAR. Place "oacc gangprivate" + variables in shared memory. */ + +static rtx +nvptx_goacc_expand_accel_var (tree var) +{ + if (VAR_P (var) + && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var))) + { + unsigned int offset, *poffset; + poffset = gangprivate_shared_hmap.get (var); + if (poffset) + offset = *poffset; + else + { + unsigned HOST_WIDE_INT align = DECL_ALIGN (var); + gangprivate_shared_size + = (gangprivate_shared_size + align - 1) & ~(align - 1); + if (gangprivate_shared_align < align) + gangprivate_shared_align = align; + + offset = gangprivate_shared_size; + bool existed = gangprivate_shared_hmap.put (var, offset); + gcc_assert (!existed); + gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var)); + } + rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset); + return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr); + } + return NULL_RTX; +} + +static GTY(()) tree nvptx_previous_fndecl; + +/* Implement TARGET_SET_CURRENT_FUNCTION. Reset per-function context. */ + +static void +nvptx_set_current_function (tree fndecl) +{ + if (!fndecl || fndecl == nvptx_previous_fndecl) + return; + + gangprivate_shared_hmap.empty (); + nvptx_previous_fndecl = fndecl; +} + #undef TARGET_OPTION_OVERRIDE #define TARGET_OPTION_OVERRIDE nvptx_option_override @@ -6054,6 +6115,12 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t) #undef TARGET_HAVE_SPECULATION_SAFE_VALUE #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed +#undef TARGET_GOACC_EXPAND_ACCEL_VAR +#define TARGET_GOACC_EXPAND_ACCEL_VAR nvptx_goacc_expand_accel_var + +#undef TARGET_SET_CURRENT_FUNCTION +#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function + struct gcc_target targetm = TARGET_INITIALIZER; #include "gt-nvptx.h" diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index e348f0a..9164917 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -6124,6 +6124,14 @@ like @code{cond_add@var{m}}. The default implementation returns a zero constant of type @var{type}. @end deftypefn +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var}) +This hook, if defined, is used by accelerator target back-ends to expand +specially handled kinds of VAR_DECL expressions. A particular use is to +place variables with specific attributes inside special accelarator +memories. A return value of NULL indicates that the target does not +handle this VAR_DECL, and normal RTL expanding is resumed. +@end deftypefn + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index f1ad80d..3cdaca2 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -4202,6 +4202,8 @@ address; but often a machine-dependent strategy can generate better code. @hook TARGET_PREFERRED_ELSE_VALUE +@hook TARGET_GOACC_EXPAND_ACCEL_VAR + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/expr.c b/gcc/expr.c index 85b7847..0f73deb 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -9874,8 +9874,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode, exp = SSA_NAME_VAR (ssa_name); goto expand_decl_rtl; - case PARM_DECL: case VAR_DECL: + /* Allow accel compiler to handle specific cases of variables, + specifically those tagged with the "oacc gangprivate" attribute, + which may be intended to be placed in special memory in GPUs. */ + if (flag_openacc && targetm.goacc.expand_accel_var) + { + temp = targetm.goacc.expand_accel_var (exp); + if (temp) + return temp; + } + /* ... fall through ... */ + + case PARM_DECL: /* If a static var's type was incomplete when the decl was written, but the type is complete now, lay out the decl now. */ if (DECL_SIZE (exp) == 0 diff --git a/gcc/omp-low.c b/gcc/omp-low.c index b406ce7..f078110 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -133,6 +133,12 @@ struct omp_context /* True if this construct can be cancelled. */ bool cancellable; + + /* The number of levels of OpenACC partitioning invoked in this context. */ + unsigned oacc_partitioning_levels; + + /* Addressable variable decls in this context. */ + vec<tree> *oacc_addressable_var_decls; }; static splay_tree all_contexts; @@ -872,6 +878,7 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx) } ctx->cb.decl_map = new hash_map<tree, tree>; + ctx->oacc_addressable_var_decls = new vec<tree> (); return ctx; } @@ -953,6 +960,8 @@ delete_omp_context (splay_tree_value value) delete ctx->task_reduction_map; } + delete ctx->oacc_addressable_var_decls; + XDELETE (ctx); } @@ -6470,6 +6479,9 @@ lower_oacc_head_tail (location_t loc, tree clauses, tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN); gcc_assert (count); + + ctx->oacc_partitioning_levels = count; + for (unsigned done = 1; count; count--, done++) { gimple_seq fork_seq = NULL; @@ -8144,6 +8156,68 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p, } } +/* Record vars listed in private clauses in CLAUSES in CTX. This information + is used to mark up variables that should be made private per-gang. */ + +static void +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) +{ + tree c; + + if (!ctx) + return; + + for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) + if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE) + { + tree decl = OMP_CLAUSE_DECL (c); + if (VAR_P (decl) && TREE_ADDRESSABLE (decl)) + ctx->oacc_addressable_var_decls->safe_push (decl); + } +} + +/* Record addressable vars declared in BINDVARS in CTX. This information is + used to mark up variables that should be made private per-gang. */ + +static void +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars) +{ + if (!ctx) + return; + + for (tree v = bindvars; v; v = DECL_CHAIN (v)) + if (VAR_P (v) && TREE_ADDRESSABLE (v)) + ctx->oacc_addressable_var_decls->safe_push (v); +} + +/* Mark addressable variables which are declared implicitly or explicitly as + gang private with a special attribute. These may need to have their + declarations altered later on in compilation (e.g. in + execute_oacc_device_lower or the backend, depending on how the OpenACC + execution model is implemented on a given target) to ensure that sharing + semantics are correct. */ + +static void +mark_oacc_gangprivate (vec<tree> *decls) +{ + int i; + tree decl; + + FOR_EACH_VEC_ELT (*decls, i, decl) + if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl))) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, + "Setting 'oacc gangprivate' attribute for decl:"); + print_generic_decl (dump_file, decl, TDF_SLIM); + fputc ('\n', dump_file); + } + DECL_ATTRIBUTES (decl) + = tree_cons (get_identifier ("oacc gangprivate"), + NULL, DECL_ATTRIBUTES (decl)); + } +} /* Lower code for an OMP loop directive. */ @@ -8161,6 +8235,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) push_gimplify_context (); + oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt)); + lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx); block = make_node (BLOCK); @@ -8316,7 +8392,20 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) /* Add OpenACC partitioning and reduction markers just before the loop. */ if (oacc_head) - gimple_seq_add_seq (&body, oacc_head); + { + gimple_seq_add_seq (&body, oacc_head); + + unsigned level_total = 0; + omp_context *thisctx; + + for (thisctx = ctx; thisctx; thisctx = thisctx->outer) + level_total += thisctx->oacc_partitioning_levels; + + /* If the current context and parent contexts are distributed over a + total of one parallelism level, we have gang partitioning. */ + if (level_total == 1) + mark_oacc_gangprivate (ctx->oacc_addressable_var_decls); + } lower_omp_for_lastprivate (&fd, &body, &dlist, ctx); @@ -9092,6 +9181,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) clauses = gimple_omp_target_clauses (stmt); + oacc_record_private_var_clauses (ctx, clauses); + gimple_seq dep_ilist = NULL; gimple_seq dep_olist = NULL; if (omp_find_clause (clauses, OMP_CLAUSE_DEPEND)) @@ -9342,6 +9433,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) if (offloaded) { + mark_oacc_gangprivate (ctx->oacc_addressable_var_decls); + /* Declare all the variables created by mapping and the variables declared in the scope of the target body. */ record_vars_into (ctx->block_vars, child_fn); @@ -10336,6 +10429,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx) ctx); break; case GIMPLE_BIND: + oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt))); lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx); maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt)); break; diff --git a/gcc/target.def b/gcc/target.def index 96f37e0..e154b17 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -1707,6 +1707,16 @@ for allocating any storage for reductions when necessary.", void, (gcall *call), default_goacc_reduction) +DEFHOOK +(expand_accel_var, +"This hook, if defined, is used by accelerator target back-ends to expand\n\ +specially handled kinds of VAR_DECL expressions. A particular use is to\n\ +place variables with specific attributes inside special accelarator\n\ +memories. A return value of NULL indicates that the target does not\n\ +handle this VAR_DECL, and normal RTL expanding is resumed.", +rtx, (tree var), +NULL) + HOOK_VECTOR_END (goacc) /* Functions relating to vectorization. */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c new file mode 100644 index 0000000..f378346 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c @@ -0,0 +1,38 @@ +#include <assert.h> + +int main (void) +{ + int ret; + + #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret) + { + int w = 0; + + #pragma acc loop worker + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + w++; + } + + ret = (w == 32); + } + assert (ret); + + #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret) + { + int v = 0; + + #pragma acc loop vector + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + v++; + } + + ret = (v == 32); + } + assert (ret); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c new file mode 100644 index 0000000..a4f81a3 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c @@ -0,0 +1,95 @@ +#include <stdio.h> +#include <openacc.h> +#include <alloca.h> +#include <string.h> +#include <gomp-constants.h> +#include <stdlib.h> + +#if 0 +#define DEBUG(DIM, IDX, VAL) \ + fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL)) +#else +#define DEBUG(DIM, IDX, VAL) +#endif + +#define N (32*32*32) + +int +check (const char *dim, int *dist, int dimsize) +{ + int ix; + int exit = 0; + + for (ix = 0; ix < dimsize; ix++) + { + DEBUG(dim, ix, dist[ix]); + if (dist[ix] < (N) / (dimsize + 0.5) + || dist[ix] > (N) / (dimsize - 0.5)) + { + fprintf (stderr, "did not distribute to %ss (%d not between %d " + "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)), + (int) ((N) / (dimsize - 0.5))); + exit |= 1; + } + } + + return exit; +} + +int main () +{ + int ary[N]; + int ix; + int exit = 0; + int gangsize = 0, workersize = 0, vectorsize = 0; + int *gangdist, *workerdist, *vectordist; + + for (ix = 0; ix < N;ix++) + ary[ix] = -1; + +#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) \ + copy(ary) copyout(gangsize, workersize, vectorsize) + { +#pragma acc loop gang worker vector + for (unsigned ix = 0; ix < N; ix++) + { + int g, w, v; + + g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG); + w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER); + v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR); + + ary[ix] = (g << 16) | (w << 8) | v; + } + + gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG); + workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER); + vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR); + } + + gangdist = (int *) alloca (gangsize * sizeof (int)); + workerdist = (int *) alloca (workersize * sizeof (int)); + vectordist = (int *) alloca (vectorsize * sizeof (int)); + memset (gangdist, 0, gangsize * sizeof (int)); + memset (workerdist, 0, workersize * sizeof (int)); + memset (vectordist, 0, vectorsize * sizeof (int)); + + /* Test that work is shared approximately equally amongst each active + gang/worker/vector. */ + for (ix = 0; ix < N; ix++) + { + int g = (ary[ix] >> 16) & 255; + int w = (ary[ix] >> 8) & 255; + int v = ary[ix] & 255; + + gangdist[g]++; + workerdist[w]++; + vectordist[v]++; + } + + exit = check ("gang", gangdist, gangsize); + exit |= check ("worker", workerdist, workersize); + exit |= check ("vector", vectordist, vectorsize); + + return exit; +} diff --git a/libgomp/testsuite/libgomp.oacc-c/pr85465.c b/libgomp/testsuite/libgomp.oacc-c/pr85465.c new file mode 100644 index 0000000..329e8a0 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-w" } */ + +int +main (void) +{ +#pragma acc parallel + foo (); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 new file mode 100644 index 0000000..5f8a5e6 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 @@ -0,0 +1,25 @@ +! Test for "oacc gangprivate" attribute on gang-private variables + +! { dg-do run } +! { dg-additional-options "-fdump-tree-omplower-details" } +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' attribute for decl: integer\\(kind=4\\) w;" 1 "omplower" } } */ + +program main + integer :: w, arr(0:31) + + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) + !$acc loop gang private(w) + do j = 0, 31 + w = 0 + !$acc loop seq + do i = 0, 31 + !$acc atomic update + w = w + 1 + !$acc end atomic + end do + arr(j) = w + end do + !$acc end parallel + + if (any (arr .ne. 32)) stop 1 +end program main
On Tue, 11 Dec 2018 15:08:11 +0000
Julian Brown <julian@codesourcery.com> wrote:
> Is this version OK? Re-tested with offloading to NVPTX.
This is a ping for the patch posted here:
https://gcc.gnu.org/ml/gcc-patches/2018-08/msg00749.html
This is a new version of the patch, rebased and with a couple of
additional bugfixes, as follows:
Firstly, in mark_oacc_gangprivate, each decl is looked up (using
maybe_lookup_decl) to apply the "oacc gangprivate" attribute to the
innermost-nested copy of the decl.
Secondly, I'd misunderstood when the maximum parallelism level was
calculated for each nested omp_context, meaning that the code to
trigger adding the "oacc gangprivate" attribute could trigger in the
wrong circumstances. I've fixed this by moving the attribute-setting to
execute_lower_omp.
I've also added a new testcase (gangprivate-attrib-2.f90). Re-tested
with offloading to nvptx.
OK for trunk?
Thank you,
Julian
2019-06-03 Julian Brown <julian@codesourcery.com>
Chung-Lin Tang <cltang@codesourcery.com>
gcc/
* config/nvptx/nvptx.c (tree-hash-traits.h): Include.
(gangprivate_shared_size): New global variable.
(gangprivate_shared_align): Likewise.
(gangprivate_shared_sym): Likewise.
(gangprivate_shared_hmap): Likewise.
(nvptx_option_override): Initialize gangprivate_shared_sym,
gangprivate_shared_align.
(nvptx_file_end): Output gangprivate_shared_sym.
(nvptx_goacc_expand_accel_var): New function.
(nvptx_set_current_function): Initialise gangprivate_shared_hmap. Add
function comment.
(TARGET_GOACC_EXPAND_ACCEL): Likewise.
* doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook.
* doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise.
* expr.c (expand_expr_real_1): Remap VAR_DECLs marked with the
"oacc gangprivate" attribute.
* omp-low.c (omp_context): Add oacc_partitioning_level and
oacc_addressable_var_decls fields.
(new_omp_context): Initialize oacc_addressable_var_decls in new
omp_context.
(delete_omp_context): Delete oacc_addressable_var_decls in old
omp_context.
(lower_oacc_head_tail): Record partitioning-level count in omp context.
(oacc_record_private_var_clauses, oacc_record_vars_in_bind,
mark_oacc_gangprivate): New functions.
(lower_omp_for): Call oacc_record_private_var_clauses with "for"
clauses.
(lower_omp_target): Likewise, for "target" clauses.
Call mark_oacc_gangprivate for offloaded target regions.
(process_oacc_gangprivate_1): New function.
(lower_omp_1): Call oacc_record_vars_in_bind for GIMPLE_BIND within OMP
regions.
(execute_lower_omp): Call process_oacc_gangprivate_1 for each OMP
context.
* target.def (expand_accel_var): New hook.
libgomp/
* testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test.
* testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test.
* testsuite/libgomp.oacc-c/pr85465.c: New test.
* testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test.
* testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90: New test.
On Mon, Jun 03, 2019 at 05:02:45PM +0100, Julian Brown wrote: > * omp-low.c (omp_context): Add oacc_partitioning_level and > oacc_addressable_var_decls fields. > (new_omp_context): Initialize oacc_addressable_var_decls in new > omp_context. > (delete_omp_context): Delete oacc_addressable_var_decls in old > omp_context. > (lower_oacc_head_tail): Record partitioning-level count in omp context. > (oacc_record_private_var_clauses, oacc_record_vars_in_bind, > mark_oacc_gangprivate): New functions. > (lower_omp_for): Call oacc_record_private_var_clauses with "for" > clauses. > (lower_omp_target): Likewise, for "target" clauses. > Call mark_oacc_gangprivate for offloaded target regions. > (process_oacc_gangprivate_1): New function. > (lower_omp_1): Call oacc_record_vars_in_bind for GIMPLE_BIND within OMP > regions. > (execute_lower_omp): Call process_oacc_gangprivate_1 for each OMP > context. Just commenting on the above part: > --- a/gcc/omp-low.c > +++ b/gcc/omp-low.c > @@ -137,6 +137,12 @@ struct omp_context > > /* True if this construct can be cancelled. */ > bool cancellable; > + > + /* The number of levels of OpenACC partitioning invoked in this context. */ > + unsigned oacc_partitioning_levels; > + > + /* Addressable variable decls in this context. */ > + vec<tree> *oacc_addressable_var_decls; Why vec<tree> * rather than vec<tree>? > @@ -878,6 +884,7 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx) > } > > ctx->cb.decl_map = new hash_map<tree, tree>; > + ctx->oacc_addressable_var_decls = new vec<tree> (); You then don't have to new it here and delete below. As the context is cleared with XCNEW, you don't need to do anything here, and just release when deleting. Note, even if using a pointer for some reason was needed (not in this case), using unconditional new for something only used for small subset of contexts is unacceptable, it would be then desirable to only create when needed. > > return ctx; > } > @@ -960,6 +967,7 @@ delete_omp_context (splay_tree_value value) > } > > delete ctx->lastprivate_conditional_map; > + delete ctx->oacc_addressable_var_decls; > > XDELETE (ctx); > } > @@ -8458,6 +8469,79 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p, > } > } > > +/* Record vars listed in private clauses in CLAUSES in CTX. This information > + is used to mark up variables that should be made private per-gang. */ > + > +static void > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) > +{ > + tree c; > + > + if (!ctx) > + return; > + > + for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) > + if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE) > + { > + tree decl = OMP_CLAUSE_DECL (c); > + if (VAR_P (decl) && TREE_ADDRESSABLE (decl)) > + ctx->oacc_addressable_var_decls->safe_push (decl); > + } > +} You don't want to do this for all GOMP_FOR or GOMP_TARGET context, I'd hope you only want to do that for OpenACC contexts. Perhaps it is ok to bail out early if the context isn't OpenACC one. On the other side, the if (!ctx) condition makes no sense, the callers of course guarantee that ctx is non-NULL. > @@ -10665,6 +10774,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx) > ctx); > break; > case GIMPLE_BIND: > + oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt))); Again, why is this done unconditionally? It should be relevant to gather it only in some subset of context, so guard that and don't do it otherwise. > lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx); > maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt)); > break; > @@ -10905,6 +11015,7 @@ execute_lower_omp (void) > > if (all_contexts) > { > + splay_tree_foreach (all_contexts, process_oacc_gangprivate_1, NULL); Similarly. Either guard with if (flag_openacc), or have some flag cleared at the start of the pass and set only if you find something interesting so that the splay_tree_foreach does something. Jakub
Hi Jakub, Thanks for the review! I believe I've addressed all your comments in the attached version of the patch. On Mon, 3 Jun 2019 18:23:00 +0200 Jakub Jelinek <jakub@redhat.com> wrote: > Why vec<tree> * rather than vec<tree>? > > @@ -878,6 +884,7 @@ new_omp_context (gimple *stmt, omp_context > > *outer_ctx) } > > > > ctx->cb.decl_map = new hash_map<tree, tree>; > > + ctx->oacc_addressable_var_decls = new vec<tree> (); > > You then don't have to new it here and delete below. As the context > is cleared with XCNEW, you don't need to do anything here, and just > release when deleting. Note, even if using a pointer for some reason > was needed (not in this case), using unconditional new for something > only used for small subset of contexts is unacceptable, it would be > then desirable to only create when needed. Fixed. > > +/* Record vars listed in private clauses in CLAUSES in CTX. This > > information > > + is used to mark up variables that should be made private > > per-gang. */ + > > +static void > > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) > > +{ > > + tree c; > > + > > + if (!ctx) > > + return; > > + > > + for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) > > + if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE) > > + { > > + tree decl = OMP_CLAUSE_DECL (c); > > + if (VAR_P (decl) && TREE_ADDRESSABLE (decl)) > > + ctx->oacc_addressable_var_decls->safe_push (decl); > > + } > > +} > > You don't want to do this for all GOMP_FOR or GOMP_TARGET context, > I'd hope you only want to do that for OpenACC contexts. Perhaps it > is ok to bail out early if the context isn't OpenACC one. On the > other side, the if (!ctx) condition makes no sense, the callers of > course guarantee that ctx is non-NULL. I'm not sure where that came from -- ctx can be NULL at the top-level of lower_omp as called from execute_lower_omp. Maybe that was left over from an earlier version of the patch. Anyway, I've removed that bit and fixed the patch to only call oacc_record_private_var_clauses in OpenACC contexts. > > @@ -10665,6 +10774,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, > > omp_context *ctx) ctx); > > break; > > case GIMPLE_BIND: > > + oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind > > *> (stmt))); > > Again, why is this done unconditionally? It should be relevant to > gather it only in some subset of context, so guard that and don't do > it otherwise. And here (where ctx *can* be NULL). > > lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), > > ctx); maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> > > (stmt)); break; > > @@ -10905,6 +11015,7 @@ execute_lower_omp (void) > > > > if (all_contexts) > > { > > + splay_tree_foreach (all_contexts, > > process_oacc_gangprivate_1, NULL); > > Similarly. Either guard with if (flag_openacc), or have some flag > cleared at the start of the pass and set only if you find something > interesting so that the splay_tree_foreach does something. I've introduced maybe_oacc_gangprivate_vars, and the splay tree walk is only called if that's true. It's set whenever something's put in oacc_addressable_var_decls in some omp context. Re-tested with offloading to NVPTX. OK? Thanks, Julian
On Fri, Jun 07, 2019 at 03:08:37PM +0100, Julian Brown wrote: > diff --git a/gcc/omp-low.c b/gcc/omp-low.c > index a7f35ffe416..67e1e82ec00 100644 > --- a/gcc/omp-low.c > +++ b/gcc/omp-low.c > @@ -9794,6 +9882,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) > > if (offloaded) > { > + mark_oacc_gangprivate (&ctx->oacc_addressable_var_decls, ctx); > + The above one still doesn't seem to be guarded for OpenACC constructs only. As for the rest of the patch, you need Tom to look over the nvptx changes. Jakub
On 12-06-19 12:22, Jakub Jelinek wrote: > On Fri, Jun 07, 2019 at 03:08:37PM +0100, Julian Brown wrote: >> diff --git a/gcc/omp-low.c b/gcc/omp-low.c >> index a7f35ffe416..67e1e82ec00 100644 >> --- a/gcc/omp-low.c >> +++ b/gcc/omp-low.c >> @@ -9794,6 +9882,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) >> >> if (offloaded) >> { >> + mark_oacc_gangprivate (&ctx->oacc_addressable_var_decls, ctx); >> + > > The above one still doesn't seem to be guarded for OpenACC constructs only. > > As for the rest of the patch, you need Tom to look over the nvptx changes. I haven't seen any nvptx changes mentioned since I ok-ed the nvptx part ( https://gcc.gnu.org/ml/gcc-patches/2018-10/msg00324.html ), so on that basis I'd say it's still ok. Thanks, - Tom
Hi! First, thanks for picking this up, and improving the patch you inherited. Then, just a few individual comments, not a complete review. (As far as I concerned, and as far as relevant, these can be addressed later, incrementally, of course.) I understand right that this will address some aspects of PR90115 "OpenACC: predetermined private levels for variables declared in blocks" (so please mention that one in the ChangeLog updates, and commit log), but it doesn't address all of these aspects (and see also Cesar's list in <http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>), and also not yet PR90114 "Predetermined private levels for variables declared in OpenACC accelerator routines"? On Fri, 7 Jun 2019 15:08:37 +0100, Julian Brown <julian@codesourcery.com> wrote: > --- a/gcc/config/nvptx/nvptx.c > +++ b/gcc/config/nvptx/nvptx.c > @@ -5237,6 +5248,10 @@ nvptx_file_end (void) > write_shared_buffer (asm_out_file, vector_red_sym, > vector_red_align, vector_red_size); > > + if (gangprivate_shared_size) > + write_shared_buffer (asm_out_file, gangprivate_shared_sym, > + gangprivate_shared_align, gangprivate_shared_size); Curious, what is the reason that we maintain this '__gangprivate_shared' variable on a per-file basis instead of on a per-function basis (with names '__gangprivate_shared_[function]', or similar), which should make it more obvious where each block of '.shared' memory belongs to? > --- a/gcc/doc/tm.texi > +++ b/gcc/doc/tm.texi > +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var}) > +This hook, if defined, is used by accelerator target back-ends to expand > +specially handled kinds of VAR_DECL expressions. A particular use is to > +place variables with specific attributes inside special accelarator > +memories. A return value of NULL indicates that the target does not > +handle this VAR_DECL, and normal RTL expanding is resumed. > +@end deftypefn I guess I'm not terribly happy with the 'goacc.expand_accel_var' name. Using different "memories" for specially tagged DECLs seems to be a pretty generic concept (address spaces?), and... > --- a/gcc/expr.c > +++ b/gcc/expr.c > @@ -9974,8 +9974,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode, > exp = SSA_NAME_VAR (ssa_name); > goto expand_decl_rtl; > > - case PARM_DECL: > case VAR_DECL: > + /* Allow accel compiler to handle specific cases of variables, > + specifically those tagged with the "oacc gangprivate" attribute, > + which may be intended to be placed in special memory in GPUs. */ > + if (flag_openacc && targetm.goacc.expand_accel_var) > + { > + temp = targetm.goacc.expand_accel_var (exp); > + if (temp) > + return temp; > + } > + /* ... fall through ... */ > + > + case PARM_DECL: ... I'm thus confused that there isn't already a generic mechanism available in GCC, that we can just use instead of adding a new one here? Thinking about the "address spaces" stuff in 'gcc/target.def' -- or is that the wrong concept? (I'm not familiar with all that, and haven't looked closely.) > --- a/gcc/omp-low.c > +++ b/gcc/omp-low.c > +/* Record vars listed in private clauses in CLAUSES in CTX. This information > + is used to mark up variables that should be made private per-gang. */ > + > +static void > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) > +{ > + for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) > + if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE) > + { > + tree decl = OMP_CLAUSE_DECL (c); > + if (VAR_P (decl) && TREE_ADDRESSABLE (decl)) > + { > + ctx->oacc_addressable_var_decls.safe_push (decl); > + maybe_oacc_gangprivate_vars = true; > + } > + } > +} Are all the relevant variables addressable? And/or, need only those be considered? > +/* Record addressable vars declared in BINDVARS in CTX. This information is > + used to mark up variables that should be made private per-gang. */ > + > +static void > +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars) > +{ > + for (tree v = bindvars; v; v = DECL_CHAIN (v)) > + if (VAR_P (v) && TREE_ADDRESSABLE (v)) > + { > + ctx->oacc_addressable_var_decls.safe_push (v); > + maybe_oacc_gangprivate_vars = true; > + } > +} Likewise. > +/* Mark addressable variables which are declared implicitly or explicitly as > + gang private with a special attribute. These may need to have their > + declarations altered later on in compilation (e.g. in > + execute_oacc_device_lower or the backend, depending on how the OpenACC > + execution model is implemented on a given target) to ensure that sharing > + semantics are correct. */ > + > +static void > +mark_oacc_gangprivate (vec<tree> *decls, omp_context *ctx) > +{ > + int i; > + tree decl; > + > + FOR_EACH_VEC_ELT (*decls, i, decl) > + { > + for (omp_context *thisctx = ctx; thisctx; thisctx = thisctx->outer) > + { > + tree inner_decl = maybe_lookup_decl (decl, thisctx); > + if (inner_decl) > + { > + decl = inner_decl; > + break; > + } > + } > + if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl))) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, > + "Setting 'oacc gangprivate' attribute for decl:"); > + print_generic_decl (dump_file, decl, TDF_SLIM); > + fputc ('\n', dump_file); > + } > + DECL_ATTRIBUTES (decl) > + = tree_cons (get_identifier ("oacc gangprivate"), > + NULL, DECL_ATTRIBUTES (decl)); > + } > + } > +} So I'm confused how that can be done here ('omplower'), given that the decision about how levels of parallelism (gang, worker, vector) are assigned is only done later ('oaccdevlow'), separately/differently per offloading target? The following seems relevant: > +/* Find gang-private variables in a context. */ > + > +static int > +process_oacc_gangprivate (splay_tree_node node, void * ARG_UNUSED (data)) > +{ > + omp_context *ctx = (omp_context *) node->value; > + unsigned level_total = 0; > + omp_context *thisctx; > + > + for (thisctx = ctx; thisctx; thisctx = thisctx->outer) > + level_total += thisctx->oacc_partitioning_levels; > + > + /* If the current context and parent contexts are distributed over a > + total of one parallelism level, we have gang partitioning. */ > + if (level_total == 1) > + mark_oacc_gangprivate (&ctx->oacc_addressable_var_decls, ctx); > + > + return 0; > +} ..., but I didn't quickly manage to grok that. (I shall try harder, later on.) But still then, this looks like it might work for the outer level (gang) only (because all offloading targets are expected to assign gang level to the outermost loop -- might that be the underlying assumption?), but it won't work for inner loop/privatization levels? (..., which I understand this patch isn't doing anything about.) > --- /dev/null > +++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c > @@ -0,0 +1,11 @@ > +/* { dg-do compile } */ > +/* { dg-additional-options "-w" } */ > + > +int > +main (void) > +{ > +#pragma acc parallel > + foo (); > + > + return 0; > +} I think that given your re-work of the implementation (move stuff from front ends into OMP lowering) this test case isn't relevant anymore (was a front end ICE). > --- /dev/null > +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 > @@ -0,0 +1,25 @@ > +! Test for "oacc gangprivate" attribute on gang-private variables > + > +! { dg-do run } > +! { dg-additional-options "-fdump-tree-omplower-details" } > +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' attribute for decl: integer\\(kind=4\\) w;" 1 "omplower" } } */ I prefer if such scanning is placed close to relevant source code constructs, so I'd move this 'scan-tree-dump-times'... > + > +program main > + integer :: w, arr(0:31) > + > + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) > + !$acc loop gang private(w) ... here. (Just to make sure, a Fortran 'integer' will always be 'integer(kind=4)'?) > + do j = 0, 31 > + w = 0 > + !$acc loop seq > + do i = 0, 31 > + !$acc atomic update > + w = w + 1 > + !$acc end atomic > + end do > + arr(j) = w > + end do > + !$acc end parallel > + > + if (any (arr .ne. 32)) stop 1 > +end program main > --- /dev/null > +++ b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90 > @@ -0,0 +1,23 @@ > +! Test for lack of "oacc gangprivate" attribute on worker-private variables > + > +! { dg-do run } > +! { dg-additional-options "-fdump-tree-omplower-details" } > +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' attribute for decl" 0 "omplower" } } */ Likewise... > + > +program main > + integer :: w, arr(0:31) > + > + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) > + !$acc loop gang worker private(w) ... here (I suppose). > + do j = 0, 31 > + w = 0 > + !$acc loop seq > + do i = 0, 31 > + w = w + 1 > + end do > + arr(j) = w > + end do > + !$acc end parallel > + > + if (any (arr .ne. 32)) stop 1 > +end program main Grüße Thomas
On Wed, 12 Jun 2019 13:57:22 +0200 Thomas Schwinge <thomas@codesourcery.com> wrote: > Hi! > > First, thanks for picking this up, and improving the patch you > inherited. Thanks for review! > I understand right that this will address some aspects of PR90115 > "OpenACC: predetermined private levels for variables declared in > blocks" (so please mention that one in the ChangeLog updates, and > commit log), but it doesn't address all of these aspects (and see > also Cesar's list in > <http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>), > and also not yet PR90114 "Predetermined private levels for variables > declared in OpenACC accelerator routines"? There's two possible reasons for placing gang-private variables in shared memory: correct implementation of OpenACC semantics, or optimisation, since shared memory is faster than local memory (on NVidia devices). Handling of private variables is intimately tied with the execution model for gangs/workers/vectors implemented by a particular target: for PTX, that's handled in the backend using a broadcasting/neutering scheme. That is sufficient for code that e.g. sets a variable in worker-single mode and expects to use the value in worker-partitioned mode. The difficulty (semantics-wise) comes when the user wants to do something like an atomic operation in worker-partitioned mode and expects a worker-single variable to be shared across each partitioned worker. Forcing use of shared memory for such variables makes that work properly. It is *not* sufficient for the next level down, though -- expecting to perform atomic operations in vector-partitioned mode on a variable that is declared in vector-single mode, i.e. so that it is supposed to be shared across all vector elements. AFAIK, that's not straightforward, and we haven't attempted to implement it. I think the original motivation for this patch was optimisation, though -- typical code won't try to use atomics in this way. Cesar's list of caveats that you linked to seems to support that notion. > On Fri, 7 Jun 2019 15:08:37 +0100, Julian Brown > <julian@codesourcery.com> wrote: > > --- a/gcc/config/nvptx/nvptx.c > > +++ b/gcc/config/nvptx/nvptx.c > > > @@ -5237,6 +5248,10 @@ nvptx_file_end (void) > > write_shared_buffer (asm_out_file, vector_red_sym, > > vector_red_align, vector_red_size); > > > > + if (gangprivate_shared_size) > > + write_shared_buffer (asm_out_file, gangprivate_shared_sym, > > + gangprivate_shared_align, > > gangprivate_shared_size); > > Curious, what is the reason that we maintain this > '__gangprivate_shared' variable on a per-file basis instead of on a > per-function basis (with names '__gangprivate_shared_[function]', or > similar), which should make it more obvious where each block of > '.shared' memory belongs to? I can't comment on that, I'm afraid that was a part of the patch that I inherited and didn't alter much... > > --- a/gcc/doc/tm.texi > > +++ b/gcc/doc/tm.texi > > > +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree > > @var{var}) +This hook, if defined, is used by accelerator target > > back-ends to expand +specially handled kinds of VAR_DECL > > expressions. A particular use is to +place variables with specific > > attributes inside special accelarator +memories. A return value of > > NULL indicates that the target does not +handle this VAR_DECL, and > > normal RTL expanding is resumed. +@end deftypefn > > I guess I'm not terribly happy with the 'goacc.expand_accel_var' name. > Using different "memories" for specially tagged DECLs seems to be a > pretty generic concept (address spaces?), and... This is partly another NVPTX weirdness -- the target uses address spaces, but only within the backend, and without using the generic middle-end address space machinery. The other reason for using an attribute instead of assigning an address space is that the former can be detected by the target compiler, but will be ignored by the host compiler. Forcing use of an address space this early would mean that the same non-standard address space would have to make sense for both host and offloaded code. For AMD GCN, we do use the generic address space support, and I found that I could re-use the "oacc gangprivate" attribute -- but not the expand_accel_var hook (expand time is too late for that target). Instead, another new hook "TARGET_GOACC_ADJUST_GANGPRIVATE_DECL" is called from omp-offload.c:execute_oacc_device_lower for variables that have the "oacc gangprivate" attribute set. Those bits haven't been posted upstream yet, though. > > --- a/gcc/expr.c > > +++ b/gcc/expr.c > > @@ -9974,8 +9974,19 @@ expand_expr_real_1 (tree exp, rtx target, > > machine_mode tmode, exp = SSA_NAME_VAR (ssa_name); > > goto expand_decl_rtl; > > > > - case PARM_DECL: > > case VAR_DECL: > > + /* Allow accel compiler to handle specific cases of > > variables, > > + specifically those tagged with the "oacc gangprivate" > > attribute, > > + which may be intended to be placed in special memory in > > GPUs. */ > > + if (flag_openacc && targetm.goacc.expand_accel_var) > > + { > > + temp = targetm.goacc.expand_accel_var (exp); > > + if (temp) > > + return temp; > > + } > > + /* ... fall through ... */ > > + > > + case PARM_DECL: > > ... I'm thus confused that there isn't already a generic mechanism > available in GCC, that we can just use instead of adding a new one > here? Thinking about the "address spaces" stuff in 'gcc/target.def' > -- or is that the wrong concept? (I'm not familiar with all that, > and haven't looked closely.) Same point again -- the same address space would have to be supported on the host and offload compiler. I'm happy to accept suggestions for another name for the hook though? > > --- a/gcc/omp-low.c > > +++ b/gcc/omp-low.c > > > +/* Record vars listed in private clauses in CLAUSES in CTX. This > > information > > + is used to mark up variables that should be made private > > per-gang. */ + > > +static void > > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) > > +{ > > + for (tree c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) > > + if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE) > > + { > > + tree decl = OMP_CLAUSE_DECL (c); > > + if (VAR_P (decl) && TREE_ADDRESSABLE (decl)) > > + { > > + ctx->oacc_addressable_var_decls.safe_push (decl); > > + maybe_oacc_gangprivate_vars = true; > > + } > > + } > > +} > > Are all the relevant variables addressable? And/or, need only those > be considered? Yes, I believe so. At least from a correctness perspective, a non-addressable variable can't be accessed outside the current thread, so it can go in a (faster than shared memory) register -- though that register may need to be broadcast in some circumstances. A variable can only meaningfully be "shared" across workers or vector lanes if its address is taken, e.g. by a call to an atomic builtin. From an optimisation perspective, the answer might be fuzzier: maybe sometimes, using shared memory directly would be faster than broadcasting. > > +/* Record addressable vars declared in BINDVARS in CTX. This > > information is > > + used to mark up variables that should be made private > > per-gang. */ + > > +static void > > +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars) > > +{ > > + for (tree v = bindvars; v; v = DECL_CHAIN (v)) > > + if (VAR_P (v) && TREE_ADDRESSABLE (v)) > > + { > > + ctx->oacc_addressable_var_decls.safe_push (v); > > + maybe_oacc_gangprivate_vars = true; > > + } > > +} > > Likewise. > > > > +/* Mark addressable variables which are declared implicitly or > > explicitly as > > + gang private with a special attribute. These may need to have > > their > > + declarations altered later on in compilation (e.g. in > > + execute_oacc_device_lower or the backend, depending on how the > > OpenACC > > + execution model is implemented on a given target) to ensure > > that sharing > > + semantics are correct. */ > > + > > +static void > > +mark_oacc_gangprivate (vec<tree> *decls, omp_context *ctx) > > +{ > > + int i; > > + tree decl; > > + > > + FOR_EACH_VEC_ELT (*decls, i, decl) > > + { > > + for (omp_context *thisctx = ctx; thisctx; thisctx = > > thisctx->outer) > > + { > > + tree inner_decl = maybe_lookup_decl (decl, thisctx); > > + if (inner_decl) > > + { > > + decl = inner_decl; > > + break; > > + } > > + } > > + if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES > > (decl))) > > + { > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + { > > + fprintf (dump_file, > > + "Setting 'oacc gangprivate' attribute for > > decl:"); > > + print_generic_decl (dump_file, decl, TDF_SLIM); > > + fputc ('\n', dump_file); > > + } > > + DECL_ATTRIBUTES (decl) > > + = tree_cons (get_identifier ("oacc gangprivate"), > > + NULL, DECL_ATTRIBUTES (decl)); > > + } > > + } > > +} > > So I'm confused how that can be done here ('omplower'), given that the > decision about how levels of parallelism (gang, worker, vector) are > assigned is only done later ('oaccdevlow'), separately/differently per > offloading target? > > The following seems relevant: > > > +/* Find gang-private variables in a context. */ > > + > > +static int > > +process_oacc_gangprivate (splay_tree_node node, void * ARG_UNUSED > > (data)) +{ > > + omp_context *ctx = (omp_context *) node->value; > > + unsigned level_total = 0; > > + omp_context *thisctx; > > + > > + for (thisctx = ctx; thisctx; thisctx = thisctx->outer) > > + level_total += thisctx->oacc_partitioning_levels; > > + > > + /* If the current context and parent contexts are distributed > > over a > > + total of one parallelism level, we have gang partitioning. */ > > + if (level_total == 1) > > + mark_oacc_gangprivate (&ctx->oacc_addressable_var_decls, ctx); > > + > > + return 0; > > +} > > ..., but I didn't quickly manage to grok that. (I shall try harder, > later on.) > > But still then, this looks like it might work for the outer level > (gang) only (because all offloading targets are expected to assign > gang level to the outermost loop -- might that be the underlying > assumption?), but it won't work for inner loop/privatization levels? > (..., which I understand this patch isn't doing anything about.) The "oacc gangprivate" only applies to variables that are (addressable and) private per-gang, but the attribute marking works on both top-level "acc parallel" directives and "acc loop" directives below that -- so long as they don't explicitly use parallelism finer than "gang" level. It also works on variables declared private() using OpenACC clauses in all supported languages, or those that are declared in an appropriate C/C++ scope. At least for loops with reductions, gang-partitioned loops have different semantics from worker and vector-partitioned loops. So I think in general, it must be the case that it is possible to analyse OpenACC code "lexically" to determine which loops are gang partitioned, and which are partitioned at finer levels. It can't be deferred entirely to the target. It's been a while since I read those bits of the standard, though! But yes, in GCC, omp-low only tries to calculate the maximum partitioning level for each loop nest. The final determination isn't made until oaccdevlow time. That's OK if shared memory is being used only as an optimisation, much less OK if it's a necessary part of implementing OpenACC semantics properly. It might be more of an issue if we tried to support "vector-shared" variables properly. > > --- /dev/null > > +++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c > > @@ -0,0 +1,11 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-w" } */ > > + > > +int > > +main (void) > > +{ > > +#pragma acc parallel > > + foo (); > > + > > + return 0; > > +} > > I think that given your re-work of the implementation (move stuff from > front ends into OMP lowering) this test case isn't relevant anymore > (was a front end ICE). OK, I can remove that. > > --- /dev/null > > +++ > > b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90 > > @@ -0,0 +1,25 @@ +! Test for "oacc gangprivate" attribute on > > gang-private variables + > > +! { dg-do run } > > +! { dg-additional-options "-fdump-tree-omplower-details" } > > +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' > > attribute for decl: integer\\(kind=4\\) w;" 1 "omplower" } } */ > > I prefer if such scanning is placed close to relevant source code > constructs, so I'd move this 'scan-tree-dump-times'... > > > + > > +program main > > + integer :: w, arr(0:31) > > + > > + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) > > + !$acc loop gang private(w) > > ... here. > > (Just to make sure, a Fortran 'integer' will always be > 'integer(kind=4)'?) No idea! I can check. > > + do j = 0, 31 > > + w = 0 > > + !$acc loop seq > > + do i = 0, 31 > > + !$acc atomic update > > + w = w + 1 > > + !$acc end atomic > > + end do > > + arr(j) = w > > + end do > > + !$acc end parallel > > + > > + if (any (arr .ne. 32)) stop 1 > > +end program main > > > --- /dev/null > > +++ > > b/libgomp/testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90 > > @@ -0,0 +1,23 @@ +! Test for lack of "oacc gangprivate" attribute > > on worker-private variables + > > +! { dg-do run } > > +! { dg-additional-options "-fdump-tree-omplower-details" } > > +! { dg-final { scan-tree-dump-times "Setting 'oacc gangprivate' > > attribute for decl" 0 "omplower" } } */ > > Likewise... > > > + > > +program main > > + integer :: w, arr(0:31) > > + > > + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) > > + !$acc loop gang worker private(w) > > ... here (I suppose). > > > + do j = 0, 31 > > + w = 0 > > + !$acc loop seq > > + do i = 0, 31 > > + w = w + 1 > > + end do > > + arr(j) = w > > + end do > > + !$acc end parallel > > + > > + if (any (arr .ne. 32)) stop 1 > > +end program main Thanks, Julian
Hi! This is a new patch that takes a different approach to the last-posted version in this thread. I have combined the previous incremental patches on the og9 branch that culminated in the following patch: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01220.html From that email, the following explanation was given of the previous approaches taken as to how the partitioning level for OpenACC "private" variables was calculated and represented in the compiler, and how this patch differs: - The first (by Chung-Lin Tang) recorded which variables should be made private per-gang in each front end (i.e. separately in C, C++ and Fortran) using a new attribute "oacc gangprivate". This was deemed too early; the final determination about which loops are assigned which parallelism level has not yet been made at parse time. - The second, last discussed here: https://gcc.gnu.org/ml/gcc-patches/2019-06/msg00726.html moved the analysis of OpenACC contexts to determine parallelism levels to omp-low.c (but kept the "oacc gangprivate" attribute and the NVPTX backend parts). However (as mentioned in that mail), this is still too early: in fact the final determination of the parallelism level for each loop (especially for loops without explicit gang/worker/vector clauses) does not happen until we reach the device compiler, in the oaccloops pass. This patch builds on the second approach, but delays fixing the parallelism level of each "private" variable (those that are addressable, and declared private using OpenACC clauses or by defining them in a scope nested within a compute region or partitioned loop) until the oaccdevlow pass. This is done by adding a new internal UNIQUE function (OACC_PRIVATE) that lists (the address of) each private variable as an argument. These new internal functions fit into the existing scheme for demarking OpenACC loops, as described in comments in the patch. Use of the "oacc gangprivate" attribute is now restricted to the NVPTX backend (and could probably be replaced with some lighter-weight mechanism as a followup). I realised I omitted to make some of the cosmetic changes Thomas highlighted below on starting to write this email, but I can do that (with suitable retesting) if desired before committing. On Wed, 12 Jun 2019 20:42:16 +0100 Julian Brown <julian@codesourcery.com> wrote: > On Wed, 12 Jun 2019 13:57:22 +0200 > Thomas Schwinge <thomas@codesourcery.com> wrote: > > > I understand right that this will address some aspects of PR90115 > > "OpenACC: predetermined private levels for variables declared in > > blocks" (so please mention that one in the ChangeLog updates, and > > commit log), but it doesn't address all of these aspects (and see > > also Cesar's list in > > <http://mid.mail-archive.com/70d27ebd-762e-59a3-082f-48fa0c687212@codesourcery.com>), > > and also not yet PR90114 "Predetermined private levels for variables > > declared in OpenACC accelerator routines"? > > There's two possible reasons for placing gang-private variables in > shared memory: correct implementation of OpenACC semantics, or > optimisation, since shared memory is faster than local memory (on > NVidia devices). Handling of private variables is intimately tied > with the execution model for gangs/workers/vectors implemented by a > particular target: for PTX, that's handled in the backend using a > broadcasting/neutering scheme. > > That is sufficient for code that e.g. sets a variable in worker-single > mode and expects to use the value in worker-partitioned mode. The > difficulty (semantics-wise) comes when the user wants to do something > like an atomic operation in worker-partitioned mode and expects a > worker-single variable to be shared across each partitioned worker. > Forcing use of shared memory for such variables makes that work > properly. > > It is *not* sufficient for the next level down, though -- expecting to > perform atomic operations in vector-partitioned mode on a variable > that is declared in vector-single mode, i.e. so that it is supposed to > be shared across all vector elements. AFAIK, that's not > straightforward, and we haven't attempted to implement it. > > I think the original motivation for this patch was optimisation, > though -- typical code won't try to use atomics in this way. Cesar's > list of caveats that you linked to seems to support that notion. After a little further investigation, I came to the conclusion that the patch was always originally about correctness, but optimisation. But that's largely academic now. > > I guess I'm not terribly happy with the 'goacc.expand_accel_var' > > name. Using different "memories" for specially tagged DECLs seems > > to be a pretty generic concept (address spaces?), and... > > This is partly another NVPTX weirdness -- the target uses address > spaces, but only within the backend, and without using the generic > middle-end address space machinery. The other reason for using an > attribute instead of assigning an address space is that the former can > be detected by the target compiler, but will be ignored by the host > compiler. Forcing use of an address space this early would mean that > the same non-standard address space would have to make sense for both > host and offloaded code. > > For AMD GCN, we do use the generic address space support, and I found > that I could re-use the "oacc gangprivate" attribute -- but not the > expand_accel_var hook (expand time is too late for that target). > Instead, another new hook "TARGET_GOACC_ADJUST_GANGPRIVATE_DECL" is > called from omp-offload.c:execute_oacc_device_lower for variables that > have the "oacc gangprivate" attribute set. Those bits haven't been > posted upstream yet, though. This patch uses both target hooks -- the TARGET_GOACC_ADJUST_PRIVATE_DECL (renamed), and TARGET_GOACC_EXPAND_ACCEL_VAR. The first can tweak the decl at oaccdevlow time, and the second at expand time. This version of the patch doesn't provide full support for gang-private variables on AMD GCN yet though, since that depends on other code that hasn't been upstreamed yet. (GCN works with the equivalent patch to this on the og9 branch though.) > > > --- a/gcc/expr.c > > > +++ b/gcc/expr.c > > > @@ -9974,8 +9974,19 @@ expand_expr_real_1 (tree exp, rtx target, > > > machine_mode tmode, exp = SSA_NAME_VAR (ssa_name); > > > goto expand_decl_rtl; > > > > > > - case PARM_DECL: > > > case VAR_DECL: > > > + /* Allow accel compiler to handle specific cases of > > > variables, > > > + specifically those tagged with the "oacc gangprivate" > > > attribute, > > > + which may be intended to be placed in special memory in > > > GPUs. */ > > > + if (flag_openacc && targetm.goacc.expand_accel_var) > > > + { > > > + temp = targetm.goacc.expand_accel_var (exp); > > > + if (temp) > > > + return temp; > > > + } > > > + /* ... fall through ... */ > > > + > > > + case PARM_DECL: > > > > ... I'm thus confused that there isn't already a generic mechanism > > available in GCC, that we can just use instead of adding a new one > > here? Thinking about the "address spaces" stuff in 'gcc/target.def' > > -- or is that the wrong concept? (I'm not familiar with all that, > > and haven't looked closely.) > > Same point again -- the same address space would have to be supported > on the host and offload compiler. I'm happy to accept suggestions for > another name for the hook though? (Still not renamed in this version, sorry.) > > > +/* Mark addressable variables which are declared implicitly or > > > explicitly as > > > + gang private with a special attribute. These may need to have > > > their > > > + declarations altered later on in compilation (e.g. in > > > + execute_oacc_device_lower or the backend, depending on how the > > > OpenACC > > > + execution model is implemented on a given target) to ensure > > > that sharing > > > + semantics are correct. */ > > > + > > > +static void > > > +mark_oacc_gangprivate (vec<tree> *decls, omp_context *ctx) > > > +{ > > > + int i; > > > + tree decl; > > > + > > > + FOR_EACH_VEC_ELT (*decls, i, decl) > > > + { > > > + for (omp_context *thisctx = ctx; thisctx; thisctx = > > > thisctx->outer) > > > + { > > > + tree inner_decl = maybe_lookup_decl (decl, thisctx); > > > + if (inner_decl) > > > + { > > > + decl = inner_decl; > > > + break; > > > + } > > > + } > > > + if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES > > > (decl))) > > > + { > > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > > + { > > > + fprintf (dump_file, > > > + "Setting 'oacc gangprivate' attribute for > > > decl:"); > > > + print_generic_decl (dump_file, decl, TDF_SLIM); > > > + fputc ('\n', dump_file); > > > + } > > > + DECL_ATTRIBUTES (decl) > > > + = tree_cons (get_identifier ("oacc gangprivate"), > > > + NULL, DECL_ATTRIBUTES (decl)); > > > + } > > > + } > > > +} > > > > So I'm confused how that can be done here ('omplower'), given that > > the decision about how levels of parallelism (gang, worker, vector) > > are assigned is only done later ('oaccdevlow'), > > separately/differently per offloading target? > > > > The following seems relevant: > > > > > +/* Find gang-private variables in a context. */ > > > + > > > +static int > > > +process_oacc_gangprivate (splay_tree_node node, void * ARG_UNUSED > > > (data)) +{ > > > + omp_context *ctx = (omp_context *) node->value; > > > + unsigned level_total = 0; > > > + omp_context *thisctx; > > > + > > > + for (thisctx = ctx; thisctx; thisctx = thisctx->outer) > > > + level_total += thisctx->oacc_partitioning_levels; > > > + > > > + /* If the current context and parent contexts are distributed > > > over a > > > + total of one parallelism level, we have gang partitioning. > > > */ > > > + if (level_total == 1) > > > + mark_oacc_gangprivate (&ctx->oacc_addressable_var_decls, > > > ctx); + > > > + return 0; > > > +} > > > > ..., but I didn't quickly manage to grok that. (I shall try harder, > > later on.) > > > > But still then, this looks like it might work for the outer level > > (gang) only (because all offloading targets are expected to assign > > gang level to the outermost loop -- might that be the underlying > > assumption?), but it won't work for inner loop/privatization levels? > > (..., which I understand this patch isn't doing anything about.) > > The "oacc gangprivate" only applies to variables that are (addressable > and) private per-gang, but the attribute marking works on both > top-level "acc parallel" directives and "acc loop" directives below > that -- so long as they don't explicitly use parallelism finer than > "gang" level. It also works on variables declared private() using > OpenACC clauses in all supported languages, or those that are declared > in an appropriate C/C++ scope. > > At least for loops with reductions, gang-partitioned loops have > different semantics from worker and vector-partitioned loops. So I > think in general, it must be the case that it is possible to analyse > OpenACC code "lexically" to determine which loops are gang > partitioned, and which are partitioned at finer levels. It can't be > deferred entirely to the target. It's been a while since I read those > bits of the standard, though! > > But yes, in GCC, omp-low only tries to calculate the maximum > partitioning level for each loop nest. The final determination isn't > made until oaccdevlow time. That's OK if shared memory is being used > only as an optimisation, much less OK if it's a necessary part of > implementing OpenACC semantics properly. It might be more of an issue > if we tried to support "vector-shared" variables properly. So: this version moves the partitioning-level calculation for private variables out of omp-low, so this isn't an issue any more. Variables are privatized according to the "true" partitioning level of the scope inside the parallel region that they are associated with (i.e. "parallel" region, or loop). > > > + > > > +program main > > > + integer :: w, arr(0:31) > > > + > > > + !$acc parallel num_gangs(32) num_workers(32) copyout(arr) > > > + !$acc loop gang private(w) > > > > ... here. > > > > (Just to make sure, a Fortran 'integer' will always be > > 'integer(kind=4)'?) > > No idea! I can check. That's a yes, I think. Re-tested with offloading to nvptx. OK for mainline? Thanks, Julian 2019-11-06 Julian Brown <julian@codesourcery.com> Chung-Lin Tang <cltang@codesourcery.com> gcc/ * config/gcn/gcn-protos.h (gcn_goacc_adjust_gangprivate_decl): Rename to... (gcn_goacc_adjust_private_decl): ...this. Add and use LEVEL parameter. * config/gcn/gcn-tree.c (gcn_goacc_adjust_gangprivate_decl): Rename to... (gcn_goacc_adjust_private_decl): ...this. Add LEVEL parameter. * config/gcn/gcn.c (TARGET_GOACC_ADJUST_GANGPRIVATE_DECL): Delete. (TARGET_GOACC_ADJUST_PRIVATE_DECL): Define using renamed gcn_goacc_adjust_private_decl. * config/nvptx/nvptx.c (tree-hash-traits.h, tree-pretty-print.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_adjust_private_decl): New function. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_GOACC_ADJUST_PRIVATE_DECL, TARGET_GOACC_EXPAND_ACCEL_VAR): Define hooks. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR, TARGET_GOACC_ADJUST_PRIVATE_DECL): Place new documentation hooks. * doc/tm.texi: Regenerate. * expr.c (expand_expr_real_1): Expand decls using the expand_accel_var OpenACC hook if defined. * internal-fn.c (expand_UNIQUE): Handle IFN_UNIQUE_OACC_PRIVATE. * internal-fn.h (IFN_UNIQUE_CODES): Add OACC_PRIVATE. * omp-low.c (omp_context): Add oacc_addressable_var_decls field. (new_omp_context): Initialize oacc_addressable_var_decls in new omp_context. (delete_omp_context): Delete oacc_addressable_var_decls in old omp_context. (lower_oacc_reductions): Add PRIVATE_MARKER parameter. Insert private marker before fork. (lower_oacc_head_tail): Add PRIVATE_MARKER parameter. Modify private marker's gimple call arguments, and pass it to lower_oacc_reductions. (oacc_record_private_var_clauses, oacc_record_vars_in_bind, make_oacc_private_marker): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call oacc_record_vars_in_bind for OpenACC contexts. Create private marker and pass to lower_oacc_head_tail. (lower_omp_target): Create private marker and pass to lower_oacc_reductions. (lower_omp_1): Call oacc_record_vars_in_bind for OpenACC bind contexts. * omp-offload.c (convert.h): Include. (oacc_loop_xform_head_tail): Treat private-variable markers like fork/join when transforming head/tail sequences. (execute_oacc_device_lower): Use IFN_UNIQUE_OACC_PRIVATE to determine partitioning level of private variables, and process any found via adjust_private_decl target hook. * target.def (expand_accel_var, adjust_private_decl): New target hooks. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-fortran/gangprivate-attrib-1.f90: New test. * testsuite/libgomp.oacc-fortran/gangprivate-attrib-2.f90: New test.
Hi! On 2019-06-07T15:08:37+0100, Julian Brown <julian@codesourcery.com> wrote: > Hi Jakub, > > Thanks for the review! I believe I've addressed all your comments in > the attached version of the patch. > > On Mon, 3 Jun 2019 18:23:00 +0200 > Jakub Jelinek <jakub@redhat.com> wrote: >> > +/* Record vars listed in private clauses in CLAUSES in CTX. This information >> > + is used to mark up variables that should be made private per-gang. */ >> > + >> > +static void >> > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) >> > +{ >> > + [...] >> > +} >> >> You don't want to do this for all GOMP_FOR or GOMP_TARGET context, >> I'd hope you only want to do that for OpenACC contexts. > I've [...] fixed the patch to only call oacc_record_private_var_clauses in > OpenACC contexts. > commit 6c2a018b940d0b132395048b0600f7d897319ee2 > Author: Julian Brown <julian@codesourcery.com> > Date: Thu Aug 9 20:27:04 2018 -0700 > > [OpenACC] Add support for gang local storage allocation in shared memory > --- a/gcc/omp-low.c > +++ b/gcc/omp-low.c > @@ -8599,6 +8681,9 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) > > push_gimplify_context (); > > + if (is_gimple_omp_oacc (ctx->stmt)) > + oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt)); > + > lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx); > > block = make_node (BLOCK); So, yes -- but then, apparently, that again got lost in a later version of the patch. ;-) I've pushed "[OpenACC privatization] Don't evaluate OpenMP 'for' clauses [PR90115]" to master branch in commit 3a285ebd0cf5ab762726018515d23280fa6dd445, see attached. Grüße Thomas ----------------- Mentor Graphics (Deutschland) GmbH, Arnulfstrasse 201, 80634 München Registergericht München HRB 106955, Geschäftsführer: Thomas Heurung, Frank Thürauf
Hi Julian! Two more questions here, in context of <https://gcc.gnu.org/PR102330> "[12 Regression] ICE in expand_gimple_stmt_1, at cfgexpand.c:3932 since r12-980-g29a2f51806c": On 2019-06-03T17:02:45+0100, Julian Brown <julian@codesourcery.com> wrote: > This is a new version of the patch, rebased The code as we've now got it in master branch has changed some more, but I think the behavior I'm seeing may have been introduced here: > and with a couple of > additional bugfixes, as follows: > > Firstly, in mark_oacc_gangprivate, each decl is looked up (using > maybe_lookup_decl) to apply the "oacc gangprivate" attribute to the > innermost-nested copy of the decl. > --- a/gcc/omp-low.c > +++ b/gcc/omp-low.c > @@ -137,6 +137,12 @@ struct omp_context > + /* Addressable variable decls in this context. */ > + vec<tree> *oacc_addressable_var_decls; > }; > +/* Record vars listed in private clauses in CLAUSES in CTX. This information > + is used to mark up variables that should be made private per-gang. */ > + > +static void > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) > +{ > + tree c; > + > + if (!ctx) > + return; > + > + for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) > + if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE) > + { > + tree decl = OMP_CLAUSE_DECL (c); > + if (VAR_P (decl) && TREE_ADDRESSABLE (decl)) > + ctx->oacc_addressable_var_decls->safe_push (decl); > + } > +} So, here we analyze 'OMP_CLAUSE_DECL (c)' (as is, without translation through 'lookup_decl (decl, ctx)')... > +/* Record addressable vars declared in BINDVARS in CTX. This information is > + used to mark up variables that should be made private per-gang. */ > + > +static void > +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars) > +{ > + if (!ctx) > + return; > + > + for (tree v = bindvars; v; v = DECL_CHAIN (v)) > + if (VAR_P (v) && TREE_ADDRESSABLE (v)) > + ctx->oacc_addressable_var_decls->safe_push (v); > +} ..., and similarly here analyze 'v' (without 'lookup_decl (v, ctx)')... > +/* Mark addressable variables which are declared implicitly or explicitly as > + gang private with a special attribute. These may need to have their > + declarations altered later on in compilation (e.g. in > + execute_oacc_device_lower or the backend, depending on how the OpenACC > + execution model is implemented on a given target) to ensure that sharing > + semantics are correct. */ > + > +static void > +mark_oacc_gangprivate (vec<tree> *decls, omp_context *ctx) > +{ > + int i; > + tree decl; > + > + FOR_EACH_VEC_ELT (*decls, i, decl) > + { > + for (omp_context *thisctx = ctx; thisctx; thisctx = thisctx->outer) > + { > + tree inner_decl = maybe_lookup_decl (decl, thisctx); > + if (inner_decl) > + { > + decl = inner_decl; > + break; > + } > + } > + if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (decl))) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, > + "Setting 'oacc gangprivate' attribute for decl:"); > + print_generic_decl (dump_file, decl, TDF_SLIM); > + fputc ('\n', dump_file); > + } > + DECL_ATTRIBUTES (decl) > + = tree_cons (get_identifier ("oacc gangprivate"), > + NULL, DECL_ATTRIBUTES (decl)); > + } > + } > +} ..., but here we action on the 'maybe_lookup_decl'-translated 'inner_decl', if applicable. In certain cases that one may be different from the original 'decl'. (In particular (only?), when the OMP lowering has made 'decl' "late 'TREE_ADDRESSABLE'".) This assymetry I understand to give rise to <https://gcc.gnu.org/PR102330> "[12 Regression] ICE in expand_gimple_stmt_1, at cfgexpand.c:3932 since r12-980-g29a2f51806c". It makes sense to me that we do the OpenACC privatization on the 'lookup_decl' -- but shouldn't we then do that in the analysis phase, too? (This appears to work fine for OpenACC 'private' clauses (..., and avoids marking a few as addressable/gang-private), and for those in 'gimple_bind_vars' it doesn't seem to make a difference (for the current test cases and/or compiler transformations).) And, second question: what case did you run into or foresee, that you here need the 'thisctx' loop and 'maybe_lookup_decl', instead of a plain 'lookup_decl (decl, ctx)'? Per my testing that's sufficient. Unless you think this needs more consideration, I suggest to do these two changes. (I have a WIP patch in testing.) Grüße Thomas ----------------- Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
On Mon, 14 Feb 2022 16:56:35 +0100 Thomas Schwinge <thomas@codesourcery.com> wrote: > Hi Julian! > > Two more questions here, in context of <https://gcc.gnu.org/PR102330> > "[12 Regression] ICE in expand_gimple_stmt_1, at cfgexpand.c:3932 > since r12-980-g29a2f51806c": > > On 2019-06-03T17:02:45+0100, Julian Brown <julian@codesourcery.com> > wrote: > > +/* Record vars listed in private clauses in CLAUSES in CTX. This > > information > > + is used to mark up variables that should be made private > > per-gang. */ + > > +static void > > +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) > > +{ > > + tree c; > > + > > + if (!ctx) > > + return; > > + > > + for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) > > + if (OMP_CLAUSE_CODE (c) == OMP_CLAUSE_PRIVATE) > > + { > > + tree decl = OMP_CLAUSE_DECL (c); > > + if (VAR_P (decl) && TREE_ADDRESSABLE (decl)) > > + ctx->oacc_addressable_var_decls->safe_push (decl); > > + } > > +} > > So, here we analyze 'OMP_CLAUSE_DECL (c)' (as is, without translation > through 'lookup_decl (decl, ctx)')... I think you're right that this one should be using lookup_decl, but... > > +/* Record addressable vars declared in BINDVARS in CTX. This > > information is > > + used to mark up variables that should be made private per-gang. > > */ + > > +static void > > +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars) > > +{ > > + if (!ctx) > > + return; > > + > > + for (tree v = bindvars; v; v = DECL_CHAIN (v)) > > + if (VAR_P (v) && TREE_ADDRESSABLE (v)) > > + ctx->oacc_addressable_var_decls->safe_push (v); > > +} > > ..., and similarly here analyze 'v' (without 'lookup_decl (v, > ctx)')... I'm not so sure about this one: if the variables are declared at a particular binding level, I think they have to be in the current OMP context (and thus shadow any definitions that might be present in the parent context)? Maybe that can be confirmed via an assertion. > > +/* Mark addressable variables which are declared implicitly or > > explicitly as > > + gang private with a special attribute. These may need to have > > their > > + declarations altered later on in compilation (e.g. in > > + execute_oacc_device_lower or the backend, depending on how the > > OpenACC > > + execution model is implemented on a given target) to ensure > > that sharing > > + semantics are correct. */ > > + > > +static void > > +mark_oacc_gangprivate (vec<tree> *decls, omp_context *ctx) > > +{ > > + int i; > > + tree decl; > > + > > + FOR_EACH_VEC_ELT (*decls, i, decl) > > + { > > + for (omp_context *thisctx = ctx; thisctx; thisctx = > > thisctx->outer) > > + { > > + tree inner_decl = maybe_lookup_decl (decl, thisctx); > > + if (inner_decl) > > + { > > + decl = inner_decl; > > + break; > > + } > > + } > > + if (!lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES > > (decl))) > > + { > > + if (dump_file && (dump_flags & TDF_DETAILS)) > > + { > > + fprintf (dump_file, > > + "Setting 'oacc gangprivate' attribute for > > decl:"); > > + print_generic_decl (dump_file, decl, TDF_SLIM); > > + fputc ('\n', dump_file); > > + } > > + DECL_ATTRIBUTES (decl) > > + = tree_cons (get_identifier ("oacc gangprivate"), > > + NULL, DECL_ATTRIBUTES (decl)); > > + } > > + } > > +} > > ..., but here we action on the 'maybe_lookup_decl'-translated > 'inner_decl', if applicable. In certain cases that one may be > different from the original 'decl'. (In particular (only?), when the > OMP lowering has made 'decl' "late 'TREE_ADDRESSABLE'".) This > assymetry I understand to give rise to <https://gcc.gnu.org/PR102330> > "[12 Regression] ICE in expand_gimple_stmt_1, at cfgexpand.c:3932 > since r12-980-g29a2f51806c". > > It makes sense to me that we do the OpenACC privatization on the > 'lookup_decl' -- but shouldn't we then do that in the analysis phase, > too? (This appears to work fine for OpenACC 'private' clauses (..., > and avoids marking a few as addressable/gang-private), and for those > in 'gimple_bind_vars' it doesn't seem to make a difference (for the > current test cases and/or compiler transformations).) Yes, I think you're right. > And, second question: what case did you run into or foresee, that you > here need the 'thisctx' loop and 'maybe_lookup_decl', instead of a > plain 'lookup_decl (decl, ctx)'? Per my testing that's sufficient. I'd probably misunderstood about lookup_decl walking up through parent contexts itself... oops. > Unless you think this needs more consideration, I suggest to do these > two changes. (I have a WIP patch in testing.) Sounds good to me. Thank you, Julian
commit 9637e7ea887e100f35d99b8d12101f9f8a9b94e3 Author: Julian Brown <julian@codesourcery.com> Date: Thu Aug 9 20:27:04 2018 -0700 [OpenACC] Add support for gang local storage allocation in shared memory 2018-08-10 Julian Brown <julian@codesourcery.com> Chung-Lin Tang <cltang@codesourcery.com> gcc/ * config/nvptx/nvptx.c (tree-hash-traits.h): Include. (gangprivate_shared_size): New global variable. (gangprivate_shared_align): Likewise. (gangprivate_shared_sym): Likewise. (gangprivate_shared_hmap): Likewise. (nvptx_option_override): Initialize gangprivate_shared_sym, gangprivate_shared_align. (nvptx_file_end): Output gangprivate_shared_sym. (nvptx_goacc_expand_accel_var): New function. (nvptx_set_current_function): New function. (TARGET_SET_CURRENT_FUNCTION): Define hook. (TARGET_GOACC_EXPAND_ACCEL): Likewise. * doc/tm.texi (TARGET_GOACC_EXPAND_ACCEL_VAR): Document new hook. * doc/tm.texi.in (TARGET_GOACC_EXPAND_ACCEL_VAR): Likewise. * expr.c (expand_expr_real_1): Remap decls marked with the "oacc gangprivate" atttribute. * omp-low.c (omp_context): Add oacc_partitioning_level and oacc_decls fields. (new_omp_context): Initialize oacc_decls in new omp_context. (delete_omp_context): Delete oacc_decls in old omp_context. (lower_oacc_head_tail): Record partitioning-level count in omp context. (oacc_record_private_var_clauses, oacc_record_vars_in_bind) (mark_oacc_gangprivate): New functions. (lower_omp_for): Call oacc_record_private_var_clauses with "for" clauses. Call mark_oacc_gangprivate for gang-partitioned loops. (lower_omp_target): Call oacc_record_private_var_clauses with "target" clauses. Call mark_oacc_gangprivate for offloaded target regions. (lower_omp_1): Call vars_in_bind for GIMPLE_BIND within OMP regions. * target.def (expand_accel_var): New hook. libgomp/ * testsuite/libgomp.oacc-c-c++-common/gang-private-1.c: New test. * testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c: New test. * testsuite/libgomp.oacc-c/pr85465.c: New test. diff --git a/gcc/config/nvptx/nvptx.c b/gcc/config/nvptx/nvptx.c index c0b0a2e..14eb842 100644 --- a/gcc/config/nvptx/nvptx.c +++ b/gcc/config/nvptx/nvptx.c @@ -73,6 +73,7 @@ #include "cfgloop.h" #include "fold-const.h" #include "intl.h" +#include "tree-hash-traits.h" /* This file should be included last. */ #include "target-def.h" @@ -137,6 +138,12 @@ static unsigned worker_red_size; static unsigned worker_red_align; static GTY(()) rtx worker_red_sym; +/* Shared memory block for gang-private variables. */ +static unsigned gangprivate_shared_size; +static unsigned gangprivate_shared_align; +static GTY(()) rtx gangprivate_shared_sym; +static hash_map<tree_decl_hash, unsigned int> gangprivate_shared_hmap; + /* Global lock variable, needed for 128bit worker & gang reductions. */ static GTY(()) tree global_lock_var; @@ -210,6 +217,10 @@ nvptx_option_override (void) SET_SYMBOL_DATA_AREA (worker_red_sym, DATA_AREA_SHARED); worker_red_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + gangprivate_shared_sym = gen_rtx_SYMBOL_REF (Pmode, "__gangprivate_shared"); + SET_SYMBOL_DATA_AREA (gangprivate_shared_sym, DATA_AREA_SHARED); + gangprivate_shared_align = GET_MODE_ALIGNMENT (SImode) / BITS_PER_UNIT; + diagnose_openacc_conflict (TARGET_GOMP, "-mgomp"); diagnose_openacc_conflict (TARGET_SOFT_STACK, "-msoft-stack"); diagnose_openacc_conflict (TARGET_UNIFORM_SIMT, "-muniform-simt"); @@ -4968,6 +4979,10 @@ nvptx_file_end (void) write_worker_buffer (asm_out_file, worker_red_sym, worker_red_align, worker_red_size); + if (gangprivate_shared_size) + write_worker_buffer (asm_out_file, gangprivate_shared_sym, + gangprivate_shared_align, gangprivate_shared_size); + if (need_softstack_decl) { write_var_marker (asm_out_file, false, true, "__nvptx_stacks"); @@ -5915,6 +5930,47 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t) return false; } +static rtx +nvptx_goacc_expand_accel_var (tree var) +{ + if (TREE_CODE (var) == VAR_DECL + && lookup_attribute ("oacc gangprivate", DECL_ATTRIBUTES (var))) + { + unsigned int offset, *poffset; + poffset = gangprivate_shared_hmap.get (var); + if (poffset) + offset = *poffset; + else + { + unsigned HOST_WIDE_INT align = DECL_ALIGN (var); + gangprivate_shared_size = + (gangprivate_shared_size + align - 1) & ~(align - 1); + if (gangprivate_shared_align < align) + gangprivate_shared_align = align; + + offset = gangprivate_shared_size; + bool existed = gangprivate_shared_hmap.put (var, offset); + gcc_assert (!existed); + gangprivate_shared_size += tree_to_uhwi (DECL_SIZE_UNIT (var)); + } + rtx addr = plus_constant (Pmode, gangprivate_shared_sym, offset); + return gen_rtx_MEM (TYPE_MODE (TREE_TYPE (var)), addr); + } + return NULL_RTX; +} + +static GTY(()) tree nvptx_previous_fndecl; + +static void +nvptx_set_current_function (tree fndecl) +{ + if (!fndecl || fndecl == nvptx_previous_fndecl) + return; + + gangprivate_shared_hmap.empty (); + nvptx_previous_fndecl = fndecl; +} + #undef TARGET_OPTION_OVERRIDE #define TARGET_OPTION_OVERRIDE nvptx_option_override @@ -6051,6 +6107,12 @@ nvptx_can_change_mode_class (machine_mode, machine_mode, reg_class_t) #undef TARGET_HAVE_SPECULATION_SAFE_VALUE #define TARGET_HAVE_SPECULATION_SAFE_VALUE speculation_safe_value_not_needed +#undef TARGET_GOACC_EXPAND_ACCEL_VAR +#define TARGET_GOACC_EXPAND_ACCEL_VAR nvptx_goacc_expand_accel_var + +#undef TARGET_SET_CURRENT_FUNCTION +#define TARGET_SET_CURRENT_FUNCTION nvptx_set_current_function + struct gcc_target targetm = TARGET_INITIALIZER; #include "gt-nvptx.h" diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi index a40f45a..fb87f67 100644 --- a/gcc/doc/tm.texi +++ b/gcc/doc/tm.texi @@ -6064,6 +6064,14 @@ like @code{cond_add@var{m}}. The default implementation returns a zero constant of type @var{type}. @end deftypefn +@deftypefn {Target Hook} rtx TARGET_GOACC_EXPAND_ACCEL_VAR (tree @var{var}) +This hook, if defined, is used by accelerator target back-ends to expand +specially handled kinds of VAR_DECL expressions. A particular use is to +place variables with specific attributes inside special accelarator +memories. A return value of NULL indicates that the target does not +handle this VAR_DECL, and normal RTL expanding is resumed. +@end deftypefn + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in index 39a214e..beace61 100644 --- a/gcc/doc/tm.texi.in +++ b/gcc/doc/tm.texi.in @@ -4151,6 +4151,8 @@ address; but often a machine-dependent strategy can generate better code. @hook TARGET_PREFERRED_ELSE_VALUE +@hook TARGET_GOACC_EXPAND_ACCEL_VAR + @node Anchored Addresses @section Anchored Addresses @cindex anchored addresses diff --git a/gcc/expr.c b/gcc/expr.c index de6709d..2c62bf9 100644 --- a/gcc/expr.c +++ b/gcc/expr.c @@ -9854,8 +9854,19 @@ expand_expr_real_1 (tree exp, rtx target, machine_mode tmode, exp = SSA_NAME_VAR (ssa_name); goto expand_decl_rtl; - case PARM_DECL: case VAR_DECL: + /* Allow accel compiler to handle specific cases of variables, + specifically those tagged with the "oacc gangprivate" attribute, + which may intended to be placed in special memory in GPUs. */ + if (flag_openacc && targetm.goacc.expand_accel_var) + { + temp = targetm.goacc.expand_accel_var (exp); + if (temp) + return temp; + } + /* ... fall through ... */ + + case PARM_DECL: /* If a static var's type was incomplete when the decl was written, but the type is complete now, lay out the decl now. */ if (DECL_SIZE (exp) == 0 diff --git a/gcc/omp-low.c b/gcc/omp-low.c index 843c66f..354e182 100644 --- a/gcc/omp-low.c +++ b/gcc/omp-low.c @@ -124,6 +124,12 @@ struct omp_context /* True if this construct can be cancelled. */ bool cancellable; + + /* The number of levels of OpenACC partitioning invoked in this context. */ + int oacc_partitioning_levels; + + /* Decls in this context. */ + vec<tree> *oacc_decls; }; static splay_tree all_contexts; @@ -850,6 +856,7 @@ new_omp_context (gimple *stmt, omp_context *outer_ctx) } ctx->cb.decl_map = new hash_map<tree, tree>; + ctx->oacc_decls = new vec<tree> (); return ctx; } @@ -925,6 +932,8 @@ delete_omp_context (splay_tree_value value) if (is_task_ctx (ctx)) finalize_task_copyfn (as_a <gomp_task *> (ctx->stmt)); + delete ctx->oacc_decls; + XDELETE (ctx); } @@ -5716,6 +5725,9 @@ lower_oacc_head_tail (location_t loc, tree clauses, tree join_kind = build_int_cst (unsigned_type_node, IFN_UNIQUE_OACC_JOIN); gcc_assert (count); + + ctx->oacc_partitioning_levels = count; + for (unsigned done = 1; count; count--, done++) { gimple_seq fork_seq = NULL; @@ -6732,6 +6744,66 @@ lower_omp_for_lastprivate (struct omp_for_data *fd, gimple_seq *body_p, } } +/* Record vars listed in private clauses in CLAUSES in CTX. This information + is used to mark up variables that should be made private per-gang. */ + +static void +oacc_record_private_var_clauses (omp_context *ctx, tree clauses) +{ + tree c; + + if (!ctx) + return; + + for (c = clauses; c; c = OMP_CLAUSE_CHAIN (c)) + switch (OMP_CLAUSE_CODE (c)) + { + case OMP_CLAUSE_PRIVATE: + { + tree decl = OMP_CLAUSE_DECL (c); + ctx->oacc_decls->safe_push (decl); + } + break; + + default: + /* Empty. */; + } +} + +/* Record vars declared in BINDVARS in CTX. This information is used to mark + up variables that should be made private per-gang. */ + +static void +oacc_record_vars_in_bind (omp_context *ctx, tree bindvars) +{ + if (!ctx) + return; + + for (tree v = bindvars; v; v = DECL_CHAIN (v)) + ctx->oacc_decls->safe_push (v); +} + +/* Mark variables which are declared implicitly or explicitly as gang private + with a special attribute. These may need to have their declarations altered + later on in compilation (e.g. in execute_oacc_device_lower or the backend, + depending on how the OpenACC execution model is implemented on a given + target) to ensure that sharing semantics are correct. + Only variables which have their address taken need to be considered. */ + +static void +mark_oacc_gangprivate (vec<tree> *decls) +{ + int i; + tree decl; + + FOR_EACH_VEC_ELT (*decls, i, decl) + { + if (TREE_CODE (decl) == VAR_DECL && TREE_ADDRESSABLE (decl)) + DECL_ATTRIBUTES (decl) + = tree_cons (get_identifier ("oacc gangprivate"), + NULL, DECL_ATTRIBUTES (decl)); + } +} /* Lower code for an OMP loop directive. */ @@ -6748,6 +6820,8 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) push_gimplify_context (); + oacc_record_private_var_clauses (ctx, gimple_omp_for_clauses (stmt)); + lower_omp (gimple_omp_for_pre_body_ptr (stmt), ctx); block = make_node (BLOCK); @@ -6878,7 +6952,20 @@ lower_omp_for (gimple_stmt_iterator *gsi_p, omp_context *ctx) /* Add OpenACC partitioning and reduction markers just before the loop. */ if (oacc_head) - gimple_seq_add_seq (&body, oacc_head); + { + gimple_seq_add_seq (&body, oacc_head); + + int level_total = 0; + omp_context *thisctx; + + for (thisctx = ctx; thisctx; thisctx = thisctx->outer) + level_total += thisctx->oacc_partitioning_levels; + + /* If the current context and parent contexts are distributed over a + total of one parallelism level, we have gang partitioning. */ + if (level_total == 1) + mark_oacc_gangprivate (ctx->oacc_decls); + } lower_omp_for_lastprivate (&fd, &body, &dlist, ctx); @@ -7511,6 +7598,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) clauses = gimple_omp_target_clauses (stmt); + oacc_record_private_var_clauses (ctx, clauses); + gimple_seq dep_ilist = NULL; gimple_seq dep_olist = NULL; if (omp_find_clause (clauses, OMP_CLAUSE_DEPEND)) @@ -7761,6 +7850,8 @@ lower_omp_target (gimple_stmt_iterator *gsi_p, omp_context *ctx) if (offloaded) { + mark_oacc_gangprivate (ctx->oacc_decls); + /* Declare all the variables created by mapping and the variables declared in the scope of the target body. */ record_vars_into (ctx->block_vars, child_fn); @@ -8755,6 +8846,7 @@ lower_omp_1 (gimple_stmt_iterator *gsi_p, omp_context *ctx) ctx); break; case GIMPLE_BIND: + oacc_record_vars_in_bind (ctx, gimple_bind_vars (as_a <gbind *> (stmt))); lower_omp (gimple_bind_body_ptr (as_a <gbind *> (stmt)), ctx); maybe_remove_omp_member_access_dummy_vars (as_a <gbind *> (stmt)); break; diff --git a/gcc/target.def b/gcc/target.def index c570f38..b3b24b8 100644 --- a/gcc/target.def +++ b/gcc/target.def @@ -1701,6 +1701,16 @@ for allocating any storage for reductions when necessary.", void, (gcall *call), default_goacc_reduction) +DEFHOOK +(expand_accel_var, +"This hook, if defined, is used by accelerator target back-ends to expand\n\ +specially handled kinds of VAR_DECL expressions. A particular use is to\n\ +place variables with specific attributes inside special accelarator\n\ +memories. A return value of NULL indicates that the target does not\n\ +handle this VAR_DECL, and normal RTL expanding is resumed.", +rtx, (tree var), +NULL) + HOOK_VECTOR_END (goacc) /* Functions relating to vectorization. */ diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c new file mode 100644 index 0000000..f378346 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/gang-private-1.c @@ -0,0 +1,38 @@ +#include <assert.h> + +int main (void) +{ + int ret; + + #pragma acc parallel num_gangs(1) num_workers(32) copyout(ret) + { + int w = 0; + + #pragma acc loop worker + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + w++; + } + + ret = (w == 32); + } + assert (ret); + + #pragma acc parallel num_gangs(1) vector_length(32) copyout(ret) + { + int v = 0; + + #pragma acc loop vector + for (int i = 0; i < 32; i++) + { + #pragma acc atomic update + v++; + } + + ret = (v == 32); + } + assert (ret); + + return 0; +} diff --git a/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c new file mode 100644 index 0000000..2fa708a --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c-c++-common/loop-gwv-2.c @@ -0,0 +1,106 @@ +/* { dg-xfail-run-if "gangprivate failure" { openacc_nvidia_accel_selected } { "-O0" } { "" } } */ + +#include <stdio.h> +#include <openacc.h> +#include <alloca.h> +#include <string.h> +#include <gomp-constants.h> +#include <stdlib.h> + +#if 0 +#define DEBUG(DIM, IDX, VAL) \ + fprintf (stderr, "%sdist[%d] = %d\n", (DIM), (IDX), (VAL)) +#else +#define DEBUG(DIM, IDX, VAL) +#endif + +#define N (32*32*32) + +int +check (const char *dim, int *dist, int dimsize) +{ + int ix; + int exit = 0; + + for (ix = 0; ix < dimsize; ix++) + { + DEBUG(dim, ix, dist[ix]); + if (dist[ix] < (N) / (dimsize + 0.5) + || dist[ix] > (N) / (dimsize - 0.5)) + { + fprintf (stderr, "did not distribute to %ss (%d not between %d " + "and %d)\n", dim, dist[ix], (int) ((N) / (dimsize + 0.5)), + (int) ((N) / (dimsize - 0.5))); + exit |= 1; + } + } + + return exit; +} + +int main () +{ + int ary[N]; + int ix; + int exit = 0; + int ondev = 0; + int gangsize = 0, workersize = 0, vectorsize = 0; + int *gangdist, *workerdist, *vectordist; + + for (ix = 0; ix < N;ix++) + ary[ix] = -1; + +#pragma acc parallel num_gangs(32) num_workers(32) vector_length(32) copy(ary) copy(ondev) copyout(gangsize, workersize, vectorsize) + { +#pragma acc loop gang worker vector + for (unsigned ix = 0; ix < N; ix++) + { + if (acc_on_device (acc_device_not_host)) + { + int g, w, v; + + g = __builtin_goacc_parlevel_id (GOMP_DIM_GANG); + w = __builtin_goacc_parlevel_id (GOMP_DIM_WORKER); + v = __builtin_goacc_parlevel_id (GOMP_DIM_VECTOR); + + ary[ix] = (g << 16) | (w << 8) | v; + ondev = 1; + } + else + ary[ix] = ix; + } + + gangsize = __builtin_goacc_parlevel_size (GOMP_DIM_GANG); + workersize = __builtin_goacc_parlevel_size (GOMP_DIM_WORKER); + vectorsize = __builtin_goacc_parlevel_size (GOMP_DIM_VECTOR); + } + + gangdist = (int *) alloca (gangsize * sizeof (int)); + workerdist = (int *) alloca (workersize * sizeof (int)); + vectordist = (int *) alloca (vectorsize * sizeof (int)); + memset (gangdist, 0, gangsize * sizeof (int)); + memset (workerdist, 0, workersize * sizeof (int)); + memset (vectordist, 0, vectorsize * sizeof (int)); + + /* Test that work is shared approximately equally amongst each active + gang/worker/vector. */ + for (ix = 0; ix < N; ix++) + { + if (ondev) + { + int g = (ary[ix] >> 16) & 255; + int w = (ary[ix] >> 8) & 255; + int v = ary[ix] & 255; + + gangdist[g]++; + workerdist[w]++; + vectordist[v]++; + } + } + + exit = check ("gang", gangdist, gangsize); + exit |= check ("worker", workerdist, workersize); + exit |= check ("vector", vectordist, vectorsize); + + return exit; +} diff --git a/libgomp/testsuite/libgomp.oacc-c/pr85465.c b/libgomp/testsuite/libgomp.oacc-c/pr85465.c new file mode 100644 index 0000000..329e8a0 --- /dev/null +++ b/libgomp/testsuite/libgomp.oacc-c/pr85465.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-w" } */ + +int +main (void) +{ +#pragma acc parallel + foo (); + + return 0; +}