From patchwork Wed Jul 8 21:46:17 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nathan Sidwell X-Patchwork-Id: 493173 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 0BA82140A9A for ; Thu, 9 Jul 2015 07:46:49 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=UTnIw8ZX; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:references :in-reply-to:content-type; q=dns; s=default; b=YgmAlskUWKdMrboGq HdeqRdcBInTKiNq7BS6jwUJoWI+oi26H4uup7J6Th6f/GFugt987PcYzuVTGFLeQ sb+og9+BaN1DoV4J4sTuMbGKc58DRtein2uO4eaREwUyjAY61Avv5BWlOUbYdPnQ OvE6WAY5gPdOUVxhSL6fRpaRzI= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:references :in-reply-to:content-type; s=default; bh=kHwEyOxFNqAnxRpuOfLBLQs bQdk=; b=UTnIw8ZXwQE+Ijhs9vlpgF0t5HkpnrRQLxMAG2qI63CaJvuNaDZ7UC7 lSZ/RATtoukrX/ceQiA+7zH6aL2+LorP56eEb/W4DSk/t3LVdt9XMiGNJ6V7kbpj LRAZ3WvlllcPswko3qyQCf6M9YYqdqFU+wZuq1Yl4vB4srcJ+Swo= Received: (qmail 106697 invoked by alias); 8 Jul 2015 21:46:36 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 106684 invoked by uid 89); 8 Jul 2015 21:46:35 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=3.8 required=5.0 tests=BAYES_99, BAYES_999, FREEMAIL_FROM, KAM_ASCII_DIVIDERS, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=no version=3.3.2 X-HELO: mail-qk0-f172.google.com Received: from mail-qk0-f172.google.com (HELO mail-qk0-f172.google.com) (209.85.220.172) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Wed, 08 Jul 2015 21:46:22 +0000 Received: by qkeo142 with SMTP id o142so173577560qke.1 for ; Wed, 08 Jul 2015 14:46:19 -0700 (PDT) X-Received: by 10.140.35.243 with SMTP id n106mr19320694qgn.11.1436391979800; Wed, 08 Jul 2015 14:46:19 -0700 (PDT) Received: from ?IPv6:2601:19b:400:a983:a2a8:cdff:fe3e:b48? ([2601:19b:400:a983:a2a8:cdff:fe3e:b48]) by smtp.googlemail.com with ESMTPSA id c82sm1966879qkh.15.2015.07.08.14.46.18 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 08 Jul 2015 14:46:18 -0700 (PDT) Message-ID: <559D9A29.2020409@acm.org> Date: Wed, 08 Jul 2015 17:46:17 -0400 From: Nathan Sidwell User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Jakub Jelinek CC: GCC Patches Subject: Re: [gomp] Move openacc vector& worker single handling to RTL References: <5597120D.2080308@acm.org> <20150703231159.GP10247@tucnak.redhat.com> <559844EF.6010208@acm.org> <559AD85B.2050102@acm.org> <20150707095408.GD10247@tucnak.redhat.com> <559BDE68.9010302@acm.org> <20150707142229.GG10247@tucnak.redhat.com> <559D381C.7020804@acm.org> <20150708145822.GB10247@tucnak.redhat.com> In-Reply-To: <20150708145822.GB10247@tucnak.redhat.com> On 07/08/15 10:58, Jakub Jelinek wrote: > On Wed, Jul 08, 2015 at 10:47:56AM -0400, Nathan Sidwell wrote: >> +/* Generate loop head markers in outer->inner order. */ >> + >> +static void >> +gen_oacc_fork (gimple_seq *seq, unsigned mask) >> +{ >> + { >> + // TODDO: Determine this information from the parallel region itself > > TODO ? I want to clean this up with the offloading launch API. As it happens, I did manage to have the PTX backend DTRT if it doesn't encounter this internal fn. I'm dropping it fromm this patch (it'd undoubtedly need moving around anyway). >> + gcall *call = gimple_build_call_internal >> + (IFN_GOACC_FORK, 1, arg); > > Why the line-break? That should fit into 80 columns just fine. oh, it does now I've changed the name of the internal function. >> + It'd be better to place the OACC_LOOP markers just inside the outer >> + conditional, so they can be entirely eliminated if the loop is >> + unreachable. > > Putting OACC_FORK/OACC_JOIN unconditionally into the comment is very > confusing. The expand_omp_for_static_nochunk routine is used for > #pragma omp for schedule(static), #pragma omp distribute etc. which > certainly don't want to emit such markers in there. So perhaps mention > somewhere that you wrap all the above sequence in between > OACC_FORK/OACC_JOIN markers. Done. (at both sites) > Please avoid such whitespace changes. Fixed (& searched others). > In any case, as it is a gomp-4_0-branch patch, I'll defer full review to the > branch maintainers. Thanks for your review! nathan 2015-07-08 Nathan Sidwell Infrastructure: * gimple.h (gimple_call_internal_unique_p): Declare. * gimple.c (gimple_call_same_target_p): Add check for gimple_call_internal_unique_p. * internal-fn.c (gimple_call_internal_unique_p): New. * omp-low.h (OACC_LOOP_MASK): Define here... * omp-low.c (OACC_LOOP_MASK): ... not here. * tree-ssa-threadedge.c (record_temporary_equivalences_from_stmts): Add check for gimple_call_internal_unique_p. * tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for the gimple statements. Additions: * internal-fn.def (GOACC_FORK, GOACC_JOIN): New. * internal-fn.c (gimple_call_internal_unique_p): Add check for IFN_GOACC_FORK, IFN_GOACC_JOIN. (expand_GOACC_FORK, expand_GOACC_JOIN): New. * omp-low.c (gen_oacc_fork, gen_oacc_join): New. (expand_omp_for_static_nochunk): Add oacc loop fork & join calls. (expand_omp_for_static_chunk): Likewise. * config/nvptx/nvptx-protos.h (nvptx_expand_oacc_fork, nvptx_expand_oacc_join): Declare. * config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST, UNSPEC_BR_UNIFIED): New unspecs. (UNSPECV_FORK, UNSPECV_FORKED, UNSPECV_JOINING, UNSPECV_JOIN, UNSPECV_BR_HIDDEN): New. (BITS, BITD): New mode iterators. (br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New branches. (nvptx_fork, nvptx_forked, nvptx_joining, nvptx_join): New insns. (oacc_fork, oacc_join): New expand (nvptx_broadcast): New insn. (unpacksi2, packsi2): New insns. (worker_load, worker_store): New insns. (nvptx_barsync): Renamed from ... (threadbarrier_insn): ... here. * config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h & omp-low.h. (worker_bcast_hwm, worker_bcast_align, worker_bcast_name, worker_bcast_sym): New. (nvptx_option_override): Initialize worker_bcast_sym. (nvptx_expand_oacc_fork, nvptx_expand_oacc_join): New. (nvptx_gen_unpack, nvptx_gen_pack): New. (struct wcast_data_t, propagate_mask): New types. (nvptx_gen_vcast, nvptx_gen_wcast): New. (nvptx_print_operand): Change 'U' specifier to look at operand itself. (struct parallel): New structs. (parallel::parallel, parallel::~parallel): Ctor & dtor. (bb_insn_map_t): New map. (insn_bb_t, insn_bb_vec_t): New tuple & vector of. (nvptx_split_blocks, nvptx_discover_pre): New. (bb_par_t, bb_par_vec_t); New tuple & vector of. (nvptx_dump_pars,nvptx_discover_pars): New. (nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen, nvptx_wpropagate): New. (nvptx_wsync): New. (nvptx_single, nvptx_skip_par): New. (nvptx_process_pars): New. (nvptx_neuter_pars): New. (nvptx_reorg): Add liveness DF problem. Call nvptx_split_blocks, nvptx_discover_pars, nvptx_process_pars & nvptx_neuter_pars. (nvptx_cannot_copy_insn): Check for broadcast, sync, fork & join insns. (nvptx_file_end): Output worker broadcast array definition. Deletions: * builtins.c (expand_oacc_thread_barrier): Delete. (expand_oacc_thread_broadcast): Delete. (expand_builtin): Adjust. * gimple.c (struct gimple_statement_omp_parallel_layout): Remove broadcast_array member. (gimple_omp_target_broadcast_array): Delete. (gimple_omp_target_set_broadcast_array): Delete. * omp-low.c (omp_region): Remove broadcast_array member. (oacc_broadcast): Delete. (build_oacc_threadbarrier): Delete. (oacc_loop_needs_threadbarrier_p): Delete. (oacc_alloc_broadcast_storage): Delete. (find_omp_target_region): Remove call to gimple_omp_target_broadcast_array. (enclosing_target_region, required_predication_mask, generate_vector_broadcast, generate_oacc_broadcast, make_predication_test, predicate_bb, find_predicatable_bbs, predicate_omp_regions): Delete. (use, gen, live_in): Delete. (populate_loop_live_in, oacc_populate_live_in_1, oacc_populate_live_in, populate_loop_use, oacc_broadcast_1, oacc_broadcast): Delete. (execute_expand_omp): Remove predicate_omp_regions call. (lower_omp_target): Remove oacc_alloc_broadcast_storage call. Remove gimple_omp_target_set_broadcast_array call. (make_gimple_omp_edges): Remove oacc_loop_needs_threadbarrier_p check. * tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Remove BUILT_IN_GOACC_THREADBARRIER. * omp-builtins.def (BUILT_IN_GOACC_THREAD_BROADCAST, BUILT_IN_GOACC_THREAD_BROADCAST_LL, BUILT_IN_GOACC_THREADBARRIER): Delete. * config/nvptx/nvptx.md (UNSPECV_WARPBCAST): Delete. (br_true, br_false): Remove U format specifier. (oacc_thread_broadcastsi, oacc_thread_broadcast_di): Delete. (oacc_threadbarrier): Delete. * config/.nvptx/nvptx.c (condition_unidirectional_p): Delete. (nvptx_print_operand): Change 'U' specifier to look at operand itself. (nvptx_reorg_subreg): Remove unidirection checking. (nvptx_cannot_copy_insn): Remove broadcast and barrier insns. * config/nvptx/nvptx.h (machine_function): Remove arp_equal_pseudos. Index: tree-ssa-alias.c =================================================================== --- tree-ssa-alias.c (revision 225323) +++ tree-ssa-alias.c (working copy) @@ -1764,7 +1764,6 @@ ref_maybe_used_by_call_p_1 (gcall *call, case BUILT_IN_GOMP_ATOMIC_END: case BUILT_IN_GOMP_BARRIER: case BUILT_IN_GOMP_BARRIER_CANCEL: - case BUILT_IN_GOACC_THREADBARRIER: case BUILT_IN_GOMP_TASKWAIT: case BUILT_IN_GOMP_TASKGROUP_END: case BUILT_IN_GOMP_CRITICAL_START: Index: gimple.c =================================================================== --- gimple.c (revision 225323) +++ gimple.c (working copy) @@ -1380,12 +1380,27 @@ bool gimple_call_same_target_p (const_gimple c1, const_gimple c2) { if (gimple_call_internal_p (c1)) - return (gimple_call_internal_p (c2) - && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2)); + { + if (!gimple_call_internal_p (c2) + || gimple_call_internal_fn (c1) != gimple_call_internal_fn (c2)) + return false; + + if (gimple_call_internal_unique_p (c1)) + return false; + + return true; + } + else if (gimple_call_fn (c1) == gimple_call_fn (c2)) + return true; else - return (gimple_call_fn (c1) == gimple_call_fn (c2) - || (gimple_call_fndecl (c1) - && gimple_call_fndecl (c1) == gimple_call_fndecl (c2))); + { + tree decl = gimple_call_fndecl (c1); + + if (!decl || decl != gimple_call_fndecl (c2)) + return false; + + return true; + } } /* Detect flags from a GIMPLE_CALL. This is just like Index: gimple.h =================================================================== --- gimple.h (revision 225323) +++ gimple.h (working copy) @@ -581,10 +581,6 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT /* [ WORD 11 ] Size of the gang-local memory to allocate. */ tree ganglocal_size; - - /* [ WORD 12 ] - A pointer to the array to be used for broadcasting across threads. */ - tree broadcast_array; }; /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */ @@ -2693,6 +2689,11 @@ gimple_call_internal_fn (const_gimple gs return static_cast (gs)->u.internal_fn; } +/* Return true, if this internal gimple call is unique. */ + +extern bool +gimple_call_internal_unique_p (const_gimple); + /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt that could alter control flow. */ @@ -5248,25 +5249,6 @@ gimple_omp_target_set_ganglocal_size (go } -/* Return the pointer to the broadcast array associated with OMP_TARGET GS. */ - -static inline tree -gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt) -{ - return omp_target_stmt->broadcast_array; -} - - -/* Set PTR to be the broadcast array associated with OMP_TARGET - GS. */ - -static inline void -gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr) -{ - omp_target_stmt->broadcast_array = ptr; -} - - /* Return the clauses associated with OMP_TEAMS GS. */ static inline tree Index: tree-ssa-threadedge.c =================================================================== --- tree-ssa-threadedge.c (revision 225323) +++ tree-ssa-threadedge.c (working copy) @@ -310,6 +310,17 @@ record_temporary_equivalences_from_stmts && gimple_asm_volatile_p (as_a (stmt))) return NULL; + /* If the statement is a unique builtin, we can not thread + through here. */ + if (gimple_code (stmt) == GIMPLE_CALL) + { + gcall *call = as_a (stmt); + + if (gimple_call_internal_p (call) + && gimple_call_internal_unique_p (call)) + return NULL; + } + /* If duplicating this block is going to cause too much code expansion, then do not thread through this block. */ stmt_count++; Index: omp-low.c =================================================================== --- omp-low.c (revision 225323) +++ omp-low.c (working copy) @@ -166,14 +166,8 @@ struct omp_region /* For an OpenACC loop, the level of parallelism requested. */ int gwv_this; - - tree broadcast_array; }; -/* Levels of parallelism as defined by OpenACC. Increasing numbers - correspond to deeper loop nesting levels. */ -#define OACC_LOOP_MASK(X) (1 << (X)) - /* Context structure. Used to store information about each parallel directive in the code. */ @@ -292,8 +286,6 @@ static vec taskreg_contex static void scan_omp (gimple_seq *, omp_context *); static tree scan_omp_1_op (tree *, int *, void *); -static basic_block oacc_broadcast (basic_block, basic_block, - struct omp_region *); #define WALK_SUBSTMTS \ case GIMPLE_BIND: \ @@ -3487,15 +3479,6 @@ build_omp_barrier (tree lhs) return g; } -/* Build a call to GOACC_threadbarrier. */ - -static gcall * -build_oacc_threadbarrier (void) -{ - tree fndecl = builtin_decl_explicit (BUILT_IN_GOACC_THREADBARRIER); - return gimple_build_call (fndecl, 0); -} - /* If a context was created for STMT when it was scanned, return it. */ static omp_context * @@ -3506,6 +3489,37 @@ maybe_lookup_ctx (gimple stmt) return n ? (omp_context *) n->value : NULL; } +/* Generate loop head markers in outer->inner order. */ + +static void +gen_oacc_fork (gimple_seq *seq, unsigned mask) +{ + unsigned level; + + for (level = OACC_gang; level != OACC_HWM; level++) + if (mask & OACC_LOOP_MASK (level)) + { + tree arg = build_int_cst (unsigned_type_node, level); + gcall *call = gimple_build_call_internal (IFN_GOACC_FORK, 1, arg); + gimple_seq_add_stmt (seq, call); + } +} + +/* Generate loop tail markers in inner->outer order. */ + +static void +gen_oacc_join (gimple_seq *seq, unsigned mask) +{ + unsigned level; + + for (level = OACC_HWM; level-- != OACC_gang; ) + if (mask & OACC_LOOP_MASK (level)) + { + tree arg = build_int_cst (unsigned_type_node, level); + gcall *call = gimple_build_call_internal (IFN_GOACC_JOIN, 1, arg); + gimple_seq_add_stmt (seq, call); + } +} /* Find the mapping for DECL in CTX or the immediately enclosing context that has a mapping for DECL. @@ -6777,21 +6791,6 @@ expand_omp_for_generic (struct omp_regio } } - -/* True if a barrier is needed after a loop partitioned over - gangs/workers/vectors as specified by GWV_BITS. OpenACC semantics specify - that a (conceptual) barrier is needed after worker and vector-partitioned - loops, but not after gang-partitioned loops. Currently we are relying on - warp reconvergence to synchronise threads within a warp after vector loops, - so an explicit barrier is not helpful after those. */ - -static bool -oacc_loop_needs_threadbarrier_p (int gwv_bits) -{ - return !(gwv_bits & OACC_LOOP_MASK (OACC_gang)) - && (gwv_bits & OACC_LOOP_MASK (OACC_worker)); -} - /* A subroutine of expand_omp_for. Generate code for a parallel loop with static schedule and no specified chunk size. Given parameters: @@ -6827,6 +6826,11 @@ oacc_loop_needs_threadbarrier_p (int gwv V += STEP; if (V cond e) goto L1; L2: + + For OpenACC the above is wrapped in an OACC_FORK/OACC_JOIN pair. + Currently we wrap the whole sequence, but it'd be better to place the + markers just inside the outer conditional, so they can be entirely + eliminated if the loop is unreachable. */ static void @@ -6868,10 +6872,6 @@ expand_omp_for_static_nochunk (struct om } exit_bb = region->exit; - /* Broadcast variables to OpenACC threads. */ - entry_bb = oacc_broadcast (entry_bb, fin_bb, region); - region->entry = entry_bb; - /* Iteration space partitioning goes in ENTRY_BB. */ gsi = gsi_last_bb (entry_bb); gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR); @@ -6893,6 +6893,15 @@ expand_omp_for_static_nochunk (struct om t = fold_binary (fd->loop.cond_code, boolean_type_node, fold_convert (type, fd->loop.n1), fold_convert (type, fd->loop.n2)); + + if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) + { + gimple_seq seq = NULL; + + gen_oacc_fork (&seq, region->gwv_this); + gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT); + } + if (fd->collapse == 1 && TYPE_UNSIGNED (type) && (t == NULL_TREE || !integer_onep (t))) @@ -7134,17 +7143,17 @@ expand_omp_for_static_nochunk (struct om /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing. */ gsi = gsi_last_bb (exit_bb); - if (!gimple_omp_return_nowait_p (gsi_stmt (gsi))) + if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) + { + gimple_seq seq = NULL; + + gen_oacc_join (&seq, region->gwv_this); + gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT); + } + else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi))) { t = gimple_omp_return_lhs (gsi_stmt (gsi)); - if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) - { - gcc_checking_assert (t == NULL_TREE); - if (oacc_loop_needs_threadbarrier_p (region->gwv_this)) - gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT); - } - else - gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT); + gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT); } gsi_remove (&gsi, true); @@ -7248,6 +7257,11 @@ find_phi_with_arg_on_edge (tree arg, edg trip += 1; goto L0; L4: + + For OpenACC the above is wrapped in an OACC_FORK/OACC_JOIN pair. + Currently we wrap the whole sequence, but it'd be better to place the + markers just inside the outer conditional, so they can be entirely + eliminated if the loop is unreachable. */ static void @@ -7281,10 +7295,6 @@ expand_omp_for_static_chunk (struct omp_ gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2); fin_bb = BRANCH_EDGE (iter_part_bb)->dest; - /* Broadcast variables to OpenACC threads. */ - entry_bb = oacc_broadcast (entry_bb, fin_bb, region); - region->entry = entry_bb; - gcc_assert (broken_loop || fin_bb == FALLTHRU_EDGE (cont_bb)->dest); seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb)); @@ -7318,6 +7328,14 @@ expand_omp_for_static_chunk (struct omp_ t = fold_binary (fd->loop.cond_code, boolean_type_node, fold_convert (type, fd->loop.n1), fold_convert (type, fd->loop.n2)); + if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) + { + gimple_seq seq = NULL; + + gen_oacc_fork (&seq, region->gwv_this); + gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT); + } + if (fd->collapse == 1 && TYPE_UNSIGNED (type) && (t == NULL_TREE || !integer_onep (t))) @@ -7576,17 +7594,18 @@ expand_omp_for_static_chunk (struct omp_ /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing. */ gsi = gsi_last_bb (exit_bb); - if (!gimple_omp_return_nowait_p (gsi_stmt (gsi))) + + if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) + { + gimple_seq seq = NULL; + + gen_oacc_join (&seq, region->gwv_this); + gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT); + } + else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi))) { t = gimple_omp_return_lhs (gsi_stmt (gsi)); - if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) - { - gcc_checking_assert (t == NULL_TREE); - if (oacc_loop_needs_threadbarrier_p (region->gwv_this)) - gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT); - } - else - gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT); + gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT); } gsi_remove (&gsi, true); @@ -9158,20 +9177,6 @@ expand_omp_atomic (struct omp_region *re expand_omp_atomic_mutex (load_bb, store_bb, addr, loaded_val, stored_val); } -/* Allocate storage for OpenACC worker threads in CTX to broadcast - condition results. */ - -static void -oacc_alloc_broadcast_storage (omp_context *ctx) -{ - tree vull_type_node = build_qualified_type (long_long_unsigned_type_node, - TYPE_QUAL_VOLATILE); - - ctx->worker_sync_elt - = alloc_var_ganglocal (NULL_TREE, vull_type_node, ctx, - TYPE_SIZE_UNIT (vull_type_node)); -} - /* Mark the loops inside the kernels region starting at REGION_ENTRY and ending at REGION_EXIT. */ @@ -9947,7 +9952,6 @@ find_omp_target_region_data (struct omp_ region->gwv_this |= OACC_LOOP_MASK (OACC_worker); if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH)) region->gwv_this |= OACC_LOOP_MASK (OACC_vector); - region->broadcast_array = gimple_omp_target_broadcast_array (stmt); } /* Helper for build_omp_regions. Scan the dominator tree starting at @@ -10091,669 +10095,6 @@ build_omp_regions (void) build_omp_regions_1 (ENTRY_BLOCK_PTR_FOR_FN (cfun), NULL, false); } -/* Walk the tree upwards from region until a target region is found - or we reach the end, then return it. */ -static omp_region * -enclosing_target_region (omp_region *region) -{ - while (region != NULL - && region->type != GIMPLE_OMP_TARGET) - region = region->outer; - return region; -} - -/* Return a mask of GWV_ values indicating the kind of OpenACC - predication required for basic blocks in REGION. */ - -static int -required_predication_mask (omp_region *region) -{ - while (region - && region->type != GIMPLE_OMP_FOR && region->type != GIMPLE_OMP_TARGET) - region = region->outer; - if (!region) - return 0; - - int outer_masks = region->gwv_this; - omp_region *outer_target = region; - while (outer_target != NULL && outer_target->type != GIMPLE_OMP_TARGET) - { - if (outer_target->type == GIMPLE_OMP_FOR) - outer_masks |= outer_target->gwv_this; - outer_target = outer_target->outer; - } - if (!outer_target) - return 0; - - int mask = 0; - if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_worker)) != 0 - && (region->type == GIMPLE_OMP_TARGET - || (outer_masks & OACC_LOOP_MASK (OACC_worker)) == 0)) - mask |= OACC_LOOP_MASK (OACC_worker); - if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_vector)) != 0 - && (region->type == GIMPLE_OMP_TARGET - || (outer_masks & OACC_LOOP_MASK (OACC_vector)) == 0)) - mask |= OACC_LOOP_MASK (OACC_vector); - return mask; -} - -/* Generate a broadcast across OpenACC vector threads (a warp on GPUs) - so that VAR is broadcast to DEST_VAR. The new statements are added - after WHERE. Return the stmt after which the block should be split. */ - -static gimple -generate_vector_broadcast (tree dest_var, tree var, - gimple_stmt_iterator &where) -{ - gimple retval = gsi_stmt (where); - tree vartype = TREE_TYPE (var); - tree call_arg_type = unsigned_type_node; - enum built_in_function fn = BUILT_IN_GOACC_THREAD_BROADCAST; - - if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type)) - { - fn = BUILT_IN_GOACC_THREAD_BROADCAST_LL; - call_arg_type = long_long_unsigned_type_node; - } - - bool need_conversion = !types_compatible_p (vartype, call_arg_type); - tree casted_var = var; - - if (need_conversion) - { - gassign *conv1 = NULL; - casted_var = create_tmp_var (call_arg_type); - - /* Handle floats and doubles. */ - if (!INTEGRAL_TYPE_P (vartype)) - { - tree t = fold_build1 (VIEW_CONVERT_EXPR, call_arg_type, var); - conv1 = gimple_build_assign (casted_var, t); - } - else - conv1 = gimple_build_assign (casted_var, NOP_EXPR, var); - - gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING); - } - - tree decl = builtin_decl_explicit (fn); - gimple call = gimple_build_call (decl, 1, casted_var); - gsi_insert_after (&where, call, GSI_NEW_STMT); - tree casted_dest = dest_var; - - if (need_conversion) - { - gassign *conv2 = NULL; - casted_dest = create_tmp_var (call_arg_type); - - if (!INTEGRAL_TYPE_P (vartype)) - { - tree t = fold_build1 (VIEW_CONVERT_EXPR, vartype, casted_dest); - conv2 = gimple_build_assign (dest_var, t); - } - else - conv2 = gimple_build_assign (dest_var, NOP_EXPR, casted_dest); - - gsi_insert_after (&where, conv2, GSI_CONTINUE_LINKING); - } - - gimple_call_set_lhs (call, casted_dest); - return retval; -} - -/* Generate a broadcast across OpenACC threads in REGION so that VAR - is broadcast to DEST_VAR. MASK specifies the parallelism level and - thereby the broadcast method. If it is only vector, we - can use a warp broadcast, otherwise we fall back to memory - store/load. */ - -static gimple -generate_oacc_broadcast (omp_region *region, tree dest_var, tree var, - gimple_stmt_iterator &where, int mask) -{ - if (mask == OACC_LOOP_MASK (OACC_vector)) - return generate_vector_broadcast (dest_var, var, where); - - omp_region *parent = enclosing_target_region (region); - - tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE); - tree ptr = create_tmp_var (build_pointer_type (elttype)); - gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR, - parent->broadcast_array); - gsi_insert_after (&where, cast1, GSI_NEW_STMT); - gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var); - gsi_insert_after (&where, st, GSI_NEW_STMT); - - gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT); - - gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR, - parent->broadcast_array); - gsi_insert_after (&where, cast2, GSI_NEW_STMT); - gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr)); - gsi_insert_after (&where, ld, GSI_NEW_STMT); - - gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT); - - return st; -} - -/* Build a test for OpenACC predication. TRUE_EDGE is the edge that should be - taken if the block should be executed. SKIP_DEST_BB is the destination to - jump to otherwise. MASK specifies the type of predication, it can contain - the bits for VECTOR and/or WORKER. */ - -static void -make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask) -{ - basic_block cond_bb = true_edge->src; - - gimple_stmt_iterator tmp_gsi = gsi_last_bb (cond_bb); - tree decl = builtin_decl_explicit (BUILT_IN_GOACC_ID); - tree comp_var = NULL_TREE; - unsigned ix; - - for (ix = OACC_worker; ix <= OACC_vector; ix++) - if (OACC_LOOP_MASK (ix) & mask) - { - gimple call = gimple_build_call - (decl, 1, build_int_cst (unsigned_type_node, ix)); - tree var = create_tmp_var (unsigned_type_node); - - gimple_call_set_lhs (call, var); - gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT); - if (comp_var) - { - tree new_comp = create_tmp_var (unsigned_type_node); - gassign *ior = gimple_build_assign (new_comp, - BIT_IOR_EXPR, comp_var, var); - gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT); - comp_var = new_comp; - } - else - comp_var = var; - } - - tree cond = build2 (EQ_EXPR, boolean_type_node, comp_var, - fold_convert (unsigned_type_node, integer_zero_node)); - gimple cond_stmt = gimple_build_cond_empty (cond); - gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT); - - true_edge->flags = EDGE_TRUE_VALUE; - - /* Force an abnormal edge before a broadcast operation that might be present - in SKIP_DEST_BB. This is only done for the non-execution edge (with - respect to the predication done by this function) -- the opposite - (execution) edge that reaches the broadcast operation must be made - abnormal also, e.g. in this function's caller. */ - edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE); - basic_block false_abnorm_bb = split_edge (e); - edge abnorm_edge = single_succ_edge (false_abnorm_bb); - abnorm_edge->flags |= EDGE_ABNORMAL; -} - -/* Apply OpenACC predication to basic block BB which is in - region PARENT. MASK has a bitmask of levels that need to be - applied; VECTOR and/or WORKER may be set. */ - -static void -predicate_bb (basic_block bb, struct omp_region *parent, int mask) -{ - /* We handle worker-single vector-partitioned loops by jumping - around them if not in the controlling worker. Don't insert - unnecessary (and incorrect) predication. */ - if (parent->type == GIMPLE_OMP_FOR - && (parent->gwv_this & OACC_LOOP_MASK (OACC_vector))) - mask &= ~OACC_LOOP_MASK (OACC_worker); - - if (mask == 0 || parent->type == GIMPLE_OMP_ATOMIC_LOAD) - return; - - gimple_stmt_iterator gsi; - gimple stmt; - - gsi = gsi_last_bb (bb); - stmt = gsi_stmt (gsi); - if (stmt == NULL) - return; - - basic_block skip_dest_bb = NULL; - - if (gimple_code (stmt) == GIMPLE_OMP_ENTRY_END) - return; - - if (gimple_code (stmt) == GIMPLE_COND) - { - tree cond_var = create_tmp_var (boolean_type_node); - tree broadcast_cond = create_tmp_var (boolean_type_node); - gassign *asgn = gimple_build_assign (cond_var, - gimple_cond_code (stmt), - gimple_cond_lhs (stmt), - gimple_cond_rhs (stmt)); - gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING); - gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn); - - gimple splitpoint = generate_oacc_broadcast (parent, broadcast_cond, - cond_var, gsi_asgn, - mask); - - edge e = split_block (bb, splitpoint); - e->flags = EDGE_ABNORMAL; - skip_dest_bb = e->dest; - - gimple_cond_set_condition (as_a (stmt), EQ_EXPR, - broadcast_cond, boolean_true_node); - } - else if (gimple_code (stmt) == GIMPLE_SWITCH) - { - gswitch *sstmt = as_a (stmt); - tree var = gimple_switch_index (sstmt); - tree new_var = create_tmp_var (TREE_TYPE (var)); - - gassign *asgn = gimple_build_assign (new_var, var); - gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING); - gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn); - - gimple splitpoint = generate_oacc_broadcast (parent, new_var, var, - gsi_asgn, mask); - - edge e = split_block (bb, splitpoint); - e->flags = EDGE_ABNORMAL; - skip_dest_bb = e->dest; - - gimple_switch_set_index (sstmt, new_var); - } - else if (is_gimple_omp (stmt)) - { - gsi_prev (&gsi); - gimple split_stmt = gsi_stmt (gsi); - enum gimple_code code = gimple_code (stmt); - - /* First, see if we must predicate away an entire loop or atomic region. */ - if (code == GIMPLE_OMP_FOR - || code == GIMPLE_OMP_ATOMIC_LOAD) - { - omp_region *inner; - inner = *bb_region_map->get (FALLTHRU_EDGE (bb)->dest); - skip_dest_bb = single_succ (inner->exit); - gcc_assert (inner->entry == bb); - if (code != GIMPLE_OMP_FOR - || ((inner->gwv_this & OACC_LOOP_MASK (OACC_vector)) - && !(inner->gwv_this & OACC_LOOP_MASK (OACC_worker)) - && (mask & OACC_LOOP_MASK (OACC_worker)))) - { - gimple_stmt_iterator head_gsi = gsi_start_bb (bb); - gsi_prev (&head_gsi); - edge e0 = split_block (bb, gsi_stmt (head_gsi)); - int mask2 = mask; - if (code == GIMPLE_OMP_FOR) - mask2 &= ~OACC_LOOP_MASK (OACC_vector); - if (!split_stmt || code != GIMPLE_OMP_FOR) - { - /* The simple case: nothing here except the for, - so we just need to make one branch around the - entire loop. */ - inner->entry = e0->dest; - make_predication_test (e0, skip_dest_bb, mask2); - return; - } - basic_block for_block = e0->dest; - /* The general case, make two conditions - a full one around the - code preceding the for, and one branch around the loop. */ - edge e1 = split_block (for_block, split_stmt); - basic_block bb3 = e1->dest; - edge e2 = split_block (for_block, split_stmt); - basic_block bb2 = e2->dest; - - make_predication_test (e0, bb2, mask); - make_predication_test (single_pred_edge (bb3), skip_dest_bb, - mask2); - inner->entry = bb3; - return; - } - } - - /* Only a few statements need special treatment. */ - if (gimple_code (stmt) != GIMPLE_OMP_FOR - && gimple_code (stmt) != GIMPLE_OMP_CONTINUE - && gimple_code (stmt) != GIMPLE_OMP_RETURN) - { - edge e = single_succ_edge (bb); - skip_dest_bb = e->dest; - } - else - { - if (!split_stmt) - return; - edge e = split_block (bb, split_stmt); - skip_dest_bb = e->dest; - if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE) - { - gcc_assert (parent->cont == bb); - parent->cont = skip_dest_bb; - } - else if (gimple_code (stmt) == GIMPLE_OMP_RETURN) - { - gcc_assert (parent->exit == bb); - parent->exit = skip_dest_bb; - } - else if (gimple_code (stmt) == GIMPLE_OMP_FOR) - { - omp_region *inner; - inner = *bb_region_map->get (FALLTHRU_EDGE (skip_dest_bb)->dest); - gcc_assert (inner->entry == bb); - inner->entry = skip_dest_bb; - } - } - } - else if (single_succ_p (bb)) - { - edge e = single_succ_edge (bb); - skip_dest_bb = e->dest; - if (gimple_code (stmt) == GIMPLE_GOTO) - gsi_prev (&gsi); - if (gsi_stmt (gsi) == 0) - return; - } - - if (skip_dest_bb != NULL) - { - gimple_stmt_iterator head_gsi = gsi_start_bb (bb); - gsi_prev (&head_gsi); - edge e2 = split_block (bb, gsi_stmt (head_gsi)); - make_predication_test (e2, skip_dest_bb, mask); - } -} - -/* Walk the dominator tree starting at BB to collect basic blocks in - WORKLIST which need OpenACC vector predication applied to them. */ - -static void -find_predicatable_bbs (basic_block bb, vec &worklist) -{ - struct omp_region *parent = *bb_region_map->get (bb); - if (required_predication_mask (parent) != 0) - worklist.safe_push (bb); - basic_block son; - for (son = first_dom_son (CDI_DOMINATORS, bb); - son; - son = next_dom_son (CDI_DOMINATORS, son)) - find_predicatable_bbs (son, worklist); -} - -/* Apply OpenACC vector predication to all basic blocks. HEAD_BB is the - first. */ - -static void -predicate_omp_regions (basic_block head_bb) -{ - vec worklist = vNULL; - find_predicatable_bbs (head_bb, worklist); - int i; - basic_block bb; - FOR_EACH_VEC_ELT (worklist, i, bb) - { - omp_region *region = *bb_region_map->get (bb); - int mask = required_predication_mask (region); - predicate_bb (bb, region, mask); - } -} - -/* USE and GET sets for variable broadcasting. */ -static std::set use, gen, live_in; - -/* This is an extremely conservative live in analysis. We only want to - detect is any compiler temporary used inside an acc loop is local to - that loop or not. So record all decl uses in all the basic blocks - post-dominating the acc loop in question. */ -static tree -populate_loop_live_in (tree *tp, int *walk_subtrees, - void *data_ ATTRIBUTE_UNUSED) -{ - struct walk_stmt_info *wi = (struct walk_stmt_info *) data_; - - if (wi && wi->is_lhs) - { - if (VAR_P (*tp)) - live_in.insert (*tp); - } - else if (IS_TYPE_OR_DECL_P (*tp)) - *walk_subtrees = 0; - - return NULL_TREE; -} - -static void -oacc_populate_live_in_1 (basic_block entry_bb, basic_block exit_bb, - basic_block loop_bb) -{ - basic_block son; - gimple_stmt_iterator gsi; - - if (entry_bb == exit_bb) - return; - - if (!dominated_by_p (CDI_DOMINATORS, loop_bb, entry_bb)) - return; - - for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi)) - { - struct walk_stmt_info wi; - gimple stmt; - - memset (&wi, 0, sizeof (wi)); - stmt = gsi_stmt (gsi); - - walk_gimple_op (stmt, populate_loop_live_in, &wi); - } - - /* Continue walking the dominator tree. */ - for (son = first_dom_son (CDI_DOMINATORS, entry_bb); - son; - son = next_dom_son (CDI_DOMINATORS, son)) - oacc_populate_live_in_1 (son, exit_bb, loop_bb); -} - -static void -oacc_populate_live_in (basic_block entry_bb, omp_region *region) -{ - /* Find the innermost OMP_TARGET region. */ - while (region && region->type != GIMPLE_OMP_TARGET) - region = region->outer; - - if (!region) - return; - - basic_block son; - - for (son = first_dom_son (CDI_DOMINATORS, region->entry); - son; - son = next_dom_son (CDI_DOMINATORS, son)) - oacc_populate_live_in_1 (son, region->exit, entry_bb); -} - -static tree -populate_loop_use (tree *tp, int *walk_subtrees, void *data_) -{ - struct walk_stmt_info *wi = (struct walk_stmt_info *) data_; - std::set::iterator it; - - /* There isn't much to do for LHS ops. There shouldn't be any pointers - or references here. */ - if (wi && wi->is_lhs) - return NULL_TREE; - - if (VAR_P (*tp)) - { - tree type; - - *walk_subtrees = 0; - - /* Filter out incompatible decls. */ - if (INDIRECT_REF_P (*tp) || is_global_var (*tp)) - return NULL_TREE; - - type = TREE_TYPE (*tp); - - /* Aggregate types aren't supported either. */ - if (AGGREGATE_TYPE_P (type)) - return NULL_TREE; - - /* Filter out decls inside GEN. */ - it = gen.find (*tp); - if (it == gen.end ()) - use.insert (*tp); - } - else if (IS_TYPE_OR_DECL_P (*tp)) - *walk_subtrees = 0; - - return NULL_TREE; -} - -/* INIT is true if this is the first time this function is called. */ - -static void -oacc_broadcast_1 (basic_block entry_bb, basic_block exit_bb, bool init, - int mask) -{ - basic_block son; - gimple_stmt_iterator gsi; - gimple stmt; - tree block, var; - - if (entry_bb == exit_bb) - return; - - /* Populate the GEN set. */ - - gsi = gsi_start_bb (entry_bb); - stmt = gsi_stmt (gsi); - - /* There's nothing to do if stmt is empty or if this is the entry basic - block to the vector loop. The entry basic block to pre-expanded loops - do not have an entry label. As such, the scope containing the initial - entry_bb should not be added to the gen set. */ - if (stmt != NULL && !init && (block = gimple_block (stmt)) != NULL) - for (var = BLOCK_VARS (block); var; var = DECL_CHAIN (var)) - gen.insert(var); - - /* Populate the USE set. */ - - for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi)) - { - struct walk_stmt_info wi; - - memset (&wi, 0, sizeof (wi)); - stmt = gsi_stmt (gsi); - - walk_gimple_op (stmt, populate_loop_use, &wi); - } - - /* Continue processing the children of this basic block. */ - for (son = first_dom_son (CDI_DOMINATORS, entry_bb); - son; - son = next_dom_son (CDI_DOMINATORS, son)) - oacc_broadcast_1 (son, exit_bb, false, mask); -} - -/* Broadcast variables to OpenACC vector loops. This function scans - all of the basic blocks withing an acc vector loop. It maintains - two sets of decls, a GEN set and a USE set. The GEN set contains - all of the decls in the the basic block's scope. The USE set - consists of decls used in current basic block, but are not in the - GEN set, globally defined or were transferred into the the accelerator - via a data movement clause. - - The vector loop begins at ENTRY_BB and end at EXIT_BB, where EXIT_BB - is a latch back to ENTRY_BB. Once a set of used variables have been - determined, they will get broadcasted in a pre-header to ENTRY_BB. */ - -static basic_block -oacc_broadcast (basic_block entry_bb, basic_block exit_bb, omp_region *region) -{ - gimple_stmt_iterator gsi; - std::set::iterator it; - int mask = region->gwv_this; - - /* Nothing to do if this isn't an acc worker or vector loop. */ - if (mask == 0) - return entry_bb; - - use.empty (); - gen.empty (); - live_in.empty (); - - /* Currently, subroutines aren't supported. */ - gcc_assert (!lookup_attribute ("oacc function", - DECL_ATTRIBUTES (current_function_decl))); - - /* Populate live_in. */ - oacc_populate_live_in (entry_bb, region); - - /* Populate the set of used decls. */ - oacc_broadcast_1 (entry_bb, exit_bb, true, mask); - - /* Filter out all of the GEN decls from the USE set. Also filter out - any compiler temporaries that which are not present in LIVE_IN. */ - for (it = use.begin (); it != use.end (); it++) - { - std::set::iterator git, lit; - - git = gen.find (*it); - lit = live_in.find (*it); - if (git != gen.end () || lit == live_in.end ()) - use.erase (it); - } - - if (mask == OACC_LOOP_MASK (OACC_vector)) - { - /* Broadcast all decls in USE right before the last instruction in - entry_bb. */ - gsi = gsi_last_bb (entry_bb); - - gimple_seq seq = NULL; - gimple_stmt_iterator g2 = gsi_start (seq); - - for (it = use.begin (); it != use.end (); it++) - generate_oacc_broadcast (region, *it, *it, g2, mask); - - gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING); - } - else if (mask & OACC_LOOP_MASK (OACC_worker)) - { - if (use.empty ()) - return entry_bb; - - /* If this loop contains a worker, then each broadcast must be - predicated. */ - - for (it = use.begin (); it != use.end (); it++) - { - /* Worker broadcasting requires predication. To do that, there - needs to be several new parent basic blocks before the omp - for instruction. */ - - gimple_seq seq = NULL; - gimple_stmt_iterator g2 = gsi_start (seq); - gimple splitpoint = generate_oacc_broadcast (region, *it, *it, - g2, mask); - gsi = gsi_last_bb (entry_bb); - gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING); - edge e = split_block (entry_bb, splitpoint); - e->flags |= EDGE_ABNORMAL; - basic_block dest_bb = e->dest; - gsi_prev (&gsi); - edge e2 = split_block (entry_bb, gsi_stmt (gsi)); - e2->flags |= EDGE_ABNORMAL; - make_predication_test (e2, dest_bb, mask); - - /* Update entry_bb. */ - entry_bb = dest_bb; - } - } - - return entry_bb; -} - /* Main entry point for expanding OMP-GIMPLE into runtime calls. */ static unsigned int @@ -10772,8 +10113,6 @@ execute_expand_omp (void) fprintf (dump_file, "\n"); } - predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun)); - remove_exit_barriers (root_omp_region); expand_omp (root_omp_region); @@ -12342,10 +11681,7 @@ lower_omp_target (gimple_stmt_iterator * orlist = NULL; if (is_gimple_omp_oacc (stmt)) - { - oacc_init_count_vars (ctx, clauses); - oacc_alloc_broadcast_storage (ctx); - } + oacc_init_count_vars (ctx, clauses); if (has_reduction) { @@ -12631,7 +11967,6 @@ lower_omp_target (gimple_stmt_iterator * gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT); gimple_omp_target_set_ganglocal_size (stmt, sz); - gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt); pop_gimplify_context (NULL); } @@ -13348,16 +12683,7 @@ make_gimple_omp_edges (basic_block bb, s ((for_stmt = last_stmt (cur_region->entry)))) == GF_OMP_FOR_KIND_OACC_LOOP) { - /* Called before OMP expansion, so this information has not been - recorded in cur_region->gwv_this yet. */ - int gwv_bits = find_omp_for_region_gwv (for_stmt); - if (oacc_loop_needs_threadbarrier_p (gwv_bits)) - { - make_edge (bb, bb->next_bb, EDGE_FALLTHRU | EDGE_ABNORMAL); - fallthru = false; - } - else - fallthru = true; + fallthru = true; } else /* In the case of a GIMPLE_OMP_SECTION, the edge will go Index: omp-low.h =================================================================== --- omp-low.h (revision 225323) +++ omp-low.h (working copy) @@ -20,6 +20,8 @@ along with GCC; see the file COPYING3. #ifndef GCC_OMP_LOW_H #define GCC_OMP_LOW_H +/* Levels of parallelism as defined by OpenACC. Increasing numbers + correspond to deeper loop nesting levels. */ enum oacc_loop_levels { OACC_gang, @@ -27,6 +29,7 @@ enum oacc_loop_levels OACC_vector, OACC_HWM }; +#define OACC_LOOP_MASK(X) (1 << (X)) struct omp_region; Index: tree-ssa-tail-merge.c =================================================================== --- tree-ssa-tail-merge.c (revision 225323) +++ tree-ssa-tail-merge.c (working copy) @@ -608,10 +608,13 @@ same_succ_def::equal (const same_succ_de { s1 = gsi_stmt (gsi1); s2 = gsi_stmt (gsi2); - if (gimple_code (s1) != gimple_code (s2)) - return 0; - if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2)) - return 0; + if (s1 != s2) + { + if (gimple_code (s1) != gimple_code (s2)) + return 0; + if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2)) + return 0; + } gsi_next_nondebug (&gsi1); gsi_next_nondebug (&gsi2); gsi_advance_fw_nondebug_nonlocal (&gsi1); Index: omp-builtins.def =================================================================== --- omp-builtins.def (revision 225323) +++ omp-builtins.def (working copy) @@ -69,13 +69,6 @@ DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_GA BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST) DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DEVICEPTR, "GOACC_deviceptr", BT_FN_PTR_PTR, ATTR_CONST_NOTHROW_LEAF_LIST) -DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST, "GOACC_thread_broadcast", - BT_FN_UINT_UINT, ATTR_NOTHROW_LEAF_LIST) -DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST_LL, "GOACC_thread_broadcast_ll", - BT_FN_ULONGLONG_ULONGLONG, ATTR_NOTHROW_LEAF_LIST) -DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREADBARRIER, "GOACC_threadbarrier", - BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST) - DEF_GOACC_BUILTIN_COMPILER (BUILT_IN_ACC_ON_DEVICE, "acc_on_device", BT_FN_INT_INT, ATTR_CONST_NOTHROW_LEAF_LIST) Index: internal-fn.def =================================================================== --- internal-fn.def (revision 225323) +++ internal-fn.def (working copy) @@ -64,3 +64,5 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL) DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL) DEF_INTERNAL_FN (GOACC_DATA_END_WITH_ARG, ECF_NOTHROW, ".r") +DEF_INTERNAL_FN (GOACC_FORK, ECF_NOTHROW | ECF_LEAF, ".") +DEF_INTERNAL_FN (GOACC_JOIN, ECF_NOTHROW | ECF_LEAF, ".") Index: config/nvptx/nvptx.md =================================================================== --- config/nvptx/nvptx.md (revision 225323) +++ config/nvptx/nvptx.md (working copy) @@ -52,15 +52,25 @@ UNSPEC_NID UNSPEC_SHARED_DATA + + UNSPEC_BIT_CONV + + UNSPEC_BROADCAST + UNSPEC_BR_UNIFIED ]) (define_c_enum "unspecv" [ UNSPECV_LOCK UNSPECV_CAS UNSPECV_XCHG - UNSPECV_WARP_BCAST UNSPECV_BARSYNC UNSPECV_ID + + UNSPECV_FORK + UNSPECV_FORKED + UNSPECV_JOINING + UNSPECV_JOIN + UNSPECV_BR_HIDDEN ]) (define_attr "subregs_ok" "false,true" @@ -253,6 +263,8 @@ (define_mode_iterator QHSIM [QI HI SI]) (define_mode_iterator SDFM [SF DF]) (define_mode_iterator SDCM [SC DC]) +(define_mode_iterator BITS [SI SF]) +(define_mode_iterator BITD [DI DF]) ;; This mode iterator allows :P to be used for patterns that operate on ;; pointer-sized quantities. Exactly one of the two alternatives will match. @@ -813,7 +825,7 @@ (label_ref (match_operand 1 "" "")) (pc)))] "" - "%j0\\tbra%U0\\t%l1;") + "%j0\\tbra\\t%l1;") (define_insn "br_false" [(set (pc) @@ -822,7 +834,34 @@ (label_ref (match_operand 1 "" "")) (pc)))] "" - "%J0\\tbra%U0\\t%l1;") + "%J0\\tbra\\t%l1;") + +;; a hidden conditional branch +(define_insn "br_true_hidden" + [(unspec_volatile:SI [(ne (match_operand:BI 0 "nvptx_register_operand" "R") + (const_int 0)) + (label_ref (match_operand 1 "" "")) + (match_operand:SI 2 "const_int_operand" "i")] + UNSPECV_BR_HIDDEN)] + "" + "%j0\\tbra%U2\\t%l1;") + +;; unified conditional branch +(define_insn "br_uni_true" + [(set (pc) (if_then_else + (ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")] + UNSPEC_BR_UNIFIED) (const_int 0)) + (label_ref (match_operand 1 "" "")) (pc)))] + "" + "%j0\\tbra.uni\\t%l1;") + +(define_insn "br_uni_false" + [(set (pc) (if_then_else + (eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")] + UNSPEC_BR_UNIFIED) (const_int 0)) + (label_ref (match_operand 1 "" "")) (pc)))] + "" + "%J0\\tbra.uni\\t%l1;") (define_expand "cbranch4" [(set (pc) @@ -1326,37 +1365,92 @@ return asms[INTVAL (operands[1])]; }) -(define_insn "oacc_thread_broadcastsi" - [(set (match_operand:SI 0 "nvptx_register_operand" "") - (unspec_volatile:SI [(match_operand:SI 1 "nvptx_register_operand" "")] - UNSPECV_WARP_BCAST))] +(define_insn "nvptx_fork" + [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")] + UNSPECV_FORK)] "" - "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;") + "// fork %0;" +) -(define_expand "oacc_thread_broadcastdi" - [(set (match_operand:DI 0 "nvptx_register_operand" "") - (unspec_volatile:DI [(match_operand:DI 1 "nvptx_register_operand" "")] - UNSPECV_WARP_BCAST))] - "" -{ - rtx t = gen_reg_rtx (DImode); - emit_insn (gen_lshrdi3 (t, operands[1], GEN_INT (32))); - rtx op0 = force_reg (SImode, gen_lowpart (SImode, t)); - rtx op1 = force_reg (SImode, gen_lowpart (SImode, operands[1])); - rtx targ0 = gen_reg_rtx (SImode); - rtx targ1 = gen_reg_rtx (SImode); - emit_insn (gen_oacc_thread_broadcastsi (targ0, op0)); - emit_insn (gen_oacc_thread_broadcastsi (targ1, op1)); - rtx t2 = gen_reg_rtx (DImode); - rtx t3 = gen_reg_rtx (DImode); - emit_insn (gen_extendsidi2 (t2, targ0)); - emit_insn (gen_extendsidi2 (t3, targ1)); - rtx t4 = gen_reg_rtx (DImode); - emit_insn (gen_ashldi3 (t4, t2, GEN_INT (32))); - emit_insn (gen_iordi3 (operands[0], t3, t4)); - DONE; +(define_insn "nvptx_forked" + [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")] + UNSPECV_FORKED)] + "" + "// forked %0;" +) + +(define_insn "nvptx_joining" + [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")] + UNSPECV_JOINING)] + "" + "// joining %0;" +) + +(define_insn "nvptx_join" + [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")] + UNSPECV_JOIN)] + "" + "// join %0;" +) + +(define_expand "oacc_fork" + [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")] + UNSPECV_FORKED)] + "" +{ + nvptx_expand_oacc_fork (operands[0]); }) +(define_expand "oacc_join" + [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")] + UNSPECV_JOIN)] + "" +{ + nvptx_expand_oacc_join (operands[0]); +}) + +;; only 32-bit shuffles exist. +(define_insn "nvptx_broadcast" + [(set (match_operand:BITS 0 "nvptx_register_operand" "") + (unspec:BITS + [(match_operand:BITS 1 "nvptx_register_operand" "")] + UNSPEC_BROADCAST))] + "" + "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;") + +;; extract parts of a 64 bit object into 2 32-bit ints +(define_insn "unpacksi2" + [(set (match_operand:SI 0 "nvptx_register_operand" "") + (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "") + (const_int 0)] UNSPEC_BIT_CONV)) + (set (match_operand:SI 1 "nvptx_register_operand" "") + (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))] + "" + "%.\\tmov.b64 {%0,%1}, %2;") + +;; pack 2 32-bit ints into a 64 bit object +(define_insn "packsi2" + [(set (match_operand:BITD 0 "nvptx_register_operand" "") + (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "") + (match_operand:SI 2 "nvptx_register_operand" "")] + UNSPEC_BIT_CONV))] + "" + "%.\\tmov.b64 %0, {%1,%2};") + +(define_insn "worker_load" + [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R") + (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")] + UNSPEC_SHARED_DATA))] + "" + "%.\\tld.shared%u0\\t%0,%1;") + +(define_insn "worker_store" + [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")] + UNSPEC_SHARED_DATA) + (match_operand:SDISDFM 0 "nvptx_register_operand" "R"))] + "" + "%.\\tst.shared%u1\\t%1,%0;") + (define_insn "ganglocal_ptr" [(set (match_operand:P 0 "nvptx_register_operand" "") (unspec:P [(const_int 0)] UNSPEC_SHARED_DATA))] @@ -1462,14 +1556,8 @@ "%.\\tatom%A1.b%T0.\\t%0, %1, %2;") ;; ??? Mark as not predicable later? -(define_insn "threadbarrier_insn" - [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)] +(define_insn "nvptx_barsync" + [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] + UNSPECV_BARSYNC)] "" "bar.sync\\t%0;") - -(define_expand "oacc_threadbarrier" - [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)] - "" -{ - operands[0] = const0_rtx; -}) Index: config/nvptx/nvptx.c =================================================================== --- config/nvptx/nvptx.c (revision 225323) +++ config/nvptx/nvptx.c (working copy) @@ -24,6 +24,7 @@ #include "coretypes.h" #include "tm.h" #include "rtl.h" +#include "hash-map.h" #include "hash-set.h" #include "machmode.h" #include "vec.h" @@ -74,6 +75,9 @@ #include "df.h" #include "dumpfile.h" #include "builtins.h" +#include "dominance.h" +#include "cfg.h" +#include "omp-low.h" /* Record the function decls we've written, and the libfuncs and function decls corresponding to them. */ @@ -97,6 +101,16 @@ static GTY((cache)) static GTY((cache)) hash_table *declared_fndecls_htab; static GTY((cache)) hash_table *needed_fndecls_htab; +/* Size of buffer needed to broadcast across workers. This is used + for both worker-neutering and worker broadcasting. It is shared + by all functions emitted. The buffer is placed in shared memory. + It'd be nice if PTX supported common blocks, because then this + could be shared across TUs (taking the largest size). */ +static unsigned worker_bcast_hwm; +static unsigned worker_bcast_align; +#define worker_bcast_name "__worker_bcast" +static GTY(()) rtx worker_bcast_sym; + /* Allocate a new, cleared machine_function structure. */ static struct machine_function * @@ -124,6 +138,8 @@ nvptx_option_override (void) needed_fndecls_htab = hash_table::create_ggc (17); declared_libfuncs_htab = hash_table::create_ggc (17); + + worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name); } /* Return the mode to be used when declaring a ptx object for OBJ. @@ -1066,6 +1082,210 @@ nvptx_expand_compare (rtx compare) return gen_rtx_NE (BImode, pred, const0_rtx); } + +/* Expand the oacc fork & join primitive into ptx-required unspecs. */ + +void +nvptx_expand_oacc_fork (rtx mode) +{ + /* Emit fork for worker level. */ + if (UINTVAL (mode) == OACC_worker) + emit_insn (gen_nvptx_fork (mode)); +} + +void +nvptx_expand_oacc_join (rtx mode) +{ + /* Emit joining for all pars. */ + emit_insn (gen_nvptx_joining (mode)); +} + +/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit + objects. */ + +static rtx +nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src) +{ + rtx res; + + switch (GET_MODE (src)) + { + case DImode: + res = gen_unpackdisi2 (dst0, dst1, src); + break; + case DFmode: + res = gen_unpackdfsi2 (dst0, dst1, src); + break; + default: gcc_unreachable (); + } + return res; +} + +/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit + object. */ + +static rtx +nvptx_gen_pack (rtx dst, rtx src0, rtx src1) +{ + rtx res; + + switch (GET_MODE (dst)) + { + case DImode: + res = gen_packsidi2 (dst, src0, src1); + break; + case DFmode: + res = gen_packsidf2 (dst, src0, src1); + break; + default: gcc_unreachable (); + } + return res; +} + +/* Generate an instruction or sequence to broadcast register REG + across the vectors of a single warp. */ + +static rtx +nvptx_gen_vcast (rtx reg) +{ + rtx res; + + switch (GET_MODE (reg)) + { + case SImode: + res = gen_nvptx_broadcastsi (reg, reg); + break; + case SFmode: + res = gen_nvptx_broadcastsf (reg, reg); + break; + case DImode: + case DFmode: + { + rtx tmp0 = gen_reg_rtx (SImode); + rtx tmp1 = gen_reg_rtx (SImode); + + start_sequence (); + emit_insn (nvptx_gen_unpack (tmp0, tmp1, reg)); + emit_insn (nvptx_gen_vcast (tmp0)); + emit_insn (nvptx_gen_vcast (tmp1)); + emit_insn (nvptx_gen_pack (reg, tmp0, tmp1)); + res = get_insns (); + end_sequence (); + } + break; + case BImode: + { + rtx tmp = gen_reg_rtx (SImode); + + start_sequence (); + emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx)); + emit_insn (nvptx_gen_vcast (tmp)); + emit_insn (gen_rtx_SET (BImode, reg, + gen_rtx_NE (BImode, tmp, const0_rtx))); + res = get_insns (); + end_sequence (); + } + break; + + case HImode: + case QImode: + default:debug_rtx (reg);gcc_unreachable (); + } + return res; +} + +/* Structure used when generating a worker-level spill or fill. */ + +struct wcast_data_t +{ + rtx base; + rtx ptr; + unsigned offset; +}; + +/* Direction of the spill/fill and looping setup/teardown indicator. */ + +enum propagate_mask + { + PM_read = 1 << 0, + PM_write = 1 << 1, + PM_loop_begin = 1 << 2, + PM_loop_end = 1 << 3, + + PM_read_write = PM_read | PM_write + }; + +/* Generate instruction(s) to spill or fill register REG to/from the + worker broadcast array. PM indicates what is to be done, REP + how many loop iterations will be executed (0 for not a loop). */ + +static rtx +nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data) +{ + rtx res; + machine_mode mode = GET_MODE (reg); + + switch (mode) + { + case BImode: + { + rtx tmp = gen_reg_rtx (SImode); + + start_sequence (); + if (pm & PM_read) + emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx)); + emit_insn (nvptx_gen_wcast (tmp, pm, rep, data)); + if (pm & PM_write) + emit_insn (gen_rtx_SET (BImode, reg, + gen_rtx_NE (BImode, tmp, const0_rtx))); + res = get_insns (); + end_sequence (); + } + break; + + default: + { + rtx addr = data->ptr; + + if (!addr) + { + unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT; + + if (align > worker_bcast_align) + worker_bcast_align = align; + data->offset = (data->offset + align - 1) & ~(align - 1); + addr = data->base; + if (data->offset) + addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset)); + } + + addr = gen_rtx_MEM (mode, addr); + addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA); + if (pm & PM_read) + res = gen_rtx_SET (mode, addr, reg); + if (pm & PM_write) + res = gen_rtx_SET (mode, reg, addr); + + if (data->ptr) + { + /* We're using a ptr, increment it. */ + start_sequence (); + + emit_insn (res); + emit_insn (gen_adddi3 (data->ptr, data->ptr, + GEN_INT (GET_MODE_SIZE (GET_MODE (res))))); + res = get_insns (); + end_sequence (); + } + else + rep = 1; + data->offset += rep * GET_MODE_SIZE (GET_MODE (reg)); + } + break; + } + return res; +} + /* When loading an operand ORIG_OP, verify whether an address space conversion to generic is required, and if so, perform it. Also check for SYMBOL_REFs for function decls and call @@ -1647,23 +1867,6 @@ nvptx_print_operand_address (FILE *file, nvptx_print_address_operand (file, addr, VOIDmode); } -/* Return true if the value of COND is the same across all threads in a - warp. */ - -static bool -condition_unidirectional_p (rtx cond) -{ - if (CONSTANT_P (cond)) - return true; - if (GET_CODE (cond) == REG) - return cfun->machine->warp_equal_pseudos[REGNO (cond)]; - if (GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMPARE - || GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMM_COMPARE) - return (condition_unidirectional_p (XEXP (cond, 0)) - && condition_unidirectional_p (XEXP (cond, 1))); - return false; -} - /* Print an operand, X, to FILE, with an optional modifier in CODE. Meaning of CODE: @@ -1677,8 +1880,7 @@ condition_unidirectional_p (rtx cond) t -- print a type opcode suffix, promoting QImode to 32 bits T -- print a type size in bits u -- print a type opcode suffix without promotions. - U -- print ".uni" if a condition consists only of values equal across all - threads in a warp. */ + U -- print ".uni" if the const_int operand is non-zero. */ static void nvptx_print_operand (FILE *file, rtx x, int code) @@ -1740,7 +1942,7 @@ nvptx_print_operand (FILE *file, rtx x, goto common; case 'U': - if (condition_unidirectional_p (x)) + if (INTVAL (x)) fprintf (file, ".uni"); break; @@ -1900,7 +2102,7 @@ get_replacement (struct reg_replace *r) conversion copyin/copyout instructions. */ static void -nvptx_reorg_subreg (int max_regs) +nvptx_reorg_subreg () { struct reg_replace qiregs, hiregs, siregs, diregs; rtx_insn *insn, *next; @@ -1914,11 +2116,6 @@ nvptx_reorg_subreg (int max_regs) siregs.mode = SImode; diregs.mode = DImode; - cfun->machine->warp_equal_pseudos - = ggc_cleared_vec_alloc (max_regs); - - auto_vec warp_reg_worklist; - for (insn = get_insns (); insn; insn = next) { next = NEXT_INSN (insn); @@ -1934,18 +2131,6 @@ nvptx_reorg_subreg (int max_regs) diregs.n_in_use = 0; extract_insn (insn); - if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi - || (GET_CODE (PATTERN (insn)) == SET - && CONSTANT_P (SET_SRC (PATTERN (insn))))) - { - rtx dest = recog_data.operand[0]; - if (REG_P (dest) && REG_N_SETS (REGNO (dest)) == 1) - { - cfun->machine->warp_equal_pseudos[REGNO (dest)] = true; - warp_reg_worklist.safe_push (REGNO (dest)); - } - } - enum attr_subregs_ok s_ok = get_attr_subregs_ok (insn); for (int i = 0; i < recog_data.n_operands; i++) { @@ -1999,71 +2184,742 @@ nvptx_reorg_subreg (int max_regs) validate_change (insn, recog_data.operand_loc[i], new_reg, false); } } +} + +/* Loop structure of the function.The entire function is described as + a NULL loop. We should be able to extend this to represent + superblocks. */ + +#define OACC_null OACC_HWM + +struct parallel +{ + /* Parent parallel. */ + parallel *parent; + + /* Next sibling parallel. */ + parallel *next; + + /* First child parallel. */ + parallel *inner; + + /* Partitioning mode of the parallel. */ + unsigned mode; + + /* Partitioning used within inner parallels. */ + unsigned inner_mask; + + /* Location of parallel forked and join. The forked is the first + block in the parallel and the join is the first block after of + the partition. */ + basic_block forked_block; + basic_block join_block; + + rtx_insn *forked_insn; + rtx_insn *join_insn; + + rtx_insn *fork_insn; + rtx_insn *joining_insn; + + /* Basic blocks in this parallel, but not in child parallels. The + FORKED and JOINING blocks are in the partition. The FORK and JOIN + blocks are not. */ + auto_vec blocks; + +public: + parallel (parallel *parent, unsigned mode); + ~parallel (); +}; + +/* Constructor links the new parallel into it's parent's chain of + children. */ + +parallel::parallel (parallel *parent_, unsigned mode_) + :parent (parent_), next (0), inner (0), mode (mode_), inner_mask (0) +{ + forked_block = join_block = 0; + forked_insn = join_insn = 0; + fork_insn = joining_insn = 0; + + if (parent) + { + next = parent->inner; + parent->inner = this; + } +} + +parallel::~parallel () +{ + delete inner; + delete next; +} + +/* Map of basic blocks to insns */ +typedef hash_map bb_insn_map_t; + +/* A tuple of an insn of interest and the BB in which it resides. */ +typedef std::pair insn_bb_t; +typedef auto_vec insn_bb_vec_t; + +/* Split basic blocks such that each forked and join unspecs are at + the start of their basic blocks. Thus afterwards each block will + have a single partitioning mode. We also do the same for return + insns, as they are executed by every thread. Return the + partitioning mode of the function as a whole. Populate MAP with + head and tail blocks. We also clear the BB visited flag, which is + used when finding partitions. */ + +static void +nvptx_split_blocks (bb_insn_map_t *map) +{ + insn_bb_vec_t worklist; + basic_block block; + rtx_insn *insn; - while (!warp_reg_worklist.is_empty ()) + /* Locate all the reorg instructions of interest. */ + FOR_ALL_BB_FN (block, cfun) { - int regno = warp_reg_worklist.pop (); + bool seen_insn = false; + + // Clear visited flag, for use by parallel locator */ + block->flags &= ~BB_VISITED; - df_ref use = DF_REG_USE_CHAIN (regno); - for (; use; use = DF_REF_NEXT_REG (use)) + FOR_BB_INSNS (block, insn) { - rtx_insn *insn; - if (!DF_REF_INSN_INFO (use)) - continue; - insn = DF_REF_INSN (use); - if (DEBUG_INSN_P (insn)) - continue; - - /* The only insns we have to exclude are those which refer to - memory. */ - rtx pat = PATTERN (insn); - if (GET_CODE (pat) == SET - && (MEM_P (SET_SRC (pat)) || MEM_P (SET_DEST (pat)))) + if (!INSN_P (insn)) continue; - - df_ref insn_use; - bool all_equal = true; - FOR_EACH_INSN_USE (insn_use, insn) + switch (recog_memoized (insn)) { - unsigned insn_regno = DF_REF_REGNO (insn_use); - if (!cfun->machine->warp_equal_pseudos[insn_regno]) - { - all_equal = false; - break; - } + default: + seen_insn = true; + continue; + case CODE_FOR_nvptx_forked: + case CODE_FOR_nvptx_join: + break; + + case CODE_FOR_return: + /* We also need to split just before return insns, as + that insn needs executing by all threads, but the + block it is in probably does not. */ + break; } - if (!all_equal) - continue; - df_ref insn_def; - FOR_EACH_INSN_DEF (insn_def, insn) + + if (seen_insn) + /* We've found an instruction that must be at the start of + a block, but isn't. Add it to the worklist. */ + worklist.safe_push (insn_bb_t (insn, block)); + else + /* It was already the first instruction. Just add it to + the map. */ + map->get_or_insert (block) = insn; + seen_insn = true; + } + } + + /* Split blocks on the worklist. */ + unsigned ix; + insn_bb_t *elt; + basic_block remap = 0; + for (ix = 0; worklist.iterate (ix, &elt); ix++) + { + if (remap != elt->second) + { + block = elt->second; + remap = block; + } + + /* Split block before insn. The insn is in the new block */ + edge e = split_block (block, PREV_INSN (elt->first)); + + block = e->dest; + map->get_or_insert (block) = elt->first; + } +} + +/* BLOCK is a basic block containing a head or tail instruction. + Locate the associated prehead or pretail instruction, which must be + in the single predecessor block. */ + +static rtx_insn * +nvptx_discover_pre (basic_block block, int expected) +{ + gcc_assert (block->preds->length () == 1); + basic_block pre_block = (*block->preds)[0]->src; + rtx_insn *pre_insn; + + for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn); + pre_insn = PREV_INSN (pre_insn)) + gcc_assert (pre_insn != BB_HEAD (pre_block)); + + gcc_assert (recog_memoized (pre_insn) == expected); + return pre_insn; +} + +/* Dump this parallel and all its inner parallels. */ + +static void +nvptx_dump_pars (parallel *par, unsigned depth) +{ + fprintf (dump_file, "%u: mode %d head=%d, tail=%d\n", + depth, par->mode, + par->forked_block ? par->forked_block->index : -1, + par->join_block ? par->join_block->index : -1); + + fprintf (dump_file, " blocks:"); + + basic_block block; + for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++) + fprintf (dump_file, " %d", block->index); + fprintf (dump_file, "\n"); + if (par->inner) + nvptx_dump_pars (par->inner, depth + 1); + + if (par->next) + nvptx_dump_pars (par->next, depth); +} + +typedef std::pair bb_par_t; +typedef auto_vec bb_par_vec_t; + +/* Walk the BBG looking for fork & join markers. Construct a + loop structure for the function. MAP is a mapping of basic blocks + to head & taiol markers, discoveded when splitting blocks. This + speeds up the discovery. We rely on the BB visited flag having + been cleared when splitting blocks. */ + +static parallel * +nvptx_discover_pars (bb_insn_map_t *map) +{ + parallel *outer_par = new parallel (0, OACC_null); + bb_par_vec_t worklist; + basic_block block; + + // Mark entry and exit blocks as visited. + block = EXIT_BLOCK_PTR_FOR_FN (cfun); + block->flags |= BB_VISITED; + block = ENTRY_BLOCK_PTR_FOR_FN (cfun); + worklist.safe_push (bb_par_t (block, outer_par)); + + while (worklist.length ()) + { + bb_par_t bb_par = worklist.pop (); + parallel *l = bb_par.second; + + block = bb_par.first; + + // Have we met this block? + if (block->flags & BB_VISITED) + continue; + block->flags |= BB_VISITED; + + rtx_insn **endp = map->get (block); + if (endp) + { + rtx_insn *end = *endp; + + /* This is a block head or tail, or return instruction. */ + switch (recog_memoized (end)) { - unsigned dregno = DF_REF_REGNO (insn_def); - if (cfun->machine->warp_equal_pseudos[dregno]) - continue; - cfun->machine->warp_equal_pseudos[dregno] = true; - warp_reg_worklist.safe_push (dregno); + case CODE_FOR_return: + /* Return instructions are in their own block, and we + don't need to do anything more. */ + continue; + + case CODE_FOR_nvptx_forked: + /* Loop head, create a new inner loop and add it into + our parent's child list. */ + { + unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 0)); + + l = new parallel (l, mode); + l->forked_block = block; + l->forked_insn = end; + if (mode == OACC_worker) + l->fork_insn + = nvptx_discover_pre (block, CODE_FOR_nvptx_fork); + } + break; + + case CODE_FOR_nvptx_join: + /* A loop tail. Finish the current loop and return to + parent. */ + { + unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 0)); + + gcc_assert (l->mode == mode); + l->join_block = block; + l->join_insn = end; + if (mode == OACC_worker) + l->joining_insn + = nvptx_discover_pre (block, CODE_FOR_nvptx_joining); + l = l->parent; + } + break; + + default: + gcc_unreachable (); } } + + /* Add this block onto the current loop's list of blocks. */ + l->blocks.safe_push (block); + + /* Push each destination block onto the work list. */ + edge e; + edge_iterator ei; + FOR_EACH_EDGE (e, ei, block->succs) + worklist.safe_push (bb_par_t (e->dest, l)); } if (dump_file) - for (int i = 0; i < max_regs; i++) - if (cfun->machine->warp_equal_pseudos[i]) - fprintf (dump_file, "Found warp invariant pseudo %d\n", i); + { + fprintf (dump_file, "\nLoops\n"); + nvptx_dump_pars (outer_par, 0); + fprintf (dump_file, "\n"); + } + + return outer_par; +} + +/* Propagate live state at the start of a partitioned region. BLOCK + provides the live register information, and might not contain + INSN. Propagation is inserted just after INSN. RW indicates whether + we are reading and/or writing state. This + separation is needed for worker-level proppagation where we + essentially do a spill & fill. FN is the underlying worker + function to generate the propagation instructions for single + register. DATA is user data. + + We propagate the live register set and the entire frame. We could + do better by (a) propagating just the live set that is used within + the partitioned regions and (b) only propagating stack entries that + are used. The latter might be quite hard to determine. */ + +static void +nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw, + rtx (*fn) (rtx, propagate_mask, + unsigned, void *), void *data) +{ + bitmap live = DF_LIVE_IN (block); + bitmap_iterator iterator; + unsigned ix; + + /* Copy the frame array. */ + HOST_WIDE_INT fs = get_frame_size (); + if (fs) + { + rtx tmp = gen_reg_rtx (DImode); + rtx idx = NULL_RTX; + rtx ptr = gen_reg_rtx (Pmode); + rtx pred = NULL_RTX; + rtx_code_label *label = NULL; + + gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1))); + fs /= GET_MODE_SIZE (DImode); + /* Detect single iteration loop. */ + if (fs == 1) + fs = 0; + + start_sequence (); + emit_insn (gen_rtx_SET (Pmode, ptr, frame_pointer_rtx)); + if (fs) + { + idx = gen_reg_rtx (SImode); + pred = gen_reg_rtx (BImode); + label = gen_label_rtx (); + + emit_insn (gen_rtx_SET (SImode, idx, GEN_INT (fs))); + /* Allow worker function to initialize anything needed */ + rtx init = fn (tmp, PM_loop_begin, fs, data); + if (init) + emit_insn (init); + emit_label (label); + LABEL_NUSES (label)++; + emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1))); + } + if (rw & PM_read) + emit_insn (gen_rtx_SET (DImode, tmp, gen_rtx_MEM (DImode, ptr))); + emit_insn (fn (tmp, rw, fs, data)); + if (rw & PM_write) + emit_insn (gen_rtx_SET (DImode, gen_rtx_MEM (DImode, ptr), tmp)); + if (fs) + { + emit_insn (gen_rtx_SET (SImode, pred, + gen_rtx_NE (BImode, idx, const0_rtx))); + emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode)))); + emit_insn (gen_br_true_hidden (pred, label, GEN_INT (1))); + rtx fini = fn (tmp, PM_loop_end, fs, data); + if (fini) + emit_insn (fini); + emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx)); + } + emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp)); + emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr)); + rtx cpy = get_insns (); + end_sequence (); + insn = emit_insn_after (cpy, insn); + } + + /* Copy live registers. */ + EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator) + { + rtx reg = regno_reg_rtx[ix]; + + if (REGNO (reg) >= FIRST_PSEUDO_REGISTER) + { + rtx bcast = fn (reg, rw, 0, data); + + insn = emit_insn_after (bcast, insn); + } + } } -/* PTX-specific reorganization - 1) mark now-unused registers, so function begin doesn't declare - unused registers. - 2) replace subregs with suitable sequences. -*/ +/* Worker for nvptx_vpropagate. */ + +static rtx +vprop_gen (rtx reg, propagate_mask pm, + unsigned ARG_UNUSED (count), void *ARG_UNUSED (data)) +{ + if (!(pm & PM_read_write)) + return 0; + + return nvptx_gen_vcast (reg); +} + +/* Propagate state that is live at start of BLOCK across the vectors + of a single warp. Propagation is inserted just after INSN. */ static void -nvptx_reorg (void) +nvptx_vpropagate (basic_block block, rtx_insn *insn) { - struct reg_replace qiregs, hiregs, siregs, diregs; - rtx_insn *insn, *next; + nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0); +} + +/* Worker for nvptx_wpropagate. */ +static rtx +wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_) +{ + wcast_data_t *data = (wcast_data_t *)data_; + + if (pm & PM_loop_begin) + { + /* Starting a loop, initialize pointer. */ + unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT; + + if (align > worker_bcast_align) + worker_bcast_align = align; + data->offset = (data->offset + align - 1) & ~(align - 1); + + data->ptr = gen_reg_rtx (Pmode); + + return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset)); + } + else if (pm & PM_loop_end) + { + rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr); + data->ptr = NULL_RTX; + return clobber; + } + else + return nvptx_gen_wcast (reg, pm, rep, data); +} + +/* Spill or fill live state that is live at start of BLOCK. PRE_P + indicates if this is just before partitioned mode (do spill), or + just after it starts (do fill). Sequence is inserted just after + INSN. */ + +static void +nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn) +{ + wcast_data_t data; + + data.base = gen_reg_rtx (Pmode); + data.offset = 0; + data.ptr = NULL_RTX; + + nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data); + if (data.offset) + { + /* Stuff was emitted, initialize the base pointer now. */ + rtx init = gen_rtx_SET (Pmode, data.base, worker_bcast_sym); + emit_insn_after (init, insn); + + if (worker_bcast_hwm < data.offset) + worker_bcast_hwm = data.offset; + } +} + +/* Emit a worker-level synchronization barrier. */ + +static void +nvptx_wsync (bool tail_p, rtx_insn *insn) +{ + emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn); +} + +/* Single neutering according to MASK. FROM is the incoming block and + TO is the outgoing block. These may be the same block. Insert at + start of FROM: + + if (tid.) hidden_goto end. + + and insert before ending branch of TO (if there is such an insn): + + end: + + + + We currently only use differnt FROM and TO when skipping an entire + loop. We could do more if we detected superblocks. */ + +static void +nvptx_single (unsigned mask, basic_block from, basic_block to) +{ + rtx_insn *head = BB_HEAD (from); + rtx_insn *tail = BB_END (to); + unsigned skip_mask = mask; + + /* Find first insn of from block */ + while (head != BB_END (from) && !INSN_P (head)) + head = NEXT_INSN (head); + + /* Find last insn of to block */ + rtx_insn *limit = from == to ? head : BB_HEAD (to); + while (tail != limit && !INSN_P (tail) && !LABEL_P (tail)) + tail = PREV_INSN (tail); + + /* Detect if tail is a branch. */ + rtx tail_branch = NULL_RTX; + rtx cond_branch = NULL_RTX; + if (tail && INSN_P (tail)) + { + tail_branch = PATTERN (tail); + if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx) + tail_branch = NULL_RTX; + else + { + cond_branch = SET_SRC (tail_branch); + if (GET_CODE (cond_branch) != IF_THEN_ELSE) + cond_branch = NULL_RTX; + } + } + + if (tail == head) + { + /* If this is empty, do nothing. */ + if (!head || !INSN_P (head)) + return; + + /* If this is a dummy insn, do nothing. */ + switch (recog_memoized (head)) + { + default:break; + case CODE_FOR_nvptx_fork: + case CODE_FOR_nvptx_forked: + case CODE_FOR_nvptx_joining: + case CODE_FOR_nvptx_join: + return; + } + + if (cond_branch) + { + /* If we're only doing vector single, there's no need to + emit skip code because we'll not insert anything. */ + if (!(mask & OACC_LOOP_MASK (OACC_vector))) + skip_mask = 0; + } + else if (tail_branch) + /* Block with only unconditional branch. Nothing to do. */ + return; + } + + /* Insert the vector test inside the worker test. */ + unsigned mode; + rtx_insn *before = tail; + for (mode = OACC_worker; mode <= OACC_vector; mode++) + if (OACC_LOOP_MASK (mode) & skip_mask) + { + rtx id = gen_reg_rtx (SImode); + rtx pred = gen_reg_rtx (BImode); + rtx_code_label *label = gen_label_rtx (); + + emit_insn_before (gen_oacc_id (id, GEN_INT (mode)), head); + rtx cond = gen_rtx_SET (BImode, pred, + gen_rtx_NE (BImode, id, const0_rtx)); + emit_insn_before (cond, head); + emit_insn_before (gen_br_true_hidden (pred, label, + GEN_INT (mode != OACC_vector)), + head); + + LABEL_NUSES (label)++; + if (tail_branch) + before = emit_label_before (label, before); + else + emit_label_after (label, tail); + } + + /* Now deal with propagating the branch condition. */ + if (cond_branch) + { + rtx pvar = XEXP (XEXP (cond_branch, 0), 0); + + if (OACC_LOOP_MASK (OACC_vector) == mask) + { + /* Vector mode only, do a shuffle. */ + emit_insn_before (nvptx_gen_vcast (pvar), tail); + } + else + { + /* Includes worker mode, do spill & fill. by construction + we should never have worker mode only. */ + wcast_data_t data; + + data.base = worker_bcast_sym; + data.ptr = 0; + + if (worker_bcast_hwm < GET_MODE_SIZE (SImode)) + worker_bcast_hwm = GET_MODE_SIZE (SImode); + + data.offset = 0; + emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data), + before); + emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail); + data.offset = 0; + emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data), + tail); + } + + extract_insn (tail); + rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar), + UNSPEC_BR_UNIFIED); + validate_change (tail, recog_data.operand_loc[0], unsp, false); + } +} + +/* PAR is a parallel that is being skipped in its entirety according to + MASK. Treat this as skipping a superblock starting at forked + and ending at joining. */ + +static void +nvptx_skip_par (unsigned mask, parallel *par) +{ + basic_block tail = par->join_block; + gcc_assert (tail->preds->length () == 1); + + basic_block pre_tail = (*tail->preds)[0]->src; + gcc_assert (pre_tail->succs->length () == 1); + + nvptx_single (mask, par->forked_block, pre_tail); +} + +/* Process the parallel PAR and all its contained + parallels. We do everything but the neutering. Return mask of + partitioned modes used within this parallel. */ + +static unsigned +nvptx_process_pars (parallel *par) +{ + unsigned inner_mask = OACC_LOOP_MASK (par->mode); + + /* Do the inner parallels first. */ + if (par->inner) + { + par->inner_mask = nvptx_process_pars (par->inner); + inner_mask |= par->inner_mask; + } + + switch (par->mode) + { + case OACC_null: + /* Dummy parallel. */ + break; + + case OACC_vector: + nvptx_vpropagate (par->forked_block, par->forked_insn); + break; + + case OACC_worker: + { + nvptx_wpropagate (false, par->forked_block, + par->forked_insn); + nvptx_wpropagate (true, par->forked_block, par->fork_insn); + /* Insert begin and end synchronizations. */ + nvptx_wsync (false, par->forked_insn); + nvptx_wsync (true, par->joining_insn); + } + break; + + case OACC_gang: + break; + + default:gcc_unreachable (); + } + + /* Now do siblings. */ + if (par->next) + inner_mask |= nvptx_process_pars (par->next); + return inner_mask; +} + +/* Neuter the parallel described by PAR. We recurse in depth-first + order. MODES are the partitioning of the execution and OUTER is + the partitioning of the parallels we are contained in. */ + +static void +nvptx_neuter_pars (parallel *par, unsigned modes, unsigned outer) +{ + unsigned me = (OACC_LOOP_MASK (par->mode) + & (OACC_LOOP_MASK (OACC_worker) + | OACC_LOOP_MASK (OACC_vector))); + unsigned skip_mask = 0, neuter_mask = 0; + + if (par->inner) + nvptx_neuter_pars (par->inner, modes, outer | me); + + for (unsigned mode = OACC_worker; mode <= OACC_vector; mode++) + { + if ((outer | me) & OACC_LOOP_MASK (mode)) + { /* Mode is partitioned: no neutering. */ } + else if (!(modes & OACC_LOOP_MASK (mode))) + { /* Mode is not used: nothing to do. */ } + else if (par->inner_mask & OACC_LOOP_MASK (mode) + || !par->forked_insn) + /* Partitioned in inner parallels, or we're not a partitioned + at all: neuter individual blocks. */ + neuter_mask |= OACC_LOOP_MASK (mode); + else if (!par->parent || !par->parent->forked_insn + || par->parent->inner_mask & OACC_LOOP_MASK (mode)) + /* Parent isn't a parallel or contains this paralleling: skip + parallel at this level. */ + skip_mask |= OACC_LOOP_MASK (mode); + else + { /* Parent will skip this parallel itself. */ } + } + + if (neuter_mask) + { + basic_block block; + + for (unsigned ix = 0; par->blocks.iterate (ix, &block); ix++) + nvptx_single (neuter_mask, block, block); + } + + if (skip_mask) + nvptx_skip_par (skip_mask, par); + + if (par->next) + nvptx_neuter_pars (par->next, modes, outer); +} + +/* NVPTX machine dependent reorg. + Insert vector and worker single neutering code and state + propagation when entering partioned mode. Fixup subregs. */ + +static void +nvptx_reorg (void) +{ /* We are freeing block_for_insn in the toplev to keep compatibility with old MDEP_REORGS that are not CFG based. Recompute it now. */ compute_bb_for_insn (); @@ -2072,19 +2928,36 @@ nvptx_reorg (void) df_clear_flags (DF_LR_RUN_DCE); df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS); + df_live_add_problem (); + + /* Split blocks and record interesting unspecs. */ + bb_insn_map_t bb_insn_map; + + nvptx_split_blocks (&bb_insn_map); + + /* Compute live regs */ df_analyze (); regstat_init_n_sets_and_refs (); - int max_regs = max_reg_num (); - + if (dump_file) + df_dump (dump_file); + /* Mark unused regs as unused. */ + int max_regs = max_reg_num (); for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++) if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0) regno_reg_rtx[i] = const0_rtx; - /* Replace subregs. */ - nvptx_reorg_subreg (max_regs); + parallel *pars = nvptx_discover_pars (&bb_insn_map); + + nvptx_process_pars (pars); + nvptx_neuter_pars (pars, (OACC_LOOP_MASK (OACC_vector) + | OACC_LOOP_MASK (OACC_worker)), 0); + delete pars; + + nvptx_reorg_subreg (); + regstat_free_n_sets_and_refs (); df_finish_pass (true); @@ -2133,19 +3006,24 @@ nvptx_vector_alignment (const_tree type) return MIN (align, BIGGEST_ALIGNMENT); } -/* Indicate that INSN cannot be duplicated. This is true for insns - that generate a unique id. To be on the safe side, we also - exclude instructions that have to be executed simultaneously by - all threads in a warp. */ +/* Indicate that INSN cannot be duplicated. */ static bool nvptx_cannot_copy_insn_p (rtx_insn *insn) { - if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi) - return true; - if (recog_memoized (insn) == CODE_FOR_threadbarrier_insn) - return true; - return false; + switch (recog_memoized (insn)) + { + case CODE_FOR_nvptx_broadcastsi: + case CODE_FOR_nvptx_broadcastsf: + case CODE_FOR_nvptx_barsync: + case CODE_FOR_nvptx_fork: + case CODE_FOR_nvptx_forked: + case CODE_FOR_nvptx_joining: + case CODE_FOR_nvptx_join: + return true; + default: + return false; + } } /* Record a symbol for mkoffload to enter into the mapping table. */ @@ -2185,6 +3063,21 @@ nvptx_file_end (void) FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter) nvptx_record_fndecl (decl, true); fputs (func_decls.str().c_str(), asm_out_file); + + if (worker_bcast_hwm) + { + /* Define the broadcast buffer. */ + + if (worker_bcast_align < GET_MODE_SIZE (SImode)) + worker_bcast_align = GET_MODE_SIZE (SImode); + worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1) + & ~(worker_bcast_align - 1); + + fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name); + fprintf (asm_out_file, ".shared.align %d .u8 %s[%d];\n", + worker_bcast_align, + worker_bcast_name, worker_bcast_hwm); + } } #undef TARGET_OPTION_OVERRIDE Index: config/nvptx/nvptx.h =================================================================== --- config/nvptx/nvptx.h (revision 225323) +++ config/nvptx/nvptx.h (working copy) @@ -235,7 +235,6 @@ struct nvptx_pseudo_info struct GTY(()) machine_function { rtx_expr_list *call_args; - char *warp_equal_pseudos; rtx start_call; tree funtype; bool has_call_with_varargs; Index: config/nvptx/nvptx-protos.h =================================================================== --- config/nvptx/nvptx-protos.h (revision 225323) +++ config/nvptx/nvptx-protos.h (working copy) @@ -32,6 +32,8 @@ extern void nvptx_register_pragmas (void extern const char *nvptx_section_for_decl (const_tree); #ifdef RTX_CODE +extern void nvptx_expand_oacc_fork (rtx); +extern void nvptx_expand_oacc_join (rtx); extern void nvptx_expand_call (rtx, rtx); extern rtx nvptx_expand_compare (rtx); extern const char *nvptx_ptx_type_from_mode (machine_mode, bool); Index: builtins.c =================================================================== --- builtins.c (revision 225323) +++ builtins.c (working copy) @@ -5947,20 +5947,6 @@ expand_builtin_acc_on_device (tree exp A #endif } -/* Expand a thread synchronization point for OpenACC threads. */ -static void -expand_oacc_threadbarrier (void) -{ -#ifdef HAVE_oacc_threadbarrier - rtx insn = GEN_FCN (CODE_FOR_oacc_threadbarrier) (); - if (insn != NULL_RTX) - { - emit_insn (insn); - } -#endif -} - - /* Expand a thread-id/thread-count builtin for OpenACC. */ static rtx @@ -6032,47 +6018,6 @@ expand_oacc_ganglocal_ptr (rtx target AT return NULL_RTX; } -/* Handle a GOACC_thread_broadcast builtin call EXP with target TARGET. - Return the result. */ - -static rtx -expand_builtin_oacc_thread_broadcast (tree exp, rtx target) -{ - tree arg0 = CALL_EXPR_ARG (exp, 0); - enum insn_code icode; - - enum machine_mode mode = TYPE_MODE (TREE_TYPE (arg0)); - gcc_assert (INTEGRAL_MODE_P (mode)); - do - { - icode = direct_optab_handler (oacc_thread_broadcast_optab, mode); - mode = GET_MODE_WIDER_MODE (mode); - } - while (icode == CODE_FOR_nothing && mode != VOIDmode); - if (icode == CODE_FOR_nothing) - return expand_expr (arg0, NULL_RTX, VOIDmode, EXPAND_NORMAL); - - rtx tmp = target; - machine_mode mode0 = insn_data[icode].operand[0].mode; - machine_mode mode1 = insn_data[icode].operand[1].mode; - if (!tmp || !REG_P (tmp) || GET_MODE (tmp) != mode0) - tmp = gen_reg_rtx (mode0); - rtx op1 = expand_expr (arg0, NULL_RTX, mode1, EXPAND_NORMAL); - if (GET_MODE (op1) != mode1) - op1 = convert_to_mode (mode1, op1, 0); - - /* op1 might be an immediate, place it inside a register. */ - op1 = force_reg (mode1, op1); - - rtx insn = GEN_FCN (icode) (tmp, op1); - if (insn != NULL_RTX) - { - emit_insn (insn); - return tmp; - } - return const0_rtx; -} - /* Expand an expression EXP that calls a built-in function, with result going to TARGET if that's convenient (and in mode MODE if that's convenient). @@ -7225,14 +7170,6 @@ expand_builtin (tree exp, rtx target, rt return target; break; - case BUILT_IN_GOACC_THREAD_BROADCAST: - case BUILT_IN_GOACC_THREAD_BROADCAST_LL: - return expand_builtin_oacc_thread_broadcast (exp, target); - - case BUILT_IN_GOACC_THREADBARRIER: - expand_oacc_threadbarrier (); - return const0_rtx; - default: /* just do library call, if unknown builtin */ break; } Index: internal-fn.c =================================================================== --- internal-fn.c (revision 225323) +++ internal-fn.c (working copy) @@ -98,6 +98,20 @@ init_internal_fns () internal_fn_fnspec_array[IFN_LAST] = 0; } +/* Return true if this internal fn call is a unique marker -- it + should not be duplicated or merged. */ + +bool +gimple_call_internal_unique_p (const_gimple gs) +{ + switch (gimple_call_internal_fn (gs)) + { + default: return false; + case IFN_GOACC_FORK: return true; + case IFN_GOACC_JOIN: return true; + } +} + /* ARRAY_TYPE is an array of vector modes. Return the associated insn for load-lanes-style optab OPTAB. The insn must exist. */ @@ -1990,6 +2004,26 @@ expand_GOACC_DATA_END_WITH_ARG (gcall *s gcc_unreachable (); } +static void +expand_GOACC_FORK (gcall *stmt) +{ + rtx mode = expand_normal (gimple_call_arg (stmt, 0)); + +#ifdef HAVE_oacc_fork + emit_insn (gen_oacc_fork (mode)); +#endif +} + +static void +expand_GOACC_JOIN (gcall *stmt) +{ + rtx mode = expand_normal (gimple_call_arg (stmt, 0)); + +#ifdef HAVE_oacc_join + emit_insn (gen_oacc_join (mode)); +#endif +} + /* Routines to expand each internal function, indexed by function number. Each routine has the prototype: