From patchwork Mon Jul 6 19:34:51 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Nathan Sidwell X-Patchwork-Id: 491808 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id A4038140311 for ; Tue, 7 Jul 2015 05:35:21 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=tQ8FSiqv; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:references :in-reply-to:content-type; q=dns; s=default; b=oWOzBLvgyThhxbzlq 5R5YyHpqzmdCC6OXCVJT/oT4xsbP5gQMGT67s80x4vnD+QB76zppR5IbdSFJRazV LiAFqKg5893osHRjqdJN5l8bMG4F69abOXq8Xnjua2MStwatfXJ563Q5S0GpEoHT 12MrIQuHH53FUMiLvp8aKtSZM8= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :message-id:date:from:mime-version:to:cc:subject:references :in-reply-to:content-type; s=default; bh=wJKqVzeJNZoyx1EYvb9Vjoa 5vns=; b=tQ8FSiqvSkPxUR6sCZRV69FZVEtnu2j55RR/dd9S/4Stxrpajgml7a9 EeHI7666acvDgi4K2fmf2RTLt0/KYvCmOF51QTTzpJD/YEEUGpufHPIDTm2feytE 73iTGH6Ekkb24oHWAxz+cY5ymI44s38f43DVBCZXkSGJOIDyIfBQ= Received: (qmail 77565 invoked by alias); 6 Jul 2015 19:35:07 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 77524 invoked by uid 89); 6 Jul 2015 19:35:06 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=3.8 required=5.0 tests=BAYES_99, BAYES_999, FREEMAIL_FROM, KAM_ASCII_DIVIDERS, RCVD_IN_DNSWL_LOW, SPF_PASS autolearn=no version=3.3.2 X-HELO: mail-qk0-f173.google.com Received: from mail-qk0-f173.google.com (HELO mail-qk0-f173.google.com) (209.85.220.173) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Mon, 06 Jul 2015 19:34:57 +0000 Received: by qkbp125 with SMTP id p125so124704659qkb.2 for ; Mon, 06 Jul 2015 12:34:55 -0700 (PDT) X-Received: by 10.55.31.17 with SMTP id f17mr853686qkf.43.1436211295298; Mon, 06 Jul 2015 12:34:55 -0700 (PDT) Received: from ?IPv6:2601:19b:400:a983:a2a8:cdff:fe3e:b48? ([2601:19b:400:a983:a2a8:cdff:fe3e:b48]) by mx.google.com with ESMTPSA id e64sm9827171qkh.19.2015.07.06.12.34.52 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 06 Jul 2015 12:34:53 -0700 (PDT) Message-ID: <559AD85B.2050102@acm.org> Date: Mon, 06 Jul 2015 15:34:51 -0400 From: Nathan Sidwell User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Jakub Jelinek CC: GCC Patches Subject: Re: [gomp] Move openacc vector& worker single handling to RTL References: <5597120D.2080308@acm.org> <20150703231159.GP10247@tucnak.redhat.com> <559844EF.6010208@acm.org> In-Reply-To: <559844EF.6010208@acm.org> On 07/04/15 16:41, Nathan Sidwell wrote: > On 07/03/15 19:11, Jakub Jelinek wrote: >> If the builtins are not meant to be used by users directly (I assume they >> aren't) nor have a 1-1 correspondence to a library routine, it is much >> better to emit them as internal calls (see internal-fn.{c,def}) instead of >> BUILT_IN_NORMAL functions. > This patch uses internal builtins, I had to make one additional change to tree-ssa-tail-merge.c's same_succ_def::equal hash compare function. The new internal fn I introduced should compare EQ but not otherwise compare EQUAL, and that was blowing up the has function, which relied on EQUAL only. I don't know why I didn't hit this problem in the previous patch with the regular builtin. comments? nathan 2015-07-06 Nathan Sidwell Infrastructure: * gimple.h (gimple_call_internal_unique_p): Declare. * gimple.c (gimple_call_same_target_p): Add check for gimple_call_internal_unique_p. * internal-fn.c (gimple_call_internal_unique_p): New. * omp-low.h (OACC_LOOP_MASK): Define here... * omp-low.c (OACC_LOOP_MASK): ... not here. * tree-ssa-threadedge.c (record_temporary_equivalences_from_stmts): Add check for gimple_call_internal_unique_p. * tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for the gimple statements. Additions: * internal-fn.def (GOACC_LEVELS, GOACC_LOOP): New. * internal-fn.c (gimple_call_internal_unique_p): Add check for IFN_GOACC_LOOP. (expand_GOACC_LEVELS, expand_GOACC_LOOP): New. * omp-low.c (gen_oacc_loop_head, gen_oacc_loop_tail): New. (expand_omp_for_static_nochunk): Add oacc loop head & tail calls. (expand_omp_for_static_chunk): Likewise. * tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Add BUILT_IN_GOACC_LOOP. * config/nvptx/nvptx-protos.h ( nvptx_expand_oacc_loop): New. * config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST, UNSPEC_BR_UNIFIED): New unspecs. (UNSPECV_LEVELS, UNSPECV_LOOP, UNSPECV_BR_HIDDEN): New. (BITS, BITD): New mode iterators. (br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New branches. (oacc_levels, nvptx_loop): New insns. (oacc_loop): New expand (nvptx_broadcast): New insn. (unpacksi2, packsi2): New insns. (worker_load, worker_store): New insns. (nvptx_barsync): Renamed from ... (threadbarrier_insn): ... here. config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h & omp-low.h. (nvptx_loop_head, nvptx_loop_tail, nvtpx_loop_prehead, nvptx_loop_pretail, LOOP_MODE_CHANGE_P: New. (worker_bcast_hwm, worker_bcast_align, worker_bcast_name, worker_bcast_sym): New. (nvptx_opetion_override): Initialize worker_bcast_sym. (nvptx_expand_oacc_loop): New. (nvptx_gen_unpack, nvptx_gen_pack): New. (struct wcast_data_t, propagate_mask): New types. (nvptx_gen_vcast, nvptx_gen_wcast): New. (nvptx_print_operand): Change 'U' specifier to look at operand itself. (struct reorg_unspec, struct reorg_loop): New structs. (unspec_map_t): New map. (loop_t, work_loop_t): New types. (nvptx_split_blocks, nvptx_discover_pre, nvptx_dump_loops, nvptx_discover_loops): New. (nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen, nvptx_wpropagate): New. (nvptx_wsync): New. (nvptx_single, nvptx_skip_loop): New. (nvptx_process_loops): New. (nvptx_neuter_loops): New. (nvptx_reorg): Add liveness DF problem. Call nvptx_split_loops, nvptx_discover_loops, nvptx_process_loops & nvptx_neuter_loops. (nvptx_cannot_copy_insn): Check for broadcast, sync & loop insns. (nvptx_file_end): Output worker broadcast array definition. Deletions: * builtins.c (expand_oacc_thread_barrier): Delete. (expand_oacc_thread_broadcast): Delete. (expand_builtin): Adjust. * gimple.c (struct gimple_statement_omp_parallel_layout): Remove broadcast_array member. (gimple_omp_target_broadcast_array): Delete. (gimple_omp_target_set_broadcast_array): Delete. * omp-low.c (omp_region): Remove broadcast_array member. (oacc_broadcast): Delete. (build_oacc_threadbarrier): Delete. (oacc_loop_needs_threadbarrier_p): Delete. (oacc_alloc_broadcast_storage): Delete. (find_omp_target_region): Remove call to gimple_omp_target_broadcast_array. (enclosing_target_region, required_predication_mask, generate_vector_broadcast, generate_oacc_broadcast, make_predication_test, predicate_bb, find_predicatable_bbs, predicate_omp_regions): Delete. (use, gen, live_in): Delete. (populate_loop_live_in, oacc_populate_live_in_1, oacc_populate_live_in, populate_loop_use, oacc_broadcast_1, oacc_broadcast): Delete. (execute_expand_omp): Remove predicate_omp_regions call. (lower_omp_target): Remove oacc_alloc_broadcast_storage call. Remove gimple_omp_target_set_broadcast_array call. (make_gimple_omp_edges): Remove oacc_loop_needs_threadbarrier_p check. * tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Remove BUILT_IN_GOACC_THREADBARRIER. * omp-builtins.def (BUILT_IN_GOACC_THREAD_BROADCAST, BUILT_IN_GOACC_THREAD_BROADCAST_LL, BUILT_IN_GOACC_THREADBARRIER): Delete. * config/nvptx/nvptx.md (UNSPECV_WARPBCAST): Delete. (br_true, br_false): Remove U format specifier. (oacc_thread_broadcastsi, oacc_thread_broadcast_di): Delete. (oacc_threadbarrier): Delete. * config/.nvptx/nvptx.c (condition_unidirectional_p): Delete. (nvptx_print_operand): Change 'U' specifier to look at operand itself. (nvptx_reorg_subreg): Remove unidirection checking. (nvptx_cannot_copy_insn): Remove broadcast and barrier insns. * config/nvptx/nvptx.h (machine_function): Remove arp_equal_pseudos. Index: omp-low.c =================================================================== --- omp-low.c (revision 225323) +++ omp-low.c (working copy) @@ -166,14 +166,8 @@ struct omp_region /* For an OpenACC loop, the level of parallelism requested. */ int gwv_this; - - tree broadcast_array; }; -/* Levels of parallelism as defined by OpenACC. Increasing numbers - correspond to deeper loop nesting levels. */ -#define OACC_LOOP_MASK(X) (1 << (X)) - /* Context structure. Used to store information about each parallel directive in the code. */ @@ -292,8 +286,6 @@ static vec taskreg_contex static void scan_omp (gimple_seq *, omp_context *); static tree scan_omp_1_op (tree *, int *, void *); -static basic_block oacc_broadcast (basic_block, basic_block, - struct omp_region *); #define WALK_SUBSTMTS \ case GIMPLE_BIND: \ @@ -3487,15 +3479,6 @@ build_omp_barrier (tree lhs) return g; } -/* Build a call to GOACC_threadbarrier. */ - -static gcall * -build_oacc_threadbarrier (void) -{ - tree fndecl = builtin_decl_explicit (BUILT_IN_GOACC_THREADBARRIER); - return gimple_build_call (fndecl, 0); -} - /* If a context was created for STMT when it was scanned, return it. */ static omp_context * @@ -3506,6 +3489,56 @@ maybe_lookup_ctx (gimple stmt) return n ? (omp_context *) n->value : NULL; } +/* Generate loop head markers in outer->inner order. */ + +static void +gen_oacc_loop_head (gimple_seq *seq, unsigned mask) +{ + { + // TODDO: Determine this information from the parallel region itself + // and emit it once in the offload function. Currently the target + // geometry definition is being extracted early. For now inform + // the backend we're using all axes of parallelism, which is a + // safe default. + gcall *call = gimple_build_call_internal + (IFN_GOACC_LEVELS, 1, + build_int_cst (unsigned_type_node, + OACC_LOOP_MASK (OACC_gang) + | OACC_LOOP_MASK (OACC_vector) + | OACC_LOOP_MASK (OACC_worker))); + gimple_seq_add_stmt (seq, call); + } + + tree arg0 = build_int_cst (unsigned_type_node, 0); + unsigned level; + + for (level = OACC_gang; level != OACC_HWM; level++) + if (mask & OACC_LOOP_MASK (level)) + { + tree arg1 = build_int_cst (unsigned_type_node, level); + gcall *call = gimple_build_call_internal + (IFN_GOACC_LOOP, 2, arg0, arg1); + gimple_seq_add_stmt (seq, call); + } +} + +/* Generate loop tail markers in inner->outer order. */ + +static void +gen_oacc_loop_tail (gimple_seq *seq, unsigned mask) +{ + tree arg0 = build_int_cst (unsigned_type_node, 1); + unsigned level; + + for (level = OACC_HWM; level-- != OACC_gang; ) + if (mask & OACC_LOOP_MASK (level)) + { + tree arg1 = build_int_cst (unsigned_type_node, level); + gcall *call = gimple_build_call_internal + (IFN_GOACC_LOOP, 2, arg0, arg1); + gimple_seq_add_stmt (seq, call); + } +} /* Find the mapping for DECL in CTX or the immediately enclosing context that has a mapping for DECL. @@ -6777,21 +6810,6 @@ expand_omp_for_generic (struct omp_regio } } - -/* True if a barrier is needed after a loop partitioned over - gangs/workers/vectors as specified by GWV_BITS. OpenACC semantics specify - that a (conceptual) barrier is needed after worker and vector-partitioned - loops, but not after gang-partitioned loops. Currently we are relying on - warp reconvergence to synchronise threads within a warp after vector loops, - so an explicit barrier is not helpful after those. */ - -static bool -oacc_loop_needs_threadbarrier_p (int gwv_bits) -{ - return !(gwv_bits & OACC_LOOP_MASK (OACC_gang)) - && (gwv_bits & OACC_LOOP_MASK (OACC_worker)); -} - /* A subroutine of expand_omp_for. Generate code for a parallel loop with static schedule and no specified chunk size. Given parameters: @@ -6800,6 +6818,7 @@ oacc_loop_needs_threadbarrier_p (int gwv where COND is "<" or ">", we generate pseudocode + OACC_LOOP_HEAD if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2; if (cond is <) adj = STEP - 1; @@ -6827,6 +6846,11 @@ oacc_loop_needs_threadbarrier_p (int gwv V += STEP; if (V cond e) goto L1; L2: + OACC_LOOP_TAIL + + It'd be better to place the OACC_LOOP markers just inside the outer + conditional, so they can be entirely eliminated if the loop is + unreachable. */ static void @@ -6868,10 +6892,6 @@ expand_omp_for_static_nochunk (struct om } exit_bb = region->exit; - /* Broadcast variables to OpenACC threads. */ - entry_bb = oacc_broadcast (entry_bb, fin_bb, region); - region->entry = entry_bb; - /* Iteration space partitioning goes in ENTRY_BB. */ gsi = gsi_last_bb (entry_bb); gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR); @@ -6893,6 +6913,15 @@ expand_omp_for_static_nochunk (struct om t = fold_binary (fd->loop.cond_code, boolean_type_node, fold_convert (type, fd->loop.n1), fold_convert (type, fd->loop.n2)); + + if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) + { + gimple_seq seq = NULL; + + gen_oacc_loop_head (&seq, region->gwv_this); + gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT); + } + if (fd->collapse == 1 && TYPE_UNSIGNED (type) && (t == NULL_TREE || !integer_onep (t))) @@ -6951,6 +6980,7 @@ expand_omp_for_static_nochunk (struct om case GF_OMP_FOR_KIND_OACC_LOOP: { gimple_seq seq = NULL; + nthreads = expand_oacc_get_num_threads (&seq, region->gwv_this); threadid = expand_oacc_get_thread_num (&seq, region->gwv_this); gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT); @@ -7134,18 +7164,19 @@ expand_omp_for_static_nochunk (struct om /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing. */ gsi = gsi_last_bb (exit_bb); - if (!gimple_omp_return_nowait_p (gsi_stmt (gsi))) + if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) + { + gimple_seq seq = NULL; + + gen_oacc_loop_tail (&seq, region->gwv_this); + gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT); + } + else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi))) { t = gimple_omp_return_lhs (gsi_stmt (gsi)); - if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) - { - gcc_checking_assert (t == NULL_TREE); - if (oacc_loop_needs_threadbarrier_p (region->gwv_this)) - gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT); - } - else - gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT); + gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT); } + gsi_remove (&gsi, true); /* Connect all the blocks. */ @@ -7220,6 +7251,7 @@ find_phi_with_arg_on_edge (tree arg, edg where COND is "<" or ">", we generate pseudocode +OACC_LOOP_HEAD if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2; if (cond is <) adj = STEP - 1; @@ -7230,6 +7262,7 @@ find_phi_with_arg_on_edge (tree arg, edg else n = (adj + N2 - N1) / STEP; trip = 0; + V = threadid * CHUNK * STEP + N1; -- this extra definition of V is here so that V is defined if the loop is not entered @@ -7248,6 +7281,7 @@ find_phi_with_arg_on_edge (tree arg, edg trip += 1; goto L0; L4: +OACC_LOOP_TAIL */ static void @@ -7281,10 +7315,6 @@ expand_omp_for_static_chunk (struct omp_ gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2); fin_bb = BRANCH_EDGE (iter_part_bb)->dest; - /* Broadcast variables to OpenACC threads. */ - entry_bb = oacc_broadcast (entry_bb, fin_bb, region); - region->entry = entry_bb; - gcc_assert (broken_loop || fin_bb == FALLTHRU_EDGE (cont_bb)->dest); seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb)); @@ -7296,7 +7326,7 @@ expand_omp_for_static_chunk (struct omp_ trip_update_bb = split_edge (FALLTHRU_EDGE (cont_bb)); } exit_bb = region->exit; - + /* Trip and adjustment setup goes in ENTRY_BB. */ gsi = gsi_last_bb (entry_bb); gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR); @@ -7318,6 +7348,14 @@ expand_omp_for_static_chunk (struct omp_ t = fold_binary (fd->loop.cond_code, boolean_type_node, fold_convert (type, fd->loop.n1), fold_convert (type, fd->loop.n2)); + if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) + { + gimple_seq seq = NULL; + + gen_oacc_loop_head (&seq, region->gwv_this); + gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT); + } + if (fd->collapse == 1 && TYPE_UNSIGNED (type) && (t == NULL_TREE || !integer_onep (t))) @@ -7576,18 +7614,20 @@ expand_omp_for_static_chunk (struct omp_ /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing. */ gsi = gsi_last_bb (exit_bb); - if (!gimple_omp_return_nowait_p (gsi_stmt (gsi))) + + if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) + { + gimple_seq seq = NULL; + + gen_oacc_loop_tail (&seq, region->gwv_this); + gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT); + } + else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi))) { t = gimple_omp_return_lhs (gsi_stmt (gsi)); - if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP) - { - gcc_checking_assert (t == NULL_TREE); - if (oacc_loop_needs_threadbarrier_p (region->gwv_this)) - gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT); - } - else - gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT); + gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT); } + gsi_remove (&gsi, true); /* Connect the new blocks. */ @@ -9158,20 +9198,6 @@ expand_omp_atomic (struct omp_region *re expand_omp_atomic_mutex (load_bb, store_bb, addr, loaded_val, stored_val); } -/* Allocate storage for OpenACC worker threads in CTX to broadcast - condition results. */ - -static void -oacc_alloc_broadcast_storage (omp_context *ctx) -{ - tree vull_type_node = build_qualified_type (long_long_unsigned_type_node, - TYPE_QUAL_VOLATILE); - - ctx->worker_sync_elt - = alloc_var_ganglocal (NULL_TREE, vull_type_node, ctx, - TYPE_SIZE_UNIT (vull_type_node)); -} - /* Mark the loops inside the kernels region starting at REGION_ENTRY and ending at REGION_EXIT. */ @@ -9947,7 +9973,6 @@ find_omp_target_region_data (struct omp_ region->gwv_this |= OACC_LOOP_MASK (OACC_worker); if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH)) region->gwv_this |= OACC_LOOP_MASK (OACC_vector); - region->broadcast_array = gimple_omp_target_broadcast_array (stmt); } /* Helper for build_omp_regions. Scan the dominator tree starting at @@ -10091,669 +10116,6 @@ build_omp_regions (void) build_omp_regions_1 (ENTRY_BLOCK_PTR_FOR_FN (cfun), NULL, false); } -/* Walk the tree upwards from region until a target region is found - or we reach the end, then return it. */ -static omp_region * -enclosing_target_region (omp_region *region) -{ - while (region != NULL - && region->type != GIMPLE_OMP_TARGET) - region = region->outer; - return region; -} - -/* Return a mask of GWV_ values indicating the kind of OpenACC - predication required for basic blocks in REGION. */ - -static int -required_predication_mask (omp_region *region) -{ - while (region - && region->type != GIMPLE_OMP_FOR && region->type != GIMPLE_OMP_TARGET) - region = region->outer; - if (!region) - return 0; - - int outer_masks = region->gwv_this; - omp_region *outer_target = region; - while (outer_target != NULL && outer_target->type != GIMPLE_OMP_TARGET) - { - if (outer_target->type == GIMPLE_OMP_FOR) - outer_masks |= outer_target->gwv_this; - outer_target = outer_target->outer; - } - if (!outer_target) - return 0; - - int mask = 0; - if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_worker)) != 0 - && (region->type == GIMPLE_OMP_TARGET - || (outer_masks & OACC_LOOP_MASK (OACC_worker)) == 0)) - mask |= OACC_LOOP_MASK (OACC_worker); - if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_vector)) != 0 - && (region->type == GIMPLE_OMP_TARGET - || (outer_masks & OACC_LOOP_MASK (OACC_vector)) == 0)) - mask |= OACC_LOOP_MASK (OACC_vector); - return mask; -} - -/* Generate a broadcast across OpenACC vector threads (a warp on GPUs) - so that VAR is broadcast to DEST_VAR. The new statements are added - after WHERE. Return the stmt after which the block should be split. */ - -static gimple -generate_vector_broadcast (tree dest_var, tree var, - gimple_stmt_iterator &where) -{ - gimple retval = gsi_stmt (where); - tree vartype = TREE_TYPE (var); - tree call_arg_type = unsigned_type_node; - enum built_in_function fn = BUILT_IN_GOACC_THREAD_BROADCAST; - - if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type)) - { - fn = BUILT_IN_GOACC_THREAD_BROADCAST_LL; - call_arg_type = long_long_unsigned_type_node; - } - - bool need_conversion = !types_compatible_p (vartype, call_arg_type); - tree casted_var = var; - - if (need_conversion) - { - gassign *conv1 = NULL; - casted_var = create_tmp_var (call_arg_type); - - /* Handle floats and doubles. */ - if (!INTEGRAL_TYPE_P (vartype)) - { - tree t = fold_build1 (VIEW_CONVERT_EXPR, call_arg_type, var); - conv1 = gimple_build_assign (casted_var, t); - } - else - conv1 = gimple_build_assign (casted_var, NOP_EXPR, var); - - gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING); - } - - tree decl = builtin_decl_explicit (fn); - gimple call = gimple_build_call (decl, 1, casted_var); - gsi_insert_after (&where, call, GSI_NEW_STMT); - tree casted_dest = dest_var; - - if (need_conversion) - { - gassign *conv2 = NULL; - casted_dest = create_tmp_var (call_arg_type); - - if (!INTEGRAL_TYPE_P (vartype)) - { - tree t = fold_build1 (VIEW_CONVERT_EXPR, vartype, casted_dest); - conv2 = gimple_build_assign (dest_var, t); - } - else - conv2 = gimple_build_assign (dest_var, NOP_EXPR, casted_dest); - - gsi_insert_after (&where, conv2, GSI_CONTINUE_LINKING); - } - - gimple_call_set_lhs (call, casted_dest); - return retval; -} - -/* Generate a broadcast across OpenACC threads in REGION so that VAR - is broadcast to DEST_VAR. MASK specifies the parallelism level and - thereby the broadcast method. If it is only vector, we - can use a warp broadcast, otherwise we fall back to memory - store/load. */ - -static gimple -generate_oacc_broadcast (omp_region *region, tree dest_var, tree var, - gimple_stmt_iterator &where, int mask) -{ - if (mask == OACC_LOOP_MASK (OACC_vector)) - return generate_vector_broadcast (dest_var, var, where); - - omp_region *parent = enclosing_target_region (region); - - tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE); - tree ptr = create_tmp_var (build_pointer_type (elttype)); - gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR, - parent->broadcast_array); - gsi_insert_after (&where, cast1, GSI_NEW_STMT); - gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var); - gsi_insert_after (&where, st, GSI_NEW_STMT); - - gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT); - - gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR, - parent->broadcast_array); - gsi_insert_after (&where, cast2, GSI_NEW_STMT); - gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr)); - gsi_insert_after (&where, ld, GSI_NEW_STMT); - - gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT); - - return st; -} - -/* Build a test for OpenACC predication. TRUE_EDGE is the edge that should be - taken if the block should be executed. SKIP_DEST_BB is the destination to - jump to otherwise. MASK specifies the type of predication, it can contain - the bits for VECTOR and/or WORKER. */ - -static void -make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask) -{ - basic_block cond_bb = true_edge->src; - - gimple_stmt_iterator tmp_gsi = gsi_last_bb (cond_bb); - tree decl = builtin_decl_explicit (BUILT_IN_GOACC_ID); - tree comp_var = NULL_TREE; - unsigned ix; - - for (ix = OACC_worker; ix <= OACC_vector; ix++) - if (OACC_LOOP_MASK (ix) & mask) - { - gimple call = gimple_build_call - (decl, 1, build_int_cst (unsigned_type_node, ix)); - tree var = create_tmp_var (unsigned_type_node); - - gimple_call_set_lhs (call, var); - gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT); - if (comp_var) - { - tree new_comp = create_tmp_var (unsigned_type_node); - gassign *ior = gimple_build_assign (new_comp, - BIT_IOR_EXPR, comp_var, var); - gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT); - comp_var = new_comp; - } - else - comp_var = var; - } - - tree cond = build2 (EQ_EXPR, boolean_type_node, comp_var, - fold_convert (unsigned_type_node, integer_zero_node)); - gimple cond_stmt = gimple_build_cond_empty (cond); - gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT); - - true_edge->flags = EDGE_TRUE_VALUE; - - /* Force an abnormal edge before a broadcast operation that might be present - in SKIP_DEST_BB. This is only done for the non-execution edge (with - respect to the predication done by this function) -- the opposite - (execution) edge that reaches the broadcast operation must be made - abnormal also, e.g. in this function's caller. */ - edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE); - basic_block false_abnorm_bb = split_edge (e); - edge abnorm_edge = single_succ_edge (false_abnorm_bb); - abnorm_edge->flags |= EDGE_ABNORMAL; -} - -/* Apply OpenACC predication to basic block BB which is in - region PARENT. MASK has a bitmask of levels that need to be - applied; VECTOR and/or WORKER may be set. */ - -static void -predicate_bb (basic_block bb, struct omp_region *parent, int mask) -{ - /* We handle worker-single vector-partitioned loops by jumping - around them if not in the controlling worker. Don't insert - unnecessary (and incorrect) predication. */ - if (parent->type == GIMPLE_OMP_FOR - && (parent->gwv_this & OACC_LOOP_MASK (OACC_vector))) - mask &= ~OACC_LOOP_MASK (OACC_worker); - - if (mask == 0 || parent->type == GIMPLE_OMP_ATOMIC_LOAD) - return; - - gimple_stmt_iterator gsi; - gimple stmt; - - gsi = gsi_last_bb (bb); - stmt = gsi_stmt (gsi); - if (stmt == NULL) - return; - - basic_block skip_dest_bb = NULL; - - if (gimple_code (stmt) == GIMPLE_OMP_ENTRY_END) - return; - - if (gimple_code (stmt) == GIMPLE_COND) - { - tree cond_var = create_tmp_var (boolean_type_node); - tree broadcast_cond = create_tmp_var (boolean_type_node); - gassign *asgn = gimple_build_assign (cond_var, - gimple_cond_code (stmt), - gimple_cond_lhs (stmt), - gimple_cond_rhs (stmt)); - gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING); - gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn); - - gimple splitpoint = generate_oacc_broadcast (parent, broadcast_cond, - cond_var, gsi_asgn, - mask); - - edge e = split_block (bb, splitpoint); - e->flags = EDGE_ABNORMAL; - skip_dest_bb = e->dest; - - gimple_cond_set_condition (as_a (stmt), EQ_EXPR, - broadcast_cond, boolean_true_node); - } - else if (gimple_code (stmt) == GIMPLE_SWITCH) - { - gswitch *sstmt = as_a (stmt); - tree var = gimple_switch_index (sstmt); - tree new_var = create_tmp_var (TREE_TYPE (var)); - - gassign *asgn = gimple_build_assign (new_var, var); - gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING); - gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn); - - gimple splitpoint = generate_oacc_broadcast (parent, new_var, var, - gsi_asgn, mask); - - edge e = split_block (bb, splitpoint); - e->flags = EDGE_ABNORMAL; - skip_dest_bb = e->dest; - - gimple_switch_set_index (sstmt, new_var); - } - else if (is_gimple_omp (stmt)) - { - gsi_prev (&gsi); - gimple split_stmt = gsi_stmt (gsi); - enum gimple_code code = gimple_code (stmt); - - /* First, see if we must predicate away an entire loop or atomic region. */ - if (code == GIMPLE_OMP_FOR - || code == GIMPLE_OMP_ATOMIC_LOAD) - { - omp_region *inner; - inner = *bb_region_map->get (FALLTHRU_EDGE (bb)->dest); - skip_dest_bb = single_succ (inner->exit); - gcc_assert (inner->entry == bb); - if (code != GIMPLE_OMP_FOR - || ((inner->gwv_this & OACC_LOOP_MASK (OACC_vector)) - && !(inner->gwv_this & OACC_LOOP_MASK (OACC_worker)) - && (mask & OACC_LOOP_MASK (OACC_worker)))) - { - gimple_stmt_iterator head_gsi = gsi_start_bb (bb); - gsi_prev (&head_gsi); - edge e0 = split_block (bb, gsi_stmt (head_gsi)); - int mask2 = mask; - if (code == GIMPLE_OMP_FOR) - mask2 &= ~OACC_LOOP_MASK (OACC_vector); - if (!split_stmt || code != GIMPLE_OMP_FOR) - { - /* The simple case: nothing here except the for, - so we just need to make one branch around the - entire loop. */ - inner->entry = e0->dest; - make_predication_test (e0, skip_dest_bb, mask2); - return; - } - basic_block for_block = e0->dest; - /* The general case, make two conditions - a full one around the - code preceding the for, and one branch around the loop. */ - edge e1 = split_block (for_block, split_stmt); - basic_block bb3 = e1->dest; - edge e2 = split_block (for_block, split_stmt); - basic_block bb2 = e2->dest; - - make_predication_test (e0, bb2, mask); - make_predication_test (single_pred_edge (bb3), skip_dest_bb, - mask2); - inner->entry = bb3; - return; - } - } - - /* Only a few statements need special treatment. */ - if (gimple_code (stmt) != GIMPLE_OMP_FOR - && gimple_code (stmt) != GIMPLE_OMP_CONTINUE - && gimple_code (stmt) != GIMPLE_OMP_RETURN) - { - edge e = single_succ_edge (bb); - skip_dest_bb = e->dest; - } - else - { - if (!split_stmt) - return; - edge e = split_block (bb, split_stmt); - skip_dest_bb = e->dest; - if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE) - { - gcc_assert (parent->cont == bb); - parent->cont = skip_dest_bb; - } - else if (gimple_code (stmt) == GIMPLE_OMP_RETURN) - { - gcc_assert (parent->exit == bb); - parent->exit = skip_dest_bb; - } - else if (gimple_code (stmt) == GIMPLE_OMP_FOR) - { - omp_region *inner; - inner = *bb_region_map->get (FALLTHRU_EDGE (skip_dest_bb)->dest); - gcc_assert (inner->entry == bb); - inner->entry = skip_dest_bb; - } - } - } - else if (single_succ_p (bb)) - { - edge e = single_succ_edge (bb); - skip_dest_bb = e->dest; - if (gimple_code (stmt) == GIMPLE_GOTO) - gsi_prev (&gsi); - if (gsi_stmt (gsi) == 0) - return; - } - - if (skip_dest_bb != NULL) - { - gimple_stmt_iterator head_gsi = gsi_start_bb (bb); - gsi_prev (&head_gsi); - edge e2 = split_block (bb, gsi_stmt (head_gsi)); - make_predication_test (e2, skip_dest_bb, mask); - } -} - -/* Walk the dominator tree starting at BB to collect basic blocks in - WORKLIST which need OpenACC vector predication applied to them. */ - -static void -find_predicatable_bbs (basic_block bb, vec &worklist) -{ - struct omp_region *parent = *bb_region_map->get (bb); - if (required_predication_mask (parent) != 0) - worklist.safe_push (bb); - basic_block son; - for (son = first_dom_son (CDI_DOMINATORS, bb); - son; - son = next_dom_son (CDI_DOMINATORS, son)) - find_predicatable_bbs (son, worklist); -} - -/* Apply OpenACC vector predication to all basic blocks. HEAD_BB is the - first. */ - -static void -predicate_omp_regions (basic_block head_bb) -{ - vec worklist = vNULL; - find_predicatable_bbs (head_bb, worklist); - int i; - basic_block bb; - FOR_EACH_VEC_ELT (worklist, i, bb) - { - omp_region *region = *bb_region_map->get (bb); - int mask = required_predication_mask (region); - predicate_bb (bb, region, mask); - } -} - -/* USE and GET sets for variable broadcasting. */ -static std::set use, gen, live_in; - -/* This is an extremely conservative live in analysis. We only want to - detect is any compiler temporary used inside an acc loop is local to - that loop or not. So record all decl uses in all the basic blocks - post-dominating the acc loop in question. */ -static tree -populate_loop_live_in (tree *tp, int *walk_subtrees, - void *data_ ATTRIBUTE_UNUSED) -{ - struct walk_stmt_info *wi = (struct walk_stmt_info *) data_; - - if (wi && wi->is_lhs) - { - if (VAR_P (*tp)) - live_in.insert (*tp); - } - else if (IS_TYPE_OR_DECL_P (*tp)) - *walk_subtrees = 0; - - return NULL_TREE; -} - -static void -oacc_populate_live_in_1 (basic_block entry_bb, basic_block exit_bb, - basic_block loop_bb) -{ - basic_block son; - gimple_stmt_iterator gsi; - - if (entry_bb == exit_bb) - return; - - if (!dominated_by_p (CDI_DOMINATORS, loop_bb, entry_bb)) - return; - - for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi)) - { - struct walk_stmt_info wi; - gimple stmt; - - memset (&wi, 0, sizeof (wi)); - stmt = gsi_stmt (gsi); - - walk_gimple_op (stmt, populate_loop_live_in, &wi); - } - - /* Continue walking the dominator tree. */ - for (son = first_dom_son (CDI_DOMINATORS, entry_bb); - son; - son = next_dom_son (CDI_DOMINATORS, son)) - oacc_populate_live_in_1 (son, exit_bb, loop_bb); -} - -static void -oacc_populate_live_in (basic_block entry_bb, omp_region *region) -{ - /* Find the innermost OMP_TARGET region. */ - while (region && region->type != GIMPLE_OMP_TARGET) - region = region->outer; - - if (!region) - return; - - basic_block son; - - for (son = first_dom_son (CDI_DOMINATORS, region->entry); - son; - son = next_dom_son (CDI_DOMINATORS, son)) - oacc_populate_live_in_1 (son, region->exit, entry_bb); -} - -static tree -populate_loop_use (tree *tp, int *walk_subtrees, void *data_) -{ - struct walk_stmt_info *wi = (struct walk_stmt_info *) data_; - std::set::iterator it; - - /* There isn't much to do for LHS ops. There shouldn't be any pointers - or references here. */ - if (wi && wi->is_lhs) - return NULL_TREE; - - if (VAR_P (*tp)) - { - tree type; - - *walk_subtrees = 0; - - /* Filter out incompatible decls. */ - if (INDIRECT_REF_P (*tp) || is_global_var (*tp)) - return NULL_TREE; - - type = TREE_TYPE (*tp); - - /* Aggregate types aren't supported either. */ - if (AGGREGATE_TYPE_P (type)) - return NULL_TREE; - - /* Filter out decls inside GEN. */ - it = gen.find (*tp); - if (it == gen.end ()) - use.insert (*tp); - } - else if (IS_TYPE_OR_DECL_P (*tp)) - *walk_subtrees = 0; - - return NULL_TREE; -} - -/* INIT is true if this is the first time this function is called. */ - -static void -oacc_broadcast_1 (basic_block entry_bb, basic_block exit_bb, bool init, - int mask) -{ - basic_block son; - gimple_stmt_iterator gsi; - gimple stmt; - tree block, var; - - if (entry_bb == exit_bb) - return; - - /* Populate the GEN set. */ - - gsi = gsi_start_bb (entry_bb); - stmt = gsi_stmt (gsi); - - /* There's nothing to do if stmt is empty or if this is the entry basic - block to the vector loop. The entry basic block to pre-expanded loops - do not have an entry label. As such, the scope containing the initial - entry_bb should not be added to the gen set. */ - if (stmt != NULL && !init && (block = gimple_block (stmt)) != NULL) - for (var = BLOCK_VARS (block); var; var = DECL_CHAIN (var)) - gen.insert(var); - - /* Populate the USE set. */ - - for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi)) - { - struct walk_stmt_info wi; - - memset (&wi, 0, sizeof (wi)); - stmt = gsi_stmt (gsi); - - walk_gimple_op (stmt, populate_loop_use, &wi); - } - - /* Continue processing the children of this basic block. */ - for (son = first_dom_son (CDI_DOMINATORS, entry_bb); - son; - son = next_dom_son (CDI_DOMINATORS, son)) - oacc_broadcast_1 (son, exit_bb, false, mask); -} - -/* Broadcast variables to OpenACC vector loops. This function scans - all of the basic blocks withing an acc vector loop. It maintains - two sets of decls, a GEN set and a USE set. The GEN set contains - all of the decls in the the basic block's scope. The USE set - consists of decls used in current basic block, but are not in the - GEN set, globally defined or were transferred into the the accelerator - via a data movement clause. - - The vector loop begins at ENTRY_BB and end at EXIT_BB, where EXIT_BB - is a latch back to ENTRY_BB. Once a set of used variables have been - determined, they will get broadcasted in a pre-header to ENTRY_BB. */ - -static basic_block -oacc_broadcast (basic_block entry_bb, basic_block exit_bb, omp_region *region) -{ - gimple_stmt_iterator gsi; - std::set::iterator it; - int mask = region->gwv_this; - - /* Nothing to do if this isn't an acc worker or vector loop. */ - if (mask == 0) - return entry_bb; - - use.empty (); - gen.empty (); - live_in.empty (); - - /* Currently, subroutines aren't supported. */ - gcc_assert (!lookup_attribute ("oacc function", - DECL_ATTRIBUTES (current_function_decl))); - - /* Populate live_in. */ - oacc_populate_live_in (entry_bb, region); - - /* Populate the set of used decls. */ - oacc_broadcast_1 (entry_bb, exit_bb, true, mask); - - /* Filter out all of the GEN decls from the USE set. Also filter out - any compiler temporaries that which are not present in LIVE_IN. */ - for (it = use.begin (); it != use.end (); it++) - { - std::set::iterator git, lit; - - git = gen.find (*it); - lit = live_in.find (*it); - if (git != gen.end () || lit == live_in.end ()) - use.erase (it); - } - - if (mask == OACC_LOOP_MASK (OACC_vector)) - { - /* Broadcast all decls in USE right before the last instruction in - entry_bb. */ - gsi = gsi_last_bb (entry_bb); - - gimple_seq seq = NULL; - gimple_stmt_iterator g2 = gsi_start (seq); - - for (it = use.begin (); it != use.end (); it++) - generate_oacc_broadcast (region, *it, *it, g2, mask); - - gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING); - } - else if (mask & OACC_LOOP_MASK (OACC_worker)) - { - if (use.empty ()) - return entry_bb; - - /* If this loop contains a worker, then each broadcast must be - predicated. */ - - for (it = use.begin (); it != use.end (); it++) - { - /* Worker broadcasting requires predication. To do that, there - needs to be several new parent basic blocks before the omp - for instruction. */ - - gimple_seq seq = NULL; - gimple_stmt_iterator g2 = gsi_start (seq); - gimple splitpoint = generate_oacc_broadcast (region, *it, *it, - g2, mask); - gsi = gsi_last_bb (entry_bb); - gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING); - edge e = split_block (entry_bb, splitpoint); - e->flags |= EDGE_ABNORMAL; - basic_block dest_bb = e->dest; - gsi_prev (&gsi); - edge e2 = split_block (entry_bb, gsi_stmt (gsi)); - e2->flags |= EDGE_ABNORMAL; - make_predication_test (e2, dest_bb, mask); - - /* Update entry_bb. */ - entry_bb = dest_bb; - } - } - - return entry_bb; -} - /* Main entry point for expanding OMP-GIMPLE into runtime calls. */ static unsigned int @@ -10772,8 +10134,6 @@ execute_expand_omp (void) fprintf (dump_file, "\n"); } - predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun)); - remove_exit_barriers (root_omp_region); expand_omp (root_omp_region); @@ -12342,10 +11702,7 @@ lower_omp_target (gimple_stmt_iterator * orlist = NULL; if (is_gimple_omp_oacc (stmt)) - { - oacc_init_count_vars (ctx, clauses); - oacc_alloc_broadcast_storage (ctx); - } + oacc_init_count_vars (ctx, clauses); if (has_reduction) { @@ -12631,7 +11988,6 @@ lower_omp_target (gimple_stmt_iterator * gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT); gimple_omp_target_set_ganglocal_size (stmt, sz); - gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt); pop_gimplify_context (NULL); } @@ -13348,16 +12704,7 @@ make_gimple_omp_edges (basic_block bb, s ((for_stmt = last_stmt (cur_region->entry)))) == GF_OMP_FOR_KIND_OACC_LOOP) { - /* Called before OMP expansion, so this information has not been - recorded in cur_region->gwv_this yet. */ - int gwv_bits = find_omp_for_region_gwv (for_stmt); - if (oacc_loop_needs_threadbarrier_p (gwv_bits)) - { - make_edge (bb, bb->next_bb, EDGE_FALLTHRU | EDGE_ABNORMAL); - fallthru = false; - } - else - fallthru = true; + fallthru = true; } else /* In the case of a GIMPLE_OMP_SECTION, the edge will go Index: omp-low.h =================================================================== --- omp-low.h (revision 225323) +++ omp-low.h (working copy) @@ -20,6 +20,8 @@ along with GCC; see the file COPYING3. #ifndef GCC_OMP_LOW_H #define GCC_OMP_LOW_H +/* Levels of parallelism as defined by OpenACC. Increasing numbers + correspond to deeper loop nesting levels. */ enum oacc_loop_levels { OACC_gang, @@ -27,6 +29,7 @@ enum oacc_loop_levels OACC_vector, OACC_HWM }; +#define OACC_LOOP_MASK(X) (1 << (X)) struct omp_region; Index: builtins.c =================================================================== --- builtins.c (revision 225323) +++ builtins.c (working copy) @@ -5947,20 +5947,6 @@ expand_builtin_acc_on_device (tree exp A #endif } -/* Expand a thread synchronization point for OpenACC threads. */ -static void -expand_oacc_threadbarrier (void) -{ -#ifdef HAVE_oacc_threadbarrier - rtx insn = GEN_FCN (CODE_FOR_oacc_threadbarrier) (); - if (insn != NULL_RTX) - { - emit_insn (insn); - } -#endif -} - - /* Expand a thread-id/thread-count builtin for OpenACC. */ static rtx @@ -6032,47 +6018,6 @@ expand_oacc_ganglocal_ptr (rtx target AT return NULL_RTX; } -/* Handle a GOACC_thread_broadcast builtin call EXP with target TARGET. - Return the result. */ - -static rtx -expand_builtin_oacc_thread_broadcast (tree exp, rtx target) -{ - tree arg0 = CALL_EXPR_ARG (exp, 0); - enum insn_code icode; - - enum machine_mode mode = TYPE_MODE (TREE_TYPE (arg0)); - gcc_assert (INTEGRAL_MODE_P (mode)); - do - { - icode = direct_optab_handler (oacc_thread_broadcast_optab, mode); - mode = GET_MODE_WIDER_MODE (mode); - } - while (icode == CODE_FOR_nothing && mode != VOIDmode); - if (icode == CODE_FOR_nothing) - return expand_expr (arg0, NULL_RTX, VOIDmode, EXPAND_NORMAL); - - rtx tmp = target; - machine_mode mode0 = insn_data[icode].operand[0].mode; - machine_mode mode1 = insn_data[icode].operand[1].mode; - if (!tmp || !REG_P (tmp) || GET_MODE (tmp) != mode0) - tmp = gen_reg_rtx (mode0); - rtx op1 = expand_expr (arg0, NULL_RTX, mode1, EXPAND_NORMAL); - if (GET_MODE (op1) != mode1) - op1 = convert_to_mode (mode1, op1, 0); - - /* op1 might be an immediate, place it inside a register. */ - op1 = force_reg (mode1, op1); - - rtx insn = GEN_FCN (icode) (tmp, op1); - if (insn != NULL_RTX) - { - emit_insn (insn); - return tmp; - } - return const0_rtx; -} - /* Expand an expression EXP that calls a built-in function, with result going to TARGET if that's convenient (and in mode MODE if that's convenient). @@ -7225,14 +7170,6 @@ expand_builtin (tree exp, rtx target, rt return target; break; - case BUILT_IN_GOACC_THREAD_BROADCAST: - case BUILT_IN_GOACC_THREAD_BROADCAST_LL: - return expand_builtin_oacc_thread_broadcast (exp, target); - - case BUILT_IN_GOACC_THREADBARRIER: - expand_oacc_threadbarrier (); - return const0_rtx; - default: /* just do library call, if unknown builtin */ break; } Index: omp-builtins.def =================================================================== --- omp-builtins.def (revision 225323) +++ omp-builtins.def (working copy) @@ -69,13 +69,6 @@ DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_GA BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST) DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DEVICEPTR, "GOACC_deviceptr", BT_FN_PTR_PTR, ATTR_CONST_NOTHROW_LEAF_LIST) -DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST, "GOACC_thread_broadcast", - BT_FN_UINT_UINT, ATTR_NOTHROW_LEAF_LIST) -DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST_LL, "GOACC_thread_broadcast_ll", - BT_FN_ULONGLONG_ULONGLONG, ATTR_NOTHROW_LEAF_LIST) -DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREADBARRIER, "GOACC_threadbarrier", - BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST) - DEF_GOACC_BUILTIN_COMPILER (BUILT_IN_ACC_ON_DEVICE, "acc_on_device", BT_FN_INT_INT, ATTR_CONST_NOTHROW_LEAF_LIST) Index: gimple.c =================================================================== --- gimple.c (revision 225323) +++ gimple.c (working copy) @@ -1380,12 +1380,27 @@ bool gimple_call_same_target_p (const_gimple c1, const_gimple c2) { if (gimple_call_internal_p (c1)) - return (gimple_call_internal_p (c2) - && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2)); + { + if (!gimple_call_internal_p (c2) + || gimple_call_internal_fn (c1) != gimple_call_internal_fn (c2)) + return false; + + if (gimple_call_internal_unique_p (c1)) + return false; + + return true; + } + else if (gimple_call_fn (c1) == gimple_call_fn (c2)) + return true; else - return (gimple_call_fn (c1) == gimple_call_fn (c2) - || (gimple_call_fndecl (c1) - && gimple_call_fndecl (c1) == gimple_call_fndecl (c2))); + { + tree decl = gimple_call_fndecl (c1); + + if (!decl || decl != gimple_call_fndecl (c2)) + return false; + + return true; + } } /* Detect flags from a GIMPLE_CALL. This is just like Index: gimple.h =================================================================== --- gimple.h (revision 225323) +++ gimple.h (working copy) @@ -581,10 +581,6 @@ struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT /* [ WORD 11 ] Size of the gang-local memory to allocate. */ tree ganglocal_size; - - /* [ WORD 12 ] - A pointer to the array to be used for broadcasting across threads. */ - tree broadcast_array; }; /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */ @@ -2693,6 +2689,11 @@ gimple_call_internal_fn (const_gimple gs return static_cast (gs)->u.internal_fn; } +/* Return true, if this internal gimple call is unique. */ + +extern bool +gimple_call_internal_unique_p (const_gimple); + /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt that could alter control flow. */ @@ -5248,25 +5249,6 @@ gimple_omp_target_set_ganglocal_size (go } -/* Return the pointer to the broadcast array associated with OMP_TARGET GS. */ - -static inline tree -gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt) -{ - return omp_target_stmt->broadcast_array; -} - - -/* Set PTR to be the broadcast array associated with OMP_TARGET - GS. */ - -static inline void -gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr) -{ - omp_target_stmt->broadcast_array = ptr; -} - - /* Return the clauses associated with OMP_TEAMS GS. */ static inline tree Index: tree-ssa-threadedge.c =================================================================== --- tree-ssa-threadedge.c (revision 225323) +++ tree-ssa-threadedge.c (working copy) @@ -310,6 +310,17 @@ record_temporary_equivalences_from_stmts && gimple_asm_volatile_p (as_a (stmt))) return NULL; + /* If the statement is a unique builtin, we can not thread + through here. */ + if (gimple_code (stmt) == GIMPLE_CALL) + { + gcall *call = as_a (stmt); + + if (gimple_call_internal_p (call) + && gimple_call_internal_unique_p (call)) + return NULL; + } + /* If duplicating this block is going to cause too much code expansion, then do not thread through this block. */ stmt_count++; Index: tree-ssa-tail-merge.c =================================================================== --- tree-ssa-tail-merge.c (revision 225323) +++ tree-ssa-tail-merge.c (working copy) @@ -608,10 +608,13 @@ same_succ_def::equal (const same_succ_de { s1 = gsi_stmt (gsi1); s2 = gsi_stmt (gsi2); - if (gimple_code (s1) != gimple_code (s2)) - return 0; - if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2)) - return 0; + if (s1 != s2) + { + if (gimple_code (s1) != gimple_code (s2)) + return 0; + if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2)) + return 0; + } gsi_next_nondebug (&gsi1); gsi_next_nondebug (&gsi2); gsi_advance_fw_nondebug_nonlocal (&gsi1); Index: internal-fn.def =================================================================== --- internal-fn.def (revision 225323) +++ internal-fn.def (working copy) @@ -64,3 +64,5 @@ DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL) DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL) DEF_INTERNAL_FN (GOACC_DATA_END_WITH_ARG, ECF_NOTHROW, ".r") +DEF_INTERNAL_FN (GOACC_LEVELS, ECF_NOTHROW | ECF_LEAF, "..") +DEF_INTERNAL_FN (GOACC_LOOP, ECF_NOTHROW | ECF_LEAF, "..") Index: tree-ssa-alias.c =================================================================== --- tree-ssa-alias.c (revision 225323) +++ tree-ssa-alias.c (working copy) @@ -1764,7 +1764,6 @@ ref_maybe_used_by_call_p_1 (gcall *call, case BUILT_IN_GOMP_ATOMIC_END: case BUILT_IN_GOMP_BARRIER: case BUILT_IN_GOMP_BARRIER_CANCEL: - case BUILT_IN_GOACC_THREADBARRIER: case BUILT_IN_GOMP_TASKWAIT: case BUILT_IN_GOMP_TASKGROUP_END: case BUILT_IN_GOMP_CRITICAL_START: Index: config/nvptx/nvptx-protos.h =================================================================== --- config/nvptx/nvptx-protos.h (revision 225323) +++ config/nvptx/nvptx-protos.h (working copy) @@ -32,6 +32,7 @@ extern void nvptx_register_pragmas (void extern const char *nvptx_section_for_decl (const_tree); #ifdef RTX_CODE +extern void nvptx_expand_oacc_loop (rtx, rtx); extern void nvptx_expand_call (rtx, rtx); extern rtx nvptx_expand_compare (rtx); extern const char *nvptx_ptx_type_from_mode (machine_mode, bool); Index: config/nvptx/nvptx.md =================================================================== --- config/nvptx/nvptx.md (revision 225323) +++ config/nvptx/nvptx.md (working copy) @@ -52,15 +52,23 @@ UNSPEC_NID UNSPEC_SHARED_DATA + + UNSPEC_BIT_CONV + + UNSPEC_BROADCAST + UNSPEC_BR_UNIFIED ]) (define_c_enum "unspecv" [ UNSPECV_LOCK UNSPECV_CAS UNSPECV_XCHG - UNSPECV_WARP_BCAST UNSPECV_BARSYNC UNSPECV_ID + + UNSPECV_LEVELS + UNSPECV_LOOP + UNSPECV_BR_HIDDEN ]) (define_attr "subregs_ok" "false,true" @@ -253,6 +261,8 @@ (define_mode_iterator QHSIM [QI HI SI]) (define_mode_iterator SDFM [SF DF]) (define_mode_iterator SDCM [SC DC]) +(define_mode_iterator BITS [SI SF]) +(define_mode_iterator BITD [DI DF]) ;; This mode iterator allows :P to be used for patterns that operate on ;; pointer-sized quantities. Exactly one of the two alternatives will match. @@ -813,7 +823,7 @@ (label_ref (match_operand 1 "" "")) (pc)))] "" - "%j0\\tbra%U0\\t%l1;") + "%j0\\tbra\\t%l1;") (define_insn "br_false" [(set (pc) @@ -822,7 +832,34 @@ (label_ref (match_operand 1 "" "")) (pc)))] "" - "%J0\\tbra%U0\\t%l1;") + "%J0\\tbra\\t%l1;") + +;; a hidden conditional branch +(define_insn "br_true_hidden" + [(unspec_volatile:SI [(ne (match_operand:BI 0 "nvptx_register_operand" "R") + (const_int 0)) + (label_ref (match_operand 1 "" "")) + (match_operand:SI 2 "const_int_operand" "i")] + UNSPECV_BR_HIDDEN)] + "" + "%j0\\tbra%U2\\t%l1;") + +;; unified conditional branch +(define_insn "br_uni_true" + [(set (pc) (if_then_else + (ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")] + UNSPEC_BR_UNIFIED) (const_int 0)) + (label_ref (match_operand 1 "" "")) (pc)))] + "" + "%j0\\tbra.uni\\t%l1;") + +(define_insn "br_uni_false" + [(set (pc) (if_then_else + (eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")] + UNSPEC_BR_UNIFIED) (const_int 0)) + (label_ref (match_operand 1 "" "")) (pc)))] + "" + "%J0\\tbra.uni\\t%l1;") (define_expand "cbranch4" [(set (pc) @@ -1326,37 +1363,72 @@ return asms[INTVAL (operands[1])]; }) -(define_insn "oacc_thread_broadcastsi" - [(set (match_operand:SI 0 "nvptx_register_operand" "") - (unspec_volatile:SI [(match_operand:SI 1 "nvptx_register_operand" "")] - UNSPECV_WARP_BCAST))] +(define_insn "oacc_levels" + [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")] + UNSPECV_LEVELS)] "" - "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;") + "// levels %0;" +) -(define_expand "oacc_thread_broadcastdi" - [(set (match_operand:DI 0 "nvptx_register_operand" "") - (unspec_volatile:DI [(match_operand:DI 1 "nvptx_register_operand" "")] - UNSPECV_WARP_BCAST))] - "" -{ - rtx t = gen_reg_rtx (DImode); - emit_insn (gen_lshrdi3 (t, operands[1], GEN_INT (32))); - rtx op0 = force_reg (SImode, gen_lowpart (SImode, t)); - rtx op1 = force_reg (SImode, gen_lowpart (SImode, operands[1])); - rtx targ0 = gen_reg_rtx (SImode); - rtx targ1 = gen_reg_rtx (SImode); - emit_insn (gen_oacc_thread_broadcastsi (targ0, op0)); - emit_insn (gen_oacc_thread_broadcastsi (targ1, op1)); - rtx t2 = gen_reg_rtx (DImode); - rtx t3 = gen_reg_rtx (DImode); - emit_insn (gen_extendsidi2 (t2, targ0)); - emit_insn (gen_extendsidi2 (t3, targ1)); - rtx t4 = gen_reg_rtx (DImode); - emit_insn (gen_ashldi3 (t4, t2, GEN_INT (32))); - emit_insn (gen_iordi3 (operands[0], t3, t4)); - DONE; +(define_insn "nvptx_loop" + [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "") + (match_operand:SI 1 "const_int_operand" "")] + UNSPECV_LOOP)] + "" + "// loop %0, %1;" +) + +(define_expand "oacc_loop" + [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "") + (match_operand:SI 1 "const_int_operand" "")] + UNSPECV_LOOP)] + "" +{ + nvptx_expand_oacc_loop (operands[0], operands[1]); }) +;; only 32-bit shuffles exist. +(define_insn "nvptx_broadcast" + [(set (match_operand:BITS 0 "nvptx_register_operand" "") + (unspec:BITS + [(match_operand:BITS 1 "nvptx_register_operand" "")] + UNSPEC_BROADCAST))] + "" + "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;") + +;; extract parts of a 64 bit object into 2 32-bit ints +(define_insn "unpacksi2" + [(set (match_operand:SI 0 "nvptx_register_operand" "") + (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "") + (const_int 0)] UNSPEC_BIT_CONV)) + (set (match_operand:SI 1 "nvptx_register_operand" "") + (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))] + "" + "%.\\tmov.b64 {%0,%1}, %2;") + +;; pack 2 32-bit ints into a 64 bit object +(define_insn "packsi2" + [(set (match_operand:BITD 0 "nvptx_register_operand" "") + (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "") + (match_operand:SI 2 "nvptx_register_operand" "")] + UNSPEC_BIT_CONV))] + "" + "%.\\tmov.b64 %0, {%1,%2};") + +(define_insn "worker_load" + [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R") + (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")] + UNSPEC_SHARED_DATA))] + "" + "%.\\tld.shared%u0\\t%0,%1;") + +(define_insn "worker_store" + [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")] + UNSPEC_SHARED_DATA) + (match_operand:SDISDFM 0 "nvptx_register_operand" "R"))] + "" + "%.\\tst.shared%u1\\t%1,%0;") + (define_insn "ganglocal_ptr" [(set (match_operand:P 0 "nvptx_register_operand" "") (unspec:P [(const_int 0)] UNSPEC_SHARED_DATA))] @@ -1462,14 +1534,8 @@ "%.\\tatom%A1.b%T0.\\t%0, %1, %2;") ;; ??? Mark as not predicable later? -(define_insn "threadbarrier_insn" - [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)] +(define_insn "nvptx_barsync" + [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] + UNSPECV_BARSYNC)] "" "bar.sync\\t%0;") - -(define_expand "oacc_threadbarrier" - [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)] - "" -{ - operands[0] = const0_rtx; -}) Index: config/nvptx/nvptx.c =================================================================== --- config/nvptx/nvptx.c (revision 225323) +++ config/nvptx/nvptx.c (working copy) @@ -24,6 +24,7 @@ #include "coretypes.h" #include "tm.h" #include "rtl.h" +#include "hash-map.h" #include "hash-set.h" #include "machmode.h" #include "vec.h" @@ -74,6 +75,15 @@ #include "df.h" #include "dumpfile.h" #include "builtins.h" +#include "dominance.h" +#include "cfg.h" +#include "omp-low.h" + +#define nvptx_loop_head 0 +#define nvptx_loop_tail 1 +#define LOOP_MODE_CHANGE_P(X) ((X) < 2) +#define nvptx_loop_prehead 2 +#define nvptx_loop_pretail 3 /* Record the function decls we've written, and the libfuncs and function decls corresponding to them. */ @@ -97,6 +107,16 @@ static GTY((cache)) static GTY((cache)) hash_table *declared_fndecls_htab; static GTY((cache)) hash_table *needed_fndecls_htab; +/* Size of buffer needed to broadcast across workers. This is used + for both worker-neutering and worker broadcasting. It is shared + by all functions emitted. The buffer is placed in shared memory. + It'd be nice if PTX supported common blocks, because then this + could be shared across TUs (taking the largest size). */ +static unsigned worker_bcast_hwm; +static unsigned worker_bcast_align; +#define worker_bcast_name "__worker_bcast" +static GTY(()) rtx worker_bcast_sym; + /* Allocate a new, cleared machine_function structure. */ static struct machine_function * @@ -124,6 +144,8 @@ nvptx_option_override (void) needed_fndecls_htab = hash_table::create_ggc (17); declared_libfuncs_htab = hash_table::create_ggc (17); + + worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name); } /* Return the mode to be used when declaring a ptx object for OBJ. @@ -1053,6 +1075,7 @@ nvptx_static_chain (const_tree fndecl, b return gen_rtx_REG (Pmode, OUTGOING_STATIC_CHAIN_REGNUM); } + /* Emit a comparison COMPARE, and return the new test to be used in the jump. */ @@ -1066,6 +1089,203 @@ nvptx_expand_compare (rtx compare) return gen_rtx_NE (BImode, pred, const0_rtx); } + +/* Expand the oacc_loop primitive into ptx-required unspecs. */ + +void +nvptx_expand_oacc_loop (rtx kind, rtx mode) +{ + /* Emit pre-tail for all loops and emit pre-head for worker level. */ + if (UINTVAL (kind) || UINTVAL (mode) == OACC_worker) + emit_insn (gen_nvptx_loop (GEN_INT (UINTVAL (kind) + 2), mode)); +} + +/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit + objects. */ + +static rtx +nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src) +{ + rtx res; + + switch (GET_MODE (src)) + { + case DImode: + res = gen_unpackdisi2 (dst0, dst1, src); + break; + case DFmode: + res = gen_unpackdfsi2 (dst0, dst1, src); + break; + default: gcc_unreachable (); + } + return res; +} + +/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit + object. */ + +static rtx +nvptx_gen_pack (rtx dst, rtx src0, rtx src1) +{ + rtx res; + + switch (GET_MODE (dst)) + { + case DImode: + res = gen_packsidi2 (dst, src0, src1); + break; + case DFmode: + res = gen_packsidf2 (dst, src0, src1); + break; + default: gcc_unreachable (); + } + return res; +} + +/* Generate an instruction or sequence to broadcast register REG + across the vectors of a single warp. */ + +static rtx +nvptx_gen_vcast (rtx reg) +{ + rtx res; + + switch (GET_MODE (reg)) + { + case SImode: + res = gen_nvptx_broadcastsi (reg, reg); + break; + case SFmode: + res = gen_nvptx_broadcastsf (reg, reg); + break; + case DImode: + case DFmode: + { + rtx tmp0 = gen_reg_rtx (SImode); + rtx tmp1 = gen_reg_rtx (SImode); + + start_sequence (); + emit_insn (nvptx_gen_unpack (tmp0, tmp1, reg)); + emit_insn (nvptx_gen_vcast (tmp0)); + emit_insn (nvptx_gen_vcast (tmp1)); + emit_insn (nvptx_gen_pack (reg, tmp0, tmp1)); + res = get_insns (); + end_sequence (); + } + break; + case BImode: + { + rtx tmp = gen_reg_rtx (SImode); + + start_sequence (); + emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx)); + emit_insn (nvptx_gen_vcast (tmp)); + emit_insn (gen_rtx_SET (BImode, reg, + gen_rtx_NE (BImode, tmp, const0_rtx))); + res = get_insns (); + end_sequence (); + } + break; + + case HImode: + case QImode: + default:debug_rtx (reg);gcc_unreachable (); + } + return res; +} + +/* Structure used when generating a worker-level spill or fill. */ + +struct wcast_data_t +{ + rtx base; + rtx ptr; + unsigned offset; +}; + +/* Direction of the spill/fill and looping setup/teardown indicator. */ + +enum propagate_mask + { + PM_read = 1 << 0, + PM_write = 1 << 1, + PM_loop_begin = 1 << 2, + PM_loop_end = 1 << 3, + + PM_read_write = PM_read | PM_write + }; + +/* Generate instruction(s) to spill or fill register REG to/from the + worker broadcast array. PM indicates what is to be done, REP + how many loop iterations will be executed (0 for not a loop). */ + +static rtx +nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data) +{ + rtx res; + machine_mode mode = GET_MODE (reg); + + switch (mode) + { + case BImode: + { + rtx tmp = gen_reg_rtx (SImode); + + start_sequence (); + if (pm & PM_read) + emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx)); + emit_insn (nvptx_gen_wcast (tmp, pm, rep, data)); + if (pm & PM_write) + emit_insn (gen_rtx_SET (BImode, reg, + gen_rtx_NE (BImode, tmp, const0_rtx))); + res = get_insns (); + end_sequence (); + } + break; + + default: + { + rtx addr = data->ptr; + + if (!addr) + { + unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT; + + if (align > worker_bcast_align) + worker_bcast_align = align; + data->offset = (data->offset + align - 1) & ~(align - 1); + addr = data->base; + if (data->offset) + addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset)); + } + + addr = gen_rtx_MEM (mode, addr); + addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA); + if (pm & PM_read) + res = gen_rtx_SET (mode, addr, reg); + if (pm & PM_write) + res = gen_rtx_SET (mode, reg, addr); + + if (data->ptr) + { + /* We're using a ptr, increment it. */ + start_sequence (); + + emit_insn (res); + emit_insn (gen_adddi3 (data->ptr, data->ptr, + GEN_INT (GET_MODE_SIZE (GET_MODE (res))))); + res = get_insns (); + end_sequence (); + } + else + rep = 1; + data->offset += rep * GET_MODE_SIZE (GET_MODE (reg)); + } + break; + } + return res; +} + /* When loading an operand ORIG_OP, verify whether an address space conversion to generic is required, and if so, perform it. Also check for SYMBOL_REFs for function decls and call @@ -1647,23 +1867,6 @@ nvptx_print_operand_address (FILE *file, nvptx_print_address_operand (file, addr, VOIDmode); } -/* Return true if the value of COND is the same across all threads in a - warp. */ - -static bool -condition_unidirectional_p (rtx cond) -{ - if (CONSTANT_P (cond)) - return true; - if (GET_CODE (cond) == REG) - return cfun->machine->warp_equal_pseudos[REGNO (cond)]; - if (GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMPARE - || GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMM_COMPARE) - return (condition_unidirectional_p (XEXP (cond, 0)) - && condition_unidirectional_p (XEXP (cond, 1))); - return false; -} - /* Print an operand, X, to FILE, with an optional modifier in CODE. Meaning of CODE: @@ -1677,8 +1880,7 @@ condition_unidirectional_p (rtx cond) t -- print a type opcode suffix, promoting QImode to 32 bits T -- print a type size in bits u -- print a type opcode suffix without promotions. - U -- print ".uni" if a condition consists only of values equal across all - threads in a warp. */ + U -- print ".uni" if the const_int operand is non-zero. */ static void nvptx_print_operand (FILE *file, rtx x, int code) @@ -1740,10 +1942,10 @@ nvptx_print_operand (FILE *file, rtx x, goto common; case 'U': - if (condition_unidirectional_p (x)) + if (INTVAL (x)) fprintf (file, ".uni"); break; - + case 'c': op_mode = GET_MODE (XEXP (x, 0)); switch (x_code) @@ -1900,7 +2102,7 @@ get_replacement (struct reg_replace *r) conversion copyin/copyout instructions. */ static void -nvptx_reorg_subreg (int max_regs) +nvptx_reorg_subreg () { struct reg_replace qiregs, hiregs, siregs, diregs; rtx_insn *insn, *next; @@ -1914,11 +2116,6 @@ nvptx_reorg_subreg (int max_regs) siregs.mode = SImode; diregs.mode = DImode; - cfun->machine->warp_equal_pseudos - = ggc_cleared_vec_alloc (max_regs); - - auto_vec warp_reg_worklist; - for (insn = get_insns (); insn; insn = next) { next = NEXT_INSN (insn); @@ -1934,18 +2131,6 @@ nvptx_reorg_subreg (int max_regs) diregs.n_in_use = 0; extract_insn (insn); - if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi - || (GET_CODE (PATTERN (insn)) == SET - && CONSTANT_P (SET_SRC (PATTERN (insn))))) - { - rtx dest = recog_data.operand[0]; - if (REG_P (dest) && REG_N_SETS (REGNO (dest)) == 1) - { - cfun->machine->warp_equal_pseudos[REGNO (dest)] = true; - warp_reg_worklist.safe_push (REGNO (dest)); - } - } - enum attr_subregs_ok s_ok = get_attr_subregs_ok (insn); for (int i = 0; i < recog_data.n_operands; i++) { @@ -1999,71 +2184,782 @@ nvptx_reorg_subreg (int max_regs) validate_change (insn, recog_data.operand_loc[i], new_reg, false); } } +} + +/* An unspec of interest and the BB in which it resides. */ +struct reorg_unspec +{ + rtx_insn *insn; + basic_block block; +}; - while (!warp_reg_worklist.is_empty ()) +/* Loop structure of the function.The entire function is described as + a NULL loop. We should be able to extend this to represent + superblocks. */ + +#define OACC_null OACC_HWM + +struct reorg_loop +{ + /* Parent loop. */ + reorg_loop *parent; + + /* Next sibling loop. */ + reorg_loop *next; + + /* First child loop. */ + reorg_loop *inner; + + /* Partitioning mode of the loop. */ + unsigned mode; + + /* Partitioning used within inner loops. */ + unsigned inner_mask; + + /* Location of loop head and tail. The head is the first block in + the partitioned loop and the tail is the first block out of the + partitioned loop. */ + basic_block head_block; + basic_block tail_block; + + rtx_insn *head_insn; + rtx_insn *tail_insn; + + rtx_insn *pre_head_insn; + rtx_insn *pre_tail_insn; + + /* basic blocks in this loop, but not in child loops. The HEAD and + PRETAIL blocks are in the loop. The PREHEAD and TAIL blocks + are not. */ + auto_vec blocks; + +public: + reorg_loop (reorg_loop *parent, unsigned mode); + ~reorg_loop (); +}; + +typedef auto_vec unspec_vec_t; + +/* Constructor links the new loop into it's parent's chain of + children. */ + +reorg_loop::reorg_loop (reorg_loop *parent_, unsigned mode_) + :parent (parent_), next (0), inner (0), mode (mode_), inner_mask (0) +{ + head_block = tail_block = 0; + head_insn = tail_insn = 0; + pre_head_insn = pre_tail_insn = 0; + + if (parent) { - int regno = warp_reg_worklist.pop (); + next = parent->inner; + parent->inner = this; + } +} + +reorg_loop::~reorg_loop () +{ + delete inner; + delete next; +} + +/* Map of basic blocks to unspecs */ +typedef hash_map unspec_map_t; + +/* Split basic blocks such that each loop head & tail unspecs are at + the start of their basic blocks. Thus afterwards each block will + have a single partitioning mode. We also do the same for return + insns, as they are executed by every thread. Return the partitioning + execution mode of the function as a whole. Populate MAP with head + and tail blocks. We also clear the BB visited flag, which is + used when finding loops. */ + +static unsigned +nvptx_split_blocks (unspec_map_t *map) +{ + auto_vec worklist; + basic_block block; + rtx_insn *insn; + unsigned levels = ~0U; // Assume the worst WRT required neutering + + /* Locate all the reorg instructions of interest. */ + FOR_ALL_BB_FN (block, cfun) + { + bool seen_insn = false; + + // Clear visited flag, for use by loop locator */ + block->flags &= ~BB_VISITED; - df_ref use = DF_REG_USE_CHAIN (regno); - for (; use; use = DF_REF_NEXT_REG (use)) + FOR_BB_INSNS (block, insn) { - rtx_insn *insn; - if (!DF_REF_INSN_INFO (use)) - continue; - insn = DF_REF_INSN (use); - if (DEBUG_INSN_P (insn)) - continue; - - /* The only insns we have to exclude are those which refer to - memory. */ - rtx pat = PATTERN (insn); - if (GET_CODE (pat) == SET - && (MEM_P (SET_SRC (pat)) || MEM_P (SET_DEST (pat)))) + if (!INSN_P (insn)) continue; + switch (recog_memoized (insn)) + { + default: + seen_insn = true; + continue; + case CODE_FOR_oacc_levels: + /* We just need to detect this and note its argument. */ + { + unsigned l = UINTVAL (XVECEXP (PATTERN (insn), 0, 0)); + /* If we see this multiple times, this should all + agree. */ + gcc_assert (levels == ~0U || l == levels); + levels = l; + } + continue; + + case CODE_FOR_nvptx_loop: + { + rtx kind = XVECEXP (PATTERN (insn), 0, 0); + if (!LOOP_MODE_CHANGE_P (UINTVAL (kind))) + { + seen_insn = true; + continue; + } + } + break; + + case CODE_FOR_return: + /* We also need to split just before return insns, as + that insn needs executing by all threads, but the + block it is in probably does not. */ + break; + } - df_ref insn_use; - bool all_equal = true; - FOR_EACH_INSN_USE (insn_use, insn) + if (seen_insn) { - unsigned insn_regno = DF_REF_REGNO (insn_use); - if (!cfun->machine->warp_equal_pseudos[insn_regno]) - { - all_equal = false; - break; - } + /* We've found an instruction that must be at the start of + a block, but isn't. Add it to the worklist. */ + reorg_unspec uns; + uns.insn = insn; + uns.block = block; + worklist.safe_push (uns); } - if (!all_equal) - continue; - df_ref insn_def; - FOR_EACH_INSN_DEF (insn_def, insn) + else + /* It was already the first instruction. Just add it to + the map. */ + map->get_or_insert (block) = insn; + seen_insn = true; + } + } + + /* Split blocks on the worklist. */ + unsigned ix; + reorg_unspec *elt; + basic_block remap = 0; + for (ix = 0; worklist.iterate (ix, &elt); ix++) + { + if (remap != elt->block) + { + block = elt->block; + remap = block; + } + + /* Split block before insn. The insn is in the new block */ + edge e = split_block (block, PREV_INSN (elt->insn)); + + block = e->dest; + map->get_or_insert (block) = elt->insn; + } + + return levels; +} + +/* BLOCK is a basic block containing a head or tail instruction. + Locate the associated prehead or pretail instruction, which must be + in the single predecessor block. */ + +static rtx_insn * +nvptx_discover_pre (basic_block block, unsigned expected) +{ + gcc_assert (block->preds->length () == 1); + basic_block pre_block = (*block->preds)[0]->src; + rtx_insn *pre_insn; + + for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn); + pre_insn = PREV_INSN (pre_insn)) + gcc_assert (pre_insn != BB_HEAD (pre_block)); + + gcc_assert (recog_memoized (pre_insn) == CODE_FOR_nvptx_loop + && (UINTVAL (XVECEXP (PATTERN (pre_insn), 0, 0)) + == expected)); + return pre_insn; +} + +typedef std::pair loop_t; +typedef auto_vec work_loop_t; + +/* Dump this loop and all its inner loops. */ + +static void +nvptx_dump_loops (reorg_loop *loop, unsigned depth) +{ + fprintf (dump_file, "%u: mode %d head=%d, tail=%d\n", + depth, loop->mode, + loop->head_block ? loop->head_block->index : -1, + loop->tail_block ? loop->tail_block->index : -1); + + fprintf (dump_file, " blocks:"); + + basic_block block; + for (unsigned ix = 0; loop->blocks.iterate (ix, &block); ix++) + fprintf (dump_file, " %d", block->index); + fprintf (dump_file, "\n"); + if (loop->inner) + nvptx_dump_loops (loop->inner, depth + 1); + + if (loop->next) + nvptx_dump_loops (loop->next, depth); +} + +/* Walk the BBG looking for loop head & tail markers. Construct a + loop structure for the function. MAP is a mapping of basic blocks + to head & taiol markers, discoveded when splitting blocks. This + speeds up the discovery. We rely on the BB visited flag having + been cleared when splitting blocks. */ + +static reorg_loop * +nvptx_discover_loops (unspec_map_t *map) +{ + reorg_loop *outer_loop = new reorg_loop (0, OACC_null); + work_loop_t worklist; + basic_block block; + + // Mark entry and exit blocks as visited. + block = EXIT_BLOCK_PTR_FOR_FN (cfun); + block->flags |= BB_VISITED; + block = ENTRY_BLOCK_PTR_FOR_FN (cfun); + worklist.safe_push (loop_t (block, outer_loop)); + + while (worklist.length ()) + { + loop_t loop = worklist.pop (); + reorg_loop *l = loop.second; + + block = loop.first; + + // Have we met this block? + if (block->flags & BB_VISITED) + continue; + block->flags |= BB_VISITED; + + rtx_insn **endp = map->get (block); + if (endp) + { + rtx_insn *end = *endp; + + /* This is a block head or tail, or return instruction. */ + switch (recog_memoized (end)) { - unsigned dregno = DF_REF_REGNO (insn_def); - if (cfun->machine->warp_equal_pseudos[dregno]) - continue; - cfun->machine->warp_equal_pseudos[dregno] = true; - warp_reg_worklist.safe_push (dregno); + case CODE_FOR_return: + /* Return instructions are in their own block, and we + don't need to do anything more. */ + continue; + + case CODE_FOR_nvptx_loop: + { + unsigned kind = UINTVAL (XVECEXP (PATTERN (end), 0, 0)); + unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 1)); + + switch (kind) + { + case nvptx_loop_head: + /* Loop head, create a new inner loop and add it + into our parent's child list. */ + l = new reorg_loop (l, mode); + l->head_block = block; + l->head_insn = end; + if (mode == OACC_worker) + l->pre_head_insn + = nvptx_discover_pre (block, nvptx_loop_prehead); + break; + + case nvptx_loop_tail: + /* A loop tail. Finish the current loop and + return to parent. */ + gcc_assert (l->mode == mode); + l->tail_block = block; + l->tail_insn = end; + if (mode == OACC_worker) + l->pre_tail_insn + = nvptx_discover_pre (block, nvptx_loop_pretail); + l = l->parent; + break; + + default: + gcc_unreachable (); + } + } + break; + + default:gcc_unreachable (); } } + /* Add this block onto the current loop's list of blocks. */ + l->blocks.safe_push (block); + + /* Push each destination block onto the work list. */ + edge e; + edge_iterator ei; + + loop.second = l; + FOR_EACH_EDGE (e, ei, block->succs) + { + loop.first = e->dest; + + worklist.safe_push (loop); + } } if (dump_file) - for (int i = 0; i < max_regs; i++) - if (cfun->machine->warp_equal_pseudos[i]) - fprintf (dump_file, "Found warp invariant pseudo %d\n", i); + { + fprintf (dump_file, "\nLoops\n"); + nvptx_dump_loops (outer_loop, 0); + fprintf (dump_file, "\n"); + } + + return outer_loop; +} + +/* Propagate live state at the start of a partitioned region. BLOCK + provides the live register information, and might not contain + INSN. Propagation is inserted just after INSN. RW indicates whether + we are reading and/or writing state. This + separation is needed for worker-level proppagation where we + essentially do a spill & fill. FN is the underlying worker + function to generate the propagation instructions for single + register. DATA is user data. + + We propagate the live register set and the entire frame. We could + do better by (a) propagating just the live set that is used within + the partitioned regions and (b) only propagating stack entries that + are used. The latter might be quite hard to determine. */ + +static void +nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw, + rtx (*fn) (rtx, propagate_mask, + unsigned, void *), void *data) +{ + bitmap live = DF_LIVE_IN (block); + bitmap_iterator iterator; + unsigned ix; + + /* Copy the frame array. */ + HOST_WIDE_INT fs = get_frame_size (); + if (fs) + { + rtx tmp = gen_reg_rtx (DImode); + rtx idx = NULL_RTX; + rtx ptr = gen_reg_rtx (Pmode); + rtx pred = NULL_RTX; + rtx_code_label *label = NULL; + + gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1))); + fs /= GET_MODE_SIZE (DImode); + /* Detect single iteration loop. */ + if (fs == 1) + fs = 0; + + start_sequence (); + emit_insn (gen_rtx_SET (Pmode, ptr, frame_pointer_rtx)); + if (fs) + { + idx = gen_reg_rtx (SImode); + pred = gen_reg_rtx (BImode); + label = gen_label_rtx (); + + emit_insn (gen_rtx_SET (SImode, idx, GEN_INT (fs))); + /* Allow worker function to initialize anything needed */ + rtx init = fn (tmp, PM_loop_begin, fs, data); + if (init) + emit_insn (init); + emit_label (label); + LABEL_NUSES (label)++; + emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1))); + } + if (rw & PM_read) + emit_insn (gen_rtx_SET (DImode, tmp, gen_rtx_MEM (DImode, ptr))); + emit_insn (fn (tmp, rw, fs, data)); + if (rw & PM_write) + emit_insn (gen_rtx_SET (DImode, gen_rtx_MEM (DImode, ptr), tmp)); + if (fs) + { + emit_insn (gen_rtx_SET (SImode, pred, + gen_rtx_NE (BImode, idx, const0_rtx))); + emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode)))); + emit_insn (gen_br_true_hidden (pred, label, GEN_INT (1))); + rtx fini = fn (tmp, PM_loop_end, fs, data); + if (fini) + emit_insn (fini); + emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx)); + } + emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp)); + emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr)); + rtx cpy = get_insns (); + end_sequence (); + insn = emit_insn_after (cpy, insn); + } + + /* Copy live registers. */ + EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator) + { + rtx reg = regno_reg_rtx[ix]; + + if (REGNO (reg) >= FIRST_PSEUDO_REGISTER) + { + rtx bcast = fn (reg, rw, 0, data); + + insn = emit_insn_after (bcast, insn); + } + } +} + +/* Worker for nvptx_vpropagate. */ + +static rtx +vprop_gen (rtx reg, propagate_mask pm, + unsigned ARG_UNUSED (count), void *ARG_UNUSED (data)) +{ + if (!(pm & PM_read_write)) + return 0; + + return nvptx_gen_vcast (reg); } -/* PTX-specific reorganization - 1) mark now-unused registers, so function begin doesn't declare - unused registers. - 2) replace subregs with suitable sequences. -*/ +/* Propagate state that is live at start of BLOCK across the vectors + of a single warp. Propagation is inserted just after INSN. */ static void -nvptx_reorg (void) +nvptx_vpropagate (basic_block block, rtx_insn *insn) { - struct reg_replace qiregs, hiregs, siregs, diregs; - rtx_insn *insn, *next; + nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0); +} + +/* Worker for nvptx_wpropagate. */ + +static rtx +wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_) +{ + wcast_data_t *data = (wcast_data_t *)data_; + + if (pm & PM_loop_begin) + { + /* Starting a loop, initialize pointer. */ + unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT; + + if (align > worker_bcast_align) + worker_bcast_align = align; + data->offset = (data->offset + align - 1) & ~(align - 1); + + data->ptr = gen_reg_rtx (Pmode); + + return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset)); + } + else if (pm & PM_loop_end) + { + rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr); + data->ptr = NULL_RTX; + return clobber; + } + else + return nvptx_gen_wcast (reg, pm, rep, data); +} + +/* Spill or fill live state that is live at start of BLOCK. PRE_P + indicates if this is just before partitioned mode (do spill), or + just after it starts (do fill). Sequence is inserted just after + INSN. */ + +static void +nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn) +{ + wcast_data_t data; + + data.base = gen_reg_rtx (Pmode); + data.offset = 0; + data.ptr = NULL_RTX; + + nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data); + if (data.offset) + { + /* Stuff was emitted, initialize the base pointer now. */ + rtx init = gen_rtx_SET (Pmode, data.base, worker_bcast_sym); + emit_insn_after (init, insn); + + if (worker_bcast_hwm < data.offset) + worker_bcast_hwm = data.offset; + } +} + +/* Emit a worker-level synchronization barrier. */ + +static void +nvptx_wsync (bool tail_p, rtx_insn *insn) +{ + emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn); +} + +/* Single neutering according to MASK. FROM is the incoming block and + TO is the outgoing block. These may be the same block. Insert at + start of FROM: + + if (tid.) hidden_goto end. + + and insert before ending branch of TO (if there is such an insn): + + end: + + + + We currently only use differnt FROM and TO when skipping an entire + loop. We could do more if we detected superblocks. */ + +static void +nvptx_single (unsigned mask, basic_block from, basic_block to) +{ + rtx_insn *head = BB_HEAD (from); + rtx_insn *tail = BB_END (to); + unsigned skip_mask = mask; + + /* Find first insn of from block */ + while (head != BB_END (from) && !INSN_P (head)) + head = NEXT_INSN (head); + + /* Find last insn of to block */ + rtx_insn *limit = from == to ? head : BB_HEAD (to); + while (tail != limit && !INSN_P (tail) && !LABEL_P (tail)) + tail = PREV_INSN (tail); + + /* Detect if tail is a branch. */ + rtx tail_branch = NULL_RTX; + rtx cond_branch = NULL_RTX; + if (tail && INSN_P (tail)) + { + tail_branch = PATTERN (tail); + if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx) + tail_branch = NULL_RTX; + else + { + cond_branch = SET_SRC (tail_branch); + if (GET_CODE (cond_branch) != IF_THEN_ELSE) + cond_branch = NULL_RTX; + } + } + + if (tail == head) + { + /* If this is empty, do nothing. */ + if (!head || !INSN_P (head)) + return; + + /* If this is a dummy insn, do nothing. */ + switch (recog_memoized (head)) + { + default:break; + case CODE_FOR_nvptx_loop: + case CODE_FOR_oacc_levels: + return; + } + if (cond_branch) + { + /* If we're only doing vector single, there's no need to + emit skip code because we'll not insert anything. */ + if (!(mask & OACC_LOOP_MASK (OACC_vector))) + skip_mask = 0; + } + else if (tail_branch) + /* Block with only unconditional branch. Nothing to do. */ + return; + } + + /* Insert the vector test inside the worker test. */ + unsigned mode; + rtx_insn *before = tail; + for (mode = OACC_worker; mode <= OACC_vector; mode++) + if (OACC_LOOP_MASK (mode) & skip_mask) + { + rtx id = gen_reg_rtx (SImode); + rtx pred = gen_reg_rtx (BImode); + rtx_code_label *label = gen_label_rtx (); + + emit_insn_before (gen_oacc_id (id, GEN_INT (mode)), head); + rtx cond = gen_rtx_SET (BImode, pred, + gen_rtx_NE (BImode, id, const0_rtx)); + emit_insn_before (cond, head); + emit_insn_before (gen_br_true_hidden (pred, label, + GEN_INT (mode != OACC_vector)), + head); + + LABEL_NUSES (label)++; + if (tail_branch) + before = emit_label_before (label, before); + else + emit_label_after (label, tail); + } + + /* Now deal with propagating the branch condition. */ + if (cond_branch) + { + rtx pvar = XEXP (XEXP (cond_branch, 0), 0); + + if (OACC_LOOP_MASK (OACC_vector) == mask) + { + /* Vector mode only, do a shuffle. */ + emit_insn_before (nvptx_gen_vcast (pvar), tail); + } + else + { + /* Includes worker mode, do spill & fill. by construction + we should never have worker mode only. */ + wcast_data_t data; + + data.base = worker_bcast_sym; + data.ptr = 0; + + if (worker_bcast_hwm < GET_MODE_SIZE (SImode)) + worker_bcast_hwm = GET_MODE_SIZE (SImode); + + data.offset = 0; + emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data), + before); + emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail); + data.offset = 0; + emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data), + tail); + } + + extract_insn (tail); + rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar), + UNSPEC_BR_UNIFIED); + validate_change (tail, recog_data.operand_loc[0], unsp, false); + } +} + +/* LOOP is a loop that is being skipped in its entirety according to + MASK. Treat this as skipping a superblock starting at loop head + and ending at loop pre-tail. */ + +static void +nvptx_skip_loop (unsigned mask, reorg_loop *loop) +{ + basic_block tail = loop->tail_block; + gcc_assert (tail->preds->length () == 1); + + basic_block pre_tail = (*tail->preds)[0]->src; + gcc_assert (pre_tail->succs->length () == 1); + + nvptx_single (mask, loop->head_block, pre_tail); +} + +/* Process the loop LOOP and all its contained loops. We do + everything but the neutering. Return mask of partition modes used + within this loop. */ + +static unsigned +nvptx_process_loops (reorg_loop *loop) +{ + unsigned inner_mask = OACC_LOOP_MASK (loop->mode); + + /* Do the inner loops first. */ + if (loop->inner) + { + loop->inner_mask = nvptx_process_loops (loop->inner); + inner_mask |= loop->inner_mask; + } + + switch (loop->mode) + { + case OACC_null: + /* Dummy loop. */ + break; + + case OACC_vector: + nvptx_vpropagate (loop->head_block, loop->head_insn); + break; + + case OACC_worker: + { + nvptx_wpropagate (false, loop->head_block, loop->head_insn); + nvptx_wpropagate (true, loop->head_block, loop->pre_head_insn); + /* Insert begin and end synchronizations. */ + nvptx_wsync (false, loop->head_insn); + nvptx_wsync (true, loop->pre_tail_insn); + } + break; + + case OACC_gang: + break; + + default:gcc_unreachable (); + } + + /* Now do siblings. */ + if (loop->next) + inner_mask |= nvptx_process_loops (loop->next); + return inner_mask; +} + +/* Neuter the loop described by LOOP. We recurse in depth-first + order. LEVELS are the partitioning of the execution and OUTER is + the partitioning of the loops we are contained in. Return the + partitioning level within this loop. */ + +static void +nvptx_neuter_loops (reorg_loop *loop, unsigned levels, unsigned outer) +{ + unsigned me = (OACC_LOOP_MASK (loop->mode) + & (OACC_LOOP_MASK (OACC_worker) + | OACC_LOOP_MASK (OACC_vector))); + unsigned skip_mask = 0, neuter_mask = 0; + + if (loop->inner) + nvptx_neuter_loops (loop->inner, levels, outer | me); + + for (unsigned mode = OACC_worker; mode <= OACC_vector; mode++) + { + if ((outer | me) & OACC_LOOP_MASK (mode)) + { /* Mode is partitioned: no neutering. */ } + else if (!(levels & OACC_LOOP_MASK (mode))) + { /* Mode is not used: nothing to do. */ } + else if (loop->inner_mask & OACC_LOOP_MASK (mode) + || !loop->head_insn) + /* Partitioning inside this loop, or we're not a loop: neuter + individual blocks. */ + neuter_mask |= OACC_LOOP_MASK (mode); + else if (!loop->parent || !loop->parent->head_insn + || loop->parent->inner_mask & OACC_LOOP_MASK (mode)) + /* Parent isn't a loop or contains this partitioning: skip + loop at this level. */ + skip_mask |= OACC_LOOP_MASK (mode); + else + { /* Parent will skip this loop itself. */ } + } + + if (neuter_mask) + { + basic_block block; + + for (unsigned ix = 0; loop->blocks.iterate (ix, &block); ix++) + nvptx_single (neuter_mask, block, block); + } + + if (skip_mask) + nvptx_skip_loop (skip_mask, loop); + + if (loop->next) + nvptx_neuter_loops (loop->next, levels, outer); +} + +/* NVPTX machine dependent reorg. + Insert vector and worker single neutering code and state + propagation when entering partioned mode. Fixup subregs. */ + +static void +nvptx_reorg (void) +{ /* We are freeing block_for_insn in the toplev to keep compatibility with old MDEP_REORGS that are not CFG based. Recompute it now. */ compute_bb_for_insn (); @@ -2072,19 +2968,34 @@ nvptx_reorg (void) df_clear_flags (DF_LR_RUN_DCE); df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS); + df_live_add_problem (); + + /* Split blocks and record interesting unspecs. */ + unspec_map_t unspec_map; + unsigned levels = nvptx_split_blocks (&unspec_map); + + /* Compute live regs */ df_analyze (); regstat_init_n_sets_and_refs (); - int max_regs = max_reg_num (); - + if (dump_file) + df_dump (dump_file); + /* Mark unused regs as unused. */ + int max_regs = max_reg_num (); for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++) if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0) regno_reg_rtx[i] = const0_rtx; - /* Replace subregs. */ - nvptx_reorg_subreg (max_regs); + reorg_loop *loops = nvptx_discover_loops (&unspec_map); + + nvptx_process_loops (loops); + nvptx_neuter_loops (loops, levels, 0); + delete loops; + + nvptx_reorg_subreg (); + regstat_free_n_sets_and_refs (); df_finish_pass (true); @@ -2133,19 +3044,21 @@ nvptx_vector_alignment (const_tree type) return MIN (align, BIGGEST_ALIGNMENT); } -/* Indicate that INSN cannot be duplicated. This is true for insns - that generate a unique id. To be on the safe side, we also - exclude instructions that have to be executed simultaneously by - all threads in a warp. */ +/* Indicate that INSN cannot be duplicated. */ static bool nvptx_cannot_copy_insn_p (rtx_insn *insn) { - if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi) - return true; - if (recog_memoized (insn) == CODE_FOR_threadbarrier_insn) - return true; - return false; + switch (recog_memoized (insn)) + { + case CODE_FOR_nvptx_broadcastsi: + case CODE_FOR_nvptx_broadcastsf: + case CODE_FOR_nvptx_barsync: + case CODE_FOR_nvptx_loop: + return true; + default: + return false; + } } /* Record a symbol for mkoffload to enter into the mapping table. */ @@ -2185,6 +3098,21 @@ nvptx_file_end (void) FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter) nvptx_record_fndecl (decl, true); fputs (func_decls.str().c_str(), asm_out_file); + + if (worker_bcast_hwm) + { + /* Define the broadcast buffer. */ + + if (worker_bcast_align < GET_MODE_SIZE (SImode)) + worker_bcast_align = GET_MODE_SIZE (SImode); + worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1) + & ~(worker_bcast_align - 1); + + fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name); + fprintf (asm_out_file, ".shared.align %d .u8 %s[%d];\n", + worker_bcast_align, + worker_bcast_name, worker_bcast_hwm); + } } #undef TARGET_OPTION_OVERRIDE Index: config/nvptx/nvptx.h =================================================================== --- config/nvptx/nvptx.h (revision 225323) +++ config/nvptx/nvptx.h (working copy) @@ -235,7 +235,6 @@ struct nvptx_pseudo_info struct GTY(()) machine_function { rtx_expr_list *call_args; - char *warp_equal_pseudos; rtx start_call; tree funtype; bool has_call_with_varargs; Index: internal-fn.c =================================================================== --- internal-fn.c (revision 225323) +++ internal-fn.c (working copy) @@ -98,6 +98,19 @@ init_internal_fns () internal_fn_fnspec_array[IFN_LAST] = 0; } +/* Return true if this internal fn call is a unique marker -- it + should not be duplicated or merged. */ + +bool +gimple_call_internal_unique_p (const_gimple gs) +{ + switch (gimple_call_internal_fn (gs)) + { + default: return false; + case IFN_GOACC_LOOP: return true; + } +} + /* ARRAY_TYPE is an array of vector modes. Return the associated insn for load-lanes-style optab OPTAB. The insn must exist. */ @@ -1990,6 +2003,28 @@ expand_GOACC_DATA_END_WITH_ARG (gcall *s gcc_unreachable (); } + +static void +expand_GOACC_LEVELS (gcall *stmt) +{ + rtx mask = expand_normal (gimple_call_arg (stmt, 0)); + +#ifdef HAVE_oacc_levels + emit_insn (gen_oacc_levels (mask)); +#endif +} + +static void +expand_GOACC_LOOP (gcall *stmt) +{ + rtx kind = expand_normal (gimple_call_arg (stmt, 0)); + rtx level = expand_normal (gimple_call_arg (stmt, 1)); + +#ifdef HAVE_oacc_loop + emit_insn (gen_oacc_loop (kind, level)); +#endif +} + /* Routines to expand each internal function, indexed by function number. Each routine has the prototype: