diff mbox

[gomp] Move openacc vector& worker single handling to RTL

Message ID 559AD85B.2050102@acm.org
State New
Headers show

Commit Message

Nathan Sidwell July 6, 2015, 7:34 p.m. UTC
On 07/04/15 16:41, Nathan Sidwell wrote:
> On 07/03/15 19:11, Jakub Jelinek wrote:

>> If the builtins are not meant to be used by users directly (I assume they
>> aren't) nor have a 1-1 correspondence to a library routine, it is much
>> better to emit them as internal calls (see internal-fn.{c,def}) instead of
>> BUILT_IN_NORMAL functions.
>

This patch uses internal builtins, I had to make one additional change to 
tree-ssa-tail-merge.c's same_succ_def::equal hash compare function.  The new 
internal fn I introduced should compare EQ but not otherwise compare EQUAL, and 
that was blowing up the has function, which relied on EQUAL only.  I don't know 
why I didn't hit this problem in the previous patch with the regular builtin.

comments?

nathan
2015-07-06  Nathan Sidwell  <nathan@codesourcery.com>

	Infrastructure:
	* gimple.h (gimple_call_internal_unique_p): Declare.
	* gimple.c (gimple_call_same_target_p): Add check for
	gimple_call_internal_unique_p.
	* internal-fn.c (gimple_call_internal_unique_p): New.
	* omp-low.h (OACC_LOOP_MASK): Define here...
	* omp-low.c (OACC_LOOP_MASK): ... not here.
	* tree-ssa-threadedge.c	(record_temporary_equivalences_from_stmts):
	Add check for gimple_call_internal_unique_p.
	* tree-ssa-tail-merge.c (same_succ_def::equal): Add EQ check for
	the gimple statements.

	Additions:
	* internal-fn.def (GOACC_LEVELS, GOACC_LOOP): New.
	* internal-fn.c (gimple_call_internal_unique_p): Add check for
	IFN_GOACC_LOOP.
	(expand_GOACC_LEVELS, expand_GOACC_LOOP): New.
	* omp-low.c (gen_oacc_loop_head, gen_oacc_loop_tail): New.
	(expand_omp_for_static_nochunk): Add oacc loop head & tail calls.
	(expand_omp_for_static_chunk): Likewise.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Add
	BUILT_IN_GOACC_LOOP.
	* config/nvptx/nvptx-protos.h ( nvptx_expand_oacc_loop): New.
	* config/nvptx/nvptx.md (UNSPEC_BIT_CONV, UNSPEC_BROADCAST,
	UNSPEC_BR_UNIFIED): New unspecs.
	(UNSPECV_LEVELS, UNSPECV_LOOP, UNSPECV_BR_HIDDEN): New.
	(BITS, BITD): New mode iterators.
	(br_true_hidden, br_false_hidden, br_uni_true, br_uni_false): New
	branches.
	(oacc_levels, nvptx_loop): New insns.
	(oacc_loop): New expand
	(nvptx_broadcast<mode>): New insn.
	(unpack<mode>si2, packsi<mode>2): New insns.
	(worker_load<mode>, worker_store<mode>): New insns.
	(nvptx_barsync): Renamed from ...
	(threadbarrier_insn): ... here.
	config/nvptx/nvptx.c: Include hash-map,h, dominance.h, cfg.h &
	omp-low.h.
	(nvptx_loop_head, nvptx_loop_tail, nvtpx_loop_prehead,
	nvptx_loop_pretail, LOOP_MODE_CHANGE_P: New.
	(worker_bcast_hwm, worker_bcast_align, worker_bcast_name,
	worker_bcast_sym): New.
	(nvptx_opetion_override): Initialize worker_bcast_sym.
	(nvptx_expand_oacc_loop): New.
	(nvptx_gen_unpack, nvptx_gen_pack): New.
	(struct wcast_data_t, propagate_mask): New types.
	(nvptx_gen_vcast, nvptx_gen_wcast): New.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(struct reorg_unspec, struct reorg_loop): New structs.
	(unspec_map_t): New map.
	(loop_t, work_loop_t): New types.
	(nvptx_split_blocks, nvptx_discover_pre, nvptx_dump_loops,
	nvptx_discover_loops): New.
	(nvptx_propagate, vprop_gen, nvptx_vpropagate, wprop_gen,
	nvptx_wpropagate): New.
	(nvptx_wsync): New.
	(nvptx_single, nvptx_skip_loop): New.
	(nvptx_process_loops): New.
	(nvptx_neuter_loops): New.
	(nvptx_reorg): Add liveness DF problem.  Call nvptx_split_loops,
	nvptx_discover_loops, nvptx_process_loops & nvptx_neuter_loops.
	(nvptx_cannot_copy_insn): Check for broadcast, sync & loop insns.
	(nvptx_file_end): Output worker broadcast array definition.

	Deletions:
	* builtins.c (expand_oacc_thread_barrier): Delete.
	(expand_oacc_thread_broadcast): Delete.
	(expand_builtin): Adjust.
	* gimple.c (struct gimple_statement_omp_parallel_layout): Remove
	broadcast_array member.
	(gimple_omp_target_broadcast_array): Delete.
	(gimple_omp_target_set_broadcast_array): Delete.
	* omp-low.c (omp_region): Remove broadcast_array member.
	(oacc_broadcast): Delete.
	(build_oacc_threadbarrier): Delete.
	(oacc_loop_needs_threadbarrier_p): Delete.
	(oacc_alloc_broadcast_storage): Delete.
	(find_omp_target_region): Remove call to
	gimple_omp_target_broadcast_array.
	(enclosing_target_region, required_predication_mask,
	generate_vector_broadcast, generate_oacc_broadcast,
	make_predication_test, predicate_bb, find_predicatable_bbs,
	predicate_omp_regions): Delete.
	(use, gen, live_in): Delete.
	(populate_loop_live_in, oacc_populate_live_in_1,
	oacc_populate_live_in, populate_loop_use, oacc_broadcast_1,
	oacc_broadcast): Delete.
	(execute_expand_omp): Remove predicate_omp_regions call.
	(lower_omp_target): Remove oacc_alloc_broadcast_storage call.
	Remove gimple_omp_target_set_broadcast_array call.
	(make_gimple_omp_edges): Remove oacc_loop_needs_threadbarrier_p
	check.
	* tree-ssa-alias.c (ref_maybe_used_by_call_p_1): Remove
	BUILT_IN_GOACC_THREADBARRIER.
	* omp-builtins.def (BUILT_IN_GOACC_THREAD_BROADCAST,
	BUILT_IN_GOACC_THREAD_BROADCAST_LL,
	BUILT_IN_GOACC_THREADBARRIER): Delete.
	* config/nvptx/nvptx.md (UNSPECV_WARPBCAST): Delete.
	(br_true, br_false): Remove U format specifier.
	(oacc_thread_broadcastsi, oacc_thread_broadcast_di): Delete.
	(oacc_threadbarrier): Delete.
	* config/.nvptx/nvptx.c (condition_unidirectional_p): Delete.
	(nvptx_print_operand):  Change 'U' specifier to look at operand
	itself.
	(nvptx_reorg_subreg): Remove unidirection checking.
	(nvptx_cannot_copy_insn): Remove broadcast and barrier insns.
	* config/nvptx/nvptx.h (machine_function): Remove
	arp_equal_pseudos.

Comments

Jakub Jelinek July 7, 2015, 9:54 a.m. UTC | #1
On Mon, Jul 06, 2015 at 03:34:51PM -0400, Nathan Sidwell wrote:
> On 07/04/15 16:41, Nathan Sidwell wrote:
> >On 07/03/15 19:11, Jakub Jelinek wrote:
> 
> >>If the builtins are not meant to be used by users directly (I assume they
> >>aren't) nor have a 1-1 correspondence to a library routine, it is much
> >>better to emit them as internal calls (see internal-fn.{c,def}) instead of
> >>BUILT_IN_NORMAL functions.
> >
> 
> This patch uses internal builtins, I had to make one additional change to
> tree-ssa-tail-merge.c's same_succ_def::equal hash compare function.  The new
> internal fn I introduced should compare EQ but not otherwise compare EQUAL,
> and that was blowing up the has function, which relied on EQUAL only.  I
> don't know why I didn't hit this problem in the previous patch with the
> regular builtin.

How does this interact with
#pragma acc routine {gang,worker,vector,seq} ?
Or is that something to be added later on?

	Jakub
Nathan Sidwell July 7, 2015, 2:12 p.m. UTC | #2
On 07/07/15 05:54, Jakub Jelinek wrote:
> On Mon, Jul 06, 2015 at 03:34:51PM -0400, Nathan Sidwell wrote:

> How does this interact with
> #pragma acc routine {gang,worker,vector,seq} ?
> Or is that something to be added later on?

That is to be added later on.  I suspect such routines will trivially work, as 
they'll be marked up with the loop head/tail functions and levels builtin (the 
latter might need a bit of reworking).  What will need additional work at that 
point is the callers of routines -- they're typically called from a foo-single 
mode, but need to get all threads into the called function.  I'm thinking each 
call site will look like a mini-loop[*] surrounded by a hesd/tail marker.  (all 
that can be done in the device-side compiler once real call sites are known.)

nathan

[*] of course it won't be a loop.  Perhaps fork/join are less confusing names 
after all.  WDYT?
Jakub Jelinek July 7, 2015, 2:22 p.m. UTC | #3
On Tue, Jul 07, 2015 at 10:12:56AM -0400, Nathan Sidwell wrote:
> On 07/07/15 05:54, Jakub Jelinek wrote:
> >On Mon, Jul 06, 2015 at 03:34:51PM -0400, Nathan Sidwell wrote:
> 
> >How does this interact with
> >#pragma acc routine {gang,worker,vector,seq} ?
> >Or is that something to be added later on?
> 
> That is to be added later on.  I suspect such routines will trivially work,
> as they'll be marked up with the loop head/tail functions and levels builtin
> (the latter might need a bit of reworking).  What will need additional work
> at that point is the callers of routines -- they're typically called from a
> foo-single mode, but need to get all threads into the called function.  I'm
> thinking each call site will look like a mini-loop[*] surrounded by a
> hesd/tail marker.  (all that can be done in the device-side compiler once
> real call sites are known.)

Wouldn't function attributes be better for that case, and just use the internal
functions for the case when the mode is being changed in the middle of
function?

I agree that fork/join might be less confusing.

BTW, where do you plan to lower the internal functions for non-PTX?
Doing it in RTL mach reorg is too late for those, we shouldn't be writing it
for each single target, as for non-PTX (perhaps non-HSA) I bet the behavior
is the same.

	Jakub
Nathan Sidwell July 7, 2015, 2:43 p.m. UTC | #4
On 07/07/15 10:22, Jakub Jelinek wrote:
> On Tue, Jul 07, 2015 at 10:12:56AM -0400, Nathan Sidwell wrote:

> Wouldn't function attributes be better for that case, and just use the internal
> functions for the case when the mode is being changed in the middle of
> function?

It may be.  I've been thinking how the top-level offloaded function (kernel), 
should be marked to specify gangs/worker/vector dimensions to allow a less 
device-specific launch mechanism.  I suspect that and routines will have similar 
solutions.

> I agree that fork/join might be less confusing.
>
> BTW, where do you plan to lower the internal functions for non-PTX?
> Doing it in RTL mach reorg is too late for those, we shouldn't be writing it
> for each single target, as for non-PTX (perhaps non-HSA) I bet the behavior
> is the same.

I suspect other devices can add a new device-specific lowering pass somewhere 
soon after the LTO readback.   I think we're going to need that pass for some 
other pieces of PTX.

FWIW on a device that has a PTX-like architecture, I think this specific piece 
should be done as late as possible.  Perhaps pieces of the PTX mach-dep-reorg 
can be abstracted for general use?

nathan
diff mbox

Patch

Index: omp-low.c
===================================================================
--- omp-low.c	(revision 225323)
+++ omp-low.c	(working copy)
@@ -166,14 +166,8 @@  struct omp_region
 
   /* For an OpenACC loop, the level of parallelism requested.  */
   int gwv_this;
-
-  tree broadcast_array;
 };
 
-/* Levels of parallelism as defined by OpenACC.  Increasing numbers
-   correspond to deeper loop nesting levels.  */
-#define OACC_LOOP_MASK(X) (1 << (X))
-
 /* Context structure.  Used to store information about each parallel
    directive in the code.  */
 
@@ -292,8 +286,6 @@  static vec<omp_context *> taskreg_contex
 
 static void scan_omp (gimple_seq *, omp_context *);
 static tree scan_omp_1_op (tree *, int *, void *);
-static basic_block oacc_broadcast (basic_block, basic_block,
-				   struct omp_region *);
 
 #define WALK_SUBSTMTS  \
     case GIMPLE_BIND: \
@@ -3487,15 +3479,6 @@  build_omp_barrier (tree lhs)
   return g;
 }
 
-/* Build a call to GOACC_threadbarrier.  */
-
-static gcall *
-build_oacc_threadbarrier (void)
-{
-  tree fndecl = builtin_decl_explicit (BUILT_IN_GOACC_THREADBARRIER);
-  return gimple_build_call (fndecl, 0);
-}
-
 /* If a context was created for STMT when it was scanned, return it.  */
 
 static omp_context *
@@ -3506,6 +3489,56 @@  maybe_lookup_ctx (gimple stmt)
   return n ? (omp_context *) n->value : NULL;
 }
 
+/* Generate loop head markers in outer->inner order.  */
+
+static void
+gen_oacc_loop_head (gimple_seq *seq, unsigned mask)
+{
+  {
+    // TODDO: Determine this information from the parallel region itself
+    // and emit it once in the offload function.  Currently the target
+    // geometry definition is being extracted early.  For now inform
+    // the backend we're using all axes of parallelism, which is a
+    // safe default.
+    gcall *call = gimple_build_call_internal
+      (IFN_GOACC_LEVELS, 1, 
+       build_int_cst (unsigned_type_node,
+		      OACC_LOOP_MASK (OACC_gang)
+		      | OACC_LOOP_MASK (OACC_vector)
+		      | OACC_LOOP_MASK (OACC_worker)));
+    gimple_seq_add_stmt (seq, call);
+  }
+
+  tree arg0 = build_int_cst (unsigned_type_node, 0);
+  unsigned level;
+
+  for (level = OACC_gang; level != OACC_HWM; level++)
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg1 = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal
+	  (IFN_GOACC_LOOP, 2, arg0, arg1);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
+
+/* Generate loop tail markers in inner->outer order.  */
+
+static void
+gen_oacc_loop_tail (gimple_seq *seq, unsigned mask)
+{
+  tree arg0 = build_int_cst (unsigned_type_node, 1);
+  unsigned level;
+
+  for (level = OACC_HWM; level-- != OACC_gang; )
+    if (mask & OACC_LOOP_MASK (level))
+      {
+	tree arg1 = build_int_cst (unsigned_type_node, level);
+	gcall *call = gimple_build_call_internal
+	  (IFN_GOACC_LOOP, 2, arg0, arg1);
+	gimple_seq_add_stmt (seq, call);
+      }
+}
 
 /* Find the mapping for DECL in CTX or the immediately enclosing
    context that has a mapping for DECL.
@@ -6777,21 +6810,6 @@  expand_omp_for_generic (struct omp_regio
     }
 }
 
-
-/* True if a barrier is needed after a loop partitioned over
-   gangs/workers/vectors as specified by GWV_BITS.  OpenACC semantics specify
-   that a (conceptual) barrier is needed after worker and vector-partitioned
-   loops, but not after gang-partitioned loops.  Currently we are relying on
-   warp reconvergence to synchronise threads within a warp after vector loops,
-   so an explicit barrier is not helpful after those.  */
-
-static bool
-oacc_loop_needs_threadbarrier_p (int gwv_bits)
-{
-  return !(gwv_bits & OACC_LOOP_MASK (OACC_gang))
-    && (gwv_bits & OACC_LOOP_MASK (OACC_worker));
-}
-
 /* A subroutine of expand_omp_for.  Generate code for a parallel
    loop with static schedule and no specified chunk size.  Given
    parameters:
@@ -6800,6 +6818,7 @@  oacc_loop_needs_threadbarrier_p (int gwv
 
    where COND is "<" or ">", we generate pseudocode
 
+  OACC_LOOP_HEAD
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -6827,6 +6846,11 @@  oacc_loop_needs_threadbarrier_p (int gwv
 	V += STEP;
 	if (V cond e) goto L1;
     L2:
+ OACC_LOOP_TAIL
+
+ It'd be better to place the OACC_LOOP markers just inside the outer
+ conditional, so they can be entirely eliminated if the loop is
+ unreachable.
 */
 
 static void
@@ -6868,10 +6892,6 @@  expand_omp_for_static_nochunk (struct om
     }
   exit_bb = region->exit;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   /* Iteration space partitioning goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -6893,6 +6913,15 @@  expand_omp_for_static_nochunk (struct om
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_loop_head (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -6951,6 +6980,7 @@  expand_omp_for_static_nochunk (struct om
     case GF_OMP_FOR_KIND_OACC_LOOP:
       {
 	gimple_seq seq = NULL;
+	
 	nthreads = expand_oacc_get_num_threads (&seq, region->gwv_this);
 	threadid = expand_oacc_get_thread_num (&seq, region->gwv_this);
 	gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
@@ -7134,18 +7164,19 @@  expand_omp_for_static_nochunk (struct om
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_loop_tail (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-	{
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+    
   gsi_remove (&gsi, true);
 
   /* Connect all the blocks.  */
@@ -7220,6 +7251,7 @@  find_phi_with_arg_on_edge (tree arg, edg
 
    where COND is "<" or ">", we generate pseudocode
 
+OACC_LOOP_HEAD
 	if ((__typeof (V)) -1 > 0 && N2 cond N1) goto L2;
 	if (cond is <)
 	  adj = STEP - 1;
@@ -7230,6 +7262,7 @@  find_phi_with_arg_on_edge (tree arg, edg
 	else
 	  n = (adj + N2 - N1) / STEP;
 	trip = 0;
+
 	V = threadid * CHUNK * STEP + N1;  -- this extra definition of V is
 					      here so that V is defined
 					      if the loop is not entered
@@ -7248,6 +7281,7 @@  find_phi_with_arg_on_edge (tree arg, edg
 	trip += 1;
 	goto L0;
     L4:
+OACC_LOOP_TAIL
 */
 
 static void
@@ -7281,10 +7315,6 @@  expand_omp_for_static_chunk (struct omp_
   gcc_assert (EDGE_COUNT (iter_part_bb->succs) == 2);
   fin_bb = BRANCH_EDGE (iter_part_bb)->dest;
 
-  /* Broadcast variables to OpenACC threads.  */
-  entry_bb = oacc_broadcast (entry_bb, fin_bb, region);
-  region->entry = entry_bb;
-
   gcc_assert (broken_loop
 	      || fin_bb == FALLTHRU_EDGE (cont_bb)->dest);
   seq_start_bb = split_edge (FALLTHRU_EDGE (iter_part_bb));
@@ -7296,7 +7326,7 @@  expand_omp_for_static_chunk (struct omp_
       trip_update_bb = split_edge (FALLTHRU_EDGE (cont_bb));
     }
   exit_bb = region->exit;
-
+  
   /* Trip and adjustment setup goes in ENTRY_BB.  */
   gsi = gsi_last_bb (entry_bb);
   gcc_assert (gimple_code (gsi_stmt (gsi)) == GIMPLE_OMP_FOR);
@@ -7318,6 +7348,14 @@  expand_omp_for_static_chunk (struct omp_
     t = fold_binary (fd->loop.cond_code, boolean_type_node,
 		     fold_convert (type, fd->loop.n1),
 		     fold_convert (type, fd->loop.n2));
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+	
+      gen_oacc_loop_head (&seq, region->gwv_this);
+      gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+    }
+
   if (fd->collapse == 1
       && TYPE_UNSIGNED (type)
       && (t == NULL_TREE || !integer_onep (t)))
@@ -7576,18 +7614,20 @@  expand_omp_for_static_chunk (struct omp_
 
   /* Replace the GIMPLE_OMP_RETURN with a barrier, or nothing.  */
   gsi = gsi_last_bb (exit_bb);
-  if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
+
+  if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
+    {
+      gimple_seq seq = NULL;
+
+      gen_oacc_loop_tail (&seq, region->gwv_this);
+      gsi_insert_seq_after (&gsi, seq, GSI_SAME_STMT);
+    }
+  else if (!gimple_omp_return_nowait_p (gsi_stmt (gsi)))
     {
       t = gimple_omp_return_lhs (gsi_stmt (gsi));
-      if (gimple_omp_for_kind (fd->for_stmt) == GF_OMP_FOR_KIND_OACC_LOOP)
-        {
-	  gcc_checking_assert (t == NULL_TREE);
-	  if (oacc_loop_needs_threadbarrier_p (region->gwv_this))
-	    gsi_insert_after (&gsi, build_oacc_threadbarrier (), GSI_SAME_STMT);
-	}
-      else
-	gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
+      gsi_insert_after (&gsi, build_omp_barrier (t), GSI_SAME_STMT);
     }
+
   gsi_remove (&gsi, true);
 
   /* Connect the new blocks.  */
@@ -9158,20 +9198,6 @@  expand_omp_atomic (struct omp_region *re
   expand_omp_atomic_mutex (load_bb, store_bb, addr, loaded_val, stored_val);
 }
 
-/* Allocate storage for OpenACC worker threads in CTX to broadcast
-   condition results.  */
-
-static void
-oacc_alloc_broadcast_storage (omp_context *ctx)
-{
-  tree vull_type_node = build_qualified_type (long_long_unsigned_type_node,
-					      TYPE_QUAL_VOLATILE);
-
-  ctx->worker_sync_elt
-    = alloc_var_ganglocal (NULL_TREE, vull_type_node, ctx,
-			   TYPE_SIZE_UNIT (vull_type_node));
-}
-
 /* Mark the loops inside the kernels region starting at REGION_ENTRY and ending
    at REGION_EXIT.  */
 
@@ -9947,7 +9973,6 @@  find_omp_target_region_data (struct omp_
     region->gwv_this |= OACC_LOOP_MASK (OACC_worker);
   if (find_omp_clause (clauses, OMP_CLAUSE_VECTOR_LENGTH))
     region->gwv_this |= OACC_LOOP_MASK (OACC_vector);
-  region->broadcast_array = gimple_omp_target_broadcast_array (stmt);
 }
 
 /* Helper for build_omp_regions.  Scan the dominator tree starting at
@@ -10091,669 +10116,6 @@  build_omp_regions (void)
   build_omp_regions_1 (ENTRY_BLOCK_PTR_FOR_FN (cfun), NULL, false);
 }
 
-/* Walk the tree upwards from region until a target region is found
-   or we reach the end, then return it.  */
-static omp_region *
-enclosing_target_region (omp_region *region)
-{
-  while (region != NULL
-	 && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  return region;
-}
-
-/* Return a mask of GWV_ values indicating the kind of OpenACC
-   predication required for basic blocks in REGION.  */
-
-static int
-required_predication_mask (omp_region *region)
-{
-  while (region
-	 && region->type != GIMPLE_OMP_FOR && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-  if (!region)
-    return 0;
-
-  int outer_masks = region->gwv_this;
-  omp_region *outer_target = region;
-  while (outer_target != NULL && outer_target->type != GIMPLE_OMP_TARGET)
-    {
-      if (outer_target->type == GIMPLE_OMP_FOR)
-	outer_masks |= outer_target->gwv_this;
-      outer_target = outer_target->outer;
-    }
-  if (!outer_target)
-    return 0;
-
-  int mask = 0;
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_worker)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_worker)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_worker);
-  if ((outer_target->gwv_this & OACC_LOOP_MASK (OACC_vector)) != 0
-      && (region->type == GIMPLE_OMP_TARGET
-	  || (outer_masks & OACC_LOOP_MASK (OACC_vector)) == 0))
-    mask |= OACC_LOOP_MASK (OACC_vector);
-  return mask;
-}
-
-/* Generate a broadcast across OpenACC vector threads (a warp on GPUs)
-   so that VAR is broadcast to DEST_VAR.  The new statements are added
-   after WHERE.  Return the stmt after which the block should be split.  */
-
-static gimple
-generate_vector_broadcast (tree dest_var, tree var,
-			   gimple_stmt_iterator &where)
-{
-  gimple retval = gsi_stmt (where);
-  tree vartype = TREE_TYPE (var);
-  tree call_arg_type = unsigned_type_node;
-  enum built_in_function fn = BUILT_IN_GOACC_THREAD_BROADCAST;
-
-  if (TYPE_PRECISION (vartype) > TYPE_PRECISION (call_arg_type))
-    {
-      fn = BUILT_IN_GOACC_THREAD_BROADCAST_LL;
-      call_arg_type = long_long_unsigned_type_node;
-    }
-
-  bool need_conversion = !types_compatible_p (vartype, call_arg_type);
-  tree casted_var = var;
-
-  if (need_conversion)
-    {
-      gassign *conv1 = NULL;
-      casted_var = create_tmp_var (call_arg_type);
-
-      /* Handle floats and doubles.  */
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, call_arg_type, var);
-	  conv1 = gimple_build_assign (casted_var, t);
-	}
-      else
-	conv1 = gimple_build_assign (casted_var, NOP_EXPR, var);
-
-      gsi_insert_after (&where, conv1, GSI_CONTINUE_LINKING);
-    }
-
-  tree decl = builtin_decl_explicit (fn);
-  gimple call = gimple_build_call (decl, 1, casted_var);
-  gsi_insert_after (&where, call, GSI_NEW_STMT);
-  tree casted_dest = dest_var;
-
-  if (need_conversion)
-    {
-      gassign *conv2 = NULL;
-      casted_dest = create_tmp_var (call_arg_type);
-
-      if (!INTEGRAL_TYPE_P (vartype))
-	{
-	  tree t = fold_build1 (VIEW_CONVERT_EXPR, vartype, casted_dest);
-	  conv2 = gimple_build_assign (dest_var, t);
-	}
-      else
-	conv2 = gimple_build_assign (dest_var, NOP_EXPR, casted_dest);
-
-      gsi_insert_after (&where, conv2, GSI_CONTINUE_LINKING);
-    }
-
-  gimple_call_set_lhs (call, casted_dest);
-  return retval;
-}
-
-/* Generate a broadcast across OpenACC threads in REGION so that VAR
-   is broadcast to DEST_VAR.  MASK specifies the parallelism level and
-   thereby the broadcast method.  If it is only vector, we
-   can use a warp broadcast, otherwise we fall back to memory
-   store/load.  */
-
-static gimple
-generate_oacc_broadcast (omp_region *region, tree dest_var, tree var,
-			 gimple_stmt_iterator &where, int mask)
-{
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    return generate_vector_broadcast (dest_var, var, where);
-
-  omp_region *parent = enclosing_target_region (region);
-
-  tree elttype = build_qualified_type (TREE_TYPE (var), TYPE_QUAL_VOLATILE);
-  tree ptr = create_tmp_var (build_pointer_type (elttype));
-  gassign *cast1 = gimple_build_assign (ptr, NOP_EXPR,
-				       parent->broadcast_array);
-  gsi_insert_after (&where, cast1, GSI_NEW_STMT);
-  gassign *st = gimple_build_assign (build_simple_mem_ref (ptr), var);
-  gsi_insert_after (&where, st, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  gassign *cast2 = gimple_build_assign (ptr, NOP_EXPR,
-					parent->broadcast_array);
-  gsi_insert_after (&where, cast2, GSI_NEW_STMT);
-  gassign *ld = gimple_build_assign (dest_var, build_simple_mem_ref (ptr));
-  gsi_insert_after (&where, ld, GSI_NEW_STMT);
-
-  gsi_insert_after (&where, build_oacc_threadbarrier (), GSI_NEW_STMT);
-
-  return st;
-}
-
-/* Build a test for OpenACC predication.  TRUE_EDGE is the edge that should be
-   taken if the block should be executed.  SKIP_DEST_BB is the destination to
-   jump to otherwise.  MASK specifies the type of predication, it can contain
-   the bits for VECTOR and/or WORKER.  */
-
-static void
-make_predication_test (edge true_edge, basic_block skip_dest_bb, int mask)
-{
-  basic_block cond_bb = true_edge->src;
-  
-  gimple_stmt_iterator tmp_gsi = gsi_last_bb (cond_bb);
-  tree decl = builtin_decl_explicit (BUILT_IN_GOACC_ID);
-  tree comp_var = NULL_TREE;
-  unsigned ix;
-
-  for (ix = OACC_worker; ix <= OACC_vector; ix++)
-    if (OACC_LOOP_MASK (ix) & mask)
-      {
-	gimple call = gimple_build_call
-	  (decl, 1, build_int_cst (unsigned_type_node, ix));
-	tree var = create_tmp_var (unsigned_type_node);
-
-	gimple_call_set_lhs (call, var);
-	gsi_insert_after (&tmp_gsi, call, GSI_NEW_STMT);
-	if (comp_var)
-	  {
-	    tree new_comp = create_tmp_var (unsigned_type_node);
-	    gassign *ior = gimple_build_assign (new_comp,
-						BIT_IOR_EXPR, comp_var, var);
-	    gsi_insert_after (&tmp_gsi, ior, GSI_NEW_STMT);
-	    comp_var = new_comp;
-	  }
-	else
-	  comp_var = var;
-      }
-
-  tree cond = build2 (EQ_EXPR, boolean_type_node, comp_var,
-		      fold_convert (unsigned_type_node, integer_zero_node));
-  gimple cond_stmt = gimple_build_cond_empty (cond);
-  gsi_insert_after (&tmp_gsi, cond_stmt, GSI_NEW_STMT);
-
-  true_edge->flags = EDGE_TRUE_VALUE;
-
-  /* Force an abnormal edge before a broadcast operation that might be present
-     in SKIP_DEST_BB.  This is only done for the non-execution edge (with
-     respect to the predication done by this function) -- the opposite
-     (execution) edge that reaches the broadcast operation must be made
-     abnormal also, e.g. in this function's caller.  */
-  edge e = make_edge (cond_bb, skip_dest_bb, EDGE_FALSE_VALUE);
-  basic_block false_abnorm_bb = split_edge (e);
-  edge abnorm_edge = single_succ_edge (false_abnorm_bb);
-  abnorm_edge->flags |= EDGE_ABNORMAL;
-}
-
-/* Apply OpenACC predication to basic block BB which is in
-   region PARENT.  MASK has a bitmask of levels that need to be
-   applied; VECTOR and/or WORKER may be set.  */
-
-static void
-predicate_bb (basic_block bb, struct omp_region *parent, int mask)
-{
-  /* We handle worker-single vector-partitioned loops by jumping
-     around them if not in the controlling worker.  Don't insert
-     unnecessary (and incorrect) predication.  */
-  if (parent->type == GIMPLE_OMP_FOR
-      && (parent->gwv_this & OACC_LOOP_MASK (OACC_vector)))
-    mask &= ~OACC_LOOP_MASK (OACC_worker);
-
-  if (mask == 0 || parent->type == GIMPLE_OMP_ATOMIC_LOAD)
-    return;
-
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-
-  gsi = gsi_last_bb (bb);
-  stmt = gsi_stmt (gsi);
-  if (stmt == NULL)
-    return;
-
-  basic_block skip_dest_bb = NULL;
-
-  if (gimple_code (stmt) == GIMPLE_OMP_ENTRY_END)
-    return;
-
-  if (gimple_code (stmt) == GIMPLE_COND)
-    {
-      tree cond_var = create_tmp_var (boolean_type_node);
-      tree broadcast_cond = create_tmp_var (boolean_type_node);
-      gassign *asgn = gimple_build_assign (cond_var,
-					   gimple_cond_code (stmt),
-					   gimple_cond_lhs (stmt),
-					   gimple_cond_rhs (stmt));
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, broadcast_cond,
-						   cond_var, gsi_asgn,
-						   mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_cond_set_condition (as_a <gcond *> (stmt), EQ_EXPR,
-				 broadcast_cond, boolean_true_node);
-    }
-  else if (gimple_code (stmt) == GIMPLE_SWITCH)
-    {
-      gswitch *sstmt = as_a <gswitch *> (stmt);
-      tree var = gimple_switch_index (sstmt);
-      tree new_var = create_tmp_var (TREE_TYPE (var));
-
-      gassign *asgn = gimple_build_assign (new_var, var);
-      gsi_insert_before (&gsi, asgn, GSI_CONTINUE_LINKING);
-      gimple_stmt_iterator gsi_asgn = gsi_for_stmt (asgn);
-
-      gimple splitpoint = generate_oacc_broadcast (parent, new_var, var,
-						   gsi_asgn, mask);
-
-      edge e = split_block (bb, splitpoint);
-      e->flags = EDGE_ABNORMAL;
-      skip_dest_bb = e->dest;
-
-      gimple_switch_set_index (sstmt, new_var);
-    }
-  else if (is_gimple_omp (stmt))
-    {
-      gsi_prev (&gsi);
-      gimple split_stmt = gsi_stmt (gsi);
-      enum gimple_code code = gimple_code (stmt);
-
-      /* First, see if we must predicate away an entire loop or atomic region.  */
-      if (code == GIMPLE_OMP_FOR
-	  || code == GIMPLE_OMP_ATOMIC_LOAD)
-	{
-	  omp_region *inner;
-	  inner = *bb_region_map->get (FALLTHRU_EDGE (bb)->dest);
-	  skip_dest_bb = single_succ (inner->exit);
-	  gcc_assert (inner->entry == bb);
-	  if (code != GIMPLE_OMP_FOR
-	      || ((inner->gwv_this & OACC_LOOP_MASK (OACC_vector))
-		  && !(inner->gwv_this & OACC_LOOP_MASK (OACC_worker))
-		  && (mask & OACC_LOOP_MASK  (OACC_worker))))
-	    {
-	      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-	      gsi_prev (&head_gsi);
-	      edge e0 = split_block (bb, gsi_stmt (head_gsi));
-	      int mask2 = mask;
-	      if (code == GIMPLE_OMP_FOR)
-		mask2 &= ~OACC_LOOP_MASK (OACC_vector);
-	      if (!split_stmt || code != GIMPLE_OMP_FOR)
-		{
-		  /* The simple case: nothing here except the for,
-		     so we just need to make one branch around the
-		     entire loop.  */
-		  inner->entry = e0->dest;
-		  make_predication_test (e0, skip_dest_bb, mask2);
-		  return;
-		}
-	      basic_block for_block = e0->dest;
-	      /* The general case, make two conditions - a full one around the
-		 code preceding the for, and one branch around the loop.  */
-	      edge e1 = split_block (for_block, split_stmt);
-	      basic_block bb3 = e1->dest;
-	      edge e2 = split_block (for_block, split_stmt);
-	      basic_block bb2 = e2->dest;
-
-	      make_predication_test (e0, bb2, mask);
-	      make_predication_test (single_pred_edge (bb3), skip_dest_bb,
-				     mask2);
-	      inner->entry = bb3;
-	      return;
-	    }
-	}
-
-      /* Only a few statements need special treatment.  */
-      if (gimple_code (stmt) != GIMPLE_OMP_FOR
-	  && gimple_code (stmt) != GIMPLE_OMP_CONTINUE
-	  && gimple_code (stmt) != GIMPLE_OMP_RETURN)
-	{
-	  edge e = single_succ_edge (bb);
-	  skip_dest_bb = e->dest;
-	}
-      else
-	{
-	  if (!split_stmt)
-	    return;
-	  edge e = split_block (bb, split_stmt);
-	  skip_dest_bb = e->dest;
-	  if (gimple_code (stmt) == GIMPLE_OMP_CONTINUE)
-	    {
-	      gcc_assert (parent->cont == bb);
-	      parent->cont = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_RETURN)
-	    {
-	      gcc_assert (parent->exit == bb);
-	      parent->exit = skip_dest_bb;
-	    }
-	  else if (gimple_code (stmt) == GIMPLE_OMP_FOR)
-	    {
-	      omp_region *inner;
-	      inner = *bb_region_map->get (FALLTHRU_EDGE (skip_dest_bb)->dest);
-	      gcc_assert (inner->entry == bb);
-	      inner->entry = skip_dest_bb;
-	    }
-	}
-    }
-  else if (single_succ_p (bb))
-    {
-      edge e = single_succ_edge (bb);
-      skip_dest_bb = e->dest;
-      if (gimple_code (stmt) == GIMPLE_GOTO)
-	gsi_prev (&gsi);
-      if (gsi_stmt (gsi) == 0)
-	return;
-    }
-
-  if (skip_dest_bb != NULL)
-    {
-      gimple_stmt_iterator head_gsi = gsi_start_bb (bb);
-      gsi_prev (&head_gsi);
-      edge e2 = split_block (bb, gsi_stmt (head_gsi));
-      make_predication_test (e2, skip_dest_bb, mask);
-    }
-}
-
-/* Walk the dominator tree starting at BB to collect basic blocks in
-   WORKLIST which need OpenACC vector predication applied to them.  */
-
-static void
-find_predicatable_bbs (basic_block bb, vec<basic_block> &worklist)
-{
-  struct omp_region *parent = *bb_region_map->get (bb);
-  if (required_predication_mask (parent) != 0)
-    worklist.safe_push (bb);
-  basic_block son;
-  for (son = first_dom_son (CDI_DOMINATORS, bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    find_predicatable_bbs (son, worklist);
-}
-
-/* Apply OpenACC vector predication to all basic blocks.  HEAD_BB is the
-   first.  */
-
-static void
-predicate_omp_regions (basic_block head_bb)
-{
-  vec<basic_block> worklist = vNULL;
-  find_predicatable_bbs (head_bb, worklist);
-  int i;
-  basic_block bb;
-  FOR_EACH_VEC_ELT (worklist, i, bb)
-    {
-      omp_region *region = *bb_region_map->get (bb);
-      int mask = required_predication_mask (region);
-      predicate_bb (bb, region, mask);
-    }
-}
-
-/* USE and GET sets for variable broadcasting.  */
-static std::set<tree> use, gen, live_in;
-
-/* This is an extremely conservative live in analysis.  We only want to
-   detect is any compiler temporary used inside an acc loop is local to
-   that loop or not.  So record all decl uses in all the basic blocks
-   post-dominating the acc loop in question.  */
-static tree
-populate_loop_live_in (tree *tp, int *walk_subtrees,
-		       void *data_ ATTRIBUTE_UNUSED)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-
-  if (wi && wi->is_lhs)
-    {
-      if (VAR_P (*tp))
-	live_in.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-static void
-oacc_populate_live_in_1 (basic_block entry_bb, basic_block exit_bb,
-			 basic_block loop_bb)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  if (!dominated_by_p (CDI_DOMINATORS, loop_bb, entry_bb))
-    return;
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-      gimple stmt;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_live_in, &wi);
-    }
-
-  /* Continue walking the dominator tree.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, exit_bb, loop_bb);
-}
-
-static void
-oacc_populate_live_in (basic_block entry_bb, omp_region *region)
-{
-  /* Find the innermost OMP_TARGET region.  */
-  while (region  && region->type != GIMPLE_OMP_TARGET)
-    region = region->outer;
-
-  if (!region)
-    return;
-
-  basic_block son;
-
-  for (son = first_dom_son (CDI_DOMINATORS, region->entry);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_populate_live_in_1 (son, region->exit, entry_bb);
-}
-
-static tree
-populate_loop_use (tree *tp, int *walk_subtrees, void *data_)
-{
-  struct walk_stmt_info *wi = (struct walk_stmt_info *) data_;
-  std::set<tree>::iterator it;
-
-  /* There isn't much to do for LHS ops. There shouldn't be any pointers
-     or references here.  */
-  if (wi && wi->is_lhs)
-    return NULL_TREE;
-
-  if (VAR_P (*tp))
-    {
-      tree type;
-
-      *walk_subtrees = 0;
-
-      /* Filter out incompatible decls.  */
-      if (INDIRECT_REF_P (*tp) || is_global_var (*tp))
-	return NULL_TREE;
-
-      type = TREE_TYPE (*tp);
-
-      /* Aggregate types aren't supported either.  */
-      if (AGGREGATE_TYPE_P (type))
-	return NULL_TREE;
-
-      /* Filter out decls inside GEN.  */
-      it = gen.find (*tp);
-      if (it == gen.end ())
-	use.insert (*tp);
-    }
-  else if (IS_TYPE_OR_DECL_P (*tp))
-    *walk_subtrees = 0;
-
-  return NULL_TREE;
-}
-
-/* INIT is true if this is the first time this function is called.  */
-
-static void
-oacc_broadcast_1 (basic_block entry_bb, basic_block exit_bb, bool init,
-		  int mask)
-{
-  basic_block son;
-  gimple_stmt_iterator gsi;
-  gimple stmt;
-  tree block, var;
-
-  if (entry_bb == exit_bb)
-    return;
-
-  /* Populate the GEN set.  */
-
-  gsi = gsi_start_bb (entry_bb);
-  stmt = gsi_stmt (gsi);
-
-  /* There's nothing to do if stmt is empty or if this is the entry basic
-     block to the vector loop.  The entry basic block to pre-expanded loops
-     do not have an entry label.  As such, the scope containing the initial
-     entry_bb should not be added to the gen set.  */
-  if (stmt != NULL && !init && (block = gimple_block (stmt)) != NULL)
-    for (var = BLOCK_VARS (block); var; var = DECL_CHAIN (var))
-      gen.insert(var);
-
-  /* Populate the USE set.  */
-
-  for (gsi = gsi_start_bb (entry_bb); !gsi_end_p (gsi); gsi_next (&gsi))
-    {
-      struct walk_stmt_info wi;
-
-      memset (&wi, 0, sizeof (wi));
-      stmt = gsi_stmt (gsi);
-
-      walk_gimple_op (stmt, populate_loop_use, &wi);
-    }
-
-  /* Continue processing the children of this basic block.  */
-  for (son = first_dom_son (CDI_DOMINATORS, entry_bb);
-       son;
-       son = next_dom_son (CDI_DOMINATORS, son))
-    oacc_broadcast_1 (son, exit_bb, false, mask);
-}
-
-/* Broadcast variables to OpenACC vector loops.  This function scans
-   all of the basic blocks withing an acc vector loop.  It maintains
-   two sets of decls, a GEN set and a USE set.  The GEN set contains
-   all of the decls in the the basic block's scope.  The USE set
-   consists of decls used in current basic block, but are not in the
-   GEN set, globally defined or were transferred into the the accelerator
-   via a data movement clause.
-
-   The vector loop begins at ENTRY_BB and end at EXIT_BB, where EXIT_BB
-   is a latch back to ENTRY_BB.  Once a set of used variables have been
-   determined, they will get broadcasted in a pre-header to ENTRY_BB.  */
-
-static basic_block
-oacc_broadcast (basic_block entry_bb, basic_block exit_bb, omp_region *region)
-{
-  gimple_stmt_iterator gsi;
-  std::set<tree>::iterator it;
-  int mask = region->gwv_this;
-
-  /* Nothing to do if this isn't an acc worker or vector loop.  */
-  if (mask == 0)
-    return entry_bb;
-
-  use.empty ();
-  gen.empty ();
-  live_in.empty ();
-
-  /* Currently, subroutines aren't supported.  */
-  gcc_assert (!lookup_attribute ("oacc function",
-				 DECL_ATTRIBUTES (current_function_decl)));
-
-  /* Populate live_in.  */
-  oacc_populate_live_in (entry_bb, region);
-
-  /* Populate the set of used decls.  */
-  oacc_broadcast_1 (entry_bb, exit_bb, true, mask);
-
-  /* Filter out all of the GEN decls from the USE set.  Also filter out
-     any compiler temporaries that which are not present in LIVE_IN.  */
-  for (it = use.begin (); it != use.end (); it++)
-    {
-      std::set<tree>::iterator git, lit;
-
-      git = gen.find (*it);
-      lit = live_in.find (*it);
-      if (git != gen.end () || lit == live_in.end ())
-	use.erase (it);
-    }
-
-  if (mask == OACC_LOOP_MASK (OACC_vector))
-    {
-      /* Broadcast all decls in USE right before the last instruction in
-	 entry_bb.  */
-      gsi = gsi_last_bb (entry_bb);
-
-      gimple_seq seq = NULL;
-      gimple_stmt_iterator g2 = gsi_start (seq);
-
-      for (it = use.begin (); it != use.end (); it++)
-	generate_oacc_broadcast (region, *it, *it, g2, mask);
-
-      gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-    }
-  else if (mask & OACC_LOOP_MASK (OACC_worker))
-    {
-      if (use.empty ())
-	return entry_bb;
-
-      /* If this loop contains a worker, then each broadcast must be
-	 predicated.  */
-
-      for (it = use.begin (); it != use.end (); it++)
-	{
-	  /* Worker broadcasting requires predication.  To do that, there
-	     needs to be several new parent basic blocks before the omp
-	     for instruction.  */
-
-	  gimple_seq seq = NULL;
-	  gimple_stmt_iterator g2 = gsi_start (seq);
-	  gimple splitpoint = generate_oacc_broadcast (region, *it, *it,
-						       g2, mask);
-	  gsi = gsi_last_bb (entry_bb);
-	  gsi_insert_seq_before (&gsi, seq, GSI_CONTINUE_LINKING);
-	  edge e = split_block (entry_bb, splitpoint);
-	  e->flags |= EDGE_ABNORMAL;
-	  basic_block dest_bb = e->dest;
-	  gsi_prev (&gsi);
-	  edge e2 = split_block (entry_bb, gsi_stmt (gsi));
-	  e2->flags |= EDGE_ABNORMAL;
-	  make_predication_test (e2, dest_bb, mask);
-
-	  /* Update entry_bb.  */
-	  entry_bb = dest_bb;
-	}
-    }
-
-  return entry_bb;
-}
-
 /* Main entry point for expanding OMP-GIMPLE into runtime calls.  */
 
 static unsigned int
@@ -10772,8 +10134,6 @@  execute_expand_omp (void)
 	  fprintf (dump_file, "\n");
 	}
 
-      predicate_omp_regions (ENTRY_BLOCK_PTR_FOR_FN (cfun));
-
       remove_exit_barriers (root_omp_region);
 
       expand_omp (root_omp_region);
@@ -12342,10 +11702,7 @@  lower_omp_target (gimple_stmt_iterator *
   orlist = NULL;
 
   if (is_gimple_omp_oacc (stmt))
-    {
-      oacc_init_count_vars (ctx, clauses);
-      oacc_alloc_broadcast_storage (ctx);
-    }
+    oacc_init_count_vars (ctx, clauses);
 
   if (has_reduction)
     {
@@ -12631,7 +11988,6 @@  lower_omp_target (gimple_stmt_iterator *
   gsi_insert_seq_before (gsi_p, sz_ilist, GSI_SAME_STMT);
 
   gimple_omp_target_set_ganglocal_size (stmt, sz);
-  gimple_omp_target_set_broadcast_array (stmt, ctx->worker_sync_elt);
   pop_gimplify_context (NULL);
 }
 
@@ -13348,16 +12704,7 @@  make_gimple_omp_edges (basic_block bb, s
 				  ((for_stmt = last_stmt (cur_region->entry))))
 	     == GF_OMP_FOR_KIND_OACC_LOOP)
         {
-	  /* Called before OMP expansion, so this information has not been
-	     recorded in cur_region->gwv_this yet.  */
-	  int gwv_bits = find_omp_for_region_gwv (for_stmt);
-	  if (oacc_loop_needs_threadbarrier_p (gwv_bits))
-	    {
-	      make_edge (bb, bb->next_bb, EDGE_FALLTHRU | EDGE_ABNORMAL);
-	      fallthru = false;
-	    }
-	  else
-	    fallthru = true;
+	  fallthru = true;
 	}
       else
 	/* In the case of a GIMPLE_OMP_SECTION, the edge will go
Index: omp-low.h
===================================================================
--- omp-low.h	(revision 225323)
+++ omp-low.h	(working copy)
@@ -20,6 +20,8 @@  along with GCC; see the file COPYING3.
 #ifndef GCC_OMP_LOW_H
 #define GCC_OMP_LOW_H
 
+/* Levels of parallelism as defined by OpenACC.  Increasing numbers
+   correspond to deeper loop nesting levels.  */
 enum oacc_loop_levels
   {
     OACC_gang,
@@ -27,6 +29,7 @@  enum oacc_loop_levels
     OACC_vector,
     OACC_HWM
   };
+#define OACC_LOOP_MASK(X) (1 << (X))
 
 struct omp_region;
 
Index: builtins.c
===================================================================
--- builtins.c	(revision 225323)
+++ builtins.c	(working copy)
@@ -5947,20 +5947,6 @@  expand_builtin_acc_on_device (tree exp A
 #endif
 }
 
-/* Expand a thread synchronization point for OpenACC threads.  */
-static void
-expand_oacc_threadbarrier (void)
-{
-#ifdef HAVE_oacc_threadbarrier
-  rtx insn = GEN_FCN (CODE_FOR_oacc_threadbarrier) ();
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-    }
-#endif
-}
-
-
 /* Expand a thread-id/thread-count builtin for OpenACC.  */
 
 static rtx
@@ -6032,47 +6018,6 @@  expand_oacc_ganglocal_ptr (rtx target AT
   return NULL_RTX;
 }
 
-/* Handle a GOACC_thread_broadcast builtin call EXP with target TARGET.
-   Return the result.  */
-
-static rtx
-expand_builtin_oacc_thread_broadcast (tree exp, rtx target)
-{
-  tree arg0 = CALL_EXPR_ARG (exp, 0);
-  enum insn_code icode;
-
-  enum machine_mode mode = TYPE_MODE (TREE_TYPE (arg0));
-  gcc_assert (INTEGRAL_MODE_P (mode));
-  do
-    {
-      icode = direct_optab_handler (oacc_thread_broadcast_optab, mode);
-      mode = GET_MODE_WIDER_MODE (mode);
-    }
-  while (icode == CODE_FOR_nothing && mode != VOIDmode);
-  if (icode == CODE_FOR_nothing)
-    return expand_expr (arg0, NULL_RTX, VOIDmode, EXPAND_NORMAL);
-
-  rtx tmp = target;
-  machine_mode mode0 = insn_data[icode].operand[0].mode;
-  machine_mode mode1 = insn_data[icode].operand[1].mode;
-  if (!tmp || !REG_P (tmp) || GET_MODE (tmp) != mode0)
-    tmp = gen_reg_rtx (mode0);
-  rtx op1 = expand_expr (arg0, NULL_RTX, mode1, EXPAND_NORMAL);
-  if (GET_MODE (op1) != mode1)
-    op1 = convert_to_mode (mode1, op1, 0);
-
-  /* op1 might be an immediate, place it inside a register.  */
-  op1 = force_reg (mode1, op1);
-
-  rtx insn = GEN_FCN (icode) (tmp, op1);
-  if (insn != NULL_RTX)
-    {
-      emit_insn (insn);
-      return tmp;
-    }
-  return const0_rtx;
-}
-
 /* Expand an expression EXP that calls a built-in function,
    with result going to TARGET if that's convenient
    (and in mode MODE if that's convenient).
@@ -7225,14 +7170,6 @@  expand_builtin (tree exp, rtx target, rt
 	return target;
       break;
 
-    case BUILT_IN_GOACC_THREAD_BROADCAST:
-    case BUILT_IN_GOACC_THREAD_BROADCAST_LL:
-      return expand_builtin_oacc_thread_broadcast (exp, target);
-
-    case BUILT_IN_GOACC_THREADBARRIER:
-      expand_oacc_threadbarrier ();
-      return const0_rtx;
-
     default:	/* just do library call, if unknown builtin */
       break;
     }
Index: omp-builtins.def
===================================================================
--- omp-builtins.def	(revision 225323)
+++ omp-builtins.def	(working copy)
@@ -69,13 +69,6 @@  DEF_GOACC_BUILTIN (BUILT_IN_GOACC_GET_GA
 		   BT_FN_PTR, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOACC_BUILTIN (BUILT_IN_GOACC_DEVICEPTR, "GOACC_deviceptr",
 		   BT_FN_PTR_PTR, ATTR_CONST_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST, "GOACC_thread_broadcast",
-		   BT_FN_UINT_UINT, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREAD_BROADCAST_LL, "GOACC_thread_broadcast_ll",
-		   BT_FN_ULONGLONG_ULONGLONG, ATTR_NOTHROW_LEAF_LIST)
-DEF_GOACC_BUILTIN (BUILT_IN_GOACC_THREADBARRIER, "GOACC_threadbarrier",
-		   BT_FN_VOID, ATTR_NOTHROW_LEAF_LIST)
-
 DEF_GOACC_BUILTIN_COMPILER (BUILT_IN_ACC_ON_DEVICE, "acc_on_device",
 			    BT_FN_INT_INT, ATTR_CONST_NOTHROW_LEAF_LIST)
 
Index: gimple.c
===================================================================
--- gimple.c	(revision 225323)
+++ gimple.c	(working copy)
@@ -1380,12 +1380,27 @@  bool
 gimple_call_same_target_p (const_gimple c1, const_gimple c2)
 {
   if (gimple_call_internal_p (c1))
-    return (gimple_call_internal_p (c2)
-	    && gimple_call_internal_fn (c1) == gimple_call_internal_fn (c2));
+    {
+      if (!gimple_call_internal_p (c2)
+	  || gimple_call_internal_fn (c1) != gimple_call_internal_fn (c2))
+	return false;
+
+      if (gimple_call_internal_unique_p (c1))
+	return false;
+      
+      return true;
+    }
+  else if (gimple_call_fn (c1) == gimple_call_fn (c2))
+    return true;
   else
-    return (gimple_call_fn (c1) == gimple_call_fn (c2)
-	    || (gimple_call_fndecl (c1)
-		&& gimple_call_fndecl (c1) == gimple_call_fndecl (c2)));
+    {
+      tree decl = gimple_call_fndecl (c1);
+
+      if (!decl || decl != gimple_call_fndecl (c2))
+	return false;
+
+      return true;
+    }
 }
 
 /* Detect flags from a GIMPLE_CALL.  This is just like
Index: gimple.h
===================================================================
--- gimple.h	(revision 225323)
+++ gimple.h	(working copy)
@@ -581,10 +581,6 @@  struct GTY((tag("GSS_OMP_PARALLEL_LAYOUT
   /* [ WORD 11 ]
      Size of the gang-local memory to allocate.  */
   tree ganglocal_size;
-
-  /* [ WORD 12 ]
-     A pointer to the array to be used for broadcasting across threads.  */
-  tree broadcast_array;
 };
 
 /* GIMPLE_OMP_PARALLEL or GIMPLE_TASK */
@@ -2693,6 +2689,11 @@  gimple_call_internal_fn (const_gimple gs
   return static_cast <const gcall *> (gs)->u.internal_fn;
 }
 
+/* Return true, if this internal gimple call is unique.  */
+
+extern bool
+gimple_call_internal_unique_p (const_gimple);
+
 /* If CTRL_ALTERING_P is true, mark GIMPLE_CALL S to be a stmt
    that could alter control flow.  */
 
@@ -5248,25 +5249,6 @@  gimple_omp_target_set_ganglocal_size (go
 }
 
 
-/* Return the pointer to the broadcast array associated with OMP_TARGET GS.  */
-
-static inline tree
-gimple_omp_target_broadcast_array (const gomp_target *omp_target_stmt)
-{
-  return omp_target_stmt->broadcast_array;
-}
-
-
-/* Set PTR to be the broadcast array associated with OMP_TARGET
-   GS.  */
-
-static inline void
-gimple_omp_target_set_broadcast_array (gomp_target *omp_target_stmt, tree ptr)
-{
-  omp_target_stmt->broadcast_array = ptr;
-}
-
-
 /* Return the clauses associated with OMP_TEAMS GS.  */
 
 static inline tree
Index: tree-ssa-threadedge.c
===================================================================
--- tree-ssa-threadedge.c	(revision 225323)
+++ tree-ssa-threadedge.c	(working copy)
@@ -310,6 +310,17 @@  record_temporary_equivalences_from_stmts
 	  && gimple_asm_volatile_p (as_a <gasm *> (stmt)))
 	return NULL;
 
+      /* If the statement is a unique builtin, we can not thread
+	 through here.  */
+      if (gimple_code (stmt) == GIMPLE_CALL)
+	{
+	  gcall *call = as_a <gcall *> (stmt);
+
+	  if (gimple_call_internal_p (call)
+	      && gimple_call_internal_unique_p (call))
+	    return NULL;
+	}
+
       /* If duplicating this block is going to cause too much code
 	 expansion, then do not thread through this block.  */
       stmt_count++;
Index: tree-ssa-tail-merge.c
===================================================================
--- tree-ssa-tail-merge.c	(revision 225323)
+++ tree-ssa-tail-merge.c	(working copy)
@@ -608,10 +608,13 @@  same_succ_def::equal (const same_succ_de
     {
       s1 = gsi_stmt (gsi1);
       s2 = gsi_stmt (gsi2);
-      if (gimple_code (s1) != gimple_code (s2))
-	return 0;
-      if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
-	return 0;
+      if (s1 != s2)
+	{
+	  if (gimple_code (s1) != gimple_code (s2))
+	    return 0;
+	  if (is_gimple_call (s1) && !gimple_call_same_target_p (s1, s2))
+	    return 0;
+	}
       gsi_next_nondebug (&gsi1);
       gsi_next_nondebug (&gsi2);
       gsi_advance_fw_nondebug_nonlocal (&gsi1);
Index: internal-fn.def
===================================================================
--- internal-fn.def	(revision 225323)
+++ internal-fn.def	(working copy)
@@ -64,3 +64,5 @@  DEF_INTERNAL_FN (MUL_OVERFLOW, ECF_CONST
 DEF_INTERNAL_FN (TSAN_FUNC_EXIT, ECF_NOVOPS | ECF_LEAF | ECF_NOTHROW, NULL)
 DEF_INTERNAL_FN (VA_ARG, ECF_NOTHROW | ECF_LEAF, NULL)
 DEF_INTERNAL_FN (GOACC_DATA_END_WITH_ARG, ECF_NOTHROW, ".r")
+DEF_INTERNAL_FN (GOACC_LEVELS, ECF_NOTHROW | ECF_LEAF, "..")
+DEF_INTERNAL_FN (GOACC_LOOP, ECF_NOTHROW | ECF_LEAF, "..")
Index: tree-ssa-alias.c
===================================================================
--- tree-ssa-alias.c	(revision 225323)
+++ tree-ssa-alias.c	(working copy)
@@ -1764,7 +1764,6 @@  ref_maybe_used_by_call_p_1 (gcall *call,
 	case BUILT_IN_GOMP_ATOMIC_END:
 	case BUILT_IN_GOMP_BARRIER:
 	case BUILT_IN_GOMP_BARRIER_CANCEL:
-	case BUILT_IN_GOACC_THREADBARRIER:
 	case BUILT_IN_GOMP_TASKWAIT:
 	case BUILT_IN_GOMP_TASKGROUP_END:
 	case BUILT_IN_GOMP_CRITICAL_START:
Index: config/nvptx/nvptx-protos.h
===================================================================
--- config/nvptx/nvptx-protos.h	(revision 225323)
+++ config/nvptx/nvptx-protos.h	(working copy)
@@ -32,6 +32,7 @@  extern void nvptx_register_pragmas (void
 extern const char *nvptx_section_for_decl (const_tree);
 
 #ifdef RTX_CODE
+extern void nvptx_expand_oacc_loop (rtx, rtx);
 extern void nvptx_expand_call (rtx, rtx);
 extern rtx nvptx_expand_compare (rtx);
 extern const char *nvptx_ptx_type_from_mode (machine_mode, bool);
Index: config/nvptx/nvptx.md
===================================================================
--- config/nvptx/nvptx.md	(revision 225323)
+++ config/nvptx/nvptx.md	(working copy)
@@ -52,15 +52,23 @@ 
    UNSPEC_NID
 
    UNSPEC_SHARED_DATA
+
+   UNSPEC_BIT_CONV
+
+   UNSPEC_BROADCAST
+   UNSPEC_BR_UNIFIED
 ])
 
 (define_c_enum "unspecv" [
    UNSPECV_LOCK
    UNSPECV_CAS
    UNSPECV_XCHG
-   UNSPECV_WARP_BCAST
    UNSPECV_BARSYNC
    UNSPECV_ID
+
+   UNSPECV_LEVELS
+   UNSPECV_LOOP
+   UNSPECV_BR_HIDDEN
 ])
 
 (define_attr "subregs_ok" "false,true"
@@ -253,6 +261,8 @@ 
 (define_mode_iterator QHSIM [QI HI SI])
 (define_mode_iterator SDFM [SF DF])
 (define_mode_iterator SDCM [SC DC])
+(define_mode_iterator BITS [SI SF])
+(define_mode_iterator BITD [DI DF])
 
 ;; This mode iterator allows :P to be used for patterns that operate on
 ;; pointer-sized quantities.  Exactly one of the two alternatives will match.
@@ -813,7 +823,7 @@ 
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%j0\\tbra%U0\\t%l1;")
+  "%j0\\tbra\\t%l1;")
 
 (define_insn "br_false"
   [(set (pc)
@@ -822,7 +832,34 @@ 
 		      (label_ref (match_operand 1 "" ""))
 		      (pc)))]
   ""
-  "%J0\\tbra%U0\\t%l1;")
+  "%J0\\tbra\\t%l1;")
+
+;; a hidden conditional branch
+(define_insn "br_true_hidden"
+  [(unspec_volatile:SI [(ne (match_operand:BI 0 "nvptx_register_operand" "R")
+			    (const_int 0))
+		        (label_ref (match_operand 1 "" ""))
+			(match_operand:SI 2 "const_int_operand" "i")]
+			UNSPECV_BR_HIDDEN)]
+  ""
+  "%j0\\tbra%U2\\t%l1;")
+
+;; unified conditional branch
+(define_insn "br_uni_true"
+  [(set (pc) (if_then_else
+	(ne (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%j0\\tbra.uni\\t%l1;")
+
+(define_insn "br_uni_false"
+  [(set (pc) (if_then_else
+	(eq (unspec:BI [(match_operand:BI 0 "nvptx_register_operand" "R")]
+		       UNSPEC_BR_UNIFIED) (const_int 0))
+        (label_ref (match_operand 1 "" "")) (pc)))]
+  ""
+  "%J0\\tbra.uni\\t%l1;")
 
 (define_expand "cbranch<mode>4"
   [(set (pc)
@@ -1326,37 +1363,72 @@ 
   return asms[INTVAL (operands[1])];
 })
 
-(define_insn "oacc_thread_broadcastsi"
-  [(set (match_operand:SI 0 "nvptx_register_operand" "")
-	(unspec_volatile:SI [(match_operand:SI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
+(define_insn "oacc_levels"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")]
+		       UNSPECV_LEVELS)]
   ""
-  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+  "// levels %0;"
+)
 
-(define_expand "oacc_thread_broadcastdi"
-  [(set (match_operand:DI 0 "nvptx_register_operand" "")
-	(unspec_volatile:DI [(match_operand:DI 1 "nvptx_register_operand" "")]
-			    UNSPECV_WARP_BCAST))]
-  ""
-{
-  rtx t = gen_reg_rtx (DImode);
-  emit_insn (gen_lshrdi3 (t, operands[1], GEN_INT (32)));
-  rtx op0 = force_reg (SImode, gen_lowpart (SImode, t));
-  rtx op1 = force_reg (SImode, gen_lowpart (SImode, operands[1]));
-  rtx targ0 = gen_reg_rtx (SImode);
-  rtx targ1 = gen_reg_rtx (SImode);
-  emit_insn (gen_oacc_thread_broadcastsi (targ0, op0));
-  emit_insn (gen_oacc_thread_broadcastsi (targ1, op1));
-  rtx t2 = gen_reg_rtx (DImode);
-  rtx t3 = gen_reg_rtx (DImode);
-  emit_insn (gen_extendsidi2 (t2, targ0));
-  emit_insn (gen_extendsidi2 (t3, targ1));
-  rtx t4 = gen_reg_rtx (DImode);
-  emit_insn (gen_ashldi3 (t4, t2, GEN_INT (32)));
-  emit_insn (gen_iordi3 (operands[0], t3, t4));
-  DONE;
+(define_insn "nvptx_loop"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")
+		        (match_operand:SI 1 "const_int_operand" "")]
+		       UNSPECV_LOOP)]
+  ""
+  "// loop %0, %1;"
+)
+
+(define_expand "oacc_loop"
+  [(unspec_volatile:SI [(match_operand:SI 0 "const_int_operand" "")
+		        (match_operand:SI 1 "const_int_operand" "")]
+		       UNSPECV_LOOP)]
+  ""
+{
+  nvptx_expand_oacc_loop (operands[0], operands[1]);
 })
 
+;; only 32-bit shuffles exist.
+(define_insn "nvptx_broadcast<mode>"
+  [(set (match_operand:BITS 0 "nvptx_register_operand" "")
+	(unspec:BITS
+		[(match_operand:BITS 1 "nvptx_register_operand" "")]
+		  UNSPEC_BROADCAST))]
+  ""
+  "%.\\tshfl.idx.b32\\t%0, %1, 0, 31;")
+
+;; extract parts of a 64 bit object into 2 32-bit ints
+(define_insn "unpack<mode>si2"
+  [(set (match_operand:SI 0 "nvptx_register_operand" "")
+        (unspec:SI [(match_operand:BITD 2 "nvptx_register_operand" "")
+		    (const_int 0)] UNSPEC_BIT_CONV))
+   (set (match_operand:SI 1 "nvptx_register_operand" "")
+        (unspec:SI [(match_dup 2) (const_int 1)] UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 {%0,%1}, %2;")
+
+;; pack 2 32-bit ints into a 64 bit object
+(define_insn "packsi<mode>2"
+  [(set (match_operand:BITD 0 "nvptx_register_operand" "")
+        (unspec:BITD [(match_operand:SI 1 "nvptx_register_operand" "")
+		      (match_operand:SI 2 "nvptx_register_operand" "")]
+		    UNSPEC_BIT_CONV))]
+  ""
+  "%.\\tmov.b64 %0, {%1,%2};")
+
+(define_insn "worker_load<mode>"
+  [(set (match_operand:SDISDFM 0 "nvptx_register_operand" "=R")
+        (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "m")]
+			 UNSPEC_SHARED_DATA))]
+  ""
+  "%.\\tld.shared%u0\\t%0,%1;")
+
+(define_insn "worker_store<mode>"
+  [(set (unspec:SDISDFM [(match_operand:SDISDFM 1 "memory_operand" "=m")]
+			 UNSPEC_SHARED_DATA)
+	(match_operand:SDISDFM 0 "nvptx_register_operand" "R"))]
+  ""
+  "%.\\tst.shared%u1\\t%1,%0;")
+
 (define_insn "ganglocal_ptr<mode>"
   [(set (match_operand:P 0 "nvptx_register_operand" "")
 	(unspec:P [(const_int 0)] UNSPEC_SHARED_DATA))]
@@ -1462,14 +1534,8 @@ 
   "%.\\tatom%A1.b%T0.<logic>\\t%0, %1, %2;")
 
 ;; ??? Mark as not predicable later?
-(define_insn "threadbarrier_insn"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
+(define_insn "nvptx_barsync"
+  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")]
+		    UNSPECV_BARSYNC)]
   ""
   "bar.sync\\t%0;")
-
-(define_expand "oacc_threadbarrier"
-  [(unspec_volatile [(match_operand:SI 0 "const_int_operand" "")] UNSPECV_BARSYNC)]
-  ""
-{
-  operands[0] = const0_rtx;
-})
Index: config/nvptx/nvptx.c
===================================================================
--- config/nvptx/nvptx.c	(revision 225323)
+++ config/nvptx/nvptx.c	(working copy)
@@ -24,6 +24,7 @@ 
 #include "coretypes.h"
 #include "tm.h"
 #include "rtl.h"
+#include "hash-map.h"
 #include "hash-set.h"
 #include "machmode.h"
 #include "vec.h"
@@ -74,6 +75,15 @@ 
 #include "df.h"
 #include "dumpfile.h"
 #include "builtins.h"
+#include "dominance.h"
+#include "cfg.h"
+#include "omp-low.h"
+
+#define nvptx_loop_head		0
+#define nvptx_loop_tail		1
+#define LOOP_MODE_CHANGE_P(X) ((X) < 2)
+#define nvptx_loop_prehead 	2
+#define nvptx_loop_pretail 	3
 
 /* Record the function decls we've written, and the libfuncs and function
    decls corresponding to them.  */
@@ -97,6 +107,16 @@  static GTY((cache))
 static GTY((cache)) hash_table<tree_hasher> *declared_fndecls_htab;
 static GTY((cache)) hash_table<tree_hasher> *needed_fndecls_htab;
 
+/* Size of buffer needed to broadcast across workers.  This is used
+   for both worker-neutering and worker broadcasting.   It is shared
+   by all functions emitted.  The buffer is placed in shared memory.
+   It'd be nice if PTX supported common blocks, because then this
+   could be shared across TUs (taking the largest size).  */
+static unsigned worker_bcast_hwm;
+static unsigned worker_bcast_align;
+#define worker_bcast_name "__worker_bcast"
+static GTY(()) rtx worker_bcast_sym;
+
 /* Allocate a new, cleared machine_function structure.  */
 
 static struct machine_function *
@@ -124,6 +144,8 @@  nvptx_option_override (void)
   needed_fndecls_htab = hash_table<tree_hasher>::create_ggc (17);
   declared_libfuncs_htab
     = hash_table<declared_libfunc_hasher>::create_ggc (17);
+
+  worker_bcast_sym = gen_rtx_SYMBOL_REF (Pmode, worker_bcast_name);
 }
 
 /* Return the mode to be used when declaring a ptx object for OBJ.
@@ -1053,6 +1075,7 @@  nvptx_static_chain (const_tree fndecl, b
     return gen_rtx_REG (Pmode, OUTGOING_STATIC_CHAIN_REGNUM);
 }
 
+
 /* Emit a comparison COMPARE, and return the new test to be used in the
    jump.  */
 
@@ -1066,6 +1089,203 @@  nvptx_expand_compare (rtx compare)
   return gen_rtx_NE (BImode, pred, const0_rtx);
 }
 
+
+/* Expand the oacc_loop primitive into ptx-required unspecs.  */
+
+void
+nvptx_expand_oacc_loop (rtx kind, rtx mode)
+{
+  /* Emit pre-tail for all loops and emit pre-head for worker level.  */
+  if (UINTVAL (kind) || UINTVAL (mode) == OACC_worker)
+    emit_insn (gen_nvptx_loop (GEN_INT (UINTVAL (kind) + 2), mode));
+}
+
+/* Generate instruction(s) to unpack a 64 bit object into 2 32 bit
+   objects.  */
+
+static rtx
+nvptx_gen_unpack (rtx dst0, rtx dst1, rtx src)
+{
+  rtx res;
+  
+  switch (GET_MODE (src))
+    {
+    case DImode:
+      res = gen_unpackdisi2 (dst0, dst1, src);
+      break;
+    case DFmode:
+      res = gen_unpackdfsi2 (dst0, dst1, src);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate instruction(s) to pack 2 32 bit objects into a 64 bit
+   object.  */
+
+static rtx
+nvptx_gen_pack (rtx dst, rtx src0, rtx src1)
+{
+  rtx res;
+  
+  switch (GET_MODE (dst))
+    {
+    case DImode:
+      res = gen_packsidi2 (dst, src0, src1);
+      break;
+    case DFmode:
+      res = gen_packsidf2 (dst, src0, src1);
+      break;
+    default: gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Generate an instruction or sequence to broadcast register REG
+   across the vectors of a single warp.  */
+
+static rtx
+nvptx_gen_vcast (rtx reg)
+{
+  rtx res;
+
+  switch (GET_MODE (reg))
+    {
+    case SImode:
+      res = gen_nvptx_broadcastsi (reg, reg);
+      break;
+    case SFmode:
+      res = gen_nvptx_broadcastsf (reg, reg);
+      break;
+    case DImode:
+    case DFmode:
+      {
+	rtx tmp0 = gen_reg_rtx (SImode);
+	rtx tmp1 = gen_reg_rtx (SImode);
+
+	start_sequence ();
+	emit_insn (nvptx_gen_unpack (tmp0, tmp1, reg));
+	emit_insn (nvptx_gen_vcast (tmp0));
+	emit_insn (nvptx_gen_vcast (tmp1));
+	emit_insn (nvptx_gen_pack (reg, tmp0, tmp1));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_vcast (tmp));
+	emit_insn (gen_rtx_SET (BImode, reg,
+				gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+      
+    case HImode:
+    case QImode:
+    default:debug_rtx (reg);gcc_unreachable ();
+    }
+  return res;
+}
+
+/* Structure used when generating a worker-level spill or fill.  */
+
+struct wcast_data_t
+{
+  rtx base;
+  rtx ptr;
+  unsigned offset;
+};
+
+/* Direction of the spill/fill and looping setup/teardown indicator.  */
+
+enum propagate_mask
+  {
+    PM_read = 1 << 0,
+    PM_write = 1 << 1,
+    PM_loop_begin = 1 << 2,
+    PM_loop_end = 1 << 3,
+
+    PM_read_write = PM_read | PM_write
+  };
+
+/* Generate instruction(s) to spill or fill register REG to/from the
+   worker broadcast array.  PM indicates what is to be done, REP
+   how many loop iterations will be executed (0 for not a loop).  */
+   
+static rtx
+nvptx_gen_wcast (rtx reg, propagate_mask pm, unsigned rep, wcast_data_t *data)
+{
+  rtx  res;
+  machine_mode mode = GET_MODE (reg);
+
+  switch (mode)
+    {
+    case BImode:
+      {
+	rtx tmp = gen_reg_rtx (SImode);
+	
+	start_sequence ();
+	if (pm & PM_read)
+	  emit_insn (gen_sel_truesi (tmp, reg, GEN_INT (1), const0_rtx));
+	emit_insn (nvptx_gen_wcast (tmp, pm, rep, data));
+	if (pm & PM_write)
+	  emit_insn (gen_rtx_SET (BImode, reg,
+				  gen_rtx_NE (BImode, tmp, const0_rtx)));
+	res = get_insns ();
+	end_sequence ();
+      }
+      break;
+
+    default:
+      {
+	rtx addr = data->ptr;
+
+	if (!addr)
+	  {
+	    unsigned align = GET_MODE_ALIGNMENT (mode) / BITS_PER_UNIT;
+
+	    if (align > worker_bcast_align)
+	      worker_bcast_align = align;
+	    data->offset = (data->offset + align - 1) & ~(align - 1);
+	    addr = data->base;
+	    if (data->offset)
+	      addr = gen_rtx_PLUS (Pmode, addr, GEN_INT (data->offset));
+	  }
+	
+	addr = gen_rtx_MEM (mode, addr);
+	addr = gen_rtx_UNSPEC (mode, gen_rtvec (1, addr), UNSPEC_SHARED_DATA);
+	if (pm & PM_read)
+	  res = gen_rtx_SET (mode, addr, reg);
+	if (pm & PM_write)
+	  res = gen_rtx_SET (mode, reg, addr);
+
+	if (data->ptr)
+	  {
+	    /* We're using a ptr, increment it.  */
+	    start_sequence ();
+	    
+	    emit_insn (res);
+	    emit_insn (gen_adddi3 (data->ptr, data->ptr,
+				   GEN_INT (GET_MODE_SIZE (GET_MODE (res)))));
+	    res = get_insns ();
+	    end_sequence ();
+	  }
+	else
+	  rep = 1;
+	data->offset += rep * GET_MODE_SIZE (GET_MODE (reg));
+      }
+      break;
+    }
+  return res;
+}
+
 /* When loading an operand ORIG_OP, verify whether an address space
    conversion to generic is required, and if so, perform it.  Also
    check for SYMBOL_REFs for function decls and call
@@ -1647,23 +1867,6 @@  nvptx_print_operand_address (FILE *file,
   nvptx_print_address_operand (file, addr, VOIDmode);
 }
 
-/* Return true if the value of COND is the same across all threads in a
-   warp.  */
-
-static bool
-condition_unidirectional_p (rtx cond)
-{
-  if (CONSTANT_P (cond))
-    return true;
-  if (GET_CODE (cond) == REG)
-    return cfun->machine->warp_equal_pseudos[REGNO (cond)];
-  if (GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMPARE
-      || GET_RTX_CLASS (GET_CODE (cond)) == RTX_COMM_COMPARE)
-    return (condition_unidirectional_p (XEXP (cond, 0))
-	    && condition_unidirectional_p (XEXP (cond, 1)));
-  return false;
-}
-
 /* Print an operand, X, to FILE, with an optional modifier in CODE.
 
    Meaning of CODE:
@@ -1677,8 +1880,7 @@  condition_unidirectional_p (rtx cond)
    t -- print a type opcode suffix, promoting QImode to 32 bits
    T -- print a type size in bits
    u -- print a type opcode suffix without promotions.
-   U -- print ".uni" if a condition consists only of values equal across all
-        threads in a warp.  */
+   U -- print ".uni" if the const_int operand is non-zero.  */
 
 static void
 nvptx_print_operand (FILE *file, rtx x, int code)
@@ -1740,10 +1942,10 @@  nvptx_print_operand (FILE *file, rtx x,
       goto common;
 
     case 'U':
-      if (condition_unidirectional_p (x))
+      if (INTVAL (x))
 	fprintf (file, ".uni");
       break;
-
+      
     case 'c':
       op_mode = GET_MODE (XEXP (x, 0));
       switch (x_code)
@@ -1900,7 +2102,7 @@  get_replacement (struct reg_replace *r)
    conversion copyin/copyout instructions.  */
 
 static void
-nvptx_reorg_subreg (int max_regs)
+nvptx_reorg_subreg ()
 {
   struct reg_replace qiregs, hiregs, siregs, diregs;
   rtx_insn *insn, *next;
@@ -1914,11 +2116,6 @@  nvptx_reorg_subreg (int max_regs)
   siregs.mode = SImode;
   diregs.mode = DImode;
 
-  cfun->machine->warp_equal_pseudos
-    = ggc_cleared_vec_alloc<char> (max_regs);
-
-  auto_vec<unsigned> warp_reg_worklist;
-
   for (insn = get_insns (); insn; insn = next)
     {
       next = NEXT_INSN (insn);
@@ -1934,18 +2131,6 @@  nvptx_reorg_subreg (int max_regs)
       diregs.n_in_use = 0;
       extract_insn (insn);
 
-      if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi
-	  || (GET_CODE (PATTERN (insn)) == SET
-	      && CONSTANT_P (SET_SRC (PATTERN (insn)))))
-	{
-	  rtx dest = recog_data.operand[0];
-	  if (REG_P (dest) && REG_N_SETS (REGNO (dest)) == 1)
-	    {
-	      cfun->machine->warp_equal_pseudos[REGNO (dest)] = true;
-	      warp_reg_worklist.safe_push (REGNO (dest));
-	    }
-	}
-
       enum attr_subregs_ok s_ok = get_attr_subregs_ok (insn);
       for (int i = 0; i < recog_data.n_operands; i++)
 	{
@@ -1999,71 +2184,782 @@  nvptx_reorg_subreg (int max_regs)
 	  validate_change (insn, recog_data.operand_loc[i], new_reg, false);
 	}
     }
+}
+
+/* An unspec of interest and the BB in which it resides.  */
+struct reorg_unspec
+{
+  rtx_insn *insn;
+  basic_block block;
+};
 
-  while (!warp_reg_worklist.is_empty ())
+/* Loop structure of the function.The entire function is described as
+   a NULL loop.  We should be able to extend this to represent
+   superblocks.  */
+
+#define OACC_null OACC_HWM
+
+struct reorg_loop
+{
+  /* Parent loop.  */
+  reorg_loop *parent;
+  
+  /* Next sibling loop.  */
+  reorg_loop *next;
+
+  /* First child loop.  */
+  reorg_loop *inner;
+
+  /* Partitioning mode of the loop.  */
+  unsigned mode;
+
+  /* Partitioning used within inner loops. */
+  unsigned inner_mask;
+
+  /* Location of loop head and tail.  The head is the first block in
+     the partitioned loop and the tail is the first block out of the
+     partitioned loop.  */
+  basic_block head_block;
+  basic_block tail_block;
+
+  rtx_insn *head_insn;
+  rtx_insn *tail_insn;
+
+  rtx_insn *pre_head_insn;
+  rtx_insn *pre_tail_insn;
+
+  /* basic blocks in this loop, but not in child loops.  The HEAD and
+     PRETAIL blocks are in the loop.  The PREHEAD and TAIL blocks
+     are not.  */
+  auto_vec<basic_block> blocks;
+
+public:
+  reorg_loop (reorg_loop *parent, unsigned mode);
+  ~reorg_loop ();
+};
+
+typedef auto_vec<reorg_unspec> unspec_vec_t;
+
+/* Constructor links the new loop into it's parent's chain of
+   children.  */
+
+reorg_loop::reorg_loop (reorg_loop *parent_, unsigned mode_)
+  :parent (parent_), next (0), inner (0), mode (mode_), inner_mask (0)
+{
+  head_block = tail_block = 0;
+  head_insn = tail_insn = 0;
+  pre_head_insn = pre_tail_insn = 0;
+  
+  if (parent)
     {
-      int regno = warp_reg_worklist.pop ();
+      next = parent->inner;
+      parent->inner = this;
+    }
+}
+
+reorg_loop::~reorg_loop ()
+{
+  delete inner;
+  delete next;
+}
+
+/* Map of basic blocks to unspecs */
+typedef hash_map<basic_block, rtx_insn *> unspec_map_t;
+
+/* Split basic blocks such that each loop head & tail unspecs are at
+   the start of their basic blocks.  Thus afterwards each block will
+   have a single partitioning mode.  We also do the same for return
+   insns, as they are executed by every thread.  Return the partitioning
+   execution mode of the function as a whole.  Populate MAP with head
+   and tail blocks.   We also clear the BB visited flag, which is
+   used when finding loops.  */
+
+static unsigned
+nvptx_split_blocks (unspec_map_t *map)
+{
+  auto_vec<reorg_unspec> worklist;
+  basic_block block;
+  rtx_insn *insn;
+  unsigned levels = ~0U; // Assume the worst WRT required neutering
+
+  /* Locate all the reorg instructions of interest.  */
+  FOR_ALL_BB_FN (block, cfun)
+    {
+      bool seen_insn = false;
+
+      // Clear visited flag, for use by loop locator  */
+      block->flags &= ~BB_VISITED;
       
-      df_ref use = DF_REG_USE_CHAIN (regno);
-      for (; use; use = DF_REF_NEXT_REG (use))
+      FOR_BB_INSNS (block, insn)
 	{
-	  rtx_insn *insn;
-	  if (!DF_REF_INSN_INFO (use))
-	    continue;
-	  insn = DF_REF_INSN (use);
-	  if (DEBUG_INSN_P (insn))
-	    continue;
-
-	  /* The only insns we have to exclude are those which refer to
-	     memory.  */
-	  rtx pat = PATTERN (insn);
-	  if (GET_CODE (pat) == SET
-	      && (MEM_P (SET_SRC (pat)) || MEM_P (SET_DEST (pat))))
+	  if (!INSN_P (insn))
 	    continue;
+	  switch (recog_memoized (insn))
+	    {
+	    default:
+	      seen_insn = true;
+	      continue;
+	    case CODE_FOR_oacc_levels:
+	      /* We just need to detect this and note its argument.  */
+	      {
+		unsigned l = UINTVAL (XVECEXP (PATTERN (insn), 0, 0));
+		/* If we see this multiple times, this should all
+		   agree.  */
+		gcc_assert (levels == ~0U || l == levels);
+		levels = l;
+	      }
+	      continue;
+
+	    case CODE_FOR_nvptx_loop:
+	      {
+		rtx kind = XVECEXP (PATTERN (insn), 0, 0);
+		if (!LOOP_MODE_CHANGE_P (UINTVAL (kind)))
+		  {
+		    seen_insn = true;
+		    continue;
+		  }
+	      }
+	      break;
+	      
+	    case CODE_FOR_return:
+	      /* We also need to split just before return insns, as
+		 that insn needs executing by all threads, but the
+		 block it is in probably does not.  */
+	      break;
+	    }
 
-	  df_ref insn_use;
-	  bool all_equal = true;
-	  FOR_EACH_INSN_USE (insn_use, insn)
+	  if (seen_insn)
 	    {
-	      unsigned insn_regno = DF_REF_REGNO (insn_use);
-	      if (!cfun->machine->warp_equal_pseudos[insn_regno])
-		{
-		  all_equal = false;
-		  break;
-		}
+	      /* We've found an instruction that  must be at the start of
+		 a block, but isn't.  Add it to the worklist.  */
+	      reorg_unspec uns;
+	      uns.insn = insn;
+	      uns.block = block;
+	      worklist.safe_push (uns);
 	    }
-	  if (!all_equal)
-	    continue;
-	  df_ref insn_def;
-	  FOR_EACH_INSN_DEF (insn_def, insn)
+	  else
+	    /* It was already the first instruction.  Just add it to
+	       the map.  */
+	    map->get_or_insert (block) = insn;
+	  seen_insn = true;
+	}
+    }
+
+  /* Split blocks on the worklist.  */
+  unsigned ix;
+  reorg_unspec *elt;
+  basic_block remap = 0;
+  for (ix = 0; worklist.iterate (ix, &elt); ix++)
+    {
+      if (remap != elt->block)
+	{
+	  block = elt->block;
+	  remap = block;
+	}
+      
+      /* Split block before insn. The insn is in the new block  */
+      edge e = split_block (block, PREV_INSN (elt->insn));
+
+      block = e->dest;
+      map->get_or_insert (block) = elt->insn;
+    }
+
+  return levels;
+}
+
+/* BLOCK is a basic block containing a head or tail instruction.
+   Locate the associated prehead or pretail instruction, which must be
+   in the single predecessor block.  */
+
+static rtx_insn *
+nvptx_discover_pre (basic_block block, unsigned expected)
+{
+  gcc_assert (block->preds->length () == 1);
+  basic_block pre_block = (*block->preds)[0]->src;
+  rtx_insn *pre_insn;
+
+  for (pre_insn = BB_END (pre_block); !INSN_P (pre_insn);
+       pre_insn = PREV_INSN (pre_insn))
+    gcc_assert (pre_insn != BB_HEAD (pre_block));
+
+  gcc_assert (recog_memoized (pre_insn) == CODE_FOR_nvptx_loop
+	      && (UINTVAL (XVECEXP (PATTERN (pre_insn), 0, 0))
+		  == expected));
+  return pre_insn;
+}
+
+typedef std::pair<basic_block, reorg_loop *> loop_t;
+typedef auto_vec<loop_t> work_loop_t;
+
+/*  Dump this loop and all its inner loops.  */
+
+static void
+nvptx_dump_loops (reorg_loop *loop, unsigned depth)
+{
+  fprintf (dump_file, "%u: mode %d head=%d, tail=%d\n",
+	   depth, loop->mode,
+	   loop->head_block ? loop->head_block->index : -1,
+	   loop->tail_block ? loop->tail_block->index : -1);
+
+  fprintf (dump_file, "    blocks:");
+
+  basic_block block;
+  for (unsigned ix = 0; loop->blocks.iterate (ix, &block); ix++)
+    fprintf (dump_file, " %d", block->index);
+  fprintf (dump_file, "\n");
+  if (loop->inner)
+    nvptx_dump_loops (loop->inner, depth + 1);
+
+  if (loop->next)
+    nvptx_dump_loops (loop->next, depth);
+}
+
+/* Walk the BBG looking for loop head & tail markers.  Construct a
+   loop structure for the function.  MAP is a mapping of basic blocks
+   to head & taiol markers, discoveded when splitting blocks.  This
+   speeds up the discovery.  We rely on the BB visited flag having
+   been cleared when splitting blocks.  */
+
+static reorg_loop *
+nvptx_discover_loops (unspec_map_t *map)
+{
+  reorg_loop *outer_loop = new reorg_loop (0, OACC_null);
+  work_loop_t worklist;
+  basic_block block;
+
+  // Mark entry and exit blocks as visited.
+  block = EXIT_BLOCK_PTR_FOR_FN (cfun);
+  block->flags |= BB_VISITED;
+  block = ENTRY_BLOCK_PTR_FOR_FN (cfun);
+  worklist.safe_push (loop_t (block, outer_loop));
+
+  while (worklist.length ())
+    {
+      loop_t loop = worklist.pop ();
+      reorg_loop *l = loop.second;
+
+      block = loop.first;
+
+      // Have we met this block?
+      if (block->flags & BB_VISITED)
+	continue;
+      block->flags |= BB_VISITED;
+      
+      rtx_insn **endp = map->get (block);
+      if (endp)
+	{
+	  rtx_insn *end = *endp;
+	  
+	  /* This is a block head or tail, or return instruction.  */
+	  switch (recog_memoized (end))
 	    {
-	      unsigned dregno = DF_REF_REGNO (insn_def);
-	      if (cfun->machine->warp_equal_pseudos[dregno])
-		continue;
-	      cfun->machine->warp_equal_pseudos[dregno] = true;
-	      warp_reg_worklist.safe_push (dregno);
+	    case CODE_FOR_return:
+	      /* Return instructions are in their own block, and we
+		 don't need to do anything more.  */
+	      continue;
+
+	    case CODE_FOR_nvptx_loop:
+	      {
+		unsigned kind = UINTVAL (XVECEXP (PATTERN (end), 0, 0));
+		unsigned mode = UINTVAL (XVECEXP (PATTERN (end), 0, 1));
+		
+		switch (kind)
+		  {
+		  case nvptx_loop_head:
+		    /* Loop head, create a new inner loop and add it
+		       into our parent's child list.  */
+		    l = new reorg_loop (l, mode);
+		    l->head_block = block;
+		    l->head_insn = end;
+		    if (mode == OACC_worker)
+		      l->pre_head_insn
+			= nvptx_discover_pre (block, nvptx_loop_prehead);
+		    break;
+
+		  case nvptx_loop_tail:
+		    /* A loop tail.  Finish the current loop and
+		       return to parent.  */
+		    gcc_assert (l->mode == mode);
+		    l->tail_block = block;
+		    l->tail_insn = end;
+		    if (mode == OACC_worker)
+		      l->pre_tail_insn
+			= nvptx_discover_pre (block, nvptx_loop_pretail);
+		    l = l->parent;
+		    break;
+		    
+		  default:
+		    gcc_unreachable ();
+		  }
+	      }
+	      break;
+
+	    default:gcc_unreachable ();
 	    }
 	}
+      /* Add this block onto the current loop's list of blocks.  */
+      l->blocks.safe_push (block);
+
+      /* Push each destination block onto the work list.  */
+      edge e;
+      edge_iterator ei;
+
+      loop.second = l;
+      FOR_EACH_EDGE (e, ei, block->succs)
+	{
+	  loop.first = e->dest;
+	  
+	  worklist.safe_push (loop);
+	}
     }
 
   if (dump_file)
-    for (int i = 0; i < max_regs; i++)
-      if (cfun->machine->warp_equal_pseudos[i])
-	fprintf (dump_file, "Found warp invariant pseudo %d\n", i);
+    {
+      fprintf (dump_file, "\nLoops\n");
+      nvptx_dump_loops (outer_loop, 0);
+      fprintf (dump_file, "\n");
+    }
+  
+  return outer_loop;
+}
+
+/* Propagate live state at the start of a partitioned region.  BLOCK
+   provides the live register information, and might not contain
+   INSN. Propagation is inserted just after INSN. RW indicates whether
+   we are reading and/or writing state.  This
+   separation is needed for worker-level proppagation where we
+   essentially do a spill & fill.  FN is the underlying worker
+   function to generate the propagation instructions for single
+   register.  DATA is user data.
+
+   We propagate the live register set and the entire frame.  We could
+   do better by (a) propagating just the live set that is used within
+   the partitioned regions and (b) only propagating stack entries that
+   are used.  The latter might be quite hard to determine.  */
+
+static void
+nvptx_propagate (basic_block block, rtx_insn *insn, propagate_mask rw,
+		 rtx (*fn) (rtx, propagate_mask,
+			    unsigned, void *), void *data)
+{
+  bitmap live = DF_LIVE_IN (block);
+  bitmap_iterator iterator;
+  unsigned ix;
+
+  /* Copy the frame array.  */
+  HOST_WIDE_INT fs = get_frame_size ();
+  if (fs)
+    {
+      rtx tmp = gen_reg_rtx (DImode);
+      rtx idx = NULL_RTX;
+      rtx ptr = gen_reg_rtx (Pmode);
+      rtx pred = NULL_RTX;
+      rtx_code_label *label = NULL;
+
+      gcc_assert (!(fs & (GET_MODE_SIZE (DImode) - 1)));
+      fs /= GET_MODE_SIZE (DImode);
+      /* Detect single iteration loop. */
+      if (fs == 1)
+	fs = 0;
+
+      start_sequence ();
+      emit_insn (gen_rtx_SET (Pmode, ptr, frame_pointer_rtx));
+      if (fs)
+	{
+	  idx = gen_reg_rtx (SImode);
+	  pred = gen_reg_rtx (BImode);
+	  label = gen_label_rtx ();
+	  
+	  emit_insn (gen_rtx_SET (SImode, idx, GEN_INT (fs)));
+	  /* Allow worker function to initialize anything needed */
+	  rtx init = fn (tmp, PM_loop_begin, fs, data);
+	  if (init)
+	    emit_insn (init);
+	  emit_label (label);
+	  LABEL_NUSES (label)++;
+	  emit_insn (gen_addsi3 (idx, idx, GEN_INT (-1)));
+	}
+      if (rw & PM_read)
+	emit_insn (gen_rtx_SET (DImode, tmp, gen_rtx_MEM (DImode, ptr)));
+      emit_insn (fn (tmp, rw, fs, data));
+      if (rw & PM_write)
+	emit_insn (gen_rtx_SET (DImode, gen_rtx_MEM (DImode, ptr), tmp));
+      if (fs)
+	{
+	  emit_insn (gen_rtx_SET (SImode, pred,
+				  gen_rtx_NE (BImode, idx, const0_rtx)));
+	  emit_insn (gen_adddi3 (ptr, ptr, GEN_INT (GET_MODE_SIZE (DImode))));
+	  emit_insn (gen_br_true_hidden (pred, label, GEN_INT (1)));
+	  rtx fini = fn (tmp, PM_loop_end, fs, data);
+	  if (fini)
+	    emit_insn (fini);
+	  emit_insn (gen_rtx_CLOBBER (GET_MODE (idx), idx));
+	}
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (tmp), tmp));
+      emit_insn (gen_rtx_CLOBBER (GET_MODE (ptr), ptr));
+      rtx cpy = get_insns ();
+      end_sequence ();
+      insn = emit_insn_after (cpy, insn);
+    }
+
+  /* Copy live registers.  */
+  EXECUTE_IF_SET_IN_BITMAP (live, 0, ix, iterator)
+    {
+      rtx reg = regno_reg_rtx[ix];
+
+      if (REGNO (reg) >= FIRST_PSEUDO_REGISTER)
+	{
+	  rtx bcast = fn (reg, rw, 0, data);
+
+	  insn = emit_insn_after (bcast, insn);
+	}
+    }
+}
+
+/* Worker for nvptx_vpropagate.  */
+
+static rtx
+vprop_gen (rtx reg, propagate_mask pm,
+	   unsigned ARG_UNUSED (count), void *ARG_UNUSED (data))
+{
+  if (!(pm & PM_read_write))
+    return 0;
+  
+  return nvptx_gen_vcast (reg);
 }
 
-/* PTX-specific reorganization
-   1) mark now-unused registers, so function begin doesn't declare
-   unused registers.
-   2) replace subregs with suitable sequences.
-*/
+/* Propagate state that is live at start of BLOCK across the vectors
+   of a single warp.  Propagation is inserted just after INSN.   */
 
 static void
-nvptx_reorg (void)
+nvptx_vpropagate (basic_block block, rtx_insn *insn)
 {
-  struct reg_replace qiregs, hiregs, siregs, diregs;
-  rtx_insn *insn, *next;
+  nvptx_propagate (block, insn, PM_read_write, vprop_gen, 0);
+}
+
+/* Worker for nvptx_wpropagate.  */
+
+static rtx
+wprop_gen (rtx reg, propagate_mask pm, unsigned rep, void *data_)
+{
+  wcast_data_t *data = (wcast_data_t *)data_;
+
+  if (pm & PM_loop_begin)
+    {
+      /* Starting a loop, initialize pointer.    */
+      unsigned align = GET_MODE_ALIGNMENT (GET_MODE (reg)) / BITS_PER_UNIT;
+
+      if (align > worker_bcast_align)
+	worker_bcast_align = align;
+      data->offset = (data->offset + align - 1) & ~(align - 1);
+
+      data->ptr = gen_reg_rtx (Pmode);
+
+      return gen_adddi3 (data->ptr, data->base, GEN_INT (data->offset));
+    }
+  else if (pm & PM_loop_end)
+    {
+      rtx clobber = gen_rtx_CLOBBER (GET_MODE (data->ptr), data->ptr);
+      data->ptr = NULL_RTX;
+      return clobber;
+    }
+  else
+    return nvptx_gen_wcast (reg, pm, rep, data);
+}
+
+/* Spill or fill live state that is live at start of BLOCK.  PRE_P
+   indicates if this is just before partitioned mode (do spill), or
+   just after it starts (do fill). Sequence is inserted just after
+   INSN.  */
+
+static void
+nvptx_wpropagate (bool pre_p, basic_block block, rtx_insn *insn)
+{
+  wcast_data_t data;
+
+  data.base = gen_reg_rtx (Pmode);
+  data.offset = 0;
+  data.ptr = NULL_RTX;
+
+  nvptx_propagate (block, insn, pre_p ? PM_read : PM_write, wprop_gen, &data);
+  if (data.offset)
+    {
+      /* Stuff was emitted, initialize the base pointer now.  */
+      rtx init = gen_rtx_SET (Pmode, data.base, worker_bcast_sym);
+      emit_insn_after (init, insn);
+      
+      if (worker_bcast_hwm < data.offset)
+	worker_bcast_hwm = data.offset;
+    }
+}
+
+/* Emit a worker-level synchronization barrier.  */
+
+static void
+nvptx_wsync (bool tail_p, rtx_insn *insn)
+{
+  emit_insn_after (gen_nvptx_barsync (GEN_INT (tail_p)), insn);
+}
+
+/* Single neutering according to MASK.  FROM is the incoming block and
+   TO is the outgoing block.  These may be the same block. Insert at
+   start of FROM:
+   
+     if (tid.<axis>) hidden_goto end.
+
+   and insert before ending branch of TO (if there is such an insn):
+
+     end:
+     <possibly-broadcast-cond>
+     <branch>
+
+   We currently only use differnt FROM and TO when skipping an entire
+   loop.  We could do more if we detected superblocks.  */
+
+static void
+nvptx_single (unsigned mask, basic_block from, basic_block to)
+{
+  rtx_insn *head = BB_HEAD (from);
+  rtx_insn *tail = BB_END (to);
+  unsigned skip_mask = mask;
+
+  /* Find first insn of from block */
+  while (head != BB_END (from) && !INSN_P (head))
+    head = NEXT_INSN (head);
+
+  /* Find last insn of to block */
+  rtx_insn *limit = from == to ? head : BB_HEAD (to);
+  while (tail != limit && !INSN_P (tail) && !LABEL_P (tail))
+    tail = PREV_INSN (tail);
+
+  /* Detect if tail is a branch.  */
+  rtx tail_branch = NULL_RTX;
+  rtx cond_branch = NULL_RTX;
+  if (tail && INSN_P (tail))
+    {
+      tail_branch = PATTERN (tail);
+      if (GET_CODE (tail_branch) != SET || SET_DEST (tail_branch) != pc_rtx)
+	tail_branch = NULL_RTX;
+      else
+	{
+	  cond_branch = SET_SRC (tail_branch);
+	  if (GET_CODE (cond_branch) != IF_THEN_ELSE)
+	    cond_branch = NULL_RTX;
+	}
+    }
+
+  if (tail == head)
+    {
+      /* If this is empty, do nothing.  */
+      if (!head || !INSN_P (head))
+	return;
+
+      /* If this is a dummy insn, do nothing.  */
+      switch (recog_memoized (head))
+	{
+	default:break;
+	case CODE_FOR_nvptx_loop:
+	case CODE_FOR_oacc_levels:
+	  return;
+	}
 
+      if (cond_branch)
+	{
+	  /* If we're only doing vector single, there's no need to
+	     emit skip code because we'll not insert anything.  */
+	  if (!(mask & OACC_LOOP_MASK (OACC_vector)))
+	    skip_mask = 0;
+	}
+      else if (tail_branch)
+	/* Block with only unconditional branch.  Nothing to do.  */
+	return;
+    }
+
+  /* Insert the vector test inside the worker test.  */
+  unsigned mode;
+  rtx_insn *before = tail;
+  for (mode = OACC_worker; mode <= OACC_vector; mode++)
+    if (OACC_LOOP_MASK (mode) & skip_mask)
+      {
+	rtx id = gen_reg_rtx (SImode);
+	rtx pred = gen_reg_rtx (BImode);
+	rtx_code_label *label = gen_label_rtx ();
+
+	emit_insn_before (gen_oacc_id (id, GEN_INT (mode)), head);
+	rtx cond = gen_rtx_SET (BImode, pred,
+				gen_rtx_NE (BImode, id, const0_rtx));
+	emit_insn_before (cond, head);
+	emit_insn_before (gen_br_true_hidden (pred, label,
+					      GEN_INT (mode != OACC_vector)),
+			  head);
+
+	LABEL_NUSES (label)++;
+	if (tail_branch)
+	  before = emit_label_before (label, before);
+	else
+	  emit_label_after (label, tail);
+      }
+
+  /* Now deal with propagating the branch condition.  */
+  if (cond_branch)
+    {
+      rtx pvar = XEXP (XEXP (cond_branch, 0), 0);
+
+      if (OACC_LOOP_MASK (OACC_vector) == mask)
+	{
+	  /* Vector mode only, do a shuffle.  */
+	  emit_insn_before (nvptx_gen_vcast (pvar), tail);
+	}
+      else
+	{
+	  /* Includes worker mode, do spill & fill.  by construction
+	     we should never have worker mode only. */
+	  wcast_data_t data;
+
+	  data.base = worker_bcast_sym;
+	  data.ptr = 0;
+
+	  if (worker_bcast_hwm < GET_MODE_SIZE (SImode))
+	    worker_bcast_hwm = GET_MODE_SIZE (SImode);
+
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_read, 0, &data),
+			    before);
+	  emit_insn_before (gen_nvptx_barsync (GEN_INT (2)), tail);
+	  data.offset = 0;
+	  emit_insn_before (nvptx_gen_wcast (pvar, PM_write, 0, &data),
+			    tail);
+	}
+
+      extract_insn (tail);
+      rtx unsp = gen_rtx_UNSPEC (BImode, gen_rtvec (1, pvar),
+				 UNSPEC_BR_UNIFIED);
+      validate_change (tail, recog_data.operand_loc[0], unsp, false);
+    }
+}
+
+/* LOOP is a loop that is being skipped in its entirety according to
+   MASK.  Treat this as skipping a superblock starting at loop head
+   and ending at loop pre-tail.  */
+
+static void
+nvptx_skip_loop (unsigned mask, reorg_loop *loop)
+{
+  basic_block tail = loop->tail_block;
+  gcc_assert (tail->preds->length () == 1);
+
+  basic_block pre_tail = (*tail->preds)[0]->src;
+  gcc_assert (pre_tail->succs->length () == 1);
+
+  nvptx_single (mask, loop->head_block, pre_tail);
+}
+
+/* Process the loop LOOP and all its contained loops.  We do
+   everything but the neutering.  Return mask of partition modes used
+   within this loop.  */
+
+static unsigned
+nvptx_process_loops (reorg_loop *loop)
+{
+  unsigned inner_mask = OACC_LOOP_MASK (loop->mode);
+  
+  /* Do the inner loops first.  */
+  if (loop->inner)
+    {
+      loop->inner_mask = nvptx_process_loops (loop->inner);
+      inner_mask |= loop->inner_mask;
+    }
+  
+  switch (loop->mode)
+    {
+    case OACC_null:
+      /* Dummy loop.  */
+      break;
+
+    case OACC_vector:
+      nvptx_vpropagate (loop->head_block, loop->head_insn);
+      break;
+      
+    case OACC_worker:
+      {
+	nvptx_wpropagate (false, loop->head_block, loop->head_insn);
+	nvptx_wpropagate (true, loop->head_block, loop->pre_head_insn);
+	/* Insert begin and end synchronizations.  */
+	nvptx_wsync (false, loop->head_insn);
+	nvptx_wsync (true, loop->pre_tail_insn);
+      }
+      break;
+
+    case OACC_gang:
+      break;
+
+    default:gcc_unreachable ();
+    }
+
+  /* Now do siblings.  */
+  if (loop->next)
+    inner_mask |= nvptx_process_loops (loop->next);
+  return inner_mask;
+}
+
+/* Neuter the loop described by LOOP.  We recurse in depth-first
+   order.  LEVELS are the partitioning of the execution and OUTER is
+   the partitioning of the loops we are contained in.  Return the
+   partitioning level within this loop.  */
+
+static void
+nvptx_neuter_loops (reorg_loop *loop, unsigned levels, unsigned outer)
+{
+  unsigned me = (OACC_LOOP_MASK (loop->mode)
+		 & (OACC_LOOP_MASK (OACC_worker)
+		    | OACC_LOOP_MASK (OACC_vector)));
+  unsigned  skip_mask = 0, neuter_mask = 0;
+  
+  if (loop->inner)
+    nvptx_neuter_loops (loop->inner, levels, outer | me);
+
+  for (unsigned mode = OACC_worker; mode <= OACC_vector; mode++)
+    {
+      if ((outer | me) & OACC_LOOP_MASK (mode))
+	{ /* Mode is partitioned: no neutering.  */ }
+      else if (!(levels & OACC_LOOP_MASK (mode)))
+	{ /* Mode  is not used: nothing to do.  */ }
+      else if (loop->inner_mask & OACC_LOOP_MASK (mode)
+	       || !loop->head_insn)
+	/* Partitioning inside this loop, or we're not a loop: neuter
+	   individual blocks.  */
+	neuter_mask |= OACC_LOOP_MASK (mode);
+      else if (!loop->parent || !loop->parent->head_insn
+	       || loop->parent->inner_mask & OACC_LOOP_MASK (mode))
+	/* Parent isn't a loop or contains this partitioning: skip
+	   loop at this level.  */
+	skip_mask |= OACC_LOOP_MASK (mode);
+      else
+	{ /* Parent will skip this loop itself.  */ }
+    }
+
+  if (neuter_mask)
+    {
+      basic_block block;
+
+      for (unsigned ix = 0; loop->blocks.iterate (ix, &block); ix++)
+	nvptx_single (neuter_mask, block, block);
+    }
+
+  if (skip_mask)
+      nvptx_skip_loop (skip_mask, loop);
+  
+  if (loop->next)
+    nvptx_neuter_loops (loop->next, levels, outer);
+}
+
+/* NVPTX machine dependent reorg.
+   Insert vector and worker single neutering code and state
+   propagation when entering partioned mode.  Fixup subregs.  */
+
+static void
+nvptx_reorg (void)
+{
   /* We are freeing block_for_insn in the toplev to keep compatibility
      with old MDEP_REORGS that are not CFG based.  Recompute it now.  */
   compute_bb_for_insn ();
@@ -2072,19 +2968,34 @@  nvptx_reorg (void)
 
   df_clear_flags (DF_LR_RUN_DCE);
   df_set_flags (DF_NO_INSN_RESCAN | DF_NO_HARD_REGS);
+  df_live_add_problem ();
+  
+  /* Split blocks and record interesting unspecs.  */
+  unspec_map_t unspec_map;
+  unsigned levels = nvptx_split_blocks (&unspec_map);
+
+  /* Compute live regs */
   df_analyze ();
   regstat_init_n_sets_and_refs ();
 
-  int max_regs = max_reg_num ();
-
+  if (dump_file)
+    df_dump (dump_file);
+  
   /* Mark unused regs as unused.  */
+  int max_regs = max_reg_num ();
   for (int i = LAST_VIRTUAL_REGISTER + 1; i < max_regs; i++)
     if (REG_N_SETS (i) == 0 && REG_N_REFS (i) == 0)
       regno_reg_rtx[i] = const0_rtx;
 
-  /* Replace subregs.  */
-  nvptx_reorg_subreg (max_regs);
+  reorg_loop *loops = nvptx_discover_loops (&unspec_map);
+
+  nvptx_process_loops (loops);
+  nvptx_neuter_loops (loops, levels, 0);
 
+  delete loops;
+
+  nvptx_reorg_subreg ();
+  
   regstat_free_n_sets_and_refs ();
 
   df_finish_pass (true);
@@ -2133,19 +3044,21 @@  nvptx_vector_alignment (const_tree type)
   return MIN (align, BIGGEST_ALIGNMENT);
 }
 
-/* Indicate that INSN cannot be duplicated.  This is true for insns
-   that generate a unique id.  To be on the safe side, we also
-   exclude instructions that have to be executed simultaneously by
-   all threads in a warp.  */
+/* Indicate that INSN cannot be duplicated.   */
 
 static bool
 nvptx_cannot_copy_insn_p (rtx_insn *insn)
 {
-  if (recog_memoized (insn) == CODE_FOR_oacc_thread_broadcastsi)
-    return true;
-  if (recog_memoized (insn) == CODE_FOR_threadbarrier_insn)
-    return true;
-  return false;
+  switch (recog_memoized (insn))
+    {
+    case CODE_FOR_nvptx_broadcastsi:
+    case CODE_FOR_nvptx_broadcastsf:
+    case CODE_FOR_nvptx_barsync:
+    case CODE_FOR_nvptx_loop:
+      return true;
+    default:
+      return false;
+    }
 }
 
 /* Record a symbol for mkoffload to enter into the mapping table.  */
@@ -2185,6 +3098,21 @@  nvptx_file_end (void)
   FOR_EACH_HASH_TABLE_ELEMENT (*needed_fndecls_htab, decl, tree, iter)
     nvptx_record_fndecl (decl, true);
   fputs (func_decls.str().c_str(), asm_out_file);
+
+  if (worker_bcast_hwm)
+    {
+      /* Define the broadcast buffer.  */
+
+      if (worker_bcast_align < GET_MODE_SIZE (SImode))
+	worker_bcast_align = GET_MODE_SIZE (SImode);
+      worker_bcast_hwm = (worker_bcast_hwm + worker_bcast_align - 1)
+	& ~(worker_bcast_align - 1);
+      
+      fprintf (asm_out_file, "// BEGIN VAR DEF: %s\n", worker_bcast_name);
+      fprintf (asm_out_file, ".shared.align %d .u8 %s[%d];\n",
+	       worker_bcast_align,
+	       worker_bcast_name, worker_bcast_hwm);
+    }
 }
 
 #undef TARGET_OPTION_OVERRIDE
Index: config/nvptx/nvptx.h
===================================================================
--- config/nvptx/nvptx.h	(revision 225323)
+++ config/nvptx/nvptx.h	(working copy)
@@ -235,7 +235,6 @@  struct nvptx_pseudo_info
 struct GTY(()) machine_function
 {
   rtx_expr_list *call_args;
-  char *warp_equal_pseudos;
   rtx start_call;
   tree funtype;
   bool has_call_with_varargs;
Index: internal-fn.c
===================================================================
--- internal-fn.c	(revision 225323)
+++ internal-fn.c	(working copy)
@@ -98,6 +98,19 @@  init_internal_fns ()
   internal_fn_fnspec_array[IFN_LAST] = 0;
 }
 
+/* Return true if this internal fn call is a unique marker -- it
+   should not be duplicated or merged.  */
+
+bool
+gimple_call_internal_unique_p (const_gimple gs)
+{
+  switch (gimple_call_internal_fn (gs))
+    {
+    default: return false;
+    case IFN_GOACC_LOOP: return true;
+    }
+}
+
 /* ARRAY_TYPE is an array of vector modes.  Return the associated insn
    for load-lanes-style optab OPTAB.  The insn must exist.  */
 
@@ -1990,6 +2003,28 @@  expand_GOACC_DATA_END_WITH_ARG (gcall *s
   gcc_unreachable ();
 }
 
+
+static void
+expand_GOACC_LEVELS (gcall *stmt)
+{
+  rtx mask = expand_normal (gimple_call_arg (stmt, 0));
+  
+#ifdef HAVE_oacc_levels
+  emit_insn (gen_oacc_levels (mask));
+#endif
+}
+
+static void
+expand_GOACC_LOOP (gcall *stmt)
+{
+  rtx kind = expand_normal (gimple_call_arg (stmt, 0));
+  rtx level = expand_normal (gimple_call_arg (stmt, 1));
+  
+#ifdef HAVE_oacc_loop
+  emit_insn (gen_oacc_loop (kind, level));
+#endif
+}
+
 /* Routines to expand each internal function, indexed by function number.
    Each routine has the prototype: