diff mbox

Make try_unroll_loop_completely to use loop bounds recorded

Message ID 20121016173226.GA1373@kam.mff.cuni.cz
State New
Headers show

Commit Message

Jan Hubicka Oct. 16, 2012, 5:32 p.m. UTC
Hi,
here is third revised version of the complette unroling changes.  While working
on the RTL variant I noticed PR54937 and the fact that I was overly aggressive
on forcing single exit of the last iteration to be taken, because loop may terminate
otherwise (by EH or by exitting the program).  Same thinko is in loop-niter.

This patch adds loop_edge_to_cancel that is more conservative: it looks for the
exit conditional where the non-exitting edges leads to latch and verifies that
latch contains no statement with side effect that may terminate the loop.
This still actually matches quite few non-single-exit loops and works well in
practice.

Unlike previous revision it also enables complette unrolling when code size
does not grow even for non-innermost loops (with update in
tree_unroll_loops_completely to walk them). This is something we did on RTL
land but missed in trees.  This actually enables quite some optimizations when
things can be propagated to the tiny inner loop body.

I also fixed accounting in tree_estimate_loop_size for the cases where last
iteration is not going to be updated.

Finally I added code constructing __bulitin_unreachable as suggested by
Ian.

Bootstrapped/regtested x86_64-linux, also bootstrapped with -O3 and -Werror
disabled and benchmarked. Among best benefits is about 7% improvement on Applu,
and it causes up to 15% improvements on vectorized loops with small iteration
counts (by completelly peeling the precondition code).  There are no real
performance regressions but there is some code size bloat.  

I plan to followup with strenghtening the heuristic to disable unrolling when
benefits are absymal.  Easy is to limit unrolling on loops with CFG and/or
calls in them.  We already have quite informed analysis in place.  I also plan
to move simple FDO guided loop peeling from RTL level to trees to enable more
propagation into peeled sequences.

The patch also triggers bug in niter and requires xfailing do_1.f90 testcase.
I filled PR 54932 to track this.

There are also confused array bound warnings I hope to track incrementally, too,
by recording statements that are known to become unreachable in the last
iteration and adding __buitin_unreachable in front of them. This is also
important to avoid duplication leading to dead code: no other optimizers
force paths leading to out of bound accesses to not happen.

Honza


	* tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add edge_to_cancel
	parameter and use it to estimate code optimized out in the final iteration.
	(loop_edge_to_cancel): New function.
	(try_unroll_loop_completely): New IRRED_IVALIDATED parameter;
	handle unrolling loops with bounds given via max_loop_iteratins;
	handle unrolling non-inner loops when code size shrinks;
	tidy dump output; when the last iteration loop still stays
	as loop in the CFG forcongly redirect the latch to
	__builtin_unreachable.
	(canonicalize_loop_induction_variables): Add irred_invlaidated
	parameter; record niter bound derrived; dump
	max_loop_iterations bounds; call try_unroll_loop_completely
	even if no niter bound is given.
	(canonicalize_induction_variables): Handle irred_invalidated.
	(tree_unroll_loops_completely): Handle non-innermost loops;
	handle irred_invalidated.
	* cfgloop.h (unlop): Declare.
	* cfgloopmanip.c (unloop): Export.
	* tree.c (build_common_builtin_nodes): Build BULTIN_UNREACHABLE.

	* gcc.target/i386/l_fma_float_?.c: Update.
	* gcc.target/i386/l_fma_double_?.c: Update.
	* gfortran.dg/do_1.f90: XFAIL
	* gcc.dg/tree-ssa/cunroll-1.c: New testcase.
	* gcc.dg/tree-ssa/cunroll-2.c: New testcase.
	* gcc.dg/tree-ssa/cunroll-3.c: New testcase.
	* gcc.dg/tree-ssa/cunroll-4.c: New testcase.
	* gcc.dg/tree-ssa/cunroll-5.c: New testcase.
	* gcc.dg/tree-ssa/ldist-17.c: Block cunroll to make testcase still
	valid.

Comments

Richard Biener Oct. 17, 2012, 10:21 a.m. UTC | #1
On Tue, 16 Oct 2012, Jan Hubicka wrote:

> Hi,
> here is third revised version of the complette unroling changes.  While working
> on the RTL variant I noticed PR54937 and the fact that I was overly aggressive
> on forcing single exit of the last iteration to be taken, because loop may terminate
> otherwise (by EH or by exitting the program).  Same thinko is in loop-niter.
> 
> This patch adds loop_edge_to_cancel that is more conservative: it looks for the
> exit conditional where the non-exitting edges leads to latch and verifies that
> latch contains no statement with side effect that may terminate the loop.
> This still actually matches quite few non-single-exit loops and works well in
> practice.
> 
> Unlike previous revision it also enables complette unrolling when code size
> does not grow even for non-innermost loops (with update in
> tree_unroll_loops_completely to walk them). This is something we did on RTL
> land but missed in trees.  This actually enables quite some optimizations when
> things can be propagated to the tiny inner loop body.
> 
> I also fixed accounting in tree_estimate_loop_size for the cases where last
> iteration is not going to be updated.
> 
> Finally I added code constructing __bulitin_unreachable as suggested by
> Ian.
> 
> Bootstrapped/regtested x86_64-linux, also bootstrapped with -O3 and -Werror
> disabled and benchmarked. Among best benefits is about 7% improvement on Applu,
> and it causes up to 15% improvements on vectorized loops with small iteration
> counts (by completelly peeling the precondition code).  There are no real
> performance regressions but there is some code size bloat.  
> 
> I plan to followup with strenghtening the heuristic to disable unrolling when
> benefits are absymal.  Easy is to limit unrolling on loops with CFG and/or
> calls in them.  We already have quite informed analysis in place.  I also plan
> to move simple FDO guided loop peeling from RTL level to trees to enable more
> propagation into peeled sequences.
> 
> The patch also triggers bug in niter and requires xfailing do_1.f90 testcase.
> I filled PR 54932 to track this.
> 
> There are also confused array bound warnings I hope to track incrementally, too,
> by recording statements that are known to become unreachable in the last
> iteration and adding __buitin_unreachable in front of them. This is also
> important to avoid duplication leading to dead code: no other optimizers
> force paths leading to out of bound accesses to not happen.

Quite a large patch ... it could have been split into bugfixes
and enhancements ;)

Still - ok!

Thanks,
Richard.

> Honza
> 
> 
> 	* tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add edge_to_cancel
> 	parameter and use it to estimate code optimized out in the final iteration.
> 	(loop_edge_to_cancel): New function.
> 	(try_unroll_loop_completely): New IRRED_IVALIDATED parameter;
> 	handle unrolling loops with bounds given via max_loop_iteratins;
> 	handle unrolling non-inner loops when code size shrinks;
> 	tidy dump output; when the last iteration loop still stays
> 	as loop in the CFG forcongly redirect the latch to
> 	__builtin_unreachable.
> 	(canonicalize_loop_induction_variables): Add irred_invlaidated
> 	parameter; record niter bound derrived; dump
> 	max_loop_iterations bounds; call try_unroll_loop_completely
> 	even if no niter bound is given.
> 	(canonicalize_induction_variables): Handle irred_invalidated.
> 	(tree_unroll_loops_completely): Handle non-innermost loops;
> 	handle irred_invalidated.
> 	* cfgloop.h (unlop): Declare.
> 	* cfgloopmanip.c (unloop): Export.
> 	* tree.c (build_common_builtin_nodes): Build BULTIN_UNREACHABLE.
> 
> 	* gcc.target/i386/l_fma_float_?.c: Update.
> 	* gcc.target/i386/l_fma_double_?.c: Update.
> 	* gfortran.dg/do_1.f90: XFAIL
> 	* gcc.dg/tree-ssa/cunroll-1.c: New testcase.
> 	* gcc.dg/tree-ssa/cunroll-2.c: New testcase.
> 	* gcc.dg/tree-ssa/cunroll-3.c: New testcase.
> 	* gcc.dg/tree-ssa/cunroll-4.c: New testcase.
> 	* gcc.dg/tree-ssa/cunroll-5.c: New testcase.
> 	* gcc.dg/tree-ssa/ldist-17.c: Block cunroll to make testcase still
> 	valid.
> Index: tree-ssa-loop-ivcanon.c
> ===================================================================
> --- tree-ssa-loop-ivcanon.c	(revision 192483)
> +++ tree-ssa-loop-ivcanon.c	(working copy)
> @@ -192,7 +192,7 @@ constant_after_peeling (tree op, gimple 
>     Return results in SIZE, estimate benefits for complete unrolling exiting by EXIT.  */
>  
>  static void
> -tree_estimate_loop_size (struct loop *loop, edge exit, struct loop_size *size)
> +tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, struct loop_size *size)
>  {
>    basic_block *body = get_loop_body (loop);
>    gimple_stmt_iterator gsi;
> @@ -208,8 +208,8 @@ tree_estimate_loop_size (struct loop *lo
>      fprintf (dump_file, "Estimating sizes for loop %i\n", loop->num);
>    for (i = 0; i < loop->num_nodes; i++)
>      {
> -      if (exit && body[i] != exit->src
> -	  && dominated_by_p (CDI_DOMINATORS, body[i], exit->src))
> +      if (edge_to_cancel && body[i] != edge_to_cancel->src
> +	  && dominated_by_p (CDI_DOMINATORS, body[i], edge_to_cancel->src))
>  	after_exit = true;
>        else
>  	after_exit = false;
> @@ -231,7 +231,7 @@ tree_estimate_loop_size (struct loop *lo
>  	  /* Look for reasons why we might optimize this stmt away. */
>  
>  	  /* Exit conditional.  */
> -	  if (body[i] == exit->src && stmt == last_stmt (exit->src))
> +	  if (exit && body[i] == exit->src && stmt == last_stmt (exit->src))
>  	    {
>  	      if (dump_file && (dump_flags & TDF_DETAILS))
>  	        fprintf (dump_file, "   Exit condition will be eliminated.\n");
> @@ -314,36 +314,161 @@ estimated_unrolled_size (struct loop_siz
>    return unr_insns;
>  }
>  
> +/* Loop LOOP is known to not loop.  See if there is an edge in the loop
> +   body that can be remove to make the loop to always exit and at
> +   the same time it does not make any code potentially executed 
> +   during the last iteration dead.  
> +
> +   After complette unrolling we still may get rid of the conditional
> +   on the exit in the last copy even if we have no idea what it does.
> +   This is quite common case for loops of form
> +
> +     int a[5];
> +     for (i=0;i<b;i++)
> +       a[i]=0;
> +
> +   Here we prove the loop to iterate 5 times but we do not know
> +   it from induction variable.
> +
> +   For now we handle only simple case where there is exit condition
> +   just before the latch block and the latch block contains no statements
> +   with side effect that may otherwise terminate the execution of loop
> +   (such as by EH or by terminating the program or longjmp).
> +
> +   In the general case we may want to cancel the paths leading to statements
> +   loop-niter identified as having undefined effect in the last iteration.
> +   The other cases are hopefully rare and will be cleaned up later.  */
> +
> +edge
> +loop_edge_to_cancel (struct loop *loop)
> +{
> +  VEC (edge, heap) *exits;
> +  unsigned i;
> +  edge edge_to_cancel;
> +  gimple_stmt_iterator gsi;
> +
> +  /* We want only one predecestor of the loop.  */
> +  if (EDGE_COUNT (loop->latch->preds) > 1)
> +    return NULL;
> +
> +  exits = get_loop_exit_edges (loop);
> +
> +  FOR_EACH_VEC_ELT (edge, exits, i, edge_to_cancel)
> +    {
> +       /* Find the other edge than the loop exit
> +          leaving the conditoinal.  */
> +       if (EDGE_COUNT (edge_to_cancel->src->succs) != 2)
> +         continue;
> +       if (EDGE_SUCC (edge_to_cancel->src, 0) == edge_to_cancel)
> +         edge_to_cancel = EDGE_SUCC (edge_to_cancel->src, 1);
> +       else
> +         edge_to_cancel = EDGE_SUCC (edge_to_cancel->src, 0);
> +
> +      /* We should never have conditionals in the loop latch. */
> +      gcc_assert (edge_to_cancel->dest != loop->header);
> +
> +      /* Check that it leads to loop latch.  */
> +      if (edge_to_cancel->dest != loop->latch)
> +        continue;
> +
> +      VEC_free (edge, heap, exits);
> +
> +      /* Verify that the code in loop latch does nothing that may end program
> +         execution without really reaching the exit.  This may include
> +	 non-pure/const function calls, EH statements, volatile ASMs etc.  */
> +      for (gsi = gsi_start_bb (loop->latch); !gsi_end_p (gsi); gsi_next (&gsi))
> +	if (gimple_has_side_effects (gsi_stmt (gsi)))
> +	   return NULL;
> +      return edge_to_cancel;
> +    }
> +  VEC_free (edge, heap, exits);
> +  return NULL;
> +}
> +
>  /* Tries to unroll LOOP completely, i.e. NITER times.
>     UL determines which loops we are allowed to unroll.
> -   EXIT is the exit of the loop that should be eliminated.  */
> +   EXIT is the exit of the loop that should be eliminated.  
> +   IRRED_INVALIDATED is used to bookkeep if information about
> +   irreducible regions may become invalid as a result
> +   of the transformation.  */
>  
>  static bool
>  try_unroll_loop_completely (struct loop *loop,
>  			    edge exit, tree niter,
> -			    enum unroll_level ul)
> +			    enum unroll_level ul,
> +			    bool *irred_invalidated)
>  {
>    unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns;
>    gimple cond;
>    struct loop_size size;
> +  bool n_unroll_found = false;
> +  HOST_WIDE_INT maxiter;
> +  basic_block latch;
> +  edge latch_edge;
> +  location_t locus;
> +  int flags;
> +  gimple stmt;
> +  gimple_stmt_iterator gsi;
> +  edge edge_to_cancel = NULL;
> +  int num = loop->num;
>  
> -  if (loop->inner)
> -    return false;
> +  /* See if we proved number of iterations to be low constant.
>  
> -  if (!host_integerp (niter, 1))
> +     EXIT is an edge that will be removed in all but last iteration of 
> +     the loop.
> +
> +     EDGE_TO_CACNEL is an edge that will be removed from the last iteration
> +     of the unrolled sequence and is expected to make the final loop not
> +     rolling. 
> +
> +     If the number of execution of loop is determined by standard induction
> +     variable test, then EXIT and EDGE_TO_CANCEL are the two edges leaving
> +     from the iv test.  */
> +  if (host_integerp (niter, 1))
> +    {
> +      n_unroll = tree_low_cst (niter, 1);
> +      n_unroll_found = true;
> +      edge_to_cancel = EDGE_SUCC (exit->src, 0);
> +      if (edge_to_cancel == exit)
> +	edge_to_cancel = EDGE_SUCC (exit->src, 1);
> +    }
> +  /* We do not know the number of iterations and thus we can not eliminate
> +     the EXIT edge.  */
> +  else
> +    exit = NULL;
> +
> +  /* See if we can improve our estimate by using recorded loop bounds.  */
> +  maxiter = max_loop_iterations_int (loop);
> +  if (maxiter >= 0
> +      && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
> +    {
> +      n_unroll = maxiter;
> +      n_unroll_found = true;
> +      /* Loop terminates before the IV variable test, so we can not
> +	 remove it in the last iteration.  */
> +      edge_to_cancel = NULL;
> +    }
> +
> +  if (!n_unroll_found)
>      return false;
> -  n_unroll = tree_low_cst (niter, 1);
>  
>    max_unroll = PARAM_VALUE (PARAM_MAX_COMPLETELY_PEEL_TIMES);
>    if (n_unroll > max_unroll)
>      return false;
>  
> +  if (!edge_to_cancel)
> +    edge_to_cancel = loop_edge_to_cancel (loop);
> +
>    if (n_unroll)
>      {
> +      sbitmap wont_exit;
> +      edge e;
> +      unsigned i;
> +      VEC (edge, heap) *to_remove = NULL;
>        if (ul == UL_SINGLE_ITER)
>  	return false;
>  
> -      tree_estimate_loop_size (loop, exit, &size);
> +      tree_estimate_loop_size (loop, exit, edge_to_cancel, &size);
>        ninsns = size.overall;
>  
>        unr_insns = estimated_unrolled_size (&size, n_unroll);
> @@ -354,6 +479,18 @@ try_unroll_loop_completely (struct loop 
>  		   (int) unr_insns);
>  	}
>  
> +      /* We unroll only inner loops, because we do not consider it profitable
> +	 otheriwse.  We still can cancel loopback edge of not rolling loop;
> +	 this is always a good idea.  */
> +      if (loop->inner && unr_insns > ninsns)
> +	{
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    fprintf (dump_file, "Not unrolling loop %d:"
> +		     "it is not innermost and code would grow.\n",
> +		     loop->num);
> +	  return false;
> +	}
> +
>        if (unr_insns > ninsns
>  	  && (unr_insns
>  	      > (unsigned) PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS)))
> @@ -369,17 +506,10 @@ try_unroll_loop_completely (struct loop 
>  	  && unr_insns > ninsns)
>  	{
>  	  if (dump_file && (dump_flags & TDF_DETAILS))
> -	    fprintf (dump_file, "Not unrolling loop %d.\n", loop->num);
> +	    fprintf (dump_file, "Not unrolling loop %d: size would grow.\n",
> +		     loop->num);
>  	  return false;
>  	}
> -    }
> -
> -  if (n_unroll)
> -    {
> -      sbitmap wont_exit;
> -      edge e;
> -      unsigned i;
> -      VEC (edge, heap) *to_remove = NULL;
>  
>        initialize_original_copy_tables ();
>        wont_exit = sbitmap_alloc (n_unroll + 1);
> @@ -408,15 +538,67 @@ try_unroll_loop_completely (struct loop 
>        free_original_copy_tables ();
>      }
>  
> -  cond = last_stmt (exit->src);
> -  if (exit->flags & EDGE_TRUE_VALUE)
> -    gimple_cond_make_true (cond);
> -  else
> -    gimple_cond_make_false (cond);
> -  update_stmt (cond);
> +  /* Remove the conditional from the last copy of the loop.  */
> +  if (edge_to_cancel)
> +    {
> +      cond = last_stmt (edge_to_cancel->src);
> +      if (edge_to_cancel->flags & EDGE_TRUE_VALUE)
> +	gimple_cond_make_false (cond);
> +      else
> +	gimple_cond_make_true (cond);
> +      update_stmt (cond);
> +      /* Do not remove the path. Doing so may remove outer loop
> +	 and confuse bookkeeping code in tree_unroll_loops_completelly.  */
> +    }
> +  /* We did not manage to cancel the loop.
> +     The loop latch remains reachable even if it will never be reached
> +     at runtime.  We must redirect it to somewhere, so create basic
> +     block containg __builtin_unreachable call for this reason.  */
> +  else
> +    {
> +      latch = loop->latch;
> +      latch_edge = loop_latch_edge (loop);
> +      flags = latch_edge->flags;
> +      locus = latch_edge->goto_locus;
> +
> +      /* Unloop destroys the latch edge.  */
> +      unloop (loop, irred_invalidated);
> +
> +      /* Create new basic block for the latch edge destination and wire
> +	 it in.  */
> +      stmt = gimple_build_call (builtin_decl_implicit (BUILT_IN_UNREACHABLE), 0);
> +      latch_edge = make_edge (latch, create_basic_block (NULL, NULL, latch), flags);
> +      latch_edge->probability = 0;
> +      latch_edge->count = 0;
> +      latch_edge->flags |= flags;
> +      latch_edge->goto_locus = locus;
> +
> +      latch_edge->dest->loop_father = current_loops->tree_root;
> +      latch_edge->dest->count = 0;
> +      latch_edge->dest->frequency = 0;
> +      set_immediate_dominator (CDI_DOMINATORS, latch_edge->dest, latch_edge->src);
> +
> +      gsi = gsi_start_bb (latch_edge->dest);
> +      gsi_insert_after (&gsi, stmt, GSI_NEW_STMT);
> +    }
>  
>    if (dump_file && (dump_flags & TDF_DETAILS))
> -    fprintf (dump_file, "Unrolled loop %d completely.\n", loop->num);
> +    {
> +      if (!n_unroll)
> +        fprintf (dump_file, "Turned loop %d to non-loop; it never loops.\n",
> +		 num);
> +      else
> +        fprintf (dump_file, "Unrolled loop %d completely "
> +		 "(duplicated %i times).\n", num, (int)n_unroll);
> +      if (exit)
> +        fprintf (dump_file, "Exit condition of peeled iterations was "
> +		 "eliminated.\n");
> +      if (edge_to_cancel)
> +        fprintf (dump_file, "Last iteration exit edge was proved true.\n");
> +      else
> +        fprintf (dump_file, "Latch of last iteration was marked by "
> +		 "__builtin_unreachable ().\n");
> +    }
>  
>    return true;
>  }
> @@ -425,12 +608,15 @@ try_unroll_loop_completely (struct loop 
>     CREATE_IV is true if we may create a new iv.  UL determines
>     which loops we are allowed to completely unroll.  If TRY_EVAL is true, we try
>     to determine the number of iterations of a loop by direct evaluation.
> -   Returns true if cfg is changed.  */
> +   Returns true if cfg is changed.  
> +
> +   IRRED_INVALIDATED is used to keep if irreducible reginos needs to be recomputed.  */
>  
>  static bool
>  canonicalize_loop_induction_variables (struct loop *loop,
>  				       bool create_iv, enum unroll_level ul,
> -				       bool try_eval)
> +				       bool try_eval,
> +				       bool *irred_invalidated)
>  {
>    edge exit = NULL;
>    tree niter;
> @@ -455,22 +641,34 @@ canonicalize_loop_induction_variables (s
>  	      || TREE_CODE (niter) != INTEGER_CST))
>  	niter = find_loop_niter_by_eval (loop, &exit);
>  
> -      if (chrec_contains_undetermined (niter)
> -	  || TREE_CODE (niter) != INTEGER_CST)
> -	return false;
> +      if (TREE_CODE (niter) != INTEGER_CST)
> +	exit = NULL;
>      }
>  
> -  if (dump_file && (dump_flags & TDF_DETAILS))
> +  /* We work exceptionally hard here to estimate the bound
> +     by find_loop_niter_by_eval.  Be sure to keep it for future.  */
> +  if (niter && TREE_CODE (niter) == INTEGER_CST)
> +    record_niter_bound (loop, tree_to_double_int (niter), false, true);
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS)
> +      && TREE_CODE (niter) == INTEGER_CST)
>      {
>        fprintf (dump_file, "Loop %d iterates ", loop->num);
>        print_generic_expr (dump_file, niter, TDF_SLIM);
>        fprintf (dump_file, " times.\n");
>      }
> +  if (dump_file && (dump_flags & TDF_DETAILS)
> +      && max_loop_iterations_int (loop) >= 0)
> +    {
> +      fprintf (dump_file, "Loop %d iterates at most %i times.\n", loop->num,
> +	       (int)max_loop_iterations_int (loop));
> +    }
>  
> -  if (try_unroll_loop_completely (loop, exit, niter, ul))
> +  if (try_unroll_loop_completely (loop, exit, niter, ul, irred_invalidated))
>      return true;
>  
> -  if (create_iv)
> +  if (create_iv
> +      && niter && !chrec_contains_undetermined (niter))
>      create_canonical_iv (loop, exit, niter);
>  
>    return false;
> @@ -485,15 +683,21 @@ canonicalize_induction_variables (void)
>    loop_iterator li;
>    struct loop *loop;
>    bool changed = false;
> +  bool irred_invalidated = false;
>  
>    FOR_EACH_LOOP (li, loop, 0)
>      {
>        changed |= canonicalize_loop_induction_variables (loop,
>  							true, UL_SINGLE_ITER,
> -							true);
> +							true,
> +							&irred_invalidated);
>      }
>    gcc_assert (!need_ssa_update_p (cfun));
>  
> +  if (irred_invalidated
> +      && loops_state_satisfies_p (LOOPS_HAVE_MARKED_IRREDUCIBLE_REGIONS))
> +    mark_irreducible_loops ();
> +
>    /* Clean up the information about numbers of iterations, since brute force
>       evaluation could reveal new information.  */
>    scev_reset ();
> @@ -594,9 +798,10 @@ tree_unroll_loops_completely (bool may_i
>  
>    do
>      {
> +      bool irred_invalidated = false;
>        changed = false;
>  
> -      FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
> +      FOR_EACH_LOOP (li, loop, 0)
>  	{
>  	  struct loop *loop_father = loop_outer (loop);
>  
> @@ -609,7 +814,8 @@ tree_unroll_loops_completely (bool may_i
>  	    ul = UL_NO_GROWTH;
>  
>  	  if (canonicalize_loop_induction_variables (loop, false, ul,
> -						     !flag_tree_loop_ivcanon))
> +						     !flag_tree_loop_ivcanon,
> +						     &irred_invalidated))
>  	    {
>  	      changed = true;
>  	      /* If we'll continue unrolling, we need to propagate constants
> @@ -629,6 +835,10 @@ tree_unroll_loops_completely (bool may_i
>  	  struct loop **iter;
>  	  unsigned i;
>  
> +	  if (irred_invalidated
> +	      && loops_state_satisfies_p (LOOPS_HAVE_MARKED_IRREDUCIBLE_REGIONS))
> +	    mark_irreducible_loops ();
> +
>  	  update_ssa (TODO_update_ssa);
>  
>  	  /* Propagate the constants within the new basic blocks.  */
> Index: tree.c
> ===================================================================
> --- tree.c	(revision 192483)
> +++ tree.c	(working copy)
> @@ -9524,6 +9524,15 @@ build_common_builtin_nodes (void)
>    tree tmp, ftype;
>    int ecf_flags;
>  
> +  if (!builtin_decl_explicit_p (BUILT_IN_UNREACHABLE))
> +    {
> +      ftype = build_function_type (void_type_node, void_list_node);
> +      local_define_builtin ("__builtin_unreachable", ftype, BUILT_IN_UNREACHABLE,
> +			    "__builtin_unreachable",
> +			    ECF_NOTHROW | ECF_LEAF | ECF_NORETURN
> +			    | ECF_CONST | ECF_LEAF);
> +    }
> +
>    if (!builtin_decl_explicit_p (BUILT_IN_MEMCPY)
>        || !builtin_decl_explicit_p (BUILT_IN_MEMMOVE))
>      {
> Index: cfgloop.h
> ===================================================================
> --- cfgloop.h	(revision 192483)
> +++ cfgloop.h	(working copy)
> @@ -320,7 +321,8 @@ extern struct loop *loopify (edge, edge,
>  struct loop * loop_version (struct loop *, void *,
>  			    basic_block *, unsigned, unsigned, unsigned, bool);
>  extern bool remove_path (edge);
> -void scale_loop_frequencies (struct loop *, int, int);
> +extern void unloop (struct loop *, bool *);
> +extern void scale_loop_frequencies (struct loop *, int, int);
>  
>  /* Induction variable analysis.  */
>  
> Index: cfgloopmanip.c
> ===================================================================
> --- cfgloopmanip.c	(revision 192483)
> +++ cfgloopmanip.c	(working copy)
> @@ -37,7 +37,6 @@ static int find_path (edge, basic_block 
>  static void fix_loop_placements (struct loop *, bool *);
>  static bool fix_bb_placement (basic_block);
>  static void fix_bb_placements (basic_block, bool *);
> -static void unloop (struct loop *, bool *);
>  
>  /* Checks whether basic block BB is dominated by DATA.  */
>  static bool
> @@ -895,7 +894,7 @@ loopify (edge latch_edge, edge header_ed
>     If this may cause the information about irreducible regions to become
>     invalid, IRRED_INVALIDATED is set to true.  */
>  
> -static void
> +void
>  unloop (struct loop *loop, bool *irred_invalidated)
>  {
>    basic_block *body;
> Index: testsuite/gcc.target/i386/l_fma_float_5.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_float_5.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_float_5.c	(working copy)
> @@ -12,7 +12,7 @@
>  /* { dg-final { scan-assembler-times "vfmsub132ps" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmadd132ps" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132ps" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132ss" 16  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfmsub132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132ss" 72  } } */
> Index: testsuite/gcc.target/i386/l_fma_double_4.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_double_4.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_double_4.c	(working copy)
> @@ -12,7 +12,7 @@
>  /* { dg-final { scan-assembler-times "vfmsub132pd" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmadd132pd" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132pd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132sd" 16  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132sd" 40  } } */
> +/* { dg-final { scan-assembler-times "vfmsub132sd" 40  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132sd" 40  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132sd" 40  } } */
> Index: testsuite/gcc.target/i386/l_fma_float_2.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_float_2.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_float_2.c	(working copy)
> @@ -12,7 +12,7 @@
>  /* { dg-final { scan-assembler-times "vfmsub132ps" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmadd132ps" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132ps" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132ss" 16  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfmsub132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132ss" 72  } } */
> Index: testsuite/gcc.target/i386/l_fma_float_6.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_float_6.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_float_6.c	(working copy)
> @@ -12,7 +12,7 @@
>  /* { dg-final { scan-assembler-times "vfmsub132ps" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmadd132ps" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132ps" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132ss" 16  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfmsub132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132ss" 72  } } */
> Index: testsuite/gcc.target/i386/l_fma_double_1.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_double_1.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_double_1.c	(working copy)
> @@ -16,11 +16,11 @@
>  /* { dg-final { scan-assembler-times "vfnmadd231pd" 4  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132pd" 4  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub231pd" 4  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd213sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmsub213sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd213sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub213sd" 8  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfmadd213sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfmsub132sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfmsub213sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd213sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub213sd" 20  } } */
> Index: testsuite/gcc.target/i386/l_fma_double_5.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_double_5.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_double_5.c	(working copy)
> @@ -12,7 +12,7 @@
>  /* { dg-final { scan-assembler-times "vfmsub132pd" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmadd132pd" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132pd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132sd" 16  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132sd" 40  } } */
> +/* { dg-final { scan-assembler-times "vfmsub132sd" 40  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132sd" 40  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132sd" 40  } } */
> Index: testsuite/gcc.target/i386/l_fma_float_3.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_float_3.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_float_3.c	(working copy)
> @@ -16,11 +16,11 @@
>  /* { dg-final { scan-assembler-times "vfnmadd231ps" 4  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132ps" 4  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub231ps" 4  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd213ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmsub213ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd213ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub213ss" 8  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfmadd213ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfmsub132ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfmsub213ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd213ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub213ss" 36  } } */
> Index: testsuite/gcc.target/i386/l_fma_double_2.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_double_2.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_double_2.c	(working copy)
> @@ -12,7 +12,7 @@
>  /* { dg-final { scan-assembler-times "vfmsub132pd" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmadd132pd" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132pd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132sd" 16  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132sd" 40  } } */
> +/* { dg-final { scan-assembler-times "vfmsub132sd" 40  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132sd" 40  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132sd" 40  } } */
> Index: testsuite/gcc.target/i386/l_fma_double_6.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_double_6.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_double_6.c	(working copy)
> @@ -12,7 +12,7 @@
>  /* { dg-final { scan-assembler-times "vfmsub132pd" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmadd132pd" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132pd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132sd" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132sd" 16  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132sd" 40 } } */
> +/* { dg-final { scan-assembler-times "vfmsub132sd" 40  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132sd" 40  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132sd" 40  } } */
> Index: testsuite/gcc.target/i386/l_fma_float_4.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_float_4.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_float_4.c	(working copy)
> @@ -12,7 +12,7 @@
>  /* { dg-final { scan-assembler-times "vfmsub132ps" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmadd132ps" 8  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132ps" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132ss" 16  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132ss" 16  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfmsub132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132ss" 72  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132ss" 72  } } */
> Index: testsuite/gcc.target/i386/l_fma_double_3.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_double_3.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_double_3.c	(working copy)
> @@ -16,11 +16,11 @@
>  /* { dg-final { scan-assembler-times "vfnmadd231pd" 4  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132pd" 4  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub231pd" 4  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd213sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmsub213sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd213sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132sd" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub213sd" 8  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfmadd213sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfmsub132sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfmsub213sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd213sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132sd" 20  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub213sd" 20  } } */
> Index: testsuite/gcc.target/i386/l_fma_float_1.c
> ===================================================================
> --- testsuite/gcc.target/i386/l_fma_float_1.c	(revision 192483)
> +++ testsuite/gcc.target/i386/l_fma_float_1.c	(working copy)
> @@ -16,11 +16,11 @@
>  /* { dg-final { scan-assembler-times "vfnmadd231ps" 4  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub132ps" 4  } } */
>  /* { dg-final { scan-assembler-times "vfnmsub231ps" 4  } } */
> -/* { dg-final { scan-assembler-times "vfmadd132ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmadd213ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmsub132ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfmsub213ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd132ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmadd213ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub132ss" 8  } } */
> -/* { dg-final { scan-assembler-times "vfnmsub213ss" 8  } } */
> +/* { dg-final { scan-assembler-times "vfmadd132ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfmadd213ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfmsub132ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfmsub213ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd132ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfnmadd213ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub132ss" 36  } } */
> +/* { dg-final { scan-assembler-times "vfnmsub213ss" 36  } } */
> Index: testsuite/gfortran.dg/do_1.f90
> ===================================================================
> --- testsuite/gfortran.dg/do_1.f90	(revision 192483)
> +++ testsuite/gfortran.dg/do_1.f90	(working copy)
> @@ -1,4 +1,5 @@
> -! { dg-do run }
> +! { dg-do run { xfail *-*-* } }
> +! XFAIL is tracked in PR 54932
>  ! Program to check corner cases for DO statements.
>  program do_1
>    implicit none
> Index: testsuite/gcc.dg/tree-ssa/cunroll-1.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/cunroll-1.c	(revision 0)
> +++ testsuite/gcc.dg/tree-ssa/cunroll-1.c	(revision 0)
> @@ -0,0 +1,13 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
> +int a[2];
> +test(int c)
> +{ 
> +  int i;
> +  for (i=0;i<c;i++)
> +    a[i]=5;
> +}
> +/* Array bounds says the loop will not roll much.  */
> +/* { dg-final { scan-tree-dump "Unrolled loop 1 completely .duplicated 1 times.." "cunroll"} } */
> +/* { dg-final { scan-tree-dump "Last iteration exit edge was proved true." "cunroll"} } */
> +/* { dg-final { cleanup-tree-dump "cunroll" } } */
> Index: testsuite/gcc.dg/tree-ssa/cunroll-2.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/cunroll-2.c	(revision 0)
> +++ testsuite/gcc.dg/tree-ssa/cunroll-2.c	(revision 0)
> @@ -0,0 +1,16 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
> +int a[2];
> +test(int c)
> +{ 
> +  int i;
> +  for (i=0;i<c;i++)
> +    {
> +      a[i]=5;
> +      if (test2())
> +	return;
> +    }
> +}
> +/* We are not able to get rid of the final conditional because the loop has two exits.  */
> +/* { dg-final { scan-tree-dump "Unrolled loop 1 completely .duplicated 2 times.." "cunroll"} } */
> +/* { dg-final { cleanup-tree-dump "cunroll" } } */
> Index: testsuite/gcc.dg/tree-ssa/cunroll-3.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/cunroll-3.c	(revision 0)
> +++ testsuite/gcc.dg/tree-ssa/cunroll-3.c	(revision 0)
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
> +int a[1];
> +test(int c)
> +{ 
> +  int i;
> +  for (i=0;i<c;i++)
> +    {
> +      a[i]=5;
> +    }
> +}
> +/* If we start duplicating headers prior curoll, this loop will have 0 iterations.  */
> +
> +/* { dg-final { scan-tree-dump "Unrolled loop 1 completely .duplicated 1 times.." "cunrolli"} } */
> +/* { dg-final { cleanup-tree-dump "cunrolli" } } */
> Index: testsuite/gcc.dg/tree-ssa/cunroll-4.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/cunroll-4.c	(revision 0)
> +++ testsuite/gcc.dg/tree-ssa/cunroll-4.c	(revision 0)
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
> +int a[1];
> +test(int c)
> +{ 
> +  int i=0,j;
> +  for (i=0;i<c;i++)
> +    {
> +      for (j=0;j<c;j++)
> +	{
> +          a[i]=5;
> +	  test2();
> +	}
> +    }
> +}
> +
> +/* We should do this as part of cunrolli, but our cost model do not take into account early exit
> +   from the last iteration.  */
> +/* { dg-final { scan-tree-dump "Turned loop 1 to non-loop; it never loops." "cunrolli"} } */
> +/* { dg-final { scan-tree-dump "Last iteration exit edge was proved true." "cunrolli"} } */
> +/* { dg-final { cleanup-tree-dump "cunroll" } } */
> Index: testsuite/gcc.dg/tree-ssa/cunroll-5.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/cunroll-5.c	(revision 0)
> +++ testsuite/gcc.dg/tree-ssa/cunroll-5.c	(revision 0)
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
> +int *a;
> +test(int c)
> +{ 
> +  int i;
> +  for (i=0;i<6;i++)
> +    a[i]=5;
> +}
> +/* Basic testcase for complette unrolling.  */
> +/* { dg-final { scan-tree-dump "Unrolled loop 1 completely .duplicated 5 times.." "cunroll"} } */
> +/* { dg-final { scan-tree-dump "Exit condition of peeled iterations was eliminated." "cunroll"} } */
> +/* { dg-final { scan-tree-dump "Last iteration exit edge was proved true." "cunroll"} } */
> +/* { dg-final { cleanup-tree-dump "cunroll" } } */
> Index: testsuite/gcc.dg/tree-ssa/ldist-17.c
> ===================================================================
> --- testsuite/gcc.dg/tree-ssa/ldist-17.c	(revision 192483)
> +++ testsuite/gcc.dg/tree-ssa/ldist-17.c	(working copy)
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fdump-tree-ldist-details" } */
> +/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fdump-tree-ldist-details fdisable-tree-cunroll -fdisable-tree-cunrolli" } */
>  
>  typedef int mad_fixed_t;
>  struct mad_pcm
> 
>
H.J. Lu Oct. 24, 2012, 11:53 p.m. UTC | #2
On Tue, Oct 16, 2012 at 10:32 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> here is third revised version of the complette unroling changes.  While working
> on the RTL variant I noticed PR54937 and the fact that I was overly aggressive
> on forcing single exit of the last iteration to be taken, because loop may terminate
> otherwise (by EH or by exitting the program).  Same thinko is in loop-niter.
>
> This patch adds loop_edge_to_cancel that is more conservative: it looks for the
> exit conditional where the non-exitting edges leads to latch and verifies that
> latch contains no statement with side effect that may terminate the loop.
> This still actually matches quite few non-single-exit loops and works well in
> practice.
>
> Unlike previous revision it also enables complette unrolling when code size
> does not grow even for non-innermost loops (with update in
> tree_unroll_loops_completely to walk them). This is something we did on RTL
> land but missed in trees.  This actually enables quite some optimizations when
> things can be propagated to the tiny inner loop body.
>
> I also fixed accounting in tree_estimate_loop_size for the cases where last
> iteration is not going to be updated.
>
> Finally I added code constructing __bulitin_unreachable as suggested by
> Ian.
>
> Bootstrapped/regtested x86_64-linux, also bootstrapped with -O3 and -Werror
> disabled and benchmarked. Among best benefits is about 7% improvement on Applu,
> and it causes up to 15% improvements on vectorized loops with small iteration
> counts (by completelly peeling the precondition code).  There are no real
> performance regressions but there is some code size bloat.
>
> I plan to followup with strenghtening the heuristic to disable unrolling when
> benefits are absymal.  Easy is to limit unrolling on loops with CFG and/or
> calls in them.  We already have quite informed analysis in place.  I also plan
> to move simple FDO guided loop peeling from RTL level to trees to enable more
> propagation into peeled sequences.
>
> The patch also triggers bug in niter and requires xfailing do_1.f90 testcase.
> I filled PR 54932 to track this.
>
> There are also confused array bound warnings I hope to track incrementally, too,
> by recording statements that are known to become unreachable in the last
> iteration and adding __buitin_unreachable in front of them. This is also
> important to avoid duplication leading to dead code: no other optimizers
> force paths leading to out of bound accesses to not happen.
>
> Honza
>
>
>         * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add edge_to_cancel
>         parameter and use it to estimate code optimized out in the final iteration.
>         (loop_edge_to_cancel): New function.
>         (try_unroll_loop_completely): New IRRED_IVALIDATED parameter;
>         handle unrolling loops with bounds given via max_loop_iteratins;
>         handle unrolling non-inner loops when code size shrinks;
>         tidy dump output; when the last iteration loop still stays
>         as loop in the CFG forcongly redirect the latch to
>         __builtin_unreachable.
>         (canonicalize_loop_induction_variables): Add irred_invlaidated
>         parameter; record niter bound derrived; dump
>         max_loop_iterations bounds; call try_unroll_loop_completely
>         even if no niter bound is given.
>         (canonicalize_induction_variables): Handle irred_invalidated.
>         (tree_unroll_loops_completely): Handle non-innermost loops;
>         handle irred_invalidated.
>         * cfgloop.h (unlop): Declare.
>         * cfgloopmanip.c (unloop): Export.
>         * tree.c (build_common_builtin_nodes): Build BULTIN_UNREACHABLE.
>

This caused:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55051
Jan Hubicka Oct. 25, 2012, 12:09 a.m. UTC | #3
> >         * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add edge_to_cancel
> >         parameter and use it to estimate code optimized out in the final iteration.
> >         (loop_edge_to_cancel): New function.
> >         (try_unroll_loop_completely): New IRRED_IVALIDATED parameter;
> >         handle unrolling loops with bounds given via max_loop_iteratins;
> >         handle unrolling non-inner loops when code size shrinks;
> >         tidy dump output; when the last iteration loop still stays
> >         as loop in the CFG forcongly redirect the latch to
> >         __builtin_unreachable.
> >         (canonicalize_loop_induction_variables): Add irred_invlaidated
> >         parameter; record niter bound derrived; dump
> >         max_loop_iterations bounds; call try_unroll_loop_completely
> >         even if no niter bound is given.
> >         (canonicalize_induction_variables): Handle irred_invalidated.
> >         (tree_unroll_loops_completely): Handle non-innermost loops;
> >         handle irred_invalidated.
> >         * cfgloop.h (unlop): Declare.
> >         * cfgloopmanip.c (unloop): Export.
> >         * tree.c (build_common_builtin_nodes): Build BULTIN_UNREACHABLE.
> >
> 
> This caused:
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55051
Oops, I did not really anticipated this to happen during profiledbootstrap, but I
discussed the problem in
http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01103.html
I suppose we will need to find porper fix for this long lasting issue more
urgently now :(

Honza
> 
> 
> -- 
> H.J.
Richard Biener Oct. 25, 2012, 10:32 a.m. UTC | #4
On Thu, 25 Oct 2012, Jan Hubicka wrote:

> > >         * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add edge_to_cancel
> > >         parameter and use it to estimate code optimized out in the final iteration.
> > >         (loop_edge_to_cancel): New function.
> > >         (try_unroll_loop_completely): New IRRED_IVALIDATED parameter;
> > >         handle unrolling loops with bounds given via max_loop_iteratins;
> > >         handle unrolling non-inner loops when code size shrinks;
> > >         tidy dump output; when the last iteration loop still stays
> > >         as loop in the CFG forcongly redirect the latch to
> > >         __builtin_unreachable.
> > >         (canonicalize_loop_induction_variables): Add irred_invlaidated
> > >         parameter; record niter bound derrived; dump
> > >         max_loop_iterations bounds; call try_unroll_loop_completely
> > >         even if no niter bound is given.
> > >         (canonicalize_induction_variables): Handle irred_invalidated.
> > >         (tree_unroll_loops_completely): Handle non-innermost loops;
> > >         handle irred_invalidated.
> > >         * cfgloop.h (unlop): Declare.
> > >         * cfgloopmanip.c (unloop): Export.
> > >         * tree.c (build_common_builtin_nodes): Build BULTIN_UNREACHABLE.
> > >
> > 
> > This caused:
> > 
> > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55051
> Oops, I did not really anticipated this to happen during profiledbootstrap, but I
> discussed the problem in
> http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01103.html
> I suppose we will need to find porper fix for this long lasting issue more
> urgently now :(

Please try to construct a testcase I can look at.

Richard.
Jan Hubicka Oct. 25, 2012, 12:05 p.m. UTC | #5
> On Thu, 25 Oct 2012, Jan Hubicka wrote:
> 
> > > >         * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add edge_to_cancel
> > > >         parameter and use it to estimate code optimized out in the final iteration.
> > > >         (loop_edge_to_cancel): New function.
> > > >         (try_unroll_loop_completely): New IRRED_IVALIDATED parameter;
> > > >         handle unrolling loops with bounds given via max_loop_iteratins;
> > > >         handle unrolling non-inner loops when code size shrinks;
> > > >         tidy dump output; when the last iteration loop still stays
> > > >         as loop in the CFG forcongly redirect the latch to
> > > >         __builtin_unreachable.
> > > >         (canonicalize_loop_induction_variables): Add irred_invlaidated
> > > >         parameter; record niter bound derrived; dump
> > > >         max_loop_iterations bounds; call try_unroll_loop_completely
> > > >         even if no niter bound is given.
> > > >         (canonicalize_induction_variables): Handle irred_invalidated.
> > > >         (tree_unroll_loops_completely): Handle non-innermost loops;
> > > >         handle irred_invalidated.
> > > >         * cfgloop.h (unlop): Declare.
> > > >         * cfgloopmanip.c (unloop): Export.
> > > >         * tree.c (build_common_builtin_nodes): Build BULTIN_UNREACHABLE.
> > > >
> > > 
> > > This caused:
> > > 
> > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55051
> > Oops, I did not really anticipated this to happen during profiledbootstrap, but I
> > discussed the problem in
> > http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01103.html
> > I suppose we will need to find porper fix for this long lasting issue more
> > urgently now :(
> 
> Please try to construct a testcase I can look at.
Like this:
int a[3];
int b[4];
main()
{
  int i;
  for (i=0;i<4;i++)
    if (b[i]==2)
     a[i]++;
}
here we deirve the bound of the loop by IV variable to be 4, a[i] does not count because
it does not dominate latch edge.  We unroll 4 times and the 4th copy gets out of range access.
(before or after my patch; with my patch we trigger this more often because more often we derive
some bounds)

I was thinking of the following:

 1) make loop->bounds to collect all statements having undefined effect on given iteration, not only
    those dominating latch
 2) make record_estimate to ignore those not dominating latch for loop bound
 3) Once done make walk through the loop determining upper bound by Dijiskra's algorithm
 4) Make complette unroling to patch in __builtin_trap into every peeled copy when given statement
    is known to not be executed.

3) will make us to handle correctly the following:
int a[3];
int b[4];
int c[3];
main()
{
  int i;
  for (i=0;b[i];i++)
    if (b[i]==2)
     a[i]++;
   else
     c[i]++;
}

Here bound is 3 by a[i] and c[i] access but it is not discovered because there
is no dominance relationship.

4) should get rid of those warings and moreover improve codegen since
__builtin_trap is less confusing than impossible path in the CFG.  We will need
to update not only the last iteratin of the loop, but all peeled copies, so we
will need change in duplicate_loop_to_header_edge.

We can also patch in __builtin_unreachable but that will make bugs with out
overflows/ out of bound accesses hard in loops to debug at -O levels (as seen
by do-1.f90 testcase and the new PR with integer overflow), so perhaps we could
do that at -Ofast or only or with -funsafe-loop-optimization or so even if it
is standard conforming.  I think friendly trap showing proper line number info
is better.

What do you think?
Honza

> 
> Richard.
Richard Biener Oct. 25, 2012, 12:20 p.m. UTC | #6
On Thu, 25 Oct 2012, Jan Hubicka wrote:

> > On Thu, 25 Oct 2012, Jan Hubicka wrote:
> > 
> > > > >         * tree-ssa-loop-ivcanon.c (tree_estimate_loop_size): Add edge_to_cancel
> > > > >         parameter and use it to estimate code optimized out in the final iteration.
> > > > >         (loop_edge_to_cancel): New function.
> > > > >         (try_unroll_loop_completely): New IRRED_IVALIDATED parameter;
> > > > >         handle unrolling loops with bounds given via max_loop_iteratins;
> > > > >         handle unrolling non-inner loops when code size shrinks;
> > > > >         tidy dump output; when the last iteration loop still stays
> > > > >         as loop in the CFG forcongly redirect the latch to
> > > > >         __builtin_unreachable.
> > > > >         (canonicalize_loop_induction_variables): Add irred_invlaidated
> > > > >         parameter; record niter bound derrived; dump
> > > > >         max_loop_iterations bounds; call try_unroll_loop_completely
> > > > >         even if no niter bound is given.
> > > > >         (canonicalize_induction_variables): Handle irred_invalidated.
> > > > >         (tree_unroll_loops_completely): Handle non-innermost loops;
> > > > >         handle irred_invalidated.
> > > > >         * cfgloop.h (unlop): Declare.
> > > > >         * cfgloopmanip.c (unloop): Export.
> > > > >         * tree.c (build_common_builtin_nodes): Build BULTIN_UNREACHABLE.
> > > > >
> > > > 
> > > > This caused:
> > > > 
> > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=55051
> > > Oops, I did not really anticipated this to happen during profiledbootstrap, but I
> > > discussed the problem in
> > > http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01103.html
> > > I suppose we will need to find porper fix for this long lasting issue more
> > > urgently now :(
> > 
> > Please try to construct a testcase I can look at.
> Like this:
> int a[3];
> int b[4];
> main()
> {
>   int i;
>   for (i=0;i<4;i++)
>     if (b[i]==2)
>      a[i]++;
> }
> here we deirve the bound of the loop by IV variable to be 4, a[i] does not count because
> it does not dominate latch edge.  We unroll 4 times and the 4th copy gets out of range access.
> (before or after my patch; with my patch we trigger this more often because more often we derive
> some bounds)
>
> I was thinking of the following:
> 
>  1) make loop->bounds to collect all statements having undefined effect on given iteration, not only
>     those dominating latch
>  2) make record_estimate to ignore those not dominating latch for loop bound
>  3) Once done make walk through the loop determining upper bound by Dijiskra's algorithm
>  4) Make complette unroling to patch in __builtin_trap into every peeled copy when given statement
>     is known to not be executed.
> 
> 3) will make us to handle correctly the following:
> int a[3];
> int b[4];
> int c[3];
> main()
> {
>   int i;
>   for (i=0;b[i];i++)
>     if (b[i]==2)
>      a[i]++;
>    else
>      c[i]++;
> }
> 
> Here bound is 3 by a[i] and c[i] access but it is not discovered because there
> is no dominance relationship.
> 
> 4) should get rid of those warings and moreover improve codegen since
> __builtin_trap is less confusing than impossible path in the CFG.  We will need
> to update not only the last iteratin of the loop, but all peeled copies, so we
> will need change in duplicate_loop_to_header_edge.
>
> We can also patch in __builtin_unreachable but that will make bugs with 
> out overflows/ out of bound accesses hard in loops to debug at -O levels 
> (as seen by do-1.f90 testcase and the new PR with integer overflow), so 
> perhaps we could do that at -Ofast or only or with 
> -funsafe-loop-optimization or so even if it is standard conforming.  I 
> think friendly trap showing proper line number info is better.

I'd prefer __builtin_unreachable ().

> What do you think?

Well, the issue is really that we exploit undefined behavior to
optimize (unroll in this case) and after that warn about undefined 
behavior ...

That said, the VRP code warning about array bound violations is
already quite weak (it requires the full value-range to be outside
of the array!) - in practice the warning isn't that great for
non-constant indices and for those we should warn during the first
VRP pass already.  So one option is to simply disable array-bound
warnings in the late VRP pass.

Richard.

> Honza
> 
> > 
> > Richard.
> 
>
Jan Hubicka Oct. 25, 2012, 12:31 p.m. UTC | #7
> > We can also patch in __builtin_unreachable but that will make bugs with 
> > out overflows/ out of bound accesses hard in loops to debug at -O levels 
> > (as seen by do-1.f90 testcase and the new PR with integer overflow), so 
> > perhaps we could do that at -Ofast or only or with 
> > -funsafe-loop-optimization or so even if it is standard conforming.  I 
> > think friendly trap showing proper line number info is better.
> 
> I'd prefer __builtin_unreachable ().

OK. I feel bit nervous about users being confused. But if you agree with the
overall plan, I will realize it with __builtin_unreachable ().  We will see
what PRs appear and we can fall back into builtin_trap at any time later with
an ease.

Of course producing traps into -O3 code just for debugging is bit uncool.  

BTW I wonder why __builtin_trap, __builtin_abort and _exit are not declared
CONST.  Do we care about memory stores just prior trapping/aborting/forcingly
exitting?

I also wonder why optimizing of builtin_unreachable can not be part of fixup of
noreturn calls in cfgcleanup.  That would make it to be handled better in the
cfgcleanups run in between cunroll iterations allowing possibly more cascaded
urolling.
 
> 
> > What do you think?
> 
> Well, the issue is really that we exploit undefined behavior to
> optimize (unroll in this case) and after that warn about undefined 
> behavior ...
> 
> That said, the VRP code warning about array bound violations is
> already quite weak (it requires the full value-range to be outside
> of the array!) - in practice the warning isn't that great for
> non-constant indices and for those we should warn during the first
> VRP pass already.  So one option is to simply disable array-bound
> warnings in the late VRP pass.

Or my original patch disabling it on all duplicated statements.  I am sure with
enough of plaing I can make testcase where the loop will be unrolled by
cunrolli confusing the early VRP pass, too.  It is just matter of introducing
enough of IV based calculations that the cost model will consider constant.

Honza
diff mbox

Patch

Index: tree-ssa-loop-ivcanon.c
===================================================================
--- tree-ssa-loop-ivcanon.c	(revision 192483)
+++ tree-ssa-loop-ivcanon.c	(working copy)
@@ -192,7 +192,7 @@  constant_after_peeling (tree op, gimple 
    Return results in SIZE, estimate benefits for complete unrolling exiting by EXIT.  */
 
 static void
-tree_estimate_loop_size (struct loop *loop, edge exit, struct loop_size *size)
+tree_estimate_loop_size (struct loop *loop, edge exit, edge edge_to_cancel, struct loop_size *size)
 {
   basic_block *body = get_loop_body (loop);
   gimple_stmt_iterator gsi;
@@ -208,8 +208,8 @@  tree_estimate_loop_size (struct loop *lo
     fprintf (dump_file, "Estimating sizes for loop %i\n", loop->num);
   for (i = 0; i < loop->num_nodes; i++)
     {
-      if (exit && body[i] != exit->src
-	  && dominated_by_p (CDI_DOMINATORS, body[i], exit->src))
+      if (edge_to_cancel && body[i] != edge_to_cancel->src
+	  && dominated_by_p (CDI_DOMINATORS, body[i], edge_to_cancel->src))
 	after_exit = true;
       else
 	after_exit = false;
@@ -231,7 +231,7 @@  tree_estimate_loop_size (struct loop *lo
 	  /* Look for reasons why we might optimize this stmt away. */
 
 	  /* Exit conditional.  */
-	  if (body[i] == exit->src && stmt == last_stmt (exit->src))
+	  if (exit && body[i] == exit->src && stmt == last_stmt (exit->src))
 	    {
 	      if (dump_file && (dump_flags & TDF_DETAILS))
 	        fprintf (dump_file, "   Exit condition will be eliminated.\n");
@@ -314,36 +314,161 @@  estimated_unrolled_size (struct loop_siz
   return unr_insns;
 }
 
+/* Loop LOOP is known to not loop.  See if there is an edge in the loop
+   body that can be remove to make the loop to always exit and at
+   the same time it does not make any code potentially executed 
+   during the last iteration dead.  
+
+   After complette unrolling we still may get rid of the conditional
+   on the exit in the last copy even if we have no idea what it does.
+   This is quite common case for loops of form
+
+     int a[5];
+     for (i=0;i<b;i++)
+       a[i]=0;
+
+   Here we prove the loop to iterate 5 times but we do not know
+   it from induction variable.
+
+   For now we handle only simple case where there is exit condition
+   just before the latch block and the latch block contains no statements
+   with side effect that may otherwise terminate the execution of loop
+   (such as by EH or by terminating the program or longjmp).
+
+   In the general case we may want to cancel the paths leading to statements
+   loop-niter identified as having undefined effect in the last iteration.
+   The other cases are hopefully rare and will be cleaned up later.  */
+
+edge
+loop_edge_to_cancel (struct loop *loop)
+{
+  VEC (edge, heap) *exits;
+  unsigned i;
+  edge edge_to_cancel;
+  gimple_stmt_iterator gsi;
+
+  /* We want only one predecestor of the loop.  */
+  if (EDGE_COUNT (loop->latch->preds) > 1)
+    return NULL;
+
+  exits = get_loop_exit_edges (loop);
+
+  FOR_EACH_VEC_ELT (edge, exits, i, edge_to_cancel)
+    {
+       /* Find the other edge than the loop exit
+          leaving the conditoinal.  */
+       if (EDGE_COUNT (edge_to_cancel->src->succs) != 2)
+         continue;
+       if (EDGE_SUCC (edge_to_cancel->src, 0) == edge_to_cancel)
+         edge_to_cancel = EDGE_SUCC (edge_to_cancel->src, 1);
+       else
+         edge_to_cancel = EDGE_SUCC (edge_to_cancel->src, 0);
+
+      /* We should never have conditionals in the loop latch. */
+      gcc_assert (edge_to_cancel->dest != loop->header);
+
+      /* Check that it leads to loop latch.  */
+      if (edge_to_cancel->dest != loop->latch)
+        continue;
+
+      VEC_free (edge, heap, exits);
+
+      /* Verify that the code in loop latch does nothing that may end program
+         execution without really reaching the exit.  This may include
+	 non-pure/const function calls, EH statements, volatile ASMs etc.  */
+      for (gsi = gsi_start_bb (loop->latch); !gsi_end_p (gsi); gsi_next (&gsi))
+	if (gimple_has_side_effects (gsi_stmt (gsi)))
+	   return NULL;
+      return edge_to_cancel;
+    }
+  VEC_free (edge, heap, exits);
+  return NULL;
+}
+
 /* Tries to unroll LOOP completely, i.e. NITER times.
    UL determines which loops we are allowed to unroll.
-   EXIT is the exit of the loop that should be eliminated.  */
+   EXIT is the exit of the loop that should be eliminated.  
+   IRRED_INVALIDATED is used to bookkeep if information about
+   irreducible regions may become invalid as a result
+   of the transformation.  */
 
 static bool
 try_unroll_loop_completely (struct loop *loop,
 			    edge exit, tree niter,
-			    enum unroll_level ul)
+			    enum unroll_level ul,
+			    bool *irred_invalidated)
 {
   unsigned HOST_WIDE_INT n_unroll, ninsns, max_unroll, unr_insns;
   gimple cond;
   struct loop_size size;
+  bool n_unroll_found = false;
+  HOST_WIDE_INT maxiter;
+  basic_block latch;
+  edge latch_edge;
+  location_t locus;
+  int flags;
+  gimple stmt;
+  gimple_stmt_iterator gsi;
+  edge edge_to_cancel = NULL;
+  int num = loop->num;
 
-  if (loop->inner)
-    return false;
+  /* See if we proved number of iterations to be low constant.
 
-  if (!host_integerp (niter, 1))
+     EXIT is an edge that will be removed in all but last iteration of 
+     the loop.
+
+     EDGE_TO_CACNEL is an edge that will be removed from the last iteration
+     of the unrolled sequence and is expected to make the final loop not
+     rolling. 
+
+     If the number of execution of loop is determined by standard induction
+     variable test, then EXIT and EDGE_TO_CANCEL are the two edges leaving
+     from the iv test.  */
+  if (host_integerp (niter, 1))
+    {
+      n_unroll = tree_low_cst (niter, 1);
+      n_unroll_found = true;
+      edge_to_cancel = EDGE_SUCC (exit->src, 0);
+      if (edge_to_cancel == exit)
+	edge_to_cancel = EDGE_SUCC (exit->src, 1);
+    }
+  /* We do not know the number of iterations and thus we can not eliminate
+     the EXIT edge.  */
+  else
+    exit = NULL;
+
+  /* See if we can improve our estimate by using recorded loop bounds.  */
+  maxiter = max_loop_iterations_int (loop);
+  if (maxiter >= 0
+      && (!n_unroll_found || (unsigned HOST_WIDE_INT)maxiter < n_unroll))
+    {
+      n_unroll = maxiter;
+      n_unroll_found = true;
+      /* Loop terminates before the IV variable test, so we can not
+	 remove it in the last iteration.  */
+      edge_to_cancel = NULL;
+    }
+
+  if (!n_unroll_found)
     return false;
-  n_unroll = tree_low_cst (niter, 1);
 
   max_unroll = PARAM_VALUE (PARAM_MAX_COMPLETELY_PEEL_TIMES);
   if (n_unroll > max_unroll)
     return false;
 
+  if (!edge_to_cancel)
+    edge_to_cancel = loop_edge_to_cancel (loop);
+
   if (n_unroll)
     {
+      sbitmap wont_exit;
+      edge e;
+      unsigned i;
+      VEC (edge, heap) *to_remove = NULL;
       if (ul == UL_SINGLE_ITER)
 	return false;
 
-      tree_estimate_loop_size (loop, exit, &size);
+      tree_estimate_loop_size (loop, exit, edge_to_cancel, &size);
       ninsns = size.overall;
 
       unr_insns = estimated_unrolled_size (&size, n_unroll);
@@ -354,6 +479,18 @@  try_unroll_loop_completely (struct loop 
 		   (int) unr_insns);
 	}
 
+      /* We unroll only inner loops, because we do not consider it profitable
+	 otheriwse.  We still can cancel loopback edge of not rolling loop;
+	 this is always a good idea.  */
+      if (loop->inner && unr_insns > ninsns)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "Not unrolling loop %d:"
+		     "it is not innermost and code would grow.\n",
+		     loop->num);
+	  return false;
+	}
+
       if (unr_insns > ninsns
 	  && (unr_insns
 	      > (unsigned) PARAM_VALUE (PARAM_MAX_COMPLETELY_PEELED_INSNS)))
@@ -369,17 +506,10 @@  try_unroll_loop_completely (struct loop 
 	  && unr_insns > ninsns)
 	{
 	  if (dump_file && (dump_flags & TDF_DETAILS))
-	    fprintf (dump_file, "Not unrolling loop %d.\n", loop->num);
+	    fprintf (dump_file, "Not unrolling loop %d: size would grow.\n",
+		     loop->num);
 	  return false;
 	}
-    }
-
-  if (n_unroll)
-    {
-      sbitmap wont_exit;
-      edge e;
-      unsigned i;
-      VEC (edge, heap) *to_remove = NULL;
 
       initialize_original_copy_tables ();
       wont_exit = sbitmap_alloc (n_unroll + 1);
@@ -408,15 +538,67 @@  try_unroll_loop_completely (struct loop 
       free_original_copy_tables ();
     }
 
-  cond = last_stmt (exit->src);
-  if (exit->flags & EDGE_TRUE_VALUE)
-    gimple_cond_make_true (cond);
-  else
-    gimple_cond_make_false (cond);
-  update_stmt (cond);
+  /* Remove the conditional from the last copy of the loop.  */
+  if (edge_to_cancel)
+    {
+      cond = last_stmt (edge_to_cancel->src);
+      if (edge_to_cancel->flags & EDGE_TRUE_VALUE)
+	gimple_cond_make_false (cond);
+      else
+	gimple_cond_make_true (cond);
+      update_stmt (cond);
+      /* Do not remove the path. Doing so may remove outer loop
+	 and confuse bookkeeping code in tree_unroll_loops_completelly.  */
+    }
+  /* We did not manage to cancel the loop.
+     The loop latch remains reachable even if it will never be reached
+     at runtime.  We must redirect it to somewhere, so create basic
+     block containg __builtin_unreachable call for this reason.  */
+  else
+    {
+      latch = loop->latch;
+      latch_edge = loop_latch_edge (loop);
+      flags = latch_edge->flags;
+      locus = latch_edge->goto_locus;
+
+      /* Unloop destroys the latch edge.  */
+      unloop (loop, irred_invalidated);
+
+      /* Create new basic block for the latch edge destination and wire
+	 it in.  */
+      stmt = gimple_build_call (builtin_decl_implicit (BUILT_IN_UNREACHABLE), 0);
+      latch_edge = make_edge (latch, create_basic_block (NULL, NULL, latch), flags);
+      latch_edge->probability = 0;
+      latch_edge->count = 0;
+      latch_edge->flags |= flags;
+      latch_edge->goto_locus = locus;
+
+      latch_edge->dest->loop_father = current_loops->tree_root;
+      latch_edge->dest->count = 0;
+      latch_edge->dest->frequency = 0;
+      set_immediate_dominator (CDI_DOMINATORS, latch_edge->dest, latch_edge->src);
+
+      gsi = gsi_start_bb (latch_edge->dest);
+      gsi_insert_after (&gsi, stmt, GSI_NEW_STMT);
+    }
 
   if (dump_file && (dump_flags & TDF_DETAILS))
-    fprintf (dump_file, "Unrolled loop %d completely.\n", loop->num);
+    {
+      if (!n_unroll)
+        fprintf (dump_file, "Turned loop %d to non-loop; it never loops.\n",
+		 num);
+      else
+        fprintf (dump_file, "Unrolled loop %d completely "
+		 "(duplicated %i times).\n", num, (int)n_unroll);
+      if (exit)
+        fprintf (dump_file, "Exit condition of peeled iterations was "
+		 "eliminated.\n");
+      if (edge_to_cancel)
+        fprintf (dump_file, "Last iteration exit edge was proved true.\n");
+      else
+        fprintf (dump_file, "Latch of last iteration was marked by "
+		 "__builtin_unreachable ().\n");
+    }
 
   return true;
 }
@@ -425,12 +608,15 @@  try_unroll_loop_completely (struct loop 
    CREATE_IV is true if we may create a new iv.  UL determines
    which loops we are allowed to completely unroll.  If TRY_EVAL is true, we try
    to determine the number of iterations of a loop by direct evaluation.
-   Returns true if cfg is changed.  */
+   Returns true if cfg is changed.  
+
+   IRRED_INVALIDATED is used to keep if irreducible reginos needs to be recomputed.  */
 
 static bool
 canonicalize_loop_induction_variables (struct loop *loop,
 				       bool create_iv, enum unroll_level ul,
-				       bool try_eval)
+				       bool try_eval,
+				       bool *irred_invalidated)
 {
   edge exit = NULL;
   tree niter;
@@ -455,22 +641,34 @@  canonicalize_loop_induction_variables (s
 	      || TREE_CODE (niter) != INTEGER_CST))
 	niter = find_loop_niter_by_eval (loop, &exit);
 
-      if (chrec_contains_undetermined (niter)
-	  || TREE_CODE (niter) != INTEGER_CST)
-	return false;
+      if (TREE_CODE (niter) != INTEGER_CST)
+	exit = NULL;
     }
 
-  if (dump_file && (dump_flags & TDF_DETAILS))
+  /* We work exceptionally hard here to estimate the bound
+     by find_loop_niter_by_eval.  Be sure to keep it for future.  */
+  if (niter && TREE_CODE (niter) == INTEGER_CST)
+    record_niter_bound (loop, tree_to_double_int (niter), false, true);
+
+  if (dump_file && (dump_flags & TDF_DETAILS)
+      && TREE_CODE (niter) == INTEGER_CST)
     {
       fprintf (dump_file, "Loop %d iterates ", loop->num);
       print_generic_expr (dump_file, niter, TDF_SLIM);
       fprintf (dump_file, " times.\n");
     }
+  if (dump_file && (dump_flags & TDF_DETAILS)
+      && max_loop_iterations_int (loop) >= 0)
+    {
+      fprintf (dump_file, "Loop %d iterates at most %i times.\n", loop->num,
+	       (int)max_loop_iterations_int (loop));
+    }
 
-  if (try_unroll_loop_completely (loop, exit, niter, ul))
+  if (try_unroll_loop_completely (loop, exit, niter, ul, irred_invalidated))
     return true;
 
-  if (create_iv)
+  if (create_iv
+      && niter && !chrec_contains_undetermined (niter))
     create_canonical_iv (loop, exit, niter);
 
   return false;
@@ -485,15 +683,21 @@  canonicalize_induction_variables (void)
   loop_iterator li;
   struct loop *loop;
   bool changed = false;
+  bool irred_invalidated = false;
 
   FOR_EACH_LOOP (li, loop, 0)
     {
       changed |= canonicalize_loop_induction_variables (loop,
 							true, UL_SINGLE_ITER,
-							true);
+							true,
+							&irred_invalidated);
     }
   gcc_assert (!need_ssa_update_p (cfun));
 
+  if (irred_invalidated
+      && loops_state_satisfies_p (LOOPS_HAVE_MARKED_IRREDUCIBLE_REGIONS))
+    mark_irreducible_loops ();
+
   /* Clean up the information about numbers of iterations, since brute force
      evaluation could reveal new information.  */
   scev_reset ();
@@ -594,9 +798,10 @@  tree_unroll_loops_completely (bool may_i
 
   do
     {
+      bool irred_invalidated = false;
       changed = false;
 
-      FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
+      FOR_EACH_LOOP (li, loop, 0)
 	{
 	  struct loop *loop_father = loop_outer (loop);
 
@@ -609,7 +814,8 @@  tree_unroll_loops_completely (bool may_i
 	    ul = UL_NO_GROWTH;
 
 	  if (canonicalize_loop_induction_variables (loop, false, ul,
-						     !flag_tree_loop_ivcanon))
+						     !flag_tree_loop_ivcanon,
+						     &irred_invalidated))
 	    {
 	      changed = true;
 	      /* If we'll continue unrolling, we need to propagate constants
@@ -629,6 +835,10 @@  tree_unroll_loops_completely (bool may_i
 	  struct loop **iter;
 	  unsigned i;
 
+	  if (irred_invalidated
+	      && loops_state_satisfies_p (LOOPS_HAVE_MARKED_IRREDUCIBLE_REGIONS))
+	    mark_irreducible_loops ();
+
 	  update_ssa (TODO_update_ssa);
 
 	  /* Propagate the constants within the new basic blocks.  */
Index: tree.c
===================================================================
--- tree.c	(revision 192483)
+++ tree.c	(working copy)
@@ -9524,6 +9524,15 @@  build_common_builtin_nodes (void)
   tree tmp, ftype;
   int ecf_flags;
 
+  if (!builtin_decl_explicit_p (BUILT_IN_UNREACHABLE))
+    {
+      ftype = build_function_type (void_type_node, void_list_node);
+      local_define_builtin ("__builtin_unreachable", ftype, BUILT_IN_UNREACHABLE,
+			    "__builtin_unreachable",
+			    ECF_NOTHROW | ECF_LEAF | ECF_NORETURN
+			    | ECF_CONST | ECF_LEAF);
+    }
+
   if (!builtin_decl_explicit_p (BUILT_IN_MEMCPY)
       || !builtin_decl_explicit_p (BUILT_IN_MEMMOVE))
     {
Index: cfgloop.h
===================================================================
--- cfgloop.h	(revision 192483)
+++ cfgloop.h	(working copy)
@@ -320,7 +321,8 @@  extern struct loop *loopify (edge, edge,
 struct loop * loop_version (struct loop *, void *,
 			    basic_block *, unsigned, unsigned, unsigned, bool);
 extern bool remove_path (edge);
-void scale_loop_frequencies (struct loop *, int, int);
+extern void unloop (struct loop *, bool *);
+extern void scale_loop_frequencies (struct loop *, int, int);
 
 /* Induction variable analysis.  */
 
Index: cfgloopmanip.c
===================================================================
--- cfgloopmanip.c	(revision 192483)
+++ cfgloopmanip.c	(working copy)
@@ -37,7 +37,6 @@  static int find_path (edge, basic_block 
 static void fix_loop_placements (struct loop *, bool *);
 static bool fix_bb_placement (basic_block);
 static void fix_bb_placements (basic_block, bool *);
-static void unloop (struct loop *, bool *);
 
 /* Checks whether basic block BB is dominated by DATA.  */
 static bool
@@ -895,7 +894,7 @@  loopify (edge latch_edge, edge header_ed
    If this may cause the information about irreducible regions to become
    invalid, IRRED_INVALIDATED is set to true.  */
 
-static void
+void
 unloop (struct loop *loop, bool *irred_invalidated)
 {
   basic_block *body;
Index: testsuite/gcc.target/i386/l_fma_float_5.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_float_5.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_float_5.c	(working copy)
@@ -12,7 +12,7 @@ 
 /* { dg-final { scan-assembler-times "vfmsub132ps" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmadd132ps" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132ps" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfmsub132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132ss" 16  } } */
+/* { dg-final { scan-assembler-times "vfmadd132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfmsub132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132ss" 72  } } */
Index: testsuite/gcc.target/i386/l_fma_double_4.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_double_4.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_double_4.c	(working copy)
@@ -12,7 +12,7 @@ 
 /* { dg-final { scan-assembler-times "vfmsub132pd" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmadd132pd" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132pd" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfmsub132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132sd" 16  } } */
+/* { dg-final { scan-assembler-times "vfmadd132sd" 40  } } */
+/* { dg-final { scan-assembler-times "vfmsub132sd" 40  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132sd" 40  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132sd" 40  } } */
Index: testsuite/gcc.target/i386/l_fma_float_2.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_float_2.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_float_2.c	(working copy)
@@ -12,7 +12,7 @@ 
 /* { dg-final { scan-assembler-times "vfmsub132ps" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmadd132ps" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132ps" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfmsub132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132ss" 16  } } */
+/* { dg-final { scan-assembler-times "vfmadd132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfmsub132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132ss" 72  } } */
Index: testsuite/gcc.target/i386/l_fma_float_6.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_float_6.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_float_6.c	(working copy)
@@ -12,7 +12,7 @@ 
 /* { dg-final { scan-assembler-times "vfmsub132ps" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmadd132ps" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132ps" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfmsub132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132ss" 16  } } */
+/* { dg-final { scan-assembler-times "vfmadd132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfmsub132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132ss" 72  } } */
Index: testsuite/gcc.target/i386/l_fma_double_1.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_double_1.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_double_1.c	(working copy)
@@ -16,11 +16,11 @@ 
 /* { dg-final { scan-assembler-times "vfnmadd231pd" 4  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132pd" 4  } } */
 /* { dg-final { scan-assembler-times "vfnmsub231pd" 4  } } */
-/* { dg-final { scan-assembler-times "vfmadd132sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd213sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfmsub132sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfmsub213sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmadd213sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmsub213sd" 8  } } */
+/* { dg-final { scan-assembler-times "vfmadd132sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfmadd213sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfmsub132sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfmsub213sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfnmadd213sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfnmsub213sd" 20  } } */
Index: testsuite/gcc.target/i386/l_fma_double_5.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_double_5.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_double_5.c	(working copy)
@@ -12,7 +12,7 @@ 
 /* { dg-final { scan-assembler-times "vfmsub132pd" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmadd132pd" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132pd" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfmsub132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132sd" 16  } } */
+/* { dg-final { scan-assembler-times "vfmadd132sd" 40  } } */
+/* { dg-final { scan-assembler-times "vfmsub132sd" 40  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132sd" 40  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132sd" 40  } } */
Index: testsuite/gcc.target/i386/l_fma_float_3.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_float_3.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_float_3.c	(working copy)
@@ -16,11 +16,11 @@ 
 /* { dg-final { scan-assembler-times "vfnmadd231ps" 4  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132ps" 4  } } */
 /* { dg-final { scan-assembler-times "vfnmsub231ps" 4  } } */
-/* { dg-final { scan-assembler-times "vfmadd132ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd213ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfmsub132ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfmsub213ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmadd213ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmsub213ss" 8  } } */
+/* { dg-final { scan-assembler-times "vfmadd132ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfmadd213ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfmsub132ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfmsub213ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfnmadd213ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfnmsub213ss" 36  } } */
Index: testsuite/gcc.target/i386/l_fma_double_2.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_double_2.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_double_2.c	(working copy)
@@ -12,7 +12,7 @@ 
 /* { dg-final { scan-assembler-times "vfmsub132pd" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmadd132pd" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132pd" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfmsub132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132sd" 16  } } */
+/* { dg-final { scan-assembler-times "vfmadd132sd" 40  } } */
+/* { dg-final { scan-assembler-times "vfmsub132sd" 40  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132sd" 40  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132sd" 40  } } */
Index: testsuite/gcc.target/i386/l_fma_double_6.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_double_6.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_double_6.c	(working copy)
@@ -12,7 +12,7 @@ 
 /* { dg-final { scan-assembler-times "vfmsub132pd" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmadd132pd" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132pd" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfmsub132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132sd" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132sd" 16  } } */
+/* { dg-final { scan-assembler-times "vfmadd132sd" 40 } } */
+/* { dg-final { scan-assembler-times "vfmsub132sd" 40  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132sd" 40  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132sd" 40  } } */
Index: testsuite/gcc.target/i386/l_fma_float_4.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_float_4.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_float_4.c	(working copy)
@@ -12,7 +12,7 @@ 
 /* { dg-final { scan-assembler-times "vfmsub132ps" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmadd132ps" 8  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132ps" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfmsub132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132ss" 16  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132ss" 16  } } */
+/* { dg-final { scan-assembler-times "vfmadd132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfmsub132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132ss" 72  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132ss" 72  } } */
Index: testsuite/gcc.target/i386/l_fma_double_3.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_double_3.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_double_3.c	(working copy)
@@ -16,11 +16,11 @@ 
 /* { dg-final { scan-assembler-times "vfnmadd231pd" 4  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132pd" 4  } } */
 /* { dg-final { scan-assembler-times "vfnmsub231pd" 4  } } */
-/* { dg-final { scan-assembler-times "vfmadd132sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd213sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfmsub132sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfmsub213sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmadd213sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132sd" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmsub213sd" 8  } } */
+/* { dg-final { scan-assembler-times "vfmadd132sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfmadd213sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfmsub132sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfmsub213sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfnmadd213sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132sd" 20  } } */
+/* { dg-final { scan-assembler-times "vfnmsub213sd" 20  } } */
Index: testsuite/gcc.target/i386/l_fma_float_1.c
===================================================================
--- testsuite/gcc.target/i386/l_fma_float_1.c	(revision 192483)
+++ testsuite/gcc.target/i386/l_fma_float_1.c	(working copy)
@@ -16,11 +16,11 @@ 
 /* { dg-final { scan-assembler-times "vfnmadd231ps" 4  } } */
 /* { dg-final { scan-assembler-times "vfnmsub132ps" 4  } } */
 /* { dg-final { scan-assembler-times "vfnmsub231ps" 4  } } */
-/* { dg-final { scan-assembler-times "vfmadd132ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfmadd213ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfmsub132ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfmsub213ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmadd132ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmadd213ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmsub132ss" 8  } } */
-/* { dg-final { scan-assembler-times "vfnmsub213ss" 8  } } */
+/* { dg-final { scan-assembler-times "vfmadd132ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfmadd213ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfmsub132ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfmsub213ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfnmadd132ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfnmadd213ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfnmsub132ss" 36  } } */
+/* { dg-final { scan-assembler-times "vfnmsub213ss" 36  } } */
Index: testsuite/gfortran.dg/do_1.f90
===================================================================
--- testsuite/gfortran.dg/do_1.f90	(revision 192483)
+++ testsuite/gfortran.dg/do_1.f90	(working copy)
@@ -1,4 +1,5 @@ 
-! { dg-do run }
+! { dg-do run { xfail *-*-* } }
+! XFAIL is tracked in PR 54932
 ! Program to check corner cases for DO statements.
 program do_1
   implicit none
Index: testsuite/gcc.dg/tree-ssa/cunroll-1.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/cunroll-1.c	(revision 0)
+++ testsuite/gcc.dg/tree-ssa/cunroll-1.c	(revision 0)
@@ -0,0 +1,13 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
+int a[2];
+test(int c)
+{ 
+  int i;
+  for (i=0;i<c;i++)
+    a[i]=5;
+}
+/* Array bounds says the loop will not roll much.  */
+/* { dg-final { scan-tree-dump "Unrolled loop 1 completely .duplicated 1 times.." "cunroll"} } */
+/* { dg-final { scan-tree-dump "Last iteration exit edge was proved true." "cunroll"} } */
+/* { dg-final { cleanup-tree-dump "cunroll" } } */
Index: testsuite/gcc.dg/tree-ssa/cunroll-2.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/cunroll-2.c	(revision 0)
+++ testsuite/gcc.dg/tree-ssa/cunroll-2.c	(revision 0)
@@ -0,0 +1,16 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
+int a[2];
+test(int c)
+{ 
+  int i;
+  for (i=0;i<c;i++)
+    {
+      a[i]=5;
+      if (test2())
+	return;
+    }
+}
+/* We are not able to get rid of the final conditional because the loop has two exits.  */
+/* { dg-final { scan-tree-dump "Unrolled loop 1 completely .duplicated 2 times.." "cunroll"} } */
+/* { dg-final { cleanup-tree-dump "cunroll" } } */
Index: testsuite/gcc.dg/tree-ssa/cunroll-3.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/cunroll-3.c	(revision 0)
+++ testsuite/gcc.dg/tree-ssa/cunroll-3.c	(revision 0)
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-cunrolli-details" } */
+int a[1];
+test(int c)
+{ 
+  int i;
+  for (i=0;i<c;i++)
+    {
+      a[i]=5;
+    }
+}
+/* If we start duplicating headers prior curoll, this loop will have 0 iterations.  */
+
+/* { dg-final { scan-tree-dump "Unrolled loop 1 completely .duplicated 1 times.." "cunrolli"} } */
+/* { dg-final { cleanup-tree-dump "cunrolli" } } */
Index: testsuite/gcc.dg/tree-ssa/cunroll-4.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/cunroll-4.c	(revision 0)
+++ testsuite/gcc.dg/tree-ssa/cunroll-4.c	(revision 0)
@@ -0,0 +1,21 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
+int a[1];
+test(int c)
+{ 
+  int i=0,j;
+  for (i=0;i<c;i++)
+    {
+      for (j=0;j<c;j++)
+	{
+          a[i]=5;
+	  test2();
+	}
+    }
+}
+
+/* We should do this as part of cunrolli, but our cost model do not take into account early exit
+   from the last iteration.  */
+/* { dg-final { scan-tree-dump "Turned loop 1 to non-loop; it never loops." "cunrolli"} } */
+/* { dg-final { scan-tree-dump "Last iteration exit edge was proved true." "cunrolli"} } */
+/* { dg-final { cleanup-tree-dump "cunroll" } } */
Index: testsuite/gcc.dg/tree-ssa/cunroll-5.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/cunroll-5.c	(revision 0)
+++ testsuite/gcc.dg/tree-ssa/cunroll-5.c	(revision 0)
@@ -0,0 +1,14 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O3 -fdump-tree-cunroll-details" } */
+int *a;
+test(int c)
+{ 
+  int i;
+  for (i=0;i<6;i++)
+    a[i]=5;
+}
+/* Basic testcase for complette unrolling.  */
+/* { dg-final { scan-tree-dump "Unrolled loop 1 completely .duplicated 5 times.." "cunroll"} } */
+/* { dg-final { scan-tree-dump "Exit condition of peeled iterations was eliminated." "cunroll"} } */
+/* { dg-final { scan-tree-dump "Last iteration exit edge was proved true." "cunroll"} } */
+/* { dg-final { cleanup-tree-dump "cunroll" } } */
Index: testsuite/gcc.dg/tree-ssa/ldist-17.c
===================================================================
--- testsuite/gcc.dg/tree-ssa/ldist-17.c	(revision 192483)
+++ testsuite/gcc.dg/tree-ssa/ldist-17.c	(working copy)
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fdump-tree-ldist-details" } */
+/* { dg-options "-O2 -ftree-loop-distribution -ftree-loop-distribute-patterns -fdump-tree-ldist-details fdisable-tree-cunroll -fdisable-tree-cunrolli" } */
 
 typedef int mad_fixed_t;
 struct mad_pcm