diff mbox series

[2/2] loops: Invoke lim after successful loop interchange

Message ID ri6pn4mz4j9.fsf@suse.cz
State New
Headers show
Series [1/2] cfgloop: Extend loop iteration macros to loop only over sub-loops | expand

Commit Message

Martin Jambor Nov. 9, 2020, 7:58 p.m. UTC
Hi,

this patch modifies the loop invariant pass so that is can operate
only on a single requested loop and its sub-loops and ignore the rest
of the function, much like it currently ignores basic blocks that are
not in any real loop.  It then invokes it from within the loop
interchange pass when it successfully swaps two loops.  This avoids
the non-LTO -Ofast run-time regressions of 410.bwaves and 503.bwaves_r
(which are 19% and 15% faster than current master on an AMD zen2
machine) while not introducing a full LIM pass into the pass pipeline.

I have not modified the LIM data structures, this means that it still
contains vectors indexed by loop->num even though only a single loop
nest is actually processed.  I also did not replace the uses of
pre_and_rev_post_order_compute_fn with a function that would count a
postorder only for a given loop.  I can of course do so if the
approach is otherwise deemed viable.

The patch adds one additional global variable requested_loop to the
pass and then at various places behaves differently when it is set.  I
was considering storing the fake root loop into it for normal
operation, but since this loop often requires special handling anyway,
I came to the conclusion that the code would actually end up less
straightforward.

I have bootstrapped and tested the patch on x86_64-linux and a very
similar one on aarch64-linux.  I have also tested it by modifying the
tree_ssa_lim function to run loop_invariant_motion_from_loop on each
real outermost loop in a function and this variant also passed
bootstrap and all tests, including dump scans, of all languages.

I have built the entire SPEC 2006 FPrate monitoring the activity of
the LIM pass without and with the patch (on top of commit b642fca1c31
with which 526.blender_r and 538.imagick_r seemed to be failing) and
it only examined 0.2% more loops, 0.02% more BBs and even fewer
percent of statements because it is invoked only in a rather special
circumstance.  But the patch allows for more such need-based uses at
hopefully reasonable cost.

Since I do not have much experience with loop optimizers, I expect
that there will be requests to adjust the patch during the review.
Still, it fixes a performance regression against GCC 9 and so I hope
to address the concerns in time to get it into GCC 11.

Thanks,

Martin


gcc/ChangeLog:

2020-11-08  Martin Jambor  <mjambor@suse.cz>

	* gimple-loop-interchange.cc (pass_linterchange::execute): Call
	loop_invariant_motion_from_loop on affected loop nests.
	* tree-ssa-loop-im.c (requested_loop): New variable.
	(get_topmost_lim_loop): New function.
	(outermost_invariant_loop): Use it, cap discovered topmost loop at
	requested_loop.
	(determine_max_movement): Use get_topmost_lim_loop.
	(set_level): Assert that the selected loop is not outside of
	requested_loop.
	(compute_invariantness): Do not process loops outside of
	requested_loop, if non-NULL.
	(move_computations_worker): Likewise.
	(mark_ref_stored): Stop iteration at requested_loop, if non-NULL.
	(mark_ref_loaded): Likewise.
	(analyze_memory_references): If non-NULL, only process basic
	blocks and loops in requested_loop.  Compute contains_call bitmap.
	(do_store_motion): Only process requested_loop if non-NULL.
	(fill_always_executed_in): Likewise.  Also accept contains_call as
	a parameter rather than computing it.
	(tree_ssa_lim_initialize): New parameter which is stored into
	requested_loop.  Additonal dumping. Only initialize
	bb_loop_postorder for loops within requested_loop, if non-NULL.
	(tree_ssa_lim_finalize): Clear requested_loop, additional dumping.
	(loop_invariant_motion_from_loop): New function.
	(tree_ssa_lim): Move all functionality to
	loop_invariant_motion_from_loop, call it.
	* tree-ssa-loop-manip.h (loop_invariant_motion_from_loop): Declare.

---
 gcc/gimple-loop-interchange.cc |  30 +++++-
 gcc/tree-ssa-loop-im.c         | 176 ++++++++++++++++++++++++---------
 gcc/tree-ssa-loop-manip.h      |   2 +
 3 files changed, 156 insertions(+), 52 deletions(-)

Comments

Richard Biener Nov. 11, 2020, 9:52 a.m. UTC | #1
On Mon, 9 Nov 2020, Martin Jambor wrote:

> Hi,
> 
> this patch modifies the loop invariant pass so that is can operate
> only on a single requested loop and its sub-loops and ignore the rest
> of the function, much like it currently ignores basic blocks that are
> not in any real loop.  It then invokes it from within the loop
> interchange pass when it successfully swaps two loops.  This avoids
> the non-LTO -Ofast run-time regressions of 410.bwaves and 503.bwaves_r
> (which are 19% and 15% faster than current master on an AMD zen2
> machine) while not introducing a full LIM pass into the pass pipeline.
> 
> I have not modified the LIM data structures, this means that it still
> contains vectors indexed by loop->num even though only a single loop
> nest is actually processed.  I also did not replace the uses of
> pre_and_rev_post_order_compute_fn with a function that would count a
> postorder only for a given loop.  I can of course do so if the
> approach is otherwise deemed viable.
> 
> The patch adds one additional global variable requested_loop to the
> pass and then at various places behaves differently when it is set.  I
> was considering storing the fake root loop into it for normal
> operation, but since this loop often requires special handling anyway,
> I came to the conclusion that the code would actually end up less
> straightforward.
> 
> I have bootstrapped and tested the patch on x86_64-linux and a very
> similar one on aarch64-linux.  I have also tested it by modifying the
> tree_ssa_lim function to run loop_invariant_motion_from_loop on each
> real outermost loop in a function and this variant also passed
> bootstrap and all tests, including dump scans, of all languages.
> 
> I have built the entire SPEC 2006 FPrate monitoring the activity of
> the LIM pass without and with the patch (on top of commit b642fca1c31
> with which 526.blender_r and 538.imagick_r seemed to be failing) and
> it only examined 0.2% more loops, 0.02% more BBs and even fewer
> percent of statements because it is invoked only in a rather special
> circumstance.  But the patch allows for more such need-based uses at
> hopefully reasonable cost.
> 
> Since I do not have much experience with loop optimizers, I expect
> that there will be requests to adjust the patch during the review.
> Still, it fixes a performance regression against GCC 9 and so I hope
> to address the concerns in time to get it into GCC 11.
> 
> Thanks,
> 
> Martin
> 
> 
> gcc/ChangeLog:
> 
> 2020-11-08  Martin Jambor  <mjambor@suse.cz>
> 
> 	* gimple-loop-interchange.cc (pass_linterchange::execute): Call
> 	loop_invariant_motion_from_loop on affected loop nests.
> 	* tree-ssa-loop-im.c (requested_loop): New variable.
> 	(get_topmost_lim_loop): New function.
> 	(outermost_invariant_loop): Use it, cap discovered topmost loop at
> 	requested_loop.
> 	(determine_max_movement): Use get_topmost_lim_loop.
> 	(set_level): Assert that the selected loop is not outside of
> 	requested_loop.
> 	(compute_invariantness): Do not process loops outside of
> 	requested_loop, if non-NULL.
> 	(move_computations_worker): Likewise.
> 	(mark_ref_stored): Stop iteration at requested_loop, if non-NULL.
> 	(mark_ref_loaded): Likewise.
> 	(analyze_memory_references): If non-NULL, only process basic
> 	blocks and loops in requested_loop.  Compute contains_call bitmap.
> 	(do_store_motion): Only process requested_loop if non-NULL.
> 	(fill_always_executed_in): Likewise.  Also accept contains_call as
> 	a parameter rather than computing it.
> 	(tree_ssa_lim_initialize): New parameter which is stored into
> 	requested_loop.  Additonal dumping. Only initialize
> 	bb_loop_postorder for loops within requested_loop, if non-NULL.
> 	(tree_ssa_lim_finalize): Clear requested_loop, additional dumping.
> 	(loop_invariant_motion_from_loop): New function.
> 	(tree_ssa_lim): Move all functionality to
> 	loop_invariant_motion_from_loop, call it.
> 	* tree-ssa-loop-manip.h (loop_invariant_motion_from_loop): Declare.
> 
> ---
>  gcc/gimple-loop-interchange.cc |  30 +++++-
>  gcc/tree-ssa-loop-im.c         | 176 ++++++++++++++++++++++++---------
>  gcc/tree-ssa-loop-manip.h      |   2 +
>  3 files changed, 156 insertions(+), 52 deletions(-)
> 
> diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
> index 1656004ecf0..8c376228779 100644
> --- a/gcc/gimple-loop-interchange.cc
> +++ b/gcc/gimple-loop-interchange.cc
> @@ -2068,6 +2068,7 @@ pass_linterchange::execute (function *fun)
>      return 0;
>  
>    bool changed_p = false;
> +  auto_vec<class loop *, 4> loops_to_lim;
>    class loop *loop;
>    FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
>      {
> @@ -2077,7 +2078,11 @@ pass_linterchange::execute (function *fun)
>        if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
>  	{
>  	  tree_loop_interchange loop_interchange (loop_nest);
> -	  changed_p |= loop_interchange.interchange (datarefs, ddrs);
> +	  if (loop_interchange.interchange (datarefs, ddrs))
> +	    {
> +	      changed_p = true;
> +	      loops_to_lim.safe_push (loop_nest[0]);
> +	    }
>  	}
>        free_dependence_relations (ddrs);
>        free_data_refs_with_aux (datarefs);
> @@ -2085,8 +2090,27 @@ pass_linterchange::execute (function *fun)
>      }
>  
>    if (changed_p)
> -    scev_reset ();
> -  return changed_p ? (TODO_update_ssa_only_virtuals) : 0;
> +    {
> +      unsigned todo = TODO_update_ssa_only_virtuals;
> +      unsigned len  = loops_to_lim.length ();
> +      for (unsigned i = len - 1; i > 0; i--)
> +	for (int j = i - 1; j >= 0; j--)
> +	  if (loops_to_lim[j] == loops_to_lim[i]
> +	      || flow_loop_nested_p (loops_to_lim[j], loops_to_lim[i]))

I think this all should never happen so you can remove the pruning
of loops_to_lim.

> +	    {
> +	      loops_to_lim.pop ();
> +	      break;
> +	    }
> +
> +      len  = loops_to_lim.length ();
> +      for (unsigned i = 0; i < len; i++)
> +	todo |= loop_invariant_motion_from_loop (cfun, loops_to_lim[i]);
> +
> +      scev_reset ();
> +      return todo;
> +    }
> +  else
> +    return 0;
>  }
>  
>  } // anon namespace
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 6bb07e133cd..24b541dfb17 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -244,6 +244,11 @@ static struct
>  static bitmap_obstack lim_bitmap_obstack;
>  static obstack mem_ref_obstack;
>  
> +/* If LIM has been requested only for a particular loop (and all of the
> +   enclosed loops this variable points to it, otherwise it points to the dummy
> +   loop spanning whole function.  */
> +static class loop *requested_loop;
> +
>  static bool ref_indep_loop_p (class loop *, im_mem_ref *, dep_kind);
>  static bool ref_always_accessed_p (class loop *, im_mem_ref *, bool);
>  static bool refs_independent_p (im_mem_ref *, im_mem_ref *, bool = true);
> @@ -410,6 +415,21 @@ movement_possibility (gimple *stmt)
>    return ret;
>  }
>  
> +/* Return either the topmost real loop in which LOOP is nested if processing
> +   the whole function, or requested_loop if not.  */
> +
> +static class loop *
> +get_topmost_lim_loop (class loop *loop)
> +{
> +  if (requested_loop)
> +    {
> +      gcc_assert (loop == requested_loop
> +		  || flow_loop_nested_p (requested_loop, loop));
> +      return requested_loop;
> +    }
> +  return superloop_at_depth (loop, 1);
> +}
> +
>  /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
>     loop to that we could move the expression using DEF if it did not have
>     other operands, i.e. the outermost loop enclosing LOOP in that the value
> @@ -424,18 +444,18 @@ outermost_invariant_loop (tree def, class loop *loop)
>    struct lim_aux_data *lim_data;
>  
>    if (!def)
> -    return superloop_at_depth (loop, 1);
> +    return get_topmost_lim_loop (loop);
>  
>    if (TREE_CODE (def) != SSA_NAME)
>      {
>        gcc_assert (is_gimple_min_invariant (def));
> -      return superloop_at_depth (loop, 1);
> +      return get_topmost_lim_loop (loop);
>      }
>  
>    def_stmt = SSA_NAME_DEF_STMT (def);
>    def_bb = gimple_bb (def_stmt);
>    if (!def_bb)
> -    return superloop_at_depth (loop, 1);
> +    return get_topmost_lim_loop (loop);
>  
>    max_loop = find_common_loop (loop, def_bb->loop_father);
>  
> @@ -443,6 +463,16 @@ outermost_invariant_loop (tree def, class loop *loop)
>    if (lim_data != NULL && lim_data->max_loop != NULL)
>      max_loop = find_common_loop (max_loop,
>  				 loop_outer (lim_data->max_loop));
> +  if (requested_loop)
> +    {
> +      class loop *req_outer = loop_outer (requested_loop);
> +      if (flow_loop_nested_p (max_loop, req_outer))
> +	max_loop = req_outer;
> +      else
> +	gcc_assert (max_loop == req_outer
> +		    || flow_loop_nested_p (req_outer, max_loop));
> +    }
> +
>    if (max_loop == loop)
>      return NULL;
>    max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1);
> @@ -677,7 +707,7 @@ determine_max_movement (gimple *stmt, bool must_preserve_exec)
>    if (must_preserve_exec)
>      level = ALWAYS_EXECUTED_IN (bb);
>    else
> -    level = superloop_at_depth (loop, 1);
> +    level = get_topmost_lim_loop (loop);
>    lim_data->max_loop = level;
>  
>    if (gphi *phi = dyn_cast <gphi *> (stmt))
> @@ -813,6 +843,9 @@ set_level (gimple *stmt, class loop *orig_loop, class loop *level)
>  
>    gcc_assert (level == lim_data->max_loop
>  	      || flow_loop_nested_p (lim_data->max_loop, level));
> +  gcc_assert (!requested_loop
> +	      || requested_loop == lim_data->max_loop
> +	      || flow_loop_nested_p (requested_loop, level));
>  
>    lim_data->tgt_loop = level;
>    FOR_EACH_VEC_ELT (lim_data->depends, i, dep_stmt)

I'm not sure if the above massaging is really required, we can
either at final, for example here - limit motion, or simply
go ahead since we already modify code outside of the requested
loop (we insert outside of it, after all).

> @@ -983,7 +1016,12 @@ compute_invariantness (basic_block bb)
>    class loop *outermost = ALWAYS_EXECUTED_IN (bb);
>    struct lim_aux_data *lim_data;
>  
> -  if (!loop_outer (bb->loop_father))
> +  if (requested_loop)
> +    {
> +      if (!flow_bb_inside_loop_p (requested_loop, bb))
> +	return;

Changing the set of blocks to iterate over is really desired.
We want region-based algorithms be O(size-of-region) and the
walking alone makes it O(size-of-function) this way.

> +    }
> +  else if (!loop_outer (bb->loop_father))
>      return;
>  
>    if (dump_file && (dump_flags & TDF_DETAILS))
> @@ -1122,7 +1160,12 @@ move_computations_worker (basic_block bb)
>    struct lim_aux_data *lim_data;
>    unsigned int todo = 0;
>  
> -  if (!loop_outer (bb->loop_father))
> +  if (requested_loop)
> +    {
> +      if (!flow_bb_inside_loop_p (requested_loop, bb))
> +	return todo;
> +    }
> +  else if (!loop_outer (bb->loop_father))
>      return todo;
>  
>    for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); )
> @@ -1414,7 +1457,9 @@ set_ref_stored_in_loop (im_mem_ref *ref, class loop *loop)
>  static void
>  mark_ref_stored (im_mem_ref *ref, class loop *loop)
>  {
> -  while (loop != current_loops->tree_root
> +  class loop *stop_loop = requested_loop
> +    ? loop_outer (requested_loop) : current_loops->tree_root;

I guess storing the "root loop" alongside the requested one would
simplify this.  There's also the old idea of splitting the
tree_ssa_lim () operation on the outermost loops in the function
which is a refactoring that would reduce peak memory use and
make your refactoring a bit more natural given we'd essentially
do a

  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
    loop_invariant_motion_from_loop (loop);

then for the main pass which is also a good test for proper
compile-time behavior of the per-loop operation.

> +  while (loop != stop_loop
>  	 && set_ref_stored_in_loop (ref, loop))
>      loop = loop_outer (loop);
>  }
> @@ -1435,7 +1480,9 @@ set_ref_loaded_in_loop (im_mem_ref *ref, class loop *loop)
>  static void
>  mark_ref_loaded (im_mem_ref *ref, class loop *loop)
>  {
> -  while (loop != current_loops->tree_root
> +  class loop *stop_loop = requested_loop
> +    ? loop_outer (requested_loop) : current_loops->tree_root;
> +  while (loop != stop_loop
>  	 && set_ref_loaded_in_loop (ref, loop))
>      loop = loop_outer (loop);
>  }
> @@ -1619,24 +1666,35 @@ sort_locs_in_loop_postorder_cmp (const void *loc1_, const void *loc2_,
>    return bb_loop_postorder[loop1->num] < bb_loop_postorder[loop2->num] ? -1 : 1;
>  }
>  
> -/* Gathers memory references in loops.  */
> +/* Gathers memory references in loops.  Set a bit in CONTAINS_CALL
> +   corresponding to index of each basic block which contains a non-pure call
> +   statement.  */
>  
>  static void
> -analyze_memory_references (void)
> +analyze_memory_references (sbitmap contains_call)
>  {
>    gimple_stmt_iterator bsi;
>    basic_block bb, *bbs;
>    class loop *loop, *outer;
>    unsigned i, n;
>  
> -  /* Collect all basic-blocks in loops and sort them after their
> -     loops postorder.  */
> +  /* Collect all basic-blocks in loops (either in the whole function or just in
> +     the requested loop) and sort them after their loops postorder.  */
>    i = 0;
> -  bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS);
> -  FOR_EACH_BB_FN (bb, cfun)
> -    if (bb->loop_father != current_loops->tree_root)
> -      bbs[i++] = bb;
> -  n = i;
> +  if (requested_loop)
> +    {
> +      bbs = get_loop_body (requested_loop);
> +      n = requested_loop->num_nodes;
> +    }
> +  else
> +    {
> +      bbs = XNEWVEC (basic_block,
> +		     n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS);
> +      FOR_EACH_BB_FN (bb, cfun)
> +	if (bb->loop_father != current_loops->tree_root)
> +	  bbs[i++] = bb;
> +      n = i;
> +    }
>    gcc_sort_r (bbs, n, sizeof (basic_block), sort_bbs_in_loop_postorder_cmp,
>  	      bb_loop_postorder);
>  
> @@ -1648,7 +1706,11 @@ analyze_memory_references (void)
>      {
>        basic_block bb = bbs[i];
>        for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
> -        gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi));
> +	{
> +	  gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi));
> +	  if (nonpure_call_p (gsi_stmt (bsi)))
> +	    bitmap_set_bit (contains_call, bb->index);
> +	}
>      }
>  
>    /* Verify the list of gathered memory references is sorted after their
> @@ -1667,7 +1729,9 @@ analyze_memory_references (void)
>  
>    /* Propagate the information about accessed memory references up
>       the loop hierarchy.  */
> -  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
> +  class loop *stop_loop = requested_loop
> +    ? loop_outer (requested_loop) : current_loops->tree_root;

Having that extra "enclosing_loop" global in addition to requested_loop
would simplify all of these.

> +  FOR_EACH_ENCLOSED_LOOP (requested_loop, loop, LI_FROM_INNERMOST)
>      {
>        /* Finalize the overall touched references (including subloops).  */
>        bitmap_ior_into (&memory_accesses.all_refs_stored_in_loop[loop->num],
> @@ -1676,7 +1740,7 @@ analyze_memory_references (void)
>        /* Propagate the information about accessed memory references up
>  	 the loop hierarchy.  */
>        outer = loop_outer (loop);
> -      if (outer == current_loops->tree_root)
> +      if (outer == stop_loop)
>  	continue;
>  
>        bitmap_ior_into (&memory_accesses.all_refs_stored_in_loop[outer->num],
> @@ -2895,8 +2959,11 @@ do_store_motion (void)
>    class loop *loop;
>    bitmap sm_executed = BITMAP_ALLOC (&lim_bitmap_obstack);
>  
> -  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
> -    store_motion_loop (loop, sm_executed);
> +  if (requested_loop)
> +    store_motion_loop (requested_loop, sm_executed);

You should avoid store-motion in the per-loop invariant motion mode
since it requires a SSA update which is interently O(size-of-function).

That said, in the way it's currently structured I think it's
"better" to export tree_ssa_lim () and call it from interchange
if any loop was interchanged (thus run a full pass but conditional
on interchange done).  You can make it cheaper by adding a flag
to tree_ssa_lim whether to do store-motion (I guess this might
be an interesting user-visible flag as well and a possibility
to make select lim passes cheaper via a pass flag) and not do
store-motion from the interchange call.  I think that's how we should
fix the regression, refactoring LIM properly requires more work
that doesn't seem to fit the stage1 deadline.

Thanks,
Richard,



> +  else
> +    for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
> +      store_motion_loop (loop, sm_executed);
>  
>    BITMAP_FREE (sm_executed);
>  }
> @@ -2982,27 +3049,13 @@ fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
>     of its header implies execution of bb.  */
>  
>  static void
> -fill_always_executed_in (void)
> +fill_always_executed_in (sbitmap contains_call)
>  {
> -  basic_block bb;
>    class loop *loop;
>  
> -  auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
> -  bitmap_clear (contains_call);
> -  FOR_EACH_BB_FN (bb, cfun)
> -    {
> -      gimple_stmt_iterator gsi;
> -      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
> -	{
> -	  if (nonpure_call_p (gsi_stmt (gsi)))
> -	    break;
> -	}
> -
> -      if (!gsi_end_p (gsi))
> -	bitmap_set_bit (contains_call, bb->index);
> -    }
> -
> -  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
> +  for (loop = requested_loop ? requested_loop : current_loops->tree_root->inner;
> +       loop;
> +       loop = loop->next)
>      fill_always_executed_in_1 (loop, contains_call);
>  }
>  
> @@ -3010,11 +3063,17 @@ fill_always_executed_in (void)
>  /* Compute the global information needed by the loop invariant motion pass.  */
>  
>  static void
> -tree_ssa_lim_initialize (void)
> +tree_ssa_lim_initialize (class loop *req_loop)
>  {
>    class loop *loop;
>    unsigned i;
>  
> +  gcc_assert (!requested_loop || requested_loop != current_loops->tree_root);
> +  requested_loop = req_loop;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS) && requested_loop)
> +    fprintf (dump_file, "Initializing LIM for loop %d.\n\n", req_loop->num);
> +
>    bitmap_obstack_initialize (&lim_bitmap_obstack);
>    gcc_obstack_init (&mem_ref_obstack);
>    lim_aux_data_map = new hash_map<gimple *, lim_aux_data *>;
> @@ -3051,7 +3110,7 @@ tree_ssa_lim_initialize (void)
>       its postorder index.  */
>    i = 0;
>    bb_loop_postorder = XNEWVEC (unsigned, number_of_loops (cfun));
> -  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
> +  FOR_EACH_ENCLOSED_LOOP (req_loop, loop, LI_FROM_INNERMOST)
>      bb_loop_postorder[loop->num] = i++;
>  }
>  
> @@ -3086,23 +3145,32 @@ tree_ssa_lim_finalize (void)
>      free_affine_expand_cache (&memory_accesses.ttae_cache);
>  
>    free (bb_loop_postorder);
> +  requested_loop = NULL;
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    fprintf (dump_file, "LIM finished.\n\n");
>  }
>  
> -/* Moves invariants from loops.  Only "expensive" invariants are moved out --
> -   i.e. those that are likely to be win regardless of the register pressure.  */
> +/* Moves invariants from LOOP in FUN.  If LOOP is NULL, perform the
> +   tranformation on all loops within FUN.  Only "expensive" invariants are
> +   moved out -- i.e. those that are likely to be win regardless of the register
> +   pressure.  Return the pass TODO flags that need to be carried out after the
> +   transformation.  */
>  
> -static unsigned int
> -tree_ssa_lim (function *fun)
> +unsigned int
> +loop_invariant_motion_from_loop (function *fun, class loop *loop)
>  {
>    unsigned int todo = 0;
>  
> -  tree_ssa_lim_initialize ();
> +  tree_ssa_lim_initialize (loop);
>  
> +  auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
> +  bitmap_clear (contains_call);
>    /* Gathers information about memory accesses in the loops.  */
> -  analyze_memory_references ();
> +  analyze_memory_references (contains_call);
>  
>    /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
> -  fill_always_executed_in ();
> +  fill_always_executed_in (contains_call);
>  
>    int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
>    int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
> @@ -3135,6 +3203,16 @@ tree_ssa_lim (function *fun)
>    return todo;
>  }
>  
> +/* Moves invariants from all loops in FUN.  Only "expensive" invariants are
> +   moved out -- i.e. those that are likely to be win regardless of the register
> +   pressure.  */
> +
> +static unsigned int
> +tree_ssa_lim (function *fun)
> +{
> +  return loop_invariant_motion_from_loop (fun, NULL);
> +}
> +
>  /* Loop invariant motion pass.  */
>  
>  namespace {
> diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
> index e789e4fcb0b..c89a86aef4e 100644
> --- a/gcc/tree-ssa-loop-manip.h
> +++ b/gcc/tree-ssa-loop-manip.h
> @@ -56,6 +56,8 @@ extern void tree_unroll_loop (class loop *, unsigned,
>  			      edge, class tree_niter_desc *);
>  extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
>  
> +extern unsigned int loop_invariant_motion_from_loop (function *fun,
> +						     class loop *loop);
>  
>  
>  #endif /* GCC_TREE_SSA_LOOP_MANIP_H */
>
Martin Jambor Nov. 12, 2020, 8:14 p.m. UTC | #2
Hi,

On Wed, Nov 11 2020, Richard Biener wrote:
> On Mon, 9 Nov 2020, Martin Jambor wrote:
>
>> this patch modifies the loop invariant pass so that is can operate
>> only on a single requested loop and its sub-loops and ignore the rest
>> of the function, much like it currently ignores basic blocks that are
>> not in any real loop.  It then invokes it from within the loop
>> interchange pass when it successfully swaps two loops.  This avoids
>> the non-LTO -Ofast run-time regressions of 410.bwaves and 503.bwaves_r
>> (which are 19% and 15% faster than current master on an AMD zen2
>> machine) while not introducing a full LIM pass into the pass pipeline.
>> 
>> I have not modified the LIM data structures, this means that it still
>> contains vectors indexed by loop->num even though only a single loop
>> nest is actually processed.  I also did not replace the uses of
>> pre_and_rev_post_order_compute_fn with a function that would count a
>> postorder only for a given loop.  I can of course do so if the
>> approach is otherwise deemed viable.
>> 
>> The patch adds one additional global variable requested_loop to the
>> pass and then at various places behaves differently when it is set.  I
>> was considering storing the fake root loop into it for normal
>> operation, but since this loop often requires special handling anyway,
>> I came to the conclusion that the code would actually end up less
>> straightforward.
>> 
>> I have bootstrapped and tested the patch on x86_64-linux and a very
>> similar one on aarch64-linux.  I have also tested it by modifying the
>> tree_ssa_lim function to run loop_invariant_motion_from_loop on each
>> real outermost loop in a function and this variant also passed
>> bootstrap and all tests, including dump scans, of all languages.
>> 
>> I have built the entire SPEC 2006 FPrate monitoring the activity of
>> the LIM pass without and with the patch (on top of commit b642fca1c31
>> with which 526.blender_r and 538.imagick_r seemed to be failing) and
>> it only examined 0.2% more loops, 0.02% more BBs and even fewer
>> percent of statements because it is invoked only in a rather special
>> circumstance.  But the patch allows for more such need-based uses at
>> hopefully reasonable cost.
>> 
>> Since I do not have much experience with loop optimizers, I expect
>> that there will be requests to adjust the patch during the review.
>> Still, it fixes a performance regression against GCC 9 and so I hope
>> to address the concerns in time to get it into GCC 11.
>> 

[...]

>
> That said, in the way it's currently structured I think it's
> "better" to export tree_ssa_lim () and call it from interchange
> if any loop was interchanged (thus run a full pass but conditional
> on interchange done).  You can make it cheaper by adding a flag
> to tree_ssa_lim whether to do store-motion (I guess this might
> be an interesting user-visible flag as well and a possibility
> to make select lim passes cheaper via a pass flag) and not do
> store-motion from the interchange call.  I think that's how we should
> fix the regression, refactoring LIM properly requires more work
> that doesn't seem to fit the stage1 deadline.
>

So just like this?  Bootstrapped and tested on x86_64-linux and I have
verified it fixes the bwaves reduction.

Thanks,

Martin



gcc/ChangeLog:

2020-11-12  Martin Jambor  <mjambor@suse.cz>

	PR tree-optimization/94406
	* tree-ssa-loop-im.c (tree_ssa_lim): Renamed to
	loop_invariant_motion_in_fun, added a parameter to control store
	motion.
	(pass_lim::execute): Adjust call to tree_ssa_lim, now
	loop_invariant_motion_in_fun.
	* tree-ssa-loop-manip.h (loop_invariant_motion_in_fun): Declare.
	* gimple-loop-interchange.cc (pass_linterchange::execute): Call
	loop_invariant_motion_in_fun if any interchange has been done.
---
 gcc/gimple-loop-interchange.cc |  9 +++++++--
 gcc/tree-ssa-loop-im.c         | 12 +++++++-----
 gcc/tree-ssa-loop-manip.h      |  2 +-
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
index 1656004ecf0..a36dbb49b1f 100644
--- a/gcc/gimple-loop-interchange.cc
+++ b/gcc/gimple-loop-interchange.cc
@@ -2085,8 +2085,13 @@ pass_linterchange::execute (function *fun)
     }
 
   if (changed_p)
-    scev_reset ();
-  return changed_p ? (TODO_update_ssa_only_virtuals) : 0;
+    {
+      unsigned todo = TODO_update_ssa_only_virtuals;
+      todo |= loop_invariant_motion_in_fun (cfun, false);
+      scev_reset ();
+      return todo;
+    }
+  return 0;
 }
 
 } // anon namespace
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 6bb07e133cd..3c7412737f0 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -3089,10 +3089,11 @@ tree_ssa_lim_finalize (void)
 }
 
 /* Moves invariants from loops.  Only "expensive" invariants are moved out --
-   i.e. those that are likely to be win regardless of the register pressure.  */
+   i.e. those that are likely to be win regardless of the register pressure.
+   Only perform store motion if STORE_MOTION is true.  */
 
-static unsigned int
-tree_ssa_lim (function *fun)
+unsigned int
+loop_invariant_motion_in_fun (function *fun, bool store_motion)
 {
   unsigned int todo = 0;
 
@@ -3114,7 +3115,8 @@ tree_ssa_lim (function *fun)
 
   /* Execute store motion.  Force the necessary invariants to be moved
      out of the loops as well.  */
-  do_store_motion ();
+  if (store_motion)
+    do_store_motion ();
 
   free (rpo);
   rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
@@ -3175,7 +3177,7 @@ pass_lim::execute (function *fun)
 
   if (number_of_loops (fun) <= 1)
     return 0;
-  unsigned int todo = tree_ssa_lim (fun);
+  unsigned int todo = loop_invariant_motion_in_fun (fun, true);
 
   if (!in_loop_pipeline)
     loop_optimizer_finalize ();
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index e789e4fcb0b..d8e918ef7c9 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -55,7 +55,7 @@ extern void tree_transform_and_unroll_loop (class loop *, unsigned,
 extern void tree_unroll_loop (class loop *, unsigned,
 			      edge, class tree_niter_desc *);
 extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
-
+extern unsigned int loop_invariant_motion_in_fun (function *, bool);
 
 
 #endif /* GCC_TREE_SSA_LOOP_MANIP_H */
Richard Biener Nov. 13, 2020, 7:27 a.m. UTC | #3
On Thu, 12 Nov 2020, Martin Jambor wrote:

> Hi,
> 
> On Wed, Nov 11 2020, Richard Biener wrote:
> > On Mon, 9 Nov 2020, Martin Jambor wrote:
> >
> >> this patch modifies the loop invariant pass so that is can operate
> >> only on a single requested loop and its sub-loops and ignore the rest
> >> of the function, much like it currently ignores basic blocks that are
> >> not in any real loop.  It then invokes it from within the loop
> >> interchange pass when it successfully swaps two loops.  This avoids
> >> the non-LTO -Ofast run-time regressions of 410.bwaves and 503.bwaves_r
> >> (which are 19% and 15% faster than current master on an AMD zen2
> >> machine) while not introducing a full LIM pass into the pass pipeline.
> >> 
> >> I have not modified the LIM data structures, this means that it still
> >> contains vectors indexed by loop->num even though only a single loop
> >> nest is actually processed.  I also did not replace the uses of
> >> pre_and_rev_post_order_compute_fn with a function that would count a
> >> postorder only for a given loop.  I can of course do so if the
> >> approach is otherwise deemed viable.
> >> 
> >> The patch adds one additional global variable requested_loop to the
> >> pass and then at various places behaves differently when it is set.  I
> >> was considering storing the fake root loop into it for normal
> >> operation, but since this loop often requires special handling anyway,
> >> I came to the conclusion that the code would actually end up less
> >> straightforward.
> >> 
> >> I have bootstrapped and tested the patch on x86_64-linux and a very
> >> similar one on aarch64-linux.  I have also tested it by modifying the
> >> tree_ssa_lim function to run loop_invariant_motion_from_loop on each
> >> real outermost loop in a function and this variant also passed
> >> bootstrap and all tests, including dump scans, of all languages.
> >> 
> >> I have built the entire SPEC 2006 FPrate monitoring the activity of
> >> the LIM pass without and with the patch (on top of commit b642fca1c31
> >> with which 526.blender_r and 538.imagick_r seemed to be failing) and
> >> it only examined 0.2% more loops, 0.02% more BBs and even fewer
> >> percent of statements because it is invoked only in a rather special
> >> circumstance.  But the patch allows for more such need-based uses at
> >> hopefully reasonable cost.
> >> 
> >> Since I do not have much experience with loop optimizers, I expect
> >> that there will be requests to adjust the patch during the review.
> >> Still, it fixes a performance regression against GCC 9 and so I hope
> >> to address the concerns in time to get it into GCC 11.
> >> 
> 
> [...]
> 
> >
> > That said, in the way it's currently structured I think it's
> > "better" to export tree_ssa_lim () and call it from interchange
> > if any loop was interchanged (thus run a full pass but conditional
> > on interchange done).  You can make it cheaper by adding a flag
> > to tree_ssa_lim whether to do store-motion (I guess this might
> > be an interesting user-visible flag as well and a possibility
> > to make select lim passes cheaper via a pass flag) and not do
> > store-motion from the interchange call.  I think that's how we should
> > fix the regression, refactoring LIM properly requires more work
> > that doesn't seem to fit the stage1 deadline.
> >
> 
> So just like this?  Bootstrapped and tested on x86_64-linux and I have
> verified it fixes the bwaves reduction.

OK.

Thanks,
Richard.


> Thanks,
> 
> Martin
> 
> 
> 
> gcc/ChangeLog:
> 
> 2020-11-12  Martin Jambor  <mjambor@suse.cz>
> 
> 	PR tree-optimization/94406
> 	* tree-ssa-loop-im.c (tree_ssa_lim): Renamed to
> 	loop_invariant_motion_in_fun, added a parameter to control store
> 	motion.
> 	(pass_lim::execute): Adjust call to tree_ssa_lim, now
> 	loop_invariant_motion_in_fun.
> 	* tree-ssa-loop-manip.h (loop_invariant_motion_in_fun): Declare.
> 	* gimple-loop-interchange.cc (pass_linterchange::execute): Call
> 	loop_invariant_motion_in_fun if any interchange has been done.
> ---
>  gcc/gimple-loop-interchange.cc |  9 +++++++--
>  gcc/tree-ssa-loop-im.c         | 12 +++++++-----
>  gcc/tree-ssa-loop-manip.h      |  2 +-
>  3 files changed, 15 insertions(+), 8 deletions(-)
> 
> diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
> index 1656004ecf0..a36dbb49b1f 100644
> --- a/gcc/gimple-loop-interchange.cc
> +++ b/gcc/gimple-loop-interchange.cc
> @@ -2085,8 +2085,13 @@ pass_linterchange::execute (function *fun)
>      }
>  
>    if (changed_p)
> -    scev_reset ();
> -  return changed_p ? (TODO_update_ssa_only_virtuals) : 0;
> +    {
> +      unsigned todo = TODO_update_ssa_only_virtuals;
> +      todo |= loop_invariant_motion_in_fun (cfun, false);
> +      scev_reset ();
> +      return todo;
> +    }
> +  return 0;
>  }
>  
>  } // anon namespace
> diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
> index 6bb07e133cd..3c7412737f0 100644
> --- a/gcc/tree-ssa-loop-im.c
> +++ b/gcc/tree-ssa-loop-im.c
> @@ -3089,10 +3089,11 @@ tree_ssa_lim_finalize (void)
>  }
>  
>  /* Moves invariants from loops.  Only "expensive" invariants are moved out --
> -   i.e. those that are likely to be win regardless of the register pressure.  */
> +   i.e. those that are likely to be win regardless of the register pressure.
> +   Only perform store motion if STORE_MOTION is true.  */
>  
> -static unsigned int
> -tree_ssa_lim (function *fun)
> +unsigned int
> +loop_invariant_motion_in_fun (function *fun, bool store_motion)
>  {
>    unsigned int todo = 0;
>  
> @@ -3114,7 +3115,8 @@ tree_ssa_lim (function *fun)
>  
>    /* Execute store motion.  Force the necessary invariants to be moved
>       out of the loops as well.  */
> -  do_store_motion ();
> +  if (store_motion)
> +    do_store_motion ();
>  
>    free (rpo);
>    rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
> @@ -3175,7 +3177,7 @@ pass_lim::execute (function *fun)
>  
>    if (number_of_loops (fun) <= 1)
>      return 0;
> -  unsigned int todo = tree_ssa_lim (fun);
> +  unsigned int todo = loop_invariant_motion_in_fun (fun, true);
>  
>    if (!in_loop_pipeline)
>      loop_optimizer_finalize ();
> diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
> index e789e4fcb0b..d8e918ef7c9 100644
> --- a/gcc/tree-ssa-loop-manip.h
> +++ b/gcc/tree-ssa-loop-manip.h
> @@ -55,7 +55,7 @@ extern void tree_transform_and_unroll_loop (class loop *, unsigned,
>  extern void tree_unroll_loop (class loop *, unsigned,
>  			      edge, class tree_niter_desc *);
>  extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
> -
> +extern unsigned int loop_invariant_motion_in_fun (function *, bool);
>  
>  
>  #endif /* GCC_TREE_SSA_LOOP_MANIP_H */
>
diff mbox series

Patch

diff --git a/gcc/gimple-loop-interchange.cc b/gcc/gimple-loop-interchange.cc
index 1656004ecf0..8c376228779 100644
--- a/gcc/gimple-loop-interchange.cc
+++ b/gcc/gimple-loop-interchange.cc
@@ -2068,6 +2068,7 @@  pass_linterchange::execute (function *fun)
     return 0;
 
   bool changed_p = false;
+  auto_vec<class loop *, 4> loops_to_lim;
   class loop *loop;
   FOR_EACH_LOOP (loop, LI_ONLY_INNERMOST)
     {
@@ -2077,7 +2078,11 @@  pass_linterchange::execute (function *fun)
       if (prepare_perfect_loop_nest (loop, &loop_nest, &datarefs, &ddrs))
 	{
 	  tree_loop_interchange loop_interchange (loop_nest);
-	  changed_p |= loop_interchange.interchange (datarefs, ddrs);
+	  if (loop_interchange.interchange (datarefs, ddrs))
+	    {
+	      changed_p = true;
+	      loops_to_lim.safe_push (loop_nest[0]);
+	    }
 	}
       free_dependence_relations (ddrs);
       free_data_refs_with_aux (datarefs);
@@ -2085,8 +2090,27 @@  pass_linterchange::execute (function *fun)
     }
 
   if (changed_p)
-    scev_reset ();
-  return changed_p ? (TODO_update_ssa_only_virtuals) : 0;
+    {
+      unsigned todo = TODO_update_ssa_only_virtuals;
+      unsigned len  = loops_to_lim.length ();
+      for (unsigned i = len - 1; i > 0; i--)
+	for (int j = i - 1; j >= 0; j--)
+	  if (loops_to_lim[j] == loops_to_lim[i]
+	      || flow_loop_nested_p (loops_to_lim[j], loops_to_lim[i]))
+	    {
+	      loops_to_lim.pop ();
+	      break;
+	    }
+
+      len  = loops_to_lim.length ();
+      for (unsigned i = 0; i < len; i++)
+	todo |= loop_invariant_motion_from_loop (cfun, loops_to_lim[i]);
+
+      scev_reset ();
+      return todo;
+    }
+  else
+    return 0;
 }
 
 } // anon namespace
diff --git a/gcc/tree-ssa-loop-im.c b/gcc/tree-ssa-loop-im.c
index 6bb07e133cd..24b541dfb17 100644
--- a/gcc/tree-ssa-loop-im.c
+++ b/gcc/tree-ssa-loop-im.c
@@ -244,6 +244,11 @@  static struct
 static bitmap_obstack lim_bitmap_obstack;
 static obstack mem_ref_obstack;
 
+/* If LIM has been requested only for a particular loop (and all of the
+   enclosed loops this variable points to it, otherwise it points to the dummy
+   loop spanning whole function.  */
+static class loop *requested_loop;
+
 static bool ref_indep_loop_p (class loop *, im_mem_ref *, dep_kind);
 static bool ref_always_accessed_p (class loop *, im_mem_ref *, bool);
 static bool refs_independent_p (im_mem_ref *, im_mem_ref *, bool = true);
@@ -410,6 +415,21 @@  movement_possibility (gimple *stmt)
   return ret;
 }
 
+/* Return either the topmost real loop in which LOOP is nested if processing
+   the whole function, or requested_loop if not.  */
+
+static class loop *
+get_topmost_lim_loop (class loop *loop)
+{
+  if (requested_loop)
+    {
+      gcc_assert (loop == requested_loop
+		  || flow_loop_nested_p (requested_loop, loop));
+      return requested_loop;
+    }
+  return superloop_at_depth (loop, 1);
+}
+
 /* Suppose that operand DEF is used inside the LOOP.  Returns the outermost
    loop to that we could move the expression using DEF if it did not have
    other operands, i.e. the outermost loop enclosing LOOP in that the value
@@ -424,18 +444,18 @@  outermost_invariant_loop (tree def, class loop *loop)
   struct lim_aux_data *lim_data;
 
   if (!def)
-    return superloop_at_depth (loop, 1);
+    return get_topmost_lim_loop (loop);
 
   if (TREE_CODE (def) != SSA_NAME)
     {
       gcc_assert (is_gimple_min_invariant (def));
-      return superloop_at_depth (loop, 1);
+      return get_topmost_lim_loop (loop);
     }
 
   def_stmt = SSA_NAME_DEF_STMT (def);
   def_bb = gimple_bb (def_stmt);
   if (!def_bb)
-    return superloop_at_depth (loop, 1);
+    return get_topmost_lim_loop (loop);
 
   max_loop = find_common_loop (loop, def_bb->loop_father);
 
@@ -443,6 +463,16 @@  outermost_invariant_loop (tree def, class loop *loop)
   if (lim_data != NULL && lim_data->max_loop != NULL)
     max_loop = find_common_loop (max_loop,
 				 loop_outer (lim_data->max_loop));
+  if (requested_loop)
+    {
+      class loop *req_outer = loop_outer (requested_loop);
+      if (flow_loop_nested_p (max_loop, req_outer))
+	max_loop = req_outer;
+      else
+	gcc_assert (max_loop == req_outer
+		    || flow_loop_nested_p (req_outer, max_loop));
+    }
+
   if (max_loop == loop)
     return NULL;
   max_loop = superloop_at_depth (loop, loop_depth (max_loop) + 1);
@@ -677,7 +707,7 @@  determine_max_movement (gimple *stmt, bool must_preserve_exec)
   if (must_preserve_exec)
     level = ALWAYS_EXECUTED_IN (bb);
   else
-    level = superloop_at_depth (loop, 1);
+    level = get_topmost_lim_loop (loop);
   lim_data->max_loop = level;
 
   if (gphi *phi = dyn_cast <gphi *> (stmt))
@@ -813,6 +843,9 @@  set_level (gimple *stmt, class loop *orig_loop, class loop *level)
 
   gcc_assert (level == lim_data->max_loop
 	      || flow_loop_nested_p (lim_data->max_loop, level));
+  gcc_assert (!requested_loop
+	      || requested_loop == lim_data->max_loop
+	      || flow_loop_nested_p (requested_loop, level));
 
   lim_data->tgt_loop = level;
   FOR_EACH_VEC_ELT (lim_data->depends, i, dep_stmt)
@@ -983,7 +1016,12 @@  compute_invariantness (basic_block bb)
   class loop *outermost = ALWAYS_EXECUTED_IN (bb);
   struct lim_aux_data *lim_data;
 
-  if (!loop_outer (bb->loop_father))
+  if (requested_loop)
+    {
+      if (!flow_bb_inside_loop_p (requested_loop, bb))
+	return;
+    }
+  else if (!loop_outer (bb->loop_father))
     return;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
@@ -1122,7 +1160,12 @@  move_computations_worker (basic_block bb)
   struct lim_aux_data *lim_data;
   unsigned int todo = 0;
 
-  if (!loop_outer (bb->loop_father))
+  if (requested_loop)
+    {
+      if (!flow_bb_inside_loop_p (requested_loop, bb))
+	return todo;
+    }
+  else if (!loop_outer (bb->loop_father))
     return todo;
 
   for (gphi_iterator bsi = gsi_start_phis (bb); !gsi_end_p (bsi); )
@@ -1414,7 +1457,9 @@  set_ref_stored_in_loop (im_mem_ref *ref, class loop *loop)
 static void
 mark_ref_stored (im_mem_ref *ref, class loop *loop)
 {
-  while (loop != current_loops->tree_root
+  class loop *stop_loop = requested_loop
+    ? loop_outer (requested_loop) : current_loops->tree_root;
+  while (loop != stop_loop
 	 && set_ref_stored_in_loop (ref, loop))
     loop = loop_outer (loop);
 }
@@ -1435,7 +1480,9 @@  set_ref_loaded_in_loop (im_mem_ref *ref, class loop *loop)
 static void
 mark_ref_loaded (im_mem_ref *ref, class loop *loop)
 {
-  while (loop != current_loops->tree_root
+  class loop *stop_loop = requested_loop
+    ? loop_outer (requested_loop) : current_loops->tree_root;
+  while (loop != stop_loop
 	 && set_ref_loaded_in_loop (ref, loop))
     loop = loop_outer (loop);
 }
@@ -1619,24 +1666,35 @@  sort_locs_in_loop_postorder_cmp (const void *loc1_, const void *loc2_,
   return bb_loop_postorder[loop1->num] < bb_loop_postorder[loop2->num] ? -1 : 1;
 }
 
-/* Gathers memory references in loops.  */
+/* Gathers memory references in loops.  Set a bit in CONTAINS_CALL
+   corresponding to index of each basic block which contains a non-pure call
+   statement.  */
 
 static void
-analyze_memory_references (void)
+analyze_memory_references (sbitmap contains_call)
 {
   gimple_stmt_iterator bsi;
   basic_block bb, *bbs;
   class loop *loop, *outer;
   unsigned i, n;
 
-  /* Collect all basic-blocks in loops and sort them after their
-     loops postorder.  */
+  /* Collect all basic-blocks in loops (either in the whole function or just in
+     the requested loop) and sort them after their loops postorder.  */
   i = 0;
-  bbs = XNEWVEC (basic_block, n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS);
-  FOR_EACH_BB_FN (bb, cfun)
-    if (bb->loop_father != current_loops->tree_root)
-      bbs[i++] = bb;
-  n = i;
+  if (requested_loop)
+    {
+      bbs = get_loop_body (requested_loop);
+      n = requested_loop->num_nodes;
+    }
+  else
+    {
+      bbs = XNEWVEC (basic_block,
+		     n_basic_blocks_for_fn (cfun) - NUM_FIXED_BLOCKS);
+      FOR_EACH_BB_FN (bb, cfun)
+	if (bb->loop_father != current_loops->tree_root)
+	  bbs[i++] = bb;
+      n = i;
+    }
   gcc_sort_r (bbs, n, sizeof (basic_block), sort_bbs_in_loop_postorder_cmp,
 	      bb_loop_postorder);
 
@@ -1648,7 +1706,11 @@  analyze_memory_references (void)
     {
       basic_block bb = bbs[i];
       for (bsi = gsi_start_bb (bb); !gsi_end_p (bsi); gsi_next (&bsi))
-        gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi));
+	{
+	  gather_mem_refs_stmt (bb->loop_father, gsi_stmt (bsi));
+	  if (nonpure_call_p (gsi_stmt (bsi)))
+	    bitmap_set_bit (contains_call, bb->index);
+	}
     }
 
   /* Verify the list of gathered memory references is sorted after their
@@ -1667,7 +1729,9 @@  analyze_memory_references (void)
 
   /* Propagate the information about accessed memory references up
      the loop hierarchy.  */
-  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
+  class loop *stop_loop = requested_loop
+    ? loop_outer (requested_loop) : current_loops->tree_root;
+  FOR_EACH_ENCLOSED_LOOP (requested_loop, loop, LI_FROM_INNERMOST)
     {
       /* Finalize the overall touched references (including subloops).  */
       bitmap_ior_into (&memory_accesses.all_refs_stored_in_loop[loop->num],
@@ -1676,7 +1740,7 @@  analyze_memory_references (void)
       /* Propagate the information about accessed memory references up
 	 the loop hierarchy.  */
       outer = loop_outer (loop);
-      if (outer == current_loops->tree_root)
+      if (outer == stop_loop)
 	continue;
 
       bitmap_ior_into (&memory_accesses.all_refs_stored_in_loop[outer->num],
@@ -2895,8 +2959,11 @@  do_store_motion (void)
   class loop *loop;
   bitmap sm_executed = BITMAP_ALLOC (&lim_bitmap_obstack);
 
-  for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
-    store_motion_loop (loop, sm_executed);
+  if (requested_loop)
+    store_motion_loop (requested_loop, sm_executed);
+  else
+    for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
+      store_motion_loop (loop, sm_executed);
 
   BITMAP_FREE (sm_executed);
 }
@@ -2982,27 +3049,13 @@  fill_always_executed_in_1 (class loop *loop, sbitmap contains_call)
    of its header implies execution of bb.  */
 
 static void
-fill_always_executed_in (void)
+fill_always_executed_in (sbitmap contains_call)
 {
-  basic_block bb;
   class loop *loop;
 
-  auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
-  bitmap_clear (contains_call);
-  FOR_EACH_BB_FN (bb, cfun)
-    {
-      gimple_stmt_iterator gsi;
-      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
-	{
-	  if (nonpure_call_p (gsi_stmt (gsi)))
-	    break;
-	}
-
-      if (!gsi_end_p (gsi))
-	bitmap_set_bit (contains_call, bb->index);
-    }
-
-  for (loop = current_loops->tree_root->inner; loop; loop = loop->next)
+  for (loop = requested_loop ? requested_loop : current_loops->tree_root->inner;
+       loop;
+       loop = loop->next)
     fill_always_executed_in_1 (loop, contains_call);
 }
 
@@ -3010,11 +3063,17 @@  fill_always_executed_in (void)
 /* Compute the global information needed by the loop invariant motion pass.  */
 
 static void
-tree_ssa_lim_initialize (void)
+tree_ssa_lim_initialize (class loop *req_loop)
 {
   class loop *loop;
   unsigned i;
 
+  gcc_assert (!requested_loop || requested_loop != current_loops->tree_root);
+  requested_loop = req_loop;
+
+  if (dump_file && (dump_flags & TDF_DETAILS) && requested_loop)
+    fprintf (dump_file, "Initializing LIM for loop %d.\n\n", req_loop->num);
+
   bitmap_obstack_initialize (&lim_bitmap_obstack);
   gcc_obstack_init (&mem_ref_obstack);
   lim_aux_data_map = new hash_map<gimple *, lim_aux_data *>;
@@ -3051,7 +3110,7 @@  tree_ssa_lim_initialize (void)
      its postorder index.  */
   i = 0;
   bb_loop_postorder = XNEWVEC (unsigned, number_of_loops (cfun));
-  FOR_EACH_LOOP (loop, LI_FROM_INNERMOST)
+  FOR_EACH_ENCLOSED_LOOP (req_loop, loop, LI_FROM_INNERMOST)
     bb_loop_postorder[loop->num] = i++;
 }
 
@@ -3086,23 +3145,32 @@  tree_ssa_lim_finalize (void)
     free_affine_expand_cache (&memory_accesses.ttae_cache);
 
   free (bb_loop_postorder);
+  requested_loop = NULL;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "LIM finished.\n\n");
 }
 
-/* Moves invariants from loops.  Only "expensive" invariants are moved out --
-   i.e. those that are likely to be win regardless of the register pressure.  */
+/* Moves invariants from LOOP in FUN.  If LOOP is NULL, perform the
+   tranformation on all loops within FUN.  Only "expensive" invariants are
+   moved out -- i.e. those that are likely to be win regardless of the register
+   pressure.  Return the pass TODO flags that need to be carried out after the
+   transformation.  */
 
-static unsigned int
-tree_ssa_lim (function *fun)
+unsigned int
+loop_invariant_motion_from_loop (function *fun, class loop *loop)
 {
   unsigned int todo = 0;
 
-  tree_ssa_lim_initialize ();
+  tree_ssa_lim_initialize (loop);
 
+  auto_sbitmap contains_call (last_basic_block_for_fn (cfun));
+  bitmap_clear (contains_call);
   /* Gathers information about memory accesses in the loops.  */
-  analyze_memory_references ();
+  analyze_memory_references (contains_call);
 
   /* Fills ALWAYS_EXECUTED_IN information for basic blocks.  */
-  fill_always_executed_in ();
+  fill_always_executed_in (contains_call);
 
   int *rpo = XNEWVEC (int, last_basic_block_for_fn (fun));
   int n = pre_and_rev_post_order_compute_fn (fun, NULL, rpo, false);
@@ -3135,6 +3203,16 @@  tree_ssa_lim (function *fun)
   return todo;
 }
 
+/* Moves invariants from all loops in FUN.  Only "expensive" invariants are
+   moved out -- i.e. those that are likely to be win regardless of the register
+   pressure.  */
+
+static unsigned int
+tree_ssa_lim (function *fun)
+{
+  return loop_invariant_motion_from_loop (fun, NULL);
+}
+
 /* Loop invariant motion pass.  */
 
 namespace {
diff --git a/gcc/tree-ssa-loop-manip.h b/gcc/tree-ssa-loop-manip.h
index e789e4fcb0b..c89a86aef4e 100644
--- a/gcc/tree-ssa-loop-manip.h
+++ b/gcc/tree-ssa-loop-manip.h
@@ -56,6 +56,8 @@  extern void tree_unroll_loop (class loop *, unsigned,
 			      edge, class tree_niter_desc *);
 extern tree canonicalize_loop_ivs (class loop *, tree *, bool);
 
+extern unsigned int loop_invariant_motion_from_loop (function *fun,
+						     class loop *loop);
 
 
 #endif /* GCC_TREE_SSA_LOOP_MANIP_H */