diff mbox

[RFC,PR66873] Use graphite for parloops

Message ID 55A6C1DF.1050108@mentor.com
State New
Headers show

Commit Message

Tom de Vries July 15, 2015, 8:26 p.m. UTC
Hi,

I tried to parallelize this fortran test-case (based on 
autopar/outer-1.c), specifically the outer loop of the first loop nest 
using -ftree-parallelize-loops=2:
...
program main
   implicit none
   integer, parameter         :: n = 500
   integer, dimension (0:n-1, 0:n-1) :: x
   integer                    :: i, j, ii, jj


   do ii = 0, n - 1
      do jj = 0, n - 1
         x(jj, ii) = ii + jj + 3
      end do
   end do

   do i = 0, n - 1
      do j = 0, n - 1
         if (x(j, i) .ne. i + j + 3) call abort
      end do
   end do

end program main
...

But autopar fails to parallelize due to failing dependency analysis.

I then tried to add -floop-parallelize-all, and found that the graphite 
dependency analysis did manage to decide that the iterations are 
independent.

At https://gcc.gnu.org/wiki/Graphite/Parallelization I read:
...
In GCC there already exists an auto-parallelization pass 
(tree-parloops.c), which is base on the lambda framework originally 
developed by Sebastian. Since Lambda framework is limited to some cases 
(e.g. triangle loops, loops with 'if' conditions), Graphite was 
developed to handle the loops that lambda was not able to handle .
...

So I wondered, why not always use the graphite dependency analysis in 
parloops. (Of course you could use -floop-parallelize-all, but that also 
changes the heuristic). So I wrote a patch for parloops to use graphite 
dependency analysis by default (so without -floop-parallelize-all), but 
while testing found out that all the reduction test-cases started 
failing because the modifications graphite makes to the code messes up 
the parloops reduction analysis.

Then I came up with this patch, which:
- first runs a parloops pass, restricted to reduction loops only,
- then runs graphite dependency analysis
- followed by a normal parloops pass run.

This way, we get to both:
- compile the reduction testcases as before, and
- profit from the better graphite dependency analysis otherwise.

A point worth noting is that I stopped running pass_iv_canon before 
parloops (only in case of -ftree-parallelize-loops > 1) because running 
it before graphite makes the graphite scop detection fail.

Bootstrapped and reg-tested on x86_64.

Any comments?

Thanks,
- Tom

Comments

Richard Biener July 16, 2015, 8:46 a.m. UTC | #1
On Wed, Jul 15, 2015 at 10:26 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> Hi,
>
> I tried to parallelize this fortran test-case (based on autopar/outer-1.c),
> specifically the outer loop of the first loop nest using
> -ftree-parallelize-loops=2:
> ...
> program main
>   implicit none
>   integer, parameter         :: n = 500
>   integer, dimension (0:n-1, 0:n-1) :: x
>   integer                    :: i, j, ii, jj
>
>
>   do ii = 0, n - 1
>      do jj = 0, n - 1
>         x(jj, ii) = ii + jj + 3
>      end do
>   end do
>
>   do i = 0, n - 1
>      do j = 0, n - 1
>         if (x(j, i) .ne. i + j + 3) call abort
>      end do
>   end do
>
> end program main
> ...
>
> But autopar fails to parallelize due to failing dependency analysis.
>
> I then tried to add -floop-parallelize-all, and found that the graphite
> dependency analysis did manage to decide that the iterations are
> independent.
>
> At https://gcc.gnu.org/wiki/Graphite/Parallelization I read:
> ...
> In GCC there already exists an auto-parallelization pass (tree-parloops.c),
> which is base on the lambda framework originally developed by Sebastian.
> Since Lambda framework is limited to some cases (e.g. triangle loops, loops
> with 'if' conditions), Graphite was developed to handle the loops that
> lambda was not able to handle .
> ...
>
> So I wondered, why not always use the graphite dependency analysis in
> parloops. (Of course you could use -floop-parallelize-all, but that also
> changes the heuristic). So I wrote a patch for parloops to use graphite
> dependency analysis by default (so without -floop-parallelize-all), but
> while testing found out that all the reduction test-cases started failing
> because the modifications graphite makes to the code messes up the parloops
> reduction analysis.
>
> Then I came up with this patch, which:
> - first runs a parloops pass, restricted to reduction loops only,
> - then runs graphite dependency analysis
> - followed by a normal parloops pass run.
>
> This way, we get to both:
> - compile the reduction testcases as before, and
> - profit from the better graphite dependency analysis otherwise.
>
> A point worth noting is that I stopped running pass_iv_canon before parloops
> (only in case of -ftree-parallelize-loops > 1) because running it before
> graphite makes the graphite scop detection fail.
>
> Bootstrapped and reg-tested on x86_64.
>
> Any comments?

graphite dependence analysis is too slow to be enabled unconditionally.
(read: hours in some simple cases - see bugzilla)

Richard.

> Thanks,
> - Tom
Thomas Schwinge July 16, 2015, 10:19 a.m. UTC | #2
Hi Tom!

On Thu, 16 Jul 2015 10:46:00 +0200, Richard Biener <richard.guenther@gmail.com> wrote:
> On Wed, Jul 15, 2015 at 10:26 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> > I tried to parallelize this fortran test-case (based on autopar/outer-1.c),
> > [...]

> > So I wondered, why not always use the graphite dependency analysis in
> > parloops. (Of course you could use -floop-parallelize-all, but that also
> > changes the heuristic). So I wrote a patch for parloops to use graphite
> > dependency analysis by default (so without -floop-parallelize-all), but
> > while testing found out that all the reduction test-cases started failing
> > because the modifications graphite makes to the code messes up the parloops
> > reduction analysis.
> >
> > Then I came up with this patch, which:
> > - first runs a parloops pass, restricted to reduction loops only,
> > - then runs graphite dependency analysis
> > - followed by a normal parloops pass run.
> >
> > This way, we get to both:
> > - compile the reduction testcases as before, and
> > - profit from the better graphite dependency analysis otherwise.

> graphite dependence analysis is too slow to be enabled unconditionally.
> (read: hours in some simple cases - see bugzilla)

Haha, "cool"!  ;-)

Maybe it is still reasonable to use graphite to analyze the code inside
OpenACC kernels regions -- maybe such code can reasonably be expected to
not have the properties that make its analysis lengthy?  So, Tom, could
you please identify and check such PRs, to get an understanding of what
these properties are?


Grüße,
 Thomas
Richard Biener July 16, 2015, 10:23 a.m. UTC | #3
On Thu, Jul 16, 2015 at 12:19 PM, Thomas Schwinge
<thomas@codesourcery.com> wrote:
> Hi Tom!
>
> On Thu, 16 Jul 2015 10:46:00 +0200, Richard Biener <richard.guenther@gmail.com> wrote:
>> On Wed, Jul 15, 2015 at 10:26 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
>> > I tried to parallelize this fortran test-case (based on autopar/outer-1.c),
>> > [...]
>
>> > So I wondered, why not always use the graphite dependency analysis in
>> > parloops. (Of course you could use -floop-parallelize-all, but that also
>> > changes the heuristic). So I wrote a patch for parloops to use graphite
>> > dependency analysis by default (so without -floop-parallelize-all), but
>> > while testing found out that all the reduction test-cases started failing
>> > because the modifications graphite makes to the code messes up the parloops
>> > reduction analysis.
>> >
>> > Then I came up with this patch, which:
>> > - first runs a parloops pass, restricted to reduction loops only,
>> > - then runs graphite dependency analysis
>> > - followed by a normal parloops pass run.
>> >
>> > This way, we get to both:
>> > - compile the reduction testcases as before, and
>> > - profit from the better graphite dependency analysis otherwise.
>
>> graphite dependence analysis is too slow to be enabled unconditionally.
>> (read: hours in some simple cases - see bugzilla)
>
> Haha, "cool"!  ;-)
>
> Maybe it is still reasonable to use graphite to analyze the code inside
> OpenACC kernels regions -- maybe such code can reasonably be expected to
> not have the properties that make its analysis lengthy?  So, Tom, could
> you please identify and check such PRs, to get an understanding of what
> these properties are?

Like the one in PR62113 or 53852 or 59121.

>
> Grüße,
>  Thomas
Richard Biener July 16, 2015, 10:28 a.m. UTC | #4
On Thu, Jul 16, 2015 at 12:23 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Thu, Jul 16, 2015 at 12:19 PM, Thomas Schwinge
> <thomas@codesourcery.com> wrote:
>> Hi Tom!
>>
>> On Thu, 16 Jul 2015 10:46:00 +0200, Richard Biener <richard.guenther@gmail.com> wrote:
>>> On Wed, Jul 15, 2015 at 10:26 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
>>> > I tried to parallelize this fortran test-case (based on autopar/outer-1.c),
>>> > [...]
>>
>>> > So I wondered, why not always use the graphite dependency analysis in
>>> > parloops. (Of course you could use -floop-parallelize-all, but that also
>>> > changes the heuristic). So I wrote a patch for parloops to use graphite
>>> > dependency analysis by default (so without -floop-parallelize-all), but
>>> > while testing found out that all the reduction test-cases started failing
>>> > because the modifications graphite makes to the code messes up the parloops
>>> > reduction analysis.
>>> >
>>> > Then I came up with this patch, which:
>>> > - first runs a parloops pass, restricted to reduction loops only,
>>> > - then runs graphite dependency analysis
>>> > - followed by a normal parloops pass run.
>>> >
>>> > This way, we get to both:
>>> > - compile the reduction testcases as before, and
>>> > - profit from the better graphite dependency analysis otherwise.
>>
>>> graphite dependence analysis is too slow to be enabled unconditionally.
>>> (read: hours in some simple cases - see bugzilla)
>>
>> Haha, "cool"!  ;-)
>>
>> Maybe it is still reasonable to use graphite to analyze the code inside
>> OpenACC kernels regions -- maybe such code can reasonably be expected to
>> not have the properties that make its analysis lengthy?  So, Tom, could
>> you please identify and check such PRs, to get an understanding of what
>> these properties are?
>
> Like the one in PR62113 or 53852 or 59121.

Btw, it would be nice to handle this case (or at least figure out why we can't)
in GCCs dependence analysis.

Richard.

>>
>> Grüße,
>>  Thomas
Tom de Vries July 16, 2015, 11:19 a.m. UTC | #5
On 16/07/15 12:23, Richard Biener wrote:
> On Thu, Jul 16, 2015 at 12:19 PM, Thomas Schwinge
> <thomas@codesourcery.com> wrote:
>> Hi Tom!
>>
>> On Thu, 16 Jul 2015 10:46:00 +0200, Richard Biener <richard.guenther@gmail.com> wrote:
>>> On Wed, Jul 15, 2015 at 10:26 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
>>>> I tried to parallelize this fortran test-case (based on autopar/outer-1.c),
>>>> [...]
>>
>>>> So I wondered, why not always use the graphite dependency analysis in
>>>> parloops. (Of course you could use -floop-parallelize-all, but that also
>>>> changes the heuristic). So I wrote a patch for parloops to use graphite
>>>> dependency analysis by default (so without -floop-parallelize-all), but
>>>> while testing found out that all the reduction test-cases started failing
>>>> because the modifications graphite makes to the code messes up the parloops
>>>> reduction analysis.
>>>>
>>>> Then I came up with this patch, which:
>>>> - first runs a parloops pass, restricted to reduction loops only,
>>>> - then runs graphite dependency analysis
>>>> - followed by a normal parloops pass run.
>>>>
>>>> This way, we get to both:
>>>> - compile the reduction testcases as before, and
>>>> - profit from the better graphite dependency analysis otherwise.
>>
>>> graphite dependence analysis is too slow to be enabled unconditionally.
>>> (read: hours in some simple cases - see bugzilla)
>>
>> Haha, "cool"!  ;-)
>>
>> Maybe it is still reasonable to use graphite to analyze the code inside
>> OpenACC kernels regions -- maybe such code can reasonably be expected to
>> not have the properties that make its analysis lengthy?  So, Tom, could
>> you please identify and check such PRs, to get an understanding of what
>> these properties are?
>
> Like the one in PR62113 or 53852 or 59121.

PR62113 and PR59121 do not reproduce for me on trunk.

PR53852 does reproduce for me (to the point that I had to reset my laptop).

Thanks,
- Tom
Sebastian Pop July 20, 2015, 6:22 p.m. UTC | #6
Tom de Vries wrote:
> >>>graphite dependence analysis is too slow to be enabled unconditionally.
> >>>(read: hours in some simple cases - see bugzilla)
> >>
> >>Haha, "cool"!  ;-)
> >>
> >>Maybe it is still reasonable to use graphite to analyze the code inside
> >>OpenACC kernels regions -- maybe such code can reasonably be expected to
> >>not have the properties that make its analysis lengthy?  So, Tom, could
> >>you please identify and check such PRs, to get an understanding of what
> >>these properties are?
> >
> >Like the one in PR62113 or 53852 or 59121.
> 
> PR62113 and PR59121 do not reproduce for me on trunk.
> 
> PR53852 does reproduce for me (to the point that I had to reset my laptop).

ISL has a way to count the number of operations, based on a watermark it will
output an error code that we can use to leave graphite: see documentation of
isl_ctx_set_max_operations().  With that mechanism we can set a goal for
graphite of at max (say 10% overhead) of whole compilation time.
Sebastian Pop July 20, 2015, 6:31 p.m. UTC | #7
Tom de Vries wrote:
> So I wondered, why not always use the graphite dependency analysis
> in parloops. (Of course you could use -floop-parallelize-all, but
> that also changes the heuristic). So I wrote a patch for parloops to
> use graphite dependency analysis by default (so without
> -floop-parallelize-all), but while testing found out that all the
> reduction test-cases started failing because the modifications
> graphite makes to the code messes up the parloops reduction
> analysis.
> 
> Then I came up with this patch, which:
> - first runs a parloops pass, restricted to reduction loops only,

I would prefer to fix graphite to catch the reduction loop and avoid running an
extra pass before graphite for that case.  Can you please specify which file is
failing to be parallelized?  Are they all those testcases that you update the flags?

Also it seems to me that you are missing -ffast-math to parallelize all these
loops: without that flag graphite would not mark reductions as
associative/commutative operations and they would not be recognized as parallel.
Is that something the current parloops detection is not too strict about?

Thanks,
Sebastian

> - then runs graphite dependency analysis
> - followed by a normal parloops pass run.
> 
> This way, we get to both:
> - compile the reduction testcases as before, and
> - profit from the better graphite dependency analysis otherwise.
> 
> A point worth noting is that I stopped running pass_iv_canon before
> parloops (only in case of -ftree-parallelize-loops > 1) because
> running it before graphite makes the graphite scop detection fail.
> 
> Bootstrapped and reg-tested on x86_64.
> 
> Any comments?
> 
> Thanks,
> - Tom

> Use graphite for parloops
> 
> 2015-07-15  Tom de Vries  <tom@codesourcery.com>
> 
> 	PR tree-optimization/66873
> 	* graphite-isl-ast-to-gimple.c (translate_isl_ast_for_loop):
> 	(scop_to_isl_ast): Handle flag_tree_parallelize_loops.
> 	* graphite-poly.c (apply_poly_transforms): Same.
> 	* graphite.c (gate_graphite_transforms): Remove static.
> 	(pass_graphite_parloops): New pass.
> 	(make_pass_graphite_parloops): New function.
> 	(pass_graphite_transforms2): New pass.
> 	(make_pass_graphite_transforms2): New function.
> 	* omp-low.c (pass_expand_omp_ssa::clone): Same.
> 	* passes.def: Add pass groups pass_parallelize_reductions and
> 	pass_graphite_parloops.
> 	* tree-parloops.c (gen_parallel_loop): Add debug print for alternative
> 	exit-first loop transform.
> 	(parallelize_loops): Add reductions_only parameter.
> 	(pass_parallelize_loops::execute): Call parallelize_loops with extra
> 	argument.
> 	(pass_parallelize_reductions): New pass.
> 	(pass_parallelize_reductions::execute)
> 	(make_pass_parallelize_reductions): New function.
> 	* tree-pass.h (make_pass_graphite_parloops)
> 	(make_pass_parallelize_reductions, make_pass_graphite_transforms2)
> 	(gate_graphite_transforms): Declare.
> 	tree-ssa-loop-ivcanon.c (pass_iv_canon::gate): Return false if
> 	flag_tree_parallelize_loops > 1.
> 
> 	* gcc.dg/autopar/outer-6.c: Update for new pass parloopsred.
> 	* gcc.dg/autopar/reduc-1.c: Same.
> 	* gcc.dg/autopar/reduc-1char.c: Same.
> 	* gcc.dg/autopar/reduc-1short.c: Same.
> 	* gcc.dg/autopar/reduc-2.c: Same.
> 	* gcc.dg/autopar/reduc-2char.c: Same.
> 	* gcc.dg/autopar/reduc-2short.c: Same.
> 	* gcc.dg/autopar/reduc-3.c: Same.
> 	* gcc.dg/autopar/reduc-6.c: Same.
> 	* gcc.dg/autopar/reduc-7.c: Same.
> 	* gcc.dg/autopar/reduc-8.c: Same.
> 	* gcc.dg/autopar/reduc-9.c: Same.
> 	* gcc.dg/parloops-exit-first-loop-alt-2.c: Same.
> 	* gcc.dg/parloops-exit-first-loop-alt-3.c: Same.
> 	* gcc.dg/parloops-exit-first-loop-alt-4.c: Same.
> 	* gcc.dg/parloops-exit-first-loop-alt-5.c: Same.
> 	* gcc.dg/parloops-exit-first-loop-alt-6.c: Same.
> 	* gcc.dg/parloops-exit-first-loop-alt-7.c: Same.
> 	* gcc.dg/parloops-exit-first-loop-alt-pr66652.c: Same.
> 	* gcc.dg/parloops-exit-first-loop-alt.c: Same.
> 	* gfortran.dg/parloops-exit-first-loop-alt-2.f95: Same.
> 	* gfortran.dg/parloops-exit-first-loop-alt.f95: Same.
> 	* gfortran.dg/parloops-outer-1.f95: New test.
> ---
>  gcc/graphite-isl-ast-to-gimple.c                   |  6 +-
>  gcc/graphite-poly.c                                |  3 +-
>  gcc/graphite.c                                     | 83 ++++++++++++++++++-
>  gcc/omp-low.c                                      |  1 +
>  gcc/passes.def                                     | 11 +++
>  gcc/testsuite/gcc.dg/autopar/outer-6.c             |  6 +-
>  gcc/testsuite/gcc.dg/autopar/reduc-1.c             |  7 +-
>  gcc/testsuite/gcc.dg/autopar/reduc-1char.c         |  7 +-
>  gcc/testsuite/gcc.dg/autopar/reduc-1short.c        |  7 +-
>  gcc/testsuite/gcc.dg/autopar/reduc-2.c             |  7 +-
>  gcc/testsuite/gcc.dg/autopar/reduc-2char.c         |  7 +-
>  gcc/testsuite/gcc.dg/autopar/reduc-2short.c        |  7 +-
>  gcc/testsuite/gcc.dg/autopar/reduc-3.c             |  5 +-
>  gcc/testsuite/gcc.dg/autopar/reduc-6.c             |  6 +-
>  gcc/testsuite/gcc.dg/autopar/reduc-7.c             |  7 +-
>  gcc/testsuite/gcc.dg/autopar/reduc-8.c             |  7 +-
>  gcc/testsuite/gcc.dg/autopar/reduc-9.c             |  7 +-
>  .../gcc.dg/parloops-exit-first-loop-alt-2.c        |  9 +--
>  .../gcc.dg/parloops-exit-first-loop-alt-3.c        |  9 +--
>  .../gcc.dg/parloops-exit-first-loop-alt-4.c        |  9 +--
>  .../gcc.dg/parloops-exit-first-loop-alt-5.c        |  9 +--
>  .../gcc.dg/parloops-exit-first-loop-alt-6.c        |  9 +--
>  .../gcc.dg/parloops-exit-first-loop-alt-7.c        |  9 +--
>  .../gcc.dg/parloops-exit-first-loop-alt-pr66652.c  | 11 +--
>  .../gcc.dg/parloops-exit-first-loop-alt.c          | 10 +--
>  .../gfortran.dg/parloops-exit-first-loop-alt-2.f95 |  9 +--
>  .../gfortran.dg/parloops-exit-first-loop-alt.f95   | 10 +--
>  gcc/testsuite/gfortran.dg/parloops-outer-1.f95     | 37 +++++++++
>  gcc/tree-parloops.c                                | 93 ++++++++++++++++++++--
>  gcc/tree-pass.h                                    |  5 ++
>  gcc/tree-ssa-loop-ivcanon.c                        |  6 +-
>  31 files changed, 303 insertions(+), 116 deletions(-)
>  create mode 100644 gcc/testsuite/gfortran.dg/parloops-outer-1.f95
> 
> diff --git a/gcc/graphite-isl-ast-to-gimple.c b/gcc/graphite-isl-ast-to-gimple.c
> index b32781a5..bdafd40 100644
> --- a/gcc/graphite-isl-ast-to-gimple.c
> +++ b/gcc/graphite-isl-ast-to-gimple.c
> @@ -442,7 +442,8 @@ translate_isl_ast_for_loop (loop_p context_loop,
>    redirect_edge_succ_nodup (next_e, after);
>    set_immediate_dominator (CDI_DOMINATORS, next_e->dest, next_e->src);
>  
> -  if (flag_loop_parallelize_all)
> +  if (flag_loop_parallelize_all
> +      || flag_tree_parallelize_loops > 1)
>    {
>      isl_id *id = isl_ast_node_get_annotation (node_for);
>      gcc_assert (id);
> @@ -995,7 +996,8 @@ scop_to_isl_ast (scop_p scop, ivs_params &ip)
>    context_isl = set_options (context_isl, schedule_isl, options_luj);
>  
>    isl_union_map *dependences = NULL;
> -  if (flag_loop_parallelize_all)
> +  if (flag_loop_parallelize_all
> +      || flag_tree_parallelize_loops > 1)
>    {
>      dependences = scop_get_dependences (scop);
>      context_isl =
> diff --git a/gcc/graphite-poly.c b/gcc/graphite-poly.c
> index bcd08d8..e32325e 100644
> --- a/gcc/graphite-poly.c
> +++ b/gcc/graphite-poly.c
> @@ -241,7 +241,8 @@ apply_poly_transforms (scop_p scop)
>    if (flag_graphite_identity)
>      transform_done = true;
>  
> -  if (flag_loop_parallelize_all)
> +  if (flag_loop_parallelize_all
> +      || flag_tree_parallelize_loops > 1)
>      transform_done = true;
>  
>    if (flag_loop_block)
> diff --git a/gcc/graphite.c b/gcc/graphite.c
> index a81ef6a..6ba58c0 100644
> --- a/gcc/graphite.c
> +++ b/gcc/graphite.c
> @@ -319,7 +319,7 @@ graphite_transforms (struct function *fun)
>    return 0;
>  }
>  
> -static bool
> +bool
>  gate_graphite_transforms (void)
>  {
>    /* Enable -fgraphite pass if any one of the graphite optimization flags
> @@ -373,6 +373,45 @@ make_pass_graphite (gcc::context *ctxt)
>  
>  namespace {
>  
> +const pass_data pass_data_graphite_parloops =
> +{
> +  GIMPLE_PASS, /* type */
> +  "graphite_parloops", /* name */
> +  OPTGROUP_LOOP, /* optinfo_flags */
> +  TV_GRAPHITE, /* tv_id */
> +  ( PROP_cfg | PROP_ssa ), /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
> +
> +class pass_graphite_parloops : public gimple_opt_pass
> +{
> +public:
> +  pass_graphite_parloops (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_graphite_parloops, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +  {
> +    return (flag_tree_parallelize_loops > 1
> +	    && !gate_graphite_transforms ());
> +  }
> +
> +}; // class pass_graphite_parloops
> +
> +} // anon namespace
> +
> +gimple_opt_pass *
> +make_pass_graphite_parloops (gcc::context *ctxt)
> +{
> +  return new pass_graphite_parloops (ctxt);
> +}
> +
> +namespace {
> +
>  const pass_data pass_data_graphite_transforms =
>  {
>    GIMPLE_PASS, /* type */
> @@ -407,4 +446,46 @@ make_pass_graphite_transforms (gcc::context *ctxt)
>    return new pass_graphite_transforms (ctxt);
>  }
>  
> +/* It would be preferable to use a clone of pass_data_graphite_transforms rather
> +   than declare a new pass.  But when using a clone of
> +   pass_data_graphite_transforms (and changing the gate to trigger for
> +   flag_tree_parallelize_loops > 1 as well) in pass group
> +   pass_graphite_parloops, the pass is not executed.  */
> +
> +namespace {
> +
> +const pass_data pass_data_graphite_transforms2 =
> +{
> +  GIMPLE_PASS, /* type */
> +  "graphite2", /* name */
> +  OPTGROUP_LOOP, /* optinfo_flags */
> +  TV_GRAPHITE_TRANSFORMS, /* tv_id */
> +  ( PROP_cfg | PROP_ssa ), /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
> +
> +class pass_graphite_transforms2 : public gimple_opt_pass
> +{
> +public:
> +  pass_graphite_transforms2 (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_graphite_transforms2, ctxt)
> +  {}
>  
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +  {
> +    return (flag_tree_parallelize_loops > 1);
> +  }
> +  virtual unsigned int execute (function *fun) { return graphite_transforms (fun); }
> +}; // class pass_graphite_transforms2
> +
> +} // anon namespace
> +
> +gimple_opt_pass *
> +make_pass_graphite_transforms2 (gcc::context *ctxt)
> +{
> +  return new pass_graphite_transforms2 (ctxt);
> +}
> diff --git a/gcc/omp-low.c b/gcc/omp-low.c
> index 3135606..8cbee3a 100644
> --- a/gcc/omp-low.c
> +++ b/gcc/omp-low.c
> @@ -9576,6 +9576,7 @@ public:
>        return !(fun->curr_properties & PROP_gimple_eomp);
>      }
>    virtual unsigned int execute (function *) { return execute_expand_omp (); }
> +  opt_pass *clone () { return new pass_expand_omp_ssa (m_ctxt); }
>  
>  }; // class pass_expand_omp_ssa
>  
> diff --git a/gcc/passes.def b/gcc/passes.def
> index 5cd07ae..aa1d1a1 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -244,6 +244,17 @@ along with GCC; see the file COPYING3.  If not see
>  	      NEXT_PASS (pass_dce);
>  	  POP_INSERT_PASSES ()
>  	  NEXT_PASS (pass_iv_canon);
> +	  NEXT_PASS (pass_parallelize_reductions);
> +	  PUSH_INSERT_PASSES_WITHIN (pass_parallelize_reductions)
> +	      NEXT_PASS (pass_expand_omp_ssa);
> +	  POP_INSERT_PASSES ()
> +	  NEXT_PASS (pass_graphite_parloops);
> +	  PUSH_INSERT_PASSES_WITHIN (pass_graphite_parloops)
> +	      NEXT_PASS (pass_graphite_transforms2);
> +	      NEXT_PASS (pass_lim);
> +	      NEXT_PASS (pass_copy_prop);
> +	      NEXT_PASS (pass_dce);
> +	  POP_INSERT_PASSES ()
>  	  NEXT_PASS (pass_parallelize_loops);
>  	  PUSH_INSERT_PASSES_WITHIN (pass_parallelize_loops)
>  	      NEXT_PASS (pass_expand_omp_ssa);
> diff --git a/gcc/testsuite/gcc.dg/autopar/outer-6.c b/gcc/testsuite/gcc.dg/autopar/outer-6.c
> index 6bef7cc..0f01bd5 100644
> --- a/gcc/testsuite/gcc.dg/autopar/outer-6.c
> +++ b/gcc/testsuite/gcc.dg/autopar/outer-6.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-optimized" } */
>  
>  void abort (void);
>  
> @@ -44,6 +44,6 @@ int main(void)
>  
>  
>  /* Check that outer loop is parallelized.  */
> -/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" } } */
> -/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 0 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 0 "parloopsred" } } */
>  /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
> diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-1.c b/gcc/testsuite/gcc.dg/autopar/reduc-1.c
> index 6e9a280..4fc9b31 100644
> --- a/gcc/testsuite/gcc.dg/autopar/reduc-1.c
> +++ b/gcc/testsuite/gcc.dg/autopar/reduc-1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
>  
>  #include <stdarg.h>
>  #include <stdlib.h>
> @@ -66,6 +66,7 @@ int main (void)
>  }
>  
>  
> -/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" } } */
> -/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
>  
> diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-1char.c b/gcc/testsuite/gcc.dg/autopar/reduc-1char.c
> index 48ead88..497b7e0 100644
> --- a/gcc/testsuite/gcc.dg/autopar/reduc-1char.c
> +++ b/gcc/testsuite/gcc.dg/autopar/reduc-1char.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
>  
>  #include <stdarg.h>
>  #include <stdlib.h>
> @@ -60,6 +60,7 @@ int main (void)
>  }
>  
>  
> -/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" } } */
> -/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
>  
> diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-1short.c b/gcc/testsuite/gcc.dg/autopar/reduc-1short.c
> index f3f547c..6af8e4b 100644
> --- a/gcc/testsuite/gcc.dg/autopar/reduc-1short.c
> +++ b/gcc/testsuite/gcc.dg/autopar/reduc-1short.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
>  
>  #include <stdarg.h>
>  #include <stdlib.h>
> @@ -59,6 +59,7 @@ int main (void)
>    return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" } } */
> -/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
>  
> diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-2.c b/gcc/testsuite/gcc.dg/autopar/reduc-2.c
> index 3ad16e4..2d0b2a1 100644
> --- a/gcc/testsuite/gcc.dg/autopar/reduc-2.c
> +++ b/gcc/testsuite/gcc.dg/autopar/reduc-2.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
>  
>  #include <stdarg.h>
>  #include <stdlib.h>
> @@ -63,6 +63,7 @@ int main (void)
>    return 0;
>  }
>  
> -/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" } } */
> -/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
>  
> diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-2char.c b/gcc/testsuite/gcc.dg/autopar/reduc-2char.c
> index 072489f..49ef16d 100644
> --- a/gcc/testsuite/gcc.dg/autopar/reduc-2char.c
> +++ b/gcc/testsuite/gcc.dg/autopar/reduc-2char.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
>  
>  #include <stdarg.h>
>  #include <stdlib.h>
> @@ -60,7 +60,8 @@ int main (void)
>  }
>  
>  
> -/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
> -/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
>  
>  
> diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-2short.c b/gcc/testsuite/gcc.dg/autopar/reduc-2short.c
> index 4dbbc8a..3ec1c2a 100644
> --- a/gcc/testsuite/gcc.dg/autopar/reduc-2short.c
> +++ b/gcc/testsuite/gcc.dg/autopar/reduc-2short.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
>  
>  #include <stdarg.h>
>  #include <stdlib.h>
> @@ -59,6 +59,7 @@ int main (void)
>  }
>  
>  
> -/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
> -/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
>  
> diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-3.c b/gcc/testsuite/gcc.dg/autopar/reduc-3.c
> index 0d4baef..e7ca82b 100644
> --- a/gcc/testsuite/gcc.dg/autopar/reduc-3.c
> +++ b/gcc/testsuite/gcc.dg/autopar/reduc-3.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
>  
>  #include <stdarg.h>
>  #include <stdlib.h>
> @@ -50,6 +50,7 @@ int main (void)
>  }
>  
>  
> -/* { dg-final { scan-tree-dump-times "Detected reduction" 1 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 1 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloopsred" } } */
>  /* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops" } } */
>  
> diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-6.c b/gcc/testsuite/gcc.dg/autopar/reduc-6.c
> index 91f679e..6c5ec7b 100644
> --- a/gcc/testsuite/gcc.dg/autopar/reduc-6.c
> +++ b/gcc/testsuite/gcc.dg/autopar/reduc-6.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
>  
>  #include <stdarg.h>
>  #include <stdlib.h>
> @@ -56,6 +56,6 @@ int main (void)
>  
>  
>  /* need -ffast-math to  parallelize these loops.  */
> -/* { dg-final { scan-tree-dump-times "Detected reduction" 0 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 0 "parloopsred" } } */
>  /* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
> -/* { dg-final { scan-tree-dump-times "FAILED: it is not a part of reduction" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "FAILED: it is not a part of reduction" 3 "parloopsred" } } */
> diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-7.c b/gcc/testsuite/gcc.dg/autopar/reduc-7.c
> index 77b99e1..dccf2a5 100644
> --- a/gcc/testsuite/gcc.dg/autopar/reduc-7.c
> +++ b/gcc/testsuite/gcc.dg/autopar/reduc-7.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
>  
>  #include <stdlib.h>
>  
> @@ -84,6 +84,7 @@ int main (void)
>  }
>  
>  
> -/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
> -/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
>  
> diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-8.c b/gcc/testsuite/gcc.dg/autopar/reduc-8.c
> index 16fb954..466bcc5 100644
> --- a/gcc/testsuite/gcc.dg/autopar/reduc-8.c
> +++ b/gcc/testsuite/gcc.dg/autopar/reduc-8.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
>  
>  #include <stdlib.h>
>  
> @@ -84,5 +84,6 @@ int main (void)
>  }
>  
>  
> -/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
> -/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
> diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-9.c b/gcc/testsuite/gcc.dg/autopar/reduc-9.c
> index 90f4db2..11556d7 100644
> --- a/gcc/testsuite/gcc.dg/autopar/reduc-9.c
> +++ b/gcc/testsuite/gcc.dg/autopar/reduc-9.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
>  
>  #include <stdlib.h>
>  
> @@ -84,5 +84,6 @@ int main (void)
>  }
>  
>  
> -/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
> -/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
> diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-2.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-2.c
> index 24e605a..f1cf75f 100644
> --- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-2.c
> +++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-2.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target pthread } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
>  
>  /* Constant bound, vector addition.  */
>  
> @@ -19,9 +19,4 @@ f (void)
>        c[i] = a[i] + b[i];
>  }
>  
> -/* Three times three array accesses:
> -   - three in f._loopfn.0
> -   - three in the parallel
> -   - three in the low iteration count loop
> -   Crucially, none for a peeled off last iteration following the parallel.  */
> -/* { dg-final { scan-tree-dump-times "(?n)\\\[i" 9 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
> diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-3.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-3.c
> index fec53a1..6c34084 100644
> --- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-3.c
> +++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-3.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target pthread } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloopsred-details" } */
>  
>  /* Variable bound, reduction.  */
>  
> @@ -18,9 +18,4 @@ f (unsigned int n, unsigned int *__restrict__ a)
>    return sum;
>  }
>  
> -/* Three array accesses:
> -   - one in f._loopfn.0
> -   - one in the parallel
> -   - one in the low iteration count loop
> -   Crucially, none for a peeled off last iteration following the parallel.  */
> -/* { dg-final { scan-tree-dump-times "(?n)\\\* 4" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloopsred" } } */
> diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-4.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-4.c
> index 2b8d289..f051ed4 100644
> --- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-4.c
> +++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-4.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target pthread } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloopsred-details" } */
>  
>  /* Constant bound, reduction.  */
>  
> @@ -20,9 +20,4 @@ f (void)
>    return sum;
>  }
>  
> -/* Three array accesses:
> -   - one in f._loopfn.0
> -   - one in the parallel
> -   - one in the low iteration count loop
> -   Crucially, none for a peeled off last iteration following the parallel.  */
> -/* { dg-final { scan-tree-dump-times "(?n)\\\* 4" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloopsred" } } */
> diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-5.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-5.c
> index 3f799cf..3c1e99b 100644
> --- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-5.c
> +++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-5.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target pthread } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
>  
>  /* Variable bound, vector addition, unsigned loop counter, unsigned bound.  */
>  
> @@ -14,9 +14,4 @@ f (unsigned int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
>      c[i] = a[i] + b[i];
>  }
>  
> -/* Three times a store:
> -   - one in f._loopfn.0
> -   - one in the parallel
> -   - one in the low iteration count loop
> -   Crucially, none for a peeled off last iteration following the parallel.  */
> -/* { dg-final { scan-tree-dump-times "(?n)^  \\*_\[0-9\]*" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
> diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-6.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-6.c
> index ee19a55..edc60ba 100644
> --- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-6.c
> +++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-6.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target pthread } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
>  
>  /* Variable bound, vector addition, unsigned loop counter, signed bound.  */
>  
> @@ -14,9 +14,4 @@ f (int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
>      c[i] = a[i] + b[i];
>  }
>  
> -/* Three times a store:
> -   - one in f._loopfn.0
> -   - one in the parallel
> -   - one in the low iteration count loop
> -   Crucially, none for a peeled off last iteration following the parallel.  */
> -/* { dg-final { scan-tree-dump-times "(?n)^  \\*_\[0-9\]*" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
> diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-7.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-7.c
> index c337342..38be2e8 100644
> --- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-7.c
> +++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-7.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target pthread } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
>  
>  /* Variable bound, vector addition, signed loop counter, signed bound.  */
>  
> @@ -14,9 +14,4 @@ f (int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
>      c[i] = a[i] + b[i];
>  }
>  
> -/* Three times a store:
> -   - one in f._loopfn.0
> -   - one in the parallel
> -   - one in the low iteration count loop
> -   Crucially, none for a peeled off last iteration following the parallel.  */
> -/* { dg-final { scan-tree-dump-times "(?n)^  \\*_\[0-9\]*" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
> diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-pr66652.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-pr66652.c
> index 2ea097d..7b64368 100644
> --- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-pr66652.c
> +++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-pr66652.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target pthread } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloopsred-details" } */
>  
>  #include <stdio.h>
>  #include <stdlib.h>
> @@ -22,10 +22,5 @@ f (unsigned int n, unsigned int sum)
>    return sum;
>  }
>  
> -/* Four times % 13:
> -   - once in f._loopfn.0
> -   - once in the parallel
> -   - once in the low iteration count loop
> -   - once for a peeled off last iteration following the parallel.
> -   In other words, we want try_transform_to_exit_first_loop_alt to fail.  */
> -/* { dg-final { scan-tree-dump-times "(?n)% 13" 4 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 1 "parloopsred" } } */
> +/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 0 "parloopsred" } } */
> diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt.c
> index 0b69165..44596e3 100644
> --- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt.c
> +++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt.c
> @@ -1,6 +1,6 @@
>  /* { dg-do compile } */
>  /* { dg-require-effective-target pthread } */
> -/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
> +/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
>  
>  /* Variable bound, vector addition, signed loop counter, unsigned bound.  */
>  
> @@ -14,9 +14,5 @@ f (unsigned int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
>      c[i] = a[i] + b[i];
>  }
>  
> -/* Three times a store:
> -   - one in f._loopfn.0
> -   - one in the parallel
> -   - one in the low iteration count loop
> -   Crucially, none for a peeled off last iteration following the parallel.  */
> -/* { dg-final { scan-tree-dump-times "(?n)^  \\*_\[0-9\]*" 3 "parloops" } } */
> +/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
> +
> diff --git a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95 b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95
> index f26a6e3..52434f2 100644
> --- a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95
> +++ b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95
> @@ -1,7 +1,7 @@
>  ! { dg-additional-options "-O2" }
>  ! { dg-require-effective-target pthread }
>  ! { dg-additional-options "-ftree-parallelize-loops=2" }
> -! { dg-additional-options "-fdump-tree-parloops" }
> +! { dg-additional-options "-fdump-tree-parloops-details" }
>  
>  ! Constant bound, vector addition.
>  
> @@ -16,9 +16,4 @@ subroutine foo ()
>    end do
>  end subroutine foo
>  
> -! Three times plus 25:
> -! - once in f._loopfn.0
> -! - once in the parallel
> -! - once in the low iteration count loop
> -! Crucially, none for a peeled off last iteration following the parallel.
> -! { dg-final { scan-tree-dump-times "(?n) \\+ 25;" 3 "parloops" } }
> +! { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } }
> diff --git a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95 b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95
> index 6dc8a38..1eb9dfd 100644
> --- a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95
> +++ b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95
> @@ -1,7 +1,7 @@
>  ! { dg-additional-options "-O2" }
>  ! { dg-require-effective-target pthread }
>  ! { dg-additional-options "-ftree-parallelize-loops=2" }
> -! { dg-additional-options "-fdump-tree-parloops" }
> +! { dg-additional-options "-fdump-tree-parloops-details" }
>  
>  ! Variable bound, vector addition.
>  
> @@ -17,9 +17,5 @@ subroutine foo (nr)
>    end do
>  end subroutine foo
>  
> -! Three times plus 25:
> -! - once in f._loopfn.0
> -! - once in the parallel
> -! - once in the low iteration count loop
> -! Crucially, none for a peeled off last iteration following the parallel.
> -! { dg-final { scan-tree-dump-times "(?n) \\+ 25;" 3 "parloops" } }
> +! { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } }
> +
> diff --git a/gcc/testsuite/gfortran.dg/parloops-outer-1.f95 b/gcc/testsuite/gfortran.dg/parloops-outer-1.f95
> new file mode 100644
> index 0000000..144e4e8
> --- /dev/null
> +++ b/gcc/testsuite/gfortran.dg/parloops-outer-1.f95
> @@ -0,0 +1,37 @@
> +! { dg-do compile }
> +! { dg-additional-options "-O2" }
> +! { dg-additional-options "-ftree-parallelize-loops=2" }
> +! { dg-additional-options "-fdump-tree-parloops-all" }
> +! { dg-additional-options "-fdump-tree-optimized" }
> +
> +! Based on autopar/outer-1.c.
> +
> +program main
> +  implicit none
> +  integer, parameter         :: n = 500
> +  integer, dimension (0:n-1, 0:n-1) :: x
> +  integer                    :: i, j, ii, jj
> +
> +
> +  do ii = 0, n - 1
> +     do jj = 0, n - 1
> +        x(jj, ii) = ii + jj + 3
> +     end do
> +  end do
> +
> +  do i = 0, n - 1
> +     do j = 0, n - 1
> +        if (x(j, i) .ne. i + j + 3) call abort
> +     end do
> +  end do
> +
> +end program main
> +
> +! Check that only one loop is analyzed, and that it can be parallelized.
> +! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } }
> +! { dg-final { scan-tree-dump-not "FAILED:" "parloops" } }
> +! { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" } }
> +
> +! Check that the loop has been split off into a function.
> +! { dg-final { scan-tree-dump-times "(?n);; Function main._loopfn.0 " 1 "optimized" } }
> +
> diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
> index 036677b..4bfe588 100644
> --- a/gcc/tree-parloops.c
> +++ b/gcc/tree-parloops.c
> @@ -2238,7 +2238,15 @@ gen_parallel_loop (struct loop *loop,
>       increment) and immediately follows the loop exit test.  Attempt to move the
>       entry of the loop directly before the exit check and increase the number of
>       iterations of the loop by one.  */
> -  if (!try_transform_to_exit_first_loop_alt (loop, reduction_list, nit))
> +  if (try_transform_to_exit_first_loop_alt (loop, reduction_list, nit))
> +    {
> +      if (dump_file
> +	  && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file,
> +		 "alternative exit-first loop transform succeeded"
> +		 " for loop %d\n", loop->num);
> +    }
> +  else
>      {
>        /* Fall back on the method that handles more cases, but duplicates the
>  	 loop body: move the exit condition of LOOP to the beginning of its
> @@ -2508,7 +2516,7 @@ try_create_reduction_list (loop_p loop,
>     otherwise.  */
>  
>  static bool
> -parallelize_loops (void)
> +parallelize_loops (bool reductions_only)
>  {
>    unsigned n_threads = flag_tree_parallelize_loops;
>    bool changed = false;
> @@ -2584,10 +2592,31 @@ parallelize_loops (void)
>        if (!try_create_reduction_list (loop, &reduction_list))
>  	continue;
>  
> -      if (!flag_loop_parallelize_all
> -	  && !loop_parallel_p (loop, &parloop_obstack))
> +      if (reductions_only
> +	  && reduction_list.elements () == 0)
>  	continue;
>  
> +      if (!flag_loop_parallelize_all)
> +	{
> +	  bool independent = false;
> +
> +	  if (!independent
> +	      && loop->can_be_parallel)
> +	    {
> +	      if (dump_file
> +		  && (dump_flags & TDF_DETAILS))
> +		fprintf (dump_file,
> +			 "  SUCCESS: may be parallelized, graphite analysis\n");
> +	      independent = true;
> +	    }
> +
> +	  if (!independent)
> +	    independent = loop_parallel_p (loop, &parloop_obstack);
> +
> +	  if (!independent)
> +	    continue;
> +	}
> +
>        changed = true;
>        if (dump_file && (dump_flags & TDF_DETAILS))
>        {
> @@ -2652,7 +2681,7 @@ pass_parallelize_loops::execute (function *fun)
>    if (number_of_loops (fun) <= 1)
>      return 0;
>  
> -  if (parallelize_loops ())
> +  if (parallelize_loops (false))
>      {
>        fun->curr_properties &= ~(PROP_gimple_eomp);
>        return TODO_update_ssa;
> @@ -2668,3 +2697,57 @@ make_pass_parallelize_loops (gcc::context *ctxt)
>  {
>    return new pass_parallelize_loops (ctxt);
>  }
> +
> +namespace {
> +
> +const pass_data pass_data_parallelize_reductions =
> +{
> +  GIMPLE_PASS, /* type */
> +  "parloopsred", /* name */
> +  OPTGROUP_LOOP, /* optinfo_flags */
> +  TV_TREE_PARALLELIZE_LOOPS, /* tv_id */
> +  ( PROP_cfg | PROP_ssa ), /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  0, /* todo_flags_finish */
> +};
> +
> +class pass_parallelize_reductions : public gimple_opt_pass
> +{
> +public:
> +  pass_parallelize_reductions (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_parallelize_reductions, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +  {
> +    return (flag_tree_parallelize_loops > 1
> +	    && !gate_graphite_transforms ());
> +  }
> +  virtual unsigned int execute (function *);
> +}; // class pass_parallelize_reductions
> +
> +unsigned
> +pass_parallelize_reductions::execute (function *fun)
> +{
> +  if (number_of_loops (fun) <= 1)
> +    return 0;
> +
> +  if (parallelize_loops (true))
> +    {
> +      fun->curr_properties &= ~(PROP_gimple_eomp);
> +      return TODO_update_ssa;
> +    }
> +
> +  return 0;
> +}
> +
> +} // anon namespace
> +
> +gimple_opt_pass *
> +make_pass_parallelize_reductions (gcc::context *ctxt)
> +{
> +  return new pass_parallelize_reductions (ctxt);
> +}
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index c47b22e..f0a7017 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -368,7 +368,9 @@ extern gimple_opt_pass *make_pass_scev_cprop (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_empty_loop (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_record_bounds (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_graphite (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_graphite_parloops (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_graphite_transforms (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_graphite_transforms2 (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_if_conversion (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_loop_distribution (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_vectorize (gcc::context *ctxt);
> @@ -377,6 +379,7 @@ extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_parallelize_reductions (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
> @@ -595,6 +598,8 @@ extern gimple_opt_pass *make_pass_update_address_taken (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_convert_switch (gcc::context *ctxt);
>  extern gimple_opt_pass *make_pass_lower_vaarg (gcc::context *ctxt);
>  
> +extern bool gate_graphite_transforms (void);
> +
>  /* Current optimization pass.  */
>  extern opt_pass *current_pass;
>  
> diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
> index eca70a9..43724ed 100644
> --- a/gcc/tree-ssa-loop-ivcanon.c
> +++ b/gcc/tree-ssa-loop-ivcanon.c
> @@ -1421,7 +1421,11 @@ public:
>    {}
>  
>    /* opt_pass methods: */
> -  virtual bool gate (function *) { return flag_tree_loop_ivcanon != 0; }
> +  virtual bool gate (function *)
> +  {
> +    return (flag_tree_loop_ivcanon != 0
> +	    && flag_tree_parallelize_loops <= 1);
> +  }
>    virtual unsigned int execute (function *fun);
>  
>  }; // class pass_iv_canon
> -- 
> 1.9.1
>
Tom de Vries July 20, 2015, 11:24 p.m. UTC | #8
On 20/07/15 20:22, Sebastian Pop wrote:
> Tom de Vries wrote:
>>>>> graphite dependence analysis is too slow to be enabled unconditionally.
>>>>> (read: hours in some simple cases - see bugzilla)
>>>>
>>>> Haha, "cool"!  ;-)
>>>>
>>>> Maybe it is still reasonable to use graphite to analyze the code inside
>>>> OpenACC kernels regions -- maybe such code can reasonably be expected to
>>>> not have the properties that make its analysis lengthy?  So, Tom, could
>>>> you please identify and check such PRs, to get an understanding of what
>>>> these properties are?
>>>
>>> Like the one in PR62113 or 53852 or 59121.
>>
>> PR62113 and PR59121 do not reproduce for me on trunk.
>>
>> PR53852 does reproduce for me (to the point that I had to reset my laptop).
>
> ISL has a way to count the number of operations, based on a watermark it will
> output an error code that we can use to leave graphite: see documentation of
> isl_ctx_set_max_operations().  With that mechanism we can set a goal for
> graphite of at max (say 10% overhead) of whole compilation time.
>

Agree, bounding graphite to a limited runtime sound like a good idea.

Determining the bound (in terms of isl operations) doesn't look trivial 
though. I suppose a basic version could be number of gimple operations 
in function times a constant.

Thanks,
- Tom
Tom de Vries July 21, 2015, 12:21 a.m. UTC | #9
On 20/07/15 20:31, Sebastian Pop wrote:
> Tom de Vries wrote:
>> So I wondered, why not always use the graphite dependency analysis
>> in parloops. (Of course you could use -floop-parallelize-all, but
>> that also changes the heuristic). So I wrote a patch for parloops to
>> use graphite dependency analysis by default (so without
>> -floop-parallelize-all), but while testing found out that all the
>> reduction test-cases started failing because the modifications
>> graphite makes to the code messes up the parloops reduction
>> analysis.
>>
>> Then I came up with this patch, which:
>> - first runs a parloops pass, restricted to reduction loops only,
>
> I would prefer to fix graphite to catch the reduction loop and avoid running an
> extra pass before graphite for that case.

> Can you please specify which file is
> failing to be parallelized?  Are they all those testcases that you update the flags?

Yep, f.i. autopar/reduc-1.c.

> Also it seems to me that you are missing -ffast-math to parallelize all these
> loops: without that flag graphite would not mark reductions as
> associative/commutative operations and they would not be recognized as parallel.

For an unsigned int reduction, we need don't need -ffast-math, so we 
don't have to specify it for parloops. It seems graphite is too strict 
in that, since it won't do any reductions without -fassociate-math.

But indeed, with -ffast-math -ftree-parallelize-loops=2 
-floop-parallelize-all we are able to parallelize the 3 reduction loops 
in autopar/reduc-1.c

> Is that something the current parloops detection is not too strict about?

Parloops uses vect_is_simple_reduction_1, which has some extensive 
testing to see if reordering of operations is allowed. The testing of 
graphite seems to be limited to testing fassociative-math, which makes 
me suspect that tests are missing there, f.i. TYPE_OVERFLOW_TRAPS.

Thanks,
- Tom
Tom de Vries July 26, 2015, 9:21 p.m. UTC | #10
On 16/07/15 12:28, Richard Biener wrote:
> On Thu, Jul 16, 2015 at 12:23 PM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Thu, Jul 16, 2015 at 12:19 PM, Thomas Schwinge
>> <thomas@codesourcery.com> wrote:
>>> Hi Tom!
>>>
>>> On Thu, 16 Jul 2015 10:46:00 +0200, Richard Biener <richard.guenther@gmail.com> wrote:
>>>> On Wed, Jul 15, 2015 at 10:26 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
>>>>> I tried to parallelize this fortran test-case (based on autopar/outer-1.c),
>>>>> [...]
>>>
>>>>> So I wondered, why not always use the graphite dependency analysis in
>>>>> parloops. (Of course you could use -floop-parallelize-all, but that also
>>>>> changes the heuristic). So I wrote a patch for parloops to use graphite
>>>>> dependency analysis by default (so without -floop-parallelize-all), but
>>>>> while testing found out that all the reduction test-cases started failing
>>>>> because the modifications graphite makes to the code messes up the parloops
>>>>> reduction analysis.
>>>>>
>>>>> Then I came up with this patch, which:
>>>>> - first runs a parloops pass, restricted to reduction loops only,
>>>>> - then runs graphite dependency analysis
>>>>> - followed by a normal parloops pass run.
>>>>>
>>>>> This way, we get to both:
>>>>> - compile the reduction testcases as before, and
>>>>> - profit from the better graphite dependency analysis otherwise.
>>>
>>>> graphite dependence analysis is too slow to be enabled unconditionally.
>>>> (read: hours in some simple cases - see bugzilla)
>>>
>>> Haha, "cool"!  ;-)
>>>
>>> Maybe it is still reasonable to use graphite to analyze the code inside
>>> OpenACC kernels regions -- maybe such code can reasonably be expected to
>>> not have the properties that make its analysis lengthy?  So, Tom, could
>>> you please identify and check such PRs, to get an understanding of what
>>> these properties are?
>>
>> Like the one in PR62113 or 53852 or 59121.
>
> Btw, it would be nice to handle this case (or at least figure out why we can't)
> in GCCs dependence analysis.
>

I wrote an equivalent test-case in C:
...
$ cat src/gcc/testsuite/gcc.dg/autopar/outer-7.c
/* { dg-do compile } */
/* { dg-options "-O2 -ftree-parallelize-loops=2 
-fdump-tree-parloops-details -fdump-tree-optimized" } */

void abort (void);

#define N 500

int
main (void)
{
   int i, j;
   int x[N][N];
   int *y = &x[0][0];

   for (i = 0; i < N; i++)
     for (j = 0; j < N; j++)
       /* y[i * N + j] == x[i][j].  */
       y[i * N + j] = i + j + 3;

   for (i = 0; i < N; i++)
     for (j = 0; j < N; j++)
       if (x[i][j] != i + j + 3)
	abort ();

   return 0;
}

/* Check that outer loop is parallelized.  */
/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 
"parloops" } } */
/* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
...

With -fno-tree-loop-ivcanon to keep original iteration order we get:
...
#(Data Ref:
#  bb: 4
#  stmt: *_15 = _17;
#  ref: *_15;
#  base_object: MEM[(int *)&x];
#  Access function 0: {{0B, +, 2000}_1, +, 4}_4
#)
#(Data Ref:
#  bb: 4
#  stmt: *_15 = _17;
#  ref: *_15;
#  base_object: MEM[(int *)&x];
#  Access function 0: {{0B, +, 2000}_1, +, 4}_4
#)
   access_fn_A: {{0B, +, 2000}_1, +, 4}_4
   access_fn_B: {{0B, +, 2000}_1, +, 4}_4

  (subscript
   iterations_that_access_an_element_twice_in_A: [0]
   last_conflict: scev_not_known
   iterations_that_access_an_element_twice_in_B: [0]
   last_conflict: scev_not_known
   (Subscript distance: 0 ))
   inner loop index: 0
   loop nest: (1 4 )
   distance_vector:   0   0
   distance_vector:   1 -500
   direction_vector:     =    =
   direction_vector:     +    -
)
   FAILED: data dependencies exist across iterations
...

If we replace the y[i * N + j] with x[i][j] we get instead:
...
#(Data Ref:
#  bb: 4
#  stmt: x[i_7][j_8] = _12;
#  ref: x[i_7][j_8];
#  base_object: x;
#  Access function 0: {0, +, 1}_4
#  Access function 1: {0, +, 1}_1
#)
#(Data Ref:
#  bb: 4
#  stmt: x[i_7][j_8] = _12;
#  ref: x[i_7][j_8];
#  base_object: x;
#  Access function 0: {0, +, 1}_4
#  Access function 1: {0, +, 1}_1
#)
   access_fn_A: {0, +, 1}_4
   access_fn_B: {0, +, 1}_4

  (subscript
   iterations_that_access_an_element_twice_in_A: [0]
   last_conflict: scev_not_known
   iterations_that_access_an_element_twice_in_B: [0]
   last_conflict: scev_not_known
   (Subscript distance: 0 ))
   access_fn_A: {0, +, 1}_1
   access_fn_B: {0, +, 1}_1

  (subscript
   iterations_that_access_an_element_twice_in_A: [0]
   last_conflict: scev_not_known
   iterations_that_access_an_element_twice_in_B: [0]
   last_conflict: scev_not_known
   (Subscript distance: 0 ))
   inner loop index: 0
   loop nest: (1 4 )
   distance_vector:   0   0
   direction_vector:     =    =
)
   SUCCESS: may be parallelized
parallelizing outer loop 8
...

Thanks,
- Tom
Sebastian Pop July 27, 2015, 3:48 a.m. UTC | #11
On Sun, Jul 26, 2015 at 4:21 PM, Tom de Vries <Tom_deVries@mentor.com> wrote:
> I wrote an equivalent test-case in C:
> ...
> $ cat src/gcc/testsuite/gcc.dg/autopar/outer-7.c
> /* { dg-do compile } */
> /* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details
> -fdump-tree-optimized" } */
>
> void abort (void);
>
> #define N 500
>
> int
> main (void)
> {
>   int i, j;
>   int x[N][N];
>   int *y = &x[0][0];
>
>   for (i = 0; i < N; i++)
>     for (j = 0; j < N; j++)
>       /* y[i * N + j] == x[i][j].  */
>       y[i * N + j] = i + j + 3;
>
>   for (i = 0; i < N; i++)
>     for (j = 0; j < N; j++)
>       if (x[i][j] != i + j + 3)
>         abort ();
>
>   return 0;
> }
>
> /* Check that outer loop is parallelized.  */
> /* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops"
> } } */
> /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
> ...
>
> With -fno-tree-loop-ivcanon to keep original iteration order we get:
> ...
> #(Data Ref:
> #  bb: 4
> #  stmt: *_15 = _17;
> #  ref: *_15;
> #  base_object: MEM[(int *)&x];
> #  Access function 0: {{0B, +, 2000}_1, +, 4}_4
> #)
> #(Data Ref:
> #  bb: 4
> #  stmt: *_15 = _17;
> #  ref: *_15;
> #  base_object: MEM[(int *)&x];
> #  Access function 0: {{0B, +, 2000}_1, +, 4}_4
> #)
>   access_fn_A: {{0B, +, 2000}_1, +, 4}_4
>   access_fn_B: {{0B, +, 2000}_1, +, 4}_4
>
>  (subscript
>   iterations_that_access_an_element_twice_in_A: [0]
>   last_conflict: scev_not_known
>   iterations_that_access_an_element_twice_in_B: [0]
>   last_conflict: scev_not_known
>   (Subscript distance: 0 ))
>   inner loop index: 0
>   loop nest: (1 4 )
>   distance_vector:   0   0
>   distance_vector:   1 -500
>   direction_vector:     =    =
>   direction_vector:     +    -
> )
>   FAILED: data dependencies exist across iterations
> ...
>
> If we replace the y[i * N + j] with x[i][j] we get instead:
> ...
> #(Data Ref:
> #  bb: 4
> #  stmt: x[i_7][j_8] = _12;
> #  ref: x[i_7][j_8];
> #  base_object: x;
> #  Access function 0: {0, +, 1}_4
> #  Access function 1: {0, +, 1}_1
> #)
> #(Data Ref:
> #  bb: 4
> #  stmt: x[i_7][j_8] = _12;
> #  ref: x[i_7][j_8];
> #  base_object: x;
> #  Access function 0: {0, +, 1}_4
> #  Access function 1: {0, +, 1}_1
> #)
>   access_fn_A: {0, +, 1}_4
>   access_fn_B: {0, +, 1}_4
>
>  (subscript
>   iterations_that_access_an_element_twice_in_A: [0]
>   last_conflict: scev_not_known
>   iterations_that_access_an_element_twice_in_B: [0]
>   last_conflict: scev_not_known
>   (Subscript distance: 0 ))
>   access_fn_A: {0, +, 1}_1
>   access_fn_B: {0, +, 1}_1
>
>  (subscript
>   iterations_that_access_an_element_twice_in_A: [0]
>   last_conflict: scev_not_known
>   iterations_that_access_an_element_twice_in_B: [0]
>   last_conflict: scev_not_known
>   (Subscript distance: 0 ))
>   inner loop index: 0
>   loop nest: (1 4 )
>   distance_vector:   0   0
>   direction_vector:     =    =
> )
>   SUCCESS: may be parallelized
> parallelizing outer loop 8
> ...

It looks like a delinearization pass could help reconstruct a two
dimension array reference, and make the Banerjee dependence test
succeed.
Note that Graphite works in this case just because the loop bounds are
statically defined: N is 500.  Now if you have N passed in as a
function parameter, Graphite would also fail to compute the
dependence, as it cannot represent "i * N", so we would also need the
delinearization pass for Graphite.
Here is a bug that I recently opened for that:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66981

Sebastian
diff mbox

Patch

Use graphite for parloops

2015-07-15  Tom de Vries  <tom@codesourcery.com>

	PR tree-optimization/66873
	* graphite-isl-ast-to-gimple.c (translate_isl_ast_for_loop):
	(scop_to_isl_ast): Handle flag_tree_parallelize_loops.
	* graphite-poly.c (apply_poly_transforms): Same.
	* graphite.c (gate_graphite_transforms): Remove static.
	(pass_graphite_parloops): New pass.
	(make_pass_graphite_parloops): New function.
	(pass_graphite_transforms2): New pass.
	(make_pass_graphite_transforms2): New function.
	* omp-low.c (pass_expand_omp_ssa::clone): Same.
	* passes.def: Add pass groups pass_parallelize_reductions and
	pass_graphite_parloops.
	* tree-parloops.c (gen_parallel_loop): Add debug print for alternative
	exit-first loop transform.
	(parallelize_loops): Add reductions_only parameter.
	(pass_parallelize_loops::execute): Call parallelize_loops with extra
	argument.
	(pass_parallelize_reductions): New pass.
	(pass_parallelize_reductions::execute)
	(make_pass_parallelize_reductions): New function.
	* tree-pass.h (make_pass_graphite_parloops)
	(make_pass_parallelize_reductions, make_pass_graphite_transforms2)
	(gate_graphite_transforms): Declare.
	tree-ssa-loop-ivcanon.c (pass_iv_canon::gate): Return false if
	flag_tree_parallelize_loops > 1.

	* gcc.dg/autopar/outer-6.c: Update for new pass parloopsred.
	* gcc.dg/autopar/reduc-1.c: Same.
	* gcc.dg/autopar/reduc-1char.c: Same.
	* gcc.dg/autopar/reduc-1short.c: Same.
	* gcc.dg/autopar/reduc-2.c: Same.
	* gcc.dg/autopar/reduc-2char.c: Same.
	* gcc.dg/autopar/reduc-2short.c: Same.
	* gcc.dg/autopar/reduc-3.c: Same.
	* gcc.dg/autopar/reduc-6.c: Same.
	* gcc.dg/autopar/reduc-7.c: Same.
	* gcc.dg/autopar/reduc-8.c: Same.
	* gcc.dg/autopar/reduc-9.c: Same.
	* gcc.dg/parloops-exit-first-loop-alt-2.c: Same.
	* gcc.dg/parloops-exit-first-loop-alt-3.c: Same.
	* gcc.dg/parloops-exit-first-loop-alt-4.c: Same.
	* gcc.dg/parloops-exit-first-loop-alt-5.c: Same.
	* gcc.dg/parloops-exit-first-loop-alt-6.c: Same.
	* gcc.dg/parloops-exit-first-loop-alt-7.c: Same.
	* gcc.dg/parloops-exit-first-loop-alt-pr66652.c: Same.
	* gcc.dg/parloops-exit-first-loop-alt.c: Same.
	* gfortran.dg/parloops-exit-first-loop-alt-2.f95: Same.
	* gfortran.dg/parloops-exit-first-loop-alt.f95: Same.
	* gfortran.dg/parloops-outer-1.f95: New test.
---
 gcc/graphite-isl-ast-to-gimple.c                   |  6 +-
 gcc/graphite-poly.c                                |  3 +-
 gcc/graphite.c                                     | 83 ++++++++++++++++++-
 gcc/omp-low.c                                      |  1 +
 gcc/passes.def                                     | 11 +++
 gcc/testsuite/gcc.dg/autopar/outer-6.c             |  6 +-
 gcc/testsuite/gcc.dg/autopar/reduc-1.c             |  7 +-
 gcc/testsuite/gcc.dg/autopar/reduc-1char.c         |  7 +-
 gcc/testsuite/gcc.dg/autopar/reduc-1short.c        |  7 +-
 gcc/testsuite/gcc.dg/autopar/reduc-2.c             |  7 +-
 gcc/testsuite/gcc.dg/autopar/reduc-2char.c         |  7 +-
 gcc/testsuite/gcc.dg/autopar/reduc-2short.c        |  7 +-
 gcc/testsuite/gcc.dg/autopar/reduc-3.c             |  5 +-
 gcc/testsuite/gcc.dg/autopar/reduc-6.c             |  6 +-
 gcc/testsuite/gcc.dg/autopar/reduc-7.c             |  7 +-
 gcc/testsuite/gcc.dg/autopar/reduc-8.c             |  7 +-
 gcc/testsuite/gcc.dg/autopar/reduc-9.c             |  7 +-
 .../gcc.dg/parloops-exit-first-loop-alt-2.c        |  9 +--
 .../gcc.dg/parloops-exit-first-loop-alt-3.c        |  9 +--
 .../gcc.dg/parloops-exit-first-loop-alt-4.c        |  9 +--
 .../gcc.dg/parloops-exit-first-loop-alt-5.c        |  9 +--
 .../gcc.dg/parloops-exit-first-loop-alt-6.c        |  9 +--
 .../gcc.dg/parloops-exit-first-loop-alt-7.c        |  9 +--
 .../gcc.dg/parloops-exit-first-loop-alt-pr66652.c  | 11 +--
 .../gcc.dg/parloops-exit-first-loop-alt.c          | 10 +--
 .../gfortran.dg/parloops-exit-first-loop-alt-2.f95 |  9 +--
 .../gfortran.dg/parloops-exit-first-loop-alt.f95   | 10 +--
 gcc/testsuite/gfortran.dg/parloops-outer-1.f95     | 37 +++++++++
 gcc/tree-parloops.c                                | 93 ++++++++++++++++++++--
 gcc/tree-pass.h                                    |  5 ++
 gcc/tree-ssa-loop-ivcanon.c                        |  6 +-
 31 files changed, 303 insertions(+), 116 deletions(-)
 create mode 100644 gcc/testsuite/gfortran.dg/parloops-outer-1.f95

diff --git a/gcc/graphite-isl-ast-to-gimple.c b/gcc/graphite-isl-ast-to-gimple.c
index b32781a5..bdafd40 100644
--- a/gcc/graphite-isl-ast-to-gimple.c
+++ b/gcc/graphite-isl-ast-to-gimple.c
@@ -442,7 +442,8 @@  translate_isl_ast_for_loop (loop_p context_loop,
   redirect_edge_succ_nodup (next_e, after);
   set_immediate_dominator (CDI_DOMINATORS, next_e->dest, next_e->src);
 
-  if (flag_loop_parallelize_all)
+  if (flag_loop_parallelize_all
+      || flag_tree_parallelize_loops > 1)
   {
     isl_id *id = isl_ast_node_get_annotation (node_for);
     gcc_assert (id);
@@ -995,7 +996,8 @@  scop_to_isl_ast (scop_p scop, ivs_params &ip)
   context_isl = set_options (context_isl, schedule_isl, options_luj);
 
   isl_union_map *dependences = NULL;
-  if (flag_loop_parallelize_all)
+  if (flag_loop_parallelize_all
+      || flag_tree_parallelize_loops > 1)
   {
     dependences = scop_get_dependences (scop);
     context_isl =
diff --git a/gcc/graphite-poly.c b/gcc/graphite-poly.c
index bcd08d8..e32325e 100644
--- a/gcc/graphite-poly.c
+++ b/gcc/graphite-poly.c
@@ -241,7 +241,8 @@  apply_poly_transforms (scop_p scop)
   if (flag_graphite_identity)
     transform_done = true;
 
-  if (flag_loop_parallelize_all)
+  if (flag_loop_parallelize_all
+      || flag_tree_parallelize_loops > 1)
     transform_done = true;
 
   if (flag_loop_block)
diff --git a/gcc/graphite.c b/gcc/graphite.c
index a81ef6a..6ba58c0 100644
--- a/gcc/graphite.c
+++ b/gcc/graphite.c
@@ -319,7 +319,7 @@  graphite_transforms (struct function *fun)
   return 0;
 }
 
-static bool
+bool
 gate_graphite_transforms (void)
 {
   /* Enable -fgraphite pass if any one of the graphite optimization flags
@@ -373,6 +373,45 @@  make_pass_graphite (gcc::context *ctxt)
 
 namespace {
 
+const pass_data pass_data_graphite_parloops =
+{
+  GIMPLE_PASS, /* type */
+  "graphite_parloops", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_GRAPHITE, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_graphite_parloops : public gimple_opt_pass
+{
+public:
+  pass_graphite_parloops (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_graphite_parloops, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+    return (flag_tree_parallelize_loops > 1
+	    && !gate_graphite_transforms ());
+  }
+
+}; // class pass_graphite_parloops
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_graphite_parloops (gcc::context *ctxt)
+{
+  return new pass_graphite_parloops (ctxt);
+}
+
+namespace {
+
 const pass_data pass_data_graphite_transforms =
 {
   GIMPLE_PASS, /* type */
@@ -407,4 +446,46 @@  make_pass_graphite_transforms (gcc::context *ctxt)
   return new pass_graphite_transforms (ctxt);
 }
 
+/* It would be preferable to use a clone of pass_data_graphite_transforms rather
+   than declare a new pass.  But when using a clone of
+   pass_data_graphite_transforms (and changing the gate to trigger for
+   flag_tree_parallelize_loops > 1 as well) in pass group
+   pass_graphite_parloops, the pass is not executed.  */
+
+namespace {
+
+const pass_data pass_data_graphite_transforms2 =
+{
+  GIMPLE_PASS, /* type */
+  "graphite2", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_GRAPHITE_TRANSFORMS, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_graphite_transforms2 : public gimple_opt_pass
+{
+public:
+  pass_graphite_transforms2 (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_graphite_transforms2, ctxt)
+  {}
 
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+    return (flag_tree_parallelize_loops > 1);
+  }
+  virtual unsigned int execute (function *fun) { return graphite_transforms (fun); }
+}; // class pass_graphite_transforms2
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_graphite_transforms2 (gcc::context *ctxt)
+{
+  return new pass_graphite_transforms2 (ctxt);
+}
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 3135606..8cbee3a 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -9576,6 +9576,7 @@  public:
       return !(fun->curr_properties & PROP_gimple_eomp);
     }
   virtual unsigned int execute (function *) { return execute_expand_omp (); }
+  opt_pass *clone () { return new pass_expand_omp_ssa (m_ctxt); }
 
 }; // class pass_expand_omp_ssa
 
diff --git a/gcc/passes.def b/gcc/passes.def
index 5cd07ae..aa1d1a1 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -244,6 +244,17 @@  along with GCC; see the file COPYING3.  If not see
 	      NEXT_PASS (pass_dce);
 	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_iv_canon);
+	  NEXT_PASS (pass_parallelize_reductions);
+	  PUSH_INSERT_PASSES_WITHIN (pass_parallelize_reductions)
+	      NEXT_PASS (pass_expand_omp_ssa);
+	  POP_INSERT_PASSES ()
+	  NEXT_PASS (pass_graphite_parloops);
+	  PUSH_INSERT_PASSES_WITHIN (pass_graphite_parloops)
+	      NEXT_PASS (pass_graphite_transforms2);
+	      NEXT_PASS (pass_lim);
+	      NEXT_PASS (pass_copy_prop);
+	      NEXT_PASS (pass_dce);
+	  POP_INSERT_PASSES ()
 	  NEXT_PASS (pass_parallelize_loops);
 	  PUSH_INSERT_PASSES_WITHIN (pass_parallelize_loops)
 	      NEXT_PASS (pass_expand_omp_ssa);
diff --git a/gcc/testsuite/gcc.dg/autopar/outer-6.c b/gcc/testsuite/gcc.dg/autopar/outer-6.c
index 6bef7cc..0f01bd5 100644
--- a/gcc/testsuite/gcc.dg/autopar/outer-6.c
+++ b/gcc/testsuite/gcc.dg/autopar/outer-6.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-optimized" } */
 
 void abort (void);
 
@@ -44,6 +44,6 @@  int main(void)
 
 
 /* Check that outer loop is parallelized.  */
-/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 0 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 0 "parloopsred" } } */
 /* { dg-final { scan-tree-dump-times "loopfn" 4 "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-1.c b/gcc/testsuite/gcc.dg/autopar/reduc-1.c
index 6e9a280..4fc9b31 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-1.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-1.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -66,6 +66,7 @@  int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-1char.c b/gcc/testsuite/gcc.dg/autopar/reduc-1char.c
index 48ead88..497b7e0 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-1char.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-1char.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -60,6 +60,7 @@  int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-1short.c b/gcc/testsuite/gcc.dg/autopar/reduc-1short.c
index f3f547c..6af8e4b 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-1short.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-1short.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -59,6 +59,7 @@  int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-2.c b/gcc/testsuite/gcc.dg/autopar/reduc-2.c
index 3ad16e4..2d0b2a1 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-2.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-2.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -63,6 +63,7 @@  int main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 4 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 3 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-2char.c b/gcc/testsuite/gcc.dg/autopar/reduc-2char.c
index 072489f..49ef16d 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-2char.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-2char.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -60,7 +60,8 @@  int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
 
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-2short.c b/gcc/testsuite/gcc.dg/autopar/reduc-2short.c
index 4dbbc8a..3ec1c2a 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-2short.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-2short.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -59,6 +59,7 @@  int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-3.c b/gcc/testsuite/gcc.dg/autopar/reduc-3.c
index 0d4baef..e7ca82b 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-3.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-3.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -50,6 +50,7 @@  int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 1 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 1 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloopsred" } } */
 /* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloops" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-6.c b/gcc/testsuite/gcc.dg/autopar/reduc-6.c
index 91f679e..6c5ec7b 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-6.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-6.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
 
 #include <stdarg.h>
 #include <stdlib.h>
@@ -56,6 +56,6 @@  int main (void)
 
 
 /* need -ffast-math to  parallelize these loops.  */
-/* { dg-final { scan-tree-dump-times "Detected reduction" 0 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 0 "parloopsred" } } */
 /* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "FAILED: it is not a part of reduction" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "FAILED: it is not a part of reduction" 3 "parloopsred" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-7.c b/gcc/testsuite/gcc.dg/autopar/reduc-7.c
index 77b99e1..dccf2a5 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-7.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-7.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
 
 #include <stdlib.h>
 
@@ -84,6 +84,7 @@  int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
 
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-8.c b/gcc/testsuite/gcc.dg/autopar/reduc-8.c
index 16fb954..466bcc5 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-8.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-8.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
 
 #include <stdlib.h>
 
@@ -84,5 +84,6 @@  int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
diff --git a/gcc/testsuite/gcc.dg/autopar/reduc-9.c b/gcc/testsuite/gcc.dg/autopar/reduc-9.c
index 90f4db2..11556d7 100644
--- a/gcc/testsuite/gcc.dg/autopar/reduc-9.c
+++ b/gcc/testsuite/gcc.dg/autopar/reduc-9.c
@@ -1,5 +1,5 @@ 
 /* { dg-do compile } */
-/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloops-details -fdump-tree-optimized" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=4 -fdump-tree-parloopsred-details -fdump-tree-parloops-details -fdump-tree-optimized" } */
 
 #include <stdlib.h>
 
@@ -84,5 +84,6 @@  int main (void)
 }
 
 
-/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloops" } } */
-/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "Detected reduction" 2 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 2 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } } */
diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-2.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-2.c
index 24e605a..f1cf75f 100644
--- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-2.c
+++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-2.c
@@ -1,6 +1,6 @@ 
 /* { dg-do compile } */
 /* { dg-require-effective-target pthread } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
 
 /* Constant bound, vector addition.  */
 
@@ -19,9 +19,4 @@  f (void)
       c[i] = a[i] + b[i];
 }
 
-/* Three times three array accesses:
-   - three in f._loopfn.0
-   - three in the parallel
-   - three in the low iteration count loop
-   Crucially, none for a peeled off last iteration following the parallel.  */
-/* { dg-final { scan-tree-dump-times "(?n)\\\[i" 9 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-3.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-3.c
index fec53a1..6c34084 100644
--- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-3.c
+++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-3.c
@@ -1,6 +1,6 @@ 
 /* { dg-do compile } */
 /* { dg-require-effective-target pthread } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloopsred-details" } */
 
 /* Variable bound, reduction.  */
 
@@ -18,9 +18,4 @@  f (unsigned int n, unsigned int *__restrict__ a)
   return sum;
 }
 
-/* Three array accesses:
-   - one in f._loopfn.0
-   - one in the parallel
-   - one in the low iteration count loop
-   Crucially, none for a peeled off last iteration following the parallel.  */
-/* { dg-final { scan-tree-dump-times "(?n)\\\* 4" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloopsred" } } */
diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-4.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-4.c
index 2b8d289..f051ed4 100644
--- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-4.c
+++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-4.c
@@ -1,6 +1,6 @@ 
 /* { dg-do compile } */
 /* { dg-require-effective-target pthread } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloopsred-details" } */
 
 /* Constant bound, reduction.  */
 
@@ -20,9 +20,4 @@  f (void)
   return sum;
 }
 
-/* Three array accesses:
-   - one in f._loopfn.0
-   - one in the parallel
-   - one in the low iteration count loop
-   Crucially, none for a peeled off last iteration following the parallel.  */
-/* { dg-final { scan-tree-dump-times "(?n)\\\* 4" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloopsred" } } */
diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-5.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-5.c
index 3f799cf..3c1e99b 100644
--- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-5.c
+++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-5.c
@@ -1,6 +1,6 @@ 
 /* { dg-do compile } */
 /* { dg-require-effective-target pthread } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
 
 /* Variable bound, vector addition, unsigned loop counter, unsigned bound.  */
 
@@ -14,9 +14,4 @@  f (unsigned int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
     c[i] = a[i] + b[i];
 }
 
-/* Three times a store:
-   - one in f._loopfn.0
-   - one in the parallel
-   - one in the low iteration count loop
-   Crucially, none for a peeled off last iteration following the parallel.  */
-/* { dg-final { scan-tree-dump-times "(?n)^  \\*_\[0-9\]*" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-6.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-6.c
index ee19a55..edc60ba 100644
--- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-6.c
+++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-6.c
@@ -1,6 +1,6 @@ 
 /* { dg-do compile } */
 /* { dg-require-effective-target pthread } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
 
 /* Variable bound, vector addition, unsigned loop counter, signed bound.  */
 
@@ -14,9 +14,4 @@  f (int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
     c[i] = a[i] + b[i];
 }
 
-/* Three times a store:
-   - one in f._loopfn.0
-   - one in the parallel
-   - one in the low iteration count loop
-   Crucially, none for a peeled off last iteration following the parallel.  */
-/* { dg-final { scan-tree-dump-times "(?n)^  \\*_\[0-9\]*" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-7.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-7.c
index c337342..38be2e8 100644
--- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-7.c
+++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-7.c
@@ -1,6 +1,6 @@ 
 /* { dg-do compile } */
 /* { dg-require-effective-target pthread } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
 
 /* Variable bound, vector addition, signed loop counter, signed bound.  */
 
@@ -14,9 +14,4 @@  f (int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
     c[i] = a[i] + b[i];
 }
 
-/* Three times a store:
-   - one in f._loopfn.0
-   - one in the parallel
-   - one in the low iteration count loop
-   Crucially, none for a peeled off last iteration following the parallel.  */
-/* { dg-final { scan-tree-dump-times "(?n)^  \\*_\[0-9\]*" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-pr66652.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-pr66652.c
index 2ea097d..7b64368 100644
--- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-pr66652.c
+++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt-pr66652.c
@@ -1,6 +1,6 @@ 
 /* { dg-do compile } */
 /* { dg-require-effective-target pthread } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloopsred-details" } */
 
 #include <stdio.h>
 #include <stdlib.h>
@@ -22,10 +22,5 @@  f (unsigned int n, unsigned int sum)
   return sum;
 }
 
-/* Four times % 13:
-   - once in f._loopfn.0
-   - once in the parallel
-   - once in the low iteration count loop
-   - once for a peeled off last iteration following the parallel.
-   In other words, we want try_transform_to_exit_first_loop_alt to fail.  */
-/* { dg-final { scan-tree-dump-times "(?n)% 13" 4 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "parallelizing inner loop" 1 "parloopsred" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 0 "parloopsred" } } */
diff --git a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt.c b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt.c
index 0b69165..44596e3 100644
--- a/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt.c
+++ b/gcc/testsuite/gcc.dg/parloops-exit-first-loop-alt.c
@@ -1,6 +1,6 @@ 
 /* { dg-do compile } */
 /* { dg-require-effective-target pthread } */
-/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops" } */
+/* { dg-options "-O2 -ftree-parallelize-loops=2 -fdump-tree-parloops-details" } */
 
 /* Variable bound, vector addition, signed loop counter, unsigned bound.  */
 
@@ -14,9 +14,5 @@  f (unsigned int n, unsigned int *__restrict__ a, unsigned int *__restrict__ b,
     c[i] = a[i] + b[i];
 }
 
-/* Three times a store:
-   - one in f._loopfn.0
-   - one in the parallel
-   - one in the low iteration count loop
-   Crucially, none for a peeled off last iteration following the parallel.  */
-/* { dg-final { scan-tree-dump-times "(?n)^  \\*_\[0-9\]*" 3 "parloops" } } */
+/* { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } } */
+
diff --git a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95 b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95
index f26a6e3..52434f2 100644
--- a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95
+++ b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt-2.f95
@@ -1,7 +1,7 @@ 
 ! { dg-additional-options "-O2" }
 ! { dg-require-effective-target pthread }
 ! { dg-additional-options "-ftree-parallelize-loops=2" }
-! { dg-additional-options "-fdump-tree-parloops" }
+! { dg-additional-options "-fdump-tree-parloops-details" }
 
 ! Constant bound, vector addition.
 
@@ -16,9 +16,4 @@  subroutine foo ()
   end do
 end subroutine foo
 
-! Three times plus 25:
-! - once in f._loopfn.0
-! - once in the parallel
-! - once in the low iteration count loop
-! Crucially, none for a peeled off last iteration following the parallel.
-! { dg-final { scan-tree-dump-times "(?n) \\+ 25;" 3 "parloops" } }
+! { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } }
diff --git a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95 b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95
index 6dc8a38..1eb9dfd 100644
--- a/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95
+++ b/gcc/testsuite/gfortran.dg/parloops-exit-first-loop-alt.f95
@@ -1,7 +1,7 @@ 
 ! { dg-additional-options "-O2" }
 ! { dg-require-effective-target pthread }
 ! { dg-additional-options "-ftree-parallelize-loops=2" }
-! { dg-additional-options "-fdump-tree-parloops" }
+! { dg-additional-options "-fdump-tree-parloops-details" }
 
 ! Variable bound, vector addition.
 
@@ -17,9 +17,5 @@  subroutine foo (nr)
   end do
 end subroutine foo
 
-! Three times plus 25:
-! - once in f._loopfn.0
-! - once in the parallel
-! - once in the low iteration count loop
-! Crucially, none for a peeled off last iteration following the parallel.
-! { dg-final { scan-tree-dump-times "(?n) \\+ 25;" 3 "parloops" } }
+! { dg-final { scan-tree-dump-times "alternative exit-first loop transform succeeded" 1 "parloops" } }
+
diff --git a/gcc/testsuite/gfortran.dg/parloops-outer-1.f95 b/gcc/testsuite/gfortran.dg/parloops-outer-1.f95
new file mode 100644
index 0000000..144e4e8
--- /dev/null
+++ b/gcc/testsuite/gfortran.dg/parloops-outer-1.f95
@@ -0,0 +1,37 @@ 
+! { dg-do compile }
+! { dg-additional-options "-O2" }
+! { dg-additional-options "-ftree-parallelize-loops=2" }
+! { dg-additional-options "-fdump-tree-parloops-all" }
+! { dg-additional-options "-fdump-tree-optimized" }
+
+! Based on autopar/outer-1.c.
+
+program main
+  implicit none
+  integer, parameter         :: n = 500
+  integer, dimension (0:n-1, 0:n-1) :: x
+  integer                    :: i, j, ii, jj
+
+
+  do ii = 0, n - 1
+     do jj = 0, n - 1
+        x(jj, ii) = ii + jj + 3
+     end do
+  end do
+
+  do i = 0, n - 1
+     do j = 0, n - 1
+        if (x(j, i) .ne. i + j + 3) call abort
+     end do
+  end do
+
+end program main
+
+! Check that only one loop is analyzed, and that it can be parallelized.
+! { dg-final { scan-tree-dump-times "SUCCESS: may be parallelized" 1 "parloops" } }
+! { dg-final { scan-tree-dump-not "FAILED:" "parloops" } }
+! { dg-final { scan-tree-dump-times "parallelizing outer loop" 1 "parloops" } }
+
+! Check that the loop has been split off into a function.
+! { dg-final { scan-tree-dump-times "(?n);; Function main._loopfn.0 " 1 "optimized" } }
+
diff --git a/gcc/tree-parloops.c b/gcc/tree-parloops.c
index 036677b..4bfe588 100644
--- a/gcc/tree-parloops.c
+++ b/gcc/tree-parloops.c
@@ -2238,7 +2238,15 @@  gen_parallel_loop (struct loop *loop,
      increment) and immediately follows the loop exit test.  Attempt to move the
      entry of the loop directly before the exit check and increase the number of
      iterations of the loop by one.  */
-  if (!try_transform_to_exit_first_loop_alt (loop, reduction_list, nit))
+  if (try_transform_to_exit_first_loop_alt (loop, reduction_list, nit))
+    {
+      if (dump_file
+	  && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file,
+		 "alternative exit-first loop transform succeeded"
+		 " for loop %d\n", loop->num);
+    }
+  else
     {
       /* Fall back on the method that handles more cases, but duplicates the
 	 loop body: move the exit condition of LOOP to the beginning of its
@@ -2508,7 +2516,7 @@  try_create_reduction_list (loop_p loop,
    otherwise.  */
 
 static bool
-parallelize_loops (void)
+parallelize_loops (bool reductions_only)
 {
   unsigned n_threads = flag_tree_parallelize_loops;
   bool changed = false;
@@ -2584,10 +2592,31 @@  parallelize_loops (void)
       if (!try_create_reduction_list (loop, &reduction_list))
 	continue;
 
-      if (!flag_loop_parallelize_all
-	  && !loop_parallel_p (loop, &parloop_obstack))
+      if (reductions_only
+	  && reduction_list.elements () == 0)
 	continue;
 
+      if (!flag_loop_parallelize_all)
+	{
+	  bool independent = false;
+
+	  if (!independent
+	      && loop->can_be_parallel)
+	    {
+	      if (dump_file
+		  && (dump_flags & TDF_DETAILS))
+		fprintf (dump_file,
+			 "  SUCCESS: may be parallelized, graphite analysis\n");
+	      independent = true;
+	    }
+
+	  if (!independent)
+	    independent = loop_parallel_p (loop, &parloop_obstack);
+
+	  if (!independent)
+	    continue;
+	}
+
       changed = true;
       if (dump_file && (dump_flags & TDF_DETAILS))
       {
@@ -2652,7 +2681,7 @@  pass_parallelize_loops::execute (function *fun)
   if (number_of_loops (fun) <= 1)
     return 0;
 
-  if (parallelize_loops ())
+  if (parallelize_loops (false))
     {
       fun->curr_properties &= ~(PROP_gimple_eomp);
       return TODO_update_ssa;
@@ -2668,3 +2697,57 @@  make_pass_parallelize_loops (gcc::context *ctxt)
 {
   return new pass_parallelize_loops (ctxt);
 }
+
+namespace {
+
+const pass_data pass_data_parallelize_reductions =
+{
+  GIMPLE_PASS, /* type */
+  "parloopsred", /* name */
+  OPTGROUP_LOOP, /* optinfo_flags */
+  TV_TREE_PARALLELIZE_LOOPS, /* tv_id */
+  ( PROP_cfg | PROP_ssa ), /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  0, /* todo_flags_finish */
+};
+
+class pass_parallelize_reductions : public gimple_opt_pass
+{
+public:
+  pass_parallelize_reductions (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_parallelize_reductions, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+  {
+    return (flag_tree_parallelize_loops > 1
+	    && !gate_graphite_transforms ());
+  }
+  virtual unsigned int execute (function *);
+}; // class pass_parallelize_reductions
+
+unsigned
+pass_parallelize_reductions::execute (function *fun)
+{
+  if (number_of_loops (fun) <= 1)
+    return 0;
+
+  if (parallelize_loops (true))
+    {
+      fun->curr_properties &= ~(PROP_gimple_eomp);
+      return TODO_update_ssa;
+    }
+
+  return 0;
+}
+
+} // anon namespace
+
+gimple_opt_pass *
+make_pass_parallelize_reductions (gcc::context *ctxt)
+{
+  return new pass_parallelize_reductions (ctxt);
+}
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index c47b22e..f0a7017 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -368,7 +368,9 @@  extern gimple_opt_pass *make_pass_scev_cprop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_empty_loop (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_record_bounds (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_graphite (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_graphite_parloops (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_graphite_transforms (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_graphite_transforms2 (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_if_conversion (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_distribution (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_vectorize (gcc::context *ctxt);
@@ -377,6 +379,7 @@  extern gimple_opt_pass *make_pass_slp_vectorize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unroll (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_complete_unrolli (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_parallelize_loops (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_parallelize_reductions (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_loop_prefetch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_iv_optimize (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tree_loop_done (gcc::context *ctxt);
@@ -595,6 +598,8 @@  extern gimple_opt_pass *make_pass_update_address_taken (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_convert_switch (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_lower_vaarg (gcc::context *ctxt);
 
+extern bool gate_graphite_transforms (void);
+
 /* Current optimization pass.  */
 extern opt_pass *current_pass;
 
diff --git a/gcc/tree-ssa-loop-ivcanon.c b/gcc/tree-ssa-loop-ivcanon.c
index eca70a9..43724ed 100644
--- a/gcc/tree-ssa-loop-ivcanon.c
+++ b/gcc/tree-ssa-loop-ivcanon.c
@@ -1421,7 +1421,11 @@  public:
   {}
 
   /* opt_pass methods: */
-  virtual bool gate (function *) { return flag_tree_loop_ivcanon != 0; }
+  virtual bool gate (function *)
+  {
+    return (flag_tree_loop_ivcanon != 0
+	    && flag_tree_parallelize_loops <= 1);
+  }
   virtual unsigned int execute (function *fun);
 
 }; // class pass_iv_canon
-- 
1.9.1