diff mbox

[RFA,PR,rtl-optimization/64317] Enhance postreload-gcse.c to eliminate more redundant loads

Message ID 5500B3FC.7040303@redhat.com
State New
Headers show

Commit Message

Jeff Law March 11, 2015, 9:30 p.m. UTC
64317 is a P2 regression related to making the PIC register a pseudo 
register on the x86 port.

Basically post-LRA we still have some redundant loads of the PIC 
register from memory (it's spilled to memory as other objects are more 
profitable to keep in registers).  But of the set of ~10 actual loads, 
we really only need ~5.

The basic problem is that LRA and the basic postreload optimizers 
effectively forget everything they know when they encounter a block that 
starts with a label -- even if that block is only reachable from one 
predecessor.  ie, these routines work on a subset of EBBs, but not 
maximal EBBs.

The first thought was to have LRA clean this up since this is largely an 
issue with cross-block inheritance of values.   Unfortunately Vlad has 
indicated that handling maximal EBBs is nontrivial in LRA.

I also investigated extending the simple EBB oriented optimizer in 
postreload.c to handle this case.  But it turns out that cselib really 
isn't suited to a scoped hash table, which would be the natural way to 
handle this.  So I killed that idea.

That leaves postreload-gcse and if you look at the incoming RTL you'd 
think that postreload-gcse would handle this just fine.  Unfortunately, 
postreload-gcse is badly mis-named. Specifically it does no dataflow 
analysis of available expressions!!

There may be some benefit in full dataflow analysis in 
postreload-gcse.c.  But we can get most of the benefits with a 
relatively simple change.  Specifically if we're unable to find the 
expression we want in the given block, see if the block has a single 
predecessor.  If the expression is available in that predecessor and 
transparent through the current block, then the expression is still 
available at the end of the current block.

That was enough to fix the PR.  But after instrumentation I was 
discouraged by the fact that postreload-gcse does virtually nothing on 
x86.  It's unable to find a single available load to put in its hash 
tables on an x86, x86_64 or ppc64 bootstrap.  Though many are found 
during testsuite runs.

Digging further into the history of postreload-gcse indicated the 
postreload-gcse.c work was primarily tested on ppc using spec.  My first 
attempts compiling spec with the instrumented compiler didn't produce 
anything useful either.   On a whim, I tried again with -O3 -funroll 
loops, then suddenly postreload-gcse.c got reasonably active.

Across a subset of spec2k, compiling with -O3 -funroll-loops, this patch 
enables about a 2X improvement in loads eliminated (ppc64).  On a ppc64 
-O3 -funroll-loops bootstrap of GCC I see 4142 loads eliminated with 
this patch and 2739 loads eliminated without.  A notable improvement.

Even with -O2 -funroll-loops, there's nothing interesting happening on 
x86_64.  Probably because it's not a load-store architecture.

Anyway, the question in my mind is whether or not we want to put this in 
during stage4.  The patch itself is medium sized, but mostly mechanical 
in nature.

It's been bootstrapped on x86.  It's been bootstrapped and regression 
tested on x86_64.  It's been bootstrapped and regression tested with -O3 
-funroll-loops on ppc64 (configured with --disable-werror).

OK for the trunk or wait for stage1?  I'm happy to go with the release 
manager's opinions on this.

Jeff

Comments

Jakub Jelinek March 16, 2015, 7:27 p.m. UTC | #1
On Wed, Mar 11, 2015 at 03:30:36PM -0600, Jeff Law wrote:
> +#ifndef GCC_GCSE__COMMONH
> +#define GCC_GCSE__COMMONH

GCC_GCSE_COMMON_H instead?

> @@ -1308,8 +1396,19 @@ gcse_after_reload_main (rtx f ATTRIBUTE_UNUSED)
>  
>    if (expr_table->elements () > 0)
>      {
> +      /* Knowing which MEMs are transparent through a block can signifiantly
> +	 increase the number of reundant loads found.  So compute transparency
> +	 information for for each memory expression in the hash table.  */

s/for for/for/ ?

> +      df_analyze ();
> +      /* This can not be part of the normal allocation routine because
> +	 we have to know the number of elements in the hash table.  */
> +      transp = sbitmap_vector_alloc (last_basic_block_for_fn (cfun),
> +				     expr_table->elements ());
> +      bitmap_vector_ones (transp, last_basic_block_for_fn (cfun));
> +      expr_table->traverse <FILE *, compute_expr_transp> (dump_file);
>        eliminate_partially_redundant_loads ();
>        delete_redundant_insns ();
> +      sbitmap_vector_free (transp);
>  
>        if (dump_file)
>  	{

What effect does the patch have on compile time on say x86_64 or ppc64?

I'm slightly leaning towards trying it even in stage4, but if e.g. richi
disagrees, we could defer it to stage1 too.

	Jakub
Jeff Law March 16, 2015, 7:45 p.m. UTC | #2
On 03/16/15 13:27, Jakub Jelinek wrote:
> On Wed, Mar 11, 2015 at 03:30:36PM -0600, Jeff Law wrote:
>> +#ifndef GCC_GCSE__COMMONH
>> +#define GCC_GCSE__COMMONH
>
> GCC_GCSE_COMMON_H instead?
:-) Will fix.

>
>> @@ -1308,8 +1396,19 @@ gcse_after_reload_main (rtx f ATTRIBUTE_UNUSED)
>>
>>     if (expr_table->elements () > 0)
>>       {
>> +      /* Knowing which MEMs are transparent through a block can signifiantly
>> +	 increase the number of reundant loads found.  So compute transparency
>> +	 information for for each memory expression in the hash table.  */
>
> s/for for/for/ ?
Similarly.


>
>> +      df_analyze ();
>> +      /* This can not be part of the normal allocation routine because
>> +	 we have to know the number of elements in the hash table.  */
>> +      transp = sbitmap_vector_alloc (last_basic_block_for_fn (cfun),
>> +				     expr_table->elements ());
>> +      bitmap_vector_ones (transp, last_basic_block_for_fn (cfun));
>> +      expr_table->traverse <FILE *, compute_expr_transp> (dump_file);
>>         eliminate_partially_redundant_loads ();
>>         delete_redundant_insns ();
>> +      sbitmap_vector_free (transp);
>>
>>         if (dump_file)
>>   	{
>
> What effect does the patch have on compile time on say x86_64 or ppc64?
I'll test both.  In the common case, the cost is going to be the basic 
bookkeeping so that we can compute the transparent property.  The actual 
computation of transparency and everything else is guarded on having 
something in the hash tables -- and the overwhelming majority of the 
time there's nothing in the hash tables.

Regardless, I'll pin down boxes and do some testing.


>
> I'm slightly leaning towards trying it even in stage4, but if e.g. richi
> disagrees, we could defer it to stage1 too.
I'd be OK either way.  I just want us to make a decision one way or the 
other :-)

Jeff
Richard Biener March 17, 2015, 10:35 a.m. UTC | #3
On March 16, 2015 8:45:18 PM GMT+01:00, Jeff Law <law@redhat.com> wrote:
>On 03/16/15 13:27, Jakub Jelinek wrote:
>> On Wed, Mar 11, 2015 at 03:30:36PM -0600, Jeff Law wrote:
>>> +#ifndef GCC_GCSE__COMMONH
>>> +#define GCC_GCSE__COMMONH
>>
>> GCC_GCSE_COMMON_H instead?
>:-) Will fix.
>
>>
>>> @@ -1308,8 +1396,19 @@ gcse_after_reload_main (rtx f
>ATTRIBUTE_UNUSED)
>>>
>>>     if (expr_table->elements () > 0)
>>>       {
>>> +      /* Knowing which MEMs are transparent through a block can
>signifiantly
>>> +	 increase the number of reundant loads found.  So compute
>transparency
>>> +	 information for for each memory expression in the hash table.  */
>>
>> s/for for/for/ ?
>Similarly.
>
>
>>
>>> +      df_analyze ();
>>> +      /* This can not be part of the normal allocation routine
>because
>>> +	 we have to know the number of elements in the hash table.  */
>>> +      transp = sbitmap_vector_alloc (last_basic_block_for_fn
>(cfun),
>>> +				     expr_table->elements ());
>>> +      bitmap_vector_ones (transp, last_basic_block_for_fn (cfun));
>>> +      expr_table->traverse <FILE *, compute_expr_transp>
>(dump_file);
>>>         eliminate_partially_redundant_loads ();
>>>         delete_redundant_insns ();
>>> +      sbitmap_vector_free (transp);
>>>
>>>         if (dump_file)
>>>   	{
>>
>> What effect does the patch have on compile time on say x86_64 or
>ppc64?
>I'll test both.  In the common case, the cost is going to be the basic 
>bookkeeping so that we can compute the transparent property.  The
>actual 
>computation of transparency and everything else is guarded on having 
>something in the hash tables -- and the overwhelming majority of the 
>time there's nothing in the hash tables.
>
>Regardless, I'll pin down boxes and do some testing.
>
>
>>
>> I'm slightly leaning towards trying it even in stage4, but if e.g.
>richi
>> disagrees, we could defer it to stage1 too.
>I'd be OK either way.  I just want us to make a decision one way or the

If it fixes a regression then OK for trunk, otherwise please wait for stage 1 to open.

Richard.

>other :-)
>
>Jeff
Jeff Law March 17, 2015, 6:27 p.m. UTC | #4
On 03/17/2015 04:35 AM, Richard Biener wrote:

>> I'll test both.  In the common case, the cost is going to be the basic
>> bookkeeping so that we can compute the transparent property.  The
>> actual
>> computation of transparency and everything else is guarded on having
>> something in the hash tables -- and the overwhelming majority of the
>> time there's nothing in the hash tables.
>>
>> Regardless, I'll pin down boxes and do some testing.
>>
>>
>>>
>>> I'm slightly leaning towards trying it even in stage4, but if e.g.
>> richi
>>> disagrees, we could defer it to stage1 too.
>> I'd be OK either way.  I just want us to make a decision one way or the
>
> If it fixes a regression then OK for trunk, otherwise please wait for stage 1 to open.
It fixes 64317 which is a P2 regression.

jeff
Andrew Pinski March 18, 2015, 6:19 a.m. UTC | #5
On Tue, Mar 17, 2015 at 11:27 AM, Jeff Law <law@redhat.com> wrote:
> On 03/17/2015 04:35 AM, Richard Biener wrote:
>
>>> I'll test both.  In the common case, the cost is going to be the basic
>>> bookkeeping so that we can compute the transparent property.  The
>>> actual
>>> computation of transparency and everything else is guarded on having
>>> something in the hash tables -- and the overwhelming majority of the
>>> time there's nothing in the hash tables.
>>>
>>> Regardless, I'll pin down boxes and do some testing.
>>>
>>>
>>>>
>>>> I'm slightly leaning towards trying it even in stage4, but if e.g.
>>>
>>> richi
>>>>
>>>> disagrees, we could defer it to stage1 too.
>>>
>>> I'd be OK either way.  I just want us to make a decision one way or the
>>
>>
>> If it fixes a regression then OK for trunk, otherwise please wait for
>> stage 1 to open.
>
> It fixes 64317 which is a P2 regression.


I had a pass which I inherited from my predecessor which was basically
doing the same as your patch:
https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/postreload-load.c;h=2652e51f8eca310d51db3a30e5f6c8847be436ce;hb=refs/heads/apinski/thunderx-cost

But I like your patch much better than this one.  I saw many loads
being removed for AARCH64 also at -Ofast -funroll-loops on SPEC 2006
with this pass.  But it seemed to due to subregs which I had mentioned
at https://gcc.gnu.org/ml/gcc/2015-01/msg00125.html .  When I get a
chance I can test your patch out on AARCH64 and disable "my" pass to
see if "my" pass catches more than your patch.

Thanks,
Andrew

>
> jeff
>
Jeff Law March 21, 2015, 7:47 a.m. UTC | #6
On 03/16/2015 01:27 PM, Jakub Jelinek wrote:
>
> What effect does the patch have on compile time on say x86_64 or ppc64?
I bootstrapped x86_64 trunk then timed compiling 640 .i/.ii files with 
-O2.  That was repeated 10 times (to get a sense of variability).  Each 
run took a little over 40 minutes with variability of less than a second 
(slowest run  compared to fastest run).

I then also gathered data for a smaller set of runs for -O2 
-funroll-loops (given the consistency between runs, 10 iterations seemed 
like major overkill).  Again variability was less than a second.

I then applied the patch and repeated the -O2 and -O2 -funroll-loops 
test.  The patched compiler was within a second of the unpatched 
compiler for both the -O2 and -O2 -funroll-loops tests.  Given the 
difference was within the noise level of those tests, my conclusion is 
the new code makes no measurable difference in compile time performance 
for x86_64.

Similar tests are in progress on powerpc64-linux-gnu.

Jeff
Jakub Jelinek March 21, 2015, 8:13 a.m. UTC | #7
On Sat, Mar 21, 2015 at 01:47:10AM -0600, Jeff Law wrote:
> On 03/16/2015 01:27 PM, Jakub Jelinek wrote:
> >
> >What effect does the patch have on compile time on say x86_64 or ppc64?
> I bootstrapped x86_64 trunk then timed compiling 640 .i/.ii files with -O2.
> That was repeated 10 times (to get a sense of variability).  Each run took a
> little over 40 minutes with variability of less than a second (slowest run
> compared to fastest run).
> 
> I then also gathered data for a smaller set of runs for -O2 -funroll-loops
> (given the consistency between runs, 10 iterations seemed like major
> overkill).  Again variability was less than a second.
> 
> I then applied the patch and repeated the -O2 and -O2 -funroll-loops test.
> The patched compiler was within a second of the unpatched compiler for both
> the -O2 and -O2 -funroll-loops tests.  Given the difference was within the
> noise level of those tests, my conclusion is the new code makes no
> measurable difference in compile time performance for x86_64.
> 
> Similar tests are in progress on powerpc64-linux-gnu.

Thanks for the testing.  I think we want the patch in now.

	Jakub
Jeff Law March 23, 2015, 5:26 a.m. UTC | #8
On 03/16/2015 01:27 PM, Jakub Jelinek wrote:
> On Wed, Mar 11, 2015 at 03:30:36PM -0600, Jeff Law wrote:
>> +#ifndef GCC_GCSE__COMMONH
>> +#define GCC_GCSE__COMMONH
>
> GCC_GCSE_COMMON_H instead?
>
>> @@ -1308,8 +1396,19 @@ gcse_after_reload_main (rtx f ATTRIBUTE_UNUSED)
>>
>>     if (expr_table->elements () > 0)
>>       {
>> +      /* Knowing which MEMs are transparent through a block can signifiantly
>> +	 increase the number of reundant loads found.  So compute transparency
>> +	 information for for each memory expression in the hash table.  */
>
> s/for for/for/ ?
>
>> +      df_analyze ();
>> +      /* This can not be part of the normal allocation routine because
>> +	 we have to know the number of elements in the hash table.  */
>> +      transp = sbitmap_vector_alloc (last_basic_block_for_fn (cfun),
>> +				     expr_table->elements ());
>> +      bitmap_vector_ones (transp, last_basic_block_for_fn (cfun));
>> +      expr_table->traverse <FILE *, compute_expr_transp> (dump_file);
>>         eliminate_partially_redundant_loads ();
>>         delete_redundant_insns ();
>> +      sbitmap_vector_free (transp);
>>
>>         if (dump_file)
>>   	{
>
> What effect does the patch have on compile time on say x86_64 or ppc64?
>
> I'm slightly leaning towards trying it even in stage4, but if e.g. richi
> disagrees, we could defer it to stage1 too.
Oh yea, forgot to mention, for PPC the compile-time testing results were 
essentially the same -- there's significantly more variation in the 
timings, but the before/after comparisons were in the noise.

Jeff
Jeff Law March 23, 2015, 5:27 a.m. UTC | #9
On 03/18/2015 12:19 AM, Andrew Pinski wrote:
> On Tue, Mar 17, 2015 at 11:27 AM, Jeff Law <law@redhat.com> wrote:
>> On 03/17/2015 04:35 AM, Richard Biener wrote:
>>
>>>> I'll test both.  In the common case, the cost is going to be the basic
>>>> bookkeeping so that we can compute the transparent property.  The
>>>> actual
>>>> computation of transparency and everything else is guarded on having
>>>> something in the hash tables -- and the overwhelming majority of the
>>>> time there's nothing in the hash tables.
>>>>
>>>> Regardless, I'll pin down boxes and do some testing.
>>>>
>>>>
>>>>>
>>>>> I'm slightly leaning towards trying it even in stage4, but if e.g.
>>>>
>>>> richi
>>>>>
>>>>> disagrees, we could defer it to stage1 too.
>>>>
>>>> I'd be OK either way.  I just want us to make a decision one way or the
>>>
>>>
>>> If it fixes a regression then OK for trunk, otherwise please wait for
>>> stage 1 to open.
>>
>> It fixes 64317 which is a P2 regression.
>
>
> I had a pass which I inherited from my predecessor which was basically
> doing the same as your patch:
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/postreload-load.c;h=2652e51f8eca310d51db3a30e5f6c8847be436ce;hb=refs/heads/apinski/thunderx-cost
>
> But I like your patch much better than this one.  I saw many loads
> being removed for AARCH64 also at -Ofast -funroll-loops on SPEC 2006
> with this pass.  But it seemed to due to subregs which I had mentioned
> at https://gcc.gnu.org/ml/gcc/2015-01/msg00125.html .  When I get a
> chance I can test your patch out on AARCH64 and disable "my" pass to
> see if "my" pass catches more than your patch.
Well, they're doing different things, so they ought to be complementary 
to some degree.
Jeff
Jeff Law April 24, 2015, 3:10 a.m. UTC | #10
On 03/18/2015 12:19 AM, Andrew Pinski wrote:
> On Tue, Mar 17, 2015 at 11:27 AM, Jeff Law <law@redhat.com> wrote:
>> On 03/17/2015 04:35 AM, Richard Biener wrote:
>>
>>>> I'll test both.  In the common case, the cost is going to be the basic
>>>> bookkeeping so that we can compute the transparent property.  The
>>>> actual
>>>> computation of transparency and everything else is guarded on having
>>>> something in the hash tables -- and the overwhelming majority of the
>>>> time there's nothing in the hash tables.
>>>>
>>>> Regardless, I'll pin down boxes and do some testing.
>>>>
>>>>
>>>>>
>>>>> I'm slightly leaning towards trying it even in stage4, but if e.g.
>>>>
>>>> richi
>>>>>
>>>>> disagrees, we could defer it to stage1 too.
>>>>
>>>> I'd be OK either way.  I just want us to make a decision one way or the
>>>
>>>
>>> If it fixes a regression then OK for trunk, otherwise please wait for
>>> stage 1 to open.
>>
>> It fixes 64317 which is a P2 regression.
>
>
> I had a pass which I inherited from my predecessor which was basically
> doing the same as your patch:
> https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/postreload-load.c;h=2652e51f8eca310d51db3a30e5f6c8847be436ce;hb=refs/heads/apinski/thunderx-cost
>
> But I like your patch much better than this one.  I saw many loads
> being removed for AARCH64 also at -Ofast -funroll-loops on SPEC 2006
> with this pass.  But it seemed to due to subregs which I had mentioned
> at https://gcc.gnu.org/ml/gcc/2015-01/msg00125.html .  When I get a
> chance I can test your patch out on AARCH64 and disable "my" pass to
> see if "my" pass catches more than your patch.
ISTM this pass from your predecessor is tackling a different issue. 
Namely it arranges to copy the result of the load (or source of a store) 
from a register which gets clobbered into a non-clobbered hard register 
so that it can be re-used later.

My patch detected cases where the the load/store register is not 
clobbered across intermediate blocks and arranges to re-use it.

Unless I'm missing something, they ought to be able to work together.

I'd recommend officially submitting that code.  Bonus points for 
integrating it into the existing pass rather than running it as a 
separate pass.

jeff
diff mbox

Patch

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 28979d5..91863be 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,34 @@ 
+2015-03-11  Jeff Law  <law@redhat.com>
+
+	PR rtl-optimization/64317
+	* Makefile.in (OBJS): Add gcse-common.c
+	* gcse.c: Include gcse-common.h
+	(struct modify_pair_s): Move structure definition to gcse-common.h
+	(compute_transp): Move function to gcse-common.c.
+	(canon_list_insert): Similarly.
+	(record_last_mem_set_info): Break out some code and put it into
+	gcse-common.c.  Call into the new common code.
+	(compute_local_properties): Pass additional arguments to compute_transp.
+	* postreload-gcse.c: Include gcse-common.h and df.h
+	(modify_mem_list_set, blocks_with_calls): New variables.
+	(modify_mem_list, canon_modify_mem_list, transp): Likewise.
+	(get_bb_avail_insn): Pass in the expression index too.
+	(alloc_mem): Allocate memory for the new bitmaps and lists.
+	(free_mem): Free memory for the new bitmaps and lists.
+	(insert_expr_in_table): Record a bitmap index for each entry we
+	add to the table.
+	(record_last_mem_set_info): Call into common code in gcse-common.c.
+	(get_bb_avail_insn): If no available insn was found in the requested
+	BB.  If BB has a single predecessor, see if the expression is
+	transparent in BB and available in that single predecessor.
+	(compute_expr_transp): New wrapper for compute_transp.
+	(eliminate_partially_redundant_load): Pass expression's bitmap_index
+	to get_bb_avail_insn.  Compute next_pred_bb_end a bit later.
+	(gcse_after_reload_main): If there are elements in the hash table,
+	then compute transparency for all the elements in the hash table.
+	* gcse-common.h: New file.
+	* gcse-common.c: New file.
+
 2015-03-11  Marek Polacek  <polacek@redhat.com>
 
 	PR tree-optimization/65388
diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index de1f3b6..670ddeb 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1235,6 +1235,7 @@  OBJS = \
 	function.o \
 	fwprop.o \
 	gcse.o \
+	gcse-common.o \
 	ggc-common.o \
 	gimple.o \
 	gimple-builder.o \
diff --git a/gcc/gcse-common.c b/gcc/gcse-common.c
new file mode 100644
index 0000000..f7cf4fb
--- /dev/null
+++ b/gcc/gcse-common.c
@@ -0,0 +1,227 @@ 
+/* Shared code for before and after reload gcse implementations.
+   Copyright (C) 1997-2015 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it under
+   the terms of the GNU General Public License as published by the Free
+   Software Foundation; either version 3, or (at your option) any later
+   version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or
+   FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+   for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>. 
+
+   It is expected that more hunks of gcse.c and postreload-gcse.c should
+   migrate into this file.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "rtl.h"
+#include "vec.h"
+#include "predict.h"
+#include "bitmap.h"
+#include "basic-block.h"
+#include "df.h"
+#include "gcse-common.h"
+
+
+/* Record all of the canonicalized MEMs of record_last_mem_set_info's insn.
+   Note we store a pair of elements in the list, so they have to be
+   taken off pairwise.  */
+
+void
+canon_list_insert (rtx dest, const_rtx x ATTRIBUTE_UNUSED, void *data)
+{
+  rtx dest_addr, insn;
+  int bb;
+  modify_pair pair;
+
+  while (GET_CODE (dest) == SUBREG
+      || GET_CODE (dest) == ZERO_EXTRACT
+      || GET_CODE (dest) == STRICT_LOW_PART)
+    dest = XEXP (dest, 0);
+
+  /* If DEST is not a MEM, then it will not conflict with a load.  Note
+     that function calls are assumed to clobber memory, but are handled
+     elsewhere.  */
+
+  if (! MEM_P (dest))
+    return;
+
+  dest_addr = get_addr (XEXP (dest, 0));
+  dest_addr = canon_rtx (dest_addr);
+  insn = ((struct gcse_note_stores_info *)data)->insn;
+  bb = BLOCK_FOR_INSN (insn)->index;
+
+  pair.dest = dest;
+  pair.dest_addr = dest_addr;
+  vec<modify_pair> *canon_mem_list
+    = ((struct gcse_note_stores_info *)data)->canon_mem_list;
+  canon_mem_list[bb].safe_push (pair);
+}
+
+/* Record memory modification information for INSN.  We do not actually care
+   about the memory location(s) that are set, or even how they are set (consider
+   a CALL_INSN).  We merely need to record which insns modify memory.  */
+
+void
+record_last_mem_set_info_common (rtx_insn *insn,
+				 vec<rtx_insn *> *modify_mem_list,
+				 vec<modify_pair> *canon_modify_mem_list,
+				 bitmap modify_mem_list_set,
+				 bitmap blocks_with_calls)
+
+{
+  int bb;
+
+  bb = BLOCK_FOR_INSN (insn)->index;
+  modify_mem_list[bb].safe_push (insn);
+  bitmap_set_bit (modify_mem_list_set, bb);
+
+  if (CALL_P (insn))
+    bitmap_set_bit (blocks_with_calls, bb);
+  else
+    {
+      struct gcse_note_stores_info data;
+      data.insn = insn;
+      data.canon_mem_list = canon_modify_mem_list;
+      note_stores (PATTERN (insn), canon_list_insert, (void*) &data);
+    }
+}
+
+
+/* For each block, compute whether X is transparent.  X is either an
+   expression or an assignment [though we don't care which, for this context
+   an assignment is treated as an expression].  For each block where an
+   element of X is modified, reset the INDX bit in BMAP. 
+
+   BLOCKS_WITH_CALLS indicates which blocks contain CALL_INSNs which kill
+   memory.
+
+   MODIFY_MEM_LIST_SET indicates which blocks have memory stores which might
+   kill a particular memory location.
+
+   CANON_MODIFY_MEM_LIST is the canonicalized list of memory locations modified
+   for each block.  */
+
+void
+compute_transp (const_rtx x, int indx, sbitmap *bmap,
+		bitmap blocks_with_calls,
+		bitmap modify_mem_list_set,
+	        vec<modify_pair> *canon_modify_mem_list)
+{
+  int i, j;
+  enum rtx_code code;
+  const char *fmt;
+
+  /* repeat is used to turn tail-recursion into iteration since GCC
+     can't do it when there's no return value.  */
+ repeat:
+
+  if (x == 0)
+    return;
+
+  code = GET_CODE (x);
+  switch (code)
+    {
+    case REG:
+	{
+	  df_ref def;
+	  for (def = DF_REG_DEF_CHAIN (REGNO (x));
+	       def;
+	       def = DF_REF_NEXT_REG (def))
+	    bitmap_clear_bit (bmap[DF_REF_BB (def)->index], indx);
+	}
+
+      return;
+
+    case MEM:
+      if (! MEM_READONLY_P (x))
+	{
+	  bitmap_iterator bi;
+	  unsigned bb_index;
+	  rtx x_addr;
+
+	  x_addr = get_addr (XEXP (x, 0));
+	  x_addr = canon_rtx (x_addr);
+
+	  /* First handle all the blocks with calls.  We don't need to
+	     do any list walking for them.  */
+	  EXECUTE_IF_SET_IN_BITMAP (blocks_with_calls, 0, bb_index, bi)
+	    {
+	      bitmap_clear_bit (bmap[bb_index], indx);
+	    }
+
+	  /* Now iterate over the blocks which have memory modifications
+	     but which do not have any calls.  */
+	  EXECUTE_IF_AND_COMPL_IN_BITMAP (modify_mem_list_set,
+					  blocks_with_calls,
+					  0, bb_index, bi)
+	    {
+	      vec<modify_pair> list
+		= canon_modify_mem_list[bb_index];
+	      modify_pair *pair;
+	      unsigned ix;
+
+	      FOR_EACH_VEC_ELT_REVERSE (list, ix, pair)
+		{
+		  rtx dest = pair->dest;
+		  rtx dest_addr = pair->dest_addr;
+
+		  if (canon_true_dependence (dest, GET_MODE (dest),
+					     dest_addr, x, x_addr))
+		    {
+		      bitmap_clear_bit (bmap[bb_index], indx);
+		      break;
+		    }
+	        }
+	    }
+	}
+
+      x = XEXP (x, 0);
+      goto repeat;
+
+    case PC:
+    case CC0: /*FIXME*/
+    case CONST:
+    CASE_CONST_ANY:
+    case SYMBOL_REF:
+    case LABEL_REF:
+    case ADDR_VEC:
+    case ADDR_DIFF_VEC:
+      return;
+
+    default:
+      break;
+    }
+
+  for (i = GET_RTX_LENGTH (code) - 1, fmt = GET_RTX_FORMAT (code); i >= 0; i--)
+    {
+      if (fmt[i] == 'e')
+	{
+	  /* If we are about to do the last recursive call
+	     needed at this level, change it into iteration.
+	     This function is called enough to be worth it.  */
+	  if (i == 0)
+	    {
+	      x = XEXP (x, i);
+	      goto repeat;
+	    }
+
+	  compute_transp (XEXP (x, i), indx, bmap, blocks_with_calls,
+			  modify_mem_list_set, canon_modify_mem_list);
+	}
+      else if (fmt[i] == 'E')
+	for (j = 0; j < XVECLEN (x, i); j++)
+	  compute_transp (XVECEXP (x, i, j), indx, bmap, blocks_with_calls,
+			  modify_mem_list_set, canon_modify_mem_list);
+    }
+}
diff --git a/gcc/gcse-common.h b/gcc/gcse-common.h
new file mode 100644
index 0000000..b485b22
--- /dev/null
+++ b/gcc/gcse-common.h
@@ -0,0 +1,47 @@ 
+/* Structures and prototypes common across the normal GCSE
+   implementation and the post-reload implementation.
+   Copyright (C) 1997-2015 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_GCSE__COMMONH
+#define GCC_GCSE__COMMONH
+
+typedef vec<rtx_insn *> vec_rtx_heap;
+typedef struct modify_pair_s
+{
+  rtx dest;                     /* A MEM.  */
+  rtx dest_addr;                /* The canonical address of `dest'.  */
+} modify_pair;
+
+typedef vec<modify_pair> vec_modify_pair_heap;
+
+struct gcse_note_stores_info
+{
+  rtx_insn *insn;
+  vec<modify_pair> *canon_mem_list;
+};
+
+extern void compute_transp (const_rtx, int, sbitmap *, bitmap,
+			    bitmap, vec<modify_pair> *);
+extern void record_last_mem_set_info_common (rtx_insn *,
+					     vec<rtx_insn *> *,
+					     vec<modify_pair> *,
+					     bitmap, bitmap);
+
+
+#endif
diff --git a/gcc/gcse.c b/gcc/gcse.c
index e03b36c..37aac6a 100644
--- a/gcc/gcse.c
+++ b/gcc/gcse.c
@@ -189,6 +189,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "dbgcnt.h"
 #include "target.h"
 #include "gcse.h"
+#include "gcse-common.h"
 
 /* We support GCSE via Partial Redundancy Elimination.  PRE optimizations
    are a superset of those done by classic GCSE.
@@ -424,13 +425,6 @@  static regset reg_set_bitmap;
 static vec<rtx_insn *> *modify_mem_list;
 static bitmap modify_mem_list_set;
 
-typedef struct modify_pair_s
-{
-  rtx dest;			/* A MEM.  */
-  rtx dest_addr;		/* The canonical address of `dest'.  */
-} modify_pair;
-
-
 /* This array parallels modify_mem_list, except that it stores MEMs
    being set and their canonicalized memory addresses.  */
 static vec<modify_pair> *canon_modify_mem_list;
@@ -507,12 +501,10 @@  static void alloc_hash_table (struct gcse_hash_table_d *);
 static void free_hash_table (struct gcse_hash_table_d *);
 static void compute_hash_table_work (struct gcse_hash_table_d *);
 static void dump_hash_table (FILE *, const char *, struct gcse_hash_table_d *);
-static void compute_transp (const_rtx, int, sbitmap *);
 static void compute_local_properties (sbitmap *, sbitmap *, sbitmap *,
 				      struct gcse_hash_table_d *);
 static void mems_conflict_for_gcse_p (rtx, const_rtx, void *);
 static int load_killed_in_block_p (const_basic_block, int, const_rtx, int);
-static void canon_list_insert (rtx, const_rtx, void *);
 static void alloc_pre_mem (int, int);
 static void free_pre_mem (void);
 static struct edge_list *compute_pre_data (void);
@@ -733,7 +725,10 @@  compute_local_properties (sbitmap *transp, sbitmap *comp, sbitmap *antloc,
 	     We start by assuming all are transparent [none are killed], and
 	     then reset the bits for those that are.  */
 	  if (transp)
-	    compute_transp (expr->expr, indx, transp);
+	    compute_transp (expr->expr, indx, transp,
+			    blocks_with_calls,
+			    modify_mem_list_set,
+			    canon_modify_mem_list);
 
 	  /* The occurrences recorded in antic_occr are exactly those that
 	     we want to set to nonzero in ANTLOC.  */
@@ -1489,40 +1484,6 @@  record_last_reg_set_info (rtx insn, int regno)
     }
 }
 
-/* Record all of the canonicalized MEMs of record_last_mem_set_info's insn.
-   Note we store a pair of elements in the list, so they have to be
-   taken off pairwise.  */
-
-static void
-canon_list_insert (rtx dest ATTRIBUTE_UNUSED, const_rtx x ATTRIBUTE_UNUSED,
-		   void * v_insn)
-{
-  rtx dest_addr, insn;
-  int bb;
-  modify_pair pair;
-
-  while (GET_CODE (dest) == SUBREG
-      || GET_CODE (dest) == ZERO_EXTRACT
-      || GET_CODE (dest) == STRICT_LOW_PART)
-    dest = XEXP (dest, 0);
-
-  /* If DEST is not a MEM, then it will not conflict with a load.  Note
-     that function calls are assumed to clobber memory, but are handled
-     elsewhere.  */
-
-  if (! MEM_P (dest))
-    return;
-
-  dest_addr = get_addr (XEXP (dest, 0));
-  dest_addr = canon_rtx (dest_addr);
-  insn = (rtx) v_insn;
-  bb = BLOCK_FOR_INSN (insn)->index;
-
-  pair.dest = dest;
-  pair.dest_addr = dest_addr;
-  canon_modify_mem_list[bb].safe_push (pair);
-}
-
 /* Record memory modification information for INSN.  We do not actually care
    about the memory location(s) that are set, or even how they are set (consider
    a CALL_INSN).  We merely need to record which insns modify memory.  */
@@ -1530,21 +1491,13 @@  canon_list_insert (rtx dest ATTRIBUTE_UNUSED, const_rtx x ATTRIBUTE_UNUSED,
 static void
 record_last_mem_set_info (rtx_insn *insn)
 {
-  int bb;
-
   if (! flag_gcse_lm)
     return;
 
-  /* load_killed_in_block_p will handle the case of calls clobbering
-     everything.  */
-  bb = BLOCK_FOR_INSN (insn)->index;
-  modify_mem_list[bb].safe_push (insn);
-  bitmap_set_bit (modify_mem_list_set, bb);
-
-  if (CALL_P (insn))
-    bitmap_set_bit (blocks_with_calls, bb);
-  else
-    note_stores (PATTERN (insn), canon_list_insert, (void*) insn);
+  record_last_mem_set_info_common (insn, modify_mem_list,
+				   canon_modify_mem_list,
+				   modify_mem_list_set,
+				   blocks_with_calls);
 }
 
 /* Called from compute_hash_table via note_stores to handle one
@@ -1699,120 +1652,6 @@  free_modify_mem_tables (void)
   canon_modify_mem_list = 0;
 }
 
-/* For each block, compute whether X is transparent.  X is either an
-   expression or an assignment [though we don't care which, for this context
-   an assignment is treated as an expression].  For each block where an
-   element of X is modified, reset the INDX bit in BMAP.  */
-
-static void
-compute_transp (const_rtx x, int indx, sbitmap *bmap)
-{
-  int i, j;
-  enum rtx_code code;
-  const char *fmt;
-
-  /* repeat is used to turn tail-recursion into iteration since GCC
-     can't do it when there's no return value.  */
- repeat:
-
-  if (x == 0)
-    return;
-
-  code = GET_CODE (x);
-  switch (code)
-    {
-    case REG:
-	{
-	  df_ref def;
-	  for (def = DF_REG_DEF_CHAIN (REGNO (x));
-	       def;
-	       def = DF_REF_NEXT_REG (def))
-	    bitmap_clear_bit (bmap[DF_REF_BB (def)->index], indx);
-	}
-
-      return;
-
-    case MEM:
-      if (! MEM_READONLY_P (x))
-	{
-	  bitmap_iterator bi;
-	  unsigned bb_index;
-	  rtx x_addr;
-
-	  x_addr = get_addr (XEXP (x, 0));
-	  x_addr = canon_rtx (x_addr);
-
-	  /* First handle all the blocks with calls.  We don't need to
-	     do any list walking for them.  */
-	  EXECUTE_IF_SET_IN_BITMAP (blocks_with_calls, 0, bb_index, bi)
-	    {
-	      bitmap_clear_bit (bmap[bb_index], indx);
-	    }
-
-	  /* Now iterate over the blocks which have memory modifications
-	     but which do not have any calls.  */
-	  EXECUTE_IF_AND_COMPL_IN_BITMAP (modify_mem_list_set,
-					  blocks_with_calls,
-					  0, bb_index, bi)
-	    {
-	      vec<modify_pair> list
-		= canon_modify_mem_list[bb_index];
-	      modify_pair *pair;
-	      unsigned ix;
-
-	      FOR_EACH_VEC_ELT_REVERSE (list, ix, pair)
-		{
-		  rtx dest = pair->dest;
-		  rtx dest_addr = pair->dest_addr;
-
-		  if (canon_true_dependence (dest, GET_MODE (dest),
-					     dest_addr, x, x_addr))
-		    {
-		      bitmap_clear_bit (bmap[bb_index], indx);
-		      break;
-		    }
-	        }
-	    }
-	}
-
-      x = XEXP (x, 0);
-      goto repeat;
-
-    case PC:
-    case CC0: /*FIXME*/
-    case CONST:
-    CASE_CONST_ANY:
-    case SYMBOL_REF:
-    case LABEL_REF:
-    case ADDR_VEC:
-    case ADDR_DIFF_VEC:
-      return;
-
-    default:
-      break;
-    }
-
-  for (i = GET_RTX_LENGTH (code) - 1, fmt = GET_RTX_FORMAT (code); i >= 0; i--)
-    {
-      if (fmt[i] == 'e')
-	{
-	  /* If we are about to do the last recursive call
-	     needed at this level, change it into iteration.
-	     This function is called enough to be worth it.  */
-	  if (i == 0)
-	    {
-	      x = XEXP (x, i);
-	      goto repeat;
-	    }
-
-	  compute_transp (XEXP (x, i), indx, bmap);
-	}
-      else if (fmt[i] == 'E')
-	for (j = 0; j < XVECLEN (x, i); j++)
-	  compute_transp (XVECEXP (x, i, j), indx, bmap);
-    }
-}
-
 /* Compute PRE+LCM working variables.  */
 
 /* Local properties of expressions.  */
diff --git a/gcc/postreload-gcse.c b/gcc/postreload-gcse.c
index 324264a..eb1e0eb 100644
--- a/gcc/postreload-gcse.c
+++ b/gcc/postreload-gcse.c
@@ -67,6 +67,8 @@  along with GCC; see the file COPYING3.  If not see
 #include "target.h"
 #include "tree-pass.h"
 #include "dbgcnt.h"
+#include "df.h"
+#include "gcse-common.h"
 
 /* The following code implements gcse after reload, the purpose of this
    pass is to cleanup redundant loads generated by reload and other
@@ -121,6 +123,9 @@  struct expr
   /* The same hash for this entry.  */
   hashval_t hash;
 
+  /* Index in the transparent bitmaps.  */
+  unsigned int bitmap_index;
+
   /* List of available occurrence in basic blocks in the function.  */
   struct occr *avail_occr;
 };
@@ -235,6 +240,24 @@  static struct modifies_mem  *modifies_mem_obstack_bottom;
    block, have no gaps, and only apply to real insns.  */
 static int *uid_cuid;
 #define INSN_CUID(INSN) (uid_cuid[INSN_UID (INSN)])
+
+/* Bitmap of blocks which have memory stores.  */
+static bitmap modify_mem_list_set;
+
+/* Bitmap of blocks which have calls.  */
+static bitmap blocks_with_calls;
+
+/* Vector indexed by block # with a list of all the insns that
+   modify memory within the block.  */
+static vec<rtx_insn *> *modify_mem_list;
+
+/* Vector indexed by block # with a canonicalized list of insns
+   that modify memory in the block.  */
+static vec<modify_pair> *canon_modify_mem_list;
+
+/* Vector of simple bitmaps indexed by block number.  Each component sbitmap
+   indicates which expressions are transparent through the block.  */
+static sbitmap *transp;
 
 
 /* Helpers for memory allocation/freeing.  */
@@ -266,7 +289,7 @@  static bool reg_used_on_edge (rtx, edge);
 static rtx get_avail_load_store_reg (rtx_insn *);
 
 static bool bb_has_well_behaved_predecessors (basic_block);
-static struct occr* get_bb_avail_insn (basic_block, struct occr *);
+static struct occr* get_bb_avail_insn (basic_block, struct occr *, int);
 static void hash_scan_set (rtx_insn *);
 static void compute_hash_table (void);
 
@@ -321,6 +344,15 @@  alloc_mem (void)
   modifies_mem_obstack_bottom =
     (struct modifies_mem *) obstack_alloc (&modifies_mem_obstack,
 					   sizeof (struct modifies_mem));
+
+  blocks_with_calls = BITMAP_ALLOC (NULL);
+  modify_mem_list_set = BITMAP_ALLOC (NULL);
+
+  modify_mem_list = (vec_rtx_heap *) xcalloc (last_basic_block_for_fn (cfun),
+					      sizeof (vec_rtx_heap));
+  canon_modify_mem_list
+    = (vec_modify_pair_heap *) xcalloc (last_basic_block_for_fn (cfun),
+					sizeof (vec_modify_pair_heap));
 }
 
 /* Free memory allocated by alloc_mem.  */
@@ -338,6 +370,16 @@  free_mem (void)
   obstack_free (&unoccr_obstack, NULL);
   obstack_free (&modifies_mem_obstack, NULL);
 
+  unsigned i;
+  bitmap_iterator bi;
+  EXECUTE_IF_SET_IN_BITMAP (modify_mem_list_set, 0, i, bi)
+    {
+      modify_mem_list[i].release ();
+      canon_modify_mem_list[i].release ();
+    }
+
+  BITMAP_FREE (blocks_with_calls);
+  BITMAP_FREE (modify_mem_list_set);
   free (reg_avail_info);
 }
 
@@ -376,8 +418,14 @@  insert_expr_in_table (rtx x, rtx_insn *insn)
   slot = expr_table->find_slot_with_hash (cur_expr, hash, INSERT);
 
   if (! (*slot))
-    /* The expression isn't found, so insert it.  */
-    *slot = cur_expr;
+    {
+      /* The expression isn't found, so insert it.  */
+      *slot = cur_expr;
+
+      /* Anytime we add an entry to the table, record the index
+	 of the new entry.  */
+      cur_expr->bitmap_index = expr_table->elements ();
+    }
   else
     {
       /* The expression is already in the table, so roll back the
@@ -698,6 +746,11 @@  record_last_mem_set_info (rtx_insn *insn)
   list_entry->insn = insn;
   list_entry->next = modifies_mem_list;
   modifies_mem_list = list_entry;
+
+  record_last_mem_set_info_common (insn, modify_mem_list,
+				   canon_modify_mem_list,
+				   modify_mem_list_set,
+				   blocks_with_calls);
 }
 
 /* Called from compute_hash_table via note_stores to handle one
@@ -955,15 +1008,45 @@  bb_has_well_behaved_predecessors (basic_block bb)
 /* Search for the occurrences of expression in BB.  */
 
 static struct occr*
-get_bb_avail_insn (basic_block bb, struct occr *occr)
+get_bb_avail_insn (basic_block bb, struct occr *orig_occr, int bitmap_index)
 {
+  struct occr *occr = orig_occr;
+
   for (; occr != NULL; occr = occr->next)
     if (BLOCK_FOR_INSN (occr->insn) == bb)
       return occr;
+
+  /* If we could not find an occurrence in BB, see if BB
+     has a single predecessor with an occurrence that is
+     transparent through BB.  */
+  if (single_pred_p (bb)
+      && bitmap_bit_p (transp[bb->index], bitmap_index)
+      && (occr = get_bb_avail_insn (single_pred (bb), orig_occr, bitmap_index)))
+    {
+      rtx avail_reg = get_avail_load_store_reg (occr->insn);
+      if (!reg_set_between_p (avail_reg,
+			      PREV_INSN (BB_HEAD (bb)),
+			      NEXT_INSN (BB_END (bb)))
+	  && !reg_killed_on_edge (avail_reg, single_pred_edge (bb)))
+	return occr;
+    }
+
   return NULL;
 }
 
 
+/* This helper is called via htab_traverse.  */
+int
+compute_expr_transp (expr **slot, FILE *dump_file ATTRIBUTE_UNUSED)
+{
+  struct expr *expr = *slot;
+
+  compute_transp (expr->expr, expr->bitmap_index, transp,
+		  blocks_with_calls, modify_mem_list_set,
+		  canon_modify_mem_list);
+  return 1;
+}
+
 /* This handles the case where several stores feed a partially redundant
    load. It checks if the redundancy elimination is possible and if it's
    worth it.
@@ -1016,9 +1099,13 @@  eliminate_partially_redundant_load (basic_block bb, rtx_insn *insn,
       avail_insn = NULL;
       avail_reg = NULL_RTX;
       pred_bb = pred->src;
-      next_pred_bb_end = NEXT_INSN (BB_END (pred_bb));
-      for (a_occr = get_bb_avail_insn (pred_bb, expr->avail_occr); a_occr;
-	   a_occr = get_bb_avail_insn (pred_bb, a_occr->next))
+      for (a_occr = get_bb_avail_insn (pred_bb,
+				       expr->avail_occr,
+				       expr->bitmap_index);
+	   a_occr;
+	   a_occr = get_bb_avail_insn (pred_bb,
+				       a_occr->next,
+				       expr->bitmap_index))
 	{
 	  /* Check if the loaded register is not used.  */
 	  avail_insn = a_occr->insn;
@@ -1038,6 +1125,7 @@  eliminate_partially_redundant_load (basic_block bb, rtx_insn *insn,
 	      avail_insn = NULL;
 	      continue;
 	    }
+	  next_pred_bb_end = NEXT_INSN (BB_END (BLOCK_FOR_INSN (avail_insn)));
 	  if (!reg_set_between_p (avail_reg, avail_insn, next_pred_bb_end))
 	    /* AVAIL_INSN remains non-null.  */
 	    break;
@@ -1150,9 +1238,9 @@  eliminate_partially_redundant_load (basic_block bb, rtx_insn *insn,
   /* Delete the insn if it is not available in this block and mark it
      for deletion if it is available. If insn is available it may help
      discover additional redundancies, so mark it for later deletion.  */
-  for (a_occr = get_bb_avail_insn (bb, expr->avail_occr);
+  for (a_occr = get_bb_avail_insn (bb, expr->avail_occr, expr->bitmap_index);
        a_occr && (a_occr->insn != insn);
-       a_occr = get_bb_avail_insn (bb, a_occr->next))
+       a_occr = get_bb_avail_insn (bb, a_occr->next, expr->bitmap_index))
     ;
 
   if (!a_occr)
@@ -1308,8 +1396,19 @@  gcse_after_reload_main (rtx f ATTRIBUTE_UNUSED)
 
   if (expr_table->elements () > 0)
     {
+      /* Knowing which MEMs are transparent through a block can signifiantly
+	 increase the number of reundant loads found.  So compute transparency
+	 information for for each memory expression in the hash table.  */
+      df_analyze ();
+      /* This can not be part of the normal allocation routine because
+	 we have to know the number of elements in the hash table.  */
+      transp = sbitmap_vector_alloc (last_basic_block_for_fn (cfun),
+				     expr_table->elements ());
+      bitmap_vector_ones (transp, last_basic_block_for_fn (cfun));
+      expr_table->traverse <FILE *, compute_expr_transp> (dump_file);
       eliminate_partially_redundant_loads ();
       delete_redundant_insns ();
+      sbitmap_vector_free (transp);
 
       if (dump_file)
 	{