diff mbox

[RFC] A new and simple pass merging paired load store instructions

Message ID 32B4330F-1D0F-4D4E-BF7A-2E5B2148B893@comcast.net
State New
Headers show

Commit Message

Mike Stump May 15, 2014, 4:51 p.m. UTC
On May 15, 2014, at 12:26 AM, bin.cheng <bin.cheng@arm.com> wrote:
> Here comes up with a new GCC pass looking through each basic block and
> merging paired load store even they are not adjacent to each other.

So I have a target that has load and store multiple support that supports large a number of registers (2-n registers), and I added a sched0 pass that is a light copy of the regular scheduling pass that uses a different cost function which arranges all loads first, then all stores then everything else.  Within a group of loads or stores the secondary key is the base register, the next key is the offset.  The net result, all loads off the same register are sorted in increasing order.  We then can use some define_insns and some peephole to patterns to take the seemingly unrelated instructions, which are now adjacent to knock them down into single instructions, instead of the mass of instructions they were before.  And then a peephole pass that runs early to allow the patterns to do the heavy lifting.  This scheme can handle unlimited complexity on the addressing forms just by ensuring the cost function for the new scheduling pass looks at all relevant bits (target dependent, if the simple machine independent form reg + n is not enough).  The sched0 and the peephole pass run early:

+      NEXT_PASS (pass_sched0);
+      NEXT_PASS (pass_peephole1);
       NEXT_PASS (pass_web);
       NEXT_PASS (pass_rtl_cprop);
       NEXT_PASS (pass_cse2);

(before register allocation) so, it can arrange to have things in adjacent registers for the load and store multiple instructions.  The register allocator can then arrange all the code to use those registers directly.

So, having done all this work, I think it would be nicer if there were a pass that managed it (so that one doesn’t have to write any of the peephole or the define_insns (you need like 3*n patterns, and the patterns of O(n), so, you need around n*4/2 lines of code, which is annoying for large n.  A pass could use the existing load store multiple patterns directly, so, no additional port work.  In my work, since I extend life times of values into registers, pretty much without limit, this could be a bad thing.  The code is naturally limited to only extending the lives of things where load and store multiple are used, as if they aren’t used, the regular scheduling pass would undo all the sched0 motion.  I choose a light copy of sched, as I don’t care about compile time, and I wanted a very small patch that was easy to maintain.  If this pass when into trunk, we’d run the new passes _only_ if a port asked for them.  99% of the ports likely don’t want either, though, peephole before register allocation might be interesting for others to solve other issues.

I wanted this to run before register allocation as my load and store multiple instructions only take consecutive register ranges (n-m), and I need the register allocator to manage to make it true.  I do reg to reg motion to move things around as needed, but almost all I expect the allocator to get rid of.  Very complex cases might wind up with a few extra moves, but I have nice bubbles that I can fit these extra moves into.

In my scheme, no new options, no documentation for new options, no new param options, no silly constants, no hard to write/maintain pass, no new weird targets interface, no limit on just pairs, works on stores as well, runs earlier, 430 lines instead of 1086 lines, conceptually much simpler, added benefit of peephole before register allocation that can be used for other things by the port…

So, my question is, does my scheme on your target find more or fewer things?  Would your scheme pair pairs (so that 4 registers would go into 1 instruction)?

Comments

Jeff Law May 15, 2014, 5:13 p.m. UTC | #1
On 05/15/14 10:51, Mike Stump wrote:
> On May 15, 2014, at 12:26 AM, bin.cheng <bin.cheng@arm.com> wrote:
>> Here comes up with a new GCC pass looking through each basic block
>> and merging paired load store even they are not adjacent to each
>> other.
>
> So I have a target that has load and store multiple support that
> supports large a number of registers (2-n registers), and I added a
> sched0 pass that is a light copy of the regular scheduling pass that
> uses a different cost function which arranges all loads first, then
> all stores then everything else.  Within a group of loads or stores
> the secondary key is the base register, the next key is the offset.
> The net result, all loads off the same register are sorted in
> increasing order.
Glad to see someone else stumble on (ab)using the scheduler to do this.

I've poked at the scheduler several times to do similar stuff, but was 
never really satisfied with the results and never tried to polish those 
prototypes into something worth submitting.

One example I've poked at was discovery of stores which then feed into a 
load from the same location.  Which obviously we'd prefer to turn into a 
store + copy (subject to mess of constraints).  There's a handful of 
these kind of transformations that seem to naturally drop out of this 
kind of work.

Similarly a post-reload pass could be used to promote single word 
loads/stores to double-word operations.

If anyone cared about PA 1.1 code generation, it'd be a much cleaner way 
to support the non-fused fmpyadd fmpysub insns.

Anyway, if you want to move forward with the idea, I'd certainly support 
doing so.

jeff
Mike Stump May 15, 2014, 6:41 p.m. UTC | #2
On May 15, 2014, at 10:13 AM, Jeff Law <law@redhat.com> wrote:
> I've poked at the scheduler several times to do similar stuff, but was never really satisfied with the results and never tried to polish those prototypes into something worth submitting.

What was lacking?  The cleanliness of the patch or the, it didn’t win me more than 0.01% code improvement so I never submitted it?

> One example I've poked at was discovery of stores which then feed into a load from the same location.

Trivial to do with my patch, just change the sort key to arrange that to happen, however, if you want multiple support and this, then, you need to recognize store_m / load from a part of that, which would be kinda annoying.

> Anyway, if you want to move forward with the idea, I'd certainly support doing so.

I’d be happy to clean up my patch as necessary and contribute it, however, people would have to tell me what they would like to see done to it.  In my mind, the arm, more, nds32, rs6000 and s390 ports would be the biggest in tree winners.

To be the most useful, my patch would benefit from an optimization person writing MI code to coalesce adjacent memory operations into either larger mode load and stores or into gen_load_multiple, gen_store_multiplethen, then all the port work (mine is around 16*n*n lines) would be zero.
Jeff Law May 15, 2014, 8:01 p.m. UTC | #3
On 05/15/14 12:41, Mike Stump wrote:
> On May 15, 2014, at 10:13 AM, Jeff Law <law@redhat.com> wrote:
>> I've poked at the scheduler several times to do similar stuff, but
>> was never really satisfied with the results and never tried to
>> polish those prototypes into something worth submitting.
>
> What was lacking?  The cleanliness of the patch or the, it didn’t win
> me more than 0.01% code improvement so I never submitted it?
Cleanliness and applicability of my implementation.

For fmpyadd/fmpysub on the PA, my recollection was that the right way to 
go was to detect when both were in the ready queue at the same time, 
then resort the queue to have them issue consecutively.  We didn't have 
the necessary hooks to have target dependent code examine and rearrange 
the queues back when I looked at this.  By the time we had those 
capabilities, the PA8000 class processors had been released and those 
instructions were considered bad so I never came back to this.

For the memory optimizations, IIRC, the dependencies keep them from 
getting into the ready queue at the same time.  Thus it's significantly 
harder to get them to issue consecutively when you've got an issue rate > 1.



> Trivial to do with my patch, just change the sort key to arrange that
> to happen, however, if you want multiple support and this, then, you
> need to recognize store_m / load from a part of that, which would be
> kinda annoying.
But if you've got an issue rate > 1, then it's a lot less likely you'll 
get the store/load pairing up the way you want.

Arguably one could throttle down the issue rate for these scheduler-like 
passes which makes that problem go away.

JEff
Mike Stump May 15, 2014, 9:25 p.m. UTC | #4
On May 15, 2014, at 1:01 PM, Jeff Law <law@redhat.com> wrote:
> For the memory optimizations, IIRC, the dependencies keep them from getting into the ready queue at the same time.  Thus it's significantly harder to get them to issue consecutively when you've got an issue rate > 1.

> But if you've got an issue rate > 1, then it's a lot less likely you'll get the store/load pairing up the way you want.

Ah, yeah, on second thought, that won’t work if they are not ready together…

I sort the ready queue:

first ready set:
load
store
everything else

next ready set:
load store
everything else

Now, it might work, if there are no other instructions and no loads in the next set, and so on, but that would be an accident.  The utility is only in instructions that are ready together, though, tick boundaries and issue rates don’t matter any, as I don’t care about ticks or issue rates.  The only consideration on order is my ordering operator and once ordered, the target is free to combine as it sees fit.  So, for store load optimization, my patch is rather useless.
Bin.Cheng May 16, 2014, 10:07 a.m. UTC | #5
On Fri, May 16, 2014 at 12:51 AM, Mike Stump <mikestump@comcast.net> wrote:
> On May 15, 2014, at 12:26 AM, bin.cheng <bin.cheng@arm.com> wrote:
>> Here comes up with a new GCC pass looking through each basic block and
>> merging paired load store even they are not adjacent to each other.
>
> So I have a target that has load and store multiple support that supports large a number of registers (2-n registers), and I added a sched0 pass that is a light copy of the regular scheduling pass that uses a different cost function which arranges all loads first, then all stores then everything else.  Within a group of loads or stores the secondary key is the base register, the next key is the offset.  The net result, all loads off the same register are sorted in increasing order.  We then can use some define_insns and some peephole to patterns to take the seemingly unrelated instructions, which are now adjacent to knock them down into single instructions, instead of the mass of instructions they were before.  And then a peephole pass that runs early to allow the patterns to do the heavy lifting.  This scheme can handle unlimited complexity on the addressing forms just by ensuring the cost function for the new scheduling pass looks at all relevant bits (target dependent, if the simple machine independent form reg + n is not enough).  The sched0 and the peephole pass run early:
>
> +      NEXT_PASS (pass_sched0);
> +      NEXT_PASS (pass_peephole1);
>        NEXT_PASS (pass_web);
>        NEXT_PASS (pass_rtl_cprop);
>        NEXT_PASS (pass_cse2);
>
> (before register allocation) so, it can arrange to have things in adjacent registers for the load and store multiple instructions.  The register allocator can then arrange all the code to use those registers directly.
>
> So, having done all this work, I think it would be nicer if there were a pass that managed it (so that one doesn't have to write any of the peephole or the define_insns (you need like 3*n patterns, and the patterns of O(n), so, you need around n*4/2 lines of code, which is annoying for large n.  A pass could use the existing load store multiple patterns directly, so, no additional port work.  In my work, since I extend life times of values into registers, pretty much without limit, this could be a bad thing.  The code is naturally limited to only extending the lives of things where load and store multiple are used, as if they aren't used, the regular scheduling pass would undo all the sched0 motion.  I choose a light copy of sched, as I don't care about compile time, and I wanted a very small patch that was easy to maintain.  If this pass when into trunk, we'd run the new passes _only_ if a port asked for them.  99% of the ports likely don't want either, though, peephole before register allocation might be interesting for others to solve other issues.
>
> I wanted this to run before register allocation as my load and store multiple instructions only take consecutive register ranges (n-m), and I need the register allocator to manage to make it true.  I do reg to reg motion to move things around as needed, but almost all I expect the allocator to get rid of.  Very complex cases might wind up with a few extra moves, but I have nice bubbles that I can fit these extra moves into.
>
> In my scheme, no new options, no documentation for new options, no new param options, no silly constants, no hard to write/maintain pass, no new weird targets interface, no limit on just pairs, works on stores as well, runs earlier, 430 lines instead of 1086 lines, conceptually much simpler, added benefit of peephole before register allocation that can be used for other things by the port...
>
> So, my question is, does my scheme on your target find more or fewer things?  Would your scheme pair pairs (so that 4 registers would go into 1 instruction)?
>
Hi Mike,
Thanks for bringing up this new method.  I had to admit that I was not
very much into this method at the first glance especially it requires
another pass in cooperation with scheduler.

For the first question, unfortunately, I can't apply the patch to svn
trunk, could you please update it thus I can do some experiment with
it?  From my experience, lots of opportunities on ARM are generated by
RA, putting it before RA would miss many cases.  Another possible
issue is about interaction with other optimizations.  Putting it at
early stage of RTL, we may disturb other optimizations like
/gcse/-fgcse-lm/-fgcse-sm/dse.  Of course, it has advantages too, for
example, fwprop_addr sometimes corrupt load/store pair opportunities
on ARM, it won't be a problem for your patch.

For the second question, the answer is no for current implementation.
For ARM the most important opportunity is the paired one, so I just
started with pair of two instructions, which is much simpler.  I do
have a draft idea how to merge more than two instructions, but haven't
worked on it yet.

Thanks,
bin
Bin.Cheng May 16, 2014, 10:07 a.m. UTC | #6
On Fri, May 16, 2014 at 1:13 AM, Jeff Law <law@redhat.com> wrote:
> On 05/15/14 10:51, Mike Stump wrote:
>>
>> On May 15, 2014, at 12:26 AM, bin.cheng <bin.cheng@arm.com> wrote:
>>>
>>> Here comes up with a new GCC pass looking through each basic block
>>> and merging paired load store even they are not adjacent to each
>>> other.
>>
>>
>> So I have a target that has load and store multiple support that
>> supports large a number of registers (2-n registers), and I added a
>> sched0 pass that is a light copy of the regular scheduling pass that
>> uses a different cost function which arranges all loads first, then
>> all stores then everything else.  Within a group of loads or stores
>> the secondary key is the base register, the next key is the offset.
>> The net result, all loads off the same register are sorted in
>> increasing order.
>
> Glad to see someone else stumble on (ab)using the scheduler to do this.
Emm, If it's (ab)using, should we still do it then?

>
> I've poked at the scheduler several times to do similar stuff, but was never
> really satisfied with the results and never tried to polish those prototypes
> into something worth submitting.
>
> One example I've poked at was discovery of stores which then feed into a
> load from the same location.  Which obviously we'd prefer to turn into a
> store + copy (subject to mess of constraints).  There's a handful of these
> kind of transformations that seem to naturally drop out of this kind of
> work.
As Mike stated, merging of consecutive memory accesses is all about
the base register and the offset. I am thinking another method
collecting all memory accesses with same base register then doing the
merge work.  In this way, we should be able to merge more than 2
instructions, also it would be possible to remove redundant load
instructions in one pass.

My question is how many these redundant loads could be?  Is there any
rtl pass responsible for this now?

Thanks,
bin
>
> Similarly a post-reload pass could be used to promote single word
> loads/stores to double-word operations.
>
> If anyone cared about PA 1.1 code generation, it'd be a much cleaner way to
> support the non-fused fmpyadd fmpysub insns.
>
> Anyway, if you want to move forward with the idea, I'd certainly support
> doing so.
>
> jeff
Jeff Law May 16, 2014, 4:18 p.m. UTC | #7
On 05/16/14 04:07, Bin.Cheng wrote:
> On Fri, May 16, 2014 at 1:13 AM, Jeff Law <law@redhat.com> wrote:
>> On 05/15/14 10:51, Mike Stump wrote:
>>>
>>> On May 15, 2014, at 12:26 AM, bin.cheng <bin.cheng@arm.com> wrote:
>>>>
>>>> Here comes up with a new GCC pass looking through each basic block
>>>> and merging paired load store even they are not adjacent to each
>>>> other.
>>>
>>>
>>> So I have a target that has load and store multiple support that
>>> supports large a number of registers (2-n registers), and I added a
>>> sched0 pass that is a light copy of the regular scheduling pass that
>>> uses a different cost function which arranges all loads first, then
>>> all stores then everything else.  Within a group of loads or stores
>>> the secondary key is the base register, the next key is the offset.
>>> The net result, all loads off the same register are sorted in
>>> increasing order.
>>
>> Glad to see someone else stumble on (ab)using the scheduler to do this.
> Emm, If it's (ab)using, should we still do it then?
I think it'd still be fine.  There's even been a comment about doing 
this kind of thing in the scheduler that's been around since the early 
90s...

The scheduler is a bit interesting in that it has a wealth of dependency 
information and the ability to reorganize the insn stream in relatively 
arbitrary ways.  That seems to make it a natural place to think about 
transformations of this nature.  We just haven't had a good 
infrastructure for doing that.

In theory we're a lot closer now to being able to plug in different 
costing/sorting models and let the scheduler do its thing.  Those models 
might rewrite for register pressure, or encourage certain independent 
insns to issue back-to-back to encourage combining, or to build 
candidate insns for delay slot scheduling, etc.

> As Mike stated, merging of consecutive memory accesses is all about
> the base register and the offset. I am thinking another method
> collecting all memory accesses with same base register then doing the
> merge work.  In this way, we should be able to merge more than 2
> instructions, also it would be possible to remove redundant load
> instructions in one pass.
>
> My question is how many these redundant loads could be?  Is there any
> rtl pass responsible for this now?
I suspect it's a lot less important now than it used to be.  But there's 
probably some cases where it'd be useful.  Combining sub-word accesses 
into full-word accesses come immediately to mind.

I'm not aware of any pass which does these kind of changes in a general 
form.  Some passes (caller-save) do a fair amount of work to track when 
they can generate multi-object loads/stores (and it was a huge win back 
on the old sparc processors).


jeff
Bin.Cheng May 19, 2014, 6:38 a.m. UTC | #8
On Sat, May 17, 2014 at 12:18 AM, Jeff Law <law@redhat.com> wrote:
> On 05/16/14 04:07, Bin.Cheng wrote:
>>
>> On Fri, May 16, 2014 at 1:13 AM, Jeff Law <law@redhat.com> wrote:
>>>
>>> On 05/15/14 10:51, Mike Stump wrote:
>>>>
>>>>
>>>> On May 15, 2014, at 12:26 AM, bin.cheng <bin.cheng@arm.com> wrote:
>>>>>
>>>>>
>>>>> Here comes up with a new GCC pass looking through each basic block
>>>>> and merging paired load store even they are not adjacent to each
>>>>> other.
>>>>
>>>>
>>>>
>>>> So I have a target that has load and store multiple support that
>>>> supports large a number of registers (2-n registers), and I added a
>>>> sched0 pass that is a light copy of the regular scheduling pass that
>>>> uses a different cost function which arranges all loads first, then
>>>> all stores then everything else.  Within a group of loads or stores
>>>> the secondary key is the base register, the next key is the offset.
>>>> The net result, all loads off the same register are sorted in
>>>> increasing order.
>>>
>>>
>>> Glad to see someone else stumble on (ab)using the scheduler to do this.
>>
>> Emm, If it's (ab)using, should we still do it then?
>
> I think it'd still be fine.  There's even been a comment about doing this
> kind of thing in the scheduler that's been around since the early 90s...
>
> The scheduler is a bit interesting in that it has a wealth of dependency
> information and the ability to reorganize the insn stream in relatively
> arbitrary ways.  That seems to make it a natural place to think about
> transformations of this nature.  We just haven't had a good infrastructure
> for doing that.
>
> In theory we're a lot closer now to being able to plug in different
> costing/sorting models and let the scheduler do its thing.  Those models
> might rewrite for register pressure, or encourage certain independent insns
> to issue back-to-back to encourage combining, or to build candidate insns
> for delay slot scheduling, etc.
>
>
>> As Mike stated, merging of consecutive memory accesses is all about
>> the base register and the offset. I am thinking another method
>> collecting all memory accesses with same base register then doing the
>> merge work.  In this way, we should be able to merge more than 2
>> instructions, also it would be possible to remove redundant load
>> instructions in one pass.
>>
>> My question is how many these redundant loads could be?  Is there any
>> rtl pass responsible for this now?
>
> I suspect it's a lot less important now than it used to be.  But there's
> probably some cases where it'd be useful.  Combining sub-word accesses into
> full-word accesses come immediately to mind.
>
> I'm not aware of any pass which does these kind of changes in a general
> form.  Some passes (caller-save) do a fair amount of work to track when they
> can generate multi-object loads/stores (and it was a huge win back on the
> old sparc processors).
>
Glad this RFC has attracted some attentions and thanks for all the
comments.  Here I can see four major concerns as below:
1) Should we do it in a separated pass, or just along with scheduler?

2) When should we run the new pass, before or after RA?  There are
both advantages and disadvantages and very depends on the target for
which we are compiling.
I have no simple answer to this.  Maybe we can run the pass twice or
follow Oleg's suggestion.  I think it's a new strategy for GCC to let
backend decide when to run a pass.

3) Do we need a new target hook interface?
I answered this in other messages and I still think it's target dependent.

4) The optimization should be able to handle cases with more than 2
consecutive load/store instructions.
The current implementation can't handle such cases and need further extension.

The 3) and 4) are just implementation questions, while I am not sure
about 1) and 2), so any more comments that we could make some
decisions to carry on this optimization?

Thanks,
bin
Jeff Law May 19, 2014, 5:16 p.m. UTC | #9
On 05/19/14 00:38, Bin.Cheng wrote:
> 1) Should we do it in a separated pass, or just along with scheduler?
ISTM that when we're able to combine insns that can impact the schedule 
we'd like to generate, possibly in significant ways.  That argues for a 
separate pass that runs before the scheduler.

So I'd probably look at splitting out the code which builds the 
dependency information, running that first, then the pass to merge these 
memory ops (incrementally updating the dependency info), then the scheduler.


>
> 2) When should we run the new pass, before or after RA?  There are
> both advantages and disadvantages and very depends on the target for
> which we are compiling.
> I have no simple answer to this.  Maybe we can run the pass twice or
> follow Oleg's suggestion.  I think it's a new strategy for GCC to let
> backend decide when to run a pass.
If we only run the dependency stuff once and share the data between the 
new pass and the sceduler, then I think running the new pass twice just 
before scheduling would be fine.

>
> 3) Do we need a new target hook interface?
> I answered this in other messages and I still think it's target dependent.
Still pondering that one :-)

>
> 4) The optimization should be able to handle cases with more than 2
> consecutive load/store instructions.
> The current implementation can't handle such cases and need further extension.
Yea, probably.  I think if we build the infrastructure right extension 
for > 2 load/stores shouldn't be terribly bad to implement.

Jeff
diff mbox

Patch

diff --git a/gcc/haifa-sched.c b/gcc/haifa-sched.c
index 4317c9f..9702ebe 100644
--- a/gcc/haifa-sched.c
+++ b/gcc/haifa-sched.c
@@ -1361,7 +1361,10 @@  insn_cost (rtx insn)
       if (recog_memoized (insn) < 0)
 	return 0;
 
-      cost = insn_default_latency (insn);
+      if (sched_load)
+	cost = 0;
+      else
+	cost = insn_default_latency (insn);
       if (cost < 0)
 	cost = 0;
 
@@ -1383,7 +1386,10 @@  insn_cost (rtx insn)
 	}
       else
 	{
-	  cost = insn_default_latency (insn);
+	  if (sched_load)
+	    cost = 0;
+	  else
+	    cost = insn_default_latency (insn);
 	  if (cost < 0)
 	    cost = 0;
 
@@ -1568,6 +1574,27 @@  dep_list_size (rtx insn, sd_list_types_def list)
   return nodbgcount;
 }
 
+static int saved_priority;
+bool sched_load;
+
+
+static int
+pri (bool wasload, int regno, HOST_WIDE_INT offp) {
+  int p;
+  int off = (int)offp;
+  int m = /* ((1<14)-1) */ 0x3fff;
+  off = off + (1<<13);
+  off = off & m;
+  /* loads go first, then stores */
+  p = wasload*(INT_MAX/4+1) + (INT_MAX/4+1);
+  /* then we sort upon the base register, increasing regnos.  */
+  p -= (regno & 0xfff) <<14;
+  /* then we sort on the offset, increasing offsets.  */
+  p -= off;
+  gcc_assert (p>0);
+  return p;
+}
+
 /* Compute the priority number for INSN.  */
 static int
 priority (rtx insn)
@@ -1582,7 +1609,43 @@  priority (rtx insn)
     {
       int this_priority = -1;
 
-      if (dep_list_size (insn, SD_LIST_FORW) == 0)
+      if (sched_load)
+	{
+	  rtx x;
+	  if (! INSN_P (insn))
+	    this_priority = 0;
+	  else if (GET_CODE (x=PATTERN (insn)) != SET)
+	    this_priority = saved_priority;
+	  else if (GET_CODE (SET_DEST (x)) == REG
+		   && GET_CODE (SET_SRC (x)) == MEM
+		   && GET_CODE (XEXP (SET_SRC (x), 0)) == REG)
+	    /* we have a recognized load: r = (mem r)  */
+	    this_priority = pri (true, REGNO (XEXP (SET_SRC (x), 0)), 0);
+	  else if (GET_CODE (SET_DEST (x)) == REG
+		   && GET_CODE (SET_SRC (x)) == MEM
+		   && GET_CODE (XEXP (SET_SRC (x), 0)) == PLUS
+		   && GET_CODE (XEXP (XEXP (SET_SRC (x), 0), 0)) == REG
+		   && GET_CODE (XEXP (XEXP (SET_SRC (x), 0), 1)) == CONST_INT)
+	    /* we have a recognized load: r = (mem (+ r n)) */
+	    this_priority = pri (true, REGNO (XEXP (XEXP (SET_SRC (x), 0), 0)),
+				 INTVAL (XEXP (XEXP (SET_SRC (x), 0), 1)));
+	  else if (GET_CODE (SET_SRC (x)) == REG
+		   && GET_CODE (SET_DEST (x)) == MEM
+		   && GET_CODE (XEXP (SET_DEST (x), 0)) == REG)
+	    /* we have a recognized store: (mem r) = r  */
+	    this_priority = pri (false, REGNO (XEXP (SET_DEST (x), 0)), 0);
+	  else if (GET_CODE (SET_SRC (x)) == REG
+		   && GET_CODE (SET_DEST (x)) == MEM
+		   && GET_CODE (XEXP (SET_DEST (x), 0)) == PLUS
+		   && GET_CODE (XEXP (XEXP (SET_DEST (x), 0), 0)) == REG
+		   && GET_CODE (XEXP (XEXP (SET_DEST (x), 0), 1)) == CONST_INT)
+	    /* we have a recognized store: (mem (+ r n)) = r */
+	    this_priority = pri (false, REGNO (XEXP (XEXP (SET_DEST (x), 0), 0)),
+				 INTVAL (XEXP (XEXP (SET_DEST (x), 0), 1)));
+	  else
+	    this_priority = saved_priority;
+	}
+      else if (dep_list_size (insn, SD_LIST_FORW) == 0)
 	/* ??? We should set INSN_PRIORITY to insn_cost when and insn has
 	   some forward deps but all of them are ignored by
 	   contributes_to_priority hook.  At the moment we set priority of
@@ -2513,6 +2576,8 @@  model_set_excess_costs (rtx *insns, int count)
     }
 }
 
+static int last_insn_pri;
+
 /* Returns a positive value if x is preferred; returns a negative value if
    y is preferred.  Should never return 0, since that will make the sort
    unstable.  */
@@ -2544,6 +2609,23 @@  rank_for_schedule (const void *x, const void *y)
   /* Make sure that priority of TMP and TMP2 are initialized.  */
   gcc_assert (INSN_PRIORITY_KNOWN (tmp) && INSN_PRIORITY_KNOWN (tmp2));
 
+  if (sched_load)
+    {
+      /* The instruction that is closest after to the last instruction
+	 is the instruction we want to pick next.  */
+      int a = last_insn_pri-INSN_PRIORITY (tmp);
+      int b = last_insn_pri-INSN_PRIORITY (tmp2);
+      if (a>=0)
+	{
+	  if (b<0)
+	    return -1;
+	  return a-b;
+	}
+      if (b<0)
+	return a-b;
+      return 1;
+    }
+
   if (sched_pressure != SCHED_PRESSURE_NONE)
     {
       int diff;
@@ -5620,6 +5702,7 @@  commit_schedule (rtx prev_head, rtx tail, basic_block *target_bb)
 {
   unsigned int i;
   rtx insn;
+  basic_block dirty_block = 0;
 
   last_scheduled_insn = prev_head;
   for (i = 0;
@@ -5645,6 +5728,19 @@  commit_schedule (rtx prev_head, rtx tail, basic_block *target_bb)
 
       if (current_sched_info->begin_move_insn)
 	(*current_sched_info->begin_move_insn) (insn, last_scheduled_insn);
+
+      /* We have to mark the block dirty so that df will rescan
+	 instructions whenever we move instructions around.  */
+      if (PREV_INSN (insn) != last_scheduled_insn)
+	{
+	  basic_block new_dirty_block = BLOCK_FOR_INSN (insn);
+	  if (new_dirty_block != dirty_block)
+	    {
+	      df_set_bb_dirty (new_dirty_block);
+	      dirty_block = new_dirty_block;
+	    }
+	}
+
       move_insn (insn, last_scheduled_insn,
 		 current_sched_info->next_tail);
       if (!DEBUG_INSN_P (insn))
@@ -5766,6 +5862,10 @@  prune_ready_list (state_t temp_state, bool first_cycle_insn_p,
 		  reason = "shadow tick";
 		}
 	    }
+
+	  if (cost > 1 && sched_load)
+	    cost = 1;
+
 	  if (cost >= 1)
 	    {
 	      if (SCHED_GROUP_P (insn) && cost > min_cost_group)
@@ -6652,12 +6752,19 @@  sched_init (void)
   curr_state = xmalloc (dfa_state_size);
 }
 
+bool free_hid = false;
+
 static void haifa_init_only_bb (basic_block, basic_block);
 
 /* Initialize data structures specific to the Haifa scheduler.  */
 void
 haifa_sched_init (void)
 {
+  if (free_hid)
+    {
+      free_hid = false;
+      haifa_finish_h_i_d ();
+    }
   setup_sched_dump ();
   sched_init ();
 
@@ -6698,6 +6805,9 @@  haifa_sched_init (void)
   after_recovery = 0;
 
   modulo_ii = 0;
+  /* normal instructions go first.  */
+  saved_priority = INT_MAX/2 + 2;
+  last_insn_pri = INT_MAX/2 + 1;
 }
 
 /* Finish work with the data specific to the Haifa scheduler.  */
@@ -6745,8 +6855,10 @@  haifa_sched_finish (void)
 void
 sched_finish (void)
 {
-  if (!flag_timing_info)
+  if (!flag_timing_info || !reload_completed)
     haifa_finish_h_i_d ();
+  else
+    free_hid = true;
   if (sched_pressure != SCHED_PRESSURE_NONE)
     {
       if (regstat_n_sets_and_refs != NULL)
@@ -8568,4 +8680,9 @@  get_ready_element (int i)
   return ready_element (&ready, i);
 }
 
+haifa_insn_data_def* (HID) (rtx insn)
+{
+  return HID(insn);
+}
+
 #endif /* INSN_SCHEDULING */
diff --git a/gcc/passes.c b/gcc/passes.c
index 2a5ea00..26e1455 100644
--- a/gcc/passes.c
+++ b/gcc/passes.c
@@ -1600,6 +1600,8 @@  init_optimization_passes (void)
 	  NEXT_PASS (pass_rtl_loop_done);
 	  *p = NULL;
 	}
+      NEXT_PASS (pass_sched0);
+      NEXT_PASS (pass_peephole1);
       NEXT_PASS (pass_web);
       NEXT_PASS (pass_rtl_cprop);
       NEXT_PASS (pass_cse2);
diff --git a/gcc/recog.c b/gcc/recog.c
index 3c56703..5cf76a8 100644
--- a/gcc/recog.c
+++ b/gcc/recog.c
@@ -72,7 +72,7 @@  static rtx split_insn (rtx);
 
 int volatile_ok;
 
-struct recog_data recog_data;
+struct Recog_data recog_data;
 
 /* Contains a vector of operand_alternative structures for every operand.
    Set up by preprocess_constraints.  */
@@ -3104,6 +3104,9 @@  peep2_find_free_register (int from, int to, const char *class_str,
   cl = (class_str[0] == 'r' ? GENERAL_REGS
 	   : REG_CLASS_FROM_CONSTRAINT (class_str[0], class_str));
 
+  if (can_create_pseudo_p ())
+    return gen_reg_rtx (mode);
+
   for (i = 0; i < FIRST_PSEUDO_REGISTER; i++)
     {
       int raw_regno, regno, success, j;
@@ -3750,6 +3753,27 @@  rest_of_handle_peephole2 (void)
   return 0;
 }
 
+struct rtl_opt_pass pass_peephole1 =
+{
+ {
+  RTL_PASS,
+  "peephole1",                          /* name */
+  OPTGROUP_NONE,                        /* optinfo_flags */
+  gate_handle_peephole2,                /* gate */
+  rest_of_handle_peephole2,             /* execute */
+  NULL,                                 /* sub */
+  NULL,                                 /* next */
+  0,                                    /* static_pass_number */
+  TV_PEEPHOLE2,                         /* tv_id */
+  0,                                    /* properties_required */
+  0,                                    /* properties_provided */
+  0,                                    /* properties_destroyed */
+  0,                                    /* todo_flags_start */
+  TODO_df_finish | TODO_verify_rtl_sharing |
+  0                                    /* todo_flags_finish */
+ }
+};
+
 struct rtl_opt_pass pass_peephole2 =
 {
  {
diff --git a/gcc/recog.h b/gcc/recog.h
index 0e7d537..9b86964 100644
--- a/gcc/recog.h
+++ b/gcc/recog.h
@@ -180,7 +180,7 @@  extern int which_alternative;
 
 /* The following vectors hold the results from insn_extract.  */
 
-struct recog_data
+struct Recog_data
 {
   /* It is very tempting to make the 5 operand related arrays into a
      structure and index on that.  However, to be source compatible
@@ -246,7 +246,7 @@  struct recog_data
   rtx insn;
 };
 
-extern struct recog_data recog_data;
+extern struct Recog_data recog_data;
 
 /* Contains a vector of operand_alternative structures for every operand.
    Set up by preprocess_constraints.  */
diff --git a/gcc/reload.c b/gcc/reload.c
index 5fd43a3..e3d6eff 100644
--- a/gcc/reload.c
+++ b/gcc/reload.c
@@ -895,7 +895,7 @@  can_reload_into (rtx in, int regno, enum machine_mode mode)
 {
   rtx dst, test_insn;
   int r = 0;
-  struct recog_data save_recog_data;
+  struct Recog_data save_recog_data;
 
   /* For matching constraints, we often get notional input reloads where
      we want to use the original register as the reload register.  I.e.
diff --git a/gcc/sched-int.h b/gcc/sched-int.h
index c2c1f2f..444113e 100644
--- a/gcc/sched-int.h
+++ b/gcc/sched-int.h
@@ -1579,5 +1579,7 @@  extern void sd_debug_lists (rtx, sd_list_types_def);
 
 #endif /* INSN_SCHEDULING */
 
+extern bool sched_load;
+
 #endif /* GCC_SCHED_INT_H */
 
diff --git a/gcc/sched-rgn.c b/gcc/sched-rgn.c
index 46edff8..f97de39 100644
--- a/gcc/sched-rgn.c
+++ b/gcc/sched-rgn.c
@@ -3526,6 +3526,18 @@  gate_handle_sched (void)
 
 /* Run instruction scheduler.  */
 static unsigned int
+rest_of_handle_sched0 (void)
+{
+#ifdef INSN_SCHEDULING
+  sched_load = true;
+  schedule_insns ();
+  sched_load = false;
+#endif
+  return 0;
+}
+
+/* Run instruction scheduler.  */
+static unsigned int
 rest_of_handle_sched (void)
 {
 #ifdef INSN_SCHEDULING
@@ -3570,6 +3582,28 @@  rest_of_handle_sched2 (void)
   return 0;
 }
 
+struct rtl_opt_pass pass_sched0 =
+{
+ {
+  RTL_PASS,
+  "sched0",                             /* name */
+  OPTGROUP_NONE,                        /* optinfo_flags */
+  gate_handle_sched,                    /* gate */
+  rest_of_handle_sched0,                /* execute */
+  NULL,                                 /* sub */
+  NULL,                                 /* next */
+  0,                                    /* static_pass_number */
+  TV_SCHED,                             /* tv_id */
+  0,                                    /* properties_required */
+  0,                                    /* properties_provided */
+  0,                                    /* properties_destroyed */
+  0,                                    /* todo_flags_start */
+  TODO_df_finish | TODO_verify_rtl_sharing |
+  TODO_verify_flow |
+  TODO_ggc_collect                      /* todo_flags_finish */
+ }
+};
+
 struct rtl_opt_pass pass_sched =
 {
  {
diff --git a/gcc/simplify-rtx.c b/gcc/simplify-rtx.c
index 4c67495..73412c4 100644
--- a/gcc/simplify-rtx.c
+++ b/gcc/simplify-rtx.c
@@ -5370,9 +5370,9 @@  static rtx
 simplify_immed_subreg (enum machine_mode outermode, rtx op,
 		       enum machine_mode innermode, unsigned int byte)
 {
-  /* We support up to 512-bit values (for V8DFmode).  */
+  /* We support up to 1024-bit values (for V16DFmode).  */
   enum {
-    max_bitsize = 512,
+    max_bitsize = 1024,
     value_bit = 8,
     value_mask = (1 << value_bit) - 1
   };
diff --git a/gcc/testsuite/gfortran.dg/select_type_4.f90 b/gcc/testsuite/gfortran.dg/select_type_4.f90
index 7e12d93..b58cc39 100644
--- a/gcc/testsuite/gfortran.dg/select_type_4.f90
+++ b/gcc/testsuite/gfortran.dg/select_type_4.f90
@@ -1,4 +1,7 @@ 
 ! { dg-do run }
+! This testcase seem to show a flaw with the alias set information when loads are
+! rescheduled earlier by sched0.  -fno-strict-aliasing seems to work around it.
+! { dg-options "-fno-strict-aliasing" }
 !
 ! Contributed by by Richard Maine
 ! http://coding.derkeiler.com/Archive/Fortran/comp.lang.fortran/2006-10/msg00104.html
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 7581cf3..b1e1c4f 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -425,6 +425,8 @@  extern struct rtl_opt_pass pass_rtl_unroll_and_peel_loops;
 extern struct rtl_opt_pass pass_rtl_doloop;
 extern struct rtl_opt_pass pass_rtl_loop_done;
 
+extern struct rtl_opt_pass pass_sched0;
+extern struct rtl_opt_pass pass_peephole1;
 extern struct rtl_opt_pass pass_web;
 extern struct rtl_opt_pass pass_cse2;
 extern struct rtl_opt_pass pass_df_initialize_opt;