diff mbox

[AVR] : Fix PR36467, PR49687 (better widening mul)

Message ID 4E24541F.7050606@gjlay.de
State New
Headers show

Commit Message

Georg-Johann Lay July 18, 2011, 3:41 p.m. UTC
This patch improves 16=8*8 and 16*16*8 multiplication.

mulhi3 is extended to care for small, signed 9-bit constants
that can be handled more efficiently than loading the constants
and performing a 16-bit multiplication.

Some combine patterns are added to match if one operand is a small
constant which is then split in the remainder.

Cost computation adapted to the new patterns.

Test cases attached to the PRs compile to good code now.

Tested without regressions.

Ok to commit?

	PR target/36467
	PR target/49687
	* config/avr/avr.md (mulhi3): Use register_or_s9_operand for
	operand2 and expand appropriately if there is a CONST_INT in
	operand2.
	(usmulqihi3): New insn.
	(*sumulqihi3): New insn.
	(*osmulqihi3): New insn.
	(*oumulqihi3): New insn.
	(*muluqihi3.uconst): New insn_and_split.
	(*muluqihi3.sconst): New insn_and_split.
	(*mulsqihi3.sconst): New insn_and_split.
	(*mulsqihi3.uconst): New insn_and_split.
	(*mulsqihi3.oconst): New insn_and_split.
	(*ashifthi3.signx.const): New insn_and_split.
	(*ashifthi3.signx.const7): New insn_and_split.
	(*ashifthi3.zerox.const): New insn_and_split.
	(mulsqihi3): New insn.
	(muluqihi3): New insn.
	(muloqihi3): New insn.
	* config/avr/avr.c (avr_rtx_costs): Report costs of above insns.
	(avr_gate_split1): New function.
	* config/avr/avr-protos.h (avr_gate_split1): New prototype.
	* config/avr/predicates.md (const_2_to_7_operand): New.
	(const_2_to_6_operand): New.
	(u8_operand): New.
	(s8_operand): New.
	(o8_operand): New.
	(s9_operand): New.
	(register_or_s9_operand): New.

Comments

Richard Henderson July 18, 2011, 5:45 p.m. UTC | #1
On 07/18/2011 08:41 AM, Georg-Johann Lay wrote:
> +(define_insn_and_split "*muluqihi3.uconst"
> +  [(set (match_operand:HI 0 "register_operand"                         "=r")
> +        (mult:HI (zero_extend:HI (match_operand:QI 1 "register_operand" "r"))
> +                 (match_operand:HI 2 "u8_operand"                       "M")))]
> +  "AVR_HAVE_MUL
> +   && avr_gate_split1()"
> +  { gcc_unreachable(); }
> +  "&& 1"
> +  [(set (match_dup 3)
> +        (match_dup 2))
> +   ; umulqihi3
> +   (set (match_dup 0)
> +        (mult:HI (zero_extend:HI (match_dup 1))
> +                 (zero_extend:HI (match_dup 3))))]
> +  {
> +    operands[2] = gen_int_mode (INTVAL (operands[2]), QImode);
> +    operands[3] = gen_reg_rtx (QImode);
> +  })
> +

I'm not keen on this at all.  I'd much prefer a formulation like

(define_insn_and_split "*muliqihi3_uconst"
  [(set (match_operand:HI 0 "register_operand" "=r")
	(mult:HI (zero_extend:HI
		  (match_operand:QI 1 "register_operand" "r"))
		 (match_operand:HI 2 "u8_operand" "n")))
   (clobber (match_scratch:QI 3 "=&r"))]
  "AVR_HAVE_MUL"
  "#"
  "&& reload_completed"
  [...]
)

I see the obvious problem, of course, pass_split_after_reload
runs after pass_postreload_cse.

Does anything break if we simply move pass_split_after_reload
earlier?  Right to the beginning of pass_postreload for instance.
Seems to me that every port would gain by optimizing the stuff
that comes out of the post-reload splitters.


r~
Georg-Johann Lay July 18, 2011, 6:05 p.m. UTC | #2
Richard Henderson wrote:
> On 07/18/2011 08:41 AM, Georg-Johann Lay wrote:
>> +(define_insn_and_split "*muluqihi3.uconst"
>> +  [(set (match_operand:HI 0 "register_operand"                         "=r")
>> +        (mult:HI (zero_extend:HI (match_operand:QI 1 "register_operand" "r"))
>> +                 (match_operand:HI 2 "u8_operand"                       "M")))]
>> +  "AVR_HAVE_MUL
>> +   && avr_gate_split1()"
>> +  { gcc_unreachable(); }
>> +  "&& 1"
>> +  [(set (match_dup 3)
>> +        (match_dup 2))
>> +   ; umulqihi3
>> +   (set (match_dup 0)
>> +        (mult:HI (zero_extend:HI (match_dup 1))
>> +                 (zero_extend:HI (match_dup 3))))]
>> +  {
>> +    operands[2] = gen_int_mode (INTVAL (operands[2]), QImode);
>> +    operands[3] = gen_reg_rtx (QImode);
>> +  })
>> +
> 
> I'm not keen on this at all.  I'd much prefer a formulation like
> 
> (define_insn_and_split "*muliqihi3_uconst"
>   [(set (match_operand:HI 0 "register_operand" "=r")
> 	(mult:HI (zero_extend:HI
> 		  (match_operand:QI 1 "register_operand" "r"))
> 		 (match_operand:HI 2 "u8_operand" "n")))
>    (clobber (match_scratch:QI 3 "=&r"))]
>   "AVR_HAVE_MUL"
>   "#"
>   "&& reload_completed"
>   [...]
> )
> 
> I see the obvious problem, of course, pass_split_after_reload
> runs after pass_postreload_cse.

What's bad with pre-reload splits?
The only weak point is in target-independent code because there
is nothing like split1_completed and other missing information
for better pass-awareness.

Split1 is almost like a .expand2 pass. Nice :-)
It contains all semantics, but a CLOBBER is always loss of information.

It's by no means clear that if a value is used 2 or more
times, they will be allocated to the same hard register.
Note that for some passes (after reload but before post-reload split)
there will be a clobber around OP3 so that it is not reusable.

These drawbacks are not there for pre-reload splits; doing these things
after register allocation is always harder or no more possible because
different registers are wasted for the same thing.

> Does anything break if we simply move pass_split_after_reload
> earlier?  Right to the beginning of pass_postreload for instance.
> Seems to me that every port would gain by optimizing the stuff
> that comes out of the post-reload splitters.

I don't know the reason for that or if other machines rely on it.

Johann

> r~
Richard Henderson July 18, 2011, 6:10 p.m. UTC | #3
On 07/18/2011 11:05 AM, Georg-Johann Lay wrote:
> What's bad with pre-reload splits?
> The only weak point is in target-independent code because there
> is nothing like split1_completed and other missing information
> for better pass-awareness.

Nothing's wrong with pre-reload splits.

However, what you've done is try very hard to work around reload
doing the Right Thing with constant spilling, namely re-generate
the constant rather than spill and restore it.  I cannot believe
that's the right way to proceed.

>> Does anything break if we simply move pass_split_after_reload
>> earlier?  Right to the beginning of pass_postreload for instance.
>> Seems to me that every port would gain by optimizing the stuff
>> that comes out of the post-reload splitters.
> 
> I don't know the reason for that or if other machines rely on it.

And that is something we need to find out.


r~
Georg-Johann Lay July 18, 2011, 7:15 p.m. UTC | #4
Richard Henderson schrieb:
> On 07/18/2011 11:05 AM, Georg-Johann Lay wrote:
> 
>>What's bad with pre-reload splits?
>>The only weak point is in target-independent code because there
>>is nothing like split1_completed and other missing information
>>for better pass-awareness.
> 
> Nothing's wrong with pre-reload splits.
> 
> However, what you've done is try very hard to work around reload
> doing the Right Thing with constant spilling, namely re-generate
> the constant rather than spill and restore it.  I cannot believe
> that's the right way to proceed.

You mean that with a clobber reload will rematerialize the constant
if there are not enough registers instead of spilling it?
And with an ordinary move like
    (set (reg) (const_int))
reload will not see that it can reaterialize it and spill it to
the stack?

Moreover, I wonder why target-independent code does not already
catch the situation because the pattern to be generated is just
an ordinary umulqihi3 widening multiplication.

Johann
Richard Henderson July 18, 2011, 8:14 p.m. UTC | #5
On 07/18/2011 12:15 PM, Georg-Johann Lay wrote:
>> However, what you've done is try very hard to work around reload
>> doing the Right Thing with constant spilling, namely re-generate
>> the constant rather than spill and restore it.  I cannot believe
>> that's the right way to proceed.
> 
> You mean that with a clobber reload will rematerialize the constant
> if there are not enough registers instead of spilling it?
> And with an ordinary move like
>    (set (reg) (const_int))
> reload will not see that it can reaterialize it and spill it to
> the stack?

Well, it certainly didn't use to.

Vlad, do you know what the current state of the register allocator
is wrt this sort of rematerialization?

> Moreover, I wonder why target-independent code does not already
> catch the situation because the pattern to be generated is just
> an ordinary umulqihi3 widening multiplication.

Yes, it is sad that the backends have to work around the fact
that sign/zero_extension of constants is invalid rtl.


r~
Vladimir Makarov July 19, 2011, 12:23 a.m. UTC | #6
On 07/18/2011 04:14 PM, Richard Henderson wrote:
> On 07/18/2011 12:15 PM, Georg-Johann Lay wrote:
>>> However, what you've done is try very hard to work around reload
>>> doing the Right Thing with constant spilling, namely re-generate
>>> the constant rather than spill and restore it.  I cannot believe
>>> that's the right way to proceed.
>> You mean that with a clobber reload will rematerialize the constant
>> if there are not enough registers instead of spilling it?
>> And with an ordinary move like
>>     (set (reg) (const_int))
>> reload will not see that it can reaterialize it and spill it to
>> the stack?
> Well, it certainly didn't use to.
>
> Vlad, do you know what the current state of the register allocator
> is wrt this sort of rematerialization?
>
Rematerialization is done in reload.  So it is the same as it was before 
IRA.

Reload should definitely rematerialize constants and I saw it many times 
when I analyzed the code generated by IRA+reload.

It will also do rematerialization of 'a hard reg + const'.
>> Moreover, I wonder why target-independent code does not already
>> catch the situation because the pattern to be generated is just
>> an ordinary umulqihi3 widening multiplication.
> Yes, it is sad that the backends have to work around the fact
> that sign/zero_extension of constants is invalid rtl.
>
>
> r~
Georg-Johann Lay July 19, 2011, 8:13 a.m. UTC | #7
Vladimir Makarov wrote:
> On 07/18/2011 04:14 PM, Richard Henderson wrote:
>> On 07/18/2011 12:15 PM, Georg-Johann Lay wrote:
>>>> However, what you've done is try very hard to work around reload
>>>> doing the Right Thing with constant spilling, namely re-generate
>>>> the constant rather than spill and restore it.  I cannot believe
>>>> that's the right way to proceed.
>>> You mean that with a clobber reload will rematerialize the constant
>>> if there are not enough registers instead of spilling it?
>>> And with an ordinary move like
>>>     (set (reg) (const_int))
>>> reload will not see that it can reaterialize it and spill it to
>>> the stack?
>> Well, it certainly didn't use to.
>>
>> Vlad, do you know what the current state of the register allocator
>> is wrt this sort of rematerialization?
>>
> Rematerialization is done in reload.  So it is the same as it was before
> IRA.
> 
> Reload should definitely rematerialize constants and I saw it many times
> when I analyzed the code generated by IRA+reload.
> 
> It will also do rematerialization of 'a hard reg + const'.

If I understand you correctly, a pre-reload split has no disadvantage
compared to a post-reload split? As oulined in
  http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01422.html

The first split is pre-reload and generates a pseudo to load the constant.
The second split is post-reload and uses a clobber reg.

The question is what's best if the constant is used more than once, e.g. used
again in similar insn.  Is one approach better than the other? Or does it not
matter at all?

The whole patch is here:
  http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01411.html

Johann
Georg-Johann Lay July 19, 2011, 10:21 a.m. UTC | #8
Richard Henderson wrote:
> On 07/18/2011 12:15 PM, Georg-Johann Lay wrote:
>> Moreover, I wonder why target-independent code does not already
>> catch the situation because the pattern to be generated is just
>> an ordinary umulqihi3 widening multiplication.
> 
> Yes, it is sad that the backends have to work around the fact
> that sign/zero_extension of constants is invalid rtl.

Why is that invalid?

(set (reg:HI A)
     (const_int 1000))

(set (reg:SI B)
     (mult:SI (zero_extend:SI (reg:HI C))
              (zero_extend:SI (reg:HI A))))


If a target hat native support for

(set (reg:SI B)
     (mult:SI (zero_extend:SI (reg:HI C))
              (const_int 1000)))

then a combine pattern is straight forward.

If the target has no native support it has to work around
it by means of combine pattern, too, alongside with getting
pseudo resp. clobber reg which is quite tedious as we just see.

There's an own pass for widening multiply detection, IMO it
would be better to detect it there and use combine pattern if there
is native support for widening mul with const.

That's much more straight forward than the current approach of not
recognizing such widening multiplies.

Johann

each and every backend
Georg-Johann Lay July 19, 2011, 12:24 p.m. UTC | #9
Richard Henderson wrote:
> On 07/18/2011 08:41 AM, Georg-Johann Lay wrote:
>> +(define_insn_and_split "*muluqihi3.uconst"
>> +  [(set (match_operand:HI 0 "register_operand"                         "=r")
>> +        (mult:HI (zero_extend:HI (match_operand:QI 1 "register_operand" "r"))
>> +                 (match_operand:HI 2 "u8_operand"                       "M")))]
>> +  "AVR_HAVE_MUL
>> +   && avr_gate_split1()"
>> +  { gcc_unreachable(); }
>> +  "&& 1"
>> +  [(set (match_dup 3)
>> +        (match_dup 2))
>> +   ; umulqihi3
>> +   (set (match_dup 0)
>> +        (mult:HI (zero_extend:HI (match_dup 1))
>> +                 (zero_extend:HI (match_dup 3))))]
>> +  {
>> +    operands[2] = gen_int_mode (INTVAL (operands[2]), QImode);
>> +    operands[3] = gen_reg_rtx (QImode);
>> +  })
>> +
> 
> I'm not keen on this at all.  I'd much prefer a formulation like
> 
> (define_insn_and_split "*muliqihi3_uconst"
>   [(set (match_operand:HI 0 "register_operand" "=r")
> 	(mult:HI (zero_extend:HI
> 		  (match_operand:QI 1 "register_operand" "r"))
> 		 (match_operand:HI 2 "u8_operand" "n")))
>    (clobber (match_scratch:QI 3 "=&r"))]
>   "AVR_HAVE_MUL"
>   "#"
>   "&& reload_completed"
>   [...]
> )
> 
> I see the obvious problem, of course, pass_split_after_reload
> runs after pass_postreload_cse.

I intend to do similar things for HI->SI widening multiply.
These multiplications are too complex to do them inline and
thus are expanded to libgcc calls like so:

(define_expand "umulhisi3"
  [(set (reg:HI 18)
        (match_operand:HI 1 "register_operand" ""))
   (set (reg:HI 20)
        (match_operand:HI 2 "register_operand" ""))
   (set (reg:SI 22)
        (mult:SI (zero_extend:SI (reg:HI 18))
                 (zero_extend:SI (reg:HI 20))))
   (set (match_operand:SI 0 "register_operand" "")
        (reg:SI 22))]
  "AVR_HAVE_MUL"
  "")

This is no more possible after reload, so for these cases
there is no alternative to doing it a pre-reload split.
Or am I something missing?

Johann
Vladimir Makarov July 19, 2011, 2:23 p.m. UTC | #10
On 07/19/2011 04:13 AM, Georg-Johann Lay wrote:
> Vladimir Makarov wrote:
>> On 07/18/2011 04:14 PM, Richard Henderson wrote:
>>> On 07/18/2011 12:15 PM, Georg-Johann Lay wrote:
>>>>> However, what you've done is try very hard to work around reload
>>>>> doing the Right Thing with constant spilling, namely re-generate
>>>>> the constant rather than spill and restore it.  I cannot believe
>>>>> that's the right way to proceed.
>>>> You mean that with a clobber reload will rematerialize the constant
>>>> if there are not enough registers instead of spilling it?
>>>> And with an ordinary move like
>>>>      (set (reg) (const_int))
>>>> reload will not see that it can reaterialize it and spill it to
>>>> the stack?
>>> Well, it certainly didn't use to.
>>>
>>> Vlad, do you know what the current state of the register allocator
>>> is wrt this sort of rematerialization?
>>>
>> Rematerialization is done in reload.  So it is the same as it was before
>> IRA.
>>
>> Reload should definitely rematerialize constants and I saw it many times
>> when I analyzed the code generated by IRA+reload.
>>
>> It will also do rematerialization of 'a hard reg + const'.
> If I understand you correctly, a pre-reload split has no disadvantage
> compared to a post-reload split? As oulined in
>    http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01422.html
>
It is hard to me to say it.  Only the experiment on several targets can 
show it and efforts of achieving this.

GCC is an old compiler.  More generally speaking, I believe in classic 
approach that at this stage (1st insn scheduling, RA) RTL insns should 
be as close as possible to real target insns.

The splitting mechanism is mostly a consequence of an outdated method of 
partial code selection (combiner).  Moving to the classical approach is 
a huge job and might be too difficult to implement it.
> The first split is pre-reload and generates a pseudo to load the constant.
> The second split is post-reload and uses a clobber reg.
>

> The question is what's best if the constant is used more than once, e.g. used
> again in similar insn.
Again it is hard to say definitely for me.  To remove second constant 
load, reload has inheritance mechanism.  In post-reload passes, there is 
gcse for this.

reload pass can combine such reloads for one insn.  But again i think 
you should try if you wan to know, reload is too complicated to 
definitely affirm something.
>    Is one approach better than the other? Or does it not
> matter at all?
>
> The whole patch is here:
>    http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01411.html
>
> Johann
>
Georg-Johann Lay July 19, 2011, 3:04 p.m. UTC | #11
Vladimir Makarov wrote:
> On 07/19/2011 04:13 AM, Georg-Johann Lay wrote:
>> Vladimir Makarov wrote:
>>> On 07/18/2011 04:14 PM, Richard Henderson wrote:
>>>> On 07/18/2011 12:15 PM, Georg-Johann Lay wrote:
>>>>>> However, what you've done is try very hard to work around reload
>>>>>> doing the Right Thing with constant spilling, namely re-generate
>>>>>> the constant rather than spill and restore it.  I cannot believe
>>>>>> that's the right way to proceed.
>>>>> You mean that with a clobber reload will rematerialize the constant
>>>>> if there are not enough registers instead of spilling it?
>>>>> And with an ordinary move like
>>>>>      (set (reg) (const_int))
>>>>> reload will not see that it can reaterialize it and spill it to
>>>>> the stack?
>>>> Well, it certainly didn't use to.
>>>>
>>>> Vlad, do you know what the current state of the register allocator
>>>> is wrt this sort of rematerialization?
>>>>
>>> Rematerialization is done in reload.  So it is the same as it was before
>>> IRA.
>>>
>>> Reload should definitely rematerialize constants and I saw it many times
>>> when I analyzed the code generated by IRA+reload.
>>>
>>> It will also do rematerialization of 'a hard reg + const'.
>> If I understand you correctly, a pre-reload split has no disadvantage
>> compared to a post-reload split? As oulined in
>>    http://gcc.gnu.org/ml/gcc-patches/2011-07/msg01422.html
>>
> It is hard to me to say it.  Only the experiment on several targets can
> show it and efforts of achieving this.
> 
> GCC is an old compiler.  More generally speaking, I believe in classic
> approach that at this stage (1st insn scheduling, RA) RTL insns should
> be as close as possible to real target insns.

That's what I tried with the pre-reload split:
as close as possible to real instructions.

> The splitting mechanism is mostly a consequence of an outdated method of
> partial code selection (combiner).  Moving to the classical approach is
> a huge job and might be too difficult to implement it.
>> The first split is pre-reload and generates a pseudo to load the
>> constant.
>> The second split is post-reload and uses a clobber reg.
>>
>> The question is what's best if the constant is used more than once,
>> e.g. used
>> again in similar insn.
> Again it is hard to say definitely for me.  To remove second constant
> load, reload has inheritance mechanism.  In post-reload passes, there is
> gcse for this.

Inside reloas it cannot work because the alternative to pre-reload split
is a clobber. And as the reg is inside a clobber, it cannot be reused.

It could reused after post-reload split, but as Richard mentioned post-reload
CSE runs before post-reload split. So that cannot work either.

> reload pass can combine such reloads for one insn.  But again i think
> you should try if you wan to know, reload is too complicated to
> definitely affirm something.

At least it cannot work for clobbers as explained above.
Richard Henderson July 19, 2011, 5:32 p.m. UTC | #12
On 07/19/2011 03:21 AM, Georg-Johann Lay wrote:
>> Yes, it is sad that the backends have to work around the fact
>> that sign/zero_extension of constants is invalid rtl.
> 
> Why is that invalid?
> 
> (set (reg:HI A)
>      (const_int 1000))
> 
> (set (reg:SI B)
>      (mult:SI (zero_extend:SI (reg:HI C))
>               (zero_extend:SI (reg:HI A))))
> 
> 
> If a target hat native support for
> 
> (set (reg:SI B)
>      (mult:SI (zero_extend:SI (reg:HI C))
>               (const_int 1000)))
> 
> then a combine pattern is straight forward.

Yes.  And the result you show is quite valid and correct.

What isn't correct is

	(zero_extend:HI (const_int 100000))

because const_int does not have a mode, and we no longer
know what the source mode is for the extension.


r~
diff mbox

Patch

Index: config/avr/predicates.md
===================================================================
--- config/avr/predicates.md	(revision 176136)
+++ config/avr/predicates.md	(working copy)
@@ -73,6 +73,16 @@  (define_predicate "const_0_to_7_operand"
   (and (match_code "const_int")
        (match_test "IN_RANGE (INTVAL (op), 0, 7)")))
 
+;; Return 1 if OP is constant integer 2..7 for MODE.
+(define_predicate "const_2_to_7_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), 2, 7)")))
+
+;; Return 1 if OP is constant integer 2..6 for MODE.
+(define_predicate "const_2_to_6_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), 2, 6)")))
+
 ;; Returns true if OP is either the constant zero or a register.
 (define_predicate "reg_or_0_operand"
   (ior (match_operand 0 "register_operand")
@@ -156,3 +166,26 @@  (define_predicate "const_8_16_24_operand
   (and (match_code "const_int")
        (match_test "8 == INTVAL(op) || 16 == INTVAL(op) || 24 == INTVAL(op)")))
 
+;; Unsigned CONST_INT that fits in 8 bits, i.e. 0..255.
+(define_predicate "u8_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), 0, 255)")))
+
+;; Signed CONST_INT that fits in 8 bits, i.e. -128..127.
+(define_predicate "s8_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), -128, 127)")))
+
+;; One-extended CONST_INT that fits in 8 bits, i.e. -256..-1.
+(define_predicate "o8_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), -256, -1)")))
+
+;; Signed CONST_INT that fits in 9 bits, i.e. -256..255.
+(define_predicate "s9_operand"
+  (and (match_code "const_int")
+       (match_test "IN_RANGE (INTVAL (op), -256, 255)")))
+
+(define_predicate "register_or_s9_operand"
+  (ior (match_operand 0 "register_operand")
+       (match_operand 0 "s9_operand")))
Index: config/avr/avr.md
===================================================================
--- config/avr/avr.md	(revision 176276)
+++ config/avr/avr.md	(working copy)
@@ -1017,19 +1017,309 @@  (define_insn "umulqihi3"
   [(set_attr "length" "3")
    (set_attr "cc" "clobber")])
 
+(define_insn "usmulqihi3"
+  [(set (match_operand:HI 0 "register_operand"                         "=r")
+        (mult:HI (zero_extend:HI (match_operand:QI 1 "register_operand" "a"))
+                 (sign_extend:HI (match_operand:QI 2 "register_operand" "a"))))]
+  "AVR_HAVE_MUL"
+  "mulsu %2,%1
+	movw %0,r0
+	clr __zero_reg__"
+  [(set_attr "length" "3")
+   (set_attr "cc" "clobber")])
+
+;; Above insn is not canonicalized by insn combine, so here is a version with
+;; operands swapped.
+
+(define_insn "*sumulqihi3"
+  [(set (match_operand:HI 0 "register_operand"                         "=r")
+        (mult:HI (sign_extend:HI (match_operand:QI 1 "register_operand" "a"))
+                 (zero_extend:HI (match_operand:QI 2 "register_operand" "a"))))]
+  "AVR_HAVE_MUL"
+  "mulsu %1,%2
+	movw %0,r0
+	clr __zero_reg__"
+  [(set_attr "length" "3")
+   (set_attr "cc" "clobber")])
+
+;; One-extend operand 1
+
+(define_insn "*osmulqihi3"
+  [(set (match_operand:HI 0 "register_operand"                                        "=&r")
+        (mult:HI (not:HI (zero_extend:HI (not:QI (match_operand:QI 1 "register_operand" "a"))))
+                 (sign_extend:HI (match_operand:QI 2 "register_operand"                 "a"))))]
+  "AVR_HAVE_MUL"
+  "mulsu %2,%1
+	movw %0,r0
+	sub %B0,%2
+	clr __zero_reg__"
+  [(set_attr "length" "4")
+   (set_attr "cc" "clobber")])
+
+(define_insn "*oumulqihi3"
+  [(set (match_operand:HI 0 "register_operand"                                        "=&r")
+        (mult:HI (not:HI (zero_extend:HI (not:QI (match_operand:QI 1 "register_operand" "r"))))
+                 (zero_extend:HI (match_operand:QI 2 "register_operand"                 "r"))))]
+  "AVR_HAVE_MUL"
+  "mul %2,%1
+	movw %0,r0
+	sub %B0,%2
+	clr __zero_reg__"
+  [(set_attr "length" "4")
+   (set_attr "cc" "clobber")])
+
+
+;******************************************************************************
+; mul HI: $1 = sign/zero-extend, $2 = small constant
+;******************************************************************************
+
+(define_insn_and_split "*muluqihi3.uconst"
+  [(set (match_operand:HI 0 "register_operand"                         "=r")
+        (mult:HI (zero_extend:HI (match_operand:QI 1 "register_operand" "r"))
+                 (match_operand:HI 2 "u8_operand"                       "M")))]
+  "AVR_HAVE_MUL
+   && avr_gate_split1()"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (match_dup 3)
+        (match_dup 2))
+   ; umulqihi3
+   (set (match_dup 0)
+        (mult:HI (zero_extend:HI (match_dup 1))
+                 (zero_extend:HI (match_dup 3))))]
+  {
+    operands[2] = gen_int_mode (INTVAL (operands[2]), QImode);
+    operands[3] = gen_reg_rtx (QImode);
+  })
+
+(define_insn_and_split "*muluqihi3.sconst"
+  [(set (match_operand:HI 0 "register_operand"                         "=r")
+        (mult:HI (zero_extend:HI (match_operand:QI 1 "register_operand" "a"))
+                 (match_operand:HI 2 "s8_operand"                       "n")))]
+  "AVR_HAVE_MUL
+   && avr_gate_split1()"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (match_dup 3)
+        (match_dup 2))
+   ; usmulqihi3
+   (set (match_dup 0)
+        (mult:HI (zero_extend:HI (match_dup 1))
+                 (sign_extend:HI (match_dup 3))))]
+  {
+    operands[2] = gen_int_mode (INTVAL (operands[2]), QImode);
+    operands[3] = gen_reg_rtx (QImode);
+  })
+
+(define_insn_and_split "*mulsqihi3.sconst"
+  [(set (match_operand:HI 0 "register_operand"                         "=r")
+        (mult:HI (sign_extend:HI (match_operand:QI 1 "register_operand" "d"))
+                 (match_operand:HI 2 "s8_operand"                       "n")))]
+  "AVR_HAVE_MUL
+   && avr_gate_split1()"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (match_dup 3)
+        (match_dup 2))
+   ; mulqihi3
+   (set (match_dup 0)
+        (mult:HI (sign_extend:HI (match_dup 1))
+                 (sign_extend:HI (match_dup 3))))]
+  {
+    operands[2] = gen_int_mode (INTVAL (operands[2]), QImode);
+    operands[3] = gen_reg_rtx (QImode);
+  })
+
+(define_insn_and_split "*mulsqihi3.uconst"
+  [(set (match_operand:HI 0 "register_operand"                         "=r")
+        (mult:HI (sign_extend:HI (match_operand:QI 1 "register_operand" "a"))
+                 (match_operand:HI 2 "u8_operand"                       "M")))]
+  "AVR_HAVE_MUL
+   && avr_gate_split1()"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (match_dup 3)
+        (match_dup 2))
+   ; usmulqihi3
+   (set (match_dup 0)
+        (mult:HI (zero_extend:HI (match_dup 3))
+                 (sign_extend:HI (match_dup 1))))]
+  {
+    operands[2] = gen_int_mode (INTVAL (operands[2]), QImode);
+    operands[3] = gen_reg_rtx (QImode);
+  })
+
+(define_insn_and_split "*mulsqihi3.oconst"
+  [(set (match_operand:HI 0 "register_operand"                         "=r")
+        (mult:HI (sign_extend:HI (match_operand:QI 1 "register_operand" "a"))
+                 (match_operand:HI 2 "o8_operand"                       "n")))]
+  "AVR_HAVE_MUL
+   && avr_gate_split1()"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (match_dup 3)
+        (match_dup 2))
+   ; *osmulqihi3
+   (set (match_dup 0)
+        (mult:HI (not:HI (zero_extend:HI (not:QI (match_dup 3))))
+                 (sign_extend:HI (match_dup 1))))]
+  {
+    operands[2] = gen_int_mode (INTVAL (operands[2]), QImode);
+    operands[3] = gen_reg_rtx (QImode);
+  })
+
+;; The EXTEND of $1 only appears in combine, we don't see it in expand so that
+;; expand decides to use ASHIFT instead of MUL because ASHIFT costs are cheaper
+;; at that time.  Fix that.
+
+(define_insn_and_split "*ashifthi3.signx.const"
+  [(set (match_operand:HI 0 "register_operand"                           "=r")
+        (ashift:HI (sign_extend:HI (match_operand:QI 1 "register_operand" "d"))
+                   (match_operand:HI 2 "const_2_to_6_operand"             "I")))]
+  "AVR_HAVE_MUL
+   && avr_gate_split1()"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (match_dup 3)
+        (match_dup 2))
+   ; mulqihi3
+   (set (match_dup 0)
+        (mult:HI (sign_extend:HI (match_dup 1))
+                 (sign_extend:HI (match_dup 3))))]
+  {
+    operands[2] = GEN_INT (1 << INTVAL (operands[2]));
+    operands[3] = gen_reg_rtx (QImode);
+  })
+
+(define_insn_and_split "*ashifthi3.signx.const7"
+  [(set (match_operand:HI 0 "register_operand"                           "=r")
+        (ashift:HI (sign_extend:HI (match_operand:QI 1 "register_operand" "a"))
+                   (const_int 7)))]
+  "AVR_HAVE_MUL
+   && avr_gate_split1()"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (match_dup 3)
+        (match_dup 2))
+   ; usmulqihi3
+   (set (match_dup 0)
+        (mult:HI (zero_extend:HI (match_dup 3))
+                 (sign_extend:HI (match_dup 1))))]
+  {
+    operands[2] = gen_int_mode (1 << 7, QImode);
+    operands[3] = gen_reg_rtx (QImode);
+  })
+
+(define_insn_and_split "*ashifthi3.zerox.const"
+  [(set (match_operand:HI 0 "register_operand"                           "=r")
+        (ashift:HI (zero_extend:HI (match_operand:QI 1 "register_operand" "r"))
+                   (match_operand:HI 2 "const_2_to_7_operand"             "I")))]
+  "AVR_HAVE_MUL
+   && avr_gate_split1()"
+  { gcc_unreachable(); }
+  "&& 1"
+  [(set (match_dup 3)
+        (match_dup 2))
+   ; umulqihi3
+   (set (match_dup 0)
+        (mult:HI (zero_extend:HI (match_dup 1))
+                 (zero_extend:HI (match_dup 3))))]
+  {
+    operands[2] = gen_int_mode (1 << INTVAL (operands[2]), QImode);
+    operands[3] = gen_reg_rtx (QImode);
+  })
+
+;******************************************************************************
+; mul HI: $1 = sign-/zero-/one-extend, $2 = reg
+;******************************************************************************
+
+(define_insn "mulsqihi3"
+  [(set (match_operand:HI 0 "register_operand"                        "=&r")
+        (mult:HI (sign_extend:HI (match_operand:QI 1 "register_operand" "a"))
+                 (match_operand:HI 2 "register_operand"                 "a")))]
+  "AVR_HAVE_MUL"
+  "mulsu %1,%A2
+	movw %0,r0
+	mul %1,%B2
+	add %B0,r0
+	clr __zero_reg__"
+  [(set_attr "length" "5")
+   (set_attr "cc" "clobber")])
+
+(define_insn "muluqihi3"
+  [(set (match_operand:HI 0 "register_operand"                        "=&r")
+        (mult:HI (zero_extend:HI (match_operand:QI 1 "register_operand" "r"))
+                 (match_operand:HI 2 "register_operand"                 "r")))]
+  "AVR_HAVE_MUL"
+  "mul %1,%A2
+	movw %0,r0
+	mul %1,%B2
+	add %B0,r0
+	clr __zero_reg__"
+  [(set_attr "length" "5")
+   (set_attr "cc" "clobber")])
+
+;; one-extend operand 1
+
+(define_insn "muloqihi3"
+  [(set (match_operand:HI 0 "register_operand"                                        "=&r")
+        (mult:HI (not:HI (zero_extend:HI (not:QI (match_operand:QI 1 "register_operand" "r"))))
+                 (match_operand:HI 2 "register_operand"                                 "r")))]
+  "AVR_HAVE_MUL"
+  "mul %1,%A2
+	movw %0,r0
+	mul %1,%B2
+	add %B0,r0
+        sub %B0,%A2
+	clr __zero_reg__"
+  [(set_attr "length" "6")
+   (set_attr "cc" "clobber")])
+
+;******************************************************************************
+
 (define_expand "mulhi3"
   [(set (match_operand:HI 0 "register_operand" "")
 	(mult:HI (match_operand:HI 1 "register_operand" "")
-		 (match_operand:HI 2 "register_operand" "")))]
+                 (match_operand:HI 2 "register_or_s9_operand" "")))]
   ""
-  "
-{
-  if (!AVR_HAVE_MUL)
-    {
-      emit_insn (gen_mulhi3_call (operands[0], operands[1], operands[2]));
-      DONE;
-    }
-}")
+  {
+    if (!AVR_HAVE_MUL)
+      {
+        if (!register_operand (operands[2], HImode))
+          operands[2] = force_reg (HImode, operands[2]);
+
+        emit_insn (gen_mulhi3_call (operands[0], operands[1], operands[2]));
+        DONE;
+      }
+
+    /* For small constants we can do better by extending them on the fly.
+       The constant can be loaded in one instruction and the widening
+       multiplication is shorter.  First try the unsigned variant because it
+       allows constraint "d" instead of "a" for the signed version.  */
+
+    if (s9_operand (operands[2], HImode))
+      {
+        rtx reg = force_reg (QImode, gen_int_mode (INTVAL (operands[2]), QImode));
+
+        if (u8_operand (operands[2], HImode))
+          {
+            emit_insn (gen_muluqihi3 (operands[0], reg, operands[1]));
+          } 
+        else if (s8_operand (operands[2], HImode))
+          {
+            emit_insn (gen_mulsqihi3 (operands[0], reg, operands[1]));
+          }
+        else
+          {
+            emit_insn (gen_muloqihi3 (operands[0], reg, operands[1]));
+          }
+
+        DONE;
+      }
+
+    if (!register_operand (operands[2], HImode))
+      operands[2] = force_reg (HImode, operands[2]);
+  })
 
 (define_insn "*mulhi3_enh"
   [(set (match_operand:HI 0 "register_operand" "=&r")
Index: config/avr/avr-protos.h
===================================================================
--- config/avr/avr-protos.h	(revision 176136)
+++ config/avr/avr-protos.h	(working copy)
@@ -117,3 +117,4 @@  extern int class_max_nregs (enum reg_cla
 #ifdef REAL_VALUE_TYPE
 extern void asm_output_float (FILE *file, REAL_VALUE_TYPE n);
 #endif
+extern bool avr_gate_split1(void);
Index: config/avr/avr.c
===================================================================
--- config/avr/avr.c	(revision 176276)
+++ config/avr/avr.c	(working copy)
@@ -5473,7 +5473,42 @@  avr_rtx_costs (rtx x, int codearg, int o
 
 	case HImode:
 	  if (AVR_HAVE_MUL)
-	    *total = COSTS_N_INSNS (!speed ? 7 : 10);
+            {
+              rtx op0 = XEXP (x, 0);
+              rtx op1 = XEXP (x, 1);
+              enum rtx_code code0 = GET_CODE (op0);
+              enum rtx_code code1 = GET_CODE (op1);
+              bool ex0 = SIGN_EXTEND == code0 || ZERO_EXTEND == code0;
+              bool ex1 = SIGN_EXTEND == code1 || ZERO_EXTEND == code1;
+
+              if (ex0
+                  && (u8_operand (op1, HImode)
+                      || s8_operand (op1, HImode)))
+                {
+                  *total = COSTS_N_INSNS (!speed ? 4 : 6);
+                  return true;
+                }
+              if (ex0
+                  && register_operand (op1, HImode))
+                {
+                  *total = COSTS_N_INSNS (!speed ? 5 : 8);
+                  return true;
+                }
+              else if (ex0 || ex1)
+                {
+                  *total = COSTS_N_INSNS (!speed ? 3 : 5);
+                  return true;
+                }
+              else if (register_operand (op0, HImode)
+                       && (u8_operand (op1, HImode)
+                           || s8_operand (op1, HImode)))
+                {
+                  *total = COSTS_N_INSNS (!speed ? 6 : 9);
+                  return true;
+                }
+              else
+                *total = COSTS_N_INSNS (!speed ? 7 : 10);
+            }
 	  else if (!speed)
 	    *total = COSTS_N_INSNS (AVR_HAVE_JMP_CALL ? 2 : 1);
 	  else
@@ -5556,6 +5591,17 @@  avr_rtx_costs (rtx x, int codearg, int o
 	  break;
 
 	case HImode:
+          if (AVR_HAVE_MUL)
+            {
+              if (const_2_to_7_operand (XEXP (x, 1), HImode)
+                  && (SIGN_EXTEND == GET_CODE (XEXP (x, 0))
+                      || ZERO_EXTEND == GET_CODE (XEXP (x, 0))))
+                {
+                  *total = COSTS_N_INSNS (!speed ? 4 : 6);
+                  return true;
+                }
+            }
+          
 	  if (GET_CODE (XEXP (x, 1)) != CONST_INT)
 	    {
 	      *total = COSTS_N_INSNS (!speed ? 5 : 41);
@@ -6888,4 +6934,29 @@  avr_expand_builtin (tree exp, rtx target
 }
 
 
+/* FIXME:  We compose some insns by means of insn combine
+      and split them in split1.  We don't want IRA/reload
+      to combine them to the original insns again because
+      that avoid some CSE optimizations if constants are
+      involved.  If IRA/reload combines, the recombined
+      insns get split again after reload, but then CSE
+      does not take place.
+         It appears that at present there is no other way
+      to take away the insn from IRA.  Notice that split1
+      runs unconditionally so that all our insns will get
+      split no matter of command line options.  */
+   
+#include "tree-pass.h"
+
+bool
+avr_gate_split1 (void)
+{
+  if (current_pass->static_pass_number
+      < pass_match_asm_constraints.pass.static_pass_number)
+    return true;
+
+  return false;
+}
+
+
 #include "gt-avr.h"