diff mbox

[2/3,AArch64] Improve zero extend

Message ID HE1PR0801MB1482D53679CE2F587DE019D183300@HE1PR0801MB1482.eurprd08.prod.outlook.com
State New
Headers show

Commit Message

Wilco Dijkstra July 19, 2016, 3:31 p.m. UTC
When zero extending a 32-bit value to 64 bits, there should always be a
SET operation on the outside, according to the patterns in aarch64.md.
However, the mid-end can also ask for the cost of a made-up instruction,
where the zero-extend is part of another operation, not SET.

In this case we currently cost the zero extend operation as a uxtb/uxth.
Instead, cost it the same way we cost "normal" 32-to-64-bit zero
extends: as a "mov" or the cost of the inner operation.

Bootstrapped and tested on aarch64-none-elf.

2016-07-19  Kristina Martsenko  <kristina.martsenko@arm.com>

	* config/aarch64/aarch64.c (aarch64_rtx_costs): Fix cost of zero extend.

---
 gcc/config/aarch64/aarch64.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Richard Earnshaw (lists) July 20, 2016, 9:35 a.m. UTC | #1
On 19/07/16 16:31, Wilco Dijkstra wrote:
> When zero extending a 32-bit value to 64 bits, there should always be a
> SET operation on the outside, according to the patterns in aarch64.md.
> However, the mid-end can also ask for the cost of a made-up instruction,
> where the zero-extend is part of another operation, not SET.
> 
> In this case we currently cost the zero extend operation as a uxtb/uxth.
> Instead, cost it the same way we cost "normal" 32-to-64-bit zero
> extends: as a "mov" or the cost of the inner operation.
> 
> Bootstrapped and tested on aarch64-none-elf.
> 
> 2016-07-19  Kristina Martsenko  <kristina.martsenko@arm.com>
> 
> 	* config/aarch64/aarch64.c (aarch64_rtx_costs): Fix cost of zero extend.
> 

I'm not sure about this, while rtx_cost is called recursively as it
walks the RTL, I'd normally expect the outer levels of the recursion to
catch the cases where zero-extend is folded into a more complex
operation.  Hitting a case like this suggests that something isn't doing
that correctly.

So what was the top-level RTX passed into rtx_cost?  I'd like to get a
better understanding about the use case before acking this patch.

A test-case would be really useful here, even if it can't be used in the
testsuite.

R.

> ---
>  gcc/config/aarch64/aarch64.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index bddffc3ab28cde3a996fd13c060de36227315fb5..a2621313d3278d39db0f1d5640b33201efefac21 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -6421,12 +6421,11 @@ cost_plus:
>  	 a 'w' register implicitly zeroes the upper bits of an 'x'
>  	 register.  However, if this is
>  
> -	   (set (reg) (zero_extend (reg)))
> +	   (zero_extend (reg))
>  
>  	 we must cost the explicit register move.  */
>        if (mode == DImode
> -	  && GET_MODE (op0) == SImode
> -	  && outer == SET)
> +	  && GET_MODE (op0) == SImode)
>  	{
>  	  int op_cost = rtx_cost (op0, VOIDmode, ZERO_EXTEND, 0, speed);
>  
>
Wilco Dijkstra July 20, 2016, 1:08 p.m. UTC | #2
Richard Earnshaw wrote:
> I'm not sure about this, while rtx_cost is called recursively as it
> walks the RTL, I'd normally expect the outer levels of the recursion to
> catch the cases where zero-extend is folded into a more complex
> operation.  Hitting a case like this suggests that something isn't doing
> that correctly.

As mentioned, the query is about an non-existent instruction, so the existing
rtx_cost code won't handle it. In fact there is no other check for "outer" anywhere
in aarch64_rtx_cost. We either assume outer == SET or know that if it isn't, the
expression will be split.

> So what was the top-level RTX passed into rtx_cost?  I'd like to get a
> better understanding about the use case before acking this patch.

An example would be:

long f(unsigned x) { return (long)x * 20; }

Combine tries to merge the constant into the multiply, so we get this cost query:

(mult:DI (zero_extend:DI (reg/v:SI 74 [ x ]))
    (const_int 20 [0x14]))

Given this is not a legal multiply, rtx_mult_cost recurses, assuming both the
zero_extend and the immediate are going to be split off. But then the zero_extend
is a SET, ie. a zero-cost operation. So not checking outer is correct.

Wilco
Richard Earnshaw (lists) July 20, 2016, 1:32 p.m. UTC | #3
On 20/07/16 14:08, Wilco Dijkstra wrote:
> Richard Earnshaw wrote:
>> I'm not sure about this, while rtx_cost is called recursively as it
>> walks the RTL, I'd normally expect the outer levels of the recursion to
>> catch the cases where zero-extend is folded into a more complex
>> operation.  Hitting a case like this suggests that something isn't doing
>> that correctly.
> 
> As mentioned, the query is about an non-existent instruction, so the existing
> rtx_cost code won't handle it. In fact there is no other check for "outer" anywhere
> in aarch64_rtx_cost. We either assume outer == SET or know that if it isn't, the
> expression will be split.
> 
>> So what was the top-level RTX passed into rtx_cost?  I'd like to get a
>> better understanding about the use case before acking this patch.
> 
> An example would be:
> 
> long f(unsigned x) { return (long)x * 20; }
> 
> Combine tries to merge the constant into the multiply, so we get this cost query:
> 
> (mult:DI (zero_extend:DI (reg/v:SI 74 [ x ]))
>     (const_int 20 [0x14]))
> 
> Given this is not a legal multiply, rtx_mult_cost recurses, assuming both the
> zero_extend and the immediate are going to be split off. But then the zero_extend
> is a SET, ie. a zero-cost operation. So not checking outer is correct.
> 
> Wilco
> 

Why does combine care what the cost is if the instruction isn't valid?

R.
Wilco Dijkstra July 20, 2016, 1:40 p.m. UTC | #4
Richard Earnshaw wrote:
> Why does combine care what the cost is if the instruction isn't valid?

No idea. Combine does lots of odd things that don't make sense to me. 
Unfortunately the costs we give for cases like this need to be accurate or
they negatively affect code quality. The reason for this patch was to fix
some unexpected slowdowns caused by the cost for zero_extend being
too high.

Wilco
Richard Earnshaw (lists) July 20, 2016, 1:56 p.m. UTC | #5
On 20/07/16 14:40, Wilco Dijkstra wrote:
> Richard Earnshaw wrote:
>> Why does combine care what the cost is if the instruction isn't valid?
> 
> No idea. Combine does lots of odd things that don't make sense to me. 
> Unfortunately the costs we give for cases like this need to be accurate or
> they negatively affect code quality. The reason for this patch was to fix
> some unexpected slowdowns caused by the cost for zero_extend being
> too high.
> 
> Wilco
> 

Well if I take your testcase and plug it into a fairly recent gcc I get:

x:
        mov     w1, 20
        umull   x0, w0, w1
        ret

If I change the constant to 33, I then get:

x:
        uxtw    x0, w0
        add     x0, x0, x0, lsl 5
        ret

Both of which look reasonable to me.
Wilco Dijkstra July 20, 2016, 3:28 p.m. UTC | #6
Richard Earnshaw wrote:
> Both of which look reasonable to me.

Yes the code we generate for these examples is fine, I don't believe this
example ever went bad. It's just the cost calculation that is incorrect with
the outer check.

Wilco
Richard Earnshaw (lists) July 20, 2016, 3:30 p.m. UTC | #7
On 20/07/16 16:28, Wilco Dijkstra wrote:
> Richard Earnshaw wrote:
>> Both of which look reasonable to me.
> 
> Yes the code we generate for these examples is fine, I don't believe this
> example ever went bad. It's just the cost calculation that is incorrect with
> the outer check.
> 
> Wilco
> 
> 

So under what circumstances does it lead to sub-optimal code?

R.
Wilco Dijkstra July 20, 2016, 3:39 p.m. UTC | #8
Richard Earnshaw wrote:
> So under what circumstances does it lead to sub-optimal code?

If the cost is incorrect Combine can make the wrong decision, for example
whether to emit a multiply-add or not. I'm not sure whether this still happens
as Kyrill fixed several issues in Combine since this patch was written.

Wilco
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index bddffc3ab28cde3a996fd13c060de36227315fb5..a2621313d3278d39db0f1d5640b33201efefac21 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -6421,12 +6421,11 @@  cost_plus:
 	 a 'w' register implicitly zeroes the upper bits of an 'x'
 	 register.  However, if this is
 
-	   (set (reg) (zero_extend (reg)))
+	   (zero_extend (reg))
 
 	 we must cost the explicit register move.  */
       if (mode == DImode
-	  && GET_MODE (op0) == SImode
-	  && outer == SET)
+	  && GET_MODE (op0) == SImode)
 	{
 	  int op_cost = rtx_cost (op0, VOIDmode, ZERO_EXTEND, 0, speed);