[2/7] tcg: Optimize add2 + sub2

Message ID	1348766397-20731-3-git-send-email-rth@twiddle.net
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Richard Henderson <rth@twiddle.net> To: qemu-devel@nongnu.org Date: Thu, 27 Sep 2012 10:19:52 -0700 Message-Id: <1348766397-20731-3-git-send-email-rth@twiddle.net> In-Reply-To: <1348766397-20731-1-git-send-email-rth@twiddle.net> References: <1348766397-20731-1-git-send-email-rth@twiddle.net> Cc: Aurelien Jarno <aurelien@aurel32.net> Subject: [Qemu-devel] [PATCH 2/7] tcg: Optimize add2 + sub2 Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Richard Henderson Sept. 27, 2012, 5:19 p.m. UTC

We can't do complete constant folding because we lack "mov2",
or the ability to insert opcodes in the stream.  But we can
at least canonicalize add2 operand ordering and simplify
add2 to add when the lowpart adds a constant 0.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 tcg/optimize.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

Aurelien Jarno Sept. 27, 2012, 11:20 p.m. UTC | #1

On Thu, Sep 27, 2012 at 10:19:52AM -0700, Richard Henderson wrote:
> We can't do complete constant folding because we lack "mov2",
> or the ability to insert opcodes in the stream.  But we can
> at least canonicalize add2 operand ordering and simplify
> add2 to add when the lowpart adds a constant 0.
> 
> Signed-off-by: Richard Henderson <rth@twiddle.net>
> ---
>  tcg/optimize.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/tcg/optimize.c b/tcg/optimize.c
> index 55f2a24..004c336 100644
> --- a/tcg/optimize.c
> +++ b/tcg/optimize.c
> @@ -470,6 +470,11 @@ static TCGArg *tcg_constant_folding(TCGContext *s, uint16_t *tcg_opc_ptr,
>              if (swap_commutative(args[0], &args[4], &args[3])) {
>                  args[5] = tcg_invert_cond(args[5]);
>              }
> +            break;
> +        case INDEX_op_add2_i32:
> +            swap_commutative(args[0], &args[2], &args[4]);
> +            swap_commutative(args[1], &args[3], &args[5]);
> +            break;
>          default:
>              break;
>          }
> @@ -522,6 +527,32 @@ static TCGArg *tcg_constant_folding(TCGContext *s, uint16_t *tcg_opc_ptr,
>                  continue;
>              }
>              break;
> +        case INDEX_op_add2_i32:
> +        case INDEX_op_sub2_i32:
> +            /* Simplify op rl, rh, al, ah, 0, bh => op rh, ah, bh.
> +               The zero implies there will be no carry into the high part.
> +               But only when rl == al, since we can't insert the extra move
> +               that would be required.  */
> +            if (temps[args[4]].state == TCG_TEMP_CONST
> +                && temps[args[4]].val == 0
> +                && temps_are_copies(args[0], args[2])) {
> +                if (temps[args[5]].state == TCG_TEMP_CONST
> +                    && temps[args[5]].val == 0
> +                    && temps_are_copies(args[1], args[3])) {
> +                    gen_opc_buf[op_index] = INDEX_op_nop;
> +                } else {
> +                    gen_opc_buf[op_index] = (op == INDEX_op_add2_i32
> +                                             ? INDEX_op_add_i32
> +                                             : INDEX_op_sub_i32);
> +                    args[0] = args[1];
> +                    args[1] = args[3];
> +                    args[2] = args[5];
> +                    gen_args += 3;
> +                }
> +                args += 6;
> +                continue;
> +            }
> +            break;
>          default:
>              break;
>          }
> -- 
> 1.7.11.4
> 

I understand that we can't easily insert an instruction, so the
limitation comes from here, but is it really something happening often?
Doing an optimization has a CPU cost, so if it is not used often, it
might be worse than without.

Richard Henderson Sept. 27, 2012, 11:28 p.m. UTC | #2

On 09/27/2012 04:20 PM, Aurelien Jarno wrote:
> I understand that we can't easily insert an instruction, so the
> limitation comes from here, but is it really something happening often?

It will certainly appear sometimes.  E.g. s390x has an add immediate
instruction that does exactly: r1 += imm16 << 32.

Or did you mean specifically the full constant being folded?  That
would happen quite a bit more often.  That you can see with most any
64-bit RISC guest when they attempt to generate a constant from 
addition primitives instead of logical primitives.

For a 32-bit host, we've already decomposed logical primitives to 32-bit
operations.  And we can constant-fold through all of those.  But when
addition comes into play, we can't constant-fold through add2.

r~

Blue Swirl Sept. 30, 2012, 7:04 a.m. UTC | #3

On Thu, Sep 27, 2012 at 5:19 PM, Richard Henderson <rth@twiddle.net> wrote:
> We can't do complete constant folding because we lack "mov2",
> or the ability to insert opcodes in the stream.  But we can
> at least canonicalize add2 operand ordering and simplify
> add2 to add when the lowpart adds a constant 0.

Couldn't we introduce add2_part1 and add2_part2, the latter being nop
for architectures that don't need it?

>
> Signed-off-by: Richard Henderson <rth@twiddle.net>
> ---
>  tcg/optimize.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
>
> diff --git a/tcg/optimize.c b/tcg/optimize.c
> index 55f2a24..004c336 100644
> --- a/tcg/optimize.c
> +++ b/tcg/optimize.c
> @@ -470,6 +470,11 @@ static TCGArg *tcg_constant_folding(TCGContext *s, uint16_t *tcg_opc_ptr,
>              if (swap_commutative(args[0], &args[4], &args[3])) {
>                  args[5] = tcg_invert_cond(args[5]);
>              }
> +            break;
> +        case INDEX_op_add2_i32:
> +            swap_commutative(args[0], &args[2], &args[4]);
> +            swap_commutative(args[1], &args[3], &args[5]);
> +            break;
>          default:
>              break;
>          }
> @@ -522,6 +527,32 @@ static TCGArg *tcg_constant_folding(TCGContext *s, uint16_t *tcg_opc_ptr,
>                  continue;
>              }
>              break;
> +        case INDEX_op_add2_i32:
> +        case INDEX_op_sub2_i32:
> +            /* Simplify op rl, rh, al, ah, 0, bh => op rh, ah, bh.
> +               The zero implies there will be no carry into the high part.
> +               But only when rl == al, since we can't insert the extra move
> +               that would be required.  */
> +            if (temps[args[4]].state == TCG_TEMP_CONST
> +                && temps[args[4]].val == 0
> +                && temps_are_copies(args[0], args[2])) {
> +                if (temps[args[5]].state == TCG_TEMP_CONST
> +                    && temps[args[5]].val == 0
> +                    && temps_are_copies(args[1], args[3])) {
> +                    gen_opc_buf[op_index] = INDEX_op_nop;
> +                } else {
> +                    gen_opc_buf[op_index] = (op == INDEX_op_add2_i32
> +                                             ? INDEX_op_add_i32
> +                                             : INDEX_op_sub_i32);
> +                    args[0] = args[1];
> +                    args[1] = args[3];
> +                    args[2] = args[5];
> +                    gen_args += 3;
> +                }
> +                args += 6;
> +                continue;
> +            }
> +            break;
>          default:
>              break;
>          }
> --
> 1.7.11.4
>
>

Aurelien Jarno Oct. 1, 2012, 5:46 p.m. UTC | #4

On Thu, Sep 27, 2012 at 04:28:47PM -0700, Richard Henderson wrote:
> On 09/27/2012 04:20 PM, Aurelien Jarno wrote:
> > I understand that we can't easily insert an instruction, so the
> > limitation comes from here, but is it really something happening often?
> 
> It will certainly appear sometimes.  E.g. s390x has an add immediate
> instruction that does exactly: r1 += imm16 << 32.
> 
> Or did you mean specifically the full constant being folded?  That
> would happen quite a bit more often.  That you can see with most any
> 64-bit RISC guest when they attempt to generate a constant from 
> addition primitives instead of logical primitives.
> 
> For a 32-bit host, we've already decomposed logical primitives to 32-bit
> operations.  And we can constant-fold through all of those.  But when
> addition comes into play, we can't constant-fold through add2.
> 

I tried this patch on an i386 host running an x86_64 target, but it
even fails to start seabios, there is probably a wrong logic somewhere
in the patch.

For the first add2 that seemed to have work correctly, this patch
optimized 0.2% of them. I am not sure it worth it as is.

I think optimizing add2, and in general all *2 ops is a good idea, but
we should be able to do more agressive optimization. Maybe, a bit like
Blue was suggesting, add2 should always be followed by a nop, so we can
do more optimizations?

Richard Henderson Oct. 1, 2012, 6:36 p.m. UTC | #5

On 2012-09-30 00:04, Blue Swirl wrote:
>> We can't do complete constant folding because we lack "mov2",
>> > or the ability to insert opcodes in the stream.  But we can
>> > at least canonicalize add2 operand ordering and simplify
>> > add2 to add when the lowpart adds a constant 0.
> Couldn't we introduce add2_part1 and add2_part2, the latter being nop
> for architectures that don't need it?

Possibly.  It certainly would be easy to model these as addcc + addx on
targets like sparc where CC never gets clobbered during moves.

I'm a bit worried about i386 though, since loading 0 wants to use xor
and clobber the flags.  We could possibly work around this by taking
care of the constant loading for add2_part2 manually.  E.g.

  { INDEX_op_add2_part2, { "r", "ri", "ri" } }

  if (args[2] == args[0] && !const_args[2]) {
    // swap arg1 arg2
  }
  if (const_args[1]) {
    mov $args[1], args[0]
  } else {
    mov args[1], args[0]
  }
  adcl args[2], args[0]

which means that tcg_out_movi will not have to be called in between.
It's all a bit fragile though.

That said, I do wonder if having a synthetic mov2{rr,ri,ii} opcodes
isn't just easier.  That could be broken up into two moves by tcg.c
without the backends having to care about it.

r~

Richard Henderson Oct. 1, 2012, 6:41 p.m. UTC | #6

On 2012-10-01 10:46, Aurelien Jarno wrote:
> For the first add2 that seemed to have work correctly, this patch
> optimized 0.2% of them. I am not sure it worth it as is.

You're probably right.

> I think optimizing add2, and in general all *2 ops is a good idea, but
> we should be able to do more agressive optimization. Maybe, a bit like
> Blue was suggesting, add2 should always be followed by a nop, so we can
> do more optimizations?

Adding an extra nop sounds like a better idea than add2_part[12].  And
it's probably easier than adding mov2 opcodes -- one little assert inside
the optimizer and we're golden.

r~

[2/7] tcg: Optimize add2 + sub2

Commit Message

Comments

Patch