Message ID | 1348766397-20731-3-git-send-email-rth@twiddle.net |
---|---|
State | New |
Headers | show |
On Thu, Sep 27, 2012 at 10:19:52AM -0700, Richard Henderson wrote: > We can't do complete constant folding because we lack "mov2", > or the ability to insert opcodes in the stream. But we can > at least canonicalize add2 operand ordering and simplify > add2 to add when the lowpart adds a constant 0. > > Signed-off-by: Richard Henderson <rth@twiddle.net> > --- > tcg/optimize.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/tcg/optimize.c b/tcg/optimize.c > index 55f2a24..004c336 100644 > --- a/tcg/optimize.c > +++ b/tcg/optimize.c > @@ -470,6 +470,11 @@ static TCGArg *tcg_constant_folding(TCGContext *s, uint16_t *tcg_opc_ptr, > if (swap_commutative(args[0], &args[4], &args[3])) { > args[5] = tcg_invert_cond(args[5]); > } > + break; > + case INDEX_op_add2_i32: > + swap_commutative(args[0], &args[2], &args[4]); > + swap_commutative(args[1], &args[3], &args[5]); > + break; > default: > break; > } > @@ -522,6 +527,32 @@ static TCGArg *tcg_constant_folding(TCGContext *s, uint16_t *tcg_opc_ptr, > continue; > } > break; > + case INDEX_op_add2_i32: > + case INDEX_op_sub2_i32: > + /* Simplify op rl, rh, al, ah, 0, bh => op rh, ah, bh. > + The zero implies there will be no carry into the high part. > + But only when rl == al, since we can't insert the extra move > + that would be required. */ > + if (temps[args[4]].state == TCG_TEMP_CONST > + && temps[args[4]].val == 0 > + && temps_are_copies(args[0], args[2])) { > + if (temps[args[5]].state == TCG_TEMP_CONST > + && temps[args[5]].val == 0 > + && temps_are_copies(args[1], args[3])) { > + gen_opc_buf[op_index] = INDEX_op_nop; > + } else { > + gen_opc_buf[op_index] = (op == INDEX_op_add2_i32 > + ? INDEX_op_add_i32 > + : INDEX_op_sub_i32); > + args[0] = args[1]; > + args[1] = args[3]; > + args[2] = args[5]; > + gen_args += 3; > + } > + args += 6; > + continue; > + } > + break; > default: > break; > } > -- > 1.7.11.4 > I understand that we can't easily insert an instruction, so the limitation comes from here, but is it really something happening often? Doing an optimization has a CPU cost, so if it is not used often, it might be worse than without.
On 09/27/2012 04:20 PM, Aurelien Jarno wrote: > I understand that we can't easily insert an instruction, so the > limitation comes from here, but is it really something happening often? It will certainly appear sometimes. E.g. s390x has an add immediate instruction that does exactly: r1 += imm16 << 32. Or did you mean specifically the full constant being folded? That would happen quite a bit more often. That you can see with most any 64-bit RISC guest when they attempt to generate a constant from addition primitives instead of logical primitives. For a 32-bit host, we've already decomposed logical primitives to 32-bit operations. And we can constant-fold through all of those. But when addition comes into play, we can't constant-fold through add2. r~
On Thu, Sep 27, 2012 at 5:19 PM, Richard Henderson <rth@twiddle.net> wrote: > We can't do complete constant folding because we lack "mov2", > or the ability to insert opcodes in the stream. But we can > at least canonicalize add2 operand ordering and simplify > add2 to add when the lowpart adds a constant 0. Couldn't we introduce add2_part1 and add2_part2, the latter being nop for architectures that don't need it? > > Signed-off-by: Richard Henderson <rth@twiddle.net> > --- > tcg/optimize.c | 31 +++++++++++++++++++++++++++++++ > 1 file changed, 31 insertions(+) > > diff --git a/tcg/optimize.c b/tcg/optimize.c > index 55f2a24..004c336 100644 > --- a/tcg/optimize.c > +++ b/tcg/optimize.c > @@ -470,6 +470,11 @@ static TCGArg *tcg_constant_folding(TCGContext *s, uint16_t *tcg_opc_ptr, > if (swap_commutative(args[0], &args[4], &args[3])) { > args[5] = tcg_invert_cond(args[5]); > } > + break; > + case INDEX_op_add2_i32: > + swap_commutative(args[0], &args[2], &args[4]); > + swap_commutative(args[1], &args[3], &args[5]); > + break; > default: > break; > } > @@ -522,6 +527,32 @@ static TCGArg *tcg_constant_folding(TCGContext *s, uint16_t *tcg_opc_ptr, > continue; > } > break; > + case INDEX_op_add2_i32: > + case INDEX_op_sub2_i32: > + /* Simplify op rl, rh, al, ah, 0, bh => op rh, ah, bh. > + The zero implies there will be no carry into the high part. > + But only when rl == al, since we can't insert the extra move > + that would be required. */ > + if (temps[args[4]].state == TCG_TEMP_CONST > + && temps[args[4]].val == 0 > + && temps_are_copies(args[0], args[2])) { > + if (temps[args[5]].state == TCG_TEMP_CONST > + && temps[args[5]].val == 0 > + && temps_are_copies(args[1], args[3])) { > + gen_opc_buf[op_index] = INDEX_op_nop; > + } else { > + gen_opc_buf[op_index] = (op == INDEX_op_add2_i32 > + ? INDEX_op_add_i32 > + : INDEX_op_sub_i32); > + args[0] = args[1]; > + args[1] = args[3]; > + args[2] = args[5]; > + gen_args += 3; > + } > + args += 6; > + continue; > + } > + break; > default: > break; > } > -- > 1.7.11.4 > >
On Thu, Sep 27, 2012 at 04:28:47PM -0700, Richard Henderson wrote: > On 09/27/2012 04:20 PM, Aurelien Jarno wrote: > > I understand that we can't easily insert an instruction, so the > > limitation comes from here, but is it really something happening often? > > It will certainly appear sometimes. E.g. s390x has an add immediate > instruction that does exactly: r1 += imm16 << 32. > > Or did you mean specifically the full constant being folded? That > would happen quite a bit more often. That you can see with most any > 64-bit RISC guest when they attempt to generate a constant from > addition primitives instead of logical primitives. > > For a 32-bit host, we've already decomposed logical primitives to 32-bit > operations. And we can constant-fold through all of those. But when > addition comes into play, we can't constant-fold through add2. > I tried this patch on an i386 host running an x86_64 target, but it even fails to start seabios, there is probably a wrong logic somewhere in the patch. For the first add2 that seemed to have work correctly, this patch optimized 0.2% of them. I am not sure it worth it as is. I think optimizing add2, and in general all *2 ops is a good idea, but we should be able to do more agressive optimization. Maybe, a bit like Blue was suggesting, add2 should always be followed by a nop, so we can do more optimizations?
On 2012-09-30 00:04, Blue Swirl wrote: >> We can't do complete constant folding because we lack "mov2", >> > or the ability to insert opcodes in the stream. But we can >> > at least canonicalize add2 operand ordering and simplify >> > add2 to add when the lowpart adds a constant 0. > Couldn't we introduce add2_part1 and add2_part2, the latter being nop > for architectures that don't need it? Possibly. It certainly would be easy to model these as addcc + addx on targets like sparc where CC never gets clobbered during moves. I'm a bit worried about i386 though, since loading 0 wants to use xor and clobber the flags. We could possibly work around this by taking care of the constant loading for add2_part2 manually. E.g. { INDEX_op_add2_part2, { "r", "ri", "ri" } } if (args[2] == args[0] && !const_args[2]) { // swap arg1 arg2 } if (const_args[1]) { mov $args[1], args[0] } else { mov args[1], args[0] } adcl args[2], args[0] which means that tcg_out_movi will not have to be called in between. It's all a bit fragile though. That said, I do wonder if having a synthetic mov2{rr,ri,ii} opcodes isn't just easier. That could be broken up into two moves by tcg.c without the backends having to care about it. r~
On 2012-10-01 10:46, Aurelien Jarno wrote: > For the first add2 that seemed to have work correctly, this patch > optimized 0.2% of them. I am not sure it worth it as is. You're probably right. > I think optimizing add2, and in general all *2 ops is a good idea, but > we should be able to do more agressive optimization. Maybe, a bit like > Blue was suggesting, add2 should always be followed by a nop, so we can > do more optimizations? Adding an extra nop sounds like a better idea than add2_part[12]. And it's probably easier than adding mov2 opcodes -- one little assert inside the optimizer and we're golden. r~
diff --git a/tcg/optimize.c b/tcg/optimize.c index 55f2a24..004c336 100644 --- a/tcg/optimize.c +++ b/tcg/optimize.c @@ -470,6 +470,11 @@ static TCGArg *tcg_constant_folding(TCGContext *s, uint16_t *tcg_opc_ptr, if (swap_commutative(args[0], &args[4], &args[3])) { args[5] = tcg_invert_cond(args[5]); } + break; + case INDEX_op_add2_i32: + swap_commutative(args[0], &args[2], &args[4]); + swap_commutative(args[1], &args[3], &args[5]); + break; default: break; } @@ -522,6 +527,32 @@ static TCGArg *tcg_constant_folding(TCGContext *s, uint16_t *tcg_opc_ptr, continue; } break; + case INDEX_op_add2_i32: + case INDEX_op_sub2_i32: + /* Simplify op rl, rh, al, ah, 0, bh => op rh, ah, bh. + The zero implies there will be no carry into the high part. + But only when rl == al, since we can't insert the extra move + that would be required. */ + if (temps[args[4]].state == TCG_TEMP_CONST + && temps[args[4]].val == 0 + && temps_are_copies(args[0], args[2])) { + if (temps[args[5]].state == TCG_TEMP_CONST + && temps[args[5]].val == 0 + && temps_are_copies(args[1], args[3])) { + gen_opc_buf[op_index] = INDEX_op_nop; + } else { + gen_opc_buf[op_index] = (op == INDEX_op_add2_i32 + ? INDEX_op_add_i32 + : INDEX_op_sub_i32); + args[0] = args[1]; + args[1] = args[3]; + args[2] = args[5]; + gen_args += 3; + } + args += 6; + continue; + } + break; default: break; }
We can't do complete constant folding because we lack "mov2", or the ability to insert opcodes in the stream. But we can at least canonicalize add2 operand ordering and simplify add2 to add when the lowpart adds a constant 0. Signed-off-by: Richard Henderson <rth@twiddle.net> --- tcg/optimize.c | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+)