Message ID | 1294350874-6885-3-git-send-email-aurelien@aurel32.net |
---|---|
State | New |
Headers | show |
Hi, On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote: > Improve constant loading in two ways: > - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using > the mvn rd, #0. Fix the conditions. > - On <= ARMv6 versions, where movw and movt are not available, load the > constants using mov and orr with rotations depending on the constant > to load. This is very useful for example to load constants where the > low byte is 0. This reduce the generated code size by about 7%. That's a nice improvement. For some instructions using MVN and AND could yield even shorter code and I think with that the optimisation options (except loading from a constant pool) would be exhausted :) ... > } > + } else { > + int opc = ARITH_MOV; > + int rn = 0; > + > + do { > + int i, rot; > + > + i = ctz32(arg) & ~1; > + rot = ((32 - i) << 7) & 0xf00; > + tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot); > + arg &= ~(0xff << i); > + > + opc = ARITH_ORR; > + rn = rd; I think you could get rid of rn and just use rd from the start of the loop. Otherwise acked by me too. Best regards
On 7 January 2011 13:52, andrzej zaborowski <balrog@zabor.org> wrote: > On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote: >> Improve constant loading in two ways: >> - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using >> the mvn rd, #0. Fix the conditions. >> - On <= ARMv6 versions, where movw and movt are not available, load the >> constants using mov and orr with rotations depending on the constant >> to load. This is very useful for example to load constants where the >> low byte is 0. This reduce the generated code size by about 7%. > > That's a nice improvement. For some instructions using MVN and AND Oops, I mean for some *values*.
On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote: > Hi, > > On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote: > > Improve constant loading in two ways: > > - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using > > the mvn rd, #0. Fix the conditions. > > - On <= ARMv6 versions, where movw and movt are not available, load the > > constants using mov and orr with rotations depending on the constant > > to load. This is very useful for example to load constants where the > > low byte is 0. This reduce the generated code size by about 7%. > > That's a nice improvement. For some instructions using MVN and AND > could yield even shorter code and I think with that the optimisation > options (except loading from a constant pool) would be exhausted :) I also did something with MVN and BIC, it works well, but the problem is to find the right heuristic to choose between MOV/ORR and MVN/BIC. In my tries, it was making the code bigger. > ... > > } > > + } else { > > + int opc = ARITH_MOV; > > + int rn = 0; > > + > > + do { > > + int i, rot; > > + > > + i = ctz32(arg) & ~1; > > + rot = ((32 - i) << 7) & 0xf00; > > + tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot); > > + arg &= ~(0xff << i); > > + > > + opc = ARITH_ORR; > > + rn = rd; > > I think you could get rid of rn and just use rd from the start of the > loop. Otherwise acked by me too. > What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order to generate a correct ARM instruction.
On 7 January 2011 15:40, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote: >> Hi, >> >> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote: >> > Improve constant loading in two ways: >> > - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using >> > the mvn rd, #0. Fix the conditions. >> > - On <= ARMv6 versions, where movw and movt are not available, load the >> > constants using mov and orr with rotations depending on the constant >> > to load. This is very useful for example to load constants where the >> > low byte is 0. This reduce the generated code size by about 7%. >> >> That's a nice improvement. For some instructions using MVN and AND >> could yield even shorter code and I think with that the optimisation >> options (except loading from a constant pool) would be exhausted :) > > I also did something with MVN and BIC, it works well, but the problem is > to find the right heuristic to choose between MOV/ORR and MVN/BIC. In my > tries, it was making the code bigger. I was thinking of running both without writing the instructions, then comparing the lengths and then running the better method. It's possible that the cost of this outweights the shorter code advantage though. > >> ... >> > } >> > + } else { >> > + int opc = ARITH_MOV; >> > + int rn = 0; >> > + >> > + do { >> > + int i, rot; >> > + >> > + i = ctz32(arg) & ~1; >> > + rot = ((32 - i) << 7) & 0xf00; >> > + tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot); >> > + arg &= ~(0xff << i); >> > + >> > + opc = ARITH_ORR; >> > + rn = rd; >> >> I think you could get rid of rn and just use rd from the start of the >> loop. Otherwise acked by me too. >> > > What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order > to generate a correct ARM instruction. According to my ARM926 manual rn is ignored for MOV/MVN, perhaps it's different in later revisions. Cheers
On Fri, Jan 07, 2011 at 04:56:32PM +0100, andrzej zaborowski wrote: > On 7 January 2011 15:40, Aurelien Jarno <aurelien@aurel32.net> wrote: > > On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote: > >> Hi, > >> > >> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote: > >> > Improve constant loading in two ways: > >> > - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using > >> > the mvn rd, #0. Fix the conditions. > >> > - On <= ARMv6 versions, where movw and movt are not available, load the > >> > constants using mov and orr with rotations depending on the constant > >> > to load. This is very useful for example to load constants where the > >> > low byte is 0. This reduce the generated code size by about 7%. > >> > >> That's a nice improvement. For some instructions using MVN and AND > >> could yield even shorter code and I think with that the optimisation > >> options (except loading from a constant pool) would be exhausted :) > > > > I also did something with MVN and BIC, it works well, but the problem is > > to find the right heuristic to choose between MOV/ORR and MVN/BIC. In my > > tries, it was making the code bigger. > > I was thinking of running both without writing the instructions, then > comparing the lengths and then running the better method. It's > possible that the cost of this outweights the shorter code advantage > though. > > > > >> ... > >> > } > >> > + } else { > >> > + int opc = ARITH_MOV; > >> > + int rn = 0; > >> > + > >> > + do { > >> > + int i, rot; > >> > + > >> > + i = ctz32(arg) & ~1; > >> > + rot = ((32 - i) << 7) & 0xf00; > >> > + tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot); > >> > + arg &= ~(0xff << i); > >> > + > >> > + opc = ARITH_ORR; > >> > + rn = rd; > >> > >> I think you could get rid of rn and just use rd from the start of the > >> loop. Otherwise acked by me too. > >> > > > > What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order > > to generate a correct ARM instruction. > > According to my ARM926 manual rn is ignored for MOV/MVN, perhaps it's > different in later revisions. > I have just tried, and it actually works (tried on ARMv5 and ARMv7). Note that binutils is not able to disassemble such an instruction and outputs in qemu.log something like: | 0x01000008: e3aa50ff undefined instruction 0xe3aa50ff However what worries me the most is that the "ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition" defines this opcode with the rn field as "(0)(0)(0)(0)". Looking at what it means: | An instruction is UNPREDICTABLE if: | [...] | * the pseudocode for that encoding does not indicate that a different | special case applies, and a bit marked (0) or (1) in the encoding | diagram of an instruction is not 0 or 1 respectively. In short is it still going to work on newer CPUs?
On 9 January 2011 23:40, Aurelien Jarno <aurelien@aurel32.net> wrote: > On Fri, Jan 07, 2011 at 04:56:32PM +0100, andrzej zaborowski wrote: >> On 7 January 2011 15:40, Aurelien Jarno <aurelien@aurel32.net> wrote: >> > On Fri, Jan 07, 2011 at 01:52:25PM +0100, andrzej zaborowski wrote: >> >> On 6 January 2011 22:54, Aurelien Jarno <aurelien@aurel32.net> wrote: >> >> ... >> >> > } >> >> > + } else { >> >> > + int opc = ARITH_MOV; >> >> > + int rn = 0; >> >> > + >> >> > + do { >> >> > + int i, rot; >> >> > + >> >> > + i = ctz32(arg) & ~1; >> >> > + rot = ((32 - i) << 7) & 0xf00; >> >> > + tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot); >> >> > + arg &= ~(0xff << i); >> >> > + >> >> > + opc = ARITH_ORR; >> >> > + rn = rd; >> >> >> >> I think you could get rid of rn and just use rd from the start of the >> >> loop. Otherwise acked by me too. >> >> >> > >> > What do you mean exactly? rn has to be 0 when opc is ARITH_MOV in order >> > to generate a correct ARM instruction. >> >> According to my ARM926 manual rn is ignored for MOV/MVN, perhaps it's >> different in later revisions. >> > > I have just tried, and it actually works (tried on ARMv5 and ARMv7). Also works under qemu-arm :) > Note that binutils is not able to disassemble such an instruction and > outputs in qemu.log something like: > | 0x01000008: e3aa50ff undefined instruction 0xe3aa50ff > > However what worries me the most is that the "ARM Architecture Reference > Manual ARMv7-A and ARMv7-R edition" defines this opcode with the rn field > as "(0)(0)(0)(0)". Looking at what it means: > > | An instruction is UNPREDICTABLE if: > | [...] > | * the pseudocode for that encoding does not indicate that a different > | special case applies, and a bit marked (0) or (1) in the encoding > | diagram of an instruction is not 0 or 1 respectively. > > In short is it still going to work on newer CPUs? Perhaps let's be on the safe side and use your version with rn = 0. I think it *should* work on the new ARM ISAs because of backwards compatibility: x works under ARMv4 & ARMv5 and x is not listed under the differences between new and old ISA, thus it needs to work under a new ISA. Cheers
On 9 January 2011 23:33, andrzej zaborowski <balrogg@gmail.com> wrote: > On 9 January 2011 23:40, Aurelien Jarno <aurelien@aurel32.net> wrote: >> Note that binutils is not able to disassemble such an instruction and >> outputs in qemu.log something like: >> | 0x01000008: e3aa50ff undefined instruction 0xe3aa50ff >> >> However what worries me the most is that the "ARM Architecture Reference >> Manual ARMv7-A and ARMv7-R edition" defines this opcode with the rn field >> as "(0)(0)(0)(0)". Looking at what it means: >> >> | An instruction is UNPREDICTABLE if: >> | [...] >> | * the pseudocode for that encoding does not indicate that a different >> | special case applies, and a bit marked (0) or (1) in the encoding >> | diagram of an instruction is not 0 or 1 respectively. >> >> In short is it still going to work on newer CPUs? It might not work on existing CPUs, never mind newer ones. We mean it about UNPREDICTABLE :-) Some cores choose to make patterns which fail these "should be zero/one" checks cause an UNDEF exception. Some don't. > I think it *should* work on the new ARM ISAs because of backwards > compatibility: x works under ARMv4 & ARMv5 and x is not listed under > the differences between new and old ISA, thus it needs to work under a > new ISA. I went back and checked the ARM ARM for ARMv4 (that's ARM document DUI0100B, dated 1996). It says that for MOV and MVN bits 19..16 are "SBZ", ie "Should Be Zero", meaning that non-zero is UNPREDICTABLE. So this isn't a change in behaviour -- the ISA has always been clear that you should not do it. [Note for the unwary: UNPREDICTABLE in ARM docs doesn't mean totally unpredictable -- an implementation isn't allowed to permit it to be a security hole or to hang the processor, for instance. But you can't rely on it doing anything useful or consistent.] -- PMM
diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c index 08c44c1..1eb5605 100644 --- a/tcg/arm/tcg-target.c +++ b/tcg/arm/tcg-target.c @@ -406,35 +406,38 @@ static inline void tcg_out_dat_imm(TCGContext *s, } static inline void tcg_out_movi32(TCGContext *s, - int cond, int rd, int32_t arg) + int cond, int rd, uint32_t arg) { /* TODO: This is very suboptimal, we can easily have a constant * pool somewhere after all the instructions. */ - - if (arg < 0 && arg > -0x100) - return tcg_out_dat_imm(s, cond, ARITH_MVN, rd, 0, (~arg) & 0xff); - - if (use_armv7_instructions) { + if ((int)arg < 0 && (int)arg >= -0x100) { + tcg_out_dat_imm(s, cond, ARITH_MVN, rd, 0, (~arg) & 0xff); + } else if (use_armv7_instructions) { /* use movw/movt */ /* movw */ tcg_out32(s, (cond << 28) | 0x03000000 | (rd << 12) | ((arg << 4) & 0x000f0000) | (arg & 0xfff)); - if (arg & 0xffff0000) + if (arg & 0xffff0000) { /* movt */ tcg_out32(s, (cond << 28) | 0x03400000 | (rd << 12) | ((arg >> 12) & 0x000f0000) | ((arg >> 16) & 0xfff)); - } else { - tcg_out_dat_imm(s, cond, ARITH_MOV, rd, 0, arg & 0xff); - if (arg & 0x0000ff00) - tcg_out_dat_imm(s, cond, ARITH_ORR, rd, rd, - ((arg >> 8) & 0xff) | 0xc00); - if (arg & 0x00ff0000) - tcg_out_dat_imm(s, cond, ARITH_ORR, rd, rd, - ((arg >> 16) & 0xff) | 0x800); - if (arg & 0xff000000) - tcg_out_dat_imm(s, cond, ARITH_ORR, rd, rd, - ((arg >> 24) & 0xff) | 0x400); } + } else { + int opc = ARITH_MOV; + int rn = 0; + + do { + int i, rot; + + i = ctz32(arg) & ~1; + rot = ((32 - i) << 7) & 0xf00; + tcg_out_dat_imm(s, cond, opc, rd, rn, ((arg >> i) & 0xff) | rot); + arg &= ~(0xff << i); + + opc = ARITH_ORR; + rn = rd; + } while (arg); + } } static inline void tcg_out_mul32(TCGContext *s,
Improve constant loading in two ways: - On all ARM versions, it's possible to load 0xffffff00 = -0x100 using the mvn rd, #0. Fix the conditions. - On <= ARMv6 versions, where movw and movt are not available, load the constants using mov and orr with rotations depending on the constant to load. This is very useful for example to load constants where the low byte is 0. This reduce the generated code size by about 7%. Also fix the coding style at the same time. Cc: Andrzej Zaborowski <balrog@zabor.org> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net> --- tcg/arm/tcg-target.c | 39 +++++++++++++++++++++------------------ 1 files changed, 21 insertions(+), 18 deletions(-)