Message ID | AANLkTik0uFO9MKH5Rgn1xEEPGa51vlT_cJdEsn-bCsea@mail.gmail.com |
---|---|
State | New |
Headers | show |
On 5/8/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote: > On Thu, May 6, 2010 at 10:51 PM, Blue Swirl <blauwirbel@gmail.com> wrote: > > On 5/5/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote: > >> On Wed, May 5, 2010 at 12:21 AM, Blue Swirl <blauwirbel@gmail.com> wrote: > >> > On 5/3/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote: > >> >> On Mon, May 3, 2010 at 11:54 PM, Blue Swirl <blauwirbel@gmail.com> wrote: > >> >> > On 5/3/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote: > >> >> >> On Mon, May 3, 2010 at 11:24 PM, Blue Swirl <blauwirbel@gmail.com> wrote: > >> >> >> > On 5/3/10, Igor Kovalenko <igor.v.kovalenko@gmail.com> wrote: > >> >> >> >> Hi! > >> >> >> >> > >> >> >> >> There is an issue with lazy conditional codes evaluation where > >> >> >> >> we return from trap handler with mismatching conditionals. > >> >> >> >> > >> >> >> >> I seldom reproduce it here when dragging qemu window while > >> >> >> >> machine is working through silo initialization. I use gentoo minimal cd > >> >> >> >> install-sparc64-minimal-20100322.iso but I think anything with silo boot > >> >> >> >> would experience the same. Once in a while it would report crc error, > >> >> >> >> unable to open cd partition or it would fail to decompress image. > >> >> >> > > >> >> >> > I think I've also seen this. > >> >> >> > > >> >> >> >> Pattern that fails appears to require a sequence of compare insn > >> >> >> >> possibly followed by a few instructions which do not touch conditionals, > >> >> >> >> then conditional branch insn. If it happens that we trap while processing > >> >> >> >> conditional branch insn so it is restarted after return from trap then > >> >> >> >> seldom conditional codes are calculated incorrectly. > >> >> >> >> > >> >> >> >> I cannot point to exact cause but it appears that after trap return > >> >> >> >> we may have CC_OP and CC_SRC* mismatch somewhere, > >> >> >> >> since adding more cond evaluation flushes over the code helps. > >> >> >> >> > >> >> >> >> We already tried doing flush more frequently and it is still not > >> >> >> >> complete, so the question is how to finally do this once and right :) > >> >> >> >> > >> >> >> >> Obviously I do not get the design of lazy evaluation right, but > >> >> >> >> the following list appears to be good start. Plan is to prepare > >> >> >> >> a change to qemu and find a way to test it. > >> >> >> >> > >> >> >> >> 1. Since SPARC* is a RISC CPU it seems to be not profitable to > >> >> >> >> use DisasContext->cc_op to predict if flags should be not evaluated > >> >> >> >> due to overriding insn. Instead we can drop cc_op from disassembler > >> >> >> >> context and simplify code to only use cc_op from env. > >> >> >> > > >> >> >> > Not currently, but in the future we may use that to do even lazier > >> >> >> > flags computation. For example the sequence 'cmp x, y; bne target' > >> >> >> > could be much more optimal by changing the branch to do the > >> >> >> > comparison. Here's an old unfinished patch to do some of this. > >> > >> > >> I wonder if it buys anything. Sparc RISC architecture means optimizing compiler > >> would prevent any extra flags computation, right? So it is basically 1-to-1 > >> conditional computation and use. Or even worse, if we delay computation > >> until there are two or more consumers, correct? > > > > For x86 target, that is the other part of savings from using lazy > > condition computation. It's true that it will not benefit RISC targets > > much. > > > > But I think you are missing the other part, where the actual flags are > > not calculated but instead the original comparison can be used. > > > > For example, consider this Sparc64 code: > > IN: > > 0x000001fff000be58: cmp %g2, 0x51 > > > > OUT: [size=82] > > 0x40df16b0: mov 0x10(%r14),%rbp > > 0x40df16b4: mov %rbp,%rbx > > 0x40df16b7: sub $0x51,%rbx > > 0x40df16bb: mov %rbx,%r12 > > 0x40df16be: mov %r12,0x10a60(%r14) > > 0x40df16c5: mov %rbp,0x60(%r14) > > 0x40df16c9: mov $0x51,%ebp > > 0x40df16ce: mov %rbp,0x68(%r14) > > 0x40df16d2: mov %rbx,0x70(%r14) > > 0x40df16d6: mov $0x7,%ebp > > 0x40df16db: mov %ebp,0x78(%r14) > > 0x40df16df: mov $0x1fff000be5c,%rbp > > 0x40df16e9: mov %rbp,0x48(%r14) > > 0x40df16ed: mov $0x1fff000be60,%rbp > > 0x40df16f7: mov %rbp,0x50(%r14) > > 0x40df16fb: xor %eax,%eax > > 0x40df16fd: jmpq 0xbfface > > > > -------------- > > IN: > > 0x000001fff000be5c: bne 0x1fff000c268 > > > > OUT: [size=95] > > 0x40df1710: callq 0x518260 > > 0x40df1715: mov 0x98(%r14),%ebp > > 0x40df171c: mov %ebp,%ebx > > 0x40df171e: shr $0x16,%rbx > > 0x40df1722: and $0x1,%ebx > > 0x40df1725: xor $0x1,%rbx > > 0x40df1729: mov %rbx,0x90(%r14) > > 0x40df1730: mov $0x1fff000be60,%rbp > > 0x40df173a: mov %rbp,0x48(%r14) > > 0x40df173e: test %rbx,%rbx > > 0x40df1741: je 0x40df175a > > 0x40df1747: mov $0x1fff000c268,%rbp > > 0x40df1751: mov %rbp,0x50(%r14) > > 0x40df1755: jmpq 0x40df1768 > > 0x40df175a: mov $0x1fff000be64,%rbp > > 0x40df1764: mov %rbp,0x50(%r14) > > 0x40df1768: xor %eax,%eax > > 0x40df176a: jmpq 0xbfface > > > > Instead of calculating any flags, we should generate for combined > > 'cmp/btest + branch' sequences something like: > > mov 0x10(%r14),%rbp > > mov %rbp,%rbx > > cmp $0x51,%rbx > > je 0x40ac275a > > mov $0x1fff000c268,%rbp > > mov %rbp,0x50(%r14) > > jmpq 0x40ac2768 > > mov $0x1fff000be64,%rbp > > mov %rbp,0x50(%r14) > > xor %eax,%eax > > jmpq 0xbfface > > > > But I fully agree that correct code generation is the most important issue. > > > >> >> >> > > >> >> >> >> Another point is that we always write to env->cc_op when > >> >> >> >> translating *cc insns > >> >> >> >> This should solve any issue with dc->cc_op prediction going > >> >> >> >> out of sync with env->cc_op and cpu_cc_src* > >> >> >> > > >> >> >> > I think this is what is happening now. > >> >> >> > > >> >> >> >> 2. We must flush lazy evaluation back to CC_OP_FLAGS in a few cases when > >> >> >> >> a. conditional code is required by insn (like addc, cond branch etc.) > >> >> >> >> - here we can optimize by evaluating specific bits (carry?) > >> >> >> >> - not sure if it works in case we have two cond consuming insns, > >> >> >> >> where first needs carry another needs the rest of flags > >> >> >> > > >> >> >> > Here's another patch to optimize C flag handling. It doesn't pass my > >> >> >> > tests though. > >> >> >> > > >> >> >> >> b. CCR is read by rdccr (helper_rdccr) > >> >> >> >> - have to compute all flags > >> >> >> >> c. trap occurs and we prepare trap level context (saving pstate) > >> >> >> >> - have to compute all flags > >> >> >> >> d. control goes out of tcg runtime (so gdbstub reads correct value from env) > >> >> >> >> - have to compute all flags > >> >> >> > > >> >> >> > Fully agree. > >> >> >> > >> >> >> > >> >> >> Cool > >> >> >> > >> >> >> Still I'd propose to kill dc->cc_op, find a reliable way to test it > >> >> >> and then add it back possibly with more optimizations. > >> >> >> I'm lost in the code up to the point where I believe we need to > >> >> >> save/restore cc_op and cpu_cc* while switching trap levels. > >> >> > > >> >> > I'd think this should do the trick: > >> >> > > >> >> > diff --git a/target-sparc/op_helper.c b/target-sparc/op_helper.c > >> >> > index b27778b..94921cd 100644 > >> >> > --- a/target-sparc/op_helper.c > >> >> > +++ b/target-sparc/op_helper.c > >> >> > @@ -3506,6 +3506,8 @@ void do_interrupt(CPUState *env) > >> >> > } > >> >> > tsptr = cpu_tsptr(env); > >> >> > > >> >> > + helper_compute_psr(); > >> >> > + > >> >> > tsptr->tstate = ((uint64_t)GET_CCR(env) << 32) | > >> >> > ((env->asi & 0xff) << 24) | ((env->pstate & 0xf3f) << 8) | > >> >> > GET_CWP64(env); > >> >> > > >> >> > >> >> > >> >> Thanks, this change seems to work here for silo issue. > >> >> > >> >> Another change would be to flush for gdbstub use of GET_CCR and for > >> >> helper_rdccr. > >> >> I tried to embed flush into GET_CCR but the code looks ugly since we > >> >> need to proxy a call to helper_compute_psr from gdbstub passing > >> >> available env pointer. > >> >> > >> >> Not really tested with your changes, but still what is the breakage you see? > >> > > >> > Aurora 2.0 (http://distro.ibiblio.org/pub/linux/distributions/aurora/build-2.0/sparc/iso/) > >> > breaks. > >> > > >> > This is what I get with git HEAD, having pressed enter key twice: > >> > Welcome to Aurora SPARC Linux > >> > > >> > > >> > > >> > > >> > +--------------+ CD Found +--------------+ > >> > | | > >> > | To begin testing the CD media before | > >> > | installation press OK. | > >> > | | > >> > | Choose Skip to skip the media test | > >> > | and start the installation. | > >> > | | > >> > | +----+ +------+ | > >> > | | OK | | Skip | | > >> > | +----+ +------+ | > >> > | | > >> > | | > >> > +----------------------------------------+ > >> > > >> > > >> > > >> > > >> > <Tab>/<Alt-Tab> between elements | <Space> selects | <F12> next screen > >> > > >> > This is what I get with the C flag patch applied: > >> > Welcome to Aurora SPARC Linux > >> > > >> > > >> > > >> > > >> > > >> > +--------------+ Error +---------------+ > >> > | | > >> > | failed to read keymap information: | > >> > | Success | > >> > | | > >> > | +----+ | > >> > | | OK | | > >> > | +----+ | > >> > | | > >> > | | > >> > +--------------------------------------+ > >> > > >> > > >> > > >> > > >> > > >> > > >> > <Tab>/<Alt-Tab> between elements | <Space> selects | <F12> next screen > >> > > >> > >> > >> I do reproduce the issue here with 0001-Convert-C-flag-input-BROKEN.patch > > > > The patch seems harmless, so maybe it uncovers some hideous bug. > > > > > Right, the bug is inside the gen_op_*cc helpers - we need to compute C > using current flag source, whereas current implementation clobbers source > then computes C. > > I think I now get the lazy evaluation design right :) > > Something along these changes to your 0001-Convert-C-flag-input-BROKEN.patch > appears to be good for aurora 2.0 test: > > diff --git a/target-sparc/translate.c b/target-sparc/translate.c > index 94c343d..ea7c71b 100644 > --- a/target-sparc/translate.c > +++ b/target-sparc/translate.c > @@ -334,9 +334,9 @@ static inline void gen_op_add_cc(TCGv dst, TCGv > src1, TCGv src2) > > static inline void gen_op_addxi_cc(TCGv dst, TCGv src1, target_long src2) > { > + gen_helper_compute_C_icc(cpu_tmp0); > tcg_gen_mov_tl(cpu_cc_src, src1); > tcg_gen_movi_tl(cpu_cc_src2, src2); > - gen_helper_compute_C_icc(cpu_tmp0); > tcg_gen_add_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0); > tcg_gen_addi_tl(cpu_cc_dst, cpu_cc_dst, src2); > tcg_gen_mov_tl(dst, cpu_cc_dst); > @@ -344,9 +344,9 @@ static inline void gen_op_addxi_cc(TCGv dst, TCGv > src1, target_long src2) > > static inline void gen_op_addx_cc(TCGv dst, TCGv src1, TCGv src2) > { > + gen_helper_compute_C_icc(cpu_tmp0); > tcg_gen_mov_tl(cpu_cc_src, src1); > tcg_gen_mov_tl(cpu_cc_src2, src2); > - gen_helper_compute_C_icc(cpu_tmp0); > tcg_gen_add_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0); > tcg_gen_add_tl(cpu_cc_dst, cpu_cc_dst, cpu_cc_src2); > tcg_gen_mov_tl(dst, cpu_cc_dst); > @@ -417,9 +417,9 @@ static inline void gen_op_sub_cc(TCGv dst, TCGv > src1, TCGv src2) > > static inline void gen_op_subxi_cc(TCGv dst, TCGv src1, target_long src2) > { > + gen_helper_compute_C_icc(cpu_tmp0); > tcg_gen_mov_tl(cpu_cc_src, src1); > tcg_gen_movi_tl(cpu_cc_src2, src2); > - gen_helper_compute_C_icc(cpu_tmp0); > tcg_gen_sub_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0); > tcg_gen_subi_tl(cpu_cc_dst, cpu_cc_dst, src2); > tcg_gen_mov_tl(dst, cpu_cc_dst); > @@ -427,9 +427,9 @@ static inline void gen_op_subxi_cc(TCGv dst, TCGv > src1, target_long src2) > > static inline void gen_op_subx_cc(TCGv dst, TCGv src1, TCGv src2) > { > + gen_helper_compute_C_icc(cpu_tmp0); > tcg_gen_mov_tl(cpu_cc_src, src1); > tcg_gen_mov_tl(cpu_cc_src2, src2); > - gen_helper_compute_C_icc(cpu_tmp0); > tcg_gen_sub_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0); > tcg_gen_sub_tl(cpu_cc_dst, cpu_cc_dst, cpu_cc_src2); > tcg_gen_mov_tl(dst, cpu_cc_dst); Thanks a lot, with this patch my tests passed! I applied the combined patch. I also did a bit of refactoring to get the original Sparc64 issue fixed.
Blue Swirl wrote: > Thanks a lot, with this patch my tests passed! I applied the combined patch. Yes, I definitely see an improvement with this patch - at least my Debian lenny SPARC boot cd doesn't randomly kernel panic any more. It looks as if it now just can't find /init which could just be due to an incorrect device mapping somewhere. > I also did a bit of refactoring to get the original Sparc64 issue fixed. However, one thing I did notice is that this does introduce a noticeable performance penalty. With OpenBIOS SVN head I see the following: With commit 72139e83a98eba2bfed2dbc2db2818fb19e47ca0 (just before the changes): [ 59.225406] Failed to execute /init [ 59.304088] Kernel panic - not syncing: No init found. Try passing init= option to kernel. [ 59.450313] Press Stop-A (L1-A) to return to the boot prom With commit 5a834bb47c373e887de5210b7ceae96e1ef413f7 (just after the changes): [ 70.384466] Failed to execute /init [ 70.474804] Kernel panic - not syncing: No init found. Try passing init= option to kernel. So while it's technically correct, it seems to have added ~15% overhead to the emulation :( ATB, Mark.
On 5/10/10, Mark Cave-Ayland <mark.cave-ayland@siriusit.co.uk> wrote: > Blue Swirl wrote: > > > > Thanks a lot, with this patch my tests passed! I applied the combined > patch. > > > > Yes, I definitely see an improvement with this patch - at least my Debian > lenny SPARC boot cd doesn't randomly kernel panic any more. It looks as if > it now just can't find /init which could just be due to an incorrect device > mapping somewhere. > > > > I also did a bit of refactoring to get the original Sparc64 issue fixed. > > > > However, one thing I did notice is that this does introduce a noticeable > performance penalty. With OpenBIOS SVN head I see the following: > > With commit 72139e83a98eba2bfed2dbc2db2818fb19e47ca0 (just > before the changes): > > [ 59.225406] Failed to execute /init > [ 59.304088] Kernel panic - not syncing: No init found. Try passing > init= option to kernel. > [ 59.450313] Press Stop-A (L1-A) to return to the boot prom > > With commit 5a834bb47c373e887de5210b7ceae96e1ef413f7 (just > after the changes): > > [ 70.384466] Failed to execute /init > [ 70.474804] Kernel panic - not syncing: No init found. Try passing > init= option to kernel. > > > So while it's technically correct, it seems to have added ~15% overhead to > the emulation :( Guest time can be unreliable, it could also indicate that Linux executes a lot more timer interrupts. Could you retest and measure the wall clock time? I think the C flag change should only increase performance. The next commit may have negative effects because more work is done every interrupt, but it's also more correct now.
Blue Swirl wrote: > Guest time can be unreliable, it could also indicate that Linux > executes a lot more timer interrupts. Could you retest and measure the > wall clock time? > > I think the C flag change should only increase performance. The next > commit may have negative effects because more work is done every > interrupt, but it's also more correct now. Hmmm looks like you're right. I did a few more tests measuring wall clock time and averaging across a set of runs gives roughly the same time (although there does seem to be quite a bit of variation in resulting times on my system here). So nothing to worry about here. ATB, Mark.
diff --git a/target-sparc/translate.c b/target-sparc/translate.c index 94c343d..ea7c71b 100644 --- a/target-sparc/translate.c +++ b/target-sparc/translate.c @@ -334,9 +334,9 @@ static inline void gen_op_add_cc(TCGv dst, TCGv src1, TCGv src2) static inline void gen_op_addxi_cc(TCGv dst, TCGv src1, target_long src2) { + gen_helper_compute_C_icc(cpu_tmp0); tcg_gen_mov_tl(cpu_cc_src, src1); tcg_gen_movi_tl(cpu_cc_src2, src2); - gen_helper_compute_C_icc(cpu_tmp0); tcg_gen_add_tl(cpu_cc_dst, cpu_cc_src, cpu_tmp0); tcg_gen_addi_tl(cpu_cc_dst, cpu_cc_dst, src2);