Message ID | 20150618090813.GF19635@aurel32.net |
---|---|
State | New |
Headers | show |
On 18/06/2015 11:08, Aurelien Jarno wrote: > For an i386 guest still on an x86 host, I get a 4% slower boot time by > not using retranslation (see patch below). This is not that much > compared to the complexity retranslation bring us. QEMU could just always compute and store the restore_state information. TCG needs to help filling it in (a new TCG opcode?), but it should be easy. Paolo > diff --git a/target-i386/translate.c b/target-i386/translate.c > index 58b1959..de65bba 100644 > --- a/target-i386/translate.c > +++ b/target-i386/translate.c > @@ -8001,6 +8001,9 @@ static inline void gen_intermediate_code_internal(X86CPU *cpu, > > gen_tb_start(tb); > for(;;) { > + gen_update_cc_op(dc); > + gen_jmp_im(pc_ptr - dc->cs_base); > + > if (unlikely(!QTAILQ_EMPTY(&cs->breakpoints))) { > QTAILQ_FOREACH(bp, &cs->breakpoints, entry) { > if (bp->pc == pc_ptr && > diff --git a/translate-all.c b/translate-all.c > index b6b0e1c..3d4c017 100644 > --- a/translate-all.c > +++ b/translate-all.c > @@ -212,6 +212,8 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb, > int64_t ti; > #endif > > + return -1; > + > #ifdef CONFIG_PROFILER > ti = profile_getclock(); > #endif
On 2015-06-18 11:29, Paolo Bonzini wrote: > On 18/06/2015 11:08, Aurelien Jarno wrote: > > For an i386 guest still on an x86 host, I get a 4% slower boot time by > > not using retranslation (see patch below). This is not that much > > compared to the complexity retranslation bring us. > > QEMU could just always compute and store the restore_state information. > TCG needs to help filling it in (a new TCG opcode?), but it should be easy. Yes, that was another approach I have in mind (I called it exception table in my other mail), but it requires a tiny more work than just saving the CPU state all the time. The problem is that the state information we want to save are varying for target to target. Going through a TCG opcode means we can use the liveness analysis pass to save the minimum amount of data. That said I would like to push further the idea of always saving the CPU state a bit more to see if we can keep the same performances. There are still improvements to do, by removing more code on the core side (like finding the call to tb_finc_pc which is now useless), or on the target side by checking/improving helper flags. We might save the CPU state too often if a helper doesn't declare it doesn't touch globals.
On 18/06/2015 11:42, Aurelien Jarno wrote: >> > QEMU could just always compute and store the restore_state information. >> > TCG needs to help filling it in (a new TCG opcode?), but it should be easy. > Yes, that was another approach I have in mind (I called it exception > table in my other mail), Okay, understood. My idea was more like always generating the gen_op_* arrays. > but it requires a tiny more work than just > saving the CPU state all the time. The problem is that the state > information we want to save are varying for target to target. Going > through a TCG opcode means we can use the liveness analysis pass to save > the minimum amount of data. I mentioned a TCG opcode because the target PC is not available inside the translator. So the translator could pepper the TCG instruction stream with things like checkpoint $target_pc, $target_cc_op, $0 TCG can then use them to fill in an array stored inside the TranslationBlock, together with the host PC. Since the gen_opc_pc, gen_opc_instr_start, gen_opc_icount arrays are inside tcg_ctx, it may be a good idea to store the checkpoint information compressed in a byte array (e.g. as a series of ULEB128 values---the host and target PCs can even be stored as deltas from the last value). As a first step, gen_intermediate_code_pc and tcg_gen_code_search_pc can then be merged into a single target-independent function that uncompresses the byte array up to the required host PC into tcg_ctx. Later you can optimize them to remove the tcg_ctx arrays altogether. So the patches could be something like this: 1) SPARC: put the jump target information directly in gen_opc_* without using gen_opc_jump_pc (not trivial) 2) a few targets: instead of gen_opc_* arrays, use a new generic member of tcg_ctx (similar to how csbase is used generically), e.g. tcg_ctx.gen_opc_target1[] and tcg_ctx.gen_opc_target2[]. 3) all targets: always fill in tcg_ctx.gen_*, even if search_pc is false 4) TCG: add support for a checkpoint operation, make it fill in tcg_ctx.gen_* 5) all targets: change explicit filling of tcg_ctx.gen_* to use the checkpoint operation 6) TCG/translate-all: convert gen_intermediate_code_pc as outlined above > That said I would like to push further the idea of always saving the CPU > state a bit more to see if we can keep the same performances. There are > still improvements to do, by removing more code on the core side (like > finding the call to tb_finc_pc which is now useless), or on the target > side by checking/improving helper flags. We might save the CPU state too > often if a helper doesn't declare it doesn't touch globals. True, on the other hand there are a lot of helpers to audit... Paolo
On 2015-06-18 12:02, Paolo Bonzini wrote: > > > On 18/06/2015 11:42, Aurelien Jarno wrote: > >> > QEMU could just always compute and store the restore_state information. > >> > TCG needs to help filling it in (a new TCG opcode?), but it should be easy. > > Yes, that was another approach I have in mind (I called it exception > > table in my other mail), > > Okay, understood. My idea was more like always generating the gen_op_* > arrays. > > > but it requires a tiny more work than just > > saving the CPU state all the time. The problem is that the state > > information we want to save are varying for target to target. Going > > through a TCG opcode means we can use the liveness analysis pass to save > > the minimum amount of data. > > I mentioned a TCG opcode because the target PC is not available inside > the translator. So the translator could pepper the TCG instruction Well it is available through s->gen_opc_pc, but that's not that clean. > stream with things like > > checkpoint $target_pc, $target_cc_op, $0 Yes, it's clearly better to add an explicit instruction for that. As said we can pass it through the liveness analysis. But it means that this instruction will have a variable number of arguments. > TCG can then use them to fill in an array stored inside the > TranslationBlock, together with the host PC. Since the gen_opc_pc, > gen_opc_instr_start, gen_opc_icount arrays are inside tcg_ctx, it may be > a good idea to store the checkpoint information compressed in a byte > array (e.g. as a series of ULEB128 values---the host and target PCs can > even be stored as deltas from the last value). Either as deltas to the last value or as delta from the start of the TB. What I am worried about is the size of the checkpoint information, even if we do some compression, we might have one per guest instruction. I have implemented a naive version of that without compression, storing the checkpoint data at the end of the generated code, and it's about 30% of the size of the TB for MIPS. It's probably smaller on architectures storing only the PC. Also it's size is quite variable. That's why it's probably not a good idea to store it directly in the TranslationBlock. I don't like storing it directly in the generated code either, especially given this part is supposed to be executable. > As a first step, gen_intermediate_code_pc and tcg_gen_code_search_pc can > then be merged into a single target-independent function that > uncompresses the byte array up to the required host PC into tcg_ctx. > Later you can optimize them to remove the tcg_ctx arrays altogether. > > So the patches could be something like this: > > 1) SPARC: put the jump target information directly in gen_opc_* without > using gen_opc_jump_pc (not trivial) > > 2) a few targets: instead of gen_opc_* arrays, use a new generic member > of tcg_ctx (similar to how csbase is used generically), e.g. > tcg_ctx.gen_opc_target1[] and tcg_ctx.gen_opc_target2[]. > > 3) all targets: always fill in tcg_ctx.gen_*, even if search_pc is false > > 4) TCG: add support for a checkpoint operation, make it fill in > tcg_ctx.gen_* > > 5) all targets: change explicit filling of tcg_ctx.gen_* to use the > checkpoint operation > > 6) TCG/translate-all: convert gen_intermediate_code_pc as outlined above That's sounds like a plan when I have more time ;-) Aurelien
> From: Aurelien Jarno [mailto:aurelien@aurel32.net] > On 2015-06-18 12:02, Paolo Bonzini wrote: > > > > TCG can then use them to fill in an array stored inside the > > TranslationBlock, together with the host PC. Since the gen_opc_pc, > > gen_opc_instr_start, gen_opc_icount arrays are inside tcg_ctx, it may be > > a good idea to store the checkpoint information compressed in a byte > > array (e.g. as a series of ULEB128 values---the host and target PCs can > > even be stored as deltas from the last value). > > Either as deltas to the last value or as delta from the start of the > TB. What I am worried about is the size of the checkpoint information, > even if we do some compression, we might have one per guest instruction. > I have implemented a naive version of that without compression, storing > the checkpoint data at the end of the generated code, and it's about 30% > of the size of the TB for MIPS. It's probably smaller on architectures > storing only the PC. Also it's size is quite variable. That's why it's > probably not a good idea to store it directly in the TranslationBlock. > I don't like storing it directly in the generated code either, > especially given this part is supposed to be executable. > > > As a first step, gen_intermediate_code_pc and tcg_gen_code_search_pc can > > then be merged into a single target-independent function that > > uncompresses the byte array up to the required host PC into tcg_ctx. > > Later you can optimize them to remove the tcg_ctx arrays altogether. > > > > So the patches could be something like this: > > > > 1) SPARC: put the jump target information directly in gen_opc_* without > > using gen_opc_jump_pc (not trivial) > > > > 2) a few targets: instead of gen_opc_* arrays, use a new generic member > > of tcg_ctx (similar to how csbase is used generically), e.g. > > tcg_ctx.gen_opc_target1[] and tcg_ctx.gen_opc_target2[]. > > > > 3) all targets: always fill in tcg_ctx.gen_*, even if search_pc is false > > > > 4) TCG: add support for a checkpoint operation, make it fill in > > tcg_ctx.gen_* > > > > 5) all targets: change explicit filling of tcg_ctx.gen_* to use the > > checkpoint operation > > > > 6) TCG/translate-all: convert gen_intermediate_code_pc as outlined above > > That's sounds like a plan when I have more time ;-) Doesn't this approach still require my fixes to work correctly? Pavel Dovgalyuk
On 2015-06-19 08:09, Pavel Dovgaluk wrote: > > From: Aurelien Jarno [mailto:aurelien@aurel32.net] > > On 2015-06-18 12:02, Paolo Bonzini wrote: > > > > > > TCG can then use them to fill in an array stored inside the > > > TranslationBlock, together with the host PC. Since the gen_opc_pc, > > > gen_opc_instr_start, gen_opc_icount arrays are inside tcg_ctx, it may be > > > a good idea to store the checkpoint information compressed in a byte > > > array (e.g. as a series of ULEB128 values---the host and target PCs can > > > even be stored as deltas from the last value). > > > > Either as deltas to the last value or as delta from the start of the > > TB. What I am worried about is the size of the checkpoint information, > > even if we do some compression, we might have one per guest instruction. > > I have implemented a naive version of that without compression, storing > > the checkpoint data at the end of the generated code, and it's about 30% > > of the size of the TB for MIPS. It's probably smaller on architectures > > storing only the PC. Also it's size is quite variable. That's why it's > > probably not a good idea to store it directly in the TranslationBlock. > > I don't like storing it directly in the generated code either, > > especially given this part is supposed to be executable. > > > > > As a first step, gen_intermediate_code_pc and tcg_gen_code_search_pc can > > > then be merged into a single target-independent function that > > > uncompresses the byte array up to the required host PC into tcg_ctx. > > > Later you can optimize them to remove the tcg_ctx arrays altogether. > > > > > > So the patches could be something like this: > > > > > > 1) SPARC: put the jump target information directly in gen_opc_* without > > > using gen_opc_jump_pc (not trivial) > > > > > > 2) a few targets: instead of gen_opc_* arrays, use a new generic member > > > of tcg_ctx (similar to how csbase is used generically), e.g. > > > tcg_ctx.gen_opc_target1[] and tcg_ctx.gen_opc_target2[]. > > > > > > 3) all targets: always fill in tcg_ctx.gen_*, even if search_pc is false > > > > > > 4) TCG: add support for a checkpoint operation, make it fill in > > > tcg_ctx.gen_* > > > > > > 5) all targets: change explicit filling of tcg_ctx.gen_* to use the > > > checkpoint operation > > > > > > 6) TCG/translate-all: convert gen_intermediate_code_pc as outlined above > > > > That's sounds like a plan when I have more time ;-) > > Doesn't this approach still require my fixes to work correctly? Yes it does. Aurélien
diff --git a/target-i386/translate.c b/target-i386/translate.c index 58b1959..de65bba 100644 --- a/target-i386/translate.c +++ b/target-i386/translate.c @@ -8001,6 +8001,9 @@ static inline void gen_intermediate_code_internal(X86CPU *cpu, gen_tb_start(tb); for(;;) { + gen_update_cc_op(dc); + gen_jmp_im(pc_ptr - dc->cs_base); + if (unlikely(!QTAILQ_EMPTY(&cs->breakpoints))) { QTAILQ_FOREACH(bp, &cs->breakpoints, entry) { if (bp->pc == pc_ptr && diff --git a/translate-all.c b/translate-all.c index b6b0e1c..3d4c017 100644 --- a/translate-all.c +++ b/translate-all.c @@ -212,6 +212,8 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb, int64_t ti; #endif + return -1; + #ifdef CONFIG_PROFILER ti = profile_getclock(); #endif