[v2,0/3] Fix exceptions handling for MIPS and i386

Message ID	20150618090813.GF19635@aurel32.net
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Date: Thu, 18 Jun 2015 11:08:13 +0200 From: Aurelien Jarno <aurelien@aurel32.net> To: Pavel Dovgaluk <Pavel.Dovgaluk@ispras.ru> Message-ID: <20150618090813.GF19635@aurel32.net> References: <20150617124158.3316.54954.stgit@PASHA-ISP> <20150617141901.GE19635@aurel32.net> <001101d0a996$19a72f80$4cf58e80$@Dovgaluk@ispras.ru> <20150618081640.GK931@aurel32.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150618081640.GK931@aurel32.net> User-Agent: Mutt/1.5.23 (2014-03-12) Error: Malformed IPv6 address (bad octet value). Cc: pbonzini@redhat.com, rth7680@gmail.com, leon.alrae@imgtec.com, qemu-devel@nongnu.org Subject: Re: [Qemu-devel] [PATCH v2 0/3] Fix exceptions handling for MIPS and i386 Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Aurelien Jarno June 18, 2015, 9:08 a.m. UTC

On 2015-06-18 10:16, Aurelien Jarno wrote:
> On x86, this patch brings a 5% boot time improvement on MIPS. One of the
> reason is that the TCG code generator has a good knowledge about which
> TCG ops or helpers can trigger an exception, so it can optimize out part
> of the instructions saving the CPU state. I guess that the host CPUs have
> also evolved over the time, now being superscalar and out-of-order so
> that saving the CPU state can be done "in background". Also it's just a
> quick and dirty patch, we can probably even do better.
> 
> All of that to say that I am worried for the performances to see more
> paths through the retranslation code, especially on MIPS as it seems to
> be costly. That said I haven't really look in details at other targets,
> nor hosts.

For an i386 guest still on an x86 host, I get a 4% slower boot time by
not using retranslation (see patch below). This is not that much
compared to the complexity retranslation bring us.

Paolo Bonzini June 18, 2015, 9:29 a.m. UTC | #1

On 18/06/2015 11:08, Aurelien Jarno wrote:
> For an i386 guest still on an x86 host, I get a 4% slower boot time by
> not using retranslation (see patch below). This is not that much
> compared to the complexity retranslation bring us.

QEMU could just always compute and store the restore_state information.
 TCG needs to help filling it in (a new TCG opcode?), but it should be easy.

Paolo

> diff --git a/target-i386/translate.c b/target-i386/translate.c
> index 58b1959..de65bba 100644
> --- a/target-i386/translate.c
> +++ b/target-i386/translate.c
> @@ -8001,6 +8001,9 @@ static inline void gen_intermediate_code_internal(X86CPU *cpu,
>  
>      gen_tb_start(tb);
>      for(;;) {
> +        gen_update_cc_op(dc);
> +        gen_jmp_im(pc_ptr - dc->cs_base);
> +
>          if (unlikely(!QTAILQ_EMPTY(&cs->breakpoints))) {
>              QTAILQ_FOREACH(bp, &cs->breakpoints, entry) {
>                  if (bp->pc == pc_ptr &&
> diff --git a/translate-all.c b/translate-all.c
> index b6b0e1c..3d4c017 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -212,6 +212,8 @@ static int cpu_restore_state_from_tb(CPUState *cpu, TranslationBlock *tb,
>      int64_t ti;
>  #endif
>  
> +    return -1;
> +
>  #ifdef CONFIG_PROFILER
>      ti = profile_getclock();
>  #endif

Aurelien Jarno June 18, 2015, 9:42 a.m. UTC | #2

On 2015-06-18 11:29, Paolo Bonzini wrote:
> On 18/06/2015 11:08, Aurelien Jarno wrote:
> > For an i386 guest still on an x86 host, I get a 4% slower boot time by
> > not using retranslation (see patch below). This is not that much
> > compared to the complexity retranslation bring us.
> 
> QEMU could just always compute and store the restore_state information.
>  TCG needs to help filling it in (a new TCG opcode?), but it should be easy.

Yes, that was another approach I have in mind (I called it exception
table in my other mail), but it requires a tiny more work than just
saving the CPU state all the time. The problem is that the state
information we want to save are varying for target to target. Going
through a TCG opcode means we can use the liveness analysis pass to save
the minimum amount of data.

That said I would like to push further the idea of always saving the CPU
state a bit more to see if we can keep the same performances. There are
still improvements to do, by removing more code on the core side (like
finding the call to tb_finc_pc which is now useless), or on the target
side by checking/improving helper flags. We might save the CPU state too
often if a helper doesn't declare it doesn't touch globals.

Paolo Bonzini June 18, 2015, 10:02 a.m. UTC | #3

On 18/06/2015 11:42, Aurelien Jarno wrote:
>> > QEMU could just always compute and store the restore_state information.
>> >  TCG needs to help filling it in (a new TCG opcode?), but it should be easy.
> Yes, that was another approach I have in mind (I called it exception
> table in my other mail),

Okay, understood.  My idea was more like always generating the gen_op_*
arrays.

> but it requires a tiny more work than just
> saving the CPU state all the time. The problem is that the state
> information we want to save are varying for target to target. Going
> through a TCG opcode means we can use the liveness analysis pass to save
> the minimum amount of data.

I mentioned a TCG opcode because the target PC is not available inside
the translator.  So the translator could pepper the TCG instruction
stream with things like

     checkpoint  $target_pc, $target_cc_op, $0

TCG can then use them to fill in an array stored inside the
TranslationBlock, together with the host PC.  Since the gen_opc_pc,
gen_opc_instr_start, gen_opc_icount arrays are inside tcg_ctx, it may be
a good idea to store the checkpoint information compressed in a byte
array (e.g. as a series of ULEB128 values---the host and target PCs can
even be stored as deltas from the last value).

As a first step, gen_intermediate_code_pc and tcg_gen_code_search_pc can
then be merged into a single target-independent function that
uncompresses the byte array up to the required host PC into tcg_ctx.
Later you can optimize them to remove the tcg_ctx arrays altogether.

So the patches could be something like this:

1) SPARC: put the jump target information directly in gen_opc_* without
using gen_opc_jump_pc (not trivial)

2) a few targets: instead of gen_opc_* arrays, use a new generic member
of tcg_ctx (similar to how csbase is used generically), e.g.
tcg_ctx.gen_opc_target1[] and tcg_ctx.gen_opc_target2[].

3) all targets: always fill in tcg_ctx.gen_*, even if search_pc is false

4) TCG: add support for a checkpoint operation, make it fill in
tcg_ctx.gen_*

5) all targets: change explicit filling of tcg_ctx.gen_* to use the
checkpoint operation

6) TCG/translate-all: convert gen_intermediate_code_pc as outlined above

> That said I would like to push further the idea of always saving the CPU
> state a bit more to see if we can keep the same performances. There are
> still improvements to do, by removing more code on the core side (like
> finding the call to tb_finc_pc which is now useless), or on the target
> side by checking/improving helper flags. We might save the CPU state too
> often if a helper doesn't declare it doesn't touch globals.

True, on the other hand there are a lot of helpers to audit...

Paolo

Aurelien Jarno June 18, 2015, 5:42 p.m. UTC | #4

On 2015-06-18 12:02, Paolo Bonzini wrote:
> 
> 
> On 18/06/2015 11:42, Aurelien Jarno wrote:
> >> > QEMU could just always compute and store the restore_state information.
> >> >  TCG needs to help filling it in (a new TCG opcode?), but it should be easy.
> > Yes, that was another approach I have in mind (I called it exception
> > table in my other mail),
> 
> Okay, understood.  My idea was more like always generating the gen_op_*
> arrays.
> 
> > but it requires a tiny more work than just
> > saving the CPU state all the time. The problem is that the state
> > information we want to save are varying for target to target. Going
> > through a TCG opcode means we can use the liveness analysis pass to save
> > the minimum amount of data.
> 
> I mentioned a TCG opcode because the target PC is not available inside
> the translator.  So the translator could pepper the TCG instruction

Well it is available through s->gen_opc_pc, but that's not that clean.

> stream with things like
> 
>      checkpoint  $target_pc, $target_cc_op, $0

Yes, it's clearly better to add an explicit instruction for that. As
said we can pass it through the liveness analysis. But it means that
this instruction will have a variable number of arguments.

> TCG can then use them to fill in an array stored inside the
> TranslationBlock, together with the host PC.  Since the gen_opc_pc,
> gen_opc_instr_start, gen_opc_icount arrays are inside tcg_ctx, it may be
> a good idea to store the checkpoint information compressed in a byte
> array (e.g. as a series of ULEB128 values---the host and target PCs can
> even be stored as deltas from the last value).

Either as deltas to the last value or as delta from the start of the
TB. What I am worried about is the size of the checkpoint information,
even if we do some compression, we might have one per guest instruction.
I have implemented a naive version of that without compression, storing
the checkpoint data at the end of the generated code, and it's about 30%
of the size of the TB for MIPS. It's probably smaller on architectures
storing only the PC. Also it's size is quite variable. That's why it's
probably not a good idea to store it directly in the TranslationBlock.
I don't like storing it directly in the generated code either,
especially given this part is supposed to be executable.

> As a first step, gen_intermediate_code_pc and tcg_gen_code_search_pc can
> then be merged into a single target-independent function that
> uncompresses the byte array up to the required host PC into tcg_ctx.
> Later you can optimize them to remove the tcg_ctx arrays altogether.
> 
> So the patches could be something like this:
> 
> 1) SPARC: put the jump target information directly in gen_opc_* without
> using gen_opc_jump_pc (not trivial)
> 
> 2) a few targets: instead of gen_opc_* arrays, use a new generic member
> of tcg_ctx (similar to how csbase is used generically), e.g.
> tcg_ctx.gen_opc_target1[] and tcg_ctx.gen_opc_target2[].
> 
> 3) all targets: always fill in tcg_ctx.gen_*, even if search_pc is false
> 
> 4) TCG: add support for a checkpoint operation, make it fill in
> tcg_ctx.gen_*
> 
> 5) all targets: change explicit filling of tcg_ctx.gen_* to use the
> checkpoint operation
> 
> 6) TCG/translate-all: convert gen_intermediate_code_pc as outlined above

That's sounds like a plan when I have more time ;-)

Aurelien

Pavel Dovgalyuk June 19, 2015, 5:09 a.m. UTC | #5

> From: Aurelien Jarno [mailto:aurelien@aurel32.net]
> On 2015-06-18 12:02, Paolo Bonzini wrote:
> >
> > TCG can then use them to fill in an array stored inside the
> > TranslationBlock, together with the host PC.  Since the gen_opc_pc,
> > gen_opc_instr_start, gen_opc_icount arrays are inside tcg_ctx, it may be
> > a good idea to store the checkpoint information compressed in a byte
> > array (e.g. as a series of ULEB128 values---the host and target PCs can
> > even be stored as deltas from the last value).
> 
> Either as deltas to the last value or as delta from the start of the
> TB. What I am worried about is the size of the checkpoint information,
> even if we do some compression, we might have one per guest instruction.
> I have implemented a naive version of that without compression, storing
> the checkpoint data at the end of the generated code, and it's about 30%
> of the size of the TB for MIPS. It's probably smaller on architectures
> storing only the PC. Also it's size is quite variable. That's why it's
> probably not a good idea to store it directly in the TranslationBlock.
> I don't like storing it directly in the generated code either,
> especially given this part is supposed to be executable.
> 
> > As a first step, gen_intermediate_code_pc and tcg_gen_code_search_pc can
> > then be merged into a single target-independent function that
> > uncompresses the byte array up to the required host PC into tcg_ctx.
> > Later you can optimize them to remove the tcg_ctx arrays altogether.
> >
> > So the patches could be something like this:
> >
> > 1) SPARC: put the jump target information directly in gen_opc_* without
> > using gen_opc_jump_pc (not trivial)
> >
> > 2) a few targets: instead of gen_opc_* arrays, use a new generic member
> > of tcg_ctx (similar to how csbase is used generically), e.g.
> > tcg_ctx.gen_opc_target1[] and tcg_ctx.gen_opc_target2[].
> >
> > 3) all targets: always fill in tcg_ctx.gen_*, even if search_pc is false
> >
> > 4) TCG: add support for a checkpoint operation, make it fill in
> > tcg_ctx.gen_*
> >
> > 5) all targets: change explicit filling of tcg_ctx.gen_* to use the
> > checkpoint operation
> >
> > 6) TCG/translate-all: convert gen_intermediate_code_pc as outlined above
> 
> That's sounds like a plan when I have more time ;-)

Doesn't this approach still require my fixes to work correctly?

Pavel Dovgalyuk

Aurelien Jarno June 19, 2015, 8:22 a.m. UTC | #6

On 2015-06-19 08:09, Pavel Dovgaluk wrote:
> > From: Aurelien Jarno [mailto:aurelien@aurel32.net]
> > On 2015-06-18 12:02, Paolo Bonzini wrote:
> > >
> > > TCG can then use them to fill in an array stored inside the
> > > TranslationBlock, together with the host PC.  Since the gen_opc_pc,
> > > gen_opc_instr_start, gen_opc_icount arrays are inside tcg_ctx, it may be
> > > a good idea to store the checkpoint information compressed in a byte
> > > array (e.g. as a series of ULEB128 values---the host and target PCs can
> > > even be stored as deltas from the last value).
> > 
> > Either as deltas to the last value or as delta from the start of the
> > TB. What I am worried about is the size of the checkpoint information,
> > even if we do some compression, we might have one per guest instruction.
> > I have implemented a naive version of that without compression, storing
> > the checkpoint data at the end of the generated code, and it's about 30%
> > of the size of the TB for MIPS. It's probably smaller on architectures
> > storing only the PC. Also it's size is quite variable. That's why it's
> > probably not a good idea to store it directly in the TranslationBlock.
> > I don't like storing it directly in the generated code either,
> > especially given this part is supposed to be executable.
> > 
> > > As a first step, gen_intermediate_code_pc and tcg_gen_code_search_pc can
> > > then be merged into a single target-independent function that
> > > uncompresses the byte array up to the required host PC into tcg_ctx.
> > > Later you can optimize them to remove the tcg_ctx arrays altogether.
> > >
> > > So the patches could be something like this:
> > >
> > > 1) SPARC: put the jump target information directly in gen_opc_* without
> > > using gen_opc_jump_pc (not trivial)
> > >
> > > 2) a few targets: instead of gen_opc_* arrays, use a new generic member
> > > of tcg_ctx (similar to how csbase is used generically), e.g.
> > > tcg_ctx.gen_opc_target1[] and tcg_ctx.gen_opc_target2[].
> > >
> > > 3) all targets: always fill in tcg_ctx.gen_*, even if search_pc is false
> > >
> > > 4) TCG: add support for a checkpoint operation, make it fill in
> > > tcg_ctx.gen_*
> > >
> > > 5) all targets: change explicit filling of tcg_ctx.gen_* to use the
> > > checkpoint operation
> > >
> > > 6) TCG/translate-all: convert gen_intermediate_code_pc as outlined above
> > 
> > That's sounds like a plan when I have more time ;-)
> 
> Doesn't this approach still require my fixes to work correctly?

Yes it does. 

Aurélien

[v2,0/3] Fix exceptions handling for MIPS and i386

Commit Message

Comments

Patch