tcg/arm: improve direct jump

Message ID	1449734536-9885-1-git-send-email-aurelien@aurel32.net
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Aurelien Jarno <aurelien@aurel32.net> To: qemu-devel@nongnu.org Date: Thu, 10 Dec 2015 09:02:16 +0100 Message-Id: <1449734536-9885-1-git-send-email-aurelien@aurel32.net> Cc: TeLeMan <geleman@gmail.com>, Aurelien Jarno <aurelien@aurel32.net>, Richard Henderson <rth@twiddle.net> Subject: [Qemu-devel] [PATCH] tcg/arm: improve direct jump Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Message ID

1449734536-9885-1-git-send-email-aurelien@aurel32.net

State

New

Headers

From: Aurelien Jarno <aurelien@aurel32.net>
To: qemu-devel@nongnu.org
Date: Thu, 10 Dec 2015 09:02:16 +0100
Message-Id: <1449734536-9885-1-git-send-email-aurelien@aurel32.net>
Cc: TeLeMan <geleman@gmail.com>, Aurelien Jarno <aurelien@aurel32.net>,
	Richard Henderson <rth@twiddle.net>
Subject: [Qemu-devel] [PATCH] tcg/arm: improve direct jump
Precedence: list
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Commit Message

Aurelien Jarno Dec. 10, 2015, 8:02 a.m. UTC

Use ldr pc, [pc, #-4] kind of branch for direct jump. This removes the
need to flush the icache on TB linking, and allow to remove the limit
on the code generation buffer.

Cc: Richard Henderson <rth@twiddle.net>
Cc: TeLeMan <geleman@gmail.com>
Cc: Andrzej Zaborowski <balrogg@gmail.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
---
 include/exec/exec-all.h | 24 ++++--------------------
 tcg/arm/tcg-target.c    |  8 +++-----
 translate-all.c         |  2 --
 3 files changed, 7 insertions(+), 27 deletions(-)

Note: I don't really get the reason for the current 16MB limit. With the
standard branch instructions the offset is coded on 24 bits, but shifted
right by 2, which should give us a +/-32MB jumps, and therefore a 32MB
limit.

Therefore it might be a good idea to benchmark the original QEMU vs a
QEMU with a 32MB buffer vs QEMU with this patch.

If mixing data and instructions like in this patch causes too much
performances trouble, at least on ARMv7 we might want to try with movw +
movt + movpc. It's only 4 bytes longer than the current patch.

Comments

Richard Henderson Dec. 10, 2015, 3:31 p.m. UTC | #1

On 12/10/2015 12:02 AM, Aurelien Jarno wrote:
> Note: I don't really get the reason for the current 16MB limit. With the
> standard branch instructions the offset is coded on 24 bits, but shifted
> right by 2, which should give us a +/-32MB jumps, and therefore a 32MB
> limit.

That might be me with the off-by-one error on the bit counting...


r~

Aurelien Jarno Dec. 10, 2015, 6:21 p.m. UTC | #2

On 2015-12-10 07:31, Richard Henderson wrote:
> On 12/10/2015 12:02 AM, Aurelien Jarno wrote:
> >Note: I don't really get the reason for the current 16MB limit. With the
> >standard branch instructions the offset is coded on 24 bits, but shifted
> >right by 2, which should give us a +/-32MB jumps, and therefore a 32MB
> >limit.
> 
> That might be me with the off-by-one error on the bit counting...

While the way to store the value has been changed a few times recently,
the original value dates from this commit:

commit 1cb0661e009267a5d060c4686f0857784a8da228
Author: balrog <balrog@c046a42c-6fe2-441c-8c8c-71466251a162>
Date:   Mon Dec 1 02:10:17 2008 +0000

    arm: Reserve code buffer in memory range reachable for pc-relative branch.
    
    Unfortunately this range is so narrow that I'm not sure if it makes more
    sense to always use memory load to pc kind of branch instead.
    
    
    git-svn-id: svn://svn.savannah.nongnu.org/qemu/trunk@5844 c046a42c-6fe2-441c-8c8c-71466251a162

This doesn't fully explain the reason why 16MB and not 32MB.

Aurelien

TeLeMan Dec. 11, 2015, 2:25 a.m. UTC | #3

On Thu, Dec 10, 2015 at 4:02 PM, Aurelien Jarno <aurelien@aurel32.net> wrote:
> Use ldr pc, [pc, #-4] kind of branch for direct jump. This removes the
> need to flush the icache on TB linking, and allow to remove the limit
> on the code generation buffer.
>
> Cc: Richard Henderson <rth@twiddle.net>
> Cc: TeLeMan <geleman@gmail.com>
> Cc: Andrzej Zaborowski <balrogg@gmail.com>
> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
> ---
>  include/exec/exec-all.h | 24 ++++--------------------
>  tcg/arm/tcg-target.c    |  8 +++-----
>  translate-all.c         |  2 --
>  3 files changed, 7 insertions(+), 27 deletions(-)
>
> Note: I don't really get the reason for the current 16MB limit. With the
> standard branch instructions the offset is coded on 24 bits, but shifted
> right by 2, which should give us a +/-32MB jumps, and therefore a 32MB
> limit.
>
> Therefore it might be a good idea to benchmark the original QEMU vs a
> QEMU with a 32MB buffer vs QEMU with this patch.
>
> If mixing data and instructions like in this patch causes too much
> performances trouble, at least on ARMv7 we might want to try with movw +
> movt + movpc. It's only 4 bytes longer than the current patch.
>
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index d900b0d..3cd63c9 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -274,26 +274,10 @@ void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr);
>  #elif defined(__arm__)
>  static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
>  {
> -#if !QEMU_GNUC_PREREQ(4, 1)
> -    register unsigned long _beg __asm ("a1");
> -    register unsigned long _end __asm ("a2");
> -    register unsigned long _flg __asm ("a3");
> -#endif
> -
> -    /* we could use a ldr pc, [pc, #-4] kind of branch and avoid the flush */
> -    *(uint32_t *)jmp_addr =
> -        (*(uint32_t *)jmp_addr & ~0xffffff)
> -        | (((addr - (jmp_addr + 8)) >> 2) & 0xffffff);
> -
> -#if QEMU_GNUC_PREREQ(4, 1)
> -    __builtin___clear_cache((char *) jmp_addr, (char *) jmp_addr + 4);
> -#else
> -    /* flush icache */
> -    _beg = jmp_addr;
> -    _end = jmp_addr + 4;
> -    _flg = 0;
> -    __asm __volatile__ ("swi 0x9f0002" : : "r" (_beg), "r" (_end), "r" (_flg));
> -#endif
> +    /* Patch the branch destination. It uses a ldr pc, [pc, #-4] kind
> +       of branch so we write absolute address and we don't need to
> +       flush icache. */
> +    *(uint32_t *)jmp_addr = addr;
>  }
>  #elif defined(__sparc__) || defined(__mips__)
>  void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr);
> diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
> index 3edf6a6..f28b9ba 100644
> --- a/tcg/arm/tcg-target.c
> +++ b/tcg/arm/tcg-target.c
> @@ -986,10 +986,6 @@ static inline void tcg_out_st8(TCGContext *s, int cond,
>          tcg_out_st8_12(s, cond, rd, rn, offset);
>  }
>
> -/* The _goto case is normally between TBs within the same code buffer, and
> - * with the code buffer limited to 16MB we wouldn't need the long case.
> - * But we also use it for the tail-call to the qemu_ld/st helpers, which does.
> - */
>  static inline void tcg_out_goto(TCGContext *s, int cond, tcg_insn_unit *addr)
>  {
>      intptr_t addri = (intptr_t)addr;
> @@ -1649,8 +1645,10 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
>      case INDEX_op_goto_tb:
>          if (s->tb_jmp_offset) {
>              /* Direct jump method */
> +            tcg_out_ld32_12(s, COND_AL, TCG_REG_PC, TCG_REG_PC, -4);
>              s->tb_jmp_offset[args[0]] = tcg_current_code_size(s);
> -            tcg_out_b_noaddr(s, COND_AL);
> +            /* Skip over address */
> +            s->code_ptr++;
>          } else {
>              /* Indirect jump method */
>              intptr_t ptr = (intptr_t)(s->tb_next + args[0]);
> diff --git a/translate-all.c b/translate-all.c
> index 042a857..1ca113c 100644
> --- a/translate-all.c
> +++ b/translate-all.c
> @@ -472,8 +472,6 @@ static inline PageDesc *page_find(tb_page_addr_t index)
>  # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
>  #elif defined(__aarch64__)
>  # define MAX_CODE_GEN_BUFFER_SIZE  (128ul * 1024 * 1024)
> -#elif defined(__arm__)
> -# define MAX_CODE_GEN_BUFFER_SIZE  (16u * 1024 * 1024)
>  #elif defined(__s390x__)
>    /* We have a +- 4GB range on the branches; leave some slop.  */
>  # define MAX_CODE_GEN_BUFFER_SIZE  (3ul * 1024 * 1024 * 1024)
> --
> 2.6.2
>

Tested-by: TeLeMan <geleman@gmail.com>

tb_size flush_count boot_time
32MB about 90 about 37 minutes
128MB 4 about 30 minutes
256MB 1 about 30 minutes

boot_time is the time of booting Windows XP until the qemu's cpu usage
is under 50%. I tested on ARM A7 1.5GHz and boot_time is not precise.

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index d900b0d..3cd63c9 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -274,26 +274,10 @@  void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr);
 #elif defined(__arm__)
 static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
 {
-#if !QEMU_GNUC_PREREQ(4, 1)
-    register unsigned long _beg __asm ("a1");
-    register unsigned long _end __asm ("a2");
-    register unsigned long _flg __asm ("a3");
-#endif
-
-    /* we could use a ldr pc, [pc, #-4] kind of branch and avoid the flush */
-    *(uint32_t *)jmp_addr =
-        (*(uint32_t *)jmp_addr & ~0xffffff)
-        | (((addr - (jmp_addr + 8)) >> 2) & 0xffffff);
-
-#if QEMU_GNUC_PREREQ(4, 1)
-    __builtin___clear_cache((char *) jmp_addr, (char *) jmp_addr + 4);
-#else
-    /* flush icache */
-    _beg = jmp_addr;
-    _end = jmp_addr + 4;
-    _flg = 0;
-    __asm __volatile__ ("swi 0x9f0002" : : "r" (_beg), "r" (_end), "r" (_flg));
-#endif
+    /* Patch the branch destination. It uses a ldr pc, [pc, #-4] kind
+       of branch so we write absolute address and we don't need to
+       flush icache. */
+    *(uint32_t *)jmp_addr = addr;
 }
 #elif defined(__sparc__) || defined(__mips__)
 void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr);
diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index 3edf6a6..f28b9ba 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -986,10 +986,6 @@  static inline void tcg_out_st8(TCGContext *s, int cond,
         tcg_out_st8_12(s, cond, rd, rn, offset);
 }
 
-/* The _goto case is normally between TBs within the same code buffer, and
- * with the code buffer limited to 16MB we wouldn't need the long case.
- * But we also use it for the tail-call to the qemu_ld/st helpers, which does.
- */
 static inline void tcg_out_goto(TCGContext *s, int cond, tcg_insn_unit *addr)
 {
     intptr_t addri = (intptr_t)addr;
@@ -1649,8 +1645,10 @@  static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_goto_tb:
         if (s->tb_jmp_offset) {
             /* Direct jump method */
+            tcg_out_ld32_12(s, COND_AL, TCG_REG_PC, TCG_REG_PC, -4);
             s->tb_jmp_offset[args[0]] = tcg_current_code_size(s);
-            tcg_out_b_noaddr(s, COND_AL);
+            /* Skip over address */
+            s->code_ptr++;
         } else {
             /* Indirect jump method */
             intptr_t ptr = (intptr_t)(s->tb_next + args[0]);
diff --git a/translate-all.c b/translate-all.c
index 042a857..1ca113c 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -472,8 +472,6 @@  static inline PageDesc *page_find(tb_page_addr_t index)
 # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
 #elif defined(__aarch64__)
 # define MAX_CODE_GEN_BUFFER_SIZE  (128ul * 1024 * 1024)
-#elif defined(__arm__)
-# define MAX_CODE_GEN_BUFFER_SIZE  (16u * 1024 * 1024)
 #elif defined(__s390x__)
   /* We have a +- 4GB range on the branches; leave some slop.  */
 # define MAX_CODE_GEN_BUFFER_SIZE  (3ul * 1024 * 1024 * 1024)