From patchwork Thu Dec 10 08:02:16 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Aurelien Jarno <aurelien@aurel32.net>
X-Patchwork-Id: 555013
Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 92722140FDB
	for <incoming@patchwork.ozlabs.org>;
	Thu, 10 Dec 2015 19:02:54 +1100 (AEDT)
Received: from localhost ([::1]:39841 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>)
	id 1a6wBf-000145-Ne
	for incoming@patchwork.ozlabs.org; Thu, 10 Dec 2015 03:02:51 -0500
Received: from eggs.gnu.org ([2001:4830:134:3::10]:39510)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aurelien@aurel32.net>) id 1a6wBP-0000mS-4l
	for qemu-devel@nongnu.org; Thu, 10 Dec 2015 03:02:39 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aurelien@aurel32.net>) id 1a6wBO-0003t7-20
	for qemu-devel@nongnu.org; Thu, 10 Dec 2015 03:02:35 -0500
Received: from hall.aurel32.net ([2001:bc8:30d7:100::1]:60623)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aurelien@aurel32.net>) id 1a6wBN-0003rM-T7
	for qemu-devel@nongnu.org; Thu, 10 Dec 2015 03:02:34 -0500
Received: from weber.rr44.fr ([2001:bc8:30d7:120:7e05:7ff:fe0d:f152])
	by hall.aurel32.net with esmtps
	(TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84)
	(envelope-from <aurelien@aurel32.net>)
	id 1a6wBD-0008Ck-Tz; Thu, 10 Dec 2015 09:02:23 +0100
Received: from aurel32 by weber.rr44.fr with local (Exim 4.86)
	(envelope-from <aurelien@aurel32.net>)
	id 1a6wBD-0007OH-K7; Thu, 10 Dec 2015 09:02:23 +0100
From: Aurelien Jarno <aurelien@aurel32.net>
To: qemu-devel@nongnu.org
Date: Thu, 10 Dec 2015 09:02:16 +0100
Message-Id: <1449734536-9885-1-git-send-email-aurelien@aurel32.net>
X-Mailer: git-send-email 2.6.2
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2001:bc8:30d7:100::1
Cc: TeLeMan <geleman@gmail.com>, Aurelien Jarno <aurelien@aurel32.net>,
	Richard Henderson <rth@twiddle.net>
Subject: [Qemu-devel] [PATCH] tcg/arm: improve direct jump
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Use ldr pc, [pc, #-4] kind of branch for direct jump. This removes the
need to flush the icache on TB linking, and allow to remove the limit
on the code generation buffer.

Cc: Richard Henderson <rth@twiddle.net>
Cc: TeLeMan <geleman@gmail.com>
Cc: Andrzej Zaborowski <balrogg@gmail.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Tested-by: TeLeMan <geleman@gmail.com>
---
 include/exec/exec-all.h | 24 ++++--------------------
 tcg/arm/tcg-target.c    |  8 +++-----
 translate-all.c         |  2 --
 3 files changed, 7 insertions(+), 27 deletions(-)

Note: I don't really get the reason for the current 16MB limit. With the
standard branch instructions the offset is coded on 24 bits, but shifted
right by 2, which should give us a +/-32MB jumps, and therefore a 32MB
limit.

Therefore it might be a good idea to benchmark the original QEMU vs a
QEMU with a 32MB buffer vs QEMU with this patch.

If mixing data and instructions like in this patch causes too much
performances trouble, at least on ARMv7 we might want to try with movw +
movt + movpc. It's only 4 bytes longer than the current patch.

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index d900b0d..3cd63c9 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -274,26 +274,10 @@ void aarch64_tb_set_jmp_target(uintptr_t jmp_addr, uintptr_t addr);
 #elif defined(__arm__)
 static inline void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr)
 {
-#if !QEMU_GNUC_PREREQ(4, 1)
-    register unsigned long _beg __asm ("a1");
-    register unsigned long _end __asm ("a2");
-    register unsigned long _flg __asm ("a3");
-#endif
-
-    /* we could use a ldr pc, [pc, #-4] kind of branch and avoid the flush */
-    *(uint32_t *)jmp_addr =
-        (*(uint32_t *)jmp_addr & ~0xffffff)
-        | (((addr - (jmp_addr + 8)) >> 2) & 0xffffff);
-
-#if QEMU_GNUC_PREREQ(4, 1)
-    __builtin___clear_cache((char *) jmp_addr, (char *) jmp_addr + 4);
-#else
-    /* flush icache */
-    _beg = jmp_addr;
-    _end = jmp_addr + 4;
-    _flg = 0;
-    __asm __volatile__ ("swi 0x9f0002" : : "r" (_beg), "r" (_end), "r" (_flg));
-#endif
+    /* Patch the branch destination. It uses a ldr pc, [pc, #-4] kind
+       of branch so we write absolute address and we don't need to
+       flush icache. */
+    *(uint32_t *)jmp_addr = addr;
 }
 #elif defined(__sparc__) || defined(__mips__)
 void tb_set_jmp_target1(uintptr_t jmp_addr, uintptr_t addr);
diff --git a/tcg/arm/tcg-target.c b/tcg/arm/tcg-target.c
index 3edf6a6..f28b9ba 100644
--- a/tcg/arm/tcg-target.c
+++ b/tcg/arm/tcg-target.c
@@ -986,10 +986,6 @@ static inline void tcg_out_st8(TCGContext *s, int cond,
         tcg_out_st8_12(s, cond, rd, rn, offset);
 }
 
-/* The _goto case is normally between TBs within the same code buffer, and
- * with the code buffer limited to 16MB we wouldn't need the long case.
- * But we also use it for the tail-call to the qemu_ld/st helpers, which does.
- */
 static inline void tcg_out_goto(TCGContext *s, int cond, tcg_insn_unit *addr)
 {
     intptr_t addri = (intptr_t)addr;
@@ -1649,8 +1645,10 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_goto_tb:
         if (s->tb_jmp_offset) {
             /* Direct jump method */
+            tcg_out_ld32_12(s, COND_AL, TCG_REG_PC, TCG_REG_PC, -4);
             s->tb_jmp_offset[args[0]] = tcg_current_code_size(s);
-            tcg_out_b_noaddr(s, COND_AL);
+            /* Skip over address */
+            s->code_ptr++;
         } else {
             /* Indirect jump method */
             intptr_t ptr = (intptr_t)(s->tb_next + args[0]);
diff --git a/translate-all.c b/translate-all.c
index 042a857..1ca113c 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -472,8 +472,6 @@ static inline PageDesc *page_find(tb_page_addr_t index)
 # define MAX_CODE_GEN_BUFFER_SIZE  (2ul * 1024 * 1024 * 1024)
 #elif defined(__aarch64__)
 # define MAX_CODE_GEN_BUFFER_SIZE  (128ul * 1024 * 1024)
-#elif defined(__arm__)
-# define MAX_CODE_GEN_BUFFER_SIZE  (16u * 1024 * 1024)
 #elif defined(__s390x__)
   /* We have a +- 4GB range on the branches; leave some slop.  */
 # define MAX_CODE_GEN_BUFFER_SIZE  (3ul * 1024 * 1024 * 1024)