target-arm: use clz32() instead of a for loop

Message ID	20091015211452.GC7071@volta.aurel32.net
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Date: Thu, 15 Oct 2009 23:14:52 +0200 From: Aurelien Jarno <aurelien@aurel32.net> To: qemu-devel@nongnu.org Message-ID: <20091015211452.GC7071@volta.aurel32.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline User-Agent: Mutt/1.5.18 (2008-05-17) Subject: [Qemu-devel] [PATCH] target-arm: use clz32() instead of a for loop Precedence: list Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Message ID

20091015211452.GC7071@volta.aurel32.net

State

New

Headers

Date: Thu, 15 Oct 2009 23:14:52 +0200
From: Aurelien Jarno <aurelien@aurel32.net>
To: qemu-devel@nongnu.org
Message-ID: <20091015211452.GC7071@volta.aurel32.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-15
Content-Disposition: inline
User-Agent: Mutt/1.5.18 (2008-05-17)
Subject: [Qemu-devel] [PATCH] target-arm: use clz32() instead of a for loop
Precedence: list
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Comments

Laurent Desnogues Oct. 18, 2009, 2:21 p.m. UTC | #1

On Thu, Oct 15, 2009 at 11:14 PM, Aurelien Jarno <aurelien@aurel32.net> wrote:
> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
> ---
>  target-arm/helper.c |    6 ++----
>  1 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/target-arm/helper.c b/target-arm/helper.c
> index 701629a..656b5df 100644
> --- a/target-arm/helper.c
> +++ b/target-arm/helper.c
> @@ -7,6 +7,7 @@
>  #include "gdbstub.h"
>  #include "helpers.h"
>  #include "qemu-common.h"
> +#include "host-utils.h"
>
>  static uint32_t cortexa8_cp15_c0_c1[8] =
>  { 0x1031, 0x11, 0x400, 0, 0x31100003, 0x20000000, 0x01202000, 0x11 };
> @@ -394,10 +395,7 @@ uint32_t HELPER(uxtb16)(uint32_t x)
>
>  uint32_t HELPER(clz)(uint32_t x)
>  {
> -    int count;
> -    for (count = 32; x; count--)
> -        x >>= 1;
> -    return count;
> +    return clz32(x);
>  }
>
>  int32_t HELPER(sdiv)(int32_t num, int32_t den)
> --
> 1.6.1.3

Acked-by: Laurent Desnogues <laurent.desnogues@gmail.com>

Stuart Brady Oct. 23, 2009, 12:34 a.m. UTC | #2

On Thu, Oct 15, 2009 at 11:14:52PM +0200, Aurelien Jarno wrote:
> @@ -394,10 +395,7 @@ uint32_t HELPER(uxtb16)(uint32_t x)
>  
>  uint32_t HELPER(clz)(uint32_t x)
>  {
> -    int count;
> -    for (count = 32; x; count--)
> -        x >>= 1;
> -    return count;
> +    return clz32(x);
>  }
>  
>  int32_t HELPER(sdiv)(int32_t num, int32_t den)

Just a quick note that the implementation of clz, ctz and popcnt is
still listed in the TCG TODO list.  The last time I looked, I noticed
that quite a few architectures have clz/ctz instructions:

   http://lkml.indiana.edu/hypermail/linux/kernel/0601.3/1683.html

For those that don't, I think a combination the following two hacks at
http://graphics.stanford.edu/~seander/bithacks.html could be used:

   'Round up to the next highest power of 2'
   'Counting bits set, in parallel'

With this, it should be possible to implement clz and ctz without too
many operations for both 32-bit and 64-bit integers, without requiring
floats, lookup tables or branches.  Of course, __builtin_clz() might
well do a better job...

BTW, it may be worth pointing out:

   B[4] = 0x0000ffff;
   B[3] = B[4] ^ (B[4] << 8) => 0x00ff00ff
   B[2] = B[3] ^ (B[3] << 4) => 0x0f0f0f0f
   B[1] = B[2] ^ (B[2] << 2) => 0x33333333
   B[0] = B[1] ^ (B[1] << 1) => 0x55555555

In reality, I wonder if five separate loads would be quicker, though.

Cheers,

Aurelien Jarno Oct. 23, 2009, 7:04 a.m. UTC | #3

Stuart Brady a écrit :
> On Thu, Oct 15, 2009 at 11:14:52PM +0200, Aurelien Jarno wrote:
>> @@ -394,10 +395,7 @@ uint32_t HELPER(uxtb16)(uint32_t x)
>>  
>>  uint32_t HELPER(clz)(uint32_t x)
>>  {
>> -    int count;
>> -    for (count = 32; x; count--)
>> -        x >>= 1;
>> -    return count;
>> +    return clz32(x);
>>  }
>>  
>>  int32_t HELPER(sdiv)(int32_t num, int32_t den)
> 
> Just a quick note that the implementation of clz, ctz and popcnt is
> still listed in the TCG TODO list.  The last time I looked, I noticed
> that quite a few architectures have clz/ctz instructions:
> 
>    http://lkml.indiana.edu/hypermail/linux/kernel/0601.3/1683.html

OTOH, a dump shows that those instruction are not used than often, so I
am not sure it worth implementing it.

> For those that don't, I think a combination the following two hacks at
> http://graphics.stanford.edu/~seander/bithacks.html could be used:

The best is probably to use an helper in that case, calling clz32(x).

Stuart Brady Oct. 23, 2009, 12:47 p.m. UTC | #4

On Fri, Oct 23, 2009 at 09:04:53AM +0200, Aurelien Jarno wrote:
> Stuart Brady a écrit :
> > Just a quick note that the implementation of clz, ctz and popcnt is
> > still listed in the TCG TODO list.  The last time I looked, I noticed
> > that quite a few architectures have clz/ctz instructions:
> > 
> >    http://lkml.indiana.edu/hypermail/linux/kernel/0601.3/1683.html
> 
> OTOH, a dump shows that those instruction are not used than often, so I
> am not sure it worth implementing it.

Really?  I'm surprised, as I gather that optimised ffs/fls/hweight
functions in the kernel do give a modest gain...  I suppose I'll have
to try it on several different targets and see! :-)

> > For those that don't, I think a combination the following two hacks at
> > http://graphics.stanford.edu/~seander/bithacks.html could be used:
> 
> The best is probably to use an helper in that case, calling clz32(x).

Yes, you're right.

There are several other places that should also call clz32()/ctz32().
The ones that I can see are helper_neon_cls_s32() for ARM, helper_bsf()
and helper_bsr() for X86, helper_ff1() for M68K.  (I'm not sure about
'do_clz8' and 'do_clz16', though.)

At some point, possibly next weekend, I'll submit patches to add clz
and ctz helpers to tcg-runtime.c, and to convert Alpha, ARM, CRIS, M68K,
MIPS, PowerPC and x86 (any others I've missed?) to use those helpers.

Cheers,

Aurelien Jarno Oct. 23, 2009, 2:38 p.m. UTC | #5

Stuart Brady a écrit :
> On Fri, Oct 23, 2009 at 09:04:53AM +0200, Aurelien Jarno wrote:
>> Stuart Brady a écrit :
>>> Just a quick note that the implementation of clz, ctz and popcnt is
>>> still listed in the TCG TODO list.  The last time I looked, I noticed
>>> that quite a few architectures have clz/ctz instructions:
>>>
>>>    http://lkml.indiana.edu/hypermail/linux/kernel/0601.3/1683.html
>> OTOH, a dump shows that those instruction are not used than often, so I
>> am not sure it worth implementing it.
> 
> Really?  I'm surprised, as I gather that optimised ffs/fls/hweight
> functions in the kernel do give a modest gain...  I suppose I'll have
> to try it on several different targets and see! :-)

I gave a quick look at MIPS, and at least here, it is used often.

>>> For those that don't, I think a combination the following two hacks at
>>> http://graphics.stanford.edu/~seander/bithacks.html could be used:
>> The best is probably to use an helper in that case, calling clz32(x).
> 
> Yes, you're right.
> 
> There are several other places that should also call clz32()/ctz32().
> The ones that I can see are helper_neon_cls_s32() for ARM, helper_bsf()
> and helper_bsr() for X86, helper_ff1() for M68K.  (I'm not sure about
> 'do_clz8' and 'do_clz16', though.)
> 
> At some point, possibly next weekend, I'll submit patches to add clz
> and ctz helpers to tcg-runtime.c, and to convert Alpha, ARM, CRIS, M68K,
> MIPS, PowerPC and x86 (any others I've missed?) to use those helpers.

The main problem I see for a TCG implementation is the definition of
clz/ctz. Some targets define that clz(0) or ctz(0) returns 32, some
other define it as being "undefined".

If we go for the common denominator for the TCG op, that is clz(0) =
undefined, it means that a test with brcond has to be added in the
targets using clz(0) = 32, and this is likely to give more slow down
than speed gain.

If we go for clz(0) = 32, it means the test has to be implemented in
TCG, which might be complicated for some hosts.

diff --git a/target-arm/helper.c b/target-arm/helper.c
index 701629a..656b5df 100644
--- a/target-arm/helper.c
+++ b/target-arm/helper.c
@@ -7,6 +7,7 @@ 
 #include "gdbstub.h"
 #include "helpers.h"
 #include "qemu-common.h"
+#include "host-utils.h"
 
 static uint32_t cortexa8_cp15_c0_c1[8] =
 { 0x1031, 0x11, 0x400, 0, 0x31100003, 0x20000000, 0x01202000, 0x11 };
@@ -394,10 +395,7 @@  uint32_t HELPER(uxtb16)(uint32_t x)
 
 uint32_t HELPER(clz)(uint32_t x)
 {
-    int count;
-    for (count = 32; x; count--)
-        x >>= 1;
-    return count;
+    return clz32(x);
 }
 
 int32_t HELPER(sdiv)(int32_t num, int32_t den)

target-arm: use clz32() instead of a for loop

Commit Message

Comments

Patch