Message ID | 20091015211452.GC7071@volta.aurel32.net |
---|---|
State | New |
Headers | show |
On Thu, Oct 15, 2009 at 11:14 PM, Aurelien Jarno <aurelien@aurel32.net> wrote: > Signed-off-by: Aurelien Jarno <aurelien@aurel32.net> > --- > target-arm/helper.c | 6 ++---- > 1 files changed, 2 insertions(+), 4 deletions(-) > > diff --git a/target-arm/helper.c b/target-arm/helper.c > index 701629a..656b5df 100644 > --- a/target-arm/helper.c > +++ b/target-arm/helper.c > @@ -7,6 +7,7 @@ > #include "gdbstub.h" > #include "helpers.h" > #include "qemu-common.h" > +#include "host-utils.h" > > static uint32_t cortexa8_cp15_c0_c1[8] = > { 0x1031, 0x11, 0x400, 0, 0x31100003, 0x20000000, 0x01202000, 0x11 }; > @@ -394,10 +395,7 @@ uint32_t HELPER(uxtb16)(uint32_t x) > > uint32_t HELPER(clz)(uint32_t x) > { > - int count; > - for (count = 32; x; count--) > - x >>= 1; > - return count; > + return clz32(x); > } > > int32_t HELPER(sdiv)(int32_t num, int32_t den) > -- > 1.6.1.3 Acked-by: Laurent Desnogues <laurent.desnogues@gmail.com>
On Thu, Oct 15, 2009 at 11:14:52PM +0200, Aurelien Jarno wrote: > @@ -394,10 +395,7 @@ uint32_t HELPER(uxtb16)(uint32_t x) > > uint32_t HELPER(clz)(uint32_t x) > { > - int count; > - for (count = 32; x; count--) > - x >>= 1; > - return count; > + return clz32(x); > } > > int32_t HELPER(sdiv)(int32_t num, int32_t den) Just a quick note that the implementation of clz, ctz and popcnt is still listed in the TCG TODO list. The last time I looked, I noticed that quite a few architectures have clz/ctz instructions: http://lkml.indiana.edu/hypermail/linux/kernel/0601.3/1683.html For those that don't, I think a combination the following two hacks at http://graphics.stanford.edu/~seander/bithacks.html could be used: 'Round up to the next highest power of 2' 'Counting bits set, in parallel' With this, it should be possible to implement clz and ctz without too many operations for both 32-bit and 64-bit integers, without requiring floats, lookup tables or branches. Of course, __builtin_clz() might well do a better job... BTW, it may be worth pointing out: B[4] = 0x0000ffff; B[3] = B[4] ^ (B[4] << 8) => 0x00ff00ff B[2] = B[3] ^ (B[3] << 4) => 0x0f0f0f0f B[1] = B[2] ^ (B[2] << 2) => 0x33333333 B[0] = B[1] ^ (B[1] << 1) => 0x55555555 In reality, I wonder if five separate loads would be quicker, though. Cheers,
Stuart Brady a écrit : > On Thu, Oct 15, 2009 at 11:14:52PM +0200, Aurelien Jarno wrote: >> @@ -394,10 +395,7 @@ uint32_t HELPER(uxtb16)(uint32_t x) >> >> uint32_t HELPER(clz)(uint32_t x) >> { >> - int count; >> - for (count = 32; x; count--) >> - x >>= 1; >> - return count; >> + return clz32(x); >> } >> >> int32_t HELPER(sdiv)(int32_t num, int32_t den) > > Just a quick note that the implementation of clz, ctz and popcnt is > still listed in the TCG TODO list. The last time I looked, I noticed > that quite a few architectures have clz/ctz instructions: > > http://lkml.indiana.edu/hypermail/linux/kernel/0601.3/1683.html OTOH, a dump shows that those instruction are not used than often, so I am not sure it worth implementing it. > For those that don't, I think a combination the following two hacks at > http://graphics.stanford.edu/~seander/bithacks.html could be used: The best is probably to use an helper in that case, calling clz32(x).
On Fri, Oct 23, 2009 at 09:04:53AM +0200, Aurelien Jarno wrote: > Stuart Brady a écrit : > > Just a quick note that the implementation of clz, ctz and popcnt is > > still listed in the TCG TODO list. The last time I looked, I noticed > > that quite a few architectures have clz/ctz instructions: > > > > http://lkml.indiana.edu/hypermail/linux/kernel/0601.3/1683.html > > OTOH, a dump shows that those instruction are not used than often, so I > am not sure it worth implementing it. Really? I'm surprised, as I gather that optimised ffs/fls/hweight functions in the kernel do give a modest gain... I suppose I'll have to try it on several different targets and see! :-) > > For those that don't, I think a combination the following two hacks at > > http://graphics.stanford.edu/~seander/bithacks.html could be used: > > The best is probably to use an helper in that case, calling clz32(x). Yes, you're right. There are several other places that should also call clz32()/ctz32(). The ones that I can see are helper_neon_cls_s32() for ARM, helper_bsf() and helper_bsr() for X86, helper_ff1() for M68K. (I'm not sure about 'do_clz8' and 'do_clz16', though.) At some point, possibly next weekend, I'll submit patches to add clz and ctz helpers to tcg-runtime.c, and to convert Alpha, ARM, CRIS, M68K, MIPS, PowerPC and x86 (any others I've missed?) to use those helpers. Cheers,
Stuart Brady a écrit : > On Fri, Oct 23, 2009 at 09:04:53AM +0200, Aurelien Jarno wrote: >> Stuart Brady a écrit : >>> Just a quick note that the implementation of clz, ctz and popcnt is >>> still listed in the TCG TODO list. The last time I looked, I noticed >>> that quite a few architectures have clz/ctz instructions: >>> >>> http://lkml.indiana.edu/hypermail/linux/kernel/0601.3/1683.html >> OTOH, a dump shows that those instruction are not used than often, so I >> am not sure it worth implementing it. > > Really? I'm surprised, as I gather that optimised ffs/fls/hweight > functions in the kernel do give a modest gain... I suppose I'll have > to try it on several different targets and see! :-) I gave a quick look at MIPS, and at least here, it is used often. >>> For those that don't, I think a combination the following two hacks at >>> http://graphics.stanford.edu/~seander/bithacks.html could be used: >> The best is probably to use an helper in that case, calling clz32(x). > > Yes, you're right. > > There are several other places that should also call clz32()/ctz32(). > The ones that I can see are helper_neon_cls_s32() for ARM, helper_bsf() > and helper_bsr() for X86, helper_ff1() for M68K. (I'm not sure about > 'do_clz8' and 'do_clz16', though.) > > At some point, possibly next weekend, I'll submit patches to add clz > and ctz helpers to tcg-runtime.c, and to convert Alpha, ARM, CRIS, M68K, > MIPS, PowerPC and x86 (any others I've missed?) to use those helpers. The main problem I see for a TCG implementation is the definition of clz/ctz. Some targets define that clz(0) or ctz(0) returns 32, some other define it as being "undefined". If we go for the common denominator for the TCG op, that is clz(0) = undefined, it means that a test with brcond has to be added in the targets using clz(0) = 32, and this is likely to give more slow down than speed gain. If we go for clz(0) = 32, it means the test has to be implemented in TCG, which might be complicated for some hosts.
diff --git a/target-arm/helper.c b/target-arm/helper.c index 701629a..656b5df 100644 --- a/target-arm/helper.c +++ b/target-arm/helper.c @@ -7,6 +7,7 @@ #include "gdbstub.h" #include "helpers.h" #include "qemu-common.h" +#include "host-utils.h" static uint32_t cortexa8_cp15_c0_c1[8] = { 0x1031, 0x11, 0x400, 0, 0x31100003, 0x20000000, 0x01202000, 0x11 }; @@ -394,10 +395,7 @@ uint32_t HELPER(uxtb16)(uint32_t x) uint32_t HELPER(clz)(uint32_t x) { - int count; - for (count = 32; x; count--) - x >>= 1; - return count; + return clz32(x); } int32_t HELPER(sdiv)(int32_t num, int32_t den)
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net> --- target-arm/helper.c | 6 ++---- 1 files changed, 2 insertions(+), 4 deletions(-)