Message ID | 1387711957-703-1-git-send-email-aurelien@aurel32.net |
---|---|
State | New |
Headers | show |
On 12/22/2013 03:32 AM, Aurelien Jarno wrote: > find_first_bit has started to be used heavily in TCG code. The current > implementation based on find_next_bit is not optimal and can't be > optimized be the compiler if the bit array has a fixed size, which is > the case most of the time. > > This new implementation does not use find_next_bit and is yet small > enough to be inlined. > > Cc: Richard Henderson <rth@twiddle.net> > Cc: Corentin Chary <corentin.chary@gmail.com> > > Signed-off-by: Aurelien Jarno <aurelien@aurel32.net> > --- > include/qemu/bitops.h | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) Reviewed-by: Richard Henderson <rth@twiddle.net> r~
Il 22/12/2013 12:32, Aurelien Jarno ha scritto: > find_first_bit has started to be used heavily in TCG code. The current > implementation based on find_next_bit is not optimal and can't be > optimized be the compiler if the bit array has a fixed size, which is > the case most of the time. If you mean by fully unrolling the loop, that's right. However... > - return find_next_bit(addr, size, 0); > + unsigned long result, tmp; > + > + for (result = 0; result < size; result += BITS_PER_LONG) { > + tmp = *addr++; > + if (tmp) { > + result += ctzl(tmp); > + return result < size ? result : size; > + } > + } > + /* Not found */ > + return size; > } > > /** > ... you probably want to limit this to bitmaps that are of constant size, and small enough that the compiler will unroll them. So it probably would be a good idea to add an if (!__builtin_constant_p(size) || size > 8 * BITS_PER_LONG) return find_next_bit(addr, size, 0); Not urgent since TCG is the only user of find_first_bit right now, but worth considering. Paolo
On Thu, Jun 19, 2014 at 10:36:18AM +0200, Paolo Bonzini wrote: > Il 22/12/2013 12:32, Aurelien Jarno ha scritto: > >find_first_bit has started to be used heavily in TCG code. The current > >implementation based on find_next_bit is not optimal and can't be > >optimized be the compiler if the bit array has a fixed size, which is > >the case most of the time. > > If you mean by fully unrolling the loop, that's right. > > However... > > >- return find_next_bit(addr, size, 0); > >+ unsigned long result, tmp; > >+ > >+ for (result = 0; result < size; result += BITS_PER_LONG) { > >+ tmp = *addr++; > >+ if (tmp) { > >+ result += ctzl(tmp); > >+ return result < size ? result : size; > >+ } > >+ } > >+ /* Not found */ > >+ return size; > > } > > > > /** > > > > ... you probably want to limit this to bitmaps that are of constant > size, and small enough that the compiler will unroll them. > > So it probably would be a good idea to add an > > if (!__builtin_constant_p(size) || size > 8 * BITS_PER_LONG) > return find_next_bit(addr, size, 0); > > Not urgent since TCG is the only user of find_first_bit right now, > but worth considering. In practice on x86_64, this function takes 27 instructions in the general case, and 18 instructions in the fixed case, even for big sizes. I therefore think that checking if the size is constant is a good idea, but we should not make any test on the size itself and trust the compiler to correctly decide if the loop should be unrolled or not. I will send a patch about that in the next days.
Il 20/06/2014 10:48, Aurelien Jarno ha scritto: > In practice on x86_64, this function takes 27 instructions in the > general case, and 18 instructions in the fixed case, even for big > sizes. I therefore think that checking if the size is constant is a good > idea, but we should not make any test on the size itself and trust the > compiler to correctly decide if the loop should be unrolled or not. But if the size is large enough that the compiler will (likely) not unroll the function, then it should pay off to use the more optimized code in find_next_bit. This of course is unless you expect find_first_bit to return a small value and not be used in a loop; and dually expect find_next_bit's usage to be more like walking sparser bitmaps in a loop. This actually makes sense, and then there's no need to change anything. Paolo
On Fri, Jun 20, 2014 at 10:58:31AM +0200, Paolo Bonzini wrote: > Il 20/06/2014 10:48, Aurelien Jarno ha scritto: > >In practice on x86_64, this function takes 27 instructions in the > >general case, and 18 instructions in the fixed case, even for big > >sizes. I therefore think that checking if the size is constant is a good > >idea, but we should not make any test on the size itself and trust the > >compiler to correctly decide if the loop should be unrolled or not. > > But if the size is large enough that the compiler will (likely) not > unroll the function, then it should pay off to use the more > optimized code in find_next_bit. The point there is that given find_next_bit is a generalized version of find_first_bit, it is actually slower. I originally noticed that by running profiling tools and noticing this function appeared relatively high for what it is supposed to do. > This of course is unless you expect find_first_bit to return a small > value and not be used in a loop; and dually expect find_next_bit's > usage to be more like walking sparser bitmaps in a loop. I think that's the point. In the TCG case, this is used to map the temp allocation to answer the question "give me a free temp". That said people might invent new usages. > This actually makes sense, and then there's no need to change anything. > > Paolo >
diff --git a/include/qemu/bitops.h b/include/qemu/bitops.h index 304c90c..041142a 100644 --- a/include/qemu/bitops.h +++ b/include/qemu/bitops.h @@ -157,7 +157,17 @@ unsigned long find_next_zero_bit(const unsigned long *addr, static inline unsigned long find_first_bit(const unsigned long *addr, unsigned long size) { - return find_next_bit(addr, size, 0); + unsigned long result, tmp; + + for (result = 0; result < size; result += BITS_PER_LONG) { + tmp = *addr++; + if (tmp) { + result += ctzl(tmp); + return result < size ? result : size; + } + } + /* Not found */ + return size; } /**
find_first_bit has started to be used heavily in TCG code. The current implementation based on find_next_bit is not optimal and can't be optimized be the compiler if the bit array has a fixed size, which is the case most of the time. This new implementation does not use find_next_bit and is yet small enough to be inlined. Cc: Richard Henderson <rth@twiddle.net> Cc: Corentin Chary <corentin.chary@gmail.com> Signed-off-by: Aurelien Jarno <aurelien@aurel32.net> --- include/qemu/bitops.h | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)