Message ID | E594D4E5-EDA7-4277-9910-D498E9E1E66A@dlhnet.de |
---|---|
State | New |
Headers | show |
On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote: > - offset %= BITS_PER_LONG; > + offset &= (BITS_PER_LONG-1); Still pointless. -- PMM
Il 11/03/2013 16:24, Peter Lieven ha scritto: > >> How would that be different in your patch? But you can solve it by >> making two >= loops, one checking for 4*BITS_PER_LONG and one checking >> BITS_PER_LONG. > > This is what I have now: > > diff --git a/util/bitops.c b/util/bitops.c > index e72237a..b0dc93f 100644 > --- a/util/bitops.c > +++ b/util/bitops.c > @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, > const unsigned long *p = addr + BITOP_WORD(offset); > unsigned long result = offset & ~(BITS_PER_LONG-1); > unsigned long tmp; > + unsigned long d0,d1,d2,d3; > > if (offset >= size) { > return size; > } > size -= result; > - offset %= BITS_PER_LONG; > + offset &= (BITS_PER_LONG-1); > if (offset) { > tmp = *(p++); > tmp &= (~0UL << offset); > @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, > size -= BITS_PER_LONG; > result += BITS_PER_LONG; > } > - while (size & ~(BITS_PER_LONG-1)) { > + while (size >= 4*BITS_PER_LONG) { > + d0 = *p; > + d1 = *(p+1); > + d2 = *(p+2); > + d3 = *(p+3); > + if (d0 || d1 || d2 || d3) { > + break; > + } > + p+=4; > + result += 4*BITS_PER_LONG; > + size -= 4*BITS_PER_LONG; > + } > + while (size >= BITS_PER_LONG) { > if ((tmp = *(p++))) { > goto found_middle; > } > Minus the %= vs. &=, Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Perhaps: tmp = *p; d1 = *(p+1); d2 = *(p+2); d3 = *(p+3); if (tmp) { goto found_middle; } if (d1 || d2 || d3) { break; } Paolo
Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>: > Il 11/03/2013 16:24, Peter Lieven ha scritto: >> >>> How would that be different in your patch? But you can solve it by >>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking >>> BITS_PER_LONG. >> >> This is what I have now: >> >> diff --git a/util/bitops.c b/util/bitops.c >> index e72237a..b0dc93f 100644 >> --- a/util/bitops.c >> +++ b/util/bitops.c >> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, >> const unsigned long *p = addr + BITOP_WORD(offset); >> unsigned long result = offset & ~(BITS_PER_LONG-1); >> unsigned long tmp; >> + unsigned long d0,d1,d2,d3; >> >> if (offset >= size) { >> return size; >> } >> size -= result; >> - offset %= BITS_PER_LONG; >> + offset &= (BITS_PER_LONG-1); >> if (offset) { >> tmp = *(p++); >> tmp &= (~0UL << offset); >> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, >> size -= BITS_PER_LONG; >> result += BITS_PER_LONG; >> } >> - while (size & ~(BITS_PER_LONG-1)) { >> + while (size >= 4*BITS_PER_LONG) { >> + d0 = *p; >> + d1 = *(p+1); >> + d2 = *(p+2); >> + d3 = *(p+3); >> + if (d0 || d1 || d2 || d3) { >> + break; >> + } >> + p+=4; >> + result += 4*BITS_PER_LONG; >> + size -= 4*BITS_PER_LONG; >> + } >> + while (size >= BITS_PER_LONG) { >> if ((tmp = *(p++))) { >> goto found_middle; >> } >> > > Minus the %= vs. &=, > > Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> > > Perhaps: > > tmp = *p; > d1 = *(p+1); > d2 = *(p+2); > d3 = *(p+3); > if (tmp) { > goto found_middle; > } > if (d1 || d2 || d3) { > break; > } i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ? i would guess its sth like one addition w/ carry and 1 test? your proposed change would introduce 2 tests (maybe)? what about this to be sure? tmp = *p; d1 = *(p+1); d2 = *(p+2); d3 = *(p+3); if (tmp || d1 || d2 || d3) { if (tmp) { goto found_middle; } break; } Peter
On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote: > + unsigned long d0,d1,d2,d3; These commas should have spaces after them. Also, since the variables are only used inside the scope of your newly added while loop: > - while (size & ~(BITS_PER_LONG-1)) { > + while (size >= 4*BITS_PER_LONG) { it would be better to declare them here. > + d0 = *p; > + d1 = *(p+1); > + d2 = *(p+2); > + d3 = *(p+3); > + if (d0 || d1 || d2 || d3) { > + break; > + } > + p+=4; > + result += 4*BITS_PER_LONG; > + size -= 4*BITS_PER_LONG; > + } > + while (size >= BITS_PER_LONG) { > if ((tmp = *(p++))) { > goto found_middle; > } thanks -- PMM
Am 11.03.2013 um 16:37 schrieb Peter Maydell <peter.maydell@linaro.org>: > On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote: >> + unsigned long d0,d1,d2,d3; > > These commas should have spaces after them. Also, since > the variables are only used inside the scope of your > newly added while loop: > >> - while (size & ~(BITS_PER_LONG-1)) { >> + while (size >= 4*BITS_PER_LONG) { > > it would be better to declare them here. can you verify if this does not make difference in the generated object code? in buffer_is_zero() its outside the loop. thanks peter
Il 11/03/2013 16:41, Peter Lieven ha scritto: > can you verify if this does not make difference in the generated object code? > in buffer_is_zero() its outside the loop. No, it doesn't. Paolo
Am 11.03.2013 um 16:42 schrieb Paolo Bonzini <pbonzini@redhat.com>: > Il 11/03/2013 16:41, Peter Lieven ha scritto: >> can you verify if this does not make difference in the generated object code? >> in buffer_is_zero() its outside the loop. > > No, it doesn't. ok, i will sent the final patch tomorrow. one last thought. would it make sense to update only `size`in the while loops and compute the `result` at the end as `orgsize` - `size`? Peter
Il 11/03/2013 16:37, Peter Lieven ha scritto: > > Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>: > >> Il 11/03/2013 16:24, Peter Lieven ha scritto: >>> >>>> How would that be different in your patch? But you can solve it by >>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking >>>> BITS_PER_LONG. >>> >>> This is what I have now: >>> >>> diff --git a/util/bitops.c b/util/bitops.c >>> index e72237a..b0dc93f 100644 >>> --- a/util/bitops.c >>> +++ b/util/bitops.c >>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, >>> const unsigned long *p = addr + BITOP_WORD(offset); >>> unsigned long result = offset & ~(BITS_PER_LONG-1); >>> unsigned long tmp; >>> + unsigned long d0,d1,d2,d3; >>> >>> if (offset >= size) { >>> return size; >>> } >>> size -= result; >>> - offset %= BITS_PER_LONG; >>> + offset &= (BITS_PER_LONG-1); >>> if (offset) { >>> tmp = *(p++); >>> tmp &= (~0UL << offset); >>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, >>> size -= BITS_PER_LONG; >>> result += BITS_PER_LONG; >>> } >>> - while (size & ~(BITS_PER_LONG-1)) { >>> + while (size >= 4*BITS_PER_LONG) { >>> + d0 = *p; >>> + d1 = *(p+1); >>> + d2 = *(p+2); >>> + d3 = *(p+3); >>> + if (d0 || d1 || d2 || d3) { >>> + break; >>> + } >>> + p+=4; >>> + result += 4*BITS_PER_LONG; >>> + size -= 4*BITS_PER_LONG; >>> + } >>> + while (size >= BITS_PER_LONG) { >>> if ((tmp = *(p++))) { >>> goto found_middle; >>> } >>> >> >> Minus the %= vs. &=, >> >> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> >> >> Perhaps: >> >> tmp = *p; >> d1 = *(p+1); >> d2 = *(p+2); >> d3 = *(p+3); >> if (tmp) { >> goto found_middle; >> } >> if (d1 || d2 || d3) { >> break; >> } > > i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ? It depends on the target and how expensive branches are. > i would guess its sth like one addition w/ carry and 1 test? It could be either 4 compare-and-jump sequences, or 3 bitwise ORs followed by a compare-and-jump. That is, either: test %r8, %r8 jz second_loop test %r9, %r9 jz second_loop test %r10, %r10 jz second_loop test %r11, %r11 jz second_loop or or %r9, %r8 or %r11, %r10 or %r8, %r10 jz second_loop Don't let the length of the code fool you. The processor knows how to optimize all of these, and GCC knows too. > your proposed change would introduce 2 tests (maybe)? Yes, but I expect they to be fairly well predicted. > what about this to be sure? > > tmp = *p; > d1 = *(p+1); > d2 = *(p+2); > d3 = *(p+3); > if (tmp || d1 || d2 || d3) { > if (tmp) { > goto found_middle; I suspect that GCC would rewrite it my version (definitely if it produces 4 compare-and-jumps; but possibly it does it even if it goes for bitwise ORs, I haven't checked. Regarding your other question ("one last thought. would it make sense to update only `size`in the while loops and compute the `result` at the end as `orgsize` - `size`?"), again the compiler knows better and might even do this for you. It will likely drop the p increases and use p[result], so if you do that change you may even get the same code, only this time p is increased and you get an extra subtraction at the end. :) Bottom line: don't try to outsmart an optimizing C compiler on micro-optimization, unless you have benchmarked it and it shows there is a problem. Paolo > } > break; > } > > Peter >
Even more efficient might be to do bitwise instead of logical or > if (tmp | d1 | d2 | d3) { that should remove 3 of the 4 conditional jumps and should become 3 bitwise ors and one conditional jump On Mon, Mar 11, 2013 at 8:58 AM, Paolo Bonzini <pbonzini@redhat.com> wrote: > Il 11/03/2013 16:37, Peter Lieven ha scritto: >> >> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>: >> >>> Il 11/03/2013 16:24, Peter Lieven ha scritto: >>>> >>>>> How would that be different in your patch? But you can solve it by >>>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking >>>>> BITS_PER_LONG. >>>> >>>> This is what I have now: >>>> >>>> diff --git a/util/bitops.c b/util/bitops.c >>>> index e72237a..b0dc93f 100644 >>>> --- a/util/bitops.c >>>> +++ b/util/bitops.c >>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, >>>> const unsigned long *p = addr + BITOP_WORD(offset); >>>> unsigned long result = offset & ~(BITS_PER_LONG-1); >>>> unsigned long tmp; >>>> + unsigned long d0,d1,d2,d3; >>>> >>>> if (offset >= size) { >>>> return size; >>>> } >>>> size -= result; >>>> - offset %= BITS_PER_LONG; >>>> + offset &= (BITS_PER_LONG-1); >>>> if (offset) { >>>> tmp = *(p++); >>>> tmp &= (~0UL << offset); >>>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, >>>> size -= BITS_PER_LONG; >>>> result += BITS_PER_LONG; >>>> } >>>> - while (size & ~(BITS_PER_LONG-1)) { >>>> + while (size >= 4*BITS_PER_LONG) { >>>> + d0 = *p; >>>> + d1 = *(p+1); >>>> + d2 = *(p+2); >>>> + d3 = *(p+3); >>>> + if (d0 || d1 || d2 || d3) { >>>> + break; >>>> + } >>>> + p+=4; >>>> + result += 4*BITS_PER_LONG; >>>> + size -= 4*BITS_PER_LONG; >>>> + } >>>> + while (size >= BITS_PER_LONG) { >>>> if ((tmp = *(p++))) { >>>> goto found_middle; >>>> } >>>> >>> >>> Minus the %= vs. &=, >>> >>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> >>> >>> Perhaps: >>> >>> tmp = *p; >>> d1 = *(p+1); >>> d2 = *(p+2); >>> d3 = *(p+3); >>> if (tmp) { >>> goto found_middle; >>> } >>> if (d1 || d2 || d3) { >>> break; >>> } >> >> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ? > > It depends on the target and how expensive branches are. > >> i would guess its sth like one addition w/ carry and 1 test? > > It could be either 4 compare-and-jump sequences, or 3 bitwise ORs > followed by a compare-and-jump. > > That is, either: > > test %r8, %r8 > jz second_loop > test %r9, %r9 > jz second_loop > test %r10, %r10 > jz second_loop > test %r11, %r11 > jz second_loop > > or > > or %r9, %r8 > or %r11, %r10 > or %r8, %r10 > jz second_loop > > Don't let the length of the code fool you. The processor knows how to > optimize all of these, and GCC knows too. > >> your proposed change would introduce 2 tests (maybe)? > > Yes, but I expect they to be fairly well predicted. > >> what about this to be sure? >> >> tmp = *p; >> d1 = *(p+1); >> d2 = *(p+2); >> d3 = *(p+3); >> if (tmp || d1 || d2 || d3) { >> if (tmp) { >> goto found_middle; > > I suspect that GCC would rewrite it my version (definitely if it > produces 4 compare-and-jumps; but possibly it does it even if it goes > for bitwise ORs, I haven't checked. > > Regarding your other question ("one last thought. would it make sense to > update only `size`in the while loops and compute the `result` at the end > as `orgsize` - `size`?"), again the compiler knows better and might even > do this for you. It will likely drop the p increases and use p[result], > so if you do that change you may even get the same code, only this time > p is increased and you get an extra subtraction at the end. :) > > Bottom line: don't try to outsmart an optimizing C compiler on > micro-optimization, unless you have benchmarked it and it shows there is > a problem. > > Paolo > >> } >> break; >> } >> >> Peter >> > >
Il 11/03/2013 18:06, ronnie sahlberg ha scritto: > Even more efficient might be to do bitwise instead of logical or > >> > if (tmp | d1 | d2 | d3) { > that should remove 3 of the 4 conditional jumps > and should become 3 bitwise ors and one conditional jump Without any serious profiling, please let the compiler do that. Paolo
Am 11.03.2013 18:07, schrieb Paolo Bonzini: > Il 11/03/2013 18:06, ronnie sahlberg ha scritto: >> Even more efficient might be to do bitwise instead of logical or >> >>>> if (tmp | d1 | d2 | d3) { >> that should remove 3 of the 4 conditional jumps >> and should become 3 bitwise ors and one conditional jump > > Without any serious profiling, please let the compiler do that. Paolo is right, i ran some tests with gcc 4.6.3 on x86_64 (with -O3) and tried the various ideas. They all made no significant difference. Even unrolling to 8 unsigned longs didn't change anything. What I tried is running 1^20 interations of find_next_bit(bitfield,4194304,0); I choosed the bitfield to be 4MByte which equals a 16GB VM. The bitfield was complete zeroed so find_next_bit had to run completely through the bitfield. The original version took 1 minute and 10 seconds whereas all other took approx. 37-38 seconds which is almost a 100% boost ;-) So I think this here is the final version: while (size >= 4*BITS_PER_LONG) { unsigned long d1, d2, d3; tmp = *p; d1 = *(p+1); d2 = *(p+2); d3 = *(p+3); if (tmp) { goto found_middle; } if (d1 || d2 || d3) { break; } p += 4; result += 4*BITS_PER_LONG; size -= 4*BITS_PER_LONG; } Peter
diff --git a/util/bitops.c b/util/bitops.c index e72237a..b0dc93f 100644 --- a/util/bitops.c +++ b/util/bitops.c @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, const unsigned long *p = addr + BITOP_WORD(offset); unsigned long result = offset & ~(BITS_PER_LONG-1); unsigned long tmp; + unsigned long d0,d1,d2,d3; if (offset >= size) { return size; } size -= result; - offset %= BITS_PER_LONG; + offset &= (BITS_PER_LONG-1); if (offset) { tmp = *(p++); tmp &= (~0UL << offset); @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size, size -= BITS_PER_LONG; result += BITS_PER_LONG; } - while (size & ~(BITS_PER_LONG-1)) { + while (size >= 4*BITS_PER_LONG) { + d0 = *p; + d1 = *(p+1); + d2 = *(p+2); + d3 = *(p+3); + if (d0 || d1 || d2 || d3) { + break; + } + p+=4; + result += 4*BITS_PER_LONG; + size -= 4*BITS_PER_LONG; + } + while (size >= BITS_PER_LONG) { if ((tmp = *(p++))) { goto found_middle; }