[RFC] find_next_bit optimizations

Message ID	E594D4E5-EDA7-4277-9910-D498E9E1E66A@dlhnet.de
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) From: Peter Lieven <pl@dlhnet.de> In-Reply-To: <513DEBCF.9050407@redhat.com> Date: Mon, 11 Mar 2013 16:24:07 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: <E594D4E5-EDA7-4277-9910-D498E9E1E66A@dlhnet.de> References: <513DDFA3.1020308@dlhnet.de> <513DE6D6.9000105@redhat.com> <FE8F3AB7-45F9-4B76-B2D4-46179AAD08FE@dlhnet.de> <513DEBCF.9050407@redhat.com> To: Paolo Bonzini <pbonzini@redhat.com> Cc: Orit Wasserman <owasserm@redhat.com>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, Corentin Chary <corentin.chary@gmail.com> Subject: Re: [Qemu-devel] [RFC] find_next_bit optimizations Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Peter Lieven March 11, 2013, 3:24 p.m. UTC

> How would that be different in your patch?  But you can solve it by
> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
> BITS_PER_LONG.

This is what I have now:

Peter Maydell March 11, 2013, 3:25 p.m. UTC | #1

On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote:
> -    offset %= BITS_PER_LONG;
> +    offset &= (BITS_PER_LONG-1);

Still pointless.

-- PMM

Paolo Bonzini March 11, 2013, 3:29 p.m. UTC | #2

Il 11/03/2013 16:24, Peter Lieven ha scritto:
> 
>> How would that be different in your patch?  But you can solve it by
>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>> BITS_PER_LONG.
> 
> This is what I have now:
> 
> diff --git a/util/bitops.c b/util/bitops.c
> index e72237a..b0dc93f 100644
> --- a/util/bitops.c
> +++ b/util/bitops.c
> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>      const unsigned long *p = addr + BITOP_WORD(offset);
>      unsigned long result = offset & ~(BITS_PER_LONG-1);
>      unsigned long tmp;
> +    unsigned long d0,d1,d2,d3;
>  
>      if (offset >= size) {
>          return size;
>      }
>      size -= result;
> -    offset %= BITS_PER_LONG;
> +    offset &= (BITS_PER_LONG-1);
>      if (offset) {
>          tmp = *(p++);
>          tmp &= (~0UL << offset);
> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>          size -= BITS_PER_LONG;
>          result += BITS_PER_LONG;
>      }
> -    while (size & ~(BITS_PER_LONG-1)) {
> +    while (size >= 4*BITS_PER_LONG) {
> +        d0 = *p;
> +        d1 = *(p+1);
> +        d2 = *(p+2);
> +        d3 = *(p+3);
> +        if (d0 || d1 || d2 || d3) {
> +            break;
> +        }
> +        p+=4;
> +        result += 4*BITS_PER_LONG;
> +        size -= 4*BITS_PER_LONG;
> +    }
> +    while (size >= BITS_PER_LONG) {
>          if ((tmp = *(p++))) {
>              goto found_middle;
>          }
> 

Minus the %= vs. &=,

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Perhaps:

        tmp = *p;
        d1 = *(p+1);
        d2 = *(p+2);
        d3 = *(p+3);
        if (tmp) {
            goto found_middle;
        }
        if (d1 || d2 || d3) {
            break;
        }

Paolo

Peter Lieven March 11, 2013, 3:37 p.m. UTC | #3

Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>> 
>>> How would that be different in your patch?  But you can solve it by
>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>> BITS_PER_LONG.
>> 
>> This is what I have now:
>> 
>> diff --git a/util/bitops.c b/util/bitops.c
>> index e72237a..b0dc93f 100644
>> --- a/util/bitops.c
>> +++ b/util/bitops.c
>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>     unsigned long tmp;
>> +    unsigned long d0,d1,d2,d3;
>> 
>>     if (offset >= size) {
>>         return size;
>>     }
>>     size -= result;
>> -    offset %= BITS_PER_LONG;
>> +    offset &= (BITS_PER_LONG-1);
>>     if (offset) {
>>         tmp = *(p++);
>>         tmp &= (~0UL << offset);
>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>         size -= BITS_PER_LONG;
>>         result += BITS_PER_LONG;
>>     }
>> -    while (size & ~(BITS_PER_LONG-1)) {
>> +    while (size >= 4*BITS_PER_LONG) {
>> +        d0 = *p;
>> +        d1 = *(p+1);
>> +        d2 = *(p+2);
>> +        d3 = *(p+3);
>> +        if (d0 || d1 || d2 || d3) {
>> +            break;
>> +        }
>> +        p+=4;
>> +        result += 4*BITS_PER_LONG;
>> +        size -= 4*BITS_PER_LONG;
>> +    }
>> +    while (size >= BITS_PER_LONG) {
>>         if ((tmp = *(p++))) {
>>             goto found_middle;
>>         }
>> 
> 
> Minus the %= vs. &=,
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> Perhaps:
> 
>        tmp = *p;
>        d1 = *(p+1);
>        d2 = *(p+2);
>        d3 = *(p+3);
>        if (tmp) {
>            goto found_middle;
>        }
>        if (d1 || d2 || d3) {
>            break;
>        }

i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?
i would guess its sth like one addition w/ carry and 1 test?

your proposed change would introduce 2 tests (maybe)?

what about this to be sure?

       tmp = *p;
       d1 = *(p+1);
       d2 = *(p+2);
       d3 = *(p+3);
       if (tmp || d1 || d2 || d3) {
           if (tmp) {
               goto found_middle;
           }
           break;
       }

Peter

Peter Maydell March 11, 2013, 3:37 p.m. UTC | #4

On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote:
> +    unsigned long d0,d1,d2,d3;

These commas should have spaces after them. Also, since
the variables are only used inside the scope of your
newly added while loop:

> -    while (size & ~(BITS_PER_LONG-1)) {
> +    while (size >= 4*BITS_PER_LONG) {

it would be better to declare them here.

> +        d0 = *p;
> +        d1 = *(p+1);
> +        d2 = *(p+2);
> +        d3 = *(p+3);
> +        if (d0 || d1 || d2 || d3) {
> +            break;
> +        }
> +        p+=4;
> +        result += 4*BITS_PER_LONG;
> +        size -= 4*BITS_PER_LONG;
> +    }
> +    while (size >= BITS_PER_LONG) {
>          if ((tmp = *(p++))) {
>              goto found_middle;
>          }

thanks
-- PMM

Peter Lieven March 11, 2013, 3:41 p.m. UTC | #5

Am 11.03.2013 um 16:37 schrieb Peter Maydell <peter.maydell@linaro.org>:

> On 11 March 2013 15:24, Peter Lieven <pl@dlhnet.de> wrote:
>> +    unsigned long d0,d1,d2,d3;
> 
> These commas should have spaces after them. Also, since
> the variables are only used inside the scope of your
> newly added while loop:
> 
>> -    while (size & ~(BITS_PER_LONG-1)) {
>> +    while (size >= 4*BITS_PER_LONG) {
> 
> it would be better to declare them here.

can you verify if this does not make difference in the generated object code?
in buffer_is_zero() its outside the loop.

thanks
peter

Paolo Bonzini March 11, 2013, 3:42 p.m. UTC | #6

Il 11/03/2013 16:41, Peter Lieven ha scritto:
> can you verify if this does not make difference in the generated object code?
> in buffer_is_zero() its outside the loop.

No, it doesn't.

Paolo

Peter Lieven March 11, 2013, 3:48 p.m. UTC | #7

Am 11.03.2013 um 16:42 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 11/03/2013 16:41, Peter Lieven ha scritto:
>> can you verify if this does not make difference in the generated object code?
>> in buffer_is_zero() its outside the loop.
> 
> No, it doesn't.

ok, i will sent the final patch tomorrow.

one last thought. would it make sense to update only `size`in the while loops
and compute the `result` at the end as `orgsize` - `size`?

Peter

Paolo Bonzini March 11, 2013, 3:58 p.m. UTC | #8

Il 11/03/2013 16:37, Peter Lieven ha scritto:
> 
> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>>>
>>>> How would that be different in your patch?  But you can solve it by
>>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>>> BITS_PER_LONG.
>>>
>>> This is what I have now:
>>>
>>> diff --git a/util/bitops.c b/util/bitops.c
>>> index e72237a..b0dc93f 100644
>>> --- a/util/bitops.c
>>> +++ b/util/bitops.c
>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>     unsigned long tmp;
>>> +    unsigned long d0,d1,d2,d3;
>>>
>>>     if (offset >= size) {
>>>         return size;
>>>     }
>>>     size -= result;
>>> -    offset %= BITS_PER_LONG;
>>> +    offset &= (BITS_PER_LONG-1);
>>>     if (offset) {
>>>         tmp = *(p++);
>>>         tmp &= (~0UL << offset);
>>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>         size -= BITS_PER_LONG;
>>>         result += BITS_PER_LONG;
>>>     }
>>> -    while (size & ~(BITS_PER_LONG-1)) {
>>> +    while (size >= 4*BITS_PER_LONG) {
>>> +        d0 = *p;
>>> +        d1 = *(p+1);
>>> +        d2 = *(p+2);
>>> +        d3 = *(p+3);
>>> +        if (d0 || d1 || d2 || d3) {
>>> +            break;
>>> +        }
>>> +        p+=4;
>>> +        result += 4*BITS_PER_LONG;
>>> +        size -= 4*BITS_PER_LONG;
>>> +    }
>>> +    while (size >= BITS_PER_LONG) {
>>>         if ((tmp = *(p++))) {
>>>             goto found_middle;
>>>         }
>>>
>>
>> Minus the %= vs. &=,
>>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>
>> Perhaps:
>>
>>        tmp = *p;
>>        d1 = *(p+1);
>>        d2 = *(p+2);
>>        d3 = *(p+3);
>>        if (tmp) {
>>            goto found_middle;
>>        }
>>        if (d1 || d2 || d3) {
>>            break;
>>        }
> 
> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?

It depends on the target and how expensive branches are.

> i would guess its sth like one addition w/ carry and 1 test?

It could be either 4 compare-and-jump sequences, or 3 bitwise ORs
followed by a compare-and-jump.

That is, either:

     test %r8, %r8
     jz   second_loop
     test %r9, %r9
     jz   second_loop
     test %r10, %r10
     jz   second_loop
     test %r11, %r11
     jz   second_loop

or

     or %r9, %r8
     or %r11, %r10
     or %r8, %r10
     jz   second_loop

Don't let the length of the code fool you.  The processor knows how to
optimize all of these, and GCC knows too.

> your proposed change would introduce 2 tests (maybe)?

Yes, but I expect they to be fairly well predicted.

> what about this to be sure?
> 
>        tmp = *p;
>        d1 = *(p+1);
>        d2 = *(p+2);
>        d3 = *(p+3);
>        if (tmp || d1 || d2 || d3) {
>            if (tmp) {
>                goto found_middle;

I suspect that GCC would rewrite it my version (definitely if it
produces 4 compare-and-jumps; but possibly it does it even if it goes
for bitwise ORs, I haven't checked.

Regarding your other question ("one last thought. would it make sense to
update only `size`in the while loops and compute the `result` at the end
as `orgsize` - `size`?"), again the compiler knows better and might even
do this for you.  It will likely drop the p increases and use p[result],
so if you do that change you may even get the same code, only this time
p is increased and you get an extra subtraction at the end. :)

Bottom line: don't try to outsmart an optimizing C compiler on
micro-optimization, unless you have benchmarked it and it shows there is
a problem.

Paolo

>            }
>            break;
>        }
> 
> Peter
>

ronnie sahlberg March 11, 2013, 5:06 p.m. UTC | #9

Even more efficient might be to do bitwise instead of logical or

>        if (tmp | d1 | d2 | d3) {

that should remove 3 of the 4 conditional jumps
and should become 3 bitwise ors and one conditional jump


On Mon, Mar 11, 2013 at 8:58 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> Il 11/03/2013 16:37, Peter Lieven ha scritto:
>>
>> Am 11.03.2013 um 16:29 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>
>>> Il 11/03/2013 16:24, Peter Lieven ha scritto:
>>>>
>>>>> How would that be different in your patch?  But you can solve it by
>>>>> making two >= loops, one checking for 4*BITS_PER_LONG and one checking
>>>>> BITS_PER_LONG.
>>>>
>>>> This is what I have now:
>>>>
>>>> diff --git a/util/bitops.c b/util/bitops.c
>>>> index e72237a..b0dc93f 100644
>>>> --- a/util/bitops.c
>>>> +++ b/util/bitops.c
>>>> @@ -24,12 +24,13 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>>     const unsigned long *p = addr + BITOP_WORD(offset);
>>>>     unsigned long result = offset & ~(BITS_PER_LONG-1);
>>>>     unsigned long tmp;
>>>> +    unsigned long d0,d1,d2,d3;
>>>>
>>>>     if (offset >= size) {
>>>>         return size;
>>>>     }
>>>>     size -= result;
>>>> -    offset %= BITS_PER_LONG;
>>>> +    offset &= (BITS_PER_LONG-1);
>>>>     if (offset) {
>>>>         tmp = *(p++);
>>>>         tmp &= (~0UL << offset);
>>>> @@ -42,7 +43,19 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
>>>>         size -= BITS_PER_LONG;
>>>>         result += BITS_PER_LONG;
>>>>     }
>>>> -    while (size & ~(BITS_PER_LONG-1)) {
>>>> +    while (size >= 4*BITS_PER_LONG) {
>>>> +        d0 = *p;
>>>> +        d1 = *(p+1);
>>>> +        d2 = *(p+2);
>>>> +        d3 = *(p+3);
>>>> +        if (d0 || d1 || d2 || d3) {
>>>> +            break;
>>>> +        }
>>>> +        p+=4;
>>>> +        result += 4*BITS_PER_LONG;
>>>> +        size -= 4*BITS_PER_LONG;
>>>> +    }
>>>> +    while (size >= BITS_PER_LONG) {
>>>>         if ((tmp = *(p++))) {
>>>>             goto found_middle;
>>>>         }
>>>>
>>>
>>> Minus the %= vs. &=,
>>>
>>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>>>
>>> Perhaps:
>>>
>>>        tmp = *p;
>>>        d1 = *(p+1);
>>>        d2 = *(p+2);
>>>        d3 = *(p+3);
>>>        if (tmp) {
>>>            goto found_middle;
>>>        }
>>>        if (d1 || d2 || d3) {
>>>            break;
>>>        }
>>
>> i do not know what gcc interally makes of the d0 || d1 || d2 || d3 ?
>
> It depends on the target and how expensive branches are.
>
>> i would guess its sth like one addition w/ carry and 1 test?
>
> It could be either 4 compare-and-jump sequences, or 3 bitwise ORs
> followed by a compare-and-jump.
>
> That is, either:
>
>      test %r8, %r8
>      jz   second_loop
>      test %r9, %r9
>      jz   second_loop
>      test %r10, %r10
>      jz   second_loop
>      test %r11, %r11
>      jz   second_loop
>
> or
>
>      or %r9, %r8
>      or %r11, %r10
>      or %r8, %r10
>      jz   second_loop
>
> Don't let the length of the code fool you.  The processor knows how to
> optimize all of these, and GCC knows too.
>
>> your proposed change would introduce 2 tests (maybe)?
>
> Yes, but I expect they to be fairly well predicted.
>
>> what about this to be sure?
>>
>>        tmp = *p;
>>        d1 = *(p+1);
>>        d2 = *(p+2);
>>        d3 = *(p+3);
>>        if (tmp || d1 || d2 || d3) {
>>            if (tmp) {
>>                goto found_middle;
>
> I suspect that GCC would rewrite it my version (definitely if it
> produces 4 compare-and-jumps; but possibly it does it even if it goes
> for bitwise ORs, I haven't checked.
>
> Regarding your other question ("one last thought. would it make sense to
> update only `size`in the while loops and compute the `result` at the end
> as `orgsize` - `size`?"), again the compiler knows better and might even
> do this for you.  It will likely drop the p increases and use p[result],
> so if you do that change you may even get the same code, only this time
> p is increased and you get an extra subtraction at the end. :)
>
> Bottom line: don't try to outsmart an optimizing C compiler on
> micro-optimization, unless you have benchmarked it and it shows there is
> a problem.
>
> Paolo
>
>>            }
>>            break;
>>        }
>>
>> Peter
>>
>
>

Paolo Bonzini March 11, 2013, 5:07 p.m. UTC | #10

Il 11/03/2013 18:06, ronnie sahlberg ha scritto:
> Even more efficient might be to do bitwise instead of logical or
> 
>> >        if (tmp | d1 | d2 | d3) {
> that should remove 3 of the 4 conditional jumps
> and should become 3 bitwise ors and one conditional jump

Without any serious profiling, please let the compiler do that.

Paolo

Peter Lieven March 11, 2013, 6:20 p.m. UTC | #11

Am 11.03.2013 18:07, schrieb Paolo Bonzini:
> Il 11/03/2013 18:06, ronnie sahlberg ha scritto:
>> Even more efficient might be to do bitwise instead of logical or
>>
>>>>        if (tmp | d1 | d2 | d3) {
>> that should remove 3 of the 4 conditional jumps
>> and should become 3 bitwise ors and one conditional jump
> 
> Without any serious profiling, please let the compiler do that.

Paolo is right, i ran some tests with gcc 4.6.3 on x86_64 (with -O3) and tried the
various ideas. They all made no significant difference. Even unrolling to 8 unsigned
longs didn't change anything.

What I tried is running 1^20 interations of find_next_bit(bitfield,4194304,0);
I choosed the bitfield to be 4MByte which equals a 16GB VM. The bitfield was
complete zeroed so find_next_bit had to run completely through the bitfield.

The original version took 1 minute and 10 seconds whereas all other took
approx. 37-38 seconds which is almost a 100% boost ;-)

So I think this here is the final version:

    while (size >= 4*BITS_PER_LONG) {
        unsigned long d1, d2, d3;
        tmp = *p;
        d1 = *(p+1);
        d2 = *(p+2);
        d3 = *(p+3);
        if (tmp) {
            goto found_middle;
        }        
        if (d1 || d2 || d3) {
            break;
        }
        p += 4;
        result += 4*BITS_PER_LONG;
        size -= 4*BITS_PER_LONG;
    }

Peter

[RFC] find_next_bit optimizations

Commit Message

Comments

Patch