diff mbox

[PATCHv2,3/9] buffer_is_zero: use vector optimizations if possible

Message ID 1363362619-3190-4-git-send-email-pl@kamp.de
State New
Headers show

Commit Message

Peter Lieven March 15, 2013, 3:50 p.m. UTC
performance gain on SSE2 is approx. 20-25%. altivec
is not tested. performance for unsigned long arithmetic
is unchanged.

Signed-off-by: Peter Lieven <pl@kamp.de>
---
 util/cutils.c |    7 +++++++
 1 file changed, 7 insertions(+)

Comments

Eric Blake March 19, 2013, 4:08 p.m. UTC | #1
On 03/15/2013 09:50 AM, Peter Lieven wrote:
> performance gain on SSE2 is approx. 20-25%. altivec
> is not tested. performance for unsigned long arithmetic
> is unchanged.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> ---
>  util/cutils.c |    7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/util/cutils.c b/util/cutils.c
> index 857dd7d..00d98fb 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -190,6 +190,13 @@ size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>   */
>  bool buffer_is_zero(const void *buf, size_t len)
>  {
> +    /* use vector optimized zero check if possible */
> +    if (((uintptr_t) buf) % sizeof(VECTYPE) == 0 
> +          && len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> +             * sizeof(VECTYPE)) == 0) {

Is it worth factoring this check into something more reusable, by adding
something like 'bool buffer_can_use_vectors(buf, len)' in patch 2/9?

> +        return buffer_find_nonzero_offset(buf, len)==len;

Spaces around binary operators.

Is it worth rewriting this function into a simpler:

check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs until
we are aligned
check buffer_find_nonzero_offset on the aligned middle
check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs at tail

instead of having two instances of code that can loop over the entire
buffer?  Otherwise, searching for zeros on an unaligned buffer will
remain slower, even though the bulk of the search could still benefit
from the vector operations.

> +    }
> +
>      /*
>       * Use long as the biggest available internal data type that fits into the
>       * CPU register and unroll the loop to smooth out the effect of memory

Your patch results in C99 declarations after statements; while we
require C99, I'm not sure if qemu prefers to stick to the C89 style of
declarations before statements.
Peter Lieven March 19, 2013, 4:14 p.m. UTC | #2
Am 19.03.2013 um 17:08 schrieb Eric Blake <eblake@redhat.com>:

> On 03/15/2013 09:50 AM, Peter Lieven wrote:
>> performance gain on SSE2 is approx. 20-25%. altivec
>> is not tested. performance for unsigned long arithmetic
>> is unchanged.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> util/cutils.c |    7 +++++++
>> 1 file changed, 7 insertions(+)
>> 
>> diff --git a/util/cutils.c b/util/cutils.c
>> index 857dd7d..00d98fb 100644
>> --- a/util/cutils.c
>> +++ b/util/cutils.c
>> @@ -190,6 +190,13 @@ size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>  */
>> bool buffer_is_zero(const void *buf, size_t len)
>> {
>> +    /* use vector optimized zero check if possible */
>> +    if (((uintptr_t) buf) % sizeof(VECTYPE) == 0 
>> +          && len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> +             * sizeof(VECTYPE)) == 0) {
> 
> Is it worth factoring this check into something more reusable, by adding
> something like 'bool buffer_can_use_vectors(buf, len)' in patch 2/9?

good idea!

> 
>> +        return buffer_find_nonzero_offset(buf, len)==len;
> 
> Spaces around binary operators.
> 
> Is it worth rewriting this function into a simpler:
> 
> check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs until
> we are aligned
> check buffer_find_nonzero_offset on the aligned middle
> check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs at tail
> 
> instead of having two instances of code that can loop over the entire
> buffer?  Otherwise, searching for zeros on an unaligned buffer will
> remain slower, even though the bulk of the search could still benefit
> from the vector operations.

i do not think that this is necessary as in all cases I know the buffers are aligned properly.

a) ram pages (4k)
b) block migration pages (1M)
c) bdrv sectors (512)

At least on x86_64 they have all 16-byte aligned addresses. 

It would just add more branches to the case where the optimized version can be used maybe
making it slower.

Peter

> 
>> +    }
>> +
>>     /*
>>      * Use long as the biggest available internal data type that fits into the
>>      * CPU register and unroll the loop to smooth out the effect of memory
> 
> Your patch results in C99 declarations after statements; while we
> require C99, I'm not sure if qemu prefers to stick to the C89 style of
> declarations before statements.

the code was slower if I made the checks afterwards.

Peter

> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
Peter Lieven March 19, 2013, 7:44 p.m. UTC | #3
Am 19.03.2013 um 17:08 schrieb Eric Blake <eblake@redhat.com>:

> On 03/15/2013 09:50 AM, Peter Lieven wrote:
>> performance gain on SSE2 is approx. 20-25%. altivec
>> is not tested. performance for unsigned long arithmetic
>> is unchanged.
>> 
>> Signed-off-by: Peter Lieven <pl@kamp.de>
>> ---
>> util/cutils.c |    7 +++++++
>> 1 file changed, 7 insertions(+)
>> 
>> diff --git a/util/cutils.c b/util/cutils.c
>> index 857dd7d..00d98fb 100644
>> --- a/util/cutils.c
>> +++ b/util/cutils.c
>> @@ -190,6 +190,13 @@ size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>  */
>> bool buffer_is_zero(const void *buf, size_t len)
>> {
>> +    /* use vector optimized zero check if possible */
>> +    if (((uintptr_t) buf) % sizeof(VECTYPE) == 0 
>> +          && len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>> +             * sizeof(VECTYPE)) == 0) {
> 
> Is it worth factoring this check into something more reusable, by adding
> something like 'bool buffer_can_use_vectors(buf, len)' in patch 2/9?
> 
>> +        return buffer_find_nonzero_offset(buf, len)==len;
> 
> Spaces around binary operators.
> 
> Is it worth rewriting this function into a simpler:
> 
> check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs until
> we are aligned
> check buffer_find_nonzero_offset on the aligned middle
> check up to (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR - 1) longs at tail
> 
> instead of having two instances of code that can loop over the entire
> buffer?  Otherwise, searching for zeros on an unaligned buffer will
> remain slower, even though the bulk of the search could still benefit
> from the vector operations.
> 
>> +    }
>> +
>>     /*
>>      * Use long as the biggest available internal data type that fits into the
>>      * CPU register and unroll the loop to smooth out the effect of memory
> 
> Your patch results in C99 declarations after statements; while we
> require C99, I'm not sure if qemu prefers to stick to the C89 style of
> declarations before statements.

See my comments to patch 7/9 regarding is_zero_page(). I will move the
call to the vector optimized version after the variable definitions. The
speed concern at that point was mainline to use it for zero page checking.

Peter

> 
> -- 
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
diff mbox

Patch

diff --git a/util/cutils.c b/util/cutils.c
index 857dd7d..00d98fb 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -190,6 +190,13 @@  size_t buffer_find_nonzero_offset(const void *buf, size_t len)
  */
 bool buffer_is_zero(const void *buf, size_t len)
 {
+    /* use vector optimized zero check if possible */
+    if (((uintptr_t) buf) % sizeof(VECTYPE) == 0 
+          && len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
+             * sizeof(VECTYPE)) == 0) {
+        return buffer_find_nonzero_offset(buf, len)==len;
+    }
+
     /*
      * Use long as the biggest available internal data type that fits into the
      * CPU register and unroll the loop to smooth out the effect of memory